Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* [patch 02/24] timekeeping: Use system_time_snapshot::sys instead of ::real
From: Thomas Gleixner @ 2026-05-26 17:13 UTC (permalink / raw)
  To: LKML
  Cc: David Woodhouse, Miroslav Lichvar, John Stultz, Stephen Boyd,
	Anna-Maria Behnsen, Frederic Weisbecker, thomas.weissschuh,
	Arthur Kiyanovski, Rodolfo Giometti, Vincent Donnefort,
	Marc Zyngier, Oliver Upton, kvmarm, Oliver Upton, Richard Cochran,
	netdev, Takashi Iwai, Miri Korenblit, Johannes Berg, Jacob Keller,
	Tony Nguyen, Saeed Mahameed, Peter Hilber, Michael S. Tsirkin,
	virtualization, linux-wireless, linux-sound
In-Reply-To: <20260526165826.392227559@kernel.org>

system_time_snapshot::sys provides the same information as
system_time_snapshot::real when the snapshot was taken with
ktime_get_snapshot_id(CLOCK_REALTIME).

Convert the history interpolation over to use 'sys' as 'real' is going away
once all users are converted.

As a side effect this is the first step to support CLOCK_AUX with
get_device_crosstime_stamp() and the history interpolation.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 kernel/time/timekeeping.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1342,7 +1342,7 @@ static int adjust_historical_crosststamp
 			(corr_raw, tk->tkr_mono.mult, tk->tkr_raw.mult);
 	} else {
 		corr_real = (u64)ktime_to_ns(
-			ktime_sub(ts->sys_realtime, history->real));
+			ktime_sub(ts->sys_realtime, history->sys));
 		ret = scale64_check_overflow(partial_history_cycles,
 					     total_history_cycles, &corr_real);
 		if (ret)
@@ -1352,7 +1352,7 @@ static int adjust_historical_crosststamp
 	/* Fixup monotonic raw and real time time values */
 	if (interp_forward) {
 		ts->sys_monoraw = ktime_add_ns(history->raw, corr_raw);
-		ts->sys_realtime = ktime_add_ns(history->real, corr_real);
+		ts->sys_realtime = ktime_add_ns(history->sys, corr_real);
 	} else {
 		ts->sys_monoraw = ktime_sub_ns(ts->sys_monoraw, corr_raw);
 		ts->sys_realtime = ktime_sub_ns(ts->sys_realtime, corr_real);


^ permalink raw reply

* [patch 01/24] timekeeping: Provide ktime_get_snapshot_id()
From: Thomas Gleixner @ 2026-05-26 17:13 UTC (permalink / raw)
  To: LKML
  Cc: David Woodhouse, Miroslav Lichvar, John Stultz, Stephen Boyd,
	Anna-Maria Behnsen, Frederic Weisbecker, thomas.weissschuh,
	Arthur Kiyanovski, Rodolfo Giometti, Vincent Donnefort,
	Marc Zyngier, Oliver Upton, kvmarm, Oliver Upton, Richard Cochran,
	netdev, Takashi Iwai, Miri Korenblit, Johannes Berg, Jacob Keller,
	Tony Nguyen, Saeed Mahameed, Peter Hilber, Michael S. Tsirkin,
	virtualization, linux-wireless, linux-sound
In-Reply-To: <20260526165826.392227559@kernel.org>

ktime_get_snapshot() provides a snapshot of the underlying clocksource
counter value and the corresponding CLOCK_MONOTONIC_RAW, CLOCK_REALTIME and
CLOCK_BOOTTIME timestamps.

There is no usage of CLOCK_REALTIME and CLOCK_BOOTTIME at the same time and
CLOCK_BOOTTIME support was just added for the ARM64 KVM tracing mechanism,
which needs CLOCK_BOOTTIME and the underlying clocksource counter value.

ktime_get_snapshot() is also not suitable for usage with CLOCK_AUX, but
that's a prerequisite to support PTP hardware timestamping for CLOCK_AUX
steering.

As a first step, rename ktime_get_snapshot() to ktime_get_snapshot_id(),
which now takes a clockid argument to select the clock which needs to be
captured. The result is stored in system_time_snapshot::sys, which will
replace the system_time_snapshot::real/boot members once all usage sites
have been converted.

ktime_get_snapshot() is a simple wrapper which hands in CLOCK_REALTIME as
clockid argument for the conversion period. That means CLOCK_REALTIME is
now captured twice, but that redunancy is only temporary.

No functional change vs. current users of ktime_get_snapshot()

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/timekeeping.h |   29 ++++++++++-----
 kernel/time/timekeeping.c   |   84 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 84 insertions(+), 29 deletions(-)

--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -276,24 +276,28 @@ static inline bool ktime_get_aux_ts64(cl
 #endif
 
 /**
- * struct system_time_snapshot - simultaneous raw/real time capture with
- *				 counter value
- * @cycles:	Clocksource counter value to produce the system times
- * @real:	Realtime system time
- * @boot:	Boot time
- * @raw:	Monotonic raw system time
- * @cs_id:	Clocksource ID
+ * struct system_time_snapshot - Simultaneous time capture of CLOCK_MONOTONIC_RAW,
+ *				 a selected CLOCK_* and the clocksource counter value
+ * @cycles:		Clocksource counter value to produce the system times
+ * @sys:		The system time of the selected CLOCK ID
+ * @real:		Realtime system time
+ * @boot:		Boot time
+ * @raw:		Monotonic raw system time
+ * @cs_id:		Clocksource ID
  * @clock_was_set_seq:	The sequence number of clock-was-set events
  * @cs_was_changed_seq:	The sequence number of clocksource change events
+ * @valid:		True if the snapshot is valid
  */
 struct system_time_snapshot {
 	u64			cycles;
+	ktime_t			sys;
 	ktime_t			real;
 	ktime_t			boot;
 	ktime_t			raw;
 	enum clocksource_ids	cs_id;
 	unsigned int		clock_was_set_seq;
 	u8			cs_was_changed_seq;
+	u8			valid;
 };
 
 /**
@@ -341,9 +345,16 @@ extern int get_device_system_crosststamp
 			struct system_device_crosststamp *xtstamp);
 
 /*
- * Simultaneously snapshot realtime and monotonic raw clocks
+ * Simultaneously snapshot a given clock with MONOTONIC_RAW and the underlying
+ * clocksource counter value.
  */
-extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
+extern bool ktime_get_snapshot_id(struct system_time_snapshot *systime_snapshot,
+				  clockid_t clock_id);
+
+static inline void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
+{
+	WARN_ON_ONCE(!ktime_get_snapshot_id(systime_snapshot, CLOCK_REALTIME));
+}
 
 /*
  * Persistent clock related interfaces
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1183,43 +1183,87 @@ noinstr time64_t __ktime_get_real_second
 }
 
 /**
- * ktime_get_snapshot - snapshots the realtime/monotonic raw clocks with counter
- * @systime_snapshot:	pointer to struct receiving the system time snapshot
+ * ktime_get_snapshot_id -  Simultaneously snapshot a given clock ID with
+ *			    CLOCK_MONOTONIC_RAW and the underlying
+ *			    clocksource counter value.
+ * @systime_snapshot:	Pointer to struct receiving the system time snapshot
+ * @clock_id:		The clock ID to snapshot
  */
-void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
+bool ktime_get_snapshot_id(struct system_time_snapshot *systime_snapshot, clockid_t clock_id)
 {
-	struct timekeeper *tk = &tk_core.timekeeper;
+	ktime_t base_raw, base_sys, offs_sys, *offs, offs_zero = 0;
+	u64 nsec_raw, nsec_sys, now;
+	struct timekeeper *tk;
+	struct tk_data *tkd;
 	unsigned int seq;
-	ktime_t base_raw;
 	ktime_t base_real;
 	ktime_t base_boot;
-	u64 nsec_raw;
-	u64 nsec_real;
-	u64 now;
 
-	WARN_ON_ONCE(timekeeping_suspended);
+	/* Invalidate the snapshot for all failure cases */
+	systime_snapshot->valid = false;
+
+	if (WARN_ON_ONCE(timekeeping_suspended))
+		return false;
+
+	switch (clock_id) {
+	case CLOCK_REALTIME:
+		tkd = &tk_core;
+		offs = &tk_core.timekeeper.offs_real;
+		break;
+	/* Map RAW to MONOTONIC so the loop below is trivial */
+	case CLOCK_MONOTONIC_RAW:
+	case CLOCK_MONOTONIC:
+		tkd = &tk_core;
+		offs = &offs_zero;
+		break;
+	case CLOCK_BOOTTIME:
+		tkd = &tk_core;
+		offs = &tk_core.timekeeper.offs_boot;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return false;
+	}
+
+	tk = &tkd->timekeeper;
 
 	do {
-		seq = read_seqcount_begin(&tk_core.seq);
+		seq = read_seqcount_begin(&tkd->seq);
+
 		now = tk_clock_read(&tk->tkr_mono);
 		systime_snapshot->cs_id = tk->tkr_mono.clock->id;
 		systime_snapshot->cs_was_changed_seq = tk->cs_was_changed_seq;
 		systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
-		base_real = ktime_add(tk->tkr_mono.base,
-				      tk_core.timekeeper.offs_real);
-		base_boot = ktime_add(tk->tkr_mono.base,
-				      tk_core.timekeeper.offs_boot);
+
+		base_sys = tk->tkr_mono.base;
+		offs_sys = *offs;
 		base_raw = tk->tkr_raw.base;
-		nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
-		nsec_raw  = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
-	} while (read_seqcount_retry(&tk_core.seq, seq));
+
+		/* Kept around until the callers are fixed up */
+		base_real = ktime_add(base_sys, tk_core.timekeeper.offs_real);
+		base_boot = ktime_add(base_sys, tk_core.timekeeper.offs_boot);
+
+		nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
+		nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
+	} while (read_seqcount_retry(&tkd->seq, seq));
 
 	systime_snapshot->cycles = now;
-	systime_snapshot->real = ktime_add_ns(base_real, nsec_real);
-	systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real);
+	systime_snapshot->sys = ktime_add_ns(base_sys, offs_sys + nsec_sys);
+	systime_snapshot->real = ktime_add_ns(base_real, nsec_sys);
+	systime_snapshot->boot = ktime_add_ns(base_boot, nsec_sys);
 	systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw);
+
+	/*
+	 * Special case for PTP. Just transfer the raw time into sys,
+	 * so the call sites can consistently use snap::sys.
+	 */
+	if (clock_id == CLOCK_MONOTONIC_RAW)
+		systime_snapshot->sys = systime_snapshot->raw;
+	/* Tell the consumer that this snapshot is valid */
+	systime_snapshot->valid = true;
+	return true;
 }
-EXPORT_SYMBOL_GPL(ktime_get_snapshot);
+EXPORT_SYMBOL_GPL(ktime_get_snapshot_id);
 
 /* Scale base by mult/div checking for overflow */
 static int scale64_check_overflow(u64 mult, u64 div, u64 *base)


^ permalink raw reply

* [patch 00/24] timekeeping/ptp: Expand snapshot functionality
From: Thomas Gleixner @ 2026-05-26 17:13 UTC (permalink / raw)
  To: LKML
  Cc: David Woodhouse, Miroslav Lichvar, John Stultz, Stephen Boyd,
	Anna-Maria Behnsen, Frederic Weisbecker, thomas.weissschuh,
	Arthur Kiyanovski, Rodolfo Giometti, Vincent Donnefort,
	Marc Zyngier, Oliver Upton, kvmarm, Oliver Upton, Richard Cochran,
	netdev, Takashi Iwai, Miri Korenblit, Johannes Berg, Jacob Keller,
	Tony Nguyen, Saeed Mahameed, Peter Hilber, Michael S. Tsirkin,
	virtualization, linux-wireless, linux-sound

Sorry for the large CC list, but changing the inner workings touches
unfortunately a lot of places in one go.

PTP wants to grow new snapshot functionality, which provides not only the
captured CLOCK* values, but also the underlying clocksource counter value.

   https://lore.kernel.org/20260515164033.6403-1-akiyano@amazon.com

There was quite some discussion in seemingly related threads how to capture
these values and how to provide core infrastructure so that driver writers
have something to work with

   https://lore.kernel.org/20260514225842.110706-1-hramamurthy@google.com
   https://lore.kernel.org/20260520135207.37826-1-dwmw2@infradead.org

This series implements the timekeeping related mechanisms to:

     1) Capture CLOCK values along with the clocksource counter value for
     	non-hardware based sampling

     2) Expanding the hardware cross time stamp mechanism to hand back the
     	clocksource counter value, which was captured by the device, along
     	with the related CLOCK values

     3) Adding AUX clock support to the hardware cross timestamping core

It's based on v7.1-rc2 and also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timekeeping-ptp-extend-v1

Thanks to David for rebasing his PTP/timekeeping work on top and providing
feedback, fixes and testing.

Thanks,

	tglx
---
 arch/arm64/kvm/hyp_trace.c                          |    8 
 arch/arm64/kvm/hypercalls.c                         |    6 
 drivers/net/dsa/sja1105/sja1105_main.c              |    8 
 drivers/net/ethernet/intel/ice/ice_ptp.c            |    5 
 drivers/net/ethernet/intel/igc/igc.h                |    1 
 drivers/net/ethernet/intel/igc/igc_ptp.c            |    4 
 drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c |    4 
 drivers/net/wireless/intel/iwlwifi/mld/ptp.c        |    5 
 drivers/net/wireless/intel/iwlwifi/mvm/ptp.c        |    7 
 drivers/pps/generators/pps_gen-dummy.c              |    6 
 drivers/pps/generators/pps_gen_tio.c                |    6 
 drivers/ptp/ptp_chardev.c                           |   18 +
 drivers/ptp/ptp_ocp.c                               |   11 -
 drivers/ptp/ptp_vmclock.c                           |   25 --
 drivers/virtio/virtio_rtc_ptp.c                     |    2 
 include/linux/pps_kernel.h                          |    8 
 include/linux/ptp_clock_kernel.h                    |   15 -
 include/linux/timekeeping.h                         |   54 ++---
 kernel/time/timekeeping.c                           |  211 ++++++++++++--------
 sound/hda/common/controller.c                       |    4 
 20 files changed, 236 insertions(+), 172 deletions(-)

^ permalink raw reply

* [PATCH v15] can: virtio: Add virtio CAN driver
From: Matias Ezequiel Vara Larsen @ 2026-05-26 16:42 UTC (permalink / raw)
  To: Harald Mommer, Marc Kleine-Budde, Vincent Mailhol,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	linux-can, virtualization
  Cc: Mikhail Golubev-Ciuchea, Stefano Garzarella, francesco, mvaralar

Add virtio CAN driver based on Virtio 1.4 specification (see
https://github.com/oasis-tcs/virtio-spec/tree/virtio-1.4). The driver
implements a complete CAN bus interface over Virtio transport,
supporting both CAN Classic and CAN-FD Ids. In term of frames, it
supports classic and CAN FD. RTR frames are only supported with classic
CAN.

Usage:
- "ip link set up can0" - start controller
- "ip link set down can0" - stop controller
- "candump can0" - receive frames
- "cansend can0 123#DEADBEEF" - send frames

Signed-off-by: Harald Mommer <harald.mommer@oss.qualcomm.com>
Co-developed-by: Harald Mommer <harald.mommer@oss.qualcomm.com>
Signed-off-by: Mikhail Golubev-Ciuchea <mikhail.golubev-ciuchea@oss.qualcomm.com>
Co-developed-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Damir Shaikhutdinov <Damir.Shaikhutdinov@opensynergy.com>
Reviewed-by: Francesco Valla <francesco@valla.it>
Tested-by: Francesco Valla <francesco@valla.it>
Signed-off-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
---
V15:
- Drop unchecked can_put_echo_skb() return value and dead error check in
  virtio_can_freeze() since virtio_can_close() always returns zero.
- Fix two use-after-free bugs: in virtio_can_start_xmit(), returning
  NETDEV_TX_BUSY after virtqueue_add_sgs() fails is wrong because
  can_put_echo_skb() already consumed the skb; and in the freeze/restore/remove
  path, stale vq pointers left by a failed restore() are dereferenced by a
  subsequent remove().

V14:
- set length before memcpy for field defined with __counted_by_le()
- move napi_enable() to open() and napi_disable() to close() thus following
  pattern in other CAN devices
- add flag to prevent remove() to hang if restore() fails

V13:
- kick device only when some buffers have been added
- add virtio_can_del_vq() when virtio_can_start() fails  
- move napi_disable() at the end of virtio_can_freeze()
- move napi_enable() at beginning of virtio_can_restore()
- let virtio_can_start() sets CAN_STATE_ERROR_ACTIVE
- in freeze(), if can_stop() fails, restore the state of the device
- in restore(), if !netif_running, set CAN_STATE_STOPPED

V12:
- check return value in virtio_can_start()
- use devm_krealloc() in virtio_can_restore() to re-allocate rpkt
- remove access to cf->len and replace it with len
- use __counted_by_le()
- free tx buffers during virtio_can_del_vq()
- add reinit_completion() for ctrl msgs

V11:
* Set CAN_VIRTIO_CAN config before CAN_XILINXCAN
* Use GFP_ATOMIC in virtio_can_alloc_tx_idx()
* Use open_candev() and set bittiming.bitrate to CAN_BITRATE_UNKNOWN and
  data_bittiming.bitrate to CAN_BITRATE_UNKNOWN
* Fix leak of skb by adding kfree_skb(skb)
* Fix out-of-bounds in virtio_can_populate_rx_vq()
* Initialize `ret`
* Use virtio_reset_device()
* Add napi_disable() in virtio_can_remove()
* Propagate error in virtio_can_start() and virtio_can_stop()

V10:
* Follow Reverse Christmas Tree convention

V9:
* Remove unnecessary comments
* Update maintainer list

V8:
* Address nits

V7:
* Address nits
* Remove unnecessary comments
* Remove io_callbacks[]
* Use guard() syntax
* Remove kicking for each inbuf
* replace sdu_len with rpkt_len
* Use devm_kzalloc()
* Use scoped_guard() to protect virtqueue_add_sgs() and virtqueue_kicks() for
  tx queue
* Tested with vhost-device-can
  (see https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-can) and
  Qemu (942b0d3) with [1]. A reviewer observed that the device stops to work
  after flooding from host. This issue is still present.

[1]
https://lore.kernel.org/qemu-devel/20251031155617.1223248-1-mvaralar@redhat.com/

V6:
* Address nits
* Check for error during register_virtio_can()
* Remove virtio_device_ready()
* Allocate virtio_can_rx rpkt[] at probe
* Define virtio_can_control struct
* Return VIRTIO_CAN_RESULT_NOT_OK after unlocking
* Define sdu[] as a flex array for both tx and rx. For rx, use
  VIRTIO_CAN_F_CAN_FD to figure out the max len for sdu
* Fix statistics in virtio_can_read_tx_queue() and
  how we indicate error to the user when getting
  VIRTIO_CAN_RESULT_NOT_OK
* Fix syntax of virtio_find_vqs()
* Drop tx_list
* Fix values of VIRTIO_CAN_F_LATE_TX_ACK and VIRTIO_CAN_F_RTR_FRAMES
* Tested with vhost-device-can
  (see
  https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-can)
  and qemu (see
  https://github.com/virtualopensystems/qemu/tree/vhu-can-rfc) 

V5:
* Re-base on top of linux-next (next-20240103)
* Tested with https://github.com/OpenSynergy/qemu/tree/virtio-can-spec-rfc-v3

RFC V4:
* Apply reverse Christmas tree style
* Add member *classic_dlc to RX and TX CAN frames
* Fix race causing a NETDEV_TX_BUSY return
---
 MAINTAINERS                     |    9 +
 drivers/net/can/Kconfig         |   12 +
 drivers/net/can/Makefile        |    1 +
 drivers/net/can/virtio_can.c    | 1022 +++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_can.h |   78 +++
 5 files changed, 1122 insertions(+)
 create mode 100644 drivers/net/can/virtio_can.c
 create mode 100644 include/uapi/linux/virtio_can.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 80cd3498c293..f295a904c93e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -27068,6 +27068,15 @@ F:	drivers/scsi/virtio_scsi.c
 F:	include/uapi/linux/virtio_blk.h
 F:	include/uapi/linux/virtio_scsi.h
 
+VIRTIO CAN DRIVER
+M:	"Harald Mommer" <harald.mommer@oss.qualcomm.com>
+M:	"Matias Ezequiel Vara Larsen" <mvaralar@redhat.com>
+L:	virtualization@lists.linux.dev
+L:	linux-can@vger.kernel.org
+S:	Maintained
+F:	drivers/net/can/virtio_can.c
+F:	include/uapi/linux/virtio_can.h
+
 VIRTIO CONSOLE DRIVER
 M:	Amit Shah <amit@kernel.org>
 L:	virtualization@lists.linux.dev
diff --git a/drivers/net/can/Kconfig b/drivers/net/can/Kconfig
index d43d56694667..f33ae46d9766 100644
--- a/drivers/net/can/Kconfig
+++ b/drivers/net/can/Kconfig
@@ -209,6 +209,18 @@ config CAN_TI_HECC
 	  Driver for TI HECC (High End CAN Controller) module found on many
 	  TI devices. The device specifications are available from www.ti.com
 
+config CAN_VIRTIO_CAN
+	depends on VIRTIO
+	tristate "Virtio CAN device support"
+	default n
+	help
+	  Say Y here if you want to support for Virtio CAN.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called virtio-can.
+
+	  If unsure, say N.
+
 config CAN_XILINXCAN
 	tristate "Xilinx CAN"
 	depends on ARCH_ZYNQ || ARM64 || MICROBLAZE || COMPILE_TEST
diff --git a/drivers/net/can/Makefile b/drivers/net/can/Makefile
index 56138d8ddfd2..2ddea733ed5d 100644
--- a/drivers/net/can/Makefile
+++ b/drivers/net/can/Makefile
@@ -32,6 +32,7 @@ obj-$(CONFIG_CAN_PEAK_PCIEFD)	+= peak_canfd/
 obj-$(CONFIG_CAN_SJA1000)	+= sja1000/
 obj-$(CONFIG_CAN_SUN4I)		+= sun4i_can.o
 obj-$(CONFIG_CAN_TI_HECC)	+= ti_hecc.o
+obj-$(CONFIG_CAN_VIRTIO_CAN)	+= virtio_can.o
 obj-$(CONFIG_CAN_XILINXCAN)	+= xilinx_can.o
 
 subdir-ccflags-$(CONFIG_CAN_DEBUG_DEVICES) += -DDEBUG
diff --git a/drivers/net/can/virtio_can.c b/drivers/net/can/virtio_can.c
new file mode 100644
index 000000000000..f67d0bf09681
--- /dev/null
+++ b/drivers/net/can/virtio_can.c
@@ -0,0 +1,1022 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * CAN bus driver for the Virtio CAN controller
+ *
+ * Copyright (C) 2021-2023 OpenSynergy GmbH
+ * Copyright Red Hat, Inc. 2025
+ */
+
+#include <linux/atomic.h>
+#include <linux/idr.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/netdevice.h>
+#include <linux/stddef.h>
+#include <linux/can/dev.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ring.h>
+#include <linux/virtio_can.h>
+
+/* CAN device queues */
+#define VIRTIO_CAN_QUEUE_TX 0
+#define VIRTIO_CAN_QUEUE_RX 1
+#define VIRTIO_CAN_QUEUE_CONTROL 2
+#define VIRTIO_CAN_QUEUE_COUNT 3
+
+#define CAN_KNOWN_FLAGS \
+	(VIRTIO_CAN_FLAGS_EXTENDED |\
+	 VIRTIO_CAN_FLAGS_FD |\
+	 VIRTIO_CAN_FLAGS_RTR)
+
+/* Max. number of in flight TX messages */
+#define VIRTIO_CAN_ECHO_SKB_MAX 128
+
+struct virtio_can_tx {
+	unsigned int putidx;
+	struct virtio_can_tx_in tx_in;
+	/* Keep virtio_can_tx_out at the end of the structure due to flex array */
+	struct virtio_can_tx_out tx_out;
+};
+
+struct virtio_can_control {
+	struct virtio_can_control_out cpkt_out;
+	struct virtio_can_control_in cpkt_in;
+};
+
+/* virtio_can private data structure */
+struct virtio_can_priv {
+	struct can_priv can;	/* must be the first member */
+	/* NAPI for RX messages */
+	struct napi_struct napi;
+	/* NAPI for TX messages */
+	struct napi_struct napi_tx;
+	/* The network device we're associated with */
+	struct net_device *dev;
+	/* The virtio device we're associated with */
+	struct virtio_device *vdev;
+	/* The virtqueues */
+	struct virtqueue *vqs[VIRTIO_CAN_QUEUE_COUNT];
+	/* Lock for TX operations */
+	spinlock_t tx_lock;
+	/* Control queue lock */
+	struct mutex ctrl_lock;
+	/* Wait for control queue processing without polling */
+	struct completion ctrl_done;
+	/* Array of receive queue messages */
+	struct virtio_can_rx *rpkt;
+	struct virtio_can_control can_ctr_msg;
+	/* Data to get and maintain the putidx for local TX echo */
+	struct ida tx_putidx_ida;
+	/* In flight TX messages */
+	atomic_t tx_inflight;
+	/* Packet length */
+	int rpkt_len;
+	/* BusOff pending. Reset after successful indication to upper layer */
+	bool busoff_pending;
+	/* Tracks whether NAPI instances are currently enabled */
+	bool napi_active;
+};
+
+static void virtqueue_napi_schedule(struct napi_struct *napi,
+				    struct virtqueue *vq)
+{
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(vq);
+		__napi_schedule(napi);
+	}
+}
+
+static void virtqueue_napi_complete(struct napi_struct *napi,
+				    struct virtqueue *vq, int processed)
+{
+	int opaque;
+
+	opaque = virtqueue_enable_cb_prepare(vq);
+	if (napi_complete_done(napi, processed)) {
+		if (unlikely(virtqueue_poll(vq, opaque)))
+			virtqueue_napi_schedule(napi, vq);
+	} else {
+		virtqueue_disable_cb(vq);
+	}
+}
+
+static void virtio_can_free_candev(struct net_device *ndev)
+{
+	struct virtio_can_priv *priv = netdev_priv(ndev);
+
+	ida_destroy(&priv->tx_putidx_ida);
+	free_candev(ndev);
+}
+
+static void virtio_can_napi_enable(struct virtio_can_priv *priv)
+{
+	if (!priv->napi_active) {
+		napi_enable(&priv->napi);
+		napi_enable(&priv->napi_tx);
+		priv->napi_active = true;
+	}
+}
+
+static void virtio_can_napi_disable(struct virtio_can_priv *priv)
+{
+	if (priv->napi_active) {
+		napi_disable(&priv->napi_tx);
+		napi_disable(&priv->napi);
+		priv->napi_active = false;
+	}
+}
+
+static int virtio_can_alloc_tx_idx(struct virtio_can_priv *priv)
+{
+	int tx_idx;
+
+	tx_idx = ida_alloc_max(&priv->tx_putidx_ida,
+			       priv->can.echo_skb_max - 1, GFP_ATOMIC);
+	if (tx_idx >= 0)
+		atomic_inc(&priv->tx_inflight);
+
+	return tx_idx;
+}
+
+static void virtio_can_free_tx_idx(struct virtio_can_priv *priv,
+				   unsigned int idx)
+{
+	ida_free(&priv->tx_putidx_ida, idx);
+	atomic_dec(&priv->tx_inflight);
+}
+
+/* Create a scatter-gather list representing our input buffer and put
+ * it in the queue.
+ *
+ * Callers should take appropriate locks.
+ */
+static int virtio_can_add_inbuf(struct virtqueue *vq, void *buf,
+				unsigned int size)
+{
+	struct scatterlist sg[1];
+	int ret;
+
+	sg_init_one(sg, buf, size);
+
+	ret = virtqueue_add_inbuf(vq, sg, 1, buf, GFP_ATOMIC);
+
+	return ret;
+}
+
+/* Send a control message with message type either
+ *
+ * - VIRTIO_CAN_SET_CTRL_MODE_START or
+ * - VIRTIO_CAN_SET_CTRL_MODE_STOP.
+ *
+ */
+static u8 virtio_can_send_ctrl_msg(struct net_device *ndev, u16 msg_type)
+{
+	struct scatterlist sg_out, sg_in, *sgs[2] = { &sg_out, &sg_in };
+	struct virtio_can_priv *priv = netdev_priv(ndev);
+	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_CONTROL];
+	struct device *dev = &priv->vdev->dev;
+	unsigned int len;
+	int err;
+
+	if (!vq)
+		return VIRTIO_CAN_RESULT_NOT_OK;
+
+	guard(mutex)(&priv->ctrl_lock);
+
+	priv->can_ctr_msg.cpkt_out.msg_type = cpu_to_le16(msg_type);
+	sg_init_one(&sg_out, &priv->can_ctr_msg.cpkt_out,
+		    sizeof(priv->can_ctr_msg.cpkt_out));
+	sg_init_one(&sg_in, &priv->can_ctr_msg.cpkt_in, sizeof(priv->can_ctr_msg.cpkt_in));
+
+	reinit_completion(&priv->ctrl_done);
+
+	err = virtqueue_add_sgs(vq, sgs, 1u, 1u, priv, GFP_ATOMIC);
+	if (err != 0) {
+		dev_err(dev, "%s(): virtqueue_add_sgs() failed\n", __func__);
+		return VIRTIO_CAN_RESULT_NOT_OK;
+	}
+
+	if (!virtqueue_kick(vq)) {
+		dev_err(dev, "%s(): Kick failed\n", __func__);
+		return VIRTIO_CAN_RESULT_NOT_OK;
+	}
+
+	while (!virtqueue_get_buf(vq, &len) && !virtqueue_is_broken(vq))
+		wait_for_completion(&priv->ctrl_done);
+
+	return priv->can_ctr_msg.cpkt_in.result;
+}
+
+static int virtio_can_start(struct net_device *ndev)
+{
+	struct virtio_can_priv *priv = netdev_priv(ndev);
+	u8 result;
+
+	result = virtio_can_send_ctrl_msg(ndev, VIRTIO_CAN_SET_CTRL_MODE_START);
+	if (result != VIRTIO_CAN_RESULT_OK) {
+		netdev_err(ndev, "CAN controller start failed\n");
+		return -EIO;
+	}
+
+	priv->busoff_pending = false;
+	priv->can.state = CAN_STATE_ERROR_ACTIVE;
+
+	return 0;
+}
+
+static int virtio_can_set_mode(struct net_device *dev, enum can_mode mode)
+{
+	int err;
+
+	switch (mode) {
+	case CAN_MODE_START:
+		err = virtio_can_start(dev);
+		if (err)
+			return err;
+		netif_wake_queue(dev);
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int virtio_can_open(struct net_device *ndev)
+{
+	struct virtio_can_priv *priv = netdev_priv(ndev);
+	int err;
+
+	err = open_candev(ndev);
+	if (err)
+		return err;
+
+	err = virtio_can_start(ndev);
+	if (err) {
+		close_candev(ndev);
+		return err;
+	}
+
+	virtio_can_napi_enable(priv);
+	netif_start_queue(ndev);
+
+	return 0;
+}
+
+static int virtio_can_stop(struct net_device *ndev)
+{
+	struct virtio_can_priv *priv = netdev_priv(ndev);
+	struct device *dev = &priv->vdev->dev;
+	u8 result;
+
+	result = virtio_can_send_ctrl_msg(ndev, VIRTIO_CAN_SET_CTRL_MODE_STOP);
+	if (result != VIRTIO_CAN_RESULT_OK) {
+		dev_err(dev, "CAN controller stop failed\n");
+		return -EIO;
+	}
+
+	priv->busoff_pending = false;
+	priv->can.state = CAN_STATE_STOPPED;
+
+	/* Switch carrier off if device was connected to the bus */
+	if (netif_carrier_ok(ndev))
+		netif_carrier_off(ndev);
+
+	return 0;
+}
+
+static int virtio_can_close(struct net_device *dev)
+{
+	struct virtio_can_priv *priv = netdev_priv(dev);
+
+	netif_stop_queue(dev);
+	/* Ignore stop error: ndo_stop must always complete cleanup regardless.
+	 * virtio_can_stop() already logs the error if it fails.
+	 */
+	virtio_can_stop(dev);
+	virtio_can_napi_disable(priv);
+	close_candev(dev);
+
+	return 0;
+}
+
+static netdev_tx_t virtio_can_start_xmit(struct sk_buff *skb,
+					 struct net_device *dev)
+{
+	struct scatterlist sg_out, sg_in, *sgs[2] = { &sg_out, &sg_in };
+	const unsigned int hdr_size = sizeof(struct virtio_can_tx_out);
+	struct canfd_frame *cf = (struct canfd_frame *)skb->data;
+	struct virtio_can_priv *priv = netdev_priv(dev);
+	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_TX];
+	netdev_tx_t xmit_ret = NETDEV_TX_OK;
+	struct virtio_can_tx *can_tx_msg;
+	u32 can_flags;
+	int putidx;
+	int err;
+
+	if (can_dev_dropped_skb(dev, skb))
+		goto kick; /* No way to return NET_XMIT_DROP here */
+
+	/* No local check for CAN_RTR_FLAG or FD frame against negotiated
+	 * features. The device will reject those anyway if not supported.
+	 */
+
+	can_tx_msg = kzalloc(sizeof(*can_tx_msg) + cf->len, GFP_ATOMIC);
+	if (!can_tx_msg) {
+		kfree_skb(skb);
+		dev->stats.tx_dropped++;
+		goto kick; /* No way to return NET_XMIT_DROP here */
+	}
+
+	can_tx_msg->tx_out.msg_type = cpu_to_le16(VIRTIO_CAN_TX);
+	can_tx_msg->tx_out.length = cpu_to_le16(cf->len);
+	can_flags = 0;
+
+	if (cf->can_id & CAN_EFF_FLAG) {
+		can_flags |= VIRTIO_CAN_FLAGS_EXTENDED;
+		can_tx_msg->tx_out.can_id = cpu_to_le32(cf->can_id & CAN_EFF_MASK);
+	} else {
+		can_tx_msg->tx_out.can_id = cpu_to_le32(cf->can_id & CAN_SFF_MASK);
+	}
+	if (cf->can_id & CAN_RTR_FLAG)
+		can_flags |= VIRTIO_CAN_FLAGS_RTR;
+	else
+		memcpy(can_tx_msg->tx_out.sdu, cf->data, cf->len);
+	if (can_is_canfd_skb(skb))
+		can_flags |= VIRTIO_CAN_FLAGS_FD;
+
+	can_tx_msg->tx_out.flags = cpu_to_le32(can_flags);
+
+	sg_init_one(&sg_out, &can_tx_msg->tx_out, hdr_size + cf->len);
+	sg_init_one(&sg_in, &can_tx_msg->tx_in, sizeof(can_tx_msg->tx_in));
+
+	putidx = virtio_can_alloc_tx_idx(priv);
+
+	if (unlikely(putidx < 0)) {
+		/* -ENOMEM or -ENOSPC here. -ENOSPC should not be possible as
+		 * tx_inflight >= can.echo_skb_max is checked in flow control
+		 */
+		WARN_ON_ONCE(putidx == -ENOSPC);
+		kfree(can_tx_msg);
+		kfree_skb(skb);
+		dev->stats.tx_dropped++;
+		goto kick; /* No way to return NET_XMIT_DROP here */
+	}
+
+	can_tx_msg->putidx = (unsigned int)putidx;
+
+	/* Push loopback echo. Will be looped back on TX interrupt/TX NAPI */
+	err = can_put_echo_skb(skb, dev, can_tx_msg->putidx, 0);
+	if (unlikely(err)) {
+		/* skb was already freed by can_put_echo_skb() on error */
+		virtio_can_free_tx_idx(priv, can_tx_msg->putidx);
+		kfree(can_tx_msg);
+		dev->stats.tx_dropped++;
+		goto kick;
+	}
+
+	/* Protect queue and list operations */
+	scoped_guard(spinlock_irqsave, &priv->tx_lock)
+		err = virtqueue_add_sgs(vq, sgs, 1u, 1u, can_tx_msg, GFP_ATOMIC);
+
+	if (unlikely(err)) {
+		/*
+		 * can_put_echo_skb() already consumed skb via consume_skb(),
+		 * so returning NETDEV_TX_BUSY would cause the stack to requeue
+		 * a freed pointer. Drop the frame and return OK instead.
+		 */
+		can_free_echo_skb(dev, can_tx_msg->putidx, NULL);
+		virtio_can_free_tx_idx(priv, can_tx_msg->putidx);
+		netif_stop_queue(dev);
+		kfree(can_tx_msg);
+		dev->stats.tx_dropped++;
+		/* Expected never to be seen */
+		netdev_warn(dev, "TX: Stop queue, err = %d\n", err);
+		goto kick;
+	}
+
+	/* Normal flow control: stop queue when no transmission slots left */
+	if (atomic_read(&priv->tx_inflight) >= priv->can.echo_skb_max ||
+	    vq->num_free == 0 || (vq->num_free < ARRAY_SIZE(sgs) &&
+	    !virtio_has_feature(vq->vdev, VIRTIO_RING_F_INDIRECT_DESC))) {
+		netif_stop_queue(dev);
+		netdev_dbg(dev, "TX: Normal stop queue\n");
+	}
+
+kick:
+	if (netif_queue_stopped(dev) || !netdev_xmit_more()) {
+		scoped_guard(spinlock_irqsave, &priv->tx_lock) {
+			if (!virtqueue_kick(vq))
+				netdev_err(dev, "%s(): Kick failed\n", __func__);
+		}
+	}
+
+	return xmit_ret;
+}
+
+static const struct net_device_ops virtio_can_netdev_ops = {
+	.ndo_open = virtio_can_open,
+	.ndo_stop = virtio_can_close,
+	.ndo_start_xmit = virtio_can_start_xmit,
+};
+
+static int register_virtio_can_dev(struct net_device *dev)
+{
+	dev->flags |= IFF_ECHO;	/* we support local echo */
+	dev->netdev_ops = &virtio_can_netdev_ops;
+
+	return register_candev(dev);
+}
+
+static int virtio_can_read_tx_queue(struct virtqueue *vq)
+{
+	struct virtio_can_priv *can_priv = vq->vdev->priv;
+	struct net_device *dev = can_priv->dev;
+	struct virtio_can_tx *can_tx_msg;
+	struct net_device_stats *stats;
+	unsigned int len;
+	u8 result;
+
+	stats = &dev->stats;
+
+	scoped_guard(spinlock_irqsave, &can_priv->tx_lock)
+		can_tx_msg = virtqueue_get_buf(vq, &len);
+
+	if (!can_tx_msg)
+		return 0;
+
+	if (unlikely(len < sizeof(struct virtio_can_tx_in))) {
+		netdev_err(dev, "TX ACK: Device sent no result code\n");
+		result = VIRTIO_CAN_RESULT_NOT_OK; /* Keep things going */
+	} else {
+		result = can_tx_msg->tx_in.result;
+	}
+
+	if (can_priv->can.state < CAN_STATE_BUS_OFF) {
+		if (result != VIRTIO_CAN_RESULT_OK) {
+			struct can_frame *skb_cf;
+			struct sk_buff *skb = alloc_can_err_skb(dev, &skb_cf);
+
+			if (skb) {
+				skb_cf->can_id |= CAN_ERR_CRTL;
+				skb_cf->data[1] |= CAN_ERR_CRTL_UNSPEC;
+				netif_rx(skb);
+			}
+			netdev_warn(dev, "TX ACK: Result = %u\n", result);
+			can_free_echo_skb(dev, can_tx_msg->putidx, NULL);
+			stats->tx_dropped++;
+		} else {
+			stats->tx_bytes += can_get_echo_skb(dev, can_tx_msg->putidx,
+				NULL);
+			stats->tx_packets++;
+		}
+	} else {
+		netdev_dbg(dev, "TX ACK: Controller inactive, drop echo\n");
+		can_free_echo_skb(dev, can_tx_msg->putidx, NULL);
+		stats->tx_dropped++;
+	}
+
+	virtio_can_free_tx_idx(can_priv, can_tx_msg->putidx);
+
+	/* Flow control */
+	if (netif_queue_stopped(dev)) {
+		netdev_dbg(dev, "TX ACK: Wake up stopped queue\n");
+		netif_wake_queue(dev);
+	}
+
+	kfree(can_tx_msg);
+
+	return 1; /* Queue was not empty so there may be more data */
+}
+
+static int virtio_can_tx_poll(struct napi_struct *napi, int quota)
+{
+	struct net_device *dev = napi->dev;
+	struct virtio_can_priv *priv = netdev_priv(dev);
+	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_TX];
+	int work_done = 0;
+
+	while (work_done < quota && virtio_can_read_tx_queue(vq) != 0)
+		work_done++;
+
+	if (work_done < quota)
+		virtqueue_napi_complete(napi, vq, work_done);
+
+	return work_done;
+}
+
+static void virtio_can_tx_intr(struct virtqueue *vq)
+{
+	struct virtio_can_priv *can_priv = vq->vdev->priv;
+
+	virtqueue_disable_cb(vq);
+	napi_schedule(&can_priv->napi_tx);
+}
+
+/* This function is the NAPI RX poll function and NAPI guarantees that this
+ * function is not invoked simultaneously on multiple processors.
+ * Read a RX message from the used queue and sends it to the upper layer.
+ */
+static int virtio_can_read_rx_queue(struct virtqueue *vq)
+{
+	const unsigned int header_size = sizeof(struct virtio_can_rx);
+	struct virtio_can_priv *priv = vq->vdev->priv;
+	struct net_device *dev = priv->dev;
+	struct net_device_stats *stats;
+	struct virtio_can_rx *can_rx;
+	unsigned int transport_len;
+	struct canfd_frame *cf;
+	struct sk_buff *skb;
+	unsigned int len;
+	u32 can_flags;
+	u16 msg_type;
+	u32 can_id;
+	int ret;
+
+	stats = &dev->stats;
+
+	can_rx = virtqueue_get_buf(vq, &transport_len);
+	if (!can_rx)
+		return 0; /* No more data */
+
+	if (transport_len < header_size) {
+		netdev_warn(dev, "RX: Message too small\n");
+		goto putback;
+	}
+
+	if (priv->can.state >= CAN_STATE_ERROR_PASSIVE) {
+		netdev_dbg(dev, "%s(): Controller not active\n", __func__);
+		goto putback;
+	}
+
+	msg_type = le16_to_cpu(can_rx->msg_type);
+	if (msg_type != VIRTIO_CAN_RX) {
+		netdev_warn(dev, "RX: Got unknown msg_type %04x\n", msg_type);
+		goto putback;
+	}
+
+	len = le16_to_cpu(can_rx->length);
+	can_flags = le32_to_cpu(can_rx->flags);
+	can_id = le32_to_cpu(can_rx->can_id);
+
+	if (can_flags & ~CAN_KNOWN_FLAGS) {
+		stats->rx_dropped++;
+		netdev_warn(dev, "RX: CAN Id 0x%08x: Invalid flags 0x%x\n",
+			    can_id, can_flags);
+		goto putback;
+	}
+
+	if (can_flags & VIRTIO_CAN_FLAGS_EXTENDED) {
+		can_id &= CAN_EFF_MASK;
+		can_id |= CAN_EFF_FLAG;
+	} else {
+		can_id &= CAN_SFF_MASK;
+	}
+
+	if (can_flags & VIRTIO_CAN_FLAGS_RTR) {
+		if (!virtio_has_feature(vq->vdev, VIRTIO_CAN_F_RTR_FRAMES)) {
+			stats->rx_dropped++;
+			netdev_warn(dev, "RX: CAN Id 0x%08x: RTR not negotiated\n",
+				    can_id);
+			goto putback;
+		}
+		if (can_flags & VIRTIO_CAN_FLAGS_FD) {
+			stats->rx_dropped++;
+			netdev_warn(dev, "RX: CAN Id 0x%08x: RTR with FD not possible\n",
+				    can_id);
+			goto putback;
+		}
+
+		if (len > 0xF) {
+			stats->rx_dropped++;
+			netdev_warn(dev, "RX: CAN Id 0x%08x: RTR with DLC > 0xF\n",
+				    can_id);
+			goto putback;
+		}
+
+		if (len > 0x8)
+			len = 0x8;
+
+		can_id |= CAN_RTR_FLAG;
+	}
+
+	if (transport_len < header_size + len) {
+		netdev_warn(dev, "RX: Message too small for payload\n");
+		goto putback;
+	}
+
+	if (can_flags & VIRTIO_CAN_FLAGS_FD) {
+		if (!virtio_has_feature(vq->vdev, VIRTIO_CAN_F_CAN_FD)) {
+			stats->rx_dropped++;
+			netdev_warn(dev, "RX: CAN Id 0x%08x: FD not negotiated\n",
+				    can_id);
+			goto putback;
+		}
+
+		if (len > CANFD_MAX_DLEN)
+			len = CANFD_MAX_DLEN;
+
+		skb = alloc_canfd_skb(priv->dev, &cf);
+	} else {
+		if (!virtio_has_feature(vq->vdev, VIRTIO_CAN_F_CAN_CLASSIC)) {
+			stats->rx_dropped++;
+			netdev_warn(dev, "RX: CAN Id 0x%08x: classic not negotiated\n",
+				    can_id);
+			goto putback;
+		}
+
+		if (len > CAN_MAX_DLEN)
+			len = CAN_MAX_DLEN;
+
+		skb = alloc_can_skb(priv->dev, (struct can_frame **)&cf);
+	}
+	if (!skb) {
+		stats->rx_dropped++;
+		netdev_warn(dev, "RX: No skb available\n");
+		goto putback;
+	}
+
+	cf->can_id = can_id;
+	cf->len = len;
+	if (!(can_flags & VIRTIO_CAN_FLAGS_RTR)) {
+		/* RTR frames have a DLC but no payload */
+		memcpy(cf->data, can_rx->sdu, len);
+	}
+
+	if (netif_receive_skb(skb) == NET_RX_SUCCESS) {
+		stats->rx_packets++;
+		if (!(can_flags & VIRTIO_CAN_FLAGS_RTR))
+			stats->rx_bytes += len;
+	}
+
+putback:
+	/* Put processed RX buffer back into avail queue */
+	ret = virtio_can_add_inbuf(vq, can_rx,
+				   priv->rpkt_len);
+	if (!ret)
+		virtqueue_kick(vq);
+	return 1; /* Queue was not empty so there may be more data */
+}
+
+static int virtio_can_handle_busoff(struct net_device *dev)
+{
+	struct virtio_can_priv *priv = netdev_priv(dev);
+	struct can_frame *cf;
+	struct sk_buff *skb;
+
+	if (!priv->busoff_pending)
+		return 0;
+
+	if (priv->can.state < CAN_STATE_BUS_OFF) {
+		netdev_dbg(dev, "entered error bus off state\n");
+
+		/* bus-off state */
+		priv->can.state = CAN_STATE_BUS_OFF;
+		priv->can.can_stats.bus_off++;
+		can_bus_off(dev);
+	}
+
+	/* propagate the error condition to the CAN stack */
+	skb = alloc_can_err_skb(dev, &cf);
+	if (unlikely(!skb))
+		return 0;
+
+	/* bus-off state */
+	cf->can_id |= CAN_ERR_BUSOFF;
+
+	/* Ensure that the BusOff indication does not get lost */
+	if (netif_receive_skb(skb) == NET_RX_SUCCESS)
+		priv->busoff_pending = false;
+
+	return 1;
+}
+
+static int virtio_can_rx_poll(struct napi_struct *napi, int quota)
+{
+	struct net_device *dev = napi->dev;
+	struct virtio_can_priv *priv = netdev_priv(dev);
+	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_RX];
+	int work_done = 0;
+
+	work_done += virtio_can_handle_busoff(dev);
+
+	while (work_done < quota && virtio_can_read_rx_queue(vq) != 0)
+		work_done++;
+
+	if (work_done < quota)
+		virtqueue_napi_complete(napi, vq, work_done);
+
+	return work_done;
+}
+
+static void virtio_can_rx_intr(struct virtqueue *vq)
+{
+	struct virtio_can_priv *can_priv = vq->vdev->priv;
+
+	virtqueue_disable_cb(vq);
+	napi_schedule(&can_priv->napi);
+}
+
+static void virtio_can_control_intr(struct virtqueue *vq)
+{
+	struct virtio_can_priv *can_priv = vq->vdev->priv;
+
+	complete(&can_priv->ctrl_done);
+}
+
+static void virtio_can_config_changed(struct virtio_device *vdev)
+{
+	struct virtio_can_priv *can_priv = vdev->priv;
+	u16 status;
+
+	status = virtio_cread16(vdev, offsetof(struct virtio_can_config,
+					       status));
+
+	if (!(status & VIRTIO_CAN_S_CTRL_BUSOFF))
+		return;
+
+	if (!can_priv->busoff_pending &&
+	    can_priv->can.state < CAN_STATE_BUS_OFF) {
+		can_priv->busoff_pending = true;
+		napi_schedule(&can_priv->napi);
+	}
+}
+
+static void virtio_can_populate_rx_vq(struct virtio_device *vdev)
+{
+	struct virtio_can_priv *priv = vdev->priv;
+	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_RX];
+	unsigned int buf_size = priv->rpkt_len;
+	int num_elements = vq->num_free;
+	u8 *buf = (u8 *)priv->rpkt;
+	unsigned int idx;
+	int ret = 0;
+
+	for (idx = 0; idx < num_elements; idx++) {
+		ret = virtio_can_add_inbuf(vq, buf, buf_size);
+		if (ret < 0) {
+			dev_dbg(&vdev->dev, "rpkt fill: ret=%d, idx=%u, size=%u\n",
+				ret, idx, buf_size);
+			break;
+		}
+		buf += buf_size;
+	}
+
+	if (idx > 0)
+		virtqueue_kick(vq);
+
+	dev_dbg(&vdev->dev, "%u rpkt added\n", idx);
+}
+
+static int virtio_can_find_vqs(struct virtio_can_priv *priv)
+{
+	struct virtqueue_info vqs_info[] = {
+		{ "can-tx", virtio_can_tx_intr },
+		{ "can-rx", virtio_can_rx_intr },
+		{ "can-state-ctrl", virtio_can_control_intr },
+	};
+
+	/* Find the queues. */
+	return virtio_find_vqs(priv->vdev, VIRTIO_CAN_QUEUE_COUNT, priv->vqs,
+			       vqs_info, NULL);
+}
+
+/* Function must not be called before virtio_can_find_vqs() has been run */
+static void virtio_can_del_vq(struct virtio_device *vdev)
+{
+	struct virtio_can_priv *priv = vdev->priv;
+	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_TX];
+	struct virtio_can_tx *can_tx_msg;
+	int q;
+
+	if (!vq)
+		return;
+
+	/* Reset the device */
+	virtio_reset_device(vdev);
+
+	/* From here we have dead silence from the device side so no locks
+	 * are needed to protect against device side events.
+	 */
+
+	/* Free pending TX buffers which were allocated in virtio_can_start_xmit() */
+	while ((can_tx_msg = virtqueue_detach_unused_buf(vq))) {
+		can_free_echo_skb(priv->dev, can_tx_msg->putidx, NULL);
+		virtio_can_free_tx_idx(priv, can_tx_msg->putidx);
+		kfree(can_tx_msg);
+	}
+
+	/* RX and control queue buffers are managed elsewhere, just detach */
+	for (q = VIRTIO_CAN_QUEUE_RX; q < VIRTIO_CAN_QUEUE_COUNT; q++)
+		while (virtqueue_detach_unused_buf(priv->vqs[q]))
+			;
+
+	if (vdev->config->del_vqs)
+		vdev->config->del_vqs(vdev);
+
+	memset(priv->vqs, 0, sizeof(priv->vqs));
+}
+
+static void virtio_can_remove(struct virtio_device *vdev)
+{
+	struct virtio_can_priv *priv = vdev->priv;
+	struct net_device *dev = priv->dev;
+
+	unregister_candev(dev);
+
+	virtio_can_del_vq(vdev);
+
+	virtio_can_free_candev(dev);
+}
+
+static int virtio_can_validate(struct virtio_device *vdev)
+{
+	/* CAN needs always access to the config space.
+	 * Check that the driver can access the config space
+	 */
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config access disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+		dev_err(&vdev->dev,
+			"device does not comply with spec version 1.x\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int virtio_can_probe(struct virtio_device *vdev)
+{
+	struct virtio_can_priv *priv;
+	struct net_device *dev;
+	size_t size;
+	int err;
+
+	dev = alloc_candev(sizeof(struct virtio_can_priv),
+			   VIRTIO_CAN_ECHO_SKB_MAX);
+	if (!dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(dev);
+
+	ida_init(&priv->tx_putidx_ida);
+
+	netif_napi_add(dev, &priv->napi, virtio_can_rx_poll);
+	netif_napi_add(dev, &priv->napi_tx, virtio_can_tx_poll);
+
+	SET_NETDEV_DEV(dev, &vdev->dev);
+
+	priv->dev = dev;
+	priv->vdev = vdev;
+	vdev->priv = priv;
+
+	priv->can.do_set_mode = virtio_can_set_mode;
+	priv->can.bittiming.bitrate = CAN_BITRATE_UNKNOWN;
+	/* Set Virtio CAN supported operations */
+	priv->can.ctrlmode_supported = CAN_CTRLMODE_BERR_REPORTING;
+	if (virtio_has_feature(vdev, VIRTIO_CAN_F_CAN_FD)) {
+		priv->can.fd.data_bittiming.bitrate = CAN_BITRATE_UNKNOWN;
+		err = can_set_static_ctrlmode(dev, CAN_CTRLMODE_FD);
+		if (err != 0)
+			goto on_failure;
+	}
+
+	/* Initialize virtqueues */
+	err = virtio_can_find_vqs(priv);
+	if (err != 0)
+		goto on_failure;
+
+	spin_lock_init(&priv->tx_lock);
+	mutex_init(&priv->ctrl_lock);
+
+	init_completion(&priv->ctrl_done);
+
+	priv->rpkt_len = sizeof(struct virtio_can_rx);
+
+	if (virtio_has_feature(vdev, VIRTIO_CAN_F_CAN_FD))
+		priv->rpkt_len += CANFD_MAX_DLEN;
+	else
+		priv->rpkt_len += CAN_MAX_DLEN;
+
+	size = priv->rpkt_len * priv->vqs[VIRTIO_CAN_QUEUE_RX]->num_free;
+	priv->rpkt = devm_kzalloc(&vdev->dev, size, GFP_KERNEL);
+	if (!priv->rpkt) {
+		virtio_can_del_vq(vdev);
+		err = -ENOMEM;
+		goto on_failure;
+	}
+	virtio_can_populate_rx_vq(vdev);
+
+	err = register_virtio_can_dev(dev);
+	if (err) {
+		virtio_can_del_vq(vdev);
+		goto on_failure;
+	}
+
+	return 0;
+
+on_failure:
+	virtio_can_free_candev(dev);
+	return err;
+}
+
+static int __maybe_unused virtio_can_freeze(struct virtio_device *vdev)
+{
+	struct virtio_can_priv *priv = vdev->priv;
+	struct net_device *ndev = priv->dev;
+
+	if (netif_running(ndev)) {
+		/* virtio_can_close() calls netif_stop_queue(), virtio_can_stop(),
+		 * napi_disable() and close_candev(). Call it directly (not via
+		 * dev_close()) to preserve IFF_UP so that netif_running() returns
+		 * true in virtio_can_restore() and the device is brought back up.
+		 */
+		virtio_can_close(ndev);
+		netif_device_detach(ndev);
+	}
+
+	priv->can.state = CAN_STATE_SLEEPING;
+
+	virtio_can_del_vq(vdev);
+
+	return 0;
+}
+
+static int __maybe_unused virtio_can_restore(struct virtio_device *vdev)
+{
+	struct virtio_can_priv *priv = vdev->priv;
+	struct net_device *ndev = priv->dev;
+	size_t size;
+	int err;
+
+	err = virtio_can_find_vqs(priv);
+	if (err != 0)
+		return err;
+
+	size = priv->rpkt_len * priv->vqs[VIRTIO_CAN_QUEUE_RX]->num_free;
+	priv->rpkt = devm_krealloc(&vdev->dev, priv->rpkt, size, GFP_KERNEL | __GFP_ZERO);
+	if (!priv->rpkt) {
+		virtio_can_del_vq(vdev);
+		return -ENOMEM;
+	}
+	virtio_can_populate_rx_vq(vdev);
+
+	if (netif_running(ndev)) {
+		/* virtio_can_open() calls open_candev(), virtio_can_start(),
+		 * napi_enable() and netif_start_queue(). Call it directly (not
+		 * via dev_open()) since IFF_UP is still set from before freeze.
+		 */
+		err = virtio_can_open(ndev);
+		if (err) {
+			virtio_can_del_vq(vdev);
+			return err;
+		}
+		netif_device_attach(ndev);
+	} else {
+		priv->can.state = CAN_STATE_STOPPED;
+	}
+
+	return 0;
+}
+
+static struct virtio_device_id virtio_can_id_table[] = {
+	{ VIRTIO_ID_CAN, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static unsigned int features[] = {
+	VIRTIO_CAN_F_CAN_CLASSIC,
+	VIRTIO_CAN_F_CAN_FD,
+	VIRTIO_CAN_F_LATE_TX_ACK,
+	VIRTIO_CAN_F_RTR_FRAMES,
+};
+
+static struct virtio_driver virtio_can_driver = {
+	.feature_table = features,
+	.feature_table_size = ARRAY_SIZE(features),
+	.driver.name = KBUILD_MODNAME,
+	.driver.owner = THIS_MODULE,
+	.id_table = virtio_can_id_table,
+	.validate = virtio_can_validate,
+	.probe = virtio_can_probe,
+	.remove = virtio_can_remove,
+	.config_changed = virtio_can_config_changed,
+#ifdef CONFIG_PM_SLEEP
+	.freeze = virtio_can_freeze,
+	.restore = virtio_can_restore,
+#endif
+};
+
+module_virtio_driver(virtio_can_driver);
+MODULE_DEVICE_TABLE(virtio, virtio_can_id_table);
+
+MODULE_AUTHOR("OpenSynergy GmbH");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("CAN bus driver for Virtio CAN controller");
diff --git a/include/uapi/linux/virtio_can.h b/include/uapi/linux/virtio_can.h
new file mode 100644
index 000000000000..08d7e3e78776
--- /dev/null
+++ b/include/uapi/linux/virtio_can.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: BSD-3-Clause */
+/*
+ * Copyright (C) 2021-2023 OpenSynergy GmbH
+ * Copyright Red Hat, Inc. 2025
+ */
+#ifndef _LINUX_VIRTIO_VIRTIO_CAN_H
+#define _LINUX_VIRTIO_VIRTIO_CAN_H
+
+#include <linux/types.h>
+#include <linux/virtio_types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+
+/* Feature bit numbers */
+#define VIRTIO_CAN_F_CAN_CLASSIC        0
+#define VIRTIO_CAN_F_CAN_FD             1
+#define VIRTIO_CAN_F_RTR_FRAMES         2
+#define VIRTIO_CAN_F_LATE_TX_ACK        3
+
+/* CAN Result Types */
+#define VIRTIO_CAN_RESULT_OK            0
+#define VIRTIO_CAN_RESULT_NOT_OK        1
+
+/* CAN flags to determine type of CAN Id */
+#define VIRTIO_CAN_FLAGS_EXTENDED       0x8000
+#define VIRTIO_CAN_FLAGS_FD             0x4000
+#define VIRTIO_CAN_FLAGS_RTR            0x2000
+
+#define VIRTIO_CAN_MAX_DLEN    64 // this is like CANFD_MAX_DLEN
+
+struct virtio_can_config {
+#define VIRTIO_CAN_S_CTRL_BUSOFF (1u << 0) /* Controller BusOff */
+	/* CAN controller status */
+	__le16 status;
+};
+
+/* TX queue message types */
+struct virtio_can_tx_out {
+#define VIRTIO_CAN_TX                   0x0001
+	__le16 msg_type;
+	__le16 length; /* 0..8 CC, 0..64 CAN-FD, 0..2048 CAN-XL, 12 bits */
+	__u8 reserved_classic_dlc; /* If CAN classic length = 8 then DLC can be 8..15 */
+	__u8 padding;
+	__le16 reserved_xl_priority; /* May be needed for CAN XL priority */
+	__le32 flags;
+	__le32 can_id;
+	__u8 sdu[] __counted_by_le(length);
+};
+
+struct virtio_can_tx_in {
+	__u8 result;
+};
+
+/* RX queue message types */
+struct virtio_can_rx {
+#define VIRTIO_CAN_RX                   0x0101
+	__le16 msg_type;
+	__le16 length; /* 0..8 CC, 0..64 CAN-FD, 0..2048 CAN-XL, 12 bits */
+	__u8 reserved_classic_dlc; /* If CAN classic length = 8 then DLC can be 8..15 */
+	__u8 padding;
+	__le16 reserved_xl_priority; /* May be needed for CAN XL priority */
+	__le32 flags;
+	__le32 can_id;
+	__u8 sdu[] __counted_by_le(length);
+};
+
+/* Control queue message types */
+struct virtio_can_control_out {
+#define VIRTIO_CAN_SET_CTRL_MODE_START  0x0201
+#define VIRTIO_CAN_SET_CTRL_MODE_STOP   0x0202
+	__le16 msg_type;
+};
+
+struct virtio_can_control_in {
+	__u8 result;
+};
+
+#endif /* #ifndef _LINUX_VIRTIO_VIRTIO_CAN_H */
-- 
2.42.0


^ permalink raw reply related

* [PATCH 4/5] x86/xen: Get rid of last XEN_LAZY_MMU uses
From: Juergen Gross @ 2026-05-26 15:05 UTC (permalink / raw)
  To: linux-kernel, x86, virtualization
  Cc: Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Boris Ostrovsky, xen-devel
In-Reply-To: <20260526150514.129330-1-jgross@suse.com>

There are only very few use cases of XEN_LAZY_MMU left. Get rid of
them in order to avoid having to call enter_lazy(XEN_LAZY_MMU) and
leave_lazy(XEN_LAZY_MMU).

The query in xen_batched_set_pte() can be replaced by using
is_lazy_mmu_mode_active() instead.

As xen_flush_lazy_mmu() will be called only with lazy MMU mode being
active, the test for the lazy mode can just be dropped.

In xen_start_context_switch() and xen_end_context_switch() use
__task_lazy_mmu_mode_pause() and __task_lazy_mmu_mode_resume(),
allowing to drop xen_enter_lazy_mmu() and xen_leave_lazy_mmu()
completely.

Call arch_flush_lazy_mmu_mode() from arch_leave_lazy_mmu_mode(), as
this is the only required action now.

Drop the lazy mmu enter and leave paravirt hooks, leaving the flush
hook as the only needed one.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/include/asm/paravirt.h       |  9 ++++-----
 arch/x86/include/asm/paravirt_types.h | 11 +----------
 arch/x86/kernel/paravirt.c            |  6 +-----
 arch/x86/xen/enlighten_pv.c           |  7 ++-----
 arch/x86/xen/mmu_pv.c                 | 28 +++------------------------
 5 files changed, 11 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cdfe4007443e..0591aa38fd85 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -483,17 +483,16 @@ static inline void arch_end_context_switch(struct task_struct *next)
 
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	PVOP_VCALL0(pv_ops, mmu.lazy_mode.enter);
 }
 
-static inline void arch_leave_lazy_mmu_mode(void)
+static inline void arch_flush_lazy_mmu_mode(void)
 {
-	PVOP_VCALL0(pv_ops, mmu.lazy_mode.leave);
+	PVOP_VCALL0(pv_ops, mmu.lazy_mode_flush);
 }
 
-static inline void arch_flush_lazy_mmu_mode(void)
+static inline void arch_leave_lazy_mmu_mode(void)
 {
-	PVOP_VCALL0(pv_ops, mmu.lazy_mode.flush);
+	arch_flush_lazy_mmu_mode();
 }
 
 static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 4f5ae0068aab..b4c4a23e77a1 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -19,15 +19,6 @@ struct cpumask;
 struct flush_tlb_info;
 struct vm_area_struct;
 
-#ifdef CONFIG_PARAVIRT_XXL
-struct pv_lazy_ops {
-	/* Set deferred update mode, used for batching operations. */
-	void (*enter)(void);
-	void (*leave)(void);
-	void (*flush)(void);
-} __no_randomize_layout;
-#endif
-
 struct pv_cpu_ops {
 	/* hooks for various privileged instructions */
 #ifdef CONFIG_PARAVIRT_XXL
@@ -171,7 +162,7 @@ struct pv_mmu_ops {
 
 	void (*set_pgd)(pgd_t *pgdp, pgd_t pgdval);
 
-	struct pv_lazy_ops lazy_mode;
+	void (*lazy_mode_flush)(void);
 
 	/* dom0 ops */
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 792fa96b3233..22f72034470f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -204,11 +204,7 @@ struct paravirt_patch_template pv_ops = {
 
 	.mmu.enter_mmap		= paravirt_nop,
 
-	.mmu.lazy_mode = {
-		.enter		= paravirt_nop,
-		.leave		= paravirt_nop,
-		.flush		= paravirt_nop,
-	},
+	.mmu.lazy_mode_flush	= paravirt_nop,
 
 	.mmu.set_fixmap		= native_set_fixmap,
 #endif /* CONFIG_PARAVIRT_XXL */
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 428189f5d437..8ee4910d597a 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -424,9 +424,7 @@ static void xen_start_context_switch(struct task_struct *prev)
 {
 	BUG_ON(preemptible());
 
-	if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) {
-		arch_leave_lazy_mmu_mode();
-	}
+	__task_lazy_mmu_mode_pause(prev);
 	enter_lazy(XEN_LAZY_CPU);
 }
 
@@ -436,8 +434,7 @@ static void xen_end_context_switch(struct task_struct *next)
 
 	xen_mc_flush();
 	leave_lazy(XEN_LAZY_CPU);
-	if (__task_lazy_mmu_mode_active(next))
-		arch_enter_lazy_mmu_mode();
+	__task_lazy_mmu_mode_resume(next);
 }
 
 static unsigned long xen_store_tr(void)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 51dbdc67c73c..928b4b0a4484 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -324,7 +324,7 @@ static bool xen_batched_set_pte(pte_t *ptep, pte_t pteval)
 {
 	struct mmu_update u;
 
-	if (xen_get_lazy_mode() != XEN_LAZY_MMU)
+	if (!is_lazy_mmu_mode_active())
 		return false;
 
 	xen_mc_batch();
@@ -2151,21 +2151,10 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 #endif
 }
 
-static void xen_enter_lazy_mmu(void)
-{
-	preempt_disable();
-	if (xen_get_lazy_mode() != XEN_LAZY_MMU)
-		enter_lazy(XEN_LAZY_MMU);
-	preempt_enable();
-}
-
 static void xen_flush_lazy_mmu(void)
 {
 	preempt_disable();
-
-	if (xen_get_lazy_mode() == XEN_LAZY_MMU)
-		xen_mc_flush();
-
+	xen_mc_flush();
 	preempt_enable();
 }
 
@@ -2189,15 +2178,6 @@ static void __init xen_post_allocator_init(void)
 	pv_ops.mmu.write_cr3 = &xen_write_cr3;
 }
 
-static void xen_leave_lazy_mmu(void)
-{
-	preempt_disable();
-	xen_mc_flush();
-	if (xen_get_lazy_mode() != XEN_LAZY_NONE)
-		leave_lazy(XEN_LAZY_MMU);
-	preempt_enable();
-}
-
 void __init xen_init_mmu_ops(void)
 {
 	x86_init.paging.pagetable_init = xen_pagetable_init;
@@ -2237,9 +2217,7 @@ void __init xen_init_mmu_ops(void)
 	pv_ops.mmu.make_p4d = PV_CALLEE_SAVE(xen_make_p4d);
 	pv_ops.mmu.enter_mmap = xen_enter_mmap;
 	pv_ops.mmu.exit_mmap = xen_exit_mmap;
-	pv_ops.mmu.lazy_mode.enter = xen_enter_lazy_mmu;
-	pv_ops.mmu.lazy_mode.leave = xen_leave_lazy_mmu;
-	pv_ops.mmu.lazy_mode.flush = xen_flush_lazy_mmu;
+	pv_ops.mmu.lazy_mode_flush = xen_flush_lazy_mmu;
 	pv_ops.mmu.set_fixmap = xen_set_fixmap;
 
 	memset(dummy_mapping, 0xff, PAGE_SIZE);
-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/5] x86/xen: Get rid of Xen private lazy MMU mode tracking
From: Juergen Gross @ 2026-05-26 15:05 UTC (permalink / raw)
  To: linux-kernel, x86, linux-trace-kernel, linux-mm, virtualization
  Cc: Juergen Gross, Boris Ostrovsky, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, xen-devel, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Ajay Kaher, Alexey Makhalov, Broadcom internal kernel review list

With generic lazy MMU mode tracking being available, there is no real
need for having Xen PV specific code to track lazy MMU mode, too.

This makes it possible to drop two paravirt hooks.

Note that this series is based on my series "x86/xen: Do some Xen-PV
related cleanups" [1].

[1]: https://lore.kernel.org/lkml/20260522152114.77319-1-jgross@suse.com/

Juergen Gross (5):
  x86/xen: Drop lazy mode from trace entries
  x86/xen: Change interface of xen_mc_issue()
  mm: Refactor lazy_mmu_mode_pause() and lazy_mmu_mode_resume()
  x86/xen: Get rid of last XEN_LAZY_MMU uses
  x86/xen: Replace generic lazy tracking with cpu specific one

 arch/x86/include/asm/paravirt.h       |  9 ++--
 arch/x86/include/asm/paravirt_types.h | 11 +----
 arch/x86/include/asm/xen/hypervisor.h | 25 +---------
 arch/x86/kernel/paravirt.c            |  6 +--
 arch/x86/xen/enlighten_pv.c           | 30 +++++-------
 arch/x86/xen/mmu_pv.c                 | 66 +++++++++------------------
 arch/x86/xen/xen-ops.h                | 12 +++--
 include/linux/pgtable.h               | 56 ++++++++++++++++++-----
 include/trace/events/xen.h            | 33 ++++++++------
 9 files changed, 112 insertions(+), 136 deletions(-)

-- 
2.54.0


^ permalink raw reply

* [PATCH] vsock/vmci: fix sk_ack_backlog leak on failed handshake
From: Raf Dickson @ 2026-05-26 10:43 UTC (permalink / raw)
  To: netdev, virtualization, linux-kernel
  Cc: sgarzare, stefanha, bryan-bt.tan, vishnu.dasa,
	bcm-kernel-feedback-list, stable, Raf Dickson

When vmci_transport_recv_connecting_server() returns an error,
vmci_transport_recv_listen() calls vsock_remove_pending() but never
calls sk_acceptq_removed(). This leaves sk_ack_backlog incremented
permanently.

Repeated handshake failures (malformed packets, queue pair alloc
failure, event subscribe failure) cause sk_ack_backlog to climb
toward sk_max_ack_backlog. Once it reaches the limit the listener
permanently refuses all new connections with -ECONNREFUSED, a
silent denial of service requiring a process restart to recover.

The two existing sk_acceptq_removed() calls in af_vsock.c do not
cover this path: line 764 checks vsock_is_pending() which returns
false after vsock_remove_pending(), and line 1889 is only reached
on successful accept().

Fix by balancing sk_acceptq_added() with sk_acceptq_removed() on
the error path.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Cc: stable@vger.kernel.org
Signed-off-by: Raf Dickson <rafdog35@gmail.com>
---
 net/vmw_vsock/vmci_transport.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index d2579380f5..88ccc55455 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -980,8 +980,10 @@ static int vmci_transport_recv_listen(struct sock *sk,
 			err = -EINVAL;
 		}

-		if (err < 0)
+		if (err < 0) {
 			vsock_remove_pending(sk, pending);
+			sk_acceptq_removed(sk);
+		}

 		release_sock(pending);
 		vmci_transport_release_pending(pending);
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v2] drm/virtio: abort virtqueue wait on device removal to avoid hung task
From: Dmitry Osipenko @ 2026-05-26 10:24 UTC (permalink / raw)
  To: Ryosuke Yasuoka, David Airlie, Gerd Hoffmann, Gurchetan Singh,
	Chia-I Wu, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter
  Cc: dri-devel, virtualization, linux-kernel
In-Reply-To: <18b1d7267ace8b76.329809b8dd1e13b3.4134c0f069791aa6@ryasuoka-thinkpadx1carbongen9.tokyo.csb>

On 5/22/26 11:51, Ryosuke Yasuoka wrote:
> Hi Dmitry,
> Thank you for your review and the comment.
> 
> On 21/05/2026 22:01, Dmitry Osipenko wrote:
>> 21.05.2026 05:19, Ryosuke Yasuoka пишет:
>>> virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() use
>>> wait_event() without any abort condition when waiting for virtqueue
>>> space. If the host device stops processing commands, these waits block
>>> indefinitely inside a drm_dev_enter/exit() critical section. Since
>>> drm_dev_unplug(), which is called in device removal and system shutdown
>>> call path, blocks on synchronize_srcu() until all critical sections
>>> complete, device removal and system shutdown also hang.
>>>
>>> Add a vqs_released flag to virtio_gpu_device and include it in the
>>> wait_event() condition. Set the flag and wake up both queues in a new
>>> virtio_gpu_release_vqs() helper, called before drm_dev_unplug() in both
>>> virtio_gpu_remove() and virtio_gpu_shutdown(). When the flag is set, the
>>> wait returns immediately and the command is aborted, following the same
>>> cleanup path as drm_dev_enter() failure.
>>>
>>> Reported-by: syzbot+d6dd6f86d3aaf7eebe7406e45c1c6e549453f224@syzkaller.appspotmail.com
>>> Closes: https://syzkaller.appspot.com/bug?id=d6dd6f86d3aaf7eebe7406e45c1c6e549453f224
>>> Reported-by: syzbot+908bd910da5dd79b88de4cf7baf376cc873a922e@syzkaller.appspotmail.com
>>> Closes: https://syzkaller.appspot.com/bug?id=908bd910da5dd79b88de4cf7baf376cc873a922e
>>> Signed-off-by: Ryosuke Yasuoka <ryasuoka@redhat.com>
>>> ---
>>> Changes in v2:
>>> - Update the commit message.
>>> - Replace wait_event_timeout() with wait_event() using a compound
>>> condition that includes a new vqs_released flag.
>>> - Add virtio_gpu_release_vqs() helper to set the flag and wake up
>>> both queues, called before drm_dev_unplug() in remove and shutdown
>>> paths.
>>> - Remove the hardcoded 5-second timeout. Recovery is now driven by
>>> the driver flag instead of an arbitrary timeout value.
>>> ---
>>>  drivers/gpu/drm/virtio/virtgpu_drv.c | 15 +++++++++++++++
>>>  drivers/gpu/drm/virtio/virtgpu_drv.h |  1 +
>>>  drivers/gpu/drm/virtio/virtgpu_vq.c  | 23 +++++++++++++++++++++--
>>>  3 files changed, 37 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
>>> index a5ce96fb8a1d..e4fe5e0780f9 100644
>>> --- a/drivers/gpu/drm/virtio/virtgpu_drv.c
>>> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
>>> @@ -119,10 +119,24 @@ static int virtio_gpu_probe(struct virtio_device *vdev)
>>>  	return ret;
>>>  }
>>>  
>>> +/*
>>> + * Release pending virtqueue waits so the drm_dev_enter/exit() critical
>>> + * sections complete before drm_dev_unplug() blocks on synchronize_srcu().
>>> + */
>>> +static void virtio_gpu_release_vqs(struct drm_device *dev)
>>> +{
>>> +	struct virtio_gpu_device *vgdev = dev->dev_private;
>>> +
>>> +	vgdev->vqs_released = true;
>>> +	wake_up_all(&vgdev->ctrlq.ack_queue);
>>> +	wake_up_all(&vgdev->cursorq.ack_queue);
>>> +}
>>> +
>>>  static void virtio_gpu_remove(struct virtio_device *vdev)
>>>  {
>>>  	struct drm_device *dev = vdev->priv;
>>>  
>>> +	virtio_gpu_release_vqs(dev);
>>>  	drm_dev_unplug(dev);
>>>  	drm_atomic_helper_shutdown(dev);
>>>  	virtio_gpu_deinit(dev);
>>> @@ -133,6 +147,7 @@ static void virtio_gpu_shutdown(struct virtio_device *vdev)
>>>  {
>>>  	struct drm_device *dev = vdev->priv;
>>>  
>>> +	virtio_gpu_release_vqs(dev);
>>>  	/* stop talking to the device */
>>>  	drm_dev_unplug(dev);
>>>  }
>>> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.h b/drivers/gpu/drm/virtio/virtgpu_drv.h
>>> index f17660a71a3e..0bd69a40857e 100644
>>> --- a/drivers/gpu/drm/virtio/virtgpu_drv.h
>>> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.h
>>> @@ -235,6 +235,7 @@ struct virtio_gpu_device {
>>>  
>>>  	struct virtio_gpu_queue ctrlq;
>>>  	struct virtio_gpu_queue cursorq;
>>> +	bool vqs_released;
>>>  	struct kmem_cache *vbufs;
>>>  
>>>  	atomic_t pending_commands;
>>> diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c b/drivers/gpu/drm/virtio/virtgpu_vq.c
>>> index 67865810a2e7..8057a9b7356d 100644
>>> --- a/drivers/gpu/drm/virtio/virtgpu_vq.c
>>> +++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
>>> @@ -396,7 +396,19 @@ static int virtio_gpu_queue_ctrl_sgs(struct virtio_gpu_device *vgdev,
>>>  	if (vq->num_free < elemcnt) {
>>>  		spin_unlock(&vgdev->ctrlq.qlock);
>>>  		virtio_gpu_notify(vgdev);
>>> -		wait_event(vgdev->ctrlq.ack_queue, vq->num_free >= elemcnt);
>>> +		wait_event(vgdev->ctrlq.ack_queue,
>>> +			   vq->num_free >= elemcnt || vgdev->vqs_released);
>>> +		/*
>>> +		 * Set by virtio_gpu_release_vqs() to unblock
>>> +		 * synchronize_srcu() wait in drm_dev_unplug().
>>> +		 */
>>> +		if (vgdev->vqs_released) {
>>> +			if (fence && vbuf->objs)
>>> +				virtio_gpu_array_unlock_resv(vbuf->objs);
>>> +			free_vbuf(vgdev, vbuf);
>>> +			drm_dev_exit(idx);
>>> +			return -ENODEV;
>>> +		}
>>>  		goto again;
>>>  	}
>>>  
>>> @@ -566,7 +578,14 @@ static void virtio_gpu_queue_cursor(struct virtio_gpu_device *vgdev,
>>>  	ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC);
>>>  	if (ret == -ENOSPC) {
>>>  		spin_unlock(&vgdev->cursorq.qlock);
>>> -		wait_event(vgdev->cursorq.ack_queue, vq->num_free >= outcnt);
>>> +		wait_event(vgdev->cursorq.ack_queue,
>>> +			   vq->num_free >= outcnt || vgdev->vqs_released);
>>> +		/* See comment in virtio_gpu_queue_ctrl_sgs(). */
>>> +		if (vgdev->vqs_released) {
>>> +			free_vbuf(vgdev, vbuf);
>>> +			drm_dev_exit(idx);
>>> +			return;
>>> +		}
>>>  		spin_lock(&vgdev->cursorq.qlock);
>>>  		goto retry;
>>>  	} else {
>>
>> What about other wait_event in the driver? Why only these?
> 
> There are other wait_event() calls on vgdev->resp_wq in virtgpu_prime.c
> and virtgpu_vram.c. These can also cause a stuck process when the host
> device stops or does not respond and should be fixed. However, they are
> not related to the issue being fixed here. I think they should be
> addressed in a separate commit.
> 
> The wait_event(vgdev->resp_wq) calls wait for a host response, not for
> virtqueue ring space. They are not inside a drm_dev_enter/exit()
> critical section, so they don't block drm_dev_unplug() ->
> synchronize_srcu() and are not part of the deadlock reported by syzbot.
> 
> However, if a process is stuck in wait_event(vgdev->resp_wq), there is
> no way to recover other than host device recovery itself. We can still
> remove the device and destroy the virtqueues while the process is stuck,
> so the host response can never arrive. IIUC, the stuck thread holds a
> drm_device reference, preventing vgdev from being freed; this is a
> resource leak.
> 
> This can be fixed by adding a wake_up_all() and a vqs_released check
> similar to what is done for ctrlq/cursorq. However, I think this should
> be a separate commit since the root cause and problem are different.
> 
> Alternatively, the resp_wq wait_event() calls could be converted to
> wait_event_interruptible() to fix the issue. But as you mentioned
> earlier in the v1 comment, we need the wait_event_interruptible()
> rework.
> 
> What do you think? Should I include the resp_wq fix as a separate commit
> in this patch series, or leave it for the _interruptible() rework?

Let's leave it for the later rework.

I briefly tried testing this patch and it's not apparent what exact
problem this patch solves.

With a regular Linux OS running systemd, during normal shutdown, systemd
waits for processes to be terminated before it would perform kernel
shutdown, hence virtio_gpu_shutdown() doesn't have a chance to be
invoked in my scenario.

In the referenced syzkaller reports, I don't see anything related to
shutdown or driver removal. Could you please clarify how to reproduce
and test the original issue?

-- 
Best regards,
Dmitry

^ permalink raw reply

* [PATCH] vhost/net: fix clear_user start address in VHOST_GET_FEATURES_ARRAY
From: rom.wang @ 2026-05-26  8:03 UTC (permalink / raw)
  To: Michael S . Tsirkin, Jason Wang, Eugenio Pérez, Paolo Abeni,
	kvm, virtualization
  Cc: netdev, linux-kernel, Yufeng Wang

From: Yufeng Wang <wangyufeng@kylinos.cn>

The clear_user() call in VHOST_GET_FEATURES_ARRAY incorrectly starts
at argp, which is the beginning of the features array, overwriting the
data just written by copy_to_user(). It should start after the copied
elements at argp + copied * sizeof(u64) to only zero the trailing
unused space.

Fixes: 333c515d1896 ("vhost-net: allow configuring extended features")
Signed-off-by: Yufeng Wang <wangyufeng@kylinos.cn>
---
 drivers/vhost/net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index db341c922673..70c578acf840 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1777,7 +1777,8 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 			return -EFAULT;
 
 		/* Zero the trailing space provided by user-space, if any */
-		if (clear_user(argp, size_mul(count - copied, sizeof(u64))))
+		if (clear_user(argp + copied * sizeof(u64),
+			       size_mul(count - copied, sizeof(u64))))
 			return -EFAULT;
 		return 0;
 	case VHOST_SET_FEATURES_ARRAY:
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH 00/11] Convert moduleparams to seq_buf
From: Petr Pavlu @ 2026-05-26  6:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: Luis Chamberlain, Pengpeng Hou, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Rafael J. Wysocki, Len Brown, Corey Minyard,
	Gabriel Somlo, Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, David Airlie, Simona Vetter,
	Bart Van Assche, Jason Gunthorpe, Leon Romanovsky,
	Laurent Pinchart, Hans de Goede, Mauro Carvalho Chehab,
	Bjorn Helgaas, Hannes Reinecke, James E.J. Bottomley,
	Martin K. Petersen, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Greg Kroah-Hartman, Jiri Slaby, Alan Stern, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Baron, Jim Cromie, Tiwei Bie,
	Benjamin Berg, Ilpo Järvinen, David E. Box,
	Maciej W. Rozycki, Srinivas Pandruvada, Peter Zijlstra,
	Heiko Carstens, Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <20260521133315.work.845-kees@kernel.org>

On 5/21/26 3:33 PM, Kees Cook wrote:
> Hi,
> 
> I tried to trim the CC list here, but it's still pretty huge...
> 
> We've had a long-standing issue with "write to a string pointer" callbacks
> that don't bounds check the destination (and for which the bounds is
> also not part of the callback prototype, even if it is "known" to be
> PAGE_SIZE, which sysfs_emit() depends on). Both moduleparams and sysfs
> use this pattern. As a first step, and to test the migration method,
> migrate moduleparams first.
> 
> There are 2 "mechanical" treewide patches that are handled by Coccinelle:
> - treewide: Convert struct kernel_param_ops initializers to DEFINE_KERNEL_PARAM_OPS
> - treewide: Convert custom kernel_param_ops .get callbacks to seq_buf via cocci
> 
> The last treewide patch is manual, and may need to be broken up into
> per-subsystem patches, though I'd prefer to avoid this, as it would
> extend the migration from 1 relase to at least 2 releases. (1 to
> release the migration infrastructure, then 1 release to collect all the
> subsystem changes, and possibly 1 more release to remove the migration
> infrastructure.)
> 
> Thoughts, questions?

This looks reasonable to me. I added a few minor comments on the patches
but they already look solid.

-- 
Thanks,
Petr

^ permalink raw reply

* Re: [PATCH] virtio_net: drop redundant err assignment in virtnet_probe()
From: Jakub Kicinski @ 2026-05-25 18:49 UTC (permalink / raw)
  To: Li Wang
  Cc: Michael S . Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	netdev, virtualization, linux-kernel
In-Reply-To: <20260520102309.281616-1-liwang@kylinos.cn>

On Wed, 20 May 2026 18:23:09 +0800 Li Wang wrote:
> err is initialized to -ENOMEM at the start of virtnet_probe(), and no
> code path between that and the devm_kzalloc() failure can change
> err. Assigning -ENOMEM again before goto free therefore is redundant.

This is asking for a bug, someone may add code before this 
if () overriding err to 0. They are 130 lines apart!

If you want to removing something remove the inline init at the start
of the function. But FTR I think that'd also be pointless churn.

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix skb overhead overflow on 32-bit builds
From: David Laight @ 2026-05-25 17:14 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Michael S. Tsirkin, patchwork-bot+netdevbpf, netdev, xuanzhuo,
	horms, virtualization, linux-kernel, kvm, kuba, eperezma, pabeni,
	davem, jasowang, stefanha, edumazet, stable
In-Reply-To: <ahRmBTnNMrg-oano@sgarzare-redhat>

On Mon, 25 May 2026 17:16:18 +0200
Stefano Garzarella <sgarzare@redhat.com> wrote:

> On Mon, May 25, 2026 at 03:53:22PM +0100, David Laight wrote:
> >On Mon, 25 May 2026 15:09:54 +0200
> >Stefano Garzarella <sgarzare@redhat.com> wrote:
...
> >> >Indeed, I wasn't thinking. For this to even get close to overflowing
> >> >we'd have to have almost all of 4G available to the 32 bit kernel taken
> >> >up by this single queue.  
> >
> >Except there is usually only 1G or 2G available to the kernel.
> >And all the skb would have to contain no data.  
> 
> `skb_overhead` was introduced to prevent exactly queueing skb without 
> any data (essentially containing just the EOM for SEQPACKET) that we 
> plan to fix properly in some other way instead of queueing empty skb.

I think most of the network code counts skb->truesize so that it limits
kernel memory on the queue rather than the amount of data.

This does rather rely on the low level code setting it to a sane value.
Last time I looked some of the usbnet drivers lied badly.
(IIRC short packets in a 64k buffer with the truesize set to the data length.)

You need to do something to limit the number of single (or even zero) byte
SEQPACKET messages that can be queued.

-- David

^ permalink raw reply

* Re: [PATCH 08/11] params: Convert generic kernel_param_ops .get helpers to seq_buf
From: Petr Pavlu @ 2026-05-25 17:10 UTC (permalink / raw)
  To: Kees Cook
  Cc: Luis Chamberlain, Pengpeng Hou, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Rafael J. Wysocki, Len Brown, Corey Minyard,
	Gabriel Somlo, Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, David Airlie, Simona Vetter,
	Bart Van Assche, Jason Gunthorpe, Leon Romanovsky,
	Laurent Pinchart, Hans de Goede, Mauro Carvalho Chehab,
	Bjorn Helgaas, Hannes Reinecke, James E.J. Bottomley,
	Martin K. Petersen, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Greg Kroah-Hartman, Jiri Slaby, Alan Stern, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Baron, Jim Cromie, Tiwei Bie,
	Benjamin Berg, Ilpo Järvinen, David E. Box,
	Maciej W. Rozycki, Srinivas Pandruvada, Peter Zijlstra,
	Heiko Carstens, Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <20260521133326.2465264-8-kees@kernel.org>

On 5/21/26 3:33 PM, Kees Cook wrote:
> Convert the generic struct kernel_param_ops .get helpers in
> kernel/params.c directly to the seq_buf signature, drop their legacy
> "char *" form, and refresh prototypes in <linux/moduleparam.h>:
> 
>   param_get_byte/short/ushort/int/uint/long/ulong/ullong/hexint
>   param_get_charp/bool/invbool/string
>   param_array_get
> 
> The STANDARD_PARAM_DEF() macro expands to a seq_buf body for every
> numeric helper. param_array_get() now writes element output directly
> into the parent seq_buf when the element ops provide .get; it only
> allocates the per-call PAGE_SIZE bounce buffer when the element ops
> still use the legacy .get_str path. The common "rewrite the prior
> element's trailing newline as a comma" step lives outside both
> branches so the two paths share it.
> 
> The non-core changes in this commit (arch/x86/kvm, mm/kfence,
> drivers/dma/dmatest, security/apparmor) are the small set of callers that
> directly invoke one of the converted generic helpers from their own .get
> callback (e.g. an apparmor wrapper that adds a capability check and then
> delegates to param_get_bool()). Because the helpers' signature changes
> here, these wrappers must move in lockstep. Each of them is updated
> to take "struct seq_buf *" and pass it through; param_get_debug() in
> apparmor also pulls aa_print_debug_params() (and its val_mask_to_str()
> helper, in security/apparmor/lib.c) over to seq_buf, since that is the
> only consumer. No other behavioural change is intended.
> 
> Custom .get callbacks that do not delegate to a generic helper (and
> therefore still match the .get_str signature) are routed automatically
> to the .get_str field by the DEFINE_KERNEL_PARAM_OPS _Generic dispatcher
> and are deliberately left alone here, to be changed separately within
> their respective subsystems.
> 
> Signed-off-by: Kees Cook <kees@kernel.org>
> ---
> [...]
> @@ -453,36 +457,46 @@ static int param_array_set(const char *val, const struct kernel_param *kp)
>  			   arr->num ?: &temp_num);
>  }
>  
> -static int param_array_get(char *buffer, const struct kernel_param *kp)
> +static int param_array_get(struct seq_buf *s, const struct kernel_param *kp)
>  {
> -	int i, off, ret;
> -	char *elem_buf;
>  	const struct kparam_array *arr = kp->arr;
>  	struct kernel_param p = *kp;
> +	char *elem_buf = NULL;
> +	int i, ret = 0;
>  
> -	elem_buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> -	if (!elem_buf)
> -		return -ENOMEM;
> +	for (i = 0; i < (arr->num ? *arr->num : arr->max); i++) {
> +		size_t before = s->len;
>  
> -	for (i = off = 0; i < (arr->num ? *arr->num : arr->max); i++) {
>  		p.arg = arr->elem + arr->elemsize * i;
>  		check_kparam_locked(p.mod);
> -		ret = arr->ops->get_str(elem_buf, &p);
> -		if (ret < 0)
> -			goto out;
> -		ret = min(ret, (int)(PAGE_SIZE - 1 - off));
> -		if (!ret)
> +
> +		if (arr->ops->get) {
> +			ret = arr->ops->get(s, &p);
> +			if (ret < 0)
> +				goto out;
> +		} else {
> +			if (!elem_buf) {
> +				elem_buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> +				if (!elem_buf) {
> +					ret = -ENOMEM;
> +					goto out;
> +				}
> +			}
> +			ret = arr->ops->get_str(elem_buf, &p);
> +			if (ret < 0)
> +				goto out;
> +			seq_buf_putmem(s, elem_buf, ret);
> +		}
> +
> +		/* Nothing got written (e.g. overflow) — stop. */
> +		if (s->len == before)
>  			break;
> +
>  		/* Replace the previous element's trailing newline with a comma. */
> -		if (i)
> -			buffer[off - 1] = ',';
> -		memcpy(buffer + off, elem_buf, ret);
> -		off += ret;
> -		if (off == PAGE_SIZE - 1)
> -			break;
> +		if (i && s->buffer[before - 1] == '\n')
> +			s->buffer[before - 1] = ',';
>  	}
> -	buffer[off] = '\0';
> -	ret = off;
> +	ret = 0;
>  out:
>  	kfree(elem_buf);
>  	return ret;

Since you're almost completely rewriting the logic in param_array_get(),
I suggest tightening it up a bit. The function could warn or return an
error when a kernel_param_ops::get/get_str() call adds a string that
doesn't terminate with '\n', specifically, when the call adds either
a zero-length string or a non-zero-length string that ends with
a different character (unless an overflow occurred).

The updated code silently stops the loop when a get call returns
a zero-length string. Similarly, handling of a string not terminated by
'\n' is halfway there because of the added check
"s->buffer[before - 1] == '\n'".

-- 
Thanks,
Petr

^ permalink raw reply

* Re: [PATCH 07/11] moduleparam: Route DEFINE_KERNEL_PARAM_OPS get pointer via _Generic
From: Petr Pavlu @ 2026-05-25 16:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: Luis Chamberlain, Pengpeng Hou, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Rafael J. Wysocki, Len Brown, Corey Minyard,
	Gabriel Somlo, Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, David Airlie, Simona Vetter,
	Bart Van Assche, Jason Gunthorpe, Leon Romanovsky,
	Laurent Pinchart, Hans de Goede, Mauro Carvalho Chehab,
	Bjorn Helgaas, Hannes Reinecke, James E.J. Bottomley,
	Martin K. Petersen, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Greg Kroah-Hartman, Jiri Slaby, Alan Stern, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Baron, Jim Cromie, Tiwei Bie,
	Benjamin Berg, Ilpo Järvinen, David E. Box,
	Maciej W. Rozycki, Srinivas Pandruvada, Peter Zijlstra,
	Heiko Carstens, Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <20260521133326.2465264-7-kees@kernel.org>

On 5/21/26 3:33 PM, Kees Cook wrote:
> Make the DEFINE_KERNEL_PARAM_OPS family route their _get argument to
> either .get (struct seq_buf *) or .get_str (char *) at compile time
> based on the pointer's actual function signature. Two helper macros
> do the routing:
> 
>   _KERNEL_PARAM_OPS_GET     - return the pointer if it has the seq_buf
>                               signature, otherwise NULL of that type
>   _KERNEL_PARAM_OPS_GET_STR - mirror image for the char * signature
> 
> Both use _Generic; only the two valid function-pointer types are
> listed, so any third-party type is a compile error rather than
> silently falling through.
> 
> Now a callback whose body has been migrated from char * to struct
> seq_buf * needs no change at its kernel_param_ops initialization site,
> because the macro picks up the new type automatically and assigns to
> the correct field.
> 
> Signed-off-by: Kees Cook <kees@kernel.org>
> ---
>  include/linux/moduleparam.h | 33 ++++++++++++++++++++++++++-------
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h
> index c52120f6ac28..795bc7c654ef 100644
> --- a/include/linux/moduleparam.h
> +++ b/include/linux/moduleparam.h
> @@ -85,15 +85,32 @@ struct kernel_param_ops {
>   *
>   *   static DEFINE_KERNEL_PARAM_OPS(my_ops, my_set, my_get);
>   *
> - * Routing the @_set and @_get function pointers through the macro
> - * (rather than naming the struct fields at every call site) lets the
> - * field layout change in one place when callbacks are migrated to a
> - * new signature.
> + * @_get may be either of:
> + *   int (*)(struct seq_buf *, const struct kernel_param *) (seq_buf)
> + *   int (*)(char *, const struct kernel_param *)           (legacy)
> + *
> + * The macro uses _Generic to route the function pointer to the
> + * matching field (.get or .get_str) at compile time, leaving the
> + * other field NULL. Each helper matches the wrong prototype signature
> + * and returns NULL, falling through to the default branch otherwise;
> + * if @_get has neither expected signature the assignment to the
> + * fields gets a normal compile-time type-mismatch error.
>   */
> +#define _KERNEL_PARAM_OPS_GET(_get)					\
> +	_Generic((_get),						\
> +	    int (*)(char *, const struct kernel_param *): NULL,		\
> +	    default: (_get))
> +
> +#define _KERNEL_PARAM_OPS_GET_STR(_get)					\
> +	_Generic((_get),						\
> +	    int (*)(struct seq_buf *, const struct kernel_param *): NULL, \
> +	    default: (_get))
> +
>  #define DEFINE_KERNEL_PARAM_OPS(_name, _set, _get)			\
>  	const struct kernel_param_ops _name = {				\
>  		.set = (_set),						\
> -		.get_str = (_get),					\
> +		.get = _KERNEL_PARAM_OPS_GET(_get),			\
> +		.get_str = _KERNEL_PARAM_OPS_GET_STR(_get),		\
>  	}
>  
>  /* As DEFINE_KERNEL_PARAM_OPS, with KERNEL_PARAM_OPS_FL_NOARG set. */
> @@ -101,14 +118,16 @@ struct kernel_param_ops {
>  	const struct kernel_param_ops _name = {				\
>  		.flags = KERNEL_PARAM_OPS_FL_NOARG,			\
>  		.set = (_set),						\
> -		.get_str = (_get),					\
> +		.get = _KERNEL_PARAM_OPS_GET(_get),			\
> +		.get_str = _KERNEL_PARAM_OPS_GET_STR(_get),		\
>  	}
>  
>  /* As DEFINE_KERNEL_PARAM_OPS, with an additional .free callback. */
>  #define DEFINE_KERNEL_PARAM_OPS_FREE(_name, _set, _get, _free)		\
>  	const struct kernel_param_ops _name = {				\
>  		.set = (_set),						\
> -		.get_str = (_get),					\
> +		.get = _KERNEL_PARAM_OPS_GET(_get),			\
> +		.get_str = _KERNEL_PARAM_OPS_GET_STR(_get),		\
>  		.free = (_free),					\
>  	}
>  

Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>

-- Petr

^ permalink raw reply

* Re: [PATCH 06/11] moduleparam: Add seq_buf-based .get callback alongside .get_str
From: Petr Pavlu @ 2026-05-25 16:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: Luis Chamberlain, Pengpeng Hou, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Rafael J. Wysocki, Len Brown, Corey Minyard,
	Gabriel Somlo, Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, David Airlie, Simona Vetter,
	Bart Van Assche, Jason Gunthorpe, Leon Romanovsky,
	Laurent Pinchart, Hans de Goede, Mauro Carvalho Chehab,
	Bjorn Helgaas, Hannes Reinecke, James E.J. Bottomley,
	Martin K. Petersen, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Greg Kroah-Hartman, Jiri Slaby, Alan Stern, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Baron, Jim Cromie, Tiwei Bie,
	Benjamin Berg, Ilpo Järvinen, David E. Box,
	Maciej W. Rozycki, Srinivas Pandruvada, Peter Zijlstra,
	Heiko Carstens, Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <20260521133326.2465264-6-kees@kernel.org>

On 5/21/26 3:33 PM, Kees Cook wrote:
> Add a new struct kernel_param_ops::get callback whose signature
> takes a struct seq_buf instead of a raw char buffer:
> 
>   int (*get)(struct seq_buf *sb, const struct kernel_param *kp);
> 
> The previously-legacy .get field is now .get_str (char *buffer);
> .get is the new seq_buf-aware form.  param_attr_show() prefers .get
> when set, otherwise falls back to .get_str.  WARN_ON_ONCE() if both
> are set.  Return contract for .get:
> 
>   < 0 : errno propagated to userspace; seq_buf contents discarded
>   = 0 : success; length derived from seq_buf_used()
>   > 0 : forbidden; the dispatcher WARN_ON_ONCE()s and treats as 0
> 
> The default policy on seq_buf_has_overflowed() is silent truncation,
> matching scnprintf()/sysfs_emit() behaviour.  Callbacks that want a
> specific overflow errno can check seq_buf_has_overflowed() and
> return their preferred error.
> 
> No callbacks use .get yet; the legacy path is still the only one in use
> after this commit. A subsequent commit teaches DEFINE_KERNEL_PARAM_OPS
> to route initializers by type.
> 
> Signed-off-by: Kees Cook <kees@kernel.org>

Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>

-- Petr

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix skb overhead overflow on 32-bit builds
From: Stefano Garzarella @ 2026-05-25 15:16 UTC (permalink / raw)
  To: David Laight
  Cc: Michael S. Tsirkin, patchwork-bot+netdevbpf, netdev, xuanzhuo,
	horms, virtualization, linux-kernel, kvm, kuba, eperezma, pabeni,
	davem, jasowang, stefanha, edumazet, stable
In-Reply-To: <20260525155322.240fbd87@pumpkin>

On Mon, May 25, 2026 at 03:53:22PM +0100, David Laight wrote:
>On Mon, 25 May 2026 15:09:54 +0200
>Stefano Garzarella <sgarzare@redhat.com> wrote:
>
>> On Mon, May 25, 2026 at 08:42:01AM -0400, Michael S. Tsirkin wrote:
>> >On Mon, May 25, 2026 at 11:53:14AM +0100, David Laight wrote:
>> >> On Mon, 25 May 2026 11:57:45 +0200
>> >> Stefano Garzarella <sgarzare@redhat.com> wrote:
>> >>
>> >> > On Sat, May 23, 2026 at 05:35:57PM +0100, David Laight wrote:
>> >> > >On Sat, 23 May 2026 02:20:29 +0000
>> >> > >patchwork-bot+netdevbpf@kernel.org wrote:
>> >> > >
>> >> > >> Hello:
>> >> > >>
>> >> > >> This patch was applied to netdev/net.git (main)
>> >> > >> by Jakub Kicinski <kuba@kernel.org>:
>> >> > >
>> >> > >Did anyone else notice that is isn't a bug?
>> >> > >
>> >> > >There is no way that a 'count of bytes of kernel memory' can overflow
>> >> > >the size of 'long'.
>> >> >
>> >> > It's more of an estimate than an actual calculation of memory usage if
>> >> > we queue the incoming packet. In theory, an overflow could occur if the
>> >> > user sets `buf_alloc` to 4GB. In practice, though, I think you're right:
>> >> > the memory should run out before we get to that check.
>> >>
>> >> The calculation is:
>> >>
>> >> 	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
>> >>
>> >> skb_queue_len() will be the number of items on the queue.
>> >> SKB_TRUESIZE(0) is the memory taken up by a zero length skb (basically sizeof(skb)).
>> >>
>> >> Unless you either corrupt the queue length or manage to allocate skb that use
>> >> less than the minimum about of memory that product can't overflow 'unsigned long'.
>> >>
>> >> The later calculations might wrap - but the multiply can't.
>> >>
>> >> -- David
>> >
>> >
>> >Indeed, I wasn't thinking. For this to even get close to overflowing
>> >we'd have to have almost all of 4G available to the 32 bit kernel taken
>> >up by this single queue.
>
>Except there is usually only 1G or 2G available to the kernel.
>And all the skb would have to contain no data.

`skb_overhead` was introduced to prevent exactly queueing skb without 
any data (essentially containing just the EOM for SEQPACKET) that we 
plan to fix properly in some other way instead of queueing empty skb.

>
>> >
>> >Revert, I'd say.
>>
>> I also blindly added the cast to silence sashiko :-(
>> I see now that it could never actually happen, but semantically it’s
>> correct, so maybe we can avoid the revert.
>
>Lots of things are semantically correct :-)

I see :-)

>
>I didn't look any further down the function to see if it could be
>'unsigned long' (or even size_t - but I like 'proper' types when they
>are always correct, I have to remember that size_t is unsigned long).

IIRC here the main reason was to handle the next check with u64 to avoid 
overflows (buf_alloc is u32) when it was introduce, but maybe, after 
commit c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to 
preserve full buf_alloc") it could be size_t.

>
>The problem with the (u64) cast is that gcc is very likely to make a
>'pigs breakfast' of it and do a full 64x64 multiply.
>It'll then try to keep the 64bit value in a register-pair which ends
>up being spilled to stack as a pair.
>I've seen it spill a constant zero and do a multiply by an immediate
>zero when doing 64bit maths on 32bit x86.
>I think gcc can hold a 64bit value as two separate 32bit values; that
>can generate reasonable code. But if they get merged (eg because of an
>"=A" asm constraint) it all goes horribly wrong.
>This is why there are some asm 'helpers' for mixed 32bit/64bit maths.

Thanks for the details, I'll keep these in mind, really useful!

Stefano


^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix skb overhead overflow on 32-bit builds
From: David Laight @ 2026-05-25 14:53 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Michael S. Tsirkin, patchwork-bot+netdevbpf, netdev, xuanzhuo,
	horms, virtualization, linux-kernel, kvm, kuba, eperezma, pabeni,
	davem, jasowang, stefanha, edumazet, stable
In-Reply-To: <ahRJS2bN9Bw_AKyo@sgarzare-redhat>

On Mon, 25 May 2026 15:09:54 +0200
Stefano Garzarella <sgarzare@redhat.com> wrote:

> On Mon, May 25, 2026 at 08:42:01AM -0400, Michael S. Tsirkin wrote:
> >On Mon, May 25, 2026 at 11:53:14AM +0100, David Laight wrote:  
> >> On Mon, 25 May 2026 11:57:45 +0200
> >> Stefano Garzarella <sgarzare@redhat.com> wrote:
> >>  
> >> > On Sat, May 23, 2026 at 05:35:57PM +0100, David Laight wrote:  
> >> > >On Sat, 23 May 2026 02:20:29 +0000
> >> > >patchwork-bot+netdevbpf@kernel.org wrote:
> >> > >  
> >> > >> Hello:
> >> > >>
> >> > >> This patch was applied to netdev/net.git (main)
> >> > >> by Jakub Kicinski <kuba@kernel.org>:  
> >> > >
> >> > >Did anyone else notice that is isn't a bug?
> >> > >
> >> > >There is no way that a 'count of bytes of kernel memory' can overflow
> >> > >the size of 'long'.  
> >> >
> >> > It's more of an estimate than an actual calculation of memory usage if
> >> > we queue the incoming packet. In theory, an overflow could occur if the
> >> > user sets `buf_alloc` to 4GB. In practice, though, I think you're right:
> >> > the memory should run out before we get to that check.  
> >>
> >> The calculation is:
> >>
> >> 	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
> >>
> >> skb_queue_len() will be the number of items on the queue.
> >> SKB_TRUESIZE(0) is the memory taken up by a zero length skb (basically sizeof(skb)).
> >>
> >> Unless you either corrupt the queue length or manage to allocate skb that use
> >> less than the minimum about of memory that product can't overflow 'unsigned long'.
> >>
> >> The later calculations might wrap - but the multiply can't.
> >>
> >> -- David  
> >
> >
> >Indeed, I wasn't thinking. For this to even get close to overflowing
> >we'd have to have almost all of 4G available to the 32 bit kernel taken
> >up by this single queue.

Except there is usually only 1G or 2G available to the kernel.
And all the skb would have to contain no data.

> >
> >Revert, I'd say.  
> 
> I also blindly added the cast to silence sashiko :-(
> I see now that it could never actually happen, but semantically it’s 
> correct, so maybe we can avoid the revert.

Lots of things are semantically correct :-)

I didn't look any further down the function to see if it could be
'unsigned long' (or even size_t - but I like 'proper' types when they
are always correct, I have to remember that size_t is unsigned long).

The problem with the (u64) cast is that gcc is very likely to make a
'pigs breakfast' of it and do a full 64x64 multiply.
It'll then try to keep the 64bit value in a register-pair which ends
up being spilled to stack as a pair.
I've seen it spill a constant zero and do a multiply by an immediate
zero when doing 64bit maths on 32bit x86.
I think gcc can hold a 64bit value as two separate 32bit values; that
can generate reasonable code. But if they get merged (eg because of an
"=A" asm constraint) it all goes horribly wrong.
This is why there are some asm 'helpers' for mixed 32bit/64bit maths.

-- David

> 
> Thanks,
> Stefano
> 
> >  
> >> >
> >> > Thanks,
> >> > Stefano
> >> >  
> >> > >
> >> > >-- David
> >> > >  
> >> > >>
> >> > >> On Thu, 21 May 2026 14:47:32 +0200 you wrote:  
> >> > >> > From: Stefano Garzarella <sgarzare@redhat.com>
> >> > >> >
> >> > >> > On 32-bit architectures, both skb_queue_len() and SKB_TRUESIZE(0) evaluate
> >> > >> > to 32-bit values. The multiplication can overflow before being assigned to
> >> > >> > the u64 skb_overhead variable, making the skb overhead check ineffective.
> >> > >> >
> >> > >> > Cast skb_queue_len() to u64 so the multiplication is always performed in
> >> > >> > 64-bit arithmetic.
> >> > >> >
> >> > >> > [...]  
> >> > >>
> >> > >> Here is the summary with links:
> >> > >>   - [net] vsock/virtio: fix skb overhead overflow on 32-bit builds
> >> > >>     https://git.kernel.org/netdev/net/c/4157501b9a8f
> >> > >>
> >> > >> You are awesome, thank you!  
> >> > >  
> >> >  
> >  
> 


^ permalink raw reply

* Re: [PATCH 04/11] treewide: Convert struct kernel_param_ops initializers to DEFINE_KERNEL_PARAM_OPS
From: Petr Pavlu @ 2026-05-25 13:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: Luis Chamberlain, Pengpeng Hou, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Rafael J. Wysocki, Len Brown, Corey Minyard,
	Gabriel Somlo, Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, David Airlie, Simona Vetter,
	Bart Van Assche, Jason Gunthorpe, Leon Romanovsky,
	Laurent Pinchart, Hans de Goede, Mauro Carvalho Chehab,
	Bjorn Helgaas, Hannes Reinecke, James E.J. Bottomley,
	Martin K. Petersen, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Greg Kroah-Hartman, Jiri Slaby, Alan Stern, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Baron, Jim Cromie, Tiwei Bie,
	Benjamin Berg, Ilpo Järvinen, David E. Box,
	Maciej W. Rozycki, Srinivas Pandruvada, Peter Zijlstra,
	Heiko Carstens, Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <20260521133326.2465264-4-kees@kernel.org>

On 5/21/26 3:33 PM, Kees Cook wrote:
> Using Coccinelle, rewrite every struct kernel_param_ops initializer that
> sets .get into a DEFINE_KERNEL_PARAM_OPS-family macro invocation,
> for example:
> 
> @@
> declarer name DEFINE_KERNEL_PARAM_OPS;
> identifier OPS;
> expression SET, GET;
> @@
> - const struct kernel_param_ops OPS = {
> -       .set = SET,
> -       .get = GET,
> - };
> + DEFINE_KERNEL_PARAM_OPS(OPS, SET, GET);
> 
> Using the macro for initialization means future changes can manipulate
> the struct layout and callback prototypes without having to change every
> initializer.

Nit: For consistency, I suggest also converting the few remaining
kernel_param_ops instances that specify only .set and no .get, such as
simdisk_param_ops_filename.

-- 
Thanks,
Petr

^ permalink raw reply

* Re: [PATCH 03/11] moduleparam: Add DEFINE_KERNEL_PARAM_OPS macro family
From: Petr Pavlu @ 2026-05-25 13:27 UTC (permalink / raw)
  To: Kees Cook
  Cc: Luis Chamberlain, Pengpeng Hou, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Rafael J. Wysocki, Len Brown, Corey Minyard,
	Gabriel Somlo, Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, David Airlie, Simona Vetter,
	Bart Van Assche, Jason Gunthorpe, Leon Romanovsky,
	Laurent Pinchart, Hans de Goede, Mauro Carvalho Chehab,
	Bjorn Helgaas, Hannes Reinecke, James E.J. Bottomley,
	Martin K. Petersen, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Greg Kroah-Hartman, Jiri Slaby, Alan Stern, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Baron, Jim Cromie, Tiwei Bie,
	Benjamin Berg, Ilpo Järvinen, David E. Box,
	Maciej W. Rozycki, Srinivas Pandruvada, Peter Zijlstra,
	Heiko Carstens, Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <20260521133326.2465264-3-kees@kernel.org>

On 5/21/26 3:33 PM, Kees Cook wrote:
> Add macros that define a struct kernel_param_ops initializer through a
> macro so the underlying field layout can evolve without touching every
> call site. Three variants cover the three cases:
> 
>  DEFINE_KERNEL_PARAM_OPS(name, set, get) // basic
>  DEFINE_KERNEL_PARAM_OPS_NOARG(name, set, get) // set KERNEL_PARAM_OPS_FL_NOARG
>  DEFINE_KERNEL_PARAM_OPS_FREE(name, set, get, free) // also set .free
> 
> Callers prefix their own visibility qualifiers, e.g.:
> 
>   static DEFINE_KERNEL_PARAM_OPS(my_ops, my_set, my_get);
> 
> Also update module_param_call() and STANDARD_PARAM_DEF() to use
> DEFINE_KERNEL_PARAM_OPS internally so the generated ops table will go
> through the same macro as everything else.
> 
> Subsequent commits convert all open-coded struct kernel_param_ops
> definitions to use these macros, in preparation for migrating to a
> seq_buf .get API.
> 
> Signed-off-by: Kees Cook <kees@kernel.org>
> ---
>  include/linux/moduleparam.h | 36 ++++++++++++++++++++++++++++++++++--
>  kernel/params.c             |  6 ++----
>  2 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h
> index 075f28585074..26bf45b36d02 100644
> --- a/include/linux/moduleparam.h
> +++ b/include/linux/moduleparam.h
> @@ -68,6 +68,39 @@ struct kernel_param_ops {
>  	void (*free)(void *arg);
>  };
>  
> +/*
> + * Define a const struct kernel_param_ops initializer. Callers prefix with
> + * any required visibility qualifiers (typically "static"):
> + *
> + *   static DEFINE_KERNEL_PARAM_OPS(my_ops, my_set, my_get);
> + *
> + * Routing the @_set and @_get function pointers through the macro
> + * (rather than naming the struct fields at every call site) lets the
> + * field layout change in one place when callbacks are migrated to a
> + * new signature.
> + */

Nit: The newly introduced DEFINE_KERNEL_PARAM_OPS*() macros remain in
place at the end of the series after the migration is complete and this
comment is removed in patch 7. It would be helpful to describe in the
commit message why these macros are generally preferable to defining
kernel_param_ops instances directly.

I assume the motivation is that the structure is simple enough and using
macros then makes defining kernel_param_ops instances a bit more
concise. A minor disadvantage is that some analysis tools, such as
ctags, may no longer see the generated definition, but that is also the
case for DEFINE_MUTEX() and other similar macros.

-- 
Thanks,
Petr

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix skb overhead overflow on 32-bit builds
From: Stefano Garzarella @ 2026-05-25 13:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Laight, patchwork-bot+netdevbpf, netdev, xuanzhuo, horms,
	virtualization, linux-kernel, kvm, kuba, eperezma, pabeni, davem,
	jasowang, stefanha, edumazet, stable
In-Reply-To: <20260525083859-mutt-send-email-mst@kernel.org>

On Mon, May 25, 2026 at 08:42:01AM -0400, Michael S. Tsirkin wrote:
>On Mon, May 25, 2026 at 11:53:14AM +0100, David Laight wrote:
>> On Mon, 25 May 2026 11:57:45 +0200
>> Stefano Garzarella <sgarzare@redhat.com> wrote:
>>
>> > On Sat, May 23, 2026 at 05:35:57PM +0100, David Laight wrote:
>> > >On Sat, 23 May 2026 02:20:29 +0000
>> > >patchwork-bot+netdevbpf@kernel.org wrote:
>> > >
>> > >> Hello:
>> > >>
>> > >> This patch was applied to netdev/net.git (main)
>> > >> by Jakub Kicinski <kuba@kernel.org>:
>> > >
>> > >Did anyone else notice that is isn't a bug?
>> > >
>> > >There is no way that a 'count of bytes of kernel memory' can overflow
>> > >the size of 'long'.
>> >
>> > It's more of an estimate than an actual calculation of memory usage if
>> > we queue the incoming packet. In theory, an overflow could occur if the
>> > user sets `buf_alloc` to 4GB. In practice, though, I think you're right:
>> > the memory should run out before we get to that check.
>>
>> The calculation is:
>>
>> 	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
>>
>> skb_queue_len() will be the number of items on the queue.
>> SKB_TRUESIZE(0) is the memory taken up by a zero length skb (basically sizeof(skb)).
>>
>> Unless you either corrupt the queue length or manage to allocate skb that use
>> less than the minimum about of memory that product can't overflow 'unsigned long'.
>>
>> The later calculations might wrap - but the multiply can't.
>>
>> -- David
>
>
>Indeed, I wasn't thinking. For this to even get close to overflowing
>we'd have to have almost all of 4G available to the 32 bit kernel taken
>up by this single queue.
>
>Revert, I'd say.

I also blindly added the cast to silence sashiko :-(
I see now that it could never actually happen, but semantically it’s 
correct, so maybe we can avoid the revert.

Thanks,
Stefano

>
>> >
>> > Thanks,
>> > Stefano
>> >
>> > >
>> > >-- David
>> > >
>> > >>
>> > >> On Thu, 21 May 2026 14:47:32 +0200 you wrote:
>> > >> > From: Stefano Garzarella <sgarzare@redhat.com>
>> > >> >
>> > >> > On 32-bit architectures, both skb_queue_len() and SKB_TRUESIZE(0) evaluate
>> > >> > to 32-bit values. The multiplication can overflow before being assigned to
>> > >> > the u64 skb_overhead variable, making the skb overhead check ineffective.
>> > >> >
>> > >> > Cast skb_queue_len() to u64 so the multiplication is always performed in
>> > >> > 64-bit arithmetic.
>> > >> >
>> > >> > [...]
>> > >>
>> > >> Here is the summary with links:
>> > >>   - [net] vsock/virtio: fix skb overhead overflow on 32-bit builds
>> > >>     https://git.kernel.org/netdev/net/c/4157501b9a8f
>> > >>
>> > >> You are awesome, thank you!
>> > >
>> >
>


^ permalink raw reply

* Re: [PATCH net] vsock/virtio: bind uarg before filling zerocopy skb
From: Stefano Garzarella @ 2026-05-25 12:48 UTC (permalink / raw)
  To: malin (R)
  Cc: Arseniy Krasnov, tanjingguo, mst@redhat.com, jasowang@redhat.com,
	xuanzhuo@linux.alibaba.com, eperezma@redhat.com,
	stefanha@redhat.com, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, horms@kernel.org, Chenzhe,
	cenxianlong, cuirongzhen, virtualization@lists.linux.dev,
	kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <9fece5ea269049f883f367642c07eaa0@huawei.com>

On Mon, May 25, 2026 at 11:15:12AM +0000, malin (R) wrote:
>From 9eea4f61a4dca97f56c23e12267219bf791a20d1 Mon Sep 17 00:00:00 2001
>From: Jingguo Tan <tanjingguo@huawei.com>
>Date: Fri, 22 May 2026 19:53:45 +0800
>Subject: [PATCH net] vsock/virtio: bind uarg before filling zerocopy skb
>
>virtio_transport_send_pkt_info() allocates or reuses the zerocopy uarg
>before entering the send loop, but virtio_transport_alloc_skb() still
>fills the skb before it inherits that uarg. When fixed-buffer vectored
>zerocopy hits MAX_SKB_FRAGS, io_sg_from_iter() may partially attach
>managed frags and return -EMSGSIZE. The rollback path calls kfree_skb()
>to free an skb that carries SKBFL_MANAGED_FRAG_REFS but no uarg, so
>skb_release_data() falls through to ordinary frag unref.
>
>Pass the uarg into virtio_transport_alloc_skb() and bind it immediately
>before virtio_transport_fill_skb(). This keeps control or no-payload skbs
>untouched while ensuring success and rollback share one lifetime rule.
>
>Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
>Signed-off-by: Lin Ma <malin89@huawei.com>
>Signed-off-by: Rongzhen Cui <cuirongzhen@huawei.com>
>Signed-off-by: Jingguo Tan <tanjingguo@huawei.com>
>---
>
> net/vmw_vsock/virtio_transport_common.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>


^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix skb overhead overflow on 32-bit builds
From: Michael S. Tsirkin @ 2026-05-25 12:42 UTC (permalink / raw)
  To: David Laight
  Cc: Stefano Garzarella, patchwork-bot+netdevbpf, netdev, xuanzhuo,
	horms, virtualization, linux-kernel, kvm, kuba, eperezma, pabeni,
	davem, jasowang, stefanha, edumazet, stable
In-Reply-To: <20260525115314.3cf310e6@pumpkin>

On Mon, May 25, 2026 at 11:53:14AM +0100, David Laight wrote:
> On Mon, 25 May 2026 11:57:45 +0200
> Stefano Garzarella <sgarzare@redhat.com> wrote:
> 
> > On Sat, May 23, 2026 at 05:35:57PM +0100, David Laight wrote:
> > >On Sat, 23 May 2026 02:20:29 +0000
> > >patchwork-bot+netdevbpf@kernel.org wrote:
> > >  
> > >> Hello:
> > >>
> > >> This patch was applied to netdev/net.git (main)
> > >> by Jakub Kicinski <kuba@kernel.org>:  
> > >
> > >Did anyone else notice that is isn't a bug?
> > >
> > >There is no way that a 'count of bytes of kernel memory' can overflow
> > >the size of 'long'.  
> > 
> > It's more of an estimate than an actual calculation of memory usage if 
> > we queue the incoming packet. In theory, an overflow could occur if the 
> > user sets `buf_alloc` to 4GB. In practice, though, I think you're right: 
> > the memory should run out before we get to that check.
> 
> The calculation is:
> 
> 	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0); 
> 
> skb_queue_len() will be the number of items on the queue.
> SKB_TRUESIZE(0) is the memory taken up by a zero length skb (basically sizeof(skb)).
> 
> Unless you either corrupt the queue length or manage to allocate skb that use
> less than the minimum about of memory that product can't overflow 'unsigned long'.
> 
> The later calculations might wrap - but the multiply can't.
> 
> -- David


Indeed, I wasn't thinking. For this to even get close to overflowing
we'd have to have almost all of 4G available to the 32 bit kernel taken
up by this single queue.

Revert, I'd say.

> > 
> > Thanks,
> > Stefano
> > 
> > >
> > >-- David
> > >  
> > >>
> > >> On Thu, 21 May 2026 14:47:32 +0200 you wrote:  
> > >> > From: Stefano Garzarella <sgarzare@redhat.com>
> > >> >
> > >> > On 32-bit architectures, both skb_queue_len() and SKB_TRUESIZE(0) evaluate
> > >> > to 32-bit values. The multiplication can overflow before being assigned to
> > >> > the u64 skb_overhead variable, making the skb overhead check ineffective.
> > >> >
> > >> > Cast skb_queue_len() to u64 so the multiplication is always performed in
> > >> > 64-bit arithmetic.
> > >> >
> > >> > [...]  
> > >>
> > >> Here is the summary with links:
> > >>   - [net] vsock/virtio: fix skb overhead overflow on 32-bit builds
> > >>     https://git.kernel.org/netdev/net/c/4157501b9a8f
> > >>
> > >> You are awesome, thank you!  
> > >  
> > 


^ permalink raw reply

* Re: [PATCH net] vsock/virtio: bind uarg before filling zerocopy skb
From: Arseniy Krasnov @ 2026-05-25 11:25 UTC (permalink / raw)
  To: malin (R), tanjingguo, mst@redhat.com, jasowang@redhat.com,
	xuanzhuo@linux.alibaba.com, eperezma@redhat.com,
	stefanha@redhat.com, sgarzare@redhat.com, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org
  Cc: Chenzhe, cenxianlong, cuirongzhen, virtualization@lists.linux.dev,
	kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <9fece5ea269049f883f367642c07eaa0@huawei.com>



On 25/05/2026 14:15, malin (R) wrote:
> From 9eea4f61a4dca97f56c23e12267219bf791a20d1 Mon Sep 17 00:00:00 2001
> From: Jingguo Tan <tanjingguo@huawei.com>
> Date: Fri, 22 May 2026 19:53:45 +0800
> Subject: [PATCH net] vsock/virtio: bind uarg before filling zerocopy skb
> 
> virtio_transport_send_pkt_info() allocates or reuses the zerocopy uarg
> before entering the send loop, but virtio_transport_alloc_skb() still
> fills the skb before it inherits that uarg. When fixed-buffer vectored
> zerocopy hits MAX_SKB_FRAGS, io_sg_from_iter() may partially attach
> managed frags and return -EMSGSIZE. The rollback path calls kfree_skb()
> to free an skb that carries SKBFL_MANAGED_FRAG_REFS but no uarg, so
> skb_release_data() falls through to ordinary frag unref.
> 
> Pass the uarg into virtio_transport_alloc_skb() and bind it immediately
> before virtio_transport_fill_skb(). This keeps control or no-payload skbs
> untouched while ensuring success and rollback share one lifetime rule.
> 
> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
> Signed-off-by: Lin Ma <malin89@huawei.com>
> Signed-off-by: Rongzhen Cui <cuirongzhen@huawei.com>
> Signed-off-by: Jingguo Tan <tanjingguo@huawei.com>

Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com>

> ---
> 
>  net/vmw_vsock/virtio_transport_common.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index df3b418e0392..73f58925ff72 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -205,6 +205,7 @@ static u16 virtio_transport_get_type(struct sock *sk)
>  static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>  						  size_t payload_len,
>  						  bool zcopy,
> +						  struct ubuf_info *uarg,
>  						  u32 src_cid,
>  						  u32 src_port,
>  						  u32 dst_cid,
> @@ -245,6 +246,11 @@ static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *
>  	if (info->msg && payload_len > 0) {
>  		int err;
>  
> +		/* Bind the zerocopy lifetime before filling frags so error rollback
> +		 * frees managed fixed-buffer pages through the uarg-aware path.
> +		 */
> +		skb_zcopy_set(skb, uarg, NULL);
> +
>  		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>  		if (err)
>  			goto out;
> @@ -364,6 +370,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  		skb_len = min(max_skb_len, rest_len);
>  
>  		skb = virtio_transport_alloc_skb(info, skb_len, can_zcopy,
> +						 uarg,
>  						 src_cid, src_port,
>  						 dst_cid, dst_port);
>  		if (!skb) {
> @@ -371,8 +378,6 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  			break;
>  		}
>  
> -		skb_zcopy_set(skb, uarg, NULL);
> -
>  		virtio_transport_inc_tx_pkt(vvs, skb);
>  
>  		ret = t_ops->send_pkt(skb, info->net);
> @@ -1183,7 +1188,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  	if (!t)
>  		return -ENOTCONN;
>  
> -	reply = virtio_transport_alloc_skb(&info, 0, false,
> +	reply = virtio_transport_alloc_skb(&info, 0, false, NULL,
>  					   le64_to_cpu(hdr->dst_cid),
>  					   le32_to_cpu(hdr->dst_port),
>  					   le64_to_cpu(hdr->src_cid),


^ permalink raw reply

* Re: [PATCH net] vsock/virtio: bind uarg before filling zerocopy skb
From: Michael S. Tsirkin @ 2026-05-25 11:25 UTC (permalink / raw)
  To: malin (R)
  Cc: Arseniy Krasnov, tanjingguo, jasowang@redhat.com,
	xuanzhuo@linux.alibaba.com, eperezma@redhat.com,
	stefanha@redhat.com, sgarzare@redhat.com, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org, Chenzhe, cenxianlong, cuirongzhen,
	virtualization@lists.linux.dev, kvm@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <9fece5ea269049f883f367642c07eaa0@huawei.com>

On Mon, May 25, 2026 at 11:15:12AM +0000, malin (R) wrote:
> >From 9eea4f61a4dca97f56c23e12267219bf791a20d1 Mon Sep 17 00:00:00 2001
> From: Jingguo Tan <tanjingguo@huawei.com>
> Date: Fri, 22 May 2026 19:53:45 +0800
> Subject: [PATCH net] vsock/virtio: bind uarg before filling zerocopy skb
> 
> virtio_transport_send_pkt_info() allocates or reuses the zerocopy uarg
> before entering the send loop, but virtio_transport_alloc_skb() still
> fills the skb before it inherits that uarg. When fixed-buffer vectored
> zerocopy hits MAX_SKB_FRAGS, io_sg_from_iter() may partially attach
> managed frags and return -EMSGSIZE. The rollback path calls kfree_skb()
> to free an skb that carries SKBFL_MANAGED_FRAG_REFS but no uarg, so
> skb_release_data() falls through to ordinary frag unref.
> 
> Pass the uarg into virtio_transport_alloc_skb() and bind it immediately
> before virtio_transport_fill_skb(). This keeps control or no-payload skbs
> untouched while ensuring success and rollback share one lifetime rule.
> 
> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
> Signed-off-by: Lin Ma <malin89@huawei.com>
> Signed-off-by: Rongzhen Cui <cuirongzhen@huawei.com>
> Signed-off-by: Jingguo Tan <tanjingguo@huawei.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> 
>  net/vmw_vsock/virtio_transport_common.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index df3b418e0392..73f58925ff72 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -205,6 +205,7 @@ static u16 virtio_transport_get_type(struct sock *sk)
>  static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>  						  size_t payload_len,
>  						  bool zcopy,
> +						  struct ubuf_info *uarg,
>  						  u32 src_cid,
>  						  u32 src_port,
>  						  u32 dst_cid,
> @@ -245,6 +246,11 @@ static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *
>  	if (info->msg && payload_len > 0) {
>  		int err;
>  
> +		/* Bind the zerocopy lifetime before filling frags so error rollback
> +		 * frees managed fixed-buffer pages through the uarg-aware path.
> +		 */
> +		skb_zcopy_set(skb, uarg, NULL);
> +
>  		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>  		if (err)
>  			goto out;
> @@ -364,6 +370,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  		skb_len = min(max_skb_len, rest_len);
>  
>  		skb = virtio_transport_alloc_skb(info, skb_len, can_zcopy,
> +						 uarg,
>  						 src_cid, src_port,
>  						 dst_cid, dst_port);
>  		if (!skb) {
> @@ -371,8 +378,6 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  			break;
>  		}
>  
> -		skb_zcopy_set(skb, uarg, NULL);
> -
>  		virtio_transport_inc_tx_pkt(vvs, skb);
>  
>  		ret = t_ops->send_pkt(skb, info->net);
> @@ -1183,7 +1188,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  	if (!t)
>  		return -ENOTCONN;
>  
> -	reply = virtio_transport_alloc_skb(&info, 0, false,
> +	reply = virtio_transport_alloc_skb(&info, 0, false, NULL,
>  					   le64_to_cpu(hdr->dst_cid),
>  					   le32_to_cpu(hdr->dst_port),
>  					   le64_to_cpu(hdr->src_cid),
> -- 
> 2.53.0


^ permalink raw reply

* [PATCH net] vsock/virtio: bind uarg before filling zerocopy skb
From: malin (R) @ 2026-05-25 11:15 UTC (permalink / raw)
  To: Arseniy Krasnov, tanjingguo, mst@redhat.com, jasowang@redhat.com,
	xuanzhuo@linux.alibaba.com, eperezma@redhat.com,
	stefanha@redhat.com, sgarzare@redhat.com, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org
  Cc: Chenzhe, cenxianlong, cuirongzhen, virtualization@lists.linux.dev,
	kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, tanjingguo

From 9eea4f61a4dca97f56c23e12267219bf791a20d1 Mon Sep 17 00:00:00 2001
From: Jingguo Tan <tanjingguo@huawei.com>
Date: Fri, 22 May 2026 19:53:45 +0800
Subject: [PATCH net] vsock/virtio: bind uarg before filling zerocopy skb

virtio_transport_send_pkt_info() allocates or reuses the zerocopy uarg
before entering the send loop, but virtio_transport_alloc_skb() still
fills the skb before it inherits that uarg. When fixed-buffer vectored
zerocopy hits MAX_SKB_FRAGS, io_sg_from_iter() may partially attach
managed frags and return -EMSGSIZE. The rollback path calls kfree_skb()
to free an skb that carries SKBFL_MANAGED_FRAG_REFS but no uarg, so
skb_release_data() falls through to ordinary frag unref.

Pass the uarg into virtio_transport_alloc_skb() and bind it immediately
before virtio_transport_fill_skb(). This keeps control or no-payload skbs
untouched while ensuring success and rollback share one lifetime rule.

Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
Signed-off-by: Lin Ma <malin89@huawei.com>
Signed-off-by: Rongzhen Cui <cuirongzhen@huawei.com>
Signed-off-by: Jingguo Tan <tanjingguo@huawei.com>
---

 net/vmw_vsock/virtio_transport_common.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index df3b418e0392..73f58925ff72 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -205,6 +205,7 @@ static u16 virtio_transport_get_type(struct sock *sk)
 static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
 						  size_t payload_len,
 						  bool zcopy,
+						  struct ubuf_info *uarg,
 						  u32 src_cid,
 						  u32 src_port,
 						  u32 dst_cid,
@@ -245,6 +246,11 @@ static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *
 	if (info->msg && payload_len > 0) {
 		int err;
 
+		/* Bind the zerocopy lifetime before filling frags so error rollback
+		 * frees managed fixed-buffer pages through the uarg-aware path.
+		 */
+		skb_zcopy_set(skb, uarg, NULL);
+
 		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
 		if (err)
 			goto out;
@@ -364,6 +370,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 		skb_len = min(max_skb_len, rest_len);
 
 		skb = virtio_transport_alloc_skb(info, skb_len, can_zcopy,
+						 uarg,
 						 src_cid, src_port,
 						 dst_cid, dst_port);
 		if (!skb) {
@@ -371,8 +378,6 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 			break;
 		}
 
-		skb_zcopy_set(skb, uarg, NULL);
-
 		virtio_transport_inc_tx_pkt(vvs, skb);
 
 		ret = t_ops->send_pkt(skb, info->net);
@@ -1183,7 +1188,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
 	if (!t)
 		return -ENOTCONN;
 
-	reply = virtio_transport_alloc_skb(&info, 0, false,
+	reply = virtio_transport_alloc_skb(&info, 0, false, NULL,
 					   le64_to_cpu(hdr->dst_cid),
 					   le32_to_cpu(hdr->dst_port),
 					   le64_to_cpu(hdr->src_cid),
-- 
2.53.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox