Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH v3] i2c: virtio: retain xfer with kref to fix UAF on interrupted wait
From: Michael S. Tsirkin @ 2026-06-10  6:50 UTC (permalink / raw)
  To: Gavin Li
  Cc: linux-i2c, viresh.kumar, Chen, Jian Jun, andi.shyti,
	virtualization
In-Reply-To: <20260609134358.36583-1-gavin.li@samsara.com>

On Tue, Jun 09, 2026 at 09:43:58AM -0400, Gavin Li wrote:
> commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible
> completion wait") switched virtio_i2c_complete_reqs() to
> wait_for_completion_interruptible() so a stuck device cannot hang a
> task forever. That left a use-after-free: if the wait returns early on
> a signal, virtio_i2c_xfer() frees reqs and DMA bounce buffers while the
> device may still hold virtqueue tokens pointing at &reqs[i] and DMA
> into read buffers. When those requests complete later,
> virtio_i2c_msg_done() calls complete() on freed memory.
> 
> Waiting uninterruptibly for every completion before freeing avoids the
> UAF but can hang the caller indefinitely if the virtio side never
> completes the request. The virtio spec provides no way to cancel an
> in-flight transfer, so that is not an acceptable tradeoff.

except for a queue reset, or a device reset if that's not
available.

> This commit manages the freeing of the xfer allocations via kref, and
> ensures that each in-flight request holds a reference. This fixes the
> use-after-free by ensuring that the virtio device has a valid location
> to write to until the request completes. This will cause a memory leak
> in cases where the device hangs, but that is much preferable to memory
> corruption.
> 
> Additionally, force usage of a bounce buffer even if the i2c_msg buf is
> DMA-safe, since the buffer passed to the virtqueue must remain valid
> even if the transfer is interrupted. Remove usage of
> i2c_get_dma_safe_msg_buf() since it may pass through msg->buf directly.
> This results in an extra allocation+copy when I2C_M_DMA_SAFE is used.
> 
> Signed-off-by: Gavin Li <gavin.li@samsara.com>

It's possible that copying unconditionally is the best you could do.
Some vague ideas below but it's possible they are not practical.


> ---

pls start a new thread with each version, and include a changelog there.


>  drivers/i2c/busses/i2c-virtio.c | 110 ++++++++++++++++++++++++--------
>  1 file changed, 83 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-virtio.c b/drivers/i2c/busses/i2c-virtio.c
> index 5da6fef92bec3..7ee96b08f6685 100644
> --- a/drivers/i2c/busses/i2c-virtio.c
> +++ b/drivers/i2c/busses/i2c-virtio.c
> @@ -13,6 +13,7 @@
>  #include <linux/err.h>
>  #include <linux/i2c.h>
>  #include <linux/kernel.h>
> +#include <linux/kref.h>
>  #include <linux/module.h>
>  #include <linux/virtio.h>
>  #include <linux/virtio_ids.h>
> @@ -31,39 +32,72 @@ struct virtio_i2c {
>  	struct virtqueue *vq;
>  };
>  
> +struct virtio_i2c_xfer;
> +
>  /**
>   * struct virtio_i2c_req - the virtio I2C request structure
> + * @xfer: owning transfer
>   * @completion: completion of virtio I2C message
>   * @out_hdr: the OUT header of the virtio I2C message
> - * @buf: the buffer into which data is read, or from which it's written
> + * @buf: bounce buffer for i2c_msg.buf, set to NULL upon free
>   * @in_hdr: the IN header of the virtio I2C message
>   */
>  struct virtio_i2c_req {
> +	struct virtio_i2c_xfer *xfer;
>  	struct completion completion;
>  	struct virtio_i2c_out_hdr out_hdr	____cacheline_aligned;
>  	uint8_t *buf				____cacheline_aligned;
>  	struct virtio_i2c_in_hdr in_hdr		____cacheline_aligned;
>  };
>  
> +/**
> + * struct virtio_i2c_xfer - a queued I2C transfer
> + * @ref: one ref for the caller, plus one per in-flight virtqueue request
> + * @num: number of messages
> + * @reqs: the virtio I2C requests
> + */
> +struct virtio_i2c_xfer {
> +	struct kref ref;
> +	int num;
> +	struct virtio_i2c_req reqs[];
> +};
> +
> +static void virtio_i2c_xfer_release(struct kref *ref)
> +{
> +	struct virtio_i2c_xfer *xfer = container_of(ref, struct virtio_i2c_xfer, ref);
> +	int i;
> +
> +	for (i = 0; i < xfer->num; i++) {
> +		struct virtio_i2c_req *req = &xfer->reqs[i];
> +		kfree(req->buf);
> +	}
> +
> +	kfree(xfer);
> +}
> +
>  static void virtio_i2c_msg_done(struct virtqueue *vq)
>  {
>  	struct virtio_i2c_req *req;
>  	unsigned int len;
>  
> -	while ((req = virtqueue_get_buf(vq, &len)))
> +	while ((req = virtqueue_get_buf(vq, &len))) {
>  		complete(&req->completion);
> +		kref_put(&req->xfer->ref, virtio_i2c_xfer_release);
> +	}
>  }
>  
> -static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
> -				   struct virtio_i2c_req *reqs,
> +static int virtio_i2c_prepare_xfer(struct virtqueue *vq,
> +				   struct virtio_i2c_xfer *xfer,
>  				   struct i2c_msg *msgs, int num)
>  {
>  	struct scatterlist *sgs[3], out_hdr, msg_buf, in_hdr;
> +	struct virtio_i2c_req *reqs = xfer->reqs;
>  	int i;
>  
>  	for (i = 0; i < num; i++) {
>  		int outcnt = 0, incnt = 0;
>  
> +		reqs[i].xfer = xfer;
>  		init_completion(&reqs[i].completion);
>  
>  		/*
> @@ -82,9 +116,16 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
>  		sgs[outcnt++] = &out_hdr;
>  
>  		if (msgs[i].len) {
> -			reqs[i].buf = i2c_get_dma_safe_msg_buf(&msgs[i], 1);
> +			/*
> +			 * Even if msg->flags has I2C_M_DMA_SAFE set, a bounce
> +			 * buffer is required because the transfer may be
> +			 * interrupted, after which msg->buf is no longer valid.
> +			 */
> +			reqs[i].buf = kzalloc(msgs[i].len, GFP_KERNEL);

is len always reasonably small?

>  			if (!reqs[i].buf)
>  				break;
> +			if (!(msgs[i].flags & I2C_M_RD))
> +				memcpy(reqs[i].buf, msgs[i].buf, msgs[i].len);
>  
>  			sg_init_one(&msg_buf, reqs[i].buf, msgs[i].len);
>

maybe there's some way to hang on to the original memory?


  
> @@ -97,38 +138,51 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
>  		sg_init_one(&in_hdr, &reqs[i].in_hdr, sizeof(reqs[i].in_hdr));
>  		sgs[outcnt + incnt++] = &in_hdr;
>  
> +		/* This reference is released in virtio_i2c_msg_done(). */
> +		kref_get(&xfer->ref);
> +
>  		if (virtqueue_add_sgs(vq, sgs, outcnt, incnt, &reqs[i], GFP_KERNEL)) {
> -			i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], false);
> +			kref_put(&xfer->ref, virtio_i2c_xfer_release);
> +
> +			kfree(reqs[i].buf);
> +			reqs[i].buf = NULL; /* prevent free by virtio_i2c_xfer_release */
> +
>  			break;
>  		}
>  	}
>  
> +	xfer->num = i;
>  	return i;
>  }
>  
> -static int virtio_i2c_complete_reqs(struct virtqueue *vq,
> -				    struct virtio_i2c_req *reqs,
> -				    struct i2c_msg *msgs, int num)
> +static int virtio_i2c_complete_xfer(struct virtio_i2c_xfer *xfer,
> +				    struct i2c_msg *msgs)
>  {
> -	bool failed = false;
> -	int i, j = 0;
> +	struct virtio_i2c_req *reqs = xfer->reqs;
> +	int i, fail_index = -1;
>  
> -	for (i = 0; i < num; i++) {
> +	for (i = 0; i < xfer->num; i++) {
>  		struct virtio_i2c_req *req = &reqs[i];
> +		struct i2c_msg *msg = &msgs[i];
>  
> -		if (!failed) {
> -			if (wait_for_completion_interruptible(&req->completion))
> -				failed = true;
> -			else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK)
> -				failed = true;
> -			else
> -				j++;
> +		if (wait_for_completion_interruptible(&req->completion))
> +			return -EINTR;
> +
> +		if (req->in_hdr.status != VIRTIO_I2C_MSG_OK) {
> +			/* Don't break yet. Try to wait until all requests complete. */
> +			if (fail_index < 0)
> +				fail_index = i;
>  		}
>  
> -		i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], !failed);
> +		if (fail_index < 0 && (msg->flags & I2C_M_RD))
> +			memcpy(msg->buf, req->buf, msg->len);
> +
> +		kfree(req->buf);
> +		req->buf = NULL; /* prevent free by virtio_i2c_xfer_release */
>  	}
>  
> -	return j;
> +	/* Return number of successful transactions */
> +	return fail_index >= 0 ? fail_index : xfer->num;
>  }
>  
>  static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
> @@ -136,14 +190,16 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
>  {
>  	struct virtio_i2c *vi = i2c_get_adapdata(adap);
>  	struct virtqueue *vq = vi->vq;
> -	struct virtio_i2c_req *reqs;
> +	struct virtio_i2c_xfer *xfer;
>  	int count;
>  
> -	reqs = kzalloc_objs(*reqs, num);
> -	if (!reqs)
> +	xfer = kzalloc(struct_size(xfer, reqs, num), GFP_KERNEL);
> +	if (!xfer)
>  		return -ENOMEM;
>  
> -	count = virtio_i2c_prepare_reqs(vq, reqs, msgs, num);
> +	kref_init(&xfer->ref);
> +
> +	count = virtio_i2c_prepare_xfer(vq, xfer, msgs, num);
>  	if (!count)
>  		goto err_free;
>  
> @@ -157,10 +213,10 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
>  	 */
>  	virtqueue_kick(vq);
>  
> -	count = virtio_i2c_complete_reqs(vq, reqs, msgs, count);
> +	count = virtio_i2c_complete_xfer(xfer, msgs);
>  
>  err_free:
> -	kfree(reqs);
> +	kref_put(&xfer->ref, virtio_i2c_xfer_release);
>  	return count;
>  }
>  
> -- 
> 2.54.0
> 


^ permalink raw reply

* Re: [PATCH] [v3] vduse: fix compat handling for VDUSE_IOTLB_GET_FD/VDUSE_VQ_GET_INFO
From: Arnd Bergmann @ 2026-06-10  6:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Arnd Bergmann
  Cc: Jason Wang, Xie Yongji, Eugenio Pérez, Xuan Zhuo,
	Marco Crivellari, Anders Roxell, virtualization, linux-kernel
In-Reply-To: <20260610012421-mutt-send-email-mst@kernel.org>

On Wed, Jun 10, 2026, at 07:26, Michael S. Tsirkin wrote:
> On Fri, Feb 13, 2026 at 04:40:46PM +0100, Arnd Bergmann wrote:
>> +	};
>> +	__u8 ready;
>> +	__u8 padding[3];
>> +} __uapi_arch_align;
>
> what is this __uapi_arch_align supposed to be doing?
>
> It compiles by luck because gcc thinks it's a global variable.

Sorry, that should not have been part of this patch, it is
part of the series of changes I was testing when I noticed
the problem here, but that other series is unfortunately
still not ready for merging.

The background here is that I'm adding -Werror=padded to
the UAPI checks in usr/include/Makefile, in order to find
any instances of data structures with implied padding,
and then instead add explicit padding using architecture
specific macros. In vduse_vq_info, this includes a final
32 bit of padding on architectures with naturally
aligned __u64 but no padding when __u64 has a smaller
alignment.

The __uapi_arch_align macro in turn is what I add to
structures with optional padding, in order to be able
to reduce the alignment of a structure when building the
kernel with a higher default alignment than userspace
(e.g. on m68k with -malign-int).

Removing the __uapi_arch_align is the correct fix for the
moment. Let me know if you would like me to send a
replacement patch without it, a fixup patch on top, or if
you will fix it up yourself.

     Arnd

^ permalink raw reply

* Re: [PATCH] [v3] vduse: fix compat handling for VDUSE_IOTLB_GET_FD/VDUSE_VQ_GET_INFO
From: Michael S. Tsirkin @ 2026-06-10  6:58 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Arnd Bergmann, Jason Wang, Xie Yongji, Eugenio Pérez,
	Xuan Zhuo, Marco Crivellari, Anders Roxell, virtualization,
	linux-kernel
In-Reply-To: <0134edd1-de51-46b3-9ea8-bdbcf48b878f@app.fastmail.com>

On Wed, Jun 10, 2026 at 08:54:59AM +0200, Arnd Bergmann wrote:
> On Wed, Jun 10, 2026, at 07:26, Michael S. Tsirkin wrote:
> > On Fri, Feb 13, 2026 at 04:40:46PM +0100, Arnd Bergmann wrote:
> >> +	};
> >> +	__u8 ready;
> >> +	__u8 padding[3];
> >> +} __uapi_arch_align;
> >
> > what is this __uapi_arch_align supposed to be doing?
> >
> > It compiles by luck because gcc thinks it's a global variable.
> 
> Sorry, that should not have been part of this patch, it is
> part of the series of changes I was testing when I noticed
> the problem here, but that other series is unfortunately
> still not ready for merging.
> 
> The background here is that I'm adding -Werror=padded to
> the UAPI checks in usr/include/Makefile, in order to find
> any instances of data structures with implied padding,
> and then instead add explicit padding using architecture
> specific macros. In vduse_vq_info, this includes a final
> 32 bit of padding on architectures with naturally
> aligned __u64 but no padding when __u64 has a smaller
> alignment.
> 
> The __uapi_arch_align macro in turn is what I add to
> structures with optional padding, in order to be able
> to reduce the alignment of a structure when building the
> kernel with a higher default alignment than userspace
> (e.g. on m68k with -malign-int).
> 
> Removing the __uapi_arch_align is the correct fix for the
> moment. Let me know if you would like me to send a
> replacement patch without it, a fixup patch on top, or if
> you will fix it up yourself.
> 
>      Arnd

I fixed it, thanks.


^ permalink raw reply

* Re: [PATCH RESEND] virtio_console: read size from config space during device init
From: Michael S. Tsirkin @ 2026-06-10  7:04 UTC (permalink / raw)
  To: Filip Hejsek
  Cc: Amit Shah, Arnd Bergmann, Greg Kroah-Hartman, Rusty Russell,
	virtualization, linux-kernel
In-Reply-To: <20260223-virtio-console-fix-v1-1-0cf08303b428@gmail.com>

On Mon, Feb 23, 2026 at 06:37:02PM +0100, Filip Hejsek wrote:
> Previously, the size was only read upon receiving the config interrupt.
> This interrupt is sent when the size changes. However, we also need to
> read the initial size.
> 
> Also make sure to only read the size from config if F_SIZE is enabled.
> 
> Fixes: 9778829cffd4 ("virtio: console: Store each console's size in the console structure")
> Signed-off-by: Filip Hejsek <filip.hejsek@gmail.com>
> ---
> This is a resend of [1], which hasn't received any response.
> 
> I found this bug while developing patches for QEMU that add virtio
> console resize support. If you want to test this, you can get my QEMU
> patches from [2]. You will need to disable multiport
> using `-device virtio-serial,max_ports=1`.
> 
> [1]: https://lore.kernel.org/all/20251224-virtio-console-fix-v1-1-69d0349692dc@gmail.com/
> [2]: https://lore.kernel.org/all/20250921-console-resize-v5-0-89e3c6727060@gmail.com/
> 
> I'll also repeat my questions from the previous submission here. These are
> things that confused me when I was trying to understand the surrounding code,
> but should in no way prevent merging this patch.
> 
>   - Why does use_multiport use __virtio_test_bit instead of
>     virtio_has_feature?
> 
>   - The VIRTIO_CONSOLE_RESIZE handler sets irq_requested to 1, which I
>     think makes no sense?
> ---
>  drivers/char/virtio_console.c | 52 ++++++++++++++++++++++++++-----------------
>  1 file changed, 31 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> index 088182e54d..c355f6d392 100644
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -1771,32 +1771,40 @@ static void config_intr(struct virtio_device *vdev)
>  		schedule_work(&portdev->config_work);
>  }
>  
> -static void config_work_handler(struct work_struct *work)
> +static void update_size_from_config(struct ports_device *portdev)
>  {
> -	struct ports_device *portdev;
> +	struct virtio_device *vdev;
> +	struct port *port;
> +	u16 rows, cols;
>  
> -	portdev = container_of(work, struct ports_device, config_work);
> -	if (!use_multiport(portdev)) {
> -		struct virtio_device *vdev;
> -		struct port *port;
> -		u16 rows, cols;
> +	vdev = portdev->vdev;
>  
> -		vdev = portdev->vdev;
> -		virtio_cread(vdev, struct virtio_console_config, cols, &cols);
> -		virtio_cread(vdev, struct virtio_console_config, rows, &rows);
> +	/*
> +	 * We'll use this way of resizing only for legacy support.
> +	 * For multiport devices, use control messages to indicate
> +	 * console size changes so that it can be done per-port.
> +	 *
> +	 * Don't test F_SIZE at all if we're rproc: not a valid feature.
> +	 */
> +	if (is_rproc_serial(vdev) ||

Wait a second. Why is there this rproc test here?
Was not in the original code and commit log says nothing about it.


> +	    use_multiport(portdev) ||
> +	    !virtio_has_feature(vdev, VIRTIO_CONSOLE_F_SIZE))
> +		return;
>  
> -		port = find_port_by_id(portdev, 0);
> -		set_console_size(port, rows, cols);
> +	virtio_cread(vdev, struct virtio_console_config, cols, &cols);
> +	virtio_cread(vdev, struct virtio_console_config, rows, &rows);
>  
> -		/*
> -		 * We'll use this way of resizing only for legacy
> -		 * support.  For newer userspace
> -		 * (VIRTIO_CONSOLE_F_MULTPORT+), use control messages
> -		 * to indicate console size changes so that it can be
> -		 * done per-port.
> -		 */
> -		resize_console(port);
> -	}
> +	port = find_port_by_id(portdev, 0);
> +	set_console_size(port, rows, cols);
> +	resize_console(port);
> +}
> +
> +static void config_work_handler(struct work_struct *work)
> +{
> +	struct ports_device *portdev;
> +
> +	portdev = container_of(work, struct ports_device, config_work);
> +	update_size_from_config(portdev);
>  }
>  
>  static int init_vqs(struct ports_device *portdev)
> @@ -2054,6 +2062,8 @@ static int virtcons_probe(struct virtio_device *vdev)
>  	__send_control_msg(portdev, VIRTIO_CONSOLE_BAD_ID,
>  			   VIRTIO_CONSOLE_DEVICE_READY, 1);
>  
> +	update_size_from_config(portdev);
> +
>  	return 0;
>  
>  free_chrdev:
> 
> ---
> base-commit: b927546677c876e26eba308550207c2ddf812a43
> change-id: 20251224-virtio-console-fix-3d46980ef569
> 
> Best regards,
> -- 
> Filip Hejsek <filip.hejsek@gmail.com>


^ permalink raw reply

* Re: [PATCH v15] can: virtio: Add virtio CAN driver
From: Michael S. Tsirkin @ 2026-06-10  7:08 UTC (permalink / raw)
  To: Matias Ezequiel Vara Larsen
  Cc: Harald Mommer, Marc Kleine-Budde, Vincent Mailhol, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, linux-can, virtualization,
	Mikhail Golubev-Ciuchea, Stefano Garzarella, francesco
In-Reply-To: <ahXNb+KzuHYbS24+@fedora>

On Tue, May 26, 2026 at 06:42:23PM +0200, Matias Ezequiel Vara Larsen wrote:
> Add virtio CAN driver based on Virtio 1.4 specification (see
> https://github.com/oasis-tcs/virtio-spec/tree/virtio-1.4). The driver
> implements a complete CAN bus interface over Virtio transport,
> supporting both CAN Classic and CAN-FD Ids. In term of frames, it
> supports classic and CAN FD. RTR frames are only supported with classic
> CAN.
> 
> Usage:
> - "ip link set up can0" - start controller
> - "ip link set down can0" - stop controller
> - "candump can0" - receive frames
> - "cansend can0 123#DEADBEEF" - send frames
> 
> Signed-off-by: Harald Mommer <harald.mommer@oss.qualcomm.com>
> Co-developed-by: Harald Mommer <harald.mommer@oss.qualcomm.com>
> Signed-off-by: Mikhail Golubev-Ciuchea <mikhail.golubev-ciuchea@oss.qualcomm.com>
> Co-developed-by: Marc Kleine-Budde <mkl@pengutronix.de>
> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
> Cc: Damir Shaikhutdinov <Damir.Shaikhutdinov@opensynergy.com>
> Reviewed-by: Francesco Valla <francesco@valla.it>
> Tested-by: Francesco Valla <francesco@valla.it>
> Signed-off-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>


It's in next now. Marc, Vincent, should I send it upstream in the next pull?



> ---
> V15:
> - Drop unchecked can_put_echo_skb() return value and dead error check in
>   virtio_can_freeze() since virtio_can_close() always returns zero.
> - Fix two use-after-free bugs: in virtio_can_start_xmit(), returning
>   NETDEV_TX_BUSY after virtqueue_add_sgs() fails is wrong because
>   can_put_echo_skb() already consumed the skb; and in the freeze/restore/remove
>   path, stale vq pointers left by a failed restore() are dereferenced by a
>   subsequent remove().
> 
> V14:
> - set length before memcpy for field defined with __counted_by_le()
> - move napi_enable() to open() and napi_disable() to close() thus following
>   pattern in other CAN devices
> - add flag to prevent remove() to hang if restore() fails
> 
> V13:
> - kick device only when some buffers have been added
> - add virtio_can_del_vq() when virtio_can_start() fails  
> - move napi_disable() at the end of virtio_can_freeze()
> - move napi_enable() at beginning of virtio_can_restore()
> - let virtio_can_start() sets CAN_STATE_ERROR_ACTIVE
> - in freeze(), if can_stop() fails, restore the state of the device
> - in restore(), if !netif_running, set CAN_STATE_STOPPED
> 
> V12:
> - check return value in virtio_can_start()
> - use devm_krealloc() in virtio_can_restore() to re-allocate rpkt
> - remove access to cf->len and replace it with len
> - use __counted_by_le()
> - free tx buffers during virtio_can_del_vq()
> - add reinit_completion() for ctrl msgs
> 
> V11:
> * Set CAN_VIRTIO_CAN config before CAN_XILINXCAN
> * Use GFP_ATOMIC in virtio_can_alloc_tx_idx()
> * Use open_candev() and set bittiming.bitrate to CAN_BITRATE_UNKNOWN and
>   data_bittiming.bitrate to CAN_BITRATE_UNKNOWN
> * Fix leak of skb by adding kfree_skb(skb)
> * Fix out-of-bounds in virtio_can_populate_rx_vq()
> * Initialize `ret`
> * Use virtio_reset_device()
> * Add napi_disable() in virtio_can_remove()
> * Propagate error in virtio_can_start() and virtio_can_stop()
> 
> V10:
> * Follow Reverse Christmas Tree convention
> 
> V9:
> * Remove unnecessary comments
> * Update maintainer list
> 
> V8:
> * Address nits
> 
> V7:
> * Address nits
> * Remove unnecessary comments
> * Remove io_callbacks[]
> * Use guard() syntax
> * Remove kicking for each inbuf
> * replace sdu_len with rpkt_len
> * Use devm_kzalloc()
> * Use scoped_guard() to protect virtqueue_add_sgs() and virtqueue_kicks() for
>   tx queue
> * Tested with vhost-device-can
>   (see https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-can) and
>   Qemu (942b0d3) with [1]. A reviewer observed that the device stops to work
>   after flooding from host. This issue is still present.
> 
> [1]
> https://lore.kernel.org/qemu-devel/20251031155617.1223248-1-mvaralar@redhat.com/
> 
> V6:
> * Address nits
> * Check for error during register_virtio_can()
> * Remove virtio_device_ready()
> * Allocate virtio_can_rx rpkt[] at probe
> * Define virtio_can_control struct
> * Return VIRTIO_CAN_RESULT_NOT_OK after unlocking
> * Define sdu[] as a flex array for both tx and rx. For rx, use
>   VIRTIO_CAN_F_CAN_FD to figure out the max len for sdu
> * Fix statistics in virtio_can_read_tx_queue() and
>   how we indicate error to the user when getting
>   VIRTIO_CAN_RESULT_NOT_OK
> * Fix syntax of virtio_find_vqs()
> * Drop tx_list
> * Fix values of VIRTIO_CAN_F_LATE_TX_ACK and VIRTIO_CAN_F_RTR_FRAMES
> * Tested with vhost-device-can
>   (see
>   https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-can)
>   and qemu (see
>   https://github.com/virtualopensystems/qemu/tree/vhu-can-rfc) 
> 
> V5:
> * Re-base on top of linux-next (next-20240103)
> * Tested with https://github.com/OpenSynergy/qemu/tree/virtio-can-spec-rfc-v3
> 
> RFC V4:
> * Apply reverse Christmas tree style
> * Add member *classic_dlc to RX and TX CAN frames
> * Fix race causing a NETDEV_TX_BUSY return
> ---
>  MAINTAINERS                     |    9 +
>  drivers/net/can/Kconfig         |   12 +
>  drivers/net/can/Makefile        |    1 +
>  drivers/net/can/virtio_can.c    | 1022 +++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_can.h |   78 +++
>  5 files changed, 1122 insertions(+)
>  create mode 100644 drivers/net/can/virtio_can.c
>  create mode 100644 include/uapi/linux/virtio_can.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 80cd3498c293..f295a904c93e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -27068,6 +27068,15 @@ F:	drivers/scsi/virtio_scsi.c
>  F:	include/uapi/linux/virtio_blk.h
>  F:	include/uapi/linux/virtio_scsi.h
>  
> +VIRTIO CAN DRIVER
> +M:	"Harald Mommer" <harald.mommer@oss.qualcomm.com>
> +M:	"Matias Ezequiel Vara Larsen" <mvaralar@redhat.com>
> +L:	virtualization@lists.linux.dev
> +L:	linux-can@vger.kernel.org
> +S:	Maintained
> +F:	drivers/net/can/virtio_can.c
> +F:	include/uapi/linux/virtio_can.h
> +
>  VIRTIO CONSOLE DRIVER
>  M:	Amit Shah <amit@kernel.org>
>  L:	virtualization@lists.linux.dev
> diff --git a/drivers/net/can/Kconfig b/drivers/net/can/Kconfig
> index d43d56694667..f33ae46d9766 100644
> --- a/drivers/net/can/Kconfig
> +++ b/drivers/net/can/Kconfig
> @@ -209,6 +209,18 @@ config CAN_TI_HECC
>  	  Driver for TI HECC (High End CAN Controller) module found on many
>  	  TI devices. The device specifications are available from www.ti.com
>  
> +config CAN_VIRTIO_CAN
> +	depends on VIRTIO
> +	tristate "Virtio CAN device support"
> +	default n
> +	help
> +	  Say Y here if you want to support for Virtio CAN.
> +
> +	  To compile this driver as a module, choose M here: the
> +	  module will be called virtio-can.
> +
> +	  If unsure, say N.
> +
>  config CAN_XILINXCAN
>  	tristate "Xilinx CAN"
>  	depends on ARCH_ZYNQ || ARM64 || MICROBLAZE || COMPILE_TEST
> diff --git a/drivers/net/can/Makefile b/drivers/net/can/Makefile
> index 56138d8ddfd2..2ddea733ed5d 100644
> --- a/drivers/net/can/Makefile
> +++ b/drivers/net/can/Makefile
> @@ -32,6 +32,7 @@ obj-$(CONFIG_CAN_PEAK_PCIEFD)	+= peak_canfd/
>  obj-$(CONFIG_CAN_SJA1000)	+= sja1000/
>  obj-$(CONFIG_CAN_SUN4I)		+= sun4i_can.o
>  obj-$(CONFIG_CAN_TI_HECC)	+= ti_hecc.o
> +obj-$(CONFIG_CAN_VIRTIO_CAN)	+= virtio_can.o
>  obj-$(CONFIG_CAN_XILINXCAN)	+= xilinx_can.o
>  
>  subdir-ccflags-$(CONFIG_CAN_DEBUG_DEVICES) += -DDEBUG
> diff --git a/drivers/net/can/virtio_can.c b/drivers/net/can/virtio_can.c
> new file mode 100644
> index 000000000000..f67d0bf09681
> --- /dev/null
> +++ b/drivers/net/can/virtio_can.c
> @@ -0,0 +1,1022 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * CAN bus driver for the Virtio CAN controller
> + *
> + * Copyright (C) 2021-2023 OpenSynergy GmbH
> + * Copyright Red Hat, Inc. 2025
> + */
> +
> +#include <linux/atomic.h>
> +#include <linux/idr.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/netdevice.h>
> +#include <linux/stddef.h>
> +#include <linux/can/dev.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_ring.h>
> +#include <linux/virtio_can.h>
> +
> +/* CAN device queues */
> +#define VIRTIO_CAN_QUEUE_TX 0
> +#define VIRTIO_CAN_QUEUE_RX 1
> +#define VIRTIO_CAN_QUEUE_CONTROL 2
> +#define VIRTIO_CAN_QUEUE_COUNT 3
> +
> +#define CAN_KNOWN_FLAGS \
> +	(VIRTIO_CAN_FLAGS_EXTENDED |\
> +	 VIRTIO_CAN_FLAGS_FD |\
> +	 VIRTIO_CAN_FLAGS_RTR)
> +
> +/* Max. number of in flight TX messages */
> +#define VIRTIO_CAN_ECHO_SKB_MAX 128
> +
> +struct virtio_can_tx {
> +	unsigned int putidx;
> +	struct virtio_can_tx_in tx_in;
> +	/* Keep virtio_can_tx_out at the end of the structure due to flex array */
> +	struct virtio_can_tx_out tx_out;
> +};
> +
> +struct virtio_can_control {
> +	struct virtio_can_control_out cpkt_out;
> +	struct virtio_can_control_in cpkt_in;
> +};
> +
> +/* virtio_can private data structure */
> +struct virtio_can_priv {
> +	struct can_priv can;	/* must be the first member */
> +	/* NAPI for RX messages */
> +	struct napi_struct napi;
> +	/* NAPI for TX messages */
> +	struct napi_struct napi_tx;
> +	/* The network device we're associated with */
> +	struct net_device *dev;
> +	/* The virtio device we're associated with */
> +	struct virtio_device *vdev;
> +	/* The virtqueues */
> +	struct virtqueue *vqs[VIRTIO_CAN_QUEUE_COUNT];
> +	/* Lock for TX operations */
> +	spinlock_t tx_lock;
> +	/* Control queue lock */
> +	struct mutex ctrl_lock;
> +	/* Wait for control queue processing without polling */
> +	struct completion ctrl_done;
> +	/* Array of receive queue messages */
> +	struct virtio_can_rx *rpkt;
> +	struct virtio_can_control can_ctr_msg;
> +	/* Data to get and maintain the putidx for local TX echo */
> +	struct ida tx_putidx_ida;
> +	/* In flight TX messages */
> +	atomic_t tx_inflight;
> +	/* Packet length */
> +	int rpkt_len;
> +	/* BusOff pending. Reset after successful indication to upper layer */
> +	bool busoff_pending;
> +	/* Tracks whether NAPI instances are currently enabled */
> +	bool napi_active;
> +};
> +
> +static void virtqueue_napi_schedule(struct napi_struct *napi,
> +				    struct virtqueue *vq)
> +{
> +	if (napi_schedule_prep(napi)) {
> +		virtqueue_disable_cb(vq);
> +		__napi_schedule(napi);
> +	}
> +}
> +
> +static void virtqueue_napi_complete(struct napi_struct *napi,
> +				    struct virtqueue *vq, int processed)
> +{
> +	int opaque;
> +
> +	opaque = virtqueue_enable_cb_prepare(vq);
> +	if (napi_complete_done(napi, processed)) {
> +		if (unlikely(virtqueue_poll(vq, opaque)))
> +			virtqueue_napi_schedule(napi, vq);
> +	} else {
> +		virtqueue_disable_cb(vq);
> +	}
> +}
> +
> +static void virtio_can_free_candev(struct net_device *ndev)
> +{
> +	struct virtio_can_priv *priv = netdev_priv(ndev);
> +
> +	ida_destroy(&priv->tx_putidx_ida);
> +	free_candev(ndev);
> +}
> +
> +static void virtio_can_napi_enable(struct virtio_can_priv *priv)
> +{
> +	if (!priv->napi_active) {
> +		napi_enable(&priv->napi);
> +		napi_enable(&priv->napi_tx);
> +		priv->napi_active = true;
> +	}
> +}
> +
> +static void virtio_can_napi_disable(struct virtio_can_priv *priv)
> +{
> +	if (priv->napi_active) {
> +		napi_disable(&priv->napi_tx);
> +		napi_disable(&priv->napi);
> +		priv->napi_active = false;
> +	}
> +}
> +
> +static int virtio_can_alloc_tx_idx(struct virtio_can_priv *priv)
> +{
> +	int tx_idx;
> +
> +	tx_idx = ida_alloc_max(&priv->tx_putidx_ida,
> +			       priv->can.echo_skb_max - 1, GFP_ATOMIC);
> +	if (tx_idx >= 0)
> +		atomic_inc(&priv->tx_inflight);
> +
> +	return tx_idx;
> +}
> +
> +static void virtio_can_free_tx_idx(struct virtio_can_priv *priv,
> +				   unsigned int idx)
> +{
> +	ida_free(&priv->tx_putidx_ida, idx);
> +	atomic_dec(&priv->tx_inflight);
> +}
> +
> +/* Create a scatter-gather list representing our input buffer and put
> + * it in the queue.
> + *
> + * Callers should take appropriate locks.
> + */
> +static int virtio_can_add_inbuf(struct virtqueue *vq, void *buf,
> +				unsigned int size)
> +{
> +	struct scatterlist sg[1];
> +	int ret;
> +
> +	sg_init_one(sg, buf, size);
> +
> +	ret = virtqueue_add_inbuf(vq, sg, 1, buf, GFP_ATOMIC);
> +
> +	return ret;
> +}
> +
> +/* Send a control message with message type either
> + *
> + * - VIRTIO_CAN_SET_CTRL_MODE_START or
> + * - VIRTIO_CAN_SET_CTRL_MODE_STOP.
> + *
> + */
> +static u8 virtio_can_send_ctrl_msg(struct net_device *ndev, u16 msg_type)
> +{
> +	struct scatterlist sg_out, sg_in, *sgs[2] = { &sg_out, &sg_in };
> +	struct virtio_can_priv *priv = netdev_priv(ndev);
> +	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_CONTROL];
> +	struct device *dev = &priv->vdev->dev;
> +	unsigned int len;
> +	int err;
> +
> +	if (!vq)
> +		return VIRTIO_CAN_RESULT_NOT_OK;
> +
> +	guard(mutex)(&priv->ctrl_lock);
> +
> +	priv->can_ctr_msg.cpkt_out.msg_type = cpu_to_le16(msg_type);
> +	sg_init_one(&sg_out, &priv->can_ctr_msg.cpkt_out,
> +		    sizeof(priv->can_ctr_msg.cpkt_out));
> +	sg_init_one(&sg_in, &priv->can_ctr_msg.cpkt_in, sizeof(priv->can_ctr_msg.cpkt_in));
> +
> +	reinit_completion(&priv->ctrl_done);
> +
> +	err = virtqueue_add_sgs(vq, sgs, 1u, 1u, priv, GFP_ATOMIC);
> +	if (err != 0) {
> +		dev_err(dev, "%s(): virtqueue_add_sgs() failed\n", __func__);
> +		return VIRTIO_CAN_RESULT_NOT_OK;
> +	}
> +
> +	if (!virtqueue_kick(vq)) {
> +		dev_err(dev, "%s(): Kick failed\n", __func__);
> +		return VIRTIO_CAN_RESULT_NOT_OK;
> +	}
> +
> +	while (!virtqueue_get_buf(vq, &len) && !virtqueue_is_broken(vq))
> +		wait_for_completion(&priv->ctrl_done);
> +
> +	return priv->can_ctr_msg.cpkt_in.result;
> +}
> +
> +static int virtio_can_start(struct net_device *ndev)
> +{
> +	struct virtio_can_priv *priv = netdev_priv(ndev);
> +	u8 result;
> +
> +	result = virtio_can_send_ctrl_msg(ndev, VIRTIO_CAN_SET_CTRL_MODE_START);
> +	if (result != VIRTIO_CAN_RESULT_OK) {
> +		netdev_err(ndev, "CAN controller start failed\n");
> +		return -EIO;
> +	}
> +
> +	priv->busoff_pending = false;
> +	priv->can.state = CAN_STATE_ERROR_ACTIVE;
> +
> +	return 0;
> +}
> +
> +static int virtio_can_set_mode(struct net_device *dev, enum can_mode mode)
> +{
> +	int err;
> +
> +	switch (mode) {
> +	case CAN_MODE_START:
> +		err = virtio_can_start(dev);
> +		if (err)
> +			return err;
> +		netif_wake_queue(dev);
> +		break;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +
> +	return 0;
> +}
> +
> +static int virtio_can_open(struct net_device *ndev)
> +{
> +	struct virtio_can_priv *priv = netdev_priv(ndev);
> +	int err;
> +
> +	err = open_candev(ndev);
> +	if (err)
> +		return err;
> +
> +	err = virtio_can_start(ndev);
> +	if (err) {
> +		close_candev(ndev);
> +		return err;
> +	}
> +
> +	virtio_can_napi_enable(priv);
> +	netif_start_queue(ndev);
> +
> +	return 0;
> +}
> +
> +static int virtio_can_stop(struct net_device *ndev)
> +{
> +	struct virtio_can_priv *priv = netdev_priv(ndev);
> +	struct device *dev = &priv->vdev->dev;
> +	u8 result;
> +
> +	result = virtio_can_send_ctrl_msg(ndev, VIRTIO_CAN_SET_CTRL_MODE_STOP);
> +	if (result != VIRTIO_CAN_RESULT_OK) {
> +		dev_err(dev, "CAN controller stop failed\n");
> +		return -EIO;
> +	}
> +
> +	priv->busoff_pending = false;
> +	priv->can.state = CAN_STATE_STOPPED;
> +
> +	/* Switch carrier off if device was connected to the bus */
> +	if (netif_carrier_ok(ndev))
> +		netif_carrier_off(ndev);
> +
> +	return 0;
> +}
> +
> +static int virtio_can_close(struct net_device *dev)
> +{
> +	struct virtio_can_priv *priv = netdev_priv(dev);
> +
> +	netif_stop_queue(dev);
> +	/* Ignore stop error: ndo_stop must always complete cleanup regardless.
> +	 * virtio_can_stop() already logs the error if it fails.
> +	 */
> +	virtio_can_stop(dev);
> +	virtio_can_napi_disable(priv);
> +	close_candev(dev);
> +
> +	return 0;
> +}
> +
> +static netdev_tx_t virtio_can_start_xmit(struct sk_buff *skb,
> +					 struct net_device *dev)
> +{
> +	struct scatterlist sg_out, sg_in, *sgs[2] = { &sg_out, &sg_in };
> +	const unsigned int hdr_size = sizeof(struct virtio_can_tx_out);
> +	struct canfd_frame *cf = (struct canfd_frame *)skb->data;
> +	struct virtio_can_priv *priv = netdev_priv(dev);
> +	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_TX];
> +	netdev_tx_t xmit_ret = NETDEV_TX_OK;
> +	struct virtio_can_tx *can_tx_msg;
> +	u32 can_flags;
> +	int putidx;
> +	int err;
> +
> +	if (can_dev_dropped_skb(dev, skb))
> +		goto kick; /* No way to return NET_XMIT_DROP here */
> +
> +	/* No local check for CAN_RTR_FLAG or FD frame against negotiated
> +	 * features. The device will reject those anyway if not supported.
> +	 */
> +
> +	can_tx_msg = kzalloc(sizeof(*can_tx_msg) + cf->len, GFP_ATOMIC);
> +	if (!can_tx_msg) {
> +		kfree_skb(skb);
> +		dev->stats.tx_dropped++;
> +		goto kick; /* No way to return NET_XMIT_DROP here */
> +	}
> +
> +	can_tx_msg->tx_out.msg_type = cpu_to_le16(VIRTIO_CAN_TX);
> +	can_tx_msg->tx_out.length = cpu_to_le16(cf->len);
> +	can_flags = 0;
> +
> +	if (cf->can_id & CAN_EFF_FLAG) {
> +		can_flags |= VIRTIO_CAN_FLAGS_EXTENDED;
> +		can_tx_msg->tx_out.can_id = cpu_to_le32(cf->can_id & CAN_EFF_MASK);
> +	} else {
> +		can_tx_msg->tx_out.can_id = cpu_to_le32(cf->can_id & CAN_SFF_MASK);
> +	}
> +	if (cf->can_id & CAN_RTR_FLAG)
> +		can_flags |= VIRTIO_CAN_FLAGS_RTR;
> +	else
> +		memcpy(can_tx_msg->tx_out.sdu, cf->data, cf->len);
> +	if (can_is_canfd_skb(skb))
> +		can_flags |= VIRTIO_CAN_FLAGS_FD;
> +
> +	can_tx_msg->tx_out.flags = cpu_to_le32(can_flags);
> +
> +	sg_init_one(&sg_out, &can_tx_msg->tx_out, hdr_size + cf->len);
> +	sg_init_one(&sg_in, &can_tx_msg->tx_in, sizeof(can_tx_msg->tx_in));
> +
> +	putidx = virtio_can_alloc_tx_idx(priv);
> +
> +	if (unlikely(putidx < 0)) {
> +		/* -ENOMEM or -ENOSPC here. -ENOSPC should not be possible as
> +		 * tx_inflight >= can.echo_skb_max is checked in flow control
> +		 */
> +		WARN_ON_ONCE(putidx == -ENOSPC);
> +		kfree(can_tx_msg);
> +		kfree_skb(skb);
> +		dev->stats.tx_dropped++;
> +		goto kick; /* No way to return NET_XMIT_DROP here */
> +	}
> +
> +	can_tx_msg->putidx = (unsigned int)putidx;
> +
> +	/* Push loopback echo. Will be looped back on TX interrupt/TX NAPI */
> +	err = can_put_echo_skb(skb, dev, can_tx_msg->putidx, 0);
> +	if (unlikely(err)) {
> +		/* skb was already freed by can_put_echo_skb() on error */
> +		virtio_can_free_tx_idx(priv, can_tx_msg->putidx);
> +		kfree(can_tx_msg);
> +		dev->stats.tx_dropped++;
> +		goto kick;
> +	}
> +
> +	/* Protect queue and list operations */
> +	scoped_guard(spinlock_irqsave, &priv->tx_lock)
> +		err = virtqueue_add_sgs(vq, sgs, 1u, 1u, can_tx_msg, GFP_ATOMIC);
> +
> +	if (unlikely(err)) {
> +		/*
> +		 * can_put_echo_skb() already consumed skb via consume_skb(),
> +		 * so returning NETDEV_TX_BUSY would cause the stack to requeue
> +		 * a freed pointer. Drop the frame and return OK instead.
> +		 */
> +		can_free_echo_skb(dev, can_tx_msg->putidx, NULL);
> +		virtio_can_free_tx_idx(priv, can_tx_msg->putidx);
> +		netif_stop_queue(dev);
> +		kfree(can_tx_msg);
> +		dev->stats.tx_dropped++;
> +		/* Expected never to be seen */
> +		netdev_warn(dev, "TX: Stop queue, err = %d\n", err);
> +		goto kick;
> +	}
> +
> +	/* Normal flow control: stop queue when no transmission slots left */
> +	if (atomic_read(&priv->tx_inflight) >= priv->can.echo_skb_max ||
> +	    vq->num_free == 0 || (vq->num_free < ARRAY_SIZE(sgs) &&
> +	    !virtio_has_feature(vq->vdev, VIRTIO_RING_F_INDIRECT_DESC))) {
> +		netif_stop_queue(dev);
> +		netdev_dbg(dev, "TX: Normal stop queue\n");
> +	}
> +
> +kick:
> +	if (netif_queue_stopped(dev) || !netdev_xmit_more()) {
> +		scoped_guard(spinlock_irqsave, &priv->tx_lock) {
> +			if (!virtqueue_kick(vq))
> +				netdev_err(dev, "%s(): Kick failed\n", __func__);
> +		}
> +	}
> +
> +	return xmit_ret;
> +}
> +
> +static const struct net_device_ops virtio_can_netdev_ops = {
> +	.ndo_open = virtio_can_open,
> +	.ndo_stop = virtio_can_close,
> +	.ndo_start_xmit = virtio_can_start_xmit,
> +};
> +
> +static int register_virtio_can_dev(struct net_device *dev)
> +{
> +	dev->flags |= IFF_ECHO;	/* we support local echo */
> +	dev->netdev_ops = &virtio_can_netdev_ops;
> +
> +	return register_candev(dev);
> +}
> +
> +static int virtio_can_read_tx_queue(struct virtqueue *vq)
> +{
> +	struct virtio_can_priv *can_priv = vq->vdev->priv;
> +	struct net_device *dev = can_priv->dev;
> +	struct virtio_can_tx *can_tx_msg;
> +	struct net_device_stats *stats;
> +	unsigned int len;
> +	u8 result;
> +
> +	stats = &dev->stats;
> +
> +	scoped_guard(spinlock_irqsave, &can_priv->tx_lock)
> +		can_tx_msg = virtqueue_get_buf(vq, &len);
> +
> +	if (!can_tx_msg)
> +		return 0;
> +
> +	if (unlikely(len < sizeof(struct virtio_can_tx_in))) {
> +		netdev_err(dev, "TX ACK: Device sent no result code\n");
> +		result = VIRTIO_CAN_RESULT_NOT_OK; /* Keep things going */
> +	} else {
> +		result = can_tx_msg->tx_in.result;
> +	}
> +
> +	if (can_priv->can.state < CAN_STATE_BUS_OFF) {
> +		if (result != VIRTIO_CAN_RESULT_OK) {
> +			struct can_frame *skb_cf;
> +			struct sk_buff *skb = alloc_can_err_skb(dev, &skb_cf);
> +
> +			if (skb) {
> +				skb_cf->can_id |= CAN_ERR_CRTL;
> +				skb_cf->data[1] |= CAN_ERR_CRTL_UNSPEC;
> +				netif_rx(skb);
> +			}
> +			netdev_warn(dev, "TX ACK: Result = %u\n", result);
> +			can_free_echo_skb(dev, can_tx_msg->putidx, NULL);
> +			stats->tx_dropped++;
> +		} else {
> +			stats->tx_bytes += can_get_echo_skb(dev, can_tx_msg->putidx,
> +				NULL);
> +			stats->tx_packets++;
> +		}
> +	} else {
> +		netdev_dbg(dev, "TX ACK: Controller inactive, drop echo\n");
> +		can_free_echo_skb(dev, can_tx_msg->putidx, NULL);
> +		stats->tx_dropped++;
> +	}
> +
> +	virtio_can_free_tx_idx(can_priv, can_tx_msg->putidx);
> +
> +	/* Flow control */
> +	if (netif_queue_stopped(dev)) {
> +		netdev_dbg(dev, "TX ACK: Wake up stopped queue\n");
> +		netif_wake_queue(dev);
> +	}
> +
> +	kfree(can_tx_msg);
> +
> +	return 1; /* Queue was not empty so there may be more data */
> +}
> +
> +static int virtio_can_tx_poll(struct napi_struct *napi, int quota)
> +{
> +	struct net_device *dev = napi->dev;
> +	struct virtio_can_priv *priv = netdev_priv(dev);
> +	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_TX];
> +	int work_done = 0;
> +
> +	while (work_done < quota && virtio_can_read_tx_queue(vq) != 0)
> +		work_done++;
> +
> +	if (work_done < quota)
> +		virtqueue_napi_complete(napi, vq, work_done);
> +
> +	return work_done;
> +}
> +
> +static void virtio_can_tx_intr(struct virtqueue *vq)
> +{
> +	struct virtio_can_priv *can_priv = vq->vdev->priv;
> +
> +	virtqueue_disable_cb(vq);
> +	napi_schedule(&can_priv->napi_tx);
> +}
> +
> +/* This function is the NAPI RX poll function and NAPI guarantees that this
> + * function is not invoked simultaneously on multiple processors.
> + * Read a RX message from the used queue and sends it to the upper layer.
> + */
> +static int virtio_can_read_rx_queue(struct virtqueue *vq)
> +{
> +	const unsigned int header_size = sizeof(struct virtio_can_rx);
> +	struct virtio_can_priv *priv = vq->vdev->priv;
> +	struct net_device *dev = priv->dev;
> +	struct net_device_stats *stats;
> +	struct virtio_can_rx *can_rx;
> +	unsigned int transport_len;
> +	struct canfd_frame *cf;
> +	struct sk_buff *skb;
> +	unsigned int len;
> +	u32 can_flags;
> +	u16 msg_type;
> +	u32 can_id;
> +	int ret;
> +
> +	stats = &dev->stats;
> +
> +	can_rx = virtqueue_get_buf(vq, &transport_len);
> +	if (!can_rx)
> +		return 0; /* No more data */
> +
> +	if (transport_len < header_size) {
> +		netdev_warn(dev, "RX: Message too small\n");
> +		goto putback;
> +	}
> +
> +	if (priv->can.state >= CAN_STATE_ERROR_PASSIVE) {
> +		netdev_dbg(dev, "%s(): Controller not active\n", __func__);
> +		goto putback;
> +	}
> +
> +	msg_type = le16_to_cpu(can_rx->msg_type);
> +	if (msg_type != VIRTIO_CAN_RX) {
> +		netdev_warn(dev, "RX: Got unknown msg_type %04x\n", msg_type);
> +		goto putback;
> +	}
> +
> +	len = le16_to_cpu(can_rx->length);
> +	can_flags = le32_to_cpu(can_rx->flags);
> +	can_id = le32_to_cpu(can_rx->can_id);
> +
> +	if (can_flags & ~CAN_KNOWN_FLAGS) {
> +		stats->rx_dropped++;
> +		netdev_warn(dev, "RX: CAN Id 0x%08x: Invalid flags 0x%x\n",
> +			    can_id, can_flags);
> +		goto putback;
> +	}
> +
> +	if (can_flags & VIRTIO_CAN_FLAGS_EXTENDED) {
> +		can_id &= CAN_EFF_MASK;
> +		can_id |= CAN_EFF_FLAG;
> +	} else {
> +		can_id &= CAN_SFF_MASK;
> +	}
> +
> +	if (can_flags & VIRTIO_CAN_FLAGS_RTR) {
> +		if (!virtio_has_feature(vq->vdev, VIRTIO_CAN_F_RTR_FRAMES)) {
> +			stats->rx_dropped++;
> +			netdev_warn(dev, "RX: CAN Id 0x%08x: RTR not negotiated\n",
> +				    can_id);
> +			goto putback;
> +		}
> +		if (can_flags & VIRTIO_CAN_FLAGS_FD) {
> +			stats->rx_dropped++;
> +			netdev_warn(dev, "RX: CAN Id 0x%08x: RTR with FD not possible\n",
> +				    can_id);
> +			goto putback;
> +		}
> +
> +		if (len > 0xF) {
> +			stats->rx_dropped++;
> +			netdev_warn(dev, "RX: CAN Id 0x%08x: RTR with DLC > 0xF\n",
> +				    can_id);
> +			goto putback;
> +		}
> +
> +		if (len > 0x8)
> +			len = 0x8;
> +
> +		can_id |= CAN_RTR_FLAG;
> +	}
> +
> +	if (transport_len < header_size + len) {
> +		netdev_warn(dev, "RX: Message too small for payload\n");
> +		goto putback;
> +	}
> +
> +	if (can_flags & VIRTIO_CAN_FLAGS_FD) {
> +		if (!virtio_has_feature(vq->vdev, VIRTIO_CAN_F_CAN_FD)) {
> +			stats->rx_dropped++;
> +			netdev_warn(dev, "RX: CAN Id 0x%08x: FD not negotiated\n",
> +				    can_id);
> +			goto putback;
> +		}
> +
> +		if (len > CANFD_MAX_DLEN)
> +			len = CANFD_MAX_DLEN;
> +
> +		skb = alloc_canfd_skb(priv->dev, &cf);
> +	} else {
> +		if (!virtio_has_feature(vq->vdev, VIRTIO_CAN_F_CAN_CLASSIC)) {
> +			stats->rx_dropped++;
> +			netdev_warn(dev, "RX: CAN Id 0x%08x: classic not negotiated\n",
> +				    can_id);
> +			goto putback;
> +		}
> +
> +		if (len > CAN_MAX_DLEN)
> +			len = CAN_MAX_DLEN;
> +
> +		skb = alloc_can_skb(priv->dev, (struct can_frame **)&cf);
> +	}
> +	if (!skb) {
> +		stats->rx_dropped++;
> +		netdev_warn(dev, "RX: No skb available\n");
> +		goto putback;
> +	}
> +
> +	cf->can_id = can_id;
> +	cf->len = len;
> +	if (!(can_flags & VIRTIO_CAN_FLAGS_RTR)) {
> +		/* RTR frames have a DLC but no payload */
> +		memcpy(cf->data, can_rx->sdu, len);
> +	}
> +
> +	if (netif_receive_skb(skb) == NET_RX_SUCCESS) {
> +		stats->rx_packets++;
> +		if (!(can_flags & VIRTIO_CAN_FLAGS_RTR))
> +			stats->rx_bytes += len;
> +	}
> +
> +putback:
> +	/* Put processed RX buffer back into avail queue */
> +	ret = virtio_can_add_inbuf(vq, can_rx,
> +				   priv->rpkt_len);
> +	if (!ret)
> +		virtqueue_kick(vq);
> +	return 1; /* Queue was not empty so there may be more data */
> +}
> +
> +static int virtio_can_handle_busoff(struct net_device *dev)
> +{
> +	struct virtio_can_priv *priv = netdev_priv(dev);
> +	struct can_frame *cf;
> +	struct sk_buff *skb;
> +
> +	if (!priv->busoff_pending)
> +		return 0;
> +
> +	if (priv->can.state < CAN_STATE_BUS_OFF) {
> +		netdev_dbg(dev, "entered error bus off state\n");
> +
> +		/* bus-off state */
> +		priv->can.state = CAN_STATE_BUS_OFF;
> +		priv->can.can_stats.bus_off++;
> +		can_bus_off(dev);
> +	}
> +
> +	/* propagate the error condition to the CAN stack */
> +	skb = alloc_can_err_skb(dev, &cf);
> +	if (unlikely(!skb))
> +		return 0;
> +
> +	/* bus-off state */
> +	cf->can_id |= CAN_ERR_BUSOFF;
> +
> +	/* Ensure that the BusOff indication does not get lost */
> +	if (netif_receive_skb(skb) == NET_RX_SUCCESS)
> +		priv->busoff_pending = false;
> +
> +	return 1;
> +}
> +
> +static int virtio_can_rx_poll(struct napi_struct *napi, int quota)
> +{
> +	struct net_device *dev = napi->dev;
> +	struct virtio_can_priv *priv = netdev_priv(dev);
> +	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_RX];
> +	int work_done = 0;
> +
> +	work_done += virtio_can_handle_busoff(dev);
> +
> +	while (work_done < quota && virtio_can_read_rx_queue(vq) != 0)
> +		work_done++;
> +
> +	if (work_done < quota)
> +		virtqueue_napi_complete(napi, vq, work_done);
> +
> +	return work_done;
> +}
> +
> +static void virtio_can_rx_intr(struct virtqueue *vq)
> +{
> +	struct virtio_can_priv *can_priv = vq->vdev->priv;
> +
> +	virtqueue_disable_cb(vq);
> +	napi_schedule(&can_priv->napi);
> +}
> +
> +static void virtio_can_control_intr(struct virtqueue *vq)
> +{
> +	struct virtio_can_priv *can_priv = vq->vdev->priv;
> +
> +	complete(&can_priv->ctrl_done);
> +}
> +
> +static void virtio_can_config_changed(struct virtio_device *vdev)
> +{
> +	struct virtio_can_priv *can_priv = vdev->priv;
> +	u16 status;
> +
> +	status = virtio_cread16(vdev, offsetof(struct virtio_can_config,
> +					       status));
> +
> +	if (!(status & VIRTIO_CAN_S_CTRL_BUSOFF))
> +		return;
> +
> +	if (!can_priv->busoff_pending &&
> +	    can_priv->can.state < CAN_STATE_BUS_OFF) {
> +		can_priv->busoff_pending = true;
> +		napi_schedule(&can_priv->napi);
> +	}
> +}
> +
> +static void virtio_can_populate_rx_vq(struct virtio_device *vdev)
> +{
> +	struct virtio_can_priv *priv = vdev->priv;
> +	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_RX];
> +	unsigned int buf_size = priv->rpkt_len;
> +	int num_elements = vq->num_free;
> +	u8 *buf = (u8 *)priv->rpkt;
> +	unsigned int idx;
> +	int ret = 0;
> +
> +	for (idx = 0; idx < num_elements; idx++) {
> +		ret = virtio_can_add_inbuf(vq, buf, buf_size);
> +		if (ret < 0) {
> +			dev_dbg(&vdev->dev, "rpkt fill: ret=%d, idx=%u, size=%u\n",
> +				ret, idx, buf_size);
> +			break;
> +		}
> +		buf += buf_size;
> +	}
> +
> +	if (idx > 0)
> +		virtqueue_kick(vq);
> +
> +	dev_dbg(&vdev->dev, "%u rpkt added\n", idx);
> +}
> +
> +static int virtio_can_find_vqs(struct virtio_can_priv *priv)
> +{
> +	struct virtqueue_info vqs_info[] = {
> +		{ "can-tx", virtio_can_tx_intr },
> +		{ "can-rx", virtio_can_rx_intr },
> +		{ "can-state-ctrl", virtio_can_control_intr },
> +	};
> +
> +	/* Find the queues. */
> +	return virtio_find_vqs(priv->vdev, VIRTIO_CAN_QUEUE_COUNT, priv->vqs,
> +			       vqs_info, NULL);
> +}
> +
> +/* Function must not be called before virtio_can_find_vqs() has been run */
> +static void virtio_can_del_vq(struct virtio_device *vdev)
> +{
> +	struct virtio_can_priv *priv = vdev->priv;
> +	struct virtqueue *vq = priv->vqs[VIRTIO_CAN_QUEUE_TX];
> +	struct virtio_can_tx *can_tx_msg;
> +	int q;
> +
> +	if (!vq)
> +		return;
> +
> +	/* Reset the device */
> +	virtio_reset_device(vdev);
> +
> +	/* From here we have dead silence from the device side so no locks
> +	 * are needed to protect against device side events.
> +	 */
> +
> +	/* Free pending TX buffers which were allocated in virtio_can_start_xmit() */
> +	while ((can_tx_msg = virtqueue_detach_unused_buf(vq))) {
> +		can_free_echo_skb(priv->dev, can_tx_msg->putidx, NULL);
> +		virtio_can_free_tx_idx(priv, can_tx_msg->putidx);
> +		kfree(can_tx_msg);
> +	}
> +
> +	/* RX and control queue buffers are managed elsewhere, just detach */
> +	for (q = VIRTIO_CAN_QUEUE_RX; q < VIRTIO_CAN_QUEUE_COUNT; q++)
> +		while (virtqueue_detach_unused_buf(priv->vqs[q]))
> +			;
> +
> +	if (vdev->config->del_vqs)
> +		vdev->config->del_vqs(vdev);
> +
> +	memset(priv->vqs, 0, sizeof(priv->vqs));
> +}
> +
> +static void virtio_can_remove(struct virtio_device *vdev)
> +{
> +	struct virtio_can_priv *priv = vdev->priv;
> +	struct net_device *dev = priv->dev;
> +
> +	unregister_candev(dev);
> +
> +	virtio_can_del_vq(vdev);
> +
> +	virtio_can_free_candev(dev);
> +}
> +
> +static int virtio_can_validate(struct virtio_device *vdev)
> +{
> +	/* CAN needs always access to the config space.
> +	 * Check that the driver can access the config space
> +	 */
> +	if (!vdev->config->get) {
> +		dev_err(&vdev->dev, "%s failure: config access disabled\n",
> +			__func__);
> +		return -EINVAL;
> +	}
> +
> +	if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> +		dev_err(&vdev->dev,
> +			"device does not comply with spec version 1.x\n");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int virtio_can_probe(struct virtio_device *vdev)
> +{
> +	struct virtio_can_priv *priv;
> +	struct net_device *dev;
> +	size_t size;
> +	int err;
> +
> +	dev = alloc_candev(sizeof(struct virtio_can_priv),
> +			   VIRTIO_CAN_ECHO_SKB_MAX);
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	priv = netdev_priv(dev);
> +
> +	ida_init(&priv->tx_putidx_ida);
> +
> +	netif_napi_add(dev, &priv->napi, virtio_can_rx_poll);
> +	netif_napi_add(dev, &priv->napi_tx, virtio_can_tx_poll);
> +
> +	SET_NETDEV_DEV(dev, &vdev->dev);
> +
> +	priv->dev = dev;
> +	priv->vdev = vdev;
> +	vdev->priv = priv;
> +
> +	priv->can.do_set_mode = virtio_can_set_mode;
> +	priv->can.bittiming.bitrate = CAN_BITRATE_UNKNOWN;
> +	/* Set Virtio CAN supported operations */
> +	priv->can.ctrlmode_supported = CAN_CTRLMODE_BERR_REPORTING;
> +	if (virtio_has_feature(vdev, VIRTIO_CAN_F_CAN_FD)) {
> +		priv->can.fd.data_bittiming.bitrate = CAN_BITRATE_UNKNOWN;
> +		err = can_set_static_ctrlmode(dev, CAN_CTRLMODE_FD);
> +		if (err != 0)
> +			goto on_failure;
> +	}
> +
> +	/* Initialize virtqueues */
> +	err = virtio_can_find_vqs(priv);
> +	if (err != 0)
> +		goto on_failure;
> +
> +	spin_lock_init(&priv->tx_lock);
> +	mutex_init(&priv->ctrl_lock);
> +
> +	init_completion(&priv->ctrl_done);
> +
> +	priv->rpkt_len = sizeof(struct virtio_can_rx);
> +
> +	if (virtio_has_feature(vdev, VIRTIO_CAN_F_CAN_FD))
> +		priv->rpkt_len += CANFD_MAX_DLEN;
> +	else
> +		priv->rpkt_len += CAN_MAX_DLEN;
> +
> +	size = priv->rpkt_len * priv->vqs[VIRTIO_CAN_QUEUE_RX]->num_free;
> +	priv->rpkt = devm_kzalloc(&vdev->dev, size, GFP_KERNEL);
> +	if (!priv->rpkt) {
> +		virtio_can_del_vq(vdev);
> +		err = -ENOMEM;
> +		goto on_failure;
> +	}
> +	virtio_can_populate_rx_vq(vdev);
> +
> +	err = register_virtio_can_dev(dev);
> +	if (err) {
> +		virtio_can_del_vq(vdev);
> +		goto on_failure;
> +	}
> +
> +	return 0;
> +
> +on_failure:
> +	virtio_can_free_candev(dev);
> +	return err;
> +}
> +
> +static int __maybe_unused virtio_can_freeze(struct virtio_device *vdev)
> +{
> +	struct virtio_can_priv *priv = vdev->priv;
> +	struct net_device *ndev = priv->dev;
> +
> +	if (netif_running(ndev)) {
> +		/* virtio_can_close() calls netif_stop_queue(), virtio_can_stop(),
> +		 * napi_disable() and close_candev(). Call it directly (not via
> +		 * dev_close()) to preserve IFF_UP so that netif_running() returns
> +		 * true in virtio_can_restore() and the device is brought back up.
> +		 */
> +		virtio_can_close(ndev);
> +		netif_device_detach(ndev);
> +	}
> +
> +	priv->can.state = CAN_STATE_SLEEPING;
> +
> +	virtio_can_del_vq(vdev);
> +
> +	return 0;
> +}
> +
> +static int __maybe_unused virtio_can_restore(struct virtio_device *vdev)
> +{
> +	struct virtio_can_priv *priv = vdev->priv;
> +	struct net_device *ndev = priv->dev;
> +	size_t size;
> +	int err;
> +
> +	err = virtio_can_find_vqs(priv);
> +	if (err != 0)
> +		return err;
> +
> +	size = priv->rpkt_len * priv->vqs[VIRTIO_CAN_QUEUE_RX]->num_free;
> +	priv->rpkt = devm_krealloc(&vdev->dev, priv->rpkt, size, GFP_KERNEL | __GFP_ZERO);
> +	if (!priv->rpkt) {
> +		virtio_can_del_vq(vdev);
> +		return -ENOMEM;
> +	}
> +	virtio_can_populate_rx_vq(vdev);
> +
> +	if (netif_running(ndev)) {
> +		/* virtio_can_open() calls open_candev(), virtio_can_start(),
> +		 * napi_enable() and netif_start_queue(). Call it directly (not
> +		 * via dev_open()) since IFF_UP is still set from before freeze.
> +		 */
> +		err = virtio_can_open(ndev);
> +		if (err) {
> +			virtio_can_del_vq(vdev);
> +			return err;
> +		}
> +		netif_device_attach(ndev);
> +	} else {
> +		priv->can.state = CAN_STATE_STOPPED;
> +	}
> +
> +	return 0;
> +}
> +
> +static struct virtio_device_id virtio_can_id_table[] = {
> +	{ VIRTIO_ID_CAN, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static unsigned int features[] = {
> +	VIRTIO_CAN_F_CAN_CLASSIC,
> +	VIRTIO_CAN_F_CAN_FD,
> +	VIRTIO_CAN_F_LATE_TX_ACK,
> +	VIRTIO_CAN_F_RTR_FRAMES,
> +};
> +
> +static struct virtio_driver virtio_can_driver = {
> +	.feature_table = features,
> +	.feature_table_size = ARRAY_SIZE(features),
> +	.driver.name = KBUILD_MODNAME,
> +	.driver.owner = THIS_MODULE,
> +	.id_table = virtio_can_id_table,
> +	.validate = virtio_can_validate,
> +	.probe = virtio_can_probe,
> +	.remove = virtio_can_remove,
> +	.config_changed = virtio_can_config_changed,
> +#ifdef CONFIG_PM_SLEEP
> +	.freeze = virtio_can_freeze,
> +	.restore = virtio_can_restore,
> +#endif
> +};
> +
> +module_virtio_driver(virtio_can_driver);
> +MODULE_DEVICE_TABLE(virtio, virtio_can_id_table);
> +
> +MODULE_AUTHOR("OpenSynergy GmbH");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("CAN bus driver for Virtio CAN controller");
> diff --git a/include/uapi/linux/virtio_can.h b/include/uapi/linux/virtio_can.h
> new file mode 100644
> index 000000000000..08d7e3e78776
> --- /dev/null
> +++ b/include/uapi/linux/virtio_can.h
> @@ -0,0 +1,78 @@
> +/* SPDX-License-Identifier: BSD-3-Clause */
> +/*
> + * Copyright (C) 2021-2023 OpenSynergy GmbH
> + * Copyright Red Hat, Inc. 2025
> + */
> +#ifndef _LINUX_VIRTIO_VIRTIO_CAN_H
> +#define _LINUX_VIRTIO_VIRTIO_CAN_H
> +
> +#include <linux/types.h>
> +#include <linux/virtio_types.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/virtio_config.h>
> +
> +/* Feature bit numbers */
> +#define VIRTIO_CAN_F_CAN_CLASSIC        0
> +#define VIRTIO_CAN_F_CAN_FD             1
> +#define VIRTIO_CAN_F_RTR_FRAMES         2
> +#define VIRTIO_CAN_F_LATE_TX_ACK        3
> +
> +/* CAN Result Types */
> +#define VIRTIO_CAN_RESULT_OK            0
> +#define VIRTIO_CAN_RESULT_NOT_OK        1
> +
> +/* CAN flags to determine type of CAN Id */
> +#define VIRTIO_CAN_FLAGS_EXTENDED       0x8000
> +#define VIRTIO_CAN_FLAGS_FD             0x4000
> +#define VIRTIO_CAN_FLAGS_RTR            0x2000
> +
> +#define VIRTIO_CAN_MAX_DLEN    64 // this is like CANFD_MAX_DLEN
> +
> +struct virtio_can_config {
> +#define VIRTIO_CAN_S_CTRL_BUSOFF (1u << 0) /* Controller BusOff */
> +	/* CAN controller status */
> +	__le16 status;
> +};
> +
> +/* TX queue message types */
> +struct virtio_can_tx_out {
> +#define VIRTIO_CAN_TX                   0x0001
> +	__le16 msg_type;
> +	__le16 length; /* 0..8 CC, 0..64 CAN-FD, 0..2048 CAN-XL, 12 bits */
> +	__u8 reserved_classic_dlc; /* If CAN classic length = 8 then DLC can be 8..15 */
> +	__u8 padding;
> +	__le16 reserved_xl_priority; /* May be needed for CAN XL priority */
> +	__le32 flags;
> +	__le32 can_id;
> +	__u8 sdu[] __counted_by_le(length);
> +};
> +
> +struct virtio_can_tx_in {
> +	__u8 result;
> +};
> +
> +/* RX queue message types */
> +struct virtio_can_rx {
> +#define VIRTIO_CAN_RX                   0x0101
> +	__le16 msg_type;
> +	__le16 length; /* 0..8 CC, 0..64 CAN-FD, 0..2048 CAN-XL, 12 bits */
> +	__u8 reserved_classic_dlc; /* If CAN classic length = 8 then DLC can be 8..15 */
> +	__u8 padding;
> +	__le16 reserved_xl_priority; /* May be needed for CAN XL priority */
> +	__le32 flags;
> +	__le32 can_id;
> +	__u8 sdu[] __counted_by_le(length);
> +};
> +
> +/* Control queue message types */
> +struct virtio_can_control_out {
> +#define VIRTIO_CAN_SET_CTRL_MODE_START  0x0201
> +#define VIRTIO_CAN_SET_CTRL_MODE_STOP   0x0202
> +	__le16 msg_type;
> +};
> +
> +struct virtio_can_control_in {
> +	__u8 result;
> +};
> +
> +#endif /* #ifndef _LINUX_VIRTIO_VIRTIO_CAN_H */
> -- 
> 2.42.0


^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Miaohe Lin @ 2026-06-10  7:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, Zi Yan
  Cc: David Hildenbrand (Arm), Andrew Morton, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <20260609165829-mutt-send-email-mst@kernel.org>

On 2026/6/10 5:00, Michael S. Tsirkin wrote:
> On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
>> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
>>
>>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
>>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
>>>>
>>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>>>>>
>>>>>> On 6/9/26 20:10, Andrew Morton wrote:
>>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>
>>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>>>>>> update to page->flags can race with non-atomic flag operations
>>>>>>>> that run under zone->lock in the buddy allocator.
>>>>>>>>
>>>>>>>> In particular, __free_pages_prepare() does:
>>>>>>>>
>>>>>>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>>>>>
>>>>>>>> This non-atomic read-modify-write, while correctly excluding
>>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>>>>>> is set and the write happens after.  Will only get worse if/when
>>>>>>>> we add more non-atomic flag operations.
>>>>>>>>
>>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>>>>>> around ClearPageHWPoison in the retry path.  This
>>>>>>>> serializes with all buddy flag manipulation.  The cost is
>>>>>>>> negligible: one lock/unlock in an extremely rare path
>>>>>>>> (hardware memory errors).
>>>>>>>>
>>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>>>>>> in this file operate on pages already removed from the buddy
>>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>>>>>> need zone->lock protection.
>>>>>>>
>>>>>>> Sashiko is saying this doesn't do anything "Because
>>>>>>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
>>>>>>>
>>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>>>>>
>>>>>> Battle of the bots: it's right.
>>>>>
>>>>> Yep, __free_pages_prepare() changes the page flag without holding
>>>>> zone->lock.
>>>>
>>>> __free_pages_prepare() works on frozen pages and assumes no one else
>>>> touches the input page. To avoid this race, memory_failure() might
>>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
>>>> sure if that works along with memory failure flow.
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>>
>>>
>>>
>>> Actually memory failure already plays with this down the road no?
>>>
>>> So maybe it's enough to just SetPageHWPoison afterwards again?
>>>
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index ee42d4361309..4758fea94a96 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>  	if (!res) {
>>>  		if (is_free_buddy_page(p)) {
>>>  			if (take_page_off_buddy(p)) {
>>> +				SetPageHWPoison(p);
>>>  				page_ref_inc(p);
>>>  				res = MF_RECOVERED;
>>>  			} else {
>>>
>>>
>>> and maybe in a bunch of other places in there?
>>
>> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
>> just set it again here?
> 
> Yea.
> 
>> Why not do it after get_hwpoison_page(), since that
>> is the expected page flag?
> 
> It's still in the buddy at that point right? I'm worried buddy might
> poke at flags.

Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
pages are not on the way to buddy...

Thanks.
.

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-10  7:35 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Zi Yan, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <dd8d89be-9c73-8ff7-7d44-9705bddeda4d@huawei.com>

On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
> > On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
> >> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
> >>
> >>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> >>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
> >>>>
> >>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >>>>>
> >>>>>> On 6/9/26 20:10, Andrew Morton wrote:
> >>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>>>
> >>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>>>>>> update to page->flags can race with non-atomic flag operations
> >>>>>>>> that run under zone->lock in the buddy allocator.
> >>>>>>>>
> >>>>>>>> In particular, __free_pages_prepare() does:
> >>>>>>>>
> >>>>>>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>>>>>
> >>>>>>>> This non-atomic read-modify-write, while correctly excluding
> >>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>>>>>> is set and the write happens after.  Will only get worse if/when
> >>>>>>>> we add more non-atomic flag operations.
> >>>>>>>>
> >>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>>>>>> around ClearPageHWPoison in the retry path.  This
> >>>>>>>> serializes with all buddy flag manipulation.  The cost is
> >>>>>>>> negligible: one lock/unlock in an extremely rare path
> >>>>>>>> (hardware memory errors).
> >>>>>>>>
> >>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>>>>>> in this file operate on pages already removed from the buddy
> >>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>>>>>> need zone->lock protection.
> >>>>>>>
> >>>>>>> Sashiko is saying this doesn't do anything "Because
> >>>>>>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
> >>>>>>>
> >>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>>>>>
> >>>>>> Battle of the bots: it's right.
> >>>>>
> >>>>> Yep, __free_pages_prepare() changes the page flag without holding
> >>>>> zone->lock.
> >>>>
> >>>> __free_pages_prepare() works on frozen pages and assumes no one else
> >>>> touches the input page. To avoid this race, memory_failure() might
> >>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
> >>>> sure if that works along with memory failure flow.
> >>>>
> >>>> Best Regards,
> >>>> Yan, Zi
> >>>
> >>>
> >>>
> >>> Actually memory failure already plays with this down the road no?
> >>>
> >>> So maybe it's enough to just SetPageHWPoison afterwards again?
> >>>
> >>>
> >>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>> index ee42d4361309..4758fea94a96 100644
> >>> --- a/mm/memory-failure.c
> >>> +++ b/mm/memory-failure.c
> >>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
> >>>  	if (!res) {
> >>>  		if (is_free_buddy_page(p)) {
> >>>  			if (take_page_off_buddy(p)) {
> >>> +				SetPageHWPoison(p);
> >>>  				page_ref_inc(p);
> >>>  				res = MF_RECOVERED;
> >>>  			} else {
> >>>
> >>>
> >>> and maybe in a bunch of other places in there?
> >>
> >> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
> >> just set it again here?
> > 
> > Yea.
> > 
> >> Why not do it after get_hwpoison_page(), since that
> >> is the expected page flag?
> > 
> > It's still in the buddy at that point right? I'm worried buddy might
> > poke at flags.
> 
> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
> pages are not on the way to buddy...
> 
> Thanks.
> .

Right so here after take_page_off_buddy it's ok, for example.

-- 
MST


^ permalink raw reply

* [PATCH net-next] virtio-net: support xsk wake up
From: Menglong Dong @ 2026-06-10  8:16 UTC (permalink / raw)
  To: xuanzhuo
  Cc: mst, jasowang, eperezma, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev, virtualization, linux-kernel

For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net.
Take the tx path for example, we set xsk_set_tx_need_wakeup() in
virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
anywhere, which means the user will call send() for every packet.

For the tx path, we will call xsk_set_tx_need_wakeup() after
virtnet_xsk_xmit_batch() if sq->vq is empty, as we can't be wakeup by the
skb_xmit_done() in this case. Otherwise, we will clear the wakeup flag.

For the rx path, we will call xsk_set_rx_need_wakeup() if we have free
buffers in rq->vq.

Race condition is considered for both rx and tx path.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
 drivers/net/virtio_net.c | 52 ++++++++++++++++++++++++++++++++++------
 1 file changed, 45 insertions(+), 7 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9b3da9f9786c..25e895b849a6 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1323,16 +1323,27 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
 				   struct xsk_buff_pool *pool, gfp_t gfp)
 {
 	struct xdp_buff **xsk_buffs;
+	bool need_wakeup;
 	dma_addr_t addr;
 	int err = 0;
 	u32 len, i;
 	int num;
 
+	need_wakeup = xsk_uses_need_wakeup(pool);
 	xsk_buffs = rq->xsk_buffs;
 
+	/* If both rq->vq and fill ring are empty, and then the user submit
+	 * all the chunks to the fill ring and check the wake up flag
+	 * after xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(),
+	 * we will lose the chance to wake up the rx napi, so we have to
+	 * set the need_wakeup flag here.
+	 */
+	if (need_wakeup && virtqueue_get_vring_size(rq->vq) == rq->vq->num_free)
+		xsk_set_rx_need_wakeup(pool);
+
 	num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->vq->num_free);
 	if (!num) {
-		if (xsk_uses_need_wakeup(pool)) {
+		if (need_wakeup) {
 			xsk_set_rx_need_wakeup(pool);
 			/* Return 0 instead of -ENOMEM so that NAPI is
 			 * descheduled.
@@ -1341,8 +1352,6 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
 		}
 
 		return -ENOMEM;
-	} else {
-		xsk_clear_rx_need_wakeup(pool);
 	}
 
 	len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
@@ -1363,6 +1372,16 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
 			goto err;
 	}
 
+	if (need_wakeup) {
+		if (rq->vq->num_free)
+			/* We have free buffers, so we'd better wake up the
+			 * rx napi as soon as possible.
+			 */
+			xsk_set_rx_need_wakeup(pool);
+		else
+			xsk_clear_rx_need_wakeup(pool);
+	}
+
 	return num;
 
 err:
@@ -1440,8 +1459,9 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	struct virtnet_sq_free_stats stats = {};
 	struct net_device *dev = vi->dev;
+	int sent, vring_size;
+	bool need_wakeup;
 	u64 kicks = 0;
-	int sent;
 
 	/* Avoid to wakeup napi meanless, so call __free_old_xmit instead of
 	 * free_old_xmit().
@@ -1451,8 +1471,29 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
 	if (stats.xsk)
 		xsk_tx_completed(sq->xsk_pool, stats.xsk);
 
+	vring_size = virtqueue_get_vring_size(sq->vq);
+	need_wakeup = xsk_uses_need_wakeup(pool);
+	/* If the sq->vq is empty, and the tx ring is empty, and the user
+	 * submit an entry to the tx ring after virtnet_xsk_xmit_batch() and
+	 * before xsk_set_tx_need_wakeup(), we will lose the chance to wake
+	 * up the tx napi, so we have to set the need_wakeup flag here.
+	 */
+	if (need_wakeup && vring_size == sq->vq->num_free)
+		xsk_set_tx_need_wakeup(pool);
+
 	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
 
+	if (need_wakeup) {
+		if (vring_size == sq->vq->num_free)
+			/* we can't wake up by ourself, and it should be done
+			 * by the user.
+			 */
+			xsk_set_tx_need_wakeup(pool);
+		else
+			/* we can wake up from skb_xmit_done() */
+			xsk_clear_tx_need_wakeup(pool);
+	}
+
 	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
 		check_sq_full_and_disable(vi, vi->dev, sq);
 
@@ -1470,9 +1511,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
 	u64_stats_add(&sq->stats.xdp_tx,  sent);
 	u64_stats_update_end(&sq->stats.syncp);
 
-	if (xsk_uses_need_wakeup(pool))
-		xsk_set_tx_need_wakeup(pool);
-
 	return sent;
 }
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net-next] virtio-net: support xsk wake up
From: Eugenio Perez Martin @ 2026-06-10  8:27 UTC (permalink / raw)
  To: Menglong Dong
  Cc: xuanzhuo, mst, jasowang, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260610081648.2205711-1-dongml2@chinatelecom.cn>

On Wed, Jun 10, 2026 at 10:17 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net.
> Take the tx path for example, we set xsk_set_tx_need_wakeup() in
> virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
> anywhere, which means the user will call send() for every packet.
>
> For the tx path, we will call xsk_set_tx_need_wakeup() after
> virtnet_xsk_xmit_batch() if sq->vq is empty, as we can't be wakeup by the
> skb_xmit_done() in this case. Otherwise, we will clear the wakeup flag.
>
> For the rx path, we will call xsk_set_rx_need_wakeup() if we have free
> buffers in rq->vq.
>
> Race condition is considered for both rx and tx path.
>

I didn't review the code itself, but the patch message lacks a Fixes:
tag. And the From and Signed-off-by emails don't match, which I'm not
sure is valid.

> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> ---
>  drivers/net/virtio_net.c | 52 ++++++++++++++++++++++++++++++++++------
>  1 file changed, 45 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 9b3da9f9786c..25e895b849a6 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1323,16 +1323,27 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
>                                    struct xsk_buff_pool *pool, gfp_t gfp)
>  {
>         struct xdp_buff **xsk_buffs;
> +       bool need_wakeup;
>         dma_addr_t addr;
>         int err = 0;
>         u32 len, i;
>         int num;
>
> +       need_wakeup = xsk_uses_need_wakeup(pool);
>         xsk_buffs = rq->xsk_buffs;
>
> +       /* If both rq->vq and fill ring are empty, and then the user submit
> +        * all the chunks to the fill ring and check the wake up flag
> +        * after xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(),
> +        * we will lose the chance to wake up the rx napi, so we have to
> +        * set the need_wakeup flag here.
> +        */
> +       if (need_wakeup && virtqueue_get_vring_size(rq->vq) == rq->vq->num_free)
> +               xsk_set_rx_need_wakeup(pool);
> +
>         num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->vq->num_free);
>         if (!num) {
> -               if (xsk_uses_need_wakeup(pool)) {
> +               if (need_wakeup) {
>                         xsk_set_rx_need_wakeup(pool);
>                         /* Return 0 instead of -ENOMEM so that NAPI is
>                          * descheduled.
> @@ -1341,8 +1352,6 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
>                 }
>
>                 return -ENOMEM;
> -       } else {
> -               xsk_clear_rx_need_wakeup(pool);
>         }
>
>         len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
> @@ -1363,6 +1372,16 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
>                         goto err;
>         }
>
> +       if (need_wakeup) {
> +               if (rq->vq->num_free)
> +                       /* We have free buffers, so we'd better wake up the
> +                        * rx napi as soon as possible.
> +                        */
> +                       xsk_set_rx_need_wakeup(pool);
> +               else
> +                       xsk_clear_rx_need_wakeup(pool);
> +       }
> +
>         return num;
>
>  err:
> @@ -1440,8 +1459,9 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
>         struct virtnet_info *vi = sq->vq->vdev->priv;
>         struct virtnet_sq_free_stats stats = {};
>         struct net_device *dev = vi->dev;
> +       int sent, vring_size;
> +       bool need_wakeup;
>         u64 kicks = 0;
> -       int sent;
>
>         /* Avoid to wakeup napi meanless, so call __free_old_xmit instead of
>          * free_old_xmit().
> @@ -1451,8 +1471,29 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
>         if (stats.xsk)
>                 xsk_tx_completed(sq->xsk_pool, stats.xsk);
>
> +       vring_size = virtqueue_get_vring_size(sq->vq);
> +       need_wakeup = xsk_uses_need_wakeup(pool);
> +       /* If the sq->vq is empty, and the tx ring is empty, and the user
> +        * submit an entry to the tx ring after virtnet_xsk_xmit_batch() and
> +        * before xsk_set_tx_need_wakeup(), we will lose the chance to wake
> +        * up the tx napi, so we have to set the need_wakeup flag here.
> +        */
> +       if (need_wakeup && vring_size == sq->vq->num_free)
> +               xsk_set_tx_need_wakeup(pool);
> +
>         sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
>
> +       if (need_wakeup) {
> +               if (vring_size == sq->vq->num_free)
> +                       /* we can't wake up by ourself, and it should be done
> +                        * by the user.
> +                        */
> +                       xsk_set_tx_need_wakeup(pool);
> +               else
> +                       /* we can wake up from skb_xmit_done() */
> +                       xsk_clear_tx_need_wakeup(pool);
> +       }
> +
>         if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
>                 check_sq_full_and_disable(vi, vi->dev, sq);
>
> @@ -1470,9 +1511,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
>         u64_stats_add(&sq->stats.xdp_tx,  sent);
>         u64_stats_update_end(&sq->stats.syncp);
>
> -       if (xsk_uses_need_wakeup(pool))
> -               xsk_set_tx_need_wakeup(pool);
> -
>         return sent;
>  }
>
> --
> 2.54.0
>


^ permalink raw reply

* [PATCH v3] vduse: Add suspend
From: Eugenio Pérez @ 2026-06-10  8:34 UTC (permalink / raw)
  To: Michael S . Tsirkin
  Cc: virtualization, Jason Wang, Cindy Lu, Xuan Zhuo,
	Stefano Garzarella, linux-kernel, Laurent Vivier, Yongji Xie,
	Eugenio Pérez, Maxime Coquelin

Implement suspend operation for vduse devices, so vhost-vdpa will offer
that backend feature and userspace can effectively suspend the device.

This is a must before get virtqueue indexes (base) for live migration,
since the device could modify them after userland gets them.

This patch does not implement resume, so VMM resets the whole device
to recover from a live migration failure.  Resume optimization can be
implemented on top of these patches, as other vDPA devices have done in
the past.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
This patch depends on
https://lore.kernel.org/lkml/20260310190759.1097506-1-eperezma@redhat.com

v3:
* Expand the patch message with information about resume operation.

v2:
* Take the rwsem only before the actual kick, not in vduse_vdpa_kick_vq.
  This assures that we're not in a critical section.
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 86 +++++++++++++++++++++++++++++-
 include/uapi/linux/vduse.h         |  4 ++
 2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 5b9248d58938..92f8bb4bb282 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -54,7 +54,8 @@
 #define IRQ_UNBOUND -1
 
 /* Supported VDUSE features */
-static const uint64_t vduse_features = BIT_U64(VDUSE_F_QUEUE_READY);
+static const uint64_t vduse_features = BIT_U64(VDUSE_F_QUEUE_READY) |
+				       BIT_U64(VDUSE_F_SUSPEND);
 
 /*
  * VDUSE instance have not asked the vduse API version, so assume 0.
@@ -85,6 +86,7 @@ struct vduse_virtqueue {
 	int irq_effective_cpu;
 	struct cpumask irq_affinity;
 	struct kobject kobj;
+	struct vduse_dev *dev;
 };
 
 struct vduse_dev;
@@ -134,6 +136,7 @@ struct vduse_dev {
 	int minor;
 	bool broken;
 	bool connected;
+	bool suspended;
 	u64 api_version;
 	u64 device_features;
 	u64 driver_features;
@@ -503,6 +506,7 @@ static void vduse_dev_reset(struct vduse_dev *dev)
 
 	down_write(&dev->rwsem);
 
+	dev->suspended = false;
 	dev->status = 0;
 	dev->driver_features = 0;
 	dev->generation++;
@@ -561,6 +565,10 @@ static void vduse_vq_kick(struct vduse_virtqueue *vq)
 	if (!vq->ready)
 		goto unlock;
 
+	guard(rwsem_read)(&vq->dev->rwsem);
+	if (vq->dev->suspended)
+		return;
+
 	if (vq->kickfd)
 		eventfd_signal(vq->kickfd);
 	else
@@ -919,6 +927,27 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
 	return 0;
 }
 
+static int vduse_vdpa_suspend(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_dev_msg msg = { 0 };
+	int ret;
+
+	msg.req.type = VDUSE_SUSPEND;
+
+	ret = vduse_dev_msg_sync(dev, &msg);
+	if (ret == 0) {
+		scoped_guard(rwsem_write, &dev->rwsem)
+			dev->suspended = true;
+
+		cancel_work_sync(&dev->inject);
+		for (u32 i = 0; i < dev->vq_num; i++)
+			cancel_work_sync(&dev->vqs[i]->inject);
+	}
+
+	return ret;
+}
+
 static void vduse_vdpa_free(struct vdpa_device *vdpa)
 {
 	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
@@ -960,6 +989,41 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.free			= vduse_vdpa_free,
 };
 
+static const struct vdpa_config_ops vduse_vdpa_config_ops_with_suspend = {
+	.set_vq_address		= vduse_vdpa_set_vq_address,
+	.kick_vq		= vduse_vdpa_kick_vq,
+	.set_vq_cb		= vduse_vdpa_set_vq_cb,
+	.set_vq_num             = vduse_vdpa_set_vq_num,
+	.get_vq_size		= vduse_vdpa_get_vq_size,
+	.get_vq_group		= vduse_get_vq_group,
+	.set_vq_ready		= vduse_vdpa_set_vq_ready,
+	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.set_vq_state		= vduse_vdpa_set_vq_state,
+	.get_vq_state		= vduse_vdpa_get_vq_state,
+	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_device_features	= vduse_vdpa_get_device_features,
+	.set_driver_features	= vduse_vdpa_set_driver_features,
+	.get_driver_features	= vduse_vdpa_get_driver_features,
+	.set_config_cb		= vduse_vdpa_set_config_cb,
+	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
+	.get_device_id		= vduse_vdpa_get_device_id,
+	.get_vendor_id		= vduse_vdpa_get_vendor_id,
+	.get_status		= vduse_vdpa_get_status,
+	.set_status		= vduse_vdpa_set_status,
+	.get_config_size	= vduse_vdpa_get_config_size,
+	.get_config		= vduse_vdpa_get_config,
+	.set_config		= vduse_vdpa_set_config,
+	.get_generation		= vduse_vdpa_get_generation,
+	.set_vq_affinity	= vduse_vdpa_set_vq_affinity,
+	.get_vq_affinity	= vduse_vdpa_get_vq_affinity,
+	.reset			= vduse_vdpa_reset,
+	.set_map		= vduse_vdpa_set_map,
+	.set_group_asid		= vduse_set_group_asid,
+	.get_vq_map		= vduse_get_vq_map,
+	.suspend		= vduse_vdpa_suspend,
+	.free			= vduse_vdpa_free,
+};
+
 static void vduse_dev_sync_single_for_device(union virtio_map token,
 					     dma_addr_t dma_addr, size_t size,
 					     enum dma_data_direction dir)
@@ -1171,6 +1235,10 @@ static void vduse_dev_irq_inject(struct work_struct *work)
 {
 	struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
 
+	guard(rwsem_read)(&dev->rwsem);
+	if (dev->suspended)
+		return;
+
 	spin_lock_bh(&dev->irq_lock);
 	if (dev->config_cb.callback)
 		dev->config_cb.callback(dev->config_cb.private);
@@ -1182,6 +1250,10 @@ static void vduse_vq_irq_inject(struct work_struct *work)
 	struct vduse_virtqueue *vq = container_of(work,
 					struct vduse_virtqueue, inject);
 
+	guard(rwsem_read)(&vq->dev->rwsem);
+	if (vq->dev->suspended)
+		return;
+
 	spin_lock_bh(&vq->irq_lock);
 	if (vq->ready && vq->cb.callback)
 		vq->cb.callback(vq->cb.private);
@@ -1212,6 +1284,9 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
 	int ret = -EINVAL;
 
 	down_read(&dev->rwsem);
+	if (dev->suspended)
+		return ret;
+
 	if (!(dev->status & VIRTIO_CONFIG_S_DRIVER_OK))
 		goto unlock;
 
@@ -1976,6 +2051,7 @@ static int vduse_dev_init_vqs(struct vduse_dev *dev, u32 vq_align, u32 vq_num)
 		}
 
 		dev->vqs[i]->index = i;
+		dev->vqs[i]->dev = dev;
 		dev->vqs[i]->irq_effective_cpu = IRQ_UNBOUND;
 		INIT_WORK(&dev->vqs[i]->inject, vduse_vq_irq_inject);
 		INIT_WORK(&dev->vqs[i]->kick, vduse_vq_kick_work);
@@ -2426,12 +2502,18 @@ static struct vduse_mgmt_dev *vduse_mgmt;
 static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
 {
 	struct vduse_vdpa *vdev;
+	const struct vdpa_config_ops *ops;
 
 	if (dev->vdev)
 		return -EEXIST;
 
+	if (dev->vduse_features & BIT_U64(VDUSE_F_SUSPEND))
+		ops = &vduse_vdpa_config_ops_with_suspend;
+	else
+		ops = &vduse_vdpa_config_ops;
+
 	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
-				 &vduse_vdpa_config_ops, &vduse_map_ops,
+				 ops, &vduse_map_ops,
 				 dev->ngroups, dev->nas, name, true);
 	if (IS_ERR(vdev))
 		return PTR_ERR(vdev);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 7324faea5df4..8c616895c511 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -17,6 +17,9 @@
 /* The VDUSE instance expects a request for vq ready */
 #define VDUSE_F_QUEUE_READY	0
 
+/* The VDUSE instance expects a request for suspend */
+#define VDUSE_F_SUSPEND		1
+
 /*
  * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
  * This is used for future extension.
@@ -334,6 +337,7 @@ enum vduse_req_type {
 	VDUSE_UPDATE_IOTLB,
 	VDUSE_SET_VQ_GROUP_ASID,
 	VDUSE_SET_VQ_READY,
+	VDUSE_SUSPEND,
 };
 
 /**
-- 
2.54.0


^ permalink raw reply related

* [PATCH net-next 0/2] vsock: fold acceptq accounting into core helpers
From: Raf Dickson @ 2026-06-10  9:11 UTC (permalink / raw)
  To: netdev, virtualization
  Cc: pabeni, sgarzare, stefanha, bryan-bt.tan, vishnu.dasa,
	bcm-kernel-feedback-list, Raf Dickson

These two patches follow up on commit c05fa14db43e
("vsock/vmci: fix sk_ack_backlog leak on failed handshake")
by folding sk_acceptq_added() and sk_acceptq_removed() into
vsock_enqueue_accept() and vsock_remove_pending() respectively,
as suggested by Paolo Abeni and Stefano Garzarella.

This makes the accounting automatic and prevents future callers
from forgetting it.

Raf Dickson (2):
  vsock: fold sk_acceptq_added() into vsock_enqueue_accept()
  vsock: fold sk_acceptq_removed() into vsock_remove_pending()

 net/vmw_vsock/af_vsock.c                | 4 ++--
 net/vmw_vsock/hyperv_transport.c        | 1 -
 net/vmw_vsock/virtio_transport_common.c | 1 -
 net/vmw_vsock/vmci_transport.c          | 6 +-----
 4 files changed, 3 insertions(+), 9 deletions(-)

-- 
2.54.0

^ permalink raw reply

* [PATCH net-next 1/2] vsock: fold sk_acceptq_added() into vsock_enqueue_accept()
From: Raf Dickson @ 2026-06-10  9:11 UTC (permalink / raw)
  To: netdev, virtualization
  Cc: pabeni, sgarzare, stefanha, bryan-bt.tan, vishnu.dasa,
	bcm-kernel-feedback-list, Raf Dickson
In-Reply-To: <20260610091121.213324-1-rafdog35@gmail.com>

All three transports (vmci, virtio, hyperv) call sk_acceptq_added()
immediately before vsock_enqueue_accept(). Move the call into
vsock_enqueue_accept() itself so callers cannot forget it and the
accounting is always consistent.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>

Signed-off-by: Raf Dickson <rafdog35@gmail.com>
---
 net/vmw_vsock/af_vsock.c                | 1 +
 net/vmw_vsock/hyperv_transport.c        | 1 -
 net/vmw_vsock/virtio_transport_common.c | 1 -
 net/vmw_vsock/vmci_transport.c          | 1 -
 4 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 2ce1063d4a..73e6416ee9 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -507,6 +507,7 @@ void vsock_enqueue_accept(struct sock *listener, struct sock *connected)
 	sock_hold(connected);
 	sock_hold(listener);
 	list_add_tail(&vconnected->accept_queue, &vlistener->accept_queue);
+	sk_acceptq_added(listener);
 }
 EXPORT_SYMBOL_GPL(vsock_enqueue_accept);
 
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index b3394946b2..0de8148877 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -410,7 +410,6 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 
 	if (conn_from_host) {
 		new->sk_state = TCP_ESTABLISHED;
-		sk_acceptq_added(sk);
 
 		hvs_new->vm_srv_id = *if_type;
 		hvs_new->host_srv_id = *if_instance;
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b10666937c..4a39d48db9 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1582,7 +1582,6 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 		return ret;
 	}
 
-	sk_acceptq_added(sk);
 	if (virtio_transport_space_update(child, skb))
 		child->sk_write_space(child);
 
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 91516488a7..4ce6660c11 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -1109,7 +1109,6 @@ static int vmci_transport_recv_listen(struct sock *sk,
 	}
 
 	vsock_add_pending(sk, pending);
-	sk_acceptq_added(sk);
 
 	pending->sk_state = TCP_SYN_SENT;
 	vmci_trans(vpending)->produce_size =
-- 
2.54.0


^ permalink raw reply related

* [PATCH net-next 2/2] vsock: fold sk_acceptq_removed() into vsock_remove_pending()
From: Raf Dickson @ 2026-06-10  9:11 UTC (permalink / raw)
  To: netdev, virtualization
  Cc: pabeni, sgarzare, stefanha, bryan-bt.tan, vishnu.dasa,
	bcm-kernel-feedback-list, Raf Dickson
In-Reply-To: <20260610091121.213324-1-rafdog35@gmail.com>

Callers of vsock_remove_pending() must also call sk_acceptq_removed()
to keep sk_ack_backlog consistent. Move the call into
vsock_remove_pending() itself to make it automatic and prevent future
callers from forgetting it.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Raf Dickson <rafdog35@gmail.com>
---
 net/vmw_vsock/af_vsock.c       | 3 +--
 net/vmw_vsock/vmci_transport.c | 5 +----
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 73e6416ee9..b9772a0205 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -493,6 +493,7 @@ void vsock_remove_pending(struct sock *listener, struct sock *pending)
 	list_del_init(&vpending->pending_links);
 	sock_put(listener);
 	sock_put(pending);
+	sk_acceptq_removed(listener);
 }
 EXPORT_SYMBOL_GPL(vsock_remove_pending);
 
@@ -761,8 +762,6 @@ static void vsock_pending_work(struct work_struct *work)
 
 	if (vsock_is_pending(sk)) {
 		vsock_remove_pending(listener, sk);
-
-		sk_acceptq_removed(listener);
 	} else if (!vsk->rejected) {
 		/* We are not on the pending list and accept() did not reject
 		 * us, so we must have been accepted by our user process.  We
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 4ce6660c11..356bebca7c 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -980,11 +980,8 @@ static int vmci_transport_recv_listen(struct sock *sk,
 			err = -EINVAL;
 		}
 
-		if (err < 0) {
+		if (err < 0)
 			vsock_remove_pending(sk, pending);
-			sk_acceptq_removed(sk);
-		}
-
 		release_sock(pending);
 		vmci_transport_release_pending(pending);
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net-next] virtio-net: support xsk wake up
From: Menglong Dong @ 2026-06-10  9:15 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: xuanzhuo, mst, jasowang, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <CAJaqyWeYiruNosJsMTh2jQ=XCEcPg7956aqeRRpDSyynfpjNZA@mail.gmail.com>

On Wed, Jun 10, 2026 at 4:28 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Jun 10, 2026 at 10:17 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net.
> > Take the tx path for example, we set xsk_set_tx_need_wakeup() in
> > virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
> > anywhere, which means the user will call send() for every packet.
> >
> > For the tx path, we will call xsk_set_tx_need_wakeup() after
> > virtnet_xsk_xmit_batch() if sq->vq is empty, as we can't be wakeup by the
> > skb_xmit_done() in this case. Otherwise, we will clear the wakeup flag.
> >
> > For the rx path, we will call xsk_set_rx_need_wakeup() if we have free
> > buffers in rq->vq.
> >
> > Race condition is considered for both rx and tx path.
> >
>
> I didn't review the code itself, but the patch message lacks a Fixes:

Yeah, you are right, it's better to use the Fix tag here, and I'll
add it in the next version.

> tag. And the From and Signed-off-by emails don't match, which I'm not
> sure is valid.

Sorry about that. I use my company email in the SOB. I used
my company email for the From too. However, my company
email system can trigger "Delivery failure" messages sometimes,
and Alexei said it is annoying. So I didn't use it for the From
and removed it from the CC since then :/

Thanks!
Menglong Dong

>
> > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> > ---
> >  drivers/net/virtio_net.c | 52 ++++++++++++++++++++++++++++++++++------
> >  1 file changed, 45 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 9b3da9f9786c..25e895b849a6 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
[...]
> >
> > --
> > 2.54.0
> >
>

^ permalink raw reply

* [PATCH] 9p/trans_virtio: bound mount_tag show copy to one page
From: Michael Bommarito @ 2026-06-10 11:42 UTC (permalink / raw)
  To: Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck
  Cc: v9fs, virtualization, linux-kernel, stable

p9_mount_tag_show() copies strlen(chan->tag) + 1 bytes into the
single-page buffer the sysfs core provides, with no upper bound. The
mount tag length comes from virtio_9p_config.tag_len, a 16-bit field read
from the device at probe in p9_virtio_probe() with no cap. Under the
confidential-computing threat model, where the guest does not trust the
host, a malicious or compromised host can present a 65535-byte tag with
no embedded NUL. A read of the world-readable /sys/.../mount_tag
attribute (udev reads it at probe) then copies ~64 KiB into the 4 KiB
sysfs page, a slab-out-of-bounds write of host-controlled content.

Bound the copy to the page size in the show handler.

Fixes: 179a5bc4b8cb ("net/9p: use memcpy() instead of snprintf() in p9_mount_tag_show()")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
 net/9p/trans_virtio.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 4cdab7094b273..b62aa7b309f1c 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -573,7 +573,11 @@ static ssize_t p9_mount_tag_show(struct device *dev,
 	chan = vdev->priv;
 	tag_len = strlen(chan->tag);

-	memcpy(buf, chan->tag, tag_len + 1);
+	if (tag_len > PAGE_SIZE - 2)
+		tag_len = PAGE_SIZE - 2;
+
+	memcpy(buf, chan->tag, tag_len);
+	buf[tag_len] = '\0';

 	return tag_len + 1;
 }
-- 
2.53.0

^ permalink raw reply related

* Re: [PATCH net] virtio_net: do not allow tunnel csum offload for non GSO packets
From: Gabriel Goller @ 2026-06-10 12:19 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, virtualization, Willem de Bruijn
In-Reply-To: <38d33d63ce5d1aee486bd6c7c47345040d526e35.1781016169.git.pabeni@redhat.com>

On Tue, 09 Jun 2026 16:44:26 +0200, Paolo Abeni <pabeni@redhat.com> wrote:
> [...]
> 
> Fixes: 56a06bd40fab ("virtio_net: enable gso over UDP tunnel support.")
> Reported-by: Fiona Ebner <f.ebner@proxmox.com>
> Closes: https://bugzilla.proxmox.com/show_bug.cgi?id=7627
> Tested-by: Fiona Ebner <f.ebner@proxmox.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Gave it a spin and it works alright, so consider:

>
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..07b8710639f9 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -6222,6 +6222,18 @@ static void virtnet_free_irq_moder(struct virtnet_info *vi)
>  	rtnl_unlock();
>  }
>  
> +static netdev_features_t virtnet_features_check(struct sk_buff *skb,
> +						struct net_device *dev,
> +						netdev_features_t features)
> +{
> +	/* Inner csum offload is only available for GSO packets. */
> +	if (skb->encapsulation && !skb_is_gso(skb))

A small question -- should we maybe check for skb_gso_ok here as well?
So add:

	(!skb_is_gso(skb) || !skb_gso_ok(skb, features)))

Because skb_is_gso alone doesn't guarantee that the packets leaving virtio will
be gso'd, they could be software gso'd at validate_xmit_skb, which is called
after ndo_feature_check.
leaving the virtio device.

Not sure if this can happen though.

> @@ -6235,7 +6247,7 @@ static const struct net_device_ops virtnet_netdev = {
>  	.ndo_bpf		= virtnet_xdp,
>  	.ndo_xdp_xmit		= virtnet_xdp_xmit,
>  	.ndo_xsk_wakeup         = virtnet_xsk_wakeup,
> -	.ndo_features_check	= passthru_features_check,
> +	.ndo_features_check	= virtnet_features_check,
>  	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
>  	.ndo_set_features	= virtnet_set_features,
>  	.ndo_tx_timeout		= virtnet_tx_timeout,

Thanks,
Gabriel

Tested-by: Gabriel Goller <g.goller@proxmox.com>
So: packet is a GSO packet but will be segmented by validate_xmit_skb before

So: packet is a GSO packet but will be segmented by validate_xmit_skb before

-- 
Gabriel Goller <g.goller@proxmox.com>


^ permalink raw reply

* Re: [PATCH net] virtio_net: do not allow tunnel csum offload for non GSO packets
From: Gabriel Goller @ 2026-06-10 12:26 UTC (permalink / raw)
  To: Gabriel Goller
  Cc: Paolo Abeni, netdev, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, virtualization, Willem de Bruijn
In-Reply-To: <178109396600.129329.5073003425673349392.b4-review@b4>

On 2026-06-10 14:19:26+02:00, Gabriel Goller wrote:
> On Tue, 09 Jun 2026 16:44:26 +0200, Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > [...]
> > 
> > Fixes: 56a06bd40fab ("virtio_net: enable gso over UDP tunnel support.")
> > Reported-by: Fiona Ebner <f.ebner@proxmox.com>
> > Closes: https://bugzilla.proxmox.com/show_bug.cgi?id=7627
> > Tested-by: Fiona Ebner <f.ebner@proxmox.com>
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> 
> Gave it a spin and it works alright, so consider:

Tested-by: Gabriel Goller <g.goller@proxmox.com>
Missed this :(

> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index f4adcfee7a80..07b8710639f9 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -6222,6 +6222,18 @@ static void virtnet_free_irq_moder(struct virtnet_info *vi)
> >  	rtnl_unlock();
> >  }
> >  
> > +static netdev_features_t virtnet_features_check(struct sk_buff *skb,
> > +						struct net_device *dev,
> > +						netdev_features_t features)
> > +{
> > +	/* Inner csum offload is only available for GSO packets. */
> > +	if (skb->encapsulation && !skb_is_gso(skb))
> 
> A small question -- should we maybe check for skb_gso_ok here as well?
> So add:
> 
> 	(!skb_is_gso(skb) || !skb_gso_ok(skb, features)))
> 
> Because skb_is_gso alone doesn't guarantee that the packets leaving virtio will
> be gso'd, they could be software gso'd at validate_xmit_skb, which is called
> after ndo_feature_check.
> leaving the virtio device.

This is supposed to say:

So packet is a GSO packet but will be segmented by validate_xmit_skb before
leaving the virtio device.

b4 thought "So:" is a git trailer :(

> Not sure if this can happen though.
> 
> > @@ -6235,7 +6247,7 @@ static const struct net_device_ops virtnet_netdev = {
> >  	.ndo_bpf		= virtnet_xdp,
> >  	.ndo_xdp_xmit		= virtnet_xdp_xmit,
> >  	.ndo_xsk_wakeup         = virtnet_xsk_wakeup,
> > -	.ndo_features_check	= passthru_features_check,
> > +	.ndo_features_check	= virtnet_features_check,
> >  	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
> >  	.ndo_set_features	= virtnet_set_features,
> >  	.ndo_tx_timeout		= virtnet_tx_timeout,
> 
> Thanks,
> Gabriel

Sorry for the noise.



^ permalink raw reply

* [PATCH v2] mm: page_reporting: allow driver to set batch capacity
From: Michael S. Tsirkin @ 2026-06-10 13:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Miaohe Lin, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Perez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi,
	Alexander Duyck

At the moment, if a virtio balloon device has a page reporting vq but
its size is < PAGE_REPORTING_CAPACITY (32), the balloon driver fails
probe.

But, there's no way for host to know this value, so it can easily
create a smaller vq and suddenly adding the reporting capability
to the device makes all of the driver fail. Not pretty.

Add a capacity field to page_reporting_dev_info so drivers can
control the maximum number of pages per report batch.

In virtio-balloon, set the capacity to the reporting virtqueue size,
letting page_reporting adapt to whatever the device provides.

Capacity need not be a power of two.  Code previously called out
division by PAGE_REPORTING_CAPACITY as cheap since it was a power
of 2, but no performance difference was observed with non-power-of-2
values.

If capacity is 0 or exceeds PAGE_REPORTING_CAPACITY, it defaults
to PAGE_REPORTING_CAPACITY.  The 0 check and the clamping is done in
page_reporting_register(), before the reporting work is scheduled,
so we never get division by 0.

Fixes: b0c504f15471 ("virtio-balloon: add support for providing free page reports to host")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
Changes v1->v2:
- Document capacity=0 as default in commit log
- Document that capacity need not be a power of two
- Drop unnecessary comment about integer division cost
- Update comment on capacity field: "0 (default) means PAGE_REPORTING_CAPACITY"

 drivers/virtio/virtio_balloon.c |  5 +----
 include/linux/page_reporting.h  |  3 +++
 mm/page_reporting.c             | 24 ++++++++++++------------
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f6c2dff33f8a..6a1a610c2cb1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1017,10 +1017,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		unsigned int capacity;
 
 		capacity = virtqueue_get_vring_size(vb->reporting_vq);
-		if (capacity < PAGE_REPORTING_CAPACITY) {
-			err = -ENOSPC;
-			goto out_unregister_oom;
-		}
 
 		vb->pr_dev_info.order = PAGE_REPORTING_ORDER_UNSPECIFIED;
 
@@ -1041,6 +1037,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		vb->pr_dev_info.order = 5;
 #endif
 
+		vb->pr_dev_info.capacity = capacity;
 		err = page_reporting_register(&vb->pr_dev_info);
 		if (err)
 			goto out_unregister_oom;
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 9d4ca5c218a0..048578118a4b 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -22,6 +22,9 @@ struct page_reporting_dev_info {
 
 	/* Minimal order of page reporting */
 	unsigned int order;
+
+	/* Max pages per report batch; 0 (default) means PAGE_REPORTING_CAPACITY */
+	unsigned int capacity;
 };
 
 /* Tear-down and bring-up for page reporting devices */
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 7418f2e500bb..942e84b6908a 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -173,11 +173,8 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	 * any pages that may have already been present from the previous
 	 * list processed. This should result in us reporting all pages on
 	 * an idle system in about 30 seconds.
-	 *
-	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
-	 * should always be a power of 2.
 	 */
-	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
+	budget = DIV_ROUND_UP(area->nr_free, prdev->capacity * 16);
 
 	/* loop through free list adding unreported pages to sg list */
 	list_for_each_entry_safe(page, next, list, lru) {
@@ -222,10 +219,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		spin_unlock_irq(&zone->lock);
 
 		/* begin processing pages in local list */
-		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
+		err = prdev->report(prdev, sgl, prdev->capacity);
 
 		/* reset offset since the full list was reported */
-		*offset = PAGE_REPORTING_CAPACITY;
+		*offset = prdev->capacity;
 
 		/* update budget to reflect call to report function */
 		budget--;
@@ -234,7 +231,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 		spin_lock_irq(&zone->lock);
 
 		/* flush reported pages from the sg list */
-		page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
+		page_reporting_drain(prdev, sgl, prdev->capacity, !err);
 
 		/*
 		 * Reset next to first entry, the old next isn't valid
@@ -260,13 +257,13 @@ static int
 page_reporting_process_zone(struct page_reporting_dev_info *prdev,
 			    struct scatterlist *sgl, struct zone *zone)
 {
-	unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
+	unsigned int order, mt, leftover, offset = prdev->capacity;
 	unsigned long watermark;
 	int err = 0;
 
 	/* Generate minimum watermark to be able to guarantee progress */
 	watermark = low_wmark_pages(zone) +
-		    (PAGE_REPORTING_CAPACITY << page_reporting_order);
+		    (prdev->capacity << page_reporting_order);
 
 	/*
 	 * Cancel request if insufficient free memory or if we failed
@@ -290,7 +287,7 @@ page_reporting_process_zone(struct page_reporting_dev_info *prdev,
 	}
 
 	/* report the leftover pages before going idle */
-	leftover = PAGE_REPORTING_CAPACITY - offset;
+	leftover = prdev->capacity - offset;
 	if (leftover) {
 		sgl = &sgl[offset];
 		err = prdev->report(prdev, sgl, leftover);
@@ -322,11 +319,11 @@ static void page_reporting_process(struct work_struct *work)
 	atomic_set(&prdev->state, state);
 
 	/* allocate scatterlist to store pages being reported on */
-	sgl = kmalloc_objs(*sgl, PAGE_REPORTING_CAPACITY);
+	sgl = kmalloc_objs(*sgl, prdev->capacity);
 	if (!sgl)
 		goto err_out;
 
-	sg_init_table(sgl, PAGE_REPORTING_CAPACITY);
+	sg_init_table(sgl, prdev->capacity);
 
 	for_each_zone(zone) {
 		err = page_reporting_process_zone(prdev, sgl, zone);
@@ -377,6 +374,9 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
 			page_reporting_order = pageblock_order;
 	}
 
+	if (!prdev->capacity || prdev->capacity > PAGE_REPORTING_CAPACITY)
+		prdev->capacity = PAGE_REPORTING_CAPACITY;
+
 	/* initialize state and work structures */
 	atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
 	INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);
-- 
MST


^ permalink raw reply related

* [PATCH v4] i2c: virtio: retain xfer with kref to fix UAF on interrupted wait
From: Gavin Li @ 2026-06-10 15:02 UTC (permalink / raw)
  To: linux-i2c, viresh.kumar, Michael S. Tsirkin
  Cc: Chen, Jian Jun, andi.shyti, virtualization, Gavin Li

commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible
completion wait") switched virtio_i2c_complete_reqs() to
wait_for_completion_interruptible() so a stuck device cannot hang a
task forever. That left a use-after-free: if the wait returns early on
a signal, virtio_i2c_xfer() frees reqs and DMA bounce buffers while the
device may still hold virtqueue tokens pointing at &reqs[i] and DMA
into read buffers. When those requests complete later,
virtio_i2c_msg_done() calls complete() on freed memory.

Waiting uninterruptibly for every completion before freeing avoids the
UAF but can hang the caller indefinitely if the virtio side never
completes the request. Performing a queue reset / device reset is
possible in this scenario, but adds complexity.

This commit manages the freeing of the xfer allocations via kref, and
ensures that each in-flight request holds a reference. This fixes the
use-after-free by ensuring that the virtio device has a valid location
to write to until the request completes. This will cause a memory leak
in cases where the device hangs, but that is much preferable to memory
corruption.

Additionally, force usage of a bounce buffer even if the i2c_msg buf is
DMA-safe, since the buffer passed to the virtqueue must remain valid
even if the transfer is interrupted. Remove usage of
i2c_get_dma_safe_msg_buf() since it may pass through msg->buf directly.
All bounce buffers are part of the single xfer allocation, so there is
no additional allocation overhead.

Signed-off-by: Gavin Li <gavin.li@samsara.com>

---

Changes in v4:
- Pack bounce buffers into a single allocation after reqs[]
- Remove per-request buf pointer and xfer->num
- Remove req.msg, track message direction with req->read
- Simplify xfer release to a single kfree()
- Restore using j to track successful transfers in _complete_xfer()
---
 drivers/i2c/busses/i2c-virtio.c | 121 +++++++++++++++++++++++++-------
 1 file changed, 94 insertions(+), 27 deletions(-)

diff --git a/drivers/i2c/busses/i2c-virtio.c b/drivers/i2c/busses/i2c-virtio.c
index 5da6fef92bec3..33b469ca39d4b 100644
--- a/drivers/i2c/busses/i2c-virtio.c
+++ b/drivers/i2c/busses/i2c-virtio.c
@@ -13,7 +13,9 @@
 #include <linux/err.h>
 #include <linux/i2c.h>
 #include <linux/kernel.h>
+#include <linux/kref.h>
 #include <linux/module.h>
+#include <linux/overflow.h>
 #include <linux/virtio.h>
 #include <linux/virtio_ids.h>
 #include <linux/virtio_config.h>
@@ -31,39 +33,74 @@ struct virtio_i2c {
 	struct virtqueue *vq;
 };
 
+struct virtio_i2c_xfer;
+
 /**
  * struct virtio_i2c_req - the virtio I2C request structure
+ * @xfer: owning transfer
  * @completion: completion of virtio I2C message
+ * @read: true if this is a read message (I2C_M_RD is set)
  * @out_hdr: the OUT header of the virtio I2C message
- * @buf: the buffer into which data is read, or from which it's written
  * @in_hdr: the IN header of the virtio I2C message
  */
 struct virtio_i2c_req {
+	struct virtio_i2c_xfer *xfer;
 	struct completion completion;
+	bool read;
 	struct virtio_i2c_out_hdr out_hdr	____cacheline_aligned;
-	uint8_t *buf				____cacheline_aligned;
 	struct virtio_i2c_in_hdr in_hdr		____cacheline_aligned;
 };
 
+/**
+ * struct virtio_i2c_xfer - a queued I2C transfer
+ * @ref: one ref for the caller, plus one per in-flight virtqueue request
+ * @bounce_buf_base: start of bounce buffer region
+ * @reqs: the virtio I2C requests
+ *
+ * Allocation layout:
+ * - struct virtio_i2c_xfer xfer
+ * - struct virtio_i2c_req reqs[num]
+ * - u8 bounce_buf[msgs[0].len]
+ *   ...
+ * - u8 bounce_buf[msgs[num-1].len]
+ */
+struct virtio_i2c_xfer {
+	struct kref ref;
+	u8 *bounce_buf_base;
+	struct virtio_i2c_req reqs[];
+};
+
+static void virtio_i2c_xfer_release(struct kref *ref)
+{
+	struct virtio_i2c_xfer *xfer = container_of(ref, struct virtio_i2c_xfer, ref);
+	kfree(xfer);
+}
+
 static void virtio_i2c_msg_done(struct virtqueue *vq)
 {
 	struct virtio_i2c_req *req;
 	unsigned int len;
 
-	while ((req = virtqueue_get_buf(vq, &len)))
+	while ((req = virtqueue_get_buf(vq, &len))) {
 		complete(&req->completion);
+		kref_put(&req->xfer->ref, virtio_i2c_xfer_release);
+	}
 }
 
-static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
-				   struct virtio_i2c_req *reqs,
+static int virtio_i2c_prepare_xfer(struct virtqueue *vq,
+				   struct virtio_i2c_xfer *xfer,
 				   struct i2c_msg *msgs, int num)
 {
 	struct scatterlist *sgs[3], out_hdr, msg_buf, in_hdr;
+	struct virtio_i2c_req *reqs = xfer->reqs;
+	u8 *bounce_buf = xfer->bounce_buf_base;
 	int i;
 
 	for (i = 0; i < num; i++) {
 		int outcnt = 0, incnt = 0;
 
+		reqs[i].xfer = xfer;
+		reqs[i].read = !!(msgs[i].flags & I2C_M_RD);
 		init_completion(&reqs[i].completion);
 
 		/*
@@ -82,23 +119,31 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
 		sgs[outcnt++] = &out_hdr;
 
 		if (msgs[i].len) {
-			reqs[i].buf = i2c_get_dma_safe_msg_buf(&msgs[i], 1);
-			if (!reqs[i].buf)
-				break;
+			/*
+			 * Even if msg->flags has I2C_M_DMA_SAFE set, a bounce
+			 * buffer is required because the transfer may be
+			 * interrupted, after which msg->buf is no longer valid.
+			 */
+			if (!(msgs[i].flags & I2C_M_RD))
+				memcpy(bounce_buf, msgs[i].buf, msgs[i].len);
 
-			sg_init_one(&msg_buf, reqs[i].buf, msgs[i].len);
+			sg_init_one(&msg_buf, bounce_buf, msgs[i].len);
 
 			if (msgs[i].flags & I2C_M_RD)
 				sgs[outcnt + incnt++] = &msg_buf;
 			else
 				sgs[outcnt++] = &msg_buf;
 		}
+		bounce_buf += msgs[i].len;
 
 		sg_init_one(&in_hdr, &reqs[i].in_hdr, sizeof(reqs[i].in_hdr));
 		sgs[outcnt + incnt++] = &in_hdr;
 
+		/* This reference is released in virtio_i2c_msg_done(). */
+		kref_get(&xfer->ref);
+
 		if (virtqueue_add_sgs(vq, sgs, outcnt, incnt, &reqs[i], GFP_KERNEL)) {
-			i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], false);
+			kref_put(&xfer->ref, virtio_i2c_xfer_release);
 			break;
 		}
 	}
@@ -106,26 +151,38 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
 	return i;
 }
 
-static int virtio_i2c_complete_reqs(struct virtqueue *vq,
-				    struct virtio_i2c_req *reqs,
-				    struct i2c_msg *msgs, int num)
+static int virtio_i2c_complete_xfer(struct virtio_i2c_xfer *xfer,
+				    struct i2c_msg *msgs,
+				    int num)
 {
+	struct virtio_i2c_req *reqs = xfer->reqs;
+	u8 *bounce_buf = xfer->bounce_buf_base;
 	bool failed = false;
 	int i, j = 0;
 
 	for (i = 0; i < num; i++) {
 		struct virtio_i2c_req *req = &reqs[i];
+		struct i2c_msg *msg = &msgs[i];
+
+		if (wait_for_completion_interruptible(&req->completion))
+			return -EINTR;
+
+		if (req->in_hdr.status != VIRTIO_I2C_MSG_OK) {
+			/*
+			 * Don't break yet. Try to wait until all requests
+			 * complete to ensure that the virtqueue has enough
+			 * descriptor slots for the next transfer.
+			 */
+			failed = true;
+		}
 
 		if (!failed) {
-			if (wait_for_completion_interruptible(&req->completion))
-				failed = true;
-			else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK)
-				failed = true;
-			else
-				j++;
+			if (req->read)
+				memcpy(msg->buf, bounce_buf, msg->len);
+			j++;
 		}
 
-		i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], !failed);
+		bounce_buf += msg->len;
 	}
 
 	return j;
@@ -136,14 +193,24 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
 {
 	struct virtio_i2c *vi = i2c_get_adapdata(adap);
 	struct virtqueue *vq = vi->vq;
-	struct virtio_i2c_req *reqs;
-	int count;
+	struct virtio_i2c_xfer *xfer;
+	size_t alloc_size;
+	int i, count;
 
-	reqs = kzalloc_objs(*reqs, num);
-	if (!reqs)
+	alloc_size = struct_size(xfer, reqs, num);
+	for (i = 0; i < num; i++) {
+		if (check_add_overflow(alloc_size, msgs[i].len, &alloc_size))
+			return -EOVERFLOW;
+	}
+
+	xfer = kzalloc(alloc_size, GFP_KERNEL);
+	if (!xfer)
 		return -ENOMEM;
 
-	count = virtio_i2c_prepare_reqs(vq, reqs, msgs, num);
+	kref_init(&xfer->ref);
+	xfer->bounce_buf_base = (u8 *)(xfer->reqs + num);
+
+	count = virtio_i2c_prepare_xfer(vq, xfer, msgs, num);
 	if (!count)
 		goto err_free;
 
@@ -157,10 +224,10 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
 	 */
 	virtqueue_kick(vq);
 
-	count = virtio_i2c_complete_reqs(vq, reqs, msgs, count);
+	count = virtio_i2c_complete_xfer(xfer, msgs, count);
 
 err_free:
-	kfree(reqs);
+	kref_put(&xfer->ref, virtio_i2c_xfer_release);
 	return count;
 }
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v4] i2c: virtio: retain xfer with kref to fix UAF on interrupted wait
From: Michael S. Tsirkin @ 2026-06-10 15:08 UTC (permalink / raw)
  To: Gavin Li
  Cc: linux-i2c, viresh.kumar, Chen, Jian Jun, andi.shyti,
	virtualization
In-Reply-To: <20260610150232.24834-1-gavin.li@samsara.com>

On Wed, Jun 10, 2026 at 11:02:32AM -0400, Gavin Li wrote:
> commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible
> completion wait") switched virtio_i2c_complete_reqs() to
> wait_for_completion_interruptible() so a stuck device cannot hang a
> task forever. That left a use-after-free: if the wait returns early on
> a signal, virtio_i2c_xfer() frees reqs and DMA bounce buffers while the
> device may still hold virtqueue tokens pointing at &reqs[i] and DMA
> into read buffers. When those requests complete later,
> virtio_i2c_msg_done() calls complete() on freed memory.
> 
> Waiting uninterruptibly for every completion before freeing avoids the
> UAF but can hang the caller indefinitely if the virtio side never
> completes the request. Performing a queue reset / device reset is
> possible in this scenario, but adds complexity.
> 
> This commit manages the freeing of the xfer allocations via kref, and
> ensures that each in-flight request holds a reference. This fixes the
> use-after-free by ensuring that the virtio device has a valid location
> to write to until the request completes. This will cause a memory leak
> in cases where the device hangs, but that is much preferable to memory
> corruption.
> 
> Additionally, force usage of a bounce buffer even if the i2c_msg buf is
> DMA-safe, since the buffer passed to the virtqueue must remain valid
> even if the transfer is interrupted. Remove usage of
> i2c_get_dma_safe_msg_buf() since it may pass through msg->buf directly.
> All bounce buffers are part of the single xfer allocation, so there is
> no additional allocation overhead.
> 
> Signed-off-by: Gavin Li <gavin.li@samsara.com>
> 
> ---
> 
> Changes in v4:
> - Pack bounce buffers into a single allocation after reqs[]
> - Remove per-request buf pointer and xfer->num
> - Remove req.msg, track message direction with req->read
> - Simplify xfer release to a single kfree()
> - Restore using j to track successful transfers in _complete_xfer()
> ---
>  drivers/i2c/busses/i2c-virtio.c | 121 +++++++++++++++++++++++++-------
>  1 file changed, 94 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-virtio.c b/drivers/i2c/busses/i2c-virtio.c
> index 5da6fef92bec3..33b469ca39d4b 100644
> --- a/drivers/i2c/busses/i2c-virtio.c
> +++ b/drivers/i2c/busses/i2c-virtio.c
> @@ -13,7 +13,9 @@
>  #include <linux/err.h>
>  #include <linux/i2c.h>
>  #include <linux/kernel.h>
> +#include <linux/kref.h>
>  #include <linux/module.h>
> +#include <linux/overflow.h>
>  #include <linux/virtio.h>
>  #include <linux/virtio_ids.h>
>  #include <linux/virtio_config.h>
> @@ -31,39 +33,74 @@ struct virtio_i2c {
>  	struct virtqueue *vq;
>  };
>  
> +struct virtio_i2c_xfer;
> +
>  /**
>   * struct virtio_i2c_req - the virtio I2C request structure
> + * @xfer: owning transfer
>   * @completion: completion of virtio I2C message
> + * @read: true if this is a read message (I2C_M_RD is set)
>   * @out_hdr: the OUT header of the virtio I2C message
> - * @buf: the buffer into which data is read, or from which it's written
>   * @in_hdr: the IN header of the virtio I2C message
>   */
>  struct virtio_i2c_req {
> +	struct virtio_i2c_xfer *xfer;
>  	struct completion completion;
> +	bool read;
>  	struct virtio_i2c_out_hdr out_hdr	____cacheline_aligned;
> -	uint8_t *buf				____cacheline_aligned;
>  	struct virtio_i2c_in_hdr in_hdr		____cacheline_aligned;
>  };
>  
> +/**
> + * struct virtio_i2c_xfer - a queued I2C transfer
> + * @ref: one ref for the caller, plus one per in-flight virtqueue request
> + * @bounce_buf_base: start of bounce buffer region
> + * @reqs: the virtio I2C requests
> + *
> + * Allocation layout:
> + * - struct virtio_i2c_xfer xfer
> + * - struct virtio_i2c_req reqs[num]
> + * - u8 bounce_buf[msgs[0].len]
> + *   ...
> + * - u8 bounce_buf[msgs[num-1].len]

You seems to be ignoring DMA alignment requirements here.
Did you try running with DMA debugging enabled?


> + */
> +struct virtio_i2c_xfer {
> +	struct kref ref;
> +	u8 *bounce_buf_base;
> +	struct virtio_i2c_req reqs[];
> +};
> +
> +static void virtio_i2c_xfer_release(struct kref *ref)
> +{
> +	struct virtio_i2c_xfer *xfer = container_of(ref, struct virtio_i2c_xfer, ref);
> +	kfree(xfer);
> +}
> +
>  static void virtio_i2c_msg_done(struct virtqueue *vq)
>  {
>  	struct virtio_i2c_req *req;
>  	unsigned int len;
>  
> -	while ((req = virtqueue_get_buf(vq, &len)))
> +	while ((req = virtqueue_get_buf(vq, &len))) {
>  		complete(&req->completion);
> +		kref_put(&req->xfer->ref, virtio_i2c_xfer_release);
> +	}
>  }
>  
> -static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
> -				   struct virtio_i2c_req *reqs,
> +static int virtio_i2c_prepare_xfer(struct virtqueue *vq,
> +				   struct virtio_i2c_xfer *xfer,
>  				   struct i2c_msg *msgs, int num)
>  {
>  	struct scatterlist *sgs[3], out_hdr, msg_buf, in_hdr;
> +	struct virtio_i2c_req *reqs = xfer->reqs;
> +	u8 *bounce_buf = xfer->bounce_buf_base;
>  	int i;
>  
>  	for (i = 0; i < num; i++) {
>  		int outcnt = 0, incnt = 0;
>  
> +		reqs[i].xfer = xfer;
> +		reqs[i].read = !!(msgs[i].flags & I2C_M_RD);
>  		init_completion(&reqs[i].completion);
>  
>  		/*
> @@ -82,23 +119,31 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
>  		sgs[outcnt++] = &out_hdr;
>  
>  		if (msgs[i].len) {
> -			reqs[i].buf = i2c_get_dma_safe_msg_buf(&msgs[i], 1);
> -			if (!reqs[i].buf)
> -				break;
> +			/*
> +			 * Even if msg->flags has I2C_M_DMA_SAFE set, a bounce
> +			 * buffer is required because the transfer may be
> +			 * interrupted, after which msg->buf is no longer valid.
> +			 */
> +			if (!(msgs[i].flags & I2C_M_RD))
> +				memcpy(bounce_buf, msgs[i].buf, msgs[i].len);
>  
> -			sg_init_one(&msg_buf, reqs[i].buf, msgs[i].len);
> +			sg_init_one(&msg_buf, bounce_buf, msgs[i].len);
>  
>  			if (msgs[i].flags & I2C_M_RD)
>  				sgs[outcnt + incnt++] = &msg_buf;
>  			else
>  				sgs[outcnt++] = &msg_buf;
>  		}
> +		bounce_buf += msgs[i].len;
>  
>  		sg_init_one(&in_hdr, &reqs[i].in_hdr, sizeof(reqs[i].in_hdr));
>  		sgs[outcnt + incnt++] = &in_hdr;
>  
> +		/* This reference is released in virtio_i2c_msg_done(). */
> +		kref_get(&xfer->ref);
> +
>  		if (virtqueue_add_sgs(vq, sgs, outcnt, incnt, &reqs[i], GFP_KERNEL)) {
> -			i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], false);
> +			kref_put(&xfer->ref, virtio_i2c_xfer_release);
>  			break;
>  		}
>  	}
> @@ -106,26 +151,38 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
>  	return i;
>  }
>  
> -static int virtio_i2c_complete_reqs(struct virtqueue *vq,
> -				    struct virtio_i2c_req *reqs,
> -				    struct i2c_msg *msgs, int num)
> +static int virtio_i2c_complete_xfer(struct virtio_i2c_xfer *xfer,
> +				    struct i2c_msg *msgs,
> +				    int num)
>  {
> +	struct virtio_i2c_req *reqs = xfer->reqs;
> +	u8 *bounce_buf = xfer->bounce_buf_base;
>  	bool failed = false;
>  	int i, j = 0;
>  
>  	for (i = 0; i < num; i++) {
>  		struct virtio_i2c_req *req = &reqs[i];
> +		struct i2c_msg *msg = &msgs[i];
> +
> +		if (wait_for_completion_interruptible(&req->completion))
> +			return -EINTR;
> +
> +		if (req->in_hdr.status != VIRTIO_I2C_MSG_OK) {
> +			/*
> +			 * Don't break yet. Try to wait until all requests
> +			 * complete to ensure that the virtqueue has enough
> +			 * descriptor slots for the next transfer.
> +			 */
> +			failed = true;
> +		}
>  
>  		if (!failed) {
> -			if (wait_for_completion_interruptible(&req->completion))
> -				failed = true;
> -			else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK)
> -				failed = true;
> -			else
> -				j++;
> +			if (req->read)
> +				memcpy(msg->buf, bounce_buf, msg->len);
> +			j++;
>  		}
>  
> -		i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], !failed);
> +		bounce_buf += msg->len;
>  	}
>  
>  	return j;
> @@ -136,14 +193,24 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
>  {
>  	struct virtio_i2c *vi = i2c_get_adapdata(adap);
>  	struct virtqueue *vq = vi->vq;
> -	struct virtio_i2c_req *reqs;
> -	int count;
> +	struct virtio_i2c_xfer *xfer;
> +	size_t alloc_size;
> +	int i, count;
>  
> -	reqs = kzalloc_objs(*reqs, num);
> -	if (!reqs)
> +	alloc_size = struct_size(xfer, reqs, num);
> +	for (i = 0; i < num; i++) {
> +		if (check_add_overflow(alloc_size, msgs[i].len, &alloc_size))
> +			return -EOVERFLOW;
> +	}
> +
> +	xfer = kzalloc(alloc_size, GFP_KERNEL);
> +	if (!xfer)
>  		return -ENOMEM;
>  
> -	count = virtio_i2c_prepare_reqs(vq, reqs, msgs, num);
> +	kref_init(&xfer->ref);
> +	xfer->bounce_buf_base = (u8 *)(xfer->reqs + num);
> +
> +	count = virtio_i2c_prepare_xfer(vq, xfer, msgs, num);
>  	if (!count)
>  		goto err_free;
>  
> @@ -157,10 +224,10 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
>  	 */
>  	virtqueue_kick(vq);
>  
> -	count = virtio_i2c_complete_reqs(vq, reqs, msgs, count);
> +	count = virtio_i2c_complete_xfer(xfer, msgs, count);
>  
>  err_free:
> -	kfree(reqs);
> +	kref_put(&xfer->ref, virtio_i2c_xfer_release);
>  	return count;
>  }
>  
> -- 
> 2.54.0


^ permalink raw reply

* Re: [PATCH v2] mm: page_reporting: allow driver to set batch capacity
From: David Hildenbrand (Arm) @ 2026-06-10 15:17 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Miaohe Lin, Jason Wang, Xuan Zhuo, Eugenio Perez, Muchun Song,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, virtualization, linux-mm, Andrea Arcangeli,
	Naoya Horiguchi, Alexander Duyck
In-Reply-To: <cb43adc61d2ed3069b2fe428f3e051dbdc4cc28d.1781097156.git.mst@redhat.com>

> --- a/include/linux/page_reporting.h
> +++ b/include/linux/page_reporting.h
> @@ -22,6 +22,9 @@ struct page_reporting_dev_info {
>  
>  	/* Minimal order of page reporting */
>  	unsigned int order;
> +
> +	/* Max pages per report batch; 0 (default) means PAGE_REPORTING_CAPACITY */
> +	unsigned int capacity;
>  };


I think PAGE_REPORTING_CAPACITY still documents

"This value should always be a power of 2, see page_reporting_cycle()"

With that:

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* [PATCH v5 01/15] drm/amd/display: Handle struct drm_plane_state.ignore_damage_clips
From: Thomas Zimmermann @ 2026-06-10 15:18 UTC (permalink / raw)
  To: mripard, maarten.lankhorst, airlied, airlied, simona, admin,
	gargaditya08, paul, jani.nikula, mhklkml, zack.rusin,
	bcm-kernel-feedback-list, harry.wentland, sunpeng.li, siqueira,
	alexander.deucher, rodrigo.vivi, joonas.lahtinen, tursulin,
	javierm, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Thomas Zimmermann, Zack Rusin, stable
In-Reply-To: <20260610152505.260172-1-tzimmermann@suse.de>

The mode-setting pipeline can disabled damage clippings for a commit
by setting ignore_damage_clips in struct drm_plane_state. The commit
will then do a full display update.

Test the flag in DCN code and do a full update in DCN code if it has
been set.

Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers
to ignore damage clips") introduced ignore_damage_clips to selectively
ignore damage clipping in certain framebuffer changes. This driver does
not do that, but DRM's damage iterator will soon rely on the flag.
Therefore supporting it here as well make sense for consistency.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")
Cc: Javier Martinez Canillas <javierm@redhat.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Zack Rusin <zackr@vmware.com>
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.8+
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 0e20194e6662..4cbb27f65a0b 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6614,8 +6614,8 @@ static void fill_dc_dirty_rects(struct drm_plane *plane,
 {
 	struct dm_crtc_state *dm_crtc_state = to_dm_crtc_state(crtc_state);
 	struct rect *dirty_rects = flip_addrs->dirty_rects;
-	u32 num_clips;
-	struct drm_mode_rect *clips;
+	u32 num_clips = 0;
+	struct drm_mode_rect *clips = NULL;
 	bool bb_changed;
 	bool fb_changed;
 	u32 i = 0;
@@ -6631,8 +6631,10 @@ static void fill_dc_dirty_rects(struct drm_plane *plane,
 	if (new_plane_state->rotation != DRM_MODE_ROTATE_0)
 		goto ffu;
 
-	num_clips = drm_plane_get_damage_clips_count(new_plane_state);
-	clips = drm_plane_get_damage_clips(new_plane_state);
+	if (!new_plane_state->ignore_damage_clips) {
+		num_clips = drm_plane_get_damage_clips_count(new_plane_state);
+		clips = drm_plane_get_damage_clips(new_plane_state);
+	}
 
 	if (num_clips && (!amdgpu_damage_clips || (amdgpu_damage_clips < 0 &&
 						   is_psr_su)))
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 00/15] drm: Improve logic behind damage handling
From: Thomas Zimmermann @ 2026-06-10 15:18 UTC (permalink / raw)
  To: mripard, maarten.lankhorst, airlied, airlied, simona, admin,
	gargaditya08, paul, jani.nikula, mhklkml, zack.rusin,
	bcm-kernel-feedback-list, harry.wentland, sunpeng.li, siqueira,
	alexander.deucher, rodrigo.vivi, joonas.lahtinen, tursulin,
	javierm, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Thomas Zimmermann

DRM clients can supply information on framebuffer areas to update as
part of each page flip, called damage-clipping rectangles. But DRM's
processing of this information is inconsistent and prone to errors.

- There are multiple fields and tests that decide if damage clips
should be taken or ignored.

- Sometimes damage clips are removed behind the back of the DRM client.

- Atomic helpers evaluate damage clipping in the middle of the atomic
check: after connectors and encoders, but before planes and CRTCs. Hence
pipeline stages have an inconsistent view.

- Which leads to drivers (ingenic) doing a re-evaluation if necessary.

- Tests of plane source coordinates only happen during commits. At this
point, the driver should already know if damage clips are to be taken or
not. Because of this, some drivers (appletbdrm) might operate on incorrect
damage information for their internal workings. This also leads to excessive
use of the old plane state.

Therefore go through DRM helpers and drivers and fix the logic.

- Run all of the atomic checks with the damage information supplied by
DRM clients. Afterwards evaluate plane and CRTC states on whether to
take or ignore damage clips. Do all related tests in a single atomic
helper.

- Do not discard damage clips. Set ignore_damage_clips in struct
drm_plane_state instead. This includes changes to plane source-coordinates.
The damage iterator now only has to look at this flag to detect if it
should use the damage clips. 

- Go over drivers and fix the damage handling in the plane's
atomic_update helpers. Most drivers no longer need the old plane state
in their update.

- The appletbdrm driver requires a fix in how it uses damage information.
Ingenic and vmwgfx can be simplified. These changes improve the drivers'
code organization.

- Add support for ignore_damage_clips to various drivers that ignored it
until now.

- Kunit tests require some changes. Drop some obsolete tests and add a new
one for ignore_damage_flags.

Tested with bochs, mgag200, Kunit tests.

v5:
- support ignore_damage_clips in amdgpu, i915, virtgpu, vmwgfx
- reorder patches to avoid possible regressions during the series
- fix clearing ignore_damage_clips in a separate patch (Javier)
v4:
- reorder patches to avoid error-prone intermediate state
v3:
- fix error path in appletbdrm
v2:
- rebase on latest upstream

Thomas Zimmermann (15):
  drm/amd/display: Handle struct drm_plane_state.ignore_damage_clips
  drm/i915/display: Handle struct drm_plane_state.ignore_damage_clips
  drm/vboxvideo: Handle struct drm_plane_state.ignore_damage_clips
  drm/vmwgfx: Handle struct drm_plane_state.ignore_damage_clips
  drm/appletbdrm: Allocate request/response buffers in begin_fb_access
  drm/damage-helper: Clear ignore_damage_clips in plane-state
    duplication
  drm/damage-helper: Do not alter damage clips on modeset, but ignore
    them
  drm/atomic-helpers: Evaluate plane damage after atomic_check
  drm/ingenic: Remove calls to drm_atomic_helper_check_plane_damage()
  drm/atomic_helper: Do not evaluate plane damage before atomic_check
  drm/damage-helper: Test src coord in
    drm_atomic_helper_check_plane_damage()
  drm/damage-helper: Remove old state from
    drm_atomic_helper_damage_iter_init()
  drm/damage-helper: Remove old state from
    drm_atomic_helper_damage_merged()
  drm/damage-helper: Rename state parameters in damage helpers
  drm/vmwgfx: Remove unused field struct
    vmwgfx_du_update_plane.old_state

 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  10 +-
 drivers/gpu/drm/ast/ast_cursor.c              |   3 +-
 drivers/gpu/drm/ast/ast_mode.c                |   2 +-
 drivers/gpu/drm/drm_atomic_helper.c           |   6 +-
 drivers/gpu/drm/drm_atomic_state_helper.c     |   1 +
 drivers/gpu/drm/drm_damage_helper.c           |  44 ++--
 drivers/gpu/drm/drm_fb_dma_helper.c           |   2 +-
 drivers/gpu/drm/drm_mipi_dbi.c                |   3 +-
 drivers/gpu/drm/gud/gud_pipe.c                |   3 +-
 drivers/gpu/drm/hyperv/hyperv_drm_modeset.c   |   3 +-
 drivers/gpu/drm/i915/display/intel_plane.c    |  11 +-
 drivers/gpu/drm/i915/display/intel_psr.c      |   8 +-
 drivers/gpu/drm/ingenic/ingenic-drm-drv.c     |   3 -
 drivers/gpu/drm/ingenic/ingenic-ipu.c         |   8 +-
 drivers/gpu/drm/mgag200/mgag200_mode.c        |   3 +-
 drivers/gpu/drm/sitronix/st7571.c             |   3 +-
 drivers/gpu/drm/sitronix/st7586.c             |   3 +-
 drivers/gpu/drm/sitronix/st7920.c             |   3 +-
 drivers/gpu/drm/solomon/ssd130x.c             |   9 +-
 drivers/gpu/drm/sysfb/drm_sysfb_modeset.c     |   3 +-
 .../gpu/drm/tests/drm_damage_helper_test.c    | 200 +++---------------
 drivers/gpu/drm/tiny/appletbdrm.c             |  59 +++---
 drivers/gpu/drm/tiny/bochs.c                  |   3 +-
 drivers/gpu/drm/tiny/cirrus-qemu.c            |   2 +-
 drivers/gpu/drm/tiny/gm12u320.c               |   2 +-
 drivers/gpu/drm/tiny/ili9225.c                |   3 +-
 drivers/gpu/drm/tiny/repaper.c                |   2 +-
 drivers/gpu/drm/tiny/sharp-memory.c           |   3 +-
 drivers/gpu/drm/udl/udl_modeset.c             |   3 +-
 drivers/gpu/drm/vboxvideo/vbox_mode.c         |  11 +-
 drivers/gpu/drm/virtio/virtgpu_plane.c        |   2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_kms.c           |   5 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_kms.h           |   2 -
 drivers/gpu/drm/vmwgfx/vmwgfx_ldu.c           |   9 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_scrn.c          |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_stdu.c          |  15 +-
 include/drm/drm_damage_helper.h               |   9 +-
 37 files changed, 148 insertions(+), 325 deletions(-)


base-commit: fc59f76558703febba8056be87d1c97d14f7485e
-- 
2.54.0


^ permalink raw reply

* [PATCH v5 03/15] drm/vboxvideo: Handle struct drm_plane_state.ignore_damage_clips
From: Thomas Zimmermann @ 2026-06-10 15:18 UTC (permalink / raw)
  To: mripard, maarten.lankhorst, airlied, airlied, simona, admin,
	gargaditya08, paul, jani.nikula, mhklkml, zack.rusin,
	bcm-kernel-feedback-list, harry.wentland, sunpeng.li, siqueira,
	alexander.deucher, rodrigo.vivi, joonas.lahtinen, tursulin,
	javierm, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Thomas Zimmermann, Zack Rusin, stable
In-Reply-To: <20260610152505.260172-1-tzimmermann@suse.de>

The mode-setting pipeline can disabled damage clippings for a commit
by setting ignore_damage_clips in struct drm_plane_state. The commit
will then do a full display update.

Test the flag in the primary plane's atomic_update and do a full update
if it has been set.

Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers
to ignore damage clips") introduced ignore_damage_clips to selectively
ignore damage clipping in certain framebuffer changes. Vboxvideo does not
do that, but DRM's damage iterator will soon rely on the flag. Therefore
supporting it here as well make sense for consistency.

While at it, also replace uint32_t with the preferred u32.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")
Cc: Javier Martinez Canillas <javierm@redhat.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Zack Rusin <zackr@vmware.com>
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.8+
---
 drivers/gpu/drm/vboxvideo/vbox_mode.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/vboxvideo/vbox_mode.c b/drivers/gpu/drm/vboxvideo/vbox_mode.c
index 8e4e5fc9d3c5..635777f1ca23 100644
--- a/drivers/gpu/drm/vboxvideo/vbox_mode.c
+++ b/drivers/gpu/drm/vboxvideo/vbox_mode.c
@@ -283,8 +283,9 @@ static void vbox_primary_atomic_update(struct drm_plane *plane,
 	struct drm_crtc *crtc = new_state->crtc;
 	struct drm_framebuffer *fb = new_state->fb;
 	struct vbox_private *vbox = to_vbox_dev(fb->dev);
-	struct drm_mode_rect *clips;
-	uint32_t num_clips, i;
+	struct drm_mode_rect *clips = NULL;
+	u32 num_clips = 0;
+	u32 i;
 
 	vbox_crtc_set_base_and_mode(crtc, fb,
 				    new_state->src_x >> 16,
@@ -292,8 +293,10 @@ static void vbox_primary_atomic_update(struct drm_plane *plane,
 
 	/* Send information about dirty rectangles to VBVA. */
 
-	clips = drm_plane_get_damage_clips(new_state);
-	num_clips = drm_plane_get_damage_clips_count(new_state);
+	if (!new_state->ignore_damage_clips) {
+		clips = drm_plane_get_damage_clips(new_state);
+		num_clips = drm_plane_get_damage_clips_count(new_state);
+	}
 
 	if (!num_clips)
 		return;
-- 
2.54.0


^ permalink raw reply related

* [PATCH v5 02/15] drm/i915/display: Handle struct drm_plane_state.ignore_damage_clips
From: Thomas Zimmermann @ 2026-06-10 15:18 UTC (permalink / raw)
  To: mripard, maarten.lankhorst, airlied, airlied, simona, admin,
	gargaditya08, paul, jani.nikula, mhklkml, zack.rusin,
	bcm-kernel-feedback-list, harry.wentland, sunpeng.li, siqueira,
	alexander.deucher, rodrigo.vivi, joonas.lahtinen, tursulin,
	javierm, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Thomas Zimmermann, stable, Zack Rusin
In-Reply-To: <20260610152505.260172-1-tzimmermann@suse.de>

The mode-setting pipeline can disabled damage clippings for a commit
by setting ignore_damage_clips in struct drm_plane_state. The commit
will then do a full display update. Commit 35ed38d58257 ("drm: Allow
drivers to indicate the damage helpers to ignore damage clips") introduced
ignore_damage_clips to selectively ignore damage clipping in certain
framebuffer changes.

The i915 driver does not modify the flag, but DRM's damage iterator
will soon rely on it. Calling drm_atomic_helper_check_plane_damage()
right before drm_atomic_helper_damage_merged() guarantees that it
has the correct state. The i915 driver does not do this elsewhere
so far.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")
Cc: <stable@vger.kernel.org> # v6.4+
Cc: Javier Martinez Canillas <javierm@redhat.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Zack Rusin <zackr@vmware.com>
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.8+
---
 drivers/gpu/drm/i915/display/intel_psr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/display/intel_psr.c b/drivers/gpu/drm/i915/display/intel_psr.c
index e138982dc91f..e4f43eb5bd72 100644
--- a/drivers/gpu/drm/i915/display/intel_psr.c
+++ b/drivers/gpu/drm/i915/display/intel_psr.c
@@ -3018,6 +3018,9 @@ int intel_psr2_sel_fetch_update(struct intel_atomic_state *state,
 		src = drm_plane_state_src(&new_plane_state->uapi);
 		drm_rect_fp_to_int(&src, &src);
 
+		/* Prepare plane-damage state before using it */
+		drm_atomic_helper_check_plane_damage(&state->base, &new_plane_state->uapi);
+
 		if (!drm_atomic_helper_damage_merged(&old_plane_state->uapi,
 						     &new_plane_state->uapi, &damaged_area))
 			continue;
-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox