Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* [PATCH net v2 0/2] vsock/virtio: fix msg_iter desync on transmission failure
From: Octavian Purdila @ 2026-06-13  0:09 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Viro, Andrew Morton, Arseniy Krasnov, David S. Miller,
	Eric Dumazet, Eugenio Pérez, Jakub Kicinski, Jason Wang, kvm,
	linux-block, linux-fsdevel, linux-kernel, Michael S. Tsirkin,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Stefano Garzarella,
	virtualization, Xuan Zhuo, Octavian Purdila

This series fixes a msg_iter desync issue in the virtio vsock transport
that can lead to warnings and eventual -ENOMEM under specific failure
scenarios (e.g. partial GUP failure during MSG_ZEROCOPY transmission).

To fix this, we need to restore the msg_iter state on transmission failure.
However, since virtio vsock transport can be built as a module, we first
need to export iov_iter_restore.

Patch 1 exports iov_iter_restore.
Patch 2 implements the msg_iter restoration in virtio vsock.

Changes in v2:
- Use iov_iter_savestate()/iov_iter_restore() (Stefano)
- Use a single restore point (Stefano)
- Reverse xmas tree (Stefano)
- Added comments in the code (Stefano)

v1: https://lore.kernel.org/all/20260609004809.1285028-1-tavip@google.com/

Octavian Purdila (2):
  iov_iter: export iov_iter_restore
  vsock/virtio: restore msg_iter on transmission failure

 lib/iov_iter.c                          |  1 +
 net/vmw_vsock/virtio_transport_common.c | 13 +++++++++++++
 2 files changed, 14 insertions(+)

-- 
2.54.0.1136.gdb2ca164c4-goog

^ permalink raw reply

* Re: [PATCH v2 net] virtio_net: do not allow tunnel csum offload for non GSO packets
From: patchwork-bot+netdevbpf @ 2026-06-12 23:50 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, mst, jasowang, xuanzhuo, eperezma, andrew+netdev, davem,
	edumazet, kuba, virtualization, g.goller, f.ebner
In-Reply-To: <6c3b6c47fb05c100f384630dc48f3975cf37b67a.1781195144.git.pabeni@redhat.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 11 Jun 2026 18:36:48 +0200 you wrote:
> Fiona reports broken connectivity for virtio net setup using UDP tunnel
> inside the guest and NIC with not UDP tunnel TSO support in the host.
> 
> Currently the virtio_net driver exposes csum offload for UDP-tunneled,
> TCP non GSO packets. Such packet reach the host as CSUM_PARTIAL ones
> with the 'encapsulation' flag cleared, as the virtio specification do
> not support this specific kind of offload.
> 
> [...]

Here is the summary with links:
  - [v2,net] virtio_net: do not allow tunnel csum offload for non GSO packets
    https://git.kernel.org/netdev/net/c/86c51f0f2313

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* [PATCH net-next 2/2] selftests/vsock: skip vng setsid workaround on >= 1.41
From: Bobby Eshleman @ 2026-06-12 19:08 UTC (permalink / raw)
  To: Stefano Garzarella, Shuah Khan
  Cc: virtualization, netdev, linux-kselftest, linux-kernel,
	Bobby Eshleman
In-Reply-To: <20260612-vsock-test-update-v1-0-7d7eeed3ac8f@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

virtme-ng 1.41 ships the upstream fix for the SIGTTOU hang
(https://github.com/arighi/virtme-ng/pull/453), so the setsid wrapper in
vng_dry_run() is no longer needed there. Gate the workaround on the vng
version: setsid is used for vng < 1.41, and vng is invoked directly on
>= 1.41.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 tools/testing/selftests/vsock/vmtest.sh | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index ee69ac9dd3dc..310dfc2a39ad 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -445,8 +445,14 @@ vng_dry_run() {
 	# stopped with SIGTTOU and hangs until kselftest's timer expires.
 	# setsid works around this by launching vng in a new session that has
 	# no controlling terminal, so tcsetattr() succeeds.
+	#
+	# Fixed in 1.41 (https://github.com/arighi/virtme-ng/pull/453).
 
-	setsid -w vng --run "$@" --dry-run &>/dev/null
+	if version_lt "$(vng --version | awk '{print $2}')" "1.41"; then
+		setsid -w vng --run "$@" --dry-run &>/dev/null
+	else
+		vng --run "$@" --dry-run &>/dev/null
+	fi
 }
 
 vm_start() {

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next 1/2] selftests/vsock: accept vng 1.33 or >= 1.36
From: Bobby Eshleman @ 2026-06-12 19:08 UTC (permalink / raw)
  To: Stefano Garzarella, Shuah Khan
  Cc: virtualization, netdev, linux-kselftest, linux-kernel,
	Bobby Eshleman
In-Reply-To: <20260612-vsock-test-update-v1-0-7d7eeed3ac8f@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

The current vng version check uses a discrete allowlist of "1.33",
"1.36", and "1.37", which forces a script update on every new release
even though all post-1.36 releases work.

Replace the discrete list with: "1.33", or any version >= 1.36. 1.34
and 1.35 are skipped because they were not tested. Add a version_lt()
helper that compares MAJOR.MINOR numerically, so the check reads as a
straightforward version comparison.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 tools/testing/selftests/vsock/vmtest.sh | 39 +++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index d97913a6bdc7..ee69ac9dd3dc 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -330,27 +330,34 @@ check_netns() {
 	return 0
 }
 
+# Compare MAJOR.MINOR versions numerically. Returns 0 (true) if $1 < $2.
+version_lt() {
+	local -a a=(${1//./ })
+	local -a b=(${2//./ })
+
+	if [[ "${a[0]}" -lt "${b[0]}" ]]; then
+		return 0
+	elif [[ "${a[0]}" -gt "${b[0]}" ]]; then
+		return 1
+	elif [[ "${a[1]}" -lt "${b[1]}" ]]; then
+		return 0
+	fi
+
+	return 1
+}
+
 check_vng() {
-	local tested_versions
 	local version
-	local ok
 
-	tested_versions=("1.33" "1.36" "1.37")
-	version="$(vng --version)"
+	version="$(vng --version | awk '{print $2}')"
 
-	ok=0
-	for tv in "${tested_versions[@]}"; do
-		if [[ "${version}" == *"${tv}"* ]]; then
-			ok=1
-			break
-		fi
-	done
-
-	if [[ ! "${ok}" -eq 1 ]]; then
-		printf "warning: vng version '%s' has not been tested and may " "${version}" >&2
-		printf "not function properly.\n\tThe following versions have been tested: " >&2
-		echo "${tested_versions[@]}" >&2
+	# Supported: 1.33, or any version >= 1.36. 1.34 and 1.35 are untested.
+	if [[ "${version}" == "1.33" ]] || ! version_lt "${version}" "1.36"; then
+		return
 	fi
+
+	printf "warning: vng version '%s' has not been tested and may " "${version}" >&2
+	printf "not function properly.\n\tSupported: 1.33 or >= 1.36\n" >&2
 }
 
 check_socat() {

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next 0/2] selftests/vsock: improve vng version and quirk handling
From: Bobby Eshleman @ 2026-06-12 19:08 UTC (permalink / raw)
  To: Stefano Garzarella, Shuah Khan
  Cc: virtualization, netdev, linux-kselftest, linux-kernel,
	Bobby Eshleman

As vng has continued updating, there have been two things in our
selftests that have been affected. One is that newer versions always
emit the vng version warning, and two is that we have a workaround that
is not needed in newer versions.

This series just updates the version handling to allow all newer
versions without warning and version-gates the workaround to only those
versions that don't have the commit that fixed the root cause.

Additionally, we add function for comparing major.minor versions which
is used in both patches.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Bobby Eshleman (2):
      selftests/vsock: accept vng 1.33 or >= 1.36
      selftests/vsock: skip vng setsid workaround on >= 1.41

 tools/testing/selftests/vsock/vmtest.sh | 47 +++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 17 deletions(-)
---
base-commit: dfcc2ff12925d99e858eaf539eaa4aaaf81fe2a6
change-id: 20260612-vsock-test-update-fcae9ffced52

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@meta.com>

^ permalink raw reply

* Re: [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures
From: Alison Schofield @ 2026-06-12 18:42 UTC (permalink / raw)
  To: Li Chen
  Cc: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	virtualization, nvdimm, linux-kernel
In-Reply-To: <20260609120726.1714780-1-me@linux.beauty>

On Tue, Jun 09, 2026 at 08:07:14PM +0800, Li Chen wrote:
> Hi,

Hi Li Chen,

In case you missed it, a Sashiko AI review of this set has posted
feedback. Please take a look.

https://sashiko.dev/#/patchset/20260609120726.1714780-1-me%40linux.beauty

-- Alison

> 
> The nvdimm flush helper currently converts any non-zero provider flush
> callback error to -EIO. That hides useful errno values from providers. For
> example, virtio-pmem may fail child flush bio allocation with -ENOMEM, but
> that is currently reported as -EIO by nvdimm_flush().
> 
> The raw failure seen in the local mkfs sanity test was:
> 
>   wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
>   mkfs.ext4: Input/output error while writing out and closing file system
>   nd_region region0: dbg: nvdimm_flush rc=-5
> 
> The first two patches keep provider flush errors intact and make the
> virtio-pmem child flush bio allocation use GFP_NOIO. async_pmem_flush() can
> allocate a child flush bio from filesystem flush and writeback paths;
> GFP_NOIO is a better fit than GFP_ATOMIC there because the allocation may
> sleep but must not recurse into filesystem I/O reclaim. With these changes,
> the same mkfs sanity test reached mkfs_rc=0, mount_rc=0, and umount_rc=0.
> 
> The rest of the series addresses virtio-pmem request lifetime and broken
> virtqueue handling. The virtio-pmem flush path uses a virtqueue cookie/token
> to carry a per-request context through completion. Under broken virtqueue /
> notify failure conditions, the submitter can return and free the request
> object while the host/backend may still complete the published request. The
> IRQ completion handler then dereferences freed memory when waking waiters,
> which is reported by KASAN as a slab-use-after-free and may manifest as lock
> corruption (e.g. "BUG: spinlock already unlocked") without KASAN.
> 
> In addition, the flush path has two wait sites: one for virtqueue descriptor
> availability (-ENOSPC from virtqueue_add_sgs()) and one for request
> completion. If the virtqueue becomes broken, forward progress is no longer
> guaranteed and these waiters may sleep indefinitely unless the driver
> converges the failure and wakes all wait sites.
> 
> This series addresses these issues:
> 
> 1/7 nvdimm: preserve flush callback errors
> Return provider flush callback errors directly from nvdimm_flush().
> 
> 2/7 nvdimm: virtio_pmem: use GFP_NOIO for child flush bio
> Use GFP_NOIO for the child flush bio allocation.
> 
> 3/7 nvdimm: virtio_pmem: always wake -ENOSPC waiters
> Wake one -ENOSPC waiter for each reclaimed used buffer, decoupled from
> token completion.
> 
> 4/7 nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
> Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
> wq_buf_avail).
> 
> 5/7 nvdimm: virtio_pmem: refcount requests for token lifetime
> Refcount request objects so the token lifetime spans the window where it is
> reachable through the virtqueue until completion/drain drops the virtqueue
> reference.
> 
> 6/7 nvdimm: virtio_pmem: converge broken virtqueue to -EIO
> Track a device-level broken state to converge broken/notify failures to -EIO:
> wake all waiters and drain/detach outstanding requests to complete them with
> an error, and fail-fast new requests.
> 
> 7/7 nvdimm: virtio_pmem: drain requests in freeze
> Drain outstanding requests in freeze() before tearing down virtqueues so
> waiters do not sleep indefinitely.
> 
> Testing was done on QEMU x86_64 with a virtio-pmem device exported as
> /dev/pmem0. This v4 series applies to v7.1-rc7, builds with
> CONFIG_VIRTIO_PMEM=m, passes checkpatch, and passed the local repro checks
> with a local-only virtqueue_kick() fault injection. I also checked that it
> applies cleanly to next/master at 6e845bcb78c9 ("Add linux-next specific
> files for 20260605").
> 
> Thanks,
> Li Chen
> 
> Changelog:
> v3->v4:
> - Rebased the series onto v7.1-rc7 so it applies cleanly to Linux 7.1-rc7.
> - Update the allocation site in 6/7 from kmalloc(sizeof(*req_data),
>   GFP_KERNEL) to kmalloc_obj(*req_data) to match current nvdimm code.
> - Add 1/7 to preserve provider flush callback errors in nvdimm_flush().
> - Include the GFP_NOIO child flush bio allocation fix as 2/7.
> - Renumber the old request lifetime and broken virtqueue fixes after the two
>   new flush error patches.
> v2->v3:
> - Split patch 1 as suggested by Pankaj Gupta: keep the waiter wakeup
>   ordering change in 1/5 and move READ_ONCE()/WRITE_ONCE() updates to
>   2/5 (no functional change intended).
> - Add log report to commit msg
> - Fold the export fix into 4/5 to keep the series bisectable when
>   CONFIG_VIRTIO_PMEM=m.
> v1->v2: add the export patch to fix compile issue.
> 
> Links:
> v3: https://lore.kernel.org/all/20260226025712.2236279-1-me@linux.beauty/#t
> v2: https://lore.kernel.org/all/20251225042915.334117-1-me@linux.beauty/
> v1: https://www.spinics.net/lists/kernel/msg5974818.html
> 
> Li Chen (7):
>   nvdimm: preserve flush callback errors
>   nvdimm: virtio_pmem: use GFP_NOIO for child flush bio
>   nvdimm: virtio_pmem: always wake -ENOSPC waiters
>   nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
>   nvdimm: virtio_pmem: refcount requests for token lifetime
>   nvdimm: virtio_pmem: converge broken virtqueue to -EIO
>   nvdimm: virtio_pmem: drain requests in freeze
> 
>  drivers/nvdimm/nd_virtio.c   | 139 +++++++++++++++++++++++++++++------
>  drivers/nvdimm/region_devs.c |   6 +-
>  drivers/nvdimm/virtio_pmem.c |  14 ++++
>  drivers/nvdimm/virtio_pmem.h |   6 ++
>  4 files changed, 139 insertions(+), 26 deletions(-)
> -- 
> 2.52.0

^ permalink raw reply

* [PATCH v4 2/2] vduse: Add suspend
From: Eugenio Pérez @ 2026-06-12 18:14 UTC (permalink / raw)
  To: Michael S . Tsirkin
  Cc: Maxime Coquelin, linux-kernel, Yongji Xie, Jason Wang,
	virtualization, Cindy Lu, Stefano Garzarella, Xuan Zhuo,
	Eugenio Pérez, Laurent Vivier
In-Reply-To: <20260612181457.622955-1-eperezma@redhat.com>

Implement suspend operation for vduse devices, so vhost-vdpa will offer
that backend feature and userspace can effectively suspend the device.

This is a must before get virtqueue indexes (base) for live migration,
since the device could modify them after userland gets them.

This patch does not implement resume, so VMM resets the whole device
to recover from a live migration failure.  Resume optimization can be
implemented on top of these patches, as other vDPA devices have done in
the past.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
v4:
* Add preparatory patch to not flush the kick and irq works under rwsem
  (MST).
* Fix jump over a semaphore guard (Nathan Chancellor).
* Fix take the device semaphore in the vq spinlock context (MST).
* Add suspend guard at vq_signal_irqfd so the device will not send an
  IRQ after suspend.

v3:
* Expand the patch message with information about resume operation.

v2:
* Take the rwsem only before the actual kick, not in vduse_vdpa_kick_vq.
  This assures that we're not in a critical section.
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 101 +++++++++++++++++++++++++----
 include/uapi/linux/vduse.h         |   4 ++
 2 files changed, 94 insertions(+), 11 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0f15575df394..80dc37ed7e13 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -54,7 +54,8 @@
 #define IRQ_UNBOUND -1
 
 /* Supported VDUSE features */
-static const uint64_t vduse_features = BIT_U64(VDUSE_F_QUEUE_READY);
+static const uint64_t vduse_features = BIT_U64(VDUSE_F_QUEUE_READY) |
+				       BIT_U64(VDUSE_F_SUSPEND);
 
 /*
  * VDUSE instance have not asked the vduse API version, so assume 0.
@@ -85,6 +86,7 @@ struct vduse_virtqueue {
 	int irq_effective_cpu;
 	struct cpumask irq_affinity;
 	struct kobject kobj;
+	struct vduse_dev *dev;
 };
 
 struct vduse_dev;
@@ -134,6 +136,7 @@ struct vduse_dev {
 	int minor;
 	bool broken;
 	bool connected;
+	bool suspended;
 	u64 api_version;
 	u64 device_features;
 	u64 driver_features;
@@ -502,6 +505,7 @@ static void vduse_dev_reset(struct vduse_dev *dev)
 	}
 
 	scoped_guard(rwsem_write, &dev->rwsem) {
+		dev->suspended = false;
 		dev->status = 0;
 		dev->driver_features = 0;
 		dev->generation++;
@@ -560,16 +564,18 @@ static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
 
 static void vduse_vq_kick(struct vduse_virtqueue *vq)
 {
-	spin_lock(&vq->kick_lock);
+	guard(rwsem_read)(&vq->dev->rwsem);
+	if (vq->dev->suspended)
+		return;
+
+	guard(spinlock)(&vq->kick_lock);
 	if (!vq->ready)
-		goto unlock;
+		return;
 
 	if (vq->kickfd)
 		eventfd_signal(vq->kickfd);
 	else
 		vq->kicked = true;
-unlock:
-	spin_unlock(&vq->kick_lock);
 }
 
 static void vduse_vq_kick_work(struct work_struct *work)
@@ -922,6 +928,27 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
 	return 0;
 }
 
+static int vduse_vdpa_suspend(struct vdpa_device *vdpa)
+{
+	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+	struct vduse_dev_msg msg = { 0 };
+	int ret;
+
+	msg.req.type = VDUSE_SUSPEND;
+
+	ret = vduse_dev_msg_sync(dev, &msg);
+	if (ret == 0) {
+		scoped_guard(rwsem_write, &dev->rwsem)
+			dev->suspended = true;
+
+		cancel_work_sync(&dev->inject);
+		for (u32 i = 0; i < dev->vq_num; i++)
+			cancel_work_sync(&dev->vqs[i]->inject);
+	}
+
+	return ret;
+}
+
 static void vduse_vdpa_free(struct vdpa_device *vdpa)
 {
 	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
@@ -963,6 +990,41 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
 	.free			= vduse_vdpa_free,
 };
 
+static const struct vdpa_config_ops vduse_vdpa_config_ops_with_suspend = {
+	.set_vq_address		= vduse_vdpa_set_vq_address,
+	.kick_vq		= vduse_vdpa_kick_vq,
+	.set_vq_cb		= vduse_vdpa_set_vq_cb,
+	.set_vq_num             = vduse_vdpa_set_vq_num,
+	.get_vq_size		= vduse_vdpa_get_vq_size,
+	.get_vq_group		= vduse_get_vq_group,
+	.set_vq_ready		= vduse_vdpa_set_vq_ready,
+	.get_vq_ready		= vduse_vdpa_get_vq_ready,
+	.set_vq_state		= vduse_vdpa_set_vq_state,
+	.get_vq_state		= vduse_vdpa_get_vq_state,
+	.get_vq_align		= vduse_vdpa_get_vq_align,
+	.get_device_features	= vduse_vdpa_get_device_features,
+	.set_driver_features	= vduse_vdpa_set_driver_features,
+	.get_driver_features	= vduse_vdpa_get_driver_features,
+	.set_config_cb		= vduse_vdpa_set_config_cb,
+	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
+	.get_device_id		= vduse_vdpa_get_device_id,
+	.get_vendor_id		= vduse_vdpa_get_vendor_id,
+	.get_status		= vduse_vdpa_get_status,
+	.set_status		= vduse_vdpa_set_status,
+	.get_config_size	= vduse_vdpa_get_config_size,
+	.get_config		= vduse_vdpa_get_config,
+	.set_config		= vduse_vdpa_set_config,
+	.get_generation		= vduse_vdpa_get_generation,
+	.set_vq_affinity	= vduse_vdpa_set_vq_affinity,
+	.get_vq_affinity	= vduse_vdpa_get_vq_affinity,
+	.reset			= vduse_vdpa_reset,
+	.set_map		= vduse_vdpa_set_map,
+	.set_group_asid		= vduse_set_group_asid,
+	.get_vq_map		= vduse_get_vq_map,
+	.suspend		= vduse_vdpa_suspend,
+	.free			= vduse_vdpa_free,
+};
+
 static void vduse_dev_sync_single_for_device(union virtio_map token,
 					     dma_addr_t dma_addr, size_t size,
 					     enum dma_data_direction dir)
@@ -1174,6 +1236,10 @@ static void vduse_dev_irq_inject(struct work_struct *work)
 {
 	struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
 
+	guard(rwsem_read)(&dev->rwsem);
+	if (dev->suspended)
+		return;
+
 	spin_lock_bh(&dev->irq_lock);
 	if (dev->config_cb.callback)
 		dev->config_cb.callback(dev->config_cb.private);
@@ -1185,6 +1251,10 @@ static void vduse_vq_irq_inject(struct work_struct *work)
 	struct vduse_virtqueue *vq = container_of(work,
 					struct vduse_virtqueue, inject);
 
+	guard(rwsem_read)(&vq->dev->rwsem);
+	if (vq->dev->suspended)
+		return;
+
 	spin_lock_bh(&vq->irq_lock);
 	if (vq->ready && vq->cb.callback)
 		vq->cb.callback(vq->cb.private);
@@ -1195,6 +1265,10 @@ static bool vduse_vq_signal_irqfd(struct vduse_virtqueue *vq)
 {
 	bool signal = false;
 
+	guard(rwsem_read)(&vq->dev->rwsem);
+	if (vq->dev->suspended)
+		return false;
+
 	if (!vq->cb.trigger)
 		return false;
 
@@ -1214,9 +1288,9 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
 {
 	int ret = -EINVAL;
 
-	down_read(&dev->rwsem);
-	if (!(dev->status & VIRTIO_CONFIG_S_DRIVER_OK))
-		goto unlock;
+	guard(rwsem_read)(&dev->rwsem);
+	if (dev->suspended || !(dev->status & VIRTIO_CONFIG_S_DRIVER_OK))
+		return ret;
 
 	ret = 0;
 	if (irq_effective_cpu == IRQ_UNBOUND)
@@ -1224,8 +1298,6 @@ static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
 	else
 		queue_work_on(irq_effective_cpu,
 			      vduse_irq_bound_wq, irq_work);
-unlock:
-	up_read(&dev->rwsem);
 
 	return ret;
 }
@@ -1979,6 +2051,7 @@ static int vduse_dev_init_vqs(struct vduse_dev *dev, u32 vq_align, u32 vq_num)
 		}
 
 		dev->vqs[i]->index = i;
+		dev->vqs[i]->dev = dev;
 		dev->vqs[i]->irq_effective_cpu = IRQ_UNBOUND;
 		INIT_WORK(&dev->vqs[i]->inject, vduse_vq_irq_inject);
 		INIT_WORK(&dev->vqs[i]->kick, vduse_vq_kick_work);
@@ -2429,12 +2502,18 @@ static struct vduse_mgmt_dev *vduse_mgmt;
 static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
 {
 	struct vduse_vdpa *vdev;
+	const struct vdpa_config_ops *ops;
 
 	if (dev->vdev)
 		return -EEXIST;
 
+	if (dev->vduse_features & BIT_U64(VDUSE_F_SUSPEND))
+		ops = &vduse_vdpa_config_ops_with_suspend;
+	else
+		ops = &vduse_vdpa_config_ops;
+
 	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
-				 &vduse_vdpa_config_ops, &vduse_map_ops,
+				 ops, &vduse_map_ops,
 				 dev->ngroups, dev->nas, name, true);
 	if (IS_ERR(vdev))
 		return PTR_ERR(vdev);
diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
index 7324faea5df4..8c616895c511 100644
--- a/include/uapi/linux/vduse.h
+++ b/include/uapi/linux/vduse.h
@@ -17,6 +17,9 @@
 /* The VDUSE instance expects a request for vq ready */
 #define VDUSE_F_QUEUE_READY	0
 
+/* The VDUSE instance expects a request for suspend */
+#define VDUSE_F_SUSPEND		1
+
 /*
  * Get the version of VDUSE API that kernel supported (VDUSE_API_VERSION).
  * This is used for future extension.
@@ -334,6 +337,7 @@ enum vduse_req_type {
 	VDUSE_UPDATE_IOTLB,
 	VDUSE_SET_VQ_GROUP_ASID,
 	VDUSE_SET_VQ_READY,
+	VDUSE_SUSPEND,
 };
 
 /**
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 1/2] vduse: do not take rwsem at reset work flush
From: Eugenio Pérez @ 2026-06-12 18:14 UTC (permalink / raw)
  To: Michael S . Tsirkin
  Cc: Maxime Coquelin, linux-kernel, Yongji Xie, Jason Wang,
	virtualization, Cindy Lu, Stefano Garzarella, Xuan Zhuo,
	Eugenio Pérez, Laurent Vivier
In-Reply-To: <20260612181457.622955-1-eperezma@redhat.com>

Next patches need to check suspend flag at this work item, and the
rwlock is used to protect the suspend flag update.  If the work takes
the rwlock too it will produce a deadlock.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 63 ++++++++++++++++--------------
 1 file changed, 33 insertions(+), 30 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index eff104f90cee..0f15575df394 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -501,44 +501,47 @@ static void vduse_dev_reset(struct vduse_dev *dev)
 			vduse_domain_reset_bounce_map(domain);
 	}
 
-	down_write(&dev->rwsem);
+	scoped_guard(rwsem_write, &dev->rwsem) {
+		dev->status = 0;
+		dev->driver_features = 0;
+		dev->generation++;
+		spin_lock(&dev->irq_lock);
+		dev->config_cb.callback = NULL;
+		dev->config_cb.private = NULL;
+		spin_unlock(&dev->irq_lock);
+
+		for (i = 0; i < dev->vq_num; i++) {
+			struct vduse_virtqueue *vq = dev->vqs[i];
+
+			vq->ready = false;
+			vq->desc_addr = 0;
+			vq->driver_addr = 0;
+			vq->device_addr = 0;
+			vq->num = 0;
+			memset(&vq->state, 0, sizeof(vq->state));
+
+			spin_lock(&vq->kick_lock);
+			vq->kicked = false;
+			if (vq->kickfd)
+				eventfd_ctx_put(vq->kickfd);
+			vq->kickfd = NULL;
+			spin_unlock(&vq->kick_lock);
+
+			spin_lock(&vq->irq_lock);
+			vq->cb.callback = NULL;
+			vq->cb.private = NULL;
+			vq->cb.trigger = NULL;
+			spin_unlock(&vq->irq_lock);
+		}
+	}
 
-	dev->status = 0;
-	dev->driver_features = 0;
-	dev->generation++;
-	spin_lock(&dev->irq_lock);
-	dev->config_cb.callback = NULL;
-	dev->config_cb.private = NULL;
-	spin_unlock(&dev->irq_lock);
 	flush_work(&dev->inject);
-
 	for (i = 0; i < dev->vq_num; i++) {
 		struct vduse_virtqueue *vq = dev->vqs[i];
 
-		vq->ready = false;
-		vq->desc_addr = 0;
-		vq->driver_addr = 0;
-		vq->device_addr = 0;
-		vq->num = 0;
-		memset(&vq->state, 0, sizeof(vq->state));
-
-		spin_lock(&vq->kick_lock);
-		vq->kicked = false;
-		if (vq->kickfd)
-			eventfd_ctx_put(vq->kickfd);
-		vq->kickfd = NULL;
-		spin_unlock(&vq->kick_lock);
-
-		spin_lock(&vq->irq_lock);
-		vq->cb.callback = NULL;
-		vq->cb.private = NULL;
-		vq->cb.trigger = NULL;
-		spin_unlock(&vq->irq_lock);
 		flush_work(&vq->inject);
 		flush_work(&vq->kick);
 	}
-
-	up_write(&dev->rwsem);
 }
 
 static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 0/2] vduse: Add suspend
From: Eugenio Pérez @ 2026-06-12 18:14 UTC (permalink / raw)
  To: Michael S . Tsirkin
  Cc: Maxime Coquelin, linux-kernel, Yongji Xie, Jason Wang,
	virtualization, Cindy Lu, Stefano Garzarella, Xuan Zhuo,
	Eugenio Pérez, Laurent Vivier

Implement suspend operation for vduse devices, so vhost-vdpa will offer
that backend feature and userspace can effectively suspend the device.

This is a must before get virtqueue indexes (base) for live migration,
since the device could modify them after userland gets them.

This patch does not implement resume, so VMM resets the whole device
to recover from a live migration failure.  Resume optimization can be
implemented on top of these patches, as other vDPA devices have done in
the past.

This series applies on top of e372c4ca7931cadb5cbee1cd9124cfad38fa2391,
replacing all the previous patches.

v4:
* Add preparatory patch to not flush the kick and irq works under rwsem
  (MST).
* Fix jump over a semaphore guard (Nathan Chancellor).
* Fix take the device semaphore in the vq spinlock context (MST).
* Add suspend guard at vq_signal_irqfd so the device will not send an
  IRQ after suspend.

v3:
* Expand the patch message with information about resume operation.

v2:
* Take the rwsem only before the actual kick, not in vduse_vdpa_kick_vq.
  This assures that we're not in a critical section.

Eugenio Pérez (2):
  vduse: do not take rwsem at reset work flush
  vduse: Add suspend

 drivers/vdpa/vdpa_user/vduse_dev.c | 164 +++++++++++++++++++++--------
 include/uapi/linux/vduse.h         |   4 +
 2 files changed, 127 insertions(+), 41 deletions(-)

-- 
2.54.0

^ permalink raw reply

* [PATCH 4/4] vhost/vsock: re-scan TX virtqueue on device start
From: Andrey Drobyshev @ 2026-06-12 16:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den,
	andrey.drobyshev
In-Reply-To: <20260612165718.433546-1-andrey.drobyshev@virtuozzo.com>

During QEMU CPR live-update (and VHOST_RESET_OWNER in general) the guest
keeps running while the host drops and later re-attaches vhost backends.
If the guest adds a buffer to the TX virtqueue (guest->host) and kicks
while the backend is temporarily NULL (between vhost_vsock_drop_backends()
and the next vhost_vsock_start()), then the kick is delivered to the
vhost worker, handle_tx_kick() sees a NULL backend and returns, and the
kick signal is consumed.  The buffer is then left in the ring.

Then upon device start vhost_vsock_start() only re-kicks the RX send
worker, never the TX VQ, so the buffer is processed only if the guest
happens to kick again.  But if the guest itself is now waiting for data
from the host, it will never kick TX VQ again, and we end up in a
deadlock.

The deadlock is reproduced during active host->guest socat data transfer
under multiple consecutive CPR live-update's.

To fix this, in vhost_vsock_start(), after kicking the RX send worker, also
queue the TX vq poll so any buffers the guest enqueued while we were paused
get scanned.

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 drivers/vhost/vsock.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index bcaba36becd7..1fcfe71d18be 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -655,6 +655,12 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 	 */
 	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);

+	/*
+	 * Some packets might've also been queued in TX VQ.  Re-scan it here,
+	 * mirroring the RX send-worker kick above.
+	 */
+	vhost_poll_queue(&vsock->vqs[VSOCK_VQ_TX].poll);
+
 	mutex_unlock(&vsock->dev.mutex);
 	return 0;

-- 
2.47.1

^ permalink raw reply related

* [PATCH 3/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
From: Andrey Drobyshev @ 2026-06-12 16:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den,
	andrey.drobyshev
In-Reply-To: <20260612165718.433546-1-andrey.drobyshev@virtuozzo.com>

From: "Denis V. Lunev" <den@openvz.org>

Earlier commit ("ms/vhost/vsock: Refuse the connection immediately when
guest isn't ready") added a fast-fail in vhost_transport_send_pkt().  It
rejects every host send with -EHOSTUNREACH until the destination calls
SET_RUNNING(1).  The fast-fail condition checks whether device's backends
are dropped, and if they're, the guest is considered to be not ready.

However, there might be other reasons for backends to be nulled.  In
particular, when QEMU is performing CPR (checkpoint-restore) migration,
device ownership is being RESET and SET again, which leads to backends
drop and reattach.  If we end up connecting during this window, an
AF_VSOCK client gets -EHOSTUNREACH, which is wrong.

Add a cpr_paused flag set inside vhost_vsock_drop_backends() when the
backend was previously live, cleared by vhost_vsock_start(). When set,
vhost_transport_send_pkt() queues the skb instead of fast-failing; the
existing kick of send_pkt_work in vhost_vsock_start() drains it on
resume. A device that has never run keeps cpr_paused == false and the
boot-time fast-fail behaviour is preserved.

Pair the cpr_paused store with the backend store using an
smp_wmb()/smp_rmb() pair so a concurrent sender on a weakly-ordered
architecture never observes (NULL backend, !paused):

Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 drivers/vhost/vsock.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e629886e5cf8..bcaba36becd7 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -61,6 +61,7 @@ struct vhost_vsock {
 
 	u32 guest_cid;
 	bool seqpacket_allow;
+	bool cpr_paused;	/* between stop and next start */
 };
 
 static u32 vhost_transport_get_local_cid(void)
@@ -311,11 +312,17 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
 	 * the mutex would be too expensive in this hot path, and we already have
 	 * all the outcomes covered: if the backend becomes NULL right after the check,
 	 * vhost_transport_do_send_pkt() will check it under the mutex anyway.
+	 *
+	 * Don't fast-fail if cpr_paused is set, keep queueing skbs instead.
+	 * The kick in vhost_vsock_start() will drain them on resume.
 	 */
 	if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
-		rcu_read_unlock();
-		kfree_skb(skb);
-		return -EHOSTUNREACH;
+		smp_rmb();	/* pairs with smp_wmb() in start/drop_backends */
+		if (!READ_ONCE(vsock->cpr_paused)) {
+			rcu_read_unlock();
+			kfree_skb(skb);
+			return -EHOSTUNREACH;
+		}
 	}
 
 	if (virtio_vsock_skb_reply(skb))
@@ -640,6 +647,9 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 		mutex_unlock(&vq->mutex);
 	}
 
+	smp_wmb();	/* pairs with smp_rmb() in send_pkt */
+	WRITE_ONCE(vsock->cpr_paused, false);
+
 	/* Some packets may have been queued before the device was started,
 	 * let's kick the send worker to send them.
 	 */
@@ -671,6 +681,11 @@ static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
 
 	lockdep_assert_held(&vsock->dev.mutex);
 
+	if (vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
+		WRITE_ONCE(vsock->cpr_paused, true);
+		smp_wmb();	/* pairs with smp_rmb() in send_pkt */
+	}
+
 	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
 		vq = &vsock->vqs[i];
 
@@ -728,6 +743,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 
 	vsock->guest_cid = 0; /* no CID assigned yet */
 	vsock->seqpacket_allow = false;
+	vsock->cpr_paused = false;
 
 	atomic_set(&vsock->queued_replies, 0);
 
-- 
2.47.1


^ permalink raw reply related

* [PATCH 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration
From: Andrey Drobyshev @ 2026-06-12 16:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den,
	andrey.drobyshev

Host<-->guest connections via AF_VSOCK sockets aren't supposed to
outlive VM migration, since VM is moving to another host.  However
there's a special case, which is QEMU live-update, or CPR
(checkpoint-restore) migration.  In this case, VM remains on the same
host, and we'd like such connections to persist.

For this to work, we need to be able to transfer device ownership from
source QEMU to dest QEMU.  Namely, source needs to reset ownership by
issuing VHOST_RESET_OWNER ioctl, and then target has to claim it by
calling VHOST_SET_OWNER.

Since VHOST_RESET_OWNER isn't yet implemented for vhost-vsock, let's add
such implementation (patches 1-2).  Also fix regression introduced by
the earlier commit [1] (patch 3), and fix the deadlock bug (commit 4).

There's a complementary series for QEMU [0] adding support of vhost-vsock
devices during CPR migration.

NOTE: this series needs to be applied on top of Michael's vhost/linux-next
tree as it contains relevant commit [1], not yet present in master branch.

I've tested this (patched QEMU + patched kernel) approximately as follows:

  * Run listener in the guest:
  socat -u VSOCK-LISTEN:9999 - >/tmp/recv.bin

  * Run data transfer from host to guest:
  socat -u FILE:/root/bigfile.bin VSOCK-CONNECT:CID:9999

  * Perform CPR migration during transfer (either cpr-exec or cpr-transfer)
  * Check that file hash sum matches

[0] https://lore.kernel.org/qemu-devel/20260612165110.431376-1-andrey.drobyshev@virtuozzo.com
[1] https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?id=bb26ed5f3a8b

Andrey Drobyshev (1):
  vhost/vsock: re-scan TX virtqueue on device start

Denis V. Lunev (1):
  vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause

Pavel Tikhomirov (2):
  vhost/vsock: split out vhost_vsock_drop_backends helper
  vhost/vsock: add VHOST_RESET_OWNER ioctl

 drivers/vhost/vsock.c | 80 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 69 insertions(+), 11 deletions(-)

-- 
2.47.1

^ permalink raw reply

* [PATCH 1/4] vhost/vsock: split out vhost_vsock_drop_backends helper
From: Andrey Drobyshev @ 2026-06-12 16:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den,
	andrey.drobyshev
In-Reply-To: <20260612165718.433546-1-andrey.drobyshev@virtuozzo.com>

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

Split the actual backend dropping part from vhost_vsock_stop.  We're
going to need it for the VHOST_RESET_OWNER implementation in the
following patch, when vsock->dev.mutex is already taken and owner is
checked.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 drivers/vhost/vsock.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 9aaab6bb8061..b12221ce6faf 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -664,9 +664,24 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 	return ret;
 }
 
-static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
+static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
 {
+	struct vhost_virtqueue *vq;
 	size_t i;
+
+	lockdep_assert_held(&vsock->dev.mutex);
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		vhost_vq_set_backend(vq, NULL);
+		mutex_unlock(&vq->mutex);
+	}
+}
+
+static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
+{
 	int ret = 0;
 
 	mutex_lock(&vsock->dev.mutex);
@@ -677,14 +692,7 @@ static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
 			goto err;
 	}
 
-	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
-		struct vhost_virtqueue *vq = &vsock->vqs[i];
-
-		mutex_lock(&vq->mutex);
-		vhost_vq_set_backend(vq, NULL);
-		mutex_unlock(&vq->mutex);
-	}
-
+	vhost_vsock_drop_backends(vsock);
 err:
 	mutex_unlock(&vsock->dev.mutex);
 	return ret;
-- 
2.47.1


^ permalink raw reply related

* [PATCH 2/4] vhost/vsock: add VHOST_RESET_OWNER ioctl
From: Andrey Drobyshev @ 2026-06-12 16:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den,
	andrey.drobyshev
In-Reply-To: <20260612165718.433546-1-andrey.drobyshev@virtuozzo.com>

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

This ioctl is needed for QEMU's CPR (checkpoint-restore) migration of
the guest with vhost-vsock device.  For this to work, we need to reset
the device ownership on the source side by calling RESET_OWNER, and then
claim it on the dest side by calling SET_OWNER.  We expect not to lose any
AF_VSOCK connection while this happens.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 drivers/vhost/vsock.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index b12221ce6faf..e629886e5cf8 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -894,6 +894,32 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
 	return -EFAULT;
 }
 
+static int vhost_vsock_reset_owner(struct vhost_vsock *vsock)
+{
+	struct vhost_iotlb *umem;
+	long err;
+
+	mutex_lock(&vsock->dev.mutex);
+	err = vhost_dev_check_owner(&vsock->dev);
+	if (err)
+		goto done;
+	umem = vhost_dev_reset_owner_prepare();
+	if (!umem) {
+		err = -ENOMEM;
+		goto done;
+	}
+	/* Follows vhost_vsock_dev_release closely except for guest_cid drop */
+	vsock_for_each_connected_socket(&vhost_transport.transport,
+					vhost_vsock_reset_orphans);
+	vhost_vsock_drop_backends(vsock);
+	vhost_vsock_flush(vsock);
+	vhost_dev_stop(&vsock->dev);
+	vhost_dev_reset_owner(&vsock->dev, umem);
+done:
+	mutex_unlock(&vsock->dev.mutex);
+	return err;
+}
+
 static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
 				  unsigned long arg)
 {
@@ -937,6 +963,8 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
 			return -EOPNOTSUPP;
 		vhost_set_backend_features(&vsock->dev, features);
 		return 0;
+	case VHOST_RESET_OWNER:
+		return vhost_vsock_reset_owner(vsock);
 	default:
 		mutex_lock(&vsock->dev.mutex);
 		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
-- 
2.47.1


^ permalink raw reply related

* [PATCH v1] s390/virtio_ccw: Also suppress -EINVAL on device detach
From: William Bezenah @ 2026-06-12 15:54 UTC (permalink / raw)
  To: linux-s390
  Cc: cohuck, pasic, farman, hca, gor, agordeev, borntraeger, svens,
	mjrosato, vneethv, oberpar, virtualization, kvm, linux-kernel

When detaching virtio devices with multiple queues, spurious and
non-fatal error messages appear in the guest kernel log. While
virtio-net devices have multiple queues by default, this issue can
also be reproduced with other virtio device types (e.g., virtio-blk)
when configured with multiple queues:

[   33.820621] virtio_ccw 0.0.0001: Failed to deregister indicators (-22)
[   33.820628] virtio_net virtio2: Error -22 while deleting queue 0
[   33.820632] virtio_net virtio2: Error -22 while deleting queue 1
[   33.820634] virtio_net virtio2: Error -22 while deleting queue 2

Since commit 8c58a229688c ("s390/cio: Do not unregister the
subchannel based on DNV"), subchannel behavior following a device
detach has been updated and results in -EINVAL being propagated
rather than -ENODEV, originating from ccw_device_start_timeout_key()
in cio/device_ops. In the end, the virtio driver has no ability to
react to the difference between device and subchannel states here,
and during detach, both -ENODEV and -EINVAL indicate the device
cannot be used and should not be treated as errors requiring
attention. Update error handling in virtio_ccw_del_vq() and
virtio_ccw_drop_indicator() to suppress -EINVAL in addition to
-ENODEV.

Fixes: 8c58a229688c ("s390/cio: Do not unregister the subchannel based on DNV")
Signed-off-by: William Bezenah <wbezenah@linux.ibm.com>
---
 drivers/s390/virtio/virtio_ccw.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index bab6cad3fd5c..02fd8bf7e469 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -429,7 +429,7 @@ static void virtio_ccw_drop_indicator(struct virtio_ccw_device *vcdev,
 			    vcdev->is_thinint ?
 			    VIRTIO_CCW_DOING_SET_IND_ADAPTER :
 			    VIRTIO_CCW_DOING_SET_IND);
-	if (ret && (ret != -ENODEV))
+	if (ret && (ret != -ENODEV) && (ret != -EINVAL))
 		dev_info(&vcdev->cdev->dev,
 			 "Failed to deregister indicators (%d)\n", ret);
 	else if (vcdev->is_thinint)
@@ -515,10 +515,10 @@ static void virtio_ccw_del_vq(struct virtqueue *vq, struct ccw1 *ccw)
 	ret = ccw_io_helper(vcdev, ccw,
 			    VIRTIO_CCW_DOING_SET_VQ | index);
 	/*
-	 * -ENODEV isn't considered an error: The device is gone anyway.
+	 * -ENODEV and -EINVAL aren't considered errors: The device is gone anyway.
 	 * This may happen on device detach.
 	 */
-	if (ret && (ret != -ENODEV))
+	if (ret && (ret != -ENODEV) && (ret != -EINVAL))
 		dev_warn(&vq->vdev->dev, "Error %d while deleting queue %d\n",
 			 ret, index);
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net-next v3 4/4] vsock: fold sk_acceptq_removed() into vsock_remove_pending()
From: Bobby Eshleman @ 2026-06-12 14:51 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, leonardi, horms, edumazet,
	kuba
In-Reply-To: <20260612045216.105796-5-rafdog35@gmail.com>

On Fri, Jun 12, 2026 at 04:52:16AM +0000, Raf Dickson wrote:
> Callers of vsock_remove_pending() must also call sk_acceptq_removed()
> to keep sk_ack_backlog consistent. Move the call into
> vsock_remove_pending() itself to make it automatic and prevent future
> callers from forgetting it.
> 
> Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Raf Dickson <rafdog35@gmail.com>
> ---
>  net/vmw_vsock/af_vsock.c       | 2 +-
>  net/vmw_vsock/vmci_transport.c | 4 +---
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 24916dd4e9..4a7d6d247a 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -494,6 +494,7 @@ void vsock_remove_pending(struct sock *listener, struct sock *pending)
>  	list_del_init(&vpending->pending_links);
>  	sock_put(listener);
>  	sock_put(pending);
> +	sk_acceptq_removed(listener);
>  }
>  EXPORT_SYMBOL_GPL(vsock_remove_pending);
>  
> @@ -773,7 +774,6 @@ static void vsock_pending_work(struct work_struct *work)
>  	if (vsock_is_pending(sk)) {
>  		vsock_remove_pending(listener, sk);
>  
> -		sk_acceptq_removed(listener);
>  	} else if (!vsk->rejected) {
>  		/* We are not on the pending list and accept() did not reject
>  		 * us, so we must have been accepted by our user process.  We
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index c2db016cca..3e6445f4e1 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -980,10 +980,8 @@ static int vmci_transport_recv_listen(struct sock *sk,
>  			err = -EINVAL;
>  		}
>  
> -		if (err < 0) {
> +		if (err < 0)
>  			vsock_remove_pending(sk, pending);
> -			sk_acceptq_removed(sk);
> -		}
>  
>  		release_sock(pending);
>  		vmci_transport_release_pending(pending);
> -- 
> 2.54.0
> 

Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>

^ permalink raw reply

* Re: [PATCH net-next v3 3/4] vsock: fold sk_acceptq_added() into vsock_enqueue_accept()
From: Bobby Eshleman @ 2026-06-12 14:51 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, leonardi, horms, edumazet,
	kuba
In-Reply-To: <20260612045216.105796-4-rafdog35@gmail.com>

On Fri, Jun 12, 2026 at 04:52:15AM +0000, Raf Dickson wrote:
> virtio and hyperv call sk_acceptq_added() immediately before
> vsock_enqueue_accept(). Move the call into vsock_enqueue_accept()
> itself so callers cannot forget it and the accounting is consistent.
> 
> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Raf Dickson <rafdog35@gmail.com>
> ---
>  net/vmw_vsock/af_vsock.c                | 1 +
>  net/vmw_vsock/hyperv_transport.c        | 1 -
>  net/vmw_vsock/virtio_transport_common.c | 1 -
>  3 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 6cfa89b6f3..24916dd4e9 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -518,6 +518,7 @@ void vsock_enqueue_accept(struct sock *listener, struct sock *connected)
>  	sock_hold(connected);
>  	sock_hold(listener);
>  	list_add_tail(&vconnected->accept_queue, &vlistener->accept_queue);
> +	sk_acceptq_added(listener);
>  }
>  EXPORT_SYMBOL_GPL(vsock_enqueue_accept);
>  
> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> index b3394946b2..0de8148877 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -410,7 +410,6 @@ static void hvs_open_connection(struct vmbus_channel *chan)
>  
>  	if (conn_from_host) {
>  		new->sk_state = TCP_ESTABLISHED;
> -		sk_acceptq_added(sk);
>  
>  		hvs_new->vm_srv_id = *if_type;
>  		hvs_new->host_srv_id = *if_instance;
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index b10666937c..4a39d48db9 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -1582,7 +1582,6 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>  		return ret;
>  	}
>  
> -	sk_acceptq_added(sk);
>  	if (virtio_transport_space_update(child, skb))
>  		child->sk_write_space(child);
>  
> -- 
> 2.54.0
> 

Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>

^ permalink raw reply

* Re: [PATCH net-next v3 2/4] vsock: fold sk_acceptq_added() into vsock_add_pending()
From: Bobby Eshleman @ 2026-06-12 14:51 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, leonardi, horms, edumazet,
	kuba
In-Reply-To: <20260612045216.105796-3-rafdog35@gmail.com>

On Fri, Jun 12, 2026 at 04:52:14AM +0000, Raf Dickson wrote:
> Move sk_acceptq_added() into vsock_add_pending() so callers cannot
> forget it. vmci is the only transport using the pending list and
> is updated accordingly.
> 
> Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Raf Dickson <rafdog35@gmail.com>
> ---
>  net/vmw_vsock/af_vsock.c       | 1 +
>  net/vmw_vsock/vmci_transport.c | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 1f94f0d44c..6cfa89b6f3 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -483,6 +483,7 @@ void vsock_add_pending(struct sock *listener, struct sock *pending)
>  	sock_hold(pending);
>  	sock_hold(listener);
>  	list_add_tail(&vpending->pending_links, &vlistener->pending_links);
> +	sk_acceptq_added(listener);
>  }
>  EXPORT_SYMBOL_GPL(vsock_add_pending);
>  
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index 635ebf9da4..c2db016cca 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -1109,7 +1109,6 @@ static int vmci_transport_recv_listen(struct sock *sk,
>  	}
>  
>  	vsock_add_pending(sk, pending);
> -	sk_acceptq_added(sk);
>  
>  	pending->sk_state = TCP_SYN_SENT;
>  	vmci_trans(vpending)->produce_size =
> -- 
> 2.54.0
> 

Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>

^ permalink raw reply

* Re: [PATCH net-next v3 1/4] vsock: introduce vsock_pending_to_accept() helper
From: Bobby Eshleman @ 2026-06-12 14:50 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, leonardi, horms, edumazet,
	kuba
In-Reply-To: <20260612045216.105796-2-rafdog35@gmail.com>

On Fri, Jun 12, 2026 at 04:52:13AM +0000, Raf Dickson wrote:
> Add vsock_pending_to_accept() to move a socket directly from the
> pending list to the accept queue in a single operation, avoiding
> the sock_put/sock_hold dance and the sk_acceptq_removed()/
> sk_acceptq_added() pair that would otherwise be needed when
> calling vsock_remove_pending() followed by vsock_enqueue_accept().
> 
> Use it in vmci_transport_recv_connecting_server() where a completed
> handshake transitions the socket from pending to accept queue.
> 
> Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Raf Dickson <rafdog35@gmail.com>
> ---
>  include/net/af_vsock.h         |  1 +
>  net/vmw_vsock/af_vsock.c       | 10 ++++++++++
>  net/vmw_vsock/vmci_transport.c |  3 +--
>  3 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 4e40063ada..30046a3c20 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -220,6 +220,7 @@ static inline bool __vsock_in_connected_table(struct vsock_sock *vsk)
>  void vsock_add_pending(struct sock *listener, struct sock *pending);
>  void vsock_remove_pending(struct sock *listener, struct sock *pending);
>  void vsock_enqueue_accept(struct sock *listener, struct sock *connected);
> +void vsock_pending_to_accept(struct sock *listener, struct sock *pending);
>  void vsock_insert_connected(struct vsock_sock *vsk);
>  void vsock_remove_bound(struct vsock_sock *vsk);
>  void vsock_remove_connected(struct vsock_sock *vsk);
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 2ce1063d4a..1f94f0d44c 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -496,6 +496,16 @@ void vsock_remove_pending(struct sock *listener, struct sock *pending)
>  }
>  EXPORT_SYMBOL_GPL(vsock_remove_pending);
>  
> +void vsock_pending_to_accept(struct sock *listener, struct sock *pending)
> +{
> +	struct vsock_sock *vpending = vsock_sk(pending);
> +	struct vsock_sock *vlistener = vsock_sk(listener);
> +
> +	list_del_init(&vpending->pending_links);
> +	list_add_tail(&vpending->accept_queue, &vlistener->accept_queue);
> +}
> +EXPORT_SYMBOL_GPL(vsock_pending_to_accept);
> +
>  void vsock_enqueue_accept(struct sock *listener, struct sock *connected)
>  {
>  	struct vsock_sock *vlistener;
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index 91516488a7..635ebf9da4 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -1258,8 +1258,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
>  	 * listener's pending list to the accept queue so callers of accept()
>  	 * can find it.
>  	 */
> -	vsock_remove_pending(listener, pending);
> -	vsock_enqueue_accept(listener, pending);
> +	vsock_pending_to_accept(listener, pending);
>  
>  	/* Callers of accept() will be waiting on the listening socket, not
>  	 * the pending socket.
> -- 
> 2.54.0
> 

Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>

^ permalink raw reply

* Re: [PATCH net-next v3 4/4] vsock: fold sk_acceptq_removed() into vsock_remove_pending()
From: Luigi Leonardi @ 2026-06-12 13:47 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, bobbyeshleman, horms,
	edumazet, kuba
In-Reply-To: <20260612045216.105796-5-rafdog35@gmail.com>

On Fri, Jun 12, 2026 at 04:52:16AM +0000, Raf Dickson wrote:
>Callers of vsock_remove_pending() must also call sk_acceptq_removed()
>to keep sk_ack_backlog consistent. Move the call into
>vsock_remove_pending() itself to make it automatic and prevent future
>callers from forgetting it.
>
>Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Raf Dickson <rafdog35@gmail.com>

Reviewed-by: Luigi Leonardi <leonardi@redhat.com>


^ permalink raw reply

* Re: [PATCH net-next v3 3/4] vsock: fold sk_acceptq_added() into vsock_enqueue_accept()
From: Luigi Leonardi @ 2026-06-12 13:46 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, bobbyeshleman, horms,
	edumazet, kuba
In-Reply-To: <20260612045216.105796-4-rafdog35@gmail.com>

On Fri, Jun 12, 2026 at 04:52:15AM +0000, Raf Dickson wrote:
>virtio and hyperv call sk_acceptq_added() immediately before
>vsock_enqueue_accept(). Move the call into vsock_enqueue_accept()
>itself so callers cannot forget it and the accounting is consistent.
>
>Suggested-by: Paolo Abeni <pabeni@redhat.com>
>Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Raf Dickson <rafdog35@gmail.com>

Reviewed-by: Luigi Leonardi <leonardi@redhat.com>


^ permalink raw reply

* Re: [PATCH net-next v3 2/4] vsock: fold sk_acceptq_added() into vsock_add_pending()
From: Luigi Leonardi @ 2026-06-12 13:39 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, bobbyeshleman, horms,
	edumazet, kuba
In-Reply-To: <20260612045216.105796-3-rafdog35@gmail.com>

On Fri, Jun 12, 2026 at 04:52:14AM +0000, Raf Dickson wrote:
>Move sk_acceptq_added() into vsock_add_pending() so callers cannot
>forget it. vmci is the only transport using the pending list and
>is updated accordingly.
>
>Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Raf Dickson <rafdog35@gmail.com>

Reviewed-by: Luigi Leonardi <leonardi@redhat.com>


^ permalink raw reply

* Re: [PATCH net-next v3 1/4] vsock: introduce vsock_pending_to_accept() helper
From: Luigi Leonardi @ 2026-06-12 13:30 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, bobbyeshleman, horms,
	edumazet, kuba
In-Reply-To: <20260612045216.105796-2-rafdog35@gmail.com>

On Fri, Jun 12, 2026 at 04:52:13AM +0000, Raf Dickson wrote:
>Add vsock_pending_to_accept() to move a socket directly from the
>pending list to the accept queue in a single operation, avoiding
>the sock_put/sock_hold dance and the sk_acceptq_removed()/
>sk_acceptq_added() pair that would otherwise be needed when
>calling vsock_remove_pending() followed by vsock_enqueue_accept().
>
>Use it in vmci_transport_recv_connecting_server() where a completed
>handshake transitions the socket from pending to accept queue.
>
>Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Raf Dickson <rafdog35@gmail.com>

Reviewed-by: Luigi Leonardi <leonardi@redhat.com>


^ permalink raw reply

* Re: [PATCH net-next v2] vsock/vmci: use sk_acceptq_is_full() helper
From: Luigi Leonardi @ 2026-06-12 13:26 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, horms, edumazet, kuba
In-Reply-To: <20260612045842.122207-1-rafdog35@gmail.com>

On Fri, Jun 12, 2026 at 04:58:42AM +0000, Raf Dickson wrote:
>Replace the open-coded backlog check with sk_acceptq_is_full().
>The helper uses > instead of >=, which is the correct comparison
>per commit 64a146513f8f ("[NET]: Revert incorrect accept queue
>backlog changes."), and adds READ_ONCE() for proper memory ordering.
>
>Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Raf Dickson <rafdog35@gmail.com>

Reviewed-by: Luigi Leonardi <leonardi@redhat.com>


^ permalink raw reply

* Re: [RFC PATCH 0/6] Support virtio-mem memory hotplug in TDX guests
From: Kiryl Shutsemau @ 2026-06-12 12:16 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: marcandre.lureau, david, rick.p.edgecombe, prsampat, pbonzini,
	mst, peterx, chenyi.qiang, elena.reshetova, michaeluth,
	ackerleytng, linux-kernel, linux-coco, virtualization, x86,
	yilun.xu, xiaoyao.li, chao.p.peng
In-Reply-To: <20260604093551.1511079-1-zhenzhong.duan@intel.com>

On Thu, Jun 04, 2026 at 05:35:45AM -0400, Zhenzhong Duan wrote:
> 2. Re-accepting already-accepted memory returns errors. Ignoring these errors
> can mislead the guest into believing re-accepted memory is zeroed when it
> contains stale data.

Re-accepting concern is valid, but often overblown. Reaccepting memory
that never got allocated is fine.

> == About this series ==
> 
> This series takes a different direction, supporting start-private memory
> and addressing the limitations of previous series [1] by implementing a
> callback-based infrastructure that integrates TDX memory acceptance and
> release operations with proper subblock granularity.

You are presenting these callbacks as generic memory hotplug thingy, but
it is only plugged into virtio mem. ACPI hotplug won't accept/release
memory unless I miss something. Are you expecting them to cover non
virtio cases too?

And these callbacks feels like very ad-hoc solution.

> See Rick and Paolo's
> discussion about using TDG.MEM.PAGE.RELEASE in [1].

Having RELEASE in hotplug path without addressing private->shared
conversion first is odd. That's the most obvious path that has to be
covered first.

Hm?

> == Future work ==
> support lazy accept

It would be nice to have some outline on how we will get there to
understand if this patchset is stepping stone or dead end that has to be
thrown away later on.

Hot[un]plug is often used to manager overcommited host. Eager accept
might be counter-productive.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox