Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] iommu/virtio: Avoid using the list iterator past the loop in viommu_add_resv_mem()
From: Jean-Philippe Brucker @ 2026-06-04 16:04 UTC (permalink / raw)
  To: Maoyi Xie; +Cc: joro, will, robin.murphy, iommu, linux-kernel, virtualization
In-Reply-To: <20260604051816.2976221-3-maoyixie.tju@gmail.com>

On Thu, Jun 04, 2026 at 01:18:16PM +0800, Maoyi Xie wrote:
> viommu_add_resv_mem() walks vdev->resv_regions to find the insertion
> point. When every element has a smaller start address, the
> list_for_each_entry() iterator ends up one past the last entry, and
> &next->list then aliases the list head, so the following list_add_tail()
> still appends at the tail. The result is correct, but using the iterator
> after the loop is undefined per the list_for_each_entry() contract.

Thank you for removing this hack, though I don't find a contract in the
list_for_each_entry() doc, and the fix still accesses a cursor outside the
loop. Since you mentioned C11 UB in another email, do you have more info
on the precise operation which is undefined in the kernel (container_of
into an invalid object or the &next->list addition)?  Just so I can avoid
it in the future.

Anyway, thanks for the patch. If this is just general cleanup that's fine
too.

Reviewed-by: Jean-Philippe Brucker <jpb@kernel.org>

> 
> The loop only needs a list_head as the insertion point, so iterate with
> list_for_each() and keep the typed list_entry() dereference inside the loop
> body. No functional change.
> 
> Suggested-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>

> ---
>  drivers/iommu/virtio-iommu.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> index 587fc13197f1..1d58d6b626a5 100644
> --- a/drivers/iommu/virtio-iommu.c
> +++ b/drivers/iommu/virtio-iommu.c
> @@ -486,7 +486,8 @@ static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
>  	size_t size;
>  	u64 start64, end64;
>  	phys_addr_t start, end;
> -	struct iommu_resv_region *region = NULL, *next;
> +	struct iommu_resv_region *region = NULL;
> +	struct list_head *pos;
>  	unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
>  
>  	start = start64 = le64_to_cpu(mem->start);
> @@ -520,11 +521,14 @@ static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
>  		return -ENOMEM;
>  
>  	/* Keep the list sorted */
> -	list_for_each_entry(next, &vdev->resv_regions, list) {
> +	list_for_each(pos, &vdev->resv_regions) {
> +		struct iommu_resv_region *next =
> +			list_entry(pos, struct iommu_resv_region, list);
> +
>  		if (next->start > region->start)
>  			break;
>  	}
> -	list_add_tail(&region->list, &next->list);
> +	list_add_tail(&region->list, pos);
>  	return 0;
>  }
>  
> -- 
> 2.34.1
> 

^ permalink raw reply

* [PATCH v1] drm/virtio: Fix driver removal with disabled KMS
From: Dmitry Osipenko @ 2026-06-04 12:27 UTC (permalink / raw)
  To: David Airlie, Gerd Hoffmann, Gurchetan Singh, Chia-I Wu,
	Ryosuke Yasuoka
  Cc: dri-devel, virtualization, linux-kernel

DRM atomic and modesetting aren't initialized if virtio-gpu driver built
with disabled KMS, leading to access of uninitialized data on driver
removal/unbinding and crashing kernel. Fix it by skipping shutting down
atomic core with unavailable KMS.

Fixes: 72122c69d717 ("drm/virtio: Add option to disable KMS support")
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
---
 drivers/gpu/drm/virtio/virtgpu_drv.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
index f0fb784c0f6f..2aaa7cb08085 100644
--- a/drivers/gpu/drm/virtio/virtgpu_drv.c
+++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
@@ -138,7 +138,10 @@ static void virtio_gpu_remove(struct virtio_device *vdev)
 
 	virtio_gpu_release_vqs(dev);
 	drm_dev_unplug(dev);
-	drm_atomic_helper_shutdown(dev);
+
+	if (drm_core_check_feature(dev, DRIVER_ATOMIC))
+		drm_atomic_helper_shutdown(dev);
+
 	virtio_gpu_deinit(dev);
 	drm_dev_put(dev);
 }
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] vsock/vmci: fix sk_ack_backlog leak on failed handshake
From: patchwork-bot+netdevbpf @ 2026-06-04 11:10 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, linux-kernel, sgarzare, stefanha,
	bryan-bt.tan, vishnu.dasa, bcm-kernel-feedback-list, stable
In-Reply-To: <20260526104356.469928-1-rafdog35@gmail.com>

Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Tue, 26 May 2026 10:43:56 +0000 you wrote:
> When vmci_transport_recv_connecting_server() returns an error,
> vmci_transport_recv_listen() calls vsock_remove_pending() but never
> calls sk_acceptq_removed(). This leaves sk_ack_backlog incremented
> permanently.
> 
> Repeated handshake failures (malformed packets, queue pair alloc
> failure, event subscribe failure) cause sk_ack_backlog to climb
> toward sk_max_ack_backlog. Once it reaches the limit the listener
> permanently refuses all new connections with -ECONNREFUSED, a
> silent denial of service requiring a process restart to recover.
> 
> [...]

Here is the summary with links:
  - vsock/vmci: fix sk_ack_backlog leak on failed handshake
    https://git.kernel.org/netdev/net/c/c05fa14db43e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] vhost/net: complete zerocopy ubufs only once
From: Paolo Abeni @ 2026-06-04 10:55 UTC (permalink / raw)
  To: Qing Ming, Michael S. Tsirkin, Jason Wang
  Cc: Eugenio Pérez, Shirley, David S. Miller, kvm, virtualization,
	netdev, linux-kernel
In-Reply-To: <20260601104300.197210-1-a0yami@mailbox.org>

On 6/1/26 12:43 PM, Qing Ming wrote:
> vhost-net initializes one ubuf_info per outstanding zerocopy TX
> descriptor and hands it to the backend socket.  The networking stack may
> then clone a zerocopy skb before all skb references are released.  For
> example, batman-adv fragmentation reaches skb_split(), which calls
> skb_zerocopy_clone() and increments the same ubuf_info refcount.
> 
> vhost_zerocopy_complete() currently treats every ubuf callback as a
> completed vhost descriptor.  It dereferences ubuf->ctx, writes the
> descriptor completion state, and drops the vhost_net_ubuf_ref even when
> the callback only releases a cloned skb reference.  A backend reset can
> therefore wait for and free the vhost_net_ubuf_ref while another cloned
> skb still carries the same ubuf_info.  A later completion then
> dereferences the freed ubufs pointer.
> 
> KASAN reports the stale completion as:
> 
>   BUG: KASAN: slab-use-after-free in vhost_zerocopy_complete+0x1d7/0x1f0
>   BUG: KASAN: slab-use-after-free in vhost_zerocopy_complete+0x101/0x1f0
>   vhost_zerocopy_complete
>   skb_copy_ubufs
>   __dev_forward_skb2
>   veth_xmit
> 
> The freed object was allocated from vhost_net_ioctl() while setting the
> backend and freed through kfree_rcu()/kvfree_rcu_bulk after backend
> removal, while delayed skb completion still reached
> vhost_zerocopy_complete().
> 
> Honor the generic ubuf_info refcount before touching vhost state, and run
> the vhost descriptor completion only for the final ubuf reference.  This
> matches the msg_zerocopy_complete() ownership rule for cloned zerocopy
> skbs.
> 
> Fixes: bab632d69ee4 ("vhost: vhost TX zero-copy support")
> Signed-off-by: Qing Ming <a0yami@mailbox.org>

The patch LGTM.

@Michael: to you want to take it via your tree?

/P


^ permalink raw reply

* Re: [PATCH] vsock/vmci: fix sk_ack_backlog leak on failed handshake
From: Stefano Garzarella @ 2026-06-04 10:40 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, linux-kernel, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list, stable
In-Reply-To: <20260526104356.469928-1-rafdog35@gmail.com>

On Tue, May 26, 2026 at 10:43:56AM +0000, Raf Dickson wrote:
>When vmci_transport_recv_connecting_server() returns an error,
>vmci_transport_recv_listen() calls vsock_remove_pending() but never
>calls sk_acceptq_removed(). This leaves sk_ack_backlog incremented
>permanently.
>
>Repeated handshake failures (malformed packets, queue pair alloc
>failure, event subscribe failure) cause sk_ack_backlog to climb
>toward sk_max_ack_backlog. Once it reaches the limit the listener
>permanently refuses all new connections with -ECONNREFUSED, a
>silent denial of service requiring a process restart to recover.
>
>The two existing sk_acceptq_removed() calls in af_vsock.c do not
>cover this path: line 764 checks vsock_is_pending() which returns
>false after vsock_remove_pending(), and line 1889 is only reached
>on successful accept().
>
>Fix by balancing sk_acceptq_added() with sk_acceptq_removed() on
>the error path.
>
>Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
>Cc: stable@vger.kernel.org
>Signed-off-by: Raf Dickson <rafdog35@gmail.com>
>---
> net/vmw_vsock/vmci_transport.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)

Acked-by: Stefano Garzarella <sgarzare@redhat.com>


^ permalink raw reply

* Re: [PATCH] vsock/vmci: fix sk_ack_backlog leak on failed handshake
From: Stefano Garzarella @ 2026-06-04 10:39 UTC (permalink / raw)
  To: Raf Dickson
  Cc: pabeni, netdev, virtualization, linux-kernel, stefanha,
	bryan-bt.tan, vishnu.dasa, bcm-kernel-feedback-list, stable
In-Reply-To: <20260601095646.180085-1-rafdog35@gmail.com>

On Mon, Jun 01, 2026 at 09:56:46AM +0000, Raf Dickson wrote:
>On Mon, Jun 1, 2026 at 9:26 AM Paolo Abeni wrote:
>> I'm wondering if sk_acceptq_removed() should be bounded in
>> vsock_remove_pending() ? (even if that change would probably be
>> net-next material).
>
>Agreed, that would prevent this class of bug entirely. Happy to prepare
>a follow-up patch for net-next once this fix lands, if that would be
>useful.

And maybe sk_acceptq_added() calls moved in vsock_add_pending().
That said I was wondering about other transports, but it seems both 
virtio and hyperv have a simplier handshake that doesn't require the 
pending list, since the socket is moved directly in the accept list.

BTW if you are going to sent a follow up, maybe another improvement 
(unrelated so another patch) could be to use sk_acceptq_is_full() 
instead of `sk->sk_ack_backlog >= sk->sk_max_ack_backlog`. Discovered 
while comparing vmci with virtio.

Thanks,
Stefano

^ permalink raw reply

* [RFC PATCH 6/6] x86/tdx: Release private memory before private->shared conversion
From: Zhenzhong Duan @ 2026-06-04  9:35 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova, michaeluth,
	ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260604093551.1511079-1-zhenzhong.duan@intel.com>

TDX supports a PAGE.RELEASE feature, when configured, host can only
remove a private page until guest releases it and puts it in a PENDING
state through TDG.MEM.PAGE.RELEASE.

When TDX PAGE.RELEASE is supported, release private memory pages before
converting them to shared state, this ensures pages transition from
accepted to pending state.

The release operation helps handle scenarios where the hypervisor may
retain old private pages during conversion. Without proper release,
subsequent shared->private conversions could encounter re-acceptance
errors when attempting to accept pages that are still in accepted state.

If the release operation fails, abort the conversion to prevent
inconsistent memory state. Note that if tdx_map_gpa() fails after
successful release, we cannot safely rollback because the GPA mapping may
have partially succeeded, creating a mix of shared and private pages that
cannot be reliably tracked or recovered.

Co-developed-by: Xu Yilun <yilun.xu@linux.intel.com>
Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 arch/x86/coco/tdx/tdx.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 0abfb3505093..ecee6df92395 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -1121,7 +1121,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 {
 	phys_addr_t start = __pa(vaddr);
 	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
+	bool release_required = !enc && tdx_page_release_supported;

+	/*
+	 * For private->shared conversion, release memory pages first.
+	 * This transitions pages from accepted to pending state to be
+	 * more robust with buggy VMM, e.g., VMM may keep old pages,
+	 * when converting back to private, re-accept error triggers.
+	 */
+	if (release_required && !tdx_release_memory(start, end))
+		return false;
+
+	/*
+	 * Update the GPA mapping state. If this fails, we cannot rollback
+	 * by calling tdx_accept_memory() because tdx_map_gpa() may have
+	 * partially succeeded, creating a mix of shared and private pages.
+	 * Attempting to accept the entire range would fail on pages that
+	 * are still in shared state, and we have no way to determine which
+	 * pages are in which state after partial failure.
+	 */
 	if (!tdx_map_gpa(start, end, enc))
 		return false;

-- 
2.52.0

^ permalink raw reply related

* [RFC PATCH 5/6] x86/tdx: Register memory pre-unplug callback for TDX guests
From: Zhenzhong Duan @ 2026-06-04  9:35 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova, michaeluth,
	ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260604093551.1511079-1-zhenzhong.duan@intel.com>

Add support for releasing memory pages before unplugging in TDX guests.
When memory is about to be unplugged by virtio-mem or other memory
hotplug drivers, the TDX guest should release the memory pages back to the
hypervisor using TDG.MEM.PAGE.RELEASE TDCALL to be more robust for buggy
VMM behavior, e.g., VMM may do nothing for unplug request.

The implementation detects TDG.MEM.PAGE.RELEASE support and optimizes
release operations by trying larger page sizes 1G/2M before falling back
to 4K pages. If release fails, the function re-accepts any released pages
to maintain consistency. Without proper memory release, re-plugging memory
in TDX guests fails when guest accepts those memory because hypervisor can
do no-op to memory unplug request and memory is already in "accepted"
state.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 arch/x86/include/asm/shared/tdx.h |   2 +
 arch/x86/coco/tdx/tdx.c           | 135 ++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+)

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 049638e3da74..910ec1e57528 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -19,6 +19,7 @@
 #define TDG_MEM_PAGE_ACCEPT		6
 #define TDG_VM_RD			7
 #define TDG_VM_WR			8
+#define TDG_MEM_PAGE_RELEASE		30
 
 /* TDX TD attributes */
 #define TDX_TD_ATTR_DEBUG_BIT		0
@@ -54,6 +55,7 @@
 
 /* TDCS_CONFIG_FLAGS bits */
 #define TDCS_CONFIG_FLEXIBLE_PENDING_VE	BIT_ULL(1)
+#define TDCS_CONFIG_PAGE_RELEASE	BIT_ULL(6)
 
 /* TDCS_TD_CTLS bits */
 #define TD_CTLS_PENDING_VE_DISABLE_BIT	0
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index d93ba092d311..0abfb3505093 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -345,6 +345,139 @@ static int tdx_memory_post_plug(u64 addr, u64 size)
 	return -EINVAL;
 }
 
+static bool tdx_page_release_supported;
+
+static void detect_mem_page_release(void)
+{
+	u64 config = 0;
+
+	tdg_vm_rd(TDCS_CONFIG_FLAGS, &config);
+
+	tdx_page_release_supported = !!(config & TDCS_CONFIG_PAGE_RELEASE);
+}
+
+static unsigned long try_release_one(phys_addr_t start, unsigned long len,
+				     enum pg_level pg_level)
+{
+	unsigned long release_size = page_level_size(pg_level);
+	struct tdx_module_args args = {};
+	u8 page_size;
+	u64 ret;
+
+	if (!IS_ALIGNED(start, release_size))
+		return 0;
+
+	if (len < release_size)
+		return 0;
+
+	/*
+	 * Pass the page physical address to TDX module to release the
+	 * private page and to put it in PENDING state.
+	 *
+	 * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
+	 */
+	switch (pg_level) {
+	case PG_LEVEL_4K:
+		page_size = TDX_PS_4K;
+		break;
+	case PG_LEVEL_2M:
+		page_size = TDX_PS_2M;
+		break;
+	case PG_LEVEL_1G:
+		page_size = TDX_PS_1G;
+		break;
+	default:
+		return 0;
+	}
+
+	args.rcx = start | page_size;
+	ret = __tdcall(TDG_MEM_PAGE_RELEASE, &args);
+	if (ret)
+		return 0;
+
+	return release_size;
+}
+
+static bool _tdx_release_memory(phys_addr_t start, phys_addr_t end, phys_addr_t *cur)
+{
+	*cur = start;
+
+	while (*cur < end) {
+		unsigned long len = end - *cur;
+		unsigned long release_size;
+
+		/*
+		 * Try larger release first. It speeds up process by cutting
+		 * number of hypercalls (if successful).
+		 */
+
+		release_size = try_release_one(*cur, len, PG_LEVEL_1G);
+		if (!release_size)
+			release_size = try_release_one(*cur, len, PG_LEVEL_2M);
+		if (!release_size)
+			release_size = try_release_one(*cur, len, PG_LEVEL_4K);
+		if (!release_size)
+			return false;
+		*cur += release_size;
+	}
+
+	return true;
+}
+
+/*
+ * Release memory pages back to the hypervisor in TDX guests.
+ *
+ * @start: Physical start address of memory range to release
+ * @end:   Physical end address of memory range to release
+ *
+ * Uses TDG.MEM.PAGE.RELEASE TDCALL to transition private pages back to
+ * pending state. If PAGE_RELEASE is not supported by the TDX
+ * configuration, returns true (success) as no action is needed.
+ *
+ * On partial failure, automatically re-accepts any successfully released
+ * pages to restore consistent memory state. Re-acceptance failure is
+ * treated as a fatal error since it indicates severe TDX module issues.
+ *
+ * Returns: true on success, false on failure
+ */
+static bool tdx_release_memory(phys_addr_t start, phys_addr_t end)
+{
+	phys_addr_t released = start;
+	bool ret;
+
+	if (!tdx_page_release_supported)
+		return true;
+
+	ret = _tdx_release_memory(start, end, &released);
+	if (!ret) {
+		pr_err("Failed to release memory [0x%llx, 0x%llx)\n",
+		       (unsigned long long)start, (unsigned long long)end);
+
+		/*
+		 * Re-accept any pages that were successfully released before
+		 * the failure occurred. This should never fail since we're
+		 * just restoring the previous accepted state.
+		 */
+		if (!tdx_accept_memory(start, released))
+			panic("%s Failed to re-accept memory\n", __func__);
+	}
+
+	return ret;
+}
+
+static int tdx_memory_pre_unplug(u64 addr, u64 size)
+{
+	u64 end;
+
+	if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(size))
+		return -EINVAL;
+
+	if (check_add_overflow(addr, size, &end))
+		return -EINVAL;
+
+	return tdx_release_memory(addr, end) ? 0 : -EINVAL;
+}
+
 static void tdx_setup(u64 *cc_mask)
 {
 	struct tdx_module_args args = {};
@@ -380,6 +513,8 @@ static void tdx_setup(u64 *cc_mask)
 	reduce_unnecessary_ve();
 
 	set_memory_post_plug_callback(tdx_memory_post_plug);
+	detect_mem_page_release();
+	set_memory_pre_unplug_callback(tdx_memory_pre_unplug);
 }
 
 /*
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 4/6] x86/tdx: Register memory post-plug callback for TDX guests
From: Zhenzhong Duan @ 2026-06-04  9:35 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova, michaeluth,
	ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260604093551.1511079-1-zhenzhong.duan@intel.com>

Register a callback to handle memory acceptance after memory plugging in
TDX guests. When memory is added by virtio-mem or other memory hotplug
drivers, the TDX guest must accept the memory pages using
TDG.MEM.PAGE.ACCEPT TDCALL before they can be safely accessed.

The callback uses the existing tdx_accept_memory() function to accept all
pages in the newly plugged memory range. Without this callback, newly
added memory would remain in "unaccepted" state, and any access to these
pages would trigger VM exits and potentially cause guest crashes. The
callback is registered during TDX setup and remains active for the
lifetime of the guest, ensuring all dynamically added memory is properly
accepted before being made available to the kernel's memory management
subsystem.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 arch/x86/coco/tdx/tdx.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 186915a17c50..d93ba092d311 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -326,6 +326,25 @@ static void reduce_unnecessary_ve(void)
 	enable_cpu_topology_enumeration();
 }

+static int tdx_memory_post_plug(u64 addr, u64 size)
+{
+	u64 end;
+
+	if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(size))
+		return -EINVAL;
+
+	if (check_add_overflow(addr, size, &end))
+		return -EINVAL;
+
+	if (tdx_accept_memory(addr, end))
+		return 0;
+
+	pr_err("Failed to accept memory [0x%llx, 0x%llx)\n",
+	       (unsigned long long)addr, (unsigned long long)end);
+
+	return -EINVAL;
+}
+
 static void tdx_setup(u64 *cc_mask)
 {
 	struct tdx_module_args args = {};
@@ -359,6 +378,8 @@ static void tdx_setup(u64 *cc_mask)
 	disable_sept_ve(td_attr);

 	reduce_unnecessary_ve();
+
+	set_memory_post_plug_callback(tdx_memory_post_plug);
 }

 /*
-- 
2.52.0

^ permalink raw reply related

* [RFC PATCH 3/6] virtio-mem: Integrate memory acceptance and release callbacks
From: Zhenzhong Duan @ 2026-06-04  9:35 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova, michaeluth,
	ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260604093551.1511079-1-zhenzhong.duan@intel.com>

Integrate the memory post-plug and pre-unplug callbacks into virtio-mem's
plug and unplug operations to support TDX memory acceptance and release.

For memory plugging, call the post-plug callback after successfully
requesting memory from the hypervisor to ensure newly added memory is
accepted by TDX guests. If acceptance fails, return -EINVAL to mark the
device as broken rather than attempting rollback, since unplug operations
may also fail and partial acceptance creates difficult-to-recover state.

For memory unplugging, call the pre-unplug callback before requesting
memory removal from the hypervisor to allow TDX guests to release memory
pages. If release fails, return -EINVAL to mark the device as broken.

If the hypervisor unplug request fails after successful memory release,
attempt to re-accept the memory to restore consistent state for retry. If
re-acceptance fails, mark the device as broken to prevent corruption.

The config_changed check is moved to the wrapper functions to ensure
callbacks are not invoked unnecessarily when operations will be retried.

This integration ensures proper memory lifecycle management in
confidential computing environments while maintaining backward
compatibility with non-TDX systems where the callbacks are no-ops.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 drivers/virtio/virtio_mem.c | 80 ++++++++++++++++++++++++++++++++-----
 1 file changed, 70 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 48051e9e98ab..12b8229dab0d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1416,8 +1416,8 @@ static uint64_t virtio_mem_send_request(struct virtio_mem *vm,
 	return virtio16_to_cpu(vm->vdev, vm->resp.type);
 }
 
-static int virtio_mem_send_plug_request(struct virtio_mem *vm, uint64_t addr,
-					uint64_t size)
+static int _virtio_mem_send_plug_request(struct virtio_mem *vm, uint64_t addr,
+					 uint64_t size)
 {
 	const uint64_t nb_vm_blocks = size / vm->device_block_size;
 	const struct virtio_mem_req req = {
@@ -1427,9 +1427,6 @@ static int virtio_mem_send_plug_request(struct virtio_mem *vm, uint64_t addr,
 	};
 	int rc = -ENOMEM;
 
-	if (atomic_read(&vm->config_changed))
-		return -EAGAIN;
-
 	dev_dbg(&vm->vdev->dev, "plugging memory: 0x%llx - 0x%llx\n", addr,
 		addr + size - 1);
 
@@ -1454,8 +1451,8 @@ static int virtio_mem_send_plug_request(struct virtio_mem *vm, uint64_t addr,
 	return rc;
 }
 
-static int virtio_mem_send_unplug_request(struct virtio_mem *vm, uint64_t addr,
-					  uint64_t size)
+static int _virtio_mem_send_unplug_request(struct virtio_mem *vm, uint64_t addr,
+					   uint64_t size)
 {
 	const uint64_t nb_vm_blocks = size / vm->device_block_size;
 	const struct virtio_mem_req req = {
@@ -1465,9 +1462,6 @@ static int virtio_mem_send_unplug_request(struct virtio_mem *vm, uint64_t addr,
 	};
 	int rc = -ENOMEM;
 
-	if (atomic_read(&vm->config_changed))
-		return -EAGAIN;
-
 	dev_dbg(&vm->vdev->dev, "unplugging memory: 0x%llx - 0x%llx\n", addr,
 		addr + size - 1);
 
@@ -1489,6 +1483,72 @@ static int virtio_mem_send_unplug_request(struct virtio_mem *vm, uint64_t addr,
 	return rc;
 }
 
+static int virtio_mem_send_plug_request(struct virtio_mem *vm, uint64_t addr,
+					uint64_t size)
+{
+	int ret;
+
+	if (atomic_read(&vm->config_changed))
+		return -EAGAIN;
+
+	ret = _virtio_mem_send_plug_request(vm, addr, size);
+	if (ret)
+		return ret;
+
+	/*
+	 * If memory acceptance fails, we cannot safely rollback to the pre-plug
+	 * state because the unplug operation may also fail (e.g., hypervisor
+	 * out of memory, VM migration in progress). Additionally, acceptance
+	 * failures may be partial, leaving some pages accepted and others not,
+	 * creating inconsistent memory state that is difficult to track and
+	 * recover from.
+	 *
+	 * Rather than attempting complex state recovery that may fail, we treat
+	 * acceptance failure as a critical error and return -EINVAL. This causes
+	 * the caller to set the broken flag and stop processing further requests,
+	 * preventing potential memory corruption or system instability. As a
+	 * consequence, the hypervisor-side memory for the failing range is
+	 * leaked for the lifetime of the device.
+	 */
+	if (memory_post_plug_call(addr, size))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int virtio_mem_send_unplug_request(struct virtio_mem *vm, uint64_t addr,
+					  uint64_t size)
+{
+	int ret;
+
+	if (atomic_read(&vm->config_changed))
+		return -EAGAIN;
+
+	/*
+	 * If memory release fails, treat it as a critical error similar to
+	 * acceptance failure. See virtio_mem_send_plug_request() for detailed
+	 * rationale on why we avoid complex error recovery.
+	 */
+	ret = memory_pre_unplug_call(addr, size);
+	if (ret)
+		return -EINVAL;
+
+	ret = _virtio_mem_send_unplug_request(vm, addr, size);
+	/*
+	 * If the hypervisor unplug request fails (e.g., out of memory, VM
+	 * migration), the operation will be retried later. Since we already
+	 * released the memory from TDX perspective, we must re-accept it to
+	 * restore consistent state for the next retry. If re-acceptance fails,
+	 * treat it as critical error to prevent state corruption. As a
+	 * consequence, the hypervisor-side memory for the failing range is
+	 * leaked for the lifetime of the device.
+	 */
+	if (ret && memory_post_plug_call(addr, size))
+		return -EINVAL;
+
+	return ret;
+}
+
 static int virtio_mem_send_unplug_all_request(struct virtio_mem *vm)
 {
 	const struct virtio_mem_req req = {
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 2/6] mm/memory_hotplug: Add memory pre-unplug callback infrastructure
From: Zhenzhong Duan @ 2026-06-04  9:35 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova, michaeluth,
	ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260604093551.1511079-1-zhenzhong.duan@intel.com>

In confidential computing environments like TDX, memory that was
previously accepted by the guest could be explicitly "released" back to
the hypervisor before it is unplugged, because hypervisor can do no-op
for the unplug operation without guest awares, then replug will fail
with re-accept error.

This callback infrastructure allows the TDX guest code to register a
handler that will be invoked after kernel removes memory from its memory
management subsystem but before it is unplugged, ensuring all memory
pages are properly released via TDG.MEM.PAGE.RELEASE TDCALL. Then re-plug
triggers TDG.MEM.PAGE.ACCEPT on pages in "unaccepted" state and succeed.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/linux/memory_hotplug.h | 10 ++++++++++
 mm/memory_hotplug.c            | 20 ++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 39f0a35a5112..5bb77670b6cf 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -29,6 +29,7 @@ enum mmop {
 };
 
 typedef int (*memory_post_plug_callback_t)(u64 addr, u64 size);
+typedef int (*memory_pre_unplug_callback_t)(u64 addr, u64 size);
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
@@ -278,6 +279,9 @@ extern int remove_memory(u64 start, u64 size);
 extern void __remove_memory(u64 start, u64 size);
 extern int offline_and_remove_memory(u64 start, u64 size);
 
+void set_memory_pre_unplug_callback(memory_pre_unplug_callback_t callback);
+int memory_pre_unplug_call(u64 addr, u64 size);
+
 #else
 static inline void try_offline_node(int nid) {}
 
@@ -293,6 +297,12 @@ static inline int remove_memory(u64 start, u64 size)
 }
 
 static inline void __remove_memory(u64 start, u64 size) {}
+
+static inline void set_memory_pre_unplug_callback(memory_pre_unplug_callback_t callback) {}
+static inline int memory_pre_unplug_call(u64 addr, u64 size)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 73054ed016fd..fcb6f85c40d0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2451,4 +2451,24 @@ int offline_and_remove_memory(u64 start, u64 size)
 	return rc;
 }
 EXPORT_SYMBOL_GPL(offline_and_remove_memory);
+
+static memory_pre_unplug_callback_t memory_pre_unplug_callback __ro_after_init;
+
+void set_memory_pre_unplug_callback(memory_pre_unplug_callback_t callback)
+{
+	/* Fatal error to set callback twice in boot stage */
+	if (memory_pre_unplug_callback)
+		panic("memory_pre_unplug_callback is already registered\n");
+
+	memory_pre_unplug_callback = callback;
+}
+
+int memory_pre_unplug_call(u64 addr, u64 size)
+{
+	if (!memory_pre_unplug_callback)
+		return 0;
+
+	return (*memory_pre_unplug_callback)(addr, size);
+}
+EXPORT_SYMBOL_GPL(memory_pre_unplug_call);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 1/6] mm/memory_hotplug: Add memory post-plug callback infrastructure
From: Zhenzhong Duan @ 2026-06-04  9:35 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova, michaeluth,
	ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260604093551.1511079-1-zhenzhong.duan@intel.com>

In confidential computing environments like TDX, newly added memory must be
explicitly "accepted" by the guest before it can be safely accessed. When
virtio-mem or other memory hotplug drivers add memory to a TDX guest, the
memory pages are initially in an "unaccepted" state. Accessing unaccepted
memory triggers VM exits and can cause guest crashes. The guest must call
TDX hypercalls to accept each page before use.

This callback infrastructure allows the TDX guest code to register a
handler that will be invoked after memory is plugged, ensuring all newly
added memory is properly accepted before being made available to the
kernel's memory management subsystem.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/linux/memory_hotplug.h | 11 +++++++++++
 mm/memory_hotplug.c            | 20 ++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 815e908c4135..39f0a35a5112 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -28,6 +28,8 @@ enum mmop {
 	MMOP_ONLINE_MOVABLE,
 };
 
+typedef int (*memory_post_plug_callback_t)(u64 addr, u64 size);
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
 
@@ -176,6 +178,9 @@ static inline void pgdat_kswapd_lock_init(pg_data_t *pgdat)
 	mutex_init(&pgdat->kswapd_lock);
 }
 
+void set_memory_post_plug_callback(memory_post_plug_callback_t callback);
+int memory_post_plug_call(u64 addr, u64 size);
+
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 #define pfn_to_online_page(pfn)			\
 ({						\
@@ -221,6 +226,12 @@ static inline bool mhp_supports_memmap_on_memory(void)
 static inline void pgdat_kswapd_lock(pg_data_t *pgdat) {}
 static inline void pgdat_kswapd_unlock(pg_data_t *pgdat) {}
 static inline void pgdat_kswapd_lock_init(pg_data_t *pgdat) {}
+
+static inline void set_memory_post_plug_callback(memory_post_plug_callback_t callback) {}
+static inline int memory_post_plug_call(u64 addr, u64 size)
+{
+	return 0;
+}
 #endif /* ! CONFIG_MEMORY_HOTPLUG */
 
 /*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 40c7915dabe0..73054ed016fd 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1729,6 +1729,26 @@ bool mhp_range_allowed(u64 start, u64 size, bool need_mapping)
 	return false;
 }
 
+static memory_post_plug_callback_t memory_post_plug_callback __ro_after_init;
+
+void set_memory_post_plug_callback(memory_post_plug_callback_t callback)
+{
+	/* Fatal error to set callback twice in boot stage */
+	if (memory_post_plug_callback)
+		panic("memory_post_plug_callback is already registered\n");
+
+	memory_post_plug_callback = callback;
+}
+
+int memory_post_plug_call(u64 addr, u64 size)
+{
+	if (!memory_post_plug_callback)
+		return 0;
+
+	return (*memory_post_plug_callback)(addr, size);
+}
+EXPORT_SYMBOL_GPL(memory_post_plug_call);
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
  * Scan pfn range [start,end) to find movable/migratable pages (LRU and
-- 
2.52.0


^ permalink raw reply related

* [RFC PATCH 0/6] Support virtio-mem memory hotplug in TDX guests
From: Zhenzhong Duan @ 2026-06-04  9:35 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova, michaeluth,
	ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng

This RFC series explores the start-private memory approach for virtio-mem
CoCo support using TDG.MEM.PAGE.RELEASE. We are seeking feedback from
Kiryl on the CoCo guest implementation, MM experts on the callback
infrastructure and virtio-mem integration, and broader virtio/CoCo
community input on the overall approach. We are not seeking x86 maintainer
review at this stage.

== Background ==

In Confidential Computing (CoCo) guests like TDX, memory hotplug
operations face unique challenges:

1. Newly added memory must be explicitly "accepted" by the guest using
TDG.MEM.PAGE.ACCEPT TDCALL before it can be safely accessed. Accessing
unaccepted memory triggers VM exits and guest crashes.
2. Hypervisor may perform no-op unplug operations, leaving old memory in
place. Re-accepting this already-accepted memory during re-plug operations
returns errors.
3. State management become much more complex, "accepted"/"unccepted" plus
"plugged"/"unplugged".
4. Initial virtio-mem memory may be start-private or start-shared.

A previous series [1][2] supports start-private memory and utilized memory
hotplug notifiers to call tdx_accept_memory() before pages are freed to
the buddy allocator. However, this approach has limitations:

1. virtio-mem operates memory at subblock granularity (e.g., 2MB chunks
within 128MB memory blocks), while generic memory notifiers operate on entire
memory blocks, causing acceptance of unplugged subblocks with no backing
memory.
2. Re-accepting already-accepted memory returns errors. Ignoring these errors
can mislead the guest into believing re-accepted memory is zeroed when it
contains stale data.

Currently, virtio-mem spec doesn't define what kind of hotplugged memory
should be supported for CoCo guest, shared or private or both. There is a
newer series [3][4] supporting start-shared memory in discuss. It converts
shared->private before online (via set_memory_encrypted-> MapGPA + ACCEPT),
and back to shared on unplug (via set_memory_decrypted).

== About this series ==

This series takes a different direction, supporting start-private memory
and addressing the limitations of previous series [1] by implementing a
callback-based infrastructure that integrates TDX memory acceptance and
release operations with proper subblock granularity. See Rick and Paolo's
discussion about using TDG.MEM.PAGE.RELEASE in [1].

The goal is not to compete with existing efforts, but rather to kick off
discussion and seek for suggestions from mm expert whether utilizing
callback-based infrastructure and PAGE.RELEASE API is a viable scheme.

We chose the generic post-plug and pre-unplug callback approach because
it provides a simple proof-of-concept that can support kexec/kdump
scenarios, though it does not support lazy acceptance. We rely on
community discussion to identify better, more upstreamable solutions if
the start-private direction is ultimately adopted.

== More details ==

**Post-plug callbacks** are registered by TDX guests during early boot and
triggered by virtio-mem after successfully requesting memory from the
hypervisor. The callback invokes tdx_accept_memory(), which performs
TDG.MEM.PAGE.ACCEPT TDCALL on the exact memory range that was plugged,
providing subblock-aware granularity. Note that tdx_accept_memory() may
not be fully self-consistent in all environments, as some pages may
remain in an "accepted" state while others do not, since page release is
not supported across all TDX module versions.

**Pre-unplug callbacks** are registered during early boot and invoked by
virtio-mem before requesting memory removal from the hypervisor. The
callback executes tdx_release_memory(), which performs
TDG.MEM.PAGE.RELEASE TDCALL with an optimization strategy that attempts
1GB/2MB page releases first before falling back to 4KB pages for maximum
efficiency. Unlike acceptance operations, tdx_release_memory() maintains
full self-consistency since page acceptance is universally supported
across TDX implementations.

**Error handling strategy** prioritizes system stability by marking the
virtio-mem device as broken whenever TDX operations fail:

1. Post-plug failures: If memory acceptance fails after successful
hypervisor allocation, the device is marked as broken to prevent memory
corruption. The hypervisor-side memory is leaked for the device lifetime.
2. Pre-unplug failures: If TDX memory release fails, the device is marked as
broken and no hypervisor unplug is attempted.
3. Hypervisor unplug failures: If the hypervisor unplug fails after
successful TDX release, the system attempts to re-accept the memory for
consistency. If re-acceptance fails, the device is marked as broken.

This approach avoids complex recovery mechanisms that could fail and
cause state corruption, choosing instead to fail safely by disabling the
device when TDX operations cannot maintain consistent state between guest
and hypervisor.

**PAGE.RELEASE configuration** requires explicit enablement by the
hypervisor during TD creation. The hypervisor must set the
CONFIG_FLAGS.PAGE_RELEASE flag in the TD's configuration to enable
TDG.MEM.PAGE.RELEASE functionality within the guest. Without this
configuration, guests cannot perform memory release operations and must
rely on the hypervisor to handle private memory release. This series
focuses on guest-side changes and does not include hypervisor
modifications, which can be added in future versions if needed.

== Testing ==
Tested with qemu [2] which supports start-private memory:
Basic memory hotplug/unplug test.
Basic kexec/kdump functions test with zero/half/full memory plugged.

Interestingly, it also pass with qemu [4] which supports start-shared memory,
because acceptance triggers memory convert implicitly, but it's slow as
implicit conversion is 4K page granularity.

== Future work ==
support lazy accept

Thanks
Zhenzhong

[1] kernel: https://lore.kernel.org/kvm/20260324-tdx-hotplug-fixes-v1-0-8f29f2c17278@redhat.com/
[2] qemu: https://lore.kernel.org/qemu-devel/20260226140001.3622334-1-marcandre.lureau@redhat.com/
[3] kernel: https://lore.kernel.org/lkml/20260401-coco-v1-1-b9c3072e2d9c@redhat.com/
[4] qemu: https://lore.kernel.org/qemu-devel/20260504-rdm5-v4-0-bdf61e57c1e1@redhat.com/

Zhenzhong Duan (6):
  mm/memory_hotplug: Add memory post-plug callback infrastructure
  mm/memory_hotplug: Add memory pre-unplug callback infrastructure
  virtio-mem: Integrate memory acceptance and release callbacks
  x86/tdx: Register memory post-plug callback for TDX guests
  x86/tdx: Register memory pre-unplug callback for TDX guests
  x86/tdx: Release private memory before private->shared conversion

 arch/x86/include/asm/shared/tdx.h |   2 +
 include/linux/memory_hotplug.h    |  21 ++++
 arch/x86/coco/tdx/tdx.c           | 174 ++++++++++++++++++++++++++++++
 drivers/virtio/virtio_mem.c       |  80 ++++++++++++--
 mm/memory_hotplug.c               |  40 +++++++
 5 files changed, 307 insertions(+), 10 deletions(-)

-- 
2.52.0

^ permalink raw reply

* [PATCH] drm/virtgpu: Use ERR_PTR for fb_create callback error return
From: liuqiangneo @ 2026-06-04  7:11 UTC (permalink / raw)
  To: airlied, kraxel, dmitry.osipenko
  Cc: gurchetansingh, olvaffe, maarten.lankhorst, mripard, tzimmermann,
	simona, dri-devel, virtualization, linux-kernel, Qiang Liu

From: Qiang Liu <liuqiang@kylinos.cn>

The fb_create callback returns an error code encoded with ERR_PTR()
on failure. This lets the caller see the actual error code instead of
just a generic failure.

Signed-off-by: Qiang Liu <liuqiang@kylinos.cn>
---
 drivers/gpu/drm/virtio/virtgpu_display.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
index 44ffffec550f..85ea252c658e 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -344,7 +344,7 @@ virtio_gpu_user_framebuffer_create(struct drm_device *dev,
 	if (ret) {
 		kfree(virtio_gpu_fb);
 		drm_gem_object_put(obj);
-		return NULL;
+		return ERR_PTR(ret);
 	}
 
 	return &virtio_gpu_fb->base;
-- 
2.43.0


^ permalink raw reply related

* [PATCH 2/2] iommu/virtio: Avoid using the list iterator past the loop in viommu_add_resv_mem()
From: Maoyi Xie @ 2026-06-04  5:18 UTC (permalink / raw)
  To: joro, will, robin.murphy, jpb
  Cc: iommu, linux-kernel, virtualization, Maoyi Xie
In-Reply-To: <20260604051816.2976221-1-maoyixie.tju@gmail.com>

viommu_add_resv_mem() walks vdev->resv_regions to find the insertion
point. When every element has a smaller start address, the
list_for_each_entry() iterator ends up one past the last entry, and
&next->list then aliases the list head, so the following list_add_tail()
still appends at the tail. The result is correct, but using the iterator
after the loop is undefined per the list_for_each_entry() contract.

The loop only needs a list_head as the insertion point, so iterate with
list_for_each() and keep the typed list_entry() dereference inside the loop
body. No functional change.

Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
---
 drivers/iommu/virtio-iommu.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 587fc13197f1..1d58d6b626a5 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -486,7 +486,8 @@ static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
 	size_t size;
 	u64 start64, end64;
 	phys_addr_t start, end;
-	struct iommu_resv_region *region = NULL, *next;
+	struct iommu_resv_region *region = NULL;
+	struct list_head *pos;
 	unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
 
 	start = start64 = le64_to_cpu(mem->start);
@@ -520,11 +521,14 @@ static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
 		return -ENOMEM;
 
 	/* Keep the list sorted */
-	list_for_each_entry(next, &vdev->resv_regions, list) {
+	list_for_each(pos, &vdev->resv_regions) {
+		struct iommu_resv_region *next =
+			list_entry(pos, struct iommu_resv_region, list);
+
 		if (next->start > region->start)
 			break;
 	}
-	list_add_tail(&region->list, &next->list);
+	list_add_tail(&region->list, pos);
 	return 0;
 }
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH 1/2] iommu: Avoid using the list iterator past the loop in iommu_insert_resv_region()
From: Maoyi Xie @ 2026-06-04  5:18 UTC (permalink / raw)
  To: joro, will, robin.murphy, jpb
  Cc: iommu, linux-kernel, virtualization, Maoyi Xie
In-Reply-To: <20260604051816.2976221-1-maoyixie.tju@gmail.com>

iommu_insert_resv_region() walks @regions to find the insertion point.
When no element compares greater, the list_for_each_entry() iterator ends
up one past the last entry, so &iter->list aliases the list head through
container_of() offset cancellation and the following list_add_tail() still
targets the tail. The result is correct, but using the iterator after the
loop is undefined per the list_for_each_entry() contract.

The loop only needs a list_head to use as the insertion point, not the
entry itself, so iterate with list_for_each() and keep the typed
list_entry() dereference inside the loop body. No functional change.

Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
---
 drivers/iommu/iommu.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 61c12ba78206..f9f53db9696f 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -862,6 +862,7 @@ static int iommu_insert_resv_region(struct iommu_resv_region *new,
 				    struct list_head *regions)
 {
 	struct iommu_resv_region *iter, *tmp, *nr, *top;
+	struct list_head *pos;
 	LIST_HEAD(stack);
 
 	nr = iommu_alloc_resv_region(new->start, new->length,
@@ -870,12 +871,14 @@ static int iommu_insert_resv_region(struct iommu_resv_region *new,
 		return -ENOMEM;
 
 	/* First add the new element based on start address sorting */
-	list_for_each_entry(iter, regions, list) {
+	list_for_each(pos, regions) {
+		iter = list_entry(pos, struct iommu_resv_region, list);
+
 		if (nr->start < iter->start ||
 		    (nr->start == iter->start && nr->type <= iter->type))
 			break;
 	}
-	list_add_tail(&nr->list, &iter->list);
+	list_add_tail(&nr->list, pos);
 
 	/* Merge overlapping segments of type nr->type in @regions, if any */
 	list_for_each_entry_safe(iter, tmp, regions, list) {
-- 
2.34.1


^ permalink raw reply related

* [PATCH 0/2] iommu: avoid using list iterators past the loop in resv region insertion
From: Maoyi Xie @ 2026-06-04  5:18 UTC (permalink / raw)
  To: joro, will, robin.murphy, jpb
  Cc: iommu, linux-kernel, virtualization, Maoyi Xie
In-Reply-To: <CAHPEe=G-FZvEXjkE+vAXN5MHXCtsOaUoKwg2RQL6m=om+c20hQ@mail.gmail.com>

This follows up the inquiry at [1], where Will and Robin agreed that the
two list iterator sites in drivers/iommu/ that run past the loop end are
worth fixing.

Both loops only walk a list to find a list_head to insert before, not to
operate on the entries. Per Robin's suggestion, rather than tracking an
explicit insert_before pointer, each loop now iterates with list_for_each()
and keeps the list_entry() dereference inside the loop body. There is no
functional change in either patch.

[1] https://lore.kernel.org/all/CAHPEe=G-FZvEXjkE+vAXN5MHXCtsOaUoKwg2RQL6m=om+c20hQ@mail.gmail.com/

Maoyi Xie (2):
  iommu: Avoid using the list iterator past the loop in
    iommu_insert_resv_region()
  iommu/virtio: Avoid using the list iterator past the loop in
    viommu_add_resv_mem()

 drivers/iommu/iommu.c        |  7 +++++--
 drivers/iommu/virtio-iommu.c | 10 +++++++---
 2 files changed, 12 insertions(+), 5 deletions(-)

base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
--
2.34.1

^ permalink raw reply

* [PATCH] tools/virtio: fix build for kmalloc_obj API and missing stubs
From: Michael S. Tsirkin @ 2026-06-03 21:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jason Wang, Xuan Zhuo, Eugenio Pérez, virtualization

Add stubs for kmalloc_obj() and kmalloc_objs() to the tools/virtio
test harness, matching the new kernel allocator API. Also add the
DMA_ATTR_CPU_CACHE_CLEAN definition and include kernel.h from err.h
for the unlikely() macro.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/linux/dma-mapping.h | 2 ++
 tools/virtio/linux/err.h         | 1 +
 tools/virtio/linux/kernel.h      | 6 ++++++
 3 files changed, 9 insertions(+)

diff --git a/tools/virtio/linux/dma-mapping.h b/tools/virtio/linux/dma-mapping.h
index fddfa2fbb276..8d1a16cb20db 100644
--- a/tools/virtio/linux/dma-mapping.h
+++ b/tools/virtio/linux/dma-mapping.h
@@ -60,4 +60,6 @@ enum dma_data_direction {
  */
 #define DMA_MAPPING_ERROR		(~(dma_addr_t)0)
 
+#define DMA_ATTR_CPU_CACHE_CLEAN	(1UL << 11)
+
 #endif
diff --git a/tools/virtio/linux/err.h b/tools/virtio/linux/err.h
index 0943c644a701..b7b4cb516dc9 100644
--- a/tools/virtio/linux/err.h
+++ b/tools/virtio/linux/err.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef ERR_H
 #define ERR_H
+#include <linux/kernel.h>
 #define MAX_ERRNO	4095
 
 #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
diff --git a/tools/virtio/linux/kernel.h b/tools/virtio/linux/kernel.h
index 416d02703f61..104abf9d1aee 100644
--- a/tools/virtio/linux/kernel.h
+++ b/tools/virtio/linux/kernel.h
@@ -65,6 +65,12 @@ static inline void *kmalloc_array(unsigned n, size_t s, gfp_t gfp)
 	return kmalloc(n * s, gfp);
 }
 
+#define kmalloc_obj(VAR_OR_TYPE, ...) \
+	((typeof(VAR_OR_TYPE) *)kmalloc(sizeof(typeof(VAR_OR_TYPE)), 0))
+
+#define kmalloc_objs(VAR_OR_TYPE, COUNT, ...) \
+	((typeof(VAR_OR_TYPE) *)kmalloc(sizeof(typeof(VAR_OR_TYPE)) * (COUNT), 0))
+
 static inline void *kzalloc(size_t s, gfp_t gfp)
 {
 	void *p = kmalloc(s, gfp);
-- 
MST


^ permalink raw reply related

* [PATCH v3 4/4] virtio_console: fix race between hvc put_chars and virtqueue teardown on freeze
From: Sungho Bae @ 2026-06-03 18:37 UTC (permalink / raw)
  To: amit, arnd, gregkh; +Cc: virtualization, linux-kernel, Sungho Bae
In-Reply-To: <20260603183757.21587-1-baver.bae@gmail.com>

From: Sungho Bae <baver.bae@lge.com>

With no_console_suspend enabled, hvc console output can continue while
virtio_console is freezing. In that window, put_chars can still enqueue
buffers to the output virtqueue while virtcons_freeze is tearing queues
down, triggering a BUG_ON in virtqueue_detach_unused_buf_split:

  BUG_ON(vq->vq.num_free != vq->split.vring.num)

Add a pm_freezing flag to ports_device. Set it via smp_store_release()
at the start of virtcons_freeze(); put_chars() and __send_to_port() drop
output while the flag is set, checked via smp_load_acquire().

The check in __send_to_port() is placed under outvq_lock, making it
atomic with remove_port_data() which also acquires outvq_lock. Once
remove_port_data() returns for a given port, no concurrent
__send_to_port() can add buffers before remove_vqs() tears down the vq.

After setting pm_freezing, acquire and release outvq_lock for each port
(protected by ports_lock to prevent list manipulation races) before
calling virtio_reset_device(). A TX thread that already passed the
pm_freezing check may still hold outvq_lock while spinning for host
acknowledgment; the drain loop ensures all such threads have completed
before the device is reset.

Clear pm_freezing in virtcons_restore() only after all port->out_vq
pointers have been reassigned to the newly allocated virtqueues,
preventing TX paths from dereferencing freed vqs during restore.

Link: https://sashiko.dev/#/patchset/20260519162242.7324-1-baver.bae%40gmail.com
Signed-off-by: Sungho Bae <baver.bae@lge.com>
---
 drivers/char/virtio_console.c | 70 ++++++++++++++++++++++++++++++++---
 1 file changed, 64 insertions(+), 6 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index dd31f7069e19..482d57a311c6 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -157,6 +157,12 @@ struct ports_device {
 
 	/* Major number for this device.  Ports will be created as minors. */
 	int chr_major;
+
+	/*
+	 * Set to true during PM freeze to block TX paths that may race
+	 * with virtqueue teardown (e.g. hvc put_chars with no_console_suspend).
+	 */
+	bool pm_freezing;
 };
 
 struct port_stats {
@@ -611,6 +617,17 @@ static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
 		goto free_and_done;
 	}
 
+	/*
+	 * Check freeze flag under the lock so that the flag check and
+	 * virtqueue_add_outbuf() are atomic with respect to
+	 * remove_port_data() which also takes outvq_lock.  This
+	 * guarantees that once remove_port_data() returns, no new
+	 * buffers can be added before remove_vqs() tears down the vq.
+	 * Pairs with smp_store_release() in virtcons_freeze/restore.
+	 */
+	if (smp_load_acquire(&portdev->pm_freezing)) /* pairs with freeze/restore */
+		goto free_and_done;
+
 	out_vq = port->out_vq;
 
 	reclaim_consumed_buffers(port);
@@ -1125,14 +1142,27 @@ static ssize_t put_chars(u32 vtermno, const u8 *buf, size_t count)
 		return -EPIPE;
 
 	/*
-	 * Silently drop output if device hot-unplug is in progress.
-	 * portdev was NULLed by unplug_port() after hvc_remove() was
-	 * already called, so the hvc layer will stop invoking put_chars()
-	 * very soon. Returning count avoids a pointless retry loop in the
-	 * interim.
+	 * Silently drop output in two cases, both by returning count so
+	 * that the hvc layer does not spin-retry:
+	 *
+	 *  1. Device hot-unplug (!portdev): portdev was NULLed by
+	 *     unplug_port() after hvc_remove() was already called, so
+	 *     the hvc layer will stop invoking put_chars() very soon.
+	 *     Returning count avoids a pointless retry loop in the
+	 *     interim.
+	 *
+	 *  2. PM freeze (pm_freezing): the hvc console stays active
+	 *     under no_console_suspend but virtqueues are being torn
+	 *     down.  Drop the output silently so the hvc layer does not
+	 *     stall suspend.
+	 *
+	 * This early check avoids a pointless GFP_ATOMIC allocation;
+	 * __send_to_port() rechecks under outvq_lock for correctness.
+	 * Pairs with smp_store_release() in virtcons_freeze/restore.
 	 */
 	portdev = READ_ONCE(port->portdev);
-	if (!portdev)
+	if (!portdev ||
+	    smp_load_acquire(&portdev->pm_freezing)) /* pairs with freeze/restore */
 		return count;
 
 	pbuf = alloc_buf(portdev->vdev, count, 0, GFP_ATOMIC);
@@ -2002,6 +2032,7 @@ static int virtcons_probe(struct virtio_device *vdev)
 	/* Attach this portdev to this virtio_device, and vice-versa. */
 	portdev->vdev = vdev;
 	vdev->priv = portdev;
+	portdev->pm_freezing = false;
 
 	portdev->chr_major = register_chrdev(0, "virtio-portsdev",
 					     &portdev_fops);
@@ -2119,9 +2150,30 @@ static int virtcons_freeze(struct virtio_device *vdev)
 {
 	struct ports_device *portdev;
 	struct port *port;
+	unsigned long flags;
 
 	portdev = vdev->priv;
 
+	/*
+	 * Block TX paths (put_chars, __send_to_port) before resetting the
+	 * device and tearing down virtqueues.  This prevents races with
+	 * hvc console writes that remain active under no_console_suspend.
+	 */
+	smp_store_release(&portdev->pm_freezing, true);
+
+	/*
+	 * Synchronize with any concurrent __send_to_port() that may have
+	 * passed the pm_freezing check. By acquiring and releasing the
+	 * outvq_lock for each port, we ensure all active TX paths have
+	 * completed before we reset the device.
+	 */
+	spin_lock_irqsave(&portdev->ports_lock, flags);
+	list_for_each_entry(port, &portdev->ports, list) {
+		spin_lock(&port->outvq_lock);
+		spin_unlock(&port->outvq_lock);
+	}
+	spin_unlock_irqrestore(&portdev->ports_lock, flags);
+
 	virtio_reset_device(vdev);
 
 	if (use_multiport(portdev))
@@ -2192,6 +2244,12 @@ static int virtcons_restore(struct virtio_device *vdev)
 	if (use_multiport(portdev))
 		fill_queue(portdev->c_ivq, &portdev->c_ivq_lock);
 
+	/*
+	 * Allow TX paths only after all port->out_vq pointers have
+	 * been reassigned to the newly allocated virtqueues.
+	 */
+	smp_store_release(&portdev->pm_freezing, false);
+
 	return 0;
 }
 #endif
-- 
2.34.1


^ permalink raw reply related

* [PATCH v3 3/4] virtio_console: fix control queue race during restore
From: Sungho Bae @ 2026-06-03 18:37 UTC (permalink / raw)
  To: amit, arnd, gregkh; +Cc: virtualization, linux-kernel, Sungho Bae
In-Reply-To: <20260603183757.21587-1-baver.bae@gmail.com>

From: Sungho Bae <baver.bae@lge.com>

In virtcons_restore(), after virtio_device_ready() sets DRIVER_OK, the
device becomes active. If the control receive queue (c_ivq) is populated
immediately, the host can instantly deliver pending control messages
(e.g., VIRTIO_CONSOLE_PORT_REMOVE).

This triggers the control_work_handler(), which can modify the
portdev->ports list concurrently with the unprotected list_for_each_entry
loop in virtcons_restore(), leading to list corruption or Use-After-Free.

Fix this by deferring the population of the control receive queue
(fill_queue for c_ivq) until after the list iteration is complete. This
ensures the host cannot inject control messages during the vulnerable
window.

Signed-off-by: Sungho Bae <baver.bae@lge.com>
---
 drivers/char/virtio_console.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 589a12261e23..dd31f7069e19 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -2164,9 +2164,6 @@ static int virtcons_restore(struct virtio_device *vdev)
 
 	virtio_device_ready(portdev->vdev);
 
-	if (use_multiport(portdev))
-		fill_queue(portdev->c_ivq, &portdev->c_ivq_lock);
-
 	list_for_each_entry(port, &portdev->ports, list) {
 		port->in_vq = portdev->in_vqs[port->id];
 		port->out_vq = portdev->out_vqs[port->id];
@@ -2183,6 +2180,18 @@ static int virtcons_restore(struct virtio_device *vdev)
 		if (port->guest_connected)
 			send_control_msg(port, VIRTIO_CONSOLE_PORT_OPEN, 1);
 	}
+
+	/*
+	 * Populate the control receive queue only after the list iteration
+	 * is complete. If we fill this queue before iterating, the host could
+	 * immediately deliver a VIRTIO_CONSOLE_PORT_REMOVE message.
+	 * This would trigger the control workqueue, which modifies the
+	 * portdev->ports list concurrently with the unprotected loop above,
+	 * leading to a Use-After-Free and list corruption.
+	 */
+	if (use_multiport(portdev))
+		fill_queue(portdev->c_ivq, &portdev->c_ivq_lock);
+
 	return 0;
 }
 #endif
-- 
2.34.1


^ permalink raw reply related

* [PATCH v3 2/4] virtio_console: fix hot-unplug races in TX paths
From: Sungho Bae @ 2026-06-03 18:37 UTC (permalink / raw)
  To: amit, arnd, gregkh; +Cc: virtualization, linux-kernel, Sungho Bae
In-Reply-To: <20260603183757.21587-1-baver.bae@gmail.com>

From: Sungho Bae <baver.bae@lge.com>

When a port is hot-unplugged, unplug_port() nullifies port->portdev.
However, concurrent TX paths (__send_to_port, put_chars) could read a
stale pointer or encounter a NULL pointer dereference.

Add READ_ONCE(port->portdev) and NULL checks in the TX paths. In
__send_to_port(), move the out_vq assignment inside the outvq_lock and
check portdev under the lock. Correspondingly, update unplug_port() to
NULL out port->portdev while holding the outvq_lock to serialize with
__send_to_port().

In put_chars(), return count instead of 0 on unplug to prevent the hvc
layer from spinning in an infinite retry loop.

Signed-off-by: Sungho Bae <baver.bae@lge.com>
---
 drivers/char/virtio_console.c | 37 +++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index bbf5b3825f12..589a12261e23 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -600,11 +600,19 @@ static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
 	int err;
 	unsigned long flags;
 	unsigned int len;
-
-	out_vq = port->out_vq;
+	struct ports_device *portdev;
 
 	spin_lock_irqsave(&port->outvq_lock, flags);
 
+	portdev = READ_ONCE(port->portdev);
+
+	if (!portdev) {
+		in_count = 0;
+		goto free_and_done;
+	}
+
+	out_vq = port->out_vq;
+
 	reclaim_consumed_buffers(port);
 
 	err = virtqueue_add_outbuf(out_vq, sg, nents, buf, GFP_ATOMIC);
@@ -1110,12 +1118,24 @@ static ssize_t put_chars(u32 vtermno, const u8 *buf, size_t count)
 	struct port *port;
 	struct scatterlist sg[1];
 	struct port_buffer *pbuf;
+	struct ports_device *portdev;
 
 	port = find_port_by_vtermno(vtermno);
 	if (!port)
 		return -EPIPE;
 
-	pbuf = alloc_buf(port->portdev->vdev, count, 0, GFP_ATOMIC);
+	/*
+	 * Silently drop output if device hot-unplug is in progress.
+	 * portdev was NULLed by unplug_port() after hvc_remove() was
+	 * already called, so the hvc layer will stop invoking put_chars()
+	 * very soon. Returning count avoids a pointless retry loop in the
+	 * interim.
+	 */
+	portdev = READ_ONCE(port->portdev);
+	if (!portdev)
+		return count;
+
+	pbuf = alloc_buf(portdev->vdev, count, 0, GFP_ATOMIC);
 	if (!pbuf)
 		return -ENOMEM;
 
@@ -1503,11 +1523,16 @@ static void unplug_port(struct port *port)
 	remove_port_data(port);
 
 	/*
-	 * We should just assume the device itself has gone off --
-	 * else a close on an open port later will try to send out a
-	 * control message.
+	 * Null out portdev under outvq_lock so that __send_to_port()
+	 * cannot race: it checks port->portdev inside the same lock
+	 * and bails out if NULL, preventing any buffer from being
+	 * enqueued to an already torn-down virtqueue.  Also prevents
+	 * a close on an open port later from sending a stale control
+	 * message.
 	 */
+	spin_lock_irq(&port->outvq_lock);
 	port->portdev = NULL;
+	spin_unlock_irq(&port->outvq_lock);
 
 	sysfs_remove_group(&port->dev->kobj, &port_attribute_group);
 	device_destroy(&port_class, port->dev->devt);
-- 
2.34.1


^ permalink raw reply related

* [PATCH v3 1/4] virtio_console: refactor __send_to_port() buffer ownership
From: Sungho Bae @ 2026-06-03 18:37 UTC (permalink / raw)
  To: amit, arnd, gregkh; +Cc: virtualization, linux-kernel, Sungho Bae
In-Reply-To: <20260603183757.21587-1-baver.bae@gmail.com>

From: Sungho Bae <baver.bae@lge.com>

Modify __send_to_port() to take ownership of a struct port_buffer *
instead of a void * raw buffer.

Previously, put_chars() would pass a raw kmemdup'd buffer and free it
immediately after __send_to_port() returned. This caused a potential
Use-After-Free and data corruption if the virtqueue was shared with
nonblocking writers, as virtqueue_get_buf() might return an older
completed buffer, causing the newly added buffer to be kfree'd while the
host is still DMAing from it.

By transferring ownership of the allocated port_buffer to __send_to_port(),
we ensure that the exact buffer returned by the host is the one that gets
freed, resolving the memory lifecycle mismatch.

Signed-off-by: Sungho Bae <baver.bae@lge.com>
---
 drivers/char/virtio_console.c | 69 +++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 32 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 9a33217c68d9..bbf5b3825f12 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -402,7 +402,7 @@ static void reclaim_dma_bufs(void)
 }
 
 static struct port_buffer *alloc_buf(struct virtio_device *vdev, size_t buf_size,
-				     int pages)
+				     int pages, gfp_t gfp)
 {
 	struct port_buffer *buf;
 
@@ -436,11 +436,10 @@ static struct port_buffer *alloc_buf(struct virtio_device *vdev, size_t buf_size
 
 		/* Increase device refcnt to avoid freeing it */
 		get_device(buf->dev);
-		buf->buf = dma_alloc_coherent(buf->dev, buf_size, &buf->dma,
-					      GFP_KERNEL);
+		buf->buf = dma_alloc_coherent(buf->dev, buf_size, &buf->dma, gfp);
 	} else {
 		buf->dev = NULL;
-		buf->buf = kmalloc(buf_size, GFP_KERNEL);
+		buf->buf = kmalloc(buf_size, gfp);
 	}
 
 	if (!buf->buf)
@@ -595,7 +594,7 @@ static void reclaim_consumed_buffers(struct port *port)
 
 static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
 			      int nents, size_t in_count,
-			      void *data, bool nonblock)
+			      struct port_buffer *buf, bool nonblock)
 {
 	struct virtqueue *out_vq;
 	int err;
@@ -608,14 +607,14 @@ static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
 
 	reclaim_consumed_buffers(port);
 
-	err = virtqueue_add_outbuf(out_vq, sg, nents, data, GFP_ATOMIC);
+	err = virtqueue_add_outbuf(out_vq, sg, nents, buf, GFP_ATOMIC);
 
 	/* Tell Host to go! */
 	virtqueue_kick(out_vq);
 
 	if (err) {
 		in_count = 0;
-		goto done;
+		goto free_and_done;
 	}
 
 	if (out_vq->num_free == 0)
@@ -632,10 +631,19 @@ static ssize_t __send_to_port(struct port *port, struct scatterlist *sg,
 	 * buffer and relax the spinning requirement.  The downside is
 	 * we need to kmalloc a GFP_ATOMIC buffer each time the
 	 * console driver writes something out.
+	 *
+	 * Spin until host returns the buffer.
+	 * Capture the returned buf so we can free it.
+	 * If broken, buf == NULL and buf stays in the vq;
+	 * remove_vqs() will call virtqueue_detach_unused_buf() -> free_buf().
 	 */
-	while (!virtqueue_get_buf(out_vq, &len)
+	while (!(buf = virtqueue_get_buf(out_vq, &len))
 		&& !virtqueue_is_broken(out_vq))
 		cpu_relax();
+
+free_and_done:
+	if (buf)
+		free_buf(buf, false);
 done:
 	spin_unlock_irqrestore(&port->outvq_lock, flags);
 
@@ -816,14 +824,14 @@ static ssize_t port_fops_write(struct file *filp, const char __user *ubuf,
 
 	count = min((size_t)(32 * 1024), count);
 
-	buf = alloc_buf(port->portdev->vdev, count, 0);
+	buf = alloc_buf(port->portdev->vdev, count, 0, GFP_KERNEL);
 	if (!buf)
 		return -ENOMEM;
 
 	ret = copy_from_user(buf->buf, ubuf, count);
 	if (ret) {
-		ret = -EFAULT;
-		goto free_buf;
+		free_buf(buf, true);
+		return -EFAULT;
 	}
 
 	/*
@@ -835,15 +843,7 @@ static ssize_t port_fops_write(struct file *filp, const char __user *ubuf,
 	 */
 	nonblock = true;
 	sg_init_one(sg, buf->buf, count);
-	ret = __send_to_port(port, sg, 1, count, buf, nonblock);
-
-	if (nonblock && ret > 0)
-		goto out;
-
-free_buf:
-	free_buf(buf, true);
-out:
-	return ret;
+	return __send_to_port(port, sg, 1, count, buf, nonblock);
 }
 
 struct sg_list {
@@ -932,7 +932,7 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe,
 		goto error_out;
 
 	occupancy = pipe_buf_usage(pipe);
-	buf = alloc_buf(port->portdev->vdev, 0, occupancy);
+	buf = alloc_buf(port->portdev->vdev, 0, occupancy, GFP_KERNEL);
 
 	if (!buf) {
 		ret = -ENOMEM;
@@ -946,11 +946,12 @@ static ssize_t port_fops_splice_write(struct pipe_inode_info *pipe,
 	sg_init_table(sgl.sg, sgl.size);
 	ret = __splice_from_pipe(pipe, &sd, pipe_to_sg);
 	pipe_unlock(pipe);
+
 	if (likely(ret > 0))
 		ret = __send_to_port(port, buf->sg, sgl.n, sgl.len, buf, true);
-
-	if (unlikely(ret <= 0))
+	else
 		free_buf(buf, true);
+
 	return ret;
 
 error_out:
@@ -1108,21 +1109,25 @@ static ssize_t put_chars(u32 vtermno, const u8 *buf, size_t count)
 {
 	struct port *port;
 	struct scatterlist sg[1];
-	void *data;
-	int ret;
+	struct port_buffer *pbuf;
 
 	port = find_port_by_vtermno(vtermno);
 	if (!port)
 		return -EPIPE;
 
-	data = kmemdup(buf, count, GFP_ATOMIC);
-	if (!data)
+	pbuf = alloc_buf(port->portdev->vdev, count, 0, GFP_ATOMIC);
+	if (!pbuf)
 		return -ENOMEM;
 
-	sg_init_one(sg, data, count);
-	ret = __send_to_port(port, sg, 1, count, data, false);
-	kfree(data);
-	return ret;
+	memcpy(pbuf->buf, buf, count);
+	pbuf->len = count;
+	sg_init_one(sg, pbuf->buf, count);
+
+	/*
+	 * Ownership of pbuf is transferred to __send_to_port().
+	 * Do not touch or free pbuf after this call.
+	 */
+	return __send_to_port(port, sg, 1, count, pbuf, false);
 }
 
 /*
@@ -1295,7 +1300,7 @@ static int fill_queue(struct virtqueue *vq, spinlock_t *lock)
 
 	nr_added_bufs = 0;
 	do {
-		buf = alloc_buf(vq->vdev, PAGE_SIZE, 0);
+		buf = alloc_buf(vq->vdev, PAGE_SIZE, 0, GFP_KERNEL);
 		if (!buf)
 			return -ENOMEM;
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH v3 0/4] virtio_console: fix suspend/resume and hot-unplug races
From: Sungho Bae @ 2026-06-03 18:37 UTC (permalink / raw)
  To: amit, arnd, gregkh; +Cc: virtualization, linux-kernel, Sungho Bae

Hi all,

This patchset addresses several critical race conditions and memory
lifecycle bugs in the virtio_console driver, primarily surrounding
the TX path(`__send_to_port`, `put_chars`), device hot-unplug, and
PM freeze/restore.

The initial motivation for this series was to fix a race where
`hvc_console` writes (which can remain active during suspend
if `no_console_suspend` is enabled) could enqueue buffers while
`virtcons_freeze` was tearing down virtqueues, leading to a `BUG_ON` in
`virtqueue_detach_unused_buf_split()`.

During the investigation and testing of the freeze race, several related
vulnerabilities were uncovered in how TX paths handled buffer ownership and
stale `portdev` pointers during concurrent hot-unplugs. Because these fixes
structurally overlap in `__send_to_port()` and `put_chars()`, they are
submitted together as a single patchset to avoid merge conflicts.

Patch Summary
=============

 1. virtio_console: refactor __send_to_port() buffer ownership

    Refactors buffer ownership in `__send_to_port()`. Previously,
    `put_chars()` would allocate a raw buffer, pass it down, and `kfree()`
    it immediately upon return. If `virtqueue_get_buf()` returned an older
    completed buffer from a previous non-blocking write, the newly queued
    buffer could be freed while the host was still DMA-ing from it.
    By transferring ownership of a `struct port_buffer` to
    `__send_to_port()`, we guarantee only the exact buffer returned by
    the host is freed.

 2. virtio_console: fix hot-unplug races in TX paths

    Hardens the TX paths against concurrent hot-unplugs. It adds
    `READ_ONCE(port->portdev)` checks, bails out cleanly (returning `count`
    to prevent `hvc` infinite retries), and synchronizes `portdev = NULL`
    in `unplug_port()` using the `outvq_lock`.

 3. virtio_console: fix control queue race during restore

    Fixes a race during `virtcons_restore()` where filling the control
    receive queue (c_ivq) immediately after `DRIVER_OK` could trigger the
    control workqueue before the driver finished restoring port states,
    leading to list corruption.

 4. virtio_console: fix race between hvc put_chars and virtqueue teardown

    Addresses the original PM freeze race by introducing a `pm_freezing`
    state. TX paths now safely drop output while freezing.
    A synchronization loop in `virtcons_freeze()` ensures all active TX
    threads have drained before `virtio_reset_device()` is called.

Testing
=======

Runtime-tested on arm64 systems with `no_console_suspend` enabled
during S4 cycles.

Changes
=======

v3:
  Split the original monolithic patch into a 4-part patchset for better
  logical grouping and bisectability.

Sungho Bae (4):
  virtio_console: refactor __send_to_port() buffer ownership
  virtio_console: fix hot-unplug races in TX paths
  virtio_console: fix control queue race during restore
  virtio_console: fix race between hvc put_chars and virtqueue teardown
    on freeze

 drivers/char/virtio_console.c | 177 ++++++++++++++++++++++++++--------
 1 file changed, 137 insertions(+), 40 deletions(-)

-- 
2.34.1

^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Peter Zijlstra @ 2026-06-03 14:26 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <aiA3dzjKVq2_cpSY@shell.ilvokhin.com>

On Wed, Jun 03, 2026 at 02:17:27PM +0000, Dmitry Ilvokhin wrote:

> > Something a little like so, which is completely untested, except to
> > build kernel/locking/spinlock.o (with clang-23).
> 
> Thanks a lot for taking a look, Peter.
> 
> I like the static_call idea. It's truly zero cost on x86 (and, as you
> note, even a byte smaller). The one caveat is that it relies on
> HAVE_STATIC_CALL_INLINE to stay free. 
> 
> So my plan would be: static_call where HAVE_STATIC_CALL_INLINE is
> available (x86), and a static branch fallback elsewhere, gated behind a
> default-off config so it imposes nothing on arches/kernels that don't
> opt in. I'm mostly interested in x86, but would like arm64 to work too,
> which would use the fallback.

(i386 doesn't have STATIC_CALL_INLINE, but nobody cares about the
performance on that target, so anything goes really ;-)

> 
> Concretely:
> 
> 1. Split the sleepable-lock patches out and send them separately.
>    They're independent of the static call work and look far less
>    controversial.
> 
> 2. Convert the paravirt spinlock unlock to a static_call, as the
>    foundation for the unlock tracepoint. I'm happy to take a stab at it.
>    Let me know if you'd rather do it yourself.

Yeah, I think that patch as-is *should* work, but like said, I haven't
even tried it, so it could be terribly broken :-)

> 3. Build the unlock tracepoint on top: static_call where it's cheap,
>    config-gated static_branch fallback where it isn't.

Right, so I think we need some sort of custom callback for tracepoint
enable/disable. Its been a minute since I dug through the tracepoint
code, but I don't think it provides that with a convenient wrapper, but
it should be doable.

One thing to note is that when you set the tracepoint unlock function,
it should either tail-call into the original function, or you have to
create two unlock_trace functions, one for native and one for paravirt
and pick the right one.

> Does this plan sound reasonable to you?

Yeah, should work.

> > Also, I think someone should go do some performance runs with
> > ARCH_INLINE_SPIN_* set for x86 just like for s390.
> 
> That's a good point, I'll run benchmarks and report back with the
> results.

Thanks!

^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-06-03 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <20260603120811.GW3493090@noisy.programming.kicks-ass.net>

On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> On Thu, May 14, 2026 at 12:34:55PM +0000, Dmitry Ilvokhin wrote:
> 
> > Baseline, in best case scenario of least number of executed
> > instructions.
> > 
> >     3e0:  endbr64                          ; 4 bytes (always executed)
> >     3e4:  movb $0x0,(%rdi)                 ; 3 bytes (unlock,
> >                                            ; always executed)
> >     3e7:  decl %gs:__preempt_count         ; 7 bytes (always executed)
> >     3ee:  je   3f5                         ; 2 bytes (always executed)
> >     3f0:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
> >                                            ; je is not taken)
> >                                            ; rest is not executed
> >     3f5:  call __SCT__preempt_schedule     ; 5 bytes
> >     3fa:  jmp  __x86_return_thunk          ; 5 bytes
> > 
> > Tracepoint (again same case of least number of executed instructions).
> > 
> >     bc0:  endbr64                          ; 4 bytes (always executed)
> >     bc4:  xchg %ax,%ax                     ; 2 bytes (always executed, this is an
> >                                            ; only addition on the execution path).
> >     bc6:  movb $0x0,(%rdi)                 ; 3 bytes (unlock, always executed)
> >     bc9:  decl %gs:__preempt_count         ; 7 bytes (always executed)
> >     bd0:  je   bde                         ; 2 bytes (always executed)
> >     bd2:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
> >                                            ; je is not taken)
> >                                            ; rest is not executed
> >     bd7:  call queued_spin_release_traced  ; 5 bytes
> >     bdc:  jmp  bc9                         ; 2 bytes
> >     bde:  call __SCT__preempt_schedule     ; 5 bytes
> >     be3:  jmp  __x86_return_thunk          ; 5 bytes
> > 
> 
> So I've been playing with this a bit, and it is all really sad.
> 
> Now, since pretty much everybody+dog will have PARAVIRT_SPINLOCK=y, the
> 'best' solution would be changing that paravirt call with a
> static_call(), that actually shrinks the code by 1 byte.
> 
> And then this tracepoint nonsense can simply use a different unlock
> function, just like paravirt.
> 
> 0000 00000000000001d0 <_raw_spin_unlock>:
> 0000  1d0:	f3 0f 1e fa          	endbr64
> 0004  1d4:	ff 15 00 00 00 00    	call   *0x0(%rip)        # 1da <_raw_spin_unlock+0xa>	1d6: R_X86_64_PC32	pv_ops_lock+0x4
> 000a  1da:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        # 1e1 <_raw_spin_unlock+0x11>	1dd: R_X86_64_PC32	__preempt_count-0x4
> 0011  1e1:	74 06                	je     1e9 <_raw_spin_unlock+0x19>
> 0013  1e3:	2e e9 00 00 00 00    	cs jmp 1e9 <_raw_spin_unlock+0x19>	1e5: R_X86_64_PLT32	__x86_return_thunk-0x4
> 0019  1e9:	e8 00 00 00 00       	call   1ee <_raw_spin_unlock+0x1e>	1ea: R_X86_64_PLT32	__SCT__preempt_schedule-0x4
> 001e  1ee:	2e e9 00 00 00 00    	cs jmp 1f4 <_raw_spin_unlock+0x24>	1f0: R_X86_64_PLT32	__x86_return_thunk-0x4
> 
> 
> 0000 00000000000001d0 <_raw_spin_unlock>:
> 0000  1d0:	f3 0f 1e fa          	endbr64
> 0004  1d4:	e8 00 00 00 00       	call   1d9 <_raw_spin_unlock+0x9>	1d5: R_X86_64_PLT32	__SCT__queued_spin_unlock-0x4
> 0009  1d9:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        # 1e0 <_raw_spin_unlock+0x10>	1dc: R_X86_64_PC32	__preempt_count-0x4
> 0010  1e0:	74 06                	je     1e8 <_raw_spin_unlock+0x18>
> 0012  1e2:	2e e9 00 00 00 00    	cs jmp 1e8 <_raw_spin_unlock+0x18>	1e4: R_X86_64_PLT32	__x86_return_thunk-0x4
> 0018  1e8:	e8 00 00 00 00       	call   1ed <_raw_spin_unlock+0x1d>	1e9: R_X86_64_PLT32	__SCT__preempt_schedule-0x4
> 001d  1ed:	2e e9 00 00 00 00    	cs jmp 1f3 <_raw_spin_unlock+0x23>	1ef: R_X86_64_PLT32	__x86_return_thunk-0x4
> 
> 
> Something a little like so, which is completely untested, except to
> build kernel/locking/spinlock.o (with clang-23).

Thanks a lot for taking a look, Peter.

I like the static_call idea. It's truly zero cost on x86 (and, as you
note, even a byte smaller). The one caveat is that it relies on
HAVE_STATIC_CALL_INLINE to stay free. 

So my plan would be: static_call where HAVE_STATIC_CALL_INLINE is
available (x86), and a static branch fallback elsewhere, gated behind a
default-off config so it imposes nothing on arches/kernels that don't
opt in. I'm mostly interested in x86, but would like arm64 to work too,
which would use the fallback.

Concretely:

1. Split the sleepable-lock patches out and send them separately.
   They're independent of the static call work and look far less
   controversial.

2. Convert the paravirt spinlock unlock to a static_call, as the
   foundation for the unlock tracepoint. I'm happy to take a stab at it.
   Let me know if you'd rather do it yourself.

3. Build the unlock tracepoint on top: static_call where it's cheap,
   config-gated static_branch fallback where it isn't.

Does this plan sound reasonable to you?

> 
> Also, I think someone should go do some performance runs with
> ARCH_INLINE_SPIN_* set for x86 just like for s390.

That's a good point, I'll run benchmarks and report back with the
results.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox