Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH net-next v3] vsock/virtio: rewrite MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-24  7:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260623132014-mutt-send-email-mst@kernel.org>


6/23/26 20:26, Michael S. Tsirkin wrote:
> On Tue, Jun 23, 2026 at 06:38:19PM +0300, Arseniy Krasnov wrote:
>> Logically it was based on TCP implementation, so to make further support
>> easier, rewrite it in the TCP way (like in 'tcp_sendmsg_locked()'). This
>> patch only rewrites flag handling (e.g. it doesn't change logic).
>>
>> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
>
> It seems to change logic though:
>
>> ---
>>  Changelog v1->v2:
>>  * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
>>    already added.
>>  Changelog v2->v3:
>>  * Update commit message.
>>  * Remove one empty line.
>>
>>  net/vmw_vsock/virtio_transport_common.c | 47 ++++++++++++-------------
>>  1 file changed, 22 insertions(+), 25 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 09475007165b..41c2a0b82a8e 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -328,38 +328,35 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>  		return pkt_len;
>>  
>> -	if (info->msg) {
>> -		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>> -		 * there is no MSG_ZEROCOPY flag set.
>> +	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
>> +		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
>> +		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
>> +		 * handling from 'tcp_sendmsg_locked()'.
>>  		 */
>> -		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>> -			info->msg->msg_flags &= ~MSG_ZEROCOPY;
> So previously without SOCK_ZEROCOPY, MSG_ZEROCOPY was always ignored...
>
>
>> +		if (info->msg->msg_ubuf) {
>> +			uarg = info->msg->msg_ubuf;
>> +			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
> now it's not in this case?

Yes, this case is currently for io_uring only, because io_uring doesn't set SOCK_ZEROCOPY to perform zerocopy transmission.

>
>
> Maybe the right call, but saying "does not change logic" seems wrong.

Agree, I need to update commit message again :)

Thanks

>
>
>> +		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
>> +			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
>> +						    NULL, false);
>> +			if (!uarg) {
>> +				virtio_transport_put_credit(vvs, pkt_len);
>> +				return -ENOMEM;
>> +			}
>>  
>> -		if (info->msg->msg_flags & MSG_ZEROCOPY)
>>  			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>> +			if (!can_zcopy)
>> +				uarg_to_msgzc(uarg)->zerocopy = 0;
>>  
>> +			have_uref = true;
>> +		}
>> +
>> +		/* 'can_zcopy' means that this transmission will be
>> +		 * in zerocopy way (e.g. using 'frags' array).
>> +		 */
>>  		if (can_zcopy)
>>  			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>  					    (MAX_SKB_FRAGS * PAGE_SIZE));
>> -
>> -		if (info->msg->msg_flags & MSG_ZEROCOPY &&
>> -		    info->op == VIRTIO_VSOCK_OP_RW) {
>> -			uarg = info->msg->msg_ubuf;
>> -
>> -			if (!uarg) {
>> -				uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>> -							    pkt_len, NULL, false);
>> -				if (!uarg) {
>> -					virtio_transport_put_credit(vvs, pkt_len);
>> -					return -ENOMEM;
>> -				}
>> -
>> -				if (!can_zcopy)
>> -					uarg_to_msgzc(uarg)->zerocopy = 0;
>> -
>> -				have_uref = true;
>> -			}
>> -		}
>>  	}
>>  
>>  	rest_len = pkt_len;
>> -- 
>> 2.25.1

^ permalink raw reply

* Re: [PATCH v3] virtio_net: disable cb when NAPI is busy-polled
From: Michael S. Tsirkin @ 2026-06-24  7:08 UTC (permalink / raw)
  To: Longjun Tang
  Cc: xuanzhuo, jasowang, edumazet, virtualization, netdev, tanglongjun
In-Reply-To: <20260624070206.85467-1-lange_tang@163.com>

On Wed, Jun 24, 2026 at 03:02:06PM +0800, Longjun Tang wrote:
> From: Longjun Tang <tanglongjun@kylinos.cn>
> 
> When busy-poll is active, napi_schedule_prep() returns false in
> virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
> The device may keep firing irqs until reaches virtqueue_napi_complete().
> Under load (received == budget), it will lead to a large number
> of spurious interrupts.
> 
> Fix it by disabling the callback at the virtnet_poll() entry. This keeps
> the callback off while we poll and re-enable

and it is re-enabled

> by virtqueue_napi_complete()
> when going idle.
> 
> Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
> 
> ---
> V1 -> V2: Remain agnostic to busy polling
> V2 -> V3: Add fixes tag
> ---
>  drivers/net/virtio_net.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..0a11f2b32500 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
>  	unsigned int xdp_xmit = 0;
>  	bool napi_complete;
>  
> +	/* Keep callbacks suppressed for the duration of this poll,
> +	 * busy-poll need.

I don't know what "busy-poll need" means. Just drop this part?
In fact, the whole comment can go, we know virtqueue_disable_cb
disables callbacks.

> +	 */
> +	virtqueue_disable_cb(rq->vq);
> +
>  	virtnet_poll_cleantx(rq, budget);
>  
>  	received = virtnet_receive(rq, budget, &xdp_xmit);
> -- 
> 2.43.0


^ permalink raw reply

* [PATCH v3] virtio_net: disable cb when NAPI is busy-polled
From: Longjun Tang @ 2026-06-24  7:02 UTC (permalink / raw)
  To: mst, xuanzhuo
  Cc: jasowang, edumazet, virtualization, netdev, tanglongjun,
	lange_tang

From: Longjun Tang <tanglongjun@kylinos.cn>

When busy-poll is active, napi_schedule_prep() returns false in
virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
The device may keep firing irqs until reaches virtqueue_napi_complete().
Under load (received == budget), it will lead to a large number
of spurious interrupts.

Fix it by disabling the callback at the virtnet_poll() entry. This keeps
the callback off while we poll and re-enable by virtqueue_napi_complete()
when going idle.

Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>

---
V1 -> V2: Remain agnostic to busy polling
V2 -> V3: Add fixes tag
---
 drivers/net/virtio_net.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..0a11f2b32500 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
 	unsigned int xdp_xmit = 0;
 	bool napi_complete;
 
+	/* Keep callbacks suppressed for the duration of this poll,
+	 * busy-poll need.
+	 */
+	virtqueue_disable_cb(rq->vq);
+
 	virtnet_poll_cleantx(rq, budget);
 
 	received = virtnet_receive(rq, budget, &xdp_xmit);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 2/2] virtio_balloon: quiesce balloon work before device shutdown
From: Michael S. Tsirkin @ 2026-06-23 19:27 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: David Hildenbrand (Arm), Denis V. Lunev, virtualization,
	linux-kernel
In-Reply-To: <33fba3a4-cebc-4f13-923a-31cb5753222b@virtuozzo.com>

On Tue, Jun 23, 2026 at 09:25:18PM +0200, Denis V. Lunev wrote:
> On 6/22/26 16:38, David Hildenbrand (Arm) wrote:
> > This email originated from an IP that might not be authorized by the domain it was sent from.
> > Do not click links or open attachments unless it is an email you expected to receive.
> > On 6/22/26 15:37, Denis V. Lunev wrote:
> >> Commit 8bd2fa086a04 ("virtio: break and reset virtio devices on
> >> device_shutdown()") added a generic virtio bus .shutdown handler that
> >> breaks and resets every virtio device during device_shutdown(), i.e. on
> >> reboot and kexec.
> >>
> >> virtio_balloon provides no .shutdown of its own, so that generic path
> >> runs while the balloon's asynchronous work is still armed. Once the
> >> device has been broken, virtqueue_add_inbuf() in
> >> virtballoon_free_page_report() returns -EIO and trips its
> >> WARN_ON_ONCE(). On a kernel booted with panic_on_warn that turns an
> >> ordinary reboot, for example a kexec based upgrade, into a fatal panic
> >> in the middle of device_shutdown(), so the machine never reaches the
> >> new kernel.
> >>
> >> Relaxing that single WARN_ON_ONCE() would only hide the symptom: the
> >> inflate/deflate and OOM paths do not warn, they call
> >> wait_event(vb->acked, ...) and would instead block forever on a broken
> >> queue that can no longer complete. The device has to be quiesced, not
> >> just kept quiet.
> > Ah, so
> >
> > 	/* We should always be able to add one buffer to an empty queue. */
> > 	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> >
> > is not actually correct.
> >
> > Yeah, quiescing sounds cleaner, although I am thinking whether we should also
> > warn if virtqueue_add_outbuf() fails, similar to what we do in
> > virtballoon_free_page_report().
> Good catch, will do.,

separate patch pls.

> >> Add a .shutdown handler that quiesces the balloon via the shared
> >> virtballoon_quiesce() helper while the device is still alive, and only
> >> then breaks and resets it. The break and reset are repeated here rather
> >> than reused from virtio_dev_shutdown(): drv->shutdown replaces the
> >> generic handler rather than augmenting it, so that drivers such as
> >> virtio-gpu can opt out of the reset. Unlike virtballoon_remove() the
> >> balloon workqueue is not destroyed, as shutdown does not free the
> >> device and cancel_work_sync() together with stop_update already prevent
> >> any further work from being queued.
> >>
> >> Fixes: 8bd2fa086a04 ("virtio: break and reset virtio devices on device_shutdown()")
> >> Signed-off-by: Denis V. Lunev <den@openvz.org>
> >> ---
> >>  drivers/virtio/virtio_balloon.c | 10 ++++++++++
> >>  1 file changed, 10 insertions(+)
> >>
> >> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> >> index 5b02d9191ac6..e35ada767b4b 100644
> >> --- a/drivers/virtio/virtio_balloon.c
> >> +++ b/drivers/virtio/virtio_balloon.c
> >> @@ -1137,6 +1137,15 @@ static void virtballoon_remove(struct virtio_device *vdev)
> >>  	kfree(vb);
> >>  }
> >>  
> >> +static void virtballoon_shutdown(struct virtio_device *vdev)
> >> +{
> >> +	virtballoon_quiesce(vdev->priv);
> >> +
> >> +	virtio_break_device(vdev);
> >> +	virtio_synchronize_cbs(vdev);
> >> +	vdev->config->reset(vdev);
> > I guess it would be good if we wouldn't have to copy what the default handler
> > does, but could instead just have it in a reusable core function?
> Ok. Sounds great. Will do.
> 
> Thanks for review,
>     Den


^ permalink raw reply

* Re: [PATCH 2/2] virtio_balloon: quiesce balloon work before device shutdown
From: Denis V. Lunev @ 2026-06-23 19:25 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Denis V. Lunev, mst; +Cc: virtualization, linux-kernel
In-Reply-To: <8b83f251-3a3e-4fc9-8ea9-8d101fb92919@kernel.org>

On 6/22/26 16:38, David Hildenbrand (Arm) wrote:
> This email originated from an IP that might not be authorized by the domain it was sent from.
> Do not click links or open attachments unless it is an email you expected to receive.
> On 6/22/26 15:37, Denis V. Lunev wrote:
>> Commit 8bd2fa086a04 ("virtio: break and reset virtio devices on
>> device_shutdown()") added a generic virtio bus .shutdown handler that
>> breaks and resets every virtio device during device_shutdown(), i.e. on
>> reboot and kexec.
>>
>> virtio_balloon provides no .shutdown of its own, so that generic path
>> runs while the balloon's asynchronous work is still armed. Once the
>> device has been broken, virtqueue_add_inbuf() in
>> virtballoon_free_page_report() returns -EIO and trips its
>> WARN_ON_ONCE(). On a kernel booted with panic_on_warn that turns an
>> ordinary reboot, for example a kexec based upgrade, into a fatal panic
>> in the middle of device_shutdown(), so the machine never reaches the
>> new kernel.
>>
>> Relaxing that single WARN_ON_ONCE() would only hide the symptom: the
>> inflate/deflate and OOM paths do not warn, they call
>> wait_event(vb->acked, ...) and would instead block forever on a broken
>> queue that can no longer complete. The device has to be quiesced, not
>> just kept quiet.
> Ah, so
>
> 	/* We should always be able to add one buffer to an empty queue. */
> 	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>
> is not actually correct.
>
> Yeah, quiescing sounds cleaner, although I am thinking whether we should also
> warn if virtqueue_add_outbuf() fails, similar to what we do in
> virtballoon_free_page_report().
Good catch, will do.,

>> Add a .shutdown handler that quiesces the balloon via the shared
>> virtballoon_quiesce() helper while the device is still alive, and only
>> then breaks and resets it. The break and reset are repeated here rather
>> than reused from virtio_dev_shutdown(): drv->shutdown replaces the
>> generic handler rather than augmenting it, so that drivers such as
>> virtio-gpu can opt out of the reset. Unlike virtballoon_remove() the
>> balloon workqueue is not destroyed, as shutdown does not free the
>> device and cancel_work_sync() together with stop_update already prevent
>> any further work from being queued.
>>
>> Fixes: 8bd2fa086a04 ("virtio: break and reset virtio devices on device_shutdown()")
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> ---
>>  drivers/virtio/virtio_balloon.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 5b02d9191ac6..e35ada767b4b 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -1137,6 +1137,15 @@ static void virtballoon_remove(struct virtio_device *vdev)
>>  	kfree(vb);
>>  }
>>  
>> +static void virtballoon_shutdown(struct virtio_device *vdev)
>> +{
>> +	virtballoon_quiesce(vdev->priv);
>> +
>> +	virtio_break_device(vdev);
>> +	virtio_synchronize_cbs(vdev);
>> +	vdev->config->reset(vdev);
> I guess it would be good if we wouldn't have to copy what the default handler
> does, but could instead just have it in a reusable core function?
Ok. Sounds great. Will do.

Thanks for review,
    Den

^ permalink raw reply

* Re: [PATCH net-next v3] vsock/virtio: rewrite MSG_ZEROCOPY flag handling
From: Michael S. Tsirkin @ 2026-06-23 17:26 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260623153819.697635-1-avkrasnov@rulkc.org>

On Tue, Jun 23, 2026 at 06:38:19PM +0300, Arseniy Krasnov wrote:
> Logically it was based on TCP implementation, so to make further support
> easier, rewrite it in the TCP way (like in 'tcp_sendmsg_locked()'). This
> patch only rewrites flag handling (e.g. it doesn't change logic).
> 
> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>


It seems to change logic though:

> ---
>  Changelog v1->v2:
>  * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
>    already added.
>  Changelog v2->v3:
>  * Update commit message.
>  * Remove one empty line.
> 
>  net/vmw_vsock/virtio_transport_common.c | 47 ++++++++++++-------------
>  1 file changed, 22 insertions(+), 25 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 09475007165b..41c2a0b82a8e 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -328,38 +328,35 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>  		return pkt_len;
>  
> -	if (info->msg) {
> -		/* If zerocopy is not enabled by 'setsockopt()', we behave as
> -		 * there is no MSG_ZEROCOPY flag set.
> +	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
> +		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
> +		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
> +		 * handling from 'tcp_sendmsg_locked()'.
>  		 */
> -		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> -			info->msg->msg_flags &= ~MSG_ZEROCOPY;

So previously without SOCK_ZEROCOPY, MSG_ZEROCOPY was always ignored...


> +		if (info->msg->msg_ubuf) {
> +			uarg = info->msg->msg_ubuf;
> +			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);

now it's not in this case?


Maybe the right call, but saying "does not change logic" seems wrong.


> +		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
> +			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
> +						    NULL, false);
> +			if (!uarg) {
> +				virtio_transport_put_credit(vvs, pkt_len);
> +				return -ENOMEM;
> +			}
>  
> -		if (info->msg->msg_flags & MSG_ZEROCOPY)
>  			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
> +			if (!can_zcopy)
> +				uarg_to_msgzc(uarg)->zerocopy = 0;
>  
> +			have_uref = true;
> +		}
> +
> +		/* 'can_zcopy' means that this transmission will be
> +		 * in zerocopy way (e.g. using 'frags' array).
> +		 */
>  		if (can_zcopy)
>  			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>  					    (MAX_SKB_FRAGS * PAGE_SIZE));
> -
> -		if (info->msg->msg_flags & MSG_ZEROCOPY &&
> -		    info->op == VIRTIO_VSOCK_OP_RW) {
> -			uarg = info->msg->msg_ubuf;
> -
> -			if (!uarg) {
> -				uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> -							    pkt_len, NULL, false);
> -				if (!uarg) {
> -					virtio_transport_put_credit(vvs, pkt_len);
> -					return -ENOMEM;
> -				}
> -
> -				if (!can_zcopy)
> -					uarg_to_msgzc(uarg)->zerocopy = 0;
> -
> -				have_uref = true;
> -			}
> -		}
>  	}
>  
>  	rest_len = pkt_len;
> -- 
> 2.25.1


^ permalink raw reply

* [PATCH net-next v3] vsock/virtio: rewrite MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-23 15:38 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman, Xuan Zhuo, Eugenio Pérez,
	Simon Horman
  Cc: kvm, virtualization, netdev, linux-kernel, oxffffaa, rulkc,
	Arseniy Krasnov

Logically it was based on TCP implementation, so to make further support
easier, rewrite it in the TCP way (like in 'tcp_sendmsg_locked()'). This
patch only rewrites flag handling (e.g. it doesn't change logic).

Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
---
 Changelog v1->v2:
 * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
   already added.
 Changelog v2->v3:
 * Update commit message.
 * Remove one empty line.

 net/vmw_vsock/virtio_transport_common.c | 47 ++++++++++++-------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 09475007165b..41c2a0b82a8e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -328,38 +328,35 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
 		return pkt_len;
 
-	if (info->msg) {
-		/* If zerocopy is not enabled by 'setsockopt()', we behave as
-		 * there is no MSG_ZEROCOPY flag set.
+	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
+		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
+		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
+		 * handling from 'tcp_sendmsg_locked()'.
 		 */
-		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
-			info->msg->msg_flags &= ~MSG_ZEROCOPY;
+		if (info->msg->msg_ubuf) {
+			uarg = info->msg->msg_ubuf;
+			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
+		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
+			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
+						    NULL, false);
+			if (!uarg) {
+				virtio_transport_put_credit(vvs, pkt_len);
+				return -ENOMEM;
+			}
 
-		if (info->msg->msg_flags & MSG_ZEROCOPY)
 			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
+			if (!can_zcopy)
+				uarg_to_msgzc(uarg)->zerocopy = 0;
 
+			have_uref = true;
+		}
+
+		/* 'can_zcopy' means that this transmission will be
+		 * in zerocopy way (e.g. using 'frags' array).
+		 */
 		if (can_zcopy)
 			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
 					    (MAX_SKB_FRAGS * PAGE_SIZE));
-
-		if (info->msg->msg_flags & MSG_ZEROCOPY &&
-		    info->op == VIRTIO_VSOCK_OP_RW) {
-			uarg = info->msg->msg_ubuf;
-
-			if (!uarg) {
-				uarg = msg_zerocopy_realloc(sk_vsock(vsk),
-							    pkt_len, NULL, false);
-				if (!uarg) {
-					virtio_transport_put_credit(vvs, pkt_len);
-					return -ENOMEM;
-				}
-
-				if (!can_zcopy)
-					uarg_to_msgzc(uarg)->zerocopy = 0;
-
-				have_uref = true;
-			}
-		}
 	}
 
 	rest_len = pkt_len;
-- 
2.25.1


^ permalink raw reply related

* [PATCH] scsi: virtio_scsi: fixup endian conversions for warning messages
From: Ben Dooks @ 2026-06-23 13:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Pérez, James E.J. Bottomley, Martin K. Petersen,
	virtualization, linux-scsi, linux-kernel
  Cc: Ben Dooks

There are several places where printing functions are being passed parameters
that have not been through endian conversion functions. Use the virtio32_to_cpu
to fix the warnings.

Fixes the following warnings from (prototype) sparse:
drivers/scsi/virtio_scsi.c:126:9: warning: incorrect type in argument 7 (different base types)
drivers/scsi/virtio_scsi.c:126:9:    expected unsigned int
drivers/scsi/virtio_scsi.c:126:9:    got restricted __virtio32 [usertype] sense_len
drivers/scsi/virtio_scsi.c:312:17: warning: incorrect type in argument 2 (different base types)
drivers/scsi/virtio_scsi.c:312:17:    expected unsigned int
drivers/scsi/virtio_scsi.c:312:17:    got restricted __virtio32 [usertype] reason
drivers/scsi/virtio_scsi.c:412:17: warning: incorrect type in argument 2 (different base types)
drivers/scsi/virtio_scsi.c:412:17:    expected unsigned int
drivers/scsi/virtio_scsi.c:412:17:    got restricted __virtio32 [usertype] event

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
---
 drivers/scsi/virtio_scsi.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 5fdaa71f0652..35731b18c519 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -122,10 +122,11 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 	struct virtio_scsi_cmd *cmd = buf;
 	struct scsi_cmnd *sc = cmd->sc;
 	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
+	unsigned sense_len = virtio32_to_cpu(vscsi->vdev, resp->sense_len);
 
 	dev_dbg(&sc->device->sdev_gendev,
 		"cmd %p response %u status %#02x sense_len %u\n",
-		sc, resp->response, resp->status, resp->sense_len);
+		sc, resp->response, resp->status, sense_len);
 
 	sc->result = resp->status;
 	virtscsi_compute_resid(sc, virtio32_to_cpu(vscsi->vdev, resp->resid));
@@ -166,13 +167,10 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 		break;
 	}
 
-	WARN_ON(virtio32_to_cpu(vscsi->vdev, resp->sense_len) >
-		VIRTIO_SCSI_SENSE_SIZE);
+	WARN_ON(sense_len > VIRTIO_SCSI_SENSE_SIZE);
 	if (resp->sense_len) {
 		memcpy(sc->sense_buffer, resp->sense,
-		       min_t(u32,
-			     virtio32_to_cpu(vscsi->vdev, resp->sense_len),
-			     VIRTIO_SCSI_SENSE_SIZE));
+		       min_t(u32, sense_len, VIRTIO_SCSI_SENSE_SIZE));
 	}
 
 	scsi_done(sc);
@@ -288,8 +286,9 @@ static void virtscsi_handle_transport_reset(struct virtio_scsi *vscsi,
 	struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
 	unsigned int target = event->lun[1];
 	unsigned int lun = (event->lun[2] << 8) | event->lun[3];
+	unsigned int reason = virtio32_to_cpu(vscsi->vdev, event->reason);
 
-	switch (virtio32_to_cpu(vscsi->vdev, event->reason)) {
+	switch (reason) {
 	case VIRTIO_SCSI_EVT_RESET_RESCAN:
 		if (lun == 0) {
 			scsi_scan_target(&shost->shost_gendev, 0, target,
@@ -309,7 +308,7 @@ static void virtscsi_handle_transport_reset(struct virtio_scsi *vscsi,
 		}
 		break;
 	default:
-		pr_info("Unsupported virtio scsi event reason %x\n", event->reason);
+		pr_info("Unsupported virtio scsi event reason %x\n", reason);
 	}
 }
 
@@ -409,7 +408,8 @@ static void virtscsi_handle_event(struct work_struct *work)
 		virtscsi_handle_param_change(vscsi, event);
 		break;
 	default:
-		pr_err("Unsupported virtio scsi event %x\n", event->event);
+		pr_err("Unsupported virtio scsi event %x\n",
+		       virtio32_to_cpu(vscsi->vdev, event->event));
 	}
 	virtscsi_kick_event(vscsi, event_node);
 }
-- 
2.37.2.352.g3c44437643


^ permalink raw reply related

* Re: [PATCH] virtio_console: fix endian conversion in handle_control_message()
From: Amit Shah @ 2026-06-23 12:23 UTC (permalink / raw)
  To: Ben Dooks, Arnd Bergmann, Greg Kroah-Hartman, virtualization,
	linux-kernel
In-Reply-To: <20260623092141.631355-1-ben.dooks@codethink.co.uk>

On Tue, 2026-06-23 at 10:21 +0100, Ben Dooks wrote:
> There are a couple of prints in handle_control_message() which should
> have converted cpkt->id through virtio32_to_cpu() before passing to
> a print.
> 
> This fixes the following (prototype) sparse warnings:
> drivers/char/virtio_console.c:1538:17: warning: incorrect type in
> argument 4 (different base types)
> drivers/char/virtio_console.c:1538:17:    expected unsigned int
> drivers/char/virtio_console.c:1538:17:    got restricted __virtio32
> [usertype] id
> drivers/char/virtio_console.c:1553:25: warning: incorrect type in
> argument 3 (different base types)
> drivers/char/virtio_console.c:1553:25:    expected unsigned int
> drivers/char/virtio_console.c:1553:25:    got restricted __virtio32
> [usertype] id
> 
> Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>

Reviewed-by: Amit Shah <amit@kernel.org>

		Amit

^ permalink raw reply

* [RFCv2 PATCH 6/6] virtio-mem: Support memory hotplug/unplug for coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

Integrate coco memory management operations into the virtio-mem driver to
manage the state of hotplug memory.

In virtio_mem_send_plug_request(), once the host hypervisor acknowledges a
plug request, invoke coco_set_plugged_bitmap() to set the corresponding
bits in the plugged bitmap. Conversely, in virtio_mem_send_unplug_request()
and virtio_mem_send_unplug_all_request(), call unaccept_memory() to let the
guest autonomously transition the target private pages back to "unaccepted"
state before asking the VMM to unplug them. After the VMM acknowledges the
unplug request, clear the ranges from the plugged bitmap.

Note that memory block hotplug/unplug also sets or clears the plugged
bitmap at memory block granularity. While doing this at device block
granularity here creates a slight redundancy, it is completely harmless.

Additionally, update virtio_mem_fake_online() to explicitly invoke
accept_memory() when transitioning memory out of the fake-offline state and
back into service. This ensures that any pages returning to the buddy
system are cleanly accepted by the guest architecture before they are freed
back into the allocator via free_contig_range().

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 drivers/virtio/virtio_mem.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 48051e9e98ab..9f6e53df8caf 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1211,6 +1211,7 @@ static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
 			generic_online_page(page, order);
 		} else {
 			virtio_mem_clear_fake_offline(pfn + i, 1 << order, true);
+			accept_memory(page_to_phys(page), PAGE_SIZE << order);
 			free_contig_range(pfn + i, 1 << order);
 			adjust_managed_page_count(page, 1 << order);
 		}
@@ -1436,6 +1437,7 @@ static int virtio_mem_send_plug_request(struct virtio_mem *vm, uint64_t addr,
 	switch (virtio_mem_send_request(vm, &req)) {
 	case VIRTIO_MEM_RESP_ACK:
 		vm->plugged_size += size;
+		WARN_ON(coco_set_plugged_bitmap(addr, size, true));
 		return 0;
 	case VIRTIO_MEM_RESP_NACK:
 		rc = -EAGAIN;
@@ -1471,9 +1473,12 @@ static int virtio_mem_send_unplug_request(struct virtio_mem *vm, uint64_t addr,
 	dev_dbg(&vm->vdev->dev, "unplugging memory: 0x%llx - 0x%llx\n", addr,
 		addr + size - 1);
 
+	unaccept_memory(addr, size);
+
 	switch (virtio_mem_send_request(vm, &req)) {
 	case VIRTIO_MEM_RESP_ACK:
 		vm->plugged_size -= size;
+		WARN_ON(coco_set_plugged_bitmap(addr, size, false));
 		return 0;
 	case VIRTIO_MEM_RESP_BUSY:
 		rc = -ETXTBSY;
@@ -1498,10 +1503,13 @@ static int virtio_mem_send_unplug_all_request(struct virtio_mem *vm)
 
 	dev_dbg(&vm->vdev->dev, "unplugging all memory");
 
+	unaccept_memory(vm->addr, vm->region_size);
+
 	switch (virtio_mem_send_request(vm, &req)) {
 	case VIRTIO_MEM_RESP_ACK:
 		vm->unplug_all_required = false;
 		vm->plugged_size = 0;
+		WARN_ON(coco_set_plugged_bitmap(vm->addr, vm->region_size, false));
 		/* usable region might have shrunk */
 		atomic_set(&vm->config_changed, 1);
 		return 0;
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 5/6] mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

Integrate coco memory management operations into the core memory hotplug
subsystem to handle the lifecycle of hotplug memory.

In add_memory_resource(), invoke coco_set_plugged_bitmap(..., true) to mark
memory plugged before adding the memory block, because self hosted memmap
initialization needs their plugged bits set before acceptance. There is no
explicit call to accept_memory() for normal pages, because they can be
lazily accepted by the core memory management subsystem after the memory
block is onlined.

In try_remove_memory(), before finalizing the physical removal of the
memory blocks, invoke unaccept_memory(). This allows the guest to take
direct control of its own memory state and release the pages itself,
eliminating the dependency on the VMM to implicitly hole-punch the memory.
It loops through the targeted ranges using find_next_andnot_bit(), matching
pages that are marked plugged and accepted, and releases them back to the
host. Following the unacceptance step, clear the ranges from the plugged
bitmap.

These operations guarantee that both the unaccepted and plugged tracking
states stay completely synchronized with the actual dynamic memory
configurations of the guest.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/linux/mm.h                       | 11 +++
 drivers/firmware/efi/unaccepted_memory.c | 94 ++++++++++++++++++++++++
 mm/memory_hotplug.c                      | 16 ++++
 3 files changed, 121 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fc2acedf0b76..4c094038872a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -5105,6 +5105,8 @@ int set_anon_vma_name(unsigned long addr, unsigned long size,
 
 bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size);
 void accept_memory(phys_addr_t start, unsigned long size);
+void unaccept_memory(phys_addr_t start, unsigned long size);
+int coco_set_plugged_bitmap(phys_addr_t start, unsigned long size, bool set);
 
 #else
 
@@ -5118,6 +5120,15 @@ static inline void accept_memory(phys_addr_t start, unsigned long size)
 {
 }
 
+static inline void unaccept_memory(phys_addr_t start, unsigned long size)
+{
+}
+
+static inline int coco_set_plugged_bitmap(phys_addr_t start, unsigned long size, bool set)
+{
+	return 0;
+}
+
 #endif
 
 static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index c290b16c5142..f35f7016af53 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -233,6 +233,100 @@ bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size)
 	return ret;
 }
 
+static int coco_hotplug_range_check(struct efi_unaccepted_memory *unaccepted,
+				    phys_addr_t start, unsigned long size)
+{
+	u64 unit_size = unaccepted->unit_size;
+	u64 phys_base = unaccepted->phys_base;
+	u64 phys_end = phys_base + unaccepted->size * unit_size * BITS_PER_BYTE;
+
+	if (!IS_ALIGNED(start | size, unit_size))
+		return -EINVAL;
+
+	if (start < phys_base || start + size > phys_end)
+		return -EINVAL;
+
+	return 0;
+}
+
+/* Only used by hotplug memory, we don't unaccept static memory */
+void unaccept_memory(phys_addr_t start, unsigned long size)
+{
+	unsigned long range_start, range_end, bitmap_size, flags;
+	struct efi_unaccepted_memory *unaccepted;
+	void *plugged_bitmap;
+	u64 unit_size;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return;
+
+	if (WARN_ON(coco_hotplug_range_check(unaccepted, start, size)))
+		return;
+
+	unit_size = unaccepted->unit_size;
+	range_start = (start - unaccepted->phys_base) / unit_size;
+	bitmap_size = range_start + size / unit_size;
+	plugged_bitmap = plugged_bitmap_of(unaccepted);
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	for (; range_start < bitmap_size; range_start = range_end) {
+		unsigned long phys_start, phys_end;
+		unsigned long unaccepted_one, plugged_zero;
+
+		range_start = find_next_andnot_bit(plugged_bitmap, unaccepted->bitmap,
+						   bitmap_size, range_start);
+
+		if (range_start >= bitmap_size)
+			break;
+
+		unaccepted_one = find_next_bit(unaccepted->bitmap, bitmap_size, range_start);
+		plugged_zero = find_next_zero_bit(plugged_bitmap, bitmap_size, range_start);
+		range_end = min(unaccepted_one, plugged_zero);
+
+		phys_start = range_start * unit_size + unaccepted->phys_base;
+		phys_end = range_end * unit_size + unaccepted->phys_base;
+
+		arch_unaccept_memory(phys_start, phys_end);
+		bitmap_set(unaccepted->bitmap, range_start, range_end - range_start);
+	}
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+/*
+ * Only used by hotplug memory, plugged bits of static memory are handled
+ * in process_unaccepted_memory()
+ */
+int coco_set_plugged_bitmap(phys_addr_t start, unsigned long size, bool set)
+{
+	struct efi_unaccepted_memory *unaccepted;
+	unsigned long range_start, flags;
+	void *plugged_bitmap;
+	u64 unit_size;
+	int ret;
+
+	unaccepted = efi_get_unaccepted_table();
+	if (!unaccepted)
+		return 0;
+
+	ret = coco_hotplug_range_check(unaccepted, start, size);
+	if (ret)
+		return ret;
+
+	unit_size = unaccepted->unit_size;
+	range_start = (start - unaccepted->phys_base) / unit_size;
+	plugged_bitmap = plugged_bitmap_of(unaccepted);
+
+	spin_lock_irqsave(&unaccepted_memory_lock, flags);
+	if (set)
+		bitmap_set(plugged_bitmap, range_start, size / unit_size);
+	else
+		bitmap_clear(plugged_bitmap, range_start, size / unit_size);
+	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+	return 0;
+}
+
 #ifdef CONFIG_PROC_VMCORE
 static bool unaccepted_memory_vmcore_pfn_is_ram(struct vmcore_cb *cb,
 						unsigned long pfn)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 40c7915dabe0..2f71514a0616 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1429,6 +1429,8 @@ static void remove_memory_blocks_and_altmaps(u64 start, u64 size)
 
 		arch_remove_memory(cur_start, memblock_size, altmap);
 
+		unaccept_memory(cur_start, PFN_PHYS(altmap->free));
+
 		/* Verify that all vmemmap pages have actually been freed. */
 		WARN(altmap->alloc, "Altmap not fully unmapped");
 		kfree(altmap);
@@ -1459,9 +1461,13 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
 			goto out;
 		}
 
+		/* Accept self hosted memmap array before access it */
+		accept_memory(cur_start, PFN_PHYS(mhp_altmap.free));
+
 		/* call arch's memory hotadd */
 		ret = arch_add_memory(nid, cur_start, memblock_size, &params);
 		if (ret < 0) {
+			unaccept_memory(cur_start, PFN_PHYS(mhp_altmap.free));
 			kfree(params.altmap);
 			goto out;
 		}
@@ -1471,6 +1477,7 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
 						  params.altmap, group);
 		if (ret) {
 			arch_remove_memory(cur_start, memblock_size, NULL);
+			unaccept_memory(cur_start, PFN_PHYS(mhp_altmap.free));
 			kfree(params.altmap);
 			goto out;
 		}
@@ -1540,6 +1547,10 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		new_node = true;
 	}
 
+	ret = coco_set_plugged_bitmap(start, size, true);
+	if (ret)
+		goto error_offline_node;
+
 	/*
 	 * Self hosted memmap array
 	 */
@@ -1584,6 +1595,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 
 	return ret;
 error:
+	WARN_ON(coco_set_plugged_bitmap(start, size, false));
+error_offline_node:
 	if (new_node) {
 		node_set_offline(nid);
 		unregister_node(nid);
@@ -2282,6 +2295,9 @@ static int try_remove_memory(u64 start, u64 size)
 	if (nid != NUMA_NO_NODE)
 		try_offline_node(nid);
 
+	unaccept_memory(start, size);
+	WARN_ON(coco_set_plugged_bitmap(start, size, false));
+
 	mem_hotplug_done();
 	return 0;
 }
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 4/6] x86/tdx: Implement arch_unaccept_memory()
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

During memory hot-unplug, if the VMM does not punch hole the memory, the
memory stays in "accepted" state. Consequently, subsequent re-acceptance
of that same memory during a re-plug operation will trigger re-accept
failure. To guard this, a confidential guest must maintain control of
the memory state explicitly, e.g., setting memory to "unaccepted" state
during unplug.

In the context of TDX, the "unaccepted" state maps to the PENDING state,
while the "accepted" state maps to the MAPPED state. Implement
arch_unaccept_memory() for TDX guest via the TDG.MEM.PAGE.RELEASE TDCALL.
It uses 1G/2M/4K page size fallbacks and rolls back on partial failure. A
failure during this rollback step indicates severe corruption of the TDX
module state and triggers a kernel panic.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 arch/x86/include/asm/shared/tdx.h        |   2 +
 arch/x86/include/asm/tdx.h               |   2 +
 arch/x86/include/asm/unaccepted_memory.h |  11 +++
 arch/x86/coco/tdx/tdx.c                  | 120 +++++++++++++++++++++++
 4 files changed, 135 insertions(+)

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 049638e3da74..910ec1e57528 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -19,6 +19,7 @@
 #define TDG_MEM_PAGE_ACCEPT		6
 #define TDG_VM_RD			7
 #define TDG_VM_WR			8
+#define TDG_MEM_PAGE_RELEASE		30
 
 /* TDX TD attributes */
 #define TDX_TD_ATTR_DEBUG_BIT		0
@@ -54,6 +55,7 @@
 
 /* TDCS_CONFIG_FLAGS bits */
 #define TDCS_CONFIG_FLEXIBLE_PENDING_VE	BIT_ULL(1)
+#define TDCS_CONFIG_PAGE_RELEASE	BIT_ULL(6)
 
 /* TDCS_TD_CTLS bits */
 #define TD_CTLS_PENDING_VE_DISABLE_BIT	0
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a149740b24e8..8608d33a7db6 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -72,6 +72,8 @@ int tdx_mcall_extend_rtmr(u8 index, u8 *data);
 
 u64 tdx_hcall_get_quote(u8 *buf, size_t size);
 
+bool tdx_unaccept_memory(phys_addr_t start, phys_addr_t end);
+
 void __init tdx_dump_attributes(u64 td_attr);
 void __init tdx_dump_td_ctls(u64 td_ctls);
 
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index f5937e9866ac..9fd9411d2c44 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -18,6 +18,17 @@ static inline void arch_accept_memory(phys_addr_t start, phys_addr_t end)
 	}
 }
 
+static inline void arch_unaccept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/* Platform-specific memory-unacceptance call goes here */
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+		if (!tdx_unaccept_memory(start, end))
+			panic("TDX: Failed to unaccept memory\n");
+	} else {
+		panic("Cannot unaccept memory: unknown platform\n");
+	}
+}
+
 static inline struct efi_unaccepted_memory *efi_get_unaccepted_table(void)
 {
 	if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 186915a17c50..1bab8f4687bf 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -326,6 +326,124 @@ static void reduce_unnecessary_ve(void)
 	enable_cpu_topology_enumeration();
 }
 
+static bool tdx_page_release_supported;
+
+static void tdx_detect_page_release_support(void)
+{
+	u64 config = 0;
+
+	tdg_vm_rd(TDCS_CONFIG_FLAGS, &config);
+
+	tdx_page_release_supported = !!(config & TDCS_CONFIG_PAGE_RELEASE);
+}
+
+static unsigned long try_release_one(phys_addr_t start, unsigned long len,
+				     enum pg_level pg_level)
+{
+	unsigned long release_size = page_level_size(pg_level);
+	struct tdx_module_args args = {};
+	u8 page_size;
+	u64 ret;
+
+	if (!IS_ALIGNED(start, release_size))
+		return 0;
+
+	if (len < release_size)
+		return 0;
+
+	/*
+	 * Pass the page physical address to TDX module to release the
+	 * private page and to put it in PENDING state.
+	 *
+	 * Encode page size in RCX[2:0] using TDX_PS_*
+	 */
+	switch (pg_level) {
+	case PG_LEVEL_4K:
+		page_size = TDX_PS_4K;
+		break;
+	case PG_LEVEL_2M:
+		page_size = TDX_PS_2M;
+		break;
+	case PG_LEVEL_1G:
+		page_size = TDX_PS_1G;
+		break;
+	default:
+		return 0;
+	}
+
+	args.rcx = start | page_size;
+	ret = __tdcall(TDG_MEM_PAGE_RELEASE, &args);
+	if (ret)
+		return 0;
+
+	return release_size;
+}
+
+static bool tdx_release_memory(phys_addr_t start, phys_addr_t end, phys_addr_t *cur)
+{
+	*cur = start;
+
+	while (*cur < end) {
+		unsigned long len = end - *cur;
+		unsigned long release_size;
+
+		/*
+		 * Try larger release first. It speeds up process by cutting
+		 * number of hypercalls (if successful).
+		 */
+
+		release_size = try_release_one(*cur, len, PG_LEVEL_1G);
+		if (!release_size)
+			release_size = try_release_one(*cur, len, PG_LEVEL_2M);
+		if (!release_size)
+			release_size = try_release_one(*cur, len, PG_LEVEL_4K);
+		if (!release_size)
+			return false;
+		*cur += release_size;
+	}
+
+	return true;
+}
+
+/**
+ * Release private memory and put it in PENDING state.
+ *
+ * @start: Physical start address of memory range to release
+ * @end:   Physical end address of memory range to release
+ *
+ * Uses TDG.MEM.PAGE.RELEASE TDCALL to transition private pages back to
+ * PENDING state. If PAGE_RELEASE is not supported by the TDX
+ * configuration, returns true (success) as no action is needed.
+ *
+ * On partial failure, automatically re-accepts any successfully released
+ * pages to restore consistent memory state. Re-acceptance failure is
+ * treated as a fatal error since it indicates severe TDX module issues.
+ *
+ * Returns: true on success, false on failure
+ */
+bool tdx_unaccept_memory(phys_addr_t start, phys_addr_t end)
+{
+	phys_addr_t released = start;
+	bool ret;
+
+	if (!tdx_page_release_supported)
+		return true;
+
+	ret = tdx_release_memory(start, end, &released);
+	if (!ret) {
+		pr_err("Failed to unaccept memory [%pa, %pa)\n", &start, &end);
+		/*
+		 * Re-accept any pages that were successfully released before
+		 * the failure occurred. This should never fail since we're
+		 * just restoring the previous MAPPED state.
+		 */
+		if (!tdx_accept_memory(start, released))
+			panic("%s: Failed to re-accept memory\n", __func__);
+	}
+
+	return ret;
+}
+
 static void tdx_setup(u64 *cc_mask)
 {
 	struct tdx_module_args args = {};
@@ -359,6 +477,8 @@ static void tdx_setup(u64 *cc_mask)
 	disable_sept_ve(td_attr);
 
 	reduce_unnecessary_ve();
+
+	tdx_detect_page_release_support();
 }
 
 /*
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 3/6] efi/unaccepted: Create plugged bitmap to support hotplug memory in coco guest
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

The load_unaligned_zeropad() function can cause unintended memory loads
across page boundaries. To safely handle these unaligned reads in a
confidential computing guest, the kernel implicitly accepts an extra
unit_size block of memory to serve as a safety guard.

However, near hotplug boundaries, this extra acceptance can fall within
unpopulated gaps between hotplugged memory ranges, triggering a guest
kernel crash.

To protect these boundaries against out-of-bounds access, introduce a
"plugged" bitmap positioned immediately following the unaccepted memory
bitmap.

Initial static boot memory ranges have their corresponding bits marked
as plugged by default during early initialization. For hotpluggable
memory ranges, the memory driver must explicitly set the proper bits
when a memory block is plugged, and clear them upon an unplug event.

Update accept_memory() and range_contains_unaccepted_memory() to check
the intersection of both bitmaps. The kernel now combines them to
determine exactly which plugged, unaccepted pages require acceptance.

Additionally, bump the unaccepted memory table layout version from 1
to 2. This strict layout enforcement guarantees that a version 1 table
passed to a new kernel, or a version 2 table passed to an old kernel,
will explicitly fail kexec early due to the version mismatch.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/linux/efi.h                           |  5 ++++
 arch/x86/boot/compressed/mem.c                |  2 +-
 drivers/firmware/efi/efi.c                    |  4 +--
 .../firmware/efi/libstub/unaccepted_memory.c  | 16 +++++++----
 drivers/firmware/efi/unaccepted_memory.c      | 28 +++++++++++++++----
 5 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/include/linux/efi.h b/include/linux/efi.h
index ccbc35479684..579d102f128a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -551,6 +551,11 @@ struct efi_unaccepted_memory {
 	unsigned long bitmap[];
 };
 
+static inline void *plugged_bitmap_of(struct efi_unaccepted_memory *u)
+{
+	return (void *)u->bitmap + u->size;
+}
+
 /*
  * Architecture independent structure for describing a memory map for the
  * benefit of efi_memmap_init_early(), and for passing context between
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 40e9c81a2206..61b8d0edd2f6 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -69,7 +69,7 @@ bool init_unaccepted_memory(void)
 	if (!table)
 		return false;
 
-	if (table->version != 1)
+	if (table->version != 2)
 		error("Unknown version of unaccepted memory table\n");
 
 	/*
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 318d1cc9a066..7f7341634c13 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -701,7 +701,7 @@ static __init void reserve_unaccepted(struct efi_unaccepted_memory *unaccepted)
 	phys_addr_t start, end;
 
 	start = PAGE_ALIGN_DOWN(efi.unaccepted);
-	end = PAGE_ALIGN(efi.unaccepted + sizeof(*unaccepted) + unaccepted->size);
+	end = PAGE_ALIGN(efi.unaccepted + sizeof(*unaccepted) + unaccepted->size * 2);
 
 	memblock_add(start, end - start);
 	memblock_reserve(start, end - start);
@@ -837,7 +837,7 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
 		unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
 		if (unaccepted) {
 
-			if (unaccepted->version == 1) {
+			if (unaccepted->version == 2) {
 				reserve_unaccepted(unaccepted);
 			} else {
 				efi.unaccepted = EFI_INVALID_TABLE_ADDR;
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index 01bed8e751ca..5b0deb6c91f1 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -113,7 +113,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 					struct efi_boot_memmap *map)
 {
 	efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
-	u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
+	u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size, total_size;
 	struct srat_parse_ctx ctx;
 	efi_status_t status;
 	int i;
@@ -124,7 +124,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 	/* Check if the table is already installed */
 	unaccepted_table = get_efi_config_table(unaccepted_table_guid);
 	if (unaccepted_table) {
-		if (unaccepted_table->version != 1) {
+		if (unaccepted_table->version != 2) {
 			efi_err("Unknown version of unaccepted memory table\n");
 			return EFI_UNSUPPORTED;
 		}
@@ -173,19 +173,22 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 	bitmap_size = DIV_ROUND_UP(unaccepted_end - unaccepted_start,
 				   EFI_UNACCEPTED_UNIT_SIZE * BITS_PER_BYTE);
 
+	/* There is a plugged bitmap after unaccepted bitmap */
+	total_size = bitmap_size << 1;
+
 	status = efi_bs_call(allocate_pool, EFI_ACPI_RECLAIM_MEMORY,
-			     sizeof(*unaccepted_table) + bitmap_size,
+			     sizeof(*unaccepted_table) + total_size,
 			     (void **)&unaccepted_table);
 	if (status != EFI_SUCCESS) {
 		efi_err("Failed to allocate unaccepted memory config table\n");
 		return status;
 	}
 
-	unaccepted_table->version = 1;
+	unaccepted_table->version = 2;
 	unaccepted_table->unit_size = EFI_UNACCEPTED_UNIT_SIZE;
 	unaccepted_table->phys_base = unaccepted_start;
 	unaccepted_table->size = bitmap_size;
-	memset(unaccepted_table->bitmap, 0, bitmap_size);
+	memset(unaccepted_table->bitmap, 0, total_size);
 	parse_acpi_srat_regions(mark_hotplug_memory_unaccepted, &ctx);
 
 	status = efi_bs_call(install_configuration_table,
@@ -287,6 +290,9 @@ void process_unaccepted_memory(u64 start, u64 end)
 	 */
 	bitmap_set(unaccepted_table->bitmap,
 		   start / unit_size, (end - start) / unit_size);
+	/* Set plugged bits for static memory and never unset */
+	bitmap_set(plugged_bitmap_of(unaccepted_table),
+		   start / unit_size, (end - start) / unit_size);
 }
 
 void accept_memory(phys_addr_t start, unsigned long size)
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
index 4a8ec8d6a571..c290b16c5142 100644
--- a/drivers/firmware/efi/unaccepted_memory.c
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -38,6 +38,7 @@ void accept_memory(phys_addr_t start, unsigned long size)
 	unsigned long flags;
 	phys_addr_t end;
 	u64 unit_size;
+	void *plugged_bitmap;
 
 	unaccepted = efi_get_unaccepted_table();
 	if (!unaccepted)
@@ -126,12 +127,23 @@ void accept_memory(phys_addr_t start, unsigned long size)
 	 */
 	list_add(&range.list, &accepting_list);
 
-	range_start = range.start;
-	for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
-				   range.end) {
+	plugged_bitmap = plugged_bitmap_of(unaccepted);
+
+	for (range_start = range.start; range_start < range.end; range_start = range_end) {
 		unsigned long phys_start, phys_end;
-		unsigned long len = range_end - range_start;
+		unsigned long len;
+		unsigned long unaccepted_zero, plugged_zero;
+
+		range_start = find_next_and_bit(plugged_bitmap, unaccepted->bitmap,
+						range.end, range_start);
+
+		if (range_start >= range.end)
+			break;
 
+		unaccepted_zero = find_next_zero_bit(unaccepted->bitmap, range.end, range_start);
+		plugged_zero = find_next_zero_bit(plugged_bitmap, range.end, range_start);
+		range_end = min(unaccepted_zero, plugged_zero);
+		len = range_end - range_start;
 		phys_start = range_start * unit_size + unaccepted->phys_base;
 		phys_end = range_end * unit_size + unaccepted->phys_base;
 
@@ -167,6 +179,7 @@ bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size)
 	bool ret = false;
 	phys_addr_t end;
 	u64 unit_size;
+	void *plugged_bitmap;
 
 	unaccepted = efi_get_unaccepted_table();
 	if (!unaccepted)
@@ -201,9 +214,14 @@ bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size)
 	if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
 		end = unaccepted->size * unit_size * BITS_PER_BYTE;
 
+	plugged_bitmap = plugged_bitmap_of(unaccepted);
+
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
 	while (start < end) {
-		if (test_bit(start / unit_size, unaccepted->bitmap)) {
+		unsigned long range_start = start / unit_size;
+
+		if (test_bit(range_start, plugged_bitmap) &&
+		    test_bit(range_start, unaccepted->bitmap)) {
 			ret = true;
 			break;
 		}
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug memory
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

In coco guests, hotpluggable memory ranges are initially unaccepted.
While a previous change expanded the unaccepted memory bitmap boundaries
to include these hotplug spaces, the actual bits inside the bitmap are
not yet marked as unaccepted.

Walks SRAT a second time after the bitmap is allocated and sets the bits
corresponding to hotpluggable ranges.

This ensures the bitmap state accurately reflects all static and hotplug
memory ranges before booting kernel.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 .../firmware/efi/libstub/unaccepted_memory.c   | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index bfbb78bd7b8a..01bed8e751ca 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -92,6 +92,23 @@ static void update_mem_boundaries(struct acpi_srat_mem_affinity *mem, struct sra
 		*(ctx->mem_end) = range_end;
 }
 
+static void mark_hotplug_memory_unaccepted(struct acpi_srat_mem_affinity *mem,
+					   struct srat_parse_ctx *ctx)
+{
+	u64 unit_size = unaccepted_table->unit_size;
+	u64 start, end;
+
+	start = round_up(mem->base_address, unit_size);
+	end = round_down(mem->base_address + mem->length, unit_size);
+
+	/* Translate to offsets from the beginning of the bitmap */
+	start -= unaccepted_table->phys_base;
+	end -= unaccepted_table->phys_base;
+
+	bitmap_set(unaccepted_table->bitmap,
+		   start / unit_size, (end - start) / unit_size);
+}
+
 efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 					struct efi_boot_memmap *map)
 {
@@ -169,6 +186,7 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 	unaccepted_table->phys_base = unaccepted_start;
 	unaccepted_table->size = bitmap_size;
 	memset(unaccepted_table->bitmap, 0, bitmap_size);
+	parse_acpi_srat_regions(mark_hotplug_memory_unaccepted, &ctx);
 
 	status = efi_bs_call(install_configuration_table,
 			     &unaccepted_table_guid, unaccepted_table);
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 1/6] efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng
In-Reply-To: <20260623101739.79695-1-zhenzhong.duan@intel.com>

Currently, allocate_unaccepted_bitmap() only scans the initial EFI
boot memory map. This misses hotpluggable ranges described in the
ACPI SRAT. Without early tracking, hotplug pages are accessed without
acceptance and this triggers guest crash.

Introduce a lightweight ACPI SRAT parser to scan these regions early.
If a region has both ACPI_SRAT_MEM_ENABLED and ACPI_SRAT_MEM_HOT_PLUGGABLE
flags, expand the tracking boundaries. This avoids pulling in the full
ACPI subsystem while ensuring the bitmap covers both static memory and
hotplug memory.

Bail out early with success on non-confidential guests to prevent
unnecessary bitmap allocation.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 drivers/firmware/efi/libstub/efistub.h        |  6 ++
 arch/x86/boot/compressed/mem.c                |  2 +-
 .../firmware/efi/libstub/unaccepted_memory.c  | 94 +++++++++++++++++++
 3 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index fd91fc15ec81..fc0cd33a5962 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -1260,4 +1260,10 @@ void arch_accept_memory(phys_addr_t start, phys_addr_t end);
 efi_status_t efi_zboot_decompress_init(unsigned long *alloc_size);
 efi_status_t efi_zboot_decompress(u8 *out, unsigned long outlen);
 
+bool early_is_tdx_guest(void);
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+bool early_is_sevsnp_guest(void);
+#else
+static inline bool early_is_sevsnp_guest(void) { return false; }
+#endif
 #endif
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 0e9f84ab4bdc..40e9c81a2206 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -12,7 +12,7 @@
  *
  * Enumerate TDX directly from the early users.
  */
-static bool early_is_tdx_guest(void)
+bool early_is_tdx_guest(void)
 {
 	static bool once;
 	static bool is_tdx;
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
index 757dbe734a47..bfbb78bd7b8a 100644
--- a/drivers/firmware/efi/libstub/unaccepted_memory.c
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -1,19 +1,109 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
 #include <linux/efi.h>
+#include <linux/acpi.h>
 #include <asm/efi.h>
 #include "efistub.h"
 
 struct efi_unaccepted_memory *unaccepted_table;
 
+struct srat_parse_ctx {
+	u64 *mem_start;
+	u64 *mem_end;
+};
+
+typedef void (*srat_region_handler_t)(struct acpi_srat_mem_affinity *mem,
+				      struct srat_parse_ctx *ctx);
+
+/*
+ * parse_acpi_srat_regions - Loop through ACPI SRAT tables to process
+ * hotpluggable memory regions via a custom callback handler.
+ */
+static void parse_acpi_srat_regions(srat_region_handler_t handler, struct srat_parse_ctx *ctx)
+{
+	u32 hotplug_mask = ACPI_SRAT_MEM_ENABLED | ACPI_SRAT_MEM_HOT_PLUGGABLE;
+	struct acpi_table_header *xsdt, *srat = NULL;
+	struct acpi_table_rsdp *rsdp = NULL;
+	u8 *current_ptr, *end_ptr;
+	u64 *table_pointers;
+	u32 entry_count;
+	unsigned long i;
+
+	rsdp = get_efi_config_table(ACPI_20_TABLE_GUID);
+
+	if (!rsdp || !ACPI_VALIDATE_RSDP_SIG(rsdp->signature))
+		return;
+
+	xsdt = (struct acpi_table_header *)(unsigned long)rsdp->xsdt_physical_address;
+	if (!xsdt || !ACPI_COMPARE_NAMESEG(xsdt->signature, ACPI_SIG_XSDT))
+		return;
+
+	if (xsdt->length < sizeof(struct acpi_table_header) + ACPI_XSDT_ENTRY_SIZE)
+		return;
+
+	entry_count = (xsdt->length - sizeof(struct acpi_table_header)) / ACPI_XSDT_ENTRY_SIZE;
+	table_pointers = (u64 *)((u8 *)xsdt + sizeof(struct acpi_table_header));
+
+	for (i = 0; i < entry_count; i++) {
+		struct acpi_table_header *tbl;
+
+		tbl = (struct acpi_table_header *)(unsigned long)table_pointers[i];
+		if (tbl && ACPI_COMPARE_NAMESEG(tbl->signature, ACPI_SIG_SRAT)) {
+			srat = tbl;
+			break;
+		}
+	}
+
+	if (!srat)
+		return;
+
+	current_ptr = (u8 *)srat + sizeof(struct acpi_table_srat);
+	end_ptr = (u8 *)srat + srat->length;
+
+	while (current_ptr < end_ptr) {
+		struct acpi_subtable_header *sub_header;
+		u64 range_end;
+
+		sub_header = (struct acpi_subtable_header *)current_ptr;
+		if (sub_header->length == 0)
+			break;
+
+		if (sub_header->type == ACPI_SRAT_TYPE_MEMORY_AFFINITY &&
+		    sub_header->length >= sizeof(struct acpi_srat_mem_affinity)) {
+			struct acpi_srat_mem_affinity *mem;
+
+			mem = (struct acpi_srat_mem_affinity *)current_ptr;
+			if ((mem->flags & hotplug_mask) == hotplug_mask &&
+			    !check_add_overflow(mem->base_address, mem->length, &range_end))
+				handler(mem, ctx);
+		}
+		current_ptr += sub_header->length;
+	}
+}
+
+static void update_mem_boundaries(struct acpi_srat_mem_affinity *mem, struct srat_parse_ctx *ctx)
+{
+	u64 range_end = mem->base_address + mem->length;
+
+	if (mem->base_address < *(ctx->mem_start))
+		*(ctx->mem_start) = mem->base_address;
+
+	if (range_end > *(ctx->mem_end))
+		*(ctx->mem_end) = range_end;
+}
+
 efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 					struct efi_boot_memmap *map)
 {
 	efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
 	u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
+	struct srat_parse_ctx ctx;
 	efi_status_t status;
 	int i;
 
+	if (!early_is_tdx_guest() && !early_is_sevsnp_guest())
+		return EFI_SUCCESS;
+
 	/* Check if the table is already installed */
 	unaccepted_table = get_efi_config_table(unaccepted_table_guid);
 	if (unaccepted_table) {
@@ -38,6 +128,10 @@ efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
 				     d->phys_addr + d->num_pages * PAGE_SIZE);
 	}
 
+	ctx.mem_start = &unaccepted_start;
+	ctx.mem_end = &unaccepted_end;
+	parse_acpi_srat_regions(update_mem_boundaries, &ctx);
+
 	if (unaccepted_start == ULLONG_MAX)
 		return EFI_SUCCESS;
 
-- 
2.52.0


^ permalink raw reply related

* [RFCv2 PATCH 0/6] Support memory hotplug/unplug for TDX CoCo guests
From: Zhenzhong Duan @ 2026-06-23 10:17 UTC (permalink / raw)
  To: marcandre.lureau, david, kas, rick.p.edgecombe, prsampat,
	pbonzini, mst, peterx, chenyi.qiang, elena.reshetova,
	michael.roth, ackerleytng
  Cc: linux-kernel, linux-coco, virtualization, x86, yilun.xu,
	xiaoyao.li, chao.p.peng

This RFCv2 series implements comprehensive support for virtio-mem and ACPI
DIMM memory hotplug/unplug in Intel TDX confidential computing guests.
It explores the start-private memory approach utilizing the native
TDG.MEM.PAGE.RELEASE API.

We are seeking feedback from Kiryl on the CoCo guest implementation, MM
experts on DIMM & virio-mem memory hotplug integration and broader
virtio/CoCo community input on the overall approach. We are not seeking
x86 maintainer review at this stage.

== Changes from RFC v1 ==

- Eliminated callback infrastructure: Dropped plug callback and replaced
  unplug callback with platform-level unaccept function into core MM
  hotplug and virtio-mem subsystems.
- Added comprehensive bitmap tracking: Introduced a "plugged" bitmap
  alongside the unaccepted bitmap to track populated hotplug memory
  states to support load_unaligned_zeropad().
- Enhanced SRAT parsing: Extended the EFI stub to parse ACPI SRAT tables
  early, ensuring hotpluggable ranges are tracked from initial boot.

For more introduction about the background or other efforts in community,
please check the RFCv1 cover letter [1].

== Technical Approach ==

- Early SRAT Integration: A lightweight EFI stub parser scans ACPI SRAT
  tables to identify hotpluggable ranges and adjust bitmap boundaries
  early, avoiding the overhead of the full ACPI subsystem.
- Comprehensive Bitmap Tracking: Introduces a "plugged" bitmap right
  after the unaccepted bitmap. Both static and hotplugged memory are
  tracked, allowing the guest to map which ranges are populated by the
  VMM. This prevents acceptance beyond plugged memory boundaries due to
  load_unaligned_zeropad() operations.
- Platform Extensibility: Exposes generic CoCo memory interfaces. Other
  confidential platforms (like AMD SEV-SNP) can easily adopt this by
  hooking their specific mechanisms into arch_unaccept_memory().
- Hotplug & Guest Control: Integrates platform-level unaccept logic
  into ACPI hotplug and virtio-mem handlers. Uses TDG.MEM.PAGE.RELEASE
  for TDX to explicitly set memory to the "unaccepted" state during
  unplug, removing host hole-punching dependencies.
- Kexec Handover: Leverages existing EFI mechanisms to seamlessly hand
  over both the extended unaccepted bitmap and the new plugged bitmap
  across kexec boundaries.

== Testing ==

- dimm and virtio-mem memory hotplug/unplug
- lazy and eager accept
- kexec/kdump with hotplugged memory

This is tested with Marc-André Lureau's newest qemu series [2]

Comments appreciated, thanks.

Zhenzhong

[1] https://lore.kernel.org/all/20260604093551.1511079-1-zhenzhong.duan@intel.com/
[2] https://lore.kernel.org/all/20260604-rdm5-v5-0-5768e6a0943d@redhat.com/

Zhenzhong Duan (6):
  efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
  efi/unaccepted: Set unaccepted bits for all hotplug memory
  efi/unaccepted: Create plugged bitmap to support hotplug memory in
    coco guest
  x86/tdx: Implement arch_unaccept_memory()
  mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
  virtio-mem: Support memory hotplug/unplug for coco guest

 arch/x86/include/asm/shared/tdx.h             |   2 +
 arch/x86/include/asm/tdx.h                    |   2 +
 arch/x86/include/asm/unaccepted_memory.h      |  11 ++
 drivers/firmware/efi/libstub/efistub.h        |   6 +
 include/linux/efi.h                           |   5 +
 include/linux/mm.h                            |  11 ++
 arch/x86/boot/compressed/mem.c                |   4 +-
 arch/x86/coco/tdx/tdx.c                       | 120 ++++++++++++++++
 drivers/firmware/efi/efi.c                    |   4 +-
 .../firmware/efi/libstub/unaccepted_memory.c  | 128 +++++++++++++++++-
 drivers/firmware/efi/unaccepted_memory.c      | 122 ++++++++++++++++-
 drivers/virtio/virtio_mem.c                   |   8 ++
 mm/memory_hotplug.c                           |  16 +++
 13 files changed, 425 insertions(+), 14 deletions(-)

-- 
2.52.0


^ permalink raw reply

* Re: [PATCH v2] virtio_net: disable cb when NAPI is busy-polled
From: Michael S. Tsirkin @ 2026-06-23 10:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Longjun Tang, netdev, xuanzhuo, jasowang, virtualization,
	tanglongjun
In-Reply-To: <CANn89iLrOTKQNqJA_oZKPjkHb1Xyqm6LS9tDn72X4az65isDGQ@mail.gmail.com>

On Tue, Jun 23, 2026 at 02:55:30AM -0700, Eric Dumazet wrote:
> On Tue, Jun 23, 2026 at 2:19 AM Longjun Tang <lange_tang@163.com> wrote:
> >
> > From: Longjun Tang <tanglongjun@kylinos.cn>
> >
> > When busy-poll is active, napi_schedule_prep() returns false in
> > virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
> > The device may keep firing irqs until reaches virtqueue_napi_complete().
> > Under load (received == budget), it will lead to a large number
> > of spurious interrupts.
> >
> > Fix it by disabling the callback at the virtnet_poll() entry. This keeps
> > the callback off while we poll and re-enable by virtqueue_napi_complete()
> > when going idle.
> >
> > Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
> >
> 
> I added netdev@ to get more attention from networking napi polling experts,
> 
> Please add a Fixes: tag as this will ease code review.
> 
> My rough guess is:
> 
> Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")
> 
> Thanks.

Exactly. The old custom virtnet_busy_poll did napi_schedule_prep + virtqueue_disable_cb itself.

I'd even say CC stable interrupt storms are devastating to performance.


> > ---
> > V1 -> V2: Remain agnostic to busy polling
> > ---
> >  drivers/net/virtio_net.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index f4adcfee7a80..0a11f2b32500 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
> >         unsigned int xdp_xmit = 0;
> >         bool napi_complete;
> >
> > +       /* Keep callbacks suppressed for the duration of this poll,
> > +        * busy-poll need.
> > +        */
> > +       virtqueue_disable_cb(rq->vq);
> > +
> >         virtnet_poll_cleantx(rq, budget);
> >
> >         received = virtnet_receive(rq, budget, &xdp_xmit);
> > --
> > 2.43.0
> >


^ permalink raw reply

* Re: [PATCH v2] virtio_net: disable cb when NAPI is busy-polled
From: Michael S. Tsirkin @ 2026-06-23 10:10 UTC (permalink / raw)
  To: Longjun Tang; +Cc: xuanzhuo, jasowang, edumazet, virtualization, tanglongjun
In-Reply-To: <20260623091901.118315-1-lange_tang@163.com>

On Tue, Jun 23, 2026 at 05:19:01PM +0800, Longjun Tang wrote:
> From: Longjun Tang <tanglongjun@kylinos.cn>
> 
> When busy-poll is active, napi_schedule_prep() returns false in
> virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
> The device may keep firing irqs until reaches virtqueue_napi_complete().
> Under load (received == budget), it will lead to a large number
> of spurious interrupts.
> 
> Fix it by disabling the callback at the virtnet_poll() entry. This keeps
> the callback off while we poll and re-enable by virtqueue_napi_complete()
> when going idle.
> 
> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>


Patch is correct


Acked-by: Michael S. Tsirkin <mst@redhat.com>

Ideally, I like to see performance numbers in such cases, to be included with commit log.

> ---
> V1 -> V2: Remain agnostic to busy polling
> ---
>  drivers/net/virtio_net.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..0a11f2b32500 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
>  	unsigned int xdp_xmit = 0;
>  	bool napi_complete;
>  
> +	/* Keep callbacks suppressed for the duration of this poll,
> +	 * busy-poll need.
> +	 */
> +	virtqueue_disable_cb(rq->vq);
> +
>  	virtnet_poll_cleantx(rq, budget);
>  
>  	received = virtnet_receive(rq, budget, &xdp_xmit);
> -- 
> 2.43.0


^ permalink raw reply

* Re: [PATCH v2] virtio_net: disable cb when NAPI is busy-polled
From: Eric Dumazet @ 2026-06-23  9:55 UTC (permalink / raw)
  To: Longjun Tang, netdev; +Cc: mst, xuanzhuo, jasowang, virtualization, tanglongjun
In-Reply-To: <20260623091901.118315-1-lange_tang@163.com>

On Tue, Jun 23, 2026 at 2:19 AM Longjun Tang <lange_tang@163.com> wrote:
>
> From: Longjun Tang <tanglongjun@kylinos.cn>
>
> When busy-poll is active, napi_schedule_prep() returns false in
> virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
> The device may keep firing irqs until reaches virtqueue_napi_complete().
> Under load (received == budget), it will lead to a large number
> of spurious interrupts.
>
> Fix it by disabling the callback at the virtnet_poll() entry. This keeps
> the callback off while we poll and re-enable by virtqueue_napi_complete()
> when going idle.
>
> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
>

I added netdev@ to get more attention from networking napi polling experts,

Please add a Fixes: tag as this will ease code review.

My rough guess is:

Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")

Thanks.

> ---
> V1 -> V2: Remain agnostic to busy polling
> ---
>  drivers/net/virtio_net.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..0a11f2b32500 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
>         unsigned int xdp_xmit = 0;
>         bool napi_complete;
>
> +       /* Keep callbacks suppressed for the duration of this poll,
> +        * busy-poll need.
> +        */
> +       virtqueue_disable_cb(rq->vq);
> +
>         virtnet_poll_cleantx(rq, budget);
>
>         received = virtnet_receive(rq, budget, &xdp_xmit);
> --
> 2.43.0
>

^ permalink raw reply

* Re: [PATCH v6 01/12] nvdimm: preserve flush callback errors
From: Pankaj Gupta @ 2026-06-23  9:46 UTC (permalink / raw)
  To: Li Chen
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm, linux-kernel
In-Reply-To: <20260621130246.2973254-2-me@linux.beauty>

> nvdimm_flush() currently converts any non-zero provider flush error to
> -EIO. That loses useful errno values from provider callbacks.
>
> A local virtio-pmem mkfs sanity test showed the masking clearly:
>
>   wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
>   mkfs.ext4: Input/output error while writing out and closing file system
>   nd_region region0: dbg: nvdimm_flush rc=-5
>
> The virtio-pmem callback can return -ENOMEM when async_pmem_flush() fails
> to allocate a child flush bio, but nvdimm_flush() hides that as -EIO before
> pmem_submit_bio() converts it to a block status.
>
> Return the provider callback error directly. The generic flush path still
> returns 0, and pmem_submit_bio() already handles errno-to-blk_status
> conversion for bio completion.
>
> Signed-off-by: Li Chen <me@linux.beauty>
> ---
> v3->v4:
> - New patch.
>
>  drivers/nvdimm/region_devs.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index e35c2e18518f0..0cd96503c0596 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1114,10 +1114,8 @@ int nvdimm_flush(struct nd_region *nd_region, struct bio *bio)
>
>         if (!nd_region->flush)
>                 rc = generic_nvdimm_flush(nd_region);
> -       else {
> -               if (nd_region->flush(nd_region, bio))
> -                       rc = -EIO;
> -       }
> +       else
> +               rc = nd_region->flush(nd_region, bio);

IIRC this was introduced as a generic populate error type since a
failed flush can also propagate host-side errors, which may not be
relevant to the guest.
That said, we could still consider handling specific cases like
-ENOMEM, unless there is a better approach to address this.

Thanks,
Pankaj
>
>         return rc;
>  }
> --
> 2.52.0

^ permalink raw reply

* Re: [PATCH] virtio_console: fix endian conversion in handle_control_message()
From: Arnd Bergmann @ 2026-06-23  9:28 UTC (permalink / raw)
  To: Ben Dooks, Amit Shah, Greg Kroah-Hartman, virtualization,
	linux-kernel
In-Reply-To: <20260623092141.631355-1-ben.dooks@codethink.co.uk>

On Tue, Jun 23, 2026, at 11:21, Ben Dooks wrote:
> There are a couple of prints in handle_control_message() which should
> have converted cpkt->id through virtio32_to_cpu() before passing to
> a print.
>
> This fixes the following (prototype) sparse warnings:
> drivers/char/virtio_console.c:1538:17: warning: incorrect type in 
> argument 4 (different base types)
> drivers/char/virtio_console.c:1538:17:    expected unsigned int
> drivers/char/virtio_console.c:1538:17:    got restricted __virtio32 
> [usertype] id
> drivers/char/virtio_console.c:1553:25: warning: incorrect type in 
> argument 3 (different base types)
> drivers/char/virtio_console.c:1553:25:    expected unsigned int
> drivers/char/virtio_console.c:1553:25:    got restricted __virtio32 
> [usertype] id
>
> Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* [PATCH] virtio_console: fix endian conversion in handle_control_message()
From: Ben Dooks @ 2026-06-23  9:21 UTC (permalink / raw)
  To: Amit Shah, Arnd Bergmann, Greg Kroah-Hartman, virtualization,
	linux-kernel
  Cc: Ben Dooks

There are a couple of prints in handle_control_message() which should
have converted cpkt->id through virtio32_to_cpu() before passing to
a print.

This fixes the following (prototype) sparse warnings:
drivers/char/virtio_console.c:1538:17: warning: incorrect type in argument 4 (different base types)
drivers/char/virtio_console.c:1538:17:    expected unsigned int
drivers/char/virtio_console.c:1538:17:    got restricted __virtio32 [usertype] id
drivers/char/virtio_console.c:1553:25: warning: incorrect type in argument 3 (different base types)
drivers/char/virtio_console.c:1553:25:    expected unsigned int
drivers/char/virtio_console.c:1553:25:    got restricted __virtio32 [usertype] id

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
---
 drivers/char/virtio_console.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 198b97314168..cbdc497f5160 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1536,7 +1536,8 @@ static void handle_control_message(struct virtio_device *vdev,
 	    cpkt->event != cpu_to_virtio16(vdev, VIRTIO_CONSOLE_PORT_ADD)) {
 		/* No valid header at start of buffer.  Drop it. */
 		dev_dbg(&portdev->vdev->dev,
-			"Invalid index %u in control packet\n", cpkt->id);
+			"Invalid index %u in control packet\n",
+			virtio32_to_cpu(vdev, cpkt->id));
 		return;
 	}
 
@@ -1553,7 +1554,8 @@ static void handle_control_message(struct virtio_device *vdev,
 			dev_warn(&portdev->vdev->dev,
 				"Request for adding port with "
 				"out-of-bound id %u, max. supported id: %u\n",
-				cpkt->id, portdev->max_nr_ports - 1);
+				 virtio32_to_cpu(vdev, cpkt->id),
+				 portdev->max_nr_ports - 1);
 			break;
 		}
 		add_port(portdev, virtio32_to_cpu(vdev, cpkt->id));
-- 
2.37.2.352.g3c44437643


^ permalink raw reply related

* [PATCH v2] virtio_net: disable cb when NAPI is busy-polled
From: Longjun Tang @ 2026-06-23  9:19 UTC (permalink / raw)
  To: mst, xuanzhuo; +Cc: jasowang, edumazet, virtualization, tanglongjun, lange_tang

From: Longjun Tang <tanglongjun@kylinos.cn>

When busy-poll is active, napi_schedule_prep() returns false in
virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
The device may keep firing irqs until reaches virtqueue_napi_complete().
Under load (received == budget), it will lead to a large number
of spurious interrupts.

Fix it by disabling the callback at the virtnet_poll() entry. This keeps
the callback off while we poll and re-enable by virtqueue_napi_complete()
when going idle.

Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>

---
V1 -> V2: Remain agnostic to busy polling
---
 drivers/net/virtio_net.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..0a11f2b32500 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
 	unsigned int xdp_xmit = 0;
 	bool napi_complete;
 
+	/* Keep callbacks suppressed for the duration of this poll,
+	 * busy-poll need.
+	 */
+	virtqueue_disable_cb(rq->vq);
+
 	virtnet_poll_cleantx(rq, budget);
 
 	received = virtnet_receive(rq, budget, &xdp_xmit);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v6 07/12] nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
From: Pankaj Gupta @ 2026-06-23  8:34 UTC (permalink / raw)
  To: Li Chen
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm, linux-kernel
In-Reply-To: <20260621130246.2973254-8-me@linux.beauty>

> Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
> wq_buf_avail). They are observed by waiters without pmem_lock, so make
> the accesses explicit single loads/stores and avoid compiler
> reordering/caching across the wait/wake paths.
>
> Signed-off-by: Li Chen <me@linux.beauty>

Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com

> ---
> v2->v3:
> - Split out READ_ONCE()/WRITE_ONCE() updates from patch 3/7 (no functional
>   change intended).
> v3->v4:
> - Rebased onto v7.1-rc7 and renumbered after the flush error patches.
>
>  drivers/nvdimm/nd_virtio.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> index 8ed4d6b3a9284..da829e9f4bdff 100644
> --- a/drivers/nvdimm/nd_virtio.c
> +++ b/drivers/nvdimm/nd_virtio.c
> @@ -18,9 +18,9 @@ static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
>
>         req_buf = list_first_entry(&vpmem->req_list,
>                                    struct virtio_pmem_request, list);
> -       req_buf->wq_buf_avail = true;
> +       list_del_init(&req_buf->list);
> +       WRITE_ONCE(req_buf->wq_buf_avail, true);
>         wake_up(&req_buf->wq_buf);
> -       list_del(&req_buf->list);
>  }
>
>   /* The interrupt handler */
> @@ -34,7 +34,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
>         spin_lock_irqsave(&vpmem->pmem_lock, flags);
>         while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
>                 virtio_pmem_wake_one_waiter(vpmem);
> -               req_data->done = true;
> +               WRITE_ONCE(req_data->done, true);
>                 wake_up(&req_data->host_acked);
>         }
>         spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> @@ -66,7 +66,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
>         if (!req_data)
>                 return -ENOMEM;
>
> -       req_data->done = false;
> +       WRITE_ONCE(req_data->done, false);
>         init_waitqueue_head(&req_data->host_acked);
>         init_waitqueue_head(&req_data->wq_buf);
>         INIT_LIST_HEAD(&req_data->list);
> @@ -87,12 +87,12 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
>                                         GFP_ATOMIC)) == -ENOSPC) {
>
>                 dev_info(&vdev->dev, "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
> -               req_data->wq_buf_avail = false;
> +               WRITE_ONCE(req_data->wq_buf_avail, false);
>                 list_add_tail(&req_data->list, &vpmem->req_list);
>                 spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
>
>                 /* A host response results in "host_ack" getting called */
> -               wait_event(req_data->wq_buf, req_data->wq_buf_avail);
> +               wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
>                 spin_lock_irqsave(&vpmem->pmem_lock, flags);
>         }
>         err1 = virtqueue_kick(vpmem->req_vq);
> @@ -106,7 +106,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
>                 err = -EIO;
>         } else {
>                 /* A host response results in "host_ack" getting called */
> -               wait_event(req_data->host_acked, req_data->done);
> +               wait_event(req_data->host_acked, READ_ONCE(req_data->done));
>                 err = le32_to_cpu(req_data->resp.ret);
>         }
>
> --
> 2.52.0

^ permalink raw reply

* Re: [PATCH 0/2] tools: Fix tools/virtio test build
From: Yichong Chen @ 2026-06-23  3:17 UTC (permalink / raw)
  To: mst
  Cc: akpm, chenyichong, eperezma, jasowang, linux-kernel, ljs, rppt,
	virtualization, xuanzhuo
In-Reply-To: <20260618080405-mutt-send-email-mst@kernel.org>

Hi Michael,

I tested the commit you pointed me to:

  8cb2c9285e4c ("can: virtio: Fix comment in UAPI header")

The tools/virtio build failure is still reproducible there with:

  make -C tools/virtio test

The first failure is:

  include/linux/virtio.h:10:10: fatal error:
  linux/mod_devicetable.h: No such file or directory

I also checked my original v7.1-based series, and it does not apply
cleanly on top of this commit due to context changes in
tools/virtio/linux/dma-mapping.h.

Would you like me to respin the series on top of 8cb2c9285e4c?

Thanks,
Yichong

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox