Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH net v4] virtio-net: fix len check in receive_big()
From: Michael S. Tsirkin @ 2026-06-16  4:39 UTC (permalink / raw)
  To: Xiang Mei
  Cc: jasowang, xuanzhuo, eperezma, andrew+netdev, davem, edumazet,
	kuba, pabeni, netdev, virtualization, linux-kernel,
	minhquangbui99, bestswngs
In-Reply-To: <20260616042837.2249468-1-xmei5@asu.edu>

On Mon, Jun 15, 2026 at 09:28:37PM -0700, Xiang Mei wrote:
> receive_big() bounds the device-announced length by
> (big_packets_num_skbfrags + 1) * PAGE_SIZE.  That is still too loose:
> add_recvbuf_big() sets sg[1] to start at offset
> sizeof(struct padded_vnet_hdr) into the first page, so the chain
> actually carries hdr_len + (PAGE_SIZE - sizeof(padded_vnet_hdr)) +
> big_packets_num_skbfrags * PAGE_SIZE bytes -- 20 bytes less than the
> check allows for the common hdr_len == 12 case.
> 
> A malicious virtio backend can announce a len in that gap.  page_to_skb()
> then walks one frag past the page chain, storing a NULL page->private
> into skb_shinfo()->frags[MAX_SKB_FRAGS], which is both an out-of-bounds
> write past the static frag array and a NULL frag handed up the rx path.
> 
> Bound len by the size add_recvbuf_big() actually advertised.
> 
> Fixes: 0c716703965f ("virtio-net: fix received length check in big packets")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Signed-off-by: Xiang Mei <xmei5@asu.edu>
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> v4: use easy to understand math to compute the max_len
> v3: revoke 2/2 and add Xuan Zhuo's Reviewed-by tag

I still feel 2/2 is good defence in depth but it can be
pursued separately.

> v2: add additiona check as 2/2
> 
>  drivers/net/virtio_net.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..8f4562316aaa 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1999,15 +1999,16 @@ static struct sk_buff *receive_big(struct net_device *dev,
>  				   struct virtnet_rq_stats *stats)
>  {
>  	struct page *page = buf;
> +	unsigned long max_len = (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE -
> +				sizeof(struct padded_vnet_hdr) + vi->hdr_len;
>  	struct sk_buff *skb;
>  
>  	/* Make sure that len does not exceed the size allocated in
>  	 * add_recvbuf_big.
>  	 */
> -	if (unlikely(len > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE)) {
> +	if (unlikely(len > max_len)) {
>  		pr_debug("%s: rx error: len %u exceeds allocated size %lu\n",
> -			 dev->name, len,
> -			 (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE);
> +			 dev->name, len, max_len);
>  		goto err;
>  	}
>  
> -- 
> 2.43.0


^ permalink raw reply

* Re: [PATCH net v3] virtio-net: fix len check in receive_big()
From: Xiang Mei @ 2026-06-16  4:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, xuanzhuo, eperezma, andrew+netdev, davem, edumazet,
	kuba, pabeni, netdev, virtualization, linux-kernel,
	minhquangbui99, bestswngs
In-Reply-To: <20260614152904-mutt-send-email-mst@kernel.org>

On Sun, Jun 14, 2026 at 12:29 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sat, Jun 13, 2026 at 01:15:02PM -0700, Xiang Mei wrote:
> > On Wed, Jun 10, 2026 at 10:56 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, Jun 10, 2026 at 07:46:16PM -0700, Xiang Mei wrote:
> > > > receive_big() bounds the device-announced length by
> > > > (big_packets_num_skbfrags + 1) * PAGE_SIZE.  That is still too loose:
> > > > add_recvbuf_big() sets sg[1] to start at offset
> > > > sizeof(struct padded_vnet_hdr) into the first page, so the chain
> > > > actually carries hdr_len + (PAGE_SIZE - sizeof(padded_vnet_hdr)) +
> > > > big_packets_num_skbfrags * PAGE_SIZE bytes -- 20 bytes less than the
> > > > check allows for the common hdr_len == 12 case.
> > > >
> > > > A malicious virtio backend can announce a len in that gap.  page_to_skb()
> > > > then walks one frag past the page chain, storing a NULL page->private
> > > > into skb_shinfo()->frags[MAX_SKB_FRAGS], which is both an out-of-bounds
> > > > write past the static frag array and a NULL frag handed up the rx path.
> > > >
> > > > Bound len by the size add_recvbuf_big() actually advertised.
> > > >
> > > > Fixes: 0c716703965f ("virtio-net: fix received length check in big packets")
> > > > Reported-by: Weiming Shi <bestswngs@gmail.com>
> > > > Signed-off-by: Xiang Mei <xmei5@asu.edu>
> > > > Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > >
> > > Thanks for the patch! Something small to improve:
> > >
> > > > ---
> > > > v3: revoke 2/2 and add Xuan Zhuo's Reviewed-by tag
> > > >
> > > >  drivers/net/virtio_net.c | 8 +++++---
> > > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index f4adcfee7a80..afe73eda1491 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -1999,15 +1999,17 @@ static struct sk_buff *receive_big(struct net_device *dev,
> > > >                                  struct virtnet_rq_stats *stats)
> > > >  {
> > > >       struct page *page = buf;
> > > > +     unsigned long max_len;
> > >
> > > Assignment can happen here?
> > >
> > > >       struct sk_buff *skb;
> > > >
> > > >       /* Make sure that len does not exceed the size allocated in
> > > >        * add_recvbuf_big.
> > > >        */
> > > > -     if (unlikely(len > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE)) {
> > > > +     max_len = vi->hdr_len + (PAGE_SIZE - sizeof(struct padded_vnet_hdr)) +
> > > > +               vi->big_packets_num_skbfrags * PAGE_SIZE;
> > >
> > > Took me a while to figure out what is going on, but I finally
> > > understand:
> > >
> > >
> > > Reducing
> > > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE
> > >
> > > (what we allocated)
> > >
> > > by sizeof(struct padded_vnet_hdr) - vi->hdr_len
> > >
> > >
> > > right?
> > >
> > > So clearer as:
> > >
> > >
> > >         unsigned long max_len = (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE -
> > >         sizeof(struct padded_vnet_hdr) + vi->hdr_len;
> > >
> > Right, that's the same value. Yours reads better!
> >
> > I'll fold this into the next respin. One thing I'd like to settle
> > first: David suggested storing this in a vi field computed once at the
> > probe (it's a per-device constant) and just comparing len against it
> > on the datapath, instead of re-deriving it in receive_big() each time.
> > I'll wait for his take on that and send a single v4 that covers both.
> >
> > Xiang
>
> I don't mind.
Thanks, Michael,

V4 has been sent.

Xiang
>
> > >
> > >
> > >
> > > > +     if (unlikely(len > max_len)) {
> > > >               pr_debug("%s: rx error: len %u exceeds allocated size %lu\n",
> > > > -                      dev->name, len,
> > > > -                      (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE);
> > > > +                      dev->name, len, max_len);
> > > >               goto err;
> > > >       }
> > > >
> > > > --
> > > > 2.43.0
> > >
>

^ permalink raw reply

* [PATCH net v4] virtio-net: fix len check in receive_big()
From: Xiang Mei @ 2026-06-16  4:28 UTC (permalink / raw)
  To: mst, jasowang, xuanzhuo, eperezma
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	virtualization, linux-kernel, minhquangbui99, bestswngs,
	Xiang Mei

receive_big() bounds the device-announced length by
(big_packets_num_skbfrags + 1) * PAGE_SIZE.  That is still too loose:
add_recvbuf_big() sets sg[1] to start at offset
sizeof(struct padded_vnet_hdr) into the first page, so the chain
actually carries hdr_len + (PAGE_SIZE - sizeof(padded_vnet_hdr)) +
big_packets_num_skbfrags * PAGE_SIZE bytes -- 20 bytes less than the
check allows for the common hdr_len == 12 case.

A malicious virtio backend can announce a len in that gap.  page_to_skb()
then walks one frag past the page chain, storing a NULL page->private
into skb_shinfo()->frags[MAX_SKB_FRAGS], which is both an out-of-bounds
write past the static frag array and a NULL frag handed up the rx path.

Bound len by the size add_recvbuf_big() actually advertised.

Fixes: 0c716703965f ("virtio-net: fix received length check in big packets")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
v4: use easy to understand math to compute the max_len
v3: revoke 2/2 and add Xuan Zhuo's Reviewed-by tag
v2: add additiona check as 2/2

 drivers/net/virtio_net.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..8f4562316aaa 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1999,15 +1999,16 @@ static struct sk_buff *receive_big(struct net_device *dev,
 				   struct virtnet_rq_stats *stats)
 {
 	struct page *page = buf;
+	unsigned long max_len = (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE -
+				sizeof(struct padded_vnet_hdr) + vi->hdr_len;
 	struct sk_buff *skb;

 	/* Make sure that len does not exceed the size allocated in
 	 * add_recvbuf_big.
 	 */
-	if (unlikely(len > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE)) {
+	if (unlikely(len > max_len)) {
 		pr_debug("%s: rx error: len %u exceeds allocated size %lu\n",
-			 dev->name, len,
-			 (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE);
+			 dev->name, len, max_len);
 		goto err;
 	}

-- 
2.43.0

^ permalink raw reply related

* Re:Re: [PATCH] virtio_net: disable cb when napi_schedule_prep fails during busy-poll
From: Xuan Zhuo @ 2026-06-16  3:27 UTC (permalink / raw)
  To: Lange Tang
  Cc: edumazet@google.com, Jakub Kicinski,
	virtualization@lists.linux.dev, Tang Longjun, jasowang@redhat.com,
	mst@redhat.com
In-Reply-To: <19692a81.3001.19ece5f8ddc.Coremail.lange_tang@163.com>

On Tue, 16 Jun 2026 11:00:29 +0800 (CST), Lange Tang <lange_tang@163.com> wrote:
> At 2026-06-15 18:01:40, "Xuan Zhuo" <xuanzhuo@linux.alibaba.com> wrote:
> >On Mon, 15 Jun 2026 17:45:50 +0800, Longjun Tang <lange_tang@163.com> wrote:
> >> From: Longjun Tang <tanglongjun@kylinos.cn>
> >>
> >> When busy-poll is active, napi_schedule_prep() returns false in
> >> skb_recv_done(), so virtqueue_disable_cb() is skipped. The device
> >> may keep firing irqs until the next poll round reaches
> >> virtqueue_napi_complete(). If cb is enabled under busy-poll case,
> >> it will lead to a large number of spurious interrupts. Explicitly
> >> disable callbacks in this case to prevent spurious interrupts.
> >>
> >> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
> >> ---
> >>  drivers/net/virtio_net.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index f4adcfee7a80..6d675fddc59b 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -728,6 +728,8 @@ static void virtqueue_napi_schedule(struct napi_struct *napi,
> >>  	if (napi_schedule_prep(napi)) {
> >>  		virtqueue_disable_cb(vq);
> >>  		__napi_schedule(napi);
> >> +	} else if (test_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state)) {
> >> +		virtqueue_disable_cb(vq);
> >
> >I see, but we should avoid checking NAPI_STATE_IN_BUSY_POLL directly in the
> >drivers. The NIC driver should remain agnostic to busy polling. I think we need
> >a better way, maybe we should rewrite virtqueue_napi_schedule instead.
>
> How about rewrite it like this?
> static void virtqueue_napi_schedule(struct napi_struct *napi,
>                                     struct virtqueue *vq)
> {
>         virtqueue_disable_cb(vq);
>         if (napi_schedule_prep(napi))
>                 __napi_schedule(napi);
> }
> Any comments are welcome.


Another CPU could be running NAPI and has just enabled the callbacks (cb).
Meanwhile, this side unconditionally disables the cb. Since NAPI on the other
CPU hasn't exited yet, the subsequent prep on this side fails, leaving no one to
re-enable the cb.

Thanks.


> >
> >Thanks.
> >
> >>  	}
> >>  }
> >>
> >> --
> >> 2.25.1
> >>
>

^ permalink raw reply

* Re:Re: [PATCH] virtio_net: disable cb when napi_schedule_prep fails during busy-poll
From: Lange Tang @ 2026-06-16  3:00 UTC (permalink / raw)
  To: xuanzhuo@linux.alibaba.com, mst@redhat.com
  Cc: edumazet@google.com, Jakub Kicinski,
	virtualization@lists.linux.dev, Tang Longjun, jasowang@redhat.com
In-Reply-To: <1781517700.4206195-1-xuanzhuo@linux.alibaba.com>

At 2026-06-15 18:01:40, "Xuan Zhuo" <xuanzhuo@linux.alibaba.com> wrote:
>On Mon, 15 Jun 2026 17:45:50 +0800, Longjun Tang <lange_tang@163.com> wrote:
>> From: Longjun Tang <tanglongjun@kylinos.cn>
>>
>> When busy-poll is active, napi_schedule_prep() returns false in
>> skb_recv_done(), so virtqueue_disable_cb() is skipped. The device
>> may keep firing irqs until the next poll round reaches
>> virtqueue_napi_complete(). If cb is enabled under busy-poll case,
>> it will lead to a large number of spurious interrupts. Explicitly
>> disable callbacks in this case to prevent spurious interrupts.
>>
>> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
>> ---
>>  drivers/net/virtio_net.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index f4adcfee7a80..6d675fddc59b 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -728,6 +728,8 @@ static void virtqueue_napi_schedule(struct napi_struct *napi,
>>  	if (napi_schedule_prep(napi)) {
>>  		virtqueue_disable_cb(vq);
>>  		__napi_schedule(napi);
>> +	} else if (test_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state)) {
>> +		virtqueue_disable_cb(vq);
>
>I see, but we should avoid checking NAPI_STATE_IN_BUSY_POLL directly in the
>drivers. The NIC driver should remain agnostic to busy polling. I think we need
>a better way, maybe we should rewrite virtqueue_napi_schedule instead.

How about rewrite it like this?
static void virtqueue_napi_schedule(struct napi_struct *napi,
                                    struct virtqueue *vq)
{       
        virtqueue_disable_cb(vq);
        if (napi_schedule_prep(napi))
                __napi_schedule(napi);
}
Any comments are welcome.
>
>Thanks.
>
>>  	}
>>  }
>>
>> --
>> 2.25.1
>>

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] virtio_net: xsk: fix race in rx wake up
From: Menglong Dong @ 2026-06-16  1:48 UTC (permalink / raw)
  To: menglong8.dong, Xuan Zhuo
  Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
	minhquangbui99, kerneljasonxing, netdev, virtualization,
	linux-kernel, eperezma
In-Reply-To: <1781491685.0613394-1-xuanzhuo@linux.alibaba.com>

On 2026/6/15 10:48 Xuan Zhuo <xuanzhuo@linux.alibaba.com> write:
> On Thu, 11 Jun 2026 10:56:43 +0800, menglong8.dong@gmail.com wrote:
> > From: Menglong Dong <dongml2@chinatelecom.cn>
> >
> > During packet receiving in virtio-net, the rq can be empty, which means
> > "rq->vq->num_free == virtqueue_get_vring_size(rq->vq)", in
> > virtnet_add_recvbuf_xsk(), if we are using xsk. Meanwhile, the fill ring
> > can be empty too, which means we can't allocate anything from
> > xsk_buff_alloc_batch(). Then, we will set the XDP_RING_NEED_WAKEUP flag.
> >
[...]
> >
> > +	need_wakeup = xsk_uses_need_wakeup(pool);
> >  	xsk_buffs = rq->xsk_buffs;
> >
> > +	/* If both rq->vq and fill ring are empty, and then the user submit
> > +	 * all the chunks to the fill ring and check the wake up flag
> > +	 * after xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(),
> > +	 * we will lose the chance to wake up the rx napi, so we have to
> > +	 * set the need_wakeup flag here.
> > +	 */
> > +	if (need_wakeup && virtqueue_get_vring_size(rq->vq) == rq->vq->num_free)
> > +		xsk_set_rx_need_wakeup(pool);
> 
> Is Condition A here too strict? We should trigger the wakeup under a wider range
> of scenarios.

Hi, Xuan. Thinks for your reviewing :)

The logic here is a addition logic to the origin wake up logic, which I planed
to fix a race condition. However, this race condition seems not likely to happen,
as we discussed in this thread:

https://lore.kernel.org/netdev/rHZz5_ylT4WggoZ-Ic2Q4w@linux.dev/

So this patch is not necessary, and I'll send the 2nd patch standalone.

Thanks!
Menglong Dong

> 
> > +
> >  	num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->vq->num_free);
> >  	if (!num) {
> > -		if (xsk_uses_need_wakeup(pool)) {
> > +		if (need_wakeup) {
> >  			xsk_set_rx_need_wakeup(pool);
> >  			/* Return 0 instead of -ENOMEM so that NAPI is
> >  			 * descheduled.
> > @@ -1341,8 +1352,6 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
> >  		}
> >
> >  		return -ENOMEM;
> > -	} else {
> > -		xsk_clear_rx_need_wakeup(pool);
> >  	}
> >
> >  	len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
> > @@ -1363,6 +1372,16 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
> >  			goto err;
> >  	}
> >
> > +	if (need_wakeup) {
> > +		if (rq->vq->num_free)
> > +			/* We have free buffers, so we'd better wake up the
> > +			 * rx napi as soon as possible.
> > +			 */
> > +			xsk_set_rx_need_wakeup(pool);
> 
> Is the purpose of waking up RX NAPI to invoke try_fill_recv? However,
> virtnet_poll does not call try_fill_recv directly. it is done
> conditionally.
> 
> Thanks.
> 
> 
> > +		else
> > +			xsk_clear_rx_need_wakeup(pool);
> > +	}
> > +
> >  	return num;
> >
> >  err:
> > --
> > 2.54.0
> >
> 
> 




^ permalink raw reply

* Re: [PATCH v1] s390/virtio_ccw: Also suppress -EINVAL on device detach
From: Halil Pasic @ 2026-06-15 21:42 UTC (permalink / raw)
  To: William Bezenah
  Cc: Cornelia Huck, linux-s390, farman, hca, gor, agordeev,
	borntraeger, svens, mjrosato, vneethv, oberpar, virtualization,
	kvm, linux-kernel, Halil Pasic
In-Reply-To: <4d7fc371-4357-496f-9774-1f7a7c1a3091@linux.ibm.com>

On Mon, 15 Jun 2026 16:01:55 -0400
William Bezenah <wbezenah@linux.ibm.com> wrote:

> On 6/15/2026 10:58 AM, Cornelia Huck wrote:
> > On Mon, Jun 15 2026, Halil Pasic <pasic@linux.ibm.com> wrote:
> >  
> >> On Fri, 12 Jun 2026 17:54:07 +0200
> >> William Bezenah <wbezenah@linux.ibm.com> wrote:
> >>  
> >>> Since commit 8c58a229688c ("s390/cio: Do not unregister the
> >>> subchannel based on DNV"), subchannel behavior following a device
> >>> detach has been updated and results in -EINVAL being propagated
> >>> rather than -ENODEV, originating from ccw_device_start_timeout_key()
> >>> in cio/device_ops. In the end, the virtio driver has no ability to
> >>> react to the difference between device and subchannel states here,
> >>> and during detach, both -ENODEV and -EINVAL indicate the device
> >>> cannot be used and should not be treated as errors requiring
> >>> attention. Update error handling in virtio_ccw_del_vq() and
> >>> virtio_ccw_drop_indicator() to suppress -EINVAL in addition to
> >>> -ENODEV.  
> >> Hi William!
> >>
> >> Are you saying that ccw_device_start() started returning -EINVAL
> >> since 8c58a229688c ("s390/cio: Do not unregister the subchannel based on
> >> DNV")? Or did I somehow read the paragraph wrong?
> >>
> >> The funcition ccw_device_start is documented to return:
> >>  * Returns:                                                                     
> >>  *  %0, if the operation was successful;                                        
> >>  *  -%EBUSY, if the device is busy, or status pending;                          
> >>  *  -%EACCES, if no path specified in @lpm is operational;                      
> >>  *  -%ENODEV, if the device is not operational. 
> >> and the commit message does not say a thing about introducing -EINVAL to
> >> the mix.  
> > The function may return -EINVAL for non-enabled subchannels
> > (i.e. pmcw.ena == 0), maybe we get an all-zeroes schib with dnv == 0?
> > I'd expect it not to be enabled in that case anyway.  
> 
> Yep, that's at least how I've come to understand what changed. The
> function ccw_device_start_timeout_key() has always returned -EINVAL
> for non-enabled subchannels (pmcw.ena == 0), though it's not
> documented in the header.

Wasn't his -EINVAL actually introduced by commit:
823d494ac111 ("[S390] pm: ccw bus power management callbacks")?

> 
> What changed with commit 8c58a229688c is that cio_update_schib() now
> updates the schib even when DNV=0, rather than returning early as it
> did previously. Somehow this update results in pmcw.ena == 0 in
> ccw_device_start_timeout_key(). Previously, it saw pmcw.ena == 1 and
> moved to the condition (cdev->private->state == DEV_STATE_NOT_OPER)
> where it returned -ENODEV.

Sounds fishy to me. As far as I understand the DNV takes precedence over
all other pieces of PMCW.

> 
> So the commit didn't introduce -EINVAL as a new return value, rather,
> it changed the subchannel lifecycle such that existing paths now
> propagate -EINVAL rather than -ENODEV during the device detach
> scenario.
> 

I'm not convinced returning -EINVAL in the given situation is the
right thing to do. Peter, would you mind to chime in?

Regards,
Halil

^ permalink raw reply

* Re: [PATCH v1] s390/virtio_ccw: Also suppress -EINVAL on device detach
From: William Bezenah @ 2026-06-15 20:01 UTC (permalink / raw)
  To: Cornelia Huck, Halil Pasic
  Cc: linux-s390, farman, hca, gor, agordeev, borntraeger, svens,
	mjrosato, vneethv, oberpar, virtualization, kvm, linux-kernel
In-Reply-To: <875x3jn94r.fsf@redhat.com>


On 6/15/2026 10:58 AM, Cornelia Huck wrote:
> On Mon, Jun 15 2026, Halil Pasic <pasic@linux.ibm.com> wrote:
>
>> On Fri, 12 Jun 2026 17:54:07 +0200
>> William Bezenah <wbezenah@linux.ibm.com> wrote:
>>
>>> Since commit 8c58a229688c ("s390/cio: Do not unregister the
>>> subchannel based on DNV"), subchannel behavior following a device
>>> detach has been updated and results in -EINVAL being propagated
>>> rather than -ENODEV, originating from ccw_device_start_timeout_key()
>>> in cio/device_ops. In the end, the virtio driver has no ability to
>>> react to the difference between device and subchannel states here,
>>> and during detach, both -ENODEV and -EINVAL indicate the device
>>> cannot be used and should not be treated as errors requiring
>>> attention. Update error handling in virtio_ccw_del_vq() and
>>> virtio_ccw_drop_indicator() to suppress -EINVAL in addition to
>>> -ENODEV.
>> Hi William!
>>
>> Are you saying that ccw_device_start() started returning -EINVAL
>> since 8c58a229688c ("s390/cio: Do not unregister the subchannel based on
>> DNV")? Or did I somehow read the paragraph wrong?
>>
>> The funcition ccw_device_start is documented to return:
>>  * Returns:                                                                     
>>  *  %0, if the operation was successful;                                        
>>  *  -%EBUSY, if the device is busy, or status pending;                          
>>  *  -%EACCES, if no path specified in @lpm is operational;                      
>>  *  -%ENODEV, if the device is not operational. 
>> and the commit message does not say a thing about introducing -EINVAL to
>> the mix.
> The function may return -EINVAL for non-enabled subchannels
> (i.e. pmcw.ena == 0), maybe we get an all-zeroes schib with dnv == 0?
> I'd expect it not to be enabled in that case anyway.

Yep, that's at least how I've come to understand what changed. The
function ccw_device_start_timeout_key() has always returned -EINVAL
for non-enabled subchannels (pmcw.ena == 0), though it's not
documented in the header.

What changed with commit 8c58a229688c is that cio_update_schib() now
updates the schib even when DNV=0, rather than returning early as it
did previously. Somehow this update results in pmcw.ena == 0 in
ccw_device_start_timeout_key(). Previously, it saw pmcw.ena == 1 and
moved to the condition (cdev->private->state == DEV_STATE_NOT_OPER)
where it returned -ENODEV.

So the commit didn't introduce -EINVAL as a new return value, rather,
it changed the subchannel lifecycle such that existing paths now
propagate -EINVAL rather than -ENODEV during the device detach
scenario.


^ permalink raw reply

* Re: [PATCH net-next 0/2] selftests/vsock: improve vng version and quirk handling
From: patchwork-bot+netdevbpf @ 2026-06-15 20:00 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: sgarzare, shuah, virtualization, netdev, linux-kselftest,
	linux-kernel, bobbyeshleman
In-Reply-To: <20260612-vsock-test-update-v1-0-7d7eeed3ac8f@meta.com>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 12 Jun 2026 12:08:40 -0700 you wrote:
> As vng has continued updating, there have been two things in our
> selftests that have been affected. One is that newer versions always
> emit the vng version warning, and two is that we have a workaround that
> is not needed in newer versions.
> 
> This series just updates the version handling to allow all newer
> versions without warning and version-gates the workaround to only those
> versions that don't have the commit that fixed the root cause.
> 
> [...]

Here is the summary with links:
  - [net-next,1/2] selftests/vsock: accept vng 1.33 or >= 1.36
    https://git.kernel.org/netdev/net-next/c/197503d5ac82
  - [net-next,2/2] selftests/vsock: skip vng setsid workaround on >= 1.41
    https://git.kernel.org/netdev/net-next/c/9361bff6bdb7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net 0/2] vsock/virtio: fix MSG_PEEK calculation on bytes to copy
From: Michael S. Tsirkin @ 2026-06-15 19:48 UTC (permalink / raw)
  To: Luigi Leonardi
  Cc: Stefan Hajnoczi, Stefano Garzarella, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Arseniy Krasnov, kvm, virtualization,
	netdev, linux-kernel
In-Reply-To: <20260402-fix_peek-v1-0-ad274fcef77b@redhat.com>

On Thu, Apr 02, 2026 at 10:18:00AM +0200, Luigi Leonardi wrote:
> `virtio_transport_stream_do_peek`, when calculating the number of bytes to copy,
> didn't consider the `offset`, caused by partial reads that happend before.
> This might cause out-of-bounds read that lead to an EFAULT.
> More details in the commit.
> 
> Commit 1 introduces the fix
> Commit 2 introduces a test that checks for this bug to avoid future
> regressions.
> 
> Signed-off-by: Luigi Leonardi <leonardi@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> Luigi Leonardi (2):
>       vsock/virtio: fix MSG_PEEK ignoring skb offset when calculating bytes to copy
>       vsock/test: add MSG_PEEK after partial recv test
> 
>  net/vmw_vsock/virtio_transport_common.c |  5 ++-
>  tools/testing/vsock/vsock_test.c        | 64 +++++++++++++++++++++++++++++++++
>  2 files changed, 66 insertions(+), 3 deletions(-)
> ---
> base-commit: 9147566d801602c9e7fc7f85e989735735bf38ba
> change-id: 20260401-fix_peek-6837b83469e3
> 
> Best regards,
> -- 
> Luigi Leonardi <leonardi@redhat.com>


^ permalink raw reply

* Re: [PATCH splitout] virtio_balloon: disable indirect descriptors
From: David Hildenbrand (Arm) @ 2026-06-15 16:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Miaohe Lin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi,
	Alexander Duyck
In-Reply-To: <7160278ae1fc4f06dec966867d307033989d029c.1781022765.git.mst@redhat.com>

On 6/9/26 18:33, Michael S. Tsirkin wrote:
> The page reporting callback submits an sg list to the reporting
> virtqueue.  With VIRTIO_RING_F_INDIRECT_DESC negotiated and
> total_sg > 1 (which it typically is), virtqueue_add reports it to the
> host by allocating an indirect descriptor via kmalloc(GFP_KERNEL).
> 
> This is not pretty: the reporting worker isolates potentially hundreds
> of MB of free pages from the buddy allocator (reported pages are at
> least pageblock_order, and the sg can contain up to
> PAGE_REPORTING_CAPACITY entries of varying orders).  As the result, at
> least in theory, the kmalloc might trigger OOM when we have in fact a
> ton of free memory.

Very theoretical, given that we isolate large pageblocks and the kmalloc would
just need likely a single page. But yeah, avodiing to allocate memory where
possible on these paths makes sense I guess.

> 
> Clear VIRTIO_RING_F_INDIRECT_DESC, to avoid using indirect descriptors.
> 
> Fixes: b0c504f15471 ("virtio-balloon: add support for providing free page reports to host")
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  drivers/virtio/virtio_balloon.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 53b4a3984e7d..6698edb61474 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -7,6 +7,7 @@
>   */
>  
>  #include <linux/virtio.h>
> +#include <uapi/linux/virtio_ring.h>
>  #include <linux/virtio_balloon.h>
>  #include <linux/swap.h>
>  #include <linux/workqueue.h>
> @@ -1175,6 +1176,11 @@ static int virtballoon_validate(struct virtio_device *vdev)
>  	else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
>  		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
>  
> +	/*
> +	 * Disable indirect descriptors to avoid memory allocation in
> +	 * virtqueue_add during page reporting.
> +	 */
> +	__virtio_clear_bit(vdev, VIRTIO_RING_F_INDIRECT_DESC);


Acked-by: David Hildenbrand (Arm) <david@kernel.org>


-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v1] s390/virtio_ccw: Also suppress -EINVAL on device detach
From: Cornelia Huck @ 2026-06-15 14:58 UTC (permalink / raw)
  To: Halil Pasic, William Bezenah
  Cc: linux-s390, farman, hca, gor, agordeev, borntraeger, svens,
	mjrosato, vneethv, oberpar, virtualization, kvm, linux-kernel,
	Halil Pasic
In-Reply-To: <20260615002309.052e0614.pasic@linux.ibm.com>

On Mon, Jun 15 2026, Halil Pasic <pasic@linux.ibm.com> wrote:

> On Fri, 12 Jun 2026 17:54:07 +0200
> William Bezenah <wbezenah@linux.ibm.com> wrote:
>
>> Since commit 8c58a229688c ("s390/cio: Do not unregister the
>> subchannel based on DNV"), subchannel behavior following a device
>> detach has been updated and results in -EINVAL being propagated
>> rather than -ENODEV, originating from ccw_device_start_timeout_key()
>> in cio/device_ops. In the end, the virtio driver has no ability to
>> react to the difference between device and subchannel states here,
>> and during detach, both -ENODEV and -EINVAL indicate the device
>> cannot be used and should not be treated as errors requiring
>> attention. Update error handling in virtio_ccw_del_vq() and
>> virtio_ccw_drop_indicator() to suppress -EINVAL in addition to
>> -ENODEV.
>
> Hi William!
>
> Are you saying that ccw_device_start() started returning -EINVAL
> since 8c58a229688c ("s390/cio: Do not unregister the subchannel based on
> DNV")? Or did I somehow read the paragraph wrong?
>
> The funcition ccw_device_start is documented to return:
>  * Returns:                                                                     
>  *  %0, if the operation was successful;                                        
>  *  -%EBUSY, if the device is busy, or status pending;                          
>  *  -%EACCES, if no path specified in @lpm is operational;                      
>  *  -%ENODEV, if the device is not operational. 
> and the commit message does not say a thing about introducing -EINVAL to
> the mix.

The function may return -EINVAL for non-enabled subchannels
(i.e. pmcw.ena == 0), maybe we get an all-zeroes schib with dnv == 0?
I'd expect it not to be enabled in that case anyway.


^ permalink raw reply

* Re: [PATCH v1 0/2] virtio: PCI ERS permanent failure teardown for virtio-blk
From: Stefan Hajnoczi @ 2026-06-15 14:52 UTC (permalink / raw)
  To: Xixin Liu
  Cc: linux-block, virtualization, mst, jasowang, xuanzhuo, eperezma,
	pbonzini, axboe, linux-kernel, Parav Pandit
In-Reply-To: <cover.virtio-blk-ers-v1.1780449274.git.liuxixin@kylinos.cn>

[-- Attachment #1: Type: text/plain, Size: 1472 bytes --]

On Mon, Jun 15, 2026 at 10:00:00AM +0800, Xixin Liu wrote:
> Hi,
> 
> This series adds proper PCI AER error recovery handling for virtio-pci and
> completes virtio-blk teardown when ERS reports pci_channel_io_perm_failure.

CCing Parav because he previously looked at surprise removal:
https://lore.kernel.org/virtualization/20250822091706.21170-1-parav@nvidia.com/

> 
> virtio-pci only registered reset_prepare/reset_done.  The recovery core
> treats devices without error_detected as NO_AER_DRIVER and does not
> deliver perm_failure to the driver after a failed recovery.  When bus
> reset fails (reproduced on QEMU with DLLLA not set within 100 ms after
> secondary bus reset), virtio-blk disks stay live even though virtqueues
> may already have been torn down during the frozen phase.
> 
> Patch 1 registers error_detected (frozen quiesce + perm_failure notify).
> Patch 2 calls the virtio driver shutdown hook from virtio-pci on
> perm_failure, implements virtio-blk shutdown with blk_mark_disk_dead(),
> and fail-fast guards in virtio_queue_rq.
> 
> Thanks,
> Xixin Liu
> 
> ---
> 
> Xixin Liu (2):
>   virtio-pci: add error_detected for PCI AER recovery
>   virtio-blk: mark disk dead on ERS permanent failure
> 
>  drivers/block/virtio_blk.c         | 39 +++++++++++++++++++++++++++++++
>  drivers/virtio/virtio_pci_common.c | 47 ++++++++++++++++++++++++++++++++++
>  2 files changed, 85 insertions(+)
> 
> -- 
> 2.43.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: vhost: fix vhost_get_avail_idx for a non empty ring
From: Christian Borntraeger @ 2026-06-15 14:24 UTC (permalink / raw)
  To: mst
  Cc: eperezma, jasowang, kvm, linux-kernel, netdev, sgarzare, shuangyu,
	stefanha, virtualization, Christian Borntraeger
In-Reply-To: <559b04ae6ce52973c535dc47e461638b7f4c3d63.1772441455.git.mst@redhat.com>

Late feedback, but this patch massively improves our uperf latency/bandwidth
and cpu consumption significantly for s390. Improvements are all over
the place, streaming, transactional (100 byte/2000 byte). Nice fix.

Christian

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: David Hildenbrand (Arm) @ 2026-06-15 10:54 UTC (permalink / raw)
  To: Miaohe Lin, Michael S. Tsirkin
  Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <984d9775-e17c-0231-b021-126b13a9aa42@huawei.com>

On 6/15/26 05:29, Miaohe Lin wrote:
> On 2026/6/11 21:20, David Hildenbrand (Arm) wrote:
>> On 6/11/26 09:36, Miaohe Lin wrote:
>>>
>>> Agree, it's not worth to do so.
>>>
>>>
>>> Since memory_failure might be the only place, this change would be unacceptable.
>>> We should come up with a better solution. Maybe we can try repeating SetPageHWPoison
>>> and ClearPageHWPoison at a first attempt though it looks somewhat weird to me and makes
>>> code more complicated.
>>
>> And I am fairly sure we could still have some remaining races ... it's shaky.
> 
> I have to agree it's shaky.

Right, just let writing task reschedule after reading the flags,
but before writing the flags.

> Any suggestion for next step?

We have various code that assumes that no concurrent writes are
possible, and consequently, we use no atomics.

__free_pages_prepare() is just one user.

Then we have __folio_set_locked(), __folio_clear_active()
and __folio_clear_unevictable().

But also __folio_mark_uptodate(), which is called rather frequently.

page_cpupid_reset_last() is also a thing, but it mostly falls
under __free_pages_prepare() handling.

... and __split_folio_to_order() also messes with flags directly without atomics.


Many of these are only possible for frozen pages (refcount == 0). I think
only  __folio_set_locked() and __folio_mark_uptodate() are called on
non-frozen pages, when there is the expectation that nobody will concurrently
use atomics that would be bad (e.g., don't trylock if not an lru page).


We don't want to use atomics at these places just to please memory failure code.

Would it be sufficient to know in memory-failure code that concurrent
handling succeeded?


Assume that we enlighten all non-atomics to grab the rcu read lock, such as

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..3c3852b60bbd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -803,10 +803,30 @@ static inline bool PageUptodate(const struct page *page)
        return folio_test_uptodate(page_folio(page));
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+static inline void page_flags_modify_nonatomic_begin(void)
+{
+       rcu_read_lock();
+}
+static inline void page_flags_modify_nonatomic_end(void)
+{
+       rcu_read_unlock();
+}
+#else
+static inline void page_flags_modify_nonatomic_begin(void)
+{
+}
+static inline void page_flags_modify_nonatomic_end(void)
+{
+}
+#endif
+
 static __always_inline void __folio_mark_uptodate(struct folio *folio)
 {
        smp_wmb();
+       page_flags_modify_nonatomic_begin();
        __set_bit(PG_uptodate, folio_flags(folio, 0));
+       page_flags_modify_nonatomic_end();
 }
 

And then we have some retry logic such as:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51508a55c405..1123c40aaf43 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -162,6 +162,62 @@ static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
 
 static DEFINE_MUTEX(pfn_space_lock);
 
+static bool page_test_set_hwpoison(struct page *page)
+{
+	lockdep_assert_held(&mf_mutex);
+
+	while (true) {
+		/* Already set -> not our problem. */
+		if (TestSetPageHWPoison(page))
+			return true;
+		/* Make sure concurrent non-atomic writers completed. */
+		synchronize_rcu();
+		/* Setting the flag was sticky. */
+		if (PageHWPoison(page))
+			return false;
+	}
+}
+
+static bool page_test_clear_hwpoison(struct page *page)
+{
+	lockdep_assert_held(&mf_mutex);
+
+	while (true) {
+		/* Already clear -> not our problem. */
+		if (!TestClearPageHWPoison(page))
+			return false;
+		/* Make sure concurrent non-atomic writers completed. */
+		synchronize_rcu();
+		/* Clearing the flag was sticky. */
+		if (!PageHWPoison(page))
+			return true;
+	}
+}
+
+static void page_set_hwpoison(struct page *page)
+{
+	lockdep_assert_held(&mf_mutex);
+
+	while (!PageHWPoison(page)) {
+		SetPageHWPoison(page);
+
+		/* Make sure concurrent non-atomic writers completed. */
+		synchronize_rcu();
+	}
+}
+
+static void page_clear_hwpoison(struct page *page)
+{
+	lockdep_assert_held(&mf_mutex);
+
+	while (PageHWPoison(page)) {
+		ClearPageHWPoison(page);
+
+		/* Make sure concurrent non-atomic writers completed. */
+		synchronize_rcu();
+	}
+}
+
 /*
  * Return values:
  *   1:   the page is dissolved (if needed) and taken off from buddy,
@@ -199,7 +255,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
 			return false;
 	}
 
-	SetPageHWPoison(page);
+	page_set_hwpoison(page);
 	if (release)
 		put_page(page);
 	page_ref_inc(page);
@@ -1744,7 +1800,7 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
 	 * Use this flag as an indication that the dax page has been
 	 * remapped UC to prevent speculative consumption of poison.
 	 */
-	SetPageHWPoison(&folio->page);
+	page_set_hwpoison(&folio->page);
 
 	/*
 	 * Unlike System-RAM there is no possibility to swap in a
@@ -1789,7 +1845,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 			goto unlock;
 
 		if (!pre_remove)
-			SetPageHWPoison(page);
+			page_set_hwpoison(page);
 
 		/*
 		 * The pre_remove case is revoking access, the memory is still
@@ -1866,7 +1922,7 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
 	head = llist_del_all(raw_hwp_list_head(folio));
 	llist_for_each_entry_safe(p, next, head, node) {
 		if (move_flag)
-			SetPageHWPoison(p->page);
+			page_set_hwpoison(p->page);
 		else
 			num_poisoned_pages_sub(page_to_pfn(p->page), 1);
 		kfree(p);
@@ -2380,7 +2436,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (res != -ENOENT)
 		goto unlock_mutex;
 
-	if (TestSetPageHWPoison(p)) {
+	if (page_test_set_hwpoison(p)) {
 		res = -EHWPOISON;
 		if (flags & MF_ACTION_REQUIRED)
 			res = kill_accessing_process(current, pfn, flags);
@@ -2410,7 +2466,7 @@ int memory_failure(unsigned long pfn, int flags)
 			} else {
 				/* We lost the race, try again */
 				if (retry) {
-					ClearPageHWPoison(p);
+					page_clear_hwpoison(p);
 					retry = false;
 					goto try_again;
 				}
@@ -2431,7 +2487,7 @@ int memory_failure(unsigned long pfn, int flags)
 	/* filter pages that are protected from hwpoison test by users */
 	folio_lock(folio);
 	if (hwpoison_filter(p)) {
-		ClearPageHWPoison(p);
+		page_clear_hwpoison(p);
 		folio_unlock(folio);
 		folio_put(folio);
 		res = -EOPNOTSUPP;
@@ -2751,7 +2807,7 @@ int unpoison_memory(unsigned long pfn)
 		}
 
 		folio_put(folio);
-		if (TestClearPageHWPoison(p)) {
+		if (page_test_clear_hwpoison(p)) {
 			folio_put(folio);
 			ret = 0;
 		}


Maybe that would work. There would still be issues to solve

(a) We don't hold the mf_mutex on all call paths, but we really need it so a
page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.

(b) There are some leftover SetPageHWPoison etc. instances. The ones in
arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
corner cases either way and we can document the situation.


Further, while I assume the synchronize_rcu() on the MCE path should be fine
(who cares about performance there?), I don't know if the added RCU read lock
on some paths could be noticable.

So one idea worth discussing, but I am sure there are more problems.

-- 
Cheers,

David

^ permalink raw reply related

* Re: [PATCH] virtio_net: disable cb when napi_schedule_prep fails during busy-poll
From: Xuan Zhuo @ 2026-06-15 10:01 UTC (permalink / raw)
  To: Longjun Tang
  Cc: edumazet, kuba, virtualization, lange_tang, tanglongjun, mst,
	jasowang
In-Reply-To: <20260615094550.106391-1-lange_tang@163.com>

On Mon, 15 Jun 2026 17:45:50 +0800, Longjun Tang <lange_tang@163.com> wrote:
> From: Longjun Tang <tanglongjun@kylinos.cn>
>
> When busy-poll is active, napi_schedule_prep() returns false in
> skb_recv_done(), so virtqueue_disable_cb() is skipped. The device
> may keep firing irqs until the next poll round reaches
> virtqueue_napi_complete(). If cb is enabled under busy-poll case,
> it will lead to a large number of spurious interrupts. Explicitly
> disable callbacks in this case to prevent spurious interrupts.
>
> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
> ---
>  drivers/net/virtio_net.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..6d675fddc59b 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -728,6 +728,8 @@ static void virtqueue_napi_schedule(struct napi_struct *napi,
>  	if (napi_schedule_prep(napi)) {
>  		virtqueue_disable_cb(vq);
>  		__napi_schedule(napi);
> +	} else if (test_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state)) {
> +		virtqueue_disable_cb(vq);

I see, but we should avoid checking NAPI_STATE_IN_BUSY_POLL directly in the
drivers. The NIC driver should remain agnostic to busy polling. I think we need
a better way, maybe we should rewrite virtqueue_napi_schedule instead.

Thanks.

>  	}
>  }
>
> --
> 2.25.1
>

^ permalink raw reply

* [PATCH] virtio_net: disable cb when napi_schedule_prep fails during busy-poll
From: Longjun Tang @ 2026-06-15  9:45 UTC (permalink / raw)
  To: mst, jasowang
  Cc: edumazet, kuba, xuanzhuo, virtualization, lange_tang, tanglongjun

From: Longjun Tang <tanglongjun@kylinos.cn>

When busy-poll is active, napi_schedule_prep() returns false in
skb_recv_done(), so virtqueue_disable_cb() is skipped. The device
may keep firing irqs until the next poll round reaches
virtqueue_napi_complete(). If cb is enabled under busy-poll case,
it will lead to a large number of spurious interrupts. Explicitly
disable callbacks in this case to prevent spurious interrupts.

Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
---
 drivers/net/virtio_net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..6d675fddc59b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -728,6 +728,8 @@ static void virtqueue_napi_schedule(struct napi_struct *napi,
 	if (napi_schedule_prep(napi)) {
 		virtqueue_disable_cb(vq);
 		__napi_schedule(napi);
+	} else if (test_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state)) {
+		virtqueue_disable_cb(vq);
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v2] mm: page_reporting: allow driver to set batch capacity
From: Gupta, Pankaj @ 2026-06-15  9:18 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Miaohe Lin, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Perez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi,
	Alexander Duyck
In-Reply-To: <cb43adc61d2ed3069b2fe428f3e051dbdc4cc28d.1781097156.git.mst@redhat.com>


> At the moment, if a virtio balloon device has a page reporting vq but
> its size is < PAGE_REPORTING_CAPACITY (32), the balloon driver fails
> probe.
>
> But, there's no way for host to know this value, so it can easily
> create a smaller vq and suddenly adding the reporting capability
> to the device makes all of the driver fail. Not pretty.
>
> Add a capacity field to page_reporting_dev_info so drivers can
> control the maximum number of pages per report batch.
>
> In virtio-balloon, set the capacity to the reporting virtqueue size,
> letting page_reporting adapt to whatever the device provides.
>
> Capacity need not be a power of two.  Code previously called out
> division by PAGE_REPORTING_CAPACITY as cheap since it was a power
> of 2, but no performance difference was observed with non-power-of-2
> values.
>
> If capacity is 0 or exceeds PAGE_REPORTING_CAPACITY, it defaults
> to PAGE_REPORTING_CAPACITY.  The 0 check and the clamping is done in
> page_reporting_register(), before the reporting work is scheduled,
> so we never get division by 0.
>
> Fixes: b0c504f15471 ("virtio-balloon: add support for providing free page reports to host")
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
> Changes v1->v2:
> - Document capacity=0 as default in commit log
> - Document that capacity need not be a power of two
> - Drop unnecessary comment about integer division cost
> - Update comment on capacity field: "0 (default) means PAGE_REPORTING_CAPACITY"
>
>   drivers/virtio/virtio_balloon.c |  5 +----
>   include/linux/page_reporting.h  |  3 +++
>   mm/page_reporting.c             | 24 ++++++++++++------------
>   3 files changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f6c2dff33f8a..6a1a610c2cb1 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -1017,10 +1017,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
>   		unsigned int capacity;
>   
>   		capacity = virtqueue_get_vring_size(vb->reporting_vq);
> -		if (capacity < PAGE_REPORTING_CAPACITY) {
> -			err = -ENOSPC;
> -			goto out_unregister_oom;
> -		}
>   
>   		vb->pr_dev_info.order = PAGE_REPORTING_ORDER_UNSPECIFIED;
>   
> @@ -1041,6 +1037,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
>   		vb->pr_dev_info.order = 5;
>   #endif
>   
> +		vb->pr_dev_info.capacity = capacity;
>   		err = page_reporting_register(&vb->pr_dev_info);
>   		if (err)
>   			goto out_unregister_oom;
> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> index 9d4ca5c218a0..048578118a4b 100644
> --- a/include/linux/page_reporting.h
> +++ b/include/linux/page_reporting.h
> @@ -22,6 +22,9 @@ struct page_reporting_dev_info {
>   
>   	/* Minimal order of page reporting */
>   	unsigned int order;
> +
> +	/* Max pages per report batch; 0 (default) means PAGE_REPORTING_CAPACITY */
> +	unsigned int capacity;
>   };
>   
>   /* Tear-down and bring-up for page reporting devices */
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index 7418f2e500bb..942e84b6908a 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -173,11 +173,8 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
>   	 * any pages that may have already been present from the previous
>   	 * list processed. This should result in us reporting all pages on
>   	 * an idle system in about 30 seconds.
> -	 *
> -	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
> -	 * should always be a power of 2.
>   	 */
> -	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
> +	budget = DIV_ROUND_UP(area->nr_free, prdev->capacity * 16);
>   
>   	/* loop through free list adding unreported pages to sg list */
>   	list_for_each_entry_safe(page, next, list, lru) {
> @@ -222,10 +219,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
>   		spin_unlock_irq(&zone->lock);
>   
>   		/* begin processing pages in local list */
> -		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
> +		err = prdev->report(prdev, sgl, prdev->capacity);
>   
>   		/* reset offset since the full list was reported */
> -		*offset = PAGE_REPORTING_CAPACITY;
> +		*offset = prdev->capacity;
>   
>   		/* update budget to reflect call to report function */
>   		budget--;
> @@ -234,7 +231,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
>   		spin_lock_irq(&zone->lock);
>   
>   		/* flush reported pages from the sg list */
> -		page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
> +		page_reporting_drain(prdev, sgl, prdev->capacity, !err);
>   
>   		/*
>   		 * Reset next to first entry, the old next isn't valid
> @@ -260,13 +257,13 @@ static int
>   page_reporting_process_zone(struct page_reporting_dev_info *prdev,
>   			    struct scatterlist *sgl, struct zone *zone)
>   {
> -	unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
> +	unsigned int order, mt, leftover, offset = prdev->capacity;
>   	unsigned long watermark;
>   	int err = 0;
>   
>   	/* Generate minimum watermark to be able to guarantee progress */
>   	watermark = low_wmark_pages(zone) +
> -		    (PAGE_REPORTING_CAPACITY << page_reporting_order);
> +		    (prdev->capacity << page_reporting_order);
>   
>   	/*
>   	 * Cancel request if insufficient free memory or if we failed
> @@ -290,7 +287,7 @@ page_reporting_process_zone(struct page_reporting_dev_info *prdev,
>   	}
>   
>   	/* report the leftover pages before going idle */
> -	leftover = PAGE_REPORTING_CAPACITY - offset;
> +	leftover = prdev->capacity - offset;
>   	if (leftover) {
>   		sgl = &sgl[offset];
>   		err = prdev->report(prdev, sgl, leftover);
> @@ -322,11 +319,11 @@ static void page_reporting_process(struct work_struct *work)
>   	atomic_set(&prdev->state, state);
>   
>   	/* allocate scatterlist to store pages being reported on */
> -	sgl = kmalloc_objs(*sgl, PAGE_REPORTING_CAPACITY);
> +	sgl = kmalloc_objs(*sgl, prdev->capacity);
>   	if (!sgl)
>   		goto err_out;
>   
> -	sg_init_table(sgl, PAGE_REPORTING_CAPACITY);
> +	sg_init_table(sgl, prdev->capacity);
>   
>   	for_each_zone(zone) {
>   		err = page_reporting_process_zone(prdev, sgl, zone);
> @@ -377,6 +374,9 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
>   			page_reporting_order = pageblock_order;
>   	}
>   
> +	if (!prdev->capacity || prdev->capacity > PAGE_REPORTING_CAPACITY)
> +		prdev->capacity = PAGE_REPORTING_CAPACITY;
> +
>   	/* initialize state and work structures */
>   	atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
>   	INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);

With the comment change pointed by David,

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>



^ permalink raw reply

* [PATCH v1 2/2] virtio-blk: mark disk dead on ERS permanent failure
From: Xixin Liu @ 2026-06-12 10:00 UTC (permalink / raw)
  To: linux-block, virtualization
  Cc: mst, jasowang, xuanzhuo, eperezma, pbonzini, stefanha, axboe,
	linux-kernel, liuxixin
In-Reply-To: <cover.virtio-blk-ers-v1.1780449274.git.liuxixin@kylinos.cn>

After ERS reports pci_channel_io_perm_failure, virtio-pci must ask the
virtio driver to tear down the block device — not only mark virtqueues
broken.  Call the virtio driver shutdown hook from virtio-pci on
perm_failure; virtio-blk implements shutdown with blk_mark_disk_dead().
Fail new requests early in virtio_queue_rq when the disk is dead or
virtqueues were removed during frozen reset_prepare.

Signed-off-by: Xixin Liu <liuxixin@kylinos.cn>
---
 drivers/block/virtio_blk.c         | 39 +++++++++++++++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_common.c | 10 +++++++++-
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 32bf3ba07a9d..4740ae91d5be 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -435,6 +435,12 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 	blk_status_t status;
 	int err;
 
+	/* Fail fast if ERS frozen tore down VQs or the disk was marked dead. */
+	if (unlikely(!disk_live(vblk->disk) || !vblk->vqs || !vblk->vdev)) {
+		blk_mq_start_request(req);
+		return BLK_STS_IOERR;
+	}
+
 	status = virtblk_prep_rq(hctx, vblk, req, vbr);
 	if (unlikely(status))
 		return status;
@@ -1561,6 +1567,29 @@ static int virtblk_probe(struct virtio_device *vdev)
 	return err;
 }
 
+/* Stop I/O and mark the gendisk dead (ERS perm_failure or system shutdown). */
+static void virtblk_shutdown(struct virtio_device *vdev)
+{
+	struct virtio_blk *vblk = vdev->priv;
+	struct request_queue *q;
+	unsigned int memflags;
+
+	if (!vblk || !vblk->disk)
+		return;
+
+	flush_work(&vblk->config_work);
+	virtio_break_device(vdev);
+
+	q = vblk->disk->queue;
+	memflags = blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue_nowait(q);
+
+	blk_mark_disk_dead(vblk->disk);
+
+	blk_mq_unquiesce_queue(q);
+	blk_mq_unfreeze_queue(q, memflags);
+}
+
 static void virtblk_remove(struct virtio_device *vdev)
 {
 	struct virtio_blk *vblk = vdev->priv;
@@ -1684,6 +1713,7 @@ static struct virtio_driver virtio_blk = {
 	.probe				= virtblk_probe,
 	.remove				= virtblk_remove,
 	.config_changed			= virtblk_config_changed,
+	.shutdown			= virtblk_shutdown,
 #ifdef CONFIG_PM_SLEEP
 	.freeze				= virtblk_freeze,
 	.restore			= virtblk_restore,
diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index e2dda946e70e..924ceead436b 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -845,7 +845,15 @@ static pci_ers_result_t virtio_pci_error_detected(struct pci_dev *pci_dev,
 	case pci_channel_io_perm_failure:
 		dev_warn(&pci_dev->dev,
 			 "permanent failure, disconnecting device\n");
-		virtio_break_device(&vp_dev->vdev);
+		{
+			struct virtio_driver *drv =
+				drv_to_virtio(vp_dev->vdev.dev.driver);
+
+			if (drv && drv->shutdown)
+				drv->shutdown(&vp_dev->vdev);
+			else
+				virtio_break_device(&vp_dev->vdev);
+		}
 		return PCI_ERS_RESULT_DISCONNECT;
 	default:
 		break;


^ permalink raw reply related

* [PATCH v1 1/2] virtio-pci: add error_detected for PCI AER recovery
From: Xixin Liu @ 2026-06-10  6:20 UTC (permalink / raw)
  To: linux-block, virtualization
  Cc: mst, jasowang, xuanzhuo, eperezma, pbonzini, stefanha, axboe,
	linux-kernel, liuxixin
In-Reply-To: <cover.virtio-blk-ers-v1.1780449274.git.liuxixin@kylinos.cn>

virtio-pci only registered reset_prepare/reset_done.  The PCI error
recovery core treats devices without error_detected as NO_AER_DRIVER and
does not deliver pci_channel_io_perm_failure to the driver after a failed
recovery.  Virtio devices therefore miss the normal ERS quiesce/teardown
sequence.

Register error_detected: quiesce on frozen (reset_prepare) before bus
reset; on perm_failure break virtqueues and return DISCONNECT.  Block-layer
cleanup for virtio-blk is handled in the follow-up patch.

Signed-off-by: Xixin Liu <liuxixin@kylinos.cn>
---
 drivers/virtio/virtio_pci_common.c | 30 +++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 164f480b18a6..e2dda946e70e 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -828,7 +828,37 @@ static void virtio_pci_reset_done(struct pci_dev *pci_dev)
 		dev_warn(&pci_dev->dev, "Reset done failure: %d", ret);
 }
 
+static pci_ers_result_t virtio_pci_error_detected(struct pci_dev *pci_dev,
+						  pci_channel_state_t state)
+{
+	struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
+
+	/*
+	 * PCI ERS error_detected: quiesce on frozen before bus reset; on
+	 * permanent failure ask the virtio driver to shut down (virtio-blk
+	 * marks the disk dead in its .shutdown handler).
+	 */
+	switch (state) {
+	case pci_channel_io_normal:
+		return PCI_ERS_RESULT_CAN_RECOVER;
+	case pci_channel_io_frozen:
+		pci_info(pci_dev, "frozen error detected, quiesce device\n");
+		if (virtio_device_reset_prepare(&vp_dev->vdev))
+			dev_warn(&pci_dev->dev, "frozen: reset prepare failed\n");
+		return PCI_ERS_RESULT_NEED_RESET;
+	case pci_channel_io_perm_failure:
+		dev_warn(&pci_dev->dev,
+			 "permanent failure, disconnecting device\n");
+		virtio_break_device(&vp_dev->vdev);
+		return PCI_ERS_RESULT_DISCONNECT;
+	default:
+		break;
+	}
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
 static const struct pci_error_handlers virtio_pci_err_handler = {
+	.error_detected = virtio_pci_error_detected,
 	.reset_prepare  = virtio_pci_reset_prepare,
 	.reset_done     = virtio_pci_reset_done,
 };


^ permalink raw reply related

* [PATCH v1 0/2] virtio: PCI ERS permanent failure teardown for virtio-blk
From: Xixin Liu @ 2026-06-15  2:00 UTC (permalink / raw)
  To: linux-block, virtualization
  Cc: mst, jasowang, xuanzhuo, eperezma, pbonzini, stefanha, axboe,
	linux-kernel, liuxixin

Hi,

This series adds proper PCI AER error recovery handling for virtio-pci and
completes virtio-blk teardown when ERS reports pci_channel_io_perm_failure.

virtio-pci only registered reset_prepare/reset_done.  The recovery core
treats devices without error_detected as NO_AER_DRIVER and does not
deliver perm_failure to the driver after a failed recovery.  When bus
reset fails (reproduced on QEMU with DLLLA not set within 100 ms after
secondary bus reset), virtio-blk disks stay live even though virtqueues
may already have been torn down during the frozen phase.

Patch 1 registers error_detected (frozen quiesce + perm_failure notify).
Patch 2 calls the virtio driver shutdown hook from virtio-pci on
perm_failure, implements virtio-blk shutdown with blk_mark_disk_dead(),
and fail-fast guards in virtio_queue_rq.

Thanks,
Xixin Liu

---

Xixin Liu (2):
  virtio-pci: add error_detected for PCI AER recovery
  virtio-blk: mark disk dead on ERS permanent failure

 drivers/block/virtio_blk.c         | 39 +++++++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_common.c | 47 ++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

-- 
2.43.0

^ permalink raw reply

* RE: [RFC PATCH 0/6] Support virtio-mem memory hotplug in TDX guests
From: Duan, Zhenzhong @ 2026-06-15  7:54 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: marcandre.lureau@redhat.com, david@kernel.org, Edgecombe, Rick P,
	prsampat@amd.com, pbonzini@redhat.com, mst@redhat.com,
	peterx@redhat.com, Qiang, Chenyi, Reshetova, Elena,
	michaeluth@amd.com, ackerleytng@google.com,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	virtualization@lists.linux.dev, x86@kernel.org, Xu, Yilun,
	Li, Xiaoyao, Peng, Chao P
In-Reply-To: <aiv0y-Op9bfP-CVO@thinkstation>

>-----Original Message-----
>From: Kiryl Shutsemau <kas@kernel.org>
>Subject: Re: [RFC PATCH 0/6] Support virtio-mem memory hotplug in TDX guests
>
>On Thu, Jun 04, 2026 at 05:35:45AM -0400, Zhenzhong Duan wrote:
>> 2. Re-accepting already-accepted memory returns errors. Ignoring these errors
>> can mislead the guest into believing re-accepted memory is zeroed when it
>> contains stale data.
>
>Re-accepting concern is valid, but often overblown.

> Reaccepting memory that never got allocated is fine.

I don't quite understand. "Reaccepting" implies accepting memory that was
already accepted earlier. For that to happen, the memory must have already
been allocated on the VMM side, correct?

>
>> == About this series ==
>>
>> This series takes a different direction, supporting start-private memory
>> and addressing the limitations of previous series [1] by implementing a
>> callback-based infrastructure that integrates TDX memory acceptance and
>> release operations with proper subblock granularity.
>
>You are presenting these callbacks as generic memory hotplug thingy, but
>it is only plugged into virtio mem. ACPI hotplug won't accept/release
>memory unless I miss something. Are you expecting them to cover non
>virtio cases too?

You are right, I didn't add ACPI hotplug in this series. I'm working on RFCv2
supporting both virtio-mem and ACPI hotplug in eager/lazy accept mode.

>
>And these callbacks feels like very ad-hoc solution.

OK, will drop the callbacks in RFCv2.

>
>> See Rick and Paolo's
>> discussion about using TDG.MEM.PAGE.RELEASE in [1].
>
>Having RELEASE in hotplug path without addressing private->shared
>conversion first is odd. That's the most obvious path that has to be
>covered first.
>
>Hm?

This patch series assumes that memory is plugged in as private memory
and must remain private prior to being unplugged. During the unplugging
process, memory is allocated from the buddy system and marked as
FAKE_OFFLINE. Because all free memory within the buddy system is
strictly private, shared memory can never be unplugged.

Shared memory is originally converted from private memory allocated by
the buddy system. Consequently, the driver must convert any shared
memory back to private and return it to the buddy system before it can
be unplugged.

>
>> == Future work ==
>> support lazy accept
>
>It would be nice to have some outline on how we will get there to
>understand if this patchset is stepping stone or dead end that has to be
>thrown away later on.

I realized the callbacks are specially used for eager accept, they are not
useful for lazy accept. So, I will drop them in RFCv2.

>
>Hot[un]plug is often used to manager overcommited host. Eager accept
>might be counter-productive.

Agree, I should have taken lazy accept into consideration from start.

Thanks
Zhenzhong

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Miaohe Lin @ 2026-06-15  3:29 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Michael S. Tsirkin
  Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <cbffabdc-42bf-4e67-9d2b-ace9a1e09f29@kernel.org>

On 2026/6/11 21:20, David Hildenbrand (Arm) wrote:
> On 6/11/26 09:36, Miaohe Lin wrote:
>> On 2026/6/11 13:43, Michael S. Tsirkin wrote:
>>> On Thu, Jun 11, 2026 at 11:35:36AM +0800, Miaohe Lin wrote:
>>>>
>>>> Do you mean repeating SetPageHWPoison on every branch?
>>>
>>> Right.
>>>
>>>> Is it possible
>>>> to make __free_pages_prepare changes page->flags atomically or this race
>>>> is specified to memory_failure?
>>>>
>>>> Thanks.
>>>> .
>>>
>>>
>>> Adding an atomic op on every fast path page allocation is, I am
>>> guessing, going to slow down Linux measureably.
>>>
>>> Doing it for the benefit of memory_failure, which is the slowest of
>>> slow paths, seems unpalatable, to me.
>>
>> Agree, it's not worth to do so.
>>
>>>
>>> Neither am I sure it's the only racy place -
>>> grep for __SetPage and __ClearPage - all these have the same issue, I
>>> suspect.
>>>
>>> At the same time, I'm not an mm maintainer. If you disagree, try to
>>> upstream a change converting all non atomics in mm to atomics, and see
>>> what others say.
>>
>> Since memory_failure might be the only place, this change would be unacceptable.
>> We should come up with a better solution. Maybe we can try repeating SetPageHWPoison
>> and ClearPageHWPoison at a first attempt though it looks somewhat weird to me and makes
>> code more complicated.
> 
> And I am fairly sure we could still have some remaining races ... it's shaky.

I have to agree it's shaky. Any suggestion for next step?

Thanks.
.

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] virtio_net: xsk: fix race in rx wake up
From: Xuan Zhuo @ 2026-06-15  2:48 UTC (permalink / raw)
  To: menglong8.dong
  Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
	minhquangbui99, kerneljasonxing, netdev, virtualization,
	linux-kernel, eperezma
In-Reply-To: <20260611025644.2431148-2-dongml2@chinatelecom.cn>

On Thu, 11 Jun 2026 10:56:43 +0800, menglong8.dong@gmail.com wrote:
> From: Menglong Dong <dongml2@chinatelecom.cn>
>
> During packet receiving in virtio-net, the rq can be empty, which means
> "rq->vq->num_free == virtqueue_get_vring_size(rq->vq)", in
> virtnet_add_recvbuf_xsk(), if we are using xsk. Meanwhile, the fill ring
> can be empty too, which means we can't allocate anything from
> xsk_buff_alloc_batch(). Then, we will set the XDP_RING_NEED_WAKEUP flag.
>
> However, if the user clean all the data in rx ring and fill the
> "fill ring" and check the XDP_RING_NEED_WAKEUP flag after
> xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(), then the rx
> napi will never be scheduled: the rx ring is empty, which means we will
> never receive a packet to trigger the further recv fill. The rx ring is
> empty now, so the user will not check the flag too.
>
> Fix this by set the XDP_RING_NEED_WAKEUP flag before
> xsk_buff_alloc_batch() if both rq->vq and fill ring are empty.
>
> Meanwhile, set the XDP_RING_NEED_WAKEUP flag if we have any free entry in
> rq->vq.
>
> Fixes: e3f8800aa243 ("virtio-net: xsk: Support wakeup on RX side")
> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> ---
>  drivers/net/virtio_net.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..4b5b3fa62008 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1323,16 +1323,27 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
>  				   struct xsk_buff_pool *pool, gfp_t gfp)
>  {
>  	struct xdp_buff **xsk_buffs;
> +	bool need_wakeup;
>  	dma_addr_t addr;
>  	int err = 0;
>  	u32 len, i;
>  	int num;
>
> +	need_wakeup = xsk_uses_need_wakeup(pool);
>  	xsk_buffs = rq->xsk_buffs;
>
> +	/* If both rq->vq and fill ring are empty, and then the user submit
> +	 * all the chunks to the fill ring and check the wake up flag
> +	 * after xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(),
> +	 * we will lose the chance to wake up the rx napi, so we have to
> +	 * set the need_wakeup flag here.
> +	 */
> +	if (need_wakeup && virtqueue_get_vring_size(rq->vq) == rq->vq->num_free)
> +		xsk_set_rx_need_wakeup(pool);

Is Condition A here too strict? We should trigger the wakeup under a wider range
of scenarios.

> +
>  	num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->vq->num_free);
>  	if (!num) {
> -		if (xsk_uses_need_wakeup(pool)) {
> +		if (need_wakeup) {
>  			xsk_set_rx_need_wakeup(pool);
>  			/* Return 0 instead of -ENOMEM so that NAPI is
>  			 * descheduled.
> @@ -1341,8 +1352,6 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
>  		}
>
>  		return -ENOMEM;
> -	} else {
> -		xsk_clear_rx_need_wakeup(pool);
>  	}
>
>  	len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
> @@ -1363,6 +1372,16 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
>  			goto err;
>  	}
>
> +	if (need_wakeup) {
> +		if (rq->vq->num_free)
> +			/* We have free buffers, so we'd better wake up the
> +			 * rx napi as soon as possible.
> +			 */
> +			xsk_set_rx_need_wakeup(pool);

Is the purpose of waking up RX NAPI to invoke try_fill_recv? However,
virtnet_poll does not call try_fill_recv directly. it is done
conditionally.

Thanks.


> +		else
> +			xsk_clear_rx_need_wakeup(pool);
> +	}
> +
>  	return num;
>
>  err:
> --
> 2.54.0
>

^ permalink raw reply

* [BUG] crypto: virtio - KASAN slab-use-after-free in virtio_crypto_skcipher_encrypt
From: Shuangpeng Bai @ 2026-06-15  2:10 UTC (permalink / raw)
  To: arei.gonglei, mst, jasowang, xuanzhuo, eperezma, herbert, davem,
	virtualization, linux-crypto, linux-kernel

Hi,

I hit the following KASAN report while testing current upstream kernel.

The issue was reproduced by queuing an AF_ALG skcipher request backed by
virtio-crypto, unbinding virtio0 from the virtio_crypto driver, and then
receiving from the old AF_ALG op fd.

KASAN: slab-use-after-free in virtio_crypto_skcipher_encrypt

I reproduced this on commit: e8c2f9fdadee7cbc75134dc463c1e0d856d6e5c7 (May 25 2026)

The reproducer and .config files are here.
https://gist.github.com/shuangpengbai/f6117a0883dd574f02288ca812bb7d65

I'm happy to test debug patches or provide additional information.

Reported-by: Shuangpeng Bai <shuangpeng.kernel@gmail.com>

[   54.367992][ T8332] BUG: KASAN: slab-use-after-free in virtio_crypto_skcipher_encrypt (drivers/crypto/virtio/virtio_crypto_skcipher_algs.c:473)
[   54.369596][ T8332] Read of size 8 at addr ffff888124a47010 by task virtio_crypto_a/8332
[   54.370922][ T8332]
[   54.371171][ T8332] Tainted: [W]=WARN
[   54.371172][ T8332] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   54.371175][ T8332] Call Trace:
[   54.371179][ T8332]  <TASK>
[   54.371181][ T8332]  dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120)
[   54.371188][ T8332]  print_report (mm/kasan/report.c:378 mm/kasan/report.c:482)
[   54.371202][ T8332]  kasan_report (mm/kasan/report.c:595)
[   54.371213][ T8332]  virtio_crypto_skcipher_encrypt (drivers/crypto/virtio/virtio_crypto_skcipher_algs.c:473)
[   54.371216][ T8332]  skcipher_recvmsg (crypto/algif_skcipher.c:203 crypto/algif_skcipher.c:226)
[   54.371249][ T8332]  sock_recvmsg (net/socket.c:1137 net/socket.c:1159)
[   54.371253][ T8332]  __sys_recvfrom (net/socket.c:2315)
[   54.371273][ T8332]  __x64_sys_recvfrom (net/socket.c:2330 net/socket.c:2326 net/socket.c:2326)
[   54.371277][ T8332]  do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
[   54.371281][ T8332]  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
[   54.371285][ T8332] RIP: 0033:0x7f3c6caaac2c
[   54.371289][ T8332] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 19 45 31 c9 45 31 c0 b8 2d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 64 c3 0f 1f 00 55 48 83 ec 20 48 89 54 24 10
[   54.371292][ T8332] RSP: 002b:00007ffed3785308 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
[   54.371297][ T8332] RAX: ffffffffffffffda RBX: 0000000000000064 RCX: 00007f3c6caaac2c
[   54.371299][ T8332] RDX: 0000000000000040 RSI: 00007ffed37853a0 RDI: 0000000000000004
[   54.371301][ T8332] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[   54.371303][ T8332] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000004
[   54.371305][ T8332] R13: 00007ffed37853a0 R14: 0000558cc9904118 R15: 0000000000000000
[   54.371309][ T8332]  </TASK>
[   54.371311][ T8332]
[   54.394932][ T8332] Freed by task 8332 on cpu 0 at 54.364772s:
[   54.395528][ T8332]  kasan_save_track (mm/kasan/common.c:57 mm/kasan/common.c:78)
[   54.395997][ T8332]  kasan_save_free_info (mm/kasan/generic.c:584)
[   54.396501][ T8332]  __kasan_slab_free (mm/kasan/common.c:253 mm/kasan/common.c:285)
[   54.396983][ T8332]  kfree (include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6251 mm/slub.c:6566)
[   54.397378][ T8332]  virtio_dev_remove (drivers/virtio/virtio.c:375)
[   54.397869][ T8332]  device_release_driver_internal (drivers/base/dd.c:619 drivers/base/dd.c:1352 drivers/base/dd.c:1375)
[   54.398475][ T8332]  unbind_store (drivers/base/bus.c:244)
[   54.398944][ T8332]  kernfs_fop_write_iter (fs/kernfs/file.c:352)
[   54.399476][ T8332]  vfs_write (fs/read_write.c:595 fs/read_write.c:688)
[   54.399915][ T8332]  ksys_write (fs/read_write.c:740)
[   54.400349][ T8332]  do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
[   54.400818][ T8332]  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
[   54.401406][ T8332]
[   54.401650][ T8332] The buggy address belongs to the object at ffff888124a47000
[   54.401650][ T8332]  which belongs to the cache kmalloc-192 of size 192
[   54.403038][ T8332] The buggy address is located 16 bytes inside of
[   54.403038][ T8332]  freed 192-byte region [ffff888124a47000, ffff888124a470c0)
[   54.404385][ T8332]


Best,
Shuangpeng

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox