Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH net-next] vsock/vmci: use sk_acceptq_is_full() helper
From: Luigi Leonardi @ 2026-06-11  7:58 UTC (permalink / raw)
  To: Raf Dickson
  Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
	vishnu.dasa, bcm-kernel-feedback-list
In-Reply-To: <20260611023830.106259-1-rafdog35@gmail.com>

On Thu, Jun 11, 2026 at 02:38:30AM +0000, Raf Dickson wrote:
>Replace the open-coded backlog check with sk_acceptq_is_full().
>The helper uses > instead of >=, which is the correct comparison
>per commit 64a146513f8f ("[NET]: Revert incorrect accept queue
>backlog changes."), and adds READ_ONCE() for proper memory ordering.
>
>Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
>

this blank line should be dropped
>Signed-off-by: Raf Dickson <rafdog35@gmail.com>
>---
> net/vmw_vsock/vmci_transport.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
>index 91516488a7..56503bee31 100644
>--- a/net/vmw_vsock/vmci_transport.c
>+++ b/net/vmw_vsock/vmci_transport.c
>@@ -1010,7 +1010,7 @@ static int vmci_transport_recv_listen(struct sock *sk,
> 	 * reset.  Otherwise we create and initialize a child socket and reply
> 	 * with a connection negotiation.
> 	 */
>-	if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
>+	if (sk_acceptq_is_full(sk)) {
> 		vmci_transport_reply_reset(pkt);
> 		return -ECONNREFUSED;
> 	}
>-- 
>2.54.0
>

Thanks for the patch!

note: according to patchwork [1] you forgot to CC some maintainers,
please be more careful next time :)

Reviewed-by: Luigi Leonardi <leonardi@redhat.com>

[1] https://patchwork.kernel.org/project/netdevbpf/patch/20260611023830.106259-1-rafdog35@gmail.com/


^ permalink raw reply

* Re: [PATCH v3] hwrng: virtio: clamp device-reported used.len at copy_data()
From: Michael S. Tsirkin @ 2026-06-11  7:58 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Michael Bommarito, Olivia Mackall, linux-crypto, Jason Wang,
	Kees Cook, Christian Borntraeger, virtualization, linux-kernel,
	Dan Williams, Ingo Molnar, H. Peter Anvin, torvalds, alan, tglx
In-Reply-To: <aipn8sIAQ6Ai2sax@gondor.apana.org.au>

On Thu, Jun 11, 2026 at 03:46:58PM +0800, Herbert Xu wrote:
> On Thu, Jun 11, 2026 at 03:30:14AM -0400, Michael S. Tsirkin wrote:
> > On Thu, Jun 11, 2026 at 12:43:09PM +0800, Herbert Xu wrote:
> > > On Sun, May 31, 2026 at 10:22:51AM -0400, Michael Bommarito wrote:
> > > >
> > > > +	size = min_t(unsigned int, size, avail - vi->data_idx);
> > > > +	idx = array_index_nospec(vi->data_idx, sizeof(vi->data));
> > > > +	memcpy(buf, vi->data + idx, size);
> > 
> > All the "malicious device" things are confusing. Spectre things -
> > doubly so.
> > 
> > So if an access is speculated then CPU might speculate feeding a kernel
> > secret into RNG. And then the speculated RNG value maybe can be also
> > speculatively be used by some kernel code as an index
> > to trigger a cache access, finally leaking the secret?
> > 
> > Maybe?
> 
> The way Spectre works is if you have an actual instruction using
> idx directly.  I don't see how that translates to memcpy.

I am not sure it has to be direct:

if (malicious_idx > SIZE)
	return;
src += malicious_idx;
memcpy(&value, src, ...)
....
hash = complex_hash_of(value)
....
return p[hash * 512];

is IIUC still a valid spectre v1 gadget leaking a value beyong SIZE, or
did I miss something?


And rng is a kind of a complex hash, but I also think in that "...."
in the kernel is probably large enough to close any transient execution
window.


So sure, we can drop this.




> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply

* Re: [PATCH RESEND] virtio_console: read size from config space during device init
From: Michael S. Tsirkin @ 2026-06-11  7:38 UTC (permalink / raw)
  To: Filip Hejsek
  Cc: Amit Shah, Arnd Bergmann, Greg Kroah-Hartman, Rusty Russell,
	virtualization, linux-kernel
In-Reply-To: <dce7a3af2e8fd674d842efe7e21abb4644248155.camel@gmail.com>

On Thu, Jun 11, 2026 at 08:57:57AM +0200, Filip Hejsek wrote:
> On Wed, 2026-06-10 at 03:04 -0400, Michael S. Tsirkin wrote:
> > On Mon, Feb 23, 2026 at 06:37:02PM +0100, Filip Hejsek wrote:
> > > Previously, the size was only read upon receiving the config interrupt.
> > > This interrupt is sent when the size changes. However, we also need to
> > > read the initial size.
> > > 
> > > Also make sure to only read the size from config if F_SIZE is enabled.
> > > 
> > > Fixes: 9778829cffd4 ("virtio: console: Store each console's size in the console structure")
> > > Signed-off-by: Filip Hejsek <filip.hejsek@gmail.com>
> > > ---
> > > This is a resend of [1], which hasn't received any response.
> > > 
> > > I found this bug while developing patches for QEMU that add virtio
> > > console resize support. If you want to test this, you can get my QEMU
> > > patches from [2]. You will need to disable multiport
> > > using `-device virtio-serial,max_ports=1`.
> > > 
> > > [1]: https://lore.kernel.org/all/20251224-virtio-console-fix-v1-1-69d0349692dc@gmail.com/
> > > [2]: https://lore.kernel.org/all/20250921-console-resize-v5-0-89e3c6727060@gmail.com/
> > > 
> > > I'll also repeat my questions from the previous submission here. These are
> > > things that confused me when I was trying to understand the surrounding code,
> > > but should in no way prevent merging this patch.
> > > 
> > >   - Why does use_multiport use __virtio_test_bit instead of
> > >     virtio_has_feature?
> > > 
> > >   - The VIRTIO_CONSOLE_RESIZE handler sets irq_requested to 1, which I
> > >     think makes no sense?
> > > ---
> > >  drivers/char/virtio_console.c | 52 ++++++++++++++++++++++++++-----------------
> > >  1 file changed, 31 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> > > index 088182e54d..c355f6d392 100644
> > > --- a/drivers/char/virtio_console.c
> > > +++ b/drivers/char/virtio_console.c
> > > @@ -1771,32 +1771,40 @@ static void config_intr(struct virtio_device *vdev)
> > >  		schedule_work(&portdev->config_work);
> > >  }
> > >  
> > > -static void config_work_handler(struct work_struct *work)
> > > +static void update_size_from_config(struct ports_device *portdev)
> > >  {
> > > -	struct ports_device *portdev;
> > > +	struct virtio_device *vdev;
> > > +	struct port *port;
> > > +	u16 rows, cols;
> > >  
> > > -	portdev = container_of(work, struct ports_device, config_work);
> > > -	if (!use_multiport(portdev)) {
> > > -		struct virtio_device *vdev;
> > > -		struct port *port;
> > > -		u16 rows, cols;
> > > +	vdev = portdev->vdev;
> > >  
> > > -		vdev = portdev->vdev;
> > > -		virtio_cread(vdev, struct virtio_console_config, cols, &cols);
> > > -		virtio_cread(vdev, struct virtio_console_config, rows, &rows);
> > > +	/*
> > > +	 * We'll use this way of resizing only for legacy support.
> > > +	 * For multiport devices, use control messages to indicate
> > > +	 * console size changes so that it can be done per-port.
> > > +	 *
> > > +	 * Don't test F_SIZE at all if we're rproc: not a valid feature.
> > > +	 */
> > > +	if (is_rproc_serial(vdev) ||
> > 
> > Wait a second. Why is there this rproc test here?
> > Was not in the original code and commit log says nothing about it.
> > 
> 
> Previously, this code was in config_work_handler(), which was never
> called for rproc_serial (it's scheduled from config_intr(), which is
> the config_changed handler only for virtio_console).
> 
> Now update_size_from_config() is called unconditionally from
> virtcons_probe(), so it will be called for rproc_serial too, which
> doesn't have the F_SIZE feature.

So why not test it? What does "not a valid feature" mean?
I dislike transport code leaking into devices.


> > 
> > > +	    use_multiport(portdev) ||
> > > +	    !virtio_has_feature(vdev, VIRTIO_CONSOLE_F_SIZE))
> > > +		return;
> > >  
> > > -		port = find_port_by_id(portdev, 0);
> > > -		set_console_size(port, rows, cols);
> > > +	virtio_cread(vdev, struct virtio_console_config, cols, &cols);
> > > +	virtio_cread(vdev, struct virtio_console_config, rows, &rows);
> > >  
> > > -		/*
> > > -		 * We'll use this way of resizing only for legacy
> > > -		 * support.  For newer userspace
> > > -		 * (VIRTIO_CONSOLE_F_MULTPORT+), use control messages
> > > -		 * to indicate console size changes so that it can be
> > > -		 * done per-port.
> > > -		 */
> > > -		resize_console(port);
> > > -	}
> > > +	port = find_port_by_id(portdev, 0);
> > > +	set_console_size(port, rows, cols);
> > > +	resize_console(port);
> > > +}
> > > +
> > > +static void config_work_handler(struct work_struct *work)
> > > +{
> > > +	struct ports_device *portdev;
> > > +
> > > +	portdev = container_of(work, struct ports_device, config_work);
> > > +	update_size_from_config(portdev);
> > >  }
> > >  
> > >  static int init_vqs(struct ports_device *portdev)
> > > @@ -2054,6 +2062,8 @@ static int virtcons_probe(struct virtio_device *vdev)
> > >  	__send_control_msg(portdev, VIRTIO_CONSOLE_BAD_ID,
> > >  			   VIRTIO_CONSOLE_DEVICE_READY, 1);
> > >  
> > > +	update_size_from_config(portdev);
> > > +
> > >  	return 0;
> > >  
> > >  free_chrdev:
> > > 
> > > ---
> > > base-commit: b927546677c876e26eba308550207c2ddf812a43
> > > change-id: 20251224-virtio-console-fix-3d46980ef569
> > > 
> > > Best regards,
> > > -- 
> > > Filip Hejsek <filip.hejsek@gmail.com>


^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Miaohe Lin @ 2026-06-11  7:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zi Yan, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <20260611013644-mutt-send-email-mst@kernel.org>

On 2026/6/11 13:43, Michael S. Tsirkin wrote:
> On Thu, Jun 11, 2026 at 11:35:36AM +0800, Miaohe Lin wrote:
>> On 2026/6/11 5:18, Michael S. Tsirkin wrote:
>>> On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
>>>> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
>>>>> On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
>>>>>> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
>>>>>>
>>>>>>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
>>>>>>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
>>>>>>>>
>>>>>>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>>>>>>>>>
>>>>>>>>>> On 6/9/26 20:10, Andrew Morton wrote:
>>>>>>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>>>>>>>>>> update to page->flags can race with non-atomic flag operations
>>>>>>>>>>>> that run under zone->lock in the buddy allocator.
>>>>>>>>>>>>
>>>>>>>>>>>> In particular, __free_pages_prepare() does:
>>>>>>>>>>>>
>>>>>>>>>>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>>>>>>>>>
>>>>>>>>>>>> This non-atomic read-modify-write, while correctly excluding
>>>>>>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>>>>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>>>>>>>>>> is set and the write happens after.  Will only get worse if/when
>>>>>>>>>>>> we add more non-atomic flag operations.
>>>>>>>>>>>>
>>>>>>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>>>>>>>>>> around ClearPageHWPoison in the retry path.  This
>>>>>>>>>>>> serializes with all buddy flag manipulation.  The cost is
>>>>>>>>>>>> negligible: one lock/unlock in an extremely rare path
>>>>>>>>>>>> (hardware memory errors).
>>>>>>>>>>>>
>>>>>>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>>>>>>>>>> in this file operate on pages already removed from the buddy
>>>>>>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>>>>>>>>>> need zone->lock protection.
>>>>>>>>>>>
>>>>>>>>>>> Sashiko is saying this doesn't do anything "Because
>>>>>>>>>>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
>>>>>>>>>>>
>>>>>>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>>>>>>>>>
>>>>>>>>>> Battle of the bots: it's right.
>>>>>>>>>
>>>>>>>>> Yep, __free_pages_prepare() changes the page flag without holding
>>>>>>>>> zone->lock.
>>>>>>>>
>>>>>>>> __free_pages_prepare() works on frozen pages and assumes no one else
>>>>>>>> touches the input page. To avoid this race, memory_failure() might
>>>>>>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
>>>>>>>> sure if that works along with memory failure flow.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Yan, Zi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Actually memory failure already plays with this down the road no?
>>>>>>>
>>>>>>> So maybe it's enough to just SetPageHWPoison afterwards again?
>>>>>>>
>>>>>>>
>>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>>> index ee42d4361309..4758fea94a96 100644
>>>>>>> --- a/mm/memory-failure.c
>>>>>>> +++ b/mm/memory-failure.c
>>>>>>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>>  	if (!res) {
>>>>>>>  		if (is_free_buddy_page(p)) {
>>>>>>>  			if (take_page_off_buddy(p)) {
>>>>>>> +				SetPageHWPoison(p);
>>>>>>>  				page_ref_inc(p);
>>>>>>>  				res = MF_RECOVERED;
>>>>>>>  			} else {
>>>>>>>
>>>>>>>
>>>>>>> and maybe in a bunch of other places in there?
>>>>>>
>>>>>> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
>>>>>> just set it again here?
>>>>>
>>>>> Yea.
>>>>>
>>>>>> Why not do it after get_hwpoison_page(), since that
>>>>>> is the expected page flag?
>>>>>
>>>>> It's still in the buddy at that point right? I'm worried buddy might
>>>>> poke at flags.
>>>>
>>>> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
>>>> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
>>>> pages are not on the way to buddy...
>>>>
>>>> Thanks.
>>>> .
>>>
>>>
>>> To clarify do you not agree repeating SetPageHWPoison is enough for
>>> this? And if not, do you have suggestions on how to fix this race?
>>
>> Do you mean repeating SetPageHWPoison on every branch?
> 
> Right.
> 
>> Is it possible
>> to make __free_pages_prepare changes page->flags atomically or this race
>> is specified to memory_failure?
>>
>> Thanks.
>> .
> 
> 
> Adding an atomic op on every fast path page allocation is, I am
> guessing, going to slow down Linux measureably.
> 
> Doing it for the benefit of memory_failure, which is the slowest of
> slow paths, seems unpalatable, to me.

Agree, it's not worth to do so.

> 
> Neither am I sure it's the only racy place -
> grep for __SetPage and __ClearPage - all these have the same issue, I
> suspect.
> 
> At the same time, I'm not an mm maintainer. If you disagree, try to
> upstream a change converting all non atomics in mm to atomics, and see
> what others say.

Since memory_failure might be the only place, this change would be unacceptable.
We should come up with a better solution. Maybe we can try repeating SetPageHWPoison
and ClearPageHWPoison at a first attempt though it looks somewhat weird to me and makes
code more complicated. But it's already complicated. :)

Thanks.
.



^ permalink raw reply

* Re: [PATCH v3] hwrng: virtio: clamp device-reported used.len at copy_data()
From: Michael S. Tsirkin @ 2026-06-11  7:30 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Michael Bommarito, Olivia Mackall, linux-crypto, Jason Wang,
	Kees Cook, Christian Borntraeger, virtualization, linux-kernel,
	Dan Williams, Ingo Molnar, H. Peter Anvin, torvalds, alan, tglx
In-Reply-To: <aio83ZWadVTiuNpR@gondor.apana.org.au>

On Thu, Jun 11, 2026 at 12:43:09PM +0800, Herbert Xu wrote:
> On Sun, May 31, 2026 at 10:22:51AM -0400, Michael Bommarito wrote:
> >
> > +	size = min_t(unsigned int, size, avail - vi->data_idx);
> > +	idx = array_index_nospec(vi->data_idx, sizeof(vi->data));
> > +	memcpy(buf, vi->data + idx, size);
> 
> I don't see how nospec can help here.  Please enlighten me.


All the "malicious device" things are confusing. Spectre things -
doubly so.

So if an access is speculated then CPU might speculate feeding a kernel
secret into RNG. And then the speculated RNG value maybe can be also
speculatively be used by some kernel code as an index
to trigger a cache access, finally leaking the secret?

Maybe?




> Thanks,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply

* Re: [PATCH net] virtio_net: do not allow tunnel csum offload for non GSO packets
From: Paolo Abeni @ 2026-06-11  7:28 UTC (permalink / raw)
  To: Gabriel Goller
  Cc: netdev, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, virtualization, Willem de Bruijn
In-Reply-To: <178109396600.129329.5073003425673349392.b4-review@b4>

On 6/10/26 2:19 PM, Gabriel Goller wrote:
> On Tue, 09 Jun 2026 16:44:26 +0200, Paolo Abeni <pabeni@redhat.com> wrote:
>> [...]
>>
>> Fixes: 56a06bd40fab ("virtio_net: enable gso over UDP tunnel support.")
>> Reported-by: Fiona Ebner <f.ebner@proxmox.com>
>> Closes: https://bugzilla.proxmox.com/show_bug.cgi?id=7627
>> Tested-by: Fiona Ebner <f.ebner@proxmox.com>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> 
> Gave it a spin and it works alright, so consider:
> 
>>
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index f4adcfee7a80..07b8710639f9 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -6222,6 +6222,18 @@ static void virtnet_free_irq_moder(struct virtnet_info *vi)
>>  	rtnl_unlock();
>>  }
>>  
>> +static netdev_features_t virtnet_features_check(struct sk_buff *skb,
>> +						struct net_device *dev,
>> +						netdev_features_t features)
>> +{
>> +	/* Inner csum offload is only available for GSO packets. */
>> +	if (skb->encapsulation && !skb_is_gso(skb))
> 
> A small question -- should we maybe check for skb_gso_ok here as well?
> So add:
> 
> 	(!skb_is_gso(skb) || !skb_gso_ok(skb, features)))
> 
> Because skb_is_gso alone doesn't guarantee that the packets leaving virtio will
> be gso'd, they could be software gso'd at validate_xmit_skb, which is called
> after ndo_feature_check.
> leaving the virtio device.

Good point. Indeed disabling tx-udp_tnl-segmentation inside the guest
after successful VIRTIO_NET_F_HOST_UDP_TUNNEL_GSO negotiation would
again break the connectivity in the critical scenarios.

Let me test a v2.

Thanks,

Paolo


^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-06-11  7:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <20260603120811.GW3493090@noisy.programming.kicks-ass.net>

On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> Also, I think someone should go do some performance runs with
> ARCH_INLINE_SPIN_* set for x86 just like for s390.

As promised, I set ARCH_INLINE_SPIN_UNLOCK{,_BH,_IRQ,_IRQRESTORE} for
x86 and measured the effect on a few real workloads.

Short version: inlining of _raw_spin_unlock() adds measurable kernel
i-cache pressure on every workload I tried, and on a
kernel-i-cache-bound one (nginx connection churn) it costs ~1.27%
throughput. I did not find a workload where it helps.

HOW BENCHMARKS WERE CHOSEN

The cost of inlining unlock is text footprint increase. Every unlock
site grows, and the extra bytes compete for the shared L1i. The bill is
paid by unrelated code, in both kernel and userspace.

Locktorture and similar microbenchmarks can't see this, because they
usually hammer a tiny loop that stays L1i-resident, so they measure
fast-path cycles, where inlining (fewer instructions per unlock) looks
neutral-to-good.

To make the cost visible, the workload has to have real instruction
cache pressure. To achieve that, it has to touch a lot of code.

A good way to screen benchmarks: look for high tma_frontend_bound
fraction from 'perf stat -M TopdownL1' and simultaneously require it to
spend non-trivial time in the kernel (be syscall-heavy).

SETUP

Hardware: 2x Intel Xeon Gold 6138 (Skylake-SP), 20 cores/socket, 40C/80T
with kernel built from locking/core branch. Baseline _raw_spin_unlock()
is out-of-line via UNINLINE_SPIN_UNLOCK=y. Experiment adds the four
selects above (exact patch is at the end of this message). Cache
geometry (lscpu -C):

NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL  SETS PHY-LINE COHERENCY-SIZE
L1d       32K     1.3M    8 Data            1    64        1             64
L1i       32K     1.3M    8 Instruction     1    64        1             64
L2         1M      40M   16 Unified         2  1024        1             64
L3      27.5M      55M   11 Unified         3 40960        1             64

Per run I collected cycles, instructions and L1i-misses. To stay within
the available PMU counters, each run used only 3 events: cycles,
instructions and one L1i filter (:u or :k). The NMI watchdog was off and
every run reported 100% counter enablement (no multiplexing). Userspace
and kernel misses therefore come from separate runs. Each benchmark was
run 20x per side: 10 with the :u counter, 10 with :k.  Cycles,
instructions and throughput are pooled across all 20, each L1i split
comes from its 10.

KERNEL IMAGE SIZE

To give a sense of the code-footprint increase, scripts/bloat-o-meter on
vmlinux, GCC 11, x86_64, defconfig + CONFIG_PARAVIRT_SPINLOCKS=y:

    Total: Before=23838694, After=23977159, chg +0.58%

ROCKSDB (DELETESEQ)

    db_bench -benchmarks=deleteseq

Metric                       Baseline      Experiment     Delta   Sig
----------------------------------------------------------------------
Instructions (total)    9,574,476,543   9,573,602,441    -0.01%   flat
L1i-miss :k (kernel)      198,588,165     216,672,536    +9.11%   **
L1i-miss :u (userspace)   593,276,235     616,433,813    +3.90%   **
Throughput ops/s            431,398         432,897      +0.35%   ns
Cycles (total)          4,681,002,302   4,665,106,876    -0.34%   ns
IPC                          2.045           2.052       +0.33%   ns
Time elapsed (s)            2.4012          2.3865       -0.62%   ns
----------------------------------------------------------------------
L1i-miss: higher = worse. Throughput: higher = better.
** = beyond per-run noise (+-0.1..0.36%), ns = within noise.

At constant instructions, inlining raises L1i misses +9.11% (kernel) and
+3.90% (userspace), both well beyond noise. Throughput, cycles, IPC and
wall-time all stay within run-to-run noise. So the i-cache cost is real,
but at IPC ~2 db_bench isn't fetch-bound at the app level, so it doesn't
surface.

No benefit from _raw_spin_unlock() inlining.

KERNEL BUILD

Building locking/core (defconfig), GCC 11.

    make -j80

Metric              Baseline      Experiment     Delta   Sig
-------------------------------------------------------------
L1i-miss :k          36.72G        37.51G       +2.16%   **
L1i-miss :u         246.99G       246.06G       -0.38%   **
Sys (s)             478.250       482.420       +0.87%   **
Time elapsed (s)    105.221       105.373       +0.14%   ns
User (s)           4022.046      4024.012       +0.05%   flat
Cycles            8,894.10G     8,902.12G       +0.09%   flat
Instructions      8,424.28G     8,426.48G       +0.03%   flat
IPC                   0.947         0.947       -0.06%   flat
-------------------------------------------------------------
L1i-miss/Sys: higher = worse.
** = beyond per-run noise, ns = within noise.

Kernel i-cache misses (+2.16%) and sys time (+0.87%) both rise and are
significant. Wall-time and userspace L1i are flat. Kernel build is
GCC/userspace-bound (User 4022s vs Sys 478s), so the added kernel fetch
cost is real but appears to sit off the critical path.

No benefit from _raw_spin_unlock() inlining.

NGINX

I ran nginx with taskset -c 2.

    perf stat -C 2 ... -- ab -n 100000 -c 80 http://127.0.0.1:8080/

Config for nginx was the following.

  worker_processes 1;
  error_log /tmp/ngx/error.log;
  pid       /tmp/ngx/nginx.pid;
  events { worker_connections 16384; }
  http {
      access_log off;
      server { listen 8080 reuseport; location / { return 200 "ok\n"; } }
  }

I used nginx version 1.20.1 (prebuilt, from CentOS repo).

Metric              Baseline      Experiment     Delta   Sig
------------------------------------------------------------
req/s (ab)           25,113        24,795       -1.27%   **
L1i MPKI :k          70.06         72.10        +2.92%   **
L1i MPKI :u          20.16         20.66        +2.50%   **
instructions          5.86G         5.83G       -0.50%   **
L1i-miss :k           0.41G         0.42G       +2.44%   **
L1i-miss :u           0.12G         0.12G       +1.95%   **
cycles                4.82G         4.81G       -0.28%   ns
IPC                   1.215         1.213       -0.22%   ns
perf time (s)         4.077         4.129       +1.26%   **
failed reqs              0             0          -      valid
------------------------------------------------------------
req/s: higher=better. MPKI: higher=worse.
** = beyond per-run noise, ns = within noise.

nginx connection-churn is the one workload that is genuinely
kernel-fetch-bound: MPKI:k ~70 and IPC ~1.2 (vs db_bench's 2.05). Here
the cost surfaces: req/s −1.27%. Misses rise in both domains (+2.9%
MPKI:k, +2.5% MPKI:u). Unlike kernel build, userspace is hit too,
because nginx runs user and kernel hot on the same core and the kernel
bloat pollutes the shared L1i.

And the kicker: instructions fell 0.5% (inlining removed the call/ret)
yet throughput dropped.

Caveat: ab is single-threaded, so it seems the worker core is
under-saturated: cycles is flat (−0.28%, ns) while wall-time rose
(+1.26%).

Measurable throughput regression from _raw_spin_unlock() inlining.

CONCLUSION

Inlining _raw_spin_unlock() raises kernel L1i misses on every workload.
It's an unconditional cost. Whether it costs the application throughput
depends on how kernel-fetch-bound the workload is.

The cost is real everywhere. It only surfaces as throughput regression
where the kernel is on the fetch critical path. And inlining did not
help in any workload I measured. The one micro-effect inlining produced
(-0.5% instructions on nginx) was erased by the added i-cache pressure.

From 99502328caed3c195e20cf194a1e8aa1563f3896 Mon Sep 17 00:00:00 2001
From: Dmitry Ilvokhin <d@ilvokhin.com>
Date: Thu, 4 Jun 2026 07:43:00 -0700
Subject: [PATCH] x86/locking: Inline the spin_unlock()

Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
 arch/x86/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fdaef60b46d6..c9a0638225fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -113,6 +113,10 @@ config X86
 	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_HAVE_EXTRA_ELF_NOTES
+	select ARCH_INLINE_SPIN_UNLOCK
+	select ARCH_INLINE_SPIN_UNLOCK_BH
+	select ARCH_INLINE_SPIN_UNLOCK_IRQ
+	select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE
 	select ARCH_MEMORY_ORDER_TSO
 	select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH net-next 3/3] xsk: replace sk_busy_loop with sk_tx_busy_loop in __xsk_sendmsg()
From: menglong8.dong @ 2026-06-11  7:12 UTC (permalink / raw)
  To: jasowang
  Cc: mst, xuanzhuo, eperezma, andrew+netdev, davem, edumazet, kuba,
	pabeni, magnus.karlsson, maciej.fijalkowski, sdf, horms, ast,
	daniel, hawk, john.fastabend, bjorn, kerneljasonxing, netdev,
	virtualization, linux-kernel, bpf
In-Reply-To: <20260611071242.2485058-1-dongml2@chinatelecom.cn>

From: Menglong Dong <dongml2@chinatelecom.cn>

Replace sk_busy_loop with sk_tx_busy_loop to support tx napi in
__xsk_sendmsg().

Fixes: a0731952d9cd ("xsk: Add busy-poll support for {recv,send}msg()")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
 net/xdp/xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5e5786cd9af5..2bf9a7313ac4 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1158,7 +1158,7 @@ static int __xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len
 		return -ENOBUFS;
 
 	if (sk_can_busy_loop(sk))
-		sk_busy_loop(sk, 1); /* only support non-blocking sockets */
+		sk_tx_busy_loop(sk, 1); /* only support non-blocking sockets */
 
 	if (xs->zc && xsk_no_wakeup(sk))
 		return 0;
-- 
2.54.0


^ permalink raw reply related

* [PATCH net-next 2/3] virtio_net: initialize napi.tx_napi in virtnet_alloc_queues()
From: menglong8.dong @ 2026-06-11  7:12 UTC (permalink / raw)
  To: jasowang
  Cc: mst, xuanzhuo, eperezma, andrew+netdev, davem, edumazet, kuba,
	pabeni, magnus.karlsson, maciej.fijalkowski, sdf, horms, ast,
	daniel, hawk, john.fastabend, bjorn, kerneljasonxing, netdev,
	virtualization, linux-kernel, bpf
In-Reply-To: <20260611071242.2485058-1-dongml2@chinatelecom.cn>

From: Menglong Dong <dongml2@chinatelecom.cn>

Ininialize the tx_napi for the rx queue, which will allow us get the tx
napi from the rx napi.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
 drivers/net/virtio_net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 86b5c1ca568c..d72c124c9760 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -6543,6 +6543,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 					 virtnet_poll_tx,
 					 napi_tx ? napi_weight : 0);
 
+		vi->rq[i].napi.tx_napi = &vi->sq[i].napi;
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
 		ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len);
 		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
-- 
2.54.0


^ permalink raw reply related

* [PATCH net-next 1/3] net: busy-poll: introduce sk_tx_busy_loop()
From: menglong8.dong @ 2026-06-11  7:12 UTC (permalink / raw)
  To: jasowang
  Cc: mst, xuanzhuo, eperezma, andrew+netdev, davem, edumazet, kuba,
	pabeni, magnus.karlsson, maciej.fijalkowski, sdf, horms, ast,
	daniel, hawk, john.fastabend, bjorn, kerneljasonxing, netdev,
	virtualization, linux-kernel, bpf
In-Reply-To: <20260611071242.2485058-1-dongml2@chinatelecom.cn>

From: Menglong Dong <dongml2@chinatelecom.cn>

For now, we use sk_busy_loop() for both rx and tx path. The sk_busy_loop()
will call napi_busy_loop() for the specified napi_id. However, some
nic drivers have tx napi, such as virtio-net. In this case, sk_busy_loop()
doesn't work, as it can only schedule the NAPI for the rx queue.

Therefore, introduce sk_tx_busy_loop() for the nic drivers that support tx
napi, which will schedule the tx napi if available.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
 include/linux/netdevice.h |  1 +
 include/net/busy_poll.h   | 41 ++++++++++++++++++++++++++++++++++++---
 net/core/dev.c            | 26 +++++++------------------
 3 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0e1e581efc5a..8a771b014d54 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -416,6 +416,7 @@ struct napi_struct {
 	int			napi_rmap_idx;
 	int			index;
 	struct napi_config	*config;
+	struct napi_struct	*tx_napi;
 };
 
 enum {
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 6e172d0f6ef5..0959e80272c7 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -33,6 +33,12 @@ static inline bool napi_id_valid(unsigned int napi_id)
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 
+enum {
+	NAPI_F_PREFER_BUSY_POLL	= 1,
+	NAPI_F_END_ON_RESCHED	= 2,
+	NAPI_F_TX_NAPI		= 4,
+};
+
 struct napi_struct;
 extern unsigned int sysctl_net_busy_read __read_mostly;
 extern unsigned int sysctl_net_busy_poll __read_mostly;
@@ -49,9 +55,9 @@ static inline bool sk_can_busy_loop(const struct sock *sk)
 
 bool sk_busy_loop_end(void *p, unsigned long start_time);
 
-void napi_busy_loop(unsigned int napi_id,
-		    bool (*loop_end)(void *, unsigned long),
-		    void *loop_end_arg, bool prefer_busy_poll, u16 budget);
+void __napi_busy_loop(unsigned int napi_id,
+		      bool (*loop_end)(void *, unsigned long),
+		      void *loop_end_arg, unsigned int flags, u16 budget);
 
 void napi_busy_loop_rcu(unsigned int napi_id,
 			bool (*loop_end)(void *, unsigned long),
@@ -60,6 +66,17 @@ void napi_busy_loop_rcu(unsigned int napi_id,
 void napi_suspend_irqs(unsigned int napi_id);
 void napi_resume_irqs(unsigned int napi_id);
 
+static inline void napi_busy_loop(unsigned int napi_id,
+				  bool (*loop_end)(void *, unsigned long),
+				  void *loop_end_arg, bool prefer_busy_poll, u16 budget)
+{
+	unsigned int flags = prefer_busy_poll ? NAPI_F_PREFER_BUSY_POLL : 0;
+
+	rcu_read_lock();
+	__napi_busy_loop(napi_id, loop_end, loop_end_arg, flags, budget);
+	rcu_read_unlock();
+}
+
 #else /* CONFIG_NET_RX_BUSY_POLL */
 static inline unsigned long net_busy_loop_on(void)
 {
@@ -126,6 +143,24 @@ static inline void sk_busy_loop(struct sock *sk, int nonblock)
 #endif
 }
 
+static inline void sk_tx_busy_loop(struct sock *sk, int nonblock)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	unsigned int napi_id = READ_ONCE(sk->sk_napi_id);
+	unsigned int flags = NAPI_F_TX_NAPI;
+
+	if (READ_ONCE(sk->sk_prefer_busy_poll))
+		flags |= NAPI_F_PREFER_BUSY_POLL;
+
+	if (napi_id_valid(napi_id)) {
+		rcu_read_lock();
+		__napi_busy_loop(napi_id, nonblock ? NULL : sk_busy_loop_end, sk, flags,
+				 READ_ONCE(sk->sk_busy_poll_budget) ?: BUSY_POLL_BUDGET);
+		rcu_read_unlock();
+	}
+#endif
+}
+
 /* used in the NIC receive handler to mark the skb */
 static inline void __skb_mark_napi_id(struct sk_buff *skb,
 				      const struct gro_node *gro)
diff --git a/net/core/dev.c b/net/core/dev.c
index 0c6c270d9f7d..645a2e851918 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6878,11 +6878,6 @@ static void __busy_poll_stop(struct napi_struct *napi, unsigned long timeout)
 		      HRTIMER_MODE_REL_PINNED);
 }
 
-enum {
-	NAPI_F_PREFER_BUSY_POLL	= 1,
-	NAPI_F_END_ON_RESCHED	= 2,
-};
-
 static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 			   unsigned flags, u16 budget)
 {
@@ -6932,9 +6927,9 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 	local_bh_enable();
 }
 
-static void __napi_busy_loop(unsigned int napi_id,
+void __napi_busy_loop(unsigned int napi_id,
 		      bool (*loop_end)(void *, unsigned long),
-		      void *loop_end_arg, unsigned flags, u16 budget)
+		      void *loop_end_arg, unsigned int flags, u16 budget)
 {
 	unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
 	int (*napi_poll)(struct napi_struct *napi, int budget);
@@ -6951,6 +6946,9 @@ static void __napi_busy_loop(unsigned int napi_id,
 	if (!napi)
 		return;
 
+	if ((flags & NAPI_F_TX_NAPI) && napi->tx_napi)
+		napi = napi->tx_napi;
+
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_disable();
 	for (;;) {
@@ -7015,6 +7013,7 @@ static void __napi_busy_loop(unsigned int napi_id,
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
 		preempt_enable();
 }
+EXPORT_SYMBOL(__napi_busy_loop);
 
 void napi_busy_loop_rcu(unsigned int napi_id,
 			bool (*loop_end)(void *, unsigned long),
@@ -7028,18 +7027,6 @@ void napi_busy_loop_rcu(unsigned int napi_id,
 	__napi_busy_loop(napi_id, loop_end, loop_end_arg, flags, budget);
 }
 
-void napi_busy_loop(unsigned int napi_id,
-		    bool (*loop_end)(void *, unsigned long),
-		    void *loop_end_arg, bool prefer_busy_poll, u16 budget)
-{
-	unsigned flags = prefer_busy_poll ? NAPI_F_PREFER_BUSY_POLL : 0;
-
-	rcu_read_lock();
-	__napi_busy_loop(napi_id, loop_end, loop_end_arg, flags, budget);
-	rcu_read_unlock();
-}
-EXPORT_SYMBOL(napi_busy_loop);
-
 void napi_suspend_irqs(unsigned int napi_id)
 {
 	struct napi_struct *napi;
@@ -7579,6 +7566,7 @@ void netif_napi_add_weight_locked(struct net_device *dev,
 	napi->poll_owner = -1;
 #endif
 	napi->list_owner = -1;
+	napi->tx_napi = NULL;
 	set_bit(NAPI_STATE_SCHED, &napi->state);
 	set_bit(NAPI_STATE_NPSVC, &napi->state);
 	netif_napi_dev_list_add(dev, napi);
-- 
2.54.0


^ permalink raw reply related

* [PATCH net-next 0/3] xsk: support tx napi busy_poll
From: menglong8.dong @ 2026-06-11  7:12 UTC (permalink / raw)
  To: jasowang
  Cc: mst, xuanzhuo, eperezma, andrew+netdev, davem, edumazet, kuba,
	pabeni, magnus.karlsson, maciej.fijalkowski, sdf, horms, ast,
	daniel, hawk, john.fastabend, bjorn, kerneljasonxing, netdev,
	virtualization, linux-kernel, bpf

From: Menglong Dong <dongml2@chinatelecom.cn>

For now, we use sk_busy_loop() in __xsk_sendmsg() to send the data in tx
ring. The sk_busy_loop() will poll on the target NAPI. However, for the
nic driver that support the tx napi, such as virtio-net, it can't schedule
the tx NAPI, but only the rx NAPI. If we enable the busy_poll for xsk and
use virtio-net, we can't send data, as the rx NAPI in virtio-net doesn't
handle the packet sending.

Fix this by introduce the sk_tx_busy_loop(), which will poll on the tx
NAPI if available. To get the tx NAPI from the napi_id, we add the
"tx_napi" field to napi_struct, which is ugly :/

Another choice is to call virtnet_xsk_xmit() in virtnet_poll() too. But
this a little contradict the design of tx NAPI.

Menglong Dong (3):
  net: busy-poll: introduce sk_tx_busy_loop()
  virtio_net: initialize napi.tx_napi in virtnet_alloc_queues()
  xsk: replace sk_busy_loop with sk_tx_busy_loop in __xsk_sendmsg()

 drivers/net/virtio_net.c  |  1 +
 include/linux/netdevice.h |  1 +
 include/net/busy_poll.h   | 41 ++++++++++++++++++++++++++++++++++++---
 net/core/dev.c            | 23 +++++-----------------
 net/xdp/xsk.c             |  2 +-
 5 files changed, 46 insertions(+), 22 deletions(-)

-- 
2.54.0

^ permalink raw reply

* Re: [PATCH RESEND] virtio_console: read size from config space during device init
From: Filip Hejsek @ 2026-06-11  6:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Amit Shah, Arnd Bergmann, Greg Kroah-Hartman, Rusty Russell,
	virtualization, linux-kernel
In-Reply-To: <20260610030318-mutt-send-email-mst@kernel.org>

On Wed, 2026-06-10 at 03:04 -0400, Michael S. Tsirkin wrote:
> On Mon, Feb 23, 2026 at 06:37:02PM +0100, Filip Hejsek wrote:
> > Previously, the size was only read upon receiving the config interrupt.
> > This interrupt is sent when the size changes. However, we also need to
> > read the initial size.
> > 
> > Also make sure to only read the size from config if F_SIZE is enabled.
> > 
> > Fixes: 9778829cffd4 ("virtio: console: Store each console's size in the console structure")
> > Signed-off-by: Filip Hejsek <filip.hejsek@gmail.com>
> > ---
> > This is a resend of [1], which hasn't received any response.
> > 
> > I found this bug while developing patches for QEMU that add virtio
> > console resize support. If you want to test this, you can get my QEMU
> > patches from [2]. You will need to disable multiport
> > using `-device virtio-serial,max_ports=1`.
> > 
> > [1]: https://lore.kernel.org/all/20251224-virtio-console-fix-v1-1-69d0349692dc@gmail.com/
> > [2]: https://lore.kernel.org/all/20250921-console-resize-v5-0-89e3c6727060@gmail.com/
> > 
> > I'll also repeat my questions from the previous submission here. These are
> > things that confused me when I was trying to understand the surrounding code,
> > but should in no way prevent merging this patch.
> > 
> >   - Why does use_multiport use __virtio_test_bit instead of
> >     virtio_has_feature?
> > 
> >   - The VIRTIO_CONSOLE_RESIZE handler sets irq_requested to 1, which I
> >     think makes no sense?
> > ---
> >  drivers/char/virtio_console.c | 52 ++++++++++++++++++++++++++-----------------
> >  1 file changed, 31 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> > index 088182e54d..c355f6d392 100644
> > --- a/drivers/char/virtio_console.c
> > +++ b/drivers/char/virtio_console.c
> > @@ -1771,32 +1771,40 @@ static void config_intr(struct virtio_device *vdev)
> >  		schedule_work(&portdev->config_work);
> >  }
> >  
> > -static void config_work_handler(struct work_struct *work)
> > +static void update_size_from_config(struct ports_device *portdev)
> >  {
> > -	struct ports_device *portdev;
> > +	struct virtio_device *vdev;
> > +	struct port *port;
> > +	u16 rows, cols;
> >  
> > -	portdev = container_of(work, struct ports_device, config_work);
> > -	if (!use_multiport(portdev)) {
> > -		struct virtio_device *vdev;
> > -		struct port *port;
> > -		u16 rows, cols;
> > +	vdev = portdev->vdev;
> >  
> > -		vdev = portdev->vdev;
> > -		virtio_cread(vdev, struct virtio_console_config, cols, &cols);
> > -		virtio_cread(vdev, struct virtio_console_config, rows, &rows);
> > +	/*
> > +	 * We'll use this way of resizing only for legacy support.
> > +	 * For multiport devices, use control messages to indicate
> > +	 * console size changes so that it can be done per-port.
> > +	 *
> > +	 * Don't test F_SIZE at all if we're rproc: not a valid feature.
> > +	 */
> > +	if (is_rproc_serial(vdev) ||
> 
> Wait a second. Why is there this rproc test here?
> Was not in the original code and commit log says nothing about it.
> 

Previously, this code was in config_work_handler(), which was never
called for rproc_serial (it's scheduled from config_intr(), which is
the config_changed handler only for virtio_console).

Now update_size_from_config() is called unconditionally from
virtcons_probe(), so it will be called for rproc_serial too, which
doesn't have the F_SIZE feature.

> 
> > +	    use_multiport(portdev) ||
> > +	    !virtio_has_feature(vdev, VIRTIO_CONSOLE_F_SIZE))
> > +		return;
> >  
> > -		port = find_port_by_id(portdev, 0);
> > -		set_console_size(port, rows, cols);
> > +	virtio_cread(vdev, struct virtio_console_config, cols, &cols);
> > +	virtio_cread(vdev, struct virtio_console_config, rows, &rows);
> >  
> > -		/*
> > -		 * We'll use this way of resizing only for legacy
> > -		 * support.  For newer userspace
> > -		 * (VIRTIO_CONSOLE_F_MULTPORT+), use control messages
> > -		 * to indicate console size changes so that it can be
> > -		 * done per-port.
> > -		 */
> > -		resize_console(port);
> > -	}
> > +	port = find_port_by_id(portdev, 0);
> > +	set_console_size(port, rows, cols);
> > +	resize_console(port);
> > +}
> > +
> > +static void config_work_handler(struct work_struct *work)
> > +{
> > +	struct ports_device *portdev;
> > +
> > +	portdev = container_of(work, struct ports_device, config_work);
> > +	update_size_from_config(portdev);
> >  }
> >  
> >  static int init_vqs(struct ports_device *portdev)
> > @@ -2054,6 +2062,8 @@ static int virtcons_probe(struct virtio_device *vdev)
> >  	__send_control_msg(portdev, VIRTIO_CONSOLE_BAD_ID,
> >  			   VIRTIO_CONSOLE_DEVICE_READY, 1);
> >  
> > +	update_size_from_config(portdev);
> > +
> >  	return 0;
> >  
> >  free_chrdev:
> > 
> > ---
> > base-commit: b927546677c876e26eba308550207c2ddf812a43
> > change-id: 20251224-virtio-console-fix-3d46980ef569
> > 
> > Best regards,
> > -- 
> > Filip Hejsek <filip.hejsek@gmail.com>

^ permalink raw reply

* Re: [PATCH] vduse: Fix error around jumping over a __cleanup() variable
From: Michael S. Tsirkin @ 2026-06-11  6:54 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Jason Wang, Xuan Zhuo, Eugenio Pérez, virtualization,
	linux-kernel, llvm
In-Reply-To: <20260610-vduse_vq_kick-fix-guard-usage-v1-1-0ce02c08006e@kernel.org>

On Wed, Jun 10, 2026 at 12:16:49PM -0700, Nathan Chancellor wrote:
> When building with clang, there is an error in vduse_vq_kick() from
> attempting to jump over a variable declared with the cleanup attribute
> using goto:
> 
>   drivers/vdpa/vdpa_user/vduse_dev.c:566:3: error: cannot jump from this goto statement to its label
>     566 |                 goto unlock;
>         |                 ^
>   drivers/vdpa/vdpa_user/vduse_dev.c:568:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
>     568 |         guard(rwsem_read)(&vq->dev->rwsem);
>         |         ^
> 
> Jumping over a variable declared with the cleanup attribute does not
> prevent the cleanup function from running, it would just result in the
> variable being passed uninitialized to the cleanup function .clang
> errors instead of generating the invalid code, unlike GCC.
> 
> The jump is only present to call spin_unlock(), so convert the
> spin_lock() and spin_unlock() calls to an equivalent guard() to avoid
> the jump while leaving runtime behavior unchanged, clearing up the
> warning.
> 
> Fixes: 6c141c034c1b ("vduse: Add suspend")
> Signed-off-by: Nathan Chancellor <nathan@kernel.org>

Eugenio?

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index a257fdcb77b7..0500da043761 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -561,9 +561,9 @@ static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>  
>  static void vduse_vq_kick(struct vduse_virtqueue *vq)
>  {
> -	spin_lock(&vq->kick_lock);
> +	guard(spinlock)(&vq->kick_lock);
>  	if (!vq->ready)
> -		goto unlock;
> +		return;
>  
>  	guard(rwsem_read)(&vq->dev->rwsem);
>  	if (vq->dev->suspended)
> @@ -573,8 +573,6 @@ static void vduse_vq_kick(struct vduse_virtqueue *vq)
>  		eventfd_signal(vq->kickfd);
>  	else
>  		vq->kicked = true;
> -unlock:
> -	spin_unlock(&vq->kick_lock);
>  }
>  
>  static void vduse_vq_kick_work(struct work_struct *work)
> 
> ---
> base-commit: 6c141c034c1b0b74a2ca4dd3d6fbb6d9054f6e46
> change-id: 20260610-vduse_vq_kick-fix-guard-usage-d1037c331419
> 
> Best regards,
> --  
> Cheers,
> Nathan


^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-11  6:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, linux-kernel, Miaohe Lin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <8c1f468e-b50a-487a-a267-8d1ea5a61c87@kernel.org>

On Tue, Jun 09, 2026 at 08:38:09PM +0200, David Hildenbrand (Arm) wrote:
> On 6/9/26 20:10, Andrew Morton wrote:
> > On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> >> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >> update to page->flags can race with non-atomic flag operations
> >> that run under zone->lock in the buddy allocator.
> >>
> >> In particular, __free_pages_prepare() does:
> >>
> >>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>
> >> This non-atomic read-modify-write, while correctly excluding
> >> __PG_HWPOISON from the mask, can still lose a concurrent
> >> TestSetPageHWPoison if the read happens before the poison bit
> >> is set and the write happens after.  Will only get worse if/when
> >> we add more non-atomic flag operations.
> >>
> >> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >> around ClearPageHWPoison in the retry path.  This
> >> serializes with all buddy flag manipulation.  The cost is
> >> negligible: one lock/unlock in an extremely rare path
> >> (hardware memory errors).
> >>
> >> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >> in this file operate on pages already removed from the buddy
> >> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >> need zone->lock protection.
> > 
> > Sashiko is saying this doesn't do anything "Because
> > __free_pages_prepare() executes entirely locklessly".  Did it goof?
> > 
> > https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> 
> Battle of the bots: it's right.

Ugh it's bot against human - I remembered we have zone lock
normally in alloc and thought it helps and didn't double check we don't
have it here . The bot wins (

> -- 
> Cheers,
> 
> Davidc


^ permalink raw reply

* Re: [PATCH net v2 2/2] virtio-net: harden page_to_skb() big-packet frag loop
From: Michael S. Tsirkin @ 2026-06-11  6:04 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Xiang Mei, andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	virtualization, linux-kernel, minhquangbui99, bestswngs, jasowang,
	eperezma
In-Reply-To: <1781144329.8069873-2-xuanzhuo@linux.alibaba.com>

On Thu, Jun 11, 2026 at 10:18:49AM +0800, Xuan Zhuo wrote:
> On Wed, 10 Jun 2026 16:29:36 -0700, Xiang Mei <xmei5@asu.edu> wrote:
> > This is a robustness hardening patch. The slow-path frag loop in
> > page_to_skb() walks the page chain via page->private until the
> > device-reported len is consumed, implicitly trusting that len fits the
> > chain. It does not stop when the chain is exhausted (page becomes NULL
> > at the tail), nor when nr_frags reaches the end of the static
> > skb_shinfo()->frags[MAX_SKB_FRAGS] array.
> >
> > Both bounds are needed: the chain length is big_packets_num_skbfrags + 1
> > pages, which for an MTU-driven configuration can be well below
> > MAX_SKB_FRAGS, so neither guard implies the other.

i don't get it, and then what?

> >
> > Make the loop self-defending so it no longer relies on the caller having
> > validated len: stop once the chain is exhausted, and never index past
> > MAX_SKB_FRAGS. No functional change for well-formed input.
> 
> At this point, we are assuming that len represents the correct packet length.
> If
> there is a bug in the validation, it can be fixed, just like in your previous
> patch. Indeed, not checking nr_frags is also based on the overall design.
> However, I do not recommend adding this kind of enhancement. If we follow
> this logic, we would end up adding similar code in many other places, which
> doesn't make much sense.
> 
> Thanks.


I will be frank, I'm never sure where the confidential computing guys
draw the line.

Are speculative things of concern, for example?


> >
> > Signed-off-by: Xiang Mei <xmei5@asu.edu>
> > ---
> > v2: robustness patch
> >
> >  drivers/net/virtio_net.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index afe73eda1491..518c22fa1b68 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -906,8 +906,11 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >  	}
> >
> >  	BUG_ON(offset >= PAGE_SIZE);
> > -	while (len) {
> > +	while (len && page) {


don't see why we would check page

> >  		unsigned int frag_size = min((unsigned)PAGE_SIZE - offset, len);
> > +
> > +		if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS))
> > +			break;

so do we want BUG_ON here maybe?

> >  		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page, offset,
> >  				frag_size, truesize);
> >  		len -= frag_size;
> > --
> > 2.43.0
> >


^ permalink raw reply

* Re: [PATCH net v3] virtio-net: fix len check in receive_big()
From: Michael S. Tsirkin @ 2026-06-11  5:56 UTC (permalink / raw)
  To: Xiang Mei
  Cc: jasowang, xuanzhuo, eperezma, andrew+netdev, davem, edumazet,
	kuba, pabeni, netdev, virtualization, linux-kernel,
	minhquangbui99, bestswngs
In-Reply-To: <20260611024616.1408317-1-xmei5@asu.edu>

On Wed, Jun 10, 2026 at 07:46:16PM -0700, Xiang Mei wrote:
> receive_big() bounds the device-announced length by
> (big_packets_num_skbfrags + 1) * PAGE_SIZE.  That is still too loose:
> add_recvbuf_big() sets sg[1] to start at offset
> sizeof(struct padded_vnet_hdr) into the first page, so the chain
> actually carries hdr_len + (PAGE_SIZE - sizeof(padded_vnet_hdr)) +
> big_packets_num_skbfrags * PAGE_SIZE bytes -- 20 bytes less than the
> check allows for the common hdr_len == 12 case.
> 
> A malicious virtio backend can announce a len in that gap.  page_to_skb()
> then walks one frag past the page chain, storing a NULL page->private
> into skb_shinfo()->frags[MAX_SKB_FRAGS], which is both an out-of-bounds
> write past the static frag array and a NULL frag handed up the rx path.
> 
> Bound len by the size add_recvbuf_big() actually advertised.
> 
> Fixes: 0c716703965f ("virtio-net: fix received length check in big packets")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Signed-off-by: Xiang Mei <xmei5@asu.edu>
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

Thanks for the patch! Something small to improve:

> ---
> v3: revoke 2/2 and add Xuan Zhuo's Reviewed-by tag
> 
>  drivers/net/virtio_net.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..afe73eda1491 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1999,15 +1999,17 @@ static struct sk_buff *receive_big(struct net_device *dev,
>  				   struct virtnet_rq_stats *stats)
>  {
>  	struct page *page = buf;
> +	unsigned long max_len;

Assignment can happen here?

>  	struct sk_buff *skb;
>  
>  	/* Make sure that len does not exceed the size allocated in
>  	 * add_recvbuf_big.
>  	 */
> -	if (unlikely(len > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE)) {
> +	max_len = vi->hdr_len + (PAGE_SIZE - sizeof(struct padded_vnet_hdr)) +
> +		  vi->big_packets_num_skbfrags * PAGE_SIZE;

Took me a while to figure out what is going on, but I finally
understand:


Reducing
(vi->big_packets_num_skbfrags + 1) * PAGE_SIZE 

(what we allocated)

by sizeof(struct padded_vnet_hdr) - vi->hdr_len


right?

So clearer as:


	unsigned long max_len = (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE -
	sizeof(struct padded_vnet_hdr) + vi->hdr_len;




> +	if (unlikely(len > max_len)) {
>  		pr_debug("%s: rx error: len %u exceeds allocated size %lu\n",
> -			 dev->name, len,
> -			 (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE);
> +			 dev->name, len, max_len);
>  		goto err;
>  	}
>  
> -- 
> 2.43.0


^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-11  5:43 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Zi Yan, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <14537566-94d9-eac5-2636-35f925a9d159@huawei.com>

On Thu, Jun 11, 2026 at 11:35:36AM +0800, Miaohe Lin wrote:
> On 2026/6/11 5:18, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
> >> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
> >>> On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
> >>>> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
> >>>>
> >>>>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> >>>>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
> >>>>>>
> >>>>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >>>>>>>
> >>>>>>>> On 6/9/26 20:10, Andrew Morton wrote:
> >>>>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>>>>>>>> update to page->flags can race with non-atomic flag operations
> >>>>>>>>>> that run under zone->lock in the buddy allocator.
> >>>>>>>>>>
> >>>>>>>>>> In particular, __free_pages_prepare() does:
> >>>>>>>>>>
> >>>>>>>>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>>>>>>>
> >>>>>>>>>> This non-atomic read-modify-write, while correctly excluding
> >>>>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>>>>>>>> is set and the write happens after.  Will only get worse if/when
> >>>>>>>>>> we add more non-atomic flag operations.
> >>>>>>>>>>
> >>>>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>>>>>>>> around ClearPageHWPoison in the retry path.  This
> >>>>>>>>>> serializes with all buddy flag manipulation.  The cost is
> >>>>>>>>>> negligible: one lock/unlock in an extremely rare path
> >>>>>>>>>> (hardware memory errors).
> >>>>>>>>>>
> >>>>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>>>>>>>> in this file operate on pages already removed from the buddy
> >>>>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>>>>>>>> need zone->lock protection.
> >>>>>>>>>
> >>>>>>>>> Sashiko is saying this doesn't do anything "Because
> >>>>>>>>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
> >>>>>>>>>
> >>>>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>>>>>>>
> >>>>>>>> Battle of the bots: it's right.
> >>>>>>>
> >>>>>>> Yep, __free_pages_prepare() changes the page flag without holding
> >>>>>>> zone->lock.
> >>>>>>
> >>>>>> __free_pages_prepare() works on frozen pages and assumes no one else
> >>>>>> touches the input page. To avoid this race, memory_failure() might
> >>>>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
> >>>>>> sure if that works along with memory failure flow.
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Yan, Zi
> >>>>>
> >>>>>
> >>>>>
> >>>>> Actually memory failure already plays with this down the road no?
> >>>>>
> >>>>> So maybe it's enough to just SetPageHWPoison afterwards again?
> >>>>>
> >>>>>
> >>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>>> index ee42d4361309..4758fea94a96 100644
> >>>>> --- a/mm/memory-failure.c
> >>>>> +++ b/mm/memory-failure.c
> >>>>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>>  	if (!res) {
> >>>>>  		if (is_free_buddy_page(p)) {
> >>>>>  			if (take_page_off_buddy(p)) {
> >>>>> +				SetPageHWPoison(p);
> >>>>>  				page_ref_inc(p);
> >>>>>  				res = MF_RECOVERED;
> >>>>>  			} else {
> >>>>>
> >>>>>
> >>>>> and maybe in a bunch of other places in there?
> >>>>
> >>>> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
> >>>> just set it again here?
> >>>
> >>> Yea.
> >>>
> >>>> Why not do it after get_hwpoison_page(), since that
> >>>> is the expected page flag?
> >>>
> >>> It's still in the buddy at that point right? I'm worried buddy might
> >>> poke at flags.
> >>
> >> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
> >> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
> >> pages are not on the way to buddy...
> >>
> >> Thanks.
> >> .
> > 
> > 
> > To clarify do you not agree repeating SetPageHWPoison is enough for
> > this? And if not, do you have suggestions on how to fix this race?
> 
> Do you mean repeating SetPageHWPoison on every branch?

Right.

> Is it possible
> to make __free_pages_prepare changes page->flags atomically or this race
> is specified to memory_failure?
> 
> Thanks.
> .


Adding an atomic op on every fast path page allocation is, I am
guessing, going to slow down Linux measureably.

Doing it for the benefit of memory_failure, which is the slowest of
slow paths, seems unpalatable, to me.

Neither am I sure it's the only racy place -
grep for __SetPage and __ClearPage - all these have the same issue, I
suspect.

At the same time, I'm not an mm maintainer. If you disagree, try to
upstream a change converting all non atomics in mm to atomics, and see
what others say.

-- 
MST


^ permalink raw reply

* Re: [PATCH v3] hwrng: virtio: clamp device-reported used.len at copy_data()
From: Herbert Xu @ 2026-06-11  4:43 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: Olivia Mackall, linux-crypto, Michael S . Tsirkin, Jason Wang,
	Kees Cook, Christian Borntraeger, virtualization, linux-kernel,
	Dan Williams, Ingo Molnar, H. Peter Anvin, torvalds, alan, tglx
In-Reply-To: <20260531142251.2792061-1-michael.bommarito@gmail.com>

On Sun, May 31, 2026 at 10:22:51AM -0400, Michael Bommarito wrote:
>
> +	size = min_t(unsigned int, size, avail - vi->data_idx);
> +	idx = array_index_nospec(vi->data_idx, sizeof(vi->data));
> +	memcpy(buf, vi->data + idx, size);

I don't see how nospec can help here.  Please enlighten me.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Miaohe Lin @ 2026-06-11  3:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zi Yan, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <20260610171646-mutt-send-email-mst@kernel.org>

On 2026/6/11 5:18, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
>> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
>>> On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
>>>> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
>>>>
>>>>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
>>>>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
>>>>>>
>>>>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>>>>>>>
>>>>>>>> On 6/9/26 20:10, Andrew Morton wrote:
>>>>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>>>>>>>> update to page->flags can race with non-atomic flag operations
>>>>>>>>>> that run under zone->lock in the buddy allocator.
>>>>>>>>>>
>>>>>>>>>> In particular, __free_pages_prepare() does:
>>>>>>>>>>
>>>>>>>>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>>>>>>>
>>>>>>>>>> This non-atomic read-modify-write, while correctly excluding
>>>>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>>>>>>>> is set and the write happens after.  Will only get worse if/when
>>>>>>>>>> we add more non-atomic flag operations.
>>>>>>>>>>
>>>>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>>>>>>>> around ClearPageHWPoison in the retry path.  This
>>>>>>>>>> serializes with all buddy flag manipulation.  The cost is
>>>>>>>>>> negligible: one lock/unlock in an extremely rare path
>>>>>>>>>> (hardware memory errors).
>>>>>>>>>>
>>>>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>>>>>>>> in this file operate on pages already removed from the buddy
>>>>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>>>>>>>> need zone->lock protection.
>>>>>>>>>
>>>>>>>>> Sashiko is saying this doesn't do anything "Because
>>>>>>>>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
>>>>>>>>>
>>>>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>>>>>>>
>>>>>>>> Battle of the bots: it's right.
>>>>>>>
>>>>>>> Yep, __free_pages_prepare() changes the page flag without holding
>>>>>>> zone->lock.
>>>>>>
>>>>>> __free_pages_prepare() works on frozen pages and assumes no one else
>>>>>> touches the input page. To avoid this race, memory_failure() might
>>>>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
>>>>>> sure if that works along with memory failure flow.
>>>>>>
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>>
>>>>>
>>>>>
>>>>> Actually memory failure already plays with this down the road no?
>>>>>
>>>>> So maybe it's enough to just SetPageHWPoison afterwards again?
>>>>>
>>>>>
>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>> index ee42d4361309..4758fea94a96 100644
>>>>> --- a/mm/memory-failure.c
>>>>> +++ b/mm/memory-failure.c
>>>>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>  	if (!res) {
>>>>>  		if (is_free_buddy_page(p)) {
>>>>>  			if (take_page_off_buddy(p)) {
>>>>> +				SetPageHWPoison(p);
>>>>>  				page_ref_inc(p);
>>>>>  				res = MF_RECOVERED;
>>>>>  			} else {
>>>>>
>>>>>
>>>>> and maybe in a bunch of other places in there?
>>>>
>>>> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
>>>> just set it again here?
>>>
>>> Yea.
>>>
>>>> Why not do it after get_hwpoison_page(), since that
>>>> is the expected page flag?
>>>
>>> It's still in the buddy at that point right? I'm worried buddy might
>>> poke at flags.
>>
>> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
>> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
>> pages are not on the way to buddy...
>>
>> Thanks.
>> .
> 
> 
> To clarify do you not agree repeating SetPageHWPoison is enough for
> this? And if not, do you have suggestions on how to fix this race?

Do you mean repeating SetPageHWPoison on every branch? Is it possible
to make __free_pages_prepare changes page->flags atomically or this race
is specified to memory_failure?

Thanks.
.


^ permalink raw reply

* [PATCH net-next v2 2/2] virtio-net: xsk: support tx wake up
From: menglong8.dong @ 2026-06-11  2:56 UTC (permalink / raw)
  To: xuanzhuo, eperezma
  Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
	minhquangbui99, kerneljasonxing, netdev, virtualization,
	linux-kernel
In-Reply-To: <20260611025644.2431148-1-dongml2@chinatelecom.cn>

From: Menglong Dong <dongml2@chinatelecom.cn>

For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net
in the tx path for example: we set xsk_set_tx_need_wakeup() in
virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
anywhere, which means the user will call send() for every packet.

We call xsk_set_tx_need_wakeup() after virtnet_xsk_xmit_batch() if sq->vq
is empty, as we can't be wakeup by the skb_xmit_done() in this case.
Otherwise, we will clear the wakeup flag.

Race condition is considered for tx path.

Fixes: 89f86675cb03 ("virtio_net: xsk: tx: support xmit xsk buffer")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
 drivers/net/virtio_net.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4b5b3fa62008..86b5c1ca568c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1459,8 +1459,9 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	struct virtnet_sq_free_stats stats = {};
 	struct net_device *dev = vi->dev;
+	int sent, vring_size;
+	bool need_wakeup;
 	u64 kicks = 0;
-	int sent;
 
 	/* Avoid to wakeup napi meanless, so call __free_old_xmit instead of
 	 * free_old_xmit().
@@ -1470,8 +1471,29 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
 	if (stats.xsk)
 		xsk_tx_completed(sq->xsk_pool, stats.xsk);
 
+	vring_size = virtqueue_get_vring_size(sq->vq);
+	need_wakeup = xsk_uses_need_wakeup(pool);
+	/* If the sq->vq is empty, and the tx ring is empty, and the user
+	 * submit an entry to the tx ring after virtnet_xsk_xmit_batch() and
+	 * before xsk_set_tx_need_wakeup(), we will lose the chance to wake
+	 * up the tx napi, so we have to set the need_wakeup flag here.
+	 */
+	if (need_wakeup && vring_size == sq->vq->num_free)
+		xsk_set_tx_need_wakeup(pool);
+
 	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
 
+	if (need_wakeup) {
+		if (vring_size == sq->vq->num_free)
+			/* we can't wake up by ourself, and it should be done
+			 * by the user.
+			 */
+			xsk_set_tx_need_wakeup(pool);
+		else
+			/* we can wake up from skb_xmit_done() */
+			xsk_clear_tx_need_wakeup(pool);
+	}
+
 	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
 		check_sq_full_and_disable(vi, vi->dev, sq);
 
@@ -1489,9 +1511,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
 	u64_stats_add(&sq->stats.xdp_tx,  sent);
 	u64_stats_update_end(&sq->stats.syncp);
 
-	if (xsk_uses_need_wakeup(pool))
-		xsk_set_tx_need_wakeup(pool);
-
 	return sent;
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH net-next v2 1/2] virtio_net: xsk: fix race in rx wake up
From: menglong8.dong @ 2026-06-11  2:56 UTC (permalink / raw)
  To: xuanzhuo, eperezma
  Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
	minhquangbui99, kerneljasonxing, netdev, virtualization,
	linux-kernel
In-Reply-To: <20260611025644.2431148-1-dongml2@chinatelecom.cn>

From: Menglong Dong <dongml2@chinatelecom.cn>

During packet receiving in virtio-net, the rq can be empty, which means
"rq->vq->num_free == virtqueue_get_vring_size(rq->vq)", in
virtnet_add_recvbuf_xsk(), if we are using xsk. Meanwhile, the fill ring
can be empty too, which means we can't allocate anything from
xsk_buff_alloc_batch(). Then, we will set the XDP_RING_NEED_WAKEUP flag.

However, if the user clean all the data in rx ring and fill the
"fill ring" and check the XDP_RING_NEED_WAKEUP flag after
xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(), then the rx
napi will never be scheduled: the rx ring is empty, which means we will
never receive a packet to trigger the further recv fill. The rx ring is
empty now, so the user will not check the flag too.

Fix this by set the XDP_RING_NEED_WAKEUP flag before
xsk_buff_alloc_batch() if both rq->vq and fill ring are empty.

Meanwhile, set the XDP_RING_NEED_WAKEUP flag if we have any free entry in
rq->vq.

Fixes: e3f8800aa243 ("virtio-net: xsk: Support wakeup on RX side")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
 drivers/net/virtio_net.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..4b5b3fa62008 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1323,16 +1323,27 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
 				   struct xsk_buff_pool *pool, gfp_t gfp)
 {
 	struct xdp_buff **xsk_buffs;
+	bool need_wakeup;
 	dma_addr_t addr;
 	int err = 0;
 	u32 len, i;
 	int num;
 
+	need_wakeup = xsk_uses_need_wakeup(pool);
 	xsk_buffs = rq->xsk_buffs;
 
+	/* If both rq->vq and fill ring are empty, and then the user submit
+	 * all the chunks to the fill ring and check the wake up flag
+	 * after xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(),
+	 * we will lose the chance to wake up the rx napi, so we have to
+	 * set the need_wakeup flag here.
+	 */
+	if (need_wakeup && virtqueue_get_vring_size(rq->vq) == rq->vq->num_free)
+		xsk_set_rx_need_wakeup(pool);
+
 	num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->vq->num_free);
 	if (!num) {
-		if (xsk_uses_need_wakeup(pool)) {
+		if (need_wakeup) {
 			xsk_set_rx_need_wakeup(pool);
 			/* Return 0 instead of -ENOMEM so that NAPI is
 			 * descheduled.
@@ -1341,8 +1352,6 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
 		}
 
 		return -ENOMEM;
-	} else {
-		xsk_clear_rx_need_wakeup(pool);
 	}
 
 	len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
@@ -1363,6 +1372,16 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
 			goto err;
 	}
 
+	if (need_wakeup) {
+		if (rq->vq->num_free)
+			/* We have free buffers, so we'd better wake up the
+			 * rx napi as soon as possible.
+			 */
+			xsk_set_rx_need_wakeup(pool);
+		else
+			xsk_clear_rx_need_wakeup(pool);
+	}
+
 	return num;
 
 err:
-- 
2.54.0


^ permalink raw reply related

* [PATCH net-next v2 0/2] virtio_net: xsk: rx and tx wake up
From: menglong8.dong @ 2026-06-11  2:56 UTC (permalink / raw)
  To: xuanzhuo, eperezma
  Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
	minhquangbui99, kerneljasonxing, netdev, virtualization,
	linux-kernel

From: Menglong Dong <dongml2@chinatelecom.cn>

In the first patch, we fix a race condition for the xsk rx wake up in
virtio-net.

In the second patch, we support xsk tx wake up for virtio-net.

Changes since v1:
- split the rx and tx into two patch
- add the Fixes tag


Menglong Dong (2):
  virtio_net: xsk: fix race in rx wake up
  virtio-net: xsk: support tx wake up

 drivers/net/virtio_net.c | 52 ++++++++++++++++++++++++++++++++++------
 1 file changed, 45 insertions(+), 7 deletions(-)

-- 
2.54.0


^ permalink raw reply

* Re: [PATCH net v2 2/2] virtio-net: harden page_to_skb() big-packet frag loop
From: Xiang Mei @ 2026-06-11  2:47 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	virtualization, linux-kernel, minhquangbui99, bestswngs, mst,
	jasowang, eperezma
In-Reply-To: <1781145615.903036-3-xuanzhuo@linux.alibaba.com>

On Wed, Jun 10, 2026 at 7:41 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Wed, 10 Jun 2026 19:24:03 -0700, Xiang Mei <xmei5@asu.edu> wrote:
> > Thanks for the review. I agree with that as I replied at the end of
> > v1. If we obsolete 2/2 but keep 1/2, is it okay to just leave it as
> > is?
>
> You should post a new version.
>
Thanks for the tips! V3 has been sent out.

Xiang
> Thanks.
>
> >
> > Xiang
> >
> > On Wed, Jun 10, 2026 at 7:19 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > On Wed, 10 Jun 2026 16:29:36 -0700, Xiang Mei <xmei5@asu.edu> wrote:
> > > > This is a robustness hardening patch. The slow-path frag loop in
> > > > page_to_skb() walks the page chain via page->private until the
> > > > device-reported len is consumed, implicitly trusting that len fits the
> > > > chain. It does not stop when the chain is exhausted (page becomes NULL
> > > > at the tail), nor when nr_frags reaches the end of the static
> > > > skb_shinfo()->frags[MAX_SKB_FRAGS] array.
> > > >
> > > > Both bounds are needed: the chain length is big_packets_num_skbfrags + 1
> > > > pages, which for an MTU-driven configuration can be well below
> > > > MAX_SKB_FRAGS, so neither guard implies the other.
> > > >
> > > > Make the loop self-defending so it no longer relies on the caller having
> > > > validated len: stop once the chain is exhausted, and never index past
> > > > MAX_SKB_FRAGS. No functional change for well-formed input.
> > >
> > > At this point, we are assuming that len represents the correct packet length. If
> > > there is a bug in the validation, it can be fixed, just like in your previous
> > > patch. Indeed, not checking nr_frags is also based on the overall design.
> > > However, I do not recommend adding this kind of enhancement. If we follow
> > > this logic, we would end up adding similar code in many other places, which
> > > doesn't make much sense.
> > >
> > > Thanks.
> > >
> > > >
> > > > Signed-off-by: Xiang Mei <xmei5@asu.edu>
> > > > ---
> > > > v2: robustness patch
> > > >
> > > >  drivers/net/virtio_net.c | 5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index afe73eda1491..518c22fa1b68 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -906,8 +906,11 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> > > >       }
> > > >
> > > >       BUG_ON(offset >= PAGE_SIZE);
> > > > -     while (len) {
> > > > +     while (len && page) {
> > > >               unsigned int frag_size = min((unsigned)PAGE_SIZE - offset, len);
> > > > +
> > > > +             if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS))
> > > > +                     break;
> > > >               skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page, offset,
> > > >                               frag_size, truesize);
> > > >               len -= frag_size;
> > > > --
> > > > 2.43.0
> > > >

^ permalink raw reply

* [PATCH net v3] virtio-net: fix len check in receive_big()
From: Xiang Mei @ 2026-06-11  2:46 UTC (permalink / raw)
  To: mst, jasowang, xuanzhuo, eperezma
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	virtualization, linux-kernel, minhquangbui99, bestswngs,
	Xiang Mei

receive_big() bounds the device-announced length by
(big_packets_num_skbfrags + 1) * PAGE_SIZE.  That is still too loose:
add_recvbuf_big() sets sg[1] to start at offset
sizeof(struct padded_vnet_hdr) into the first page, so the chain
actually carries hdr_len + (PAGE_SIZE - sizeof(padded_vnet_hdr)) +
big_packets_num_skbfrags * PAGE_SIZE bytes -- 20 bytes less than the
check allows for the common hdr_len == 12 case.

A malicious virtio backend can announce a len in that gap.  page_to_skb()
then walks one frag past the page chain, storing a NULL page->private
into skb_shinfo()->frags[MAX_SKB_FRAGS], which is both an out-of-bounds
write past the static frag array and a NULL frag handed up the rx path.

Bound len by the size add_recvbuf_big() actually advertised.

Fixes: 0c716703965f ("virtio-net: fix received length check in big packets")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
v3: revoke 2/2 and add Xuan Zhuo's Reviewed-by tag

 drivers/net/virtio_net.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..afe73eda1491 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1999,15 +1999,17 @@ static struct sk_buff *receive_big(struct net_device *dev,
 				   struct virtnet_rq_stats *stats)
 {
 	struct page *page = buf;
+	unsigned long max_len;
 	struct sk_buff *skb;

 	/* Make sure that len does not exceed the size allocated in
 	 * add_recvbuf_big.
 	 */
-	if (unlikely(len > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE)) {
+	max_len = vi->hdr_len + (PAGE_SIZE - sizeof(struct padded_vnet_hdr)) +
+		  vi->big_packets_num_skbfrags * PAGE_SIZE;
+	if (unlikely(len > max_len)) {
 		pr_debug("%s: rx error: len %u exceeds allocated size %lu\n",
-			 dev->name, len,
-			 (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE);
+			 dev->name, len, max_len);
 		goto err;
 	}

-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH net v2 2/2] virtio-net: harden page_to_skb() big-packet frag loop
From: Xuan Zhuo @ 2026-06-11  2:40 UTC (permalink / raw)
  To: Xiang Mei
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	virtualization, linux-kernel, minhquangbui99, bestswngs, mst,
	jasowang, eperezma
In-Reply-To: <CAPpSM+RApuYf3-ui15E+01fWEzUzfh6mgijQyT_+KjusMVxMfw@mail.gmail.com>

On Wed, 10 Jun 2026 19:24:03 -0700, Xiang Mei <xmei5@asu.edu> wrote:
> Thanks for the review. I agree with that as I replied at the end of
> v1. If we obsolete 2/2 but keep 1/2, is it okay to just leave it as
> is?

You should post a new version.

Thanks.

>
> Xiang
>
> On Wed, Jun 10, 2026 at 7:19 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Wed, 10 Jun 2026 16:29:36 -0700, Xiang Mei <xmei5@asu.edu> wrote:
> > > This is a robustness hardening patch. The slow-path frag loop in
> > > page_to_skb() walks the page chain via page->private until the
> > > device-reported len is consumed, implicitly trusting that len fits the
> > > chain. It does not stop when the chain is exhausted (page becomes NULL
> > > at the tail), nor when nr_frags reaches the end of the static
> > > skb_shinfo()->frags[MAX_SKB_FRAGS] array.
> > >
> > > Both bounds are needed: the chain length is big_packets_num_skbfrags + 1
> > > pages, which for an MTU-driven configuration can be well below
> > > MAX_SKB_FRAGS, so neither guard implies the other.
> > >
> > > Make the loop self-defending so it no longer relies on the caller having
> > > validated len: stop once the chain is exhausted, and never index past
> > > MAX_SKB_FRAGS. No functional change for well-formed input.
> >
> > At this point, we are assuming that len represents the correct packet length. If
> > there is a bug in the validation, it can be fixed, just like in your previous
> > patch. Indeed, not checking nr_frags is also based on the overall design.
> > However, I do not recommend adding this kind of enhancement. If we follow
> > this logic, we would end up adding similar code in many other places, which
> > doesn't make much sense.
> >
> > Thanks.
> >
> > >
> > > Signed-off-by: Xiang Mei <xmei5@asu.edu>
> > > ---
> > > v2: robustness patch
> > >
> > >  drivers/net/virtio_net.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index afe73eda1491..518c22fa1b68 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -906,8 +906,11 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> > >       }
> > >
> > >       BUG_ON(offset >= PAGE_SIZE);
> > > -     while (len) {
> > > +     while (len && page) {
> > >               unsigned int frag_size = min((unsigned)PAGE_SIZE - offset, len);
> > > +
> > > +             if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS))
> > > +                     break;
> > >               skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page, offset,
> > >                               frag_size, truesize);
> > >               len -= frag_size;
> > > --
> > > 2.43.0
> > >

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox