* Re: [PATCH net v4] virtio-net: fix len check in receive_big()
From: Michael S. Tsirkin @ 2026-06-16 4:39 UTC (permalink / raw)
To: Xiang Mei
Cc: jasowang, xuanzhuo, eperezma, andrew+netdev, davem, edumazet,
kuba, pabeni, netdev, virtualization, linux-kernel,
minhquangbui99, bestswngs
In-Reply-To: <20260616042837.2249468-1-xmei5@asu.edu>
On Mon, Jun 15, 2026 at 09:28:37PM -0700, Xiang Mei wrote:
> receive_big() bounds the device-announced length by
> (big_packets_num_skbfrags + 1) * PAGE_SIZE. That is still too loose:
> add_recvbuf_big() sets sg[1] to start at offset
> sizeof(struct padded_vnet_hdr) into the first page, so the chain
> actually carries hdr_len + (PAGE_SIZE - sizeof(padded_vnet_hdr)) +
> big_packets_num_skbfrags * PAGE_SIZE bytes -- 20 bytes less than the
> check allows for the common hdr_len == 12 case.
>
> A malicious virtio backend can announce a len in that gap. page_to_skb()
> then walks one frag past the page chain, storing a NULL page->private
> into skb_shinfo()->frags[MAX_SKB_FRAGS], which is both an out-of-bounds
> write past the static frag array and a NULL frag handed up the rx path.
>
> Bound len by the size add_recvbuf_big() actually advertised.
>
> Fixes: 0c716703965f ("virtio-net: fix received length check in big packets")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Signed-off-by: Xiang Mei <xmei5@asu.edu>
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> v4: use easy to understand math to compute the max_len
> v3: revoke 2/2 and add Xuan Zhuo's Reviewed-by tag
I still feel 2/2 is good defence in depth but it can be
pursued separately.
> v2: add additiona check as 2/2
>
> drivers/net/virtio_net.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..8f4562316aaa 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1999,15 +1999,16 @@ static struct sk_buff *receive_big(struct net_device *dev,
> struct virtnet_rq_stats *stats)
> {
> struct page *page = buf;
> + unsigned long max_len = (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE -
> + sizeof(struct padded_vnet_hdr) + vi->hdr_len;
> struct sk_buff *skb;
>
> /* Make sure that len does not exceed the size allocated in
> * add_recvbuf_big.
> */
> - if (unlikely(len > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE)) {
> + if (unlikely(len > max_len)) {
> pr_debug("%s: rx error: len %u exceeds allocated size %lu\n",
> - dev->name, len,
> - (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE);
> + dev->name, len, max_len);
> goto err;
> }
>
> --
> 2.43.0
^ permalink raw reply
* Re: Patch "vsock/virtio: fix potential unbounded skb queue" has been added to the 6.6-stable tree
From: Greg KH @ 2026-06-16 4:47 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Sasha Levin, Michael S. Tsirkin, AVKrasnov, edumazet, eperezma,
jasowang, kuba, leonardi, stefanha, virtualization, xuanzhuo,
stable-commits, stable
In-Reply-To: <ag8EvTp29B-Q3nCq@sgarzare-redhat>
On Thu, May 21, 2026 at 03:15:54PM +0200, Stefano Garzarella wrote:
> On Sun, May 17, 2026 at 09:33:06AM -0400, Sasha Levin wrote:
> > > > What's the status of that fix?
> > >
> > > Stefano posted v3 and is working on v4.
> > >
> > > > Should it be reverted elsewhere?
> > >
> > > Donnu. With the change we have no DoS but the socket gets silently
> > > broken. Eric felt given the brokenness is upstream already it's better
> > > to work on a fix on top, not revert.
> >
> > Dropped from the 6.6, 6.12, 6.18, and 7.0 queues. We'll pick up Stefano's
> > follow-up once it lands upstream.
>
> FYI v4 is now merged in the net tree, so I guess they will land upstream
> soon. I CCed stable on both patches:
>
> a4f0b001782b ("vsock/virtio: reset connection on receiving queue overflow")
> c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to preserve full
> buf_alloc")
>
> Both are related, but the second is the main fix of this patch.
THe second one doesn't apply at all :(
^ permalink raw reply
* Re: [PATCH net v4] virtio-net: fix len check in receive_big()
From: Xiang Mei @ 2026-06-16 5:20 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: jasowang, xuanzhuo, eperezma, andrew+netdev, davem, edumazet,
kuba, pabeni, netdev, virtualization, linux-kernel,
minhquangbui99, bestswngs
In-Reply-To: <20260616003903-mutt-send-email-mst@kernel.org>
On Mon, Jun 15, 2026 at 9:40 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Jun 15, 2026 at 09:28:37PM -0700, Xiang Mei wrote:
> > receive_big() bounds the device-announced length by
> > (big_packets_num_skbfrags + 1) * PAGE_SIZE. That is still too loose:
> > add_recvbuf_big() sets sg[1] to start at offset
> > sizeof(struct padded_vnet_hdr) into the first page, so the chain
> > actually carries hdr_len + (PAGE_SIZE - sizeof(padded_vnet_hdr)) +
> > big_packets_num_skbfrags * PAGE_SIZE bytes -- 20 bytes less than the
> > check allows for the common hdr_len == 12 case.
> >
> > A malicious virtio backend can announce a len in that gap. page_to_skb()
> > then walks one frag past the page chain, storing a NULL page->private
> > into skb_shinfo()->frags[MAX_SKB_FRAGS], which is both an out-of-bounds
> > write past the static frag array and a NULL frag handed up the rx path.
> >
> > Bound len by the size add_recvbuf_big() actually advertised.
> >
> > Fixes: 0c716703965f ("virtio-net: fix received length check in big packets")
> > Reported-by: Weiming Shi <bestswngs@gmail.com>
> > Signed-off-by: Xiang Mei <xmei5@asu.edu>
> > Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>
> > ---
> > v4: use easy to understand math to compute the max_len
> > v3: revoke 2/2 and add Xuan Zhuo's Reviewed-by tag
>
> I still feel 2/2 is good defence in depth but it can be
> pursued separately.
Thanks, Michael. I'll leave 2/2 out of this series.
Appreciate the review.
Xiang
>
> > v2: add additiona check as 2/2
> >
> > drivers/net/virtio_net.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index f4adcfee7a80..8f4562316aaa 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1999,15 +1999,16 @@ static struct sk_buff *receive_big(struct net_device *dev,
> > struct virtnet_rq_stats *stats)
> > {
> > struct page *page = buf;
> > + unsigned long max_len = (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE -
> > + sizeof(struct padded_vnet_hdr) + vi->hdr_len;
> > struct sk_buff *skb;
> >
> > /* Make sure that len does not exceed the size allocated in
> > * add_recvbuf_big.
> > */
> > - if (unlikely(len > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE)) {
> > + if (unlikely(len > max_len)) {
> > pr_debug("%s: rx error: len %u exceeds allocated size %lu\n",
> > - dev->name, len,
> > - (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE);
> > + dev->name, len, max_len);
> > goto err;
> > }
> >
> > --
> > 2.43.0
>
^ permalink raw reply
* Re:Re:Re: [PATCH] virtio_net: disable cb when napi_schedule_prep fails during busy-poll
From: Lange Tang @ 2026-06-16 6:07 UTC (permalink / raw)
To: xuanzhuo@linux.alibaba.com, mst@redhat.com
Cc: edumazet@google.com, Jakub Kicinski,
virtualization@lists.linux.dev, Tang Longjun, jasowang@redhat.com
In-Reply-To: <1781580432.712892-1-xuanzhuo@linux.alibaba.com>
At 2026-06-16 11:27:12, "Xuan Zhuo" <xuanzhuo@linux.alibaba.com> wrote:
>On Tue, 16 Jun 2026 11:00:29 +0800 (CST), Lange Tang <lange_tang@163.com> wrote:
>> At 2026-06-15 18:01:40, "Xuan Zhuo" <xuanzhuo@linux.alibaba.com> wrote:
>> >On Mon, 15 Jun 2026 17:45:50 +0800, Longjun Tang <lange_tang@163.com> wrote:
>> >> From: Longjun Tang <tanglongjun@kylinos.cn>
>> >>
>> >> When busy-poll is active, napi_schedule_prep() returns false in
>> >> skb_recv_done(), so virtqueue_disable_cb() is skipped. The device
>> >> may keep firing irqs until the next poll round reaches
>> >> virtqueue_napi_complete(). If cb is enabled under busy-poll case,
>> >> it will lead to a large number of spurious interrupts. Explicitly
>> >> disable callbacks in this case to prevent spurious interrupts.
>> >>
>> >> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
>> >> ---
>> >> drivers/net/virtio_net.c | 2 ++
>> >> 1 file changed, 2 insertions(+)
>> >>
>> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> >> index f4adcfee7a80..6d675fddc59b 100644
>> >> --- a/drivers/net/virtio_net.c
>> >> +++ b/drivers/net/virtio_net.c
>> >> @@ -728,6 +728,8 @@ static void virtqueue_napi_schedule(struct napi_struct *napi,
>> >> if (napi_schedule_prep(napi)) {
>> >> virtqueue_disable_cb(vq);
>> >> __napi_schedule(napi);
>> >> + } else if (test_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state)) {
>> >> + virtqueue_disable_cb(vq);
>> >
>> >I see, but we should avoid checking NAPI_STATE_IN_BUSY_POLL directly in the
>> >drivers. The NIC driver should remain agnostic to busy polling. I think we need
>> >a better way, maybe we should rewrite virtqueue_napi_schedule instead.
>>
>> How about rewrite it like this?
>> static void virtqueue_napi_schedule(struct napi_struct *napi,
>> struct virtqueue *vq)
>> {
>> virtqueue_disable_cb(vq);
>> if (napi_schedule_prep(napi))
>> __napi_schedule(napi);
>> }
>> Any comments are welcome.
>
>
>Another CPU could be running NAPI and has just enabled the callbacks (cb).
>Meanwhile, this side unconditionally disables the cb. Since NAPI on the other
>CPU hasn't exited yet, the subsequent prep on this side fails, leaving no one to
>re-enable the cb.
>
>Thanks.
Regarding the case you described, when NAPI on another CPU exits, the virtqueue_napi_complete func
will be executed to re-enable cb. and if there is still unconsumed data in the virtqueue, virtqueue_napi_schedule
will be called again to schedule NAPI.
In summary, I think that the disable_cb and __napi_schedule within the virtqueue_napi_schedule func do not need to be bound together.
Any comments are welcome. Thinks.
>
>
>> >
>> >
>> >> }
>> >> }
>> >>
>> >> --
>> >> 2.25.1
>> >>
>>
^ permalink raw reply
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Miaohe Lin @ 2026-06-16 6:32 UTC (permalink / raw)
To: David Hildenbrand (Arm), Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <be1b20ed-5b75-46b8-b7be-3c9b029f016b@kernel.org>
On 2026/6/15 18:54, David Hildenbrand (Arm) wrote:
> On 6/15/26 05:29, Miaohe Lin wrote:
>> On 2026/6/11 21:20, David Hildenbrand (Arm) wrote:
>>> On 6/11/26 09:36, Miaohe Lin wrote:
>>>>
>>>> Agree, it's not worth to do so.
>>>>
>>>>
>>>> Since memory_failure might be the only place, this change would be unacceptable.
>>>> We should come up with a better solution. Maybe we can try repeating SetPageHWPoison
>>>> and ClearPageHWPoison at a first attempt though it looks somewhat weird to me and makes
>>>> code more complicated.
>>>
>>> And I am fairly sure we could still have some remaining races ... it's shaky.
>>
>> I have to agree it's shaky.
>
> Right, just let writing task reschedule after reading the flags,
> but before writing the flags.
>
>> Any suggestion for next step?
>
> We have various code that assumes that no concurrent writes are
> possible, and consequently, we use no atomics.
>
> __free_pages_prepare() is just one user.
>
> Then we have __folio_set_locked(), __folio_clear_active()
> and __folio_clear_unevictable().
>
> But also __folio_mark_uptodate(), which is called rather frequently.
>
> page_cpupid_reset_last() is also a thing, but it mostly falls
> under __free_pages_prepare() handling.
>
> ... and __split_folio_to_order() also messes with flags directly without atomics.
>
>
> Many of these are only possible for frozen pages (refcount == 0). I think
> only __folio_set_locked() and __folio_mark_uptodate() are called on
> non-frozen pages, when there is the expectation that nobody will concurrently
> use atomics that would be bad (e.g., don't trylock if not an lru page).
>
Thanks David! This information is really helpful!
>
> We don't want to use atomics at these places just to please memory failure code.
Bad news. We have more places racing with memory failure code.
>
> Would it be sufficient to know in memory-failure code that concurrent
> handling succeeded?
I think so, that would be useful.
>
>
> Assume that we enlighten all non-atomics to grab the rcu read lock, such as
These non-atomics are defined and used because they want to avoid atomic ops overhead?
So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7223f6f4e2b4..3c3852b60bbd 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -803,10 +803,30 @@ static inline bool PageUptodate(const struct page *page)
> return folio_test_uptodate(page_folio(page));
> }
>
> +#ifdef CONFIG_MEMORY_FAILURE
> +static inline void page_flags_modify_nonatomic_begin(void)
> +{
> + rcu_read_lock();
> +}
> +static inline void page_flags_modify_nonatomic_end(void)
> +{
> + rcu_read_unlock();
> +}
> +#else
> +static inline void page_flags_modify_nonatomic_begin(void)
> +{
> +}
> +static inline void page_flags_modify_nonatomic_end(void)
> +{
> +}
> +#endif
> +
> static __always_inline void __folio_mark_uptodate(struct folio *folio)
> {
> smp_wmb();
> + page_flags_modify_nonatomic_begin();
> __set_bit(PG_uptodate, folio_flags(folio, 0));
> + page_flags_modify_nonatomic_end();
> }
>
>
> And then we have some retry logic such as:
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 51508a55c405..1123c40aaf43 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -162,6 +162,62 @@ static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
>
> static DEFINE_MUTEX(pfn_space_lock);
>
> +static bool page_test_set_hwpoison(struct page *page)
> +{
> + lockdep_assert_held(&mf_mutex);
> +
> + while (true) {
> + /* Already set -> not our problem. */
> + if (TestSetPageHWPoison(page))
> + return true;
> + /* Make sure concurrent non-atomic writers completed. */
> + synchronize_rcu();
> + /* Setting the flag was sticky. */
> + if (PageHWPoison(page))
> + return false;
> + }
> +}
> +
> +static bool page_test_clear_hwpoison(struct page *page)
> +{
> + lockdep_assert_held(&mf_mutex);
> +
> + while (true) {
> + /* Already clear -> not our problem. */
> + if (!TestClearPageHWPoison(page))
> + return false;
> + /* Make sure concurrent non-atomic writers completed. */
> + synchronize_rcu();
> + /* Clearing the flag was sticky. */
> + if (!PageHWPoison(page))
> + return true;
> + }
> +}
> +
> +static void page_set_hwpoison(struct page *page)
> +{
> + lockdep_assert_held(&mf_mutex);
> +
> + while (!PageHWPoison(page)) {
> + SetPageHWPoison(page);
> +
> + /* Make sure concurrent non-atomic writers completed. */
> + synchronize_rcu();
> + }
> +}
> +
> +static void page_clear_hwpoison(struct page *page)
> +{
> + lockdep_assert_held(&mf_mutex);
> +
> + while (PageHWPoison(page)) {
> + ClearPageHWPoison(page);
> +
> + /* Make sure concurrent non-atomic writers completed. */
> + synchronize_rcu();
> + }
> +}
> +
> /*
> * Return values:
> * 1: the page is dissolved (if needed) and taken off from buddy,
> @@ -199,7 +255,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
> return false;
> }
>
> - SetPageHWPoison(page);
> + page_set_hwpoison(page);
> if (release)
> put_page(page);
> page_ref_inc(page);
> @@ -1744,7 +1800,7 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
> * Use this flag as an indication that the dax page has been
> * remapped UC to prevent speculative consumption of poison.
> */
> - SetPageHWPoison(&folio->page);
> + page_set_hwpoison(&folio->page);
>
> /*
> * Unlike System-RAM there is no possibility to swap in a
> @@ -1789,7 +1845,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> goto unlock;
>
> if (!pre_remove)
> - SetPageHWPoison(page);
> + page_set_hwpoison(page);
>
> /*
> * The pre_remove case is revoking access, the memory is still
> @@ -1866,7 +1922,7 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> head = llist_del_all(raw_hwp_list_head(folio));
> llist_for_each_entry_safe(p, next, head, node) {
> if (move_flag)
> - SetPageHWPoison(p->page);
> + page_set_hwpoison(p->page);
> else
> num_poisoned_pages_sub(page_to_pfn(p->page), 1);
> kfree(p);
> @@ -2380,7 +2436,7 @@ int memory_failure(unsigned long pfn, int flags)
> if (res != -ENOENT)
> goto unlock_mutex;
>
> - if (TestSetPageHWPoison(p)) {
> + if (page_test_set_hwpoison(p)) {
> res = -EHWPOISON;
> if (flags & MF_ACTION_REQUIRED)
> res = kill_accessing_process(current, pfn, flags);
> @@ -2410,7 +2466,7 @@ int memory_failure(unsigned long pfn, int flags)
> } else {
> /* We lost the race, try again */
> if (retry) {
> - ClearPageHWPoison(p);
> + page_clear_hwpoison(p);
> retry = false;
> goto try_again;
> }
> @@ -2431,7 +2487,7 @@ int memory_failure(unsigned long pfn, int flags)
> /* filter pages that are protected from hwpoison test by users */
> folio_lock(folio);
> if (hwpoison_filter(p)) {
> - ClearPageHWPoison(p);
> + page_clear_hwpoison(p);
> folio_unlock(folio);
> folio_put(folio);
> res = -EOPNOTSUPP;
> @@ -2751,7 +2807,7 @@ int unpoison_memory(unsigned long pfn)
> }
>
> folio_put(folio);
> - if (TestClearPageHWPoison(p)) {
> + if (page_test_clear_hwpoison(p)) {
> folio_put(folio);
> ret = 0;
> }
>
>
> Maybe that would work. There would still be issues to solve
>
> (a) We don't hold the mf_mutex on all call paths, but we really need it so a
> page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
>
> (b) There are some leftover SetPageHWPoison etc. instances. The ones in
> arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
> corner cases either way and we can document the situation.
>
>
> Further, while I assume the synchronize_rcu() on the MCE path should be fine
> (who cares about performance there?), I don't know if the added RCU read lock
> on some paths could be noticable.
>
> So one idea worth discussing, but I am sure there are more problems.
I think this is a good idea, although there are some remaining issues.
But such race should be really rare, is it worth all this effort? Could we
simply aim to resolve, not to be flawless? I.e. could we simply check
and re-set the hwpoison flag at the end of memory_failure handling to
simply avoid losing hwpoison flag as a best-effort attempt? Would it be
acceptable?
Thanks.
.
^ permalink raw reply
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: David Hildenbrand (Arm) @ 2026-06-16 6:56 UTC (permalink / raw)
To: Miaohe Lin, Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <d7e97bec-72bf-7d56-3811-6d9c41fefb35@huawei.com>
>>
>>
>> Assume that we enlighten all non-atomics to grab the rcu read lock, such as
>
> These non-atomics are defined and used because they want to avoid atomic ops overhead?
> So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
It should be cheaper than atomics IIUC. Further, I assume that some pages could
batch over multiple such operations (esp. page freeing path when we process tail
pages).
With !CONFIG_PREEMPT_RCU it's simply preempt_disable()/preempt_enable(), which
is either a NOP or just adjusting the preempt counter of the current thread. Cheap.
With CONFIG_PREEMPT_RCU we mostly increment current->rcu_read_lock_nesting. But
there might be a function call involved (did not look into the details). So that
variant should be slightly more expensive.
We'd have to measure what an addition rcu read lock would cost in there. that
should be fairly easy to benchmark.
>>
>> Maybe that would work. There would still be issues to solve
>>
>> (a) We don't hold the mf_mutex on all call paths, but we really need it so a
>> page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
>>
>> (b) There are some leftover SetPageHWPoison etc. instances. The ones in
>> arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
>> corner cases either way and we can document the situation.
>>
>>
>> Further, while I assume the synchronize_rcu() on the MCE path should be fine
>> (who cares about performance there?), I don't know if the added RCU read lock
>> on some paths could be noticable.
>>
>> So one idea worth discussing, but I am sure there are more problems.
>
> I think this is a good idea, although there are some remaining issues.
> But such race should be really rare, is it worth all this effort? Could we
> simply aim to resolve, not to be flawless? I.e. could we simply check
> and re-set the hwpoison flag at the end of memory_failure handling to
> simply avoid losing hwpoison flag as a best-effort attempt? Would it be
> acceptable?
Hacky. Sufficient for the hypervisor to suspend the nonatomic-setting CPU at the
wrong time to still trigger the same behavior.
I think, either we fix it properly, or we redesign hwpoison handling to deal
with setting/clearing becoming stale at some random point in the future.
--
Cheers,
David
^ permalink raw reply
* Re:Re:Re: [PATCH] virtio_net: disable cb when napi_schedule_prep fails during busy-poll
From: Xuan Zhuo @ 2026-06-16 6:49 UTC (permalink / raw)
To: Lange Tang
Cc: edumazet@google.com, Jakub Kicinski,
virtualization@lists.linux.dev, Tang Longjun, jasowang@redhat.com,
mst@redhat.com
In-Reply-To: <4118686.4d22.19ecf0ad78e.Coremail.lange_tang@163.com>
On Tue, 16 Jun 2026 14:07:34 +0800 (CST), Lange Tang <lange_tang@163.com> wrote:
> At 2026-06-16 11:27:12, "Xuan Zhuo" <xuanzhuo@linux.alibaba.com> wrote:
> >On Tue, 16 Jun 2026 11:00:29 +0800 (CST), Lange Tang <lange_tang@163.com> wrote:
> >> At 2026-06-15 18:01:40, "Xuan Zhuo" <xuanzhuo@linux.alibaba.com> wrote:
> >> >On Mon, 15 Jun 2026 17:45:50 +0800, Longjun Tang <lange_tang@163.com> wrote:
> >> >> From: Longjun Tang <tanglongjun@kylinos.cn>
> >> >>
> >> >> When busy-poll is active, napi_schedule_prep() returns false in
> >> >> skb_recv_done(), so virtqueue_disable_cb() is skipped. The device
> >> >> may keep firing irqs until the next poll round reaches
> >> >> virtqueue_napi_complete(). If cb is enabled under busy-poll case,
> >> >> it will lead to a large number of spurious interrupts. Explicitly
> >> >> disable callbacks in this case to prevent spurious interrupts.
> >> >>
> >> >> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
> >> >> ---
> >> >> drivers/net/virtio_net.c | 2 ++
> >> >> 1 file changed, 2 insertions(+)
> >> >>
> >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> >> index f4adcfee7a80..6d675fddc59b 100644
> >> >> --- a/drivers/net/virtio_net.c
> >> >> +++ b/drivers/net/virtio_net.c
> >> >> @@ -728,6 +728,8 @@ static void virtqueue_napi_schedule(struct napi_struct *napi,
> >> >> if (napi_schedule_prep(napi)) {
> >> >> virtqueue_disable_cb(vq);
> >> >> __napi_schedule(napi);
> >> >> + } else if (test_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state)) {
> >> >> + virtqueue_disable_cb(vq);
> >> >
> >> >I see, but we should avoid checking NAPI_STATE_IN_BUSY_POLL directly in the
> >> >drivers. The NIC driver should remain agnostic to busy polling. I think we need
> >> >a better way, maybe we should rewrite virtqueue_napi_schedule instead.
> >>
> >> How about rewrite it like this?
> >> static void virtqueue_napi_schedule(struct napi_struct *napi,
> >> struct virtqueue *vq)
> >> {
> >> virtqueue_disable_cb(vq);
> >> if (napi_schedule_prep(napi))
> >> __napi_schedule(napi);
> >> }
> >> Any comments are welcome.
> >
> >
> >Another CPU could be running NAPI and has just enabled the callbacks (cb).
> >Meanwhile, this side unconditionally disables the cb. Since NAPI on the other
> >CPU hasn't exited yet, the subsequent prep on this side fails, leaving no one to
> >re-enable the cb.
> >
> >Thanks.
>
> Regarding the case you described, when NAPI on another CPU exits, the virtqueue_napi_complete func
> will be executed to re-enable cb. and if there is still unconsumed data in the virtqueue, virtqueue_napi_schedule
> will be called again to schedule NAPI.
>
> In summary, I think that the disable_cb and __napi_schedule within the virtqueue_napi_schedule func do not need to be bound together.
>
> Any comments are welcome. Thinks.
<Your code>
static void virtqueue_napi_schedule(struct napi_struct *napi,
struct virtqueue *vq)
{
|static bool virtqueue_napi_complete(struct napi_struct *napi,
| struct virtqueue *vq, int processed)
|{
| int opaque;
|
| opaque = virtqueue_enable_cb_prepare(vq);
|
virtqueue_disable_cb(vq); |
if (napi_schedule_prep(napi)) |
__napi_schedule(napi); |
| if (napi_complete_done(napi, processed)) {
| if (unlikely(virtqueue_poll(vq, opaque)))
| virtqueue_napi_schedule(napi, vq);
| else
| return true; // return directly
| } else {
| virtqueue_disable_cb(vq);
| }
|
| return false;
|}
}
1. new packets (notified by irq) are consumed by napi before virtqueue_napi_complete
2. poll is not called by irq, maybe xsk wake up. So irq is not disabled.
Thanks.
>
> >
> >
> >> >
> >> >
> >> >> }
> >> >> }
> >> >>
> >> >> --
> >> >> 2.25.1
> >> >>
> >>
>
^ permalink raw reply
* Re: Patch "vsock/virtio: fix potential unbounded skb queue" has been added to the 6.6-stable tree
From: Stefano Garzarella @ 2026-06-16 7:52 UTC (permalink / raw)
To: Greg KH
Cc: Sasha Levin, Michael S. Tsirkin, AVKrasnov, edumazet, eperezma,
jasowang, kuba, leonardi, stefanha, virtualization, xuanzhuo,
stable-commits, stable
In-Reply-To: <2026061624-harbor-capture-a5bf@gregkh>
On Tue, Jun 16, 2026 at 10:17:31AM +0530, Greg KH wrote:
>On Thu, May 21, 2026 at 03:15:54PM +0200, Stefano Garzarella wrote:
>> On Sun, May 17, 2026 at 09:33:06AM -0400, Sasha Levin wrote:
>> > > > What's the status of that fix?
>> > >
>> > > Stefano posted v3 and is working on v4.
>> > >
>> > > > Should it be reverted elsewhere?
>> > >
>> > > Donnu. With the change we have no DoS but the socket gets silently
>> > > broken. Eric felt given the brokenness is upstream already it's better
>> > > to work on a fix on top, not revert.
>> >
>> > Dropped from the 6.6, 6.12, 6.18, and 7.0 queues. We'll pick up Stefano's
>> > follow-up once it lands upstream.
>>
>> FYI v4 is now merged in the net tree, so I guess they will land upstream
>> soon. I CCed stable on both patches:
>>
>> a4f0b001782b ("vsock/virtio: reset connection on receiving queue overflow")
>> c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to preserve full
>> buf_alloc")
>>
>> Both are related, but the second is the main fix of this patch.
>
>THe second one doesn't apply at all :(
>
The second one is the fix of the patch originally added to stable queue
by this thread, so should be applied on top of it (commit 059b7dbd20a6
("vsock/virtio: fix potential unbounded skb queue")).
I'm working on improving memory management, but for now I think it makes
sense to backport all three to the stable branches.
So, in summary:
059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb queue")
a4f0b001782b ("vsock/virtio: reset connection on receiving queue overflow")
c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to preserve full buf_alloc")
Thanks,
Stefano
^ permalink raw reply
* Re: [PATCH v1] s390/virtio_ccw: Also suppress -EINVAL on device detach
From: Peter Oberparleiter @ 2026-06-16 7:58 UTC (permalink / raw)
To: Halil Pasic, William Bezenah, vneethv
Cc: Cornelia Huck, linux-s390, farman, hca, gor, agordeev,
borntraeger, svens, mjrosato, virtualization, kvm, linux-kernel
In-Reply-To: <20260615234246.0ec5d947.pasic@linux.ibm.com>
On 15.06.2026 23:42, Halil Pasic wrote:
> On Mon, 15 Jun 2026 16:01:55 -0400
> William Bezenah <wbezenah@linux.ibm.com> wrote:
>
>> On 6/15/2026 10:58 AM, Cornelia Huck wrote:
>>> On Mon, Jun 15 2026, Halil Pasic <pasic@linux.ibm.com> wrote:
>>>
>>>> On Fri, 12 Jun 2026 17:54:07 +0200
>>>> William Bezenah <wbezenah@linux.ibm.com> wrote:
>>>>
>>>>> Since commit 8c58a229688c ("s390/cio: Do not unregister the
>>>>> subchannel based on DNV"), subchannel behavior following a device
>>>>> detach has been updated and results in -EINVAL being propagated
>>>>> rather than -ENODEV, originating from ccw_device_start_timeout_key()
>>>>> in cio/device_ops. In the end, the virtio driver has no ability to
>>>>> react to the difference between device and subchannel states here,
>>>>> and during detach, both -ENODEV and -EINVAL indicate the device
>>>>> cannot be used and should not be treated as errors requiring
>>>>> attention. Update error handling in virtio_ccw_del_vq() and
>>>>> virtio_ccw_drop_indicator() to suppress -EINVAL in addition to
>>>>> -ENODEV.
>>>> Hi William!
>>>>
>>>> Are you saying that ccw_device_start() started returning -EINVAL
>>>> since 8c58a229688c ("s390/cio: Do not unregister the subchannel based on
>>>> DNV")? Or did I somehow read the paragraph wrong?
>>>>
>>>> The funcition ccw_device_start is documented to return:
>>>> * Returns:
>>>> * %0, if the operation was successful;
>>>> * -%EBUSY, if the device is busy, or status pending;
>>>> * -%EACCES, if no path specified in @lpm is operational;
>>>> * -%ENODEV, if the device is not operational.
>>>> and the commit message does not say a thing about introducing -EINVAL to
>>>> the mix.
>>> The function may return -EINVAL for non-enabled subchannels
>>> (i.e. pmcw.ena == 0), maybe we get an all-zeroes schib with dnv == 0?
>>> I'd expect it not to be enabled in that case anyway.
>>
>> Yep, that's at least how I've come to understand what changed. The
>> function ccw_device_start_timeout_key() has always returned -EINVAL
>> for non-enabled subchannels (pmcw.ena == 0), though it's not
>> documented in the header.
>
> Wasn't his -EINVAL actually introduced by commit:
> 823d494ac111 ("[S390] pm: ccw bus power management callbacks")?
In the context of virtio-ccw added in 2012, an EINVAL return code
introduced in 2009 might be considered "always" :)
>> What changed with commit 8c58a229688c is that cio_update_schib() now
>> updates the schib even when DNV=0, rather than returning early as it
>> did previously. Somehow this update results in pmcw.ena == 0 in
>> ccw_device_start_timeout_key(). Previously, it saw pmcw.ena == 1 and
>> moved to the condition (cdev->private->state == DEV_STATE_NOT_OPER)
>> where it returned -ENODEV.
>
> Sounds fishy to me. As far as I understand the DNV takes precedence over
> all other pieces of PMCW.
And you're right about that! The Principles of Operation states (p. 15-4
in SA22-7832-14 [1]) that the contents of all other fields in the PMCW
are unpredictable when DNV is 0, therefore 8c58a229688c is in error.
I'll work with Vineeth to determine how to fix this issue, potentially
via manually clearing some relevant SCHIB fields instead of copying the
unpredictable results of the STSCH instruction.
>> So the commit didn't introduce -EINVAL as a new return value, rather,
>> it changed the subchannel lifecycle such that existing paths now
>> propagate -EINVAL rather than -ENODEV during the device detach
>> scenario.
>
> I'm not convinced returning -EINVAL in the given situation is the
> right thing to do. Peter, would you mind to chime in?
I tend to agree that an attempt to start I/O for a subchannel that has
DNV 0 should result in ENODEV rather than EINVAL, though the latter is
still valid when a driver tries to start I/O on a subchannel that is not
enabled for I/O.
We'll make sure to design the fix for 8c58a229688c in away that ENODEV
will be returned when DNV is 0. Assuming that this is the only situation
where virtio-ccw's ccw_io_helper() receives -EINVAL from
ccw_device_start__timeout_key() during detach, the subject patch should
no longer be necessary.
[1] https://www.ibm.com/docs/en/module_1678991624569/pdf/SA22-7832-14.pdf
--
Peter Oberparleiter
Linux on IBM Z Development - IBM Germany R&D
^ permalink raw reply
* Re: Patch "vsock/virtio: fix potential unbounded skb queue" has been added to the 6.6-stable tree
From: Greg KH @ 2026-06-16 7:59 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Sasha Levin, Michael S. Tsirkin, AVKrasnov, edumazet, eperezma,
jasowang, kuba, leonardi, stefanha, virtualization, xuanzhuo,
stable-commits, stable
In-Reply-To: <ajD_FBEak8hKNdIK@sgarzare-redhat>
On Tue, Jun 16, 2026 at 09:52:32AM +0200, Stefano Garzarella wrote:
> On Tue, Jun 16, 2026 at 10:17:31AM +0530, Greg KH wrote:
> > On Thu, May 21, 2026 at 03:15:54PM +0200, Stefano Garzarella wrote:
> > > On Sun, May 17, 2026 at 09:33:06AM -0400, Sasha Levin wrote:
> > > > > > What's the status of that fix?
> > > > >
> > > > > Stefano posted v3 and is working on v4.
> > > > >
> > > > > > Should it be reverted elsewhere?
> > > > >
> > > > > Donnu. With the change we have no DoS but the socket gets silently
> > > > > broken. Eric felt given the brokenness is upstream already it's better
> > > > > to work on a fix on top, not revert.
> > > >
> > > > Dropped from the 6.6, 6.12, 6.18, and 7.0 queues. We'll pick up Stefano's
> > > > follow-up once it lands upstream.
> > >
> > > FYI v4 is now merged in the net tree, so I guess they will land upstream
> > > soon. I CCed stable on both patches:
> > >
> > > a4f0b001782b ("vsock/virtio: reset connection on receiving queue overflow")
> > > c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to preserve full
> > > buf_alloc")
> > >
> > > Both are related, but the second is the main fix of this patch.
> >
> > THe second one doesn't apply at all :(
> >
>
> The second one is the fix of the patch originally added to stable queue by
> this thread, so should be applied on top of it (commit 059b7dbd20a6
> ("vsock/virtio: fix potential unbounded skb queue")).
>
> I'm working on improving memory management, but for now I think it makes
> sense to backport all three to the stable branches.
>
> So, in summary:
> 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb queue")
> a4f0b001782b ("vsock/virtio: reset connection on receiving queue overflow")
> c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to preserve full buf_alloc")
Again, this last one fails to apply everywhere :(
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH net-next] virtio-net: support xsk wake up
From: Eugenio Perez Martin @ 2026-06-16 8:35 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Menglong Dong, xuanzhuo, mst, jasowang, andrew+netdev, davem,
edumazet, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260613144612.0c5b7ba4@kernel.org>
On Sat, Jun 13, 2026 at 11:46 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 10 Jun 2026 10:27:28 +0200 Eugenio Perez Martin wrote:
> > And the From and Signed-off-by emails don't match, which I'm not sure is valid.
>
> It's clearly the same person. Please focus on the code, not trivial
> process issues.
>
> Quoting documentation:
>
> Reviewer guidance
> -----------------
>
> [...]
>
> Reviewers are highly encouraged to do more in-depth review of submissions
> and not focus exclusively on process issues, trivial or subjective
> matters like code formatting, tags etc.
>
> See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#reviewer-guidance
>
Ack'd, it was just a nitpick since the fixes tag was already needed.
Thanks for the doc pointer, I agree with that so I'll try to avoid
these nits in the future!
^ permalink raw reply
* Re: Patch "vsock/virtio: fix potential unbounded skb queue" has been added to the 6.6-stable tree
From: Stefano Garzarella @ 2026-06-16 8:36 UTC (permalink / raw)
To: Greg KH
Cc: Sasha Levin, Michael S. Tsirkin, AVKrasnov, edumazet, eperezma,
jasowang, kuba, leonardi, stefanha, virtualization, xuanzhuo,
stable-commits, stable
In-Reply-To: <2026061607-risotto-getaway-c57f@gregkh>
On Tue, 16 Jun 2026 at 10:00, Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Jun 16, 2026 at 09:52:32AM +0200, Stefano Garzarella wrote:
> > On Tue, Jun 16, 2026 at 10:17:31AM +0530, Greg KH wrote:
> > > On Thu, May 21, 2026 at 03:15:54PM +0200, Stefano Garzarella wrote:
> > > > On Sun, May 17, 2026 at 09:33:06AM -0400, Sasha Levin wrote:
> > > > > > > What's the status of that fix?
> > > > > >
> > > > > > Stefano posted v3 and is working on v4.
> > > > > >
> > > > > > > Should it be reverted elsewhere?
> > > > > >
> > > > > > Donnu. With the change we have no DoS but the socket gets silently
> > > > > > broken. Eric felt given the brokenness is upstream already it's better
> > > > > > to work on a fix on top, not revert.
> > > > >
> > > > > Dropped from the 6.6, 6.12, 6.18, and 7.0 queues. We'll pick up Stefano's
> > > > > follow-up once it lands upstream.
> > > >
> > > > FYI v4 is now merged in the net tree, so I guess they will land upstream
> > > > soon. I CCed stable on both patches:
> > > >
> > > > a4f0b001782b ("vsock/virtio: reset connection on receiving queue overflow")
> > > > c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to preserve full
> > > > buf_alloc")
> > > >
> > > > Both are related, but the second is the main fix of this patch.
> > >
> > > THe second one doesn't apply at all :(
> > >
> >
> > The second one is the fix of the patch originally added to stable queue by
> > this thread, so should be applied on top of it (commit 059b7dbd20a6
> > ("vsock/virtio: fix potential unbounded skb queue")).
> >
> > I'm working on improving memory management, but for now I think it makes
> > sense to backport all three to the stable branches.
> >
> > So, in summary:
> > 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb queue")
> > a4f0b001782b ("vsock/virtio: reset connection on receiving queue overflow")
> > c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to preserve full buf_alloc")
>
> Again, this last one fails to apply everywhere :(
Again, c6087c5aaad6 depends on 059b7dbd20a6 (as also indicated by the
Fixes tag in the patch description).
I don't know what you meant with "everywhere", but I just run `git
cherry-pick 059b7dbd20a6 c6087c5aaad6` on linux-6.12.y, linux-6.18.y,
and linux-7.0.y without any issue.
On linux-6.6.y it's failing because we are missing zero-copy support in
AF_VSOCK. So, I guess we didn't backport commit 45ca7e9f0730
("vsock/virtio: fix `rx_bytes` accounting for stream sockets") because
there were conflicts. That patch is needed to apply commit 059b7dbd20a6
("vsock/virtio: fix potential unbounded skb queue") cleanly.
Stefano
^ permalink raw reply
* Re: [PATCH] vhost/net: fix clear_user start address in VHOST_GET_FEATURES_ARRAY
From: rom.wang @ 2026-06-16 9:01 UTC (permalink / raw)
To: r4o5m6e8o
Cc: eperezma, jasowang, kvm, linux-kernel, mst, netdev, pabeni,
virtualization, wangyufeng
In-Reply-To: <20260526080336.61296-1-r4o5m6e8o@163.com>
Gentle ping. Any comments on this patch?
Thanks
Yufeng Wang
^ permalink raw reply
* Re: [PATCH v1] s390/virtio_ccw: Also suppress -EINVAL on device detach
From: Cornelia Huck @ 2026-06-16 9:16 UTC (permalink / raw)
To: Peter Oberparleiter, Halil Pasic, William Bezenah, vneethv
Cc: linux-s390, farman, hca, gor, agordeev, borntraeger, svens,
mjrosato, virtualization, kvm, linux-kernel
In-Reply-To: <2e543ef5-1aa8-4ddc-a68a-103c7bdfe58d@linux.ibm.com>
On Tue, Jun 16 2026, Peter Oberparleiter <oberpar@linux.ibm.com> wrote:
> On 15.06.2026 23:42, Halil Pasic wrote:
>> On Mon, 15 Jun 2026 16:01:55 -0400
>> William Bezenah <wbezenah@linux.ibm.com> wrote:
>>
>>> On 6/15/2026 10:58 AM, Cornelia Huck wrote:
>>>> On Mon, Jun 15 2026, Halil Pasic <pasic@linux.ibm.com> wrote:
>>>>
>>>>> On Fri, 12 Jun 2026 17:54:07 +0200
>>>>> William Bezenah <wbezenah@linux.ibm.com> wrote:
>>>>>
>>>>>> Since commit 8c58a229688c ("s390/cio: Do not unregister the
>>>>>> subchannel based on DNV"), subchannel behavior following a device
>>>>>> detach has been updated and results in -EINVAL being propagated
>>>>>> rather than -ENODEV, originating from ccw_device_start_timeout_key()
>>>>>> in cio/device_ops. In the end, the virtio driver has no ability to
>>>>>> react to the difference between device and subchannel states here,
>>>>>> and during detach, both -ENODEV and -EINVAL indicate the device
>>>>>> cannot be used and should not be treated as errors requiring
>>>>>> attention. Update error handling in virtio_ccw_del_vq() and
>>>>>> virtio_ccw_drop_indicator() to suppress -EINVAL in addition to
>>>>>> -ENODEV.
>>>>> Hi William!
>>>>>
>>>>> Are you saying that ccw_device_start() started returning -EINVAL
>>>>> since 8c58a229688c ("s390/cio: Do not unregister the subchannel based on
>>>>> DNV")? Or did I somehow read the paragraph wrong?
>>>>>
>>>>> The funcition ccw_device_start is documented to return:
>>>>> * Returns:
>>>>> * %0, if the operation was successful;
>>>>> * -%EBUSY, if the device is busy, or status pending;
>>>>> * -%EACCES, if no path specified in @lpm is operational;
>>>>> * -%ENODEV, if the device is not operational.
>>>>> and the commit message does not say a thing about introducing -EINVAL to
>>>>> the mix.
>>>> The function may return -EINVAL for non-enabled subchannels
>>>> (i.e. pmcw.ena == 0), maybe we get an all-zeroes schib with dnv == 0?
>>>> I'd expect it not to be enabled in that case anyway.
>>>
>>> Yep, that's at least how I've come to understand what changed. The
>>> function ccw_device_start_timeout_key() has always returned -EINVAL
>>> for non-enabled subchannels (pmcw.ena == 0), though it's not
>>> documented in the header.
>>
>> Wasn't his -EINVAL actually introduced by commit:
>> 823d494ac111 ("[S390] pm: ccw bus power management callbacks")?
>
> In the context of virtio-ccw added in 2012, an EINVAL return code
> introduced in 2009 might be considered "always" :)
:)
I'm wondering whether we should still expect to hit the "ssch with
ena==0" situation, given that pm support has been removed again in the
meanwhile. (Well, other than in situations like this, where it is a
follow-up to other problems.) IOW, can callers expect not to see
-EINVAL, unless they are doing something really stupid?
>
>>> What changed with commit 8c58a229688c is that cio_update_schib() now
>>> updates the schib even when DNV=0, rather than returning early as it
>>> did previously. Somehow this update results in pmcw.ena == 0 in
>>> ccw_device_start_timeout_key(). Previously, it saw pmcw.ena == 1 and
>>> moved to the condition (cdev->private->state == DEV_STATE_NOT_OPER)
>>> where it returned -ENODEV.
>>
>> Sounds fishy to me. As far as I understand the DNV takes precedence over
>> all other pieces of PMCW.
>
> And you're right about that! The Principles of Operation states (p. 15-4
> in SA22-7832-14 [1]) that the contents of all other fields in the PMCW
> are unpredictable when DNV is 0, therefore 8c58a229688c is in error.
>
> I'll work with Vineeth to determine how to fix this issue, potentially
> via manually clearing some relevant SCHIB fields instead of copying the
> unpredictable results of the STSCH instruction.
Can't you zero the whole SCHIB, or do you still need some of the
measurement block things for cleanup?
>
>>> So the commit didn't introduce -EINVAL as a new return value, rather,
>>> it changed the subchannel lifecycle such that existing paths now
>>> propagate -EINVAL rather than -ENODEV during the device detach
>>> scenario.
>>
>> I'm not convinced returning -EINVAL in the given situation is the
>> right thing to do. Peter, would you mind to chime in?
>
> I tend to agree that an attempt to start I/O for a subchannel that has
> DNV 0 should result in ENODEV rather than EINVAL, though the latter is
> still valid when a driver tries to start I/O on a subchannel that is not
> enabled for I/O.
>
> We'll make sure to design the fix for 8c58a229688c in away that ENODEV
> will be returned when DNV is 0. Assuming that this is the only situation
> where virtio-ccw's ccw_io_helper() receives -EINVAL from
> ccw_device_start__timeout_key() during detach, the subject patch should
> no longer be necessary.
I agree, I'd not expect to get -EINVAL in ccw_io_helper().
^ permalink raw reply
* Re: Patch "vsock/virtio: fix potential unbounded skb queue" has been added to the 6.6-stable tree
From: Greg KH @ 2026-06-16 9:43 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Sasha Levin, Michael S. Tsirkin, AVKrasnov, edumazet, eperezma,
jasowang, kuba, leonardi, stefanha, virtualization, xuanzhuo,
stable-commits, stable
In-Reply-To: <CAGxU2F6PDqHKLsW97qLUg+7hWq=iYk5qDAGUuxWSbdkyEDmsQw@mail.gmail.com>
On Tue, Jun 16, 2026 at 10:36:43AM +0200, Stefano Garzarella wrote:
> On Tue, 16 Jun 2026 at 10:00, Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Jun 16, 2026 at 09:52:32AM +0200, Stefano Garzarella wrote:
> > > On Tue, Jun 16, 2026 at 10:17:31AM +0530, Greg KH wrote:
> > > > On Thu, May 21, 2026 at 03:15:54PM +0200, Stefano Garzarella wrote:
> > > > > On Sun, May 17, 2026 at 09:33:06AM -0400, Sasha Levin wrote:
> > > > > > > > What's the status of that fix?
> > > > > > >
> > > > > > > Stefano posted v3 and is working on v4.
> > > > > > >
> > > > > > > > Should it be reverted elsewhere?
> > > > > > >
> > > > > > > Donnu. With the change we have no DoS but the socket gets silently
> > > > > > > broken. Eric felt given the brokenness is upstream already it's better
> > > > > > > to work on a fix on top, not revert.
> > > > > >
> > > > > > Dropped from the 6.6, 6.12, 6.18, and 7.0 queues. We'll pick up Stefano's
> > > > > > follow-up once it lands upstream.
> > > > >
> > > > > FYI v4 is now merged in the net tree, so I guess they will land upstream
> > > > > soon. I CCed stable on both patches:
> > > > >
> > > > > a4f0b001782b ("vsock/virtio: reset connection on receiving queue overflow")
> > > > > c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to preserve full
> > > > > buf_alloc")
> > > > >
> > > > > Both are related, but the second is the main fix of this patch.
> > > >
> > > > THe second one doesn't apply at all :(
> > > >
> > >
> > > The second one is the fix of the patch originally added to stable queue by
> > > this thread, so should be applied on top of it (commit 059b7dbd20a6
> > > ("vsock/virtio: fix potential unbounded skb queue")).
> > >
> > > I'm working on improving memory management, but for now I think it makes
> > > sense to backport all three to the stable branches.
> > >
> > > So, in summary:
> > > 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb queue")
> > > a4f0b001782b ("vsock/virtio: reset connection on receiving queue overflow")
> > > c6087c5aaad6 ("vsock/virtio: fix skb overhead accounting to preserve full buf_alloc")
> >
> > Again, this last one fails to apply everywhere :(
>
> Again, c6087c5aaad6 depends on 059b7dbd20a6 (as also indicated by the
> Fixes tag in the patch description).
>
> I don't know what you meant with "everywhere", but I just run `git
> cherry-pick 059b7dbd20a6 c6087c5aaad6` on linux-6.12.y, linux-6.18.y,
> and linux-7.0.y without any issue.
Sorry, I was just searching for the short-id, which is in commits
already in those trees. The real commit worked, sorry for the
confusion.
> On linux-6.6.y it's failing because we are missing zero-copy support in
> AF_VSOCK. So, I guess we didn't backport commit 45ca7e9f0730
> ("vsock/virtio: fix `rx_bytes` accounting for stream sockets") because
> there were conflicts. That patch is needed to apply commit 059b7dbd20a6
> ("vsock/virtio: fix potential unbounded skb queue") cleanly.
That commit does not backport cleanly to 6.6.y, so I still need a patch
series for that tree.
thanks,
greg k-h
^ permalink raw reply
* Patch "vsock/virtio: fix potential unbounded skb queue" has been added to the 6.12-stable tree
From: gregkh @ 2026-06-16 9:48 UTC (permalink / raw)
To: AVKrasnov, edumazet, eperezma, gregkh, jasowang, kuba, mst,
sgarzare, stefanha, virtualization, xuanzhuo
Cc: stable-commits
This is a note to let you know that I've just added the patch titled
vsock/virtio: fix potential unbounded skb queue
to the 6.12-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary
The filename of the patch is:
vsock-virtio-fix-potential-unbounded-skb-queue.patch
and it can be found in the queue-6.12 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.
From 059b7dbd20a6f0c539a45ddff1573cb8946685b5 Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 30 Apr 2026 12:26:52 +0000
Subject: vsock/virtio: fix potential unbounded skb queue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Eric Dumazet <edumazet@google.com>
commit 059b7dbd20a6f0c539a45ddff1573cb8946685b5 upstream.
virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc.
virtio_transport_recv_enqueue() skips coalescing for packets
with VIRTIO_VSOCK_SEQ_EOM.
If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM,
a very large number of packets can be queued
because vvs->rx_bytes stays at 0.
Fix this by estimating the skb metadata size:
(Number of skbs in the queue) * SKB_TRUESIZE(0)
Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: virtualization@lists.linux.dev
Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
net/vmw_vsock/virtio_transport_common.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -430,7 +430,9 @@ static int virtio_transport_send_pkt_inf
static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
u32 len)
{
- if (vvs->buf_used + len > vvs->buf_alloc)
+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
+
+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
return false;
vvs->rx_bytes += len;
Patches currently in stable-queue which might be from edumazet@google.com are
queue-6.12/ipv6-sit-reload-inner-ipv6-header-after-gso-offloads.patch
queue-6.12/ieee802154-6lowpan-only-accept-ipv6-packets-in-lowpa.patch
queue-6.12/udp-clear-skb-dev-before-running-a-sockmap-verdict.patch
queue-6.12/inet-frags-fix-use-after-free-caused-by-the-fqdir_pre_exit-flush.patch
queue-6.12/net-add-pskb_may_pull-to-skb_gro_receive_list.patch
queue-6.12/net_sched-act_pedit-use-rcu-in-tcf_pedit_dump.patch
queue-6.12/net-sched-act_api-use-rcu-with-deferred-freeing-for-.patch
queue-6.12/ip6_vti-fix-incorrect-tunnel-matching-in-vti6_tnl_lo.patch
queue-6.12/tcp-restrict-so_attach_filter-to-priv-users.patch
queue-6.12/vsock-virtio-fix-potential-unbounded-skb-queue.patch
queue-6.12/ipv6-mcast-fix-use-after-free-when-processing-mld-queries.patch
queue-6.12/ipv4-restrict-ipopt_ssrr-and-ipopt_lsrr-options.patch
^ permalink raw reply
* Patch "vsock/virtio: fix potential unbounded skb queue" has been added to the 6.18-stable tree
From: gregkh @ 2026-06-16 9:50 UTC (permalink / raw)
To: AVKrasnov, edumazet, eperezma, gregkh, jasowang, kuba, mst,
sgarzare, stefanha, virtualization, xuanzhuo
Cc: stable-commits
This is a note to let you know that I've just added the patch titled
vsock/virtio: fix potential unbounded skb queue
to the 6.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary
The filename of the patch is:
vsock-virtio-fix-potential-unbounded-skb-queue.patch
and it can be found in the queue-6.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.
From 059b7dbd20a6f0c539a45ddff1573cb8946685b5 Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 30 Apr 2026 12:26:52 +0000
Subject: vsock/virtio: fix potential unbounded skb queue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Eric Dumazet <edumazet@google.com>
commit 059b7dbd20a6f0c539a45ddff1573cb8946685b5 upstream.
virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc.
virtio_transport_recv_enqueue() skips coalescing for packets
with VIRTIO_VSOCK_SEQ_EOM.
If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM,
a very large number of packets can be queued
because vvs->rx_bytes stays at 0.
Fix this by estimating the skb metadata size:
(Number of skbs in the queue) * SKB_TRUESIZE(0)
Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: virtualization@lists.linux.dev
Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
net/vmw_vsock/virtio_transport_common.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -425,7 +425,9 @@ static int virtio_transport_send_pkt_inf
static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
u32 len)
{
- if (vvs->buf_used + len > vvs->buf_alloc)
+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
+
+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
return false;
vvs->rx_bytes += len;
Patches currently in stable-queue which might be from edumazet@google.com are
queue-6.18/ipv6-sit-reload-inner-ipv6-header-after-gso-offloads.patch
queue-6.18/ieee802154-6lowpan-only-accept-ipv6-packets-in-lowpa.patch
queue-6.18/udp-clear-skb-dev-before-running-a-sockmap-verdict.patch
queue-6.18/inet-frags-fix-use-after-free-caused-by-the-fqdir_pre_exit-flush.patch
queue-6.18/net-add-pskb_may_pull-to-skb_gro_receive_list.patch
queue-6.18/net-sched-act_api-use-rcu-with-deferred-freeing-for-.patch
queue-6.18/ip6_vti-fix-incorrect-tunnel-matching-in-vti6_tnl_lo.patch
queue-6.18/tcp-restrict-so_attach_filter-to-priv-users.patch
queue-6.18/vsock-virtio-fix-potential-unbounded-skb-queue.patch
queue-6.18/ipv6-mcast-fix-use-after-free-when-processing-mld-queries.patch
queue-6.18/ip6_vti-set-netns_immutable-on-the-fallback-device.patch
queue-6.18/rxrpc-fix-the-ack-parser-to-extract-the-sack-table-for-parsing.patch
queue-6.18/ipv4-restrict-ipopt_ssrr-and-ipopt_lsrr-options.patch
^ permalink raw reply
* Patch "vsock/virtio: fix potential unbounded skb queue" has been added to the 7.0-stable tree
From: gregkh @ 2026-06-16 9:53 UTC (permalink / raw)
To: AVKrasnov, edumazet, eperezma, gregkh, jasowang, kuba, mst,
sgarzare, stefanha, virtualization, xuanzhuo
Cc: stable-commits
This is a note to let you know that I've just added the patch titled
vsock/virtio: fix potential unbounded skb queue
to the 7.0-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary
The filename of the patch is:
vsock-virtio-fix-potential-unbounded-skb-queue.patch
and it can be found in the queue-7.0 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.
From 059b7dbd20a6f0c539a45ddff1573cb8946685b5 Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 30 Apr 2026 12:26:52 +0000
Subject: vsock/virtio: fix potential unbounded skb queue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Eric Dumazet <edumazet@google.com>
commit 059b7dbd20a6f0c539a45ddff1573cb8946685b5 upstream.
virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc.
virtio_transport_recv_enqueue() skips coalescing for packets
with VIRTIO_VSOCK_SEQ_EOM.
If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM,
a very large number of packets can be queued
because vvs->rx_bytes stays at 0.
Fix this by estimating the skb metadata size:
(Number of skbs in the queue) * SKB_TRUESIZE(0)
Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: virtualization@lists.linux.dev
Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
net/vmw_vsock/virtio_transport_common.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -425,7 +425,9 @@ static int virtio_transport_send_pkt_inf
static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
u32 len)
{
- if (vvs->buf_used + len > vvs->buf_alloc)
+ u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
+
+ if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
return false;
vvs->rx_bytes += len;
Patches currently in stable-queue which might be from edumazet@google.com are
queue-7.0/ipv6-sit-reload-inner-ipv6-header-after-gso-offloads.patch
queue-7.0/ieee802154-6lowpan-only-accept-ipv6-packets-in-lowpa.patch
queue-7.0/udp-clear-skb-dev-before-running-a-sockmap-verdict.patch
queue-7.0/inet-frags-fix-use-after-free-caused-by-the-fqdir_pre_exit-flush.patch
queue-7.0/net-add-pskb_may_pull-to-skb_gro_receive_list.patch
queue-7.0/net-sched-act_api-use-rcu-with-deferred-freeing-for-.patch
queue-7.0/bonding-annotate-data-races-arcound-churn-variables.patch
queue-7.0/ip6_vti-fix-incorrect-tunnel-matching-in-vti6_tnl_lo.patch
queue-7.0/tcp-restrict-so_attach_filter-to-priv-users.patch
queue-7.0/vsock-virtio-fix-potential-unbounded-skb-queue.patch
queue-7.0/ipv6-mcast-fix-use-after-free-when-processing-mld-queries.patch
queue-7.0/tcp-add-preempt_-disable-enable-_nested-in-reqsk_que.patch
queue-7.0/ip6_vti-set-netns_immutable-on-the-fallback-device.patch
queue-7.0/rxrpc-fix-the-ack-parser-to-extract-the-sack-table-for-parsing.patch
queue-7.0/ipv4-restrict-ipopt_ssrr-and-ipopt_lsrr-options.patch
^ permalink raw reply
* Re: [PATCH v1] s390/virtio_ccw: Also suppress -EINVAL on device detach
From: Halil Pasic @ 2026-06-16 11:38 UTC (permalink / raw)
To: Peter Oberparleiter
Cc: William Bezenah, vneethv, Cornelia Huck, linux-s390, farman, hca,
gor, agordeev, borntraeger, svens, mjrosato, virtualization, kvm,
linux-kernel, Halil Pasic
In-Reply-To: <2e543ef5-1aa8-4ddc-a68a-103c7bdfe58d@linux.ibm.com>
On Tue, 16 Jun 2026 09:58:38 +0200
Peter Oberparleiter <oberpar@linux.ibm.com> wrote:
> >> So the commit didn't introduce -EINVAL as a new return value, rather,
> >> it changed the subchannel lifecycle such that existing paths now
> >> propagate -EINVAL rather than -ENODEV during the device detach
> >> scenario.
> >
> > I'm not convinced returning -EINVAL in the given situation is the
> > right thing to do. Peter, would you mind to chime in?
>
> I tend to agree that an attempt to start I/O for a subchannel that has
> DNV 0 should result in ENODEV rather than EINVAL, though the latter is
> still valid when a driver tries to start I/O on a subchannel that is not
> enabled for I/O.
>
That would be a "programming error". I think it would be nice to document
that ccw_device_start() can return -EINVAL as well, and that the semantic
of EINVAL is basically execution of the function was not attempted,
because of a programming error. Or if we think that the list is not
exhaustive even after adding -EINVAL, we should state that as well in
some way one can program against.
> We'll make sure to design the fix for 8c58a229688c in away that ENODEV
> will be returned when DNV is 0. Assuming that this is the only situation
> where virtio-ccw's ccw_io_helper() receives -EINVAL from
> ccw_device_start__timeout_key() during detach, the subject patch should
> no longer be necessary.
I fully agree! With the semantic of -EINVAL pinned down to programming
error, logging the condition seems to be the right thing to do. So we
should keep the old handing IMHO.
Regards,
Halill
^ permalink raw reply
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Miaohe Lin @ 2026-06-16 11:40 UTC (permalink / raw)
To: David Hildenbrand (Arm), Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <e4e123f3-ef32-417f-8369-ff944b6f358d@kernel.org>
On 2026/6/16 14:56, David Hildenbrand (Arm) wrote:
>>>
>>>
>>> Assume that we enlighten all non-atomics to grab the rcu read lock, such as
>>
>> These non-atomics are defined and used because they want to avoid atomic ops overhead?
>> So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
>
> It should be cheaper than atomics IIUC. Further, I assume that some pages could
> batch over multiple such operations (esp. page freeing path when we process tail
> pages).
>
> With !CONFIG_PREEMPT_RCU it's simply preempt_disable()/preempt_enable(), which
> is either a NOP or just adjusting the preempt counter of the current thread. Cheap.
>
> With CONFIG_PREEMPT_RCU we mostly increment current->rcu_read_lock_nesting. But
> there might be a function call involved (did not look into the details). So that
> variant should be slightly more expensive.
I scanned the code and found rcu_read_unlock_special might be called in some cases.
Some expensive ops, e.g. irq_work_queue_on, might be called in some corner cases.
So the overhead of rcu read lock might be fluctuating.
>
> We'd have to measure what an addition rcu read lock would cost in there. that
> should be fairly easy to benchmark.
Sure. We can do that if needed.
>
>>>
>>> Maybe that would work. There would still be issues to solve
>>>
>>> (a) We don't hold the mf_mutex on all call paths, but we really need it so a
>>> page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
>>>
>>> (b) There are some leftover SetPageHWPoison etc. instances. The ones in
>>> arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
>>> corner cases either way and we can document the situation.
>>>
>>>
>>> Further, while I assume the synchronize_rcu() on the MCE path should be fine
>>> (who cares about performance there?), I don't know if the added RCU read lock
>>> on some paths could be noticable.
>>>
>>> So one idea worth discussing, but I am sure there are more problems.
>>
>> I think this is a good idea, although there are some remaining issues.
>> But such race should be really rare, is it worth all this effort? Could we
>> simply aim to resolve, not to be flawless? I.e. could we simply check
>> and re-set the hwpoison flag at the end of memory_failure handling to
>> simply avoid losing hwpoison flag as a best-effort attempt? Would it be
>> acceptable?
>
> Hacky. Sufficient for the hypervisor to suspend the nonatomic-setting CPU at the
> wrong time to still trigger the same behavior.
Right. hypervisor could make the issue easier to trigger...
>
> I think, either we fix it properly, or we redesign hwpoison handling to deal
> with setting/clearing becoming stale at some random point in the future.
I think your proposal, although there are still some issues to be resolved, is
nevertheless a good solution. We could also wait and see if anyone comes up with
a better one.
Thanks.
.
^ permalink raw reply
* [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Menglong Dong @ 2026-06-16 11:59 UTC (permalink / raw)
To: xuanzhuo, eperezma
Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
netdev, virtualization, linux-kernel
For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net
in the tx path for example: we set xsk_set_tx_need_wakeup() in
virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
anywhere, which means the user will call send() for every packet.
We call xsk_set_tx_need_wakeup() after virtnet_xsk_xmit_batch() if sq->vq
is empty, as we can't be wakeup by the skb_xmit_done() in this case.
Otherwise, we will clear the wakeup flag.
Race condition is considered for tx path.
Fixes: 89f86675cb03 ("virtio_net: xsk: tx: support xmit xsk buffer")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
v3:
- remove the confusing comment
v2:
- add the Fixes tag
---
drivers/net/virtio_net.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..6e099edef6e9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1440,8 +1440,9 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
struct virtnet_info *vi = sq->vq->vdev->priv;
struct virtnet_sq_free_stats stats = {};
struct net_device *dev = vi->dev;
+ int sent, vring_size;
+ bool need_wakeup;
u64 kicks = 0;
- int sent;
/* Avoid to wakeup napi meanless, so call __free_old_xmit instead of
* free_old_xmit().
@@ -1451,8 +1452,25 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
if (stats.xsk)
xsk_tx_completed(sq->xsk_pool, stats.xsk);
+ vring_size = virtqueue_get_vring_size(sq->vq);
+ need_wakeup = xsk_uses_need_wakeup(pool);
+
+ if (need_wakeup && vring_size == sq->vq->num_free)
+ xsk_set_tx_need_wakeup(pool);
+
sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
+ if (need_wakeup) {
+ if (vring_size == sq->vq->num_free)
+ /* we can't wake up by ourself, and it should be done
+ * by the user.
+ */
+ xsk_set_tx_need_wakeup(pool);
+ else
+ /* we can wake up from skb_xmit_done() */
+ xsk_clear_tx_need_wakeup(pool);
+ }
+
if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
check_sq_full_and_disable(vi, vi->dev, sq);
@@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
u64_stats_add(&sq->stats.xdp_tx, sent);
u64_stats_update_end(&sq->stats.syncp);
- if (xsk_uses_need_wakeup(pool))
- xsk_set_tx_need_wakeup(pool);
-
return sent;
}
--
2.54.0
^ permalink raw reply related
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: David Hildenbrand (Arm) @ 2026-06-16 12:18 UTC (permalink / raw)
To: Miaohe Lin, Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <438389f2-332d-2f70-cad4-784d7f54af9f@huawei.com>
On 6/16/26 13:40, Miaohe Lin wrote:
> On 2026/6/16 14:56, David Hildenbrand (Arm) wrote:
>>>
>>> These non-atomics are defined and used because they want to avoid atomic ops overhead?
>>> So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
>>
>> It should be cheaper than atomics IIUC. Further, I assume that some pages could
>> batch over multiple such operations (esp. page freeing path when we process tail
>> pages).
>>
>> With !CONFIG_PREEMPT_RCU it's simply preempt_disable()/preempt_enable(), which
>> is either a NOP or just adjusting the preempt counter of the current thread. Cheap.
>>
>> With CONFIG_PREEMPT_RCU we mostly increment current->rcu_read_lock_nesting. But
>> there might be a function call involved (did not look into the details). So that
>> variant should be slightly more expensive.
>
> I scanned the code and found rcu_read_unlock_special might be called in some cases.
> Some expensive ops, e.g. irq_work_queue_on, might be called in some corner cases.
> So the overhead of rcu read lock might be fluctuating.
Right. Usually rcu_read_lock+unlock is supposed to be very lightweight, but that
might not be completely the case with that PREEMPT_RCU thingy ...
>
>>
>> We'd have to measure what an addition rcu read lock would cost in there. that
>> should be fairly easy to benchmark.
>
> Sure. We can do that if needed.
>
>>
>>>
>>> I think this is a good idea, although there are some remaining issues.
>>> But such race should be really rare, is it worth all this effort? Could we
>>> simply aim to resolve, not to be flawless? I.e. could we simply check
>>> and re-set the hwpoison flag at the end of memory_failure handling to
>>> simply avoid losing hwpoison flag as a best-effort attempt? Would it be
>>> acceptable?
>>
>> Hacky. Sufficient for the hypervisor to suspend the nonatomic-setting CPU at the
>> wrong time to still trigger the same behavior.
>
> Right. hypervisor could make the issue easier to trigger...
>
>>
>> I think, either we fix it properly, or we redesign hwpoison handling to deal
>> with setting/clearing becoming stale at some random point in the future.
>
> I think your proposal, although there are still some issues to be resolved, is
> nevertheless a good solution. We could also wait and see if anyone comes up with
> a better one.
I wouldn't call it "good" ... it's the only thing I was easily able to come up
with :)
The only alternative would be moving the hwpoison bit out of page->flags,
storing it in a sparse bitmap or sth. like that. It would be a bigger rework and
I am sure there are issues with that as well.
--
Cheers,
David
^ permalink raw reply
* Re: [RFC PATCH 1/2] virtio-balloon: extend stats with memory composition and pressure data
From: David Hildenbrand (Arm) @ 2026-06-16 12:30 UTC (permalink / raw)
To: Gregory Price, virtualization
Cc: linux-kernel, kernel-team, mst, jasowang, xuanzhuo, eperezma,
hannes, surenb, peterz, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
kprateek.nayak
In-Reply-To: <20260513165006.2790857-2-gourry@gourry.net>
On 5/13/26 18:50, Gregory Price wrote:
> When doing proactive ballooning to reduce the size of a guest, some
> additional vmstat information is useful in deciding how much pressure
> to exert on the VM and when the VM starts experiencing undesirable
> behavior (memory and io pressure).
>
> Add 11 new statistics tags to the existing balloon stats virtqueue
> for improved balloon sizing decisions. Old hosts ignore unknown tags,
> so no feature negotiation is required.
>
> New memory composition stats (bytes):
> S_DIRTY: dirty pages awaiting writeback
> S_WRITEBACK: pages under active writeback
> S_ANON: anonymous pages (for balloon ceiling calculation)
> S_INACTIVE_FILE: inactive file LRU (safely reclaimable subset of cache)
> S_SLAB_RECLAIM: reclaimable slab memory
>
> New workingset stats (counts):
> S_WS_REFAULT_A: anon workingset refaults
> S_WS_REFAULT_F: file workingset refaults
>
> New PSI stats (microseconds, cumulative):
> S_PSI_MEM_SOME: memory pressure (some stalled)
> S_PSI_MEM_FULL: memory pressure (all stalled)
> S_PSI_IO_SOME: IO pressure (some stalled)
> S_PSI_IO_FULL: IO pressure (all stalled)
>
> Export psi_system for module builds (CONFIG_VIRTIO_BALLOON=m with
> CONFIG_PSI=y).
>
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> drivers/virtio/virtio_balloon.c | 33 +++++++++++++++++++++++++++++
> include/uapi/linux/virtio_balloon.h | 26 +++++++++++++++++++++--
> kernel/sched/psi.c | 1 +
> 3 files changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f6c2dff33f8a..8fa33aec4ce7 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -18,6 +18,8 @@
> #include <linux/wait.h>
> #include <linux/mm.h>
> #include <linux/page_reporting.h>
> +#include <linux/vmstat.h>
> +#include <linux/psi.h>
>
> /*
> * Balloon device works in 4K page units. So each page is pointed to by
> @@ -414,6 +416,37 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb)
> update_stat(vb, idx++, VIRTIO_BALLOON_S_CACHES,
> pages_to_bytes(caches));
>
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRTY,
> + pages_to_bytes(global_node_page_state(NR_FILE_DIRTY)));
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_WRITEBACK,
> + pages_to_bytes(global_node_page_state(NR_WRITEBACK)));
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_ANON,
> + pages_to_bytes(global_node_page_state(NR_ANON_MAPPED)));
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_INACTIVE_FILE,
> + pages_to_bytes(global_node_page_state(NR_INACTIVE_FILE)));
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_SLAB_RECLAIM,
> + pages_to_bytes(
> + global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B)));
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_WS_REFAULT_A,
> + global_node_page_state(WORKINGSET_REFAULT_ANON));
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_WS_REFAULT_F,
> + global_node_page_state(WORKINGSET_REFAULT_FILE));
> +
> +#ifdef CONFIG_PSI
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_PSI_MEM_SOME,
> + div_u64(psi_system.total[PSI_AVGS][PSI_MEM_SOME],
> + NSEC_PER_USEC));
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_PSI_MEM_FULL,
> + div_u64(psi_system.total[PSI_AVGS][PSI_MEM_FULL],
> + NSEC_PER_USEC));
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_PSI_IO_SOME,
> + div_u64(psi_system.total[PSI_AVGS][PSI_IO_SOME],
> + NSEC_PER_USEC));
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_PSI_IO_FULL,
> + div_u64(psi_system.total[PSI_AVGS][PSI_IO_FULL],
> + NSEC_PER_USEC));
> +#endif
> +
> return idx;
> }
>
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index ee35a372805d..37ec8a8466c4 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -77,7 +77,18 @@ struct virtio_balloon_config {
> #define VIRTIO_BALLOON_S_DIRECT_SCAN 13 /* Amount of memory scanned directly */
> #define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed asynchronously */
> #define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed directly */
> -#define VIRTIO_BALLOON_S_NR 16
> +#define VIRTIO_BALLOON_S_DIRTY 16 /* Dirty pages (bytes) */
> +#define VIRTIO_BALLOON_S_WRITEBACK 17 /* Pages under writeback (bytes) */
> +#define VIRTIO_BALLOON_S_ANON 18 /* Anonymous pages (bytes) */
> +#define VIRTIO_BALLOON_S_INACTIVE_FILE 19 /* Inactive file LRU pages (bytes) */
> +#define VIRTIO_BALLOON_S_SLAB_RECLAIM 20 /* Reclaimable slab (bytes) */
> +#define VIRTIO_BALLOON_S_WS_REFAULT_A 21 /* Workingset refaults anon (count) */
> +#define VIRTIO_BALLOON_S_WS_REFAULT_F 22 /* Workingset refaults file (count) */
> +#define VIRTIO_BALLOON_S_PSI_MEM_SOME 23 /* PSI memory some total (us) */
> +#define VIRTIO_BALLOON_S_PSI_MEM_FULL 24 /* PSI memory full total (us) */
> +#define VIRTIO_BALLOON_S_PSI_IO_SOME 25 /* PSI IO some total (us) */
> +#define VIRTIO_BALLOON_S_PSI_IO_FULL 26 /* PSI IO full total (us) */
> +#define VIRTIO_BALLOON_S_NR 27
>
> #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \
> VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
> @@ -95,7 +106,18 @@ struct virtio_balloon_config {
> VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \
> VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \
> VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \
> - VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \
> + VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims", \
> + VIRTIO_BALLOON_S_NAMES_prefix "dirty", \
> + VIRTIO_BALLOON_S_NAMES_prefix "writeback", \
> + VIRTIO_BALLOON_S_NAMES_prefix "anon-pages", \
> + VIRTIO_BALLOON_S_NAMES_prefix "inactive-file", \
> + VIRTIO_BALLOON_S_NAMES_prefix "slab-reclaimable", \
> + VIRTIO_BALLOON_S_NAMES_prefix "ws-refault-anon", \
> + VIRTIO_BALLOON_S_NAMES_prefix "ws-refault-file", \
> + VIRTIO_BALLOON_S_NAMES_prefix "psi-mem-some-us", \
> + VIRTIO_BALLOON_S_NAMES_prefix "psi-mem-full-us", \
> + VIRTIO_BALLOON_S_NAMES_prefix "psi-io-some-us", \
> + VIRTIO_BALLOON_S_NAMES_prefix "psi-io-full-us" \
> }
>
> #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index d9c9d9480a45..8ab3aa1c4ef5 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -175,6 +175,7 @@ static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu);
> struct psi_group psi_system = {
> .pcpu = &system_group_pcpu,
> };
> +EXPORT_SYMBOL_GPL(psi_system);
Nothing too crazy here, however, the question is which of these values we
actually want to guarantee that we will provide them with unchanged semantics in
the future ... I guess anything we already expose to user space is alright
(because it effectively already must remain mostly unchanged I assume).
--
Cheers,
David
^ permalink raw reply
* Re: [RFC PATCH 2/2] virtio-balloon: add stats push mode
From: David Hildenbrand (Arm) @ 2026-06-16 12:33 UTC (permalink / raw)
To: Gregory Price, virtualization
Cc: linux-kernel, kernel-team, mst, jasowang, xuanzhuo, eperezma,
hannes, surenb, peterz, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
kprateek.nayak
In-Reply-To: <20260513165006.2790857-3-gourry@gourry.net>
On 5/13/26 18:50, Gregory Price wrote:
> When doing aggressive overcommit of VMs on a single host, a pull
> model of stat retrieval is problematic if a guest becomes some form
> of unresponsive. In particular, it's difficult to discern the
> difference between a hung guest and a slow guest - and why the
> guest is experiencing that.
>
> Add VIRTIO_BALLOON_F_STATS_PUSH feature that allows the host to
> configure the guest to push stats on a timer instead of the default
> pull model.
>
> The host sets stats_push_interval_ms in the balloon config space:
> 0 = disabled (pull-only, default)
> N > 0 = guest pushes stats every N milliseconds
>
> The push mode reuses the existing stats VQ, same buffer format,
> same tags. The host can change the interval at runtime by updating
> the config field.
>
> Push mode provides two advantages over pull:
> 1. Guest liveness detection: in pull mode, the host cannot
> distinguish a slow guest from a hung guest without implementing
> its own timeout tracking. In push mode, the absence of expected
> stats buffers is an implicit liveness signal; if the guest
> fails to push within the expected interval, the host can
> conclude it is unresponsive.
> 2. Latency-sensitive consumers (e.g., memory pressure response
> loops) receive fresh stats at a guaranteed cadence without
> the host needing to poll.
>
> STATS_PUSH requires STATS_VQ; the driver clears STATS_PUSH during
> feature validation if STATS_VQ is absent. When push mode is active,
> the pull callback is suppressed to avoid racing on buffer submission.
>
> The pull model remains available and is the default.
I don't quite see the big benefit here, really: either it's a timer in the
hypervisor or a timer in the VM. A slow VM will, in either model, delay the
update of stats.
If you need some "liveness detection", is virtio-balloon stats updates really
the right mechanism?
I don't quite understand the "Latency-sensitive consumers" problem. If the VM is
slow, it is slow and will mess with latency-sensitive consumers in either way?
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH net v2 1/2] iov_iter: export iov_iter_restore
From: Stefano Garzarella @ 2026-06-16 12:35 UTC (permalink / raw)
To: Octavian Purdila
Cc: netdev, Alexander Viro, Andrew Morton, Arseniy Krasnov,
David S. Miller, Eric Dumazet, Eugenio Pérez, Jakub Kicinski,
Jason Wang, kvm, linux-block, linux-fsdevel, linux-kernel,
Michael S. Tsirkin, Paolo Abeni, Simon Horman, Stefan Hajnoczi,
virtualization, Xuan Zhuo
In-Reply-To: <20260613000953.467473-2-tavip@google.com>
On Sat, Jun 13, 2026 at 12:09:52AM +0000, Octavian Purdila wrote:
>Export iov_iter_restore so that it can be used by modules.
>
>This is needed by the virtio vsock transport (which can be built as a
>module) to restore the msg_iter state when transmission fails.
>
>Signed-off-by: Octavian Purdila <tavip@google.com>
>---
> lib/iov_iter.c | 1 +
> 1 file changed, 1 insertion(+)
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
>
>diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>index 243662af1af73..067e745f9ef53 100644
>--- a/lib/iov_iter.c
>+++ b/lib/iov_iter.c
>@@ -1491,6 +1491,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
> i->__iov -= state->nr_segs - i->nr_segs;
> i->nr_segs = state->nr_segs;
> }
>+EXPORT_SYMBOL(iov_iter_restore);
>
> /*
> * Extract a list of contiguous pages from an ITER_FOLIOQ iterator. This does
>--
>2.54.0.1136.gdb2ca164c4-goog
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox