* [PATCH] vsock/virtio: fix memory leak in virtio_transport_recv_listen
From: Divya Mankani @ 2026-06-05 19:19 UTC (permalink / raw)
To: kuba, pabeni
Cc: horms, virtualization, kvm, netdev, linux-kernel,
syzbot+1b2c9c4a0f8708082678, Divya Mankani
Syzbot reported a memory leak inside virtio_transport_recv_listen
caused by a race condition when the parent listener socket shuts down
while an incoming packet is being enqueued.
Fix this by locking the parent socket and verifying its shutdown
state under the lock before executing vsock_enqueue_accept().
Fixes: a478546a782a ("vsock/virtio: add support for listen sockets")
Reported-by: syzbot+1b2c9c4a0f8708082678@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1b2c9c4a0f8708082678
Signed-off-by: Divya Mankani <divyakm@unc.edu>
---
net/vmw_vsock/virtio_transport_common.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 3b294164b..8006a13bb 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1571,15 +1571,20 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
le32_to_cpu(hdr->src_port));
- ret = vsock_assign_transport(vchild, vsk);
- /* Transport assigned (looking at remote_addr) must be the same
- * where we received the request.
+ /* Lock the parent listener socket to synchronize with a potential
+ * simultaneous shutdown thread running __vsock_release().
*/
- if (ret || vchild->transport != &t->transport) {
+ lock_sock(sk);
+
+ /* Check if the listener socket was shut down while we were
+ * creating and configuring the child socket.
+ */
+ if (sk->sk_shutdown == SHUTDOWN_MASK) {
+ release_sock(sk);
release_sock(child);
virtio_transport_reset_no_sock(t, skb, sock_net(sk));
sock_put(child);
- return ret;
+ return -ESHUTDOWN;
}
sk_acceptq_added(sk);
@@ -1590,6 +1595,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
vsock_enqueue_accept(sk, child);
virtio_transport_send_response(vchild, skb);
+ /* Safely release both locked objects */
+ release_sock(sk);
release_sock(child);
sk->sk_data_ready(sk);
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH] virtio_net: normalize non-positive napi_weight values
From: Samuel Moelius @ 2026-06-05 19:40 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Samuel Moelius, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, open list:VIRTIO NET DRIVER,
open list:VIRTIO NET DRIVER, open list
The virtio_net.napi_weight module parameter is signed and is copied
directly into the RX NAPI weight after netif_napi_add_config().
A value of -1 lets virtnet_poll() run with a negative budget. The RX
loop processes no packets, the unsigned received-vs-budget comparison
completes NAPI, and the NAPI core reports:
NAPI poll function virtnet_poll returned 0, exceeding its budget of -1
The device then repeatedly drops receive progress under stock QEMU
virtio-net traffic.
Normalize non-positive values to the default NAPI weight before
assigning the RX and TX NAPI weights. TX NAPI can still be disabled
through the separate napi_tx parameter, which intentionally sets the TX
weight to zero.
Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
drivers/net/virtio_net.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..667026607cfa 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -6481,6 +6481,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
static int virtnet_alloc_queues(struct virtnet_info *vi)
{
int i;
+ int weight = napi_weight > 0 ? napi_weight : NAPI_POLL_WEIGHT;
if (vi->has_cvq) {
vi->ctrl = kzalloc_obj(*vi->ctrl);
@@ -6500,10 +6501,10 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
vi->rq[i].pages = NULL;
netif_napi_add_config(vi->dev, &vi->rq[i].napi, virtnet_poll,
i);
- vi->rq[i].napi.weight = napi_weight;
+ vi->rq[i].napi.weight = weight;
netif_napi_add_tx_weight(vi->dev, &vi->sq[i].napi,
virtnet_poll_tx,
- napi_tx ? napi_weight : 0);
+ napi_tx ? weight : 0);
sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Thomas Gleixner @ 2026-06-05 19:51 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <aiMPxl5vkvJDldi9@google.com>
On Fri, Jun 05 2026 at 11:04, Sean Christopherson wrote:
> On Fri, Jun 05, 2026, Thomas Gleixner wrote:
>> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
>> > Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
>> > frequency.
>>
>> That's misleading because fixed frequency means that the frequency does
>> not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
>> X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
>> from CPUID/MSRs.
>
> Sorry, "if the TSC runs at a known, fixed frequency" would be a better way to
> phrase this.
>
>> > In practice, this is likely one big nop, as re-calibration is
>> > used only for SMP=n kernels, and only for hardware that is 20+ years old,
>> > i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
>>
>> recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
>> frequency drivers, which means that's absolutely not interesting and
>> neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
>> set on those systems.
>
> It _shouldn't_ be set on those systems, but in the world of virtualization it's
> not completely impossible.
>
>> IOW, this patch is pointless voodoo ware.
>
> Would y'all be opposed to adding a WARN? I don't actually care about P4 or K7
> CPUs, but without any reference to X86_FEATURE_TSC_KNOWN_FREQ in
> recalibrate_cpu_khz(), the code _looks_ wrong, and so is very confusing for
> readers that don't already know that in practice, it's limited to ancient CPUs.
>
> In other words, the point is to document expectations and mutual exclusion, not
> to "fix" anything.
Fair enough.
So yes, having a check there for actually X86_FEATURE_CONSTANT_TSC
(X86_FEATURE_CONSTANT_TSC is not interesting) and emitting a warning and
returning early is the right thing to do there.
But we also should have a check in the TSC init code somewhere which
validates that X86_FEATURE_CONSTANT_TSC is set when
X86_FEATURE_TSC_KNOWN_FREQ is set. X86_FEATURE_TSC_KNOWN_FREQ is useless
w/o X86_FEATURE_CONSTANT_TSC.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: Michael S. Tsirkin @ 2026-06-06 0:11 UTC (permalink / raw)
To: Si-Wei Liu
Cc: Eugenio Perez Martin, yangjiale, Jason Wang, Xuan Zhuo,
virtualization, linux-kernel, Andrew.Boyer
In-Reply-To: <5e1a10bd-2783-42ba-b443-853f12159756@oracle.com>
On Fri, Jun 05, 2026 at 11:50:36AM -0700, Si-Wei Liu wrote:
>
>
> On 6/5/2026 10:43 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 05, 2026 at 09:03:36AM -0700, Si-Wei Liu wrote:
> > >
> > > On 6/1/2026 11:04 PM, Eugenio Perez Martin wrote:
> > > > On Tue, Jun 2, 2026 at 6:34 AM yangjiale <yangjiale133@163.com> wrote:
> > > > > When a descriptor list spans across cache lines,
> > > > > updating the flag first can lead to a scenario where the device side
> > > > > perceives the flag as valid, yet the corresponding address and length
> > > > > fields remain unupdated—resulting in invalid values.
> > > > > Therefore, the flag field must be updated last.
> > > > >
> > > > > Signed-off-by: yangjiale <yangjiale133@163.com>
> > > > > ---
> > > > > drivers/virtio/virtio_ring.c | 8 ++++----
> > > > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index fbca7ce1c6bf..036b4f90d30f 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> > > > > &addr, &len, premapped, attr))
> > > > > goto unmap_release;
> > > > >
> > > > > + desc[i].addr = cpu_to_le64(addr);
> > > > > + desc[i].len = cpu_to_le32(len);
> > > > > + desc[i].id = cpu_to_le16(id);
> > > > > +
> > > > > flags = cpu_to_le16(vq->packed.avail_used_flags |
> > > > > (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
> > > > > (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> > > > > @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> > > > > else
> > > > > desc[i].flags = flags;
> > > > >
> > > > > - desc[i].addr = cpu_to_le64(addr);
> > > > > - desc[i].len = cpu_to_le32(len);
> > > > > - desc[i].id = cpu_to_le16(id);
> > > > > -
> > > > > if (unlikely(vq->use_map_api)) {
> > > > > vq->packed.desc_extra[curr].addr = premapped ?
> > > > > DMA_MAPPING_ERROR : addr;
> > > > These flags are updated before the flags of the head descriptor at the
> > > > end of the function, at "vq->packed.vring.desc[head].flags =
> > > > head_flags", so the device should not see these. Because of that, the
> > > > relative order between the rest of the fields of the same descriptor
> > > > or other descriptors' fields, except for the head descriptor's flags,
> > > > should not matter. There is a write memory barrier just before
> > > > updating the head's flags.
> > > The above analysis is absolutely correct. Though one hardware vendor told me
> > > that this driver implementation kinda stops them from reading ahead of
> > > descriptors already posted beyond the available index., ending up with
> > > suboptimal performance that is hard to make up by other means. Would it be a
> > > bad idea to go with this change and add write barrier in a gentle way for a
> > > small flit in the batch, e.g. commit to memory after every cache line size
> > > worth of descriptors are posted? Would the memory barrier have negative
> > > performance overhead to other backend implementation variants than real
> > > hardware PCI device?
> > >
> > > -Siwei
> > this would need a new feature bit, won't it?
> Probably. This is to capture the device's expectation and behavior right?
> the driver change itself is not spec violating...
yes, device can't rely on this without a feature bit.
> >
> > > > Also, I don't get why the cache line matters here. Can you expand? Am
> > > > I missing something?
> > me too.
> >
> Just to avoid extra delay due to excessive coherency messages and frequent
> cache thrashing, device read over pci bus contends with host write/update on
> the descriptors in a same cache line..
>
> -Siwei
this should be infrequent, the whole idea is that there's parallelism:
device reads descriptors from X while host writes other ones to Y.
btw i can't say whether it's ok for device to just issue 2 reads,
or does it have to receive read result and only then issue the second
read.
--
MST
^ permalink raw reply
* Re: [PATCH] vsock/virtio: fix memory leak in virtio_transport_recv_listen
From: kernel test robot @ 2026-06-06 7:19 UTC (permalink / raw)
To: Divya Mankani, kuba, pabeni
Cc: oe-kbuild-all, horms, virtualization, kvm, netdev, linux-kernel,
syzbot+1b2c9c4a0f8708082678, Divya Mankani
In-Reply-To: <20260605191922.12720-1-divyakm@unc.edu>
Hi Divya,
kernel test robot noticed the following build warnings:
[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on linus/master v6.16-rc1 next-20260605]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Divya-Mankani/vsock-virtio-fix-memory-leak-in-virtio_transport_recv_listen/20260606-032557
base: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
patch link: https://lore.kernel.org/r/20260605191922.12720-1-divyakm%40unc.edu
patch subject: [PATCH] vsock/virtio: fix memory leak in virtio_transport_recv_listen
config: s390-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260606/202606060907.y54Wngh6-lkp@intel.com/config)
compiler: s390x-linux-gnu-gcc (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260606/202606060907.y54Wngh6-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606060907.y54Wngh6-lkp@intel.com/
All warnings (new ones prefixed by >>):
net/vmw_vsock/virtio_transport_common.c: In function 'virtio_transport_recv_listen':
>> net/vmw_vsock/virtio_transport_common.c:1538:13: warning: unused variable 'ret' [-Wunused-variable]
1538 | int ret;
| ^~~
>> net/vmw_vsock/virtio_transport_common.c:1535:28: warning: unused variable 'vsk' [-Wunused-variable]
1535 | struct vsock_sock *vsk = vsock_sk(sk);
| ^~~
vim +/ret +1538 net/vmw_vsock/virtio_transport_common.c
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 1528
06a8fc78367d07 Asias He 2016-07-28 1529 /* Handle server socket */
06a8fc78367d07 Asias He 2016-07-28 1530 static int
71dc9ec9ac7d3e Bobby Eshleman 2023-01-13 1531 virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 1532 struct virtio_transport *t)
06a8fc78367d07 Asias He 2016-07-28 1533 {
71dc9ec9ac7d3e Bobby Eshleman 2023-01-13 1534 struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
06a8fc78367d07 Asias He 2016-07-28 @1535 struct vsock_sock *vsk = vsock_sk(sk);
06a8fc78367d07 Asias He 2016-07-28 1536 struct vsock_sock *vchild;
06a8fc78367d07 Asias He 2016-07-28 1537 struct sock *child;
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 @1538 int ret;
06a8fc78367d07 Asias He 2016-07-28 1539
71dc9ec9ac7d3e Bobby Eshleman 2023-01-13 1540 if (le16_to_cpu(hdr->op) != VIRTIO_VSOCK_OP_REQUEST) {
a69686327e4291 Bobby Eshleman 2026-01-21 1541 virtio_transport_reset_no_sock(t, skb, sock_net(sk));
06a8fc78367d07 Asias He 2016-07-28 1542 return -EINVAL;
06a8fc78367d07 Asias He 2016-07-28 1543 }
06a8fc78367d07 Asias He 2016-07-28 1544
06a8fc78367d07 Asias He 2016-07-28 1545 if (sk_acceptq_is_full(sk)) {
a69686327e4291 Bobby Eshleman 2026-01-21 1546 virtio_transport_reset_no_sock(t, skb, sock_net(sk));
06a8fc78367d07 Asias He 2016-07-28 1547 return -ENOMEM;
06a8fc78367d07 Asias He 2016-07-28 1548 }
06a8fc78367d07 Asias He 2016-07-28 1549
d7b0ff5a866724 Michal Luczaj 2024-11-07 1550 /* __vsock_release() might have already flushed accept_queue.
d7b0ff5a866724 Michal Luczaj 2024-11-07 1551 * Subsequent enqueues would lead to a memory leak.
d7b0ff5a866724 Michal Luczaj 2024-11-07 1552 */
d7b0ff5a866724 Michal Luczaj 2024-11-07 1553 if (sk->sk_shutdown == SHUTDOWN_MASK) {
a69686327e4291 Bobby Eshleman 2026-01-21 1554 virtio_transport_reset_no_sock(t, skb, sock_net(sk));
d7b0ff5a866724 Michal Luczaj 2024-11-07 1555 return -ESHUTDOWN;
d7b0ff5a866724 Michal Luczaj 2024-11-07 1556 }
d7b0ff5a866724 Michal Luczaj 2024-11-07 1557
b9ca2f5ff7784d Stefano Garzarella 2019-11-14 1558 child = vsock_create_connected(sk);
06a8fc78367d07 Asias He 2016-07-28 1559 if (!child) {
a69686327e4291 Bobby Eshleman 2026-01-21 1560 virtio_transport_reset_no_sock(t, skb, sock_net(sk));
06a8fc78367d07 Asias He 2016-07-28 1561 return -ENOMEM;
06a8fc78367d07 Asias He 2016-07-28 1562 }
06a8fc78367d07 Asias He 2016-07-28 1563
06a8fc78367d07 Asias He 2016-07-28 1564 lock_sock_nested(child, SINGLE_DEPTH_NESTING);
06a8fc78367d07 Asias He 2016-07-28 1565
3b4477d2dcf270 Stefan Hajnoczi 2017-10-05 1566 child->sk_state = TCP_ESTABLISHED;
06a8fc78367d07 Asias He 2016-07-28 1567
06a8fc78367d07 Asias He 2016-07-28 1568 vchild = vsock_sk(child);
71dc9ec9ac7d3e Bobby Eshleman 2023-01-13 1569 vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
71dc9ec9ac7d3e Bobby Eshleman 2023-01-13 1570 le32_to_cpu(hdr->dst_port));
71dc9ec9ac7d3e Bobby Eshleman 2023-01-13 1571 vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
71dc9ec9ac7d3e Bobby Eshleman 2023-01-13 1572 le32_to_cpu(hdr->src_port));
06a8fc78367d07 Asias He 2016-07-28 1573
b09fac3ae8b655 Divya Mankani 2026-06-06 1574 /* Lock the parent listener socket to synchronize with a potential
b09fac3ae8b655 Divya Mankani 2026-06-06 1575 * simultaneous shutdown thread running __vsock_release().
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 1576 */
b09fac3ae8b655 Divya Mankani 2026-06-06 1577 lock_sock(sk);
b09fac3ae8b655 Divya Mankani 2026-06-06 1578
b09fac3ae8b655 Divya Mankani 2026-06-06 1579 /* Check if the listener socket was shut down while we were
b09fac3ae8b655 Divya Mankani 2026-06-06 1580 * creating and configuring the child socket.
b09fac3ae8b655 Divya Mankani 2026-06-06 1581 */
b09fac3ae8b655 Divya Mankani 2026-06-06 1582 if (sk->sk_shutdown == SHUTDOWN_MASK) {
b09fac3ae8b655 Divya Mankani 2026-06-06 1583 release_sock(sk);
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 1584 release_sock(child);
a69686327e4291 Bobby Eshleman 2026-01-21 1585 virtio_transport_reset_no_sock(t, skb, sock_net(sk));
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 1586 sock_put(child);
b09fac3ae8b655 Divya Mankani 2026-06-06 1587 return -ESHUTDOWN;
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 1588 }
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 1589
52bcb57a4e8a08 Dudu Lu 2026-04-13 1590 sk_acceptq_added(sk);
71dc9ec9ac7d3e Bobby Eshleman 2023-01-13 1591 if (virtio_transport_space_update(child, skb))
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 1592 child->sk_write_space(child);
c0cfa2d8a788fc Stefano Garzarella 2019-11-14 1593
06a8fc78367d07 Asias He 2016-07-28 1594 vsock_insert_connected(vchild);
06a8fc78367d07 Asias He 2016-07-28 1595 vsock_enqueue_accept(sk, child);
71dc9ec9ac7d3e Bobby Eshleman 2023-01-13 1596 virtio_transport_send_response(vchild, skb);
06a8fc78367d07 Asias He 2016-07-28 1597
b09fac3ae8b655 Divya Mankani 2026-06-06 1598 /* Safely release both locked objects */
b09fac3ae8b655 Divya Mankani 2026-06-06 1599 release_sock(sk);
06a8fc78367d07 Asias He 2016-07-28 1600 release_sock(child);
06a8fc78367d07 Asias He 2016-07-28 1601
06a8fc78367d07 Asias He 2016-07-28 1602 sk->sk_data_ready(sk);
06a8fc78367d07 Asias He 2016-07-28 1603 return 0;
06a8fc78367d07 Asias He 2016-07-28 1604 }
06a8fc78367d07 Asias He 2016-07-28 1605
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH] virtio_net: normalize non-positive napi_weight values
From: Michael S. Tsirkin @ 2026-06-06 8:40 UTC (permalink / raw)
To: Samuel Moelius
Cc: Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
open list:VIRTIO NET DRIVER, open list:VIRTIO NET DRIVER,
open list
In-Reply-To: <20260605194045.2625610-1-sam.moelius@trailofbits.com>
On Fri, Jun 05, 2026 at 07:40:44PM +0000, Samuel Moelius wrote:
> The virtio_net.napi_weight module parameter is signed and is copied
> directly into the RX NAPI weight after netif_napi_add_config().
>
> A value of -1 lets virtnet_poll() run with a negative budget. The RX
> loop processes no packets, the unsigned received-vs-budget comparison
> completes NAPI, and the NAPI core reports:
>
> NAPI poll function virtnet_poll returned 0, exceeding its budget of -1
>
> The device then repeatedly drops receive progress under stock QEMU
> virtio-net traffic.
>
> Normalize non-positive values to the default NAPI weight before
> assigning the RX and TX NAPI weights. TX NAPI can still be disabled
> through the separate napi_tx parameter, which intentionally sets the TX
> weight to zero.
>
> Assisted-by: Codex:gpt-5.5-cyber-preview
> Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
Why should we bother working around admin errors in the kernel?
Leave it for the userspace tooling which can at least provide
useful diagnostics.
> ---
> drivers/net/virtio_net.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..667026607cfa 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -6481,6 +6481,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
> static int virtnet_alloc_queues(struct virtnet_info *vi)
> {
> int i;
> + int weight = napi_weight > 0 ? napi_weight : NAPI_POLL_WEIGHT;
>
> if (vi->has_cvq) {
> vi->ctrl = kzalloc_obj(*vi->ctrl);
> @@ -6500,10 +6501,10 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
> vi->rq[i].pages = NULL;
> netif_napi_add_config(vi->dev, &vi->rq[i].napi, virtnet_poll,
> i);
> - vi->rq[i].napi.weight = napi_weight;
> + vi->rq[i].napi.weight = weight;
> netif_napi_add_tx_weight(vi->dev, &vi->sq[i].napi,
> virtnet_poll_tx,
> - napi_tx ? napi_weight : 0);
> + napi_tx ? weight : 0);
>
> sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
> ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len);
> --
> 2.43.0
^ permalink raw reply
* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Thomas Gleixner @ 2026-06-06 10:34 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-11-seanjc@google.com>
On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> Now that all paravirt code that explicitly specifies the TSC frequency
> also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
>
> Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> by the user. Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> line parameter"), one of the goals of the param is to allow the refined
> calibration work "to do meaningful error checking".
>
> Note, preferring the user-provided TSC frequency over the frequency from
> the hypervisor or trusted firmware, while simultaneously not treating the
> user-provided frequency as gospel, is obviously incongruous. Sweep the
> problem under the rug for now to avoid opening a big can of worms that
> likely doesn't have a great answer.
There is a good answer I think.
early_tsc_khz exists to cater for the overclocking crowd. On their
modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
matching reality anymore. So they work around that by supplying a close
enough tsc_early_khz and then they let the refined calibration work
figure it out.
Arguably that's only relevant for bare metal systems and what's worse is
that in virtual environments the refined calibration work can fail,
which renders the TSC unstable.
So I'd rather say we change this logic to:
if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
tsc_khz = x86_init.....();
force(X86_FEATURE_TSC_KNOWN_FREQ);
} else if (tsc_khz_early) {
....
} else {
...
}
Along with:
if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
if (tsc_khz_early)
pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
or something daft like that.
The kernel has for various reasons always tried to cater for the needs
of users who are plagued by bonkers firmware, but we have to stop to
prioritize or treating equal ancient and modded out of spec hardware.
TBH, I consider that whole KVM clock nonsense to fall into the modded
out of spec hardware realm. Do a reality check:
How many production systems are out there still which run VMs on CPUs
with a broken TSC and the lack of VM TSC scaling?
I'm not saying that we should not support the few remaining systems
anymore, but our tendency to pretend that we can keep all of this
nonsense working and at the same time making progress is just a fallacy.
I rather want to have a more fine grained differentiation and
prioritization of:
1) The actual real world relevant use cases which run on contemporary
hardware.
2) Still relevant use cases on slightly older hardware with less
capabilities
3) Broken firmware
4) Modded out of spec nonsense
5) Support for ancient museums pieces
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: David Woodhouse @ 2026-06-06 10:52 UTC (permalink / raw)
To: Thomas Gleixner, Sean Christopherson, Paolo Bonzini, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
Michael Kelley
In-Reply-To: <877boc554l.ffs@fw13>
[-- Attachment #1: Type: text/plain, Size: 3487 bytes --]
On Sat, 2026-06-06 at 12:34 +0200, Thomas Gleixner wrote:
> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
>
> > Now that all paravirt code that explicitly specifies the TSC frequency
> > also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> > and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
> >
> > Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> > by the user. Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> > line parameter"), one of the goals of the param is to allow the refined
> > calibration work "to do meaningful error checking".
> >
> > Note, preferring the user-provided TSC frequency over the frequency from
> > the hypervisor or trusted firmware, while simultaneously not treating the
> > user-provided frequency as gospel, is obviously incongruous. Sweep the
> > problem under the rug for now to avoid opening a big can of worms that
> > likely doesn't have a great answer.
>
> There is a good answer I think.
>
> early_tsc_khz exists to cater for the overclocking crowd. On their
> modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
> matching reality anymore. So they work around that by supplying a close
> enough tsc_early_khz and then they let the refined calibration work
> figure it out.
>
> Arguably that's only relevant for bare metal systems and what's worse is
> that in virtual environments the refined calibration work can fail,
> which renders the TSC unstable.
>
> So I'd rather say we change this logic to:
>
> if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> tsc_khz = x86_init.....();
> force(X86_FEATURE_TSC_KNOWN_FREQ);
> } else if (tsc_khz_early) {
> ....
> } else {
> ...
> }
>
> Along with:
>
> if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> if (tsc_khz_early)
> pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
>
> or something daft like that.
>
> The kernel has for various reasons always tried to cater for the needs
> of users who are plagued by bonkers firmware, but we have to stop to
> prioritize or treating equal ancient and modded out of spec hardware.
>
> TBH, I consider that whole KVM clock nonsense to fall into the modded
> out of spec hardware realm. Do a reality check:
>
> How many production systems are out there still which run VMs on CPUs
> with a broken TSC and the lack of VM TSC scaling?
>
> I'm not saying that we should not support the few remaining systems
> anymore, but our tendency to pretend that we can keep all of this
> nonsense working and at the same time making progress is just a fallacy.
I don't know that we can take the KVM (and Xen) clock away from guests,
but all of the *horrid* part about it is the way it attempts to cope
with the possibility that the *host* timekeeping might flip away from
TSC-based mode at any point in time. By the end of my outstanding
cleanup series, that is the *only* thing the gtod_notifier remains for.
If we can trust the hardware *and* the host kernel, then KVM could
theoretically hardwire the kvmclock into 'master clock mode' where it
basically just advertises the TSC→kvmclock relationship *once* to all
CPUs and it never changes.
All the nonsense about updating it every time we enter a CPU could just
go away completely.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* [syzbot ci] Re: vsock/virtio: fix memory leak in virtio_transport_recv_listen
From: syzbot ci @ 2026-06-06 14:24 UTC (permalink / raw)
To: divyakm, divyakm, horms, kuba, kvm, linux-kernel, netdev, pabeni,
syzbot, virtualization
Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260605191922.12720-1-divyakm@unc.edu>
syzbot ci has tested the following series
[v1] vsock/virtio: fix memory leak in virtio_transport_recv_listen
https://lore.kernel.org/all/20260605191922.12720-1-divyakm@unc.edu
* [PATCH] vsock/virtio: fix memory leak in virtio_transport_recv_listen
and found the following issue:
possible deadlock in virtio_transport_recv_listen
Full report is available here:
https://ci.syzbot.org/series/76f40e62-5a21-46d4-a636-10f0ec9c5040
***
possible deadlock in virtio_transport_recv_listen
tree: net-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base: bfa3d89cc15c09f7d1581c834a5ed725189ec19f
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/5a94a424-94b1-4b1e-8b06-74d59933368f/config
syz repro: https://ci.syzbot.org/findings/a8224003-11b3-4232-a635-6755ad892149/syz_repro
============================================
WARNING: possible recursive locking detected
syzkaller #0 Not tainted
--------------------------------------------
kworker/0:3/5740 is trying to acquire lock:
ffff88810efb62a0 (sk_lock-AF_VSOCK){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1713 [inline]
ffff88810efb62a0 (sk_lock-AF_VSOCK){+.+.}-{0:0}, at: virtio_transport_recv_listen+0x964/0x2120 net/vmw_vsock/virtio_transport_common.c:1577
but task is already holding lock:
ffff88810efb62a0 (sk_lock-AF_VSOCK){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1713 [inline]
ffff88810efb62a0 (sk_lock-AF_VSOCK){+.+.}-{0:0}, at: virtio_transport_recv_pkt+0xa26/0x2800 net/vmw_vsock/virtio_transport_common.c:1668
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(sk_lock-AF_VSOCK);
lock(sk_lock-AF_VSOCK);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by kworker/0:3/5740:
#0: ffff8881150d3540 ((wq_completion)vsock-loopback){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3289 [inline]
#0: ffff8881150d3540 ((wq_completion)vsock-loopback){+.+.}-{0:0}, at: process_scheduled_works+0xa35/0x1860 kernel/workqueue.c:3397
#1: ffffc90003fafc40 ((work_completion)(&vsock->pkt_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3290 [inline]
#1: ffffc90003fafc40 ((work_completion)(&vsock->pkt_work)){+.+.}-{0:0}, at: process_scheduled_works+0xa70/0x1860 kernel/workqueue.c:3397
#2: ffff88810efb62a0 (sk_lock-AF_VSOCK){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1713 [inline]
#2: ffff88810efb62a0 (sk_lock-AF_VSOCK){+.+.}-{0:0}, at: virtio_transport_recv_pkt+0xa26/0x2800 net/vmw_vsock/virtio_transport_common.c:1668
#3: ffff88810efb5120 (sk_lock-AF_VSOCK/1){+.+.}-{0:0}, at: virtio_transport_recv_listen+0x83f/0x2120 net/vmw_vsock/virtio_transport_common.c:1564
stack backtrace:
CPU: 0 UID: 0 PID: 5740 Comm: kworker/0:3 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: vsock-loopback vsock_loopback_work
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_deadlock_bug+0x279/0x290 kernel/locking/lockdep.c:3041
check_deadlock kernel/locking/lockdep.c:3093 [inline]
validate_chain kernel/locking/lockdep.c:3895 [inline]
__lock_acquire+0x253f/0x2cf0 kernel/locking/lockdep.c:5237
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5868
lock_sock_nested+0x41/0x100 net/core/sock.c:3790
lock_sock include/net/sock.h:1713 [inline]
virtio_transport_recv_listen+0x964/0x2120 net/vmw_vsock/virtio_transport_common.c:1577
virtio_transport_recv_pkt+0x1570/0x2800 net/vmw_vsock/virtio_transport_common.c:1692
vsock_loopback_work+0x32c/0x3f0 net/vmw_vsock/vsock_loopback.c:142
process_one_work kernel/workqueue.c:3314 [inline]
process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
kthread+0x389/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).
The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.
^ permalink raw reply
* [PATCH] virtio-blk: clamp zone report to the report buffer capacity
From: Michael Bommarito @ 2026-06-06 17:04 UTC (permalink / raw)
To: Michael S . Tsirkin, Jason Wang, Stefan Hajnoczi, Jens Axboe
Cc: Xuan Zhuo, virtualization, linux-block, linux-kernel
virtblk_report_zones() trusts the device-reported number of zones when
walking the report buffer:
nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
nr_zones);
...
for (i = 0; i < nz && zone_idx < nr_zones; i++) {
ret = virtblk_parse_zone(vblk, &report->zones[i], ...);
The buffer is allocated by virtblk_alloc_report_buffer(), whose size is
capped by the queue's max hardware sectors and max segments and can
therefore hold fewer descriptors than nr_zones. nz is bounded only by
the device-supplied report->nr_zones and the requested nr_zones, never
by the buffer's descriptor capacity. At probe time the request count is
unbounded (blk_revalidate_disk_zones() calls report_zones() with
nr_zones == UINT_MAX), so the device-supplied report->nr_zones is the
sole gate: a device that reports more zones than fit in the buffer
drives the loop to read report->zones[i] past the end of the allocation.
A malicious or buggy virtio-blk device that reports an inflated nr_zones
triggers this during zone revalidation at probe. KASAN reports a
vmalloc-out-of-bounds read in virtblk_report_zones() against the report
buffer allocated a few lines earlier.
Clamp nz to the number of descriptors that actually fit in the report
buffer.
Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices")
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
drivers/block/virtio_blk.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b1c9a27fe00f3..d50aaf956d558 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -689,6 +689,14 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
nr_zones);
+ /*
+ * The device-reported nr_zones is untrusted; clamp it to the
+ * number of descriptors that actually fit in the report buffer
+ * so a malicious or buggy device cannot drive the parse loop
+ * past the allocation.
+ */
+ nz = min_t(u64, nz,
+ (buflen - sizeof(*report)) / sizeof(report->zones[0]));
if (!nz)
break;
base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
--
2.53.0
^ permalink raw reply related
* Re: [PATCH] virtio-blk: clamp zone report to the report buffer capacity
From: Michael S. Tsirkin @ 2026-06-07 2:23 UTC (permalink / raw)
To: Michael Bommarito
Cc: Jason Wang, Stefan Hajnoczi, Jens Axboe, Xuan Zhuo,
virtualization, linux-block, linux-kernel
In-Reply-To: <20260606170415.1523660-1-michael.bommarito@gmail.com>
On Sat, Jun 06, 2026 at 01:04:15PM -0400, Michael Bommarito wrote:
> virtblk_report_zones() trusts the device-reported number of zones when
> walking the report buffer:
>
> nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
> nr_zones);
> ...
> for (i = 0; i < nz && zone_idx < nr_zones; i++) {
> ret = virtblk_parse_zone(vblk, &report->zones[i], ...);
>
> The buffer is allocated by virtblk_alloc_report_buffer(), whose size is
> capped by the queue's max hardware sectors and max segments and can
> therefore hold fewer descriptors than nr_zones. nz is bounded only by
> the device-supplied report->nr_zones and the requested nr_zones, never
> by the buffer's descriptor capacity. At probe time the request count is
> unbounded (blk_revalidate_disk_zones() calls report_zones() with
> nr_zones == UINT_MAX), so the device-supplied report->nr_zones is the
> sole gate: a device that reports more zones than fit in the buffer
> drives the loop to read report->zones[i] past the end of the allocation.
>
> A malicious or buggy virtio-blk device that reports an inflated nr_zones
> triggers this during zone revalidation at probe. KASAN reports a
> vmalloc-out-of-bounds read in virtblk_report_zones() against the report
> buffer allocated a few lines earlier.
>
> Clamp nz to the number of descriptors that actually fit in the report
> buffer.
>
> Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices")
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
> drivers/block/virtio_blk.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index b1c9a27fe00f3..d50aaf956d558 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -689,6 +689,14 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
>
> nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
I think nr_zones should have been le64, bot virtio64.
> nr_zones);
> + /*
> + * The device-reported nr_zones is untrusted;
this part depends on the config. just drop it.
> clamp it to the
> + * number of descriptors that actually fit in the report buffer
> + * so a malicious or buggy device cannot drive the parse loop
> + * past the allocation.
> + */
> + nz = min_t(u64, nz,
> + (buflen - sizeof(*report)) / sizeof(report->zones[0]));
> if (!nz)
> break;
>
>
> base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
> --
> 2.53.0
^ permalink raw reply
* Re: [PATCH v1] drm/virtio: Fix driver removal with disabled KMS
From: Ryosuke Yasuoka @ 2026-06-07 4:31 UTC (permalink / raw)
To: Dmitry Osipenko, David Airlie, Gerd Hoffmann, Gurchetan Singh,
Chia-I Wu
Cc: dri-devel, virtualization, linux-kernel
In-Reply-To: <20260604122743.13383-1-dmitry.osipenko@collabora.com>
Hi Dmitry
On 04/06/2026 15:27, Dmitry Osipenko wrote:
> DRM atomic and modesetting aren't initialized if virtio-gpu driver built
> with disabled KMS, leading to access of uninitialized data on driver
> removal/unbinding and crashing kernel. Fix it by skipping shutting down
> atomic core with unavailable KMS.
>
> Fixes: 72122c69d717 ("drm/virtio: Add option to disable KMS support")
> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> ---
> drivers/gpu/drm/virtio/virtgpu_drv.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
> index f0fb784c0f6f..2aaa7cb08085 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_drv.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
> @@ -138,7 +138,10 @@ static void virtio_gpu_remove(struct virtio_device *vdev)
>
> virtio_gpu_release_vqs(dev);
> drm_dev_unplug(dev);
> - drm_atomic_helper_shutdown(dev);
> +
> + if (drm_core_check_feature(dev, DRIVER_ATOMIC))
> + drm_atomic_helper_shutdown(dev);
> +
> virtio_gpu_deinit(dev);
> drm_dev_put(dev);
> }
The patch looks good to me at a glance. I haven't done a full, deep code
review yet, but I've tested it on my lab and everything works as
expected.
Tested-by: Ryosuke Yasuoka <ryasuoka@redhat.com>
test info
base-commit: ce73a5db44e3d5f9c0c061f0868ae209b59605f1 (drm-misc)
Removing the virtio-gpu device by the following command, and the following
panic message can be observed.
# echo 1 > /sys/devices/pci0000\:00/0000\:00\:01.0/remove
[ 330.048794] ------------[ cut here ]------------
[ 330.050023] WARNING: drivers/gpu/drm/drm_modeset_lock.c:319 at drm_modeset_lock+0x118/0x120, CPU#5: bash/22216
[ 330.052581] Modules linked in: rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables sunrpc qrtr vfat snd_hda_codec_generic fat intel_rapl_msr intel_rapl_common snd_hda_intel snd_intel_dspcfg snd_hda_codec kvm_amd iTCO_wdt intel_pmc_bxt snd_hwdep kvm snd_hda_core snd_pcm i2c_i801 irqbypass i2c_smbus snd_timer snd lpc_ich soundcore virtio_balloon virtio_net net_failover failover joydev dm_multipath loop nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci zram lz4hc_compress vsock lz4_compress virtio_scsi virtio_gpu virtio_dma_buf serio_raw scsi_dh_rdac scsi_dh_emc scsi_dh_alua i2c_dev qemu_fw_cfg virtiofs fuse
[ 330.067477] CPU: 5 UID: 0 PID: 22216 Comm: bash Not tainted 7.1.0-rc5 #9 PREEMPT(lazy)
[ 330.069058] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20251119-3.fc43 11/19/2025
[ 330.070880] RIP: 0010:drm_modeset_lock+0x118/0x120
[ 330.071858] Code: e8 ad e3 48 ff 85 c0 75 c7 b8 f0 ff ff ff eb c2 0f 0b e9 22 ff ff ff e8 06 82 5e 00 4c 8b 04 24 48 8b 4c 24 08 e9 40 ff ff ff <0f> 0b e9 4e ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 330.075466] RSP: 0018:ffffd2f38ea6baf8 EFLAGS: 00010286
[ 330.076522] RAX: 0000000000000000 RBX: ffffd2f38ea6bb78 RCX: ffffd2f38ea6bb78
[ 330.077939] RDX: ffff8ea806fd8000 RSI: ffffd2f38ea6bb78 RDI: ffff8ea8016ce190
[ 330.079362] RBP: ffffd2f38ea6bb78 R08: ffff8ea8016ce170 R09: 000000000fe6d5f1
[ 330.080785] R10: fffff93844235700 R11: ffff8ea80004f800 R12: 0000000000000000
[ 330.082410] R13: ffff8ea8016ce000 R14: 0000000000000090 R15: ffff8ea8017180d0
[ 330.083787] FS: 00007f202225f780(0000) GS:ffff8ea9ecea0000(0000) knlGS:0000000000000000
[ 330.085395] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 330.086553] CR2: 00005567379b41a0 CR3: 00000001a100e000 CR4: 0000000000750ef0
[ 330.087980] PKRU: 55555554
[ 330.088556] Call Trace:
[ 330.089041] <TASK>
[ 330.089480] ? rcutree_enqueue.isra.0+0x1e/0xe0
[ 330.090365] drm_modeset_lock_all_ctx+0x29/0x3f0
[ 330.091274] ? mnt_get_count+0x4d/0xa0
[ 330.091988] ? __destroy_inode+0x8a/0x180
[ 330.092799] drm_atomic_helper_shutdown+0x7b/0x120
[ 330.093779] virtio_gpu_remove+0x57/0x70 [virtio_gpu]
[ 330.094784] virtio_dev_remove+0x3f/0x90
[ 330.095595] device_release_driver_internal+0x19e/0x200
[ 330.096626] bus_remove_device+0xe4/0x1c0
[ 330.097425] ? srso_alias_return_thunk+0x5/0xfbef5
[ 330.098390] ? device_remove_attrs+0xa1/0x100
[ 330.099224] device_del+0x160/0x3d0
[ 330.099937] ? srso_alias_return_thunk+0x5/0xfbef5
[ 330.100881] ? pci_bus_read_config_dword+0x4c/0x80
[ 330.101842] device_unregister+0x17/0x70
[ 330.102629] unregister_virtio_device+0x15/0x30
[ 330.103555] virtio_pci_remove+0x3f/0x80
[ 330.104340] pci_device_remove+0x4a/0xc0
[ 330.105094] device_release_driver_internal+0x19e/0x200
[ 330.106142] pci_stop_bus_device+0x63/0x80
[ 330.106995] pci_stop_and_remove_bus_device_locked+0x1a/0x30
[ 330.108113] remove_store+0x83/0xa0
[ 330.108830] kernfs_fop_write_iter+0x147/0x200
[ 330.109773] vfs_write+0x25d/0x480
[ 330.110483] ksys_write+0x73/0xf0
[ 330.111140] do_syscall_64+0xe2/0x560
[ 330.112093] ? srso_alias_return_thunk+0x5/0xfbef5
[ 330.113066] ? irqentry_exit+0x40/0x6c0
[ 330.113858] ? srso_alias_return_thunk+0x5/0xfbef5
[ 330.114800] ? do_syscall_64+0x99/0x560
[ 330.115551] ? exc_page_fault+0x82/0x1d0
[ 330.116318] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 330.117286] RIP: 0033:0x7f20222d0bbe
[ 330.117969] Code: 4d 89 d8 e8 34 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
[ 330.121352] RSP: 002b:00007ffdb43331f0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 330.122797] RAX: ffffffffffffffda RBX: 00007f202244c5c0 RCX: 00007f20222d0bbe
[ 330.124122] RDX: 0000000000000002 RSI: 00005567379b41a0 RDI: 0000000000000001
[ 330.125459] RBP: 00007ffdb4333200 R08: 0000000000000000 R09: 0000000000000000
[ 330.126839] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[ 330.128205] R13: 0000000000000002 R14: 00005567379b41a0 R15: 00005567379c3370
[ 330.129605] </TASK>
[ 330.130073] ---[ end trace 0000000000000000 ]---
[ 330.130987] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 330.131963] #PF: supervisor write access in kernel mode
[ 330.131963] #PF: error_code(0x0002) - not-present page
[ 330.131963] PGD 0 P4D 0
[ 330.131963] Oops: Oops: 0002 [#1] SMP NOPTI
[ 330.131963] CPU: 5 UID: 0 PID: 22216 Comm: bash Tainted: G W 7.1.0-rc5 #9 PREEMPT(lazy)
[ 330.131963] Tainted: [W]=WARN
[ 330.131963] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20251119-3.fc43 11/19/2025
[ 330.131963] RIP: 0010:ww_mutex_lock+0x38/0x110
[ 330.131963] Code: 89 f4 55 53 48 89 fb 48 83 ec 20 65 48 8b 2d 57 93 21 02 48 89 6c 24 18 31 ed 2e 2e 2e 31 c0 65 48 8b 15 53 93 21 02 48 89 e8 <f0> 48 0f b1 13 75 53 4d 85 e4 74 2a 48 8d 6c 24 08 41 83 44 24 10
[ 330.131963] RSP: 0018:ffffd2f38ea6bae0 EFLAGS: 00010246
[ 330.131963] RAX: 0000000000000000 RBX: 0000000000000018 RCX: ffffd2f38ea6bb78
[ 330.131963] RDX: ffff8ea806fd8000 RSI: ffffd2f38ea6bb78 RDI: 0000000000000018
[ 330.131963] RBP: 0000000000000000 R08: ffff8ea8016ce170 R09: 000000000fe6d5f1
[ 330.131963] R10: fffff93844235700 R11: ffff8ea80004f800 R12: ffffd2f38ea6bb78
[ 330.131963] R13: ffff8ea8016ce000 R14: 0000000000000000 R15: 0000000000000018
[ 330.131963] FS: 00007f202225f780(0000) GS:ffff8ea9ecea0000(0000) knlGS:0000000000000000
[ 330.131963] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 330.131963] CR2: 0000000000000018 CR3: 00000001a100e000 CR4: 0000000000750ef0
[ 330.131963] PKRU: 55555554
[ 330.131963] Call Trace:
[ 330.131963] <TASK>
[ 330.131963] ? srso_alias_return_thunk+0x5/0xfbef5
[ 330.131963] ? drm_modeset_lock+0x92/0x120
[ 330.131963] drm_modeset_lock_all_ctx+0x8e/0x3f0
[ 330.131963] ? __destroy_inode+0x8a/0x180
[ 330.131963] drm_atomic_helper_shutdown+0x7b/0x120
[ 330.131963] virtio_gpu_remove+0x57/0x70 [virtio_gpu]
[ 330.131963] virtio_dev_remove+0x3f/0x90
[ 330.131963] device_release_driver_internal+0x19e/0x200
[ 330.131963] bus_remove_device+0xe4/0x1c0
[ 330.131963] ? srso_alias_return_thunk+0x5/0xfbef5
[ 330.131963] ? device_remove_attrs+0xa1/0x100
[ 330.131963] device_del+0x160/0x3d0
[ 330.131963] ? srso_alias_return_thunk+0x5/0xfbef5
[ 330.131963] ? pci_bus_read_config_dword+0x4c/0x80
[ 330.131963] device_unregister+0x17/0x70
[ 330.131963] unregister_virtio_device+0x15/0x30
[ 330.131963] virtio_pci_remove+0x3f/0x80
[ 330.131963] pci_device_remove+0x4a/0xc0
[ 330.131963] device_release_driver_internal+0x19e/0x200
[ 330.131963] pci_stop_bus_device+0x63/0x80
[ 330.131963] pci_stop_and_remove_bus_device_locked+0x1a/0x30
[ 330.131963] remove_store+0x83/0xa0
[ 330.131963] kernfs_fop_write_iter+0x147/0x200
[ 330.131963] vfs_write+0x25d/0x480
[ 330.131963] ksys_write+0x73/0xf0
[ 330.131963] do_syscall_64+0xe2/0x560
[ 330.131963] ? srso_alias_return_thunk+0x5/0xfbef5
[ 330.131963] ? irqentry_exit+0x40/0x6c0
[ 330.131963] ? srso_alias_return_thunk+0x5/0xfbef5
[ 330.131963] ? do_syscall_64+0x99/0x560
[ 330.131963] ? exc_page_fault+0x82/0x1d0
[ 330.131963] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 330.131963] RIP: 0033:0x7f20222d0bbe
[ 330.131963] Code: 4d 89 d8 e8 34 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
[ 330.131963] RSP: 002b:00007ffdb43331f0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 330.131963] RAX: ffffffffffffffda RBX: 00007f202244c5c0 RCX: 00007f20222d0bbe
[ 330.131963] RDX: 0000000000000002 RSI: 00005567379b41a0 RDI: 0000000000000001
[ 330.131963] RBP: 00007ffdb4333200 R08: 0000000000000000 R09: 0000000000000000
[ 330.131963] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[ 330.131963] R13: 0000000000000002 R14: 00005567379b41a0 R15: 00005567379c3370
[ 330.131963] </TASK>
[ 330.131963] Modules linked in: rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables sunrpc qrtr vfat snd_hda_codec_generic fat intel_rapl_msr intel_rapl_common snd_hda_intel snd_intel_dspcfg snd_hda_codec kvm_amd iTCO_wdt intel_pmc_bxt snd_hwdep kvm snd_hda_core snd_pcm i2c_i801 irqbypass i2c_smbus snd_timer snd lpc_ich soundcore virtio_balloon virtio_net net_failover failover joydev dm_multipath loop nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci zram lz4hc_compress vsock lz4_compress virtio_scsi virtio_gpu virtio_dma_buf serio_raw scsi_dh_rdac scsi_dh_emc scsi_dh_alua i2c_dev qemu_fw_cfg virtiofs fuse
[ 330.131963] CR2: 0000000000000018
[ 330.131963] ---[ end trace 0000000000000000 ]---
[ 330.131963] RIP: 0010:ww_mutex_lock+0x38/0x110
[ 330.131963] Code: 89 f4 55 53 48 89 fb 48 83 ec 20 65 48 8b 2d 57 93 21 02 48 89 6c 24 18 31 ed 2e 2e 2e 31 c0 65 48 8b 15 53 93 21 02 48 89 e8 <f0> 48 0f b1 13 75 53 4d 85 e4 74 2a 48 8d 6c 24 08 41 83 44 24 10
[ 330.131963] RSP: 0018:ffffd2f38ea6bae0 EFLAGS: 00010246
[ 330.131963] RAX: 0000000000000000 RBX: 0000000000000018 RCX: ffffd2f38ea6bb78
[ 330.131963] RDX: ffff8ea806fd8000 RSI: ffffd2f38ea6bb78 RDI: 0000000000000018
[ 330.131963] RBP: 0000000000000000 R08: ffff8ea8016ce170 R09: 000000000fe6d5f1
[ 330.131963] R10: fffff93844235700 R11: ffff8ea80004f800 R12: ffffd2f38ea6bb78
[ 330.131963] R13: ffff8ea8016ce000 R14: 0000000000000000 R15: 0000000000000018
[ 330.225288] FS: 00007f202225f780(0000) GS:ffff8ea9ecea0000(0000) knlGS:0000000000000000
[ 330.225288] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 330.225288] CR2: 0000000000000018 CR3: 00000001a100e000 CR4: 0000000000750ef0
[ 330.225288] PKRU: 55555554
[ 330.225288] note: bash[22216] exited with irqs disabled
Best regards,
Ryosuke
^ permalink raw reply
* [PATCH] drm/virtio: fix dma_fence refcount leak on error in virtio_gpu_dma_fence_wait()
From: Wentao Liang @ 2026-06-07 9:03 UTC (permalink / raw)
To: airlied, kraxel, dmitry.osipenko, maarten.lankhorst, mripard,
tzimmermann, simona
Cc: gurchetansingh, olvaffe, dri-devel, virtualization, linux-kernel,
Wentao Liang, stable
dma_fence_unwrap_for_each() internally calls dma_fence_unwrap_first()
which does cursor->chain = dma_fence_get(head), taking an extra
reference. On normal loop completion, dma_fence_unwrap_next()
releases this via dma_fence_chain_walk() -> dma_fence_put().
When virtio_gpu_do_fence_wait() fails and the function returns early
from inside the loop, the cursor->chain reference is never released.
This is the only caller in the entire kernel that does an early return
inside dma_fence_unwrap_for_each.
Add dma_fence_put(itr.chain) before the early return.
Cc: stable@vger.kernel.org
Fixes: eba57fb5498f ("drm/virtio: Wait for each dma-fence of in-fence array individually")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
drivers/gpu/drm/virtio/virtgpu_submit.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/virtio/virtgpu_submit.c b/drivers/gpu/drm/virtio/virtgpu_submit.c
index dae761fa5788..32cb1e4aa425 100644
--- a/drivers/gpu/drm/virtio/virtgpu_submit.c
+++ b/drivers/gpu/drm/virtio/virtgpu_submit.c
@@ -65,8 +65,10 @@ static int virtio_gpu_dma_fence_wait(struct virtio_gpu_submit *submit,
dma_fence_unwrap_for_each(f, &itr, fence) {
err = virtio_gpu_do_fence_wait(submit, f);
- if (err)
+ if (err) {
+ dma_fence_put(itr.chain);
return err;
+ }
}
return 0;
--
2.34.1
^ permalink raw reply related
* [PATCH v2] virtio-blk: clamp zone report to the report buffer capacity
From: Michael Bommarito @ 2026-06-07 12:48 UTC (permalink / raw)
To: Michael S . Tsirkin, Jason Wang, Stefan Hajnoczi, Jens Axboe
Cc: Xuan Zhuo, virtualization, linux-block, linux-kernel
virtblk_report_zones() trusts the device-reported number of zones when
walking the report buffer:
nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
nr_zones);
...
for (i = 0; i < nz && zone_idx < nr_zones; i++) {
ret = virtblk_parse_zone(vblk, &report->zones[i], ...);
The buffer is allocated by virtblk_alloc_report_buffer(), whose size is
capped by the queue's max hardware sectors and max segments and can
therefore hold fewer descriptors than nr_zones. nz is bounded only by
the device-supplied report->nr_zones and the requested nr_zones, never
by the buffer's descriptor capacity. At probe time the request count is
unbounded (blk_revalidate_disk_zones() calls report_zones() with
nr_zones == UINT_MAX), so the device-supplied report->nr_zones is the
sole gate: a device that reports more zones than fit in the buffer
drives the loop to read report->zones[i] past the end of the allocation.
A malicious or buggy virtio-blk device that reports an inflated nr_zones
triggers this during zone revalidation at probe. KASAN reports a
vmalloc-out-of-bounds read in virtblk_report_zones() against the report
buffer allocated a few lines earlier.
Clamp nz to the number of descriptors that actually fit in the report
buffer.
Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices")
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
v2: drop the explanatory comment per Michael S. Tsirkin's review; the
clamp itself is unchanged.
drivers/block/virtio_blk.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b1c9a27..32bf3ba 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -689,6 +689,8 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
nz = min_t(u64, virtio64_to_cpu(vblk->vdev, report->nr_zones),
nr_zones);
+ nz = min_t(u64, nz,
+ (buflen - sizeof(*report)) / sizeof(report->zones[0]));
if (!nz)
break;
base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
--
2.53.0
^ permalink raw reply related
* [PATCH v1] i2c: virtio: wait uninterruptibly for completions to avoid UAF
From: Gavin Li @ 2026-06-07 14:36 UTC (permalink / raw)
To: linux-i2c, viresh.kumar
Cc: Chen, Jian Jun, andi.shyti, virtualization, Gavin Li
virtio_i2c_complete_reqs() uses wait_for_completion_interruptible() and stops
waiting when a signal arrives. virtio_i2c_xfer() then frees reqs and the
per-request DMA bounce buffers while the device may still hold virtqueue tokens
pointing at &reqs[i] and DMA into read bounce buffers. Additionally, when the
device later completes those requests, virtio_i2c_msg_done() calls complete()
on freed memory and can corrupt the slab freelist.
Wait uninterruptibly for every completion before freeing reqs. This
matches how other virtio drivers retain request storage until the device
completes it. The virtio spec unfortunately does not provide a way to
cancel an in-flight request, so waiting uninterruptibly is required.
Signed-off-by: Gavin Li <gavin.li@samsara.com>
---
drivers/i2c/busses/i2c-virtio.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/drivers/i2c/busses/i2c-virtio.c b/drivers/i2c/busses/i2c-virtio.c
index 5da6fef92bec3..12acc049f5999 100644
--- a/drivers/i2c/busses/i2c-virtio.c
+++ b/drivers/i2c/busses/i2c-virtio.c
@@ -116,14 +116,13 @@ static int virtio_i2c_complete_reqs(struct virtqueue *vq,
for (i = 0; i < num; i++) {
struct virtio_i2c_req *req = &reqs[i];
- if (!failed) {
- if (wait_for_completion_interruptible(&req->completion))
- failed = true;
- else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK)
- failed = true;
- else
- j++;
- }
+ /* Wait uninterruptibly: device still owns token/bounce buf until completion. */
+ wait_for_completion(&req->completion);
+
+ if (!failed)
+ failed = req->in_hdr.status != VIRTIO_I2C_MSG_OK;
+ if (!failed)
+ j++;
i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], !failed);
}
--
2.54.0
^ permalink raw reply related
* [PATCH v2] tools/virtio: check mmap return value in vringh_test
From: longlong yan @ 2026-06-08 1:19 UTC (permalink / raw)
To: mst; +Cc: eperezma, jasowang, linux-kernel, virtualization, xuanzhuo,
yanlonglong
In-Reply-To: <20260605060404-mutt-send-email-mst@kernel.org>
In parallel_test(), the return values of mmap() for both host_map and
guest_map are not checked against MAP_FAILED. If mmap() fails, the
subsequent code will dereference the invalid pointer, leading to a
segmentation fault.
Add MAP_FAILED checks after both mmap() calls, using err() to report
the error and exit, consistent with the existing error handling style
in this file (e.g., the open() call on line 149).
Fixes: 1515c5ce26ae("tools/virtio: add vring_test.")
Signed-off-by: longlong yan <yanlonglong@kylinos.cn>
---
tools/virtio/vringh_test.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/virtio/vringh_test.c b/tools/virtio/vringh_test.c
index b9591223437a..5ea6d29bc992 100644
--- a/tools/virtio/vringh_test.c
+++ b/tools/virtio/vringh_test.c
@@ -159,7 +159,12 @@ static int parallel_test(u64 features,
/* Parent and child use separate addresses, to check our mapping logic! */
host_map = mmap(NULL, mapsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ if (host_map == MAP_FAILED)
+ err(1, "mmap host_map");
+
guest_map = mmap(NULL, mapsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ if (guest_map == MAP_FAILED)
+ err(1, "mmap guest_map");
pipe_ret = pipe(to_guest);
assert(!pipe_ret);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v1] i2c: virtio: wait uninterruptibly for completions to avoid UAF
From: Viresh Kumar @ 2026-06-08 3:44 UTC (permalink / raw)
To: Gavin Li; +Cc: linux-i2c, Chen, Jian Jun, andi.shyti, virtualization
In-Reply-To: <20260607143608.76122-1-gavin.li@samsara.com>
On 07-06-26, 10:36, Gavin Li wrote:
> virtio_i2c_complete_reqs() uses wait_for_completion_interruptible() and stops
> waiting when a signal arrives. virtio_i2c_xfer() then frees reqs and the
> per-request DMA bounce buffers while the device may still hold virtqueue tokens
> pointing at &reqs[i] and DMA into read bounce buffers. Additionally, when the
> device later completes those requests, virtio_i2c_msg_done() calls complete()
> on freed memory and can corrupt the slab freelist.
>
> Wait uninterruptibly for every completion before freeing reqs. This
> matches how other virtio drivers retain request storage until the device
> completes it. The virtio spec unfortunately does not provide a way to
> cancel an in-flight request, so waiting uninterruptibly is required.
>
> Signed-off-by: Gavin Li <gavin.li@samsara.com>
> ---
> drivers/i2c/busses/i2c-virtio.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
This is a revert of (and maybe better if that is mentioned in the logs):
commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible completion wait")
I don't think this is the right approach here. We shouldn't hang the kernel
indefinitely if the other side is dead.
--
viresh
^ permalink raw reply
* Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: Eugenio Perez Martin @ 2026-06-08 8:08 UTC (permalink / raw)
To: Si-Wei Liu
Cc: Michael S. Tsirkin, yangjiale, Jason Wang, Xuan Zhuo,
virtualization, linux-kernel, Andrew.Boyer, Dragos Tatulea DE
In-Reply-To: <5e1a10bd-2783-42ba-b443-853f12159756@oracle.com>
On Fri, Jun 5, 2026 at 8:51 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 6/5/2026 10:43 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 05, 2026 at 09:03:36AM -0700, Si-Wei Liu wrote:
> >>
> >> On 6/1/2026 11:04 PM, Eugenio Perez Martin wrote:
> >>> On Tue, Jun 2, 2026 at 6:34 AM yangjiale <yangjiale133@163.com> wrote:
> >>>> When a descriptor list spans across cache lines,
> >>>> updating the flag first can lead to a scenario where the device side
> >>>> perceives the flag as valid, yet the corresponding address and length
> >>>> fields remain unupdated—resulting in invalid values.
> >>>> Therefore, the flag field must be updated last.
> >>>>
> >>>> Signed-off-by: yangjiale <yangjiale133@163.com>
> >>>> ---
> >>>> drivers/virtio/virtio_ring.c | 8 ++++----
> >>>> 1 file changed, 4 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >>>> index fbca7ce1c6bf..036b4f90d30f 100644
> >>>> --- a/drivers/virtio/virtio_ring.c
> >>>> +++ b/drivers/virtio/virtio_ring.c
> >>>> @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> >>>> &addr, &len, premapped, attr))
> >>>> goto unmap_release;
> >>>>
> >>>> + desc[i].addr = cpu_to_le64(addr);
> >>>> + desc[i].len = cpu_to_le32(len);
> >>>> + desc[i].id = cpu_to_le16(id);
> >>>> +
> >>>> flags = cpu_to_le16(vq->packed.avail_used_flags |
> >>>> (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
> >>>> (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> >>>> @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> >>>> else
> >>>> desc[i].flags = flags;
> >>>>
> >>>> - desc[i].addr = cpu_to_le64(addr);
> >>>> - desc[i].len = cpu_to_le32(len);
> >>>> - desc[i].id = cpu_to_le16(id);
> >>>> -
> >>>> if (unlikely(vq->use_map_api)) {
> >>>> vq->packed.desc_extra[curr].addr = premapped ?
> >>>> DMA_MAPPING_ERROR : addr;
> >>> These flags are updated before the flags of the head descriptor at the
> >>> end of the function, at "vq->packed.vring.desc[head].flags =
> >>> head_flags", so the device should not see these. Because of that, the
> >>> relative order between the rest of the fields of the same descriptor
> >>> or other descriptors' fields, except for the head descriptor's flags,
> >>> should not matter. There is a write memory barrier just before
> >>> updating the head's flags.
> >> The above analysis is absolutely correct. Though one hardware vendor told me
> >> that this driver implementation kinda stops them from reading ahead of
> >> descriptors already posted beyond the available index., ending up with
> >> suboptimal performance that is hard to make up by other means. Would it be a
> >> bad idea to go with this change and add write barrier in a gentle way for a
> >> small flit in the batch, e.g. commit to memory after every cache line size
> >> worth of descriptors are posted? Would the memory barrier have negative
> >> performance overhead to other backend implementation variants than real
> >> hardware PCI device?
> >>
> >> -Siwei
> > this would need a new feature bit, won't it?
> Probably. This is to capture the device's expectation and behavior
> right? the driver change itself is not spec violating...
>
> >
> >>> Also, I don't get why the cache line matters here. Can you expand? Am
> >>> I missing something?
> > me too.
> >
> Just to avoid extra delay due to excessive coherency messages and
> frequent cache thrashing, device read over pci bus contends with host
> write/update on the descriptors in a same cache line..
>
Whether the descriptors are in the same cache line or not, how does
the device know that the memory for the other descriptors is updated
or dirty and needs to be read again? The only way I can imagine is to
force both the device and the driver to update the flags of all
descriptors, regardless of whether they're in a chain, and use that
information for synchronization. Also, the device must read these
flags strictly before the other members, as it does with the head's
flag now. I'm not sure if that memory dance beats the PCI latency.
I thought of something similar, not for cache thrashing but to save
the overhead and latency of the extra PCI read for the length and id
descriptor members. My understanding is that a PCI device can only
read 64 bits atomically at most. So it can only save one fetch of the
fields together with the flags (length and id) if the driver promises
to write all of them atomically. This needs a feature flag as MST
says.
Something like this super early draft:
VIRTIO_F_ATOMIC_64_FLAGS: The driver writes the length, id, and flags
of a packed descriptor atomically, ensuring they are always
synchronized. The device will not read them again once it finds the
descriptor available via its flags.
Conversely, the device could atomically update the descriptor ID along
with the flags, meaning the driver wouldn't need a memory barrier
between these updates.
I guess it does not buy much on x86 software devices, but it might
improve performance in architectures with less cache coherency. From
the driver's perspective, implementation isn't hard.
I also see 128-bit CAS PCI, but I'm not sure if the CPU can write the
128 bits of the descriptor atomically from the device's POV or if the
driver's write barrier is sufficient. Perhaps this is an improvement
for SW devices actually.
Adding Dragos to the thread.
^ permalink raw reply
* Re: [PATCH v1] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-08 8:10 UTC (permalink / raw)
To: David Laight
Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
Jason Wang, Bobby Eshleman, Xuan Zhuo, Eugenio Pérez,
Simon Horman, kvm, virtualization, netdev, linux-kernel, oxffffaa,
rulkc
In-Reply-To: <20260605160851.3ddbd2ed@pumpkin>
On 05/06/2026 18:08, David Laight wrote:
> On Fri, 5 Jun 2026 14:53:14 +0300
> Arseniy Krasnov <avkrasnov@rulkc.org> wrote:
>
>> Logically it was based on TCP implementation, so to make further
>> support easier, rewrite it in the TCP way.
>>
>> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
>> ---
>> net/vmw_vsock/virtio_transport_common.c | 64 ++++++++++++-------------
>> 1 file changed, 32 insertions(+), 32 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 2fd9eaaf5ca6..00caeeaa5590 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -73,10 +73,13 @@ static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops,
>> static int virtio_transport_fill_skb(struct sk_buff *skb,
>> struct virtio_vsock_pkt_info *info,
>> size_t len,
>> - bool zcopy)
>> + bool zcopy, struct ubuf_info *uarg)
>> {
>> struct msghdr *msg = info->msg;
>>
>> + /* We have completion - attach it to 'skb'. */
>> + skb_zcopy_set(skb, uarg, NULL);
>> +
>> if (zcopy)
>> return __zerocopy_sg_from_iter(msg, NULL, skb,
>> &msg->msg_iter, len, NULL);
>> @@ -208,7 +211,8 @@ static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *
>> u32 src_cid,
>> u32 src_port,
>> u32 dst_cid,
>> - u32 dst_port)
>> + u32 dst_port,
>> + struct ubuf_info *uarg)
>> {
>> struct vsock_sock *vsk;
>> struct sk_buff *skb;
>> @@ -245,7 +249,7 @@ static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *
>> if (info->msg && payload_len > 0) {
>> int err;
>>
>> - err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>> + err = virtio_transport_fill_skb(skb, info, payload_len, zcopy, uarg);
>> if (err)
>> goto out;
>>
>> @@ -321,38 +325,36 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>> if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>> return pkt_len;
>>
>> - if (info->msg) {
>> - /* If zerocopy is not enabled by 'setsockopt()', we behave as
>> - * there is no MSG_ZEROCOPY flag set.
>> + if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
>> + /* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
>> + * 'MSG_ZEROCOPY' flag handling here is based on the same flag
>> + * handling from 'tcp_sendmsg_locked()'.
>> */
>> - if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>> - info->msg->msg_flags &= ~MSG_ZEROCOPY;
>> + if (info->msg->msg_ubuf) {
>> + uarg = info->msg->msg_ubuf;
>> + can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>> + } else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
>> + uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
>> + NULL, false);
>> + if (!uarg) {
>> + virtio_transport_put_credit(vvs, pkt_len);
>> + return -ENOMEM;
>> + }
>>
>> - if (info->msg->msg_flags & MSG_ZEROCOPY)
>> can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>>
>> + if (!can_zcopy)
>> + uarg_to_msgzc(uarg)->zerocopy = 0;
>> +
>> + have_uref = true;
>> + }
>> +
>> + /* 'can_zcopy' means that this transmission will be
>> + * in zerocopy way (e.g. using 'frags' array).
>> + */
> I've not looked at the tcp code, but the above doesn't look right.
> I don't see why msg->msg_ubuf might be non-NULL without SOCK_ZEROCOPY set.
> That would give the outer code a callback when the last skb is freed but
> still copy the data.
Hi,
I guess case when 'msg->msg_ubuf' is non-NULL is special case today for io_uring MSG_ZEROCOPY implementation.
It was added here https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb315a7d1396b1139fc7daea55f2d3191e8e7092
As I see implementation of its tests in tools/testing/selftests/net/io_uring_zerocopy_tx.c , it doesn't require setting SOCK_ZEROCOPY option for
socket, so for virtio vsock case I just copied same logic to maintain compatibility, because there is MSG_ZEROCOPY io_uring test for virtio/vsock.
>
> I also don't see the point of calling msg_zerocopy_realloc() to get a
> callback when the last skb is freed and then setting
> uarg_to_msgzc(uarg)->zerocopy = 0;
> so that the callback doesn't actually do anything.
> It isn't as though you 'find out' later on that you can't actually do
> zerocopy.
Sorry, what do You mean "last skb" ? In this code we first allocate uarg (allocate, because third arg is always NULL). Then in
loop we allocate sk_buffs, fill it with data and send. I mean first/last skb will be freed after uarg is already allocated and we
don't touch it. I think i didn't understand Your question here.
>
>> if (can_zcopy)
>> max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>> (MAX_SKB_FRAGS * PAGE_SIZE));
>> -
>> - if (info->msg->msg_flags & MSG_ZEROCOPY &&
>> - info->op == VIRTIO_VSOCK_OP_RW) {
>> - uarg = info->msg->msg_ubuf;
>> -
>> - if (!uarg) {
>> - uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>> - pkt_len, NULL, false);
>> - if (!uarg) {
>> - virtio_transport_put_credit(vvs, pkt_len);
>> - return -ENOMEM;
>> - }
>> -
>> - if (!can_zcopy)
>> - uarg_to_msgzc(uarg)->zerocopy = 0;
>> -
>> - have_uref = true;
>> - }
>> - }
>> }
>>
>> rest_len = pkt_len;
>> @@ -365,14 +367,12 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>
>> skb = virtio_transport_alloc_skb(info, skb_len, can_zcopy,
>> src_cid, src_port,
>> - dst_cid, dst_port);
>> + dst_cid, dst_port, uarg);
>> if (!skb) {
>> ret = -ENOMEM;
>> break;
>> }
>>
>> - skb_zcopy_set(skb, uarg, NULL);
> Aren't you passing uarg through two function calls instead of doing it here.
> Doesn't even make it clearer what is going on.
Agree, to simplify patch, uarg could be set earlier (without passing it to functions) I guess.
Thanks
>
> -- David
>
>> -
>> virtio_transport_inc_tx_pkt(vvs, skb);
>>
>> ret = t_ops->send_pkt(skb, info->net);
>> @@ -1178,7 +1178,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>> le64_to_cpu(hdr->dst_cid),
>> le32_to_cpu(hdr->dst_port),
>> le64_to_cpu(hdr->src_cid),
>> - le32_to_cpu(hdr->src_port));
>> + le32_to_cpu(hdr->src_port), NULL);
>> if (!reply)
>> return -ENOMEM;
>>
^ permalink raw reply
* [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Michael S. Tsirkin @ 2026-06-08 8:33 UTC (permalink / raw)
To: linux-kernel
Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
Further, on architectures with aliasing caches, upstream with init_on_alloc
double-zeros user pages: once via kernel_init_pages() in
post_alloc_hook, and again via clear_user_highpage() at the
callsite (because user_alloc_needs_zeroing() returns true).
This series eliminates that double-zeroing by moving the zeroing
into the post_alloc_hook + propagating the "host
already zeroed this page" information through the buddy allocator.
For page reporting, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED (bit 6)
is used. For the inflate/deflate path,
VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE (bit 7) is used.
Virtio spec: https://lore.kernel.org/all/cover.1778140241.git.mst@redhat.com
Based on v7.1-rc6. When applying on mm-unstable, two conflicts
are expected:
- kernel_init_pages() was renamed to clear_highpages_kasan_tagged()
in mm-unstable. Use clear_highpages_kasan_tagged() in the
post_alloc_hook else branch.
- FPI_PREPARED uses BIT(3) in mm-unstable. Bump FPI_ZEROED to
BIT(4).
Build-tested on mm-unstable at e9dd96806dbc:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git zero-mm-unstable
Patches 1-5: fixes/cleanups, dependencies of the zeroing patches.
Patches 6-9: thread user_addr through page allocator, contig API,
and gigantic hugetlb allocation.
Patches 10-16: folio_zero_user in post_alloc_hook, vma_alloc_zeroed
conversion, raw fault address threading.
Patches 17-24: PG_zeroed flag, aliasing guard, buddy merge/split
tracking, FPI_ZEROED optimization, folio_put_zeroed.
Patches 25-27: __GFP_ZERO callsite conversions (alloc_anon_folio,
vma_alloc_anon_folio_pmd) with memcg charge failure mitigation.
Patches 28-29: hugetlb __GFP_ZERO + HPG_zeroed.
Patches 30-35: page reporting zeroing (DEVICE_INIT_REPORTED),
disable indirect descriptors.
Patches 36-37: inflate/deflate zeroing (DEVICE_INIT_ON_INFLATE).
-------
Performance with THP enabled on a 2GB VM, 1 vCPU, allocating
256MB of anonymous pages:
metric baseline optimized delta
task-clock 232 +- 20 ms 51 +- 26 ms -78%
cache-misses 1.20M +- 248K 288K +- 102K -76%
instructions 16.3M +- 1.2M 13.8M +- 1.0M -15%
With hugetlb surplus pages:
metric baseline optimized delta
task-clock 219 +- 23 ms 65 +- 34 ms -70%
cache-misses 1.17M +- 391K 263K +- 36K -78%
instructions 17.9M +- 1.2M 15.1M +- 724K -16%
Two flags track known-zero pages:
PG_zeroed (aliased to PG_private) marks buddy allocator pages that
are known to contain all zeros, either because the host zeroed
them during page reporting, or because they were freed via the
balloon deflate path. It lives on free-list pages and is consumed
by post_alloc_hook() on allocation.
HPG_zeroed (stored in hugetlb folio->private bits) serves the same
purpose for hugetlb pool pages, which are kept in a pool and may
be zeroed long after buddy allocation, so PG_zeroed (consumed at
allocation time) cannot track their state.
PG_zeroed lifecycle:
Sets PG_zeroed:
- page_reporting_drain: on reported pages when host zeroes them
- __free_pages_ok / __free_frozen_pages: when FPI_ZEROED is set
(balloon deflate path)
- buddy merge: on merged page if both buddies were zeroed
- expand(): propagate to split-off buddy sub-pages
Clears PG_zeroed:
- __free_pages_prepare: clears all PAGE_FLAGS_CHECK_AT_PREP flags
(PG_zeroed included), preventing PG_private aliasing leaks
- rmqueue_buddy / __rmqueue_pcplist: read-then-clear, passes
zeroed hint to prep_new_page -> post_alloc_hook
- __isolate_free_page: clear (compaction/page_reporting isolation)
- compaction, alloc_contig, split_free_frozen: clear before use
- buddy merge: clear both pages before merge, then conditionally
re-set on merged head if both were zeroed
HPG_zeroed lifecycle (hugetlb pool pages, stored in folio->private):
Sets HPG_zeroed:
- alloc_surplus_hugetlb_folio: after buddy allocation with
__GFP_ZERO, mark pool page as known-zero
Clears HPG_zeroed:
- free_huge_folio: page was mapped to userspace, no longer
known-zero when it returns to the pool
- alloc_hugetlb_folio: cleared unconditionally on output
- alloc_hugetlb_folio_reserve: cleared after checking
- The optimization is most effective with THP, where entire 2MB
pages are allocated directly from reported order-9+ buddy pages.
Without THP, only ~21% of order-0 allocations come from reported
pages due to low-order fragmentation.
- Persistent hugetlb pool pages are not covered: when freed by
userspace they return to the hugetlb free pool, not the buddy
allocator, so they are never reported to the host. Surplus
hugetlb pages are allocated from buddy and do benefit.
- PG_zeroed is aliased to PG_private. __free_pages_prepare() clears it
(preventing filesystem PG_private from leaking as false PG_zeroed).
FPI_ZEROED re-sets it after prepare for balloon deflate pages.
Is aliasing PG_private acceptable, or should a different bit be used?
- With __GFP_ZERO, the folio is zeroed before mem_cgroup_charge().
If the charge fails (cgroup at limit), the zeroing work is wasted
and the folio is freed and retried at a smaller order. Previously,
zeroing was done after a successful charge. This is inherent to
the __GFP_ZERO approach. Is this acceptable?
- On architectures with aliasing caches, upstream with init_on_alloc
double-zeros user pages: once via kernel_init_pages() in
post_alloc_hook, and again via clear_user_highpage() at the
callsite (because user_alloc_needs_zeroing() returns true).
Our patches eliminate this by zeroing once via folio_zero_user()
in post_alloc_hook. Not a critical fix (people who set init_on_alloc
know they are paying performance) but a nice cleanup anyway.
Test program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#ifndef MADV_POPULATE_WRITE
#define MADV_POPULATE_WRITE 23
#endif
#ifndef MAP_HUGETLB
#define MAP_HUGETLB 0x40000
#endif
int main(int argc, char **argv)
{
unsigned long size;
int flags = MAP_PRIVATE | MAP_ANONYMOUS;
void *p;
int r;
if (argc < 2) {
fprintf(stderr, "usage: %s <size_mb> [huge]\n", argv[0]);
return 1;
}
size = atol(argv[1]) * 1024UL * 1024;
if (argc >= 3 && strcmp(argv[2], "huge") == 0)
flags |= MAP_HUGETLB;
p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return 1;
}
r = madvise(p, size, MADV_POPULATE_WRITE);
if (r) {
perror("madvise");
return 1;
}
munmap(p, size);
return 0;
}
Test script (bench.sh):
#!/bin/bash
# Usage: bench.sh <size_mb> <iterations> [huge]
# Feature negotiation (DEVICE_INIT_REPORTED/ON_INFLATE) is
# handled by QEMU command line flags,
SZ=${1:-256}; ITER=${2:-10}; HUGE=${3:-}
FLUSH=/sys/module/page_reporting/parameters/flush
CSV=/tmp/perf.csv
rmmod virtio_balloon 2>/dev/null
insmod /mnt/share/virtio_balloon.ko
echo 512 > $FLUSH
[ "$HUGE" = "huge" ] && echo $((SZ/2)) > /proc/sys/vm/nr_overcommit_hugepages
rm -f $CSV
echo "=== sz=${SZ}MB iter=$ITER $HUGE ==="
for i in $(seq 1 $ITER); do
echo 3 > /proc/sys/vm/drop_caches
echo 512 > $FLUSH
perf stat -e task-clock,instructions,cache-misses \
-x, -o $CSV --append -- /mnt/share/alloc_once $SZ $HUGE
done
[ "$HUGE" = "huge" ] && echo 0 > /proc/sys/vm/nr_overcommit_hugepages
rmmod virtio_balloon
awk -F, '/^#/||/^$/{next}{v=$1+0;e=$3;gsub(/ /,"",e);s[e]+=v;ss[e]+=v*v;n[e]++}
END{for(e in s){a=s[e]/n[e];d=sqrt(ss[e]/n[e]-a*a);printf " %-16s %10.0f +- %8.0f (n=%d)\n",e,a,d,n[e]}}' $CSV
Compile and run:
gcc -static -O2 -o alloc_once alloc_once.c
bash bench.sh 256 10 # regular pages
bash bench.sh 256 10 huge # hugetlb surplus
Note about Sashiko (sashiko.dev) false positives:
Sashiko's mm-alloc guideline says "Any optimization replacing
clear_user_highpage() with __GFP_ZERO is wrong on [cache-aliasing]
architectures". This is correct for mainline but not for this
series, which threads user_addr through the allocator so that
post_alloc_hook() calls folio_zero_user() with the dcache flush.
Suggested guideline update: add "unless the caller passes a
valid user address (i.e. not USER_ADDR_NONE) to vma_alloc_folio(),
alloc_contig_frozen_pages_user() etc., which reaches
post_alloc_hook() for the dcache flush".
Pre-existing bugs found during review (not fixed, not made worse):
- do_swap_page() returns VM_FAULT_OOM on large-folio swapin race
instead of retrying.
- free_huge_folio() called with refcount==1 on
mem_cgroup_charge_hugetlb failure.
- memfd_alloc_folio() double-decrements resv_huge_pages on error.
- wait_event in virtballoon_free_page_report hangs on broken
virtqueue (pre-existing, same as old single-buffer code).
- tell_host() GFP_KERNEL under balloon_lock risks OOM deadlock.
Changes since v9:
- Fix W=1 kerneldoc warning on alloc_contig_frozen_pages_user_noprof.
- Fix link error on !MMU configs (m68k, arm allnoconfig): move
folio_zero_user stub to new mm/folio_zero.h header.
- Reorder patches: move PG_zeroed tracking and folio_put_zeroed
before __GFP_ZERO conversions, allowing folio_put_zeroed to
handle memcg charge failures.
- Better handle memcg charge failures.
Changes since v8 (address Sashiko v8 review findings):
- Fix mempolicy interleave: combine vm_pgoff and VMA offset into
a single expression before shifting, fixing carry loss for
file-backed VMAs with unaligned vm_pgoff.
- Fix memory-failure: wrap ClearPageHWPoison in retry path with
zone->lock (same race as TestSetPageHWPoison).
- Fix stale comment: "folio_zero_user writes" -> "page zeroing"
in huge_memory.c __folio_mark_uptodate comment.
- Drop rounddown_pow_of_two for page reporting capacity (no-op
for compiler optimization, halves batch size for non-power-of-2).
- Reorder: move "mm: balloon: use put_page_zeroed" before
"virtio_balloon: implement DEVICE_INIT_ON_INFLATE" so the
ClearPageZeroed handling is in place before any page gets
the flag set.
- Various commit log improvements (PowerPC note in aliasing
patch, memory-failure note about other HWPoison calls,
wording fixes).
Changes since v7 (address Sashiko AI review findings):
- Fix dcache flush on VIPT aliasing architectures: add
user_alloc_needs_zeroing() guard in post_alloc_hook to force
folio_zero_user for user pages when cache aliasing requires it.
Host-zeroed pages excluded (!zeroed). Optimization preserved.
- Fix folio_zero_user stub: replace macro with non-inline function
in mm/memory.c to avoid double-evaluation and missing include.
- Fix C89 declaration-after-statement in free_huge_folio.
- Fix CMA __GFP_ZERO: pass through to cma_alloc_frozen_compound
so HPG_zeroed accurately reflects whether page was zeroed.
- Fix big-endian bitmap: use test_bit_le() for inflate_bitmap.
- Fix migratepage: clear PageZeroed on old page before deflation.
- Fix page_reporting flush: overflow-safe loop, add -EINTR on
signal, add code comment explaining double flush_delayed_work.
- Add atomic ClearPageZeroed (CLEARPAGEFLAG) for balloon migration
path where zone->lock is not held.
- Add VM_WARN_ON_ONCE for order>0 without __GFP_COMP in
post_alloc_hook (folio_zero_user requires compound metadata).
- Add _noprof pattern for vma_alloc_zeroed_movable_folio to
preserve memory allocation profiling attribution.
- Add PageReported propagation in split_large_buddy (was missing
from patch 2).
- Add FPI_ZEROED guard: skip PageZeroed when page_poisoning
enabled and init_on_free disabled (poison overwrites zeroes).
- Add DMA alignment comment for inflate_bitmap (ACCESS_PLATFORM
cleared, so not needed now).
- Restore tell_host comment explaining vq buffer assumption.
- Various code comments documenting design decisions.
- Drop __GFP_ZERO from gather_surplus_pages: avoid shifting
zeroing from fault time to reservation time (mmap/fallocate).
Pool pages are zeroed at fault time via alloc_hugetlb_folio.
Fresh surplus allocations at fault time still benefit from
__GFP_ZERO + HPG_zeroed.
- New patch: add alloc_contig_frozen_pages_user API with user_addr
for cache-friendly zeroing in the contiguous allocation path.
- New patch: thread user_addr through gigantic hugetlb allocation
via alloc_contig_frozen_pages_user.
- New patch: replace user_alloc_needs_zeroing() with aliasing-only
checks (cpu_dcache_is_aliasing || cpu_icache_is_aliasing) in the
post_alloc_hook guard. Avoids redundant re-zero on non-aliasing.
- New patch: serialize TestSetPageHWPoison with zone->lock in
memory_failure to fix pre-existing race with non-atomic buddy
flag operations (e.g. page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP).
- New patch: disable VIRTIO_RING_F_INDIRECT_DESC in balloon to
prevent GFP_KERNEL allocation under balloon_lock (OOM deadlock).
- New patch: skip kernel_init_pages for FPI_ZEROED when page
poisoning is not enabled (page already zero, skip redundant work).
Also since v7 (address review by Gregory Price):
- Drop from_pool bool in alloc_hugetlb_folio: use
folio_test_hugetlb_zeroed directly. HPG_zeroed is set by
alloc_surplus_hugetlb_folio for fresh allocations, so the
check handles both pool and fresh pages.
- Drop bool *zeroed output parameter from alloc_hugetlb_folio:
sink zeroing inside the function. When __GFP_ZERO is set and
!folio_test_hugetlb_zeroed, call folio_zero_user internally.
- Rename addr to user_addr in alloc_hugetlb_folio, align
internally with huge_page_mask.
- Add Reviewed-by: Gregory Price tags on reviewed patches.
New patches since v7:
- mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
- mm: add alloc_contig_frozen_pages_user for cache-friendly zeroing
- mm: hugetlb: thread user_addr through gigantic page allocation
- mm: page_alloc: use aliasing checks instead of
user_alloc_needs_zeroing
- virtio_balloon: disable indirect descriptors
- mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe
Changes since v6 (address review by Gregory Price):
- Rework hugetlb: use gfp_t parameter instead of bool zero /
bool *zeroed. Sink zeroing inside alloc_hugetlb_folio().
Pass raw fault address (user_addr) for cache-friendly zeroing
on both pool-page and fresh allocation
paths. (Suggested by Gregory Price)
- Reorder compaction_alloc_noprof() to call prep_compound_page
before post_alloc_hook for consistency.
(Suggested by Gregory Price)
- Reorder: interleave fix first, PageReported propagation and
capacity fix moved to front as dependencies.
- Add USER_ADDR_NONE comments in mmap.c and internal.h explaining why -1 is
never a valid userspace address.
- Fix err uninitialized warning in virtballoon_free_page_report().
- Lots of commit log tweaks.
Also in v7:
- Fix hugetlb pool page zeroing to use vmf->real_address
(the actual faulting subpage) instead of vmf->address
(hugepage-aligned), preserving cache-friendly zeroing
locality that upstream had at the callsite.
- Remove dead/broken alloc_hugetlb_folio !CONFIG_HUGETLB_PAGE
stub (returned NULL but callers check IS_ERR).
Changes since v5:
- Rebased onto v7.1-rc2.
- Split alloc_anon_folio and alloc_swap_folio raw fault address
changes into separate patches.
- In virtio, move PAGE_POISON check for DEVICE_INIT_REPORTED
from probe() to validate(), clearing the feature instead of
just gating host_zeroes_pages. Same for confidential
computing check.
- Fix bisectability: FPI_ZEROED definition and usage now in
the same patch.
- Lots of commit log tweaks.
- Reorder: REPORTED before ON_INFLATE.
- Kerneldoc fixes.
Changes since v4:
With virtio spec posted, update to latest spec:
- Add VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED (bit 6) for reporting.
- Add VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE (bit 7) for inflate.
- Per-page virtqueue submission, per-page used_len feedback.
- Balloon migration preserves PageZeroed hint.
- Page_reporting capacity bugfix for small virtqueues.
- PG_zeroed propagation in split_large_buddy.
- Disable both features for confidential computing guests.
- Gate host_zeroes_pages on PAGE_POISON/poison_val: when PAGE_POISON
is negotiated with non-zero poison_val, device fills with poison
not zeros, so host_zeroes_pages must be false.
- Disable ON_INFLATE when PAGE_POISON with non-zero poison_val.
- Bound inflate bitmap reads by used_len from device.
- Move ON_INFLATE poison_val check to validate() for proper
feature negotiation.
- Fix NUMA interleave index for unaligned VMA start (new patch 1).
- Drop vma_alloc_folio_user_addr: with the ilx fix, callers can
pass raw fault address to vma_alloc_folio directly.
- Tested with DEBUG_VM, INIT_ON_ALLOC/FREE enabled.
Changes since v3 (address review by Gregory Price and David Hildenbrand):
- Keep user_addr threading internal: public APIs (__alloc_pages,
__folio_alloc, folio_alloc_mpol) are unchanged. Only internal
functions (__alloc_frozen_pages_noprof, __alloc_pages_mpol) carry
user_addr. This eliminates all API churn for external callers.
- Add vma_alloc_folio_user_addr() (2/22) to separate NUMA policy
address from the zeroing hint address. Fixes NUMA interleave
index corruption when passing unaligned fault address for
higher-order allocations.
- Add per-page zeroed_bitmap to page_reporting_dev_info (17/22).
The driver's report() callback manages the bitmap. Drain
checks it gated by the host_zeroes_pages static key. This
matches the proposed virtio balloon extension at
https://lore.kernel.org/all/cover.1776874126.git.mst@redhat.com/
- Clear PG_zeroed in __isolate_free_page() to prevent the aliased
PG_private flag from leaking to compaction/alloc_contig paths.
- Do not exclude PG_zeroed from PAGE_FLAGS_CHECK_AT_PREP macro.
Instead, __free_pages_prepare() clears it (preventing filesystem
PG_private leaking as false PG_zeroed), and FPI_ZEROED sets it
after prepare. Only buddy merge assertion is relaxed.
- Initialize alloc_context.user_addr in alloc_pages_bulk_noprof.
- Deflate and hugetlb changes are much smaller now. Still, the
patchset can be merged gradually, if desired.
Changes since v2 (address review by Gregory Price and David Hildenbrand):
- v2 used pghint_t / vma_alloc_folio_hints API. v3 switches to
threading user_addr through the page allocator and using __GFP_ZERO,
so post_alloc_hook() can use folio_zero_user() for cache-friendly
zeroing when the user fault address is known.
- Use FPI_ZEROED to set PG_zeroed after __free_pages_prepare() instead
of runtime masking in __free_one_page (further refined in v4).
- Drop redundant page_poisoning_enabled() check from mm core free
path, already guarded at feature negotiation time in
virtio_balloon_validate. The balloon driver keeps its own
page_poisoning_enabled_static() check as defense in depth.
- Split free_frozen_pages_zeroed and put_page_zeroed into separate
patches. David Hildenbrand indicated he intends to rework balloon
pages to be frozen (no refcount), at which point put_page_zeroed
(21/22) can be dropped and the balloon can call
free_frozen_pages_zeroed directly.
- Use HPG_zeroed flag (in hugetlb folio->private) for hugetlb pool
pages instead of PG_zeroed, since pool pages are zeroed long after
buddy allocation and PG_zeroed is consumed at allocation time.
- syzbot CI found a PF_NO_COMPOUND BUG in the v2 pghint_t approach
where __ClearPageZeroed was called on compound hugetlb pages in
free_huge_folio. The v3 HPG_zeroed approach avoids this.
- Remove redundant arch vma_alloc_zeroed_movable_folio overrides
on x86, s390, m68k, and alpha (12/22). Suggested by David
Hildenbrand.
- Updated benchmarking script to compute per-run avg +- stddev
via awk on CSV output.
Changes v1->v2:
- Replaced __GFP_PREZEROED with PG_zeroed page flag (aliased PG_private)
- Added pghint_t type and vma_alloc_folio_hints() API
- Track PG_zeroed across buddy merges and splits
- Added post_alloc_hook integration (single consume/clear point)
- Added hugetlb support (pool pages + memfd)
- Added page_reporting flush parameter for deterministic testing
- Added free_frozen_pages_hint/put_page_hint for balloon deflate path
- Added try_to_claim_block PG_zeroed preservation
- Updated perf numbers with per-iteration flush methodology
Written with assistance from Claude (claude-opus-4-6).
Reviewed by cursor-agent (GPT-5.4-xhigh).
Everything manually read, patchset split and commit logs edited manually.
Michael S. Tsirkin (37):
mm: mempolicy: fix interleave index calculation
mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
mm: page_alloc: propagate PageReported flag across buddy splits
mm: page_reporting: allow driver to set batch capacity
mm: hugetlb: remove dead alloc_hugetlb_folio stub
mm: move vma_alloc_folio_noprof to page_alloc.c
mm: thread user_addr through page allocator for cache-friendly zeroing
mm: add alloc_contig_frozen_pages_user for cache-friendly zeroing
mm: hugetlb: thread user_addr through gigantic page allocation
mm: add folio_zero_user stub for configs without THP/HUGETLBFS
mm: page_alloc: move prep_compound_page before post_alloc_hook
mm: use folio_zero_user for user pages in post_alloc_hook
mm: use __GFP_ZERO in vma_alloc_zeroed_movable_folio
mm: remove arch vma_alloc_zeroed_movable_folio overrides
mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio
mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
mm: page_reporting: skip redundant zeroing of host-zeroed reported
pages
mm: page_alloc: use aliasing checks instead of
user_alloc_needs_zeroing
mm: page_alloc: clear PG_zeroed on buddy merge if not both zero
mm: page_alloc: preserve PG_zeroed in page_del_and_expand
mm: page_alloc: propagate PG_zeroed in split_large_buddy
mm: add free_frozen_pages_zeroed
mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe
mm: add put_page_zeroed and folio_put_zeroed
mm: use __GFP_ZERO in alloc_anon_folio
mm: vma_alloc_anon_folio_pmd: pass raw fault address to
vma_alloc_folio
mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd
mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages
mm: memfd: skip zeroing for zeroed hugetlb pool pages
mm: page_reporting: add per-page zeroed bitmap for host feedback
virtio_balloon: submit reported pages as individual buffers
virtio_balloon: disable indirect descriptors
mm: page_reporting: add flush parameter with page budget
virtio_balloon: skip zeroing for host-zeroed reported pages
virtio_balloon: disable reporting zeroed optimization for confidential
guests
mm: balloon: use put_page_zeroed for zeroed balloon pages
virtio_balloon: implement VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE
arch/alpha/include/asm/page.h | 3 -
arch/m68k/include/asm/page_no.h | 3 -
arch/s390/include/asm/page.h | 3 -
arch/x86/include/asm/page.h | 3 -
drivers/virtio/virtio_balloon.c | 177 ++++++++++++++---
fs/hugetlbfs/inode.c | 3 +-
include/linux/cma.h | 3 +-
include/linux/gfp.h | 18 +-
include/linux/highmem.h | 15 +-
include/linux/hugetlb.h | 18 +-
include/linux/mm.h | 13 ++
include/linux/page-flags.h | 11 ++
include/linux/page_reporting.h | 13 ++
include/uapi/linux/virtio_balloon.h | 2 +
mm/balloon.c | 10 +-
mm/cma.c | 6 +-
mm/compaction.c | 9 +-
mm/folio_zero.h | 18 ++
mm/huge_memory.c | 16 +-
mm/hugetlb.c | 138 ++++++++-----
mm/hugetlb_cma.c | 4 +-
mm/internal.h | 22 ++-
mm/memfd.c | 14 +-
mm/memory-failure.c | 10 +
mm/memory.c | 19 +-
mm/mempolicy.c | 75 +++----
mm/mmap.c | 6 +
mm/page_alloc.c | 297 +++++++++++++++++++++++-----
mm/page_reporting.c | 99 ++++++++--
mm/page_reporting.h | 12 ++
mm/slub.c | 4 +-
mm/swap.c | 20 +-
32 files changed, 792 insertions(+), 272 deletions(-)
create mode 100644 mm/folio_zero.h
--
MST
^ permalink raw reply
* [PATCH v10 01/37] mm: mempolicy: fix interleave index calculation
From: Michael S. Tsirkin @ 2026-06-08 8:34 UTC (permalink / raw)
To: linux-kernel
Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <cover.1780906288.git.mst@redhat.com>
The NUMA interleave index was computed as two separate terms:
*ilx += vma->vm_pgoff >> order;
*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
This has two problems:
1. When vm_start is not aligned to the folio size, the
subtraction before the shift lets low bits affect the
result via borrows.
2. For file-backed VMAs, shifting vm_pgoff and the VMA
offset independently loses carries between them, giving
wrong chunk indices when vm_pgoff is not aligned to order.
Combine into a single expression that adds vm_pgoff and
the page-granularity VMA offset first, then shifts once:
*ilx += (vma->vm_pgoff +
(addr >> PAGE_SHIFT) -
(vma->vm_start >> PAGE_SHIFT)) >> order;
For anonymous VMAs, vm_pgoff equals vm_start >> PAGE_SHIFT,
so the vm_pgoff and vm_start terms cancel and the result
reduces to addr >> (PAGE_SHIFT + order), same as before.
For file-backed VMAs, the sum vm_pgoff + (addr >> PAGE_SHIFT)
- (vm_start >> PAGE_SHIFT) gives the file page offset of addr.
Shifting by order gives the correct file chunk index.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Gregory Price <gourry@gourry.net>
---
mm/mempolicy.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4e4421b22b59..d139b074a599 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2048,8 +2048,9 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
pol = get_task_policy(current);
if (pol->mode == MPOL_INTERLEAVE ||
pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
- *ilx += vma->vm_pgoff >> order;
- *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
+ *ilx += (vma->vm_pgoff +
+ (addr >> PAGE_SHIFT) -
+ (vma->vm_start >> PAGE_SHIFT)) >> order;
}
return pol;
}
--
MST
^ permalink raw reply related
* [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-08 8:34 UTC (permalink / raw)
To: linux-kernel
Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin
In-Reply-To: <cover.1780906288.git.mst@redhat.com>
TestSetPageHWPoison() is called without zone->lock, so its atomic
update to page->flags can race with non-atomic flag operations
that run under zone->lock in the buddy allocator.
In particular, __free_pages_prepare() does:
page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
This non-atomic read-modify-write, while correctly excluding
__PG_HWPOISON from the mask, can still lose a concurrent
TestSetPageHWPoison if the read happens before the poison bit
is set and the write happens after. Follow-up patches in this
series add similar non-atomic flag operations as well.
Fix by acquiring zone->lock around TestSetPageHWPoison and
around ClearPageHWPoison in the retry path. This
serializes with all buddy flag manipulation. The cost is
negligible: one lock/unlock in an extremely rare path
(hardware memory errors).
Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
in this file operate on pages already removed from the buddy
allocator or on non-buddy pages (DAX, hugetlb), so they do not
need zone->lock protection.
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
mm/memory-failure.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..3880486028a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2348,6 +2348,8 @@ int memory_failure(unsigned long pfn, int flags)
unsigned long page_flags;
bool retry = true;
int hugetlb = 0;
+ struct zone *zone;
+ unsigned long mf_flags;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -2390,7 +2392,11 @@ int memory_failure(unsigned long pfn, int flags)
if (hugetlb)
goto unlock_mutex;
+ /* Serialize with non-atomic buddy flag operations */
+ zone = page_zone(p);
+ spin_lock_irqsave(&zone->lock, mf_flags);
if (TestSetPageHWPoison(p)) {
+ spin_unlock_irqrestore(&zone->lock, mf_flags);
res = -EHWPOISON;
if (flags & MF_ACTION_REQUIRED)
res = kill_accessing_process(current, pfn, flags);
@@ -2399,6 +2405,7 @@ int memory_failure(unsigned long pfn, int flags)
action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
goto unlock_mutex;
}
+ spin_unlock_irqrestore(&zone->lock, mf_flags);
/*
* We need/can do nothing about count=0 pages.
@@ -2420,7 +2427,10 @@ int memory_failure(unsigned long pfn, int flags)
} else {
/* We lost the race, try again */
if (retry) {
+ /* Serialize with non-atomic buddy flag operations */
+ spin_lock_irqsave(&zone->lock, mf_flags);
ClearPageHWPoison(p);
+ spin_unlock_irqrestore(&zone->lock, mf_flags);
retry = false;
goto try_again;
}
--
MST
^ permalink raw reply related
* [PATCH v10 03/37] mm: page_alloc: propagate PageReported flag across buddy splits
From: Michael S. Tsirkin @ 2026-06-08 8:34 UTC (permalink / raw)
To: linux-kernel
Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <cover.1780906288.git.mst@redhat.com>
When a reported free page is split via expand() to satisfy a
smaller allocation, the sub-pages placed back on the free lists
lose the PageReported flag. This means they will be unnecessarily
re-reported to the hypervisor in the next reporting cycle, wasting
work.
While I was unable to quantify the performance difference, it is
an obvious waste, even if small.
Propagate the PageReported flag to sub-pages during expand(),
both in page_del_and_expand() and try_to_claim_block(), so
split_large_buddy() also propagates PageReported via a bool
parameter: the caller saves PageReported before
del_page_from_free_list() clears it, then passes the saved
value. The flag is set after __free_one_page() with a
PageBuddy check, matching the page_reporting_drain() pattern.
Free-path callers pass false (freshly freed pages are never
reported).
that they are recognized as already-reported.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
mm/page_alloc.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da..8dae5b3f5876 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1502,7 +1502,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* Split a multi-block free page into its individual pageblocks. */
static void split_large_buddy(struct zone *zone, struct page *page,
- unsigned long pfn, int order, fpi_t fpi)
+ unsigned long pfn, int order, fpi_t fpi,
+ bool reported)
{
unsigned long end = pfn + (1 << order);
@@ -1517,6 +1518,8 @@ static void split_large_buddy(struct zone *zone, struct page *page,
int mt = get_pfnblock_migratetype(page, pfn);
__free_one_page(page, pfn, zone, order, mt, fpi);
+ if (reported && PageBuddy(page) && buddy_order(page) == order)
+ __SetPageReported(page);
pfn += 1 << order;
if (pfn == end)
break;
@@ -1559,11 +1562,12 @@ static void free_one_page(struct zone *zone, struct page *page,
llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
unsigned int p_order = p->private;
- split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
+ split_large_buddy(zone, p, page_to_pfn(p), p_order,
+ fpi_flags, false);
__count_vm_events(PGFREE, 1 << p_order);
}
}
- split_large_buddy(zone, page, pfn, order, fpi_flags);
+ split_large_buddy(zone, page, pfn, order, fpi_flags, false);
spin_unlock_irqrestore(&zone->lock, flags);
__count_vm_events(PGFREE, 1 << order);
@@ -1694,7 +1698,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
* -- nyc
*/
static inline unsigned int expand(struct zone *zone, struct page *page, int low,
- int high, int migratetype)
+ int high, int migratetype, bool reported)
{
unsigned int size = 1 << high;
unsigned int nr_added = 0;
@@ -1716,6 +1720,15 @@ static inline unsigned int expand(struct zone *zone, struct page *page, int low,
__add_to_free_list(&page[size], zone, high, migratetype, false);
set_buddy_order(&page[size], high);
nr_added += size;
+
+ /*
+ * The parent page has been reported to the host. The
+ * sub-pages are part of the same reported block, so mark
+ * them reported too. This avoids re-reporting pages that
+ * the host already knows about.
+ */
+ if (reported)
+ __SetPageReported(&page[size]);
}
return nr_added;
@@ -1726,9 +1739,10 @@ static __always_inline void page_del_and_expand(struct zone *zone,
int high, int migratetype)
{
int nr_pages = 1 << high;
+ bool was_reported = page_reported(page);
__del_page_from_free_list(page, zone, high, migratetype);
- nr_pages -= expand(zone, page, low, high, migratetype);
+ nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
account_freepages(zone, -nr_pages, migratetype);
}
@@ -2116,11 +2130,13 @@ static bool __move_freepages_block_isolate(struct zone *zone,
/* We're a part of a larger buddy */
if (PageBuddy(buddy) && buddy_order(buddy) > pageblock_order) {
int order = buddy_order(buddy);
+ bool reported = PageReported(buddy);
del_page_from_free_list(buddy, zone, order,
get_pfnblock_migratetype(buddy, buddy_pfn));
toggle_pageblock_isolate(page, isolate);
- split_large_buddy(zone, buddy, buddy_pfn, order, FPI_NONE);
+ split_large_buddy(zone, buddy, buddy_pfn, order, FPI_NONE,
+ reported);
return true;
}
@@ -2283,10 +2299,12 @@ try_to_claim_block(struct zone *zone, struct page *page,
/* Take ownership for orders >= pageblock_order */
if (current_order >= pageblock_order) {
unsigned int nr_added;
+ bool was_reported = page_reported(page);
del_page_from_free_list(page, zone, current_order, block_type);
change_pageblock_range(page, current_order, start_type);
- nr_added = expand(zone, page, order, current_order, start_type);
+ nr_added = expand(zone, page, order, current_order, start_type,
+ was_reported);
account_freepages(zone, nr_added, start_type);
return page;
}
--
MST
^ permalink raw reply related
* [PATCH v10 04/37] mm: page_reporting: allow driver to set batch capacity
From: Michael S. Tsirkin @ 2026-06-08 8:34 UTC (permalink / raw)
To: linux-kernel
Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <cover.1780906288.git.mst@redhat.com>
Add a capacity field to page_reporting_dev_info so drivers can
control the maximum number of pages per report batch. This is
useful when the driver needs to reserve virtqueue descriptors for
metadata (e.g., a bitmap buffer) alongside the page buffers.
The value is capped at PAGE_REPORTING_CAPACITY.
If unset (0), defaults to PAGE_REPORTING_CAPACITY.
The virtio_balloon driver sets capacity to the reporting virtqueue
size, letting page_reporting adapt to whatever the device provides.
Note: capacity need not be a power of 2. The DIV_ROUND_UP
in page_reporting_cycle() uses integer division, which the
compiler handles efficiently. Rounding would halve the batch
size for non-power-of-2 virtqueue sizes, wasting capacity.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
drivers/virtio/virtio_balloon.c | 5 +----
include/linux/page_reporting.h | 3 +++
mm/page_reporting.c | 25 ++++++++++++++-----------
3 files changed, 18 insertions(+), 15 deletions(-)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f6c2dff33f8a..6a1a610c2cb1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1017,10 +1017,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
unsigned int capacity;
capacity = virtqueue_get_vring_size(vb->reporting_vq);
- if (capacity < PAGE_REPORTING_CAPACITY) {
- err = -ENOSPC;
- goto out_unregister_oom;
- }
vb->pr_dev_info.order = PAGE_REPORTING_ORDER_UNSPECIFIED;
@@ -1041,6 +1037,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->pr_dev_info.order = 5;
#endif
+ vb->pr_dev_info.capacity = capacity;
err = page_reporting_register(&vb->pr_dev_info);
if (err)
goto out_unregister_oom;
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 9d4ca5c218a0..5ab5be02fa15 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -22,6 +22,9 @@ struct page_reporting_dev_info {
/* Minimal order of page reporting */
unsigned int order;
+
+ /* Max pages per report batch (default PAGE_REPORTING_CAPACITY) */
+ unsigned int capacity;
};
/* Tear-down and bring-up for page reporting devices */
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 7418f2e500bb..5b6b17f67131 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -174,10 +174,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
* list processed. This should result in us reporting all pages on
* an idle system in about 30 seconds.
*
- * The division here should be cheap since PAGE_REPORTING_CAPACITY
- * should always be a power of 2.
+ * The division here uses integer division; capacity need
+ * not be a power of 2.
*/
- budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
+ budget = DIV_ROUND_UP(area->nr_free, prdev->capacity * 16);
/* loop through free list adding unreported pages to sg list */
list_for_each_entry_safe(page, next, list, lru) {
@@ -222,10 +222,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
spin_unlock_irq(&zone->lock);
/* begin processing pages in local list */
- err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
+ err = prdev->report(prdev, sgl, prdev->capacity);
/* reset offset since the full list was reported */
- *offset = PAGE_REPORTING_CAPACITY;
+ *offset = prdev->capacity;
/* update budget to reflect call to report function */
budget--;
@@ -234,7 +234,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
spin_lock_irq(&zone->lock);
/* flush reported pages from the sg list */
- page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
+ page_reporting_drain(prdev, sgl, prdev->capacity, !err);
/*
* Reset next to first entry, the old next isn't valid
@@ -260,13 +260,13 @@ static int
page_reporting_process_zone(struct page_reporting_dev_info *prdev,
struct scatterlist *sgl, struct zone *zone)
{
- unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
+ unsigned int order, mt, leftover, offset = prdev->capacity;
unsigned long watermark;
int err = 0;
/* Generate minimum watermark to be able to guarantee progress */
watermark = low_wmark_pages(zone) +
- (PAGE_REPORTING_CAPACITY << page_reporting_order);
+ (prdev->capacity << page_reporting_order);
/*
* Cancel request if insufficient free memory or if we failed
@@ -290,7 +290,7 @@ page_reporting_process_zone(struct page_reporting_dev_info *prdev,
}
/* report the leftover pages before going idle */
- leftover = PAGE_REPORTING_CAPACITY - offset;
+ leftover = prdev->capacity - offset;
if (leftover) {
sgl = &sgl[offset];
err = prdev->report(prdev, sgl, leftover);
@@ -322,11 +322,11 @@ static void page_reporting_process(struct work_struct *work)
atomic_set(&prdev->state, state);
/* allocate scatterlist to store pages being reported on */
- sgl = kmalloc_objs(*sgl, PAGE_REPORTING_CAPACITY);
+ sgl = kmalloc_objs(*sgl, prdev->capacity);
if (!sgl)
goto err_out;
- sg_init_table(sgl, PAGE_REPORTING_CAPACITY);
+ sg_init_table(sgl, prdev->capacity);
for_each_zone(zone) {
err = page_reporting_process_zone(prdev, sgl, zone);
@@ -377,6 +377,9 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
page_reporting_order = pageblock_order;
}
+ if (!prdev->capacity || prdev->capacity > PAGE_REPORTING_CAPACITY)
+ prdev->capacity = PAGE_REPORTING_CAPACITY;
+
/* initialize state and work structures */
atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);
--
MST
^ permalink raw reply related
* [PATCH v10 05/37] mm: hugetlb: remove dead alloc_hugetlb_folio stub
From: Michael S. Tsirkin @ 2026-06-08 8:34 UTC (permalink / raw)
To: linux-kernel
Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <cover.1780906288.git.mst@redhat.com>
Remove the !CONFIG_HUGETLB_PAGE stub for alloc_hugetlb_folio().
The stub is dead code: all callers are in mm/hugetlb.c
(CONFIG_HUGETLB_PAGE) or fs/hugetlbfs/inode.c (CONFIG_HUGETLBFS),
and CONFIG_HUGETLB_PAGE is def_bool HUGETLBFS with nothing
selecting it independently.
The stub is also broken: it returns NULL, but all callers check
IS_ERR(folio), so a NULL return would not be caught and would
crash on the subsequent folio dereference.
Remove it now since follow-up patches change the signature of
alloc_hugetlb_folio and would otherwise need to update the
broken stub too.
Reviewed-by: Gregory Price <gourry@gourry.net>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
include/linux/hugetlb.h | 7 -------
1 file changed, 7 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5957bc25efa8..1f7ae6609e51 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1123,13 +1123,6 @@ static inline void wait_for_freed_hugetlb_folios(void)
{
}
-static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
- unsigned long addr,
- bool cow_from_owner)
-{
- return NULL;
-}
-
static inline struct folio *
alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
nodemask_t *nmask, gfp_t gfp_mask)
--
MST
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox