Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH v5 4/8] nvdimm: virtio_pmem: always wake -ENOSPC waiters
From: Li Chen @ 2026-06-17 12:24 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260617122442.2118957-1-me@linux.beauty>

virtio_pmem_host_ack() reclaims virtqueue descriptors with
virtqueue_get_buf(). The -ENOSPC waiter wakeup is tied to completing the
returned token. If token completion is skipped for any reason, reclaimed
descriptors may not wake a waiter and the submitter may sleep forever
waiting for a free slot. Always wake one -ENOSPC waiter for each virtqueue
completion before touching the returned token.

Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Split out the waiter wakeup ordering change from READ_ONCE()/WRITE_ONCE()
  updates (now patch 4/7), per Pankaj's suggestion.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.

 drivers/nvdimm/nd_virtio.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 081370aac6317..16ee5a47b9938 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -9,26 +9,33 @@
 #include "virtio_pmem.h"
 #include "nd.h"
 
+static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
+{
+	struct virtio_pmem_request *req_buf;
+
+	if (list_empty(&vpmem->req_list))
+		return;
+
+	req_buf = list_first_entry(&vpmem->req_list,
+				   struct virtio_pmem_request, list);
+	req_buf->wq_buf_avail = true;
+	wake_up(&req_buf->wq_buf);
+	list_del(&req_buf->list);
+}
+
  /* The interrupt handler */
 void virtio_pmem_host_ack(struct virtqueue *vq)
 {
 	struct virtio_pmem *vpmem = vq->vdev->priv;
-	struct virtio_pmem_request *req_data, *req_buf;
+	struct virtio_pmem_request *req_data;
 	unsigned long flags;
 	unsigned int len;
 
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
+		virtio_pmem_wake_one_waiter(vpmem);
 		req_data->done = true;
 		wake_up(&req_data->host_acked);
-
-		if (!list_empty(&vpmem->req_list)) {
-			req_buf = list_first_entry(&vpmem->req_list,
-					struct virtio_pmem_request, list);
-			req_buf->wq_buf_avail = true;
-			wake_up(&req_buf->wq_buf);
-			list_del(&req_buf->list);
-		}
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 }
-- 
2.52.0

^ permalink raw reply related

* [PATCH v5 3/8] nvdimm: virtio_pmem: use GFP_NOIO for child flush bio
From: Li Chen @ 2026-06-17 12:24 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260617122442.2118957-1-me@linux.beauty>

async_pmem_flush() can allocate a child flush bio from filesystem flush
and writeback paths. GFP_ATOMIC is unnecessarily restrictive there and can
make the allocation fail under pressure, which then propagates -ENOMEM to
the flush caller.

A local virtio-pmem mkfs sanity test hit a flush failure before this
change:

  wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
  mkfs.ext4: Input/output error while writing out and closing file system
  nd_region region0: dbg: nvdimm_flush rc=-5

The debug log showed async_pmem_flush() was entered and nvdimm_flush()
returned -EIO. With GFP_NOIO, the same test reached mkfs_rc=0, mount_rc=0,
and umount_rc=0.

Use GFP_NOIO instead. The path may sleep, but it must not recurse into
filesystem I/O reclaim while it is already servicing a flush request.

Signed-off-by: Li Chen <me@linux.beauty>
---
v3->v4:
- New patch.

 drivers/nvdimm/nd_virtio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 4176046627beb..081370aac6317 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -117,7 +117,7 @@ int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
 	if (bio && bio->bi_iter.bi_sector != -1) {
 		struct bio *child = bio_alloc(bio->bi_bdev, 0,
 					      REQ_OP_WRITE | REQ_PREFLUSH,
-					      GFP_ATOMIC);
+					      GFP_NOIO);
 
 		if (!child)
 			return -ENOMEM;
-- 
2.52.0

^ permalink raw reply related

* [PATCH v5 2/8] nvdimm: pmem: keep PREFLUSH before data writes
From: Li Chen @ 2026-06-17 12:24 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260617122442.2118957-1-me@linux.beauty>

pmem_submit_bio() records a REQ_PREFLUSH error, but continues to copy the
bio data and can later overwrite the error with a successful REQ_FUA flush.
That lets data writes run after a failed preflush and can complete the bio
successfully despite the failed ordering barrier.

Run the REQ_PREFLUSH flush synchronously before touching the bio data and
complete the bio with the flush error if it fails. Keep asynchronous flush
chaining for REQ_FUA. At that point, data copy has completed and the parent
bio can wait for the chained flush bio.

Signed-off-by: Li Chen <me@linux.beauty>
---
 drivers/nvdimm/pmem.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 92c67fbbc1c85..05d3de33e2706 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -208,8 +208,14 @@ static void pmem_submit_bio(struct bio *bio)
 	struct pmem_device *pmem = bio->bi_bdev->bd_disk->private_data;
 	struct nd_region *nd_region = to_region(pmem);
 
-	if (bio->bi_opf & REQ_PREFLUSH)
-		ret = nvdimm_flush(nd_region, bio);
+	if (bio->bi_opf & REQ_PREFLUSH) {
+		ret = nvdimm_flush(nd_region, NULL);
+		if (ret) {
+			bio->bi_status = errno_to_blk_status(ret);
+			bio_endio(bio);
+			return;
+		}
+	}
 
 	do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
 	if (do_acct)
@@ -229,7 +235,7 @@ static void pmem_submit_bio(struct bio *bio)
 	if (do_acct)
 		bio_end_io_acct(bio, start);
 
-	if (bio->bi_opf & REQ_FUA)
+	if ((bio->bi_opf & REQ_FUA) && !bio->bi_status)
 		ret = nvdimm_flush(nd_region, bio);
 
 	if (ret)
-- 
2.52.0


^ permalink raw reply related

* [PATCH v5 1/8] nvdimm: preserve flush callback errors
From: Li Chen @ 2026-06-17 12:24 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel, Li Chen
In-Reply-To: <20260617122442.2118957-1-me@linux.beauty>

nvdimm_flush() currently converts any non-zero provider flush error to
-EIO. That loses useful errno values from provider callbacks.

A local virtio-pmem mkfs sanity test showed the masking clearly:

  wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
  mkfs.ext4: Input/output error while writing out and closing file system
  nd_region region0: dbg: nvdimm_flush rc=-5

The virtio-pmem callback can return -ENOMEM when async_pmem_flush() fails
to allocate a child flush bio, but nvdimm_flush() hides that as -EIO before
pmem_submit_bio() converts it to a block status.

Return the provider callback error directly. The generic flush path still
returns 0, and pmem_submit_bio() already handles errno-to-blk_status
conversion for bio completion.

Signed-off-by: Li Chen <me@linux.beauty>
---
v3->v4:
- New patch.

 drivers/nvdimm/region_devs.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index e35c2e18518f0..0cd96503c0596 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1114,10 +1114,8 @@ int nvdimm_flush(struct nd_region *nd_region, struct bio *bio)
 
 	if (!nd_region->flush)
 		rc = generic_nvdimm_flush(nd_region);
-	else {
-		if (nd_region->flush(nd_region, bio))
-			rc = -EIO;
-	}
+	else
+		rc = nd_region->flush(nd_region, bio);
 
 	return rc;
 }
-- 
2.52.0

^ permalink raw reply related

* [PATCH v5 0/8] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures
From: Li Chen @ 2026-06-17 12:24 UTC (permalink / raw)
  To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm
  Cc: linux-kernel

Hi,

The nvdimm flush helper currently converts any non-zero provider flush
callback error to -EIO. That hides useful errno values from providers. For
example, virtio-pmem may fail child flush bio allocation with -ENOMEM, but
that is currently reported as -EIO by nvdimm_flush().

The raw failure seen in the local mkfs sanity test was:

  wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
  mkfs.ext4: Input/output error while writing out and closing file system
  nd_region region0: dbg: nvdimm_flush rc=-5

The first three patches keep provider flush errors intact, make
pmem_submit_bio() honor a failed REQ_PREFLUSH before copying data, and use
GFP_NOIO for virtio-pmem child flush bio allocation. REQ_PREFLUSH is now
issued synchronously before the data copy. The asynchronous child flush bio is
still used for REQ_FUA, where the data copy has already completed and the
parent bio can be chained to the flush completion.

The rest of the series addresses virtio-pmem request lifetime and broken
virtqueue handling. The virtio-pmem flush path uses a virtqueue cookie/token
to carry a per-request context through completion. Under broken virtqueue /
notify failure conditions, the submitter can return and free the request
object while the host/backend may still complete the published request. The
IRQ completion handler then dereferences freed memory when waking waiters,
which is reported by KASAN as a slab-use-after-free and may manifest as lock
corruption (e.g. "BUG: spinlock already unlocked") without KASAN.

In addition, the flush path has two wait sites: one for virtqueue descriptor
availability (-ENOSPC from virtqueue_add_sgs()) and one for request
completion. If the virtqueue becomes broken, forward progress is no longer
guaranteed and these waiters may sleep indefinitely unless the driver
converges the failure and wakes all wait sites.

This series addresses these issues:

1/8 nvdimm: preserve flush callback errors
Return provider flush callback errors directly from nvdimm_flush().

2/8 nvdimm: pmem: keep PREFLUSH before data writes
Run REQ_PREFLUSH synchronously before copying data and fail the bio if the
flush fails.

3/8 nvdimm: virtio_pmem: use GFP_NOIO for child flush bio
Use GFP_NOIO for the child flush bio allocation.

4/8 nvdimm: virtio_pmem: always wake -ENOSPC waiters
Wake one -ENOSPC waiter for each reclaimed used buffer, decoupled from
token completion.

5/8 nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
wq_buf_avail).

6/8 nvdimm: virtio_pmem: refcount requests for token lifetime
Refcount request objects so the token lifetime spans the window where it is
reachable through the virtqueue until completion/drain drops the virtqueue
reference.

7/8 nvdimm: virtio_pmem: converge broken virtqueue to -EIO
Track a device-level broken state to converge broken/notify failures to -EIO:
wake -ENOSPC waiters, fail-fast new requests, and report errors for completed
tokens after the queue is marked broken.

8/8 nvdimm: virtio_pmem: drain requests in freeze
Drain outstanding requests in freeze() after resetting the device so waiters
do not sleep indefinitely and virtqueue_detach_unused_buf() only runs on a
quiesced queue.

The original repros were on QEMU x86_64 with a virtio-pmem device exported
as /dev/pmem0. For this v5 reroll, I checked that the series applies to
v7.1-rc7 and to next/master at 8d6dbbbe3ba6 ("Add linux-next specific files
for 20260615"). Each commit builds with CONFIG_VIRTIO_PMEM=m, and the series
passes checkpatch.

Thanks,
Li Chen

Changelog:
v4->v5:
- Address review feedback about REQ_PREFLUSH ordering and active virtqueue
  detach.
- Add 2/8 so a failed REQ_PREFLUSH fails the bio before any data copy, and
  make REQ_PREFLUSH use a synchronous provider flush instead of a deferred
  child bio.
- Rework broken-queue handling so runtime failure marking only stops new
  submissions and wakes local -ENOSPC waiters; used/unused token draining is
  done after device reset in remove() and freeze().
- Remove the broken-state shortcut from the host-completion wait so the
  submitter never reads an uninitialized response field.
- Keep the raw broken-virtqueue dmesg in 7/8 while updating the teardown
  rationale.
- Renumber the old virtio-pmem fixes after the new pmem PREFLUSH patch.
v3->v4:
- Rebased the series onto v7.1-rc7 so it applies cleanly to Linux 7.1-rc7.
- Update the allocation site in 6/7 from kmalloc(sizeof(*req_data),
  GFP_KERNEL) to kmalloc_obj(*req_data) to match current nvdimm code.
- Add 1/7 to preserve provider flush callback errors in nvdimm_flush().
- Include the GFP_NOIO child flush bio allocation fix as 2/7.
- Renumber the old request lifetime and broken virtqueue fixes after the two
  new flush error patches.
v2->v3:
- Split patch 1 as suggested by Pankaj Gupta: keep the waiter wakeup
  ordering change in 1/5 and move READ_ONCE()/WRITE_ONCE() updates to
  2/5 (no functional change intended).
- Add log report to commit msg
- Fold the export fix into 4/5 to keep the series bisectable when
  CONFIG_VIRTIO_PMEM=m.
v1->v2: add the export patch to fix compile issue.

Links:
v4: https://lore.kernel.org/all/20260609120726.1714780-1-me@linux.beauty/
v3: https://lore.kernel.org/all/20260226025712.2236279-1-me@linux.beauty/#t
v2: https://lore.kernel.org/all/20251225042915.334117-1-me@linux.beauty/
v1: https://www.spinics.net/lists/kernel/msg5974818.html

Li Chen (8):
  nvdimm: preserve flush callback errors
  nvdimm: pmem: keep PREFLUSH before data writes
  nvdimm: virtio_pmem: use GFP_NOIO for child flush bio
  nvdimm: virtio_pmem: always wake -ENOSPC waiters
  nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
  nvdimm: virtio_pmem: refcount requests for token lifetime
  nvdimm: virtio_pmem: converge broken virtqueue to -EIO
  nvdimm: virtio_pmem: drain requests in freeze

 drivers/nvdimm/nd_virtio.c   | 163 ++++++++++++++++++++++++++++-------
 drivers/nvdimm/pmem.c        |  12 ++-
 drivers/nvdimm/region_devs.c |   6 +-
 drivers/nvdimm/virtio_pmem.c |  28 +++++-
 drivers/nvdimm/virtio_pmem.h |   7 ++
 5 files changed, 178 insertions(+), 38 deletions(-)

-- 
2.52.0

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: busy-poll: introduce sk_tx_busy_loop()
From: Menglong Dong @ 2026-06-17 12:00 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Menglong Dong, Jakub Kicinski, jasowang, mst, xuanzhuo, eperezma,
	andrew+netdev, davem, edumazet, pabeni, magnus.karlsson, sdf,
	horms, ast, daniel, hawk, john.fastabend, bjorn, kerneljasonxing,
	netdev, virtualization, linux-kernel, bpf
In-Reply-To: <ajJrckiXEUztBQDz@boxer>

On Wed, Jun 17, 2026 at 5:40 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Sun, Jun 14, 2026 at 06:12:46PM +0800, Menglong Dong wrote:
> > On 2026/6/14 02:21, Jakub Kicinski wrote:
> > > On Thu, 11 Jun 2026 15:12:40 +0800 menglong8.dong@gmail.com wrote:
[...]
> >
> > I'm not sure if it is a good idea to introduce the sk_tx_busy_loop().
> > Maybe we can modify the driver instead by using the same NAPI
> > for both data sending and receiving, just like others do. The
> > advantage of introduce sk_tx_busy_loop() is that we can split the
> > data sending and receiving, which maybe more efficient.
>
> Would be good if you back your changes by any performance numbers. I
> believe that drivers do tx processing via rx napi as before AF_XDP it was
> only about cleaning up writebacks, AF_XDP added more weight via actual tx
> descriptors submission.
>
> Maybe you can vibe-code virtio-net to work only with rx napi and see what
> are the results.

Hi, Maciej. I have not done such performance testing yet. It's a good
and interesting
idea to do such testing on viriot-net, and I'll do it. If there is no obvious
performance differences, I'll modify virtio-net by sending data via rx napi
instead.

>
> Side note/question - Do you have a tx-only use case for AF_XDP ? I am
> planning (for a long time actually) to implement asymmetric AF_XDP
> sockets. Currently for ZC scenarios xsk socket occupies both rx and tx
> queues even when you do rx or tx only.

I think this is an interesting idea, and will be helpful in some cases.
I'm improving the performance of MySQL with AF_XDP. For this case,
tx-only is not suitable, as data reading and writing are both needed.

But for the other case, such as Redis, data reading is mostly. And in
this case, I think it's a good idea to use such "tx-only" ZC AF_XDP.

In my case, I don't want to occupy the whole NIC or the whole queue
with AF_XDP, and the other users can use the NIC too. However, the
ZC of AF_XDP has a little additional overhead to the skb in rx path,
as there is an extra data copy.

If such "tx-only" ZC is supported, the performance of AF_XDP is still
good in the read mostly case, and doesn't have additional overhead to
others too.

I haven't used AF_XDP for such a "reading mostly" case yet, so I'm not
sure if I'm right ;)

Thanks!
Menglong Dong

>
> >
> > >
> > > Third, this series does not apply.
> >
> > Ah, I'll rebase this series if a V2 is acceptable.
> >
> > Thanks!
> > Menglong Dong
> >
> > >
> > >
> >
> >
> >
> >

^ permalink raw reply

* Re: [PATCH] drm: Consistently define pci_device_ids using named initializers
From: Uwe Kleine-König (The Capable Hub) @ 2026-06-17 10:59 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Maarten Lankhorst, Maxime Ripard, David Airlie, Simona Vetter,
	Gerd Hoffmann, Markus Schneider-Pargmann, Patrik Jakobsson,
	Jianmin Lv, Qianhai Wu, Huacai Chen, Mingcong Bai, Xi Ruoyao,
	Icenowy Zheng, Dave Airlie, Jocelyn Falempe, dri-devel,
	linux-kernel, virtualization, spice-devel
In-Reply-To: <bc89deb2-2a2e-41c5-8cd9-28b794020972@suse.de>

[-- Attachment #1: Type: text/plain, Size: 1106 bytes --]

Hallo Thomas,

On Wed, Jun 17, 2026 at 11:11:32AM +0200, Thomas Zimmermann wrote:
> Am 12.06.26 um 14:10 schrieb Thomas Zimmermann:
> > Hi
> > 
> > Am 04.05.26 um 17:05 schrieb Uwe Kleine-König (The Capable Hub):
> > > The .driver_data member of the various struct pci_device_id arrays were
> > > initialized by list expressions. This isn't easily readable if you're
> > > not into PCI. Using the PCI_DEVICE macro and named initializers is more
> > > explicit and thus easier to parse. Also skip explicit assignments of 0
> > > (which the compiler then takes care of).
> > > 
> > > This change doesn't introduce changes to the compiled pci_device_id
> > > arrays. Tested on x86 and arm64.
> > > 
> > > Signed-off-by: Uwe Kleine-König (The Capable Hub)
> > > <u.kleine-koenig@baylibre.com>
> > 
> > Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
> > 
> > I'll merge the patch into drm-misc-next.
> 
> Merged with a minor change to coding style in gma500.

Ah, you removed spaces from expressions like:

	(long) &cdv_chip_ops

That's fine, thank you.

Best regards
Uwe

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [GIT PULL] virtio,vhost,vdpa: features, fixes
From: Michael S. Tsirkin @ 2026-06-17 10:55 UTC (permalink / raw)
  To: Linus Torvalds, kvm, virtualization, netdev, linux-kernel, a0yami,
	ammarfaizi2, arnd, chenhuacai, chenhuacai, christfontanez,
	Damir.Shaikhutdinov, david, den, enelsonmoore, eperezma, ethan,
	evg28bur, filip.hejsek, francesco, graf, harald.mommer, jasowang,
	jiri, johan, johannes.thumshirn, lingshan.zhu, luis.hernandez093,
	lulu, mhi, michael.bommarito, mikhail.golubev-ciuchea, mkl, mst,
	mvaralar, nathan, oleg, pawel.moll, physicalmtea, polina.vishneva,
	q.h.hack.winter, rosenp, schalla, shuangyu, stefanha, vattunuru,
	yanlonglong, yichun, yui.washidu, yuka, zhangtianci.1997

The following changes since commit e43ffb69e0438cddd72aaa30898b4dc446f664f8:

  Linux 7.1-rc6 (2026-05-31 15:14:24 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 8cb2c9285e4ce9154f45fb15633ebd45dfd8d9cf:

  can: virtio: Fix comment in UAPI header (2026-06-10 02:17:00 -0400)

----------------------------------------------------------------
virtio,vhost,vdpa: features, fixes

- new virtio CAN driver
- support for LoongArch architecture in fw_cfg
- support for firmware notifications in vdpa/octeon_ep
- support for VFs in virtio core

- fixes, cleanups all over the place, notably
    - vhost: fix vhost_get_avail_idx for a non empty ring
      fixing an significant old perf regression
    - plus READ_ONCE annotations mean virtio ring is now
      free of KCSAN warnings

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Alexander Graf (1):
      virtio_ring: Add READ_ONCE annotations for device-writable fields

Ammar Faizi (1):
      virtio_pci: fix vq info pointer lookup via wrong index

Arnd Bergmann (1):
      vduse: fix compat handling for VDUSE_IOTLB_GET_FD/VDUSE_VQ_GET_INFO

Christian Fontanez (1):
      virtio: add missing kernel-doc for map and vmap members

Cindy Lu (2):
      vdpa/mlx5: update mlx_features with driver state check
      vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr()

Denis V. Lunev (1):
      vhost/vsock: Refuse the connection immediately when guest isn't ready

Ethan Carter Edwards (1):
      virtio_console: Fix spelling mistake "colums" -> "columns"

Ethan Nelson-Moore (1):
      vhost: remove unnecessary module_init/exit functions

Evgenii Burenchev (1):
      vdpa/ifcvf: handle dev_set_name() failure in ifcvf_vdpa_dev_add()

Filip Hejsek (1):
      virtio_console: read size from config space during device init

Huacai Chen (1):
      fw_cfg: Add support for LoongArch architecture

Jason Wang (1):
      VDUSE: avoid leaking information to userspace

Jia Jia (1):
      virtio: rtc: tear down old virtqueues before restore

Johan Hovold (3):
      virtio-mmio: fix device release warning on module unload
      vdpa_sim_blk: switch to dynamic root device
      vdpa_sim_net: switch to dynamic root device

Matias Ezequiel Vara Larsen (1):
      can: virtio: Add virtio CAN driver

Maurice Hieronymus (2):
      virtio-balloon: Destroy mutex before freeing virtio_balloon
      virtio-mem: Destroy mutex before freeing virtio_mem

Michael Bommarito (1):
      hwrng: virtio: clamp device-reported used.len at copy_data()

Michael S. Tsirkin (2):
      vhost: fix vhost_get_avail_idx for a non empty ring
      tools/virtio: fix build for kmalloc_obj API and missing stubs

Nathan Chancellor (1):
      can: virtio: Fix comment in UAPI header

Oleg Nesterov (1):
      vhost_task_create: kill unnecessary .exit_signal initialization

Qihang Tang (2):
      vduse: hold vduse_lock across IDR lookup in open path
      vhost/vdpa: validate virtqueue index in mmap and fault paths

Qing Ming (1):
      vhost/net: complete zerocopy ubufs only once

Rosen Penev (1):
      vdpa/mlx5: Use kvzalloc_flex() for MTT command memory

Srujana Challa (2):
      vdpa/octeon_ep: Fix PF->VF mailbox data address calculation
      vdpa/octeon_ep: fix IRQ-to-ring mapping in interrupt handler

Vamsi Attunuru (2):
      vdpa/octeon_ep: Use 4 bytes for mailbox signature
      vdpa/octeon_ep: Add vDPA device event handling for firmware notifications

Yui Washizu (1):
      virtio: add num_vf callback to virtio_bus

Zhang Tianci (2):
      vduse: Requeue failed read to send_list head
      vduse: Fix race in vduse_dev_msg_sync and vduse_dev_read_iter

longlong yan (1):
      tools/virtio: check mmap return value in vringh_test

 MAINTAINERS                              |    9 +
 drivers/char/hw_random/virtio-rng.c      |   23 +-
 drivers/char/virtio_console.c            |   52 +-
 drivers/firmware/Kconfig                 |    2 +-
 drivers/firmware/qemu_fw_cfg.c           |    2 +-
 drivers/net/can/Kconfig                  |   12 +
 drivers/net/can/Makefile                 |    1 +
 drivers/net/can/virtio_can.c             | 1022 ++++++++++++++++++++++++++++++
 drivers/vdpa/ifcvf/ifcvf_main.c          |   11 +-
 drivers/vdpa/mlx5/core/mr.c              |    7 +-
 drivers/vdpa/octeon_ep/octep_vdpa.h      |   22 +-
 drivers/vdpa/octeon_ep/octep_vdpa_main.c |  131 +++-
 drivers/vdpa/vdpa_sim/vdpa_sim_blk.c     |   24 +-
 drivers/vdpa/vdpa_sim/vdpa_sim_net.c     |   23 +-
 drivers/vdpa/vdpa_user/iova_domain.c     |    2 +-
 drivers/vdpa/vdpa_user/vduse_dev.c       |  197 +++++-
 drivers/vhost/net.c                      |   15 +-
 drivers/vhost/vdpa.c                     |   29 +-
 drivers/vhost/vhost.c                    |   23 +-
 drivers/vhost/vsock.c                    |   16 +
 drivers/virtio/virtio.c                  |    9 +
 drivers/virtio/virtio_balloon.c          |    2 +
 drivers/virtio/virtio_mem.c              |    2 +
 drivers/virtio/virtio_mmio.c             |   26 +-
 drivers/virtio/virtio_pci_common.c       |   10 +-
 drivers/virtio/virtio_ring.c             |   77 ++-
 drivers/virtio/virtio_rtc_driver.c       |   28 +-
 include/linux/virtio.h                   |    2 +
 include/uapi/linux/virtio_can.h          |   78 +++
 include/uapi/linux/virtio_console.h      |    2 +-
 kernel/vhost_task.c                      |    1 -
 tools/virtio/linux/dma-mapping.h         |    2 +
 tools/virtio/linux/err.h                 |    1 +
 tools/virtio/linux/kernel.h              |    6 +
 tools/virtio/vringh_test.c               |    5 +
 35 files changed, 1690 insertions(+), 184 deletions(-)
 create mode 100644 drivers/net/can/virtio_can.c
 create mode 100644 include/uapi/linux/virtio_can.h


^ permalink raw reply

* [PATCH v3] vduse: hold vduse_lock across IDR lookup in open path
From: Qihang Tang @ 2026-05-08  9:46 UTC (permalink / raw)
  To: mst
  Cc: jasowang, w, eperezma, Qihang Tang, kvm, linux-kernel, netdev,
	virtualization
In-Reply-To: <20260418211354.3698-1-q.h.hack.winter@gmail.com>

vduse_dev_open() looks up struct vduse_dev through the IDR and then
acquires dev->lock only after vduse_lock has been dropped.

This leaves a window where a concurrent VDUSE_DESTROY_DEV can remove the
same object from the IDR and free it before the open path locks the
device, leading to a use-after-free.

Close this race by keeping vduse_lock held until dev->lock has been
acquired in the open path, matching the lock ordering already used by
the destroy path.

Fixes: c8a6153b6c59 ("vduse: Introduce VDUSE - vDPA Device in Userspace")
Signed-off-by: Qihang Tang <q.h.hack.winter@gmail.com>
---
v2 -> v3:
- keep vduse_lock held until after dropping dev->lock
in vduse_dev_open()
- add changelog requested in review

v1 -> v2:
- add Fixes tag
- remove helper and inline the locking in
vduse_dev_open()

 drivers/vdpa/vdpa_user/vduse_dev.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 6202f6902fcd..d5c34260ed68 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1637,26 +1637,18 @@ static int vduse_dev_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
-static struct vduse_dev *vduse_dev_get_from_minor(int minor)
+static int vduse_dev_open(struct inode *inode, struct file *file)
 {
+	int ret = -EBUSY;
 	struct vduse_dev *dev;
 
 	mutex_lock(&vduse_lock);
-	dev = idr_find(&vduse_idr, minor);
-	mutex_unlock(&vduse_lock);
-
-	return dev;
-}
-
-static int vduse_dev_open(struct inode *inode, struct file *file)
-{
-	int ret;
-	struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
-
-	if (!dev)
+	dev = idr_find(&vduse_idr, iminor(inode));
+	if (!dev) {
+		mutex_unlock(&vduse_lock);
 		return -ENODEV;
+	}
 
-	ret = -EBUSY;
 	mutex_lock(&dev->lock);
 	if (dev->connected)
 		goto unlock;
@@ -1666,6 +1658,7 @@ static int vduse_dev_open(struct inode *inode, struct file *file)
 	file->private_data = dev;
 unlock:
 	mutex_unlock(&dev->lock);
+	mutex_unlock(&vduse_lock);
 
 	return ret;
 }
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* [PATCH v5] vhost/vdpa: validate virtqueue index in mmap and fault paths
From: Qihang Tang @ 2026-05-08  7:58 UTC (permalink / raw)
  To: mst
  Cc: jasowang, w, eperezma, Qihang Tang, kvm, linux-kernel, netdev,
	virtualization
In-Reply-To: <20260508063745.90506-1-q.h.hack.winter@gmail.com>

vhost_vdpa_mmap() and vhost_vdpa_fault() use vma->vm_pgoff as a
virtqueue index for get_vq_notification(), but they do not validate
that the index is smaller than v->nvqs.

The ioctl path already performs both a bounds check and
array_index_nospec(), but the mmap/fault path only checks that the
index fits in u16. This allows an out-of-range queue index to reach
driver-specific get_vq_notification() callbacks.

Fix this by extracting a unified vhost_vdpa_get_vq_notification()
helper that validates the queue index against v->nvqs and applies
array_index_nospec() before calling the driver callback. Both the
mmap and fault paths use this helper, and the bounds checking is
consolidated into a single location.

>From source inspection, the most defensible impact is out-of-bounds
access in the callback path, potentially leading to invalid PFN
remaps and crash/DoS.

Fixes: ddd89d0a059d ("vhost_vdpa: support doorbell mapping via mmap")
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Qihang Tang <q.h.hack.winter@gmail.com>
---
 drivers/vhost/vdpa.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 692564b1bcbb..ac55275fa0d0 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1482,16 +1482,32 @@ static int vhost_vdpa_release(struct inode *inode, struct file *filep)
 }
 
 #ifdef CONFIG_MMU
-static vm_fault_t vhost_vdpa_fault(struct vm_fault *vmf)
+static int
+vhost_vdpa_get_vq_notification(struct vhost_vdpa *v, unsigned long index,
+			       struct vdpa_notification_area *notify)
 {
-	struct vhost_vdpa *v = vmf->vma->vm_file->private_data;
 	struct vdpa_device *vdpa = v->vdpa;
 	const struct vdpa_config_ops *ops = vdpa->config;
+
+	if (index > 65535 || index >= v->nvqs)
+		return -EINVAL;
+
+	index = array_index_nospec(index, v->nvqs);
+
+	*notify = ops->get_vq_notification(vdpa, index);
+
+	return 0;
+}
+
+static vm_fault_t vhost_vdpa_fault(struct vm_fault *vmf)
+{
+	struct vhost_vdpa *v = vmf->vma->vm_file->private_data;
 	struct vdpa_notification_area notify;
 	struct vm_area_struct *vma = vmf->vma;
-	u16 index = vma->vm_pgoff;
+	unsigned long index = vma->vm_pgoff;
 
-	notify = ops->get_vq_notification(vdpa, index);
+	if (vhost_vdpa_get_vq_notification(v, index, &notify))
+		return VM_FAULT_SIGBUS;
 
 	return vmf_insert_pfn(vma, vmf->address & PAGE_MASK, PFN_DOWN(notify.addr));
 }
@@ -1514,8 +1530,6 @@ static int vhost_vdpa_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 	if (vma->vm_flags & VM_READ)
 		return -EINVAL;
-	if (index > 65535)
-		return -EINVAL;
 	if (!ops->get_vq_notification)
 		return -ENOTSUPP;
 
@@ -1523,7 +1537,8 @@ static int vhost_vdpa_mmap(struct file *file, struct vm_area_struct *vma)
 	 * support the doorbell which sits on the page boundary and
 	 * does not share the page with other registers.
 	 */
-	notify = ops->get_vq_notification(vdpa, index);
+	if (vhost_vdpa_get_vq_notification(v, index, &notify))
+		return -EINVAL;
 	if (notify.addr & (PAGE_SIZE - 1))
 		return -EINVAL;
 	if (vma->vm_end - vma->vm_start != notify.size)
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* Re: [PATCH net-next 1/3] net: busy-poll: introduce sk_tx_busy_loop()
From: Jason Xing @ 2026-06-17 10:19 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Menglong Dong, menglong8.dong, Jakub Kicinski, jasowang, mst,
	xuanzhuo, eperezma, andrew+netdev, davem, edumazet, pabeni,
	magnus.karlsson, sdf, horms, ast, daniel, hawk, john.fastabend,
	bjorn, netdev, virtualization, linux-kernel, bpf
In-Reply-To: <ajJrckiXEUztBQDz@boxer>

On Wed, Jun 17, 2026 at 5:40 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Sun, Jun 14, 2026 at 06:12:46PM +0800, Menglong Dong wrote:
> > On 2026/6/14 02:21, Jakub Kicinski wrote:
> > > On Thu, 11 Jun 2026 15:12:40 +0800 menglong8.dong@gmail.com wrote:
> > > > For now, we use sk_busy_loop() for both rx and tx path. The sk_busy_loop()
> > > > will call napi_busy_loop() for the specified napi_id. However, some
> > > > nic drivers have tx napi, such as virtio-net. In this case, sk_busy_loop()
> > > > doesn't work, as it can only schedule the NAPI for the rx queue.
> > > >
> > > > Therefore, introduce sk_tx_busy_loop() for the nic drivers that support tx
> > > > napi, which will schedule the tx napi if available.
> > >
> > > First, I thought the only difference with Tx NAPI is that it can't be
> > > busy polled. So if you want to poll an instance don't register it as
> > > a Tx one instead of adding all this "tx polling" stuff in the core?
> >
> > I see. Register the tx NAPI with netif_napi_add_config() allow us
> > busy poll it. But we still have two NAPI instance: rx NAPI and tx NAPI.
> > sk_busy_loop() can only busy poll on one of them.
> >
> > Before AF_XDP, we don't have the need to send packet via tx NAPI, which
> > means that we don't need to busy poll it.
> >
> > I analyst some nic drivers on the implement of AF_XDP. Some of them
> > will check xsk tx ring of current queue and send the data in it in the
> > rx NAPI, such as mlx5. Some of them will allocate a extra "rxtx" NAPI
> > for the AF_XDP zero-copy queue, which will poll both the data receiving
> > and sending.
> >
> > In the case about, they will do the data sending and receiving for the
> > AF_XDP in a single NAPI instance.
> >
> > However, some driver receiving the data in rx NAPI and send data in
> > tx NAPI for AF_XDP. In this case, we can't use sk_busy_loop() for both
> > rx path and tx path, as we need to wake different NAPI instance.
> >
> > >
> > > Second, can this problem happen for any other NIC or is it purely
> > > an artifact of virtio's delayed Tx completion handling?
> >
> > According to my analysis, only virtio-net and ICSSG driver have
> > split NAPI for AF_XDP. I don't have a ICSSG nic, but the codex tell
> > me that it does have the same problem.
> >
> > I'm not sure if it is a good idea to introduce the sk_tx_busy_loop().
> > Maybe we can modify the driver instead by using the same NAPI
> > for both data sending and receiving, just like others do. The
> > advantage of introduce sk_tx_busy_loop() is that we can split the
> > data sending and receiving, which maybe more efficient.
>
> Would be good if you back your changes by any performance numbers. I
> believe that drivers do tx processing via rx napi as before AF_XDP it was
> only about cleaning up writebacks, AF_XDP added more weight via actual tx
> descriptors submission.
>
> Maybe you can vibe-code virtio-net to work only with rx napi and see what
> are the results.
>
> Side note/question - Do you have a tx-only use case for AF_XDP ? I am
> planning (for a long time actually) to implement asymmetric AF_XDP
> sockets. Currently for ZC scenarios xsk socket occupies both rx and tx
> queues even when you do rx or tx only.

As far as I know, since I use TCP as the userspace protocol, I don't
have any idea on how we can apply this. It seems you've got the
requirement in the real world? Interesting.

Thanks,
Jason

>
> >
> > >
> > > Third, this series does not apply.
> >
> > Ah, I'll rebase this series if a V2 is acceptable.
> >
> > Thanks!
> > Menglong Dong
> >
> > >
> > >
> >
> >
> >
> >

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: busy-poll: introduce sk_tx_busy_loop()
From: Maciej Fijalkowski @ 2026-06-17  9:40 UTC (permalink / raw)
  To: Menglong Dong
  Cc: menglong8.dong, Jakub Kicinski, jasowang, mst, xuanzhuo, eperezma,
	andrew+netdev, davem, edumazet, pabeni, magnus.karlsson, sdf,
	horms, ast, daniel, hawk, john.fastabend, bjorn, kerneljasonxing,
	netdev, virtualization, linux-kernel, bpf
In-Reply-To: <TYn10tJ2SIGF1pAhF26DRQ@linux.dev>

On Sun, Jun 14, 2026 at 06:12:46PM +0800, Menglong Dong wrote:
> On 2026/6/14 02:21, Jakub Kicinski wrote:
> > On Thu, 11 Jun 2026 15:12:40 +0800 menglong8.dong@gmail.com wrote:
> > > For now, we use sk_busy_loop() for both rx and tx path. The sk_busy_loop()
> > > will call napi_busy_loop() for the specified napi_id. However, some
> > > nic drivers have tx napi, such as virtio-net. In this case, sk_busy_loop()
> > > doesn't work, as it can only schedule the NAPI for the rx queue.
> > > 
> > > Therefore, introduce sk_tx_busy_loop() for the nic drivers that support tx
> > > napi, which will schedule the tx napi if available.
> > 
> > First, I thought the only difference with Tx NAPI is that it can't be
> > busy polled. So if you want to poll an instance don't register it as 
> > a Tx one instead of adding all this "tx polling" stuff in the core?
> 
> I see. Register the tx NAPI with netif_napi_add_config() allow us
> busy poll it. But we still have two NAPI instance: rx NAPI and tx NAPI.
> sk_busy_loop() can only busy poll on one of them.
> 
> Before AF_XDP, we don't have the need to send packet via tx NAPI, which
> means that we don't need to busy poll it.
> 
> I analyst some nic drivers on the implement of AF_XDP. Some of them
> will check xsk tx ring of current queue and send the data in it in the
> rx NAPI, such as mlx5. Some of them will allocate a extra "rxtx" NAPI
> for the AF_XDP zero-copy queue, which will poll both the data receiving
> and sending.
> 
> In the case about, they will do the data sending and receiving for the
> AF_XDP in a single NAPI instance.
> 
> However, some driver receiving the data in rx NAPI and send data in
> tx NAPI for AF_XDP. In this case, we can't use sk_busy_loop() for both
> rx path and tx path, as we need to wake different NAPI instance.
> 
> > 
> > Second, can this problem happen for any other NIC or is it purely 
> > an artifact of virtio's delayed Tx completion handling?
> 
> According to my analysis, only virtio-net and ICSSG driver have
> split NAPI for AF_XDP. I don't have a ICSSG nic, but the codex tell
> me that it does have the same problem.
> 
> I'm not sure if it is a good idea to introduce the sk_tx_busy_loop().
> Maybe we can modify the driver instead by using the same NAPI
> for both data sending and receiving, just like others do. The
> advantage of introduce sk_tx_busy_loop() is that we can split the
> data sending and receiving, which maybe more efficient.

Would be good if you back your changes by any performance numbers. I
believe that drivers do tx processing via rx napi as before AF_XDP it was
only about cleaning up writebacks, AF_XDP added more weight via actual tx
descriptors submission.

Maybe you can vibe-code virtio-net to work only with rx napi and see what
are the results.

Side note/question - Do you have a tx-only use case for AF_XDP ? I am
planning (for a long time actually) to implement asymmetric AF_XDP
sockets. Currently for ZC scenarios xsk socket occupies both rx and tx
queues even when you do rx or tx only.

> 
> > 
> > Third, this series does not apply.
> 
> Ah, I'll rebase this series if a V2 is acceptable.
> 
> Thanks!
> Menglong Dong
> 
> > 
> > 
> 
> 
> 
> 

^ permalink raw reply

* Re: [RFC PATCH 2/2] virtio-balloon: add stats push mode
From: David Hildenbrand (Arm) @ 2026-06-17  9:34 UTC (permalink / raw)
  To: Gregory Price
  Cc: virtualization, linux-kernel, kernel-team, mst, jasowang,
	xuanzhuo, eperezma, hannes, surenb, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak
In-Reply-To: <ajFynjvnntxVy2m7@gourry-fedora-PF4VCD3F>

On 6/16/26 17:58, Gregory Price wrote:
> On Tue, Jun 16, 2026 at 05:52:29PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/16/26 16:44, Gregory Price wrote:
>>> That makes sense, although don't you just push the blocking operation
>>> into yet another thread on the host?
>>
>> I think timers are run from the QEMU main thread, so no separate thread just for
>> the timer.
>>
>> And IIRC, there will be no blocking. At least if I understand your concern
>> correctly.
>>
>> balloon_stats_poll_cb() will do a virtqueue_push()+virtio_notify(), which will
>> notify the device. The main thread will continue afterwards doing what a main
>> thread usually does.
>>
>> A VCPU will process the request in the VM and send it back + notify the device.
>>
> 
> Entirely possible I just bungled the interaction then and/or CH's
> interfaces introduce a blocking op that shouldn't.
> 
> Thanks for the feedback, we can probably drop this patch.  Unless
> there's any particular pushback for 1/2, should i leave as-is or
> resubmit separately w/o RFC?

Probably best to send #1 as non-RFC after the merge window.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH] drm: Consistently define pci_device_ids using named initializers
From: Thomas Zimmermann @ 2026-06-17  9:11 UTC (permalink / raw)
  To: Uwe Kleine-König (The Capable Hub), Maarten Lankhorst,
	Maxime Ripard, David Airlie, Simona Vetter, Gerd Hoffmann
  Cc: Markus Schneider-Pargmann, Patrik Jakobsson, Jianmin Lv,
	Qianhai Wu, Huacai Chen, Mingcong Bai, Xi Ruoyao, Icenowy Zheng,
	Dave Airlie, Jocelyn Falempe, dri-devel, linux-kernel,
	virtualization, spice-devel
In-Reply-To: <b13678d8-1159-457a-bcab-ade06bea0cf8@suse.de>



Am 12.06.26 um 14:10 schrieb Thomas Zimmermann:
> Hi
>
> Am 04.05.26 um 17:05 schrieb Uwe Kleine-König (The Capable Hub):
>> The .driver_data member of the various struct pci_device_id arrays were
>> initialized by list expressions. This isn't easily readable if you're
>> not into PCI. Using the PCI_DEVICE macro and named initializers is more
>> explicit and thus easier to parse. Also skip explicit assignments of 0
>> (which the compiler then takes care of).
>>
>> This change doesn't introduce changes to the compiled pci_device_id
>> arrays. Tested on x86 and arm64.
>>
>> Signed-off-by: Uwe Kleine-König (The Capable Hub) 
>> <u.kleine-koenig@baylibre.com>
>
> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
>
> I'll merge the patch into drm-misc-next.

Merged with a minor change to coding style in gma500.

>
> Best regards
> Thomas
>
>> ---
>> Hello,
>>
>> The secret plan is to make struct pci_device_id::driver_data an
>> anonymous union (similar to
>> https://lore.kernel.org/all/cover.1776579304.git.u.kleine-koenig@baylibre.com/) 
>>
>> and that requires named initializers. But IMHO it's also a nice cleanup
>> on its own.
>>
>> The anonymous union will allow changes like the following:
>>
>> -    { PCI_DEVICE(0x8086, 0x8108), .driver_data = (long) 
>> &psb_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x8108), .driver_data_ptr = &psb_chip_ops },
>>
>> (together with the respective change in the code when the value is
>> used). This gets rid of a bunch of casts and thus slightly improves
>> type safety.
>>
>> Best regards
>> Uwe
>>
>>   drivers/gpu/drm/gma500/psb_drv.c      | 56 +++++++++++++--------------
>>   drivers/gpu/drm/loongson/lsdc_drv.c   |  4 +-
>>   drivers/gpu/drm/mgag200/mgag200_drv.c | 24 ++++++------
>>   drivers/gpu/drm/qxl/qxl_drv.c         | 15 ++++---
>>   4 files changed, 52 insertions(+), 47 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/gma500/psb_drv.c 
>> b/drivers/gpu/drm/gma500/psb_drv.c
>> index 005ab7f5355f..039da26ef24d 100644
>> --- a/drivers/gpu/drm/gma500/psb_drv.c
>> +++ b/drivers/gpu/drm/gma500/psb_drv.c
>> @@ -56,36 +56,36 @@ static int psb_pci_probe(struct pci_dev *pdev, 
>> const struct pci_device_id *ent);
>>    */
>>   static const struct pci_device_id pciidlist[] = {
>>       /* Poulsbo */
>> -    { 0x8086, 0x8108, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &psb_chip_ops },
>> -    { 0x8086, 0x8109, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &psb_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x8108), .driver_data = (long) 
>> &psb_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x8109), .driver_data = (long) 
>> &psb_chip_ops },
>>       /* Oak Trail */
>> -    { 0x8086, 0x4100, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &oaktrail_chip_ops },
>> -    { 0x8086, 0x4101, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &oaktrail_chip_ops },
>> -    { 0x8086, 0x4102, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &oaktrail_chip_ops },
>> -    { 0x8086, 0x4103, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &oaktrail_chip_ops },
>> -    { 0x8086, 0x4104, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &oaktrail_chip_ops },
>> -    { 0x8086, 0x4105, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &oaktrail_chip_ops },
>> -    { 0x8086, 0x4106, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &oaktrail_chip_ops },
>> -    { 0x8086, 0x4107, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &oaktrail_chip_ops },
>> -    { 0x8086, 0x4108, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &oaktrail_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x4100), .driver_data = (long) 
>> &oaktrail_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x4101), .driver_data = (long) 
>> &oaktrail_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x4102), .driver_data = (long) 
>> &oaktrail_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x4103), .driver_data = (long) 
>> &oaktrail_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x4104), .driver_data = (long) 
>> &oaktrail_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x4105), .driver_data = (long) 
>> &oaktrail_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x4106), .driver_data = (long) 
>> &oaktrail_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x4107), .driver_data = (long) 
>> &oaktrail_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x4108), .driver_data = (long) 
>> &oaktrail_chip_ops },
>>       /* Cedar Trail */
>> -    { 0x8086, 0x0be0, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0be1, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0be2, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0be3, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0be4, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0be5, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0be6, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0be7, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0be8, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0be9, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0bea, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0beb, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0bec, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0bed, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0bee, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0x8086, 0x0bef, PCI_ANY_ID, PCI_ANY_ID, 0, 0, (long) 
>> &cdv_chip_ops },
>> -    { 0, }
>> +    { PCI_DEVICE(0x8086, 0x0be0), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0be1), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0be2), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0be3), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0be4), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0be5), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0be6), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0be7), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0be8), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0be9), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0bea), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0beb), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0bec), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0bed), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0bee), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { PCI_DEVICE(0x8086, 0x0bef), .driver_data = (long) 
>> &cdv_chip_ops },
>> +    { }
>>   };
>>   MODULE_DEVICE_TABLE(pci, pciidlist);
>>   diff --git a/drivers/gpu/drm/loongson/lsdc_drv.c 
>> b/drivers/gpu/drm/loongson/lsdc_drv.c
>> index 1ece1ea42f78..f9f7271ddbff 100644
>> --- a/drivers/gpu/drm/loongson/lsdc_drv.c
>> +++ b/drivers/gpu/drm/loongson/lsdc_drv.c
>> @@ -444,8 +444,8 @@ static const struct dev_pm_ops lsdc_pm_ops = {
>>   };
>>     static const struct pci_device_id lsdc_pciid_list[] = {
>> -    {PCI_VDEVICE(LOONGSON, 0x7a06), CHIP_LS7A1000},
>> -    {PCI_VDEVICE(LOONGSON, 0x7a36), CHIP_LS7A2000},
>> +    { PCI_VDEVICE(LOONGSON, 0x7a06), .driver_data = CHIP_LS7A1000 },
>> +    { PCI_VDEVICE(LOONGSON, 0x7a36), .driver_data = CHIP_LS7A2000 },
>>       { }
>>   };
>>   diff --git a/drivers/gpu/drm/mgag200/mgag200_drv.c 
>> b/drivers/gpu/drm/mgag200/mgag200_drv.c
>> index a32be27c39e8..8ad4ddb60ee6 100644
>> --- a/drivers/gpu/drm/mgag200/mgag200_drv.c
>> +++ b/drivers/gpu/drm/mgag200/mgag200_drv.c
>> @@ -205,18 +205,18 @@ int mgag200_device_init(struct mga_device *mdev,
>>    */
>>     static const struct pci_device_id mgag200_pciidlist[] = {
>> -    { PCI_VENDOR_ID_MATROX, 0x520, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_PCI },
>> -    { PCI_VENDOR_ID_MATROX, 0x521, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_AGP },
>> -    { PCI_VENDOR_ID_MATROX, 0x522, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_SE_A },
>> -    { PCI_VENDOR_ID_MATROX, 0x524, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_SE_B },
>> -    { PCI_VENDOR_ID_MATROX, 0x530, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_EV },
>> -    { PCI_VENDOR_ID_MATROX, 0x532, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_WB },
>> -    { PCI_VENDOR_ID_MATROX, 0x533, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_EH },
>> -    { PCI_VENDOR_ID_MATROX, 0x534, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_ER },
>> -    { PCI_VENDOR_ID_MATROX, 0x536, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_EW3 },
>> -    { PCI_VENDOR_ID_MATROX, 0x538, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_EH3 },
>> -    { PCI_VENDOR_ID_MATROX, 0x53a, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 
>> G200_EH5 },
>> -    {0,}
>> +    { PCI_VDEVICE(MATROX, 0x0520), .driver_data = G200_PCI },
>> +    { PCI_VDEVICE(MATROX, 0x0521), .driver_data = G200_AGP },
>> +    { PCI_VDEVICE(MATROX, 0x0522), .driver_data = G200_SE_A },
>> +    { PCI_VDEVICE(MATROX, 0x0524), .driver_data = G200_SE_B },
>> +    { PCI_VDEVICE(MATROX, 0x0530), .driver_data = G200_EV },
>> +    { PCI_VDEVICE(MATROX, 0x0532), .driver_data = G200_WB },
>> +    { PCI_VDEVICE(MATROX, 0x0533), .driver_data = G200_EH },
>> +    { PCI_VDEVICE(MATROX, 0x0534), .driver_data = G200_ER },
>> +    { PCI_VDEVICE(MATROX, 0x0536), .driver_data = G200_EW3 },
>> +    { PCI_VDEVICE(MATROX, 0x0538), .driver_data = G200_EH3 },
>> +    { PCI_VDEVICE(MATROX, 0x053a), .driver_data = G200_EH5 },
>> +    { }
>>   };
>>     MODULE_DEVICE_TABLE(pci, mgag200_pciidlist);
>> diff --git a/drivers/gpu/drm/qxl/qxl_drv.c 
>> b/drivers/gpu/drm/qxl/qxl_drv.c
>> index 2bbb1168a3ff..6c3c309b8e4d 100644
>> --- a/drivers/gpu/drm/qxl/qxl_drv.c
>> +++ b/drivers/gpu/drm/qxl/qxl_drv.c
>> @@ -50,11 +50,16 @@
>>   #include "qxl_object.h"
>>     static const struct pci_device_id pciidlist[] = {
>> -    { 0x1b36, 0x100, PCI_ANY_ID, PCI_ANY_ID, PCI_CLASS_DISPLAY_VGA 
>> << 8,
>> -      0xffff00, 0 },
>> -    { 0x1b36, 0x100, PCI_ANY_ID, PCI_ANY_ID, PCI_CLASS_DISPLAY_OTHER 
>> << 8,
>> -      0xffff00, 0 },
>> -    { 0, 0, 0 },
>> +    {
>> +        PCI_DEVICE(0x1b36, 0x0100),
>> +        .class = PCI_CLASS_DISPLAY_VGA << 8,
>> +        .class_mask = 0xffff00
>> +    }, {
>> +        PCI_DEVICE(0x1b36, 0x0100),
>> +        .class = PCI_CLASS_DISPLAY_OTHER << 8,
>> +        .class_mask = 0xffff00
>> +    },
>> +    { },
>>   };
>>   MODULE_DEVICE_TABLE(pci, pciidlist);
>>
>> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH] vdpa_sim: fix cleanup after worker creation failure
From: Eugenio Perez Martin @ 2026-06-17  7:10 UTC (permalink / raw)
  To: Linfeng Sun
  Cc: Michael S . Tsirkin, Jason Wang, Xuan Zhuo, virtualization,
	linux-kernel
In-Reply-To: <20260612105054.1850453-1-slf@hdu.edu.cn>

On Fri, Jun 12, 2026 at 12:57 PM Linfeng Sun <slf@hdu.edu.cn> wrote:
>
> vdpasim_create() leaves vdpasim->worker as an ERR_PTR when
> kthread_run_worker() fails. The error path then drops the device
> reference, which releases the partially initialized simulator.
>
> vdpasim_free() unconditionally passes the worker pointer to
> kthread_destroy_worker(), so the ERR_PTR is dereferenced and can
> trigger a general protection fault.
>
> Store the worker error, clear the pointer, and make the release path
> only clean up resources that were successfully initialized before
> the failure.
>

Good catch! Yet a few things to improve,

It missees Fixes: tag

> Signed-off-by: Linfeng Sun <slf@hdu.edu.cn>
> ---
>  drivers/vdpa/vdpa_sim/vdpa_sim.c | 27 ++++++++++++++++++---------
>  1 file changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 8cb1cc2ea139..6a4e28c49d2d 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -230,9 +230,12 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr,
>
>         kthread_init_work(&vdpasim->work, vdpasim_work_fn);
>         vdpasim->worker = kthread_run_worker(0, "vDPA sim worker: %s",
> -                                               dev_attr->name);
> -       if (IS_ERR(vdpasim->worker))
> +                                            dev_attr->name);
> +       if (IS_ERR(vdpasim->worker)) {
> +               ret = PTR_ERR(vdpasim->worker);
> +               vdpasim->worker = NULL;
>                 goto err_iommu;
> +       }
>
>         mutex_init(&vdpasim->mutex);
>         spin_lock_init(&vdpasim->iommu_lock);
> @@ -742,18 +745,24 @@ static void vdpasim_free(struct vdpa_device *vdpa)
>         struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
>         int i;
>
> -       kthread_cancel_work_sync(&vdpasim->work);
> -       kthread_destroy_worker(vdpasim->worker);
> +       if (vdpasim->worker) {
> +               kthread_cancel_work_sync(&vdpasim->work);
> +               kthread_destroy_worker(vdpasim->worker);
> +       }
>
> -       for (i = 0; i < vdpasim->dev_attr.nvqs; i++) {
> -               vringh_kiov_cleanup(&vdpasim->vqs[i].out_iov);
> -               vringh_kiov_cleanup(&vdpasim->vqs[i].in_iov);
> +       if (vdpasim->vqs) {
> +               for (i = 0; i < vdpasim->dev_attr.nvqs; i++) {
> +                       vringh_kiov_cleanup(&vdpasim->vqs[i].out_iov);
> +                       vringh_kiov_cleanup(&vdpasim->vqs[i].in_iov);
> +               }
>         }
>
>         vdpasim->dev_attr.free(vdpasim);
>
> -       for (i = 0; i < vdpasim->dev_attr.nas; i++)
> -               vhost_iotlb_reset(&vdpasim->iommu[i]);
> +       if (vdpasim->iommu && vdpasim->iommu_pt) {

It looks to me that this conditional and the one for `vdpasim->vqs`,
are needed if vdpasim_create returns from error paths other than the
error from kthread_run_worker, isn't it? For example, if
dma_set_mask_and_coherent fails, vhost_iotlb_reset will also be
called. In that sense, it would be great to note this in the patch
message.

Also, vdpasim->iommu[i] should be reset regardless vdpasim->iommu_pt,
so omit that in the conditional.

Finally, if you found the issue or coded the patch with AI please
indicate it with the Assisted-by tag.

With these changes applied, feel free to add:

Reviewed-by: Eugenio Pérez <eperezma@redhat.com>

Thanks!

> +               for (i = 0; i < vdpasim->dev_attr.nas; i++)
> +                       vhost_iotlb_reset(&vdpasim->iommu[i]);
> +       }
>         kfree(vdpasim->iommu);
>         kfree(vdpasim->iommu_pt);
>         kfree(vdpasim->vqs);
> --
> 2.43.0
>
>


^ permalink raw reply

* [PATCH] tools/virtio: fix build for kmalloc_obj API and missing stubs
From: Michael S. Tsirkin @ 2026-06-17  6:05 UTC (permalink / raw)
  Cc: Eugenio Pérez, Jason Wang, linux-kernel, Michael S. Tsirkin,
	virtualization, Xuan Zhuo

Add stubs for kmalloc_obj() and kmalloc_objs() to the tools/virtio
test harness, matching the new kernel allocator API. Also add the
DMA_ATTR_CPU_CACHE_CLEAN definition and include kernel.h from err.h
for the unlikely() macro.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/linux/dma-mapping.h | 2 ++
 tools/virtio/linux/err.h         | 1 +
 tools/virtio/linux/kernel.h      | 6 ++++++
 3 files changed, 9 insertions(+)

diff --git a/tools/virtio/linux/dma-mapping.h b/tools/virtio/linux/dma-mapping.h
index fddfa2fbb276..8d1a16cb20db 100644
--- a/tools/virtio/linux/dma-mapping.h
+++ b/tools/virtio/linux/dma-mapping.h
@@ -60,4 +60,6 @@ enum dma_data_direction {
  */
 #define DMA_MAPPING_ERROR		(~(dma_addr_t)0)
 
+#define DMA_ATTR_CPU_CACHE_CLEAN	(1UL << 11)
+
 #endif
diff --git a/tools/virtio/linux/err.h b/tools/virtio/linux/err.h
index 0943c644a701..b7b4cb516dc9 100644
--- a/tools/virtio/linux/err.h
+++ b/tools/virtio/linux/err.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef ERR_H
 #define ERR_H
+#include <linux/kernel.h>
 #define MAX_ERRNO	4095
 
 #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
diff --git a/tools/virtio/linux/kernel.h b/tools/virtio/linux/kernel.h
index 416d02703f61..104abf9d1aee 100644
--- a/tools/virtio/linux/kernel.h
+++ b/tools/virtio/linux/kernel.h
@@ -65,6 +65,12 @@ static inline void *kmalloc_array(unsigned n, size_t s, gfp_t gfp)
 	return kmalloc(n * s, gfp);
 }
 
+#define kmalloc_obj(VAR_OR_TYPE, ...) \
+	((typeof(VAR_OR_TYPE) *)kmalloc(sizeof(typeof(VAR_OR_TYPE)), 0))
+
+#define kmalloc_objs(VAR_OR_TYPE, COUNT, ...) \
+	((typeof(VAR_OR_TYPE) *)kmalloc(sizeof(typeof(VAR_OR_TYPE)) * (COUNT), 0))
+
 static inline void *kzalloc(size_t s, gfp_t gfp)
 {
 	void *p = kmalloc(s, gfp);
-- 
MST


^ permalink raw reply related

* Re:Re:Re:Re: [PATCH] virtio_net: disable cb when napi_schedule_prep fails during busy-poll
From: Lange Tang @ 2026-06-17  2:08 UTC (permalink / raw)
  To: xuanzhuo@linux.alibaba.com
  Cc: edumazet@google.com, Jakub Kicinski,
	virtualization@lists.linux.dev, Tang Longjun, jasowang@redhat.com,
	mst@redhat.com
In-Reply-To: <1781592565.1172295-1-xuanzhuo@linux.alibaba.com>

At 2026-06-16 14:49:25, "Xuan Zhuo" <xuanzhuo@linux.alibaba.com> wrote:
>On Tue, 16 Jun 2026 14:07:34 +0800 (CST), Lange Tang <lange_tang@163.com> wrote:
>> At 2026-06-16 11:27:12, "Xuan Zhuo" <xuanzhuo@linux.alibaba.com> wrote:
>> >On Tue, 16 Jun 2026 11:00:29 +0800 (CST), Lange Tang <lange_tang@163.com> wrote:
>> >> At 2026-06-15 18:01:40, "Xuan Zhuo" <xuanzhuo@linux.alibaba.com> wrote:
>> >> >On Mon, 15 Jun 2026 17:45:50 +0800, Longjun Tang <lange_tang@163.com> wrote:
>> >> >> From: Longjun Tang <tanglongjun@kylinos.cn>
>> >> >>
>> >> >> When busy-poll is active, napi_schedule_prep() returns false in
>> >> >> skb_recv_done(), so virtqueue_disable_cb() is skipped. The device
>> >> >> may keep firing irqs until the next poll round reaches
>> >> >> virtqueue_napi_complete(). If cb is enabled under busy-poll case,
>> >> >> it will lead to a large number of spurious interrupts. Explicitly
>> >> >> disable callbacks in this case to prevent spurious interrupts.
>> >> >>
>> >> >> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
>> >> >> ---
>> >> >>  drivers/net/virtio_net.c | 2 ++
>> >> >>  1 file changed, 2 insertions(+)
>> >> >>
>> >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> >> >> index f4adcfee7a80..6d675fddc59b 100644
>> >> >> --- a/drivers/net/virtio_net.c
>> >> >> +++ b/drivers/net/virtio_net.c
>> >> >> @@ -728,6 +728,8 @@ static void virtqueue_napi_schedule(struct napi_struct *napi,
>> >> >>  	if (napi_schedule_prep(napi)) {
>> >> >>  		virtqueue_disable_cb(vq);
>> >> >>  		__napi_schedule(napi);
>> >> >> +	} else if (test_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state)) {
>> >> >> +		virtqueue_disable_cb(vq);
>> >> >
>> >> >I see, but we should avoid checking NAPI_STATE_IN_BUSY_POLL directly in the
>> >> >drivers. The NIC driver should remain agnostic to busy polling. I think we need
>> >> >a better way, maybe we should rewrite virtqueue_napi_schedule instead.
>> >>
>> >> How about rewrite it like this?
>> >> static void virtqueue_napi_schedule(struct napi_struct *napi,
>> >>                                     struct virtqueue *vq)
>> >> {
>> >>         virtqueue_disable_cb(vq);
>> >>         if (napi_schedule_prep(napi))
>> >>                 __napi_schedule(napi);
>> >> }
>> >> Any comments are welcome.
>> >
>> >
>> >Another CPU could be running NAPI and has just enabled the callbacks (cb).
>> >Meanwhile, this side unconditionally disables the cb. Since NAPI on the other
>> >CPU hasn't exited yet, the subsequent prep on this side fails, leaving no one to
>> >re-enable the cb.
>> >
>> >Thanks.
>>
>> Regarding the case you described, when NAPI on another CPU exits, the virtqueue_napi_complete func
>> will be executed to re-enable cb.  and if there is still unconsumed data in the virtqueue, virtqueue_napi_schedule
>> will be called again to schedule NAPI.
>>
>> In summary, I think that the disable_cb and __napi_schedule within the virtqueue_napi_schedule func do not need to be bound together.
>>
>> Any comments are welcome. Thinks.
>
>
><Your code>
>static void virtqueue_napi_schedule(struct napi_struct *napi,
>                                    struct virtqueue *vq)
>{
>
>							       |static bool virtqueue_napi_complete(struct napi_struct *napi,
>							       |				    struct virtqueue *vq, int processed)
>							       |{
>							       |	int opaque;
>							       |
>							       |	opaque = virtqueue_enable_cb_prepare(vq);
>                                                               |
>        virtqueue_disable_cb(vq);                              |
>        if (napi_schedule_prep(napi))                          |
>                __napi_schedule(napi);                         |
>							       |	if (napi_complete_done(napi, processed)) {
>							       |		if (unlikely(virtqueue_poll(vq, opaque)))
>							       |			virtqueue_napi_schedule(napi, vq);
>							       |		else
>							       |			 return true; // return directly
>							       |	} else {
>							       |		virtqueue_disable_cb(vq);
>							       |	}
>							       |
>							       |	return false;
>							       |}
>}
>
>1. new packets (notified by irq) are consumed by napi before virtqueue_napi_complete
>2. poll is not called by irq, maybe xsk wake up. So irq is not disabled.
>
>
>Thanks.

Based on your code analysis above, I got .  thanks.
disable_cb and __napi_schedule in the virtqueue_napi_schedule func indeed cannot be separated.

Regarding the issue of not being able to disable cb in a busy-poll context, do you have any suggestions?

Any comments are welcome. Thanks.
>
>
>>
>> >
>> >
>> >> >
>> >> >
>> >> >>  	}
>> >> >>  }
>> >> >>
>> >> >> --
>> >> >> 2.25.1
>> >> >>
>> >>
>>

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-16 21:35 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Zi Yan, Andrew Morton, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <ab84b317-fecc-4197-be2f-4b4aeba3f4e3@kernel.org>

On Tue, Jun 16, 2026 at 02:18:57PM +0200, David Hildenbrand (Arm) wrote:
> On 6/16/26 13:40, Miaohe Lin wrote:
> > On 2026/6/16 14:56, David Hildenbrand (Arm) wrote:
> >>>
> >>> These non-atomics are defined and used because they want to avoid atomic ops overhead?
> >>> So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
> >>
> >> It should be cheaper than atomics IIUC. Further, I assume that some pages could
> >> batch over multiple such operations (esp. page freeing path when we process tail
> >> pages).
> >>
> >> With !CONFIG_PREEMPT_RCU it's simply preempt_disable()/preempt_enable(), which
> >> is either a NOP or just adjusting the preempt counter of the current thread. Cheap.
> >>
> >> With CONFIG_PREEMPT_RCU we mostly increment current->rcu_read_lock_nesting. But
> >> there might be a function call involved (did not look into the details). So that
> >> variant should be slightly more expensive.
> > 
> > I scanned the code and found rcu_read_unlock_special might be called in some cases.
> > Some expensive ops, e.g. irq_work_queue_on, might be called in some corner cases.
> > So the overhead of rcu read lock might be fluctuating.
> 
> Right. Usually rcu_read_lock+unlock is supposed to be very lightweight, but that
> might not be completely the case with that PREEMPT_RCU thingy ...
> 
> > 
> >>
> >> We'd have to measure what an addition rcu read lock would cost in there. that
> >> should be fairly easy to benchmark.
> > 
> > Sure. We can do that if needed.
> > 
> >>
> >>>
> >>> I think this is a good idea, although there are some remaining issues.
> >>> But such race should be really rare, is it worth all this effort? Could we
> >>> simply aim to resolve, not to be flawless? I.e. could we simply check
> >>> and re-set the hwpoison flag at the end of memory_failure handling to
> >>> simply avoid losing hwpoison flag as a best-effort attempt? Would it be
> >>> acceptable?
> >>
> >> Hacky. Sufficient for the hypervisor to suspend the nonatomic-setting CPU at the
> >> wrong time to still trigger the same behavior.
> > 
> > Right. hypervisor could make the issue easier to trigger...
> > 
> >>
> >> I think, either we fix it properly, or we redesign hwpoison handling to deal
> >> with setting/clearing becoming stale at some random point in the future.
> > 
> > I think your proposal, although there are still some issues to be resolved, is
> > nevertheless a good solution. We could also wait and see if anyone comes up with
> > a better one.
> 
> I wouldn't call it "good" ... it's the only thing I was easily able to come up
> with :)
> 
> The only alternative would be moving the hwpoison bit out of page->flags,
> storing it in a sparse bitmap or sth. like that. It would be a bigger rework and
> I am sure there are issues with that as well.
> 
> -- 
> Cheers,
> 
> David


I had a vague feeling using static keys should be possible somehow,
but could not come up with anything robust.
So - like this? Untested.

--->

mm: memory-failure: use RCU and static key to fix HWPoison flag race

Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
__clear_bit) can race with atomic TestSetPageHWPoison() in
memory_failure().  The non-atomic RMW reads flags, memory_failure()
atomically sets HWPoison, then the RMW writes back the old value
without HWPoison -- clobbering the bit.

Fix this by wrapping all non-atomic page flag operations in
rcu_read_lock/rcu_read_unlock via the hwpoison_safe() macro
(CONFIG_MEMORY_FAILURE only, skipped early boot via rcu_is_watching()).
memory_failure() then calls synchronize_rcu() to drain in-flight
non-atomic operations, and retries TestSetPageHWPoison() until the
bit sticks.

Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6

---

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..e607a77c1627 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2343,7 +2343,7 @@ int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
 
 static inline void page_cpupid_reset_last(struct page *page)
 {
-	page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
+	set_page_flags_safe(page, LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
 }
 #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
 
@@ -2503,8 +2503,8 @@ static inline struct zone *folio_zone(const struct folio *folio)
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
-	page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
-	page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
+	clear_page_flags_safe(page, SECTIONS_MASK << SECTIONS_PGSHIFT);
+	set_page_flags_safe(page, (section & SECTIONS_MASK) << SECTIONS_PGSHIFT);
 }
 
 static inline unsigned long memdesc_section(memdesc_flags_t mdf)
@@ -2719,14 +2719,14 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
 
 static inline void set_page_zone(struct page *page, enum zone_type zone)
 {
-	page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT);
-	page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
+	clear_page_flags_safe(page, ZONES_MASK << ZONES_PGSHIFT);
+	set_page_flags_safe(page, (zone & ZONES_MASK) << ZONES_PGSHIFT);
 }
 
 static inline void set_page_node(struct page *page, unsigned long node)
 {
-	page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT);
-	page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT;
+	clear_page_flags_safe(page, NODES_MASK << NODES_PGSHIFT);
+	set_page_flags_safe(page, (node & NODES_MASK) << NODES_PGSHIFT);
 }
 
 static inline void set_page_links(struct page *page, enum zone_type zone,
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..e896d47d0031 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mmdebug.h>
+#include <linux/rcupdate.h>
 #ifndef __GENERATING_BOUNDS_H
 #include <linux/mm_types.h>
 #include <generated/bounds.h>
@@ -404,6 +405,38 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
 #define FOLIO_HEAD_PAGE		0
 #define FOLIO_SECOND_PAGE	1
 
+/*
+ * Non-atomic page flag operations (__set_bit, __clear_bit, flags &= ~mask)
+ * can race with atomic TestSetPageHWPoison() in memory_failure().
+ * Wrap non-atomic ops in rcu_read_lock so that synchronize_rcu() in
+ * memory_failure() drains in-flight callers.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define hwpoison_safe(op) do {						\
+	if (rcu_is_watching()) {					\
+		rcu_read_lock();					\
+		op;							\
+		rcu_read_unlock();					\
+	} else {							\
+		op;							\
+	}								\
+} while (0)
+#else
+#define hwpoison_safe(op) do { op; } while (0)
+#endif
+
+static __always_inline void clear_page_flags_safe(struct page *page,
+						  unsigned long mask)
+{
+	hwpoison_safe(page->flags.f &= ~mask);
+}
+
+static __always_inline void set_page_flags_safe(struct page *page,
+						unsigned long mask)
+{
+	hwpoison_safe(page->flags.f |= mask);
+}
+
 /*
  * Macros to create function definitions for page flags
  */
@@ -421,11 +454,11 @@ static __always_inline void folio_clear_##name(struct folio *folio)	\
 
 #define __FOLIO_SET_FLAG(name, page)					\
 static __always_inline void __folio_set_##name(struct folio *folio)	\
-{ __set_bit(PG_##name, folio_flags(folio, page)); }
+{ hwpoison_safe(__set_bit(PG_##name, folio_flags(folio, page))); }
 
 #define __FOLIO_CLEAR_FLAG(name, page)					\
 static __always_inline void __folio_clear_##name(struct folio *folio)	\
-{ __clear_bit(PG_##name, folio_flags(folio, page)); }
+{ hwpoison_safe(__clear_bit(PG_##name, folio_flags(folio, page))); }
 
 #define FOLIO_TEST_SET_FLAG(name, page)					\
 static __always_inline bool folio_test_set_##name(struct folio *folio)	\
@@ -458,12 +491,12 @@ static __always_inline void ClearPage##uname(struct page *page)		\
 #define __SETPAGEFLAG(uname, lname, policy)				\
 __FOLIO_SET_FLAG(lname, FOLIO_##policy)					\
 static __always_inline void __SetPage##uname(struct page *page)		\
-{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ hwpoison_safe(__set_bit(PG_##lname, &policy(page, 1)->flags.f)); }
 
 #define __CLEARPAGEFLAG(uname, lname, policy)				\
 __FOLIO_CLEAR_FLAG(lname, FOLIO_##policy)				\
 static __always_inline void __ClearPage##uname(struct page *page)	\
-{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ hwpoison_safe(__clear_bit(PG_##lname, &policy(page, 1)->flags.f)); }
 
 #define TESTSETFLAG(uname, lname, policy)				\
 FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy)				\
@@ -806,7 +839,7 @@ static inline bool PageUptodate(const struct page *page)
 static __always_inline void __folio_mark_uptodate(struct folio *folio)
 {
 	smp_wmb();
-	__set_bit(PG_uptodate, folio_flags(folio, 0));
+	hwpoison_safe(__set_bit(PG_uptodate, folio_flags(folio, 0)));
 }
 
 static __always_inline void folio_mark_uptodate(struct folio *folio)
@@ -1169,7 +1202,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
 	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
-	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
+	hwpoison_safe(__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f));
 }
 
 #ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..da6a0747e4d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3624,8 +3624,9 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		 * unreferenced sub-pages of an anonymous THP: we can simply drop
 		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
 		 */
-		new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
-		new_folio->flags.f |= (folio->flags.f &
+		clear_page_flags_safe(&new_folio->page, PAGE_FLAGS_CHECK_AT_PREP);
+		set_page_flags_safe(&new_folio->page,
+			folio->flags.f &
 				((1L << PG_referenced) |
 				 (1L << PG_swapbacked) |
 				 (1L << PG_swapcache) |
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..9bc1ad5bffca 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -76,6 +76,44 @@ static int sysctl_enable_soft_offline __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * clobber a concurrently set HWPoison bit.  Retries until the bit sticks.
+ */
+static void set_hwpoison_drain_rcu(struct page *p)
+{
+	do {
+		synchronize_rcu();
+	} while (!TestSetPageHWPoison(p));
+}
+
+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * restore the HWPoison bit from stale data.  Retries until it stays clear.
+ */
+static void clear_hwpoison_drain_rcu(struct page *p)
+{
+	do {
+		synchronize_rcu();
+	} while (TestClearPageHWPoison(p));
+}
+
+static bool test_and_set_hwpoison_drain_rcu(struct page *p)
+{
+	bool was_set = TestSetPageHWPoison(p);
+
+	set_hwpoison_drain_rcu(p);
+	return was_set;
+}
+
+static bool test_and_clear_hwpoison_drain_rcu(struct page *p)
+{
+	bool was_set = TestClearPageHWPoison(p);
+
+	clear_hwpoison_drain_rcu(p);
+	return was_set;
+}
+
 static bool hw_memory_failure __read_mostly = false;
 
 static DEFINE_MUTEX(mf_mutex);
@@ -2390,7 +2428,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (hugetlb)
 		goto unlock_mutex;
 
-	if (TestSetPageHWPoison(p)) {
+	if (test_and_set_hwpoison_drain_rcu(p)) {
 		res = -EHWPOISON;
 		if (flags & MF_ACTION_REQUIRED)
 			res = kill_accessing_process(current, pfn, flags);
@@ -2420,7 +2458,7 @@ int memory_failure(unsigned long pfn, int flags)
 			} else {
 				/* We lost the race, try again */
 				if (retry) {
-					ClearPageHWPoison(p);
+					clear_hwpoison_drain_rcu(p);
 					retry = false;
 					goto try_again;
 				}
@@ -2441,7 +2479,7 @@ int memory_failure(unsigned long pfn, int flags)
 	/* filter pages that are protected from hwpoison test by users */
 	folio_lock(folio);
 	if (hwpoison_filter(p)) {
-		ClearPageHWPoison(p);
+		clear_hwpoison_drain_rcu(p);
 		folio_unlock(folio);
 		folio_put(folio);
 		res = -EOPNOTSUPP;
@@ -2761,7 +2799,7 @@ int unpoison_memory(unsigned long pfn)
 		}
 
 		folio_put(folio);
-		if (TestClearPageHWPoison(p)) {
+		if (test_and_clear_hwpoison_drain_rcu(p)) {
 			folio_put(folio);
 			ret = 0;
 		}
diff --git a/mm/memremap.c b/mm/memremap.c
index 053842d45cb1..c3949fdca5aa 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -494,7 +494,7 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
 		 * blindly clear bits which could have set my order field here,
 		 * including page head.
 		 */
-		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
+		clear_page_flags_safe(new_page, 0xffUL);	/* Clear possible order, page head */
 
 #ifdef NR_PAGES_IN_LARGE_FOLIO
 		/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da..1587acf431f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1359,7 +1359,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
 		int i;
 
 		if (compound) {
-			page[1].flags.f &= ~PAGE_FLAGS_SECOND;
+			clear_page_flags_safe(&page[1], PAGE_FLAGS_SECOND);
 #ifdef NR_PAGES_IN_LARGE_FOLIO
 			folio->_nr_pages = 0;
 #endif
@@ -1373,7 +1373,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
 					continue;
 				}
 			}
-			(page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+			clear_page_flags_safe(page + i, PAGE_FLAGS_CHECK_AT_PREP);
 		}
 	}
 	if (folio_test_anon(folio)) {
@@ -1392,7 +1392,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
 	}
 
 	page_cpupid_reset_last(page);
-	page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	clear_page_flags_safe(page, PAGE_FLAGS_CHECK_AT_PREP);
 	page->private = 0;
 	reset_page_owner(page, order);
 	page_table_check_free(page, order);
diff --git a/mm/slub.c b/mm/slub.c
index a2bf3756ca7d..2bfa7e3f8a84 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab)
 
 static inline void __slab_clear_pfmemalloc(struct slab *slab)
 {
-	__clear_bit(SL_pfmemalloc, &slab->flags.f);
+	hwpoison_safe(__clear_bit(SL_pfmemalloc, &slab->flags.f));
 }
 
 /*


^ permalink raw reply related

* Re: [PATCH net v2 1/2] iov_iter: export iov_iter_restore
From: Jens Axboe @ 2026-06-16 20:47 UTC (permalink / raw)
  To: Octavian Purdila, netdev
  Cc: Alexander Viro, Andrew Morton, Arseniy Krasnov, David S. Miller,
	Eric Dumazet, Eugenio Pérez, Jakub Kicinski, Jason Wang, kvm,
	linux-block, linux-fsdevel, linux-kernel, Michael S. Tsirkin,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Stefano Garzarella,
	virtualization, Xuan Zhuo
In-Reply-To: <20260613000953.467473-2-tavip@google.com>

On 6/12/26 6:09 PM, Octavian Purdila wrote:
> Export iov_iter_restore so that it can be used by modules.
> 
> This is needed by the virtio vsock transport (which can be built as a
> module) to restore the msg_iter state when transmission fails.
> 
> Signed-off-by: Octavian Purdila <tavip@google.com>
> ---
>  lib/iov_iter.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 243662af1af73..067e745f9ef53 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1491,6 +1491,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
>  		i->__iov -= state->nr_segs - i->nr_segs;
>  	i->nr_segs = state->nr_segs;
>  }
> +EXPORT_SYMBOL(iov_iter_restore);

I don't have a problem exporting this to modules, but any new export
should be _GPL. So please change it to that.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH 6.12 248/261] vsock/virtio: fix potential unbounded skb queue
From: Greg Kroah-Hartman @ 2026-06-16 15:01 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Eric Dumazet, Arseniy Krasnov,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, virtualization,
	Jakub Kicinski
In-Reply-To: <20260616145044.869532709@linuxfoundation.org>

6.12-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <edumazet@google.com>

commit 059b7dbd20a6f0c539a45ddff1573cb8946685b5 upstream.

virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc.

virtio_transport_recv_enqueue() skips coalescing for packets
with VIRTIO_VSOCK_SEQ_EOM.

If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM,
a very large number of packets can be queued
because vvs->rx_bytes stays at 0.

Fix this by estimating the skb metadata size:

	(Number of skbs in the queue) * SKB_TRUESIZE(0)

Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: virtualization@lists.linux.dev
Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/vmw_vsock/virtio_transport_common.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -430,7 +430,9 @@ static int virtio_transport_send_pkt_inf
 static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
 					u32 len)
 {
-	if (vvs->buf_used + len > vvs->buf_alloc)
+	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
+
+	if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
 		return false;
 
 	vvs->rx_bytes += len;



^ permalink raw reply

* Re: [PATCH v1] s390/virtio_ccw: Also suppress -EINVAL on device detach
From: William Bezenah @ 2026-06-16 16:27 UTC (permalink / raw)
  To: Peter Oberparleiter, Cornelia Huck, Halil Pasic, vneethv
  Cc: linux-s390, farman, hca, gor, agordeev, borntraeger, svens,
	mjrosato, virtualization, kvm, linux-kernel
In-Reply-To: <6356a2be-a49b-4ac4-b52e-dd84bd4ff7b5@linux.ibm.com>

On 6/16/2026 10:53 AM, Peter Oberparleiter wrote:

> On 16.06.2026 11:16, Cornelia Huck wrote:
>> On Tue, Jun 16 2026, Peter Oberparleiter <oberpar@linux.ibm.com> wrote:
>>
>>> On 15.06.2026 23:42, Halil Pasic wrote:
>>>> On Mon, 15 Jun 2026 16:01:55 -0400
>>>> William Bezenah <wbezenah@linux.ibm.com> wrote:
>>>>
>>>>> On 6/15/2026 10:58 AM, Cornelia Huck wrote:
>>>>>> On Mon, Jun 15 2026, Halil Pasic <pasic@linux.ibm.com> wrote:
>>>>>>  
>>>>>>> On Fri, 12 Jun 2026 17:54:07 +0200
>>>>>>> William Bezenah <wbezenah@linux.ibm.com> wrote:
>>>>>>>  
>>>>>>>> Since commit 8c58a229688c ("s390/cio: Do not unregister the
>>>>>>>> subchannel based on DNV"), subchannel behavior following a device
>>>>>>>> detach has been updated and results in -EINVAL being propagated
>>>>>>>> rather than -ENODEV, originating from ccw_device_start_timeout_key()
>>>>>>>> in cio/device_ops. In the end, the virtio driver has no ability to
>>>>>>>> react to the difference between device and subchannel states here,
>>>>>>>> and during detach, both -ENODEV and -EINVAL indicate the device
>>>>>>>> cannot be used and should not be treated as errors requiring
>>>>>>>> attention. Update error handling in virtio_ccw_del_vq() and
>>>>>>>> virtio_ccw_drop_indicator() to suppress -EINVAL in addition to
>>>>>>>> -ENODEV.  
>>>>>>> Hi William!
>>>>>>>
>>>>>>> Are you saying that ccw_device_start() started returning -EINVAL
>>>>>>> since 8c58a229688c ("s390/cio: Do not unregister the subchannel based on
>>>>>>> DNV")? Or did I somehow read the paragraph wrong?
>>>>>>>
>>>>>>> The funcition ccw_device_start is documented to return:
>>>>>>>  * Returns:                                                                     
>>>>>>>  *  %0, if the operation was successful;                                        
>>>>>>>  *  -%EBUSY, if the device is busy, or status pending;                          
>>>>>>>  *  -%EACCES, if no path specified in @lpm is operational;                      
>>>>>>>  *  -%ENODEV, if the device is not operational. 
>>>>>>> and the commit message does not say a thing about introducing -EINVAL to
>>>>>>> the mix.  
>>>>>> The function may return -EINVAL for non-enabled subchannels
>>>>>> (i.e. pmcw.ena == 0), maybe we get an all-zeroes schib with dnv == 0?
>>>>>> I'd expect it not to be enabled in that case anyway.  
>>>>> Yep, that's at least how I've come to understand what changed. The
>>>>> function ccw_device_start_timeout_key() has always returned -EINVAL
>>>>> for non-enabled subchannels (pmcw.ena == 0), though it's not
>>>>> documented in the header.
>>>> Wasn't his -EINVAL actually introduced by commit:
>>>> 823d494ac111 ("[S390] pm: ccw bus power management callbacks")?
>>> In the context of virtio-ccw added in 2012, an EINVAL return code
>>> introduced in 2009 might be considered "always" :)
>> :)
>>
>> I'm wondering whether we should still expect to hit the "ssch with
>> ena==0" situation, given that pm support has been removed again in the
>> meanwhile. (Well, other than in situations like this, where it is a
>> follow-up to other problems.) IOW, can callers expect not to see
>> -EINVAL, unless they are doing something really stupid?
> As Halil also pointed out, this would be a programming error, either on
> the side of the driver that starts I/O without setting the device
> properly online, or in the common I/O layer (hopefully not, but you
> never know). Having a dedicated return code to identify this situation
> is definitely useful, and we'll also consider documenting it accordingly
> in the function comment.
>
>>>>> What changed with commit 8c58a229688c is that cio_update_schib() now
>>>>> updates the schib even when DNV=0, rather than returning early as it
>>>>> did previously. Somehow this update results in pmcw.ena == 0 in
>>>>> ccw_device_start_timeout_key(). Previously, it saw pmcw.ena == 1 and
>>>>> moved to the condition (cdev->private->state == DEV_STATE_NOT_OPER)
>>>>> where it returned -ENODEV.
>>>> Sounds fishy to me. As far as I understand the DNV takes precedence over
>>>> all other pieces of PMCW.
>>> And you're right about that! The Principles of Operation states (p. 15-4
>>> in SA22-7832-14 [1]) that the contents of all other fields in the PMCW
>>> are unpredictable when DNV is 0, therefore 8c58a229688c is in error.
>>>
>>> I'll work with Vineeth to determine how to fix this issue, potentially
>>> via manually clearing some relevant SCHIB fields instead of copying the
>>> unpredictable results of the STSCH instruction.
>> Can't you zero the whole SCHIB, or do you still need some of the
>> measurement block things for cleanup?
> I faintly remember that there WAS a reason to use the remainder of the
> SCHIB contents because of some unwanted effect that occurred if we
> didn't, but I don't recall the details. We'll need to dig up the
> associated bug report to understand it and determine if we can simply
> clear all of the SCHIB, or need to keep some of the information intact.
>
>>>>> So the commit didn't introduce -EINVAL as a new return value, rather,
>>>>> it changed the subchannel lifecycle such that existing paths now
>>>>> propagate -EINVAL rather than -ENODEV during the device detach
>>>>> scenario.
>>>> I'm not convinced returning -EINVAL in the given situation is the
>>>> right thing to do. Peter, would you mind to chime in?
>>> I tend to agree that an attempt to start I/O for a subchannel that has
>>> DNV 0 should result in ENODEV rather than EINVAL, though the latter is
>>> still valid when a driver tries to start I/O on a subchannel that is not
>>> enabled for I/O.
>>>
>>> We'll make sure to design the fix for 8c58a229688c in away that ENODEV
>>> will be returned when DNV is 0. Assuming that this is the only situation
>>> where virtio-ccw's ccw_io_helper() receives -EINVAL from
>>> ccw_device_start__timeout_key() during detach, the subject patch should
>>> no longer be necessary.
>> I agree, I'd not expect to get -EINVAL in ccw_io_helper().
> Yeah, this was definitely an unexpected side effect of the DNV commit.
>
FWIW, during my testing, that was indeed the only situation in which
I saw -EINVAL during the detach process. So assuming the right fix
would correctly return ENODEV, I agree this patch is likely
unnecessary.

Thanks all.


^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-16 16:15 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Miaohe Lin, Zi Yan, Andrew Morton, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <e4e123f3-ef32-417f-8369-ff944b6f358d@kernel.org>

On Tue, Jun 16, 2026 at 08:56:42AM +0200, David Hildenbrand (Arm) wrote:
> >>
> >>
> >> Assume that we enlighten all non-atomics to grab the rcu read lock, such as
> > 
> > These non-atomics are defined and used because they want to avoid atomic ops overhead?
> > So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
> 
> It should be cheaper than atomics IIUC. Further, I assume that some pages could
> batch over multiple such operations (esp. page freeing path when we process tail
> pages).

Doubt we have the energy to bother with this much.
We'll have to stick them in the bit manipulation macros if memory failure is
configured and be done with it.

-- 
MST


^ permalink raw reply

* [PATCH 6.18 318/325] vsock/virtio: fix potential unbounded skb queue
From: Greg Kroah-Hartman @ 2026-06-16 15:01 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Eric Dumazet, Arseniy Krasnov,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, virtualization,
	Jakub Kicinski
In-Reply-To: <20260616145057.827196531@linuxfoundation.org>

6.18-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <edumazet@google.com>

commit 059b7dbd20a6f0c539a45ddff1573cb8946685b5 upstream.

virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc.

virtio_transport_recv_enqueue() skips coalescing for packets
with VIRTIO_VSOCK_SEQ_EOM.

If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM,
a very large number of packets can be queued
because vvs->rx_bytes stays at 0.

Fix this by estimating the skb metadata size:

	(Number of skbs in the queue) * SKB_TRUESIZE(0)

Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: virtualization@lists.linux.dev
Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/vmw_vsock/virtio_transport_common.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -425,7 +425,9 @@ static int virtio_transport_send_pkt_inf
 static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
 					u32 len)
 {
-	if (vvs->buf_used + len > vvs->buf_alloc)
+	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
+
+	if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
 		return false;
 
 	vvs->rx_bytes += len;



^ permalink raw reply

* Re: [PATCH 3/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
From: Stefano Garzarella @ 2026-06-16 16:13 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <021a6604-289c-4dd8-a0be-33c7812c0105@virtuozzo.com>

On Tue, Jun 16, 2026 at 06:58:40PM +0300, Andrey Drobyshev wrote:
>On 6/16/26 5:18 PM, Stefano Garzarella wrote:
>> On Fri, Jun 12, 2026 at 07:57:17PM +0300, Andrey Drobyshev wrote:

[...]

>>> static u32 vhost_transport_get_local_cid(void)
>>> @@ -311,11 +312,17 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
>>> 	 * the mutex would be too expensive in this hot path, and we already have
>>> 	 * all the outcomes covered: if the backend becomes NULL right after the check,
>>> 	 * vhost_transport_do_send_pkt() will check it under the mutex anyway.
>>> +	 *
>>> +	 * Don't fast-fail if cpr_paused is set, keep queueing skbs instead.
>>> +	 * The kick in vhost_vsock_start() will drain them on resume.
>>> 	 */
>>> 	if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
>>> -		rcu_read_unlock();
>>> -		kfree_skb(skb);
>>> ]		return -EHOSTUNREACH;
>>> +		smp_rmb();	/* pairs with smp_wmb() in start/drop_backends */
>>> +		if (!READ_ONCE(vsock->cpr_paused)) {
>>
>> Can we avoid this which is not really readable and maybe add a single
>> variable to control the fast-fail at all?
>>
>> I mean replacing both cpr_paused + backend-pointer with a single
>> `started` flag: set it to false at open, true on start via
>> smp_store_release(), back to false on normal stop, and leave it true
>> during CPR pause.
>>
>> The reader in send_pkt can do just:
>>
>>      if (!smp_load_acquire(&vsock->started))
>>          return -EHOSTUNREACH;
>>
>> WDYT?
>>
>
>I don't think it's gonna work as suggested.  As I understand, the order
>during CPR migration is:
>
>1) SET_RUNNING(0)
>       -> vhost_vsock_stop()
>           -> vhost_vsock_drop_backends()
>2) RESET_OWNER
>       -> vhost_vsock_drop_backends()
>3) SET_OWNER
>4) SET_RUNNING(1)
>       -> vhost_vsock_start
>           -> for (...) vhost_vq_set_backend()
>
>(Btw I just noticed backends are already NULL at step 2), but that's
>just our CPR case, for any potential RESET_OWNER users it might not be
>the case).
>
>So the race windows starts from 1) (not from 2)).  We have no way of
>differentiating whether device is actually being stopped for good, or
>we're in the middle of CPR.  If we set the flag to false on stop as you
>suggested, we'll still hit the -EHOSTUNREACH case eventually, and
>avoiding it is the whole purpose of this patch.
>
>The fast-fail with -EHOSTUNREACH relies on the presence of backends.
>IIUC the backend will only become set after initial SET_RUNNING(1),
>which will only happen once the guest driver writes smth to virtio
>config register, QEMU catches it and calls SET_RUNNING(1).  So we have
>ordering with the guest's actions here, which is logical.  But for our
>issue that means that the only true marker of paused/not paused is the
>presence of backends - and that's why the flag is set in
>vhost_vsock_drop_backends().

Okay, so what about avoiding to set `started` to false in 
SET_RUNNING(0)? I mean use it just to track the first SET_RUNNING(1).
(And maybe changing the name to that variable).

Apart from CPR, when can SET_RUNNING(0) occur?

At the end that was just an optimization, if we queue the packet is not 
a big issue IMO.

>
>>> +			rcu_read_unlock();
>>> +			kfree_skb(skb);
>>> +			return -EHOSTUNREACH;
>>> +		}
>>
>>
>> That said claude here is reporting a potential issue that I think we
>> should consider:
>>      After VHOST_RESET_OWNER, the guest CID stays in the hash, so
>>      vhost_transport_send_pkt() can still find the vsock, skip the
>>      fast-fail (cpr_paused=true), and call vhost_vq_work_queue() while
>>      vhost_workers_free() is freeing workers without a synchronize_rcu()
>>      — risking a use-after-free. Also, any send_pkt_work queued between
>>      the last flush and worker teardown gets its VHOST_WORK_QUEUED 
>>      bit
>>      stuck (the vhost task exits without draining), deadlocking
>>      host→guest traffic after restart.
>>
>>      A synchronize_rcu() in vhost_workers_free() between the
>>      rcu_assign_pointer(NULL) loop and the destroy loop would close the
>>      use-after-free, and reinitializing send_pkt_work via
>>      vhost_work_init() after vhost_dev_reset_owner() returns would clear
>>      the stuck QUEUED bit.
>>
>>
>
>Yes, this looks real indeed.  Though I couldn't hit the UAF issue while
>testing host->guest transfer under KASAN.
>
>>> 	}
>>>
>>> 	if (virtio_vsock_skb_reply(skb))
>>> @@ -640,6 +647,9 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>>> 		mutex_unlock(&vq->mutex);
>>> 	}
>>>
>>> +	smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>>> +	WRITE_ONCE(vsock->cpr_paused, false);
>>> +
>>> 	/* Some packets may have been queued before the device was started,
>>> 	 * let's kick the send worker to send them.
>>> 	 */
>>> @@ -671,6 +681,11 @@ static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
>>>
>>> 	lockdep_assert_held(&vsock->dev.mutex);
>>>
>>> +	if (vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
>>> +		WRITE_ONCE(vsock->cpr_paused, true);
>>> +		smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>>> +	}
>>
>> Why here and not in vhost_vsock_reset_owner()?
>>
>> Also having this here will set it to true also with
>> VHOST_VSOCK_SET_RUNNING(0), is that right?
>>
>
>That was added here precisely to cover the vhost_vsock_stop() case (see
>above).

I see now, a comment or something in the commit would have helped.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH 4/4] vhost/vsock: re-scan TX virtqueue on device start
From: Andrey Drobyshev @ 2026-06-16 15:58 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <ajFbT6sDESh9FDOl@sgarzare-redhat>

On 6/16/26 5:23 PM, Stefano Garzarella wrote:
> On Fri, Jun 12, 2026 at 07:57:18PM +0300, Andrey Drobyshev wrote:
>> During QEMU CPR live-update (and VHOST_RESET_OWNER in general) the guest
>> keeps running while the host drops and later re-attaches vhost backends.
>> If the guest adds a buffer to the TX virtqueue (guest->host) and kicks
>> while the backend is temporarily NULL (between vhost_vsock_drop_backends()
>> and the next vhost_vsock_start()), then the kick is delivered to the
>> vhost worker, handle_tx_kick() sees a NULL backend and returns, and the
>> kick signal is consumed.  The buffer is then left in the ring.
>>
>> Then upon device start vhost_vsock_start() only re-kicks the RX send
>> worker, never the TX VQ, so the buffer is processed only if the guest
>> happens to kick again.  But if the guest itself is now waiting for data
>>from the host, it will never kick TX VQ again, and we end up in a
>> deadlock.
>>
>> The deadlock is reproduced during active host->guest socat data transfer
>> under multiple consecutive CPR live-update's.
>>
>> To fix this, in vhost_vsock_start(), after kicking the RX send worker, also
>> queue the TX vq poll so any buffers the guest enqueued while we were paused
>> get scanned.
> 
> Again, it seems like we're fixing an issue that existed before this 
> series, but IIUC without support for VHOST_RESET_OWNER, this could never 
> have happened, so the wording should be changed to make it clear that 
> this is can happen only with the new VHOST_RESET_OWNER support.
> 
> In addition, this patch must also be applied before the 
> VHOST_RESET_OWNER support or merged into it.
>

Agreed.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>> drivers/vhost/vsock.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index bcaba36becd7..1fcfe71d18be 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -655,6 +655,12 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>> 	 */
>> 	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
>>
>> +	/*
>> +	 * Some packets might've also been queued in TX VQ.  Re-scan it here,
>> +	 * mirroring the RX send-worker kick above.
>> +	 */
> 
> Can we also mention that this is related to VHOST_RESET_OWNER?
>

Agreed.
> Thanks,
> Stefano
> 
>> +	vhost_poll_queue(&vsock->vqs[VSOCK_VQ_TX].poll);
>> +
>> 	mutex_unlock(&vsock->dev.mutex);
>> 	return 0;
>>
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox