Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths
@ 2026-06-30  9:23 Li Chen
  2026-06-30  9:23 ` [PATCH v7 01/12] nvdimm: preserve flush callback -ENOMEM Li Chen
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel

Hi,

This series started as a virtio-pmem request lifetime and broken virtqueue
fix, but the rerolls have picked up several related flush-path fixes found
during local testing and review. Since the series is now broader than the
original lifetime bug, this cover letter calls out where the patches came
from.

The nvdimm flush helper maps provider flush failures to -EIO. That should
remain the default for provider/backend failures because host-side errors are
still best reported as generic I/O errors to the guest. However, virtio-pmem
may also fail a guest-local flush request allocation with -ENOMEM before any
request is submitted to the host. Reporting that resource failure as -EIO
makes memory pressure look like media failure.

The raw failure seen in the local mkfs sanity test was:

  wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
  mkfs.ext4: Input/output error while writing out and closing file system
  nd_region region0: dbg: nvdimm_flush rc=-5

Patch 1 comes from that local failure, with the error policy narrowed after
Pankaj pointed out that host/backend provider errors should not all be exposed
directly to the guest. It now preserves only -ENOMEM and keeps other provider
flush failures mapped to -EIO.

Patches 2 and 3 come from review of the pmem flush path. Patch 2 keeps a
failed REQ_PREFLUSH from being overwritten after data copy, and patch 3 is the
dataless-bio guard added after the Sashiko review. Patch 4 comes from the
local child flush bio allocation failure, but v7 reworks the v6 synchronous
FUA approach after Pankaj noted that the old child flush bio path completed
asynchronously. This version removes the child bio while keeping parent bio
completion asynchronous: the provider returns NVDIMM_FLUSH_ASYNC, queues
ordered WQ_MEM_RECLAIM work, and completes the parent bio after
virtio_pmem_flush() finishes. Patch 5 is the remaining allocation-policy
follow-up for the actual virtio-pmem flush request object, not for a child
bio.

Patches 6 and 7 are the older waiter fixes. Patch 6 wakes one -ENOSPC waiter
for each reclaimed used buffer, and patch 7 makes the wait flags explicit
READ_ONCE()/WRITE_ONCE() accesses. Pankaj asked for those changes to be split
across patches, and patch 7 carries his Acked-by.

Patch 8 is the original KASAN use-after-free fix for the request token
lifetime. Patches 9 and 10 are follow-up hardening in the same completion
path: order response publication before the submitter reads resp.ret, and keep
the DMA_FROM_DEVICE response buffer away from CPU-owned request fields. Patch
11 addresses the broken virtqueue / notify failure path reported by LKP and
reproduced locally with fault injection. It also serializes async parent-bio
flush work against broken-state publication, so remove/freeze cannot drain the
workqueue before a racing FUA bio queues new completion work. Patch 12 handles
teardown: it drains requests across freeze/remove and also addresses the
Sashiko-reported req_vq-after-free/NULL-deref class by clearing req_vq after
del_vqs() and making the drain helper tolerate a NULL queue. It also stops the
submit path from checking req_vq after the broken state is visible.

The original repros were on QEMU x86_64 with a virtio-pmem device exported
as /dev/pmem0. For this v7 reroll, the series applies to v7.1-rc7.

Thanks,
Li Chen

Changelog:
v6->v7:
- Address Pankaj's feedback on nvdimm_flush() error policy.
- Preserve only -ENOMEM from provider flush callbacks and continue to map
  other provider/backend failures to -EIO.
- Address Pankaj's feedback on the FUA flush behavior: replace the v6
  synchronous FUA path with provider-owned asynchronous parent bio completion.
- Add NVDIMM_FLUSH_ASYNC and use ordered WQ_MEM_RECLAIM work to run
  virtio_pmem_flush() and complete the parent bio after the host flush.
- Keep GFP_NOIO for the virtio-pmem request allocation, but no longer describe
  it as a child bio allocation fix.
- Add Pankaj's Acked-by on the READ_ONCE()/WRITE_ONCE() patch.
- Serialize async parent-bio flush work against broken-state publication in
  the broken-virtqueue patch, so remove/freeze cannot drain the workqueue
  before a racing FUA bio queues new completion work.
- Fold the Sashiko-reported req_vq NULL-deref fix into the freeze/remove
  drain patch.
- Update commit messages and this cover letter to describe patch origins.
v5->v6:
- Address Sashiko review feedback:
  - Add a data-loop guard for dataless bios in pmem_submit_bio().
  - Replace the child flush bio allocation with synchronous FUA flushing.
  - Keep GFP_NOIO only for the virtio-pmem request allocation.
  - Publish request completion with release/acquire ordering.
  - Isolate the DMA_FROM_DEVICE response buffer from CPU-owned fields.
  - Wake the in-flight host-completion waiter when marking the queue broken.
- Clear req_vq after del_vqs() and make drain tolerate a NULL queue.
v4->v5:
- Address review feedback about REQ_PREFLUSH ordering and active virtqueue
  detach.
- Add 2/8 so a failed REQ_PREFLUSH fails the bio before any data copy, and
  make REQ_PREFLUSH use a synchronous provider flush instead of a deferred
  child bio.
- Rework broken-queue handling so runtime failure marking only stops new
  submissions and wakes local -ENOSPC waiters; used/unused token draining is
  done after device reset in remove() and freeze().
- Remove the broken-state shortcut from the host-completion wait so the
  submitter never reads an uninitialized response field.
- Keep the raw broken-virtqueue dmesg in 7/8 while updating the teardown
  rationale.
- Renumber the old virtio-pmem fixes after the new pmem PREFLUSH patch.
v3->v4:
- Rebased the series onto v7.1-rc7 so it applies cleanly to Linux 7.1-rc7.
- Update the allocation site in 6/7 from kmalloc(sizeof(*req_data),
  GFP_KERNEL) to kmalloc_obj(*req_data) to match current nvdimm code.
- Add 1/7 to preserve provider flush callback errors in nvdimm_flush().
- Include the GFP_NOIO child flush bio allocation fix as 2/7.
- Renumber the old request lifetime and broken virtqueue fixes after the two
  new flush error patches.
v2->v3:
- Split patch 1 as suggested by Pankaj Gupta: keep the waiter wakeup
  ordering change in 1/5 and move READ_ONCE()/WRITE_ONCE() updates to
  2/5 (no functional change intended).
- Add log report to commit msg.
- Fold the export fix into 4/5 to keep the series bisectable when
  CONFIG_VIRTIO_PMEM=m.
v1->v2:
- Add the export patch to fix compile issue.

Links:
v6: https://lore.kernel.org/all/20260621130246.2973254-1-me@linux.beauty/
v5: https://lore.kernel.org/all/20260617122442.2118957-1-me@linux.beauty/
v4: https://lore.kernel.org/all/20260609120726.1714780-1-me@linux.beauty/
v3: https://lore.kernel.org/all/20260226025712.2236279-1-me@linux.beauty/#t
v2: https://lore.kernel.org/all/20251225042915.334117-1-me@linux.beauty/
v1: https://www.spinics.net/lists/kernel/msg5974818.html

Li Chen (12):
  nvdimm: preserve flush callback -ENOMEM
  nvdimm: pmem: keep PREFLUSH before data writes
  nvdimm: pmem: guard data loop for dataless bios
  nvdimm: virtio_pmem: stop allocating child flush bio
  nvdimm: virtio_pmem: use GFP_NOIO for flush requests
  nvdimm: virtio_pmem: always wake -ENOSPC waiters
  nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
  nvdimm: virtio_pmem: refcount requests for token lifetime
  nvdimm: virtio_pmem: publish done with release/acquire
  nvdimm: virtio_pmem: isolate DMA request buffers
  nvdimm: virtio_pmem: converge broken virtqueue to -EIO
  nvdimm: virtio_pmem: drain requests in freeze

 drivers/nvdimm/nd_virtio.c   | 265 +++++++++++++++++++++++++++++------
 drivers/nvdimm/pmem.c        |  51 ++++---
 drivers/nvdimm/region_devs.c |   5 +-
 drivers/nvdimm/virtio_pmem.c |  65 ++++++++-
 drivers/nvdimm/virtio_pmem.h |  22 ++-
 include/linux/libnvdimm.h    |   9 ++
 6 files changed, 343 insertions(+), 74 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v7 01/12] nvdimm: preserve flush callback -ENOMEM
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 02/12] nvdimm: pmem: keep PREFLUSH before data writes Li Chen
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

nvdimm_flush() maps provider flush failures to -EIO. Keep that default
because provider callbacks can report host-side or backend failures that
should remain generic I/O errors to the guest.

Guest-side allocation failures should not be reported as I/O errors. In the
virtio-pmem path, the flush request allocation can fail with -ENOMEM before
any request is submitted to the host. Mapping that to -EIO makes resource
pressure look like media failure.

Preserve -ENOMEM from provider callbacks and continue to map other non-zero
provider failures to -EIO. The generic flush path still returns 0, and
pmem_submit_bio() already converts errno values to block status for bio
completion.

Suggested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v7:
- Preserve only -ENOMEM and keep other provider/backend failures mapped to
  -EIO, per Pankaj's feedback.
v3->v4:
- New patch.

 drivers/nvdimm/region_devs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index e35c2e18518f0..7cd2c2f0d3121 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1115,7 +1115,8 @@ int nvdimm_flush(struct nd_region *nd_region, struct bio *bio)
 	if (!nd_region->flush)
 		rc = generic_nvdimm_flush(nd_region);
 	else {
-		if (nd_region->flush(nd_region, bio))
+		rc = nd_region->flush(nd_region, bio);
+		if (rc && rc != -ENOMEM)
 			rc = -EIO;
 	}
 
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 02/12] nvdimm: pmem: keep PREFLUSH before data writes
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
  2026-06-30  9:23 ` [PATCH v7 01/12] nvdimm: preserve flush callback -ENOMEM Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 03/12] nvdimm: pmem: guard data loop for dataless bios Li Chen
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

pmem_submit_bio() records a REQ_PREFLUSH error, but continues to copy the
bio data and can later overwrite the error with a successful REQ_FUA flush.
That lets data writes run after a failed preflush and can complete the bio
successfully despite the failed ordering barrier.

Run the REQ_PREFLUSH flush synchronously before touching the bio data and
complete the bio with the flush error if it fails. Keep asynchronous flush
chaining for REQ_FUA. At that point, data copy has completed and the parent
bio can wait for the chained flush bio.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v5:
- New patch.

 drivers/nvdimm/pmem.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 92c67fbbc1c85..05d3de33e2706 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -208,8 +208,14 @@ static void pmem_submit_bio(struct bio *bio)
 	struct pmem_device *pmem = bio->bi_bdev->bd_disk->private_data;
 	struct nd_region *nd_region = to_region(pmem);
 
-	if (bio->bi_opf & REQ_PREFLUSH)
-		ret = nvdimm_flush(nd_region, bio);
+	if (bio->bi_opf & REQ_PREFLUSH) {
+		ret = nvdimm_flush(nd_region, NULL);
+		if (ret) {
+			bio->bi_status = errno_to_blk_status(ret);
+			bio_endio(bio);
+			return;
+		}
+	}
 
 	do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
 	if (do_acct)
@@ -229,7 +235,7 @@ static void pmem_submit_bio(struct bio *bio)
 	if (do_acct)
 		bio_end_io_acct(bio, start);
 
-	if (bio->bi_opf & REQ_FUA)
+	if ((bio->bi_opf & REQ_FUA) && !bio->bi_status)
 		ret = nvdimm_flush(nd_region, bio);
 
 	if (ret)
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 03/12] nvdimm: pmem: guard data loop for dataless bios
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
  2026-06-30  9:23 ` [PATCH v7 01/12] nvdimm: preserve flush callback -ENOMEM Li Chen
  2026-06-30  9:23 ` [PATCH v7 02/12] nvdimm: pmem: keep PREFLUSH before data writes Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 04/12] nvdimm: virtio_pmem: stop allocating child flush bio Li Chen
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

pmem_submit_bio() handles flush-only bios before and after the data
loop. Keep dataless bios out of bio_for_each_segment() so the data path
only walks bios that actually carry bvec data.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- New patch.

 drivers/nvdimm/pmem.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 05d3de33e2706..82ee1ddb3a445 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -217,23 +217,29 @@ static void pmem_submit_bio(struct bio *bio)
 		}
 	}
 
-	do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
-	if (do_acct)
-		start = bio_start_io_acct(bio);
-	bio_for_each_segment(bvec, bio, iter) {
-		if (op_is_write(bio_op(bio)))
-			rc = pmem_do_write(pmem, bvec.bv_page, bvec.bv_offset,
-				iter.bi_sector, bvec.bv_len);
-		else
-			rc = pmem_do_read(pmem, bvec.bv_page, bvec.bv_offset,
-				iter.bi_sector, bvec.bv_len);
-		if (rc) {
-			bio->bi_status = rc;
-			break;
+	if (bio_has_data(bio)) {
+		do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
+		if (do_acct)
+			start = bio_start_io_acct(bio);
+		bio_for_each_segment(bvec, bio, iter) {
+			if (op_is_write(bio_op(bio)))
+				rc = pmem_do_write(pmem, bvec.bv_page,
+						   bvec.bv_offset,
+						   iter.bi_sector,
+						   bvec.bv_len);
+			else
+				rc = pmem_do_read(pmem, bvec.bv_page,
+						  bvec.bv_offset,
+						  iter.bi_sector,
+						  bvec.bv_len);
+			if (rc) {
+				bio->bi_status = rc;
+				break;
+			}
 		}
+		if (do_acct)
+			bio_end_io_acct(bio, start);
 	}
-	if (do_acct)
-		bio_end_io_acct(bio, start);
 
 	if ((bio->bi_opf & REQ_FUA) && !bio->bi_status)
 		ret = nvdimm_flush(nd_region, bio);
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 04/12] nvdimm: virtio_pmem: stop allocating child flush bio
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (2 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 03/12] nvdimm: pmem: guard data loop for dataless bios Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 05/12] nvdimm: virtio_pmem: use GFP_NOIO for flush requests Li Chen
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

pmem_submit_bio() passes the parent bio to nvdimm_flush() for
REQ_FUA. For virtio-pmem this makes async_pmem_flush() allocate
and submit a child PREFLUSH bio chained to the parent.

That child allocation is in the block submit path. Making it
blocking with GFP_NOIO can consume the same global bio mempool that
submit_bio() uses, while making it GFP_ATOMIC can fail under
pressure. A forced failure of the child allocation produced:

virtio_pmem: forcing child bio allocation failure for test
Buffer I/O error on dev pmem0, logical block 0, lost sync page write
EXT4-fs (pmem0): I/O error while writing superblock
EXT4-fs (pmem0): mount failed

Avoid the child bio without turning REQ_FUA into a synchronous
submit-path wait. Let provider flush callbacks return
NVDIMM_FLUSH_ASYNC after taking ownership of parent bio completion.
pmem_submit_bio() returns in that case, and virtio-pmem queues an
ordered WQ_MEM_RECLAIM work item that runs the existing host flush
path and completes the parent bio.

This keeps the asynchronous completion model of the child-bio path
while removing the child bio allocation from the submit path.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v7:
- Replace synchronous FUA flushing with provider-owned asynchronous parent bio
  completion.
- Add NVDIMM_FLUSH_ASYNC and ordered WQ_MEM_RECLAIM flush work.
Changes in v6:
- Replace the child bio allocation fix with synchronous FUA flushing.

 drivers/nvdimm/nd_virtio.c   | 54 +++++++++++++++++++++++++-----------
 drivers/nvdimm/pmem.c        |  5 +++-
 drivers/nvdimm/region_devs.c |  2 ++
 drivers/nvdimm/virtio_pmem.c | 17 +++++++++++-
 drivers/nvdimm/virtio_pmem.h |  4 +++
 include/linux/libnvdimm.h    |  9 ++++++
 6 files changed, 73 insertions(+), 18 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 4176046627beb..8e16b7780be1a 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -9,6 +9,12 @@
 #include "virtio_pmem.h"
 #include "nd.h"
 
+struct virtio_pmem_flush_work {
+	struct work_struct work;
+	struct nd_region *nd_region;
+	struct bio *bio;
+};
+
  /* The interrupt handler */
 void virtio_pmem_host_ack(struct virtqueue *vq)
 {
@@ -107,30 +113,46 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 	return err;
 };
 
+static void virtio_pmem_flush_work(struct work_struct *work)
+{
+	struct virtio_pmem_flush_work *flush;
+	int err;
+
+	flush = container_of(work, struct virtio_pmem_flush_work, work);
+	err = virtio_pmem_flush(flush->nd_region);
+	if (err > 0)
+		err = -EIO;
+	if (err)
+		flush->bio->bi_status = errno_to_blk_status(err);
+	bio_endio(flush->bio);
+	kfree(flush);
+}
+
 /* The asynchronous flush callback function */
 int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
 {
-	/*
-	 * Create child bio for asynchronous flush and chain with
-	 * parent bio. Otherwise directly call nd_region flush.
-	 */
-	if (bio && bio->bi_iter.bi_sector != -1) {
-		struct bio *child = bio_alloc(bio->bi_bdev, 0,
-					      REQ_OP_WRITE | REQ_PREFLUSH,
-					      GFP_ATOMIC);
+	struct virtio_device *vdev = nd_region->provider_data;
+	struct virtio_pmem *vpmem = vdev->priv;
+	struct virtio_pmem_flush_work *flush;
+	int err;
 
-		if (!child)
+	if (bio && bio->bi_iter.bi_sector != -1) {
+		flush = kmalloc_obj(*flush, GFP_NOIO);
+		if (!flush)
 			return -ENOMEM;
-		bio_clone_blkg_association(child, bio);
-		child->bi_iter.bi_sector = -1;
-		bio_chain(child, bio);
-		submit_bio(child);
-		return 0;
+
+		INIT_WORK(&flush->work, virtio_pmem_flush_work);
+		flush->nd_region = nd_region;
+		flush->bio = bio;
+		queue_work(vpmem->flush_wq, &flush->work);
+		return NVDIMM_FLUSH_ASYNC;
 	}
-	if (virtio_pmem_flush(nd_region))
+
+	err = virtio_pmem_flush(nd_region);
+	if (err > 0)
 		return -EIO;
 
-	return 0;
+	return err;
 };
 EXPORT_SYMBOL_GPL(async_pmem_flush);
 MODULE_DESCRIPTION("Virtio Persistent Memory Driver");
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 82ee1ddb3a445..30a51c365ce8b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -241,8 +241,11 @@ static void pmem_submit_bio(struct bio *bio)
 			bio_end_io_acct(bio, start);
 	}
 
-	if ((bio->bi_opf & REQ_FUA) && !bio->bi_status)
+	if ((bio->bi_opf & REQ_FUA) && !bio->bi_status) {
 		ret = nvdimm_flush(nd_region, bio);
+		if (ret == NVDIMM_FLUSH_ASYNC)
+			return;
+	}
 
 	if (ret)
 		bio->bi_status = errno_to_blk_status(ret);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 7cd2c2f0d3121..c540f1cff9250 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1116,6 +1116,8 @@ int nvdimm_flush(struct nd_region *nd_region, struct bio *bio)
 		rc = generic_nvdimm_flush(nd_region);
 	else {
 		rc = nd_region->flush(nd_region, bio);
+		if (rc > 0)
+			return rc;
 		if (rc && rc != -ENOMEM)
 			rc = -EIO;
 	}
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index 77b1966619059..9cf822a6c0c38 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -67,10 +67,17 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
 	mutex_init(&vpmem->flush_lock);
 	vpmem->vdev = vdev;
 	vdev->priv = vpmem;
+	vpmem->flush_wq = alloc_ordered_workqueue("virtio-pmem-flush",
+						  WQ_MEM_RECLAIM);
+	if (!vpmem->flush_wq) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
 	err = init_vq(vpmem);
 	if (err) {
 		dev_err(&vdev->dev, "failed to initialize virtio pmem vq's\n");
-		goto out_err;
+		goto out_wq;
 	}
 
 	if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) {
@@ -131,6 +138,8 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
 	nvdimm_bus_unregister(vpmem->nvdimm_bus);
 out_vq:
 	vdev->config->del_vqs(vdev);
+out_wq:
+	destroy_workqueue(vpmem->flush_wq);
 out_err:
 	return err;
 }
@@ -138,14 +147,20 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
 static void virtio_pmem_remove(struct virtio_device *vdev)
 {
 	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+	struct virtio_pmem *vpmem = vdev->priv;
 
 	nvdimm_bus_unregister(nvdimm_bus);
+	drain_workqueue(vpmem->flush_wq);
 	vdev->config->del_vqs(vdev);
 	virtio_reset_device(vdev);
+	destroy_workqueue(vpmem->flush_wq);
 }
 
 static int virtio_pmem_freeze(struct virtio_device *vdev)
 {
+	struct virtio_pmem *vpmem = vdev->priv;
+
+	drain_workqueue(vpmem->flush_wq);
 	vdev->config->del_vqs(vdev);
 	virtio_reset_device(vdev);
 
diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
index f72cf17f9518f..e6dfc10ce0762 100644
--- a/drivers/nvdimm/virtio_pmem.h
+++ b/drivers/nvdimm/virtio_pmem.h
@@ -15,6 +15,7 @@
 #include <linux/libnvdimm.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
+#include <linux/workqueue.h>
 
 struct virtio_pmem_request {
 	struct virtio_pmem_req req;
@@ -39,6 +40,9 @@ struct virtio_pmem {
 	/* Serialize flush requests to the device. */
 	struct mutex flush_lock;
 
+	/* Complete asynchronous FUA flushes outside the submit path. */
+	struct workqueue_struct *flush_wq;
+
 	/* nvdimm bus registers virtio pmem device */
 	struct nvdimm_bus *nvdimm_bus;
 	struct nvdimm_bus_descriptor nd_desc;
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 28f086c4a1873..d929d83abf3be 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -126,6 +126,15 @@ struct nd_mapping_desc {
 struct bio;
 struct resource;
 struct nd_region;
+
+/*
+ * Provider flush callback return values:
+ *   0: flush completed synchronously
+ *  <0: flush failed
+ *  >0: flush completion was queued and @bio will be completed later
+ */
+#define NVDIMM_FLUSH_ASYNC 1
+
 struct nd_region_desc {
 	struct resource *res;
 	struct nd_mapping_desc *mapping;
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 05/12] nvdimm: virtio_pmem: use GFP_NOIO for flush requests
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (3 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 04/12] nvdimm: virtio_pmem: stop allocating child flush bio Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 06/12] nvdimm: virtio_pmem: always wake -ENOSPC waiters Li Chen
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

virtio_pmem_flush() can run from pmem_submit_bio() while filesystem IO
is waiting on the flush completion. The request object allocation can
sleep, but it should not enter filesystem or block IO reclaim from this
flush path.

Use GFP_NOIO for the request allocation. The virtqueue descriptor
allocation still uses GFP_ATOMIC because it runs under pmem_lock.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v7:
- Keep GFP_NOIO for the virtio-pmem request allocation after removing the
  child flush bio path.
Changes in v6:
- New patch; keep GFP_NOIO only for the virtio-pmem request allocation.

 drivers/nvdimm/nd_virtio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 8e16b7780be1a..a35044afddf34 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -61,7 +61,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		return -EIO;
 	}
 
-	req_data = kmalloc_obj(*req_data);
+	req_data = kmalloc_obj(*req_data, GFP_NOIO);
 	if (!req_data)
 		return -ENOMEM;
 
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 06/12] nvdimm: virtio_pmem: always wake -ENOSPC waiters
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (4 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 05/12] nvdimm: virtio_pmem: use GFP_NOIO for flush requests Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 07/12] nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags Li Chen
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

virtio_pmem_host_ack() reclaims virtqueue descriptors with
virtqueue_get_buf(). The -ENOSPC waiter wakeup is tied to completing the
returned token. If token completion is skipped for any reason, reclaimed
descriptors may not wake a waiter and the submitter may sleep forever
waiting for a free slot. Always wake one -ENOSPC waiter for each virtqueue
completion before touching the returned token.

Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Split out the waiter wakeup ordering change from READ_ONCE()/WRITE_ONCE()
  updates (now patch 4/7), per Pankaj's suggestion.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.

 drivers/nvdimm/nd_virtio.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index a35044afddf34..fcb26a595d7c6 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -15,26 +15,33 @@ struct virtio_pmem_flush_work {
 	struct bio *bio;
 };
 
+static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
+{
+	struct virtio_pmem_request *req_buf;
+
+	if (list_empty(&vpmem->req_list))
+		return;
+
+	req_buf = list_first_entry(&vpmem->req_list,
+				   struct virtio_pmem_request, list);
+	req_buf->wq_buf_avail = true;
+	wake_up(&req_buf->wq_buf);
+	list_del(&req_buf->list);
+}
+
  /* The interrupt handler */
 void virtio_pmem_host_ack(struct virtqueue *vq)
 {
 	struct virtio_pmem *vpmem = vq->vdev->priv;
-	struct virtio_pmem_request *req_data, *req_buf;
+	struct virtio_pmem_request *req_data;
 	unsigned long flags;
 	unsigned int len;
 
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
+		virtio_pmem_wake_one_waiter(vpmem);
 		req_data->done = true;
 		wake_up(&req_data->host_acked);
-
-		if (!list_empty(&vpmem->req_list)) {
-			req_buf = list_first_entry(&vpmem->req_list,
-					struct virtio_pmem_request, list);
-			req_buf->wq_buf_avail = true;
-			wake_up(&req_buf->wq_buf);
-			list_del(&req_buf->list);
-		}
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 }
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 07/12] nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (5 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 06/12] nvdimm: virtio_pmem: always wake -ENOSPC waiters Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 08/12] nvdimm: virtio_pmem: refcount requests for token lifetime Li Chen
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
wq_buf_avail). They are observed by waiters without pmem_lock, so make
the accesses explicit single loads/stores and avoid compiler
reordering/caching across the wait/wake paths.

Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v7:
- Add Pankaj's Acked-by.
v2->v3:
- Split out READ_ONCE()/WRITE_ONCE() updates from patch 3/7 (no functional
  change intended).
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.

 drivers/nvdimm/nd_virtio.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index fcb26a595d7c6..8c0d4347938a1 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -24,9 +24,9 @@ static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 
 	req_buf = list_first_entry(&vpmem->req_list,
 				   struct virtio_pmem_request, list);
-	req_buf->wq_buf_avail = true;
+	list_del_init(&req_buf->list);
+	WRITE_ONCE(req_buf->wq_buf_avail, true);
 	wake_up(&req_buf->wq_buf);
-	list_del(&req_buf->list);
 }
 
  /* The interrupt handler */
@@ -40,7 +40,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
 		virtio_pmem_wake_one_waiter(vpmem);
-		req_data->done = true;
+		WRITE_ONCE(req_data->done, true);
 		wake_up(&req_data->host_acked);
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -72,7 +72,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 	if (!req_data)
 		return -ENOMEM;
 
-	req_data->done = false;
+	WRITE_ONCE(req_data->done, false);
 	init_waitqueue_head(&req_data->host_acked);
 	init_waitqueue_head(&req_data->wq_buf);
 	INIT_LIST_HEAD(&req_data->list);
@@ -93,12 +93,12 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 					GFP_ATOMIC)) == -ENOSPC) {
 
 		dev_info(&vdev->dev, "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
-		req_data->wq_buf_avail = false;
+		WRITE_ONCE(req_data->wq_buf_avail, false);
 		list_add_tail(&req_data->list, &vpmem->req_list);
 		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 
 		/* A host response results in "host_ack" getting called */
-		wait_event(req_data->wq_buf, req_data->wq_buf_avail);
+		wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
 		spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	}
 	err1 = virtqueue_kick(vpmem->req_vq);
@@ -112,7 +112,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		err = -EIO;
 	} else {
 		/* A host response results in "host_ack" getting called */
-		wait_event(req_data->host_acked, req_data->done);
+		wait_event(req_data->host_acked, READ_ONCE(req_data->done));
 		err = le32_to_cpu(req_data->resp.ret);
 	}
 
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 08/12] nvdimm: virtio_pmem: refcount requests for token lifetime
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (6 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 07/12] nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 09/12] nvdimm: virtio_pmem: publish done with release/acquire Li Chen
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, stable, Li Chen

KASAN reports slab-use-after-free in __wake_up_common():
BUG: KASAN: slab-use-after-free in __wake_up_common+0x114/0x160
Read of size 8 at addr ffff88810fdcb710 by task swapper/0/0

CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
6.19.0-next-20260220-00006-g1eae5f204ec3 #4 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux
1.17.0-2-2 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl+0x6d/0xb0
 print_report+0x170/0x4e2
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 ? __virt_addr_valid+0x1dc/0x380
 kasan_report+0xbc/0xf0
 ? __wake_up_common+0x114/0x160
 ? __wake_up_common+0x114/0x160
 __wake_up_common+0x114/0x160
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 __wake_up+0x36/0x60
 virtio_pmem_host_ack+0x11d/0x3b0
 ? sched_balance_domains+0x29f/0xb00
 ? __pfx_virtio_pmem_host_ack+0x10/0x10
 ? _raw_spin_lock_irqsave+0x98/0x100
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 vring_interrupt+0x1c9/0x5e0
 ? __pfx_vp_interrupt+0x10/0x10
 vp_vring_interrupt+0x87/0x100
 ? __pfx_vp_interrupt+0x10/0x10
 __handle_irq_event_percpu+0x17f/0x550
 ? __pfx__raw_spin_lock+0x10/0x10
 handle_irq_event+0xab/0x1c0
 handle_fasteoi_irq+0x276/0xae0
 __common_interrupt+0x65/0x130
 common_interrupt+0x78/0xa0
 </IRQ>

virtio_pmem_host_ack() wakes a request that has already been freed by the
submitter.

This happens when the request token is still reachable via the virtqueue,
but virtio_pmem_flush() returns and frees it.

Fix the token lifetime by refcounting struct virtio_pmem_request.
virtio_pmem_flush() holds a submitter reference, and the virtqueue holds an
extra reference once the request is queued. The completion path drops the
virtqueue reference, and the submitter drops its reference before
returning.

Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
Cc: stable@vger.kernel.org
Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Add raw KASAN report to the patch description.
- Drop timestamps from the embedded report.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.

 drivers/nvdimm/nd_virtio.c   | 34 +++++++++++++++++++++++++++++-----
 drivers/nvdimm/virtio_pmem.h |  2 ++
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 8c0d4347938a1..1cf53f75b1281 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -15,6 +15,14 @@ struct virtio_pmem_flush_work {
 	struct bio *bio;
 };
 
+static void virtio_pmem_req_release(struct kref *kref)
+{
+	struct virtio_pmem_request *req;
+
+	req = container_of(kref, struct virtio_pmem_request, kref);
+	kfree(req);
+}
+
 static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 {
 	struct virtio_pmem_request *req_buf;
@@ -42,6 +50,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
 		virtio_pmem_wake_one_waiter(vpmem);
 		WRITE_ONCE(req_data->done, true);
 		wake_up(&req_data->host_acked);
+		kref_put(&req_data->kref, virtio_pmem_req_release);
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 }
@@ -72,6 +81,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 	if (!req_data)
 		return -ENOMEM;
 
+	kref_init(&req_data->kref);
 	WRITE_ONCE(req_data->done, false);
 	init_waitqueue_head(&req_data->host_acked);
 	init_waitqueue_head(&req_data->wq_buf);
@@ -89,10 +99,23 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 	  * to req_list and wait for host_ack to wake us up when free
 	  * slots are available.
 	  */
-	while ((err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req_data,
-					GFP_ATOMIC)) == -ENOSPC) {
-
-		dev_info(&vdev->dev, "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
+	for (;;) {
+		err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req_data,
+					GFP_ATOMIC);
+		if (!err) {
+			/*
+			 * Take the virtqueue reference while @pmem_lock is
+			 * held so completion cannot run concurrently.
+			 */
+			kref_get(&req_data->kref);
+			break;
+		}
+
+		if (err != -ENOSPC)
+			break;
+
+		dev_info_ratelimited(&vdev->dev,
+				     "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
 		WRITE_ONCE(req_data->wq_buf_avail, false);
 		list_add_tail(&req_data->list, &vpmem->req_list);
 		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -101,6 +124,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
 		spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	}
+
 	err1 = virtqueue_kick(vpmem->req_vq);
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 	/*
@@ -116,7 +140,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		err = le32_to_cpu(req_data->resp.ret);
 	}
 
-	kfree(req_data);
+	kref_put(&req_data->kref, virtio_pmem_req_release);
 	return err;
 };
 
diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
index e6dfc10ce0762..3af92588bd9d1 100644
--- a/drivers/nvdimm/virtio_pmem.h
+++ b/drivers/nvdimm/virtio_pmem.h
@@ -12,12 +12,14 @@
 
 #include <linux/module.h>
 #include <uapi/linux/virtio_pmem.h>
+#include <linux/kref.h>
 #include <linux/libnvdimm.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
 #include <linux/workqueue.h>
 
 struct virtio_pmem_request {
+	struct kref kref;
 	struct virtio_pmem_req req;
 	struct virtio_pmem_resp resp;
 
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 09/12] nvdimm: virtio_pmem: publish done with release/acquire
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (7 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 08/12] nvdimm: virtio_pmem: refcount requests for token lifetime Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 10/12] nvdimm: virtio_pmem: isolate DMA request buffers Li Chen
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

virtio_pmem_host_ack() publishes the device response by setting done and
waking the submitter. The submitter reads resp.ret after wait_event()
observes done.

Use smp_store_release() on done and smp_load_acquire() in the wait
condition so the response read is ordered after completion.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- New patch.

 drivers/nvdimm/nd_virtio.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 1cf53f75b1281..e4e4284ae19e5 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -23,6 +23,19 @@ static void virtio_pmem_req_release(struct kref *kref)
 	kfree(req);
 }
 
+static void virtio_pmem_signal_done(struct virtio_pmem_request *req)
+{
+	/* Pairs with smp_load_acquire() in virtio_pmem_req_done(). */
+	smp_store_release(&req->done, true);
+	wake_up(&req->host_acked);
+}
+
+static bool virtio_pmem_req_done(struct virtio_pmem_request *req)
+{
+	/* Pairs with smp_store_release() in virtio_pmem_signal_done(). */
+	return smp_load_acquire(&req->done);
+}
+
 static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 {
 	struct virtio_pmem_request *req_buf;
@@ -48,8 +61,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
 		virtio_pmem_wake_one_waiter(vpmem);
-		WRITE_ONCE(req_data->done, true);
-		wake_up(&req_data->host_acked);
+		virtio_pmem_signal_done(req_data);
 		kref_put(&req_data->kref, virtio_pmem_req_release);
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -136,7 +148,8 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		err = -EIO;
 	} else {
 		/* A host response results in "host_ack" getting called */
-		wait_event(req_data->host_acked, READ_ONCE(req_data->done));
+		wait_event(req_data->host_acked,
+			   virtio_pmem_req_done(req_data));
 		err = le32_to_cpu(req_data->resp.ret);
 	}
 
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 10/12] nvdimm: virtio_pmem: isolate DMA request buffers
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (8 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 09/12] nvdimm: virtio_pmem: publish done with release/acquire Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 11/12] nvdimm: virtio_pmem: converge broken virtqueue to -EIO Li Chen
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

The virtio-pmem request object stores wait queues, flags, and list
pointers next to buffers mapped for virtqueue DMA. The response buffer is
mapped DMA_FROM_DEVICE, so non-coherent DMA invalidation must not share a
cache line with CPU-owned fields.

Keep the request buffer outside the DMA-from-device group and wrap only
the response buffer with __dma_from_device_group_begin/end.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- New patch.

 drivers/nvdimm/virtio_pmem.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
index 3af92588bd9d1..8843a8b965874 100644
--- a/drivers/nvdimm/virtio_pmem.h
+++ b/drivers/nvdimm/virtio_pmem.h
@@ -10,6 +10,7 @@
 #ifndef _LINUX_VIRTIO_PMEM_H
 #define _LINUX_VIRTIO_PMEM_H
 
+#include <linux/dma-mapping.h>
 #include <linux/module.h>
 #include <uapi/linux/virtio_pmem.h>
 #include <linux/kref.h>
@@ -20,8 +21,6 @@
 
 struct virtio_pmem_request {
 	struct kref kref;
-	struct virtio_pmem_req req;
-	struct virtio_pmem_resp resp;
 
 	/* Wait queue to process deferred work after ack from host */
 	wait_queue_head_t host_acked;
@@ -31,6 +30,11 @@ struct virtio_pmem_request {
 	wait_queue_head_t wq_buf;
 	bool wq_buf_avail;
 	struct list_head list;
+
+	struct virtio_pmem_req req;
+	__dma_from_device_group_begin(resp);
+	struct virtio_pmem_resp resp;
+	__dma_from_device_group_end(resp);
 };
 
 struct virtio_pmem {
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 11/12] nvdimm: virtio_pmem: converge broken virtqueue to -EIO
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (9 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 10/12] nvdimm: virtio_pmem: isolate DMA request buffers Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:23 ` [PATCH v7 12/12] nvdimm: virtio_pmem: drain requests in freeze Li Chen
  2026-06-30  9:47 ` [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Pankaj Gupta
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

dmesg reports virtqueue failure and device reset:
virtio_pmem virtio2: failed to send command to
virtio pmem device, no free slots in the virtqueue
virtio_pmem virtio2: virtio pmem device
needs a reset

virtio_pmem_flush() can wait for a free virtqueue descriptor (-ENOSPC).
It can also wait for host completion. If the request virtqueue breaks,
those waiters may never make progress. One example is notify failure from
virtqueue_kick().

Track a device-level broken state and converge the failure to -EIO. New
requests fail fast, -ENOSPC waiters are unlinked and woken, and the
currently submitted request is woken so its host_acked waiter can return
without waiting forever for host completion. Completed requests are forced
to report an error after the queue is marked broken.

Also serialize async parent-bio flush work against the broken state with
pmem_lock. That way remove and freeze either drain work queued before
virtio_pmem_mark_broken(), or later callers see nvdimm_flush() complete
the parent bio synchronously with -EIO instead of queuing work after the
drain point.

Do not detach unused buffers from an active virtqueue. Runtime
broken-queue handling only stops new submissions and wakes local waiters.
Removal resets the device first. It then drains request tokens. After
that, the device no longer owns the buffers when the virtqueue reference
is dropped.

Closes: https://lore.kernel.org/r/202512250116.ewtzlD0g-lkp@intel.com/
Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v7:
- Serialize async parent-bio flush work against broken-state publication so
  remove/freeze cannot drain the workqueue before a racing FUA bio queues new
  completion work.
Changes in v6:
- Wake the in-flight host-completion waiter when marking the queue broken.
- Track req_inflight and clear it on completion/drain paths.
- Return -EIO if the queue breaks before a host response is observed.
Changes in v5:
- Split broken marking from token draining.
- Do not call virtqueue_detach_unused_buf() on an active queue.
- Reset the device before draining tokens in remove().
- Do not let the host-completion wait return only because the device is
  marked broken.
v2->v3:
- Add raw dmesg excerpt to the patch description.
- Drop timestamps from the embedded dmesg.
- Fold the CONFIG_VIRTIO_PMEM=m export fix into this patch.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.
- Use kmalloc_obj(*req_data) at the allocation site to match current nvdimm
  code.

 drivers/nvdimm/nd_virtio.c   | 126 +++++++++++++++++++++++++++++++----
 drivers/nvdimm/virtio_pmem.c |  16 ++++-
 drivers/nvdimm/virtio_pmem.h |   8 +++
 3 files changed, 136 insertions(+), 14 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index e4e4284ae19e5..a6820300cbe8f 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -36,6 +36,12 @@ static bool virtio_pmem_req_done(struct virtio_pmem_request *req)
 	return smp_load_acquire(&req->done);
 }
 
+static void virtio_pmem_complete_err(struct virtio_pmem_request *req)
+{
+	req->resp.ret = cpu_to_le32(1);
+	virtio_pmem_signal_done(req);
+}
+
 static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 {
 	struct virtio_pmem_request *req_buf;
@@ -50,6 +56,63 @@ static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
 	wake_up(&req_buf->wq_buf);
 }
 
+static void virtio_pmem_wake_all_waiters(struct virtio_pmem *vpmem)
+{
+	struct virtio_pmem_request *req, *tmp;
+
+	list_for_each_entry_safe(req, tmp, &vpmem->req_list, list) {
+		list_del_init(&req->list);
+		WRITE_ONCE(req->wq_buf_avail, true);
+		wake_up(&req->wq_buf);
+	}
+}
+
+static void virtio_pmem_clear_inflight(struct virtio_pmem *vpmem,
+				       struct virtio_pmem_request *req)
+{
+	if (vpmem->req_inflight == req)
+		vpmem->req_inflight = NULL;
+}
+
+static void virtio_pmem_wake_inflight(struct virtio_pmem *vpmem)
+{
+	struct virtio_pmem_request *req = vpmem->req_inflight;
+
+	if (req)
+		wake_up(&req->host_acked);
+}
+
+void virtio_pmem_mark_broken(struct virtio_pmem *vpmem)
+{
+	if (!READ_ONCE(vpmem->broken)) {
+		WRITE_ONCE(vpmem->broken, true);
+		dev_err_once(&vpmem->vdev->dev, "virtqueue is broken\n");
+	}
+
+	virtio_pmem_wake_inflight(vpmem);
+	virtio_pmem_wake_all_waiters(vpmem);
+}
+EXPORT_SYMBOL_GPL(virtio_pmem_mark_broken);
+
+void virtio_pmem_drain(struct virtio_pmem *vpmem)
+{
+	struct virtio_pmem_request *req;
+	unsigned int len;
+
+	while ((req = virtqueue_get_buf(vpmem->req_vq, &len)) != NULL) {
+		virtio_pmem_clear_inflight(vpmem, req);
+		virtio_pmem_complete_err(req);
+		kref_put(&req->kref, virtio_pmem_req_release);
+	}
+
+	while ((req = virtqueue_detach_unused_buf(vpmem->req_vq)) != NULL) {
+		virtio_pmem_clear_inflight(vpmem, req);
+		virtio_pmem_complete_err(req);
+		kref_put(&req->kref, virtio_pmem_req_release);
+	}
+}
+EXPORT_SYMBOL_GPL(virtio_pmem_drain);
+
  /* The interrupt handler */
 void virtio_pmem_host_ack(struct virtqueue *vq)
 {
@@ -60,8 +123,12 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
 
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
 	while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
+		virtio_pmem_clear_inflight(vpmem, req_data);
 		virtio_pmem_wake_one_waiter(vpmem);
-		virtio_pmem_signal_done(req_data);
+		if (READ_ONCE(vpmem->broken))
+			virtio_pmem_complete_err(req_data);
+		else
+			virtio_pmem_signal_done(req_data);
 		kref_put(&req_data->kref, virtio_pmem_req_release);
 	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -89,6 +156,9 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		return -EIO;
 	}
 
+	if (READ_ONCE(vpmem->broken))
+		return -EIO;
+
 	req_data = kmalloc_obj(*req_data, GFP_NOIO);
 	if (!req_data)
 		return -ENOMEM;
@@ -105,13 +175,18 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 	sgs[1] = &ret;
 
 	spin_lock_irqsave(&vpmem->pmem_lock, flags);
-	 /*
-	  * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual
-	  * queue does not have free descriptor. We add the request
-	  * to req_list and wait for host_ack to wake us up when free
-	  * slots are available.
-	  */
+	/*
+	 * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual
+	 * queue does not have free descriptor. We add the request
+	 * to req_list and wait for host_ack to wake us up when free
+	 * slots are available.
+	 */
 	for (;;) {
+		if (READ_ONCE(vpmem->broken)) {
+			err = -EIO;
+			break;
+		}
+
 		err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req_data,
 					GFP_ATOMIC);
 		if (!err) {
@@ -120,6 +195,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 			 * held so completion cannot run concurrently.
 			 */
 			kref_get(&req_data->kref);
+			vpmem->req_inflight = req_data;
 			break;
 		}
 
@@ -133,24 +209,41 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 
 		/* A host response results in "host_ack" getting called */
-		wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
+		wait_event(req_data->wq_buf,
+			   READ_ONCE(req_data->wq_buf_avail) ||
+			   READ_ONCE(vpmem->broken));
 		spin_lock_irqsave(&vpmem->pmem_lock, flags);
+
+		if (READ_ONCE(vpmem->broken))
+			break;
 	}
 
-	err1 = virtqueue_kick(vpmem->req_vq);
+	if (err == -EIO || virtqueue_is_broken(vpmem->req_vq))
+		virtio_pmem_mark_broken(vpmem);
+
+	err1 = true;
+	if (!err && !READ_ONCE(vpmem->broken)) {
+		err1 = virtqueue_kick(vpmem->req_vq);
+		if (!err1)
+			virtio_pmem_mark_broken(vpmem);
+	}
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 	/*
 	 * virtqueue_add_sgs failed with error different than -ENOSPC, we can't
 	 * do anything about that.
 	 */
-	if (err || !err1) {
+	if (READ_ONCE(vpmem->broken) || err || !err1) {
 		dev_info(&vdev->dev, "failed to send command to virtio pmem device\n");
 		err = -EIO;
 	} else {
 		/* A host response results in "host_ack" getting called */
 		wait_event(req_data->host_acked,
-			   virtio_pmem_req_done(req_data));
-		err = le32_to_cpu(req_data->resp.ret);
+			   virtio_pmem_req_done(req_data) ||
+			   READ_ONCE(vpmem->broken));
+		if (virtio_pmem_req_done(req_data))
+			err = le32_to_cpu(req_data->resp.ret);
+		else
+			err = -EIO;
 	}
 
 	kref_put(&req_data->kref, virtio_pmem_req_release);
@@ -178,6 +271,7 @@ int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
 	struct virtio_device *vdev = nd_region->provider_data;
 	struct virtio_pmem *vpmem = vdev->priv;
 	struct virtio_pmem_flush_work *flush;
+	unsigned long flags;
 	int err;
 
 	if (bio && bio->bi_iter.bi_sector != -1) {
@@ -188,7 +282,15 @@ int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
 		INIT_WORK(&flush->work, virtio_pmem_flush_work);
 		flush->nd_region = nd_region;
 		flush->bio = bio;
+
+		spin_lock_irqsave(&vpmem->pmem_lock, flags);
+		if (READ_ONCE(vpmem->broken)) {
+			spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+			kfree(flush);
+			return -EIO;
+		}
 		queue_work(vpmem->flush_wq, &flush->work);
+		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 		return NVDIMM_FLUSH_ASYNC;
 	}
 
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index 9cf822a6c0c38..36664a5ea25e3 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -25,6 +25,8 @@ static int init_vq(struct virtio_pmem *vpmem)
 
 	spin_lock_init(&vpmem->pmem_lock);
 	INIT_LIST_HEAD(&vpmem->req_list);
+	vpmem->req_inflight = NULL;
+	WRITE_ONCE(vpmem->broken, false);
 
 	return 0;
 };
@@ -148,11 +150,21 @@ static void virtio_pmem_remove(struct virtio_device *vdev)
 {
 	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
 	struct virtio_pmem *vpmem = vdev->priv;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	virtio_pmem_mark_broken(vpmem);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 
-	nvdimm_bus_unregister(nvdimm_bus);
 	drain_workqueue(vpmem->flush_wq);
-	vdev->config->del_vqs(vdev);
 	virtio_reset_device(vdev);
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	virtio_pmem_drain(vpmem);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+	nvdimm_bus_unregister(nvdimm_bus);
+	vdev->config->del_vqs(vdev);
 	destroy_workqueue(vpmem->flush_wq);
 }
 
diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
index 8843a8b965874..0b90777d7658b 100644
--- a/drivers/nvdimm/virtio_pmem.h
+++ b/drivers/nvdimm/virtio_pmem.h
@@ -56,6 +56,12 @@ struct virtio_pmem {
 	/* List to store deferred work if virtqueue is full */
 	struct list_head req_list;
 
+	/* Request currently owned by the virtqueue. */
+	struct virtio_pmem_request *req_inflight;
+
+	/* Fail fast and wake waiters if the request virtqueue is broken. */
+	bool broken;
+
 	/* Synchronize virtqueue data */
 	spinlock_t pmem_lock;
 
@@ -65,5 +71,7 @@ struct virtio_pmem {
 };
 
 void virtio_pmem_host_ack(struct virtqueue *vq);
+void virtio_pmem_mark_broken(struct virtio_pmem *vpmem);
+void virtio_pmem_drain(struct virtio_pmem *vpmem);
 int async_pmem_flush(struct nd_region *nd_region, struct bio *bio);
 #endif
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 12/12] nvdimm: virtio_pmem: drain requests in freeze
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (10 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 11/12] nvdimm: virtio_pmem: converge broken virtqueue to -EIO Li Chen
@ 2026-06-30  9:23 ` Li Chen
  2026-06-30  9:47 ` [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Pankaj Gupta
  12 siblings, 0 replies; 14+ messages in thread
From: Li Chen @ 2026-06-30  9:23 UTC (permalink / raw)
  To: Pankaj Gupta, Vishal Verma, Dave Jiang, Alison Schofield,
	virtualization, nvdimm
  Cc: linux-kernel, Li Chen

virtio_pmem_freeze() currently deletes virtqueues and resets the device
without waking threads waiting for a virtqueue descriptor or a host
completion.

Mark the request virtqueue broken before reset. This makes new submissions
fail fast and lets -ENOSPC waiters leave the wait list. Reset the device
before draining used and unused request tokens, then delete the virtqueues.
This wakes waiters with -EIO. It also keeps the detach call on a quiesced
device.

Clear req_vq after del_vqs(). Make drain tolerate a NULL queue so remove
after freeze does not dereference a stale virtqueue pointer. Also make
virtio_pmem_flush() stop checking req_vq once the broken state is visible.
A waiter woken by freeze/remove can resume after del_vqs() has cleared
req_vq.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v7:
- Stop checking req_vq once the broken state is visible, so a waiter woken
  by freeze/remove does not dereference req_vq after del_vqs() clears it.
Changes in v6:
- Clear req_vq after del_vqs() and make drain tolerate a NULL queue.
Changes in v5:
- Reset the device before draining used and unused request tokens.
- Use the split broken-marking and post-reset drain helpers.
v2->v3:
- No change.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.

 drivers/nvdimm/nd_virtio.c   |  5 +++++
 drivers/nvdimm/virtio_pmem.c | 34 +++++++++++++++++++++++++++++-----
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index a6820300cbe8f..3b8be79a20a0f 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -99,6 +99,9 @@ void virtio_pmem_drain(struct virtio_pmem *vpmem)
 	struct virtio_pmem_request *req;
 	unsigned int len;
 
+	if (!vpmem->req_vq)
+		return;
+
 	while ((req = virtqueue_get_buf(vpmem->req_vq, &len)) != NULL) {
 		virtio_pmem_clear_inflight(vpmem, req);
 		virtio_pmem_complete_err(req);
@@ -218,6 +221,8 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
 			break;
 	}
 
+	if (READ_ONCE(vpmem->broken))
+		err = -EIO;
 	if (err == -EIO || virtqueue_is_broken(vpmem->req_vq))
 		virtio_pmem_mark_broken(vpmem);
 
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index 36664a5ea25e3..7ee3fb1779f73 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -17,11 +17,16 @@ static struct virtio_device_id id_table[] = {
  /* Initialize virt queue */
 static int init_vq(struct virtio_pmem *vpmem)
 {
+	int err;
+
 	/* single vq */
 	vpmem->req_vq = virtio_find_single_vq(vpmem->vdev,
 					virtio_pmem_host_ack, "flush_queue");
-	if (IS_ERR(vpmem->req_vq))
-		return PTR_ERR(vpmem->req_vq);
+	if (IS_ERR(vpmem->req_vq)) {
+		err = PTR_ERR(vpmem->req_vq);
+		vpmem->req_vq = NULL;
+		return err;
+	}
 
 	spin_lock_init(&vpmem->pmem_lock);
 	INIT_LIST_HEAD(&vpmem->req_list);
@@ -31,6 +36,15 @@ static int init_vq(struct virtio_pmem *vpmem)
 	return 0;
 };
 
+static void virtio_pmem_del_vqs(struct virtio_pmem *vpmem)
+{
+	if (!vpmem->req_vq)
+		return;
+
+	vpmem->vdev->config->del_vqs(vpmem->vdev);
+	vpmem->req_vq = NULL;
+}
+
 static int virtio_pmem_validate(struct virtio_device *vdev)
 {
 	struct virtio_shm_region shm_reg;
@@ -139,7 +153,7 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
 	virtio_reset_device(vdev);
 	nvdimm_bus_unregister(vpmem->nvdimm_bus);
 out_vq:
-	vdev->config->del_vqs(vdev);
+	virtio_pmem_del_vqs(vpmem);
 out_wq:
 	destroy_workqueue(vpmem->flush_wq);
 out_err:
@@ -164,18 +178,28 @@ static void virtio_pmem_remove(struct virtio_device *vdev)
 	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 
 	nvdimm_bus_unregister(nvdimm_bus);
-	vdev->config->del_vqs(vdev);
+	virtio_pmem_del_vqs(vpmem);
 	destroy_workqueue(vpmem->flush_wq);
 }
 
 static int virtio_pmem_freeze(struct virtio_device *vdev)
 {
 	struct virtio_pmem *vpmem = vdev->priv;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	virtio_pmem_mark_broken(vpmem);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
 
 	drain_workqueue(vpmem->flush_wq);
-	vdev->config->del_vqs(vdev);
 	virtio_reset_device(vdev);
 
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	virtio_pmem_drain(vpmem);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+	virtio_pmem_del_vqs(vpmem);
+
 	return 0;
 }
 
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths
  2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
                   ` (11 preceding siblings ...)
  2026-06-30  9:23 ` [PATCH v7 12/12] nvdimm: virtio_pmem: drain requests in freeze Li Chen
@ 2026-06-30  9:47 ` Pankaj Gupta
  12 siblings, 0 replies; 14+ messages in thread
From: Pankaj Gupta @ 2026-06-30  9:47 UTC (permalink / raw)
  To: Li Chen
  Cc: Vishal Verma, Dave Jiang, Alison Schofield, virtualization,
	nvdimm, linux-kernel, Michael S . Tsirkin, Dan Williams

+CC Dan's correct email address and MST's email.

> Hi,
>
> This series started as a virtio-pmem request lifetime and broken virtqueue
> fix, but the rerolls have picked up several related flush-path fixes found
> during local testing and review. Since the series is now broader than the
> original lifetime bug, this cover letter calls out where the patches came
> from.
>
> The nvdimm flush helper maps provider flush failures to -EIO. That should
> remain the default for provider/backend failures because host-side errors are
> still best reported as generic I/O errors to the guest. However, virtio-pmem
> may also fail a guest-local flush request allocation with -ENOMEM before any
> request is submitted to the host. Reporting that resource failure as -EIO
> makes memory pressure look like media failure.
>
> The raw failure seen in the local mkfs sanity test was:
>
>   wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
>   mkfs.ext4: Input/output error while writing out and closing file system
>   nd_region region0: dbg: nvdimm_flush rc=-5
>
> Patch 1 comes from that local failure, with the error policy narrowed after
> Pankaj pointed out that host/backend provider errors should not all be exposed
> directly to the guest. It now preserves only -ENOMEM and keeps other provider
> flush failures mapped to -EIO.
>
> Patches 2 and 3 come from review of the pmem flush path. Patch 2 keeps a
> failed REQ_PREFLUSH from being overwritten after data copy, and patch 3 is the
> dataless-bio guard added after the Sashiko review. Patch 4 comes from the
> local child flush bio allocation failure, but v7 reworks the v6 synchronous
> FUA approach after Pankaj noted that the old child flush bio path completed
> asynchronously. This version removes the child bio while keeping parent bio
> completion asynchronous: the provider returns NVDIMM_FLUSH_ASYNC, queues
> ordered WQ_MEM_RECLAIM work, and completes the parent bio after
> virtio_pmem_flush() finishes. Patch 5 is the remaining allocation-policy
> follow-up for the actual virtio-pmem flush request object, not for a child
> bio.
>
> Patches 6 and 7 are the older waiter fixes. Patch 6 wakes one -ENOSPC waiter
> for each reclaimed used buffer, and patch 7 makes the wait flags explicit
> READ_ONCE()/WRITE_ONCE() accesses. Pankaj asked for those changes to be split
> across patches, and patch 7 carries his Acked-by.
>
> Patch 8 is the original KASAN use-after-free fix for the request token
> lifetime. Patches 9 and 10 are follow-up hardening in the same completion
> path: order response publication before the submitter reads resp.ret, and keep
> the DMA_FROM_DEVICE response buffer away from CPU-owned request fields. Patch
> 11 addresses the broken virtqueue / notify failure path reported by LKP and
> reproduced locally with fault injection. It also serializes async parent-bio
> flush work against broken-state publication, so remove/freeze cannot drain the
> workqueue before a racing FUA bio queues new completion work. Patch 12 handles
> teardown: it drains requests across freeze/remove and also addresses the
> Sashiko-reported req_vq-after-free/NULL-deref class by clearing req_vq after
> del_vqs() and making the drain helper tolerate a NULL queue. It also stops the
> submit path from checking req_vq after the broken state is visible.
>
> The original repros were on QEMU x86_64 with a virtio-pmem device exported
> as /dev/pmem0. For this v7 reroll, the series applies to v7.1-rc7.
>
> Thanks,
> Li Chen
>
> Changelog:
> v6->v7:
> - Address Pankaj's feedback on nvdimm_flush() error policy.
> - Preserve only -ENOMEM from provider flush callbacks and continue to map
>   other provider/backend failures to -EIO.
> - Address Pankaj's feedback on the FUA flush behavior: replace the v6
>   synchronous FUA path with provider-owned asynchronous parent bio completion.
> - Add NVDIMM_FLUSH_ASYNC and use ordered WQ_MEM_RECLAIM work to run
>   virtio_pmem_flush() and complete the parent bio after the host flush.
> - Keep GFP_NOIO for the virtio-pmem request allocation, but no longer describe
>   it as a child bio allocation fix.
> - Add Pankaj's Acked-by on the READ_ONCE()/WRITE_ONCE() patch.
> - Serialize async parent-bio flush work against broken-state publication in
>   the broken-virtqueue patch, so remove/freeze cannot drain the workqueue
>   before a racing FUA bio queues new completion work.
> - Fold the Sashiko-reported req_vq NULL-deref fix into the freeze/remove
>   drain patch.
> - Update commit messages and this cover letter to describe patch origins.
> v5->v6:
> - Address Sashiko review feedback:
>   - Add a data-loop guard for dataless bios in pmem_submit_bio().
>   - Replace the child flush bio allocation with synchronous FUA flushing.
>   - Keep GFP_NOIO only for the virtio-pmem request allocation.
>   - Publish request completion with release/acquire ordering.
>   - Isolate the DMA_FROM_DEVICE response buffer from CPU-owned fields.
>   - Wake the in-flight host-completion waiter when marking the queue broken.
> - Clear req_vq after del_vqs() and make drain tolerate a NULL queue.
> v4->v5:
> - Address review feedback about REQ_PREFLUSH ordering and active virtqueue
>   detach.
> - Add 2/8 so a failed REQ_PREFLUSH fails the bio before any data copy, and
>   make REQ_PREFLUSH use a synchronous provider flush instead of a deferred
>   child bio.
> - Rework broken-queue handling so runtime failure marking only stops new
>   submissions and wakes local -ENOSPC waiters; used/unused token draining is
>   done after device reset in remove() and freeze().
> - Remove the broken-state shortcut from the host-completion wait so the
>   submitter never reads an uninitialized response field.
> - Keep the raw broken-virtqueue dmesg in 7/8 while updating the teardown
>   rationale.
> - Renumber the old virtio-pmem fixes after the new pmem PREFLUSH patch.
> v3->v4:
> - Rebased the series onto v7.1-rc7 so it applies cleanly to Linux 7.1-rc7.
> - Update the allocation site in 6/7 from kmalloc(sizeof(*req_data),
>   GFP_KERNEL) to kmalloc_obj(*req_data) to match current nvdimm code.
> - Add 1/7 to preserve provider flush callback errors in nvdimm_flush().
> - Include the GFP_NOIO child flush bio allocation fix as 2/7.
> - Renumber the old request lifetime and broken virtqueue fixes after the two
>   new flush error patches.
> v2->v3:
> - Split patch 1 as suggested by Pankaj Gupta: keep the waiter wakeup
>   ordering change in 1/5 and move READ_ONCE()/WRITE_ONCE() updates to
>   2/5 (no functional change intended).
> - Add log report to commit msg.
> - Fold the export fix into 4/5 to keep the series bisectable when
>   CONFIG_VIRTIO_PMEM=m.
> v1->v2:
> - Add the export patch to fix compile issue.
>
> Links:
> v6: https://lore.kernel.org/all/20260621130246.2973254-1-me@linux.beauty/
> v5: https://lore.kernel.org/all/20260617122442.2118957-1-me@linux.beauty/
> v4: https://lore.kernel.org/all/20260609120726.1714780-1-me@linux.beauty/
> v3: https://lore.kernel.org/all/20260226025712.2236279-1-me@linux.beauty/#t
> v2: https://lore.kernel.org/all/20251225042915.334117-1-me@linux.beauty/
> v1: https://www.spinics.net/lists/kernel/msg5974818.html
>
> Li Chen (12):
>   nvdimm: preserve flush callback -ENOMEM
>   nvdimm: pmem: keep PREFLUSH before data writes
>   nvdimm: pmem: guard data loop for dataless bios
>   nvdimm: virtio_pmem: stop allocating child flush bio
>   nvdimm: virtio_pmem: use GFP_NOIO for flush requests
>   nvdimm: virtio_pmem: always wake -ENOSPC waiters
>   nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
>   nvdimm: virtio_pmem: refcount requests for token lifetime
>   nvdimm: virtio_pmem: publish done with release/acquire
>   nvdimm: virtio_pmem: isolate DMA request buffers
>   nvdimm: virtio_pmem: converge broken virtqueue to -EIO
>   nvdimm: virtio_pmem: drain requests in freeze
>
>  drivers/nvdimm/nd_virtio.c   | 265 +++++++++++++++++++++++++++++------
>  drivers/nvdimm/pmem.c        |  51 ++++---
>  drivers/nvdimm/region_devs.c |   5 +-
>  drivers/nvdimm/virtio_pmem.c |  65 ++++++++-
>  drivers/nvdimm/virtio_pmem.h |  22 ++-
>  include/linux/libnvdimm.h    |   9 ++
>  6 files changed, 343 insertions(+), 74 deletions(-)
>
> --
> 2.52.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-06-30  9:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-30  9:23 [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Li Chen
2026-06-30  9:23 ` [PATCH v7 01/12] nvdimm: preserve flush callback -ENOMEM Li Chen
2026-06-30  9:23 ` [PATCH v7 02/12] nvdimm: pmem: keep PREFLUSH before data writes Li Chen
2026-06-30  9:23 ` [PATCH v7 03/12] nvdimm: pmem: guard data loop for dataless bios Li Chen
2026-06-30  9:23 ` [PATCH v7 04/12] nvdimm: virtio_pmem: stop allocating child flush bio Li Chen
2026-06-30  9:23 ` [PATCH v7 05/12] nvdimm: virtio_pmem: use GFP_NOIO for flush requests Li Chen
2026-06-30  9:23 ` [PATCH v7 06/12] nvdimm: virtio_pmem: always wake -ENOSPC waiters Li Chen
2026-06-30  9:23 ` [PATCH v7 07/12] nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags Li Chen
2026-06-30  9:23 ` [PATCH v7 08/12] nvdimm: virtio_pmem: refcount requests for token lifetime Li Chen
2026-06-30  9:23 ` [PATCH v7 09/12] nvdimm: virtio_pmem: publish done with release/acquire Li Chen
2026-06-30  9:23 ` [PATCH v7 10/12] nvdimm: virtio_pmem: isolate DMA request buffers Li Chen
2026-06-30  9:23 ` [PATCH v7 11/12] nvdimm: virtio_pmem: converge broken virtqueue to -EIO Li Chen
2026-06-30  9:23 ` [PATCH v7 12/12] nvdimm: virtio_pmem: drain requests in freeze Li Chen
2026-06-30  9:47 ` [PATCH v7 00/12] nvdimm: virtio_pmem: fix flush/request failure paths Pankaj Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox