* [PATCH v4 1/7] nvdimm: preserve flush callback errors
2026-06-09 12:07 [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Li Chen
@ 2026-06-09 12:07 ` Li Chen
2026-06-09 12:07 ` [PATCH v4 2/7] nvdimm: virtio_pmem: use GFP_NOIO for child flush bio Li Chen
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Li Chen @ 2026-06-09 12:07 UTC (permalink / raw)
To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
Alison Schofield, virtualization, nvdimm
Cc: linux-kernel, Li Chen
nvdimm_flush() currently converts any non-zero provider flush error to
-EIO. That loses useful errno values from provider callbacks.
A local virtio-pmem mkfs sanity test showed the masking clearly:
wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
mkfs.ext4: Input/output error while writing out and closing file system
nd_region region0: dbg: nvdimm_flush rc=-5
The virtio-pmem callback can return -ENOMEM when async_pmem_flush() fails
to allocate a child flush bio, but nvdimm_flush() hides that as -EIO before
pmem_submit_bio() converts it to a block status.
Return the provider callback error directly. The generic flush path still
returns 0, and pmem_submit_bio() already handles errno-to-blk_status
conversion for bio completion.
Signed-off-by: Li Chen <me@linux.beauty>
---
v3->v4:
- New patch.
drivers/nvdimm/region_devs.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index e35c2e18518f0..0cd96503c0596 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1114,10 +1114,8 @@ int nvdimm_flush(struct nd_region *nd_region, struct bio *bio)
if (!nd_region->flush)
rc = generic_nvdimm_flush(nd_region);
- else {
- if (nd_region->flush(nd_region, bio))
- rc = -EIO;
- }
+ else
+ rc = nd_region->flush(nd_region, bio);
return rc;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v4 2/7] nvdimm: virtio_pmem: use GFP_NOIO for child flush bio
2026-06-09 12:07 [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Li Chen
2026-06-09 12:07 ` [PATCH v4 1/7] nvdimm: preserve flush callback errors Li Chen
@ 2026-06-09 12:07 ` Li Chen
2026-06-09 12:07 ` [PATCH v4 3/7] nvdimm: virtio_pmem: always wake -ENOSPC waiters Li Chen
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Li Chen @ 2026-06-09 12:07 UTC (permalink / raw)
To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
Alison Schofield, virtualization, nvdimm
Cc: linux-kernel, Li Chen
async_pmem_flush() can allocate a child flush bio from filesystem flush
and writeback paths. GFP_ATOMIC is unnecessarily restrictive there and can
make the allocation fail under pressure, which then propagates -ENOMEM to
the flush caller.
A local virtio-pmem mkfs sanity test hit a flush failure before this
change:
wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
mkfs.ext4: Input/output error while writing out and closing file system
nd_region region0: dbg: nvdimm_flush rc=-5
The debug log showed async_pmem_flush() was entered and nvdimm_flush()
returned -EIO. With GFP_NOIO, the same test reached mkfs_rc=0, mount_rc=0,
and umount_rc=0.
Use GFP_NOIO instead. The path may sleep, but it must not recurse into
filesystem I/O reclaim while it is already servicing a flush request.
Signed-off-by: Li Chen <me@linux.beauty>
---
v3->v4:
- New patch.
drivers/nvdimm/nd_virtio.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 4176046627beb..081370aac6317 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -117,7 +117,7 @@ int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
if (bio && bio->bi_iter.bi_sector != -1) {
struct bio *child = bio_alloc(bio->bi_bdev, 0,
REQ_OP_WRITE | REQ_PREFLUSH,
- GFP_ATOMIC);
+ GFP_NOIO);
if (!child)
return -ENOMEM;
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v4 3/7] nvdimm: virtio_pmem: always wake -ENOSPC waiters
2026-06-09 12:07 [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Li Chen
2026-06-09 12:07 ` [PATCH v4 1/7] nvdimm: preserve flush callback errors Li Chen
2026-06-09 12:07 ` [PATCH v4 2/7] nvdimm: virtio_pmem: use GFP_NOIO for child flush bio Li Chen
@ 2026-06-09 12:07 ` Li Chen
2026-06-09 12:07 ` [PATCH v4 4/7] nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags Li Chen
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Li Chen @ 2026-06-09 12:07 UTC (permalink / raw)
To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
Alison Schofield, virtualization, nvdimm
Cc: linux-kernel, Li Chen
virtio_pmem_host_ack() reclaims virtqueue descriptors with
virtqueue_get_buf(). The -ENOSPC waiter wakeup is tied to completing the
returned token. If token completion is skipped for any reason, reclaimed
descriptors may not wake a waiter and the submitter may sleep forever
waiting for a free slot. Always wake one -ENOSPC waiter for each virtqueue
completion before touching the returned token.
Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Split out the waiter wakeup ordering change from READ_ONCE()/WRITE_ONCE()
updates (now patch 4/7), per Pankaj's suggestion.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.
drivers/nvdimm/nd_virtio.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 081370aac6317..16ee5a47b9938 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -9,26 +9,33 @@
#include "virtio_pmem.h"
#include "nd.h"
+static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
+{
+ struct virtio_pmem_request *req_buf;
+
+ if (list_empty(&vpmem->req_list))
+ return;
+
+ req_buf = list_first_entry(&vpmem->req_list,
+ struct virtio_pmem_request, list);
+ req_buf->wq_buf_avail = true;
+ wake_up(&req_buf->wq_buf);
+ list_del(&req_buf->list);
+}
+
/* The interrupt handler */
void virtio_pmem_host_ack(struct virtqueue *vq)
{
struct virtio_pmem *vpmem = vq->vdev->priv;
- struct virtio_pmem_request *req_data, *req_buf;
+ struct virtio_pmem_request *req_data;
unsigned long flags;
unsigned int len;
spin_lock_irqsave(&vpmem->pmem_lock, flags);
while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
+ virtio_pmem_wake_one_waiter(vpmem);
req_data->done = true;
wake_up(&req_data->host_acked);
-
- if (!list_empty(&vpmem->req_list)) {
- req_buf = list_first_entry(&vpmem->req_list,
- struct virtio_pmem_request, list);
- req_buf->wq_buf_avail = true;
- wake_up(&req_buf->wq_buf);
- list_del(&req_buf->list);
- }
}
spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v4 4/7] nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
2026-06-09 12:07 [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Li Chen
` (2 preceding siblings ...)
2026-06-09 12:07 ` [PATCH v4 3/7] nvdimm: virtio_pmem: always wake -ENOSPC waiters Li Chen
@ 2026-06-09 12:07 ` Li Chen
2026-06-09 12:07 ` [PATCH v4 5/7] nvdimm: virtio_pmem: refcount requests for token lifetime Li Chen
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Li Chen @ 2026-06-09 12:07 UTC (permalink / raw)
To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
Alison Schofield, virtualization, nvdimm
Cc: linux-kernel, Li Chen
Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
wq_buf_avail). They are observed by waiters without pmem_lock, so make
the accesses explicit single loads/stores and avoid compiler
reordering/caching across the wait/wake paths.
Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Split out READ_ONCE()/WRITE_ONCE() updates from patch 3/7 (no functional
change intended).
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.
drivers/nvdimm/nd_virtio.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 16ee5a47b9938..f8c0604edde51 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -18,9 +18,9 @@ static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
req_buf = list_first_entry(&vpmem->req_list,
struct virtio_pmem_request, list);
- req_buf->wq_buf_avail = true;
+ list_del_init(&req_buf->list);
+ WRITE_ONCE(req_buf->wq_buf_avail, true);
wake_up(&req_buf->wq_buf);
- list_del(&req_buf->list);
}
/* The interrupt handler */
@@ -34,7 +34,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
spin_lock_irqsave(&vpmem->pmem_lock, flags);
while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
virtio_pmem_wake_one_waiter(vpmem);
- req_data->done = true;
+ WRITE_ONCE(req_data->done, true);
wake_up(&req_data->host_acked);
}
spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -66,7 +66,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
if (!req_data)
return -ENOMEM;
- req_data->done = false;
+ WRITE_ONCE(req_data->done, false);
init_waitqueue_head(&req_data->host_acked);
init_waitqueue_head(&req_data->wq_buf);
INIT_LIST_HEAD(&req_data->list);
@@ -87,12 +87,12 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
GFP_ATOMIC)) == -ENOSPC) {
dev_info(&vdev->dev, "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
- req_data->wq_buf_avail = false;
+ WRITE_ONCE(req_data->wq_buf_avail, false);
list_add_tail(&req_data->list, &vpmem->req_list);
spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
/* A host response results in "host_ack" getting called */
- wait_event(req_data->wq_buf, req_data->wq_buf_avail);
+ wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
spin_lock_irqsave(&vpmem->pmem_lock, flags);
}
err1 = virtqueue_kick(vpmem->req_vq);
@@ -106,7 +106,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
err = -EIO;
} else {
/* A host response results in "host_ack" getting called */
- wait_event(req_data->host_acked, req_data->done);
+ wait_event(req_data->host_acked, READ_ONCE(req_data->done));
err = le32_to_cpu(req_data->resp.ret);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v4 5/7] nvdimm: virtio_pmem: refcount requests for token lifetime
2026-06-09 12:07 [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Li Chen
` (3 preceding siblings ...)
2026-06-09 12:07 ` [PATCH v4 4/7] nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags Li Chen
@ 2026-06-09 12:07 ` Li Chen
2026-06-09 12:07 ` [PATCH v4 6/7] nvdimm: virtio_pmem: converge broken virtqueue to -EIO Li Chen
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Li Chen @ 2026-06-09 12:07 UTC (permalink / raw)
To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
Alison Schofield, virtualization, nvdimm
Cc: linux-kernel, stable, Li Chen
KASAN reports slab-use-after-free in __wake_up_common():
BUG: KASAN: slab-use-after-free in __wake_up_common+0x114/0x160
Read of size 8 at addr ffff88810fdcb710 by task swapper/0/0
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
6.19.0-next-20260220-00006-g1eae5f204ec3 #4 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux
1.17.0-2-2 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x6d/0xb0
print_report+0x170/0x4e2
? __pfx__raw_spin_lock_irqsave+0x10/0x10
? __virt_addr_valid+0x1dc/0x380
kasan_report+0xbc/0xf0
? __wake_up_common+0x114/0x160
? __wake_up_common+0x114/0x160
__wake_up_common+0x114/0x160
? __pfx__raw_spin_lock_irqsave+0x10/0x10
__wake_up+0x36/0x60
virtio_pmem_host_ack+0x11d/0x3b0
? sched_balance_domains+0x29f/0xb00
? __pfx_virtio_pmem_host_ack+0x10/0x10
? _raw_spin_lock_irqsave+0x98/0x100
? __pfx__raw_spin_lock_irqsave+0x10/0x10
vring_interrupt+0x1c9/0x5e0
? __pfx_vp_interrupt+0x10/0x10
vp_vring_interrupt+0x87/0x100
? __pfx_vp_interrupt+0x10/0x10
__handle_irq_event_percpu+0x17f/0x550
? __pfx__raw_spin_lock+0x10/0x10
handle_irq_event+0xab/0x1c0
handle_fasteoi_irq+0x276/0xae0
__common_interrupt+0x65/0x130
common_interrupt+0x78/0xa0
</IRQ>
virtio_pmem_host_ack() wakes a request that has already been freed by the
submitter.
This happens when the request token is still reachable via the virtqueue,
but virtio_pmem_flush() returns and frees it.
Fix the token lifetime by refcounting struct virtio_pmem_request.
virtio_pmem_flush() holds a submitter reference, and the virtqueue holds an
extra reference once the request is queued. The completion path drops the
virtqueue reference, and the submitter drops its reference before
returning.
Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
Cc: stable@vger.kernel.org
Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Add raw KASAN report to the patch description.
- Drop timestamps from the embedded report.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.
drivers/nvdimm/nd_virtio.c | 34 +++++++++++++++++++++++++++++-----
drivers/nvdimm/virtio_pmem.h | 2 ++
2 files changed, 31 insertions(+), 5 deletions(-)
diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index f8c0604edde51..f5264f6afe44f 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -9,6 +9,14 @@
#include "virtio_pmem.h"
#include "nd.h"
+static void virtio_pmem_req_release(struct kref *kref)
+{
+ struct virtio_pmem_request *req;
+
+ req = container_of(kref, struct virtio_pmem_request, kref);
+ kfree(req);
+}
+
static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
{
struct virtio_pmem_request *req_buf;
@@ -36,6 +44,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
virtio_pmem_wake_one_waiter(vpmem);
WRITE_ONCE(req_data->done, true);
wake_up(&req_data->host_acked);
+ kref_put(&req_data->kref, virtio_pmem_req_release);
}
spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
}
@@ -66,6 +75,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
if (!req_data)
return -ENOMEM;
+ kref_init(&req_data->kref);
WRITE_ONCE(req_data->done, false);
init_waitqueue_head(&req_data->host_acked);
init_waitqueue_head(&req_data->wq_buf);
@@ -83,10 +93,23 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
* to req_list and wait for host_ack to wake us up when free
* slots are available.
*/
- while ((err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req_data,
- GFP_ATOMIC)) == -ENOSPC) {
-
- dev_info(&vdev->dev, "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
+ for (;;) {
+ err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req_data,
+ GFP_ATOMIC);
+ if (!err) {
+ /*
+ * Take the virtqueue reference while @pmem_lock is
+ * held so completion cannot run concurrently.
+ */
+ kref_get(&req_data->kref);
+ break;
+ }
+
+ if (err != -ENOSPC)
+ break;
+
+ dev_info_ratelimited(&vdev->dev,
+ "failed to send command to virtio pmem device, no free slots in the virtqueue\n");
WRITE_ONCE(req_data->wq_buf_avail, false);
list_add_tail(&req_data->list, &vpmem->req_list);
spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -95,6 +118,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
spin_lock_irqsave(&vpmem->pmem_lock, flags);
}
+
err1 = virtqueue_kick(vpmem->req_vq);
spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
/*
@@ -110,7 +134,7 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
err = le32_to_cpu(req_data->resp.ret);
}
- kfree(req_data);
+ kref_put(&req_data->kref, virtio_pmem_req_release);
return err;
};
diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
index f72cf17f9518f..1017e498c9b4c 100644
--- a/drivers/nvdimm/virtio_pmem.h
+++ b/drivers/nvdimm/virtio_pmem.h
@@ -12,11 +12,13 @@
#include <linux/module.h>
#include <uapi/linux/virtio_pmem.h>
+#include <linux/kref.h>
#include <linux/libnvdimm.h>
#include <linux/mutex.h>
#include <linux/spinlock.h>
struct virtio_pmem_request {
+ struct kref kref;
struct virtio_pmem_req req;
struct virtio_pmem_resp resp;
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v4 6/7] nvdimm: virtio_pmem: converge broken virtqueue to -EIO
2026-06-09 12:07 [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Li Chen
` (4 preceding siblings ...)
2026-06-09 12:07 ` [PATCH v4 5/7] nvdimm: virtio_pmem: refcount requests for token lifetime Li Chen
@ 2026-06-09 12:07 ` Li Chen
2026-06-09 12:07 ` [PATCH v4 7/7] nvdimm: virtio_pmem: drain requests in freeze Li Chen
2026-06-12 18:42 ` [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Alison Schofield
7 siblings, 0 replies; 9+ messages in thread
From: Li Chen @ 2026-06-09 12:07 UTC (permalink / raw)
To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
Alison Schofield, virtualization, nvdimm
Cc: linux-kernel, Li Chen
dmesg reports virtqueue failure and device reset:
virtio_pmem virtio2: failed to send command to
virtio pmem device, no free slots in the virtqueue
virtio_pmem virtio2: virtio pmem device
needs a reset
virtio_pmem_flush() waits for either a free virtqueue descriptor (-ENOSPC)
or a host completion. If the request virtqueue becomes broken (e.g.
virtqueue_kick() notify failure), those waiters may never make progress.
Track a device-level broken state and converge all error paths to -EIO.
Fail fast for new requests, wake all -ENOSPC waiters, and drain/detach
outstanding request tokens to complete them with an error.
Closes: https://lore.kernel.org/r/202512250116.ewtzlD0g-lkp@intel.com/
Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- Add raw dmesg excerpt to the patch description.
- Drop timestamps from the embedded dmesg.
- Fold the CONFIG_VIRTIO_PMEM=m export fix into this patch.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.
- Use kmalloc_obj(*req_data) at the allocation site to match current nvdimm
code.
drivers/nvdimm/nd_virtio.c | 76 +++++++++++++++++++++++++++++++++---
drivers/nvdimm/virtio_pmem.c | 7 ++++
drivers/nvdimm/virtio_pmem.h | 4 ++
3 files changed, 81 insertions(+), 6 deletions(-)
diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index f5264f6afe44f..3f13e234e2f04 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -17,6 +17,18 @@ static void virtio_pmem_req_release(struct kref *kref)
kfree(req);
}
+static void virtio_pmem_signal_done(struct virtio_pmem_request *req)
+{
+ WRITE_ONCE(req->done, true);
+ wake_up(&req->host_acked);
+}
+
+static void virtio_pmem_complete_err(struct virtio_pmem_request *req)
+{
+ req->resp.ret = cpu_to_le32(1);
+ virtio_pmem_signal_done(req);
+}
+
static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
{
struct virtio_pmem_request *req_buf;
@@ -31,6 +43,41 @@ static void virtio_pmem_wake_one_waiter(struct virtio_pmem *vpmem)
wake_up(&req_buf->wq_buf);
}
+static void virtio_pmem_wake_all_waiters(struct virtio_pmem *vpmem)
+{
+ struct virtio_pmem_request *req, *tmp;
+
+ list_for_each_entry_safe(req, tmp, &vpmem->req_list, list) {
+ WRITE_ONCE(req->wq_buf_avail, true);
+ wake_up(&req->wq_buf);
+ list_del_init(&req->list);
+ }
+}
+
+void virtio_pmem_mark_broken_and_drain(struct virtio_pmem *vpmem)
+{
+ struct virtio_pmem_request *req;
+ unsigned int len;
+
+ if (READ_ONCE(vpmem->broken))
+ return;
+
+ WRITE_ONCE(vpmem->broken, true);
+ dev_err_once(&vpmem->vdev->dev, "virtqueue is broken\n");
+ virtio_pmem_wake_all_waiters(vpmem);
+
+ while ((req = virtqueue_get_buf(vpmem->req_vq, &len)) != NULL) {
+ virtio_pmem_complete_err(req);
+ kref_put(&req->kref, virtio_pmem_req_release);
+ }
+
+ while ((req = virtqueue_detach_unused_buf(vpmem->req_vq)) != NULL) {
+ virtio_pmem_complete_err(req);
+ kref_put(&req->kref, virtio_pmem_req_release);
+ }
+}
+EXPORT_SYMBOL_GPL(virtio_pmem_mark_broken_and_drain);
+
/* The interrupt handler */
void virtio_pmem_host_ack(struct virtqueue *vq)
{
@@ -42,8 +89,7 @@ void virtio_pmem_host_ack(struct virtqueue *vq)
spin_lock_irqsave(&vpmem->pmem_lock, flags);
while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) {
virtio_pmem_wake_one_waiter(vpmem);
- WRITE_ONCE(req_data->done, true);
- wake_up(&req_data->host_acked);
+ virtio_pmem_signal_done(req_data);
kref_put(&req_data->kref, virtio_pmem_req_release);
}
spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
@@ -71,6 +117,9 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
return -EIO;
}
+ if (READ_ONCE(vpmem->broken))
+ return -EIO;
+
req_data = kmalloc_obj(*req_data);
if (!req_data)
return -ENOMEM;
@@ -115,22 +164,37 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
/* A host response results in "host_ack" getting called */
- wait_event(req_data->wq_buf, READ_ONCE(req_data->wq_buf_avail));
+ wait_event(req_data->wq_buf,
+ READ_ONCE(req_data->wq_buf_avail) ||
+ READ_ONCE(vpmem->broken));
spin_lock_irqsave(&vpmem->pmem_lock, flags);
+
+ if (READ_ONCE(vpmem->broken))
+ break;
}
- err1 = virtqueue_kick(vpmem->req_vq);
+ if (err == -EIO || virtqueue_is_broken(vpmem->req_vq))
+ virtio_pmem_mark_broken_and_drain(vpmem);
+
+ err1 = true;
+ if (!err && !READ_ONCE(vpmem->broken)) {
+ err1 = virtqueue_kick(vpmem->req_vq);
+ if (!err1)
+ virtio_pmem_mark_broken_and_drain(vpmem);
+ }
spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
/*
* virtqueue_add_sgs failed with error different than -ENOSPC, we can't
* do anything about that.
*/
- if (err || !err1) {
+ if (READ_ONCE(vpmem->broken) || err || !err1) {
dev_info(&vdev->dev, "failed to send command to virtio pmem device\n");
err = -EIO;
} else {
/* A host response results in "host_ack" getting called */
- wait_event(req_data->host_acked, READ_ONCE(req_data->done));
+ wait_event(req_data->host_acked,
+ READ_ONCE(req_data->done) ||
+ READ_ONCE(vpmem->broken));
err = le32_to_cpu(req_data->resp.ret);
}
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index 77b1966619059..c5caf11a479a7 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -25,6 +25,7 @@ static int init_vq(struct virtio_pmem *vpmem)
spin_lock_init(&vpmem->pmem_lock);
INIT_LIST_HEAD(&vpmem->req_list);
+ WRITE_ONCE(vpmem->broken, false);
return 0;
};
@@ -138,6 +139,12 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
static void virtio_pmem_remove(struct virtio_device *vdev)
{
struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+ struct virtio_pmem *vpmem = vdev->priv;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ virtio_pmem_mark_broken_and_drain(vpmem);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
nvdimm_bus_unregister(nvdimm_bus);
vdev->config->del_vqs(vdev);
diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
index 1017e498c9b4c..e1a46abb9483c 100644
--- a/drivers/nvdimm/virtio_pmem.h
+++ b/drivers/nvdimm/virtio_pmem.h
@@ -48,6 +48,9 @@ struct virtio_pmem {
/* List to store deferred work if virtqueue is full */
struct list_head req_list;
+ /* Fail fast and wake waiters if the request virtqueue is broken. */
+ bool broken;
+
/* Synchronize virtqueue data */
spinlock_t pmem_lock;
@@ -57,5 +60,6 @@ struct virtio_pmem {
};
void virtio_pmem_host_ack(struct virtqueue *vq);
+void virtio_pmem_mark_broken_and_drain(struct virtio_pmem *vpmem);
int async_pmem_flush(struct nd_region *nd_region, struct bio *bio);
#endif
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v4 7/7] nvdimm: virtio_pmem: drain requests in freeze
2026-06-09 12:07 [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Li Chen
` (5 preceding siblings ...)
2026-06-09 12:07 ` [PATCH v4 6/7] nvdimm: virtio_pmem: converge broken virtqueue to -EIO Li Chen
@ 2026-06-09 12:07 ` Li Chen
2026-06-12 18:42 ` [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Alison Schofield
7 siblings, 0 replies; 9+ messages in thread
From: Li Chen @ 2026-06-09 12:07 UTC (permalink / raw)
To: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
Alison Schofield, virtualization, nvdimm
Cc: linux-kernel, Li Chen
virtio_pmem_freeze() deletes virtqueues and resets the device without
waking threads waiting for a virtqueue descriptor or a host completion.
Mark the request virtqueue broken and drain outstanding requests under
pmem_lock before teardown so waiters can make progress and return -EIO.
Signed-off-by: Li Chen <me@linux.beauty>
---
v2->v3:
- No change.
v3->v4:
- Rebased onto v7.1-rc7 and renumbered after the flush error patches.
drivers/nvdimm/virtio_pmem.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index c5caf11a479a7..663a60686fbdb 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -153,6 +153,13 @@ static void virtio_pmem_remove(struct virtio_device *vdev)
static int virtio_pmem_freeze(struct virtio_device *vdev)
{
+ struct virtio_pmem *vpmem = vdev->priv;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ virtio_pmem_mark_broken_and_drain(vpmem);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
vdev->config->del_vqs(vdev);
virtio_reset_device(vdev);
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures
2026-06-09 12:07 [PATCH v4 0/7] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures Li Chen
` (6 preceding siblings ...)
2026-06-09 12:07 ` [PATCH v4 7/7] nvdimm: virtio_pmem: drain requests in freeze Li Chen
@ 2026-06-12 18:42 ` Alison Schofield
7 siblings, 0 replies; 9+ messages in thread
From: Alison Schofield @ 2026-06-12 18:42 UTC (permalink / raw)
To: Li Chen
Cc: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
virtualization, nvdimm, linux-kernel
On Tue, Jun 09, 2026 at 08:07:14PM +0800, Li Chen wrote:
> Hi,
Hi Li Chen,
In case you missed it, a Sashiko AI review of this set has posted
feedback. Please take a look.
https://sashiko.dev/#/patchset/20260609120726.1714780-1-me%40linux.beauty
-- Alison
>
> The nvdimm flush helper currently converts any non-zero provider flush
> callback error to -EIO. That hides useful errno values from providers. For
> example, virtio-pmem may fail child flush bio allocation with -ENOMEM, but
> that is currently reported as -EIO by nvdimm_flush().
>
> The raw failure seen in the local mkfs sanity test was:
>
> wipefs: /dev/pmem0: cannot flush modified buffers: Input/output error
> mkfs.ext4: Input/output error while writing out and closing file system
> nd_region region0: dbg: nvdimm_flush rc=-5
>
> The first two patches keep provider flush errors intact and make the
> virtio-pmem child flush bio allocation use GFP_NOIO. async_pmem_flush() can
> allocate a child flush bio from filesystem flush and writeback paths;
> GFP_NOIO is a better fit than GFP_ATOMIC there because the allocation may
> sleep but must not recurse into filesystem I/O reclaim. With these changes,
> the same mkfs sanity test reached mkfs_rc=0, mount_rc=0, and umount_rc=0.
>
> The rest of the series addresses virtio-pmem request lifetime and broken
> virtqueue handling. The virtio-pmem flush path uses a virtqueue cookie/token
> to carry a per-request context through completion. Under broken virtqueue /
> notify failure conditions, the submitter can return and free the request
> object while the host/backend may still complete the published request. The
> IRQ completion handler then dereferences freed memory when waking waiters,
> which is reported by KASAN as a slab-use-after-free and may manifest as lock
> corruption (e.g. "BUG: spinlock already unlocked") without KASAN.
>
> In addition, the flush path has two wait sites: one for virtqueue descriptor
> availability (-ENOSPC from virtqueue_add_sgs()) and one for request
> completion. If the virtqueue becomes broken, forward progress is no longer
> guaranteed and these waiters may sleep indefinitely unless the driver
> converges the failure and wakes all wait sites.
>
> This series addresses these issues:
>
> 1/7 nvdimm: preserve flush callback errors
> Return provider flush callback errors directly from nvdimm_flush().
>
> 2/7 nvdimm: virtio_pmem: use GFP_NOIO for child flush bio
> Use GFP_NOIO for the child flush bio allocation.
>
> 3/7 nvdimm: virtio_pmem: always wake -ENOSPC waiters
> Wake one -ENOSPC waiter for each reclaimed used buffer, decoupled from
> token completion.
>
> 4/7 nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
> Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
> wq_buf_avail).
>
> 5/7 nvdimm: virtio_pmem: refcount requests for token lifetime
> Refcount request objects so the token lifetime spans the window where it is
> reachable through the virtqueue until completion/drain drops the virtqueue
> reference.
>
> 6/7 nvdimm: virtio_pmem: converge broken virtqueue to -EIO
> Track a device-level broken state to converge broken/notify failures to -EIO:
> wake all waiters and drain/detach outstanding requests to complete them with
> an error, and fail-fast new requests.
>
> 7/7 nvdimm: virtio_pmem: drain requests in freeze
> Drain outstanding requests in freeze() before tearing down virtqueues so
> waiters do not sleep indefinitely.
>
> Testing was done on QEMU x86_64 with a virtio-pmem device exported as
> /dev/pmem0. This v4 series applies to v7.1-rc7, builds with
> CONFIG_VIRTIO_PMEM=m, passes checkpatch, and passed the local repro checks
> with a local-only virtqueue_kick() fault injection. I also checked that it
> applies cleanly to next/master at 6e845bcb78c9 ("Add linux-next specific
> files for 20260605").
>
> Thanks,
> Li Chen
>
> Changelog:
> v3->v4:
> - Rebased the series onto v7.1-rc7 so it applies cleanly to Linux 7.1-rc7.
> - Update the allocation site in 6/7 from kmalloc(sizeof(*req_data),
> GFP_KERNEL) to kmalloc_obj(*req_data) to match current nvdimm code.
> - Add 1/7 to preserve provider flush callback errors in nvdimm_flush().
> - Include the GFP_NOIO child flush bio allocation fix as 2/7.
> - Renumber the old request lifetime and broken virtqueue fixes after the two
> new flush error patches.
> v2->v3:
> - Split patch 1 as suggested by Pankaj Gupta: keep the waiter wakeup
> ordering change in 1/5 and move READ_ONCE()/WRITE_ONCE() updates to
> 2/5 (no functional change intended).
> - Add log report to commit msg
> - Fold the export fix into 4/5 to keep the series bisectable when
> CONFIG_VIRTIO_PMEM=m.
> v1->v2: add the export patch to fix compile issue.
>
> Links:
> v3: https://lore.kernel.org/all/20260226025712.2236279-1-me@linux.beauty/#t
> v2: https://lore.kernel.org/all/20251225042915.334117-1-me@linux.beauty/
> v1: https://www.spinics.net/lists/kernel/msg5974818.html
>
> Li Chen (7):
> nvdimm: preserve flush callback errors
> nvdimm: virtio_pmem: use GFP_NOIO for child flush bio
> nvdimm: virtio_pmem: always wake -ENOSPC waiters
> nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
> nvdimm: virtio_pmem: refcount requests for token lifetime
> nvdimm: virtio_pmem: converge broken virtqueue to -EIO
> nvdimm: virtio_pmem: drain requests in freeze
>
> drivers/nvdimm/nd_virtio.c | 139 +++++++++++++++++++++++++++++------
> drivers/nvdimm/region_devs.c | 6 +-
> drivers/nvdimm/virtio_pmem.c | 14 ++++
> drivers/nvdimm/virtio_pmem.h | 6 ++
> 4 files changed, 139 insertions(+), 26 deletions(-)
> --
> 2.52.0
^ permalink raw reply [flat|nested] 9+ messages in thread