Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH] scsi: bsg: copy uring_cmd payload to prevent double-fetch from shared SQE
From: Jens Axboe @ 2026-05-27 16:45 UTC (permalink / raw)
  To: Caleb Sander Mateos, Rahul Chandelkar
  Cc: James.Bottomley, martin.petersen, fujita.tomonori, linux-scsi,
	linux-block, io-uring
In-Reply-To: <CADUfDZr6LJckoVt2NRfRt3Njs-WAqsg5-QnTDi6xbUDiO950Fw@mail.gmail.com>

On 5/27/26 10:27 AM, Caleb Sander Mateos wrote:
> On Wed, May 27, 2026 at 9:19 AM Rahul Chandelkar <rc@rexion.ai> wrote:
>>
>> On Wed, May 27, 2026 at 10:06:44AM -0600, Jens Axboe wrote:
>>> I don't think this is the right way to fix it, ->sqe should've been
>>> stable upfront if this ends up happening. Can you share your poc with
>>> me? Your trace has been trimmed down way too much to be useful.
>>
>> Agreed that a core-level copy before the inline callback would be the
>> right fix and would eliminate the entire class for every uring_cmd
>> driver. The per-driver copy was meant as a minimal backportable fix
>> for the immediate scsi_bsg path.
>>
>> PoC and full trace below.
>>
>> --- PoC (poc_bsg_toctou.c) ---
>>
>> Build:  gcc -O2 -pthread -static -o poc poc_bsg_toctou.c
>> Usage:  ./poc /dev/bsg/X
>> Needs:  2+ CPUs, io_uring, /dev/bsg/* access
>>
>> The racer thread flips request_len between 16 (passes the <=32 bounds
>> check) and 128 (used by copy_from_user, overflows scmd->cmnd[32]).
>> The overflow payload plants 0xdead000000001000 at the sense_buffer
>> pointer offset (+84 from cmnd[0]). When scsi_queue_rq() does
>> memset(scmd->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE) it faults on the
>> corrupted pointer.
> 
> Then the fix is to use READ_ONCE() to access the SQE fields, right?
> Copying the entire SQE seems like unnecessary overhead. See
> nvme_uring_cmd_io() for prior art.

That is indeed the correct fix.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] scsi: bsg: copy uring_cmd payload to prevent double-fetch from shared SQE
From: Jens Axboe @ 2026-05-27 16:48 UTC (permalink / raw)
  To: Caleb Sander Mateos, Rahul Chandelkar
  Cc: James.Bottomley, martin.petersen, fujita.tomonori, linux-scsi,
	linux-block, io-uring
In-Reply-To: <07c25a67-54b3-4ecd-bdf1-7ca0cefc8e38@kernel.dk>

On 5/27/26 10:45 AM, Jens Axboe wrote:
> On 5/27/26 10:27 AM, Caleb Sander Mateos wrote:
>> On Wed, May 27, 2026 at 9:19?AM Rahul Chandelkar <rc@rexion.ai> wrote:
>>>
>>> On Wed, May 27, 2026 at 10:06:44AM -0600, Jens Axboe wrote:
>>>> I don't think this is the right way to fix it, ->sqe should've been
>>>> stable upfront if this ends up happening. Can you share your poc with
>>>> me? Your trace has been trimmed down way too much to be useful.
>>>
>>> Agreed that a core-level copy before the inline callback would be the
>>> right fix and would eliminate the entire class for every uring_cmd
>>> driver. The per-driver copy was meant as a minimal backportable fix
>>> for the immediate scsi_bsg path.
>>>
>>> PoC and full trace below.
>>>
>>> --- PoC (poc_bsg_toctou.c) ---
>>>
>>> Build:  gcc -O2 -pthread -static -o poc poc_bsg_toctou.c
>>> Usage:  ./poc /dev/bsg/X
>>> Needs:  2+ CPUs, io_uring, /dev/bsg/* access
>>>
>>> The racer thread flips request_len between 16 (passes the <=32 bounds
>>> check) and 128 (used by copy_from_user, overflows scmd->cmnd[32]).
>>> The overflow payload plants 0xdead000000001000 at the sense_buffer
>>> pointer offset (+84 from cmnd[0]). When scsi_queue_rq() does
>>> memset(scmd->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE) it faults on the
>>> corrupted pointer.
>>
>> Then the fix is to use READ_ONCE() to access the SQE fields, right?
>> Copying the entire SQE seems like unnecessary overhead. See
>> nvme_uring_cmd_io() for prior art.
> 
> That is indeed the correct fix.

To be a bit more clear for the original reporter, in the hopes that they
will send a v2. Doing things like:

	if (cmd->addr)
		validate_addr(cmd->addr);

	[...]

	Use cmd->addr, we already validated it.

Is not safe, as ->addr can change in between. All of the sqe related
bits which cmd is should follow the pattern of:

	addr = READ_ONCE(cmd->addr);
	if (addr)
		validate_addr(addr);

	[...]

	Use addr, we already validated it, and it cannot have changed.

Copying 128b in both places is a big hammer, the code just needs to use
the proper access mechanism.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 1/3] loop: cleanup lo_rw_aio
From: Chaitanya Kulkarni @ 2026-05-27 16:50 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Ming Lei, Bart Van Assche,
	Caleb Sander Mateos, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org
In-Reply-To: <20260527151043.2349900-2-hch@lst.de>

On 5/27/26 08:10, Christoph Hellwig wrote:
> Port over the changes from the zloop driver to remove the need for
> the local bio, bvec and offset variables and clean up the code by
> that.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>

Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply

* Re: [PATCH 2/3] nvme-tcp: cleanup nvme_tcp_init_iter
From: Chaitanya Kulkarni @ 2026-05-27 16:51 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Ming Lei, Bart Van Assche,
	Caleb Sander Mateos, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org
In-Reply-To: <20260527151043.2349900-3-hch@lst.de>

On 5/27/26 08:10, Christoph Hellwig wrote:
> Split the two init cases based on code in the zloop driver.  This
> simplifies the code and makes it easier to follow.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>


Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply

* Re: [PATCH 3/3] bvec: make the bvec_iter helpers inline functions
From: Chaitanya Kulkarni @ 2026-05-27 16:51 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Ming Lei, Bart Van Assche,
	Caleb Sander Mateos, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org
In-Reply-To: <20260527151043.2349900-4-hch@lst.de>

On 5/27/26 08:10, Christoph Hellwig wrote:
> The macros are impossible to follow due to the lack of visual type
> information and all the braces.  Replace them with inline helpers to
> improve on that.  Because the calling conventions are a bit problematic
> with a lot of passing structures by value, all the helpers are marked
> as __always_inline so that they are force inlined.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>
> Reviewed-by: Bart Van Assche<bvanassche@acm.org>
> Reviewed-by: Caleb Sander Mateos<csander@purestorage.com>


Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply

* Re: [PATCH v2 4/4] dm crypt: batch all sectors of a bio per crypto request
From: Mikulas Patocka @ 2026-05-27 17:32 UTC (permalink / raw)
  To: Leonid Ravich
  Cc: Herbert Xu, David S . Miller, Mike Snitzer, Alasdair Kergon,
	Ard Biesheuvel, Eric Biggers, Jens Axboe, Horia Geanta,
	Gilad Ben-Yossef, linux-crypto, dm-devel, linux-block
In-Reply-To: <20260527065021.19525-5-lravich@amazon.com>

Hi

On Wed, 27 May 2026, Leonid Ravich wrote:

> +/*
> + * Multi-data-unit variant of crypt_convert_block_skcipher.  Submits all
> + * remaining sectors of the current bio in one skcipher request whose
> + * data_unit_size is cc->sector_size.  The cipher walks the IV between
> + * data units (see crypto_skcipher_set_data_unit_size()).
> + *
> + * Returns the same set of values as crypt_convert_block_skcipher:
> + *   0 on synchronous success (full chunk processed),
> + *   -EINPROGRESS / -EBUSY on asynchronous dispatch,
> + *   -EAGAIN if the per-bio scatterlist allocation cannot be made.  The
> + *           caller MUST disable multi-data-unit batching for the rest
> + *           of this bio and re-enter the per-sector path, which uses
> + *           only mempool reserves and is therefore safe even on the
> + *           swap-out-to-dm-crypt path under total memory exhaustion.
> + *   negative errno otherwise.
> + *
> + * On success the bio iterators have been advanced by the chunk size.
> + *
> + * Walks the bio with __bio_for_each_bvec so that multi-page folios
> + * produce one scatterlist entry rather than N (one per PAGE_SIZE).
> + */
> +static int crypt_convert_block_skcipher_multi(struct crypt_config *cc,
> +					      struct convert_context *ctx,
> +					      struct skcipher_request *req,
> +					      unsigned int *out_processed)
> +{
> +	const unsigned int sector_size = cc->sector_size;
> +	const gfp_t gfp = GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN;
> +	unsigned int total_in = ctx->iter_in.bi_size;
> +	unsigned int total_out = ctx->iter_out.bi_size;
> +	unsigned int total = min(total_in, total_out);
> +	unsigned int n_sectors;
> +	unsigned int n_sg_in = 0, n_sg_out = 0;
> +	struct dm_crypt_request *dmreq = dmreq_of_req(cc, req);
> +	struct scatterlist *sg_in = NULL, *sg_out = NULL;
> +	struct bvec_iter iter_in, iter_out;
> +	struct bio_vec bv;
> +	u8 *iv, *org_iv;
> +	int r;
> +
> +	if (unlikely(total < sector_size))
> +		return -EIO;
> +	n_sectors = total / sector_size;
> +	total = n_sectors * sector_size;

Division is slow. There should be this:
	n_sectors = total >> cc->sector_shift;


> +     if (unlikely(total < sector_size))
> +             return -EIO;

The condition total < sector_size is true if total is small but it goes 
through if total is bigger but unaligned. I think that it should be:

	if (unlikely(total & (sector_size - 1)))
		return -EIO;

(then, we can drop the line "total = n_sectors * sector_size")

ctx->iter_in.bi_size is supposed to be the same as ctx->iter_out.bi_size, 
so do we really need total = min(total_in, total_out)? Should it instead 
warn if they differ? (where the warning would indicate a bug in the code)

Mikulas


^ permalink raw reply

* Re: [PATCH] block: blk-zoned: fix zwplug refcount leak on write error path
From: Damien Le Moal @ 2026-05-27 18:06 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki
  Cc: Haris Iqbal, Wentao Liang, Jens Axboe, linux-block, linux-kernel,
	stable
In-Reply-To: <ahbZRsqHKKbg9PSB@shinmob>

On 2026/05/27 20:47, Shin'ichiro Kawasaki wrote:
> On May 27, 2026 / 08:15, Damien Le Moal wrote:
> [...]
>> Wentao,
>>
>> You clearly did not test this at all because if you had, you would have seen
>> all the warning splats that your patch triggers.
> 
> FYI, the blktests CI run for the patch caught failures at block/017, zbd/004,
> zbd/009 and zbd/012.

Thanks Shin'ichiro. I did a simple manual test issuing an unaligned write with
dd on a zloop device. That was enough to trigger warnings similar to what the CI
reported.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Damien Le Moal @ 2026-05-27 18:11 UTC (permalink / raw)
  To: Tetsuo Handa, Ming Lei
  Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig, linux-block, LKML,
	Andrew Morton, Linus Torvalds, linux-btrfs, David Sterba,
	linux-fsdevel, Christian Brauner
In-Reply-To: <fbb3edda-f108-4e5b-acf2-266f043f8125@I-love.SAKURA.ne.jp>

On 2026/05/27 20:29, Tetsuo Handa wrote:
> On 2026/05/27 12:00, Ming Lei wrote:
>> On Wed, May 27, 2026 at 10:35:56AM +0900, Tetsuo Handa wrote:
>>> On 2026/05/27 10:20, Ming Lei wrote:
>>>>> Of course we should try to figure out the root cause first, but how can we do?
>>>>
>>>> Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
>>>> which may cause data loss, so CC btrfs list and maintainer.
>>>
> 
> I had a conversation with Google AI mode, and received the following response.
> 
> --------------------------------------------------------------------------------
> Technical Analysis: lo_rw_aio Null Pointer Dereference / UAF since v7.1-rc1
> 
> 
> 1. The Root Cause of the Timing Shift
> 
> This regression was introduced during the v7.1-rc1 merge window, primarily exposed by
> Commit 65565ca5f99b ("block: unify the synchronous bi_end_io callbacks"), along with
> helper refactorings like Commit 92c3737a2473 ("block: add a bio_submit_or_kill helper").
> 
> Prior to v7.0, the synchronous I/O completion path inherently contained execution lags (due
> to serialized completion handling and context switches) before notifying upper layers. This
> latency accidentally acted as a natural safety barrier. It ensured that by the time a file
> system completed its final sync_filesystem() and initiated umount, the loop driver's internal
> workqueue (lo_rw_aio) had already finished processing everything.
> 
> In v7.1, the unification and optimization of bi_end_io significantly minimized this latency.
> The filesystem now learns of "I/O completion" much faster. Consequently, highly-concurrent
> execution pipelines like btrfs or jfs proceed rapidly through kill_sb() and blkdev_put(),
> ultimately invoking lo_release() -> __loop_clr_fd() while the loop driver's backend kworker
> is still in the middle of executing the last sub-millisecond asynchronous file-backed I/O
> request.
> 
> 
> 2. Why the Block Layer's Built-in Quiesce/Freeze Fails
> 
> There is an implicit assumption that standard block layer freeze mechanisms (blk_mq_freeze_queue())
> protect the device lifetime during release. However, the v7.1 BIO helper refactoring introduced
> a synchronization gap:
> 
>   1. The filesystem triggers its final metadata or journal updates (e.g., txCommit in jfs or
>      delayed refcount updates in btrfs) right during the unmount/close boundary.
>   2. Due to the optimized execution path, these requests bypass the block layer's active
>      request-tracking metrics at the exact moment blk_mq_freeze_queue() or state validation
>      checks evaluated them as zero.
>   3. The block layer assumes the queue is safe and silent, allowing __loop_clr_fd() to
>      progress and nullify lo->lo_backing_file (or trigger fput()).
>   4. The leaked asynchronous kworker wakes up a fraction of a millisecond too late, attempts
>      to access lo->lo_backing_file or invokes kiocb_end_write() -> file_inode(), leading to
>      either a general protection fault (Null pointer dereference) or a Use-After-Free (UAF).
> 
> 
> 3. Why This Isn't Just an "Unexpected FS Bug"
> 
> While the write I/O originates from file systems like btrfs and jfs post-close, blaming the
> file systems entirely ignores the underlying infrastructure change. The core issue is that the
> block layer altered its synchronization behavior, breaking the barrier contract that
> VFS and file systems historically relied on during the device release path.
> 
> Papering over this inside individual file systems would require adding heavy, duplicated
> barriers inside every single filesystem's unmount path.

It sounds like the VFS unmount call needs to have something that waits for
sync() to complete. Though, it really feels very strange that an FS can complete
unmount without itself ensuring that there are no more IOs in flight. The
generic VFS layer cannot know what the FS needs to flush on unmount, so waiting
on a generic sync might not be enough.

It really feels like this is a btrfs and jfs issue, unless the same can be
reproduced with any file system (XFS, ext4, f2fs, ...).

Just my 2 cents.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

* Re: [PATCH] rust: block: mq: align init_request numa_node arg with C signature
From: Andreas Hindborg @ 2026-05-27 18:16 UTC (permalink / raw)
  To: mateusz.nowicki
  Cc: Boqun Feng, Miguel Ojeda, Gary Guo, Björn Roy Baron,
	Benno Lossin, Alice Ryhl, Trevor Gross, Danilo Krummrich,
	Jens Axboe, linux-block, rust-for-linux, linux-kernel
In-Reply-To: <1552076fb3201e5e47ead0989e793472@posteo.net>

mateusz.nowicki@posteo.net writes:

> Hello Andreas,
>
> how can I catch it earlier in the future? I verified patch correctness 
> with compiling 'allyesconfig' but
> I didn't catch rust issue.

As Gary says, CONFIG_RUST is silently disabled if the required tools are
not present. Please see the quick start guide for how to get a suitable
compiler on major distros [1].

You can run `make LLVM=1 rustavailable` to check that your tools are set
up correctly.

Best regards,
Andreas Hindborg


[1] https://docs.kernel.org/rust/quick-start.html


^ permalink raw reply

* [PATCH v2] scsi: bsg: read io_uring command fields once
From: Rahul Chandelkar @ 2026-05-27 19:17 UTC (permalink / raw)
  To: rc, James E . J . Bottomley, Martin K . Petersen, Jens Axboe,
	FUJITA Tomonori
  Cc: linux-scsi, linux-block, io-uring, linux-kernel, Yang Xiuwei,
	Bart Van Assche, Caleb Sander Mateos, stable
In-Reply-To: <20260527105931.3950913-1-rc@rexion.ai>

scsi_bsg_uring_cmd() reads struct bsg_uring_cmd fields directly from the
shared mmap'd io_uring SQE.  On the inline execution path, io_uring may
still point at userspace-visible SQE storage, so a concurrent userspace
thread can change fields between validation and use.

request_len is checked against the size of scmd->cmnd, then used again for
scmd->cmd_len and copy_from_user().  If userspace changes request_len after
the bounds check, the later copy can overflow the 32-byte scmd->cmnd
buffer.  Transfer fields are also read again by scsi_bsg_map_user_buffer(),
leaving direction, address and length open to the same race.

Use READ_ONCE() to load each bsg_uring_cmd field needed by
scsi_bsg_uring_cmd() into a local variable, then use those locals for both
validation and execution.  Pass the stable transfer direction, address and
length into scsi_bsg_map_user_buffer() so the helper no longer re-derives
them from the SQE.

This fixes the double-fetch without copying the whole io_uring command
payload.

Tested with KASAN on QEMU (virtio-scsi, 2 vCPUs).  Without this fix, a
two-thread race produces:

  BUG: KASAN: wild-memory-access in scsi_queue_rq+0x4a3/0x58a0
  Write of size 96 at addr dead000000001000 by task poc/67
  Call Trace:
   kasan_report+0xce/0x100
   __asan_memset+0x23/0x50
   scsi_queue_rq+0x4a3/0x58a0
   scsi_bsg_uring_cmd+0x942/0x1570
   io_uring_cmd+0x2f6/0x950
   io_issue_sqe+0xe5/0x22d0

Link: https://lore.kernel.org/all/20260527105931.3950913-1-rc@rexion.ai/T/#u
Fixes: 7b6d3255e7f8 ("scsi: bsg: add io_uring passthrough handler")
Cc: stable@vger.kernel.org
Signed-off-by: Rahul Chandelkar <rc@rexion.ai>
---
Changes in v2:
- Use READ_ONCE() for individual fields instead of memcpying the command
  payload.
- Pass stable transfer parameters to scsi_bsg_map_user_buffer() so it does
  not re-read the SQE.
- Do not carry the Reviewed-by tag from v1 because the implementation
  strategy changed.

 drivers/scsi/scsi_bsg.c | 54 ++++++++++++++++++++++++++---------------
 1 file changed, 35 insertions(+), 19 deletions(-)

diff --git a/drivers/scsi/scsi_bsg.c b/drivers/scsi/scsi_bsg.c
index e80dec53174e..ccbe3d98e4ff 100644
--- a/drivers/scsi/scsi_bsg.c
+++ b/drivers/scsi/scsi_bsg.c
@@ -76,12 +76,10 @@ static enum rq_end_io_ret scsi_bsg_uring_cmd_done(struct request *req,
 
 static int scsi_bsg_map_user_buffer(struct request *req,
 				    struct io_uring_cmd *ioucmd,
-				    unsigned int issue_flags, gfp_t gfp_mask)
+				    unsigned int issue_flags, gfp_t gfp_mask,
+				    bool is_write, u64 buf_addr,
+				    unsigned long buf_len)
 {
-	const struct bsg_uring_cmd *cmd = io_uring_sqe128_cmd(ioucmd->sqe, struct bsg_uring_cmd);
-	bool is_write = cmd->dout_xfer_len > 0;
-	u64 buf_addr = is_write ? cmd->dout_xferp : cmd->din_xferp;
-	unsigned long buf_len = is_write ? cmd->dout_xfer_len : cmd->din_xfer_len;
 	struct iov_iter iter;
 	int ret;
 
@@ -104,26 +102,40 @@ static int scsi_bsg_uring_cmd(struct request_queue *q, struct io_uring_cmd *iouc
 			       unsigned int issue_flags, bool open_for_write)
 {
 	struct scsi_bsg_uring_cmd_pdu *pdu = scsi_bsg_uring_cmd_pdu(ioucmd);
-	const struct bsg_uring_cmd *cmd = io_uring_sqe128_cmd(ioucmd->sqe, struct bsg_uring_cmd);
+	const struct bsg_uring_cmd *cmd =
+		io_uring_sqe128_cmd(ioucmd->sqe, struct bsg_uring_cmd);
 	struct scsi_cmnd *scmd;
 	struct request *req;
 	blk_mq_req_flags_t blk_flags = 0;
 	gfp_t gfp_mask = GFP_KERNEL;
+	u64 request = READ_ONCE(cmd->request);
+	u32 request_len = READ_ONCE(cmd->request_len);
+	u32 protocol = READ_ONCE(cmd->protocol);
+	u32 subprotocol = READ_ONCE(cmd->subprotocol);
+	u32 max_response_len = READ_ONCE(cmd->max_response_len);
+	u64 response = READ_ONCE(cmd->response);
+	u64 dout_xferp = READ_ONCE(cmd->dout_xferp);
+	u32 dout_xfer_len = READ_ONCE(cmd->dout_xfer_len);
+	u32 dout_iovec_count = READ_ONCE(cmd->dout_iovec_count);
+	u64 din_xferp = READ_ONCE(cmd->din_xferp);
+	u32 din_xfer_len = READ_ONCE(cmd->din_xfer_len);
+	u32 din_iovec_count = READ_ONCE(cmd->din_iovec_count);
+	u32 timeout_ms = READ_ONCE(cmd->timeout_ms);
 	int ret;
 
-	if (cmd->protocol != BSG_PROTOCOL_SCSI ||
-	    cmd->subprotocol != BSG_SUB_PROTOCOL_SCSI_CMD)
+	if (protocol != BSG_PROTOCOL_SCSI ||
+	    subprotocol != BSG_SUB_PROTOCOL_SCSI_CMD)
 		return -EINVAL;
 
-	if (!cmd->request || cmd->request_len == 0)
+	if (!request || request_len == 0)
 		return -EINVAL;
 
-	if (cmd->dout_xfer_len && cmd->din_xfer_len) {
+	if (dout_xfer_len && din_xfer_len) {
 		pr_warn_once("BIDI support in bsg has been removed.\n");
 		return -EOPNOTSUPP;
 	}
 
-	if (cmd->dout_iovec_count > 0 || cmd->din_iovec_count > 0)
+	if (dout_iovec_count > 0 || din_iovec_count > 0)
 		return -EOPNOTSUPP;
 
 	if (issue_flags & IO_URING_F_NONBLOCK) {
@@ -131,20 +143,20 @@ static int scsi_bsg_uring_cmd(struct request_queue *q, struct io_uring_cmd *iouc
 		gfp_mask = GFP_NOWAIT;
 	}
 
-	req = scsi_alloc_request(q, cmd->dout_xfer_len ?
+	req = scsi_alloc_request(q, dout_xfer_len ?
 				 REQ_OP_DRV_OUT : REQ_OP_DRV_IN, blk_flags);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
 	scmd = blk_mq_rq_to_pdu(req);
-	if (cmd->request_len > sizeof(scmd->cmnd)) {
+	if (request_len > sizeof(scmd->cmnd)) {
 		ret = -EINVAL;
 		goto out_free_req;
 	}
-	scmd->cmd_len = cmd->request_len;
+	scmd->cmd_len = request_len;
 	scmd->allowed = SG_DEFAULT_RETRIES;
 
-	if (copy_from_user(scmd->cmnd, uptr64(cmd->request), cmd->request_len)) {
+	if (copy_from_user(scmd->cmnd, uptr64(request), request_len)) {
 		ret = -EFAULT;
 		goto out_free_req;
 	}
@@ -154,12 +166,18 @@ static int scsi_bsg_uring_cmd(struct request_queue *q, struct io_uring_cmd *iouc
 		goto out_free_req;
 	}
 
-	pdu->response_addr = cmd->response;
-	scmd->sense_len = cmd->max_response_len ?
-		min(cmd->max_response_len, SCSI_SENSE_BUFFERSIZE) : SCSI_SENSE_BUFFERSIZE;
+	pdu->response_addr = response;
+	scmd->sense_len = max_response_len ?
+		min(max_response_len, SCSI_SENSE_BUFFERSIZE) : SCSI_SENSE_BUFFERSIZE;
 
-	if (cmd->dout_xfer_len || cmd->din_xfer_len) {
-		ret = scsi_bsg_map_user_buffer(req, ioucmd, issue_flags, gfp_mask);
+	if (dout_xfer_len || din_xfer_len) {
+		bool is_write = dout_xfer_len > 0;
+		u64 buf_addr = is_write ? dout_xferp : din_xferp;
+		unsigned long buf_len = is_write ? dout_xfer_len : din_xfer_len;
+
+		ret = scsi_bsg_map_user_buffer(req, ioucmd, issue_flags,
+					       gfp_mask, is_write, buf_addr,
+					       buf_len);
 		if (ret)
 			goto out_free_req;
 		pdu->bio = req->bio;
@@ -167,8 +185,8 @@ static int scsi_bsg_uring_cmd(struct request_queue *q, struct io_uring_cmd *iouc
 		pdu->bio = NULL;
 	}
 
-	req->timeout = cmd->timeout_ms ?
-		msecs_to_jiffies(cmd->timeout_ms) : BLK_DEFAULT_SG_TIMEOUT;
+	req->timeout = timeout_ms ?
+		msecs_to_jiffies(timeout_ms) : BLK_DEFAULT_SG_TIMEOUT;
 
 	req->end_io = scsi_bsg_uring_cmd_done;
 	req->end_io_data = ioucmd;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH] block: add a bio_endio_status helper
From: Damien Le Moal @ 2026-05-27 19:41 UTC (permalink / raw)
  To: Christoph Hellwig, axboe; +Cc: linux-block
In-Reply-To: <20260527151247.2352145-1-hch@lst.de>

On 2026/05/28 0:12, Christoph Hellwig wrote:
> Add a helper that sets bi_status and call bio_endio() as that is a very
> common pattern and convert the core block code over to it.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good, modulo one nit below.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

> +/**
> + * bio_endio - end I/O on a bio with a specific status

This should be:
 * bio_endio_status - end I/O...

> + * @bio:	bio
> + * @status:	status to set
> + *
> + * Set @bio->bi_status to @status and call bio_endio().
> + **/
> +static inline void bio_endio_status(struct bio *bio, blk_status_t status)
>  {
> -	bio->bi_status = BLK_STS_IOERR;
> +	bio->bi_status = status;
>  	bio_endio(bio);
>  }
>  
> +static inline void bio_io_error(struct bio *bio)
> +{
> +	bio_endio_status(bio, BLK_STS_IOERR);
> +}
> +
>  static inline void bio_wouldblock_error(struct bio *bio)
>  {
> -	bio->bi_status = BLK_STS_AGAIN;
> -	bio_endio(bio);
> +	bio_endio_status(bio, BLK_STS_AGAIN);
>  }
>  
>  /*


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

* [PATCH] block, nvme: export and use passthrough stats
From: Keith Busch @ 2026-05-28  0:58 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: axboe, hch, Keith Busch

From: Keith Busch <kbusch@kernel.org>

So stacking drivers can also report passthrough workloads through
iostat.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-mq.c                | 30 ------------------------------
 drivers/nvme/host/multipath.c |  4 +++-
 include/linux/blk-mq.h        | 29 +++++++++++++++++++++++++++++
 3 files changed, 32 insertions(+), 31 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75ea..c794b70fefe26 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1088,36 +1088,6 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 	}
 }
 
-static inline bool blk_rq_passthrough_stats(struct request *req)
-{
-	struct bio *bio = req->bio;
-
-	if (!blk_queue_passthrough_stat(req->q))
-		return false;
-
-	/* Requests without a bio do not transfer data. */
-	if (!bio)
-		return false;
-
-	/*
-	 * Stats are accumulated in the bdev, so must have one attached to a
-	 * bio to track stats. Most drivers do not set the bdev for passthrough
-	 * requests, but nvme is one that will set it.
-	 */
-	if (!bio->bi_bdev)
-		return false;
-
-	/*
-	 * We don't know what a passthrough command does, but we know the
-	 * payload size and data direction. Ensuring the size is aligned to the
-	 * block size filters out most commands with payloads that don't
-	 * represent sector access.
-	 */
-	if (blk_rq_bytes(req) & (bdev_logical_block_size(bio->bi_bdev) - 1))
-		return false;
-	return true;
-}
-
 static inline void blk_account_io_start(struct request *req)
 {
 	trace_block_io_start(req);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 263161cb8ac06..435fab0be6401 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -175,9 +175,11 @@ void nvme_mpath_start_request(struct request *rq)
 		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
 	}
 
-	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq) ||
+	if (!blk_queue_io_stat(disk->queue) ||
 	    (nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
+	if (blk_rq_is_passthrough(rq) && !blk_rq_passthrough_stats(rq))
+		return;
 
 	nvme_req(rq)->flags |= NVME_MPATH_IO_STATS;
 	nvme_req(rq)->start_time = bdev_start_io_acct(disk->part0, req_op(rq),
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581d..8301830ece8b7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1243,4 +1243,33 @@ static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist)
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+static inline bool blk_rq_passthrough_stats(struct request *req)
+{
+	struct bio *bio = req->bio;
+
+	if (!blk_queue_passthrough_stat(req->q))
+		return false;
+
+	/* Requests without a bio do not transfer data. */
+	if (!bio)
+		return false;
+
+	/*
+	 * Stats are accumulated in the bdev, so must have one attached to a
+	 * bio to track stats. Most drivers do not set the bdev for passthrough
+	 * requests, but nvme is one that will set it.
+	 */
+	if (!bio->bi_bdev)
+		return false;
+
+	/*
+	 * We don't know what a passthrough command does, but we know the
+	 * payload size and data direction. Ensuring the size is aligned to the
+	 * block size filters out most commands with payloads that don't
+	 * represent sector access.
+	 */
+	if (blk_rq_bytes(req) & (bdev_logical_block_size(bio->bi_bdev) - 1))
+		return false;
+	return true;
+}
 #endif /* BLK_MQ_H */
-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH] block, nvme: export and use passthrough stats
From: Keith Busch @ 2026-05-28  1:00 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, axboe, hch
In-Reply-To: <20260528005848.1517715-1-kbusch@meta.com>

Sorry, please disregard. I pointed send-email to the v1 directory by
mistake.

^ permalink raw reply

* [PATCHv3 1/2] block: export passthrough stats enabled
From: Keith Busch @ 2026-05-28  1:00 UTC (permalink / raw)
  To: linux-block, linux-nvme
  Cc: axboe, hch, Keith Busch, Nilay Shroff, Nitesh Shetty
In-Reply-To: <20260528010041.1533124-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

A user can enable io accounting for passthrough requests, so export the
helper that checks if the request should be tracked. This will enable
stacking drivers to to report iostats for passthrough workloads. Since
the stacking request_queue may not be the one providing the request, the
API has to add a parameter for the caller to specify which one to check.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-mq.c         | 32 +-------------------------------
 include/linux/blk-mq.h | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 31 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a24175441380e..8ab6fa59f8d54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1088,43 +1088,13 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 	}
 }
 
-static inline bool blk_rq_passthrough_stats(struct request *req)
-{
-	struct bio *bio = req->bio;
-
-	if (!blk_queue_passthrough_stat(req->q))
-		return false;
-
-	/* Requests without a bio do not transfer data. */
-	if (!bio)
-		return false;
-
-	/*
-	 * Stats are accumulated in the bdev, so must have one attached to a
-	 * bio to track stats. Most drivers do not set the bdev for passthrough
-	 * requests, but nvme is one that will set it.
-	 */
-	if (!bio->bi_bdev)
-		return false;
-
-	/*
-	 * We don't know what a passthrough command does, but we know the
-	 * payload size and data direction. Ensuring the size is aligned to the
-	 * block size filters out most commands with payloads that don't
-	 * represent sector access.
-	 */
-	if (blk_rq_bytes(req) & (bdev_logical_block_size(bio->bi_bdev) - 1))
-		return false;
-	return true;
-}
-
 static inline void blk_account_io_start(struct request *req)
 {
 	trace_block_io_start(req);
 
 	if (!blk_queue_io_stat(req->q))
 		return;
-	if (blk_rq_is_passthrough(req) && !blk_rq_passthrough_stats(req))
+	if (blk_rq_is_passthrough(req) && !blk_rq_passthrough_stats(req, req->q))
 		return;
 
 	req->rq_flags |= RQF_IO_STAT;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 24b4160aeaad3..af878597afb8c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1252,4 +1252,44 @@ static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist)
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+/**
+ * blk_rq_passthrough_stats - check if this request should account stats
+ * @rq: request to check
+ * @q: the queue accumulating the stats
+ *
+ * Note, @q does not necessarily need to be the request_queue that provides
+ * @rq.
+ *
+ * Return: true if stats should be accounted.
+ */
+static inline bool blk_rq_passthrough_stats(struct request *rq,
+					    struct request_queue *q)
+{
+	struct bio *bio = rq->bio;
+
+	if (!blk_queue_passthrough_stat(q))
+		return false;
+
+	/* Requests without a bio do not transfer data. */
+	if (!bio)
+		return false;
+
+	/*
+	 * Stats are accumulated in the bdev, so must have one attached to a
+	 * bio to track stats. Most drivers do not set the bdev for passthrough
+	 * requests, but nvme is one that will set it.
+	 */
+	if (!bio->bi_bdev)
+		return false;
+
+	/*
+	 * We don't know what a passthrough command does, but we know the
+	 * payload size and data direction. Ensuring the size is aligned to the
+	 * block size filters out most commands with payloads that don't
+	 * represent sector access.
+	 */
+	if (blk_rq_bytes(rq) & (bdev_logical_block_size(bio->bi_bdev) - 1))
+		return false;
+	return true;
+}
 #endif /* BLK_MQ_H */
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCHv3 2/2] nvme: add support multipath passthrough iostats
From: Keith Busch @ 2026-05-28  1:00 UTC (permalink / raw)
  To: linux-block, linux-nvme
  Cc: axboe, hch, Keith Busch, Nilay Shroff, Nitesh Shetty
In-Reply-To: <20260528010041.1533124-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

Don't skip the io accounting for passthrough commands if the user
enabled tracking these.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/ioctl.c     | 9 +++++++++
 drivers/nvme/host/multipath.c | 5 ++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 08889b20e5d8c..664216eece4a6 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -102,8 +102,17 @@ static struct request *nvme_alloc_user_request(struct request_queue *q,
 		struct nvme_command *cmd, blk_opf_t rq_flags,
 		blk_mq_req_flags_t blk_flags)
 {
+	struct nvme_ns *ns = q->queuedata;
 	struct request *req;
 
+	/*
+	 * The NVME_MPATH flag is set only for IO commands sent to a namespace
+	 * with a multipath enabled head. The request is not eligible for
+	 * failover as passthrough requests also append REQ_FAILFAST_DRIVER.
+	 */
+	if (ns && nvme_ns_head_multipath(ns->head))
+		rq_flags |= REQ_NVME_MPATH;
+
 	req = blk_mq_alloc_request(q, nvme_req_op(cmd) | rq_flags, blk_flags);
 	if (IS_ERR(req))
 		return req;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index ff442bbf2937a..bca8e7c975190 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -175,9 +175,12 @@ void nvme_mpath_start_request(struct request *rq)
 		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
 	}
 
-	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq) ||
+	if (!blk_queue_io_stat(disk->queue) ||
 	    (nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
+	if (blk_rq_is_passthrough(rq) &&
+	    !blk_rq_passthrough_stats(rq, disk->queue))
+		return;
 
 	nvme_req(rq)->flags |= NVME_MPATH_IO_STATS;
 	nvme_req(rq)->start_time = bdev_start_io_acct(disk->part0, req_op(rq),
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCHv3 0/2] block, nvme: enable passthrough iostats
From: Keith Busch @ 2026-05-28  1:00 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: axboe, hch, Keith Busch

From: Keith Busch <kbusch@kernel.org>

v2->v3:

  Added kerneldoc for the exported API

  Added code comment for the passthrough safety

  Added reviews.

Keith Busch (2):
  block: export passthrough stats enabled
  nvme: add support multipath passthrough iostats

 block/blk-mq.c                | 32 +---------------------------
 drivers/nvme/host/ioctl.c     |  9 ++++++++
 drivers/nvme/host/multipath.c |  5 ++++-
 include/linux/blk-mq.h        | 40 +++++++++++++++++++++++++++++++++++
 4 files changed, 54 insertions(+), 32 deletions(-)

-- 
2.53.0-Meta


^ permalink raw reply

* [PATCH v3 0/2] zram: fix UAF in zram_bvec_write_partial() and drop dead bio plumbing
From: Cunlong Li @ 2026-05-28  2:48 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton,
	Yisheng Xie
  Cc: Christoph Hellwig, linux-block, linux-mm, linux-kernel,
	Cunlong Li, stable

Patch 1 fixes a use-after-free in zram_bvec_write_partial() that
happens on PAGE_SIZE > 4K configurations when a partial write hits a
ZRAM_WB slot.

Patch 2 is a follow-up cleanup that drops the now-unused bio parameter
from zram_bvec_write_partial() and zram_bvec_write(), no functional
change.

Patch 1 is tagged for stable; patch 2 is not.

Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
---
Changes in v3:
- Update Fixes: tag to 8e654f8fbff5 ("zram: read page from backing
  device") per Christoph.
- Link to v2: https://lore.kernel.org/r/20260527-zram-v2-0-2fb84b054b5c@gmail.com

Changes in v2:
- Add patch 2: drop the now-unused bio parameter from
  zram_bvec_write_partial() and zram_bvec_write(), per Sergey's
  suggestion on v1.
- Link to v1: https://lore.kernel.org/r/20260527-zram-v1-1-ce1acb2bfaf9@gmail.com

---
Cunlong Li (2):
      zram: fix use-after-free in zram_bvec_write_partial()
      zram: drop unused bio parameter from write helpers

 drivers/block/zram/zram_drv.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)
---
base-commit: e8c2f9fdadee7cbc75134dc463c1e0d856d6e5c7
change-id: 20260526-zram-b01425b7e6c6

Best regards,
-- 
Cunlong Li <shenxiaogll@gmail.com>


^ permalink raw reply

* [PATCH v3 1/2] zram: fix use-after-free in zram_bvec_write_partial()
From: Cunlong Li @ 2026-05-28  2:48 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton,
	Yisheng Xie
  Cc: Christoph Hellwig, linux-block, linux-mm, linux-kernel,
	Cunlong Li, stable
In-Reply-To: <20260528-zram-v3-0-cab86eef8764@gmail.com>

zram_read_page() picks the sync or async backing device read path
based on whether the parent bio is NULL.  zram_bvec_write_partial()
passes its parent bio down, so for ZRAM_WB slots the read is
dispatched asynchronously and zram_read_page() returns 0 while the
bio is still in flight.  The caller then runs memcpy_from_bvec(),
zram_write_page() and __free_page() on the buffer, leaving the
async read to write into a freed page.

zram_bvec_read_partial() was switched to NULL in commit 4e3c87b9421d
("zram: fix synchronous reads") for the same reason; the
write_partial counterpart was missed.

Fixes: 8e654f8fbff5 ("zram: read page from backing device")
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
---
 drivers/block/zram/zram_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index aebc710f0d6a..b23a8bbb687c 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2333,7 +2333,7 @@ static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
 	if (!page)
 		return -ENOMEM;
 
-	ret = zram_read_page(zram, page, index, bio);
+	ret = zram_read_page(zram, page, index, NULL);
 	if (!ret) {
 		memcpy_from_bvec(page_address(page) + offset, bvec);
 		ret = zram_write_page(zram, page, index);

-- 
2.30.2


^ permalink raw reply related

* [PATCH v3 2/2] zram: drop unused bio parameter from write helpers
From: Cunlong Li @ 2026-05-28  2:48 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton,
	Yisheng Xie
  Cc: Christoph Hellwig, linux-block, linux-mm, linux-kernel,
	Cunlong Li
In-Reply-To: <20260528-zram-v3-0-cab86eef8764@gmail.com>

After the previous fix, zram_bvec_write_partial() always passes NULL
to zram_read_page() and no longer needs the parent bio.  Mirror the
read side (zram_bvec_read_partial() has not taken a bio since commit
4e3c87b9421d ("zram: fix synchronous reads")) and drop the parameter
from zram_bvec_write_partial() and zram_bvec_write().

No functional change.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
---
 drivers/block/zram/zram_drv.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b23a8bbb687c..66347915a2cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2325,7 +2325,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
  * This is a partial IO. Read the full page before writing the changes.
  */
 static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
-				   u32 index, int offset, struct bio *bio)
+				   u32 index, int offset)
 {
 	struct page *page = alloc_page(GFP_NOIO);
 	int ret;
@@ -2343,10 +2343,10 @@ static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
 }
 
 static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
-			   u32 index, int offset, struct bio *bio)
+			   u32 index, int offset)
 {
 	if (is_partial_io(bvec))
-		return zram_bvec_write_partial(zram, bvec, index, offset, bio);
+		return zram_bvec_write_partial(zram, bvec, index, offset);
 	return zram_write_page(zram, bvec->bv_page, index);
 }
 
@@ -2743,7 +2743,7 @@ static void zram_bio_write(struct zram *zram, struct bio *bio)
 
 		bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
 
-		if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
+		if (zram_bvec_write(zram, &bv, index, offset) < 0) {
 			atomic64_inc(&zram->stats.failed_writes);
 			bio->bi_status = BLK_STS_IOERR;
 			break;

-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH 1/3] loop: cleanup lo_rw_aio
From: Ming Lei @ 2026-05-28  3:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Bart Van Assche,
	Caleb Sander Mateos, linux-block, linux-nvme
In-Reply-To: <20260527151043.2349900-2-hch@lst.de>

On Wed, May 27, 2026 at 05:10:20PM +0200, Christoph Hellwig wrote:
> Port over the changes from the zloop driver to remove the need for
> the local bio, bvec and offset variables and clean up the code by
> that.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH 2/3] nvme-tcp: cleanup nvme_tcp_init_iter
From: Ming Lei @ 2026-05-28  3:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Bart Van Assche,
	Caleb Sander Mateos, linux-block, linux-nvme
In-Reply-To: <20260527151043.2349900-3-hch@lst.de>

On Wed, May 27, 2026 at 05:10:21PM +0200, Christoph Hellwig wrote:
> Split the two init cases based on code in the zloop driver.  This
> simplifies the code and makes it easier to follow.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

Thanks, 
Ming

^ permalink raw reply

* Re: [PATCH 3/3] bvec: make the bvec_iter helpers inline functions
From: Ming Lei @ 2026-05-28  3:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Bart Van Assche,
	Caleb Sander Mateos, linux-block, linux-nvme
In-Reply-To: <20260527151043.2349900-4-hch@lst.de>

On Wed, May 27, 2026 at 05:10:22PM +0200, Christoph Hellwig wrote:
> The macros are impossible to follow due to the lack of visual type
> information and all the braces.  Replace them with inline helpers to
> improve on that.  Because the calling conventions are a bit problematic
> with a lot of passing structures by value, all the helpers are marked
> as __always_inline so that they are force inlined.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH v3 0/2] zram: fix UAF in zram_bvec_write_partial() and drop dead bio plumbing
From: Cunlong Li @ 2026-05-28  4:41 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton,
	Yisheng Xie
  Cc: Christoph Hellwig, linux-block, linux-mm, linux-kernel, stable
In-Reply-To: <20260528-zram-v3-0-cab86eef8764@gmail.com>

On Thu, May 28, 2026 at 10:48:43AM +0800, Cunlong Li wrote:
> Patch 1 fixes a use-after-free in zram_bvec_write_partial() that
> happens on PAGE_SIZE > 4K configurations when a partial write hits a
> ZRAM_WB slot.
> 
> Patch 2 is a follow-up cleanup that drops the now-unused bio parameter
> from zram_bvec_write_partial() and zram_bvec_write(), no functional
> change.
> 
> Patch 1 is tagged for stable; patch 2 is not.
> 
> Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
> ---
> Changes in v3:
> - Update Fixes: tag to 8e654f8fbff5 ("zram: read page from backing
>   device") per Christoph.
> - Link to v2: https://lore.kernel.org/r/20260527-zram-v2-0-2fb84b054b5c@gmail.com
> 
> Changes in v2:
> - Add patch 2: drop the now-unused bio parameter from
>   zram_bvec_write_partial() and zram_bvec_write(), per Sergey's
>   suggestion on v1.
> - Link to v1: https://lore.kernel.org/r/20260527-zram-v1-1-ce1acb2bfaf9@gmail.com
> 
> ---
> Cunlong Li (2):
>       zram: fix use-after-free in zram_bvec_write_partial()
>       zram: drop unused bio parameter from write helpers
> 
>  drivers/block/zram/zram_drv.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> ---
> base-commit: e8c2f9fdadee7cbc75134dc463c1e0d856d6e5c7
> change-id: 20260526-zram-b01425b7e6c6
> 
> Best regards,
> -- 
> Cunlong Li <shenxiaogll@gmail.com>
> 

Test results for reference:

Tested on arm64 16K-page QEMU (Apple M4, HVF) with KASAN enabled,
kernel v7.1-rc5 (base-commit e8c2f9fdadee).  zram0 backed by a loop
file on ext4, fio bs=4k randrw (4 jobs, 120s) against ext4-on-zram0
with a parallel loop triggering idle writeback.

Without the fix, KASAN fires within seconds:

  BUG: KASAN: use-after-free in copy_folio_from_iter_atomic+0x830/0x18e8
  Read of size 16384 at addr ffff8000d1168000 by task kworker/u16:4/321

  Workqueue: loop0 loop_rootcg_workfn
  Call trace:
   memcpy+0x3c/0x9c
   copy_folio_from_iter_atomic+0x830/0x18e8
   generic_perform_write+0x308/0x558
   ext4_buffered_write_iter+0x140/0x438
   ext4_file_write_iter+0x868/0x1004
   lo_rw_aio.isra.0+0x838/0xc94
   loop_process_work+0x2f8/0xdf0
   loop_rootcg_workfn+0x20/0x2c
   process_one_work+0x560/0xc10

  page: refcount:0 mapcount:0

The async backing-device read bio still references the page after
zram_bvec_write_partial() freed it; the loop worker then writes
into freed memory.

With the series applied, the same workload runs clean for two
minutes with no KASAN reports.


^ permalink raw reply

* Re: blktests failures with v7.1-rc1 kernel
From: Shin'ichiro Kawasaki @ 2026-05-28  5:24 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-scsi@vger.kernel.org, nbd, linux-rdma
In-Reply-To: <c4ddc101-184a-4e4f-82ca-c3123bce5e34@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2972 bytes --]

On May 25, 2026 / 18:14, Nilay Shroff wrote:
> hi Shinichiro,
> 
> On 4/28/26 2:43 PM, Shin'ichiro Kawasaki wrote:
[...]
> > #1: nvme/005,063 (tcp transport)
> > 
> >      The test cases nvme/005 and 063 fail for tcp transport due to the lockdep
> >      WARN related to the three locks q->q_usage_counter, q->elevator_lock and
> >      set->srcu. The failure was reported first time for nvme/063 and v6.16-rc1
> >      kernel [2].
> > 
> >      Chaitanya provided a fix patch (thanks!), and it is queued for v7.1-rcX tags
> >      [3]. However, nvme/005 and 063 still fail even when I apply the fix patch to
> >      v7.1-rc1 kernel. The call traces of the lockdep WARN are different between
> >      "v7.1-rc1" kernel [4] and "v7.1-rc1+the fix patch" kernel [5]. I guess that
> >      there exist two lockdep problems with similar symptoms and patch [3] fixed
> >      one of them. I guess that still one problem is left.
> > 
> >      [2]https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/
> >      [3]https://lore.kernel.org/all/20260413171628.6204-1-kch@nvidia.com/
> 
> 
> I looked into this lockdep warning, and it seems that Chaitanya's patch indeed fixes the
> original issue reported in [4]. However, the new warning reported in [5] appears to be a
> separate lockdep splat and, from what I can tell, likely a false positive. There are two
> reasons why I think so:
> 
> 1. The lockdep report suggests that thread #1 is sending data over a TCP socket while
>    another thread #2 is still in the process of establishing that same socket connection.
>    In practice, this should not be possible because request dispatch over the socket can
>    only happen after the connection setup has completed successfully.
> 
> 2. The warning also suggests that while thread #0 is deleting the gendisk and unregistering
>    the corresponding request queue, another thread #5 is concurrently attempting to change
>    the queue elevator. However, once gendisk deletion starts, elevator switching is already
>    inhibited for that queue (see disable_elv_switch()), so the reported locking scenario
>    should not be reachable in practice.
> 
> Based on the above, I suspect this is a lockdep false positive caused by dependency tracking
> across different queue/socket lifecycle phases. We may need to suppress lock dependency tracking
> in some of these paths to avoid the false warning.

Hi Nilay, thank you very much looking into this. It is good to know that
Chaitanya's patch fixed one problem, and the other problem looks like a false-
positive.

To confirm that "lockdep false positive caused by dependency tracking across
different queue/socket lifecycle phases", I created the patch attached. It
uses dynamic lockdep keys for the sockets of nvme-tcp controllers. With this
patch, the WARN at nvme/005 disappears! I think this indicates that your
suspect is correct. I will do some more testing and post the patch.

[-- Attachment #2: 0001-nvme-tcp-lockdep-use-dynamic-lockdep-keys.patch --]
[-- Type: text/plain, Size: 5676 bytes --]

From 74ae2157712e872711663ebb6cedbb4b0fc8c92a Mon Sep 17 00:00:00 2001
From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Date: Thu, 28 May 2026 13:52:48 +0900
Subject: [PATCH] nvme-tcp: lockdep: use dynamic lockdep keys

When NVMe-TCP controller setup and teardown are repeated with lockdep
enabled, lockdep reports false positives for the following locks:

  1) &q->elevator_lock        : IO scheduler change context
  2) &q->q_usage_counter(io)  : SCSI disk probe context
  3) fs_reclaim               : CPU hotplug bring-up context
  4) cpu_hotplug_lock         : socket establishment context
  5) sk_lock-AF_INET-NVME     : MQ sched dispatch context for the socket
  6) set->srcu                : NVMe controller delete context

This is a false positive because lockdep confuses lock 4) (socket
establishment) with lock 5) (socket in use) for different socket
instances. The locks belong to different sockets, but lockdep treats
them as the same due to shared static lockdep keys.

Fix this by using dynamically allocated lockdep keys per socket instance
instead of static keys nvme_tcp_sk_key[] and nvme_tcp_slock_key[]. Add
nvme_tcp_sk_key and nvme_tcp_slock_key fields to struct nvme_tcp_queue
and pass them to sock_lock_init_class_and_name() for proper lockdep
tracking. Move nvme_tcp_reclassify_socket() after struct nvme_tcp_queue
definition to avoid "too early" reference compiler errors.

Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 drivers/nvme/host/tcp.c | 88 +++++++++++++++++++++++------------------
 1 file changed, 49 insertions(+), 39 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 15d36d6a728e..51d496f414a1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -56,44 +56,6 @@ MODULE_PARM_DESC(tls_handshake_timeout,
 
 static atomic_t nvme_tcp_cpu_queues[NR_CPUS];
 
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-/* lockdep can detect a circular dependency of the form
- *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
- * because dependencies are tracked for both nvme-tcp and user contexts. Using
- * a separate class prevents lockdep from conflating nvme-tcp socket use with
- * user-space socket API use.
- */
-static struct lock_class_key nvme_tcp_sk_key[2];
-static struct lock_class_key nvme_tcp_slock_key[2];
-
-static void nvme_tcp_reclassify_socket(struct socket *sock)
-{
-	struct sock *sk = sock->sk;
-
-	if (WARN_ON_ONCE(!sock_allow_reclassification(sk)))
-		return;
-
-	switch (sk->sk_family) {
-	case AF_INET:
-		sock_lock_init_class_and_name(sk, "slock-AF_INET-NVME",
-					      &nvme_tcp_slock_key[0],
-					      "sk_lock-AF_INET-NVME",
-					      &nvme_tcp_sk_key[0]);
-		break;
-	case AF_INET6:
-		sock_lock_init_class_and_name(sk, "slock-AF_INET6-NVME",
-					      &nvme_tcp_slock_key[1],
-					      "sk_lock-AF_INET6-NVME",
-					      &nvme_tcp_sk_key[1]);
-		break;
-	default:
-		WARN_ON_ONCE(1);
-	}
-}
-#else
-static void nvme_tcp_reclassify_socket(struct socket *sock) { }
-#endif
-
 enum nvme_tcp_send_state {
 	NVME_TCP_SEND_CMD_PDU = 0,
 	NVME_TCP_SEND_H2C_PDU,
@@ -180,6 +142,11 @@ struct nvme_tcp_queue {
 	void (*state_change)(struct sock *);
 	void (*data_ready)(struct sock *);
 	void (*write_space)(struct sock *);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lock_class_key nvme_tcp_sk_key;
+	struct lock_class_key nvme_tcp_slock_key;
+#endif
 };
 
 struct nvme_tcp_ctrl {
@@ -207,6 +174,44 @@ static const struct blk_mq_ops nvme_tcp_mq_ops;
 static const struct blk_mq_ops nvme_tcp_admin_mq_ops;
 static int nvme_tcp_try_send(struct nvme_tcp_queue *queue);
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+/* lockdep can detect a circular dependency of the form
+ *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
+ * because dependencies are tracked for both nvme-tcp and user contexts. Using
+ * a separate class prevents lockdep from conflating nvme-tcp socket use with
+ * user-space socket API use.
+ */
+static void nvme_tcp_reclassify_socket(struct nvme_tcp_queue *queue)
+{
+	struct sock *sk = queue->sock->sk;
+
+	lockdep_register_key(&queue->nvme_tcp_sk_key);
+	lockdep_register_key(&queue->nvme_tcp_slock_key);
+
+	if (WARN_ON_ONCE(!sock_allow_reclassification(sk)))
+		return;
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		sock_lock_init_class_and_name(sk, "slock-AF_INET-NVME",
+					      &queue->nvme_tcp_slock_key,
+					      "sk_lock-AF_INET-NVME",
+					      &queue->nvme_tcp_sk_key);
+		break;
+	case AF_INET6:
+		sock_lock_init_class_and_name(sk, "slock-AF_INET6-NVME",
+					      &queue->nvme_tcp_slock_key,
+					      "sk_lock-AF_INET6-NVME",
+					      &queue->nvme_tcp_sk_key);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+	}
+}
+#else
+static void nvme_tcp_reclassify_socket(struct nvme_tcp_queue *queue) { }
+#endif
+
 static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
 {
 	return container_of(ctrl, struct nvme_tcp_ctrl, ctrl);
@@ -1468,6 +1473,11 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	kfree(queue->pdu);
 	mutex_destroy(&queue->send_mutex);
 	mutex_destroy(&queue->queue_lock);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	lockdep_unregister_key(&queue->nvme_tcp_sk_key);
+	lockdep_unregister_key(&queue->nvme_tcp_slock_key);
+#endif
 }
 
 static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
@@ -1813,7 +1823,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
 	}
 
 	sk_net_refcnt_upgrade(queue->sock->sk);
-	nvme_tcp_reclassify_socket(queue->sock);
+	nvme_tcp_reclassify_socket(queue);
 
 	/* Single syn retry */
 	tcp_sock_set_syncnt(queue->sock->sk, 1);
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Hillf Danton @ 2026-05-28  5:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tetsuo Handa, Jens Axboe, Bart Van Assche, Christoph Hellwig,
	Damien Le Moal, linux-block, LKML, Andrew Morton, Linus Torvalds,
	linux-btrfs, David Sterba, linux-fsdevel, Christian Brauner
In-Reply-To: <ahZeYQ0cLE1i8TGs@fedora>

On Tue, 26 May 2026 22:00:49 -0500 Ming Lei wrote:
>On Wed, May 27, 2026 at 10:35:56AM +0900, Tetsuo Handa wrote:
>> On 2026/05/27 10:20, Ming Lei wrote:
>> >> Of course we should try to figure out the root cause first, but how can we do?
>> > 
>> > Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
>> > which may cause data loss, so CC btrfs list and maintainer.
>> 
>> Why do you assume that the culprit is btrfs?
>> 
>> https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 indicated that
>> this similar race is also happening with jfs.
>
> I just didn't see the above report on jfs.
> 
> It doesn't change anything, the same question still stands: unexpected write IO is issued
> or crosses umount & last closing of loop disk.
>
Given the loop workqueue that triggered the jfs warning, can you specify
the reason why the workqueue in question is NOT flushed while closing disk?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox