Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christian Brauner @ 2026-06-17  9:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs
In-Reply-To: <20260617062523.GA20041@lst.de>

> No, we don't need a secondary device number to sb mapping.  On the other
> hand we do need the deviceloss, freeze etc upcalls to work for owners
> that are not file systems like mdraid or dm, even if they have been
> slow to pick this.  The whole idea of the holder ops is to abstract
> away from who holds it instead of adding back the broken hard coding
> of the superblock.  Otherwise you're just badly reinventing get_super.

No, the expanded version works for all device numbers. There's also
no-hardcoding. And non-fs users may do whatever they want with their
holder ops ofc. erofs always had the non 1:1 relationship between
devices and filesystems and for that case it seems sane. I'm happy to
let the series sit for a bit to gather input and do the security
mediation patches first. The series are complementary.

^ permalink raw reply

* Re: [PATCH v3] rust: add procedural macro for declaring configfs attributes
From: Malte Wechter @ 2026-06-17  9:13 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Andreas Hindborg, Breno Leitao, Miguel Ojeda, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Jens Axboe, Luis Chamberlain,
	Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin,
	linux-kernel, rust-for-linux, linux-block, linux-modules
In-Reply-To: <CANiq72=RX5V6W+1tj0GHxZusrk5OqYbZ5-xV=wvSssrx_CWXAA@mail.gmail.com>


On 6/13/26 12:41 PM, Miguel Ojeda wrote:
> Hi Malte,
>
> Some quick notes...
>
> On Fri, Jun 12, 2026 at 3:29 PM Malte Wechter <maltewechter@gmail.com> wrote:
>> +/// ```ignore
> Empty /// before examples.
>
>> +///     // This will extract "foo: <field>" into a variable named "foo".
> ` instead of "
>
> i.e. please use Markdown
>
>> +///```
> Missing space indentation
>
>> +/// Expands the following output:
>> +///    let item_type = {
> Missing example block, both at the beginning and the end.
>
> Please double-check by generating the docs and looking at how they
> appear in the browser.
>
> The prefix of the title should likely be `rust: configfs:`.
>
> Thanks!
>
> Cheers,
> Miguel
As of now doc strings are not generated for private items in the macros 
crate. I am moving the `parse_ordered_fields!` macro into 
macros/helpers.rs but this means the doc strings are not generated for 
the macro anymore. The `parse_ordered_fields!` macro is a larger helper 
function, and the doc strings are relevant and helpful for macro 
developers that wants to use it.

You can enable documenting private items:

diff --git a/rust/Makefile b/rust/Makefile
index b361bfedfdf0..b4239443307e 100644
--- a/rust/Makefile
+++ b/rust/Makefile
@@ -147,6 +147,7 @@ quiet_cmd_rustdoc = RUSTDOC $(if $(rustdoc_host),H, ) $<
      OBJTREE=$(abspath $(objtree)) \
      $(RUSTDOC) $(filter-out $(skip_flags) --remap-path-scope=%,$(if 
$(rustdoc_host),$(rust_common_flags),$(rust_flags))) \
          $(rustc_target_flags) -L$(objtree)/$(obj) \
+        --document-private-items \
          -Zunstable-options --generate-link-to-definition \
          --output $(rustdoc_output) \
          --crate-name $(subst rustdoc-,,$@) \

But this enables _all_ private items to get rendered, which is not 
ideal. How should i proceed?
Best regards,

Malte



^ permalink raw reply related

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Sumit Saxena @ 2026-06-17  7:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Martin K. Petersen, Jens Axboe, James Bottomley, Linux SCSI List,
	linux-block
In-Reply-To: <93a82831-608d-4462-a019-26b3adc7089c@suse.de>

[-- Attachment #1.1: Type: text/plain, Size: 721 bytes --]

> What tests did you perform?
> I'm pretty sure you see an improvement when having just a few drives,
> but what about having a lot of them (ie tens of drives)?
> The whole point of this was to increase fairness between drives, so
> of course removing it will make an individual drive going faster ...

Initially, we ran tests with 8 drives and saw positive results. However, we
completed
tests with 16 drives and are seeing performance drops at higher iodepths
(>=128) with this patch.
This appears to be due to the removal of the per-queue throttle
(hctx_may_queue).
We are currently running additional tests to better understand this
behavior. I will provide an update
once I have more meaningful data.

Thanks,
Sumit

[-- Attachment #1.2: Type: text/html, Size: 2102 bytes --]

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* [PATCH 2/2 blktests] src/miniublk: fall back to legacy opcodes on older kernels
From: Sebastian Chlad @ 2026-06-17  7:25 UTC (permalink / raw)
  To: linux-block; +Cc: shinichiro.kawasaki, Sebastian Chlad
In-Reply-To: <20260617072516.6238-1-sebastian.chlad@suse.com>

Try ioctl-encoded ADD_DEV and GET_DEV_INFO first; if either fails,
retry with the legacy raw opcode. After a successful bootstrap
command, derive use_ioctl from UBLK_F_CMD_IOCTL_ENCODE in dev_info.flags
so all subsequent control and IO commands use the mode reported by the
kernel.

Signed-off-by: Sebastian Chlad <sebastian.chlad@suse.com>
---
 src/miniublk.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/src/miniublk.c b/src/miniublk.c
index 5a35ca7..494a4ae 100644
--- a/src/miniublk.c
+++ b/src/miniublk.c
@@ -112,6 +112,7 @@ struct ublk_dev {
 	int fds[2];	/* fds[0] points to /dev/ublkcN */
 	int nr_fds;
 	int ctrl_fd;
+	bool use_ioctl;
 	struct io_uring ring;
 };
 
@@ -235,7 +236,7 @@ static inline int ublk_setup_ring(struct io_uring *r, int depth,
 
 static inline void ublk_ctrl_init_cmd(struct ublk_dev *dev,
 		struct io_uring_sqe *sqe,
-		struct ublk_ctrl_cmd_data *data)
+		struct ublk_ctrl_cmd_data *data, __u32 cmd_op)
 {
 	struct ublksrv_ctrl_dev_info *info = &dev->dev_info;
 	struct ublksrv_ctrl_cmd *cmd = (struct ublksrv_ctrl_cmd *)ublk_get_sqe_cmd(sqe);
@@ -255,25 +256,34 @@ static inline void ublk_ctrl_init_cmd(struct ublk_dev *dev,
 	cmd->dev_id = info->dev_id;
 	cmd->queue_id = -1;
 
-	ublk_set_sqe_cmd_op(sqe, data->cmd_op);
+	ublk_set_sqe_cmd_op(sqe, cmd_op);
 
 	io_uring_sqe_set_data(sqe, cmd);
 }
 
+static void ublk_update_ioctl_encoding(struct ublk_dev *dev)
+{
+	dev->use_ioctl = !!(dev->dev_info.flags & UBLK_F_CMD_IOCTL_ENCODE);
+}
+
 static int __ublk_ctrl_cmd(struct ublk_dev *dev,
 		struct ublk_ctrl_cmd_data *data)
 {
 	struct io_uring_sqe *sqe;
 	struct io_uring_cqe *cqe;
+	__u32 cmd_op = data->cmd_op;
 	int ret = -EINVAL;
 
+	if (!dev->use_ioctl)
+		cmd_op = _IOC_NR(cmd_op);
+
 	sqe = io_uring_get_sqe(&dev->ring);
 	if (!sqe) {
 		ublk_err("%s: can't get sqe ret %d\n", __func__, ret);
 		return ret;
 	}
 
-	ublk_ctrl_init_cmd(dev, sqe, data);
+	ublk_ctrl_init_cmd(dev, sqe, data, cmd_op);
 
 	ret = io_uring_submit(&dev->ring);
 	if (ret < 0) {
@@ -321,8 +331,19 @@ int ublk_ctrl_add_dev(struct ublk_dev *dev)
 		.addr = (__u64)&dev->dev_info,
 		.len = sizeof(struct ublksrv_ctrl_dev_info),
 	};
+	int ret;
 
-	return __ublk_ctrl_cmd(dev, &data);
+	ret = __ublk_ctrl_cmd(dev, &data);
+	if (ret < 0) {
+		/* retry with legacy opcode on older kernels */
+		dev->use_ioctl = false;
+		ret = __ublk_ctrl_cmd(dev, &data);
+	}
+
+	if (ret >= 0)
+		ublk_update_ioctl_encoding(dev);
+
+	return ret;
 }
 
 int ublk_ctrl_del_dev(struct ublk_dev *dev)
@@ -343,8 +364,19 @@ int ublk_ctrl_get_info(struct ublk_dev *dev)
 		.addr = (__u64)&dev->dev_info,
 		.len = sizeof(struct ublksrv_ctrl_dev_info),
 	};
+	int ret;
 
-	return __ublk_ctrl_cmd(dev, &data);
+	ret = __ublk_ctrl_cmd(dev, &data);
+	if (ret < 0 && dev->use_ioctl) {
+		/* retry with legacy opcode on older kernels */
+		dev->use_ioctl = false;
+		ret = __ublk_ctrl_cmd(dev, &data);
+	}
+
+	if (ret >= 0)
+		ublk_update_ioctl_encoding(dev);
+
+	return ret;
 }
 
 int ublk_ctrl_set_params(struct ublk_dev *dev,
@@ -453,6 +485,8 @@ static struct ublk_dev *ublk_ctrl_init()
 	struct ublksrv_ctrl_dev_info *info = &dev->dev_info;
 	int ret;
 
+	dev->use_ioctl = true; /* use ioctl opcodes by default */
+
 	dev->ctrl_fd = open(CTRL_DEV, O_RDWR);
 	if (dev->ctrl_fd < 0) {
 		ublk_err("control dev %s can't be opened: %m %d\n", CTRL_DEV, errno);
@@ -628,6 +662,9 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 	else
 		cmd_op = UBLK_U_IO_FETCH_REQ;
 
+	if (!q->dev->use_ioctl)
+		cmd_op = _IOC_NR(cmd_op);
+
 	sqe = io_uring_get_sqe(&q->ring);
 	if (!sqe) {
 		ublk_err("%s: run out of sqe %d, tag %d\n",
-- 
2.51.0


^ permalink raw reply related

* [PATCH 1/2 blktests] src/miniublk: switch to ioctl-encoded ublk commands
From: Sebastian Chlad @ 2026-06-17  7:25 UTC (permalink / raw)
  To: linux-block; +Cc: shinichiro.kawasaki, Sebastian Chlad
In-Reply-To: <20260617072516.6238-1-sebastian.chlad@suse.com>

Kernels built without CONFIG_BLKDEV_UBLK_LEGACY_OPCODES reject the
legacy raw UBLK_CMD_* and UBLK_IO_* opcodes. Switch miniublk to use
the ioctl-encoded UBLK_U_CMD_* and UBLK_U_IO_* variants defined in
linux/ublk_cmd.h instead.

For IO commands, the ioctl-encoded opcode is used for submission while
_IOC_NR() extracts the raw NR bits for build_user_data(), keeping the
user_data tag encoding intact.

Signed-off-by: Sebastian Chlad <sebastian.chlad@suse.com>
---
 src/miniublk.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/src/miniublk.c b/src/miniublk.c
index f98f850..5a35ca7 100644
--- a/src/miniublk.c
+++ b/src/miniublk.c
@@ -294,7 +294,7 @@ static int __ublk_ctrl_cmd(struct ublk_dev *dev,
 int ublk_ctrl_stop_dev(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_STOP_DEV,
+		.cmd_op	= UBLK_U_CMD_STOP_DEV,
 	};
 
 	return __ublk_ctrl_cmd(dev, &data);
@@ -304,7 +304,7 @@ int ublk_ctrl_start_dev(struct ublk_dev *dev,
 		int daemon_pid)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_START_DEV,
+		.cmd_op	= UBLK_U_CMD_START_DEV,
 		.flags	= CTRL_CMD_HAS_DATA,
 	};
 
@@ -316,7 +316,7 @@ int ublk_ctrl_start_dev(struct ublk_dev *dev,
 int ublk_ctrl_add_dev(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_ADD_DEV,
+		.cmd_op	= UBLK_U_CMD_ADD_DEV,
 		.flags	= CTRL_CMD_HAS_BUF,
 		.addr = (__u64)&dev->dev_info,
 		.len = sizeof(struct ublksrv_ctrl_dev_info),
@@ -328,7 +328,7 @@ int ublk_ctrl_add_dev(struct ublk_dev *dev)
 int ublk_ctrl_del_dev(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op = UBLK_CMD_DEL_DEV,
+		.cmd_op = UBLK_U_CMD_DEL_DEV,
 		.flags = 0,
 	};
 
@@ -338,7 +338,7 @@ int ublk_ctrl_del_dev(struct ublk_dev *dev)
 int ublk_ctrl_get_info(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_GET_DEV_INFO,
+		.cmd_op	= UBLK_U_CMD_GET_DEV_INFO,
 		.flags	= CTRL_CMD_HAS_BUF,
 		.addr = (__u64)&dev->dev_info,
 		.len = sizeof(struct ublksrv_ctrl_dev_info),
@@ -351,7 +351,7 @@ int ublk_ctrl_set_params(struct ublk_dev *dev,
 		struct ublk_params *params)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_SET_PARAMS,
+		.cmd_op	= UBLK_U_CMD_SET_PARAMS,
 		.flags	= CTRL_CMD_HAS_BUF,
 		.addr = (__u64)params,
 		.len = sizeof(*params),
@@ -364,7 +364,7 @@ static int ublk_ctrl_get_params(struct ublk_dev *dev,
 		struct ublk_params *params)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_GET_PARAMS,
+		.cmd_op	= UBLK_U_CMD_GET_PARAMS,
 		.flags	= CTRL_CMD_HAS_BUF,
 		.addr = (__u64)params,
 		.len = sizeof(*params),
@@ -378,7 +378,7 @@ static int ublk_ctrl_get_params(struct ublk_dev *dev,
 static int ublk_ctrl_start_user_recover(struct ublk_dev *dev)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_START_USER_RECOVERY,
+		.cmd_op	= UBLK_U_CMD_START_USER_RECOVERY,
 		.flags	= 0,
 	};
 
@@ -389,7 +389,7 @@ static int ublk_ctrl_end_user_recover(struct ublk_dev *dev,
 		int daemon_pid)
 {
 	struct ublk_ctrl_cmd_data data = {
-		.cmd_op	= UBLK_CMD_END_USER_RECOVERY,
+		.cmd_op	= UBLK_U_CMD_END_USER_RECOVERY,
 		.flags	= CTRL_CMD_HAS_DATA,
 	};
 
@@ -624,9 +624,9 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 		return 0;
 
 	if (io->flags & UBLKSRV_NEED_COMMIT_RQ_COMP)
-		cmd_op = UBLK_IO_COMMIT_AND_FETCH_REQ;
-	else if (io->flags & UBLKSRV_NEED_FETCH_RQ)
-		cmd_op = UBLK_IO_FETCH_REQ;
+		cmd_op = UBLK_U_IO_COMMIT_AND_FETCH_REQ;
+	else
+		cmd_op = UBLK_U_IO_FETCH_REQ;
 
 	sqe = io_uring_get_sqe(&q->ring);
 	if (!sqe) {
@@ -637,7 +637,7 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 
 	cmd = (struct ublksrv_io_cmd *)ublk_get_sqe_cmd(sqe);
 
-	if (cmd_op == UBLK_IO_COMMIT_AND_FETCH_REQ)
+	if (io->flags & UBLKSRV_NEED_COMMIT_RQ_COMP)
 		cmd->result = io->result;
 
 	/* These fields should be written once, never change */
@@ -650,7 +650,7 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 	cmd->addr	= (__u64)io->buf_addr;
 	cmd->q_id	= q->q_id;
 
-	user_data = build_user_data(tag, cmd_op, 0, 0);
+	user_data = build_user_data(tag, _IOC_NR(cmd_op), 0, 0);
 	io_uring_sqe_set_data64(sqe, user_data);
 
 	io->flags = 0;
@@ -658,7 +658,7 @@ static int ublk_queue_io_cmd(struct ublk_queue *q,
 	q->cmd_inflight += 1;
 
 	ublk_dbg(UBLK_DBG_IO_CMD, "%s: (qid %d tag %u cmd_op %u) iof %x stopping %d\n",
-			__func__, q->q_id, tag, cmd_op,
+			__func__, q->q_id, tag, _IOC_NR(cmd_op),
 			io->flags, !!(q->state & UBLKSRV_QUEUE_STOPPING));
 	return 1;
 }
-- 
2.51.0


^ permalink raw reply related

* [PATCH 0/2 blktests] Update the miniublk to use ioctl opcodes
From: Sebastian Chlad @ 2026-06-17  7:25 UTC (permalink / raw)
  To: linux-block; +Cc: shinichiro.kawasaki, Sebastian Chlad

miniublk currently uses only legacy opcodes. Kernels built without
CONFIG_BLKDEV_UBLK_LEGACY_OPCODES reject them with -EOPNOTSUPP, causing
all ublk tests to fail. This patch solves the problem and the following
patch adds fallback to legacy opcodes for testing of the older kernels.

I tested against the old 6.3 kernel supporting only legacy opcodes. Also
against new kernel with ioctl opcodes and legacy opcodes still enabled as
well as the new kernel with ioctl opcodes and no support for the legacy ones.

Sebastian Chlad (2):
  src/miniublk: switch to ioctl-encoded ublk commands
  src/miniublk: fall back to legacy opcodes on older kernels

 src/miniublk.c | 77 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 57 insertions(+), 20 deletions(-)

-- 
2.51.0

^ permalink raw reply

* Re: [PATCH v2] blk-mq: bound blk_hctx_poll() to one jiffy
From: changfengnan @ 2026-06-17  7:07 UTC (permalink / raw)
  To: Anuj Gupta
  Cc: axboe, hch, kbusch, lidiangang, tom.leiming, nj.shetty, joshi.k,
	anuj1072538, linux-block, Anuj Gupta, Alok Rathore
In-Reply-To: <20260617060850.1244788-1-anuj20.g@samsung.com>

Looks good to me.
Reviewed-by: Fengnan Chang <changfengnan@bytedance.com>

> From: "Anuj Gupta"<anuj20.g@samsung.com>
> Date:  Wed, Jun 17, 2026, 14:15
> Subject:  [PATCH v2] blk-mq: bound blk_hctx_poll() to one jiffy
> To: <axboe@kernel.dk>, <hch@lst.de>, <kbusch@kernel.org>, <lidiangang@bytedance.com>, <changfengnan@bytedance.com>, <tom.leiming@gmail.com>, <nj.shetty@samsung.com>, <joshi.k@samsung.com>, <anuj1072538@gmail.com>
> Cc: <linux-block@vger.kernel.org>, "Anuj Gupta"<anuj20.g@samsung.com>, "Alok Rathore"<alok.rathore@samsung.com>
> blk_hctx_poll() can busy-poll until a completion is found or
> need_resched() becomes true. On preemptible kernels, the scheduler can
> set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
> return before the loop condition re-evaluates it. After the context
> switch, the flag is cleared, so the poller can continue spinning instead
> of returning to its caller.
> 
> This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
> which holds the rcu_read_lock() while calling bio_poll(). If another
> poller on the same polled queue drains the available completions, this
> poller may repeatedly find no completions and remain inside the RCU
> read-side critical section long enough to trigger RCU stall reports:
> 
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
> rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
> task:fio state:R  running task     stack:0     pid:3961
> Call Trace:
> <TASK>
> ? nvme_poll+0x36/0xa0 [nvme]
> ? blk_hctx_poll+0x39/0x90
> ? blk_mq_poll+0x30/0x60
> ? bio_poll+0x87/0x170
> ? iocb_bio_iopoll+0x32/0x50
> ? io_uring_classic_poll+0x25/0x50
> ? io_do_iopoll+0x216/0x420
> ? __do_sys_io_uring_enter+0x2c7/0x7c0
> 
> Reproducible with:
> 
> fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
> --numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
> --registerfiles=1 --group_reporting --thread
> 
> Record the starting jiffy and exit the loop once jiffies has advanced.
> This bounds each blk_hctx_poll() invocation while also covering the
> case where the reschedule flag was cleared by the context switch
> before the loop condition could observe it.
> 
> Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
> Suggested-by: Fengnan Chang <changfengnan@bytedance.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
> ---
>  block/blk-mq.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c5c16cce4f8..ae6c5f4b80ce 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>                           struct io_comp_batch *iob, unsigned int flags)
>  {
>          int ret;
> +        unsigned long timeout = jiffies + 1;
>  
>          do {
>                  ret = q->mq_ops->poll(hctx, iob);
> @@ -5258,7 +5259,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>                  if (ret < 0 || (flags & BLK_POLL_ONESHOT))
>                          break;
>                  cpu_relax();
> -        } while (!need_resched());
> +        } while (!need_resched() && time_before(jiffies, timeout));
>  
>          return 0;
>  }
> -- 
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christoph Hellwig @ 2026-06-17  6:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-fragil-duktus-nachverfolgen-60f54584c206@brauner>

On Tue, Jun 16, 2026 at 04:59:53PM +0200, Christian Brauner wrote:
> > Err, no.  block devices need to have a specific owner.  If erofs wants
> > to share a device between superblock it needs to come up with an entity
> > that owns the block devices which is not a superblock.
> 
> It already did.
> 
> > IMHO sharing devices between superblocks is a bad idea, but that ship
> > has sailed, but please keep it contained inside of erofs.
> 
> We need a simple device number to superblock mapping anyway and that can
> simply be centralized in the vfs. And it can work with anon device
> numbers and block device numbers uniformly.

No, we don't need a secondary device number to sb mapping.  On the other
hand we do need the deviceloss, freeze etc upcalls to work for owners
that are not file systems like mdraid or dm, even if they have been
slow to pick this.  The whole idea of the holder ops is to abstract
away from who holds it instead of adding back the broken hard coding
of the superblock.  Otherwise you're just badly reinventing get_super.

If erofs already has an owner entity it just needs custom holder ops for
that.

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Christoph Hellwig @ 2026-06-17  6:19 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Jianyue Wu, Christoph Hellwig, Andrew Morton, Chris Li,
	Baoquan He, Nhat Pham, Barry Song, Kairui Song, Kemeng Shi,
	Youngjun Park, Minchan Kim, Jens Axboe, Matthew Wilcox (Oracle),
	Jan Kara, linux-mm, linux-kernel, linux-block, linux-doc,
	Brian Geffon
In-Reply-To: <ajIYFtADxQDq8q1P@google.com>

On Wed, Jun 17, 2026 at 12:46:53PM +0900, Sergey Senozhatsky wrote:
> Those are fantastic questions, thank you for asking them.
> Can we elaborate on zram being a "legacy interface"?

Compression is functionality that fundamentally belongs into the core
swap code, not a virtual block device.  Between the backing store
less zswap and the virtual swap layer, the core swap code is not getting
to the point where don't need to rely on hacks like a compressing
ramdisk.

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Christoph Hellwig @ 2026-06-17  6:17 UTC (permalink / raw)
  To: Jianyue Wu
  Cc: Christoph Hellwig, Andrew Morton, Chris Li, Baoquan He, Nhat Pham,
	Barry Song, Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <CAJxJ_jhK+zkpjhs3YsQ9RoasKYh+E0NweQci0sPAEY1ne5LmBA@mail.gmail.com>

On Wed, Jun 17, 2026 at 11:38:02AM +0800, Jianyue Wu wrote:
> Before I rework or drop the RFC, could you outline how you see that
> core-side model working? In particular:
>   - How should a compressed backend like zram or future block device
>     plug into swap_iocb / swap_ops?

I don't think that is the right layer.  The virtual swap layer that is
currently in the process of being upstreamed is the right level, and
the actual swap devices or swap files are just a dumb backend for what
they higher level code does.

>   - What role do you expect zram to keep while the legacy block interface
>     remains: current block swap only, or something else?

For now we'll need to keep it working as-is.  It is heavily used in
android and potentially elsewhere.  Once we have zswap fully working
in the virtual swap layer world it might make sense to say never
compress again in zram when REQ_SWAP is set (or maybe a new
REQ_COPRESSED) so that we can use the core compression code without
breaking existing setups.

^ permalink raw reply

* [PATCH v2] blk-mq: bound blk_hctx_poll() to one jiffy
From: Anuj Gupta @ 2026-06-17  6:08 UTC (permalink / raw)
  To: axboe, hch, kbusch, lidiangang, changfengnan, tom.leiming,
	nj.shetty, joshi.k, anuj1072538
  Cc: linux-block, Anuj Gupta, Alok Rathore
In-Reply-To: <CGME20260617061531epcas5p26e62bfdf2e91b646611191e4451d9843@epcas5p2.samsung.com>

blk_hctx_poll() can busy-poll until a completion is found or
need_resched() becomes true. On preemptible kernels, the scheduler can
set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
return before the loop condition re-evaluates it. After the context
switch, the flag is cleared, so the poller can continue spinning instead
of returning to its caller.

This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
which holds the rcu_read_lock() while calling bio_poll(). If another
poller on the same polled queue drains the available completions, this
poller may repeatedly find no completions and remain inside the RCU
read-side critical section long enough to trigger RCU stall reports:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
task:fio state:R  running task     stack:0     pid:3961
Call Trace:
<TASK>
? nvme_poll+0x36/0xa0 [nvme]
? blk_hctx_poll+0x39/0x90
? blk_mq_poll+0x30/0x60
? bio_poll+0x87/0x170
? iocb_bio_iopoll+0x32/0x50
? io_uring_classic_poll+0x25/0x50
? io_do_iopoll+0x216/0x420
? __do_sys_io_uring_enter+0x2c7/0x7c0

Reproducible with:

fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
--numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
--registerfiles=1 --group_reporting --thread

Record the starting jiffy and exit the loop once jiffies has advanced.
This bounds each blk_hctx_poll() invocation while also covering the
case where the reschedule flag was cleared by the context switch
before the loop condition could observe it.

Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
Suggested-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..ae6c5f4b80ce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			 struct io_comp_batch *iob, unsigned int flags)
 {
 	int ret;
+	unsigned long timeout = jiffies + 1;

 	do {
 		ret = q->mq_ops->poll(hctx, iob);
@@ -5258,7 +5259,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
 			break;
 		cpu_relax();
-	} while (!need_resched());
+	} while (!need_resched() && time_before(jiffies, timeout));

 	return 0;
 }
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH] blk-mq: bound blk_hctx_poll() to one jiffy
From: Anuj Gupta/Anuj Gupta @ 2026-06-17  6:14 UTC (permalink / raw)
  To: Fengnan, axboe, hch, kbusch, lidiangang, tom.leiming, nj.shetty,
	joshi.k, anuj1072538
  Cc: linux-block, Alok Rathore
In-Reply-To: <2e916cee-3a82-47ac-a416-b52a9744cdd5@bytedance.com>

On 6/12/2026 7:23 AM, Fengnan wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 4c5c16cce4f8..d85fa4a51e79 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>>    			 struct io_comp_batch *iob, unsigned int flags)
>>    {
>>    	int ret;
>> +	unsigned long start = jiffies;
> how about this :
> 
> unsigned long timeout = jiffies + 1;
> ...
> } while (!need_resched() && time_before(jiffies, timeout));

Thanks for taking a look.
These are functionally identical but your form is established idiom at 
other places.
I will switch to that in v2.
--
Anuj

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Sergey Senozhatsky @ 2026-06-17  6:10 UTC (permalink / raw)
  To: Jianyue Wu, Christoph Hellwig
  Cc: Sergey Senozhatsky, Andrew Morton, Chris Li, Baoquan He,
	Nhat Pham, Barry Song, Kairui Song, Kemeng Shi, Youngjun Park,
	Minchan Kim, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc, Brian Geffon
In-Reply-To: <CAJxJ_jiM_-a52EOm896FXkdH+wRxjSHJx+MW6b-ewNLVkp4uSw@mail.gmail.com>

Hi,

On (26/06/17 13:44), Jianyue Wu wrote:
> Hello Sergey,
> 
> On Wed, Jun 17, 2026 at 11:46 AM Sergey Senozhatsky
> <senozhatsky@chromium.org> wrote:
> > Can we elaborate on zram being a "legacy interface"?
> My previous wording was ambiguous. Actually I didn't mean it is a
> legacy interface.

Oh, your wording wasn't ambiguous.  I simply forgot to direct my
previous email to Christoph.

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Jianyue Wu @ 2026-06-17  5:44 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Christoph Hellwig, Andrew Morton, Chris Li, Baoquan He, Nhat Pham,
	Barry Song, Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Jens Axboe, Matthew Wilcox (Oracle), Jan Kara, linux-mm,
	linux-kernel, linux-block, linux-doc, Brian Geffon
In-Reply-To: <ajIYFtADxQDq8q1P@google.com>

Hello Sergey,

On Wed, Jun 17, 2026 at 11:46 AM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
> Can we elaborate on zram being a "legacy interface"?
My previous wording was ambiguous. Actually I didn't mean it is a
legacy interface.
Previously I just compared the new compressed swap implementation.
AFAIK, zram is widely used in many products like Android, automotive and IoT.
Its usage and interface should remain unchanged, as the impact would
be significant.

Thanks,
Jianyue

^ permalink raw reply

* Re: [PATCH blktests] scsi/009: fix unset bytes_to_write in TEST 8
From: Shin'ichiro Kawasaki @ 2026-06-17  4:29 UTC (permalink / raw)
  To: Sebastian Chlad; +Cc: linux-block, Sebastian Chlad, alan.adamson
In-Reply-To: <20260614181651.11554-2-sebastian.chlad@suse.com>

CC+ Alan,

On Jun 14, 2026 / 20:16, Sebastian Chlad wrote:
> bytes_to_write was never assigned before TEST 8, causing it to pass for
> the wrong reason. Set it to atomic_unit_max_bytes + logical_block_size
> and update the golden output with the expected "pwrite: Invalid argument"
> from xfs_io.
> 
> Signed-off-by: Sebastian Chlad <sebastian.chlad@suse.com>

Thanks. The change looks good to me.

I will wait a few more days just in case anyone has opinion on the change.
FYI: Sebastian posted a similar change for nvme/059 [*].

[*] https://github.com/linux-blktests/blktests/pull/245

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Sergey Senozhatsky @ 2026-06-17  3:46 UTC (permalink / raw)
  To: Jianyue Wu
  Cc: Christoph Hellwig, Andrew Morton, Chris Li, Baoquan He, Nhat Pham,
	Barry Song, Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc, Brian Geffon
In-Reply-To: <CAJxJ_jhK+zkpjhs3YsQ9RoasKYh+E0NweQci0sPAEY1ne5LmBA@mail.gmail.com>

Cc-ing Brian

On (26/06/17 11:38), Jianyue Wu wrote:
> > I fear this is going entirely in the wrong direction.
> OK. I was trying to build on your swap_iocb / swap_ops rework
> for the zram swap path, but I take your point that compressed swap can
> be handled more nicely.
> 
> > Yes, we have to keep zram around as a legacy interface for now,
> > but the right place to deal with compressed swap is in the core.
> I agree compressed swap belongs in the core is better, so not only ram,
> but also the block layer can use it.
> 
> Before I rework or drop the RFC, could you outline how you see that
> core-side model working? In particular:
>   - How should a compressed backend like zram or future block device
>     plug into swap_iocb / swap_ops?
>   - What role do you expect zram to keep while the legacy block interface
>     remains: current block swap only, or something else?

Those are fantastic questions, thank you for asking them.
Can we elaborate on zram being a "legacy interface"?

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Jianyue Wu @ 2026-06-17  3:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Chris Li, Baoquan He, Nhat Pham, Barry Song,
	Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <20260616123646.GB21024@lst.de>

Hi Christoph,

Thanks for the feedback.

> I fear this is going entirely in the wrong direction.
OK. I was trying to build on your swap_iocb / swap_ops rework
for the zram swap path, but I take your point that compressed swap can
be handled more nicely.

> Yes, we have to keep zram around as a legacy interface for now,
> but the right place to deal with compressed swap is in the core.
I agree compressed swap belongs in the core is better, so not only ram,
but also the block layer can use it.

Before I rework or drop the RFC, could you outline how you see that
core-side model working? In particular:
  - How should a compressed backend like zram or future block device
    plug into swap_iocb / swap_ops?
  - What role do you expect zram to keep while the legacy block interface
    remains: current block swap only, or something else?

I am open to reworking the series toward a core-based approach once
the intended direction is clearer.

Thanks,
Jianyue

^ permalink raw reply

* Re: [PATCH RFC v2 15/18] f2fs: open via dedicated fs bdev helpers
From: Chao Yu @ 2026-06-17  3:17 UTC (permalink / raw)
  To: Christian Brauner, Jan Kara
  Cc: chao, Christoph Hellwig, Jens Axboe, Alexander Viro, linux-block,
	linux-kernel, linux-fsdevel, Carlos Maiolino, linux-xfs,
	Chris Mason, David Sterba, linux-btrfs, Theodore Ts'o,
	linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-15-7df6b864028e@kernel.org>

On 6/16/26 22:08, Christian Brauner wrote:
> Route the extra device opens of a multi-device f2fs through
> fs_bdev_file_open_by_path() so each device is registered against the
> superblock, and convert the matching release in destroy_device_list()
> to fs_bdev_file_release(). The first device aliases the main bdev file
> opened by setup_bdev_super() and is already registered through it.
> 
> f2fs opened its extra devices without holder ops, so a freeze, sync, or
> removal of one of them was never propagated to the superblock.
> Registering them wires those events up: every device now freezes,
> thaws, syncs, and shuts down the filesystem like the main device does.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Acked-by: Chao Yu <chao@kernel.org>

Thanks,

^ permalink raw reply

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Dr. David Alan Gilbert @ 2026-06-16 23:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajHgIWdT0QmeF_t4@kbusch-mbp>

* Keith Busch (kbusch@kernel.org) wrote:
> On Tue, Jun 16, 2026 at 08:09:18PM +0000, Dr. David Alan Gilbert wrote:
> > Interesting; I've got your:
> >   dm-raid1: don't fail the mirror for invalid I/O errors
> >   For DM_IO_BIO requests, do_region() built each destination bio by walking..
> > ontop of e21ee273e6fa3879aec9a27251cfce98156e07c4 which is just before 7.1
> >   I've not your https://lore.kernel.org/linux-block/20260612223205.465913-1-kbusch@meta.com/
> > 
> > root@dalek:/home/dg# lvcreate  --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> > root@dalek:/home/dg# mkfs.ext4 /dev/mapper/main-lvol0
> > root@dalek:/home/dg# mount /dev/mapper/main-lvol0 /mnt/tmp/
> > root@dalek:/home/dg# chmod a+rwx /mnt/tmp
> > 
> > dg@dalek:~$ dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=1
> > 
> > my two tests are separate tests:
> 
> Goodness, I'm struggling here. Unpatched, I have no problem recreating
> the issues you've described, but everything I've tried with the
> proposals included gets the expected results. I'm running out of ideas
> on replicating your results with hardware I have, but it's getting late,
> so I'll try to have new ideas tomorrow.

OK, no problem - let me know if there's any useful diags I can gather;
would blktrace or function tracing or something help?

Sleep tight!

Dave

-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply

* Re: [PATCH 2/2] dm-raid1: don't fail the mirror for invalid I/O errors
From: Keith Busch @ 2026-06-16 23:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Keith Busch, dm-devel, linux-block, mpatocka, Vjaceslavs Klimovs
In-Reply-To: <ajGtbuJ2kXo1GZ1d@gallifrey>

On Tue, Jun 16, 2026 at 08:09:18PM +0000, Dr. David Alan Gilbert wrote:
> Interesting; I've got your:
>   dm-raid1: don't fail the mirror for invalid I/O errors
>   For DM_IO_BIO requests, do_region() built each destination bio by walking..
> ontop of e21ee273e6fa3879aec9a27251cfce98156e07c4 which is just before 7.1
>   I've not your https://lore.kernel.org/linux-block/20260612223205.465913-1-kbusch@meta.com/
> 
> root@dalek:/home/dg# lvcreate  --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2
> root@dalek:/home/dg# mkfs.ext4 /dev/mapper/main-lvol0
> root@dalek:/home/dg# mount /dev/mapper/main-lvol0 /mnt/tmp/
> root@dalek:/home/dg# chmod a+rwx /mnt/tmp
> 
> dg@dalek:~$ dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=1
> 
> my two tests are separate tests:

Goodness, I'm struggling here. Unpatched, I have no problem recreating
the issues you've described, but everything I've tried with the
proposals included gets the expected results. I'm running out of ideas
on replicating your results with hardware I have, but it's getting late,
so I'll try to have new ideas tomorrow.

^ permalink raw reply

* Re: (subset) [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
From: Jens Axboe @ 2026-06-16 20:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs, Qu Wenruo
In-Reply-To: <cover.1781597506.git.wqu@suse.com>


On Tue, 16 Jun 2026 17:42:34 +0930, Qu Wenruo wrote:
> [CHANGELOG]
> v4:
> - Follow iomap/block layer code style to avoid lines over 80 chars
> 
> - Reject NOWAIT BOUNCE direct writes inside btrfs
>   The iomap code still allocates memory with GFP_KERNEL in other
>   locations.
>   For now just disable NOWAIT BOUNCE direct writes and let the caller
>   fall back to blocking mode.
> 
> [...]

Applied, thanks!

[1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
      commit: b68d4979c88e31488970373f67ac79b4f6267008
[2/3] block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
      commit: d5b58fbb2fd7ac25fcd7e1c14730f998a90b0322

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
From: Jens Axboe @ 2026-06-16 20:48 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs
In-Reply-To: <ajFFYT10i1dvAy89@infradead.org>

On 6/16/26 6:45 AM, Christoph Hellwig wrote:
> Note: You'll need to include Jens for the block bits to get either an
> ACK or a merge through the block tree.

I can queue 1-2, then Qu can push the btrfs change in once that lands.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH net v2 1/2] iov_iter: export iov_iter_restore
From: Jens Axboe @ 2026-06-16 20:47 UTC (permalink / raw)
  To: Octavian Purdila, netdev
  Cc: Alexander Viro, Andrew Morton, Arseniy Krasnov, David S. Miller,
	Eric Dumazet, Eugenio Pérez, Jakub Kicinski, Jason Wang, kvm,
	linux-block, linux-fsdevel, linux-kernel, Michael S. Tsirkin,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Stefano Garzarella,
	virtualization, Xuan Zhuo
In-Reply-To: <20260613000953.467473-2-tavip@google.com>

On 6/12/26 6:09 PM, Octavian Purdila wrote:
> Export iov_iter_restore so that it can be used by modules.
> 
> This is needed by the virtio vsock transport (which can be built as a
> module) to restore the msg_iter state when transmission fails.
> 
> Signed-off-by: Octavian Purdila <tavip@google.com>
> ---
>  lib/iov_iter.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 243662af1af73..067e745f9ef53 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1491,6 +1491,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
>  		i->__iov -= state->nr_segs - i->nr_segs;
>  	i->nr_segs = state->nr_segs;
>  }
> +EXPORT_SYMBOL(iov_iter_restore);

I don't have a problem exporting this to modules, but any new export
should be _GPL. So please change it to that.

-- 
Jens Axboe

^ permalink raw reply

* Re: Landlock: LANDLOCK_ACCESS_FS_IOCTL_DEV bypass via io_uring IORING_OP_URING_CMD
From: Jens Axboe @ 2026-06-16 20:36 UTC (permalink / raw)
  To: Bryam Vargas, Mickaël Salaün
  Cc: Günther Noack, Paul Moore, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, linux-security-module, io-uring, linux-block,
	linux-nvme, linux-kernel
In-Reply-To: <20260616201633.275067-1-hexlabsecurity@proton.me>

On 6/16/26 2:16 PM, Bryam Vargas wrote:
> Hello Micka?l, and Landlock / io_uring folks,
> 
> A task confined by a Landlock ruleset that grants READ_FILE/WRITE_FILE
> on a block or NVMe character device but withholds
> LANDLOCK_ACCESS_FS_IOCTL_DEV can still reach the device-command
> surface through io_uring IORING_OP_URING_CMD with the IOCTL_DEV check
> bypassed: the request enters the device-command handler (block
> discard, or the NVMe char-device passthrough) where the equivalent
> ioctl(2) is denied. The destructive completion and the NVMe-admin
> surface follow from the code -- see Impact.

I've said this before, but apparently it hasn't been received - this
isn't an io_uring issue. If landlock is missing a hook, then that's on
landlock and they should add it. Other security handlers already have
that. Hence no need to broadcast this to a bunch of lists, it's strictly
a landlock issue.

-- 
Jens Axboe

^ permalink raw reply

* Landlock: LANDLOCK_ACCESS_FS_IOCTL_DEV bypass via io_uring IORING_OP_URING_CMD
From: Bryam Vargas @ 2026-06-16 20:16 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Günther Noack, Paul Moore, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, linux-security-module, io-uring,
	linux-block, linux-nvme, linux-kernel

Hello Mickaël, and Landlock / io_uring folks,

A task confined by a Landlock ruleset that grants READ_FILE/WRITE_FILE on a block
or NVMe character device but withholds LANDLOCK_ACCESS_FS_IOCTL_DEV can still
reach the device-command surface through io_uring IORING_OP_URING_CMD with the
IOCTL_DEV check bypassed: the request enters the device-command handler (block
discard, or the NVMe char-device passthrough) where the equivalent ioctl(2) is
denied. The destructive completion and the NVMe-admin surface follow from the
code -- see Impact.

Affected
--------
Any kernel with CONFIG_SECURITY_LANDLOCK=y and Landlock enabled that supports
LANDLOCK_ACCESS_FS_IOCTL_DEV (Landlock ABI >= 5, since Linux 6.8) and io_uring
uring_cmd for the device class (block BLOCK_URING_CMD_DISCARD; NVMe passthrough).
Confirmed by source inspection on mainline (v7.1-rc7) and reproduced on Linux
7.0.11 (Landlock ABI 8). The confined task needs a writable fd to a device it is
legitimately allowed to use (e.g. a partition/loop device or an NVMe namespace
passed into a container or granted by the ruleset); no CAP is required to reach
the io_uring path. The gap is structural -- Landlock has never registered a
uring_cmd hook -- so it is present from ABI 5 (Linux 6.8) through current
mainline (v7.1-rc7) and is not a regression tied to a single Fixes: commit.

Root cause
----------
On the ioctl(2) path, the syscall handler in fs/ioctl.c calls
security_file_ioctl() (its only call site on the ioctl(2) path) before
dispatching to do_vfs_ioctl(); that reaches Landlock hook_file_ioctl_common(),
which denies a device ioctl unless the file's
allowed_access holds LANDLOCK_ACCESS_FS_IOCTL_DEV (BLKDISCARD/BLKSECDISCARD/
BLKZEROOUT and NVMe passthrough are not in the is_masked_device_ioctl()
allow-list, so they require the right).

io_uring reaches the same device-command surface by a different producer:

  IORING_OP_URING_CMD -> io_uring_cmd()   io_uring/uring_cmd.c
   -> security_uring_cmd(ioucmd)          (the ONLY LSM gate on this path)
   -> file->f_op->uring_cmd()             e.g. blkdev_uring_cmd() / nvme_ns_chr_uring_cmd()

Landlock's LSM_HOOK_INIT list (security/landlock/fs.c, net.c, task.c) registers
file_ioctl/file_ioctl_compat but no uring_cmd hook -- only SELinux
(selinux_uring_cmd) and Smack (smack_uring_cmd) gate this surface -- so
security_uring_cmd() returns 0 for a Landlocked task and hook_file_ioctl /
IOCTL_DEV is never consulted. For block, blkdev_cmd_discard() is then gated only
by BLK_OPEN_WRITE; for NVMe, nvme_ns_chr_uring_cmd() reaches the admin/IO
passthrough with no security_file_ioctl on the path. There is no shared helper
that re-applies the IOCTL_DEV check.

SELinux and Smack hooking uring_cmd while Landlock does not is the coverage
asymmetry; the Landlock documentation describes IOCTL_DEV as gating ioctl(2) but
does not mention io_uring.

Reproducer
----------
A self-contained PoC is available on request (it needs root only to set up a loop
block device and open it; Landlock enforcement is uid-independent, so the
confined child demonstrates the gap regardless of the setup uid). The child
applies a Landlock ruleset handling READ_FILE|WRITE_FILE|IOCTL_DEV with a rule
granting only READ_FILE|WRITE_FILE on the device, then:

  (1) ioctl(fd, BLKDISCARD, range)        -> -EACCES  (Landlock enforces IOCTL_DEV)
  (2) IORING_OP_URING_CMD,
      cmd_op = BLOCK_URING_CMD_DISCARD     -> reaches the block command handler

Observed on Linux 7.0.11 (Landlock ABI 8):

  [1] ioctl(BLKDISCARD)   -> ret=-1 errno=13 (Permission denied)
  [2] uring_cmd(DISCARD)  -> cqe.res=-22 (Invalid argument)

A Landlock denial is always -EACCES; the io_uring path returned -EINVAL, which
originates in a post-authorization check inside the block command handler
(blk_validate_byte_range() in blkdev_cmd_discard()), reached only after
security_uring_cmd() returned 0. So this run demonstrates the authorization
bypass -- the request traversed the LSM gate into the block device-command
handler with no IOCTL_DEV check -- and then failed a parameter check, not an
authorization check. The destructive completion (an authorized discard with a
granularity-aligned range) is the expected behaviour but was not exercised in
this run.

Impact
------
Demonstrated: the LANDLOCK_ACCESS_FS_IOCTL_DEV authorization is bypassed. The
device-command request reaches the block command handler with no Landlock check;
the only remaining gate is BLK_OPEN_WRITE (held, since the policy granted write).
Inferred from the code, not exercised here: an authorized DISCARD with a valid
range completes (DISCARD/secure-erase semantics, destroying on-device data), and
the same missing hook leaves the NVMe char-device uring_cmd surface ungated --
nvme_ns_chr_uring_cmd (namespace device /dev/nvmeXnY) -> nvme_ns_uring_cmd for
NVME_URING_CMD_IO/IO_VEC passthrough, and nvme_dev_uring_cmd (controller device
/dev/nvmeX) for NVME_URING_CMD_ADMIN (format, sanitize, firmware download,
security send) -- both reach f_op->uring_cmd with no Landlock/IOCTL_DEV gate.

So the confirmed finding is a missing authorization (the confined task escapes
its own IOCTL_DEV restriction); the destructive data effect and the NVMe-admin
high-water-mark follow from the code but are not shown in the run above. The
proven authorization bypass alone scores CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:N/I:H/A:N
(6.5 Medium) -- S:C because the confined task crosses the Landlock policy
boundary it was placed under, I:H because the bypassed path reaches a handler
whose authorized completion modifies device data. With the device command
completing destructively the projected ceiling is
CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:N/I:H/A:H (8.4 High), the A:H component
reasoned from the source rather than executed. No memory safety is involved.

Suggested direction
-------------------
Have Landlock register a uring_cmd hook that maps the device command to the same
checks the ioctl path applies (IOCTL_DEV, and truncate where relevant), so a
single chokepoint covers every f_op->uring_cmd provider (block, NVMe, ublk, and
any future one). Mirrors how SELinux/Smack already gate this surface.

I am happy to send a patch for this if you would like.

Best regards,

Bryam Vargas
Independent security researcher, HEXLAB S.A.S., Cali, Colombia
hexlabsecurity@proton.me

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox