* [PATCH v2 3/3] dm: add documentation for dm-inlinecrypt target
From: Linlin Zhang @ 2026-04-10 13:40 UTC (permalink / raw)
To: linux-block, ebiggers, mpatocka, gmazyland
Cc: linux-kernel, adrianvovk, dm-devel, quic_mdalam, israelr, hch,
axboe
In-Reply-To: <20260410134031.2880675-1-linlin.zhang@oss.qualcomm.com>
This adds the admin-guide documentation for dm-inlinecrypt.
dm-inlinecrypt.rst is the guide to using dm-inlinecrypt.
Signed-off-by: Linlin Zhang <linlin.zhang@oss.qualcomm.com>
---
.../device-mapper/dm-inlinecrypt.rst | 122 ++++++++++++++++++
1 file changed, 122 insertions(+)
create mode 100644 Documentation/admin-guide/device-mapper/dm-inlinecrypt.rst
diff --git a/Documentation/admin-guide/device-mapper/dm-inlinecrypt.rst b/Documentation/admin-guide/device-mapper/dm-inlinecrypt.rst
new file mode 100644
index 000000000000..c302ba73fc38
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-inlinecrypt.rst
@@ -0,0 +1,122 @@
+========
+dm-inlinecrypt
+========
+
+Device-Mapper's "inlinecrypt" target provides transparent encryption of block devices
+using the inline encryption hardware.
+
+For a more detailed description of inline encryption, see:
+https://docs.kernel.org/block/inline-encryption.html
+
+Parameters::
+
+ <cipher> <key> <iv_offset> <device path> \
+ <offset> [<#opt_params> <opt_params>]
+
+<cipher>
+ Encryption cipher type.
+
+ The cipher specifications format is::
+
+ cipher
+
+ Examples::
+
+ aes-xts-plain64
+
+ The cipher type is correspond one-to-one with encryption modes. For
+ instance, the corresponding crypto mode of aes-xts-plain64 is
+ BLK_ENCRYPTION_MODE_AES_256_XTS.
+
+<key>
+ Key used for encryption. It is encoded either as a hexadecimal number
+ or it can be passed as <key_string> prefixed with single colon
+ character (':') for keys residing in kernel keyring service.
+ You can only use key sizes that are valid for the selected cipher.
+ Note that the size in bytes of a valid key must be in bellow range.
+
+ [BLK_CRYPTO_KEY_TYPE_RAW, BLK_CRYPTO_KEY_TYPE_HW_WRAPPED]
+
+<key_string>
+ The kernel keyring key is identified by string in following format:
+ <key_size>:<key_type>:<key_description>.
+
+<key_size>
+ The encryption key size in bytes. The kernel key payload size must match
+ the value passed in <key_size>.
+
+<key_type>
+ Either 'logon', or 'trusted' kernel key type.
+
+<key_description>
+ The kernel keyring key description inlinecrypt target should look for
+ when loading key of <key_type>.
+
+<iv_offset>
+ The IV offset is a sector count that is added to the sector number
+ before creating the IV.
+
+<device path>
+ This is the device that is going to be used as backend and contains the
+ encrypted data. You can specify it as a path like /dev/xxx or a device
+ number <major>:<minor>.
+
+<offset>
+ Starting sector within the device where the encrypted data begins.
+
+<#opt_params>
+ Number of optional parameters. If there are no optional parameters,
+ the optional parameters section can be skipped or #opt_params can be zero.
+ Otherwise #opt_params is the number of following arguments.
+
+ Example of optional parameters section:
+ allow_discards sector_size:4096 iv_large_sectors
+
+allow_discards
+ Block discard requests (a.k.a. TRIM) are passed through the inlinecrypt
+ device. The default is to ignore discard requests.
+
+ WARNING: Assess the specific security risks carefully before enabling this
+ option. For example, allowing discards on encrypted devices may lead to
+ the leak of information about the ciphertext device (filesystem type,
+ used space etc.) if the discarded blocks can be located easily on the
+ device later.
+
+sector_size:<bytes>
+ Use <bytes> as the encryption unit instead of 512 bytes sectors.
+ This option can be in range 512 - 4096 bytes and must be power of two.
+ Virtual device will announce this size as a minimal IO and logical sector.
+
+iv_large_sectors
+ IV generators will use sector number counted in <sector_size> units
+ instead of default 512 bytes sectors.
+
+ For example, if <sector_size> is 4096 bytes, plain64 IV for the second
+ sector will be 8 (without flag) and 1 if iv_large_sectors is present.
+ The <iv_offset> must be multiple of <sector_size> (in 512 bytes units)
+ if this flag is specified.
+
+Example scripts
+===============
+LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
+encryption with dm-inlinecrypt using the 'cryptsetup' utility, see
+https://gitlab.com/cryptsetup/cryptsetup
+
+::
+
+ #!/bin/sh
+ # Create a inlinecrypt device using dmsetup
+ dmsetup create inlinecrypt1 --table "0 `blockdev --getsz $1` inlinecrypt aes-xts-plain64 babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0 $1 0"
+
+::
+
+ #!/bin/sh
+ # Create a inlinecrypt device using dmsetup when encryption key is stored in keyring service
+ dmsetup create inlinecrypt2 --table "0 `blockdev --getsz $1` inlinecrypt aes-xts-plain64 :64:logon:fde:dminlinecrypt_test_key 0 $1 0"
+
+::
+
+ #!/bin/sh
+ # Create a inlinecrypt device using cryptsetup and LUKS header with default cipher
+ cryptsetup luksFormat $1
+ cryptsetup luksOpen $1 inlinecrypt1
--
2.34.1
^ permalink raw reply related
* [PATCH v2 2/3] dm-inlinecrypt: add target for inline block device encryption
From: Linlin Zhang @ 2026-04-10 13:40 UTC (permalink / raw)
To: linux-block, ebiggers, mpatocka, gmazyland
Cc: linux-kernel, adrianvovk, dm-devel, quic_mdalam, israelr, hch,
axboe
In-Reply-To: <20260410134031.2880675-1-linlin.zhang@oss.qualcomm.com>
From: Eric Biggers <ebiggers@google.com>
Add a new device-mapper target "dm-inlinecrypt" that is similar to
dm-crypt but uses the blk-crypto API instead of the regular crypto API.
This allows it to take advantage of inline encryption hardware such as
that commonly built into UFS host controllers.
The table syntax matches dm-crypt's, but for now only a stripped-down
set of parameters is supported. For example, for now AES-256-XTS is the
only supported cipher.
dm-inlinecrypt is based on Android's dm-default-key with the
controversial passthrough support removed. Note that due to the removal
of passthrough support, use of dm-inlinecrypt in combination with
fscrypt causes double encryption of file contents (similar to dm-crypt +
fscrypt), with the fscrypt layer not being able to use the inline
encryption hardware. This makes dm-inlinecrypt unusable on systems such
as Android that use fscrypt and where a more optimized approach is
needed. It is however suitable as a replacement for dm-crypt.
dm-inlinecrypt supports both keyring key and hex key, the former avoids
the key to be exposed in dm-table message. Similar to dm-default-key in
Android, it will fallabck to the software block crypto once the inline
crypto hardware cannot support the expected cipher.
Test:
dmsetup create inlinecrypt_logon --table "0 `blockdev --getsz $1` \
inlinecrypt aes-xts-plain64 :64:logon:fde:dminlinecrypt_test_key 0 $1 0"
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Linlin Zhang <linlin.zhang@oss.qualcomm.com>
---
drivers/md/Kconfig | 10 +
drivers/md/Makefile | 1 +
drivers/md/dm-inlinecrypt.c | 559 ++++++++++++++++++++++++++++++++++++
3 files changed, 570 insertions(+)
create mode 100644 drivers/md/dm-inlinecrypt.c
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index c58a9a8ea54e..aa541cc22ecc 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -313,6 +313,16 @@ config DM_CRYPT
If unsure, say N.
+config DM_INLINECRYPT
+ tristate "Inline encryption target support"
+ depends on BLK_DEV_DM
+ depends on BLK_INLINE_ENCRYPTION
+ help
+ This device-mapper target is similar to dm-crypt, but it uses the
+ blk-crypto API instead of the regular crypto API. This allows it to
+ take advantage of inline encryption hardware such as that commonly
+ built into UFS host controllers.
+
config DM_SNAPSHOT
tristate "Snapshot target"
depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index c338cc6fbe2e..517d1f7d8288 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_DM_UNSTRIPED) += dm-unstripe.o
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
+obj-$(CONFIG_DM_INLINECRYPT) += dm-inlinecrypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
obj-$(CONFIG_DM_DUST) += dm-dust.o
obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
diff --git a/drivers/md/dm-inlinecrypt.c b/drivers/md/dm-inlinecrypt.c
new file mode 100644
index 000000000000..b6e98fdf8af1
--- /dev/null
+++ b/drivers/md/dm-inlinecrypt.c
@@ -0,0 +1,559 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2024 Google LLC
+ */
+
+#include <linux/blk-crypto.h>
+#include <linux/ctype.h>
+#include <linux/device-mapper.h>
+#include <linux/hex.h>
+#include <linux/module.h>
+#include <keys/user-type.h>
+
+#define DM_MSG_PREFIX "inlinecrypt"
+
+static const struct dm_inlinecrypt_cipher {
+ const char *name;
+ enum blk_crypto_mode_num mode_num;
+} dm_inlinecrypt_ciphers[] = {
+ {
+ .name = "aes-xts-plain64",
+ .mode_num = BLK_ENCRYPTION_MODE_AES_256_XTS,
+ },
+};
+
+/**
+ * struct inlinecrypt_ctx - private data of an inlinecrypt target
+ * @dev: the underlying device
+ * @start: starting sector of the range of @dev which this target actually maps.
+ * For this purpose a "sector" is 512 bytes.
+ * @cipher_string: the name of the encryption algorithm being used
+ * @iv_offset: starting offset for IVs. IVs are generated as if the target were
+ * preceded by @iv_offset 512-byte sectors.
+ * @sector_size: crypto sector size in bytes (usually 4096)
+ * @sector_bits: log2(sector_size)
+ * @key: the encryption key to use
+ * @max_dun: the maximum DUN that may be used (computed from other params)
+ */
+struct inlinecrypt_ctx {
+ struct dm_dev *dev;
+ sector_t start;
+ const char *cipher_string;
+ unsigned int key_size;
+ u64 iv_offset;
+ unsigned int sector_size;
+ unsigned int sector_bits;
+ struct blk_crypto_key key;
+ u64 max_dun;
+};
+
+static const struct dm_inlinecrypt_cipher *
+lookup_cipher(const char *cipher_string)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(dm_inlinecrypt_ciphers); i++) {
+ if (strcmp(cipher_string, dm_inlinecrypt_ciphers[i].name) == 0)
+ return &dm_inlinecrypt_ciphers[i];
+ }
+ return NULL;
+}
+
+static void inlinecrypt_dtr(struct dm_target *ti)
+{
+ struct inlinecrypt_ctx *ctx = ti->private;
+
+ if (ctx->dev) {
+ if (ctx->key.size)
+ blk_crypto_evict_key(ctx->dev->bdev, &ctx->key);
+ dm_put_device(ti, ctx->dev);
+ }
+ kfree_sensitive(ctx->cipher_string);
+ kfree_sensitive(ctx);
+}
+
+static bool contains_whitespace(const char *str)
+{
+ while (*str)
+ if (isspace(*str++))
+ return true;
+ return false;
+}
+
+static int set_key_user(struct key *key, char *bin_key,
+ const unsigned int bin_key_size)
+{
+ const struct user_key_payload *ukp;
+
+ ukp = user_key_payload_locked(key);
+ if (!ukp)
+ return -EKEYREVOKED;
+
+ if (bin_key_size != ukp->datalen)
+ return -EINVAL;
+
+ memcpy(bin_key, ukp->data, bin_key_size);
+
+ return 0;
+}
+
+static int inlinecrypt_get_keyring_key(const char *key_string, u8 *bin_key,
+ const unsigned int bin_key_size)
+{
+ char *key_desc;
+ int ret;
+ struct key_type *type;
+ struct key *key;
+ int (*set_key)(struct key *key, char *bin_key,
+ const unsigned int bin_key_size);
+
+ /*
+ * Reject key_string with whitespace. dm core currently lacks code for
+ * proper whitespace escaping in arguments on DM_TABLE_STATUS path.
+ */
+ if (contains_whitespace(key_string)) {
+ DMERR("whitespace chars not allowed in key string");
+ return -EINVAL;
+ }
+
+ /* look for next ':' separating key_type from key_description */
+ key_desc = strchr(key_string, ':');
+ if (!key_desc || key_desc == key_string || !strlen(key_desc + 1))
+ return -EINVAL;
+
+ if (!strncmp(key_string, "logon:", key_desc - key_string + 1)) {
+ type = &key_type_logon;
+ set_key = set_key_user;
+ } else {
+ return -EINVAL;
+ }
+
+ key = request_key(type, key_desc + 1, NULL);
+ if (IS_ERR(key))
+ return PTR_ERR(key);
+
+ down_read(&key->sem);
+
+ ret = set_key(key, (char *)bin_key, bin_key_size);
+ if (ret < 0) {
+ up_read(&key->sem);
+ key_put(key);
+ return ret;
+ }
+
+ up_read(&key->sem);
+ key_put(key);
+
+ return ret;
+}
+
+static int inlinecrypt_get_key(const char *key_string,
+ u8 key[BLK_CRYPTO_MAX_ANY_KEY_SIZE],
+ const unsigned int key_size)
+{
+ int ret = 0;
+
+ /* ':' means the key is in kernel keyring, short-circuit normal key processing */
+ if (key_string[0] == ':') {
+ if (key_size > BLK_CRYPTO_MAX_ANY_KEY_SIZE) {
+ DMERR("Invalid keysize");
+ return -EINVAL;
+ }
+ /* key string should be :<logon|user>:<key_desc> */
+ ret = inlinecrypt_get_keyring_key(key_string + 1, key, key_size);
+ goto out;
+ }
+
+ if (key_size > 2 * BLK_CRYPTO_MAX_ANY_KEY_SIZE
+ || key_size % 2
+ || !key_size) {
+ DMERR("Invalid keysize");
+ return -EINVAL;
+ }
+ if (hex2bin(key, key_string, key_size) != 0)
+ ret = -EINVAL;
+
+out:
+ return ret;
+}
+
+static int get_key_size(char **key_string)
+{
+ char *colon, dummy;
+ int ret;
+
+ if (*key_string[0] != ':')
+ return strlen(*key_string) >> 1;
+
+ /* look for next ':' in key string */
+ colon = strpbrk(*key_string + 1, ":");
+ if (!colon)
+ return -EINVAL;
+
+ if (sscanf(*key_string + 1, "%u%c", &ret, &dummy) != 2 || dummy != ':')
+ return -EINVAL;
+
+ /* remaining key string should be :<logon|user>:<key_desc> */
+ *key_string = colon;
+
+ return ret;
+}
+
+static int inlinecrypt_ctr_optional(struct dm_target *ti,
+ unsigned int argc, char **argv)
+{
+ struct inlinecrypt_ctx *ctx = ti->private;
+ struct dm_arg_set as;
+ static const struct dm_arg _args[] = {
+ {0, 3, "Invalid number of feature args"},
+ };
+ unsigned int opt_params;
+ const char *opt_string;
+ bool iv_large_sectors = false;
+ char dummy;
+ int err;
+
+ as.argc = argc;
+ as.argv = argv;
+
+ err = dm_read_arg_group(_args, &as, &opt_params, &ti->error);
+ if (err)
+ return err;
+
+ while (opt_params--) {
+ opt_string = dm_shift_arg(&as);
+ if (!opt_string) {
+ ti->error = "Not enough feature arguments";
+ return -EINVAL;
+ }
+ if (!strcmp(opt_string, "allow_discards")) {
+ ti->num_discard_bios = 1;
+ } else if (sscanf(opt_string, "sector_size:%u%c",
+ &ctx->sector_size, &dummy) == 1) {
+ if (ctx->sector_size < SECTOR_SIZE ||
+ ctx->sector_size > 4096 ||
+ !is_power_of_2(ctx->sector_size)) {
+ ti->error = "Invalid sector_size";
+ return -EINVAL;
+ }
+ } else if (!strcmp(opt_string, "iv_large_sectors")) {
+ iv_large_sectors = true;
+ } else {
+ ti->error = "Invalid feature arguments";
+ return -EINVAL;
+ }
+ }
+
+ /* dm-inlinecrypt doesn't implement iv_large_sectors=false. */
+ if (ctx->sector_size != SECTOR_SIZE && !iv_large_sectors) {
+ ti->error = "iv_large_sectors must be specified";
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/*
+ * Construct an inlinecrypt mapping:
+ * <cipher> [<key>|:<key_size>:<logon>:<key_description>] <iv_offset> <dev_path> <start>
+ *
+ * This syntax matches dm-crypt's, but the set of supported functionality has
+ * been stripped down.
+ */
+static int inlinecrypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct inlinecrypt_ctx *ctx;
+ const struct dm_inlinecrypt_cipher *cipher;
+ u8 raw_key[BLK_CRYPTO_MAX_ANY_KEY_SIZE];
+ unsigned int dun_bytes;
+ unsigned long long tmpll;
+ char dummy;
+ int err;
+
+ if (argc < 5) {
+ ti->error = "Not enough arguments";
+ return -EINVAL;
+ }
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx) {
+ ti->error = "Out of memory";
+ return -ENOMEM;
+ }
+ ti->private = ctx;
+
+ /* <cipher> */
+ ctx->cipher_string = kstrdup(argv[0], GFP_KERNEL);
+ if (!ctx->cipher_string) {
+ ti->error = "Out of memory";
+ err = -ENOMEM;
+ goto bad;
+ }
+ cipher = lookup_cipher(ctx->cipher_string);
+ if (!cipher) {
+ ti->error = "Unsupported cipher";
+ err = -EINVAL;
+ goto bad;
+ }
+
+ /* <key> */
+ ctx->key_size = get_key_size(&argv[1]);
+ if (ctx->key_size < 0) {
+ ti->error = "Cannot parse key size";
+ return -EINVAL;
+ }
+ err = inlinecrypt_get_key(argv[1], raw_key, ctx->key_size);
+ if (err) {
+ ti->error = "Malformed key string";
+ goto bad;
+ }
+
+ /* <iv_offset> */
+ if (sscanf(argv[2], "%llu%c", &ctx->iv_offset, &dummy) != 1) {
+ ti->error = "Invalid iv_offset sector";
+ err = -EINVAL;
+ goto bad;
+ }
+
+ /* <dev_path> */
+ err = dm_get_device(ti, argv[3], dm_table_get_mode(ti->table),
+ &ctx->dev);
+ if (err) {
+ ti->error = "Device lookup failed";
+ goto bad;
+ }
+
+ /* <start> */
+ if (sscanf(argv[4], "%llu%c", &tmpll, &dummy) != 1 ||
+ tmpll != (sector_t)tmpll) {
+ ti->error = "Invalid start sector";
+ err = -EINVAL;
+ goto bad;
+ }
+ ctx->start = tmpll;
+
+ /* optional arguments */
+ ctx->sector_size = SECTOR_SIZE;
+ if (argc > 5) {
+ err = inlinecrypt_ctr_optional(ti, argc - 5, &argv[5]);
+ if (err)
+ goto bad;
+ }
+ ctx->sector_bits = ilog2(ctx->sector_size);
+ if (ti->len & ((ctx->sector_size >> SECTOR_SHIFT) - 1)) {
+ ti->error = "Device size is not a multiple of sector_size";
+ err = -EINVAL;
+ goto bad;
+ }
+
+ ctx->max_dun = (ctx->iv_offset + ti->len - 1) >>
+ (ctx->sector_bits - SECTOR_SHIFT);
+ dun_bytes = DIV_ROUND_UP(fls64(ctx->max_dun), 8);
+
+ err = blk_crypto_init_key(&ctx->key, raw_key, ctx->key_size,
+ BLK_CRYPTO_KEY_TYPE_RAW,
+ cipher->mode_num, dun_bytes,
+ ctx->sector_size);
+ if (err) {
+ ti->error = "Error initializing blk-crypto key";
+ goto bad;
+ }
+
+ err = blk_crypto_start_using_key(ctx->dev->bdev, &ctx->key);
+ if (err) {
+ ti->error = "Error starting to use blk-crypto";
+ goto bad;
+ }
+
+ ti->num_flush_bios = 1;
+
+ err = 0;
+ goto out;
+
+bad:
+ inlinecrypt_dtr(ti);
+out:
+ memzero_explicit(raw_key, sizeof(raw_key));
+ return err;
+}
+
+static int inlinecrypt_map(struct dm_target *ti, struct bio *bio)
+{
+ const struct inlinecrypt_ctx *ctx = ti->private;
+ sector_t sector_in_target;
+ u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE] = {};
+
+ bio_set_dev(bio, ctx->dev->bdev);
+
+ /*
+ * If the bio is a device-level request which doesn't target a specific
+ * sector, there's nothing more to do.
+ */
+ if (bio_sectors(bio) == 0)
+ return DM_MAPIO_REMAPPED;
+
+ /*
+ * The bio should never have an encryption context already, since
+ * dm-inlinecrypt doesn't pass through any inline encryption
+ * capabilities to the layer above it.
+ */
+ if (WARN_ON_ONCE(bio_has_crypt_ctx(bio)))
+ return DM_MAPIO_KILL;
+
+ /* Map the bio's sector to the underlying device. (512-byte sectors) */
+ sector_in_target = dm_target_offset(ti, bio->bi_iter.bi_sector);
+ bio->bi_iter.bi_sector = ctx->start + sector_in_target;
+ /*
+ * If the bio doesn't have any data (e.g. if it's a DISCARD request),
+ * there's nothing more to do.
+ */
+ if (!bio_has_data(bio))
+ return DM_MAPIO_REMAPPED;
+
+ /* Calculate the DUN and enforce data-unit (crypto sector) alignment. */
+ dun[0] = ctx->iv_offset + sector_in_target; /* 512-byte sectors */
+ if (dun[0] & ((ctx->sector_size >> SECTOR_SHIFT) - 1))
+ return DM_MAPIO_KILL;
+ dun[0] >>= ctx->sector_bits - SECTOR_SHIFT; /* crypto sectors */
+
+ /*
+ * This check isn't necessary as we should have calculated max_dun
+ * correctly, but be safe.
+ */
+ if (WARN_ON_ONCE(dun[0] > ctx->max_dun))
+ return DM_MAPIO_KILL;
+
+ bio_crypt_set_ctx(bio, &ctx->key, dun, GFP_NOIO);
+
+ /*
+ * Since we've added an encryption context to the bio and
+ * blk-crypto-fallback may be needed to process it, it's necessary to
+ * use the fallback-aware bio submission code rather than
+ * unconditionally returning DM_MAPIO_REMAPPED.
+ *
+ * To get the correct accounting for a dm target in the case where
+ * __blk_crypto_submit_bio() doesn't take ownership of the bio (returns
+ * true), call __blk_crypto_submit_bio() directly and return
+ * DM_MAPIO_REMAPPED in that case, rather than relying on
+ * blk_crypto_submit_bio() which calls submit_bio() in that case.
+ */
+ if (__blk_crypto_submit_bio(bio))
+ return DM_MAPIO_REMAPPED;
+ return DM_MAPIO_SUBMITTED;
+}
+
+static void inlinecrypt_status(struct dm_target *ti, status_type_t type,
+ unsigned int status_flags, char *result,
+ unsigned int maxlen)
+{
+ const struct inlinecrypt_ctx *ctx = ti->private;
+ unsigned int sz = 0;
+ int num_feature_args = 0;
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ case STATUSTYPE_IMA:
+ result[0] = '\0';
+ break;
+
+ case STATUSTYPE_TABLE:
+ /*
+ * Warning: like dm-crypt, dm-inlinecrypt includes the key in
+ * the returned table. Userspace is responsible for redacting
+ * the key when needed.
+ */
+ DMEMIT("%s %*phN %llu %s %llu", ctx->cipher_string,
+ ctx->key.size, ctx->key.bytes, ctx->iv_offset,
+ ctx->dev->name, ctx->start);
+ num_feature_args += !!ti->num_discard_bios;
+ if (ctx->sector_size != SECTOR_SIZE)
+ num_feature_args += 2;
+ if (num_feature_args != 0) {
+ DMEMIT(" %d", num_feature_args);
+ if (ti->num_discard_bios)
+ DMEMIT(" allow_discards");
+ if (ctx->sector_size != SECTOR_SIZE) {
+ DMEMIT(" sector_size:%u", ctx->sector_size);
+ DMEMIT(" iv_large_sectors");
+ }
+ }
+ break;
+ }
+}
+
+static int inlinecrypt_prepare_ioctl(struct dm_target *ti,
+ struct block_device **bdev, unsigned int cmd,
+ unsigned long arg, bool *forward)
+{
+ const struct inlinecrypt_ctx *ctx = ti->private;
+ const struct dm_dev *dev = ctx->dev;
+
+ *bdev = dev->bdev;
+
+ /* Only pass ioctls through if the device sizes match exactly. */
+ return ctx->start != 0 || ti->len != bdev_nr_sectors(dev->bdev);
+}
+
+static int inlinecrypt_iterate_devices(struct dm_target *ti,
+ iterate_devices_callout_fn fn,
+ void *data)
+{
+ const struct inlinecrypt_ctx *ctx = ti->private;
+
+ return fn(ti, ctx->dev, ctx->start, ti->len, data);
+}
+
+#ifdef CONFIG_BLK_DEV_ZONED
+static int inlinecrypt_report_zones(struct dm_target *ti,
+ struct dm_report_zones_args *args,
+ unsigned int nr_zones)
+{
+ const struct inlinecrypt_ctx *ctx = ti->private;
+
+ return dm_report_zones(ctx->dev->bdev, ctx->start,
+ ctx->start + dm_target_offset(ti, args->next_sector),
+ args, nr_zones);
+}
+#else
+#define inlinecrypt_report_zones NULL
+#endif
+
+static void inlinecrypt_io_hints(struct dm_target *ti,
+ struct queue_limits *limits)
+{
+ const struct inlinecrypt_ctx *ctx = ti->private;
+ const unsigned int sector_size = ctx->sector_size;
+
+ limits->logical_block_size =
+ max_t(unsigned int, limits->logical_block_size, sector_size);
+ limits->physical_block_size =
+ max_t(unsigned int, limits->physical_block_size, sector_size);
+ limits->io_min = max_t(unsigned int, limits->io_min, sector_size);
+ limits->dma_alignment = limits->logical_block_size - 1;
+}
+
+static struct target_type inlinecrypt_target = {
+ .name = "inlinecrypt",
+ .version = {1, 0, 0},
+ /*
+ * Do not set DM_TARGET_PASSES_CRYPTO, since dm-inlinecrypt consumes the
+ * crypto capability itself.
+ */
+ .features = DM_TARGET_ZONED_HM,
+ .module = THIS_MODULE,
+ .ctr = inlinecrypt_ctr,
+ .dtr = inlinecrypt_dtr,
+ .map = inlinecrypt_map,
+ .status = inlinecrypt_status,
+ .prepare_ioctl = inlinecrypt_prepare_ioctl,
+ .iterate_devices = inlinecrypt_iterate_devices,
+ .report_zones = inlinecrypt_report_zones,
+ .io_hints = inlinecrypt_io_hints,
+};
+
+module_dm(inlinecrypt);
+
+MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
+MODULE_AUTHOR("Linlin Zhang <linlin.zhang@oss.qualcomm.com>");
+MODULE_DESCRIPTION(DM_NAME " target for inline encryption");
+MODULE_LICENSE("GPL");
--
2.34.1
^ permalink raw reply related
* [PATCH v2 1/3] block: export blk-crypto symbols required by dm-inlinecrypt
From: Linlin Zhang @ 2026-04-10 13:40 UTC (permalink / raw)
To: linux-block, ebiggers, mpatocka, gmazyland
Cc: linux-kernel, adrianvovk, dm-devel, quic_mdalam, israelr, hch,
axboe
In-Reply-To: <20260410134031.2880675-1-linlin.zhang@oss.qualcomm.com>
From: Eric Biggers <ebiggers@google.com>
bio_crypt_set_ctx(), blk_crypto_init_key(), and
blk_crypto_start_using_key() are needed to use inline encryption; see
Documentation/block/inline-encryption.rst. Export them so that
dm-inlinecrypt can use them. The only reason these weren't exported
before was that inline encryption was previously used only by fs/crypto/
which is built-in code.
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
block/blk-crypto.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/blk-crypto.c b/block/blk-crypto.c
index 856d3c5b1fa0..40a99a859748 100644
--- a/block/blk-crypto.c
+++ b/block/blk-crypto.c
@@ -116,6 +116,7 @@ void bio_crypt_set_ctx(struct bio *bio, const struct blk_crypto_key *key,
bio->bi_crypt_context = bc;
}
+EXPORT_SYMBOL_GPL(bio_crypt_set_ctx);
void __bio_crypt_free_ctx(struct bio *bio)
{
@@ -349,6 +350,7 @@ int blk_crypto_init_key(struct blk_crypto_key *blk_key,
return 0;
}
+EXPORT_SYMBOL_GPL(blk_crypto_init_key);
bool blk_crypto_config_supported_natively(struct block_device *bdev,
const struct blk_crypto_config *cfg)
@@ -399,6 +401,7 @@ int blk_crypto_start_using_key(struct block_device *bdev,
}
return blk_crypto_fallback_start_using_mode(key->crypto_cfg.crypto_mode);
}
+EXPORT_SYMBOL_GPL(blk_crypto_start_using_key);
/**
* blk_crypto_evict_key() - Evict a blk_crypto_key from a block_device
--
2.34.1
^ permalink raw reply related
* [PATCH v2 0/3] dm-inlinecrypt: add target for inline block device encryption
From: Linlin Zhang @ 2026-04-10 13:40 UTC (permalink / raw)
To: linux-block, ebiggers, mpatocka, gmazyland
Cc: linux-kernel, adrianvovk, dm-devel, quic_mdalam, israelr, hch,
axboe
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 1582 bytes --]
This patch series is based on Erics work posted at:
https://lore.kernel.org/all/ecbb7ea8-11f6-30c1-ad77-bd984c52ca33@quicinc.com/
Erics patches introduce a new dm target, dm-inlinecrypt, to support inline
block-device encryption. The implementation builds on the work previously done
in Androids dm-default-key, but intentionally drops passthrough support,
as that functionality does not appear likely to be accepted upstream in the
near future. With this limitation, dm-inlinecrypt is positioned as a
practical replacement for dm-crypt, rather than a general passthrough
mechanism.
On top of Erics series, keyring key support is added in dm-inlinecrypt. Thus,
both keyring key and hex key are feasible for dm-inlinecrypt. In addition,
dm-inlinecrypt.rst is added as the user-guide of dm-inlinecrypt.
V1:
https://lore.kernel.org/all/20260304121729.1532469-1-linlin.zhang@oss.qualcomm.com/
Eric Biggers (2):
block: export blk-crypto symbols required by dm-inlinecrypt
dm-inlinecrypt: add target for inline block device encryption
Linlin Zhang (1):
dm: add documentation for dm-inlinecrypt target
.../device-mapper/dm-inlinecrypt.rst | 122 ++++
block/blk-crypto.c | 3 +
drivers/md/Kconfig | 10 +
drivers/md/Makefile | 1 +
drivers/md/dm-inlinecrypt.c | 559 ++++++++++++++++++
5 files changed, 695 insertions(+)
create mode 100644 Documentation/admin-guide/device-mapper/dm-inlinecrypt.rst
create mode 100644 drivers/md/dm-inlinecrypt.c
--
2.34.1
^ permalink raw reply
* Re: [PATCH 4/8] FOLD: block: change the defer in task context interface to be procedural
From: Matthew Wilcox @ 2026-04-10 13:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Tal Zussman, Jens Axboe, Christian Brauner, Darrick J. Wong,
Carlos Maiolino, Al Viro, Jan Kara, Dave Chinner, Bart Van Assche,
Gao Xiang, linux-block, linux-kernel, linux-xfs, linux-fsdevel,
linux-mm
In-Reply-To: <20260410061725.GA24667@lst.de>
On Fri, Apr 10, 2026 at 08:17:25AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 09, 2026 at 09:18:50PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 09, 2026 at 06:02:17PM +0200, Christoph Hellwig wrote:
> > > @@ -1836,9 +1837,7 @@ void bio_endio(struct bio *bio)
> > > }
> > > #endif
> > >
> > > - if (!in_task() && bio_flagged(bio, BIO_COMPLETE_IN_TASK))
> > > - bio_queue_completion(bio);
> > > - else if (bio->bi_end_io)
> > > + if (bio->bi_end_io)
> > > bio->bi_end_io(bio);
> >
> > What I liked about this before is that we had one central place that
> > needed to be changed. This change means that every bi_end_io now needs
> > to check whether the BIO can be completed in its context.
>
> Yes. On the other hand we can actually use it when we don't know if
> we need to offload beforehand, which enabls the two later conversions
> and probably more.
I don't understand why we need to remove _this_ way to defer completions
to take context in order to _add_ the ability to defer completions
inside the bi_end_io handler.
^ permalink raw reply
* Re: [PATCH] ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
From: Jens Axboe @ 2026-04-10 13:04 UTC (permalink / raw)
To: linux-block, Ming Lei; +Cc: Caleb Sander Mateos, kernel test robot
In-Reply-To: <20260410124136.3983429-1-tom.leiming@gmail.com>
On Fri, 10 Apr 2026 20:41:36 +0800, Ming Lei wrote:
> On 32-bit architectures, 'unsigned long size' can never exceed
> UBLK_SHMEM_BUF_SIZE_MAX (1ULL << 32), causing a tautological
> comparison warning. Validate buf_reg.len (__u64) directly before
> using it, and consolidate all input validation into a single check.
>
> Also remove the unnecessary local variables 'addr' and 'size' since
> buf_reg.addr and buf_reg.len can be used directly.
>
> [...]
Applied, thanks!
[1/1] ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
commit: 36446de0c30c62b9d89502fd36c4904996d86ecd
Best regards,
--
Jens Axboe
^ permalink raw reply
* [PATCH] ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
From: Ming Lei @ 2026-04-10 12:41 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei, kernel test robot
On 32-bit architectures, 'unsigned long size' can never exceed
UBLK_SHMEM_BUF_SIZE_MAX (1ULL << 32), causing a tautological
comparison warning. Validate buf_reg.len (__u64) directly before
using it, and consolidate all input validation into a single check.
Also remove the unnecessary local variables 'addr' and 'size' since
buf_reg.addr and buf_reg.len can be used directly.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604101952.3NOzqnu9-lkp@intel.com/
Fixes: 23b3b6f0b584 ("ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/block/ublk_drv.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 247c1ce8ce8a..49fb584e392b 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -5330,7 +5330,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
{
void __user *argp = (void __user *)(unsigned long)header->addr;
struct ublk_shmem_buf_reg buf_reg;
- unsigned long addr, size, nr_pages;
+ unsigned long nr_pages;
struct page **pages = NULL;
unsigned int gup_flags;
unsigned int memflags;
@@ -5352,14 +5352,12 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
if (buf_reg.reserved)
return -EINVAL;
- addr = buf_reg.addr;
- size = buf_reg.len;
- nr_pages = size >> PAGE_SHIFT;
-
- if (!size || size > UBLK_SHMEM_BUF_SIZE_MAX ||
- !PAGE_ALIGNED(size) || !PAGE_ALIGNED(addr))
+ if (!buf_reg.len || buf_reg.len > UBLK_SHMEM_BUF_SIZE_MAX ||
+ !PAGE_ALIGNED(buf_reg.len) || !PAGE_ALIGNED(buf_reg.addr))
return -EINVAL;
+ nr_pages = buf_reg.len >> PAGE_SHIFT;
+
/* Pin pages before any locks (may sleep) */
pages = kvmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
if (!pages)
@@ -5369,7 +5367,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
if (!(buf_reg.flags & UBLK_SHMEM_BUF_READ_ONLY))
gup_flags |= FOLL_WRITE;
- pinned = pin_user_pages_fast(addr, nr_pages, gup_flags, pages);
+ pinned = pin_user_pages_fast(buf_reg.addr, nr_pages, gup_flags, pages);
if (pinned < 0) {
ret = pinned;
goto err_free_pages;
--
2.53.0
^ permalink raw reply related
* Re: [RFC PATCH 0/2] rust block-mq TagSet flags plumbing and rnull blocking wiring
From: Andreas Hindborg @ 2026-04-10 12:39 UTC (permalink / raw)
To: Wenzhao Liao, axboe, ojeda, linux-block, rust-for-linux
Cc: bjorn3_gh, aliceryhl, lossin, boqun, dakr, gary, sunke, tmgross,
linux-kernel
In-Reply-To: <20260410090829.1409430-1-wenzhaoliao@ruc.edu.cn>
Hi Wenzhao,
"Wenzhao Liao" <wenzhaoliao@ruc.edu.cn> writes:
> This RFC series fills a practical gap in the Rust block-mq abstraction by
> exposing blk_mq_tag_set.flags safely, then wires one in-tree consumer
> (`rnull`) via configfs as a reference.
Patches for this functionality is already submitted to list [1]. Please
take a look if those patches solve your requirement. If you require this
functionality in the kernel, please help by reviewing the patches in
that series.
Best regards,
Andreas Hindborg
[1] https://lore.kernel.org/rust-for-linux/20260216-rnull-v6-19-rc5-send-v1-12-de9a7af4b469@kernel.org/
^ permalink raw reply
* Re: [PATCH 07/13] libmultipath: Add delayed removal support
From: John Garry @ 2026-04-10 11:49 UTC (permalink / raw)
To: Nilay Shroff, Hannes Reinecke, hch, kbusch, sagi, axboe,
martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
In-Reply-To: <8502c8d6-2958-4c46-bdba-95b91c2ceb10@linux.ibm.com>
On 10/04/2026 11:51, Nilay Shroff wrote:
>> About the module refcounting, as I mentioned earlier it's hard to test
>> this effectively. We could use lsmod to check refcount on nvme ko
>> during the delayed removal window and ensure that it was incremented.
>> I'm not sure if it is robust and whether the complexity is worth it.
>>
> Regarding module refcnt, I think that's easily available
> if we read /sys/module/<mod-name>/refcnt. We may not
> need to parse lsmod output.
Sure, so we would be testing this behaviour:
# echo 40 >
/sys/class/nvme-subsystem/nvme-subsys1/nvme1n1/delayed_removal_secs
# cat /sys/module/nvme_core/refcnt
3
# nvme disconnect -n nvme-test-target
[ 562.533253] nvme nvme1: Removing ctrl: NQN "nvme-test-target"
[ 562.565462] nvme nvme2: Removing ctrl: NQN "nvme-test-target"
NQN:nvme-test-target disconnected 2 controller(s)
# cat /sys/module/nvme_core/refcnt
4
# sleep 40
# cat /sys/module/nvme_core/refcnt
3
#
File /sys/module/nvme_core/refcnt would not exist for when it's builtin,
but I don't think that blktests even support builtin modules.
Cheers
^ permalink raw reply
* Re: [PATCH 07/13] libmultipath: Add delayed removal support
From: Nilay Shroff @ 2026-04-10 10:51 UTC (permalink / raw)
To: John Garry, Hannes Reinecke, hch, kbusch, sagi, axboe,
martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
In-Reply-To: <8d9c2ee9-0c2b-4c64-badc-b5b0fc1eaf67@oracle.com>
On 4/10/26 3:19 PM, John Garry wrote:
> On 10/04/2026 10:09, Nilay Shroff wrote:
>>>> It seems there may be a race here if we attempt to write to $ns before
>>>> the reconnect has completed in _delayed_nvme_reconnect_ctrl.
>>>>
>>>> If the intention is simply to verify that the controller reconnect occurs
>>>> within the delayed removal window and test pwrite,
>>>
>>> Not exactly. I want to verify that if I write between the disconnect and the reconnect, then we write succeeds.
>>
>> Okay, got it — I think I misunderstood the intention earlier.
>>
>> So the goal here is to verify that if a write is issued during the
>> delayed removal window is in progress (i.e., when there is temporarily
>> no active path), the write should be queued. Once the reconnect succeeds,
>> the queued write should then be unblocked and sent to the target.
>
> Yeah, that's it. Otherwise, the write will be queued but then eventually fail (for no reconnect).
>
>>
>> If this understanding is correct, then this looks like a good test
>> to me.
> thanks
>
> About the module refcounting, as I mentioned earlier it's hard to test this effectively. We could use lsmod to check refcount on nvme ko during the delayed removal window and ensure that it was incremented. I'm not sure if it is robust and whether the complexity is worth it.
>
Regarding module refcnt, I think that's easily available
if we read /sys/module/<mod-name>/refcnt. We may not
need to parse lsmod output.
Thanks,
--Nilay
^ permalink raw reply
* Re: [PATCH 07/13] libmultipath: Add delayed removal support
From: John Garry @ 2026-04-10 9:49 UTC (permalink / raw)
To: Nilay Shroff, Hannes Reinecke, hch, kbusch, sagi, axboe,
martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
In-Reply-To: <a1d72045-7b0e-4354-8365-f21f03765659@linux.ibm.com>
On 10/04/2026 10:09, Nilay Shroff wrote:
>>> It seems there may be a race here if we attempt to write to $ns before
>>> the reconnect has completed in _delayed_nvme_reconnect_ctrl.
>>>
>>> If the intention is simply to verify that the controller reconnect
>>> occurs
>>> within the delayed removal window and test pwrite,
>>
>> Not exactly. I want to verify that if I write between the disconnect
>> and the reconnect, then we write succeeds.
>
> Okay, got it — I think I misunderstood the intention earlier.
>
> So the goal here is to verify that if a write is issued during the
> delayed removal window is in progress (i.e., when there is temporarily
> no active path), the write should be queued. Once the reconnect succeeds,
> the queued write should then be unblocked and sent to the target.
Yeah, that's it. Otherwise, the write will be queued but then eventually
fail (for no reconnect).
>
> If this understanding is correct, then this looks like a good test
> to me.
thanks
About the module refcounting, as I mentioned earlier it's hard to test
this effectively. We could use lsmod to check refcount on nvme ko during
the delayed removal window and ensure that it was incremented. I'm not
sure if it is robust and whether the complexity is worth it.
Cheers,
John
^ permalink raw reply
* Re: [PATCH 07/13] libmultipath: Add delayed removal support
From: Nilay Shroff @ 2026-04-10 9:09 UTC (permalink / raw)
To: John Garry, Hannes Reinecke, hch, kbusch, sagi, axboe,
martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
In-Reply-To: <b77d5eab-d50f-4102-8bfb-f907cf39ca56@oracle.com>
On 4/10/26 2:25 PM, John Garry wrote:
> On 10/04/2026 08:06, Nilay Shroff wrote:
>>> # Part b: Ensure writes work for intermittent disconnect
>>> _nvme_connect_subsys
>>>
>>> nvmedev=$(_find_nvme_dev "${def_subsysnqn}")
>>> ns=$(_find_nvme_ns "${def_subsys_uuid}")
>>> echo 10 > "/sys/block/"$ns"/delayed_removal_secs"
>>> bytes_written=$(run_xfs_io_pwritev2 /dev/"$ns" 4096)
>>> if [ "$bytes_written" != 4096 ]; then
>>> echo "could not write successfully initially"
>>> fi
>>> sleep 1
>>> _nvme_disconnect_ctrl "${nvmedev}"
>>> sleep 1
>>> ns=$(_find_nvme_ns "${def_subsys_uuid}")
>>> if [[ "${ns}" = "" ]]; then
>>> echo "could not find ns after disconnect"
>>> fi
>>> _delayed_nvme_reconnect_ctrl &
>>> sleep 1
>>> bytes_written=$(run_xfs_io_pwritev2 /dev/"$ns" 4096)
>>> if [ "$bytes_written" != 4096 ]; then
>>> echo "could not write successfully with reconnect"
>>> fi
>>
>> It seems there may be a race here if we attempt to write to $ns before
>> the reconnect has completed in _delayed_nvme_reconnect_ctrl.
>>
>> If the intention is simply to verify that the controller reconnect occurs
>> within the delayed removal window and test pwrite,
>
> Not exactly. I want to verify that if I write between the disconnect and the reconnect, then we write succeeds.
Okay, got it — I think I misunderstood the intention earlier.
So the goal here is to verify that if a write is issued during the
delayed removal window is in progress (i.e., when there is temporarily
no active path), the write should be queued. Once the reconnect succeeds,
the queued write should then be unblocked and sent to the target.
If this understanding is correct, then this looks like a good test
to me.
Thanks,
--Nilay
^ permalink raw reply
* [RFC PATCH 1/2] rust: block: mq: safely expose TagSet flags
From: Wenzhao Liao @ 2026-04-10 9:08 UTC (permalink / raw)
To: axboe, a.hindborg, ojeda, linux-block, rust-for-linux
Cc: bjorn3_gh, aliceryhl, lossin, boqun, dakr, gary, sunke, tmgross,
linux-kernel
In-Reply-To: <20260410090829.1409430-1-wenzhaoliao@ruc.edu.cn>
TagSet::new() currently hardcodes blk_mq_tag_set.flags to 0.
That prevents Rust block drivers from declaring blk-mq queue flags.
Introduce TagSetFlags as a typed wrapper for BLK_MQ_F_* bits.
Add TagSet::new_with_flags() so drivers can pass flags explicitly.
Keep TagSet::new() as a compatibility wrapper using empty flags.
Re-export TagSetFlags from kernel::block::mq for driver imports.
Build-tested with LLVM=-15 in an out-of-tree rust-next build.
Validation includes vmlinux and drivers/block/rnull/rnull_mod.ko.
Signed-off-by: Wenzhao Liao <wenzhaoliao@ruc.edu.cn>
---
rust/kernel/block/mq.rs | 2 +-
rust/kernel/block/mq/tag_set.rs | 86 ++++++++++++++++++++++++++++++++-
2 files changed, 86 insertions(+), 2 deletions(-)
diff --git a/rust/kernel/block/mq.rs b/rust/kernel/block/mq.rs
index 1fd0d54dd549..799afdf36539 100644
--- a/rust/kernel/block/mq.rs
+++ b/rust/kernel/block/mq.rs
@@ -100,4 +100,4 @@
pub use operations::Operations;
pub use request::Request;
-pub use tag_set::TagSet;
+pub use tag_set::{TagSet, TagSetFlags};
diff --git a/rust/kernel/block/mq/tag_set.rs b/rust/kernel/block/mq/tag_set.rs
index dae9df408a86..72d9bce5b11f 100644
--- a/rust/kernel/block/mq/tag_set.rs
+++ b/rust/kernel/block/mq/tag_set.rs
@@ -16,6 +16,80 @@
use core::{convert::TryInto, marker::PhantomData};
use pin_init::{pin_data, pinned_drop, PinInit};
+/// Flags that control blk-mq tag set behavior.
+///
+/// They can be combined with the operators `|`, `&`, and `!`.
+#[derive(Clone, Copy, PartialEq, Eq)]
+pub struct TagSetFlags(u32);
+
+impl TagSetFlags {
+ /// Returns an empty instance where no flags are set.
+ pub const fn empty() -> Self {
+ Self(0)
+ }
+
+ /// Register as a blocking blk-mq driver device.
+ pub const BLOCKING: Self = Self::new(bindings::BLK_MQ_F_BLOCKING as u32);
+
+ /// Use an underlying blk-mq device for completing I/O.
+ pub const STACKING: Self = Self::new(bindings::BLK_MQ_F_STACKING as u32);
+
+ /// Share hardware contexts between tags.
+ pub const TAG_HCTX_SHARED: Self = Self::new(bindings::BLK_MQ_F_TAG_HCTX_SHARED as u32);
+
+ /// Allocate tags on a round-robin basis.
+ pub const TAG_RR: Self = Self::new(bindings::BLK_MQ_F_TAG_RR as u32);
+
+ /// Disable the I/O scheduler by default.
+ pub const NO_SCHED_BY_DEFAULT: Self =
+ Self::new(bindings::BLK_MQ_F_NO_SCHED_BY_DEFAULT as u32);
+
+ /// Check whether `flags` is contained in `self`.
+ pub fn contains(self, flags: Self) -> bool {
+ (self & flags) == flags
+ }
+
+ pub(crate) const fn as_raw(self) -> u32 {
+ self.0
+ }
+
+ const fn all_bits() -> u32 {
+ Self::BLOCKING.0
+ | Self::STACKING.0
+ | Self::TAG_HCTX_SHARED.0
+ | Self::TAG_RR.0
+ | Self::NO_SCHED_BY_DEFAULT.0
+ }
+
+ const fn new(value: u32) -> Self {
+ Self(value)
+ }
+}
+
+impl core::ops::BitOr for TagSetFlags {
+ type Output = Self;
+
+ fn bitor(self, rhs: Self) -> Self::Output {
+ Self(self.0 | rhs.0)
+ }
+}
+
+impl core::ops::BitAnd for TagSetFlags {
+ type Output = Self;
+
+ fn bitand(self, rhs: Self) -> Self::Output {
+ Self(self.0 & rhs.0)
+ }
+}
+
+impl core::ops::Not for TagSetFlags {
+ type Output = Self;
+
+ fn not(self) -> Self::Output {
+ Self(!self.0 & Self::all_bits())
+ }
+}
+
/// A wrapper for the C `struct blk_mq_tag_set`.
///
/// `struct blk_mq_tag_set` contains a `struct list_head` and so must be pinned.
@@ -37,6 +111,16 @@ pub fn new(
nr_hw_queues: u32,
num_tags: u32,
num_maps: u32,
+ ) -> impl PinInit<Self, error::Error> {
+ Self::new_with_flags(nr_hw_queues, num_tags, num_maps, TagSetFlags::empty())
+ }
+
+ /// Try to create a new tag set with the given blk-mq flags.
+ pub fn new_with_flags(
+ nr_hw_queues: u32,
+ num_tags: u32,
+ num_maps: u32,
+ flags: TagSetFlags,
) -> impl PinInit<Self, error::Error> {
let tag_set: bindings::blk_mq_tag_set = pin_init::zeroed();
let tag_set: Result<_> = core::mem::size_of::<RequestDataWrapper>()
@@ -49,7 +133,7 @@ pub fn new(
numa_node: bindings::NUMA_NO_NODE,
queue_depth: num_tags,
cmd_size,
- flags: 0,
+ flags: flags.as_raw(),
driver_data: core::ptr::null_mut::<crate::ffi::c_void>(),
nr_maps: num_maps,
..tag_set
--
2.34.1
^ permalink raw reply related
* [RFC PATCH 2/2] block: rnull: support BLK_MQ_F_BLOCKING via configfs
From: Wenzhao Liao @ 2026-04-10 9:08 UTC (permalink / raw)
To: axboe, a.hindborg, ojeda, linux-block, rust-for-linux
Cc: bjorn3_gh, aliceryhl, lossin, boqun, dakr, gary, sunke, tmgross,
linux-kernel
In-Reply-To: <20260410090829.1409430-1-wenzhaoliao@ruc.edu.cn>
Add a new configfs boolean attribute named blocking.
Advertise blocking in the rnull features list.
On power-on, map blocking=1 to TagSetFlags::BLOCKING.
Create the tag set with TagSet::new_with_flags().
Keep default blocking=0 to preserve existing behavior.
Like other parameters, blocking writes return -EBUSY while powered on.
Validated with LLVM=-15 out-of-tree builds and QEMU runtime tests.
Signed-off-by: Wenzhao Liao <wenzhaoliao@ruc.edu.cn>
---
drivers/block/rnull/configfs.rs | 32 +++++++++++++++++++++++++++++++-
drivers/block/rnull/rnull.rs | 11 +++++++++--
2 files changed, 40 insertions(+), 3 deletions(-)
diff --git a/drivers/block/rnull/configfs.rs b/drivers/block/rnull/configfs.rs
index 7c2eb5c0b722..a9d46a511340 100644
--- a/drivers/block/rnull/configfs.rs
+++ b/drivers/block/rnull/configfs.rs
@@ -35,7 +35,7 @@ impl AttributeOperations<0> for Config {
fn show(_this: &Config, page: &mut [u8; PAGE_SIZE]) -> Result<usize> {
let mut writer = kernel::str::Formatter::new(page);
- writer.write_str("blocksize,size,rotational,irqmode\n")?;
+ writer.write_str("blocksize,size,rotational,irqmode,blocking\n")?;
Ok(writer.bytes_written())
}
}
@@ -58,6 +58,7 @@ fn make_group(
rotational: 2,
size: 3,
irqmode: 4,
+ blocking: 5,
],
};
@@ -73,6 +74,7 @@ fn make_group(
disk: None,
capacity_mib: 4096,
irq_mode: IRQMode::None,
+ blocking: false,
name: name.try_into()?,
}),
}),
@@ -122,6 +124,7 @@ struct DeviceConfigInner {
rotational: bool,
capacity_mib: u64,
irq_mode: IRQMode,
+ blocking: bool,
disk: Option<GenDisk<NullBlkDevice>>,
}
@@ -152,6 +155,7 @@ fn store(this: &DeviceConfig, page: &[u8]) -> Result {
guard.rotational,
guard.capacity_mib,
guard.irq_mode,
+ guard.blocking,
)?);
guard.powered = true;
} else if guard.powered && !power_op {
@@ -259,3 +263,29 @@ fn store(this: &DeviceConfig, page: &[u8]) -> Result {
Ok(())
}
}
+
+#[vtable]
+impl configfs::AttributeOperations<5> for DeviceConfig {
+ type Data = DeviceConfig;
+
+ fn show(this: &DeviceConfig, page: &mut [u8; PAGE_SIZE]) -> Result<usize> {
+ let mut writer = kernel::str::Formatter::new(page);
+
+ if this.data.lock().blocking {
+ writer.write_str("1\n")?;
+ } else {
+ writer.write_str("0\n")?;
+ }
+
+ Ok(writer.bytes_written())
+ }
+
+ fn store(this: &DeviceConfig, page: &[u8]) -> Result {
+ if this.data.lock().powered {
+ return Err(EBUSY);
+ }
+
+ this.data.lock().blocking = kstrtobool_bytes(page)?;
+ Ok(())
+ }
+}
diff --git a/drivers/block/rnull/rnull.rs b/drivers/block/rnull/rnull.rs
index 0ca8715febe8..d7ebd504d8df 100644
--- a/drivers/block/rnull/rnull.rs
+++ b/drivers/block/rnull/rnull.rs
@@ -11,7 +11,7 @@
mq::{
self,
gen_disk::{self, GenDisk},
- Operations, TagSet,
+ Operations, TagSet, TagSetFlags,
},
},
prelude::*,
@@ -51,8 +51,15 @@ fn new(
rotational: bool,
capacity_mib: u64,
irq_mode: IRQMode,
+ blocking: bool,
) -> Result<GenDisk<Self>> {
- let tagset = Arc::pin_init(TagSet::new(1, 256, 1), GFP_KERNEL)?;
+ let flags = if blocking {
+ TagSetFlags::BLOCKING
+ } else {
+ TagSetFlags::empty()
+ };
+
+ let tagset = Arc::pin_init(TagSet::new_with_flags(1, 256, 1, flags), GFP_KERNEL)?;
let queue_data = Box::new(QueueData { irq_mode }, GFP_KERNEL)?;
--
2.34.1
^ permalink raw reply related
* [RFC PATCH 0/2] rust block-mq TagSet flags plumbing and rnull blocking wiring
From: Wenzhao Liao @ 2026-04-10 9:08 UTC (permalink / raw)
To: axboe, a.hindborg, ojeda, linux-block, rust-for-linux
Cc: bjorn3_gh, aliceryhl, lossin, boqun, dakr, gary, sunke, tmgross,
linux-kernel
This RFC series fills a practical gap in the Rust block-mq abstraction by
exposing blk_mq_tag_set.flags safely, then wires one in-tree consumer
(`rnull`) via configfs as a reference.
Patch 1 introduces `TagSetFlags` and `TagSet::new_with_flags(...)` while
keeping `TagSet::new(...)` for compatibility.
Patch 2 adds a `blocking` configfs attribute to `rnull` and maps it to
`TagSetFlags::BLOCKING` when powering on a device.
Validation summary:
- Out-of-tree build on rust-next (LLVM=-15): defconfig, rustavailable.
- Enabled CONFIG_RUST=y, CONFIG_CONFIGFS_FS=y, CONFIG_BLK_DEV_RUST_NULL=m.
- Built vmlinux and drivers/block/rnull/rnull_mod.ko successfully.
- QEMU runtime test (initramfs automation) verifies:
- `/sys/kernel/config/rnull/features` contains `blocking`;
- writing `blocking` while powered on fails with busy semantics;
- writing `blocking` after power off succeeds again.
Comments on API naming and any preferred follow-up scope are welcome.
Wenzhao Liao (2):
rust: block: mq: safely expose TagSet flags
block: rnull: support BLK_MQ_F_BLOCKING via configfs
drivers/block/rnull/configfs.rs | 32 +++++++++++-
drivers/block/rnull/rnull.rs | 11 ++++-
rust/kernel/block/mq.rs | 2 +-
rust/kernel/block/mq/tag_set.rs | 86 ++++++++++++++++++++++++++++++++-
4 files changed, 126 insertions(+), 5 deletions(-)
base-commit: 3418d862679ac6da0b6bd681b18b3189c4fad20d
--
2.34.1
^ permalink raw reply
* Re: [PATCH 07/13] libmultipath: Add delayed removal support
From: John Garry @ 2026-04-10 8:55 UTC (permalink / raw)
To: Nilay Shroff, Hannes Reinecke, hch, kbusch, sagi, axboe,
martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
In-Reply-To: <da2bfbb0-70ef-4c3a-a235-1343b4a02489@linux.ibm.com>
On 10/04/2026 08:06, Nilay Shroff wrote:
>> # Part b: Ensure writes work for intermittent disconnect
>> _nvme_connect_subsys
>>
>> nvmedev=$(_find_nvme_dev "${def_subsysnqn}")
>> ns=$(_find_nvme_ns "${def_subsys_uuid}")
>> echo 10 > "/sys/block/"$ns"/delayed_removal_secs"
>> bytes_written=$(run_xfs_io_pwritev2 /dev/"$ns" 4096)
>> if [ "$bytes_written" != 4096 ]; then
>> echo "could not write successfully initially"
>> fi
>> sleep 1
>> _nvme_disconnect_ctrl "${nvmedev}"
>> sleep 1
>> ns=$(_find_nvme_ns "${def_subsys_uuid}")
>> if [[ "${ns}" = "" ]]; then
>> echo "could not find ns after disconnect"
>> fi
>> _delayed_nvme_reconnect_ctrl &
>> sleep 1
>> bytes_written=$(run_xfs_io_pwritev2 /dev/"$ns" 4096)
>> if [ "$bytes_written" != 4096 ]; then
>> echo "could not write successfully with reconnect"
>> fi
>
> It seems there may be a race here if we attempt to write to $ns before
> the reconnect has completed in _delayed_nvme_reconnect_ctrl.
>
> If the intention is simply to verify that the controller reconnect occurs
> within the delayed removal window and test pwrite,
Not exactly. I want to verify that if I write between the disconnect and
the reconnect, then we write succeeds.
> then it may be
> sufficient
> to:
> - perform the reconnect, and
> - then validate the write (pwrite) afterwards.
I think that this is something subtly different.
For your revised test, if we reconnect, we always expect the subsequent
write to succeed even without the delayed removal, so I am not sure what
we achieve.
>
> In that case, we could either:
> - run _delayed_nvme_reconnect_ctrl in the foreground, or
> - open-code the reconnect directly in the script before issuing the write.
>
How would that open-code reconnect look? I was just using the subsystem
connect, which I think is not optimal.
Thanks,
John
^ permalink raw reply
* Re: [PATCH 07/13] libmultipath: Add delayed removal support
From: Nilay Shroff @ 2026-04-10 7:06 UTC (permalink / raw)
To: John Garry, Hannes Reinecke, hch, kbusch, sagi, axboe,
martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
In-Reply-To: <ccfc867c-e744-42a2-9b22-47245a6c06d7@oracle.com>
On 4/9/26 6:30 PM, John Garry wrote:
> On 09/04/2026 07:37, Nilay Shroff wrote:
>>>
>>> You mean a common blktests testcase, right?
>>>
>>> For NVMe, that test would:
>>> a. try to remove NVMe ko when we have the delayed removal active
>>> b. ensure that we can queue for no path
>>>
>>> I suppose that a common testcase could be possible (with dm mpath), but doesn't dm have its own testsuite?
>>>
>> Yes, I'd add a blktest for 'queue_if_no_path' feature. But as we know we have
>> separate test suite for dm under blktests, I'd first target nvme testcase and
>> then later add another testcase for dm-multipath.
>
> Testing a. is a challenge to be effective, as we would typically not be able to remove the nvme modules anyway due to many other references.
>
> For b, how about something like the following:
>
> set_conditions() {
> _set_nvme_trtype "$@"
> }
>
> _delayed_nvme_reconnect_ctrl() {
> sleep 2
> _nvme_connect_subsys
> }
>
> test() {
> echo "Running ${TEST_NAME}"
>
> _setup_nvmet
>
> local nvmedev
> local ns
> local bytes_written
>
> _nvmet_target_setup
> _nvme_connect_subsys
>
> # Part a: Ensure writes fail when no path returns
> nvmedev=$(_find_nvme_dev "${def_subsysnqn}")
> ns=$(_find_nvme_ns "${def_subsys_uuid}")
> echo 10 > "/sys/block/"$ns"/delayed_removal_secs"
> bytes_written=$(run_xfs_io_pwritev2 /dev/"$ns" 4096)
> if [ "$bytes_written" != 4096 ]; then
> echo "could not write successfully initially"
> fi
> sleep 1
> _nvme_disconnect_ctrl "${nvmedev}"
> sleep 1
> ns=$(_find_nvme_ns "${def_subsys_uuid}")
> if [[ "${ns}" = "" ]]; then
> echo "could not find ns after disconnect"
> fi
> bytes_written=$(run_xfs_io_pwritev2 /dev/"$ns" 4096)
> if [ "$bytes_written" == 4096 ]; then
> echo "wrote successfully after disconnect"
> fi
> sleep 10
> ns=$(_find_nvme_ns "${def_subsys_uuid}")
> if [[ !"${ns}" = "" ]]; then
> echo "found ns after delayed removal"
> fi
>
> #echo "now part 2"
> # Part b: Ensure writes work for intermittent disconnect
> _nvme_connect_subsys
>
> nvmedev=$(_find_nvme_dev "${def_subsysnqn}")
> ns=$(_find_nvme_ns "${def_subsys_uuid}")
> echo 10 > "/sys/block/"$ns"/delayed_removal_secs"
> bytes_written=$(run_xfs_io_pwritev2 /dev/"$ns" 4096)
> if [ "$bytes_written" != 4096 ]; then
> echo "could not write successfully initially"
> fi
> sleep 1
> _nvme_disconnect_ctrl "${nvmedev}"
> sleep 1
> ns=$(_find_nvme_ns "${def_subsys_uuid}")
> if [[ "${ns}" = "" ]]; then
> echo "could not find ns after disconnect"
> fi
> _delayed_nvme_reconnect_ctrl &
> sleep 1
> bytes_written=$(run_xfs_io_pwritev2 /dev/"$ns" 4096)
> if [ "$bytes_written" != 4096 ]; then
> echo "could not write successfully with reconnect"
> fi
It seems there may be a race here if we attempt to write to $ns before
the reconnect has completed in _delayed_nvme_reconnect_ctrl.
If the intention is simply to verify that the controller reconnect occurs
within the delayed removal window and test pwrite, then it may be sufficient
to:
- perform the reconnect, and
- then validate the write (pwrite) afterwards.
In that case, we could either:
- run _delayed_nvme_reconnect_ctrl in the foreground, or
- open-code the reconnect directly in the script before issuing the write.
> sleep 10
> ns=$(_find_nvme_ns "${def_subsys_uuid}")
> if [[ "${ns}" = "" ]]; then
> echo "could not find ns after delayed reconnect"
> fi
>
> # Final tidy-up
> echo 0 > /sys/block/"$ns"/delayed_removal_secs
> nvmedev=$(_find_nvme_dev "${def_subsysnqn}")
> _nvme_disconnect_ctrl "${nvmedev}"
> _nvmet_target_cleanup
>
> echo "Test complete"
> }
Otherwise overall this looks good to me.
Thanks,
--Nilay
^ permalink raw reply
* Re: [PATCH 8/8] RFC: use a TASK_FIFO kthread for read completion support
From: Christoph Hellwig @ 2026-04-10 6:19 UTC (permalink / raw)
To: Tal Zussman
Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox (Oracle),
Christian Brauner, Darrick J. Wong, Carlos Maiolino, Al Viro,
Jan Kara, Dave Chinner, Bart Van Assche, Gao Xiang, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm
In-Reply-To: <2cdaa767-c071-4e84-b9d7-1c944407f5bb@columbia.edu>
On Thu, Apr 09, 2026 at 03:06:47PM -0400, Tal Zussman wrote:
> > -#include <linux/llist.h>
> > +#include <linux/freezer.h>
>
> Why freezer.h and not kthread.h?
I needed freezer.h to try to make the thread freezable, but that didn't
work out. kthread.h seems to be pulled in implicitly.
> > struct bio_complete_batch {
> > - struct llist_head list;
>
> If we go with this approach, we should remove the newly-added bi_llist from
> struct bio too.
Yes.
^ permalink raw reply
* Re: [PATCH 4/8] FOLD: block: change the defer in task context interface to be procedural
From: Christoph Hellwig @ 2026-04-10 6:17 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Hellwig, Tal Zussman, Jens Axboe, Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Al Viro, Jan Kara, Dave Chinner,
Bart Van Assche, Gao Xiang, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm
In-Reply-To: <adgJqiA0vivaW7NA@casper.infradead.org>
On Thu, Apr 09, 2026 at 09:18:50PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 09, 2026 at 06:02:17PM +0200, Christoph Hellwig wrote:
> > @@ -1836,9 +1837,7 @@ void bio_endio(struct bio *bio)
> > }
> > #endif
> >
> > - if (!in_task() && bio_flagged(bio, BIO_COMPLETE_IN_TASK))
> > - bio_queue_completion(bio);
> > - else if (bio->bi_end_io)
> > + if (bio->bi_end_io)
> > bio->bi_end_io(bio);
>
> What I liked about this before is that we had one central place that
> needed to be changed. This change means that every bi_end_io now needs
> to check whether the BIO can be completed in its context.
Yes. On the other hand we can actually use it when we don't know if
we need to offload beforehand, which enabls the two later conversions
and probably more.
>
> > +++ b/fs/buffer.c
> > @@ -2673,6 +2673,9 @@ static void end_bio_bh_io_sync(struct bio *bio)
> > {
> > struct buffer_head *bh = bio->bi_private;
> >
> > + if (buffer_dropbehind(bh) && bio_complete_in_task(bio))
> > + return;
>
> I really don't like this. It assumes there's only one reason to
> complete in task context -- whether the buffer belongs to a dropbehind
> folio. I want there to be other reasons. Why would you introduce the
> new BH_dropbehind flag instead of checking BIO_COMPLETE_IN_TASK?
Because there's no BIO_COMPLETE_IN_TASK? left.
> > struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
> >
> > + /* Page cache invalidation cannot be done in irq context. */
> > + if (ioend->io_flags & IOMAP_IOEND_DONTCACHE) {
> > + if (bio_complete_in_task(bio))
> > + return;
> > + }
>
> I thought we agreed to kill off IOMAP_IOEND_DONTCACHE?
Only IFF we don't need it. With this version there is no way to just
remove it.
^ permalink raw reply
* [PATCH blktests 1/1] common/scsi_debug: use _patient_rmmod() to unload scsi_debug
From: Shin'ichiro Kawasaki @ 2026-04-10 5:30 UTC (permalink / raw)
To: linux-block; +Cc: Yi Zhang, Shin'ichiro Kawasaki
The helper _have_scsi_debug_group_number_stats() loads and unloads
scsi_debug to check parameters of scsi_debug. However, the unload is too
soon after the load, then it often fails. To avoid such unload failures,
use _patient_rmmod() to unload scsi_debug.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://github.com/linux-blktests/blktests/issues/241
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
common/scsi_debug | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/common/scsi_debug b/common/scsi_debug
index 6aa7420..86ea3f4 100644
--- a/common/scsi_debug
+++ b/common/scsi_debug
@@ -57,7 +57,7 @@ _have_scsi_debug_group_number_stats() {
SKIP_REASONS+=("scsi_debug does not support group number statistics")
ret=1
fi
- modprobe -qr scsi_debug >&/dev/null
+ _patient_rmmod scsi_debug >&/dev/null
return ${ret}
}
--
2.49.0
^ permalink raw reply related
* Re: [PATCH 2/2] block: allow different-pgmap pages as separate bvecs in bio_add_page
From: Naman Jain @ 2026-04-10 3:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Chaitanya Kulkarni, John Hubbard, Logan Gunthorpe,
linux-kernel, linux-block, Saurabh Sengar, Long Li,
Michael Kelley
In-Reply-To: <20260408060825.GA24532@lst.de>
On 4/8/2026 11:38 AM, Christoph Hellwig wrote:
> Hi Naman,
>
> On Tue, Apr 07, 2026 at 12:38:30PM +0530, Naman Jain wrote:
>>> So the zone_device_pages_have_same_pgmap check should go into
>>> zone_device_pages_compatible and we need to stop building the bio
>>> as well in that case.
>>
>> Ok, so rest all things same, from my last email, but my previous compatible
>> function would look like this:
>>
>> static inline bool zone_device_pages_compatible(const struct page *a,
>> const struct page *b)
>> {
>> if (is_pci_p2pdma_page(a) || is_pci_p2pdma_page(b))
>> return zone_device_pages_have_same_pgmap(a, b);
>> return true;
>> }
>>
>> This would prevent two P2PDMA pages from different pgmaps (different PCI
>> devices) passing the compatible check and both get added to the bio.
>> Please correct me if that is not what you meant. I'll wait for a couple
>> more days and send the next version with this, and we can review this
>> again.
>
> This looks good modulo the tabs vs spaces in the indentation.
Hi,
Thanks.
Regards,
Naman
^ permalink raw reply
* Re: [PATCH v10 13/13] docs: add io_queue flag to isolcpus
From: Ming Lei @ 2026-04-10 2:44 UTC (permalink / raw)
To: Aaron Tomlin
Cc: Ming Lei, axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
longman, chenridong, hare, kch, steve, sean, chjohnst, neelx,
mproche, linux-block, linux-kernel, virtualization, linux-nvme,
linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
MPT-FusionLinux.pdl
In-Reply-To: <a566smu6morqeefqal23eek4ibezfuiwhs774xtxhyyclpbtsx@uzzwgbzmwdjd>
On Thu, Apr 09, 2026 at 09:45:04PM -0400, Aaron Tomlin wrote:
> On Thu, Apr 09, 2026 at 11:00:09PM +0800, Ming Lei wrote:
> > How can the isolated core be scheduled for running polling task?
> >
> > Who triggered it?
> >
> > > loop waiting for the hardware completion. This would completely monopolise
> > > the core and destroy any real time isolation guarantees without the user
> > > space application ever having requested it.
> >
> > No.
> >
> > IOPOLL queue doesn't have interrupt, and the ->poll() is only run from
> > the submission context. So if you don't submitted polled IO on isolated
> > CPU cores, everything is just fine. This is simpler than irq IO actually.
>
> Yes, you are entirely correct. The ->iopoll() is indeed executed strictly
> within the submission context. In the example below, the file operations
> iopoll callback is iocb_bio_iopoll():
>
> // file->f_op->iopoll(&rw->kiocb, iob, poll_flags)
> iocb_bio_iopoll(&rw->kiocb, iob, poll_flags)
> {
> struct bio *bio
>
> bio = READ_ONCE(kiocb->private)
> if (bio)
> bio_poll(bio, iob, flags)
> if (queue_is_mq(q))
> blk_mq_poll(q, cookie, iob, flags)
> {
> if (!blk_mq_can_poll(q))
> return 0
>
> blk_hctx_poll(q, q->queue_hw_ctx[cookie], iob, flags)
> {
> int ret
>
> do {
> ret = q->mq_ops->poll(hctx, iob)
> if (ret > 0)
> return ret
> if (task_sigpending(current))
> return 1
> if (ret < 0 || (flags & BLK_POLL_ONESHOT))
> break
> cpu_relax()
> } while (!need_resched())
>
> return 0
> }
> }
>
> If an application on an isolated CPU does not explicitly submit a polled
> I/O request, it will not poll. Thank you for correcting me on this.
Great, you finally get the point.
>
> > Can you share one example in which managed irq can't address?
>
> Without io_queue, the block layer maps isolated CPUs to these queues, and
> the device will fire unmanaged interrupts that can freely land on isolated
For unmanaged interrupts, user can set irq affinity on housekeeping cpus
from /proc or kernel command line.
Why is unmanaged interrupts involved with this patchset?
> CPUs, thereby breaking isolation. By applying the constraint via io_queue
> at the block layer, we restrict the hardware queue count and map the
> isolated CPUs to the housekeeping queues, ensuring isolation is maintained
> regardless of whether the driver uses managed interrupts.
>
> Does the above help?
As I mentioned, managed irq already covers it:
- typically application submits IO from housekeeping CPUs, which is mapped
to one hardware, which effective interrupt affinity excludes isolated
CPUs if possible.
I'd suggest to share some real problems you found instead of something
imaginary.
>
> > > >
> > > > IMO, only two differences from this viewpoint:
> > > >
> > > > 1) `io_queue` may reduce nr_hw_queues
> > > >
> > > > 2) when application submits IO from isolated CPUs, `io_queue` can complete
> > > > IO from housekeeping CPUs.
> > >
> > > Acknowledged.
> >
> > Are there other major differences besides the two mentioned above?
>
> I believe the above is sufficient. Please let me know your thoughts.
Both two are small improvement, not bug fixes. However the user has to pay
the cost of potential failing of offlining CPU. Not mention the little
complicated change: `19 files changed, 378 insertions(+), 48 deletions(-)`
But I won't object if you can update the commit log/kernel command line
doc and fix the issue found in review.
Thanks,
Ming
^ permalink raw reply
* Re: [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
From: Ming Lei @ 2026-04-10 2:28 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Ming Lei, Jens Axboe, linux-block
In-Reply-To: <CADUfDZqnq1Ls9NdcBXgbAJymPNvcUOa883zSTNMNcaXODGEEEg@mail.gmail.com>
On Thu, Apr 09, 2026 at 02:22:24PM -0700, Caleb Sander Mateos wrote:
> On Thu, Apr 9, 2026 at 5:18 AM Ming Lei <tom.leiming@gmail.com> wrote:
> >
> > On Wed, Apr 08, 2026 at 08:20:12AM -0700, Caleb Sander Mateos wrote:
> > > On Tue, Apr 7, 2026 at 7:23 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Tue, Apr 07, 2026 at 12:35:49PM -0700, Caleb Sander Mateos wrote:
> > > > > On Tue, Mar 31, 2026 at 8:32 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > >
> > > > > > Add control commands for registering and unregistering shared memory
> > > > > > buffers for zero-copy I/O:
> > > > > >
> > > > > > - UBLK_U_CMD_REG_BUF (0x18): pins pages from userspace, inserts PFN
> > > > > > ranges into a per-device maple tree for O(log n) lookup during I/O.
> > > > > > Buffer pointers are tracked in a per-device xarray. Returns the
> > > > > > assigned buffer index.
> > > > > >
> > > > > > - UBLK_U_CMD_UNREG_BUF (0x19): removes PFN entries and unpins pages.
> > > > > >
> > > > > > Queue freeze/unfreeze is handled internally so userspace need not
> > > > > > quiesce the device during registration.
> > > > > >
> > > > > > Also adds:
> > > > > > - UBLK_IO_F_SHMEM_ZC flag and addr encoding helpers in UAPI header
> > > > > > (16-bit buffer index supporting up to 65536 buffers)
> > > > > > - Data structures (ublk_buf, ublk_buf_range) and xarray/maple tree
> > > > > > - __ublk_ctrl_reg_buf() helper for PFN insertion with error unwinding
> > > > > > - __ublk_ctrl_unreg_buf() helper for cleanup reuse
> > > > > > - ublk_support_shmem_zc() / ublk_dev_support_shmem_zc() stubs
> > > > > > (returning false — feature not enabled yet)
> > > > > >
> > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > ---
> > > > > > drivers/block/ublk_drv.c | 300 ++++++++++++++++++++++++++++++++++
> > > > > > include/uapi/linux/ublk_cmd.h | 72 ++++++++
> > > > > > 2 files changed, 372 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > > index 71c7c56b38ca..ac6ccc174d44 100644
> > > > > > --- a/drivers/block/ublk_drv.c
> > > > > > +++ b/drivers/block/ublk_drv.c
> > > > > > @@ -46,6 +46,8 @@
> > > > > > #include <linux/kref.h>
> > > > > > #include <linux/kfifo.h>
> > > > > > #include <linux/blk-integrity.h>
> > > > > > +#include <linux/maple_tree.h>
> > > > > > +#include <linux/xarray.h>
> > > > > > #include <uapi/linux/fs.h>
> > > > > > #include <uapi/linux/ublk_cmd.h>
> > > > > >
> > > > > > @@ -58,6 +60,8 @@
> > > > > > #define UBLK_CMD_UPDATE_SIZE _IOC_NR(UBLK_U_CMD_UPDATE_SIZE)
> > > > > > #define UBLK_CMD_QUIESCE_DEV _IOC_NR(UBLK_U_CMD_QUIESCE_DEV)
> > > > > > #define UBLK_CMD_TRY_STOP_DEV _IOC_NR(UBLK_U_CMD_TRY_STOP_DEV)
> > > > > > +#define UBLK_CMD_REG_BUF _IOC_NR(UBLK_U_CMD_REG_BUF)
> > > > > > +#define UBLK_CMD_UNREG_BUF _IOC_NR(UBLK_U_CMD_UNREG_BUF)
> > > > > >
> > > > > > #define UBLK_IO_REGISTER_IO_BUF _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
> > > > > > #define UBLK_IO_UNREGISTER_IO_BUF _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
> > > > > > @@ -289,6 +293,20 @@ struct ublk_queue {
> > > > > > struct ublk_io ios[] __counted_by(q_depth);
> > > > > > };
> > > > > >
> > > > > > +/* Per-registered shared memory buffer */
> > > > > > +struct ublk_buf {
> > > > > > + struct page **pages;
> > > > > > + unsigned int nr_pages;
> > > > > > +};
> > > > > > +
> > > > > > +/* Maple tree value: maps a PFN range to buffer location */
> > > > > > +struct ublk_buf_range {
> > > > > > + unsigned long base_pfn;
> > > > > > + unsigned short buf_index;
> > > > > > + unsigned short flags;
> > > > > > + unsigned int base_offset; /* byte offset within buffer */
> > > > > > +};
> > > > > > +
> > > > > > struct ublk_device {
> > > > > > struct gendisk *ub_disk;
> > > > > >
> > > > > > @@ -323,6 +341,10 @@ struct ublk_device {
> > > > > >
> > > > > > bool block_open; /* protected by open_mutex */
> > > > > >
> > > > > > + /* shared memory zero copy */
> > > > > > + struct maple_tree buf_tree;
> > > > > > + struct xarray bufs_xa;
> > > > > > +
> > > > > > struct ublk_queue *queues[];
> > > > > > };
> > > > > >
> > > > > > @@ -334,6 +356,7 @@ struct ublk_params_header {
> > > > > >
> > > > > > static void ublk_io_release(void *priv);
> > > > > > static void ublk_stop_dev_unlocked(struct ublk_device *ub);
> > > > > > +static void ublk_buf_cleanup(struct ublk_device *ub);
> > > > > > static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > > > > > static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> > > > > > u16 q_id, u16 tag, struct ublk_io *io);
> > > > > > @@ -398,6 +421,16 @@ static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub)
> > > > > > return ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY;
> > > > > > }
> > > > > >
> > > > > > +static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
> > > > > > +{
> > > > > > + return false;
> > > > > > +}
> > > > > > +
> > > > > > +static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
> > > > > > +{
> > > > > > + return false;
> > > > > > +}
> > > > > > +
> > > > > > static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
> > > > > > {
> > > > > > return ubq->flags & UBLK_F_AUTO_BUF_REG;
> > > > > > @@ -1460,6 +1493,7 @@ static blk_status_t ublk_setup_iod(struct ublk_queue *ubq, struct request *req)
> > > > > > iod->op_flags = ublk_op | ublk_req_build_flags(req);
> > > > > > iod->nr_sectors = blk_rq_sectors(req);
> > > > > > iod->start_sector = blk_rq_pos(req);
> > > > > > +
> > > > >
> > > > > nit: unrelated whitespace change?
> > > > >
> > > > > > iod->addr = io->buf.addr;
> > > > > >
> > > > > > return BLK_STS_OK;
> > > > > > @@ -1665,6 +1699,7 @@ static bool ublk_start_io(const struct ublk_queue *ubq, struct request *req,
> > > > > > {
> > > > > > unsigned mapped_bytes = ublk_map_io(ubq, req, io);
> > > > > >
> > > > > > +
> > > > >
> > > > > nit: unrelated whitespace change?
> > > > >
> > > > > > /* partially mapped, update io descriptor */
> > > > > > if (unlikely(mapped_bytes != blk_rq_bytes(req))) {
> > > > > > /*
> > > > > > @@ -4206,6 +4241,7 @@ static void ublk_cdev_rel(struct device *dev)
> > > > > > {
> > > > > > struct ublk_device *ub = container_of(dev, struct ublk_device, cdev_dev);
> > > > > >
> > > > > > + ublk_buf_cleanup(ub);
> > > > > > blk_mq_free_tag_set(&ub->tag_set);
> > > > > > ublk_deinit_queues(ub);
> > > > > > ublk_free_dev_number(ub);
> > > > > > @@ -4625,6 +4661,8 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
> > > > > > mutex_init(&ub->mutex);
> > > > > > spin_lock_init(&ub->lock);
> > > > > > mutex_init(&ub->cancel_mutex);
> > > > > > + mt_init(&ub->buf_tree);
> > > > > > + xa_init_flags(&ub->bufs_xa, XA_FLAGS_ALLOC);
> > > > > > INIT_WORK(&ub->partition_scan_work, ublk_partition_scan_work);
> > > > > >
> > > > > > ret = ublk_alloc_dev_number(ub, header->dev_id);
> > > > > > @@ -5168,6 +5206,260 @@ static int ublk_char_dev_permission(struct ublk_device *ub,
> > > > > > return err;
> > > > > > }
> > > > > >
> > > > > > +/*
> > > > > > + * Drain inflight I/O and quiesce the queue. Freeze drains all inflight
> > > > > > + * requests, quiesce_nowait marks the queue so no new requests dispatch,
> > > > > > + * then unfreeze allows new submissions (which won't dispatch due to
> > > > > > + * quiesce). This keeps freeze and ub->mutex non-nested.
> > > > > > + */
> > > > > > +static void ublk_quiesce_and_release(struct gendisk *disk)
> > > > > > +{
> > > > > > + unsigned int memflags;
> > > > > > +
> > > > > > + memflags = blk_mq_freeze_queue(disk->queue);
> > > > > > + blk_mq_quiesce_queue_nowait(disk->queue);
> > > > > > + blk_mq_unfreeze_queue(disk->queue, memflags);
> > > > > > +}
> > > > > > +
> > > > > > +static void ublk_unquiesce_and_resume(struct gendisk *disk)
> > > > > > +{
> > > > > > + blk_mq_unquiesce_queue(disk->queue);
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Insert PFN ranges of a registered buffer into the maple tree,
> > > > > > + * coalescing consecutive PFNs into single range entries.
> > > > > > + * Returns 0 on success, negative error with partial insertions unwound.
> > > > > > + */
> > > > > > +/* Erase coalesced PFN ranges from the maple tree for pages [0, nr_pages) */
> > > > > > +static void ublk_buf_erase_ranges(struct ublk_device *ub,
> > > > > > + struct ublk_buf *ubuf,
> > > > > > + unsigned long nr_pages)
> > > > > > +{
> > > > > > + unsigned long i;
> > > > > > +
> > > > > > + for (i = 0; i < nr_pages; ) {
> > > > > > + unsigned long pfn = page_to_pfn(ubuf->pages[i]);
> > > > > > + unsigned long start = i;
> > > > > > +
> > > > > > + while (i + 1 < nr_pages &&
> > > > > > + page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
> > > > > > + i++;
> > > > > > + i++;
> > > > > > + kfree(mtree_erase(&ub->buf_tree, pfn));
> > > > > > + }
> > > > > > +}
> > > > > > +
> > > > > > +static int __ublk_ctrl_reg_buf(struct ublk_device *ub,
> > > > > > + struct ublk_buf *ubuf, int index,
> > > > > > + unsigned short flags)
> > > > > > +{
> > > > > > + unsigned long nr_pages = ubuf->nr_pages;
> > > > > > + unsigned long i;
> > > > > > + int ret;
> > > > > > +
> > > > > > + for (i = 0; i < nr_pages; ) {
> > > > > > + unsigned long pfn = page_to_pfn(ubuf->pages[i]);
> > > > > > + unsigned long start = i;
> > > > > > + struct ublk_buf_range *range;
> > > > > > +
> > > > > > + /* Find run of consecutive PFNs */
> > > > > > + while (i + 1 < nr_pages &&
> > > > > > + page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
> > > > > > + i++;
> > > > > > + i++; /* past the last page in this run */
> > > > >
> > > > > Move this increment to the for loop so you don't need the "- 1" in the
> > > > > mtree_insert_range() call?
> > > >
> > > > Good catch!
> > > >
> > > > >
> > > > > > +
> > > > > > + range = kzalloc(sizeof(*range), GFP_KERNEL);
> > > > >
> > > > > Not sure kzalloc() is necessary; all the fields are initialized below
> > > >
> > > > Yeah, kmalloc() is fine, and we shouldn't add more fields to `range` in
> > > > future.
> > > >
> > > > >
> > > > > > + if (!range) {
> > > > > > + ret = -ENOMEM;
> > > > > > + goto unwind;
> > > > > > + }
> > > > > > + range->buf_index = index;
> > > > > > + range->flags = flags;
> > > > > > + range->base_pfn = pfn;
> > > > > > + range->base_offset = start << PAGE_SHIFT;
> > > > > > +
> > > > > > + ret = mtree_insert_range(&ub->buf_tree, pfn,
> > > > > > + pfn + (i - start) - 1,
> > > > > > + range, GFP_KERNEL);
> > > > > > + if (ret) {
> > > > > > + kfree(range);
> > > > > > + goto unwind;
> > > > > > + }
> > > > > > + }
> > > > > > + return 0;
> > > > > > +
> > > > > > +unwind:
> > > > > > + ublk_buf_erase_ranges(ub, ubuf, i);
> > > > > > + return ret;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Register a shared memory buffer for zero-copy I/O.
> > > > > > + * Pins pages, builds PFN maple tree, freezes/unfreezes the queue
> > > > > > + * internally. Returns buffer index (>= 0) on success.
> > > > > > + */
> > > > > > +static int ublk_ctrl_reg_buf(struct ublk_device *ub,
> > > > > > + struct ublksrv_ctrl_cmd *header)
> > > > > > +{
> > > > > > + void __user *argp = (void __user *)(unsigned long)header->addr;
> > > > > > + struct ublk_shmem_buf_reg buf_reg;
> > > > > > + unsigned long addr, size, nr_pages;
> > > > >
> > > > > size and nr_pages could be u32
> > > >
> > > > Yeah, it was caused by internal change on `ublk_shmem_buf_reg`.
> > > >
> > > > >
> > > > > > + unsigned int gup_flags;
> > > > > > + struct gendisk *disk;
> > > > > > + struct ublk_buf *ubuf;
> > > > > > + long pinned;
> > > > >
> > > > > pinned could be int to match the return type of pin_user_pages_fast()
> > > >
> > > > OK.
> > > >
> > > > >
> > > > > > + u32 index;
> > > > > > + int ret;
> > > > > > +
> > > > > > + if (!ublk_dev_support_shmem_zc(ub))
> > > > > > + return -EOPNOTSUPP;
> > > > > > +
> > > > > > + memset(&buf_reg, 0, sizeof(buf_reg));
> > > > > > + if (copy_from_user(&buf_reg, argp,
> > > > > > + min_t(size_t, header->len, sizeof(buf_reg))))
> > > > > > + return -EFAULT;
> > > > > > +
> > > > > > + if (buf_reg.flags & ~UBLK_SHMEM_BUF_READ_ONLY)
> > > > > > + return -EINVAL;
> > > > > > +
> > > > > > + addr = buf_reg.addr;
> > > > > > + size = buf_reg.len;
> > > > >
> > > > > nit: don't see much value in these additional variables that are just
> > > > > copies of buf_reg fields
> > > > >
> > > > > > + nr_pages = size >> PAGE_SHIFT;
> > > > > > +
> > > > > > + if (!size || !PAGE_ALIGNED(size) || !PAGE_ALIGNED(addr))
> > > > > > + return -EINVAL;
> > > > > > +
> > > > > > + disk = ublk_get_disk(ub);
> > > > > > + if (!disk)
> > > > > > + return -ENODEV;
> > > > >
> > > > > So buffers can't be registered before the ublk device is started? Is
> > > > > there a reason why that's not possible? Could we just make the
> > > > > ublk_quiesce_and_release() and ublk_unquiesce_and_resume() conditional
> > > > > on disk being non-NULL? I guess we'd have to hold the ublk_device
> > > > > mutex before calling ublk_get_disk() to prevent it from being assigned
> > > > > concurrently.
> > > >
> > > > Here `disk` is used for freeze & quiesce queue.
> > > >
> > > > But the implementation can be a bit more complicated given the dependency
> > > > between ub->mutex and freeze queue should be avoided.
> > > >
> > > > Anyway, it is one nice requirement, I will try to relax the constraint.
> > > >
> > > > >
> > > > > > +
> > > > > > + /* Pin pages before quiescing (may sleep) */
> > > > > > + ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL);
> > > > > > + if (!ubuf) {
> > > > > > + ret = -ENOMEM;
> > > > > > + goto put_disk;
> > > > > > + }
> > > > > > +
> > > > > > + ubuf->pages = kvmalloc_array(nr_pages, sizeof(*ubuf->pages),
> > > > > > + GFP_KERNEL);
> > > > > > + if (!ubuf->pages) {
> > > > > > + ret = -ENOMEM;
> > > > > > + goto err_free;
> > > > > > + }
> > > > > > +
> > > > > > + gup_flags = FOLL_LONGTERM;
> > > > > > + if (!(buf_reg.flags & UBLK_SHMEM_BUF_READ_ONLY))
> > > > > > + gup_flags |= FOLL_WRITE;
> > > > > > +
> > > > > > + pinned = pin_user_pages_fast(addr, nr_pages, gup_flags, ubuf->pages);
> > > > > > + if (pinned < 0) {
> > > > > > + ret = pinned;
> > > > > > + goto err_free_pages;
> > > > > > + }
> > > > > > + if (pinned != nr_pages) {
> > > > > > + ret = -EFAULT;
> > > > > > + goto err_unpin;
> > > > > > + }
> > > > > > + ubuf->nr_pages = nr_pages;
> > > > > > +
> > > > > > + /*
> > > > > > + * Drain inflight I/O and quiesce the queue so no new requests
> > > > > > + * are dispatched while we modify the maple tree. Keep freeze
> > > > > > + * and mutex non-nested to avoid lock dependency.
> > > > > > + */
> > > > > > + ublk_quiesce_and_release(disk);
> > > > > > +
> > > > > > + mutex_lock(&ub->mutex);
> > > > >
> > > > > Looks like the xarray and maple tree do their own spinlocking, is this needed?
> > > >
> > > > Right, looks it isn't needed now, and it was added from beginning with plain
> > > > array.
> > > >
> > > > >
> > > > > > +
> > > > > > + ret = xa_alloc(&ub->bufs_xa, &index, ubuf, xa_limit_16b, GFP_KERNEL);
> > > > > > + if (ret)
> > > > > > + goto err_unlock;
> > > > > > +
> > > > > > + ret = __ublk_ctrl_reg_buf(ub, ubuf, index, buf_reg.flags);
> > > > > > + if (ret) {
> > > > > > + xa_erase(&ub->bufs_xa, index);
> > > > > > + goto err_unlock;
> > > > > > + }
> > > > > > +
> > > > > > + mutex_unlock(&ub->mutex);
> > > > > > +
> > > > > > + ublk_unquiesce_and_resume(disk);
> > > > > > + ublk_put_disk(disk);
> > > > > > + return index;
> > > > > > +
> > > > > > +err_unlock:
> > > > > > + mutex_unlock(&ub->mutex);
> > > > > > + ublk_unquiesce_and_resume(disk);
> > > > > > +err_unpin:
> > > > > > + unpin_user_pages(ubuf->pages, pinned);
> > > > > > +err_free_pages:
> > > > > > + kvfree(ubuf->pages);
> > > > > > +err_free:
> > > > > > + kfree(ubuf);
> > > > > > +put_disk:
> > > > > > + ublk_put_disk(disk);
> > > > > > + return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static void __ublk_ctrl_unreg_buf(struct ublk_device *ub,
> > > > > > + struct ublk_buf *ubuf)
> > > > > > +{
> > > > > > + ublk_buf_erase_ranges(ub, ubuf, ubuf->nr_pages);
> > > > > > + unpin_user_pages(ubuf->pages, ubuf->nr_pages);
> > > > > > + kvfree(ubuf->pages);
> > > > > > + kfree(ubuf);
> > > > > > +}
> > > > > > +
> > > > > > +static int ublk_ctrl_unreg_buf(struct ublk_device *ub,
> > > > > > + struct ublksrv_ctrl_cmd *header)
> > > > > > +{
> > > > > > + int index = (int)header->data[0];
> > > > > > + struct gendisk *disk;
> > > > > > + struct ublk_buf *ubuf;
> > > > > > +
> > > > > > + if (!ublk_dev_support_shmem_zc(ub))
> > > > > > + return -EOPNOTSUPP;
> > > > > > +
> > > > > > + disk = ublk_get_disk(ub);
> > > > > > + if (!disk)
> > > > > > + return -ENODEV;
> > > > > > +
> > > > > > + /* Drain inflight I/O before modifying the maple tree */
> > > > > > + ublk_quiesce_and_release(disk);
> > > > > > +
> > > > > > + mutex_lock(&ub->mutex);
> > > > > > +
> > > > > > + ubuf = xa_erase(&ub->bufs_xa, index);
> > > > > > + if (!ubuf) {
> > > > > > + mutex_unlock(&ub->mutex);
> > > > > > + ublk_unquiesce_and_resume(disk);
> > > > > > + ublk_put_disk(disk);
> > > > > > + return -ENOENT;
> > > > > > + }
> > > > > > +
> > > > > > + __ublk_ctrl_unreg_buf(ub, ubuf);
> > > > > > +
> > > > > > + mutex_unlock(&ub->mutex);
> > > > > > +
> > > > > > + ublk_unquiesce_and_resume(disk);
> > > > > > + ublk_put_disk(disk);
> > > > > > + return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static void ublk_buf_cleanup(struct ublk_device *ub)
> > > > > > +{
> > > > > > + struct ublk_buf *ubuf;
> > > > > > + unsigned long index;
> > > > > > +
> > > > > > + xa_for_each(&ub->bufs_xa, index, ubuf)
> > > > > > + __ublk_ctrl_unreg_buf(ub, ubuf);
> > > > >
> > > > > This looks quadratic in the number of registered buffers. Can we do a
> > > > > single pass over the xarray and the maple tree?
> > > >
> > > > It can be done two passes: first pass walks maple tree for unpinning
> > > > pages, the 2nd pass is for freeing buffers in xarray, which looks more
> > > > clean.
> > > >
> > > > But the current way can reuse __ublk_ctrl_unreg_buf(), also
> > > > ublk_buf_cleanup() is only called one-shot in device release handler, so we can
> > > > leave it for future optimization.
> > > >
> > > > >
> > > > > > + xa_destroy(&ub->bufs_xa);
> > > > > > + mtree_destroy(&ub->buf_tree);
> > > > > > +}
> > > > > > +
> > > > > > +
> > > > > > +
> > > > > > static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
> > > > > > u32 cmd_op, struct ublksrv_ctrl_cmd *header)
> > > > > > {
> > > > > > @@ -5225,6 +5517,8 @@ static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
> > > > > > case UBLK_CMD_UPDATE_SIZE:
> > > > > > case UBLK_CMD_QUIESCE_DEV:
> > > > > > case UBLK_CMD_TRY_STOP_DEV:
> > > > > > + case UBLK_CMD_REG_BUF:
> > > > > > + case UBLK_CMD_UNREG_BUF:
> > > > > > mask = MAY_READ | MAY_WRITE;
> > > > > > break;
> > > > > > default:
> > > > > > @@ -5350,6 +5644,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
> > > > > > case UBLK_CMD_TRY_STOP_DEV:
> > > > > > ret = ublk_ctrl_try_stop_dev(ub);
> > > > > > break;
> > > > > > + case UBLK_CMD_REG_BUF:
> > > > > > + ret = ublk_ctrl_reg_buf(ub, &header);
> > > > > > + break;
> > > > > > + case UBLK_CMD_UNREG_BUF:
> > > > > > + ret = ublk_ctrl_unreg_buf(ub, &header);
> > > > > > + break;
> > > > > > default:
> > > > > > ret = -EOPNOTSUPP;
> > > > > > break;
> > > > > > diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> > > > > > index a88876756805..52bb9b843d73 100644
> > > > > > --- a/include/uapi/linux/ublk_cmd.h
> > > > > > +++ b/include/uapi/linux/ublk_cmd.h
> > > > > > @@ -57,6 +57,44 @@
> > > > > > _IOWR('u', 0x16, struct ublksrv_ctrl_cmd)
> > > > > > #define UBLK_U_CMD_TRY_STOP_DEV \
> > > > > > _IOWR('u', 0x17, struct ublksrv_ctrl_cmd)
> > > > > > +/*
> > > > > > + * Register a shared memory buffer for zero-copy I/O.
> > > > > > + * Input: ctrl_cmd.addr points to struct ublk_buf_reg (buffer VA + size)
> > > > > > + * ctrl_cmd.len = sizeof(struct ublk_buf_reg)
> > > > > > + * Result: >= 0 is the assigned buffer index, < 0 is error
> > > > > > + *
> > > > > > + * The kernel pins pages from the calling process's address space
> > > > > > + * and inserts PFN ranges into a per-device maple tree. When a block
> > > > > > + * request's pages match registered pages, the driver sets
> > > > > > + * UBLK_IO_F_SHMEM_ZC and encodes the buffer index + offset in addr,
> > > > > > + * allowing the server to access the data via its own mapping of the
> > > > > > + * same shared memory — true zero copy.
> > > > > > + *
> > > > > > + * The memory can be backed by memfd, hugetlbfs, or any GUP-compatible
> > > > > > + * shared mapping. Queue freeze is handled internally.
> > > > > > + *
> > > > > > + * The buffer VA and size are passed via a user buffer (not inline in
> > > > > > + * ctrl_cmd) so that unprivileged devices can prepend the device path
> > > > > > + * to ctrl_cmd.addr without corrupting the VA.
> > > > > > + */
> > > > > > +#define UBLK_U_CMD_REG_BUF \
> > > > > > + _IOWR('u', 0x18, struct ublksrv_ctrl_cmd)
> > > > > > +/*
> > > > > > + * Unregister a shared memory buffer.
> > > > > > + * Input: ctrl_cmd.data[0] = buffer index
> > > > > > + */
> > > > > > +#define UBLK_U_CMD_UNREG_BUF \
> > > > > > + _IOWR('u', 0x19, struct ublksrv_ctrl_cmd)
> > > > > > +
> > > > > > +/* Parameter buffer for UBLK_U_CMD_REG_BUF, pointed to by ctrl_cmd.addr */
> > > > > > +struct ublk_shmem_buf_reg {
> > > > > > + __u64 addr; /* userspace virtual address of shared memory */
> > > > > > + __u32 len; /* buffer size in bytes (page-aligned, max 4GB) */
> > > > > > + __u32 flags;
> > > > > > +};
> > > > > > +
> > > > > > +/* Pin pages without FOLL_WRITE; usable with write-sealed memfd */
> > > > > > +#define UBLK_SHMEM_BUF_READ_ONLY (1U << 0)
> > > > > > /*
> > > > > > * 64bits are enough now, and it should be easy to extend in case of
> > > > > > * running out of feature flags
> > > > > > @@ -370,6 +408,7 @@
> > > > > > /* Disable automatic partition scanning when device is started */
> > > > > > #define UBLK_F_NO_AUTO_PART_SCAN (1ULL << 18)
> > > > > >
> > > > > > +
> > > > > > /* device state */
> > > > > > #define UBLK_S_DEV_DEAD 0
> > > > > > #define UBLK_S_DEV_LIVE 1
> > > > > > @@ -469,6 +508,12 @@ struct ublksrv_ctrl_dev_info {
> > > > > > #define UBLK_IO_F_NEED_REG_BUF (1U << 17)
> > > > > > /* Request has an integrity data buffer */
> > > > > > #define UBLK_IO_F_INTEGRITY (1UL << 18)
> > > > > > +/*
> > > > > > + * I/O buffer is in a registered shared memory buffer. When set, the addr
> > > > > > + * field in ublksrv_io_desc encodes buffer index and byte offset instead
> > > > > > + * of a userspace virtual address.
> > > > > > + */
> > > > > > +#define UBLK_IO_F_SHMEM_ZC (1U << 19)
> > > > > >
> > > > > > /*
> > > > > > * io cmd is described by this structure, and stored in share memory, indexed
> > > > > > @@ -743,4 +788,31 @@ struct ublk_params {
> > > > > > struct ublk_param_integrity integrity;
> > > > > > };
> > > > > >
> > > > > > +/*
> > > > > > + * Shared memory zero-copy addr encoding for UBLK_IO_F_SHMEM_ZC.
> > > > > > + *
> > > > > > + * When UBLK_IO_F_SHMEM_ZC is set, ublksrv_io_desc.addr is encoded as:
> > > > > > + * bits [0:31] = byte offset within the buffer (up to 4GB)
> > > > > > + * bits [32:47] = buffer index (up to 65536)
> > > > > > + * bits [48:63] = reserved (must be zero)
> > > > >
> > > > > I wonder whether the "buffer index" is necessary. Can iod->addr and
> > > > > UBLK_U_CMD_UNREG_BUF refer to the buffer by the virtual address used
> > > > > with UBLK_U_CMD_REG_BUF? Then struct ublksrv_io_desc's addr field
> > > > > would retain its meaning. We would also avoid needing to compare the
> > > > > range buf_index values in ublk_try_buf_match(). And the xarray
> > > > > wouldn't be necessary to allocate buffer indices.
> > > >
> > > > There are several reasons for choosing "buffer index":
> > > >
> > > > - avoid to add extra storage in `struct ublk_buf_range`, extra
> > > > `vaddr` field is required for returning virtual address; otherwise one extra
> > > > lookup from buffer index is needed in fast path of ublk_try_buf_match()
> > >
> > > Yes, struct ublk_buf_range would have to track the virtual address,
> > > but that would replace both buf_index and base_offset. And you could
> > > store flags in the lower bits of the virtual address (since it has to
> > > be page-aligned) if you want to keep it as 16 bytes. I'm not sure what
> > > you mean about "one extra lookup from buffer index"; I was thinking it
> > > would be possible to get rid of buffer indices entirely.
> > >
> > > >
> > > > - I want ublk server to know the difference between shmem_zc buffer and the
> > > > plain IO buffer clearly, both two shouldn't be mixed, otherwise it is easy to
> > > > cause data corruption. For example, client is using buf A, but the
> > > > ublk server fallback code path may be using it at the same time.
> > >
> > > Yes, certainly the ublk server should only use a shared-memory buffer
> > > when an incoming UBLK_IO_F_SHMEM_ZC request specifies it. I don't see
> > > the connection to referring to buffers by virtual address instead of
> > > buffer index, though. The kernel can't prevent the ublk server from
> > > writing to a registered shared memory region it has mapped into its
> > > address space.
> > >
> > > It seems like a simpler interface if iod->addr indicates the virtual
> > > address of the data buffer regardless of the UBLK_IO_F_SHMEM_ZC flag.
> > > Then the ublk server doesn't have to care whether the data is in
> > > shared memory or was copied automatically by the
> > > ublk_map_io()/ublk_unmap_io() fallback path; it just accesses the
> > > memory at iod->addr and lets the kernel take care of the copying (if
> > > necessary).
> > >
> > > >
> > > > - total registered buffers can be limited naturally by `u16` buffer_index
> > >
> > > I don't really see a reason to limit the number of SHMEM_ZC buffers if
> > > they don't consume any per-buffer resources. I think it would make
> > > more sense to limit it based on the _total size_ of registered
> > > buffers, but the page pinning should already provide that.
> > >
> > > >
> > > > - it is proved that buffer index is one nice pattern wrt. buffer
> > > > registration, such as io_uring fixed buffer; and it helps userspace
> > > > to manage multiple buffers, given each one has unique ID.
> > >
> > > If you really prefer the (buf_index, buf_offset) interface, I won't
> > > stand in the way. It just seems to me like the only thing the ublk
> > > server cares about a shared memory buffer is its virtual address, so
> > > that's a more natural way to refer to it.
> >
> > It looks a little simpler from UAPI viewpoint, but storing vaddr in maple
> > tree introduces some implementation issues:
> >
> > - vaddr is bound with task, so ublk device has to track task context, such
> > as, when we allow to register buffer before adding device, the buffer has
> > to be guaranteed to belong to ublk daemon(tracked in future)
>
> Don't all tasks within a process share the same virtual address space?
> Then the threads in the ublk server process should all have the same
> virtual address for the registered buffer. I guess it's theoretically
> possible that a different process would register a shared memory
> region on behalf of the ublk server, and then they could have
> different virtual addresses for it, but that seems like a bizarre use
> case.
I meant processes:
- control command can be issued from any task/processes for
registering/unregistering shared memory, before opening ublk char
device, when it is hard to track ublk daemon
- difference processes are involved in whole ublk device lifetime in
case of recover
>
> >
> > - not like the plain page copy code path, in which the buffer vaddr is
> > always passed in daemon context; but buffer register can be done in any
> > task context.
> >
> > - not get ID for buffer, it could become a bit complicated to handle
> > recover if we store vaddr in mapple tree; vaddr can become invalid in
> > device lifetime, but buf index is device-wide, which can't become obsolete.
>
> Fair point, though I'm not even sure how the new ublk server process
> would be aware of the shared memory regions that belonged to the
> previous process. If they were mapped from an anonymous memfd, the new
> process wouldn't have any way to access them. I think it probably
> makes more sense to unregister the shared memory regions when the ublk
> server process exits.
So far, let's ublk server deal with it.
But we can improve this situation in future.
>
> >
> > Anyway, I don't object to switch to vaddr based UAPI, and we have enough
> > time to think it though and take it before 7.1 release.
> >
> > I will send patchset to cover other review comments first.
>
> Sure, I think you have good reasons for the buf_index representation
> and it doesn't really matter either way. I'm fine keeping this
> interface.
OK.
Thanks,
Ming
^ permalink raw reply
* Re: [PATCH v10 13/13] docs: add io_queue flag to isolcpus
From: Aaron Tomlin @ 2026-04-10 1:45 UTC (permalink / raw)
To: Ming Lei
Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
longman, chenridong, hare, kch, steve, sean, chjohnst, neelx,
mproche, linux-block, linux-kernel, virtualization, linux-nvme,
linux-scsi, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
MPT-FusionLinux.pdl, Lei, Ming
In-Reply-To: <CAFj5m9JE5e4DRGbzQFxDdZWU76ZPQ3G+C9JpLu0mhTB6aesZ9g@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2958 bytes --]
On Thu, Apr 09, 2026 at 11:00:09PM +0800, Ming Lei wrote:
> How can the isolated core be scheduled for running polling task?
>
> Who triggered it?
>
> > loop waiting for the hardware completion. This would completely monopolise
> > the core and destroy any real time isolation guarantees without the user
> > space application ever having requested it.
>
> No.
>
> IOPOLL queue doesn't have interrupt, and the ->poll() is only run from
> the submission context. So if you don't submitted polled IO on isolated
> CPU cores, everything is just fine. This is simpler than irq IO actually.
Yes, you are entirely correct. The ->iopoll() is indeed executed strictly
within the submission context. In the example below, the file operations
iopoll callback is iocb_bio_iopoll():
// file->f_op->iopoll(&rw->kiocb, iob, poll_flags)
iocb_bio_iopoll(&rw->kiocb, iob, poll_flags)
{
struct bio *bio
bio = READ_ONCE(kiocb->private)
if (bio)
bio_poll(bio, iob, flags)
if (queue_is_mq(q))
blk_mq_poll(q, cookie, iob, flags)
{
if (!blk_mq_can_poll(q))
return 0
blk_hctx_poll(q, q->queue_hw_ctx[cookie], iob, flags)
{
int ret
do {
ret = q->mq_ops->poll(hctx, iob)
if (ret > 0)
return ret
if (task_sigpending(current))
return 1
if (ret < 0 || (flags & BLK_POLL_ONESHOT))
break
cpu_relax()
} while (!need_resched())
return 0
}
}
If an application on an isolated CPU does not explicitly submit a polled
I/O request, it will not poll. Thank you for correcting me on this.
> Can you share one example in which managed irq can't address?
Without io_queue, the block layer maps isolated CPUs to these queues, and
the device will fire unmanaged interrupts that can freely land on isolated
CPUs, thereby breaking isolation. By applying the constraint via io_queue
at the block layer, we restrict the hardware queue count and map the
isolated CPUs to the housekeeping queues, ensuring isolation is maintained
regardless of whether the driver uses managed interrupts.
Does the above help?
> > >
> > > IMO, only two differences from this viewpoint:
> > >
> > > 1) `io_queue` may reduce nr_hw_queues
> > >
> > > 2) when application submits IO from isolated CPUs, `io_queue` can complete
> > > IO from housekeeping CPUs.
> >
> > Acknowledged.
>
> Are there other major differences besides the two mentioned above?
I believe the above is sufficient. Please let me know your thoughts.
Kind regards,
--
Aaron Tomlin
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH] block: refactor blkdev_zone_mgmt_ioctl
From: Jens Axboe @ 2026-04-10 1:15 UTC (permalink / raw)
To: dlemoal, Christoph Hellwig; +Cc: linux-block
In-Reply-To: <20260327090032.3722065-1-hch@lst.de>
On Fri, 27 Mar 2026 10:00:32 +0100, Christoph Hellwig wrote:
> Split the zone reset case into a separate helper so that the conditional
> locking goes away.
Applied, thanks!
[1/1] block: refactor blkdev_zone_mgmt_ioctl
commit: 539fb773a3f7c07cf7fd00617f33ed4e33058d72
Best regards,
--
Jens Axboe
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox