[PATCH 0/2] New zoned loop block device driver

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] New zoned loop block device driver
@ 2025-01-06 14:24 Damien Le Moal
  2025-01-06 14:24 ` [PATCH 1/2] block: new " Damien Le Moal
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Damien Le Moal @ 2025-01-06 14:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Christoph Hellwig

The first patch implements the new "zloop" zoned block device driver
which allows creating zoned block devices using one regular file per
zone as backing storage.

The second patch adds documentation for this driver (overview and usage
examples).

About half of the code of the first patch is from Christoph Hellwig.

Damien Le Moal (2):
  block: new zoned loop block device driver
  Documentation: Document the new zoned loop block device driver

 Documentation/admin-guide/blockdev/index.rst  |    1 +
 .../admin-guide/blockdev/zoned_loop.rst       |  168 +++
 MAINTAINERS                                   |    8 +
 drivers/block/Kconfig                         |   16 +
 drivers/block/Makefile                        |    1 +
 drivers/block/zloop.c                         | 1330 +++++++++++++++++
 6 files changed, 1524 insertions(+)
 create mode 100644 Documentation/admin-guide/blockdev/zoned_loop.rst
 create mode 100644 drivers/block/zloop.c


base-commit: 9d89551994a430b50c4fffcb1e617a057fa76e20
-- 
2.47.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/2] block: new zoned loop block device driver
  2025-01-06 14:24 [PATCH 0/2] New zoned loop block device driver Damien Le Moal
@ 2025-01-06 14:24 ` Damien Le Moal
  2025-01-06 14:24 ` [PATCH 2/2] Documentation: Document the " Damien Le Moal
  2025-01-06 14:54 ` [PATCH 0/2] New " Jens Axboe
  2 siblings, 0 replies; 35+ messages in thread
From: Damien Le Moal @ 2025-01-06 14:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Christoph Hellwig

The zoned loop block device driver allows a user to create emulated
zoned block devices using one regular file per zone as backing storage.
Compared to null_blk or scsi_debug, it has the advantage of allowing
emulating large zoned devices without requiring the same amount of
memory as the capacity of the emulated device. Furthermore, zoned
devices emulated with this driver can be re-started after a host reboot
without any loss of the state of the device zones, which is something
that null_blk and scsi_debug do not support.

This initial implementation is simple and does not support zone resource
limits. That is, a zoned loop block device limits for the maximum number
of open zones and maximum number of active zones is always 0.

This driver can be either compiled in-kernel or as a module, named
"zloop". Compilation of this driver depends on the block layer support
for zoned block device (CONFIG_BLK_DEV_ZONED must be set).

Using the zloop driver to create and delete zoned block devices is
done by writing commands to the zoned loop control character device file
(/dev/zloop-control). Creating a device is done with:

  $ echo "add [options]" > /dev/zloop-control

The options available for the "add" operation cat be listed by reading
the zloop-control device file:

  $ cat /dev/zloop-control
  add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u
  remove id=%d

The options available allow controlling the zoned device total
capacity, zone size, zone capactity of sequential zones, total number
of conventional zones, base directory for the zones backing file, number
of I/O queues and the maximum queue depth of I/O queues.

Deleting a device is done using the "remove" command:

  $ echo "remove id=0" > /dev/zloop-control

This implementation passes various tests using zonefs and fio (t/zbd
tests) and provides a state machine for zone conditions that is
compliant with the T10 ZBC and NVMe ZNS specifications.

Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 MAINTAINERS            |    7 +
 drivers/block/Kconfig  |   16 +
 drivers/block/Makefile |    1 +
 drivers/block/zloop.c  | 1330 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1354 insertions(+)
 create mode 100644 drivers/block/zloop.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 30cbc3d44cd5..49076a506b6b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25972,6 +25972,13 @@ L:	linux-kernel@vger.kernel.org
 S:	Maintained
 F:	arch/x86/kernel/cpu/zhaoxin.c
 
+ZONED LOOP DEVICE
+M:	Damien Le Moal <dlemoal@kernel.org>
+R:	Christoph Hellwig <hch@lst.de>
+L:	linux-block@vger.kernel.org
+S:	Maintained
+F:	drivers/block/zloop.c
+
 ZONEFS FILESYSTEM
 M:	Damien Le Moal <dlemoal@kernel.org>
 M:	Naohiro Aota <naohiro.aota@wdc.com>
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index a97f2c40c640..12af52a77cf8 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -413,4 +413,20 @@ config BLKDEV_UBLK_LEGACY_OPCODES
 
 source "drivers/block/rnbd/Kconfig"
 
+config BLK_DEV_ZONED_LOOP
+	tristate "Zoned loopback device support"
+	depends on BLK_DEV_ZONED
+	help
+	  Saying Y here will allow you to use create a zoned block device using
+	  regular files for zones (one file per zones). This is useful to test
+	  file systems, device mapper and applications that support zoned block
+	  devices. To create a zoned loop device, no user utility is needed, a
+	  zoned loop device can be created (or re-started) using a command
+	  like:
+
+	  echo "create idx=0,zone_size=256m,size=40T,conv_zones=11" > \
+		/dev/zloop-control
+
+	  If unsure, say N.
+
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 1105a2d4fdcb..097707aca725 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -41,5 +41,6 @@ obj-$(CONFIG_BLK_DEV_RNBD)	+= rnbd/
 obj-$(CONFIG_BLK_DEV_NULL_BLK)	+= null_blk/
 
 obj-$(CONFIG_BLK_DEV_UBLK)			+= ublk_drv.o
+obj-$(CONFIG_BLK_DEV_ZONED_LOOP) += zloop.o
 
 swim_mod-y	:= swim.o swim_asm.o
diff --git a/drivers/block/zloop.c b/drivers/block/zloop.c
new file mode 100644
index 000000000000..b0b398b9aad0
--- /dev/null
+++ b/drivers/block/zloop.c
@@ -0,0 +1,1330 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, Christoph Hellwig.
+ * Copyright (c) 2025, Western Digital Corporation or its affiliates.
+ *
+ * Zoned Loop Device driver - exports a zoned block device using one file per
+ * zone as backing storage.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/blk-mq.h>
+#include <linux/blkzoned.h>
+#include <linux/pagemap.h>
+#include <linux/miscdevice.h>
+#include <linux/falloc.h>
+#include <linux/mutex.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+
+/*
+ * Options for adding (and removing) a device.
+ */
+enum {
+	ZLOOP_OPT_ERR			= 0,
+	ZLOOP_OPT_ID			= (1 << 0),
+	ZLOOP_OPT_CAPACITY		= (1 << 1),
+	ZLOOP_OPT_ZONE_SIZE		= (1 << 2),
+	ZLOOP_OPT_ZONE_CAPACITY		= (1 << 3),
+	ZLOOP_OPT_NR_CONV_ZONES		= (1 << 4),
+	ZLOOP_OPT_BASE_DIR		= (1 << 5),
+	ZLOOP_OPT_NR_QUEUES		= (1 << 6),
+	ZLOOP_OPT_QUEUE_DEPTH		= (1 << 7),
+};
+
+static const match_table_t zloop_opt_tokens = {
+	{ ZLOOP_OPT_ID,			"id=%d"	},
+	{ ZLOOP_OPT_CAPACITY,		"capacity_mb=%u"	},
+	{ ZLOOP_OPT_ZONE_SIZE,		"zone_size_mb=%u"	},
+	{ ZLOOP_OPT_ZONE_CAPACITY,	"zone_capacity_mb=%u"	},
+	{ ZLOOP_OPT_NR_CONV_ZONES,	"conv_zones=%u"		},
+	{ ZLOOP_OPT_BASE_DIR,		"base_dir=%s"		},
+	{ ZLOOP_OPT_NR_QUEUES,		"nr_queues=%u"		},
+	{ ZLOOP_OPT_QUEUE_DEPTH,	"queue_depth=%u"	},
+	{ ZLOOP_OPT_ERR,		NULL			}
+};
+
+/* Default values for the "add" operation. */
+#define ZLOOP_DEF_ID			-1
+#define ZLOOP_DEF_ZONE_SIZE		((256ULL * SZ_1M) >> SECTOR_SHIFT)
+#define ZLOOP_DEF_NR_ZONES		64
+#define ZLOOP_DEF_NR_CONV_ZONES		8
+#define ZLOOP_DEF_BASE_DIR		"/var/local/zloop"
+#define ZLOOP_DEF_NR_QUEUES		1
+#define ZLOOP_DEF_QUEUE_DEPTH		64
+
+/* Arbitrary limit on the zone size (16GB). */
+#define ZLOOP_MAX_ZONE_SIZE_MB		16384
+
+struct zloop_options {
+	unsigned int		mask;
+	int			id;
+	sector_t		capacity;
+	sector_t		zone_size;
+	sector_t		zone_capacity;
+	unsigned int		nr_conv_zones;
+	char			*base_dir;
+	unsigned int		nr_queues;
+	unsigned int		queue_depth;
+};
+
+/*
+ * Device states.
+ */
+enum {
+	Zlo_creating = 0,
+	Zlo_live,
+	Zlo_deleting,
+};
+
+enum zloop_zone_flags {
+	ZLOOP_ZONE_CONV = 0,
+	ZLOOP_ZONE_SEQ_ERROR,
+};
+
+struct zloop_zone {
+	struct file		*file;
+
+	unsigned long		flags;
+	struct mutex		lock;
+	enum blk_zone_cond	cond;
+	sector_t		start;
+	sector_t		wp;
+
+	gfp_t			old_gfp_mask;
+};
+
+struct zloop_device {
+	unsigned int		id;
+	unsigned int		state;
+
+	struct blk_mq_tag_set	tag_set;
+	struct gendisk		*disk;
+
+	spinlock_t		work_lock;
+	struct workqueue_struct *workqueue;
+	struct work_struct	work;
+	struct list_head	cmd_list;
+
+	const char		*base_dir;
+	struct file		*data_dir;
+
+	unsigned int		zone_shift;
+	sector_t		zone_size;
+	sector_t		zone_capacity;
+	unsigned int		nr_zones;
+	unsigned int		nr_conv_zones;
+
+	struct zloop_zone	zones[] __counted_by(nr_zones);
+};
+
+struct zloop_cmd {
+	struct list_head	entry;
+	atomic_t		ref;
+	sector_t		sector;
+	sector_t		nr_sectors;
+	long			ret;
+	struct kiocb		iocb;
+	struct bio_vec		*bvec;
+};
+
+static DEFINE_IDR(zloop_index_idr);
+static DEFINE_MUTEX(zloop_ctl_mutex);
+
+static unsigned int rq_zone_no(struct request *rq)
+{
+	struct zloop_device *zlo = rq->q->queuedata;
+
+	return blk_rq_pos(rq) >> zlo->zone_shift;
+}
+
+static int zloop_update_seq_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	struct kstat stat;
+	sector_t file_sectors;
+	int ret;
+
+	lockdep_assert_held(&zone->lock);
+
+	ret = vfs_getattr(&zone->file->f_path, &stat, STATX_SIZE, 0);
+	if (ret < 0) {
+		pr_err("Failed to get zone %u file stat (err=%d)\n",
+		       zone_no, ret);
+		set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+		return ret;
+	}
+
+	file_sectors = stat.size >> SECTOR_SHIFT;
+	if (file_sectors > zlo->zone_capacity) {
+		pr_err("Zone %u file too large (%llu sectors > %llu)\n",
+		       zone_no, file_sectors, zlo->zone_capacity);
+		return -EINVAL;
+	}
+
+	if (!file_sectors) {
+		zone->cond = BLK_ZONE_COND_EMPTY;
+		zone->wp = zone->start;
+	} else if (file_sectors == zlo->zone_capacity) {
+		zone->cond = BLK_ZONE_COND_FULL;
+		zone->wp = zone->start + zlo->zone_size;
+	} else {
+		zone->cond = BLK_ZONE_COND_CLOSED;
+		zone->wp = zone->start + file_sectors;
+	}
+
+	return 0;
+}
+
+static int zloop_open_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	int ret = 0;
+
+	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
+		return -EIO;
+
+	mutex_lock(&zone->lock);
+
+	if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
+		ret = zloop_update_seq_zone(zlo, zone_no);
+		if (ret)
+			goto unlock;
+	}
+
+	switch (zone->cond) {
+	case BLK_ZONE_COND_EXP_OPEN:
+		break;
+	case BLK_ZONE_COND_EMPTY:
+	case BLK_ZONE_COND_CLOSED:
+	case BLK_ZONE_COND_IMP_OPEN:
+		zone->cond = BLK_ZONE_COND_EXP_OPEN;
+		break;
+	case BLK_ZONE_COND_FULL:
+	default:
+		ret = -EIO;
+		break;
+	}
+
+unlock:
+	mutex_unlock(&zone->lock);
+
+	return ret;
+}
+
+static int zloop_close_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	int ret = 0;
+
+	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
+		return -EIO;
+
+	mutex_lock(&zone->lock);
+
+	if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
+		ret = zloop_update_seq_zone(zlo, zone_no);
+		if (ret)
+			goto unlock;
+	}
+
+	switch (zone->cond) {
+	case BLK_ZONE_COND_CLOSED:
+		break;
+	case BLK_ZONE_COND_IMP_OPEN:
+	case BLK_ZONE_COND_EXP_OPEN:
+		if (zone->wp == zone->start)
+			zone->cond = BLK_ZONE_COND_EMPTY;
+		else
+			zone->cond = BLK_ZONE_COND_CLOSED;
+		break;
+	case BLK_ZONE_COND_EMPTY:
+	case BLK_ZONE_COND_FULL:
+	default:
+		ret = -EIO;
+		break;
+	}
+
+unlock:
+	mutex_unlock(&zone->lock);
+
+	return ret;
+}
+
+static int zloop_reset_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+
+	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
+		return -EIO;
+
+	if (vfs_truncate(&zone->file->f_path, 0)) {
+		set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+		return -EIO;
+	}
+
+	mutex_lock(&zone->lock);
+	zone->cond = BLK_ZONE_COND_EMPTY;
+	zone->wp = zone->start;
+	clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+	mutex_unlock(&zone->lock);
+
+	return 0;
+}
+
+static int zloop_reset_all_zones(struct request *rq)
+{
+	struct zloop_device *zlo = rq->q->queuedata;
+	unsigned int i;
+	int ret;
+
+	for (i = zlo->nr_conv_zones; i < zlo->nr_zones; i++) {
+		ret = zloop_reset_zone(zlo, i);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int zloop_finish_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+
+	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
+		return -EIO;
+
+	if (vfs_truncate(&zone->file->f_path, zlo->zone_size << SECTOR_SHIFT)) {
+		set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+		return -EIO;
+	}
+
+	mutex_lock(&zone->lock);
+	zone->cond = BLK_ZONE_COND_FULL;
+	zone->wp = zone->start + zlo->zone_size;
+	clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+	mutex_unlock(&zone->lock);
+
+	return 0;
+}
+
+static void zloop_put_cmd(struct zloop_cmd *cmd)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+
+	if (!atomic_dec_and_test(&cmd->ref))
+		return;
+	kfree(cmd->bvec);
+	cmd->bvec = NULL;
+	if (likely(!blk_should_fake_timeout(rq->q)))
+		blk_mq_complete_request(rq);
+}
+
+static void zloop_rw_complete(struct kiocb *iocb, long ret)
+{
+	struct zloop_cmd *cmd = container_of(iocb, struct zloop_cmd, iocb);
+
+	cmd->ret = ret;
+	zloop_put_cmd(cmd);
+}
+
+static void zloop_rw(struct zloop_cmd *cmd)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	struct zloop_device *zlo = rq->q->queuedata;
+	unsigned int zone_no = rq_zone_no(rq);
+	sector_t sector = blk_rq_pos(rq);
+	sector_t nr_sectors = blk_rq_sectors(rq);
+	bool is_append = req_op(rq) == REQ_OP_ZONE_APPEND;
+	bool is_write = req_op(rq) == REQ_OP_WRITE || is_append;
+	int rw = is_write ? ITER_SOURCE : ITER_DEST;
+	struct req_iterator rq_iter;
+	struct zloop_zone *zone;
+	struct iov_iter iter;
+	struct bio_vec tmp;
+	sector_t zone_end;
+	int nr_bvec = 0;
+	int ret;
+
+	cmd->sector = sector;
+	cmd->nr_sectors = nr_sectors;
+
+	/* We should never get an I/O beyond the device capacity. */
+	if (WARN_ON_ONCE(zone_no >= zlo->nr_zones)) {
+		ret = -EIO;
+		goto out;
+	}
+	zone = &zlo->zones[zone_no];
+	zone_end = zone->start + zlo->zone_capacity;
+
+	/*
+	 * The block layer should never send requests that are not fully
+	 * contained within the zone.
+	 */
+	if (WARN_ON_ONCE(sector + nr_sectors > zone->start + zlo->zone_size)) {
+		ret = -EIO;
+		goto out;
+	}
+
+	if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
+		mutex_lock(&zone->lock);
+		ret = zloop_update_seq_zone(zlo, zone_no);
+		mutex_unlock(&zone->lock);
+		if (ret)
+			goto out;
+	}
+
+	if (!test_bit(ZLOOP_ZONE_CONV, &zone->flags) && is_write) {
+		mutex_lock(&zone->lock);
+
+		if (is_append) {
+			sector = zone->wp;
+			cmd->sector = sector;
+		}
+
+		/*
+		 * Write operations must be aligned to the write pointer and
+		 * fully contained within the zone capacity.
+		 */
+		if (sector != zone->wp || zone->wp + nr_sectors > zone_end) {
+			pr_err("Zone %u: unaligned write: sect %llu, wp %llu\n",
+			       zone_no, sector, zone->wp);
+			mutex_unlock(&zone->lock);
+			ret = -EIO;
+			goto out;
+		}
+
+		/* Implicitly open the target zone. */
+		if (zone->cond == BLK_ZONE_COND_CLOSED ||
+		    zone->cond == BLK_ZONE_COND_EMPTY)
+			zone->cond = BLK_ZONE_COND_IMP_OPEN;
+
+		/*
+		 * Advance the write pointer of sequential zones. If the write
+		 * fails, the wp position will be corrected when the next I/O
+		 * copmpletes.
+		 */
+		zone->wp += nr_sectors;
+		if (zone->wp == zone_end)
+			zone->cond = BLK_ZONE_COND_FULL;
+
+		mutex_unlock(&zone->lock);
+	}
+
+	rq_for_each_bvec(tmp, rq, rq_iter)
+		nr_bvec++;
+
+	if (rq->bio != rq->biotail) {
+		struct bio_vec *bvec;
+
+		cmd->bvec = kmalloc_array(nr_bvec, sizeof(*cmd->bvec), GFP_NOIO);
+		if (!cmd->bvec) {
+			ret = -EIO;
+			goto out;
+		}
+
+		/*
+		 * The bios of the request may be started from the middle of
+		 * the 'bvec' because of bio splitting, so we can't directly
+		 * copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
+		 * API will take care of all details for us.
+		 */
+		bvec = cmd->bvec;
+		rq_for_each_bvec(tmp, rq, rq_iter) {
+			*bvec = tmp;
+			bvec++;
+		}
+		iov_iter_bvec(&iter, rw, cmd->bvec, nr_bvec, blk_rq_bytes(rq));
+	} else {
+		/*
+		 * Same here, this bio may be started from the middle of the
+		 * 'bvec' because of bio splitting, so offset from the bvec
+		 * must be passed to iov iterator
+		 */
+		iov_iter_bvec(&iter, rw,
+			__bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter),
+					nr_bvec, blk_rq_bytes(rq));
+		iter.iov_offset = rq->bio->bi_iter.bi_bvec_done;
+	}
+
+	cmd->iocb.ki_pos = (sector - zone->start) << SECTOR_SHIFT;
+	cmd->iocb.ki_filp = zone->file;
+	cmd->iocb.ki_complete = zloop_rw_complete;
+	cmd->iocb.ki_flags = IOCB_DIRECT;
+	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
+
+	if (rw == ITER_SOURCE)
+		ret = zone->file->f_op->write_iter(&cmd->iocb, &iter);
+	else
+		ret = zone->file->f_op->read_iter(&cmd->iocb, &iter);
+out:
+	if (ret != -EIOCBQUEUED)
+		zloop_rw_complete(&cmd->iocb, ret);
+	zloop_put_cmd(cmd);
+}
+
+static void zloop_handle_cmd(struct zloop_cmd *cmd)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	struct zloop_device *zlo = rq->q->queuedata;
+
+	atomic_set(&cmd->ref, 2);
+	cmd->ret = 0;
+
+	switch (req_op(rq)) {
+	case REQ_OP_READ:
+	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
+		/*
+		 * zloop_rw() always executes asynchronously or completes
+		 * directly.
+		 */
+		zloop_rw(cmd);
+		return;
+	case REQ_OP_FLUSH:
+		/*
+		 * sync the entire fs containing the zone files instead of
+		 * walking all files
+		 */
+		if (sync_filesystem(file_inode(zlo->data_dir)->i_sb))
+			cmd->ret = -EIO;
+		break;
+	case REQ_OP_ZONE_RESET:
+		cmd->ret = zloop_reset_zone(zlo, rq_zone_no(rq));
+		break;
+	case REQ_OP_ZONE_RESET_ALL:
+		cmd->ret = zloop_reset_all_zones(rq);
+		break;
+	case REQ_OP_ZONE_FINISH:
+		cmd->ret = zloop_finish_zone(zlo, rq_zone_no(rq));
+		break;
+	case REQ_OP_ZONE_OPEN:
+		cmd->ret = zloop_open_zone(zlo, rq_zone_no(rq));
+		break;
+	case REQ_OP_ZONE_CLOSE:
+		cmd->ret = zloop_close_zone(zlo, rq_zone_no(rq));
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		pr_err("Unsupported operation %d\n", req_op(rq));
+		cmd->ret = -EOPNOTSUPP;
+		break;
+	}
+
+	blk_mq_complete_request(rq);
+}
+
+static struct zloop_cmd *zloop_get_cmd(struct zloop_device *zlo)
+{
+	struct zloop_cmd *cmd;
+
+	spin_lock_irq(&zlo->work_lock);
+	cmd = list_first_entry_or_null(&zlo->cmd_list, struct zloop_cmd, entry);
+	if (cmd)
+		list_del_init(&cmd->entry);
+	spin_unlock_irq(&zlo->work_lock);
+
+	return cmd;
+}
+
+static void zloop_workfn(struct work_struct *work)
+{
+	struct zloop_device *zlo = container_of(work, struct zloop_device, work);
+	int orig_flags = current->flags;
+	struct zloop_cmd *cmd;
+
+	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
+	while ((cmd = zloop_get_cmd(zlo))) {
+		zloop_handle_cmd(cmd);
+		cond_resched();
+	}
+	current->flags = orig_flags;
+}
+
+static void zloop_complete_rq(struct request *rq)
+{
+	struct zloop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+	struct zloop_device *zlo = rq->q->queuedata;
+	unsigned int zone_no = cmd->sector >> zlo->zone_shift;
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	blk_status_t sts = BLK_STS_OK;
+
+	switch (req_op(rq)) {
+	case REQ_OP_READ:
+		if (cmd->ret < 0)
+			pr_err("Zone %u: failed read sector %llu, %llu sectors\n",
+			       zone_no, cmd->sector, cmd->nr_sectors);
+
+		if (cmd->ret >= 0 && cmd->ret != blk_rq_bytes(rq)) {
+			/* short read */
+			struct bio *bio;
+
+			__rq_for_each_bio(bio, rq)
+				zero_fill_bio(bio);
+		}
+		break;
+	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
+		if (cmd->ret < 0)
+			pr_err("Zone %u: failed %swrite sector %llu, %llu sectors\n",
+			       zone_no,
+			       req_op(rq) == REQ_OP_WRITE ? "" : "append ",
+			       cmd->sector, cmd->nr_sectors);
+
+		if (cmd->ret >= 0 && cmd->ret != blk_rq_bytes(rq)) {
+			pr_err("Zone %u: partial write %ld/%u B\n",
+			       zone_no, cmd->ret, blk_rq_bytes(rq));
+			cmd->ret = -EIO;
+		}
+
+		if (cmd->ret < 0 && !test_bit(ZLOOP_ZONE_CONV, &zone->flags)) {
+			/*
+			 * A write to a sequential zone file failed: mark the
+			 * zone as having an error. This will be corrected and
+			 * cleared when the next IO is submitted.
+			 */
+			set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+		}
+		break;
+	default:
+		break;
+	}
+
+	if (cmd->ret < 0)
+		sts = errno_to_blk_status(cmd->ret);
+	blk_mq_end_request(rq, sts);
+}
+
+static blk_status_t zloop_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
+{
+	struct request *rq = bd->rq;
+	struct zloop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+	struct zloop_device *zlo = rq->q->queuedata;
+
+	if (zlo->state == Zlo_deleting)
+		return BLK_STS_IOERR;
+
+	blk_mq_start_request(rq);
+
+	spin_lock_irq(&zlo->work_lock);
+	list_add_tail(&cmd->entry, &zlo->cmd_list);
+	queue_work(zlo->workqueue, &zlo->work);
+	spin_unlock_irq(&zlo->work_lock);
+	return BLK_STS_OK;
+}
+
+static const struct blk_mq_ops zloop_mq_ops = {
+	.queue_rq       = zloop_queue_rq,
+	.complete	= zloop_complete_rq,
+};
+
+static int zloop_open(struct gendisk *disk, blk_mode_t mode)
+{
+	struct zloop_device *zlo = disk->private_data;
+	int ret;
+
+	ret = mutex_lock_killable(&zloop_ctl_mutex);
+	if (ret)
+		return ret;
+
+	if (zlo->state != Zlo_live)
+		ret = -ENXIO;
+	mutex_unlock(&zloop_ctl_mutex);
+	return ret;
+}
+
+static int zloop_report_zones(struct gendisk *disk, sector_t sector,
+		unsigned int nr_zones, report_zones_cb cb, void *data)
+{
+	struct zloop_device *zlo = disk->private_data;
+	struct blk_zone blkz = {};
+	unsigned int first, i;
+	int ret;
+
+	first = disk_zone_no(disk, sector);
+	if (first >= zlo->nr_zones)
+		return 0;
+	nr_zones = min(nr_zones, zlo->nr_zones - first);
+
+	for (i = 0; i < nr_zones; i++) {
+		unsigned int zone_no = first + i;
+		struct zloop_zone *zone = &zlo->zones[zone_no];
+
+		mutex_lock(&zone->lock);
+
+		if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
+			ret = zloop_update_seq_zone(zlo, zone_no);
+			if (ret) {
+				mutex_unlock(&zone->lock);
+				return ret;
+			}
+		}
+
+		blkz.start = zone->start;
+		blkz.len = zlo->zone_size;
+		blkz.wp = zone->wp;
+		blkz.cond = zone->cond;
+		if (test_bit(ZLOOP_ZONE_CONV, &zone->flags)) {
+			blkz.type = BLK_ZONE_TYPE_CONVENTIONAL;
+			blkz.capacity = zlo->zone_size;
+		} else {
+			blkz.type = BLK_ZONE_TYPE_SEQWRITE_REQ;
+			blkz.capacity = zlo->zone_capacity;
+		}
+
+		mutex_unlock(&zone->lock);
+
+		ret = cb(&blkz, i, data);
+		if (ret)
+			return ret;
+	}
+
+	return nr_zones;
+}
+
+static void zloop_free_disk(struct gendisk *disk)
+{
+	struct zloop_device *zlo = disk->private_data;
+	unsigned int i;
+
+	for (i = 0; i < zlo->nr_zones; i++) {
+		struct zloop_zone *zone = &zlo->zones[i];
+
+		mapping_set_gfp_mask(zone->file->f_mapping,
+				zone->old_gfp_mask);
+		fput(zone->file);
+	}
+
+	fput(zlo->data_dir);
+	destroy_workqueue(zlo->workqueue);
+	kfree(zlo->base_dir);
+	kvfree(zlo);
+}
+
+static const struct block_device_operations zloop_fops = {
+	.owner			= THIS_MODULE,
+	.open			= zloop_open,
+	.report_zones		= zloop_report_zones,
+	.free_disk		= zloop_free_disk,
+};
+
+__printf(3, 4)
+static struct file *zloop_filp_open_fmt(int oflags, umode_t mode,
+		const char *fmt, ...)
+{
+	struct file *file;
+	va_list ap;
+	char *p;
+
+	va_start(ap, fmt);
+	p = kvasprintf(GFP_KERNEL, fmt, ap);
+	va_end(ap);
+
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+	file = filp_open(p, oflags, mode);
+	kfree(p);
+	return file;
+}
+
+static int zloop_init_zone(struct zloop_device *zlo, unsigned int zone_no,
+			   bool restore)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	int oflags = O_RDWR | O_DIRECT;
+	struct kstat stat;
+	sector_t file_sectors;
+	int ret;
+
+	mutex_init(&zone->lock);
+	zone->start = (sector_t)zone_no << zlo->zone_shift;
+
+	if (!restore)
+		oflags |= O_CREAT;
+
+	if (zone_no < zlo->nr_conv_zones) {
+		/* Conventional zone file. */
+		set_bit(ZLOOP_ZONE_CONV, &zone->flags);
+		zone->cond = BLK_ZONE_COND_NOT_WP;
+		zone->wp = U64_MAX;
+
+		zone->file = zloop_filp_open_fmt(oflags, 0600, "%s/%u/cnv-%06u",
+					zlo->base_dir, zlo->id, zone_no);
+		if (IS_ERR(zone->file)) {
+			pr_err("Failed to open zone %u file %s/%u/cnv-%06u (err=%ld)",
+			       zone_no, zlo->base_dir, zlo->id, zone_no,
+			       PTR_ERR(zone->file));
+			return PTR_ERR(zone->file);
+		}
+
+		ret = vfs_getattr(&zone->file->f_path, &stat, STATX_SIZE, 0);
+		if (ret < 0) {
+			pr_err("Failed to get zone %u file stat\n", zone_no);
+			return ret;
+		}
+		file_sectors = stat.size >> SECTOR_SHIFT;
+
+		if (restore && file_sectors != zlo->zone_size) {
+			pr_err("Invalid conventional zone %u file size (%llu sectors != %llu)\n",
+			       zone_no, file_sectors, zlo->zone_capacity);
+			return ret;
+		}
+
+		ret = vfs_truncate(&zone->file->f_path,
+				   zlo->zone_size << SECTOR_SHIFT);
+		if (ret < 0) {
+			pr_err("Failed to truncate zone %u file (err=%d)\n",
+			       zone_no, ret);
+			return ret;
+		}
+
+		return 0;
+	}
+
+	/* Sequential zone file. */
+	zone->file = zloop_filp_open_fmt(oflags, 0600, "%s/%u/seq-%06u",
+					 zlo->base_dir, zlo->id, zone_no);
+	if (IS_ERR(zone->file)) {
+		pr_err("Failed to open zone %u file %s/%u/seq-%06u (err=%ld)",
+		       zone_no, zlo->base_dir, zlo->id, zone_no,
+		       PTR_ERR(zone->file));
+		return PTR_ERR(zone->file);
+	}
+
+	mutex_lock(&zone->lock);
+	ret = zloop_update_seq_zone(zlo, zone_no);
+	mutex_unlock(&zone->lock);
+
+	return ret;
+}
+
+static bool zloop_dev_exists(struct zloop_device *zlo)
+{
+	struct file *cnv, *seq;
+	bool exists;
+
+	cnv = zloop_filp_open_fmt(O_RDONLY, 0600, "%s/%u/cnv-%06u",
+				  zlo->base_dir, zlo->id, 0);
+	seq = zloop_filp_open_fmt(O_RDONLY, 0600, "%s/%u/seq-%06u",
+				  zlo->base_dir, zlo->id, 0);
+	exists = !IS_ERR(cnv) || !IS_ERR(seq);
+
+	if (!IS_ERR(cnv))
+		fput(cnv);
+	if (!IS_ERR(seq))
+		fput(seq);
+
+	return exists;
+}
+
+static int zloop_ctl_add(struct zloop_options *opts)
+{
+	struct queue_limits lim = {
+		.max_hw_sectors		= SZ_1M >> SECTOR_SHIFT,
+		.max_hw_zone_append_sectors = SZ_1M >> SECTOR_SHIFT,
+		.chunk_sectors		= opts->zone_size,
+		.features		= BLK_FEAT_ZONED,
+	};
+	unsigned int nr_zones, i, j;
+	struct zloop_device *zlo;
+	int block_size;
+	int ret = -EINVAL;
+	bool restore;
+
+	__module_get(THIS_MODULE);
+
+	nr_zones = opts->capacity >> ilog2(opts->zone_size);
+	if (opts->nr_conv_zones >= nr_zones) {
+		pr_err("Invalid number of conventional zones %u\n",
+		       opts->nr_conv_zones);
+		goto out;
+	}
+
+	zlo = kvzalloc(struct_size(zlo, zones, nr_zones), GFP_KERNEL);
+	if (!zlo) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	zlo->state = Zlo_creating;
+
+	ret = mutex_lock_killable(&zloop_ctl_mutex);
+	if (ret)
+		goto out_free_dev;
+
+	/* Allocate id, if @opts->id >= 0, we're requesting that specific id */
+	if (opts->id >= 0) {
+		ret = idr_alloc(&zloop_index_idr, zlo,
+				  opts->id, opts->id + 1, GFP_KERNEL);
+		if (ret == -ENOSPC)
+			ret = -EEXIST;
+	} else {
+		ret = idr_alloc(&zloop_index_idr, zlo, 0, 0, GFP_KERNEL);
+	}
+	mutex_unlock(&zloop_ctl_mutex);
+	if (ret < 0)
+		goto out_free_dev;
+
+	zlo->id = ret;
+	spin_lock_init(&zlo->work_lock);
+	INIT_WORK(&zlo->work, zloop_workfn);
+	INIT_LIST_HEAD(&zlo->cmd_list);
+	zlo->zone_shift = ilog2(opts->zone_size);
+	zlo->zone_size = opts->zone_size;
+	if (opts->zone_capacity)
+		zlo->zone_capacity = opts->zone_capacity;
+	else
+		zlo->zone_capacity = zlo->zone_size;
+	zlo->nr_zones = nr_zones;
+	zlo->nr_conv_zones = opts->nr_conv_zones;
+
+	zlo->workqueue = alloc_workqueue("zloop%d", WQ_UNBOUND | WQ_FREEZABLE,
+					 0, zlo->id);
+	if (!zlo->workqueue) {
+		ret = -ENOMEM;
+		goto out_free_idr;
+	}
+
+	if (opts->base_dir)
+		zlo->base_dir = kstrdup(opts->base_dir, GFP_KERNEL);
+	else
+		zlo->base_dir = kstrdup(ZLOOP_DEF_BASE_DIR, GFP_KERNEL);
+	if (!zlo->base_dir) {
+		ret = -ENOMEM;
+		goto out_destroy_workqueue;
+	}
+
+	zlo->data_dir = zloop_filp_open_fmt(O_RDONLY | O_DIRECTORY, 0, "%s/%u",
+					    zlo->base_dir, zlo->id);
+	if (IS_ERR(zlo->data_dir)) {
+		ret = PTR_ERR(zlo->data_dir);
+		pr_warn("Failed to open directory %s/%u (err=%d)\n",
+			zlo->base_dir, zlo->id, ret);
+		goto out_free_base_dir;
+	}
+
+	/* Use the FS block size as the device sector size. */
+	block_size = file_inode(zlo->data_dir)->i_sb->s_blocksize;
+	if (block_size > SZ_4K) {
+		pr_warn("Unsupported FS block size %d B > 4096\n",
+			block_size);
+		goto out_close_data_dir;
+	}
+	lim.physical_block_size = block_size;
+	lim.logical_block_size = block_size;
+
+	/*
+	 * If we already have zone files, we are restoring a device created by a
+	 * previous add operation. In this case, zloop_init_zone() will check
+	 * that the zone files are consistent with the zone configuration given.
+	 */
+	restore = zloop_dev_exists(zlo);
+	for (i = 0; i < nr_zones; i++) {
+		ret = zloop_init_zone(zlo, i, restore);
+		if (ret)
+			goto out_close_files;
+	}
+
+	zlo->tag_set.ops = &zloop_mq_ops;
+	zlo->tag_set.nr_hw_queues = opts->nr_queues;
+	zlo->tag_set.queue_depth = opts->queue_depth;
+	zlo->tag_set.numa_node = NUMA_NO_NODE;
+	zlo->tag_set.cmd_size = sizeof(struct zloop_cmd);
+	zlo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	zlo->tag_set.driver_data = zlo;
+
+	ret = blk_mq_alloc_tag_set(&zlo->tag_set);
+	if (ret) {
+		pr_err("blk_mq_alloc_tag_set failed (err=%d)\n", ret);
+		goto out_close_files;
+	}
+
+	zlo->disk = blk_mq_alloc_disk(&zlo->tag_set, &lim, zlo);
+	if (IS_ERR(zlo->disk)) {
+		pr_err("blk_mq_alloc_disk failed (err=%d)\n", ret);
+		ret = PTR_ERR(zlo->disk);
+		goto out_cleanup_tags;
+	}
+	zlo->disk->flags = GENHD_FL_NO_PART;
+	zlo->disk->fops = &zloop_fops;
+	zlo->disk->private_data = zlo;
+	sprintf(zlo->disk->disk_name, "zloop%d", zlo->id);
+	set_capacity(zlo->disk, (u64)lim.chunk_sectors * zlo->nr_zones);
+
+	if (blk_revalidate_disk_zones(zlo->disk))
+		goto out_cleanup_disk;
+
+	ret = add_disk(zlo->disk);
+	if (ret) {
+		pr_err("add_disk failed (err=%d)\n", ret);
+		goto out_cleanup_disk;
+	}
+
+	mutex_lock(&zloop_ctl_mutex);
+	zlo->state = Zlo_live;
+	mutex_unlock(&zloop_ctl_mutex);
+
+	pr_info("Added device %d\n", zlo->id);
+
+	return 0;
+
+out_cleanup_disk:
+	put_disk(zlo->disk);
+out_cleanup_tags:
+	blk_mq_free_tag_set(&zlo->tag_set);
+out_close_files:
+	for (j = 0; j < i; j++) {
+		struct zloop_zone *zone = &zlo->zones[j];
+
+		if (!IS_ERR_OR_NULL(zone->file))
+			fput(zone->file);
+	}
+out_close_data_dir:
+	fput(zlo->data_dir);
+out_free_base_dir:
+	kfree(zlo->base_dir);
+out_destroy_workqueue:
+	destroy_workqueue(zlo->workqueue);
+out_free_idr:
+	mutex_lock(&zloop_ctl_mutex);
+	idr_remove(&zloop_index_idr, zlo->id);
+	mutex_unlock(&zloop_ctl_mutex);
+out_free_dev:
+	kvfree(zlo);
+out:
+	module_put(THIS_MODULE);
+	if (ret == -ENOENT)
+		ret = -EINVAL;
+	return ret;
+}
+
+static int zloop_ctl_remove(struct zloop_options *opts)
+{
+	struct zloop_device *zlo;
+	int ret;
+
+	if (!(opts->mask & ZLOOP_OPT_ID)) {
+		pr_err("No ID specified\n");
+		return -EINVAL;
+	}
+
+	ret = mutex_lock_killable(&zloop_ctl_mutex);
+	if (ret)
+		return ret;
+
+	zlo = idr_find(&zloop_index_idr, opts->id);
+	if (!zlo || zlo->state == Zlo_creating) {
+		ret = -ENODEV;
+	} else if (zlo->state == Zlo_deleting) {
+		ret = -EINVAL;
+	} else {
+		idr_remove(&zloop_index_idr, zlo->id);
+		zlo->state = Zlo_deleting;
+	}
+
+	mutex_unlock(&zloop_ctl_mutex);
+	if (ret)
+		return ret;
+
+	del_gendisk(zlo->disk);
+	put_disk(zlo->disk);
+	blk_mq_free_tag_set(&zlo->tag_set);
+
+	pr_info("Removed device %d\n", opts->id);
+
+	module_put(THIS_MODULE);
+
+	return 0;
+}
+
+static int zloop_parse_options(struct zloop_options *opts, const char *buf)
+{
+	substring_t args[MAX_OPT_ARGS];
+	char *options, *o, *p;
+	unsigned int token;
+	int ret = 0;
+
+	/* Set defaults. */
+	opts->mask = 0;
+	opts->id = ZLOOP_DEF_ID;
+	opts->capacity = ZLOOP_DEF_ZONE_SIZE * ZLOOP_DEF_NR_ZONES;
+	opts->zone_size = ZLOOP_DEF_ZONE_SIZE;
+	opts->nr_conv_zones = ZLOOP_DEF_NR_CONV_ZONES;
+	opts->nr_queues = ZLOOP_DEF_NR_QUEUES;
+	opts->queue_depth = ZLOOP_DEF_QUEUE_DEPTH;
+
+	if (!buf)
+		return 0;
+
+	/* Skip leading spaces before the options. */
+	while (isspace(*buf))
+		buf++;
+
+	options = o = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	/* Parse the options, doing only some light invalid value checks. */
+	while ((p = strsep(&o, ",\n")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, zloop_opt_tokens, args);
+		opts->mask |= token;
+		switch (token) {
+		case ZLOOP_OPT_ID:
+			if (match_int(args, &opts->id)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			break;
+		case ZLOOP_OPT_CAPACITY:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token) {
+				pr_err("Invalid capacity\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->capacity =
+				((sector_t)token * SZ_1M) >> SECTOR_SHIFT;
+			break;
+		case ZLOOP_OPT_ZONE_SIZE:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token || token > ZLOOP_MAX_ZONE_SIZE_MB ||
+			    !is_power_of_2(token)) {
+				pr_err("Invalid zone size %u\n", token);
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->zone_size =
+				((sector_t)token * SZ_1M) >> SECTOR_SHIFT;
+			break;
+		case ZLOOP_OPT_ZONE_CAPACITY:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token) {
+				pr_err("Invalid zone capacity\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->zone_capacity =
+				((sector_t)token * SZ_1M) >> SECTOR_SHIFT;
+			break;
+		case ZLOOP_OPT_NR_CONV_ZONES:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->nr_conv_zones = token;
+			break;
+		case ZLOOP_OPT_BASE_DIR:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			kfree(opts->base_dir);
+			opts->base_dir = p;
+			break;
+		case ZLOOP_OPT_NR_QUEUES:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token) {
+				pr_err("Invalid number of queues\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->nr_queues = min(token, num_online_cpus());
+			break;
+		case ZLOOP_OPT_QUEUE_DEPTH:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token) {
+				pr_err("Invalid queue depth\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->queue_depth = token;
+			break;
+		case ZLOOP_OPT_ERR:
+		default:
+			pr_warn("unknown parameter or missing value '%s'\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	ret = -EINVAL;
+	if (opts->capacity <= opts->zone_size) {
+		pr_err("Invalid capacity\n");
+		goto out;
+	}
+
+	if (opts->zone_capacity > opts->zone_size) {
+		pr_err("Invalid zone capacity\n");
+		goto out;
+	}
+
+	ret = 0;
+out:
+	kfree(options);
+	return ret;
+}
+
+enum {
+	ZLOOP_CTL_ADD,
+	ZLOOP_CTL_REMOVE,
+};
+
+static struct zloop_ctl_op {
+	int		code;
+	const char	*name;
+} zloop_ctl_ops[] = {
+	{ ZLOOP_CTL_ADD,	"add" },
+	{ ZLOOP_CTL_REMOVE,	"remove" },
+	{ -1,	NULL },
+};
+
+static ssize_t zloop_ctl_write(struct file *file, const char __user *ubuf,
+			       size_t count, loff_t *pos)
+{
+	struct zloop_options opts = { };
+	struct zloop_ctl_op *op;
+	const char *buf, *opts_buf;
+	int i, ret;
+
+	if (count > PAGE_SIZE)
+		return -ENOMEM;
+
+	buf = memdup_user_nul(ubuf, count);
+	if (IS_ERR(buf))
+		return PTR_ERR(buf);
+
+	for (i = 0; i < ARRAY_SIZE(zloop_ctl_ops); i++) {
+		op = &zloop_ctl_ops[i];
+		if (!op->name) {
+			pr_err("Invalid operation\n");
+			ret = -EINVAL;
+			goto out;
+		}
+		if (!strncmp(buf, op->name, strlen(op->name)))
+			break;
+	}
+
+	if (count <= strlen(op->name))
+		opts_buf = NULL;
+	else
+		opts_buf = buf + strlen(op->name);
+
+	ret = zloop_parse_options(&opts, opts_buf);
+	if (ret) {
+		pr_err("Failed to parse options\n");
+		goto out;
+	}
+
+	switch (op->code) {
+	case ZLOOP_CTL_ADD:
+		ret = zloop_ctl_add(&opts);
+		break;
+	case ZLOOP_CTL_REMOVE:
+		ret = zloop_ctl_remove(&opts);
+		break;
+	default:
+		pr_err("Invalid operation\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	kfree(opts.base_dir);
+	kfree(buf);
+	return ret ? ret : count;
+}
+
+static int zloop_ctl_show(struct seq_file *seq_file, void *private)
+{
+	const struct match_token *tok;
+	int i;
+
+	/* Add operation */
+	seq_printf(seq_file, "%s ", zloop_ctl_ops[0].name);
+	for (i = 0; i < ARRAY_SIZE(zloop_opt_tokens); i++) {
+		tok = &zloop_opt_tokens[i];
+		if (tok->token == ZLOOP_OPT_ERR)
+			break;
+		if (i)
+			seq_putc(seq_file, ',');
+		seq_puts(seq_file, tok->pattern);
+	}
+	seq_putc(seq_file, '\n');
+
+	/* Remove operation */
+	seq_printf(seq_file, "%s ", zloop_ctl_ops[1].name);
+	seq_puts(seq_file, "id=%d\n");
+
+	return 0;
+}
+
+static int zloop_ctl_open(struct inode *inode, struct file *file)
+{
+	file->private_data = NULL;
+	return single_open(file, zloop_ctl_show, NULL);
+}
+
+static int zloop_ctl_release(struct inode *inode, struct file *file)
+{
+	return single_release(inode, file);
+}
+
+static const struct file_operations zloop_ctl_fops = {
+	.owner		= THIS_MODULE,
+	.open		= zloop_ctl_open,
+	.release	= zloop_ctl_release,
+	.write		= zloop_ctl_write,
+	.read		= seq_read,
+};
+
+static struct miscdevice zloop_misc = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "zloop-control",
+	.fops		= &zloop_ctl_fops,
+};
+
+static int __init zloop_init(void)
+{
+	int ret;
+
+	ret = misc_register(&zloop_misc);
+	if (ret) {
+		pr_err("Failed to register misc device: %d\n", ret);
+		return ret;
+	}
+	pr_info("Module loaded\n");
+
+	return 0;
+}
+
+static void __exit zloop_exit(void)
+{
+	misc_deregister(&zloop_misc);
+	idr_destroy(&zloop_index_idr);
+}
+
+module_init(zloop_init);
+module_exit(zloop_exit);
+
+MODULE_DESCRIPTION("Zoned loopback device");
+MODULE_LICENSE("GPL");
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/2] Documentation: Document the new zoned loop block device driver
  2025-01-06 14:24 [PATCH 0/2] New zoned loop block device driver Damien Le Moal
  2025-01-06 14:24 ` [PATCH 1/2] block: new " Damien Le Moal
@ 2025-01-06 14:24 ` Damien Le Moal
  2025-01-06 14:54 ` [PATCH 0/2] New " Jens Axboe
  2 siblings, 0 replies; 35+ messages in thread
From: Damien Le Moal @ 2025-01-06 14:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Christoph Hellwig

Introduce the zoned_loop.rst documentation file under
admin-guide/blockdev to document the zoned loop block device driver.
An overview of the driver is provided and its usage to create and delete
zoned devices described.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 Documentation/admin-guide/blockdev/index.rst  |   1 +
 .../admin-guide/blockdev/zoned_loop.rst       | 168 ++++++++++++++++++
 MAINTAINERS                                   |   1 +
 3 files changed, 170 insertions(+)
 create mode 100644 Documentation/admin-guide/blockdev/zoned_loop.rst

diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst
index 957ccf617797..3262397ebe8f 100644
--- a/Documentation/admin-guide/blockdev/index.rst
+++ b/Documentation/admin-guide/blockdev/index.rst
@@ -11,6 +11,7 @@ Block Devices
    nbd
    paride
    ramdisk
+   zoned_loop
    zram
 
    drbd/index
diff --git a/Documentation/admin-guide/blockdev/zoned_loop.rst b/Documentation/admin-guide/blockdev/zoned_loop.rst
new file mode 100644
index 000000000000..e437f501b0ae
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/zoned_loop.rst
@@ -0,0 +1,168 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+Zoned Loop Block Device
+=======================
+
+.. Contents:
+
+	1) Overview
+	2) Creating a Zoned Device
+	3) Deleting a Zoned Device
+	4) Example
+
+
+1) Overview
+-----------
+
+The zoned loop block device driver (zloop) allows a user to create a zoned block
+device using one regular file per zone as backing storage. This driver does not
+directly control any hardware and uses read, write and truncate operations to
+regular files of a file system to emulate a zoned block device.
+
+Using zloop, zoned block devices with a configurable capacity, zone size and
+number of conventional zones can be created. The storage for each zone of the
+device is implemented using a regular file with a maximum size equal to the zone
+size. The size of a file backing a conventional zone is always equal to the zone
+size. The size of a file backing a sequential zone indicates the amount of data
+sequentially written to the file, that is, the size of the file directly
+indicates the position of the write pointer of the zone.
+
+When resetting a sequential zone, its backing file size is truncated to zero.
+Conversely, for a zone finish operation, the backing file is truncated to the
+zone size. With this, the maximum capacity of a zloop zoned block device created
+can be larger configured to be larger than the storage space available on the
+backing file system. Of course, for such configuration, writing more data than
+the storage space available on the backing file system will result in write
+errors.
+
+The zoned loop block device driver implements a complete zone transition state
+machine. That is, zones can be empty, implicitly opened, explicitly opened,
+closed or full. The current implementation does not support any limits on the
+maximum number of open and active zones.
+
+No user tools are necessary to create and delete zloop devices.
+
+2) Creating a Zoned Device
+--------------------------
+
+Once the zloop module is loaded (or if zloop is compiled in the kernel), the
+character device file /dev/zloop-control can be used to add a zloop device.
+This is done by writing an "add" command directly to the /dev/zloop-control
+device::
+
+	$ modprobe zloop
+        $ ls -l /dev/zloop*
+        crw-------. 1 root root 10, 123 Jan  6 19:18 /dev/zloop-control
+
+        $ mkdir -p <base directory/<device ID>
+        $ echo "add [options]" > /dev/zloop-control
+
+The options available for the add command can be listed by reading the
+/dev/zloop-control device::
+
+	$ cat /dev/zloop-control
+        add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u
+        remove id=%d
+
+In more details, the options that can be used with the "add" command are as
+follows.
+
+================   ===========================================================
+id                 Device number (the X in /dev/zloopX).
+                   Default: automatically assigned.
+capacity_mb        Device total capacity in MiB. This is always rounded up to
+                   the nearest higher multiple of the zone size.
+                   Default: 16384 MiB (16 GiB).
+zone_size_mb       Device zone size in MiB. Default: 256 MiB.
+zone_capacity_mb   Device zone capacity (must always be equal to or lower than
+                   the zone size. Default: zone size.
+conv_zones         Total number of conventioanl zones starting from sector 0.
+                   Default: 8.
+base_dir           Path to the base directoy where to create the directory
+                   containing the zone files of the device.
+                   Default=/var/local/zloop.
+                   The device directory containing the zone files is always
+                   named with the device ID. E.g. the default zone file
+                   directory for /dev/zloop0 is /var/local/zloop/0.
+nr_queues          Number of I/O queues of the zoned block device. This value is
+                   always capped by the number of online CPUs
+                   Default: 1
+queue_depth        Maximum I/O queue depth per I/O queue.
+                   Default: 64
+================   ===========================================================
+
+3) Deleting a Zoned Device
+--------------------------
+
+Deleting an unused zoned loop block device is done by issuing the "remove"
+command to /dev/zloop-control, specifying the ID of the device to remove::
+
+        $ echo "remove id=X" > /dev/zloop-control
+
+The remove command does not have any option.
+
+A zoned device that was removed can be re-added again without any change to the
+state of the device zones: the device zones are restored to their last state
+before the device was removed. Adding again a zoned device after it was removed
+must always be done using the same configuration as when the device was first
+added. If a zone configuration change is detected, an error will be returned and
+the zoned device will not be created.
+
+To fully delete a zoned device, after executing the remove operation, the device
+base directory containing the backing files of the device zones must be deleted.
+
+4) Example
+----------
+
+The following sequence of commands creates a 2GB zoned device with zones of 64
+MB and a zone capacity of 63 MB::
+
+        $ modprobe zloop
+        $ mkdir -p /var/local/zloop/0
+        $ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity=63MB" > /dev/zloop-control
+
+For the device created (/dev/zloop0), the zone backing files are all created
+under the default base directory (/var/local/zloop)::
+
+        $ ls -l /var/local/zloop/0
+        total 0
+        -rw-------. 1 root root 67108864 Jan  6 22:23 cnv-000000
+        -rw-------. 1 root root 67108864 Jan  6 22:23 cnv-000001
+        -rw-------. 1 root root 67108864 Jan  6 22:23 cnv-000002
+        -rw-------. 1 root root 67108864 Jan  6 22:23 cnv-000003
+        -rw-------. 1 root root 67108864 Jan  6 22:23 cnv-000004
+        -rw-------. 1 root root 67108864 Jan  6 22:23 cnv-000005
+        -rw-------. 1 root root 67108864 Jan  6 22:23 cnv-000006
+        -rw-------. 1 root root 67108864 Jan  6 22:23 cnv-000007
+        -rw-------. 1 root root        0 Jan  6 22:23 seq-000008
+        -rw-------. 1 root root        0 Jan  6 22:23 seq-000009
+        ...
+
+The zoned device created (/dev/zloop0) can then be used normally::
+
+        $ lsblk -z
+        NAME   ZONED        ZONE-SZ ZONE-NR ZONE-AMAX ZONE-OMAX ZONE-APP ZONE-WGRAN
+        zloop0 host-managed     64M      32         0         0       1M         4K
+        $ blkzone report /dev/zloop0
+          start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+          start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+          start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+          start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+          start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+          start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+          start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+          start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+          start: 0x000100000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
+          start: 0x000120000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
+          ...
+
+Deleting this device is done using the command::
+
+        $ echo "remove id=0" > /dev/zloop-control
+
+The removed device can be re-added again using the same "add" command as when
+the device was first created. To fully delete a zoned device, its backing files
+should also be deleted after executing the remove command::
+
+        $ rm -r /var/local/zloop/0
diff --git a/MAINTAINERS b/MAINTAINERS
index 49076a506b6b..a61b5d2827af 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25977,6 +25977,7 @@ M:	Damien Le Moal <dlemoal@kernel.org>
 R:	Christoph Hellwig <hch@lst.de>
 L:	linux-block@vger.kernel.org
 S:	Maintained
+F:	Documentation/admin-guide/blockdev/zoned_loop.rst
 F:	drivers/block/zloop.c
 
 ZONEFS FILESYSTEM
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 14:24 [PATCH 0/2] New zoned loop block device driver Damien Le Moal
  2025-01-06 14:24 ` [PATCH 1/2] block: new " Damien Le Moal
  2025-01-06 14:24 ` [PATCH 2/2] Documentation: Document the " Damien Le Moal
@ 2025-01-06 14:54 ` Jens Axboe
  2025-01-06 15:21   ` Christoph Hellwig
  2 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2025-01-06 14:54 UTC (permalink / raw)
  To: Damien Le Moal, linux-block; +Cc: Christoph Hellwig

On 1/6/25 7:24 AM, Damien Le Moal wrote:
> The first patch implements the new "zloop" zoned block device driver
> which allows creating zoned block devices using one regular file per
> zone as backing storage.

Couldn't we do this with ublk and keep most of this stuff in userspace
rather than need a whole new loop driver?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 14:54 ` [PATCH 0/2] New " Jens Axboe
@ 2025-01-06 15:21   ` Christoph Hellwig
  2025-01-06 15:24     ` Jens Axboe
  2025-01-08  2:29     ` Ming Lei
  0 siblings, 2 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-01-06 15:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Damien Le Moal, linux-block, Christoph Hellwig

On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote:
> On 1/6/25 7:24 AM, Damien Le Moal wrote:
> > The first patch implements the new "zloop" zoned block device driver
> > which allows creating zoned block devices using one regular file per
> > zone as backing storage.
> 
> Couldn't we do this with ublk and keep most of this stuff in userspace
> rather than need a whole new loop driver?

I'm pretty sure we could do that.  But dealing with ublk is complete
pain especially when setting up and tearing it down all the time for
test, and would require a lot more code, so why?  As-is I can directly
add this to xfstests for the much needed large file system testing that
currently doesn't work for zoned file systems, which was the motivation
why I started writing this code (before Damien gladly took over and
polished it).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 15:21   ` Christoph Hellwig
@ 2025-01-06 15:24     ` Jens Axboe
  2025-01-06 15:32       ` Christoph Hellwig
  2025-01-08  2:29     ` Ming Lei
  1 sibling, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2025-01-06 15:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Damien Le Moal, linux-block

On 1/6/25 8:21 AM, Christoph Hellwig wrote:
> On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote:
>> On 1/6/25 7:24 AM, Damien Le Moal wrote:
>>> The first patch implements the new "zloop" zoned block device driver
>>> which allows creating zoned block devices using one regular file per
>>> zone as backing storage.
>>
>> Couldn't we do this with ublk and keep most of this stuff in userspace
>> rather than need a whole new loop driver?
> 
> I'm pretty sure we could do that.  But dealing with ublk is complete
> pain especially when setting up and tearing it down all the time for
> test, and would require a lot more code, so why?  As-is I can directly
>
> add this to xfstests for the much needed large file system testing that
> currently doesn't work for zoned file systems, which was the motivation
> why I started writing this code (before Damien gladly took over and
> polished it).

A lot more code where? Not in the kernel. And now we're stuck with a new
driver for a relatively niche use case. Seems like a bad tradeoff to me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 15:24     ` Jens Axboe
@ 2025-01-06 15:32       ` Christoph Hellwig
  2025-01-06 15:38         ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2025-01-06 15:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, Damien Le Moal, linux-block

On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote:
> A lot more code where?

Very good and relevant question.  Some random new repo that no one knows
about?  Not very helpful.  xfstests itself?  Maybe, but that would just
means other users have to fork it.

> Not in the kernel. And now we're stuck with a new
> driver for a relatively niche use case. Seems like a bad tradeoff to me.

Seriously, if you can't Damien and me to maintain a little driver
using completely standard interfaces without any magic you'll have
different problems keepign the block layer alive :)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 15:32       ` Christoph Hellwig
@ 2025-01-06 15:38         ` Jens Axboe
  2025-01-06 15:44           ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2025-01-06 15:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Damien Le Moal, linux-block

On 1/6/25 8:32 AM, Christoph Hellwig wrote:
> On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote:
>> A lot more code where?
> 
> Very good and relevant question.  Some random new repo that no one knows
> about?  Not very helpful.  xfstests itself?  Maybe, but that would just
> means other users have to fork it.

Why would they have to fork it? Just put it in xfstests itself. These
are very weak reasons, imho.

>> Not in the kernel. And now we're stuck with a new
>> driver for a relatively niche use case. Seems like a bad tradeoff to me.
> 
> Seriously, if you can't Damien and me to maintain a little driver
> using completely standard interfaces without any magic you'll have
> different problems keepign the block layer alive :)

Asking "why do we need this driver, when we can accomplish the same with
existing stuff" is a valid question, and I'm a bit puzzled why we can't
just have a reasonable discussion about this. If that simple question
can't be asked, and answered suitably, then something is really amiss.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 15:38         ` Jens Axboe
@ 2025-01-06 15:44           ` Christoph Hellwig
  2025-01-06 17:38             ` Jens Axboe
  2025-01-08  2:47             ` Ming Lei
  0 siblings, 2 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-01-06 15:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, Damien Le Moal, linux-block

On Mon, Jan 06, 2025 at 08:38:26AM -0700, Jens Axboe wrote:
> On 1/6/25 8:32 AM, Christoph Hellwig wrote:
> > On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote:
> >> A lot more code where?
> > 
> > Very good and relevant question.  Some random new repo that no one knows
> > about?  Not very helpful.  xfstests itself?  Maybe, but that would just
> > means other users have to fork it.
> 
> Why would they have to fork it? Just put it in xfstests itself. These
> are very weak reasons, imho.

Because that way other users can't use it.  Damien has already mentioned
some.

And someone would actually have to write that hypothetical thing.

> >> Not in the kernel. And now we're stuck with a new
> >> driver for a relatively niche use case. Seems like a bad tradeoff to me.
> > 
> > Seriously, if you can't Damien and me to maintain a little driver
> > using completely standard interfaces without any magic you'll have
> > different problems keepign the block layer alive :)
> 
> Asking "why do we need this driver, when we can accomplish the same with
> existing stuff"

There is no "existing stuff"

> is a valid question, and I'm a bit puzzled why we can't
> just have a reasonable discussion about this.

I think this is a valid and reasonable discussion.  But maybe we're
just not on the same page.  I don't know anything existing and usable,
maybe I've just not found it?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 15:44           ` Christoph Hellwig
@ 2025-01-06 17:38             ` Jens Axboe
  2025-01-06 18:05               ` Christoph Hellwig
  2025-01-07  1:08               ` Damien Le Moal
  2025-01-08  2:47             ` Ming Lei
  1 sibling, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2025-01-06 17:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Damien Le Moal, linux-block

On 1/6/25 8:44 AM, Christoph Hellwig wrote:
> On Mon, Jan 06, 2025 at 08:38:26AM -0700, Jens Axboe wrote:
>> On 1/6/25 8:32 AM, Christoph Hellwig wrote:
>>> On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote:
>>>> A lot more code where?
>>>
>>> Very good and relevant question.  Some random new repo that no one knows
>>> about?  Not very helpful.  xfstests itself?  Maybe, but that would just
>>> means other users have to fork it.
>>
>> Why would they have to fork it? Just put it in xfstests itself. These
>> are very weak reasons, imho.
> 
> Because that way other users can't use it.  Damien has already mentioned
> some.

If it's actually useful to others, then it can become a standalone
thing. Really nothing new there.

> And someone would actually have to write that hypothetical thing.

That is certainly true.

>>>> Not in the kernel. And now we're stuck with a new
>>>> driver for a relatively niche use case. Seems like a bad tradeoff to me.
>>>
>>> Seriously, if you can't Damien and me to maintain a little driver
>>> using completely standard interfaces without any magic you'll have
>>> different problems keepign the block layer alive :)
>>
>> Asking "why do we need this driver, when we can accomplish the same with
>> existing stuff"
> 
> There is no "existing stuff"

Right, that's true on both sides now. Yes this kernel driver has been
written, but in practice there is no existing stuff.

>> is a valid question, and I'm a bit puzzled why we can't
>> just have a reasonable discussion about this.
> 
> I think this is a valid and reasonable discussion.  But maybe we're
> just not on the same page.  I don't know anything existing and usable,
> maybe I've just not found it?

Not that I'm aware of, it was just a suggestion/thought that we could
utilize an existing driver for this, rather than have a separate one.
Yes the proposed one is pretty simple and not large, and maintaining it
isn't a big deal, but it's still a new driver and hence why I was asking
"why can't we just use ublk for this". That also keeps the code mostly
in userspace which is nice, rather than needing kernel changes for new
features, changes, etc.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 17:38             ` Jens Axboe
@ 2025-01-06 18:05               ` Christoph Hellwig
  2025-01-07 21:10                 ` Jens Axboe
  2025-01-07  1:08               ` Damien Le Moal
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2025-01-06 18:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, Damien Le Moal, linux-block

On Mon, Jan 06, 2025 at 10:38:24AM -0700, Jens Axboe wrote:
> > just not on the same page.  I don't know anything existing and usable,
> > maybe I've just not found it?
> 
> Not that I'm aware of, it was just a suggestion/thought that we could
> utilize an existing driver for this, rather than have a separate one.
> Yes the proposed one is pretty simple and not large, and maintaining it
> isn't a big deal, but it's still a new driver and hence why I was asking
> "why can't we just use ublk for this". That also keeps the code mostly
> in userspace which is nice, rather than needing kernel changes for new
> features, changes, etc.

Well, the reason to do a kernel driver rather than a ublk back end
boils down to a few things:

 - writing highly concurrent code is actually a lot simpler in the kernel
   than in userspace because we have the right primitives for it
 - these primitives tend to actually be a lot faster than those available
   in glibc as well
 - the double context switch into the kernel and back for a ublk device
   backed by a file system will actually show up for some xfstests that
   do a lot of synchronous ops
 - having an in-tree kernel driver that you just configure / unconfigure
   from the shell is a lot easier to use than a daemon that needs to
   be running.  Especially from xfstests or other test suites that do
   a lot of per-test setup and teardown
 - the kernel actually has really nice infrastructure for block drivers.
   I'm pretty sure doing this in userspace would actually be more
   code, while being harder to use and lower performance.

So we could go both ways, but the kernel version was pretty obviously
the preferred one to me.  Maybe that's a little biasses by doing a lot
of kernel work, and having run into a lot of problems and performance
issues with the SCSI target user backend lately.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 17:38             ` Jens Axboe
  2025-01-06 18:05               ` Christoph Hellwig
@ 2025-01-07  1:08               ` Damien Le Moal
  2025-01-07 21:08                 ` Jens Axboe
  1 sibling, 1 reply; 35+ messages in thread
From: Damien Le Moal @ 2025-01-07  1:08 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: linux-block

On 1/7/25 02:38, Jens Axboe wrote:
>> I think this is a valid and reasonable discussion.  But maybe we're
>> just not on the same page.  I don't know anything existing and usable,
>> maybe I've just not found it?
> 
> Not that I'm aware of, it was just a suggestion/thought that we could
> utilize an existing driver for this, rather than have a separate one.
> Yes the proposed one is pretty simple and not large, and maintaining it
> isn't a big deal, but it's still a new driver and hence why I was asking
> "why can't we just use ublk for this". That also keeps the code mostly
> in userspace which is nice, rather than needing kernel changes for new
> features, changes, etc.

I did consider ublk at some point but did not switch to it because a ublk
backend driver to do the same as zloop in userspace would need a lot more code
to be efficient. And even then, as Christoph already mentioned, we would still
have performance suffer from the context switches. But that performance point
was not the primary stopper though as this driver is not intended for production
use but rather to be the simplest possible setup that can be used in CI systems
to test zoned file systems (among other zone related things).

A kernel-based implementation is simpler and the configuration interface
literally needs only a single echo bash command to add or remove devices. This
allows minimal VM configurations with no dependencies on user tools/libraries to
run these zoned devices, which is what we wanted.

I completely agree about the user-space vs kernel tradeoff you mentioned. I did
consider it but the code simplicity and ease of use in practice won for us and I
chose to stick with the kernel driver approach.

Note that if you are OK with this, I need to send a V2 to correct the Kconfig
description which currently shows an invalid configuration command example.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-07  1:08               ` Damien Le Moal
@ 2025-01-07 21:08                 ` Jens Axboe
  2025-01-08  5:11                   ` Damien Le Moal
  2025-01-08  5:44                   ` Christoph Hellwig
  0 siblings, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2025-01-07 21:08 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig; +Cc: linux-block

On 1/6/25 6:08 PM, Damien Le Moal wrote:
> On 1/7/25 02:38, Jens Axboe wrote:
>>> I think this is a valid and reasonable discussion.  But maybe we're
>>> just not on the same page.  I don't know anything existing and usable,
>>> maybe I've just not found it?
>>
>> Not that I'm aware of, it was just a suggestion/thought that we could
>> utilize an existing driver for this, rather than have a separate one.
>> Yes the proposed one is pretty simple and not large, and maintaining it
>> isn't a big deal, but it's still a new driver and hence why I was asking
>> "why can't we just use ublk for this". That also keeps the code mostly
>> in userspace which is nice, rather than needing kernel changes for new
>> features, changes, etc.
> 
> I did consider ublk at some point but did not switch to it because a
> ublk backend driver to do the same as zloop in userspace would need a
> lot more code to be efficient. And even then, as Christoph already
> mentioned, we would still have performance suffer from the context
> switches. But that performance point was not the primary stopper

I don't buy this context switch argument at all. Why would it mean more
sleeping? There's absolutely zero reason why a ublk solution would be at
least as performant as the kernel one.

And why would it need "a lot more code to be efficient"?

> though as this driver is not intended for production use but rather to
> be the simplest possible setup that can be used in CI systems to test
> zoned file systems (among other zone related things).

Right, that too.

> A kernel-based implementation is simpler and the configuration
> interface literally needs only a single echo bash command to add or
> remove devices. This allows minimal VM configurations with no
> dependencies on user tools/libraries to run these zoned devices, which
> is what we wanted.
> 
> I completely agree about the user-space vs kernel tradeoff you
> mentioned. I did consider it but the code simplicity and ease of use
> in practice won for us and I chose to stick with the kernel driver
> approach.
> 
> Note that if you are OK with this, I need to send a V2 to correct the
> Kconfig description which currently shows an invalid configuration
> command example.

Sure, I'm not totally against it, even if I think the arguments are
very weak, and in some places also just wrong. It's not like it's a
huge driver.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 18:05               ` Christoph Hellwig
@ 2025-01-07 21:10                 ` Jens Axboe
  2025-01-08  5:49                   ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2025-01-07 21:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Damien Le Moal, linux-block

On 1/6/25 11:05 AM, Christoph Hellwig wrote:
> On Mon, Jan 06, 2025 at 10:38:24AM -0700, Jens Axboe wrote:
>>> just not on the same page.  I don't know anything existing and usable,
>>> maybe I've just not found it?
>>
>> Not that I'm aware of, it was just a suggestion/thought that we could
>> utilize an existing driver for this, rather than have a separate one.
>> Yes the proposed one is pretty simple and not large, and maintaining it
>> isn't a big deal, but it's still a new driver and hence why I was asking
>> "why can't we just use ublk for this". That also keeps the code mostly
>> in userspace which is nice, rather than needing kernel changes for new
>> features, changes, etc.
> 
> Well, the reason to do a kernel driver rather than a ublk back end
> boils down to a few things:
> 
>  - writing highly concurrent code is actually a lot simpler in the kernel
>    than in userspace because we have the right primitives for it
>  - these primitives tend to actually be a lot faster than those available
>    in glibc as well

That's certainly true.

>  - the double context switch into the kernel and back for a ublk device
>    backed by a file system will actually show up for some xfstests that
>    do a lot of synchronous ops

Like I replied to Damien, that's mostly a bogus argument. If you're
doing sync stuff, you can do that with a single system call. If you're
building up depth, then it doesn't matter.

>  - having an in-tree kernel driver that you just configure / unconfigure
>    from the shell is a lot easier to use than a daemon that needs to
>    be running.  Especially from xfstests or other test suites that do
>    a lot of per-test setup and teardown

This is always true when it's a new piece of userspace, but not
necessarily true once the use case has been established.

>  - the kernel actually has really nice infrastructure for block drivers.
>    I'm pretty sure doing this in userspace would actually be more
>    code, while being harder to use and lower performance.

That's very handwavy...

> So we could go both ways, but the kernel version was pretty obviously
> the preferred one to me.  Maybe that's a little biasses by doing a lot
> of kernel work, and having run into a lot of problems and performance
> issues with the SCSI target user backend lately.

Sure, that is understandable.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 15:21   ` Christoph Hellwig
  2025-01-06 15:24     ` Jens Axboe
@ 2025-01-08  2:29     ` Ming Lei
  2025-01-08  5:06       ` Damien Le Moal
  2025-01-08  5:47       ` Christoph Hellwig
  1 sibling, 2 replies; 35+ messages in thread
From: Ming Lei @ 2025-01-08  2:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Damien Le Moal, linux-block

On Mon, Jan 06, 2025 at 04:21:18PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote:
> > On 1/6/25 7:24 AM, Damien Le Moal wrote:
> > > The first patch implements the new "zloop" zoned block device driver
> > > which allows creating zoned block devices using one regular file per
> > > zone as backing storage.
> > 
> > Couldn't we do this with ublk and keep most of this stuff in userspace
> > rather than need a whole new loop driver?
> 
> I'm pretty sure we could do that.  But dealing with ublk is complete
> pain especially when setting up and tearing it down all the time for
> test, and would require a lot more code, so why?  As-is I can directly

You can link with libublk or add it to rublk, which supports ramdisk zone
already, then install rublk from crates.io directly for setup the
test.

Forking one new loop could add much more pain since you may have to address
everything we have fixed for loop, please look at 'git log loop'


Thanks,
Ming


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-06 15:44           ` Christoph Hellwig
  2025-01-06 17:38             ` Jens Axboe
@ 2025-01-08  2:47             ` Ming Lei
  2025-01-08 14:10               ` Theodore Ts'o
  1 sibling, 1 reply; 35+ messages in thread
From: Ming Lei @ 2025-01-08  2:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Damien Le Moal, linux-block

On Mon, Jan 06, 2025 at 04:44:33PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 06, 2025 at 08:38:26AM -0700, Jens Axboe wrote:
> > On 1/6/25 8:32 AM, Christoph Hellwig wrote:
> > > On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote:
> > >> A lot more code where?
> > > 
> > > Very good and relevant question.  Some random new repo that no one knows
> > > about?  Not very helpful.  xfstests itself?  Maybe, but that would just
> > > means other users have to fork it.
> > 
> > Why would they have to fork it? Just put it in xfstests itself. These
> > are very weak reasons, imho.
> 
> Because that way other users can't use it.  Damien has already mentioned
> some.

- cargo install rublk
- rublk add zoned

Then you can setup xfstests over the ublk/zoned disk, also Fedora 42
starts to ship rublk.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-08  2:29     ` Ming Lei
@ 2025-01-08  5:06       ` Damien Le Moal
  2025-01-08  8:13         ` Ming Lei
  2025-01-08  5:47       ` Christoph Hellwig
  1 sibling, 1 reply; 35+ messages in thread
From: Damien Le Moal @ 2025-01-08  5:06 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig; +Cc: Jens Axboe, linux-block

On 1/8/25 11:29 AM, Ming Lei wrote:
> On Mon, Jan 06, 2025 at 04:21:18PM +0100, Christoph Hellwig wrote:
>> On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote:
>>> On 1/6/25 7:24 AM, Damien Le Moal wrote:
>>>> The first patch implements the new "zloop" zoned block device driver
>>>> which allows creating zoned block devices using one regular file per
>>>> zone as backing storage.
>>>
>>> Couldn't we do this with ublk and keep most of this stuff in userspace
>>> rather than need a whole new loop driver?
>>
>> I'm pretty sure we could do that.  But dealing with ublk is complete
>> pain especially when setting up and tearing it down all the time for
>> test, and would require a lot more code, so why?  As-is I can directly
> 
> You can link with libublk or add it to rublk, which supports ramdisk zone
> already, then install rublk from crates.io directly for setup the
> test.

Thanks, but memory backing is not what we want. We need to emulate large drives
for FS tests (to catch problems such as overflows), and for that, a file based
storage backing is better.

> Forking one new loop could add much more pain since you may have to address
> everything we have fixed for loop, please look at 'git log loop'

Which is why Christoph initially started with the kernel driver approach in the
first place. To avoid such issues/difficulties.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-07 21:08                 ` Jens Axboe
@ 2025-01-08  5:11                   ` Damien Le Moal
  2025-01-08  5:44                   ` Christoph Hellwig
  1 sibling, 0 replies; 35+ messages in thread
From: Damien Le Moal @ 2025-01-08  5:11 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: linux-block

On 1/8/25 6:08 AM, Jens Axboe wrote:
>> A kernel-based implementation is simpler and the configuration
>> interface literally needs only a single echo bash command to add or
>> remove devices. This allows minimal VM configurations with no
>> dependencies on user tools/libraries to run these zoned devices, which
>> is what we wanted.
>>
>> I completely agree about the user-space vs kernel tradeoff you
>> mentioned. I did consider it but the code simplicity and ease of use
>> in practice won for us and I chose to stick with the kernel driver
>> approach.
>>
>> Note that if you are OK with this, I need to send a V2 to correct the
>> Kconfig description which currently shows an invalid configuration
>> command example.
> 
> Sure, I'm not totally against it, even if I think the arguments are
> very weak, and in some places also just wrong. It's not like it's a
> huge driver.

I am not going to try contesting that our arguments are somewhat weak. Yes, if
we spend enough time on it, we could eventually get something workable with ublk.

But with that said, when you spend your days developing and testing stuff for
zoned storage, having a super easy to use emulation setup for VMs without any
userspace dependencies does a world of good for productivity. That is a strong
argument for those involved, I think.

So may I send V2 for getting it queued up ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-07 21:08                 ` Jens Axboe
  2025-01-08  5:11                   ` Damien Le Moal
@ 2025-01-08  5:44                   ` Christoph Hellwig
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-01-08  5:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Damien Le Moal, Christoph Hellwig, linux-block

On Tue, Jan 07, 2025 at 02:08:20PM -0700, Jens Axboe wrote:
> > ublk backend driver to do the same as zloop in userspace would need a
> > lot more code to be efficient. And even then, as Christoph already
> > mentioned, we would still have performance suffer from the context
> > switches. But that performance point was not the primary stopper
> 
> I don't buy this context switch argument at all.

The zloop write goes straight from kblockd into the the filesystem.
ublk switches to userspace, which goes back to the kernel when the
file system writes.  Similar double context switch on the completion
side.


> Why would it mean more
> sleeping?

?

> There's absolutely zero reason why a ublk solution would be at
> least as performant as the kernel one.

Well, prove it.  From haing worked on similar schemes in the past
I highly doubt it.

> And why would it need "a lot more code to be efficient"?

Because we don't have all the nice locking and even infrastructure
in userspace that we have in the kernel.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-08  2:29     ` Ming Lei
  2025-01-08  5:06       ` Damien Le Moal
@ 2025-01-08  5:47       ` Christoph Hellwig
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-01-08  5:47 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, Damien Le Moal, linux-block

On Wed, Jan 08, 2025 at 10:29:57AM +0800, Ming Lei wrote:
> You can link with libublk or add it to rublk, which supports ramdisk zone
> already, then install rublk from crates.io directly for setup the
> test.

ramdisk are nicely supported in null_blk already.  And rust crates
are a massive pain as they tend to not be packaged nicely.  Exatly
what I do not want to depend on.

> Forking one new loop could add much more pain since you may have to address
> everything we have fixed for loop, please look at 'git log loop'

The biggest problem with the loop driver is the historic baggage in
the user interface. That's side stepped by this driver (and even for
conventional device a loop-ng doing the same might be nice, but that's
a separate story).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-07 21:10                 ` Jens Axboe
@ 2025-01-08  5:49                   ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-01-08  5:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, Damien Le Moal, linux-block

On Tue, Jan 07, 2025 at 02:10:45PM -0700, Jens Axboe wrote:
> >  - the double context switch into the kernel and back for a ublk device
> >    backed by a file system will actually show up for some xfstests that
> >    do a lot of synchronous ops
> 
> Like I replied to Damien, that's mostly a bogus argument. If you're
> doing sync stuff, you can do that with a single system call. If you're
> building up depth, then it doesn't matter.

How do I do a single system call for retrive the requet from the
kernel and execture it on the file system after examining it?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-08  5:06       ` Damien Le Moal
@ 2025-01-08  8:13         ` Ming Lei
  2025-01-08  9:09           ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Ming Lei @ 2025-01-08  8:13 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Christoph Hellwig, Jens Axboe, linux-block

On Wed, Jan 8, 2025 at 1:07 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 1/8/25 11:29 AM, Ming Lei wrote:
> > On Mon, Jan 06, 2025 at 04:21:18PM +0100, Christoph Hellwig wrote:
> >> On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote:
> >>> On 1/6/25 7:24 AM, Damien Le Moal wrote:
> >>>> The first patch implements the new "zloop" zoned block device driver
> >>>> which allows creating zoned block devices using one regular file per
> >>>> zone as backing storage.
> >>>
> >>> Couldn't we do this with ublk and keep most of this stuff in userspace
> >>> rather than need a whole new loop driver?
> >>
> >> I'm pretty sure we could do that.  But dealing with ublk is complete
> >> pain especially when setting up and tearing it down all the time for
> >> test, and would require a lot more code, so why?  As-is I can directly
> >
> > You can link with libublk or add it to rublk, which supports ramdisk zone
> > already, then install rublk from crates.io directly for setup the
> > test.
>
> Thanks, but memory backing is not what we want. We need to emulate large drives
> for FS tests (to catch problems such as overflows), and for that, a file based
> storage backing is better.

It is backed by virtual memory, which can be big enough because of swap, and
it is also easy to extend to file backed support since zloop doesn't store
zone meta data, which is similar to ram backed zoned actually.

Not like loop, zloop can only serve for test purposes, because each zone's
meta data is always reset when adding a new device.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-08  8:13         ` Ming Lei
@ 2025-01-08  9:09           ` Christoph Hellwig
  2025-01-08  9:39             ` Ming Lei
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2025-01-08  9:09 UTC (permalink / raw)
  To: Ming Lei; +Cc: Damien Le Moal, Christoph Hellwig, Jens Axboe, linux-block

On Wed, Jan 08, 2025 at 04:13:01PM +0800, Ming Lei wrote:
> It is backed by virtual memory, which can be big enough because of swap, and

Good luck getting half way decent performance out of swapping for a 50TB
data set.  Or even a partially filled one which really is the use case
here so it might only be a TB or so.

> it is also easy to extend to file backed support since zloop doesn't store
> zone meta data, which is similar to ram backed zoned actually.

No, zloop does store write point in the file sizse of each zone.  That's
sorta the whole point becauce it enables things like mount and even
power fail testing.

All of this is mentioned explicitly in the commit logs, documentation and
code comments, so claiming something else here feels a bit uninformed.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-08  9:09           ` Christoph Hellwig
@ 2025-01-08  9:39             ` Ming Lei
  2025-01-10 12:34               ` Ming Lei
  0 siblings, 1 reply; 35+ messages in thread
From: Ming Lei @ 2025-01-08  9:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Damien Le Moal, Jens Axboe, linux-block

On Wed, Jan 08, 2025 at 10:09:12AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 08, 2025 at 04:13:01PM +0800, Ming Lei wrote:
> > It is backed by virtual memory, which can be big enough because of swap, and
> 
> Good luck getting half way decent performance out of swapping for a 50TB
> data set.  Or even a partially filled one which really is the use case
> here so it might only be a TB or so.
> 
> > it is also easy to extend to file backed support since zloop doesn't store
> > zone meta data, which is similar to ram backed zoned actually.
> 
> No, zloop does store write point in the file sizse of each zone.  That's
> sorta the whole point becauce it enables things like mount and even
> power fail testing.
> 
> All of this is mentioned explicitly in the commit logs, documentation and
> code comments, so claiming something else here feels a bit uninformed.

OK, looks one smart idea.

It is easy to extend rublk/zoned in this way with io_uring io emulation, :-)



Thanks,
Ming


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-08  2:47             ` Ming Lei
@ 2025-01-08 14:10               ` Theodore Ts'o
  0 siblings, 0 replies; 35+ messages in thread
From: Theodore Ts'o @ 2025-01-08 14:10 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, Damien Le Moal, linux-block

On Wed, Jan 08, 2025 at 10:47:57AM +0800, Ming Lei wrote:
> > > Why would they have to fork it? Just put it in xfstests itself. These
> > > are very weak reasons, imho.
> > 
> > Because that way other users can't use it.  Damien has already mentioned
> > some.
> 
> - cargo install rublk
> - rublk add zoned
> 
> Then you can setup xfstests over the ublk/zoned disk, also Fedora 42
> starts to ship rublk.

Um, I build xfstests on Debian Stable; other people build xfstests on
enterprise Linux distributions (e.g., RHEL).

I'd be really nice if we don't add a rust dependency on xfstests
anytime soon.  Or at least, have a way of skipping tests that have a
rust dependency if xfstesets is built on a system that doesn't have
Rust, and to not add Rust dependency on existing tests, so that we
don't suddenly lose a lot of test coverage all in the name of adding
Rust....

						- Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-08  9:39             ` Ming Lei
@ 2025-01-10 12:34               ` Ming Lei
  2025-01-24  9:30                 ` Damien Le Moal
  0 siblings, 1 reply; 35+ messages in thread
From: Ming Lei @ 2025-01-10 12:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Damien Le Moal, Jens Axboe, linux-block

On Wed, Jan 08, 2025 at 05:39:33PM +0800, Ming Lei wrote:
> On Wed, Jan 08, 2025 at 10:09:12AM +0100, Christoph Hellwig wrote:
> > On Wed, Jan 08, 2025 at 04:13:01PM +0800, Ming Lei wrote:
> > > It is backed by virtual memory, which can be big enough because of swap, and
> > 
> > Good luck getting half way decent performance out of swapping for a 50TB
> > data set.  Or even a partially filled one which really is the use case
> > here so it might only be a TB or so.
> > 
> > > it is also easy to extend to file backed support since zloop doesn't store
> > > zone meta data, which is similar to ram backed zoned actually.
> > 
> > No, zloop does store write point in the file sizse of each zone.  That's
> > sorta the whole point becauce it enables things like mount and even
> > power fail testing.
> > 
> > All of this is mentioned explicitly in the commit logs, documentation and
> > code comments, so claiming something else here feels a bit uninformed.
> 
> OK, looks one smart idea.
> 
> It is easy to extend rublk/zoned in this way with io_uring io emulation, :-)

Here it is:

https://github.com/ublk-org/rublk/commits/file-backed-zoned/

Top two commits implement the feature by command line `--path $zdir`:

	[rublk]# git diff --stat=80 HEAD^^...
	 src/zoned.rs   | 397 +++++++++++++++++++++++++++++++++++++++++++++++----------
	 tests/basic.rs |  49 ++++---
	 2 files changed, 363 insertions(+), 83 deletions(-)

It takes 280 new LoC:

    - support both ram-back and file-back
    - completely async io_uring IO emulation for zoned read/write IO
    - include selftest code for running mkfs.btrfs/mount/read & write IO/umount


Thanks,
Ming


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-10 12:34               ` Ming Lei
@ 2025-01-24  9:30                 ` Damien Le Moal
  2025-01-24 12:30                   ` Ming Lei
  0 siblings, 1 reply; 35+ messages in thread
From: Damien Le Moal @ 2025-01-24  9:30 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig; +Cc: Jens Axboe, linux-block

On 1/10/25 21:34, Ming Lei wrote:
>> It is easy to extend rublk/zoned in this way with io_uring io emulation, :-)
> 
> Here it is:
> 
> https://github.com/ublk-org/rublk/commits/file-backed-zoned/
> 
> Top two commits implement the feature by command line `--path $zdir`:
> 
> 	[rublk]# git diff --stat=80 HEAD^^...
> 	 src/zoned.rs   | 397 +++++++++++++++++++++++++++++++++++++++++++++++----------
> 	 tests/basic.rs |  49 ++++---
> 	 2 files changed, 363 insertions(+), 83 deletions(-)
> 
> It takes 280 new LoC:
> 
>     - support both ram-back and file-back
>     - completely async io_uring IO emulation for zoned read/write IO
>     - include selftest code for running mkfs.btrfs/mount/read & write IO/umount

Hi Ming,

My apologies for the late reply. Conference travel kept me busy.
Thank you for doing this. I gave it a try and measured the performance for some
write workloads (using current Linus tree which includes the block PR for 6.14).
The zloop results shown here are with a slightly tweaked version (not posted)
that changes to using a work item per command instead of having a single work
for all commands.

1 queue:
========
                              +-------------------+-------------------+
                              | ublk (IOPS / BW)  | zloop (IOPS / BW) |
 +----------------------------+-------------------+-------------------+
 | QD=1,    4K rnd wr, 1 job  | 11.7k / 47.8 MB/s | 15.8k / 53.0 MB/s |
 | QD=32,   4K rnd wr, 8 jobs | 63.4k / 260 MB/s  | 101k / 413 MB/s   |
 | QD=32, 128K rnd wr, 1 job  | 5008 / 656 MB/s   | 5993 / 786 MB/s   |
 | QD=32, 128K seq wr, 1 job  | 2636 / 346 MB/s   | 5393 / 707 MB/s   |
 +----------------------------+-------------------+-------------------+

8 queues:
=========
                              +-------------------+-------------------+
                              | ublk (IOPS / BW)  | zloop (IOPS / BW) |
 +----------------------------+-------------------+-------------------+
 | QD=1,    4K rnd wr, 1 job  | 9699 / 39.7 MB/s  | 16.7k / 68.6 MB/s |
 | QD=32,   4K rnd wr, 8 jobs | 58.2k / 238 MB/s  | 108k / 444 MB/s   |
 | QD=32, 128K rnd wr, 1 job  | 4160 / 545 MB/s   | 5715 / 749 MB/s   |
 | QD=32, 128K seq wr, 1 job  | 3274 / 429 MB/s   | 5934 / 778 MB/s   |
 +----------------------------+-------------------+-------------------+

As you can see, zloop is generally much faster. This shows the best results from
several runs as performance variation from one run to another can be significant
(for both ublk and zloop).

But as mentioned before, since this is intended to be a test tool for file
systems, performance is not the primary goal here (though the higher the better
as that shortens test times). Simplicity is. And as Ted also stated, introducing
a ublk and rust dependency in xfstests is far from ideal.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-24  9:30                 ` Damien Le Moal
@ 2025-01-24 12:30                   ` Ming Lei
  2025-01-24 14:20                     ` Johannes Thumshirn
  2025-01-29  8:10                     ` Damien Le Moal
  0 siblings, 2 replies; 35+ messages in thread
From: Ming Lei @ 2025-01-24 12:30 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Christoph Hellwig, Jens Axboe, linux-block

On Fri, Jan 24, 2025 at 06:30:19PM +0900, Damien Le Moal wrote:
> On 1/10/25 21:34, Ming Lei wrote:
> >> It is easy to extend rublk/zoned in this way with io_uring io emulation, :-)
> > 
> > Here it is:
> > 
> > https://github.com/ublk-org/rublk/commits/file-backed-zoned/
> > 
> > Top two commits implement the feature by command line `--path $zdir`:
> > 
> > 	[rublk]# git diff --stat=80 HEAD^^...
> > 	 src/zoned.rs   | 397 +++++++++++++++++++++++++++++++++++++++++++++++----------
> > 	 tests/basic.rs |  49 ++++---
> > 	 2 files changed, 363 insertions(+), 83 deletions(-)
> > 
> > It takes 280 new LoC:
> > 
> >     - support both ram-back and file-back
> >     - completely async io_uring IO emulation for zoned read/write IO
> >     - include selftest code for running mkfs.btrfs/mount/read & write IO/umount
> 
> Hi Ming,
> 
> My apologies for the late reply. Conference travel kept me busy.
> Thank you for doing this. I gave it a try and measured the performance for some
> write workloads (using current Linus tree which includes the block PR for 6.14).
> The zloop results shown here are with a slightly tweaked version (not posted)
> that changes to using a work item per command instead of having a single work
> for all commands.
> 
> 1 queue:
> ========
>                               +-------------------+-------------------+
>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>  +----------------------------+-------------------+-------------------+
>  | QD=1,    4K rnd wr, 1 job  | 11.7k / 47.8 MB/s | 15.8k / 53.0 MB/s |
>  | QD=32,   4K rnd wr, 8 jobs | 63.4k / 260 MB/s  | 101k / 413 MB/s   |

I can't reproduce the above two, actually not observe obvious difference
between rublk/zoned and zloop in my test VM.

Maybe rublk works at debug mode, which reduces perf by half usually.
And you need to add device via 'cargo run -r -- add zoned' for using
release mode.

Actually there is just single io_uring_enter() running in each ublk queue
pthread, perf should be similar with kernel IO handling, and the main extra
load is from the single syscall kernel/user context switch and IO data copy,
and data copy effect can be neglected in small io size usually(< 64KB).

>  | QD=32, 128K rnd wr, 1 job  | 5008 / 656 MB/s   | 5993 / 786 MB/s   |
>  | QD=32, 128K seq wr, 1 job  | 2636 / 346 MB/s   | 5393 / 707 MB/s   |

ublk 128K BS may be a little slower since there is one extra copy.

>  +----------------------------+-------------------+-------------------+
> 
> 8 queues:
> =========
>                               +-------------------+-------------------+
>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>  +----------------------------+-------------------+-------------------+
>  | QD=1,    4K rnd wr, 1 job  | 9699 / 39.7 MB/s  | 16.7k / 68.6 MB/s |
>  | QD=32,   4K rnd wr, 8 jobs | 58.2k / 238 MB/s  | 108k / 444 MB/s   |
>  | QD=32, 128K rnd wr, 1 job  | 4160 / 545 MB/s   | 5715 / 749 MB/s   |
>  | QD=32, 128K seq wr, 1 job  | 3274 / 429 MB/s   | 5934 / 778 MB/s   |
>  +----------------------------+-------------------+-------------------+
> 
> As you can see, zloop is generally much faster. This shows the best results from
> several runs as performance variation from one run to another can be significant
> (for both ublk and zloop).
> 
> But as mentioned before, since this is intended to be a test tool for file
> systems, performance is not the primary goal here (though the higher the better
> as that shortens test times). Simplicity is. And as Ted also stated, introducing
> a ublk and rust dependency in xfstests is far from ideal.

Simplicity need to be observed from multiple dimensions, 300 vs. 1500 LoC has
shown something already, IMO.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-24 12:30                   ` Ming Lei
@ 2025-01-24 14:20                     ` Johannes Thumshirn
  2025-01-29  8:10                     ` Damien Le Moal
  1 sibling, 0 replies; 35+ messages in thread
From: Johannes Thumshirn @ 2025-01-24 14:20 UTC (permalink / raw)
  To: Ming Lei, Damien Le Moal; +Cc: hch, Jens Axboe, linux-block@vger.kernel.org

On 24.01.25 13:30, Ming Lei wrote:
> On Fri, Jan 24, 2025 at 06:30:19PM +0900, Damien Le Moal wrote:
>> But as mentioned before, since this is intended to be a test tool for file
>> systems, performance is not the primary goal here (though the higher the better
>> as that shortens test times). Simplicity is. And as Ted also stated, introducing
>> a ublk and rust dependency in xfstests is far from ideal.
> 
> Simplicity need to be observed from multiple dimensions, 300 vs. 1500 LoC has
> shown something already, IMO.
> 

To add my $.02 here, if there's another dependency for these tests for 
xfstests they're just going to be skipped by 99.9% of the people running 
xfstests.

Also 300 LoC Rust code don't translate well to C. All kernel developers 
know how to debug and write C code, at the moment only a fraction knows 
Rust.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-24 12:30                   ` Ming Lei
  2025-01-24 14:20                     ` Johannes Thumshirn
@ 2025-01-29  8:10                     ` Damien Le Moal
  2025-01-31  3:54                       ` Ming Lei
  1 sibling, 1 reply; 35+ messages in thread
From: Damien Le Moal @ 2025-01-29  8:10 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, linux-block

[-- Attachment #1: Type: text/plain, Size: 6078 bytes --]

On 1/24/25 21:30, Ming Lei wrote:
>> 1 queue:
>> ========
>>                               +-------------------+-------------------+
>>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>>  +----------------------------+-------------------+-------------------+
>>  | QD=1,    4K rnd wr, 1 job  | 11.7k / 47.8 MB/s | 15.8k / 53.0 MB/s |
>>  | QD=32,   4K rnd wr, 8 jobs | 63.4k / 260 MB/s  | 101k / 413 MB/s   |
> 
> I can't reproduce the above two, actually not observe obvious difference
> between rublk/zoned and zloop in my test VM.

I am using bare-metal machines for these tests as I do not want any
noise from a VM/hypervisor in the numbers. And I did say that this is with a
tweaked version of zloop that I have not posted yet (I was waiting for rc1 to
repost as a rebase is needed to correct a compilation failure du to the nomerge
tage set flag being removed). I am attaching the patch I used here (it applies
on top of current Linus tree)

> Maybe rublk works at debug mode, which reduces perf by half usually.
> And you need to add device via 'cargo run -r -- add zoned' for using
> release mode.

Well, that is not an obvious thing for someone who does not know rust well. The
README file of rublk also does not mention that. So no, I did not run it like
this. I followed the README and call rublk directly. It would be great to
document that.

> Actually there is just single io_uring_enter() running in each ublk queue
> pthread, perf should be similar with kernel IO handling, and the main extra
> load is from the single syscall kernel/user context switch and IO data copy,
> and data copy effect can be neglected in small io size usually(< 64KB).
> 
>>  | QD=32, 128K rnd wr, 1 job  | 5008 / 656 MB/s   | 5993 / 786 MB/s   |
>>  | QD=32, 128K seq wr, 1 job  | 2636 / 346 MB/s   | 5393 / 707 MB/s   |
> 
> ublk 128K BS may be a little slower since there is one extra copy.

Here are newer numbers running rublk as you suggested (using cargo run -r).
The backend storage is on an XFS file system using a PCI gen4 4TB M.2 SSD that
is empty (the FS is empty on start). The emulated zoned disk has a capacity of
512GB with sequential zones only of 256 MB (that is, there are 2048
zones/files). Each data point is from a 1min run of fio.

On a 8-cores Intel Xeon test box, which has PCI gen 3 only, I get:

Single queue:
=============
                              +-------------------+-------------------+
                              | ublk (IOPS / BW)  | zloop (IOPS / BW) |
 +----------------------------+-------------------+-------------------+
 | QD=1,    4K rnd wr, 1 job  | 2859 / 11.7 MB/s  | 5535 / 22.7 MB/s  |
 | QD=32,   4K rnd wr, 8 jobs | 24.5k / 100 MB/s  | 24.6k / 101 MB/s  |
 | QD=32, 128K rnd wr, 1 job  | 14.9k / 1954 MB/s | 19.6k / 2571 MB/s |
 | QD=32, 128K seq wr, 1 job  | 1516 / 199 MB/s   | 10.6k / 1385 MB/s |
 +----------------------------+-------------------+-------------------+

8 queues:
=========
                              +-------------------+-------------------+
                              | ublk (IOPS / BW)  | zloop (IOPS / BW) |
 +----------------------------+-------------------+-------------------+
 | QD=1,    4K rnd wr, 1 job  | 5387 / 22.1 MB/s  | 5436 / 22.3 MB/s  |
 | QD=32,   4K rnd wr, 8 jobs | 16.4k / 67.0 MB/s | 26.3k / 108 MB/s  |
 | QD=32, 128K rnd wr, 1 job  | 6101 / 800 MB/s   | 19.8k / 2591 MB/s |
 | QD=32, 128K seq wr, 1 job  | 3987 / 523 MB/s   | 10.6k / 1391 MB/s |
 +----------------------------+-------------------+-------------------+

I have no idea why ublk is generally slower when setup with 8 I/O queues. The
qd=32 4K random write with 8 jobs is generally faster with ublk than zloop, but
that varies. I tracked that down to CPU utilization which is generally much
better (all CPUs used) with ublk compared to zloop, as zloop is at the mercy of
the workqueue code and how it schedules unbound work items.

Also, I do not understand why the sequential write workload is so much slower
with ublk. That baffles me and I have no explanations.

With a faster PCIe gen4 16-core AMD Epyc-2 machine, I get:

Single queue:
=============
                              +-------------------+-------------------+
                              | ublk (IOPS / BW)  | zloop (IOPS / BW) |
 +----------------------------+-------------------+-------------------+
 | QD=1,    4K rnd wr, 1 job  | 6824 / 28.0 MB/s  | 7320 / 30.0 MB/s  |
 | QD=32,   4K rnd wr, 8 jobs | 50.9k / 208 MB/s  | 41.7k / 171 MB/s  |
 | QD=32, 128K rnd wr, 1 job  | 15.6k / 2046 MB/s | 18.5k / 2430 MB/s |
 | QD=32, 128K seq wr, 1 job  | 6237 / 818 MB/s   | 22.5k / 2943 MB/s |
 +----------------------------+-------------------+-------------------+

8 queues:
=========
                              +-------------------+-------------------+
                              | ublk (IOPS / BW)  | zloop (IOPS / BW) |
 +----------------------------+-------------------+-------------------+
 | QD=1,    4K rnd wr, 1 job  | 6884 / 28.2 MB/s  | 7707 / 31.6 MB/s  |
 | QD=32,   4K rnd wr, 8 jobs | 39.4k / 161 MB/s  | 46.8k / 192 MB/s  |
 | QD=32, 128K rnd wr, 1 job  | 12.2k / 1597 MB/s | 18.8k / 2460 MB/s |
 | QD=32, 128K seq wr, 1 job  | 6391 / 799 MB/s   | 21.4k / 2802 MB/s |
 +----------------------------+-------------------+-------------------+

The same pattern repeats again: ublk with 8 queues is slower, but it is faster
with a single queue for the qd=32 4K random write with 8 jobs case.

> Simplicity need to be observed from multiple dimensions, 300 vs. 1500 LoC has
> shown something already, IMO.

Sure. But given the very complicated syntax of rust, a lower LoC for rust
compared to C is very subjective in my opinion.

I said "simplicity" in the context of the driver use. And rublk is not as
simple to use as zloop as it needs rust/cargo installed which is not an
acceptable dependency for xfstests. Furthermore, it is very annoying to have to
change the nofile ulimit to allow rublk to open all the zone files for a large
disk (zloop does not need that)

-- 
Damien Le Moal
Western Digital Research

[-- Attachment #2: 0001-block-new-zoned-loop-block-device-driver.patch --]
[-- Type: text/x-patch, Size: 36750 bytes --]

From 8f1e318c4e4fe05814c3e8cb13992c802a365458 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <dlemoal@kernel.org>
Date: Fri, 20 Dec 2024 07:59:07 +0000
Subject: [PATCH] block: new zoned loop block device driver

The zoned loop block device driver allows a user to create emulated
zoned block devices using one regular file per zone as backing storage.
Compared to null_blk or scsi_debug, it has the advantage of allowing
emulating large zoned devices without requiring the same amount of
memory as the capacity of the emulated device. Furthermore, zoned
devices emulated with this driver can be re-started after a host reboot
without any loss of the state of the device zones, which is something
that null_blk and scsi_debug do not support.

This initial implementation is simple and does not support zone resource
limits. That is, a zoned loop block device limits for the maximum number
of open zones and maximum number of active zones is always 0.

This driver can be either compiled in-kernel or as a module, named
"zloop". Compilation of this driver depends on the block layer support
for zoned block device (CONFIG_BLK_DEV_ZONED must be set).

Using the zloop driver to create and delete zoned block devices is
done by writing commands to the zoned loop control character device file
(/dev/zloop-control). Creating a device is done with:

  $ echo "add [options]" > /dev/zloop-control

The options available for the "add" operation cat be listed by reading
the zloop-control device file:

  $ cat /dev/zloop-control
  add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u
  remove id=%d

The options available allow controlling the zoned device total
capacity, zone size, zone capactity of sequential zones, total number
of conventional zones, base directory for the zones backing file, number
of I/O queues and the maximum queue depth of I/O queues.

Deleting a device is done using the "remove" command:

  $ echo "remove id=0" > /dev/zloop-control

This implementation passes various tests using zonefs and fio (t/zbd
tests) and provides a state machine for zone conditions that is
compliant with the T10 ZBC and NVMe ZNS specifications.

Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 MAINTAINERS            |    7 +
 drivers/block/Kconfig  |   16 +
 drivers/block/Makefile |    1 +
 drivers/block/zloop.c  | 1338 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1362 insertions(+)
 create mode 100644 drivers/block/zloop.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 936e80f2c9ce..96325f487b3e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26152,6 +26152,13 @@ L:	linux-kernel@vger.kernel.org
 S:	Maintained
 F:	arch/x86/kernel/cpu/zhaoxin.c
 
+ZONED LOOP DEVICE
+M:	Damien Le Moal <dlemoal@kernel.org>
+R:	Christoph Hellwig <hch@lst.de>
+L:	linux-block@vger.kernel.org
+S:	Maintained
+F:	drivers/block/zloop.c
+
 ZONEFS FILESYSTEM
 M:	Damien Le Moal <dlemoal@kernel.org>
 M:	Naohiro Aota <naohiro.aota@wdc.com>
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index a97f2c40c640..abdbe5d49026 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -413,4 +413,20 @@ config BLKDEV_UBLK_LEGACY_OPCODES
 
 source "drivers/block/rnbd/Kconfig"
 
+config BLK_DEV_ZONED_LOOP
+	tristate "Zoned loopback device support"
+	depends on BLK_DEV_ZONED
+	help
+	  Saying Y here will allow you to use create a zoned block device using
+	  regular files for zones (one file per zones). This is useful to test
+	  file systems, device mapper and applications that support zoned block
+	  devices. To create a zoned loop device, no user utility is needed, a
+	  zoned loop device can be created (or re-started) using a command
+	  like:
+
+	  echo "add id=0,zone_size_mb=256,capacity_mb=16384,conv_zones=11" > \
+		/dev/zloop-control
+
+	  If unsure, say N.
+
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 1105a2d4fdcb..097707aca725 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -41,5 +41,6 @@ obj-$(CONFIG_BLK_DEV_RNBD)	+= rnbd/
 obj-$(CONFIG_BLK_DEV_NULL_BLK)	+= null_blk/
 
 obj-$(CONFIG_BLK_DEV_UBLK)			+= ublk_drv.o
+obj-$(CONFIG_BLK_DEV_ZONED_LOOP) += zloop.o
 
 swim_mod-y	:= swim.o swim_asm.o
diff --git a/drivers/block/zloop.c b/drivers/block/zloop.c
new file mode 100644
index 000000000000..b2341bb13528
--- /dev/null
+++ b/drivers/block/zloop.c
@@ -0,0 +1,1338 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, Christoph Hellwig.
+ * Copyright (c) 2025, Western Digital Corporation or its affiliates.
+ *
+ * Zoned Loop Device driver - exports a zoned block device using one file per
+ * zone as backing storage.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/blk-mq.h>
+#include <linux/blkzoned.h>
+#include <linux/pagemap.h>
+#include <linux/miscdevice.h>
+#include <linux/falloc.h>
+#include <linux/mutex.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+
+/*
+ * Options for adding (and removing) a device.
+ */
+enum {
+	ZLOOP_OPT_ERR			= 0,
+	ZLOOP_OPT_ID			= (1 << 0),
+	ZLOOP_OPT_CAPACITY		= (1 << 1),
+	ZLOOP_OPT_ZONE_SIZE		= (1 << 2),
+	ZLOOP_OPT_ZONE_CAPACITY		= (1 << 3),
+	ZLOOP_OPT_NR_CONV_ZONES		= (1 << 4),
+	ZLOOP_OPT_BASE_DIR		= (1 << 5),
+	ZLOOP_OPT_NR_QUEUES		= (1 << 6),
+	ZLOOP_OPT_QUEUE_DEPTH		= (1 << 7),
+	ZLOOP_OPT_BUFFERED_IO		= (1 << 8),
+};
+
+static const match_table_t zloop_opt_tokens = {
+	{ ZLOOP_OPT_ID,			"id=%d"	},
+	{ ZLOOP_OPT_CAPACITY,		"capacity_mb=%u"	},
+	{ ZLOOP_OPT_ZONE_SIZE,		"zone_size_mb=%u"	},
+	{ ZLOOP_OPT_ZONE_CAPACITY,	"zone_capacity_mb=%u"	},
+	{ ZLOOP_OPT_NR_CONV_ZONES,	"conv_zones=%u"		},
+	{ ZLOOP_OPT_BASE_DIR,		"base_dir=%s"		},
+	{ ZLOOP_OPT_NR_QUEUES,		"nr_queues=%u"		},
+	{ ZLOOP_OPT_QUEUE_DEPTH,	"queue_depth=%u"	},
+	{ ZLOOP_OPT_BUFFERED_IO,	"buffered_io"		},
+	{ ZLOOP_OPT_ERR,		NULL			}
+};
+
+/* Default values for the "add" operation. */
+#define ZLOOP_DEF_ID			-1
+#define ZLOOP_DEF_ZONE_SIZE		((256ULL * SZ_1M) >> SECTOR_SHIFT)
+#define ZLOOP_DEF_NR_ZONES		64
+#define ZLOOP_DEF_NR_CONV_ZONES		8
+#define ZLOOP_DEF_BASE_DIR		"/var/local/zloop"
+#define ZLOOP_DEF_NR_QUEUES		1
+#define ZLOOP_DEF_QUEUE_DEPTH		128
+#define ZLOOP_DEF_BUFFERED_IO		false
+
+/* Arbitrary limit on the zone size (16GB). */
+#define ZLOOP_MAX_ZONE_SIZE_MB		16384
+
+struct zloop_options {
+	unsigned int		mask;
+	int			id;
+	sector_t		capacity;
+	sector_t		zone_size;
+	sector_t		zone_capacity;
+	unsigned int		nr_conv_zones;
+	char			*base_dir;
+	unsigned int		nr_queues;
+	unsigned int		queue_depth;
+	bool			buffered_io;
+};
+
+/*
+ * Device states.
+ */
+enum {
+	Zlo_creating = 0,
+	Zlo_live,
+	Zlo_deleting,
+};
+
+enum zloop_zone_flags {
+	ZLOOP_ZONE_CONV = 0,
+	ZLOOP_ZONE_SEQ_ERROR,
+};
+
+struct zloop_zone {
+	struct file		*file;
+
+	unsigned long		flags;
+	struct mutex		lock;
+	enum blk_zone_cond	cond;
+	sector_t		start;
+	sector_t		wp;
+
+	gfp_t			old_gfp_mask;
+};
+
+struct zloop_device {
+	unsigned int		id;
+	unsigned int		state;
+
+	struct blk_mq_tag_set	tag_set;
+	struct gendisk		*disk;
+
+	struct workqueue_struct *workqueue;
+	bool			buffered_io;
+
+	const char		*base_dir;
+	struct file		*data_dir;
+
+	unsigned int		zone_shift;
+	sector_t		zone_size;
+	sector_t		zone_capacity;
+	unsigned int		nr_zones;
+	unsigned int		nr_conv_zones;
+
+	struct zloop_zone	zones[] __counted_by(nr_zones);
+};
+
+struct zloop_cmd {
+	struct work_struct	work;
+	atomic_t		ref;
+	sector_t		sector;
+	sector_t		nr_sectors;
+	long			ret;
+	struct kiocb		iocb;
+	struct bio_vec		*bvec;
+};
+
+static DEFINE_IDR(zloop_index_idr);
+static DEFINE_MUTEX(zloop_ctl_mutex);
+
+static unsigned int rq_zone_no(struct request *rq)
+{
+	struct zloop_device *zlo = rq->q->queuedata;
+
+	return blk_rq_pos(rq) >> zlo->zone_shift;
+}
+
+static int zloop_update_seq_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	struct kstat stat;
+	sector_t file_sectors;
+	int ret;
+
+	lockdep_assert_held(&zone->lock);
+
+	ret = vfs_getattr(&zone->file->f_path, &stat, STATX_SIZE, 0);
+	if (ret < 0) {
+		pr_err("Failed to get zone %u file stat (err=%d)\n",
+		       zone_no, ret);
+		set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+		return ret;
+	}
+
+	file_sectors = stat.size >> SECTOR_SHIFT;
+	if (file_sectors > zlo->zone_capacity) {
+		pr_err("Zone %u file too large (%llu sectors > %llu)\n",
+		       zone_no, file_sectors, zlo->zone_capacity);
+		return -EINVAL;
+	}
+
+	if (!file_sectors) {
+		zone->cond = BLK_ZONE_COND_EMPTY;
+		zone->wp = zone->start;
+	} else if (file_sectors == zlo->zone_capacity) {
+		zone->cond = BLK_ZONE_COND_FULL;
+		zone->wp = zone->start + zlo->zone_size;
+	} else {
+		zone->cond = BLK_ZONE_COND_CLOSED;
+		zone->wp = zone->start + file_sectors;
+	}
+
+	return 0;
+}
+
+static int zloop_open_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	int ret = 0;
+
+	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
+		return -EIO;
+
+	mutex_lock(&zone->lock);
+
+	if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
+		ret = zloop_update_seq_zone(zlo, zone_no);
+		if (ret)
+			goto unlock;
+	}
+
+	switch (zone->cond) {
+	case BLK_ZONE_COND_EXP_OPEN:
+		break;
+	case BLK_ZONE_COND_EMPTY:
+	case BLK_ZONE_COND_CLOSED:
+	case BLK_ZONE_COND_IMP_OPEN:
+		zone->cond = BLK_ZONE_COND_EXP_OPEN;
+		break;
+	case BLK_ZONE_COND_FULL:
+	default:
+		ret = -EIO;
+		break;
+	}
+
+unlock:
+	mutex_unlock(&zone->lock);
+
+	return ret;
+}
+
+static int zloop_close_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	int ret = 0;
+
+	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
+		return -EIO;
+
+	mutex_lock(&zone->lock);
+
+	if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
+		ret = zloop_update_seq_zone(zlo, zone_no);
+		if (ret)
+			goto unlock;
+	}
+
+	switch (zone->cond) {
+	case BLK_ZONE_COND_CLOSED:
+		break;
+	case BLK_ZONE_COND_IMP_OPEN:
+	case BLK_ZONE_COND_EXP_OPEN:
+		if (zone->wp == zone->start)
+			zone->cond = BLK_ZONE_COND_EMPTY;
+		else
+			zone->cond = BLK_ZONE_COND_CLOSED;
+		break;
+	case BLK_ZONE_COND_EMPTY:
+	case BLK_ZONE_COND_FULL:
+	default:
+		ret = -EIO;
+		break;
+	}
+
+unlock:
+	mutex_unlock(&zone->lock);
+
+	return ret;
+}
+
+static int zloop_reset_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	int ret = 0;
+
+	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
+		return -EIO;
+
+	mutex_lock(&zone->lock);
+
+	if (!test_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags) &&
+	    zone->cond == BLK_ZONE_COND_EMPTY)
+		goto unlock;
+
+	if (vfs_truncate(&zone->file->f_path, 0)) {
+		set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+		ret = -EIO;
+		goto unlock;
+	}
+
+	zone->cond = BLK_ZONE_COND_EMPTY;
+	zone->wp = zone->start;
+	clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+
+unlock:
+	mutex_unlock(&zone->lock);
+
+	return ret;
+}
+
+static int zloop_reset_all_zones(struct zloop_device *zlo)
+{
+	unsigned int i;
+	int ret;
+
+	for (i = zlo->nr_conv_zones; i < zlo->nr_zones; i++) {
+		ret = zloop_reset_zone(zlo, i);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int zloop_finish_zone(struct zloop_device *zlo, unsigned int zone_no)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	int ret = 0;
+
+	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
+		return -EIO;
+
+	mutex_lock(&zone->lock);
+
+	if (!test_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags) &&
+	    zone->cond == BLK_ZONE_COND_FULL)
+		goto unlock;
+
+	if (vfs_truncate(&zone->file->f_path, zlo->zone_size << SECTOR_SHIFT)) {
+		set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+		ret = -EIO;
+		goto unlock;
+	}
+
+	zone->cond = BLK_ZONE_COND_FULL;
+	zone->wp = zone->start + zlo->zone_size;
+	clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+
+ unlock:
+	mutex_unlock(&zone->lock);
+
+	return ret;
+}
+
+static void zloop_put_cmd(struct zloop_cmd *cmd)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+
+	if (!atomic_dec_and_test(&cmd->ref))
+		return;
+	kfree(cmd->bvec);
+	cmd->bvec = NULL;
+	if (likely(!blk_should_fake_timeout(rq->q)))
+		blk_mq_complete_request(rq);
+}
+
+static void zloop_rw_complete(struct kiocb *iocb, long ret)
+{
+	struct zloop_cmd *cmd = container_of(iocb, struct zloop_cmd, iocb);
+
+	cmd->ret = ret;
+	zloop_put_cmd(cmd);
+}
+
+static void zloop_rw(struct zloop_cmd *cmd)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	struct zloop_device *zlo = rq->q->queuedata;
+	unsigned int zone_no = rq_zone_no(rq);
+	sector_t sector = blk_rq_pos(rq);
+	sector_t nr_sectors = blk_rq_sectors(rq);
+	bool is_append = req_op(rq) == REQ_OP_ZONE_APPEND;
+	bool is_write = req_op(rq) == REQ_OP_WRITE || is_append;
+	int rw = is_write ? ITER_SOURCE : ITER_DEST;
+	struct req_iterator rq_iter;
+	struct zloop_zone *zone;
+	struct iov_iter iter;
+	struct bio_vec tmp;
+	sector_t zone_end;
+	int nr_bvec = 0;
+	int ret;
+
+	atomic_set(&cmd->ref, 2);
+	cmd->sector = sector;
+	cmd->nr_sectors = nr_sectors;
+	cmd->ret = 0;
+
+	/* We should never get an I/O beyond the device capacity. */
+	if (WARN_ON_ONCE(zone_no >= zlo->nr_zones)) {
+		ret = -EIO;
+		goto out;
+	}
+	zone = &zlo->zones[zone_no];
+	zone_end = zone->start + zlo->zone_capacity;
+
+	/*
+	 * The block layer should never send requests that are not fully
+	 * contained within the zone.
+	 */
+	if (WARN_ON_ONCE(sector + nr_sectors > zone->start + zlo->zone_size)) {
+		ret = -EIO;
+		goto out;
+	}
+
+	if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
+		mutex_lock(&zone->lock);
+		ret = zloop_update_seq_zone(zlo, zone_no);
+		mutex_unlock(&zone->lock);
+		if (ret)
+			goto out;
+	}
+
+	if (!test_bit(ZLOOP_ZONE_CONV, &zone->flags) && is_write) {
+		mutex_lock(&zone->lock);
+
+		if (is_append) {
+			sector = zone->wp;
+			cmd->sector = sector;
+		}
+
+		/*
+		 * Write operations must be aligned to the write pointer and
+		 * fully contained within the zone capacity.
+		 */
+		if (sector != zone->wp || zone->wp + nr_sectors > zone_end) {
+			pr_err("Zone %u: unaligned write: sect %llu, wp %llu\n",
+			       zone_no, sector, zone->wp);
+			ret = -EIO;
+			goto unlock;
+		}
+
+		/* Implicitly open the target zone. */
+		if (zone->cond == BLK_ZONE_COND_CLOSED ||
+		    zone->cond == BLK_ZONE_COND_EMPTY)
+			zone->cond = BLK_ZONE_COND_IMP_OPEN;
+
+		/*
+		 * Advance the write pointer of sequential zones. If the write
+		 * fails, the wp position will be corrected when the next I/O
+		 * copmpletes.
+		 */
+		zone->wp += nr_sectors;
+		if (zone->wp == zone_end)
+			zone->cond = BLK_ZONE_COND_FULL;
+	}
+
+	rq_for_each_bvec(tmp, rq, rq_iter)
+		nr_bvec++;
+
+	if (rq->bio != rq->biotail) {
+		struct bio_vec *bvec;
+
+		cmd->bvec = kmalloc_array(nr_bvec, sizeof(*cmd->bvec), GFP_NOIO);
+		if (!cmd->bvec) {
+			ret = -EIO;
+			goto unlock;
+		}
+
+		/*
+		 * The bios of the request may be started from the middle of
+		 * the 'bvec' because of bio splitting, so we can't directly
+		 * copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
+		 * API will take care of all details for us.
+		 */
+		bvec = cmd->bvec;
+		rq_for_each_bvec(tmp, rq, rq_iter) {
+			*bvec = tmp;
+			bvec++;
+		}
+		iov_iter_bvec(&iter, rw, cmd->bvec, nr_bvec, blk_rq_bytes(rq));
+	} else {
+		/*
+		 * Same here, this bio may be started from the middle of the
+		 * 'bvec' because of bio splitting, so offset from the bvec
+		 * must be passed to iov iterator
+		 */
+		iov_iter_bvec(&iter, rw,
+			__bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter),
+					nr_bvec, blk_rq_bytes(rq));
+		iter.iov_offset = rq->bio->bi_iter.bi_bvec_done;
+	}
+
+	cmd->iocb.ki_pos = (sector - zone->start) << SECTOR_SHIFT;
+	cmd->iocb.ki_filp = zone->file;
+	cmd->iocb.ki_complete = zloop_rw_complete;
+	if (!zlo->buffered_io)
+		cmd->iocb.ki_flags = IOCB_DIRECT;
+	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
+
+	if (rw == ITER_SOURCE)
+		ret = zone->file->f_op->write_iter(&cmd->iocb, &iter);
+	else
+		ret = zone->file->f_op->read_iter(&cmd->iocb, &iter);
+unlock:
+	if (!test_bit(ZLOOP_ZONE_CONV, &zone->flags) && is_write)
+		mutex_unlock(&zone->lock);
+out:
+	if (ret != -EIOCBQUEUED)
+		zloop_rw_complete(&cmd->iocb, ret);
+	zloop_put_cmd(cmd);
+}
+
+static void zloop_handle_cmd(struct zloop_cmd *cmd)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	struct zloop_device *zlo = rq->q->queuedata;
+
+	switch (req_op(rq)) {
+	case REQ_OP_READ:
+	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
+		/*
+		 * zloop_rw() always executes asynchronously or completes
+		 * directly.
+		 */
+		zloop_rw(cmd);
+		return;
+	case REQ_OP_FLUSH:
+		/*
+		 * Sync the entire FS containing the zone files instead of
+		 * walking all files
+		 */
+		cmd->ret = sync_filesystem(file_inode(zlo->data_dir)->i_sb);
+		break;
+	case REQ_OP_ZONE_RESET:
+		cmd->ret = zloop_reset_zone(zlo, rq_zone_no(rq));
+		break;
+	case REQ_OP_ZONE_RESET_ALL:
+		cmd->ret = zloop_reset_all_zones(zlo);
+		break;
+	case REQ_OP_ZONE_FINISH:
+		cmd->ret = zloop_finish_zone(zlo, rq_zone_no(rq));
+		break;
+	case REQ_OP_ZONE_OPEN:
+		cmd->ret = zloop_open_zone(zlo, rq_zone_no(rq));
+		break;
+	case REQ_OP_ZONE_CLOSE:
+		cmd->ret = zloop_close_zone(zlo, rq_zone_no(rq));
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		pr_err("Unsupported operation %d\n", req_op(rq));
+		cmd->ret = -EOPNOTSUPP;
+		break;
+	}
+
+	blk_mq_complete_request(rq);
+}
+
+static void zloop_cmd_workfn(struct work_struct *work)
+{
+	struct zloop_cmd *cmd = container_of(work, struct zloop_cmd, work);
+	int orig_flags = current->flags;
+
+	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
+	zloop_handle_cmd(cmd);
+	current->flags = orig_flags;
+}
+
+static void zloop_complete_rq(struct request *rq)
+{
+	struct zloop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+	struct zloop_device *zlo = rq->q->queuedata;
+	unsigned int zone_no = cmd->sector >> zlo->zone_shift;
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	blk_status_t sts = BLK_STS_OK;
+
+	switch (req_op(rq)) {
+	case REQ_OP_READ:
+		if (cmd->ret < 0)
+			pr_err("Zone %u: failed read sector %llu, %llu sectors\n",
+			       zone_no, cmd->sector, cmd->nr_sectors);
+
+		if (cmd->ret >= 0 && cmd->ret != blk_rq_bytes(rq)) {
+			/* short read */
+			struct bio *bio;
+
+			__rq_for_each_bio(bio, rq)
+				zero_fill_bio(bio);
+		}
+		break;
+	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
+		if (cmd->ret < 0)
+			pr_err("Zone %u: failed %swrite sector %llu, %llu sectors\n",
+			       zone_no,
+			       req_op(rq) == REQ_OP_WRITE ? "" : "append ",
+			       cmd->sector, cmd->nr_sectors);
+
+		if (cmd->ret >= 0 && cmd->ret != blk_rq_bytes(rq)) {
+			pr_err("Zone %u: partial write %ld/%u B\n",
+			       zone_no, cmd->ret, blk_rq_bytes(rq));
+			cmd->ret = -EIO;
+		}
+
+		if (cmd->ret < 0 && !test_bit(ZLOOP_ZONE_CONV, &zone->flags)) {
+			/*
+			 * A write to a sequential zone file failed: mark the
+			 * zone as having an error. This will be corrected and
+			 * cleared when the next IO is submitted.
+			 */
+			set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
+			break;
+		}
+		if (req_op(rq) == REQ_OP_ZONE_APPEND)
+			rq->__sector = cmd->sector;
+
+		break;
+	default:
+		break;
+	}
+
+	if (cmd->ret < 0)
+		sts = errno_to_blk_status(cmd->ret);
+	blk_mq_end_request(rq, sts);
+}
+
+static blk_status_t zloop_queue_rq(struct blk_mq_hw_ctx *hctx,
+				   const struct blk_mq_queue_data *bd)
+{
+	struct request *rq = bd->rq;
+	struct zloop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+	struct zloop_device *zlo = rq->q->queuedata;
+
+	if (zlo->state == Zlo_deleting)
+		return BLK_STS_IOERR;
+
+	blk_mq_start_request(rq);
+
+	INIT_WORK(&cmd->work, zloop_cmd_workfn);
+	queue_work(zlo->workqueue, &cmd->work);
+
+	return BLK_STS_OK;
+}
+
+static const struct blk_mq_ops zloop_mq_ops = {
+	.queue_rq       = zloop_queue_rq,
+	.complete	= zloop_complete_rq,
+};
+
+static int zloop_open(struct gendisk *disk, blk_mode_t mode)
+{
+	struct zloop_device *zlo = disk->private_data;
+	int ret;
+
+	ret = mutex_lock_killable(&zloop_ctl_mutex);
+	if (ret)
+		return ret;
+
+	if (zlo->state != Zlo_live)
+		ret = -ENXIO;
+	mutex_unlock(&zloop_ctl_mutex);
+	return ret;
+}
+
+static int zloop_report_zones(struct gendisk *disk, sector_t sector,
+		unsigned int nr_zones, report_zones_cb cb, void *data)
+{
+	struct zloop_device *zlo = disk->private_data;
+	struct blk_zone blkz = {};
+	unsigned int first, i;
+	int ret;
+
+	first = disk_zone_no(disk, sector);
+	if (first >= zlo->nr_zones)
+		return 0;
+	nr_zones = min(nr_zones, zlo->nr_zones - first);
+
+	for (i = 0; i < nr_zones; i++) {
+		unsigned int zone_no = first + i;
+		struct zloop_zone *zone = &zlo->zones[zone_no];
+
+		mutex_lock(&zone->lock);
+
+		if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
+			ret = zloop_update_seq_zone(zlo, zone_no);
+			if (ret) {
+				mutex_unlock(&zone->lock);
+				return ret;
+			}
+		}
+
+		blkz.start = zone->start;
+		blkz.len = zlo->zone_size;
+		blkz.wp = zone->wp;
+		blkz.cond = zone->cond;
+		if (test_bit(ZLOOP_ZONE_CONV, &zone->flags)) {
+			blkz.type = BLK_ZONE_TYPE_CONVENTIONAL;
+			blkz.capacity = zlo->zone_size;
+		} else {
+			blkz.type = BLK_ZONE_TYPE_SEQWRITE_REQ;
+			blkz.capacity = zlo->zone_capacity;
+		}
+
+		mutex_unlock(&zone->lock);
+
+		ret = cb(&blkz, i, data);
+		if (ret)
+			return ret;
+	}
+
+	return nr_zones;
+}
+
+static void zloop_free_disk(struct gendisk *disk)
+{
+	struct zloop_device *zlo = disk->private_data;
+	unsigned int i;
+
+	for (i = 0; i < zlo->nr_zones; i++) {
+		struct zloop_zone *zone = &zlo->zones[i];
+
+		mapping_set_gfp_mask(zone->file->f_mapping,
+				zone->old_gfp_mask);
+		fput(zone->file);
+	}
+
+	fput(zlo->data_dir);
+	destroy_workqueue(zlo->workqueue);
+	kfree(zlo->base_dir);
+	kvfree(zlo);
+}
+
+static const struct block_device_operations zloop_fops = {
+	.owner			= THIS_MODULE,
+	.open			= zloop_open,
+	.report_zones		= zloop_report_zones,
+	.free_disk		= zloop_free_disk,
+};
+
+__printf(3, 4)
+static struct file *zloop_filp_open_fmt(int oflags, umode_t mode,
+		const char *fmt, ...)
+{
+	struct file *file;
+	va_list ap;
+	char *p;
+
+	va_start(ap, fmt);
+	p = kvasprintf(GFP_KERNEL, fmt, ap);
+	va_end(ap);
+
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+	file = filp_open(p, oflags, mode);
+	kfree(p);
+	return file;
+}
+
+static int zloop_init_zone(struct zloop_device *zlo, unsigned int zone_no,
+			   struct zloop_options *opts, bool restore)
+{
+	struct zloop_zone *zone = &zlo->zones[zone_no];
+	int oflags = O_RDWR;
+	struct kstat stat;
+	sector_t file_sectors;
+	int ret;
+
+	mutex_init(&zone->lock);
+	zone->start = (sector_t)zone_no << zlo->zone_shift;
+
+	if (!restore)
+		oflags |= O_CREAT;
+
+	if (!opts->buffered_io)
+		oflags |= O_DIRECT;
+
+	if (zone_no < zlo->nr_conv_zones) {
+		/* Conventional zone file. */
+		set_bit(ZLOOP_ZONE_CONV, &zone->flags);
+		zone->cond = BLK_ZONE_COND_NOT_WP;
+		zone->wp = U64_MAX;
+
+		zone->file = zloop_filp_open_fmt(oflags, 0600, "%s/%u/cnv-%06u",
+					zlo->base_dir, zlo->id, zone_no);
+		if (IS_ERR(zone->file)) {
+			pr_err("Failed to open zone %u file %s/%u/cnv-%06u (err=%ld)",
+			       zone_no, zlo->base_dir, zlo->id, zone_no,
+			       PTR_ERR(zone->file));
+			return PTR_ERR(zone->file);
+		}
+
+		ret = vfs_getattr(&zone->file->f_path, &stat, STATX_SIZE, 0);
+		if (ret < 0) {
+			pr_err("Failed to get zone %u file stat\n", zone_no);
+			return ret;
+		}
+		file_sectors = stat.size >> SECTOR_SHIFT;
+
+		if (restore && file_sectors != zlo->zone_size) {
+			pr_err("Invalid conventional zone %u file size (%llu sectors != %llu)\n",
+			       zone_no, file_sectors, zlo->zone_capacity);
+			return ret;
+		}
+
+		ret = vfs_truncate(&zone->file->f_path,
+				   zlo->zone_size << SECTOR_SHIFT);
+		if (ret < 0) {
+			pr_err("Failed to truncate zone %u file (err=%d)\n",
+			       zone_no, ret);
+			return ret;
+		}
+
+		return 0;
+	}
+
+	/* Sequential zone file. */
+	zone->file = zloop_filp_open_fmt(oflags, 0600, "%s/%u/seq-%06u",
+					 zlo->base_dir, zlo->id, zone_no);
+	if (IS_ERR(zone->file)) {
+		pr_err("Failed to open zone %u file %s/%u/seq-%06u (err=%ld)",
+		       zone_no, zlo->base_dir, zlo->id, zone_no,
+		       PTR_ERR(zone->file));
+		return PTR_ERR(zone->file);
+	}
+
+	mutex_lock(&zone->lock);
+	ret = zloop_update_seq_zone(zlo, zone_no);
+	mutex_unlock(&zone->lock);
+
+	return ret;
+}
+
+static bool zloop_dev_exists(struct zloop_device *zlo)
+{
+	struct file *cnv, *seq;
+	bool exists;
+
+	cnv = zloop_filp_open_fmt(O_RDONLY, 0600, "%s/%u/cnv-%06u",
+				  zlo->base_dir, zlo->id, 0);
+	seq = zloop_filp_open_fmt(O_RDONLY, 0600, "%s/%u/seq-%06u",
+				  zlo->base_dir, zlo->id, 0);
+	exists = !IS_ERR(cnv) || !IS_ERR(seq);
+
+	if (!IS_ERR(cnv))
+		fput(cnv);
+	if (!IS_ERR(seq))
+		fput(seq);
+
+	return exists;
+}
+
+static int zloop_ctl_add(struct zloop_options *opts)
+{
+	struct queue_limits lim = {
+		.max_hw_sectors		= SZ_1M >> SECTOR_SHIFT,
+		.max_hw_zone_append_sectors = SZ_1M >> SECTOR_SHIFT,
+		.chunk_sectors		= opts->zone_size,
+		.features		= BLK_FEAT_ZONED,
+	};
+	unsigned int nr_zones, i, j;
+	struct zloop_device *zlo;
+	int block_size;
+	int ret = -EINVAL;
+	bool restore;
+
+	__module_get(THIS_MODULE);
+
+	nr_zones = opts->capacity >> ilog2(opts->zone_size);
+	if (opts->nr_conv_zones >= nr_zones) {
+		pr_err("Invalid number of conventional zones %u\n",
+		       opts->nr_conv_zones);
+		goto out;
+	}
+
+	zlo = kvzalloc(struct_size(zlo, zones, nr_zones), GFP_KERNEL);
+	if (!zlo) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	zlo->state = Zlo_creating;
+
+	ret = mutex_lock_killable(&zloop_ctl_mutex);
+	if (ret)
+		goto out_free_dev;
+
+	/* Allocate id, if @opts->id >= 0, we're requesting that specific id */
+	if (opts->id >= 0) {
+		ret = idr_alloc(&zloop_index_idr, zlo,
+				  opts->id, opts->id + 1, GFP_KERNEL);
+		if (ret == -ENOSPC)
+			ret = -EEXIST;
+	} else {
+		ret = idr_alloc(&zloop_index_idr, zlo, 0, 0, GFP_KERNEL);
+	}
+	mutex_unlock(&zloop_ctl_mutex);
+	if (ret < 0)
+		goto out_free_dev;
+
+	zlo->id = ret;
+	zlo->zone_shift = ilog2(opts->zone_size);
+	zlo->zone_size = opts->zone_size;
+	if (opts->zone_capacity)
+		zlo->zone_capacity = opts->zone_capacity;
+	else
+		zlo->zone_capacity = zlo->zone_size;
+	zlo->nr_zones = nr_zones;
+	zlo->nr_conv_zones = opts->nr_conv_zones;
+	zlo->buffered_io = opts->buffered_io;
+
+	zlo->workqueue = alloc_workqueue("zloop%d", WQ_UNBOUND | WQ_FREEZABLE,
+				opts->nr_queues * opts->queue_depth, zlo->id);
+	if (!zlo->workqueue) {
+		ret = -ENOMEM;
+		goto out_free_idr;
+	}
+
+	if (opts->base_dir)
+		zlo->base_dir = kstrdup(opts->base_dir, GFP_KERNEL);
+	else
+		zlo->base_dir = kstrdup(ZLOOP_DEF_BASE_DIR, GFP_KERNEL);
+	if (!zlo->base_dir) {
+		ret = -ENOMEM;
+		goto out_destroy_workqueue;
+	}
+
+	zlo->data_dir = zloop_filp_open_fmt(O_RDONLY | O_DIRECTORY, 0, "%s/%u",
+					    zlo->base_dir, zlo->id);
+	if (IS_ERR(zlo->data_dir)) {
+		ret = PTR_ERR(zlo->data_dir);
+		pr_warn("Failed to open directory %s/%u (err=%d)\n",
+			zlo->base_dir, zlo->id, ret);
+		goto out_free_base_dir;
+	}
+
+	/* Use the FS block size as the device sector size. */
+	block_size = file_inode(zlo->data_dir)->i_sb->s_blocksize;
+	if (block_size > SZ_4K) {
+		pr_warn("Unsupported FS block size %d B > 4096\n",
+			block_size);
+		goto out_close_data_dir;
+	}
+	lim.physical_block_size = block_size;
+	lim.logical_block_size = block_size;
+
+	/*
+	 * If we already have zone files, we are restoring a device created by a
+	 * previous add operation. In this case, zloop_init_zone() will check
+	 * that the zone files are consistent with the zone configuration given.
+	 */
+	restore = zloop_dev_exists(zlo);
+	for (i = 0; i < nr_zones; i++) {
+		ret = zloop_init_zone(zlo, i, opts, restore);
+		if (ret)
+			goto out_close_files;
+	}
+
+	zlo->tag_set.ops = &zloop_mq_ops;
+	zlo->tag_set.nr_hw_queues = opts->nr_queues;
+	zlo->tag_set.queue_depth = opts->queue_depth;
+	zlo->tag_set.numa_node = NUMA_NO_NODE;
+	zlo->tag_set.cmd_size = sizeof(struct zloop_cmd);
+	zlo->tag_set.driver_data = zlo;
+
+	ret = blk_mq_alloc_tag_set(&zlo->tag_set);
+	if (ret) {
+		pr_err("blk_mq_alloc_tag_set failed (err=%d)\n", ret);
+		goto out_close_files;
+	}
+
+	zlo->disk = blk_mq_alloc_disk(&zlo->tag_set, &lim, zlo);
+	if (IS_ERR(zlo->disk)) {
+		pr_err("blk_mq_alloc_disk failed (err=%d)\n", ret);
+		ret = PTR_ERR(zlo->disk);
+		goto out_cleanup_tags;
+	}
+	zlo->disk->flags = GENHD_FL_NO_PART;
+	zlo->disk->fops = &zloop_fops;
+	zlo->disk->private_data = zlo;
+	sprintf(zlo->disk->disk_name, "zloop%d", zlo->id);
+	set_capacity(zlo->disk, (u64)lim.chunk_sectors * zlo->nr_zones);
+
+	if (blk_revalidate_disk_zones(zlo->disk))
+		goto out_cleanup_disk;
+
+	ret = add_disk(zlo->disk);
+	if (ret) {
+		pr_err("add_disk failed (err=%d)\n", ret);
+		goto out_cleanup_disk;
+	}
+
+	mutex_lock(&zloop_ctl_mutex);
+	zlo->state = Zlo_live;
+	mutex_unlock(&zloop_ctl_mutex);
+
+	pr_info("Added device %d\n", zlo->id);
+
+	return 0;
+
+out_cleanup_disk:
+	put_disk(zlo->disk);
+out_cleanup_tags:
+	blk_mq_free_tag_set(&zlo->tag_set);
+out_close_files:
+	for (j = 0; j < i; j++) {
+		struct zloop_zone *zone = &zlo->zones[j];
+
+		if (!IS_ERR_OR_NULL(zone->file))
+			fput(zone->file);
+	}
+out_close_data_dir:
+	fput(zlo->data_dir);
+out_free_base_dir:
+	kfree(zlo->base_dir);
+out_destroy_workqueue:
+	destroy_workqueue(zlo->workqueue);
+out_free_idr:
+	mutex_lock(&zloop_ctl_mutex);
+	idr_remove(&zloop_index_idr, zlo->id);
+	mutex_unlock(&zloop_ctl_mutex);
+out_free_dev:
+	kvfree(zlo);
+out:
+	module_put(THIS_MODULE);
+	if (ret == -ENOENT)
+		ret = -EINVAL;
+	return ret;
+}
+
+static int zloop_ctl_remove(struct zloop_options *opts)
+{
+	struct zloop_device *zlo;
+	int ret;
+
+	if (!(opts->mask & ZLOOP_OPT_ID)) {
+		pr_err("No ID specified\n");
+		return -EINVAL;
+	}
+
+	ret = mutex_lock_killable(&zloop_ctl_mutex);
+	if (ret)
+		return ret;
+
+	zlo = idr_find(&zloop_index_idr, opts->id);
+	if (!zlo || zlo->state == Zlo_creating) {
+		ret = -ENODEV;
+	} else if (zlo->state == Zlo_deleting) {
+		ret = -EINVAL;
+	} else {
+		idr_remove(&zloop_index_idr, zlo->id);
+		zlo->state = Zlo_deleting;
+	}
+
+	mutex_unlock(&zloop_ctl_mutex);
+	if (ret)
+		return ret;
+
+	del_gendisk(zlo->disk);
+	put_disk(zlo->disk);
+	blk_mq_free_tag_set(&zlo->tag_set);
+
+	pr_info("Removed device %d\n", opts->id);
+
+	module_put(THIS_MODULE);
+
+	return 0;
+}
+
+static int zloop_parse_options(struct zloop_options *opts, const char *buf)
+{
+	substring_t args[MAX_OPT_ARGS];
+	char *options, *o, *p;
+	unsigned int token;
+	int ret = 0;
+
+	/* Set defaults. */
+	opts->mask = 0;
+	opts->id = ZLOOP_DEF_ID;
+	opts->capacity = ZLOOP_DEF_ZONE_SIZE * ZLOOP_DEF_NR_ZONES;
+	opts->zone_size = ZLOOP_DEF_ZONE_SIZE;
+	opts->nr_conv_zones = ZLOOP_DEF_NR_CONV_ZONES;
+	opts->nr_queues = ZLOOP_DEF_NR_QUEUES;
+	opts->queue_depth = ZLOOP_DEF_QUEUE_DEPTH;
+	opts->buffered_io = ZLOOP_DEF_BUFFERED_IO;
+
+	if (!buf)
+		return 0;
+
+	/* Skip leading spaces before the options. */
+	while (isspace(*buf))
+		buf++;
+
+	options = o = kstrdup(buf, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	/* Parse the options, doing only some light invalid value checks. */
+	while ((p = strsep(&o, ",\n")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, zloop_opt_tokens, args);
+		opts->mask |= token;
+		switch (token) {
+		case ZLOOP_OPT_ID:
+			if (match_int(args, &opts->id)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			break;
+		case ZLOOP_OPT_CAPACITY:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token) {
+				pr_err("Invalid capacity\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->capacity =
+				((sector_t)token * SZ_1M) >> SECTOR_SHIFT;
+			break;
+		case ZLOOP_OPT_ZONE_SIZE:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token || token > ZLOOP_MAX_ZONE_SIZE_MB ||
+			    !is_power_of_2(token)) {
+				pr_err("Invalid zone size %u\n", token);
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->zone_size =
+				((sector_t)token * SZ_1M) >> SECTOR_SHIFT;
+			break;
+		case ZLOOP_OPT_ZONE_CAPACITY:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token) {
+				pr_err("Invalid zone capacity\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->zone_capacity =
+				((sector_t)token * SZ_1M) >> SECTOR_SHIFT;
+			break;
+		case ZLOOP_OPT_NR_CONV_ZONES:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->nr_conv_zones = token;
+			break;
+		case ZLOOP_OPT_BASE_DIR:
+			p = match_strdup(args);
+			if (!p) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			kfree(opts->base_dir);
+			opts->base_dir = p;
+			break;
+		case ZLOOP_OPT_NR_QUEUES:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token) {
+				pr_err("Invalid number of queues\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->nr_queues = min(token, num_online_cpus());
+			break;
+		case ZLOOP_OPT_QUEUE_DEPTH:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!token) {
+				pr_err("Invalid queue depth\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->queue_depth = token;
+			break;
+		case ZLOOP_OPT_BUFFERED_IO:
+			opts->buffered_io = true;
+			break;
+		case ZLOOP_OPT_ERR:
+		default:
+			pr_warn("unknown parameter or missing value '%s'\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	ret = -EINVAL;
+	if (opts->capacity <= opts->zone_size) {
+		pr_err("Invalid capacity\n");
+		goto out;
+	}
+
+	if (opts->zone_capacity > opts->zone_size) {
+		pr_err("Invalid zone capacity\n");
+		goto out;
+	}
+
+	ret = 0;
+out:
+	kfree(options);
+	return ret;
+}
+
+enum {
+	ZLOOP_CTL_ADD,
+	ZLOOP_CTL_REMOVE,
+};
+
+static struct zloop_ctl_op {
+	int		code;
+	const char	*name;
+} zloop_ctl_ops[] = {
+	{ ZLOOP_CTL_ADD,	"add" },
+	{ ZLOOP_CTL_REMOVE,	"remove" },
+	{ -1,	NULL },
+};
+
+static ssize_t zloop_ctl_write(struct file *file, const char __user *ubuf,
+			       size_t count, loff_t *pos)
+{
+	struct zloop_options opts = { };
+	struct zloop_ctl_op *op;
+	const char *buf, *opts_buf;
+	int i, ret;
+
+	if (count > PAGE_SIZE)
+		return -ENOMEM;
+
+	buf = memdup_user_nul(ubuf, count);
+	if (IS_ERR(buf))
+		return PTR_ERR(buf);
+
+	for (i = 0; i < ARRAY_SIZE(zloop_ctl_ops); i++) {
+		op = &zloop_ctl_ops[i];
+		if (!op->name) {
+			pr_err("Invalid operation\n");
+			ret = -EINVAL;
+			goto out;
+		}
+		if (!strncmp(buf, op->name, strlen(op->name)))
+			break;
+	}
+
+	if (count <= strlen(op->name))
+		opts_buf = NULL;
+	else
+		opts_buf = buf + strlen(op->name);
+
+	ret = zloop_parse_options(&opts, opts_buf);
+	if (ret) {
+		pr_err("Failed to parse options\n");
+		goto out;
+	}
+
+	switch (op->code) {
+	case ZLOOP_CTL_ADD:
+		ret = zloop_ctl_add(&opts);
+		break;
+	case ZLOOP_CTL_REMOVE:
+		ret = zloop_ctl_remove(&opts);
+		break;
+	default:
+		pr_err("Invalid operation\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	kfree(opts.base_dir);
+	kfree(buf);
+	return ret ? ret : count;
+}
+
+static int zloop_ctl_show(struct seq_file *seq_file, void *private)
+{
+	const struct match_token *tok;
+	int i;
+
+	/* Add operation */
+	seq_printf(seq_file, "%s ", zloop_ctl_ops[0].name);
+	for (i = 0; i < ARRAY_SIZE(zloop_opt_tokens); i++) {
+		tok = &zloop_opt_tokens[i];
+		if (!tok->pattern)
+			break;
+		if (i)
+			seq_putc(seq_file, ',');
+		seq_puts(seq_file, tok->pattern);
+	}
+	seq_putc(seq_file, '\n');
+
+	/* Remove operation */
+	seq_puts(seq_file, zloop_ctl_ops[1].name);
+	seq_puts(seq_file, " id=%d\n");
+
+	return 0;
+}
+
+static int zloop_ctl_open(struct inode *inode, struct file *file)
+{
+	file->private_data = NULL;
+	return single_open(file, zloop_ctl_show, NULL);
+}
+
+static int zloop_ctl_release(struct inode *inode, struct file *file)
+{
+	return single_release(inode, file);
+}
+
+static const struct file_operations zloop_ctl_fops = {
+	.owner		= THIS_MODULE,
+	.open		= zloop_ctl_open,
+	.release	= zloop_ctl_release,
+	.write		= zloop_ctl_write,
+	.read		= seq_read,
+};
+
+static struct miscdevice zloop_misc = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "zloop-control",
+	.fops		= &zloop_ctl_fops,
+};
+
+static int __init zloop_init(void)
+{
+	int ret;
+
+	ret = misc_register(&zloop_misc);
+	if (ret) {
+		pr_err("Failed to register misc device: %d\n", ret);
+		return ret;
+	}
+	pr_info("Module loaded\n");
+
+	return 0;
+}
+
+static void __exit zloop_exit(void)
+{
+	misc_deregister(&zloop_misc);
+	idr_destroy(&zloop_index_idr);
+}
+
+module_init(zloop_init);
+module_exit(zloop_exit);
+
+MODULE_DESCRIPTION("Zoned loopback device");
+MODULE_LICENSE("GPL");
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-29  8:10                     ` Damien Le Moal
@ 2025-01-31  3:54                       ` Ming Lei
  2025-02-04  3:22                         ` Damien Le Moal
  0 siblings, 1 reply; 35+ messages in thread
From: Ming Lei @ 2025-01-31  3:54 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Christoph Hellwig, Jens Axboe, linux-block

On Wed, Jan 29, 2025 at 05:10:32PM +0900, Damien Le Moal wrote:
> On 1/24/25 21:30, Ming Lei wrote:
> >> 1 queue:
> >> ========
> >>                               +-------------------+-------------------+
> >>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
> >>  +----------------------------+-------------------+-------------------+
> >>  | QD=1,    4K rnd wr, 1 job  | 11.7k / 47.8 MB/s | 15.8k / 53.0 MB/s |
> >>  | QD=32,   4K rnd wr, 8 jobs | 63.4k / 260 MB/s  | 101k / 413 MB/s   |
> > 
> > I can't reproduce the above two, actually not observe obvious difference
> > between rublk/zoned and zloop in my test VM.
> 
> I am using bare-metal machines for these tests as I do not want any
> noise from a VM/hypervisor in the numbers. And I did say that this is with a
> tweaked version of zloop that I have not posted yet (I was waiting for rc1 to
> repost as a rebase is needed to correct a compilation failure du to the nomerge
> tage set flag being removed). I am attaching the patch I used here (it applies
> on top of current Linus tree)
> 
> > Maybe rublk works at debug mode, which reduces perf by half usually.
> > And you need to add device via 'cargo run -r -- add zoned' for using
> > release mode.
> 
> Well, that is not an obvious thing for someone who does not know rust well. The
> README file of rublk also does not mention that. So no, I did not run it like
> this. I followed the README and call rublk directly. It would be great to
> document that.

OK, that is fine, and now you can install rublk/zoned with 'cargo
install rublk' directly, which always build & install the binary of
release version.

> 
> > Actually there is just single io_uring_enter() running in each ublk queue
> > pthread, perf should be similar with kernel IO handling, and the main extra
> > load is from the single syscall kernel/user context switch and IO data copy,
> > and data copy effect can be neglected in small io size usually(< 64KB).
> > 
> >>  | QD=32, 128K rnd wr, 1 job  | 5008 / 656 MB/s   | 5993 / 786 MB/s   |
> >>  | QD=32, 128K seq wr, 1 job  | 2636 / 346 MB/s   | 5393 / 707 MB/s   |
> > 
> > ublk 128K BS may be a little slower since there is one extra copy.
> 
> Here are newer numbers running rublk as you suggested (using cargo run -r).
> The backend storage is on an XFS file system using a PCI gen4 4TB M.2 SSD that
> is empty (the FS is empty on start). The emulated zoned disk has a capacity of
> 512GB with sequential zones only of 256 MB (that is, there are 2048
> zones/files). Each data point is from a 1min run of fio.

Can you share how you create rublk/zoned and zloop and the underlying
device info? Especially queue depth and nr_queues(both rublk/zloop &
underlying disk) plays a big role.

I will take your setting on real hardware and re-run the test after I
return from the Spring Festival holiday.

> 
> On a 8-cores Intel Xeon test box, which has PCI gen 3 only, I get:
> 
> Single queue:
> =============
>                               +-------------------+-------------------+
>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>  +----------------------------+-------------------+-------------------+
>  | QD=1,    4K rnd wr, 1 job  | 2859 / 11.7 MB/s  | 5535 / 22.7 MB/s  |
>  | QD=32,   4K rnd wr, 8 jobs | 24.5k / 100 MB/s  | 24.6k / 101 MB/s  |
>  | QD=32, 128K rnd wr, 1 job  | 14.9k / 1954 MB/s | 19.6k / 2571 MB/s |
>  | QD=32, 128K seq wr, 1 job  | 1516 / 199 MB/s   | 10.6k / 1385 MB/s |
>  +----------------------------+-------------------+-------------------+
> 
> 8 queues:
> =========
>                               +-------------------+-------------------+
>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>  +----------------------------+-------------------+-------------------+
>  | QD=1,    4K rnd wr, 1 job  | 5387 / 22.1 MB/s  | 5436 / 22.3 MB/s  |
>  | QD=32,   4K rnd wr, 8 jobs | 16.4k / 67.0 MB/s | 26.3k / 108 MB/s  |
>  | QD=32, 128K rnd wr, 1 job  | 6101 / 800 MB/s   | 19.8k / 2591 MB/s |
>  | QD=32, 128K seq wr, 1 job  | 3987 / 523 MB/s   | 10.6k / 1391 MB/s |
>  +----------------------------+-------------------+-------------------+
> 
> I have no idea why ublk is generally slower when setup with 8 I/O queues. The
> qd=32 4K random write with 8 jobs is generally faster with ublk than zloop, but
> that varies. I tracked that down to CPU utilization which is generally much
> better (all CPUs used) with ublk compared to zloop, as zloop is at the mercy of
> the workqueue code and how it schedules unbound work items.

Maybe it is related with queue depth? The default ublk queue depth is
128, and 8jobs actually causes 256 in-flight IOs, and default ublk nr_queue
is 1.

Another thing I mentioned is that ublk has one extra IO data copy, which
slows IO especially when IO size is > 64K usually.

> 
> Also, I do not understand why the sequential write workload is so much slower
> with ublk. That baffles me and I have no explanations.

Yes, it isn't expected, and I will look into it with your setting.

> 
> With a faster PCIe gen4 16-core AMD Epyc-2 machine, I get:
> 
> Single queue:
> =============
>                               +-------------------+-------------------+
>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>  +----------------------------+-------------------+-------------------+
>  | QD=1,    4K rnd wr, 1 job  | 6824 / 28.0 MB/s  | 7320 / 30.0 MB/s  |
>  | QD=32,   4K rnd wr, 8 jobs | 50.9k / 208 MB/s  | 41.7k / 171 MB/s  |
>  | QD=32, 128K rnd wr, 1 job  | 15.6k / 2046 MB/s | 18.5k / 2430 MB/s |
>  | QD=32, 128K seq wr, 1 job  | 6237 / 818 MB/s   | 22.5k / 2943 MB/s |
>  +----------------------------+-------------------+-------------------+
> 
> 8 queues:
> =========
>                               +-------------------+-------------------+
>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>  +----------------------------+-------------------+-------------------+
>  | QD=1,    4K rnd wr, 1 job  | 6884 / 28.2 MB/s  | 7707 / 31.6 MB/s  |
>  | QD=32,   4K rnd wr, 8 jobs | 39.4k / 161 MB/s  | 46.8k / 192 MB/s  |
>  | QD=32, 128K rnd wr, 1 job  | 12.2k / 1597 MB/s | 18.8k / 2460 MB/s |
>  | QD=32, 128K seq wr, 1 job  | 6391 / 799 MB/s   | 21.4k / 2802 MB/s |
>  +----------------------------+-------------------+-------------------+
> 
> The same pattern repeats again: ublk with 8 queues is slower, but it is faster
> with a single queue for the qd=32 4K random write with 8 jobs case.

Probably it is caused by good io_uring batch handling, and I guess you
may get good result with 'QD=32 4K 1 job' too.

> 
> > Simplicity need to be observed from multiple dimensions, 300 vs. 1500 LoC has
> > shown something already, IMO.
> 
> Sure. But given the very complicated syntax of rust, a lower LoC for rust
> compared to C is very subjective in my opinion.
> 
> I said "simplicity" in the context of the driver use. And rublk is not as
> simple to use as zloop as it needs rust/cargo installed which is not an
> acceptable dependency for xfstests. Furthermore, it is very annoying to have to

xfstests just need user to pass the zoned block device, so the same test can
cover any zoned device.

I don't understand why you have to add the zoned device emulation code into
xfstest test script, and introduce the device dependency into upper level FS
test, and sounds like one layer violation?

I guess you may miss the point, and actually it isn't related with Rust.

The 300 Rust LoC is exactly doing what the feature needs to do, parsing
parameters, handling file IO. However, most of the 1.5K LoC is like
forking something existed in kernel: it re-implements loop dio IO,
re-implements zone state machine, ...

> change the nofile ulimit to allow rublk to open all the zone files for a large
> disk (zloop does not need that)

That is easy to deal with, actually rublk supports max_zones limit,
which isn't allowed by zloop.


Thanks,
Ming

> 
> -- 
> Damien Le Moal
> Western Digital Research

> From 8f1e318c4e4fe05814c3e8cb13992c802a365458 Mon Sep 17 00:00:00 2001
> From: Damien Le Moal <dlemoal@kernel.org>
> Date: Fri, 20 Dec 2024 07:59:07 +0000
> Subject: [PATCH] block: new zoned loop block device driver
> 
> The zoned loop block device driver allows a user to create emulated
> zoned block devices using one regular file per zone as backing storage.
> Compared to null_blk or scsi_debug, it has the advantage of allowing
> emulating large zoned devices without requiring the same amount of
> memory as the capacity of the emulated device. Furthermore, zoned
> devices emulated with this driver can be re-started after a host reboot
> without any loss of the state of the device zones, which is something
> that null_blk and scsi_debug do not support.
> 
> This initial implementation is simple and does not support zone resource
> limits. That is, a zoned loop block device limits for the maximum number
> of open zones and maximum number of active zones is always 0.
> 
> This driver can be either compiled in-kernel or as a module, named
> "zloop". Compilation of this driver depends on the block layer support
> for zoned block device (CONFIG_BLK_DEV_ZONED must be set).
> 
> Using the zloop driver to create and delete zoned block devices is
> done by writing commands to the zoned loop control character device file
> (/dev/zloop-control). Creating a device is done with:
> 
>   $ echo "add [options]" > /dev/zloop-control
> 
> The options available for the "add" operation cat be listed by reading
> the zloop-control device file:
> 
>   $ cat /dev/zloop-control
>   add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u
>   remove id=%d
> 
> The options available allow controlling the zoned device total
> capacity, zone size, zone capactity of sequential zones, total number
> of conventional zones, base directory for the zones backing file, number
> of I/O queues and the maximum queue depth of I/O queues.
> 
> Deleting a device is done using the "remove" command:
> 
>   $ echo "remove id=0" > /dev/zloop-control
> 
> This implementation passes various tests using zonefs and fio (t/zbd
> tests) and provides a state machine for zone conditions that is
> compliant with the T10 ZBC and NVMe ZNS specifications.
> 
> Co-developed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>  MAINTAINERS            |    7 +
>  drivers/block/Kconfig  |   16 +
>  drivers/block/Makefile |    1 +
>  drivers/block/zloop.c  | 1338 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 1362 insertions(+)
>  create mode 100644 drivers/block/zloop.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 936e80f2c9ce..96325f487b3e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -26152,6 +26152,13 @@ L:	linux-kernel@vger.kernel.org
>  S:	Maintained
>  F:	arch/x86/kernel/cpu/zhaoxin.c
>  
> +ZONED LOOP DEVICE
> +M:	Damien Le Moal <dlemoal@kernel.org>
> +R:	Christoph Hellwig <hch@lst.de>
> +L:	linux-block@vger.kernel.org
> +S:	Maintained
> +F:	drivers/block/zloop.c
> +
>  ZONEFS FILESYSTEM
>  M:	Damien Le Moal <dlemoal@kernel.org>
>  M:	Naohiro Aota <naohiro.aota@wdc.com>
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index a97f2c40c640..abdbe5d49026 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -413,4 +413,20 @@ config BLKDEV_UBLK_LEGACY_OPCODES
>  
>  source "drivers/block/rnbd/Kconfig"
>  
> +config BLK_DEV_ZONED_LOOP
> +	tristate "Zoned loopback device support"
> +	depends on BLK_DEV_ZONED
> +	help
> +	  Saying Y here will allow you to use create a zoned block device using
> +	  regular files for zones (one file per zones). This is useful to test
> +	  file systems, device mapper and applications that support zoned block
> +	  devices. To create a zoned loop device, no user utility is needed, a
> +	  zoned loop device can be created (or re-started) using a command
> +	  like:
> +
> +	  echo "add id=0,zone_size_mb=256,capacity_mb=16384,conv_zones=11" > \
> +		/dev/zloop-control
> +
> +	  If unsure, say N.
> +
>  endif # BLK_DEV
> diff --git a/drivers/block/Makefile b/drivers/block/Makefile
> index 1105a2d4fdcb..097707aca725 100644
> --- a/drivers/block/Makefile
> +++ b/drivers/block/Makefile
> @@ -41,5 +41,6 @@ obj-$(CONFIG_BLK_DEV_RNBD)	+= rnbd/
>  obj-$(CONFIG_BLK_DEV_NULL_BLK)	+= null_blk/
>  
>  obj-$(CONFIG_BLK_DEV_UBLK)			+= ublk_drv.o
> +obj-$(CONFIG_BLK_DEV_ZONED_LOOP) += zloop.o
>  
>  swim_mod-y	:= swim.o swim_asm.o
> diff --git a/drivers/block/zloop.c b/drivers/block/zloop.c
> new file mode 100644
> index 000000000000..b2341bb13528
> --- /dev/null
> +++ b/drivers/block/zloop.c
> @@ -0,0 +1,1338 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2025, Christoph Hellwig.
> + * Copyright (c) 2025, Western Digital Corporation or its affiliates.
> + *
> + * Zoned Loop Device driver - exports a zoned block device using one file per
> + * zone as backing storage.
> + */
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/module.h>
> +#include <linux/blk-mq.h>
> +#include <linux/blkzoned.h>
> +#include <linux/pagemap.h>
> +#include <linux/miscdevice.h>
> +#include <linux/falloc.h>
> +#include <linux/mutex.h>
> +#include <linux/parser.h>
> +#include <linux/seq_file.h>
> +
> +/*
> + * Options for adding (and removing) a device.
> + */
> +enum {
> +	ZLOOP_OPT_ERR			= 0,
> +	ZLOOP_OPT_ID			= (1 << 0),
> +	ZLOOP_OPT_CAPACITY		= (1 << 1),
> +	ZLOOP_OPT_ZONE_SIZE		= (1 << 2),
> +	ZLOOP_OPT_ZONE_CAPACITY		= (1 << 3),
> +	ZLOOP_OPT_NR_CONV_ZONES		= (1 << 4),
> +	ZLOOP_OPT_BASE_DIR		= (1 << 5),
> +	ZLOOP_OPT_NR_QUEUES		= (1 << 6),
> +	ZLOOP_OPT_QUEUE_DEPTH		= (1 << 7),
> +	ZLOOP_OPT_BUFFERED_IO		= (1 << 8),
> +};
> +
> +static const match_table_t zloop_opt_tokens = {
> +	{ ZLOOP_OPT_ID,			"id=%d"	},
> +	{ ZLOOP_OPT_CAPACITY,		"capacity_mb=%u"	},
> +	{ ZLOOP_OPT_ZONE_SIZE,		"zone_size_mb=%u"	},
> +	{ ZLOOP_OPT_ZONE_CAPACITY,	"zone_capacity_mb=%u"	},
> +	{ ZLOOP_OPT_NR_CONV_ZONES,	"conv_zones=%u"		},
> +	{ ZLOOP_OPT_BASE_DIR,		"base_dir=%s"		},
> +	{ ZLOOP_OPT_NR_QUEUES,		"nr_queues=%u"		},
> +	{ ZLOOP_OPT_QUEUE_DEPTH,	"queue_depth=%u"	},
> +	{ ZLOOP_OPT_BUFFERED_IO,	"buffered_io"		},
> +	{ ZLOOP_OPT_ERR,		NULL			}
> +};
> +
> +/* Default values for the "add" operation. */
> +#define ZLOOP_DEF_ID			-1
> +#define ZLOOP_DEF_ZONE_SIZE		((256ULL * SZ_1M) >> SECTOR_SHIFT)
> +#define ZLOOP_DEF_NR_ZONES		64
> +#define ZLOOP_DEF_NR_CONV_ZONES		8
> +#define ZLOOP_DEF_BASE_DIR		"/var/local/zloop"
> +#define ZLOOP_DEF_NR_QUEUES		1
> +#define ZLOOP_DEF_QUEUE_DEPTH		128
> +#define ZLOOP_DEF_BUFFERED_IO		false
> +
> +/* Arbitrary limit on the zone size (16GB). */
> +#define ZLOOP_MAX_ZONE_SIZE_MB		16384
> +
> +struct zloop_options {
> +	unsigned int		mask;
> +	int			id;
> +	sector_t		capacity;
> +	sector_t		zone_size;
> +	sector_t		zone_capacity;
> +	unsigned int		nr_conv_zones;
> +	char			*base_dir;
> +	unsigned int		nr_queues;
> +	unsigned int		queue_depth;
> +	bool			buffered_io;
> +};
> +
> +/*
> + * Device states.
> + */
> +enum {
> +	Zlo_creating = 0,
> +	Zlo_live,
> +	Zlo_deleting,
> +};
> +
> +enum zloop_zone_flags {
> +	ZLOOP_ZONE_CONV = 0,
> +	ZLOOP_ZONE_SEQ_ERROR,
> +};
> +
> +struct zloop_zone {
> +	struct file		*file;
> +
> +	unsigned long		flags;
> +	struct mutex		lock;
> +	enum blk_zone_cond	cond;
> +	sector_t		start;
> +	sector_t		wp;
> +
> +	gfp_t			old_gfp_mask;
> +};
> +
> +struct zloop_device {
> +	unsigned int		id;
> +	unsigned int		state;
> +
> +	struct blk_mq_tag_set	tag_set;
> +	struct gendisk		*disk;
> +
> +	struct workqueue_struct *workqueue;
> +	bool			buffered_io;
> +
> +	const char		*base_dir;
> +	struct file		*data_dir;
> +
> +	unsigned int		zone_shift;
> +	sector_t		zone_size;
> +	sector_t		zone_capacity;
> +	unsigned int		nr_zones;
> +	unsigned int		nr_conv_zones;
> +
> +	struct zloop_zone	zones[] __counted_by(nr_zones);
> +};
> +
> +struct zloop_cmd {
> +	struct work_struct	work;
> +	atomic_t		ref;
> +	sector_t		sector;
> +	sector_t		nr_sectors;
> +	long			ret;
> +	struct kiocb		iocb;
> +	struct bio_vec		*bvec;
> +};
> +
> +static DEFINE_IDR(zloop_index_idr);
> +static DEFINE_MUTEX(zloop_ctl_mutex);
> +
> +static unsigned int rq_zone_no(struct request *rq)
> +{
> +	struct zloop_device *zlo = rq->q->queuedata;
> +
> +	return blk_rq_pos(rq) >> zlo->zone_shift;
> +}
> +
> +static int zloop_update_seq_zone(struct zloop_device *zlo, unsigned int zone_no)
> +{
> +	struct zloop_zone *zone = &zlo->zones[zone_no];
> +	struct kstat stat;
> +	sector_t file_sectors;
> +	int ret;
> +
> +	lockdep_assert_held(&zone->lock);
> +
> +	ret = vfs_getattr(&zone->file->f_path, &stat, STATX_SIZE, 0);
> +	if (ret < 0) {
> +		pr_err("Failed to get zone %u file stat (err=%d)\n",
> +		       zone_no, ret);
> +		set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
> +		return ret;
> +	}
> +
> +	file_sectors = stat.size >> SECTOR_SHIFT;
> +	if (file_sectors > zlo->zone_capacity) {
> +		pr_err("Zone %u file too large (%llu sectors > %llu)\n",
> +		       zone_no, file_sectors, zlo->zone_capacity);
> +		return -EINVAL;
> +	}
> +
> +	if (!file_sectors) {
> +		zone->cond = BLK_ZONE_COND_EMPTY;
> +		zone->wp = zone->start;
> +	} else if (file_sectors == zlo->zone_capacity) {
> +		zone->cond = BLK_ZONE_COND_FULL;
> +		zone->wp = zone->start + zlo->zone_size;
> +	} else {
> +		zone->cond = BLK_ZONE_COND_CLOSED;
> +		zone->wp = zone->start + file_sectors;
> +	}
> +
> +	return 0;
> +}
> +
> +static int zloop_open_zone(struct zloop_device *zlo, unsigned int zone_no)
> +{
> +	struct zloop_zone *zone = &zlo->zones[zone_no];
> +	int ret = 0;
> +
> +	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
> +		return -EIO;
> +
> +	mutex_lock(&zone->lock);
> +
> +	if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
> +		ret = zloop_update_seq_zone(zlo, zone_no);
> +		if (ret)
> +			goto unlock;
> +	}
> +
> +	switch (zone->cond) {
> +	case BLK_ZONE_COND_EXP_OPEN:
> +		break;
> +	case BLK_ZONE_COND_EMPTY:
> +	case BLK_ZONE_COND_CLOSED:
> +	case BLK_ZONE_COND_IMP_OPEN:
> +		zone->cond = BLK_ZONE_COND_EXP_OPEN;
> +		break;
> +	case BLK_ZONE_COND_FULL:
> +	default:
> +		ret = -EIO;
> +		break;
> +	}
> +
> +unlock:
> +	mutex_unlock(&zone->lock);
> +
> +	return ret;
> +}
> +
> +static int zloop_close_zone(struct zloop_device *zlo, unsigned int zone_no)
> +{
> +	struct zloop_zone *zone = &zlo->zones[zone_no];
> +	int ret = 0;
> +
> +	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
> +		return -EIO;
> +
> +	mutex_lock(&zone->lock);
> +
> +	if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
> +		ret = zloop_update_seq_zone(zlo, zone_no);
> +		if (ret)
> +			goto unlock;
> +	}
> +
> +	switch (zone->cond) {
> +	case BLK_ZONE_COND_CLOSED:
> +		break;
> +	case BLK_ZONE_COND_IMP_OPEN:
> +	case BLK_ZONE_COND_EXP_OPEN:
> +		if (zone->wp == zone->start)
> +			zone->cond = BLK_ZONE_COND_EMPTY;
> +		else
> +			zone->cond = BLK_ZONE_COND_CLOSED;
> +		break;
> +	case BLK_ZONE_COND_EMPTY:
> +	case BLK_ZONE_COND_FULL:
> +	default:
> +		ret = -EIO;
> +		break;
> +	}
> +
> +unlock:
> +	mutex_unlock(&zone->lock);
> +
> +	return ret;
> +}
> +
> +static int zloop_reset_zone(struct zloop_device *zlo, unsigned int zone_no)
> +{
> +	struct zloop_zone *zone = &zlo->zones[zone_no];
> +	int ret = 0;
> +
> +	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
> +		return -EIO;
> +
> +	mutex_lock(&zone->lock);
> +
> +	if (!test_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags) &&
> +	    zone->cond == BLK_ZONE_COND_EMPTY)
> +		goto unlock;
> +
> +	if (vfs_truncate(&zone->file->f_path, 0)) {
> +		set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	zone->cond = BLK_ZONE_COND_EMPTY;
> +	zone->wp = zone->start;
> +	clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
> +
> +unlock:
> +	mutex_unlock(&zone->lock);
> +
> +	return ret;
> +}
> +
> +static int zloop_reset_all_zones(struct zloop_device *zlo)
> +{
> +	unsigned int i;
> +	int ret;
> +
> +	for (i = zlo->nr_conv_zones; i < zlo->nr_zones; i++) {
> +		ret = zloop_reset_zone(zlo, i);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int zloop_finish_zone(struct zloop_device *zlo, unsigned int zone_no)
> +{
> +	struct zloop_zone *zone = &zlo->zones[zone_no];
> +	int ret = 0;
> +
> +	if (test_bit(ZLOOP_ZONE_CONV, &zone->flags))
> +		return -EIO;
> +
> +	mutex_lock(&zone->lock);
> +
> +	if (!test_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags) &&
> +	    zone->cond == BLK_ZONE_COND_FULL)
> +		goto unlock;
> +
> +	if (vfs_truncate(&zone->file->f_path, zlo->zone_size << SECTOR_SHIFT)) {
> +		set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	zone->cond = BLK_ZONE_COND_FULL;
> +	zone->wp = zone->start + zlo->zone_size;
> +	clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
> +
> + unlock:
> +	mutex_unlock(&zone->lock);
> +
> +	return ret;
> +}
> +
> +static void zloop_put_cmd(struct zloop_cmd *cmd)
> +{
> +	struct request *rq = blk_mq_rq_from_pdu(cmd);
> +
> +	if (!atomic_dec_and_test(&cmd->ref))
> +		return;
> +	kfree(cmd->bvec);
> +	cmd->bvec = NULL;
> +	if (likely(!blk_should_fake_timeout(rq->q)))
> +		blk_mq_complete_request(rq);
> +}
> +
> +static void zloop_rw_complete(struct kiocb *iocb, long ret)
> +{
> +	struct zloop_cmd *cmd = container_of(iocb, struct zloop_cmd, iocb);
> +
> +	cmd->ret = ret;
> +	zloop_put_cmd(cmd);
> +}
> +
> +static void zloop_rw(struct zloop_cmd *cmd)
> +{
> +	struct request *rq = blk_mq_rq_from_pdu(cmd);
> +	struct zloop_device *zlo = rq->q->queuedata;
> +	unsigned int zone_no = rq_zone_no(rq);
> +	sector_t sector = blk_rq_pos(rq);
> +	sector_t nr_sectors = blk_rq_sectors(rq);
> +	bool is_append = req_op(rq) == REQ_OP_ZONE_APPEND;
> +	bool is_write = req_op(rq) == REQ_OP_WRITE || is_append;
> +	int rw = is_write ? ITER_SOURCE : ITER_DEST;
> +	struct req_iterator rq_iter;
> +	struct zloop_zone *zone;
> +	struct iov_iter iter;
> +	struct bio_vec tmp;
> +	sector_t zone_end;
> +	int nr_bvec = 0;
> +	int ret;
> +
> +	atomic_set(&cmd->ref, 2);
> +	cmd->sector = sector;
> +	cmd->nr_sectors = nr_sectors;
> +	cmd->ret = 0;
> +
> +	/* We should never get an I/O beyond the device capacity. */
> +	if (WARN_ON_ONCE(zone_no >= zlo->nr_zones)) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +	zone = &zlo->zones[zone_no];
> +	zone_end = zone->start + zlo->zone_capacity;
> +
> +	/*
> +	 * The block layer should never send requests that are not fully
> +	 * contained within the zone.
> +	 */
> +	if (WARN_ON_ONCE(sector + nr_sectors > zone->start + zlo->zone_size)) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
> +		mutex_lock(&zone->lock);
> +		ret = zloop_update_seq_zone(zlo, zone_no);
> +		mutex_unlock(&zone->lock);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	if (!test_bit(ZLOOP_ZONE_CONV, &zone->flags) && is_write) {
> +		mutex_lock(&zone->lock);
> +
> +		if (is_append) {
> +			sector = zone->wp;
> +			cmd->sector = sector;
> +		}
> +
> +		/*
> +		 * Write operations must be aligned to the write pointer and
> +		 * fully contained within the zone capacity.
> +		 */
> +		if (sector != zone->wp || zone->wp + nr_sectors > zone_end) {
> +			pr_err("Zone %u: unaligned write: sect %llu, wp %llu\n",
> +			       zone_no, sector, zone->wp);
> +			ret = -EIO;
> +			goto unlock;
> +		}
> +
> +		/* Implicitly open the target zone. */
> +		if (zone->cond == BLK_ZONE_COND_CLOSED ||
> +		    zone->cond == BLK_ZONE_COND_EMPTY)
> +			zone->cond = BLK_ZONE_COND_IMP_OPEN;
> +
> +		/*
> +		 * Advance the write pointer of sequential zones. If the write
> +		 * fails, the wp position will be corrected when the next I/O
> +		 * copmpletes.
> +		 */
> +		zone->wp += nr_sectors;
> +		if (zone->wp == zone_end)
> +			zone->cond = BLK_ZONE_COND_FULL;
> +	}
> +
> +	rq_for_each_bvec(tmp, rq, rq_iter)
> +		nr_bvec++;
> +
> +	if (rq->bio != rq->biotail) {
> +		struct bio_vec *bvec;
> +
> +		cmd->bvec = kmalloc_array(nr_bvec, sizeof(*cmd->bvec), GFP_NOIO);
> +		if (!cmd->bvec) {
> +			ret = -EIO;
> +			goto unlock;
> +		}
> +
> +		/*
> +		 * The bios of the request may be started from the middle of
> +		 * the 'bvec' because of bio splitting, so we can't directly
> +		 * copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
> +		 * API will take care of all details for us.
> +		 */
> +		bvec = cmd->bvec;
> +		rq_for_each_bvec(tmp, rq, rq_iter) {
> +			*bvec = tmp;
> +			bvec++;
> +		}
> +		iov_iter_bvec(&iter, rw, cmd->bvec, nr_bvec, blk_rq_bytes(rq));
> +	} else {
> +		/*
> +		 * Same here, this bio may be started from the middle of the
> +		 * 'bvec' because of bio splitting, so offset from the bvec
> +		 * must be passed to iov iterator
> +		 */
> +		iov_iter_bvec(&iter, rw,
> +			__bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter),
> +					nr_bvec, blk_rq_bytes(rq));
> +		iter.iov_offset = rq->bio->bi_iter.bi_bvec_done;
> +	}
> +
> +	cmd->iocb.ki_pos = (sector - zone->start) << SECTOR_SHIFT;
> +	cmd->iocb.ki_filp = zone->file;
> +	cmd->iocb.ki_complete = zloop_rw_complete;
> +	if (!zlo->buffered_io)
> +		cmd->iocb.ki_flags = IOCB_DIRECT;
> +	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
> +
> +	if (rw == ITER_SOURCE)
> +		ret = zone->file->f_op->write_iter(&cmd->iocb, &iter);
> +	else
> +		ret = zone->file->f_op->read_iter(&cmd->iocb, &iter);
> +unlock:
> +	if (!test_bit(ZLOOP_ZONE_CONV, &zone->flags) && is_write)
> +		mutex_unlock(&zone->lock);
> +out:
> +	if (ret != -EIOCBQUEUED)
> +		zloop_rw_complete(&cmd->iocb, ret);
> +	zloop_put_cmd(cmd);
> +}
> +
> +static void zloop_handle_cmd(struct zloop_cmd *cmd)
> +{
> +	struct request *rq = blk_mq_rq_from_pdu(cmd);
> +	struct zloop_device *zlo = rq->q->queuedata;
> +
> +	switch (req_op(rq)) {
> +	case REQ_OP_READ:
> +	case REQ_OP_WRITE:
> +	case REQ_OP_ZONE_APPEND:
> +		/*
> +		 * zloop_rw() always executes asynchronously or completes
> +		 * directly.
> +		 */
> +		zloop_rw(cmd);
> +		return;
> +	case REQ_OP_FLUSH:
> +		/*
> +		 * Sync the entire FS containing the zone files instead of
> +		 * walking all files
> +		 */
> +		cmd->ret = sync_filesystem(file_inode(zlo->data_dir)->i_sb);
> +		break;
> +	case REQ_OP_ZONE_RESET:
> +		cmd->ret = zloop_reset_zone(zlo, rq_zone_no(rq));
> +		break;
> +	case REQ_OP_ZONE_RESET_ALL:
> +		cmd->ret = zloop_reset_all_zones(zlo);
> +		break;
> +	case REQ_OP_ZONE_FINISH:
> +		cmd->ret = zloop_finish_zone(zlo, rq_zone_no(rq));
> +		break;
> +	case REQ_OP_ZONE_OPEN:
> +		cmd->ret = zloop_open_zone(zlo, rq_zone_no(rq));
> +		break;
> +	case REQ_OP_ZONE_CLOSE:
> +		cmd->ret = zloop_close_zone(zlo, rq_zone_no(rq));
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		pr_err("Unsupported operation %d\n", req_op(rq));
> +		cmd->ret = -EOPNOTSUPP;
> +		break;
> +	}
> +
> +	blk_mq_complete_request(rq);
> +}
> +
> +static void zloop_cmd_workfn(struct work_struct *work)
> +{
> +	struct zloop_cmd *cmd = container_of(work, struct zloop_cmd, work);
> +	int orig_flags = current->flags;
> +
> +	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
> +	zloop_handle_cmd(cmd);
> +	current->flags = orig_flags;
> +}
> +
> +static void zloop_complete_rq(struct request *rq)
> +{
> +	struct zloop_cmd *cmd = blk_mq_rq_to_pdu(rq);
> +	struct zloop_device *zlo = rq->q->queuedata;
> +	unsigned int zone_no = cmd->sector >> zlo->zone_shift;
> +	struct zloop_zone *zone = &zlo->zones[zone_no];
> +	blk_status_t sts = BLK_STS_OK;
> +
> +	switch (req_op(rq)) {
> +	case REQ_OP_READ:
> +		if (cmd->ret < 0)
> +			pr_err("Zone %u: failed read sector %llu, %llu sectors\n",
> +			       zone_no, cmd->sector, cmd->nr_sectors);
> +
> +		if (cmd->ret >= 0 && cmd->ret != blk_rq_bytes(rq)) {
> +			/* short read */
> +			struct bio *bio;
> +
> +			__rq_for_each_bio(bio, rq)
> +				zero_fill_bio(bio);
> +		}
> +		break;
> +	case REQ_OP_WRITE:
> +	case REQ_OP_ZONE_APPEND:
> +		if (cmd->ret < 0)
> +			pr_err("Zone %u: failed %swrite sector %llu, %llu sectors\n",
> +			       zone_no,
> +			       req_op(rq) == REQ_OP_WRITE ? "" : "append ",
> +			       cmd->sector, cmd->nr_sectors);
> +
> +		if (cmd->ret >= 0 && cmd->ret != blk_rq_bytes(rq)) {
> +			pr_err("Zone %u: partial write %ld/%u B\n",
> +			       zone_no, cmd->ret, blk_rq_bytes(rq));
> +			cmd->ret = -EIO;
> +		}
> +
> +		if (cmd->ret < 0 && !test_bit(ZLOOP_ZONE_CONV, &zone->flags)) {
> +			/*
> +			 * A write to a sequential zone file failed: mark the
> +			 * zone as having an error. This will be corrected and
> +			 * cleared when the next IO is submitted.
> +			 */
> +			set_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags);
> +			break;
> +		}
> +		if (req_op(rq) == REQ_OP_ZONE_APPEND)
> +			rq->__sector = cmd->sector;
> +
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	if (cmd->ret < 0)
> +		sts = errno_to_blk_status(cmd->ret);
> +	blk_mq_end_request(rq, sts);
> +}
> +
> +static blk_status_t zloop_queue_rq(struct blk_mq_hw_ctx *hctx,
> +				   const struct blk_mq_queue_data *bd)
> +{
> +	struct request *rq = bd->rq;
> +	struct zloop_cmd *cmd = blk_mq_rq_to_pdu(rq);
> +	struct zloop_device *zlo = rq->q->queuedata;
> +
> +	if (zlo->state == Zlo_deleting)
> +		return BLK_STS_IOERR;
> +
> +	blk_mq_start_request(rq);
> +
> +	INIT_WORK(&cmd->work, zloop_cmd_workfn);
> +	queue_work(zlo->workqueue, &cmd->work);
> +
> +	return BLK_STS_OK;
> +}
> +
> +static const struct blk_mq_ops zloop_mq_ops = {
> +	.queue_rq       = zloop_queue_rq,
> +	.complete	= zloop_complete_rq,
> +};
> +
> +static int zloop_open(struct gendisk *disk, blk_mode_t mode)
> +{
> +	struct zloop_device *zlo = disk->private_data;
> +	int ret;
> +
> +	ret = mutex_lock_killable(&zloop_ctl_mutex);
> +	if (ret)
> +		return ret;
> +
> +	if (zlo->state != Zlo_live)
> +		ret = -ENXIO;
> +	mutex_unlock(&zloop_ctl_mutex);
> +	return ret;
> +}
> +
> +static int zloop_report_zones(struct gendisk *disk, sector_t sector,
> +		unsigned int nr_zones, report_zones_cb cb, void *data)
> +{
> +	struct zloop_device *zlo = disk->private_data;
> +	struct blk_zone blkz = {};
> +	unsigned int first, i;
> +	int ret;
> +
> +	first = disk_zone_no(disk, sector);
> +	if (first >= zlo->nr_zones)
> +		return 0;
> +	nr_zones = min(nr_zones, zlo->nr_zones - first);
> +
> +	for (i = 0; i < nr_zones; i++) {
> +		unsigned int zone_no = first + i;
> +		struct zloop_zone *zone = &zlo->zones[zone_no];
> +
> +		mutex_lock(&zone->lock);
> +
> +		if (test_and_clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags)) {
> +			ret = zloop_update_seq_zone(zlo, zone_no);
> +			if (ret) {
> +				mutex_unlock(&zone->lock);
> +				return ret;
> +			}
> +		}
> +
> +		blkz.start = zone->start;
> +		blkz.len = zlo->zone_size;
> +		blkz.wp = zone->wp;
> +		blkz.cond = zone->cond;
> +		if (test_bit(ZLOOP_ZONE_CONV, &zone->flags)) {
> +			blkz.type = BLK_ZONE_TYPE_CONVENTIONAL;
> +			blkz.capacity = zlo->zone_size;
> +		} else {
> +			blkz.type = BLK_ZONE_TYPE_SEQWRITE_REQ;
> +			blkz.capacity = zlo->zone_capacity;
> +		}
> +
> +		mutex_unlock(&zone->lock);
> +
> +		ret = cb(&blkz, i, data);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return nr_zones;
> +}
> +
> +static void zloop_free_disk(struct gendisk *disk)
> +{
> +	struct zloop_device *zlo = disk->private_data;
> +	unsigned int i;
> +
> +	for (i = 0; i < zlo->nr_zones; i++) {
> +		struct zloop_zone *zone = &zlo->zones[i];
> +
> +		mapping_set_gfp_mask(zone->file->f_mapping,
> +				zone->old_gfp_mask);
> +		fput(zone->file);
> +	}
> +
> +	fput(zlo->data_dir);
> +	destroy_workqueue(zlo->workqueue);
> +	kfree(zlo->base_dir);
> +	kvfree(zlo);
> +}
> +
> +static const struct block_device_operations zloop_fops = {
> +	.owner			= THIS_MODULE,
> +	.open			= zloop_open,
> +	.report_zones		= zloop_report_zones,
> +	.free_disk		= zloop_free_disk,
> +};
> +
> +__printf(3, 4)
> +static struct file *zloop_filp_open_fmt(int oflags, umode_t mode,
> +		const char *fmt, ...)
> +{
> +	struct file *file;
> +	va_list ap;
> +	char *p;
> +
> +	va_start(ap, fmt);
> +	p = kvasprintf(GFP_KERNEL, fmt, ap);
> +	va_end(ap);
> +
> +	if (!p)
> +		return ERR_PTR(-ENOMEM);
> +	file = filp_open(p, oflags, mode);
> +	kfree(p);
> +	return file;
> +}
> +
> +static int zloop_init_zone(struct zloop_device *zlo, unsigned int zone_no,
> +			   struct zloop_options *opts, bool restore)
> +{
> +	struct zloop_zone *zone = &zlo->zones[zone_no];
> +	int oflags = O_RDWR;
> +	struct kstat stat;
> +	sector_t file_sectors;
> +	int ret;
> +
> +	mutex_init(&zone->lock);
> +	zone->start = (sector_t)zone_no << zlo->zone_shift;
> +
> +	if (!restore)
> +		oflags |= O_CREAT;
> +
> +	if (!opts->buffered_io)
> +		oflags |= O_DIRECT;
> +
> +	if (zone_no < zlo->nr_conv_zones) {
> +		/* Conventional zone file. */
> +		set_bit(ZLOOP_ZONE_CONV, &zone->flags);
> +		zone->cond = BLK_ZONE_COND_NOT_WP;
> +		zone->wp = U64_MAX;
> +
> +		zone->file = zloop_filp_open_fmt(oflags, 0600, "%s/%u/cnv-%06u",
> +					zlo->base_dir, zlo->id, zone_no);
> +		if (IS_ERR(zone->file)) {
> +			pr_err("Failed to open zone %u file %s/%u/cnv-%06u (err=%ld)",
> +			       zone_no, zlo->base_dir, zlo->id, zone_no,
> +			       PTR_ERR(zone->file));
> +			return PTR_ERR(zone->file);
> +		}
> +
> +		ret = vfs_getattr(&zone->file->f_path, &stat, STATX_SIZE, 0);
> +		if (ret < 0) {
> +			pr_err("Failed to get zone %u file stat\n", zone_no);
> +			return ret;
> +		}
> +		file_sectors = stat.size >> SECTOR_SHIFT;
> +
> +		if (restore && file_sectors != zlo->zone_size) {
> +			pr_err("Invalid conventional zone %u file size (%llu sectors != %llu)\n",
> +			       zone_no, file_sectors, zlo->zone_capacity);
> +			return ret;
> +		}
> +
> +		ret = vfs_truncate(&zone->file->f_path,
> +				   zlo->zone_size << SECTOR_SHIFT);
> +		if (ret < 0) {
> +			pr_err("Failed to truncate zone %u file (err=%d)\n",
> +			       zone_no, ret);
> +			return ret;
> +		}
> +
> +		return 0;
> +	}
> +
> +	/* Sequential zone file. */
> +	zone->file = zloop_filp_open_fmt(oflags, 0600, "%s/%u/seq-%06u",
> +					 zlo->base_dir, zlo->id, zone_no);
> +	if (IS_ERR(zone->file)) {
> +		pr_err("Failed to open zone %u file %s/%u/seq-%06u (err=%ld)",
> +		       zone_no, zlo->base_dir, zlo->id, zone_no,
> +		       PTR_ERR(zone->file));
> +		return PTR_ERR(zone->file);
> +	}
> +
> +	mutex_lock(&zone->lock);
> +	ret = zloop_update_seq_zone(zlo, zone_no);
> +	mutex_unlock(&zone->lock);
> +
> +	return ret;
> +}
> +
> +static bool zloop_dev_exists(struct zloop_device *zlo)
> +{
> +	struct file *cnv, *seq;
> +	bool exists;
> +
> +	cnv = zloop_filp_open_fmt(O_RDONLY, 0600, "%s/%u/cnv-%06u",
> +				  zlo->base_dir, zlo->id, 0);
> +	seq = zloop_filp_open_fmt(O_RDONLY, 0600, "%s/%u/seq-%06u",
> +				  zlo->base_dir, zlo->id, 0);
> +	exists = !IS_ERR(cnv) || !IS_ERR(seq);
> +
> +	if (!IS_ERR(cnv))
> +		fput(cnv);
> +	if (!IS_ERR(seq))
> +		fput(seq);
> +
> +	return exists;
> +}
> +
> +static int zloop_ctl_add(struct zloop_options *opts)
> +{
> +	struct queue_limits lim = {
> +		.max_hw_sectors		= SZ_1M >> SECTOR_SHIFT,
> +		.max_hw_zone_append_sectors = SZ_1M >> SECTOR_SHIFT,
> +		.chunk_sectors		= opts->zone_size,
> +		.features		= BLK_FEAT_ZONED,
> +	};
> +	unsigned int nr_zones, i, j;
> +	struct zloop_device *zlo;
> +	int block_size;
> +	int ret = -EINVAL;
> +	bool restore;
> +
> +	__module_get(THIS_MODULE);
> +
> +	nr_zones = opts->capacity >> ilog2(opts->zone_size);
> +	if (opts->nr_conv_zones >= nr_zones) {
> +		pr_err("Invalid number of conventional zones %u\n",
> +		       opts->nr_conv_zones);
> +		goto out;
> +	}
> +
> +	zlo = kvzalloc(struct_size(zlo, zones, nr_zones), GFP_KERNEL);
> +	if (!zlo) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	zlo->state = Zlo_creating;
> +
> +	ret = mutex_lock_killable(&zloop_ctl_mutex);
> +	if (ret)
> +		goto out_free_dev;
> +
> +	/* Allocate id, if @opts->id >= 0, we're requesting that specific id */
> +	if (opts->id >= 0) {
> +		ret = idr_alloc(&zloop_index_idr, zlo,
> +				  opts->id, opts->id + 1, GFP_KERNEL);
> +		if (ret == -ENOSPC)
> +			ret = -EEXIST;
> +	} else {
> +		ret = idr_alloc(&zloop_index_idr, zlo, 0, 0, GFP_KERNEL);
> +	}
> +	mutex_unlock(&zloop_ctl_mutex);
> +	if (ret < 0)
> +		goto out_free_dev;
> +
> +	zlo->id = ret;
> +	zlo->zone_shift = ilog2(opts->zone_size);
> +	zlo->zone_size = opts->zone_size;
> +	if (opts->zone_capacity)
> +		zlo->zone_capacity = opts->zone_capacity;
> +	else
> +		zlo->zone_capacity = zlo->zone_size;
> +	zlo->nr_zones = nr_zones;
> +	zlo->nr_conv_zones = opts->nr_conv_zones;
> +	zlo->buffered_io = opts->buffered_io;
> +
> +	zlo->workqueue = alloc_workqueue("zloop%d", WQ_UNBOUND | WQ_FREEZABLE,
> +				opts->nr_queues * opts->queue_depth, zlo->id);
> +	if (!zlo->workqueue) {
> +		ret = -ENOMEM;
> +		goto out_free_idr;
> +	}
> +
> +	if (opts->base_dir)
> +		zlo->base_dir = kstrdup(opts->base_dir, GFP_KERNEL);
> +	else
> +		zlo->base_dir = kstrdup(ZLOOP_DEF_BASE_DIR, GFP_KERNEL);
> +	if (!zlo->base_dir) {
> +		ret = -ENOMEM;
> +		goto out_destroy_workqueue;
> +	}
> +
> +	zlo->data_dir = zloop_filp_open_fmt(O_RDONLY | O_DIRECTORY, 0, "%s/%u",
> +					    zlo->base_dir, zlo->id);
> +	if (IS_ERR(zlo->data_dir)) {
> +		ret = PTR_ERR(zlo->data_dir);
> +		pr_warn("Failed to open directory %s/%u (err=%d)\n",
> +			zlo->base_dir, zlo->id, ret);
> +		goto out_free_base_dir;
> +	}
> +
> +	/* Use the FS block size as the device sector size. */
> +	block_size = file_inode(zlo->data_dir)->i_sb->s_blocksize;
> +	if (block_size > SZ_4K) {
> +		pr_warn("Unsupported FS block size %d B > 4096\n",
> +			block_size);
> +		goto out_close_data_dir;
> +	}
> +	lim.physical_block_size = block_size;
> +	lim.logical_block_size = block_size;
> +
> +	/*
> +	 * If we already have zone files, we are restoring a device created by a
> +	 * previous add operation. In this case, zloop_init_zone() will check
> +	 * that the zone files are consistent with the zone configuration given.
> +	 */
> +	restore = zloop_dev_exists(zlo);
> +	for (i = 0; i < nr_zones; i++) {
> +		ret = zloop_init_zone(zlo, i, opts, restore);
> +		if (ret)
> +			goto out_close_files;
> +	}
> +
> +	zlo->tag_set.ops = &zloop_mq_ops;
> +	zlo->tag_set.nr_hw_queues = opts->nr_queues;
> +	zlo->tag_set.queue_depth = opts->queue_depth;
> +	zlo->tag_set.numa_node = NUMA_NO_NODE;
> +	zlo->tag_set.cmd_size = sizeof(struct zloop_cmd);
> +	zlo->tag_set.driver_data = zlo;
> +
> +	ret = blk_mq_alloc_tag_set(&zlo->tag_set);
> +	if (ret) {
> +		pr_err("blk_mq_alloc_tag_set failed (err=%d)\n", ret);
> +		goto out_close_files;
> +	}
> +
> +	zlo->disk = blk_mq_alloc_disk(&zlo->tag_set, &lim, zlo);
> +	if (IS_ERR(zlo->disk)) {
> +		pr_err("blk_mq_alloc_disk failed (err=%d)\n", ret);
> +		ret = PTR_ERR(zlo->disk);
> +		goto out_cleanup_tags;
> +	}
> +	zlo->disk->flags = GENHD_FL_NO_PART;
> +	zlo->disk->fops = &zloop_fops;
> +	zlo->disk->private_data = zlo;
> +	sprintf(zlo->disk->disk_name, "zloop%d", zlo->id);
> +	set_capacity(zlo->disk, (u64)lim.chunk_sectors * zlo->nr_zones);
> +
> +	if (blk_revalidate_disk_zones(zlo->disk))
> +		goto out_cleanup_disk;
> +
> +	ret = add_disk(zlo->disk);
> +	if (ret) {
> +		pr_err("add_disk failed (err=%d)\n", ret);
> +		goto out_cleanup_disk;
> +	}
> +
> +	mutex_lock(&zloop_ctl_mutex);
> +	zlo->state = Zlo_live;
> +	mutex_unlock(&zloop_ctl_mutex);
> +
> +	pr_info("Added device %d\n", zlo->id);
> +
> +	return 0;
> +
> +out_cleanup_disk:
> +	put_disk(zlo->disk);
> +out_cleanup_tags:
> +	blk_mq_free_tag_set(&zlo->tag_set);
> +out_close_files:
> +	for (j = 0; j < i; j++) {
> +		struct zloop_zone *zone = &zlo->zones[j];
> +
> +		if (!IS_ERR_OR_NULL(zone->file))
> +			fput(zone->file);
> +	}
> +out_close_data_dir:
> +	fput(zlo->data_dir);
> +out_free_base_dir:
> +	kfree(zlo->base_dir);
> +out_destroy_workqueue:
> +	destroy_workqueue(zlo->workqueue);
> +out_free_idr:
> +	mutex_lock(&zloop_ctl_mutex);
> +	idr_remove(&zloop_index_idr, zlo->id);
> +	mutex_unlock(&zloop_ctl_mutex);
> +out_free_dev:
> +	kvfree(zlo);
> +out:
> +	module_put(THIS_MODULE);
> +	if (ret == -ENOENT)
> +		ret = -EINVAL;
> +	return ret;
> +}
> +
> +static int zloop_ctl_remove(struct zloop_options *opts)
> +{
> +	struct zloop_device *zlo;
> +	int ret;
> +
> +	if (!(opts->mask & ZLOOP_OPT_ID)) {
> +		pr_err("No ID specified\n");
> +		return -EINVAL;
> +	}
> +
> +	ret = mutex_lock_killable(&zloop_ctl_mutex);
> +	if (ret)
> +		return ret;
> +
> +	zlo = idr_find(&zloop_index_idr, opts->id);
> +	if (!zlo || zlo->state == Zlo_creating) {
> +		ret = -ENODEV;
> +	} else if (zlo->state == Zlo_deleting) {
> +		ret = -EINVAL;
> +	} else {
> +		idr_remove(&zloop_index_idr, zlo->id);
> +		zlo->state = Zlo_deleting;
> +	}
> +
> +	mutex_unlock(&zloop_ctl_mutex);
> +	if (ret)
> +		return ret;
> +
> +	del_gendisk(zlo->disk);
> +	put_disk(zlo->disk);
> +	blk_mq_free_tag_set(&zlo->tag_set);
> +
> +	pr_info("Removed device %d\n", opts->id);
> +
> +	module_put(THIS_MODULE);
> +
> +	return 0;
> +}
> +
> +static int zloop_parse_options(struct zloop_options *opts, const char *buf)
> +{
> +	substring_t args[MAX_OPT_ARGS];
> +	char *options, *o, *p;
> +	unsigned int token;
> +	int ret = 0;
> +
> +	/* Set defaults. */
> +	opts->mask = 0;
> +	opts->id = ZLOOP_DEF_ID;
> +	opts->capacity = ZLOOP_DEF_ZONE_SIZE * ZLOOP_DEF_NR_ZONES;
> +	opts->zone_size = ZLOOP_DEF_ZONE_SIZE;
> +	opts->nr_conv_zones = ZLOOP_DEF_NR_CONV_ZONES;
> +	opts->nr_queues = ZLOOP_DEF_NR_QUEUES;
> +	opts->queue_depth = ZLOOP_DEF_QUEUE_DEPTH;
> +	opts->buffered_io = ZLOOP_DEF_BUFFERED_IO;
> +
> +	if (!buf)
> +		return 0;
> +
> +	/* Skip leading spaces before the options. */
> +	while (isspace(*buf))
> +		buf++;
> +
> +	options = o = kstrdup(buf, GFP_KERNEL);
> +	if (!options)
> +		return -ENOMEM;
> +
> +	/* Parse the options, doing only some light invalid value checks. */
> +	while ((p = strsep(&o, ",\n")) != NULL) {
> +		if (!*p)
> +			continue;
> +
> +		token = match_token(p, zloop_opt_tokens, args);
> +		opts->mask |= token;
> +		switch (token) {
> +		case ZLOOP_OPT_ID:
> +			if (match_int(args, &opts->id)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			break;
> +		case ZLOOP_OPT_CAPACITY:
> +			if (match_uint(args, &token)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			if (!token) {
> +				pr_err("Invalid capacity\n");
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			opts->capacity =
> +				((sector_t)token * SZ_1M) >> SECTOR_SHIFT;
> +			break;
> +		case ZLOOP_OPT_ZONE_SIZE:
> +			if (match_uint(args, &token)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			if (!token || token > ZLOOP_MAX_ZONE_SIZE_MB ||
> +			    !is_power_of_2(token)) {
> +				pr_err("Invalid zone size %u\n", token);
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			opts->zone_size =
> +				((sector_t)token * SZ_1M) >> SECTOR_SHIFT;
> +			break;
> +		case ZLOOP_OPT_ZONE_CAPACITY:
> +			if (match_uint(args, &token)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			if (!token) {
> +				pr_err("Invalid zone capacity\n");
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			opts->zone_capacity =
> +				((sector_t)token * SZ_1M) >> SECTOR_SHIFT;
> +			break;
> +		case ZLOOP_OPT_NR_CONV_ZONES:
> +			if (match_uint(args, &token)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			opts->nr_conv_zones = token;
> +			break;
> +		case ZLOOP_OPT_BASE_DIR:
> +			p = match_strdup(args);
> +			if (!p) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +			kfree(opts->base_dir);
> +			opts->base_dir = p;
> +			break;
> +		case ZLOOP_OPT_NR_QUEUES:
> +			if (match_uint(args, &token)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			if (!token) {
> +				pr_err("Invalid number of queues\n");
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			opts->nr_queues = min(token, num_online_cpus());
> +			break;
> +		case ZLOOP_OPT_QUEUE_DEPTH:
> +			if (match_uint(args, &token)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			if (!token) {
> +				pr_err("Invalid queue depth\n");
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			opts->queue_depth = token;
> +			break;
> +		case ZLOOP_OPT_BUFFERED_IO:
> +			opts->buffered_io = true;
> +			break;
> +		case ZLOOP_OPT_ERR:
> +		default:
> +			pr_warn("unknown parameter or missing value '%s'\n", p);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
> +	ret = -EINVAL;
> +	if (opts->capacity <= opts->zone_size) {
> +		pr_err("Invalid capacity\n");
> +		goto out;
> +	}
> +
> +	if (opts->zone_capacity > opts->zone_size) {
> +		pr_err("Invalid zone capacity\n");
> +		goto out;
> +	}
> +
> +	ret = 0;
> +out:
> +	kfree(options);
> +	return ret;
> +}
> +
> +enum {
> +	ZLOOP_CTL_ADD,
> +	ZLOOP_CTL_REMOVE,
> +};
> +
> +static struct zloop_ctl_op {
> +	int		code;
> +	const char	*name;
> +} zloop_ctl_ops[] = {
> +	{ ZLOOP_CTL_ADD,	"add" },
> +	{ ZLOOP_CTL_REMOVE,	"remove" },
> +	{ -1,	NULL },
> +};
> +
> +static ssize_t zloop_ctl_write(struct file *file, const char __user *ubuf,
> +			       size_t count, loff_t *pos)
> +{
> +	struct zloop_options opts = { };
> +	struct zloop_ctl_op *op;
> +	const char *buf, *opts_buf;
> +	int i, ret;
> +
> +	if (count > PAGE_SIZE)
> +		return -ENOMEM;
> +
> +	buf = memdup_user_nul(ubuf, count);
> +	if (IS_ERR(buf))
> +		return PTR_ERR(buf);
> +
> +	for (i = 0; i < ARRAY_SIZE(zloop_ctl_ops); i++) {
> +		op = &zloop_ctl_ops[i];
> +		if (!op->name) {
> +			pr_err("Invalid operation\n");
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		if (!strncmp(buf, op->name, strlen(op->name)))
> +			break;
> +	}
> +
> +	if (count <= strlen(op->name))
> +		opts_buf = NULL;
> +	else
> +		opts_buf = buf + strlen(op->name);
> +
> +	ret = zloop_parse_options(&opts, opts_buf);
> +	if (ret) {
> +		pr_err("Failed to parse options\n");
> +		goto out;
> +	}
> +
> +	switch (op->code) {
> +	case ZLOOP_CTL_ADD:
> +		ret = zloop_ctl_add(&opts);
> +		break;
> +	case ZLOOP_CTL_REMOVE:
> +		ret = zloop_ctl_remove(&opts);
> +		break;
> +	default:
> +		pr_err("Invalid operation\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +out:
> +	kfree(opts.base_dir);
> +	kfree(buf);
> +	return ret ? ret : count;
> +}
> +
> +static int zloop_ctl_show(struct seq_file *seq_file, void *private)
> +{
> +	const struct match_token *tok;
> +	int i;
> +
> +	/* Add operation */
> +	seq_printf(seq_file, "%s ", zloop_ctl_ops[0].name);
> +	for (i = 0; i < ARRAY_SIZE(zloop_opt_tokens); i++) {
> +		tok = &zloop_opt_tokens[i];
> +		if (!tok->pattern)
> +			break;
> +		if (i)
> +			seq_putc(seq_file, ',');
> +		seq_puts(seq_file, tok->pattern);
> +	}
> +	seq_putc(seq_file, '\n');
> +
> +	/* Remove operation */
> +	seq_puts(seq_file, zloop_ctl_ops[1].name);
> +	seq_puts(seq_file, " id=%d\n");
> +
> +	return 0;
> +}
> +
> +static int zloop_ctl_open(struct inode *inode, struct file *file)
> +{
> +	file->private_data = NULL;
> +	return single_open(file, zloop_ctl_show, NULL);
> +}
> +
> +static int zloop_ctl_release(struct inode *inode, struct file *file)
> +{
> +	return single_release(inode, file);
> +}
> +
> +static const struct file_operations zloop_ctl_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= zloop_ctl_open,
> +	.release	= zloop_ctl_release,
> +	.write		= zloop_ctl_write,
> +	.read		= seq_read,
> +};
> +
> +static struct miscdevice zloop_misc = {
> +	.minor		= MISC_DYNAMIC_MINOR,
> +	.name		= "zloop-control",
> +	.fops		= &zloop_ctl_fops,
> +};
> +
> +static int __init zloop_init(void)
> +{
> +	int ret;
> +
> +	ret = misc_register(&zloop_misc);
> +	if (ret) {
> +		pr_err("Failed to register misc device: %d\n", ret);
> +		return ret;
> +	}
> +	pr_info("Module loaded\n");
> +
> +	return 0;
> +}
> +
> +static void __exit zloop_exit(void)
> +{
> +	misc_deregister(&zloop_misc);
> +	idr_destroy(&zloop_index_idr);
> +}
> +
> +module_init(zloop_init);
> +module_exit(zloop_exit);
> +
> +MODULE_DESCRIPTION("Zoned loopback device");
> +MODULE_LICENSE("GPL");
> -- 
> 2.48.1
> 


-- 
Ming


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-01-31  3:54                       ` Ming Lei
@ 2025-02-04  3:22                         ` Damien Le Moal
  2025-02-05  3:43                           ` Ming Lei
  0 siblings, 1 reply; 35+ messages in thread
From: Damien Le Moal @ 2025-02-04  3:22 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, linux-block

On 1/31/25 12:54, Ming Lei wrote:
> On Wed, Jan 29, 2025 at 05:10:32PM +0900, Damien Le Moal wrote:
>> On 1/24/25 21:30, Ming Lei wrote:
>>>> 1 queue:
>>>> ========
>>>>                               +-------------------+-------------------+
>>>>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>>>>  +----------------------------+-------------------+-------------------+
>>>>  | QD=1,    4K rnd wr, 1 job  | 11.7k / 47.8 MB/s | 15.8k / 53.0 MB/s |
>>>>  | QD=32,   4K rnd wr, 8 jobs | 63.4k / 260 MB/s  | 101k / 413 MB/s   |
>>>
>>> I can't reproduce the above two, actually not observe obvious difference
>>> between rublk/zoned and zloop in my test VM.
>>
>> I am using bare-metal machines for these tests as I do not want any
>> noise from a VM/hypervisor in the numbers. And I did say that this is with a
>> tweaked version of zloop that I have not posted yet (I was waiting for rc1 to
>> repost as a rebase is needed to correct a compilation failure du to the nomerge
>> tage set flag being removed). I am attaching the patch I used here (it applies
>> on top of current Linus tree)
>>
>>> Maybe rublk works at debug mode, which reduces perf by half usually.
>>> And you need to add device via 'cargo run -r -- add zoned' for using
>>> release mode.
>>
>> Well, that is not an obvious thing for someone who does not know rust well. The
>> README file of rublk also does not mention that. So no, I did not run it like
>> this. I followed the README and call rublk directly. It would be great to
>> document that.
> 
> OK, that is fine, and now you can install rublk/zoned with 'cargo
> install rublk' directly, which always build & install the binary of
> release version.
> 
>>
>>> Actually there is just single io_uring_enter() running in each ublk queue
>>> pthread, perf should be similar with kernel IO handling, and the main extra
>>> load is from the single syscall kernel/user context switch and IO data copy,
>>> and data copy effect can be neglected in small io size usually(< 64KB).
>>>
>>>>  | QD=32, 128K rnd wr, 1 job  | 5008 / 656 MB/s   | 5993 / 786 MB/s   |
>>>>  | QD=32, 128K seq wr, 1 job  | 2636 / 346 MB/s   | 5393 / 707 MB/s   |
>>>
>>> ublk 128K BS may be a little slower since there is one extra copy.
>>
>> Here are newer numbers running rublk as you suggested (using cargo run -r).
>> The backend storage is on an XFS file system using a PCI gen4 4TB M.2 SSD that
>> is empty (the FS is empty on start). The emulated zoned disk has a capacity of
>> 512GB with sequential zones only of 256 MB (that is, there are 2048
>> zones/files). Each data point is from a 1min run of fio.
> 
> Can you share how you create rublk/zoned and zloop and the underlying
> device info? Especially queue depth and nr_queues(both rublk/zloop &
> underlying disk) plays a big role.

rublk:

cargo run -r -- add zoned --size 524288 --zone-size 256 --conv-zones 0 \
		--logical-block-size 4096 --queue ${nrq} --depth 128 \
		--path /mnt/zloop/0

zloop:

echo "add conv_zones=0,capacity_mb=524288,zone_size_mb=256,\
base_dir=/mnt/zloop,nr_queues=${nrq},queue_depth=128" > /dev/zloop-control

The backing storage is using XFS on a PCIe Gen4 4TB M.2 SSD (my Xeon machine is
PCIe Gen3 though). This drive has a large enough max_qid to provide one IO queue
pair per CPU for up to 32 CPUs (16-cores / 32-threads).

> I will take your setting on real hardware and re-run the test after I
> return from the Spring Festival holiday.
> 
>>
>> On a 8-cores Intel Xeon test box, which has PCI gen 3 only, I get:
>>
>> Single queue:
>> =============
>>                               +-------------------+-------------------+
>>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>>  +----------------------------+-------------------+-------------------+
>>  | QD=1,    4K rnd wr, 1 job  | 2859 / 11.7 MB/s  | 5535 / 22.7 MB/s  |
>>  | QD=32,   4K rnd wr, 8 jobs | 24.5k / 100 MB/s  | 24.6k / 101 MB/s  |
>>  | QD=32, 128K rnd wr, 1 job  | 14.9k / 1954 MB/s | 19.6k / 2571 MB/s |
>>  | QD=32, 128K seq wr, 1 job  | 1516 / 199 MB/s   | 10.6k / 1385 MB/s |
>>  +----------------------------+-------------------+-------------------+
>>
>> 8 queues:
>> =========
>>                               +-------------------+-------------------+
>>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
>>  +----------------------------+-------------------+-------------------+
>>  | QD=1,    4K rnd wr, 1 job  | 5387 / 22.1 MB/s  | 5436 / 22.3 MB/s  |
>>  | QD=32,   4K rnd wr, 8 jobs | 16.4k / 67.0 MB/s | 26.3k / 108 MB/s  |
>>  | QD=32, 128K rnd wr, 1 job  | 6101 / 800 MB/s   | 19.8k / 2591 MB/s |
>>  | QD=32, 128K seq wr, 1 job  | 3987 / 523 MB/s   | 10.6k / 1391 MB/s |
>>  +----------------------------+-------------------+-------------------+
>>
>> I have no idea why ublk is generally slower when setup with 8 I/O queues. The
>> qd=32 4K random write with 8 jobs is generally faster with ublk than zloop, but
>> that varies. I tracked that down to CPU utilization which is generally much
>> better (all CPUs used) with ublk compared to zloop, as zloop is at the mercy of
>> the workqueue code and how it schedules unbound work items.
> 
> Maybe it is related with queue depth? The default ublk queue depth is
> 128, and 8jobs actually causes 256 in-flight IOs, and default ublk nr_queue
> is 1.

See above: both rublk and zloop are setup with the exact same number of queues
and max qd.

> Another thing I mentioned is that ublk has one extra IO data copy, which
> slows IO especially when IO size is > 64K usually.

Yes. I do keep this in mind when looking at the results.

[...]

>>> Simplicity need to be observed from multiple dimensions, 300 vs. 1500 LoC has
>>> shown something already, IMO.
>>
>> Sure. But given the very complicated syntax of rust, a lower LoC for rust
>> compared to C is very subjective in my opinion.
>>
>> I said "simplicity" in the context of the driver use. And rublk is not as
>> simple to use as zloop as it needs rust/cargo installed which is not an
>> acceptable dependency for xfstests. Furthermore, it is very annoying to have to
> 
> xfstests just need user to pass the zoned block device, so the same test can
> cover any zoned device.

Sure. But the environment that allows that still needs to have the rust
dependency to pull-in and build rublk before using it to run the tests. That is
more dependencies for a CI system or minimal VMs that are not necessarilly based
on a full distro but used to run xfstests.

> I don't understand why you have to add the zoned device emulation code into
> xfstest test script, and introduce the device dependency into upper level FS
> test, and sounds like one layer violation?

The device need to be prepared before running the tests. See above.

> I guess you may miss the point, and actually it isn't related with Rust.

It is. As mentioned several times now, adding rust as a dependency to allow
minimal test VMs to create an emulated zoned device for running xfstests is not
nice. Sure it is not an unsolvable problem, but still not one that we want to
add to test environments. zloop only needs sh/bash, which is necessarily already
included in any existing test environment because that is what xfstests is
written with.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-02-04  3:22                         ` Damien Le Moal
@ 2025-02-05  3:43                           ` Ming Lei
  2025-02-05  6:07                             ` Damien Le Moal
  0 siblings, 1 reply; 35+ messages in thread
From: Ming Lei @ 2025-02-05  3:43 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Christoph Hellwig, Jens Axboe, linux-block

On Tue, Feb 04, 2025 at 12:22:53PM +0900, Damien Le Moal wrote:
> On 1/31/25 12:54, Ming Lei wrote:
> > On Wed, Jan 29, 2025 at 05:10:32PM +0900, Damien Le Moal wrote:
> >> On 1/24/25 21:30, Ming Lei wrote:
> >>>> 1 queue:
> >>>> ========
> >>>>                               +-------------------+-------------------+
> >>>>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
> >>>>  +----------------------------+-------------------+-------------------+
> >>>>  | QD=1,    4K rnd wr, 1 job  | 11.7k / 47.8 MB/s | 15.8k / 53.0 MB/s |
> >>>>  | QD=32,   4K rnd wr, 8 jobs | 63.4k / 260 MB/s  | 101k / 413 MB/s   |
> >>>
> >>> I can't reproduce the above two, actually not observe obvious difference
> >>> between rublk/zoned and zloop in my test VM.
> >>
> >> I am using bare-metal machines for these tests as I do not want any
> >> noise from a VM/hypervisor in the numbers. And I did say that this is with a
> >> tweaked version of zloop that I have not posted yet (I was waiting for rc1 to
> >> repost as a rebase is needed to correct a compilation failure du to the nomerge
> >> tage set flag being removed). I am attaching the patch I used here (it applies
> >> on top of current Linus tree)
> >>
> >>> Maybe rublk works at debug mode, which reduces perf by half usually.
> >>> And you need to add device via 'cargo run -r -- add zoned' for using
> >>> release mode.
> >>
> >> Well, that is not an obvious thing for someone who does not know rust well. The
> >> README file of rublk also does not mention that. So no, I did not run it like
> >> this. I followed the README and call rublk directly. It would be great to
> >> document that.
> > 
> > OK, that is fine, and now you can install rublk/zoned with 'cargo
> > install rublk' directly, which always build & install the binary of
> > release version.
> > 
> >>
> >>> Actually there is just single io_uring_enter() running in each ublk queue
> >>> pthread, perf should be similar with kernel IO handling, and the main extra
> >>> load is from the single syscall kernel/user context switch and IO data copy,
> >>> and data copy effect can be neglected in small io size usually(< 64KB).
> >>>
> >>>>  | QD=32, 128K rnd wr, 1 job  | 5008 / 656 MB/s   | 5993 / 786 MB/s   |
> >>>>  | QD=32, 128K seq wr, 1 job  | 2636 / 346 MB/s   | 5393 / 707 MB/s   |
> >>>
> >>> ublk 128K BS may be a little slower since there is one extra copy.
> >>
> >> Here are newer numbers running rublk as you suggested (using cargo run -r).
> >> The backend storage is on an XFS file system using a PCI gen4 4TB M.2 SSD that
> >> is empty (the FS is empty on start). The emulated zoned disk has a capacity of
> >> 512GB with sequential zones only of 256 MB (that is, there are 2048
> >> zones/files). Each data point is from a 1min run of fio.
> > 
> > Can you share how you create rublk/zoned and zloop and the underlying
> > device info? Especially queue depth and nr_queues(both rublk/zloop &
> > underlying disk) plays a big role.
> 
> rublk:
> 
> cargo run -r -- add zoned --size 524288 --zone-size 256 --conv-zones 0 \
> 		--logical-block-size 4096 --queue ${nrq} --depth 128 \
> 		--path /mnt/zloop/0
> 
> zloop:
> 
> echo "add conv_zones=0,capacity_mb=524288,zone_size_mb=256,\
> base_dir=/mnt/zloop,nr_queues=${nrq},queue_depth=128" > /dev/zloop-control

zone is actually stateful, maybe it is better to use standalone backing
directory/files.

> 
> The backing storage is using XFS on a PCIe Gen4 4TB M.2 SSD (my Xeon machine is
> PCIe Gen3 though). This drive has a large enough max_qid to provide one IO queue
> pair per CPU for up to 32 CPUs (16-cores / 32-threads).

I just setup one XFS over nvme in real hardware, still can't reproduce the big gap in
your test result. Kernel is v6.13 with zloop patch v2.

`8 queues` should only make a difference for the test of "QD=32,   4K rnd wr, 8 jobs".
For other single job test, single queue supposes to be same with 8 queues.

The big gap is mainly in test of 'QD=32, 128K seq wr, 1 job ', maybe your local
change improves zloop's merge? In my test:

	- ublk/zoned : 912 MiB/s
	- zloop(v2) : 960 MiB/s.

BTW, my test is over btrfs, and follows the test script:

 fio --size=32G --time_based --bsrange=128K-128K --runtime=40 --numjobs=1 \
 	--ioengine=libaio --iodepth=32 --directory=./ublk --group_reporting=1 --direct=1 \
	--fsync=0 --name=f1 --stonewall --rw=write

> 
> > I will take your setting on real hardware and re-run the test after I
> > return from the Spring Festival holiday.
> > 
> >>
> >> On a 8-cores Intel Xeon test box, which has PCI gen 3 only, I get:
> >>
> >> Single queue:
> >> =============
> >>                               +-------------------+-------------------+
> >>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
> >>  +----------------------------+-------------------+-------------------+
> >>  | QD=1,    4K rnd wr, 1 job  | 2859 / 11.7 MB/s  | 5535 / 22.7 MB/s  |
> >>  | QD=32,   4K rnd wr, 8 jobs | 24.5k / 100 MB/s  | 24.6k / 101 MB/s  |
> >>  | QD=32, 128K rnd wr, 1 job  | 14.9k / 1954 MB/s | 19.6k / 2571 MB/s |
> >>  | QD=32, 128K seq wr, 1 job  | 1516 / 199 MB/s   | 10.6k / 1385 MB/s |
> >>  +----------------------------+-------------------+-------------------+
> >>
> >> 8 queues:
> >> =========
> >>                               +-------------------+-------------------+
> >>                               | ublk (IOPS / BW)  | zloop (IOPS / BW) |
> >>  +----------------------------+-------------------+-------------------+
> >>  | QD=1,    4K rnd wr, 1 job  | 5387 / 22.1 MB/s  | 5436 / 22.3 MB/s  |
> >>  | QD=32,   4K rnd wr, 8 jobs | 16.4k / 67.0 MB/s | 26.3k / 108 MB/s  |
> >>  | QD=32, 128K rnd wr, 1 job  | 6101 / 800 MB/s   | 19.8k / 2591 MB/s |
> >>  | QD=32, 128K seq wr, 1 job  | 3987 / 523 MB/s   | 10.6k / 1391 MB/s |
> >>  +----------------------------+-------------------+-------------------+
> >>
> >> I have no idea why ublk is generally slower when setup with 8 I/O queues. The
> >> qd=32 4K random write with 8 jobs is generally faster with ublk than zloop, but
> >> that varies. I tracked that down to CPU utilization which is generally much
> >> better (all CPUs used) with ublk compared to zloop, as zloop is at the mercy of
> >> the workqueue code and how it schedules unbound work items.
> > 
> > Maybe it is related with queue depth? The default ublk queue depth is
> > 128, and 8jobs actually causes 256 in-flight IOs, and default ublk nr_queue
> > is 1.
> 
> See above: both rublk and zloop are setup with the exact same number of queues
> and max qd.
> 
> > Another thing I mentioned is that ublk has one extra IO data copy, which
> > slows IO especially when IO size is > 64K usually.
> 
> Yes. I do keep this in mind when looking at the results.
> 
> [...]
> 
> >>> Simplicity need to be observed from multiple dimensions, 300 vs. 1500 LoC has
> >>> shown something already, IMO.
> >>
> >> Sure. But given the very complicated syntax of rust, a lower LoC for rust
> >> compared to C is very subjective in my opinion.
> >>
> >> I said "simplicity" in the context of the driver use. And rublk is not as
> >> simple to use as zloop as it needs rust/cargo installed which is not an
> >> acceptable dependency for xfstests. Furthermore, it is very annoying to have to
> > 
> > xfstests just need user to pass the zoned block device, so the same test can
> > cover any zoned device.
> 
> Sure. But the environment that allows that still needs to have the rust
> dependency to pull-in and build rublk before using it to run the tests. That is
> more dependencies for a CI system or minimal VMs that are not necessarilly based
> on a full distro but used to run xfstests.

OK, it isn't too hard to solve:

- `install cargo` in the distribution if `cargo` doesn't exist

- run 'cargo install rublk' if rublk isn't installed



Thanks,
Ming


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-02-05  3:43                           ` Ming Lei
@ 2025-02-05  6:07                             ` Damien Le Moal
  2025-02-06  3:24                               ` Ming Lei
  0 siblings, 1 reply; 35+ messages in thread
From: Damien Le Moal @ 2025-02-05  6:07 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, linux-block

On 2/5/25 12:43 PM, Ming Lei wrote:
>>> Can you share how you create rublk/zoned and zloop and the underlying
>>> device info? Especially queue depth and nr_queues(both rublk/zloop &
>>> underlying disk) plays a big role.
>>
>> rublk:
>>
>> cargo run -r -- add zoned --size 524288 --zone-size 256 --conv-zones 0 \
>> 		--logical-block-size 4096 --queue ${nrq} --depth 128 \
>> 		--path /mnt/zloop/0
>>
>> zloop:
>>
>> echo "add conv_zones=0,capacity_mb=524288,zone_size_mb=256,\
>> base_dir=/mnt/zloop,nr_queues=${nrq},queue_depth=128" > /dev/zloop-control
> 
> zone is actually stateful, maybe it is better to use standalone backing
> directory/files.

I do not understand what you are saying... I reformat the backing FS and
recreate the same /mnt/zloop/0 directory for every test, to be sure I am not
seeing an artifact from the FS.

>> The backing storage is using XFS on a PCIe Gen4 4TB M.2 SSD (my Xeon machine is
>> PCIe Gen3 though). This drive has a large enough max_qid to provide one IO queue
>> pair per CPU for up to 32 CPUs (16-cores / 32-threads).
> 
> I just setup one XFS over nvme in real hardware, still can't reproduce the big gap in
> your test result. Kernel is v6.13 with zloop patch v2.
> 
> `8 queues` should only make a difference for the test of "QD=32,   4K rnd wr, 8 jobs".
> For other single job test, single queue supposes to be same with 8 queues.
> 
> The big gap is mainly in test of 'QD=32, 128K seq wr, 1 job ', maybe your local
> change improves zloop's merge? In my test:
> 
> 	- ublk/zoned : 912 MiB/s
> 	- zloop(v2) : 960 MiB/s.
> 
> BTW, my test is over btrfs, and follows the test script:
> 
>  fio --size=32G --time_based --bsrange=128K-128K --runtime=40 --numjobs=1 \
>  	--ioengine=libaio --iodepth=32 --directory=./ublk --group_reporting=1 --direct=1 \
> 	--fsync=0 --name=f1 --stonewall --rw=write

If you add an FS on top of the emulated zoned deive, you are testing the FS
perf as much as the backing dev. I focused on the backing dev so I ran fio
directly on top of the emulated drive. E.g.:

fio --name=test --filename=${dev} --rw=randwrite \
                --ioengine=libaio --iodepth=32 --direct=1 --bs=4096 \
                --zonemode=zbd --numjobs=8 --group_reporting --norandommap \
                --cpus_allowed=0-7 --cpus_allowed_policy=split \
                --runtime=${runtime} --ramp_time=5 --time_based

(you must use libaio here)

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/2] New zoned loop block device driver
  2025-02-05  6:07                             ` Damien Le Moal
@ 2025-02-06  3:24                               ` Ming Lei
  0 siblings, 0 replies; 35+ messages in thread
From: Ming Lei @ 2025-02-06  3:24 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Christoph Hellwig, Jens Axboe, linux-block

On Wed, Feb 05, 2025 at 03:07:51PM +0900, Damien Le Moal wrote:
> On 2/5/25 12:43 PM, Ming Lei wrote:
> >>> Can you share how you create rublk/zoned and zloop and the underlying
> >>> device info? Especially queue depth and nr_queues(both rublk/zloop &
> >>> underlying disk) plays a big role.
> >>
> >> rublk:
> >>
> >> cargo run -r -- add zoned --size 524288 --zone-size 256 --conv-zones 0 \
> >> 		--logical-block-size 4096 --queue ${nrq} --depth 128 \
> >> 		--path /mnt/zloop/0
> >>
> >> zloop:
> >>
> >> echo "add conv_zones=0,capacity_mb=524288,zone_size_mb=256,\
> >> base_dir=/mnt/zloop,nr_queues=${nrq},queue_depth=128" > /dev/zloop-control
> > 
> > zone is actually stateful, maybe it is better to use standalone backing
> > directory/files.
> 
> I do not understand what you are saying... I reformat the backing FS and
> recreate the same /mnt/zloop/0 directory for every test, to be sure I am not
> seeing an artifact from the FS.

I meant same backfiles are shared for two devices.

But I guess it may not be big deal.

> 
> >> The backing storage is using XFS on a PCIe Gen4 4TB M.2 SSD (my Xeon machine is
> >> PCIe Gen3 though). This drive has a large enough max_qid to provide one IO queue
> >> pair per CPU for up to 32 CPUs (16-cores / 32-threads).
> > 
> > I just setup one XFS over nvme in real hardware, still can't reproduce the big gap in
> > your test result. Kernel is v6.13 with zloop patch v2.
> > 
> > `8 queues` should only make a difference for the test of "QD=32,   4K rnd wr, 8 jobs".
> > For other single job test, single queue supposes to be same with 8 queues.
> > 
> > The big gap is mainly in test of 'QD=32, 128K seq wr, 1 job ', maybe your local
> > change improves zloop's merge? In my test:
> > 
> > 	- ublk/zoned : 912 MiB/s
> > 	- zloop(v2) : 960 MiB/s.
> > 
> > BTW, my test is over btrfs, and follows the test script:
> > 
> >  fio --size=32G --time_based --bsrange=128K-128K --runtime=40 --numjobs=1 \
> >  	--ioengine=libaio --iodepth=32 --directory=./ublk --group_reporting=1 --direct=1 \
> > 	--fsync=0 --name=f1 --stonewall --rw=write
> 
> If you add an FS on top of the emulated zoned deive, you are testing the FS
> perf as much as the backing dev. I focused on the backing dev so I ran fio
> directly on top of the emulated drive. E.g.:
> 
> fio --name=test --filename=${dev} --rw=randwrite \
>                 --ioengine=libaio --iodepth=32 --direct=1 --bs=4096 \
>                 --zonemode=zbd --numjobs=8 --group_reporting --norandommap \
>                 --cpus_allowed=0-7 --cpus_allowed_policy=split \
>                 --runtime=${runtime} --ramp_time=5 --time_based
> 
> (you must use libaio here)

Thanks for sharing the '--zonemode=zbd'.

I can reproduce the perf issue with the above script, and the reason is related
to io-uring emulation and zone space pre-allocation.

When FS WRITE IO needs to allocate space, .write_iter() returns -EAGAIN
for each io-uring write, then the write is always fallback to io-wq, cause
very bad sequential write perf.

It can be fixed[1] simply by pre-allocating space before writing to the
beginning of each seq-zone.

Now follows result in my test over real nvme/XFS:

+ ./zfio /dev/zloop0 write 1 40
    write /dev/zloop0: jobs   1 io_depth   32 time   40sec
	BS   4k: IOPS   171383 BW   685535KiB/s fio_cpu_util(25% 38%)
	BS 128k: IOPS     7669 BW   981846KiB/s fio_cpu_util( 5% 11%)
+ ./zfio /dev/ublkb0 write 1 40
    write /dev/ublkb0: jobs   1 io_depth   32 time   40sec
	BS   4k: IOPS   179861 BW   719448KiB/s fio_cpu_util(29% 42%)
	BS 128k: IOPS     7239 BW   926786KiB/s fio_cpu_util( 6%  9%)

+ ./zfio /dev/zloop0 randwrite 1 40
randwrite /dev/zloop0: jobs   1 io_depth   32 time   40sec
	BS   4k: IOPS     8909 BW    35642KiB/s fio_cpu_util( 2%  5%)
	BS 128k: IOPS      210 BW    27035KiB/s fio_cpu_util( 0%  0%)
+ ./zfio /dev/ublkb0 randwrite 1 40
randwrite /dev/ublkb0: jobs   1 io_depth   32 time   40sec
	BS   4k: IOPS    20500 BW    82001KiB/s fio_cpu_util( 5% 12%)
	BS 128k: IOPS     5622 BW   719792KiB/s fio_cpu_util( 6%  8%)



[1] https://github.com/ublk-org/rublk/commit/fd01a87abb2f9b8e94c8da24e73683e4bb12659b

[2] `z` (zone fio test script) https://github.com/ublk-org/rublk/blob/main/scripts/zfio

Thanks,
Ming


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-02-06  3:24 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-06 14:24 [PATCH 0/2] New zoned loop block device driver Damien Le Moal
2025-01-06 14:24 ` [PATCH 1/2] block: new " Damien Le Moal
2025-01-06 14:24 ` [PATCH 2/2] Documentation: Document the " Damien Le Moal
2025-01-06 14:54 ` [PATCH 0/2] New " Jens Axboe
2025-01-06 15:21   ` Christoph Hellwig
2025-01-06 15:24     ` Jens Axboe
2025-01-06 15:32       ` Christoph Hellwig
2025-01-06 15:38         ` Jens Axboe
2025-01-06 15:44           ` Christoph Hellwig
2025-01-06 17:38             ` Jens Axboe
2025-01-06 18:05               ` Christoph Hellwig
2025-01-07 21:10                 ` Jens Axboe
2025-01-08  5:49                   ` Christoph Hellwig
2025-01-07  1:08               ` Damien Le Moal
2025-01-07 21:08                 ` Jens Axboe
2025-01-08  5:11                   ` Damien Le Moal
2025-01-08  5:44                   ` Christoph Hellwig
2025-01-08  2:47             ` Ming Lei
2025-01-08 14:10               ` Theodore Ts'o
2025-01-08  2:29     ` Ming Lei
2025-01-08  5:06       ` Damien Le Moal
2025-01-08  8:13         ` Ming Lei
2025-01-08  9:09           ` Christoph Hellwig
2025-01-08  9:39             ` Ming Lei
2025-01-10 12:34               ` Ming Lei
2025-01-24  9:30                 ` Damien Le Moal
2025-01-24 12:30                   ` Ming Lei
2025-01-24 14:20                     ` Johannes Thumshirn
2025-01-29  8:10                     ` Damien Le Moal
2025-01-31  3:54                       ` Ming Lei
2025-02-04  3:22                         ` Damien Le Moal
2025-02-05  3:43                           ` Ming Lei
2025-02-05  6:07                             ` Damien Le Moal
2025-02-06  3:24                               ` Ming Lei
2025-01-08  5:47       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).