linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
@ 2025-06-04  2:08 Zhang Yi
  2025-06-04  2:08 ` [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
                   ` (10 more replies)
  0 siblings, 11 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Changes since RFC v4:
 - Rebase codes on 6.16-rc1.
 - Add a new queue_limit flag, and change the write_zeroes_unmap sysfs
   interface to RW mode. User can disable the unmap write zeroes
   operation by writing '0' to it when the operation is slow.
 - Modify the documentation of write_zeroes_unmap sysfs interface as
   Martin suggested.
 - Remove the statx interface.
 - Make the bdev and ext4 don't allow to submit FALLOC_FL_WRITE_ZEROES
   if the block device does not enable the unmap write zeroes operation,
   it should return -EOPNOTSUPP.
Changes sicne RFC v3:
 - Rebase codes on 6.15-rc2.
 - Add a note in patch 1 to indicate that the unmap write zeros command
   is not always guaranteed as Christoph suggested.
 - Rename bdev_unmap_write_zeroes() helper and move it to patch 1 as
   Christoph suggested.
 - Introduce a new statx attribute flag STATX_ATTR_WRITE_ZEROES_UNMAP as
   Christoph and Christian suggested.
 - Exchange the order of the two patches that modified
   blkdev_fallocate() as Christoph suggested.
Changes since RFC v2:
 - Rebase codes on next-20250314.
 - Add support for nvme multipath.
 - Add support for NVMeT with block device backing.
 - Clear FALLOC_FL_WRITE_ZEROES if dm clear
   limits->max_write_zeroes_sectors.
 - Complement the counterpart userspace tools(util-linux and xfs_io)
   and tests(blktests and xfstests), please see below for details.
Changes since RFC v1:
 - Switch to add a new write zeroes operation, FALLOC_FL_WRITE_ZEROES,
   in fallocate, instead of just adding a supported flag to
   FALLOC_FL_ZERO_RANGE.
 - Introduce a new flag BLK_FEAT_WRITE_ZEROES_UNMAP to the block
   device's queue limit features, and implement it on SCSI sd driver,
   NVMe SSD driver and dm driver.
 - Implement FALLOC_FL_WRITE_ZEROES on both the ext4 filesystem and
   block device (bdev).

RFC v4: https://lore.kernel.org/linux-fsdevel/20250421021509.2366003-1-yi.zhang@huaweicloud.com/
RFC v3: https://lore.kernel.org/linux-fsdevel/20250318073545.3518707-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-fsdevel/20250115114637.2705887-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-fsdevel/20241228014522.2395187-1-yi.zhang@huaweicloud.com/

The counterpart userspace tools changes and tests are here:
 - util-linux: https://lore.kernel.org/linux-fsdevel/20250318073218.3513262-1-yi.zhang@huaweicloud.com/ 
 - xfsprogs: https://lore.kernel.org/linux-fsdevel/20250318072318.3502037-1-yi.zhang@huaweicloud.com/
 - xfstests: https://lore.kernel.org/linux-fsdevel/20250318072615.3505873-1-yi.zhang@huaweicloud.com/
 - blktests: https://lore.kernel.org/linux-fsdevel/20250318072835.3508696-1-yi.zhang@huaweicloud.com/

Original Description:

Currently, we can use the fallocate command to quickly create a
pre-allocated file. However, on most filesystems, such as ext4 and XFS,
fallocate create pre-allocation blocks in an unwritten state, and the
FALLOC_FL_ZERO_RANGE flag also behaves similarly. The extent state must
be converted to a written state when the user writes data into this
range later, which can trigger numerous metadata changes and consequent
journal I/O. This may leads to significant write amplification and
performance degradation in synchronous write mode. Therefore, we need a
method to create a pre-allocated file with written extents that can be
used for pure overwriting. At the monent, the only method available is
to create an empty file and write zero data into it (for example, using
'dd' with a large block size). However, this method is slow and consumes
a considerable amount of disk bandwidth, we must pre-allocate files in
advance but cannot add pre-allocated files while user business services
are running.

Fortunately, with the development and more and more widely used of
flash-based storage devices, we can efficiently write zeros to SSDs
using the unmap write zeroes command if the devices do not write
physical zeroes to the media. For example, if SCSI SSDs support the
UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command
does not write actual data to the device, instead, NVMe converts the
zeroed range to a deallocated state, which works fast and consumes
almost no disk write bandwidth. Consequently, this feature can provide
us with a faster method for creating pre-allocated files with written
extents and zeroed data. However, please note that this may be a
best-effort optimization rather than a mandatory requirement, some
devices may partially fall back to writing physical zeroes due to
factors such as receiving unaligned commands. 

This series aims to implement this by:
1. Introduce a new feature BLK_FEAT_WRITE_ZEROES_UNMAP to the block
   device queue limit features, which indicates whether the storage is
   device explicitly supports the unmapped write zeroes command. This
   flag should be set to 1 by the driver if the attached disk supports
   this command.

2. Introduce a queue limit flag, BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED,
   along with a corresponding sysfs entry. Users can query the support
   status of the unmap write zeroes operation and disable this operation
   if the write zeroes operation is very slow.

       /sys/block/<disk>/queue/write_zeroes_unmap

3. Introduce a new flag, FALLOC_FL_WRITE_ZEROES, into the fallocate.
   Filesystems that support this operation should allocate written
   extents and issue zeroes to the specified range of the device. For
   local block device filesystems, this operation should depend on the
   write_zeroes_unmap operaion of the underlying block device. It should
   return -EOPNOTSUPP if the device doesn't enable unmap write zeroes
   operaion.

This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and
BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and
device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and
STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices.
Any comments are welcome.

I've tested performance with this series on ext4 filesystem on my
machine with an Intel Xeon Gold 6248R CPU, a 7TB KCD61LUL7T68 NVMe SSD
which supports unmap write zeroes command with the Deallocated state
and the DEAC bit. Feel free to give it a try.

0. Ensure the NVMe device supports WRITE_ZERO command.

 $ cat /sys/block/nvme5n1/queue/write_zeroes_max_bytes
   8388608
 $ nvme id-ns -H /dev/nvme5n1 | grep -i -A 3 "dlfeat"
   dlfeat  : 25
   [4:4] : 0x1   Guard Field of Deallocated Logical Blocks is set to CRC
                 of The Value Read
   [3:3] : 0x1   Deallocate Bit in the Write Zeroes Command is Supported
   [2:0] : 0x1   Bytes Read From a Deallocated Logical Block and its
                 Metadata are 0x00

1. Compare 'dd' and fallocate with unmap write zeroes, the later one is
   significantly faster than 'dd'.

   Create a 1GB and 10GB zeroed file.
    $dd if=/dev/zero of=foo bs=2M count=$count oflag=direct
    $time fallocate -w -l $size bar

    #1G
    dd:                     0.5s
    FALLOC_FL_WRITE_ZEROES: 0.17s

    #10G
    dd:                     5.0s
    FALLOC_FL_WRITE_ZEROES: 1.7s

2. Run fio overwrite and fallocate with unmap write zeroes
   simultaneously, fallocate has little impact on write bandwidth and
   only slightly affects write latency.

 a) Test bandwidth costs.
  $ fio -directory=/test -direct=1 -iodepth=10 -fsync=0 -rw=write \
        -numjobs=10 -bs=2M -ioengine=libaio -size=20G -runtime=20 \
        -fallocate=none -overwrite=1 -group_reportin -name=bw_test

   Without background zero range:
    bw (MiB/s): min= 2068, max= 2280, per=100.00%, avg=2186.40

   With background zero range:
    bw (MiB/s): min= 2056, max= 2308, per=100.00%, avg=2186.20

 b) Test write latency costs.
  $ fio -filename=/test/foo -direct=1 -iodepth=1 -fsync=0 -rw=write \
        -numjobs=1 -bs=4k -ioengine=psync -size=5G -runtime=20 \
        -fallocate=none -overwrite=1 -group_reportin -name=lat_test

   Without background zero range:
   lat (nsec): min=9269, max=71635, avg=9840.65

   With a background zero range:
   lat (usec): min=9, max=982, avg=11.03

3. Compare overwriting in a pre-allocated unwritten file and a written
   file in O_DSYNC mode. Write to a file with written extents is much
   faster.

  # First mkfs and create a test file according to below three cases,
  # and then run fio.

  $ fio -filename=/test/foo -direct=1 -iodepth=1 -fdatasync=1 \
        -rw=write -numjobs=1 -bs=4k -ioengine=psync -size=5G \
        -runtime=20 -fallocate=none -group_reportin -name=test

   unwritten file:                 IOPS=20.1k, BW=78.7MiB/s
   unwritten file + fast_commit:   IOPS=42.9k, BW=167MiB/s
   written file:                   IOPS=98.8k, BW=386MiB/s

Thanks,
Yi.

---

[1] https://nvmexpress.org/specifications/
    NVM Command Set Specification, Figure 82 and Figure 114.

Zhang Yi (10):
  block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
  nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP
  scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap
    zeroing mode
  dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  block: factor out common part in blkdev_fallocate()
  block: add FALLOC_FL_WRITE_ZEROES support
  ext4: add FALLOC_FL_WRITE_ZEROES support

 Documentation/ABI/stable/sysfs-block | 20 +++++++++
 block/blk-settings.c                 |  6 +++
 block/blk-sysfs.c                    | 25 +++++++++++
 block/fops.c                         | 44 +++++++++++--------
 drivers/md/dm-table.c                |  7 ++-
 drivers/md/dm.c                      |  1 +
 drivers/nvme/host/core.c             | 21 +++++----
 drivers/nvme/host/multipath.c        |  3 +-
 drivers/nvme/target/io-cmd-bdev.c    |  4 ++
 drivers/scsi/sd.c                    |  5 +++
 fs/ext4/extents.c                    | 66 +++++++++++++++++++++++-----
 fs/open.c                            |  1 +
 include/linux/blkdev.h               | 18 ++++++++
 include/linux/falloc.h               |  3 +-
 include/trace/events/ext4.h          |  3 +-
 include/uapi/linux/falloc.h          | 18 ++++++++
 16 files changed, 201 insertions(+), 44 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-11  6:09   ` Christoph Hellwig
  2025-06-04  2:08 ` [PATCH 02/10] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.

For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state, which opportunistically avoids writing zeroes to media while
still guaranteeing that subsequent reads from the specified block range
will return zeroed data. This is a best-effort optimization, not a
mandatory requirement, some devices may partially fall back to writing
physical zeroes due to factors such as misalignment or being asked to
clear a block range smaller than the device's internal allocation unit.
Therefore, the speed of this operation is not guaranteed.

It is difficult to determine whether the storage device supports unmap
write zeroes operation. We cannot determine this by only querying
bdev_limits(bdev)->max_write_zeroes_sectors. First, add a new queue
limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP, to indicate whether a device
supports this unmap write zeroes operation. Then, add a new counterpart
flag, BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED and a sysfs entry, which
allow users to disable this operation if the speed is very slow on some
sepcial devices.

Finally, for the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP
should be supported both by the stacking driver and all underlying
devices.

Thanks to Martin K. Petersen for optimizing the documentation of the
write_zeroes_unmap sysfs interface.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 Documentation/ABI/stable/sysfs-block | 20 ++++++++++++++++++++
 block/blk-settings.c                 |  6 ++++++
 block/blk-sysfs.c                    | 25 +++++++++++++++++++++++++
 include/linux/blkdev.h               | 18 ++++++++++++++++++
 4 files changed, 69 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 4ba771b56b3b..8e7d513286c4 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -778,6 +778,26 @@ Description:
 		0, write zeroes is not supported by the device.
 
 
+What:		/sys/block/<disk>/queue/write_zeroes_unmap
+Date:		January 2025
+Contact:	Zhang Yi <yi.zhang@huawei.com>
+Description:
+		[RW] When read, this file will display whether the device has
+		enabled the unmap write zeroes operation. This operation
+		indicates that the device supports zeroing data in a specified
+		block range without incurring the cost of physically writing
+		zeroes to media for each individual block. It implements a
+		zeroing operation which opportunistically avoids writing zeroes
+		to media while still guaranteeing that subsequent reads from the
+		specified block range will return zeroed data. This operation is
+		a best-effort optimization, a device may fall back to physically
+		writing zeroes to media due to other factors such as
+		misalignment or being asked to clear a block range smaller than
+		the device's internal allocation unit. So the speed of this
+		operation is not guaranteed. Writing a value of '0' to this file
+		disables this operation.
+
+
 What:		/sys/block/<disk>/queue/zone_append_max_bytes
 Date:		May 2020
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a000daafbfb4..de99763fd668 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -698,6 +698,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_NOWAIT;
 	if (!(b->features & BLK_FEAT_POLL))
 		t->features &= ~BLK_FEAT_POLL;
+	if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
 
@@ -820,6 +822,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->zone_write_granularity = 0;
 		t->max_zone_append_sectors = 0;
 	}
+
+	if (!t->max_write_zeroes_sectors)
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+
 	blk_stack_atomic_writes_limits(t, b, start);
 
 	return ret;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b2b9b89d6967..e918b2c93aed 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -457,6 +457,29 @@ static int queue_wc_store(struct gendisk *disk, const char *page,
 	return 0;
 }
 
+static ssize_t queue_write_zeroes_unmap_show(struct gendisk *disk, char *page)
+{
+	return sysfs_emit(page, "%u\n",
+			  blk_queue_write_zeroes_unmap(disk->queue));
+}
+
+static int queue_write_zeroes_unmap_store(struct gendisk *disk,
+		const char *page, size_t count, struct queue_limits *lim)
+{
+	unsigned long val;
+	ssize_t ret;
+
+	ret = queue_var_store(&val, page, count);
+	if (ret < 0)
+		return ret;
+
+	if (val)
+		lim->flags &= ~BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED;
+	else
+		lim->flags |= BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED;
+	return 0;
+}
+
 #define QUEUE_RO_ENTRY(_prefix, _name)			\
 static struct queue_sysfs_entry _prefix##_entry = {	\
 	.attr	= { .name = _name, .mode = 0444 },	\
@@ -514,6 +537,7 @@ QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
 
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
+QUEUE_LIM_RW_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
 QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
@@ -662,6 +686,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_atomic_write_unit_min_entry.attr,
 	&queue_atomic_write_unit_max_entry.attr,
 	&queue_max_write_zeroes_sectors_entry.attr,
+	&queue_write_zeroes_unmap_entry.attr,
 	&queue_max_zone_append_sectors_entry.attr,
 	&queue_zone_write_granularity_entry.attr,
 	&queue_rotational_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 332b56f323d9..6f1cf97b1f00 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -340,6 +340,9 @@ typedef unsigned int __bitwise blk_features_t;
 #define BLK_FEAT_ATOMIC_WRITES \
 	((__force blk_features_t)(1u << 16))
 
+/* supports unmap write zeroes command */
+#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
+
 /*
  * Flags automatically inherited when stacking limits.
  */
@@ -360,6 +363,10 @@ typedef unsigned int __bitwise blk_flags_t;
 /* passthrough command IO accounting */
 #define BLK_FLAG_IOSTATS_PASSTHROUGH	((__force blk_flags_t)(1u << 2))
 
+/* disable the unmap write zeroes operation */
+#define BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED \
+					((__force blk_flags_t)(1u << 3))
+
 struct queue_limits {
 	blk_features_t		features;
 	blk_flags_t		flags;
@@ -1378,6 +1385,17 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 	return bdev_limits(bdev)->max_write_zeroes_sectors;
 }
 
+static inline bool blk_queue_write_zeroes_unmap(struct request_queue *q)
+{
+	return (q->limits.features & BLK_FEAT_WRITE_ZEROES_UNMAP) &&
+		!(q->limits.flags & BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED);
+}
+
+static inline bool bdev_write_zeroes_unmap(struct block_device *bdev)
+{
+	return blk_queue_write_zeroes_unmap(bdev_get_queue(bdev));
+}
+
 static inline bool bdev_nonrot(struct block_device *bdev)
 {
 	return blk_queue_nonrot(bdev_get_queue(bdev));
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/10] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
  2025-06-04  2:08 ` [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-04  2:08 ` [PATCH 03/10] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When the device supports the Write Zeroes command and the DEAC bit, it
indicates that the deallocate bit in the Write Zeroes command is
supported, and the bytes read from a deallocated logical block are
zeroes. This means the device supports unmap Write Zeroes, so set the
BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f69a232a000a..0ac3dffe2a3d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2421,22 +2421,25 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	else
 		lim.write_stream_granularity = 0;
 
-	ret = queue_limits_commit_update(ns->disk->queue, &lim);
-	if (ret) {
-		blk_mq_unfreeze_queue(ns->disk->queue, memflags);
-		goto out;
-	}
-
-	set_capacity_and_notify(ns->disk, capacity);
-
 	/*
 	 * Only set the DEAC bit if the device guarantees that reads from
 	 * deallocated data return zeroes.  While the DEAC bit does not
 	 * require that, it must be a no-op if reads from deallocated data
 	 * do not return zeroes.
 	 */
-	if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3)))
+	if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3))) {
 		ns->head->features |= NVME_NS_DEAC;
+		if (lim.max_write_zeroes_sectors)
+			lim.features |= BLK_FEAT_WRITE_ZEROES_UNMAP;
+	}
+
+	ret = queue_limits_commit_update(ns->disk->queue, &lim);
+	if (ret) {
+		blk_mq_unfreeze_queue(ns->disk->queue, memflags);
+		goto out;
+	}
+
+	set_capacity_and_notify(ns->disk, capacity);
 	set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info));
 	set_bit(NVME_NS_READY, &ns->flags);
 	blk_mq_unfreeze_queue(ns->disk->queue, memflags);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/10] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
  2025-06-04  2:08 ` [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
  2025-06-04  2:08 ` [PATCH 02/10] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-04  2:08 ` [PATCH 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature while creating multipath
stacking queue limits by default. This feature shall be disabled if any
attached namespace does not support it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/multipath.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 878ea8b1a0ac..6a6d827fa7ee 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -745,7 +745,8 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	blk_set_stacking_limits(&lim);
 	lim.dma_alignment = 3;
 	lim.features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT |
-		BLK_FEAT_POLL | BLK_FEAT_ATOMIC_WRITES;
+			BLK_FEAT_POLL | BLK_FEAT_ATOMIC_WRITES |
+			BLK_FEAT_WRITE_ZEROES_UNMAP;
 	if (head->ids.csi == NVME_CSI_ZNS)
 		lim.features |= BLK_FEAT_ZONED;
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (2 preceding siblings ...)
  2025-06-04  2:08 ` [PATCH 03/10] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-04  2:08 ` [PATCH 05/10] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode Zhang Yi
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Set WZDS and DRB bit to the namespace dlfeat if the underlying block
device supports BLK_FEAT_WRITE_ZEROES_UNMAP, make the nvme target
device supports unmaped write zeroes command.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/target/io-cmd-bdev.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 83be0657e6df..052da6174548 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,10 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 	id->npda = id->npdg;
 	/* NOWS = Namespace Optimal Write Size */
 	id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+	/* Set WZDS and DRB if device supports unmapped write zeroes */
+	if (bdev_write_zeroes_unmap(bdev))
+		id->dlfeat = (1 << 3) | 0x1;
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/10] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (3 preceding siblings ...)
  2025-06-04  2:08 ` [PATCH 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-04  2:08 ` [PATCH 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When the device supports the Write Zeroes command and the zeroing mode
is set to SD_ZERO_WS16_UNMAP or SD_ZERO_WS10_UNMAP, this means that the
device supports unmap Write Zeroes, so set the corresponding
BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 3f6e87705b62..c34b7fac876d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1118,6 +1118,11 @@ static void sd_config_write_same(struct scsi_disk *sdkp,
 	else
 		sdkp->zeroing_mode = SD_ZERO_WRITE;
 
+	if (sdkp->max_ws_blocks &&
+	    (sdkp->zeroing_mode == SD_ZERO_WS16_UNMAP ||
+	     sdkp->zeroing_mode == SD_ZERO_WS10_UNMAP))
+		lim->features |= BLK_FEAT_WRITE_ZEROES_UNMAP;
+
 	if (sdkp->max_ws_blocks &&
 	    sdkp->physical_block_size > logical_block_size) {
 		/*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (4 preceding siblings ...)
  2025-06-04  2:08 ` [PATCH 05/10] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-04  2:08 ` [PATCH 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by
default. This feature shall be disabled if any underlying device does
not support it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 drivers/md/dm-table.c | 7 +++++--
 drivers/md/dm.c       | 1 +
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 6b23e777e10e..4d450713b69d 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -599,7 +599,8 @@ int dm_split_args(int *argc, char ***argvp, char *input)
 static void dm_set_stacking_limits(struct queue_limits *limits)
 {
 	blk_set_stacking_limits(limits);
-	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
+	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
+			    BLK_FEAT_WRITE_ZEROES_UNMAP;
 }
 
 /*
@@ -1851,8 +1852,10 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 		limits->discard_alignment = 0;
 	}
 
-	if (!dm_table_supports_write_zeroes(t))
+	if (!dm_table_supports_write_zeroes(t)) {
 		limits->max_write_zeroes_sectors = 0;
+		limits->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+	}
 
 	if (!dm_table_supports_secure_erase(t))
 		limits->max_secure_erase_sectors = 0;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ab7574c0c76..b59c3dbeaaf1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1096,6 +1096,7 @@ void disable_write_zeroes(struct mapped_device *md)
 
 	/* device doesn't really support WRITE ZEROES, disable it */
 	limits->max_write_zeroes_sectors = 0;
+	limits->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
 }
 
 static bool swap_bios_limit(struct dm_target *ti, struct bio *bio)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (5 preceding siblings ...)
  2025-06-04  2:08 ` [PATCH 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-11 15:05   ` Darrick J. Wong
  2025-06-04  2:08 ` [PATCH 08/10] block: factor out common part in blkdev_fallocate() Zhang Yi
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

With the development of flash-based storage devices, we can quickly
write zeros to SSDs using the WRITE_ZERO command if the devices do not
actually write physical zeroes to the media. Therefore, we can use this
command to quickly preallocate a real all-zero file with written
extents. This approach should be beneficial for subsequent pure
overwriting within this file, as it can save on block allocation and,
consequently, significant metadata changes, which should greatly improve
overwrite performance on certain filesystems.

Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to
fallocate. This flag is used to convert a specified range of a file to
zeros by issuing a zeroing operation. Blocks should be allocated for the
regions that span holes in the file, and the entire range is converted
to written extents. If the underlying device supports the actual offload
write zeroes command, the process of zeroing out operation can be
accelerated. If it does not, we currently don't prevent the file system
from writing actual zeros to the device. This provides users with a new
method to quickly generate a zeroed file, users no longer need to write
zero data to create a file with written extents.

Users can determine whether a disk supports the unmap write zeroes
operation through querying this sysfs interface:

    /sys/block/<disk>/queue/write_zeroes_unmap

Finally, this flag cannot be specified in conjunction with the
FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is
not permitted. In addition, filesystems that always require out-of-place
writes should not support this flag since they still need to allocated
new blocks during subsequent overwrites.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/open.c                   |  1 +
 include/linux/falloc.h      |  3 ++-
 include/uapi/linux/falloc.h | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index 7828234a7caa..b777e11e5522 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -281,6 +281,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		break;
 	case FALLOC_FL_COLLAPSE_RANGE:
 	case FALLOC_FL_INSERT_RANGE:
+	case FALLOC_FL_WRITE_ZEROES:
 		if (mode & FALLOC_FL_KEEP_SIZE)
 			return -EOPNOTSUPP;
 		break;
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 3f49f3df6af5..7c38c6b76b60 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -36,7 +36,8 @@ struct space_resv {
 				 FALLOC_FL_COLLAPSE_RANGE |	\
 				 FALLOC_FL_ZERO_RANGE |		\
 				 FALLOC_FL_INSERT_RANGE |	\
-				 FALLOC_FL_UNSHARE_RANGE)
+				 FALLOC_FL_UNSHARE_RANGE |	\
+				 FALLOC_FL_WRITE_ZEROES)
 
 /* on ia32 l_start is on a 32-bit boundary */
 #if defined(CONFIG_X86_64)
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 5810371ed72b..265aae7ff8c1 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -78,4 +78,22 @@
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_WRITE_ZEROES is used to convert a specified range of a file to
+ * zeros by issuing a zeroing operation. Blocks should be allocated for the
+ * regions that span holes in the file, and the entire range is converted to
+ * written extents. This flag is beneficial for subsequent pure overwriting
+ * within this range, as it can save on block allocation and, consequently,
+ * significant metadata changes. Therefore, filesystems that always require
+ * out-of-place writes should not support this flag.
+ *
+ * Different filesystems may implement different limitations on the
+ * granularity of the zeroing operation. Most will preferably be accelerated
+ * by submitting write zeroes command if the backing storage supports, which
+ * may not physically write zeros to the media.
+ *
+ * This flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE.
+ */
+#define FALLOC_FL_WRITE_ZEROES		0x80
+
 #endif /* _UAPI_FALLOC_H_ */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/10] block: factor out common part in blkdev_fallocate()
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (6 preceding siblings ...)
  2025-06-04  2:08 ` [PATCH 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-04  2:08 ` [PATCH 09/10] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Only the flags passed to blkdev_issue_zeroout() differ among the two
zeroing branches in blkdev_fallocate(). Therefore, do cleanup by
factoring them out.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/fops.c | 32 ++++++++++++++------------------
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 1309861d4c2c..e1c921549d28 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -850,6 +850,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	struct block_device *bdev = I_BDEV(inode);
 	loff_t end = start + len - 1;
 	loff_t isize;
+	unsigned int flags;
 	int error;
 
 	/* Fail if we don't recognize the flags. */
@@ -877,34 +878,29 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	inode_lock(inode);
 	filemap_invalidate_lock(inode->i_mapping);
 
-	/*
-	 * Invalidate the page cache, including dirty pages, for valid
-	 * de-allocate mode calls to fallocate().
-	 */
 	switch (mode) {
 	case FALLOC_FL_ZERO_RANGE:
 	case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
-		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
-		if (error)
-			goto fail;
-
-		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
-					     len >> SECTOR_SHIFT, GFP_KERNEL,
-					     BLKDEV_ZERO_NOUNMAP);
+		flags = BLKDEV_ZERO_NOUNMAP;
 		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
-		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
-		if (error)
-			goto fail;
-
-		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
-					     len >> SECTOR_SHIFT, GFP_KERNEL,
-					     BLKDEV_ZERO_NOFALLBACK);
+		flags = BLKDEV_ZERO_NOFALLBACK;
 		break;
 	default:
 		error = -EOPNOTSUPP;
+		goto fail;
 	}
 
+	/*
+	 * Invalidate the page cache, including dirty pages, for valid
+	 * de-allocate mode calls to fallocate().
+	 */
+	error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
+	if (error)
+		goto fail;
+
+	error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
+				     len >> SECTOR_SHIFT, GFP_KERNEL, flags);
  fail:
 	filemap_invalidate_unlock(inode->i_mapping);
 	inode_unlock(inode);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/10] block: add FALLOC_FL_WRITE_ZEROES support
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (7 preceding siblings ...)
  2025-06-04  2:08 ` [PATCH 08/10] block: factor out common part in blkdev_fallocate() Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-11  6:10   ` Christoph Hellwig
  2025-06-04  2:08 ` [PATCH 10/10] ext4: " Zhang Yi
  2025-06-10  1:47 ` [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Martin K. Petersen
  10 siblings, 1 reply; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Add support for FALLOC_FL_WRITE_ZEROES, if the block device enables the
unmap write zeroes operation, it will issue a write zeroes command.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 block/fops.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/fops.c b/block/fops.c
index e1c921549d28..050c16f5974a 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -841,7 +841,7 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 #define	BLKDEV_FALLOC_FL_SUPPORTED					\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
-		 FALLOC_FL_ZERO_RANGE)
+		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_WRITE_ZEROES)
 
 static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 			     loff_t len)
@@ -856,6 +856,13 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	/* Fail if we don't recognize the flags. */
 	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
 		return -EOPNOTSUPP;
+	/*
+	 * Don't allow writing zeroes if the device does not enable the
+	 * unmap write zeroes operation.
+	 */
+	if (!bdev_write_zeroes_unmap(bdev) &&
+	    (mode & FALLOC_FL_WRITE_ZEROES))
+		return -EOPNOTSUPP;
 
 	/* Don't go off the end of the device. */
 	isize = bdev_nr_bytes(bdev);
@@ -886,6 +893,9 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
 		flags = BLKDEV_ZERO_NOFALLBACK;
 		break;
+	case FALLOC_FL_WRITE_ZEROES:
+		flags = 0;
+		break;
 	default:
 		error = -EOPNOTSUPP;
 		goto fail;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/10] ext4: add FALLOC_FL_WRITE_ZEROES support
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (8 preceding siblings ...)
  2025-06-04  2:08 ` [PATCH 09/10] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
@ 2025-06-04  2:08 ` Zhang Yi
  2025-06-10  1:47 ` [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Martin K. Petersen
  10 siblings, 0 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-04  2:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Add support for FALLOC_FL_WRITE_ZEROES if the underlying device enable
the unmap write zeroes operation. This first allocates blocks as
unwritten, then issues a zero command outside of the running journal
handle, and finally converts them to a written state.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c           | 66 ++++++++++++++++++++++++++++++-------
 include/trace/events/ext4.h |  3 +-
 2 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index b543a46fc809..29ce9f6287d0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4501,6 +4501,8 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	struct ext4_map_blocks map;
 	unsigned int credits;
 	loff_t epos, old_size = i_size_read(inode);
+	unsigned int blkbits = inode->i_blkbits;
+	bool alloc_zero = false;
 
 	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
 	map.m_lblk = offset;
@@ -4513,6 +4515,17 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	if (len <= EXT_UNWRITTEN_MAX_LEN)
 		flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
 
+	/*
+	 * Do the actual write zero during a running journal transaction
+	 * costs a lot. First allocate an unwritten extent and then
+	 * convert it to written after zeroing it out.
+	 */
+	if (flags & EXT4_GET_BLOCKS_ZERO) {
+		flags &= ~EXT4_GET_BLOCKS_ZERO;
+		flags |= EXT4_GET_BLOCKS_UNWRIT_EXT;
+		alloc_zero = true;
+	}
+
 	/*
 	 * credits to insert 1 extent into extent tree
 	 */
@@ -4549,9 +4562,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 		 * allow a full retry cycle for any remaining allocations
 		 */
 		retries = 0;
-		map.m_lblk += ret;
-		map.m_len = len = len - ret;
-		epos = (loff_t)map.m_lblk << inode->i_blkbits;
+		epos = (loff_t)(map.m_lblk + ret) << blkbits;
 		inode_set_ctime_current(inode);
 		if (new_size) {
 			if (epos > new_size)
@@ -4571,6 +4582,21 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 		ret2 = ret3 ? ret3 : ret2;
 		if (unlikely(ret2))
 			break;
+
+		if (alloc_zero &&
+		    (map.m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN))) {
+			ret2 = ext4_issue_zeroout(inode, map.m_lblk, map.m_pblk,
+						  map.m_len);
+			if (likely(!ret2))
+				ret2 = ext4_convert_unwritten_extents(NULL,
+					inode, (loff_t)map.m_lblk << blkbits,
+					(loff_t)map.m_len << blkbits);
+			if (ret2)
+				break;
+		}
+
+		map.m_lblk += ret;
+		map.m_len = len = len - ret;
 	}
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
@@ -4636,7 +4662,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (end_lblk > start_lblk) {
 		ext4_lblk_t zero_blks = end_lblk - start_lblk;
 
-		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
+		if (mode & FALLOC_FL_WRITE_ZEROES)
+			flags = EXT4_GET_BLOCKS_CREATE_ZERO | EXT4_EX_NOCACHE;
+		else
+			flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
+				  EXT4_EX_NOCACHE);
 		ret = ext4_alloc_file_blocks(file, start_lblk, zero_blks,
 					     new_size, flags);
 		if (ret)
@@ -4745,11 +4775,18 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (IS_ENCRYPTED(inode) &&
 	    (mode & (FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_INSERT_RANGE)))
 		return -EOPNOTSUPP;
+	/*
+	 * Don't allow writing zeroes if the underlying device does not
+	 * enable the unmap write zeroes operation.
+	 */
+	if (!bdev_write_zeroes_unmap(inode->i_sb->s_bdev) &&
+	    (mode & FALLOC_FL_WRITE_ZEROES))
+		return -EOPNOTSUPP;
 
 	/* Return error if mode is not supported */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
-		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
-		     FALLOC_FL_INSERT_RANGE))
+		     FALLOC_FL_ZERO_RANGE | FALLOC_FL_COLLAPSE_RANGE |
+		     FALLOC_FL_INSERT_RANGE | FALLOC_FL_WRITE_ZEROES))
 		return -EOPNOTSUPP;
 
 	inode_lock(inode);
@@ -4780,16 +4817,23 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (ret)
 		goto out_invalidate_lock;
 
-	if (mode & FALLOC_FL_PUNCH_HOLE)
+	switch (mode & FALLOC_FL_MODE_MASK) {
+	case FALLOC_FL_PUNCH_HOLE:
 		ret = ext4_punch_hole(file, offset, len);
-	else if (mode & FALLOC_FL_COLLAPSE_RANGE)
+		break;
+	case FALLOC_FL_COLLAPSE_RANGE:
 		ret = ext4_collapse_range(file, offset, len);
-	else if (mode & FALLOC_FL_INSERT_RANGE)
+		break;
+	case FALLOC_FL_INSERT_RANGE:
 		ret = ext4_insert_range(file, offset, len);
-	else if (mode & FALLOC_FL_ZERO_RANGE)
+		break;
+	case FALLOC_FL_ZERO_RANGE:
+	case FALLOC_FL_WRITE_ZEROES:
 		ret = ext4_zero_range(file, offset, len, mode);
-	else
+		break;
+	default:
 		ret = -EOPNOTSUPP;
+	}
 
 out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..6f9cf2811733 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -92,7 +92,8 @@ TRACE_DEFINE_ENUM(ES_REFERENCED_B);
 	{ FALLOC_FL_KEEP_SIZE,		"KEEP_SIZE"},		\
 	{ FALLOC_FL_PUNCH_HOLE,		"PUNCH_HOLE"},		\
 	{ FALLOC_FL_COLLAPSE_RANGE,	"COLLAPSE_RANGE"},	\
-	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"})
+	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"},		\
+	{ FALLOC_FL_WRITE_ZEROES,	"WRITE_ZEROES"})
 
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_XATTR);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_CROSS_RENAME);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
  2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (9 preceding siblings ...)
  2025-06-04  2:08 ` [PATCH 10/10] ext4: " Zhang Yi
@ 2025-06-10  1:47 ` Martin K. Petersen
  2025-06-16 14:27   ` Christian Brauner
  10 siblings, 1 reply; 28+ messages in thread
From: Martin K. Petersen @ 2025-06-10  1:47 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun


Zhang,

> Changes since RFC v4:
>  - Rebase codes on 6.16-rc1.
>  - Add a new queue_limit flag, and change the write_zeroes_unmap sysfs
>    interface to RW mode. User can disable the unmap write zeroes
>    operation by writing '0' to it when the operation is slow.
>  - Modify the documentation of write_zeroes_unmap sysfs interface as
>    Martin suggested.
>  - Remove the statx interface.
>  - Make the bdev and ext4 don't allow to submit FALLOC_FL_WRITE_ZEROES
>    if the block device does not enable the unmap write zeroes operation,
>    it should return -EOPNOTSUPP.

This looks OK to me as long as the fs folks agree on the fallocate()
semantics.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-04  2:08 ` [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
@ 2025-06-11  6:09   ` Christoph Hellwig
  2025-06-11  7:31     ` Zhang Yi
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2025-06-11  6:09 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun

On Wed, Jun 04, 2025 at 10:08:41AM +0800, Zhang Yi wrote:
> +static ssize_t queue_write_zeroes_unmap_show(struct gendisk *disk, char *page)

..

> +static int queue_write_zeroes_unmap_store(struct gendisk *disk,
> +		const char *page, size_t count, struct queue_limits *lim)

We're probably getting close to wanting macros for the sysfs
flags, similar to the one for the features (QUEUE_SYSFS_FEATURE).

No need to do this now, just thinking along.

> +/* supports unmap write zeroes command */
> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))


Should this be exposed through sysfs as a read-only value?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 09/10] block: add FALLOC_FL_WRITE_ZEROES support
  2025-06-04  2:08 ` [PATCH 09/10] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
@ 2025-06-11  6:10   ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2025-06-11  6:10 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun

On Wed, Jun 04, 2025 at 10:08:49AM +0800, Zhang Yi wrote:
> @@ -856,6 +856,13 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  	/* Fail if we don't recognize the flags. */
>  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
>  		return -EOPNOTSUPP;
> +	/*
> +	 * Don't allow writing zeroes if the device does not enable the
> +	 * unmap write zeroes operation.
> +	 */
> +	if (!bdev_write_zeroes_unmap(bdev) &&
> +	    (mode & FALLOC_FL_WRITE_ZEROES))

Cosmetic nitpick, but I'd turn the check around to check the mode first
as that's easier to read.  The whole check also fits onto a single line:

	if ((mode & FALLOC_FL_WRITE_ZEROES) && !bdev_write_zeroes_unmap(bdev))

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-11  6:09   ` Christoph Hellwig
@ 2025-06-11  7:31     ` Zhang Yi
  2025-06-12  4:47       ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Zhang Yi @ 2025-06-11  7:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun

On 2025/6/11 14:09, Christoph Hellwig wrote:
> On Wed, Jun 04, 2025 at 10:08:41AM +0800, Zhang Yi wrote:
>> +static ssize_t queue_write_zeroes_unmap_show(struct gendisk *disk, char *page)
> 
> ..
> 
>> +static int queue_write_zeroes_unmap_store(struct gendisk *disk,
>> +		const char *page, size_t count, struct queue_limits *lim)
> 
> We're probably getting close to wanting macros for the sysfs
> flags, similar to the one for the features (QUEUE_SYSFS_FEATURE).
> 
> No need to do this now, just thinking along.

Yes.

> 
>> +/* supports unmap write zeroes command */
>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
> 
> 
> Should this be exposed through sysfs as a read-only value?

Uh, are you suggesting adding another sysfs interface to expose
this feature?

> 
> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  2025-06-04  2:08 ` [PATCH 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
@ 2025-06-11 15:05   ` Darrick J. Wong
  2025-06-12 11:37     ` Zhang Yi
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2025-06-11 15:05 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun,
	linux-api

[cc linux-api about a fallocate uapi change]

On Wed, Jun 04, 2025 at 10:08:47AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> With the development of flash-based storage devices, we can quickly
> write zeros to SSDs using the WRITE_ZERO command if the devices do not
> actually write physical zeroes to the media. Therefore, we can use this
> command to quickly preallocate a real all-zero file with written
> extents. This approach should be beneficial for subsequent pure
> overwriting within this file, as it can save on block allocation and,
> consequently, significant metadata changes, which should greatly improve
> overwrite performance on certain filesystems.
> 
> Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to
> fallocate. This flag is used to convert a specified range of a file to
> zeros by issuing a zeroing operation. Blocks should be allocated for the
> regions that span holes in the file, and the entire range is converted
> to written extents. If the underlying device supports the actual offload
> write zeroes command, the process of zeroing out operation can be
> accelerated. If it does not, we currently don't prevent the file system
> from writing actual zeros to the device. This provides users with a new
> method to quickly generate a zeroed file, users no longer need to write
> zero data to create a file with written extents.
> 
> Users can determine whether a disk supports the unmap write zeroes
> operation through querying this sysfs interface:
> 
>     /sys/block/<disk>/queue/write_zeroes_unmap
> 
> Finally, this flag cannot be specified in conjunction with the
> FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is
> not permitted. In addition, filesystems that always require out-of-place
> writes should not support this flag since they still need to allocated
> new blocks during subsequent overwrites.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/open.c                   |  1 +
>  include/linux/falloc.h      |  3 ++-
>  include/uapi/linux/falloc.h | 18 ++++++++++++++++++
>  3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 7828234a7caa..b777e11e5522 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -281,6 +281,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		break;
>  	case FALLOC_FL_COLLAPSE_RANGE:
>  	case FALLOC_FL_INSERT_RANGE:
> +	case FALLOC_FL_WRITE_ZEROES:
>  		if (mode & FALLOC_FL_KEEP_SIZE)
>  			return -EOPNOTSUPP;
>  		break;
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 3f49f3df6af5..7c38c6b76b60 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -36,7 +36,8 @@ struct space_resv {
>  				 FALLOC_FL_COLLAPSE_RANGE |	\
>  				 FALLOC_FL_ZERO_RANGE |		\
>  				 FALLOC_FL_INSERT_RANGE |	\
> -				 FALLOC_FL_UNSHARE_RANGE)
> +				 FALLOC_FL_UNSHARE_RANGE |	\
> +				 FALLOC_FL_WRITE_ZEROES)
>  
>  /* on ia32 l_start is on a 32-bit boundary */
>  #if defined(CONFIG_X86_64)
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 5810371ed72b..265aae7ff8c1 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -78,4 +78,22 @@
>   */
>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>  
> +/*
> + * FALLOC_FL_WRITE_ZEROES is used to convert a specified range of a file to
> + * zeros by issuing a zeroing operation. Blocks should be allocated for the
> + * regions that span holes in the file, and the entire range is converted to
> + * written extents.

I think you could simplify this a bit by talking only about the end
state after a successful call:

"FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that
subsequent writes to that range do not require further changes to file
mapping metadata."

Note that we don't say how the filesystem gets to this goal.  Presumably
the first implementations will send a zeroing operation to the block
device during allocation and the fs will create written mappings, but
there are other ways to get there -- a filesystem could maintain a pool
of pre-zeroed space and hand those out; or it could zero space on
freeing and mounting such that all new mappings can be created as
written even without the block device zeroing operation.

Or you could be running on some carefully engineered system where you
know the storage will always be zeroed at allocation time due to some
other aspect of the system design, e.g. a single-use throwaway cloud vm
where you allocate to the end of the disk and reboot the node.

> + *                  This flag is beneficial for subsequent pure overwriting
> + * within this range, as it can save on block allocation and, consequently,
> + * significant metadata changes. Therefore, filesystems that always require
> + * out-of-place writes should not support this flag.
> + *
> + * Different filesystems may implement different limitations on the
> + * granularity of the zeroing operation. Most will preferably be accelerated
> + * by submitting write zeroes command if the backing storage supports, which
> + * may not physically write zeros to the media.
> + *
> + * This flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE.
> + */
> +#define FALLOC_FL_WRITE_ZEROES		0x80

The rest of the writeup seems fine to me.

--D

> +
>  #endif /* _UAPI_FALLOC_H_ */
> -- 
> 2.46.1
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-11  7:31     ` Zhang Yi
@ 2025-06-12  4:47       ` Christoph Hellwig
  2025-06-12 11:20         ` Zhang Yi
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2025-06-12  4:47 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Christoph Hellwig, linux-fsdevel, linux-ext4, linux-block,
	dm-devel, linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
	djwong, john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki,
	brauner, martin.petersen, yi.zhang, chengzhihao1, yukuai3,
	yangerkun

On Wed, Jun 11, 2025 at 03:31:21PM +0800, Zhang Yi wrote:
> >> +/* supports unmap write zeroes command */
> >> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
> > 
> > 
> > Should this be exposed through sysfs as a read-only value?
> 
> Uh, are you suggesting adding another sysfs interface to expose
> this feature?

That was the idea.  Or do we have another way to report this capability?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-12  4:47       ` Christoph Hellwig
@ 2025-06-12 11:20         ` Zhang Yi
  2025-06-12 15:03           ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: Zhang Yi @ 2025-06-12 11:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun

On 2025/6/12 12:47, Christoph Hellwig wrote:
> On Wed, Jun 11, 2025 at 03:31:21PM +0800, Zhang Yi wrote:
>>>> +/* supports unmap write zeroes command */
>>>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
>>>
>>>
>>> Should this be exposed through sysfs as a read-only value?
>>
>> Uh, are you suggesting adding another sysfs interface to expose
>> this feature?
> 
> That was the idea.  Or do we have another way to report this capability?
> 

Exposing this feature looks useful, but I think adding a new interface
might be somewhat redundant, and it's also difficult to name the new
interface. What about extend this interface to include 3 types? When
read, it exposes the following:

 - none     : the device doesn't support BLK_FEAT_WRITE_ZEROES_UNMAP.
 - enabled  : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, but the
              BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is not set.
 - disabled : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, and the
              BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is set.

Users can write '0' and '1' to disable and enable this operation if it
is not 'none', thoughts?

Best regards,
Yi.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  2025-06-11 15:05   ` Darrick J. Wong
@ 2025-06-12 11:37     ` Zhang Yi
  0 siblings, 0 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-12 11:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun,
	linux-api

On 2025/6/11 23:05, Darrick J. Wong wrote:
> [cc linux-api about a fallocate uapi change]
> 
> On Wed, Jun 04, 2025 at 10:08:47AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> With the development of flash-based storage devices, we can quickly
>> write zeros to SSDs using the WRITE_ZERO command if the devices do not
>> actually write physical zeroes to the media. Therefore, we can use this
>> command to quickly preallocate a real all-zero file with written
>> extents. This approach should be beneficial for subsequent pure
>> overwriting within this file, as it can save on block allocation and,
>> consequently, significant metadata changes, which should greatly improve
>> overwrite performance on certain filesystems.
>>
>> Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to
>> fallocate. This flag is used to convert a specified range of a file to
>> zeros by issuing a zeroing operation. Blocks should be allocated for the
>> regions that span holes in the file, and the entire range is converted
>> to written extents. If the underlying device supports the actual offload
>> write zeroes command, the process of zeroing out operation can be
>> accelerated. If it does not, we currently don't prevent the file system
>> from writing actual zeros to the device. This provides users with a new
>> method to quickly generate a zeroed file, users no longer need to write
>> zero data to create a file with written extents.
>>
>> Users can determine whether a disk supports the unmap write zeroes
>> operation through querying this sysfs interface:
>>
>>     /sys/block/<disk>/queue/write_zeroes_unmap
>>
>> Finally, this flag cannot be specified in conjunction with the
>> FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is
>> not permitted. In addition, filesystems that always require out-of-place
>> writes should not support this flag since they still need to allocated
>> new blocks during subsequent overwrites.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> ---
>>  fs/open.c                   |  1 +
>>  include/linux/falloc.h      |  3 ++-
>>  include/uapi/linux/falloc.h | 18 ++++++++++++++++++
>>  3 files changed, 21 insertions(+), 1 deletion(-)
>>
[...]
>> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
>> index 5810371ed72b..265aae7ff8c1 100644
>> --- a/include/uapi/linux/falloc.h
>> +++ b/include/uapi/linux/falloc.h
>> @@ -78,4 +78,22 @@
>>   */
>>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>>  
>> +/*
>> + * FALLOC_FL_WRITE_ZEROES is used to convert a specified range of a file to
>> + * zeros by issuing a zeroing operation. Blocks should be allocated for the
>> + * regions that span holes in the file, and the entire range is converted to
>> + * written extents.
> 
> I think you could simplify this a bit by talking only about the end
> state after a successful call:
> 
> "FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that
> subsequent writes to that range do not require further changes to file
> mapping metadata."
> 
> Note that we don't say how the filesystem gets to this goal.  Presumably
> the first implementations will send a zeroing operation to the block
> device during allocation and the fs will create written mappings, but
> there are other ways to get there -- a filesystem could maintain a pool
> of pre-zeroed space and hand those out; or it could zero space on
> freeing and mounting such that all new mappings can be created as
> written even without the block device zeroing operation.
> 
> Or you could be running on some carefully engineered system where you
> know the storage will always be zeroed at allocation time due to some
> other aspect of the system design, e.g. a single-use throwaway cloud vm
> where you allocate to the end of the disk and reboot the node.

Indeed, it makes sense to me. It appears to be more generic and obscures
the methods by which different file systems may achieve this goal. Thank
you for the suggestion.

Best regards,
Yi.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-12 11:20         ` Zhang Yi
@ 2025-06-12 15:03           ` Darrick J. Wong
  2025-06-13  3:15             ` Zhang Yi
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2025-06-12 15:03 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Christoph Hellwig, linux-fsdevel, linux-ext4, linux-block,
	dm-devel, linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun

On Thu, Jun 12, 2025 at 07:20:45PM +0800, Zhang Yi wrote:
> On 2025/6/12 12:47, Christoph Hellwig wrote:
> > On Wed, Jun 11, 2025 at 03:31:21PM +0800, Zhang Yi wrote:
> >>>> +/* supports unmap write zeroes command */
> >>>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
> >>>
> >>>
> >>> Should this be exposed through sysfs as a read-only value?
> >>
> >> Uh, are you suggesting adding another sysfs interface to expose
> >> this feature?
> > 
> > That was the idea.  Or do we have another way to report this capability?
> > 
> 
> Exposing this feature looks useful, but I think adding a new interface
> might be somewhat redundant, and it's also difficult to name the new
> interface. What about extend this interface to include 3 types? When
> read, it exposes the following:
> 
>  - none     : the device doesn't support BLK_FEAT_WRITE_ZEROES_UNMAP.
>  - enabled  : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, but the
>               BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is not set.
>  - disabled : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, and the
>               BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is set.
> 
> Users can write '0' and '1' to disable and enable this operation if it
> is not 'none', thoughts?

Perhaps it should reuse the enumeration pattern elsewhere in sysfs?
For example,

# cat /sys/block/sda/queue/scheduler
none [mq-deadline]
# echo none > /sys/block/sda/queue/scheduler
# cat /sys/block/sda/queue/scheduler
[none] mq-deadline

(Annoying that this seems to be opencoded wherever it appears...)

--D

> Best regards,
> Yi.
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-12 15:03           ` Darrick J. Wong
@ 2025-06-13  3:15             ` Zhang Yi
  2025-06-13  5:56               ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Zhang Yi @ 2025-06-13  3:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, linux-fsdevel, linux-ext4, linux-block,
	dm-devel, linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun

On 2025/6/12 23:03, Darrick J. Wong wrote:
> On Thu, Jun 12, 2025 at 07:20:45PM +0800, Zhang Yi wrote:
>> On 2025/6/12 12:47, Christoph Hellwig wrote:
>>> On Wed, Jun 11, 2025 at 03:31:21PM +0800, Zhang Yi wrote:
>>>>>> +/* supports unmap write zeroes command */
>>>>>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
>>>>>
>>>>>
>>>>> Should this be exposed through sysfs as a read-only value?
>>>>
>>>> Uh, are you suggesting adding another sysfs interface to expose
>>>> this feature?
>>>
>>> That was the idea.  Or do we have another way to report this capability?
>>>
>>
>> Exposing this feature looks useful, but I think adding a new interface
>> might be somewhat redundant, and it's also difficult to name the new
>> interface. What about extend this interface to include 3 types? When
>> read, it exposes the following:
>>
>>  - none     : the device doesn't support BLK_FEAT_WRITE_ZEROES_UNMAP.
>>  - enabled  : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, but the
>>               BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is not set.
>>  - disabled : the device supports BLK_FEAT_WRITE_ZEROES_UNMAP, and the
>>               BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED is set.
>>
>> Users can write '0' and '1' to disable and enable this operation if it
>> is not 'none', thoughts?
> 
> Perhaps it should reuse the enumeration pattern elsewhere in sysfs?
> For example,
> 
> # cat /sys/block/sda/queue/scheduler
> none [mq-deadline]
> # echo none > /sys/block/sda/queue/scheduler
> # cat /sys/block/sda/queue/scheduler
> [none] mq-deadline
> 
> (Annoying that this seems to be opencoded wherever it appears...)
> 

Yeah, this solution looks good to me. However, we currently have only
two selections (none and unmap). What if we keep it as is and simply
hide this interface if BLK_FEAT_WRITE_ZEROES_UNMAP is not set, making
it visible only when the device supports this feature? Something like
below:

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e918b2c93aed..204ee4d5f63f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -747,6 +747,9 @@ static umode_t queue_attr_visible(struct kobject *kobj, struct attribute *attr,
             attr == &queue_max_active_zones_entry.attr) &&
            !blk_queue_is_zoned(q))
                return 0;
+       if (attr == &queue_write_zeroes_unmap_entry.attr &&
+           !(q->limits.features & BLK_FEAT_WRITE_ZEROES_UNMAP))
+               return 0;

        return attr->mode;
 }

Thanks,
Yi.


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-13  3:15             ` Zhang Yi
@ 2025-06-13  5:56               ` Christoph Hellwig
  2025-06-13 14:54                 ` Darrick J. Wong
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2025-06-13  5:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Darrick J. Wong, Christoph Hellwig, linux-fsdevel, linux-ext4,
	linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
	linux-kernel, tytso, john.g.garry, bmarzins, chaitanyak,
	shinichiro.kawasaki, brauner, martin.petersen, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Fri, Jun 13, 2025 at 11:15:41AM +0800, Zhang Yi wrote:
> Yeah, this solution looks good to me. However, we currently have only
> two selections (none and unmap). What if we keep it as is and simply
> hide this interface if BLK_FEAT_WRITE_ZEROES_UNMAP is not set, making
> it visible only when the device supports this feature? Something like
> below:

I really hate having all kinds of different interfaces for configurations.
Maybe we should redo this similar to the other hardware/software interfaces
and have a hw_ limit that is exposed by the driver and re-only in
sysfs, and then the user configurable one without _hw.  Setting it to
zero disables the feature.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-13  5:56               ` Christoph Hellwig
@ 2025-06-13 14:54                 ` Darrick J. Wong
  2025-06-14  4:48                   ` Zhang Yi
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2025-06-13 14:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-fsdevel, linux-ext4, linux-block, dm-devel,
	linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun

On Fri, Jun 13, 2025 at 07:56:30AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 13, 2025 at 11:15:41AM +0800, Zhang Yi wrote:
> > Yeah, this solution looks good to me. However, we currently have only
> > two selections (none and unmap). What if we keep it as is and simply
> > hide this interface if BLK_FEAT_WRITE_ZEROES_UNMAP is not set, making
> > it visible only when the device supports this feature? Something like
> > below:
> 
> I really hate having all kinds of different interfaces for configurations.

I really hate the open-coded string parsing nonsense that is sysfs. ;)

> Maybe we should redo this similar to the other hardware/software interfaces
> and have a hw_ limit that is exposed by the driver and re-only in
> sysfs, and then the user configurable one without _hw.  Setting it to
> zero disables the feature.

Yeah, that fits the /sys/block/foo/queue model better.

--D

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-13 14:54                 ` Darrick J. Wong
@ 2025-06-14  4:48                   ` Zhang Yi
  2025-06-16  5:39                     ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Zhang Yi @ 2025-06-14  4:48 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, tytso, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
	martin.petersen, yi.zhang, chengzhihao1, yukuai3, yangerkun

On 2025/6/13 22:54, Darrick J. Wong wrote:
> On Fri, Jun 13, 2025 at 07:56:30AM +0200, Christoph Hellwig wrote:
>> On Fri, Jun 13, 2025 at 11:15:41AM +0800, Zhang Yi wrote:
>>> Yeah, this solution looks good to me. However, we currently have only
>>> two selections (none and unmap). What if we keep it as is and simply
>>> hide this interface if BLK_FEAT_WRITE_ZEROES_UNMAP is not set, making
>>> it visible only when the device supports this feature? Something like
>>> below:
>>
>> I really hate having all kinds of different interfaces for configurations.
> 
> I really hate the open-coded string parsing nonsense that is sysfs. ;)
> 
>> Maybe we should redo this similar to the other hardware/software interfaces
>> and have a hw_ limit that is exposed by the driver and re-only in
>> sysfs, and then the user configurable one without _hw.  Setting it to
>> zero disables the feature.
> 
> Yeah, that fits the /sys/block/foo/queue model better.
> 

OK, well. Please let me confirm, are you both suggesting adding
max_hw_write_zeores_unmap_sectors and max_write_zeroes_unmap_sectors to
the queue_limits instead of adding BLK_FEAT_WRITE_ZEROES_UNMAP to the
queue_limits->features. Something like the following.

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 378d3a1a22fc..14394850863c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -376,7 +376,9 @@ struct queue_limits {
        unsigned int            max_hw_discard_sectors;
        unsigned int            max_user_discard_sectors;
        unsigned int            max_secure_erase_sectors;
-       unsigned int            max_write_zeroes_sectors;
+       unsigned int            max_hw_write_zeroes_sectors;
+       unsigned int            max_hw_write_zeores_unmap_sectors;
+       unsigned int            max_write_zeroes_unmap_sectors;
        unsigned int            max_hw_zone_append_sectors;
        unsigned int            max_zone_append_sectors;
        unsigned int            discard_granularity;

Besides, we should also rename max_write_zeroes_sectors to
max_hw_write_zeroes_sectors since it is a hardware limitation reported
by the driver.  If the device supports unmap write zeroes,
max_hw_write_zeores_unmap_sectors should be equal to
max_hw_write_zeroes_sectors, otherwise it should be 0.

Right?

Best regards,
Yi.


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-06-14  4:48                   ` Zhang Yi
@ 2025-06-16  5:39                     ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2025-06-16  5:39 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Darrick J. Wong, Christoph Hellwig, linux-fsdevel, linux-ext4,
	linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
	linux-kernel, tytso, john.g.garry, bmarzins, chaitanyak,
	shinichiro.kawasaki, brauner, martin.petersen, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Sat, Jun 14, 2025 at 12:48:26PM +0800, Zhang Yi wrote:
> >> Maybe we should redo this similar to the other hardware/software interfaces
> >> and have a hw_ limit that is exposed by the driver and re-only in
> >> sysfs, and then the user configurable one without _hw.  Setting it to
> >> zero disables the feature.
> > 
> > Yeah, that fits the /sys/block/foo/queue model better.
> > 
> 
> OK, well. Please let me confirm, are you both suggesting adding
> max_hw_write_zeores_unmap_sectors and max_write_zeroes_unmap_sectors to
> the queue_limits instead of adding BLK_FEAT_WRITE_ZEROES_UNMAP to the
> queue_limits->features. Something like the following.

Yes.

> Besides, we should also rename max_write_zeroes_sectors to
> max_hw_write_zeroes_sectors since it is a hardware limitation reported
> by the driver.  If the device supports unmap write zeroes,
> max_hw_write_zeores_unmap_sectors should be equal to
> max_hw_write_zeroes_sectors, otherwise it should be 0.

We've only done the hw names when we allow and overwrite or cap based
on other values.  So far we've not done any of that to
max_write_zeroes_sectors.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
  2025-06-10  1:47 ` [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Martin K. Petersen
@ 2025-06-16 14:27   ` Christian Brauner
  2025-06-16 16:59     ` Martin K. Petersen
  0 siblings, 1 reply; 28+ messages in thread
From: Christian Brauner @ 2025-06-16 14:27 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Zhang Yi, linux-fsdevel, linux-ext4, linux-block, dm-devel,
	linux-nvme, linux-scsi, linux-xfs, linux-kernel, hch, tytso,
	djwong, john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki,
	yi.zhang, chengzhihao1, yukuai3, yangerkun

On Mon, Jun 09, 2025 at 09:47:13PM -0400, Martin K. Petersen wrote:
> 
> Zhang,
> 
> > Changes since RFC v4:
> >  - Rebase codes on 6.16-rc1.
> >  - Add a new queue_limit flag, and change the write_zeroes_unmap sysfs
> >    interface to RW mode. User can disable the unmap write zeroes
> >    operation by writing '0' to it when the operation is slow.
> >  - Modify the documentation of write_zeroes_unmap sysfs interface as
> >    Martin suggested.
> >  - Remove the statx interface.
> >  - Make the bdev and ext4 don't allow to submit FALLOC_FL_WRITE_ZEROES
> >    if the block device does not enable the unmap write zeroes operation,
> >    it should return -EOPNOTSUPP.
> 
> This looks OK to me as long as the fs folks agree on the fallocate()
> semantics.

That looks overall fine. Should I queue this up in the vfs tree?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
  2025-06-16 14:27   ` Christian Brauner
@ 2025-06-16 16:59     ` Martin K. Petersen
  2025-06-17  2:25       ` Zhang Yi
  0 siblings, 1 reply; 28+ messages in thread
From: Martin K. Petersen @ 2025-06-16 16:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Martin K. Petersen, Zhang Yi, linux-fsdevel, linux-ext4,
	linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
	linux-kernel, hch, tytso, djwong, john.g.garry, bmarzins,
	chaitanyak, shinichiro.kawasaki, yi.zhang, chengzhihao1, yukuai3,
	yangerkun


Christian,

>> This looks OK to me as long as the fs folks agree on the fallocate()
>> semantics.
>
> That looks overall fine. Should I queue this up in the vfs tree?

We're expecting another revision addressing the queue limit sysfs
override. Otherwise I believe it's good to go.

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
  2025-06-16 16:59     ` Martin K. Petersen
@ 2025-06-17  2:25       ` Zhang Yi
  0 siblings, 0 replies; 28+ messages in thread
From: Zhang Yi @ 2025-06-17  2:25 UTC (permalink / raw)
  To: Martin K. Petersen, Christian Brauner
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On 2025/6/17 0:59, Martin K. Petersen wrote:
> 
> Christian,
> 
>>> This looks OK to me as long as the fs folks agree on the fallocate()
>>> semantics.
>>
>> That looks overall fine. Should I queue this up in the vfs tree?
> 
> We're expecting another revision addressing the queue limit sysfs
> override. Otherwise I believe it's good to go.
> 

Yeah, I'm going to revise the queue limits sysfs interface as
Christoph and Darrick suggested and send out v2.

Best regards,
Yi.


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-06-17  2:25 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04  2:08 [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
2025-06-04  2:08 ` [PATCH 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
2025-06-11  6:09   ` Christoph Hellwig
2025-06-11  7:31     ` Zhang Yi
2025-06-12  4:47       ` Christoph Hellwig
2025-06-12 11:20         ` Zhang Yi
2025-06-12 15:03           ` Darrick J. Wong
2025-06-13  3:15             ` Zhang Yi
2025-06-13  5:56               ` Christoph Hellwig
2025-06-13 14:54                 ` Darrick J. Wong
2025-06-14  4:48                   ` Zhang Yi
2025-06-16  5:39                     ` Christoph Hellwig
2025-06-04  2:08 ` [PATCH 02/10] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
2025-06-04  2:08 ` [PATCH 03/10] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
2025-06-04  2:08 ` [PATCH 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
2025-06-04  2:08 ` [PATCH 05/10] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode Zhang Yi
2025-06-04  2:08 ` [PATCH 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
2025-06-04  2:08 ` [PATCH 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
2025-06-11 15:05   ` Darrick J. Wong
2025-06-12 11:37     ` Zhang Yi
2025-06-04  2:08 ` [PATCH 08/10] block: factor out common part in blkdev_fallocate() Zhang Yi
2025-06-04  2:08 ` [PATCH 09/10] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
2025-06-11  6:10   ` Christoph Hellwig
2025-06-04  2:08 ` [PATCH 10/10] ext4: " Zhang Yi
2025-06-10  1:47 ` [PATCH 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Martin K. Petersen
2025-06-16 14:27   ` Christian Brauner
2025-06-16 16:59     ` Martin K. Petersen
2025-06-17  2:25       ` Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).