linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
@ 2025-03-18  7:35 Zhang Yi
  2025-03-18  7:35 ` [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
                   ` (9 more replies)
  0 siblings, 10 replies; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Changes since RFC v2:
 - Rebase codes on next-20250314.
 - Add support for nvme multipath.
 - Add support for NVMeT with block device backing.
 - Clear FALLOC_FL_WRITE_ZEROES if dm clear
   limits->max_write_zeroes_sectors.
 - Complement the counterpart userspace tools(util-linux and xfs_io)
   and tests(blktests and xfstests), please see below for details.
Changes since RFC v1:
 - Switch to add a new write zeroes operation, FALLOC_FL_WRITE_ZEROES,
   in fallocate, instead of just adding a supported flag to
   FALLOC_FL_ZERO_RANGE.
 - Introduce a new flag BLK_FEAT_WRITE_ZEROES_UNMAP to the block
   device's queue limit features, and implement it on SCSI sd driver,
   NVMe SSD driver and dm driver.
 - Implement FALLOC_FL_WRITE_ZEROES on both the ext4 filesystem and
   block device (bdev).

RFC v2: https://lore.kernel.org/linux-fsdevel/20250115114637.2705887-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-fsdevel/20241228014522.2395187-1-yi.zhang@huaweicloud.com/

The counterpart userspace tools changes and tests are here:
 - util-linux: https://lore.kernel.org/linux-fsdevel/20250318073218.3513262-1-yi.zhang@huaweicloud.com/ 
 - xfsprogs: https://lore.kernel.org/linux-fsdevel/20250318072318.3502037-1-yi.zhang@huaweicloud.com/
 - xfstests: https://lore.kernel.org/linux-fsdevel/20250318072615.3505873-1-yi.zhang@huaweicloud.com/
 - blktests: https://lore.kernel.org/linux-fsdevel/20250318072835.3508696-1-yi.zhang@huaweicloud.com/

Currently, we can use the fallocate command to quickly create a
pre-allocated file. However, on most filesystems, such as ext4 and XFS,
fallocate create pre-allocation blocks in an unwritten state, and the
FALLOC_FL_ZERO_RANGE flag also behaves similarly. The extent state must
be converted to a written state when the user writes data into this
range later, which can trigger numerous metadata changes and consequent
journal I/O. This may leads to significant write amplification and
performance degradation in synchronous write mode. Therefore, we need a
method to create a pre-allocated file with written extents that can be
used for pure overwriting. At the monent, the only method available is
to create an empty file and write zero data into it (for example, using
'dd' with a large block size). However, this method is slow and consumes
a considerable amount of disk bandwidth, we must pre-allocate files in
advance but cannot add pre-allocated files while user business services
are running.

Fortunately, with the development and more and more widely used of
flash-based storage devices, we can efficiently write zeros to SSDs
using the unmap write zeroes command if the devices do not write
physical zeroes to the media. For example, if SCSI SSDs support the
UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command
does not write actual data to the device, instead, NVMe converts the
zeroed range to a deallocated state, which works fast and consumes
almost no disk write bandwidth. Consequently, this feature can provide
us with a faster method for creating pre-allocated files with written
extents and zeroed data.

This series aims to implement this by:
1. Introduce a new feature BLK_FEAT_WRITE_ZEROES_UNMAP to the block
   device queue limit features, which indicates whether the storage is
   device explicitly supports the unmapped write zeroes command. This
   flag should be set to 1 by the driver if the attached disk supports
   this command. Users can check this flag by querying:

       /sys/block/<disk>/queue/write_zeroes_unmap

2. Introduce a new flag FALLOC_FL_WRITE_ZEROES into the fallocate,
   filesystems with this operaion should allocate written extents and
   issuing zeroes to the range of the device. If the device supports
   unmap write zeroes command, the zeroing can be accelerated, if not,
   we currently still allow to fall back to submit zeroes data. Users
   can verify if the device supports the unmap write zeroes command and
   then decide whether to use it.

This series implemented the BLK_FEAT_WRITE_ZEROES_UNMAP flag for SCSI,
NVMe and device-mapper drivers, and added the FALLOC_FL_WRITE_ZEROES
support for ext4 and raw bdev devices. Any comments are welcome.

I've tested performance with this series on ext4 filesystem on my
machine with an Intel Xeon Gold 6248R CPU, a 7TB KCD61LUL7T68 NVMe SSD
which supports unmap write zeroes command with the Deallocated state
and the DEAC bit. Feel free to give it a try.

0. Ensure the NVMe device supports WRITE_ZERO command.

 $ cat /sys/block/nvme5n1/queue/write_zeroes_max_bytes
   8388608
 $ nvme id-ns -H /dev/nvme5n1 | grep -i -A 3 "dlfeat"
   dlfeat  : 25
   [4:4] : 0x1   Guard Field of Deallocated Logical Blocks is set to CRC
                 of The Value Read
   [3:3] : 0x1   Deallocate Bit in the Write Zeroes Command is Supported
   [2:0] : 0x1   Bytes Read From a Deallocated Logical Block and its
                 Metadata are 0x00

1. Compare 'dd' and fallocate with unmap write zeroes, the later one is
   significantly faster than 'dd'.

   Create a 1GB and 10GB zeroed file.
    $dd if=/dev/zero of=foo bs=2M count=$count oflag=direct
    $time fallocate -w -l $size bar

    #1G
    dd:                     0.5s
    FALLOC_FL_WRITE_ZEROES: 0.17s

    #10G
    dd:                     5.0s
    FALLOC_FL_WRITE_ZEROES: 1.7s

2. Run fio overwrite and fallocate with unmap write zeroes
   simultaneously, fallocate has little impact on write bandwidth and
   only slightly affects write latency.

 a) Test bandwidth costs.
  $ fio -directory=/test -direct=1 -iodepth=10 -fsync=0 -rw=write \
        -numjobs=10 -bs=2M -ioengine=libaio -size=20G -runtime=20 \
        -fallocate=none -overwrite=1 -group_reportin -name=bw_test

   Without background zero range:
    bw (MiB/s): min= 2068, max= 2280, per=100.00%, avg=2186.40

   With background zero range:
    bw (MiB/s): min= 2056, max= 2308, per=100.00%, avg=2186.20

 b) Test write latency costs.
  $ fio -filename=/test/foo -direct=1 -iodepth=1 -fsync=0 -rw=write \
        -numjobs=1 -bs=4k -ioengine=psync -size=5G -runtime=20 \
        -fallocate=none -overwrite=1 -group_reportin -name=lat_test

   Without background zero range:
   lat (nsec): min=9269, max=71635, avg=9840.65

   With a background zero range:
   lat (usec): min=9, max=982, avg=11.03

3. Compare overwriting in a pre-allocated unwritten file and a written
   file in O_DSYNC mode. Write to a file with written extents is much
   faster.

  # First mkfs and create a test file according to below three cases,
  # and then run fio.

  $ fio -filename=/test/foo -direct=1 -iodepth=1 -fdatasync=1 \
        -rw=write -numjobs=1 -bs=4k -ioengine=psync -size=5G \
        -runtime=20 -fallocate=none -group_reportin -name=test

   unwritten file:                 IOPS=20.1k, BW=78.7MiB/s
   unwritten file + fast_commit:   IOPS=42.9k, BW=167MiB/s
   written file:                   IOPS=98.8k, BW=386MiB/s

Thanks,
Yi.

---

[1] https://nvmexpress.org/specifications/
    NVM Command Set Specification, section 3.2.8

Zhang Yi (10):
  block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
  nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP
  scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap
    zeroing mode
  dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  block: add FALLOC_FL_WRITE_ZEROES support
  block: factor out common part in blkdev_fallocate()
  ext4: add FALLOC_FL_WRITE_ZEROES support

 Documentation/ABI/stable/sysfs-block | 14 +++++++
 block/blk-settings.c                 |  6 +++
 block/blk-sysfs.c                    |  3 ++
 block/fops.c                         | 37 +++++++++--------
 drivers/md/dm-table.c                |  7 +++-
 drivers/md/dm.c                      |  1 +
 drivers/nvme/host/core.c             | 21 +++++-----
 drivers/nvme/host/multipath.c        |  3 +-
 drivers/nvme/target/io-cmd-bdev.c    |  4 ++
 drivers/scsi/sd.c                    |  5 +++
 fs/ext4/extents.c                    | 59 ++++++++++++++++++++++------
 fs/open.c                            |  1 +
 include/linux/blkdev.h               |  8 ++++
 include/linux/falloc.h               |  3 +-
 include/trace/events/ext4.h          |  3 +-
 include/uapi/linux/falloc.h          | 18 +++++++++
 16 files changed, 149 insertions(+), 44 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  2025-04-09 10:31   ` Christoph Hellwig
  2025-03-18  7:35 ` [RFC PATCH -next v3 02/10] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.

For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state. However, it is difficult to ascertain whether the storage device
supports unmap write zeroes. We cannot determine this solely by querying
bdev_limits(bdev)->max_write_zeroes_sectors.

Therefore, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP
and the corresponding sysfs entry, to indicate whether the block device
explicitly supports the unmapped write zeroes command. Each device
driver should set this bit if it is certain that the attached disk
supports this command. If the bit is not set, the disk either does not
support it, or its support status is unknown.

For the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be
supported both by the stacking driver and all underlying devices.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/ABI/stable/sysfs-block | 14 ++++++++++++++
 block/blk-settings.c                 |  6 ++++++
 block/blk-sysfs.c                    |  3 +++
 include/linux/blkdev.h               |  3 +++
 4 files changed, 26 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 890cde28bf90..67513c0d9233 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -742,6 +742,20 @@ Description:
 		0, write zeroes is not supported by the device.
 
 
+What:		/sys/block/<disk>/queue/write_zeroes_unmap
+Date:		January 2025
+Contact:	Zhang Yi <yi.zhang@huawei.com>
+Description:
+		[RO] Devices that explicitly support the unmap write zeroes
+		operation in which a single write zeroes request with the unmap
+		bit set to zero out the range of contiguous blocks on storage
+		by freeing blocks, rather than writing physical zeroes to the
+		media. If write_zeroes_unmap is 1, this indicates that the
+		device explicitly supports the write zero command. Otherwise,
+		the device either does not support it, or its support status is
+		unknown.
+
+
 What:		/sys/block/<disk>/queue/zone_append_max_bytes
 Date:		May 2020
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6b2dbe645d23..3331d07bd5d9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -697,6 +697,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_NOWAIT;
 	if (!(b->features & BLK_FEAT_POLL))
 		t->features &= ~BLK_FEAT_POLL;
+	if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
 
@@ -819,6 +821,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->zone_write_granularity = 0;
 		t->max_zone_append_sectors = 0;
 	}
+
+	if (!t->max_write_zeroes_sectors)
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+
 	blk_stack_atomic_writes_limits(t, b, start);
 
 	return ret;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d584461a1d84..6f00e9a8f8b6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -261,6 +261,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page)	\
 
 QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
 QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
+QUEUE_SYSFS_FEATURE_SHOW(write_zeroes_unmap, BLK_FEAT_WRITE_ZEROES_UNMAP);
 
 static ssize_t queue_poll_show(struct gendisk *disk, char *page)
 {
@@ -510,6 +511,7 @@ QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
 
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
+QUEUE_LIM_RO_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
 QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
@@ -656,6 +658,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_atomic_write_unit_min_entry.attr,
 	&queue_atomic_write_unit_max_entry.attr,
 	&queue_max_write_zeroes_sectors_entry.attr,
+	&queue_write_zeroes_unmap_entry.attr,
 	&queue_max_zone_append_sectors_entry.attr,
 	&queue_zone_write_granularity_entry.attr,
 	&queue_rotational_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e39c45bc0a97..5d280c7fba65 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -342,6 +342,9 @@ typedef unsigned int __bitwise blk_features_t;
 #define BLK_FEAT_ATOMIC_WRITES \
 	((__force blk_features_t)(1u << 16))
 
+/* supports unmap write zeroes command */
+#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
+
 /*
  * Flags automatically inherited when stacking limits.
  */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 02/10] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
  2025-03-18  7:35 ` [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  2025-03-18  7:35 ` [RFC PATCH -next v3 03/10] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When the device supports the Write Zeroes command and the DEAC bit, it
indicates that the deallocate bit in the Write Zeroes command is
supported, and the bytes read from a deallocated logical block are
zeroes. This means the device supports unmap Write Zeroes, so set the
BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 drivers/nvme/host/core.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 24c3e1765d49..3af6a50f07ec 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2223,22 +2223,25 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	if (!nvme_init_integrity(ns->head, &lim, info))
 		capacity = 0;
 
-	ret = queue_limits_commit_update(ns->disk->queue, &lim);
-	if (ret) {
-		blk_mq_unfreeze_queue(ns->disk->queue, memflags);
-		goto out;
-	}
-
-	set_capacity_and_notify(ns->disk, capacity);
-
 	/*
 	 * Only set the DEAC bit if the device guarantees that reads from
 	 * deallocated data return zeroes.  While the DEAC bit does not
 	 * require that, it must be a no-op if reads from deallocated data
 	 * do not return zeroes.
 	 */
-	if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3)))
+	if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3))) {
 		ns->head->features |= NVME_NS_DEAC;
+		if (lim.max_write_zeroes_sectors)
+			lim.features |= BLK_FEAT_WRITE_ZEROES_UNMAP;
+	}
+
+	ret = queue_limits_commit_update(ns->disk->queue, &lim);
+	if (ret) {
+		blk_mq_unfreeze_queue(ns->disk->queue, memflags);
+		goto out;
+	}
+
+	set_capacity_and_notify(ns->disk, capacity);
 	set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info));
 	set_bit(NVME_NS_READY, &ns->flags);
 	blk_mq_unfreeze_queue(ns->disk->queue, memflags);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 03/10] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
  2025-03-18  7:35 ` [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
  2025-03-18  7:35 ` [RFC PATCH -next v3 02/10] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  2025-03-18  7:35 ` [RFC PATCH -next v3 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature while creating multipath
stacking queue limits by default. This feature shall be disabled if any
attached namespace does not support it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 drivers/nvme/host/multipath.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 2a7635565083..82f12209446f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -638,7 +638,8 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 
 	blk_set_stacking_limits(&lim);
 	lim.dma_alignment = 3;
-	lim.features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
+	lim.features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
+			BLK_FEAT_WRITE_ZEROES_UNMAP;
 	if (head->ids.csi == NVME_CSI_ZNS)
 		lim.features |= BLK_FEAT_ZONED;
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (2 preceding siblings ...)
  2025-03-18  7:35 ` [RFC PATCH -next v3 03/10] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  2025-04-09 10:34   ` Christoph Hellwig
  2025-03-18  7:35 ` [RFC PATCH -next v3 05/10] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode Zhang Yi
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Set WZDS and DRB bit to the namespace dlfeat if the underlying block
device supports BLK_FEAT_WRITE_ZEROES_UNMAP, make the nvme target
device supports unmaped write zeroes command.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 drivers/nvme/target/io-cmd-bdev.c | 4 ++++
 include/linux/blkdev.h            | 5 +++++
 2 files changed, 9 insertions(+)

diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 83be0657e6df..0e8b35732492 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,10 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 	id->npda = id->npdg;
 	/* NOWS = Namespace Optimal Write Size */
 	id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+	/* Set WZDS and DRB if device supports unmapped write zeroes */
+	if (bdev_unmap_write_zeroes(bdev))
+		id->dlfeat = (1 << 3) | 0x1;
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5d280c7fba65..836738ab1fa6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1344,6 +1344,11 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 	return bdev_limits(bdev)->max_write_zeroes_sectors;
 }
 
+static inline bool bdev_unmap_write_zeroes(struct block_device *bdev)
+{
+	return bdev_limits(bdev)->features & BLK_FEAT_WRITE_ZEROES_UNMAP;
+}
+
 static inline bool bdev_nonrot(struct block_device *bdev)
 {
 	return blk_queue_nonrot(bdev_get_queue(bdev));
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 05/10] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (3 preceding siblings ...)
  2025-03-18  7:35 ` [RFC PATCH -next v3 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  2025-03-18  7:35 ` [RFC PATCH -next v3 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When the device supports the Write Zeroes command and the zeroing mode
is set to SD_ZERO_WS16_UNMAP or SD_ZERO_WS10_UNMAP, this means that the
device supports unmap Write Zeroes, so set the corresponding
BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 950d8c9fb884..652630b410de 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1118,6 +1118,11 @@ static void sd_config_write_same(struct scsi_disk *sdkp,
 	else
 		sdkp->zeroing_mode = SD_ZERO_WRITE;
 
+	if (sdkp->max_ws_blocks &&
+	    (sdkp->zeroing_mode == SD_ZERO_WS16_UNMAP ||
+	     sdkp->zeroing_mode == SD_ZERO_WS10_UNMAP))
+		lim->features |= BLK_FEAT_WRITE_ZEROES_UNMAP;
+
 	if (sdkp->max_ws_blocks &&
 	    sdkp->physical_block_size > logical_block_size) {
 		/*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (4 preceding siblings ...)
  2025-03-18  7:35 ` [RFC PATCH -next v3 05/10] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  2025-03-19 19:50   ` Benjamin Marzinski
  2025-03-18  7:35 ` [RFC PATCH -next v3 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by
default. This feature shall be disabled if any underlying device does
not support it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 drivers/md/dm-table.c | 7 +++++--
 drivers/md/dm.c       | 1 +
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 453803f1edf5..d4a483287e26 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -598,7 +598,8 @@ int dm_split_args(int *argc, char ***argvp, char *input)
 static void dm_set_stacking_limits(struct queue_limits *limits)
 {
 	blk_set_stacking_limits(limits);
-	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
+	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
+			    BLK_FEAT_WRITE_ZEROES_UNMAP;
 }
 
 /*
@@ -1848,8 +1849,10 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 		limits->discard_alignment = 0;
 	}
 
-	if (!dm_table_supports_write_zeroes(t))
+	if (!dm_table_supports_write_zeroes(t)) {
 		limits->max_write_zeroes_sectors = 0;
+		limits->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+	}
 
 	if (!dm_table_supports_secure_erase(t))
 		limits->max_secure_erase_sectors = 0;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ab7574c0c76..b59c3dbeaaf1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1096,6 +1096,7 @@ void disable_write_zeroes(struct mapped_device *md)
 
 	/* device doesn't really support WRITE ZEROES, disable it */
 	limits->max_write_zeroes_sectors = 0;
+	limits->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
 }
 
 static bool swap_bios_limit(struct dm_target *ti, struct bio *bio)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (5 preceding siblings ...)
  2025-03-18  7:35 ` [RFC PATCH -next v3 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  2025-04-09 10:35   ` Christoph Hellwig
  2025-03-18  7:35 ` [RFC PATCH -next v3 08/10] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

With the development of flash-based storage devices, we can quickly
write zeros to SSDs using the WRITE_ZERO command if the devices do not
actually write physical zeroes to the media. Therefore, we can use this
command to quickly preallocate a real all-zero file with written
extents. This approach should be beneficial for subsequent pure
overwriting within this file, as it can save on block allocation and,
consequently, significant metadata changes, which should greatly improve
overwrite performance on certain filesystems.

Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to
fallocate. This flag is used to convert a specified range of a file to
zeros by issuing a zeroing operation. Blocks should be allocated for the
regions that span holes in the file, and the entire range is converted
to written extents. If the underlying device supports the actual offload
write zeroes command, the process of zeroing out operation can be
accelerated. If it does not, we currently don't prevent the file system
from writing actual zeros to the device. This provides users with a new
method to quickly generate a zeroed file, users no longer need to write
zero data to create a file with written extents.

Users can check the disk support of unmap write zeroes command by
querying:

    /sys/block/<disk>/queue/write_zeroes_unmap

Finally, this flag should not be specified in conjunction with the
FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is
not permitted, and filesystems that always require out-of-place writes
should not support this flag since they still need to allocated new
blocks during subsequent overwrites.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/open.c                   |  1 +
 include/linux/falloc.h      |  3 ++-
 include/uapi/linux/falloc.h | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index bdbf03f799a1..03b30613d7dc 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -278,6 +278,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		break;
 	case FALLOC_FL_COLLAPSE_RANGE:
 	case FALLOC_FL_INSERT_RANGE:
+	case FALLOC_FL_WRITE_ZEROES:
 		if (mode & FALLOC_FL_KEEP_SIZE)
 			return -EOPNOTSUPP;
 		break;
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 3f49f3df6af5..7c38c6b76b60 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -36,7 +36,8 @@ struct space_resv {
 				 FALLOC_FL_COLLAPSE_RANGE |	\
 				 FALLOC_FL_ZERO_RANGE |		\
 				 FALLOC_FL_INSERT_RANGE |	\
-				 FALLOC_FL_UNSHARE_RANGE)
+				 FALLOC_FL_UNSHARE_RANGE |	\
+				 FALLOC_FL_WRITE_ZEROES)
 
 /* on ia32 l_start is on a 32-bit boundary */
 #if defined(CONFIG_X86_64)
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 5810371ed72b..265aae7ff8c1 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -78,4 +78,22 @@
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_WRITE_ZEROES is used to convert a specified range of a file to
+ * zeros by issuing a zeroing operation. Blocks should be allocated for the
+ * regions that span holes in the file, and the entire range is converted to
+ * written extents. This flag is beneficial for subsequent pure overwriting
+ * within this range, as it can save on block allocation and, consequently,
+ * significant metadata changes. Therefore, filesystems that always require
+ * out-of-place writes should not support this flag.
+ *
+ * Different filesystems may implement different limitations on the
+ * granularity of the zeroing operation. Most will preferably be accelerated
+ * by submitting write zeroes command if the backing storage supports, which
+ * may not physically write zeros to the media.
+ *
+ * This flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE.
+ */
+#define FALLOC_FL_WRITE_ZEROES		0x80
+
 #endif /* _UAPI_FALLOC_H_ */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 08/10] block: add FALLOC_FL_WRITE_ZEROES support
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (6 preceding siblings ...)
  2025-03-18  7:35 ` [RFC PATCH -next v3 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  2025-03-18  7:35 ` [RFC PATCH -next v3 09/10] block: factor out common part in blkdev_fallocate() Zhang Yi
  2025-03-18  7:35 ` [RFC PATCH -next v3 10/10] ext4: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Add support for FALLOC_FL_WRITE_ZEROES. It directly calls
blkdev_issue_zeroout() with flags set to 0. The underlying process will
attempt to use the fastest method for issuing zeroes. First, the block
layer will try to issue a write zeroes command if the storage device
supports it; if not, it will fall back to issuing zeroed data. Then, the
storage device driver may attempt to submit an unmap write zero command
if the device supports it; if not, the driver may fall back to
submitting a no-unmap write zeroes command.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 block/fops.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/fops.c b/block/fops.c
index be9f1dbea9ce..5a519581d7e3 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -803,7 +803,7 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 #define	BLKDEV_FALLOC_FL_SUPPORTED					\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
-		 FALLOC_FL_ZERO_RANGE)
+		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_WRITE_ZEROES)
 
 static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 			     loff_t len)
@@ -862,6 +862,15 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 					     len >> SECTOR_SHIFT, GFP_KERNEL,
 					     BLKDEV_ZERO_NOFALLBACK);
 		break;
+	case FALLOC_FL_WRITE_ZEROES:
+		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
+		if (error)
+			goto fail;
+
+		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
+					     len >> SECTOR_SHIFT, GFP_KERNEL,
+					     0);
+		break;
 	default:
 		error = -EOPNOTSUPP;
 	}
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 09/10] block: factor out common part in blkdev_fallocate()
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (7 preceding siblings ...)
  2025-03-18  7:35 ` [RFC PATCH -next v3 08/10] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  2025-04-09 10:36   ` Christoph Hellwig
  2025-03-18  7:35 ` [RFC PATCH -next v3 10/10] ext4: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
  9 siblings, 1 reply; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Only the flags passed to blkdev_issue_zeroout() differ among the three
zeroing branches in blkdev_fallocate(). Therefore, do cleanup by
factoring them out.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 block/fops.c | 40 +++++++++++++++-------------------------
 1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 5a519581d7e3..e590c8997689 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -812,6 +812,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	struct block_device *bdev = I_BDEV(inode);
 	loff_t end = start + len - 1;
 	loff_t isize;
+	unsigned int flags;
 	int error;
 
 	/* Fail if we don't recognize the flags. */
@@ -838,43 +839,32 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 
 	filemap_invalidate_lock(inode->i_mapping);
 
-	/*
-	 * Invalidate the page cache, including dirty pages, for valid
-	 * de-allocate mode calls to fallocate().
-	 */
 	switch (mode) {
 	case FALLOC_FL_ZERO_RANGE:
 	case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
-		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
-		if (error)
-			goto fail;
-
-		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
-					     len >> SECTOR_SHIFT, GFP_KERNEL,
-					     BLKDEV_ZERO_NOUNMAP);
+		flags = BLKDEV_ZERO_NOUNMAP;
 		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
-		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
-		if (error)
-			goto fail;
-
-		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
-					     len >> SECTOR_SHIFT, GFP_KERNEL,
-					     BLKDEV_ZERO_NOFALLBACK);
+		flags = BLKDEV_ZERO_NOFALLBACK;
 		break;
 	case FALLOC_FL_WRITE_ZEROES:
-		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
-		if (error)
-			goto fail;
-
-		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
-					     len >> SECTOR_SHIFT, GFP_KERNEL,
-					     0);
+		flags = 0;
 		break;
 	default:
 		error = -EOPNOTSUPP;
+		goto fail;
 	}
 
+	/*
+	 * Invalidate the page cache, including dirty pages, for valid
+	 * de-allocate mode calls to fallocate().
+	 */
+	error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
+	if (error)
+		goto fail;
+
+	error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
+				     len >> SECTOR_SHIFT, GFP_KERNEL, flags);
  fail:
 	filemap_invalidate_unlock(inode->i_mapping);
 	return error;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH -next v3 10/10] ext4: add FALLOC_FL_WRITE_ZEROES support
  2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
                   ` (8 preceding siblings ...)
  2025-03-18  7:35 ` [RFC PATCH -next v3 09/10] block: factor out common part in blkdev_fallocate() Zhang Yi
@ 2025-03-18  7:35 ` Zhang Yi
  9 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2025-03-18  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi
  Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Add support for FALLOC_FL_WRITE_ZEROES. This first allocates blocks as
unwritten, then issues a zero command outside of the running journal
handle, and finally converts them to a written state.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c           | 59 ++++++++++++++++++++++++++++++-------
 include/trace/events/ext4.h |  3 +-
 2 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 1b028be19193..e937a714085c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4483,6 +4483,8 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	struct ext4_map_blocks map;
 	unsigned int credits;
 	loff_t epos, old_size = i_size_read(inode);
+	unsigned int blkbits = inode->i_blkbits;
+	bool alloc_zero = false;
 
 	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
 	map.m_lblk = offset;
@@ -4495,6 +4497,17 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	if (len <= EXT_UNWRITTEN_MAX_LEN)
 		flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
 
+	/*
+	 * Do the actual write zero during a running journal transaction
+	 * costs a lot. First allocate an unwritten extent and then
+	 * convert it to written after zeroing it out.
+	 */
+	if (flags & EXT4_GET_BLOCKS_ZERO) {
+		flags &= ~EXT4_GET_BLOCKS_ZERO;
+		flags |= EXT4_GET_BLOCKS_UNWRIT_EXT;
+		alloc_zero = true;
+	}
+
 	/*
 	 * credits to insert 1 extent into extent tree
 	 */
@@ -4531,9 +4544,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 		 * allow a full retry cycle for any remaining allocations
 		 */
 		retries = 0;
-		map.m_lblk += ret;
-		map.m_len = len = len - ret;
-		epos = (loff_t)map.m_lblk << inode->i_blkbits;
+		epos = (loff_t)(map.m_lblk + ret) << blkbits;
 		inode_set_ctime_current(inode);
 		if (new_size) {
 			if (epos > new_size)
@@ -4553,6 +4564,21 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 		ret2 = ret3 ? ret3 : ret2;
 		if (unlikely(ret2))
 			break;
+
+		if (alloc_zero &&
+		    (map.m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN))) {
+			ret2 = ext4_issue_zeroout(inode, map.m_lblk, map.m_pblk,
+						  map.m_len);
+			if (likely(!ret2))
+				ret2 = ext4_convert_unwritten_extents(NULL,
+					inode, (loff_t)map.m_lblk << blkbits,
+					(loff_t)map.m_len << blkbits);
+			if (ret2)
+				break;
+		}
+
+		map.m_lblk += ret;
+		map.m_len = len = len - ret;
 	}
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
@@ -4618,7 +4644,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (end_lblk > start_lblk) {
 		ext4_lblk_t zero_blks = end_lblk - start_lblk;
 
-		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
+		if (mode & FALLOC_FL_WRITE_ZEROES)
+			flags = EXT4_GET_BLOCKS_CREATE_ZERO | EXT4_EX_NOCACHE;
+		else
+			flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
+				  EXT4_EX_NOCACHE);
 		ret = ext4_alloc_file_blocks(file, start_lblk, zero_blks,
 					     new_size, flags);
 		if (ret)
@@ -4730,8 +4760,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 
 	/* Return error if mode is not supported */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
-		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
-		     FALLOC_FL_INSERT_RANGE))
+		     FALLOC_FL_ZERO_RANGE | FALLOC_FL_COLLAPSE_RANGE |
+		     FALLOC_FL_INSERT_RANGE | FALLOC_FL_WRITE_ZEROES))
 		return -EOPNOTSUPP;
 
 	inode_lock(inode);
@@ -4762,16 +4792,23 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (ret)
 		goto out_invalidate_lock;
 
-	if (mode & FALLOC_FL_PUNCH_HOLE)
+	switch (mode & FALLOC_FL_MODE_MASK) {
+	case FALLOC_FL_PUNCH_HOLE:
 		ret = ext4_punch_hole(file, offset, len);
-	else if (mode & FALLOC_FL_COLLAPSE_RANGE)
+		break;
+	case FALLOC_FL_COLLAPSE_RANGE:
 		ret = ext4_collapse_range(file, offset, len);
-	else if (mode & FALLOC_FL_INSERT_RANGE)
+		break;
+	case FALLOC_FL_INSERT_RANGE:
 		ret = ext4_insert_range(file, offset, len);
-	else if (mode & FALLOC_FL_ZERO_RANGE)
+		break;
+	case FALLOC_FL_ZERO_RANGE:
+	case FALLOC_FL_WRITE_ZEROES:
 		ret = ext4_zero_range(file, offset, len, mode);
-	else
+		break;
+	default:
 		ret = -EOPNOTSUPP;
+	}
 
 out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..6f9cf2811733 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -92,7 +92,8 @@ TRACE_DEFINE_ENUM(ES_REFERENCED_B);
 	{ FALLOC_FL_KEEP_SIZE,		"KEEP_SIZE"},		\
 	{ FALLOC_FL_PUNCH_HOLE,		"PUNCH_HOLE"},		\
 	{ FALLOC_FL_COLLAPSE_RANGE,	"COLLAPSE_RANGE"},	\
-	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"})
+	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"},		\
+	{ FALLOC_FL_WRITE_ZEROES,	"WRITE_ZEROES"})
 
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_XATTR);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_CROSS_RENAME);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  2025-03-18  7:35 ` [RFC PATCH -next v3 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
@ 2025-03-19 19:50   ` Benjamin Marzinski
  0 siblings, 0 replies; 24+ messages in thread
From: Benjamin Marzinski @ 2025-03-19 19:50 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, chaitanyak, shinichiro.kawasaki, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue, Mar 18, 2025 at 03:35:41PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by
> default. This feature shall be disabled if any underlying device does
> not support it.
> 
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>

> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  drivers/md/dm-table.c | 7 +++++--
>  drivers/md/dm.c       | 1 +
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 453803f1edf5..d4a483287e26 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -598,7 +598,8 @@ int dm_split_args(int *argc, char ***argvp, char *input)
>  static void dm_set_stacking_limits(struct queue_limits *limits)
>  {
>  	blk_set_stacking_limits(limits);
> -	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
> +	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
> +			    BLK_FEAT_WRITE_ZEROES_UNMAP;
>  }
>  
>  /*
> @@ -1848,8 +1849,10 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>  		limits->discard_alignment = 0;
>  	}
>  
> -	if (!dm_table_supports_write_zeroes(t))
> +	if (!dm_table_supports_write_zeroes(t)) {
>  		limits->max_write_zeroes_sectors = 0;
> +		limits->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
> +	}
>  
>  	if (!dm_table_supports_secure_erase(t))
>  		limits->max_secure_erase_sectors = 0;
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 5ab7574c0c76..b59c3dbeaaf1 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1096,6 +1096,7 @@ void disable_write_zeroes(struct mapped_device *md)
>  
>  	/* device doesn't really support WRITE ZEROES, disable it */
>  	limits->max_write_zeroes_sectors = 0;
> +	limits->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
>  }
>  
>  static bool swap_bios_limit(struct dm_target *ti, struct bio *bio)
> -- 
> 2.46.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-03-18  7:35 ` [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
@ 2025-04-09 10:31   ` Christoph Hellwig
  2025-04-10  3:52     ` Zhang Yi
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2025-04-09 10:31 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue, Mar 18, 2025 at 03:35:36PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Currently, disks primarily implement the write zeroes command (aka
> REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
> physically writing zeros to the disk media (e.g., HDDs), while the
> second performs an unmap operation on the logical blocks, effectively
> putting them into a deallocated state (e.g., SSDs). The first method is
> generally slow, while the second method is typically very fast.
> 
> For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
> REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
> the write zeros operation by placing disk blocks into

Note that this is a can, not a must.  The NVMe definition of Write
Zeroes is unfortunately pretty stupid.

> +		[RO] Devices that explicitly support the unmap write zeroes
> +		operation in which a single write zeroes request with the unmap
> +		bit set to zero out the range of contiguous blocks on storage
> +		by freeing blocks, rather than writing physical zeroes to the
> +		media.

This is not actually guaranteed for nvme or scsi.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP
  2025-03-18  7:35 ` [RFC PATCH -next v3 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
@ 2025-04-09 10:34   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2025-04-09 10:34 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue, Mar 18, 2025 at 03:35:39PM +0800, Zhang Yi wrote:
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 5d280c7fba65..836738ab1fa6 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1344,6 +1344,11 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
>  	return bdev_limits(bdev)->max_write_zeroes_sectors;
>  }
>  
> +static inline bool bdev_unmap_write_zeroes(struct block_device *bdev)
> +{
> +	return bdev_limits(bdev)->features & BLK_FEAT_WRITE_ZEROES_UNMAP;

This helper has an odd name. In doubt stick to the name of the flag
instead of reordering the words.

Also no core block code should be added in an nvmet patch, this needs
to go into the first patch.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  2025-03-18  7:35 ` [RFC PATCH -next v3 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
@ 2025-04-09 10:35   ` Christoph Hellwig
  2025-04-09 10:50     ` Christian Brauner
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2025-04-09 10:35 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue, Mar 18, 2025 at 03:35:42PM +0800, Zhang Yi wrote:
> Users can check the disk support of unmap write zeroes command by
> querying:
> 
>     /sys/block/<disk>/queue/write_zeroes_unmap

No, that is not in any way a good user interface.  Users need to be
able to query this on a per-file basis.

> Finally, this flag should not be specified in conjunction with the
> FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is
> not permitted, and filesystems that always require out-of-place writes
> should not support this flag since they still need to allocated new
> blocks during subsequent overwrites.

Should not or can't?  You're returning an error if this happens, so it
doesn't look like should is the right word here.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 09/10] block: factor out common part in blkdev_fallocate()
  2025-03-18  7:35 ` [RFC PATCH -next v3 09/10] block: factor out common part in blkdev_fallocate() Zhang Yi
@ 2025-04-09 10:36   ` Christoph Hellwig
  2025-04-09 10:36     ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2025-04-09 10:36 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 09/10] block: factor out common part in blkdev_fallocate()
  2025-04-09 10:36   ` Christoph Hellwig
@ 2025-04-09 10:36     ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2025-04-09 10:36 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Wed, Apr 09, 2025 at 12:36:29PM +0200, Christoph Hellwig wrote:
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

.. although this really should go before the previous patch.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  2025-04-09 10:35   ` Christoph Hellwig
@ 2025-04-09 10:50     ` Christian Brauner
  2025-04-18  6:44       ` Zhang Yi
  0 siblings, 1 reply; 24+ messages in thread
From: Christian Brauner @ 2025-04-09 10:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-fsdevel, linux-ext4, linux-block, dm-devel,
	linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Wed, Apr 09, 2025 at 12:35:48PM +0200, Christoph Hellwig wrote:
> On Tue, Mar 18, 2025 at 03:35:42PM +0800, Zhang Yi wrote:
> > Users can check the disk support of unmap write zeroes command by
> > querying:
> > 
> >     /sys/block/<disk>/queue/write_zeroes_unmap
> 
> No, that is not in any way a good user interface.  Users need to be
> able to query this on a per-file basis.

Agreed. This should get a statx attribute most likely.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-04-09 10:31   ` Christoph Hellwig
@ 2025-04-10  3:52     ` Zhang Yi
  2025-04-10  7:15       ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Zhang Yi @ 2025-04-10  3:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2025/4/9 18:31, Christoph Hellwig wrote:
> On Tue, Mar 18, 2025 at 03:35:36PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Currently, disks primarily implement the write zeroes command (aka
>> REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
>> physically writing zeros to the disk media (e.g., HDDs), while the
>> second performs an unmap operation on the logical blocks, effectively
>> putting them into a deallocated state (e.g., SSDs). The first method is
>> generally slow, while the second method is typically very fast.
>>
>> For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
>> REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
>> the write zeros operation by placing disk blocks into
> 
> Note that this is a can, not a must.  The NVMe definition of Write
> Zeroes is unfortunately pretty stupid.
> 
>> +		[RO] Devices that explicitly support the unmap write zeroes
>> +		operation in which a single write zeroes request with the unmap
>> +		bit set to zero out the range of contiguous blocks on storage
>> +		by freeing blocks, rather than writing physical zeroes to the
>> +		media.
> 
> This is not actually guaranteed for nvme or scsi.

Thank you for your review and comments. However, I'm not sure I fully
understand your points. Could you please provide more details?

AFAIK, the NVMe protocol has the following description in the latest
NVM Command Set Specification Figure 82 and Figure 114:

===
Deallocate (DEAC): If this bit is set to ‘1’, then the host is
requesting that the controller deallocate the specified logical blocks.
If this bit is cleared to ‘0’, then the host is not requesting that
the controller deallocate the specified logical blocks...

DLFEAT:
Write Zeroes Deallocation Support (WZDS): If this bit is set to ‘1’,
then the controller supports the Deallocate bit in the Write Zeroes
command for this namespace...
Deallocation Read Behavior (DRB): This field indicates the deallocated
logical block read behavior. For a logical block that is deallocated,
this field indicates the values read from that deallocated logical block
and its metadata (excluding protection information)...

  Value  Definition
  001b   A deallocated logical block returns all bytes cleared to 0h
===

At the same time, the current kernel determines whether to set the
unmap bit when submitting the write zeroes command based on the above
protocol. So I think this rules should be clear now.

Were you saying that what is described in this protocol is not a
mandatory requirement? Which means the disks that claiming to support
the UNMAP write zeroes command(WZDS=1,DRB=1), but in fact, they still
write actual zeroes data to the storage media? Or were you referring
to some irregular disks that do not obey the protocol and mislead
users?

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-04-10  3:52     ` Zhang Yi
@ 2025-04-10  7:15       ` Christoph Hellwig
  2025-04-10  8:20         ` Keith Busch
  2025-04-10  9:15         ` Zhang Yi
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2025-04-10  7:15 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Christoph Hellwig, linux-fsdevel, linux-ext4, linux-block,
	dm-devel, linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
	djwong, john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki,
	yi.zhang, chengzhihao1, yukuai3, yangerkun

On Thu, Apr 10, 2025 at 11:52:17AM +0800, Zhang Yi wrote:
> 
> Thank you for your review and comments. However, I'm not sure I fully
> understand your points. Could you please provide more details?
> 
> AFAIK, the NVMe protocol has the following description in the latest
> NVM Command Set Specification Figure 82 and Figure 114:
> 
> ===
> Deallocate (DEAC): If this bit is set to ‘1’, then the host is
> requesting that the controller deallocate the specified logical blocks.
> If this bit is cleared to ‘0’, then the host is not requesting that
> the controller deallocate the specified logical blocks...
> 
> DLFEAT:
> Write Zeroes Deallocation Support (WZDS): If this bit is set to ‘1’,
> then the controller supports the Deallocate bit in the Write Zeroes
> command for this namespace...

Yes.  The host is requesting, not the controller shall.  It's not
guaranteed behavior and the controller might as well actually write
zeroes to the media.  That is rather stupid, but still.

Also note that some write zeroes implementations in consumer devices
are really slow even when deallocation is requested so that we had
to blacklist them.

> Were you saying that what is described in this protocol is not a
> mandatory requirement? Which means the disks that claiming to support
> the UNMAP write zeroes command(WZDS=1,DRB=1), but in fact, they still
> write actual zeroes data to the storage media? Or were you referring
> to some irregular disks that do not obey the protocol and mislead
> users?

The are at least allowed to.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-04-10  7:15       ` Christoph Hellwig
@ 2025-04-10  8:20         ` Keith Busch
  2025-04-10  9:35           ` Zhang Yi
  2025-04-10  9:15         ` Zhang Yi
  1 sibling, 1 reply; 24+ messages in thread
From: Keith Busch @ 2025-04-10  8:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-fsdevel, linux-ext4, linux-block, dm-devel,
	linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso, djwong,
	john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Thu, Apr 10, 2025 at 09:15:59AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 11:52:17AM +0800, Zhang Yi wrote:
> > 
> > Thank you for your review and comments. However, I'm not sure I fully
> > understand your points. Could you please provide more details?
> > 
> > AFAIK, the NVMe protocol has the following description in the latest
> > NVM Command Set Specification Figure 82 and Figure 114:
> > 
> > ===
> > Deallocate (DEAC): If this bit is set to `1´, then the host is
> > requesting that the controller deallocate the specified logical blocks.
> > If this bit is cleared to `0´, then the host is not requesting that
> > the controller deallocate the specified logical blocks...
> > 
> > DLFEAT:
> > Write Zeroes Deallocation Support (WZDS): If this bit is set to `1´,
> > then the controller supports the Deallocate bit in the Write Zeroes
> > command for this namespace...
> 
> Yes.  The host is requesting, not the controller shall.  It's not
> guaranteed behavior and the controller might as well actually write
> zeroes to the media.  That is rather stupid, but still.

I guess some controllers _really_ want specific alignments to
successfully do a proper discard. While still not guaranteed in spec, I
think it is safe to assume a proper deallocation will occur if you align
to NPDA and NPDG. Otherwise, the controller may do a read-modify-write
to ensure zeroes are returned for the requested LBA range on anything
that straddles an implementation specific boundary.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-04-10  7:15       ` Christoph Hellwig
  2025-04-10  8:20         ` Keith Busch
@ 2025-04-10  9:15         ` Zhang Yi
  1 sibling, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2025-04-10  9:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2025/4/10 15:15, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 11:52:17AM +0800, Zhang Yi wrote:
>>
>> Thank you for your review and comments. However, I'm not sure I fully
>> understand your points. Could you please provide more details?
>>
>> AFAIK, the NVMe protocol has the following description in the latest
>> NVM Command Set Specification Figure 82 and Figure 114:
>>
>> ===
>> Deallocate (DEAC): If this bit is set to ‘1’, then the host is
>> requesting that the controller deallocate the specified logical blocks.
>> If this bit is cleared to ‘0’, then the host is not requesting that
>> the controller deallocate the specified logical blocks...
>>
>> DLFEAT:
>> Write Zeroes Deallocation Support (WZDS): If this bit is set to ‘1’,
>> then the controller supports the Deallocate bit in the Write Zeroes
>> command for this namespace...
> 
> Yes.  The host is requesting, not the controller shall.  It's not
> guaranteed behavior and the controller might as well actually write
> zeroes to the media.  That is rather stupid, but still.

IIUC, the DEAC is requested by the host, but the WZDS and DRB bits in
DLFEAT is returned by the controller(no?). The host will only initiate
a DEAC request when both WZDS and DRB are satisfied. So I think that
if the disk controller returns WZDS=1 and DRB=1, the kernel can only
trust it according to the protocol and then set
BLK_FEAT_WRITE_ZEROES_UNMAP flag, the kernel can't and also do not
need to identify those irregular disks.

> 
> Also note that some write zeroes implementations in consumer devices
> are really slow even when deallocation is requested so that we had
> to blacklist them.

Yes, indeed. For now, the kernel can only detect through protocol
specifications, and there seems to be no better way to distinguish
the specific behavior of the disk. Perhaps we should emphasize that
this write_zeroes_unmap tag is not equivalent to disk support for
'fast' write zeros in the DOC.

Thanks.
Yi.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  2025-04-10  8:20         ` Keith Busch
@ 2025-04-10  9:35           ` Zhang Yi
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2025-04-10  9:35 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2025/4/10 16:20, Keith Busch wrote:
> On Thu, Apr 10, 2025 at 09:15:59AM +0200, Christoph Hellwig wrote:
>> On Thu, Apr 10, 2025 at 11:52:17AM +0800, Zhang Yi wrote:
>>>
>>> Thank you for your review and comments. However, I'm not sure I fully
>>> understand your points. Could you please provide more details?
>>>
>>> AFAIK, the NVMe protocol has the following description in the latest
>>> NVM Command Set Specification Figure 82 and Figure 114:
>>>
>>> ===
>>> Deallocate (DEAC): If this bit is set to `1´, then the host is
>>> requesting that the controller deallocate the specified logical blocks.
>>> If this bit is cleared to `0´, then the host is not requesting that
>>> the controller deallocate the specified logical blocks...
>>>
>>> DLFEAT:
>>> Write Zeroes Deallocation Support (WZDS): If this bit is set to `1´,
>>> then the controller supports the Deallocate bit in the Write Zeroes
>>> command for this namespace...
>>
>> Yes.  The host is requesting, not the controller shall.  It's not
>> guaranteed behavior and the controller might as well actually write
>> zeroes to the media.  That is rather stupid, but still.
> 
> I guess some controllers _really_ want specific alignments to
> successfully do a proper discard. While still not guaranteed in spec, I
> think it is safe to assume a proper deallocation will occur if you align
> to NPDA and NPDG. Otherwise, the controller may do a read-modify-write
> to ensure zeroes are returned for the requested LBA range on anything
> that straddles an implementation specific boundary.
> 

I understand. A proper deallocation has certain constraints, but I
guess it should be useful for most scenarios. Thank you for
the explanation.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH -next v3 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  2025-04-09 10:50     ` Christian Brauner
@ 2025-04-18  6:44       ` Zhang Yi
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang Yi @ 2025-04-18  6:44 UTC (permalink / raw)
  To: Christian Brauner, Christoph Hellwig
  Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
	linux-scsi, linux-xfs, linux-kernel, tytso, djwong, john.g.garry,
	bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2025/4/9 18:50, Christian Brauner wrote:
> On Wed, Apr 09, 2025 at 12:35:48PM +0200, Christoph Hellwig wrote:
>> On Tue, Mar 18, 2025 at 03:35:42PM +0800, Zhang Yi wrote:
>>> Users can check the disk support of unmap write zeroes command by
>>> querying:
>>>
>>>     /sys/block/<disk>/queue/write_zeroes_unmap
>>
>> No, that is not in any way a good user interface.  Users need to be
>> able to query this on a per-file basis.
> 
> Agreed. This should get a statx attribute most likely.

Sorry for the late. Sure, I will add a statx attribute for both bdev and ext4.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-04-18  6:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-18  7:35 [RFC PATCH -next v3 00/10] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
2025-03-18  7:35 ` [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
2025-04-09 10:31   ` Christoph Hellwig
2025-04-10  3:52     ` Zhang Yi
2025-04-10  7:15       ` Christoph Hellwig
2025-04-10  8:20         ` Keith Busch
2025-04-10  9:35           ` Zhang Yi
2025-04-10  9:15         ` Zhang Yi
2025-03-18  7:35 ` [RFC PATCH -next v3 02/10] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
2025-03-18  7:35 ` [RFC PATCH -next v3 03/10] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
2025-03-18  7:35 ` [RFC PATCH -next v3 04/10] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
2025-04-09 10:34   ` Christoph Hellwig
2025-03-18  7:35 ` [RFC PATCH -next v3 05/10] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode Zhang Yi
2025-03-18  7:35 ` [RFC PATCH -next v3 06/10] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
2025-03-19 19:50   ` Benjamin Marzinski
2025-03-18  7:35 ` [RFC PATCH -next v3 07/10] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
2025-04-09 10:35   ` Christoph Hellwig
2025-04-09 10:50     ` Christian Brauner
2025-04-18  6:44       ` Zhang Yi
2025-03-18  7:35 ` [RFC PATCH -next v3 08/10] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
2025-03-18  7:35 ` [RFC PATCH -next v3 09/10] block: factor out common part in blkdev_fallocate() Zhang Yi
2025-04-09 10:36   ` Christoph Hellwig
2025-04-09 10:36     ` Christoph Hellwig
2025-03-18  7:35 ` [RFC PATCH -next v3 10/10] ext4: add FALLOC_FL_WRITE_ZEROES support Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).