* [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
@ 2025-04-21 2:14 Zhang Yi
2025-04-21 2:14 ` [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
` (10 more replies)
0 siblings, 11 replies; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:14 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Changes sicne RFC v3:
- Rebase codes on 6.15-rc2.
- Add a note in patch 1 to indicate that the unmap write zeros command
is not always guaranteed as Christoph suggested.
- Rename bdev_unmap_write_zeroes() helper and move it to patch 1 as
Christoph suggested.
- Introduce a new statx attribute flag STATX_ATTR_WRITE_ZEROES_UNMAP as
Christoph and Christian suggested.
- Exchange the order of the two patches that modified
blkdev_fallocate() as Christoph suggested.
Changes since RFC v2:
- Rebase codes on next-20250314.
- Add support for nvme multipath.
- Add support for NVMeT with block device backing.
- Clear FALLOC_FL_WRITE_ZEROES if dm clear
limits->max_write_zeroes_sectors.
- Complement the counterpart userspace tools(util-linux and xfs_io)
and tests(blktests and xfstests), please see below for details.
Changes since RFC v1:
- Switch to add a new write zeroes operation, FALLOC_FL_WRITE_ZEROES,
in fallocate, instead of just adding a supported flag to
FALLOC_FL_ZERO_RANGE.
- Introduce a new flag BLK_FEAT_WRITE_ZEROES_UNMAP to the block
device's queue limit features, and implement it on SCSI sd driver,
NVMe SSD driver and dm driver.
- Implement FALLOC_FL_WRITE_ZEROES on both the ext4 filesystem and
block device (bdev).
RFC v3: https://lore.kernel.org/linux-fsdevel/20250318073545.3518707-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-fsdevel/20250115114637.2705887-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-fsdevel/20241228014522.2395187-1-yi.zhang@huaweicloud.com/
The counterpart userspace tools changes and tests are here:
- util-linux: https://lore.kernel.org/linux-fsdevel/20250318073218.3513262-1-yi.zhang@huaweicloud.com/
- xfsprogs: https://lore.kernel.org/linux-fsdevel/20250318072318.3502037-1-yi.zhang@huaweicloud.com/
- xfstests: https://lore.kernel.org/linux-fsdevel/20250318072615.3505873-1-yi.zhang@huaweicloud.com/
- blktests: https://lore.kernel.org/linux-fsdevel/20250318072835.3508696-1-yi.zhang@huaweicloud.com/
Currently, we can use the fallocate command to quickly create a
pre-allocated file. However, on most filesystems, such as ext4 and XFS,
fallocate create pre-allocation blocks in an unwritten state, and the
FALLOC_FL_ZERO_RANGE flag also behaves similarly. The extent state must
be converted to a written state when the user writes data into this
range later, which can trigger numerous metadata changes and consequent
journal I/O. This may leads to significant write amplification and
performance degradation in synchronous write mode. Therefore, we need a
method to create a pre-allocated file with written extents that can be
used for pure overwriting. At the monent, the only method available is
to create an empty file and write zero data into it (for example, using
'dd' with a large block size). However, this method is slow and consumes
a considerable amount of disk bandwidth, we must pre-allocate files in
advance but cannot add pre-allocated files while user business services
are running.
Fortunately, with the development and more and more widely used of
flash-based storage devices, we can efficiently write zeros to SSDs
using the unmap write zeroes command if the devices do not write
physical zeroes to the media. For example, if SCSI SSDs support the
UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command
does not write actual data to the device, instead, NVMe converts the
zeroed range to a deallocated state, which works fast and consumes
almost no disk write bandwidth. Consequently, this feature can provide
us with a faster method for creating pre-allocated files with written
extents and zeroed data. However, please note that this may be a
best-effort optimization rather than a mandatory requirement, some
devices may partially fall back to writing physical zeroes due to
factors such as receiving unaligned commands.
This series aims to implement this by:
1. Introduce a new feature BLK_FEAT_WRITE_ZEROES_UNMAP to the block
device queue limit features, which indicates whether the storage is
device explicitly supports the unmapped write zeroes command. This
flag should be set to 1 by the driver if the attached disk supports
this command. Users can check this flag by querying:
/sys/block/<disk>/queue/write_zeroes_unmap
2. Introduce a new attribute flag STATX_ATTR_WRITE_ZEROES_UNMAP into
the statx. Users can determine whether a bdev or a file supports the
unmapped write zeroes command.
3. Introduce a new flag FALLOC_FL_WRITE_ZEROES into the fallocate,
filesystems with this operaion should allocate written extents and
issuing zeroes to the range of the device. If the device supports
unmap write zeroes command, the zeroing can be accelerated, if not,
we currently still allow to fall back to submit zeroes data. Users
can verify if the device supports the unmap write zeroes command and
then decide whether to use it.
This series implemented the BLK_FEAT_WRITE_ZEROES_UNMAP flag for SCSI,
NVMe and device-mapper drivers, and added the FALLOC_FL_WRITE_ZEROES and
STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. Any
comments are welcome.
I've tested performance with this series on ext4 filesystem on my
machine with an Intel Xeon Gold 6248R CPU, a 7TB KCD61LUL7T68 NVMe SSD
which supports unmap write zeroes command with the Deallocated state
and the DEAC bit. Feel free to give it a try.
0. Ensure the NVMe device supports WRITE_ZERO command.
$ cat /sys/block/nvme5n1/queue/write_zeroes_max_bytes
8388608
$ nvme id-ns -H /dev/nvme5n1 | grep -i -A 3 "dlfeat"
dlfeat : 25
[4:4] : 0x1 Guard Field of Deallocated Logical Blocks is set to CRC
of The Value Read
[3:3] : 0x1 Deallocate Bit in the Write Zeroes Command is Supported
[2:0] : 0x1 Bytes Read From a Deallocated Logical Block and its
Metadata are 0x00
1. Compare 'dd' and fallocate with unmap write zeroes, the later one is
significantly faster than 'dd'.
Create a 1GB and 10GB zeroed file.
$dd if=/dev/zero of=foo bs=2M count=$count oflag=direct
$time fallocate -w -l $size bar
#1G
dd: 0.5s
FALLOC_FL_WRITE_ZEROES: 0.17s
#10G
dd: 5.0s
FALLOC_FL_WRITE_ZEROES: 1.7s
2. Run fio overwrite and fallocate with unmap write zeroes
simultaneously, fallocate has little impact on write bandwidth and
only slightly affects write latency.
a) Test bandwidth costs.
$ fio -directory=/test -direct=1 -iodepth=10 -fsync=0 -rw=write \
-numjobs=10 -bs=2M -ioengine=libaio -size=20G -runtime=20 \
-fallocate=none -overwrite=1 -group_reportin -name=bw_test
Without background zero range:
bw (MiB/s): min= 2068, max= 2280, per=100.00%, avg=2186.40
With background zero range:
bw (MiB/s): min= 2056, max= 2308, per=100.00%, avg=2186.20
b) Test write latency costs.
$ fio -filename=/test/foo -direct=1 -iodepth=1 -fsync=0 -rw=write \
-numjobs=1 -bs=4k -ioengine=psync -size=5G -runtime=20 \
-fallocate=none -overwrite=1 -group_reportin -name=lat_test
Without background zero range:
lat (nsec): min=9269, max=71635, avg=9840.65
With a background zero range:
lat (usec): min=9, max=982, avg=11.03
3. Compare overwriting in a pre-allocated unwritten file and a written
file in O_DSYNC mode. Write to a file with written extents is much
faster.
# First mkfs and create a test file according to below three cases,
# and then run fio.
$ fio -filename=/test/foo -direct=1 -iodepth=1 -fdatasync=1 \
-rw=write -numjobs=1 -bs=4k -ioengine=psync -size=5G \
-runtime=20 -fallocate=none -group_reportin -name=test
unwritten file: IOPS=20.1k, BW=78.7MiB/s
unwritten file + fast_commit: IOPS=42.9k, BW=167MiB/s
written file: IOPS=98.8k, BW=386MiB/s
Thanks,
Yi.
---
[1] https://nvmexpress.org/specifications/
NVM Command Set Specification, Figure 82 and Figure 114.
Zhang Yi (11):
block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support
nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP
scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap
zeroing mode
dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
fs: statx add write zeroes unmap attribute
fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
block: factor out common part in blkdev_fallocate()
block: add FALLOC_FL_WRITE_ZEROES support
ext4: add FALLOC_FL_WRITE_ZEROES support
Documentation/ABI/stable/sysfs-block | 18 +++++++++
block/bdev.c | 4 ++
block/blk-settings.c | 6 +++
block/blk-sysfs.c | 3 ++
block/fops.c | 37 +++++++++--------
drivers/md/dm-table.c | 7 +++-
drivers/md/dm.c | 1 +
drivers/nvme/host/core.c | 21 +++++-----
drivers/nvme/host/multipath.c | 3 +-
drivers/nvme/target/io-cmd-bdev.c | 4 ++
drivers/scsi/sd.c | 5 +++
fs/ext4/extents.c | 59 ++++++++++++++++++++++------
fs/ext4/inode.c | 9 +++--
fs/open.c | 1 +
include/linux/blkdev.h | 8 ++++
include/linux/falloc.h | 3 +-
include/trace/events/ext4.h | 3 +-
include/uapi/linux/falloc.h | 18 +++++++++
include/uapi/linux/stat.h | 1 +
19 files changed, 164 insertions(+), 47 deletions(-)
--
2.46.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
@ 2025-04-21 2:14 ` Zhang Yi
2025-05-05 11:54 ` Christoph Hellwig
2025-05-06 4:21 ` Martin K. Petersen
2025-04-21 2:15 ` [RFC PATCH v4 02/11] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
` (9 subsequent siblings)
10 siblings, 2 replies; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:14 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.
For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state (this is a best effort, not a mandatory requirement, some devices
may partially fall back to writing physical zeroes due to factors such
as receiving unaligned commands). However, it is difficult to determine
whether the storage device supports unmap write zeroes. We cannot
determine this by querying bdev_limits(bdev)->max_write_zeroes_sectors.
Therefore, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP
and the corresponding sysfs entry, to indicate whether the block device
explicitly supports the unmapped write zeroes command. Each device
driver should set this bit if it is certain that the attached disk
supports this command. If the bit is not set, the disk either does not
support it, or its support status is unknown.
For the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be
supported both by the stacking driver and all underlying devices.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
Documentation/ABI/stable/sysfs-block | 18 ++++++++++++++++++
block/blk-settings.c | 6 ++++++
block/blk-sysfs.c | 3 +++
include/linux/blkdev.h | 8 ++++++++
4 files changed, 35 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 3879963f0f01..6531cdfcaacf 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -763,6 +763,24 @@ Description:
0, write zeroes is not supported by the device.
+What: /sys/block/<disk>/queue/write_zeroes_unmap
+Date: January 2025
+Contact: Zhang Yi <yi.zhang@huawei.com>
+Description:
+ [RO] Devices that explicitly support the unmap write zeroes
+ operation in which a single write zeroes request with the unmap
+ bit set to zero out the range of contiguous blocks on storage
+ by freeing blocks, rather than writing physical zeroes to the
+ media. If the write_zeroes_unmap is set to 1, this indicates
+ that the device explicitly supports the write zero command.
+ However, this may be a best-effort optimization rather than a
+ mandatory requirement, some devices may partially fall back to
+ writing physical zeroes due to factors such as receiving
+ unaligned commands. If the parameter is set to 0, the device
+ either does not support this operation, or its support status is
+ unknown.
+
+
What: /sys/block/<disk>/queue/zone_append_max_bytes
Date: May 2020
Contact: linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6b2dbe645d23..3331d07bd5d9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -697,6 +697,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->features &= ~BLK_FEAT_NOWAIT;
if (!(b->features & BLK_FEAT_POLL))
t->features &= ~BLK_FEAT_POLL;
+ if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
+ t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
@@ -819,6 +821,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->zone_write_granularity = 0;
t->max_zone_append_sectors = 0;
}
+
+ if (!t->max_write_zeroes_sectors)
+ t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+
blk_stack_atomic_writes_limits(t, b, start);
return ret;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a2882751f0d2..7a9c20bd3779 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -261,6 +261,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \
QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
+QUEUE_SYSFS_FEATURE_SHOW(write_zeroes_unmap, BLK_FEAT_WRITE_ZEROES_UNMAP);
static ssize_t queue_poll_show(struct gendisk *disk, char *page)
{
@@ -510,6 +511,7 @@ QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
+QUEUE_LIM_RO_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
@@ -656,6 +658,7 @@ static struct attribute *queue_attrs[] = {
&queue_atomic_write_unit_min_entry.attr,
&queue_atomic_write_unit_max_entry.attr,
&queue_max_write_zeroes_sectors_entry.attr,
+ &queue_write_zeroes_unmap_entry.attr,
&queue_max_zone_append_sectors_entry.attr,
&queue_zone_write_granularity_entry.attr,
&queue_rotational_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e39c45bc0a97..7c8752578e36 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -342,6 +342,9 @@ typedef unsigned int __bitwise blk_features_t;
#define BLK_FEAT_ATOMIC_WRITES \
((__force blk_features_t)(1u << 16))
+/* supports unmap write zeroes command */
+#define BLK_FEAT_WRITE_ZEROES_UNMAP ((__force blk_features_t)(1u << 17))
+
/*
* Flags automatically inherited when stacking limits.
*/
@@ -1341,6 +1344,11 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
return bdev_limits(bdev)->max_write_zeroes_sectors;
}
+static inline bool bdev_write_zeroes_unmap(struct block_device *bdev)
+{
+ return bdev_limits(bdev)->features & BLK_FEAT_WRITE_ZEROES_UNMAP;
+}
+
static inline bool bdev_nonrot(struct block_device *bdev)
{
return blk_queue_nonrot(bdev_get_queue(bdev));
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 02/11] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
2025-04-21 2:14 ` [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
2025-05-05 11:55 ` Christoph Hellwig
2025-04-21 2:15 ` [RFC PATCH v4 03/11] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
` (8 subsequent siblings)
10 siblings, 1 reply; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
When the device supports the Write Zeroes command and the DEAC bit, it
indicates that the deallocate bit in the Write Zeroes command is
supported, and the bytes read from a deallocated logical block are
zeroes. This means the device supports unmap Write Zeroes, so set the
BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
drivers/nvme/host/core.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b502ac07483b..b2cece376f30 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2223,22 +2223,25 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
if (!nvme_init_integrity(ns->head, &lim, info))
capacity = 0;
- ret = queue_limits_commit_update(ns->disk->queue, &lim);
- if (ret) {
- blk_mq_unfreeze_queue(ns->disk->queue, memflags);
- goto out;
- }
-
- set_capacity_and_notify(ns->disk, capacity);
-
/*
* Only set the DEAC bit if the device guarantees that reads from
* deallocated data return zeroes. While the DEAC bit does not
* require that, it must be a no-op if reads from deallocated data
* do not return zeroes.
*/
- if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3)))
+ if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3))) {
ns->head->features |= NVME_NS_DEAC;
+ if (lim.max_write_zeroes_sectors)
+ lim.features |= BLK_FEAT_WRITE_ZEROES_UNMAP;
+ }
+
+ ret = queue_limits_commit_update(ns->disk->queue, &lim);
+ if (ret) {
+ blk_mq_unfreeze_queue(ns->disk->queue, memflags);
+ goto out;
+ }
+
+ set_capacity_and_notify(ns->disk, capacity);
set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info));
set_bit(NVME_NS_READY, &ns->flags);
blk_mq_unfreeze_queue(ns->disk->queue, memflags);
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 03/11] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
2025-04-21 2:14 ` [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 02/11] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
2025-05-05 11:55 ` Christoph Hellwig
2025-04-21 2:15 ` [RFC PATCH v4 04/11] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
` (7 subsequent siblings)
10 siblings, 1 reply; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature while creating multipath
stacking queue limits by default. This feature shall be disabled if any
attached namespace does not support it.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
drivers/nvme/host/multipath.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 05eccd96d34a..1880ad09559a 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -638,7 +638,8 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
blk_set_stacking_limits(&lim);
lim.dma_alignment = 3;
- lim.features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
+ lim.features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
+ BLK_FEAT_WRITE_ZEROES_UNMAP;
if (head->ids.csi == NVME_CSI_ZNS)
lim.features |= BLK_FEAT_ZONED;
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 04/11] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
` (2 preceding siblings ...)
2025-04-21 2:15 ` [RFC PATCH v4 03/11] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
2025-05-05 11:56 ` Christoph Hellwig
2025-04-21 2:15 ` [RFC PATCH v4 05/11] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode Zhang Yi
` (6 subsequent siblings)
10 siblings, 1 reply; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Set WZDS and DRB bit to the namespace dlfeat if the underlying block
device supports BLK_FEAT_WRITE_ZEROES_UNMAP, make the nvme target
device supports unmaped write zeroes command.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
drivers/nvme/target/io-cmd-bdev.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 83be0657e6df..052da6174548 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,10 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+ /* Set WZDS and DRB if device supports unmapped write zeroes */
+ if (bdev_write_zeroes_unmap(bdev))
+ id->dlfeat = (1 << 3) | 0x1;
}
void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 05/11] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
` (3 preceding siblings ...)
2025-04-21 2:15 ` [RFC PATCH v4 04/11] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 06/11] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
` (5 subsequent siblings)
10 siblings, 0 replies; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
When the device supports the Write Zeroes command and the zeroing mode
is set to SD_ZERO_WS16_UNMAP or SD_ZERO_WS10_UNMAP, this means that the
device supports unmap Write Zeroes, so set the corresponding
BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
drivers/scsi/sd.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 950d8c9fb884..652630b410de 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1118,6 +1118,11 @@ static void sd_config_write_same(struct scsi_disk *sdkp,
else
sdkp->zeroing_mode = SD_ZERO_WRITE;
+ if (sdkp->max_ws_blocks &&
+ (sdkp->zeroing_mode == SD_ZERO_WS16_UNMAP ||
+ sdkp->zeroing_mode == SD_ZERO_WS10_UNMAP))
+ lim->features |= BLK_FEAT_WRITE_ZEROES_UNMAP;
+
if (sdkp->max_ws_blocks &&
sdkp->physical_block_size > logical_block_size) {
/*
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 06/11] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
` (4 preceding siblings ...)
2025-04-21 2:15 ` [RFC PATCH v4 05/11] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute Zhang Yi
` (4 subsequent siblings)
10 siblings, 0 replies; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by
default. This feature shall be disabled if any underlying device does
not support it.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-table.c | 7 +++++--
drivers/md/dm.c | 1 +
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 35100a435c88..56141d585ef8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -598,7 +598,8 @@ int dm_split_args(int *argc, char ***argvp, char *input)
static void dm_set_stacking_limits(struct queue_limits *limits)
{
blk_set_stacking_limits(limits);
- limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
+ limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
+ BLK_FEAT_WRITE_ZEROES_UNMAP;
}
/*
@@ -1852,8 +1853,10 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
limits->discard_alignment = 0;
}
- if (!dm_table_supports_write_zeroes(t))
+ if (!dm_table_supports_write_zeroes(t)) {
limits->max_write_zeroes_sectors = 0;
+ limits->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+ }
if (!dm_table_supports_secure_erase(t))
limits->max_secure_erase_sectors = 0;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ab7574c0c76..b59c3dbeaaf1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1096,6 +1096,7 @@ void disable_write_zeroes(struct mapped_device *md)
/* device doesn't really support WRITE ZEROES, disable it */
limits->max_write_zeroes_sectors = 0;
+ limits->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
}
static bool swap_bios_limit(struct dm_target *ti, struct bio *bio)
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
` (5 preceding siblings ...)
2025-04-21 2:15 ` [RFC PATCH v4 06/11] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
2025-05-05 13:22 ` Christoph Hellwig
2025-04-21 2:15 ` [RFC PATCH v4 08/11] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
` (3 subsequent siblings)
10 siblings, 1 reply; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Add a new attribute flag to statx to determine whether a bdev or a file
supports the unmap write zeroes command.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
block/bdev.c | 4 ++++
fs/ext4/inode.c | 9 ++++++---
include/uapi/linux/stat.h | 1 +
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index 4844d1e27b6f..29b0e5feb138 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1304,6 +1304,10 @@ void bdev_statx(struct path *path, struct kstat *stat,
queue_atomic_write_unit_max_bytes(bd_queue));
}
+ if (bdev_write_zeroes_unmap(bdev))
+ stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP;
+ stat->attributes_mask |= STATX_ATTR_WRITE_ZEROES_UNMAP;
+
stat->blksize = bdev_io_min(bdev);
blkdev_put_no_open(bdev);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 94c7d2d828a6..38caf2f39c6d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5653,6 +5653,7 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
struct inode *inode = d_inode(path->dentry);
struct ext4_inode *raw_inode;
struct ext4_inode_info *ei = EXT4_I(inode);
+ struct block_device *bdev = inode->i_sb->s_bdev;
unsigned int flags;
if ((request_mask & STATX_BTIME) &&
@@ -5672,8 +5673,6 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
stat->result_mask |= STATX_DIOALIGN;
if (dio_align == 1) {
- struct block_device *bdev = inode->i_sb->s_bdev;
-
/* iomap defaults */
stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
stat->dio_offset_align = bdev_logical_block_size(bdev);
@@ -5695,6 +5694,9 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
generic_fill_statx_atomic_writes(stat, awu_min, awu_max);
}
+ if (S_ISREG(inode->i_mode) && bdev_write_zeroes_unmap(bdev))
+ stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP;
+
flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
if (flags & EXT4_APPEND_FL)
stat->attributes |= STATX_ATTR_APPEND;
@@ -5714,7 +5716,8 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
STATX_ATTR_ENCRYPTED |
STATX_ATTR_IMMUTABLE |
STATX_ATTR_NODUMP |
- STATX_ATTR_VERITY);
+ STATX_ATTR_VERITY |
+ STATX_ATTR_WRITE_ZEROES_UNMAP);
generic_fillattr(idmap, request_mask, inode, stat);
return 0;
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index f78ee3670dd5..279ce7e34df7 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -251,6 +251,7 @@ struct statx {
#define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */
#define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */
#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */
+#define STATX_ATTR_WRITE_ZEROES_UNMAP 0x00800000 /* File supports unmap write zeroes */
#endif /* _UAPI_LINUX_STAT_H */
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 08/11] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
` (6 preceding siblings ...)
2025-04-21 2:15 ` [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
2025-05-05 13:22 ` Christoph Hellwig
2025-04-21 2:15 ` [RFC PATCH v4 09/11] block: factor out common part in blkdev_fallocate() Zhang Yi
` (2 subsequent siblings)
10 siblings, 1 reply; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
With the development of flash-based storage devices, we can quickly
write zeros to SSDs using the WRITE_ZERO command if the devices do not
actually write physical zeroes to the media. Therefore, we can use this
command to quickly preallocate a real all-zero file with written
extents. This approach should be beneficial for subsequent pure
overwriting within this file, as it can save on block allocation and,
consequently, significant metadata changes, which should greatly improve
overwrite performance on certain filesystems.
Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to
fallocate. This flag is used to convert a specified range of a file to
zeros by issuing a zeroing operation. Blocks should be allocated for the
regions that span holes in the file, and the entire range is converted
to written extents. If the underlying device supports the actual offload
write zeroes command, the process of zeroing out operation can be
accelerated. If it does not, we currently don't prevent the file system
from writing actual zeros to the device. This provides users with a new
method to quickly generate a zeroed file, users no longer need to write
zero data to create a file with written extents.
Users can determine whether a file or a bdev supports the unmap write
zeroes command by using the statx(2) and checking if the
STATX_ATTR_WRITE_ZEROES_UNMAP flag is set.
Users can also check whether a disk supports the unmap write zeroes
command through querying this sysfs interface:
/sys/block/<disk>/queue/write_zeroes_unmap
Finally, this flag cannot be specified in conjunction with the
FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is
not permitted. In addition, filesystems that always require out-of-place
writes should not support this flag since they still need to allocated
new blocks during subsequent overwrites.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/open.c | 1 +
include/linux/falloc.h | 3 ++-
include/uapi/linux/falloc.h | 18 ++++++++++++++++++
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/fs/open.c b/fs/open.c
index a9063cca9911..08b5daaf4df5 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -278,6 +278,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
break;
case FALLOC_FL_COLLAPSE_RANGE:
case FALLOC_FL_INSERT_RANGE:
+ case FALLOC_FL_WRITE_ZEROES:
if (mode & FALLOC_FL_KEEP_SIZE)
return -EOPNOTSUPP;
break;
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 3f49f3df6af5..7c38c6b76b60 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -36,7 +36,8 @@ struct space_resv {
FALLOC_FL_COLLAPSE_RANGE | \
FALLOC_FL_ZERO_RANGE | \
FALLOC_FL_INSERT_RANGE | \
- FALLOC_FL_UNSHARE_RANGE)
+ FALLOC_FL_UNSHARE_RANGE | \
+ FALLOC_FL_WRITE_ZEROES)
/* on ia32 l_start is on a 32-bit boundary */
#if defined(CONFIG_X86_64)
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 5810371ed72b..265aae7ff8c1 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -78,4 +78,22 @@
*/
#define FALLOC_FL_UNSHARE_RANGE 0x40
+/*
+ * FALLOC_FL_WRITE_ZEROES is used to convert a specified range of a file to
+ * zeros by issuing a zeroing operation. Blocks should be allocated for the
+ * regions that span holes in the file, and the entire range is converted to
+ * written extents. This flag is beneficial for subsequent pure overwriting
+ * within this range, as it can save on block allocation and, consequently,
+ * significant metadata changes. Therefore, filesystems that always require
+ * out-of-place writes should not support this flag.
+ *
+ * Different filesystems may implement different limitations on the
+ * granularity of the zeroing operation. Most will preferably be accelerated
+ * by submitting write zeroes command if the backing storage supports, which
+ * may not physically write zeros to the media.
+ *
+ * This flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE.
+ */
+#define FALLOC_FL_WRITE_ZEROES 0x80
+
#endif /* _UAPI_FALLOC_H_ */
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 09/11] block: factor out common part in blkdev_fallocate()
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
` (7 preceding siblings ...)
2025-04-21 2:15 ` [RFC PATCH v4 08/11] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 10/11] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 11/11] ext4: " Zhang Yi
10 siblings, 0 replies; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Only the flags passed to blkdev_issue_zeroout() differ among the two
zeroing branches in blkdev_fallocate(). Therefore, do cleanup by
factoring them out.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
block/fops.c | 32 ++++++++++++++------------------
1 file changed, 14 insertions(+), 18 deletions(-)
diff --git a/block/fops.c b/block/fops.c
index be9f1dbea9ce..77a5465309e7 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -812,6 +812,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
struct block_device *bdev = I_BDEV(inode);
loff_t end = start + len - 1;
loff_t isize;
+ unsigned int flags;
int error;
/* Fail if we don't recognize the flags. */
@@ -838,34 +839,29 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
filemap_invalidate_lock(inode->i_mapping);
- /*
- * Invalidate the page cache, including dirty pages, for valid
- * de-allocate mode calls to fallocate().
- */
switch (mode) {
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
- error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
- if (error)
- goto fail;
-
- error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
- len >> SECTOR_SHIFT, GFP_KERNEL,
- BLKDEV_ZERO_NOUNMAP);
+ flags = BLKDEV_ZERO_NOUNMAP;
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
- error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
- if (error)
- goto fail;
-
- error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
- len >> SECTOR_SHIFT, GFP_KERNEL,
- BLKDEV_ZERO_NOFALLBACK);
+ flags = BLKDEV_ZERO_NOFALLBACK;
break;
default:
error = -EOPNOTSUPP;
+ goto fail;
}
+ /*
+ * Invalidate the page cache, including dirty pages, for valid
+ * de-allocate mode calls to fallocate().
+ */
+ error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
+ if (error)
+ goto fail;
+
+ error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
+ len >> SECTOR_SHIFT, GFP_KERNEL, flags);
fail:
filemap_invalidate_unlock(inode->i_mapping);
return error;
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 10/11] block: add FALLOC_FL_WRITE_ZEROES support
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
` (8 preceding siblings ...)
2025-04-21 2:15 ` [RFC PATCH v4 09/11] block: factor out common part in blkdev_fallocate() Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 11/11] ext4: " Zhang Yi
10 siblings, 0 replies; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Add support for FALLOC_FL_WRITE_ZEROES. It directly calls
blkdev_issue_zeroout() with flags set to 0. The underlying process will
attempt to use the fastest method for issuing zeroes. First, the block
layer will try to issue a write zeroes command if the storage device
supports it; if not, it will fall back to issuing zeroed data. Then, the
storage device driver may attempt to submit an unmap write zero command
if the device supports it; if not, the driver may fall back to
submitting a no-unmap write zeroes command.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
block/fops.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/block/fops.c b/block/fops.c
index 77a5465309e7..e590c8997689 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -803,7 +803,7 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
#define BLKDEV_FALLOC_FL_SUPPORTED \
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
- FALLOC_FL_ZERO_RANGE)
+ FALLOC_FL_ZERO_RANGE | FALLOC_FL_WRITE_ZEROES)
static long blkdev_fallocate(struct file *file, int mode, loff_t start,
loff_t len)
@@ -847,6 +847,9 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
flags = BLKDEV_ZERO_NOFALLBACK;
break;
+ case FALLOC_FL_WRITE_ZEROES:
+ flags = 0;
+ break;
default:
error = -EOPNOTSUPP;
goto fail;
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [RFC PATCH v4 11/11] ext4: add FALLOC_FL_WRITE_ZEROES support
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
` (9 preceding siblings ...)
2025-04-21 2:15 ` [RFC PATCH v4 10/11] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
@ 2025-04-21 2:15 ` Zhang Yi
10 siblings, 0 replies; 38+ messages in thread
From: Zhang Yi @ 2025-04-21 2:15 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi
Cc: linux-xfs, linux-kernel, hch, tytso, djwong, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
yi.zhang, chengzhihao1, yukuai3, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Add support for FALLOC_FL_WRITE_ZEROES. This first allocates blocks as
unwritten, then issues a zero command outside of the running journal
handle, and finally converts them to a written state.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/extents.c | 59 ++++++++++++++++++++++++++++++-------
include/trace/events/ext4.h | 3 +-
2 files changed, 50 insertions(+), 12 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c616a16a9f36..a147714403af 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4483,6 +4483,8 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
struct ext4_map_blocks map;
unsigned int credits;
loff_t epos, old_size = i_size_read(inode);
+ unsigned int blkbits = inode->i_blkbits;
+ bool alloc_zero = false;
BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
map.m_lblk = offset;
@@ -4495,6 +4497,17 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
if (len <= EXT_UNWRITTEN_MAX_LEN)
flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
+ /*
+ * Do the actual write zero during a running journal transaction
+ * costs a lot. First allocate an unwritten extent and then
+ * convert it to written after zeroing it out.
+ */
+ if (flags & EXT4_GET_BLOCKS_ZERO) {
+ flags &= ~EXT4_GET_BLOCKS_ZERO;
+ flags |= EXT4_GET_BLOCKS_UNWRIT_EXT;
+ alloc_zero = true;
+ }
+
/*
* credits to insert 1 extent into extent tree
*/
@@ -4531,9 +4544,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
* allow a full retry cycle for any remaining allocations
*/
retries = 0;
- map.m_lblk += ret;
- map.m_len = len = len - ret;
- epos = (loff_t)map.m_lblk << inode->i_blkbits;
+ epos = (loff_t)(map.m_lblk + ret) << blkbits;
inode_set_ctime_current(inode);
if (new_size) {
if (epos > new_size)
@@ -4553,6 +4564,21 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
ret2 = ret3 ? ret3 : ret2;
if (unlikely(ret2))
break;
+
+ if (alloc_zero &&
+ (map.m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN))) {
+ ret2 = ext4_issue_zeroout(inode, map.m_lblk, map.m_pblk,
+ map.m_len);
+ if (likely(!ret2))
+ ret2 = ext4_convert_unwritten_extents(NULL,
+ inode, (loff_t)map.m_lblk << blkbits,
+ (loff_t)map.m_len << blkbits);
+ if (ret2)
+ break;
+ }
+
+ map.m_lblk += ret;
+ map.m_len = len = len - ret;
}
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
@@ -4618,7 +4644,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
if (end_lblk > start_lblk) {
ext4_lblk_t zero_blks = end_lblk - start_lblk;
- flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
+ if (mode & FALLOC_FL_WRITE_ZEROES)
+ flags = EXT4_GET_BLOCKS_CREATE_ZERO | EXT4_EX_NOCACHE;
+ else
+ flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
+ EXT4_EX_NOCACHE);
ret = ext4_alloc_file_blocks(file, start_lblk, zero_blks,
new_size, flags);
if (ret)
@@ -4730,8 +4760,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
/* Return error if mode is not supported */
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
- FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
- FALLOC_FL_INSERT_RANGE))
+ FALLOC_FL_ZERO_RANGE | FALLOC_FL_COLLAPSE_RANGE |
+ FALLOC_FL_INSERT_RANGE | FALLOC_FL_WRITE_ZEROES))
return -EOPNOTSUPP;
inode_lock(inode);
@@ -4762,16 +4792,23 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (ret)
goto out_invalidate_lock;
- if (mode & FALLOC_FL_PUNCH_HOLE)
+ switch (mode & FALLOC_FL_MODE_MASK) {
+ case FALLOC_FL_PUNCH_HOLE:
ret = ext4_punch_hole(file, offset, len);
- else if (mode & FALLOC_FL_COLLAPSE_RANGE)
+ break;
+ case FALLOC_FL_COLLAPSE_RANGE:
ret = ext4_collapse_range(file, offset, len);
- else if (mode & FALLOC_FL_INSERT_RANGE)
+ break;
+ case FALLOC_FL_INSERT_RANGE:
ret = ext4_insert_range(file, offset, len);
- else if (mode & FALLOC_FL_ZERO_RANGE)
+ break;
+ case FALLOC_FL_ZERO_RANGE:
+ case FALLOC_FL_WRITE_ZEROES:
ret = ext4_zero_range(file, offset, len, mode);
- else
+ break;
+ default:
ret = -EOPNOTSUPP;
+ }
out_invalidate_lock:
filemap_invalidate_unlock(mapping);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..6f9cf2811733 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -92,7 +92,8 @@ TRACE_DEFINE_ENUM(ES_REFERENCED_B);
{ FALLOC_FL_KEEP_SIZE, "KEEP_SIZE"}, \
{ FALLOC_FL_PUNCH_HOLE, "PUNCH_HOLE"}, \
{ FALLOC_FL_COLLAPSE_RANGE, "COLLAPSE_RANGE"}, \
- { FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"})
+ { FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"}, \
+ { FALLOC_FL_WRITE_ZEROES, "WRITE_ZEROES"})
TRACE_DEFINE_ENUM(EXT4_FC_REASON_XATTR);
TRACE_DEFINE_ENUM(EXT4_FC_REASON_CROSS_RENAME);
--
2.46.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
2025-04-21 2:14 ` [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
@ 2025-05-05 11:54 ` Christoph Hellwig
2025-05-06 4:21 ` Martin K. Petersen
1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-05 11:54 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 02/11] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
2025-04-21 2:15 ` [RFC PATCH v4 02/11] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
@ 2025-05-05 11:55 ` Christoph Hellwig
0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-05 11:55 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 03/11] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support
2025-04-21 2:15 ` [RFC PATCH v4 03/11] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
@ 2025-05-05 11:55 ` Christoph Hellwig
0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-05 11:55 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 04/11] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP
2025-04-21 2:15 ` [RFC PATCH v4 04/11] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
@ 2025-05-05 11:56 ` Christoph Hellwig
0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-05 11:56 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-04-21 2:15 ` [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute Zhang Yi
@ 2025-05-05 13:22 ` Christoph Hellwig
2025-05-05 14:29 ` Darrick J. Wong
0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-05 13:22 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
On Mon, Apr 21, 2025 at 10:15:05AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Add a new attribute flag to statx to determine whether a bdev or a file
> supports the unmap write zeroes command.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> block/bdev.c | 4 ++++
> fs/ext4/inode.c | 9 ++++++---
> include/uapi/linux/stat.h | 1 +
> 3 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/block/bdev.c b/block/bdev.c
> index 4844d1e27b6f..29b0e5feb138 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1304,6 +1304,10 @@ void bdev_statx(struct path *path, struct kstat *stat,
> queue_atomic_write_unit_max_bytes(bd_queue));
> }
>
> + if (bdev_write_zeroes_unmap(bdev))
> + stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP;
> + stat->attributes_mask |= STATX_ATTR_WRITE_ZEROES_UNMAP;
Hmm, shouldn't this always be set by stat? But I might just be
really confused what attributes_mask is, and might in fact have
misapplied it in past patches of my own..
Also shouldn't the patches to report the flag go into the bdev/ext4
patches that actually implement the feature for the respective files
to keep bisectability?
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 08/11] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
2025-04-21 2:15 ` [RFC PATCH v4 08/11] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
@ 2025-05-05 13:22 ` Christoph Hellwig
0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-05 13:22 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-05 13:22 ` Christoph Hellwig
@ 2025-05-05 14:29 ` Darrick J. Wong
2025-05-06 4:28 ` Zhang Yi
2025-05-06 5:02 ` Christoph Hellwig
0 siblings, 2 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-05-05 14:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Zhang Yi, linux-fsdevel, linux-ext4, linux-block, dm-devel,
linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
On Mon, May 05, 2025 at 03:22:08PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 21, 2025 at 10:15:05AM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> >
> > Add a new attribute flag to statx to determine whether a bdev or a file
> > supports the unmap write zeroes command.
> >
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > ---
> > block/bdev.c | 4 ++++
> > fs/ext4/inode.c | 9 ++++++---
> > include/uapi/linux/stat.h | 1 +
> > 3 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/block/bdev.c b/block/bdev.c
> > index 4844d1e27b6f..29b0e5feb138 100644
> > --- a/block/bdev.c
> > +++ b/block/bdev.c
> > @@ -1304,6 +1304,10 @@ void bdev_statx(struct path *path, struct kstat *stat,
> > queue_atomic_write_unit_max_bytes(bd_queue));
> > }
> >
> > + if (bdev_write_zeroes_unmap(bdev))
> > + stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP;
> > + stat->attributes_mask |= STATX_ATTR_WRITE_ZEROES_UNMAP;
>
> Hmm, shouldn't this always be set by stat? But I might just be
> really confused what attributes_mask is, and might in fact have
> misapplied it in past patches of my own..
attributes_mask contains attribute flags known to the filesystem,
whereas attributes contains flags actually set on the file.
"known_attributes" would have been a better name, but that's water under
the bridge. :P
> Also shouldn't the patches to report the flag go into the bdev/ext4
> patches that actually implement the feature for the respective files
> to keep bisectability?
/I/ think so...
--D
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
2025-04-21 2:14 ` [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
2025-05-05 11:54 ` Christoph Hellwig
@ 2025-05-06 4:21 ` Martin K. Petersen
2025-05-06 7:51 ` Zhang Yi
1 sibling, 1 reply; 38+ messages in thread
From: Martin K. Petersen @ 2025-05-06 4:21 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
Hi Zhang!
> + [RO] Devices that explicitly support the unmap write zeroes
> + operation in which a single write zeroes request with the unmap
> + bit set to zero out the range of contiguous blocks on storage
> + by freeing blocks, rather than writing physical zeroes to the
> + media. If the write_zeroes_unmap is set to 1, this indicates
> + that the device explicitly supports the write zero command.
> + However, this may be a best-effort optimization rather than a
> + mandatory requirement, some devices may partially fall back to
> + writing physical zeroes due to factors such as receiving
> + unaligned commands. If the parameter is set to 0, the device
> + either does not support this operation, or its support status is
> + unknown.
I am not so keen on mixing Write Zeroes (which is NVMe-speak) and Unmap
(which is SCSI). Also, Deallocate and Unmap reflect block provisioning
state on the device but don't really convey what is semantically
important for your proposed change (zeroing speed and/or media wear
reduction).
That said, I'm having a hard time coming up with a better term.
WRITE_ZEROES_OPTIMIZED, maybe? Naming is hard...
For the description, perhaps something like the following which tries to
focus on the block layer semantics without using protocol-specific
terminology?
[RO] This parameter indicates whether a device supports zeroing data in
a specified block range without incurring the cost of physically writing
zeroes to media for each individual block. This operation is a
best-effort optimization, a device may fall back to physically writing
zeroes to media due to other factors such as misalignment or being asked
to clear a block range smaller than the device's internal allocation
unit. If write_zeroes_unmap is set to 1, the device implements a zeroing
operation which opportunistically avoids writing zeroes to media while
still guaranteeing that subsequent reads from the specified block range
will return zeroed data. If write_zeroes_unmap is set to 0, the device
may have to write each logical block media during a zeroing operation.
--
Martin K. Petersen
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-05 14:29 ` Darrick J. Wong
@ 2025-05-06 4:28 ` Zhang Yi
2025-05-06 4:39 ` Christoph Hellwig
2025-05-06 5:02 ` Christoph Hellwig
1 sibling, 1 reply; 38+ messages in thread
From: Zhang Yi @ 2025-05-06 4:28 UTC (permalink / raw)
To: Darrick J. Wong, Christoph Hellwig
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, tytso, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
chengzhihao1, yukuai3, yangerkun
On 2025/5/5 22:29, Darrick J. Wong wrote:
> On Mon, May 05, 2025 at 03:22:08PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 21, 2025 at 10:15:05AM +0800, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> Add a new attribute flag to statx to determine whether a bdev or a file
>>> supports the unmap write zeroes command.
>>>
>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>> ---
>>> block/bdev.c | 4 ++++
>>> fs/ext4/inode.c | 9 ++++++---
>>> include/uapi/linux/stat.h | 1 +
>>> 3 files changed, 11 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/bdev.c b/block/bdev.c
>>> index 4844d1e27b6f..29b0e5feb138 100644
>>> --- a/block/bdev.c
>>> +++ b/block/bdev.c
>>> @@ -1304,6 +1304,10 @@ void bdev_statx(struct path *path, struct kstat *stat,
>>> queue_atomic_write_unit_max_bytes(bd_queue));
>>> }
>>>
>>> + if (bdev_write_zeroes_unmap(bdev))
>>> + stat->attributes |= STATX_ATTR_WRITE_ZEROES_UNMAP;
>>> + stat->attributes_mask |= STATX_ATTR_WRITE_ZEROES_UNMAP;
>>
>> Hmm, shouldn't this always be set by stat? But I might just be
>> really confused what attributes_mask is, and might in fact have
>> misapplied it in past patches of my own..
>
> attributes_mask contains attribute flags known to the filesystem,
> whereas attributes contains flags actually set on the file.
> "known_attributes" would have been a better name, but that's water under
> the bridge. :P
>
>> Also shouldn't the patches to report the flag go into the bdev/ext4
>> patches that actually implement the feature for the respective files
>> to keep bisectability?
>
> /I/ think so...
>
OK, since this statx reporting flag is not strongly tied to
FALLOC_FL_WRITE_ZEROES in vfs_fallocate(), I'll split this patch into
three separate patches.
Thanks,
Yi.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 4:28 ` Zhang Yi
@ 2025-05-06 4:39 ` Christoph Hellwig
2025-05-06 11:16 ` Zhang Yi
0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-06 4:39 UTC (permalink / raw)
To: Zhang Yi
Cc: Darrick J. Wong, Christoph Hellwig, linux-fsdevel, linux-ext4,
linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
linux-kernel, tytso, john.g.garry, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, yi.zhang, chengzhihao1, yukuai3,
yangerkun
On Tue, May 06, 2025 at 12:28:54PM +0800, Zhang Yi wrote:
> OK, since this statx reporting flag is not strongly tied to
> FALLOC_FL_WRITE_ZEROES in vfs_fallocate(), I'll split this patch into
> three separate patches.
I don't think that is the right thing to do do. Keep the flag addition
here, and then report it in the ext4 and bdev patches adding
FALLOC_FL_WRITE_ZEROES as the reporting should be consistent with
the added support.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-05 14:29 ` Darrick J. Wong
2025-05-06 4:28 ` Zhang Yi
@ 2025-05-06 5:02 ` Christoph Hellwig
2025-05-06 5:36 ` Darrick J. Wong
1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-06 5:02 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Zhang Yi, linux-fsdevel, linux-ext4,
linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
linux-kernel, tytso, john.g.garry, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, yi.zhang, chengzhihao1, yukuai3,
yangerkun
On Mon, May 05, 2025 at 07:29:45AM -0700, Darrick J. Wong wrote:
> attributes_mask contains attribute flags known to the filesystem,
> whereas attributes contains flags actually set on the file.
> "known_attributes" would have been a better name, but that's water under
> the bridge. :P
Oooh. I think I was very confused at what this patch does, and what
it does seems confused as well.
The patch adds a new flag to the STATX_ATTR_* namespace, which
historically was used for persistent on-disk flags like immutable,
not the STATX_* namespace where I assumed it, and which has no
support mask. Which seems really odd for a pure kernel feature.
Then again it seems to follow STATX_ATTR_WRITE_ATOMIC which seems
just as wrongly place unless I'm missing something?
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 5:02 ` Christoph Hellwig
@ 2025-05-06 5:36 ` Darrick J. Wong
2025-05-06 5:47 ` Christoph Hellwig
0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-05-06 5:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Zhang Yi, linux-fsdevel, linux-ext4, linux-block, dm-devel,
linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
On Tue, May 06, 2025 at 07:02:39AM +0200, Christoph Hellwig wrote:
> On Mon, May 05, 2025 at 07:29:45AM -0700, Darrick J. Wong wrote:
> > attributes_mask contains attribute flags known to the filesystem,
> > whereas attributes contains flags actually set on the file.
> > "known_attributes" would have been a better name, but that's water under
> > the bridge. :P
>
> Oooh. I think I was very confused at what this patch does, and what
> it does seems confused as well.
>
> The patch adds a new flag to the STATX_ATTR_* namespace, which
> historically was used for persistent on-disk flags like immutable,
> not the STATX_* namespace where I assumed it, and which has no
> support mask. Which seems really odd for a pure kernel feature.
> Then again it seems to follow STATX_ATTR_WRITE_ATOMIC which seems
> just as wrongly place unless I'm missing something?
I think STATX_* (i.e. not STATX_ATTR_*) flags have two purposes: 1) to
declare that specific fields in struct statx actually have meaning, most
notably in scenarios where zeroes are valid field contents; and 2) if
filling out the field is expensive, userspace can elect not to have it
filled by leaving the bit unset. I don't know how userspace is supposed
to figure out which fields are expensive.
STATX_ATTR_* are supposed to be reflect persistent inode state. I think
STATX_ATTR_WRITE_ATOMIC is a (now unremovable) artifact of the era when
we were going to have a new iflag and feature bit for all the new
forcealign functionality. For XFS it's not necessary anymore because we
always have software fallback and the statx::atomic_write_* fields being
nonzero is sufficient to detect the functionality.
(I'm confused about the whole premise of /this/ patch -- it's a "fast
zeroing" fallocate flag that causes the *device* to unmap, so that the
filesystem can preallocate and avoid unwritten extent conversions?
What happens if the block device is thinp and it runs out of space?
That seems antithetical to fallocate...)
--D
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 5:36 ` Darrick J. Wong
@ 2025-05-06 5:47 ` Christoph Hellwig
2025-05-06 11:25 ` Zhang Yi
0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-06 5:47 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Zhang Yi, linux-fsdevel, linux-ext4,
linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
linux-kernel, tytso, john.g.garry, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, yi.zhang, chengzhihao1, yukuai3,
yangerkun
On Mon, May 05, 2025 at 10:36:54PM -0700, Darrick J. Wong wrote:
> I think STATX_* (i.e. not STATX_ATTR_*) flags have two purposes: 1) to
> declare that specific fields in struct statx actually have meaning, most
> notably in scenarios where zeroes are valid field contents; and 2) if
> filling out the field is expensive, userspace can elect not to have it
> filled by leaving the bit unset. I don't know how userspace is supposed
> to figure out which fields are expensive.
Yes.
> (I'm confused about the whole premise of /this/ patch -- it's a "fast
> zeroing" fallocate flag that causes the *device* to unmap, so that the
> filesystem can preallocate and avoid unwritten extent conversions?
Yes.
> What happens if the block device is thinp and it runs out of space?
> That seems antithetical to fallocate...)
While the origin posix_fallocate was about space preallocatіon, these
days fallocate seems to be more about extent layout and/or fast
zeroing.
I'm not a huge fan of either this or the hardware atomics as they
force a FTL layer world view which is quite ingrained but also
rather stupid, but some folks really want to go down there full
throttle, so..
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
2025-05-06 4:21 ` Martin K. Petersen
@ 2025-05-06 7:51 ` Zhang Yi
0 siblings, 0 replies; 38+ messages in thread
From: Zhang Yi @ 2025-05-06 7:51 UTC (permalink / raw)
To: Martin K. Petersen
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, hch, tytso, djwong,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
Hi, Martin!
On 2025/5/6 12:21, Martin K. Petersen wrote:
>
> Hi Zhang!
>
>> + [RO] Devices that explicitly support the unmap write zeroes
>> + operation in which a single write zeroes request with the unmap
>> + bit set to zero out the range of contiguous blocks on storage
>> + by freeing blocks, rather than writing physical zeroes to the
>> + media. If the write_zeroes_unmap is set to 1, this indicates
>> + that the device explicitly supports the write zero command.
>> + However, this may be a best-effort optimization rather than a
>> + mandatory requirement, some devices may partially fall back to
>> + writing physical zeroes due to factors such as receiving
>> + unaligned commands. If the parameter is set to 0, the device
>> + either does not support this operation, or its support status is
>> + unknown.
>
> I am not so keen on mixing Write Zeroes (which is NVMe-speak) and Unmap
> (which is SCSI). Also, Deallocate and Unmap reflect block provisioning
> state on the device but don't really convey what is semantically
> important for your proposed change (zeroing speed and/or media wear
> reduction).
>
Since this flag doesn't strictly guarantee zeroing speed or media wear
reduction optimizations, but rather reflects typical optimization
behavior across most supported devices and cases. Therefore, I propose
using a name that accurately indicates the function of the block device.
However, also can't think of a better name either. Using the name
WRITE_ZEROES_UNMAP seems appropriate to convey that the block device
supports this type of Deallocate and Unmap state.
> That said, I'm having a hard time coming up with a better term.
> WRITE_ZEROES_OPTIMIZED, maybe? Naming is hard...
Using WRITE_ZEROES_OPTIMIZED feels somewhat too generic to me, and
users may not fully grasp the specific optimizations it entails based
on the name.
>
> For the description, perhaps something like the following which tries to
> focus on the block layer semantics without using protocol-specific
> terminology?
>
> [RO] This parameter indicates whether a device supports zeroing data in
> a specified block range without incurring the cost of physically writing
> zeroes to media for each individual block. This operation is a
> best-effort optimization, a device may fall back to physically writing
> zeroes to media due to other factors such as misalignment or being asked
> to clear a block range smaller than the device's internal allocation
> unit. If write_zeroes_unmap is set to 1, the device implements a zeroing
> operation which opportunistically avoids writing zeroes to media while
> still guaranteeing that subsequent reads from the specified block range
> will return zeroed data. If write_zeroes_unmap is set to 0, the device
> may have to write each logical block media during a zeroing operation.
>
Thank you for optimizing the description, it looks good to me. I'd like
to this one in my next iteration. :)
Thanks,
Yi.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 4:39 ` Christoph Hellwig
@ 2025-05-06 11:16 ` Zhang Yi
2025-05-06 12:11 ` Christoph Hellwig
0 siblings, 1 reply; 38+ messages in thread
From: Zhang Yi @ 2025-05-06 11:16 UTC (permalink / raw)
To: Christoph Hellwig, Darrick J. Wong
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, tytso, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
chengzhihao1, yukuai3, yangerkun
On 2025/5/6 12:39, Christoph Hellwig wrote:
> On Tue, May 06, 2025 at 12:28:54PM +0800, Zhang Yi wrote:
>> OK, since this statx reporting flag is not strongly tied to
>> FALLOC_FL_WRITE_ZEROES in vfs_fallocate(), I'll split this patch into
>> three separate patches.
>
> I don't think that is the right thing to do do. Keep the flag addition
> here, and then report it in the ext4 and bdev patches adding
> FALLOC_FL_WRITE_ZEROES as the reporting should be consistent with
> the added support.
>
Sorry, but I don't understand your suggestion. The
STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev
and the block device that under the specified file support unmap write
zeroes commoand. It does not reflect whether the bdev and the
filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of
FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes
commoand now, users simply refer to this attribute flag to determine
whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file.
So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't
have strong relations, why do you suggested to put this into the ext4
and bdev patches that adding FALLOC_FL_WRITE_ZEROES?
Thanks,
Yi.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 5:47 ` Christoph Hellwig
@ 2025-05-06 11:25 ` Zhang Yi
2025-05-06 12:10 ` Christoph Hellwig
0 siblings, 1 reply; 38+ messages in thread
From: Zhang Yi @ 2025-05-06 11:25 UTC (permalink / raw)
To: Christoph Hellwig, Darrick J. Wong
Cc: linux-fsdevel, linux-ext4, linux-block, dm-devel, linux-nvme,
linux-scsi, linux-xfs, linux-kernel, tytso, john.g.garry,
bmarzins, chaitanyak, shinichiro.kawasaki, brauner, yi.zhang,
chengzhihao1, yukuai3, yangerkun
On 2025/5/6 13:47, Christoph Hellwig wrote:
> On Mon, May 05, 2025 at 10:36:54PM -0700, Darrick J. Wong wrote:
>> I think STATX_* (i.e. not STATX_ATTR_*) flags have two purposes: 1) to
>> declare that specific fields in struct statx actually have meaning, most
>> notably in scenarios where zeroes are valid field contents; and 2) if
>> filling out the field is expensive, userspace can elect not to have it
>> filled by leaving the bit unset. I don't know how userspace is supposed
>> to figure out which fields are expensive.
>
> Yes.
>
IIUC, it seems I was misled by STATX_ATTR_WRITE_ATOMIC, adding this
STATX_ATTR_WRITE_ZEROES_UNMAP attribute flag is incorrect. The right
approach should be to add STATX_WRITE_ZEROES_UNMAP, setting it in the
result_mask if the request_mask includes this flag and
bdev_write_zeroes_unmap(bdev) returns true. Something like below. Is
my understanding right?
diff --git a/block/bdev.c b/block/bdev.c
index 4ba48b8735e7..e1367f30dbce 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1303,9 +1303,9 @@ void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask)
queue_atomic_write_unit_max_bytes(bd_queue));
}
+ if (request_mask & STATX_WRITE_ZEROES_UNMAP &&
+ bdev_write_zeroes_unmap(bdev))
+ stat->result_mask |= STATX_WRITE_ZEROES_UNMAP;
stat->blksize = bdev_io_min(bdev);
Thanks,
Yi.
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 11:25 ` Zhang Yi
@ 2025-05-06 12:10 ` Christoph Hellwig
2025-05-06 15:55 ` Darrick J. Wong
0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-06 12:10 UTC (permalink / raw)
To: Zhang Yi, dhowells, brauner
Cc: Christoph Hellwig, Darrick J. Wong, linux-fsdevel, linux-ext4,
linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
linux-kernel, tytso, john.g.garry, bmarzins, chaitanyak,
shinichiro.kawasaki, yi.zhang, chengzhihao1, yukuai3, yangerkun
On Tue, May 06, 2025 at 07:25:06PM +0800, Zhang Yi wrote:
> + if (request_mask & STATX_WRITE_ZEROES_UNMAP &&
> + bdev_write_zeroes_unmap(bdev))
> + stat->result_mask |= STATX_WRITE_ZEROES_UNMAP;
That would be my expectation. But then again this area seems to
confuse me a lot, so maybe we'll get Christian or Dave to chim in.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 11:16 ` Zhang Yi
@ 2025-05-06 12:11 ` Christoph Hellwig
2025-05-07 7:33 ` Zhang Yi
0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-06 12:11 UTC (permalink / raw)
To: Zhang Yi
Cc: Christoph Hellwig, Darrick J. Wong, linux-fsdevel, linux-ext4,
linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
linux-kernel, tytso, john.g.garry, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, yi.zhang, chengzhihao1, yukuai3,
yangerkun
On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote:
> Sorry, but I don't understand your suggestion. The
> STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev
> and the block device that under the specified file support unmap write
> zeroes commoand. It does not reflect whether the bdev and the
> filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of
> FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes
> commoand now, users simply refer to this attribute flag to determine
> whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file.
> So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't
> have strong relations, why do you suggested to put this into the ext4
> and bdev patches that adding FALLOC_FL_WRITE_ZEROES?
So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP?
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 12:10 ` Christoph Hellwig
@ 2025-05-06 15:55 ` Darrick J. Wong
2025-05-07 8:23 ` Zhang Yi
0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-05-06 15:55 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Zhang Yi, dhowells, brauner, linux-fsdevel, linux-ext4,
linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
linux-kernel, tytso, john.g.garry, bmarzins, chaitanyak,
shinichiro.kawasaki, yi.zhang, chengzhihao1, yukuai3, yangerkun
On Tue, May 06, 2025 at 02:10:12PM +0200, Christoph Hellwig wrote:
> On Tue, May 06, 2025 at 07:25:06PM +0800, Zhang Yi wrote:
> > + if (request_mask & STATX_WRITE_ZEROES_UNMAP &&
> > + bdev_write_zeroes_unmap(bdev))
> > + stat->result_mask |= STATX_WRITE_ZEROES_UNMAP;
>
> That would be my expectation. But then again this area seems to
> confuse me a lot, so maybe we'll get Christian or Dave to chim in.
Um... does STATX_WRITE_ZEROES_UNMAP protect a field somewhere?
It might be nice to expose the request alignment granularity/max
size/etc. Or does this flag exist solely to support discovering that
FALLOC_FL_WRITE_ZEROES is supported? In which case, why not discover
its existence by calling fallocate(fd, WRITE_ZEROES, 0, 0) like the
other modes?
--D
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 12:11 ` Christoph Hellwig
@ 2025-05-07 7:33 ` Zhang Yi
2025-05-07 21:03 ` Darrick J. Wong
2025-05-08 5:01 ` Christoph Hellwig
0 siblings, 2 replies; 38+ messages in thread
From: Zhang Yi @ 2025-05-07 7:33 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Darrick J. Wong, linux-fsdevel, linux-ext4, linux-block, dm-devel,
linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
On 2025/5/6 20:11, Christoph Hellwig wrote:
> On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote:
>> Sorry, but I don't understand your suggestion. The
>> STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev
>> and the block device that under the specified file support unmap write
>> zeroes commoand. It does not reflect whether the bdev and the
>> filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of
>> FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes
>> commoand now, users simply refer to this attribute flag to determine
>> whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file.
>> So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't
>> have strong relations, why do you suggested to put this into the ext4
>> and bdev patches that adding FALLOC_FL_WRITE_ZEROES?
>
> So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP?
My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to
only bdev or files where bdev_unmap_write_zeroes() returns true. In
other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES
are not consistent, they are two independent features. Even if some
devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be
allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some
devices and drivers currently cannot reliably ascertain whether they
support the unmap write zero command; however, certain devices, such as
specific cloud storage devices, do support it. Users of these devices
may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing
process.
Therefore, I think that the current point of
STATX_ATTR_WRITE_ZEROES_UNMAP (possibly STATX_WRITE_ZEROES_UNMAP) should
be to just indicate whether a bdev or file supports the unmap write zero
command (i.e., whether bdev_unmap_write_zeroes() returns true). If we
use standard SCSI and NVMe storage devices, and the
STATX_ATTR_WRITE_ZEROES_UNMAP attribute is set, users can be assured
that FALLOC_FL_WRITE_ZEROES is fast and can choose to use
fallocate(FALLOC_FL_WRITE_ZEROES) immediately.
Would you prefer to make STATX_ATTR_WRITE_ZEROES_UNMAP and
FALLOC_FL_WRITE_ZEROES consistent, which means
fallcoate(FALLOC_FL_WRITE_ZEROES) will return -EOPNOTSUPP if the block
device doesn't set STATX_ATTR_WRITE_ZEROES_UNMAP ?
If so, I'd suggested we need to:
1) Remove STATX_ATTR_WRITE_ZEROES_UNMAP since users can check the
existence by calling fallocate(FALLOC_FL_WRITE_ZEROES) directly, this
statx flag seems useless.
2) Make the BLK_FEAT_WRITE_ZEROES_UNMAP sysfs interface to RW, allowing
users to adjust the block device's support state according to the
real situation.
Thanks,
Yi.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-06 15:55 ` Darrick J. Wong
@ 2025-05-07 8:23 ` Zhang Yi
0 siblings, 0 replies; 38+ messages in thread
From: Zhang Yi @ 2025-05-07 8:23 UTC (permalink / raw)
To: Darrick J. Wong, Christoph Hellwig
Cc: dhowells, brauner, linux-fsdevel, linux-ext4, linux-block,
dm-devel, linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, yi.zhang,
chengzhihao1, yukuai3, yangerkun
On 2025/5/6 23:55, Darrick J. Wong wrote:
> On Tue, May 06, 2025 at 02:10:12PM +0200, Christoph Hellwig wrote:
>> On Tue, May 06, 2025 at 07:25:06PM +0800, Zhang Yi wrote:
>>> + if (request_mask & STATX_WRITE_ZEROES_UNMAP &&
>>> + bdev_write_zeroes_unmap(bdev))
>>> + stat->result_mask |= STATX_WRITE_ZEROES_UNMAP;
>>
>> That would be my expectation. But then again this area seems to
>> confuse me a lot, so maybe we'll get Christian or Dave to chim in.
>
> Um... does STATX_WRITE_ZEROES_UNMAP protect a field somewhere?
> It might be nice to expose the request alignment granularity/max
> size/etc.
I think that simply returning the support state is sufficient at the
moment. __blkdev_issue_write_zeroes() will send write zeroes through
multiple iterations, and there are no specific restrictions on the
parameters provided by users.
> Or does this flag exist solely to support discovering that
> FALLOC_FL_WRITE_ZEROES is supported? In which case, why not discover
> its existence by calling fallocate(fd, WRITE_ZEROES, 0, 0) like the
> other modes?
>
Current STATX_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES are
inconsistent, we allow users to call fallocate(FALLOC_FL_WRITE_ZEROES) on
files that STATX_WRITE_ZEROES_UNMAP is not set. Users can check whether
the device supports unmap write zeroes through STATX_WRITE_ZEROES_UNMAP
and then decide to call fallocate(FALLOC_FL_WRITE_ZEROES) if it is
supported. Please see this explanation for details.
https://lore.kernel.org/linux-fsdevel/20250421021509.2366003-1-yi.zhang@huaweicloud.com/T/#mc1618822bc27d486296216fc1643d5531fee03e1
However, removing STATX_WRITE_ZEROES_UNMAP also seems good to me(Perhaps
it would be better.).It means we do not allow to call
fallocate(FALLOC_FL_WRITE_ZEROES) if the device does not explicitly
support unmap write zeroes.
Thanks,
Yi.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-07 7:33 ` Zhang Yi
@ 2025-05-07 21:03 ` Darrick J. Wong
2025-05-08 5:01 ` Christoph Hellwig
1 sibling, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-05-07 21:03 UTC (permalink / raw)
To: Zhang Yi
Cc: Christoph Hellwig, linux-fsdevel, linux-ext4, linux-block,
dm-devel, linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
On Wed, May 07, 2025 at 03:33:23PM +0800, Zhang Yi wrote:
> On 2025/5/6 20:11, Christoph Hellwig wrote:
> > On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote:
> >> Sorry, but I don't understand your suggestion. The
> >> STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev
> >> and the block device that under the specified file support unmap write
> >> zeroes commoand. It does not reflect whether the bdev and the
> >> filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of
> >> FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes
> >> commoand now, users simply refer to this attribute flag to determine
> >> whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file.
> >> So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't
> >> have strong relations, why do you suggested to put this into the ext4
> >> and bdev patches that adding FALLOC_FL_WRITE_ZEROES?
> >
> > So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP?
>
> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to
> only bdev or files where bdev_unmap_write_zeroes() returns true. In
> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES
> are not consistent, they are two independent features. Even if some
> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be
> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some
> devices and drivers currently cannot reliably ascertain whether they
> support the unmap write zero command; however, certain devices, such as
> specific cloud storage devices, do support it. Users of these devices
> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing
> process.
>
> Therefore, I think that the current point of
> STATX_ATTR_WRITE_ZEROES_UNMAP (possibly STATX_WRITE_ZEROES_UNMAP) should
> be to just indicate whether a bdev or file supports the unmap write zero
> command (i.e., whether bdev_unmap_write_zeroes() returns true). If we
> use standard SCSI and NVMe storage devices, and the
> STATX_ATTR_WRITE_ZEROES_UNMAP attribute is set, users can be assured
> that FALLOC_FL_WRITE_ZEROES is fast and can choose to use
> fallocate(FALLOC_FL_WRITE_ZEROES) immediately.
>
> Would you prefer to make STATX_ATTR_WRITE_ZEROES_UNMAP and
> FALLOC_FL_WRITE_ZEROES consistent, which means
> fallcoate(FALLOC_FL_WRITE_ZEROES) will return -EOPNOTSUPP if the block
> device doesn't set STATX_ATTR_WRITE_ZEROES_UNMAP ?
>
> If so, I'd suggested we need to:
> 1) Remove STATX_ATTR_WRITE_ZEROES_UNMAP since users can check the
> existence by calling fallocate(FALLOC_FL_WRITE_ZEROES) directly, this
> statx flag seems useless.
> 2) Make the BLK_FEAT_WRITE_ZEROES_UNMAP sysfs interface to RW, allowing
> users to adjust the block device's support state according to the
> real situation.
Sounds fine to me... ;)
--D
> Thanks,
> Yi.
>
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-07 7:33 ` Zhang Yi
2025-05-07 21:03 ` Darrick J. Wong
@ 2025-05-08 5:01 ` Christoph Hellwig
2025-05-08 12:17 ` Zhang Yi
1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2025-05-08 5:01 UTC (permalink / raw)
To: Zhang Yi
Cc: Christoph Hellwig, Darrick J. Wong, linux-fsdevel, linux-ext4,
linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
linux-kernel, tytso, john.g.garry, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, yi.zhang, chengzhihao1, yukuai3,
yangerkun
On Wed, May 07, 2025 at 03:33:23PM +0800, Zhang Yi wrote:
> On 2025/5/6 20:11, Christoph Hellwig wrote:
> > On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote:
> >> Sorry, but I don't understand your suggestion. The
> >> STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev
> >> and the block device that under the specified file support unmap write
> >> zeroes commoand. It does not reflect whether the bdev and the
> >> filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of
> >> FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes
> >> commoand now, users simply refer to this attribute flag to determine
> >> whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file.
> >> So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't
> >> have strong relations, why do you suggested to put this into the ext4
> >> and bdev patches that adding FALLOC_FL_WRITE_ZEROES?
> >
> > So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP?
>
> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to
> only bdev or files where bdev_unmap_write_zeroes() returns true. In
> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES
> are not consistent, they are two independent features. Even if some
> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be
> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some
> devices and drivers currently cannot reliably ascertain whether they
> support the unmap write zero command; however, certain devices, such as
> specific cloud storage devices, do support it. Users of these devices
> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing
> process.
What are those "cloud storage devices" where you set it reliably,
i.e.g what drivers?
> Therefore, I think that the current point of
> STATX_ATTR_WRITE_ZEROES_UNMAP (possibly STATX_WRITE_ZEROES_UNMAP) should
> be to just indicate whether a bdev or file supports the unmap write zero
> command (i.e., whether bdev_unmap_write_zeroes() returns true). If we
> use standard SCSI and NVMe storage devices, and the
> STATX_ATTR_WRITE_ZEROES_UNMAP attribute is set, users can be assured
> that FALLOC_FL_WRITE_ZEROES is fast and can choose to use
> fallocate(FALLOC_FL_WRITE_ZEROES) immediately.
That's breaking the abstracton again. An attribute must say something
about the specific file, not about some underlying semi-related feature.
> Would you prefer to make STATX_ATTR_WRITE_ZEROES_UNMAP and
> FALLOC_FL_WRITE_ZEROES consistent, which means
> fallcoate(FALLOC_FL_WRITE_ZEROES) will return -EOPNOTSUPP if the block
> device doesn't set STATX_ATTR_WRITE_ZEROES_UNMAP ?
Not sure where the block device comes from here, both of these operate
on a file.
> If so, I'd suggested we need to:
> 1) Remove STATX_ATTR_WRITE_ZEROES_UNMAP since users can check the
> existence by calling fallocate(FALLOC_FL_WRITE_ZEROES) directly, this
> statx flag seems useless.
Yes, that was my inital thought.
> 2) Make the BLK_FEAT_WRITE_ZEROES_UNMAP sysfs interface to RW, allowing
> users to adjust the block device's support state according to the
> real situation.
No, it's a feature and not a flag.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-08 5:01 ` Christoph Hellwig
@ 2025-05-08 12:17 ` Zhang Yi
2025-05-08 20:24 ` Theodore Ts'o
0 siblings, 1 reply; 38+ messages in thread
From: Zhang Yi @ 2025-05-08 12:17 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Darrick J. Wong, linux-fsdevel, linux-ext4, linux-block, dm-devel,
linux-nvme, linux-scsi, linux-xfs, linux-kernel, tytso,
john.g.garry, bmarzins, chaitanyak, shinichiro.kawasaki, brauner,
yi.zhang, chengzhihao1, yukuai3, yangerkun
On 2025/5/8 13:01, Christoph Hellwig wrote:
> On Wed, May 07, 2025 at 03:33:23PM +0800, Zhang Yi wrote:
>> On 2025/5/6 20:11, Christoph Hellwig wrote:
>>> On Tue, May 06, 2025 at 07:16:56PM +0800, Zhang Yi wrote:
>>>> Sorry, but I don't understand your suggestion. The
>>>> STATX_ATTR_WRITE_ZEROES_UNMAP attribute only indicate whether the bdev
>>>> and the block device that under the specified file support unmap write
>>>> zeroes commoand. It does not reflect whether the bdev and the
>>>> filesystems support FALLOC_FL_WRITE_ZEROES. The implementation of
>>>> FALLOC_FL_WRITE_ZEROES doesn't fully rely on the unmap write zeroes
>>>> commoand now, users simply refer to this attribute flag to determine
>>>> whether to use FALLOC_FL_WRITE_ZEROES when preallocating a file.
>>>> So, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES doesn't
>>>> have strong relations, why do you suggested to put this into the ext4
>>>> and bdev patches that adding FALLOC_FL_WRITE_ZEROES?
>>>
>>> So what is the point of STATX_ATTR_WRITE_ZEROES_UNMAP?
>>
>> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to
>> only bdev or files where bdev_unmap_write_zeroes() returns true. In
>> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES
>> are not consistent, they are two independent features. Even if some
>> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be
>> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some
>> devices and drivers currently cannot reliably ascertain whether they
>> support the unmap write zero command; however, certain devices, such as
>> specific cloud storage devices, do support it. Users of these devices
>> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing
>> process.
>
> What are those "cloud storage devices" where you set it reliably,
> i.e.g what drivers?
I don't have these 'cloud storage devices' now, but Ted had mentioned
those cloud-emulated block devices such as Google's Persistent Desk or
Amazon's Elastic Block Device in. I'm not sure if they can accurately
report the BLK_FEAT_WRITE_ZEROES_UNMAP feature, maybe Ted can give more
details.
https://lore.kernel.org/linux-fsdevel/20250106161732.GG1284777@mit.edu/
>
>> Therefore, I think that the current point of
>> STATX_ATTR_WRITE_ZEROES_UNMAP (possibly STATX_WRITE_ZEROES_UNMAP) should
>> be to just indicate whether a bdev or file supports the unmap write zero
>> command (i.e., whether bdev_unmap_write_zeroes() returns true). If we
>> use standard SCSI and NVMe storage devices, and the
>> STATX_ATTR_WRITE_ZEROES_UNMAP attribute is set, users can be assured
>> that FALLOC_FL_WRITE_ZEROES is fast and can choose to use
>> fallocate(FALLOC_FL_WRITE_ZEROES) immediately.
>
> That's breaking the abstracton again. An attribute must say something
> about the specific file, not about some underlying semi-related feature.
OK.
>
>> Would you prefer to make STATX_ATTR_WRITE_ZEROES_UNMAP and
>> FALLOC_FL_WRITE_ZEROES consistent, which means
>> fallcoate(FALLOC_FL_WRITE_ZEROES) will return -EOPNOTSUPP if the block
>> device doesn't set STATX_ATTR_WRITE_ZEROES_UNMAP ?
>
> Not sure where the block device comes from here, both of these operate
> on a file.
I am referring to the block device on which the filesystem is mounted.
The support status of the file is directly dependent on this block
device.
>
>> If so, I'd suggested we need to:
>> 1) Remove STATX_ATTR_WRITE_ZEROES_UNMAP since users can check the
>> existence by calling fallocate(FALLOC_FL_WRITE_ZEROES) directly, this
>> statx flag seems useless.
>
> Yes, that was my inital thought.
>
>> 2) Make the BLK_FEAT_WRITE_ZEROES_UNMAP sysfs interface to RW, allowing
>> users to adjust the block device's support state according to the
>> real situation.
>
> No, it's a feature and not a flag.
>
I am a bit confused about the feature and the flag, I checked the other
features, and it appears that features such as BLK_FEAT_ROTATIONAL allow
to be modified, is this flexibility due to historical reasons or for the
convenience of testing?
Think about this again, I suppose we should keep the
BLK_FEAT_WRITE_ZEROES_UNMAP as read-only and add a new flag,
BLK_FALG_WRITE_ZEROES_UNMAP_DISABLED, to disable the
FALLOC_FL_WRITE_ZEROES. Since the Write Zeroes does not guarantee
performance, and some devices may claim to support **UNMAP** Write Zeroes
but exhibit extremely slow write-zeroes speeds. Users may want be able to
disable it. Thoughts?
Thanks,
Yi.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-08 12:17 ` Zhang Yi
@ 2025-05-08 20:24 ` Theodore Ts'o
2025-05-09 12:35 ` Zhang Yi
0 siblings, 1 reply; 38+ messages in thread
From: Theodore Ts'o @ 2025-05-08 20:24 UTC (permalink / raw)
To: Zhang Yi
Cc: Christoph Hellwig, Darrick J. Wong, linux-fsdevel, linux-ext4,
linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
linux-kernel, john.g.garry, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, yi.zhang, chengzhihao1, yukuai3,
yangerkun
On Thu, May 08, 2025 at 08:17:14PM +0800, Zhang Yi wrote:
> On 2025/5/8 13:01, Christoph Hellwig wrote:
> >>
> >> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to
> >> only bdev or files where bdev_unmap_write_zeroes() returns true. In
> >> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES
> >> are not consistent, they are two independent features. Even if some
> >> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be
> >> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some
> >> devices and drivers currently cannot reliably ascertain whether they
> >> support the unmap write zero command; however, certain devices, such as
> >> specific cloud storage devices, do support it. Users of these devices
> >> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing
> >> process.
> >
> > What are those "cloud storage devices" where you set it reliably,
> > i.e.g what drivers?
>
> I don't have these 'cloud storage devices' now, but Ted had mentioned
> those cloud-emulated block devices such as Google's Persistent Desk or
> Amazon's Elastic Block Device in. I'm not sure if they can accurately
> report the BLK_FEAT_WRITE_ZEROES_UNMAP feature, maybe Ted can give more
> details.
>
> https://lore.kernel.org/linux-fsdevel/20250106161732.GG1284777@mit.edu/
There's nothing really exotic about what I was referring to in terms
of "cloud storage devices". Perhaps a better way of describing them
is to consider devices such as dm-thin, or a Ceph Block Device, which
is being exposed as a SCSI or NVME device.
The distinction I was trying to make is performance-related. Suppose
you call WRITE_ZEROS on a 14TB region. After the WRITES_ZEROS
complete, a read anywhere on that 14TB region will return zeros.
That's easy. But the question is when you call WRITE_ZEROS, will the
storage device (a) go away for a day or more before it completes (which
would be the case if it is a traditional spinning rust platter), or
(b) will it be basically instaneous, because all dm-thin or a Ceph Block
Device needs to do is to delete one or more entries in its mapping
table.
The problem is two-fold. First, there's no way for the kernel to know
whether a storage device will behave as (a) or (b), because SCSI and
other storage specifications say that performance is out of scope.
They only talk about the functional results (afterwards, if yout try
to read from the region, you will get zeros), and are utterly silent
about how long it migt take. The second problem is that if you are an
application program, there is no way you will be willing to call
fallocate(WRITE_ZEROS, 14TB) if you don't know whether the disk will
go away for a day or whether it will be instaneous.
But because there is no way for the kernel to know whether WRITE_ZEROS
will be fast or not, how would you expect the kernel to expose
STATX_ATTR_WRITE_ZEROES_UNMAP? Cristoph's formulation "breaking the
abstraction" perfectly encapsulate the SCSI specification's position
on the matter, and I agree it's a valid position. It's just not
terribly useful for the application programmer.
Things which some programs/users might want to know or rely upon, but which is normally quite impossible are:
* Will the write zero / discard operation take a "reasonable" amount
of time? (Yes, not necessarilly well defined, but we know it when
we see it, and hours or days is generally not reasonable.)
* Is the operation reliable --- i.e., is the device allowed to
randomly decide that it won't actually zero the requested blocks (as
is the case of discard) whenever it feels like it.
* Is the operation guaranteed to make the data irretreviable even in
face of an attacker with low-level access to the device. (And this
is also not necessarily well defined; does the attacker have access
to a scanning electronic microscope, or can do a liquid nitrogen
destructive access of the flash device?)
The UFS (Universal Flash Storage) spec comes the closest to providing
commands that distinguish between these various cases, but for most
storage specifications, like SCSI, it is absolutely requires peaking
behind the abstraction barrier defined by the specification, and so
ultimately, the kernel can't know.
About the best you can do is to require manual configuration; perhaps a
config file at the database or userspace cluster file system level
because the system adminsitrator knows --- maybe because the hyperscale
cloud provider has leaned on the storage vendor to tell them under
NDA, storage specs be damned or they won't spend $$$ millions with
that storage vendor --- or because the database administrator discovers
that using fallocate(WRITE_ZEROS) causes performance to tank, so they
manually disable the use of WRITE_ZEROS.
Could this be done in the kernel? Sure. We could have a file, say,
/sys/block/sdXX/queue/write_zeros where the write_zeros file is
writeable, and so the administrator can force-disable WRITES_ZERO by
writing 0 into the file. And could this be queried via a STATX
attribute? I suppose, although to be honest, I'm used to doing this
by looking at the sysfs files. For example, just recently I coded up
the following:
static int is_rotational (const char *device_name EXT2FS_ATTR((unused)))
{
int rotational = -1;
#ifdef __linux__
char path[1024];
struct stat st;
FILE *f;
if ((stat(device_name, &st) < 0) || !S_ISBLK(st.st_mode))
return -1;
snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/queue/rotational",
major(st.st_rdev), minor(st.st_rdev));
f = fopen(path, "r");
if (!f) {
snprintf(path, sizeof(path),
"/sys/dev/block/%d:%d/../queue/rotational",
major(st.st_rdev), minor(st.st_rdev));
f = fopen(path, "r");
}
if (f) {
if (fscanf(f, "%d", &rotational) != 1)
rotational = -1;
fclose(f);
}
#endif
return rotational;
}
Easy-peasy! Who needs statx? :-)
- Ted
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute
2025-05-08 20:24 ` Theodore Ts'o
@ 2025-05-09 12:35 ` Zhang Yi
0 siblings, 0 replies; 38+ messages in thread
From: Zhang Yi @ 2025-05-09 12:35 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Christoph Hellwig, Darrick J. Wong, linux-fsdevel, linux-ext4,
linux-block, dm-devel, linux-nvme, linux-scsi, linux-xfs,
linux-kernel, john.g.garry, bmarzins, chaitanyak,
shinichiro.kawasaki, brauner, yi.zhang, chengzhihao1, yukuai3,
yangerkun
On 2025/5/9 4:24, Theodore Ts'o wrote:
> On Thu, May 08, 2025 at 08:17:14PM +0800, Zhang Yi wrote:
>> On 2025/5/8 13:01, Christoph Hellwig wrote:
>>>>
>>>> My idea is not to strictly limiting the use of FALLOC_FL_WRITE_ZEROES to
>>>> only bdev or files where bdev_unmap_write_zeroes() returns true. In
>>>> other words, STATX_ATTR_WRITE_ZEROES_UNMAP and FALLOC_FL_WRITE_ZEROES
>>>> are not consistent, they are two independent features. Even if some
>>>> devices STATX_ATTR_WRITE_ZEROES_UNMAP are not set, users should still be
>>>> allowed to call fallcoate(FALLOC_FL_WRITE_ZEROES). This is because some
>>>> devices and drivers currently cannot reliably ascertain whether they
>>>> support the unmap write zero command; however, certain devices, such as
>>>> specific cloud storage devices, do support it. Users of these devices
>>>> may also wish to use FALLOC_FL_WRITE_ZEROES to expedite the zeroing
>>>> process.
>>>
>>> What are those "cloud storage devices" where you set it reliably,
>>> i.e.g what drivers?
>>
>> I don't have these 'cloud storage devices' now, but Ted had mentioned
>> those cloud-emulated block devices such as Google's Persistent Desk or
>> Amazon's Elastic Block Device in. I'm not sure if they can accurately
>> report the BLK_FEAT_WRITE_ZEROES_UNMAP feature, maybe Ted can give more
>> details.
>>
>> https://lore.kernel.org/linux-fsdevel/20250106161732.GG1284777@mit.edu/
>
> There's nothing really exotic about what I was referring to in terms
> of "cloud storage devices". Perhaps a better way of describing them
> is to consider devices such as dm-thin, or a Ceph Block Device, which
> is being exposed as a SCSI or NVME device.
OK, then correctly reporting the BLK_FEAT_WRITE_ZEROES_UNMAP feature
should no longer be a major problem. It seems that we do not need to
pay much attention to enabling this feature manually.
>
> The distinction I was trying to make is performance-related. Suppose
> you call WRITE_ZEROS on a 14TB region. After the WRITES_ZEROS
> complete, a read anywhere on that 14TB region will return zeros.
> That's easy. But the question is when you call WRITE_ZEROS, will the
> storage device (a) go away for a day or more before it completes (which
> would be the case if it is a traditional spinning rust platter), or
> (b) will it be basically instaneous, because all dm-thin or a Ceph Block
> Device needs to do is to delete one or more entries in its mapping
> table.
Yes.
>
> The problem is two-fold. First, there's no way for the kernel to know
> whether a storage device will behave as (a) or (b), because SCSI and
> other storage specifications say that performance is out of scope.
> They only talk about the functional results (afterwards, if yout try
> to read from the region, you will get zeros), and are utterly silent
> about how long it migt take. The second problem is that if you are an
> application program, there is no way you will be willing to call
> fallocate(WRITE_ZEROS, 14TB) if you don't know whether the disk will
> go away for a day or whether it will be instaneous.
>
> But because there is no way for the kernel to know whether WRITE_ZEROS
> will be fast or not, how would you expect the kernel to expose
> STATX_ATTR_WRITE_ZEROES_UNMAP? Cristoph's formulation "breaking the
> abstraction" perfectly encapsulate the SCSI specification's position
> on the matter, and I agree it's a valid position. It's just not
> terribly useful for the application programmer.
Yes.
>
> Things which some programs/users might want to know or rely upon, but which is normally quite impossible are:
>
> * Will the write zero / discard operation take a "reasonable" amount
> of time? (Yes, not necessarilly well defined, but we know it when
> we see it, and hours or days is generally not reasonable.)
>
> * Is the operation reliable --- i.e., is the device allowed to
> randomly decide that it won't actually zero the requested blocks (as
> is the case of discard) whenever it feels like it.
>
> * Is the operation guaranteed to make the data irretreviable even in
> face of an attacker with low-level access to the device. (And this
> is also not necessarily well defined; does the attacker have access
> to a scanning electronic microscope, or can do a liquid nitrogen
> destructive access of the flash device?)
Yes.
>
> The UFS (Universal Flash Storage) spec comes the closest to providing
> commands that distinguish between these various cases, but for most
> storage specifications, like SCSI, it is absolutely requires peaking
> behind the abstraction barrier defined by the specification, and so
> ultimately, the kernel can't know.
>
> About the best you can do is to require manual configuration; perhaps a
> config file at the database or userspace cluster file system level
> because the system adminsitrator knows --- maybe because the hyperscale
> cloud provider has leaned on the storage vendor to tell them under
> NDA, storage specs be damned or they won't spend $$$ millions with
> that storage vendor --- or because the database administrator discovers
> that using fallocate(WRITE_ZEROS) causes performance to tank, so they
> manually disable the use of WRITE_ZEROS.
Yes, this is indeed what we should consider.
>
> Could this be done in the kernel? Sure. We could have a file, say,
> /sys/block/sdXX/queue/write_zeros where the write_zeros file is
> writeable, and so the administrator can force-disable WRITES_ZERO by
> writing 0 into the file. And could this be queried via a STATX
> attribute? I suppose, although to be honest, I'm used to doing this
> by looking at the sysfs files. For example, just recently I coded up
> the following:
>
> static int is_rotational (const char *device_name EXT2FS_ATTR((unused)))
> {
> int rotational = -1;
> #ifdef __linux__
> char path[1024];
> struct stat st;
> FILE *f;
>
> if ((stat(device_name, &st) < 0) || !S_ISBLK(st.st_mode))
> return -1;
>
> snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/queue/rotational",
> major(st.st_rdev), minor(st.st_rdev));
> f = fopen(path, "r");
> if (!f) {
> snprintf(path, sizeof(path),
> "/sys/dev/block/%d:%d/../queue/rotational",
> major(st.st_rdev), minor(st.st_rdev));
> f = fopen(path, "r");
> }
> if (f) {
> if (fscanf(f, "%d", &rotational) != 1)
> rotational = -1;
> fclose(f);
> }
> #endif
> return rotational;
> }
>
> Easy-peasy! Who needs statx? :-)
>
Yes. as I replied earlier, I'm going to implement this with a new flag,
BLK_FALG_WRITE_ZEROES_UNMAP_DISABLED, similar to the existing
BLK_FLAG_WRITE_CACHE_DISABLED. Make
/sys/block/<disk>/queue/write_zeroes_unmap to read-write. Regarding
whether to rename it to 'write_zeroes', I need to reconsider, as the
naming aligns perfectly with FALLOC_FL_WRITE_ZEROES, but the **UNMAP**
semantics cannot be adequately expressed.
Thank you for your detailed explanation and suggestions!
Best regards.
Yi.
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2025-05-09 12:36 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-21 2:14 [RFC PATCH v4 00/11] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag Zhang Yi
2025-04-21 2:14 ` [RFC PATCH v4 01/11] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features Zhang Yi
2025-05-05 11:54 ` Christoph Hellwig
2025-05-06 4:21 ` Martin K. Petersen
2025-05-06 7:51 ` Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 02/11] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit Zhang Yi
2025-05-05 11:55 ` Christoph Hellwig
2025-04-21 2:15 ` [RFC PATCH v4 03/11] nvme-multipath: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
2025-05-05 11:55 ` Christoph Hellwig
2025-04-21 2:15 ` [RFC PATCH v4 04/11] nvmet: set WZDS and DRB if device supports BLK_FEAT_WRITE_ZEROES_UNMAP Zhang Yi
2025-05-05 11:56 ` Christoph Hellwig
2025-04-21 2:15 ` [RFC PATCH v4 05/11] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 06/11] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 07/11] fs: statx add write zeroes unmap attribute Zhang Yi
2025-05-05 13:22 ` Christoph Hellwig
2025-05-05 14:29 ` Darrick J. Wong
2025-05-06 4:28 ` Zhang Yi
2025-05-06 4:39 ` Christoph Hellwig
2025-05-06 11:16 ` Zhang Yi
2025-05-06 12:11 ` Christoph Hellwig
2025-05-07 7:33 ` Zhang Yi
2025-05-07 21:03 ` Darrick J. Wong
2025-05-08 5:01 ` Christoph Hellwig
2025-05-08 12:17 ` Zhang Yi
2025-05-08 20:24 ` Theodore Ts'o
2025-05-09 12:35 ` Zhang Yi
2025-05-06 5:02 ` Christoph Hellwig
2025-05-06 5:36 ` Darrick J. Wong
2025-05-06 5:47 ` Christoph Hellwig
2025-05-06 11:25 ` Zhang Yi
2025-05-06 12:10 ` Christoph Hellwig
2025-05-06 15:55 ` Darrick J. Wong
2025-05-07 8:23 ` Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 08/11] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate Zhang Yi
2025-05-05 13:22 ` Christoph Hellwig
2025-04-21 2:15 ` [RFC PATCH v4 09/11] block: factor out common part in blkdev_fallocate() Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 10/11] block: add FALLOC_FL_WRITE_ZEROES support Zhang Yi
2025-04-21 2:15 ` [RFC PATCH v4 11/11] ext4: " Zhang Yi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).