public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] fallocate: introduce FALLOC_FL_FORCE_ZERO flag
@ 2024-12-28  1:45 Zhang Yi
  2024-12-28  1:45 ` [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate Zhang Yi
  2024-12-28  1:45 ` [RFC PATCH 2/2] ext4: add FALLOC_FL_FORCE_ZERO support Zhang Yi
  0 siblings, 2 replies; 13+ messages in thread
From: Zhang Yi @ 2024-12-28  1:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4
  Cc: linux-kernel, viro, brauner, jack, tytso, djwong, adilger.kernel,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Currently, we can use the fallocate command to quickly create a
pre-allocated file. However, on most filesystems, such as ext4 and XFS,
fallocate create pre-allocation blocks in an unwritten state, and the
FALLOC_FL_ZERO_RANGE flag also behaves similarly. The extent state must
be converted to a written state when the user writes data into this
range later, which can trigger numerous metadata changes and consequent
journal I/O. This may leads to significant write amplification and
performance degradation in synchronous write mode. Therefore, we need a
method to create a pre-allocated file with written extents. At the
monent, the only method available is to create an empty file and write
zero data into it (for example, using 'dd' with a large block size).
However, this method is slow and consumes a considerable amount of disk
bandwidth, we must pre-allocate files in advance but cannot add
pre-allocated files while user business services are running.

Fortunately, with the development and more and more widely used of
flash-based storage devices, we can efficiently write zeros to SSDs
using the WRITE_ZERO command. If SCSI SSDs support the UMMAP bit or NVMe
SSDs support the DEAC bit[1], the WRITE_ZERO command does not write
actual data to the device, instead, NVMe converts the zeroed range to a
deallocated state, which works fast and consumes almost no disk write
bandwidth. Consequently, this feature can provide us with a faster
method for creating pre-allocated files with written extents and zeroed
data.

This series aims to implement this by introducing a new flag
FALLOC_FL_FORCE_ZERO into the fallocate, and providing a brief
demonstration on ext4(note: this is based on my ext4 fallocate refactor
series[2] which hasn't been merged yet), which will be used for further
test and discussion. This flag serves as a supported flag for
FALLOC_FL_ZERO_RANGE, it enforce the file system to issue zeros and
allocate written extents during the FALLOC_FL_FORCE_ZERO operation. If
the underlying storage supports WRITE_ZERO, the zero range operation
can be accelerated, if not, it defaults to write zero data, similar to
a direct write.

I've modified xfs_io and fallocate tool in util-linux[3], and tested
performance with this series on ext4 filesystem on my machine with an
Intel Xeon Gold 6248R CPU, a 7TB KCD61LUL7T68 NVMe SSD which supports
WRITE_ZERO with the Deallocated state and the DEAC bit.

0. Ensure the NVMe device supports WRITE_ZERO command.

 $ cat /sys/block/nvme5n1/queue/write_zeroes_max_bytes
   8388608
 $ nvme id-ns -H /dev/nvme5n1 | grep -i -A 3 "dlfeat"
   dlfeat  : 25
   [4:4] : 0x1   Guard Field of Deallocated Logical Blocks is set to CRC
                 of The Value Read
   [3:3] : 0x1   Deallocate Bit in the Write Zeroes Command is Supported
   [2:0] : 0x1   Bytes Read From a Deallocated Logical Block and its
                 Metadata are 0x00

1. Compare 'dd' and fallocate with force zero range, the zero range is
   significantly faster than 'dd'.

 a) Create a 1GB zeroed file.
  $ dd if=/dev/zero of=foo bs=2M count=512 oflag=direct
    512+0 records in
    512+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.504496 s, 2.1 GB/s

  $ time fallocate -Z -l 1G bar  # -Z is a new option to do actual zero
    real    0m0.171s
    user    0m0.001s
    sys     0m0.003s

 b) Create a 10GB zeroed file.
  $ dd if=/dev/zero of=foo bs=2M count=5120 oflag=direct  
    5120+0 records in
    5120+0 records out
    10737418240 bytes (11 GB, 10 GiB) copied, 5.04009 s, 2.1 GB/s

  $ time fallocate -Z -l 10G bar
    real    0m1.724s
    user    0m0.000s
    sys     0m0.024s

2. Run fio overwrite and fallocate with force zero range simultaneously,
   fallocate has little impact on write bandwidth and only slightly
   affects write latency.

 a) Test bandwidth costs.
  $ fio -directory=/test -direct=1 -iodepth=10 -fsync=0 -rw=write \
        -numjobs=10 -bs=2M -ioengine=libaio -size=20G -runtime=20 \
        -fallocate=none -overwrite=1 -group_reportin -name=bw_test

   Without background zero range:
    bw (MiB/s): min= 2068, max= 2280, per=100.00%, avg=2186.40

   With background zero range:
    bw (MiB/s): min= 2056, max= 2308, per=100.00%, avg=2186.20

 b) Test write latency costs.
  $ fio -filename=/test/foo -direct=1 -iodepth=1 -fsync=0 -rw=write \
        -numjobs=1 -bs=4k -ioengine=psync -size=5G -runtime=20 \
        -fallocate=none -overwrite=1 -group_reportin -name=lat_test

   Without background zero range:
   lat (nsec): min=9269, max=71635, avg=9840.65

   With a background zero range:
   lat (usec): min=9, max=982, avg=11.03

3. Compare overwriting in a pre-allocated unwritten file and a written
   file in O_DSYNC mode. Write to a file with written extents is much
   faster.

  # First mkfs and create a test file according to below three cases,
  # and then run fio.

  $ fio -filename=/test/foo -direct=1 -iodepth=1 -fdatasync=1 \
        -rw=write -numjobs=1 -bs=4k -ioengine=psync -size=5G \
        -runtime=20 -fallocate=none -group_reportin -name=test

   unwritten file:                 IOPS=20.1k, BW=78.7MiB/s
   unwritten file + fast_commit:   IOPS=42.9k, BW=167MiB/s
   written file:                   IOPS=98.8k, BW=386MiB/s

Any comments are welcome.

Thanks,
Yi.

---

[1] https://nvmexpress.org/specifications/
    NVM Command Set Specification, section 3.2.8
[2] https://lore.kernel.org/linux-ext4/20241220011637.1157197-1-yi.zhang@huaweicloud.com/
[3] Here is a simple support of xfs_io and fallocate tool in util-linux.
    Feel free to give it a try.

1. xfs_io

diff --git a/io/prealloc.c b/io/prealloc.c
index 8e968c9f..66ae63d6 100644
--- a/io/prealloc.c
+++ b/io/prealloc.c
@@ -30,6 +30,10 @@
 #define FALLOC_FL_UNSHARE_RANGE 0x40
 #endif
 
+#ifndef FALLOC_FL_FORCE_ZERO
+#define FALLOC_FL_FORCE_ZERO 0x80
+#endif
+
 static cmdinfo_t allocsp_cmd;
 static cmdinfo_t freesp_cmd;
 static cmdinfo_t resvsp_cmd;
@@ -324,16 +328,20 @@ fzero_f(
 	int		mode = FALLOC_FL_ZERO_RANGE;
 	int		c;
 
-	while ((c = getopt(argc, argv, "k")) != EOF) {
+	while ((c = getopt(argc, argv, "kz")) != EOF) {
 		switch (c) {
 		case 'k':
 			mode |= FALLOC_FL_KEEP_SIZE;
 			break;
+		case 'z':
+			mode |= FALLOC_FL_FORCE_ZERO;
+			break;
 		default:
 			command_usage(&fzero_cmd);
 		}
 	}
-        if (optind != argc - 2)
+	if (optind != argc - 2 ||
+	    ((mode & FALLOC_FL_KEEP_SIZE) && (mode & FALLOC_FL_FORCE_ZERO)))
                 return command_usage(&fzero_cmd);
 
 	if (!offset_length(argv[optind], argv[optind + 1], &segment))
@@ -475,7 +483,7 @@ prealloc_init(void)
 	fzero_cmd.argmin = 2;
 	fzero_cmd.argmax = 3;
 	fzero_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
-	fzero_cmd.args = _("[-k] off len");
+	fzero_cmd.args = _("[-k | -z ] off len");
 	fzero_cmd.oneline =
 	_("zeroes space and eliminates holes by preallocating");
 	add_command(&fzero_cmd);

2. util-linux

diff --git a/sys-utils/fallocate.c b/sys-utils/fallocate.c
index ac7c687f2..55627ce4b 100644
--- a/sys-utils/fallocate.c
+++ b/sys-utils/fallocate.c
@@ -66,6 +66,10 @@
 # define FALLOC_FL_INSERT_RANGE		0x20
 #endif
 
+#ifndef FALLOC_FL_FORCE_ZERO
+# define FALLOC_FL_FORCE_ZERO		0x80
+#endif
+
 #include "nls.h"
 #include "strutils.h"
 #include "c.h"
@@ -305,6 +309,7 @@ int main(int argc, char **argv)
 	    { "dig-holes",      no_argument,       NULL, 'd' },
 	    { "insert-range",   no_argument,       NULL, 'i' },
 	    { "zero-range",     no_argument,       NULL, 'z' },
+	    { "force-zero",     no_argument,       NULL, 'Z' },
 	    { "offset",         required_argument, NULL, 'o' },
 	    { "length",         required_argument, NULL, 'l' },
 	    { "posix",          no_argument,       NULL, 'x' },
@@ -313,9 +318,10 @@ int main(int argc, char **argv)
 	};
 
 	static const ul_excl_t excl[] = {	/* rows and cols in ASCII order */
-		{ 'c', 'd', 'p', 'z' },
+		{ 'c', 'd', 'p', 'z', 'Z' },
 		{ 'c', 'n' },
-		{ 'x', 'c', 'd', 'i', 'n', 'p', 'z'},
+		{ 'Z', 'n' },
+		{ 'x', 'c', 'd', 'i', 'n', 'p', 'z', 'Z'},
 		{ 0 }
 	};
 	int excl_st[ARRAY_SIZE(excl)] = UL_EXCL_STATUS_INIT;
@@ -325,7 +331,7 @@ int main(int argc, char **argv)
 	textdomain(PACKAGE);
 	close_stdout_atexit();
 
-	while ((c = getopt_long(argc, argv, "hvVncpdizxl:o:", longopts, NULL))
+	while ((c = getopt_long(argc, argv, "hvVncpdizZxl:o:", longopts, NULL))
 			!= -1) {
 
 		err_exclusive_options(c, longopts, excl, excl_st);
@@ -355,6 +361,9 @@ int main(int argc, char **argv)
 		case 'z':
 			mode |= FALLOC_FL_ZERO_RANGE;
 			break;
+		case 'Z':
+			mode |= FALLOC_FL_ZERO_RANGE | FALLOC_FL_FORCE_ZERO;
+			break;
 		case 'x':
 #ifdef HAVE_POSIX_FALLOCATE
 			posix = 1;



Zhang Yi (2):
  fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  ext4: add FALLOC_FL_FORCE_ZERO support

 fs/ext4/extents.c           | 42 +++++++++++++++++++++++++++++++------
 fs/open.c                   | 14 ++++++++++---
 include/linux/falloc.h      |  5 ++++-
 include/trace/events/ext4.h |  3 ++-
 include/uapi/linux/falloc.h | 12 +++++++++++
 5 files changed, 65 insertions(+), 11 deletions(-)

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2024-12-28  1:45 [RFC PATCH 0/2] fallocate: introduce FALLOC_FL_FORCE_ZERO flag Zhang Yi
@ 2024-12-28  1:45 ` Zhang Yi
  2025-01-06 11:27   ` Christoph Hellwig
  2024-12-28  1:45 ` [RFC PATCH 2/2] ext4: add FALLOC_FL_FORCE_ZERO support Zhang Yi
  1 sibling, 1 reply; 13+ messages in thread
From: Zhang Yi @ 2024-12-28  1:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4
  Cc: linux-kernel, viro, brauner, jack, tytso, djwong, adilger.kernel,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Thanks to the development of flash-based storage devices, we can quickly
write zeros to SSDs using the WRITE_ZERO command. Therefore, we
introduce a new flag FALLOC_FL_FORCE_ZERO to fallocate, which acts as a
supported flag for FALLOC_FL_ZERO_RANGE. This flag forces the file
system to issue zeroes and allocate written extents. The process of
zeroing out can be accelerated with the REQ_OP_WRITE_ZEROES operation
when the underlying storage device supports WRITE_ZERO cmd and UMMAP bit
on SCSI SSDs or DEAC bit on NVMe SSDs.

This provides users with a new method to quickly generate a zeroed file.
Users no longer need to write zero data to create a file with written
extents. The subsequent overwriting of this file range can save
significant metadata changes, which should greatly improve overwrite
performance on certain filesystems.

This flag should not be used in conjunction with the FALLOC_FL_KEEP_SIZE
since allocating written extents beyond file EOF is not permitted.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/open.c                   | 14 +++++++++++---
 include/linux/falloc.h      |  5 ++++-
 include/uapi/linux/falloc.h | 12 ++++++++++++
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index e6911101fe71..d3afaddfcf27 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -246,7 +246,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (offset < 0 || len <= 0)
 		return -EINVAL;
 
-	if (mode & ~(FALLOC_FL_MODE_MASK | FALLOC_FL_KEEP_SIZE))
+	if (mode & ~(FALLOC_FL_MODE_MASK | FALLOC_FL_SUPPORT_MASK))
 		return -EOPNOTSUPP;
 
 	/*
@@ -259,15 +259,23 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	switch (mode & FALLOC_FL_MODE_MASK) {
 	case FALLOC_FL_ALLOCATE_RANGE:
 	case FALLOC_FL_UNSHARE_RANGE:
+		if (mode & FALLOC_FL_FORCE_ZERO)
+			return -EOPNOTSUPP;
+		break;
 	case FALLOC_FL_ZERO_RANGE:
+		if ((mode & FALLOC_FL_KEEP_SIZE) &&
+		     (mode & FALLOC_FL_FORCE_ZERO))
+			return -EOPNOTSUPP;
 		break;
 	case FALLOC_FL_PUNCH_HOLE:
-		if (!(mode & FALLOC_FL_KEEP_SIZE))
+		if (!(mode & FALLOC_FL_KEEP_SIZE) ||
+		    (mode & FALLOC_FL_FORCE_ZERO))
 			return -EOPNOTSUPP;
 		break;
 	case FALLOC_FL_COLLAPSE_RANGE:
 	case FALLOC_FL_INSERT_RANGE:
-		if (mode & FALLOC_FL_KEEP_SIZE)
+		if ((mode & FALLOC_FL_KEEP_SIZE) ||
+		    (mode & FALLOC_FL_FORCE_ZERO))
 			return -EOPNOTSUPP;
 		break;
 	default:
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 3f49f3df6af5..75ac063d7eab 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -29,7 +29,8 @@ struct space_resv {
  * Mask of all supported fallocate modes.  Only one can be set at a time.
  *
  * In addition to the mode bit, the mode argument can also encode flags.
- * FALLOC_FL_KEEP_SIZE is the only supported flag so far.
+ * FALLOC_FL_KEEP_SIZE and FALLOC_FL_FORCE_ZERO are the only supported
+ * flags so far.
  */
 #define FALLOC_FL_MODE_MASK	(FALLOC_FL_ALLOCATE_RANGE |	\
 				 FALLOC_FL_PUNCH_HOLE |		\
@@ -37,6 +38,8 @@ struct space_resv {
 				 FALLOC_FL_ZERO_RANGE |		\
 				 FALLOC_FL_INSERT_RANGE |	\
 				 FALLOC_FL_UNSHARE_RANGE)
+#define FALLOC_FL_SUPPORT_MASK	(FALLOC_FL_KEEP_SIZE |		\
+				 FALLOC_FL_FORCE_ZERO)
 
 /* on ia32 l_start is on a 32-bit boundary */
 #if defined(CONFIG_X86_64)
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 5810371ed72b..7c12bcdff7d3 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -78,4 +78,16 @@
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_FORCE_ZERO should be used in conjunction with FALLOC_FL_ZERO_RANGE,
+ * it force the file system issuing zero and allocate written extents. The
+ * zeroing out can speed up with the REQ_OP_WRITE_ZEROES command, and sebsequent
+ * overwriting over this range can save significant metadata changes, which
+ * should be contribute to improve the overwrite performance on such
+ * preallocated range.
+ *
+ * This flag cannot be used in conjunction with the FALLOC_FL_KEEP_SIZE.
+ */
+#define FALLOC_FL_FORCE_ZERO		0x80
+
 #endif /* _UAPI_FALLOC_H_ */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/2] ext4: add FALLOC_FL_FORCE_ZERO support
  2024-12-28  1:45 [RFC PATCH 0/2] fallocate: introduce FALLOC_FL_FORCE_ZERO flag Zhang Yi
  2024-12-28  1:45 ` [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate Zhang Yi
@ 2024-12-28  1:45 ` Zhang Yi
  1 sibling, 0 replies; 13+ messages in thread
From: Zhang Yi @ 2024-12-28  1:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4
  Cc: linux-kernel, viro, brauner, jack, tytso, djwong, adilger.kernel,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Add support for FALLOC_FL_FORCE_ZERO. This first allocates blocks as
unwritten, then issues a zero command outside of the running journal
handle, and finally converts them to a written state.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c           | 42 +++++++++++++++++++++++++++++++------
 include/trace/events/ext4.h |  3 ++-
 2 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 1b028be19193..dcb3ef4ca1d4 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4483,6 +4483,8 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	struct ext4_map_blocks map;
 	unsigned int credits;
 	loff_t epos, old_size = i_size_read(inode);
+	unsigned int blkbits = inode->i_blkbits;
+	bool alloc_zero = false;
 
 	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
 	map.m_lblk = offset;
@@ -4495,6 +4497,17 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	if (len <= EXT_UNWRITTEN_MAX_LEN)
 		flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
 
+	/*
+	 * Do the actual write zero during a running journal transaction
+	 * costs a lot. First allocate an unwritten extent and then
+	 * convert it to written after zeroing it out.
+	 */
+	if (flags & EXT4_GET_BLOCKS_ZERO) {
+		flags &= ~EXT4_GET_BLOCKS_ZERO;
+		flags |= EXT4_GET_BLOCKS_UNWRIT_EXT;
+		alloc_zero = true;
+	}
+
 	/*
 	 * credits to insert 1 extent into extent tree
 	 */
@@ -4531,9 +4544,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 		 * allow a full retry cycle for any remaining allocations
 		 */
 		retries = 0;
-		map.m_lblk += ret;
-		map.m_len = len = len - ret;
-		epos = (loff_t)map.m_lblk << inode->i_blkbits;
+		epos = (loff_t)(map.m_lblk + ret) << blkbits;
 		inode_set_ctime_current(inode);
 		if (new_size) {
 			if (epos > new_size)
@@ -4553,6 +4564,21 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 		ret2 = ret3 ? ret3 : ret2;
 		if (unlikely(ret2))
 			break;
+
+		if (alloc_zero &&
+		    (map.m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN))) {
+			ret2 = ext4_issue_zeroout(inode, map.m_lblk, map.m_pblk,
+						  map.m_len);
+			if (likely(!ret2))
+				ret2 = ext4_convert_unwritten_extents(NULL,
+					inode, (loff_t)map.m_lblk << blkbits,
+					(loff_t)map.m_len << blkbits);
+			if (ret2)
+				break;
+		}
+
+		map.m_lblk += ret;
+		map.m_len = len = len - ret;
 	}
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
@@ -4618,7 +4644,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (end_lblk > start_lblk) {
 		ext4_lblk_t zero_blks = end_lblk - start_lblk;
 
-		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
+		if (mode & FALLOC_FL_FORCE_ZERO)
+			flags = EXT4_GET_BLOCKS_CREATE_ZERO | EXT4_EX_NOCACHE;
+		else
+			flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
+				  EXT4_EX_NOCACHE);
 		ret = ext4_alloc_file_blocks(file, start_lblk, zero_blks,
 					     new_size, flags);
 		if (ret)
@@ -4730,8 +4760,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 
 	/* Return error if mode is not supported */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
-		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
-		     FALLOC_FL_INSERT_RANGE))
+		     FALLOC_FL_ZERO_RANGE | FALLOC_FL_FORCE_ZERO |
+		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_INSERT_RANGE))
 		return -EOPNOTSUPP;
 
 	inode_lock(inode);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..1ac29dc637a9 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -92,7 +92,8 @@ TRACE_DEFINE_ENUM(ES_REFERENCED_B);
 	{ FALLOC_FL_KEEP_SIZE,		"KEEP_SIZE"},		\
 	{ FALLOC_FL_PUNCH_HOLE,		"PUNCH_HOLE"},		\
 	{ FALLOC_FL_COLLAPSE_RANGE,	"COLLAPSE_RANGE"},	\
-	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"})
+	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"},		\
+	{ FALLOC_FL_FORCE_ZERO,		"FORCE_ZERO"})
 
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_XATTR);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_CROSS_RENAME);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2024-12-28  1:45 ` [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate Zhang Yi
@ 2025-01-06 11:27   ` Christoph Hellwig
  2025-01-06 16:17     ` Theodore Ts'o
  2025-01-07 12:38     ` Zhang Yi
  0 siblings, 2 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-01-06 11:27 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-fsdevel, linux-ext4, linux-kernel, viro, brauner, jack,
	tytso, djwong, adilger.kernel, yi.zhang, chengzhihao1, yukuai3,
	yangerkun, Sai Chaitanya Mitta, linux-xfs

There's a feature request for something similar on the xfs list, so
I guess people are asking for it.

That being said this really should not be a modifier but a separate
operation, as the logic is very different from FALLOC_FL_ZERO_RANGE,
similar to how plain prealloc, hole punch and zero range are different
operations despite all of them resulting in reads of zeroes from the
range.

That will also make it more clear that for files or file systems that
require out place writes this operation should fail instead of doing
pointless multiple writes.

Also please write a man page update clearly specifying the semantics,
especially if this should work or not if there is no write zeroes
offload in the hardware, or if that offload actually writes physical
zeroes to the media or not.

Btw, someone really should clean up the ext4 fallocate code to use
helper adnd do the

	switch (mode & FALLOC_FL_MODE_MASK) {
	}

and then use helpers for each mode whih will make these things a lot
more obvious.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2025-01-06 11:27   ` Christoph Hellwig
@ 2025-01-06 16:17     ` Theodore Ts'o
  2025-01-06 16:27       ` Christoph Hellwig
  2025-01-07 11:22       ` Zhang Yi
  2025-01-07 12:38     ` Zhang Yi
  1 sibling, 2 replies; 13+ messages in thread
From: Theodore Ts'o @ 2025-01-06 16:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-fsdevel, linux-ext4, linux-kernel, viro, brauner,
	jack, djwong, adilger.kernel, yi.zhang, chengzhihao1, yukuai3,
	yangerkun, Sai Chaitanya Mitta, linux-xfs

On Mon, Jan 06, 2025 at 03:27:52AM -0800, Christoph Hellwig wrote:
> There's a feature request for something similar on the xfs list, so
> I guess people are asking for it.

Yeah, I have folks asking for this on the ext4 side as well.

The one caution that I've given to them is that there is no guarantee
what the performance will be for WRITE SAME or equivalent operations,
since the standards documents state that performance is out of scope
for the document.  So in some cases, WRITE SAME might be fast (if for
example it is just adjusing FTL metadata on an SSD, or some similar
thing on cloud-emulated block devices such as Google's Persistent Desk
or Amazon's Elastic Block Device --- what Darrick has called "software
defined storage" for the cloud), but in other hardware deployments,
WRITE SAME might be as slow as writing zeros to an HDD.

This is technically not the kernel's problem, since we can also use
the same mealy-mouth "performance is out of scope and not the kernel's
concern", but that just transfers the problem to the application
programmers.  I could imagine some kind of tunable which we can make
the block device pretend that it really doesn't support using WRITE
SAME if the performance characteristics are such that it's a Bad Idea
to use it, so that there's a single tunable knob that the system
adminstrator can reach for as opposed to have different ways for
PostgresQL, MySQL, Oracle Enterprise Database, etc have for
configuring whether or not to disable WRITE SAME, but that's not
something we need to decide right away.

> That being said this really should not be a modifier but a separate
> operation, as the logic is very different from FALLOC_FL_ZERO_RANGE,
> similar to how plain prealloc, hole punch and zero range are different
> operations despite all of them resulting in reads of zeroes from the
> range.

Yes.  And we might decide that it should be done using some kind of
ioctl, such as BLKDISCARD, as opposed to a new fallocate operation,
since it really isn't a filesystem metadata operation, just as
BLKDISARD isn't.  The other side of the argument is that ioctls are
ugly, and maybe all new such operations should be plumbed through via
fallocate as opposed to adding a new ioctl.  I don't have strong
feelings on this, although I *do* belive that whatever interface we
use, whether it be fallocate or ioctl, it should be supported by block
devices and files in a file system, to make life easier for those
databases that want to support running on a raw block device (for
full-page advertisements on the back cover of the Businessweek
magazine) or on files (which is how 99.9% of all real-world users
actually run enterprise databases.  :-)

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2025-01-06 16:17     ` Theodore Ts'o
@ 2025-01-06 16:27       ` Christoph Hellwig
  2025-01-06 17:31         ` Darrick J. Wong
  2025-01-07 11:22       ` Zhang Yi
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-01-06 16:27 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, Zhang Yi, linux-fsdevel, linux-ext4,
	linux-kernel, viro, brauner, jack, djwong, adilger.kernel,
	yi.zhang, chengzhihao1, yukuai3, yangerkun, Sai Chaitanya Mitta,
	linux-xfs

On Mon, Jan 06, 2025 at 11:17:32AM -0500, Theodore Ts'o wrote:
> Yes.  And we might decide that it should be done using some kind of
> ioctl, such as BLKDISCARD, as opposed to a new fallocate operation,
> since it really isn't a filesystem metadata operation, just as
> BLKDISARD isn't.  The other side of the argument is that ioctls are
> ugly, and maybe all new such operations should be plumbed through via
> fallocate as opposed to adding a new ioctl.  I don't have strong
> feelings on this, although I *do* belive that whatever interface we
> use, whether it be fallocate or ioctl, it should be supported by block
> devices and files in a file system, to make life easier for those
> databases that want to support running on a raw block device (for
> full-page advertisements on the back cover of the Businessweek
> magazine) or on files (which is how 99.9% of all real-world users
> actually run enterprise databases.  :-)

If you want the operation to work for files it needs to be routed
through the file system as otherwise you can't make it actually
work coherently.  While you could add a new ioctl that works on a
file fallocate seems like a much better interface.  Supporting it
on a block device is trivial, as it can mostly (or even entirely
depending on the exact definition of the interface) reuse the existing
zero range / punch hole code.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2025-01-06 16:27       ` Christoph Hellwig
@ 2025-01-06 17:31         ` Darrick J. Wong
  2025-01-06 18:06           ` Christoph Hellwig
  2025-01-07 14:05           ` Zhang Yi
  0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-01-06 17:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Ts'o, Zhang Yi, linux-fsdevel, linux-ext4,
	linux-kernel, viro, brauner, jack, adilger.kernel, yi.zhang,
	chengzhihao1, yukuai3, yangerkun, Sai Chaitanya Mitta, linux-xfs

On Mon, Jan 06, 2025 at 08:27:49AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 06, 2025 at 11:17:32AM -0500, Theodore Ts'o wrote:
> > Yes.  And we might decide that it should be done using some kind of
> > ioctl, such as BLKDISCARD, as opposed to a new fallocate operation,
> > since it really isn't a filesystem metadata operation, just as
> > BLKDISARD isn't.  The other side of the argument is that ioctls are
> > ugly, and maybe all new such operations should be plumbed through via
> > fallocate as opposed to adding a new ioctl.  I don't have strong
> > feelings on this, although I *do* belive that whatever interface we
> > use, whether it be fallocate or ioctl, it should be supported by block
> > devices and files in a file system, to make life easier for those
> > databases that want to support running on a raw block device (for
> > full-page advertisements on the back cover of the Businessweek
> > magazine) or on files (which is how 99.9% of all real-world users
> > actually run enterprise databases.  :-)
> 
> If you want the operation to work for files it needs to be routed
> through the file system as otherwise you can't make it actually
> work coherently.  While you could add a new ioctl that works on a
> file fallocate seems like a much better interface.  Supporting it
> on a block device is trivial, as it can mostly (or even entirely
> depending on the exact definition of the interface) reuse the existing
> zero range / punch hole code.

I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode,
document very vigorously that it exists to facilitate pure overwrites
(specifically that it returns EOPNOTSUPP for always-cow files), and not
add more ioctls.

(That said, doesn't BLKZEROOUT already do this for bdevs?)

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2025-01-06 17:31         ` Darrick J. Wong
@ 2025-01-06 18:06           ` Christoph Hellwig
  2025-01-07 14:05           ` Zhang Yi
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-01-06 18:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Theodore Ts'o, Zhang Yi, linux-fsdevel,
	linux-ext4, linux-kernel, viro, brauner, jack, adilger.kernel,
	yi.zhang, chengzhihao1, yukuai3, yangerkun, Sai Chaitanya Mitta,
	linux-xfs

On Mon, Jan 06, 2025 at 09:31:33AM -0800, Darrick J. Wong wrote:
> I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode,
> document very vigorously that it exists to facilitate pure overwrites
> (specifically that it returns EOPNOTSUPP for always-cow files), and not
> add more ioctls.

That goes into a similar direction to what I'd prefer.

> (That said, doesn't BLKZEROOUT already do this for bdevs?)

Yes. But the same is true for the other fallocate modes on block
devices.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2025-01-06 16:17     ` Theodore Ts'o
  2025-01-06 16:27       ` Christoph Hellwig
@ 2025-01-07 11:22       ` Zhang Yi
  1 sibling, 0 replies; 13+ messages in thread
From: Zhang Yi @ 2025-01-07 11:22 UTC (permalink / raw)
  To: Theodore Ts'o, Christoph Hellwig
  Cc: linux-fsdevel, linux-ext4, linux-kernel, viro, brauner, jack,
	djwong, adilger.kernel, yi.zhang, chengzhihao1, yukuai3,
	yangerkun, Sai Chaitanya Mitta, linux-xfs

On 2025/1/7 0:17, Theodore Ts'o wrote:
> On Mon, Jan 06, 2025 at 03:27:52AM -0800, Christoph Hellwig wrote:
>> There's a feature request for something similar on the xfs list, so
>> I guess people are asking for it.
> 
> Yeah, I have folks asking for this on the ext4 side as well.
> 
> The one caution that I've given to them is that there is no guarantee
> what the performance will be for WRITE SAME or equivalent operations,
> since the standards documents state that performance is out of scope
> for the document.  So in some cases, WRITE SAME might be fast (if for
> example it is just adjusing FTL metadata on an SSD, or some similar
> thing on cloud-emulated block devices such as Google's Persistent Desk
> or Amazon's Elastic Block Device --- what Darrick has called "software
> defined storage" for the cloud), but in other hardware deployments,
> WRITE SAME might be as slow as writing zeros to an HDD.
> 
> This is technically not the kernel's problem, since we can also use
> the same mealy-mouth "performance is out of scope and not the kernel's
> concern", but that just transfers the problem to the application
> programmers.  I could imagine some kind of tunable which we can make
> the block device pretend that it really doesn't support using WRITE
> SAME if the performance characteristics are such that it's a Bad Idea
> to use it, so that there's a single tunable knob that the system
> adminstrator can reach for as opposed to have different ways for
> PostgresQL, MySQL, Oracle Enterprise Database, etc have for
> configuring whether or not to disable WRITE SAME, but that's not
> something we need to decide right away.

Yes, I completely agree with you. At this time, it is not possible to
determine whether a disk supports fast write zeros only by checking if
the disk supports the write_zero command. Especially for some HDDs,
which should submit actual zeros to the disk even if they claim to
support the write_zero command, but that is very slow.

Therefore, I propose that we add a new feature flag, such as
BLK_FEAT_FAST_WRITE_ZERO, to queue->limits.features. This flag should
be set by each disk driver if the attached disk supports fast write
zeros. For instance, the NVMe SSD driver should set this flag if the
given namespace supports NVME_NS_DEAC. Additionally, we can add an
entry in sysfs that allows the user to enable and disable this feature
manually when the driver don't know the whether the disk supports it
or not for some corner cases.

> 
>> That being said this really should not be a modifier but a separate
>> operation, as the logic is very different from FALLOC_FL_ZERO_RANGE,
>> similar to how plain prealloc, hole punch and zero range are different
>> operations despite all of them resulting in reads of zeroes from the
>> range.
> 
> Yes.  And we might decide that it should be done using some kind of
> ioctl, such as BLKDISCARD, as opposed to a new fallocate operation,
> since it really isn't a filesystem metadata operation, just as
> BLKDISARD isn't.  The other side of the argument is that ioctls are
> ugly, and maybe all new such operations should be plumbed through via
> fallocate as opposed to adding a new ioctl.  I don't have strong
> feelings on this, although I *do* belive that whatever interface we
> use, whether it be fallocate or ioctl, it should be supported by block
> devices and files in a file system, to make life easier for those
> databases that want to support running on a raw block device (for
> full-page advertisements on the back cover of the Businessweek
> magazine) or on files (which is how 99.9% of all real-world users
> actually run enterprise databases.  :-)
> 

For this part, I still think it would be better to use fallocate.

Thanks,
Yi.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2025-01-06 11:27   ` Christoph Hellwig
  2025-01-06 16:17     ` Theodore Ts'o
@ 2025-01-07 12:38     ` Zhang Yi
  1 sibling, 0 replies; 13+ messages in thread
From: Zhang Yi @ 2025-01-07 12:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-ext4, linux-kernel, viro, brauner, jack,
	tytso, djwong, adilger.kernel, yi.zhang, chengzhihao1, yukuai3,
	yangerkun, Sai Chaitanya Mitta, linux-xfs

On 2025/1/6 19:27, Christoph Hellwig wrote:
> There's a feature request for something similar on the xfs list, so
> I guess people are asking for it.
> 
> That being said this really should not be a modifier but a separate
> operation, as the logic is very different from FALLOC_FL_ZERO_RANGE,
> similar to how plain prealloc, hole punch and zero range are different
> operations despite all of them resulting in reads of zeroes from the
> range.

OK, it seems reasonable to me, and adding a new operation would be
better. There is actually no need to mix it with the current
FALLOC_FL_ZERO_RANGE.

> 
> That will also make it more clear that for files or file systems that
> require out place writes this operation should fail instead of doing
> pointless multiple writes.
> 
> Also please write a man page update clearly specifying the semantics,
> especially if this should work or not if there is no write zeroes
> offload in the hardware, or if that offload actually writes physical
> zeroes to the media or not.
> 

Sure. thanks for your advice.

Thanks,
Yi.

> Btw, someone really should clean up the ext4 fallocate code to use
> helper adnd do the
> 
> 	switch (mode & FALLOC_FL_MODE_MASK) {
> 	}
> 
> and then use helpers for each mode whih will make these things a lot
> more obvious.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2025-01-06 17:31         ` Darrick J. Wong
  2025-01-06 18:06           ` Christoph Hellwig
@ 2025-01-07 14:05           ` Zhang Yi
  2025-01-07 16:42             ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Zhang Yi @ 2025-01-07 14:05 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig
  Cc: Theodore Ts'o, linux-fsdevel, linux-ext4, linux-kernel, viro,
	brauner, jack, adilger.kernel, yi.zhang, chengzhihao1, yukuai3,
	yangerkun, Sai Chaitanya Mitta, linux-xfs

On 2025/1/7 1:31, Darrick J. Wong wrote:
> On Mon, Jan 06, 2025 at 08:27:49AM -0800, Christoph Hellwig wrote:
>> On Mon, Jan 06, 2025 at 11:17:32AM -0500, Theodore Ts'o wrote:
>>> Yes.  And we might decide that it should be done using some kind of
>>> ioctl, such as BLKDISCARD, as opposed to a new fallocate operation,
>>> since it really isn't a filesystem metadata operation, just as
>>> BLKDISARD isn't.  The other side of the argument is that ioctls are
>>> ugly, and maybe all new such operations should be plumbed through via
>>> fallocate as opposed to adding a new ioctl.  I don't have strong
>>> feelings on this, although I *do* belive that whatever interface we
>>> use, whether it be fallocate or ioctl, it should be supported by block
>>> devices and files in a file system, to make life easier for those
>>> databases that want to support running on a raw block device (for
>>> full-page advertisements on the back cover of the Businessweek
>>> magazine) or on files (which is how 99.9% of all real-world users
>>> actually run enterprise databases.  :-)
>>
>> If you want the operation to work for files it needs to be routed
>> through the file system as otherwise you can't make it actually
>> work coherently.  While you could add a new ioctl that works on a
>> file fallocate seems like a much better interface.  Supporting it
>> on a block device is trivial, as it can mostly (or even entirely
>> depending on the exact definition of the interface) reuse the existing
>> zero range / punch hole code.
> 
> I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode,
> document very vigorously that it exists to facilitate pure overwrites
> (specifically that it returns EOPNOTSUPP for always-cow files), and not
> add more ioctls.
> 

Sorry. the "pure overwrites" and "always-cow files" makes me confused,
this is mainly used to create a new written file range, but also could
be used to zero out an existing range, why you mentioned it exists to
facilitate pure overwrites?

For the "always-cow files", do you mean reflinked files? Could you
please give more details?

Thanks,
Yi.

> (That said, doesn't BLKZEROOUT already do this for bdevs?)
> 
> --D


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2025-01-07 14:05           ` Zhang Yi
@ 2025-01-07 16:42             ` Christoph Hellwig
  2025-01-08  1:20               ` Zhang Yi
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-01-07 16:42 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Darrick J. Wong, Christoph Hellwig, Theodore Ts'o,
	linux-fsdevel, linux-ext4, linux-kernel, viro, brauner, jack,
	adilger.kernel, yi.zhang, chengzhihao1, yukuai3, yangerkun,
	Sai Chaitanya Mitta, linux-xfs

On Tue, Jan 07, 2025 at 10:05:47PM +0800, Zhang Yi wrote:
> Sorry. the "pure overwrites" and "always-cow files" makes me confused,
> this is mainly used to create a new written file range, but also could
> be used to zero out an existing range, why you mentioned it exists to
> facilitate pure overwrites?

If you're fine with writes to your file causing block allocations you
can already use the hole punch or preallocate fallocate modes.  No
need to actually send a command to the device.

> 
> For the "always-cow files", do you mean reflinked files? Could you
> please give more details?

reflinked files will require out of place writes for shared blocks.
As will anything on device mapper snapshots.  Or any file on
file systems that write out of place (btrfs, f2fs, nilfs2, the
upcoming zoned xfs mode).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate
  2025-01-07 16:42             ` Christoph Hellwig
@ 2025-01-08  1:20               ` Zhang Yi
  0 siblings, 0 replies; 13+ messages in thread
From: Zhang Yi @ 2025-01-08  1:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Theodore Ts'o, linux-fsdevel, linux-ext4,
	linux-kernel, viro, brauner, jack, adilger.kernel, yi.zhang,
	chengzhihao1, yukuai3, yangerkun, Sai Chaitanya Mitta, linux-xfs

On 2025/1/8 0:42, Christoph Hellwig wrote:
> On Tue, Jan 07, 2025 at 10:05:47PM +0800, Zhang Yi wrote:
>> Sorry. the "pure overwrites" and "always-cow files" makes me confused,
>> this is mainly used to create a new written file range, but also could
>> be used to zero out an existing range, why you mentioned it exists to
>> facilitate pure overwrites?
> 
> If you're fine with writes to your file causing block allocations you
> can already use the hole punch or preallocate fallocate modes.  No
> need to actually send a command to the device.
> 

Okay, I misunderstood your point earlier. This is indeed prepared for
subsequent overwrites. Thanks a lot for explaining.

Thanks,
Yi.

>>
>> For the "always-cow files", do you mean reflinked files? Could you
>> please give more details?
> 
> reflinked files will require out of place writes for shared blocks.
> As will anything on device mapper snapshots.  Or any file on
> file systems that write out of place (btrfs, f2fs, nilfs2, the
> upcoming zoned xfs mode).
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-01-08  1:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-28  1:45 [RFC PATCH 0/2] fallocate: introduce FALLOC_FL_FORCE_ZERO flag Zhang Yi
2024-12-28  1:45 ` [RFC PATCH 1/2] fs: introduce FALLOC_FL_FORCE_ZERO to fallocate Zhang Yi
2025-01-06 11:27   ` Christoph Hellwig
2025-01-06 16:17     ` Theodore Ts'o
2025-01-06 16:27       ` Christoph Hellwig
2025-01-06 17:31         ` Darrick J. Wong
2025-01-06 18:06           ` Christoph Hellwig
2025-01-07 14:05           ` Zhang Yi
2025-01-07 16:42             ` Christoph Hellwig
2025-01-08  1:20               ` Zhang Yi
2025-01-07 11:22       ` Zhang Yi
2025-01-07 12:38     ` Zhang Yi
2024-12-28  1:45 ` [RFC PATCH 2/2] ext4: add FALLOC_FL_FORCE_ZERO support Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox