[PATCH 0/9] introduce defrag to xfs

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/9] introduce defrag to xfs_spaceman
@ 2024-07-09 19:10 Wengang Wang
  2024-07-09 19:10 ` [PATCH 1/9] xfsprogs: introduce defrag command to spaceman Wengang Wang
                   ` (9 more replies)
  0 siblings, 10 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

This patch set introduces defrag to xfs_spaceman command. It has the functionality and
features below (also subject to be added to man page, so please review):

       defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
              defrag defragments the specified XFS file online non-exclusively. The target XFS
              doesn't have to (and must not) be unmunted.  When defragmentation is in progress, file
              IOs are served 'in parallel'.  reflink feature must be enabled in the XFS.

              Defragmentation and file IOs

              The target file is virtually devided into many small segments. Segments are the
              smallest units for defragmentation. Each segment is defragmented one by one in a
              lock->defragment->unlock->idle manner. File IOs are blocked when the target file is
              locked and are served during the defragmentation idle time (file is unlocked). Though
              the file IOs can't really go in parallel, they are not blocked long. The locking time
              basically depends on the segment size. Smaller segments usually take less locking time
              and thus IOs are blocked shorterly, bigger segments usually need more locking time and
              IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
              service.

              Temporary file

              A temporary file is used for the defragmentation. The temporary file is created in the
              same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse
              file and contains a defragmentation segment at a time. The temporary file is removed
              automatically when defragmentation is done or is cancelled by ctrl-c. It remains in
              case kernel crashes when defragmentation is going on. In that case, the temporary file
              has to be removed manaully.

              Free blocks consumption

              Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
              then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
              to the number the newly allocated blocks. As a finally result, defragmentation doesn't
              consume free blocks. Well, that is true if the target file is not sharing blocks with
              other files.  In case the target file contains shared blocks, those shared blocks won't
              be freed back to filesystem as they are still owned by other files. So defragmenation
              allocates more blocks than it frees.  For existing XFS, free blocks might be over-
              committed when reflink snapshots were created. To avoid causing the XFS running into
              low free blocks state, this defragmentation excludes (partially) shared segments when
              the file system free blocks reaches a shreshold. Check the -f option.

              Safty and consistency

              The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
              crash.

              First extent share

              Current kernel has routine for each segment defragmentation detecting if the file is
              sharing blocks. It takes long in case the target file contains huge number of extents
              and the shared ones, if there is, are at the end. The First extent share feature works
              around above issue by making the first serveral blocks shared. Seeing the first blocks
              are shared, the kernel routine ends quickly. The side effect is that the "share" flag
              would remain traget file. This feature is enabled by default and can be disabled by -n
              option.

              extsize and cowextsize

              According to kernel implementation, extsize and cowextsize could have following impacts
              to defragmentation: 1) non-zero extsize causes separated block allocations for each
              extent in the segment and those blocks are not contiguous. The segment remains same
              number of extents after defragmention (no effect).  2) When extsize and/or cowextsize
              are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO
              comes to whose pre-allocated blocks  Copy on Write happens and causes the file
              fragmented.

              Readahead

              Readahead tries to fetch the data blocks for next segment with less locking in
              backgroud during idle time. This feature is disabled by default, use -a to enable it.

              The command takes the following options:
                 -f free_space
                     The shreshold of XFS free blocks in MiB. When free blocks are less than this
                     number, (partially) shared segments are excluded from defragmentation. Default
                     number is 1024

                 -i idle_time
                     The time in milliseconds, defragmentation enters idle state for this long after
                     defragmenting a segment and before handing the next. Default number is TOBEDONE.

                 -s segment_size
                     The size limitation in bytes of segments. Minimium number is 4MiB, default
                     number is 16MiB.

                 -n  Disable the First extent share feature. Enabled by default.

                 -a  Enable readahead feature, disabled by default.

We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
sleep time. Here comes some number of the test:

Test: running of defrag on the image file which is used for the back end of a block device in a
      virtual machine. At the same time, fio is running at the same time inside virtual machine
      on that block device.
block device type:   NVME
File size:           200GiB
paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
Defrag run time:     223 minutes
Number of extents:   6745489(before) -> 203571(after)
Fio read latency:    15.72ms(without defrag) -> 14.53ms(during defrag)
Fio write latency:   32.21ms(without defrag) -> 20.03ms(during defrag)


Wengang Wang (9):
  xfsprogs: introduce defrag command to spaceman
  spaceman/defrag: pick up segments from target file
  spaceman/defrag: defrag segments
  spaceman/defrag: ctrl-c handler
  spaceman/defrag: exclude shared segments on low free space
  spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag()
  spaceman/defrag: sleeps between segments
  spaceman/defrag: readahead for better performance
  spaceman/defrag: warn on extsize

 spaceman/Makefile |   2 +-
 spaceman/defrag.c | 788 ++++++++++++++++++++++++++++++++++++++++++++++
 spaceman/init.c   |   1 +
 spaceman/space.h  |   1 +
 4 files changed, 791 insertions(+), 1 deletion(-)
 create mode 100644 spaceman/defrag.c

-- 
2.39.3 (Apple Git-146)


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/9] xfsprogs: introduce defrag command to spaceman
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
@ 2024-07-09 19:10 ` Wengang Wang
  2024-07-09 21:18   ` Darrick J. Wong
  2024-07-09 19:10 ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Wengang Wang
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Non-exclusive defragment
Here we are introducing the non-exclusive manner to defragment a file,
especially for huge files, without blocking IO to it long.
Non-exclusive defragmentation divides the whole file into small segments.
For each segment, we lock the file, defragment the segment and unlock the file.
Defragmenting the small segment doesn’t take long. File IO requests can get
served between defragmenting segments before blocked long.  Also we put
(user adjustable) idle time between defragmenting two consecutive segments to
balance the defragmentation and file IOs.

The first patch in the set checks for valid target files

Valid target files to defrag must:
1. be accessible for read/write
2. be regular files
3. be in XFS filesystem
4. the containing XFS has reflink enabled. This is not checked
   before starting defragmentation, but error would be reported
   later.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/Makefile |   2 +-
 spaceman/defrag.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++
 spaceman/init.c   |   1 +
 spaceman/space.h  |   1 +
 4 files changed, 201 insertions(+), 1 deletion(-)
 create mode 100644 spaceman/defrag.c

diff --git a/spaceman/Makefile b/spaceman/Makefile
index 1f048d54..9c00b20a 100644
--- a/spaceman/Makefile
+++ b/spaceman/Makefile
@@ -7,7 +7,7 @@ include $(TOPDIR)/include/builddefs
 
 LTCOMMAND = xfs_spaceman
 HFILES = init.h space.h
-CFILES = info.c init.c file.c health.c prealloc.c trim.c
+CFILES = info.c init.c file.c health.c prealloc.c trim.c defrag.c
 LSRCFILES = xfs_info.sh
 
 LLDLIBS = $(LIBXCMD) $(LIBFROG)
diff --git a/spaceman/defrag.c b/spaceman/defrag.c
new file mode 100644
index 00000000..c9732984
--- /dev/null
+++ b/spaceman/defrag.c
@@ -0,0 +1,198 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2024 Oracle.
+ * All Rights Reserved.
+ */
+
+#include "libxfs.h"
+#include <linux/fiemap.h>
+#include <linux/fsmap.h>
+#include "libfrog/fsgeom.h"
+#include "command.h"
+#include "init.h"
+#include "libfrog/paths.h"
+#include "space.h"
+#include "input.h"
+
+/* defrag segment size limit in units of 512 bytes */
+#define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
+#define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
+static int g_segment_size_lmt = DEFAULT_SEGMENT_SIZE_LIMIT;
+
+/* size of the defrag target file */
+static off_t g_defrag_file_size = 0;
+
+/* stats for the target file extents before defrag */
+struct ext_stats {
+	long	nr_ext_total;
+	long	nr_ext_unwritten;
+	long	nr_ext_shared;
+};
+static struct ext_stats	g_ext_stats;
+
+/*
+ * check if the target is a valid file to defrag
+ * also store file size
+ * returns:
+ * true for yes and false for no
+ */
+static bool
+defrag_check_file(char *path)
+{
+	struct statfs statfs_s;
+	struct stat stat_s;
+
+	if (access(path, F_OK|W_OK) == -1) {
+		if (errno == ENOENT)
+			fprintf(stderr, "file \"%s\" doesn't exist\n", path);
+		else
+			fprintf(stderr, "no access to \"%s\", %s\n", path,
+				strerror(errno));
+		return false;
+	}
+
+	if (stat(path, &stat_s) == -1) {
+		fprintf(stderr, "failed to get file info on \"%s\":  %s\n",
+			path, strerror(errno));
+		return false;
+	}
+
+	g_defrag_file_size = stat_s.st_size;
+
+	if (!S_ISREG(stat_s.st_mode)) {
+		fprintf(stderr, "\"%s\" is not a regular file\n", path);
+		return false;
+	}
+
+	if (statfs(path, &statfs_s) == -1) {
+		fprintf(stderr, "failed to get FS info on \"%s\":  %s\n",
+			path, strerror(errno));
+		return false;
+	}
+
+	if (statfs_s.f_type != XFS_SUPER_MAGIC) {
+		fprintf(stderr, "\"%s\" is not a xfs file\n", path);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * defragment a file
+ * return 0 if successfully done, 1 otherwise
+ */
+static int
+defrag_xfs_defrag(char *file_path) {
+	int	max_clone_us = 0, max_unshare_us = 0, max_punch_us = 0;
+	long	nr_seg_defrag = 0, nr_ext_defrag = 0;
+	int	scratch_fd = -1, defrag_fd = -1;
+	char	tmp_file_path[PATH_MAX+1];
+	char	*defrag_dir;
+	struct fsxattr	fsx;
+	int	ret = 0;
+
+	fsx.fsx_nextents = 0;
+	memset(&g_ext_stats, 0, sizeof(g_ext_stats));
+
+	if (!defrag_check_file(file_path)) {
+		ret = 1;
+		goto out;
+	}
+
+	defrag_fd = open(file_path, O_RDWR);
+	if (defrag_fd == -1) {
+		fprintf(stderr, "Opening %s failed. %s\n", file_path,
+			strerror(errno));
+		ret = 1;
+		goto out;
+	}
+
+	defrag_dir = dirname(file_path);
+	snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
+		getpid());
+	tmp_file_path[PATH_MAX] = 0;
+	scratch_fd = open(tmp_file_path, O_CREAT|O_EXCL|O_RDWR, 0600);
+	if (scratch_fd == -1) {
+		fprintf(stderr, "Opening temporary file %s failed. %s\n",
+			tmp_file_path, strerror(errno));
+		ret = 1;
+		goto out;
+	}
+out:
+	if (scratch_fd != -1) {
+		close(scratch_fd);
+		unlink(tmp_file_path);
+	}
+	if (defrag_fd != -1) {
+		ioctl(defrag_fd, FS_IOC_FSGETXATTR, &fsx);
+		close(defrag_fd);
+	}
+
+	printf("Pre-defrag %ld extents detected, %ld are \"unwritten\","
+		"%ld are \"shared\"\n",
+		g_ext_stats.nr_ext_total, g_ext_stats.nr_ext_unwritten,
+		g_ext_stats.nr_ext_shared);
+	printf("Tried to defragment %ld extents in %ld segments\n",
+		nr_ext_defrag, nr_seg_defrag);
+	printf("Time stats(ms): max clone: %d, max unshare: %d,"
+	       " max punch_hole: %d\n",
+	       max_clone_us/1000, max_unshare_us/1000, max_punch_us/1000);
+	printf("Post-defrag %u extents detected\n", fsx.fsx_nextents);
+	return ret;
+}
+
+
+static void defrag_help(void)
+{
+	printf(_(
+"\n"
+"Defragemnt files on XFS where reflink is enabled. IOs to the target files \n"
+"can be served durning the defragmentations.\n"
+"\n"
+" -s segment_size    -- specify the segment size in MiB, minmum value is 4 \n"
+"                       default is 16\n"));
+}
+
+static cmdinfo_t defrag_cmd;
+
+static int
+defrag_f(int argc, char **argv)
+{
+	int	i;
+	int	c;
+
+	while ((c = getopt(argc, argv, "s:")) != EOF) {
+		switch(c) {
+		case 's':
+			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
+			if (g_segment_size_lmt < MIN_SEGMENT_SIZE_LIMIT) {
+				g_segment_size_lmt = MIN_SEGMENT_SIZE_LIMIT;
+				printf("Using minimium segment size %d\n",
+					g_segment_size_lmt);
+			}
+			break;
+		default:
+			command_usage(&defrag_cmd);
+			return 1;
+		}
+	}
+
+	for (i = 0; i < filecount; i++)
+		defrag_xfs_defrag(filetable[i].name);
+	return 0;
+}
+void defrag_init(void)
+{
+	defrag_cmd.name		= "defrag";
+	defrag_cmd.altname	= "dfg";
+	defrag_cmd.cfunc	= defrag_f;
+	defrag_cmd.argmin	= 0;
+	defrag_cmd.argmax	= 4;
+	defrag_cmd.args		= "[-s segment_size]";
+	defrag_cmd.flags	= CMD_FLAG_ONESHOT;
+	defrag_cmd.oneline	= _("Defragment XFS files");
+	defrag_cmd.help		= defrag_help;
+
+	add_command(&defrag_cmd);
+}
diff --git a/spaceman/init.c b/spaceman/init.c
index cf1ff3cb..396f965c 100644
--- a/spaceman/init.c
+++ b/spaceman/init.c
@@ -35,6 +35,7 @@ init_commands(void)
 	trim_init();
 	freesp_init();
 	health_init();
+	defrag_init();
 }
 
 static int
diff --git a/spaceman/space.h b/spaceman/space.h
index 723209ed..c288aeb9 100644
--- a/spaceman/space.h
+++ b/spaceman/space.h
@@ -26,6 +26,7 @@ extern void	help_init(void);
 extern void	prealloc_init(void);
 extern void	quit_init(void);
 extern void	trim_init(void);
+extern void	defrag_init(void);
 #ifdef HAVE_GETFSMAP
 extern void	freesp_init(void);
 #else
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/9] spaceman/defrag: pick up segments from target file
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
  2024-07-09 19:10 ` [PATCH 1/9] xfsprogs: introduce defrag command to spaceman Wengang Wang
@ 2024-07-09 19:10 ` Wengang Wang
  2024-07-09 21:50   ` [PATCH 2/9] spaceman/defrag: pick up segments from target fileOM Darrick J. Wong
  2024-07-15 23:40   ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Dave Chinner
  2024-07-09 19:10 ` [PATCH 3/9] spaceman/defrag: defrag segments Wengang Wang
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

segments are the smallest unit to defragment.

A segment
1. Can't exceed size limit
2. contains some extents
3. the contained extents can't be "unwritten"
4. the contained extents must be contigous in file blocks

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)

diff --git a/spaceman/defrag.c b/spaceman/defrag.c
index c9732984..175cf461 100644
--- a/spaceman/defrag.c
+++ b/spaceman/defrag.c
@@ -14,6 +14,32 @@
 #include "space.h"
 #include "input.h"
 
+#define MAPSIZE 512
+/* used to fetch bmap */
+struct getbmapx	g_mapx[MAPSIZE];
+/* current offset of the file in units of 512 bytes, used to fetch bmap */
+static long long 	g_offset = 0;
+/* index to indentify next extent, used to get next extent */
+static int		g_ext_next_idx = -1;
+
+/*
+ * segment, the smallest unit to defrag
+ * it includes some contiguous extents.
+ * no holes included,
+ * no unwritten extents included
+ * the size is limited by g_segment_size_lmt
+ */
+struct defrag_segment {
+	/* segment offset in units of 512 bytes */
+	long long	ds_offset;
+	/* length of segment in units of 512 bytes */
+	long long	ds_length;
+	/* number of extents in this segment */
+	int		ds_nr;
+	/* flag indicating if segment contains shared blocks */
+	bool		ds_shared;
+};
+
 /* defrag segment size limit in units of 512 bytes */
 #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
 #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
@@ -78,6 +104,165 @@ defrag_check_file(char *path)
 	return true;
 }
 
+/*
+ * get next extent in the file.
+ * Note: next call will get the same extent unless move_next_extent() is called.
+ * returns:
+ * -1:	error happened.
+ * 0:	extent returned
+ * 1:	no more extent left
+ */
+static int
+defrag_get_next_extent(int fd, struct getbmapx *map_out)
+{
+	int err = 0, i;
+
+	/* when no extents are cached in g_mapx, fetch from kernel */
+	if (g_ext_next_idx == -1) {
+		g_mapx[0].bmv_offset = g_offset;
+		g_mapx[0].bmv_length = -1LL;
+		g_mapx[0].bmv_count = MAPSIZE;
+		g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
+		err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
+		if (err == -1) {
+			perror("XFS_IOC_GETBMAPX failed");
+			goto out;
+		}
+		/* for stats */
+		g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
+
+		/* no more extents */
+		if (g_mapx[0].bmv_entries == 0) {
+			err = 1;
+			goto out;
+		}
+
+		/* for stats */
+		for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
+			if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
+				g_ext_stats.nr_ext_unwritten++;
+			if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
+				g_ext_stats.nr_ext_shared++;
+		}
+
+		g_ext_next_idx = 1;
+		g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
+				g_mapx[g_mapx[0].bmv_entries].bmv_length;
+	}
+
+	map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
+	map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
+	map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
+out:
+	return err;
+}
+
+/*
+ * move to next extent
+ */
+static void
+defrag_move_next_extent()
+{
+	if (g_ext_next_idx == g_mapx[0].bmv_entries)
+		g_ext_next_idx = -1;
+	else
+		g_ext_next_idx += 1;
+}
+
+/*
+ * check if the given extent is a defrag target.
+ * no need to check for holes as we are using BMV_IF_NO_HOLES
+ */
+static bool
+defrag_is_target(struct getbmapx *mapx)
+{
+	/* unwritten */
+	if (mapx->bmv_oflags & BMV_OF_PREALLOC)
+		return false;
+	return mapx->bmv_length < g_segment_size_lmt;
+}
+
+static bool
+defrag_is_extent_shared(struct getbmapx *mapx)
+{
+	return !!(mapx->bmv_oflags & BMV_OF_SHARED);
+}
+
+/*
+ * get next segment to defragment.
+ * returns:
+ * -1	error happened.
+ * 0	segment returned.
+ * 1	no more segments to return
+ */
+static int
+defrag_get_next_segment(int fd, struct defrag_segment *out)
+{
+	struct getbmapx mapx;
+	int	ret;
+
+	out->ds_offset = 0;
+	out->ds_length = 0;
+	out->ds_nr = 0;
+	out->ds_shared = false;
+
+	do {
+		ret = defrag_get_next_extent(fd, &mapx);
+		if (ret != 0) {
+			/*
+			 * no more extetns, return current segment if its not
+			 * empty
+			*/
+			if (ret == 1 && out->ds_nr > 0)
+				ret = 0;
+			/* otherwise, error heppened, stop */
+			break;
+		}
+
+		/*
+		 * If the extent is not a defrag target, skip it.
+		 * go to next extent if the segment is empty;
+		 * otherwise return the segment.
+		 */
+		if (!defrag_is_target(&mapx)) {
+			defrag_move_next_extent();
+			if (out->ds_nr == 0)
+				continue;
+			else
+				break;
+		}
+
+		/* check for segment size limitation */
+		if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
+			break;
+
+		/* the segment is empty now, add this extent to it for sure */
+		if (out->ds_nr == 0) {
+			out->ds_offset = mapx.bmv_offset;
+			goto add_ext;
+		}
+
+		/*
+		 * the segment is not empty, check for hole since the last exent
+		 * if a hole exist before this extent, this extent can't be
+		 * added to the segment. return the segment
+		 */
+		if (out->ds_offset + out->ds_length != mapx.bmv_offset)
+			break;
+
+add_ext:
+		if (defrag_is_extent_shared(&mapx))
+			out->ds_shared = true;
+
+		out->ds_length += mapx.bmv_length;
+		out->ds_nr += 1;
+		defrag_move_next_extent();
+
+	} while (true);
+
+	return ret;
+}
+
 /*
  * defragment a file
  * return 0 if successfully done, 1 otherwise
@@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
 	struct fsxattr	fsx;
 	int	ret = 0;
 
+	g_offset = 0;
+	g_ext_next_idx = -1;
+
 	fsx.fsx_nextents = 0;
 	memset(&g_ext_stats, 0, sizeof(g_ext_stats));
 
@@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
 		ret = 1;
 		goto out;
 	}
+
+	do {
+		struct defrag_segment segment;
+
+		ret = defrag_get_next_segment(defrag_fd, &segment);
+		/* no more segments, we are done */
+		if (ret == 1) {
+			ret = 0;
+			break;
+		}
+		/* error happened when reading bmap, stop here */
+		if (ret == -1) {
+			ret = 1;
+			break;
+		}
+	} while (true);
 out:
 	if (scratch_fd != -1) {
 		close(scratch_fd);
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 3/9] spaceman/defrag: defrag segments
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
  2024-07-09 19:10 ` [PATCH 1/9] xfsprogs: introduce defrag command to spaceman Wengang Wang
  2024-07-09 19:10 ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Wengang Wang
@ 2024-07-09 19:10 ` Wengang Wang
  2024-07-09 21:57   ` Darrick J. Wong
  2024-07-16  0:08   ` Dave Chinner
  2024-07-09 19:10 ` [PATCH 4/9] spaceman/defrag: ctrl-c handler Wengang Wang
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

For each segment, the following steps are done trying to defrag it:

1. share the segment with a temporary file
2. unshare the segment in the target file. kernel simulates Cow on the whole
   segment complete the unshare (defrag).
3. release blocks from the tempoary file.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/spaceman/defrag.c b/spaceman/defrag.c
index 175cf461..9f11e36b 100644
--- a/spaceman/defrag.c
+++ b/spaceman/defrag.c
@@ -263,6 +263,40 @@ add_ext:
 	return ret;
 }
 
+/*
+ * check if the segment exceeds EoF.
+ * fix up the clone range and return true if EoF happens,
+ * return false otherwise.
+ */
+static bool
+defrag_clone_eof(struct file_clone_range *clone)
+{
+	off_t delta;
+
+	delta = clone->src_offset + clone->src_length - g_defrag_file_size;
+	if (delta > 0) {
+		clone->src_length = 0; // to the end
+		return true;
+	}
+	return false;
+}
+
+/*
+ * get the time delta since pre_time in ms.
+ * pre_time should contains values fetched by gettimeofday()
+ * cur_time is used to store current time by gettimeofday()
+ */
+static long long
+get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
+{
+	long long us;
+
+	gettimeofday(cur_time, NULL);
+	us = (cur_time->tv_sec - pre_time->tv_sec) * 1000000;
+	us += (cur_time->tv_usec - pre_time->tv_usec);
+	return us;
+}
+
 /*
  * defragment a file
  * return 0 if successfully done, 1 otherwise
@@ -273,6 +307,7 @@ defrag_xfs_defrag(char *file_path) {
 	long	nr_seg_defrag = 0, nr_ext_defrag = 0;
 	int	scratch_fd = -1, defrag_fd = -1;
 	char	tmp_file_path[PATH_MAX+1];
+	struct file_clone_range clone;
 	char	*defrag_dir;
 	struct fsxattr	fsx;
 	int	ret = 0;
@@ -296,6 +331,8 @@ defrag_xfs_defrag(char *file_path) {
 		goto out;
 	}
 
+	clone.src_fd = defrag_fd;
+
 	defrag_dir = dirname(file_path);
 	snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
 		getpid());
@@ -309,7 +346,11 @@ defrag_xfs_defrag(char *file_path) {
 	}
 
 	do {
+		struct timeval t_clone, t_unshare, t_punch_hole;
 		struct defrag_segment segment;
+		long long seg_size, seg_off;
+		int time_delta;
+		bool stop;
 
 		ret = defrag_get_next_segment(defrag_fd, &segment);
 		/* no more segments, we are done */
@@ -322,6 +363,79 @@ defrag_xfs_defrag(char *file_path) {
 			ret = 1;
 			break;
 		}
+
+		/* we are done if the segment contains only 1 extent */
+		if (segment.ds_nr < 2)
+			continue;
+
+		/* to bytes */
+		seg_off = segment.ds_offset * 512;
+		seg_size = segment.ds_length * 512;
+
+		clone.src_offset = seg_off;
+		clone.src_length = seg_size;
+		clone.dest_offset = seg_off;
+
+		/* checks for EoF and fix up clone */
+		stop = defrag_clone_eof(&clone);
+		gettimeofday(&t_clone, NULL);
+		ret = ioctl(scratch_fd, FICLONERANGE, &clone);
+		if (ret != 0) {
+			fprintf(stderr, "FICLONERANGE failed %s\n",
+				strerror(errno));
+			break;
+		}
+
+		/* for time stats */
+		time_delta = get_time_delta_us(&t_clone, &t_unshare);
+		if (time_delta > max_clone_us)
+			max_clone_us = time_delta;
+
+		/* for defrag stats */
+		nr_ext_defrag += segment.ds_nr;
+
+		/*
+		 * For the shared range to be unshared via a copy-on-write
+		 * operation in the file to be defragged. This causes the
+		 * file needing to be defragged to have new extents allocated
+		 * and the data to be copied over and written out.
+		 */
+		ret = fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, seg_off,
+				seg_size);
+		if (ret != 0) {
+			fprintf(stderr, "UNSHARE_RANGE failed %s\n",
+				strerror(errno));
+			break;
+		}
+
+		/* for time stats */
+		time_delta = get_time_delta_us(&t_unshare, &t_punch_hole);
+		if (time_delta > max_unshare_us)
+			max_unshare_us = time_delta;
+
+		/*
+		 * Punch out the original extents we shared to the
+		 * scratch file so they are returned to free space.
+		 */
+		ret = fallocate(scratch_fd,
+			FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, seg_off,
+			seg_size);
+		if (ret != 0) {
+			fprintf(stderr, "PUNCH_HOLE failed %s\n",
+				strerror(errno));
+			break;
+		}
+
+		/* for defrag stats */
+		nr_seg_defrag += 1;
+
+		/* for time stats */
+		time_delta = get_time_delta_us(&t_punch_hole, &t_clone);
+		if (time_delta > max_punch_us)
+			max_punch_us = time_delta;
+
+		if (stop)
+			break;
 	} while (true);
 out:
 	if (scratch_fd != -1) {
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 4/9] spaceman/defrag: ctrl-c handler
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
                   ` (2 preceding siblings ...)
  2024-07-09 19:10 ` [PATCH 3/9] spaceman/defrag: defrag segments Wengang Wang
@ 2024-07-09 19:10 ` Wengang Wang
  2024-07-09 21:08   ` Darrick J. Wong
  2024-07-09 19:10 ` [PATCH 5/9] spaceman/defrag: exclude shared segments on low free space Wengang Wang
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

Add this handler to break the defrag better, so it has
1. the stats reporting
2. remove the temporary file

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/spaceman/defrag.c b/spaceman/defrag.c
index 9f11e36b..61e47a43 100644
--- a/spaceman/defrag.c
+++ b/spaceman/defrag.c
@@ -297,6 +297,13 @@ get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
 	return us;
 }
 
+static volatile bool usedKilled = false;
+void defrag_sigint_handler(int dummy)
+{
+	usedKilled = true;
+	printf("Please wait until current segment is defragmented\n");
+};
+
 /*
  * defragment a file
  * return 0 if successfully done, 1 otherwise
@@ -345,6 +352,8 @@ defrag_xfs_defrag(char *file_path) {
 		goto out;
 	}
 
+	signal(SIGINT, defrag_sigint_handler);
+
 	do {
 		struct timeval t_clone, t_unshare, t_punch_hole;
 		struct defrag_segment segment;
@@ -434,7 +443,7 @@ defrag_xfs_defrag(char *file_path) {
 		if (time_delta > max_punch_us)
 			max_punch_us = time_delta;
 
-		if (stop)
+		if (stop || usedKilled)
 			break;
 	} while (true);
 out:
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 5/9] spaceman/defrag: exclude shared segments on low free space
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
                   ` (3 preceding siblings ...)
  2024-07-09 19:10 ` [PATCH 4/9] spaceman/defrag: ctrl-c handler Wengang Wang
@ 2024-07-09 19:10 ` Wengang Wang
  2024-07-09 21:05   ` Darrick J. Wong
  2024-07-09 19:10 ` [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() Wengang Wang
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

On some XFS, free blocks are over-committed to reflink copies.
And those free blocks are not enough if CoW happens to all the shared blocks.

This defrag tool would exclude shared segments when free space is under shrethold.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 46 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 43 insertions(+), 3 deletions(-)

diff --git a/spaceman/defrag.c b/spaceman/defrag.c
index 61e47a43..f8e6713c 100644
--- a/spaceman/defrag.c
+++ b/spaceman/defrag.c
@@ -304,6 +304,29 @@ void defrag_sigint_handler(int dummy)
 	printf("Please wait until current segment is defragmented\n");
 };
 
+/*
+ * limitation of filesystem free space in bytes.
+ * when filesystem has less free space than this number, segments which contain
+ * shared extents are skipped. 1GiB by default
+ */
+static long	g_limit_free_bytes = 1024 * 1024 * 1024;
+
+/*
+ * check if the free space in the FS is less than the _limit_
+ * return true if so, false otherwise
+ */
+static bool
+defrag_fs_limit_hit(int fd)
+{
+	struct statfs statfs_s;
+
+	if (g_limit_free_bytes <= 0)
+		return false;
+
+	fstatfs(fd, &statfs_s);
+	return statfs_s.f_bsize * statfs_s.f_bavail < g_limit_free_bytes;
+}
+
 /*
  * defragment a file
  * return 0 if successfully done, 1 otherwise
@@ -377,6 +400,15 @@ defrag_xfs_defrag(char *file_path) {
 		if (segment.ds_nr < 2)
 			continue;
 
+		/*
+		 * When the segment is (partially) shared, defrag would
+		 * consume free blocks. We check the limit of FS free blocks
+		 * and skip defragmenting this segment in case the limit is
+		 * reached.
+		 */
+		if (segment.ds_shared && defrag_fs_limit_hit(defrag_fd))
+			continue;
+
 		/* to bytes */
 		seg_off = segment.ds_offset * 512;
 		seg_size = segment.ds_length * 512;
@@ -478,7 +510,11 @@ static void defrag_help(void)
 "can be served durning the defragmentations.\n"
 "\n"
 " -s segment_size    -- specify the segment size in MiB, minmum value is 4 \n"
-"                       default is 16\n"));
+"                       default is 16\n"
+" -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
+"                       XFS free space is lower than that, shared segments \n"
+"                       are excluded from defragmentation, 1024 by default\n"
+	));
 }
 
 static cmdinfo_t defrag_cmd;
@@ -489,7 +525,7 @@ defrag_f(int argc, char **argv)
 	int	i;
 	int	c;
 
-	while ((c = getopt(argc, argv, "s:")) != EOF) {
+	while ((c = getopt(argc, argv, "s:f:")) != EOF) {
 		switch(c) {
 		case 's':
 			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
@@ -499,6 +535,10 @@ defrag_f(int argc, char **argv)
 					g_segment_size_lmt);
 			}
 			break;
+		case 'f':
+			g_limit_free_bytes = atol(optarg) * 1024 * 1024;
+			break;
+
 		default:
 			command_usage(&defrag_cmd);
 			return 1;
@@ -516,7 +556,7 @@ void defrag_init(void)
 	defrag_cmd.cfunc	= defrag_f;
 	defrag_cmd.argmin	= 0;
 	defrag_cmd.argmax	= 4;
-	defrag_cmd.args		= "[-s segment_size]";
+	defrag_cmd.args		= "[-s segment_size] [-f free_space]";
 	defrag_cmd.flags	= CMD_FLAG_ONESHOT;
 	defrag_cmd.oneline	= _("Defragment XFS files");
 	defrag_cmd.help		= defrag_help;
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag()
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
                   ` (4 preceding siblings ...)
  2024-07-09 19:10 ` [PATCH 5/9] spaceman/defrag: exclude shared segments on low free space Wengang Wang
@ 2024-07-09 19:10 ` Wengang Wang
  2024-07-09 20:51   ` Darrick J. Wong
                     ` (2 more replies)
  2024-07-09 19:10 ` [PATCH 7/9] spaceman/defrag: sleeps between segments Wengang Wang
                   ` (3 subsequent siblings)
  9 siblings, 3 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

xfs_reflink_try_clear_inode_flag() takes very long in case file has huge number
of extents and none of the extents are shared.

workaround:
share the first real extent so that xfs_reflink_try_clear_inode_flag() returns
quickly to save cpu times and speed up defrag significantly.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 174 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 172 insertions(+), 2 deletions(-)

diff --git a/spaceman/defrag.c b/spaceman/defrag.c
index f8e6713c..b5c5b187 100644
--- a/spaceman/defrag.c
+++ b/spaceman/defrag.c
@@ -327,6 +327,155 @@ defrag_fs_limit_hit(int fd)
 	return statfs_s.f_bsize * statfs_s.f_bavail < g_limit_free_bytes;
 }
 
+static bool g_enable_first_ext_share = true;
+
+static int
+defrag_get_first_real_ext(int fd, struct getbmapx *mapx)
+{
+	int			err;
+
+	while (1) {
+		err = defrag_get_next_extent(fd, mapx);
+		if (err)
+			break;
+
+		defrag_move_next_extent();
+		if (!(mapx->bmv_oflags & BMV_OF_PREALLOC))
+			break;
+	}
+	return err;
+}
+
+static __u64 g_share_offset = -1ULL;
+static __u64 g_share_len = 0ULL;
+#define SHARE_MAX_SIZE 32768  /* 32KiB */
+
+/* share the first real extent with scrach */
+static void
+defrag_share_first_extent(int defrag_fd, int scratch_fd)
+{
+#define OFFSET_1PB 0x4000000000000LL
+	struct file_clone_range clone;
+	struct getbmapx mapx;
+	int	err;
+
+	if (g_enable_first_ext_share == false)
+		return;
+
+	err = defrag_get_first_real_ext(defrag_fd, &mapx);
+	if (err)
+		return;
+
+	clone.src_fd = defrag_fd;
+	clone.src_offset = mapx.bmv_offset * 512;
+	clone.src_length = mapx.bmv_length * 512;
+	/* shares at most SHARE_MAX_SIZE length */
+	if (clone.src_length > SHARE_MAX_SIZE)
+		clone.src_length = SHARE_MAX_SIZE;
+	clone.dest_offset = OFFSET_1PB + clone.src_offset;
+	/* if the first is extent is reaching the EoF, no need to share */
+	if (clone.src_offset + clone.src_length >= g_defrag_file_size)
+		return;
+	err = ioctl(scratch_fd, FICLONERANGE, &clone);
+	if (err != 0) {
+		fprintf(stderr, "cloning first extent failed: %s\n",
+			strerror(errno));
+		return;
+	}
+
+	/* safe the offset and length for re-share */
+	g_share_offset = clone.src_offset;
+	g_share_len = clone.src_length;
+}
+
+/* re-share the blocks we shared previous if then are no longer shared */
+static void
+defrag_reshare_blocks_in_front(int defrag_fd, int scratch_fd)
+{
+#define NR_GET_EXT 9
+	struct getbmapx mapx[NR_GET_EXT];
+	struct file_clone_range clone;
+	__u64	new_share_len;
+	int	idx, err;
+
+	if (g_enable_first_ext_share == false)
+		return;
+
+	if (g_share_len == 0ULL)
+		return;
+
+	/*
+	 * check if previous shareing still exist
+	 * we are done if (partially) so.
+	 */
+	mapx[0].bmv_offset = g_share_offset;
+	mapx[0].bmv_length = g_share_len;
+	mapx[0].bmv_count = NR_GET_EXT;
+	mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
+	err = ioctl(defrag_fd, XFS_IOC_GETBMAPX, mapx);
+	if (err) {
+		fprintf(stderr, "XFS_IOC_GETBMAPX failed %s\n",
+			strerror(errno));
+		/* won't try share again */
+		g_share_len = 0ULL;
+		return;
+	}
+
+	if (mapx[0].bmv_entries == 0) {
+		/* shared blocks all became hole, won't try share again */
+		g_share_len = 0ULL;
+		return;
+	}
+
+	if (g_share_offset != 512 * mapx[1].bmv_offset) {
+		/* first shared block became hole, won't try share again */
+		g_share_len = 0ULL;
+		return;
+	}
+
+	/* we check up to only the first NR_GET_EXT - 1 extents */
+	for (idx = 1; idx <= mapx[0].bmv_entries; idx++) {
+		if (mapx[idx].bmv_oflags & BMV_OF_SHARED) {
+			/* some blocks still shared, done */
+			return;
+		}
+	}
+
+	/*
+	 * The previously shared blocks are no longer shared, re-share.
+	 * deallocate the blocks in scrath file first
+	 */
+	err = fallocate(scratch_fd,
+		FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,
+		OFFSET_1PB + g_share_offset, g_share_len);
+	if (err != 0) {
+		fprintf(stderr, "punch hole failed %s\n",
+			strerror(errno));
+		g_share_len = 0;
+		return;
+	}
+
+	new_share_len = 512 * mapx[1].bmv_length;
+	if (new_share_len > SHARE_MAX_SIZE)
+		new_share_len = SHARE_MAX_SIZE;
+
+	clone.src_fd = defrag_fd;
+	/* keep starting offset unchanged */
+	clone.src_offset = g_share_offset;
+	clone.src_length = new_share_len;
+	clone.dest_offset = OFFSET_1PB + clone.src_offset;
+
+	err = ioctl(scratch_fd, FICLONERANGE, &clone);
+	if (err) {
+		fprintf(stderr, "FICLONERANGE failed %s\n",
+			strerror(errno));
+		g_share_len = 0;
+		return;
+	}
+
+	g_share_len = new_share_len;
+ }
+
 /*
  * defragment a file
  * return 0 if successfully done, 1 otherwise
@@ -377,6 +526,12 @@ defrag_xfs_defrag(char *file_path) {
 
 	signal(SIGINT, defrag_sigint_handler);
 
+	/*
+	 * share the first extent to work around kernel consuming time
+	 * in xfs_reflink_try_clear_inode_flag()
+	 */
+	defrag_share_first_extent(defrag_fd, scratch_fd);
+
 	do {
 		struct timeval t_clone, t_unshare, t_punch_hole;
 		struct defrag_segment segment;
@@ -454,6 +609,15 @@ defrag_xfs_defrag(char *file_path) {
 		if (time_delta > max_unshare_us)
 			max_unshare_us = time_delta;
 
+		/*
+		 * if unshare used more than 1 second, time is very possibly
+		 * used in checking if the file is sharing extents now.
+		 * to avoid that happen again we re-share the blocks in front
+		 * to workaround that.
+		 */
+		if (time_delta > 1000000)
+			defrag_reshare_blocks_in_front(defrag_fd, scratch_fd);
+
 		/*
 		 * Punch out the original extents we shared to the
 		 * scratch file so they are returned to free space.
@@ -514,6 +678,8 @@ static void defrag_help(void)
 " -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
 "                       XFS free space is lower than that, shared segments \n"
 "                       are excluded from defragmentation, 1024 by default\n"
+" -n                 -- disable the \"share first extent\" featue, it's\n"
+"                       enabled by default to speed up\n"
 	));
 }
 
@@ -525,7 +691,7 @@ defrag_f(int argc, char **argv)
 	int	i;
 	int	c;
 
-	while ((c = getopt(argc, argv, "s:f:")) != EOF) {
+	while ((c = getopt(argc, argv, "s:f:n")) != EOF) {
 		switch(c) {
 		case 's':
 			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
@@ -539,6 +705,10 @@ defrag_f(int argc, char **argv)
 			g_limit_free_bytes = atol(optarg) * 1024 * 1024;
 			break;
 
+		case 'n':
+			g_enable_first_ext_share = false;
+			break;
+
 		default:
 			command_usage(&defrag_cmd);
 			return 1;
@@ -556,7 +726,7 @@ void defrag_init(void)
 	defrag_cmd.cfunc	= defrag_f;
 	defrag_cmd.argmin	= 0;
 	defrag_cmd.argmax	= 4;
-	defrag_cmd.args		= "[-s segment_size] [-f free_space]";
+	defrag_cmd.args		= "[-s segment_size] [-f free_space] [-n]";
 	defrag_cmd.flags	= CMD_FLAG_ONESHOT;
 	defrag_cmd.oneline	= _("Defragment XFS files");
 	defrag_cmd.help		= defrag_help;
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 7/9] spaceman/defrag: sleeps between segments
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
                   ` (5 preceding siblings ...)
  2024-07-09 19:10 ` [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() Wengang Wang
@ 2024-07-09 19:10 ` Wengang Wang
  2024-07-09 20:46   ` Darrick J. Wong
  2024-07-09 19:10 ` [PATCH 8/9] spaceman/defrag: readahead for better performance Wengang Wang
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

Let user contol the time to sleep between segments (file unlocked) to
balance defrag performance and file IO servicing time.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/spaceman/defrag.c b/spaceman/defrag.c
index b5c5b187..415fe9c2 100644
--- a/spaceman/defrag.c
+++ b/spaceman/defrag.c
@@ -311,6 +311,9 @@ void defrag_sigint_handler(int dummy)
  */
 static long	g_limit_free_bytes = 1024 * 1024 * 1024;
 
+/* sleep time in us between segments, overwritten by paramter */
+static int		g_idle_time = 250 * 1000;
+
 /*
  * check if the free space in the FS is less than the _limit_
  * return true if so, false otherwise
@@ -487,6 +490,7 @@ defrag_xfs_defrag(char *file_path) {
 	int	scratch_fd = -1, defrag_fd = -1;
 	char	tmp_file_path[PATH_MAX+1];
 	struct file_clone_range clone;
+	int	sleep_time_us = 0;
 	char	*defrag_dir;
 	struct fsxattr	fsx;
 	int	ret = 0;
@@ -574,6 +578,9 @@ defrag_xfs_defrag(char *file_path) {
 
 		/* checks for EoF and fix up clone */
 		stop = defrag_clone_eof(&clone);
+		if (sleep_time_us > 0)
+			usleep(sleep_time_us);
+
 		gettimeofday(&t_clone, NULL);
 		ret = ioctl(scratch_fd, FICLONERANGE, &clone);
 		if (ret != 0) {
@@ -587,6 +594,10 @@ defrag_xfs_defrag(char *file_path) {
 		if (time_delta > max_clone_us)
 			max_clone_us = time_delta;
 
+		/* sleeps if clone cost more than 500ms, slow FS */
+		if (time_delta >= 500000 && g_idle_time > 0)
+			usleep(g_idle_time);
+
 		/* for defrag stats */
 		nr_ext_defrag += segment.ds_nr;
 
@@ -641,6 +652,12 @@ defrag_xfs_defrag(char *file_path) {
 
 		if (stop || usedKilled)
 			break;
+
+		/*
+		 * no lock on target file when punching hole from scratch file,
+		 * so minus the time used for punching hole
+		 */
+		sleep_time_us = g_idle_time - time_delta;
 	} while (true);
 out:
 	if (scratch_fd != -1) {
@@ -678,6 +695,7 @@ static void defrag_help(void)
 " -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
 "                       XFS free space is lower than that, shared segments \n"
 "                       are excluded from defragmentation, 1024 by default\n"
+" -i idle_time       -- time in ms to be idle between segments, 250ms by default\n"
 " -n                 -- disable the \"share first extent\" featue, it's\n"
 "                       enabled by default to speed up\n"
 	));
@@ -691,7 +709,7 @@ defrag_f(int argc, char **argv)
 	int	i;
 	int	c;
 
-	while ((c = getopt(argc, argv, "s:f:n")) != EOF) {
+	while ((c = getopt(argc, argv, "s:f:ni")) != EOF) {
 		switch(c) {
 		case 's':
 			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
@@ -709,6 +727,10 @@ defrag_f(int argc, char **argv)
 			g_enable_first_ext_share = false;
 			break;
 
+		case 'i':
+			g_idle_time = atoi(optarg) * 1000;
+			break;
+
 		default:
 			command_usage(&defrag_cmd);
 			return 1;
@@ -726,7 +748,7 @@ void defrag_init(void)
 	defrag_cmd.cfunc	= defrag_f;
 	defrag_cmd.argmin	= 0;
 	defrag_cmd.argmax	= 4;
-	defrag_cmd.args		= "[-s segment_size] [-f free_space] [-n]";
+	defrag_cmd.args		= "[-s segment_size] [-f free_space] [-i idle_time] [-n]";
 	defrag_cmd.flags	= CMD_FLAG_ONESHOT;
 	defrag_cmd.oneline	= _("Defragment XFS files");
 	defrag_cmd.help		= defrag_help;
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 8/9] spaceman/defrag: readahead for better performance
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
                   ` (6 preceding siblings ...)
  2024-07-09 19:10 ` [PATCH 7/9] spaceman/defrag: sleeps between segments Wengang Wang
@ 2024-07-09 19:10 ` Wengang Wang
  2024-07-09 20:27   ` Darrick J. Wong
  2024-07-16  0:56   ` Dave Chinner
  2024-07-09 19:10 ` [PATCH 9/9] spaceman/defrag: warn on extsize Wengang Wang
  2024-07-15 23:03 ` [PATCH 0/9] introduce defrag to xfs_spaceman Dave Chinner
  9 siblings, 2 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

Reading ahead take less lock on file compared to "unshare" the file via ioctl.
Do readahead when defrag sleeps for better defrag performace and thus more
file IO time.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/spaceman/defrag.c b/spaceman/defrag.c
index 415fe9c2..ab8508bb 100644
--- a/spaceman/defrag.c
+++ b/spaceman/defrag.c
@@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
 }
 
 static bool g_enable_first_ext_share = true;
+static bool g_readahead = false;
+
+static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
+{
+	if (!g_readahead || g_idle_time <= 0)
+		return;
+
+	if (readahead(defrag_fd, offset, count) < 0) {
+		fprintf(stderr, "readahead failed: %s, errno=%d\n",
+			strerror(errno), errno);
+	}
+}
 
 static int
 defrag_get_first_real_ext(int fd, struct getbmapx *mapx)
@@ -578,6 +590,8 @@ defrag_xfs_defrag(char *file_path) {
 
 		/* checks for EoF and fix up clone */
 		stop = defrag_clone_eof(&clone);
+		defrag_readahead(defrag_fd, seg_off, seg_size);
+
 		if (sleep_time_us > 0)
 			usleep(sleep_time_us);
 
@@ -698,6 +712,7 @@ static void defrag_help(void)
 " -i idle_time       -- time in ms to be idle between segments, 250ms by default\n"
 " -n                 -- disable the \"share first extent\" featue, it's\n"
 "                       enabled by default to speed up\n"
+" -a                 -- do readahead to speed up defrag, disabled by default\n"
 	));
 }
 
@@ -709,7 +724,7 @@ defrag_f(int argc, char **argv)
 	int	i;
 	int	c;
 
-	while ((c = getopt(argc, argv, "s:f:ni")) != EOF) {
+	while ((c = getopt(argc, argv, "s:f:nia")) != EOF) {
 		switch(c) {
 		case 's':
 			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
@@ -731,6 +746,10 @@ defrag_f(int argc, char **argv)
 			g_idle_time = atoi(optarg) * 1000;
 			break;
 
+		case 'a':
+			g_readahead = true;
+			break;
+
 		default:
 			command_usage(&defrag_cmd);
 			return 1;
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 9/9] spaceman/defrag: warn on extsize
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
                   ` (7 preceding siblings ...)
  2024-07-09 19:10 ` [PATCH 8/9] spaceman/defrag: readahead for better performance Wengang Wang
@ 2024-07-09 19:10 ` Wengang Wang
  2024-07-09 20:21   ` Darrick J. Wong
  2024-07-15 23:03 ` [PATCH 0/9] introduce defrag to xfs_spaceman Dave Chinner
  9 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-09 19:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: wen.gang.wang

According to current kernel implemenation, non-zero extsize might affect
the result of defragmentation.
Just print a warning on that if non-zero extsize is set on file.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
---
 spaceman/defrag.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/spaceman/defrag.c b/spaceman/defrag.c
index ab8508bb..b6b89dd9 100644
--- a/spaceman/defrag.c
+++ b/spaceman/defrag.c
@@ -526,6 +526,18 @@ defrag_xfs_defrag(char *file_path) {
 		goto out;
 	}
 
+       if (ioctl(defrag_fd, FS_IOC_FSGETXATTR, &fsx) < 0) {
+               fprintf(stderr, "FSGETXATTR failed %s\n",
+                       strerror(errno));
+               ret = 1;
+               goto out;
+       }
+
+       if (fsx.fsx_extsize != 0)
+               fprintf(stderr, "%s has extsize set %d. That might affect defrag "
+                       "according to kernel implementation\n",
+                       file_path, fsx.fsx_extsize);
+
 	clone.src_fd = defrag_fd;
 
 	defrag_dir = dirname(file_path);
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 9/9] spaceman/defrag: warn on extsize
  2024-07-09 19:10 ` [PATCH 9/9] spaceman/defrag: warn on extsize Wengang Wang
@ 2024-07-09 20:21   ` Darrick J. Wong
  2024-07-11 23:36     ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-09 20:21 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:28PM -0700, Wengang Wang wrote:
> According to current kernel implemenation, non-zero extsize might affect
> the result of defragmentation.
> Just print a warning on that if non-zero extsize is set on file.

I'm not sure what's the point of warning vaguely about extent size
hints?  I'd have thought that would help reduce the number of extents;
is that not the case?

> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index ab8508bb..b6b89dd9 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -526,6 +526,18 @@ defrag_xfs_defrag(char *file_path) {
>  		goto out;
>  	}
>  
> +       if (ioctl(defrag_fd, FS_IOC_FSGETXATTR, &fsx) < 0) {
> +               fprintf(stderr, "FSGETXATTR failed %s\n",
> +                       strerror(errno));

Also we usually indent continuations by two tabs (not one) so that the
continuation is more obvious:

		fprintf(stderr, "FSGETXATTR failed %s\n",
				strerror(errno));

> +               ret = 1;
> +               goto out;
> +       }
> +
> +       if (fsx.fsx_extsize != 0)
> +               fprintf(stderr, "%s has extsize set %d. That might affect defrag "
> +                       "according to kernel implementation\n",

Format strings in userspace printf calls should be wrapped so that
gettext can provide translated versions:

	fprintf(stderr, _("%s has extsize...\n"), file_path...);

(I know, xfsprogs isn't as consistent as it probably ought to be...)

--D

> +                       file_path, fsx.fsx_extsize);
> +
>  	clone.src_fd = defrag_fd;
>  
>  	defrag_dir = dirname(file_path);
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 8/9] spaceman/defrag: readahead for better performance
  2024-07-09 19:10 ` [PATCH 8/9] spaceman/defrag: readahead for better performance Wengang Wang
@ 2024-07-09 20:27   ` Darrick J. Wong
  2024-07-11 23:29     ` Wengang Wang
  2024-07-16  0:56   ` Dave Chinner
  1 sibling, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-09 20:27 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
> Do readahead when defrag sleeps for better defrag performace and thus more
> file IO time.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index 415fe9c2..ab8508bb 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>  }
>  
>  static bool g_enable_first_ext_share = true;
> +static bool g_readahead = false;
> +
> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
> +{
> +	if (!g_readahead || g_idle_time <= 0)
> +		return;
> +
> +	if (readahead(defrag_fd, offset, count) < 0) {
> +		fprintf(stderr, "readahead failed: %s, errno=%d\n",
> +			strerror(errno), errno);

Why is it worth reporting if readahead fails?  Won't the unshare also
fail?  I'm also wondering why we wouldn't want readahead all the time?

--D

> +	}
> +}
>  
>  static int
>  defrag_get_first_real_ext(int fd, struct getbmapx *mapx)
> @@ -578,6 +590,8 @@ defrag_xfs_defrag(char *file_path) {
>  
>  		/* checks for EoF and fix up clone */
>  		stop = defrag_clone_eof(&clone);
> +		defrag_readahead(defrag_fd, seg_off, seg_size);
> +
>  		if (sleep_time_us > 0)
>  			usleep(sleep_time_us);
>  
> @@ -698,6 +712,7 @@ static void defrag_help(void)
>  " -i idle_time       -- time in ms to be idle between segments, 250ms by default\n"
>  " -n                 -- disable the \"share first extent\" featue, it's\n"
>  "                       enabled by default to speed up\n"
> +" -a                 -- do readahead to speed up defrag, disabled by default\n"
>  	));
>  }
>  
> @@ -709,7 +724,7 @@ defrag_f(int argc, char **argv)
>  	int	i;
>  	int	c;
>  
> -	while ((c = getopt(argc, argv, "s:f:ni")) != EOF) {
> +	while ((c = getopt(argc, argv, "s:f:nia")) != EOF) {
>  		switch(c) {
>  		case 's':
>  			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
> @@ -731,6 +746,10 @@ defrag_f(int argc, char **argv)
>  			g_idle_time = atoi(optarg) * 1000;
>  			break;
>  
> +		case 'a':
> +			g_readahead = true;
> +			break;
> +
>  		default:
>  			command_usage(&defrag_cmd);
>  			return 1;
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 7/9] spaceman/defrag: sleeps between segments
  2024-07-09 19:10 ` [PATCH 7/9] spaceman/defrag: sleeps between segments Wengang Wang
@ 2024-07-09 20:46   ` Darrick J. Wong
  2024-07-11 23:26     ` Wengang Wang
  2024-07-11 23:30     ` Wengang Wang
  0 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-09 20:46 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:26PM -0700, Wengang Wang wrote:
> Let user contol the time to sleep between segments (file unlocked) to
> balance defrag performance and file IO servicing time.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 26 ++++++++++++++++++++++++--
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index b5c5b187..415fe9c2 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -311,6 +311,9 @@ void defrag_sigint_handler(int dummy)
>   */
>  static long	g_limit_free_bytes = 1024 * 1024 * 1024;
>  
> +/* sleep time in us between segments, overwritten by paramter */
> +static int		g_idle_time = 250 * 1000;
> +
>  /*
>   * check if the free space in the FS is less than the _limit_
>   * return true if so, false otherwise
> @@ -487,6 +490,7 @@ defrag_xfs_defrag(char *file_path) {
>  	int	scratch_fd = -1, defrag_fd = -1;
>  	char	tmp_file_path[PATH_MAX+1];
>  	struct file_clone_range clone;
> +	int	sleep_time_us = 0;
>  	char	*defrag_dir;
>  	struct fsxattr	fsx;
>  	int	ret = 0;
> @@ -574,6 +578,9 @@ defrag_xfs_defrag(char *file_path) {
>  
>  		/* checks for EoF and fix up clone */
>  		stop = defrag_clone_eof(&clone);
> +		if (sleep_time_us > 0)
> +			usleep(sleep_time_us);
> +
>  		gettimeofday(&t_clone, NULL);
>  		ret = ioctl(scratch_fd, FICLONERANGE, &clone);
>  		if (ret != 0) {
> @@ -587,6 +594,10 @@ defrag_xfs_defrag(char *file_path) {
>  		if (time_delta > max_clone_us)
>  			max_clone_us = time_delta;
>  
> +		/* sleeps if clone cost more than 500ms, slow FS */

Why half a second?  I sense that what you're getting at is that you want
to limit file io latency spikes in other programs by relaxing the defrag
program, right?  But the help screen doesn't say anything about "only if
the clone lasts more than 500ms".

> +		if (time_delta >= 500000 && g_idle_time > 0)
> +			usleep(g_idle_time);

These days, I wonder if it makes more sense to provide a CPU utilization
target and let the kernel figure out how much sleeping that is:

$ systemd-run -p 'CPUQuota=60%' xfs_spaceman -c 'defrag' /path/to/file

The tradeoff here is that we as application writers no longer have to
implement these clunky sleeps ourselves, but then one has to turn on cpu
accounting in systemd (if there even /is/ a systemd).  Also I suppose we
don't want this program getting throttled while it's holding a file
lock.

--D

> +
>  		/* for defrag stats */
>  		nr_ext_defrag += segment.ds_nr;
>  
> @@ -641,6 +652,12 @@ defrag_xfs_defrag(char *file_path) {
>  
>  		if (stop || usedKilled)
>  			break;
> +
> +		/*
> +		 * no lock on target file when punching hole from scratch file,
> +		 * so minus the time used for punching hole
> +		 */
> +		sleep_time_us = g_idle_time - time_delta;
>  	} while (true);
>  out:
>  	if (scratch_fd != -1) {
> @@ -678,6 +695,7 @@ static void defrag_help(void)
>  " -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
>  "                       XFS free space is lower than that, shared segments \n"
>  "                       are excluded from defragmentation, 1024 by default\n"
> +" -i idle_time       -- time in ms to be idle between segments, 250ms by default\n"
>  " -n                 -- disable the \"share first extent\" featue, it's\n"
>  "                       enabled by default to speed up\n"
>  	));
> @@ -691,7 +709,7 @@ defrag_f(int argc, char **argv)
>  	int	i;
>  	int	c;
>  
> -	while ((c = getopt(argc, argv, "s:f:n")) != EOF) {
> +	while ((c = getopt(argc, argv, "s:f:ni")) != EOF) {
>  		switch(c) {
>  		case 's':
>  			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
> @@ -709,6 +727,10 @@ defrag_f(int argc, char **argv)
>  			g_enable_first_ext_share = false;
>  			break;
>  
> +		case 'i':
> +			g_idle_time = atoi(optarg) * 1000;

Should we complain if optarg is non-integer garbage?  Or if g_idle_time
is larger than 1s?

--D

> +			break;
> +
>  		default:
>  			command_usage(&defrag_cmd);
>  			return 1;
> @@ -726,7 +748,7 @@ void defrag_init(void)
>  	defrag_cmd.cfunc	= defrag_f;
>  	defrag_cmd.argmin	= 0;
>  	defrag_cmd.argmax	= 4;
> -	defrag_cmd.args		= "[-s segment_size] [-f free_space] [-n]";
> +	defrag_cmd.args		= "[-s segment_size] [-f free_space] [-i idle_time] [-n]";
>  	defrag_cmd.flags	= CMD_FLAG_ONESHOT;
>  	defrag_cmd.oneline	= _("Defragment XFS files");
>  	defrag_cmd.help		= defrag_help;
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag()
  2024-07-09 19:10 ` [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() Wengang Wang
@ 2024-07-09 20:51   ` Darrick J. Wong
  2024-07-11 23:11     ` Wengang Wang
  2024-07-16  0:25   ` Dave Chinner
  2024-07-31 22:25   ` Dave Chinner
  2 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-09 20:51 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:25PM -0700, Wengang Wang wrote:
> xfs_reflink_try_clear_inode_flag() takes very long in case file has huge number
> of extents and none of the extents are shared.
> 
> workaround:
> share the first real extent so that xfs_reflink_try_clear_inode_flag() returns
> quickly to save cpu times and speed up defrag significantly.

I wonder if a better solution would be to change xfs_reflink_unshare
only to try to clear the reflink iflag if offset/len cover the entire
file?  It's a pity we can't set time budgets on fallocate requests.

--D

> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 174 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 172 insertions(+), 2 deletions(-)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index f8e6713c..b5c5b187 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -327,6 +327,155 @@ defrag_fs_limit_hit(int fd)
>  	return statfs_s.f_bsize * statfs_s.f_bavail < g_limit_free_bytes;
>  }
>  
> +static bool g_enable_first_ext_share = true;
> +
> +static int
> +defrag_get_first_real_ext(int fd, struct getbmapx *mapx)
> +{
> +	int			err;
> +
> +	while (1) {
> +		err = defrag_get_next_extent(fd, mapx);
> +		if (err)
> +			break;
> +
> +		defrag_move_next_extent();
> +		if (!(mapx->bmv_oflags & BMV_OF_PREALLOC))
> +			break;
> +	}
> +	return err;
> +}
> +
> +static __u64 g_share_offset = -1ULL;
> +static __u64 g_share_len = 0ULL;
> +#define SHARE_MAX_SIZE 32768  /* 32KiB */
> +
> +/* share the first real extent with scrach */
> +static void
> +defrag_share_first_extent(int defrag_fd, int scratch_fd)
> +{
> +#define OFFSET_1PB 0x4000000000000LL
> +	struct file_clone_range clone;
> +	struct getbmapx mapx;
> +	int	err;
> +
> +	if (g_enable_first_ext_share == false)
> +		return;
> +
> +	err = defrag_get_first_real_ext(defrag_fd, &mapx);
> +	if (err)
> +		return;
> +
> +	clone.src_fd = defrag_fd;
> +	clone.src_offset = mapx.bmv_offset * 512;
> +	clone.src_length = mapx.bmv_length * 512;
> +	/* shares at most SHARE_MAX_SIZE length */
> +	if (clone.src_length > SHARE_MAX_SIZE)
> +		clone.src_length = SHARE_MAX_SIZE;
> +	clone.dest_offset = OFFSET_1PB + clone.src_offset;
> +	/* if the first is extent is reaching the EoF, no need to share */
> +	if (clone.src_offset + clone.src_length >= g_defrag_file_size)
> +		return;
> +	err = ioctl(scratch_fd, FICLONERANGE, &clone);
> +	if (err != 0) {
> +		fprintf(stderr, "cloning first extent failed: %s\n",
> +			strerror(errno));
> +		return;
> +	}
> +
> +	/* safe the offset and length for re-share */
> +	g_share_offset = clone.src_offset;
> +	g_share_len = clone.src_length;
> +}
> +
> +/* re-share the blocks we shared previous if then are no longer shared */
> +static void
> +defrag_reshare_blocks_in_front(int defrag_fd, int scratch_fd)
> +{
> +#define NR_GET_EXT 9
> +	struct getbmapx mapx[NR_GET_EXT];
> +	struct file_clone_range clone;
> +	__u64	new_share_len;
> +	int	idx, err;
> +
> +	if (g_enable_first_ext_share == false)
> +		return;
> +
> +	if (g_share_len == 0ULL)
> +		return;
> +
> +	/*
> +	 * check if previous shareing still exist
> +	 * we are done if (partially) so.
> +	 */
> +	mapx[0].bmv_offset = g_share_offset;
> +	mapx[0].bmv_length = g_share_len;
> +	mapx[0].bmv_count = NR_GET_EXT;
> +	mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
> +	err = ioctl(defrag_fd, XFS_IOC_GETBMAPX, mapx);
> +	if (err) {
> +		fprintf(stderr, "XFS_IOC_GETBMAPX failed %s\n",
> +			strerror(errno));
> +		/* won't try share again */
> +		g_share_len = 0ULL;
> +		return;
> +	}
> +
> +	if (mapx[0].bmv_entries == 0) {
> +		/* shared blocks all became hole, won't try share again */
> +		g_share_len = 0ULL;
> +		return;
> +	}
> +
> +	if (g_share_offset != 512 * mapx[1].bmv_offset) {
> +		/* first shared block became hole, won't try share again */
> +		g_share_len = 0ULL;
> +		return;
> +	}
> +
> +	/* we check up to only the first NR_GET_EXT - 1 extents */
> +	for (idx = 1; idx <= mapx[0].bmv_entries; idx++) {
> +		if (mapx[idx].bmv_oflags & BMV_OF_SHARED) {
> +			/* some blocks still shared, done */
> +			return;
> +		}
> +	}
> +
> +	/*
> +	 * The previously shared blocks are no longer shared, re-share.
> +	 * deallocate the blocks in scrath file first
> +	 */
> +	err = fallocate(scratch_fd,
> +		FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,
> +		OFFSET_1PB + g_share_offset, g_share_len);
> +	if (err != 0) {
> +		fprintf(stderr, "punch hole failed %s\n",
> +			strerror(errno));
> +		g_share_len = 0;
> +		return;
> +	}
> +
> +	new_share_len = 512 * mapx[1].bmv_length;
> +	if (new_share_len > SHARE_MAX_SIZE)
> +		new_share_len = SHARE_MAX_SIZE;
> +
> +	clone.src_fd = defrag_fd;
> +	/* keep starting offset unchanged */
> +	clone.src_offset = g_share_offset;
> +	clone.src_length = new_share_len;
> +	clone.dest_offset = OFFSET_1PB + clone.src_offset;
> +
> +	err = ioctl(scratch_fd, FICLONERANGE, &clone);
> +	if (err) {
> +		fprintf(stderr, "FICLONERANGE failed %s\n",
> +			strerror(errno));
> +		g_share_len = 0;
> +		return;
> +	}
> +
> +	g_share_len = new_share_len;
> + }
> +
>  /*
>   * defragment a file
>   * return 0 if successfully done, 1 otherwise
> @@ -377,6 +526,12 @@ defrag_xfs_defrag(char *file_path) {
>  
>  	signal(SIGINT, defrag_sigint_handler);
>  
> +	/*
> +	 * share the first extent to work around kernel consuming time
> +	 * in xfs_reflink_try_clear_inode_flag()
> +	 */
> +	defrag_share_first_extent(defrag_fd, scratch_fd);
> +
>  	do {
>  		struct timeval t_clone, t_unshare, t_punch_hole;
>  		struct defrag_segment segment;
> @@ -454,6 +609,15 @@ defrag_xfs_defrag(char *file_path) {
>  		if (time_delta > max_unshare_us)
>  			max_unshare_us = time_delta;
>  
> +		/*
> +		 * if unshare used more than 1 second, time is very possibly
> +		 * used in checking if the file is sharing extents now.
> +		 * to avoid that happen again we re-share the blocks in front
> +		 * to workaround that.
> +		 */
> +		if (time_delta > 1000000)
> +			defrag_reshare_blocks_in_front(defrag_fd, scratch_fd);
> +
>  		/*
>  		 * Punch out the original extents we shared to the
>  		 * scratch file so they are returned to free space.
> @@ -514,6 +678,8 @@ static void defrag_help(void)
>  " -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
>  "                       XFS free space is lower than that, shared segments \n"
>  "                       are excluded from defragmentation, 1024 by default\n"
> +" -n                 -- disable the \"share first extent\" featue, it's\n"
> +"                       enabled by default to speed up\n"
>  	));
>  }
>  
> @@ -525,7 +691,7 @@ defrag_f(int argc, char **argv)
>  	int	i;
>  	int	c;
>  
> -	while ((c = getopt(argc, argv, "s:f:")) != EOF) {
> +	while ((c = getopt(argc, argv, "s:f:n")) != EOF) {
>  		switch(c) {
>  		case 's':
>  			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
> @@ -539,6 +705,10 @@ defrag_f(int argc, char **argv)
>  			g_limit_free_bytes = atol(optarg) * 1024 * 1024;
>  			break;
>  
> +		case 'n':
> +			g_enable_first_ext_share = false;
> +			break;
> +
>  		default:
>  			command_usage(&defrag_cmd);
>  			return 1;
> @@ -556,7 +726,7 @@ void defrag_init(void)
>  	defrag_cmd.cfunc	= defrag_f;
>  	defrag_cmd.argmin	= 0;
>  	defrag_cmd.argmax	= 4;
> -	defrag_cmd.args		= "[-s segment_size] [-f free_space]";
> +	defrag_cmd.args		= "[-s segment_size] [-f free_space] [-n]";
>  	defrag_cmd.flags	= CMD_FLAG_ONESHOT;
>  	defrag_cmd.oneline	= _("Defragment XFS files");
>  	defrag_cmd.help		= defrag_help;
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/9] spaceman/defrag: exclude shared segments on low free space
  2024-07-09 19:10 ` [PATCH 5/9] spaceman/defrag: exclude shared segments on low free space Wengang Wang
@ 2024-07-09 21:05   ` Darrick J. Wong
  2024-07-11 23:08     ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-09 21:05 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:24PM -0700, Wengang Wang wrote:
> On some XFS, free blocks are over-committed to reflink copies.
> And those free blocks are not enough if CoW happens to all the shared blocks.

Hmmm.  I think what you're trying to do here is avoid running a
filesystem out of space because it defragmented files A, B, ... Z, each
of which previously shared the same chunk of storage but now they don't
because this defragger unshared them to reduce the extent count in those
files.  Right?

In that case, I wonder if it's a good idea to touch shared extents at
all?  Someone set those files to share space, that's probably a better
performance optimization than reducing extent count.

That said, you /could/ also use GETFSMAP to find all the other owners of
a shared extent.  Then you can reflink the same extent to a scratch
file, copy the contents to a new region in the scratch file, and use
FIEDEDUPERANGE on each of A..Z to remap the new region into those files.
Assuming the new region has fewer mappings than the old one it was
copied from, you'll defragment A..Z while preserving the sharing factor.

I say that because I've written such a thing before; look for
csp_evac_dedupe_fsmap in
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/commit/?h=defrag-freespace&id=785d2f024e31a0d0f52b04073a600f9139ef0b21

> This defrag tool would exclude shared segments when free space is under shrethold.

"threshold"

--D

> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 46 +++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 43 insertions(+), 3 deletions(-)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index 61e47a43..f8e6713c 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -304,6 +304,29 @@ void defrag_sigint_handler(int dummy)
>  	printf("Please wait until current segment is defragmented\n");
>  };
>  
> +/*
> + * limitation of filesystem free space in bytes.
> + * when filesystem has less free space than this number, segments which contain
> + * shared extents are skipped. 1GiB by default
> + */
> +static long	g_limit_free_bytes = 1024 * 1024 * 1024;
> +
> +/*
> + * check if the free space in the FS is less than the _limit_
> + * return true if so, false otherwise
> + */
> +static bool
> +defrag_fs_limit_hit(int fd)
> +{
> +	struct statfs statfs_s;
> +
> +	if (g_limit_free_bytes <= 0)
> +		return false;
> +
> +	fstatfs(fd, &statfs_s);
> +	return statfs_s.f_bsize * statfs_s.f_bavail < g_limit_free_bytes;
> +}
> +
>  /*
>   * defragment a file
>   * return 0 if successfully done, 1 otherwise
> @@ -377,6 +400,15 @@ defrag_xfs_defrag(char *file_path) {
>  		if (segment.ds_nr < 2)
>  			continue;
>  
> +		/*
> +		 * When the segment is (partially) shared, defrag would
> +		 * consume free blocks. We check the limit of FS free blocks
> +		 * and skip defragmenting this segment in case the limit is
> +		 * reached.
> +		 */
> +		if (segment.ds_shared && defrag_fs_limit_hit(defrag_fd))
> +			continue;
> +
>  		/* to bytes */
>  		seg_off = segment.ds_offset * 512;
>  		seg_size = segment.ds_length * 512;
> @@ -478,7 +510,11 @@ static void defrag_help(void)
>  "can be served durning the defragmentations.\n"
>  "\n"
>  " -s segment_size    -- specify the segment size in MiB, minmum value is 4 \n"
> -"                       default is 16\n"));
> +"                       default is 16\n"
> +" -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
> +"                       XFS free space is lower than that, shared segments \n"
> +"                       are excluded from defragmentation, 1024 by default\n"
> +	));
>  }
>  
>  static cmdinfo_t defrag_cmd;
> @@ -489,7 +525,7 @@ defrag_f(int argc, char **argv)
>  	int	i;
>  	int	c;
>  
> -	while ((c = getopt(argc, argv, "s:")) != EOF) {
> +	while ((c = getopt(argc, argv, "s:f:")) != EOF) {
>  		switch(c) {
>  		case 's':
>  			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
> @@ -499,6 +535,10 @@ defrag_f(int argc, char **argv)
>  					g_segment_size_lmt);
>  			}
>  			break;
> +		case 'f':
> +			g_limit_free_bytes = atol(optarg) * 1024 * 1024;
> +			break;
> +
>  		default:
>  			command_usage(&defrag_cmd);
>  			return 1;
> @@ -516,7 +556,7 @@ void defrag_init(void)
>  	defrag_cmd.cfunc	= defrag_f;
>  	defrag_cmd.argmin	= 0;
>  	defrag_cmd.argmax	= 4;
> -	defrag_cmd.args		= "[-s segment_size]";
> +	defrag_cmd.args		= "[-s segment_size] [-f free_space]";
>  	defrag_cmd.flags	= CMD_FLAG_ONESHOT;
>  	defrag_cmd.oneline	= _("Defragment XFS files");
>  	defrag_cmd.help		= defrag_help;
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/9] spaceman/defrag: ctrl-c handler
  2024-07-09 19:10 ` [PATCH 4/9] spaceman/defrag: ctrl-c handler Wengang Wang
@ 2024-07-09 21:08   ` Darrick J. Wong
  2024-07-11 22:58     ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-09 21:08 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:23PM -0700, Wengang Wang wrote:
> Add this handler to break the defrag better, so it has
> 1. the stats reporting
> 2. remove the temporary file
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index 9f11e36b..61e47a43 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -297,6 +297,13 @@ get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
>  	return us;
>  }
>  
> +static volatile bool usedKilled = false;
> +void defrag_sigint_handler(int dummy)
> +{
> +	usedKilled = true;

Not sure why some of these variables are camelCase and others not.
Or why this global variable doesn't have a g_ prefix like the others?

> +	printf("Please wait until current segment is defragmented\n");

Is it actually safe to call printf from a signal handler?  Handlers must
be very careful about what they call -- regreSSHion was a result of
openssh not getting this right.

(Granted spaceman isn't as critical...)

Also would you rather SIGINT merely terminate the spaceman process?  I
think the file locks drop on termination, right?

--D

> +};
> +
>  /*
>   * defragment a file
>   * return 0 if successfully done, 1 otherwise
> @@ -345,6 +352,8 @@ defrag_xfs_defrag(char *file_path) {
>  		goto out;
>  	}
>  
> +	signal(SIGINT, defrag_sigint_handler);
> +
>  	do {
>  		struct timeval t_clone, t_unshare, t_punch_hole;
>  		struct defrag_segment segment;
> @@ -434,7 +443,7 @@ defrag_xfs_defrag(char *file_path) {
>  		if (time_delta > max_punch_us)
>  			max_punch_us = time_delta;
>  
> -		if (stop)
> +		if (stop || usedKilled)
>  			break;
>  	} while (true);
>  out:
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/9] xfsprogs: introduce defrag command to spaceman
  2024-07-09 19:10 ` [PATCH 1/9] xfsprogs: introduce defrag command to spaceman Wengang Wang
@ 2024-07-09 21:18   ` Darrick J. Wong
  2024-07-11 21:54     ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-09 21:18 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:20PM -0700, Wengang Wang wrote:
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> Non-exclusive defragment
> Here we are introducing the non-exclusive manner to defragment a file,
> especially for huge files, without blocking IO to it long.
> Non-exclusive defragmentation divides the whole file into small segments.
> For each segment, we lock the file, defragment the segment and unlock the file.
> Defragmenting the small segment doesn’t take long. File IO requests can get
> served between defragmenting segments before blocked long.  Also we put
> (user adjustable) idle time between defragmenting two consecutive segments to
> balance the defragmentation and file IOs.
> 
> The first patch in the set checks for valid target files
> 
> Valid target files to defrag must:
> 1. be accessible for read/write
> 2. be regular files
> 3. be in XFS filesystem
> 4. the containing XFS has reflink enabled. This is not checked
>    before starting defragmentation, but error would be reported
>    later.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/Makefile |   2 +-
>  spaceman/defrag.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++
>  spaceman/init.c   |   1 +
>  spaceman/space.h  |   1 +
>  4 files changed, 201 insertions(+), 1 deletion(-)
>  create mode 100644 spaceman/defrag.c
> 
> diff --git a/spaceman/Makefile b/spaceman/Makefile
> index 1f048d54..9c00b20a 100644
> --- a/spaceman/Makefile
> +++ b/spaceman/Makefile
> @@ -7,7 +7,7 @@ include $(TOPDIR)/include/builddefs
>  
>  LTCOMMAND = xfs_spaceman
>  HFILES = init.h space.h
> -CFILES = info.c init.c file.c health.c prealloc.c trim.c
> +CFILES = info.c init.c file.c health.c prealloc.c trim.c defrag.c
>  LSRCFILES = xfs_info.sh
>  
>  LLDLIBS = $(LIBXCMD) $(LIBFROG)
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> new file mode 100644
> index 00000000..c9732984
> --- /dev/null
> +++ b/spaceman/defrag.c
> @@ -0,0 +1,198 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2024 Oracle.
> + * All Rights Reserved.
> + */
> +
> +#include "libxfs.h"
> +#include <linux/fiemap.h>
> +#include <linux/fsmap.h>
> +#include "libfrog/fsgeom.h"
> +#include "command.h"
> +#include "init.h"
> +#include "libfrog/paths.h"
> +#include "space.h"
> +#include "input.h"
> +
> +/* defrag segment size limit in units of 512 bytes */
> +#define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
> +#define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
> +static int g_segment_size_lmt = DEFAULT_SEGMENT_SIZE_LIMIT;
> +
> +/* size of the defrag target file */
> +static off_t g_defrag_file_size = 0;
> +
> +/* stats for the target file extents before defrag */
> +struct ext_stats {
> +	long	nr_ext_total;
> +	long	nr_ext_unwritten;
> +	long	nr_ext_shared;
> +};
> +static struct ext_stats	g_ext_stats;
> +
> +/*
> + * check if the target is a valid file to defrag
> + * also store file size
> + * returns:
> + * true for yes and false for no
> + */
> +static bool
> +defrag_check_file(char *path)
> +{
> +	struct statfs statfs_s;
> +	struct stat stat_s;
> +
> +	if (access(path, F_OK|W_OK) == -1) {
> +		if (errno == ENOENT)
> +			fprintf(stderr, "file \"%s\" doesn't exist\n", path);
> +		else
> +			fprintf(stderr, "no access to \"%s\", %s\n", path,
> +				strerror(errno));
> +		return false;
> +	}
> +
> +	if (stat(path, &stat_s) == -1) {
> +		fprintf(stderr, "failed to get file info on \"%s\":  %s\n",
> +			path, strerror(errno));
> +		return false;
> +	}
> +
> +	g_defrag_file_size = stat_s.st_size;
> +
> +	if (!S_ISREG(stat_s.st_mode)) {
> +		fprintf(stderr, "\"%s\" is not a regular file\n", path);
> +		return false;
> +	}
> +
> +	if (statfs(path, &statfs_s) == -1) {

statfs is deprecated, please use fstatvfs.

> +		fprintf(stderr, "failed to get FS info on \"%s\":  %s\n",
> +			path, strerror(errno));
> +		return false;
> +	}
> +
> +	if (statfs_s.f_type != XFS_SUPER_MAGIC) {
> +		fprintf(stderr, "\"%s\" is not a xfs file\n", path);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +/*
> + * defragment a file
> + * return 0 if successfully done, 1 otherwise
> + */
> +static int
> +defrag_xfs_defrag(char *file_path) {

defrag_xfs_path() ?

> +	int	max_clone_us = 0, max_unshare_us = 0, max_punch_us = 0;
> +	long	nr_seg_defrag = 0, nr_ext_defrag = 0;
> +	int	scratch_fd = -1, defrag_fd = -1;
> +	char	tmp_file_path[PATH_MAX+1];
> +	char	*defrag_dir;
> +	struct fsxattr	fsx;
> +	int	ret = 0;
> +
> +	fsx.fsx_nextents = 0;
> +	memset(&g_ext_stats, 0, sizeof(g_ext_stats));
> +
> +	if (!defrag_check_file(file_path)) {
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	defrag_fd = open(file_path, O_RDWR);
> +	if (defrag_fd == -1) {

Not sure why you check the path before opening it -- all those file and
statvfs attributes that you collect there can change (or the entire fs
gets unmounted) until you've pinned the fs by opening the file.

> +		fprintf(stderr, "Opening %s failed. %s\n", file_path,
> +			strerror(errno));
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	defrag_dir = dirname(file_path);
> +	snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
> +		getpid());
> +	tmp_file_path[PATH_MAX] = 0;
> +	scratch_fd = open(tmp_file_path, O_CREAT|O_EXCL|O_RDWR, 0600);

O_TMPFILE?  Then you don't have to do this .xfsdefrag_XXX stuff.

> +	if (scratch_fd == -1) {
> +		fprintf(stderr, "Opening temporary file %s failed. %s\n",
> +			tmp_file_path, strerror(errno));
> +		ret = 1;
> +		goto out;
> +	}
> +out:
> +	if (scratch_fd != -1) {
> +		close(scratch_fd);
> +		unlink(tmp_file_path);
> +	}
> +	if (defrag_fd != -1) {
> +		ioctl(defrag_fd, FS_IOC_FSGETXATTR, &fsx);
> +		close(defrag_fd);
> +	}
> +
> +	printf("Pre-defrag %ld extents detected, %ld are \"unwritten\","
> +		"%ld are \"shared\"\n",
> +		g_ext_stats.nr_ext_total, g_ext_stats.nr_ext_unwritten,
> +		g_ext_stats.nr_ext_shared);
> +	printf("Tried to defragment %ld extents in %ld segments\n",
> +		nr_ext_defrag, nr_seg_defrag);
> +	printf("Time stats(ms): max clone: %d, max unshare: %d,"
> +	       " max punch_hole: %d\n",
> +	       max_clone_us/1000, max_unshare_us/1000, max_punch_us/1000);
> +	printf("Post-defrag %u extents detected\n", fsx.fsx_nextents);
> +	return ret;
> +}
> +
> +
> +static void defrag_help(void)
> +{
> +	printf(_(
> +"\n"
> +"Defragemnt files on XFS where reflink is enabled. IOs to the target files \n"

"Defragment"

> +"can be served durning the defragmentations.\n"
> +"\n"
> +" -s segment_size    -- specify the segment size in MiB, minmum value is 4 \n"
> +"                       default is 16\n"));
> +}
> +
> +static cmdinfo_t defrag_cmd;
> +
> +static int
> +defrag_f(int argc, char **argv)
> +{
> +	int	i;
> +	int	c;
> +
> +	while ((c = getopt(argc, argv, "s:")) != EOF) {
> +		switch(c) {
> +		case 's':
> +			g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
> +			if (g_segment_size_lmt < MIN_SEGMENT_SIZE_LIMIT) {
> +				g_segment_size_lmt = MIN_SEGMENT_SIZE_LIMIT;
> +				printf("Using minimium segment size %d\n",
> +					g_segment_size_lmt);
> +			}
> +			break;
> +		default:
> +			command_usage(&defrag_cmd);
> +			return 1;
> +		}
> +	}
> +
> +	for (i = 0; i < filecount; i++)
> +		defrag_xfs_defrag(filetable[i].name);

Pass in the whole filetable[i] and then you've already got an open fd
and some validation that it's an xfs filesystem.

> +	return 0;
> +}
> +void defrag_init(void)
> +{
> +	defrag_cmd.name		= "defrag";
> +	defrag_cmd.altname	= "dfg";
> +	defrag_cmd.cfunc	= defrag_f;
> +	defrag_cmd.argmin	= 0;
> +	defrag_cmd.argmax	= 4;
> +	defrag_cmd.args		= "[-s segment_size]";
> +	defrag_cmd.flags	= CMD_FLAG_ONESHOT;

IIRC if you don't set CMD_FLAG_FOREIGN_OK then the command processor
won't let this command get run against a non-xfs file.

--D

> +	defrag_cmd.oneline	= _("Defragment XFS files");
> +	defrag_cmd.help		= defrag_help;
> +
> +	add_command(&defrag_cmd);
> +}
> diff --git a/spaceman/init.c b/spaceman/init.c
> index cf1ff3cb..396f965c 100644
> --- a/spaceman/init.c
> +++ b/spaceman/init.c
> @@ -35,6 +35,7 @@ init_commands(void)
>  	trim_init();
>  	freesp_init();
>  	health_init();
> +	defrag_init();
>  }
>  
>  static int
> diff --git a/spaceman/space.h b/spaceman/space.h
> index 723209ed..c288aeb9 100644
> --- a/spaceman/space.h
> +++ b/spaceman/space.h
> @@ -26,6 +26,7 @@ extern void	help_init(void);
>  extern void	prealloc_init(void);
>  extern void	quit_init(void);
>  extern void	trim_init(void);
> +extern void	defrag_init(void);
>  #ifdef HAVE_GETFSMAP
>  extern void	freesp_init(void);
>  #else
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target fileOM
  2024-07-09 19:10 ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Wengang Wang
@ 2024-07-09 21:50   ` Darrick J. Wong
  2024-07-11 22:37     ` Wengang Wang
  2024-07-15 23:40   ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Dave Chinner
  1 sibling, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-09 21:50 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
> segments are the smallest unit to defragment.
> 
> A segment
> 1. Can't exceed size limit

What size limit?  Do you mean a segment can't extend beyond EOF?  Or did
you actually mean RLIMIT_FSIZE?

> 2. contains some extents
> 3. the contained extents can't be "unwritten"
> 4. the contained extents must be contigous in file blocks

As in the segment cannot contain sparse holes?

I think what I"m reading here is that a segment cannot extend beyond EOF
and must be completely filled with written extent mappings?

Is there an upper limit on the number of mappings per segment?

> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 204 insertions(+)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index c9732984..175cf461 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -14,6 +14,32 @@
>  #include "space.h"
>  #include "input.h"
>  
> +#define MAPSIZE 512
> +/* used to fetch bmap */
> +struct getbmapx	g_mapx[MAPSIZE];

Each of these global arrays increases the bss segment size, which
increases the overall footprint of xfs_spaceman, even when it's not
being used to defragment files.

Could you switch this data to be dynamically allocated at the start of
defrag_f and freed at the end?

> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
> +static long long 	g_offset = 0;

Unnecessary space after the 'long long'.

> +/* index to indentify next extent, used to get next extent */
> +static int		g_ext_next_idx = -1;
> +
> +/*
> + * segment, the smallest unit to defrag
> + * it includes some contiguous extents.
> + * no holes included,
> + * no unwritten extents included
> + * the size is limited by g_segment_size_lmt
> + */
> +struct defrag_segment {
> +	/* segment offset in units of 512 bytes */
> +	long long	ds_offset;
> +	/* length of segment in units of 512 bytes */
> +	long long	ds_length;
> +	/* number of extents in this segment */
> +	int		ds_nr;
> +	/* flag indicating if segment contains shared blocks */
> +	bool		ds_shared;

Maybe g_mapx belongs in here?  Wait, does a bunch of contiguous written
bmapx records comprise a segment, or is a segment created from (possibly
a subection of) a particular written bmapx record?

> +};
> +
>  /* defrag segment size limit in units of 512 bytes */
>  #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>  #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>  	return true;
>  }
>  
> +/*
> + * get next extent in the file.
> + * Note: next call will get the same extent unless move_next_extent() is called.
> + * returns:
> + * -1:	error happened.
> + * 0:	extent returned
> + * 1:	no more extent left
> + */
> +static int
> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
> +{
> +	int err = 0, i;
> +
> +	/* when no extents are cached in g_mapx, fetch from kernel */
> +	if (g_ext_next_idx == -1) {
> +		g_mapx[0].bmv_offset = g_offset;
> +		g_mapx[0].bmv_length = -1LL;
> +		g_mapx[0].bmv_count = MAPSIZE;
> +		g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
> +		err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
> +		if (err == -1) {
> +			perror("XFS_IOC_GETBMAPX failed");
> +			goto out;
> +		}
> +		/* for stats */
> +		g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
> +
> +		/* no more extents */
> +		if (g_mapx[0].bmv_entries == 0) {
> +			err = 1;
> +			goto out;
> +		}
> +
> +		/* for stats */
> +		for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
> +			if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
> +				g_ext_stats.nr_ext_unwritten++;
> +			if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
> +				g_ext_stats.nr_ext_shared++;
> +		}
> +
> +		g_ext_next_idx = 1;
> +		g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
> +				g_mapx[g_mapx[0].bmv_entries].bmv_length;
> +	}

Huh.  AFAICT, g_ext_next_idx/g_mapx effectively act as a cursor over the
mappings for this file segment.  In that case, shouldn't
defrag_move_next_extent (which actually advances the cursor) be in
charge of grabbing bmapx records from the kernel, and
defrag_get_next_extent be the trivial helper to pass mappings to the
consumer of the bmapx objects (aka defrag_get_next_segment)?

> +
> +	map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
> +	map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
> +	map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
> +out:
> +	return err;
> +}
> +
> +/*
> + * move to next extent
> + */
> +static void
> +defrag_move_next_extent()
> +{
> +	if (g_ext_next_idx == g_mapx[0].bmv_entries)
> +		g_ext_next_idx = -1;
> +	else
> +		g_ext_next_idx += 1;
> +}
> +
> +/*
> + * check if the given extent is a defrag target.
> + * no need to check for holes as we are using BMV_IF_NO_HOLES
> + */
> +static bool
> +defrag_is_target(struct getbmapx *mapx)
> +{
> +	/* unwritten */
> +	if (mapx->bmv_oflags & BMV_OF_PREALLOC)
> +		return false;
> +	return mapx->bmv_length < g_segment_size_lmt;
> +}
> +
> +static bool
> +defrag_is_extent_shared(struct getbmapx *mapx)
> +{
> +	return !!(mapx->bmv_oflags & BMV_OF_SHARED);
> +}
> +
> +/*
> + * get next segment to defragment.
> + * returns:
> + * -1	error happened.
> + * 0	segment returned.
> + * 1	no more segments to return
> + */
> +static int
> +defrag_get_next_segment(int fd, struct defrag_segment *out)
> +{
> +	struct getbmapx mapx;
> +	int	ret;
> +
> +	out->ds_offset = 0;
> +	out->ds_length = 0;
> +	out->ds_nr = 0;
> +	out->ds_shared = false;
> +
> +	do {
> +		ret = defrag_get_next_extent(fd, &mapx);
> +		if (ret != 0) {
> +			/*
> +			 * no more extetns, return current segment if its not
> +			 * empty
> +			*/
> +			if (ret == 1 && out->ds_nr > 0)
> +				ret = 0;
> +			/* otherwise, error heppened, stop */
> +			break;
> +		}
> +
> +		/*
> +		 * If the extent is not a defrag target, skip it.
> +		 * go to next extent if the segment is empty;
> +		 * otherwise return the segment.
> +		 */
> +		if (!defrag_is_target(&mapx)) {
> +			defrag_move_next_extent();
> +			if (out->ds_nr == 0)
> +				continue;
> +			else
> +				break;
> +		}
> +
> +		/* check for segment size limitation */
> +		if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
> +			break;
> +
> +		/* the segment is empty now, add this extent to it for sure */
> +		if (out->ds_nr == 0) {
> +			out->ds_offset = mapx.bmv_offset;
> +			goto add_ext;
> +		}
> +
> +		/*
> +		 * the segment is not empty, check for hole since the last exent
> +		 * if a hole exist before this extent, this extent can't be
> +		 * added to the segment. return the segment
> +		 */
> +		if (out->ds_offset + out->ds_length != mapx.bmv_offset)
> +			break;
> +
> +add_ext:
> +		if (defrag_is_extent_shared(&mapx))
> +			out->ds_shared = true;
> +
> +		out->ds_length += mapx.bmv_length;
> +		out->ds_nr += 1;

OH, ok.  So we walk the mappings for a file.  If we can identify a run
of contiguous written mappings, we define a segment to be the file range
described by that run, up to whatever the maximum is (~4-16M).  Each of
these segments is defragmented (somehow).  Is that correct?

> +		defrag_move_next_extent();
> +
> +	} while (true);
> +
> +	return ret;
> +}
> +
>  /*
>   * defragment a file
>   * return 0 if successfully done, 1 otherwise
> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>  	struct fsxattr	fsx;
>  	int	ret = 0;
>  
> +	g_offset = 0;
> +	g_ext_next_idx = -1;
> +
>  	fsx.fsx_nextents = 0;
>  	memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>  
> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>  		ret = 1;
>  		goto out;
>  	}
> +
> +	do {
> +		struct defrag_segment segment;
> +
> +		ret = defrag_get_next_segment(defrag_fd, &segment);
> +		/* no more segments, we are done */
> +		if (ret == 1) {
> +			ret = 0;
> +			break;
> +		}

If you reverse the polarity of the 0/1 return values (aka return 1 if
there is a segment and 0 if there is none) then you can shorten this
loop to:

	struct defrag_segment	segment;
	int			ret;

	while ((ret = defrag_get_next_segment(...)) == 1) {
		/* process segment */
	}

	return ret;

--D

> +		/* error happened when reading bmap, stop here */
> +		if (ret == -1) {
> +			ret = 1;
> +			break;
> +		}
> +	} while (true);
>  out:
>  	if (scratch_fd != -1) {
>  		close(scratch_fd);
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/9] spaceman/defrag: defrag segments
  2024-07-09 19:10 ` [PATCH 3/9] spaceman/defrag: defrag segments Wengang Wang
@ 2024-07-09 21:57   ` Darrick J. Wong
  2024-07-11 22:49     ` Wengang Wang
  2024-07-16  0:08   ` Dave Chinner
  1 sibling, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-09 21:57 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:22PM -0700, Wengang Wang wrote:
> For each segment, the following steps are done trying to defrag it:
> 
> 1. share the segment with a temporary file
> 2. unshare the segment in the target file. kernel simulates Cow on the whole
>    segment complete the unshare (defrag).
> 3. release blocks from the tempoary file.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 114 insertions(+)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index 175cf461..9f11e36b 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -263,6 +263,40 @@ add_ext:
>  	return ret;
>  }
>  
> +/*
> + * check if the segment exceeds EoF.
> + * fix up the clone range and return true if EoF happens,
> + * return false otherwise.
> + */
> +static bool
> +defrag_clone_eof(struct file_clone_range *clone)
> +{
> +	off_t delta;
> +
> +	delta = clone->src_offset + clone->src_length - g_defrag_file_size;
> +	if (delta > 0) {
> +		clone->src_length = 0; // to the end
> +		return true;
> +	}
> +	return false;
> +}
> +
> +/*
> + * get the time delta since pre_time in ms.
> + * pre_time should contains values fetched by gettimeofday()
> + * cur_time is used to store current time by gettimeofday()
> + */
> +static long long
> +get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
> +{
> +	long long us;
> +
> +	gettimeofday(cur_time, NULL);
> +	us = (cur_time->tv_sec - pre_time->tv_sec) * 1000000;
> +	us += (cur_time->tv_usec - pre_time->tv_usec);
> +	return us;
> +}
> +
>  /*
>   * defragment a file
>   * return 0 if successfully done, 1 otherwise
> @@ -273,6 +307,7 @@ defrag_xfs_defrag(char *file_path) {
>  	long	nr_seg_defrag = 0, nr_ext_defrag = 0;
>  	int	scratch_fd = -1, defrag_fd = -1;
>  	char	tmp_file_path[PATH_MAX+1];
> +	struct file_clone_range clone;
>  	char	*defrag_dir;
>  	struct fsxattr	fsx;
>  	int	ret = 0;

Now that I see this, you might want to straighten up the lines:

	struct fsxattr	fsx = { };
	long		nr_seg_defrag = 0, nr_ext_defrag = 0;

etc.  Note the "= { }" bit that means you don't have to memset them to
zero explicitly.

> @@ -296,6 +331,8 @@ defrag_xfs_defrag(char *file_path) {
>  		goto out;
>  	}
>  
> +	clone.src_fd = defrag_fd;
> +
>  	defrag_dir = dirname(file_path);
>  	snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
>  		getpid());
> @@ -309,7 +346,11 @@ defrag_xfs_defrag(char *file_path) {
>  	}
>  
>  	do {
> +		struct timeval t_clone, t_unshare, t_punch_hole;
>  		struct defrag_segment segment;
> +		long long seg_size, seg_off;
> +		int time_delta;
> +		bool stop;
>  
>  		ret = defrag_get_next_segment(defrag_fd, &segment);
>  		/* no more segments, we are done */
> @@ -322,6 +363,79 @@ defrag_xfs_defrag(char *file_path) {
>  			ret = 1;
>  			break;
>  		}
> +
> +		/* we are done if the segment contains only 1 extent */
> +		if (segment.ds_nr < 2)
> +			continue;
> +
> +		/* to bytes */
> +		seg_off = segment.ds_offset * 512;
> +		seg_size = segment.ds_length * 512;
> +
> +		clone.src_offset = seg_off;
> +		clone.src_length = seg_size;
> +		clone.dest_offset = seg_off;
> +
> +		/* checks for EoF and fix up clone */
> +		stop = defrag_clone_eof(&clone);
> +		gettimeofday(&t_clone, NULL);
> +		ret = ioctl(scratch_fd, FICLONERANGE, &clone);

Hm, should the top-level defrag_f function check in the
filetable[i].fsgeom structure that the fs supports reflink?

> +		if (ret != 0) {
> +			fprintf(stderr, "FICLONERANGE failed %s\n",
> +				strerror(errno));

Might be useful to include the file_path in the error message:

/opt/a: FICLONERANGE failed Software caused connection abort

(maybe also put a semicolon before the strerror message?)

> +			break;
> +		}
> +
> +		/* for time stats */
> +		time_delta = get_time_delta_us(&t_clone, &t_unshare);
> +		if (time_delta > max_clone_us)
> +			max_clone_us = time_delta;
> +
> +		/* for defrag stats */
> +		nr_ext_defrag += segment.ds_nr;
> +
> +		/*
> +		 * For the shared range to be unshared via a copy-on-write
> +		 * operation in the file to be defragged. This causes the
> +		 * file needing to be defragged to have new extents allocated
> +		 * and the data to be copied over and written out.
> +		 */
> +		ret = fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, seg_off,
> +				seg_size);
> +		if (ret != 0) {
> +			fprintf(stderr, "UNSHARE_RANGE failed %s\n",
> +				strerror(errno));
> +			break;
> +		}
> +
> +		/* for time stats */
> +		time_delta = get_time_delta_us(&t_unshare, &t_punch_hole);
> +		if (time_delta > max_unshare_us)
> +			max_unshare_us = time_delta;
> +
> +		/*
> +		 * Punch out the original extents we shared to the
> +		 * scratch file so they are returned to free space.
> +		 */
> +		ret = fallocate(scratch_fd,
> +			FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, seg_off,
> +			seg_size);

Indentation here (two tabs for a continuation).  Or just ftruncate
scratch_fd to zero bytes?  I think you have to do that for the EOF stuff
to work, right?

--D

> +		if (ret != 0) {
> +			fprintf(stderr, "PUNCH_HOLE failed %s\n",
> +				strerror(errno));
> +			break;
> +		}
> +
> +		/* for defrag stats */
> +		nr_seg_defrag += 1;
> +
> +		/* for time stats */
> +		time_delta = get_time_delta_us(&t_punch_hole, &t_clone);
> +		if (time_delta > max_punch_us)
> +			max_punch_us = time_delta;
> +
> +		if (stop)
> +			break;
>  	} while (true);
>  out:
>  	if (scratch_fd != -1) {
> -- 
> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/9] xfsprogs: introduce defrag command to spaceman
  2024-07-09 21:18   ` Darrick J. Wong
@ 2024-07-11 21:54     ` Wengang Wang
  2024-07-15 21:30       ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 21:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org

Hi Darrick,
Thanks for review, pls check in lines.

> On Jul 9, 2024, at 2:18 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:20PM -0700, Wengang Wang wrote:
>> Content-Type: text/plain; charset=UTF-8
>> Content-Transfer-Encoding: 8bit
>> 
>> Non-exclusive defragment
>> Here we are introducing the non-exclusive manner to defragment a file,
>> especially for huge files, without blocking IO to it long.
>> Non-exclusive defragmentation divides the whole file into small segments.
>> For each segment, we lock the file, defragment the segment and unlock the file.
>> Defragmenting the small segment doesn’t take long. File IO requests can get
>> served between defragmenting segments before blocked long.  Also we put
>> (user adjustable) idle time between defragmenting two consecutive segments to
>> balance the defragmentation and file IOs.
>> 
>> The first patch in the set checks for valid target files
>> 
>> Valid target files to defrag must:
>> 1. be accessible for read/write
>> 2. be regular files
>> 3. be in XFS filesystem
>> 4. the containing XFS has reflink enabled. This is not checked
>>   before starting defragmentation, but error would be reported
>>   later.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/Makefile |   2 +-
>> spaceman/defrag.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++
>> spaceman/init.c   |   1 +
>> spaceman/space.h  |   1 +
>> 4 files changed, 201 insertions(+), 1 deletion(-)
>> create mode 100644 spaceman/defrag.c
>> 
>> diff --git a/spaceman/Makefile b/spaceman/Makefile
>> index 1f048d54..9c00b20a 100644
>> --- a/spaceman/Makefile
>> +++ b/spaceman/Makefile
>> @@ -7,7 +7,7 @@ include $(TOPDIR)/include/builddefs
>> 
>> LTCOMMAND = xfs_spaceman
>> HFILES = init.h space.h
>> -CFILES = info.c init.c file.c health.c prealloc.c trim.c
>> +CFILES = info.c init.c file.c health.c prealloc.c trim.c defrag.c
>> LSRCFILES = xfs_info.sh
>> 
>> LLDLIBS = $(LIBXCMD) $(LIBFROG)
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> new file mode 100644
>> index 00000000..c9732984
>> --- /dev/null
>> +++ b/spaceman/defrag.c
>> @@ -0,0 +1,198 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2024 Oracle.
>> + * All Rights Reserved.
>> + */
>> +
>> +#include "libxfs.h"
>> +#include <linux/fiemap.h>
>> +#include <linux/fsmap.h>
>> +#include "libfrog/fsgeom.h"
>> +#include "command.h"
>> +#include "init.h"
>> +#include "libfrog/paths.h"
>> +#include "space.h"
>> +#include "input.h"
>> +
>> +/* defrag segment size limit in units of 512 bytes */
>> +#define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>> +#define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
>> +static int g_segment_size_lmt = DEFAULT_SEGMENT_SIZE_LIMIT;
>> +
>> +/* size of the defrag target file */
>> +static off_t g_defrag_file_size = 0;
>> +
>> +/* stats for the target file extents before defrag */
>> +struct ext_stats {
>> + long nr_ext_total;
>> + long nr_ext_unwritten;
>> + long nr_ext_shared;
>> +};
>> +static struct ext_stats g_ext_stats;
>> +
>> +/*
>> + * check if the target is a valid file to defrag
>> + * also store file size
>> + * returns:
>> + * true for yes and false for no
>> + */
>> +static bool
>> +defrag_check_file(char *path)
>> +{
>> + struct statfs statfs_s;
>> + struct stat stat_s;
>> +
>> + if (access(path, F_OK|W_OK) == -1) {
>> + if (errno == ENOENT)
>> + fprintf(stderr, "file \"%s\" doesn't exist\n", path);
>> + else
>> + fprintf(stderr, "no access to \"%s\", %s\n", path,
>> + strerror(errno));
>> + return false;
>> + }
>> +
>> + if (stat(path, &stat_s) == -1) {
>> + fprintf(stderr, "failed to get file info on \"%s\":  %s\n",
>> + path, strerror(errno));
>> + return false;
>> + }
>> +
>> + g_defrag_file_size = stat_s.st_size;
>> +
>> + if (!S_ISREG(stat_s.st_mode)) {
>> + fprintf(stderr, "\"%s\" is not a regular file\n", path);
>> + return false;
>> + }
>> +
>> + if (statfs(path, &statfs_s) == -1) {
> 
> statfs is deprecated, please use fstatvfs.

OK, will move to fstatvfs.

> 
>> + fprintf(stderr, "failed to get FS info on \"%s\":  %s\n",
>> + path, strerror(errno));
>> + return false;
>> + }
>> +
>> + if (statfs_s.f_type != XFS_SUPER_MAGIC) {
>> + fprintf(stderr, "\"%s\" is not a xfs file\n", path);
>> + return false;
>> + }
>> +
>> + return true;
>> +}
>> +
>> +/*
>> + * defragment a file
>> + * return 0 if successfully done, 1 otherwise
>> + */
>> +static int
>> +defrag_xfs_defrag(char *file_path) {
> 
> defrag_xfs_path() ?

OK.
> 
>> + int max_clone_us = 0, max_unshare_us = 0, max_punch_us = 0;
>> + long nr_seg_defrag = 0, nr_ext_defrag = 0;
>> + int scratch_fd = -1, defrag_fd = -1;
>> + char tmp_file_path[PATH_MAX+1];
>> + char *defrag_dir;
>> + struct fsxattr fsx;
>> + int ret = 0;
>> +
>> + fsx.fsx_nextents = 0;
>> + memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>> +
>> + if (!defrag_check_file(file_path)) {
>> + ret = 1;
>> + goto out;
>> + }
>> +
>> + defrag_fd = open(file_path, O_RDWR);
>> + if (defrag_fd == -1) {
> 
> Not sure why you check the path before opening it -- all those file and
> statvfs attributes that you collect there can change (or the entire fs
> gets unmounted) until you've pinned the fs by opening the file.

The idea comes from internal reviews hoping some explicit reasons why
Defrag failed. Those reasons include: 
1) if user has permission to access the target file.
2) if the species path exist (when moving to spaceman, spaceman takes care of it)
3) if the specified is a regular file
4) if the target file is a XFS file

Thing might change after checking and opening, but that’s very rare case and user is
responsible for that change rather than this tool.

> 
>> + fprintf(stderr, "Opening %s failed. %s\n", file_path,
>> + strerror(errno));
>> + ret = 1;
>> + goto out;
>> + }
>> +
>> + defrag_dir = dirname(file_path);
>> + snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
>> + getpid());
>> + tmp_file_path[PATH_MAX] = 0;
>> + scratch_fd = open(tmp_file_path, O_CREAT|O_EXCL|O_RDWR, 0600);
> 
> O_TMPFILE?  Then you don't have to do this .xfsdefrag_XXX stuff.
> 

My first first version was using O_TMPFILE. But clone failed somehow (Don’t remember the details).
I retried O_TMPFILE, it’s working now. So will move to use O_TMPFILE.

>> + if (scratch_fd == -1) {
>> + fprintf(stderr, "Opening temporary file %s failed. %s\n",
>> + tmp_file_path, strerror(errno));
>> + ret = 1;
>> + goto out;
>> + }
>> +out:
>> + if (scratch_fd != -1) {
>> + close(scratch_fd);
>> + unlink(tmp_file_path);
>> + }
>> + if (defrag_fd != -1) {
>> + ioctl(defrag_fd, FS_IOC_FSGETXATTR, &fsx);
>> + close(defrag_fd);
>> + }
>> +
>> + printf("Pre-defrag %ld extents detected, %ld are \"unwritten\","
>> + "%ld are \"shared\"\n",
>> + g_ext_stats.nr_ext_total, g_ext_stats.nr_ext_unwritten,
>> + g_ext_stats.nr_ext_shared);
>> + printf("Tried to defragment %ld extents in %ld segments\n",
>> + nr_ext_defrag, nr_seg_defrag);
>> + printf("Time stats(ms): max clone: %d, max unshare: %d,"
>> +        " max punch_hole: %d\n",
>> +        max_clone_us/1000, max_unshare_us/1000, max_punch_us/1000);
>> + printf("Post-defrag %u extents detected\n", fsx.fsx_nextents);
>> + return ret;
>> +}
>> +
>> +
>> +static void defrag_help(void)
>> +{
>> + printf(_(
>> +"\n"
>> +"Defragemnt files on XFS where reflink is enabled. IOs to the target files \n"
> 
> "Defragment"

OK.

> 
>> +"can be served durning the defragmentations.\n"
>> +"\n"
>> +" -s segment_size    -- specify the segment size in MiB, minmum value is 4 \n"
>> +"                       default is 16\n"));
>> +}
>> +
>> +static cmdinfo_t defrag_cmd;
>> +
>> +static int
>> +defrag_f(int argc, char **argv)
>> +{
>> + int i;
>> + int c;
>> +
>> + while ((c = getopt(argc, argv, "s:")) != EOF) {
>> + switch(c) {
>> + case 's':
>> + g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
>> + if (g_segment_size_lmt < MIN_SEGMENT_SIZE_LIMIT) {
>> + g_segment_size_lmt = MIN_SEGMENT_SIZE_LIMIT;
>> + printf("Using minimium segment size %d\n",
>> + g_segment_size_lmt);
>> + }
>> + break;
>> + default:
>> + command_usage(&defrag_cmd);
>> + return 1;
>> + }
>> + }
>> +
>> + for (i = 0; i < filecount; i++)
>> + defrag_xfs_defrag(filetable[i].name);
> 
> Pass in the whole filetable[i] and then you've already got an open fd
> and some validation that it's an xfs filesystem.

Good to know.
> 
>> + return 0;
>> +}
>> +void defrag_init(void)
>> +{
>> + defrag_cmd.name = "defrag";
>> + defrag_cmd.altname = "dfg";
>> + defrag_cmd.cfunc = defrag_f;
>> + defrag_cmd.argmin = 0;
>> + defrag_cmd.argmax = 4;
>> + defrag_cmd.args = "[-s segment_size]";
>> + defrag_cmd.flags = CMD_FLAG_ONESHOT;
> 
> IIRC if you don't set CMD_FLAG_FOREIGN_OK then the command processor
> won't let this command get run against a non-xfs file.
> 

OK.

Thanks,
Winging

> --D
> 
>> + defrag_cmd.oneline = _("Defragment XFS files");
>> + defrag_cmd.help = defrag_help;
>> +
>> + add_command(&defrag_cmd);
>> +}
>> diff --git a/spaceman/init.c b/spaceman/init.c
>> index cf1ff3cb..396f965c 100644
>> --- a/spaceman/init.c
>> +++ b/spaceman/init.c
>> @@ -35,6 +35,7 @@ init_commands(void)
>> trim_init();
>> freesp_init();
>> health_init();
>> + defrag_init();
>> }
>> 
>> static int
>> diff --git a/spaceman/space.h b/spaceman/space.h
>> index 723209ed..c288aeb9 100644
>> --- a/spaceman/space.h
>> +++ b/spaceman/space.h
>> @@ -26,6 +26,7 @@ extern void help_init(void);
>> extern void prealloc_init(void);
>> extern void quit_init(void);
>> extern void trim_init(void);
>> +extern void defrag_init(void);
>> #ifdef HAVE_GETFSMAP
>> extern void freesp_init(void);
>> #else
>> -- 
>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target fileOM
  2024-07-09 21:50   ` [PATCH 2/9] spaceman/defrag: pick up segments from target fileOM Darrick J. Wong
@ 2024-07-11 22:37     ` Wengang Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 22:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 9, 2024, at 2:50 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
>> segments are the smallest unit to defragment.
>> 
>> A segment
>> 1. Can't exceed size limit
> 
> What size limit?  Do you mean a segment can't extend beyond EOF?  Or did
> you actually mean RLIMIT_FSIZE?

We defrag things by share (non-contiguous) blocks and then UNSHARE them to get contiguous
Blocks.  The size limit is the limit of the range for which we issue a UNSHARE ioctl call.
We put some contiguous extents (mappings) together (as a segment) for a UNSHARE call.
We don’t hope that’s too big taking long time IO lock on file. Default size limit is 16MiB.

> 
>> 2. contains some extents
>> 3. the contained extents can't be "unwritten"
>> 4. the contained extents must be contigous in file blocks
> 
> As in the segment cannot contain sparse holes?

Holes are meaningless for unshare calls, so they are not included in segments.

> 
> I think what I"m reading here is that a segment cannot extend beyond EOF
> and must be completely filled with written extent mappings?

Yes, a segment as the unit/range for UNSHARE, 
1) extents included in the segment must be contiguous by file offset
2) is not a hole
3) is not unwritten extents. Unwritten extents are meaningless for UNSHARE.

> 
> Is there an upper limit on the number of mappings per segment?
> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 204 insertions(+)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index c9732984..175cf461 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -14,6 +14,32 @@
>> #include "space.h"
>> #include "input.h"
>> 
>> +#define MAPSIZE 512
>> +/* used to fetch bmap */
>> +struct getbmapx g_mapx[MAPSIZE];
> 
> Each of these global arrays increases the bss segment size, which
> increases the overall footprint of xfs_spaceman, even when it's not
> being used to defragment files.
> 
> Could you switch this data to be dynamically allocated at the start of
> defrag_f and freed at the end?
> 

Yes, I can.
Well I am wondering the binary size is concern this much?
After adding defrag, xfs_space has a size of 239K, I don’t think it’s too big.

$ ll -h spaceman/xfs_spaceman
-rwxrwxr-x 1 ubuntu ubuntu 239K Jul 10 17:20 spaceman/xfs_spaceman*



>> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
>> +static long long  g_offset = 0;
> 
> Unnecessary space after the 'long long'.
> 

ok.

>> +/* index to indentify next extent, used to get next extent */
>> +static int g_ext_next_idx = -1;
>> +
>> +/*
>> + * segment, the smallest unit to defrag
>> + * it includes some contiguous extents.
>> + * no holes included,
>> + * no unwritten extents included
>> + * the size is limited by g_segment_size_lmt
>> + */
>> +struct defrag_segment {
>> + /* segment offset in units of 512 bytes */
>> + long long ds_offset;
>> + /* length of segment in units of 512 bytes */
>> + long long ds_length;
>> + /* number of extents in this segment */
>> + int ds_nr;
>> + /* flag indicating if segment contains shared blocks */
>> + bool ds_shared;
> 
> Maybe g_mapx belongs in here?  Wait, does a bunch of contiguous written
> bmapx records comprise a segment, or is a segment created from (possibly
> a subection of) a particular written bmapx record?

g_mapx is used to fetch the mappings from beginning of the file to the end.
It,
1) can include at most 511 (MAPSIZE-1) extents/mappings
2) can fill more than one segment
3) can be a part of a segment

We are walking through the extents/mappings to form segments.
g_mapx is used in walking through the extents/mappings.
For forming a segment, pls see other functions. 

> 
>> +};
>> +
>> /* defrag segment size limit in units of 512 bytes */
>> #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>> #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
>> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>> return true;
>> }
>> 
>> +/*
>> + * get next extent in the file.
>> + * Note: next call will get the same extent unless move_next_extent() is called.
>> + * returns:
>> + * -1: error happened.
>> + * 0: extent returned
>> + * 1: no more extent left
>> + */
>> +static int
>> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
>> +{
>> + int err = 0, i;
>> +
>> + /* when no extents are cached in g_mapx, fetch from kernel */
>> + if (g_ext_next_idx == -1) {
>> + g_mapx[0].bmv_offset = g_offset;
>> + g_mapx[0].bmv_length = -1LL;
>> + g_mapx[0].bmv_count = MAPSIZE;
>> + g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
>> + err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
>> + if (err == -1) {
>> + perror("XFS_IOC_GETBMAPX failed");
>> + goto out;
>> + }
>> + /* for stats */
>> + g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
>> +
>> + /* no more extents */
>> + if (g_mapx[0].bmv_entries == 0) {
>> + err = 1;
>> + goto out;
>> + }
>> +
>> + /* for stats */
>> + for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
>> + if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
>> + g_ext_stats.nr_ext_unwritten++;
>> + if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
>> + g_ext_stats.nr_ext_shared++;
>> + }
>> +
>> + g_ext_next_idx = 1;
>> + g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
>> + g_mapx[g_mapx[0].bmv_entries].bmv_length;
>> + }
> 
> Huh.  AFAICT, g_ext_next_idx/g_mapx effectively act as a cursor over the
> mappings for this file segment.  

Yes, right.

> In that case, shouldn't
> defrag_move_next_extent (which actually advances the cursor) be in
> charge of grabbing bmapx records from the kernel, and
> defrag_get_next_extent be the trivial helper to pass mappings to the
> consumer of the bmapx objects (aka defrag_get_next_segment)?
> 

Defrag_move_next_extent moves the cursor and defrag_get_next_extent
Pick the extent on cursor but doesn’t move the cursor.
There is case that if we add current extent (returned by defrag_get_next_extent)
 to current segment, current segment would be too big. So we don’t add current extent
  But will add it to next segment. For next segment,  defrag_get_next_extent is called
To get the extent in question again.
So as summary on the cases we move cursor:
1). The extent on cursor is not target to defrag (unwritten or too big)
2). The extent is added in current segment.

>> +
>> + map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
>> + map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
>> + map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
>> +out:
>> + return err;
>> +}
>> +
>> +/*
>> + * move to next extent
>> + */
>> +static void
>> +defrag_move_next_extent()
>> +{
>> + if (g_ext_next_idx == g_mapx[0].bmv_entries)
>> + g_ext_next_idx = -1;
>> + else
>> + g_ext_next_idx += 1;
>> +}
>> +
>> +/*
>> + * check if the given extent is a defrag target.
>> + * no need to check for holes as we are using BMV_IF_NO_HOLES
>> + */
>> +static bool
>> +defrag_is_target(struct getbmapx *mapx)
>> +{
>> + /* unwritten */
>> + if (mapx->bmv_oflags & BMV_OF_PREALLOC)
>> + return false;
>> + return mapx->bmv_length < g_segment_size_lmt;
>> +}
>> +
>> +static bool
>> +defrag_is_extent_shared(struct getbmapx *mapx)
>> +{
>> + return !!(mapx->bmv_oflags & BMV_OF_SHARED);
>> +}
>> +
>> +/*
>> + * get next segment to defragment.
>> + * returns:
>> + * -1 error happened.
>> + * 0 segment returned.
>> + * 1 no more segments to return
>> + */
>> +static int
>> +defrag_get_next_segment(int fd, struct defrag_segment *out)
>> +{
>> + struct getbmapx mapx;
>> + int ret;
>> +
>> + out->ds_offset = 0;
>> + out->ds_length = 0;
>> + out->ds_nr = 0;
>> + out->ds_shared = false;
>> +
>> + do {
>> + ret = defrag_get_next_extent(fd, &mapx);
>> + if (ret != 0) {
>> + /*
>> +  * no more extetns, return current segment if its not
>> +  * empty
>> + */
>> + if (ret == 1 && out->ds_nr > 0)
>> + ret = 0;
>> + /* otherwise, error heppened, stop */
>> + break;
>> + }
>> +
>> + /*
>> +  * If the extent is not a defrag target, skip it.
>> +  * go to next extent if the segment is empty;
>> +  * otherwise return the segment.
>> +  */
>> + if (!defrag_is_target(&mapx)) {
>> + defrag_move_next_extent();
>> + if (out->ds_nr == 0)
>> + continue;
>> + else
>> + break;
>> + }
>> +
>> + /* check for segment size limitation */
>> + if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
>> + break;
>> +
>> + /* the segment is empty now, add this extent to it for sure */
>> + if (out->ds_nr == 0) {
>> + out->ds_offset = mapx.bmv_offset;
>> + goto add_ext;
>> + }
>> +
>> + /*
>> +  * the segment is not empty, check for hole since the last exent
>> +  * if a hole exist before this extent, this extent can't be
>> +  * added to the segment. return the segment
>> +  */
>> + if (out->ds_offset + out->ds_length != mapx.bmv_offset)
>> + break;
>> +
>> +add_ext:
>> + if (defrag_is_extent_shared(&mapx))
>> + out->ds_shared = true;
>> +
>> + out->ds_length += mapx.bmv_length;
>> + out->ds_nr += 1;
> 
> OH, ok.  So we walk the mappings for a file.  If we can identify a run
> of contiguous written mappings, we define a segment to be the file range
> described by that run, up to whatever the maximum is (~4-16M).  Each of
> these segments is defragmented (somehow).  Is that correct?

Yes, correct.

>> + defrag_move_next_extent();
>> +
>> + } while (true);
>> +
>> + return ret;
>> +}
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>> struct fsxattr fsx;
>> int ret = 0;
>> 
>> + g_offset = 0;
>> + g_ext_next_idx = -1;
>> +
>> fsx.fsx_nextents = 0;
>> memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>> 
>> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>> ret = 1;
>> goto out;
>> }
>> +
>> + do {
>> + struct defrag_segment segment;
>> +
>> + ret = defrag_get_next_segment(defrag_fd, &segment);
>> + /* no more segments, we are done */
>> + if (ret == 1) {
>> + ret = 0;
>> + break;
>> + }
> 
> If you reverse the polarity of the 0/1 return values (aka return 1 if
> there is a segment and 0 if there is none) then you can shorten this
> loop to:
> 
> struct defrag_segment segment;
> int ret;
> 
> while ((ret = defrag_get_next_segment(...)) == 1) {
> /* process segment */
> }
> 
> return ret;
> 

Yes, that might be better. What I was just thinking is the “0” usually mean a ‘good’ return :D
Will consider this.

Thanks,
Wengang

> --D
> 
>> + /* error happened when reading bmap, stop here */
>> + if (ret == -1) {
>> + ret = 1;
>> + break;
>> + }
>> + } while (true);
>> out:
>> if (scratch_fd != -1) {
>> close(scratch_fd);
>> -- 
>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/9] spaceman/defrag: defrag segments
  2024-07-09 21:57   ` Darrick J. Wong
@ 2024-07-11 22:49     ` Wengang Wang
  2024-07-12 19:07       ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 22:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 9, 2024, at 2:57 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:22PM -0700, Wengang Wang wrote:
>> For each segment, the following steps are done trying to defrag it:
>> 
>> 1. share the segment with a temporary file
>> 2. unshare the segment in the target file. kernel simulates Cow on the whole
>>   segment complete the unshare (defrag).
>> 3. release blocks from the tempoary file.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 114 insertions(+)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index 175cf461..9f11e36b 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -263,6 +263,40 @@ add_ext:
>> return ret;
>> }
>> 
>> +/*
>> + * check if the segment exceeds EoF.
>> + * fix up the clone range and return true if EoF happens,
>> + * return false otherwise.
>> + */
>> +static bool
>> +defrag_clone_eof(struct file_clone_range *clone)
>> +{
>> + off_t delta;
>> +
>> + delta = clone->src_offset + clone->src_length - g_defrag_file_size;
>> + if (delta > 0) {
>> + clone->src_length = 0; // to the end
>> + return true;
>> + }
>> + return false;
>> +}
>> +
>> +/*
>> + * get the time delta since pre_time in ms.
>> + * pre_time should contains values fetched by gettimeofday()
>> + * cur_time is used to store current time by gettimeofday()
>> + */
>> +static long long
>> +get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
>> +{
>> + long long us;
>> +
>> + gettimeofday(cur_time, NULL);
>> + us = (cur_time->tv_sec - pre_time->tv_sec) * 1000000;
>> + us += (cur_time->tv_usec - pre_time->tv_usec);
>> + return us;
>> +}
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -273,6 +307,7 @@ defrag_xfs_defrag(char *file_path) {
>> long nr_seg_defrag = 0, nr_ext_defrag = 0;
>> int scratch_fd = -1, defrag_fd = -1;
>> char tmp_file_path[PATH_MAX+1];
>> + struct file_clone_range clone;
>> char *defrag_dir;
>> struct fsxattr fsx;
>> int ret = 0;
> 
> Now that I see this, you might want to straighten up the lines:
> 
> struct fsxattr fsx = { };
> long nr_seg_defrag = 0, nr_ext_defrag = 0;
> 
> etc.  Note the "= { }" bit that means you don't have to memset them to
> zero explicitly.

Nice!

> 
>> @@ -296,6 +331,8 @@ defrag_xfs_defrag(char *file_path) {
>> goto out;
>> }
>> 
>> + clone.src_fd = defrag_fd;
>> +
>> defrag_dir = dirname(file_path);
>> snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
>> getpid());
>> @@ -309,7 +346,11 @@ defrag_xfs_defrag(char *file_path) {
>> }
>> 
>> do {
>> + struct timeval t_clone, t_unshare, t_punch_hole;
>> struct defrag_segment segment;
>> + long long seg_size, seg_off;
>> + int time_delta;
>> + bool stop;
>> 
>> ret = defrag_get_next_segment(defrag_fd, &segment);
>> /* no more segments, we are done */
>> @@ -322,6 +363,79 @@ defrag_xfs_defrag(char *file_path) {
>> ret = 1;
>> break;
>> }
>> +
>> + /* we are done if the segment contains only 1 extent */
>> + if (segment.ds_nr < 2)
>> + continue;
>> +
>> + /* to bytes */
>> + seg_off = segment.ds_offset * 512;
>> + seg_size = segment.ds_length * 512;
>> +
>> + clone.src_offset = seg_off;
>> + clone.src_length = seg_size;
>> + clone.dest_offset = seg_off;
>> +
>> + /* checks for EoF and fix up clone */
>> + stop = defrag_clone_eof(&clone);
>> + gettimeofday(&t_clone, NULL);
>> + ret = ioctl(scratch_fd, FICLONERANGE, &clone);
> 
> Hm, should the top-level defrag_f function check in the
> filetable[i].fsgeom structure that the fs supports reflink?

Yes, good to know.

> 
>> + if (ret != 0) {
>> + fprintf(stderr, "FICLONERANGE failed %s\n",
>> + strerror(errno));
> 
> Might be useful to include the file_path in the error message:
> 
> /opt/a: FICLONERANGE failed Software caused connection abort
> 
> (maybe also put a semicolon before the strerror message?)

OK.

> 
>> + break;
>> + }
>> +
>> + /* for time stats */
>> + time_delta = get_time_delta_us(&t_clone, &t_unshare);
>> + if (time_delta > max_clone_us)
>> + max_clone_us = time_delta;
>> +
>> + /* for defrag stats */
>> + nr_ext_defrag += segment.ds_nr;
>> +
>> + /*
>> +  * For the shared range to be unshared via a copy-on-write
>> +  * operation in the file to be defragged. This causes the
>> +  * file needing to be defragged to have new extents allocated
>> +  * and the data to be copied over and written out.
>> +  */
>> + ret = fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, seg_off,
>> + seg_size);
>> + if (ret != 0) {
>> + fprintf(stderr, "UNSHARE_RANGE failed %s\n",
>> + strerror(errno));
>> + break;
>> + }
>> +
>> + /* for time stats */
>> + time_delta = get_time_delta_us(&t_unshare, &t_punch_hole);
>> + if (time_delta > max_unshare_us)
>> + max_unshare_us = time_delta;
>> +
>> + /*
>> +  * Punch out the original extents we shared to the
>> +  * scratch file so they are returned to free space.
>> +  */
>> + ret = fallocate(scratch_fd,
>> + FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, seg_off,
>> + seg_size);
> 
> Indentation here (two tabs for a continuation).  

OK.

> Or just ftruncate
> scratch_fd to zero bytes?  I think you have to do that for the EOF stuff
> to work, right?
> 

I’d truncate the UNSHARE range only in the loop.
EOF stuff would be truncated on (O_TMPFILE) file close.
The EOF stuff would be used for another purpose, see 
[PATCH 6/9] spaceman/defrag: workaround kernel

Thanks,
Wengang

> --D
> 
>> + if (ret != 0) {
>> + fprintf(stderr, "PUNCH_HOLE failed %s\n",
>> + strerror(errno));
>> + break;
>> + }
>> +
>> + /* for defrag stats */
>> + nr_seg_defrag += 1;
>> +
>> + /* for time stats */
>> + time_delta = get_time_delta_us(&t_punch_hole, &t_clone);
>> + if (time_delta > max_punch_us)
>> + max_punch_us = time_delta;
>> +
>> + if (stop)
>> + break;
>> } while (true);
>> out:
>> if (scratch_fd != -1) {
>> -- 
>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/9] spaceman/defrag: ctrl-c handler
  2024-07-09 21:08   ` Darrick J. Wong
@ 2024-07-11 22:58     ` Wengang Wang
  2024-07-15 22:56       ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 22:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 9, 2024, at 2:08 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:23PM -0700, Wengang Wang wrote:
>> Add this handler to break the defrag better, so it has
>> 1. the stats reporting
>> 2. remove the temporary file
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index 9f11e36b..61e47a43 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -297,6 +297,13 @@ get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
>> return us;
>> }
>> 
>> +static volatile bool usedKilled = false;
>> +void defrag_sigint_handler(int dummy)
>> +{
>> + usedKilled = true;
> 
> Not sure why some of these variables are camelCase and others not.
> Or why this global variable doesn't have a g_ prefix like the others?
> 

Yep, will change it to g_user_killed.

>> + printf("Please wait until current segment is defragmented\n");
> 
> Is it actually safe to call printf from a signal handler?  Handlers must
> be very careful about what they call -- regreSSHion was a result of
> openssh not getting this right.
> 
> (Granted spaceman isn't as critical...)
> 

As the ioctl of UNSHARE takes time, so the process would really stop a while
After user’s kil. The message is used as a quick response to user. It’s not actually
Has any functionality. If it’s not safe, we can remove the message.


> Also would you rather SIGINT merely terminate the spaceman process?  I
> think the file locks drop on termination, right?

Another purpose of the handler is that I want to show the stats like below even process is killed:

Pre-defrag 54699 extents detected, 0 are "unwritten",0 are "shared"
Tried to defragment 54697 extents (939511808 bytes) in 57 segments
Time stats(ms): max clone: 33, max unshare: 2254, max punch_hole: 286
Post-defrag 12617 extents detected

Thanks,
Winging

> 
> --D
> 
>> +};
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -345,6 +352,8 @@ defrag_xfs_defrag(char *file_path) {
>> goto out;
>> }
>> 
>> + signal(SIGINT, defrag_sigint_handler);
>> +
>> do {
>> struct timeval t_clone, t_unshare, t_punch_hole;
>> struct defrag_segment segment;
>> @@ -434,7 +443,7 @@ defrag_xfs_defrag(char *file_path) {
>> if (time_delta > max_punch_us)
>> max_punch_us = time_delta;
>> 
>> - if (stop)
>> + if (stop || usedKilled)
>> break;
>> } while (true);
>> out:
>> -- 
>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/9] spaceman/defrag: exclude shared segments on low free space
  2024-07-09 21:05   ` Darrick J. Wong
@ 2024-07-11 23:08     ` Wengang Wang
  2024-07-15 22:58       ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 23:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 9, 2024, at 2:05 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:24PM -0700, Wengang Wang wrote:
>> On some XFS, free blocks are over-committed to reflink copies.
>> And those free blocks are not enough if CoW happens to all the shared blocks.
> 
> Hmmm.  I think what you're trying to do here is avoid running a
> filesystem out of space because it defragmented files A, B, ... Z, each
> of which previously shared the same chunk of storage but now they don't
> because this defragger unshared them to reduce the extent count in those
> files.  Right?
> 

Yes.

> In that case, I wonder if it's a good idea to touch shared extents at
> all?  Someone set those files to share space, that's probably a better
> performance optimization than reducing extent count.

The question is that:
Are the shared parts are something to be overwritten frequently?
If they are, Copy-on-Write would make those shared parts fragmented.
In above case we should dedefrag those parts, otherwise, the defrag might doesn’t defrag at all.
Otherwise the shared parts are not subjects to be overwritten frequently,
They are expected to remain in big extents, choosing proper segment size
Would skip those.

But yes, we can add a option to simply skip those share extents. 

> 
> That said, you /could/ also use GETFSMAP to find all the other owners of
> a shared extent.  Then you can reflink the same extent to a scratch
> file, copy the contents to a new region in the scratch file, and use
> FIEDEDUPERANGE on each of A..Z to remap the new region into those files.
> Assuming the new region has fewer mappings than the old one it was
> copied from, you'll defragment A..Z while preserving the sharing factor.

That’s not safe? Things may change after GETFSMAP.

> 
> I say that because I've written such a thing before; look for
> csp_evac_dedupe_fsmap in
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/commit/?h=defrag-freespace&id=785d2f024e31a0d0f52b04073a600f9139ef0b21
> 
>> This defrag tool would exclude shared segments when free space is under shrethold.
> 
> "threshold"

OK.

Thanks
Wengang
> 
> --D
> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 46 +++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 43 insertions(+), 3 deletions(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index 61e47a43..f8e6713c 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -304,6 +304,29 @@ void defrag_sigint_handler(int dummy)
>> printf("Please wait until current segment is defragmented\n");
>> };
>> 
>> +/*
>> + * limitation of filesystem free space in bytes.
>> + * when filesystem has less free space than this number, segments which contain
>> + * shared extents are skipped. 1GiB by default
>> + */
>> +static long g_limit_free_bytes = 1024 * 1024 * 1024;
>> +
>> +/*
>> + * check if the free space in the FS is less than the _limit_
>> + * return true if so, false otherwise
>> + */
>> +static bool
>> +defrag_fs_limit_hit(int fd)
>> +{
>> + struct statfs statfs_s;
>> +
>> + if (g_limit_free_bytes <= 0)
>> + return false;
>> +
>> + fstatfs(fd, &statfs_s);
>> + return statfs_s.f_bsize * statfs_s.f_bavail < g_limit_free_bytes;
>> +}
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -377,6 +400,15 @@ defrag_xfs_defrag(char *file_path) {
>> if (segment.ds_nr < 2)
>> continue;
>> 
>> + /*
>> + * When the segment is (partially) shared, defrag would
>> + * consume free blocks. We check the limit of FS free blocks
>> + * and skip defragmenting this segment in case the limit is
>> + * reached.
>> + */
>> + if (segment.ds_shared && defrag_fs_limit_hit(defrag_fd))
>> + continue;
>> +
>> /* to bytes */
>> seg_off = segment.ds_offset * 512;
>> seg_size = segment.ds_length * 512;
>> @@ -478,7 +510,11 @@ static void defrag_help(void)
>> "can be served durning the defragmentations.\n"
>> "\n"
>> " -s segment_size    -- specify the segment size in MiB, minmum value is 4 \n"
>> -"                       default is 16\n"));
>> +"                       default is 16\n"
>> +" -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
>> +"                       XFS free space is lower than that, shared segments \n"
>> +"                       are excluded from defragmentation, 1024 by default\n"
>> + ));
>> }
>> 
>> static cmdinfo_t defrag_cmd;
>> @@ -489,7 +525,7 @@ defrag_f(int argc, char **argv)
>> int i;
>> int c;
>> 
>> - while ((c = getopt(argc, argv, "s:")) != EOF) {
>> + while ((c = getopt(argc, argv, "s:f:")) != EOF) {
>> switch(c) {
>> case 's':
>> g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
>> @@ -499,6 +535,10 @@ defrag_f(int argc, char **argv)
>> g_segment_size_lmt);
>> }
>> break;
>> + case 'f':
>> + g_limit_free_bytes = atol(optarg) * 1024 * 1024;
>> + break;
>> +
>> default:
>> command_usage(&defrag_cmd);
>> return 1;
>> @@ -516,7 +556,7 @@ void defrag_init(void)
>> defrag_cmd.cfunc = defrag_f;
>> defrag_cmd.argmin = 0;
>> defrag_cmd.argmax = 4;
>> - defrag_cmd.args = "[-s segment_size]";
>> + defrag_cmd.args = "[-s segment_size] [-f free_space]";
>> defrag_cmd.flags = CMD_FLAG_ONESHOT;
>> defrag_cmd.oneline = _("Defragment XFS files");
>> defrag_cmd.help = defrag_help;
>> -- 
>> 2.39.3 (Apple Git-146)
>> 
>> 
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag()
  2024-07-09 20:51   ` Darrick J. Wong
@ 2024-07-11 23:11     ` Wengang Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 23:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 9, 2024, at 1:51 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:25PM -0700, Wengang Wang wrote:
>> xfs_reflink_try_clear_inode_flag() takes very long in case file has huge number
>> of extents and none of the extents are shared.
>> 
>> workaround:
>> share the first real extent so that xfs_reflink_try_clear_inode_flag() returns
>> quickly to save cpu times and speed up defrag significantly.
> 
> I wonder if a better solution would be to change xfs_reflink_unshare
> only to try to clear the reflink iflag if offset/len cover the entire
> file?  It's a pity we can't set time budgets on fallocate requests.

Yep.
Anyway the change, if there will be, would be in kernel.
We can use -n option to disable this workaround in defrag.

Thanks,
Wengang

> 
> --D
> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 174 +++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 172 insertions(+), 2 deletions(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index f8e6713c..b5c5b187 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -327,6 +327,155 @@ defrag_fs_limit_hit(int fd)
>> return statfs_s.f_bsize * statfs_s.f_bavail < g_limit_free_bytes;
>> }
>> 
>> +static bool g_enable_first_ext_share = true;
>> +
>> +static int
>> +defrag_get_first_real_ext(int fd, struct getbmapx *mapx)
>> +{
>> + int err;
>> +
>> + while (1) {
>> + err = defrag_get_next_extent(fd, mapx);
>> + if (err)
>> + break;
>> +
>> + defrag_move_next_extent();
>> + if (!(mapx->bmv_oflags & BMV_OF_PREALLOC))
>> + break;
>> + }
>> + return err;
>> +}
>> +
>> +static __u64 g_share_offset = -1ULL;
>> +static __u64 g_share_len = 0ULL;
>> +#define SHARE_MAX_SIZE 32768  /* 32KiB */
>> +
>> +/* share the first real extent with scrach */
>> +static void
>> +defrag_share_first_extent(int defrag_fd, int scratch_fd)
>> +{
>> +#define OFFSET_1PB 0x4000000000000LL
>> + struct file_clone_range clone;
>> + struct getbmapx mapx;
>> + int err;
>> +
>> + if (g_enable_first_ext_share == false)
>> + return;
>> +
>> + err = defrag_get_first_real_ext(defrag_fd, &mapx);
>> + if (err)
>> + return;
>> +
>> + clone.src_fd = defrag_fd;
>> + clone.src_offset = mapx.bmv_offset * 512;
>> + clone.src_length = mapx.bmv_length * 512;
>> + /* shares at most SHARE_MAX_SIZE length */
>> + if (clone.src_length > SHARE_MAX_SIZE)
>> + clone.src_length = SHARE_MAX_SIZE;
>> + clone.dest_offset = OFFSET_1PB + clone.src_offset;
>> + /* if the first is extent is reaching the EoF, no need to share */
>> + if (clone.src_offset + clone.src_length >= g_defrag_file_size)
>> + return;
>> + err = ioctl(scratch_fd, FICLONERANGE, &clone);
>> + if (err != 0) {
>> + fprintf(stderr, "cloning first extent failed: %s\n",
>> + strerror(errno));
>> + return;
>> + }
>> +
>> + /* safe the offset and length for re-share */
>> + g_share_offset = clone.src_offset;
>> + g_share_len = clone.src_length;
>> +}
>> +
>> +/* re-share the blocks we shared previous if then are no longer shared */
>> +static void
>> +defrag_reshare_blocks_in_front(int defrag_fd, int scratch_fd)
>> +{
>> +#define NR_GET_EXT 9
>> + struct getbmapx mapx[NR_GET_EXT];
>> + struct file_clone_range clone;
>> + __u64 new_share_len;
>> + int idx, err;
>> +
>> + if (g_enable_first_ext_share == false)
>> + return;
>> +
>> + if (g_share_len == 0ULL)
>> + return;
>> +
>> + /*
>> + * check if previous shareing still exist
>> + * we are done if (partially) so.
>> + */
>> + mapx[0].bmv_offset = g_share_offset;
>> + mapx[0].bmv_length = g_share_len;
>> + mapx[0].bmv_count = NR_GET_EXT;
>> + mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
>> + err = ioctl(defrag_fd, XFS_IOC_GETBMAPX, mapx);
>> + if (err) {
>> + fprintf(stderr, "XFS_IOC_GETBMAPX failed %s\n",
>> + strerror(errno));
>> + /* won't try share again */
>> + g_share_len = 0ULL;
>> + return;
>> + }
>> +
>> + if (mapx[0].bmv_entries == 0) {
>> + /* shared blocks all became hole, won't try share again */
>> + g_share_len = 0ULL;
>> + return;
>> + }
>> +
>> + if (g_share_offset != 512 * mapx[1].bmv_offset) {
>> + /* first shared block became hole, won't try share again */
>> + g_share_len = 0ULL;
>> + return;
>> + }
>> +
>> + /* we check up to only the first NR_GET_EXT - 1 extents */
>> + for (idx = 1; idx <= mapx[0].bmv_entries; idx++) {
>> + if (mapx[idx].bmv_oflags & BMV_OF_SHARED) {
>> + /* some blocks still shared, done */
>> + return;
>> + }
>> + }
>> +
>> + /*
>> + * The previously shared blocks are no longer shared, re-share.
>> + * deallocate the blocks in scrath file first
>> + */
>> + err = fallocate(scratch_fd,
>> + FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,
>> + OFFSET_1PB + g_share_offset, g_share_len);
>> + if (err != 0) {
>> + fprintf(stderr, "punch hole failed %s\n",
>> + strerror(errno));
>> + g_share_len = 0;
>> + return;
>> + }
>> +
>> + new_share_len = 512 * mapx[1].bmv_length;
>> + if (new_share_len > SHARE_MAX_SIZE)
>> + new_share_len = SHARE_MAX_SIZE;
>> +
>> + clone.src_fd = defrag_fd;
>> + /* keep starting offset unchanged */
>> + clone.src_offset = g_share_offset;
>> + clone.src_length = new_share_len;
>> + clone.dest_offset = OFFSET_1PB + clone.src_offset;
>> +
>> + err = ioctl(scratch_fd, FICLONERANGE, &clone);
>> + if (err) {
>> + fprintf(stderr, "FICLONERANGE failed %s\n",
>> + strerror(errno));
>> + g_share_len = 0;
>> + return;
>> + }
>> +
>> + g_share_len = new_share_len;
>> + }
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -377,6 +526,12 @@ defrag_xfs_defrag(char *file_path) {
>> 
>> signal(SIGINT, defrag_sigint_handler);
>> 
>> + /*
>> + * share the first extent to work around kernel consuming time
>> + * in xfs_reflink_try_clear_inode_flag()
>> + */
>> + defrag_share_first_extent(defrag_fd, scratch_fd);
>> +
>> do {
>> struct timeval t_clone, t_unshare, t_punch_hole;
>> struct defrag_segment segment;
>> @@ -454,6 +609,15 @@ defrag_xfs_defrag(char *file_path) {
>> if (time_delta > max_unshare_us)
>> max_unshare_us = time_delta;
>> 
>> + /*
>> + * if unshare used more than 1 second, time is very possibly
>> + * used in checking if the file is sharing extents now.
>> + * to avoid that happen again we re-share the blocks in front
>> + * to workaround that.
>> + */
>> + if (time_delta > 1000000)
>> + defrag_reshare_blocks_in_front(defrag_fd, scratch_fd);
>> +
>> /*
>> * Punch out the original extents we shared to the
>> * scratch file so they are returned to free space.
>> @@ -514,6 +678,8 @@ static void defrag_help(void)
>> " -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
>> "                       XFS free space is lower than that, shared segments \n"
>> "                       are excluded from defragmentation, 1024 by default\n"
>> +" -n                 -- disable the \"share first extent\" featue, it's\n"
>> +"                       enabled by default to speed up\n"
>> ));
>> }
>> 
>> @@ -525,7 +691,7 @@ defrag_f(int argc, char **argv)
>> int i;
>> int c;
>> 
>> - while ((c = getopt(argc, argv, "s:f:")) != EOF) {
>> + while ((c = getopt(argc, argv, "s:f:n")) != EOF) {
>> switch(c) {
>> case 's':
>> g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
>> @@ -539,6 +705,10 @@ defrag_f(int argc, char **argv)
>> g_limit_free_bytes = atol(optarg) * 1024 * 1024;
>> break;
>> 
>> + case 'n':
>> + g_enable_first_ext_share = false;
>> + break;
>> +
>> default:
>> command_usage(&defrag_cmd);
>> return 1;
>> @@ -556,7 +726,7 @@ void defrag_init(void)
>> defrag_cmd.cfunc = defrag_f;
>> defrag_cmd.argmin = 0;
>> defrag_cmd.argmax = 4;
>> - defrag_cmd.args = "[-s segment_size] [-f free_space]";
>> + defrag_cmd.args = "[-s segment_size] [-f free_space] [-n]";
>> defrag_cmd.flags = CMD_FLAG_ONESHOT;
>> defrag_cmd.oneline = _("Defragment XFS files");
>> defrag_cmd.help = defrag_help;
>> -- 
>> 2.39.3 (Apple Git-146)
>> 
>> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 7/9] spaceman/defrag: sleeps between segments
  2024-07-09 20:46   ` Darrick J. Wong
@ 2024-07-11 23:26     ` Wengang Wang
  2024-07-11 23:30     ` Wengang Wang
  1 sibling, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 23:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 9, 2024, at 1:46 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:26PM -0700, Wengang Wang wrote:
>> Let user contol the time to sleep between segments (file unlocked) to
>> balance defrag performance and file IO servicing time.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 26 ++++++++++++++++++++++++--
>> 1 file changed, 24 insertions(+), 2 deletions(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index b5c5b187..415fe9c2 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -311,6 +311,9 @@ void defrag_sigint_handler(int dummy)
>>  */
>> static long g_limit_free_bytes = 1024 * 1024 * 1024;
>> 
>> +/* sleep time in us between segments, overwritten by paramter */
>> +static int g_idle_time = 250 * 1000;
>> +
>> /*
>>  * check if the free space in the FS is less than the _limit_
>>  * return true if so, false otherwise
>> @@ -487,6 +490,7 @@ defrag_xfs_defrag(char *file_path) {
>> int scratch_fd = -1, defrag_fd = -1;
>> char tmp_file_path[PATH_MAX+1];
>> struct file_clone_range clone;
>> + int sleep_time_us = 0;
>> char *defrag_dir;
>> struct fsxattr fsx;
>> int ret = 0;
>> @@ -574,6 +578,9 @@ defrag_xfs_defrag(char *file_path) {
>> 
>> /* checks for EoF and fix up clone */
>> stop = defrag_clone_eof(&clone);
>> + if (sleep_time_us > 0)
>> + usleep(sleep_time_us);
>> +
>> gettimeofday(&t_clone, NULL);
>> ret = ioctl(scratch_fd, FICLONERANGE, &clone);
>> if (ret != 0) {
>> @@ -587,6 +594,10 @@ defrag_xfs_defrag(char *file_path) {
>> if (time_delta > max_clone_us)
>> max_clone_us = time_delta;
>> 
>> + /* sleeps if clone cost more than 500ms, slow FS */
> 
> Why half a second?  I sense that what you're getting at is that you want
> to limit file io latency spikes in other programs by relaxing the defrag
> program, right?  But the help screen doesn't say anything about "only if
> the clone lasts more than 500ms".

This is an optional sleep for very slow FS with CLONE takes long.
Actually, we don’t fall into this case in our local tests.

The main sleep is at above:

 40 +               if (sleep_time_us > 0)
 41 +                       usleep(sleep_time_us);

> 
>> + if (time_delta >= 500000 && g_idle_time > 0)
>> + usleep(g_idle_time);
> 
> These days, I wonder if it makes more sense to provide a CPU utilization
> target and let the kernel figure out how much sleeping that is:
> 
> $ systemd-run -p 'CPUQuota=60%' xfs_spaceman -c 'defrag' /path/to/file
> 
> The tradeoff here is that we as application writers no longer have to
> implement these clunky sleeps ourselves, but then one has to turn on cpu
> accounting in systemd (if there even /is/ a systemd).  Also I suppose we
> don't want this program getting throttled while it's holding a file
> lock.
> 

Yes, we hope less locking time as possible.

This slowness is mainly coming from slow disks. As we tested, when page cache
Is empty, CPU usages is only at about 6% on my VM (the real physical would be spindle disk).
It would be higher for NVMEs.  
I’d like to provide user a way to make balance, say longer sleep time between segments to make IO serviced better.

Thanks,
Wengang

> --D
> 
>> +
>> /* for defrag stats */
>> nr_ext_defrag += segment.ds_nr;
>> 
>> @@ -641,6 +652,12 @@ defrag_xfs_defrag(char *file_path) {
>> 
>> if (stop || usedKilled)
>> break;
>> +
>> + /*
>> +  * no lock on target file when punching hole from scratch file,
>> +  * so minus the time used for punching hole
>> +  */
>> + sleep_time_us = g_idle_time - time_delta;
>> } while (true);
>> out:
>> if (scratch_fd != -1) {
>> @@ -678,6 +695,7 @@ static void defrag_help(void)
>> " -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
>> "                       XFS free space is lower than that, shared segments \n"
>> "                       are excluded from defragmentation, 1024 by default\n"
>> +" -i idle_time       -- time in ms to be idle between segments, 250ms by default\n"
>> " -n                 -- disable the \"share first extent\" featue, it's\n"
>> "                       enabled by default to speed up\n"
>> ));
>> @@ -691,7 +709,7 @@ defrag_f(int argc, char **argv)
>> int i;
>> int c;
>> 
>> - while ((c = getopt(argc, argv, "s:f:n")) != EOF) {
>> + while ((c = getopt(argc, argv, "s:f:ni")) != EOF) {
>> switch(c) {
>> case 's':
>> g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
>> @@ -709,6 +727,10 @@ defrag_f(int argc, char **argv)
>> g_enable_first_ext_share = false;
>> break;
>> 
>> + case 'i':
>> + g_idle_time = atoi(optarg) * 1000;
> 
> Should we complain if optarg is non-integer garbage?  Or if g_idle_time
> is larger than 1s?
> 
> --D
> 
>> + break;
>> +
>> default:
>> command_usage(&defrag_cmd);
>> return 1;
>> @@ -726,7 +748,7 @@ void defrag_init(void)
>> defrag_cmd.cfunc = defrag_f;
>> defrag_cmd.argmin = 0;
>> defrag_cmd.argmax = 4;
>> - defrag_cmd.args = "[-s segment_size] [-f free_space] [-n]";
>> + defrag_cmd.args = "[-s segment_size] [-f free_space] [-i idle_time] [-n]";
>> defrag_cmd.flags = CMD_FLAG_ONESHOT;
>> defrag_cmd.oneline = _("Defragment XFS files");
>> defrag_cmd.help = defrag_help;
>> -- 
>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 8/9] spaceman/defrag: readahead for better performance
  2024-07-09 20:27   ` Darrick J. Wong
@ 2024-07-11 23:29     ` Wengang Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 23:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 9, 2024, at 1:27 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
>> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
>> Do readahead when defrag sleeps for better defrag performace and thus more
>> file IO time.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 21 ++++++++++++++++++++-
>> 1 file changed, 20 insertions(+), 1 deletion(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index 415fe9c2..ab8508bb 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>> }
>> 
>> static bool g_enable_first_ext_share = true;
>> +static bool g_readahead = false;
>> +
>> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
>> +{
>> + if (!g_readahead || g_idle_time <= 0)
>> + return;
>> +
>> + if (readahead(defrag_fd, offset, count) < 0) {
>> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
>> + strerror(errno), errno);
> 
> Why is it worth reporting if readahead fails?  Won't the unshare also
> fail?  

Yes, if readahead failed, later unshare should fail as I think.
I just want to capture error as soon as possible though readahead is not critical.

> I'm also wondering why we wouldn't want readahead all the time?

As per our tests, readahead on NVME doesn’t hehaved better.
So I’d make it an option.

Thanks,
Wengang

> 
> --D
> 
>> + }
>> +}
>> 
>> static int
>> defrag_get_first_real_ext(int fd, struct getbmapx *mapx)
>> @@ -578,6 +590,8 @@ defrag_xfs_defrag(char *file_path) {
>> 
>> /* checks for EoF and fix up clone */
>> stop = defrag_clone_eof(&clone);
>> + defrag_readahead(defrag_fd, seg_off, seg_size);
>> +
>> if (sleep_time_us > 0)
>> usleep(sleep_time_us);
>> 
>> @@ -698,6 +712,7 @@ static void defrag_help(void)
>> " -i idle_time       -- time in ms to be idle between segments, 250ms by default\n"
>> " -n                 -- disable the \"share first extent\" featue, it's\n"
>> "                       enabled by default to speed up\n"
>> +" -a                 -- do readahead to speed up defrag, disabled by default\n"
>> ));
>> }
>> 
>> @@ -709,7 +724,7 @@ defrag_f(int argc, char **argv)
>> int i;
>> int c;
>> 
>> - while ((c = getopt(argc, argv, "s:f:ni")) != EOF) {
>> + while ((c = getopt(argc, argv, "s:f:nia")) != EOF) {
>> switch(c) {
>> case 's':
>> g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
>> @@ -731,6 +746,10 @@ defrag_f(int argc, char **argv)
>> g_idle_time = atoi(optarg) * 1000;
>> break;
>> 
>> + case 'a':
>> + g_readahead = true;
>> + break;
>> +
>> default:
>> command_usage(&defrag_cmd);
>> return 1;
>> -- 
>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 7/9] spaceman/defrag: sleeps between segments
  2024-07-09 20:46   ` Darrick J. Wong
  2024-07-11 23:26     ` Wengang Wang
@ 2024-07-11 23:30     ` Wengang Wang
  1 sibling, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 23:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 9, 2024, at 1:46 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:26PM -0700, Wengang Wang wrote:
>> Let user contol the time to sleep between segments (file unlocked) to
>> balance defrag performance and file IO servicing time.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 26 ++++++++++++++++++++++++--
>> 1 file changed, 24 insertions(+), 2 deletions(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index b5c5b187..415fe9c2 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -311,6 +311,9 @@ void defrag_sigint_handler(int dummy)
>>  */
>> static long g_limit_free_bytes = 1024 * 1024 * 1024;
>> 
>> +/* sleep time in us between segments, overwritten by paramter */
>> +static int g_idle_time = 250 * 1000;
>> +
>> /*
>>  * check if the free space in the FS is less than the _limit_
>>  * return true if so, false otherwise
>> @@ -487,6 +490,7 @@ defrag_xfs_defrag(char *file_path) {
>> int scratch_fd = -1, defrag_fd = -1;
>> char tmp_file_path[PATH_MAX+1];
>> struct file_clone_range clone;
>> + int sleep_time_us = 0;
>> char *defrag_dir;
>> struct fsxattr fsx;
>> int ret = 0;
>> @@ -574,6 +578,9 @@ defrag_xfs_defrag(char *file_path) {
>> 
>> /* checks for EoF and fix up clone */
>> stop = defrag_clone_eof(&clone);
>> + if (sleep_time_us > 0)
>> + usleep(sleep_time_us);
>> +
>> gettimeofday(&t_clone, NULL);
>> ret = ioctl(scratch_fd, FICLONERANGE, &clone);
>> if (ret != 0) {
>> @@ -587,6 +594,10 @@ defrag_xfs_defrag(char *file_path) {
>> if (time_delta > max_clone_us)
>> max_clone_us = time_delta;
>> 
>> + /* sleeps if clone cost more than 500ms, slow FS */
> 
> Why half a second?  I sense that what you're getting at is that you want
> to limit file io latency spikes in other programs by relaxing the defrag
> program, right?  But the help screen doesn't say anything about "only if
> the clone lasts more than 500ms".
> 
>> + if (time_delta >= 500000 && g_idle_time > 0)
>> + usleep(g_idle_time);
> 
> These days, I wonder if it makes more sense to provide a CPU utilization
> target and let the kernel figure out how much sleeping that is:
> 
> $ systemd-run -p 'CPUQuota=60%' xfs_spaceman -c 'defrag' /path/to/file
> 
> The tradeoff here is that we as application writers no longer have to
> implement these clunky sleeps ourselves, but then one has to turn on cpu
> accounting in systemd (if there even /is/ a systemd).  Also I suppose we
> don't want this program getting throttled while it's holding a file
> lock.
> 
> --D
> 
>> +
>> /* for defrag stats */
>> nr_ext_defrag += segment.ds_nr;
>> 
>> @@ -641,6 +652,12 @@ defrag_xfs_defrag(char *file_path) {
>> 
>> if (stop || usedKilled)
>> break;
>> +
>> + /*
>> +  * no lock on target file when punching hole from scratch file,
>> +  * so minus the time used for punching hole
>> +  */
>> + sleep_time_us = g_idle_time - time_delta;
>> } while (true);
>> out:
>> if (scratch_fd != -1) {
>> @@ -678,6 +695,7 @@ static void defrag_help(void)
>> " -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
>> "                       XFS free space is lower than that, shared segments \n"
>> "                       are excluded from defragmentation, 1024 by default\n"
>> +" -i idle_time       -- time in ms to be idle between segments, 250ms by default\n"
>> " -n                 -- disable the \"share first extent\" featue, it's\n"
>> "                       enabled by default to speed up\n"
>> ));
>> @@ -691,7 +709,7 @@ defrag_f(int argc, char **argv)
>> int i;
>> int c;
>> 
>> - while ((c = getopt(argc, argv, "s:f:n")) != EOF) {
>> + while ((c = getopt(argc, argv, "s:f:ni")) != EOF) {
>> switch(c) {
>> case 's':
>> g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
>> @@ -709,6 +727,10 @@ defrag_f(int argc, char **argv)
>> g_enable_first_ext_share = false;
>> break;
>> 
>> + case 'i':
>> + g_idle_time = atoi(optarg) * 1000;
> 
> Should we complain if optarg is non-integer garbage?  Or if g_idle_time
> is larger than 1s?

It’s users responsibility :D.

Thanks,
Wengang

> 
> --D
> 
>> + break;
>> +
>> default:
>> command_usage(&defrag_cmd);
>> return 1;
>> @@ -726,7 +748,7 @@ void defrag_init(void)
>> defrag_cmd.cfunc = defrag_f;
>> defrag_cmd.argmin = 0;
>> defrag_cmd.argmax = 4;
>> - defrag_cmd.args = "[-s segment_size] [-f free_space] [-n]";
>> + defrag_cmd.args = "[-s segment_size] [-f free_space] [-i idle_time] [-n]";
>> defrag_cmd.flags = CMD_FLAG_ONESHOT;
>> defrag_cmd.oneline = _("Defragment XFS files");
>> defrag_cmd.help = defrag_help;
>> -- 
>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 9/9] spaceman/defrag: warn on extsize
  2024-07-09 20:21   ` Darrick J. Wong
@ 2024-07-11 23:36     ` Wengang Wang
  2024-07-16  0:29       ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-11 23:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 9, 2024, at 1:21 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:28PM -0700, Wengang Wang wrote:
>> According to current kernel implemenation, non-zero extsize might affect
>> the result of defragmentation.
>> Just print a warning on that if non-zero extsize is set on file.
> 
> I'm not sure what's the point of warning vaguely about extent size
> hints?  I'd have thought that would help reduce the number of extents;
> is that not the case?

Not exactly.

Same 1G file with about 54K extents,

The one with 16K extsize, after defrag, it’s extents drops to 13K.
And the one with 0 extsize, after defrag, it’s extents dropped to 22.

Above is tested with UEK6 (5.4 kernel). I can get the numbers on mainline if you want.


> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 12 ++++++++++++
>> 1 file changed, 12 insertions(+)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index ab8508bb..b6b89dd9 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -526,6 +526,18 @@ defrag_xfs_defrag(char *file_path) {
>> goto out;
>> }
>> 
>> +       if (ioctl(defrag_fd, FS_IOC_FSGETXATTR, &fsx) < 0) {
>> +               fprintf(stderr, "FSGETXATTR failed %s\n",
>> +                       strerror(errno));
> 
> Also we usually indent continuations by two tabs (not one) so that the
> continuation is more obvious:
> 
> fprintf(stderr, "FSGETXATTR failed %s\n",
> strerror(errno));

OK.

> 
>> +               ret = 1;
>> +               goto out;
>> +       }
>> +
>> +       if (fsx.fsx_extsize != 0)
>> +               fprintf(stderr, "%s has extsize set %d. That might affect defrag "
>> +                       "according to kernel implementation\n",
> 
> Format strings in userspace printf calls should be wrapped so that
> gettext can provide translated versions:
> 
> fprintf(stderr, _("%s has extsize...\n"), file_path...);
> 
> (I know, xfsprogs isn't as consistent as it probably ought to be...)
> 

OK.

Thanks,
Wengang
> --D
> 
>> +                       file_path, fsx.fsx_extsize);
>> +
>> clone.src_fd = defrag_fd;
>> 
>> defrag_dir = dirname(file_path);
>> -- 
>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/9] spaceman/defrag: defrag segments
  2024-07-11 22:49     ` Wengang Wang
@ 2024-07-12 19:07       ` Wengang Wang
  2024-07-15 22:42         ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-12 19:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 11, 2024, at 3:49 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Jul 9, 2024, at 2:57 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>> 
>> On Tue, Jul 09, 2024 at 12:10:22PM -0700, Wengang Wang wrote:
>>> For each segment, the following steps are done trying to defrag it:
>>> 
>>> 1. share the segment with a temporary file
>>> 2. unshare the segment in the target file. kernel simulates Cow on the whole
>>>  segment complete the unshare (defrag).
>>> 3. release blocks from the tempoary file.
>>> 
>>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>>> ---
>>> spaceman/defrag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 114 insertions(+)
>>> 
>>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>>> index 175cf461..9f11e36b 100644
>>> --- a/spaceman/defrag.c
>>> +++ b/spaceman/defrag.c
>>> @@ -263,6 +263,40 @@ add_ext:
>>> return ret;
>>> }
>>> 
>>> +/*
>>> + * check if the segment exceeds EoF.
>>> + * fix up the clone range and return true if EoF happens,
>>> + * return false otherwise.
>>> + */
>>> +static bool
>>> +defrag_clone_eof(struct file_clone_range *clone)
>>> +{
>>> + off_t delta;
>>> +
>>> + delta = clone->src_offset + clone->src_length - g_defrag_file_size;
>>> + if (delta > 0) {
>>> + clone->src_length = 0; // to the end
>>> + return true;
>>> + }
>>> + return false;
>>> +}
>>> +
>>> +/*
>>> + * get the time delta since pre_time in ms.
>>> + * pre_time should contains values fetched by gettimeofday()
>>> + * cur_time is used to store current time by gettimeofday()
>>> + */
>>> +static long long
>>> +get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
>>> +{
>>> + long long us;
>>> +
>>> + gettimeofday(cur_time, NULL);
>>> + us = (cur_time->tv_sec - pre_time->tv_sec) * 1000000;
>>> + us += (cur_time->tv_usec - pre_time->tv_usec);
>>> + return us;
>>> +}
>>> +
>>> /*
>>> * defragment a file
>>> * return 0 if successfully done, 1 otherwise
>>> @@ -273,6 +307,7 @@ defrag_xfs_defrag(char *file_path) {
>>> long nr_seg_defrag = 0, nr_ext_defrag = 0;
>>> int scratch_fd = -1, defrag_fd = -1;
>>> char tmp_file_path[PATH_MAX+1];
>>> + struct file_clone_range clone;
>>> char *defrag_dir;
>>> struct fsxattr fsx;
>>> int ret = 0;
>> 
>> Now that I see this, you might want to straighten up the lines:
>> 
>> struct fsxattr fsx = { };
>> long nr_seg_defrag = 0, nr_ext_defrag = 0;
>> 
>> etc.  Note the "= { }" bit that means you don't have to memset them to
>> zero explicitly.
> 
> Nice!
> 
>> 
>>> @@ -296,6 +331,8 @@ defrag_xfs_defrag(char *file_path) {
>>> goto out;
>>> }
>>> 
>>> + clone.src_fd = defrag_fd;
>>> +
>>> defrag_dir = dirname(file_path);
>>> snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
>>> getpid());
>>> @@ -309,7 +346,11 @@ defrag_xfs_defrag(char *file_path) {
>>> }
>>> 
>>> do {
>>> + struct timeval t_clone, t_unshare, t_punch_hole;
>>> struct defrag_segment segment;
>>> + long long seg_size, seg_off;
>>> + int time_delta;
>>> + bool stop;
>>> 
>>> ret = defrag_get_next_segment(defrag_fd, &segment);
>>> /* no more segments, we are done */
>>> @@ -322,6 +363,79 @@ defrag_xfs_defrag(char *file_path) {
>>> ret = 1;
>>> break;
>>> }
>>> +
>>> + /* we are done if the segment contains only 1 extent */
>>> + if (segment.ds_nr < 2)
>>> + continue;
>>> +
>>> + /* to bytes */
>>> + seg_off = segment.ds_offset * 512;
>>> + seg_size = segment.ds_length * 512;
>>> +
>>> + clone.src_offset = seg_off;
>>> + clone.src_length = seg_size;
>>> + clone.dest_offset = seg_off;
>>> +
>>> + /* checks for EoF and fix up clone */
>>> + stop = defrag_clone_eof(&clone);
>>> + gettimeofday(&t_clone, NULL);
>>> + ret = ioctl(scratch_fd, FICLONERANGE, &clone);
>> 
>> Hm, should the top-level defrag_f function check in the
>> filetable[i].fsgeom structure that the fs supports reflink?
> 
> Yes, good to know.

It seems that xfs_fsop_geom doesn’t know about reflink?

Thanks,
Wengang 

> 
>> 
>>> + if (ret != 0) {
>>> + fprintf(stderr, "FICLONERANGE failed %s\n",
>>> + strerror(errno));
>> 
>> Might be useful to include the file_path in the error message:
>> 
>> /opt/a: FICLONERANGE failed Software caused connection abort
>> 
>> (maybe also put a semicolon before the strerror message?)
> 
> OK.
> 
>> 
>>> + break;
>>> + }
>>> +
>>> + /* for time stats */
>>> + time_delta = get_time_delta_us(&t_clone, &t_unshare);
>>> + if (time_delta > max_clone_us)
>>> + max_clone_us = time_delta;
>>> +
>>> + /* for defrag stats */
>>> + nr_ext_defrag += segment.ds_nr;
>>> +
>>> + /*
>>> +  * For the shared range to be unshared via a copy-on-write
>>> +  * operation in the file to be defragged. This causes the
>>> +  * file needing to be defragged to have new extents allocated
>>> +  * and the data to be copied over and written out.
>>> +  */
>>> + ret = fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, seg_off,
>>> + seg_size);
>>> + if (ret != 0) {
>>> + fprintf(stderr, "UNSHARE_RANGE failed %s\n",
>>> + strerror(errno));
>>> + break;
>>> + }
>>> +
>>> + /* for time stats */
>>> + time_delta = get_time_delta_us(&t_unshare, &t_punch_hole);
>>> + if (time_delta > max_unshare_us)
>>> + max_unshare_us = time_delta;
>>> +
>>> + /*
>>> +  * Punch out the original extents we shared to the
>>> +  * scratch file so they are returned to free space.
>>> +  */
>>> + ret = fallocate(scratch_fd,
>>> + FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, seg_off,
>>> + seg_size);
>> 
>> Indentation here (two tabs for a continuation).  
> 
> OK.
> 
>> Or just ftruncate
>> scratch_fd to zero bytes?  I think you have to do that for the EOF stuff
>> to work, right?
>> 
> 
> I’d truncate the UNSHARE range only in the loop.
> EOF stuff would be truncated on (O_TMPFILE) file close.
> The EOF stuff would be used for another purpose, see 
> [PATCH 6/9] spaceman/defrag: workaround kernel
> 
> Thanks,
> Wengang
> 
>> --D
>> 
>>> + if (ret != 0) {
>>> + fprintf(stderr, "PUNCH_HOLE failed %s\n",
>>> + strerror(errno));
>>> + break;
>>> + }
>>> +
>>> + /* for defrag stats */
>>> + nr_seg_defrag += 1;
>>> +
>>> + /* for time stats */
>>> + time_delta = get_time_delta_us(&t_punch_hole, &t_clone);
>>> + if (time_delta > max_punch_us)
>>> + max_punch_us = time_delta;
>>> +
>>> + if (stop)
>>> + break;
>>> } while (true);
>>> out:
>>> if (scratch_fd != -1) {
>>> -- 
>>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/9] xfsprogs: introduce defrag command to spaceman
  2024-07-11 21:54     ` Wengang Wang
@ 2024-07-15 21:30       ` Wengang Wang
  2024-07-15 22:44         ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-15 21:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 11, 2024, at 2:54 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> Hi Darrick,
> Thanks for review, pls check in lines.
> 
>> On Jul 9, 2024, at 2:18 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>> 
>> On Tue, Jul 09, 2024 at 12:10:20PM -0700, Wengang Wang wrote:
>>> Content-Type: text/plain; charset=UTF-8
>>> Content-Transfer-Encoding: 8bit
>>> 
>>> Non-exclusive defragment
>>> Here we are introducing the non-exclusive manner to defragment a file,
>>> especially for huge files, without blocking IO to it long.
>>> Non-exclusive defragmentation divides the whole file into small segments.
>>> For each segment, we lock the file, defragment the segment and unlock the file.
>>> Defragmenting the small segment doesn’t take long. File IO requests can get
>>> served between defragmenting segments before blocked long.  Also we put
>>> (user adjustable) idle time between defragmenting two consecutive segments to
>>> balance the defragmentation and file IOs.
>>> 
>>> The first patch in the set checks for valid target files
>>> 
>>> Valid target files to defrag must:
>>> 1. be accessible for read/write
>>> 2. be regular files
>>> 3. be in XFS filesystem
>>> 4. the containing XFS has reflink enabled. This is not checked
>>>  before starting defragmentation, but error would be reported
>>>  later.
>>> 
>>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>>> ---
>>> spaceman/Makefile |   2 +-
>>> spaceman/defrag.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++
>>> spaceman/init.c   |   1 +
>>> spaceman/space.h  |   1 +
>>> 4 files changed, 201 insertions(+), 1 deletion(-)
>>> create mode 100644 spaceman/defrag.c
>>> 
>>> diff --git a/spaceman/Makefile b/spaceman/Makefile
>>> index 1f048d54..9c00b20a 100644
>>> --- a/spaceman/Makefile
>>> +++ b/spaceman/Makefile
>>> @@ -7,7 +7,7 @@ include $(TOPDIR)/include/builddefs
>>> 
>>> LTCOMMAND = xfs_spaceman
>>> HFILES = init.h space.h
>>> -CFILES = info.c init.c file.c health.c prealloc.c trim.c
>>> +CFILES = info.c init.c file.c health.c prealloc.c trim.c defrag.c
>>> LSRCFILES = xfs_info.sh
>>> 
>>> LLDLIBS = $(LIBXCMD) $(LIBFROG)
>>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>>> new file mode 100644
>>> index 00000000..c9732984
>>> --- /dev/null
>>> +++ b/spaceman/defrag.c
>>> @@ -0,0 +1,198 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Copyright (c) 2024 Oracle.
>>> + * All Rights Reserved.
>>> + */
>>> +
>>> +#include "libxfs.h"
>>> +#include <linux/fiemap.h>
>>> +#include <linux/fsmap.h>
>>> +#include "libfrog/fsgeom.h"
>>> +#include "command.h"
>>> +#include "init.h"
>>> +#include "libfrog/paths.h"
>>> +#include "space.h"
>>> +#include "input.h"
>>> +
>>> +/* defrag segment size limit in units of 512 bytes */
>>> +#define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>>> +#define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
>>> +static int g_segment_size_lmt = DEFAULT_SEGMENT_SIZE_LIMIT;
>>> +
>>> +/* size of the defrag target file */
>>> +static off_t g_defrag_file_size = 0;
>>> +
>>> +/* stats for the target file extents before defrag */
>>> +struct ext_stats {
>>> + long nr_ext_total;
>>> + long nr_ext_unwritten;
>>> + long nr_ext_shared;
>>> +};
>>> +static struct ext_stats g_ext_stats;
>>> +
>>> +/*
>>> + * check if the target is a valid file to defrag
>>> + * also store file size
>>> + * returns:
>>> + * true for yes and false for no
>>> + */
>>> +static bool
>>> +defrag_check_file(char *path)
>>> +{
>>> + struct statfs statfs_s;
>>> + struct stat stat_s;
>>> +
>>> + if (access(path, F_OK|W_OK) == -1) {
>>> + if (errno == ENOENT)
>>> + fprintf(stderr, "file \"%s\" doesn't exist\n", path);
>>> + else
>>> + fprintf(stderr, "no access to \"%s\", %s\n", path,
>>> + strerror(errno));
>>> + return false;
>>> + }
>>> +
>>> + if (stat(path, &stat_s) == -1) {
>>> + fprintf(stderr, "failed to get file info on \"%s\":  %s\n",
>>> + path, strerror(errno));
>>> + return false;
>>> + }
>>> +
>>> + g_defrag_file_size = stat_s.st_size;
>>> +
>>> + if (!S_ISREG(stat_s.st_mode)) {
>>> + fprintf(stderr, "\"%s\" is not a regular file\n", path);
>>> + return false;
>>> + }
>>> +
>>> + if (statfs(path, &statfs_s) == -1) {
>> 
>> statfs is deprecated, please use fstatvfs.
> 
> OK, will move to fstatvfs.
> 
>> 
>>> + fprintf(stderr, "failed to get FS info on \"%s\":  %s\n",
>>> + path, strerror(errno));
>>> + return false;
>>> + }
>>> +
>>> + if (statfs_s.f_type != XFS_SUPER_MAGIC) {
>>> + fprintf(stderr, "\"%s\" is not a xfs file\n", path);
>>> + return false;
>>> + }
>>> +
>>> + return true;
>>> +}
>>> +
>>> +/*
>>> + * defragment a file
>>> + * return 0 if successfully done, 1 otherwise
>>> + */
>>> +static int
>>> +defrag_xfs_defrag(char *file_path) {
>> 
>> defrag_xfs_path() ?
> 
> OK.
>> 
>>> + int max_clone_us = 0, max_unshare_us = 0, max_punch_us = 0;
>>> + long nr_seg_defrag = 0, nr_ext_defrag = 0;
>>> + int scratch_fd = -1, defrag_fd = -1;
>>> + char tmp_file_path[PATH_MAX+1];
>>> + char *defrag_dir;
>>> + struct fsxattr fsx;
>>> + int ret = 0;
>>> +
>>> + fsx.fsx_nextents = 0;
>>> + memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>>> +
>>> + if (!defrag_check_file(file_path)) {
>>> + ret = 1;
>>> + goto out;
>>> + }
>>> +
>>> + defrag_fd = open(file_path, O_RDWR);
>>> + if (defrag_fd == -1) {
>> 
>> Not sure why you check the path before opening it -- all those file and
>> statvfs attributes that you collect there can change (or the entire fs
>> gets unmounted) until you've pinned the fs by opening the file.
> 
> The idea comes from internal reviews hoping some explicit reasons why
> Defrag failed. Those reasons include: 
> 1) if user has permission to access the target file.
> 2) if the species path exist (when moving to spaceman, spaceman takes care of it)
> 3) if the specified is a regular file
> 4) if the target file is a XFS file
> 
> Thing might change after checking and opening, but that’s very rare case and user is
> responsible for that change rather than this tool.
> 
>> 
>>> + fprintf(stderr, "Opening %s failed. %s\n", file_path,
>>> + strerror(errno));
>>> + ret = 1;
>>> + goto out;
>>> + }
>>> +
>>> + defrag_dir = dirname(file_path);
>>> + snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
>>> + getpid());
>>> + tmp_file_path[PATH_MAX] = 0;
>>> + scratch_fd = open(tmp_file_path, O_CREAT|O_EXCL|O_RDWR, 0600);
>> 
>> O_TMPFILE?  Then you don't have to do this .xfsdefrag_XXX stuff.
>> 
> 
> My first first version was using O_TMPFILE. But clone failed somehow (Don’t remember the details).
> I retried O_TMPFILE, it’s working now. So will move to use O_TMPFILE.
> 
>>> + if (scratch_fd == -1) {
>>> + fprintf(stderr, "Opening temporary file %s failed. %s\n",
>>> + tmp_file_path, strerror(errno));
>>> + ret = 1;
>>> + goto out;
>>> + }
>>> +out:
>>> + if (scratch_fd != -1) {
>>> + close(scratch_fd);
>>> + unlink(tmp_file_path);
>>> + }
>>> + if (defrag_fd != -1) {
>>> + ioctl(defrag_fd, FS_IOC_FSGETXATTR, &fsx);
>>> + close(defrag_fd);
>>> + }
>>> +
>>> + printf("Pre-defrag %ld extents detected, %ld are \"unwritten\","
>>> + "%ld are \"shared\"\n",
>>> + g_ext_stats.nr_ext_total, g_ext_stats.nr_ext_unwritten,
>>> + g_ext_stats.nr_ext_shared);
>>> + printf("Tried to defragment %ld extents in %ld segments\n",
>>> + nr_ext_defrag, nr_seg_defrag);
>>> + printf("Time stats(ms): max clone: %d, max unshare: %d,"
>>> +        " max punch_hole: %d\n",
>>> +        max_clone_us/1000, max_unshare_us/1000, max_punch_us/1000);
>>> + printf("Post-defrag %u extents detected\n", fsx.fsx_nextents);
>>> + return ret;
>>> +}
>>> +
>>> +
>>> +static void defrag_help(void)
>>> +{
>>> + printf(_(
>>> +"\n"
>>> +"Defragemnt files on XFS where reflink is enabled. IOs to the target files \n"
>> 
>> "Defragment"
> 
> OK.
> 
>> 
>>> +"can be served durning the defragmentations.\n"
>>> +"\n"
>>> +" -s segment_size    -- specify the segment size in MiB, minmum value is 4 \n"
>>> +"                       default is 16\n"));
>>> +}
>>> +
>>> +static cmdinfo_t defrag_cmd;
>>> +
>>> +static int
>>> +defrag_f(int argc, char **argv)
>>> +{
>>> + int i;
>>> + int c;
>>> +
>>> + while ((c = getopt(argc, argv, "s:")) != EOF) {
>>> + switch(c) {
>>> + case 's':
>>> + g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
>>> + if (g_segment_size_lmt < MIN_SEGMENT_SIZE_LIMIT) {
>>> + g_segment_size_lmt = MIN_SEGMENT_SIZE_LIMIT;
>>> + printf("Using minimium segment size %d\n",
>>> + g_segment_size_lmt);
>>> + }
>>> + break;
>>> + default:
>>> + command_usage(&defrag_cmd);
>>> + return 1;
>>> + }
>>> + }
>>> +
>>> + for (i = 0; i < filecount; i++)
>>> + defrag_xfs_defrag(filetable[i].name);
>> 
>> Pass in the whole filetable[i] and then you've already got an open fd
>> and some validation that it's an xfs filesystem.
> 

Filetable[I].xfd.fd doesn’t work well. UNSHARE returns “Bad file descriptor”, I am suspecting that fd is readonly.

So I have to write-open again.

Thanks,
Wengang

> Good to know.
>> 
>>> + return 0;
>>> +}
>>> +void defrag_init(void)
>>> +{
>>> + defrag_cmd.name = "defrag";
>>> + defrag_cmd.altname = "dfg";
>>> + defrag_cmd.cfunc = defrag_f;
>>> + defrag_cmd.argmin = 0;
>>> + defrag_cmd.argmax = 4;
>>> + defrag_cmd.args = "[-s segment_size]";
>>> + defrag_cmd.flags = CMD_FLAG_ONESHOT;
>> 
>> IIRC if you don't set CMD_FLAG_FOREIGN_OK then the command processor
>> won't let this command get run against a non-xfs file.
>> 
> 
> OK.
> 
> Thanks,
> Winging
> 
>> --D
>> 
>>> + defrag_cmd.oneline = _("Defragment XFS files");
>>> + defrag_cmd.help = defrag_help;
>>> +
>>> + add_command(&defrag_cmd);
>>> +}
>>> diff --git a/spaceman/init.c b/spaceman/init.c
>>> index cf1ff3cb..396f965c 100644
>>> --- a/spaceman/init.c
>>> +++ b/spaceman/init.c
>>> @@ -35,6 +35,7 @@ init_commands(void)
>>> trim_init();
>>> freesp_init();
>>> health_init();
>>> + defrag_init();
>>> }
>>> 
>>> static int
>>> diff --git a/spaceman/space.h b/spaceman/space.h
>>> index 723209ed..c288aeb9 100644
>>> --- a/spaceman/space.h
>>> +++ b/spaceman/space.h
>>> @@ -26,6 +26,7 @@ extern void help_init(void);
>>> extern void prealloc_init(void);
>>> extern void quit_init(void);
>>> extern void trim_init(void);
>>> +extern void defrag_init(void);
>>> #ifdef HAVE_GETFSMAP
>>> extern void freesp_init(void);
>>> #else
>>> -- 
>>> 2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/9] spaceman/defrag: defrag segments
  2024-07-12 19:07       ` Wengang Wang
@ 2024-07-15 22:42         ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-15 22:42 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org

On Fri, Jul 12, 2024 at 07:07:01PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 11, 2024, at 3:49 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> > 
> > 
> > 
> >> On Jul 9, 2024, at 2:57 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> >> 
> >> On Tue, Jul 09, 2024 at 12:10:22PM -0700, Wengang Wang wrote:
> >>> For each segment, the following steps are done trying to defrag it:
> >>> 
> >>> 1. share the segment with a temporary file
> >>> 2. unshare the segment in the target file. kernel simulates Cow on the whole
> >>>  segment complete the unshare (defrag).
> >>> 3. release blocks from the tempoary file.
> >>> 
> >>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> >>> ---
> >>> spaceman/defrag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
> >>> 1 file changed, 114 insertions(+)
> >>> 
> >>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> >>> index 175cf461..9f11e36b 100644
> >>> --- a/spaceman/defrag.c
> >>> +++ b/spaceman/defrag.c

<snip>

> >>> @@ -322,6 +363,79 @@ defrag_xfs_defrag(char *file_path) {
> >>> ret = 1;
> >>> break;
> >>> }
> >>> +
> >>> + /* we are done if the segment contains only 1 extent */
> >>> + if (segment.ds_nr < 2)
> >>> + continue;
> >>> +
> >>> + /* to bytes */
> >>> + seg_off = segment.ds_offset * 512;
> >>> + seg_size = segment.ds_length * 512;
> >>> +
> >>> + clone.src_offset = seg_off;
> >>> + clone.src_length = seg_size;
> >>> + clone.dest_offset = seg_off;
> >>> +
> >>> + /* checks for EoF and fix up clone */
> >>> + stop = defrag_clone_eof(&clone);
> >>> + gettimeofday(&t_clone, NULL);
> >>> + ret = ioctl(scratch_fd, FICLONERANGE, &clone);
> >> 
> >> Hm, should the top-level defrag_f function check in the
> >> filetable[i].fsgeom structure that the fs supports reflink?
> > 
> > Yes, good to know.
> 
> It seems that xfs_fsop_geom doesn’t know about reflink?

XFS_FSOP_GEOM_FLAGS_REFLINK ?

--D

> Thanks,
> Wengang 
> 
> > 
> >> 
> >>> + if (ret != 0) {
> >>> + fprintf(stderr, "FICLONERANGE failed %s\n",
> >>> + strerror(errno));
> >> 
> >> Might be useful to include the file_path in the error message:
> >> 
> >> /opt/a: FICLONERANGE failed Software caused connection abort
> >> 
> >> (maybe also put a semicolon before the strerror message?)
> > 
> > OK.
> > 
> >> 
> >>> + break;
> >>> + }
> >>> +
> >>> + /* for time stats */
> >>> + time_delta = get_time_delta_us(&t_clone, &t_unshare);
> >>> + if (time_delta > max_clone_us)
> >>> + max_clone_us = time_delta;
> >>> +
> >>> + /* for defrag stats */
> >>> + nr_ext_defrag += segment.ds_nr;
> >>> +
> >>> + /*
> >>> +  * For the shared range to be unshared via a copy-on-write
> >>> +  * operation in the file to be defragged. This causes the
> >>> +  * file needing to be defragged to have new extents allocated
> >>> +  * and the data to be copied over and written out.
> >>> +  */
> >>> + ret = fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, seg_off,
> >>> + seg_size);
> >>> + if (ret != 0) {
> >>> + fprintf(stderr, "UNSHARE_RANGE failed %s\n",
> >>> + strerror(errno));
> >>> + break;
> >>> + }
> >>> +
> >>> + /* for time stats */
> >>> + time_delta = get_time_delta_us(&t_unshare, &t_punch_hole);
> >>> + if (time_delta > max_unshare_us)
> >>> + max_unshare_us = time_delta;
> >>> +
> >>> + /*
> >>> +  * Punch out the original extents we shared to the
> >>> +  * scratch file so they are returned to free space.
> >>> +  */
> >>> + ret = fallocate(scratch_fd,
> >>> + FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, seg_off,
> >>> + seg_size);
> >> 
> >> Indentation here (two tabs for a continuation).  
> > 
> > OK.
> > 
> >> Or just ftruncate
> >> scratch_fd to zero bytes?  I think you have to do that for the EOF stuff
> >> to work, right?
> >> 
> > 
> > I’d truncate the UNSHARE range only in the loop.
> > EOF stuff would be truncated on (O_TMPFILE) file close.
> > The EOF stuff would be used for another purpose, see 
> > [PATCH 6/9] spaceman/defrag: workaround kernel
> > 
> > Thanks,
> > Wengang
> > 
> >> --D
> >> 
> >>> + if (ret != 0) {
> >>> + fprintf(stderr, "PUNCH_HOLE failed %s\n",
> >>> + strerror(errno));
> >>> + break;
> >>> + }
> >>> +
> >>> + /* for defrag stats */
> >>> + nr_seg_defrag += 1;
> >>> +
> >>> + /* for time stats */
> >>> + time_delta = get_time_delta_us(&t_punch_hole, &t_clone);
> >>> + if (time_delta > max_punch_us)
> >>> + max_punch_us = time_delta;
> >>> +
> >>> + if (stop)
> >>> + break;
> >>> } while (true);
> >>> out:
> >>> if (scratch_fd != -1) {
> >>> -- 
> >>> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/9] xfsprogs: introduce defrag command to spaceman
  2024-07-15 21:30       ` Wengang Wang
@ 2024-07-15 22:44         ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-15 22:44 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org

On Mon, Jul 15, 2024 at 09:30:42PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 11, 2024, at 2:54 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> > 
> > Hi Darrick,
> > Thanks for review, pls check in lines.
> > 
> >> On Jul 9, 2024, at 2:18 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> >> 
> >> On Tue, Jul 09, 2024 at 12:10:20PM -0700, Wengang Wang wrote:
> >>> Content-Type: text/plain; charset=UTF-8
> >>> Content-Transfer-Encoding: 8bit
> >>> 
> >>> Non-exclusive defragment
> >>> Here we are introducing the non-exclusive manner to defragment a file,
> >>> especially for huge files, without blocking IO to it long.
> >>> Non-exclusive defragmentation divides the whole file into small segments.
> >>> For each segment, we lock the file, defragment the segment and unlock the file.
> >>> Defragmenting the small segment doesn’t take long. File IO requests can get
> >>> served between defragmenting segments before blocked long.  Also we put
> >>> (user adjustable) idle time between defragmenting two consecutive segments to
> >>> balance the defragmentation and file IOs.
> >>> 
> >>> The first patch in the set checks for valid target files
> >>> 
> >>> Valid target files to defrag must:
> >>> 1. be accessible for read/write
> >>> 2. be regular files
> >>> 3. be in XFS filesystem
> >>> 4. the containing XFS has reflink enabled. This is not checked
> >>>  before starting defragmentation, but error would be reported
> >>>  later.
> >>> 
> >>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> >>> ---
> >>> spaceman/Makefile |   2 +-
> >>> spaceman/defrag.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++
> >>> spaceman/init.c   |   1 +
> >>> spaceman/space.h  |   1 +
> >>> 4 files changed, 201 insertions(+), 1 deletion(-)
> >>> create mode 100644 spaceman/defrag.c
> >>> 
> >>> diff --git a/spaceman/Makefile b/spaceman/Makefile
> >>> index 1f048d54..9c00b20a 100644
> >>> --- a/spaceman/Makefile
> >>> +++ b/spaceman/Makefile
> >>> @@ -7,7 +7,7 @@ include $(TOPDIR)/include/builddefs
> >>> 
> >>> LTCOMMAND = xfs_spaceman
> >>> HFILES = init.h space.h
> >>> -CFILES = info.c init.c file.c health.c prealloc.c trim.c
> >>> +CFILES = info.c init.c file.c health.c prealloc.c trim.c defrag.c
> >>> LSRCFILES = xfs_info.sh
> >>> 
> >>> LLDLIBS = $(LIBXCMD) $(LIBFROG)
> >>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> >>> new file mode 100644
> >>> index 00000000..c9732984
> >>> --- /dev/null
> >>> +++ b/spaceman/defrag.c
> >>> @@ -0,0 +1,198 @@
> >>> +// SPDX-License-Identifier: GPL-2.0
> >>> +/*
> >>> + * Copyright (c) 2024 Oracle.
> >>> + * All Rights Reserved.
> >>> + */
> >>> +
> >>> +#include "libxfs.h"
> >>> +#include <linux/fiemap.h>
> >>> +#include <linux/fsmap.h>
> >>> +#include "libfrog/fsgeom.h"
> >>> +#include "command.h"
> >>> +#include "init.h"
> >>> +#include "libfrog/paths.h"
> >>> +#include "space.h"
> >>> +#include "input.h"
> >>> +
> >>> +/* defrag segment size limit in units of 512 bytes */
> >>> +#define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
> >>> +#define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
> >>> +static int g_segment_size_lmt = DEFAULT_SEGMENT_SIZE_LIMIT;
> >>> +
> >>> +/* size of the defrag target file */
> >>> +static off_t g_defrag_file_size = 0;
> >>> +
> >>> +/* stats for the target file extents before defrag */
> >>> +struct ext_stats {
> >>> + long nr_ext_total;
> >>> + long nr_ext_unwritten;
> >>> + long nr_ext_shared;
> >>> +};
> >>> +static struct ext_stats g_ext_stats;
> >>> +
> >>> +/*
> >>> + * check if the target is a valid file to defrag
> >>> + * also store file size
> >>> + * returns:
> >>> + * true for yes and false for no
> >>> + */
> >>> +static bool
> >>> +defrag_check_file(char *path)
> >>> +{
> >>> + struct statfs statfs_s;
> >>> + struct stat stat_s;
> >>> +
> >>> + if (access(path, F_OK|W_OK) == -1) {
> >>> + if (errno == ENOENT)
> >>> + fprintf(stderr, "file \"%s\" doesn't exist\n", path);
> >>> + else
> >>> + fprintf(stderr, "no access to \"%s\", %s\n", path,
> >>> + strerror(errno));
> >>> + return false;
> >>> + }
> >>> +
> >>> + if (stat(path, &stat_s) == -1) {
> >>> + fprintf(stderr, "failed to get file info on \"%s\":  %s\n",
> >>> + path, strerror(errno));
> >>> + return false;
> >>> + }
> >>> +
> >>> + g_defrag_file_size = stat_s.st_size;
> >>> +
> >>> + if (!S_ISREG(stat_s.st_mode)) {
> >>> + fprintf(stderr, "\"%s\" is not a regular file\n", path);
> >>> + return false;
> >>> + }
> >>> +
> >>> + if (statfs(path, &statfs_s) == -1) {
> >> 
> >> statfs is deprecated, please use fstatvfs.
> > 
> > OK, will move to fstatvfs.
> > 
> >> 
> >>> + fprintf(stderr, "failed to get FS info on \"%s\":  %s\n",
> >>> + path, strerror(errno));
> >>> + return false;
> >>> + }
> >>> +
> >>> + if (statfs_s.f_type != XFS_SUPER_MAGIC) {
> >>> + fprintf(stderr, "\"%s\" is not a xfs file\n", path);
> >>> + return false;
> >>> + }
> >>> +
> >>> + return true;
> >>> +}
> >>> +
> >>> +/*
> >>> + * defragment a file
> >>> + * return 0 if successfully done, 1 otherwise
> >>> + */
> >>> +static int
> >>> +defrag_xfs_defrag(char *file_path) {
> >> 
> >> defrag_xfs_path() ?
> > 
> > OK.
> >> 
> >>> + int max_clone_us = 0, max_unshare_us = 0, max_punch_us = 0;
> >>> + long nr_seg_defrag = 0, nr_ext_defrag = 0;
> >>> + int scratch_fd = -1, defrag_fd = -1;
> >>> + char tmp_file_path[PATH_MAX+1];
> >>> + char *defrag_dir;
> >>> + struct fsxattr fsx;
> >>> + int ret = 0;
> >>> +
> >>> + fsx.fsx_nextents = 0;
> >>> + memset(&g_ext_stats, 0, sizeof(g_ext_stats));
> >>> +
> >>> + if (!defrag_check_file(file_path)) {
> >>> + ret = 1;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + defrag_fd = open(file_path, O_RDWR);
> >>> + if (defrag_fd == -1) {
> >> 
> >> Not sure why you check the path before opening it -- all those file and
> >> statvfs attributes that you collect there can change (or the entire fs
> >> gets unmounted) until you've pinned the fs by opening the file.
> > 
> > The idea comes from internal reviews hoping some explicit reasons why
> > Defrag failed. Those reasons include: 
> > 1) if user has permission to access the target file.
> > 2) if the species path exist (when moving to spaceman, spaceman takes care of it)
> > 3) if the specified is a regular file
> > 4) if the target file is a XFS file
> > 
> > Thing might change after checking and opening, but that’s very rare case and user is
> > responsible for that change rather than this tool.
> > 
> >> 
> >>> + fprintf(stderr, "Opening %s failed. %s\n", file_path,
> >>> + strerror(errno));
> >>> + ret = 1;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + defrag_dir = dirname(file_path);
> >>> + snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
> >>> + getpid());
> >>> + tmp_file_path[PATH_MAX] = 0;
> >>> + scratch_fd = open(tmp_file_path, O_CREAT|O_EXCL|O_RDWR, 0600);
> >> 
> >> O_TMPFILE?  Then you don't have to do this .xfsdefrag_XXX stuff.
> >> 
> > 
> > My first first version was using O_TMPFILE. But clone failed somehow (Don’t remember the details).
> > I retried O_TMPFILE, it’s working now. So will move to use O_TMPFILE.
> > 
> >>> + if (scratch_fd == -1) {
> >>> + fprintf(stderr, "Opening temporary file %s failed. %s\n",
> >>> + tmp_file_path, strerror(errno));
> >>> + ret = 1;
> >>> + goto out;
> >>> + }
> >>> +out:
> >>> + if (scratch_fd != -1) {
> >>> + close(scratch_fd);
> >>> + unlink(tmp_file_path);
> >>> + }
> >>> + if (defrag_fd != -1) {
> >>> + ioctl(defrag_fd, FS_IOC_FSGETXATTR, &fsx);
> >>> + close(defrag_fd);
> >>> + }
> >>> +
> >>> + printf("Pre-defrag %ld extents detected, %ld are \"unwritten\","
> >>> + "%ld are \"shared\"\n",
> >>> + g_ext_stats.nr_ext_total, g_ext_stats.nr_ext_unwritten,
> >>> + g_ext_stats.nr_ext_shared);
> >>> + printf("Tried to defragment %ld extents in %ld segments\n",
> >>> + nr_ext_defrag, nr_seg_defrag);
> >>> + printf("Time stats(ms): max clone: %d, max unshare: %d,"
> >>> +        " max punch_hole: %d\n",
> >>> +        max_clone_us/1000, max_unshare_us/1000, max_punch_us/1000);
> >>> + printf("Post-defrag %u extents detected\n", fsx.fsx_nextents);
> >>> + return ret;
> >>> +}
> >>> +
> >>> +
> >>> +static void defrag_help(void)
> >>> +{
> >>> + printf(_(
> >>> +"\n"
> >>> +"Defragemnt files on XFS where reflink is enabled. IOs to the target files \n"
> >> 
> >> "Defragment"
> > 
> > OK.
> > 
> >> 
> >>> +"can be served durning the defragmentations.\n"
> >>> +"\n"
> >>> +" -s segment_size    -- specify the segment size in MiB, minmum value is 4 \n"
> >>> +"                       default is 16\n"));
> >>> +}
> >>> +
> >>> +static cmdinfo_t defrag_cmd;
> >>> +
> >>> +static int
> >>> +defrag_f(int argc, char **argv)
> >>> +{
> >>> + int i;
> >>> + int c;
> >>> +
> >>> + while ((c = getopt(argc, argv, "s:")) != EOF) {
> >>> + switch(c) {
> >>> + case 's':
> >>> + g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
> >>> + if (g_segment_size_lmt < MIN_SEGMENT_SIZE_LIMIT) {
> >>> + g_segment_size_lmt = MIN_SEGMENT_SIZE_LIMIT;
> >>> + printf("Using minimium segment size %d\n",
> >>> + g_segment_size_lmt);
> >>> + }
> >>> + break;
> >>> + default:
> >>> + command_usage(&defrag_cmd);
> >>> + return 1;
> >>> + }
> >>> + }
> >>> +
> >>> + for (i = 0; i < filecount; i++)
> >>> + defrag_xfs_defrag(filetable[i].name);
> >> 
> >> Pass in the whole filetable[i] and then you've already got an open fd
> >> and some validation that it's an xfs filesystem.
> > 
> 
> Filetable[I].xfd.fd doesn’t work well. UNSHARE returns “Bad file descriptor”, I am suspecting that fd is readonly.
> 
> So I have to write-open again.

Ah, ok.  In that case, after you reopen the file, you ought to stat both
of them and check that st_dev/st_ino match.

(Or just change spaceman to be able to open files O_RDWR?)

--D

> Thanks,
> Wengang
> 
> > Good to know.
> >> 
> >>> + return 0;
> >>> +}
> >>> +void defrag_init(void)
> >>> +{
> >>> + defrag_cmd.name = "defrag";
> >>> + defrag_cmd.altname = "dfg";
> >>> + defrag_cmd.cfunc = defrag_f;
> >>> + defrag_cmd.argmin = 0;
> >>> + defrag_cmd.argmax = 4;
> >>> + defrag_cmd.args = "[-s segment_size]";
> >>> + defrag_cmd.flags = CMD_FLAG_ONESHOT;
> >> 
> >> IIRC if you don't set CMD_FLAG_FOREIGN_OK then the command processor
> >> won't let this command get run against a non-xfs file.
> >> 
> > 
> > OK.
> > 
> > Thanks,
> > Winging
> > 
> >> --D
> >> 
> >>> + defrag_cmd.oneline = _("Defragment XFS files");
> >>> + defrag_cmd.help = defrag_help;
> >>> +
> >>> + add_command(&defrag_cmd);
> >>> +}
> >>> diff --git a/spaceman/init.c b/spaceman/init.c
> >>> index cf1ff3cb..396f965c 100644
> >>> --- a/spaceman/init.c
> >>> +++ b/spaceman/init.c
> >>> @@ -35,6 +35,7 @@ init_commands(void)
> >>> trim_init();
> >>> freesp_init();
> >>> health_init();
> >>> + defrag_init();
> >>> }
> >>> 
> >>> static int
> >>> diff --git a/spaceman/space.h b/spaceman/space.h
> >>> index 723209ed..c288aeb9 100644
> >>> --- a/spaceman/space.h
> >>> +++ b/spaceman/space.h
> >>> @@ -26,6 +26,7 @@ extern void help_init(void);
> >>> extern void prealloc_init(void);
> >>> extern void quit_init(void);
> >>> extern void trim_init(void);
> >>> +extern void defrag_init(void);
> >>> #ifdef HAVE_GETFSMAP
> >>> extern void freesp_init(void);
> >>> #else
> >>> -- 
> >>> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/9] spaceman/defrag: ctrl-c handler
  2024-07-11 22:58     ` Wengang Wang
@ 2024-07-15 22:56       ` Darrick J. Wong
  2024-07-16 16:21         ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-15 22:56 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org

On Thu, Jul 11, 2024 at 10:58:02PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 9, 2024, at 2:08 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> > 
> > On Tue, Jul 09, 2024 at 12:10:23PM -0700, Wengang Wang wrote:
> >> Add this handler to break the defrag better, so it has
> >> 1. the stats reporting
> >> 2. remove the temporary file
> >> 
> >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> >> ---
> >> spaceman/defrag.c | 11 ++++++++++-
> >> 1 file changed, 10 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> >> index 9f11e36b..61e47a43 100644
> >> --- a/spaceman/defrag.c
> >> +++ b/spaceman/defrag.c
> >> @@ -297,6 +297,13 @@ get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
> >> return us;
> >> }
> >> 
> >> +static volatile bool usedKilled = false;
> >> +void defrag_sigint_handler(int dummy)
> >> +{
> >> + usedKilled = true;
> > 
> > Not sure why some of these variables are camelCase and others not.
> > Or why this global variable doesn't have a g_ prefix like the others?
> > 
> 
> Yep, will change it to g_user_killed.
> 
> >> + printf("Please wait until current segment is defragmented\n");
> > 
> > Is it actually safe to call printf from a signal handler?  Handlers must
> > be very careful about what they call -- regreSSHion was a result of
> > openssh not getting this right.
> > 
> > (Granted spaceman isn't as critical...)
> > 
> 
> As the ioctl of UNSHARE takes time, so the process would really stop a while
> After user’s kil. The message is used as a quick response to user. It’s not actually
> Has any functionality. If it’s not safe, we can remove the message.

$ man signal-safety

> > Also would you rather SIGINT merely terminate the spaceman process?  I
> > think the file locks drop on termination, right?
> 
> Another purpose of the handler is that I want to show the stats like below even process is killed:
> 
> Pre-defrag 54699 extents detected, 0 are "unwritten",0 are "shared"
> Tried to defragment 54697 extents (939511808 bytes) in 57 segments
> Time stats(ms): max clone: 33, max unshare: 2254, max punch_hole: 286
> Post-defrag 12617 extents detected

Ah, ok.

> Thanks,
> Winging
> 
> > 
> > --D
> > 
> >> +};
> >> +
> >> /*
> >>  * defragment a file
> >>  * return 0 if successfully done, 1 otherwise
> >> @@ -345,6 +352,8 @@ defrag_xfs_defrag(char *file_path) {
> >> goto out;
> >> }
> >> 
> >> + signal(SIGINT, defrag_sigint_handler);
> >> +
> >> do {
> >> struct timeval t_clone, t_unshare, t_punch_hole;
> >> struct defrag_segment segment;
> >> @@ -434,7 +443,7 @@ defrag_xfs_defrag(char *file_path) {
> >> if (time_delta > max_punch_us)
> >> max_punch_us = time_delta;
> >> 
> >> - if (stop)
> >> + if (stop || usedKilled)
> >> break;
> >> } while (true);
> >> out:
> >> -- 
> >> 2.39.3 (Apple Git-146)
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/9] spaceman/defrag: exclude shared segments on low free space
  2024-07-11 23:08     ` Wengang Wang
@ 2024-07-15 22:58       ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2024-07-15 22:58 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org

On Thu, Jul 11, 2024 at 11:08:39PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 9, 2024, at 2:05 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> > 
> > On Tue, Jul 09, 2024 at 12:10:24PM -0700, Wengang Wang wrote:
> >> On some XFS, free blocks are over-committed to reflink copies.
> >> And those free blocks are not enough if CoW happens to all the shared blocks.
> > 
> > Hmmm.  I think what you're trying to do here is avoid running a
> > filesystem out of space because it defragmented files A, B, ... Z, each
> > of which previously shared the same chunk of storage but now they don't
> > because this defragger unshared them to reduce the extent count in those
> > files.  Right?
> > 
> 
> Yes.
> 
> > In that case, I wonder if it's a good idea to touch shared extents at
> > all?  Someone set those files to share space, that's probably a better
> > performance optimization than reducing extent count.
> 
> The question is that:
> Are the shared parts are something to be overwritten frequently?
> If they are, Copy-on-Write would make those shared parts fragmented.
> In above case we should dedefrag those parts, otherwise, the defrag might doesn’t defrag at all.
> Otherwise the shared parts are not subjects to be overwritten frequently,
> They are expected to remain in big extents, choosing proper segment size
> Would skip those.
> 
> But yes, we can add a option to simply skip those share extents. 

Good enough for now, I think. :)

> > 
> > That said, you /could/ also use GETFSMAP to find all the other owners of
> > a shared extent.  Then you can reflink the same extent to a scratch
> > file, copy the contents to a new region in the scratch file, and use
> > FIEDEDUPERANGE on each of A..Z to remap the new region into those files.
> > Assuming the new region has fewer mappings than the old one it was
> > copied from, you'll defragment A..Z while preserving the sharing factor.
> 
> That’s not safe? Things may change after GETFSMAP.

It is if after you reflink the same extent to a scratch file, you then
check that what was reflinked into that scratch file is the same space
that you thought you were cloning.  If not, truncate the scratch file
and try the GETFSMAP again.

The dedupe should be safe because it doesn't remap unless the contents
match.

--D

> > 
> > I say that because I've written such a thing before; look for
> > csp_evac_dedupe_fsmap in
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/commit/?h=defrag-freespace&id=785d2f024e31a0d0f52b04073a600f9139ef0b21
> > 
> >> This defrag tool would exclude shared segments when free space is under shrethold.
> > 
> > "threshold"
> 
> OK.
> 
> Thanks
> Wengang
> > 
> > --D
> > 
> >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> >> ---
> >> spaceman/defrag.c | 46 +++++++++++++++++++++++++++++++++++++++++++---
> >> 1 file changed, 43 insertions(+), 3 deletions(-)
> >> 
> >> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> >> index 61e47a43..f8e6713c 100644
> >> --- a/spaceman/defrag.c
> >> +++ b/spaceman/defrag.c
> >> @@ -304,6 +304,29 @@ void defrag_sigint_handler(int dummy)
> >> printf("Please wait until current segment is defragmented\n");
> >> };
> >> 
> >> +/*
> >> + * limitation of filesystem free space in bytes.
> >> + * when filesystem has less free space than this number, segments which contain
> >> + * shared extents are skipped. 1GiB by default
> >> + */
> >> +static long g_limit_free_bytes = 1024 * 1024 * 1024;
> >> +
> >> +/*
> >> + * check if the free space in the FS is less than the _limit_
> >> + * return true if so, false otherwise
> >> + */
> >> +static bool
> >> +defrag_fs_limit_hit(int fd)
> >> +{
> >> + struct statfs statfs_s;
> >> +
> >> + if (g_limit_free_bytes <= 0)
> >> + return false;
> >> +
> >> + fstatfs(fd, &statfs_s);
> >> + return statfs_s.f_bsize * statfs_s.f_bavail < g_limit_free_bytes;
> >> +}
> >> +
> >> /*
> >>  * defragment a file
> >>  * return 0 if successfully done, 1 otherwise
> >> @@ -377,6 +400,15 @@ defrag_xfs_defrag(char *file_path) {
> >> if (segment.ds_nr < 2)
> >> continue;
> >> 
> >> + /*
> >> + * When the segment is (partially) shared, defrag would
> >> + * consume free blocks. We check the limit of FS free blocks
> >> + * and skip defragmenting this segment in case the limit is
> >> + * reached.
> >> + */
> >> + if (segment.ds_shared && defrag_fs_limit_hit(defrag_fd))
> >> + continue;
> >> +
> >> /* to bytes */
> >> seg_off = segment.ds_offset * 512;
> >> seg_size = segment.ds_length * 512;
> >> @@ -478,7 +510,11 @@ static void defrag_help(void)
> >> "can be served durning the defragmentations.\n"
> >> "\n"
> >> " -s segment_size    -- specify the segment size in MiB, minmum value is 4 \n"
> >> -"                       default is 16\n"));
> >> +"                       default is 16\n"
> >> +" -f free_space      -- specify shrethod of the XFS free space in MiB, when\n"
> >> +"                       XFS free space is lower than that, shared segments \n"
> >> +"                       are excluded from defragmentation, 1024 by default\n"
> >> + ));
> >> }
> >> 
> >> static cmdinfo_t defrag_cmd;
> >> @@ -489,7 +525,7 @@ defrag_f(int argc, char **argv)
> >> int i;
> >> int c;
> >> 
> >> - while ((c = getopt(argc, argv, "s:")) != EOF) {
> >> + while ((c = getopt(argc, argv, "s:f:")) != EOF) {
> >> switch(c) {
> >> case 's':
> >> g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512;
> >> @@ -499,6 +535,10 @@ defrag_f(int argc, char **argv)
> >> g_segment_size_lmt);
> >> }
> >> break;
> >> + case 'f':
> >> + g_limit_free_bytes = atol(optarg) * 1024 * 1024;
> >> + break;
> >> +
> >> default:
> >> command_usage(&defrag_cmd);
> >> return 1;
> >> @@ -516,7 +556,7 @@ void defrag_init(void)
> >> defrag_cmd.cfunc = defrag_f;
> >> defrag_cmd.argmin = 0;
> >> defrag_cmd.argmax = 4;
> >> - defrag_cmd.args = "[-s segment_size]";
> >> + defrag_cmd.args = "[-s segment_size] [-f free_space]";
> >> defrag_cmd.flags = CMD_FLAG_ONESHOT;
> >> defrag_cmd.oneline = _("Defragment XFS files");
> >> defrag_cmd.help = defrag_help;
> >> -- 
> >> 2.39.3 (Apple Git-146)
> >> 
> >> 
> > 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/9] introduce defrag to xfs_spaceman
  2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
                   ` (8 preceding siblings ...)
  2024-07-09 19:10 ` [PATCH 9/9] spaceman/defrag: warn on extsize Wengang Wang
@ 2024-07-15 23:03 ` Dave Chinner
  2024-07-16 19:45   ` Wengang Wang
  9 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-07-15 23:03 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

[ Please keep documentation text to 80 columns. ] 

[ Please run documentation through a spell checker - there are too
many typos in this document to point them all out... ]

On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
> features below (also subject to be added to man page, so please review):

What's the use case for this?

>        defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
>               defrag defragments the specified XFS file online non-exclusively. The target XFS

What's "non-exclusively" mean? How is this different to what xfs_fsr
does?

>               doesn't have to (and must not) be unmunted.  When defragmentation is in progress, file
>               IOs are served 'in parallel'.  reflink feature must be enabled in the XFS.

xfs_fsr allows IO to occur in parallel to defrag.

>               Defragmentation and file IOs
> 
>               The target file is virtually devided into many small segments. Segments are the
>               smallest units for defragmentation. Each segment is defragmented one by one in a
>               lock->defragment->unlock->idle manner.

Userspace can't easily lock the file to prevent concurrent access.
So I'mnot sure what you are refering to here.

>               File IOs are blocked when the target file is locked and are served during the
>               defragmentation idle time (file is unlocked).

What file IOs are being served in parallel? The defragmentation IO?
something else?

>               Though
>               the file IOs can't really go in parallel, they are not blocked long. The locking time
>               basically depends on the segment size. Smaller segments usually take less locking time
>               and thus IOs are blocked shorterly, bigger segments usually need more locking time and
>               IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
>               service.

How is a user supposed to know what the correct values are for their
storage, files, and workload? Algorithms should auto tune, not
require users and administrators to use trial and error to find the
best numbers to feed a given operation.

>               Temporary file
> 
>               A temporary file is used for the defragmentation. The temporary file is created in the
>               same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse
>               file and contains a defragmentation segment at a time. The temporary file is removed
>               automatically when defragmentation is done or is cancelled by ctrl-c. It remains in
>               case kernel crashes when defragmentation is going on. In that case, the temporary file
>               has to be removed manaully.

O_TMPFILE, as Darrick has already requested.

> 
>               Free blocks consumption
> 
>               Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
>               then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
>               to the number the newly allocated blocks. As a finally result, defragmentation doesn't
>               consume free blocks.  Well, that is true if the target file is not sharing blocks with
>               other files.

This is really hard to read. Defragmentation will -always- consume
free space while it is progress. It will always release the
temporary space it consumes when it completes.

>               In case the target file contains shared blocks, those shared blocks won't
>               be freed back to filesystem as they are still owned by other files. So defragmenation
>               allocates more blocks than it frees.

So this is doing an unshare operation as well as defrag? That seems
... suboptimal. The whole point of sharing blocks is to minimise
disk usage for duplicated data.

>               For existing XFS, free blocks might be over-
>               committed when reflink snapshots were created. To avoid causing the XFS running into
>               low free blocks state, this defragmentation excludes (partially) shared segments when
>               the file system free blocks reaches a shreshold. Check the -f option.

Again, how is the user supposed to know when they need to do this?
If the answer is "they should always avoid defrag on low free
space", then why is this an option?

>               Safty and consistency
> 
>               The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
>               crash.

Which file is the "defragmentation file"? The source or the temp
file?

>               First extent share
> 
>               Current kernel has routine for each segment defragmentation detecting if the file is
>               sharing blocks.

I have no idea what this means, or what interface this refers to.

>               It takes long in case the target file contains huge number of extents
>               and the shared ones, if there is, are at the end. The First extent share feature works
>               around above issue by making the first serveral blocks shared. Seeing the first blocks
>               are shared, the kernel routine ends quickly. The side effect is that the "share" flag
>               would remain traget file. This feature is enabled by default and can be disabled by -n
>               option.

And from this description, I have no idea what this is doing, what
problem it is trying to work around, or why we'd want to share
blocks out of a file to speed up detection of whether there are
shared blocks in the file. This description doesn't make any sense
to me because I don't know what interface you are actually having
performance issues with. Please reference the kernel code that is
problematic, and explain why the existing kernel code is problematic
and cannot be fixed.

>               extsize and cowextsize
> 
>               According to kernel implementation, extsize and cowextsize could have following impacts
>               to defragmentation: 1) non-zero extsize causes separated block allocations for each
>               extent in the segment and those blocks are not contiguous.

Extent size hints do no such thing. The simply provide extent
alignment guidelines and do not affect things like contiguous or
multi-block allocation lengths.

>               The segment remains same
>               number of extents after defragmention (no effect).  2) When extsize and/or cowextsize
>               are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO
>               comes to whose pre-allocated blocks  Copy on Write happens and causes the file
>               fragmented.

extsize based unwritten extents won't cause COW or cause
fragmentation because they aren't shared and they are contiguous.
I suspect that your definition of "fragmented" isn't taking into
account that unwritten-written-unwritten over a contiguous range
is *not* fragmentation. It's just a contiguous extent in different
states, and this should really not be touched/changed by
defragmentation.

check out xfs_fsr: it ensures that the pattern of unwritten/written
blocks in the defragmented file is identical to the source. i.e. it
preserves preallocation because the application/fs config wants it
to be there....

>               Readahead
> 
>               Readahead tries to fetch the data blocks for next segment with less locking in
>               backgroud during idle time. This feature is disabled by default, use -a to enable it.

What are you reading ahead into? Kernel page cache or user buffers?
Either way, it's hardly what I'd call "idle time" if the defrag
process is using it to issue lots of read IO...

>               The command takes the following options:
>                  -f free_space
>                      The shreshold of XFS free blocks in MiB. When free blocks are less than this
>                      number, (partially) shared segments are excluded from defragmentation. Default
>                      number is 1024

When you are down to 4MB of free space in the filesystem, you
shouldn't even be trying to run defrag because all the free space
that will be left in the filesystem is single blocks. I would have
expected this sort of number to be in a percentage of capacity,
defaulting to something like 5% (which is where we start running low
space algorithms in the kernel).

>                  -i idle_time
>                      The time in milliseconds, defragmentation enters idle state for this long after
>                      defragmenting a segment and before handing the next. Default number is TOBEDONE.

Yeah, I don't think this is something anyonce whould be expected to
use or tune. If an idle time is needed, the defrag application
should be selecting this itself.
> 
>                  -s segment_size
>                      The size limitation in bytes of segments. Minimium number is 4MiB, default
>                      number is 16MiB.

Why were these numbers chosen? What happens if the file has ~32MB
sized extents and the user wants the file to be returned to a single
large contiguous extent it possible? i.e. how is the user supposed
to know how to set this for any given file without first having
examined the exact pattern of fragmentations in the file?

>                  -n  Disable the First extent share feature. Enabled by default.

So confusing.  Is the "feature disable flag" enabled by default, or
is the feature enabled by default?

>                  -a  Enable readahead feature, disabled by default.

Same confusion, but opposite logic.

I would highly recommend that you get a native english speaker to
review, spell and grammar check the documentation before the next
time you post it.

> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
> sleep time. Here comes some number of the test:
> 
> Test: running of defrag on the image file which is used for the back end of a block device in a
>       virtual machine. At the same time, fio is running at the same time inside virtual machine
>       on that block device.
> block device type:   NVME
> File size:           200GiB
> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
> Defrag run time:     223 minutes
> Number of extents:   6745489(before) -> 203571(after)

So and average extent size of ~32kB before, 100MB after? How much of
these are shared extents?

Runtime is 13380secs, so if we copied 200GiB in that time, the
defrag ran at 16MB/s. That's not very fast.

What's the CPU utilisation of the defrag task and kernel side
processing? What is the difference between "first_extent_share"
enabled and disabled (both performance numbers and CPU usage)?

> Fio read latency:    15.72ms(without defrag) -> 14.53ms(during defrag)
> Fio write latency:   32.21ms(without defrag) -> 20.03ms(during defrag)

So the IO latency is *lower* when defrag is running? That doesn't
make any sense, unless the fio throughput is massively reduced while
defrag is running.  What's the throughput change in the fio
workload? What's the change in worst case latency for the fio
workload? i.e. post the actual fio results so we can see the whole
picture of the behaviour, not just a single cherry-picked number.

Really, though, I have to ask: why is this an xfs_spaceman command
and not something built into the existing online defrag program
we have (xfs_fsr)?

I'm sure I'll hav emore questions as I go through the code - I'll
start at the userspace IO engine part of the patchset so I have some
idea of what the defrag algorithm actually is...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target file
  2024-07-09 19:10 ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Wengang Wang
  2024-07-09 21:50   ` [PATCH 2/9] spaceman/defrag: pick up segments from target fileOM Darrick J. Wong
@ 2024-07-15 23:40   ` Dave Chinner
  2024-07-16 20:23     ` Wengang Wang
  1 sibling, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-07-15 23:40 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
> segments are the smallest unit to defragment.
> 
> A segment
> 1. Can't exceed size limit
> 2. contains some extents
> 3. the contained extents can't be "unwritten"
> 4. the contained extents must be contigous in file blocks
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 204 insertions(+)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index c9732984..175cf461 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -14,6 +14,32 @@
>  #include "space.h"
>  #include "input.h"
>  
> +#define MAPSIZE 512
> +/* used to fetch bmap */
> +struct getbmapx	g_mapx[MAPSIZE];
> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
> +static long long 	g_offset = 0;
> +/* index to indentify next extent, used to get next extent */
> +static int		g_ext_next_idx = -1;

Please do not prefix global variables with "g_". This is not
useful, and simply makes the code hard to read.

That said, it is much better to pass these as function parameters so
they are specific to the mapping context and so are inherently
thread safe.

> +/*
> + * segment, the smallest unit to defrag
> + * it includes some contiguous extents.
> + * no holes included,
> + * no unwritten extents included
> + * the size is limited by g_segment_size_lmt
> + */

I have no idea what this comment is trying to tell me.

> +struct defrag_segment {
> +	/* segment offset in units of 512 bytes */
> +	long long	ds_offset;
> +	/* length of segment in units of 512 bytes */
> +	long long	ds_length;
> +	/* number of extents in this segment */
> +	int		ds_nr;
> +	/* flag indicating if segment contains shared blocks */
> +	bool		ds_shared;
> +};
> +
>  /* defrag segment size limit in units of 512 bytes */
>  #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>  #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>  	return true;
>  }
>  
> +/*
> + * get next extent in the file.
> + * Note: next call will get the same extent unless move_next_extent() is called.
> + * returns:
> + * -1:	error happened.
> + * 0:	extent returned
> + * 1:	no more extent left
> + */
> +static int
> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
> +{
> +	int err = 0, i;
> +
> +	/* when no extents are cached in g_mapx, fetch from kernel */
> +	if (g_ext_next_idx == -1) {
> +		g_mapx[0].bmv_offset = g_offset;
> +		g_mapx[0].bmv_length = -1LL;
> +		g_mapx[0].bmv_count = MAPSIZE;
> +		g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
> +		err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
> +		if (err == -1) {
> +			perror("XFS_IOC_GETBMAPX failed");
> +			goto out;
> +		}
> +		/* for stats */
> +		g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
> +
> +		/* no more extents */
> +		if (g_mapx[0].bmv_entries == 0) {
> +			err = 1;
> +			goto out;
> +		}
> +
> +		/* for stats */
> +		for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
> +			if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
> +				g_ext_stats.nr_ext_unwritten++;
> +			if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
> +				g_ext_stats.nr_ext_shared++;
> +		}
> +
> +		g_ext_next_idx = 1;
> +		g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
> +				g_mapx[g_mapx[0].bmv_entries].bmv_length;
> +	}
> +
> +	map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
> +	map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
> +	map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
> +out:
> +	return err;
> +}

Ok, so the global variables are just a bmap cache. That's a problem,
because this cache is stale the moment XFS_IOC_GETBMAPX returns to
userspace. Iterating it to decide exactly waht to do next will
race with ongoing file modifications and so it's not going to be
accurate....

> +
> +/*
> + * move to next extent
> + */
> +static void
> +defrag_move_next_extent()
> +{
> +	if (g_ext_next_idx == g_mapx[0].bmv_entries)
> +		g_ext_next_idx = -1;
> +	else
> +		g_ext_next_idx += 1;
> +}
> +
> +/*
> + * check if the given extent is a defrag target.
> + * no need to check for holes as we are using BMV_IF_NO_HOLES
> + */
> +static bool
> +defrag_is_target(struct getbmapx *mapx)
> +{
> +	/* unwritten */
> +	if (mapx->bmv_oflags & BMV_OF_PREALLOC)
> +		return false;
> +	return mapx->bmv_length < g_segment_size_lmt;
> +}
> +
> +static bool
> +defrag_is_extent_shared(struct getbmapx *mapx)
> +{
> +	return !!(mapx->bmv_oflags & BMV_OF_SHARED);
> +}
> +
> +/*
> + * get next segment to defragment.
> + * returns:
> + * -1	error happened.
> + * 0	segment returned.
> + * 1	no more segments to return
> + */
> +static int
> +defrag_get_next_segment(int fd, struct defrag_segment *out)
> +{
> +	struct getbmapx mapx;
> +	int	ret;
> +
> +	out->ds_offset = 0;
> +	out->ds_length = 0;
> +	out->ds_nr = 0;
> +	out->ds_shared = false;

out->ds_nr is never set to anything but zero in this patch.

> +
> +	do {
> +		ret = defrag_get_next_extent(fd, &mapx);
> +		if (ret != 0) {
> +			/*
> +			 * no more extetns, return current segment if its not

Typos everywhere.

> +			 * empty
> +			*/
> +			if (ret == 1 && out->ds_nr > 0)
> +				ret = 0;
> +			/* otherwise, error heppened, stop */
> +			break;
> +		}

> +
> +		/*
> +		 * If the extent is not a defrag target, skip it.
> +		 * go to next extent if the segment is empty;
> +		 * otherwise return the segment.
> +		 */
> +		if (!defrag_is_target(&mapx)) {
> +			defrag_move_next_extent();
> +			if (out->ds_nr == 0)
> +				continue;
> +			else
> +				break;
> +		}
> +
> +		/* check for segment size limitation */
> +		if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
> +			break;
> +
> +		/* the segment is empty now, add this extent to it for sure */
> +		if (out->ds_nr == 0) {
> +			out->ds_offset = mapx.bmv_offset;
> +			goto add_ext;
> +		}

So this is essentially a filter for the getbmapx output that strips
away unwritten extents and anything outside/larger than the target
range.

> +
> +		/*
> +		 * the segment is not empty, check for hole since the last exent
> +		 * if a hole exist before this extent, this extent can't be
> +		 * added to the segment. return the segment
> +		 */
> +		if (out->ds_offset + out->ds_length != mapx.bmv_offset)
> +			break;
> +
> +add_ext:

Why do you need a goto for this logic?

		/*
		 * the segment is not empty, check for hole since the last exent
		 * if a hole exist before this extent, this extent can't be
		 * added to the segment. return the segment
		 */
		if (out->ds_nr) {
			if (out->ds_offset + out->ds_length != mapx.bmv_offset)
				break;
		} else {
			out->ds_offset = mapx.bmv_offset;
		}

> +		if (defrag_is_extent_shared(&mapx))
> +			out->ds_shared = true;
> +
> +		out->ds_length += mapx.bmv_length;
> +		out->ds_nr += 1;
> +		defrag_move_next_extent();
> +
> +	} while (true);

> +
> +	return ret;
> +}
> +
>  /*
>   * defragment a file
>   * return 0 if successfully done, 1 otherwise
> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>  	struct fsxattr	fsx;
>  	int	ret = 0;
>  
> +	g_offset = 0;
> +	g_ext_next_idx = -1;
> +
>  	fsx.fsx_nextents = 0;
>  	memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>  
> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>  		ret = 1;
>  		goto out;
>  	}
> +
> +	do {
> +		struct defrag_segment segment;
> +
> +		ret = defrag_get_next_segment(defrag_fd, &segment);
> +		/* no more segments, we are done */
> +		if (ret == 1) {
> +			ret = 0;
> +			break;
> +		}
> +		/* error happened when reading bmap, stop here */
> +		if (ret == -1) {
> +			ret = 1;
> +			break;
> +		}

ternary return values are nasty. Return a negative errno when a
an error occurs, and -ENODATA when there are no more segments.
Then you have

		if (ret < 0) {
			if (ret == -ENODATA)
				exit_value = 0;
			else
				exit_value = 1;
			break;
		}

> +	} while (true);

Not a fan of do {} while(true) loops.

WIth the above error handling changes, this becomes:

	do {
		struct defrag_segment segment;

		ret = defrag_get_next_segment(defrag_fd, &segment);
	} while (ret == 0);

	if (ret == 0 || ret == -ENODATA)
		exit_value = 0;
	else
		exit_value = 1;


Ok, so this is a linear iteration of all extents in the file that
filters extents for the specific "segment" that is going to be
processed. I still have no idea why fixed length segments are
important, but "linear extent scan for filtering" seems somewhat
expensive.

Indeed, if you used FIEMAP, you can pass a minimum
segment length to filter out all the small extents. Iterating that
extent list means all the ranges you need to defrag are in the holes
of the returned mapping information. This would be much faster
than an entire linear mapping to find all the regions with small
extents that need defrag. The second step could then be doing a
fine grained mapping of each region that we now know either contains
fragmented data or holes....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/9] spaceman/defrag: defrag segments
  2024-07-09 19:10 ` [PATCH 3/9] spaceman/defrag: defrag segments Wengang Wang
  2024-07-09 21:57   ` Darrick J. Wong
@ 2024-07-16  0:08   ` Dave Chinner
  2024-07-18 18:06     ` Wengang Wang
  1 sibling, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-07-16  0:08 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:22PM -0700, Wengang Wang wrote:
> For each segment, the following steps are done trying to defrag it:
> 
> 1. share the segment with a temporary file
> 2. unshare the segment in the target file. kernel simulates Cow on the whole
>    segment complete the unshare (defrag).
> 3. release blocks from the tempoary file.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 114 insertions(+)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index 175cf461..9f11e36b 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -263,6 +263,40 @@ add_ext:
>  	return ret;
>  }
>  
> +/*
> + * check if the segment exceeds EoF.
> + * fix up the clone range and return true if EoF happens,
> + * return false otherwise.
> + */
> +static bool
> +defrag_clone_eof(struct file_clone_range *clone)
> +{
> +	off_t delta;
> +
> +	delta = clone->src_offset + clone->src_length - g_defrag_file_size;
> +	if (delta > 0) {
> +		clone->src_length = 0; // to the end
> +		return true;
> +	}
> +	return false;
> +}
> +
> +/*
> + * get the time delta since pre_time in ms.
> + * pre_time should contains values fetched by gettimeofday()
> + * cur_time is used to store current time by gettimeofday()
> + */
> +static long long
> +get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
> +{
> +	long long us;
> +
> +	gettimeofday(cur_time, NULL);
> +	us = (cur_time->tv_sec - pre_time->tv_sec) * 1000000;
> +	us += (cur_time->tv_usec - pre_time->tv_usec);
> +	return us;
> +}
> +
>  /*
>   * defragment a file
>   * return 0 if successfully done, 1 otherwise
> @@ -273,6 +307,7 @@ defrag_xfs_defrag(char *file_path) {
>  	long	nr_seg_defrag = 0, nr_ext_defrag = 0;
>  	int	scratch_fd = -1, defrag_fd = -1;
>  	char	tmp_file_path[PATH_MAX+1];
> +	struct file_clone_range clone;
>  	char	*defrag_dir;
>  	struct fsxattr	fsx;
>  	int	ret = 0;
> @@ -296,6 +331,8 @@ defrag_xfs_defrag(char *file_path) {
>  		goto out;
>  	}
>  
> +	clone.src_fd = defrag_fd;
> +
>  	defrag_dir = dirname(file_path);

Just a note: can you please call this the "source fd", not the
"defrag_fd"? defrag_fd could mean either the source or the
temporary scratch file we use as the defrag destination.

>  	snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
>  		getpid());
> @@ -309,7 +346,11 @@ defrag_xfs_defrag(char *file_path) {
>  	}
>  
>  	do {
> +		struct timeval t_clone, t_unshare, t_punch_hole;
>  		struct defrag_segment segment;
> +		long long seg_size, seg_off;
> +		int time_delta;
> +		bool stop;
>  
>  		ret = defrag_get_next_segment(defrag_fd, &segment);
>  		/* no more segments, we are done */
> @@ -322,6 +363,79 @@ defrag_xfs_defrag(char *file_path) {
>  			ret = 1;
>  			break;
>  		}
> +
> +		/* we are done if the segment contains only 1 extent */
> +		if (segment.ds_nr < 2)
> +			continue;
> +
> +		/* to bytes */
> +		seg_off = segment.ds_offset * 512;
> +		seg_size = segment.ds_length * 512;

Ugh. Do this in the mapping code that gets the extent info. Have it
return bytes. Or just use FIEMAP because it uses byte ranges to
begin with.

> +
> +		clone.src_offset = seg_off;
> +		clone.src_length = seg_size;
> +		clone.dest_offset = seg_off;
> +
> +		/* checks for EoF and fix up clone */
> +		stop = defrag_clone_eof(&clone);

Ok, so we copy the segment map into clone args, and ...

> +		gettimeofday(&t_clone, NULL);
> +		ret = ioctl(scratch_fd, FICLONERANGE, &clone);
> +		if (ret != 0) {
> +			fprintf(stderr, "FICLONERANGE failed %s\n",
> +				strerror(errno));
> +			break;
> +		}

clone the source to the scratch file. This blocks writes to the
source file while it is in progress, but allows reads to pass
through the source file as data is not changing.

> +		/* for time stats */
> +		time_delta = get_time_delta_us(&t_clone, &t_unshare);
> +		if (time_delta > max_clone_us)
> +			max_clone_us = time_delta;
> +
> +		/* for defrag stats */
> +		nr_ext_defrag += segment.ds_nr;
> +
> +		/*
> +		 * For the shared range to be unshared via a copy-on-write
> +		 * operation in the file to be defragged. This causes the
> +		 * file needing to be defragged to have new extents allocated
> +		 * and the data to be copied over and written out.
> +		 */
> +		ret = fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, seg_off,
> +				seg_size);
> +		if (ret != 0) {
> +			fprintf(stderr, "UNSHARE_RANGE failed %s\n",
> +				strerror(errno));
> +			break;
> +		}

And now we unshare the source file. This blocks all IO to the source
file.

Ok, so this is the fundamental problem this whole "segmented
defrag" is trying to work around: FALLOC_FL_UNSHARE_RANGE blocks
all read and write IO whilst it is in progress.

We had this same problem with FICLONERANGE taking snapshots of VM
files - we changed the locking to take shared IO locks to allow
reads to run while the clone was in progress. Because the Oracle Vm
infrastructure uses a sidecar to redirect writes while a snapshot
(clone) was in progress, no VM IO got blocked while the clone was in
progress and so the applications inside the VM never even noticed a
clone was taking place.

Why isn't the same infrastructure being used here?

FALLOC_FL_UNSHARE_RANGE is not changing data, nor is it freeing any
data blocks. Yes, we are re-writing the data somewhere else, but
in that case the original data is still intact in it's original
location on disk and not being freed.

Hence if a read races with UNSHARE, it will hit a referenced extent
containing the correct data regardless of whether it is in the old
or new file. Hence we can likely use shared IO locking for UNSHARE,
just like we do for FICLONERANGE.

At this point, if the Oracle VM infrastructure uses the sidecar
write channel whilst the defrag is in progress, this whole algorithm
simply becomes "for regions with extents smaller than X, clone and
unshare the region".

The whole need for "idle time" goes away. The need for segment size
control largely goes away. The need to tune the defrag algorithm to
avoid IO latency and/or throughput issues goes away.

> +
> +		/* for time stats */
> +		time_delta = get_time_delta_us(&t_unshare, &t_punch_hole);
> +		if (time_delta > max_unshare_us)
> +			max_unshare_us = time_delta;
> +
> +		/*
> +		 * Punch out the original extents we shared to the
> +		 * scratch file so they are returned to free space.
> +		 */
> +		ret = fallocate(scratch_fd,
> +			FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, seg_off,
> +			seg_size);
> +		if (ret != 0) {
> +			fprintf(stderr, "PUNCH_HOLE failed %s\n",
> +				strerror(errno));
> +			break;
> +		}

This is unnecessary if there is lots of free space. You can leave
this to the very end of defrag so that the source file defrag
operation isn't slowed down by cleaning up all the fragmented
extents....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag()
  2024-07-09 19:10 ` [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() Wengang Wang
  2024-07-09 20:51   ` Darrick J. Wong
@ 2024-07-16  0:25   ` Dave Chinner
  2024-07-18 18:24     ` Wengang Wang
  2024-07-31 22:25   ` Dave Chinner
  2 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-07-16  0:25 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:25PM -0700, Wengang Wang wrote:
> xfs_reflink_try_clear_inode_flag() takes very long in case file has huge number
> of extents and none of the extents are shared.

Got a kernel profile showing how bad it is?

> 
> workaround:
> share the first real extent so that xfs_reflink_try_clear_inode_flag() returns
> quickly to save cpu times and speed up defrag significantly.

That's nasty.

Let's fix the kernel code, not work around it in userspace.

I mean, it would be really easy to store if an extent is shared in
the iext btree record for the extent. If we do an unshare operation,
just do a single "find shared extents" pass on the extent tree and
mark all the extents that are shared as shared.  Then set a flag on
the data fork saying it is tracking shared extents, and so when we
share/unshare extents in that inode from then on, we set/clear that
flag in the iext record. (i.e. it's an in-memory equivalent of the
UNWRITTEN state flag).

Then after the first unshare, checking for nothing being shared is a
walk of the iext btree over the given range, not a refcountbt
walk. That should be much faster.

And we could make it even faster by adding a "shared extents"
counter to the inode fork. i.e. the first scan that sets the flags
also counts the shared extents, and we maintain that as we maintain
the iin memory extent flags....

That makes the cost of xfs_reflink_try_clear_inode_flag() basically
go to zero in these sorts of workloads. IMO, this is a much better
solution to the problem than hacking around it in userspace...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 9/9] spaceman/defrag: warn on extsize
  2024-07-11 23:36     ` Wengang Wang
@ 2024-07-16  0:29       ` Dave Chinner
  2024-07-22 18:01         ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-07-16  0:29 UTC (permalink / raw)
  To: Wengang Wang; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org

On Thu, Jul 11, 2024 at 11:36:28PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 9, 2024, at 1:21 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> > 
> > On Tue, Jul 09, 2024 at 12:10:28PM -0700, Wengang Wang wrote:
> >> According to current kernel implemenation, non-zero extsize might affect
> >> the result of defragmentation.
> >> Just print a warning on that if non-zero extsize is set on file.
> > 
> > I'm not sure what's the point of warning vaguely about extent size
> > hints?  I'd have thought that would help reduce the number of extents;
> > is that not the case?
> 
> Not exactly.
> 
> Same 1G file with about 54K extents,
> 
> The one with 16K extsize, after defrag, it’s extents drops to 13K.
> And the one with 0 extsize, after defrag, it’s extents dropped to 22.

extsize should not affect file contiguity like this at all. Are you
measuring fragmentation correctly? i.e. a contiguous region from an
larger extsize allocation that results in a bmap/fiemap output of
three extents in a unwritten/written/unwritten is not fragmentation.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 8/9] spaceman/defrag: readahead for better performance
  2024-07-09 19:10 ` [PATCH 8/9] spaceman/defrag: readahead for better performance Wengang Wang
  2024-07-09 20:27   ` Darrick J. Wong
@ 2024-07-16  0:56   ` Dave Chinner
  2024-07-18 18:40     ` Wengang Wang
  1 sibling, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-07-16  0:56 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
> Do readahead when defrag sleeps for better defrag performace and thus more
> file IO time.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index 415fe9c2..ab8508bb 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>  }
>  
>  static bool g_enable_first_ext_share = true;
> +static bool g_readahead = false;
> +
> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
> +{
> +	if (!g_readahead || g_idle_time <= 0)
> +		return;
> +
> +	if (readahead(defrag_fd, offset, count) < 0) {
> +		fprintf(stderr, "readahead failed: %s, errno=%d\n",
> +			strerror(errno), errno);

This doesn't do what you think it does. readahead() only queues the
first readahead chunk of the range given (a few pages at most). It
does not cause readahead on the entire range, wait for page cache
population, nor report IO errors that might have occurred during
readahead.

There's almost no value to making this syscall, especially if the
app is about to trigger a sequential read for the whole range.
Readahead will occur naturally during that read operation (i.e. the
UNSHARE copy), and the read will return IO errors unlike
readahead().

If you want the page cache pre-populated before the unshare
operation is done, then you need to use mmap() and
madvise(MADV_POPULATE_READ). This will read the whole region into
the page cache as if it was a sequential read, wait for it to
complete and return any IO errors that might have occurred during
the read.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/9] spaceman/defrag: ctrl-c handler
  2024-07-15 22:56       ` Darrick J. Wong
@ 2024-07-16 16:21         ` Wengang Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-16 16:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs@vger.kernel.org



> On Jul 15, 2024, at 3:56 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Thu, Jul 11, 2024 at 10:58:02PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Jul 9, 2024, at 2:08 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>>> 
>>> On Tue, Jul 09, 2024 at 12:10:23PM -0700, Wengang Wang wrote:
>>>> Add this handler to break the defrag better, so it has
>>>> 1. the stats reporting
>>>> 2. remove the temporary file
>>>> 
>>>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>>>> ---
>>>> spaceman/defrag.c | 11 ++++++++++-
>>>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>>>> index 9f11e36b..61e47a43 100644
>>>> --- a/spaceman/defrag.c
>>>> +++ b/spaceman/defrag.c
>>>> @@ -297,6 +297,13 @@ get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
>>>> return us;
>>>> }
>>>> 
>>>> +static volatile bool usedKilled = false;
>>>> +void defrag_sigint_handler(int dummy)
>>>> +{
>>>> + usedKilled = true;
>>> 
>>> Not sure why some of these variables are camelCase and others not.
>>> Or why this global variable doesn't have a g_ prefix like the others?
>>> 
>> 
>> Yep, will change it to g_user_killed.
>> 
>>>> + printf("Please wait until current segment is defragmented\n");
>>> 
>>> Is it actually safe to call printf from a signal handler?  Handlers must
>>> be very careful about what they call -- regreSSHion was a result of
>>> openssh not getting this right.
>>> 
>>> (Granted spaceman isn't as critical...)
>>> 
>> 
>> As the ioctl of UNSHARE takes time, so the process would really stop a while
>> After user’s kil. The message is used as a quick response to user. It’s not actually
>> Has any functionality. If it’s not safe, we can remove the message.
> 
> $ man signal-safety

Yep, will remove the print.

Thanks,
Wengang


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/9] introduce defrag to xfs_spaceman
  2024-07-15 23:03 ` [PATCH 0/9] introduce defrag to xfs_spaceman Dave Chinner
@ 2024-07-16 19:45   ` Wengang Wang
  2024-07-31  2:51     ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-16 19:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org



> On Jul 15, 2024, at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> [ Please keep documentation text to 80 columns. ] 
> 

Yes. This is not a patch. I copied it from the man 8 output.
It will be limited to 80 columns when sent as a patch.

> [ Please run documentation through a spell checker - there are too
> many typos in this document to point them all out... ]

OK.

> 
> On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
>> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
>> features below (also subject to be added to man page, so please review):
> 
> What's the use case for this?

This is the user space defrag as you suggested previously.

Please see the previous conversation for your reference: 
https://patchwork.kernel.org/project/xfs/cover/20231214170530.8664-1-wen.gang.wang@oracle.com/

COPY STARTS —————————————> 
I am copying your last comment there:

On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
> Hi Dave,
> Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).

That's good to know :)

> Let me know if you want it in xfsprog.

Yes, i think adding it as an xfs_spaceman command would be a good
way for this defrag feature to be maintained for anyone who has need
for it.

-Dave.
<———————————————— COPY ENDS

> 
>>       defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
>>              defrag defragments the specified XFS file online non-exclusively. The target XFS
> 
> What's "non-exclusively" mean? How is this different to what xfs_fsr
> does?
> 

I think you have seen the difference when you reviewing more of this set.
Well, if I read xfs_fsr code correctly, though xfs_fsr allow parallel writes, it looks have a problem(?)
As I read the code, Xfs_fsr do the followings to defrag one file:
1) preallocating blocks to a temporary file hoping the temporary get same number of blocks as the
    file under defrag with with less extents.
2) copy data blocks from the file under defrag to the temporary file.
3) switch the extents between the two files.

For stage 2, it’s NOT copying data blocks in atomic manner. Take an example: there need two
Read->write pair to complete the data copy, that is
    Copy range 1 (read range 1 from the file under defrag to the temporary file)
    Copy range 2

If a new write come to the file (range 1) under defrag after copying range1 is done. After the defrag
(xfs_fsr) finished, will the new write lose?

I didn’t look into the extents-switch code, don’t know if that check if the two files has same data contents.
But even it does, it would be pretty slow with file locked.    


>>              doesn't have to (and must not) be unmunted.  When defragmentation is in progress, file
>>              IOs are served 'in parallel'.  reflink feature must be enabled in the XFS.
> 
> xfs_fsr allows IO to occur in parallel to defrag.

Pls see my concern above.

> 
>>              Defragmentation and file IOs
>> 
>>              The target file is virtually devided into many small segments. Segments are the
>>              smallest units for defragmentation. Each segment is defragmented one by one in a
>>              lock->defragment->unlock->idle manner.
> 
> Userspace can't easily lock the file to prevent concurrent access.
> So I'mnot sure what you are refering to here.

The manner is not simply meant what is done at user space, but a whole thing in both user space
and kernel space.  The tool defrag a file segment by segment. The lock->defragment->unlock
Is done by kernel in responding to the FALLOC_FL_UNSHARE_RANGE request from user space.

> 
>>              File IOs are blocked when the target file is locked and are served during the
>>              defragmentation idle time (file is unlocked).
> 
> What file IOs are being served in parallel? The defragmentation IO?
> something else?

Here the file IOs means the IOs request from user space applications including virtual machine
Engine.

> 
>>              Though
>>              the file IOs can't really go in parallel, they are not blocked long. The locking time
>>              basically depends on the segment size. Smaller segments usually take less locking time
>>              and thus IOs are blocked shorterly, bigger segments usually need more locking time and
>>              IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
>>              service.
> 
> How is a user supposed to know what the correct values are for their
> storage, files, and workload? Algorithms should auto tune, not
> require users and administrators to use trial and error to find the
> best numbers to feed a given operation.

In my option, user would need a way to control this according to their use case.
Any algorithms will restrict what user want to do.
Say, user want the defrag done as quick as possible regardless the resources it takes (CPU, IO and so on)
when the production system is in a maintenance window. But when the production system is busy
User want the defrag use less resources.
Another example, kernel (algorithms) never knows the maximum IO latency the user applications tolerate.
But if you have some algorithms, please share.

And we provide default numbers for the options, they come from test practice though user might need to
Change them for their own use case.

> 
>>              Temporary file
>> 
>>              A temporary file is used for the defragmentation. The temporary file is created in the
>>              same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse
>>              file and contains a defragmentation segment at a time. The temporary file is removed
>>              automatically when defragmentation is done or is cancelled by ctrl-c. It remains in
>>              case kernel crashes when defragmentation is going on. In that case, the temporary file
>>              has to be removed manaully.
> 
> O_TMPFILE, as Darrick has already requested.

OK. Will be it.
> 
>> 
>>              Free blocks consumption
>> 
>>              Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
>>              then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
>>              to the number the newly allocated blocks. As a finally result, defragmentation doesn't
>>              consume free blocks.  Well, that is true if the target file is not sharing blocks with
>>              other files.
> 
> This is really hard to read. Defragmentation will -always- consume
> free space while it is progress. It will always release the
> temporary space it consumes when it completes.

I don’t think it’s always free blocks when it releases the temporary file. When the blocks were
Original shared before defrag, the blocks won’t be freed.

> 
>>              In case the target file contains shared blocks, those shared blocks won't
>>              be freed back to filesystem as they are still owned by other files. So defragmenation
>>              allocates more blocks than it frees.
> 
> So this is doing an unshare operation as well as defrag? That seems
> ... suboptimal. The whole point of sharing blocks is to minimise
> disk usage for duplicated data.

That depends on user's need. If users think defrag is the first priority, it is. If users don’t think the disk
saving is the most important, it is not. No matter what developers think.
What’s more, reflink (or sharing blocks) is not only used to minimize disk usage. Sometimes it’s
Used as way to take snapshots. And those snapshots might won’t stay long.

And what’s more is that, the unshare operation is what you suggested :D   


> 
>>              For existing XFS, free blocks might be over-
>>              committed when reflink snapshots were created. To avoid causing the XFS running into
>>              low free blocks state, this defragmentation excludes (partially) shared segments when
>>              the file system free blocks reaches a shreshold. Check the -f option.
> 
> Again, how is the user supposed to know when they need to do this?
> If the answer is "they should always avoid defrag on low free
> space", then why is this an option?

I didn’t say "they should always avoid defrag on low free space”. And even we can’t say how low is
Not tolerated by user, that depends on user use case. Though it’s an option, it has the default value
Of 1GB. If users don’t set this option, that is "always avoid defrag on low free space”.


> 
>>              Safty and consistency
>> 
>>              The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
>>              crash.
> 
> Which file is the "defragmentation file"? The source or the temp
> file?

I don’t think there is "source concept" here. There is no data copy between files.
“The defragmentation file” means the file under defrag, I will change it to “The file under defrag”.
I don’t think users care about the temporary file at all.


> 
>>              First extent share
>> 
>>              Current kernel has routine for each segment defragmentation detecting if the file is
>>              sharing blocks.
> 
> I have no idea what this means, or what interface this refers to.
> 
>>              It takes long in case the target file contains huge number of extents
>>              and the shared ones, if there is, are at the end. The First extent share feature works
>>              around above issue by making the first serveral blocks shared. Seeing the first blocks
>>              are shared, the kernel routine ends quickly. The side effect is that the "share" flag
>>              would remain traget file. This feature is enabled by default and can be disabled by -n
>>              option.
> 
> And from this description, I have no idea what this is doing, what
> problem it is trying to work around, or why we'd want to share
> blocks out of a file to speed up detection of whether there are
> shared blocks in the file. This description doesn't make any sense
> to me because I don't know what interface you are actually having
> performance issues with. Please reference the kernel code that is
> problematic, and explain why the existing kernel code is problematic
> and cannot be fixed.

I mentioned the kernel function name in patch 6. It is xfs_reflink_try_clear_inode_flag().

> 
>>              extsize and cowextsize
>> 
>>              According to kernel implementation, extsize and cowextsize could have following impacts
>>              to defragmentation: 1) non-zero extsize causes separated block allocations for each
>>              extent in the segment and those blocks are not contiguous.
> 
> Extent size hints do no such thing. The simply provide extent
> alignment guidelines and do not affect things like contiguous or
> multi-block allocation lengths.

Extsize really make alignment on the number of blocks to allocate. But it affects more than that.
When extsize is set, the allocations is not delayed allocation.
xfs_reflink_unshare() does one allocation each extent. For a defrag segment containing
N extents, there are N allocations.

> 
>>              The segment remains same
>>              number of extents after defragmention (no effect).  2) When extsize and/or cowextsize
>>              are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO
>>              comes to whose pre-allocated blocks  Copy on Write happens and causes the file
>>              fragmented.
> 
> extsize based unwritten extents won't cause COW or cause
> fragmentation because they aren't shared and they are contiguous.
> I suspect that your definition of "fragmented" isn't taking into
> account that unwritten-written-unwritten over a contiguous range
> is *not* fragmentation. It's just a contiguous extent in different
> states, and this should really not be touched/changed by
> defragmentation.

Are you sure about that? In my option, take the buffer’s write as example,
During writeback, when the target block is found in Cow fork, Copy on Write just happens no matter
If the block is really shared or not.  Let’s see this simple example:
1) a file contains 4 blocks. file block 0, 1 and 2 are shared and block 3 is not share. 
    Extsize on this file is 4 blocks.
2) a writeback come to file block 0, 1 and 2.
3) On seeing those 3 blocks are shared, kernel pre-allocate blocks in Cow fork.
    Extsize being 4 blocks, after alignment, 4 blocks (unwritten) are allocated in Cow fork.
4) data is written to 3 of the blocks in Cow fork. In IO done callback, those 3 blocks in Cow fork
    Is moved to data fork, the original 3 blocks in data fork are freed.

The Copy on Write is done, right?
But remember, there is 1 unwritten block left in the Cow fork.
In case now a new writeback come to file block 3, the kernel see there is a file block 3 in Cow fork,
A new Copy on Write happens.
 

> 
> check out xfs_fsr: it ensures that the pattern of unwritten/written
> blocks in the defragmented file is identical to the source. i.e. it
> preserves preallocation because the application/fs config wants it
> to be there....
> 
>>              Readahead
>> 
>>              Readahead tries to fetch the data blocks for next segment with less locking in
>>              backgroud during idle time. This feature is disabled by default, use -a to enable it.
> 
> What are you reading ahead into? Kernel page cache or user buffers?
Kernel page cache.
> Either way, it's hardly what I'd call "idle time" if the defrag
> process is using it to issue lots of read IO...
> 

During the “idle time”, the file is not (IOLOCK) locked though disk fetching might be happening. 

> 
>>              The command takes the following options:
>>                 -f free_space
>>                     The shreshold of XFS free blocks in MiB. When free blocks are less than this
>>                     number, (partially) shared segments are excluded from defragmentation. Default
>>                     number is 1024
> 
> When you are down to 4MB of free space in the filesystem, you
> shouldn't even be trying to run defrag because all the free space
> that will be left in the filesystem is single blocks. I would have
> expected this sort of number to be in a percentage of capacity,
> defaulting to something like 5% (which is where we start running low
> space algorithms in the kernel).

I would like leave this to user. When user is doing defrag on low free space system, it won’t cause
Problem to file system its self. At most the defrag fails during unshare when allocating blocks.
You can’t prevent user from writing to new file when system is low free space either.

I don’t think a percentage is a good idea,  say, for a 10TiB filesystem, 5% is 512GB.  512GB is pretty
Enough to do things. And for a small one, say a 512 MB filesystem, 5% that’s 25MB, that’s too less.
In above cases, limiting by a percentage would ether prevent user doing something that can be done
Without any problem, or allow user to do something that might cause problem.
I’d think specifying a fixed safe size is better.


> 
>>                 -i idle_time
>>                     The time in milliseconds, defragmentation enters idle state for this long after
>>                     defragmenting a segment and before handing the next. Default number is TOBEDONE.
> 
> Yeah, I don't think this is something anyonce whould be expected to
> use or tune. If an idle time is needed, the defrag application
> should be selecting this itself.

I don’t think so, see my explain above.

>> 
>>                 -s segment_size
>>                     The size limitation in bytes of segments. Minimium number is 4MiB, default
>>                     number is 16MiB.
> 
> Why were these numbers chosen? What happens if the file has ~32MB
> sized extents and the user wants the file to be returned to a single
> large contiguous extent it possible? i.e. how is the user supposed
> to know how to set this for any given file without first having
> examined the exact pattern of fragmentations in the file?

Why customer want the file to be returned to a single large contiguous extent?
A 32MB extent is pretty good to me.  I didn’t here any customer complain about 32MB extents…
And you know, whether we can defrag extents to a large one depends on not only the tool it’s self.
It’s depends on the status of the filesystem, say if the filesystem is very fragmented too, say the AG size..

The 16MB is selected according our tests basing on a customer metadump. With 16MB segment size,
The the defrag result is very good and the IO latency is acceptable too.  With the default 16MB segment
Size, 32MB extent is excluded from defrag.

If you have better default size, we can use that.

> 
>>                 -n  Disable the First extent share feature. Enabled by default.
> 
> So confusing.  Is the "feature disable flag" enabled by default, or
> is the feature enabled by default?

Will change it to the following if it’s clear:
The "First extent share “ feature is enabled to default. User -n to disable it.

> 
>>                 -a  Enable readahead feature, disabled by default.
> 
> Same confusion, but opposite logic.
> 
> I would highly recommend that you get a native english speaker to
> review, spell and grammar check the documentation before the next
> time you post it.

OK, will try to do so.

> 
>> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
>> sleep time. Here comes some number of the test:
>> 
>> Test: running of defrag on the image file which is used for the back end of a block device in a
>>      virtual machine. At the same time, fio is running at the same time inside virtual machine
>>      on that block device.
>> block device type:   NVME
>> File size:           200GiB
>> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
>> Defrag run time:     223 minutes
>> Number of extents:   6745489(before) -> 203571(after)
> 
> So and average extent size of ~32kB before, 100MB after? How much of
> these are shared extents?

Zero shared extents, but there are some unwritten ones.
A similar run stats is like this:
Pre-defrag 6654460 extents detected, 112228 are "unwritten",0 are "shared” 
Tried to defragment 6393352 extents (181000359936 bytes) in 26032 segments Time stats(ms): max clone: 31, max unshare: 300, max punch_hole: 66
Post-defrag 282659 extents detected

> 
> Runtime is 13380secs, so if we copied 200GiB in that time, the
> defrag ran at 16MB/s. That's not very fast.
> 

We are chasing the balance of defrag and parallel IO latency.

> What's the CPU utilisation of the defrag task and kernel side
> processing? What is the difference between "first_extent_share"
> enabled and disabled (both performance numbers and CPU usage)?

On my test VM (spindle based disk I think). CPU usage is about 6% for
The defrag command. Kernel processes much lower.
I didn’t pay much attention to the CPU usage when “first_extent_share” is disabled. But think
That caused very high CPU usages.

> 
>> Fio read latency:    15.72ms(without defrag) -> 14.53ms(during defrag)
>> Fio write latency:   32.21ms(without defrag) -> 20.03ms(during defrag)
> 
> So the IO latency is *lower* when defrag is running? That doesn't
> make any sense, unless the fio throughput is massively reduced while
> defrag is running.  

That’s reasonable. For the segments that defrag is done, the page cache remains.
 

> What's the throughput change in the fio
> workload? What's the change in worst case latency for the fio
> workload? i.e. post the actual fio results so we can see the whole
> picture of the behaviour, not just a single cherry-picked number.

Let me see if we have that saved.

> 
> Really, though, I have to ask: why is this an xfs_spaceman command
> and not something built into the existing online defrag program
> we have (xfs_fsr)?
> 

Quotation from previous conversation:
“”””" 
> Let me know if you want it in xfsprog.

Yes, i think adding it as an xfs_spaceman command would be a good
way for this defrag feature to be maintained for anyone who has need
for it.

-Dave.
“”””””

Thanks,
Wengang

> I'm sure I'll hav emore questions as I go through the code - I'll
> start at the userspace IO engine part of the patchset so I have some
> idea of what the defrag algorithm actually is...
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target file
  2024-07-15 23:40   ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Dave Chinner
@ 2024-07-16 20:23     ` Wengang Wang
  2024-07-17  4:11       ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-16 20:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org



> On Jul 15, 2024, at 4:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:21PM -0700, Wengang Wang wrote:
>> segments are the smallest unit to defragment.
>> 
>> A segment
>> 1. Can't exceed size limit
>> 2. contains some extents
>> 3. the contained extents can't be "unwritten"
>> 4. the contained extents must be contigous in file blocks
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 204 insertions(+)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index c9732984..175cf461 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -14,6 +14,32 @@
>> #include "space.h"
>> #include "input.h"
>> 
>> +#define MAPSIZE 512
>> +/* used to fetch bmap */
>> +struct getbmapx g_mapx[MAPSIZE];
>> +/* current offset of the file in units of 512 bytes, used to fetch bmap */
>> +static long long  g_offset = 0;
>> +/* index to indentify next extent, used to get next extent */
>> +static int g_ext_next_idx = -1;
> 
> Please do not prefix global variables with "g_". This is not
> useful, and simply makes the code hard to read.
> 
> That said, it is much better to pass these as function parameters so
> they are specific to the mapping context and so are inherently
> thread safe.
> 

OK, will try to move them to function parameters, But I do see global variables used in xfsprog.

>> +/*
>> + * segment, the smallest unit to defrag
>> + * it includes some contiguous extents.
>> + * no holes included,
>> + * no unwritten extents included
>> + * the size is limited by g_segment_size_lmt
>> + */
> 
> I have no idea what this comment is trying to tell me.
OK.

> 
>> +struct defrag_segment {
>> + /* segment offset in units of 512 bytes */
>> + long long ds_offset;
>> + /* length of segment in units of 512 bytes */
>> + long long ds_length;
>> + /* number of extents in this segment */
>> + int ds_nr;
>> + /* flag indicating if segment contains shared blocks */
>> + bool ds_shared;
>> +};
>> +
>> /* defrag segment size limit in units of 512 bytes */
>> #define MIN_SEGMENT_SIZE_LIMIT 8192 /* 4MiB */
>> #define DEFAULT_SEGMENT_SIZE_LIMIT 32768 /* 16MiB */
>> @@ -78,6 +104,165 @@ defrag_check_file(char *path)
>> return true;
>> }
>> 
>> +/*
>> + * get next extent in the file.
>> + * Note: next call will get the same extent unless move_next_extent() is called.
>> + * returns:
>> + * -1: error happened.
>> + * 0: extent returned
>> + * 1: no more extent left
>> + */
>> +static int
>> +defrag_get_next_extent(int fd, struct getbmapx *map_out)
>> +{
>> + int err = 0, i;
>> +
>> + /* when no extents are cached in g_mapx, fetch from kernel */
>> + if (g_ext_next_idx == -1) {
>> + g_mapx[0].bmv_offset = g_offset;
>> + g_mapx[0].bmv_length = -1LL;
>> + g_mapx[0].bmv_count = MAPSIZE;
>> + g_mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC;
>> + err = ioctl(fd, XFS_IOC_GETBMAPX, g_mapx);
>> + if (err == -1) {
>> + perror("XFS_IOC_GETBMAPX failed");
>> + goto out;
>> + }
>> + /* for stats */
>> + g_ext_stats.nr_ext_total += g_mapx[0].bmv_entries;
>> +
>> + /* no more extents */
>> + if (g_mapx[0].bmv_entries == 0) {
>> + err = 1;
>> + goto out;
>> + }
>> +
>> + /* for stats */
>> + for (i = 1; i <= g_mapx[0].bmv_entries; i++) {
>> + if (g_mapx[i].bmv_oflags & BMV_OF_PREALLOC)
>> + g_ext_stats.nr_ext_unwritten++;
>> + if (g_mapx[i].bmv_oflags & BMV_OF_SHARED)
>> + g_ext_stats.nr_ext_shared++;
>> + }
>> +
>> + g_ext_next_idx = 1;
>> + g_offset = g_mapx[g_mapx[0].bmv_entries].bmv_offset +
>> + g_mapx[g_mapx[0].bmv_entries].bmv_length;
>> + }
>> +
>> + map_out->bmv_offset = g_mapx[g_ext_next_idx].bmv_offset;
>> + map_out->bmv_length = g_mapx[g_ext_next_idx].bmv_length;
>> + map_out->bmv_oflags = g_mapx[g_ext_next_idx].bmv_oflags;
>> +out:
>> + return err;
>> +}
> 
> Ok, so the global variables are just a bmap cache. That's a problem,
> because this cache is stale the moment XFS_IOC_GETBMAPX returns to
> userspace. Iterating it to decide exactly waht to do next will
> race with ongoing file modifications and so it's not going to be
> accurate....

Yes, there is racing. 
And even we do defrag basing on stale extents, there is harm to the file under defrag
though it might not bring good defrag result.


> 
>> +
>> +/*
>> + * move to next extent
>> + */
>> +static void
>> +defrag_move_next_extent()
>> +{
>> + if (g_ext_next_idx == g_mapx[0].bmv_entries)
>> + g_ext_next_idx = -1;
>> + else
>> + g_ext_next_idx += 1;
>> +}
>> +
>> +/*
>> + * check if the given extent is a defrag target.
>> + * no need to check for holes as we are using BMV_IF_NO_HOLES
>> + */
>> +static bool
>> +defrag_is_target(struct getbmapx *mapx)
>> +{
>> + /* unwritten */
>> + if (mapx->bmv_oflags & BMV_OF_PREALLOC)
>> + return false;
>> + return mapx->bmv_length < g_segment_size_lmt;
>> +}
>> +
>> +static bool
>> +defrag_is_extent_shared(struct getbmapx *mapx)
>> +{
>> + return !!(mapx->bmv_oflags & BMV_OF_SHARED);
>> +}
>> +
>> +/*
>> + * get next segment to defragment.
>> + * returns:
>> + * -1 error happened.
>> + * 0 segment returned.
>> + * 1 no more segments to return
>> + */
>> +static int
>> +defrag_get_next_segment(int fd, struct defrag_segment *out)
>> +{
>> + struct getbmapx mapx;
>> + int ret;
>> +
>> + out->ds_offset = 0;
>> + out->ds_length = 0;
>> + out->ds_nr = 0;
>> + out->ds_shared = false;
> 
> out->ds_nr is never set to anything but zero in this patch.
> 

It’s set at line 211 in the raw patch.

206 +add_ext:
207 +               if (defrag_is_extent_shared(&mapx))
208 +                       out->ds_shared = true;
209 +
210 +               out->ds_length += mapx.bmv_length;
211 +               out->ds_nr += 1;
212 +               defrag_move_next_extent();


>> +
>> + do {
>> + ret = defrag_get_next_extent(fd, &mapx);
>> + if (ret != 0) {
>> + /*
>> +  * no more extetns, return current segment if its not
> 
> Typos everywhere.

OK.

> 
>> +  * empty
>> + */
>> + if (ret == 1 && out->ds_nr > 0)
>> + ret = 0;
>> + /* otherwise, error heppened, stop */
>> + break;
>> + }
> 
>> +
>> + /*
>> +  * If the extent is not a defrag target, skip it.
>> +  * go to next extent if the segment is empty;
>> +  * otherwise return the segment.
>> +  */
>> + if (!defrag_is_target(&mapx)) {
>> + defrag_move_next_extent();
>> + if (out->ds_nr == 0)
>> + continue;
>> + else
>> + break;
>> + }
>> +
>> + /* check for segment size limitation */
>> + if (out->ds_length + mapx.bmv_length > g_segment_size_lmt)
>> + break;
>> +
>> + /* the segment is empty now, add this extent to it for sure */
>> + if (out->ds_nr == 0) {
>> + out->ds_offset = mapx.bmv_offset;
>> + goto add_ext;
>> + }
> 
> So this is essentially a filter for the getbmapx output that strips
> away unwritten extents and anything outside/larger than the target
> range.

yes.

> 
>> +
>> + /*
>> +  * the segment is not empty, check for hole since the last exent
>> +  * if a hole exist before this extent, this extent can't be
>> +  * added to the segment. return the segment
>> +  */
>> + if (out->ds_offset + out->ds_length != mapx.bmv_offset)
>> + break;
>> +
>> +add_ext:
> 
> Why do you need a goto for this logic?
> 
> /*
>  * the segment is not empty, check for hole since the last exent
>  * if a hole exist before this extent, this extent can't be
>  * added to the segment. return the segment
>  */
> if (out->ds_nr) {
> if (out->ds_offset + out->ds_length != mapx.bmv_offset)
> break;
> } else {
> out->ds_offset = mapx.bmv_offset;
> }
> 

Above code also work.
The using of "goto add_ext” saved a “if" inside a “if” making the code clearer to me.
But I can change it as you expected.

>> + if (defrag_is_extent_shared(&mapx))
>> + out->ds_shared = true;
>> +
>> + out->ds_length += mapx.bmv_length;
>> + out->ds_nr += 1;
>> + defrag_move_next_extent();
>> +
>> + } while (true);
> 
>> +
>> + return ret;
>> +}
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -92,6 +277,9 @@ defrag_xfs_defrag(char *file_path) {
>> struct fsxattr fsx;
>> int ret = 0;
>> 
>> + g_offset = 0;
>> + g_ext_next_idx = -1;
>> +
>> fsx.fsx_nextents = 0;
>> memset(&g_ext_stats, 0, sizeof(g_ext_stats));
>> 
>> @@ -119,6 +307,22 @@ defrag_xfs_defrag(char *file_path) {
>> ret = 1;
>> goto out;
>> }
>> +
>> + do {
>> + struct defrag_segment segment;
>> +
>> + ret = defrag_get_next_segment(defrag_fd, &segment);
>> + /* no more segments, we are done */
>> + if (ret == 1) {
>> + ret = 0;
>> + break;
>> + }
>> + /* error happened when reading bmap, stop here */
>> + if (ret == -1) {
>> + ret = 1;
>> + break;
>> + }
> 
> ternary return values are nasty. Return a negative errno when a
> an error occurs, and -ENODATA when there are no more segments.
> Then you have
> 
> if (ret < 0) {
> if (ret == -ENODATA)
> exit_value = 0;
> else
> exit_value = 1;
> break;
> }

Agreed.
I have modified defrag_get_next_segment() and defrag_get_next_extent() for V2,
Making them return -1 for error and 0 for success. And the "no more extents/segements”
is checked by looking at their length. Will send them out. 

> 
>> + } while (true);
> 
> Not a fan of do {} while(true) loops.
> 
> WIth the above error handling changes, this becomes:
> 
> do {
> struct defrag_segment segment;
> 
> ret = defrag_get_next_segment(defrag_fd, &segment);
> } while (ret == 0);
> 
> if (ret == 0 || ret == -ENODATA)
> exit_value = 0;
> else
> exit_value = 1;
> 

Yes, I making “big” changes to defrag_get_next_segment()/defrag_get_next_extent()
Please review them when I sent them out.

> 
> Ok, so this is a linear iteration of all extents in the file that
> filters extents for the specific "segment" that is going to be
> processed. I still have no idea why fixed length segments are
> important, but "linear extent scan for filtering" seems somewhat
> expensive.

Hm… fixed length segments — actually not fixed length segments, but segment
size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.
Larger segment take longer time (with filed locked) to defrag. The segment size limit is a way to balance
the defrag and the parallel IO latency.


> 
> Indeed, if you used FIEMAP, you can pass a minimum
> segment length to filter out all the small extents. Iterating that
> extent list means all the ranges you need to defrag are in the holes
> of the returned mapping information. This would be much faster
> than an entire linear mapping to find all the regions with small
> extents that need defrag. The second step could then be doing a
> fine grained mapping of each region that we now know either contains
> fragmented data or holes....

Hm… just a question here:
As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented.
What if the extent layout is like this:

1.    1
2.    2049
3.    2
4.    2050
5.    1
6.    2051

In above case, do you do defrag or not?

As I understand the situation, performance of defrag it’s self is not a critical concern here.

Thanks,
Wengang



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target file
  2024-07-16 20:23     ` Wengang Wang
@ 2024-07-17  4:11       ` Dave Chinner
  2024-07-18 19:03         ` Wengang Wang
                           ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Dave Chinner @ 2024-07-17  4:11 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org

On Tue, Jul 16, 2024 at 08:23:35PM +0000, Wengang Wang wrote:
> > Ok, so this is a linear iteration of all extents in the file that
> > filters extents for the specific "segment" that is going to be
> > processed. I still have no idea why fixed length segments are
> > important, but "linear extent scan for filtering" seems somewhat
> > expensive.
> 
> Hm… fixed length segments — actually not fixed length segments, but segment
> size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.

Which is effectively fixed length segments....

> Larger segment take longer time (with filed locked) to defrag. The
> segment size limit is a way to balance the defrag and the parallel
> IO latency.

Yes, I know why you've done it. These were the same arguments made a
while back for a new way of cloning files on XFS. We solved those
problems just with a small change to the locking, and didn't need
new ioctls or lots of new code just to solve the "clone blocks
concurrent IO" problem.

I'm looking at this from exactly the same POV. The code presented is
doing lots of complex, unusable stuff to work around the fact that
UNSHARE blocks concurrent IO. I don't see any difference between
CLONE and UNSHARE from the IO perspective - if anything UNSHARE can
have looser rules than CLONE, because a concurrent write will either
do the COW of a shared block itself, or hit the exclusive block that
has already been unshared.

So if we fix these locking issues in the kernel, then the whole need
for working around the IO concurrency problems with UNSHARE goes
away and the userspace code becomes much, much simpler.

> > Indeed, if you used FIEMAP, you can pass a minimum
> > segment length to filter out all the small extents. Iterating that
> > extent list means all the ranges you need to defrag are in the holes
> > of the returned mapping information. This would be much faster
> > than an entire linear mapping to find all the regions with small
> > extents that need defrag. The second step could then be doing a
> > fine grained mapping of each region that we now know either contains
> > fragmented data or holes....
> 
> Hm… just a question here:
> As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented.
> What if the extent layout is like this:
> 
> 1.    1
> 2.    2049
> 3.    2
> 4.    2050
> 5.    1
> 6.    2051
> 
> In above case, do you do defrag or not?

The filtering presenting in the patch above will not defrag any of
this with a 2048 block segment side, because the second extent in
each segment extend beyond the configured max segment length. IOWs,
it ends up with a single extent per "2048 block segment", and that
won't get defragged with the current algorithm.

As it is, this really isn't a common fragmentation pattern for a
file that does not contain shared extents, so I wouldn't expect to
ever need to decide if this needs to be defragged or not.

However, it is exactly the layout I would expect to see for cloned
and modified filesystem image files.  That is, the common layout for
such a "cloned from golden image" Vm images is this:

1.    1		written
2.    2049	shared
3.    2		written
4.    2050	shared
5.    1		written
6.    2051	shared

i.e. there are large chunks of contiguous shared extents between the
small individual COW block modifications that have been made to
customise the image for the deployed VM.

Either way, if the segment/filter length is 2048 blocks, then this
isn't a pattern that should be defragmented. If the segment/filter
length is 4096 or larger, then yes, this pattern should definitely
be defragmented.

> As I understand the situation, performance of defrag it’s self is
> not a critical concern here.

Sure, but implementing a low performing, high CPU consumption,
entirely single threaded defragmentation model that requires
specific tuning in every different environment it is run in doesn't
seem like the best idea to me.

I'm trying to work out if there is a faster, simpler way of
achieving the same goal....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/9] spaceman/defrag: defrag segments
  2024-07-16  0:08   ` Dave Chinner
@ 2024-07-18 18:06     ` Wengang Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-18 18:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org



> On Jul 15, 2024, at 5:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:22PM -0700, Wengang Wang wrote:
>> For each segment, the following steps are done trying to defrag it:
>> 
>> 1. share the segment with a temporary file
>> 2. unshare the segment in the target file. kernel simulates Cow on the whole
>>   segment complete the unshare (defrag).
>> 3. release blocks from the tempoary file.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 114 insertions(+)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index 175cf461..9f11e36b 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -263,6 +263,40 @@ add_ext:
>> return ret;
>> }
>> 
>> +/*
>> + * check if the segment exceeds EoF.
>> + * fix up the clone range and return true if EoF happens,
>> + * return false otherwise.
>> + */
>> +static bool
>> +defrag_clone_eof(struct file_clone_range *clone)
>> +{
>> + off_t delta;
>> +
>> + delta = clone->src_offset + clone->src_length - g_defrag_file_size;
>> + if (delta > 0) {
>> + clone->src_length = 0; // to the end
>> + return true;
>> + }
>> + return false;
>> +}
>> +
>> +/*
>> + * get the time delta since pre_time in ms.
>> + * pre_time should contains values fetched by gettimeofday()
>> + * cur_time is used to store current time by gettimeofday()
>> + */
>> +static long long
>> +get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
>> +{
>> + long long us;
>> +
>> + gettimeofday(cur_time, NULL);
>> + us = (cur_time->tv_sec - pre_time->tv_sec) * 1000000;
>> + us += (cur_time->tv_usec - pre_time->tv_usec);
>> + return us;
>> +}
>> +
>> /*
>>  * defragment a file
>>  * return 0 if successfully done, 1 otherwise
>> @@ -273,6 +307,7 @@ defrag_xfs_defrag(char *file_path) {
>> long nr_seg_defrag = 0, nr_ext_defrag = 0;
>> int scratch_fd = -1, defrag_fd = -1;
>> char tmp_file_path[PATH_MAX+1];
>> + struct file_clone_range clone;
>> char *defrag_dir;
>> struct fsxattr fsx;
>> int ret = 0;
>> @@ -296,6 +331,8 @@ defrag_xfs_defrag(char *file_path) {
>> goto out;
>> }
>> 
>> + clone.src_fd = defrag_fd;
>> +
>> defrag_dir = dirname(file_path);
> 
> Just a note: can you please call this the "source fd", not the
> "defrag_fd"? defrag_fd could mean either the source or the
> temporary scratch file we use as the defrag destination.

I have no problem to change that. Just I was thinking there is no data moving happening
from a file to another. The defrag_fd means the file under defrag and the temp file is with
scratch_fd.

> 
>> snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
>> getpid());
>> @@ -309,7 +346,11 @@ defrag_xfs_defrag(char *file_path) {
>> }
>> 
>> do {
>> + struct timeval t_clone, t_unshare, t_punch_hole;
>> struct defrag_segment segment;
>> + long long seg_size, seg_off;
>> + int time_delta;
>> + bool stop;
>> 
>> ret = defrag_get_next_segment(defrag_fd, &segment);
>> /* no more segments, we are done */
>> @@ -322,6 +363,79 @@ defrag_xfs_defrag(char *file_path) {
>> ret = 1;
>> break;
>> }
>> +
>> + /* we are done if the segment contains only 1 extent */
>> + if (segment.ds_nr < 2)
>> + continue;
>> +
>> + /* to bytes */
>> + seg_off = segment.ds_offset * 512;
>> + seg_size = segment.ds_length * 512;
> 
> Ugh. Do this in the mapping code that gets the extent info. Have it
> return bytes. Or just use FIEMAP because it uses byte ranges to
> begin with.

OK.

> 
>> +
>> + clone.src_offset = seg_off;
>> + clone.src_length = seg_size;
>> + clone.dest_offset = seg_off;
>> +
>> + /* checks for EoF and fix up clone */
>> + stop = defrag_clone_eof(&clone);
> 
> Ok, so we copy the segment map into clone args, and ...
> 
>> + gettimeofday(&t_clone, NULL);
>> + ret = ioctl(scratch_fd, FICLONERANGE, &clone);
>> + if (ret != 0) {
>> + fprintf(stderr, "FICLONERANGE failed %s\n",
>> + strerror(errno));
>> + break;
>> + }
> 
> clone the source to the scratch file. This blocks writes to the
> source file while it is in progress, but allows reads to pass
> through the source file as data is not changing.
> 
> 
>> + /* for time stats */
>> + time_delta = get_time_delta_us(&t_clone, &t_unshare);
>> + if (time_delta > max_clone_us)
>> + max_clone_us = time_delta;
>> +
>> + /* for defrag stats */
>> + nr_ext_defrag += segment.ds_nr;
>> +
>> + /*
>> +  * For the shared range to be unshared via a copy-on-write
>> +  * operation in the file to be defragged. This causes the
>> +  * file needing to be defragged to have new extents allocated
>> +  * and the data to be copied over and written out.
>> +  */
>> + ret = fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, seg_off,
>> + seg_size);
>> + if (ret != 0) {
>> + fprintf(stderr, "UNSHARE_RANGE failed %s\n",
>> + strerror(errno));
>> + break;
>> + }
> 
> And now we unshare the source file. This blocks all IO to the source
> file.

Yes, even fetching data from disk is done while IO is locked.
I am wondering if we can fetch data without locking IO. That’s also why
I added readahead in a later patch.

> 
> Ok, so this is the fundamental problem this whole "segmented
> defrag" is trying to work around: FALLOC_FL_UNSHARE_RANGE blocks
> all read and write IO whilst it is in progress.
> 
> We had this same problem with FICLONERANGE taking snapshots of VM
> files - we changed the locking to take shared IO locks to allow
> reads to run while the clone was in progress. Because the Oracle Vm
> infrastructure uses a sidecar to redirect writes while a snapshot
> (clone) was in progress, no VM IO got blocked while the clone was in
> progress and so the applications inside the VM never even noticed a
> clone was taking place.
> 
> Why isn't the same infrastructure being used here?
> 
> FALLOC_FL_UNSHARE_RANGE is not changing data, nor is it freeing any
> data blocks. Yes, we are re-writing the data somewhere else, but
> in that case the original data is still intact in it's original
> location on disk and not being freed.
> 
> Hence if a read races with UNSHARE, it will hit a referenced extent
> containing the correct data regardless of whether it is in the old
> or new file. Hence we can likely use shared IO locking for UNSHARE,
> just like we do for FICLONERANGE.
> 
> At this point, if the Oracle VM infrastructure uses the sidecar
> write channel whilst the defrag is in progress, this whole algorithm
> simply becomes "for regions with extents smaller than X, clone and
> unshare the region".
> 
> The whole need for "idle time" goes away. The need for segment size
> control largely goes away. The need to tune the defrag algorithm to
> avoid IO latency and/or throughput issues goes away.
> 

As we tested, the write-redirecting doesn’t work well.


>> +
>> + /* for time stats */
>> + time_delta = get_time_delta_us(&t_unshare, &t_punch_hole);
>> + if (time_delta > max_unshare_us)
>> + max_unshare_us = time_delta;
>> +
>> + /*
>> +  * Punch out the original extents we shared to the
>> +  * scratch file so they are returned to free space.
>> +  */
>> + ret = fallocate(scratch_fd,
>> + FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, seg_off,
>> + seg_size);
>> + if (ret != 0) {
>> + fprintf(stderr, "PUNCH_HOLE failed %s\n",
>> + strerror(errno));
>> + break;
>> + }
> 
> This is unnecessary if there is lots of free space. You can leave
> this to the very end of defrag so that the source file defrag
> operation isn't slowed down by cleaning up all the fragmented
> extents....

Yes. Two considerations here:
1. The space use as you mentioned, the temp file might grow huge on a low space system.
2. Punching holes on the temp file doesn’t lock the file under defrag, so with the PUNCH_HOLE,
we need a little bit less sleep time for each segment.

Thanks,
Wengang


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag()
  2024-07-16  0:25   ` Dave Chinner
@ 2024-07-18 18:24     ` Wengang Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-07-18 18:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org



> On Jul 15, 2024, at 5:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:25PM -0700, Wengang Wang wrote:
>> xfs_reflink_try_clear_inode_flag() takes very long in case file has huge number
>> of extents and none of the extents are shared.
> 
> Got a kernel profile showing how bad it is?

It was more than 1.5 seconds (basing on 6.4 millions of extents) when I add debug code to measure it.

> 
>> 
>> workaround:
>> share the first real extent so that xfs_reflink_try_clear_inode_flag() returns
>> quickly to save cpu times and speed up defrag significantly.
> 
> That's nasty.
> 
> Let's fix the kernel code, not work around it in userspace.
> 
> I mean, it would be really easy to store if an extent is shared in
> the iext btree record for the extent. If we do an unshare operation,
> just do a single "find shared extents" pass on the extent tree and
> mark all the extents that are shared as shared.  Then set a flag on
> the data fork saying it is tracking shared extents, and so when we
> share/unshare extents in that inode from then on, we set/clear that
> flag in the iext record. (i.e. it's an in-memory equivalent of the
> UNWRITTEN state flag).
> 
> Then after the first unshare, checking for nothing being shared is a
> walk of the iext btree over the given range, not a refcountbt
> walk. That should be much faster.
> 
> And we could make it even faster by adding a "shared extents"
> counter to the inode fork. i.e. the first scan that sets the flags
> also counts the shared extents, and we maintain that as we maintain
> the iin memory extent flags....
> 
> That makes the cost of xfs_reflink_try_clear_inode_flag() basically
> go to zero in these sorts of workloads. IMO, this is a much better
> solution to the problem than hacking around it in userspace...
> 

Yes, fixing it in kernel is the best way to go.
Well, one consideration is that the customers don’t run on upstream kernel.
They might run a much lower version. And some customers don’t want kernel
upgrades if there are no security issues.
So can we have both? 
1. Trying to fix kernel and
2. Keep the workaround in defrag usersapce?

Thanks,
Wengang
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 8/9] spaceman/defrag: readahead for better performance
  2024-07-16  0:56   ` Dave Chinner
@ 2024-07-18 18:40     ` Wengang Wang
  2024-07-31  3:10       ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-18 18:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org



> On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
>> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
>> Do readahead when defrag sleeps for better defrag performace and thus more
>> file IO time.
>> 
>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> ---
>> spaceman/defrag.c | 21 ++++++++++++++++++++-
>> 1 file changed, 20 insertions(+), 1 deletion(-)
>> 
>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>> index 415fe9c2..ab8508bb 100644
>> --- a/spaceman/defrag.c
>> +++ b/spaceman/defrag.c
>> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>> }
>> 
>> static bool g_enable_first_ext_share = true;
>> +static bool g_readahead = false;
>> +
>> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
>> +{
>> + if (!g_readahead || g_idle_time <= 0)
>> + return;
>> +
>> + if (readahead(defrag_fd, offset, count) < 0) {
>> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
>> + strerror(errno), errno);
> 
> This doesn't do what you think it does. readahead() only queues the
> first readahead chunk of the range given (a few pages at most). It
> does not cause readahead on the entire range, wait for page cache
> population, nor report IO errors that might have occurred during
> readahead.

Is it a bug? As per the man page it should try to read _count_ bytes:

DESCRIPTION
       readahead() initiates readahead on a file so that subsequent reads from that file will be satisfied from the cache, and not block on disk I/O (assuming the readahead was initiated early enough and that
       other activity on the system did not in the meantime flush pages from the cache).

       The fd argument is a file descriptor identifying the file which is to be read.  The offset argument specifies the starting point from which data is to be read and count specifies the number of bytes to
       be read.  I/O is performed in whole pages, so that offset is effectively rounded down to a page boundary and bytes are read up to the next page boundary greater than or equal to (offset+count).  reada‐
       head() does not read beyond the end of the file.  The file offset of the open file description referred to by fd is left unchanged.

> 
> There's almost no value to making this syscall, especially if the
> app is about to trigger a sequential read for the whole range.
> Readahead will occur naturally during that read operation (i.e. the
> UNSHARE copy), and the read will return IO errors unlike
> readahead().
> 
> If you want the page cache pre-populated before the unshare
> operation is done, then you need to use mmap() and
> madvise(MADV_POPULATE_READ). This will read the whole region into
> the page cache as if it was a sequential read, wait for it to
> complete and return any IO errors that might have occurred during
> the read.

As you know in the unshare path, fetching data from disk is done when IO is locked.
(I am wondering if we can improve that.)
The main purpose of using readahead is that I want less (IO) lock time when fetching
data from disk. Can we achieve that by using mmap and madvise()?

Thanks,
Wengang



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target file
  2024-07-17  4:11       ` Dave Chinner
@ 2024-07-18 19:03         ` Wengang Wang
  2024-07-19  4:59           ` Dave Chinner
  2024-07-19  4:01         ` Christoph Hellwig
  2024-07-24 19:22         ` Wengang Wang
  2 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-18 19:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org



> On Jul 16, 2024, at 9:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 16, 2024 at 08:23:35PM +0000, Wengang Wang wrote:
>>> Ok, so this is a linear iteration of all extents in the file that
>>> filters extents for the specific "segment" that is going to be
>>> processed. I still have no idea why fixed length segments are
>>> important, but "linear extent scan for filtering" seems somewhat
>>> expensive.
>> 
>> Hm… fixed length segments — actually not fixed length segments, but segment
>> size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.
> 
> Which is effectively fixed length segments....
> 
>> Larger segment take longer time (with filed locked) to defrag. The
>> segment size limit is a way to balance the defrag and the parallel
>> IO latency.
> 
> Yes, I know why you've done it. These were the same arguments made a
> while back for a new way of cloning files on XFS. We solved those
> problems just with a small change to the locking, and didn't need
> new ioctls or lots of new code just to solve the "clone blocks
> concurrent IO" problem.

I didn’t check the code history, but I am thinking you solved the problem
by allow reads to go while cloning is in progress? Correct me if I'm wrong.
The problem we hit is (heart beat) write timeout.  

> 
> I'm looking at this from exactly the same POV. The code presented is
> doing lots of complex, unusable stuff to work around the fact that
> UNSHARE blocks concurrent IO. I don't see any difference between
> CLONE and UNSHARE from the IO perspective - if anything UNSHARE can
> have looser rules than CLONE, because a concurrent write will either
> do the COW of a shared block itself, or hit the exclusive block that
> has already been unshared.
> 
> So if we fix these locking issues in the kernel, then the whole need
> for working around the IO concurrency problems with UNSHARE goes
> away and the userspace code becomes much, much simpler.
> 
>>> Indeed, if you used FIEMAP, you can pass a minimum
>>> segment length to filter out all the small extents. Iterating that
>>> extent list means all the ranges you need to defrag are in the holes
>>> of the returned mapping information. This would be much faster
>>> than an entire linear mapping to find all the regions with small
>>> extents that need defrag. The second step could then be doing a
>>> fine grained mapping of each region that we now know either contains
>>> fragmented data or holes....
>> 
>> Hm… just a question here:
>> As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented.
>> What if the extent layout is like this:
>> 
>> 1.    1
>> 2.    2049
>> 3.    2
>> 4.    2050
>> 5.    1
>> 6.    2051
>> 
>> In above case, do you do defrag or not?
> 
> The filtering presenting in the patch above will not defrag any of
> this with a 2048 block segment side, because the second extent in
> each segment extend beyond the configured max segment length. IOWs,
> it ends up with a single extent per "2048 block segment", and that
> won't get defragged with the current algorithm.
> 
> As it is, this really isn't a common fragmentation pattern for a
> file that does not contain shared extents, so I wouldn't expect to
> ever need to decide if this needs to be defragged or not.
> 
> However, it is exactly the layout I would expect to see for cloned
> and modified filesystem image files.  That is, the common layout for
> such a "cloned from golden image" Vm images is this:
> 
> 1.    1 written
> 2.    2049 shared
> 3.    2 written
> 4.    2050 shared
> 5.    1 written
> 6.    2051 shared
> 
> i.e. there are large chunks of contiguous shared extents between the
> small individual COW block modifications that have been made to
> customise the image for the deployed VM.
> 
> Either way, if the segment/filter length is 2048 blocks, then this
> isn't a pattern that should be defragmented. If the segment/filter
> length is 4096 or larger, then yes, this pattern should definitely
> be defragmented.

Yes, true. We should focus on real case layout.

> 
>> As I understand the situation, performance of defrag it’s self is
>> not a critical concern here.
> 
> Sure, but implementing a low performing, high CPU consumption,
> entirely single threaded defragmentation model that requires
> specific tuning in every different environment it is run in doesn't
> seem like the best idea to me.
> 
> I'm trying to work out if there is a faster, simpler way of
> achieving the same goal....
> 

Great!

Wengang


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target file
  2024-07-17  4:11       ` Dave Chinner
  2024-07-18 19:03         ` Wengang Wang
@ 2024-07-19  4:01         ` Christoph Hellwig
  2024-07-24 19:22         ` Wengang Wang
  2 siblings, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2024-07-19  4:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Wengang Wang, linux-xfs@vger.kernel.org

On Wed, Jul 17, 2024 at 02:11:16PM +1000, Dave Chinner wrote:
> Yes, I know why you've done it. These were the same arguments made a
> while back for a new way of cloning files on XFS. We solved those
> problems just with a small change to the locking, and didn't need
> new ioctls or lots of new code just to solve the "clone blocks
> concurrent IO" problem.
> 
> I'm looking at this from exactly the same POV. The code presented is
> doing lots of complex, unusable stuff to work around the fact that
> UNSHARE blocks concurrent IO. I don't see any difference between
> CLONE and UNSHARE from the IO perspective - if anything UNSHARE can
> have looser rules than CLONE, because a concurrent write will either
> do the COW of a shared block itself, or hit the exclusive block that
> has already been unshared.
> 
> So if we fix these locking issues in the kernel, then the whole need
> for working around the IO concurrency problems with UNSHARE goes
> away and the userspace code becomes much, much simpler.

Btw, the main problem with unshare isn't just locking, but that is
extremely inefficient.  It synchronously reads one block at a time,
which makes it very, very slow.  That's purely a kernel implementation
detail, of course.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target file
  2024-07-18 19:03         ` Wengang Wang
@ 2024-07-19  4:59           ` Dave Chinner
  0 siblings, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2024-07-19  4:59 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org

On Thu, Jul 18, 2024 at 07:03:40PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 16, 2024, at 9:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Tue, Jul 16, 2024 at 08:23:35PM +0000, Wengang Wang wrote:
> >>> Ok, so this is a linear iteration of all extents in the file that
> >>> filters extents for the specific "segment" that is going to be
> >>> processed. I still have no idea why fixed length segments are
> >>> important, but "linear extent scan for filtering" seems somewhat
> >>> expensive.
> >> 
> >> Hm… fixed length segments — actually not fixed length segments, but segment
> >> size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.
> > 
> > Which is effectively fixed length segments....
> > 
> >> Larger segment take longer time (with filed locked) to defrag. The
> >> segment size limit is a way to balance the defrag and the parallel
> >> IO latency.
> > 
> > Yes, I know why you've done it. These were the same arguments made a
> > while back for a new way of cloning files on XFS. We solved those
> > problems just with a small change to the locking, and didn't need
> > new ioctls or lots of new code just to solve the "clone blocks
> > concurrent IO" problem.
> 
> I didn’t check the code history, but I am thinking you solved the problem
> by allow reads to go while cloning is in progress? Correct me if I'm wrong.
> The problem we hit is (heart beat) write timeout.  

The reason this worked (allowing shared reads through and not
writes) was that the VM infrastructure this was being done for uses
a sidecar write channel to redirect writes while a clone is being
done. i.e. writes are not blocked by the clone in progress because
they are being done to a different file.

When the clone completes, those writes are folded back into the
original image file. e.g. see the `qemu-img commit -b <backing file>
<file with delta writes>` which will fold writes to a sidecar write
file back into the original backing file that was just cloned....

What I'm suggesting is that when you run an backing file
defragmentation, you use the same sidecar write setup as cloning
whilst the defrag is done. Reads go straight through to the backing
file, and writes get written to a delta write file. When the defrag
is done the delta write file gets folded back into the backing file.

But for this to work, UNSHARE needs to use shared read locking so
that read IO can be directed through the file at the same time as
the UNSHARE is running. If this works for CLONE to avoid read and
write blocking whilst the operation is in progress, the same
mechanism should be able to be used for UNSHARE, too. At this point
defrag using CLONE+UNSHARE shouldn't ever block read IO and
shouldn't block write IO for any significant period of time,
either...

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 9/9] spaceman/defrag: warn on extsize
  2024-07-16  0:29       ` Dave Chinner
@ 2024-07-22 18:01         ` Wengang Wang
  2024-07-30 22:43           ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-22 18:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org



> On Jul 15, 2024, at 5:29 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Jul 11, 2024 at 11:36:28PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Jul 9, 2024, at 1:21 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>>> 
>>> On Tue, Jul 09, 2024 at 12:10:28PM -0700, Wengang Wang wrote:
>>>> According to current kernel implemenation, non-zero extsize might affect
>>>> the result of defragmentation.
>>>> Just print a warning on that if non-zero extsize is set on file.
>>> 
>>> I'm not sure what's the point of warning vaguely about extent size
>>> hints?  I'd have thought that would help reduce the number of extents;
>>> is that not the case?
>> 
>> Not exactly.
>> 
>> Same 1G file with about 54K extents,
>> 
>> The one with 16K extsize, after defrag, it’s extents drops to 13K.
>> And the one with 0 extsize, after defrag, it’s extents dropped to 22.
> 
> extsize should not affect file contiguity like this at all. Are you
> measuring fragmentation correctly? i.e. a contiguous region from an
> larger extsize allocation that results in a bmap/fiemap output of
> three extents in a unwritten/written/unwritten is not fragmentation.

I was using FS_IOC_FSGETXATTR to get the number of extents (fsx.fsx_nextents).
So if kernel doesn’t lie, I got it correctly. There was no unwritten extents in the files to defrag.

(As I mentioned somewhere else), though extsize is mainly used to align the number
of blocks, it breaks delayed-allocations. In the unshare path, there are N allocations performed
for the N extents respectively in the segment to be defragmented. 

Thanks,
Wengang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target file
  2024-07-17  4:11       ` Dave Chinner
  2024-07-18 19:03         ` Wengang Wang
  2024-07-19  4:01         ` Christoph Hellwig
@ 2024-07-24 19:22         ` Wengang Wang
  2024-07-30 22:13           ` Dave Chinner
  2 siblings, 1 reply; 60+ messages in thread
From: Wengang Wang @ 2024-07-24 19:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org



> On Jul 16, 2024, at 9:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 16, 2024 at 08:23:35PM +0000, Wengang Wang wrote:
>>> Ok, so this is a linear iteration of all extents in the file that
>>> filters extents for the specific "segment" that is going to be
>>> processed. I still have no idea why fixed length segments are
>>> important, but "linear extent scan for filtering" seems somewhat
>>> expensive.
>> 
>> Hm… fixed length segments — actually not fixed length segments, but segment
>> size can’t exceed the limitation.  So segment.ds_length <=  LIMIT.
> 
> Which is effectively fixed length segments....
> 
>> Larger segment take longer time (with filed locked) to defrag. The
>> segment size limit is a way to balance the defrag and the parallel
>> IO latency.
> 
> Yes, I know why you've done it. These were the same arguments made a
> while back for a new way of cloning files on XFS. We solved those
> problems just with a small change to the locking, and didn't need
> new ioctls or lots of new code just to solve the "clone blocks
> concurrent IO" problem.
> 
> I'm looking at this from exactly the same POV. The code presented is
> doing lots of complex, unusable stuff to work around the fact that
> UNSHARE blocks concurrent IO. I don't see any difference between
> CLONE and UNSHARE from the IO perspective - if anything UNSHARE can
> have looser rules than CLONE, because a concurrent write will either
> do the COW of a shared block itself, or hit the exclusive block that
> has already been unshared.
> 
> So if we fix these locking issues in the kernel, then the whole need
> for working around the IO concurrency problems with UNSHARE goes
> away and the userspace code becomes much, much simpler.
> 
>>> Indeed, if you used FIEMAP, you can pass a minimum
>>> segment length to filter out all the small extents. Iterating that
>>> extent list means all the ranges you need to defrag are in the holes
>>> of the returned mapping information. This would be much faster
>>> than an entire linear mapping to find all the regions with small
>>> extents that need defrag. The second step could then be doing a
>>> fine grained mapping of each region that we now know either contains
>>> fragmented data or holes....


Where can we pass a minimum segment length to filter out things?
I don’t see that in the fiemap structure:

A fiemap request is encoded within struct fiemap:

struct fiemap {
__u64 fm_start; /* logical offset (inclusive) at
* which to start mapping (in) */
__u64 fm_length; /* logical length of mapping which
* userspace cares about (in) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_mapped_extents; /* number of extents that were
* mapped (out) */
__u32 fm_extent_count; /* size of fm_extents array (in) */
__u32 fm_reserved;
struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
};

Thanks,
Wengang

>> 
>> Hm… just a question here:
>> As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented.
>> What if the extent layout is like this:
>> 
>> 1.    1
>> 2.    2049
>> 3.    2
>> 4.    2050
>> 5.    1
>> 6.    2051
>> 
>> In above case, do you do defrag or not?
> 
> The filtering presenting in the patch above will not defrag any of
> this with a 2048 block segment side, because the second extent in
> each segment extend beyond the configured max segment length. IOWs,
> it ends up with a single extent per "2048 block segment", and that
> won't get defragged with the current algorithm.
> 
> As it is, this really isn't a common fragmentation pattern for a
> file that does not contain shared extents, so I wouldn't expect to
> ever need to decide if this needs to be defragged or not.
> 
> However, it is exactly the layout I would expect to see for cloned
> and modified filesystem image files.  That is, the common layout for
> such a "cloned from golden image" Vm images is this:
> 
> 1.    1 written
> 2.    2049 shared
> 3.    2 written
> 4.    2050 shared
> 5.    1 written
> 6.    2051 shared
> 
> i.e. there are large chunks of contiguous shared extents between the
> small individual COW block modifications that have been made to
> customise the image for the deployed VM.
> 
> Either way, if the segment/filter length is 2048 blocks, then this
> isn't a pattern that should be defragmented. If the segment/filter
> length is 4096 or larger, then yes, this pattern should definitely
> be defragmented.
> 
>> As I understand the situation, performance of defrag it’s self is
>> not a critical concern here.
> 
> Sure, but implementing a low performing, high CPU consumption,
> entirely single threaded defragmentation model that requires
> specific tuning in every different environment it is run in doesn't
> seem like the best idea to me.
> 
> I'm trying to work out if there is a faster, simpler way of
> achieving the same goal....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/9] spaceman/defrag: pick up segments from target file
  2024-07-24 19:22         ` Wengang Wang
@ 2024-07-30 22:13           ` Dave Chinner
  0 siblings, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2024-07-30 22:13 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org

On Wed, Jul 24, 2024 at 07:22:25PM +0000, Wengang Wang wrote:
> > On Jul 16, 2024, at 9:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>> Indeed, if you used FIEMAP, you can pass a minimum
> >>> segment length to filter out all the small extents. Iterating that
> >>> extent list means all the ranges you need to defrag are in the holes
> >>> of the returned mapping information. This would be much faster
> >>> than an entire linear mapping to find all the regions with small
> >>> extents that need defrag. The second step could then be doing a
> >>> fine grained mapping of each region that we now know either contains
> >>> fragmented data or holes....
> 
> 
> Where can we pass a minimum segment length to filter out things?
> I don’t see that in the fiemap structure:

Oh, sorry, too many similar APIs - it's FITRIM that has a minimum
length filter build into it. It's something we could add if we
really need it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 9/9] spaceman/defrag: warn on extsize
  2024-07-22 18:01         ` Wengang Wang
@ 2024-07-30 22:43           ` Dave Chinner
  0 siblings, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2024-07-30 22:43 UTC (permalink / raw)
  To: Wengang Wang; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org

On Mon, Jul 22, 2024 at 06:01:08PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 15, 2024, at 5:29 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Thu, Jul 11, 2024 at 11:36:28PM +0000, Wengang Wang wrote:
> >> 
> >> 
> >>> On Jul 9, 2024, at 1:21 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> >>> 
> >>> On Tue, Jul 09, 2024 at 12:10:28PM -0700, Wengang Wang wrote:
> >>>> According to current kernel implemenation, non-zero extsize might affect
> >>>> the result of defragmentation.
> >>>> Just print a warning on that if non-zero extsize is set on file.
> >>> 
> >>> I'm not sure what's the point of warning vaguely about extent size
> >>> hints?  I'd have thought that would help reduce the number of extents;
> >>> is that not the case?
> >> 
> >> Not exactly.
> >> 
> >> Same 1G file with about 54K extents,
> >> 
> >> The one with 16K extsize, after defrag, it’s extents drops to 13K.
> >> And the one with 0 extsize, after defrag, it’s extents dropped to 22.
> > 
> > extsize should not affect file contiguity like this at all. Are you
> > measuring fragmentation correctly? i.e. a contiguous region from an
> > larger extsize allocation that results in a bmap/fiemap output of
> > three extents in a unwritten/written/unwritten is not fragmentation.
> 
> I was using FS_IOC_FSGETXATTR to get the number of extents (fsx.fsx_nextents).
> So if kernel doesn’t lie, I got it correctly. There was no unwritten extents in the files to defrag.

The kernel is not lying, and you've misunderstood what the kernel is
reporting as an extent. The kernel reports the count of -individual
extent records- it maintains, not the count of contiguous regions it
is mapping. Have a look at the implementation of fsx.fsx_nextents in
xfs_fill_fsxattr():

	if (ifp && !xfs_need_iread_extents(ifp))
                fa->fsx_nextents = xfs_iext_count(ifp);
        else
                fa->fsx_nextents = xfs_ifork_nextents(ifp);

We have:

inline xfs_extnum_t xfs_iext_count(struct xfs_ifork *ifp)
{
        return ifp->if_bytes / sizeof(struct xfs_iext_rec);
}

Which is the number of in-memory extents for the inode fork. Not
only does that include unwritten extent records, it includes delayed
allocation extents that don't even exist on disk.

And if we haven't read the extent list in from disk, we use:

static inline xfs_extnum_t xfs_ifork_nextents(struct xfs_ifork *ifp)
{
        if (!ifp)
                return 0;
        return ifp->if_nextents;
}

Which is a count of the on-disk extents for the inode fork which
counts both written and unwritten extent records.

IOWs, both of these functions count unwritten extents as separate
extents to written extents, even if they are contiguous.  That means
a single contiguous extent with an unwritten region in the middle of
it:

	0	1	2	3
	+WWWWWWW+UUUUUUU+WWWWWWW+

Is reported as three extent records - {0,1,W}, {1,1,U}, {2,1,W} -
and so fsx.fsx_nextents will report 3 extents despite the fact that
file is *not* fragmented at all.

Hence interpretting fsx.fsx_nextents as a number that accurately
reflects actual extent fragmentation levels is incorrect. If you
have a sparse file or mixed written/unwritten regions, the extent
count will be much higher than expected but it does not indicate
that the file is fragmented at all.

Applications need to look at the actual extent map that is returned
from FIEMAP to determine if there is significant fragmentation that
can be addressed, not just the raw extent count.

> (As I mentioned somewhere else), though extsize is mainly used to
> align the number of blocks, it breaks delayed-allocations.
> In the unshare path, there are N allocations performed for the N
> extents respectively in the segment to be defragmented. 

That's largely irrelevant to the issue at hand.  If there is
sufficient free space in the filesystem, the allocator will first
attempt and succeed at contiguous allocation. Hence the size of each
allocation is irrelevant as they will be laid out contiguously given
sufficient large contiguous free space.

Indeed, this is how allocation for direct IO works, and it doesn't
have problems with fragmentation of files for single threaded
sequential IO for the same reasons....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/9] introduce defrag to xfs_spaceman
  2024-07-16 19:45   ` Wengang Wang
@ 2024-07-31  2:51     ` Dave Chinner
  2024-08-02 18:14       ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-07-31  2:51 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org

On Tue, Jul 16, 2024 at 07:45:37PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 15, 2024, at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > [ Please keep documentation text to 80 columns. ] 
> > 
> 
> Yes. This is not a patch. I copied it from the man 8 output.
> It will be limited to 80 columns when sent as a patch.
> 
> > [ Please run documentation through a spell checker - there are too
> > many typos in this document to point them all out... ]
> 
> OK.
> 
> > 
> > On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
> >> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
> >> features below (also subject to be added to man page, so please review):
> > 
> > What's the use case for this?
> 
> This is the user space defrag as you suggested previously.
> 
> Please see the previous conversation for your reference: 
> https://patchwork.kernel.org/project/xfs/cover/20231214170530.8664-1-wen.gang.wang@oracle.com/

That's exactly what you should have put in the cover letter!

The cover letter is not for documenting the user interface of a new
tool - that's what the patch in the patch set for the new man page
should be doing.

The cover letter should contain references to past patch sets and
discussions on the topic. The cover letter shoudl also contain a
changelog that documents what is different in this new version of
the patch set so reviewers know what you've changed since they last
looked at it.

IOWs, the cover letter for explaining the use case, why the
functionality is needed, important design/implementation decisions
and the history of the patchset. It's meant to inform and remind
readers of what has already happened to get to this point.

> COPY STARTS —————————————> 
> I am copying your last comment there:
> 
> On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
> > Hi Dave,
> > Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).
> 
> That's good to know :)
> 
> > Let me know if you want it in xfsprog.
> 
> Yes, i think adding it as an xfs_spaceman command would be a good
> way for this defrag feature to be maintained for anyone who has need
> for it.

Sure, I might have said that 6 months ago. When presented with a
completely new implementation in a new context months later, I might
see things differently.  Everyone is allowed to change their mind,
opinions and theories as circumstances, evidence and contexts
change.

Indeed when I look at this:

> >>       defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
> >>              defrag defragments the specified XFS file online non-exclusively. The target XFS

I didn't expect anything nearly as complex and baroque as this. All
I was expecting was something like this to defrag a single range of
a file:

	xfs_spaceman -c "defrag <offset> <length>" <file>

As the control command, and then functionality for
automated/periodic scanning and defrag would still end up being
co-ordinated by the existing xfs_fsr code.

> > What's "non-exclusively" mean? How is this different to what xfs_fsr
> > does?
> > 
> 
> I think you have seen the difference when you reviewing more of this set.
> Well, if I read xfs_fsr code correctly, though xfs_fsr allow parallel writes, it looks have a problem(?)
> As I read the code, Xfs_fsr do the followings to defrag one file:
> 1) preallocating blocks to a temporary file hoping the temporary get same number of blocks as the
>     file under defrag with with less extents.
> 2) copy data blocks from the file under defrag to the temporary file.
> 3) switch the extents between the two files.
> 
> For stage 2, it’s NOT copying data blocks in atomic manner. Take an example: there need two
> Read->write pair to complete the data copy, that is
>     Copy range 1 (read range 1 from the file under defrag to the temporary file)
>     Copy range 2

I wasn't asking you to explain to me how the xfs_fsr algorithm
works. What I was asking for was a definition of what
"non-exclusively" means.

What xfs_fsr currently does meets my definition of "non-exclusive" - it does
not rely on or require exclusive access to the file being
defragmented except for the atomic extent swap at the end. However,
using FICLONE/UNSHARE does require exclusive access to the file be
defragmented for the entirity of those operations, so I don't have
any real idea of why this new algorithm is explicitly described as
"non-exclusive".

Defining terms so everyone has a common understanding is important.

Indeed, Given that we now have XFS_IOC_EXCHANGE_RANGE, I'm
definitely starting to wonder if clone/unshare is actually the best
way to do this now.  I think we could make xfs_fsr do iterative
small file region defrag using XFS_IOC_EXCHANGE_RANGE instead of
'whole file at once' as it does now. If we were also to make fsr
aware of shared extents 

> > 
> >>              Defragmentation and file IOs
> >> 
> >>              The target file is virtually devided into many small segments. Segments are the
> >>              smallest units for defragmentation. Each segment is defragmented one by one in a
> >>              lock->defragment->unlock->idle manner.
> > 
> > Userspace can't easily lock the file to prevent concurrent access.
> > So I'mnot sure what you are refering to here.
> 
> The manner is not simply meant what is done at user space, but a whole thing in both user space
> and kernel space.  The tool defrag a file segment by segment. The lock->defragment->unlock
> Is done by kernel in responding to the FALLOC_FL_UNSHARE_RANGE request from user space.

I'm still not sure what locking you are trying to describe. There
are multiple layers of locking in the kernel, and we use them
differently. Indeed, the algorithm you have described is actually

	FICLONERANGE
	IOLOCK shared
	ILOCK exclusive
	remap_file_range()
	IUNLOCK exclusive
	IOUNLOCK shared

	.....

	UNSHARE_RANGE
	IOLOCK exclusive
	MMAPLOCK exclusive
	<drain DIO in flight>
	ILOCK exclusive
	unshare_range()
	IUNLOCK exclusive
	MMAPUNLOCK exclusive
	IOUNLOCK shared

And so there isn't a single "lock -> defrag -> unlock" context
occurring - there are multiple independent operations that have
different kernel side locking contexts and there are no userspace
side file locking contexts, either.

> > 
> >>              File IOs are blocked when the target file is locked and are served during the
> >>              defragmentation idle time (file is unlocked).
> > 
> > What file IOs are being served in parallel? The defragmentation IO?
> > something else?
> 
> Here the file IOs means the IOs request from user space applications including virtual machine
> Engine.
> 
> > 
> >>              Though
> >>              the file IOs can't really go in parallel, they are not blocked long. The locking time
> >>              basically depends on the segment size. Smaller segments usually take less locking time
> >>              and thus IOs are blocked shorterly, bigger segments usually need more locking time and
> >>              IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
> >>              service.
> > 
> > How is a user supposed to know what the correct values are for their
> > storage, files, and workload? Algorithms should auto tune, not
> > require users and administrators to use trial and error to find the
> > best numbers to feed a given operation.
> 
> In my option, user would need a way to control this according to their use case.
> Any algorithms will restrict what user want to do.
> Say, user want the defrag done as quick as possible regardless the resources it takes (CPU, IO and so on)
> when the production system is in a maintenance window. But when the production system is busy
> User want the defrag use less resources.

That's not for the defrag program to implement That's what we use
resource control groups for. Things like memcgs, block IO cgroups,
scheduler cgroups, etc. Administrators are used to restricting the
resources used by applications with generic admin tools; asking them
to learn how some random admin tool does it's own resrouce
utilisation restriction that requires careful hand tuning for -one
off admin events- is not the right way to solve this problem.

We should be making the admin tool go as fast as possible and
consume as much resources as are available. This makes it fast out
of the box, and lets the admins restrict the IO rate, CPU and memory
usage to bring it down to an acceptible resource usage level for
admin tasks on their systems.

> Another example, kernel (algorithms) never knows the maximum IO latency the user applications tolerate.
> But if you have some algorithms, please share.

As I said - make it as fast and low latency as reasonably possible.
If you have less than 10ms IO latency SLAs, the application isn't
going to be running on sparse, software defined storage that may
require hundreds of milliseconds of IO pauses during admin tasks.
Hence design to a max fixed IO latency (say 100ms) and make the
funcitonality run as fast as possible within that latency window.

If people need lower latency SLAs, then they shouldn't be running
that application on sparse, COW based VM images. This is not a
problem a defrag utility should be trying to solve.

> >>              Free blocks consumption
> >> 
> >>              Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
> >>              then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
> >>              to the number the newly allocated blocks. As a finally result, defragmentation doesn't
> >>              consume free blocks.  Well, that is true if the target file is not sharing blocks with
> >>              other files.
> > 
> > This is really hard to read. Defragmentation will -always- consume
> > free space while it is progress. It will always release the
> > temporary space it consumes when it completes.
> 
> I don’t think it’s always free blocks when it releases the temporary file. When the blocks were
> Original shared before defrag, the blocks won’t be freed.

I didn't make myself clear. If the blocks shared to the temp file
are owned exclusively by the source file (i.e. they were COW'd from
shared extents at some time in the past), then that is space
that is temporarily required by the defragmentation process. UNSHARE
creates a second, permanent copy of those blocks in the source file
and closing of the temp file them makes the original exclusively
owned blocks go away.

IOWs, defrag can temporarily consume an entire extra file's worth of
space between the UNSHARE starting and the freeing of the temporary
file when we are done with it. Freeing the temp file -always-
releases this extra space, though I note that the implementation is
to hole-punch it away after each segment has been processed.

> > 
> >>              In case the target file contains shared blocks, those shared blocks won't
> >>              be freed back to filesystem as they are still owned by other files. So defragmenation
> >>              allocates more blocks than it frees.
> > 
> > So this is doing an unshare operation as well as defrag? That seems
> > ... suboptimal. The whole point of sharing blocks is to minimise
> > disk usage for duplicated data.
> 
> That depends on user's need. If users think defrag is the first
> priority, it is.  If users don’t think the disk
> saving is the most important, it is not. No matter what developers think.
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's pretty ... dismissive.

I mean, you're flat out wrong. You make the assumption that a user
knows exactly how every file that every application in their system
has been created and knows exactly how best to defragment it.

That's just .... wrong.

Users and admins do not have intimate knowledge of how their
applications do their stuff, and a lot of them don't even know
that their systems are using file clones (i.e. reflink copies)
instead of data copies extensively these days.

That is completely the wrong way to approach administration
tools. 

Our defragmentation policy for xfs_fsr is to leave the structure of
the file as intact as possible. That means we replicate unwritten
regions in the defragmented file. We actually -defragment unwritten
extents- in xfs_fsr, not just written extents, and we do that
because we have to assume that the unwritten extents exist for a
good reason.

We don't expect the admin to make a decision as to whether unwritten
extents should be replicated or defragged - we make the assumption
that either the application or the admin has asked for them to exist
in the first place.

It is similar for defragmenting files that are largely made up of shared
extents. That layout exists for a reason, and it's not the place of
the defragmentation operation to unilaterally decide layout policies
for the admin and/or application that is using files with shared
extents.

Hence the defrag operation must preserve the *observed intent* of
the source file layout as much as possible and not require the admin
or user to be sufficiently informed to make the right decision one
way or another. We must attempt to preserve the status quo.

Hence if the file is largely shared, we must not unshare the entire
file to defragment it unless that is the only way to reduce the
fragmentation (e.g. resolve small interleaved shared and unshared
extents). If there are reasonable sized shared extents, we should be
leaving them alone and not unsharing them just to reduce the extent
count by a handful of extents.

> What’s more, reflink (or sharing blocks) is not only used to minimize disk usage. Sometimes it’s
> Used as way to take snapshots. And those snapshots might won’t stay long.

Yes, I know this. It doesn't change anything to do with how we
defragment a file that contains shared blocks.

If you don't want the snapshot(s) to affect defragmentation, then
don't run defrag while the snapshots are present. Otherwise, we
want defrag to retain as much sharing between the snapshots and
the source file because *minimising the space used by snapshots* is
the whole point of using file clones for snapshots in the first
place!

> And what’s more is that, the unshare operation is what you suggested :D   

I suggested it as a mechanism to defrag regions of shared files with
excessive fragmentation. I was not suggesting that "defrag ==
unshare".

> >>              For existing XFS, free blocks might be over-
> >>              committed when reflink snapshots were created. To avoid causing the XFS running into
> >>              low free blocks state, this defragmentation excludes (partially) shared segments when
> >>              the file system free blocks reaches a shreshold. Check the -f option.
> > 
> > Again, how is the user supposed to know when they need to do this?
> > If the answer is "they should always avoid defrag on low free
> > space", then why is this an option?
> 
> I didn’t say "they should always avoid defrag on low free space”. And even we can’t say how low is
> Not tolerated by user, that depends on user use case. Though it’s an option, it has the default value
> Of 1GB. If users don’t set this option, that is "always avoid defrag on low free space”.

You didn't answer my question: how is the user supposed to know
when they should set this?

And, again, the followup question is: why does this need to be
built into the defrag tool?

From a policy perspective, caring about the amount of free space in
the filesystem isn't the job of a defragmentation operation. It
should simply abort if it gets an ENOSPC error or fails to improve
the layout of the file in question. Indeed, if it is obvious that
there may not be enough free space in the filesystem to begin with
thendon't run the defrag operation at all.

This is how xfs_fsr works - it tries to preallocate all the space it
will need before it starts moving data. If it fails to preallocate
all the space, it aborts. If it fails to find large enough
contiguous free spaces to improve the layout of the file, it aborts.

IOWs, xfs_fsr policy is that it doesn't care about the amount of
free space in the filesystem, it just cares if the result will
improve the layout of the file.  That's basically how any online
background defrag operation should work - if the new
layout is worse than the existing layout, or there isn't space for
the new layout to be allocated, just abort.

> >>              Safty and consistency
> >> 
> >>              The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
> >>              crash.
> > 
> > Which file is the "defragmentation file"? The source or the temp
> > file?
> 
> I don’t think there is "source concept" here. There is no data copy between files.
> “The defragmentation file” means the file under defrag, I will change it to “The file under defrag”.
> I don’t think users care about the temporary file at all.

Define the terms you use rather than assuming the reader
understands both the terminology you are using and the context in
which you are using them.

.....

> > 
> >>              The command takes the following options:
> >>                 -f free_space
> >>                     The shreshold of XFS free blocks in MiB. When free blocks are less than this
> >>                     number, (partially) shared segments are excluded from defragmentation. Default
> >>                     number is 1024
> > 
> > When you are down to 4MB of free space in the filesystem, you
> > shouldn't even be trying to run defrag because all the free space
> > that will be left in the filesystem is single blocks. I would have
> > expected this sort of number to be in a percentage of capacity,
> > defaulting to something like 5% (which is where we start running low
> > space algorithms in the kernel).
> 
> I would like leave this to user.

Again: How is the user going to know what to set this to? What
problem is this avoiding that requires the user to change this in
any way.

> When user is doing defrag on low free space system, it won’t cause
> Problem to file system its self. At most the defrag fails during unshare when allocating blocks.

Why would we even allow a user to run defrag near ENOSPC? It is a
well known problem that finding contiguous free space when we are close
to ENOSPC is difficult and so defrag often is unable to improve the
situation when we are within a few percent of the filesysetm being
full.

It is also a well known problem that defragmentation at low free
space trades off contiguous free space for fragmented free space.
Hence when we are at low free space, defrag makes the free space
fragmetnation worse, which then results in all allocation in the
filesystem getting worse and more fragmented. This is something we
absolutely should be trying to avoid.

This is one of the reasons xfs_fsr tries to layout the entire
file before doing any IO - when about 95% full, it's common for the
new layout to be worse than the original file's layout because there
isn't sufficient contiguous free space to improve the layout.

IOWs, running defragmentation when we are above 95% full is actively
harmful to the longevity of the filesystem. Hence, on a fundamental
level, having a low space threshold in a defragmentation tool is
simply wrong - defragmentation should simply not be run when the
filesystem is anywhere near full.

.....

> >> 
> >>                 -s segment_size
> >>                     The size limitation in bytes of segments. Minimium number is 4MiB, default
> >>                     number is 16MiB.
> > 
> > Why were these numbers chosen? What happens if the file has ~32MB
> > sized extents and the user wants the file to be returned to a single
> > large contiguous extent it possible? i.e. how is the user supposed
> > to know how to set this for any given file without first having
> > examined the exact pattern of fragmentations in the file?
> 
> Why customer want the file to be returned to a single large contiguous extent?
> A 32MB extent is pretty good to me.  I didn’t here any customer
> complain about 32MB extents…

There's a much wider world out there than just Oracle customers.
Just because you aren't aware of other use cases that exist, it
doesn't mean they don't exist. I know they exist, hence my question.

For example, extent size hints are used to guarantee that the data
is aligned to the underlying storage correctly, and very large
contiguous extents are required to avoid excessive seeks during
sequential reads that result in critical SLA failures. Hence if a
file is poorly laid out in this situation, defrag needs to return it
to as few, maximally sized extents as it can. How does a user know
what they'd need to set this segment size field to and so acheive
the result they need?

> And you know, whether we can defrag extents to a large one depends on not only the tool it’s self.
> It’s depends on the status of the filesystem, say if the filesystem is very fragmented too, say the AG size..
> 
> The 16MB is selected according our tests basing on a customer metadump. With 16MB segment size,
> The the defrag result is very good and the IO latency is acceptable too.  With the default 16MB segment
> Size, 32MB extent is excluded from defrag.

Exactly my point: you have written a solution that works for a
single filesystem in a single environment.  However, the solution is
so specific to the single problem you need to solve that it is not
clear whether that functionality or defaults are valid outside of
the specific problem case you've written it for and tested it on.

> If you have better default size, we can use that.

I'm not convinced that fixed size "segments" is even the right way
to approach this problem. What needs to be done is dependent on the
extent layout of the file, not how extents fit over some arbitrary
fixed segment map....

> >> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
> >> sleep time. Here comes some number of the test:
> >> 
> >> Test: running of defrag on the image file which is used for the back end of a block device in a
> >>      virtual machine. At the same time, fio is running at the same time inside virtual machine
> >>      on that block device.
> >> block device type:   NVME
> >> File size:           200GiB
> >> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
> >> Defrag run time:     223 minutes
> >> Number of extents:   6745489(before) -> 203571(after)
> > 
> > So and average extent size of ~32kB before, 100MB after? How much of
> > these are shared extents?
> 
> Zero shared extents, but there are some unwritten ones.
> A similar run stats is like this:
> Pre-defrag 6654460 extents detected, 112228 are "unwritten",0 are "shared” 
> Tried to defragment 6393352 extents (181000359936 bytes) in 26032 segments Time stats(ms): max clone: 31, max unshare: 300, max punch_hole: 66
> Post-defrag 282659 extents detected
> 
> > 
> > Runtime is 13380secs, so if we copied 200GiB in that time, the
> > defrag ran at 16MB/s. That's not very fast.
> > 
> 
> We are chasing the balance of defrag and parallel IO latency.

My point is that stuff like CLONE and UNSHARE should be able to run
much, much faster than this, even if some of the time is left idle
for other IO.

i.e. we can clone extents at about 100,000/s. We can copy data
through the page cache at 7-8GB/s on NVMe devices.

A full clone of the 6.6 million extents should only take about
a minute.

A full page cache copy of the 200GB cloned file (i.e. via read/write
syscalls) should easily run at >1GB/s, and so only take a couple of
minutes to run.

IOWs, the actual IO and metadata modification side of things is
really only about 5 minutes worth of CPU and IO.

Hence this defrag operation is roughly 100x slower than we should be
able to run it at.  We should be able to run it at close to those
speeds whilst still allowing concurrent read access to the file.

If an admin then wants it to run at 16MB/s, it can be throttled
to that speed using cgroups, ionice, etc.

i.e. I think you are trying to solve too many unnecessary problems
here and not addressing the one thing it should do: defrag a file as
fast and efficiently as possible.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 8/9] spaceman/defrag: readahead for better performance
  2024-07-18 18:40     ` Wengang Wang
@ 2024-07-31  3:10       ` Dave Chinner
  2024-08-02 18:31         ` Wengang Wang
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2024-07-31  3:10 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org

On Thu, Jul 18, 2024 at 06:40:46PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
> >> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
> >> Do readahead when defrag sleeps for better defrag performace and thus more
> >> file IO time.
> >> 
> >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> >> ---
> >> spaceman/defrag.c | 21 ++++++++++++++++++++-
> >> 1 file changed, 20 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> >> index 415fe9c2..ab8508bb 100644
> >> --- a/spaceman/defrag.c
> >> +++ b/spaceman/defrag.c
> >> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
> >> }
> >> 
> >> static bool g_enable_first_ext_share = true;
> >> +static bool g_readahead = false;
> >> +
> >> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
> >> +{
> >> + if (!g_readahead || g_idle_time <= 0)
> >> + return;
> >> +
> >> + if (readahead(defrag_fd, offset, count) < 0) {
> >> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
> >> + strerror(errno), errno);
> > 
> > This doesn't do what you think it does. readahead() only queues the
> > first readahead chunk of the range given (a few pages at most). It
> > does not cause readahead on the entire range, wait for page cache
> > population, nor report IO errors that might have occurred during
> > readahead.
> 
> Is it a bug?

No.

> As per the man page it should try to read _count_ bytes:

No it doesn't. It says:

> 
> DESCRIPTION
>        readahead() initiates readahead on a file
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It says it -initiates- readahead. It doesn't mean it waits for
readahead to complete or that it will readahead the whole range.
It just starts readahead.

> > There's almost no value to making this syscall, especially if the
> > app is about to trigger a sequential read for the whole range.
> > Readahead will occur naturally during that read operation (i.e. the
> > UNSHARE copy), and the read will return IO errors unlike
> > readahead().
> > 
> > If you want the page cache pre-populated before the unshare
> > operation is done, then you need to use mmap() and
> > madvise(MADV_POPULATE_READ). This will read the whole region into
> > the page cache as if it was a sequential read, wait for it to
> > complete and return any IO errors that might have occurred during
> > the read.
> 
> As you know in the unshare path, fetching data from disk is done when IO is locked.
> (I am wondering if we can improve that.)

Christoph pointed that out and some potential fixes back in the
original discussion:

https://lore.kernel.org/linux-xfs/ZXvQ0YDfHBuvLXbY@infradead.org/

> The main purpose of using readahead is that I want less (IO) lock time when fetching
> data from disk. Can we achieve that by using mmap and madvise()?

Maybe, but you're still adding complexity to userspace as a work
around for a kernel issue we should be fixing.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag()
  2024-07-09 19:10 ` [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() Wengang Wang
  2024-07-09 20:51   ` Darrick J. Wong
  2024-07-16  0:25   ` Dave Chinner
@ 2024-07-31 22:25   ` Dave Chinner
  2 siblings, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2024-07-31 22:25 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs

On Tue, Jul 09, 2024 at 12:10:25PM -0700, Wengang Wang wrote:
> xfs_reflink_try_clear_inode_flag() takes very long in case file has huge number
> of extents and none of the extents are shared.
> 
> workaround:
> share the first real extent so that xfs_reflink_try_clear_inode_flag() returns
> quickly to save cpu times and speed up defrag significantly.
> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
>  spaceman/defrag.c | 174 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 172 insertions(+), 2 deletions(-)

I had some insight on this late last night. The source of the issue
is that both the kernel and the defrag algorithm are walking
forwards across the file. Hence as we get t higher offsets in the
file during defrag which unshares shared ranges, we are moving the
first shared range to be higher in the file.

Hence the act of unsharing the file in ascending offset order
results in the ascending offset search for shared extents done by
the kernel growing in time.

The solution to this is to make defrag work backwards through the
file, so it leaves the low offset shared extents intact for the
kernel to find until the defrag process unshares them. At which
point the kernel will clear the reflink flag and the searching
stops.

IOWs, we either need to change the kernel code to do reverse order
shared extent searching, or change the defrag operation to work in
reverse sequential order, and then the performance problem relating
to unsharing having to determine if the file being defragged is
still shared or not goes away.

That said, I still think that the fact the defrag is completely
unsharing the source file is wrong. If we leave shared extents
intact as we defrag the file, then this problem doesn't need solving
at all because xfs_reflink_try_clear_inode_flag() will hit the same
shared extent near the front of the file every time...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/9] introduce defrag to xfs_spaceman
  2024-07-31  2:51     ` Dave Chinner
@ 2024-08-02 18:14       ` Wengang Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-08-02 18:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org



> On Jul 30, 2024, at 7:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 16, 2024 at 07:45:37PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Jul 15, 2024, at 4:03 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> [ Please keep documentation text to 80 columns. ] 
>>> 
>> 
>> Yes. This is not a patch. I copied it from the man 8 output.
>> It will be limited to 80 columns when sent as a patch.
>> 
>>> [ Please run documentation through a spell checker - there are too
>>> many typos in this document to point them all out... ]
>> 
>> OK.
>> 
>>> 
>>> On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
>>>> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
>>>> features below (also subject to be added to man page, so please review):
>>> 
>>> What's the use case for this?
>> 
>> This is the user space defrag as you suggested previously.
>> 
>> Please see the previous conversation for your reference: 
>> https://patchwork.kernel.org/project/xfs/cover/20231214170530.8664-1-wen.gang.wang@oracle.com/
> 
> That's exactly what you should have put in the cover letter!
> 
> The cover letter is not for documenting the user interface of a new
> tool - that's what the patch in the patch set for the new man page
> should be doing.
> 
> The cover letter should contain references to past patch sets and
> discussions on the topic. The cover letter shoudl also contain a
> changelog that documents what is different in this new version of
> the patch set so reviewers know what you've changed since they last
> looked at it.
> 
> IOWs, the cover letter for explaining the use case, why the
> functionality is needed, important design/implementation decisions
> and the history of the patchset. It's meant to inform and remind
> readers of what has already happened to get to this point.
> 
>> COPY STARTS —————————————> 
>> I am copying your last comment there:
>> 
>> On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
>>> Hi Dave,
>>> Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).
>> 
>> That's good to know :)
>> 
>>> Let me know if you want it in xfsprog.
>> 
>> Yes, i think adding it as an xfs_spaceman command would be a good
>> way for this defrag feature to be maintained for anyone who has need
>> for it.
> 
> Sure, I might have said that 6 months ago. When presented with a
> completely new implementation in a new context months later, I might
> see things differently.  Everyone is allowed to change their mind,
> opinions and theories as circumstances, evidence and contexts
> change.
> 
> Indeed when I look at this:
> 
>>>>      defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
>>>>             defrag defragments the specified XFS file online non-exclusively. The target XFS
> 
> I didn't expect anything nearly as complex and baroque as this. All
> I was expecting was something like this to defrag a single range of
> a file:
> 
> xfs_spaceman -c "defrag <offset> <length>" <file>
> 
> As the control command, and then functionality for
> automated/periodic scanning and defrag would still end up being
> co-ordinated by the existing xfs_fsr code.
> 
>>> What's "non-exclusively" mean? How is this different to what xfs_fsr
>>> does?
>>> 
>> 
>> I think you have seen the difference when you reviewing more of this set.
>> Well, if I read xfs_fsr code correctly, though xfs_fsr allow parallel writes, it looks have a problem(?)
>> As I read the code, Xfs_fsr do the followings to defrag one file:
>> 1) preallocating blocks to a temporary file hoping the temporary get same number of blocks as the
>>    file under defrag with with less extents.
>> 2) copy data blocks from the file under defrag to the temporary file.
>> 3) switch the extents between the two files.
>> 
>> For stage 2, it’s NOT copying data blocks in atomic manner. Take an example: there need two
>> Read->write pair to complete the data copy, that is
>>    Copy range 1 (read range 1 from the file under defrag to the temporary file)
>>    Copy range 2
> 
> I wasn't asking you to explain to me how the xfs_fsr algorithm
> works. What I was asking for was a definition of what
> "non-exclusively" means.
> 
> What xfs_fsr currently does meets my definition of "non-exclusive" - it does
> not rely on or require exclusive access to the file being
> defragmented except for the atomic extent swap at the end. However,
> using FICLONE/UNSHARE does require exclusive access to the file be
> defragmented for the entirity of those operations, so I don't have
> any real idea of why this new algorithm is explicitly described as
> "non-exclusive".
> 
> Defining terms so everyone has a common understanding is important.
> 
> Indeed, Given that we now have XFS_IOC_EXCHANGE_RANGE, I'm
> definitely starting to wonder if clone/unshare is actually the best
> way to do this now.  I think we could make xfs_fsr do iterative
> small file region defrag using XFS_IOC_EXCHANGE_RANGE instead of
> 'whole file at once' as it does now. If we were also to make fsr
> aware of shared extents 
> 
>>> 
>>>>             Defragmentation and file IOs
>>>> 
>>>>             The target file is virtually devided into many small segments. Segments are the
>>>>             smallest units for defragmentation. Each segment is defragmented one by one in a
>>>>             lock->defragment->unlock->idle manner.
>>> 
>>> Userspace can't easily lock the file to prevent concurrent access.
>>> So I'mnot sure what you are refering to here.
>> 
>> The manner is not simply meant what is done at user space, but a whole thing in both user space
>> and kernel space.  The tool defrag a file segment by segment. The lock->defragment->unlock
>> Is done by kernel in responding to the FALLOC_FL_UNSHARE_RANGE request from user space.
> 
> I'm still not sure what locking you are trying to describe. There
> are multiple layers of locking in the kernel, and we use them
> differently. Indeed, the algorithm you have described is actually
> 
> FICLONERANGE
> IOLOCK shared
> ILOCK exclusive
> remap_file_range()
> IUNLOCK exclusive
> IOUNLOCK shared
> 
> .....
> 
> UNSHARE_RANGE
> IOLOCK exclusive
> MMAPLOCK exclusive
> <drain DIO in flight>
> ILOCK exclusive
> unshare_range()
> IUNLOCK exclusive
> MMAPUNLOCK exclusive
> IOUNLOCK shared
> 
> And so there isn't a single "lock -> defrag -> unlock" context
> occurring - there are multiple independent operations that have
> different kernel side locking contexts and there are no userspace
> side file locking contexts, either.
> 
>>> 
>>>>             File IOs are blocked when the target file is locked and are served during the
>>>>             defragmentation idle time (file is unlocked).
>>> 
>>> What file IOs are being served in parallel? The defragmentation IO?
>>> something else?
>> 
>> Here the file IOs means the IOs request from user space applications including virtual machine
>> Engine.
>> 
>>> 
>>>>             Though
>>>>             the file IOs can't really go in parallel, they are not blocked long. The locking time
>>>>             basically depends on the segment size. Smaller segments usually take less locking time
>>>>             and thus IOs are blocked shorterly, bigger segments usually need more locking time and
>>>>             IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
>>>>             service.
>>> 
>>> How is a user supposed to know what the correct values are for their
>>> storage, files, and workload? Algorithms should auto tune, not
>>> require users and administrators to use trial and error to find the
>>> best numbers to feed a given operation.
>> 
>> In my option, user would need a way to control this according to their use case.
>> Any algorithms will restrict what user want to do.
>> Say, user want the defrag done as quick as possible regardless the resources it takes (CPU, IO and so on)
>> when the production system is in a maintenance window. But when the production system is busy
>> User want the defrag use less resources.
> 
> That's not for the defrag program to implement That's what we use
> resource control groups for. Things like memcgs, block IO cgroups,
> scheduler cgroups, etc. Administrators are used to restricting the
> resources used by applications with generic admin tools; asking them
> to learn how some random admin tool does it's own resrouce
> utilisation restriction that requires careful hand tuning for -one
> off admin events- is not the right way to solve this problem.
> 
> We should be making the admin tool go as fast as possible and
> consume as much resources as are available. This makes it fast out
> of the box, and lets the admins restrict the IO rate, CPU and memory
> usage to bring it down to an acceptible resource usage level for
> admin tasks on their systems.
> 
>> Another example, kernel (algorithms) never knows the maximum IO latency the user applications tolerate.
>> But if you have some algorithms, please share.
> 
> As I said - make it as fast and low latency as reasonably possible.
> If you have less than 10ms IO latency SLAs, the application isn't
> going to be running on sparse, software defined storage that may
> require hundreds of milliseconds of IO pauses during admin tasks.
> Hence design to a max fixed IO latency (say 100ms) and make the
> funcitonality run as fast as possible within that latency window.
> 
> If people need lower latency SLAs, then they shouldn't be running
> that application on sparse, COW based VM images. This is not a
> problem a defrag utility should be trying to solve.
> 
>>>>             Free blocks consumption
>>>> 
>>>>             Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
>>>>             then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
>>>>             to the number the newly allocated blocks. As a finally result, defragmentation doesn't
>>>>             consume free blocks.  Well, that is true if the target file is not sharing blocks with
>>>>             other files.
>>> 
>>> This is really hard to read. Defragmentation will -always- consume
>>> free space while it is progress. It will always release the
>>> temporary space it consumes when it completes.
>> 
>> I don’t think it’s always free blocks when it releases the temporary file. When the blocks were
>> Original shared before defrag, the blocks won’t be freed.
> 
> I didn't make myself clear. If the blocks shared to the temp file
> are owned exclusively by the source file (i.e. they were COW'd from
> shared extents at some time in the past), then that is space
> that is temporarily required by the defragmentation process. UNSHARE
> creates a second, permanent copy of those blocks in the source file
> and closing of the temp file them makes the original exclusively
> owned blocks go away.
> 
> IOWs, defrag can temporarily consume an entire extra file's worth of
> space between the UNSHARE starting and the freeing of the temporary
> file when we are done with it. Freeing the temp file -always-
> releases this extra space, though I note that the implementation is
> to hole-punch it away after each segment has been processed.
> 
>>> 
>>>>             In case the target file contains shared blocks, those shared blocks won't
>>>>             be freed back to filesystem as they are still owned by other files. So defragmenation
>>>>             allocates more blocks than it frees.
>>> 
>>> So this is doing an unshare operation as well as defrag? That seems
>>> ... suboptimal. The whole point of sharing blocks is to minimise
>>> disk usage for duplicated data.
>> 
>> That depends on user's need. If users think defrag is the first
>> priority, it is.  If users don’t think the disk
>> saving is the most important, it is not. No matter what developers think.
>                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> That's pretty ... dismissive.
> 
> I mean, you're flat out wrong. You make the assumption that a user
> knows exactly how every file that every application in their system
> has been created and knows exactly how best to defragment it.
> 
> That's just .... wrong.
> 
> Users and admins do not have intimate knowledge of how their
> applications do their stuff, and a lot of them don't even know
> that their systems are using file clones (i.e. reflink copies)
> instead of data copies extensively these days.
> 
> That is completely the wrong way to approach administration
> tools. 
> 
> Our defragmentation policy for xfs_fsr is to leave the structure of
> the file as intact as possible. That means we replicate unwritten
> regions in the defragmented file. We actually -defragment unwritten
> extents- in xfs_fsr, not just written extents, and we do that
> because we have to assume that the unwritten extents exist for a
> good reason.
> 
> We don't expect the admin to make a decision as to whether unwritten
> extents should be replicated or defragged - we make the assumption
> that either the application or the admin has asked for them to exist
> in the first place.
> 
> It is similar for defragmenting files that are largely made up of shared
> extents. That layout exists for a reason, and it's not the place of
> the defragmentation operation to unilaterally decide layout policies
> for the admin and/or application that is using files with shared
> extents.
> 
> Hence the defrag operation must preserve the *observed intent* of
> the source file layout as much as possible and not require the admin
> or user to be sufficiently informed to make the right decision one
> way or another. We must attempt to preserve the status quo.
> 
> Hence if the file is largely shared, we must not unshare the entire
> file to defragment it unless that is the only way to reduce the
> fragmentation (e.g. resolve small interleaved shared and unshared
> extents). If there are reasonable sized shared extents, we should be
> leaving them alone and not unsharing them just to reduce the extent
> count by a handful of extents.
> 
>> What’s more, reflink (or sharing blocks) is not only used to minimize disk usage. Sometimes it’s
>> Used as way to take snapshots. And those snapshots might won’t stay long.
> 
> Yes, I know this. It doesn't change anything to do with how we
> defragment a file that contains shared blocks.
> 
> If you don't want the snapshot(s) to affect defragmentation, then
> don't run defrag while the snapshots are present. Otherwise, we
> want defrag to retain as much sharing between the snapshots and
> the source file because *minimising the space used by snapshots* is
> the whole point of using file clones for snapshots in the first
> place!
> 
>> And what’s more is that, the unshare operation is what you suggested :D   
> 
> I suggested it as a mechanism to defrag regions of shared files with
> excessive fragmentation. I was not suggesting that "defrag ==
> unshare".
> 
>>>>             For existing XFS, free blocks might be over-
>>>>             committed when reflink snapshots were created. To avoid causing the XFS running into
>>>>             low free blocks state, this defragmentation excludes (partially) shared segments when
>>>>             the file system free blocks reaches a shreshold. Check the -f option.
>>> 
>>> Again, how is the user supposed to know when they need to do this?
>>> If the answer is "they should always avoid defrag on low free
>>> space", then why is this an option?
>> 
>> I didn’t say "they should always avoid defrag on low free space”. And even we can’t say how low is
>> Not tolerated by user, that depends on user use case. Though it’s an option, it has the default value
>> Of 1GB. If users don’t set this option, that is "always avoid defrag on low free space”.
> 
> You didn't answer my question: how is the user supposed to know
> when they should set this?
> 
> And, again, the followup question is: why does this need to be
> built into the defrag tool?
> 
> From a policy perspective, caring about the amount of free space in
> the filesystem isn't the job of a defragmentation operation. It
> should simply abort if it gets an ENOSPC error or fails to improve
> the layout of the file in question. Indeed, if it is obvious that
> there may not be enough free space in the filesystem to begin with
> thendon't run the defrag operation at all.
> 
> This is how xfs_fsr works - it tries to preallocate all the space it
> will need before it starts moving data. If it fails to preallocate
> all the space, it aborts. If it fails to find large enough
> contiguous free spaces to improve the layout of the file, it aborts.
> 
> IOWs, xfs_fsr policy is that it doesn't care about the amount of
> free space in the filesystem, it just cares if the result will
> improve the layout of the file.  That's basically how any online
> background defrag operation should work - if the new
> layout is worse than the existing layout, or there isn't space for
> the new layout to be allocated, just abort.
> 
> 
>>>>             Safty and consistency
>>>> 
>>>>             The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
>>>>             crash.
>>> 
>>> Which file is the "defragmentation file"? The source or the temp
>>> file?
>> 
>> I don’t think there is "source concept" here. There is no data copy between files.
>> “The defragmentation file” means the file under defrag, I will change it to “The file under defrag”.
>> I don’t think users care about the temporary file at all.
> 
> Define the terms you use rather than assuming the reader
> understands both the terminology you are using and the context in
> which you are using them.
> 
> .....
> 
>>> 
>>>>             The command takes the following options:
>>>>                -f free_space
>>>>                    The shreshold of XFS free blocks in MiB. When free blocks are less than this
>>>>                    number, (partially) shared segments are excluded from defragmentation. Default
>>>>                    number is 1024
>>> 
>>> When you are down to 4MB of free space in the filesystem, you
>>> shouldn't even be trying to run defrag because all the free space
>>> that will be left in the filesystem is single blocks. I would have
>>> expected this sort of number to be in a percentage of capacity,
>>> defaulting to something like 5% (which is where we start running low
>>> space algorithms in the kernel).
>> 
>> I would like leave this to user.
> 
> Again: How is the user going to know what to set this to? What
> problem is this avoiding that requires the user to change this in
> any way.
> 
>> When user is doing defrag on low free space system, it won’t cause
>> Problem to file system its self. At most the defrag fails during unshare when allocating blocks.
> 
> Why would we even allow a user to run defrag near ENOSPC? It is a
> well known problem that finding contiguous free space when we are close
> to ENOSPC is difficult and so defrag often is unable to improve the
> situation when we are within a few percent of the filesysetm being
> full.
> 
> It is also a well known problem that defragmentation at low free
> space trades off contiguous free space for fragmented free space.
> Hence when we are at low free space, defrag makes the free space
> fragmetnation worse, which then results in all allocation in the
> filesystem getting worse and more fragmented. This is something we
> absolutely should be trying to avoid.
> 
> This is one of the reasons xfs_fsr tries to layout the entire
> file before doing any IO - when about 95% full, it's common for the
> new layout to be worse than the original file's layout because there
> isn't sufficient contiguous free space to improve the layout.
> 
> IOWs, running defragmentation when we are above 95% full is actively
> harmful to the longevity of the filesystem. Hence, on a fundamental
> level, having a low space threshold in a defragmentation tool is
> simply wrong - defragmentation should simply not be run when the
> filesystem is anywhere near full.
> 
> .....
> 
>>>> 
>>>>                -s segment_size
>>>>                    The size limitation in bytes of segments. Minimium number is 4MiB, default
>>>>                    number is 16MiB.
>>> 
>>> Why were these numbers chosen? What happens if the file has ~32MB
>>> sized extents and the user wants the file to be returned to a single
>>> large contiguous extent it possible? i.e. how is the user supposed
>>> to know how to set this for any given file without first having
>>> examined the exact pattern of fragmentations in the file?
>> 
>> Why customer want the file to be returned to a single large contiguous extent?
>> A 32MB extent is pretty good to me.  I didn’t here any customer
>> complain about 32MB extents…
> 
> There's a much wider world out there than just Oracle customers.
> Just because you aren't aware of other use cases that exist, it
> doesn't mean they don't exist. I know they exist, hence my question.
> 
> For example, extent size hints are used to guarantee that the data
> is aligned to the underlying storage correctly, and very large
> contiguous extents are required to avoid excessive seeks during
> sequential reads that result in critical SLA failures. Hence if a
> file is poorly laid out in this situation, defrag needs to return it
> to as few, maximally sized extents as it can. How does a user know
> what they'd need to set this segment size field to and so acheive
> the result they need?
> 
>> And you know, whether we can defrag extents to a large one depends on not only the tool it’s self.
>> It’s depends on the status of the filesystem, say if the filesystem is very fragmented too, say the AG size..
>> 
>> The 16MB is selected according our tests basing on a customer metadump. With 16MB segment size,
>> The the defrag result is very good and the IO latency is acceptable too.  With the default 16MB segment
>> Size, 32MB extent is excluded from defrag.
> 
> Exactly my point: you have written a solution that works for a
> single filesystem in a single environment.  However, the solution is
> so specific to the single problem you need to solve that it is not
> clear whether that functionality or defaults are valid outside of
> the specific problem case you've written it for and tested it on.
> 
>> If you have better default size, we can use that.
> 
> I'm not convinced that fixed size "segments" is even the right way
> to approach this problem. What needs to be done is dependent on the
> extent layout of the file, not how extents fit over some arbitrary
> fixed segment map....
> 
>>>> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
>>>> sleep time. Here comes some number of the test:
>>>> 
>>>> Test: running of defrag on the image file which is used for the back end of a block device in a
>>>>     virtual machine. At the same time, fio is running at the same time inside virtual machine
>>>>     on that block device.
>>>> block device type:   NVME
>>>> File size:           200GiB
>>>> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
>>>> Defrag run time:     223 minutes
>>>> Number of extents:   6745489(before) -> 203571(after)
>>> 
>>> So and average extent size of ~32kB before, 100MB after? How much of
>>> these are shared extents?
>> 
>> Zero shared extents, but there are some unwritten ones.
>> A similar run stats is like this:
>> Pre-defrag 6654460 extents detected, 112228 are "unwritten",0 are "shared” 
>> Tried to defragment 6393352 extents (181000359936 bytes) in 26032 segments Time stats(ms): max clone: 31, max unshare: 300, max punch_hole: 66
>> Post-defrag 282659 extents detected
>> 
>>> 
>>> Runtime is 13380secs, so if we copied 200GiB in that time, the
>>> defrag ran at 16MB/s. That's not very fast.
>>> 
>> 
>> We are chasing the balance of defrag and parallel IO latency.
> 
> My point is that stuff like CLONE and UNSHARE should be able to run
> much, much faster than this, even if some of the time is left idle
> for other IO.
> 
> i.e. we can clone extents at about 100,000/s. We can copy data
> through the page cache at 7-8GB/s on NVMe devices.
> 
> A full clone of the 6.6 million extents should only take about
> a minute.
> 
> A full page cache copy of the 200GB cloned file (i.e. via read/write
> syscalls) should easily run at >1GB/s, and so only take a couple of
> minutes to run.
> 
> IOWs, the actual IO and metadata modification side of things is
> really only about 5 minutes worth of CPU and IO.
> 
> Hence this defrag operation is roughly 100x slower than we should be
> able to run it at.  We should be able to run it at close to those
> speeds whilst still allowing concurrent read access to the file.
> 
> If an admin then wants it to run at 16MB/s, it can be throttled
> to that speed using cgroups, ionice, etc.
> 
> i.e. I think you are trying to solve too many unnecessary problems
> here and not addressing the one thing it should do: defrag a file as
> fast and efficiently as possible.
> 

Thanks for all above replies.
For the performance, I am still thinking that the bottle neck is at the synchronous page by page disk reading.

Yes, I have to address/workaround something in kernel.
Let’s expect the related kernel fixings and then simplify the user space defrag code.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 8/9] spaceman/defrag: readahead for better performance
  2024-07-31  3:10       ` Dave Chinner
@ 2024-08-02 18:31         ` Wengang Wang
  0 siblings, 0 replies; 60+ messages in thread
From: Wengang Wang @ 2024-08-02 18:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org



> On Jul 30, 2024, at 8:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Jul 18, 2024 at 06:40:46PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Jul 15, 2024, at 5:56 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Tue, Jul 09, 2024 at 12:10:27PM -0700, Wengang Wang wrote:
>>>> Reading ahead take less lock on file compared to "unshare" the file via ioctl.
>>>> Do readahead when defrag sleeps for better defrag performace and thus more
>>>> file IO time.
>>>> 
>>>> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>>>> ---
>>>> spaceman/defrag.c | 21 ++++++++++++++++++++-
>>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
>>>> index 415fe9c2..ab8508bb 100644
>>>> --- a/spaceman/defrag.c
>>>> +++ b/spaceman/defrag.c
>>>> @@ -331,6 +331,18 @@ defrag_fs_limit_hit(int fd)
>>>> }
>>>> 
>>>> static bool g_enable_first_ext_share = true;
>>>> +static bool g_readahead = false;
>>>> +
>>>> +static void defrag_readahead(int defrag_fd, off64_t offset, size_t count)
>>>> +{
>>>> + if (!g_readahead || g_idle_time <= 0)
>>>> + return;
>>>> +
>>>> + if (readahead(defrag_fd, offset, count) < 0) {
>>>> + fprintf(stderr, "readahead failed: %s, errno=%d\n",
>>>> + strerror(errno), errno);
>>> 
>>> This doesn't do what you think it does. readahead() only queues the
>>> first readahead chunk of the range given (a few pages at most). It
>>> does not cause readahead on the entire range, wait for page cache
>>> population, nor report IO errors that might have occurred during
>>> readahead.
>> 
>> Is it a bug?
> 
> No.
> 
>> As per the man page it should try to read _count_ bytes:
> 
> No it doesn't. It says:
> 
>> 
>> DESCRIPTION
>>       readahead() initiates readahead on a file
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> It says it -initiates- readahead. It doesn't mean it waits for
> readahead to complete or that it will readahead the whole range.
> It just starts readahead.
> 

I know it’s asynchronous operation. But when given enough time, it should complete.

# without considering your idea of completing the defrag as fast as possible
As we tested with 16MiB segment size limit, we were seeing the unshare takes about
350 ms including the disk data reading. And we are using 250 ms default sleep/idle time.
In my idea, during the 250 ms sleep time, a lot of block reads should be done.


>>> There's almost no value to making this syscall, especially if the
>>> app is about to trigger a sequential read for the whole range.
>>> Readahead will occur naturally during that read operation (i.e. the
>>> UNSHARE copy), and the read will return IO errors unlike
>>> readahead().
>>> 
>>> If you want the page cache pre-populated before the unshare
>>> operation is done, then you need to use mmap() and
>>> madvise(MADV_POPULATE_READ). This will read the whole region into
>>> the page cache as if it was a sequential read, wait for it to
>>> complete and return any IO errors that might have occurred during
>>> the read.
>> 
>> As you know in the unshare path, fetching data from disk is done when IO is locked.
>> (I am wondering if we can improve that.)
> 
> Christoph pointed that out and some potential fixes back in the
> original discussion:
> 
> https://lore.kernel.org/linux-xfs/ZXvQ0YDfHBuvLXbY@infradead.org/

Yes, I read that. Thanks Christoph for that.

> 
>> The main purpose of using readahead is that I want less (IO) lock time when fetching
>> data from disk. Can we achieve that by using mmap and madvise()?
> 
> Maybe, but you're still adding complexity to userspace as a work
> around for a kernel issue we should be fixing.
> 

Yes, true. But in case of kernel has no the related fixes and we have to use the defrag, working
around is the way go, right.

Thanks,
Wengang



^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2024-08-02 18:32 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
2024-07-09 19:10 ` [PATCH 1/9] xfsprogs: introduce defrag command to spaceman Wengang Wang
2024-07-09 21:18   ` Darrick J. Wong
2024-07-11 21:54     ` Wengang Wang
2024-07-15 21:30       ` Wengang Wang
2024-07-15 22:44         ` Darrick J. Wong
2024-07-09 19:10 ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Wengang Wang
2024-07-09 21:50   ` [PATCH 2/9] spaceman/defrag: pick up segments from target fileOM Darrick J. Wong
2024-07-11 22:37     ` Wengang Wang
2024-07-15 23:40   ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Dave Chinner
2024-07-16 20:23     ` Wengang Wang
2024-07-17  4:11       ` Dave Chinner
2024-07-18 19:03         ` Wengang Wang
2024-07-19  4:59           ` Dave Chinner
2024-07-19  4:01         ` Christoph Hellwig
2024-07-24 19:22         ` Wengang Wang
2024-07-30 22:13           ` Dave Chinner
2024-07-09 19:10 ` [PATCH 3/9] spaceman/defrag: defrag segments Wengang Wang
2024-07-09 21:57   ` Darrick J. Wong
2024-07-11 22:49     ` Wengang Wang
2024-07-12 19:07       ` Wengang Wang
2024-07-15 22:42         ` Darrick J. Wong
2024-07-16  0:08   ` Dave Chinner
2024-07-18 18:06     ` Wengang Wang
2024-07-09 19:10 ` [PATCH 4/9] spaceman/defrag: ctrl-c handler Wengang Wang
2024-07-09 21:08   ` Darrick J. Wong
2024-07-11 22:58     ` Wengang Wang
2024-07-15 22:56       ` Darrick J. Wong
2024-07-16 16:21         ` Wengang Wang
2024-07-09 19:10 ` [PATCH 5/9] spaceman/defrag: exclude shared segments on low free space Wengang Wang
2024-07-09 21:05   ` Darrick J. Wong
2024-07-11 23:08     ` Wengang Wang
2024-07-15 22:58       ` Darrick J. Wong
2024-07-09 19:10 ` [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() Wengang Wang
2024-07-09 20:51   ` Darrick J. Wong
2024-07-11 23:11     ` Wengang Wang
2024-07-16  0:25   ` Dave Chinner
2024-07-18 18:24     ` Wengang Wang
2024-07-31 22:25   ` Dave Chinner
2024-07-09 19:10 ` [PATCH 7/9] spaceman/defrag: sleeps between segments Wengang Wang
2024-07-09 20:46   ` Darrick J. Wong
2024-07-11 23:26     ` Wengang Wang
2024-07-11 23:30     ` Wengang Wang
2024-07-09 19:10 ` [PATCH 8/9] spaceman/defrag: readahead for better performance Wengang Wang
2024-07-09 20:27   ` Darrick J. Wong
2024-07-11 23:29     ` Wengang Wang
2024-07-16  0:56   ` Dave Chinner
2024-07-18 18:40     ` Wengang Wang
2024-07-31  3:10       ` Dave Chinner
2024-08-02 18:31         ` Wengang Wang
2024-07-09 19:10 ` [PATCH 9/9] spaceman/defrag: warn on extsize Wengang Wang
2024-07-09 20:21   ` Darrick J. Wong
2024-07-11 23:36     ` Wengang Wang
2024-07-16  0:29       ` Dave Chinner
2024-07-22 18:01         ` Wengang Wang
2024-07-30 22:43           ` Dave Chinner
2024-07-15 23:03 ` [PATCH 0/9] introduce defrag to xfs_spaceman Dave Chinner
2024-07-16 19:45   ` Wengang Wang
2024-07-31  2:51     ` Dave Chinner
2024-08-02 18:14       ` Wengang Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox