From: Dave Chinner <david@fromorbit.com>
To: Wengang Wang <wen.gang.wang@oracle.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 3/9] spaceman/defrag: defrag segments
Date: Tue, 16 Jul 2024 10:08:09 +1000 [thread overview]
Message-ID: <ZpW56ccnA26iYI1U@dread.disaster.area> (raw)
In-Reply-To: <20240709191028.2329-4-wen.gang.wang@oracle.com>
On Tue, Jul 09, 2024 at 12:10:22PM -0700, Wengang Wang wrote:
> For each segment, the following steps are done trying to defrag it:
>
> 1. share the segment with a temporary file
> 2. unshare the segment in the target file. kernel simulates Cow on the whole
> segment complete the unshare (defrag).
> 3. release blocks from the tempoary file.
>
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> ---
> spaceman/defrag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 114 insertions(+)
>
> diff --git a/spaceman/defrag.c b/spaceman/defrag.c
> index 175cf461..9f11e36b 100644
> --- a/spaceman/defrag.c
> +++ b/spaceman/defrag.c
> @@ -263,6 +263,40 @@ add_ext:
> return ret;
> }
>
> +/*
> + * check if the segment exceeds EoF.
> + * fix up the clone range and return true if EoF happens,
> + * return false otherwise.
> + */
> +static bool
> +defrag_clone_eof(struct file_clone_range *clone)
> +{
> + off_t delta;
> +
> + delta = clone->src_offset + clone->src_length - g_defrag_file_size;
> + if (delta > 0) {
> + clone->src_length = 0; // to the end
> + return true;
> + }
> + return false;
> +}
> +
> +/*
> + * get the time delta since pre_time in ms.
> + * pre_time should contains values fetched by gettimeofday()
> + * cur_time is used to store current time by gettimeofday()
> + */
> +static long long
> +get_time_delta_us(struct timeval *pre_time, struct timeval *cur_time)
> +{
> + long long us;
> +
> + gettimeofday(cur_time, NULL);
> + us = (cur_time->tv_sec - pre_time->tv_sec) * 1000000;
> + us += (cur_time->tv_usec - pre_time->tv_usec);
> + return us;
> +}
> +
> /*
> * defragment a file
> * return 0 if successfully done, 1 otherwise
> @@ -273,6 +307,7 @@ defrag_xfs_defrag(char *file_path) {
> long nr_seg_defrag = 0, nr_ext_defrag = 0;
> int scratch_fd = -1, defrag_fd = -1;
> char tmp_file_path[PATH_MAX+1];
> + struct file_clone_range clone;
> char *defrag_dir;
> struct fsxattr fsx;
> int ret = 0;
> @@ -296,6 +331,8 @@ defrag_xfs_defrag(char *file_path) {
> goto out;
> }
>
> + clone.src_fd = defrag_fd;
> +
> defrag_dir = dirname(file_path);
Just a note: can you please call this the "source fd", not the
"defrag_fd"? defrag_fd could mean either the source or the
temporary scratch file we use as the defrag destination.
> snprintf(tmp_file_path, PATH_MAX, "%s/.xfsdefrag_%d", defrag_dir,
> getpid());
> @@ -309,7 +346,11 @@ defrag_xfs_defrag(char *file_path) {
> }
>
> do {
> + struct timeval t_clone, t_unshare, t_punch_hole;
> struct defrag_segment segment;
> + long long seg_size, seg_off;
> + int time_delta;
> + bool stop;
>
> ret = defrag_get_next_segment(defrag_fd, &segment);
> /* no more segments, we are done */
> @@ -322,6 +363,79 @@ defrag_xfs_defrag(char *file_path) {
> ret = 1;
> break;
> }
> +
> + /* we are done if the segment contains only 1 extent */
> + if (segment.ds_nr < 2)
> + continue;
> +
> + /* to bytes */
> + seg_off = segment.ds_offset * 512;
> + seg_size = segment.ds_length * 512;
Ugh. Do this in the mapping code that gets the extent info. Have it
return bytes. Or just use FIEMAP because it uses byte ranges to
begin with.
> +
> + clone.src_offset = seg_off;
> + clone.src_length = seg_size;
> + clone.dest_offset = seg_off;
> +
> + /* checks for EoF and fix up clone */
> + stop = defrag_clone_eof(&clone);
Ok, so we copy the segment map into clone args, and ...
> + gettimeofday(&t_clone, NULL);
> + ret = ioctl(scratch_fd, FICLONERANGE, &clone);
> + if (ret != 0) {
> + fprintf(stderr, "FICLONERANGE failed %s\n",
> + strerror(errno));
> + break;
> + }
clone the source to the scratch file. This blocks writes to the
source file while it is in progress, but allows reads to pass
through the source file as data is not changing.
> + /* for time stats */
> + time_delta = get_time_delta_us(&t_clone, &t_unshare);
> + if (time_delta > max_clone_us)
> + max_clone_us = time_delta;
> +
> + /* for defrag stats */
> + nr_ext_defrag += segment.ds_nr;
> +
> + /*
> + * For the shared range to be unshared via a copy-on-write
> + * operation in the file to be defragged. This causes the
> + * file needing to be defragged to have new extents allocated
> + * and the data to be copied over and written out.
> + */
> + ret = fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, seg_off,
> + seg_size);
> + if (ret != 0) {
> + fprintf(stderr, "UNSHARE_RANGE failed %s\n",
> + strerror(errno));
> + break;
> + }
And now we unshare the source file. This blocks all IO to the source
file.
Ok, so this is the fundamental problem this whole "segmented
defrag" is trying to work around: FALLOC_FL_UNSHARE_RANGE blocks
all read and write IO whilst it is in progress.
We had this same problem with FICLONERANGE taking snapshots of VM
files - we changed the locking to take shared IO locks to allow
reads to run while the clone was in progress. Because the Oracle Vm
infrastructure uses a sidecar to redirect writes while a snapshot
(clone) was in progress, no VM IO got blocked while the clone was in
progress and so the applications inside the VM never even noticed a
clone was taking place.
Why isn't the same infrastructure being used here?
FALLOC_FL_UNSHARE_RANGE is not changing data, nor is it freeing any
data blocks. Yes, we are re-writing the data somewhere else, but
in that case the original data is still intact in it's original
location on disk and not being freed.
Hence if a read races with UNSHARE, it will hit a referenced extent
containing the correct data regardless of whether it is in the old
or new file. Hence we can likely use shared IO locking for UNSHARE,
just like we do for FICLONERANGE.
At this point, if the Oracle VM infrastructure uses the sidecar
write channel whilst the defrag is in progress, this whole algorithm
simply becomes "for regions with extents smaller than X, clone and
unshare the region".
The whole need for "idle time" goes away. The need for segment size
control largely goes away. The need to tune the defrag algorithm to
avoid IO latency and/or throughput issues goes away.
> +
> + /* for time stats */
> + time_delta = get_time_delta_us(&t_unshare, &t_punch_hole);
> + if (time_delta > max_unshare_us)
> + max_unshare_us = time_delta;
> +
> + /*
> + * Punch out the original extents we shared to the
> + * scratch file so they are returned to free space.
> + */
> + ret = fallocate(scratch_fd,
> + FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, seg_off,
> + seg_size);
> + if (ret != 0) {
> + fprintf(stderr, "PUNCH_HOLE failed %s\n",
> + strerror(errno));
> + break;
> + }
This is unnecessary if there is lots of free space. You can leave
this to the very end of defrag so that the source file defrag
operation isn't slowed down by cleaning up all the fragmented
extents....
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2024-07-16 0:08 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-09 19:10 [PATCH 0/9] introduce defrag to xfs_spaceman Wengang Wang
2024-07-09 19:10 ` [PATCH 1/9] xfsprogs: introduce defrag command to spaceman Wengang Wang
2024-07-09 21:18 ` Darrick J. Wong
2024-07-11 21:54 ` Wengang Wang
2024-07-15 21:30 ` Wengang Wang
2024-07-15 22:44 ` Darrick J. Wong
2024-07-09 19:10 ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Wengang Wang
2024-07-09 21:50 ` [PATCH 2/9] spaceman/defrag: pick up segments from target fileOM Darrick J. Wong
2024-07-11 22:37 ` Wengang Wang
2024-07-15 23:40 ` [PATCH 2/9] spaceman/defrag: pick up segments from target file Dave Chinner
2024-07-16 20:23 ` Wengang Wang
2024-07-17 4:11 ` Dave Chinner
2024-07-18 19:03 ` Wengang Wang
2024-07-19 4:59 ` Dave Chinner
2024-07-19 4:01 ` Christoph Hellwig
2024-07-24 19:22 ` Wengang Wang
2024-07-30 22:13 ` Dave Chinner
2024-07-09 19:10 ` [PATCH 3/9] spaceman/defrag: defrag segments Wengang Wang
2024-07-09 21:57 ` Darrick J. Wong
2024-07-11 22:49 ` Wengang Wang
2024-07-12 19:07 ` Wengang Wang
2024-07-15 22:42 ` Darrick J. Wong
2024-07-16 0:08 ` Dave Chinner [this message]
2024-07-18 18:06 ` Wengang Wang
2024-07-09 19:10 ` [PATCH 4/9] spaceman/defrag: ctrl-c handler Wengang Wang
2024-07-09 21:08 ` Darrick J. Wong
2024-07-11 22:58 ` Wengang Wang
2024-07-15 22:56 ` Darrick J. Wong
2024-07-16 16:21 ` Wengang Wang
2024-07-09 19:10 ` [PATCH 5/9] spaceman/defrag: exclude shared segments on low free space Wengang Wang
2024-07-09 21:05 ` Darrick J. Wong
2024-07-11 23:08 ` Wengang Wang
2024-07-15 22:58 ` Darrick J. Wong
2024-07-09 19:10 ` [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() Wengang Wang
2024-07-09 20:51 ` Darrick J. Wong
2024-07-11 23:11 ` Wengang Wang
2024-07-16 0:25 ` Dave Chinner
2024-07-18 18:24 ` Wengang Wang
2024-07-31 22:25 ` Dave Chinner
2024-07-09 19:10 ` [PATCH 7/9] spaceman/defrag: sleeps between segments Wengang Wang
2024-07-09 20:46 ` Darrick J. Wong
2024-07-11 23:26 ` Wengang Wang
2024-07-11 23:30 ` Wengang Wang
2024-07-09 19:10 ` [PATCH 8/9] spaceman/defrag: readahead for better performance Wengang Wang
2024-07-09 20:27 ` Darrick J. Wong
2024-07-11 23:29 ` Wengang Wang
2024-07-16 0:56 ` Dave Chinner
2024-07-18 18:40 ` Wengang Wang
2024-07-31 3:10 ` Dave Chinner
2024-08-02 18:31 ` Wengang Wang
2024-07-09 19:10 ` [PATCH 9/9] spaceman/defrag: warn on extsize Wengang Wang
2024-07-09 20:21 ` Darrick J. Wong
2024-07-11 23:36 ` Wengang Wang
2024-07-16 0:29 ` Dave Chinner
2024-07-22 18:01 ` Wengang Wang
2024-07-30 22:43 ` Dave Chinner
2024-07-15 23:03 ` [PATCH 0/9] introduce defrag to xfs_spaceman Dave Chinner
2024-07-16 19:45 ` Wengang Wang
2024-07-31 2:51 ` Dave Chinner
2024-08-02 18:14 ` Wengang Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZpW56ccnA26iYI1U@dread.disaster.area \
--to=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=wen.gang.wang@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox