From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Wengang Wang <wen.gang.wang@oracle.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/9] xfs file non-exclusive online defragment
Date: Fri, 15 Dec 2023 14:15:02 +1100 [thread overview]
Message-ID: <ZXvEtvRm1rkT03Sb@dread.disaster.area> (raw)
In-Reply-To: <20231214213502.GI361584@frogsfrogsfrogs>
On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
> On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote:
> > Background:
> > We have the existing xfs_fsr tool which do defragment for files. It has the
> > following features:
> > 1. Defragment is implemented by file copying.
> > 2. The copy (to a temporary file) is exclusive. The source file is locked
> > during the copy (to a temporary file) and all IO requests are blocked
> > before the copy is done.
> > 3. The copy could take long time for huge files with IO blocked.
> > 4. The copy requires as many free blocks as the source file has.
> > If the source is huge, say it’s 1TiB, it’s hard to require the file
> > system to have another 1TiB free.
> >
> > The use case in concern is that the XFS files are used as images files for
> > Virtual Machines.
> > 1. The image files are huge, they can reach hundreds of GiB and even to TiB.
> > 2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
> > 3. fragmentation make reflink copies super slow.
> > 4. during the reflink copy, all IO requests to the file are blocked for super
> > long time. That makes timeout in VM and the timeout lead to disaster.
> >
> > This feature aims to:
> > 1. reduce the file fragmentation making future reflink (much) faster and
> > 2. at the same time, defragmentation works in non-exclusive manner, it doesn’t
> > block file IOs long.
> >
> > Non-exclusive defragment
> > Here we are introducing the non-exclusive manner to defragment a file,
> > especially for huge files, without blocking IO to it long. Non-exclusive
> > defragmentation divides the whole file into small pieces. For each piece,
> > we lock the file, defragment the piece and unlock the file. Defragmenting
> > the small piece doesn’t take long. File IO requests can get served between
> > pieces before blocked long. Also we put (user adjustable) idle time between
> > defragmenting two consecutive pieces to balance the defragmentation and file IOs.
> > So though the defragmentation could take longer than xfs_fsr, it balances
> > defragmentation and file IOs.
>
> I'm kinda surprised you don't just turn on alwayscow mode, use an
> iomap_funshare-like function to read in and dirty pagecache (which will
> hopefully create a new large cow fork mapping) and then flush it all
> back out with writeback. Then you don't need all this state tracking,
> kthreads management, and copying file data through the buffer cache.
> Wouldn't that be a lot simpler?
Hmmm. I don't think it needs any kernel code to be written at all.
I think we can do atomic section-by-section, crash-safe active file
defrag from userspace like this:
scratch_fd = open(O_TMPFILE);
defrag_fd = open("file-to-be-dfragged");
while (offset < target_size) {
/*
* share a range of the file to be defragged into
* the scratch file.
*/
args.src_fd = defrag_fd;
args.src_offset = offset;
args.src_len = length;
args.dst_offset = offset;
ioctl(scratch_fd, FICLONERANGE, args);
/*
* For the shared range to be unshared via a
* copy-on-write operation in the file to be
* defragged. This causes the file needing to be
* defragged to have new extents allocated and the
* data to be copied over and written out.
*/
fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length);
fdatasync(defrag_fd);
/*
* Punch out the original extents we shared to the
* scratch file so they are returned to free space.
*/
fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length);
/* move onto next region */
offset += length;
};
As long as the length is large enough for the unshare to create a
large contiguous delalloc region for the COW, I think this would
likely acheive the desired "non-exclusive" defrag requirement.
If we were to implement this as, say, and xfs_spaceman operation
then all the user controlled policy bits (like inter chunk delays,
chunk sizes, etc) then just becomes command line parameters for the
defrag command...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2023-12-15 3:15 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-14 17:05 [PATCH 0/9] xfs file non-exclusive online defragment Wengang Wang
2023-12-14 17:05 ` [PATCH 1/9] xfs: defrag: introduce strucutures and numbers Wengang Wang
2023-12-15 5:35 ` kernel test robot
2023-12-14 17:05 ` [PATCH 2/9] xfs: defrag: initialization and cleanup Wengang Wang
2023-12-15 14:09 ` kernel test robot
2023-12-14 17:05 ` [PATCH 3/9] xfs: defrag implement stop/suspend/resume/status Wengang Wang
2023-12-14 17:05 ` [PATCH 4/9] xfs: defrag: allocate/cleanup defragmentation Wengang Wang
2023-12-14 17:05 ` [PATCH 5/9] xfs: defrag: process some cases in xfs_defrag_process Wengang Wang
2023-12-14 17:05 ` [PATCH 6/9] xfs: defrag: piece picking up Wengang Wang
2023-12-14 17:05 ` [PATCH 7/9] xfs: defrag: guarantee contigurous blocks in cow fork Wengang Wang
2023-12-14 17:05 ` [PATCH 8/9] xfs: defrag: copy data from old blocks to new blocks Wengang Wang
2023-12-14 17:05 ` [PATCH 9/9] xfs: defrag: map " Wengang Wang
2023-12-14 21:35 ` [PATCH 0/9] xfs file non-exclusive online defragment Darrick J. Wong
2023-12-15 3:15 ` Dave Chinner [this message]
2023-12-15 17:07 ` Wengang Wang
2023-12-15 17:30 ` Darrick J. Wong
2023-12-15 20:03 ` Wengang Wang
2023-12-15 20:20 ` Dave Chinner
2023-12-18 16:27 ` Wengang Wang
2023-12-19 21:17 ` Wengang Wang
2023-12-19 21:29 ` Dave Chinner
2023-12-19 22:23 ` Wengang Wang
2023-12-15 4:06 ` Christoph Hellwig
2023-12-15 16:48 ` Wengang Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZXvEtvRm1rkT03Sb@dread.disaster.area \
--to=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=wen.gang.wang@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox