Re: [PATCH 0/9] xfs file non-exclusive online defragment

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Wengang Wang <wen.gang.wang@oracle.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/9] xfs file non-exclusive online defragment
Date: Fri, 15 Dec 2023 14:15:02 +1100	[thread overview]
Message-ID: <ZXvEtvRm1rkT03Sb@dread.disaster.area> (raw)
In-Reply-To: <20231214213502.GI361584@frogsfrogsfrogs>

On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
> On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote:
> > Background:
> > We have the existing xfs_fsr tool which do defragment for files. It has the
> > following features:
> > 1. Defragment is implemented by file copying.
> > 2. The copy (to a temporary file) is exclusive. The source file is locked
> >    during the copy (to a temporary file) and all IO requests are blocked
> >    before the copy is done.
> > 3. The copy could take long time for huge files with IO blocked.
> > 4. The copy requires as many free blocks as the source file has.
> >    If the source is huge, say it’s 1TiB,  it’s hard to require the file
> >    system to have another 1TiB free.
> > 
> > The use case in concern is that the XFS files are used as images files for
> > Virtual Machines.
> > 1. The image files are huge, they can reach hundreds of GiB and even to TiB.
> > 2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
> > 3. fragmentation make reflink copies super slow.
> > 4. during the reflink copy, all IO requests to the file are blocked for super
> >    long time. That makes timeout in VM and the timeout lead to disaster.
> > 
> > This feature aims to:
> > 1. reduce the file fragmentation making future reflink (much) faster and
> > 2. at the same time,  defragmentation works in non-exclusive manner, it doesn’t
> >    block file IOs long.
> > 
> > Non-exclusive defragment
> > Here we are introducing the non-exclusive manner to defragment a file,
> > especially for huge files, without blocking IO to it long. Non-exclusive
> > defragmentation divides the whole file into small pieces. For each piece,
> > we lock the file, defragment the piece and unlock the file. Defragmenting
> > the small piece doesn’t take long. File IO requests can get served between
> > pieces before blocked long.  Also we put (user adjustable) idle time between
> > defragmenting two consecutive pieces to balance the defragmentation and file IOs.
> > So though the defragmentation could take longer than xfs_fsr,  it balances
> > defragmentation and file IOs.
> 
> I'm kinda surprised you don't just turn on alwayscow mode, use an
> iomap_funshare-like function to read in and dirty pagecache (which will
> hopefully create a new large cow fork mapping) and then flush it all
> back out with writeback.  Then you don't need all this state tracking,
> kthreads management, and copying file data through the buffer cache.
> Wouldn't that be a lot simpler?

Hmmm. I don't think it needs any kernel code to be written at all.
I think we can do atomic section-by-section, crash-safe active file
defrag from userspace like this:

	scratch_fd = open(O_TMPFILE);
	defrag_fd = open("file-to-be-dfragged");

	while (offset < target_size) {

		/*
		 * share a range of the file to be defragged into
		 * the scratch file.
		 */
		args.src_fd = defrag_fd;
		args.src_offset = offset;
		args.src_len = length;
		args.dst_offset = offset;
		ioctl(scratch_fd, FICLONERANGE, args);

		/*
		 * For the shared range to be unshared via a
		 * copy-on-write operation in the file to be
		 * defragged. This causes the file needing to be
		 * defragged to have new extents allocated and the
		 * data to be copied over and written out.
		 */
		fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length);
		fdatasync(defrag_fd);

		/*
		 * Punch out the original extents we shared to the
		 * scratch file so they are returned to free space.
		 */
		fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length);

		/* move onto next region */
		offset += length;
	};

As long as the length is large enough for the unshare to create a
large contiguous delalloc region for the COW, I think this would
likely acheive the desired "non-exclusive" defrag requirement.

If we were to implement this as, say, and xfs_spaceman operation
then all the user controlled policy bits (like inter chunk delays,
chunk sizes, etc) then just becomes command line parameters for the
defrag command...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2023-12-15  3:15 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-14 17:05 [PATCH 0/9] xfs file non-exclusive online defragment Wengang Wang
2023-12-14 17:05 ` [PATCH 1/9] xfs: defrag: introduce strucutures and numbers Wengang Wang
2023-12-15  5:35   ` kernel test robot
2023-12-14 17:05 ` [PATCH 2/9] xfs: defrag: initialization and cleanup Wengang Wang
2023-12-15 14:09   ` kernel test robot
2023-12-14 17:05 ` [PATCH 3/9] xfs: defrag implement stop/suspend/resume/status Wengang Wang
2023-12-14 17:05 ` [PATCH 4/9] xfs: defrag: allocate/cleanup defragmentation Wengang Wang
2023-12-14 17:05 ` [PATCH 5/9] xfs: defrag: process some cases in xfs_defrag_process Wengang Wang
2023-12-14 17:05 ` [PATCH 6/9] xfs: defrag: piece picking up Wengang Wang
2023-12-14 17:05 ` [PATCH 7/9] xfs: defrag: guarantee contigurous blocks in cow fork Wengang Wang
2023-12-14 17:05 ` [PATCH 8/9] xfs: defrag: copy data from old blocks to new blocks Wengang Wang
2023-12-14 17:05 ` [PATCH 9/9] xfs: defrag: map " Wengang Wang
2023-12-14 21:35 ` [PATCH 0/9] xfs file non-exclusive online defragment Darrick J. Wong
2023-12-15  3:15   ` Dave Chinner [this message]
2023-12-15 17:07     ` Wengang Wang
2023-12-15 17:30       ` Darrick J. Wong
2023-12-15 20:03         ` Wengang Wang
2023-12-15 20:20       ` Dave Chinner
2023-12-18 16:27         ` Wengang Wang
2023-12-19 21:17           ` Wengang Wang
2023-12-19 21:29             ` Dave Chinner
2023-12-19 22:23               ` Wengang Wang
2023-12-15  4:06   ` Christoph Hellwig
2023-12-15 16:48     ` Wengang Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZXvEtvRm1rkT03Sb@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=wen.gang.wang@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.