Re: [PATCH 0/9] xfs file non-exclusive online defragment

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH 0/9] xfs file non-exclusive online defragment
Date: Fri, 15 Dec 2023 09:30:19 -0800	[thread overview]
Message-ID: <20231215173019.GO361584@frogsfrogsfrogs> (raw)
In-Reply-To: <97269730-511F-438B-9840-59CAF7997FC2@oracle.com>

On Fri, Dec 15, 2023 at 05:07:36PM +0000, Wengang Wang wrote:
> 
> 
> > On Dec 14, 2023, at 7:15 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
> >> On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote:
> >>> Background:
> >>> We have the existing xfs_fsr tool which do defragment for files. It has the
> >>> following features:
> >>> 1. Defragment is implemented by file copying.
> >>> 2. The copy (to a temporary file) is exclusive. The source file is locked
> >>>   during the copy (to a temporary file) and all IO requests are blocked
> >>>   before the copy is done.
> >>> 3. The copy could take long time for huge files with IO blocked.
> >>> 4. The copy requires as many free blocks as the source file has.
> >>>   If the source is huge, say it’s 1TiB,  it’s hard to require the file
> >>>   system to have another 1TiB free.
> >>> 
> >>> The use case in concern is that the XFS files are used as images files for
> >>> Virtual Machines.
> >>> 1. The image files are huge, they can reach hundreds of GiB and even to TiB.
> >>> 2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
> >>> 3. fragmentation make reflink copies super slow.
> >>> 4. during the reflink copy, all IO requests to the file are blocked for super
> >>>   long time. That makes timeout in VM and the timeout lead to disaster.
> >>> 
> >>> This feature aims to:
> >>> 1. reduce the file fragmentation making future reflink (much) faster and
> >>> 2. at the same time,  defragmentation works in non-exclusive manner, it doesn’t
> >>>   block file IOs long.
> >>> 
> >>> Non-exclusive defragment
> >>> Here we are introducing the non-exclusive manner to defragment a file,
> >>> especially for huge files, without blocking IO to it long. Non-exclusive
> >>> defragmentation divides the whole file into small pieces. For each piece,
> >>> we lock the file, defragment the piece and unlock the file. Defragmenting
> >>> the small piece doesn’t take long. File IO requests can get served between
> >>> pieces before blocked long.  Also we put (user adjustable) idle time between
> >>> defragmenting two consecutive pieces to balance the defragmentation and file IOs.
> >>> So though the defragmentation could take longer than xfs_fsr,  it balances
> >>> defragmentation and file IOs.
> >> 
> >> I'm kinda surprised you don't just turn on alwayscow mode, use an
> >> iomap_funshare-like function to read in and dirty pagecache (which will
> >> hopefully create a new large cow fork mapping) and then flush it all
> >> back out with writeback.  Then you don't need all this state tracking,
> >> kthreads management, and copying file data through the buffer cache.
> >> Wouldn't that be a lot simpler?
> > 
> > Hmmm. I don't think it needs any kernel code to be written at all.
> > I think we can do atomic section-by-section, crash-safe active file
> > defrag from userspace like this:
> > 
> > scratch_fd = open(O_TMPFILE);
> > defrag_fd = open("file-to-be-dfragged");
> > 
> > while (offset < target_size) {
> > 
> > /*
> >  * share a range of the file to be defragged into
> >  * the scratch file.
> >  */
> > args.src_fd = defrag_fd;
> > args.src_offset = offset;
> > args.src_len = length;
> > args.dst_offset = offset;
> > ioctl(scratch_fd, FICLONERANGE, args);
> > 
> > /*
> >  * For the shared range to be unshared via a
> >  * copy-on-write operation in the file to be
> >  * defragged. This causes the file needing to be
> >  * defragged to have new extents allocated and the
> >  * data to be copied over and written out.
> >  */
> > fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length);
> > fdatasync(defrag_fd);
> > 
> > /*
> >  * Punch out the original extents we shared to the
> >  * scratch file so they are returned to free space.
> >  */
> > fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length);

You could even set args.dst_offset = 0 and ftruncate here.

But yes, this is a better suggestion than adding more kernel code.

> > /* move onto next region */
> > offset += length;
> > };
> > 
> > As long as the length is large enough for the unshare to create a
> > large contiguous delalloc region for the COW, I think this would
> > likely acheive the desired "non-exclusive" defrag requirement.
> > 
> > If we were to implement this as, say, and xfs_spaceman operation
> > then all the user controlled policy bits (like inter chunk delays,
> > chunk sizes, etc) then just becomes command line parameters for the
> > defrag command...
> 
> 
> Ha, the idea from user space is very interesting!
> So far I have the following thoughts:
> 1). If the FICLONERANGE/FALLOC_FL_UNSHARE_RANGE/FALLOC_FL_PUNCH works
> on a FS without reflink enabled.

It does not.

That said, for your usecase (reflinked vm disk images that fragment over
time) that won't be an issue.  For non-reflink filesystems, there's
fewer chances for extreme fragmentation due to the lack of COW.

> 2). What if there is a big hole in the file to be defragmented? Will
> it cause block allocation and writing blocks with zeroes.

FUNSHARE ignores holes.

> 3). In case a big range of the file is good (not much fragmented), the
> ‘defrag’ on that range is not necessary.

Yep, so you'd have to check the bmap/fiemap output first to identify
areas that are more fragmented than you'd like.

> 4). The use space defrag can’t use a try-lock mode to make IO requests
> have priorities. I am not sure if this is very important.
> 
> Maybe we can work with xfs_bmap to get extents info and skip good
> extents and holes to help case 2) and 3).

Yeah, that sounds necessary.

--D

> I will figure above out.
> Again, the idea is so amazing, I didn’t reallize it.
> 
> Thanks,
> Wengang
>

next prev parent reply	other threads:[~2023-12-15 17:30 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-14 17:05 [PATCH 0/9] xfs file non-exclusive online defragment Wengang Wang
2023-12-14 17:05 ` [PATCH 1/9] xfs: defrag: introduce strucutures and numbers Wengang Wang
2023-12-15  5:35   ` kernel test robot
2023-12-14 17:05 ` [PATCH 2/9] xfs: defrag: initialization and cleanup Wengang Wang
2023-12-15 14:09   ` kernel test robot
2023-12-14 17:05 ` [PATCH 3/9] xfs: defrag implement stop/suspend/resume/status Wengang Wang
2023-12-14 17:05 ` [PATCH 4/9] xfs: defrag: allocate/cleanup defragmentation Wengang Wang
2023-12-14 17:05 ` [PATCH 5/9] xfs: defrag: process some cases in xfs_defrag_process Wengang Wang
2023-12-14 17:05 ` [PATCH 6/9] xfs: defrag: piece picking up Wengang Wang
2023-12-14 17:05 ` [PATCH 7/9] xfs: defrag: guarantee contigurous blocks in cow fork Wengang Wang
2023-12-14 17:05 ` [PATCH 8/9] xfs: defrag: copy data from old blocks to new blocks Wengang Wang
2023-12-14 17:05 ` [PATCH 9/9] xfs: defrag: map " Wengang Wang
2023-12-14 21:35 ` [PATCH 0/9] xfs file non-exclusive online defragment Darrick J. Wong
2023-12-15  3:15   ` Dave Chinner
2023-12-15 17:07     ` Wengang Wang
2023-12-15 17:30       ` Darrick J. Wong [this message]
2023-12-15 20:03         ` Wengang Wang
2023-12-15 20:20       ` Dave Chinner
2023-12-18 16:27         ` Wengang Wang
2023-12-19 21:17           ` Wengang Wang
2023-12-19 21:29             ` Dave Chinner
2023-12-19 22:23               ` Wengang Wang
2023-12-15  4:06   ` Christoph Hellwig
2023-12-15 16:48     ` Wengang Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231215173019.GO361584@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=wen.gang.wang@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox