Re: [RFC] Ext3 online defrag

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Chinner <dgc@sgi.com>
To: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: David Chinner <dgc@sgi.com>, Jeff Garzik <jeff@garzik.org>,
	Alex Tomas <alex@clusterfs.com>, Theodore Tso <tytso@mit.edu>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [RFC] Ext3 online defrag
Date: Wed, 25 Oct 2006 02:01:28 +1000	[thread overview]
Message-ID: <20061024160128.GF11034@melbourne.sgi.com> (raw)
In-Reply-To: <1161701502.20134.17.camel@kleikamp.austin.ibm.com>

On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
> On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
> > On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote:
> > > On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote:
> > > > isn't that a kernel responsbility to find/allocate target blocks?
> > > > wouldn't it better to specify desirable target group and minimal
> > > > acceptable chunk of free blocks?
> > > 
> > > The kernel doesn't have enough knowledge to know whether or not the
> > > defragger prefers one blkdev location over another.
> > > 
> > > When you are trying to consolidate blocks, you must specify the
> > > destination as well as source blocks.
> > > 
> > > Certainly, to prevent corruption and other nastiness, you must fail if
> > > the destination isn't available...
> > 
> > That's the wrong way to look at it. if you want the userspace
> > process to specify a location, then you should preallocate it first
> > before doing anything else. There is no need to clutter a simple
> > data mover interface with all sorts of unnecessary error handling.
> 
> You are implying the the 2-step interface, creating a new inode then
> swapping the contents, is the only way to implement this.

No, it's not the only way to implement it, but it seems the cleanest
way to me when you have to consider crash recovery. With a temporary
inode, you can create it, hold a reference and then unlink it so
that any crash at that point will free the inode and any extents
it has on it.

The only way I can see anything different working is having the
filesystem hold extents somewhere internally that provides us the
same recovery guarantees while we copy the data and insert the new
extents.  This is obviously a filesystem specific solution and is
more complex to implement than a swap extent transaction. it
probably also needs on disk format changes to support properly....

> > Once you've separated the destination allocation from the data
> > mover, the mover is basically a splice copy from source to
> > destination, an fsync and then an atomic swap blocks/extents operation.
> > Most of this code is generic, and a per-fs swap-extents vector
> > could be easily provided for the one bit that is not....
> 
> The benefit of having such a simple data mover is negated by moving the
> complexity into the allocator.

What complexity does it introduce that the allocator doesn't already
have or needs to provide for the single call interface to work?

> A single interface that would move a part of a file at a time has the
> advantage that a large file which is only fragmented in a few areas does
> not need to be completely moved.

And the two-step process can do exactly this as well - splice can
work on any offset within the file...

> > The allocation interface, OTOH, is anything but simple and is really
> > a filesystem specific interface. Seems logical to me to separate
> > the two. 
> 
> So what then is the benefit of having a simple generic data mover if
> every file system needs to implement it's own interface to allocate a
> copy of the data?

I assume you meant "....allocate the space to store the copy of the data."

The allocation interface needs to be be able to be  extended
independently of the data mover interface. XFS already exposes
allocation ioctls to userspace for preallocation and we've got plans
to extnd this further to allow userspace controlled allocation for
smart defrag tools for XFS. Tying allocation to the data mover
just makes the interface less flexible and harder to do anything
smart with....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

next prev parent reply	other threads:[~2006-10-24 16:01 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20061023122710.GA12034@atrey.karlin.mff.cuni.cz>
2006-10-23 14:16 ` [RFC] Ext3 online defrag Theodore Tso
2006-10-23 14:31   ` Alex Tomas
2006-10-23 14:48     ` Andreas Dilger
2006-10-23 14:55       ` Jan Kara
2006-10-23 14:51     ` Jan Kara
2006-10-23 15:01     ` Eric Sandeen
2006-10-24  4:14     ` Jeff Garzik
2006-10-24 13:59       ` David Chinner
2006-10-24 14:51         ` Dave Kleikamp
2006-10-24 16:01           ` David Chinner [this message]
2006-10-24 16:26             ` Dave Kleikamp
2006-10-25  1:18               ` David Chinner
2006-10-25  2:30                 ` Barry Naujok
2006-10-25  2:42                   ` Jeff Garzik
2006-10-25  4:27                     ` David Chinner
2006-10-25  4:48                       ` Jeff Garzik
2006-10-25  5:38                         ` David Chinner
2006-10-25  6:01                           ` Jeff Garzik
2006-10-25  8:11                             ` David Chinner
2006-10-25 17:00                               ` Jeff Garzik
2006-10-26  1:40                                 ` David Chinner
2006-10-26  3:33                                   ` Theodore Tso
2006-10-26  6:36                                     ` David Chinner
2006-10-26 13:37                                       ` Theodore Tso
2006-10-26 14:40                                         ` Dave Kleikamp
2006-10-26 11:37                                   ` Jan Kara
2006-10-27  1:32                                     ` David Chinner
2006-10-24 14:52         ` Eric Sandeen
2006-10-24 19:44         ` Theodore Tso
2006-10-24 20:31           ` Russell Cattelan
2006-10-24 23:00           ` Andreas Dilger
2006-10-25 14:54             ` Jan Kara
2006-10-25 17:02               ` Jeff Garzik
2006-10-25 17:58                 ` Jan Kara
2006-10-25 18:08                   ` Jeff Garzik
2006-10-25 18:25                     ` Jan Kara
2006-10-25 18:33                       ` Jeff Garzik
2006-10-26  9:30               ` Andreas Dilger
2006-10-25  2:09           ` David Chinner
2006-10-23 14:45   ` Jan Kara
2006-10-23 15:14   ` Andreas Dilger
2006-10-23 16:03     ` Jan Kara
2006-10-23 17:29       ` Andreas Dilger
2006-10-25 18:36         ` Jan Kara
2006-10-25 18:41           ` Jeff Garzik
2006-10-26 15:25             ` Jörn Engel
2006-10-24  4:13 ` Jeff Garzik
2006-10-24  4:21 ` Chris Wedgwood
2006-10-24 10:09   ` Jan Kara
2006-10-27  7:23 sho
2006-10-27  7:44 ` Alex Tomas
2006-10-27 13:53   ` Eric Sandeen
2006-10-27 14:05     ` Alex Tomas
2006-10-27 14:24       ` Eric Sandeen
2006-10-27 14:39         ` Alex Tomas
2006-11-15  9:54   ` Takashi Sato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061024160128.GF11034@melbourne.sgi.com \
    --to=dgc@sgi.com \
    --cc=alex@clusterfs.com \
    --cc=jack@suse.cz \
    --cc=jeff@garzik.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=shaggy@austin.ibm.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).