From: David Chinner <dgc@sgi.com>
To: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: David Chinner <dgc@sgi.com>, Jeff Garzik <jeff@garzik.org>,
Alex Tomas <alex@clusterfs.com>, Theodore Tso <tytso@mit.edu>,
Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [RFC] Ext3 online defrag
Date: Wed, 25 Oct 2006 11:18:53 +1000 [thread overview]
Message-ID: <20061025011853.GQ8394166@melbourne.sgi.com> (raw)
In-Reply-To: <1161707186.20134.26.camel@kleikamp.austin.ibm.com>
On Tue, Oct 24, 2006 at 11:26:26AM -0500, Dave Kleikamp wrote:
> On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote:
> > On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
> > > On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
> > > > That's the wrong way to look at it. if you want the userspace
> > > > process to specify a location, then you should preallocate it first
> > > > before doing anything else. There is no need to clutter a simple
> > > > data mover interface with all sorts of unnecessary error handling.
> > >
> > > You are implying the the 2-step interface, creating a new inode then
> > > swapping the contents, is the only way to implement this.
> >
> > No, it's not the only way to implement it, but it seems the cleanest
> > way to me when you have to consider crash recovery. With a temporary
> > inode, you can create it, hold a reference and then unlink it so
> > that any crash at that point will free the inode and any extents
> > it has on it.
> >
> > The only way I can see anything different working is having the
> > filesystem hold extents somewhere internally that provides us the
> > same recovery guarantees while we copy the data and insert the new
> > extents. This is obviously a filesystem specific solution and is
> > more complex to implement than a swap extent transaction. it
> > probably also needs on disk format changes to support properly....
>
> This is definitely filesystem-dependent. I would think allocating an
> extent would be like any other allocation done by the filesystem, and
> there are already recovery mechanisms for that.
Yes, the allocation would be the same, but that isn't the problem
I was talking about.
The problem is holding a reference to the extent once it has been
allocated while it is having the data copied into it (i.e. before it
is swapped with the original extents) and then holding the original
extents until they are freed. These references need to be
persistent so they can be freed correctly during crash recovery
i.e. rollback the allocation if the extent swap has not been
logged, or free the original blocks is the extent swap has been
logged.
The obvious way to do this is to use an unlinked (orphan) inode....
> > > > Once you've separated the destination allocation from the data
> > > > mover, the mover is basically a splice copy from source to
> > > > destination, an fsync and then an atomic swap blocks/extents operation.
> > > > Most of this code is generic, and a per-fs swap-extents vector
> > > > could be easily provided for the one bit that is not....
> > >
> > > The benefit of having such a simple data mover is negated by moving the
> > > complexity into the allocator.
> >
> > What complexity does it introduce that the allocator doesn't already
> > have or needs to provide for the single call interface to work?
>
> I don't see it as any more or less complex than a single interface.
Ok, I thought I was missing something there.
> > The allocation interface needs to be be able to be extended
> > independently of the data mover interface. XFS already exposes
> > allocation ioctls to userspace for preallocation and we've got plans
> > to extnd this further to allow userspace controlled allocation for
> > smart defrag tools for XFS. Tying allocation to the data mover
> > just makes the interface less flexible and harder to do anything
> > smart with....
>
> Okay. It would be nice to standardize the interface so we don't have
> every filesystem introducing new ioctls.
Well, that will be an interesting challenge. I'm sure that there
is a common subset that all filesystems can implement e.g. per
file preallocation (something like XFS's allocate/reserve/free space
ioctls) to provide kernel support for posix_fallocate(), etc.
However, we may end up exposing enough of XFS's current allocation
semantics to do things like telling the filesystem to allocate in
allocation group 6, near block number 0x32482 within the AG, falling
back to searching for the nearest match to the size requirement,
failing that look for something larger than the minimum size
specified, and then fail if you can't find a match in that AG.
That makes little sense to any filesystem but XFS, which is really
why I think that the smarter allocation interfaces are going to
remain filesystem specific....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
next prev parent reply other threads:[~2006-10-25 1:20 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20061023122710.GA12034@atrey.karlin.mff.cuni.cz>
2006-10-23 14:16 ` [RFC] Ext3 online defrag Theodore Tso
2006-10-23 14:31 ` Alex Tomas
2006-10-23 14:48 ` Andreas Dilger
2006-10-23 14:55 ` Jan Kara
2006-10-23 14:51 ` Jan Kara
2006-10-23 15:01 ` Eric Sandeen
2006-10-24 4:14 ` Jeff Garzik
2006-10-24 13:59 ` David Chinner
2006-10-24 14:51 ` Dave Kleikamp
2006-10-24 16:01 ` David Chinner
2006-10-24 16:26 ` Dave Kleikamp
2006-10-25 1:18 ` David Chinner [this message]
2006-10-25 2:30 ` Barry Naujok
2006-10-25 2:42 ` Jeff Garzik
2006-10-25 4:27 ` David Chinner
2006-10-25 4:48 ` Jeff Garzik
2006-10-25 5:38 ` David Chinner
2006-10-25 6:01 ` Jeff Garzik
2006-10-25 8:11 ` David Chinner
2006-10-25 17:00 ` Jeff Garzik
2006-10-26 1:40 ` David Chinner
2006-10-26 3:33 ` Theodore Tso
2006-10-26 6:36 ` David Chinner
2006-10-26 13:37 ` Theodore Tso
2006-10-26 14:40 ` Dave Kleikamp
2006-10-26 11:37 ` Jan Kara
2006-10-27 1:32 ` David Chinner
2006-10-24 14:52 ` Eric Sandeen
2006-10-24 19:44 ` Theodore Tso
2006-10-24 20:31 ` Russell Cattelan
2006-10-24 23:00 ` Andreas Dilger
2006-10-25 14:54 ` Jan Kara
2006-10-25 17:02 ` Jeff Garzik
2006-10-25 17:58 ` Jan Kara
2006-10-25 18:08 ` Jeff Garzik
2006-10-25 18:25 ` Jan Kara
2006-10-25 18:33 ` Jeff Garzik
2006-10-26 9:30 ` Andreas Dilger
2006-10-25 2:09 ` David Chinner
2006-10-23 14:45 ` Jan Kara
2006-10-23 15:14 ` Andreas Dilger
2006-10-23 16:03 ` Jan Kara
2006-10-23 17:29 ` Andreas Dilger
2006-10-25 18:36 ` Jan Kara
2006-10-25 18:41 ` Jeff Garzik
2006-10-26 15:25 ` Jörn Engel
2006-10-27 7:23 sho
2006-10-27 7:44 ` Alex Tomas
2006-10-27 13:53 ` Eric Sandeen
2006-10-27 14:05 ` Alex Tomas
2006-10-27 14:24 ` Eric Sandeen
2006-10-27 14:39 ` Alex Tomas
2006-11-15 9:54 ` Takashi Sato
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061025011853.GQ8394166@melbourne.sgi.com \
--to=dgc@sgi.com \
--cc=alex@clusterfs.com \
--cc=jack@suse.cz \
--cc=jeff@garzik.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=shaggy@austin.ibm.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox