linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Chinner <dgc@sgi.com>
To: Theodore Tso <tytso@mit.edu>
Cc: David Chinner <dgc@sgi.com>, Jeff Garzik <jeff@garzik.org>,
	Alex Tomas <alex@clusterfs.com>, Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [RFC] Ext3 online defrag
Date: Wed, 25 Oct 2006 12:09:27 +1000	[thread overview]
Message-ID: <20061025020927.GS8394166@melbourne.sgi.com> (raw)
In-Reply-To: <20061024194416.GB16087@thunk.org>

On Tue, Oct 24, 2006 at 03:44:16PM -0400, Theodore Tso wrote:
> On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote:
> > That's the wrong way to look at it. if you want the userspace
> > process to specify a location, then you should preallocate it first
> > before doing anything else. There is no need to clutter a simple
> > data mover interface with all sorts of unnecessary error handling.
> 
> This is doable, but it adds a huge amount of complexity before we
> could implement on-line defragmentation.
> 
> First of all, we would need a way of allowing userpsace to specify
> which blocks should be used in the preallocation.

Not initially. Create a file, and call posix_fallocate() on it.
Later, the filesystem can provide something that the defrag tool can
use for fine-grained control of where the preallocated blocks are on
disk.

> Secondly, we would need a way of marking blocks as "preallocated but
> not pre-zeroed"; otherwise we would have to zero out all of the blocks
> in order to assure security (don't want userspace programs seeing the
> previous contents of the data blocks), only to do the copy and the
> extents vector swap.

The unlinked inode method avoids this problem because no user space
process can see the inode to open it. Also, posix_fallocate() zeroes
the disk blocks so even this protects against data exposure.

So, now all that remains for an initial implementation is the swap
extents transaction and the data mover syscall.

For a smart, fast implementation, I agree that you need unwritten
extents (which XFS already has), then a fast filesystem
implementation of posix_fallocate() that utilises unwritten extents
(which XFS already has), and finally another interface that allows
you to allocate unwritten extents in an arbitrary location within
the filesystem (which no filesystem currently has).

> That's a huge amount of work, and while the above two features can be
> useful for other things, it's not clear it's worth it to require this
> as the only way to implement on-line defragging.  You're right that
> it's a way of making things be more generic, but it means that each
> filesystem needs to have a huge amount of additional complexity and
> potential filesystem format changes before they could take advantage
> of this general framework.  

I disagree - it's not a huge amount of work to get some thing
working and to solidify the generic interfaces and only format
change is a new transaction. Any filesystem that supports the swap
extent/blocks method would then work better than XFs's current
online defrag tool which currently does not use preallocation,
nor does it use splice.....

> (For example, you'd never be able to do this with the FAT filesystem,
> or the ext2 or ext3 filesystems; it would work for ext4 only *after*
> we implement the above mentioned new features and the associated
> filesystem format changes.)

Sure, but they can use the slow, unoptimised posix_fallocate() method
for allocating disk space....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

  parent reply	other threads:[~2006-10-25  2:09 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20061023122710.GA12034@atrey.karlin.mff.cuni.cz>
2006-10-23 14:16 ` [RFC] Ext3 online defrag Theodore Tso
2006-10-23 14:31   ` Alex Tomas
2006-10-23 14:48     ` Andreas Dilger
2006-10-23 14:55       ` Jan Kara
2006-10-23 14:51     ` Jan Kara
2006-10-23 15:01     ` Eric Sandeen
2006-10-24  4:14     ` Jeff Garzik
2006-10-24 13:59       ` David Chinner
2006-10-24 14:51         ` Dave Kleikamp
2006-10-24 16:01           ` David Chinner
2006-10-24 16:26             ` Dave Kleikamp
2006-10-25  1:18               ` David Chinner
2006-10-25  2:30                 ` Barry Naujok
2006-10-25  2:42                   ` Jeff Garzik
2006-10-25  4:27                     ` David Chinner
2006-10-25  4:48                       ` Jeff Garzik
2006-10-25  5:38                         ` David Chinner
2006-10-25  6:01                           ` Jeff Garzik
2006-10-25  8:11                             ` David Chinner
2006-10-25 17:00                               ` Jeff Garzik
2006-10-26  1:40                                 ` David Chinner
2006-10-26  3:33                                   ` Theodore Tso
2006-10-26  6:36                                     ` David Chinner
2006-10-26 13:37                                       ` Theodore Tso
2006-10-26 14:40                                         ` Dave Kleikamp
2006-10-26 11:37                                   ` Jan Kara
2006-10-27  1:32                                     ` David Chinner
2006-10-24 14:52         ` Eric Sandeen
2006-10-24 19:44         ` Theodore Tso
2006-10-24 20:31           ` Russell Cattelan
2006-10-24 23:00           ` Andreas Dilger
2006-10-25 14:54             ` Jan Kara
2006-10-25 17:02               ` Jeff Garzik
2006-10-25 17:58                 ` Jan Kara
2006-10-25 18:08                   ` Jeff Garzik
2006-10-25 18:25                     ` Jan Kara
2006-10-25 18:33                       ` Jeff Garzik
2006-10-26  9:30               ` Andreas Dilger
2006-10-25  2:09           ` David Chinner [this message]
2006-10-23 14:45   ` Jan Kara
2006-10-23 15:14   ` Andreas Dilger
2006-10-23 16:03     ` Jan Kara
2006-10-23 17:29       ` Andreas Dilger
2006-10-25 18:36         ` Jan Kara
2006-10-25 18:41           ` Jeff Garzik
2006-10-26 15:25             ` Jörn Engel
2006-10-24  4:13 ` Jeff Garzik
2006-10-24  4:21 ` Chris Wedgwood
2006-10-24 10:09   ` Jan Kara
2006-10-27  7:23 sho
2006-10-27  7:44 ` Alex Tomas
2006-10-27 13:53   ` Eric Sandeen
2006-10-27 14:05     ` Alex Tomas
2006-10-27 14:24       ` Eric Sandeen
2006-10-27 14:39         ` Alex Tomas
2006-11-15  9:54   ` Takashi Sato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061025020927.GS8394166@melbourne.sgi.com \
    --to=dgc@sgi.com \
    --cc=alex@clusterfs.com \
    --cc=jack@suse.cz \
    --cc=jeff@garzik.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).