linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Dilger <adilger@clusterfs.com>
To: Theodore Tso <tytso@mit.edu>
Cc: David Chinner <dgc@sgi.com>, Jeff Garzik <jeff@garzik.org>,
	Alex Tomas <alex@clusterfs.com>, Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [RFC] Ext3 online defrag
Date: Tue, 24 Oct 2006 17:00:20 -0600	[thread overview]
Message-ID: <20061024230020.GZ3509@schatzie.adilger.int> (raw)
In-Reply-To: <20061024194416.GB16087@thunk.org>

On Oct 24, 2006  15:44 -0400, Theodore Tso wrote:
> First of all, we would need a way of allowing userpsace to specify
> which blocks should be used in the preallocation.

Presumably it could do this in the same way it will be specifying
which blocks to relocate in the defragger - by passing an extent.
You would be required to pass the file offset for which to preallocate,
and optionally an extent for the on-disk allocation itself (if none is
supplied the kernel will allocate the best extent it can).

> Secondly, we would need a way of marking blocks as "preallocated but
> not pre-zeroed"; otherwise we would have to zero out all of the blocks
> in order to assure security (don't want userspace programs seeing the
> previous contents of the data blocks), only to do the copy and the
> extents vector swap.

This could be mitigated by having the preallocation be done (in the
defragment case) against a temporary inode in the orphan list (as
the initial patch did) so if there is a crash it will be released.
The temporary inode will not be linked into the namespace so it cannot
be read - only used to hold preallocation.  If this was a write-only
file handle then we should be OK?

For defragger purposes this would need:

- "allocate new temporary inode" (VFS + fs, returns write-only fh if
   fs can't properly handle uninitalized extents, or doesn't request
   full-extent zeroing)

   for each extent to defragment {
	- "preallocate extents on temp inode" (fs specific internals)
	- "copy data from orig to temp at offset X" (VFS, splice or
	   e.g. sys_copyfile(src, dst, offset, count) which Linus agreed
	   to at KS '05 for network filesystems)
	- "migrate copied extent to original inode" (fs specific internals)
   }

- "free temporary inode" (just close of temp fh, frees unmigrated extents).

I don't think this is much more work than implementing all of this
functionality as part of a monolithic online defrag function, assuming
we don't require full-file copies in order to do defrag.

> (For example, you'd never be able to do this with the FAT filesystem,
> or the ext2 or ext3 filesystems; it would work for ext4 only *after*
> we implement the above mentioned new features and the associated
> filesystem format changes.)

Well, ext4 already has stub support for "allocated but uninitialized"
extents.  But regardless, I think if we structure the operations as
above we don't need to do very much crazy stuff.  It just boils down
to exposing some fs internals (create open-unlink inode, block allocation
with sanity check if on-disk extents are given) via new userspace methods,
and one new bit of code (extent migration with sanity check).

Virtually all of the VFS bits are generally useful and it doesn't require
any funky ability on the part of the filesystem in order to work.  We
don't need this to be super performant, so it can do as much locking &
page flushing as it needs to get things correct.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

  parent reply	other threads:[~2006-10-24 23:00 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20061023122710.GA12034@atrey.karlin.mff.cuni.cz>
2006-10-23 14:16 ` [RFC] Ext3 online defrag Theodore Tso
2006-10-23 14:31   ` Alex Tomas
2006-10-23 14:48     ` Andreas Dilger
2006-10-23 14:55       ` Jan Kara
2006-10-23 14:51     ` Jan Kara
2006-10-23 15:01     ` Eric Sandeen
2006-10-24  4:14     ` Jeff Garzik
2006-10-24 13:59       ` David Chinner
2006-10-24 14:51         ` Dave Kleikamp
2006-10-24 16:01           ` David Chinner
2006-10-24 16:26             ` Dave Kleikamp
2006-10-25  1:18               ` David Chinner
2006-10-25  2:30                 ` Barry Naujok
2006-10-25  2:42                   ` Jeff Garzik
2006-10-25  4:27                     ` David Chinner
2006-10-25  4:48                       ` Jeff Garzik
2006-10-25  5:38                         ` David Chinner
2006-10-25  6:01                           ` Jeff Garzik
2006-10-25  8:11                             ` David Chinner
2006-10-25 17:00                               ` Jeff Garzik
2006-10-26  1:40                                 ` David Chinner
2006-10-26  3:33                                   ` Theodore Tso
2006-10-26  6:36                                     ` David Chinner
2006-10-26 13:37                                       ` Theodore Tso
2006-10-26 14:40                                         ` Dave Kleikamp
2006-10-26 11:37                                   ` Jan Kara
2006-10-27  1:32                                     ` David Chinner
2006-10-24 14:52         ` Eric Sandeen
2006-10-24 19:44         ` Theodore Tso
2006-10-24 20:31           ` Russell Cattelan
2006-10-24 23:00           ` Andreas Dilger [this message]
2006-10-25 14:54             ` Jan Kara
2006-10-25 17:02               ` Jeff Garzik
2006-10-25 17:58                 ` Jan Kara
2006-10-25 18:08                   ` Jeff Garzik
2006-10-25 18:25                     ` Jan Kara
2006-10-25 18:33                       ` Jeff Garzik
2006-10-26  9:30               ` Andreas Dilger
2006-10-25  2:09           ` David Chinner
2006-10-23 14:45   ` Jan Kara
2006-10-23 15:14   ` Andreas Dilger
2006-10-23 16:03     ` Jan Kara
2006-10-23 17:29       ` Andreas Dilger
2006-10-25 18:36         ` Jan Kara
2006-10-25 18:41           ` Jeff Garzik
2006-10-26 15:25             ` Jörn Engel
2006-10-24  4:13 ` Jeff Garzik
2006-10-24  4:21 ` Chris Wedgwood
2006-10-24 10:09   ` Jan Kara
2006-10-27  7:23 sho
2006-10-27  7:44 ` Alex Tomas
2006-10-27 13:53   ` Eric Sandeen
2006-10-27 14:05     ` Alex Tomas
2006-10-27 14:24       ` Eric Sandeen
2006-10-27 14:39         ` Alex Tomas
2006-11-15  9:54   ` Takashi Sato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061024230020.GZ3509@schatzie.adilger.int \
    --to=adilger@clusterfs.com \
    --cc=alex@clusterfs.com \
    --cc=dgc@sgi.com \
    --cc=jack@suse.cz \
    --cc=jeff@garzik.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).