linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Theodore Tso <tytso@mit.edu>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [RFC] Ext3 online defrag
Date: Mon, 23 Oct 2006 10:16:41 -0400	[thread overview]
Message-ID: <20061023141641.GA29649@thunk.org> (raw)
In-Reply-To: <20061023122710.GA12034@atrey.karlin.mff.cuni.cz>

On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
>   Hello,
> 
>   I've written a simple patch implementing ext3 ioctl for file
> relocation. Basically you call ioctl on a file, give it list of blocks
> and it relocates the file into given blocks (provided they are still
> free). The idea is to use it as a kernel part of ext3 online
> defragmenter (or generally disk access optimizer). Now I don't have the
> user space part that finds larger runs of free blocks and so on so that
> it can really be used as a defragmenter. I just send this as a kind of
> proof-of-concept to hear some comments. Attached is also a simple
> program that demonstrates the use of the ioctl.

As a suggestion, I would pass the inode number and inode generation
number into the ext3_file_mode_data array:

struct ext3_file_move_data {
	int extents;
	struct ext3_reloc_extent __user *ext_array;
};

This will be much more efficient for the userspace relocator, since it
won't need to translate from an inode number to a pathname, and then
try to open the file before relocating it.

I'd also use an explicit 64-bit block numbers type so that we don't
have to worry about the ABI changing when we support 64-bit block
numbers.


The other problem I see with this patch is that there will be cache
coherency problems between the buffer cache and the page cache.  I
think you will want to pull the data blocks of the file into the page
cache, and then write them out from the page cache, and only *then*
update the indirect blocks and commit the transaction.

So what needs to happen is the following:

1) Validate that inode and generation number.  Make sure the new
(destination) blocks passed in are valid and not in use.  Allocate
them to prevent anyone else from using those blocks.

2) Pull the blocks into the page cache (if they are not already
there), and the write them out to the new location on disk.  If any of
the I/O's fail, abort.

3) Update the indirect blocks or extent tree to point at the newly
allocated and copied data blocks.

In the current patch, it looks like you add the inode being relocated
to the orphan list, and then update the direct/indirect blocks first
--- and if you fail the inode gets truncated.  That's bad since we
don't want to lose any data if we crash in the middle of the defrag
operation....

Great to see that you're working on this problem!  I'd love to see
this functionality into ext4.

Regards,

						- Ted

P.S.  There is also the question of whether we'll be able to get this
interface past the ioctl() police, but the atomicity requirements of
such an interface are a poster child for why we really, REALLY, can't
do this via a sysfs interface.  We might be forced to create a new
filesystem, or create a pseudo inode which we open via a magic
pathname, though.  That in my opinion is uglier than an ioctl, but the
ioctl police really don't like the problem of needing to maintain
32/64 bit translation functions, and this interface would surely cause
problems for the x86_64 and PPC platforms, since they have to support
32-bit and 64-bit system ABI's.



       reply	other threads:[~2006-10-23 14:16 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20061023122710.GA12034@atrey.karlin.mff.cuni.cz>
2006-10-23 14:16 ` Theodore Tso [this message]
2006-10-23 14:31   ` [RFC] Ext3 online defrag Alex Tomas
2006-10-23 14:48     ` Andreas Dilger
2006-10-23 14:55       ` Jan Kara
2006-10-23 14:51     ` Jan Kara
2006-10-23 15:01     ` Eric Sandeen
2006-10-24  4:14     ` Jeff Garzik
2006-10-24 13:59       ` David Chinner
2006-10-24 14:51         ` Dave Kleikamp
2006-10-24 16:01           ` David Chinner
2006-10-24 16:26             ` Dave Kleikamp
2006-10-25  1:18               ` David Chinner
2006-10-25  2:30                 ` Barry Naujok
2006-10-25  2:42                   ` Jeff Garzik
2006-10-25  4:27                     ` David Chinner
2006-10-25  4:48                       ` Jeff Garzik
2006-10-25  5:38                         ` David Chinner
2006-10-25  6:01                           ` Jeff Garzik
2006-10-25  8:11                             ` David Chinner
2006-10-25 17:00                               ` Jeff Garzik
2006-10-26  1:40                                 ` David Chinner
2006-10-26  3:33                                   ` Theodore Tso
2006-10-26  6:36                                     ` David Chinner
2006-10-26 13:37                                       ` Theodore Tso
2006-10-26 14:40                                         ` Dave Kleikamp
2006-10-26 11:37                                   ` Jan Kara
2006-10-27  1:32                                     ` David Chinner
2006-10-24 14:52         ` Eric Sandeen
2006-10-24 19:44         ` Theodore Tso
2006-10-24 20:31           ` Russell Cattelan
2006-10-24 23:00           ` Andreas Dilger
2006-10-25 14:54             ` Jan Kara
2006-10-25 17:02               ` Jeff Garzik
2006-10-25 17:58                 ` Jan Kara
2006-10-25 18:08                   ` Jeff Garzik
2006-10-25 18:25                     ` Jan Kara
2006-10-25 18:33                       ` Jeff Garzik
2006-10-26  9:30               ` Andreas Dilger
2006-10-25  2:09           ` David Chinner
2006-10-23 14:45   ` Jan Kara
2006-10-23 15:14   ` Andreas Dilger
2006-10-23 16:03     ` Jan Kara
2006-10-23 17:29       ` Andreas Dilger
2006-10-25 18:36         ` Jan Kara
2006-10-25 18:41           ` Jeff Garzik
2006-10-26 15:25             ` Jörn Engel
2006-10-24  4:13 ` Jeff Garzik
2006-10-24  4:21 ` Chris Wedgwood
2006-10-24 10:09   ` Jan Kara
2006-10-27  7:23 sho
2006-10-27  7:44 ` Alex Tomas
2006-10-27 13:53   ` Eric Sandeen
2006-10-27 14:05     ` Alex Tomas
2006-10-27 14:24       ` Eric Sandeen
2006-10-27 14:39         ` Alex Tomas
2006-11-15  9:54   ` Takashi Sato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061023141641.GA29649@thunk.org \
    --to=tytso@mit.edu \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).