From: Jan Kara <jack@suse.cz>
To: Theodore Tso <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [RFC] Ext3 online defrag
Date: Mon, 23 Oct 2006 16:45:15 +0200 [thread overview]
Message-ID: <20061023144515.GC19623@atrey.karlin.mff.cuni.cz> (raw)
In-Reply-To: <20061023141641.GA29649@thunk.org>
Hello,
> > I've written a simple patch implementing ext3 ioctl for file
> > relocation. Basically you call ioctl on a file, give it list of blocks
> > and it relocates the file into given blocks (provided they are still
> > free). The idea is to use it as a kernel part of ext3 online
> > defragmenter (or generally disk access optimizer). Now I don't have the
> > user space part that finds larger runs of free blocks and so on so that
> > it can really be used as a defragmenter. I just send this as a kind of
> > proof-of-concept to hear some comments. Attached is also a simple
> > program that demonstrates the use of the ioctl.
>
> As a suggestion, I would pass the inode number and inode generation
> number into the ext3_file_mode_data array:
>
> struct ext3_file_move_data {
> int extents;
> struct ext3_reloc_extent __user *ext_array;
> };
>
> This will be much more efficient for the userspace relocator, since it
> won't need to translate from an inode number to a pathname, and then
> try to open the file before relocating it.
Hmm, I was also thinking about it. Probably you're right. It just
seemed elegant to call ioctl on a file and *plop* it's relocated ;).
> I'd also use an explicit 64-bit block numbers type so that we don't
> have to worry about the ABI changing when we support 64-bit block
> numbers.
Right, will fix.
> The other problem I see with this patch is that there will be cache
> coherency problems between the buffer cache and the page cache. I
> think you will want to pull the data blocks of the file into the page
> cache, and then write them out from the page cache, and only *then*
> update the indirect blocks and commit the transaction.
Hmm, I thought I got this right. We build a new tree, copy all data to
it (no writes happen so trees remain consistent), we switch block
pointers from inode. So from now on, any get_block() will correctly
return new block number and block will be read from disk (hmm, probably
I'm missing sync after writing out all the data). Now we call
invalidate_inode_pages2() so all buffers mapped to old blocks are freed
from memory. So there should not be problems with this... OTOH doing the
data copy via page-cache (of the temporarily set-up inode) should not be
a big problem either and we can avoid one sync which should be a win.
> So what needs to happen is the following:
>
> 1) Validate that inode and generation number. Make sure the new
> (destination) blocks passed in are valid and not in use. Allocate
> them to prevent anyone else from using those blocks.
>
> 2) Pull the blocks into the page cache (if they are not already
> there), and the write them out to the new location on disk. If any of
> the I/O's fail, abort.
>
> 3) Update the indirect blocks or extent tree to point at the newly
> allocated and copied data blocks.
>
> In the current patch, it looks like you add the inode being relocated
> to the orphan list, and then update the direct/indirect blocks first
No, I create temporary inode that holds allocated blocks and that is
added to the orphan list. Hence if we crash in the middle of relocation,
all blocks are correctly freed.
> --- and if you fail the inode gets truncated. That's bad since we
> don't want to lose any data if we crash in the middle of the defrag
> operation....
>
> Great to see that you're working on this problem! I'd love to see
> this functionality into ext4.
Thanks for comments.
> P.S. There is also the question of whether we'll be able to get this
> interface past the ioctl() police, but the atomicity requirements of
> such an interface are a poster child for why we really, REALLY, can't
> do this via a sysfs interface. We might be forced to create a new
> filesystem, or create a pseudo inode which we open via a magic
> pathname, though. That in my opinion is uglier than an ioctl, but the
> ioctl police really don't like the problem of needing to maintain
> 32/64 bit translation functions, and this interface would surely cause
> problems for the x86_64 and PPC platforms, since they have to support
> 32-bit and 64-bit system ABI's.
Umm, yes. I'm open to suggestions with respect to which interface to
choose. ioctl() was just the easiest to code ;).
Bye
Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
next prev parent reply other threads:[~2006-10-23 14:45 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20061023122710.GA12034@atrey.karlin.mff.cuni.cz>
2006-10-23 14:16 ` [RFC] Ext3 online defrag Theodore Tso
2006-10-23 14:31 ` Alex Tomas
2006-10-23 14:48 ` Andreas Dilger
2006-10-23 14:55 ` Jan Kara
2006-10-23 14:51 ` Jan Kara
2006-10-23 15:01 ` Eric Sandeen
2006-10-24 4:14 ` Jeff Garzik
2006-10-24 13:59 ` David Chinner
2006-10-24 14:51 ` Dave Kleikamp
2006-10-24 16:01 ` David Chinner
2006-10-24 16:26 ` Dave Kleikamp
2006-10-25 1:18 ` David Chinner
2006-10-25 2:30 ` Barry Naujok
2006-10-25 2:42 ` Jeff Garzik
2006-10-25 4:27 ` David Chinner
2006-10-25 4:48 ` Jeff Garzik
2006-10-25 5:38 ` David Chinner
2006-10-25 6:01 ` Jeff Garzik
2006-10-25 8:11 ` David Chinner
2006-10-25 17:00 ` Jeff Garzik
2006-10-26 1:40 ` David Chinner
2006-10-26 3:33 ` Theodore Tso
2006-10-26 6:36 ` David Chinner
2006-10-26 13:37 ` Theodore Tso
2006-10-26 14:40 ` Dave Kleikamp
2006-10-26 11:37 ` Jan Kara
2006-10-27 1:32 ` David Chinner
2006-10-24 14:52 ` Eric Sandeen
2006-10-24 19:44 ` Theodore Tso
2006-10-24 20:31 ` Russell Cattelan
2006-10-24 23:00 ` Andreas Dilger
2006-10-25 14:54 ` Jan Kara
2006-10-25 17:02 ` Jeff Garzik
2006-10-25 17:58 ` Jan Kara
2006-10-25 18:08 ` Jeff Garzik
2006-10-25 18:25 ` Jan Kara
2006-10-25 18:33 ` Jeff Garzik
2006-10-26 9:30 ` Andreas Dilger
2006-10-25 2:09 ` David Chinner
2006-10-23 14:45 ` Jan Kara [this message]
2006-10-23 15:14 ` Andreas Dilger
2006-10-23 16:03 ` Jan Kara
2006-10-23 17:29 ` Andreas Dilger
2006-10-25 18:36 ` Jan Kara
2006-10-25 18:41 ` Jeff Garzik
2006-10-26 15:25 ` Jörn Engel
2006-10-24 4:13 ` Jeff Garzik
2006-10-24 4:21 ` Chris Wedgwood
2006-10-24 10:09 ` Jan Kara
2006-10-27 7:23 sho
2006-10-27 7:44 ` Alex Tomas
2006-10-27 13:53 ` Eric Sandeen
2006-10-27 14:05 ` Alex Tomas
2006-10-27 14:24 ` Eric Sandeen
2006-10-27 14:39 ` Alex Tomas
2006-11-15 9:54 ` Takashi Sato
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061023144515.GC19623@atrey.karlin.mff.cuni.cz \
--to=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).