From: David Chinner <dgc@sgi.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [RFC] Defragmentation interface
Date: Fri, 3 Nov 2006 09:59:53 +1100 [thread overview]
Message-ID: <20061102225953.GF8394166@melbourne.sgi.com> (raw)
In-Reply-To: <20061102143929.GA8607@atrey.karlin.mff.cuni.cz>
On Thu, Nov 02, 2006 at 03:39:29PM +0100, Jan Kara wrote:
> Hi,
>
> from the thread after my patch implementing ext3 online
> defragmentation I found out that probably the only (and definitely the
> biggest) issue is the interface. Someone wants is common enough so that
> we can profit from common tools for several filesystems, others object
> that some applications, e.g. defragmenter, need to know something about
> ext3 internals to work reasonably well. Moreover ioctl() is ugly and has
> some compatibility issues, on the other hand ext2meta is too lowlevel,
> fs-specific and it would be hard to do any reasonable application
> crash-safe...
> So in this email I try to propose some interface which should hopefully
> address most of the concerns. The type of the interface is sysfs like
> (idea taken from ext2meta) - that has a few advantages:
> - no 32/64-bit compatibility issues
> - easily extensible
> - generally nice ;)
- complex
- over-engineered
- little common code between filesystems
BTW, does use of sysfs mean ASCII encoding of all the data
passing between kernel and userspace?
> Each filesystem willing to support this interface implements special
> filesystem (e.g. ext3meta, XFSmeta, ...) and admin/defrag-tool mounts it
> to some directory.
- not useful for wider audiences like applications that would like
to direct allocation
> There are parts of this interface which should be
> common for all filesystems (so that tools don't have to care about
> particular filesystem and still get some useful results), other parts
> are fs-specific. Here is basic structure I propose:
>
> meta/features
> - bitmap of features supported by the interface (ext2/3-like) so that
> the tool can verify whether it understands the interface and don't
> mess with it otherwise
- grow very large, very quickly if it has to support all the
different quirks of different filesystems.
> meta/allocation/free_blocks
> - RO file - if you read from fpos F, you'll get a list of extents
> describing areas with free blocks (as many as fits into supplied
> buffer) starting from block F. Fpos of your file descriptor is
> shifted to the first unreported free block.
- linear search properties == Bad. (think fs sizes of hundreds of
terabytes - XFS is already deployed with filesystems of this size)
- cannot use smart requests like given me free blocks near X,
in AG Y or Z, etc.
- some filesystems have more than one data area - e.g. XFS has the
realtime volume.
- every time you fail an allocation, you need to reread this file.
> meta/super/blocksize
> - filesystem block size
fcntl(FIGETBSZ).
Also:
- some filesystems can use different block sizes for different
structures (e.g XFs directory blocks canbe larger than the fsb)
- stripe unit and stripe width need to be exposed so defrag too
can make correct placement decisions.
- extent size hints, etc.
Hence this will require the spuer/ directory to be extensible
in a filesystem specific interface.
> meta/super/id
> - filesystem ID (for paranoid tools to verify that they are accessing
> really the right meta-filesystem)
- UUID, please.
> meta/nodes/<ident>
> - this should be a directory containing things specific for a fs-object
> with identification <ident>. In case of ext3 these would be inode
> numbers, I guess this should be plausible also for XFS and others
> but I'm open to suggestions...
> - directory contains the following:
> alloc_goal
> - block number with current allocation goal
The kernel has to store this across syscalls until you write into
data/alloc? That sounds dangerous...
> data/extents
> - if you read from this file, you get a list of extents describing
> data blocks (and holes) of the file. The listing starts at logical
> block fpos. Fpos is shifted to the first unreported data block.
fcntl(FIBMAP)
> data/alloc
> - you write there a number L and fs allocates L blocks to a file
> (preferable from alloc_goal) starting from file-block fpos. Fpos
> is shifted after the last block allocated in this call.
You seek to the position you want (in blocks or bytes?), then write
a number into the file (in blocks or bytes)? That's messy compared
to a function call with an offset and length in it....
> data/reloc
> - you write there <ident> and relocation of data happens as follows:
> All blocks that are allocated both in original file and <ident>
> are relocated to <ident>. Write returns number of relocated
> blocks.
You can only relocate to a new inode (which in XFS will change
the inode number)? What happens if there are blocks in duplicate
offsets in both inodes? What happens if all the blocks aren't
relocated - how do you handle this?
Let me get this straight - the interface you propose for
moving data about is:
read and process extents into an internal structure
find range where you want to relocate
find free space you want to relocate into
write desired block to alloc_goal
seek to allocation offset in data/alloc
write length into data/alloc
allocate new inode
write new inode number into data/reloc to relocate blocks
What I proposed:
fcntl(src, FIBMAP);
/* find range to relocate */
open(tmp, O_CREATE);
funlink(tmp);
fs_get_free_list(src, policy, list);
/* select free extent to use */
fs_allocate_space(tmp, list[X], off, len);
fs_move_data(src, tmp, off, len);
close(tmp);
close(src);
So the process is pretty close to the same except the interface I
proposed does not change the location of the inode holding the data.
The major difference is that one implementation requires 3 new
generically useful syscalls, and the other requires every filesystem
to implement a metadata filesystem and require root priviledges
to use.
> metadata/
> - this directory is fs-specific, contains fs block pointers and
> similar. Here I describe what I'd like to have for ext3.
Nothing really useful for XFS here unless we start talking
about btree defragmentation and attribute fork optimisation,
etc. We really don't need a sysfs interface for this, just
an additional fs_move_metadata() type of call....
hmmm - how do you support objects in the filesystem not attached
to inodes (e.g. the freespace and inode btrees in XFS)? What sort
interface would they use?
> This is all that is needed for my purposes. Any comments welcome.
Then your purpose is explicitly data defragmentation? If that is
the case, I still fail to see any need for a new metadata fs for
every filesystem to support this.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
next prev parent reply other threads:[~2006-11-02 22:59 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-11-02 14:39 [RFC] Defragmentation interface Jan Kara
2006-11-02 22:59 ` David Chinner [this message]
2006-11-03 14:30 ` Jan Kara
2006-11-03 19:22 ` Andreas Dilger
2006-11-03 19:38 ` Jan Kara
2006-11-06 2:54 ` David Chinner
2006-11-06 17:44 ` Jan Kara
2006-11-07 3:03 ` David Chinner
2006-11-03 14:50 ` Dave Kleikamp
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061102225953.GF8394166@melbourne.sgi.com \
--to=dgc@sgi.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox