linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: jim owens <jowens@hp.com>
To: Andreas Dilger <adilger@sun.com>
Cc: Mark Fasheh <mfasheh@suse.com>, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 1/5] vfs: vfs-level fiemap interface
Date: Thu, 05 Jun 2008 17:35:04 -0400	[thread overview]
Message-ID: <48485C08.4040409@hp.com> (raw)
In-Reply-To: <20080605051829.GA18021@webber.adilger.int>

Andreas Dilger wrote:
> 
> So, I think we need another __u64 in he fiemap_extent which is
> fe_loglength, and rename fe_length to fe_physlength.

As some people guessed, my earlier post [PATCH 0/5]:
 > experience with a non-linux filesystem that has a similar API
refers to AdvFS on Tru64.

I was asked to provide more information about Tru64's equivalent
to fiemap.  I believe the person who asked wanted to get closure
on the fiemap definition, but I'm probably going to just throw
more gasoline around and light the match with this :)

My earlier post said how I thought our code worked, and as
usual if I describe something without looking at the code,
when I really go look at it I find it does something else
and I'm saying "damn, I didn't think it was that ugly".

Well it probably is ugly and it is really 4 different interfaces,
but after thinking about it I realized the 4 interface designs are
KISS defensible as being optimal for their intended use.
Here is the 10 year old "most used API" for userspace code:

#define F_GETMAP        21      /* retrieve a file's sparseness map */

struct extentmapentry {
         unsigned long offset;
         unsigned long size;
};

struct extentmap {
         unsigned long arraysize;
         unsigned long numextents;
         unsigned long offset;
         struct extentmapentry *extent;
};

fcntl(fileno, F_GETMAP, &extentmap)

Backup/dump tools call this fcntl() to retrieve the sparseness map
of an AdvFS or UFS file.  NFS and CD filesystems return an error.

Its intent is to return the LOGICAL extent map of a file, without
regard to the physical extent map of the file.  Multiple extents
will only be returned if the file in question is a sparse file.
All logically contiguous extents will be collapsed into a single
extent. FYI, "longs" are 64 bits on Tru64.

The extentmapentry.offset is byte-in-file and extentmapentry.size
is bytes-in-extent and only allocated data extents are returned
so there is no need for "extent type".  The extentmapentry is
designed to be small so that minimum memory is required when the
file is highly sparse-fragmented.

extentmap.arraysize is really max_extents (in) and "how_many_more"
(out) extents are present after the "numextents" (out) in the
*extent output array.  The part (so ugly) I forgot is that the
extentmap.offset is NOT a "starting byte in file", it is a
"skip over this many data extents".  That is not an intuitive api
but then I realized it is precisely the best for a backup program.

The backup always reads the complete file from 0..filesize, it
wants to duplicate sparse as sparse (or at least not read it
from the disk), it needs to use a reasonably sized extent array,
so it needs to walk forward in a loop (as in get 4 extents in
one call (0..3, 4..7, 8..11).  So extentmap.offset as an index
into the file's logical map makes sense and you don't need to
worry about start-at-byte-in-file not being an extent start.

A program that wanted to optimize random reads to a sparse file
could do it using this api though not as easily as if it had
the start_byte input parameter.

I'm not going to bore you with the other 3 interfaces that are
only supported in AdvFS to retrieve RAW extent maps for the
cluster and filesystem administrative tools.  These are the only
interfaces that return extent device allocation because normal
applications including backup need to do their data access
through the filesystem.  The bottom line is that information
that is filesystem-specific is only really valuable to tools that
are filesystem-specific.

=== LIGHT THE MATCH ===

- I don't want linux to implement Tru64 F_GETMAP for fiemap!

- The lesson is that a simple design covers the major use and
   other complicated needs are done somewhere else.

- I have talked to Mark and he has tools waiting to use the
   features he originally designed into fiemap... but every
   day there is a new flag or return field added "just in case".

- I know "memory is cheap", but we still seem to run out of it
   so expanding every return structure for data that may only be
   useful to a specific filesystem seems like a bad idea.

- A simplified filesystem-independent version and separate
   complex-as-you-want filesystem-dependent api might be better,
   for example:

   * We can't even agree what "device" is.
   * What good is "encrypted" or "compressed" without "how"?

=== THROW ON MORE GASOLINE ===

Subject:    [RFC] add FIEMAP ioctl to efficiently map file allocation
From:       Andreas Dilger <adilger () clusterfs ! com>
Date:       2007-04-12 11:05:50
Message-ID: 20070412110550.GM5967 () schatzie ! adilger ! int

we additionally need to get the mapping over the network so it needs to
be efficient in terms of how data is passed, and how easily it can be
extracted from the filesystem.

jim

      reply	other threads:[~2008-06-05 21:37 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-25  0:01 [PATCH 1/5] vfs: vfs-level fiemap interface Mark Fasheh
2008-05-25  7:28 ` Andreas Dilger
2008-05-27 18:31   ` Mark Fasheh
2008-05-28 16:09     ` Andreas Dilger
2008-05-28 17:24       ` Joel Becker
2008-05-29 23:46         ` Andreas Dilger
2008-05-30  0:15           ` Mark Fasheh
2008-05-30 17:24             ` Andreas Dilger
2008-05-28 19:42 ` Andreas Dilger
2008-05-28 19:54   ` Josef Bacik
2008-05-28 20:12     ` Mark Fasheh
2008-05-28 20:19       ` Josef Bacik
2008-05-28 21:23   ` Mark Fasheh
2008-05-29  1:24   ` Dave Chinner
2008-05-29 13:04     ` Christoph Hellwig
2008-05-29 17:02       ` Andreas Dilger
2008-05-31  8:16         ` Christoph Hellwig
2008-05-29 13:03   ` Christoph Hellwig
2008-06-05  5:18 ` Andreas Dilger
2008-06-05 21:35   ` jim owens [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48485C08.4040409@hp.com \
    --to=jowens@hp.com \
    --cc=adilger@sun.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mfasheh@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).