linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Theodore Tso <tytso@mit.edu>
Cc: jim owens <jowens@hp.com>, Dave Chinner <david@fromorbit.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
Date: Wed, 9 Jul 2008 02:50:38 +0100	[thread overview]
Message-ID: <20080709015038.GA10728@shareable.org> (raw)
In-Reply-To: <20080708143045.GA11389@mit.edu>

Theodore Tso wrote:
> Let's take step back and ask ourselves what tools will want to do with
> FIEMAP in the first place, shall we?
> 
> As far as I know, it's basically only useful for bootloaders like lilo
> and to a limited extent grub (for its stage2 loader) and for debugging
> tools that are interested in knowing how fragmented a file might be.
> I cant think of any other really good uses, anyway.  Someone what to
> enlighten me?

Yes:

   1. Databases.  FIEMAP indicates where O_DIRECT will probably access.

      a. I/O strategy.  Database engines can use this as hint to
         reduce seeks and increase speed of large or many concurrent
         queries.  Merely trying to emit thousands of AIOs and letting
         the kernel elevator do it is not as good, as there are higher
         level optimisations possible, and in any case AIO and
         elevator limitations.

      b. The hints can also guide new data allocation, or reorgansation.

   2. Filesystems in user space, e.g. NTFS-3G.  See above.

   3. Virtual machines use compact representations of large virtual
      disks.  Some of them add COW capabilities.  Both types are
      effectively filesystems-in-a-file.  See above.

   4. Programs which read data from lots of files, but don't care
      about the order, can reduce seeking if they can FIEMAP all the
      files and read the data in roughly block order (without getting
      too pedantic about it).  E.g. something which indexes the
      content of of /home.  (Related: See my (little used) "treescan"
      program which is sometimes much faster than "find" for scanning
      names and stat() information, due mostly to seek optimisation.)

In all these uses, I notice that the _exact_ values are _not_ required.
It is enough that they are usually accurate enough to use as I/O
hints.

It would make sense, I think, to merge this with the other work being
done on I/O hints, for RAIDs and other media with sub-structure.

> However, how many filesystems beyond resierfs3 actually will move a
> file around on disk once it has been mapped to specific disk blocks
> and written to disk?  Does XFS does this?  I didn't think so.  If it
> does, then for bootloaders like LILO it will also need a flag that
> prevents a block from being moved around.

Isn't "chattr +t" effectively a suitable generic flag for that, even
though it doesn't exactly say so in the manual?

Btw, I imagine quite a few future filesystems will move data around on
disk once it is mapped.  Probably not the majority.

> There are however plenty of filesystems (XFS, ext4, etc.) that play
> the delayed allocation game, where the FIEMAP information returned
> could change from "location not yet determined on disk" to "here's
> where we decided to put it on disk".  And I assume that's what the
> SYNC flag does, right?  So it's really just syntactic sugar for doing
> fsync; get fiemap; check to see if the an unmapped extent was still
> returned (due to a race condition; if so, go back and repeat the fsync
> and then retry the fiemap loop).

I think you said two different things there.  "Here's where we decided
to put it it" is not the same as "we _have_ put it here".  So sync is
stronger than removing delalloc extents.  (There's also a middle
strength where data is all committed, but not necessarily atomically
with getting all the extents at once).

I'm not sure which semantics the XFS utilities need.  If they don't
access the raw blocks directly, they don't really need sync, they just
need "here's where we decided to put it".  If they do access raw
blocks directly, they need that xfs_freeze stuff too, at which point
it's using XFS ioctls anyway, so it begs the question of whether it
should be using FIEMAP at all.

> So I think perhaps the talking-at-cross-purposes is that Jim is
> thinking about how to support filesystems that will in fact relocate
> file data on disk (for example, as part of an online shrink or when
> moving a file from one volume to another in a filesystem like advfs or
> btrfs), and other folks have been assuming a simpler world where data
> is either mapped to a location or disk or still in a delayed
> allocation state.

There was a flag FIEMAP_EXTENT_NO_DIRECT which should presumably be
set on filesystems where data is not mapped at stable (or even single)
blocks.

That's why I suggested requiring that _not_ setting
FIEMAP_EXTENT_NO_DIRECT (really, define it's complement!) should mean
"the data is at this physical location _only while no process modifies
to the file_".  Filesystems with stable data locations, and some which
move the file only when it's modified, could unset the flag.  Other
filesystems (maybe including BTRFS) would always set it.  But that
suggestion was not really understood at the time.

Otherwise, if you think that no useful program will access the blocks
directly, then why do we have !FIEMAP_EXTENT_NO_DIRECT at all?  And
what does it mean?

-- Jamie

  reply	other threads:[~2008-07-09  1:51 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-25 22:18 [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2 Mark Fasheh
2008-06-26  3:03 ` Andreas Dilger
2008-06-26  9:36 ` Jamie Lokier
2008-06-26 10:24   ` Andreas Dilger
2008-06-26 11:37     ` Anton Altaparmakov
2008-06-26 12:19     ` Jamie Lokier
2008-06-26 13:16       ` Dave Chinner
2008-06-26 13:27         ` Jamie Lokier
2008-06-26 13:48         ` Eric Sandeen
2008-06-26 14:16           ` Jamie Lokier
2008-06-26 16:56             ` Andreas Dilger
2008-06-29 19:12               ` Anton Altaparmakov
2008-06-29 21:45                 ` Dave Chinner
2008-06-30 22:57                   ` Jamie Lokier
2008-06-30 23:07                     ` Mark Fasheh
2008-07-01  2:01                       ` Brad Boyer
2008-07-02  6:38                         ` Andreas Dilger
2008-07-02  6:33                 ` Andreas Dilger
2008-07-02 14:26                   ` Jamie Lokier
2008-06-26 17:17       ` Andreas Dilger
2008-06-26 14:03 ` Eric Sandeen
2008-06-27  1:41   ` Dave Chinner
2008-06-27  9:41     ` Jamie Lokier
2008-06-27 10:01       ` Dave Chinner
2008-06-27 10:32         ` Jamie Lokier
2008-06-27 22:48       ` Andreas Dilger
2008-06-28  4:21         ` Eric Sandeen
2008-07-02  6:26           ` Andreas Dilger
2008-07-02 14:28             ` Jamie Lokier
2008-07-02 21:20               ` Mark Fasheh
2008-07-03 14:45                 ` Jamie Lokier
2008-06-26 14:04 ` Dave Kleikamp
2008-06-26 14:15   ` Eric Sandeen
2008-06-26 14:27     ` Dave Kleikamp
2008-07-02 23:48       ` jim owens
2008-07-03 11:17         ` Dave Chinner
2008-07-03 12:11           ` jim owens
2008-07-03 22:51             ` Dave Chinner
2008-07-04  8:31               ` Andreas Dilger
2008-07-04 12:13               ` Jamie Lokier
2008-07-07  7:40                 ` Dave Chinner
2008-07-07 16:53                   ` Jamie Lokier
2008-07-07 22:51                     ` Dave Chinner
2008-07-07 21:16               ` jim owens
2008-07-08  3:01                 ` Dave Chinner
2008-07-07 22:02               ` jim owens
2008-07-09  2:03                 ` Jamie Lokier
2008-07-03 12:21           ` jim owens
2008-07-03 12:42             ` Andi Kleen
2008-07-04 20:32             ` Anton Altaparmakov
2008-07-05 10:49               ` Jamie Lokier
2008-07-05 21:44                 ` Anton Altaparmakov
2008-07-07 23:01               ` jim owens
2008-07-08  1:51                 ` Dave Chinner
2008-07-08 13:02                   ` jim owens
2008-07-08 14:03                     ` jim owens
2008-07-08 14:39                       ` jim owens
2008-07-08 14:30                     ` Theodore Tso
2008-07-09  1:50                       ` Jamie Lokier [this message]
2008-06-26 17:01   ` Andreas Dilger
2008-07-03 14:37 ` jim owens
2008-07-03 15:17   ` Jamie Lokier
2008-07-04  8:49     ` Andreas Dilger
2008-07-04 11:28       ` Jamie Lokier
2008-07-03 23:00   ` Dave Chinner
2008-07-04  9:00   ` Andreas Dilger
2008-07-07 23:28     ` jim owens
2008-07-09  1:53       ` Jamie Lokier
2008-07-09 15:01         ` jim owens
2008-07-08  0:06     ` jim owens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080709015038.GA10728@shareable.org \
    --to=jamie@shareable.org \
    --cc=david@fromorbit.com \
    --cc=jowens@hp.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).