From: Jamie Lokier <jamie@shareable.org>
To: Andreas Dilger <adilger@sun.com>
Cc: jim owens <jowens@hp.com>,
linux-fsdevel@vger.kernel.org, mfasheh@suse.com
Subject: Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
Date: Fri, 4 Jul 2008 12:28:20 +0100 [thread overview]
Message-ID: <20080704112820.GA29484@shareable.org> (raw)
In-Reply-To: <20080704084920.GP6239@webber.adilger.int>
Andreas Dilger wrote:
> On Jul 03, 2008 16:17 +0100, Jamie Lokier wrote:
> > jim owens wrote:
> > > FIEMAP_EXTENT_NO_BYPASS
> > >
> > > As in "you can't bypass the filesystem" to directly access it.
> >
> > Can we also commit to this, when FIEMAP_EXTENT_NO_BYPASS is *not* set:
> >
> > 1. The data at fe_physical, and *will not move* so long as nothing
> > modifies *that particular file*?
> >
> > 2. Both reading *and writing* the file bypassing the filesystem are ok.
>
> I don't think any such guarantee can be made. What if the file is
> truncated and rewritten after the FIEMAP is called?
That is prohibited by "so long as nothing modifies that particular file".
That's the entire point of 1! :-)
> The filesystem can't guarantee that will not happen.
The filesystem's guarantee has to be _conditional_ on nothing _else_
modifying the file. That includes writing, truncating, and extending.
It's not the filesystem's job to prevent those things.
What I'm saying is that some filesystems will move data blocks _even
when no process touches the file containing those blocks_. E.g. some
filesystems do garbage collection in the background - even when
nothing touches any file. Some filesystems clone data blocks for COW.
There are many imaginable other reasons.
Clearly, any program that "gets away with it" by using FIEMAP to get a
block map and then accessing the disk directly, is less reliable with
those filesystems. It would be good to reflect that somehow.
The obvious way to my mind is for those filesystems which don't have
stable data positions, when a file is not being modified, to set the
flag which says "this extent should not be accessed directly"
(whatever it is called :-).
> I think the only way to make sure of constant mapping is to call
> FIEMAP before and after the blocks are read.
No, that is clearly unsafe. They can change twice, ending up back at
the same positions, but different in between. That's even likely,
with some modern filesystem techniques.
> > The reason for 2 is that some filesystems checksum the data and/or
> > replicate it, and won't be readable if you write to it directly.
>
> EEEEEK. The _intent_ of FIEMAP is mostly for reporting fragmentation,
> and possibly to allow a "generic" defragmenter to be written. At an
> outside stretch I could imagine some tools like "dump" wanting direct
> read access to the file data.
Potentially useful other cases are providing good information to
assist access patterns and block allocation for things like databases,
filesystems-in-a-file, and virtual-disks-in-a-non-flat-file. Those
are all variations on reporting fragmentation, and don't require the
information to be absolutely stable or correct.
> Directly writing underneath a filesystem is major bad news and will
> likely corrupt the filesystem because you can never be sure that there
> aren't dirty pages in the page cache that will overwrite your "direct"
> write, or that your write isn't racy with an unlink or truncate.
You're right. It's a fair point, should be clarified, because I
hadn't thought of it ;-)
Btw, you can be sure there aren't dirty pages, if you have done
fsync() or sync_file_range() at some time in the past, and you are
_sure_ no other process is accessing the file. (Otoh, I'm not sure if
some funky COW implementations would complicate that.)
However, that still leaves a gaping lack of coherency in that the
filesystem may have clean cached pages not matching what is written to
disk. So, you're absolutely right: NO WRITING.
You must do fsync() anyway, and ensure nobody is modifying the file,
if you're going to read correct data from FIEMAP blocks.
Ok, then I'll remove point 2 and add these:
- FIEMAP extents are _not_ safe for writing data directly!
Page cache coherency affects all filesystems. Checksums and
replication are also involved with some filesystems. All
writing should go through the filesystem itself.
- If reading data directly, do fsync() before FIEMAP, and be
absolutely sure no process modifies the file between
fsync+FIEMAP and reading the blocks, and that the
FIEMAP_EXTENT_NO_DIRECT flag is not set. It is the
application's responsibility to ensure no other process modifies
the file.
-- Jamie
next prev parent reply other threads:[~2008-07-04 11:28 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-06-25 22:18 [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2 Mark Fasheh
2008-06-26 3:03 ` Andreas Dilger
2008-06-26 9:36 ` Jamie Lokier
2008-06-26 10:24 ` Andreas Dilger
2008-06-26 11:37 ` Anton Altaparmakov
2008-06-26 12:19 ` Jamie Lokier
2008-06-26 13:16 ` Dave Chinner
2008-06-26 13:27 ` Jamie Lokier
2008-06-26 13:48 ` Eric Sandeen
2008-06-26 14:16 ` Jamie Lokier
2008-06-26 16:56 ` Andreas Dilger
2008-06-29 19:12 ` Anton Altaparmakov
2008-06-29 21:45 ` Dave Chinner
2008-06-30 22:57 ` Jamie Lokier
2008-06-30 23:07 ` Mark Fasheh
2008-07-01 2:01 ` Brad Boyer
2008-07-02 6:38 ` Andreas Dilger
2008-07-02 6:33 ` Andreas Dilger
2008-07-02 14:26 ` Jamie Lokier
2008-06-26 17:17 ` Andreas Dilger
2008-06-26 14:03 ` Eric Sandeen
2008-06-27 1:41 ` Dave Chinner
2008-06-27 9:41 ` Jamie Lokier
2008-06-27 10:01 ` Dave Chinner
2008-06-27 10:32 ` Jamie Lokier
2008-06-27 22:48 ` Andreas Dilger
2008-06-28 4:21 ` Eric Sandeen
2008-07-02 6:26 ` Andreas Dilger
2008-07-02 14:28 ` Jamie Lokier
2008-07-02 21:20 ` Mark Fasheh
2008-07-03 14:45 ` Jamie Lokier
2008-06-26 14:04 ` Dave Kleikamp
2008-06-26 14:15 ` Eric Sandeen
2008-06-26 14:27 ` Dave Kleikamp
2008-07-02 23:48 ` jim owens
2008-07-03 11:17 ` Dave Chinner
2008-07-03 12:11 ` jim owens
2008-07-03 22:51 ` Dave Chinner
2008-07-04 8:31 ` Andreas Dilger
2008-07-04 12:13 ` Jamie Lokier
2008-07-07 7:40 ` Dave Chinner
2008-07-07 16:53 ` Jamie Lokier
2008-07-07 22:51 ` Dave Chinner
2008-07-07 21:16 ` jim owens
2008-07-08 3:01 ` Dave Chinner
2008-07-07 22:02 ` jim owens
2008-07-09 2:03 ` Jamie Lokier
2008-07-03 12:21 ` jim owens
2008-07-03 12:42 ` Andi Kleen
2008-07-04 20:32 ` Anton Altaparmakov
2008-07-05 10:49 ` Jamie Lokier
2008-07-05 21:44 ` Anton Altaparmakov
2008-07-07 23:01 ` jim owens
2008-07-08 1:51 ` Dave Chinner
2008-07-08 13:02 ` jim owens
2008-07-08 14:03 ` jim owens
2008-07-08 14:39 ` jim owens
2008-07-08 14:30 ` Theodore Tso
2008-07-09 1:50 ` Jamie Lokier
2008-06-26 17:01 ` Andreas Dilger
2008-07-03 14:37 ` jim owens
2008-07-03 15:17 ` Jamie Lokier
2008-07-04 8:49 ` Andreas Dilger
2008-07-04 11:28 ` Jamie Lokier [this message]
2008-07-03 23:00 ` Dave Chinner
2008-07-04 9:00 ` Andreas Dilger
2008-07-07 23:28 ` jim owens
2008-07-09 1:53 ` Jamie Lokier
2008-07-09 15:01 ` jim owens
2008-07-08 0:06 ` jim owens
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080704112820.GA29484@shareable.org \
--to=jamie@shareable.org \
--cc=adilger@sun.com \
--cc=jowens@hp.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mfasheh@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).