[PATCH 0/4] Fiemap, an extent mapping ioctl

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
@ 2008-06-25 22:18 Mark Fasheh
  2008-06-26  3:03 ` Andreas Dilger
                   ` (4 more replies)
  0 siblings, 5 replies; 70+ messages in thread
From: Mark Fasheh @ 2008-06-25 22:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

Hello,

	The following patches are the latest attempt at implementing a
fiemap ioctl, which can be used by userspace software to get extent
information for an inode in an efficient manner.

	These patches are against 2.6.26-rc3, though they probably apply
fine against Linus' latest tree. The fs patches are much more complete this
time around, and the vfs patch has been trimmed down.

	An updated version of my ioctl wrapper test program is available at:

   http://www.kernel.org/pub/linux/kernel/people/mfasheh/fiemap/tests/

	A couple of notes regarding the VFS patch:

	Firstly, most behavior-changing fm_flags have been removed. We're
left with SYNC and XATTR now. This is a very good thing because frankly, I
think fiemap should be targeted as a straight-forward and relatively
uncomplicated API for exposing extents as they appear on disk. Think "one
notch above extent-based FIBMAP replacement". There's a flip side to this -
'complicated' file systems should be free to implement their own
complementary ioctls where there is a unique need that FIEMAP does not
address. Things like non-trivial device mappings, encryption specifics
(beyond 'this extent is encrypted'), don't belong here.

	The passing back of unknown flags still exists - it was part of the
original patches I took over. I'm not quite sure I'm happy about this
feature. On one hand, it's nice to cleanly manage minor api revisions, but
on the other hand I wonder if we're actually going to use this
functionality. The good news is that any app which simply doesn't care about
micro-managing the set of passed flags can ignore the whole thing and simply
just key on the error code.

Full changelog from my last posting:

* Removed FIEMAP_FLAG_HSM_READ and FIEMAP_FLAG_LUN_ORDER.

* FIEMAP_FLAG_NUM_EXTENTS also no longer exists. Instead, users simply pass
  an fm_extent_count of zero. This now functions much like getxattr.

* Updated doc describing the interface changes.

* Added fiemap_check_flags() which checks against the current set of
  *understood* flags. File systems have been updated to use this.

* Added key for use with block based file systems, FIEMAP_EXTENT_MERGED.

* Whether FIEMAP_EXTENT_SECONDARY sets FIEMAP_EXTENT_UNKNOWN is now up to
  the file system.

* Added myself to authors line in fiemap.h.

* Ext4 and generic block based implementations are both updated. Thanks to
  Eric and Josef for doing this.

Below this I will include the contents of fiemap.txt to make it convenient
for folks to get details on the API.

In the meantime, let the circus begin...
	--Mark

============
Fiemap Ioctl
============

The fiemap ioctl is an efficient method for userspace to get file
extent mappings. Instead of block-by-block mapping (such as bmap), fiemap
returns a list of extents.

Request Basics
--------------

A fiemap request is encoded within struct fiemap:

struct fiemap {
	__u64	fm_start;	 /* logical offset (inclusive) at
				  * which to start mapping (in) */
	__u64	fm_length;	 /* logical length of mapping which
				  * userspace cares about (in) */
	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in/out) */
	__u32	fm_mapped_extents; /* number of extents that were
				    * mapped (out) */
	__u32	fm_extent_count; /* size of fm_extents array (in) */
	__u32	fm_reserved;
	struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
};

fm_start, and fm_length specify the logical range within the file
which the process would like mappings for. Extents returned mirror
those on disk - that is, the logical offset of the 1st returned extent
may start before fm_start, and the range covered by the last returned
extent may end after fm_length. All offsets and lengths are in bytes.

Certain flags to modify the way in which mappings are looked up can be
set in fm_flags. If the kernel doesn't understand some particular
flags, it will return EBADR and the contents of fm_flags will contain
the set of flags which caused the error. If the kernel is compatible
with all flags passed, the contents of fm_flags will be unmodified.
It is up to userspace to determine whether rejection of a particular
flag is fatal to it's operation. This scheme is intended to allow the
fiemap interface to grow in the future but without losing
compatibility with old software.

fm_extent_count specifies the number of elements in the fm_extents[] array
that can be used to return extents.  If fm_extent_count is zero, then the
fm_extents[] array is ignored (no extents will be returned), and the
fm_mapped_extents count will hold the number of extents needed in
fm_extents[] to hold the file's current mapping.  Note that there is
nothing to prevent the file from changing between calls to FIEMAP.

Currently, there are three flags which can be set in fm_flags:

* FIEMAP_FLAG_SYNC
If this flag is set, the kernel will sync the file before mapping extents.

* FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inodes
extended attribute lookup tree, instead of it's data tree.

Extent Mapping
--------------

Extent information is returned within the embedded fm_extents array
which userspace must allocate along with the fiemap structure. The
number of elements in the fiemap_extents[] array should be passed via
fm_extent_count. The number of extents mapped by kernel will be
returned via fm_mapped_extents. If the number of fiemap_extents
allocated is less than would be required to map the requested range,
the maximum number of extents that can be mapped in the fm_extent[]
array will be returned and fm_mapped_extents will be equal to
fm_extent_count. In that case, the last extent in the array will not
complete the requested range and will not have the FIEMAP_EXTENT_LAST
flag set (see the next section on extent flags).

Each extent is described by a single fiemap_extent structure as
returned in fm_extents.

struct fiemap_extent {
	__u64	fe_logical;  /* logical offset in bytes for the start of
			      * the extent */
	__u64	fe_physical; /* physical offset in bytes for the start
			      * of the extent */
	__u64	fe_length;   /* length in bytes for the extent */
	__u32	fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
	__u32	fe_device;   /* device number for extent */
};

All offsets and lengths are in bytes and mirror those on disk.  It is valid
for an extents logical offset to start before the request or it's logical
length to extend past the request.  Unless FIEMAP_EXTENT_NOT_ALIGNED is
returned, fe_logical, fe_physical, and fe_length will be aligned to the
block size of the file system.  With the exception of extents flagged as
FIEMAP_EXTENT_MERGED, adjacent extents will not be merged.

The fe_flags field contains flags which describe the extent returned.
A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in
the file so that the process making fiemap calls can determine when no
more extents are available, without having to call the ioctl again.

Some flags are intentionally vague and will always be set in the
presence of other more specific flags. This way a program looking for
a general property does not have to know all existing and future flags
which imply that property.

For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL
are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking
for inline or tail-packed data can key on the specific flag. Software
which simply cares not to try operating on non-aligned extents
however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to
worry about all present and future flags which might imply unaligned
data. Note that the opposite is not true - it would be valid for
FIEMAP_EXTENT_NOT_ALIGNED to appear alone.

* FIEMAP_EXTENT_LAST
This is the last extent in the file. A mapping attempt past this
extent will return nothing.

* FIEMAP_EXTENT_UNKNOWN
The location of this extent is currently unknown. This may indicate
the data is stored on an inaccessible volume or that no storage has
been allocated for the file yet.

* FIEMAP_EXTENT_DELALLOC
  - This will also set FIEMAP_EXTENT_UNKNOWN.
Delayed allocation - while there is data for this extent, it's
physical location has not been allocated yet.

* FIEMAP_EXTENT_NO_DIRECT
Direct access to the data in this extent is illegal or will have
undefined results.

* FIEMAP_EXTENT_SECONDARY
The data for this extent is in secondary storage (e.g. HSM).  If the
data is not also in the filesystem, then FIEMAP_EXTENT_NO_DIRECT
should also be set.

* FIEMAP_EXTENT_NET
  - This will also set FIEMAP_EXTENT_NO_DIRECT
The data for this extent is not stored in a locally-accessible device.

* FIEMAP_EXTENT_DATA_COMPRESSED
  - This will also set FIEMAP_EXTENT_NO_DIRECT
The data in this extent has been compressed by the file system.

* FIEMAP_EXTENT_DATA_ENCRYPTED
  - This will also set FIEMAP_EXTENT_NO_DIRECT
The data in this extent has been encrypted by the file system.

* FIEMAP_EXTENT_NOT_ALIGNED
Extent offsets and length are not guaranteed to be block aligned.

* FIEMAP_EXTENT_DATA_INLINE
  This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is located within a meta data block.

* FIEMAP_EXTENT_DATA_TAIL
  This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is packed into a block with data from other files.

* FIEMAP_EXTENT_UNWRITTEN
Unwritten extent - the extent is allocated but it's data has not been
initialized.  This indicates the extent's data will be all zero.

* FIEMAP_EXTENT_MERGED
This will be set when a file does not support extents, i.e., it uses a block
based addressing scheme.  Since returning an extent for each block back to
userspace would be highly inefficient, the kernel will try to merge most
adjacent blocks into 'extents'.

VFS -> File System Implementation
---------------------------------

File systems wishing to support fiemap must implement a ->fiemap callback on
their inode_operations structure. The fs ->fiemap call is responsible for
defining it's set of supported fiemap flags, and calling a helper function on
each discovered extent:

struct inode_operations {
       ...

       int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
                     u64 len);

->fiemap is passed struct fiemap_extent_info which describes the
fiemap request:

struct fiemap_extent_info {
	unsigned int fi_flags;		/* Flags as passed from user */
	unsigned int fi_extents_mapped;	/* Number of mapped extents */
	unsigned int fi_extents_max;	/* Size of fiemap_extent array */
	struct fiemap_extent *fi_extents_start;	/* Start of fiemap_extent array */
};

It is intended that the file system should not need to access any of this
structure directly.

Flag checking should be done at the beginning of the ->fiemap callback via the
fiemap_check_flags() helper:

int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);

The struct fieinfo should be passed in as recieved from ioctl_fiemap(). The
set of fiemap flags which the fs understands should be passed via fs_flags. If
fiemap_check_flags finds invalid user flags, it will place the bad values in
fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from
fiemap_check_flags(), it should immediately exit, returning that error back to
ioctl_fiemap().

For each extent in the request range, the file system should call
the helper function, fiemap_fill_next_extent():

int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
			    u64 phys, u64 len, u32 flags, u32 dev);

fiemap_fill_next_extent() will use the passed values to populate the
next free extent in the fm_extents array. 'General' extent flags will
automatically be set from specific flags on behalf of the calling file
system so that the userspace API is not broken.

fiemap_fill_next_extent() returns 0 on success, and 1 when the
user-supplied fm_extents array is full. If an error is encountered
while copying the extent to user memory, -EFAULT will be returned.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-25 22:18 [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2 Mark Fasheh
@ 2008-06-26  3:03 ` Andreas Dilger
  2008-06-26  9:36 ` Jamie Lokier
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 70+ messages in thread
From: Andreas Dilger @ 2008-06-26  3:03 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: linux-fsdevel, Andreas Dilger, Kalpak Shah, Eric Sandeen,
	Josef Bacik

On Jun 25, 2008  15:18 -0700, Mark Fasheh wrote:
> Currently, there are three flags which can be set in fm_flags:
> 
> * FIEMAP_FLAG_SYNC
> If this flag is set, the kernel will sync the file before mapping extents.
> 
> * FIEMAP_FLAG_XATTR
> If this flag is set, the extents returned will describe the inodes
> extended attribute lookup tree, instead of it's data tree.

(minor) "Currently, there are two flags ...", or to avoid this becoming
wrong again in the future just "The flags which can be set in fm_flags are:"

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-25 22:18 [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2 Mark Fasheh
  2008-06-26  3:03 ` Andreas Dilger
@ 2008-06-26  9:36 ` Jamie Lokier
  2008-06-26 10:24   ` Andreas Dilger
  2008-06-26 14:03 ` Eric Sandeen
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 70+ messages in thread
From: Jamie Lokier @ 2008-06-26  9:36 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: linux-fsdevel, Andreas Dilger, Kalpak Shah, Eric Sandeen,
	Josef Bacik

Mark Fasheh wrote:
> * FIEMAP_FLAG_SYNC
> If this flag is set, the kernel will sync the file before mapping extents.

Is there a reason why fsync() before calling FIEMAP is unsuitable?

> * FIEMAP_FLAG_XATTR
> If this flag is set, the extents returned will describe the inodes
> extended attribute lookup tree, instead of it's data tree.

What is this for?  The meaning of the xattr tree sounds rather
filesystem specific to me.

> * FIEMAP_EXTENT_NO_DIRECT
> Direct access to the data in this extent is illegal or will have
> undefined results.
...
> * FIEMAP_EXTENT_NET
>   - This will also set FIEMAP_EXTENT_NO_DIRECT
> The data for this extent is not stored in a locally-accessible device.

Does this _always_ set FIEMAP_EXTENT_NO_DIRECT?  Some network
filesystems do support O_DIRECT access - NFS comes to mind.

(I'm assuming 'direct access' means O_DIRECT).

> * FIEMAP_EXTENT_DATA_ENCRYPTED
>   - This will also set FIEMAP_EXTENT_NO_DIRECT
> The data in this extent has been encrypted by the file system.

I don't think encryption necessarily rules out O_DIRECT.  It'll depend
how I/O is implemented by that filesystem.

> * FIEMAP_EXTENT_DATA_INLINE
>   This will also set FIEMAP_EXTENT_NOT_ALIGNED
> Data is located within a meta data block.

This seems like it would always set FIEMAP_EXTENT_NO_DIRECT :-)

(Generally, won't FIEMAP_EXTENT_NOT_ALIGNED always set
FIEMAP_EXTENT_NO_DIRECT?)

> * FIEMAP_EXTENT_DATA_TAIL
>   This will also set FIEMAP_EXTENT_NOT_ALIGNED
> Data is packed into a block with data from other files.

Maybe this too.

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26  9:36 ` Jamie Lokier
@ 2008-06-26 10:24   ` Andreas Dilger
  2008-06-26 11:37     ` Anton Altaparmakov
  2008-06-26 12:19     ` Jamie Lokier
  0 siblings, 2 replies; 70+ messages in thread
From: Andreas Dilger @ 2008-06-26 10:24 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

On Jun 26, 2008  10:36 +0100, Jamie Lokier wrote:
> Mark Fasheh wrote:
> > * FIEMAP_FLAG_SYNC
> > If this flag is set, the kernel will sync the file before mapping extents.
> 
> Is there a reason why fsync() before calling FIEMAP is unsuitable?

This was added because the xfsbmap operation always did an fsync before
returning the extents.  I don't think it is strictly required, but it
isn't harmful either.

> > * FIEMAP_FLAG_XATTR
> > If this flag is set, the extents returned will describe the inodes
> > extended attribute lookup tree, instead of it's data tree.
> 
> What is this for?  The meaning of the xattr tree sounds rather
> filesystem specific to me.

This is to return the location of the xattr blocks for the inode.

> > * FIEMAP_EXTENT_NO_DIRECT
> > Direct access to the data in this extent is illegal or will have
> > undefined results.
> ...
> > * FIEMAP_EXTENT_NET
> >   - This will also set FIEMAP_EXTENT_NO_DIRECT
> > The data for this extent is not stored in a locally-accessible device.
> 
> Does this _always_ set FIEMAP_EXTENT_NO_DIRECT?  Some network
> filesystems do support O_DIRECT access - NFS comes to mind.
> 
> (I'm assuming 'direct access' means O_DIRECT).

"NO_DIRECT" has nothing to do with "O_DIRECT".  It just means that,
per the description a few lines earlier, direct access to the file
data is impossible (i.e. for lilo or other tool which thinks it can
open "dev" and seek to "fe_physical" to read the data), or at best
will have undefined results (e.g. you may get encrypted or compressed
data back, or it is on the far side of a network interface).

> > * FIEMAP_EXTENT_DATA_ENCRYPTED
> >   - This will also set FIEMAP_EXTENT_NO_DIRECT
> > The data in this extent has been encrypted by the file system.
> 
> I don't think encryption necessarily rules out O_DIRECT.  It'll depend
> how I/O is implemented by that filesystem.
> 
> > * FIEMAP_EXTENT_DATA_INLINE
> >   This will also set FIEMAP_EXTENT_NOT_ALIGNED
> > Data is located within a meta data block.
> 
> This seems like it would always set FIEMAP_EXTENT_NO_DIRECT :-)
> 
> (Generally, won't FIEMAP_EXTENT_NOT_ALIGNED always set
> FIEMAP_EXTENT_NO_DIRECT?)
> 
> > * FIEMAP_EXTENT_DATA_TAIL
> >   This will also set FIEMAP_EXTENT_NOT_ALIGNED
> > Data is packed into a block with data from other files.
> 
> Maybe this too.

The rest of these comments seem based on the previous misunderstanding.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 10:24   ` Andreas Dilger
@ 2008-06-26 11:37     ` Anton Altaparmakov
  2008-06-26 12:19     ` Jamie Lokier
  1 sibling, 0 replies; 70+ messages in thread
From: Anton Altaparmakov @ 2008-06-26 11:37 UTC (permalink / raw)
  To: Jamie Lokier, Andreas Dilger
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

Hi,

On 26 Jun 2008, at 11:24, Andreas Dilger wrote:
> On Jun 26, 2008  10:36 +0100, Jamie Lokier wrote:
>> Mark Fasheh wrote:
>>> * FIEMAP_FLAG_XATTR
>>> If this flag is set, the extents returned will describe the inodes
>>> extended attribute lookup tree, instead of it's data tree.
>>
>> What is this for?  The meaning of the xattr tree sounds rather
>> filesystem specific to me.
>
> This is to return the location of the xattr blocks for the inode.

Jamie is completely right that this is file system specific.  It only  
has a meaning for file systems which use an "xattr tree" whatever that  
is.  At a guess that is just XFS?  It seems a bit odd to have such a  
file system specific flag in a generic interface.  On the other hand  
given the resistance to exposing named streams properly in Linux I  
guess this is the only thing you can do to get this information so I  
have no objections.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 10:24   ` Andreas Dilger
  2008-06-26 11:37     ` Anton Altaparmakov
@ 2008-06-26 12:19     ` Jamie Lokier
  2008-06-26 13:16       ` Dave Chinner
  2008-06-26 17:17       ` Andreas Dilger
  1 sibling, 2 replies; 70+ messages in thread
From: Jamie Lokier @ 2008-06-26 12:19 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

Andreas Dilger wrote:
> > Is there a reason why fsync() before calling FIEMAP is unsuitable?
> 
> This was added because the xfsbmap operation always did an fsync before
> returning the extents.  I don't think it is strictly required, but it
> isn't harmful either.

It's not harmful but suggests it might do something important -
e.g. provide atomicity between the fsync and getting extends.

Can the documentation make it clear that it's exactly equivalent to
calling fsync() before - or, if that's not true, explain the diffence?

> > > * FIEMAP_FLAG_XATTR
> > > If this flag is set, the extents returned will describe the inodes
> > > extended attribute lookup tree, instead of it's data tree.
> > 
> > What is this for?  The meaning of the xattr tree sounds rather
> > filesystem specific to me.
> 
> This is to return the location of the xattr blocks for the inode.

Some filesystems will store xattrs as metadata - in exactly the same
as, say, the inode itself, it's permissions, mappings etc.

I'm not sure why xattrs get special treatment, compared with a
hypothetical FIEMAP_FLAG_METADATA for example, indicating which
physical blocks contain the inode itself, or it's other auxiliary
information.

(Aside: If there was a way to get physical block address for inodes
(without retrieving the inodes, using only the name) I know at least
one program that would benefit from that - it sorts stat() calls by
estimated inode block, which greatly speeds up scanning a large
filesystem.  I realise FIEMAP isn't an appropriate interface for
that.)

> > (I'm assuming 'direct access' means O_DIRECT).
> 
> "NO_DIRECT" has nothing to do with "O_DIRECT".  It just means that,
> per the description a few lines earlier, direct access to the file
> data is impossible (i.e. for lilo or other tool which thinks it can
> open "dev" and seek to "fe_physical" to read the data), or at best
> will have undefined results (e.g. you may get encrypted or compressed
> data back, or it is on the far side of a network interface).

Ok.  This wasn't clear, as 'direct access' means O_DIRECT elsewhere -
and some programs which use FIEMAP are likely to be the same ones
which use O_DIRECT.

Maybe calling it 'physical addressing' or something like that?

Because the field is called 'fe_physical', I'm thinking
FIEMAP_EXTENT_PHYSICAL is a much clearer flag name.  Also reversing
the sense, so it's _set_ when 'fe_physical' is a valid quantity.

(A flag FIEMAP_EXTENT_O_DIRECT to indicate when O_DIRECT access will
work sounds useful too, and quite easy to implement, btw.)

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 12:19     ` Jamie Lokier
@ 2008-06-26 13:16       ` Dave Chinner
  2008-06-26 13:27         ` Jamie Lokier
  2008-06-26 13:48         ` Eric Sandeen
  2008-06-26 17:17       ` Andreas Dilger
  1 sibling, 2 replies; 70+ messages in thread
From: Dave Chinner @ 2008-06-26 13:16 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andreas Dilger, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

On Thu, Jun 26, 2008 at 01:19:51PM +0100, Jamie Lokier wrote:
> Andreas Dilger wrote:
> > > Is there a reason why fsync() before calling FIEMAP is unsuitable?
> > 
> > This was added because the xfsbmap operation always did an fsync before
> > returning the extents.  I don't think it is strictly required, but it
> > isn't harmful either.
> 
> It's not harmful but suggests it might do something important -
> e.g. provide atomicity between the fsync and getting extends.

It does precisely that.

> > > > * FIEMAP_FLAG_XATTR
> > > > If this flag is set, the extents returned will describe the inodes
> > > > extended attribute lookup tree, instead of it's data tree.
> > > 
> > > What is this for?  The meaning of the xattr tree sounds rather
> > > filesystem specific to me.
> > 
> > This is to return the location of the xattr blocks for the inode.
> 
> Some filesystems will store xattrs as metadata - in exactly the same
> as, say, the inode itself, it's permissions, mappings etc.
> 
> I'm not sure why xattrs get special treatment, compared with a
> hypothetical FIEMAP_FLAG_METADATA for example, indicating which
> physical blocks contain the inode itself, or it's other auxiliary
> information.

Because xattrs tend to contain user data, not metadata?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 13:16       ` Dave Chinner
@ 2008-06-26 13:27         ` Jamie Lokier
  2008-06-26 13:48         ` Eric Sandeen
  1 sibling, 0 replies; 70+ messages in thread
From: Jamie Lokier @ 2008-06-26 13:27 UTC (permalink / raw)
  To: Andreas Dilger, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Eric Sandeen <san

Dave Chinner wrote:
> > It's not harmful but suggests it might do something important -
> > e.g. provide atomicity between the fsync and getting extends.
> 
> It does precisely that.

Ok - so nobody can modify the file in between?  Is that useful, given
the file can be modified as soon as FIEMAP returns anyway?  I suppose
it does ensure all the instantiated data blocks will be allocated on
disk in the returned extents.

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 13:16       ` Dave Chinner
  2008-06-26 13:27         ` Jamie Lokier
@ 2008-06-26 13:48         ` Eric Sandeen
  2008-06-26 14:16           ` Jamie Lokier
  1 sibling, 1 reply; 70+ messages in thread
From: Eric Sandeen @ 2008-06-26 13:48 UTC (permalink / raw)
  To: Jamie Lokier, Andreas Dilger, Mark Fasheh, linux-fsdevel,
	Andreas Dilger, Kalpak Shah <Ka

Dave Chinner wrote:
> On Thu, Jun 26, 2008 at 01:19:51PM +0100, Jamie Lokier wrote:
>> Andreas Dilger wrote:
>>>> Is there a reason why fsync() before calling FIEMAP is unsuitable?
>>> This was added because the xfsbmap operation always did an fsync before
>>> returning the extents.  I don't think it is strictly required, but it
>>> isn't harmful either.
>> It's not harmful but suggests it might do something important -
>> e.g. provide atomicity between the fsync and getting extends.
> 
> It does precisely that.

xfs does via the bmap ioctl, but the generic fiemap implementation does
not.  It probably should be removed from the vfs level:

+	if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC)
+		filemap_write_and_wait(inode->i_mapping);

and let the filesystem handle it in an atomic way?

>>>>> * FIEMAP_FLAG_XATTR
>>>>> If this flag is set, the extents returned will describe the inodes
>>>>> extended attribute lookup tree, instead of it's data tree.
>>>> What is this for?  The meaning of the xattr tree sounds rather
>>>> filesystem specific to me.
>>> This is to return the location of the xattr blocks for the inode.
>> Some filesystems will store xattrs as metadata - in exactly the same
>> as, say, the inode itself, it's permissions, mappings etc.
>>
>> I'm not sure why xattrs get special treatment, compared with a
>> hypothetical FIEMAP_FLAG_METADATA for example, indicating which
>> physical blocks contain the inode itself, or it's other auxiliary
>> information.
> 
> Because xattrs tend to contain user data, not metadata?

Agreed, I don't see anything particularly strange about returning the
xattr mapping, and to me it's not particularly fs specific  (well, the
detailed format of the on-disk data might be, i.e. the layout of names &
values within that blob of data, but it is still user data... I guess
it's something of a grey area).

-Eric

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 13:48         ` Eric Sandeen
@ 2008-06-26 14:16           ` Jamie Lokier
  2008-06-26 16:56             ` Andreas Dilger
  0 siblings, 1 reply; 70+ messages in thread
From: Jamie Lokier @ 2008-06-26 14:16 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Andreas Dilger, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik

> > Because xattrs tend to contain user data, not metadata?
> 
> Agreed, I don't see anything particularly strange about returning the
> xattr mapping, and to me it's not particularly fs specific  (well, the
> detailed format of the on-disk data might be, i.e. the layout of names &
> values within that blob of data, but it is still user data... I guess
> it's something of a grey area).

I'm thinking that some filesystems won't store it as a 'blob' at all,
but as, for example, leaves in a whole-fs tree structure on the same
footing as permissions, size, etc.

I don't see a problem with the xattr feature - it might be useful on
several filesystems.  (For my own program which tries to call stat()
in disk layout order to reduce seeks, knowing xattr blocks _would_ be
useful, as it could try to get xattrs in disk layout order too).

I just wanted to bring up that "all the xattrs of one inode" aren't
necessarily a blob of data in the same way as ordinary contents.
(Just in case it's being assumed that it is.)

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 14:16           ` Jamie Lokier
@ 2008-06-26 16:56             ` Andreas Dilger
  2008-06-29 19:12               ` Anton Altaparmakov
  0 siblings, 1 reply; 70+ messages in thread
From: Andreas Dilger @ 2008-06-26 16:56 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Eric Sandeen, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik

On Jun 26, 2008  15:16 +0100, Jamie Lokier wrote:
> > > Because xattrs tend to contain user data, not metadata?
> > 
> > Agreed, I don't see anything particularly strange about returning the
> > xattr mapping, and to me it's not particularly fs specific  (well, the
> > detailed format of the on-disk data might be, i.e. the layout of names &
> > values within that blob of data, but it is still user data... I guess
> > it's something of a grey area).
> 
> I'm thinking that some filesystems won't store it as a 'blob' at all,
> but as, for example, leaves in a whole-fs tree structure on the same
> footing as permissions, size, etc.
> 
> I don't see a problem with the xattr feature - it might be useful on
> several filesystems.  (For my own program which tries to call stat()
> in disk layout order to reduce seeks, knowing xattr blocks _would_ be
> useful, as it could try to get xattrs in disk layout order too).
> 
> I just wanted to bring up that "all the xattrs of one inode" aren't
> necessarily a blob of data in the same way as ordinary contents.
> (Just in case it's being assumed that it is.)

It doesn't need to be a "blob", per se.  The physical addresses should
really represent where the xattrs are stored on disk, regardless of
whether it is stored in a separate block, or in the inode, or in the
leaves of a filesystem-wide tree.  There can be multiple blocks/extents
returned for an XATTR request (as ext4 and ext3 eventually will allow).

The logical offset of an xattr doesn't make so much sense, but I don't
think that is harmful.  I'd suggest that multiple xattrs be returned
in the order that a name search would be done, but I don't think it
really matters.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 16:56             ` Andreas Dilger
@ 2008-06-29 19:12               ` Anton Altaparmakov
  2008-06-29 21:45                 ` Dave Chinner
  2008-07-02  6:33                 ` Andreas Dilger
  0 siblings, 2 replies; 70+ messages in thread
From: Anton Altaparmakov @ 2008-06-29 19:12 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jamie Lokier, Eric Sandeen, Mark Fasheh, linux-fsdevel,
	Andreas Dilger, Kalpak Shah, Josef Bacik

On 26 Jun 2008, at 17:56, Andreas Dilger wrote:
> On Jun 26, 2008  15:16 +0100, Jamie Lokier wrote:
>>>> Because xattrs tend to contain user data, not metadata?
>>>
>>> Agreed, I don't see anything particularly strange about returning  
>>> the
>>> xattr mapping, and to me it's not particularly fs specific  (well,  
>>> the
>>> detailed format of the on-disk data might be, i.e. the layout of  
>>> names &
>>> values within that blob of data, but it is still user data... I  
>>> guess
>>> it's something of a grey area).
>>
>> I'm thinking that some filesystems won't store it as a 'blob' at all,
>> but as, for example, leaves in a whole-fs tree structure on the same
>> footing as permissions, size, etc.
>>
>> I don't see a problem with the xattr feature - it might be useful on
>> several filesystems.  (For my own program which tries to call stat()
>> in disk layout order to reduce seeks, knowing xattr blocks _would_ be
>> useful, as it could try to get xattrs in disk layout order too).
>>
>> I just wanted to bring up that "all the xattrs of one inode" aren't
>> necessarily a blob of data in the same way as ordinary contents.
>> (Just in case it's being assumed that it is.)
>
> It doesn't need to be a "blob", per se.  The physical addresses should
> really represent where the xattrs are stored on disk, regardless of
> whether it is stored in a separate block, or in the inode, or in the
> leaves of a filesystem-wide tree.  There can be multiple blocks/ 
> extents
> returned for an XATTR request (as ext4 and ext3 eventually will  
> allow).
>
> The logical offset of an xattr doesn't make so much sense, but I don't
> think that is harmful.  I'd suggest that multiple xattrs be returned
> in the order that a name search would be done, but I don't think it
> really matters.


But how would you return multiple xattrs if some of them are stored  
inside the on-disk inode structure, some are stored in a single  
extent, and some are stored in lots of extents, i.e. some have  
"proper", block-aligned mappings and some don't.  This is the case for  
NTFS where each xattr is stored as a named stream and each named  
stream is treated in exactly the same way as the file data itself  
(which is simply an unnamed named stream, i.e. a named stream with a  
filename length of zero) thus each xattr is stored independently and  
depending on their sizes you can end up with multiple xattrs inside  
the same on-disk block and you can also end up with a huge xattr that  
has a really large number of extents (the maximum size of each xattr/ 
named stream in NTFS is 2^63-1 bytes which is really rather big)...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-29 19:12               ` Anton Altaparmakov
@ 2008-06-29 21:45                 ` Dave Chinner
  2008-06-30 22:57                   ` Jamie Lokier
  2008-07-02  6:33                 ` Andreas Dilger
  1 sibling, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2008-06-29 21:45 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Andreas Dilger, Jamie Lokier, Eric Sandeen, Mark Fasheh,
	linux-fsdevel, Andreas Dilger, Kalpak Shah, Josef Bacik

On Sun, Jun 29, 2008 at 08:12:32PM +0100, Anton Altaparmakov wrote:
> On 26 Jun 2008, at 17:56, Andreas Dilger wrote:
>> On Jun 26, 2008  15:16 +0100, Jamie Lokier wrote:
>>>>> Because xattrs tend to contain user data, not metadata?
>>>>
>>>> Agreed, I don't see anything particularly strange about returning  
>>>> the
>>>> xattr mapping, and to me it's not particularly fs specific  (well,  
>>>> the
>>>> detailed format of the on-disk data might be, i.e. the layout of  
>>>> names &
>>>> values within that blob of data, but it is still user data... I  
>>>> guess
>>>> it's something of a grey area).
>>>
>>> I'm thinking that some filesystems won't store it as a 'blob' at all,
>>> but as, for example, leaves in a whole-fs tree structure on the same
>>> footing as permissions, size, etc.
>>>
>>> I don't see a problem with the xattr feature - it might be useful on
>>> several filesystems.  (For my own program which tries to call stat()
>>> in disk layout order to reduce seeks, knowing xattr blocks _would_ be
>>> useful, as it could try to get xattrs in disk layout order too).
>>>
>>> I just wanted to bring up that "all the xattrs of one inode" aren't
>>> necessarily a blob of data in the same way as ordinary contents.
>>> (Just in case it's being assumed that it is.)
>>
>> It doesn't need to be a "blob", per se.  The physical addresses should
>> really represent where the xattrs are stored on disk, regardless of
>> whether it is stored in a separate block, or in the inode, or in the
>> leaves of a filesystem-wide tree.  There can be multiple blocks/ 
>> extents
>> returned for an XATTR request (as ext4 and ext3 eventually will  
>> allow).
>>
>> The logical offset of an xattr doesn't make so much sense, but I don't
>> think that is harmful.  I'd suggest that multiple xattrs be returned
>> in the order that a name search would be done, but I don't think it
>> really matters.
>
> But how would you return multiple xattrs if some of them are stored  
> inside the on-disk inode structure, some are stored in a single extent, 
> and some are stored in lots of extents, i.e. some have "proper", 
> block-aligned mappings and some don't.

For xfs_bmap we don't care - we just return the extent map of the tree.
i.e. you can't find out the location of an individual xattr without
doing lots more filesystem specific decoding. If you have large
xattrs and a fragmented tree, then you've got problems. Basically,
the flag is not to indicate getting the mapping of a specific
xattr, but that of the entire set of xattr data. If you know the
offset and length of the xattr, then you can get it specifically,
bu to do that you need to know about the internals of the
filesystem....

FWIW, this is exactly the same case as getting the extent map of a
directory data (I use xfs_bmap all the time for this) - you know
where the blocks are, but without completely decoding the directory
structure you have no idea where inside that map a given entry is.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-29 21:45                 ` Dave Chinner
@ 2008-06-30 22:57                   ` Jamie Lokier
  2008-06-30 23:07                     ` Mark Fasheh
  0 siblings, 1 reply; 70+ messages in thread
From: Jamie Lokier @ 2008-06-30 22:57 UTC (permalink / raw)
  To: Anton Altaparmakov, Andreas Dilger, Eric Sandeen, Mark Fasheh,
	linux-fsdevel, Andreas 

Dave Chinner wrote:

> If you know the offset and length of the xattr, then you can get it
> specifically, bu to do that you need to know about the internals of
> the filesystem....
>
> FWIW, this is exactly the same case as getting the extent map of a
> directory data (I use xfs_bmap all the time for this) - you know
> where the blocks are, but without completely decoding the directory
> structure you have no idea where inside that map a given entry is.

There's a thought.  What does "offset" mean for directory data and
FIEMAP?  (And xattrs, but let's ignore that, it's less important).

What's the appropriate thing to return for FIEMAP on a directory, on a
filesystem which doesn't store directories as a blob of data with
contiguous offsets?

E.g. directory offsets (readdir) do mean something, but there's no
guarantee that directory offsets 0..N-1 corresponds with extents
covering N bytes exactly.

Would it return extent data similar to a file with large holes?

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-30 22:57                   ` Jamie Lokier
@ 2008-06-30 23:07                     ` Mark Fasheh
  2008-07-01  2:01                       ` Brad Boyer
  0 siblings, 1 reply; 70+ messages in thread
From: Mark Fasheh @ 2008-06-30 23:07 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Anton Altaparmakov, Andreas Dilger, Eric Sandeen, linux-fsdevel,
	Andreas Dilger, Kalpak Shah, Josef Bacik

On Mon, Jun 30, 2008 at 11:57:31PM +0100, Jamie Lokier wrote:
> > FWIW, this is exactly the same case as getting the extent map of a
> > directory data (I use xfs_bmap all the time for this) - you know
> > where the blocks are, but without completely decoding the directory
> > structure you have no idea where inside that map a given entry is.
> 
> There's a thought.  What does "offset" mean for directory data and
> FIEMAP?  (And xattrs, but let's ignore that, it's less important).
> 
> What's the appropriate thing to return for FIEMAP on a directory, on a
> filesystem which doesn't store directories as a blob of data with
> contiguous offsets?
> 
> E.g. directory offsets (readdir) do mean something, but there's no
> guarantee that directory offsets 0..N-1 corresponds with extents
> covering N bytes exactly.
> 
> Would it return extent data similar to a file with large holes?

Yes. FIEMAP is not in the business of interpreting individual directory
entries.

You can look at the Ocfs2 or ext4 patches for examples, but directory
extents are simply treated like file extents. In the case of Ocfs2, where
small directories can be stored inside of the inode meta data, the blob is
returned as a single extent, with the appropriate descriptor flags set
(FIEMAP_EXTENT_DATA_INLINE in particular).
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-30 23:07                     ` Mark Fasheh
@ 2008-07-01  2:01                       ` Brad Boyer
  2008-07-02  6:38                         ` Andreas Dilger
  0 siblings, 1 reply; 70+ messages in thread
From: Brad Boyer @ 2008-07-01  2:01 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Jamie Lokier, Anton Altaparmakov, Andreas Dilger, Eric Sandeen,
	linux-fsdevel, Andreas Dilger, Kalpak Shah, Josef Bacik

On Mon, Jun 30, 2008 at 04:07:41PM -0700, Mark Fasheh wrote:
> Yes. FIEMAP is not in the business of interpreting individual directory
> entries.
> 
> You can look at the Ocfs2 or ext4 patches for examples, but directory
> extents are simply treated like file extents. In the case of Ocfs2, where
> small directories can be stored inside of the inode meta data, the blob is
> returned as a single extent, with the appropriate descriptor flags set
> (FIEMAP_EXTENT_DATA_INLINE in particular).

What would you expect as reasonable behavior for an FS type that
doesn't have distinct storage for directories? On HFS and HFS+, the
directory information is completely synthetic based on the parent
ID of each file. A directory has no actual data dedicated to it 
other than the basic metadata that would be in the inode in ext3,
and readdir just walks the catalog tree and finds all the entries
that say they have the directory you want as a parent. They are
sorted that way, so it's not as bad as it sounds for performance.

Would it be reasonable for a filesystem like this to just say that
a directory has no extents at all? You can open a directory and
seek to an offset, but that doesn't logically map to any place
on the disk. Is this going to cause problems?

	Brad Boyer
	flar@allandria.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-01  2:01                       ` Brad Boyer
@ 2008-07-02  6:38                         ` Andreas Dilger
  0 siblings, 0 replies; 70+ messages in thread
From: Andreas Dilger @ 2008-07-02  6:38 UTC (permalink / raw)
  To: Brad Boyer
  Cc: Mark Fasheh, Jamie Lokier, Anton Altaparmakov, Eric Sandeen,
	linux-fsdevel, Andreas Dilger, Kalpak Shah, Josef Bacik

On Jun 30, 2008  19:01 -0700, Brad Boyer wrote:
> What would you expect as reasonable behavior for an FS type that
> doesn't have distinct storage for directories? On HFS and HFS+, the
> directory information is completely synthetic based on the parent
> ID of each file. A directory has no actual data dedicated to it 
> other than the basic metadata that would be in the inode in ext3,
> and readdir just walks the catalog tree and finds all the entries
> that say they have the directory you want as a parent. They are
> sorted that way, so it's not as bad as it sounds for performance.
> 
> Would it be reasonable for a filesystem like this to just say that
> a directory has no extents at all? You can open a directory and
> seek to an offset, but that doesn't logically map to any place
> on the disk. Is this going to cause problems?

Even though the directory data isn't stored in a separate "file",
the inodes that make up the "directory" still consume space on disk.
If I were writing a FIEMAP handler for such a filesystem, I'd probably
return the byte range of the "catalog tree" that match the directory
"parent".

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-29 19:12               ` Anton Altaparmakov
  2008-06-29 21:45                 ` Dave Chinner
@ 2008-07-02  6:33                 ` Andreas Dilger
  2008-07-02 14:26                   ` Jamie Lokier
  1 sibling, 1 reply; 70+ messages in thread
From: Andreas Dilger @ 2008-07-02  6:33 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Jamie Lokier, Eric Sandeen, Mark Fasheh, linux-fsdevel,
	Andreas Dilger, Kalpak Shah, Josef Bacik

On Jun 29, 2008  20:12 +0100, Anton Altaparmakov wrote:
>> It doesn't need to be a "blob", per se.  The physical addresses should
>> really represent where the xattrs are stored on disk, regardless of
>> whether it is stored in a separate block, or in the inode, or in the
>> leaves of a filesystem-wide tree.  There can be multiple blocks/ 
>> extents returned for an XATTR request (as ext4 and ext3 eventually will  
>> allow).
>
> But how would you return multiple xattrs if some of them are stored  
> inside the on-disk inode structure, some are stored in a single extent, 
> and some are stored in lots of extents, i.e. some have "proper", 
> block-aligned mappings and some don't.

The ext4 code can also have both in-inode and external xattr data.
If the inode has data in both locations it will return two extents,
each one with a separate set of flags.

> This is the case for NTFS where 
> each xattr is stored as a named stream and each named stream is treated 
> in exactly the same way as the file data itself (which is simply an 
> unnamed named stream, i.e. a named stream with a filename length of zero) 
> thus each xattr is stored independently and depending on their sizes you 
> can end up with multiple xattrs inside the same on-disk block and you can 
> also end up with a huge xattr that has a really large number of extents 
> (the maximum size of each xattr/named stream in NTFS is 2^63-1 bytes 
> which is really rather big)...

The XATTR request is really only intended for the "small" getxattr data.
If there are large "xattrs" (i.e. data forks) then I'd suggest instead
to use openat() to open the data fork and call FIEMAP on that directly.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-02  6:33                 ` Andreas Dilger
@ 2008-07-02 14:26                   ` Jamie Lokier
  0 siblings, 0 replies; 70+ messages in thread
From: Jamie Lokier @ 2008-07-02 14:26 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Anton Altaparmakov, Eric Sandeen, Mark Fasheh, linux-fsdevel,
	Andreas Dilger, Kalpak Shah, Josef Bacik

Andreas Dilger wrote:
> If there are large "xattrs" (i.e. data forks) then I'd suggest instead
> to use openat() to open the data fork and call FIEMAP on that directly.

Ooh.  Can yon openat() with a file as dirfd?  Equivalently, can you
open "file/fork"?

That sounds like the "file-as-directory" feature Reiser tried to get
into the kernel some time back.  Wasn't it rejected after a long
discussion?

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 12:19     ` Jamie Lokier
  2008-06-26 13:16       ` Dave Chinner
@ 2008-06-26 17:17       ` Andreas Dilger
  1 sibling, 0 replies; 70+ messages in thread
From: Andreas Dilger @ 2008-06-26 17:17 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

On Jun 26, 2008  13:19 +0100, Jamie Lokier wrote:
> Andreas Dilger wrote:
> > "NO_DIRECT" has nothing to do with "O_DIRECT".  It just means that,
> > per the description a few lines earlier, direct access to the file
> > data is impossible (i.e. for lilo or other tool which thinks it can
> > open "dev" and seek to "fe_physical" to read the data), or at best
> > will have undefined results (e.g. you may get encrypted or compressed
> > data back, or it is on the far side of a network interface).
> 
> Ok.  This wasn't clear, as 'direct access' means O_DIRECT elsewhere -
> and some programs which use FIEMAP are likely to be the same ones
> which use O_DIRECT.
> 
> Maybe calling it 'physical addressing' or something like that?
> 
> Because the field is called 'fe_physical', I'm thinking
> FIEMAP_EXTENT_PHYSICAL is a much clearer flag name.  Also reversing
> the sense, so it's _set_ when 'fe_physical' is a valid quantity.

Well, in most cases the physical addresses are valid (except if
FIEMAP_EXTENT_NET), but the point of NO_DIRECT is that if some
application were to read the data directly from disk (e.g. lilo,
or dump) it won't necessarily get the data it expects.

I agree the name is a bit confusing, and maybe a clarification should
be made w.r.t. the fact it has nothing to do with O_DIRECT, but I
can't think of a better name for it.  The NO_DIRECT flag will normally
have another qualifier that explains why it isn't directly accessible,
but apps which don't care WHY it isn't accessible don't need to check
for each of those flags.

> (A flag FIEMAP_EXTENT_O_DIRECT to indicate when O_DIRECT access will
> work sounds useful too, and quite easy to implement, btw.)

There is already the FIEMAP_EXTENT_NOT_ALIGNED flag, which is the
opposite of what you ask for - it marks extents that are not properly
aligned to block boundaries:

	* FIEMAP_EXTENT_NOT_ALIGNED
	Extent offsets and length are not guaranteed to be block aligned.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-25 22:18 [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2 Mark Fasheh
  2008-06-26  3:03 ` Andreas Dilger
  2008-06-26  9:36 ` Jamie Lokier
@ 2008-06-26 14:03 ` Eric Sandeen
  2008-06-27  1:41   ` Dave Chinner
  2008-06-26 14:04 ` Dave Kleikamp
  2008-07-03 14:37 ` jim owens
  4 siblings, 1 reply; 70+ messages in thread
From: Eric Sandeen @ 2008-06-26 14:03 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-fsdevel, Andreas Dilger, Kalpak Shah, Josef Bacik

Mark Fasheh wrote:
> Hello,
> 
> 	The following patches are the latest attempt at implementing a
> fiemap ioctl, which can be used by userspace software to get extent
> information for an inode in an efficient manner.
> 
> 	These patches are against 2.6.26-rc3, though they probably apply
> fine against Linus' latest tree. The fs patches are much more complete this
> time around, and the vfs patch has been trimmed down.
> 
> 	An updated version of my ioctl wrapper test program is available at:
> 
>    http://www.kernel.org/pub/linux/kernel/people/mfasheh/fiemap/tests/
> 
> 	A couple of notes regarding the VFS patch:
> 
> 	Firstly, most behavior-changing fm_flags have been removed. We're
> left with SYNC and XATTR now. This is a very good thing because frankly, I
> think fiemap should be targeted as a straight-forward and relatively
> uncomplicated API for exposing extents as they appear on disk. Think "one
> notch above extent-based FIBMAP replacement".

So Mark's gonna hate me for this 'cause I was acting all resigned last
night, but I have to throw this out (sorry Mark!)

Right now the interface seems to be about returning details of the
filesystem's accounting of the on-disk layout, as opposed to just a
simple mapping.  As 2 examples:

1) If you have 8 contiguous 128M extents for a 1G file, currently the
interface will (or may) give you back 8 extents for the entire file,
even though the file is 100% unfragmented, because that reflects the
details of the filesystem's internal accounting.

2) Further, if you ask for a mapping of that file between 100M and 200M
(logical), you will (or may) get back 2 extents between 0M and 256M
because again, that is how the filesystem is tracking the layout internally.

(compare with a simple mapping-only interface which would return a
single range from 0 to 1G, or from 100M to 200M).

Either approach has its merits, depending on what you want the interface
to do I suppose.  Maybe it should even be (gasp) another flag to switch
between one or the other?  (merge & trim extents vs. distinct & full
extents?)

For filesystem debugging work I see the value in returning some details
of the filesystem's internal representation of the layout.  For a
mapping interface, I think it complicates things for the caller.

In the end I can live with either as long as we're explicit about it,
but I think it's worth pointing out.

-Eric

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 14:03 ` Eric Sandeen
@ 2008-06-27  1:41   ` Dave Chinner
  2008-06-27  9:41     ` Jamie Lokier
  0 siblings, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2008-06-27  1:41 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Josef Bacik

On Thu, Jun 26, 2008 at 09:03:49AM -0500, Eric Sandeen wrote:
> Mark Fasheh wrote:
> > Hello,
> > 
> > 	The following patches are the latest attempt at implementing a
> > fiemap ioctl, which can be used by userspace software to get extent
> > information for an inode in an efficient manner.
> > 
> > 	These patches are against 2.6.26-rc3, though they probably apply
> > fine against Linus' latest tree. The fs patches are much more complete this
> > time around, and the vfs patch has been trimmed down.
> > 
> > 	An updated version of my ioctl wrapper test program is available at:
> > 
> >    http://www.kernel.org/pub/linux/kernel/people/mfasheh/fiemap/tests/
> > 
> > 	A couple of notes regarding the VFS patch:
> > 
> > 	Firstly, most behavior-changing fm_flags have been removed. We're
> > left with SYNC and XATTR now. This is a very good thing because frankly, I
> > think fiemap should be targeted as a straight-forward and relatively
> > uncomplicated API for exposing extents as they appear on disk. Think "one
> > notch above extent-based FIBMAP replacement".
> 
> So Mark's gonna hate me for this 'cause I was acting all resigned last
> night, but I have to throw this out (sorry Mark!)
> 
> Right now the interface seems to be about returning details of the
> filesystem's accounting of the on-disk layout, as opposed to just a
> simple mapping.  As 2 examples:

IMO we shouldn't complicate the kernel implementation - if the user
wants to see merged extents, merge them in userspace. If the user
want's trimmed extents, do it in userspace. If the use wants
every raw extent, then nothing else needs to be done...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-27  1:41   ` Dave Chinner
@ 2008-06-27  9:41     ` Jamie Lokier
  2008-06-27 10:01       ` Dave Chinner
  2008-06-27 22:48       ` Andreas Dilger
  0 siblings, 2 replies; 70+ messages in thread
From: Jamie Lokier @ 2008-06-27  9:41 UTC (permalink / raw)
  To: Eric Sandeen, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik <jba

Dave Chinner wrote:
> > Right now the interface seems to be about returning details of the
> > filesystem's accounting of the on-disk layout, as opposed to just a
> > simple mapping.  As 2 examples:
> 
> IMO we shouldn't complicate the kernel implementation - if the user
> wants to see merged extents, merge them in userspace. If the user
> want's trimmed extents, do it in userspace. If the use wants
> every raw extent, then nothing else needs to be done...

Agree, in general, but for merging a couple of things spring to mind:

   - The block based implementation must merge the filesystem's
     internal representation, i.e. each block.  Not doing this would
     return a prohibitively large list of block-size extents.

   - The filesystem's internal representation may have _many_ more
     extents than the contiguous layout.  E.g. a 4GiB file might have
     65536 x 64kiB extents on some filesystem, or 1 extent when
     merged.  Is it ever useful to return the much larger list?

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-27  9:41     ` Jamie Lokier
@ 2008-06-27 10:01       ` Dave Chinner
  2008-06-27 10:32         ` Jamie Lokier
  2008-06-27 22:48       ` Andreas Dilger
  1 sibling, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2008-06-27 10:01 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Eric Sandeen, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik

On Fri, Jun 27, 2008 at 10:41:56AM +0100, Jamie Lokier wrote:
> Dave Chinner wrote:
> > > Right now the interface seems to be about returning details of the
> > > filesystem's accounting of the on-disk layout, as opposed to just a
> > > simple mapping.  As 2 examples:
> > 
> > IMO we shouldn't complicate the kernel implementation - if the user
> > wants to see merged extents, merge them in userspace. If the user
> > want's trimmed extents, do it in userspace. If the use wants
> > every raw extent, then nothing else needs to be done...
> 
> Agree, in general, but for merging a couple of things spring to mind:
> 
>    - The block based implementation must merge the filesystem's
>      internal representation, i.e. each block.  Not doing this would
>      return a prohibitively large list of block-size extents.

Given there's a massage layer between the bmap and extent based fs's
already, it could be handled there.

>    - The filesystem's internal representation may have _many_ more
>      extents than the contiguous layout.  E.g. a 4GiB file might have
>      65536 x 64kiB extents on some filesystem, or 1 extent when
>      merged.  Is it ever useful to return the much larger list?

64k max extent size? I'd consider that a block based filesystem,
not an extent based filesytem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-27 10:01       ` Dave Chinner
@ 2008-06-27 10:32         ` Jamie Lokier
  0 siblings, 0 replies; 70+ messages in thread
From: Jamie Lokier @ 2008-06-27 10:32 UTC (permalink / raw)
  To: Eric Sandeen, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik <jba

Dave Chinner wrote:
> On Fri, Jun 27, 2008 at 10:41:56AM +0100, Jamie Lokier wrote:
> > > > Right now the interface seems to be about returning details of the
> > > > filesystem's accounting of the on-disk layout, as opposed to just a
> > > > simple mapping.  As 2 examples:
...
> >    - The filesystem's internal representation may have _many_ more
> >      extents than the contiguous layout.  E.g. a 4GiB file might have
> >      65536 x 64kiB extents on some filesystem, or 1 extent when
> >      merged.  Is it ever useful to return the much larger list?
> 
> 64k max extent size? I'd consider that a block based filesystem,
> not an extent based filesytem....

It's only block based if the size is fixed.

Besides it's just an example.  If "real" extent based filesystems
never split their contiguous file extents for internal processing
reasons, I don't know what we're talking about :-)

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-27  9:41     ` Jamie Lokier
  2008-06-27 10:01       ` Dave Chinner
@ 2008-06-27 22:48       ` Andreas Dilger
  2008-06-28  4:21         ` Eric Sandeen
  1 sibling, 1 reply; 70+ messages in thread
From: Andreas Dilger @ 2008-06-27 22:48 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Eric Sandeen, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik

On Jun 27, 2008  10:41 +0100, Jamie Lokier wrote:
> Dave Chinner wrote:
> > > Right now the interface seems to be about returning details of the
> > > filesystem's accounting of the on-disk layout, as opposed to just a
> > > simple mapping.  As 2 examples:
> > 
> > IMO we shouldn't complicate the kernel implementation - if the user
> > wants to see merged extents, merge them in userspace. If the user
> > want's trimmed extents, do it in userspace. If the use wants
> > every raw extent, then nothing else needs to be done...
> 
> Agree, in general, but for merging a couple of things spring to mind:
> 
>    - The block based implementation must merge the filesystem's
>      internal representation, i.e. each block.  Not doing this would
>      return a prohibitively large list of block-size extents.

This is already done in the generic_fiemap() implementation for block-based
filesystems (ext2, ext3, ext4 with block-mapped files).  In that case it
sets the FIEMAP_EXTENT_MERGED flag in the returned extent.

>    - The filesystem's internal representation may have _many_ more
>      extents than the contiguous layout.  E.g. a 4GiB file might have
>      65536 x 64kiB extents on some filesystem, or 1 extent when
>      merged.  Is it ever useful to return the much larger list?

It seems unlikely to consider this an "extent based" filesystem, and
I'd treat it as a block-based filesystem internally and merge it...
Yes, for ext4 it must split the extents at 128MB boundaries (1 group),
but because of the on-disk metadata it isn't yet possible to allocate
larger extents anyways.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-27 22:48       ` Andreas Dilger
@ 2008-06-28  4:21         ` Eric Sandeen
  2008-07-02  6:26           ` Andreas Dilger
  0 siblings, 1 reply; 70+ messages in thread
From: Eric Sandeen @ 2008-06-28  4:21 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jamie Lokier, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik

Andreas Dilger wrote:

>>    - The filesystem's internal representation may have _many_ more
>>      extents than the contiguous layout.  E.g. a 4GiB file might have
>>      65536 x 64kiB extents on some filesystem, or 1 extent when
>>      merged.  Is it ever useful to return the much larger list?
> 
> It seems unlikely to consider this an "extent based" filesystem, and
> I'd treat it as a block-based filesystem internally and merge it...
> Yes, for ext4 it must split the extents at 128MB boundaries (1 group),
> but because of the on-disk metadata it isn't yet possible to allocate
> larger extents anyways.

aside: even w/ metadata out of the way the ext4 extent format is still
capped at 128M per extent right, even if they're contiguous.  Which led
me to the little though experiment about hm, is this 1G file 1 extent or
8 and what should fiemap return...

-Eric


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-28  4:21         ` Eric Sandeen
@ 2008-07-02  6:26           ` Andreas Dilger
  2008-07-02 14:28             ` Jamie Lokier
  0 siblings, 1 reply; 70+ messages in thread
From: Andreas Dilger @ 2008-07-02  6:26 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jamie Lokier, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik

On Jun 27, 2008  23:21 -0500, Eric Sandeen wrote:
> even w/ metadata out of the way the ext4 extent format is still
> capped at 128M per extent right, even if they're contiguous.  Which led
> me to the little though experiment about hm, is this 1G file 1 extent or
> 8 and what should fiemap return...

The current FIEMAP code will return 8 extents, which mirrors the number
of physical extents on disk.  This could be important to the person
looking at the file, as it will be clear that there are too many
extents to fit into the inode and the extent tree will include an index
block.

I was a bit on the fence about this, but I agree with David that if
userspace has more information (e.g. all of the on-disk extent data)
it can always merge it, but it can't "un-merge" the data returned
from the kernel.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-02  6:26           ` Andreas Dilger
@ 2008-07-02 14:28             ` Jamie Lokier
  2008-07-02 21:20               ` Mark Fasheh
  0 siblings, 1 reply; 70+ messages in thread
From: Jamie Lokier @ 2008-07-02 14:28 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Eric Sandeen, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik

Andreas Dilger wrote:
> On Jun 27, 2008  23:21 -0500, Eric Sandeen wrote:
> > even w/ metadata out of the way the ext4 extent format is still
> > capped at 128M per extent right, even if they're contiguous.  Which led
> > me to the little though experiment about hm, is this 1G file 1 extent or
> > 8 and what should fiemap return...
> 
> The current FIEMAP code will return 8 extents, which mirrors the number
> of physical extents on disk.  This could be important to the person
> looking at the file, as it will be clear that there are too many
> extents to fit into the inode and the extent tree will include an index
> block.
> 
> I was a bit on the fence about this, but I agree with David that if
> userspace has more information (e.g. all of the on-disk extent data)
> it can always merge it, but it can't "un-merge" the data returned
> from the kernel.

Ok, wouldn't be appropriate to include the extent of the index block
as well, somewhere, so that userspace can see which parts of the disk
must be read to read the file?

(As you say, more information...)

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-02 14:28             ` Jamie Lokier
@ 2008-07-02 21:20               ` Mark Fasheh
  2008-07-03 14:45                 ` Jamie Lokier
  0 siblings, 1 reply; 70+ messages in thread
From: Mark Fasheh @ 2008-07-02 21:20 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andreas Dilger, Eric Sandeen, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik

On Wed, Jul 02, 2008 at 03:28:54PM +0100, Jamie Lokier wrote:
> Ok, wouldn't be appropriate to include the extent of the index block
> as well, somewhere, so that userspace can see which parts of the disk
> must be read to read the file?

Mapping meta data doesn't belong in fiemap - we can come up with a dedicated
interface for that at a future date.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-02 21:20               ` Mark Fasheh
@ 2008-07-03 14:45                 ` Jamie Lokier
  0 siblings, 0 replies; 70+ messages in thread
From: Jamie Lokier @ 2008-07-03 14:45 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Andreas Dilger, Eric Sandeen, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Josef Bacik

Mark Fasheh wrote:
> On Wed, Jul 02, 2008 at 03:28:54PM +0100, Jamie Lokier wrote:
> > Ok, wouldn't be appropriate to include the extent of the index block
> > as well, somewhere, so that userspace can see which parts of the disk
> > must be read to read the file?
> 
> Mapping meta data doesn't belong in fiemap - we can come up with a dedicated
> interface for that at a future date.

I agree, but it seems inconsistent with the xattr / directory thing.

Because that *is* mapping metadata.  They may have no data blocks;
it's been suggested they return extents covering metadata locations only.

It seems the same use-cases where you might want metadata location
from xattrs and directories apply equally to metadata locations for a
file.  I.e. knowing which parts of a disk will be accessed.

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-25 22:18 [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2 Mark Fasheh
                   ` (2 preceding siblings ...)
  2008-06-26 14:03 ` Eric Sandeen
@ 2008-06-26 14:04 ` Dave Kleikamp
  2008-06-26 14:15   ` Eric Sandeen
  2008-06-26 17:01   ` Andreas Dilger
  2008-07-03 14:37 ` jim owens
  4 siblings, 2 replies; 70+ messages in thread
From: Dave Kleikamp @ 2008-06-26 14:04 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: linux-fsdevel, Andreas Dilger, Kalpak Shah, Eric Sandeen,
	Josef Bacik

> 	Firstly, most behavior-changing fm_flags have been removed. We're
> left with SYNC and XATTR now. This is a very good thing because frankly, I
> think fiemap should be targeted as a straight-forward and relatively
> uncomplicated API for exposing extents as they appear on disk. Think "one
> notch above extent-based FIBMAP replacement". There's a flip side to this -
> 'complicated' file systems should be free to implement their own
> complementary ioctls where there is a unique need that FIEMAP does not
> address. Things like non-trivial device mappings, encryption specifics
> (beyond 'this extent is encrypted'), don't belong here.

Keeping the SYNC and XATTR flags seems contradictory to above statement.
If more complicated filesystems are encouraged to implement their own
ioctls for non-generic things, then why does the new syscall need to
support xfs-specific things so that you can supersede its existing
ioctl?

Honestly, I can see XATTR used generically, even though most filesystems
don't store the XATTR as a tree.  (jfs stores it in a single extent.)
SYNC really doesn't look like it belongs, and it's only there so that
the new ioctl acts like the xfs ioctl.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 14:04 ` Dave Kleikamp
@ 2008-06-26 14:15   ` Eric Sandeen
  2008-06-26 14:27     ` Dave Kleikamp
  2008-06-26 17:01   ` Andreas Dilger
  1 sibling, 1 reply; 70+ messages in thread
From: Eric Sandeen @ 2008-06-26 14:15 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Josef Bacik

Dave Kleikamp wrote:
>> 	Firstly, most behavior-changing fm_flags have been removed. We're
>> left with SYNC and XATTR now. This is a very good thing because frankly, I
>> think fiemap should be targeted as a straight-forward and relatively
>> uncomplicated API for exposing extents as they appear on disk. Think "one
>> notch above extent-based FIBMAP replacement". There's a flip side to this -
>> 'complicated' file systems should be free to implement their own
>> complementary ioctls where there is a unique need that FIEMAP does not
>> address. Things like non-trivial device mappings, encryption specifics
>> (beyond 'this extent is encrypted'), don't belong here.
> 
> Keeping the SYNC and XATTR flags seems contradictory to above statement.
> If more complicated filesystems are encouraged to implement their own
> ioctls for non-generic things, then why does the new syscall need to
> support xfs-specific things so that you can supersede its existing
> ioctl?

I don't think either of these are xfs-specific at all.

> Honestly, I can see XATTR used generically, even though most filesystems
> don't store the XATTR as a tree.  (jfs stores it in a single extent.)

That's fine, so, you return that one extent.  I'm not sure what it has
to do with whether or not it's a tree?

> SYNC really doesn't look like it belongs, and it's only there so that
> the new ioctl acts like the xfs ioctl.

I disagree, while it may have been inspired by the xfs behavior, it's
not at all xfs specific.

If a filesystem implements delalloc, you may want to know which ranges
are still delalloc in the fiemap output, or you may want to put them on
disk and know the actual physical location.  And if you want a snapshot
of an actual, consistent layout of the file at a point in time, then you
need an atomic sync+map - for any filesystem.

(this is all assuming that this is not just a bog-simple "mapping only"
interface, per my other email...)

-Eric

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 14:15   ` Eric Sandeen
@ 2008-06-26 14:27     ` Dave Kleikamp
  2008-07-02 23:48       ` jim owens
  0 siblings, 1 reply; 70+ messages in thread
From: Dave Kleikamp @ 2008-06-26 14:27 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Josef Bacik

On Thu, 2008-06-26 at 09:15 -0500, Eric Sandeen wrote:
> Dave Kleikamp wrote:
> >> 	Firstly, most behavior-changing fm_flags have been removed. We're
> >> left with SYNC and XATTR now. This is a very good thing because frankly, I
> >> think fiemap should be targeted as a straight-forward and relatively
> >> uncomplicated API for exposing extents as they appear on disk. Think "one
> >> notch above extent-based FIBMAP replacement". There's a flip side to this -
> >> 'complicated' file systems should be free to implement their own
> >> complementary ioctls where there is a unique need that FIEMAP does not
> >> address. Things like non-trivial device mappings, encryption specifics
> >> (beyond 'this extent is encrypted'), don't belong here.
> > 
> > Keeping the SYNC and XATTR flags seems contradictory to above statement.
> > If more complicated filesystems are encouraged to implement their own
> > ioctls for non-generic things, then why does the new syscall need to
> > support xfs-specific things so that you can supersede its existing
> > ioctl?
> 
> I don't think either of these are xfs-specific at all.
> 
> > Honestly, I can see XATTR used generically, even though most filesystems
> > don't store the XATTR as a tree.  (jfs stores it in a single extent.)
> 
> That's fine, so, you return that one extent.  I'm not sure what it has
> to do with whether or not it's a tree?

Okay.  I'm fine with that.

> > SYNC really doesn't look like it belongs, and it's only there so that
> > the new ioctl acts like the xfs ioctl.
> 
> I disagree, while it may have been inspired by the xfs behavior, it's
> not at all xfs specific.
> 
> If a filesystem implements delalloc, you may want to know which ranges
> are still delalloc in the fiemap output, or you may want to put them on
> disk and know the actual physical location.  And if you want a snapshot
> of an actual, consistent layout of the file at a point in time, then you
> need an atomic sync+map - for any filesystem.

This makes sense.  In fact, I could see always doing the sync if there
are delalloc blocks to ensure that the location of the blocks will
always be returned.

> (this is all assuming that this is not just a bog-simple "mapping only"
> interface, per my other email...)

I guess I was put off by Andreas' response that FIEMAP_FLAG_SYNC is
there because xfsbmap had it "isn't harmful either".  This seemed a bit
weak, but I see that there is a better justification than just that.
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 14:27     ` Dave Kleikamp
@ 2008-07-02 23:48       ` jim owens
  2008-07-03 11:17         ` Dave Chinner
  0 siblings, 1 reply; 70+ messages in thread
From: jim owens @ 2008-07-02 23:48 UTC (permalink / raw)
  To: linux-fsdevel

I'm back from vacation and ready to cause fiemap() trouble.

Dave Kleikamp wrote:
> On Thu, 2008-06-26 at 09:15 -0500, Eric Sandeen wrote:
> 
>>>SYNC really doesn't look like it belongs, and it's only there so that
>>>the new ioctl acts like the xfs ioctl.
>>
>>I disagree, while it may have been inspired by the xfs behavior, it's
>>not at all xfs specific.
>>
>>If a filesystem implements delalloc, you may want to know which ranges
>>are still delalloc in the fiemap output, or you may want to put them on
>>disk and know the actual physical location.  And if you want a snapshot
>>of an actual, consistent layout of the file at a point in time, then you
>>need an atomic sync+map - for any filesystem.
> 
> This makes sense.  In fact, I could see always doing the sync if there
> are delalloc blocks to ensure that the location of the blocks will
> always be returned.
> 
  > I guess I was put off by Andreas' response that FIEMAP_FLAG_SYNC is
> there because xfsbmap had it "isn't harmful either".  This seemed a bit
> weak, but I see that there is a better justification than just that.

I say IT IS HARMFUL to have the FIEMAP_FLAG_SYNC.

The email trail points out how this so-called atomic sync+map
will lead programmers to write bad code because it leads them
to think there is some valuable guarantee of consistency by
using the SYNC flag.  This is not true.

The fiemap by itself is equivalent in all cases to reading
multiple disk blocks, while someone else is writing some
random subset of the same blocks.  You have data, but it is
not a clean "before" or "after" picture.

The only way to get a true useful snapshot is to have a
set of commands doing:
    freeze_metadata()
    read_metadata()
    ... userspace operate on metadata ...
    unfreeze_metadata()

If you are going to define fiemap to have an internal
freeze_metadata(), then I say that is even MORE HARMFUL
because it makes every (de)allocate/(de)compress/move
code path take a giant lock just so fiemap can get a
static picture that encompasses all in-range extents.

And that static picture can be invalid the moment the
giant lock is released.

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-02 23:48       ` jim owens
@ 2008-07-03 11:17         ` Dave Chinner
  2008-07-03 12:11           ` jim owens
  2008-07-03 12:21           ` jim owens
  0 siblings, 2 replies; 70+ messages in thread
From: Dave Chinner @ 2008-07-03 11:17 UTC (permalink / raw)
  To: jim owens; +Cc: linux-fsdevel

On Wed, Jul 02, 2008 at 07:48:07PM -0400, jim owens wrote:
> I'm back from vacation and ready to cause fiemap() trouble.
>
> Dave Kleikamp wrote:
>> On Thu, 2008-06-26 at 09:15 -0500, Eric Sandeen wrote:
>>
>>>> SYNC really doesn't look like it belongs, and it's only there so that
>>>> the new ioctl acts like the xfs ioctl.
>>>
>>> I disagree, while it may have been inspired by the xfs behavior, it's
>>> not at all xfs specific.
>>>
>>> If a filesystem implements delalloc, you may want to know which ranges
>>> are still delalloc in the fiemap output, or you may want to put them on
>>> disk and know the actual physical location.  And if you want a snapshot
>>> of an actual, consistent layout of the file at a point in time, then you
>>> need an atomic sync+map - for any filesystem.
>>
>> This makes sense.  In fact, I could see always doing the sync if there
>> are delalloc blocks to ensure that the location of the blocks will
>> always be returned.
>>
>  > I guess I was put off by Andreas' response that FIEMAP_FLAG_SYNC is
>> there because xfsbmap had it "isn't harmful either".  This seemed a bit
>> weak, but I see that there is a better justification than just that.
>
> I say IT IS HARMFUL to have the FIEMAP_FLAG_SYNC.
>
> The email trail points out how this so-called atomic sync+map
> will lead programmers to write bad code because it leads them
> to think there is some valuable guarantee of consistency by
> using the SYNC flag.  This is not true.

xfs_bmap provides an atomic sync and mapping. If the
FIEMAP_FLAG_SYNC is pushed down to the filesystem, then XFS
and all other filesystems can provide that same atomicity if
desired.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 11:17         ` Dave Chinner
@ 2008-07-03 12:11           ` jim owens
  2008-07-03 22:51             ` Dave Chinner
  2008-07-03 12:21           ` jim owens
  1 sibling, 1 reply; 70+ messages in thread
From: jim owens @ 2008-07-03 12:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

Dave Chinner wrote:

> xfs_bmap provides an atomic sync and mapping. If the
> FIEMAP_FLAG_SYNC is pushed down to the filesystem, then XFS
> and all other filesystems can provide that same atomicity if
> desired.

That is exactly what I was afraid of.

We are back to the "because XFS has it" argument.

But many other filesystems won't be able to provide
atomicity without normal operation performance being
reduced.  And I say we don't want fiemap to hurt
normal operation so fiemap should not impose even
an implied need for atomicity because programmers
will expect it and code for it.

If XFS users want atomic SYNC, they can use xfs_bmap,
or if XFS wants, it can always sync in its fiemap.

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 12:11           ` jim owens
@ 2008-07-03 22:51             ` Dave Chinner
  2008-07-04  8:31               ` Andreas Dilger
                                 ` (3 more replies)
  0 siblings, 4 replies; 70+ messages in thread
From: Dave Chinner @ 2008-07-03 22:51 UTC (permalink / raw)
  To: jim owens; +Cc: linux-fsdevel

On Thu, Jul 03, 2008 at 08:11:38AM -0400, jim owens wrote:
> Dave Chinner wrote:
>
>> xfs_bmap provides an atomic sync and mapping. If the
>> FIEMAP_FLAG_SYNC is pushed down to the filesystem, then XFS
>> and all other filesystems can provide that same atomicity if
>> desired.
>
> That is exactly what I was afraid of.

Why? Have you ever used an extent mapping interface before?

> We are back to the "because XFS has it" argument.

Given that one of the original desires was to have the fiemap ioctl
*replace the XFS ioctl*, I'm kinda getting sick of people saying "we
don't want this because only it's only an XFS feature" depsite the
fact that wإat we are doing here is re-implemented long standing XFS
interfaces. XFS is going to have to maintain two extent mapping
interfaces forever more because we don't have everything in fiemap
that we need to replace the XFS ioctl.

The point of this SYNC flag is to ensure that you get nothing other
than blocks mapped to disk - no delalloc regions, etc. The only sane
way to do that is an atomic 'sync+map' operation. This is not a
filesystem specific feature - it's what the SYNC flag should be
defined as providing.

The fact that it's only implemented in XFS right now has absolutely
*zero* consideration in determining this feature is necessary or
not. The fact that the only existing extent mapping interface in
Linux already defines it this way and it is in use by existing
userspace utilities that *expect this semantic* is much, much more
important.

Speaking of which, some of the features in fiemap are currently
OCFS2 specific, some that are Lustre(!) specific or not even used
by current filesystems or implementations of fiemap. However,
nobody is complaining that they should be ripped out just because
they are only implemented in filesystem X....

FWIW, if ext4 had this atomic sync+map (which it could do), would
you still be complaining about it?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 22:51             ` Dave Chinner
@ 2008-07-04  8:31               ` Andreas Dilger
  2008-07-04 12:13               ` Jamie Lokier
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 70+ messages in thread
From: Andreas Dilger @ 2008-07-04  8:31 UTC (permalink / raw)
  To: jim owens, linux-fsdevel

On Jul 04, 2008  08:51 +1000, Dave Chinner wrote:
> The fact that it's only implemented in XFS right now has absolutely
> *zero* consideration in determining this feature is necessary or
> not. The fact that the only existing extent mapping interface in
> Linux already defines it this way and it is in use by existing
> userspace utilities that *expect this semantic* is much, much more
> important.
> 
> FWIW, if ext4 had this atomic sync+map (which it could do), would
> you still be complaining about it?

Actually, the ext4 ->fiemap() method DOES grab a lock that will prevent
the file layout changing during the mapping, assuming the extent array
is large enough.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 22:51             ` Dave Chinner
  2008-07-04  8:31               ` Andreas Dilger
@ 2008-07-04 12:13               ` Jamie Lokier
  2008-07-07  7:40                 ` Dave Chinner
  2008-07-07 21:16               ` jim owens
  2008-07-07 22:02               ` jim owens
  3 siblings, 1 reply; 70+ messages in thread
From: Jamie Lokier @ 2008-07-04 12:13 UTC (permalink / raw)
  To: jim owens, linux-fsdevel

Dave Chinner wrote:
> The point of this SYNC flag is to ensure that you get nothing other
> than blocks mapped to disk - no delalloc regions, etc. The only sane
> way to do that is an atomic 'sync+map' operation. This is not a
> filesystem specific feature - it's what the SYNC flag should be
> defined as providing.

Wait a minute.

I think Jim, and you Dave, have imagined different use-cases
for FIEMAP - and that's the reason for this difference of opinion.

The two use-cases are:

    1. To get a detailed fragmentation report, which is guidance (and
       can only be guidance: it may be invalid the moment it's returned).

    2. To get a block mapping suitable for _reading_ those blocks from
       the physical device directly (e.g. LILO).

For 1, atomic 'sync+map' does make sense, if you want the report to
not have any delalloc extents, and you want to operate on files which
are being modified by other processes.

For 2, Jim appears to be correct that atomic 'sync+map' is not useful.
You can only read blocks if the mapping remains stable after returning
it, which means the application _must_ ensure no process is modifying
the file, and that it's on a filesystem which doesn't arbitrarily move
blocks when it feels like it.  Given that,
'make_sure_nothing_modifies; atomic(sync + map); read data;
ok_you_can_modify' is no different from 'make_sure_nothing_modifies;
fsync(); map; read data; ok_you_can_modify'.

> The fact that it's only implemented in XFS right now has absolutely
> *zero* consideration in determining this feature is necessary or
> not.

You're right, but that's not what Jim's arguing.  He's saying the
feature isn't necessary since it provides no _dependable_ semantic
guarantees, and therefore arguments to keep it are for legacy
compatibility alone.  That may be reason enough to keep it, though.

However, he's mistaken.  You've explained that it does provide a
guarantee: the resulting map will be valid for a consistent snapshot
of the file at some instant in time during the FIEMAP call.  In other
words, with concurrent modifiers, atomic sync+map ensures no delalloc
regions (is there anything else?) in the map, while fsync() + map gets
close but does not ensure it.  But either way, with concurrent
modifiers, you can only use the result for guidance, in a
fragmentation report, so is preventing delalloc regions actually
useful?  Maybe it is.  It would be good to see an example, though.

Dave, can you give an actual situation where you have seen atomic
'sync+map' used with XFS where it is necessary for an application to
behave correctly?  I'm having trouble thinking of one, other than "the
current app code doesn't know what to do with a delalloc extent".

I'm thinking when those programs are updated to use the new interface,
wouldn't it be better to update them to handle delalloc extents
(treating them as "region unknown" in fragmentation reports), because
some filesystems won't support atomic sync+map anyway?

Finally, if the real intent here is "the returned map for
fragmentation report shall not include any delalloc extents", perhaps
that should be the request flag instead?  There are other ways to
ensure that which don't require blocking concurrent modifications, for
potentially a significant time (esp. block-based filesystems).  We
like lock-free algorithms these days, if the results are suitable.

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-04 12:13               ` Jamie Lokier
@ 2008-07-07  7:40                 ` Dave Chinner
  2008-07-07 16:53                   ` Jamie Lokier
  0 siblings, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2008-07-07  7:40 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: jim owens, linux-fsdevel

On Fri, Jul 04, 2008 at 01:13:25PM +0100, Jamie Lokier wrote:
> Dave Chinner wrote:
> > The point of this SYNC flag is to ensure that you get nothing other
> > than blocks mapped to disk - no delalloc regions, etc. The only sane
> > way to do that is an atomic 'sync+map' operation. This is not a
> > filesystem specific feature - it's what the SYNC flag should be
> > defined as providing.
> 
> Wait a minute.
> 
> I think Jim, and you Dave, have imagined different use-cases
> for FIEMAP - and that's the reason for this difference of opinion.
> 
> The two use-cases are:
> 
>     1. To get a detailed fragmentation report, which is guidance (and
>        can only be guidance: it may be invalid the moment it's returned).
> 
>     2. To get a block mapping suitable for _reading_ those blocks from
>        the physical device directly (e.g. LILO).
> 
> For 1, atomic 'sync+map' does make sense, if you want the report to
> not have any delalloc extents, and you want to operate on files which
> are being modified by other processes.
> 
> For 2, Jim appears to be correct that atomic 'sync+map' is not useful.
> You can only read blocks if the mapping remains stable after returning
> it, which means the application _must_ ensure no process is modifying
> the file, and that it's on a filesystem which doesn't arbitrarily move
> blocks when it feels like it.  Given that,
> 'make_sure_nothing_modifies; atomic(sync + map); read data;
> ok_you_can_modify' is no different from 'make_sure_nothing_modifies;
> fsync(); map; read data; ok_you_can_modify'.

Like:

# xfs_freeze -f <mntpt>
# xfs_bmap -vvp <file>
# <do something nasty with direct block access>
# xfs_freeze -u <mntpt>

> You've explained that it does provide a
> guarantee: the resulting map will be valid for a consistent snapshot
> of the file at some instant in time during the FIEMAP call.  In other
> words, with concurrent modifiers, atomic sync+map ensures no delalloc
> regions (is there anything else?) in the map, while fsync() + map gets
> close but does not ensure it.

Synchronisation with direct I/O, ensures unwritten extent conversion
completion with concurrent async direct I/O before mapping, space
preallocation, etc.

> Dave, can you give an actual situation where you have seen atomic
> 'sync+map' used with XFS where it is necessary for an application to
> behave correctly?

The only application that uses the XFS ioctls are
xfs utilities, and they tend to work around the assumption
that the mapping operation returns a consistent map at the
time the call was made.

> I'm having trouble thinking of one, other than "the
> current app code doesn't know what to do with a delalloc extent".

No, the XFS utilities want to know mappings, not delalloc extents -
i.e. they want to know where on disk stuff is, not where in memory
it is.  That being said, there have been times when I've wanted to
know what ranges of the file were on disk or in memory when
analysing problems...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-07  7:40                 ` Dave Chinner
@ 2008-07-07 16:53                   ` Jamie Lokier
  2008-07-07 22:51                     ` Dave Chinner
  0 siblings, 1 reply; 70+ messages in thread
From: Jamie Lokier @ 2008-07-07 16:53 UTC (permalink / raw)
  To: jim owens, linux-fsdevel

Dave Chinner wrote:
> Like:
> 
> # xfs_freeze -f <mntpt>
> # xfs_bmap -vvp <file>
> # <do something nasty with direct block access>
> # xfs_freeze -u <mntpt>

^^^ Oh, exactly the sort of thing which led to this quote from Andreas
Dilger on why FIEMAP is *not* suitable for this, on generic filesystems :-)

    "EEEEEK [...] Directly writing underneath a filesystem is major
    bad news and will likely corrupt the filesystem because [page
    cache reasons]."

Besides, if you're using the currently-XFS-specific freeze capability,
a simple fsync() before the xfs_bmap, inside the freeze, will be
sufficient won't it?

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-07 16:53                   ` Jamie Lokier
@ 2008-07-07 22:51                     ` Dave Chinner
  0 siblings, 0 replies; 70+ messages in thread
From: Dave Chinner @ 2008-07-07 22:51 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: jim owens, linux-fsdevel

On Mon, Jul 07, 2008 at 05:53:54PM +0100, Jamie Lokier wrote:
> Dave Chinner wrote:
> > Like:
> > 
> > # xfs_freeze -f <mntpt>
> > # xfs_bmap -vvp <file>
> > # <do something nasty with direct block access>
> > # xfs_freeze -u <mntpt>
> 
> ^^^ Oh, exactly the sort of thing which led to this quote from Andreas
> Dilger on why FIEMAP is *not* suitable for this, on generic filesystems :-)
> 
>     "EEEEEK [...] Directly writing underneath a filesystem is major
>     bad news and will likely corrupt the filesystem because [page
>     cache reasons]."

Yes - that's why I said "Do something nasty". IIRC, grub does
exactly this to try to work around the fact that it mixes raw
disk access with mounted filesystems....

FWIW, I've used the output of xfs_bmap for direct block *read*
access in the past - it's very handy for tracking down corruption
problems as a result of hardware misdirecting writes.

> Besides, if you're using the currently-XFS-specific freeze capability,

<sigh>

Freeze is not XFS specific.

http://marc.info/?l=linux-fsdevel&m=121482849815436&w=2
http://marc.info/?l=linux-kernel&m=121511426713644&w=2

> a simple fsync() before the xfs_bmap, inside the freeze, will be
> sufficient won't it?

The freeze makes sync redundant. 

[ As an aside, not many people around here seem to grok what
'freezing the filesystem' really means. It means 'put the filesystem
in a 100% consistent state on disk and prevent further modification
until unfrozen'. It provides guarantees that sync doesn't as sync
only needs to guarantee the filesystem is in a *recoverable* state
on disk at a single point in time. ]

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 22:51             ` Dave Chinner
  2008-07-04  8:31               ` Andreas Dilger
  2008-07-04 12:13               ` Jamie Lokier
@ 2008-07-07 21:16               ` jim owens
  2008-07-08  3:01                 ` Dave Chinner
  2008-07-07 22:02               ` jim owens
  3 siblings, 1 reply; 70+ messages in thread
From: jim owens @ 2008-07-07 21:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

Dave Chinner wrote:

> Why? Have you ever used an extent mapping interface before?

YES - for the past 9 years in Tru64 unix.

Interfaces that were used by tools to defragment, print extmaps,
capture metadata for support, scan for fs errors, perform backups
with directIO through the fs, move file extents between devices,
allow a stacked cluster filesystem to read/write directly to
the raw device from other Tru64 nodes, allow an IBM Tivoli client
to read/write directly to the raw SAN storage from anywhere,
(and probably some uses I've forgot).

> Given that one of the original desires was to have the fiemap ioctl
> *replace the XFS ioctl*, I'm kinda getting sick of people saying "we
> don't want this because only it's only an XFS feature" depsite the
> fact that wإat we are doing here is re-implemented long standing XFS
> interfaces.

I don't have a problem with a feature that is implemented
for XFS if we can properly define the use of that feature.

I have been saying since I started talking on this forum that
I'm a real Linux newbie.  But I have been doing other kernels
so long they now add "old" to my "mean bastard" designation :)

What I have been trying to point out in the fiemap discussion
is all learned from past mistakes.  The words I'm using may be
part of the problem, or maybe people are too set on defending
an existing implementation to think it through.

So please think carefully about what I say in the next
emails about the design issues and consider that maybe
XFS made a mistake or two along the way.  If the mistake
is just that I don't understand it, please explain.

jim

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-07 21:16               ` jim owens
@ 2008-07-08  3:01                 ` Dave Chinner
  0 siblings, 0 replies; 70+ messages in thread
From: Dave Chinner @ 2008-07-08  3:01 UTC (permalink / raw)
  To: jim owens; +Cc: linux-fsdevel

On Mon, Jul 07, 2008 at 05:16:50PM -0400, jim owens wrote:
> Dave Chinner wrote:
>
>> Why? Have you ever used an extent mapping interface before?
>
> YES - for the past 9 years in Tru64 unix.
>
> Interfaces that were used by tools to defragment, print extmaps,
> capture metadata for support,
> perform backups
> with directIO through the fs, move file extents between devices,

Those are simple 'give me a mapping' cases - syncing first
is appropriate for all those cases. Making it atomic so you only
get blocks mapped to disk is also appropriate for all these
cases....

> scan for fs errors,

Deep filesystem specific knowledge is needed to do that properly, so
it's not really a use case we've considered for a generic fiemap
API.

> allow a stacked cluster filesystem to read/write directly to
> the raw device from other Tru64 nodes,

Oh, that's just gross.

> allow an IBM Tivoli client
> to read/write directly to the raw SAN storage from anywhere,
> (and probably some uses I've forgot).

yes, that's a fairly common thing to do from a read-only snapshot
of the filesystem. IOWs, the filesystem image is not changing so

	- block mapping is stable
	- no concurrent modifications

The application has provided sane access synchronisation to the
blocks, so it's not needed in fiemap. Great - your an expert in
storage and you know that your application doesn't need use the
FIEMAP_FLAG_SYNC....

> What I have been trying to point out in the fiemap discussion
> is all learned from past mistakes. 

You're saying that like it's a bad thing. Learning from mistakes
is how we improve things.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 22:51             ` Dave Chinner
                                 ` (2 preceding siblings ...)
  2008-07-07 21:16               ` jim owens
@ 2008-07-07 22:02               ` jim owens
  2008-07-09  2:03                 ` Jamie Lokier
  3 siblings, 1 reply; 70+ messages in thread
From: jim owens @ 2008-07-07 22:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

Dave Chinner wrote:

> The point of this SYNC flag is to ensure that you get nothing other
> than blocks mapped to disk - no delalloc regions, etc. The only sane
> way to do that is an atomic 'sync+map' operation. This is not a
> filesystem specific feature - it's what the SYNC flag should be
> defined as providing.

If the real need is to force allocation then the flag should
be something like FIEMAP_FLAG_ALLOC and not need to do fsync
or any data flush, just ensure there is assigned storage.

> Linux already defines it this way and it is in use by existing
> userspace utilities that *expect this semantic* is much, much more
> important.

 From you and Anton, I understand the only critical semantic
is to never get back a delalloc from xfs.  But, I still don't
see the critical need when in a later email you say:

> The only application that uses the XFS ioctls are
> xfs utilities, and they tend to work around the assumption
> that the mapping operation returns a consistent map at the
> time the call was made.

So... the situation is you are saying you must keep it
the same for the utilities that are XFS-specific and that
you must change anyway to use fiemap.  As Jamie said, just
put in the code to skip any unknown extents.

Before replying, wait for the next email that says why
I think those utilities have semantic problems too.

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-07 22:02               ` jim owens
@ 2008-07-09  2:03                 ` Jamie Lokier
  0 siblings, 0 replies; 70+ messages in thread
From: Jamie Lokier @ 2008-07-09  2:03 UTC (permalink / raw)
  To: jim owens; +Cc: Dave Chinner, linux-fsdevel

jim owens wrote:
> Dave Chinner wrote:
> 
> >The point of this SYNC flag is to ensure that you get nothing other
> >than blocks mapped to disk - no delalloc regions, etc. The only sane
> >way to do that is an atomic 'sync+map' operation. This is not a
> >filesystem specific feature - it's what the SYNC flag should be
> >defined as providing.
> 
> If the real need is to force allocation then the flag should
> be something like FIEMAP_FLAG_ALLOC and not need to do fsync
> or any data flush, just ensure there is assigned storage.

See also the huge time for fsync on ext3 in some circumstances (and
why they had to patch Firefox 3 because of it).  On ext3,
FIEMAP_FLAG_ALLOC would be a no-op, but sync can take a long time.

If the utilities using FIEMAP just need "no delalloc extents", they
really should use an ALLOC flag, if only because forcing writeback,
which may take a long time, is not what they are trying to accomplish.

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 11:17         ` Dave Chinner
  2008-07-03 12:11           ` jim owens
@ 2008-07-03 12:21           ` jim owens
  2008-07-03 12:42             ` Andi Kleen
  2008-07-04 20:32             ` Anton Altaparmakov
  1 sibling, 2 replies; 70+ messages in thread
From: jim owens @ 2008-07-03 12:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

Dave Chinner wrote:

> xfs_bmap provides an atomic sync and mapping.

By the way, I still fail to see how doing fiemap-with-SYNC
in XFS has any more value than doing fsync(), fiemap().

In both cases the returned extent information is only
guaranteed if there are 0 other threads changing the file.

(Well OK, it is 1 less system call being made)

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 12:21           ` jim owens
@ 2008-07-03 12:42             ` Andi Kleen
  2008-07-04 20:32             ` Anton Altaparmakov
  1 sibling, 0 replies; 70+ messages in thread
From: Andi Kleen @ 2008-07-03 12:42 UTC (permalink / raw)
  To: jim owens; +Cc: Dave Chinner, linux-fsdevel

jim owens <jowens@hp.com> writes:
>
> (Well OK, it is 1 less system call being made)

In Linux it's normally safe to assume that system calls are reasonable
fast and you don't need to go out to extreme measures to minimize 
them.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 12:21           ` jim owens
  2008-07-03 12:42             ` Andi Kleen
@ 2008-07-04 20:32             ` Anton Altaparmakov
  2008-07-05 10:49               ` Jamie Lokier
  2008-07-07 23:01               ` jim owens
  1 sibling, 2 replies; 70+ messages in thread
From: Anton Altaparmakov @ 2008-07-04 20:32 UTC (permalink / raw)
  To: jim owens; +Cc: Dave Chinner, linux-fsdevel

Hi,

On 3 Jul 2008, at 13:21, jim owens wrote:
> Dave Chinner wrote:
>> xfs_bmap provides an atomic sync and mapping.
>
> By the way, I still fail to see how doing fiemap-with-SYNC
> in XFS has any more value than doing fsync(), fiemap().
>
> In both cases the returned extent information is only
> guaranteed if there are 0 other threads changing the file.

That's simply wrong.

fsync() followed by fiemap() means that another process/thread can  
write to the file in between the two system calls so that the fiemap()  
call will return things like delayed allocation regions, etc, which  
the caller may not want / may not even be able to know what to do with.

fiemap-with-SYNC does not suffer from this problem because the caller  
is guaranteed that the sync will flush everything to disk and then  
this state will be returned by the fiemap call.

It is completely irrelevant whether the information is still valid  
after the fiemap returns.  Any application that calls fiemap and then  
goes and reads or writes those blocks on disk is totally brain damaged  
and should be sent to bitrot hell.  fiemap is about information not  
about direct access to disk by user space.  That is what O_DIRECT is  
for...  For example all you need to do as a malicious process/user is  
to catch some application that uses fiemap + then writes to disk and  
open() the file it does this on and do a truncate(0) on it whilst the  
application is writing to the disk.  With some luck and good design  
you could get the application to overwrite /etc/shadow or to read the  
ssh private key, etc at least on some file systems...

> (Well OK, it is 1 less system call being made)

That is basically irrelevant on Linux.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-04 20:32             ` Anton Altaparmakov
@ 2008-07-05 10:49               ` Jamie Lokier
  2008-07-05 21:44                 ` Anton Altaparmakov
  2008-07-07 23:01               ` jim owens
  1 sibling, 1 reply; 70+ messages in thread
From: Jamie Lokier @ 2008-07-05 10:49 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: jim owens, Dave Chinner, linux-fsdevel

Anton Altaparmakov wrote:
> Any application that calls fiemap and then goes and reads or writes
> those blocks on disk is totally brain damaged and should be sent to
> bitrot hell.  fiemap is about information not about direct access to
> disk by user space.

So why FIEMAP_EXTENT_NO_DIRECT "Direct access to the data in this
extent is illegal or will have undefined results" in the patches?

- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-05 10:49               ` Jamie Lokier
@ 2008-07-05 21:44                 ` Anton Altaparmakov
  0 siblings, 0 replies; 70+ messages in thread
From: Anton Altaparmakov @ 2008-07-05 21:44 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: jim owens, Dave Chinner, linux-fsdevel

On Sat, 5 Jul 2008, Jamie Lokier wrote:
> Anton Altaparmakov wrote:
> > Any application that calls fiemap and then goes and reads or writes
> > those blocks on disk is totally brain damaged and should be sent to
> > bitrot hell.  fiemap is about information not about direct access to
> > disk by user space.
> 
> So why FIEMAP_EXTENT_NO_DIRECT "Direct access to the data in this
> extent is illegal or will have undefined results" in the patches?

Because sadly there are some applications with insane requirements. lilo 
is one example and swap files in the kernel is another example.  )-:

I think this is still the wrong way to do it and grub is a better answer 
for the problem lilo has and the kernel swap could just do O_DIRECT writes 
(though file system re-entry and locking will be a problem that will need 
solving somehow) but those applications exist already and given it is a 
one-bit flag to make them know whether it is relatively safe to do dirct 
access we mighst as well have it...  I mean if you make the kernel image 
only writable by the root user, and you assume the root user is not going 
to modify the file without re-running lilo then doing direct read from 
disk is fine as long as the file system does not do online defragmentation 
or anything other block moving about operations.  After all that is how 
lilo works now and it is what causes the machine to fail to boot if you 
replace the kernel image with a new kernel and do not re-run lilo...  It 
doesn't change the fact that I think it is a crazy thing to do.  And on 
some file systems the NO_DIRECT flag would be set for all files (because 
they perform online deframentation) and then things like lilo and kernel 
swap would know that they cannot work on those file systems so given we 
know people will use the interface for direct access even though it is an 
evil thing to do we should IMHO have the flag to make it a little safer 
for them to do it.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-04 20:32             ` Anton Altaparmakov
  2008-07-05 10:49               ` Jamie Lokier
@ 2008-07-07 23:01               ` jim owens
  2008-07-08  1:51                 ` Dave Chinner
  1 sibling, 1 reply; 70+ messages in thread
From: jim owens @ 2008-07-07 23:01 UTC (permalink / raw)
  To: Anton Altaparmakov, Dave Chinner; +Cc: linux-fsdevel

Anton Altaparmakov wrote:

> It is completely irrelevant whether the information is still valid  
> after the fiemap returns.

So if that is true, any XFS utility that does more than PRINT
the extent map based on doing JUST a fiemap is subject to
erronious results.

I agree with everyone who says that to do useful work with
the output of fiemap, you need a set of syscall functions
that have this effect:

    mandatory_exclusive_file_lock();
      [optional] fsync(); or force_allocation();
    fiemap();
      [do ugly userspace stuff]
    release_mandatory_exclusive_file_lock();

Without the locking steps, any code that acts on the
fiemap output is just guessing, and if XFS utilities
do unlocked fiemap, it doesn't matter that they have
forced an atomic fsync, their extent map is no more
valid than the non-atomic case.  So why bother having
it allocate and sync storage (besides so you don't
have to add code to handle unknown extent types)?

Dave Chinner wrote:
> On Fri, Jul 04, 2008 at 01:13:25PM +0100, Jamie Lokier wrote:
>>You can only read blocks if the mapping remains stable after returning
>>it, which means the application _must_ ensure no process is modifying
>>the file, and that it's on a filesystem which doesn't arbitrarily move
>>blocks when it feels like it.
> 
> Like:
> 
> # xfs_freeze -f <mntpt>
> # xfs_bmap -vvp <file>
> # <do something nasty with direct block access>
> # xfs_freeze -u <mntpt>
> 
>>You've explained that it does provide a
>>guarantee: the resulting map will be valid for a consistent snapshot
>>of the file at some instant in time during the FIEMAP call.  In other
>>words, with concurrent modifiers, atomic sync+map ensures no delalloc
>>regions (is there anything else?) in the map, while fsync() + map gets
>>close but does not ensure it.
> 
> Synchronisation with direct I/O, ensures unwritten extent conversion
> completion with concurrent async direct I/O before mapping, space
> preallocation, etc.

So the sequence above seems to match my locked sequence and
only needs the fsync() instead of counting on fiemap-with-sync.

However, I will point out that the FREEZE-FILESYSTEM commands
(which I assume is your semantic as it is using <mntpt>) I am
used to using do not allow any metadata changes on the storage.
This is because the device snapshot code needs it stable.

So if xfs_bmap and fiemap() are expected to ignore freeze and
change metadata to do allocations that is sematically incorrect too.

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-07 23:01               ` jim owens
@ 2008-07-08  1:51                 ` Dave Chinner
  2008-07-08 13:02                   ` jim owens
  0 siblings, 1 reply; 70+ messages in thread
From: Dave Chinner @ 2008-07-08  1:51 UTC (permalink / raw)
  To: jim owens; +Cc: Anton Altaparmakov, linux-fsdevel

On Mon, Jul 07, 2008 at 07:01:24PM -0400, jim owens wrote:
> Anton Altaparmakov wrote:
>
>> It is completely irrelevant whether the information is still valid   
>> after the fiemap returns.
>
> So if that is true, any XFS utility that does more than PRINT
> the extent map based on doing JUST a fiemap is subject to
> erronious results.

No, that's an incorrect conclusion.

The fact that the file can change *after* the mapping is done is
taken into account by the application. The XFS utilities assume
that the mapping is atomic, but can change after the mapping
has been taken.

Indeed, there's usually bigger issues to deal with than this e.g.
defrag has to deal with the file not changing for the entire copy
process, not just the mapping part. e.g., xfs_fsr uses atomic
primitives and inode change detection to avoid the need exclusively
lock out all other access whilst doing the defragmentation.....

> I agree with everyone who says that to do useful work with
> the output of fiemap, you need a set of syscall functions
> that have this effect:
>
>    mandatory_exclusive_file_lock();
>      [optional] fsync(); or force_allocation();
>    fiemap();
>      [do ugly userspace stuff]
>    release_mandatory_exclusive_file_lock();

Yes, but file locking and application level synchronisation is
outside the scope of the fiemap syscall. I'm not disagreeing that
this is not needed, just that such application level synchronisation
has no direct relevance to the fiemap API.

OTOH, an atomic sync+map is relevant fiemap as this is the only API
that can provide it. We often do stuff with atomic primitives to
avoid unnecessary and/or expensive locking and that's all this is -
an atomic mapping primitive. You may not consider it useful, but
some of us do....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-08  1:51                 ` Dave Chinner
@ 2008-07-08 13:02                   ` jim owens
  2008-07-08 14:03                     ` jim owens
  2008-07-08 14:30                     ` Theodore Tso
  0 siblings, 2 replies; 70+ messages in thread
From: jim owens @ 2008-07-08 13:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

Dave Chinner wrote:

> Yes, but file locking and application level synchronisation is
> outside the scope of the fiemap syscall. I'm not disagreeing that
> this is not needed, just that such application level synchronisation
> has no direct relevance to the fiemap API.

At least we agree on something.  I keep talking about locking
only to say that true data consistency requires using some other
system locking mechanism around fiemap().  The relevance is that
without these other mechanisms the data must be assumed inconsistent.

> OTOH, an atomic sync+map is relevant fiemap as this is the only API
> that can provide it. We often do stuff with atomic primitives to
> avoid unnecessary and/or expensive locking and that's all this is -
> an atomic mapping primitive. You may not consider it useful, but
> some of us do....

My objection is that I still have not heard a consistent
logical argument and set of semantics that apply to ALL
filesystems with an explanation of how a NEW tool would
use this feature.  I would be happy to have the SYNC flag
with its current semantic for XFS if we redefine it as:

* FIEMAP_B_STUPID
*    may provide a more complete extent map on some
*    filesystems at the expense of using more resources

So users understand what they really are doing and
don't think an atomic fsync provides good data.

Here is the summary of the SYNC supporters argument:

1) XFS tools can't deal with delalloc.

2) XFS tools know returned data is crap and handle that.

3) XFS needs to obsolete and replace XFS specific api.

4) XFS tools must be recoded for #3, and already have
    extensive logic for #2 to make rational decisions
    with bad data... BUT it is impossible to have
    the XFS tools fixed so #1 is also handled, thus
    we must have the fiemap SYNC flag.

AAARRRRRGGGG!

Does anyone else see why I just don't freak'n get it!

This is the truth:  FIEMAP DATA IS UNSTABLE

NEW TOOLS MUST DEAL WITH THAT INSTABILITY.

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-08 13:02                   ` jim owens
@ 2008-07-08 14:03                     ` jim owens
  2008-07-08 14:39                       ` jim owens
  2008-07-08 14:30                     ` Theodore Tso
  1 sibling, 1 reply; 70+ messages in thread
From: jim owens @ 2008-07-08 14:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

Here is one last point that confuses me so please
explain what I have misunderstood:

- The XFS bmap api ALWAYS does SYNC, which is why
   XFS tools are coded that way to expect it.

- XFS could implement fiemap the same way to always sync.

- And the only use in XFS for NO-SYNC is that Dave might
   like to sometime see unmodified extent data.

Sounds like a good reason to confuse users with
the FIEMAP_FLAG_SYNC and make other filesystems do
extra work doesn't it?

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-08 14:03                     ` jim owens
@ 2008-07-08 14:39                       ` jim owens
  0 siblings, 0 replies; 70+ messages in thread
From: jim owens @ 2008-07-08 14:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

Assuming I am correct in at least part of my understanding
on why there is

 > * FIEMAP_FLAG_SYNC
 > If this flag is set, the kernel will sync the file before mapping extents.

I propose we replace it with the following flag that should
provide everything XFS needs:

* FIEMAP_FLAG_ANY
* Advisory only - If this flag is set, the filesystem can
* ignore normal consistency checks when returning data.

Then with the flag==0, XFS does the locked-atomic-sync and
with the flag==1, XFS works like other filesystems that
don't do the atomic stuff.

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-08 13:02                   ` jim owens
  2008-07-08 14:03                     ` jim owens
@ 2008-07-08 14:30                     ` Theodore Tso
  2008-07-09  1:50                       ` Jamie Lokier
  1 sibling, 1 reply; 70+ messages in thread
From: Theodore Tso @ 2008-07-08 14:30 UTC (permalink / raw)
  To: jim owens; +Cc: Dave Chinner, linux-fsdevel

On Tue, Jul 08, 2008 at 09:02:52AM -0400, jim owens wrote:
> Here is the summary of the SYNC supporters argument:
>
> 1) XFS tools can't deal with delalloc.
>
> 2) XFS tools know returned data is crap and handle that.
>
> 3) XFS needs to obsolete and replace XFS specific api.
>
> 4) XFS tools must be recoded for #3, and already have
>    extensive logic for #2 to make rational decisions
>    with bad data... BUT it is impossible to have
>    the XFS tools fixed so #1 is also handled, thus
>    we must have the fiemap SYNC flag.
>
> AAARRRRRGGGG!
>
> Does anyone else see why I just don't freak'n get it!
>
> This is the truth:  FIEMAP DATA IS UNSTABLE
>
> NEW TOOLS MUST DEAL WITH THAT INSTABILITY.

Let's take step back and ask ourselves what tools will want to do with
FIEMAP in the first place, shall we?

As far as I know, it's basically only useful for bootloaders like lilo
and to a limited extent grub (for its stage2 loader) and for debugging
tools that are interested in knowing how fragmented a file might be.
I cant think of any other really good uses, anyway.  Someone what to
enlighten me?

For bootloaders, where the information is going to be stashed
somewhere permanent, for those class of filesystems which might
reorganize data after it has been mapped once, you need some magic
file flag which nails down the file.  Basically, a "don't you dare
move this flag".  This is implemented for reiserfs3, since it will
move a file around once it has been placed on disk.

However, how many filesystems beyond resierfs3 actually will move a
file around on disk once it has been mapped to specific disk blocks
and written to disk?  Does XFS does this?  I didn't think so.  If it
does, then for bootloaders like LILO it will also need a flag that
prevents a block from being moved around.

There are however plenty of filesystems (XFS, ext4, etc.) that play
the delayed allocation game, where the FIEMAP information returned
could change from "location not yet determined on disk" to "here's
where we decided to put it on disk".  And I assume that's what the
SYNC flag does, right?  So it's really just syntactic sugar for doing
fsync; get fiemap; check to see if the an unmapped extent was still
returned (due to a race condition; if so, go back and repeat the fsync
and then retry the fiemap loop).

So I think perhaps the talking-at-cross-purposes is that Jim is
thinking about how to support filesystems that will in fact relocate
file data on disk (for example, as part of an online shrink or when
moving a file from one volume to another in a filesystem like advfs or
btrfs), and other folks have been assuming a simpler world where data
is either mapped to a location or disk or still in a delayed
allocation state.

In order to support filesystems where file data can move, then indeed
you need the kind of userspace locking that Jim was talking about ---
except what applications really need that kind of reliable information
exactly where the file data blocks live on disk?  Again, the only ones
I can think of require utter stability, because the location will be
stashed in some location for use by a bootloader, or something that
needs to run before the filesystem is up and runing.  So there, what
is needed is out of scope of FIEMAP, and it's probably a flag which
nails the file to a specific location on disk.  And if such a file is
present, it will prevent hot-removal of a volume from a filesystem
group, and it may interfere with a hot-shrink operation --- but that's
as it should be.  Since otherwise, it would break the bootloader.

Does this make sense?

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-08 14:30                     ` Theodore Tso
@ 2008-07-09  1:50                       ` Jamie Lokier
  0 siblings, 0 replies; 70+ messages in thread
From: Jamie Lokier @ 2008-07-09  1:50 UTC (permalink / raw)
  To: Theodore Tso; +Cc: jim owens, Dave Chinner, linux-fsdevel

Theodore Tso wrote:
> Let's take step back and ask ourselves what tools will want to do with
> FIEMAP in the first place, shall we?
> 
> As far as I know, it's basically only useful for bootloaders like lilo
> and to a limited extent grub (for its stage2 loader) and for debugging
> tools that are interested in knowing how fragmented a file might be.
> I cant think of any other really good uses, anyway.  Someone what to
> enlighten me?

Yes:

   1. Databases.  FIEMAP indicates where O_DIRECT will probably access.

      a. I/O strategy.  Database engines can use this as hint to
         reduce seeks and increase speed of large or many concurrent
         queries.  Merely trying to emit thousands of AIOs and letting
         the kernel elevator do it is not as good, as there are higher
         level optimisations possible, and in any case AIO and
         elevator limitations.

      b. The hints can also guide new data allocation, or reorgansation.

   2. Filesystems in user space, e.g. NTFS-3G.  See above.

   3. Virtual machines use compact representations of large virtual
      disks.  Some of them add COW capabilities.  Both types are
      effectively filesystems-in-a-file.  See above.

   4. Programs which read data from lots of files, but don't care
      about the order, can reduce seeking if they can FIEMAP all the
      files and read the data in roughly block order (without getting
      too pedantic about it).  E.g. something which indexes the
      content of of /home.  (Related: See my (little used) "treescan"
      program which is sometimes much faster than "find" for scanning
      names and stat() information, due mostly to seek optimisation.)

In all these uses, I notice that the _exact_ values are _not_ required.
It is enough that they are usually accurate enough to use as I/O
hints.

It would make sense, I think, to merge this with the other work being
done on I/O hints, for RAIDs and other media with sub-structure.

> However, how many filesystems beyond resierfs3 actually will move a
> file around on disk once it has been mapped to specific disk blocks
> and written to disk?  Does XFS does this?  I didn't think so.  If it
> does, then for bootloaders like LILO it will also need a flag that
> prevents a block from being moved around.

Isn't "chattr +t" effectively a suitable generic flag for that, even
though it doesn't exactly say so in the manual?

Btw, I imagine quite a few future filesystems will move data around on
disk once it is mapped.  Probably not the majority.

> There are however plenty of filesystems (XFS, ext4, etc.) that play
> the delayed allocation game, where the FIEMAP information returned
> could change from "location not yet determined on disk" to "here's
> where we decided to put it on disk".  And I assume that's what the
> SYNC flag does, right?  So it's really just syntactic sugar for doing
> fsync; get fiemap; check to see if the an unmapped extent was still
> returned (due to a race condition; if so, go back and repeat the fsync
> and then retry the fiemap loop).

I think you said two different things there.  "Here's where we decided
to put it it" is not the same as "we _have_ put it here".  So sync is
stronger than removing delalloc extents.  (There's also a middle
strength where data is all committed, but not necessarily atomically
with getting all the extents at once).

I'm not sure which semantics the XFS utilities need.  If they don't
access the raw blocks directly, they don't really need sync, they just
need "here's where we decided to put it".  If they do access raw
blocks directly, they need that xfs_freeze stuff too, at which point
it's using XFS ioctls anyway, so it begs the question of whether it
should be using FIEMAP at all.

> So I think perhaps the talking-at-cross-purposes is that Jim is
> thinking about how to support filesystems that will in fact relocate
> file data on disk (for example, as part of an online shrink or when
> moving a file from one volume to another in a filesystem like advfs or
> btrfs), and other folks have been assuming a simpler world where data
> is either mapped to a location or disk or still in a delayed
> allocation state.

There was a flag FIEMAP_EXTENT_NO_DIRECT which should presumably be
set on filesystems where data is not mapped at stable (or even single)
blocks.

That's why I suggested requiring that _not_ setting
FIEMAP_EXTENT_NO_DIRECT (really, define it's complement!) should mean
"the data is at this physical location _only while no process modifies
to the file_".  Filesystems with stable data locations, and some which
move the file only when it's modified, could unset the flag.  Other
filesystems (maybe including BTRFS) would always set it.  But that
suggestion was not really understood at the time.

Otherwise, if you think that no useful program will access the blocks
directly, then why do we have !FIEMAP_EXTENT_NO_DIRECT at all?  And
what does it mean?

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-26 14:04 ` Dave Kleikamp
  2008-06-26 14:15   ` Eric Sandeen
@ 2008-06-26 17:01   ` Andreas Dilger
  1 sibling, 0 replies; 70+ messages in thread
From: Andreas Dilger @ 2008-06-26 17:01 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

On Jun 26, 2008  09:04 -0500, Dave Kleikamp wrote:
> Honestly, I can see XATTR used generically, even though most filesystems
> don't store the XATTR as a tree.  (jfs stores it in a single extent.)
> SYNC really doesn't look like it belongs, and it's only there so that
> the new ioctl acts like the xfs ioctl.

I think the use of "tree" in the XATTR description is a bit misleading.
It doesn't really matter how the xattrs are layed out on disk, or how
they are addressed internally to the filesystem.  The "extents" returned
by FIEMAP will just be a list of {physical,logical start}+length ranges
that map one (or more) locations on disk the xattrs for this inode are
stored.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-06-25 22:18 [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2 Mark Fasheh
                   ` (3 preceding siblings ...)
  2008-06-26 14:04 ` Dave Kleikamp
@ 2008-07-03 14:37 ` jim owens
  2008-07-03 15:17   ` Jamie Lokier
                     ` (2 more replies)
  4 siblings, 3 replies; 70+ messages in thread
From: jim owens @ 2008-07-03 14:37 UTC (permalink / raw)
  To: linux-fsdevel, mfasheh

As Jamie pointed out, this:

> * FIEMAP_EXTENT_NO_DIRECT
> Direct access to the data in this extent is illegal or will have
> undefined results.

will confuse and mislead many people.  And as Andreas said,
the fe_physical and fe_device can be valid without the ability
to directly access the data.  I suggest we call this:

   FIEMAP_EXTENT_NO_BYPASS

As in "you can't bypass the filesystem" to directly access it.

Based on what Andreas said the NO_DIRECT (NO_BYPASS) means,
I disagree with these:

> * FIEMAP_EXTENT_SECONDARY
> The data for this extent is in secondary storage (e.g. HSM).  If the
> data is not also in the filesystem, then FIEMAP_EXTENT_NO_DIRECT
> should also be set.

First, "If the data is not also in the filesystem" has no
relevance.  You are either providing the fe_physical and
fe_device for the HSM which may or may not allow direct
access, or you are providing the location in the filesystem
of the primary copy in which case you are not setting
FIEMAP_EXTENT_SECONDARY.  I don't think you want to return
BOTH the fiemap_extent primary and secondary as this will
confuse the hell out of everyone.

Or, as was said before, maybe HSM should wait until we
know what it really needs.

> * FIEMAP_EXTENT_NET
>   - This will also set FIEMAP_EXTENT_NO_DIRECT
> The data for this extent is not stored in a locally-accessible device.
> 
> * FIEMAP_EXTENT_DATA_COMPRESSED
>   - This will also set FIEMAP_EXTENT_NO_DIRECT
> The data in this extent has been compressed by the file system.
> 
> * FIEMAP_EXTENT_DATA_ENCRYPTED
>   - This will also set FIEMAP_EXTENT_NO_DIRECT
> The data in this extent has been encrypted by the file system.

None of the above are always NO_DIRECT.  Certainly you can
read the raw compressed/encrypted data (a backup program might)
and even a netdev might be accessed if you know how.

I think you need a different top-level flag that covers
"things which are not raw block data" so I would change this:

> * FIEMAP_EXTENT_NOT_ALIGNED
> Extent offsets and length are not guaranteed to be block aligned.

To be something like:

    FIEMAP_EXTENT_ENCODED

> * FIEMAP_EXTENT_DATA_INLINE
>   This will also set FIEMAP_EXTENT_NOT_ALIGNED
> Data is located within a meta data block.
> 
> * FIEMAP_EXTENT_DATA_TAIL
>   This will also set FIEMAP_EXTENT_NOT_ALIGNED
> Data is packed into a block with data from other files.

With FIEMAP_EXTENT_ENCODED being the top flag set for
FIEMAP_EXTENT_NET, FIEMAP_EXTENT_DATA_COMPRESSED,
FIEMAP_EXTENT_DATA_ENCRYPTED, FIEMAP_EXTENT_DATA_INLINE,
and FIEMAP_EXTENT_DATA_TAIL.  Then FIEMAP_EXTENT_NO_BYPASS
may or may not also be set and would say "can I get to
the physical data", and separately "do I need to do
special processing on it" would be FIEMAP_EXTENT_ENCODED.

> * FIEMAP_EXTENT_UNWRITTEN
> Unwritten extent - the extent is allocated but it's data has not been
> initialized.  This indicates the extent's data will be all zero.

This should say "will be all zero if read through the filesystem
but the contents are undefined if read directly."

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 14:37 ` jim owens
@ 2008-07-03 15:17   ` Jamie Lokier
  2008-07-04  8:49     ` Andreas Dilger
  2008-07-03 23:00   ` Dave Chinner
  2008-07-04  9:00   ` Andreas Dilger
  2 siblings, 1 reply; 70+ messages in thread
From: Jamie Lokier @ 2008-07-03 15:17 UTC (permalink / raw)
  To: jim owens; +Cc: linux-fsdevel, mfasheh

jim owens wrote:
>   FIEMAP_EXTENT_NO_BYPASS
> 
> As in "you can't bypass the filesystem" to directly access it.

Can we also commit to this, when FIEMAP_EXTENT_NO_BYPASS is *not* set:

   1. The data at fe_physical, and *will not move* so long as nothing
      modifies *that particular file*?

   2. Both reading *and writing* the file bypassing the filesystem are ok.

The reason for 1 is that some filesystems may move data when _other_
files are modified.  Heck, they may do so when other files are simply
read, or at some random whim.  Those filesystems would set
FIEMAP_EXTENT_NO_BYPASS (except for files with chattr 't'), because
fe_physical does not correspond to a *stable* location which a program
can subsequently use.

The reason for 2 is that some filesystems checksum the data and/or
replicate it, and won't be readable if you write to it directly.

In both these cases, O_DIRECT may *sometimes* work even though
directly accessing the physical device is unreliable.  So
FIEMAP_EXTENT_NO_BYPASS may be treated as "access the file through the
filesystem, use O_DIRECT if you still want direct access, and fall
back to ordinary file access if that doesn't work".

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 15:17   ` Jamie Lokier
@ 2008-07-04  8:49     ` Andreas Dilger
  2008-07-04 11:28       ` Jamie Lokier
  0 siblings, 1 reply; 70+ messages in thread
From: Andreas Dilger @ 2008-07-04  8:49 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: jim owens, linux-fsdevel, mfasheh

On Jul 03, 2008  16:17 +0100, Jamie Lokier wrote:
> jim owens wrote:
> >   FIEMAP_EXTENT_NO_BYPASS
> > 
> > As in "you can't bypass the filesystem" to directly access it.
> 
> Can we also commit to this, when FIEMAP_EXTENT_NO_BYPASS is *not* set:
> 
>    1. The data at fe_physical, and *will not move* so long as nothing
>       modifies *that particular file*?
> 
>    2. Both reading *and writing* the file bypassing the filesystem are ok.

I don't think any such guarantee can be made.  What if the file is
truncated and rewritten after the FIEMAP is called?  The filesystem
can't guarantee that will not happen. I think the only way to make
sure of constant mapping is to call FIEMAP before and after the blocks
are read.

> The reason for 2 is that some filesystems checksum the data and/or
> replicate it, and won't be readable if you write to it directly.

EEEEEK.  The _intent_ of FIEMAP is mostly for reporting fragmentation,
and possibly to allow a "generic" defragmenter to be written.  At an
outside stretch I could imagine some tools like "dump" wanting direct
read access to the file data.

Directly writing underneath a filesystem is major bad news and will
likely corrupt the filesystem because you can never be sure that there
aren't dirty pages in the page cache that will overwrite your "direct"
write, or that your write isn't racy with an unlink or truncate.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-04  8:49     ` Andreas Dilger
@ 2008-07-04 11:28       ` Jamie Lokier
  0 siblings, 0 replies; 70+ messages in thread
From: Jamie Lokier @ 2008-07-04 11:28 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: jim owens, linux-fsdevel, mfasheh

Andreas Dilger wrote:
> On Jul 03, 2008  16:17 +0100, Jamie Lokier wrote:
> > jim owens wrote:
> > >   FIEMAP_EXTENT_NO_BYPASS
> > > 
> > > As in "you can't bypass the filesystem" to directly access it.
> > 
> > Can we also commit to this, when FIEMAP_EXTENT_NO_BYPASS is *not* set:
> > 
> >    1. The data at fe_physical, and *will not move* so long as nothing
> >       modifies *that particular file*?
> > 
> >    2. Both reading *and writing* the file bypassing the filesystem are ok.
> 
> I don't think any such guarantee can be made.  What if the file is
> truncated and rewritten after the FIEMAP is called?

That is prohibited by "so long as nothing modifies that particular file".
That's the entire point of 1! :-)

> The filesystem can't guarantee that will not happen.

The filesystem's guarantee has to be _conditional_ on nothing _else_
modifying the file.  That includes writing, truncating, and extending.
It's not the filesystem's job to prevent those things.

What I'm saying is that some filesystems will move data blocks _even
when no process touches the file containing those blocks_.  E.g. some
filesystems do garbage collection in the background - even when
nothing touches any file.  Some filesystems clone data blocks for COW.
There are many imaginable other reasons.

Clearly, any program that "gets away with it" by using FIEMAP to get a
block map and then accessing the disk directly, is less reliable with
those filesystems.  It would be good to reflect that somehow.

The obvious way to my mind is for those filesystems which don't have
stable data positions, when a file is not being modified, to set the
flag which says "this extent should not be accessed directly"
(whatever it is called :-).

> I think the only way to make sure of constant mapping is to call
> FIEMAP before and after the blocks are read.

No, that is clearly unsafe.  They can change twice, ending up back at
the same positions, but different in between.  That's even likely,
with some modern filesystem techniques.

> > The reason for 2 is that some filesystems checksum the data and/or
> > replicate it, and won't be readable if you write to it directly.
> 
> EEEEEK.  The _intent_ of FIEMAP is mostly for reporting fragmentation,
> and possibly to allow a "generic" defragmenter to be written.  At an
> outside stretch I could imagine some tools like "dump" wanting direct
> read access to the file data.

Potentially useful other cases are providing good information to
assist access patterns and block allocation for things like databases,
filesystems-in-a-file, and virtual-disks-in-a-non-flat-file.  Those
are all variations on reporting fragmentation, and don't require the
information to be absolutely stable or correct.

> Directly writing underneath a filesystem is major bad news and will
> likely corrupt the filesystem because you can never be sure that there
> aren't dirty pages in the page cache that will overwrite your "direct"
> write, or that your write isn't racy with an unlink or truncate.

You're right.  It's a fair point, should be clarified, because I
hadn't thought of it ;-)

Btw, you can be sure there aren't dirty pages, if you have done
fsync() or sync_file_range() at some time in the past, and you are
_sure_ no other process is accessing the file.  (Otoh, I'm not sure if
some funky COW implementations would complicate that.)

However, that still leaves a gaping lack of coherency in that the
filesystem may have clean cached pages not matching what is written to
disk.  So, you're absolutely right: NO WRITING.

You must do fsync() anyway, and ensure nobody is modifying the file,
if you're going to read correct data from FIEMAP blocks.

Ok, then I'll remove point 2 and add these:

    - FIEMAP extents are _not_ safe for writing data directly!
      Page cache coherency affects all filesystems.  Checksums and
      replication are also involved with some filesystems.  All
      writing should go through the filesystem itself.

    - If reading data directly, do fsync() before FIEMAP, and be
      absolutely sure no process modifies the file between
      fsync+FIEMAP and reading the blocks, and that the
      FIEMAP_EXTENT_NO_DIRECT flag is not set.  It is the
      application's responsibility to ensure no other process modifies
      the file.

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 14:37 ` jim owens
  2008-07-03 15:17   ` Jamie Lokier
@ 2008-07-03 23:00   ` Dave Chinner
  2008-07-04  9:00   ` Andreas Dilger
  2 siblings, 0 replies; 70+ messages in thread
From: Dave Chinner @ 2008-07-03 23:00 UTC (permalink / raw)
  To: jim owens; +Cc: linux-fsdevel, mfasheh

On Thu, Jul 03, 2008 at 10:37:36AM -0400, jim owens wrote:
> As Jamie pointed out, this:
> I disagree with these:
>
>> * FIEMAP_EXTENT_SECONDARY
>> The data for this extent is in secondary storage (e.g. HSM).  If the
>> data is not also in the filesystem, then FIEMAP_EXTENT_NO_DIRECT
>> should also be set.
....
> Or, as was said before, maybe HSM should wait until we
> know what it really needs.

Given that other flags for HSM interfacing have already been removed
(i.e. the 'don't recall HSM resident extents to map them' flag) this
serveѕ little purpose.

As to what HSM needs, we know exactly what it needs - that's one of
the things XFS has been working intimately with for years and years.
Those needs fed into the original fiemap interface design and this
flag was one of them (as was the 'don't read' flag that has already
been removed).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-03 14:37 ` jim owens
  2008-07-03 15:17   ` Jamie Lokier
  2008-07-03 23:00   ` Dave Chinner
@ 2008-07-04  9:00   ` Andreas Dilger
  2008-07-07 23:28     ` jim owens
  2008-07-08  0:06     ` jim owens
  2 siblings, 2 replies; 70+ messages in thread
From: Andreas Dilger @ 2008-07-04  9:00 UTC (permalink / raw)
  To: jim owens; +Cc: linux-fsdevel, mfasheh

On Jul 03, 2008  10:37 -0400, jim owens wrote:
>> * FIEMAP_EXTENT_NO_DIRECT
>> Direct access to the data in this extent is illegal or will have
>> undefined results.
>
> will confuse and mislead many people.  And as Andreas said,
> the fe_physical and fe_device can be valid without the ability
> to directly access the data.  I suggest we call this:
>
>   FIEMAP_EXTENT_NO_BYPASS
>
> As in "you can't bypass the filesystem" to directly access it.
>
> Based on what Andreas said the NO_DIRECT (NO_BYPASS) means,
> I disagree with these:
>
>> * FIEMAP_EXTENT_NET
>>   - This will also set FIEMAP_EXTENT_NO_DIRECT
>> The data for this extent is not stored in a locally-accessible device.
>>
>> * FIEMAP_EXTENT_DATA_COMPRESSED
>>   - This will also set FIEMAP_EXTENT_NO_DIRECT
>> The data in this extent has been compressed by the file system.
>>
>> * FIEMAP_EXTENT_DATA_ENCRYPTED
>>   - This will also set FIEMAP_EXTENT_NO_DIRECT
>> The data in this extent has been encrypted by the file system.
>
> None of the above are always NO_DIRECT.  Certainly you can
> read the raw compressed/encrypted data (a backup program might)
> and even a netdev might be accessed if you know how.

I don't see that calling this "NO_BYPASS" is significantly different
than calling it "NO_DIRECT".  You can "bypass" that flag just as
easily, the point is that you may get garbage out of it, don't do that.
I don't think anyone writing an application will be seriously confused.


>> * FIEMAP_EXTENT_NOT_ALIGNED
>> Extent offsets and length are not guaranteed to be block aligned.
>
> To be something like:
>
>    FIEMAP_EXTENT_ENCODED

That doesn't make sense either.  "NOT_ALIGNED" just means that it
isn't a full filesystem block, but it isn't necessarily "encoded"
in any way.

>> * FIEMAP_EXTENT_DATA_INLINE
>>   This will also set FIEMAP_EXTENT_NOT_ALIGNED
>> Data is located within a meta data block.
>>
>> * FIEMAP_EXTENT_DATA_TAIL
>>   This will also set FIEMAP_EXTENT_NOT_ALIGNED
>> Data is packed into a block with data from other files.
>
> With FIEMAP_EXTENT_ENCODED being the top flag set for
> FIEMAP_EXTENT_NET, FIEMAP_EXTENT_DATA_COMPRESSED,
> FIEMAP_EXTENT_DATA_ENCRYPTED, FIEMAP_EXTENT_DATA_INLINE,
> and FIEMAP_EXTENT_DATA_TAIL.  Then FIEMAP_EXTENT_NO_BYPASS
> may or may not also be set and would say "can I get to
> the physical data", and separately "do I need to do
> special processing on it" would be FIEMAP_EXTENT_ENCODED.

It seems to me that "ENCODED" has no relation to "NET" or "DATA_TAIL"
or "DATA_INLINE".

>> * FIEMAP_EXTENT_UNWRITTEN
>> Unwritten extent - the extent is allocated but it's data has not been
>> initialized.  This indicates the extent's data will be all zero.
>
> This should say "will be all zero if read through the filesystem
> but the contents are undefined if read directly."

That does make sense and should be updated.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-04  9:00   ` Andreas Dilger
@ 2008-07-07 23:28     ` jim owens
  2008-07-09  1:53       ` Jamie Lokier
  2008-07-08  0:06     ` jim owens
  1 sibling, 1 reply; 70+ messages in thread
From: jim owens @ 2008-07-07 23:28 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel

Andreas Dilger wrote:

> I don't see that calling this "NO_BYPASS" is significantly different
> than calling it "NO_DIRECT".  You can "bypass" that flag just as
> easily, the point is that you may get garbage out of it, don't do that.

I agree that the name of the flag does not change what we intend
it to do... which is to say the storage can not be reached by
directly accessing the physical device.

> I don't think anyone writing an application will be seriously confused.

If you can say that with a straight face, you must not have
had much experience with commercial application developers.

No matter how carefully your try to explain it, "NO_DIRECT" is
going to be confused with the O_DIRECT feature.

What I'm saying is that we should find some other name for
the flag than "NO_DIRECT" because it is easier than trying
to explain away the confusion.  Any other suggestions?

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-07 23:28     ` jim owens
@ 2008-07-09  1:53       ` Jamie Lokier
  2008-07-09 15:01         ` jim owens
  0 siblings, 1 reply; 70+ messages in thread
From: Jamie Lokier @ 2008-07-09  1:53 UTC (permalink / raw)
  To: jim owens; +Cc: Andreas Dilger, linux-fsdevel

jim owens wrote:
> What I'm saying is that we should find some other name for
> the flag than "NO_DIRECT" because it is easier than trying
> to explain away the confusion.  Any other suggestions?

I proposed "PHYSICAL" because it corresponds with the name of the
fe_physical field.

Following this thread, I'm thinking it would be better to change the
sense of the flag, too, from NO_DIRECT to DIRECT.  I.e. only set when
access to the physical device is usable.

-- Jamie

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-09  1:53       ` Jamie Lokier
@ 2008-07-09 15:01         ` jim owens
  0 siblings, 0 replies; 70+ messages in thread
From: jim owens @ 2008-07-09 15:01 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andreas Dilger, linux-fsdevel

Jamie Lokier wrote:
> jim owens wrote:
> 
>>What I'm saying is that we should find some other name for
>>the flag than "NO_DIRECT" because it is easier than trying
>>to explain away the confusion.  Any other suggestions?
> 
> I proposed "PHYSICAL" because it corresponds with the name of the
> fe_physical field.
> 
> Following this thread, I'm thinking it would be better to change the
> sense of the flag, too, from NO_DIRECT to DIRECT.  I.e. only set when
> access to the physical device is usable.

I like your logic.  Instead of the NO_DIRECT (my NO_BYPASS)
this would seem to be better:

* FIEMAP_FLAG_PHYSICAL
*   If this flag is set, the physical device region defined by
*   the tuple (fs_device, fe_physical, fe_length) is directly
*   accessible outside the filesystem.
*

And fixes Andreas's legitimate original objection to having
the flag named "PHYSICAL".

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
  2008-07-04  9:00   ` Andreas Dilger
  2008-07-07 23:28     ` jim owens
@ 2008-07-08  0:06     ` jim owens
  1 sibling, 0 replies; 70+ messages in thread
From: jim owens @ 2008-07-08  0:06 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel

Andreas Dilger wrote:

> That doesn't make sense either.  "NOT_ALIGNED" just means that it
> isn't a full filesystem block, but it isn't necessarily "encoded"
> in any way.

I fully agree (don't let anyone know I said that) the word
encoded is bad, but I was trying to produce a set of top
level flags that minimize the checks applications need.

>>With FIEMAP_EXTENT_ENCODED being the top flag set for
>>FIEMAP_EXTENT_NET, FIEMAP_EXTENT_DATA_COMPRESSED,
>>FIEMAP_EXTENT_DATA_ENCRYPTED, FIEMAP_EXTENT_DATA_INLINE,
>>and FIEMAP_EXTENT_DATA_TAIL.  Then FIEMAP_EXTENT_NO_BYPASS
>>may or may not also be set and would say "can I get to
>>the physical data", and separately "do I need to do
>>special processing on it" would be FIEMAP_EXTENT_ENCODED.
>  
> It seems to me that "ENCODED" has no relation to "NET" or "DATA_TAIL"
> or "DATA_INLINE".

The point is that a simple app that only processes blocks
would not handle any of the above extents.

When I proposed this I was thinking about layered flag checking,
but now I'm starting to wonder if having top catagories actually
helps applications now or for the future.  It seems they still
need to check "are there any flags set that I don't know", rather
than checking groupings.

So now I don't see FIEMAP_EXTENT_NOT_ALIGNED or "ENCODED"
has any value.

jim

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2008-07-09 15:01 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-25 22:18 [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2 Mark Fasheh
2008-06-26  3:03 ` Andreas Dilger
2008-06-26  9:36 ` Jamie Lokier
2008-06-26 10:24   ` Andreas Dilger
2008-06-26 11:37     ` Anton Altaparmakov
2008-06-26 12:19     ` Jamie Lokier
2008-06-26 13:16       ` Dave Chinner
2008-06-26 13:27         ` Jamie Lokier
2008-06-26 13:48         ` Eric Sandeen
2008-06-26 14:16           ` Jamie Lokier
2008-06-26 16:56             ` Andreas Dilger
2008-06-29 19:12               ` Anton Altaparmakov
2008-06-29 21:45                 ` Dave Chinner
2008-06-30 22:57                   ` Jamie Lokier
2008-06-30 23:07                     ` Mark Fasheh
2008-07-01  2:01                       ` Brad Boyer
2008-07-02  6:38                         ` Andreas Dilger
2008-07-02  6:33                 ` Andreas Dilger
2008-07-02 14:26                   ` Jamie Lokier
2008-06-26 17:17       ` Andreas Dilger
2008-06-26 14:03 ` Eric Sandeen
2008-06-27  1:41   ` Dave Chinner
2008-06-27  9:41     ` Jamie Lokier
2008-06-27 10:01       ` Dave Chinner
2008-06-27 10:32         ` Jamie Lokier
2008-06-27 22:48       ` Andreas Dilger
2008-06-28  4:21         ` Eric Sandeen
2008-07-02  6:26           ` Andreas Dilger
2008-07-02 14:28             ` Jamie Lokier
2008-07-02 21:20               ` Mark Fasheh
2008-07-03 14:45                 ` Jamie Lokier
2008-06-26 14:04 ` Dave Kleikamp
2008-06-26 14:15   ` Eric Sandeen
2008-06-26 14:27     ` Dave Kleikamp
2008-07-02 23:48       ` jim owens
2008-07-03 11:17         ` Dave Chinner
2008-07-03 12:11           ` jim owens
2008-07-03 22:51             ` Dave Chinner
2008-07-04  8:31               ` Andreas Dilger
2008-07-04 12:13               ` Jamie Lokier
2008-07-07  7:40                 ` Dave Chinner
2008-07-07 16:53                   ` Jamie Lokier
2008-07-07 22:51                     ` Dave Chinner
2008-07-07 21:16               ` jim owens
2008-07-08  3:01                 ` Dave Chinner
2008-07-07 22:02               ` jim owens
2008-07-09  2:03                 ` Jamie Lokier
2008-07-03 12:21           ` jim owens
2008-07-03 12:42             ` Andi Kleen
2008-07-04 20:32             ` Anton Altaparmakov
2008-07-05 10:49               ` Jamie Lokier
2008-07-05 21:44                 ` Anton Altaparmakov
2008-07-07 23:01               ` jim owens
2008-07-08  1:51                 ` Dave Chinner
2008-07-08 13:02                   ` jim owens
2008-07-08 14:03                     ` jim owens
2008-07-08 14:39                       ` jim owens
2008-07-08 14:30                     ` Theodore Tso
2008-07-09  1:50                       ` Jamie Lokier
2008-06-26 17:01   ` Andreas Dilger
2008-07-03 14:37 ` jim owens
2008-07-03 15:17   ` Jamie Lokier
2008-07-04  8:49     ` Andreas Dilger
2008-07-04 11:28       ` Jamie Lokier
2008-07-03 23:00   ` Dave Chinner
2008-07-04  9:00   ` Andreas Dilger
2008-07-07 23:28     ` jim owens
2008-07-09  1:53       ` Jamie Lokier
2008-07-09 15:01         ` jim owens
2008-07-08  0:06     ` jim owens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).