[RFC][PATCH 0/5] Fiemap, an extent mapping ioctl

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
@ 2008-05-25  0:01 Mark Fasheh
  2008-05-25 19:42 ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Mark Fasheh @ 2008-05-25  0:01 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

Hello,

	The following patches are the latest attempt at implementing a
fiemap ioctl, which can be used by userspace software to get extent
information for an inode in an efficient manner.

	These patches are against Linus' latest tree. While the core vfs
patch seems to be approaching feature-completeness, most of the series
should still be considered as being incomplete. The fs patches in particular
need some more attention. I think there's enough here however, that it makes
sense to start posting to fsdevel for general comments.

	Testing so far has been light, typically consisting of me running a
bare-bones ioctl wrapper program by hand:

   http://www.kernel.org/pub/linux/kernel/people/mfasheh/fiemap/tests/

	We definitely need some more rigorous testing software, which I
believe Eric is working on. Additionally, a port of the 'filefrag'
application still needs to be completed.

	A lot has changed since the last fiemap patch was posted. Mostly,
the vfs<->fs api is more fleshed out, with suitable abstractions and helper
functions to aid implementation of ->fiemap. Some checks were added in the
vfs patch to catch things like overflow, fs limits checks, etc. Automatic
trimming of the request happens now so the fs doesn't have to worry about
ranges being larger than it can handle.

	Some changes were also made to the user API with the goal of
simplifying things so that it was easier for client file systems to
implement a callback. My hope is that a simpler API means file systems will
provide ->fiemap() quicker, and will be less likely to return results that
are wrong, or worse, slightly different from other implementations.

- Except for 'fm_flags', the various in/out fields on struct fiemap got
  turned into a single 'out' field - the number of mapped extents
  (fm_mapped_extents). This gives the kernel side dealing with struct fiemap
  fewer 'moving parts' to deal with.

- Extent flags were cleaned up, and some new ones got added.

- Instead of forcing the user to add up all extent lengths before a given
  one to figure it's logical offset, an 'fe_logical" field was added to
  fiemap_extent. This is a lot more obvious and straight forward in my
  opinion, and is well worth the tradeoff of a few bytes. It also obviates
  the need to describe holes as their existence is easily implied now. Also,
  fm_start and fm_length no longer have to be 'out' variables, which goes
  back to the 1st listed change.

- Handling of incompatible flags was simplified to just return -EBADR and
  the set of not-understood flags in fm_flags.

- Documentation/filesystems/fiemap.txt has been added in the 1st patch.

Below this I will include the contents of fiemap.txt to make it more
convenient for folks to get details on the API.
	--Mark

Fiemap Ioctl
============

The fiemap ioctl is an efficient method for userspace to get file
extent mappings. Instead of block-by-block mapping (such as bmap), fiemap
returns a list of extents.

Request Basics
--------------

A fiemap request is encoded within struct fiemap:

struct fiemap {
	__u64	fm_start;	 /* logical offset (inclusive) at
				  * which to start mapping (in) */
	__u64	fm_length;	 /* logical length of mapping which
				  * userspace cares about (in) */
	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in) */
	__u32	fm_extent_count; /* size of fm_extents array (in) */
	__u32	fm_mapped_extents; /* number of extents that were
				    * mapped (out) */
	__u32	fm_reserved;
	struct fiemap_extent	fm_extents[0];
};

fm_start, and fm_length specify the logical range within the file
which the process would like mappings for. Extents returned mirror
those on disk - that is, the logical offset of the 1st returned extent
may start before fm_start, and the range covered by the last returned
extent may end after fm_length. All offsets and lengths are in bytes.

Certain flags to modify the way in which mappings are looked up can be
set in fm_flags. If the kernel doesn't understand some particular
flags, it will return EBADR and the contents of fm_flags will contain
the set of flags which caused the error. If the kernel is compatible
with all flags passed, the contents of fm_flags will be unmodified.
It is up to userspace to determine whether rejection of a particular
flag is fatal to it's operation. This scheme is intended to allow the
fiemap interface to grow in the future but without losing
compatibility with old software.

Currently, there are four flags which can be set in fm_flags:

* FIEMAP_FLAG_NUM_EXTENTS
If this flag is set, extent information will not be returned via the
fm_extents array and the value of fm_extent_count will be
ignored. Instead, the total number of extents covering the range will
be returned via fm_mapped_extents. This is useful for programs which
only want to count the number of extents in a file, but don't care
about the actual extent layout.

* FIEMAP_FLAG_SYNC
If this flag is set, the kernel will sync the file before mapping extents.

* FIEMAP_FLAG_HSM_READ
If the extent is offline, retrieve it before mapping and do not flag
it as FIEMAP_EXTENT_SECONDARY. This flag has no effect if the file
system does not support HSM.

* FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inodes
extended attribute lookup tree, instead of it's data tree.

* FIEMAP_FLAG_LUN_ORDER
If the file system stripes file data, this will return contiguous
regions of physical allocation, sorted by LUN. Logical offsets may not
make sense if this flag is passed. If the file system does not support
multiple LUNs, this flag will be ignored.

Extent Mapping
--------------

Note that all of this is ignored if FIEMAP_FLAG_NUM_EXTENTS is set.

Extent information is returned within the embedded fm_extents array
which userspace must allocate along with the fiemap structure. The
total number of fiemap_extents available should be passed via
fm_extent_count. The of extents mapped by kernel will be returned via
fm_mapped_extents. If the number of fiemap_extents allocated is less
than would be required to map the requested range, the maximum number
of extents that can be mapped in available memory will be returned and
fm_mapped_extents will be equal to fm_extent_count. In that case, the
last extent in the array will not complete the requested range and
will not have the FIEMAP_EXTENT_LAST flag set (see the next section on
extent flags).

Each extent is described by a single fiemap_extent structure as
returned in fm_extents.

struct fiemap_extent {
	__u64	fe_logical;/* logical offset in bytes for the start of
			    * the extent */
	__u64	fe_physical; /* physical offset in bytes for the start
			      * of the extent */
	__u64	fe_length; /* length in bytes for the extent */
	__u32	fe_flags;  /* returned FIEMAP_EXTENT_* flags for the extent */
	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
};

All offsets and lengths are in bytes and mirror those on disk - it is
valid for an extents logical offset to start before the request or
it's logical length to extend past the request. Unless
FIEMAP_EXTENT_NOT_ALIGNED is returned, fe_logical, fe_physical and
fe_length will be aligned to the block size of the file system.

The fe_flags field contains flags which describe the extent
returned. A special flag, FIEMAP_EXTENT_LAST is always set on the last
extent in the file so that the process making fiemap calls can
determine when no more extents are available.

Some flags are intentionally vague and will always be set in the
presence of other more specific flags. This way a program looking for
a general property does not have to know all existing and future flags
which imply that property.

For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL
are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking
for inline or tail-packed data can key on the specific flag. Software
which simply cares not to try operating on non-aligned extents
however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to
worry about all present and future flags which might imply unaligned
data. Note that the opposite is not true - it would be valid for
FIEMAP_EXTENT_NOT_ALIGNED to appear alone.

* FIEMAP_EXTENT_LAST
This is the last extent in the file. A mapping attempt past this
extent will return nothing.

* FIEMAP_EXTENT_UNKNOWN
The location of this extent is currently unknown. This may indicate
the data is stored on an inaccessible volume or that no storage has
been allocated for the file yet.

* FIEMAP_EXTENT_SECONDARY
  - This will also set FIEMAP_EXTENT_UNKNOWN.
The data for this extent is in secondary storage.

* FIEMAP_EXTENT_DELALLOC
  - This will also set FIEMAP_EXTENT_UNKNOWN.
Delayed allocation - while there is data for this extent, it's
physical location has not been allocated yet.

* FIEMAP_EXTENT_NO_DIRECT
Direct access to the data in this extent is illegal or will have
undefined results.

* FIEMAP_EXTENT_NET
  - This will also set FIEMAP_EXTENT_NO_DIRECT
The data for this extent is not stored in a locally-accessible device.

* FIEMAP_EXTENT_DATA_COMPRESSED
  - This will also set FIEMAP_EXTENT_NO_DIRECT
The data in this extent has been compressed by the file system.

* FIEMAP_EXTENT_DATA_ENCRYPTED
  - This will also set FIEMAP_EXTENT_NO_DIRECT
The data in this extent has been encrypted by the file system.

* FIEMAP_EXTENT_NOT_ALIGNED
Extent offsets and length are not guaranteed to be block aligned.

* FIEMAP_EXTENT_DATA_INLINE
  This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is located within a meta data block.

* FIEMAP_EXTENT_DATA_TAIL
  This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is packed into a block with data from other files.

* FIEMAP_EXTENT_UNWRITTEN
Unwritten extent - the extent is allocated but it's data has not been
initialized.

VFS -> File System Implementation
---------------------------------

File systems wishing to support fiemap must implement a ->fiemap
callback (on struct inode_operations):

struct inode_operations {
       ...

       int (*fiemap) (struct inode *, struct fiemap_extent_info *, u64 start,
       	   	      u64 len);

->fiemap is passed struct fiemap_extent_info which describes the
fiemap request:

struct fiemap_extent_info {
	unsigned int	fi_flags;		/* Flags as passed from user */
	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
	char		*fi_extents_start;	/* Start of fiemap_extent array */
};

It is intended that the file system should only need to access
fi_flags directly. Aside from checking fi_flags to modify callback
behavior, flags which the file system can not handle, can be written
into fieinfo->fi_flags. In this case, the file system *must* return
-EBADR so that ioctl_fiemap() can write them into the userspace
buffer.

For each extent in the request range, the file system should call
the helper function, fiemap_fill_next_extent():

int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
			    u64 phys, u64 len, u32 flags, u32 lun);

fiemap_fill_next_extent() will use the passed values to populate the
next free extent in the fm_extents array. 'General' extent flags will
automatically be set from specific flags on behalf of the calling file
system so that the userspace API is not broken.

fiemap_fill_next_extent() returns 0 on success, and 1 when the
user-supplied fm_extents array is full. If an error is encountered
while copying the extent to user memory, -EFAULT will be returned.

If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
this helper is not necessary and fi_extents_mapped can be set
directly.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-25  0:01 [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl Mark Fasheh
@ 2008-05-25 19:42 ` Christoph Hellwig
  2008-05-25 20:59   ` Brad Boyer
                     ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Christoph Hellwig @ 2008-05-25 19:42 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: linux-fsdevel, Andreas Dilger, Kalpak Shah, Eric Sandeen,
	Josef Bacik

On Sat, May 24, 2008 at 05:01:48PM -0700, Mark Fasheh wrote:
> * FIEMAP_FLAG_HSM_READ
> If the extent is offline, retrieve it before mapping and do not flag
> it as FIEMAP_EXTENT_SECONDARY. This flag has no effect if the file
> system does not support HSM.

Given that there's no HSM support in mainline this should not be added.
It'll be useful once we add proper HSM support, though :)

> * FIEMAP_FLAG_LUN_ORDER
> If the file system stripes file data, this will return contiguous
> regions of physical allocation, sorted by LUN. Logical offsets may not
> make sense if this flag is passed. If the file system does not support
> multiple LUNs, this flag will be ignored.

A LUN doesn't make any sense in filesystem context.  That's a
scsi-centric acronym that doesn't even make sense in a scsi-centric
filesystem universe because a LUN can of course contain multiple
partitions.  It's also extremly ill-defined when using volume managers.

There's also no filesystems that actually support a single file on
multiple device in mainline, the only filesystem that supports multiple
data devices at all (XFS) requires each file to be on a single device.

Once we have a filesystem with real multiple data device support like
btrfs or a future XFS version we can worry about this and defined
a different ioctl for it.

> 
> Each extent is described by a single fiemap_extent structure as
> returned in fm_extents.
> 
> struct fiemap_extent {
> 	__u64	fe_logical;/* logical offset in bytes for the start of
> 			    * the extent */
> 	__u64	fe_physical; /* physical offset in bytes for the start
> 			      * of the extent */
> 	__u64	fe_length; /* length in bytes for the extent */
> 	__u32	fe_flags;  /* returned FIEMAP_EXTENT_* flags for the extent */
> 	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/

Again this lun thing is horribly ill-defined.  There is no such thing
as a logic device number in our filesystem terminology.

> struct fiemap_extent_info {
> 	unsigned int	fi_flags;		/* Flags as passed from user */
> 	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
> 	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
> 	char		*fi_extents_start;	/* Start of fiemap_extent array */
> };

Why is this passes a structure instead of individual arguments?
Also why isn't fi_extents_start properly typed?

> If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> this helper is not necessary and fi_extents_mapped can be set
> directly.

Sounds like the count number of extents request should be a separate
ioctl and separate filesystem entry point instead of overloading FIEMAP.

Just define a simple FIECOUNT ioctl.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-25 19:42 ` Christoph Hellwig
@ 2008-05-25 20:59   ` Brad Boyer
  2008-05-26 10:59   ` Andreas Dilger
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: Brad Boyer @ 2008-05-25 20:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

On Sun, May 25, 2008 at 03:42:03PM -0400, Christoph Hellwig wrote:
> > * FIEMAP_FLAG_LUN_ORDER
> > If the file system stripes file data, this will return contiguous
> > regions of physical allocation, sorted by LUN. Logical offsets may not
> > make sense if this flag is passed. If the file system does not support
> > multiple LUNs, this flag will be ignored.
> 
> A LUN doesn't make any sense in filesystem context.  That's a
> scsi-centric acronym that doesn't even make sense in a scsi-centric
> filesystem universe because a LUN can of course contain multiple
> partitions.  It's also extremly ill-defined when using volume managers.
> 
> There's also no filesystems that actually support a single file on
> multiple device in mainline, the only filesystem that supports multiple
> data devices at all (XFS) requires each file to be on a single device.
> 
> Once we have a filesystem with real multiple data device support like
> btrfs or a future XFS version we can worry about this and defined
> a different ioctl for it.
> 
> > 
> > Each extent is described by a single fiemap_extent structure as
> > returned in fm_extents.
> > 
> > struct fiemap_extent {
> > 	__u64	fe_logical;/* logical offset in bytes for the start of
> > 			    * the extent */
> > 	__u64	fe_physical; /* physical offset in bytes for the start
> > 			      * of the extent */
> > 	__u64	fe_length; /* length in bytes for the extent */
> > 	__u32	fe_flags;  /* returned FIEMAP_EXTENT_* flags for the extent */
> > 	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
> 
> Again this lun thing is horribly ill-defined.  There is no such thing
> as a logic device number in our filesystem terminology.

The vxfs filesystem is capable of having each extent in a file on a
different device. However, I don't think freevxfs supports that.

I do agree that calling each one a "lun" is not really meaningful.

	Brad Boyer
	flar@allandria.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-25 19:42 ` Christoph Hellwig
  2008-05-25 20:59   ` Brad Boyer
@ 2008-05-26 10:59   ` Andreas Dilger
  2008-05-26 18:04     ` Brad Boyer
  2008-05-27 16:45     ` Christoph Hellwig
  2008-05-27 13:48   ` Chris Mason
  2008-05-27 18:56   ` Mark Fasheh
  3 siblings, 2 replies; 35+ messages in thread
From: Andreas Dilger @ 2008-05-26 10:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

On May 25, 2008  15:42 -0400, Christoph Hellwig wrote:
> On Sat, May 24, 2008 at 05:01:48PM -0700, Mark Fasheh wrote:
> > * FIEMAP_FLAG_HSM_READ
> > If the extent is offline, retrieve it before mapping and do not flag
> > it as FIEMAP_EXTENT_SECONDARY. This flag has no effect if the file
> > system does not support HSM.
> 
> Given that there's no HSM support in mainline this should not be added.
> It'll be useful once we add proper HSM support, though :)

This was added at the request of David for XFS, because the XFS bmap
ioctl defaults to reading in extents from HSM.  I don't have any
attachment to it myself.

> > * FIEMAP_FLAG_LUN_ORDER
> > If the file system stripes file data, this will return contiguous
> > regions of physical allocation, sorted by LUN. Logical offsets may not
> > make sense if this flag is passed. If the file system does not support
> > multiple LUNs, this flag will be ignored.
> 
> A LUN doesn't make any sense in filesystem context.  That's a
> scsi-centric acronym that doesn't even make sense in a scsi-centric
> filesystem universe because a LUN can of course contain multiple
> partitions.  It's also extremly ill-defined when using volume managers.

What else do you propose calling this?  It isn't a LUN in the SCSI sense
of course, but there is definitely a need to be able to identify multiple
disks.  Regardless of whether there is a single disk or multiple disks
involved, it is generally called a LUN.  It is a better than calling it
a "disk" or a "partition".

> There's also no filesystems that actually support a single file on
> multiple device in mainline, the only filesystem that supports multiple
> data devices at all (XFS) requires each file to be on a single device.
> 
> Once we have a filesystem with real multiple data device support like
> btrfs or a future XFS version we can worry about this and defined
> a different ioctl for it.

I don't see why we need a different ioctl for mapping extents on a
filesystem that support direct access to multiple disks.  Having one
mechanism that returns the file mapping is much more simple for user
space applications (filefrag, cp, tar, gzip, etc) than having to use
different ioctls for different backing filesystems.

> > Each extent is described by a single fiemap_extent structure as
> > returned in fm_extents.
> > 
> > struct fiemap_extent {
> > 	__u64	fe_logical;/* logical offset in bytes for the start of
> > 			    * the extent */
> > 	__u64	fe_physical; /* physical offset in bytes for the start
> > 			      * of the extent */
> > 	__u64	fe_length; /* length in bytes for the extent */
> > 	__u32	fe_flags;  /* returned FIEMAP_EXTENT_* flags for the extent */
> > 	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
> 
> Again this lun thing is horribly ill-defined.  There is no such thing
> as a logic device number in our filesystem terminology.

Propose a better name then, but the need for it will not go away.  This
is needed for Lustre, btrfs, pNFS, etc.  The whole point of developing
this API and getting input from all of the main filesystems was to have
a single common interface that could be used by all filesystems.

> > struct fiemap_extent_info {
> > 	unsigned int	fi_flags;		/* Flags as passed from user */
> > 	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
> > 	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
> > 	char		*fi_extents_start;	/* Start of fiemap_extent array */
> > };
> 
> Why is this passes a structure instead of individual arguments?

Saves on passing this around as arguments on the stack?  Also, for ext4
there is an iterator function which needs a private data struct passed,
and it doesn't make sense to require duplicating all of this information
again.

> Also why isn't fi_extents_start properly typed?

I was wondering about that, I'm not sure why Mark implemented it that
way.  I would have thought that it should be a struct fiemap_extent *.
I thought maybe to allow for misaligned userspace pointers, but I'm
not sure.

> > If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> > this helper is not necessary and fi_extents_mapped can be set
> > directly.
> 
> Sounds like the count number of extents request should be a separate
> ioctl and separate filesystem entry point instead of overloading FIEMAP.

I don't see that at all.  The operations that the filesystem has to do
are basically the same whether it is counting extents or returning them.
All that would result from having separate ioctl and filesystem methods
would be a lot of code duplication.

The fiemap_fill_next_extents() call will handle the NUM_EXTENTS operation
internally, and the filesystem code doesn't need to special case this
at all.  The only time the NUM_EXTENTS case would be handled by the
filesystem specially would be if it tracks the count of extents itself
for some reason.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-26 10:59   ` Andreas Dilger
@ 2008-05-26 18:04     ` Brad Boyer
  2008-05-27 16:45     ` Christoph Hellwig
  1 sibling, 0 replies; 35+ messages in thread
From: Brad Boyer @ 2008-05-26 18:04 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

On Mon, May 26, 2008 at 04:59:44AM -0600, Andreas Dilger wrote:
> On May 25, 2008  15:42 -0400, Christoph Hellwig wrote:
> > A LUN doesn't make any sense in filesystem context.  That's a
> > scsi-centric acronym that doesn't even make sense in a scsi-centric
> > filesystem universe because a LUN can of course contain multiple
> > partitions.  It's also extremly ill-defined when using volume managers.
> 
> What else do you propose calling this?  It isn't a LUN in the SCSI sense
> of course, but there is definitely a need to be able to identify multiple
> disks.  Regardless of whether there is a single disk or multiple disks
> involved, it is generally called a LUN.  It is a better than calling it
> a "disk" or a "partition".

How about "device"? It's more generic than LUN but also doesn't imply
a particular implementation. A volume manager exports individual
volumes as devices. Regular partitions on a disk are also exported
to user space as devices. I don't think anyone would get confused and
think a filesystem would be using a non-storage device.

	Brad Boyer
	flar@allandria.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-25 19:42 ` Christoph Hellwig
  2008-05-25 20:59   ` Brad Boyer
  2008-05-26 10:59   ` Andreas Dilger
@ 2008-05-27 13:48   ` Chris Mason
  2008-05-27 16:21     ` Eric Sandeen
  2008-05-27 16:52     ` jim owens
  2008-05-27 18:56   ` Mark Fasheh
  3 siblings, 2 replies; 35+ messages in thread
From: Chris Mason @ 2008-05-27 13:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mark Fasheh, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

On Sunday 25 May 2008, Christoph Hellwig wrote:

Thanks for doing this Mark ;)

> On Sat, May 24, 2008 at 05:01:48PM -0700, Mark Fasheh wrote:
> > * FIEMAP_FLAG_HSM_READ
> > If the extent is offline, retrieve it before mapping and do not flag
> > it as FIEMAP_EXTENT_SECONDARY. This flag has no effect if the file
> > system does not support HSM.
>
> Given that there's no HSM support in mainline this should not be added.
> It'll be useful once we add proper HSM support, though :)
>

The HSM flag doesn't hurt, and it allows the people actually shipping hsm 
patches to use fiemap without extending the api themselves.  Reserving the 
flag isn't a bad idea.

> > * FIEMAP_FLAG_LUN_ORDER
> > If the file system stripes file data, this will return contiguous
> > regions of physical allocation, sorted by LUN. Logical offsets may not
> > make sense if this flag is passed. If the file system does not support
> > multiple LUNs, this flag will be ignored.
>
> A LUN doesn't make any sense in filesystem context.  That's a
> scsi-centric acronym that doesn't even make sense in a scsi-centric
> filesystem universe because a LUN can of course contain multiple
> partitions.  It's also extremly ill-defined when using volume managers.
>
> There's also no filesystems that actually support a single file on
> multiple device in mainline, the only filesystem that supports multiple
> data devices at all (XFS) requires each file to be on a single device.
>
> Once we have a filesystem with real multiple data device support like
> btrfs or a future XFS version we can worry about this and defined
> a different ioctl for it.
>

For btrfs I would return the logical extents via fiemap (just like the file 
were on lvm) and make btrfs specific ioctls for details about where the file 
actually lived.

fiemap alone isn't a great way to describe raid levels or complex storage 
topologies.  To include physical information I would also have to encode the 
raid level used and information about all the devices the data is replicated 
on (raid1/10)

-chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 13:48   ` Chris Mason
@ 2008-05-27 16:21     ` Eric Sandeen
  2008-05-27 16:47       ` Christoph Hellwig
  2008-05-27 16:52     ` jim owens
  1 sibling, 1 reply; 35+ messages in thread
From: Eric Sandeen @ 2008-05-27 16:21 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

Chris Mason wrote:
> On Sunday 25 May 2008, Christoph Hellwig wrote:
> 
> Thanks for doing this Mark ;)
> 
>> On Sat, May 24, 2008 at 05:01:48PM -0700, Mark Fasheh wrote:
>>> * FIEMAP_FLAG_HSM_READ
>>> If the extent is offline, retrieve it before mapping and do not flag
>>> it as FIEMAP_EXTENT_SECONDARY. This flag has no effect if the file
>>> system does not support HSM.
>> Given that there's no HSM support in mainline this should not be added.
>> It'll be useful once we add proper HSM support, though :)
>>
> 
> The HSM flag doesn't hurt, and it allows the people actually shipping hsm 
> patches to use fiemap without extending the api themselves.  Reserving the 
> flag isn't a bad idea.

Here I agree.  HSM is a generic enough concept, and I think this
interface's API w.r.t. HSM is well-enough defined that there's no reason
not to go ahead & put it in now, IMHO.

-Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-26 10:59   ` Andreas Dilger
  2008-05-26 18:04     ` Brad Boyer
@ 2008-05-27 16:45     ` Christoph Hellwig
  2008-05-27 21:10       ` Mark Fasheh
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2008-05-27 16:45 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, Mark Fasheh, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

On Mon, May 26, 2008 at 04:59:44AM -0600, Andreas Dilger wrote:
> > A LUN doesn't make any sense in filesystem context.  That's a
> > scsi-centric acronym that doesn't even make sense in a scsi-centric
> > filesystem universe because a LUN can of course contain multiple
> > partitions.  It's also extremly ill-defined when using volume managers.
> 
> What else do you propose calling this?  It isn't a LUN in the SCSI sense
> of course, but there is definitely a need to be able to identify multiple
> disks.  Regardless of whether there is a single disk or multiple disks
> involved, it is generally called a LUN.  It is a better than calling it
> a "disk" or a "partition".

See below because the naming really depends on defining the semantics
of the this field.

> I don't see why we need a different ioctl for mapping extents on a
> filesystem that support direct access to multiple disks.  Having one
> mechanism that returns the file mapping is much more simple for user
> space applications (filefrag, cp, tar, gzip, etc) than having to use
> different ioctls for different backing filesystems.

Well, we could add a dev field that contains the dev_t for the
underlying block device.  That would work for the current XFS realtime
device aswell as for my work to map different XFS AGs to different
devices.  It wouldn't work for btrfs with integrated raid code where
a single extent can span multiple underlying devices, the same probably
applies to pnfs.

> > Why is this passes a structure instead of individual arguments?
> 
> Saves on passing this around as arguments on the stack?  Also, for ext4
> there is an iterator function which needs a private data struct passed,
> and it doesn't make sense to require duplicating all of this information
> again.

Ok.

> > > If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> > > this helper is not necessary and fi_extents_mapped can be set
> > > directly.
> > 
> > Sounds like the count number of extents request should be a separate
> > ioctl and separate filesystem entry point instead of overloading FIEMAP.
> 
> I don't see that at all.  The operations that the filesystem has to do
> are basically the same whether it is counting extents or returning them.
> All that would result from having separate ioctl and filesystem methods
> would be a lot of code duplication.
> 
> The fiemap_fill_next_extents() call will handle the NUM_EXTENTS operation
> internally, and the filesystem code doesn't need to special case this
> at all.  The only time the NUM_EXTENTS case would be handled by the
> filesystem specially would be if it tracks the count of extents itself
> for some reason.

It just special cases it.  As does for example the ext4 handler.  Please
keep the API simple instead of overloading it already from the start.
And when you look at the ext4 implementation with the extent walker it's
pretty simple to implement a fiecount by having a second callback with
a trivial shared helper.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 16:21     ` Eric Sandeen
@ 2008-05-27 16:47       ` Christoph Hellwig
  2008-05-27 20:34         ` Joel Becker
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2008-05-27 16:47 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Chris Mason, Christoph Hellwig, Mark Fasheh, linux-fsdevel,
	Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

On Tue, May 27, 2008 at 11:21:21AM -0500, Eric Sandeen wrote:
> > The HSM flag doesn't hurt, and it allows the people actually shipping hsm 
> > patches to use fiemap without extending the api themselves.  Reserving the 
> > flag isn't a bad idea.
> 
> Here I agree.  HSM is a generic enough concept, and I think this
> interface's API w.r.t. HSM is well-enough defined that there's no reason
> not to go ahead & put it in now, IMHO.

But there is no such thing as HSM support anywher near mainline.  Call
me a dickhead, but I'm 100% against adding anything helping HSM until
people get their act together to actually add HSM support.  It's
something really useful that we should have, and not something that
should be in really grotty out of tree codebases.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 13:48   ` Chris Mason
  2008-05-27 16:21     ` Eric Sandeen
@ 2008-05-27 16:52     ` jim owens
  2008-05-27 17:19       ` Chris Mason
  1 sibling, 1 reply; 35+ messages in thread
From: jim owens @ 2008-05-27 16:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Chris Mason, Christoph Hellwig, Mark Fasheh, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

For what it is worth, a few comments from a newbie who has
experience with a non-linux filesystem that has a similar API
and supports files spread across multiple devices.

Mark Fasheh wrote:
> 
> * FIEMAP_FLAG_LUN_ORDER
> If the file system stripes file data, this will return contiguous
> regions of physical allocation, sorted by LUN. Logical offsets may not
> make sense if this flag is passed. If the file system does not support
> multiple LUNs, this flag will be ignored.

This should return an error (ENOTSUPPORTED ?) if the FS does
not support multiple devices OR does not support sort-by-lun-order
so the caller does not count on the info being sorted.  Even an FS
that supports multiple devices per file may be unable to sort it
by on-disk-order without consuming an ugly set of resources.

Christoph Hellwig wrote:

>>	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
> 
> 
> Again this lun thing is horribly ill-defined.  There is no such thing
> as a logic device number in our filesystem terminology.

I agree that LUN is confusing.  In my opinion the words "logical"
and "number" are overused and meaningless.  As Brad suggested,
"device" would be preferable, or "unit", but unfortunately every
word I can think of has some other definition too :)

Our term was "volume"... an awful designation.

Chris Mason wrote:

> For btrfs I would return the logical extents via fiemap (just like the file 
> were on lvm) and make btrfs specific ioctls for details about where the file 
> actually lived.
> 
> fiemap alone isn't a great way to describe raid levels or complex storage 
> topologies.  To include physical information I would also have to encode the 
> raid level used and information about all the devices the data is replicated 
> on (raid1/10)

fiemap by itself is useful for programs that want to determine
how fragmented a file is or where sparse areas are to skip.

At least one more generic API is needed to enumerate the device number
to device (path name, inode, socket, ... ?).  In our case this was only
used for clusters.

For the complex case you describe, it might be possible to have
an "enumerate" api that could be used to traverse each layer for
more detail.  I hope this is done generically by someone.

A final thought on this:

> 	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/

While the flags field can be used to tell the validity of this
number, we found that starting at 0 was not a good practice.
We started at 1 so 0 was always a not-valid.  One way this can
be useful is if you have delayed allocation, you can indicate
"intended device" with a non-0 number.  Of course other values
such as max_int could be termed "invalid" instead.

Another point to document is whether this number is a contiguous
series (1, 2, 3,... N) defining the location based on the current
device list or is possibly a sparse (1, 2, 6) series because the
FS tracks devices that have been removed.  In our implementation
both views were present for different consumers.  The sparse
series was native and the contiguous series a translation.

jim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 16:52     ` jim owens
@ 2008-05-27 17:19       ` Chris Mason
  2008-05-28 16:09         ` Andreas Dilger
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Mason @ 2008-05-27 17:19 UTC (permalink / raw)
  To: jim owens
  Cc: linux-fsdevel, Christoph Hellwig, Mark Fasheh, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

On Tuesday 27 May 2008, jim owens wrote:
> For what it is worth, a few comments from a newbie who has
> experience with a non-linux filesystem that has a similar API
> and supports files spread across multiple devices.
>
> Mark Fasheh wrote:
> > * FIEMAP_FLAG_LUN_ORDER
> > If the file system stripes file data, this will return contiguous
> > regions of physical allocation, sorted by LUN. Logical offsets may not
> > make sense if this flag is passed. If the file system does not support
> > multiple LUNs, this flag will be ignored.
>
> This should return an error (ENOTSUPPORTED ?) if the FS does
> not support multiple devices OR does not support sort-by-lun-order
> so the caller does not count on the info being sorted.  Even an FS
> that supports multiple devices per file may be unable to sort it
> by on-disk-order without consuming an ugly set of resources.

That's a good point, I couldn't provide 100% sorted output even if I wanted 
to.

>
> Christoph Hellwig wrote:
> >>	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
> >
> > Again this lun thing is horribly ill-defined.  There is no such thing
> > as a logic device number in our filesystem terminology.
>
> I agree that LUN is confusing.  In my opinion the words "logical"
> and "number" are overused and meaningless.  As Brad suggested,
> "device" would be preferable, or "unit", but unfortunately every
> word I can think of has some other definition too :)
>
> Our term was "volume"... an awful designation.
>
> Chris Mason wrote:
> > For btrfs I would return the logical extents via fiemap (just like the
> > file were on lvm) and make btrfs specific ioctls for details about where
> > the file actually lived.
> >
> > fiemap alone isn't a great way to describe raid levels or complex storage
> > topologies.  To include physical information I would also have to encode
> > the raid level used and information about all the devices the data is
> > replicated on (raid1/10)
>
> fiemap by itself is useful for programs that want to determine
> how fragmented a file is or where sparse areas are to skip.

Yes, and since it has no concurrency semantics, use outside of that quickly 
gets difficult.  fibmap is used by lilo, and reiserfs needs a special ioctl 
that said i've-called-fibmap-please-don't-move-these-bytes that prevented 
tail packing.

>
> At least one more generic API is needed to enumerate the device number
> to device (path name, inode, socket, ... ?).  In our case this was only
> used for clusters.
>
> For the complex case you describe, it might be possible to have
> an "enumerate" api that could be used to traverse each layer for
> more detail.  I hope this is done generically by someone.
>

It would be especially interesting if the enumerate API actually went all the 
way down to the lvm/md layers as well.

> A final thought on this:
> > 	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
>
> While the flags field can be used to tell the validity of this
> number, we found that starting at 0 was not a good practice.
> We started at 1 so 0 was always a not-valid.  One way this can
> be useful is if you have delayed allocation, you can indicate
> "intended device" with a non-0 number.  Of course other values
> such as max_int could be termed "invalid" instead.

I use 0 as not-valid as well.  The original intent was 0 meant 
logical-block-number, signaling additional lookups were needed.  But I 
haven't found a good use case for that yet.

>
> Another point to document is whether this number is a contiguous
> series (1, 2, 3,... N) defining the location based on the current
> device list or is possibly a sparse (1, 2, 6) series because the
> FS tracks devices that have been removed.  In our implementation
> both views were present for different consumers.  The sparse
> series was native and the contiguous series a translation.

Interesting, I've been presenting the sparse representation only.

-chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-25 19:42 ` Christoph Hellwig
                     ` (2 preceding siblings ...)
  2008-05-27 13:48   ` Chris Mason
@ 2008-05-27 18:56   ` Mark Fasheh
  2008-05-27 20:31     ` Joel Becker
  3 siblings, 1 reply; 35+ messages in thread
From: Mark Fasheh @ 2008-05-27 18:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, Andreas Dilger, Kalpak Shah, Eric Sandeen,
	Josef Bacik

On Sun, May 25, 2008 at 03:42:03PM -0400, Christoph Hellwig wrote:

[ I'll get back to you regarding luns and the HSM flag ]

> > struct fiemap_extent_info {
> > 	unsigned int	fi_flags;		/* Flags as passed from user */
> > 	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
> > 	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
> > 	char		*fi_extents_start;	/* Start of fiemap_extent array */
> > };
> 
> Why is this passes a structure instead of individual arguments?

[ The structure vs args seems to been addressed elsewhere in
this thread ]

> Also why isn't fi_extents_start properly typed?

There's no good reason, my brain just wrote it that way. I can type it
properly in my next patch.


> > If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> > this helper is not necessary and fi_extents_mapped can be set
> > directly.
> 
> Sounds like the count number of extents request should be a separate
> ioctl and separate filesystem entry point instead of overloading FIEMAP.
> 
> Just define a simple FIECOUNT ioctl.

Having it as a seperate, simpler ioctl is fine with me. As it is, we're sort
of shoe-horning it into a structure which is more optimized for returning
actual extent data. The flag breaks lots of rules for that structure, such
as no actual fm_extents needing to be allocated, fm_extent_count is
magically ignored for that call, etc. So the simpler API to userspace is a
win, IMHO.

What about the back-end though? This is pretty transparently handled in
fiemap_fill_next_extent() and many file systems (Ocfs2 included) would just
have ->fiecount callbacks that are nearly identical ->fiecount to their
->fiemap...
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 18:56   ` Mark Fasheh
@ 2008-05-27 20:31     ` Joel Becker
  2008-05-27 20:49       ` Mark Fasheh
                         ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Joel Becker @ 2008-05-27 20:31 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Christoph Hellwig, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

On Tue, May 27, 2008 at 11:56:22AM -0700, Mark Fasheh wrote:
> > > If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> > > this helper is not necessary and fi_extents_mapped can be set
> > > directly.
> > 
> > Sounds like the count number of extents request should be a separate
> > ioctl and separate filesystem entry point instead of overloading FIEMAP.
> > 
> > Just define a simple FIECOUNT ioctl.
> 
> What about the back-end though? This is pretty transparently handled in
> fiemap_fill_next_extent() and many file systems (Ocfs2 included) would just
> have ->fiecount callbacks that are nearly identical ->fiecount to their
> ->fiemap...

	Provide generic_fiemap_fiecount() that does the operation in
terms of ->fiemap().  Then filesystems like ocfs2 can just do .fiecount
= generic_fiemap_fiecount.
	I agree with Christoph that it seems a bit overloaded when done
as a special case of FIEMAP.

Joel

-- 

"This is the end, beautiful friend.
 This is the end, my only friend the end
 Of our elaborate plans, the end
 Of everything that stands, the end
 No safety or surprise, the end
 I'll never look into your eyes again."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 16:47       ` Christoph Hellwig
@ 2008-05-27 20:34         ` Joel Becker
  0 siblings, 0 replies; 35+ messages in thread
From: Joel Becker @ 2008-05-27 20:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Sandeen, Chris Mason, Mark Fasheh, linux-fsdevel,
	Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

On Tue, May 27, 2008 at 12:47:30PM -0400, Christoph Hellwig wrote:
> On Tue, May 27, 2008 at 11:21:21AM -0500, Eric Sandeen wrote:
> > Here I agree.  HSM is a generic enough concept, and I think this
> > interface's API w.r.t. HSM is well-enough defined that there's no reason
> > not to go ahead & put it in now, IMHO.
> 
> But there is no such thing as HSM support anywher near mainline.  Call
> me a dickhead, but I'm 100% against adding anything helping HSM until
> people get their act together to actually add HSM support.  It's
> something really useful that we should have, and not something that
> should be in really grotty out of tree codebases.

	I would love to see HSM in mainline too, but this is a single
flag that belongs in a well-defined interface.  It isn't the place to
fight over.

Joel

-- 

Life's Little Instruction Book #226

	"When someone hugs you, let them be the first to let go."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 20:31     ` Joel Becker
@ 2008-05-27 20:49       ` Mark Fasheh
  2008-05-28  5:14       ` Christoph Hellwig
  2008-05-28 16:02       ` Andreas Dilger
  2 siblings, 0 replies; 35+ messages in thread
From: Mark Fasheh @ 2008-05-27 20:49 UTC (permalink / raw)
  To: Joel Becker
  Cc: Christoph Hellwig, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

On Tue, May 27, 2008 at 01:31:24PM -0700, Joel Becker wrote:
> On Tue, May 27, 2008 at 11:56:22AM -0700, Mark Fasheh wrote:
> > > > If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> > > > this helper is not necessary and fi_extents_mapped can be set
> > > > directly.
> > > 
> > > Sounds like the count number of extents request should be a separate
> > > ioctl and separate filesystem entry point instead of overloading FIEMAP.
> > > 
> > > Just define a simple FIECOUNT ioctl.
> > 
> > What about the back-end though? This is pretty transparently handled in
> > fiemap_fill_next_extent() and many file systems (Ocfs2 included) would just
> > have ->fiecount callbacks that are nearly identical ->fiecount to their
> > ->fiemap...
> 
> 	Provide generic_fiemap_fiecount() that does the operation in
> terms of ->fiemap().  Then filesystems like ocfs2 can just do .fiecount
> = generic_fiemap_fiecount.

Right - that's essentially what I'm getting at. Some of Christophs comments
later in this thread seemed geared towards implementation too, so I was just
wondering...
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 16:45     ` Christoph Hellwig
@ 2008-05-27 21:10       ` Mark Fasheh
  0 siblings, 0 replies; 35+ messages in thread
From: Mark Fasheh @ 2008-05-27 21:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Dilger, linux-fsdevel, Andreas Dilger, Kalpak Shah,
	Eric Sandeen, Josef Bacik

On Tue, May 27, 2008 at 12:45:46PM -0400, Christoph Hellwig wrote:
> > I don't see why we need a different ioctl for mapping extents on a
> > filesystem that support direct access to multiple disks.  Having one
> > mechanism that returns the file mapping is much more simple for user
> > space applications (filefrag, cp, tar, gzip, etc) than having to use
> > different ioctls for different backing filesystems.
> 
> Well, we could add a dev field that contains the dev_t for the
> underlying block device.  That would work for the current XFS realtime
> device aswell as for my work to map different XFS AGs to different
> devices.  It wouldn't work for btrfs with integrated raid code where
> a single extent can span multiple underlying devices, the same probably
> applies to pnfs.

Dev_t seems reasonable to me.

Anything any more complicated than just wanting a simple device identifier
needs a more dedicated API.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 20:31     ` Joel Becker
  2008-05-27 20:49       ` Mark Fasheh
@ 2008-05-28  5:14       ` Christoph Hellwig
  2008-05-28 16:02       ` Andreas Dilger
  2 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2008-05-28  5:14 UTC (permalink / raw)
  To: Joel Becker
  Cc: Mark Fasheh, Christoph Hellwig, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

On Tue, May 27, 2008 at 01:31:24PM -0700, Joel Becker wrote:
> 	Provide generic_fiemap_fiecount() that does the operation in
> terms of ->fiemap().  Then filesystems like ocfs2 can just do .fiecount
> = generic_fiemap_fiecount.

Yes, exactly.  Although I suspect for most filesystems an iterator-based
implementation like the one for ext4 in the patchset would actually be
ceaner.

I like these iterators, they helped cleaning up a lot of messy interface
in XFS.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 20:31     ` Joel Becker
  2008-05-27 20:49       ` Mark Fasheh
  2008-05-28  5:14       ` Christoph Hellwig
@ 2008-05-28 16:02       ` Andreas Dilger
  2008-05-28 17:04         ` Joel Becker
  2008-05-29  5:55         ` Christoph Hellwig
  2 siblings, 2 replies; 35+ messages in thread
From: Andreas Dilger @ 2008-05-28 16:02 UTC (permalink / raw)
  To: Joel Becker
  Cc: Mark Fasheh, Christoph Hellwig, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

On May 27, 2008  13:31 -0700, Joel Becker wrote:
> On Tue, May 27, 2008 at 11:56:22AM -0700, Mark Fasheh wrote:
> > > > If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> > > > this helper is not necessary and fi_extents_mapped can be set
> > > > directly.
> > > 
> > > Sounds like the count number of extents request should be a separate
> > > ioctl and separate filesystem entry point instead of overloading FIEMAP.
> > > 
> > > Just define a simple FIECOUNT ioctl.
> > 
> > What about the back-end though? This is pretty transparently handled in
> > fiemap_fill_next_extent() and many file systems (Ocfs2 included) would just
> > have ->fiecount callbacks that are nearly identical ->fiecount to their
> > ->fiemap...
> 
> 	Provide generic_fiemap_fiecount() that does the operation in
> terms of ->fiemap().  Then filesystems like ocfs2 can just do .fiecount
> = generic_fiemap_fiecount.
> 	I agree with Christoph that it seems a bit overloaded when done
> as a special case of FIEMAP.

The question is whether there are any (or many) filesystems that will NOT
implement ->fiecount() as a wrapper of ->fiemap()?  At that point the
"simplification" of the API means that there is actually more code and
layers being called for the simple version.

I'm not saying I'm against having a FIECOUNT ioctl, we're probably just
going to use an updated filefrag like everyone else, but it doesn't seem
like a net reduction in code anywhere.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-27 17:19       ` Chris Mason
@ 2008-05-28 16:09         ` Andreas Dilger
  2008-05-28 16:33           ` Chris Mason
  2008-05-29 13:01           ` Christoph Hellwig
  0 siblings, 2 replies; 35+ messages in thread
From: Andreas Dilger @ 2008-05-28 16:09 UTC (permalink / raw)
  To: Chris Mason
  Cc: jim owens, linux-fsdevel, Christoph Hellwig, Mark Fasheh,
	Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

On May 27, 2008  13:19 -0400, Chris Mason wrote:
> On Tuesday 27 May 2008, jim owens wrote:
> > For what it is worth, a few comments from a newbie who has
> > experience with a non-linux filesystem that has a similar API
> > and supports files spread across multiple devices.
> >
> > Mark Fasheh wrote:
> > > * FIEMAP_FLAG_LUN_ORDER
> > > If the file system stripes file data, this will return contiguous
> > > regions of physical allocation, sorted by LUN. Logical offsets may not
> > > make sense if this flag is passed. If the file system does not support
> > > multiple LUNs, this flag will be ignored.
> >
> > This should return an error (ENOTSUPPORTED ?) if the FS does
> > not support multiple devices OR does not support sort-by-lun-order
> > so the caller does not count on the info being sorted.  Even an FS
> > that supports multiple devices per file may be unable to sort it
> > by on-disk-order without consuming an ugly set of resources.
> 
> That's a good point, I couldn't provide 100% sorted output even if I wanted 
> to.

I'm OK with this also.  The only reason I thought "simple" filesystems
(i.e. single-lun) should ignore FLAG_LUN_ORDER is so that tools like
filefrag can always try with LUN_ORDER and in most cases still get a
mapping returned.  If the filesystem doesn't care about LUN_MAPPING, no
harm done, because all of the extents live on a single LUN anyways.  If
a multi-device filesystem doesn't want to implement LUN_ORDER, returning
-EBADR is perfectly acceptable because the application will retry without
the unsupported flags (LUN_ORDER in this case) and get the logical file
offset order data returned.

For Lustre, it is completely inefficient to return data in non-LUN_ORDER,
because it is doing RAID-0 striping of the file data across data servers.
A 100MB 2-stripe file with 1MB stripes would have to return 100 extents,
even if the file data is allocated contiguously on disk in the backing
filesystems in two 50MB chunks.  With LUN_ORDER it will return 2 extents
and the user can see much more clearly that the file is layed out well.

> > Christoph Hellwig wrote:
> > >>	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
> > >
> > > Again this lun thing is horribly ill-defined.  There is no such thing
> > > as a logic device number in our filesystem terminology.
> >
> > I agree that LUN is confusing.  In my opinion the words "logical"
> > and "number" are overused and meaningless.  As Brad suggested,
> > "device" would be preferable, or "unit", but unfortunately every
> > word I can think of has some other definition too :)

Calling it "device" instead of "LUN" for this is fine...

> > Christoph Hellwig wrote:
> > > Well, we could add a dev field that contains the dev_t for the
> > > underlying block device.  That would work for the current XFS realtime
> > > device aswell as for my work to map different XFS AGs to different
> > > devices.  It wouldn't work for btrfs with integrated raid code where
> > > a single extent can span multiple underlying devices, the same probably
> > > applies to pnfs.

... but I don't think it should necessarily be _required_ to return a
real "dev_t" (major, minor) device.  For network filesystems this is
meaningless.  If it is possible for FIEMAP_EXTENT_NET to signal that the
device is not a local/physical device (where a dev_t has no meaning),
and simply allow an enumeration [0, 1, 2, ...] of the logical devices
then I think this is reasonable.  The mapping of logical devices to
servers is available separately with a Lustre-specific ioctl.

This passes more information for filesystems that have local devices
while not breaking the functionality for network filesystems and could
be used as an efficient replacement for lilo's use of FIBMAP.

> > Chris Mason wrote:
> > > For btrfs I would return the logical extents via fiemap (just like the
> > > file were on lvm) and make btrfs specific ioctls for details about where
> > > the file actually lived.
> > >
> > > fiemap alone isn't a great way to describe raid levels or complex storage
> > > topologies.  To include physical information I would also have to encode
> > > the raid level used and information about all the devices the data is
> > > replicated on (raid1/10)
> >
> > fiemap by itself is useful for programs that want to determine
> > how fragmented a file is or where sparse areas are to skip.

For RAID1/10 you can return multiple logical->physical extent mappings
for the same logical range of the file with different "device" IDs.  You
could do the same for RAID5 returning each of the data and parity chunks
with "NO_DIRECT" if desired (maybe only on the parity extent, or don't
return the parity extent at all).  The spec does not require that the
returned extents be non-overlapping.

In fact Mark, Eric, and I were discussing the ability to request mappings
for metadata blocks in addition to the data blocks.  The metadata blocks
would also overlap the data blocks (with FLAG_METADATA set in the
metadata extent) so that it is possible to return to the client (if
requested) the inode block with [0-EOF] mapping, indirect blocks with
their corresponding data mappings, and the file data blocks.

This came up in the context of ext4 trying to visualize different
metadata placement algorithms and would be very useful information.
It might also be useful for filesystem defragmentation utilities.

> Yes, and since it has no concurrency semantics, use outside of that quickly 
> gets difficult.  fibmap is used by lilo, and reiserfs needs a special ioctl 
> that said i've-called-fibmap-please-don't-move-these-bytes that prevented 
> tail packing.

Wasn't that turned into an ext3-like SETFLAGS ioctl for "NOTAIL" on
the inode?

My point of view is that FIEMAP is a file layout visualization API that
could also be used in certain cases for direct data access.  Since any
direct access of data returned by FIEMAP is inherently racy (as is
FIBMAP), I'm less concerned with the mappings being fully consistent,
and more concerned with providing the maximum amount of information.

Any application using FIEMAP for direct data access (e.g. dump of
some kind) either has to guard against races itself by verifying the
mapping again afterward, or for uses like lilo trust that the admin
is doing the right thing.  That isn't a new issue with FIEMAP vs FIBMAP.

> > A final thought on this:
> > > 	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
> >
> > While the flags field can be used to tell the validity of this
> > number, we found that starting at 0 was not a good practice.
> > We started at 1 so 0 was always a not-valid.  One way this can
> > be useful is if you have delayed allocation, you can indicate
> > "intended device" with a non-0 number.  Of course other values
> > such as max_int could be termed "invalid" instead.
> 
> I use 0 as not-valid as well.  The original intent was 0 meant 
> logical-block-number, signaling additional lookups were needed.  But I 
> haven't found a good use case for that yet.

I would prefer that the fe_lun (or fe_device as is now preferred)
be at least somewhat implementation-specific.  For local filesystems,
returning the device number seems reasonable and would mean that "0"
is not a valid return value, but I'd prefer to allow this to be an index
number for Lustre or other non-local filesystems, and in that case
"0" be a valid device index number.  Since there are already flags for
unallocated and unkown extents, I don't think we should depend on
fe_device == 0 to have a special meaning for network filesystems.

> > Another point to document is whether this number is a contiguous
> > series (1, 2, 3,... N) defining the location based on the current
> > device list or is possibly a sparse (1, 2, 6) series because the
> > FS tracks devices that have been removed.  In our implementation
> > both views were present for different consumers.  The sparse
> > series was native and the contiguous series a translation.
> 
> Interesting, I've been presenting the sparse representation only.

For Lustre, the fe_lun/fe_device returned for a file extent would
indicate the base-0 server index on which each of the file fragments
resides.  A given file will normally be striped over a subset of the
data servers, so it would be normal to get extents returned that
are a sparse subset of all available data servers.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-28 16:09         ` Andreas Dilger
@ 2008-05-28 16:33           ` Chris Mason
  2008-05-29 22:01             ` Andreas Dilger
  2008-05-29 13:01           ` Christoph Hellwig
  1 sibling, 1 reply; 35+ messages in thread
From: Chris Mason @ 2008-05-28 16:33 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: jim owens, linux-fsdevel, Christoph Hellwig, Mark Fasheh,
	Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

On Wednesday 28 May 2008, Andreas Dilger wrote:
> On May 27, 2008  13:19 -0400, Chris Mason wrote:
> > On Tuesday 27 May 2008, jim owens wrote:
> > > For what it is worth, a few comments from a newbie who has
> > > experience with a non-linux filesystem that has a similar API
> > > and supports files spread across multiple devices.
> > >
> > > Mark Fasheh wrote:
> > > > * FIEMAP_FLAG_LUN_ORDER
> > > > If the file system stripes file data, this will return contiguous
> > > > regions of physical allocation, sorted by LUN. Logical offsets may
> > > > not make sense if this flag is passed. If the file system does not
> > > > support multiple LUNs, this flag will be ignored.
> > >
> > > This should return an error (ENOTSUPPORTED ?) if the FS does
> > > not support multiple devices OR does not support sort-by-lun-order
> > > so the caller does not count on the info being sorted.  Even an FS
> > > that supports multiple devices per file may be unable to sort it
> > > by on-disk-order without consuming an ugly set of resources.
> >
> > That's a good point, I couldn't provide 100% sorted output even if I
> > wanted to.
>
> I'm OK with this also.  The only reason I thought "simple" filesystems
> (i.e. single-lun) should ignore FLAG_LUN_ORDER is so that tools like
> filefrag can always try with LUN_ORDER and in most cases still get a
> mapping returned.  If the filesystem doesn't care about LUN_MAPPING, no
> harm done, because all of the extents live on a single LUN anyways.  If
> a multi-device filesystem doesn't want to implement LUN_ORDER, returning
> -EBADR is perfectly acceptable because the application will retry without
> the unsupported flags (LUN_ORDER in this case) and get the logical file
> offset order data returned.
>
> For Lustre, it is completely inefficient to return data in non-LUN_ORDER,
> because it is doing RAID-0 striping of the file data across data servers.
> A 100MB 2-stripe file with 1MB stripes would have to return 100 extents,
> even if the file data is allocated contiguously on disk in the backing
> filesystems in two 50MB chunks.  With LUN_ORDER it will return 2 extents
> and the user can see much more clearly that the file is layed out well.

Ah, so lustre doesn't have a logical address layer at all?  In my case the 
files contain pointers to contiguous logical extent and the lower layers of 
the FS figure out that is raid0/1/10 or whatever future crud I toss in.

If the logical extents are contiguous it is safe to assume the lower end is 
also contiguous.

[ huge snip ;) ]

> My point of view is that FIEMAP is a file layout visualization API that
> could also be used in certain cases for direct data access.  Since any
> direct access of data returned by FIEMAP is inherently racy (as is
> FIBMAP), I'm less concerned with the mappings being fully consistent,
> and more concerned with providing the maximum amount of information.
>
> Any application using FIEMAP for direct data access (e.g. dump of
> some kind) either has to guard against races itself by verifying the
> mapping again afterward, or for uses like lilo trust that the admin
> is doing the right thing.  That isn't a new issue with FIEMAP vs FIBMAP.

So, I'm a big fan of better layout visualization and creating APIs to improve 
it.  At some point we need to take a step back and ask if those apis are 
better left to other tools instead of heaping them all into fiemap.  

The advantage of dropping the lun support from fiemap and pushing it into a 
new ioctl/syscall is that we can determine the underlying storage topology 
for any logical block on the device, including those underneath md/dm without 
worrying about a backing file.

And then we can get interesting information about stripe widths, preferred IO 
sizes etc etc.

[ lots of other stuff that makes good sense snipped too ]

-chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-28 16:02       ` Andreas Dilger
@ 2008-05-28 17:04         ` Joel Becker
  2008-05-29  0:51           ` Dave Chinner
  2008-05-29  1:17           ` Andreas Dilger
  2008-05-29  5:55         ` Christoph Hellwig
  1 sibling, 2 replies; 35+ messages in thread
From: Joel Becker @ 2008-05-28 17:04 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Mark Fasheh, Christoph Hellwig, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

On Wed, May 28, 2008 at 10:02:01AM -0600, Andreas Dilger wrote:
> On May 27, 2008  13:31 -0700, Joel Becker wrote:
> > 	Provide generic_fiemap_fiecount() that does the operation in
> > terms of ->fiemap().  Then filesystems like ocfs2 can just do .fiecount
> > = generic_fiemap_fiecount.
> > 	I agree with Christoph that it seems a bit overloaded when done
> > as a special case of FIEMAP.
> 
> The question is whether there are any (or many) filesystems that will NOT
> implement ->fiecount() as a wrapper of ->fiemap()?  At that point the
> "simplification" of the API means that there is actually more code and
> layers being called for the simple version.
> 
> I'm not saying I'm against having a FIECOUNT ioctl, we're probably just
> going to use an updated filefrag like everyone else, but it doesn't seem
> like a net reduction in code anywhere.

	It's not about net reduction of code.  It's about a readable and
understandable interface.  "Pass this array of extent structures we'll
ignore if you set this special flag" is pretty ugly.  Calling FIECOUNT
separately is nice.  How it is implemented in the kernel is a whole
'nother ball of wax - maybe we don't have ->fiecount() and always
implement FIECOUNT in terms of a ->fiemap() walk.  Doesn't matter.

Joel

-- 

"Hey mister if you're gonna walk on water,
 Could you drop a line my way?"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-28 17:04         ` Joel Becker
@ 2008-05-29  0:51           ` Dave Chinner
  2008-05-29 13:02             ` Christoph Hellwig
  2008-05-29  1:17           ` Andreas Dilger
  1 sibling, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2008-05-29  0:51 UTC (permalink / raw)
  To: Joel Becker
  Cc: Andreas Dilger, Mark Fasheh, Christoph Hellwig, linux-fsdevel,
	Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

On Wed, May 28, 2008 at 10:04:19AM -0700, Joel Becker wrote:
> On Wed, May 28, 2008 at 10:02:01AM -0600, Andreas Dilger wrote:
> > On May 27, 2008  13:31 -0700, Joel Becker wrote:
> > > 	Provide generic_fiemap_fiecount() that does the operation in
> > > terms of ->fiemap().  Then filesystems like ocfs2 can just do .fiecount
> > > = generic_fiemap_fiecount.
> > > 	I agree with Christoph that it seems a bit overloaded when done
> > > as a special case of FIEMAP.
> > 
> > The question is whether there are any (or many) filesystems that will NOT
> > implement ->fiecount() as a wrapper of ->fiemap()?  At that point the
> > "simplification" of the API means that there is actually more code and
> > layers being called for the simple version.
> > 
> > I'm not saying I'm against having a FIECOUNT ioctl, we're probably just
> > going to use an updated filefrag like everyone else, but it doesn't seem
> > like a net reduction in code anywhere.
> 
> 	It's not about net reduction of code.  It's about a readable and
> understandable interface.  "Pass this array of extent structures we'll
> ignore if you set this special flag" is pretty ugly.  Calling FIECOUNT
> separately is nice.  How it is implemented in the kernel is a whole
> 'nother ball of wax - maybe we don't have ->fiecount() and always
> implement FIECOUNT in terms of a ->fiemap() walk.  Doesn't matter.

XFS has XFS_IOC_FSGETXATTR which can return the number of extents
on an inode. It's a total count, not a range count, so it's a bit
different to FIECOUNT and as such does not require walking the
extent list to retrieve (extent count is in the inode itself).

It's still not that straight forward as you have to encode count,
offset and length into a structure to pass into the ioctl. i.e.
is it really that much simpler and cleaner than just adding a
extra flag to FIEMAP?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-28 17:04         ` Joel Becker
  2008-05-29  0:51           ` Dave Chinner
@ 2008-05-29  1:17           ` Andreas Dilger
  1 sibling, 0 replies; 35+ messages in thread
From: Andreas Dilger @ 2008-05-29  1:17 UTC (permalink / raw)
  To: Joel Becker
  Cc: Mark Fasheh, Christoph Hellwig, linux-fsdevel, Andreas Dilger,
	Kalpak Shah, Eric Sandeen, Josef Bacik

On May 28, 2008  10:04 -0700, Joel Becker wrote:
> 	It's not about net reduction of code.  It's about a readable and
> understandable interface.  "Pass this array of extent structures we'll
> ignore if you set this special flag" is pretty ugly.  Calling FIECOUNT
> separately is nice.  How it is implemented in the kernel is a whole
> 'nother ball of wax - maybe we don't have ->fiecount() and always
> implement FIECOUNT in terms of a ->fiemap() walk.  Doesn't matter.

In the FIEMAP + NUM_EXTENTS count, you don't need to pass the array of
structures, just the header.  I don't think this is an onerous interface:

	struct fiemap fiemap = { .fm_start = 0, .fm_length = ~0ULL,
				 .fm_num_extents = 0,
				 .fm_flags = FIEMAP_FLAG_NUM_EXTENTS }

	rc = ioctl(fd, FIEMAP, &fiemap);

	if (rc == 0)
		num_extents = fiemap.fm_mapped_extents;

We wouldn't need to even specify .fm_num_extents, if the VFS handler
doesn't check its validity for FIEMAP_FLAG_NUM_EXTENTS (contrary to
my recent change proposal).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-28 16:02       ` Andreas Dilger
  2008-05-28 17:04         ` Joel Becker
@ 2008-05-29  5:55         ` Christoph Hellwig
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2008-05-29  5:55 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Joel Becker, Mark Fasheh, Christoph Hellwig, linux-fsdevel,
	Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

On Wed, May 28, 2008 at 10:02:01AM -0600, Andreas Dilger wrote:
> The question is whether there are any (or many) filesystems that will NOT
> implement ->fiecount() as a wrapper of ->fiemap()?  At that point the
> "simplification" of the API means that there is actually more code and
> layers being called for the simple version.

ext4 for example as obvious from the code posted in this thread.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-28 16:09         ` Andreas Dilger
  2008-05-28 16:33           ` Chris Mason
@ 2008-05-29 13:01           ` Christoph Hellwig
  2008-05-29 20:17             ` Andreas Dilger
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2008-05-29 13:01 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Chris Mason, jim owens, linux-fsdevel, Christoph Hellwig,
	Mark Fasheh, Andreas Dilger, Kalpak Shah, Eric Sandeen,
	Josef Bacik

On Wed, May 28, 2008 at 10:09:31AM -0600, Andreas Dilger wrote:
> ... but I don't think it should necessarily be _required_ to return a
> real "dev_t" (major, minor) device.  For network filesystems this is
> meaningless.  If it is possible for FIEMAP_EXTENT_NET to signal that the
> device is not a local/physical device (where a dev_t has no meaning),
> and simply allow an enumeration [0, 1, 2, ...] of the logical devices
> then I think this is reasonable.  The mapping of logical devices to
> servers is available separately with a Lustre-specific ioctl.
> 
> This passes more information for filesystems that have local devices
> while not breaking the functionality for network filesystems and could
> be used as an efficient replacement for lilo's use of FIBMAP.

A dev_t actually means something for the only in-tree users of
this interface, so there's no point making this interface worse for
some long-term out of tree code.  And it's not like you simply can't
allow multiple anonymous blockdevices for your networked filesystems
similar to the one used for st_dev already.

> For RAID1/10 you can return multiple logical->physical extent mappings
> for the same logical range of the file with different "device" IDs.  You
> could do the same for RAID5 returning each of the data and parity chunks
> with "NO_DIRECT" if desired (maybe only on the parity extent, or don't
> return the parity extent at all).  The spec does not require that the
> returned extents be non-overlapping.

Umm, no.  That's just make the interface too complicated.  I can bet
with your that userspace programmers will generally only test their code
with simple filesystems and hell will break lose when they get these
multiple ranges.  Especially as that's a very unnatural interface.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-29  0:51           ` Dave Chinner
@ 2008-05-29 13:02             ` Christoph Hellwig
  2008-05-29 15:33               ` jim owens
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2008-05-29 13:02 UTC (permalink / raw)
  To: Joel Becker, Andreas Dilger, Mark Fasheh, Christoph Hellwig,
	linux-fsdevel, Andreas

On Thu, May 29, 2008 at 10:51:34AM +1000, Dave Chinner wrote:
> XFS has XFS_IOC_FSGETXATTR which can return the number of extents
> on an inode. It's a total count, not a range count, so it's a bit
> different to FIECOUNT and as such does not require walking the
> extent list to retrieve (extent count is in the inode itself).

What use is there geeting the extent count for a range?  I'd rather
do it only per-file like the xfs ioctl.

> It's still not that straight forward as you have to encode count,
> offset and length into a structure to pass into the ioctl. i.e.
> is it really that much simpler and cleaner than just adding a
> extra flag to FIEMAP?

Yes :)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-29 13:02             ` Christoph Hellwig
@ 2008-05-29 15:33               ` jim owens
  2008-05-29 15:53                 ` Jamie Lokier
  2008-05-29 18:56                 ` Joel Becker
  0 siblings, 2 replies; 35+ messages in thread
From: jim owens @ 2008-05-29 15:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mark Fasheh, linux-fsdevel

Christoph Hellwig wrote:
> 
> What use is there geeting the extent count for a range?  I'd rather
> do it only per-file like the xfs ioctl.

I'll answer that from practical experience.  Our api equivalents:

  	__u64	fm_start;	 /* logical offset (inclusive) at
				  * which to start mapping (in) */
	__u64	fm_length;	 /* logical length of mapping which
				  * userspace cares about (in) */
	__u32	fm_extent_count; /* size of fm_extents array (in) */
	__u32	fm_mapped_extents; /* number of extents that were
				    * mapped (out) */

... note it has no flags field and no separate ioctl_extent_count.

"fm_extent_count" is
    IN  == max_extents to return.
    OUT == number of extents remaining in-range after fm_mapped_extents

Pass in fm_extent_count==0 and you get OUT number of extent entries
within your fm_start + fm_length range.  Which you can use to set
your malloc size because the FS can have massive extent counts :(

This is why it was done.  In practice this was only used by kernel
callers because most application developers simply looped with a
fixed buffer and adjusted fm_start.  Dumber applications kept
doubling their malloc to get all extents at once... or core dump :)

I'm not saying there is a good reason to do it this way
in linux, just why someone else did it that way.

jim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-29 15:33               ` jim owens
@ 2008-05-29 15:53                 ` Jamie Lokier
  2008-05-29 18:56                 ` Joel Becker
  1 sibling, 0 replies; 35+ messages in thread
From: Jamie Lokier @ 2008-05-29 15:53 UTC (permalink / raw)
  To: jim owens; +Cc: Christoph Hellwig, Mark Fasheh, linux-fsdevel

jim owens wrote:
> This is why it was done.  In practice this was only used by kernel
> callers because most application developers simply looped with a
> fixed buffer and adjusted fm_start.  Dumber applications kept
> doubling their malloc to get all extents at once... or core dump :)

I'm seeing that a database, VM, or filesystem-in-a-file app (anything
like that using huge files) may want to optimise it's I/O scheduling
and allocation pattern according to the estimated layout of it's open
files... and some of those would like to be robust and not core dump
when presented with really large files! :) And not core dumping does
not mean "abandon the strategy when too large" either - it means have
a strategy which scales :-)

I'm also seeing block devices potentially offering a similar or even
same interface.  Now that block devices (LVM) have extents and
underlying allocation strategies etc. much the same as modern
filesystems.  Perhaps LVM itself could support FIEMAP.  From a database
engine's point of view, there isn't really much difference any more
except differing APIs and limitations.

-- Jamie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-29 15:33               ` jim owens
  2008-05-29 15:53                 ` Jamie Lokier
@ 2008-05-29 18:56                 ` Joel Becker
  2008-05-29 21:41                   ` Andreas Dilger
  1 sibling, 1 reply; 35+ messages in thread
From: Joel Becker @ 2008-05-29 18:56 UTC (permalink / raw)
  To: jim owens; +Cc: Christoph Hellwig, Mark Fasheh, linux-fsdevel

On Thu, May 29, 2008 at 11:33:09AM -0400, jim owens wrote:
> Christoph Hellwig wrote:
>>
>> What use is there geeting the extent count for a range?  I'd rather
>> do it only per-file like the xfs ioctl.
>
> I'll answer that from practical experience.  Our api equivalents:
>
>  	__u64	fm_start;	 /* logical offset (inclusive) at
> 				  * which to start mapping (in) */
> 	__u64	fm_length;	 /* logical length of mapping which
> 				  * userspace cares about (in) */
> 	__u32	fm_extent_count; /* size of fm_extents array (in) */
> 	__u32	fm_mapped_extents; /* number of extents that were
> 				    * mapped (out) */
>
> ... note it has no flags field and no separate ioctl_extent_count.
>
> "fm_extent_count" is
>    IN  == max_extents to return.
>    OUT == number of extents remaining in-range after fm_mapped_extents
>
> Pass in fm_extent_count==0 and you get OUT number of extent entries
> within your fm_start + fm_length range.  Which you can use to set
> your malloc size because the FS can have massive extent counts :(

	See, that's a very natural API.  I'd be much happier with that -
"extent_count == 0" is consistent with no extent structures passed in.
It also fits with many other API that do it the same way (eg, snprintf).

Joel

-- 

"I don't know anything about music. In my line you don't have
 to."
        - Elvis Presley

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-29 13:01           ` Christoph Hellwig
@ 2008-05-29 20:17             ` Andreas Dilger
  0 siblings, 0 replies; 35+ messages in thread
From: Andreas Dilger @ 2008-05-29 20:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, jim owens, linux-fsdevel, Mark Fasheh,
	Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

On May 29, 2008  09:01 -0400, Christoph Hellwig wrote:
> On Wed, May 28, 2008 at 10:09:31AM -0600, Andreas Dilger wrote:
> > ... but I don't think it should necessarily be _required_ to return a
> > real "dev_t" (major, minor) device.  For network filesystems this is
> > meaningless.  If it is possible for FIEMAP_EXTENT_NET to signal that the
> > device is not a local/physical device (where a dev_t has no meaning),
> > and simply allow an enumeration [0, 1, 2, ...] of the logical devices
> > then I think this is reasonable.  The mapping of logical devices to
> > servers is available separately with a Lustre-specific ioctl.
> > 
> > This passes more information for filesystems that have local devices
> > while not breaking the functionality for network filesystems and could
> > be used as an efficient replacement for lilo's use of FIBMAP.
> 
> A dev_t actually means something for the only in-tree users of
> this interface, so there's no point making this interface worse for
> some long-term out of tree code.  And it's not like you simply can't
> allow multiple anonymous blockdevices for your networked filesystems
> similar to the one used for st_dev already.

But requiring 1500 anonymous blockdevices (== number of storage targets)
be created at mount time, which exporting some varying-over-reboot, and
inconsistent-across-clients random-value dev_t for network filesystems
just for the possibility that the client is going to do FIEMAP isn't
making the interface better either...

Getting devices of [0x1908afed, 0x4058204b] back from FIEMAP of a file
on one client, and [0x4bac5821, 0x0abefd63] on another client is pretty
useless compared to devices [2, 4], which have very clear meanings,
will always be the same across all clients, and the same across reboots.

> > For RAID1/10 you can return multiple logical->physical extent mappings
> > for the same logical range of the file with different "device" IDs.  You
> > could do the same for RAID5 returning each of the data and parity chunks
> > with "NO_DIRECT" if desired (maybe only on the parity extent, or don't
> > return the parity extent at all).  The spec does not require that the
> > returned extents be non-overlapping.
> 
> Umm, no.  That's just make the interface too complicated.  I can bet
> with your that userspace programmers will generally only test their code
> with simple filesystems and hell will break lose when they get these
> multiple ranges.  Especially as that's a very unnatural interface.

The metadata information isn't exposed to callers by default, they have
to request it explicitly with e.g. FIEMAP_FLAG_METADATA.  For the most
common use cases, applications/users will care about:
a) for cp/tar/dd/etc they only want to know where there are holes.  This
   is available in the most simple instance of FIEMAP (no flags).
b) for "fiemap" the user will want to know whether there are large or
   small contiguous allocations/fragmentation, or just the extent count.
c) for sophisticated users (e.g. filesystem developers, performance tuning)
   they want to know both the extent information, the metadata layout, and
   possibly the mapping all the way down to the platters

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-29 18:56                 ` Joel Becker
@ 2008-05-29 21:41                   ` Andreas Dilger
  2008-05-29 21:47                     ` Joel Becker
  0 siblings, 1 reply; 35+ messages in thread
From: Andreas Dilger @ 2008-05-29 21:41 UTC (permalink / raw)
  To: Joel Becker; +Cc: jim owens, Christoph Hellwig, Mark Fasheh, linux-fsdevel

On May 29, 2008  11:56 -0700, Joel Becker wrote:
> On Thu, May 29, 2008 at 11:33:09AM -0400, jim owens wrote:
> > Christoph Hellwig wrote:
> >> What use is there geeting the extent count for a range?  I'd rather
> >> do it only per-file like the xfs ioctl.
> >
> > I'll answer that from practical experience.  Our api equivalents:
> >
> >  	__u64	fm_start;	 /* logical offset (inclusive) at
> > 				  * which to start mapping (in) */
> > 	__u64	fm_length;	 /* logical length of mapping which
> > 				  * userspace cares about (in) */
> > 	__u32	fm_extent_count; /* size of fm_extents array (in) */
> > 	__u32	fm_mapped_extents; /* number of extents that were
> > 				    * mapped (out) */
> >
> > ... note it has no flags field and no separate ioctl_extent_count.
> >
> > "fm_extent_count" is
> >    IN  == max_extents to return.
> >    OUT == number of extents remaining in-range after fm_mapped_extents
> >
> > Pass in fm_extent_count==0 and you get OUT number of extent entries
> > within your fm_start + fm_length range.  Which you can use to set
> > your malloc size because the FS can have massive extent counts :(
> 
> 	See, that's a very natural API.  I'd be much happier with that -
> "extent_count == 0" is consistent with no extent structures passed in.
> It also fits with many other API that do it the same way (eg, snprintf).

So, to clarify, you are suggesting that FIEMAP_FLAG_NUM_EXTENTS isn't
needed, and returning the extent count should just be detected by calling
FIEMAP with fm_extent_count == 0?  I'm OK with that also, since calling
it with fm_extent_count == 0 doesn't make sense otherwise.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-29 21:41                   ` Andreas Dilger
@ 2008-05-29 21:47                     ` Joel Becker
  2008-05-29 23:20                       ` Andreas Dilger
  0 siblings, 1 reply; 35+ messages in thread
From: Joel Becker @ 2008-05-29 21:47 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: jim owens, Christoph Hellwig, Mark Fasheh, linux-fsdevel

On Thu, May 29, 2008 at 03:41:23PM -0600, Andreas Dilger wrote:
> So, to clarify, you are suggesting that FIEMAP_FLAG_NUM_EXTENTS isn't
> needed, and returning the extent count should just be detected by calling
> FIEMAP with fm_extent_count == 0?  I'm OK with that also, since calling
> it with fm_extent_count == 0 doesn't make sense otherwise.

	I'd always set fm_extent_count to whatever covers the range.
So, like snprintf(3), you always know when you were truncated and by how
much.  Then, passing fm_extent_count=0 in is just a special case of it.
Like estimating a printf buffer with len = snprintf(NULL, 0, fmt, ...);

Joel

-- 

"Reader, suppose you were and idiot.  And suppose you were a member of
 Congress.  But I repeat myself."
	- Mark Twain

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-28 16:33           ` Chris Mason
@ 2008-05-29 22:01             ` Andreas Dilger
  2008-05-30 13:37               ` Chris Mason
  0 siblings, 1 reply; 35+ messages in thread
From: Andreas Dilger @ 2008-05-29 22:01 UTC (permalink / raw)
  To: Chris Mason
  Cc: jim owens, linux-fsdevel, Christoph Hellwig, Mark Fasheh,
	Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

On May 28, 2008  12:33 -0400, Chris Mason wrote:
> On Wednesday 28 May 2008, Andreas Dilger wrote:
> > For Lustre, it is completely inefficient to return data in non-LUN_ORDER,
> > because it is doing RAID-0 striping of the file data across data servers.
> > A 100MB 2-stripe file with 1MB stripes would have to return 100 extents,
> > even if the file data is allocated contiguously on disk in the backing
> > filesystems in two 50MB chunks.  With LUN_ORDER it will return 2 extents
> > and the user can see much more clearly that the file is layed out well.
> 
> Ah, so lustre doesn't have a logical address layer at all?  In my case the 
> files contain pointers to contiguous logical extent and the lower layers of 
> the FS figure out that is raid0/1/10 or whatever future crud I toss in.
> 
> If the logical extents are contiguous it is safe to assume the lower end is 
> also contiguous.

Well, Lustre has a logical address layer on a per-file basis, but the
layout maps from the file offsets to multiple object offsets.  There is
no "flat" logical device in the background which file allocations are
coming from, because the API provided to the client is based only on
objects and offsets, and there may be multiple objects that map into a
single file via some striping.  That is currently RAID-0 across objects,
but it might be RAID-1/5/6 or something else in the future.  With the
RAID-0 layout, the logical file offsets round-robin across the multiple
objects with a certain stripe size (default 1MB).

It sounds like you actually have the same setup with btrfs (if it is at
all like ZFS) that file blocks map onto multiple disks, and there may
be multiple copies of the data (RAID-1/10).

What a user/administrator really cares about in the end is whether
the files are allocated contiguously within the objects on the server
filesystems.  If we were to run filefrag (with FIEMAP support) on a
Lustre file without LUN_ORDER, or maybe a RAID-5 btrfs file, it would
return a list of extents, each broken up at smaller boundaries, and it
will convey the wrong idea of how the file is layed out physically.

If we run fiemap (with LUN_ORDER) what will happen is we get the larger
(hopefully) extents that are actually contiguously allocated in the
backing filesystems.  Since this is a network object-based filesystem,
we don't really care about the _actual_ file offset->device block number
layout as much as the overall picture of file fragmentation and layout.

> > My point of view is that FIEMAP is a file layout visualization API that
> > could also be used in certain cases for direct data access.  Since any
> > direct access of data returned by FIEMAP is inherently racy (as is
> > FIBMAP), I'm less concerned with the mappings being fully consistent,
> > and more concerned with providing the maximum amount of information.
> >
> > Any application using FIEMAP for direct data access (e.g. dump of
> > some kind) either has to guard against races itself by verifying the
> > mapping again afterward, or for uses like lilo trust that the admin
> > is doing the right thing.  That isn't a new issue with FIEMAP vs FIBMAP.
> 
> So, I'm a big fan of better layout visualization and creating APIs to improve 
> it.  At some point we need to take a step back and ask if those apis are 
> better left to other tools instead of heaping them all into fiemap.  
> The advantage of dropping the lun support from fiemap and pushing it into a 
> new ioctl/syscall is that we can determine the underlying storage topology 
> for any logical block on the device, including those underneath md/dm without 
> worrying about a backing file.

Argh, that would make FIEMAP basically unsuitable for a mutli-device
filesystem like Lustre and pNFS and the future direction of XFS (I think),
and btrfs IMHO, or even ZFS.  There just isn't a single address space
that files can be mapped to.

A major reason I proposed FIEMAP to linux-fsdevel in the first
place, instead of just keeping it internal to Lustre and maybe ext4,
is because it is generally useful interface for efficiently determining
file layout information, and isn't tied to block devices like FIBMAP is.

It is useful for many different reasons like cp/tar to skip holes
(not using the physical offset information, just the logical extents
and flags like UNWRITTEN) to avoid reading empty parts of the file,
layout visualization, maybe defrag, etc.

Dropping lun/device support, and removing all of the flexibility of the
FIEMAP interface design, is IMHO killing the whole reason I proposed
FIEMAP in the first place.

> And then we can get interesting information about stripe widths, preferred
> IO sizes etc etc.

I agree that this part is somewhat orthogonal.  In the vast majority
of cases stripe width, stripe count, IO size, etc can be encapsulated
into a small number of parameters and does not need to be specified on a
per-block or per-extent basis.  The actual layout of the file (returned
by FIEMAP) is a natural consequence of these parameters, as they (may)
influence the filesystem in making allocation decisions.  Returning the
metadata layout as part of FIEMAP makes sense to me, because it boils
down to logical->{physical,device} ranges in the end.  Returning the
generic file layout parameters doesn't make sense, on the other hand.

Lustre (and in fact several of the HPC filesystem vendors) would like to
come up with a common API (virtual xattr?) to be able to extract and
restore the generic layout information but that is for a separate email.

Cheer , Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-29 21:47                     ` Joel Becker
@ 2008-05-29 23:20                       ` Andreas Dilger
  0 siblings, 0 replies; 35+ messages in thread
From: Andreas Dilger @ 2008-05-29 23:20 UTC (permalink / raw)
  To: Joel Becker; +Cc: jim owens, Christoph Hellwig, Mark Fasheh, linux-fsdevel

On May 29, 2008  14:47 -0700, Joel Becker wrote:
> On Thu, May 29, 2008 at 03:41:23PM -0600, Andreas Dilger wrote:
> > So, to clarify, you are suggesting that FIEMAP_FLAG_NUM_EXTENTS isn't
> > needed, and returning the extent count should just be detected by calling
> > FIEMAP with fm_extent_count == 0?  I'm OK with that also, since calling
> > it with fm_extent_count == 0 doesn't make sense otherwise.
> 
> 	I'd always set fm_extent_count to whatever covers the range.
> So, like snprintf(3), you always know when you were truncated and by how
> much.  Then, passing fm_extent_count=0 in is just a special case of it.
> Like estimating a printf buffer with len = snprintf(NULL, 0, fmt, ...);

I totally agree, I just didn't remember to include the start/end range
in my email, though I think they are necessary.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
  2008-05-29 22:01             ` Andreas Dilger
@ 2008-05-30 13:37               ` Chris Mason
  0 siblings, 0 replies; 35+ messages in thread
From: Chris Mason @ 2008-05-30 13:37 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: jim owens, linux-fsdevel, Christoph Hellwig, Mark Fasheh,
	Andreas Dilger, Kalpak Shah, Eric Sandeen, Josef Bacik

On Thursday 29 May 2008, Andreas Dilger wrote:
> On May 28, 2008  12:33 -0400, Chris Mason wrote:
> > On Wednesday 28 May 2008, Andreas Dilger wrote:
> > > For Lustre, it is completely inefficient to return data in
> > > non-LUN_ORDER, because it is doing RAID-0 striping of the file data
> > > across data servers. A 100MB 2-stripe file with 1MB stripes would have
> > > to return 100 extents, even if the file data is allocated contiguously
> > > on disk in the backing filesystems in two 50MB chunks.  With LUN_ORDER
> > > it will return 2 extents and the user can see much more clearly that
> > > the file is layed out well.
> >
> > Ah, so lustre doesn't have a logical address layer at all?  In my case
> > the files contain pointers to contiguous logical extent and the lower
> > layers of the FS figure out that is raid0/1/10 or whatever future crud I
> > toss in.
> >
> > If the logical extents are contiguous it is safe to assume the lower end
> > is also contiguous.
>
> Well, Lustre has a logical address layer on a per-file basis, but the
> layout maps from the file offsets to multiple object offsets.  There is
> no "flat" logical device in the background which file allocations are
> coming from, because the API provided to the client is based only on
> objects and offsets, and there may be multiple objects that map into a
> single file via some striping.  That is currently RAID-0 across objects,
> but it might be RAID-1/5/6 or something else in the future.  With the
> RAID-0 layout, the logical file offsets round-robin across the multiple
> objects with a certain stripe size (default 1MB).
>
> It sounds like you actually have the same setup with btrfs (if it is at
> all like ZFS) that file blocks map onto multiple disks, and there may
> be multiple copies of the data (RAID-1/10).

In my case, all pointers to extents (both metadata blocks and file data) 
reference a logical address space.  So, even for raid10 or raid5/6 if I ever 
code it, there is a central place that does translation from 
logical->physical block(s).

The disk format supports multiple (2^64) such namespaces but that isn't being 
used yet.

>
>
> What a user/administrator really cares about in the end is whether
> the files are allocated contiguously within the objects on the server
> filesystems.  If we were to run filefrag (with FIEMAP support) on a
> Lustre file without LUN_ORDER, or maybe a RAID-5 btrfs file, it would
> return a list of extents, each broken up at smaller boundaries, and it
> will convey the wrong idea of how the file is layed out physically.
>

For Btrfs, it'll always return the logical extents, and because the storage is 
grouped in relatively large chunks (~1GB  or more), this is sufficiently 
enough for measuring fragmentation.

But, if lustre doesn't have this kind of logical backing store, I think it is 
reason enough to keep the lun interface.  I know lots of people are against 
adding interfaces to the kernel for out of tree projects, but the per-file 
logical mapping you describe is a very reasonable way to design things, and 
we might as well leave it in for future use.

> Dropping lun/device support, and removing all of the flexibility of the
> FIEMAP interface design, is IMHO killing the whole reason I proposed
> FIEMAP in the first place.

My goal isn't to remove the flexibility from the interface design, it is just 
to ask if all of this functionality needs to be in one ioctl.  At least the 
device number / lun bit makes sense now (Mark, if you keep it, please don't 
make this a dev_t) thanks for the extra details.

-chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-05-30 13:38 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-25  0:01 [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl Mark Fasheh
2008-05-25 19:42 ` Christoph Hellwig
2008-05-25 20:59   ` Brad Boyer
2008-05-26 10:59   ` Andreas Dilger
2008-05-26 18:04     ` Brad Boyer
2008-05-27 16:45     ` Christoph Hellwig
2008-05-27 21:10       ` Mark Fasheh
2008-05-27 13:48   ` Chris Mason
2008-05-27 16:21     ` Eric Sandeen
2008-05-27 16:47       ` Christoph Hellwig
2008-05-27 20:34         ` Joel Becker
2008-05-27 16:52     ` jim owens
2008-05-27 17:19       ` Chris Mason
2008-05-28 16:09         ` Andreas Dilger
2008-05-28 16:33           ` Chris Mason
2008-05-29 22:01             ` Andreas Dilger
2008-05-30 13:37               ` Chris Mason
2008-05-29 13:01           ` Christoph Hellwig
2008-05-29 20:17             ` Andreas Dilger
2008-05-27 18:56   ` Mark Fasheh
2008-05-27 20:31     ` Joel Becker
2008-05-27 20:49       ` Mark Fasheh
2008-05-28  5:14       ` Christoph Hellwig
2008-05-28 16:02       ` Andreas Dilger
2008-05-28 17:04         ` Joel Becker
2008-05-29  0:51           ` Dave Chinner
2008-05-29 13:02             ` Christoph Hellwig
2008-05-29 15:33               ` jim owens
2008-05-29 15:53                 ` Jamie Lokier
2008-05-29 18:56                 ` Joel Becker
2008-05-29 21:41                   ` Andreas Dilger
2008-05-29 21:47                     ` Joel Becker
2008-05-29 23:20                       ` Andreas Dilger
2008-05-29  1:17           ` Andreas Dilger
2008-05-29  5:55         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).