Hole Punching V3

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Hole Punching V3
@ 2010-11-18  1:46 Josef Bacik
  2010-11-18  1:46 ` [PATCH 1/6] fs: add hole punching to fallocate Josef Bacik
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Josef Bacik @ 2010-11-18  1:46 UTC (permalink / raw)
  To: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-dev

This is version 3 of the hole punching series I've been posting.  Not much has
changed, the history is below

V2->V3
-FALLOC_FL_PUNCH_HOLE must also have FALLOC_FL_KEEP_SIZE in order to work
-formatting fixes

V1->V2
-Hole punching doesn't change file size
-Fixed the mode checks in ext4/btrfs/gfs2 so they do what they are supposed to

I've updated my local copies of the xfsprogs patches I have to test this to use
KEEP_SIZE and PUNCH_HOLE together, I'll post them after it looks like these
patches are good to go, including the manpage update.  The xfstest I wrote ran
fine both on xfs and btrfs (failing on btrfs obviously).  Thanks,

Josef

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-18  1:46 Hole Punching V3 Josef Bacik
@ 2010-11-18  1:46 ` Josef Bacik
  2010-11-18 23:43   ` Jan Kara
  2010-11-18  1:46 ` [PATCH 2/6] XFS: handle hole punching via fallocate properly Josef Bacik
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Josef Bacik @ 2010-11-18  1:46 UTC (permalink / raw)
  To: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-dev

Hole punching has already been implemented by XFS and OCFS2, and has the
potential to be implemented on both BTRFS and EXT4 so we need a generic way to
get to this feature.  The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE
to fallocate() since it already looks like the normal fallocate() operation.
I've tested this patch with XFS and BTRFS to make sure XFS did what it's
supposed to do and that BTRFS failed like it was supposed to.  Thank you,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/open.c              |    7 ++++++-
 include/linux/falloc.h |    1 +
 2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 4197b9e..5b6ef7e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,7 +223,12 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EINVAL;
 
 	/* Return error if mode is not supported */
-	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+		return -EOPNOTSUPP;
+
+	/* Punch hole must have keep size set */
+	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
+	    !(mode & FALLOC_FL_KEEP_SIZE))
 		return -EOPNOTSUPP;
 
 	if (!(file->f_mode & FMODE_WRITE))
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 3c15510..73e0b62 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -2,6 +2,7 @@
 #define _FALLOC_H_
 
 #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
 
 #ifdef __KERNEL__
 
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-18  1:46 ` [PATCH 1/6] fs: add hole punching to fallocate Josef Bacik
@ 2010-11-18 23:43   ` Jan Kara
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kara @ 2010-11-18 23:43 UTC (permalink / raw)
  To: Josef Bacik
  Cc: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-devel, joel.becker, jack

On Wed 17-11-10 20:46:15, Josef Bacik wrote:
> Hole punching has already been implemented by XFS and OCFS2, and has the
> potential to be implemented on both BTRFS and EXT4 so we need a generic way to
> get to this feature.  The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE
> to fallocate() since it already looks like the normal fallocate() operation.
> I've tested this patch with XFS and BTRFS to make sure XFS did what it's
> supposed to do and that BTRFS failed like it was supposed to.  Thank you,
  Looks nice now. Acked-by: Jan Kara <jack@suse.cz>

									Honza
> 
> Signed-off-by: Josef Bacik <josef@redhat.com>
> ---
>  fs/open.c              |    7 ++++++-
>  include/linux/falloc.h |    1 +
>  2 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 4197b9e..5b6ef7e 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -223,7 +223,12 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -EINVAL;
>  
>  	/* Return error if mode is not supported */
> -	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> +	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> +		return -EOPNOTSUPP;
> +
> +	/* Punch hole must have keep size set */
> +	if ((mode & FALLOC_FL_PUNCH_HOLE) &&
> +	    !(mode & FALLOC_FL_KEEP_SIZE))
>  		return -EOPNOTSUPP;
>  
>  	if (!(file->f_mode & FMODE_WRITE))
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 3c15510..73e0b62 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -2,6 +2,7 @@
>  #define _FALLOC_H_
>  
>  #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
> +#define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
>  
>  #ifdef __KERNEL__
>  
> -- 
> 1.6.6.1
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/6] XFS: handle hole punching via fallocate properly
  2010-11-18  1:46 Hole Punching V3 Josef Bacik
  2010-11-18  1:46 ` [PATCH 1/6] fs: add hole punching to fallocate Josef Bacik
@ 2010-11-18  1:46 ` Josef Bacik
  2010-11-18  1:46 ` [PATCH 3/6] Ocfs2: " Josef Bacik
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Josef Bacik @ 2010-11-18  1:46 UTC (permalink / raw)
  To: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-dev

This patch simply allows XFS to handle the hole punching flag in fallocate
properly.  I've tested this with a little program that does a bunch of random
hole punching with FL_KEEP_SIZE and without it to make sure it does the right
thing.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/xfs/linux-2.6/xfs_iops.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index 96107ef..07d2164 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -516,6 +516,7 @@ xfs_vn_fallocate(
 	loff_t		new_size = 0;
 	xfs_flock64_t	bf;
 	xfs_inode_t	*ip = XFS_I(inode);
+	int		cmd = XFS_IOC_RESVSP;
 
 	/* preallocation on directories not yet supported */
 	error = -ENODEV;
@@ -528,6 +529,9 @@ xfs_vn_fallocate(
 
 	xfs_ilock(ip, XFS_IOLOCK_EXCL);
 
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		cmd = XFS_IOC_UNRESVSP;
+
 	/* check the new inode size is valid before allocating */
 	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
 	    offset + len > i_size_read(inode)) {
@@ -537,8 +541,7 @@ xfs_vn_fallocate(
 			goto out_unlock;
 	}
 
-	error = -xfs_change_file_space(ip, XFS_IOC_RESVSP, &bf,
-				       0, XFS_ATTR_NOLOCK);
+	error = -xfs_change_file_space(ip, cmd, &bf, 0, XFS_ATTR_NOLOCK);
 	if (error)
 		goto out_unlock;
 
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/6] Ocfs2: handle hole punching via fallocate properly
  2010-11-18  1:46 Hole Punching V3 Josef Bacik
  2010-11-18  1:46 ` [PATCH 1/6] fs: add hole punching to fallocate Josef Bacik
  2010-11-18  1:46 ` [PATCH 2/6] XFS: handle hole punching via fallocate properly Josef Bacik
@ 2010-11-18  1:46 ` Josef Bacik
  2010-11-18  1:46 ` [PATCH 4/6] Ext4: fail if we try to use hole punch Josef Bacik
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Josef Bacik @ 2010-11-18  1:46 UTC (permalink / raw)
  To: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-dev

This patch just makes ocfs2 use its UNRESERVP ioctl when we get the hole punch
flag in fallocate.  I didn't test it, but it seems simple enough.  Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/ocfs2/file.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 77b4c04..ad23a18 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1992,6 +1992,7 @@ static long ocfs2_fallocate(struct inode *inode, int mode, loff_t offset,
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct ocfs2_space_resv sr;
 	int change_size = 1;
+	int cmd = OCFS2_IOC_RESVSP64;
 
 	if (!ocfs2_writes_unwritten_extents(osb))
 		return -EOPNOTSUPP;
@@ -2002,12 +2003,15 @@ static long ocfs2_fallocate(struct inode *inode, int mode, loff_t offset,
 	if (mode & FALLOC_FL_KEEP_SIZE)
 		change_size = 0;
 
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		cmd = OCFS2_IOC_UNRESVSP64;
+
 	sr.l_whence = 0;
 	sr.l_start = (s64)offset;
 	sr.l_len = (s64)len;
 
-	return __ocfs2_change_file_space(NULL, inode, offset,
-					 OCFS2_IOC_RESVSP64, &sr, change_size);
+	return __ocfs2_change_file_space(NULL, inode, offset, cmd, &sr,
+					 change_size);
 }
 
 int ocfs2_check_range_for_refcount(struct inode *inode, loff_t pos,
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/6] Ext4: fail if we try to use hole punch
  2010-11-18  1:46 Hole Punching V3 Josef Bacik
                   ` (2 preceding siblings ...)
  2010-11-18  1:46 ` [PATCH 3/6] Ocfs2: " Josef Bacik
@ 2010-11-18  1:46 ` Josef Bacik
  2010-11-18  1:46 ` [PATCH 5/6] Btrfs: " Josef Bacik
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Josef Bacik @ 2010-11-18  1:46 UTC (permalink / raw)
  To: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-dev

Ext4 doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
be added later.  Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/ext4/extents.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0554c48..35bca73 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3622,6 +3622,10 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 	struct ext4_map_blocks map;
 	unsigned int credits, blkbits = inode->i_blkbits;
 
+	/* We only support the FALLOC_FL_KEEP_SIZE mode */
+	if (mode && (mode != FALLOC_FL_KEEP_SIZE))
+		return -EOPNOTSUPP;
+
 	/*
 	 * currently supporting (pre)allocate mode for extent-based
 	 * files _only_
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 5/6] Btrfs: fail if we try to use hole punch
  2010-11-18  1:46 Hole Punching V3 Josef Bacik
                   ` (3 preceding siblings ...)
  2010-11-18  1:46 ` [PATCH 4/6] Ext4: fail if we try to use hole punch Josef Bacik
@ 2010-11-18  1:46 ` Josef Bacik
  2010-11-18  1:46 ` [PATCH 6/6] Gfs2: " Josef Bacik
  2011-01-03 21:57 ` Hole Punching V3 Josef Bacik
  6 siblings, 0 replies; 35+ messages in thread
From: Josef Bacik @ 2010-11-18  1:46 UTC (permalink / raw)
  To: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-dev

Btrfs doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
be added later.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/btrfs/inode.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 78877d7..6f08892 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6936,6 +6936,10 @@ static long btrfs_fallocate(struct inode *inode, int mode,
 	alloc_start = offset & ~mask;
 	alloc_end =  (offset + len + mask) & ~mask;
 
+	/* We only support the FALLOC_FL_KEEP_SIZE mode */
+	if (mode && (mode != FALLOC_FL_KEEP_SIZE))
+		return -EOPNOTSUPP;
+
 	/*
 	 * wait for ordered IO before we have any locks.  We'll loop again
 	 * below with the locks held.
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 6/6] Gfs2: fail if we try to use hole punch
  2010-11-18  1:46 Hole Punching V3 Josef Bacik
                   ` (4 preceding siblings ...)
  2010-11-18  1:46 ` [PATCH 5/6] Btrfs: " Josef Bacik
@ 2010-11-18  1:46 ` Josef Bacik
  2011-01-03 21:57 ` Hole Punching V3 Josef Bacik
  6 siblings, 0 replies; 35+ messages in thread
From: Josef Bacik @ 2010-11-18  1:46 UTC (permalink / raw)
  To: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-dev

Gfs2 doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
be added later.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/gfs2/ops_inode.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c
index 12cbea7..65cb5f6 100644
--- a/fs/gfs2/ops_inode.c
+++ b/fs/gfs2/ops_inode.c
@@ -1439,6 +1439,10 @@ static long gfs2_fallocate(struct inode *inode, int mode, loff_t offset,
 	loff_t next = (offset + len - 1) >> sdp->sd_sb.sb_bsize_shift;
 	next = (next + 1) << sdp->sd_sb.sb_bsize_shift;
 
+	/* We only support the FALLOC_FL_KEEP_SIZE mode */
+	if (mode && (mode != FALLOC_FL_KEEP_SIZE))
+		return -EOPNOTSUPP;
+
 	offset = (offset >> sdp->sd_sb.sb_bsize_shift) <<
 		 sdp->sd_sb.sb_bsize_shift;
 
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Hole Punching V3
  2010-11-18  1:46 Hole Punching V3 Josef Bacik
                   ` (5 preceding siblings ...)
  2010-11-18  1:46 ` [PATCH 6/6] Gfs2: " Josef Bacik
@ 2011-01-03 21:57 ` Josef Bacik
  6 siblings, 0 replies; 35+ messages in thread
From: Josef Bacik @ 2011-01-03 21:57 UTC (permalink / raw)
  To: Josef Bacik
  Cc: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-devel, joel.becker, jack, akpm, torvalds

On Wed, Nov 17, 2010 at 08:46:14PM -0500, Josef Bacik wrote:
> This is version 3 of the hole punching series I've been posting.  Not much has
> changed, the history is below
> 
> V2->V3
> -FALLOC_FL_PUNCH_HOLE must also have FALLOC_FL_KEEP_SIZE in order to work
> -formatting fixes
> 
> V1->V2
> -Hole punching doesn't change file size
> -Fixed the mode checks in ext4/btrfs/gfs2 so they do what they are supposed to
> 
> I've updated my local copies of the xfsprogs patches I have to test this to use
> KEEP_SIZE and PUNCH_HOLE together, I'll post them after it looks like these
> patches are good to go, including the manpage update.  The xfstest I wrote ran
> fine both on xfs and btrfs (failing on btrfs obviously).  Thanks,
> 

I'd like to try and get this into the next merge window, it seems everybody is
happy with it so far, any other comments?  Provided everybody is ok with it, how
would you like me to send it to you Linus?  Would you prefer a pull request or
will you just pull the patches off the mailinglist?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Hole Punching V2
@ 2010-11-15 17:05 Josef Bacik
  2010-11-15 17:05 ` [PATCH 1/6] fs: add hole punching to fallocate Josef Bacik
  0 siblings, 1 reply; 35+ messages in thread
From: Josef Bacik @ 2010-11-15 17:05 UTC (permalink / raw)
  To: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-dev

This is version 2 of the hole punching series I posted last week.  The following
things have changed

-Hole punching doesn't change file size
-Fixed the mode checks in ext4/btrfs/gfs2 so they do what they are supposed to

I posted updates to xfstests and xfsprogs in order to test this new interface,
and ran the test on xfs to make sure hole punching worked properly for xfs and I
tested it on btrfs to make sure it failed properly.  This series also adds
support for doing hole punching to ocfs2, but I did not test this part of it,
albiet it's an obvious fix so it should work fine.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-15 17:05 Hole Punching V2 Josef Bacik
@ 2010-11-15 17:05 ` Josef Bacik
  2010-11-16 11:16   ` Jan Kara
  0 siblings, 1 reply; 35+ messages in thread
From: Josef Bacik @ 2010-11-15 17:05 UTC (permalink / raw)
  To: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-dev

Hole punching has already been implemented by XFS and OCFS2, and has the
potential to be implemented on both BTRFS and EXT4 so we need a generic way to
get to this feature.  The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE
to fallocate() since it already looks like the normal fallocate() operation.
I've tested this patch with XFS and BTRFS to make sure XFS did what it's
supposed to do and that BTRFS failed like it was supposed to.  Thank you,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/open.c              |    2 +-
 include/linux/falloc.h |    1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 4197b9e..ab8dedf 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EINVAL;
 
 	/* Return error if mode is not supported */
-	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
+	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
 		return -EOPNOTSUPP;
 
 	if (!(file->f_mode & FMODE_WRITE))
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 3c15510..851cba2 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -2,6 +2,7 @@
 #define _FALLOC_H_
 
 #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE	0X02 /* de-allocates range */
 
 #ifdef __KERNEL__
 
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-15 17:05 ` [PATCH 1/6] fs: add hole punching to fallocate Josef Bacik
@ 2010-11-16 11:16   ` Jan Kara
  2010-11-16 11:43     ` Jan Kara
  2010-11-16 12:53     ` Josef Bacik
  0 siblings, 2 replies; 35+ messages in thread
From: Jan Kara @ 2010-11-16 11:16 UTC (permalink / raw)
  To: Josef Bacik
  Cc: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-devel, ocfs2-devel

On Mon 15-11-10 12:05:18, Josef Bacik wrote:
> diff --git a/fs/open.c b/fs/open.c
> index 4197b9e..ab8dedf 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -EINVAL;
>  
>  	/* Return error if mode is not supported */
> -	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> +	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
  Why not just:
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ?

> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 3c15510..851cba2 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -2,6 +2,7 @@
>  #define _FALLOC_H_
>  
>  #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
> +#define FALLOC_FL_PUNCH_HOLE	0X02 /* de-allocates range */
                                 ^ use lowercase 'x' please...


								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-16 11:16   ` Jan Kara
@ 2010-11-16 11:43     ` Jan Kara
  2010-11-16 12:52       ` Josef Bacik
  2010-11-16 12:53     ` Josef Bacik
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Kara @ 2010-11-16 11:43 UTC (permalink / raw)
  To: Josef Bacik
  Cc: david, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	cmm, cluster-devel, ocfs2-devel

On Tue 16-11-10 12:16:11, Jan Kara wrote:
> On Mon 15-11-10 12:05:18, Josef Bacik wrote:
> > diff --git a/fs/open.c b/fs/open.c
> > index 4197b9e..ab8dedf 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >  		return -EINVAL;
> >  
> >  	/* Return error if mode is not supported */
> > -	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> > +	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
>   Why not just:
> if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ?
  And BTW, since FALLOC_FL_PUNCH_HOLE does not change the file size, should
not we enforce that FALLOC_FL_KEEP_SIZE is / is not set? I don't mind too
much which way but keeping it ambiguous (ignored) in the interface usually
proves as a bad idea in future when we want to further extend the interface...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-16 11:43     ` Jan Kara
@ 2010-11-16 12:52       ` Josef Bacik
  2010-11-16 13:14         ` Jan Kara
  0 siblings, 1 reply; 35+ messages in thread
From: Josef Bacik @ 2010-11-16 12:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, david, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, cmm, cluster-devel, ocfs2-devel

On Tue, Nov 16, 2010 at 12:43:46PM +0100, Jan Kara wrote:
> On Tue 16-11-10 12:16:11, Jan Kara wrote:
> > On Mon 15-11-10 12:05:18, Josef Bacik wrote:
> > > diff --git a/fs/open.c b/fs/open.c
> > > index 4197b9e..ab8dedf 100644
> > > --- a/fs/open.c
> > > +++ b/fs/open.c
> > > @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> > >  		return -EINVAL;
> > >  
> > >  	/* Return error if mode is not supported */
> > > -	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> > > +	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
> >   Why not just:
> > if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ?
>   And BTW, since FALLOC_FL_PUNCH_HOLE does not change the file size, should
> not we enforce that FALLOC_FL_KEEP_SIZE is / is not set? I don't mind too
> much which way but keeping it ambiguous (ignored) in the interface usually
> proves as a bad idea in future when we want to further extend the interface...
>

Yeah I went back and forth on this.  KEEP_SIZE won't change the behavior of
PUNCH_HOLE since PUNCH_HOLE implicitly means keep the size.  I figured since its
"mode" and not "flags" it would be ok to make either way accepted, but if you
prefer PUNCH_HOLE means you have to have KEEP_SIZE set then I'm cool with that,
just let me know one way or the other.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-16 12:52       ` Josef Bacik
@ 2010-11-16 13:14         ` Jan Kara
  2010-11-17  0:22           ` Andreas Dilger
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kara @ 2010-11-16 13:14 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, david, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, cmm, cluster-devel, ocfs2-devel

On Tue 16-11-10 07:52:50, Josef Bacik wrote:
> On Tue, Nov 16, 2010 at 12:43:46PM +0100, Jan Kara wrote:
> > On Tue 16-11-10 12:16:11, Jan Kara wrote:
> > > On Mon 15-11-10 12:05:18, Josef Bacik wrote:
> > > > diff --git a/fs/open.c b/fs/open.c
> > > > index 4197b9e..ab8dedf 100644
> > > > --- a/fs/open.c
> > > > +++ b/fs/open.c
> > > > @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> > > >  		return -EINVAL;
> > > >  
> > > >  	/* Return error if mode is not supported */
> > > > -	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> > > > +	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
> > >   Why not just:
> > > if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ?
> >   And BTW, since FALLOC_FL_PUNCH_HOLE does not change the file size, should
> > not we enforce that FALLOC_FL_KEEP_SIZE is / is not set? I don't mind too
> > much which way but keeping it ambiguous (ignored) in the interface usually
> > proves as a bad idea in future when we want to further extend the interface...
> >
> 
> Yeah I went back and forth on this.  KEEP_SIZE won't change the behavior of
> PUNCH_HOLE since PUNCH_HOLE implicitly means keep the size.  I figured since its
> "mode" and not "flags" it would be ok to make either way accepted, but if you
> prefer PUNCH_HOLE means you have to have KEEP_SIZE set then I'm cool with that,
> just let me know one way or the other.  Thanks,
  I was wondering about 'mode' vs 'flags' as well. The manpage says:
The mode argument determines the operation to be performed on the given
range.  Currently only one flag is supported for mode...
  So we call it "mode" but speak about "flags"? Seems a bit inconsistent.
I'd maybe lean a bit at the "flags" side and just make sure that
only one of FALLOC_FL_KEEP_SIZE, FALLOC_FL_PUNCH_HOLE is set (interpreting
FALLOC_FL_KEEP_SIZE as allocate blocks beyond i_size). But I'm not sure
what others think.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-16 13:14         ` Jan Kara
@ 2010-11-17  0:22           ` Andreas Dilger
  2010-11-17  2:11             ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Andreas Dilger @ 2010-11-17  0:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, david, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, cmm, cluster-devel, ocfs2-devel

On 2010-11-16, at 07:14, Jan Kara wrote:
>> Yeah I went back and forth on this.  KEEP_SIZE won't change the behavior of PUNCH_HOLE since PUNCH_HOLE implicitly means keep the size.  I figured since its "mode" and not "flags" it would be ok to make either way accepted, but if you prefer PUNCH_HOLE means you have to have KEEP_SIZE set then I'm cool with that, just let me know one way or the other.
> 
> So we call it "mode" but speak about "flags"? Seems a bit inconsistent.  
> I'd maybe lean a bit at the "flags" side and just make sure that only one of FALLOC_FL_KEEP_SIZE, FALLOC_FL_PUNCH_HOLE is set (interpreting FALLOC_FL_KEEP_SIZE as allocate blocks beyond i_size). But I'm not sure what others think.

IMHO, it makes more sense for consistency and "get what users expect" that these be treated as flags.  Some users will want KEEP_SIZE, but in other cases it may make sense that a hole punch at the end of a file should shrink the file (i.e. the opposite of an append).

Cheers, Andreas






^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-17  0:22           ` Andreas Dilger
@ 2010-11-17  2:11             ` Dave Chinner
  2010-11-17  2:28               ` Josef Bacik
  2010-11-17  9:19               ` Andreas Dilger
  0 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2010-11-17  2:11 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Kara, Josef Bacik, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, cmm, cluster-devel, ocfs2-devel

On Tue, Nov 16, 2010 at 06:22:47PM -0600, Andreas Dilger wrote:
> On 2010-11-16, at 07:14, Jan Kara wrote:
> >> Yeah I went back and forth on this.  KEEP_SIZE won't change the
> >> behavior of PUNCH_HOLE since PUNCH_HOLE implicitly means keep
> >> the size.  I figured since its "mode" and not "flags" it would
> >> be ok to make either way accepted, but if you prefer PUNCH_HOLE
> >> means you have to have KEEP_SIZE set then I'm cool with that,
> >> just let me know one way or the other.
> > 
> > So we call it "mode" but speak about "flags"? Seems a bit
> > inconsistent.  I'd maybe lean a bit at the "flags" side and just
> > make sure that only one of FALLOC_FL_KEEP_SIZE,
> > FALLOC_FL_PUNCH_HOLE is set (interpreting FALLOC_FL_KEEP_SIZE as
> > allocate blocks beyond i_size). But I'm not sure what others
> > think.
> 
> IMHO, it makes more sense for consistency and "get what users
> expect" that these be treated as flags.  Some users will want
> KEEP_SIZE, but in other cases it may make sense that a hole punch
> at the end of a file should shrink the file (i.e. the opposite of
> an append).

What's wrong with ftruncate() for this?

There's plenty of open questions about the interface if we allow
hole punching to change the file size. e.g. where do we set the EOF
(offset or offset+len)?  What do we do with the rest of the blocks
that are now beyond EOF?  We weren't asked to punch them out, so do
we leave them behind? What if we are leaving written blocks beyond
EOF - does any filesystem other than XFS support that (i.e. are we
introducing different behaviour on different filesystems)?  And what
happens if the offset is beyond EOF? Do we extend the file, and if
so why wouldn't you just use ftruncate() instead?

IMO, allowing hole punching to change the file size makes it much
more complicated and hence less likely to simply do what the user
expects. It also is harder to implement and testing becomes much
more intricate. From that perspective, it does not seem desirable to
me...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-17  2:11             ` Dave Chinner
@ 2010-11-17  2:28               ` Josef Bacik
  2010-11-17  2:34                 ` Josef Bacik
  2010-11-17  9:19               ` Andreas Dilger
  1 sibling, 1 reply; 35+ messages in thread
From: Josef Bacik @ 2010-11-17  2:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, Jan Kara, Josef Bacik, linux-kernel, linux-btrfs,
	linux-ext4, linux-fsdevel, xfs, cmm, cluster-devel, ocfs2-devel

On Wed, Nov 17, 2010 at 01:11:50PM +1100, Dave Chinner wrote:
> On Tue, Nov 16, 2010 at 06:22:47PM -0600, Andreas Dilger wrote:
> > On 2010-11-16, at 07:14, Jan Kara wrote:
> > >> Yeah I went back and forth on this.  KEEP_SIZE won't change the
> > >> behavior of PUNCH_HOLE since PUNCH_HOLE implicitly means keep
> > >> the size.  I figured since its "mode" and not "flags" it would
> > >> be ok to make either way accepted, but if you prefer PUNCH_HOLE
> > >> means you have to have KEEP_SIZE set then I'm cool with that,
> > >> just let me know one way or the other.
> > > 
> > > So we call it "mode" but speak about "flags"? Seems a bit
> > > inconsistent.  I'd maybe lean a bit at the "flags" side and just
> > > make sure that only one of FALLOC_FL_KEEP_SIZE,
> > > FALLOC_FL_PUNCH_HOLE is set (interpreting FALLOC_FL_KEEP_SIZE as
> > > allocate blocks beyond i_size). But I'm not sure what others
> > > think.
> > 
> > IMHO, it makes more sense for consistency and "get what users
> > expect" that these be treated as flags.  Some users will want
> > KEEP_SIZE, but in other cases it may make sense that a hole punch
> > at the end of a file should shrink the file (i.e. the opposite of
> > an append).
> 
> What's wrong with ftruncate() for this?
> 
> There's plenty of open questions about the interface if we allow
> hole punching to change the file size. e.g. where do we set the EOF
> (offset or offset+len)?  What do we do with the rest of the blocks
> that are now beyond EOF?  We weren't asked to punch them out, so do
> we leave them behind? What if we are leaving written blocks beyond
> EOF - does any filesystem other than XFS support that (i.e. are we
> introducing different behaviour on different filesystems)?  And what
> happens if the offset is beyond EOF? Do we extend the file, and if
> so why wouldn't you just use ftruncate() instead?
> 
> IMO, allowing hole punching to change the file size makes it much
> more complicated and hence less likely to simply do what the user
> expects. It also is harder to implement and testing becomes much
> more intricate. From that perspective, it does not seem desirable to
> me...
> 

FWIW I agree with Dave, the only question at this point is do we force users to
specify KEEP_SIZE with PUNCH_HOLE?  On one hand it makes the interface a bit
more consistent, on the other hand it makes the documentation a little weird

"We have mode here, but if you want to use PUNCH_HOLE you also have to specify
KEEP_SIZE, so really it's like a flags field it's just named poorly"

I have no strong opinions the other way so if nobody else does then I'll just do
it Jan's way.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-17  2:28               ` Josef Bacik
@ 2010-11-17  2:34                 ` Josef Bacik
  2010-11-17  9:30                   ` Andreas Dilger
  0 siblings, 1 reply; 35+ messages in thread
From: Josef Bacik @ 2010-11-17  2:34 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Dave Chinner, Andreas Dilger, Jan Kara, linux-kernel, linux-btrfs,
	linux-ext4, linux-fsdevel, xfs, cmm, cluster-devel, ocfs2-devel

On Tue, Nov 16, 2010 at 09:28:14PM -0500, Josef Bacik wrote:
> On Wed, Nov 17, 2010 at 01:11:50PM +1100, Dave Chinner wrote:
> > On Tue, Nov 16, 2010 at 06:22:47PM -0600, Andreas Dilger wrote:
> > > On 2010-11-16, at 07:14, Jan Kara wrote:
> > > >> Yeah I went back and forth on this.  KEEP_SIZE won't change the
> > > >> behavior of PUNCH_HOLE since PUNCH_HOLE implicitly means keep
> > > >> the size.  I figured since its "mode" and not "flags" it would
> > > >> be ok to make either way accepted, but if you prefer PUNCH_HOLE
> > > >> means you have to have KEEP_SIZE set then I'm cool with that,
> > > >> just let me know one way or the other.
> > > > 
> > > > So we call it "mode" but speak about "flags"? Seems a bit
> > > > inconsistent.  I'd maybe lean a bit at the "flags" side and just
> > > > make sure that only one of FALLOC_FL_KEEP_SIZE,
> > > > FALLOC_FL_PUNCH_HOLE is set (interpreting FALLOC_FL_KEEP_SIZE as
> > > > allocate blocks beyond i_size). But I'm not sure what others
> > > > think.
> > > 
> > > IMHO, it makes more sense for consistency and "get what users
> > > expect" that these be treated as flags.  Some users will want
> > > KEEP_SIZE, but in other cases it may make sense that a hole punch
> > > at the end of a file should shrink the file (i.e. the opposite of
> > > an append).
> > 
> > What's wrong with ftruncate() for this?
> > 
> > There's plenty of open questions about the interface if we allow
> > hole punching to change the file size. e.g. where do we set the EOF
> > (offset or offset+len)?  What do we do with the rest of the blocks
> > that are now beyond EOF?  We weren't asked to punch them out, so do
> > we leave them behind? What if we are leaving written blocks beyond
> > EOF - does any filesystem other than XFS support that (i.e. are we
> > introducing different behaviour on different filesystems)?  And what
> > happens if the offset is beyond EOF? Do we extend the file, and if
> > so why wouldn't you just use ftruncate() instead?
> > 
> > IMO, allowing hole punching to change the file size makes it much
> > more complicated and hence less likely to simply do what the user
> > expects. It also is harder to implement and testing becomes much
> > more intricate. From that perspective, it does not seem desirable to
> > me...
> > 
> 
> FWIW I agree with Dave, the only question at this point is do we force users to
> specify KEEP_SIZE with PUNCH_HOLE?  On one hand it makes the interface a bit
> more consistent, on the other hand it makes the documentation a little weird
> 
> "We have mode here, but if you want to use PUNCH_HOLE you also have to specify
> KEEP_SIZE, so really it's like a flags field it's just named poorly"
> 
> I have no strong opinions the other way so if nobody else does then I'll just do
> it Jan's way.  Thanks,
>

Sorry child induced sleep deprevation bleeding in there, that should read "I
have no strong opinions one way or the other."  Sheesh,

Josef 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-17  2:34                 ` Josef Bacik
@ 2010-11-17  9:30                   ` Andreas Dilger
  0 siblings, 0 replies; 35+ messages in thread
From: Andreas Dilger @ 2010-11-17  9:30 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Dave Chinner, Jan Kara, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, cmm, cluster-devel, ocfs2-devel

On 2010-11-16, at 20:34, Josef Bacik wrote:
> FWIW I agree with Dave, the only question at this point is do we force users to specify KEEP_SIZE with PUNCH_HOLE?  On one hand it makes the interface a bit more consistent, on the other hand it makes the documentation a little weird
> 
> "We have mode here, but if you want to use PUNCH_HOLE you also have to specify KEEP_SIZE, so really it's like a flags field it's just named poorly"

Even if this is the case, and we decide today that PUNCH_HOLE without KEEP_SIZE is not desirable to implement, it would be better to just return -EOPNOTSUPP if both flags are not set than assume one or the other is what the user wanted.  That allows the ability to implement this in the future without breaking every application, while if it is assumed that KEEP_SIZE is always implicit there will never be a way to add that functionality without something awful like a separate CHANGE_SIZE flag for PUNCH_HOLE.

One option is to define FALLOC_FL_PUNCH_HOLE as 0x3 (so that KEEP_SIZE is always passed) and in the future we can define some new flag name like TRUNCATE_HOLE (or whatever) that is 0x2 only.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-17  2:11             ` Dave Chinner
  2010-11-17  2:28               ` Josef Bacik
@ 2010-11-17  9:19               ` Andreas Dilger
  1 sibling, 0 replies; 35+ messages in thread
From: Andreas Dilger @ 2010-11-17  9:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Josef Bacik, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, cmm, cluster-devel, ocfs2-devel

On 2010-11-16, at 20:11, Dave Chinner wrote:
> On Tue, Nov 16, 2010 at 06:22:47PM -0600, Andreas Dilger wrote:
>> IMHO, it makes more sense for consistency and "get what users
>> expect" that these be treated as flags.  Some users will want
>> KEEP_SIZE, but in other cases it may make sense that a hole punch
>> at the end of a file should shrink the file (i.e. the opposite of
>> an append).
> 
> What's wrong with ftruncate() for this?

It makes the API usage from applications more consistent.  It would be inconvenient, for example, if applications had to use a different system call if they were writing in the middle of the file vs. at the end, wouldn't it?

Similarly, if multiple threads are appending vs. punching (let's assume non-overlapping regions, for sanity, like a producer/consumer model punching out completed records) then using ftruncate() to remove the last record and shrink the file would require locking the whole file from userspace (unlike the append, which does this in the kernel), or risk discarding unprocessed data beyond the record that was punched out.

> There's plenty of open questions about the interface if we allow
> hole punching to change the file size. e.g. where do we set the EOF
> (offset or offset+len)?

I would think it natural that the new size is the start of the region, like an "anti-write" (where write sets the size at the end of the added bytes).

>  What do we do with the rest of the blocks that are now beyond EOF?
> We weren't asked to punch them out, so do we leave them behind?

I definitely think they should be left as is.  If they were in the punched-out range, they would be deallocated, and if they are beyond EOF they will remain as they are - we didn't ask to remove them unless the punched-out range went to ~0ULL (which would make it equivalent to an ftruncate()).

> What if we are leaving written blocks beyond EOF - does any filesystem other than XFS support that (i.e. are we introducing different behaviour on different filesystems)?

I'm not sure I understand what a "written block beyond EOF" means.  How can there be data beyond EOF?  I think the KEEP_SIZE flag is only relevant if the punch is spanning EOF, like the opposite of a write that is spanning EOF.  If KEEP_SIZE is set, then it leaves the size unchanged, and if unset and punch spans EOF it reduces the file size.  If the punch is not at EOF it doesn't change the file size, just like a write that is not at EOF.

> And what happens if the offset is beyond EOF? Do we extend the file, and if so why wouldn't you just use ftruncate() instead?

Even if the effects were the same, it makes sense because applications may be using fallocate(PUNCH_HOLE) to punch out records, and having them special case the use of ftruncate() to get certain semantics at the end of the file adds needless complexity.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-16 11:16   ` Jan Kara
  2010-11-16 11:43     ` Jan Kara
@ 2010-11-16 12:53     ` Josef Bacik
  1 sibling, 0 replies; 35+ messages in thread
From: Josef Bacik @ 2010-11-16 12:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, david, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, cmm, cluster-devel, ocfs2-devel

On Tue, Nov 16, 2010 at 12:16:11PM +0100, Jan Kara wrote:
> On Mon 15-11-10 12:05:18, Josef Bacik wrote:
> > diff --git a/fs/open.c b/fs/open.c
> > index 4197b9e..ab8dedf 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >  		return -EINVAL;
> >  
> >  	/* Return error if mode is not supported */
> > -	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> > +	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
>   Why not just:
> if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ?
>

Good point, thank you.
 
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index 3c15510..851cba2 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -2,6 +2,7 @@
> >  #define _FALLOC_H_
> >  
> >  #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
> > +#define FALLOC_FL_PUNCH_HOLE	0X02 /* de-allocates range */
>                                  ^ use lowercase 'x' please...
>

Argh bitten by caps-lock again.  Thanks I'll fix this up,

Josef 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/6] fs: add hole punching to fallocate
@ 2010-11-08 20:32 Josef Bacik
  2010-11-09  1:12 ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Josef Bacik @ 2010-11-08 20:32 UTC (permalink / raw)
  To: linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	joel.becker, cmm, cluster-

Hole punching has already been implemented by XFS and OCFS2, and has the
potential to be implemented on both BTRFS and EXT4 so we need a generic way to
get to this feature.  The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE
to fallocate() since it already looks like the normal fallocate() operation.
I've tested this patch with XFS and BTRFS to make sure XFS did what it's
supposed to do and that BTRFS failed like it was supposed to.  Thank you,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
 fs/open.c              |    2 +-
 include/linux/falloc.h |    1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 4197b9e..ab8dedf 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EINVAL;
 
 	/* Return error if mode is not supported */
-	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
+	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
 		return -EOPNOTSUPP;
 
 	if (!(file->f_mode & FMODE_WRITE))
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 3c15510..851cba2 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -2,6 +2,7 @@
 #define _FALLOC_H_
 
 #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE	0X02 /* de-allocates range */
 
 #ifdef __KERNEL__
 
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-08 20:32 Josef Bacik
@ 2010-11-09  1:12 ` Dave Chinner
  2010-11-09  2:10   ` Josef Bacik
                     ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Dave Chinner @ 2010-11-09  1:12 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel, xfs,
	joel.becker, cmm, cluster-devel

On Mon, Nov 08, 2010 at 03:32:02PM -0500, Josef Bacik wrote:
> Hole punching has already been implemented by XFS and OCFS2, and has the
> potential to be implemented on both BTRFS and EXT4 so we need a generic way to
> get to this feature.  The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE
> to fallocate() since it already looks like the normal fallocate() operation.
> I've tested this patch with XFS and BTRFS to make sure XFS did what it's
> supposed to do and that BTRFS failed like it was supposed to.  Thank you,
> 
> Signed-off-by: Josef Bacik <josef@redhat.com>
> ---
>  fs/open.c              |    2 +-
>  include/linux/falloc.h |    1 +
>  2 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 4197b9e..ab8dedf 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -EINVAL;
>  
>  	/* Return error if mode is not supported */
> -	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> +	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
>  		return -EOPNOTSUPP;
>  
>  	if (!(file->f_mode & FMODE_WRITE))
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 3c15510..851cba2 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -2,6 +2,7 @@
>  #define _FALLOC_H_
>  
>  #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
> +#define FALLOC_FL_PUNCH_HOLE	0X02 /* de-allocates range */

Hole punching was not included originally in fallocate() for a
variety of reasons. IIRC, they were along the lines of:

	1 de-allocating of blocks in an allocation syscall is wrong.
	  People wanted a new syscall for this functionality.
	2 no glibc interface needs it
	3 at the time, only XFS supported punching holes, so there
	  is not need to support it in a generic interface
	4 the use cases presented were not considered compelling
	  enough to justify the additional complexity (!)

In the end, I gave up arguing for it to be included because just
getting the FALLOC_FL_KEEP_SIZE functionality was a hard enough
battle.

Anyway, #3 isn't the case any more, #4 was just an excuse not to
support anything ext4 couldn't do and lots of apps are calling
fallocate directly (because glibc can't use FALLOC_FL_KEEP_SIZE) so
#2 isn't an issue, either. I guess that leaves #1 to be debated;
I don't think there is any problem with doing what you propose.

What I will suggest is that this requires a generic xfstest to be
written and support added to xfs_io to enable that test (and others)
to issue hole punches. Something along the lines of test 242 which I
wrote for testing all the edge case of XFS_IOC_ZERO_RANGE (*) would be
good.

Cheers,

Dave.

(*) fallocate() version:
http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=commitdiff;h=45f3e1831e3abc8bd12ec1e6c548f73a8dd9e36d
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-09  1:12 ` Dave Chinner
@ 2010-11-09  2:10   ` Josef Bacik
  2010-11-09  3:30   ` Ted Ts'o
  2010-11-09 20:51   ` Josef Bacik
  2 siblings, 0 replies; 35+ messages in thread
From: Josef Bacik @ 2010-11-09  2:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel,
	xfs, joel.becker, cmm, cluster-devel

On Tue, Nov 09, 2010 at 12:12:22PM +1100, Dave Chinner wrote:
> On Mon, Nov 08, 2010 at 03:32:02PM -0500, Josef Bacik wrote:
> > Hole punching has already been implemented by XFS and OCFS2, and has the
> > potential to be implemented on both BTRFS and EXT4 so we need a generic way to
> > get to this feature.  The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE
> > to fallocate() since it already looks like the normal fallocate() operation.
> > I've tested this patch with XFS and BTRFS to make sure XFS did what it's
> > supposed to do and that BTRFS failed like it was supposed to.  Thank you,
> > 
> > Signed-off-by: Josef Bacik <josef@redhat.com>
> > ---
> >  fs/open.c              |    2 +-
> >  include/linux/falloc.h |    1 +
> >  2 files changed, 2 insertions(+), 1 deletions(-)
> > 
> > diff --git a/fs/open.c b/fs/open.c
> > index 4197b9e..ab8dedf 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >  		return -EINVAL;
> >  
> >  	/* Return error if mode is not supported */
> > -	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> > +	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
> >  		return -EOPNOTSUPP;
> >  
> >  	if (!(file->f_mode & FMODE_WRITE))
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index 3c15510..851cba2 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -2,6 +2,7 @@
> >  #define _FALLOC_H_
> >  
> >  #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
> > +#define FALLOC_FL_PUNCH_HOLE	0X02 /* de-allocates range */
> 
> Hole punching was not included originally in fallocate() for a
> variety of reasons. IIRC, they were along the lines of:
> 
> 	1 de-allocating of blocks in an allocation syscall is wrong.
> 	  People wanted a new syscall for this functionality.
> 	2 no glibc interface needs it
> 	3 at the time, only XFS supported punching holes, so there
> 	  is not need to support it in a generic interface
> 	4 the use cases presented were not considered compelling
> 	  enough to justify the additional complexity (!)
> 
> In the end, I gave up arguing for it to be included because just
> getting the FALLOC_FL_KEEP_SIZE functionality was a hard enough
> battle.
> 
> Anyway, #3 isn't the case any more, #4 was just an excuse not to
> support anything ext4 couldn't do and lots of apps are calling
> fallocate directly (because glibc can't use FALLOC_FL_KEEP_SIZE) so
> #2 isn't an issue, either. I guess that leaves #1 to be debated;
> I don't think there is any problem with doing what you propose.
>
> What I will suggest is that this requires a generic xfstest to be
> written and support added to xfs_io to enable that test (and others)
> to issue hole punches. Something along the lines of test 242 which I
> wrote for testing all the edge case of XFS_IOC_ZERO_RANGE (*) would be
> good.

Sounds good.  Do you want me to build my PUNCH_HOLE patch ontop of your
ZERO_RANGE patch?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-09  1:12 ` Dave Chinner
  2010-11-09  2:10   ` Josef Bacik
@ 2010-11-09  3:30   ` Ted Ts'o
  2010-11-09  4:42     ` Dave Chinner
  2010-11-09 20:51   ` Josef Bacik
  2 siblings, 1 reply; 35+ messages in thread
From: Ted Ts'o @ 2010-11-09  3:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel,
	xfs, joel.becker, cmm, cluster-devel

On Tue, Nov 09, 2010 at 12:12:22PM +1100, Dave Chinner wrote:
> Hole punching was not included originally in fallocate() for a
> variety of reasons. IIRC, they were along the lines of:
> 
> 	1 de-allocating of blocks in an allocation syscall is wrong.
> 	  People wanted a new syscall for this functionality.
> 	2 no glibc interface needs it
> 	3 at the time, only XFS supported punching holes, so there
> 	  is not need to support it in a generic interface
> 	4 the use cases presented were not considered compelling
> 	  enough to justify the additional complexity (!)
> 
> In the end, I gave up arguing for it to be included because just
> getting the FALLOC_FL_KEEP_SIZE functionality was a hard enough
> battle.
> 
> Anyway, #3 isn't the case any more, #4 was just an excuse not to
> support anything ext4 couldn't do and lots of apps are calling
> fallocate directly (because glibc can't use FALLOC_FL_KEEP_SIZE) so
> #2 isn't an issue, either.

I don't recall anyone arguing #4 because of ext4, but I get very tired
of the linux-fsdevel bike-shed painting parties, so I often will
concede whatever is necessary just to get the !@#! interface in,
assuming we could add more flags later....

glibc does support fallocate(), BTW; it's just posix_fallocate() that
doesn't use FALLOC_FL_KEEP_SIZE.

> I guess that leaves #1 to be debated;
> I don't think there is any problem with doing what you propose.

I don't have a problem either.

As a completely separate proposal, what do people think about an
FALLOCATE_FL_ZEROIZE after which time the blocks are allocated, but
reading from them returns zero.  This could be done either by (a)
sending a discard in the case of devices where discard_zeros_data is
true and discard_granularty is less than the fs block size, or (b) by
setting the uninitialized flag in the extent tree.

						- Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-09  3:30   ` Ted Ts'o
@ 2010-11-09  4:42     ` Dave Chinner
  2010-11-09 21:41       ` Ted Ts'o
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2010-11-09  4:42 UTC (permalink / raw)
  To: Ted Ts'o, Josef Bacik, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs

On Mon, Nov 08, 2010 at 10:30:38PM -0500, Ted Ts'o wrote:
> On Tue, Nov 09, 2010 at 12:12:22PM +1100, Dave Chinner wrote:
> > Hole punching was not included originally in fallocate() for a
> > variety of reasons. IIRC, they were along the lines of:
> > 
> > 	1 de-allocating of blocks in an allocation syscall is wrong.
> > 	  People wanted a new syscall for this functionality.
....
> > I guess that leaves #1 to be debated;
> > I don't think there is any problem with doing what you propose.
> 
> I don't have a problem either.
> 
> As a completely separate proposal, what do people think about an
> FALLOCATE_FL_ZEROIZE after which time the blocks are allocated, but
> reading from them returns zero.

That's exactly the new XFS_IOC_ZERO_RANGE ioctl in 2.6.36 does
(commit 447223520520b17d3b6d0631aa4838fbaf8eddb4 "xfs: Introduce
XFS_IOC_ZERO_RANGE") The git commit I pointed to in the last email
is the rudimentary fallocate() interface support I have for that
code which goes along with an xfs_io patch I have. Given that there
seems to be interest for this operation, I'll flesh it out into a
proper patch....

> This could be done either by (a)
> sending a discard in the case of devices where discard_zeros_data is
> true and discard_granularty is less than the fs block size, or (b) by
> setting the uninitialized flag in the extent tree.

Implementation is up to the filesystem. However, XFS does (b)
because:

	1) it was extremely simple to implement (one of the
	   advantages of having an exceedingly complex allocation
	   interface to begin with :P)
	2) conversion is atomic, fast and reliable
	3) it is independent of the underlying storage; and
	4) reads of unwritten extents operate at memory speed,
	   not disk speed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-09  4:42     ` Dave Chinner
@ 2010-11-09 21:41       ` Ted Ts'o
  2010-11-09 21:53         ` Jan Kara
  2010-11-09 23:40         ` Dave Chinner
  0 siblings, 2 replies; 35+ messages in thread
From: Ted Ts'o @ 2010-11-09 21:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel,
	xfs, joel.becker, cmm, cluster-devel

On Tue, Nov 09, 2010 at 03:42:42PM +1100, Dave Chinner wrote:
> Implementation is up to the filesystem. However, XFS does (b)
> because:
> 
> 	1) it was extremely simple to implement (one of the
> 	   advantages of having an exceedingly complex allocation
> 	   interface to begin with :P)
> 	2) conversion is atomic, fast and reliable
> 	3) it is independent of the underlying storage; and
> 	4) reads of unwritten extents operate at memory speed,
> 	   not disk speed.

Yeah, I was thinking that using a device-style TRIM might be better
since future attempts to write to it won't require a separate seek to
modify the extent tree.  But yeah, there are a bunch of advantages of
simply mutating the extent tree.

While we're on the subject of changes to fallocate, what do people
think of FALLOC_FL_EXPOSE_OLD_DATA, which requires either root
privileges or (if capabilities are in use) CAP_DAC_OVERRIDE &&
CAP_MAC_OVERRIDE && CAP_SYS_ADMIN.  This would allow a trusted process
to fallocate blocks with the extent already marked initialized.  I've
had two requests for such functionality for ext4 already.  

(Take for example a trusted cluster filesystem backend that checks the
object checksum before returning any data to the user; and if the
check fails the cluster file system will try to use some other replica
stored on some other server.)

    		     	  		    	 - Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-09 21:41       ` Ted Ts'o
@ 2010-11-09 21:53         ` Jan Kara
  2010-11-09 23:40         ` Dave Chinner
  1 sibling, 0 replies; 35+ messages in thread
From: Jan Kara @ 2010-11-09 21:53 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Dave Chinner, Josef Bacik, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, joel.becker, cmm, cluster-devel

On Tue 09-11-10 16:41:47, Ted Ts'o wrote:
> On Tue, Nov 09, 2010 at 03:42:42PM +1100, Dave Chinner wrote:
> > Implementation is up to the filesystem. However, XFS does (b)
> > because:
> > 
> > 	1) it was extremely simple to implement (one of the
> > 	   advantages of having an exceedingly complex allocation
> > 	   interface to begin with :P)
> > 	2) conversion is atomic, fast and reliable
> > 	3) it is independent of the underlying storage; and
> > 	4) reads of unwritten extents operate at memory speed,
> > 	   not disk speed.
> 
> Yeah, I was thinking that using a device-style TRIM might be better
> since future attempts to write to it won't require a separate seek to
> modify the extent tree.  But yeah, there are a bunch of advantages of
> simply mutating the extent tree.
> 
> While we're on the subject of changes to fallocate, what do people
> think of FALLOC_FL_EXPOSE_OLD_DATA, which requires either root
> privileges or (if capabilities are in use) CAP_DAC_OVERRIDE &&
> CAP_MAC_OVERRIDE && CAP_SYS_ADMIN.  This would allow a trusted process
> to fallocate blocks with the extent already marked initialized.  I've
> had two requests for such functionality for ext4 already.  
> 
> (Take for example a trusted cluster filesystem backend that checks the
> object checksum before returning any data to the user; and if the
> check fails the cluster file system will try to use some other replica
> stored on some other server.)
  Hum, could you elaborate a bit? I fail to see how above fallocate() flag
could be used to help solving this problem... Just curious...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-09 21:41       ` Ted Ts'o
  2010-11-09 21:53         ` Jan Kara
@ 2010-11-09 23:40         ` Dave Chinner
  2011-01-11 21:13           ` Lawrence Greenfield
  1 sibling, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2010-11-09 23:40 UTC (permalink / raw)
  To: Ted Ts'o, Josef Bacik, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs

On Tue, Nov 09, 2010 at 04:41:47PM -0500, Ted Ts'o wrote:
> On Tue, Nov 09, 2010 at 03:42:42PM +1100, Dave Chinner wrote:
> > Implementation is up to the filesystem. However, XFS does (b)
> > because:
> > 
> > 	1) it was extremely simple to implement (one of the
> > 	   advantages of having an exceedingly complex allocation
> > 	   interface to begin with :P)
> > 	2) conversion is atomic, fast and reliable
> > 	3) it is independent of the underlying storage; and
> > 	4) reads of unwritten extents operate at memory speed,
> > 	   not disk speed.
> 
> Yeah, I was thinking that using a device-style TRIM might be better
> since future attempts to write to it won't require a separate seek to
> modify the extent tree.  But yeah, there are a bunch of advantages of
> simply mutating the extent tree.
> 
> While we're on the subject of changes to fallocate, what do people
> think of FALLOC_FL_EXPOSE_OLD_DATA, which requires either root
> privileges or (if capabilities are in use) CAP_DAC_OVERRIDE &&
> CAP_MAC_OVERRIDE && CAP_SYS_ADMIN.  This would allow a trusted process
> to fallocate blocks with the extent already marked initialized.  I've
> had two requests for such functionality for ext4 already.  

We removed that ability from XFS about three years ago because it's
a massive security hole. e.g. what happens if the file is world
readable, even though the process that called
FALLOC_FL_EXPOSE_OLD_DATA was privileged and was allowed to expose
such data? Or the file is chmod 777 after being exposed?

The historical reason for such behaviour existing in XFS was that in
1997 the CPU and IO latency cost of unwritten extent conversion was
significant, so users with real physical security (i.e. marines with
guns) were able to make use of fast preallocation with no conversion
overhead without caring about the security implications. These days,
the performance overhead of unwritten extent conversion is minimal -
I generally can't measure a difference in IO performance as a result
of it - so there is simply no good reaѕon for leaving such a gaping
security hole in the system.

If anyone wants to read the underlying data, then use fiemap to map
the physical blocks and read it directly from the block device. That
requires root privileges but does not open any new stale data
exposure problems....

> (Take for example a trusted cluster filesystem backend that checks the
> object checksum before returning any data to the user; and if the
> check fails the cluster file system will try to use some other replica
> stored on some other server.)

IOWs, all they want to do is avoid the unwritten extent conversion
overhead. Time has shown that a bad security/performance tradeoff
decision was made 13 years ago in XFS, so I see little reason to
repeat it for ext4 today....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-09 23:40         ` Dave Chinner
@ 2011-01-11 21:13           ` Lawrence Greenfield
  2011-01-11 21:30             ` Ted Ts'o
  2011-01-12 12:44             ` Dave Chinner
  0 siblings, 2 replies; 35+ messages in thread
From: Lawrence Greenfield @ 2011-01-11 21:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ted Ts'o, Josef Bacik, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, joel.becker, cmm, cluster-devel

On Tue, Nov 9, 2010 at 6:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Nov 09, 2010 at 04:41:47PM -0500, Ted Ts'o wrote:
>> On Tue, Nov 09, 2010 at 03:42:42PM +1100, Dave Chinner wrote:
>> > Implementation is up to the filesystem. However, XFS does (b)
>> > because:
>> >
>> >     1) it was extremely simple to implement (one of the
>> >        advantages of having an exceedingly complex allocation
>> >        interface to begin with :P)
>> >     2) conversion is atomic, fast and reliable
>> >     3) it is independent of the underlying storage; and
>> >     4) reads of unwritten extents operate at memory speed,
>> >        not disk speed.
>>
>> Yeah, I was thinking that using a device-style TRIM might be better
>> since future attempts to write to it won't require a separate seek to
>> modify the extent tree.  But yeah, there are a bunch of advantages of
>> simply mutating the extent tree.
>>
>> While we're on the subject of changes to fallocate, what do people
>> think of FALLOC_FL_EXPOSE_OLD_DATA, which requires either root
>> privileges or (if capabilities are in use) CAP_DAC_OVERRIDE &&
>> CAP_MAC_OVERRIDE && CAP_SYS_ADMIN.  This would allow a trusted process
>> to fallocate blocks with the extent already marked initialized.  I've
>> had two requests for such functionality for ext4 already.
>
> We removed that ability from XFS about three years ago because it's
> a massive security hole. e.g. what happens if the file is world
> readable, even though the process that called
> FALLOC_FL_EXPOSE_OLD_DATA was privileged and was allowed to expose
> such data? Or the file is chmod 777 after being exposed?
>
> The historical reason for such behaviour existing in XFS was that in
> 1997 the CPU and IO latency cost of unwritten extent conversion was
> significant, so users with real physical security (i.e. marines with
> guns) were able to make use of fast preallocation with no conversion
> overhead without caring about the security implications. These days,
> the performance overhead of unwritten extent conversion is minimal -
> I generally can't measure a difference in IO performance as a result
> of it - so there is simply no good reaѕon for leaving such a gaping
> security hole in the system.
>
> If anyone wants to read the underlying data, then use fiemap to map
> the physical blocks and read it directly from the block device. That
> requires root privileges but does not open any new stale data
> exposure problems....
>
>> (Take for example a trusted cluster filesystem backend that checks the
>> object checksum before returning any data to the user; and if the
>> check fails the cluster file system will try to use some other replica
>> stored on some other server.)
>
> IOWs, all they want to do is avoid the unwritten extent conversion
> overhead. Time has shown that a bad security/performance tradeoff
> decision was made 13 years ago in XFS, so I see little reason to
> repeat it for ext4 today....

I'd make use of FALLOC_FL_EXPOSE_OLD_DATA. It's not the CPU overhead
of extent conversion. It's that extent conversion causes more metadata
operations than what you'd have otherwise, which means systems that
want to use O_DIRECT and make sure the data doesn't go away either
have to write O_DIRECT|O_DSYNC or need to call fdatasync().

cluster file system implementor,
Larry

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2011-01-11 21:13           ` Lawrence Greenfield
@ 2011-01-11 21:30             ` Ted Ts'o
  2011-01-12 11:48               ` Dave Chinner
  2011-01-12 12:44             ` Dave Chinner
  1 sibling, 1 reply; 35+ messages in thread
From: Ted Ts'o @ 2011-01-11 21:30 UTC (permalink / raw)
  To: Lawrence Greenfield
  Cc: Dave Chinner, Josef Bacik, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, joel.becker, cmm, cluster-devel

On Tue, Jan 11, 2011 at 04:13:42PM -0500, Lawrence Greenfield wrote:
> > IOWs, all they want to do is avoid the unwritten extent conversion
> > overhead. Time has shown that a bad security/performance tradeoff
> > decision was made 13 years ago in XFS, so I see little reason to
> > repeat it for ext4 today....

I suspect things may have changed somewhat; both in terms of
requirements and nature of cluter file systems, and the performance of
various storage systems (including PCIe-attached flash devices).  

> I'd make use of FALLOC_FL_EXPOSE_OLD_DATA. It's not the CPU overhead
> of extent conversion. It's that extent conversion causes more metadata
> operations than what you'd have otherwise, which means systems that
> want to use O_DIRECT and make sure the data doesn't go away either
> have to write O_DIRECT|O_DSYNC or need to call fdatasync().
> 
> cluster file system implementor,

One possibility might be to make it an optional feature which is only
enabled via a mount option.  That way someone would have to explicit
ask for this feature two ways (via a new flag to fallocate) and a
mount option.

It might not make sense for XFS, but for people who are using ext4 as
the local storage file system back-end, and are doing all sorts of
things to get the best performance, including disabling the journal, I
suspect it really would make sense.  So it could always be an
optional-to-implement flag, that not all file systems should feel
obliged to implement.

					- Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2011-01-11 21:30             ` Ted Ts'o
@ 2011-01-12 11:48               ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2011-01-12 11:48 UTC (permalink / raw)
  To: Ted Ts'o, Lawrence Greenfield, Josef Bacik, linux-kernel,
	linux-btrfs, linux-ext4

On Tue, Jan 11, 2011 at 04:30:07PM -0500, Ted Ts'o wrote:
> On Tue, Jan 11, 2011 at 04:13:42PM -0500, Lawrence Greenfield wrote:
> > > IOWs, all they want to do is avoid the unwritten extent conversion
> > > overhead. Time has shown that a bad security/performance tradeoff
> > > decision was made 13 years ago in XFS, so I see little reason to
> > > repeat it for ext4 today....
> 
> I suspect things may have changed somewhat; both in terms of
> requirements and nature of cluter file systems, and the performance of
> various storage systems (including PCIe-attached flash devices).  

We can throw 1000x more CPU power and memory at the problem than
we could 13 years ago. IOW the system balance hasn't changed (even
considering pci-e SSDs) compared to 13 years. Hence if it was a bad
tradeoff 13 years ago, it's still a bad tradeoff today.

> > I'd make use of FALLOC_FL_EXPOSE_OLD_DATA. It's not the CPU overhead
> > of extent conversion. It's that extent conversion causes more metadata
> > operations than what you'd have otherwise, which means systems that
> > want to use O_DIRECT and make sure the data doesn't go away either
> > have to write O_DIRECT|O_DSYNC or need to call fdatasync().
> > cluster file system implementor,
> 
> One possibility might be to make it an optional feature which is only
> enabled via a mount option.  That way someone would have to explicit
> ask for this feature two ways (via a new flag to fallocate) and a
> mount option.

Proliferation of mount options just to enable feature X of API Y for
filesystem Z is not a good idea. Either you enable it via the
fallocate API or you don't allow it at all.

> It might not make sense for XFS, but for people who are using ext4
> as the local storage file system back-end,

How does this differ from a local filesystem? Are you talking about
storage nodes for clustered/cloudy storage?

If so, I know of quite a few places that use XFS for this purpose
and they all seem to measure storage in petabytes made up of small
boxes containing anywhere between 30-100TB each. The only request
for additional preallocation functionality I've got from people
running such applications recently is for XFS_IOC_ZERO_RANGE. This
is quite relevant, because that specifically converts allocated
extents to unwritten extents. i.e. they like to be able to
efficiently re-initialise allocated space to zeros rather than
have it contain stale data.

> and are doing all sorts of things to get the best performance,
> including disabling the journal, I suspect it really would make
> sense.

That's not really a convincing argument for a new interface that
needs to be maintained forever.

> So it could always be an
> optional-to-implement flag, that not all file systems should feel
> obliged to implement.

It could, but it still needs better justification.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2011-01-11 21:13           ` Lawrence Greenfield
  2011-01-11 21:30             ` Ted Ts'o
@ 2011-01-12 12:44             ` Dave Chinner
  2011-01-28 18:13               ` Ric Wheeler
  1 sibling, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2011-01-12 12:44 UTC (permalink / raw)
  To: Lawrence Greenfield
  Cc: Ted Ts'o, Josef Bacik, linux-kernel, linux-btrfs, linux-ext4,
	linux-fsdevel, xfs, joel.becker, cmm, cluster-devel

On Tue, Jan 11, 2011 at 04:13:42PM -0500, Lawrence Greenfield wrote:
> On Tue, Nov 9, 2010 at 6:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > The historical reason for such behaviour existing in XFS was that in
> > 1997 the CPU and IO latency cost of unwritten extent conversion was
> > significant,

.....

> >> (Take for example a trusted cluster filesystem backend that checks the
> >> object checksum before returning any data to the user; and if the
> >> check fails the cluster file system will try to use some other replica
> >> stored on some other server.)
> >
> > IOWs, all they want to do is avoid the unwritten extent conversion
> > overhead. Time has shown that a bad security/performance tradeoff
> > decision was made 13 years ago in XFS, so I see little reason to
> > repeat it for ext4 today....
> 
> I'd make use of FALLOC_FL_EXPOSE_OLD_DATA. It's not the CPU overhead
> of extent conversion. It's that extent conversion causes more metadata
> operations than what you'd have otherwise,

Yes, that's the "IO latency" part of the cost I mentioned above.

> which means systems that
> want to use O_DIRECT and make sure the data doesn't go away either
> have to write O_DIRECT|O_DSYNC or need to call fdatasync().

Seriously, we tell application writers _all the time_ that they
*must* use fsync/fdatasync to guarantee their data is on stable
storage and that they cannot rely on side-effects of filesystem or
storage specific behaviours (like ext3 ordered mode) to do that job
for them.

You're suggesting that by introducing FALLOC_FL_EXPOSE_OLD_DATA,
applications can rely on filesystem/storage specific behaviour to
guarantee data is on stable storage without the use of
fdatasync/fsync. Wht you describe is definitely storage specific,
because volatile write caches still needs the fdatasync to issue a
cache flush.

Do you see the same conflict here that I do?

> cluster file system implementor

Which one?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2011-01-12 12:44             ` Dave Chinner
@ 2011-01-28 18:13               ` Ric Wheeler
  0 siblings, 0 replies; 35+ messages in thread
From: Ric Wheeler @ 2011-01-28 18:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Lawrence Greenfield, Ted Ts'o, Josef Bacik, linux-kernel,
	linux-btrfs, linux-ext4, linux-fsdevel, xfs, joel.becker, cmm,
	cluster-devel

On 01/12/2011 07:44 AM, Dave Chinner wrote:
> On Tue, Jan 11, 2011 at 04:13:42PM -0500, Lawrence Greenfield wrote:
>> On Tue, Nov 9, 2010 at 6:40 PM, Dave Chinner<david@fromorbit.com>  wrote:
>>> The historical reason for such behaviour existing in XFS was that in
>>> 1997 the CPU and IO latency cost of unwritten extent conversion was
>>> significant,
> .....
>
>>>> (Take for example a trusted cluster filesystem backend that checks the
>>>> object checksum before returning any data to the user; and if the
>>>> check fails the cluster file system will try to use some other replica
>>>> stored on some other server.)
>>> IOWs, all they want to do is avoid the unwritten extent conversion
>>> overhead. Time has shown that a bad security/performance tradeoff
>>> decision was made 13 years ago in XFS, so I see little reason to
>>> repeat it for ext4 today....
>> I'd make use of FALLOC_FL_EXPOSE_OLD_DATA. It's not the CPU overhead
>> of extent conversion. It's that extent conversion causes more metadata
>> operations than what you'd have otherwise,
> Yes, that's the "IO latency" part of the cost I mentioned above.
>
>> which means systems that
>> want to use O_DIRECT and make sure the data doesn't go away either
>> have to write O_DIRECT|O_DSYNC or need to call fdatasync().
> Seriously, we tell application writers _all the time_ that they
> *must* use fsync/fdatasync to guarantee their data is on stable
> storage and that they cannot rely on side-effects of filesystem or
> storage specific behaviours (like ext3 ordered mode) to do that job
> for them.
>
> You're suggesting that by introducing FALLOC_FL_EXPOSE_OLD_DATA,
> applications can rely on filesystem/storage specific behaviour to
> guarantee data is on stable storage without the use of
> fdatasync/fsync. Wht you describe is definitely storage specific,
> because volatile write caches still needs the fdatasync to issue a
> cache flush.
>
> Do you see the same conflict here that I do?
>

The very concept seems quite "non-enterprise".  I also agree that the cost of 
maintaining extra mount options (and code) for something that no sane end user 
would ever do seems to be a loss.

Why wouldn't you want to convert the punched hole to an unwritten extent?

Thanks!

Ric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] fs: add hole punching to fallocate
  2010-11-09  1:12 ` Dave Chinner
  2010-11-09  2:10   ` Josef Bacik
  2010-11-09  3:30   ` Ted Ts'o
@ 2010-11-09 20:51   ` Josef Bacik
  2 siblings, 0 replies; 35+ messages in thread
From: Josef Bacik @ 2010-11-09 20:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, linux-kernel, linux-btrfs, linux-ext4, linux-fsdevel,
	xfs, joel.becker, cmm, cluster-devel

On Tue, Nov 09, 2010 at 12:12:22PM +1100, Dave Chinner wrote:
> On Mon, Nov 08, 2010 at 03:32:02PM -0500, Josef Bacik wrote:
> > Hole punching has already been implemented by XFS and OCFS2, and has the
> > potential to be implemented on both BTRFS and EXT4 so we need a generic way to
> > get to this feature.  The simplest way in my mind is to add FALLOC_FL_PUNCH_HOLE
> > to fallocate() since it already looks like the normal fallocate() operation.
> > I've tested this patch with XFS and BTRFS to make sure XFS did what it's
> > supposed to do and that BTRFS failed like it was supposed to.  Thank you,
> > 
> > Signed-off-by: Josef Bacik <josef@redhat.com>
> > ---
> >  fs/open.c              |    2 +-
> >  include/linux/falloc.h |    1 +
> >  2 files changed, 2 insertions(+), 1 deletions(-)
> > 
> > diff --git a/fs/open.c b/fs/open.c
> > index 4197b9e..ab8dedf 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >  		return -EINVAL;
> >  
> >  	/* Return error if mode is not supported */
> > -	if (mode && !(mode & FALLOC_FL_KEEP_SIZE))
> > +	if (mode && (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)))
> >  		return -EOPNOTSUPP;
> >  
> >  	if (!(file->f_mode & FMODE_WRITE))
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index 3c15510..851cba2 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -2,6 +2,7 @@
> >  #define _FALLOC_H_
> >  
> >  #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
> > +#define FALLOC_FL_PUNCH_HOLE	0X02 /* de-allocates range */
> 
> Hole punching was not included originally in fallocate() for a
> variety of reasons. IIRC, they were along the lines of:
> 
> 	1 de-allocating of blocks in an allocation syscall is wrong.
> 	  People wanted a new syscall for this functionality.
> 	2 no glibc interface needs it
> 	3 at the time, only XFS supported punching holes, so there
> 	  is not need to support it in a generic interface
> 	4 the use cases presented were not considered compelling
> 	  enough to justify the additional complexity (!)
> 
> In the end, I gave up arguing for it to be included because just
> getting the FALLOC_FL_KEEP_SIZE functionality was a hard enough
> battle.
> 
> Anyway, #3 isn't the case any more, #4 was just an excuse not to
> support anything ext4 couldn't do and lots of apps are calling
> fallocate directly (because glibc can't use FALLOC_FL_KEEP_SIZE) so
> #2 isn't an issue, either. I guess that leaves #1 to be debated;
> I don't think there is any problem with doing what you propose.
> 
> What I will suggest is that this requires a generic xfstest to be
> written and support added to xfs_io to enable that test (and others)
> to issue hole punches. Something along the lines of test 242 which I
> wrote for testing all the edge case of XFS_IOC_ZERO_RANGE (*) would be
> good.
> 

So this was relatively simple, adding a flag to falloc for xfs_io and such.  Got
a test going and it worked great on XFS.  Then I went to make sure it worked on
non-XFS, and thats where I've run into pain.  Turns out xfs_io -c "bmap" only
works on XFS.  So I thought to myself "well how hard could it be to make this
thing use fiemap?", hahaha I'm an idiot.  So I've been adding a xfs_io -c
"fiemap" that spits things out similar to bmap, and it will probably be tomorrow
when I finish it.

So good news is my simple patches seem to work just fine for hole-punch, bad
news is its going to take me another day to have all the infrastructure to test
it on non-XFS filesystems.

Also did you want me to rebase my patches on your fallocate() version of
ZERO_RANGE?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2011-01-28 18:13 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-18  1:46 Hole Punching V3 Josef Bacik
2010-11-18  1:46 ` [PATCH 1/6] fs: add hole punching to fallocate Josef Bacik
2010-11-18 23:43   ` Jan Kara
2010-11-18  1:46 ` [PATCH 2/6] XFS: handle hole punching via fallocate properly Josef Bacik
2010-11-18  1:46 ` [PATCH 3/6] Ocfs2: " Josef Bacik
2010-11-18  1:46 ` [PATCH 4/6] Ext4: fail if we try to use hole punch Josef Bacik
2010-11-18  1:46 ` [PATCH 5/6] Btrfs: " Josef Bacik
2010-11-18  1:46 ` [PATCH 6/6] Gfs2: " Josef Bacik
2011-01-03 21:57 ` Hole Punching V3 Josef Bacik
  -- strict thread matches above, loose matches on Subject: below --
2010-11-15 17:05 Hole Punching V2 Josef Bacik
2010-11-15 17:05 ` [PATCH 1/6] fs: add hole punching to fallocate Josef Bacik
2010-11-16 11:16   ` Jan Kara
2010-11-16 11:43     ` Jan Kara
2010-11-16 12:52       ` Josef Bacik
2010-11-16 13:14         ` Jan Kara
2010-11-17  0:22           ` Andreas Dilger
2010-11-17  2:11             ` Dave Chinner
2010-11-17  2:28               ` Josef Bacik
2010-11-17  2:34                 ` Josef Bacik
2010-11-17  9:30                   ` Andreas Dilger
2010-11-17  9:19               ` Andreas Dilger
2010-11-16 12:53     ` Josef Bacik
2010-11-08 20:32 Josef Bacik
2010-11-09  1:12 ` Dave Chinner
2010-11-09  2:10   ` Josef Bacik
2010-11-09  3:30   ` Ted Ts'o
2010-11-09  4:42     ` Dave Chinner
2010-11-09 21:41       ` Ted Ts'o
2010-11-09 21:53         ` Jan Kara
2010-11-09 23:40         ` Dave Chinner
2011-01-11 21:13           ` Lawrence Greenfield
2011-01-11 21:30             ` Ted Ts'o
2011-01-12 11:48               ` Dave Chinner
2011-01-12 12:44             ` Dave Chinner
2011-01-28 18:13               ` Ric Wheeler
2010-11-09 20:51   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).