linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] xfs: fake fallocate success for always CoW inodes
@ 2025-11-06 13:35 Hans Holmberg
  2025-11-06 13:48 ` Florian Weimer
  0 siblings, 1 reply; 28+ messages in thread
From: Hans Holmberg @ 2025-11-06 13:35 UTC (permalink / raw)
  To: linux-xfs
  Cc: Carlos Maiolino, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, linux-fsdevel, linux-kernel, libc-alpha,
	Hans Holmberg

We don't support preallocations for CoW inodes and we currently fail
with -EOPNOTSUPP, but this causes an issue for users of glibc's
posix_fallocate[1]. If fallocate fails, posix_fallocate falls back on
writing actual data into the range to try to allocate blocks that way.
That does not actually gurantee anything for CoW inodes however as we
write out of place.

So, for this case, users of posix_fallocate will end up writing data
unnecessarily AND be left with a broken promise of being able to
overwrite the range without ending up with -ENOSPC.

So, to avoid the useless data copy that just increases the risk of
-ENOSPC, warn the user and fake that the allocation was successful.

User space using fallocate[2] for preallocation will now be notified of
the missing support for CoW inodes via a logged warning in stead of via
the return value. This is not great, but having posix_fallocate write
useless data and still not guarantee overwrites is arguably worse.

A mount option to choose between these two evils would be good to add,
but we would need to agree on the default value first.

[1] https://man7.org/linux/man-pages/man3/posix_fallocate.3.html
[2] https://man7.org/linux/man-pages/man2/fallocate.2.html

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
---
 fs/xfs/xfs_bmap_util.c | 15 ++++++++++++++-
 fs/xfs/xfs_file.c      |  7 -------
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 06ca11731e43..ff7f6aa41fc8 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -659,8 +659,21 @@ xfs_alloc_file_space(
 	xfs_bmbt_irec_t		imaps[1], *imapp;
 	int			error;
 
-	if (xfs_is_always_cow_inode(ip))
+	/*
+	 * If always_cow mode we can't use preallocations and thus should not
+	 * create them.
+	 */
+	if (xfs_is_always_cow_inode(ip)) {
+		/*
+		 * In stead of failing the fallocate, pretend it was successful
+		 * to avoid glibc posix_fallocate to fall back on writing actual
+		 * data that won't guarantee that the range can be overwritten
+		 * either.
+		 */
+		xfs_warn_once(mp,
+"Always CoW inodes do not support preallocations, faking fallocate success.");
 		return 0;
+	}
 
 	trace_xfs_alloc_file_space(ip);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2702fef2c90c..91e2693873c0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1312,13 +1312,6 @@ xfs_falloc_allocate_range(
 	loff_t			new_size = 0;
 	int			error;
 
-	/*
-	 * If always_cow mode we can't use preallocations and thus should not
-	 * create them.
-	 */
-	if (xfs_is_always_cow_inode(XFS_I(inode)))
-		return -EOPNOTSUPP;
-
 	error = xfs_falloc_newsize(file, mode, offset, len, &new_size);
 	if (error)
 		return error;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-06 13:35 [RFC] xfs: fake fallocate success for always CoW inodes Hans Holmberg
@ 2025-11-06 13:48 ` Florian Weimer
  2025-11-06 13:52   ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Florian Weimer @ 2025-11-06 13:48 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-xfs, Carlos Maiolino, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, linux-fsdevel, linux-kernel, libc-alpha

* Hans Holmberg:

> We don't support preallocations for CoW inodes and we currently fail
> with -EOPNOTSUPP, but this causes an issue for users of glibc's
> posix_fallocate[1]. If fallocate fails, posix_fallocate falls back on
> writing actual data into the range to try to allocate blocks that way.
> That does not actually gurantee anything for CoW inodes however as we
> write out of place.

Why doesn't fallocate trigger the copy instead?  Isn't this what the
user is requesting?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-06 13:48 ` Florian Weimer
@ 2025-11-06 13:52   ` Christoph Hellwig
  2025-11-06 14:42     ` Matthew Wilcox
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-06 13:52 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Hans Holmberg, linux-xfs, Carlos Maiolino, Dave Chinner,
	Darrick J . Wong, Christoph Hellwig, linux-fsdevel, linux-kernel,
	libc-alpha

On Thu, Nov 06, 2025 at 02:48:12PM +0100, Florian Weimer wrote:
> * Hans Holmberg:
> 
> > We don't support preallocations for CoW inodes and we currently fail
> > with -EOPNOTSUPP, but this causes an issue for users of glibc's
> > posix_fallocate[1]. If fallocate fails, posix_fallocate falls back on
> > writing actual data into the range to try to allocate blocks that way.
> > That does not actually gurantee anything for CoW inodes however as we
> > write out of place.
> 
> Why doesn't fallocate trigger the copy instead?  Isn't this what the
> user is requesting?

What copy?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-06 13:52   ` Christoph Hellwig
@ 2025-11-06 14:42     ` Matthew Wilcox
  2025-11-06 14:46       ` Christoph Hellwig
  2025-11-06 16:31       ` Florian Weimer
  0 siblings, 2 replies; 28+ messages in thread
From: Matthew Wilcox @ 2025-11-06 14:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Florian Weimer, Hans Holmberg, linux-xfs, Carlos Maiolino,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-kernel,
	libc-alpha

On Thu, Nov 06, 2025 at 02:52:12PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 06, 2025 at 02:48:12PM +0100, Florian Weimer wrote:
> > * Hans Holmberg:
> > 
> > > We don't support preallocations for CoW inodes and we currently fail
> > > with -EOPNOTSUPP, but this causes an issue for users of glibc's
> > > posix_fallocate[1]. If fallocate fails, posix_fallocate falls back on
> > > writing actual data into the range to try to allocate blocks that way.
> > > That does not actually gurantee anything for CoW inodes however as we
> > > write out of place.
> > 
> > Why doesn't fallocate trigger the copy instead?  Isn't this what the
> > user is requesting?
> 
> What copy?

I believe Florian is thinking of CoW in the sense of "share while read
only, then you have a mutable block allocation", rather than the
WAFL (or SMR) sense of "we always put writes in a new location".

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-06 14:42     ` Matthew Wilcox
@ 2025-11-06 14:46       ` Christoph Hellwig
  2025-11-11  8:31         ` Hans Holmberg
  2025-11-06 16:31       ` Florian Weimer
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-06 14:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Florian Weimer, Hans Holmberg, linux-xfs,
	Carlos Maiolino, Dave Chinner, Darrick J . Wong, linux-fsdevel,
	linux-kernel, libc-alpha

On Thu, Nov 06, 2025 at 02:42:30PM +0000, Matthew Wilcox wrote:
> On Thu, Nov 06, 2025 at 02:52:12PM +0100, Christoph Hellwig wrote:
> > On Thu, Nov 06, 2025 at 02:48:12PM +0100, Florian Weimer wrote:
> > > * Hans Holmberg:
> > > 
> > > > We don't support preallocations for CoW inodes and we currently fail
> > > > with -EOPNOTSUPP, but this causes an issue for users of glibc's
> > > > posix_fallocate[1]. If fallocate fails, posix_fallocate falls back on
> > > > writing actual data into the range to try to allocate blocks that way.
> > > > That does not actually gurantee anything for CoW inodes however as we
> > > > write out of place.
> > > 
> > > Why doesn't fallocate trigger the copy instead?  Isn't this what the
> > > user is requesting?
> > 
> > What copy?
> 
> I believe Florian is thinking of CoW in the sense of "share while read
> only, then you have a mutable block allocation", rather than the
> WAFL (or SMR) sense of "we always put writes in a new location".

Note that the glibc posix_fallocate(3( fallback will never copy anyway.
It does a racy check and somewhat broken check if there is already
data, and if it thinks there isn't it writes zeroes.  Which is the
wrong thing for just about every use case imaginable.  And the only
thing to stop it from doing that is to implement fallocate(2) and
return success.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-06 14:42     ` Matthew Wilcox
  2025-11-06 14:46       ` Christoph Hellwig
@ 2025-11-06 16:31       ` Florian Weimer
  2025-11-06 17:05         ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: Florian Weimer @ 2025-11-06 16:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Hans Holmberg, linux-xfs, Carlos Maiolino,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-kernel,
	libc-alpha

* Matthew Wilcox:

> On Thu, Nov 06, 2025 at 02:52:12PM +0100, Christoph Hellwig wrote:
>> On Thu, Nov 06, 2025 at 02:48:12PM +0100, Florian Weimer wrote:
>> > * Hans Holmberg:
>> > 
>> > > We don't support preallocations for CoW inodes and we currently fail
>> > > with -EOPNOTSUPP, but this causes an issue for users of glibc's
>> > > posix_fallocate[1]. If fallocate fails, posix_fallocate falls back on
>> > > writing actual data into the range to try to allocate blocks that way.
>> > > That does not actually gurantee anything for CoW inodes however as we
>> > > write out of place.
>> > 
>> > Why doesn't fallocate trigger the copy instead?  Isn't this what the
>> > user is requesting?
>> 
>> What copy?
>
> I believe Florian is thinking of CoW in the sense of "share while read
> only, then you have a mutable block allocation", rather than the
> WAFL (or SMR) sense of "we always put writes in a new location".

Ahh.  That's a new aspect to the discussion that was previously lost to
me.  Previous discussions focused on cases where the kernel couldn't do
the pre-population operation safely even though it was beneficial from
an application perspective.  And not cases where the operation was
meaningless because of the way the file system was implemented.

(Pre-allocating CoW space as part of fallocate appears to be difficult
because I don't see how to surface this space usage to applications and
adminstrators.)

It's been a few years, I think, and maybe we should drop the allocation
logic from posix_fallocate in glibc?  Assuming that it's implemented
everywhere it makes sense?  There are more always-CoW, compressing file
systems these days, so applications just have to come to terms with the
fact that even after posix_fallocate, writes can still fail, and not
just because of media errors.  So maybe posix_fallocate isn't that
meaningful anymore.

Thanks,
Floriana


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-06 16:31       ` Florian Weimer
@ 2025-11-06 17:05         ` Christoph Hellwig
  2025-11-08 12:30           ` Florian Weimer
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-06 17:05 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Matthew Wilcox, Christoph Hellwig, Hans Holmberg, linux-xfs,
	Carlos Maiolino, Dave Chinner, Darrick J . Wong, linux-fsdevel,
	linux-kernel, libc-alpha

On Thu, Nov 06, 2025 at 05:31:28PM +0100, Florian Weimer wrote:
> It's been a few years, I think, and maybe we should drop the allocation
> logic from posix_fallocate in glibc?  Assuming that it's implemented
> everywhere it makes sense?

I really think it should go away.  If it turns out we find cases where
it was useful we can try to implement a zeroing fallocate in the kernel
for the file system where people want it.  gfs2 for example currently
has such an implementation, and we could have somewhat generic library
version of it.

> There are more always-CoW, compressing file
> systems these days, so applications just have to come to terms with the
> fact that even after posix_fallocate, writes can still fail, and not
> just because of media errors.

Yes.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-06 17:05         ` Christoph Hellwig
@ 2025-11-08 12:30           ` Florian Weimer
  2025-11-09 22:15             ` Dave Chinner
  2025-11-10  9:31             ` Christoph Hellwig
  0 siblings, 2 replies; 28+ messages in thread
From: Florian Weimer @ 2025-11-08 12:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Florian Weimer, Matthew Wilcox, Hans Holmberg, linux-xfs,
	Carlos Maiolino, Dave Chinner, Darrick J . Wong, linux-fsdevel,
	linux-kernel, libc-alpha

* Christoph Hellwig:

> On Thu, Nov 06, 2025 at 05:31:28PM +0100, Florian Weimer wrote:
>> It's been a few years, I think, and maybe we should drop the allocation
>> logic from posix_fallocate in glibc?  Assuming that it's implemented
>> everywhere it makes sense?
>
> I really think it should go away.  If it turns out we find cases where
> it was useful we can try to implement a zeroing fallocate in the kernel
> for the file system where people want it.  gfs2 for example currently
> has such an implementation, and we could have somewhat generic library
> version of it.

Sorry, I remember now where this got stuck the last time.

This program:

#include <fcntl.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

int
main(void)
{
  FILE *fp = tmpfile();
  if (fp == NULL)
    abort();
  int fd = fileno(fp);
  posix_fallocate(fd, 0, 1);
  char *p = mmap(NULL, 1, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
  *p = 1;
}

should not crash even if the file system does not support fallocate.
I hope we can agree on that.  I expect avoiding SIGBUS errors because
of insufficient file size is a common use case for posix_fallocate.
This use is not really an optimization, it's required to get mmap
working properly.

If we can get an fallocate mode that we can use as a fallback to
increase the file size with a zero flag argument, we can definitely
use that in posix_fallocate (replacing the fallback path on kernels
that support it).  All local file systems should be able to implement
that (but perhaps not efficiently).  Basically, what we need here is a
non-destructive ftruncate.

Maybe add two flags, one for the ftruncate replacement, and one that
instructs the file system that the range will be used with mmap soon?
I expect this could be useful information to the file system.  We
wouldn't use it in posix_fallocate, but applications calling fallocate
directly might.

Christoph, is this something you could help with?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-08 12:30           ` Florian Weimer
@ 2025-11-09 22:15             ` Dave Chinner
  2025-11-10  5:27               ` Florian Weimer
  2025-11-10  9:37               ` Christoph Hellwig
  2025-11-10  9:31             ` Christoph Hellwig
  1 sibling, 2 replies; 28+ messages in thread
From: Dave Chinner @ 2025-11-09 22:15 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, Florian Weimer, Matthew Wilcox, Hans Holmberg,
	linux-xfs, Carlos Maiolino, Darrick J . Wong, linux-fsdevel,
	linux-kernel, libc-alpha

On Sat, Nov 08, 2025 at 01:30:18PM +0100, Florian Weimer wrote:
> * Christoph Hellwig:
> 
> > On Thu, Nov 06, 2025 at 05:31:28PM +0100, Florian Weimer wrote:
> >> It's been a few years, I think, and maybe we should drop the allocation
> >> logic from posix_fallocate in glibc?  Assuming that it's implemented
> >> everywhere it makes sense?
> >
> > I really think it should go away.  If it turns out we find cases where
> > it was useful we can try to implement a zeroing fallocate in the kernel
> > for the file system where people want it.

This is what the shiny new FALLOC_FL_WRITE_ZEROS command is supposed
to provide. We don't have widepsread support in filesystems for it
yet, though.

> > gfs2 for example currently
> > has such an implementation, and we could have somewhat generic library
> > version of it.

Yup, seems like a iomap iter loop would be pretty trivial to
abstract from that...

> Sorry, I remember now where this got stuck the last time.
> 
> This program:
> 
> #include <fcntl.h>
> #include <stddef.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> 
> int
> main(void)
> {
>   FILE *fp = tmpfile();
>   if (fp == NULL)
>     abort();
>   int fd = fileno(fp);
>   posix_fallocate(fd, 0, 1);
>   char *p = mmap(NULL, 1, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>   *p = 1;
> }
> 
> should not crash even if the file system does not support fallocate.

I think that's buggy application code.

Failing to check the return value of a library call that documents
EOPNOTSUPP as a valid error is a bug. IOWs, the above code *should*
SIGBUS on the mmap access, because it failed to verify that the file
extension operation actually worked.

I mean, if this was "ftruncate(1); mmap(); *p =1" and ftruncate()
failed and so SIGBUS was delivered, there would be no doubt that
this is an application bug. Why is should we treat errors returned
by fallocate() and/or posix_fallocate() any different here?

> I hope we can agree on that.  I expect avoiding SIGBUS errors because
> of insufficient file size is a common use case for posix_fallocate.
> This use is not really an optimization, it's required to get mmap
> working properly.
> 
> If we can get an fallocate mode that we can use as a fallback to
> increase the file size with a zero flag argument, we can definitely

The fallocate() API already support that, in two different ways:
FALLOC_FL_ZERO_RANGE and FALLOC_FL_WRITE_ZEROS. 

But, again, not all filesystems support these, so userspace has to
be prepared to receive -EOPNOTSUPP from these calls. Hence userspace
has to do the right thing for posix_fallocate() if you want to
ensure that it always extend the file size even when fallocate()
calls fail...

> use that in posix_fallocate (replacing the fallback path on kernels
> that support it).  All local file systems should be able to implement
> that (but perhaps not efficiently).  Basically, what we need here is a
> non-destructive ftruncate.

You aren't going to get support for such new commands on existing
kernels, so userspace is still going to have to code the ftruncate()
fallback itself for the desired behaviour to be provided
consistently to applications.

As such, I don't see any reason for the fallocate() syscall
providing some whacky "ftruncate() in all but name" mode.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-09 22:15             ` Dave Chinner
@ 2025-11-10  5:27               ` Florian Weimer
  2025-11-10  9:38                 ` Christoph Hellwig
  2025-11-10 20:28                 ` Dave Chinner
  2025-11-10  9:37               ` Christoph Hellwig
  1 sibling, 2 replies; 28+ messages in thread
From: Florian Weimer @ 2025-11-10  5:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Matthew Wilcox, Hans Holmberg, linux-xfs,
	Carlos Maiolino, Darrick J . Wong, linux-fsdevel, linux-kernel,
	libc-alpha

* Dave Chinner:

> On Sat, Nov 08, 2025 at 01:30:18PM +0100, Florian Weimer wrote:
>> * Christoph Hellwig:
>> 
>> > On Thu, Nov 06, 2025 at 05:31:28PM +0100, Florian Weimer wrote:
>> >> It's been a few years, I think, and maybe we should drop the allocation
>> >> logic from posix_fallocate in glibc?  Assuming that it's implemented
>> >> everywhere it makes sense?
>> >
>> > I really think it should go away.  If it turns out we find cases where
>> > it was useful we can try to implement a zeroing fallocate in the kernel
>> > for the file system where people want it.
>
> This is what the shiny new FALLOC_FL_WRITE_ZEROS command is supposed
> to provide. We don't have widepsread support in filesystems for it
> yet, though.
>
>> > gfs2 for example currently
>> > has such an implementation, and we could have somewhat generic library
>> > version of it.
>
> Yup, seems like a iomap iter loop would be pretty trivial to
> abstract from that...
>
>> Sorry, I remember now where this got stuck the last time.
>> 
>> This program:
>> 
>> #include <fcntl.h>
>> #include <stddef.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> 
>> int
>> main(void)
>> {
>>   FILE *fp = tmpfile();
>>   if (fp == NULL)
>>     abort();
>>   int fd = fileno(fp);
>>   posix_fallocate(fd, 0, 1);
>>   char *p = mmap(NULL, 1, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>   *p = 1;
>> }
>> 
>> should not crash even if the file system does not support fallocate.
>
> I think that's buggy application code.
>
> Failing to check the return value of a library call that documents
> EOPNOTSUPP as a valid error is a bug. IOWs, the above code *should*
> SIGBUS on the mmap access, because it failed to verify that the file
> extension operation actually worked.

Sorry, I made the example confusing.

How would the application deal with failure due to lack of fallocate
support?  It would have to do a pwrite, like posix_fallocate does to
today, or maybe ftruncate.  This is way I think removing the fallback
from posix_fallocate completely is mostly pointless.

>> I hope we can agree on that.  I expect avoiding SIGBUS errors because
>> of insufficient file size is a common use case for posix_fallocate.
>> This use is not really an optimization, it's required to get mmap
>> working properly.
>> 
>> If we can get an fallocate mode that we can use as a fallback to
>> increase the file size with a zero flag argument, we can definitely
>
> The fallocate() API already support that, in two different ways:
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_WRITE_ZEROS.

Neither is appropriate for posix_fallocate because they are as
destructive as the existing fallback.

> But, again, not all filesystems support these, so userspace has to
> be prepared to receive -EOPNOTSUPP from these calls. Hence userspace
> has to do the right thing for posix_fallocate() if you want to
> ensure that it always extend the file size even when fallocate()
> calls fail...

Sure, but eventually, we may get into a better situation.

>> use that in posix_fallocate (replacing the fallback path on kernels
>> that support it).  All local file systems should be able to implement
>> that (but perhaps not efficiently).  Basically, what we need here is a
>> non-destructive ftruncate.
>
> You aren't going to get support for such new commands on existing
> kernels, so userspace is still going to have to code the ftruncate()
> fallback itself for the desired behaviour to be provided
> consistently to applications.
>
> As such, I don't see any reason for the fallocate() syscall
> providing some whacky "ftruncate() in all but name" mode.

Please reconsider.  If we start fixing this, we'll eventually be in a
position where the glibc fallback code never runs.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-08 12:30           ` Florian Weimer
  2025-11-09 22:15             ` Dave Chinner
@ 2025-11-10  9:31             ` Christoph Hellwig
  2025-11-10  9:48               ` truncatat? was, " Christoph Hellwig
  2025-11-10  9:49               ` Florian Weimer
  1 sibling, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-10  9:31 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, Florian Weimer, Matthew Wilcox, Hans Holmberg,
	linux-xfs, Carlos Maiolino, Dave Chinner, Darrick J . Wong,
	linux-fsdevel, linux-kernel, libc-alpha

On Sat, Nov 08, 2025 at 01:30:18PM +0100, Florian Weimer wrote:
> main(void)
> {
>   FILE *fp = tmpfile();
>   if (fp == NULL)
>     abort();
>   int fd = fileno(fp);
>   posix_fallocate(fd, 0, 1);
>   char *p = mmap(NULL, 1, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>   *p = 1;
> }
> 
> should not crash even if the file system does not support fallocate.
> I hope we can agree on that.  I expect avoiding SIGBUS errors because
> of insufficient file size is a common use case for posix_fallocate.
> This use is not really an optimization, it's required to get mmap
> working properly.

That's a weird use of posix_fallocate, but if an interface to increase
the file without the chance of reducing it is useful that's for
sure something we could add.

> If we can get an fallocate mode that we can use as a fallback to
> increase the file size with a zero flag argument, we can definitely
> use that in posix_fallocate (replacing the fallback path on kernels
> that support it).  All local file systems should be able to implement
> that (but perhaps not efficiently).  Basically, what we need here is a
> non-destructive ftruncate.

fallocate seems like an odd interface choice for that, but given that
(f)truncate doesn't have a flags argument that might still be the
least unexpected version.

> Maybe add two flags, one for the ftruncate replacement, and one that
> instructs the file system that the range will be used with mmap soon?
> I expect this could be useful information to the file system.  We
> wouldn't use it in posix_fallocate, but applications calling fallocate
> directly might.

What do you think "to be used with mmap" flag could be useful for
in the file system?  For file systems mmap I/O isn't very different
from other use cases.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-09 22:15             ` Dave Chinner
  2025-11-10  5:27               ` Florian Weimer
@ 2025-11-10  9:37               ` Christoph Hellwig
  2025-11-10  9:44                 ` Florian Weimer
  2025-11-10 21:33                 ` Dave Chinner
  1 sibling, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-10  9:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Florian Weimer, Christoph Hellwig, Florian Weimer, Matthew Wilcox,
	Hans Holmberg, linux-xfs, Carlos Maiolino, Darrick J . Wong,
	linux-fsdevel, linux-kernel, libc-alpha

On Mon, Nov 10, 2025 at 09:15:50AM +1100, Dave Chinner wrote:
> On Sat, Nov 08, 2025 at 01:30:18PM +0100, Florian Weimer wrote:
> > * Christoph Hellwig:
> > 
> > > On Thu, Nov 06, 2025 at 05:31:28PM +0100, Florian Weimer wrote:
> > >> It's been a few years, I think, and maybe we should drop the allocation
> > >> logic from posix_fallocate in glibc?  Assuming that it's implemented
> > >> everywhere it makes sense?
> > >
> > > I really think it should go away.  If it turns out we find cases where
> > > it was useful we can try to implement a zeroing fallocate in the kernel
> > > for the file system where people want it.
> 
> This is what the shiny new FALLOC_FL_WRITE_ZEROS command is supposed
> to provide. We don't have widepsread support in filesystems for it
> yet, though.

Not really.  FALLOC_FL_WRITE_ZEROS does hardware-offloaded zeroing.
I.e., it does the same think as the just write zeroes thing as the
current glibc fallback and is just as bad for the same reasons.  It
also is something that doesn't make any sense to support in a write
out of place file system.

> Failing to check the return value of a library call that documents
> EOPNOTSUPP as a valid error is a bug. IOWs, the above code *should*
> SIGBUS on the mmap access, because it failed to verify that the file
> extension operation actually worked.
> 
> I mean, if this was "ftruncate(1); mmap(); *p =1" and ftruncate()
> failed and so SIGBUS was delivered, there would be no doubt that
> this is an application bug. Why is should we treat errors returned
> by fallocate() and/or posix_fallocate() any different here?

I think what Florian wants (although I might be misunderstanding him)
is an interface that will increase the file size up to the passed in
size, but never reduce it and lose data.

> > If we can get an fallocate mode that we can use as a fallback to
> > increase the file size with a zero flag argument, we can definitely
> 
> The fallocate() API already support that, in two different ways:
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_WRITE_ZEROS. 

They are both quite different as they both zero the entire passed in
range, even if it already contains data, which is completely different
from the posix_fallocate or fallocate FALLOC_FL_ALLOCATE_RANGE semantics
that leave any existing data intact.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10  5:27               ` Florian Weimer
@ 2025-11-10  9:38                 ` Christoph Hellwig
  2025-11-10 10:03                   ` Florian Weimer
  2025-11-10 20:28                 ` Dave Chinner
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-10  9:38 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Dave Chinner, Christoph Hellwig, Matthew Wilcox, Hans Holmberg,
	linux-xfs, Carlos Maiolino, Darrick J . Wong, linux-fsdevel,
	linux-kernel, libc-alpha

On Mon, Nov 10, 2025 at 06:27:41AM +0100, Florian Weimer wrote:
> Sorry, I made the example confusing.
> 
> How would the application deal with failure due to lack of fallocate
> support?  It would have to do a pwrite, like posix_fallocate does to
> today, or maybe ftruncate.  This is way I think removing the fallback
> from posix_fallocate completely is mostly pointless.

In general it would ftruncate.  If it thinks it can't work without
preallocation at all the application will fail, as again the lack
of posix_fallocate means that space can't be preallocated.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10  9:37               ` Christoph Hellwig
@ 2025-11-10  9:44                 ` Florian Weimer
  2025-11-10 21:33                 ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Florian Weimer @ 2025-11-10  9:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Matthew Wilcox, Hans Holmberg, linux-xfs,
	Carlos Maiolino, Darrick J . Wong, linux-fsdevel, linux-kernel,
	libc-alpha

* Christoph Hellwig:

> I think what Florian wants (although I might be misunderstanding him)
> is an interface that will increase the file size up to the passed in
> size, but never reduce it and lose data.

Exaclty.  Thank you for the succinct summary.

Florian


^ permalink raw reply	[flat|nested] 28+ messages in thread

* truncatat? was, Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10  9:31             ` Christoph Hellwig
@ 2025-11-10  9:48               ` Christoph Hellwig
  2025-11-10 10:00                 ` Florian Weimer
  2025-11-10  9:49               ` Florian Weimer
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-10  9:48 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, Florian Weimer, Matthew Wilcox, Hans Holmberg,
	linux-xfs, Carlos Maiolino, Dave Chinner, Darrick J . Wong,
	linux-fsdevel, linux-kernel, linux-api, libc-alpha

On Mon, Nov 10, 2025 at 10:31:40AM +0100, Christoph Hellwig wrote:
> fallocate seems like an odd interface choice for that, but given that
> (f)truncate doesn't have a flags argument that might still be the
> least unexpected version.
> 
> > Maybe add two flags, one for the ftruncate replacement, and one that
> > instructs the file system that the range will be used with mmap soon?
> > I expect this could be useful information to the file system.  We
> > wouldn't use it in posix_fallocate, but applications calling fallocate
> > directly might.
> 
> What do you think "to be used with mmap" flag could be useful for
> in the file system?  For file systems mmap I/O isn't very different
> from other use cases.

The usual way to pass extra flags was the flats at for the *at syscalls.
truncate doesn't have that, and I wonder if there would be uses for
that?  Because if so that feels like the right way to add that feature.
OTOH a quick internet search only pointed to a single question about it,
which was related to other confusion in the use of (f)truncate.

While adding a new system call can be rather cumbersome, the advantage
would be that we could implement the "only increase file size" flag
in common code and it would work on all file systems for kernels that
support the system call.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10  9:31             ` Christoph Hellwig
  2025-11-10  9:48               ` truncatat? was, " Christoph Hellwig
@ 2025-11-10  9:49               ` Florian Weimer
  2025-11-10  9:52                 ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: Florian Weimer @ 2025-11-10  9:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Hans Holmberg, linux-xfs, Carlos Maiolino,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-kernel,
	libc-alpha

* Christoph Hellwig:

>> Maybe add two flags, one for the ftruncate replacement, and one that
>> instructs the file system that the range will be used with mmap soon?
>> I expect this could be useful information to the file system.  We
>> wouldn't use it in posix_fallocate, but applications calling fallocate
>> directly might.
>
> What do you think "to be used with mmap" flag could be useful for
> in the file system?  For file systems mmap I/O isn't very different
> from other use cases.

I'm not a file system developer. 8-)

The original concern was about a large file download tool that didn't
download in sequence.  It wrote to a memory mapping directly, in
somewhat random order.  And was observed to cause truly bad
fragmentation in practice.  Maybe this something for posix_fadvise.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10  9:49               ` Florian Weimer
@ 2025-11-10  9:52                 ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-10  9:52 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, Matthew Wilcox, Hans Holmberg, linux-xfs,
	Carlos Maiolino, Dave Chinner, Darrick J . Wong, linux-fsdevel,
	linux-kernel, libc-alpha

On Mon, Nov 10, 2025 at 10:49:04AM +0100, Florian Weimer wrote:
> >> Maybe add two flags, one for the ftruncate replacement, and one that
> >> instructs the file system that the range will be used with mmap soon?
> >> I expect this could be useful information to the file system.  We
> >> wouldn't use it in posix_fallocate, but applications calling fallocate
> >> directly might.
> >
> > What do you think "to be used with mmap" flag could be useful for
> > in the file system?  For file systems mmap I/O isn't very different
> > from other use cases.
> 
> I'm not a file system developer. 8-)
> 
> The original concern was about a large file download tool that didn't
> download in sequence.  It wrote to a memory mapping directly, in
> somewhat random order.  And was observed to cause truly bad
> fragmentation in practice.  Maybe this something for posix_fadvise.

In general smart allocators (both the classic XFS allocator, and the
zoned one we're talking about here) take the file offset into account
when allocating blocks.  Additionally the VM writeback code usually
avoids writing back out of order unless writeback is forced by an
f(data)sync or memory pressuere.  So it should not be needed here,
although I won't hold my hand into the fire that fallocate won't help
with simpler allocators or really degenerate I/O patterns, but
there is nothing mmap-specific about that.

> 
> Thanks,
> Florian
---end quoted text---

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: truncatat? was, Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10  9:48               ` truncatat? was, " Christoph Hellwig
@ 2025-11-10 10:00                 ` Florian Weimer
  0 siblings, 0 replies; 28+ messages in thread
From: Florian Weimer @ 2025-11-10 10:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Hans Holmberg, linux-xfs, Carlos Maiolino,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-kernel,
	linux-api, libc-alpha

* Christoph Hellwig:

> On Mon, Nov 10, 2025 at 10:31:40AM +0100, Christoph Hellwig wrote:
>> fallocate seems like an odd interface choice for that, but given that
>> (f)truncate doesn't have a flags argument that might still be the
>> least unexpected version.
>> 
>> > Maybe add two flags, one for the ftruncate replacement, and one that
>> > instructs the file system that the range will be used with mmap soon?
>> > I expect this could be useful information to the file system.  We
>> > wouldn't use it in posix_fallocate, but applications calling fallocate
>> > directly might.
>> 
>> What do you think "to be used with mmap" flag could be useful for
>> in the file system?  For file systems mmap I/O isn't very different
>> from other use cases.
>
> The usual way to pass extra flags was the flats at for the *at syscalls.
> truncate doesn't have that, and I wonder if there would be uses for
> that?  Because if so that feels like the right way to add that feature.
> OTOH a quick internet search only pointed to a single question about it,
> which was related to other confusion in the use of (f)truncate.
>
> While adding a new system call can be rather cumbersome, the advantage
> would be that we could implement the "only increase file size" flag
> in common code and it would work on all file systems for kernels that
> support the system call.

There are some references to ftruncateat:

  <https://codesearch.debian.net/search?q=ftruncateat&literal=1>

I don't have a particularly strong opinion on the choice of interface.
I can't find anything in the Austin Group tracker that suggests that
they are considering standardizing ftruncateat without a flags argument.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10  9:38                 ` Christoph Hellwig
@ 2025-11-10 10:03                   ` Florian Weimer
  0 siblings, 0 replies; 28+ messages in thread
From: Florian Weimer @ 2025-11-10 10:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Matthew Wilcox, Hans Holmberg, linux-xfs,
	Carlos Maiolino, Darrick J . Wong, linux-fsdevel, linux-kernel,
	libc-alpha

* Christoph Hellwig:

> On Mon, Nov 10, 2025 at 06:27:41AM +0100, Florian Weimer wrote:
>> Sorry, I made the example confusing.
>> 
>> How would the application deal with failure due to lack of fallocate
>> support?  It would have to do a pwrite, like posix_fallocate does to
>> today, or maybe ftruncate.  This is way I think removing the fallback
>> from posix_fallocate completely is mostly pointless.
>
> In general it would ftruncate.  If it thinks it can't work without
> preallocation at all the application will fail, as again the lack
> of posix_fallocate means that space can't be preallocated.

Hmm.  It's not a 1:1 replacement: someone really needs to understand the
code and see what the appropriate way to deal with the situation is.  Of
course the posix_fallocate fallback path (or an application-level
equivalent) has the potential for data loss, too.  It's just a different
trade-off.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10  5:27               ` Florian Weimer
  2025-11-10  9:38                 ` Christoph Hellwig
@ 2025-11-10 20:28                 ` Dave Chinner
  2025-11-11  8:56                   ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2025-11-10 20:28 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, Matthew Wilcox, Hans Holmberg, linux-xfs,
	Carlos Maiolino, Darrick J . Wong, linux-fsdevel, linux-kernel,
	libc-alpha

On Mon, Nov 10, 2025 at 06:27:41AM +0100, Florian Weimer wrote:
> * Dave Chinner:
> 
> > On Sat, Nov 08, 2025 at 01:30:18PM +0100, Florian Weimer wrote:
> >> * Christoph Hellwig:
> >> 
> >> > On Thu, Nov 06, 2025 at 05:31:28PM +0100, Florian Weimer wrote:
> >> >> It's been a few years, I think, and maybe we should drop the allocation
> >> >> logic from posix_fallocate in glibc?  Assuming that it's implemented
> >> >> everywhere it makes sense?
> >> >
> >> > I really think it should go away.  If it turns out we find cases where
> >> > it was useful we can try to implement a zeroing fallocate in the kernel
> >> > for the file system where people want it.
> >
> > This is what the shiny new FALLOC_FL_WRITE_ZEROS command is supposed
> > to provide. We don't have widepsread support in filesystems for it
> > yet, though.
> >
> >> > gfs2 for example currently
> >> > has such an implementation, and we could have somewhat generic library
> >> > version of it.
> >
> > Yup, seems like a iomap iter loop would be pretty trivial to
> > abstract from that...
> >
> >> Sorry, I remember now where this got stuck the last time.
> >> 
> >> This program:
> >> 
> >> #include <fcntl.h>
> >> #include <stddef.h>
> >> #include <stdio.h>
> >> #include <stdlib.h>
> >> #include <sys/mman.h>
> >> 
> >> int
> >> main(void)
> >> {
> >>   FILE *fp = tmpfile();
> >>   if (fp == NULL)
> >>     abort();
> >>   int fd = fileno(fp);
> >>   posix_fallocate(fd, 0, 1);
> >>   char *p = mmap(NULL, 1, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> >>   *p = 1;
> >> }
> >> 
> >> should not crash even if the file system does not support fallocate.
> >
> > I think that's buggy application code.
> >
> > Failing to check the return value of a library call that documents
> > EOPNOTSUPP as a valid error is a bug. IOWs, the above code *should*
> > SIGBUS on the mmap access, because it failed to verify that the file
> > extension operation actually worked.
> 
> Sorry, I made the example confusing.
> 
> How would the application deal with failure due to lack of fallocate
> support?  It would have to do a pwrite, like posix_fallocate does to
> today, or maybe ftruncate.  This is way I think removing the fallback
> from posix_fallocate completely is mostly pointless.
> 
> >> I hope we can agree on that.  I expect avoiding SIGBUS errors because
> >> of insufficient file size is a common use case for posix_fallocate.
> >> This use is not really an optimization, it's required to get mmap
> >> working properly.
> >> 
> >> If we can get an fallocate mode that we can use as a fallback to
> >> increase the file size with a zero flag argument, we can definitely
> >
> > The fallocate() API already support that, in two different ways:
> > FALLOC_FL_ZERO_RANGE and FALLOC_FL_WRITE_ZEROS.
> 
> Neither is appropriate for posix_fallocate because they are as
> destructive as the existing fallback.

You suggested we should consider "implement a zeroing fallocate",
and I've simply pointed out that it already exists. That is simply:

	fallocate(WRITE_ZEROES, old_eof, new_eof - old_eof)

You didn't say that you wanted something that isn't potentially
destructive when a buggy allocation allows multiple file extension
operations to be performed concurrently. 

> > You aren't going to get support for such new commands on existing
> > kernels, so userspace is still going to have to code the ftruncate()
> > fallback itself for the desired behaviour to be provided
> > consistently to applications.
> >
> > As such, I don't see any reason for the fallocate() syscall
> > providing some whacky "ftruncate() in all but name" mode.
> 
> Please reconsider.  If we start fixing this, we'll eventually be in a
> position where the glibc fallback code never runs.

Providing non-destructive, "truncate up only" file extension
semantics through fallocate() is exactly what
FALLOC_FL_ALLOCATE_RANGE provides.

Oh, wait, we started down this path because the "fake" success patch
didn't implement the correct ALLOCATE_RANGE semantics. i.e. the
proposed patch is buggy because it doesn't implement the externally
visible file size change semantics of a successful operation.

IOWs, there is no need for a new API here - just for filesystems to
correctly implement the file extension semantics of
FALLOC_FL_ALLOCATE_RANGE if they are going to return success without
having performed physical allocation.

IOWs, I have no problems with COW filesystems not doing
preallocation, but if they are going to return success they still
need to perform all the non-allocation parts of fallocate()
operations correctly.

Again, I don't see a need for a new API here to provide
non-destructive "truncate up only" semantics as we already have
those semantics built into the ALLOCATE_RANGE operation...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10  9:37               ` Christoph Hellwig
  2025-11-10  9:44                 ` Florian Weimer
@ 2025-11-10 21:33                 ` Dave Chinner
  2025-11-11  9:04                   ` Christoph Hellwig
  2025-11-11  9:30                   ` Florian Weimer
  1 sibling, 2 replies; 28+ messages in thread
From: Dave Chinner @ 2025-11-10 21:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Florian Weimer, Florian Weimer, Matthew Wilcox, Hans Holmberg,
	linux-xfs, Carlos Maiolino, Darrick J . Wong, linux-fsdevel,
	linux-kernel, libc-alpha

On Mon, Nov 10, 2025 at 10:37:01AM +0100, Christoph Hellwig wrote:
> On Mon, Nov 10, 2025 at 09:15:50AM +1100, Dave Chinner wrote:
> > On Sat, Nov 08, 2025 at 01:30:18PM +0100, Florian Weimer wrote:
> > > * Christoph Hellwig:
> > > 
> > > > On Thu, Nov 06, 2025 at 05:31:28PM +0100, Florian Weimer wrote:
> > > >> It's been a few years, I think, and maybe we should drop the allocation
> > > >> logic from posix_fallocate in glibc?  Assuming that it's implemented
> > > >> everywhere it makes sense?
> > > >
> > > > I really think it should go away.  If it turns out we find cases where
> > > > it was useful we can try to implement a zeroing fallocate in the kernel
> > > > for the file system where people want it.
> > 
> > This is what the shiny new FALLOC_FL_WRITE_ZEROS command is supposed
> > to provide. We don't have widepsread support in filesystems for it
> > yet, though.
> 
> Not really.  FALLOC_FL_WRITE_ZEROS does hardware-offloaded zeroing.

That is not required functionality - it is an implementation
optimisation.

WRITE_ZEROES requires that the subsequent write must not need to
perform filesystem metadata updates to guarantee data integrity.
How the filesystem implements that is up to the filesystem....

> I.e., it does the same think as the just write zeroes thing as the
> current glibc fallback and is just as bad for the same reasons.

No, it is not like the current glibc posix_fallocate() fallback.
That is a compatibility slow-path, not an IO path performance
optimisation.

i.e. WRITE_ZEROES is for applications that overwrite in place and
are very sensitive to IO latency.  The zeroing is done
in a context that is not performance sensitive, and it results in
much lower long tail latencies in the performance sensitive IO
paths.

WRITE_ZEROES is a more efficient way of running
FALLOC_FL_ALLOC_RANGE and then writing zeroes to convert the range
from unwritten to written extents because it allows ithe kernel to
use hardware offloads if they are available.

Applications that need pure overwrite behaviour are not going to be
using COW files or storage that requires always-COW IO paths in the
filesystems (e.g. on zoned storage hardware).

Hence we just don't care that:

> It
> also is something that doesn't make any sense to support in a write
> out of place file system.

... COW files cannot support WRITE_ZEROES functionality because
optimisations for overwrite-in-place aren't valid for COW-based
IO...

> > Failing to check the return value of a library call that documents
> > EOPNOTSUPP as a valid error is a bug. IOWs, the above code *should*
> > SIGBUS on the mmap access, because it failed to verify that the file
> > extension operation actually worked.
> > 
> > I mean, if this was "ftruncate(1); mmap(); *p =1" and ftruncate()
> > failed and so SIGBUS was delivered, there would be no doubt that
> > this is an application bug. Why is should we treat errors returned
> > by fallocate() and/or posix_fallocate() any different here?
> 
> I think what Florian wants (although I might be misunderstanding him)
> is an interface that will increase the file size up to the passed in
> size, but never reduce it and lose data.

Ah, that's not a "zeroing fallocate()" like was suggested. These are
the existing FALLOC_FL_ALLOCATE_RANGE file extension semantics.

AFAICT, this is exactly what the proposed patch implements - it
short circuits the bit we can't guarantee (ENOSPC prevention via
preallocation) but retains all the other aspects (non-destructive
truncate up) when it returns success.

I don't see how a glibc posix_fallocate() fallback that does a
non-desctructive truncate up though some new interface is any better
than just having the filesystem implement ALLOCATE_RANGE without the
ENOSPC guarantees in the first place?

> > > If we can get an fallocate mode that we can use as a fallback to
> > > increase the file size with a zero flag argument, we can definitely
> > 
> > The fallocate() API already support that, in two different ways:
> > FALLOC_FL_ZERO_RANGE and FALLOC_FL_WRITE_ZEROS. 
> 
> They are both quite different as they both zero the entire passed in
> range, even if it already contains data, which is completely different
> from the posix_fallocate or fallocate FALLOC_FL_ALLOCATE_RANGE semantics
> that leave any existing data intact.

Yes. However:

	fallocate(fd, FALLOC_FL_WRITE_ZEROES, old_eof, new_eof - old_eof);

is exactly the "zeroing truncate up" operation that was being
suggested. It will not overwrite any existing data, except if the
application is racing other file extension operations with this one.
In which case, the application is buggy, not the fallocate() code.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-06 14:46       ` Christoph Hellwig
@ 2025-11-11  8:31         ` Hans Holmberg
  2025-11-11  9:05           ` hch
  0 siblings, 1 reply; 28+ messages in thread
From: Hans Holmberg @ 2025-11-11  8:31 UTC (permalink / raw)
  To: hch, Florian Weimer
  Cc: linux-xfs@vger.kernel.org, Carlos Maiolino, Dave Chinner,
	Darrick J . Wong, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, libc-alpha@sourceware.org,
	Matthew Wilcox

On 06/11/2025 15:46, Christoph Hellwig wrote:
> On Thu, Nov 06, 2025 at 02:42:30PM +0000, Matthew Wilcox wrote:
>> On Thu, Nov 06, 2025 at 02:52:12PM +0100, Christoph Hellwig wrote:
>>> On Thu, Nov 06, 2025 at 02:48:12PM +0100, Florian Weimer wrote:
>>>> * Hans Holmberg:
>>>>
>>>>> We don't support preallocations for CoW inodes and we currently fail
>>>>> with -EOPNOTSUPP, but this causes an issue for users of glibc's
>>>>> posix_fallocate[1]. If fallocate fails, posix_fallocate falls back on
>>>>> writing actual data into the range to try to allocate blocks that way.
>>>>> That does not actually gurantee anything for CoW inodes however as we
>>>>> write out of place.
>>>> Why doesn't fallocate trigger the copy instead?  Isn't this what the
>>>> user is requesting?
>>> What copy?
>> I believe Florian is thinking of CoW in the sense of "share while read
>> only, then you have a mutable block allocation", rather than the
>> WAFL (or SMR) sense of "we always put writes in a new location".
> Note that the glibc posix_fallocate(3( fallback will never copy anyway.
> It does a racy check and somewhat broken check if there is already
> data, and if it thinks there isn't it writes zeroes.  Which is the
> wrong thing for just about every use case imaginable.  And the only
> thing to stop it from doing that is to implement fallocate(2) and
> return success.

In stead of returning success in fallocate(2), could we in stead return
an distinct error code that would tell the caller that:

The optimized allocation not supported, AND there is no use trying to
preallocate data using writes?

EUSELESS would be nice to have, but that is not available.

Then posix_fallocate could fail with -EINVAL (which looks legit according
to the man page "the underlying filesystem does not support the operation")
or skip the writes and return success (whatever is preferable)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10 20:28                 ` Dave Chinner
@ 2025-11-11  8:56                   ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-11  8:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Florian Weimer, Christoph Hellwig, Matthew Wilcox, Hans Holmberg,
	linux-xfs, Carlos Maiolino, Darrick J . Wong, linux-fsdevel,
	linux-kernel, libc-alpha

On Tue, Nov 11, 2025 at 07:28:20AM +1100, Dave Chinner wrote:
> IOWs, I have no problems with COW filesystems not doing
> preallocation, but if they are going to return success they still
> need to perform all the non-allocation parts of fallocate()
> operations correctly.
> 
> Again, I don't see a need for a new API here to provide
> non-destructive "truncate up only" semantics as we already have
> those semantics built into the ALLOCATE_RANGE operation...

The problem it loses the ability of an intelligent application using
the low-level Linux API to probe what is there.  That might not be a
major issue, but it is an issue at least.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10 21:33                 ` Dave Chinner
@ 2025-11-11  9:04                   ` Christoph Hellwig
  2025-11-11  9:30                   ` Florian Weimer
  1 sibling, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2025-11-11  9:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Florian Weimer, Florian Weimer, Matthew Wilcox,
	Hans Holmberg, linux-xfs, Carlos Maiolino, Darrick J . Wong,
	linux-fsdevel, linux-kernel, libc-alpha

On Tue, Nov 11, 2025 at 08:33:34AM +1100, Dave Chinner wrote:
> > Not really.  FALLOC_FL_WRITE_ZEROS does hardware-offloaded zeroing.
> 
> That is not required functionality - it is an implementation
> optimisation.

It's also the reason why it exists.

> WRITE_ZEROES requires that the subsequent write must not need to
> perform filesystem metadata updates to guarantee data integrity.
> How the filesystem implements that is up to the filesystem....

No, it can;t require that.  But it is optimizing for that.

> > I think what Florian wants (although I might be misunderstanding him)
> > is an interface that will increase the file size up to the passed in
> > size, but never reduce it and lose data.
> 
> Ah, that's not a "zeroing fallocate()" like was suggested. These are
> the existing FALLOC_FL_ALLOCATE_RANGE file extension semantics.

Yes, just without allocating.

> AFAICT, this is exactly what the proposed patch implements - it
> short circuits the bit we can't guarantee (ENOSPC prevention via
> preallocation) but retains all the other aspects (non-destructive
> truncate up) when it returns success.

Yes.

> I don't see how a glibc posix_fallocate() fallback that does a
> non-desctructive truncate up though some new interface is any better
> than just having the filesystem implement ALLOCATE_RANGE without the
> ENOSPC guarantees in the first place?

For one because applications specifically probing the low-level Linux
system call will find out what is supported or not.  And Linux fallocate
has always failed when not supporting the exact semantics, while
posix_fallocate in glibc always had a (fairly broken) fallback and thus
applications can somewhat reasonable expect it to not fail.

> > They are both quite different as they both zero the entire passed in
> > range, even if it already contains data, which is completely different
> > from the posix_fallocate or fallocate FALLOC_FL_ALLOCATE_RANGE semantics
> > that leave any existing data intact.
> 
> Yes. However:
> 
> 	fallocate(fd, FALLOC_FL_WRITE_ZEROES, old_eof, new_eof - old_eof);
> 
> is exactly the "zeroing truncate up" operation that was being
> suggested. It will not overwrite any existing data, except if the
> application is racing other file extension operations with this one.

FALLOC_FL_WRITE_ZEROES is defined to zero the entire range.
FALLOC_FL_ALLOCATE_RANGE or a truncate up do not zero existing data.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-11  8:31         ` Hans Holmberg
@ 2025-11-11  9:05           ` hch
  2025-11-11  9:50             ` Florian Weimer
  0 siblings, 1 reply; 28+ messages in thread
From: hch @ 2025-11-11  9:05 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: hch, Florian Weimer, linux-xfs@vger.kernel.org, Carlos Maiolino,
	Dave Chinner, Darrick J . Wong, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, libc-alpha@sourceware.org,
	Matthew Wilcox

On Tue, Nov 11, 2025 at 08:31:30AM +0000, Hans Holmberg wrote:
> In stead of returning success in fallocate(2), could we in stead return
> an distinct error code that would tell the caller that:
> 
> The optimized allocation not supported, AND there is no use trying to
> preallocate data using writes?
> 
> EUSELESS would be nice to have, but that is not available.
> 
> Then posix_fallocate could fail with -EINVAL (which looks legit according
> to the man page "the underlying filesystem does not support the operation")
> or skip the writes and return success (whatever is preferable)

The problem is that both the existing direct callers of fallocate(2)
including all currently released glibc versions do not expect that
return value.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-10 21:33                 ` Dave Chinner
  2025-11-11  9:04                   ` Christoph Hellwig
@ 2025-11-11  9:30                   ` Florian Weimer
  1 sibling, 0 replies; 28+ messages in thread
From: Florian Weimer @ 2025-11-11  9:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Matthew Wilcox, Hans Holmberg, linux-xfs,
	Carlos Maiolino, Darrick J . Wong, linux-fsdevel, linux-kernel,
	libc-alpha

* Dave Chinner:

> I don't see how a glibc posix_fallocate() fallback that does a
> non-desctructive truncate up though some new interface is any better
> than just having the filesystem implement ALLOCATE_RANGE without the
> ENOSPC guarantees in the first place?

It's better because you don't have to get consensus among all file
system developers that implementing ALLOCATE_RANGE as a non-destructive
truncate is acceptable.  Even it means that future writes to the range
can fail with ENOSPC, contrary to what POSIX requires for
posix_fallocate.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-11  9:05           ` hch
@ 2025-11-11  9:50             ` Florian Weimer
  2025-11-11 13:40               ` hch
  0 siblings, 1 reply; 28+ messages in thread
From: Florian Weimer @ 2025-11-11  9:50 UTC (permalink / raw)
  To: hch
  Cc: Hans Holmberg, linux-xfs@vger.kernel.org, Carlos Maiolino,
	Dave Chinner, Darrick J . Wong, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, libc-alpha@sourceware.org,
	Matthew Wilcox

> On Tue, Nov 11, 2025 at 08:31:30AM +0000, Hans Holmberg wrote:
>> In stead of returning success in fallocate(2), could we in stead return
>> an distinct error code that would tell the caller that:
>> 
>> The optimized allocation not supported, AND there is no use trying to
>> preallocate data using writes?
>> 
>> EUSELESS would be nice to have, but that is not available.
>> 
>> Then posix_fallocate could fail with -EINVAL (which looks legit according
>> to the man page "the underlying filesystem does not support the operation")
>> or skip the writes and return success (whatever is preferable)
>
> The problem is that both the existing direct callers of fallocate(2)
> including all currently released glibc versions do not expect that
> return value.

That could be covered by putting a flag into the mode argument of
allocate that triggers the new behavior.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] xfs: fake fallocate success for always CoW inodes
  2025-11-11  9:50             ` Florian Weimer
@ 2025-11-11 13:40               ` hch
  0 siblings, 0 replies; 28+ messages in thread
From: hch @ 2025-11-11 13:40 UTC (permalink / raw)
  To: Florian Weimer
  Cc: hch, Hans Holmberg, linux-xfs@vger.kernel.org, Carlos Maiolino,
	Dave Chinner, Darrick J . Wong, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, libc-alpha@sourceware.org,
	Matthew Wilcox

On Tue, Nov 11, 2025 at 10:50:13AM +0100, Florian Weimer wrote:
> > On Tue, Nov 11, 2025 at 08:31:30AM +0000, Hans Holmberg wrote:
> >> In stead of returning success in fallocate(2), could we in stead return
> >> an distinct error code that would tell the caller that:
> >> 
> >> The optimized allocation not supported, AND there is no use trying to
> >> preallocate data using writes?
> >> 
> >> EUSELESS would be nice to have, but that is not available.
> >> 
> >> Then posix_fallocate could fail with -EINVAL (which looks legit according
> >> to the man page "the underlying filesystem does not support the operation")
> >> or skip the writes and return success (whatever is preferable)
> >
> > The problem is that both the existing direct callers of fallocate(2)
> > including all currently released glibc versions do not expect that
> > return value.
> 
> That could be covered by putting a flag into the mode argument of
> allocate that triggers the new behavior.

Which basically makes it a new mode, just encoded as a flag for all
purposes ;-)


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-11-11 13:40 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-06 13:35 [RFC] xfs: fake fallocate success for always CoW inodes Hans Holmberg
2025-11-06 13:48 ` Florian Weimer
2025-11-06 13:52   ` Christoph Hellwig
2025-11-06 14:42     ` Matthew Wilcox
2025-11-06 14:46       ` Christoph Hellwig
2025-11-11  8:31         ` Hans Holmberg
2025-11-11  9:05           ` hch
2025-11-11  9:50             ` Florian Weimer
2025-11-11 13:40               ` hch
2025-11-06 16:31       ` Florian Weimer
2025-11-06 17:05         ` Christoph Hellwig
2025-11-08 12:30           ` Florian Weimer
2025-11-09 22:15             ` Dave Chinner
2025-11-10  5:27               ` Florian Weimer
2025-11-10  9:38                 ` Christoph Hellwig
2025-11-10 10:03                   ` Florian Weimer
2025-11-10 20:28                 ` Dave Chinner
2025-11-11  8:56                   ` Christoph Hellwig
2025-11-10  9:37               ` Christoph Hellwig
2025-11-10  9:44                 ` Florian Weimer
2025-11-10 21:33                 ` Dave Chinner
2025-11-11  9:04                   ` Christoph Hellwig
2025-11-11  9:30                   ` Florian Weimer
2025-11-10  9:31             ` Christoph Hellwig
2025-11-10  9:48               ` truncatat? was, " Christoph Hellwig
2025-11-10 10:00                 ` Florian Weimer
2025-11-10  9:49               ` Florian Weimer
2025-11-10  9:52                 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).