posix_fallocate behavior in glibc

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* posix_fallocate behavior in glibc
@ 2024-06-26  6:01 Christoph Hellwig
  2024-07-29 15:09 ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2024-06-26  6:01 UTC (permalink / raw)
  To: libc-hacker, linux-fsdevel; +Cc: Trond Myklebust

Hi all,

Trond brought the glibc posix_fallocate behavior to my attention.

As a refresher, this is how Open Group defines posix_fallocate:

   The posix_fallocate() function shall ensure that any required storage
   for regular file data starting at offset and continuing for len bytes
   is allocated on the file system storage media. If posix_fallocate()
   returns successfully, subsequent writes to the specified file data
   shall not fail due to the lack of free space on the file system
   storage media.

The glibc implementation in sysdeps/posix/posix_fallocate.c, which is
also by sysdeps/unix/sysv/linux/posix_fallocate.c as a fallback if the
fallocate syscall returns EOPNOTSUPP is implemented by doing single
byte writes at intervals of min(f.f_bsize, 4096).

This assumes the writes to a file guarantee allocating space for future
writes.  Such an assumption is false for write out place file systems
which have been around since at least they early 1990s, but are becoming
at lot more common in the last decode.  Native Linux examples are
all file systems sitting on zoned devices where this is required
behavior, but also the nilfs2 file system or the LFS mode in f2fs.
On top of that it is fairly common for storage systems exposing
network file system access.

How can we get rid of this glibc fallback that turns the implementations
non-conformant and increases write amplication for no good reason?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-06-26  6:01 posix_fallocate behavior in glibc Christoph Hellwig
@ 2024-07-29 15:09 ` Christoph Hellwig
  2024-07-29 15:11   ` Sam James
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2024-07-29 15:09 UTC (permalink / raw)
  To: libc-hacker, linux-fsdevel; +Cc: Trond Myklebust

Hi dear glibc maintainer,

any comments and ideas how to get glibc out of the behavior of
making file systems non-conformant by adding a broken wrapper?

On Wed, Jun 26, 2024 at 08:01:34AM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> Trond brought the glibc posix_fallocate behavior to my attention.
> 
> As a refresher, this is how Open Group defines posix_fallocate:
> 
>    The posix_fallocate() function shall ensure that any required storage
>    for regular file data starting at offset and continuing for len bytes
>    is allocated on the file system storage media. If posix_fallocate()
>    returns successfully, subsequent writes to the specified file data
>    shall not fail due to the lack of free space on the file system
>    storage media.
> 
> The glibc implementation in sysdeps/posix/posix_fallocate.c, which is
> also by sysdeps/unix/sysv/linux/posix_fallocate.c as a fallback if the
> fallocate syscall returns EOPNOTSUPP is implemented by doing single
> byte writes at intervals of min(f.f_bsize, 4096).
> 
> This assumes the writes to a file guarantee allocating space for future
> writes.  Such an assumption is false for write out place file systems
> which have been around since at least they early 1990s, but are becoming
> at lot more common in the last decode.  Native Linux examples are
> all file systems sitting on zoned devices where this is required
> behavior, but also the nilfs2 file system or the LFS mode in f2fs.
> On top of that it is fairly common for storage systems exposing
> network file system access.
> 
> How can we get rid of this glibc fallback that turns the implementations
> non-conformant and increases write amplication for no good reason?
---end quoted text---

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 15:09 ` Christoph Hellwig
@ 2024-07-29 15:11   ` Sam James
  0 siblings, 0 replies; 22+ messages in thread
From: Sam James @ 2024-07-29 15:11 UTC (permalink / raw)
  To: hch; +Cc: libc-hacker, linux-fsdevel, trondmy

Hi,

Please write to libc-alpha@.

thanks,
sam

^ permalink raw reply	[flat|nested] 22+ messages in thread

* posix_fallocate behavior in glibc
@ 2024-07-29 16:09 Christoph Hellwig
  2024-07-29 17:23 ` Paul Eggert
  2024-07-29 17:57 ` Florian Weimer
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2024-07-29 16:09 UTC (permalink / raw)
  To: libc-alpha, linux-fsdevel; +Cc: Trond Myklebust

Hi glibc hackers,

Trond brought the glibc posix_fallocate behavior to my attention.

As a refresher, this is how Open Group defines posix_fallocate:

   The posix_fallocate() function shall ensure that any required storage
   for regular file data starting at offset and continuing for len bytes
   is allocated on the file system storage media. If posix_fallocate()
   returns successfully, subsequent writes to the specified file data
   shall not fail due to the lack of free space on the file system
   storage media.

The glibc implementation in sysdeps/posix/posix_fallocate.c, which is
also by sysdeps/unix/sysv/linux/posix_fallocate.c as a fallback if the
fallocate syscall returns EOPNOTSUPP is implemented by doing single
byte writes at intervals of min(f.f_bsize, 4096).

This assumes the writes to a file guarantee allocating space for future
writes.  Such an assumption is false for write out place file systems
which have been around since at least they early 1990s, but are becoming
at lot more common in the last decode.  Native Linux examples are
all file systems sitting on zoned devices where this is required
behavior, but also the nilfs2 file system or the LFS mode in f2fs.
On top of that it is fairly common for storage systems exposing
network file system access.

How can we get rid of this glibc fallback that turns the implementations
non-conformant and increases write amplication for no good reason?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 16:09 Christoph Hellwig
@ 2024-07-29 17:23 ` Paul Eggert
  2024-07-29 17:43   ` Christoph Hellwig
  2024-07-29 17:57 ` Florian Weimer
  1 sibling, 1 reply; 22+ messages in thread
From: Paul Eggert @ 2024-07-29 17:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Trond Myklebust, libc-alpha, linux-fsdevel

On 2024-07-29 09:09, Christoph Hellwig wrote:

> How can we get rid of this glibc fallback that turns the implementations
> non-conformant and increases write amplication for no good reason?

The simplest solution would be to remove the broken fallback in glibc. A 
more conservative possibility would be to use the fallback only on file 
system types that lack native fallocate and where the fallback is known 
to work (which are they?). Perhaps you could propose a patch along 
either line.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 17:23 ` Paul Eggert
@ 2024-07-29 17:43   ` Christoph Hellwig
  2024-07-29 17:54     ` Adhemerval Zanella Netto
       [not found]     ` <CAPBLoAf11hM0PLhqPG5gUyivU9U1manpOOhDWCPugUmWc1VVUw@mail.gmail.com>
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2024-07-29 17:43 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Christoph Hellwig, Trond Myklebust, libc-alpha, linux-fsdevel

Hi Paul,

thanks for the answer.  I don't have a current glibc assignment, so me
directly sending a patch is probably not productive.

I don't really know which file systems benefit from doing a zeroing
operations - after all this requires writing the data twice which usually
actually is a bad idea unless offset by extremely suboptimal allocation
behavior for small allocations, which got fixed in most file systems
people actually use.  So candidates where it actually would be useful
might be things like hfsplus.  But these are often used on cheap
consumer media, where the double write will actually meaningfully cause
additional write and erase cycle harming the device lifetime and long
term performance.

Note that the kernel has a few implementations of fallocate that are
basically a slightly more optimized implementation of this pattern
(fat, gfs2) so some maintainers through it useful at least for
some workloads and use cases.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 17:43   ` Christoph Hellwig
@ 2024-07-29 17:54     ` Adhemerval Zanella Netto
       [not found]     ` <CAPBLoAf11hM0PLhqPG5gUyivU9U1manpOOhDWCPugUmWc1VVUw@mail.gmail.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Adhemerval Zanella Netto @ 2024-07-29 17:54 UTC (permalink / raw)
  To: Christoph Hellwig, Paul Eggert, Florian Weimer
  Cc: Trond Myklebust, libc-alpha, linux-fsdevel



On 29/07/24 14:43, Christoph Hellwig wrote:
> Hi Paul,
> 
> thanks for the answer.  I don't have a current glibc assignment, so me
> directly sending a patch is probably not productive.
> 
> I don't really know which file systems benefit from doing a zeroing
> operations - after all this requires writing the data twice which usually
> actually is a bad idea unless offset by extremely suboptimal allocation
> behavior for small allocations, which got fixed in most file systems
> people actually use.  So candidates where it actually would be useful
> might be things like hfsplus.  But these are often used on cheap
> consumer media, where the double write will actually meaningfully cause
> additional write and erase cycle harming the device lifetime and long
> term performance.
> 
> Note that the kernel has a few implementations of fallocate that are
> basically a slightly more optimized implementation of this pattern
> (fat, gfs2) so some maintainers through it useful at least for
> some workloads and use cases.

We already have discussed this some years ago, where some bug were marked
as WONTFIx:

* Bug 6865 - fallback posix_fallocate() implementation is racy
* Bug 18515 - posix_fallocate disastrous fallback behavior is no longer mandated by POSIX and should be fixed
* Bug 15661 - posix_fallocate fallback code buggy and dangerous

Florian even sent a patch to remove the posix_fallocate implementations [1],
which generated a long thread of potential pitfalls of the fallback
removal [2]. 

Florian and Carlos, has anything changes in this behavior? 


[1] https://sourceware.org/legacy-ml/libc-alpha/2015-04/msg00309.html
[2] https://sourceware.org/legacy-ml/libc-alpha/2015-05/msg00058.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 16:09 Christoph Hellwig
  2024-07-29 17:23 ` Paul Eggert
@ 2024-07-29 17:57 ` Florian Weimer
  2024-07-29 18:44   ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2024-07-29 17:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: libc-alpha, linux-fsdevel, Trond Myklebust

* Christoph Hellwig:

> The glibc implementation in sysdeps/posix/posix_fallocate.c, which is
> also by sysdeps/unix/sysv/linux/posix_fallocate.c as a fallback if the
> fallocate syscall returns EOPNOTSUPP is implemented by doing single
> byte writes at intervals of min(f.f_bsize, 4096).

> How can we get rid of this glibc fallback that turns the implementations
> non-conformant and increases write amplication for no good reason?

When does the kernel return EOPNOTSUPP these days?  We do not even do
fallback for EPERM/ENOSYS, those that might be encountered in
containers.

Last time I looked at this I concluded that it does not make sense to
push this write loop from glibc to the applications.  That's what would
happen if we had a new version of posix_fallocate that didn't do those
writes.  We also updated the manual:

  Storage Allocation
  <https://sourceware.org/glibc/manual/latest/html_node/Storage-Allocation.html>

As mentioned, if an application doesn't want fallback behavior, it can
call fallocate directly.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 17:57 ` Florian Weimer
@ 2024-07-29 18:44   ` Christoph Hellwig
  2024-07-29 18:52     ` Florian Weimer
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2024-07-29 18:44 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, libc-alpha, linux-fsdevel, Trond Myklebust

On Mon, Jul 29, 2024 at 07:57:54PM +0200, Florian Weimer wrote:
> When does the kernel return EOPNOTSUPP these days?

In common code whenever the file system does not implement the
fallocate file operation, and various file systems can also
return it from inside the method if the feature is not actually
supported for the particular file system or file it is called on.

> Last time I looked at this I concluded that it does not make sense to
> push this write loop from glibc to the applications.  That's what would
> happen if we had a new version of posix_fallocate that didn't do those
> writes.  We also updated the manual:

That assumes that the loop is the right thing to do for file systems not
supporting fallocate.  That's is generally the wrong thing to do, and
spectacularly wrong for file systems that write out of place.

> As mentioned, if an application doesn't want fallback behavior, it can
> call fallocate directly.

The applications might not know about glibc/Linux implementation details
and expect posix_fallocate to either fail if can't be supported or
actually give the guarantees it is supposed to provide, which this
"fallback" doesn't actually do for the not entirely uncommon case of a 
file system that is writing out of place.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
       [not found]     ` <CAPBLoAf11hM0PLhqPG5gUyivU9U1manpOOhDWCPugUmWc1VVUw@mail.gmail.com>
@ 2024-07-29 18:45       ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2024-07-29 18:45 UTC (permalink / raw)
  To: Cristian Rodríguez
  Cc: Christoph Hellwig, Paul Eggert, Trond Myklebust, libc-alpha,
	linux-fsdevel

On Mon, Jul 29, 2024 at 02:40:00PM -0400, Cristian Rodríguez wrote:
> Do you mean glibc copyright assigment ? DCO is now ok.

Oh, I completely missed that.  Thanks or the headsup.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 18:44   ` Christoph Hellwig
@ 2024-07-29 18:52     ` Florian Weimer
  2024-07-29 19:01       ` Christoph Hellwig
  2024-07-29 23:53       ` Dave Chinner
  0 siblings, 2 replies; 22+ messages in thread
From: Florian Weimer @ 2024-07-29 18:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: libc-alpha, linux-fsdevel, Trond Myklebust

* Christoph Hellwig:

> On Mon, Jul 29, 2024 at 07:57:54PM +0200, Florian Weimer wrote:
>> When does the kernel return EOPNOTSUPP these days?
>
> In common code whenever the file system does not implement the
> fallocate file operation, and various file systems can also
> return it from inside the method if the feature is not actually
> supported for the particular file system or file it is called on.
>
>> Last time I looked at this I concluded that it does not make sense to
>> push this write loop from glibc to the applications.  That's what would
>> happen if we had a new version of posix_fallocate that didn't do those
>> writes.  We also updated the manual:
>
> That assumes that the loop is the right thing to do for file systems not
> supporting fallocate.  That's is generally the wrong thing to do, and
> spectacularly wrong for file systems that write out of place.

In this case, the file system could return another error code besides
EOPNOTSUPP.  There's a difference between “no one bothered to implement
this” and “this can't be implemented correctly”, and it could be
reflected in the error code.

> The applications might not know about glibc/Linux implementation details
> and expect posix_fallocate to either fail if can't be supported or
> actually give the guarantees it is supposed to provide, which this
> "fallback" doesn't actually do for the not entirely uncommon case of a 
> file system that is writing out of place.

I think people are aware that with thin provisioning and whatnot, even a
successful fallocate call doesn't mean that there's sufficient space to
complete the actual write.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 18:52     ` Florian Weimer
@ 2024-07-29 19:01       ` Christoph Hellwig
  2024-07-29 19:23         ` Florian Weimer
  2024-07-29 23:53       ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2024-07-29 19:01 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, libc-alpha, linux-fsdevel, Trond Myklebust

On Mon, Jul 29, 2024 at 08:52:00PM +0200, Florian Weimer wrote:
> > supporting fallocate.  That's is generally the wrong thing to do, and
> > spectacularly wrong for file systems that write out of place.
> 
> In this case, the file system could return another error code besides
> EOPNOTSUPP.

What error code would that be and how do applications know about it?

> There's a difference between “no one bothered to implement
> this” and “this can't be implemented correctly”, and it could be
> reflected in the error code.

posix_fallocate can't be correctly implemented in userspace, which
is part of the problem.

> > The applications might not know about glibc/Linux implementation details
> > and expect posix_fallocate to either fail if can't be supported or
> > actually give the guarantees it is supposed to provide, which this
> > "fallback" doesn't actually do for the not entirely uncommon case of a 
> > file system that is writing out of place.
> 
> I think people are aware that with thin provisioning and whatnot, even a
> successful fallocate call doesn't mean that there's sufficient space to
> complete the actual write.

With a correctly implemented fallocate the guarantee in the standard
actually work properly.  Even if the underlying block device is thinly
provisioned and makes a write fail due to lack of space in the block
device this will actually shut down the file system entirely but not
return -ENOSPC.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 19:01       ` Christoph Hellwig
@ 2024-07-29 19:23         ` Florian Weimer
  2024-07-30 15:47           ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2024-07-29 19:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: libc-alpha, linux-fsdevel, Trond Myklebust

* Christoph Hellwig:

> On Mon, Jul 29, 2024 at 08:52:00PM +0200, Florian Weimer wrote:
>> > supporting fallocate.  That's is generally the wrong thing to do, and
>> > spectacularly wrong for file systems that write out of place.
>> 
>> In this case, the file system could return another error code besides
>> EOPNOTSUPP.
>
> What error code would that be and how do applications know about it?

Anything that's not EOPNOTSUPP will do.  EMEDIUMTYPE or ENOTBLK might do
it.  Any of the many STREAMS error codes could also be re-used quite
safely because Linux doesn't do STREAMS.

If you remove the fallback code, applications need to be taught about
EOPNOTSUPP, so that doesn't really make a difference.

Still it needs testing.  It's possible that key software doesn't expect
posix_fallocate to fail.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 18:52     ` Florian Weimer
  2024-07-29 19:01       ` Christoph Hellwig
@ 2024-07-29 23:53       ` Dave Chinner
  1 sibling, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2024-07-29 23:53 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, libc-alpha, linux-fsdevel, Trond Myklebust

On Mon, Jul 29, 2024 at 08:52:00PM +0200, Florian Weimer wrote:
> * Christoph Hellwig:
> 
> > On Mon, Jul 29, 2024 at 07:57:54PM +0200, Florian Weimer wrote:
> >> When does the kernel return EOPNOTSUPP these days?
> >
> > In common code whenever the file system does not implement the
> > fallocate file operation, and various file systems can also
> > return it from inside the method if the feature is not actually
> > supported for the particular file system or file it is called on.
> >
> >> Last time I looked at this I concluded that it does not make sense to
> >> push this write loop from glibc to the applications.  That's what would
> >> happen if we had a new version of posix_fallocate that didn't do those
> >> writes.  We also updated the manual:
> >
> > That assumes that the loop is the right thing to do for file systems not
> > supporting fallocate.  That's is generally the wrong thing to do, and
> > spectacularly wrong for file systems that write out of place.
> 
> In this case, the file system could return another error code besides
> EOPNOTSUPP.  There's a difference between “no one bothered to implement
> this” and “this can't be implemented correctly”, and it could be
> reflected in the error code.

Huh. EOPNOTSUPP. is explicitly stated as a valid "not implemented in
kernel or userspace" error return in the man page.

EOPNOTSUPP
      The  filesystem  containing the file referred to by fd does
      not support this operation.  This error code can be returned
      by C libraries that don't perform the emulation shown in
      NOTES, such as musl libc.

Hence libc-independent applications already have to handle
EOPNOTSUPP being returned by posix_fallocate() and implement their
own fallback if they really need it.

Having documented behaviour of "glibc does emulation so badly you
need to work around it" is really not a strong position from which
to advocate that the kernel needs to change behaviour....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-29 19:23         ` Florian Weimer
@ 2024-07-30 15:47           ` Christoph Hellwig
  2024-07-30 16:11             ` Paul Eggert
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2024-07-30 15:47 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, libc-alpha, linux-fsdevel, Trond Myklebust

On Mon, Jul 29, 2024 at 09:23:22PM +0200, Florian Weimer wrote:
> Anything that's not EOPNOTSUPP will do.  EMEDIUMTYPE or ENOTBLK might do
> it.  Any of the many STREAMS error codes could also be re-used quite
> safely because Linux doesn't do STREAMS.

Huh?  EOPNOTSUP(P) is the standard error code in Posix for operation
not supported, and clearly documented as such in the Linux man page
(for musl).  A totally random new error code doesn't really help us.

> If you remove the fallback code, applications need to be taught about
> EOPNOTSUPP, so that doesn't really make a difference.

Portable software can't assume that posix_fallocate actually does
anything.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-30 15:47           ` Christoph Hellwig
@ 2024-07-30 16:11             ` Paul Eggert
  2024-07-30 16:20               ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Paul Eggert @ 2024-07-30 16:11 UTC (permalink / raw)
  To: Christoph Hellwig, Florian Weimer
  Cc: libc-alpha, linux-fsdevel, Trond Myklebust

On 2024-07-30 08:47, Christoph Hellwig wrote:
> On Mon, Jul 29, 2024 at 09:23:22PM +0200, Florian Weimer wrote:
>> Anything that's not EOPNOTSUPP will do.  EMEDIUMTYPE or ENOTBLK might do
>> it.  Any of the many STREAMS error codes could also be re-used quite
>> safely because Linux doesn't do STREAMS.
> 
> Huh?  EOPNOTSUP(P) is the standard error code in Posix for operation
> not supported, and clearly documented as such in the Linux man page
> (for musl).  A totally random new error code doesn't really help us.

It would help glibc distinguish the following cases:

A. file systems whose internal structure supports the semantics of 
posix_fallocate, and where user-mode code can approximate those 
semantics by writing zeros, but where that feature has not been 
implemented in the kernel's file system code so the system call 
currently fails with EOPNOTSUPP.

B. file systems whose internal structure cannot support the semantics of 
posix_fallocate and you cannot approximate them, and where the system 
call currently fails with EOPNOTSUPP.

Florian is proposing that different error numbers be returned for (A) vs 
(B) so that glibc posix_fallocate can treat the cases differently.

>> If you remove the fallback code, applications need to be taught about
>> EOPNOTSUPP, so that doesn't really make a difference.
> 
> Portable software can't assume that posix_fallocate actually does
> anything.

Of course, but this issue is about whether glibc posix_fallocate should 
continue to work on type (A) file systems. It's understandable that 
glibc maintainers would be reluctant to mess with that longstanding 
behavior.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-30 16:11             ` Paul Eggert
@ 2024-07-30 16:20               ` Christoph Hellwig
  2024-07-30 17:03                 ` Florian Weimer
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2024-07-30 16:20 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Christoph Hellwig, Florian Weimer, libc-alpha, linux-fsdevel,
	Trond Myklebust

On Tue, Jul 30, 2024 at 09:11:17AM -0700, Paul Eggert wrote:
> It would help glibc distinguish the following cases:
>
> A. file systems whose internal structure supports the semantics of 
> posix_fallocate, and where user-mode code can approximate those semantics 
> by writing zeros, but where that feature has not been implemented in the 
> kernel's file system code so the system call currently fails with 
> EOPNOTSUPP.
>
> B. file systems whose internal structure cannot support the semantics of 
> posix_fallocate and you cannot approximate them, and where the system call 
> currently fails with EOPNOTSUPP.

As mentioned earlier in the thread case a) are basically legacy / foreign
OS compatibility file systems (minix, sysfs, hfs/hfsplus).  They are
probably not something that people actually use posix_fallocate on.
The only relevant exception is probably ext4 in ext2/ext3 mode, where
the latter might still have users left running real workloads on it
and not using it for usb disks or VM images.

> Florian is proposing that different error numbers be returned for (A) vs 
> (B) so that glibc posix_fallocate can treat the cases differently.

The problem with a new error code is that it will leak out to the
application when using a new kernel and an old glibc.  If we want to skin
the cat that way a better way might be to expose this kind of information
through a statx flag or a similar interface.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-30 16:20               ` Christoph Hellwig
@ 2024-07-30 17:03                 ` Florian Weimer
  2024-07-30 17:08                   ` Christoph Hellwig
                                     ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Florian Weimer @ 2024-07-30 17:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Paul Eggert, libc-alpha, linux-fsdevel, Trond Myklebust

* Christoph Hellwig:

> On Tue, Jul 30, 2024 at 09:11:17AM -0700, Paul Eggert wrote:
>> It would help glibc distinguish the following cases:
>>
>> A. file systems whose internal structure supports the semantics of 
>> posix_fallocate, and where user-mode code can approximate those semantics 
>> by writing zeros, but where that feature has not been implemented in the 
>> kernel's file system code so the system call currently fails with 
>> EOPNOTSUPP.
>>
>> B. file systems whose internal structure cannot support the semantics of 
>> posix_fallocate and you cannot approximate them, and where the system call 
>> currently fails with EOPNOTSUPP.
>
> As mentioned earlier in the thread case a) are basically legacy / foreign
> OS compatibility file systems (minix, sysfs, hfs/hfsplus).  They are
> probably not something that people actually use posix_fallocate on.

It's more about a file copying tool doing this by default on behalf of
the users (perhaps Midnight Commander?).  If I recall, posix_fallocate
is also used by file-sharing clients, and those might be used with
external storage media that have older file systems.

> The only relevant exception is probably ext4 in ext2/ext3 mode, where
> the latter might still have users left running real workloads on it
> and not using it for usb disks or VM images.

Why doesn't the kernel perform allocation in these cases?  There doesn't
seem to be a file-system-specific reason why it's impossible to do.

At the very least, we should have a variant of ftruncate that never
truncates, likely under the fallocate umbrella.  It seems that that's
how posix_fallocate is used sometimes, for avoiding SIGBUS with mmap.
To these use cases, whether extents are allocated or not does not
matter.

>> Florian is proposing that different error numbers be returned for (A) vs 
>> (B) so that glibc posix_fallocate can treat the cases differently.
>
> The problem with a new error code is that it will leak out to the
> application when using a new kernel and an old glibc.

If we removed the fallback code from glibc today, it would just be
EOPNOTSUPP that leaks to applications, so it's structurally the same
issue.  The error codes that glibc's posix_fallocate can produce are all
different (unless write on the file fails with EOPNOTSUPP in the kernel,
but that would be quite unexpected).  EOPNOTSUPP would be equally
surprising.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-30 17:03                 ` Florian Weimer
@ 2024-07-30 17:08                   ` Christoph Hellwig
  2024-07-30 17:29                     ` Florian Weimer
  2024-07-30 17:52                   ` Mark Wielaard
  2024-07-31  2:32                   ` Theodore Ts'o
  2 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2024-07-30 17:08 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, Paul Eggert, libc-alpha, linux-fsdevel,
	Trond Myklebust

On Tue, Jul 30, 2024 at 07:03:50PM +0200, Florian Weimer wrote:
> > The only relevant exception is probably ext4 in ext2/ext3 mode, where
> > the latter might still have users left running real workloads on it
> > and not using it for usb disks or VM images.
> 
> Why doesn't the kernel perform allocation in these cases?  There doesn't
> seem to be a file-system-specific reason why it's impossible to do.

Because in general it's a really stupid idea.  You don't get a better
allocation patter, but you are writing every block twice, making things
significantly slower and wearing the device out in the process if it
is flash based.

> At the very least, we should have a variant of ftruncate that never
> truncates, likely under the fallocate umbrella.  It seems that that's
> how posix_fallocate is used sometimes, for avoiding SIGBUS with mmap.
> To these use cases, whether extents are allocated or not does not
> matter.

I don't see how that is related.

> If we removed the fallback code from glibc today, it would just be
> EOPNOTSUPP that leaks to applications, so it's structurally the same
> issue.

Not really.  EOPNOTSUPP is a valid error code, that has historically
been returned by other operating systems and even other libc
implementations for Linux


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-30 17:08                   ` Christoph Hellwig
@ 2024-07-30 17:29                     ` Florian Weimer
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Weimer @ 2024-07-30 17:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Paul Eggert, libc-alpha, linux-fsdevel, Trond Myklebust

* Christoph Hellwig:

> On Tue, Jul 30, 2024 at 07:03:50PM +0200, Florian Weimer wrote:
>> > The only relevant exception is probably ext4 in ext2/ext3 mode, where
>> > the latter might still have users left running real workloads on it
>> > and not using it for usb disks or VM images.
>> 
>> Why doesn't the kernel perform allocation in these cases?  There doesn't
>> seem to be a file-system-specific reason why it's impossible to do.
>
> Because in general it's a really stupid idea.  You don't get a better
> allocation patter, but you are writing every block twice, making things
> significantly slower and wearing the device out in the process if it
> is flash based.

I would assume the applications that do pre-allocation before mmap with
random writes had a good reason to do it even when it was slow.

>> At the very least, we should have a variant of ftruncate that never
>> truncates, likely under the fallocate umbrella.  It seems that that's
>> how posix_fallocate is used sometimes, for avoiding SIGBUS with mmap.
>> To these use cases, whether extents are allocated or not does not
>> matter.
>
> I don't see how that is related.

Open file, posix_fallocate to the desired size, then use mmap, seems to
be somewhat common.  More often, people use fruncate, but that can
unexpectedly shrink the file.

>> If we removed the fallback code from glibc today, it would just be
>> EOPNOTSUPP that leaks to applications, so it's structurally the same
>> issue.
>
> Not really.  EOPNOTSUPP is a valid error code, that has historically
> been returned by other operating systems and even other libc
> implementations for Linux

I don't see EOPNOTSUPP handling code in Ceph, Beanstalk, Bitcoin Core,
or Transmission.  Most of them seem to just ignore errors (except
perhaps Ceph).  This might not be a problem in the end, but it seems
that existing software (even portable software) does not check for
EOPNOTSUPP.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-30 17:03                 ` Florian Weimer
  2024-07-30 17:08                   ` Christoph Hellwig
@ 2024-07-30 17:52                   ` Mark Wielaard
  2024-07-31  2:32                   ` Theodore Ts'o
  2 siblings, 0 replies; 22+ messages in thread
From: Mark Wielaard @ 2024-07-30 17:52 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, Paul Eggert, libc-alpha, linux-fsdevel,
	Trond Myklebust

Hi,

On Tue, Jul 30, 2024 at 07:03:50PM +0200, Florian Weimer wrote:
> At the very least, we should have a variant of ftruncate that never
> truncates, likely under the fallocate umbrella.  It seems that that's
> how posix_fallocate is used sometimes, for avoiding SIGBUS with mmap.
> To these use cases, whether extents are allocated or not does not
> matter.

This is how/why elfutils libelf uses posix_fallocate when using
ELF_C_RDWR_MMAP. The comment for it says:

      /* When using mmap we want to make sure the file content is
         really there. Only using ftruncate might mean the file is
         extended, but space isn't allocated yet.  This might cause a
         SIGBUS once we write into the mmapped space and the disk is
         full.  In glibc posix_fallocate is required to extend the
         file and allocate enough space even if the underlying
         filesystem would normally return EOPNOTSUPP.  But other
         implementations might not work as expected.  And the glibc
         fallback case might fail (with unexpected errnos) in some cases.
         So we only report an error when the call fails and errno is
         ENOSPC. Otherwise we ignore the error and treat it as just hint.  */

Cheers,

Mark

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: posix_fallocate behavior in glibc
  2024-07-30 17:03                 ` Florian Weimer
  2024-07-30 17:08                   ` Christoph Hellwig
  2024-07-30 17:52                   ` Mark Wielaard
@ 2024-07-31  2:32                   ` Theodore Ts'o
  2 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2024-07-31  2:32 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Christoph Hellwig, Paul Eggert, libc-alpha, linux-fsdevel,
	Trond Myklebust

On Tue, Jul 30, 2024 at 07:03:50PM +0200, Florian Weimer wrote:
> 
> At the very least, we should have a variant of ftruncate that never
> truncates, likely under the fallocate umbrella.  It seems that that's
> how posix_fallocate is used sometimes, for avoiding SIGBUS with mmap.
> To these use cases, whether extents are allocated or not does not
> matter.

Personally, what I advise any application authors I come across is
simply tell them to avoid using posix_fallocate(2) altogether; the
semantics are totally broken, as is common with anything mandated by a
committee that was trying to satify multiple legacy Unix
implementations.  And so, relying on it just going to be fraught.

What I tell them to do instead is to use the Linux fallocate(2) system
call directly, which is well-defined, and if the file system doesn't
support fallocate, and fallocate(2) returns ENOSPC, that the userspace
application should either accept the fact it won't be able to allocate
the space, or if it really needs to avoid things like the SIGBUS with
mmap(2), to have the userspace application do the zero-fill writes
itself.

So honestly, is it worth it to try "fixing" posix_fallocate(2)?  Just
tell people to avoid it like the plague....  That way, we don't have
to worry about breaking existing legacy applications.

If we are going to stick with the existing Linux fallocate(2) system
call, then the problem is trying to have the system mind-read about
what the application writer really was trying to get when they call
fallocate(2) --- are they trying to avoid SIGBUS with mmap?  Or are
they trying to guarantee that any writes to that file range will never
fail with ENOSPC (even in the face of something like dm-thin being in
the storage stack).  And so the solution is simple; we can define new
flag bits to the fallocate(2) system call to make it be explicit
exactly what the application is requesting of the system.

Adding new fallocate(2) flag bits seems to be a more general solution
adding a new ftruncate(2) variant,

In addition, we can also add a new flag which requests the file system
passes the allocation request down to the thin provisioned storage
(aassuming that this is something that is supported).  Although I'm
not sure how much this matters; after all, for decades there have been
thin-provisioned NetApp storage appliances where fallocate(2) or
posix_falloate(2) wouldn't necessarily guarantee a thin-provisioned
device might run out of space on a write(2), and application authors
seem to have been willing to live with it.  Still, if people really
want this to work, even in the face of a file system which supports
copy-on-write cloned ranges, then presumably this new fallocate(2)
system call with the "never shall a write fail with ENOSPC" bit set,
can also snap the COW region as well.  It's important, though, that
this be done usinga new fallocate(2) flag, as opposed to have this
magically be added to the existing fallocate(2) system call, since
that will likely cause surprises for some applications.

     	  	       		     - Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-07-31  2:32 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-26  6:01 posix_fallocate behavior in glibc Christoph Hellwig
2024-07-29 15:09 ` Christoph Hellwig
2024-07-29 15:11   ` Sam James
  -- strict thread matches above, loose matches on Subject: below --
2024-07-29 16:09 Christoph Hellwig
2024-07-29 17:23 ` Paul Eggert
2024-07-29 17:43   ` Christoph Hellwig
2024-07-29 17:54     ` Adhemerval Zanella Netto
     [not found]     ` <CAPBLoAf11hM0PLhqPG5gUyivU9U1manpOOhDWCPugUmWc1VVUw@mail.gmail.com>
2024-07-29 18:45       ` Christoph Hellwig
2024-07-29 17:57 ` Florian Weimer
2024-07-29 18:44   ` Christoph Hellwig
2024-07-29 18:52     ` Florian Weimer
2024-07-29 19:01       ` Christoph Hellwig
2024-07-29 19:23         ` Florian Weimer
2024-07-30 15:47           ` Christoph Hellwig
2024-07-30 16:11             ` Paul Eggert
2024-07-30 16:20               ` Christoph Hellwig
2024-07-30 17:03                 ` Florian Weimer
2024-07-30 17:08                   ` Christoph Hellwig
2024-07-30 17:29                     ` Florian Weimer
2024-07-30 17:52                   ` Mark Wielaard
2024-07-31  2:32                   ` Theodore Ts'o
2024-07-29 23:53       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).