Question on slow fallocate

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Question on slow fallocate
@ 2023-06-22  5:34 Masahiko Sawada
  2023-06-22  7:44 ` Wang Yugui
  2023-06-23  0:47 ` Dave Chinner
  0 siblings, 2 replies; 20+ messages in thread
From: Masahiko Sawada @ 2023-06-22  5:34 UTC (permalink / raw)
  To: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 1054 bytes --]

Hi all,

When testing PostgreSQL, I found a performance degradation. After some
investigation, it ultimately reached the attached simple C program and
turned out that the performance degradation happens on only the xfs
filesystem (doesn't happen on neither ext3 nor ext4). In short, the
program alternately does two things to extend a file (1) call
posix_fallocate() to extend by 8192 bytes and (2) call pwrite() to
extend by 8192 bytes. If I do only either (1) or (2), the program is
completed in 2 sec, but if I do (1) and (2) alternatively, it is
completed in 90 sec.

$ gcc -o test test.c
$ time ./test test.1 1
total   200000
fallocate       200000
filewrite       0

real    0m1.305s
user    0m0.050s
sys     0m1.255s

$ time ./test test.2 2
total   200000
fallocate       100000
filewrite       100000

real    1m29.222s
user    0m0.139s
sys     0m3.139s

Why does it take so long in the latter case? and are there any
workaround or configuration changes to deal with it?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

[-- Attachment #2: test.c --]
[-- Type: application/octet-stream, Size: 952 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
    char *filename = argv[1];
    int ratio = atoi(argv[2]);
    char block[8192] = {0};
    int fd;
    int total_len = 0;
    int n_fallocate = 0;
    int n_filewrite = 0;
    int i;

    fd = open(filename, O_RDWR | O_CREAT, S_IRWXU);
    if (fd < 0)
    {
        fprintf(stderr, "could not open file %s: %m\n", filename);
        return 1;
    }

    for (i = 0; i < 200000; i++)
    {
        int ret;

        if (ratio != 0 && i % ratio == 0)
        {
            posix_fallocate(fd, total_len, 8192);
            n_fallocate++;
        }
        else
        {
            pwrite(fd, block, 8192, total_len);
            n_filewrite++;
        }
        total_len += 8192;
    }

    printf("total\t%d\n", i);
    printf("fallocate\t%d\n", n_fallocate);
    printf("filewrite\t%d\n", n_filewrite);

    close(fd);
    return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-22  5:34 Question on slow fallocate Masahiko Sawada
@ 2023-06-22  7:44 ` Wang Yugui
  2023-06-22  8:18   ` Masahiko Sawada
  2023-06-23  0:47 ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2023-06-22  7:44 UTC (permalink / raw)
  To: Masahiko Sawada; +Cc: linux-xfs

Hi,

> Hi all,
> 
> When testing PostgreSQL, I found a performance degradation. After some
> investigation, it ultimately reached the attached simple C program and
> turned out that the performance degradation happens on only the xfs
> filesystem (doesn't happen on neither ext3 nor ext4). In short, the
> program alternately does two things to extend a file (1) call
> posix_fallocate() to extend by 8192 bytes and (2) call pwrite() to
> extend by 8192 bytes. If I do only either (1) or (2), the program is
> completed in 2 sec, but if I do (1) and (2) alternatively, it is
> completed in 90 sec.
> 
> $ gcc -o test test.c
> $ time ./test test.1 1
> total   200000
> fallocate       200000
> filewrite       0
> 
> real    0m1.305s
> user    0m0.050s
> sys     0m1.255s
> 
> $ time ./test test.2 2
> total   200000
> fallocate       100000
> filewrite       100000
> 
> real    1m29.222s
> user    0m0.139s
> sys     0m3.139s
> 
> Why does it take so long in the latter case? and are there any
> workaround or configuration changes to deal with it?
> 

I test it on xfs linux 6.1.35 and 6.4-rc7

the result is almost same.

$ time ./test test.1 1
real    0m1.382s

$ time ./test test.2 2
real    0m9.262s

linunx kernel version please.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/06/22


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-22  7:44 ` Wang Yugui
@ 2023-06-22  8:18   ` Masahiko Sawada
  0 siblings, 0 replies; 20+ messages in thread
From: Masahiko Sawada @ 2023-06-22  8:18 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-xfs

On Thu, Jun 22, 2023 at 4:44 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> > Hi all,
> >
> > When testing PostgreSQL, I found a performance degradation. After some
> > investigation, it ultimately reached the attached simple C program and
> > turned out that the performance degradation happens on only the xfs
> > filesystem (doesn't happen on neither ext3 nor ext4). In short, the
> > program alternately does two things to extend a file (1) call
> > posix_fallocate() to extend by 8192 bytes and (2) call pwrite() to
> > extend by 8192 bytes. If I do only either (1) or (2), the program is
> > completed in 2 sec, but if I do (1) and (2) alternatively, it is
> > completed in 90 sec.
> >
> > $ gcc -o test test.c
> > $ time ./test test.1 1
> > total   200000
> > fallocate       200000
> > filewrite       0
> >
> > real    0m1.305s
> > user    0m0.050s
> > sys     0m1.255s
> >
> > $ time ./test test.2 2
> > total   200000
> > fallocate       100000
> > filewrite       100000
> >
> > real    1m29.222s
> > user    0m0.139s
> > sys     0m3.139s
> >
> > Why does it take so long in the latter case? and are there any
> > workaround or configuration changes to deal with it?
> >
>
> I test it on xfs linux 6.1.35 and 6.4-rc7
>
> the result is almost same.
>
> $ time ./test test.1 1
> real    0m1.382s
>
> $ time ./test test.2 2
> real    0m9.262s
>
> linunx kernel version please.

I test it on:

$ uname -r
6.1.29-50.88.amzn2023.x86_64

and

$ uname -r
4.18.0-372.9.1.el8.x86_64

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-22  5:34 Question on slow fallocate Masahiko Sawada
  2023-06-22  7:44 ` Wang Yugui
@ 2023-06-23  0:47 ` Dave Chinner
  2023-06-23  8:29   ` Ritesh Harjani
                     ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Dave Chinner @ 2023-06-23  0:47 UTC (permalink / raw)
  To: Masahiko Sawada; +Cc: linux-xfs

On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote:
> Hi all,
> 
> When testing PostgreSQL, I found a performance degradation. After some
> investigation, it ultimately reached the attached simple C program and
> turned out that the performance degradation happens on only the xfs
> filesystem (doesn't happen on neither ext3 nor ext4). In short, the
> program alternately does two things to extend a file (1) call
> posix_fallocate() to extend by 8192 bytes

This is a well known anti-pattern - it always causes problems. Do
not do this.

> and (2) call pwrite() to
> extend by 8192 bytes. If I do only either (1) or (2), the program is
> completed in 2 sec, but if I do (1) and (2) alternatively, it is
> completed in 90 sec.

Well, yes. Using fallocate to extend the file has very different
constraints to using pwrite to extend the file.

> $ gcc -o test test.c
> $ time ./test test.1 1
> total   200000
> fallocate       200000
> filewrite       0

No data is written here, so this is just a series of 8kB allocations
and file size extension operations. There are no constraints here
because it is a pure metadata operation.

> real    0m1.305s
> user    0m0.050s
> sys     0m1.255s
> 
> $ time ./test test.2 2
> total   200000
> fallocate       100000
> filewrite       100000
>
> real    1m29.222s
> user    0m0.139s
> sys     0m3.139s

Now we have fallocate extending the file and doing unwritten extent
allocation, followed by writing into that unwritten extent which
then does unwritten extent conversion.

This introduces data vs metadata update ordering constraints to the
workload.

The problem here in that the "truncate up" operation that
fallocate is doing to move the file size. The "truncate up" is going
to move the on-disk file size to the end of the fallocated range via
a journal transaction, and so it will expose the range of the
previous write as containing valid data.

However, the previous data write is still only in memory and not on
disk. The result of journalling the file size change is that if we
crash after the size change is made but the data is not on disk,
we end up with lost data - the file contains zeros (or NULLs) where
the in memory data previously existed.

Go google for "NULL file data exposure" and you'll see this is a
problem we fixed in ~2006, caused by extending the file size on disk
without first having written all the in-memory data into the file.
And even though we fixed the problem over 15 years ago, we still
hear people today saying "XFS overwrites user data with NULLs!" as
their reason for never using XFS, even though this was never true in
the first place..

The result of users demanding that we prevent poorly written
applications from losing their data is that users get poor
performance when their applications are poorly written. i.e. they do
something that triggers the data integrity ordering constraints that
users demand we work within.

So, how to avoid the problem?

With 'posix_fallocate(fd, total_len, 8192);':

$ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1
total   200000
fallocate       200000
filewrite       0

real    0m2.557s
user    0m0.025s
sys     0m2.531s
$ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2
total   200000
fallocate       100000
filewrite       100000

real    0m39.564s
user    0m0.117s
sys     0m7.535s

With 'fallocate(fd, FALLOC_FL_KEEP_SIZE, total_len, 8192);':

$ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1
total   200000
fallocate       200000
filewrite       0

real    0m2.269s
user    0m0.037s
sys     0m2.233s
$ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2
total   200000
fallocate       100000
filewrite       100000

real    0m1.068s
user    0m0.028s
sys     0m1.040s

Yup, just stop fallocate() from extending the file size and leave
that to the pwrite() call that actually writes the data into the
file.

As it is, using fallocate/pwrite like test does is a well known
anti-pattern:

	error = fallocate(fd, off, len);
	if (error == ENOSPC) {
		/* abort write!!! */
	}
	error = pwrite(fd, off, len);
	ASSERT(error != ENOSPC);
	if (error) {
		/* handle error */
	}

Why does the code need a call to fallocate() here it prevent ENOSPC in the
pwrite() call?

The answer here is that it *doesn't need to use fallocate() here*.
THat is, the fallocate() ENOSPC check before the space is allocated
is exactly the same as the ENOSPC check done in the pwrite() call to
see if there is space for the write to proceed.

IOWs, the fallocate() call is *completely redundant*, yet it is
actively harmful to performance in the short term (as per the
issue in this thread) as well as being harmful for file fragmentation
levels and filesystem longevity because it prevents the filesystem
from optimising away unnecessary allocations. i.e. it defeats
delayed allocation which allows filesystem to combine lots of
small sequential write() calls in a single big contiguous extent
allocation when the data is getting written to disk.

IOWs, using fallocate() in the way described in this test is a sign
of applicaiton developers not understanding what preallocation
actually does and the situations where it actually provides some
kinds of benefit.

i.e. fallocate() is intended to allow applications to preallocate
space in large chunks long before it is needed, and still have it
available when the application actually needs to write to it. e.g.
preallocate 10MB at a time, not have to run fallocate again until the
existing preallocated chunk is entirely used up by the next thousand
8KB writes that extend the file.

Using fallocate() as a replacement for "truncate up before write" is
*not a recommended use*.

> Why does it take so long in the latter case? and are there any
> workaround or configuration changes to deal with it?

Let pwrite() do the file extension because it natively handles data
vs metadata ordering without having to flush data to disk and wait
for it. i.e. do not use fallocate() as if it is ftruncate(). Also,
do not use posix_fallocate() - it gives you no control over how
preallocation is done, use fallocate() directly. And if you must use
fallocate() before a write, use fallocate(fd, FALLOC_FL_KEEP_SIZE,
off, len) so that the file extension is done by the pwrite() to
avoid any metadata/data ordering constraints that might exist with
non-data write related file size changes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-23  0:47 ` Dave Chinner
@ 2023-06-23  8:29   ` Ritesh Harjani
  2023-06-23 10:07     ` Dave Chinner
  2023-06-26  3:17   ` Masahiko Sawada
  2023-07-11 22:28   ` Andres Freund
  2 siblings, 1 reply; 20+ messages in thread
From: Ritesh Harjani @ 2023-06-23  8:29 UTC (permalink / raw)
  To: Dave Chinner, Masahiko Sawada; +Cc: linux-xfs

Dave Chinner <david@fromorbit.com> writes:

> On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote:
>> Hi all,
>> 
>> When testing PostgreSQL, I found a performance degradation. After some
>> investigation, it ultimately reached the attached simple C program and
>> turned out that the performance degradation happens on only the xfs
>> filesystem (doesn't happen on neither ext3 nor ext4). In short, the
>> program alternately does two things to extend a file (1) call
>> posix_fallocate() to extend by 8192 bytes
>
> This is a well known anti-pattern - it always causes problems. Do
> not do this.
>
>> and (2) call pwrite() to
>> extend by 8192 bytes. If I do only either (1) or (2), the program is
>> completed in 2 sec, but if I do (1) and (2) alternatively, it is
>> completed in 90 sec.
>
> Well, yes. Using fallocate to extend the file has very different
> constraints to using pwrite to extend the file.
>
>> $ gcc -o test test.c
>> $ time ./test test.1 1
>> total   200000
>> fallocate       200000
>> filewrite       0
>
> No data is written here, so this is just a series of 8kB allocations
> and file size extension operations. There are no constraints here
> because it is a pure metadata operation.
>
>> real    0m1.305s
>> user    0m0.050s
>> sys     0m1.255s
>> 
>> $ time ./test test.2 2
>> total   200000
>> fallocate       100000
>> filewrite       100000
>>
>> real    1m29.222s
>> user    0m0.139s
>> sys     0m3.139s
>
> Now we have fallocate extending the file and doing unwritten extent
> allocation, followed by writing into that unwritten extent which
> then does unwritten extent conversion.
>
> This introduces data vs metadata update ordering constraints to the
> workload.
>
> The problem here in that the "truncate up" operation that
> fallocate is doing to move the file size. The "truncate up" is going
> to move the on-disk file size to the end of the fallocated range via
> a journal transaction, and so it will expose the range of the
> previous write as containing valid data.
>
> However, the previous data write is still only in memory and not on
> disk. The result of journalling the file size change is that if we
> crash after the size change is made but the data is not on disk,
> we end up with lost data - the file contains zeros (or NULLs) where
> the in memory data previously existed.
>
> Go google for "NULL file data exposure" and you'll see this is a
> problem we fixed in ~2006, caused by extending the file size on disk
> without first having written all the in-memory data into the file.

I guess here is the <patch> you are speaking of. So this prevents from
exposing nulls within a file in case of a crash.

I guess the behavior is not the same with ext4. ext4 does not seem to be
doing filemap_write_and_wait_range() if the new i_disksize is more than
oldsize. So then I think ext4 must be ok if in case of a crash the
file has nulls in between. That's why I think the observation of slow
performance is not seen in ext4.

Few queres-
- If the user doesn't issue a flush and if the system crashes, then
  anyways it is not expected that the file will have all the data right?

- Also is that "data/inode size update order" which you are mentioning in
  this patch. Is this something that all filesystems should follow?

- I was wondering what exactly it breaks which the applications depend
  upon? Because not all filesystems tend to follow this practice right?


Thanks for the detailed explaination! I got interested in this thread
after looking at your explaination and since the thread mention this
happens with postgres.

-ritesh

<patch>
[XFS] Fix inode size update before data write in xfs_setattr

When changing the file size by a truncate() call, we log the change in the
inode size. However, we do not flush any outstanding data that might not
have been written to disk, thereby violating the data/inode size update
order. This can leave files full of NULLs on crash.

Hence if we are truncating the file, flush any unwritten data that may lie
between the curret on disk inode size and the new inode size that is being
logged to ensure that ordering is preserved.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-23  8:29   ` Ritesh Harjani
@ 2023-06-23 10:07     ` Dave Chinner
  2023-06-23 11:49       ` Ritesh Harjani
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2023-06-23 10:07 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Masahiko Sawada, linux-xfs

On Fri, Jun 23, 2023 at 01:59:58PM +0530, Ritesh Harjani wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote:
> >> Hi all,
> >> 
> >> When testing PostgreSQL, I found a performance degradation. After some
> >> investigation, it ultimately reached the attached simple C program and
> >> turned out that the performance degradation happens on only the xfs
> >> filesystem (doesn't happen on neither ext3 nor ext4). In short, the
> >> program alternately does two things to extend a file (1) call
> >> posix_fallocate() to extend by 8192 bytes
> >
> > This is a well known anti-pattern - it always causes problems. Do
> > not do this.
> >
> >> and (2) call pwrite() to
> >> extend by 8192 bytes. If I do only either (1) or (2), the program is
> >> completed in 2 sec, but if I do (1) and (2) alternatively, it is
> >> completed in 90 sec.
> >
> > Well, yes. Using fallocate to extend the file has very different
> > constraints to using pwrite to extend the file.
> >
> >> $ gcc -o test test.c
> >> $ time ./test test.1 1
> >> total   200000
> >> fallocate       200000
> >> filewrite       0
> >
> > No data is written here, so this is just a series of 8kB allocations
> > and file size extension operations. There are no constraints here
> > because it is a pure metadata operation.
> >
> >> real    0m1.305s
> >> user    0m0.050s
> >> sys     0m1.255s
> >> 
> >> $ time ./test test.2 2
> >> total   200000
> >> fallocate       100000
> >> filewrite       100000
> >>
> >> real    1m29.222s
> >> user    0m0.139s
> >> sys     0m3.139s
> >
> > Now we have fallocate extending the file and doing unwritten extent
> > allocation, followed by writing into that unwritten extent which
> > then does unwritten extent conversion.
> >
> > This introduces data vs metadata update ordering constraints to the
> > workload.
> >
> > The problem here in that the "truncate up" operation that
> > fallocate is doing to move the file size. The "truncate up" is going
> > to move the on-disk file size to the end of the fallocated range via
> > a journal transaction, and so it will expose the range of the
> > previous write as containing valid data.
> >
> > However, the previous data write is still only in memory and not on
> > disk. The result of journalling the file size change is that if we
> > crash after the size change is made but the data is not on disk,
> > we end up with lost data - the file contains zeros (or NULLs) where
> > the in memory data previously existed.
> >
> > Go google for "NULL file data exposure" and you'll see this is a
> > problem we fixed in ~2006, caused by extending the file size on disk
> > without first having written all the in-memory data into the file.
> 
> I guess here is the <patch> you are speaking of. So this prevents from
> exposing nulls within a file in case of a crash.

Well, we're not really "exposing NULLs". No data got written before
the crash, so a read from that range after a crash will find a hole
or unwritten extents in the file and return zeros.

> I guess the behavior is not the same with ext4. ext4 does not seem to be
> doing filemap_write_and_wait_range() if the new i_disksize is more than
> oldsize. So then I think ext4 must be ok if in case of a crash the
> file has nulls in between. That's why I think the observation of slow
> performance is not seen in ext4.

ext4 also has a similar problem issue where crashes can lead to
files full of zeroes, and many of the mitigations they use were
copied from the XFS mitigations for the same problem.  However, ext4
has a completely different way of handling failures after truncate
(via an orphan list, IIRC) so it doesn't need to actually write
the data to avoid potential stale data exposure issues.

> Few queres-
> - If the user doesn't issue a flush and if the system crashes, then
>   anyways it is not expected that the file will have all the data right?

Correct.

> - Also is that "data/inode size update order" which you are mentioning in
>   this patch. Is this something that all filesystems should follow?

No, it's the specific fix for the inode size update ordering problem
that lead to user visible symptoms after a crash. We avoid the
problem in two ways now - first we always journal inode size
updates, and second we always write dependent data before we journal
said size updates.

> - I was wondering what exactly it breaks which the applications depend
>   upon? Because not all filesystems tend to follow this practice right?

The filesystems didn't break anything - applications failed to write
and/or overwrite data safely, and when they did this data got lost.

However, because the same type of failure didn't result in data loss
on ext3, then the data loss was considered by users and application
developers as a filesystem bug, rather than an inevitable result of
an application failing to ensure the user's data was actually
written to the filesystem in a crash-safe manner.

i.e. users and application developers demanded that filesystem's
provide be omnipotent and provide a higher level of data integrity
than the application/user asks them to provide.

The result is that we provided that higher level of data integrity
that users demanded, but it came at a cost....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-23 10:07     ` Dave Chinner
@ 2023-06-23 11:49       ` Ritesh Harjani
  2023-06-23 20:04         ` Eric Sandeen
  0 siblings, 1 reply; 20+ messages in thread
From: Ritesh Harjani @ 2023-06-23 11:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Masahiko Sawada, linux-xfs

Dave Chinner <david@fromorbit.com> writes:

> On Fri, Jun 23, 2023 at 01:59:58PM +0530, Ritesh Harjani wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>> 
>> > On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote:
>> >> Hi all,
>> >> 
>> >> When testing PostgreSQL, I found a performance degradation. After some
>> >> investigation, it ultimately reached the attached simple C program and
>> >> turned out that the performance degradation happens on only the xfs
>> >> filesystem (doesn't happen on neither ext3 nor ext4). In short, the
>> >> program alternately does two things to extend a file (1) call
>> >> posix_fallocate() to extend by 8192 bytes
>> >
>> > This is a well known anti-pattern - it always causes problems. Do
>> > not do this.
>> >
>> >> and (2) call pwrite() to
>> >> extend by 8192 bytes. If I do only either (1) or (2), the program is
>> >> completed in 2 sec, but if I do (1) and (2) alternatively, it is
>> >> completed in 90 sec.
>> >
>> > Well, yes. Using fallocate to extend the file has very different
>> > constraints to using pwrite to extend the file.
>> >
>> >> $ gcc -o test test.c
>> >> $ time ./test test.1 1
>> >> total   200000
>> >> fallocate       200000
>> >> filewrite       0
>> >
>> > No data is written here, so this is just a series of 8kB allocations
>> > and file size extension operations. There are no constraints here
>> > because it is a pure metadata operation.
>> >
>> >> real    0m1.305s
>> >> user    0m0.050s
>> >> sys     0m1.255s
>> >> 
>> >> $ time ./test test.2 2
>> >> total   200000
>> >> fallocate       100000
>> >> filewrite       100000
>> >>
>> >> real    1m29.222s
>> >> user    0m0.139s
>> >> sys     0m3.139s
>> >
>> > Now we have fallocate extending the file and doing unwritten extent
>> > allocation, followed by writing into that unwritten extent which
>> > then does unwritten extent conversion.
>> >
>> > This introduces data vs metadata update ordering constraints to the
>> > workload.
>> >
>> > The problem here in that the "truncate up" operation that
>> > fallocate is doing to move the file size. The "truncate up" is going
>> > to move the on-disk file size to the end of the fallocated range via
>> > a journal transaction, and so it will expose the range of the
>> > previous write as containing valid data.
>> >
>> > However, the previous data write is still only in memory and not on
>> > disk. The result of journalling the file size change is that if we
>> > crash after the size change is made but the data is not on disk,
>> > we end up with lost data - the file contains zeros (or NULLs) where
>> > the in memory data previously existed.
>> >
>> > Go google for "NULL file data exposure" and you'll see this is a
>> > problem we fixed in ~2006, caused by extending the file size on disk
>> > without first having written all the in-memory data into the file.
>> 
>> I guess here is the <patch> you are speaking of. So this prevents from
>> exposing nulls within a file in case of a crash.
>
> Well, we're not really "exposing NULLs". No data got written before
> the crash, so a read from that range after a crash will find a hole
> or unwritten extents in the file and return zeros.
>

Yes, I agree. 
I meant writing "null file problem".

>> I guess the behavior is not the same with ext4. ext4 does not seem to be
>> doing filemap_write_and_wait_range() if the new i_disksize is more than
>> oldsize. So then I think ext4 must be ok if in case of a crash the
>> file has nulls in between. That's why I think the observation of slow
>> performance is not seen in ext4.
>
> ext4 also has a similar problem issue where crashes can lead to
> files full of zeroes, and many of the mitigations they use were
> copied from the XFS mitigations for the same problem.  However, ext4
> has a completely different way of handling failures after truncate
> (via an orphan list, IIRC) so it doesn't need to actually write
> the data to avoid potential stale data exposure issues.
>

Sorry, but I still haven't understood the real problem here for which
XFS does filemap_write_and_wait_range(). Is it a stale data exposure
problem? 

Now, in this code here in fs/xfs/xfs_iops.c we refer to the problem as
"expose ourselves to the null files problem".
What is the "expose ourselves to the null files problem here"
for which we do filemap_write_and_wait_range()?


	/*
	 * We are going to log the inode size change in this transaction so
	 * any previous writes that are beyond the on disk EOF and the new
	 * EOF that have not been written out need to be written here.  If we
	 * do not write the data out, we expose ourselves to the null files
	 * problem. Note that this includes any block zeroing we did above;
	 * otherwise those blocks may not be zeroed after a crash.
	 */
	if (did_zeroing ||
	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
						ip->i_disk_size, newsize - 1);
		if (error)
			return error;
	}


Talking about ext4, it handles truncates to a file using orphan
handline, yes. In case if the truncate operation spans multiple txns and
if the crash happens say in the middle of a txn, then the subsequent crash
recovery will truncate the blocks spanning i_disksize.

But we aren't discussing shrinking here right. We are doing pwrite
followed by fallocate to grow the file size. With pwrite we use delalloc
so the blocks only get allocated during writeback time and with
fallocate we will allocate unwritten extents, so there should be no
stale data expose problem in this case right? 

Hence my question was to mainly understand what does "expose ourselves to
the null files problem" means in XFS?


Thanks!
-ritesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-23 11:49       ` Ritesh Harjani
@ 2023-06-23 20:04         ` Eric Sandeen
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Sandeen @ 2023-06-23 20:04 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), Dave Chinner; +Cc: Masahiko Sawada, linux-xfs

On 6/23/23 6:49 AM, Ritesh Harjani (IBM) wrote:
> Sorry, but I still haven't understood the real problem here for which
> XFS does filemap_write_and_wait_range(). Is it a stale data exposure
> problem?

(Hopefully I get this right by trying to be helpful, here. It's been a 
while).

Not really. IIRC the original problem was that the file size could get 
updated (transactionally) before the delayed allocation and IO happened 
at writeback time, leaving a hole before EOF where buffered writes had 
failed to land before a crash. This is what people originally called the 
"NULL files problem" because reading the hole post-crash returned zeros. 
It wasn't stale date, it was no data.

Some commits that dealt with this explain it fairly well I think:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c32676eea19ce29cb74dba0f97b085e83f6b8915

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba87ea699ebd9dd577bf055ebc4a98200e337542

> Now, in this code here in fs/xfs/xfs_iops.c we refer to the problem as
> "expose ourselves to the null files problem".
> What is the "expose ourselves to the null files problem here"
> for which we do filemap_write_and_wait_range()?
> 
> 
> 	/*
> 	 * We are going to log the inode size change in this transaction so
> 	 * any previous writes that are beyond the on disk EOF and the new
> 	 * EOF that have not been written out need to be written here.  If we

i.e. force the writeback of any pending buffered IO into the hole 
created up to the new EOF

> 	 * do not write the data out, we expose ourselves to the null files
> 	 * problem. Note that this includes any block zeroing we did above;
> 	 * otherwise those blocks may not be zeroed after a crash.

and I suppose this relates a little to stale date, IIRC this is 
referring to zeroing partial blocks past the old EOF.

> 	 */
> 	if (did_zeroing ||
> 	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
> 		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> 						ip->i_disk_size, newsize - 1);
> 		if (error)
> 			return error;
> 	}
> 
> 
> Talking about ext4, it handles truncates to a file using orphan
> handline, yes. In case if the truncate operation spans multiple txns and
> if the crash happens say in the middle of a txn, then the subsequent crash
> recovery will truncate the blocks spanning i_disksize.
> 
> But we aren't discussing shrinking here right. We are doing pwrite
> followed by fallocate to grow the file size. With pwrite we use delalloc
> so the blocks only get allocated during writeback time and with
> fallocate we will allocate unwritten extents, so there should be no
> stale data expose problem in this case right?

yeah, it's not a stale data problem. I think that the extended EOF 
created by fallocate is being treated exactly the same as if we had 
extended it with ftruncate(). Indeed, replacing the posix_fallocate with 
ftruncate to the same size in the test program results in a similarly 
slow run, slightly faster probably because unwritten conversion doesn't 
have to happen in that case.

> Hence my question was to mainly understand what does "expose ourselves to
> the null files problem" means in XFS?

Hopefully the above explains it; that said, I'm not sure this is 
anything more than academically interesting. As Dave mentioned, 
fallocating tiny space and then writing into it is not at all the 
recommended or efficient use of fallocate.

The one thing I'm not remembering exactly here is why we have the 
heuristic that a truncate up requires flushing all pending data behind it.

I *think* it's because most users knew enough to expect buffered writes 
could be lost on a crash, but they expected to see valid data up to the 
on-disk EOF post-crash. Without this heuristic, they'd get some valid 
data that made it out followed by a hole ("NULLS") up to the new EOF, 
and they Did Not Like It.

-Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-23  0:47 ` Dave Chinner
  2023-06-23  8:29   ` Ritesh Harjani
@ 2023-06-26  3:17   ` Masahiko Sawada
  2023-06-26 15:32     ` Eric Sandeen
  2023-07-11 22:28   ` Andres Freund
  2 siblings, 1 reply; 20+ messages in thread
From: Masahiko Sawada @ 2023-06-26  3:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jun 23, 2023 at 9:47 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote:
> > Hi all,
> >
> > When testing PostgreSQL, I found a performance degradation. After some
> > investigation, it ultimately reached the attached simple C program and
> > turned out that the performance degradation happens on only the xfs
> > filesystem (doesn't happen on neither ext3 nor ext4). In short, the
> > program alternately does two things to extend a file (1) call
> > posix_fallocate() to extend by 8192 bytes
>
> This is a well known anti-pattern - it always causes problems. Do
> not do this.
>
> > and (2) call pwrite() to
> > extend by 8192 bytes. If I do only either (1) or (2), the program is
> > completed in 2 sec, but if I do (1) and (2) alternatively, it is
> > completed in 90 sec.
>
> Well, yes. Using fallocate to extend the file has very different
> constraints to using pwrite to extend the file.
>
> > $ gcc -o test test.c
> > $ time ./test test.1 1
> > total   200000
> > fallocate       200000
> > filewrite       0
>
> No data is written here, so this is just a series of 8kB allocations
> and file size extension operations. There are no constraints here
> because it is a pure metadata operation.
>
> > real    0m1.305s
> > user    0m0.050s
> > sys     0m1.255s
> >
> > $ time ./test test.2 2
> > total   200000
> > fallocate       100000
> > filewrite       100000
> >
> > real    1m29.222s
> > user    0m0.139s
> > sys     0m3.139s
>
> Now we have fallocate extending the file and doing unwritten extent
> allocation, followed by writing into that unwritten extent which
> then does unwritten extent conversion.
>
> This introduces data vs metadata update ordering constraints to the
> workload.
>
> The problem here in that the "truncate up" operation that
> fallocate is doing to move the file size. The "truncate up" is going
> to move the on-disk file size to the end of the fallocated range via
> a journal transaction, and so it will expose the range of the
> previous write as containing valid data.
>
> However, the previous data write is still only in memory and not on
> disk. The result of journalling the file size change is that if we
> crash after the size change is made but the data is not on disk,
> we end up with lost data - the file contains zeros (or NULLs) where
> the in memory data previously existed.
>
> Go google for "NULL file data exposure" and you'll see this is a
> problem we fixed in ~2006, caused by extending the file size on disk
> without first having written all the in-memory data into the file.
> And even though we fixed the problem over 15 years ago, we still
> hear people today saying "XFS overwrites user data with NULLs!" as
> their reason for never using XFS, even though this was never true in
> the first place..
>
> The result of users demanding that we prevent poorly written
> applications from losing their data is that users get poor
> performance when their applications are poorly written. i.e. they do
> something that triggers the data integrity ordering constraints that
> users demand we work within.

Thank you for the detailed explanation.

>
> So, how to avoid the problem?
>
> With 'posix_fallocate(fd, total_len, 8192);':
>
> $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1
> total   200000
> fallocate       200000
> filewrite       0
>
> real    0m2.557s
> user    0m0.025s
> sys     0m2.531s
> $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2
> total   200000
> fallocate       100000
> filewrite       100000
>
> real    0m39.564s
> user    0m0.117s
> sys     0m7.535s
>
>
> With 'fallocate(fd, FALLOC_FL_KEEP_SIZE, total_len, 8192);':
>
> $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1
> total   200000
> fallocate       200000
> filewrite       0
>
> real    0m2.269s
> user    0m0.037s
> sys     0m2.233s
> $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2
> total   200000
> fallocate       100000
> filewrite       100000
>
> real    0m1.068s
> user    0m0.028s
> sys     0m1.040s
>
> Yup, just stop fallocate() from extending the file size and leave
> that to the pwrite() call that actually writes the data into the
> file.
>
> As it is, using fallocate/pwrite like test does is a well known
> anti-pattern:
>
>         error = fallocate(fd, off, len);
>         if (error == ENOSPC) {
>                 /* abort write!!! */
>         }
>         error = pwrite(fd, off, len);
>         ASSERT(error != ENOSPC);
>         if (error) {
>                 /* handle error */
>         }
>

The test.c and what PostgreSQL does are slightly different from the
above pattern actually: it calls fallocate and pwrites for different
8kB blocks. For example, it calls fallocate to extend the file from 0
byte to 8192 bytes, and then calls pwrite to extend the file from 8192
bytes to 16384 bytes. But it's also not a recommended use, right?

> Why does the code need a call to fallocate() here it prevent ENOSPC in the
> pwrite() call?
>
> The answer here is that it *doesn't need to use fallocate() here*.
> THat is, the fallocate() ENOSPC check before the space is allocated
> is exactly the same as the ENOSPC check done in the pwrite() call to
> see if there is space for the write to proceed.
>
> IOWs, the fallocate() call is *completely redundant*, yet it is
> actively harmful to performance in the short term (as per the
> issue in this thread) as well as being harmful for file fragmentation
> levels and filesystem longevity because it prevents the filesystem
> from optimising away unnecessary allocations. i.e. it defeats
> delayed allocation which allows filesystem to combine lots of
> small sequential write() calls in a single big contiguous extent
> allocation when the data is getting written to disk.
>
> IOWs, using fallocate() in the way described in this test is a sign
> of applicaiton developers not understanding what preallocation
> actually does and the situations where it actually provides some
> kinds of benefit.
>
> i.e. fallocate() is intended to allow applications to preallocate
> space in large chunks long before it is needed, and still have it
> available when the application actually needs to write to it. e.g.
> preallocate 10MB at a time, not have to run fallocate again until the
> existing preallocated chunk is entirely used up by the next thousand
> 8KB writes that extend the file.
>
> Using fallocate() as a replacement for "truncate up before write" is
> *not a recommended use*.

FYI, to share the background of what PostgreSQL does, when
bulk-insertions into one table are running concurrently, one process
extends the underlying files depending on how many concurrent
processes are waiting to extend. The more processes wait, the more 8kB
blocks are appended. As the current implementation, if the process
needs to extend the table by more than 8 blocks (i.e. 64kB) it uses
posix_fallocate(), otherwise it uses pwrites() (see the code[1] for
details). We don't use fallocate() for small extensions as it's slow
on some filesystems. Therefore, if a bulk-insertion process tries to
extend the table by say 5~10 blocks many times, it could use
poxis_fallocate() and pwrite() alternatively, which led to the slow
performance as I reported.

>
> > Why does it take so long in the latter case? and are there any
> > workaround or configuration changes to deal with it?
>
> Let pwrite() do the file extension because it natively handles data
> vs metadata ordering without having to flush data to disk and wait
> for it. i.e. do not use fallocate() as if it is ftruncate(). Also,
> do not use posix_fallocate() - it gives you no control over how
> preallocation is done, use fallocate() directly. And if you must use
> fallocate() before a write, use fallocate(fd, FALLOC_FL_KEEP_SIZE,
> off, len) so that the file extension is done by the pwrite() to
> avoid any metadata/data ordering constraints that might exist with
> non-data write related file size changes.
>

Thanks. Wang Yugui reported that this slow performance seems not to
happen on newer kernel versions, but is that right?

Fortunately, this behavior is still beta (PG16 beta). I will discuss
alternative solutions in the PostgreSQL community.

Regards,

[1] https://github.com/postgres/postgres/blob/master/src/backend/storage/smgr/md.c#L577

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-26  3:17   ` Masahiko Sawada
@ 2023-06-26 15:32     ` Eric Sandeen
  2023-06-27 15:50       ` Masahiko Sawada
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sandeen @ 2023-06-26 15:32 UTC (permalink / raw)
  To: Masahiko Sawada, Dave Chinner; +Cc: linux-xfs

On 6/25/23 10:17 PM, Masahiko Sawada wrote:
> FYI, to share the background of what PostgreSQL does, when
> bulk-insertions into one table are running concurrently, one process
> extends the underlying files depending on how many concurrent
> processes are waiting to extend. The more processes wait, the more 8kB
> blocks are appended. As the current implementation, if the process
> needs to extend the table by more than 8 blocks (i.e. 64kB) it uses
> posix_fallocate(), otherwise it uses pwrites() (see the code[1] for
> details). We don't use fallocate() for small extensions as it's slow
> on some filesystems. Therefore, if a bulk-insertion process tries to
> extend the table by say 5~10 blocks many times, it could use
> poxis_fallocate() and pwrite() alternatively, which led to the slow
> performance as I reported.

To what end? What problem is PostgreSQL trying to solve with this 
scheme? I might be missing something but it seems like you've described 
the "what" in detail, but no "why."

-Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-26 15:32     ` Eric Sandeen
@ 2023-06-27 15:50       ` Masahiko Sawada
  2023-06-27 16:12         ` Eric Sandeen
  0 siblings, 1 reply; 20+ messages in thread
From: Masahiko Sawada @ 2023-06-27 15:50 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, linux-xfs

On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@sandeen.net> wrote:
>
> On 6/25/23 10:17 PM, Masahiko Sawada wrote:
> > FYI, to share the background of what PostgreSQL does, when
> > bulk-insertions into one table are running concurrently, one process
> > extends the underlying files depending on how many concurrent
> > processes are waiting to extend. The more processes wait, the more 8kB
> > blocks are appended. As the current implementation, if the process
> > needs to extend the table by more than 8 blocks (i.e. 64kB) it uses
> > posix_fallocate(), otherwise it uses pwrites() (see the code[1] for
> > details). We don't use fallocate() for small extensions as it's slow
> > on some filesystems. Therefore, if a bulk-insertion process tries to
> > extend the table by say 5~10 blocks many times, it could use
> > poxis_fallocate() and pwrite() alternatively, which led to the slow
> > performance as I reported.
>
> To what end? What problem is PostgreSQL trying to solve with this
> scheme? I might be missing something but it seems like you've described
> the "what" in detail, but no "why."

It's for better scalability. SInce the process who wants to extend the
table needs to hold an exclusive lock on the table, we need to
minimize the work while holding the lock.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-27 15:50       ` Masahiko Sawada
@ 2023-06-27 16:12         ` Eric Sandeen
  2023-06-28  4:56           ` Christoph Hellwig
  2023-07-11 22:49           ` Andres Freund
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Sandeen @ 2023-06-27 16:12 UTC (permalink / raw)
  To: Masahiko Sawada; +Cc: Dave Chinner, linux-xfs

On 6/27/23 10:50 AM, Masahiko Sawada wrote:
> On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@sandeen.net> wrote:
>>
>> On 6/25/23 10:17 PM, Masahiko Sawada wrote:
>>> FYI, to share the background of what PostgreSQL does, when
>>> bulk-insertions into one table are running concurrently, one process
>>> extends the underlying files depending on how many concurrent
>>> processes are waiting to extend. The more processes wait, the more 8kB
>>> blocks are appended. As the current implementation, if the process
>>> needs to extend the table by more than 8 blocks (i.e. 64kB) it uses
>>> posix_fallocate(), otherwise it uses pwrites() (see the code[1] for
>>> details). We don't use fallocate() for small extensions as it's slow
>>> on some filesystems. Therefore, if a bulk-insertion process tries to
>>> extend the table by say 5~10 blocks many times, it could use
>>> poxis_fallocate() and pwrite() alternatively, which led to the slow
>>> performance as I reported.
>>
>> To what end? What problem is PostgreSQL trying to solve with this
>> scheme? I might be missing something but it seems like you've described
>> the "what" in detail, but no "why."
> 
> It's for better scalability. SInce the process who wants to extend the
> table needs to hold an exclusive lock on the table, we need to
> minimize the work while holding the lock.

Ok, but what is the reason for zeroing out the blocks prior to them 
being written with real data? I'm wondering what the core requirement 
here is for the zeroing, either via fallocate (which btw posix_fallocate 
does not guarantee) or pwrites of zeros.

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-27 16:12         ` Eric Sandeen
@ 2023-06-28  4:56           ` Christoph Hellwig
  2023-07-11 22:49           ` Andres Freund
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2023-06-28  4:56 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Masahiko Sawada, Dave Chinner, linux-xfs

On Tue, Jun 27, 2023 at 11:12:01AM -0500, Eric Sandeen wrote:
> Ok, but what is the reason for zeroing out the blocks prior to them being
> written with real data? I'm wondering what the core requirement here is for
> the zeroing, either via fallocate (which btw posix_fallocate does not
> guarantee) or pwrites of zeros.

Note that even a plain truncate will zero the data visible to the
user.  I could see this tring to reduce fragmentation by making sure
the whole file extension is allocated together insted of split up
as the difference processes write their areas.  But is there data
showing this fragmentation even happens and actually hurts?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-27 16:12         ` Eric Sandeen
  2023-06-28  4:56           ` Christoph Hellwig
@ 2023-07-11 22:49           ` Andres Freund
  2023-07-19  7:25             ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Andres Freund @ 2023-07-11 22:49 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Masahiko Sawada, Dave Chinner, linux-xfs

Hi,

On 2023-06-27 11:12:01 -0500, Eric Sandeen wrote:
> On 6/27/23 10:50 AM, Masahiko Sawada wrote:
> > On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@sandeen.net> wrote:
> > > 
> > > On 6/25/23 10:17 PM, Masahiko Sawada wrote:
> > > > FYI, to share the background of what PostgreSQL does, when
> > > > bulk-insertions into one table are running concurrently, one process
> > > > extends the underlying files depending on how many concurrent
> > > > processes are waiting to extend. The more processes wait, the more 8kB
> > > > blocks are appended. As the current implementation, if the process
> > > > needs to extend the table by more than 8 blocks (i.e. 64kB) it uses
> > > > posix_fallocate(), otherwise it uses pwrites() (see the code[1] for
> > > > details). We don't use fallocate() for small extensions as it's slow
> > > > on some filesystems. Therefore, if a bulk-insertion process tries to
> > > > extend the table by say 5~10 blocks many times, it could use
> > > > poxis_fallocate() and pwrite() alternatively, which led to the slow
> > > > performance as I reported.
> > > 
> > > To what end? What problem is PostgreSQL trying to solve with this
> > > scheme? I might be missing something but it seems like you've described
> > > the "what" in detail, but no "why."
> > 
> > It's for better scalability. SInce the process who wants to extend the
> > table needs to hold an exclusive lock on the table, we need to
> > minimize the work while holding the lock.
> 
> Ok, but what is the reason for zeroing out the blocks prior to them being
> written with real data? I'm wondering what the core requirement here is for
> the zeroing, either via fallocate (which btw posix_fallocate does not
> guarantee) or pwrites of zeros.

The goal is to avoid ENOSPC at a later time. We do this before filling our own
in-memory buffer pool with pages containing new contents. If we have dirty
pages in our buffer that we can't write out due to ENOSPC, we're in trouble,
because we can't checkpoint. Which typically will make the ENOSPC situation
worse, because we also can't remove WAL / journal files without the checkpoint
having succeeded.  Of course a successful fallocate() / pwrite() doesn't
guarantee that much on a COW filesystem, but there's not much we can do about
that, to my knowledge.

Using fallocate() for small extensions is problematic because it a) causes
fragmentation b) disables delayed allocation, using pwrite() is also bad
because the kernel will have to write out those dirty pages full of zeroes -
very often we won't write out the page with "real content" before the kernel
decides to do so.

Hence using a heuristic to choose between the two. I think all that's needed
here is a bit of tuning of the heuristic, possibly adding some "history"
awareness.

If we could opt into delayed allocation while avoiding ENOSPC for a certain
length, it'd be perfect, but I don't think that's possible today?

We're also working on using DIO FWIW, where using fallocate() is just about
mandatory...

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-07-11 22:49           ` Andres Freund
@ 2023-07-19  7:25             ` Dave Chinner
  2023-07-19 20:29               ` Andres Freund
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2023-07-19  7:25 UTC (permalink / raw)
  To: Andres Freund; +Cc: Eric Sandeen, Masahiko Sawada, linux-xfs

On Tue, Jul 11, 2023 at 03:49:11PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2023-06-27 11:12:01 -0500, Eric Sandeen wrote:
> > On 6/27/23 10:50 AM, Masahiko Sawada wrote:
> > > On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@sandeen.net> wrote:
> > > > 
> > > > On 6/25/23 10:17 PM, Masahiko Sawada wrote:
> > > > > FYI, to share the background of what PostgreSQL does, when
> > > > > bulk-insertions into one table are running concurrently, one process
> > > > > extends the underlying files depending on how many concurrent
> > > > > processes are waiting to extend. The more processes wait, the more 8kB
> > > > > blocks are appended. As the current implementation, if the process
> > > > > needs to extend the table by more than 8 blocks (i.e. 64kB) it uses
> > > > > posix_fallocate(), otherwise it uses pwrites() (see the code[1] for
> > > > > details). We don't use fallocate() for small extensions as it's slow
> > > > > on some filesystems. Therefore, if a bulk-insertion process tries to
> > > > > extend the table by say 5~10 blocks many times, it could use
> > > > > poxis_fallocate() and pwrite() alternatively, which led to the slow
> > > > > performance as I reported.
> > > > 
> > > > To what end? What problem is PostgreSQL trying to solve with this
> > > > scheme? I might be missing something but it seems like you've described
> > > > the "what" in detail, but no "why."
> > > 
> > > It's for better scalability. SInce the process who wants to extend the
> > > table needs to hold an exclusive lock on the table, we need to
> > > minimize the work while holding the lock.
> > 
> > Ok, but what is the reason for zeroing out the blocks prior to them being
> > written with real data? I'm wondering what the core requirement here is for
> > the zeroing, either via fallocate (which btw posix_fallocate does not
> > guarantee) or pwrites of zeros.
> 
> The goal is to avoid ENOSPC at a later time. We do this before filling our own
> in-memory buffer pool with pages containing new contents. If we have dirty
> pages in our buffer that we can't write out due to ENOSPC, we're in trouble,
> because we can't checkpoint. Which typically will make the ENOSPC situation
> worse, because we also can't remove WAL / journal files without the checkpoint
> having succeeded.  Of course a successful fallocate() / pwrite() doesn't
> guarantee that much on a COW filesystem, but there's not much we can do about
> that, to my knowledge.

Yup, which means you're screwed on XFS, ZFS and btrfs right now, and
also bcachefs when people start using it.

> Using fallocate() for small extensions is problematic because it a) causes
> fragmentation b) disables delayed allocation, using pwrite() is also bad
> because the kernel will have to write out those dirty pages full of zeroes -
> very often we won't write out the page with "real content" before the kernel
> decides to do so.

Yes, that why we allow fallocate() to preallocate space that extends
beyond the current EOF. i.e. for optimising layouts on append-based
workloads. posix_fallocate() does not allow that - it forces file
size extension, whilst a raw fallocate(FALLOC_FL_KEEP_SIZE) call
will allow preallocation anywhere beyond EOF without changing the
file size. IOws, with FALLOC_FL_KEEP_SIZE you don't have to
initialise buffer space in memory to cover the preallocated space
until you actually need to extend the file and write to it.

i.e. use fallocate(FALLOC_FL_KEEP_SIZE) to preallocate
chunks megabytes beyond the current EOF and then grow into them with
normal extending pwrite() calls. When that preallocate space is
done, preallocate another large chunk beyond EOF and continue
onwards extending the file with your small write()s...

> Hence using a heuristic to choose between the two. I think all that's needed
> here is a bit of tuning of the heuristic, possibly adding some "history"
> awareness.

No heuristics needed: just use FALLOC_FL_KEEP_SIZE and preallocate
large chunks beyond EOF each time. It works for both cases equally
well, which results in less code and is easier to understand.

AFAIC, nobody should ever use posix_fallocate() - it's impossible to
know what it is doing under the covers, or even know when it fails
to provide you with any guarantee at all (e.g. COW files).

> If we could opt into delayed allocation while avoiding ENOSPC for a certain
> length, it'd be perfect, but I don't think that's possible today?

Nope. Not desirable, either, because we currently need to have dirty
data in the page cache over delalloc regions.

> We're also working on using DIO FWIW, where using fallocate() is just about
> mandatory...

No, no it isn't. fallocate() is even more important to avoid with
DIO than buffered IO because fallocate() completely serialises *all*
IO to the file. That's the last thing you want with DIO given the
only reason for using DIO is to maximising IO concurrency and
minimise IO latency to individual files.

If you want to minimise fragmentation with DIO workloads, then you
should be using extent size hints of an appropriate size. That will
align and size extents to the hint regardless of fallocate/write
ranges, hence this controls worst case fragmentation effectively.

If you want enospc guarantees for future writes, then large,
infrequent fallocate(FALLOC_FL_KEEP_SIZE) calls should be used. Do
not use this mechanism as an anti-fragmentation mechanism, that's
what extent size hints are for.

Use fallocate() as *little as possible*.

In my experience, fine grained management of file space by userspace
applications via fallocate() is nothing but a recipe for awful
performance, highly variable IO latency, bad file fragmentation, and
poor filesystem aging characteristics. Just don't do it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-07-19  7:25             ` Dave Chinner
@ 2023-07-19 20:29               ` Andres Freund
  2023-07-19 20:38                 ` Eric Sandeen
  0 siblings, 1 reply; 20+ messages in thread
From: Andres Freund @ 2023-07-19 20:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, Masahiko Sawada, linux-xfs

Hi,

On 2023-07-19 17:25:37 +1000, Dave Chinner wrote:
> On Tue, Jul 11, 2023 at 03:49:11PM -0700, Andres Freund wrote:
> > The goal is to avoid ENOSPC at a later time. We do this before filling our own
> > in-memory buffer pool with pages containing new contents. If we have dirty
> > pages in our buffer that we can't write out due to ENOSPC, we're in trouble,
> > because we can't checkpoint. Which typically will make the ENOSPC situation
> > worse, because we also can't remove WAL / journal files without the checkpoint
> > having succeeded.  Of course a successful fallocate() / pwrite() doesn't
> > guarantee that much on a COW filesystem, but there's not much we can do about
> > that, to my knowledge.
> 
> Yup, which means you're screwed on XFS, ZFS and btrfs right now, and
> also bcachefs when people start using it.

I'd be happy to hear of a better alternative... fallocate() should avoid
ENOSPC on XFS unless snapshots trigger COW on a write, correct?

> > Using fallocate() for small extensions is problematic because it a) causes
> > We're also working on using DIO FWIW, where using fallocate() is just about
> > mandatory...
> 
> No, no it isn't. fallocate() is even more important to avoid with
> DIO than buffered IO because fallocate() completely serialises *all*
> IO to the file. That's the last thing you want with DIO given the
> only reason for using DIO is to maximising IO concurrency and
> minimise IO latency to individual files.

Not using any form of preallocation (potentially via extent size hints as you
mention below), when multiple files are being appended to simultaneously with
DIO, does lead to terrifying levels of fragmentation on xfs.

On a newly initialized xfs (mkfs.xfs version 6.3.0, 6.5.0-rc2):

rm -f fragtest-* && fio --minimal --name fragtest-1 --buffered=0 --filesize=128MB --fallocate=none --rw write --bs=$((4096*4)) --nrfiles=10

filefrag fragtest-1.0.*

fragtest-1.0.1: 8192 extents found
fragtest-1.0.2: 8192 extents found
fragtest-1.0.3: 8192 extents found
fragtest-1.0.4: 8192 extents found
fragtest-1.0.5: 8192 extents found
fragtest-1.0.6: 8192 extents found
fragtest-1.0.7: 8192 extents found
fragtest-1.0.8: 8192 extents found
fragtest-1.0.9: 8192 extents found

On a more "aged" filesystem, it's not quite as regular, but still above 7k
extents for all files.  Similarly, if I use io_uring for more concurrent IOs,
there's a bit less fragmentation, presumbly because sometimes two IOs for the
same file happen in subsequently.

Of course just writing four blocks at a time is a bit extreme, I wanted to
showcase the issue here, but even with a bit bigger writes, the problem is
still severe.  Writing multiple files at the same time is extremely common for
us (think of table and its indexes, or multiple partitions of a table being
filled concurrently).

It looks to me that with a single file being written, each write only
allocates a small extent, but the extent can be extended in subsequent
writes. But when 2+ files are being written, that rarely is possible, because
the space was already used for the other file(s).

> If you want to minimise fragmentation with DIO workloads, then you
> should be using extent size hints of an appropriate size. That will
> align and size extents to the hint regardless of fallocate/write
> ranges, hence this controls worst case fragmentation effectively.

That might be an option, but I'm not sure how realistic it is. Lookes like one
can't adjust the extsize for a file with existing contents, if I see this
correctly. We don't know what data will be how large ahead of time, so we
can't just configure a large extsize and be done with that.

Given the above fragmentation behaviour, and the fact that extsizes can't be
adjusted, I don't really see how we can get away from using fallocate() to
avoid fragmentation.

Then there's also the issue of extsize being xfs specific, without
corresponding fetures in other filesystems...

> If you want enospc guarantees for future writes, then large,
> infrequent fallocate(FALLOC_FL_KEEP_SIZE) calls should be used. Do
> not use this mechanism as an anti-fragmentation mechanism, that's
> what extent size hints are for.

Is there documentation about extent size hints anywhere beyond the paragraphs
in the ioctl_xfs_fsgetxattr(2)? I didn't find much...

> Use fallocate() as *little as possible*.
> 
> In my experience, fine grained management of file space by userspace
> applications via fallocate() is nothing but a recipe for awful
> performance, highly variable IO latency, bad file fragmentation, and
> poor filesystem aging characteristics. Just don't do it.

I'd like to avoid it, but so far experience has shown that that causes plenty
issues as well.

Somewhat tangential: I still would like a fallocate() option that actually
zeroes out new extents (via "write zeroes", if supported), rather than just
setting them up as unwritten extents. Nor for "data" files, but for
WAL/journal files.

Unwrittent extent "conversion", or actually extending the file, makes durable
journal writes via O_DSYNC or fdatasync() unusably slow. So one has to
overwrite the file with zeroes "manually" - even though "write zeroes" would
often be more efficient.

rm -f durable-*;fio --buffered=0 --filesize=32MB --fallocate=1 --rw write --bs=$((8192)) --nrfiles=1 --ioengine io_uring --iodepth 16 --sync dsync --name durable-overwrite --overwrite 1 --name durable-nooverwrite --overwrite 0 --stonewall --name durable-nofallocate --overwrite 0 --fallocate 0 --stonewall

slow-ish nvme:

Run status group 0 (all jobs):
  WRITE: bw=45.1MiB/s (47.3MB/s), 45.1MiB/s-45.1MiB/s (47.3MB/s-47.3MB/s), io=32.0MiB (33.6MB), run=710-710msec

Run status group 1 (all jobs):
  WRITE: bw=3224KiB/s (3302kB/s), 3224KiB/s-3224KiB/s (3302kB/s-3302kB/s), io=32.0MiB (33.6MB), run=10163-10163msec

Run status group 2 (all jobs):
  WRITE: bw=2660KiB/s (2724kB/s), 2660KiB/s-2660KiB/s (2724kB/s-2724kB/s), io=32.0MiB (33.6MB), run=12320-12320msec

fast nvme:

Run status group 0 (all jobs):
  WRITE: bw=1600MiB/s (1678MB/s), 1600MiB/s-1600MiB/s (1678MB/s-1678MB/s), io=32.0MiB (33.6MB), run=20-20msec

Run status group 1 (all jobs):
  WRITE: bw=356MiB/s (373MB/s), 356MiB/s-356MiB/s (373MB/s-373MB/s), io=32.0MiB (33.6MB), run=90-90msec

Run status group 2 (all jobs):
  WRITE: bw=260MiB/s (273MB/s), 260MiB/s-260MiB/s (273MB/s-273MB/s), io=32.0MiB (33.6MB), run=123-123msec

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-07-19 20:29               ` Andres Freund
@ 2023-07-19 20:38                 ` Eric Sandeen
  2023-07-19 20:49                   ` Eric Sandeen
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sandeen @ 2023-07-19 20:38 UTC (permalink / raw)
  To: Andres Freund, Dave Chinner; +Cc: Eric Sandeen, Masahiko Sawada, linux-xfs

On 7/19/23 3:29 PM, Andres Freund wrote:
> Somewhat tangential: I still would like a fallocate() option that actually
> zeroes out new extents (via "write zeroes", if supported), rather than just
> setting them up as unwritten extents. Nor for "data" files, but for
> WAL/journal files.

Like this?

fallocate(2):

    Zeroing file space
        Specifying  the  FALLOC_FL_ZERO_RANGE  flag  (available  since 
Linux 3.15) in mode zeros space in the byte range starting at offset and 
continuing for len bytes.

Under the covers, that uses efficient zeroing methods when available.

-Eric


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-07-19 20:38                 ` Eric Sandeen
@ 2023-07-19 20:49                   ` Eric Sandeen
  2023-07-19 22:23                     ` Andres Freund
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sandeen @ 2023-07-19 20:49 UTC (permalink / raw)
  To: sandeen, Andres Freund, Dave Chinner; +Cc: Masahiko Sawada, linux-xfs

On 7/19/23 3:38 PM, Eric Sandeen wrote:
> On 7/19/23 3:29 PM, Andres Freund wrote:
>> Somewhat tangential: I still would like a fallocate() option that 
>> actually
>> zeroes out new extents (via "write zeroes", if supported), rather than 
>> just
>> setting them up as unwritten extents. Nor for "data" files, but for
>> WAL/journal files.
> 
> Like this?
> 
> fallocate(2):
> 
>     Zeroing file space
>         Specifying  the  FALLOC_FL_ZERO_RANGE  flag  (available  since 
> Linux 3.15) in mode zeros space in the byte range starting at offset and 
> continuing for len bytes.
> 
> Under the covers, that uses efficient zeroing methods when available.
> 
> -Eric
> 

Hm sorry, it's been a while. Maybe I'm wrong about this; I know it does 
efficient zeroing when pointed at a block device but I guess I've 
confused myself about what happens on a filesystem like XFS that 
supports unwritten extents.

-Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-07-19 20:49                   ` Eric Sandeen
@ 2023-07-19 22:23                     ` Andres Freund
  0 siblings, 0 replies; 20+ messages in thread
From: Andres Freund @ 2023-07-19 22:23 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: sandeen, Dave Chinner, Masahiko Sawada, linux-xfs

Hi,

On 2023-07-19 15:49:59 -0500, Eric Sandeen wrote:
> On 7/19/23 3:38 PM, Eric Sandeen wrote:
> > On 7/19/23 3:29 PM, Andres Freund wrote:
> > > Somewhat tangential: I still would like a fallocate() option that
> > > actually
> > > zeroes out new extents (via "write zeroes", if supported), rather
> > > than just
> > > setting them up as unwritten extents. Nor for "data" files, but for
> > > WAL/journal files.
> >
> > Like this?
> >
> > fallocate(2):
> >
> >     Zeroing file space
> >         Specifying  the  FALLOC_FL_ZERO_RANGE  flag  (available  since
> > Linux 3.15) in mode zeros space in the byte range starting at offset and
> > continuing for len bytes.
> >
> > Under the covers, that uses efficient zeroing methods when available.
> >
> > -Eric
> >
>
> Hm sorry, it's been a while. Maybe I'm wrong about this; I know it does
> efficient zeroing when pointed at a block device but I guess I've confused
> myself about what happens on a filesystem like XFS that supports unwritten
> extents.

Yea, it's documented to use unwritten extents:

       Zeroing is done within the filesystem preferably by converting the
       range into unwritten extents.  This approach means that the specified
       range will not be physically zeroed out on the device (except for
       partial blocks at the either end of the range), and I/O is (otherwise)
       required only to update meta‐ data.

and an experiment confirms that:

$ dd if=/dev/zero of=test bs=1MB count=1

$ filefrag -v test
Filesystem type is: 58465342
File size of test is 1000000 (245 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     244:    6104864..   6105108:    245:             last,eof
test: 1 extent found

$ fallocate -z -o 0 -l $((4096*128)) test
$ filefrag -v test
Filesystem type is: 58465342
File size of test is 1000000 (245 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     127:    6105210..   6105337:    128:             unwritten
   1:      128..     244:    6104992..   6105108:    117:    6105338: last,eof

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question on slow fallocate
  2023-06-23  0:47 ` Dave Chinner
  2023-06-23  8:29   ` Ritesh Harjani
  2023-06-26  3:17   ` Masahiko Sawada
@ 2023-07-11 22:28   ` Andres Freund
  2 siblings, 0 replies; 20+ messages in thread
From: Andres Freund @ 2023-07-11 22:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Masahiko Sawada, linux-xfs

Hi,

On 2023-06-23 10:47:43 +1000, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote:
> > Hi all,
> >
> > When testing PostgreSQL, I found a performance degradation. After some
> > investigation, it ultimately reached the attached simple C program and
> > turned out that the performance degradation happens on only the xfs
> > filesystem (doesn't happen on neither ext3 nor ext4). In short, the
> > program alternately does two things to extend a file (1) call
> > posix_fallocate() to extend by 8192 bytes
>
> This is a well known anti-pattern - it always causes problems. Do
> not do this.

Postgres' actual behaviour is more complicated than what Sawada-san's test.
We either fallocate() multiple pages or we use use pwritev() to extend by
fewer pages.

I think Sawada-san wrote it when trying to narrow down a performance issue to
the "problematic" interaction, perhaps simplifying the real workload too much.

> As it is, using fallocate/pwrite like test does is a well known
> anti-pattern:
>
> 	error = fallocate(fd, off, len);
> 	if (error == ENOSPC) {
> 		/* abort write!!! */
> 	}
> 	error = pwrite(fd, off, len);
> 	ASSERT(error != ENOSPC);
> 	if (error) {
> 		/* handle error */
> 	}
>
> Why does the code need a call to fallocate() here it prevent ENOSPC in the
> pwrite() call?

The reason we do need either fallocate or pwrite is to ensure we can later
write out the page from postgres' buffer pool without hitting ENOSPC (of
course that's still not reliable for all filesystems...).  We don't want to
use *write() for larger amounts of data, because that ends up with the kernel
actually needing to write out those pages. There never is any content in those
extended pages.

So for small file extensions we use writes, and when it's more bulk work, we
use fallocate.

Having a dirty page in our buffer pool is, that we can't write out due to
ENOSPC, is bad, as that prevents our checkpoints from ever succeeding. Thus we
either need to "crash" and replay the journal, or we can't checkpoint, with
all the issues that entails.

The performance issue at hand came to be because of the workload flipping
between extending by fallocate() and extending by write(), as part of the
heuristic is the contention on the lock protecting file extensions.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-07-19 22:24 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-22  5:34 Question on slow fallocate Masahiko Sawada
2023-06-22  7:44 ` Wang Yugui
2023-06-22  8:18   ` Masahiko Sawada
2023-06-23  0:47 ` Dave Chinner
2023-06-23  8:29   ` Ritesh Harjani
2023-06-23 10:07     ` Dave Chinner
2023-06-23 11:49       ` Ritesh Harjani
2023-06-23 20:04         ` Eric Sandeen
2023-06-26  3:17   ` Masahiko Sawada
2023-06-26 15:32     ` Eric Sandeen
2023-06-27 15:50       ` Masahiko Sawada
2023-06-27 16:12         ` Eric Sandeen
2023-06-28  4:56           ` Christoph Hellwig
2023-07-11 22:49           ` Andres Freund
2023-07-19  7:25             ` Dave Chinner
2023-07-19 20:29               ` Andres Freund
2023-07-19 20:38                 ` Eric Sandeen
2023-07-19 20:49                   ` Eric Sandeen
2023-07-19 22:23                     ` Andres Freund
2023-07-11 22:28   ` Andres Freund

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox