* Question on slow fallocate
@ 2023-06-22 5:34 Masahiko Sawada
2023-06-22 7:44 ` Wang Yugui
2023-06-23 0:47 ` Dave Chinner
0 siblings, 2 replies; 20+ messages in thread
From: Masahiko Sawada @ 2023-06-22 5:34 UTC (permalink / raw)
To: linux-xfs
[-- Attachment #1: Type: text/plain, Size: 1054 bytes --]
Hi all,
When testing PostgreSQL, I found a performance degradation. After some
investigation, it ultimately reached the attached simple C program and
turned out that the performance degradation happens on only the xfs
filesystem (doesn't happen on neither ext3 nor ext4). In short, the
program alternately does two things to extend a file (1) call
posix_fallocate() to extend by 8192 bytes and (2) call pwrite() to
extend by 8192 bytes. If I do only either (1) or (2), the program is
completed in 2 sec, but if I do (1) and (2) alternatively, it is
completed in 90 sec.
$ gcc -o test test.c
$ time ./test test.1 1
total 200000
fallocate 200000
filewrite 0
real 0m1.305s
user 0m0.050s
sys 0m1.255s
$ time ./test test.2 2
total 200000
fallocate 100000
filewrite 100000
real 1m29.222s
user 0m0.139s
sys 0m3.139s
Why does it take so long in the latter case? and are there any
workaround or configuration changes to deal with it?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
[-- Attachment #2: test.c --]
[-- Type: application/octet-stream, Size: 952 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
int
main(int argc, char **argv)
{
char *filename = argv[1];
int ratio = atoi(argv[2]);
char block[8192] = {0};
int fd;
int total_len = 0;
int n_fallocate = 0;
int n_filewrite = 0;
int i;
fd = open(filename, O_RDWR | O_CREAT, S_IRWXU);
if (fd < 0)
{
fprintf(stderr, "could not open file %s: %m\n", filename);
return 1;
}
for (i = 0; i < 200000; i++)
{
int ret;
if (ratio != 0 && i % ratio == 0)
{
posix_fallocate(fd, total_len, 8192);
n_fallocate++;
}
else
{
pwrite(fd, block, 8192, total_len);
n_filewrite++;
}
total_len += 8192;
}
printf("total\t%d\n", i);
printf("fallocate\t%d\n", n_fallocate);
printf("filewrite\t%d\n", n_filewrite);
close(fd);
return 0;
}
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: Question on slow fallocate 2023-06-22 5:34 Question on slow fallocate Masahiko Sawada @ 2023-06-22 7:44 ` Wang Yugui 2023-06-22 8:18 ` Masahiko Sawada 2023-06-23 0:47 ` Dave Chinner 1 sibling, 1 reply; 20+ messages in thread From: Wang Yugui @ 2023-06-22 7:44 UTC (permalink / raw) To: Masahiko Sawada; +Cc: linux-xfs Hi, > Hi all, > > When testing PostgreSQL, I found a performance degradation. After some > investigation, it ultimately reached the attached simple C program and > turned out that the performance degradation happens on only the xfs > filesystem (doesn't happen on neither ext3 nor ext4). In short, the > program alternately does two things to extend a file (1) call > posix_fallocate() to extend by 8192 bytes and (2) call pwrite() to > extend by 8192 bytes. If I do only either (1) or (2), the program is > completed in 2 sec, but if I do (1) and (2) alternatively, it is > completed in 90 sec. > > $ gcc -o test test.c > $ time ./test test.1 1 > total 200000 > fallocate 200000 > filewrite 0 > > real 0m1.305s > user 0m0.050s > sys 0m1.255s > > $ time ./test test.2 2 > total 200000 > fallocate 100000 > filewrite 100000 > > real 1m29.222s > user 0m0.139s > sys 0m3.139s > > Why does it take so long in the latter case? and are there any > workaround or configuration changes to deal with it? > I test it on xfs linux 6.1.35 and 6.4-rc7 the result is almost same. $ time ./test test.1 1 real 0m1.382s $ time ./test test.2 2 real 0m9.262s linunx kernel version please. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2023/06/22 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-22 7:44 ` Wang Yugui @ 2023-06-22 8:18 ` Masahiko Sawada 0 siblings, 0 replies; 20+ messages in thread From: Masahiko Sawada @ 2023-06-22 8:18 UTC (permalink / raw) To: Wang Yugui; +Cc: linux-xfs On Thu, Jun 22, 2023 at 4:44 PM Wang Yugui <wangyugui@e16-tech.com> wrote: > > Hi, > > > Hi all, > > > > When testing PostgreSQL, I found a performance degradation. After some > > investigation, it ultimately reached the attached simple C program and > > turned out that the performance degradation happens on only the xfs > > filesystem (doesn't happen on neither ext3 nor ext4). In short, the > > program alternately does two things to extend a file (1) call > > posix_fallocate() to extend by 8192 bytes and (2) call pwrite() to > > extend by 8192 bytes. If I do only either (1) or (2), the program is > > completed in 2 sec, but if I do (1) and (2) alternatively, it is > > completed in 90 sec. > > > > $ gcc -o test test.c > > $ time ./test test.1 1 > > total 200000 > > fallocate 200000 > > filewrite 0 > > > > real 0m1.305s > > user 0m0.050s > > sys 0m1.255s > > > > $ time ./test test.2 2 > > total 200000 > > fallocate 100000 > > filewrite 100000 > > > > real 1m29.222s > > user 0m0.139s > > sys 0m3.139s > > > > Why does it take so long in the latter case? and are there any > > workaround or configuration changes to deal with it? > > > > I test it on xfs linux 6.1.35 and 6.4-rc7 > > the result is almost same. > > $ time ./test test.1 1 > real 0m1.382s > > $ time ./test test.2 2 > real 0m9.262s > > linunx kernel version please. I test it on: $ uname -r 6.1.29-50.88.amzn2023.x86_64 and $ uname -r 4.18.0-372.9.1.el8.x86_64 Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-22 5:34 Question on slow fallocate Masahiko Sawada 2023-06-22 7:44 ` Wang Yugui @ 2023-06-23 0:47 ` Dave Chinner 2023-06-23 8:29 ` Ritesh Harjani ` (2 more replies) 1 sibling, 3 replies; 20+ messages in thread From: Dave Chinner @ 2023-06-23 0:47 UTC (permalink / raw) To: Masahiko Sawada; +Cc: linux-xfs On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote: > Hi all, > > When testing PostgreSQL, I found a performance degradation. After some > investigation, it ultimately reached the attached simple C program and > turned out that the performance degradation happens on only the xfs > filesystem (doesn't happen on neither ext3 nor ext4). In short, the > program alternately does two things to extend a file (1) call > posix_fallocate() to extend by 8192 bytes This is a well known anti-pattern - it always causes problems. Do not do this. > and (2) call pwrite() to > extend by 8192 bytes. If I do only either (1) or (2), the program is > completed in 2 sec, but if I do (1) and (2) alternatively, it is > completed in 90 sec. Well, yes. Using fallocate to extend the file has very different constraints to using pwrite to extend the file. > $ gcc -o test test.c > $ time ./test test.1 1 > total 200000 > fallocate 200000 > filewrite 0 No data is written here, so this is just a series of 8kB allocations and file size extension operations. There are no constraints here because it is a pure metadata operation. > real 0m1.305s > user 0m0.050s > sys 0m1.255s > > $ time ./test test.2 2 > total 200000 > fallocate 100000 > filewrite 100000 > > real 1m29.222s > user 0m0.139s > sys 0m3.139s Now we have fallocate extending the file and doing unwritten extent allocation, followed by writing into that unwritten extent which then does unwritten extent conversion. This introduces data vs metadata update ordering constraints to the workload. The problem here in that the "truncate up" operation that fallocate is doing to move the file size. The "truncate up" is going to move the on-disk file size to the end of the fallocated range via a journal transaction, and so it will expose the range of the previous write as containing valid data. However, the previous data write is still only in memory and not on disk. The result of journalling the file size change is that if we crash after the size change is made but the data is not on disk, we end up with lost data - the file contains zeros (or NULLs) where the in memory data previously existed. Go google for "NULL file data exposure" and you'll see this is a problem we fixed in ~2006, caused by extending the file size on disk without first having written all the in-memory data into the file. And even though we fixed the problem over 15 years ago, we still hear people today saying "XFS overwrites user data with NULLs!" as their reason for never using XFS, even though this was never true in the first place.. The result of users demanding that we prevent poorly written applications from losing their data is that users get poor performance when their applications are poorly written. i.e. they do something that triggers the data integrity ordering constraints that users demand we work within. So, how to avoid the problem? With 'posix_fallocate(fd, total_len, 8192);': $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1 total 200000 fallocate 200000 filewrite 0 real 0m2.557s user 0m0.025s sys 0m2.531s $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2 total 200000 fallocate 100000 filewrite 100000 real 0m39.564s user 0m0.117s sys 0m7.535s With 'fallocate(fd, FALLOC_FL_KEEP_SIZE, total_len, 8192);': $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1 total 200000 fallocate 200000 filewrite 0 real 0m2.269s user 0m0.037s sys 0m2.233s $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2 total 200000 fallocate 100000 filewrite 100000 real 0m1.068s user 0m0.028s sys 0m1.040s Yup, just stop fallocate() from extending the file size and leave that to the pwrite() call that actually writes the data into the file. As it is, using fallocate/pwrite like test does is a well known anti-pattern: error = fallocate(fd, off, len); if (error == ENOSPC) { /* abort write!!! */ } error = pwrite(fd, off, len); ASSERT(error != ENOSPC); if (error) { /* handle error */ } Why does the code need a call to fallocate() here it prevent ENOSPC in the pwrite() call? The answer here is that it *doesn't need to use fallocate() here*. THat is, the fallocate() ENOSPC check before the space is allocated is exactly the same as the ENOSPC check done in the pwrite() call to see if there is space for the write to proceed. IOWs, the fallocate() call is *completely redundant*, yet it is actively harmful to performance in the short term (as per the issue in this thread) as well as being harmful for file fragmentation levels and filesystem longevity because it prevents the filesystem from optimising away unnecessary allocations. i.e. it defeats delayed allocation which allows filesystem to combine lots of small sequential write() calls in a single big contiguous extent allocation when the data is getting written to disk. IOWs, using fallocate() in the way described in this test is a sign of applicaiton developers not understanding what preallocation actually does and the situations where it actually provides some kinds of benefit. i.e. fallocate() is intended to allow applications to preallocate space in large chunks long before it is needed, and still have it available when the application actually needs to write to it. e.g. preallocate 10MB at a time, not have to run fallocate again until the existing preallocated chunk is entirely used up by the next thousand 8KB writes that extend the file. Using fallocate() as a replacement for "truncate up before write" is *not a recommended use*. > Why does it take so long in the latter case? and are there any > workaround or configuration changes to deal with it? Let pwrite() do the file extension because it natively handles data vs metadata ordering without having to flush data to disk and wait for it. i.e. do not use fallocate() as if it is ftruncate(). Also, do not use posix_fallocate() - it gives you no control over how preallocation is done, use fallocate() directly. And if you must use fallocate() before a write, use fallocate(fd, FALLOC_FL_KEEP_SIZE, off, len) so that the file extension is done by the pwrite() to avoid any metadata/data ordering constraints that might exist with non-data write related file size changes. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-23 0:47 ` Dave Chinner @ 2023-06-23 8:29 ` Ritesh Harjani 2023-06-23 10:07 ` Dave Chinner 2023-06-26 3:17 ` Masahiko Sawada 2023-07-11 22:28 ` Andres Freund 2 siblings, 1 reply; 20+ messages in thread From: Ritesh Harjani @ 2023-06-23 8:29 UTC (permalink / raw) To: Dave Chinner, Masahiko Sawada; +Cc: linux-xfs Dave Chinner <david@fromorbit.com> writes: > On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote: >> Hi all, >> >> When testing PostgreSQL, I found a performance degradation. After some >> investigation, it ultimately reached the attached simple C program and >> turned out that the performance degradation happens on only the xfs >> filesystem (doesn't happen on neither ext3 nor ext4). In short, the >> program alternately does two things to extend a file (1) call >> posix_fallocate() to extend by 8192 bytes > > This is a well known anti-pattern - it always causes problems. Do > not do this. > >> and (2) call pwrite() to >> extend by 8192 bytes. If I do only either (1) or (2), the program is >> completed in 2 sec, but if I do (1) and (2) alternatively, it is >> completed in 90 sec. > > Well, yes. Using fallocate to extend the file has very different > constraints to using pwrite to extend the file. > >> $ gcc -o test test.c >> $ time ./test test.1 1 >> total 200000 >> fallocate 200000 >> filewrite 0 > > No data is written here, so this is just a series of 8kB allocations > and file size extension operations. There are no constraints here > because it is a pure metadata operation. > >> real 0m1.305s >> user 0m0.050s >> sys 0m1.255s >> >> $ time ./test test.2 2 >> total 200000 >> fallocate 100000 >> filewrite 100000 >> >> real 1m29.222s >> user 0m0.139s >> sys 0m3.139s > > Now we have fallocate extending the file and doing unwritten extent > allocation, followed by writing into that unwritten extent which > then does unwritten extent conversion. > > This introduces data vs metadata update ordering constraints to the > workload. > > The problem here in that the "truncate up" operation that > fallocate is doing to move the file size. The "truncate up" is going > to move the on-disk file size to the end of the fallocated range via > a journal transaction, and so it will expose the range of the > previous write as containing valid data. > > However, the previous data write is still only in memory and not on > disk. The result of journalling the file size change is that if we > crash after the size change is made but the data is not on disk, > we end up with lost data - the file contains zeros (or NULLs) where > the in memory data previously existed. > > Go google for "NULL file data exposure" and you'll see this is a > problem we fixed in ~2006, caused by extending the file size on disk > without first having written all the in-memory data into the file. I guess here is the <patch> you are speaking of. So this prevents from exposing nulls within a file in case of a crash. I guess the behavior is not the same with ext4. ext4 does not seem to be doing filemap_write_and_wait_range() if the new i_disksize is more than oldsize. So then I think ext4 must be ok if in case of a crash the file has nulls in between. That's why I think the observation of slow performance is not seen in ext4. Few queres- - If the user doesn't issue a flush and if the system crashes, then anyways it is not expected that the file will have all the data right? - Also is that "data/inode size update order" which you are mentioning in this patch. Is this something that all filesystems should follow? - I was wondering what exactly it breaks which the applications depend upon? Because not all filesystems tend to follow this practice right? Thanks for the detailed explaination! I got interested in this thread after looking at your explaination and since the thread mention this happens with postgres. -ritesh <patch> [XFS] Fix inode size update before data write in xfs_setattr When changing the file size by a truncate() call, we log the change in the inode size. However, we do not flush any outstanding data that might not have been written to disk, thereby violating the data/inode size update order. This can leave files full of NULLs on crash. Hence if we are truncating the file, flush any unwritten data that may lie between the curret on disk inode size and the new inode size that is being logged to ensure that ordering is preserved. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-23 8:29 ` Ritesh Harjani @ 2023-06-23 10:07 ` Dave Chinner 2023-06-23 11:49 ` Ritesh Harjani 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2023-06-23 10:07 UTC (permalink / raw) To: Ritesh Harjani; +Cc: Masahiko Sawada, linux-xfs On Fri, Jun 23, 2023 at 01:59:58PM +0530, Ritesh Harjani wrote: > Dave Chinner <david@fromorbit.com> writes: > > > On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote: > >> Hi all, > >> > >> When testing PostgreSQL, I found a performance degradation. After some > >> investigation, it ultimately reached the attached simple C program and > >> turned out that the performance degradation happens on only the xfs > >> filesystem (doesn't happen on neither ext3 nor ext4). In short, the > >> program alternately does two things to extend a file (1) call > >> posix_fallocate() to extend by 8192 bytes > > > > This is a well known anti-pattern - it always causes problems. Do > > not do this. > > > >> and (2) call pwrite() to > >> extend by 8192 bytes. If I do only either (1) or (2), the program is > >> completed in 2 sec, but if I do (1) and (2) alternatively, it is > >> completed in 90 sec. > > > > Well, yes. Using fallocate to extend the file has very different > > constraints to using pwrite to extend the file. > > > >> $ gcc -o test test.c > >> $ time ./test test.1 1 > >> total 200000 > >> fallocate 200000 > >> filewrite 0 > > > > No data is written here, so this is just a series of 8kB allocations > > and file size extension operations. There are no constraints here > > because it is a pure metadata operation. > > > >> real 0m1.305s > >> user 0m0.050s > >> sys 0m1.255s > >> > >> $ time ./test test.2 2 > >> total 200000 > >> fallocate 100000 > >> filewrite 100000 > >> > >> real 1m29.222s > >> user 0m0.139s > >> sys 0m3.139s > > > > Now we have fallocate extending the file and doing unwritten extent > > allocation, followed by writing into that unwritten extent which > > then does unwritten extent conversion. > > > > This introduces data vs metadata update ordering constraints to the > > workload. > > > > The problem here in that the "truncate up" operation that > > fallocate is doing to move the file size. The "truncate up" is going > > to move the on-disk file size to the end of the fallocated range via > > a journal transaction, and so it will expose the range of the > > previous write as containing valid data. > > > > However, the previous data write is still only in memory and not on > > disk. The result of journalling the file size change is that if we > > crash after the size change is made but the data is not on disk, > > we end up with lost data - the file contains zeros (or NULLs) where > > the in memory data previously existed. > > > > Go google for "NULL file data exposure" and you'll see this is a > > problem we fixed in ~2006, caused by extending the file size on disk > > without first having written all the in-memory data into the file. > > I guess here is the <patch> you are speaking of. So this prevents from > exposing nulls within a file in case of a crash. Well, we're not really "exposing NULLs". No data got written before the crash, so a read from that range after a crash will find a hole or unwritten extents in the file and return zeros. > I guess the behavior is not the same with ext4. ext4 does not seem to be > doing filemap_write_and_wait_range() if the new i_disksize is more than > oldsize. So then I think ext4 must be ok if in case of a crash the > file has nulls in between. That's why I think the observation of slow > performance is not seen in ext4. ext4 also has a similar problem issue where crashes can lead to files full of zeroes, and many of the mitigations they use were copied from the XFS mitigations for the same problem. However, ext4 has a completely different way of handling failures after truncate (via an orphan list, IIRC) so it doesn't need to actually write the data to avoid potential stale data exposure issues. > Few queres- > - If the user doesn't issue a flush and if the system crashes, then > anyways it is not expected that the file will have all the data right? Correct. > - Also is that "data/inode size update order" which you are mentioning in > this patch. Is this something that all filesystems should follow? No, it's the specific fix for the inode size update ordering problem that lead to user visible symptoms after a crash. We avoid the problem in two ways now - first we always journal inode size updates, and second we always write dependent data before we journal said size updates. > - I was wondering what exactly it breaks which the applications depend > upon? Because not all filesystems tend to follow this practice right? The filesystems didn't break anything - applications failed to write and/or overwrite data safely, and when they did this data got lost. However, because the same type of failure didn't result in data loss on ext3, then the data loss was considered by users and application developers as a filesystem bug, rather than an inevitable result of an application failing to ensure the user's data was actually written to the filesystem in a crash-safe manner. i.e. users and application developers demanded that filesystem's provide be omnipotent and provide a higher level of data integrity than the application/user asks them to provide. The result is that we provided that higher level of data integrity that users demanded, but it came at a cost.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-23 10:07 ` Dave Chinner @ 2023-06-23 11:49 ` Ritesh Harjani 2023-06-23 20:04 ` Eric Sandeen 0 siblings, 1 reply; 20+ messages in thread From: Ritesh Harjani @ 2023-06-23 11:49 UTC (permalink / raw) To: Dave Chinner; +Cc: Masahiko Sawada, linux-xfs Dave Chinner <david@fromorbit.com> writes: > On Fri, Jun 23, 2023 at 01:59:58PM +0530, Ritesh Harjani wrote: >> Dave Chinner <david@fromorbit.com> writes: >> >> > On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote: >> >> Hi all, >> >> >> >> When testing PostgreSQL, I found a performance degradation. After some >> >> investigation, it ultimately reached the attached simple C program and >> >> turned out that the performance degradation happens on only the xfs >> >> filesystem (doesn't happen on neither ext3 nor ext4). In short, the >> >> program alternately does two things to extend a file (1) call >> >> posix_fallocate() to extend by 8192 bytes >> > >> > This is a well known anti-pattern - it always causes problems. Do >> > not do this. >> > >> >> and (2) call pwrite() to >> >> extend by 8192 bytes. If I do only either (1) or (2), the program is >> >> completed in 2 sec, but if I do (1) and (2) alternatively, it is >> >> completed in 90 sec. >> > >> > Well, yes. Using fallocate to extend the file has very different >> > constraints to using pwrite to extend the file. >> > >> >> $ gcc -o test test.c >> >> $ time ./test test.1 1 >> >> total 200000 >> >> fallocate 200000 >> >> filewrite 0 >> > >> > No data is written here, so this is just a series of 8kB allocations >> > and file size extension operations. There are no constraints here >> > because it is a pure metadata operation. >> > >> >> real 0m1.305s >> >> user 0m0.050s >> >> sys 0m1.255s >> >> >> >> $ time ./test test.2 2 >> >> total 200000 >> >> fallocate 100000 >> >> filewrite 100000 >> >> >> >> real 1m29.222s >> >> user 0m0.139s >> >> sys 0m3.139s >> > >> > Now we have fallocate extending the file and doing unwritten extent >> > allocation, followed by writing into that unwritten extent which >> > then does unwritten extent conversion. >> > >> > This introduces data vs metadata update ordering constraints to the >> > workload. >> > >> > The problem here in that the "truncate up" operation that >> > fallocate is doing to move the file size. The "truncate up" is going >> > to move the on-disk file size to the end of the fallocated range via >> > a journal transaction, and so it will expose the range of the >> > previous write as containing valid data. >> > >> > However, the previous data write is still only in memory and not on >> > disk. The result of journalling the file size change is that if we >> > crash after the size change is made but the data is not on disk, >> > we end up with lost data - the file contains zeros (or NULLs) where >> > the in memory data previously existed. >> > >> > Go google for "NULL file data exposure" and you'll see this is a >> > problem we fixed in ~2006, caused by extending the file size on disk >> > without first having written all the in-memory data into the file. >> >> I guess here is the <patch> you are speaking of. So this prevents from >> exposing nulls within a file in case of a crash. > > Well, we're not really "exposing NULLs". No data got written before > the crash, so a read from that range after a crash will find a hole > or unwritten extents in the file and return zeros. > Yes, I agree. I meant writing "null file problem". >> I guess the behavior is not the same with ext4. ext4 does not seem to be >> doing filemap_write_and_wait_range() if the new i_disksize is more than >> oldsize. So then I think ext4 must be ok if in case of a crash the >> file has nulls in between. That's why I think the observation of slow >> performance is not seen in ext4. > > ext4 also has a similar problem issue where crashes can lead to > files full of zeroes, and many of the mitigations they use were > copied from the XFS mitigations for the same problem. However, ext4 > has a completely different way of handling failures after truncate > (via an orphan list, IIRC) so it doesn't need to actually write > the data to avoid potential stale data exposure issues. > Sorry, but I still haven't understood the real problem here for which XFS does filemap_write_and_wait_range(). Is it a stale data exposure problem? Now, in this code here in fs/xfs/xfs_iops.c we refer to the problem as "expose ourselves to the null files problem". What is the "expose ourselves to the null files problem here" for which we do filemap_write_and_wait_range()? /* * We are going to log the inode size change in this transaction so * any previous writes that are beyond the on disk EOF and the new * EOF that have not been written out need to be written here. If we * do not write the data out, we expose ourselves to the null files * problem. Note that this includes any block zeroing we did above; * otherwise those blocks may not be zeroed after a crash. */ if (did_zeroing || (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) { error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, ip->i_disk_size, newsize - 1); if (error) return error; } Talking about ext4, it handles truncates to a file using orphan handline, yes. In case if the truncate operation spans multiple txns and if the crash happens say in the middle of a txn, then the subsequent crash recovery will truncate the blocks spanning i_disksize. But we aren't discussing shrinking here right. We are doing pwrite followed by fallocate to grow the file size. With pwrite we use delalloc so the blocks only get allocated during writeback time and with fallocate we will allocate unwritten extents, so there should be no stale data expose problem in this case right? Hence my question was to mainly understand what does "expose ourselves to the null files problem" means in XFS? Thanks! -ritesh ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-23 11:49 ` Ritesh Harjani @ 2023-06-23 20:04 ` Eric Sandeen 0 siblings, 0 replies; 20+ messages in thread From: Eric Sandeen @ 2023-06-23 20:04 UTC (permalink / raw) To: Ritesh Harjani (IBM), Dave Chinner; +Cc: Masahiko Sawada, linux-xfs On 6/23/23 6:49 AM, Ritesh Harjani (IBM) wrote: > Sorry, but I still haven't understood the real problem here for which > XFS does filemap_write_and_wait_range(). Is it a stale data exposure > problem? (Hopefully I get this right by trying to be helpful, here. It's been a while). Not really. IIRC the original problem was that the file size could get updated (transactionally) before the delayed allocation and IO happened at writeback time, leaving a hole before EOF where buffered writes had failed to land before a crash. This is what people originally called the "NULL files problem" because reading the hole post-crash returned zeros. It wasn't stale date, it was no data. Some commits that dealt with this explain it fairly well I think: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c32676eea19ce29cb74dba0f97b085e83f6b8915 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ba87ea699ebd9dd577bf055ebc4a98200e337542 > Now, in this code here in fs/xfs/xfs_iops.c we refer to the problem as > "expose ourselves to the null files problem". > What is the "expose ourselves to the null files problem here" > for which we do filemap_write_and_wait_range()? > > > /* > * We are going to log the inode size change in this transaction so > * any previous writes that are beyond the on disk EOF and the new > * EOF that have not been written out need to be written here. If we i.e. force the writeback of any pending buffered IO into the hole created up to the new EOF > * do not write the data out, we expose ourselves to the null files > * problem. Note that this includes any block zeroing we did above; > * otherwise those blocks may not be zeroed after a crash. and I suppose this relates a little to stale date, IIRC this is referring to zeroing partial blocks past the old EOF. > */ > if (did_zeroing || > (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) { > error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > ip->i_disk_size, newsize - 1); > if (error) > return error; > } > > > Talking about ext4, it handles truncates to a file using orphan > handline, yes. In case if the truncate operation spans multiple txns and > if the crash happens say in the middle of a txn, then the subsequent crash > recovery will truncate the blocks spanning i_disksize. > > But we aren't discussing shrinking here right. We are doing pwrite > followed by fallocate to grow the file size. With pwrite we use delalloc > so the blocks only get allocated during writeback time and with > fallocate we will allocate unwritten extents, so there should be no > stale data expose problem in this case right? yeah, it's not a stale data problem. I think that the extended EOF created by fallocate is being treated exactly the same as if we had extended it with ftruncate(). Indeed, replacing the posix_fallocate with ftruncate to the same size in the test program results in a similarly slow run, slightly faster probably because unwritten conversion doesn't have to happen in that case. > Hence my question was to mainly understand what does "expose ourselves to > the null files problem" means in XFS? Hopefully the above explains it; that said, I'm not sure this is anything more than academically interesting. As Dave mentioned, fallocating tiny space and then writing into it is not at all the recommended or efficient use of fallocate. The one thing I'm not remembering exactly here is why we have the heuristic that a truncate up requires flushing all pending data behind it. I *think* it's because most users knew enough to expect buffered writes could be lost on a crash, but they expected to see valid data up to the on-disk EOF post-crash. Without this heuristic, they'd get some valid data that made it out followed by a hole ("NULLS") up to the new EOF, and they Did Not Like It. -Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-23 0:47 ` Dave Chinner 2023-06-23 8:29 ` Ritesh Harjani @ 2023-06-26 3:17 ` Masahiko Sawada 2023-06-26 15:32 ` Eric Sandeen 2023-07-11 22:28 ` Andres Freund 2 siblings, 1 reply; 20+ messages in thread From: Masahiko Sawada @ 2023-06-26 3:17 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Fri, Jun 23, 2023 at 9:47 AM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote: > > Hi all, > > > > When testing PostgreSQL, I found a performance degradation. After some > > investigation, it ultimately reached the attached simple C program and > > turned out that the performance degradation happens on only the xfs > > filesystem (doesn't happen on neither ext3 nor ext4). In short, the > > program alternately does two things to extend a file (1) call > > posix_fallocate() to extend by 8192 bytes > > This is a well known anti-pattern - it always causes problems. Do > not do this. > > > and (2) call pwrite() to > > extend by 8192 bytes. If I do only either (1) or (2), the program is > > completed in 2 sec, but if I do (1) and (2) alternatively, it is > > completed in 90 sec. > > Well, yes. Using fallocate to extend the file has very different > constraints to using pwrite to extend the file. > > > $ gcc -o test test.c > > $ time ./test test.1 1 > > total 200000 > > fallocate 200000 > > filewrite 0 > > No data is written here, so this is just a series of 8kB allocations > and file size extension operations. There are no constraints here > because it is a pure metadata operation. > > > real 0m1.305s > > user 0m0.050s > > sys 0m1.255s > > > > $ time ./test test.2 2 > > total 200000 > > fallocate 100000 > > filewrite 100000 > > > > real 1m29.222s > > user 0m0.139s > > sys 0m3.139s > > Now we have fallocate extending the file and doing unwritten extent > allocation, followed by writing into that unwritten extent which > then does unwritten extent conversion. > > This introduces data vs metadata update ordering constraints to the > workload. > > The problem here in that the "truncate up" operation that > fallocate is doing to move the file size. The "truncate up" is going > to move the on-disk file size to the end of the fallocated range via > a journal transaction, and so it will expose the range of the > previous write as containing valid data. > > However, the previous data write is still only in memory and not on > disk. The result of journalling the file size change is that if we > crash after the size change is made but the data is not on disk, > we end up with lost data - the file contains zeros (or NULLs) where > the in memory data previously existed. > > Go google for "NULL file data exposure" and you'll see this is a > problem we fixed in ~2006, caused by extending the file size on disk > without first having written all the in-memory data into the file. > And even though we fixed the problem over 15 years ago, we still > hear people today saying "XFS overwrites user data with NULLs!" as > their reason for never using XFS, even though this was never true in > the first place.. > > The result of users demanding that we prevent poorly written > applications from losing their data is that users get poor > performance when their applications are poorly written. i.e. they do > something that triggers the data integrity ordering constraints that > users demand we work within. Thank you for the detailed explanation. > > So, how to avoid the problem? > > With 'posix_fallocate(fd, total_len, 8192);': > > $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1 > total 200000 > fallocate 200000 > filewrite 0 > > real 0m2.557s > user 0m0.025s > sys 0m2.531s > $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2 > total 200000 > fallocate 100000 > filewrite 100000 > > real 0m39.564s > user 0m0.117s > sys 0m7.535s > > > With 'fallocate(fd, FALLOC_FL_KEEP_SIZE, total_len, 8192);': > > $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 1 > total 200000 > fallocate 200000 > filewrite 0 > > real 0m2.269s > user 0m0.037s > sys 0m2.233s > $ rm /mnt/scratch/foo ; time ./fwtest /mnt/scratch/foo 2 > total 200000 > fallocate 100000 > filewrite 100000 > > real 0m1.068s > user 0m0.028s > sys 0m1.040s > > Yup, just stop fallocate() from extending the file size and leave > that to the pwrite() call that actually writes the data into the > file. > > As it is, using fallocate/pwrite like test does is a well known > anti-pattern: > > error = fallocate(fd, off, len); > if (error == ENOSPC) { > /* abort write!!! */ > } > error = pwrite(fd, off, len); > ASSERT(error != ENOSPC); > if (error) { > /* handle error */ > } > The test.c and what PostgreSQL does are slightly different from the above pattern actually: it calls fallocate and pwrites for different 8kB blocks. For example, it calls fallocate to extend the file from 0 byte to 8192 bytes, and then calls pwrite to extend the file from 8192 bytes to 16384 bytes. But it's also not a recommended use, right? > Why does the code need a call to fallocate() here it prevent ENOSPC in the > pwrite() call? > > The answer here is that it *doesn't need to use fallocate() here*. > THat is, the fallocate() ENOSPC check before the space is allocated > is exactly the same as the ENOSPC check done in the pwrite() call to > see if there is space for the write to proceed. > > IOWs, the fallocate() call is *completely redundant*, yet it is > actively harmful to performance in the short term (as per the > issue in this thread) as well as being harmful for file fragmentation > levels and filesystem longevity because it prevents the filesystem > from optimising away unnecessary allocations. i.e. it defeats > delayed allocation which allows filesystem to combine lots of > small sequential write() calls in a single big contiguous extent > allocation when the data is getting written to disk. > > IOWs, using fallocate() in the way described in this test is a sign > of applicaiton developers not understanding what preallocation > actually does and the situations where it actually provides some > kinds of benefit. > > i.e. fallocate() is intended to allow applications to preallocate > space in large chunks long before it is needed, and still have it > available when the application actually needs to write to it. e.g. > preallocate 10MB at a time, not have to run fallocate again until the > existing preallocated chunk is entirely used up by the next thousand > 8KB writes that extend the file. > > Using fallocate() as a replacement for "truncate up before write" is > *not a recommended use*. FYI, to share the background of what PostgreSQL does, when bulk-insertions into one table are running concurrently, one process extends the underlying files depending on how many concurrent processes are waiting to extend. The more processes wait, the more 8kB blocks are appended. As the current implementation, if the process needs to extend the table by more than 8 blocks (i.e. 64kB) it uses posix_fallocate(), otherwise it uses pwrites() (see the code[1] for details). We don't use fallocate() for small extensions as it's slow on some filesystems. Therefore, if a bulk-insertion process tries to extend the table by say 5~10 blocks many times, it could use poxis_fallocate() and pwrite() alternatively, which led to the slow performance as I reported. > > > Why does it take so long in the latter case? and are there any > > workaround or configuration changes to deal with it? > > Let pwrite() do the file extension because it natively handles data > vs metadata ordering without having to flush data to disk and wait > for it. i.e. do not use fallocate() as if it is ftruncate(). Also, > do not use posix_fallocate() - it gives you no control over how > preallocation is done, use fallocate() directly. And if you must use > fallocate() before a write, use fallocate(fd, FALLOC_FL_KEEP_SIZE, > off, len) so that the file extension is done by the pwrite() to > avoid any metadata/data ordering constraints that might exist with > non-data write related file size changes. > Thanks. Wang Yugui reported that this slow performance seems not to happen on newer kernel versions, but is that right? Fortunately, this behavior is still beta (PG16 beta). I will discuss alternative solutions in the PostgreSQL community. Regards, [1] https://github.com/postgres/postgres/blob/master/src/backend/storage/smgr/md.c#L577 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-26 3:17 ` Masahiko Sawada @ 2023-06-26 15:32 ` Eric Sandeen 2023-06-27 15:50 ` Masahiko Sawada 0 siblings, 1 reply; 20+ messages in thread From: Eric Sandeen @ 2023-06-26 15:32 UTC (permalink / raw) To: Masahiko Sawada, Dave Chinner; +Cc: linux-xfs On 6/25/23 10:17 PM, Masahiko Sawada wrote: > FYI, to share the background of what PostgreSQL does, when > bulk-insertions into one table are running concurrently, one process > extends the underlying files depending on how many concurrent > processes are waiting to extend. The more processes wait, the more 8kB > blocks are appended. As the current implementation, if the process > needs to extend the table by more than 8 blocks (i.e. 64kB) it uses > posix_fallocate(), otherwise it uses pwrites() (see the code[1] for > details). We don't use fallocate() for small extensions as it's slow > on some filesystems. Therefore, if a bulk-insertion process tries to > extend the table by say 5~10 blocks many times, it could use > poxis_fallocate() and pwrite() alternatively, which led to the slow > performance as I reported. To what end? What problem is PostgreSQL trying to solve with this scheme? I might be missing something but it seems like you've described the "what" in detail, but no "why." -Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-26 15:32 ` Eric Sandeen @ 2023-06-27 15:50 ` Masahiko Sawada 2023-06-27 16:12 ` Eric Sandeen 0 siblings, 1 reply; 20+ messages in thread From: Masahiko Sawada @ 2023-06-27 15:50 UTC (permalink / raw) To: Eric Sandeen; +Cc: Dave Chinner, linux-xfs On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@sandeen.net> wrote: > > On 6/25/23 10:17 PM, Masahiko Sawada wrote: > > FYI, to share the background of what PostgreSQL does, when > > bulk-insertions into one table are running concurrently, one process > > extends the underlying files depending on how many concurrent > > processes are waiting to extend. The more processes wait, the more 8kB > > blocks are appended. As the current implementation, if the process > > needs to extend the table by more than 8 blocks (i.e. 64kB) it uses > > posix_fallocate(), otherwise it uses pwrites() (see the code[1] for > > details). We don't use fallocate() for small extensions as it's slow > > on some filesystems. Therefore, if a bulk-insertion process tries to > > extend the table by say 5~10 blocks many times, it could use > > poxis_fallocate() and pwrite() alternatively, which led to the slow > > performance as I reported. > > To what end? What problem is PostgreSQL trying to solve with this > scheme? I might be missing something but it seems like you've described > the "what" in detail, but no "why." It's for better scalability. SInce the process who wants to extend the table needs to hold an exclusive lock on the table, we need to minimize the work while holding the lock. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-27 15:50 ` Masahiko Sawada @ 2023-06-27 16:12 ` Eric Sandeen 2023-06-28 4:56 ` Christoph Hellwig 2023-07-11 22:49 ` Andres Freund 0 siblings, 2 replies; 20+ messages in thread From: Eric Sandeen @ 2023-06-27 16:12 UTC (permalink / raw) To: Masahiko Sawada; +Cc: Dave Chinner, linux-xfs On 6/27/23 10:50 AM, Masahiko Sawada wrote: > On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@sandeen.net> wrote: >> >> On 6/25/23 10:17 PM, Masahiko Sawada wrote: >>> FYI, to share the background of what PostgreSQL does, when >>> bulk-insertions into one table are running concurrently, one process >>> extends the underlying files depending on how many concurrent >>> processes are waiting to extend. The more processes wait, the more 8kB >>> blocks are appended. As the current implementation, if the process >>> needs to extend the table by more than 8 blocks (i.e. 64kB) it uses >>> posix_fallocate(), otherwise it uses pwrites() (see the code[1] for >>> details). We don't use fallocate() for small extensions as it's slow >>> on some filesystems. Therefore, if a bulk-insertion process tries to >>> extend the table by say 5~10 blocks many times, it could use >>> poxis_fallocate() and pwrite() alternatively, which led to the slow >>> performance as I reported. >> >> To what end? What problem is PostgreSQL trying to solve with this >> scheme? I might be missing something but it seems like you've described >> the "what" in detail, but no "why." > > It's for better scalability. SInce the process who wants to extend the > table needs to hold an exclusive lock on the table, we need to > minimize the work while holding the lock. Ok, but what is the reason for zeroing out the blocks prior to them being written with real data? I'm wondering what the core requirement here is for the zeroing, either via fallocate (which btw posix_fallocate does not guarantee) or pwrites of zeros. Thanks, -Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-27 16:12 ` Eric Sandeen @ 2023-06-28 4:56 ` Christoph Hellwig 2023-07-11 22:49 ` Andres Freund 1 sibling, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2023-06-28 4:56 UTC (permalink / raw) To: Eric Sandeen; +Cc: Masahiko Sawada, Dave Chinner, linux-xfs On Tue, Jun 27, 2023 at 11:12:01AM -0500, Eric Sandeen wrote: > Ok, but what is the reason for zeroing out the blocks prior to them being > written with real data? I'm wondering what the core requirement here is for > the zeroing, either via fallocate (which btw posix_fallocate does not > guarantee) or pwrites of zeros. Note that even a plain truncate will zero the data visible to the user. I could see this tring to reduce fragmentation by making sure the whole file extension is allocated together insted of split up as the difference processes write their areas. But is there data showing this fragmentation even happens and actually hurts? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-27 16:12 ` Eric Sandeen 2023-06-28 4:56 ` Christoph Hellwig @ 2023-07-11 22:49 ` Andres Freund 2023-07-19 7:25 ` Dave Chinner 1 sibling, 1 reply; 20+ messages in thread From: Andres Freund @ 2023-07-11 22:49 UTC (permalink / raw) To: Eric Sandeen; +Cc: Masahiko Sawada, Dave Chinner, linux-xfs Hi, On 2023-06-27 11:12:01 -0500, Eric Sandeen wrote: > On 6/27/23 10:50 AM, Masahiko Sawada wrote: > > On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@sandeen.net> wrote: > > > > > > On 6/25/23 10:17 PM, Masahiko Sawada wrote: > > > > FYI, to share the background of what PostgreSQL does, when > > > > bulk-insertions into one table are running concurrently, one process > > > > extends the underlying files depending on how many concurrent > > > > processes are waiting to extend. The more processes wait, the more 8kB > > > > blocks are appended. As the current implementation, if the process > > > > needs to extend the table by more than 8 blocks (i.e. 64kB) it uses > > > > posix_fallocate(), otherwise it uses pwrites() (see the code[1] for > > > > details). We don't use fallocate() for small extensions as it's slow > > > > on some filesystems. Therefore, if a bulk-insertion process tries to > > > > extend the table by say 5~10 blocks many times, it could use > > > > poxis_fallocate() and pwrite() alternatively, which led to the slow > > > > performance as I reported. > > > > > > To what end? What problem is PostgreSQL trying to solve with this > > > scheme? I might be missing something but it seems like you've described > > > the "what" in detail, but no "why." > > > > It's for better scalability. SInce the process who wants to extend the > > table needs to hold an exclusive lock on the table, we need to > > minimize the work while holding the lock. > > Ok, but what is the reason for zeroing out the blocks prior to them being > written with real data? I'm wondering what the core requirement here is for > the zeroing, either via fallocate (which btw posix_fallocate does not > guarantee) or pwrites of zeros. The goal is to avoid ENOSPC at a later time. We do this before filling our own in-memory buffer pool with pages containing new contents. If we have dirty pages in our buffer that we can't write out due to ENOSPC, we're in trouble, because we can't checkpoint. Which typically will make the ENOSPC situation worse, because we also can't remove WAL / journal files without the checkpoint having succeeded. Of course a successful fallocate() / pwrite() doesn't guarantee that much on a COW filesystem, but there's not much we can do about that, to my knowledge. Using fallocate() for small extensions is problematic because it a) causes fragmentation b) disables delayed allocation, using pwrite() is also bad because the kernel will have to write out those dirty pages full of zeroes - very often we won't write out the page with "real content" before the kernel decides to do so. Hence using a heuristic to choose between the two. I think all that's needed here is a bit of tuning of the heuristic, possibly adding some "history" awareness. If we could opt into delayed allocation while avoiding ENOSPC for a certain length, it'd be perfect, but I don't think that's possible today? We're also working on using DIO FWIW, where using fallocate() is just about mandatory... Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-07-11 22:49 ` Andres Freund @ 2023-07-19 7:25 ` Dave Chinner 2023-07-19 20:29 ` Andres Freund 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2023-07-19 7:25 UTC (permalink / raw) To: Andres Freund; +Cc: Eric Sandeen, Masahiko Sawada, linux-xfs On Tue, Jul 11, 2023 at 03:49:11PM -0700, Andres Freund wrote: > Hi, > > On 2023-06-27 11:12:01 -0500, Eric Sandeen wrote: > > On 6/27/23 10:50 AM, Masahiko Sawada wrote: > > > On Tue, Jun 27, 2023 at 12:32 AM Eric Sandeen <sandeen@sandeen.net> wrote: > > > > > > > > On 6/25/23 10:17 PM, Masahiko Sawada wrote: > > > > > FYI, to share the background of what PostgreSQL does, when > > > > > bulk-insertions into one table are running concurrently, one process > > > > > extends the underlying files depending on how many concurrent > > > > > processes are waiting to extend. The more processes wait, the more 8kB > > > > > blocks are appended. As the current implementation, if the process > > > > > needs to extend the table by more than 8 blocks (i.e. 64kB) it uses > > > > > posix_fallocate(), otherwise it uses pwrites() (see the code[1] for > > > > > details). We don't use fallocate() for small extensions as it's slow > > > > > on some filesystems. Therefore, if a bulk-insertion process tries to > > > > > extend the table by say 5~10 blocks many times, it could use > > > > > poxis_fallocate() and pwrite() alternatively, which led to the slow > > > > > performance as I reported. > > > > > > > > To what end? What problem is PostgreSQL trying to solve with this > > > > scheme? I might be missing something but it seems like you've described > > > > the "what" in detail, but no "why." > > > > > > It's for better scalability. SInce the process who wants to extend the > > > table needs to hold an exclusive lock on the table, we need to > > > minimize the work while holding the lock. > > > > Ok, but what is the reason for zeroing out the blocks prior to them being > > written with real data? I'm wondering what the core requirement here is for > > the zeroing, either via fallocate (which btw posix_fallocate does not > > guarantee) or pwrites of zeros. > > The goal is to avoid ENOSPC at a later time. We do this before filling our own > in-memory buffer pool with pages containing new contents. If we have dirty > pages in our buffer that we can't write out due to ENOSPC, we're in trouble, > because we can't checkpoint. Which typically will make the ENOSPC situation > worse, because we also can't remove WAL / journal files without the checkpoint > having succeeded. Of course a successful fallocate() / pwrite() doesn't > guarantee that much on a COW filesystem, but there's not much we can do about > that, to my knowledge. Yup, which means you're screwed on XFS, ZFS and btrfs right now, and also bcachefs when people start using it. > Using fallocate() for small extensions is problematic because it a) causes > fragmentation b) disables delayed allocation, using pwrite() is also bad > because the kernel will have to write out those dirty pages full of zeroes - > very often we won't write out the page with "real content" before the kernel > decides to do so. Yes, that why we allow fallocate() to preallocate space that extends beyond the current EOF. i.e. for optimising layouts on append-based workloads. posix_fallocate() does not allow that - it forces file size extension, whilst a raw fallocate(FALLOC_FL_KEEP_SIZE) call will allow preallocation anywhere beyond EOF without changing the file size. IOws, with FALLOC_FL_KEEP_SIZE you don't have to initialise buffer space in memory to cover the preallocated space until you actually need to extend the file and write to it. i.e. use fallocate(FALLOC_FL_KEEP_SIZE) to preallocate chunks megabytes beyond the current EOF and then grow into them with normal extending pwrite() calls. When that preallocate space is done, preallocate another large chunk beyond EOF and continue onwards extending the file with your small write()s... > Hence using a heuristic to choose between the two. I think all that's needed > here is a bit of tuning of the heuristic, possibly adding some "history" > awareness. No heuristics needed: just use FALLOC_FL_KEEP_SIZE and preallocate large chunks beyond EOF each time. It works for both cases equally well, which results in less code and is easier to understand. AFAIC, nobody should ever use posix_fallocate() - it's impossible to know what it is doing under the covers, or even know when it fails to provide you with any guarantee at all (e.g. COW files). > If we could opt into delayed allocation while avoiding ENOSPC for a certain > length, it'd be perfect, but I don't think that's possible today? Nope. Not desirable, either, because we currently need to have dirty data in the page cache over delalloc regions. > We're also working on using DIO FWIW, where using fallocate() is just about > mandatory... No, no it isn't. fallocate() is even more important to avoid with DIO than buffered IO because fallocate() completely serialises *all* IO to the file. That's the last thing you want with DIO given the only reason for using DIO is to maximising IO concurrency and minimise IO latency to individual files. If you want to minimise fragmentation with DIO workloads, then you should be using extent size hints of an appropriate size. That will align and size extents to the hint regardless of fallocate/write ranges, hence this controls worst case fragmentation effectively. If you want enospc guarantees for future writes, then large, infrequent fallocate(FALLOC_FL_KEEP_SIZE) calls should be used. Do not use this mechanism as an anti-fragmentation mechanism, that's what extent size hints are for. Use fallocate() as *little as possible*. In my experience, fine grained management of file space by userspace applications via fallocate() is nothing but a recipe for awful performance, highly variable IO latency, bad file fragmentation, and poor filesystem aging characteristics. Just don't do it. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-07-19 7:25 ` Dave Chinner @ 2023-07-19 20:29 ` Andres Freund 2023-07-19 20:38 ` Eric Sandeen 0 siblings, 1 reply; 20+ messages in thread From: Andres Freund @ 2023-07-19 20:29 UTC (permalink / raw) To: Dave Chinner; +Cc: Eric Sandeen, Masahiko Sawada, linux-xfs Hi, On 2023-07-19 17:25:37 +1000, Dave Chinner wrote: > On Tue, Jul 11, 2023 at 03:49:11PM -0700, Andres Freund wrote: > > The goal is to avoid ENOSPC at a later time. We do this before filling our own > > in-memory buffer pool with pages containing new contents. If we have dirty > > pages in our buffer that we can't write out due to ENOSPC, we're in trouble, > > because we can't checkpoint. Which typically will make the ENOSPC situation > > worse, because we also can't remove WAL / journal files without the checkpoint > > having succeeded. Of course a successful fallocate() / pwrite() doesn't > > guarantee that much on a COW filesystem, but there's not much we can do about > > that, to my knowledge. > > Yup, which means you're screwed on XFS, ZFS and btrfs right now, and > also bcachefs when people start using it. I'd be happy to hear of a better alternative... fallocate() should avoid ENOSPC on XFS unless snapshots trigger COW on a write, correct? > > Using fallocate() for small extensions is problematic because it a) causes > > We're also working on using DIO FWIW, where using fallocate() is just about > > mandatory... > > No, no it isn't. fallocate() is even more important to avoid with > DIO than buffered IO because fallocate() completely serialises *all* > IO to the file. That's the last thing you want with DIO given the > only reason for using DIO is to maximising IO concurrency and > minimise IO latency to individual files. Not using any form of preallocation (potentially via extent size hints as you mention below), when multiple files are being appended to simultaneously with DIO, does lead to terrifying levels of fragmentation on xfs. On a newly initialized xfs (mkfs.xfs version 6.3.0, 6.5.0-rc2): rm -f fragtest-* && fio --minimal --name fragtest-1 --buffered=0 --filesize=128MB --fallocate=none --rw write --bs=$((4096*4)) --nrfiles=10 filefrag fragtest-1.0.* fragtest-1.0.1: 8192 extents found fragtest-1.0.2: 8192 extents found fragtest-1.0.3: 8192 extents found fragtest-1.0.4: 8192 extents found fragtest-1.0.5: 8192 extents found fragtest-1.0.6: 8192 extents found fragtest-1.0.7: 8192 extents found fragtest-1.0.8: 8192 extents found fragtest-1.0.9: 8192 extents found On a more "aged" filesystem, it's not quite as regular, but still above 7k extents for all files. Similarly, if I use io_uring for more concurrent IOs, there's a bit less fragmentation, presumbly because sometimes two IOs for the same file happen in subsequently. Of course just writing four blocks at a time is a bit extreme, I wanted to showcase the issue here, but even with a bit bigger writes, the problem is still severe. Writing multiple files at the same time is extremely common for us (think of table and its indexes, or multiple partitions of a table being filled concurrently). It looks to me that with a single file being written, each write only allocates a small extent, but the extent can be extended in subsequent writes. But when 2+ files are being written, that rarely is possible, because the space was already used for the other file(s). > If you want to minimise fragmentation with DIO workloads, then you > should be using extent size hints of an appropriate size. That will > align and size extents to the hint regardless of fallocate/write > ranges, hence this controls worst case fragmentation effectively. That might be an option, but I'm not sure how realistic it is. Lookes like one can't adjust the extsize for a file with existing contents, if I see this correctly. We don't know what data will be how large ahead of time, so we can't just configure a large extsize and be done with that. Given the above fragmentation behaviour, and the fact that extsizes can't be adjusted, I don't really see how we can get away from using fallocate() to avoid fragmentation. Then there's also the issue of extsize being xfs specific, without corresponding fetures in other filesystems... > If you want enospc guarantees for future writes, then large, > infrequent fallocate(FALLOC_FL_KEEP_SIZE) calls should be used. Do > not use this mechanism as an anti-fragmentation mechanism, that's > what extent size hints are for. Is there documentation about extent size hints anywhere beyond the paragraphs in the ioctl_xfs_fsgetxattr(2)? I didn't find much... > Use fallocate() as *little as possible*. > > In my experience, fine grained management of file space by userspace > applications via fallocate() is nothing but a recipe for awful > performance, highly variable IO latency, bad file fragmentation, and > poor filesystem aging characteristics. Just don't do it. I'd like to avoid it, but so far experience has shown that that causes plenty issues as well. Somewhat tangential: I still would like a fallocate() option that actually zeroes out new extents (via "write zeroes", if supported), rather than just setting them up as unwritten extents. Nor for "data" files, but for WAL/journal files. Unwrittent extent "conversion", or actually extending the file, makes durable journal writes via O_DSYNC or fdatasync() unusably slow. So one has to overwrite the file with zeroes "manually" - even though "write zeroes" would often be more efficient. rm -f durable-*;fio --buffered=0 --filesize=32MB --fallocate=1 --rw write --bs=$((8192)) --nrfiles=1 --ioengine io_uring --iodepth 16 --sync dsync --name durable-overwrite --overwrite 1 --name durable-nooverwrite --overwrite 0 --stonewall --name durable-nofallocate --overwrite 0 --fallocate 0 --stonewall slow-ish nvme: Run status group 0 (all jobs): WRITE: bw=45.1MiB/s (47.3MB/s), 45.1MiB/s-45.1MiB/s (47.3MB/s-47.3MB/s), io=32.0MiB (33.6MB), run=710-710msec Run status group 1 (all jobs): WRITE: bw=3224KiB/s (3302kB/s), 3224KiB/s-3224KiB/s (3302kB/s-3302kB/s), io=32.0MiB (33.6MB), run=10163-10163msec Run status group 2 (all jobs): WRITE: bw=2660KiB/s (2724kB/s), 2660KiB/s-2660KiB/s (2724kB/s-2724kB/s), io=32.0MiB (33.6MB), run=12320-12320msec fast nvme: Run status group 0 (all jobs): WRITE: bw=1600MiB/s (1678MB/s), 1600MiB/s-1600MiB/s (1678MB/s-1678MB/s), io=32.0MiB (33.6MB), run=20-20msec Run status group 1 (all jobs): WRITE: bw=356MiB/s (373MB/s), 356MiB/s-356MiB/s (373MB/s-373MB/s), io=32.0MiB (33.6MB), run=90-90msec Run status group 2 (all jobs): WRITE: bw=260MiB/s (273MB/s), 260MiB/s-260MiB/s (273MB/s-273MB/s), io=32.0MiB (33.6MB), run=123-123msec Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-07-19 20:29 ` Andres Freund @ 2023-07-19 20:38 ` Eric Sandeen 2023-07-19 20:49 ` Eric Sandeen 0 siblings, 1 reply; 20+ messages in thread From: Eric Sandeen @ 2023-07-19 20:38 UTC (permalink / raw) To: Andres Freund, Dave Chinner; +Cc: Eric Sandeen, Masahiko Sawada, linux-xfs On 7/19/23 3:29 PM, Andres Freund wrote: > Somewhat tangential: I still would like a fallocate() option that actually > zeroes out new extents (via "write zeroes", if supported), rather than just > setting them up as unwritten extents. Nor for "data" files, but for > WAL/journal files. Like this? fallocate(2): Zeroing file space Specifying the FALLOC_FL_ZERO_RANGE flag (available since Linux 3.15) in mode zeros space in the byte range starting at offset and continuing for len bytes. Under the covers, that uses efficient zeroing methods when available. -Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-07-19 20:38 ` Eric Sandeen @ 2023-07-19 20:49 ` Eric Sandeen 2023-07-19 22:23 ` Andres Freund 0 siblings, 1 reply; 20+ messages in thread From: Eric Sandeen @ 2023-07-19 20:49 UTC (permalink / raw) To: sandeen, Andres Freund, Dave Chinner; +Cc: Masahiko Sawada, linux-xfs On 7/19/23 3:38 PM, Eric Sandeen wrote: > On 7/19/23 3:29 PM, Andres Freund wrote: >> Somewhat tangential: I still would like a fallocate() option that >> actually >> zeroes out new extents (via "write zeroes", if supported), rather than >> just >> setting them up as unwritten extents. Nor for "data" files, but for >> WAL/journal files. > > Like this? > > fallocate(2): > > Zeroing file space > Specifying the FALLOC_FL_ZERO_RANGE flag (available since > Linux 3.15) in mode zeros space in the byte range starting at offset and > continuing for len bytes. > > Under the covers, that uses efficient zeroing methods when available. > > -Eric > Hm sorry, it's been a while. Maybe I'm wrong about this; I know it does efficient zeroing when pointed at a block device but I guess I've confused myself about what happens on a filesystem like XFS that supports unwritten extents. -Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-07-19 20:49 ` Eric Sandeen @ 2023-07-19 22:23 ` Andres Freund 0 siblings, 0 replies; 20+ messages in thread From: Andres Freund @ 2023-07-19 22:23 UTC (permalink / raw) To: Eric Sandeen; +Cc: sandeen, Dave Chinner, Masahiko Sawada, linux-xfs Hi, On 2023-07-19 15:49:59 -0500, Eric Sandeen wrote: > On 7/19/23 3:38 PM, Eric Sandeen wrote: > > On 7/19/23 3:29 PM, Andres Freund wrote: > > > Somewhat tangential: I still would like a fallocate() option that > > > actually > > > zeroes out new extents (via "write zeroes", if supported), rather > > > than just > > > setting them up as unwritten extents. Nor for "data" files, but for > > > WAL/journal files. > > > > Like this? > > > > fallocate(2): > > > > Zeroing file space > > Specifying the FALLOC_FL_ZERO_RANGE flag (available since > > Linux 3.15) in mode zeros space in the byte range starting at offset and > > continuing for len bytes. > > > > Under the covers, that uses efficient zeroing methods when available. > > > > -Eric > > > > Hm sorry, it's been a while. Maybe I'm wrong about this; I know it does > efficient zeroing when pointed at a block device but I guess I've confused > myself about what happens on a filesystem like XFS that supports unwritten > extents. Yea, it's documented to use unwritten extents: Zeroing is done within the filesystem preferably by converting the range into unwritten extents. This approach means that the specified range will not be physically zeroed out on the device (except for partial blocks at the either end of the range), and I/O is (otherwise) required only to update meta‐ data. and an experiment confirms that: $ dd if=/dev/zero of=test bs=1MB count=1 $ filefrag -v test Filesystem type is: 58465342 File size of test is 1000000 (245 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 244: 6104864.. 6105108: 245: last,eof test: 1 extent found $ fallocate -z -o 0 -l $((4096*128)) test $ filefrag -v test Filesystem type is: 58465342 File size of test is 1000000 (245 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 6105210.. 6105337: 128: unwritten 1: 128.. 244: 6104992.. 6105108: 117: 6105338: last,eof Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question on slow fallocate 2023-06-23 0:47 ` Dave Chinner 2023-06-23 8:29 ` Ritesh Harjani 2023-06-26 3:17 ` Masahiko Sawada @ 2023-07-11 22:28 ` Andres Freund 2 siblings, 0 replies; 20+ messages in thread From: Andres Freund @ 2023-07-11 22:28 UTC (permalink / raw) To: Dave Chinner; +Cc: Masahiko Sawada, linux-xfs Hi, On 2023-06-23 10:47:43 +1000, Dave Chinner wrote: > On Thu, Jun 22, 2023 at 02:34:18PM +0900, Masahiko Sawada wrote: > > Hi all, > > > > When testing PostgreSQL, I found a performance degradation. After some > > investigation, it ultimately reached the attached simple C program and > > turned out that the performance degradation happens on only the xfs > > filesystem (doesn't happen on neither ext3 nor ext4). In short, the > > program alternately does two things to extend a file (1) call > > posix_fallocate() to extend by 8192 bytes > > This is a well known anti-pattern - it always causes problems. Do > not do this. Postgres' actual behaviour is more complicated than what Sawada-san's test. We either fallocate() multiple pages or we use use pwritev() to extend by fewer pages. I think Sawada-san wrote it when trying to narrow down a performance issue to the "problematic" interaction, perhaps simplifying the real workload too much. > As it is, using fallocate/pwrite like test does is a well known > anti-pattern: > > error = fallocate(fd, off, len); > if (error == ENOSPC) { > /* abort write!!! */ > } > error = pwrite(fd, off, len); > ASSERT(error != ENOSPC); > if (error) { > /* handle error */ > } > > Why does the code need a call to fallocate() here it prevent ENOSPC in the > pwrite() call? The reason we do need either fallocate or pwrite is to ensure we can later write out the page from postgres' buffer pool without hitting ENOSPC (of course that's still not reliable for all filesystems...). We don't want to use *write() for larger amounts of data, because that ends up with the kernel actually needing to write out those pages. There never is any content in those extended pages. So for small file extensions we use writes, and when it's more bulk work, we use fallocate. Having a dirty page in our buffer pool is, that we can't write out due to ENOSPC, is bad, as that prevents our checkpoints from ever succeeding. Thus we either need to "crash" and replay the journal, or we can't checkpoint, with all the issues that entails. The performance issue at hand came to be because of the workload flipping between extending by fallocate() and extending by write(), as part of the heuristic is the contention on the lock protecting file extensions. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2023-07-19 22:24 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-06-22 5:34 Question on slow fallocate Masahiko Sawada 2023-06-22 7:44 ` Wang Yugui 2023-06-22 8:18 ` Masahiko Sawada 2023-06-23 0:47 ` Dave Chinner 2023-06-23 8:29 ` Ritesh Harjani 2023-06-23 10:07 ` Dave Chinner 2023-06-23 11:49 ` Ritesh Harjani 2023-06-23 20:04 ` Eric Sandeen 2023-06-26 3:17 ` Masahiko Sawada 2023-06-26 15:32 ` Eric Sandeen 2023-06-27 15:50 ` Masahiko Sawada 2023-06-27 16:12 ` Eric Sandeen 2023-06-28 4:56 ` Christoph Hellwig 2023-07-11 22:49 ` Andres Freund 2023-07-19 7:25 ` Dave Chinner 2023-07-19 20:29 ` Andres Freund 2023-07-19 20:38 ` Eric Sandeen 2023-07-19 20:49 ` Eric Sandeen 2023-07-19 22:23 ` Andres Freund 2023-07-11 22:28 ` Andres Freund
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox