* Do I have to fsync after aio_write finishes (with fallocate preallocation) ? @ 2022-11-29 19:20 Shawn 2022-11-29 21:17 ` Darrick J. Wong 2022-11-29 21:34 ` Dave Chinner 0 siblings, 2 replies; 10+ messages in thread From: Shawn @ 2022-11-29 19:20 UTC (permalink / raw) To: linux-xfs Hello all, I implemented a write workload by sequentially appending to the file end using libaio aio_write in O_DIRECT mode (with proper offset and buffer address alignment). When I reach a 1MB boundary I call fallocate() to extend the file. I need to protect the write from various failures such as disk unplug / power failure. The bottom line is, once I ack a write-complete, the user must be able to read it back later after a disk/power failure and recovery. In my understanding, fallocate() will preallocate disk space for the file, and I can call fsync to make sure the file metadata about this new space is persisted when fallocate returns. Once aio_write returns the data is in the disk. So it seems I don't need fsync after aio-write completion, because (1) the data is in disk, and (2) the file metadata to address the disk blocks is in disk. On the other hand, it seems XFS always does a delayed allocation which might break my assumption that file=>disk space mapping is persisted by fallocate. I can improve the data-in-disk format to carry proper header/footer to detect a broken write when scanning the file after a disk/power failure. Given all those above, do I still need a fsync() after aio_write completion in XFS to protect data persistence? Thanks all for your input! regards, Shawn ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ? 2022-11-29 19:20 Do I have to fsync after aio_write finishes (with fallocate preallocation) ? Shawn @ 2022-11-29 21:17 ` Darrick J. Wong 2022-11-29 21:34 ` Dave Chinner 1 sibling, 0 replies; 10+ messages in thread From: Darrick J. Wong @ 2022-11-29 21:17 UTC (permalink / raw) To: Shawn; +Cc: linux-xfs On Tue, Nov 29, 2022 at 11:20:05AM -0800, Shawn wrote: > Hello all, > I implemented a write workload by sequentially appending to the file > end using libaio aio_write in O_DIRECT mode (with proper offset and > buffer address alignment). When I reach a 1MB boundary I call > fallocate() to extend the file. > > I need to protect the write from various failures such as disk unplug > / power failure. The bottom line is, once I ack a write-complete, > the user must be able to read it back later after a disk/power failure > and recovery. > > In my understanding, fallocate() will preallocate disk space for the > file, and I can call fsync to make sure the file metadata about this > new space is persisted when fallocate returns. Once aio_write returns > the data is in the disk. So it seems I don't need fsync after > aio-write completion, because (1) the data is in disk, and (2) the > file metadata to address the disk blocks is in disk. > > On the other hand, it seems XFS always does a delayed allocation > which might break my assumption that file=>disk space mapping is > persisted by fallocate. > > I can improve the data-in-disk format to carry proper header/footer to > detect a broken write when scanning the file after a disk/power > failure. > > Given all those above, do I still need a fsync() after aio_write > completion in XFS to protect data persistence? Yes. The only time you don't is if you're performing an O_SYNC write to a part of a file that you've already written (and fsync'd) that's entirely below EOF and you've arranged that the filesystem will never COW or otherwise require metadata updates. Hey, at least aio_fsync works now... --D > Thanks all for your input! > > regards, > Shawn ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ? 2022-11-29 19:20 Do I have to fsync after aio_write finishes (with fallocate preallocation) ? Shawn 2022-11-29 21:17 ` Darrick J. Wong @ 2022-11-29 21:34 ` Dave Chinner 2023-08-21 19:01 ` Shawn 1 sibling, 1 reply; 10+ messages in thread From: Dave Chinner @ 2022-11-29 21:34 UTC (permalink / raw) To: Shawn; +Cc: linux-xfs On Tue, Nov 29, 2022 at 11:20:05AM -0800, Shawn wrote: > Hello all, > I implemented a write workload by sequentially appending to the file > end using libaio aio_write in O_DIRECT mode (with proper offset and > buffer address alignment). When I reach a 1MB boundary I call > fallocate() to extend the file. Ah, yet another fallocate anti-pattern. Firstly, friends don't let friends use fallocate() with AIO+DIO. fallocate() serialises all IO to that file - it waits for existing IO to complete, and prevents new IO from being issued until the the fallocate() operation completes. It is a completely synchronous operation and it does not play well with non-blocking IO paths (AIO or io_uring). Put simply: fallocate() is an IO performance and scalability killer. If you need to *allocate* in aligned 1MB chunks, then use extent size hints to tell the filesystem to allocate 1MB aligned chunks when it does IO. This does not serialise all IO to the file like fallocate does, it acheives exactly the same result as using fallocate to extend the file, yet the application doesn't need to know anything about controlling file layout. Further, using DIO write()s call to extend the file rather than fallocate() or ftruncate() also means that there will always be data right up to the end of the file. That's because XFS will not update the file size on extension until the IO has completed, and making the file size extension persistent (i.e. journalling it) doesn't happen until the data has been made persistent via device cache flushes. IOWs, if the file has been extended by a write IO, then XFS has *guaranteed* that the data written to thatextended region has been persisted to disk before the size extension is persisted. > I need to protect the write from various failures such as disk unplug > / power failure. The bottom line is, once I ack a write-complete, > the user must be able to read it back later after a disk/power failure > and recovery. Fallocate() does not provide data integrity guarantees. The application needs to use O_DSYNC/RWF_DSYNC IO controls to tell the filesystem to provide data integrity guarnatees. > In my understanding, fallocate() will preallocate disk space for the > file, and I can call fsync to make sure the file metadata about this > new space is persisted when fallocate returns. Yes, but as it just contains zeros so if it is missing after a crash, what does it matter? It just looks like the file wasn't extended, and the application has to be able to recover from that situation already, yes? > Once aio_write returns > the data is in the disk. So it seems I don't need fsync after > aio-write completion, because (1) the data is in disk, and (2) the > file metadata to address the disk blocks is in disk. Wrong. Direct IO does not guarantee persistence when the write()/aio_write() completes. Even with direct IO, the data can be held in volatile caches in the storage stack and the data is not guaranteed to be persistent until directed by the application to be made persistent. > On the other hand, it seems XFS always does a delayed allocation > which might break my assumption that file=>disk space mapping is > persisted by fallocate. Wrong on many levels. The first is the same as above - fallocate() does not provide any data persistence guarantees. Secondly, DIO writes do not use delayed allocation because they can't - we have to issue the IO immediately, so there's nothign that can be delayed. IOWs, delayed allocation is only done for buffered IO. This is true for delayed allocation on both ext4 and btrfs as well. Further, on XFS buffered writes into preallocated space from fallocate() do not use delayed allocation either - the space is already allocated, so there's nothing to allocate and hence nothing to delay! To drive the point home even further: if you use extent size hints with buffered writes, then this also turns off delayed allocation and instead uses immediate allocation just like DIO writes to preallocate the aligned extent around the range being written. Lastly, if you write an fallocate() based algorithm that works "well" on XFS, there's every chance it's going to absolutely suck on a different filesystem (e.g. btrfs) because different filesystems have very different allocation policies and interact with preallocation very differently. IOWs, there's a major step between knowing what concepts like delayed allocation and preallocation do versus understanding the complex policies that filesystems weave around these concepts to make general purpose workloads perform optimally in most situations.... > I can improve the data-in-disk format to carry proper header/footer to > detect a broken write when scanning the file after a disk/power > failure. > > Given all those above, do I still need a fsync() after aio_write > completion in XFS to protect data persistence? Regardless of the filesystem, applications *always* need to use fsync/fdatasync/O_SYNC/O_DSYNC/RWF_DSYNC to guarantee data persistence. The filesystem doesn't provide any persistence guarantees in the absence of these application directives - guaranteeing user data integrity is the responsibility of the application manipulating the user data, not the filesystem. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ? 2022-11-29 21:34 ` Dave Chinner @ 2023-08-21 19:01 ` Shawn 2023-08-26 3:48 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Shawn @ 2023-08-21 19:01 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs Hello Dave, Thank you for your detailed reply. That fallocate() thing makes a lot of sense. I want to figure out the default extent size in my evn. But "xfs_info" doesn't seem to output it? (See below output) Also, I want to use this cmd to set the default extent size hint, is this correct? $ sudo mkfs.xfs -d extszinherit=256 <== the data block is 4KB, so 256 is 1MB. $ sudo xfs_info /dev/nvme3n1 meta-data=/dev/nvme3n1 isize=512 agcount=4, agsize=117210902 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0 spinodes=0 data = bsize=4096 blocks=468843606, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=228927, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 regards, Shawn On Tue, Nov 29, 2022 at 1:34 PM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Nov 29, 2022 at 11:20:05AM -0800, Shawn wrote: > > Hello all, > > I implemented a write workload by sequentially appending to the file > > end using libaio aio_write in O_DIRECT mode (with proper offset and > > buffer address alignment). When I reach a 1MB boundary I call > > fallocate() to extend the file. > > Ah, yet another fallocate anti-pattern. > > Firstly, friends don't let friends use fallocate() with AIO+DIO. > > fallocate() serialises all IO to that file - it waits for existing > IO to complete, and prevents new IO from being issued until the > the fallocate() operation completes. It is a completely synchronous > operation and it does not play well with non-blocking IO paths (AIO > or io_uring). Put simply: fallocate() is an IO performance and > scalability killer. > > If you need to *allocate* in aligned 1MB chunks, then use extent > size hints to tell the filesystem to allocate 1MB aligned chunks > when it does IO. This does not serialise all IO to the file like > fallocate does, it acheives exactly the same result as using > fallocate to extend the file, yet the application doesn't need to > know anything about controlling file layout. > > Further, using DIO write()s call to extend the file rather than > fallocate() or ftruncate() also means that there will always be data > right up to the end of the file. That's because XFS will not update > the file size on extension until the IO has completed, and making > the file size extension persistent (i.e. journalling it) doesn't > happen until the data has been made persistent via device cache > flushes. > > IOWs, if the file has been extended by a write IO, then XFS has > *guaranteed* that the data written to thatextended region has been > persisted to disk before the size extension is persisted. > > > I need to protect the write from various failures such as disk unplug > > / power failure. The bottom line is, once I ack a write-complete, > > the user must be able to read it back later after a disk/power failure > > and recovery. > > Fallocate() does not provide data integrity guarantees. The > application needs to use O_DSYNC/RWF_DSYNC IO controls to tell the > filesystem to provide data integrity guarnatees. > > > In my understanding, fallocate() will preallocate disk space for the > > file, and I can call fsync to make sure the file metadata about this > > new space is persisted when fallocate returns. > > Yes, but as it just contains zeros so if it is missing after a > crash, what does it matter? It just looks like the file wasn't > extended, and the application has to be able to recover from that > situation already, yes? > > > Once aio_write returns > > the data is in the disk. So it seems I don't need fsync after > > aio-write completion, because (1) the data is in disk, and (2) the > > file metadata to address the disk blocks is in disk. > > Wrong. Direct IO does not guarantee persistence when the > write()/aio_write() completes. Even with direct IO, the data can be > held in volatile caches in the storage stack and the data is not > guaranteed to be persistent until directed by the application to be > made persistent. > > > On the other hand, it seems XFS always does a delayed allocation > > which might break my assumption that file=>disk space mapping is > > persisted by fallocate. > > Wrong on many levels. The first is the same as above - fallocate() > does not provide any data persistence guarantees. > > Secondly, DIO writes do not use delayed allocation because they > can't - we have to issue the IO immediately, so there's nothign that > can be delayed. IOWs, delayed allocation is only done for buffered > IO. This is true for delayed allocation on both ext4 and btrfs as > well. > > Further, on XFS buffered writes into preallocated space from > fallocate() do not use delayed allocation either - the space is > already allocated, so there's nothing to allocate and hence nothing > to delay! > > To drive the point home even further: if you use extent size > hints with buffered writes, then this also turns off delayed > allocation and instead uses immediate allocation just like DIO > writes to preallocate the aligned extent around the range being > written. > > Lastly, if you write an fallocate() based algorithm that works > "well" on XFS, there's every chance it's going to absolutely suck on > a different filesystem (e.g. btrfs) because different filesystems > have very different allocation policies and interact with > preallocation very differently. > > IOWs, there's a major step between knowing what concepts like > delayed allocation and preallocation do versus understanding the > complex policies that filesystems weave around these concepts to > make general purpose workloads perform optimally in most > situations.... > > > I can improve the data-in-disk format to carry proper header/footer to > > detect a broken write when scanning the file after a disk/power > > failure. > > > > Given all those above, do I still need a fsync() after aio_write > > completion in XFS to protect data persistence? > > Regardless of the filesystem, applications *always* need to use > fsync/fdatasync/O_SYNC/O_DSYNC/RWF_DSYNC to guarantee data > persistence. The filesystem doesn't provide any persistence > guarantees in the absence of these application directives - > guaranteeing user data integrity is the responsibility of the > application manipulating the user data, not the filesystem. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ? 2023-08-21 19:01 ` Shawn @ 2023-08-26 3:48 ` Dave Chinner 2023-08-27 1:09 ` Shawn 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2023-08-26 3:48 UTC (permalink / raw) To: Shawn; +Cc: linux-xfs On Mon, Aug 21, 2023 at 12:01:27PM -0700, Shawn wrote: > Hello Dave, > Thank you for your detailed reply. That fallocate() thing makes a lot of sense. > > I want to figure out the default extent size in my evn. But > "xfs_info" doesn't seem to output it? (See below output) extent size hints are an inode property, not a filesystem geometry property. xfs_info only queries the later, it knows nothing about the former. # xfs_io -c 'stat' </path/to/mnt> will tell you what the default extent size hint that will be inherited by newly created sub-directories and files (fsxattr.extsize). > > Also, I want to use this cmd to set the default extent size hint, is > this correct? > $ sudo mkfs.xfs -d extszinherit=256 <== the data block is 4KB, so > 256 is 1MB. Yes. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ? 2023-08-26 3:48 ` Dave Chinner @ 2023-08-27 1:09 ` Shawn 2023-08-28 1:01 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Shawn @ 2023-08-27 1:09 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs xfs_io shows "extsize" as 0. The data bsize is always 4096. What's the implication of a 0 extsize? $ sudo xfs_io -c 'stat' /mnt/S48BNW0K700192T/ fd.path = "/mnt/S48BNW0K700192T/" fd.flags = non-sync,non-direct,read-write stat.ino = 64 stat.type = directory stat.size = 81 stat.blocks = 0 fsxattr.xflags = 0x0 [--------------] fsxattr.projid = 0 fsxattr.extsize = 0 <==== 0 fsxattr.nextents = 0 fsxattr.naextents = 0 dioattr.mem = 0x200 dioattr.miniosz = 512 dioattr.maxiosz = 2147483136 On Fri, Aug 25, 2023 at 8:48 PM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Aug 21, 2023 at 12:01:27PM -0700, Shawn wrote: > > Hello Dave, > > Thank you for your detailed reply. That fallocate() thing makes a lot of sense. > > > > I want to figure out the default extent size in my evn. But > > "xfs_info" doesn't seem to output it? (See below output) > > extent size hints are an inode property, not a filesystem geometry > property. xfs_info only queries the later, it knows nothing about > the former. > > # xfs_io -c 'stat' </path/to/mnt> > > will tell you what the default extent size hint that will be > inherited by newly created sub-directories and files > (fsxattr.extsize). > > > > > Also, I want to use this cmd to set the default extent size hint, is > > this correct? > > $ sudo mkfs.xfs -d extszinherit=256 <== the data block is 4KB, so > > 256 is 1MB. > > Yes. > > -Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ? 2023-08-27 1:09 ` Shawn @ 2023-08-28 1:01 ` Dave Chinner 2023-09-01 1:06 ` Shawn 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2023-08-28 1:01 UTC (permalink / raw) To: Shawn; +Cc: linux-xfs On Sat, Aug 26, 2023 at 06:09:13PM -0700, Shawn wrote: > xfs_io shows "extsize" as 0. The data bsize is always 4096. What's > the implication of a 0 extsize? > > $ sudo xfs_io -c 'stat' /mnt/S48BNW0K700192T/ > fd.path = "/mnt/S48BNW0K700192T/" > fd.flags = non-sync,non-direct,read-write > stat.ino = 64 > stat.type = directory > stat.size = 81 > stat.blocks = 0 > fsxattr.xflags = 0x0 [--------------] > fsxattr.projid = 0 > fsxattr.extsize = 0 <==== 0 > fsxattr.nextents = 0 > fsxattr.naextents = 0 > dioattr.mem = 0x200 > dioattr.miniosz = 512 > dioattr.maxiosz = 2147483136 THere are no xflags set, meaning the XFS_DIFLAG_EXTSZINHERIT is not set on the directory so nothing will inherit the extsize from the directory at creation time. An extsize of zero is the default "don't do any non-default extent size alignment" (i.e. align to stripe parameters if the filesystem has them set, but nothing else.) If this is the root directory of a mounted filesystem, it means the extent size hint was not set by mkfs, and it hasn't been set manually via xfs_io after mount, either. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ? 2023-08-28 1:01 ` Dave Chinner @ 2023-09-01 1:06 ` Shawn 2023-09-01 3:47 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Shawn @ 2023-09-01 1:06 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs Hi Dave, If ext size hint is not set at all, what's the default extent size alignment if the FS doesn't do striping (which is my case)? On Sun, Aug 27, 2023 at 6:01 PM Dave Chinner <david@fromorbit.com> wrote: > > On Sat, Aug 26, 2023 at 06:09:13PM -0700, Shawn wrote: > > xfs_io shows "extsize" as 0. The data bsize is always 4096. What's > > the implication of a 0 extsize? > > > > $ sudo xfs_io -c 'stat' /mnt/S48BNW0K700192T/ > > fd.path = "/mnt/S48BNW0K700192T/" > > fd.flags = non-sync,non-direct,read-write > > stat.ino = 64 > > stat.type = directory > > stat.size = 81 > > stat.blocks = 0 > > fsxattr.xflags = 0x0 [--------------] > > fsxattr.projid = 0 > > fsxattr.extsize = 0 <==== 0 > > fsxattr.nextents = 0 > > fsxattr.naextents = 0 > > dioattr.mem = 0x200 > > dioattr.miniosz = 512 > > dioattr.maxiosz = 2147483136 > > THere are no xflags set, meaning the XFS_DIFLAG_EXTSZINHERIT is not > set on the directory so nothing will inherit the extsize from the > directory at creation time. An extsize of zero is the default "don't > do any non-default extent size alignment" (i.e. align to stripe > parameters if the filesystem has them set, but nothing else.) > > If this is the root directory of a mounted filesystem, it means the > extent size hint was not set by mkfs, and it hasn't been set > manually via xfs_io after mount, either. > > -Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ? 2023-09-01 1:06 ` Shawn @ 2023-09-01 3:47 ` Dave Chinner 2023-09-01 23:50 ` Shawn 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2023-09-01 3:47 UTC (permalink / raw) To: Shawn; +Cc: linux-xfs On Thu, Aug 31, 2023 at 06:06:23PM -0700, Shawn wrote: > Hi Dave, > If ext size hint is not set at all, what's the default extent size > alignment if the FS doesn't do striping (which is my case)? No alignment. XFS will allocate exact sized extents for the writes being issued... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ? 2023-09-01 3:47 ` Dave Chinner @ 2023-09-01 23:50 ` Shawn 0 siblings, 0 replies; 10+ messages in thread From: Shawn @ 2023-09-01 23:50 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Aug 31, 2023 at 8:47 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Aug 31, 2023 at 06:06:23PM -0700, Shawn wrote: > > Hi Dave, > > If ext size hint is not set at all, what's the default extent size > > alignment if the FS doesn't do striping (which is my case)? > > No alignment. XFS will allocate exact sized extents for the > writes being issued... => Seems this can explain why io_submit() latency was very high for small aio_write (4KB, 12KB, etc). When I did fallocate() 1MB space before moving on to the next 1MB chunk, then io_submit latency becomes normal. So ext size hint can achieve the same effect as fallocate() in this case, might be even better. > > -Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-09-01 23:50 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-11-29 19:20 Do I have to fsync after aio_write finishes (with fallocate preallocation) ? Shawn 2022-11-29 21:17 ` Darrick J. Wong 2022-11-29 21:34 ` Dave Chinner 2023-08-21 19:01 ` Shawn 2023-08-26 3:48 ` Dave Chinner 2023-08-27 1:09 ` Shawn 2023-08-28 1:01 ` Dave Chinner 2023-09-01 1:06 ` Shawn 2023-09-01 3:47 ` Dave Chinner 2023-09-01 23:50 ` Shawn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox