* stat() on btrfs reports the st_blocks with delay (data loss in archivers)
@ 2016-07-02 7:18 Pavel Raiskup
2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
2016-07-11 14:41 ` David Sterba
0 siblings, 2 replies; 21+ messages in thread
From: Pavel Raiskup @ 2016-07-02 7:18 UTC (permalink / raw)
To: linux-btrfs; +Cc: bug-tar
There are optimizations in archivers (tar, rsync, ...) that rely on up2date
st_blocks info. For example, in GNU tar there is optimization check [1]
whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
tar considers that file is sparse (and does additional steps).
It looks like btrfs doesn't show correct value in 'st_blocks' until the data
are synced. ATM, there happens that:
a) some "tool" creates sparse file
b) that tool does not sync explicitly and exits ..
c) tar is called immediately after that to archive the sparse file
d) tar considers [2] the file is completely sparse (because st_blocks is
zero) and archives no data. Here comes data loss.
Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
small and is in-lined (no real data blocks) -- I consider this is too bug in
btrfs worth fixing.
[1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
[2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
Tested on kernel:
kernel-4.5.7-300.fc24.x86_64
Originally reported here, reproducer available there:
https://bugzilla.redhat.com/show_bug.cgi?id=1352061
Pavel
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-02 7:18 stat() on btrfs reports the st_blocks with delay (data loss in archivers) Pavel Raiskup @ 2016-07-04 19:35 ` Andreas Dilger 2016-07-05 9:28 ` Joerg Schilling 2016-07-07 8:08 ` Pavel Raiskup 2016-07-11 14:41 ` David Sterba 1 sibling, 2 replies; 21+ messages in thread From: Andreas Dilger @ 2016-07-04 19:35 UTC (permalink / raw) To: Pavel Raiskup; +Cc: linux-btrfs, bug-tar [-- Attachment #1: Type: text/plain, Size: 2427 bytes --] On Jul 2, 2016, at 1:18 AM, Pavel Raiskup <praiskup@redhat.com> wrote: > > There are optimizations in archivers (tar, rsync, ...) that rely on up2date > st_blocks info. For example, in GNU tar there is optimization check [1] > whether the 'st_size' reports more data than the 'st_blocks' can hold --> then > tar considers that file is sparse (and does additional steps). > > It looks like btrfs doesn't show correct value in 'st_blocks' until the data > are synced. ATM, there happens that: > > a) some "tool" creates sparse file > b) that tool does not sync explicitly and exits .. > c) tar is called immediately after that to archive the sparse file > d) tar considers [2] the file is completely sparse (because st_blocks is > zero) and archives no data. Here comes data loss. > > Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is > small and is in-lined (no real data blocks) -- I consider this is too bug in > btrfs worth fixing. We had a similar problem with both ext4 and Lustre - the client was reporting zero blocks due to delayed allocation until data was written to disk. While those problems were fixed in the filesystem to report an estimate of the block count before any blocks were actually written to disk, it seems like this may be a problem that will come up again with other filesystems in the future. I think in addition to fixing btrfs (because it needs to work with existing tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar to handle this situation more robustly. One option is if st_blocks == 0 then tar should also check if st_mtime is less than 60s in the past, and if yes then it should call fsync() on the file to flush any unwritten data to disk, or assume the file is not sparse and read the whole file, so that it doesn't incorrectly assume that the file is sparse and skip archiving the file data. Cheers, Andreas > > [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392 > [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273 > > Tested on kernel: > kernel-4.5.7-300.fc24.x86_64 > > Originally reported here, reproducer available there: > https://bugzilla.redhat.com/show_bug.cgi?id=1352061 > > Pavel > > Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger @ 2016-07-05 9:28 ` Joerg Schilling 2016-07-06 11:37 ` Austin S. Hemmelgarn 2016-07-07 8:08 ` Pavel Raiskup 1 sibling, 1 reply; 21+ messages in thread From: Joerg Schilling @ 2016-07-05 9:28 UTC (permalink / raw) To: praiskup, adilger; +Cc: linux-btrfs, bug-tar Andreas Dilger <adilger@dilger.ca> wrote: > I think in addition to fixing btrfs (because it needs to work with existing > tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar > to handle this situation more robustly. One option is if st_blocks == 0 then > tar should also check if st_mtime is less than 60s in the past, and if yes > then it should call fsync() on the file to flush any unwritten data to disk, > or assume the file is not sparse and read the whole file, so that it doesn't > incorrectly assume that the file is sparse and skip archiving the file data. A broken filesystem is a broken filesystem. If you try to change gtar to work around a specific problem, it may fail in other situations. Jörg -- EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-05 9:28 ` Joerg Schilling @ 2016-07-06 11:37 ` Austin S. Hemmelgarn 2016-07-06 11:49 ` Joerg Schilling 0 siblings, 1 reply; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-07-06 11:37 UTC (permalink / raw) To: Joerg Schilling, praiskup, adilger; +Cc: linux-btrfs, bug-tar On 2016-07-05 05:28, Joerg Schilling wrote: > Andreas Dilger <adilger@dilger.ca> wrote: > >> I think in addition to fixing btrfs (because it needs to work with existing >> tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar >> to handle this situation more robustly. One option is if st_blocks == 0 then >> tar should also check if st_mtime is less than 60s in the past, and if yes >> then it should call fsync() on the file to flush any unwritten data to disk, >> or assume the file is not sparse and read the whole file, so that it doesn't >> incorrectly assume that the file is sparse and skip archiving the file data. > > A broken filesystem is a broken filesystem. > > If you try to change gtar to work around a specific problem, it may fail in > other situations. The problem with this is that tar is assuming things that are not guaranteed to be true. There is absolutely nothing that says that st_blocks has to be non-zero if there's data in the file. In fact, the behavior that BTRFS used to have of reporting st_blocks to be 0 for files entirely inlined in the metadata is absolutely correct given the description of the field by POSIX, because there _are_ no blocks allocated to the file (because the metadata block is technically equivalent to the inode, which isn't counted by st_blocks). This is yet another example of an old interface (in this case, sparse file detection) being short-sighted (read in this case as non-existent). The proper fix for this is that tar (and anything else that handles sparse files differently) should be parsing the file regardless. It has to anyway for a normal sparse file to figure out where the sparse regions are, and optimizing for a file that's completely sparse (and therefore probably pre-allocated with fallocate) is not all that reasonable considering that this is going to be a very rare case in normal usage. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 11:37 ` Austin S. Hemmelgarn @ 2016-07-06 11:49 ` Joerg Schilling 2016-07-06 14:43 ` Antonio Diaz Diaz 0 siblings, 1 reply; 21+ messages in thread From: Joerg Schilling @ 2016-07-06 11:49 UTC (permalink / raw) To: praiskup, ahferroin7, adilger; +Cc: linux-btrfs, bug-tar "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote: > > A broken filesystem is a broken filesystem. > > > > If you try to change gtar to work around a specific problem, it may fail in > > other situations. > The problem with this is that tar is assuming things that are not > guaranteed to be true. There is absolutely nothing that says that > st_blocks has to be non-zero if there's data in the file. In fact, the This is not true: POSIX requires st_blocks to be != 0 in case that the file contains data. > behavior that BTRFS used to have of reporting st_blocks to be 0 for > files entirely inlined in the metadata is absolutely correct given the > description of the field by POSIX, because there _are_ no blocks > allocated to the file (because the metadata block is technically > equivalent to the inode, which isn't counted by st_blocks). This is yet > another example of an old interface (in this case, sparse file > detection) being short-sighted (read in this case as non-existent). The internal state of a file system is irrelevant. The only thing that counts is the user space view and if a file contains data (read succeeds in user space), it needs to report st_blocks != 0. > The proper fix for this is that tar (and anything else that handles > sparse files differently) should be parsing the file regardless. It has > to anyway for a normal sparse file to figure out where the sparse > regions are, and optimizing for a file that's completely sparse (and > therefore probably pre-allocated with fallocate) is not all that > reasonable considering that this is going to be a very rare case in > normal usage. This does not help. Even on a decent OS (e.g. Solaris since Summer 2005) and a decent tar implementation (star) that supports SEEK_HOLE since Summer 2005, this method will not work for all filesystems as there may be old filesystem implementations and as there may be NFS... For this reason, star still checks st_blocks in case that SEEK_HOLE did not work. Jörg -- EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 11:49 ` Joerg Schilling @ 2016-07-06 14:43 ` Antonio Diaz Diaz 2016-07-06 14:53 ` Joerg Schilling 0 siblings, 1 reply; 21+ messages in thread From: Antonio Diaz Diaz @ 2016-07-06 14:43 UTC (permalink / raw) To: Joerg Schilling; +Cc: ahferroin7, adilger, linux-btrfs, bug-tar Joerg Schilling wrote: > POSIX requires st_blocks to be != 0 in case that the file contains data. Please, could you provide a reference? I can't find such requirement at http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html Thanks. Antonio. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 14:43 ` Antonio Diaz Diaz @ 2016-07-06 14:53 ` Joerg Schilling 2016-07-06 15:01 ` Paul Eggert 2016-07-06 15:12 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 21+ messages in thread From: Joerg Schilling @ 2016-07-06 14:53 UTC (permalink / raw) To: antonio; +Cc: linux-btrfs, bug-tar, ahferroin7, adilger Antonio Diaz Diaz <antonio@gnu.org> wrote: > Joerg Schilling wrote: > > POSIX requires st_blocks to be != 0 in case that the file contains data. > > Please, could you provide a reference? I can't find such requirement at > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html blkcnt_t st_blocks Number of blocks allocated for this object. It should be obvious that a file that offers content also has allocated blocks. Blocks are "allocated" when the OS decides whether the new data will fit on the medium. The fact that some filesystems may have data in a cache but not yet on the medium does not matter here. This is how UNIX worked since st_block has been introduced nearly 40 years ago. A new filesystem cannot introduce new rules just because people believe it would save time. Jörg -- EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 14:53 ` Joerg Schilling @ 2016-07-06 15:01 ` Paul Eggert 2016-07-06 15:09 ` Joerg Schilling 2016-07-06 15:12 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 21+ messages in thread From: Paul Eggert @ 2016-07-06 15:01 UTC (permalink / raw) To: Joerg Schilling, antonio; +Cc: adilger, ahferroin7, linux-btrfs, bug-tar On 07/06/2016 04:53 PM, Joerg Schilling wrote: > Antonio Diaz Diaz<antonio@gnu.org> wrote: > >> >Joerg Schilling wrote: >>> > >POSIX requires st_blocks to be != 0 in case that the file contains data. >> > >> >Please, could you provide a reference? I can't find such requirement at >> >http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html > blkcnt_t st_blocks Number of blocks allocated for this object. This doesn't require that st_blocks must be nonzero if the file contains nonzero data, any more that it requires that st_blocks must be nonzero if the file contains zero data. In either case, metadata outside the scope of st_blocks might contain enough information for the file system to represent all the file's data. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 15:01 ` Paul Eggert @ 2016-07-06 15:09 ` Joerg Schilling 2016-07-06 15:11 ` Paul Eggert 0 siblings, 1 reply; 21+ messages in thread From: Joerg Schilling @ 2016-07-06 15:09 UTC (permalink / raw) To: eggert, antonio; +Cc: linux-btrfs, bug-tar, ahferroin7, adilger Paul Eggert <eggert@cs.ucla.edu> wrote: > On 07/06/2016 04:53 PM, Joerg Schilling wrote: > > Antonio Diaz Diaz<antonio@gnu.org> wrote: > > > >> >Joerg Schilling wrote: > >>> > >POSIX requires st_blocks to be != 0 in case that the file contains data. > >> > > >> >Please, could you provide a reference? I can't find such requirement at > >> >http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html > > blkcnt_t st_blocks Number of blocks allocated for this object. > > This doesn't require that st_blocks must be nonzero if the file contains > nonzero data, any more that it requires that st_blocks must be nonzero > if the file contains zero data. In either case, metadata outside the > scope of st_blocks might contain enough information for the file system > to represent all the file's data. In other words, you concur that a delayed assignment of the "correct" value for st_blocks while the contend of the file does not change is not permitted. Jörg -- EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 15:09 ` Joerg Schilling @ 2016-07-06 15:11 ` Paul Eggert 0 siblings, 0 replies; 21+ messages in thread From: Paul Eggert @ 2016-07-06 15:11 UTC (permalink / raw) To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, ahferroin7, adilger On 07/06/2016 05:09 PM, Joerg Schilling wrote: > you concur that a delayed assignment of the "correct" value for > st_blocks while the contend of the file does not change is not permitted. I'm not sure I agree even with that. A file system may undergo garbage collection and compaction, for instance, in which a file's data do not change but its internal representation does. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 14:53 ` Joerg Schilling 2016-07-06 15:01 ` Paul Eggert @ 2016-07-06 15:12 ` Austin S. Hemmelgarn 2016-07-06 15:22 ` Joerg Schilling 1 sibling, 1 reply; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-07-06 15:12 UTC (permalink / raw) To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, adilger On 2016-07-06 10:53, Joerg Schilling wrote: > Antonio Diaz Diaz <antonio@gnu.org> wrote: > >> Joerg Schilling wrote: >>> POSIX requires st_blocks to be != 0 in case that the file contains data. >> >> Please, could you provide a reference? I can't find such requirement at >> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html > > blkcnt_t st_blocks Number of blocks allocated for this object. > > It should be obvious that a file that offers content also has allocated blocks. What you mean then is that POSIX _implies_ that this is the case, but does not say whether or not it is required. There are all kinds of counterexamples to this too, procfs is a POSIX compliant filesystem (every POSIX certified system has it), yet does not display the behavior that you expect, every single file in /proc for example reports 0 for both st_blocks and st_size, and yet all of them very obviously have content. > > Blocks are "allocated" when the OS decides whether the new data will fit on the > medium. The fact that some filesystems may have data in a cache but not yet on > the medium does not matter here. This is how UNIX worked since st_block has > been introduced nearly 40 years ago. Tradition is the corpse of wisdom. Backwards comparability is a problem just as much as a good thing. In all seriousness though, this started out because stuff wasn't cached to anywhere near the degree it is today, and there was no such thing as delayed allocation. When you said to write, the filesystem allocated the blocks, regardless of when it actually wrote the data. IOW, the behavior that GNU tar is relying on is an implementation detail, not an API. Just like df, this breaks under modern designs, not because they chose to break it, but because it wasn't designed for use with such implementations. In the case of tar and similar things though, I'd argue that it's not sensible to special case files that are 'sparse', it should store any long enough run of zeroes as a sparse region, then provide an option to say to not make those files sparse when restored. > > A new filesystem cannot introduce new rules just because people believe it would > save time. Saying the file has no blocks when there are no blocks allocated for it is not to 'save time', it's absolutely accurate. Suppose SVR4 UFS had a way to pack file data into the inode if it was small enough. In that case, it woulod be perfectly reasonable to return 0 for st_blocks because the inode table in UFS is a fixed pre-allocated structure, and therefore nothing is allocated to the file itself except the inode. The same applies in the case of a file packed into it's own metadata block on BTRFS, nothing is allocated to that file beyond the metadata block it has to have to store the inode. In the case of delayed allocation where the file hasn't been flushed, there is nothing allocated, so st_blocks based on a strict interpretation of it's description in POSIX _should_ be 0, because nothing is allocated yet. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 15:12 ` Austin S. Hemmelgarn @ 2016-07-06 15:22 ` Joerg Schilling 2016-07-06 16:05 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 21+ messages in thread From: Joerg Schilling @ 2016-07-06 15:22 UTC (permalink / raw) To: antonio, ahferroin7; +Cc: linux-btrfs, bug-tar, adilger "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote: > > It should be obvious that a file that offers content also has allocated blocks. > What you mean then is that POSIX _implies_ that this is the case, but > does not say whether or not it is required. There are all kinds of > counterexamples to this too, procfs is a POSIX compliant filesystem > (every POSIX certified system has it), yet does not display the behavior > that you expect, every single file in /proc for example reports 0 for > both st_blocks and st_size, and yet all of them very obviously have content. You are mistaken. stat /proc/$$/as File: `/proc/6518/as' Size: 2793472 Blocks: 5456 IO Block: 512 regular file Device: 5440000h/88342528d Inode: 7557 Links: 1 Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs) Access: 2016-07-06 16:33:15.660224934 +0200 Modify: 2016-07-06 16:33:15.660224934 +0200 Change: 2016-07-06 16:33:15.660224934 +0200 stat /proc/$$/auxv File: `/proc/6518/auxv' Size: 168 Blocks: 1 IO Block: 512 regular file Device: 5440000h/88342528d Inode: 7568 Links: 1 Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs) Access: 2016-07-06 16:33:15.660224934 +0200 Modify: 2016-07-06 16:33:15.660224934 +0200 Change: 2016-07-06 16:33:15.660224934 +0200 Any correct implementation of /proc returns the expected numbers in st_size as well as in st_blocks. > In all seriousness though, this started out because stuff wasn't cached > to anywhere near the degree it is today, and there was no such thing as > delayed allocation. When you said to write, the filesystem allocated > the blocks, regardless of when it actually wrote the data. IOW, the > behavior that GNU tar is relying on is an implementation detail, not an > API. Just like df, this breaks under modern designs, not because they > chose to break it, but because it wasn't designed for use with such > implementations. This seems to be a strange interpretation if what a standard is. > > A new filesystem cannot introduce new rules just because people believe it would > > save time. > Saying the file has no blocks when there are no blocks allocated for it > is not to 'save time', it's absolutely accurate. Suppose SVR4 UFS had a > way to pack file data into the inode if it was small enough. In that > case, it woulod be perfectly reasonable to return 0 for st_blocks > because the inode table in UFS is a fixed pre-allocated structure, and Given that inode size is 128, such a change would not break things as the heuristics would not imply a sparse file here. > therefore nothing is allocated to the file itself except the inode. The > same applies in the case of a file packed into it's own metadata block > on BTRFS, nothing is allocated to that file beyond the metadata block it > has to have to store the inode. In the case of delayed allocation where > the file hasn't been flushed, there is nothing allocated, so st_blocks > based on a strict interpretation of it's description in POSIX _should_ > be 0, because nothing is allocated yet. Now you know why BTRFS is still an incomplete filesystem. In a few years when it turns 10, this may change. People who implement filesystems of course need to learn that they need to hide implementation details from the official user space interfaces. Jörg -- EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 15:22 ` Joerg Schilling @ 2016-07-06 16:05 ` Austin S. Hemmelgarn 2016-07-06 16:11 ` Austin S. Hemmelgarn 2016-07-06 16:33 ` Joerg Schilling 0 siblings, 2 replies; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-07-06 16:05 UTC (permalink / raw) To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, adilger On 2016-07-06 11:22, Joerg Schilling wrote: > "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote: > >>> It should be obvious that a file that offers content also has allocated blocks. >> What you mean then is that POSIX _implies_ that this is the case, but >> does not say whether or not it is required. There are all kinds of >> counterexamples to this too, procfs is a POSIX compliant filesystem >> (every POSIX certified system has it), yet does not display the behavior >> that you expect, every single file in /proc for example reports 0 for >> both st_blocks and st_size, and yet all of them very obviously have content. > > You are mistaken. > > stat /proc/$$/as > File: `/proc/6518/as' > Size: 2793472 Blocks: 5456 IO Block: 512 regular file > Device: 5440000h/88342528d Inode: 7557 Links: 1 > Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs) > Access: 2016-07-06 16:33:15.660224934 +0200 > Modify: 2016-07-06 16:33:15.660224934 +0200 > Change: 2016-07-06 16:33:15.660224934 +0200 > > stat /proc/$$/auxv > File: `/proc/6518/auxv' > Size: 168 Blocks: 1 IO Block: 512 regular file > Device: 5440000h/88342528d Inode: 7568 Links: 1 > Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs) > Access: 2016-07-06 16:33:15.660224934 +0200 > Modify: 2016-07-06 16:33:15.660224934 +0200 > Change: 2016-07-06 16:33:15.660224934 +0200 > > Any correct implementation of /proc returns the expected numbers in st_size as > well as in st_blocks. Odd, because I get 0 for both values on all the files in /proc/self and all the top level files on all kernels I tested prior to sending that e-mail, for reference, they include: * A direct clone of HEAD on torvalds/linux * 4.6.3 mainline * 4.1.27 mainline * 4.6.3 mainline with a small number of local patches on top * 4.1.19+ from the Raspberry Pi foundation * 4.4.6-gentoo (mainline with Gentoo patches on top) * 4.5.5-linode69 (not certain about the patches on top) It's probably notable that I don't see /proc/$PID/as on any of these systems, which implies you're running some significantly different kernel version to begin with, and therefore it's not unreasonable to assume that what you see is because of some misguided patch that got added to allow tar to archive /proc. > >> In all seriousness though, this started out because stuff wasn't cached >> to anywhere near the degree it is today, and there was no such thing as >> delayed allocation. When you said to write, the filesystem allocated >> the blocks, regardless of when it actually wrote the data. IOW, the >> behavior that GNU tar is relying on is an implementation detail, not an >> API. Just like df, this breaks under modern designs, not because they >> chose to break it, but because it wasn't designed for use with such >> implementations. > > This seems to be a strange interpretation if what a standard is. Except what I'm talking about is the _interpretation_ of the standard, not the standard itself. I said nothing about the standard, all it requires is that st_blocks be the number of 512 byte blocks allocated by the filesystem for the file. There is nothing in there about it having to reflect the expected size of the allocated content on disk. In fact, there's technically nothing in there about how to handle sparse files either. To further explain what I'm trying to say, here's a rough description of what happens in SVR4 UFS (and other non-delayed allocation filesystems) when you issue a write: 1. The number of new blocks needed to fulfill the write request is calculated. 2. If this number is greater than 0, that many new blocks are allocated, and st_blocks for that file is functionally updated (I don't recall if it was dynamically calculated per call or not) 3. At some indeterminate point in the future, the decision is made to flush the cache. 4. The data is written to the appropriate place in the file. By comparison, in a delayed allocation scenario, 3 happens before 1 and 2. 1 and 2 obviously have to be strictly ordered WRT each other and 4, but based on the POSIX standard, 3 does not have to be strictly ordered with regards to any of them (although it is illogical to have it between 1 and 2 or after 4). Because it is not required by the standard to have 3 be strictly ordered and the ordering isn't part of the API itself, where it happens in the sequence is an implementation detail. > >>> A new filesystem cannot introduce new rules just because people believe it would >>> save time. >> Saying the file has no blocks when there are no blocks allocated for it >> is not to 'save time', it's absolutely accurate. Suppose SVR4 UFS had a >> way to pack file data into the inode if it was small enough. In that >> case, it woulod be perfectly reasonable to return 0 for st_blocks >> because the inode table in UFS is a fixed pre-allocated structure, and > > Given that inode size is 128, such a change would not break things as the > heuristics would not imply a sparse file here. OK, so change the heuristic then so that there's a reasonable limit on small files, or better yet, just get rid of it, as it was introduced to save time itself. The case it optimizes for (large sparse files) is mostly irrelevant, because you have to parse the file to figure out what's sparse anyway, and anyone who's dealing with completely sparse files has other potential issues to deal with (I'm actually curious if you know of some legitimate reason to copy a completely empty file in a backup anyway, they're almost always either pre-allocated but unused files which will just get reallocated by whatever allocated them in the first place, lock-files, or something similar in some other way). Regardless, the correct way on current Linux systems to determine file sparseness is SEEK_DATA and SEEK_HOLE. You have to read any data that's there anyway, so alternating SEEK_DATA and SEEK_HOLE is necessary to find where the data is that you need to read. If the first SEEK_HOLE returns the end of the file, then you know it's not sparse. > >> therefore nothing is allocated to the file itself except the inode. The >> same applies in the case of a file packed into it's own metadata block >> on BTRFS, nothing is allocated to that file beyond the metadata block it >> has to have to store the inode. In the case of delayed allocation where >> the file hasn't been flushed, there is nothing allocated, so st_blocks >> based on a strict interpretation of it's description in POSIX _should_ >> be 0, because nothing is allocated yet. > > Now you know why BTRFS is still an incomplete filesystem. In a few years when > it turns 10, this may change. People who implement filesystems of course need > to learn that they need to hide implementation details from the official user > space interfaces. So in other words you think we should be lying about how much is actually allocated on disk and thus violating the standard directly (and yes, ext4 and everyone else who does this with delayed allocation _is_ strictly speaking violating the standard, because _nothing_ is allocated yet)? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 16:05 ` Austin S. Hemmelgarn @ 2016-07-06 16:11 ` Austin S. Hemmelgarn 2016-07-06 16:33 ` Joerg Schilling 1 sibling, 0 replies; 21+ messages in thread From: Austin S. Hemmelgarn @ 2016-07-06 16:11 UTC (permalink / raw) To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, adilger On 2016-07-06 12:05, Austin S. Hemmelgarn wrote: > On 2016-07-06 11:22, Joerg Schilling wrote: >> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote: >> >>>> It should be obvious that a file that offers content also has >>>> allocated blocks. >>> What you mean then is that POSIX _implies_ that this is the case, but >>> does not say whether or not it is required. There are all kinds of >>> counterexamples to this too, procfs is a POSIX compliant filesystem >>> (every POSIX certified system has it), yet does not display the behavior >>> that you expect, every single file in /proc for example reports 0 for >>> both st_blocks and st_size, and yet all of them very obviously have >>> content. >> >> You are mistaken. >> >> stat /proc/$$/as >> File: `/proc/6518/as' >> Size: 2793472 Blocks: 5456 IO Block: 512 regular file >> Device: 5440000h/88342528d Inode: 7557 Links: 1 >> Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs) >> Access: 2016-07-06 16:33:15.660224934 +0200 >> Modify: 2016-07-06 16:33:15.660224934 +0200 >> Change: 2016-07-06 16:33:15.660224934 +0200 >> >> stat /proc/$$/auxv >> File: `/proc/6518/auxv' >> Size: 168 Blocks: 1 IO Block: 512 regular file >> Device: 5440000h/88342528d Inode: 7568 Links: 1 >> Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs) >> Access: 2016-07-06 16:33:15.660224934 +0200 >> Modify: 2016-07-06 16:33:15.660224934 +0200 >> Change: 2016-07-06 16:33:15.660224934 +0200 >> >> Any correct implementation of /proc returns the expected numbers in >> st_size as >> well as in st_blocks. > Odd, because I get 0 for both values on all the files in /proc/self and > all the top level files on all kernels I tested prior to sending that > e-mail, for reference, they include: > * A direct clone of HEAD on torvalds/linux > * 4.6.3 mainline > * 4.1.27 mainline > * 4.6.3 mainline with a small number of local patches on top > * 4.1.19+ from the Raspberry Pi foundation > * 4.4.6-gentoo (mainline with Gentoo patches on top) > * 4.5.5-linode69 (not certain about the patches on top) Further ones I've now tested that behave like the others listed above: * 2.4.20-8 from RedHat 9 * 2.6.18-1.2798.fc6 from Fedora Core 6 * 3.11.10-301.fc20 from Fedora 20 IOW, it looks like whatever you're running is an exception here. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 16:05 ` Austin S. Hemmelgarn 2016-07-06 16:11 ` Austin S. Hemmelgarn @ 2016-07-06 16:33 ` Joerg Schilling 2016-07-06 17:35 ` Andreas Dilger 1 sibling, 1 reply; 21+ messages in thread From: Joerg Schilling @ 2016-07-06 16:33 UTC (permalink / raw) To: antonio, ahferroin7; +Cc: linux-btrfs, bug-tar, adilger "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote: > On 2016-07-06 11:22, Joerg Schilling wrote: > > "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote: > > > >>> It should be obvious that a file that offers content also has allocated blocks. > >> What you mean then is that POSIX _implies_ that this is the case, but > >> does not say whether or not it is required. There are all kinds of > >> counterexamples to this too, procfs is a POSIX compliant filesystem > >> (every POSIX certified system has it), yet does not display the behavior > >> that you expect, every single file in /proc for example reports 0 for > >> both st_blocks and st_size, and yet all of them very obviously have content. > > > > You are mistaken. > > > > stat /proc/$$/as > > File: `/proc/6518/as' > > Size: 2793472 Blocks: 5456 IO Block: 512 regular file > > Device: 5440000h/88342528d Inode: 7557 Links: 1 > > Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs) > > Access: 2016-07-06 16:33:15.660224934 +0200 > > Modify: 2016-07-06 16:33:15.660224934 +0200 > > Change: 2016-07-06 16:33:15.660224934 +0200 > > > > stat /proc/$$/auxv > > File: `/proc/6518/auxv' > > Size: 168 Blocks: 1 IO Block: 512 regular file > > Device: 5440000h/88342528d Inode: 7568 Links: 1 > > Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs) > > Access: 2016-07-06 16:33:15.660224934 +0200 > > Modify: 2016-07-06 16:33:15.660224934 +0200 > > Change: 2016-07-06 16:33:15.660224934 +0200 > > > > Any correct implementation of /proc returns the expected numbers in st_size as > > well as in st_blocks. > Odd, because I get 0 for both values on all the files in /proc/self and > all the top level files on all kernels I tested prior to sending that I tested this with an official PROCFS-2 implementation that was written by the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased away last weekend. You may have done your tests on an inofficial procfs implementation.... > > Now you know why BTRFS is still an incomplete filesystem. In a few years when > > it turns 10, this may change. People who implement filesystems of course need > > to learn that they need to hide implementation details from the official user > > space interfaces. > So in other words you think we should be lying about how much is > actually allocated on disk and thus violating the standard directly (and > yes, ext4 and everyone else who does this with delayed allocation _is_ > strictly speaking violating the standard, because _nothing_ is allocated > yet)? If it returns 0, it would be lying or it would be wrong anyway as it did not check fpe available space. Also note that I mentioned already that the priciple availability of SEEK_HOLE does not help as there is e.g. NFS... Jörg -- EMail:joerg@schily.net (home) Jörg Schilling D-13353 Berlin joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-06 16:33 ` Joerg Schilling @ 2016-07-06 17:35 ` Andreas Dilger 0 siblings, 0 replies; 21+ messages in thread From: Andreas Dilger @ 2016-07-06 17:35 UTC (permalink / raw) To: Joerg Schilling; +Cc: antonio, ahferroin7, linux-btrfs, bug-tar [-- Attachment #1: Type: text/plain, Size: 3137 bytes --] > On Jul 6, 2016, at 10:33 AM, Joerg Schilling <Joerg.Schilling@fokus.fraunhofer.de> wrote: > > "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote: > >> On 2016-07-06 11:22, Joerg Schilling wrote: >>> >>> >>> You are mistaken. >>> >>> stat /proc/$$/as >>> File: `/proc/6518/as' >>> Size: 2793472 Blocks: 5456 IO Block: 512 regular file >>> Device: 5440000h/88342528d Inode: 7557 Links: 1 >>> Access: (0600/-rw-------) Uid: ( xx/ joerg) Gid: ( xx/ bs) >>> Access: 2016-07-06 16:33:15.660224934 +0200 >>> Modify: 2016-07-06 16:33:15.660224934 +0200 >>> Change: 2016-07-06 16:33:15.660224934 +0200 >>> >>> stat /proc/$$/auxv >>> File: `/proc/6518/auxv' >>> Size: 168 Blocks: 1 IO Block: 512 regular file >>> Device: 5440000h/88342528d Inode: 7568 Links: 1 >>> Access: (0400/-r--------) Uid: ( xx/ joerg) Gid: ( xx/ bs) >>> Access: 2016-07-06 16:33:15.660224934 +0200 >>> Modify: 2016-07-06 16:33:15.660224934 +0200 >>> Change: 2016-07-06 16:33:15.660224934 +0200 >>> >>> Any correct implementation of /proc returns the expected numbers in >>> st_size as well as in st_blocks. >> >> Odd, because I get 0 for both values on all the files in /proc/self and >> all the top level files on all kernels I tested prior to sending that > > I tested this with an official PROCFS-2 implementation that was written by > the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased > away last weekend. > > You may have done your tests on an inofficial procfs implementation.... So, what you are saying is that you don't care about star working properly on Linux, because it has an "inofficial" procfs implementation, while Solaris has an "official" implementation? >>> Now you know why BTRFS is still an incomplete filesystem. In a few years >>> when it turns 10, this may change. People who implement filesystems of >>> course need to learn that they need to hide implementation details from >>> the official user space interfaces. >> >> So in other words you think we should be lying about how much is >> actually allocated on disk and thus violating the standard directly (and >> yes, ext4 and everyone else who does this with delayed allocation _is_ >> strictly speaking violating the standard, because _nothing_ is allocated >> yet)? > > If it returns 0, it would be lying or it would be wrong anyway as it did not > check fpe available space. > > Also note that I mentioned already that the priciple availability of SEEK_HOLE > does not help as there is e.g. NFS... So, it's OK that NFS is not POSIX compliant in various ways, and star will deal with it, but you aren't willing to fix a heuristic used by star for a behaviour that is unspecified by POSIX but has caused users to lose data when archiving from several modern filesystems? That's fine, so long as GNU tar is fixed to use the safe fallback in such cases (i.e. trying to archive data from files that are newly created, even if they report st_blocks == 0). Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger 2016-07-05 9:28 ` Joerg Schilling @ 2016-07-07 8:08 ` Pavel Raiskup 1 sibling, 0 replies; 21+ messages in thread From: Pavel Raiskup @ 2016-07-07 8:08 UTC (permalink / raw) To: bug-tar; +Cc: Andreas Dilger, linux-btrfs On Monday, July 4, 2016 1:35:25 PM CEST Andreas Dilger wrote: > I think in addition to fixing btrfs (because it needs to work with existing > tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar to > handle this situation more robustly. What I was rather thinking about is to remove the [2] heuristic. As there is now SEEK_HOLE implemented, the need for that check "completely sparse files" might be considered less useful. With [1], I'm not sure -- is it that bad to face some false positive there? (it is documented that tar shouldn't be run concurrently with other processes writing to archived files .., and waiting for flush here is probably a very similar race condition). > One option is if st_blocks == 0 then tar should also check if st_mtime is > less than 60s in the past, and if yes then it should call fsync() on the > file to flush any unwritten data to disk , or assume the file is not sparse > and read the whole file, so that it doesn't incorrectly assume that the file > is sparse and skip archiving the file data. The reported fact 'st_blocks != 0' doesn't mean that the fsync() call is not needed, so I'm not 100% we should special-case the 'st_blocks == 0' files. -- As this effectively breaks tar's testsuite on btrfs, could we also explicitly sync in 'genfile'? Pavel > > Cheers, Andreas > > > [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392 > > [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-02 7:18 stat() on btrfs reports the st_blocks with delay (data loss in archivers) Pavel Raiskup 2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger @ 2016-07-11 14:41 ` David Sterba 2016-07-11 15:00 ` Chris Mason 1 sibling, 1 reply; 21+ messages in thread From: David Sterba @ 2016-07-11 14:41 UTC (permalink / raw) To: Pavel Raiskup; +Cc: linux-btrfs, bug-tar On Sat, Jul 02, 2016 at 09:18:07AM +0200, Pavel Raiskup wrote: > There are optimizations in archivers (tar, rsync, ...) that rely on up2date > st_blocks info. For example, in GNU tar there is optimization check [1] > whether the 'st_size' reports more data than the 'st_blocks' can hold --> then > tar considers that file is sparse (and does additional steps). > > It looks like btrfs doesn't show correct value in 'st_blocks' until the data > are synced. ATM, there happens that: > > a) some "tool" creates sparse file > b) that tool does not sync explicitly and exits .. > c) tar is called immediately after that to archive the sparse file > d) tar considers [2] the file is completely sparse (because st_blocks is > zero) and archives no data. Here comes data loss. > > Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is > small and is in-lined (no real data blocks) -- I consider this is too bug in > btrfs worth fixing. > > [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392 > [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273 > > Tested on kernel: > kernel-4.5.7-300.fc24.x86_64 > > Originally reported here, reproducer available there: > https://bugzilla.redhat.com/show_bug.cgi?id=1352061 The reproducer works for me here. So far I found: * the btrfs implementation of stat.st_blocks (btrfs_getattr) includes the 'delayed allocated' bytes, so there is not a problem in principle (http://lxr.free-electrons.com/source/fs/btrfs/inode.c#L9372) * calling fsync on the sparsefile will produce the expected result * a short delay between ./binary and 'stat' will also produce correct result, 0.5 seconds worked for me -- so it IMO proves it's a race between writing and reporting the data * I'm not yet sure where the delay between write and synced 'inode->delalloc_bytes' comes from * I think that st_blocks accounting can be wrong anyway, if the file is mmap-ed and not msync-ed, I'm writing a reproducer for this case ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-11 14:41 ` David Sterba @ 2016-07-11 15:00 ` Chris Mason 2016-07-11 15:16 ` David Sterba 0 siblings, 1 reply; 21+ messages in thread From: Chris Mason @ 2016-07-11 15:00 UTC (permalink / raw) To: dsterba, Pavel Raiskup, linux-btrfs, bug-tar On 07/11/2016 10:41 AM, David Sterba wrote: > On Sat, Jul 02, 2016 at 09:18:07AM +0200, Pavel Raiskup wrote: >> There are optimizations in archivers (tar, rsync, ...) that rely on up2date >> st_blocks info. For example, in GNU tar there is optimization check [1] >> whether the 'st_size' reports more data than the 'st_blocks' can hold --> then >> tar considers that file is sparse (and does additional steps). >> >> It looks like btrfs doesn't show correct value in 'st_blocks' until the data >> are synced. ATM, there happens that: >> >> a) some "tool" creates sparse file >> b) that tool does not sync explicitly and exits .. >> c) tar is called immediately after that to archive the sparse file >> d) tar considers [2] the file is completely sparse (because st_blocks is >> zero) and archives no data. Here comes data loss. >> >> Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is >> small and is in-lined (no real data blocks) -- I consider this is too bug in >> btrfs worth fixing. >> >> [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392 >> [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273 >> >> Tested on kernel: >> kernel-4.5.7-300.fc24.x86_64 >> >> Originally reported here, reproducer available there: >> https://bugzilla.redhat.com/show_bug.cgi?id=1352061 > > The reproducer works for me here. So far I found: > > * the btrfs implementation of stat.st_blocks (btrfs_getattr) includes > the 'delayed allocated' bytes, so there is not a problem in principle > (https://urldefense.proofpoint.com/v2/url?u=http-3A__lxr.free-2Delectrons.com_source_fs_btrfs_inode.c-23L9372&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=9QPtTAxcitoznaWRKKHoEQ&m=Y07_PApD0zOC-eaM4Hq-oxAwNlktnY8bDo7LD2GbXRo&s=Is_ZqFiL7a4sVWAB1k1ZuAgbNMK-sZ1gcU7oLtyfKoY&e= ) > > * calling fsync on the sparsefile will produce the expected result > > * a short delay between ./binary and 'stat' will also produce correct > result, 0.5 seconds worked for me -- so it IMO proves it's a race > between writing and reporting the data > > * I'm not yet sure where the delay between write and synced > 'inode->delalloc_bytes' comes from > > * I think that st_blocks accounting can be wrong anyway, if the file is > mmap-ed and not msync-ed, I'm writing a reproducer for this case On my test box running current linux git, things work fine if I run the reproducer once. But if I leave it running in a loop long enough for writeback to kick in, I trigger it. The reproducer has a loop in there where it is adding delalloc writes and truncating them away. What should be happening is that we're leaving some delalloc bits set past EOF, which makes us skip bumping inode->delalloc_bytes during the new write. I can kind of confirm this by changing the reproducer to stat directly after the write call. Normally st_blocks is never zero. But if I leave it running in a loop for 30 seconds or so, I eventually get st_block zero directly after the write(). If I change the C program to unlink the file on exit, running the binary over and over again works every time. So, the real bug is that we're letting some delalloc stat hang around after the truncate, probably related to IO in progress. We do already account for delalloc in what we return to stat, but there's a corner case involving truncate where we screw it up. -chris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-11 15:00 ` Chris Mason @ 2016-07-11 15:16 ` David Sterba 2016-07-11 17:30 ` Chris Mason 0 siblings, 1 reply; 21+ messages in thread From: David Sterba @ 2016-07-11 15:16 UTC (permalink / raw) To: Chris Mason; +Cc: dsterba, Pavel Raiskup, linux-btrfs, bug-tar On Mon, Jul 11, 2016 at 11:00:55AM -0400, Chris Mason wrote: > So, the real bug is that we're letting some delalloc stat hang around > after the truncate, probably related to IO in progress. We do already > account for delalloc in what we return to stat, but there's a corner > case involving truncate where we screw it up. So the original testcase: > >> a) some "tool" creates sparse file > >> b) that tool does not sync explicitly and exits .. > >> c) tar is called immediately after that to archive the sparse file > >> d) tar considers [2] the file is completely sparse (because st_blocks is > >> zero) and archives no data. Here comes data loss. will not happen. The application would basically have to mimick the provided reproducer script and do the truncate/write loop and be lucky enough to let tar hit the short race window. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers) 2016-07-11 15:16 ` David Sterba @ 2016-07-11 17:30 ` Chris Mason 0 siblings, 0 replies; 21+ messages in thread From: Chris Mason @ 2016-07-11 17:30 UTC (permalink / raw) To: dsterba, Pavel Raiskup, linux-btrfs, bug-tar On 07/11/2016 11:16 AM, David Sterba wrote: > On Mon, Jul 11, 2016 at 11:00:55AM -0400, Chris Mason wrote: >> So, the real bug is that we're letting some delalloc stat hang around >> after the truncate, probably related to IO in progress. We do already >> account for delalloc in what we return to stat, but there's a corner >> case involving truncate where we screw it up. > > So the original testcase: > >>>> a) some "tool" creates sparse file >>>> b) that tool does not sync explicitly and exits .. >>>> c) tar is called immediately after that to archive the sparse file >>>> d) tar considers [2] the file is completely sparse (because st_blocks is >>>> zero) and archives no data. Here comes data loss. > > will not happen. The application would basically have to mimick the > provided reproducer script and do the truncate/write loop and be lucky > enough to let tar hit the short race window. > Looking harder there is a race window that can trigger this without the truncate loop: 1) application calls write(), we make the pages delalloc (in-ram st_blocks goes up) 2) VM calls write_cache_pages, we go find a contiguous delalloc range 3) We call cow_file_range on the locked range of pages 4) cow_file_range clears the delalloc bits (in-ram st_blocks goes down) < ----- race begins here -----> 5) The io is started 6) The IO completes and extents are inserted into the metadata 7) the on disk/in-ram st_blocks goes up < ---- race ends here ----> This makes a ton more sense than leaking delalloc bits. -chris ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2016-07-11 17:31 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-07-02 7:18 stat() on btrfs reports the st_blocks with delay (data loss in archivers) Pavel Raiskup 2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger 2016-07-05 9:28 ` Joerg Schilling 2016-07-06 11:37 ` Austin S. Hemmelgarn 2016-07-06 11:49 ` Joerg Schilling 2016-07-06 14:43 ` Antonio Diaz Diaz 2016-07-06 14:53 ` Joerg Schilling 2016-07-06 15:01 ` Paul Eggert 2016-07-06 15:09 ` Joerg Schilling 2016-07-06 15:11 ` Paul Eggert 2016-07-06 15:12 ` Austin S. Hemmelgarn 2016-07-06 15:22 ` Joerg Schilling 2016-07-06 16:05 ` Austin S. Hemmelgarn 2016-07-06 16:11 ` Austin S. Hemmelgarn 2016-07-06 16:33 ` Joerg Schilling 2016-07-06 17:35 ` Andreas Dilger 2016-07-07 8:08 ` Pavel Raiskup 2016-07-11 14:41 ` David Sterba 2016-07-11 15:00 ` Chris Mason 2016-07-11 15:16 ` David Sterba 2016-07-11 17:30 ` Chris Mason
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).