stat() on btrfs reports the st_blocks with delay (data loss in archivers)

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* stat() on btrfs reports the st_blocks with delay (data loss in archivers)
@ 2016-07-02  7:18 Pavel Raiskup
  2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
  2016-07-11 14:41 ` David Sterba
  0 siblings, 2 replies; 21+ messages in thread
From: Pavel Raiskup @ 2016-07-02  7:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: bug-tar

There are optimizations in archivers (tar, rsync, ...) that rely on up2date
st_blocks info.  For example, in GNU tar there is optimization check [1]
whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
tar considers that file is sparse (and does additional steps).

It looks like btrfs doesn't show correct value in 'st_blocks' until the data
are synced.  ATM, there happens that:

    a) some "tool" creates sparse file
    b) that tool does not sync explicitly and exits ..
    c) tar is called immediately after that to archive the sparse file
    d) tar considers [2] the file is completely sparse (because st_blocks is
       zero) and archives no data.  Here comes data loss.

Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
small and is in-lined (no real data blocks) -- I consider this is too bug in
btrfs worth fixing.

[1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
[2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273

Tested on kernel:
kernel-4.5.7-300.fc24.x86_64

Originally reported here, reproducer available there:
https://bugzilla.redhat.com/show_bug.cgi?id=1352061

Pavel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-02  7:18 stat() on btrfs reports the st_blocks with delay (data loss in archivers) Pavel Raiskup
@ 2016-07-04 19:35 ` Andreas Dilger
  2016-07-05  9:28   ` Joerg Schilling
  2016-07-07  8:08   ` Pavel Raiskup
  2016-07-11 14:41 ` David Sterba
  1 sibling, 2 replies; 21+ messages in thread
From: Andreas Dilger @ 2016-07-04 19:35 UTC (permalink / raw)
  To: Pavel Raiskup; +Cc: linux-btrfs, bug-tar

[-- Attachment #1: Type: text/plain, Size: 2427 bytes --]

On Jul 2, 2016, at 1:18 AM, Pavel Raiskup <praiskup@redhat.com> wrote:
> 
> There are optimizations in archivers (tar, rsync, ...) that rely on up2date
> st_blocks info.  For example, in GNU tar there is optimization check [1]
> whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
> tar considers that file is sparse (and does additional steps).
> 
> It looks like btrfs doesn't show correct value in 'st_blocks' until the data
> are synced.  ATM, there happens that:
> 
>    a) some "tool" creates sparse file
>    b) that tool does not sync explicitly and exits ..
>    c) tar is called immediately after that to archive the sparse file
>    d) tar considers [2] the file is completely sparse (because st_blocks is
>       zero) and archives no data.  Here comes data loss.
> 
> Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
> small and is in-lined (no real data blocks) -- I consider this is too bug in
> btrfs worth fixing.

We had a similar problem with both ext4 and Lustre - the client was reporting
zero blocks due to delayed allocation until data was written to disk.  While
those problems were fixed in the filesystem to report an estimate of the block
count before any blocks were actually written to disk, it seems like this may
be a problem that will come up again with other filesystems in the future.

I think in addition to fixing btrfs (because it needs to work with existing
tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar
to handle this situation more robustly.  One option is if st_blocks == 0 then
tar should also check if st_mtime is less than 60s in the past, and if yes
then it should call fsync() on the file to flush any unwritten data to disk,
or assume the file is not sparse and read the whole file, so that it doesn't
incorrectly assume that the file is sparse and skip archiving the file data.

Cheers, Andreas

> 
> [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
> [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
> 
> Tested on kernel:
> kernel-4.5.7-300.fc24.x86_64
> 
> Originally reported here, reproducer available there:
> https://bugzilla.redhat.com/show_bug.cgi?id=1352061
> 
> Pavel
> 
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
@ 2016-07-05  9:28   ` Joerg Schilling
  2016-07-06 11:37     ` Austin S. Hemmelgarn
  2016-07-07  8:08   ` Pavel Raiskup
  1 sibling, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-05  9:28 UTC (permalink / raw)
  To: praiskup, adilger; +Cc: linux-btrfs, bug-tar

Andreas Dilger <adilger@dilger.ca> wrote:

> I think in addition to fixing btrfs (because it needs to work with existing
> tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar
> to handle this situation more robustly.  One option is if st_blocks == 0 then
> tar should also check if st_mtime is less than 60s in the past, and if yes
> then it should call fsync() on the file to flush any unwritten data to disk,
> or assume the file is not sparse and read the whole file, so that it doesn't
> incorrectly assume that the file is sparse and skip archiving the file data.

A broken filesystem is a broken filesystem.

If you try to change gtar to work around a specific problem, it may fail in 
other situations.

Jörg

-- 
 EMail:joerg@schily.net                    (home) Jörg Schilling D-13353 Berlin
       joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-05  9:28   ` Joerg Schilling
@ 2016-07-06 11:37     ` Austin S. Hemmelgarn
  2016-07-06 11:49       ` Joerg Schilling
  0 siblings, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-06 11:37 UTC (permalink / raw)
  To: Joerg Schilling, praiskup, adilger; +Cc: linux-btrfs, bug-tar

On 2016-07-05 05:28, Joerg Schilling wrote:
> Andreas Dilger <adilger@dilger.ca> wrote:
>
>> I think in addition to fixing btrfs (because it needs to work with existing
>> tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar
>> to handle this situation more robustly.  One option is if st_blocks == 0 then
>> tar should also check if st_mtime is less than 60s in the past, and if yes
>> then it should call fsync() on the file to flush any unwritten data to disk,
>> or assume the file is not sparse and read the whole file, so that it doesn't
>> incorrectly assume that the file is sparse and skip archiving the file data.
>
> A broken filesystem is a broken filesystem.
>
> If you try to change gtar to work around a specific problem, it may fail in
> other situations.
The problem with this is that tar is assuming things that are not 
guaranteed to be true.  There is absolutely nothing that says that 
st_blocks has to be non-zero if there's data in the file.  In fact, the 
behavior that BTRFS used to have of reporting st_blocks to be 0 for 
files entirely inlined in the metadata is absolutely correct given the 
description of the field by POSIX, because there _are_ no blocks 
allocated to the file (because the metadata block is technically 
equivalent to the inode, which isn't counted by st_blocks).  This is yet 
another example of an old interface (in this case, sparse file 
detection) being short-sighted (read in this case as non-existent).

The proper fix for this is that tar (and anything else that handles 
sparse files differently) should be parsing the file regardless.  It has 
to anyway for a normal sparse file to figure out where the sparse 
regions are, and optimizing for a file that's completely sparse (and 
therefore probably pre-allocated with fallocate) is not all that 
reasonable considering that this is going to be a very rare case in 
normal usage.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 11:37     ` Austin S. Hemmelgarn
@ 2016-07-06 11:49       ` Joerg Schilling
  2016-07-06 14:43         ` Antonio Diaz Diaz
  0 siblings, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 11:49 UTC (permalink / raw)
  To: praiskup, ahferroin7, adilger; +Cc: linux-btrfs, bug-tar

"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:

> > A broken filesystem is a broken filesystem.
> >
> > If you try to change gtar to work around a specific problem, it may fail in
> > other situations.
> The problem with this is that tar is assuming things that are not 
> guaranteed to be true.  There is absolutely nothing that says that 
> st_blocks has to be non-zero if there's data in the file.  In fact, the 

This is not true: POSIX requires st_blocks to be != 0 in case that the file 
contains data.

> behavior that BTRFS used to have of reporting st_blocks to be 0 for 
> files entirely inlined in the metadata is absolutely correct given the 
> description of the field by POSIX, because there _are_ no blocks 
> allocated to the file (because the metadata block is technically 
> equivalent to the inode, which isn't counted by st_blocks).  This is yet 
> another example of an old interface (in this case, sparse file 
> detection) being short-sighted (read in this case as non-existent).

The internal state of a file system is irrelevant. The only thing that counts 
is the user space view and if a file contains data (read succeeds in user 
space), it needs to report st_blocks != 0.

> The proper fix for this is that tar (and anything else that handles 
> sparse files differently) should be parsing the file regardless.  It has 
> to anyway for a normal sparse file to figure out where the sparse 
> regions are, and optimizing for a file that's completely sparse (and 
> therefore probably pre-allocated with fallocate) is not all that 
> reasonable considering that this is going to be a very rare case in 
> normal usage.

This does not help.

Even on a decent OS (e.g. Solaris since Summer 2005) and a decent tar 
implementation (star) that supports SEEK_HOLE since Summer 2005, this method 
will not work for all filesystems as there may be old filesystem 
implementations and as there may be NFS...

For this reason, star still checks st_blocks in case that SEEK_HOLE did not 
work.

Jörg

-- 
 EMail:joerg@schily.net                    (home) Jörg Schilling D-13353 Berlin
       joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 11:49       ` Joerg Schilling
@ 2016-07-06 14:43         ` Antonio Diaz Diaz
  2016-07-06 14:53           ` Joerg Schilling
  0 siblings, 1 reply; 21+ messages in thread
From: Antonio Diaz Diaz @ 2016-07-06 14:43 UTC (permalink / raw)
  To: Joerg Schilling; +Cc: ahferroin7, adilger, linux-btrfs, bug-tar

Joerg Schilling wrote:
> POSIX requires st_blocks to be != 0 in case that the file contains data.

Please, could you provide a reference? I can't find such requirement at 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html


Thanks.
Antonio.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 14:43         ` Antonio Diaz Diaz
@ 2016-07-06 14:53           ` Joerg Schilling
  2016-07-06 15:01             ` Paul Eggert
  2016-07-06 15:12             ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 14:53 UTC (permalink / raw)
  To: antonio; +Cc: linux-btrfs, bug-tar, ahferroin7, adilger

Antonio Diaz Diaz <antonio@gnu.org> wrote:

> Joerg Schilling wrote:
> > POSIX requires st_blocks to be != 0 in case that the file contains data.
>
> Please, could you provide a reference? I can't find such requirement at 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html

	blkcnt_t st_blocks      Number of blocks allocated for this object.

It should be obvious that a file that offers content also has allocated blocks.

Blocks are "allocated" when the OS decides whether the new data will fit on the 
medium. The fact that some filesystems may have data in a cache but not yet on 
the medium does not matter here. This is how UNIX worked since st_block has 
been introduced nearly 40 years ago. 

A new filesystem cannot introduce new rules just because people believe it would 
save time.

Jörg

-- 
 EMail:joerg@schily.net                    (home) Jörg Schilling D-13353 Berlin
       joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 14:53           ` Joerg Schilling
@ 2016-07-06 15:01             ` Paul Eggert
  2016-07-06 15:09               ` Joerg Schilling
  2016-07-06 15:12             ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 21+ messages in thread
From: Paul Eggert @ 2016-07-06 15:01 UTC (permalink / raw)
  To: Joerg Schilling, antonio; +Cc: adilger, ahferroin7, linux-btrfs, bug-tar

On 07/06/2016 04:53 PM, Joerg Schilling wrote:
> Antonio Diaz Diaz<antonio@gnu.org>  wrote:
>
>> >Joerg Schilling wrote:
>>> > >POSIX requires st_blocks to be != 0 in case that the file contains data.
>> >
>> >Please, could you provide a reference? I can't find such requirement at
>> >http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html
> 	blkcnt_t st_blocks      Number of blocks allocated for this object.

This doesn't require that st_blocks must be nonzero if the file contains 
nonzero data, any more that it requires that st_blocks must be nonzero 
if the file contains zero data. In either case, metadata outside the 
scope of st_blocks might contain enough information for the file system 
to represent all the file's data.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 15:01             ` Paul Eggert
@ 2016-07-06 15:09               ` Joerg Schilling
  2016-07-06 15:11                 ` Paul Eggert
  0 siblings, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 15:09 UTC (permalink / raw)
  To: eggert, antonio; +Cc: linux-btrfs, bug-tar, ahferroin7, adilger

Paul Eggert <eggert@cs.ucla.edu> wrote:

> On 07/06/2016 04:53 PM, Joerg Schilling wrote:
> > Antonio Diaz Diaz<antonio@gnu.org>  wrote:
> >
> >> >Joerg Schilling wrote:
> >>> > >POSIX requires st_blocks to be != 0 in case that the file contains data.
> >> >
> >> >Please, could you provide a reference? I can't find such requirement at
> >> >http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html
> > 	blkcnt_t st_blocks      Number of blocks allocated for this object.
>
> This doesn't require that st_blocks must be nonzero if the file contains 
> nonzero data, any more that it requires that st_blocks must be nonzero 
> if the file contains zero data. In either case, metadata outside the 
> scope of st_blocks might contain enough information for the file system 
> to represent all the file's data.

In other words, you concur that a delayed assignment of the "correct" value for 
st_blocks while the contend of the file does not change is not permitted.

Jörg

-- 
 EMail:joerg@schily.net                    (home) Jörg Schilling D-13353 Berlin
       joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 15:09               ` Joerg Schilling
@ 2016-07-06 15:11                 ` Paul Eggert
  0 siblings, 0 replies; 21+ messages in thread
From: Paul Eggert @ 2016-07-06 15:11 UTC (permalink / raw)
  To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, ahferroin7, adilger

On 07/06/2016 05:09 PM, Joerg Schilling wrote:
> you concur that a delayed assignment of the "correct" value for
> st_blocks while the contend of the file does not change is not permitted.
I'm not sure I agree even with that. A file system may undergo garbage 
collection and compaction, for instance, in which a file's data do not 
change but its internal representation does.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 14:53           ` Joerg Schilling
  2016-07-06 15:01             ` Paul Eggert
@ 2016-07-06 15:12             ` Austin S. Hemmelgarn
  2016-07-06 15:22               ` Joerg Schilling
  1 sibling, 1 reply; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-06 15:12 UTC (permalink / raw)
  To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, adilger

On 2016-07-06 10:53, Joerg Schilling wrote:
> Antonio Diaz Diaz <antonio@gnu.org> wrote:
>
>> Joerg Schilling wrote:
>>> POSIX requires st_blocks to be != 0 in case that the file contains data.
>>
>> Please, could you provide a reference? I can't find such requirement at
>> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html
>
> 	blkcnt_t st_blocks      Number of blocks allocated for this object.
>
> It should be obvious that a file that offers content also has allocated blocks.
What you mean then is that POSIX _implies_ that this is the case, but 
does not say whether or not it is required.  There are all kinds of 
counterexamples to this too, procfs is a POSIX compliant filesystem 
(every POSIX certified system has it), yet does not display the behavior 
that you expect, every single file in /proc for example reports 0 for 
both st_blocks and st_size, and yet all of them very obviously have content.
>
> Blocks are "allocated" when the OS decides whether the new data will fit on the
> medium. The fact that some filesystems may have data in a cache but not yet on
> the medium does not matter here. This is how UNIX worked since st_block has
> been introduced nearly 40 years ago.
Tradition is the corpse of wisdom.  Backwards comparability is a problem 
just as much as a good thing.

In all seriousness though, this started out because stuff wasn't cached 
to anywhere near the degree it is today, and there was no such thing as 
delayed allocation.  When you said to write, the filesystem allocated 
the blocks, regardless of when it actually wrote the data.  IOW, the 
behavior that GNU tar is relying on is an implementation detail, not an 
API.  Just like df, this breaks under modern designs, not because they 
chose to break it, but because it wasn't designed for use with such 
implementations.

In the case of tar and similar things though, I'd argue that it's not 
sensible to special case files that are 'sparse', it should store any 
long enough run of zeroes as a sparse region, then provide an option to 
say to not make those files sparse when restored.
>
> A new filesystem cannot introduce new rules just because people believe it would
> save time.
Saying the file has no blocks when there are no blocks allocated for it 
is not to 'save time', it's absolutely accurate.  Suppose SVR4 UFS had a 
way to pack file data into the inode if it was small enough.  In that 
case, it woulod be perfectly reasonable to return 0 for st_blocks 
because the inode table in UFS is a fixed pre-allocated structure, and 
therefore nothing is allocated to the file itself except the inode.  The 
same applies in the case of a file packed into it's own metadata block 
on BTRFS, nothing is allocated to that file beyond the metadata block it 
has to have to store the inode.  In the case of delayed allocation where 
the file hasn't been flushed, there is nothing allocated, so st_blocks 
based on a strict interpretation of it's description in POSIX _should_ 
be 0, because nothing is allocated yet.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 15:12             ` Austin S. Hemmelgarn
@ 2016-07-06 15:22               ` Joerg Schilling
  2016-07-06 16:05                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 15:22 UTC (permalink / raw)
  To: antonio, ahferroin7; +Cc: linux-btrfs, bug-tar, adilger

"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:

> > It should be obvious that a file that offers content also has allocated blocks.
> What you mean then is that POSIX _implies_ that this is the case, but 
> does not say whether or not it is required.  There are all kinds of 
> counterexamples to this too, procfs is a POSIX compliant filesystem 
> (every POSIX certified system has it), yet does not display the behavior 
> that you expect, every single file in /proc for example reports 0 for 
> both st_blocks and st_size, and yet all of them very obviously have content.

You are mistaken.

stat /proc/$$/as
  File: `/proc/6518/as'
  Size: 2793472         Blocks: 5456       IO Block: 512    regular file
Device: 5440000h/88342528d      Inode: 7557        Links: 1
Access: (0600/-rw-------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200

stat /proc/$$/auxv
  File: `/proc/6518/auxv'
  Size: 168             Blocks: 1          IO Block: 512    regular file
Device: 5440000h/88342528d      Inode: 7568        Links: 1
Access: (0400/-r--------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
Access: 2016-07-06 16:33:15.660224934 +0200
Modify: 2016-07-06 16:33:15.660224934 +0200
Change: 2016-07-06 16:33:15.660224934 +0200

Any correct implementation of /proc returns the expected numbers in st_size as 
well as in st_blocks.

> In all seriousness though, this started out because stuff wasn't cached 
> to anywhere near the degree it is today, and there was no such thing as 
> delayed allocation.  When you said to write, the filesystem allocated 
> the blocks, regardless of when it actually wrote the data.  IOW, the 
> behavior that GNU tar is relying on is an implementation detail, not an 
> API.  Just like df, this breaks under modern designs, not because they 
> chose to break it, but because it wasn't designed for use with such 
> implementations.

This seems to be a strange interpretation if what a standard is.

> > A new filesystem cannot introduce new rules just because people believe it would
> > save time.
> Saying the file has no blocks when there are no blocks allocated for it 
> is not to 'save time', it's absolutely accurate.  Suppose SVR4 UFS had a 
> way to pack file data into the inode if it was small enough.  In that 
> case, it woulod be perfectly reasonable to return 0 for st_blocks 
> because the inode table in UFS is a fixed pre-allocated structure, and 

Given that inode size is 128, such a change would not break things as the 
heuristics would not imply a sparse file here.

> therefore nothing is allocated to the file itself except the inode.  The 
> same applies in the case of a file packed into it's own metadata block 
> on BTRFS, nothing is allocated to that file beyond the metadata block it 
> has to have to store the inode.  In the case of delayed allocation where 
> the file hasn't been flushed, there is nothing allocated, so st_blocks 
> based on a strict interpretation of it's description in POSIX _should_ 
> be 0, because nothing is allocated yet.

Now you know why BTRFS is still an incomplete filesystem. In a few years when 
it turns 10, this may change. People who implement filesystems of course need 
to learn that they need to hide implementation details from the official user 
space interfaces.

Jörg

-- 
 EMail:joerg@schily.net                    (home) Jörg Schilling D-13353 Berlin
       joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 15:22               ` Joerg Schilling
@ 2016-07-06 16:05                 ` Austin S. Hemmelgarn
  2016-07-06 16:11                   ` Austin S. Hemmelgarn
  2016-07-06 16:33                   ` Joerg Schilling
  0 siblings, 2 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-06 16:05 UTC (permalink / raw)
  To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, adilger

On 2016-07-06 11:22, Joerg Schilling wrote:
> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
>
>>> It should be obvious that a file that offers content also has allocated blocks.
>> What you mean then is that POSIX _implies_ that this is the case, but
>> does not say whether or not it is required.  There are all kinds of
>> counterexamples to this too, procfs is a POSIX compliant filesystem
>> (every POSIX certified system has it), yet does not display the behavior
>> that you expect, every single file in /proc for example reports 0 for
>> both st_blocks and st_size, and yet all of them very obviously have content.
>
> You are mistaken.
>
> stat /proc/$$/as
>   File: `/proc/6518/as'
>   Size: 2793472         Blocks: 5456       IO Block: 512    regular file
> Device: 5440000h/88342528d      Inode: 7557        Links: 1
> Access: (0600/-rw-------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
> Access: 2016-07-06 16:33:15.660224934 +0200
> Modify: 2016-07-06 16:33:15.660224934 +0200
> Change: 2016-07-06 16:33:15.660224934 +0200
>
> stat /proc/$$/auxv
>   File: `/proc/6518/auxv'
>   Size: 168             Blocks: 1          IO Block: 512    regular file
> Device: 5440000h/88342528d      Inode: 7568        Links: 1
> Access: (0400/-r--------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
> Access: 2016-07-06 16:33:15.660224934 +0200
> Modify: 2016-07-06 16:33:15.660224934 +0200
> Change: 2016-07-06 16:33:15.660224934 +0200
>
> Any correct implementation of /proc returns the expected numbers in st_size as
> well as in st_blocks.
Odd, because I get 0 for both values on all the files in /proc/self and 
all the top level files on all kernels I tested prior to sending that 
e-mail, for reference, they include:
* A direct clone of HEAD on torvalds/linux
* 4.6.3 mainline
* 4.1.27 mainline
* 4.6.3 mainline with a small number of local patches on top
* 4.1.19+ from the Raspberry Pi foundation
* 4.4.6-gentoo (mainline with Gentoo patches on top)
* 4.5.5-linode69 (not certain about the patches on top)
It's probably notable that I don't see /proc/$PID/as on any of these 
systems, which implies you're running some significantly different 
kernel version to begin with, and therefore it's not unreasonable to 
assume that what you see is because of some misguided patch that got 
added to allow tar to archive /proc.
>
>> In all seriousness though, this started out because stuff wasn't cached
>> to anywhere near the degree it is today, and there was no such thing as
>> delayed allocation.  When you said to write, the filesystem allocated
>> the blocks, regardless of when it actually wrote the data.  IOW, the
>> behavior that GNU tar is relying on is an implementation detail, not an
>> API.  Just like df, this breaks under modern designs, not because they
>> chose to break it, but because it wasn't designed for use with such
>> implementations.
>
> This seems to be a strange interpretation if what a standard is.
Except what I'm talking about is the _interpretation_ of the standard, 
not the standard itself.  I said nothing about the standard, all it 
requires is that st_blocks be the number of 512 byte blocks allocated by 
the filesystem for the file.  There is nothing in there about it having 
to reflect the expected size of the allocated content on disk.  In fact, 
there's technically nothing in there about how to handle sparse files 
either.

To further explain what I'm trying to say, here's a rough description of 
what happens in SVR4 UFS (and other non-delayed allocation filesystems) 
when you issue a write:
1. The number of new blocks needed to fulfill the write request is 
calculated.
2. If this number is greater than 0, that many new blocks are allocated, 
and st_blocks for that file is functionally updated (I don't recall if 
it was dynamically calculated per call or not)
3. At some indeterminate point in the future, the decision is made to 
flush the cache.
4. The data is written to the appropriate place in the file.

By comparison, in a delayed allocation scenario, 3 happens before 1 and 
2.  1 and 2 obviously have to be strictly ordered WRT each other and 4, 
but based on the POSIX standard, 3 does not have to be strictly ordered 
with regards to any of them (although it is illogical to have it between 
1 and 2 or after 4).  Because it is not required by the standard to have 
3 be strictly ordered and the ordering isn't part of the API itself, 
where it happens in the sequence is an implementation detail.
>
>>> A new filesystem cannot introduce new rules just because people believe it would
>>> save time.
>> Saying the file has no blocks when there are no blocks allocated for it
>> is not to 'save time', it's absolutely accurate.  Suppose SVR4 UFS had a
>> way to pack file data into the inode if it was small enough.  In that
>> case, it woulod be perfectly reasonable to return 0 for st_blocks
>> because the inode table in UFS is a fixed pre-allocated structure, and
>
> Given that inode size is 128, such a change would not break things as the
> heuristics would not imply a sparse file here.
OK, so change the heuristic then so that there's a reasonable limit on 
small files, or better yet, just get rid of it, as it was introduced to 
save time itself.  The case it optimizes for (large sparse files) is 
mostly irrelevant, because you have to parse the file to figure out 
what's sparse anyway, and anyone who's dealing with completely sparse 
files has other potential issues to deal with (I'm actually curious if 
you know of some legitimate reason to copy a completely empty file in a 
backup anyway, they're almost always either pre-allocated but unused 
files which will just get reallocated by whatever allocated them in the 
first place, lock-files, or something similar in some other way). 
Regardless, the correct way on current Linux systems to determine file 
sparseness is SEEK_DATA and SEEK_HOLE.  You have to read any data that's 
there anyway, so alternating SEEK_DATA and SEEK_HOLE is necessary to 
find where the data is that you need to read.  If the first SEEK_HOLE 
returns the end of the file, then you know it's not sparse.
>
>> therefore nothing is allocated to the file itself except the inode.  The
>> same applies in the case of a file packed into it's own metadata block
>> on BTRFS, nothing is allocated to that file beyond the metadata block it
>> has to have to store the inode.  In the case of delayed allocation where
>> the file hasn't been flushed, there is nothing allocated, so st_blocks
>> based on a strict interpretation of it's description in POSIX _should_
>> be 0, because nothing is allocated yet.
>
> Now you know why BTRFS is still an incomplete filesystem. In a few years when
> it turns 10, this may change. People who implement filesystems of course need
> to learn that they need to hide implementation details from the official user
> space interfaces.
So in other words you think we should be lying about how much is 
actually allocated on disk and thus violating the standard directly (and 
yes, ext4 and everyone else who does this with delayed allocation _is_ 
strictly speaking violating the standard, because _nothing_ is allocated 
yet)?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 16:05                 ` Austin S. Hemmelgarn
@ 2016-07-06 16:11                   ` Austin S. Hemmelgarn
  2016-07-06 16:33                   ` Joerg Schilling
  1 sibling, 0 replies; 21+ messages in thread
From: Austin S. Hemmelgarn @ 2016-07-06 16:11 UTC (permalink / raw)
  To: Joerg Schilling, antonio; +Cc: linux-btrfs, bug-tar, adilger

On 2016-07-06 12:05, Austin S. Hemmelgarn wrote:
> On 2016-07-06 11:22, Joerg Schilling wrote:
>> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
>>
>>>> It should be obvious that a file that offers content also has
>>>> allocated blocks.
>>> What you mean then is that POSIX _implies_ that this is the case, but
>>> does not say whether or not it is required.  There are all kinds of
>>> counterexamples to this too, procfs is a POSIX compliant filesystem
>>> (every POSIX certified system has it), yet does not display the behavior
>>> that you expect, every single file in /proc for example reports 0 for
>>> both st_blocks and st_size, and yet all of them very obviously have
>>> content.
>>
>> You are mistaken.
>>
>> stat /proc/$$/as
>>   File: `/proc/6518/as'
>>   Size: 2793472         Blocks: 5456       IO Block: 512    regular file
>> Device: 5440000h/88342528d      Inode: 7557        Links: 1
>> Access: (0600/-rw-------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
>> Access: 2016-07-06 16:33:15.660224934 +0200
>> Modify: 2016-07-06 16:33:15.660224934 +0200
>> Change: 2016-07-06 16:33:15.660224934 +0200
>>
>> stat /proc/$$/auxv
>>   File: `/proc/6518/auxv'
>>   Size: 168             Blocks: 1          IO Block: 512    regular file
>> Device: 5440000h/88342528d      Inode: 7568        Links: 1
>> Access: (0400/-r--------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
>> Access: 2016-07-06 16:33:15.660224934 +0200
>> Modify: 2016-07-06 16:33:15.660224934 +0200
>> Change: 2016-07-06 16:33:15.660224934 +0200
>>
>> Any correct implementation of /proc returns the expected numbers in
>> st_size as
>> well as in st_blocks.
> Odd, because I get 0 for both values on all the files in /proc/self and
> all the top level files on all kernels I tested prior to sending that
> e-mail, for reference, they include:
> * A direct clone of HEAD on torvalds/linux
> * 4.6.3 mainline
> * 4.1.27 mainline
> * 4.6.3 mainline with a small number of local patches on top
> * 4.1.19+ from the Raspberry Pi foundation
> * 4.4.6-gentoo (mainline with Gentoo patches on top)
> * 4.5.5-linode69 (not certain about the patches on top)
Further ones I've now tested that behave like the others listed above:
* 2.4.20-8 from RedHat 9
* 2.6.18-1.2798.fc6 from Fedora Core 6
* 3.11.10-301.fc20 from Fedora 20

IOW, it looks like whatever you're running is an exception here.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 16:05                 ` Austin S. Hemmelgarn
  2016-07-06 16:11                   ` Austin S. Hemmelgarn
@ 2016-07-06 16:33                   ` Joerg Schilling
  2016-07-06 17:35                     ` Andreas Dilger
  1 sibling, 1 reply; 21+ messages in thread
From: Joerg Schilling @ 2016-07-06 16:33 UTC (permalink / raw)
  To: antonio, ahferroin7; +Cc: linux-btrfs, bug-tar, adilger

"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:

> On 2016-07-06 11:22, Joerg Schilling wrote:
> > "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
> >
> >>> It should be obvious that a file that offers content also has allocated blocks.
> >> What you mean then is that POSIX _implies_ that this is the case, but
> >> does not say whether or not it is required.  There are all kinds of
> >> counterexamples to this too, procfs is a POSIX compliant filesystem
> >> (every POSIX certified system has it), yet does not display the behavior
> >> that you expect, every single file in /proc for example reports 0 for
> >> both st_blocks and st_size, and yet all of them very obviously have content.
> >
> > You are mistaken.
> >
> > stat /proc/$$/as
> >   File: `/proc/6518/as'
> >   Size: 2793472         Blocks: 5456       IO Block: 512    regular file
> > Device: 5440000h/88342528d      Inode: 7557        Links: 1
> > Access: (0600/-rw-------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
> > Access: 2016-07-06 16:33:15.660224934 +0200
> > Modify: 2016-07-06 16:33:15.660224934 +0200
> > Change: 2016-07-06 16:33:15.660224934 +0200
> >
> > stat /proc/$$/auxv
> >   File: `/proc/6518/auxv'
> >   Size: 168             Blocks: 1          IO Block: 512    regular file
> > Device: 5440000h/88342528d      Inode: 7568        Links: 1
> > Access: (0400/-r--------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
> > Access: 2016-07-06 16:33:15.660224934 +0200
> > Modify: 2016-07-06 16:33:15.660224934 +0200
> > Change: 2016-07-06 16:33:15.660224934 +0200
> >
> > Any correct implementation of /proc returns the expected numbers in st_size as
> > well as in st_blocks.
> Odd, because I get 0 for both values on all the files in /proc/self and 
> all the top level files on all kernels I tested prior to sending that 

I tested this with an official PROCFS-2 implementation that was written by 
the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased 
away last weekend.

You may have done your tests on an inofficial procfs implementation....

> > Now you know why BTRFS is still an incomplete filesystem. In a few years when
> > it turns 10, this may change. People who implement filesystems of course need
> > to learn that they need to hide implementation details from the official user
> > space interfaces.
> So in other words you think we should be lying about how much is 
> actually allocated on disk and thus violating the standard directly (and 
> yes, ext4 and everyone else who does this with delayed allocation _is_ 
> strictly speaking violating the standard, because _nothing_ is allocated 
> yet)?

If it returns 0, it would be lying or it would be wrong anyway as it did not 
check fpe available space.

Also note that I mentioned already that the priciple availability of SEEK_HOLE 
does not help as there is e.g. NFS...

Jörg

-- 
 EMail:joerg@schily.net                    (home) Jörg Schilling D-13353 Berlin
       joerg.schilling@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/'

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-06 16:33                   ` Joerg Schilling
@ 2016-07-06 17:35                     ` Andreas Dilger
  0 siblings, 0 replies; 21+ messages in thread
From: Andreas Dilger @ 2016-07-06 17:35 UTC (permalink / raw)
  To: Joerg Schilling; +Cc: antonio, ahferroin7, linux-btrfs, bug-tar

[-- Attachment #1: Type: text/plain, Size: 3137 bytes --]


> On Jul 6, 2016, at 10:33 AM, Joerg Schilling <Joerg.Schilling@fokus.fraunhofer.de> wrote:
> 
> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
> 
>> On 2016-07-06 11:22, Joerg Schilling wrote:
>>> 
>>> 
>>> You are mistaken.
>>> 
>>> stat /proc/$$/as
>>>  File: `/proc/6518/as'
>>>  Size: 2793472         Blocks: 5456       IO Block: 512    regular file
>>> Device: 5440000h/88342528d      Inode: 7557        Links: 1
>>> Access: (0600/-rw-------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
>>> Access: 2016-07-06 16:33:15.660224934 +0200
>>> Modify: 2016-07-06 16:33:15.660224934 +0200
>>> Change: 2016-07-06 16:33:15.660224934 +0200
>>> 
>>> stat /proc/$$/auxv
>>>  File: `/proc/6518/auxv'
>>>  Size: 168             Blocks: 1          IO Block: 512    regular file
>>> Device: 5440000h/88342528d      Inode: 7568        Links: 1
>>> Access: (0400/-r--------)  Uid: (   xx/   joerg)   Gid: (  xx/      bs)
>>> Access: 2016-07-06 16:33:15.660224934 +0200
>>> Modify: 2016-07-06 16:33:15.660224934 +0200
>>> Change: 2016-07-06 16:33:15.660224934 +0200
>>> 
>>> Any correct implementation of /proc returns the expected numbers in
>>> st_size as well as in st_blocks.
>> 
>> Odd, because I get 0 for both values on all the files in /proc/self and
>> all the top level files on all kernels I tested prior to sending that
> 
> I tested this with an official PROCFS-2 implementation that was written by
> the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased
> away last weekend.
> 
> You may have done your tests on an inofficial procfs implementation....

So, what you are saying is that you don't care about star working properly
on Linux, because it has an "inofficial" procfs implementation, while Solaris
has an "official" implementation?

>>> Now you know why BTRFS is still an incomplete filesystem. In a few years
>>> when it turns 10, this may change. People who implement filesystems of
>>> course need to learn that they need to hide implementation details from
>>> the official user space interfaces.
>> 
>> So in other words you think we should be lying about how much is
>> actually allocated on disk and thus violating the standard directly (and
>> yes, ext4 and everyone else who does this with delayed allocation _is_
>> strictly speaking violating the standard, because _nothing_ is allocated
>> yet)?
> 
> If it returns 0, it would be lying or it would be wrong anyway as it did not
> check fpe available space.
> 
> Also note that I mentioned already that the priciple availability of SEEK_HOLE
> does not help as there is e.g. NFS...

So, it's OK that NFS is not POSIX compliant in various ways, and star will
deal with it, but you aren't willing to fix a heuristic used by star for a
behaviour that is unspecified by POSIX but has caused users to lose data
when archiving from several modern filesystems?

That's fine, so long as GNU tar is fixed to use the safe fallback in such
cases (i.e. trying to archive data from files that are newly created, even
if they report st_blocks == 0).

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
  2016-07-05  9:28   ` Joerg Schilling
@ 2016-07-07  8:08   ` Pavel Raiskup
  1 sibling, 0 replies; 21+ messages in thread
From: Pavel Raiskup @ 2016-07-07  8:08 UTC (permalink / raw)
  To: bug-tar; +Cc: Andreas Dilger, linux-btrfs

On Monday, July 4, 2016 1:35:25 PM CEST Andreas Dilger wrote:
> I think in addition to fixing btrfs (because it needs to work with existing
> tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar to
> handle this situation more robustly.

What I was rather thinking about is to remove the [2] heuristic.  As there is
now SEEK_HOLE implemented, the need for that check "completely sparse files"
might be considered less useful.  With [1], I'm not sure -- is it that bad to
face some false positive there? (it is documented that tar shouldn't be run
concurrently with other processes writing to archived files .., and waiting for
flush here is probably a very similar race condition).

> One option is if st_blocks == 0 then tar should also check if st_mtime is
> less than 60s in the past, and if yes then it should call fsync() on the
> file to flush any unwritten data to disk , or assume the file is not sparse
> and read the whole file, so that it doesn't incorrectly assume that the file
> is sparse and skip archiving the file data.

The reported fact 'st_blocks != 0' doesn't mean that the fsync() call is not
needed, so I'm not 100% we should special-case the 'st_blocks == 0' files.

--

As this effectively breaks tar's testsuite on btrfs, could we also explicitly
sync in 'genfile'?

Pavel

> 
> Cheers, Andreas
> 
> > [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
> > [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-02  7:18 stat() on btrfs reports the st_blocks with delay (data loss in archivers) Pavel Raiskup
  2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
@ 2016-07-11 14:41 ` David Sterba
  2016-07-11 15:00   ` Chris Mason
  1 sibling, 1 reply; 21+ messages in thread
From: David Sterba @ 2016-07-11 14:41 UTC (permalink / raw)
  To: Pavel Raiskup; +Cc: linux-btrfs, bug-tar

On Sat, Jul 02, 2016 at 09:18:07AM +0200, Pavel Raiskup wrote:
> There are optimizations in archivers (tar, rsync, ...) that rely on up2date
> st_blocks info.  For example, in GNU tar there is optimization check [1]
> whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
> tar considers that file is sparse (and does additional steps).
> 
> It looks like btrfs doesn't show correct value in 'st_blocks' until the data
> are synced.  ATM, there happens that:
> 
>     a) some "tool" creates sparse file
>     b) that tool does not sync explicitly and exits ..
>     c) tar is called immediately after that to archive the sparse file
>     d) tar considers [2] the file is completely sparse (because st_blocks is
>        zero) and archives no data.  Here comes data loss.
> 
> Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
> small and is in-lined (no real data blocks) -- I consider this is too bug in
> btrfs worth fixing.
> 
> [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
> [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
> 
> Tested on kernel:
> kernel-4.5.7-300.fc24.x86_64
> 
> Originally reported here, reproducer available there:
> https://bugzilla.redhat.com/show_bug.cgi?id=1352061

The reproducer works for me here. So far I found:

* the btrfs implementation of stat.st_blocks (btrfs_getattr) includes
  the 'delayed allocated' bytes, so there is not a problem in principle
  (http://lxr.free-electrons.com/source/fs/btrfs/inode.c#L9372)

* calling fsync on the sparsefile will produce the expected result

* a short delay between ./binary and 'stat' will also produce correct
  result, 0.5 seconds worked for me -- so it IMO proves it's a race
  between writing and reporting the data

* I'm not yet sure where the delay between write and synced
  'inode->delalloc_bytes' comes from

* I think that st_blocks accounting can be wrong anyway, if the file is
  mmap-ed and not msync-ed, I'm writing a reproducer for this case

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-11 14:41 ` David Sterba
@ 2016-07-11 15:00   ` Chris Mason
  2016-07-11 15:16     ` David Sterba
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2016-07-11 15:00 UTC (permalink / raw)
  To: dsterba, Pavel Raiskup, linux-btrfs, bug-tar



On 07/11/2016 10:41 AM, David Sterba wrote:
> On Sat, Jul 02, 2016 at 09:18:07AM +0200, Pavel Raiskup wrote:
>> There are optimizations in archivers (tar, rsync, ...) that rely on up2date
>> st_blocks info.  For example, in GNU tar there is optimization check [1]
>> whether the 'st_size' reports more data than the 'st_blocks' can hold --> then
>> tar considers that file is sparse (and does additional steps).
>>
>> It looks like btrfs doesn't show correct value in 'st_blocks' until the data
>> are synced.  ATM, there happens that:
>>
>>     a) some "tool" creates sparse file
>>     b) that tool does not sync explicitly and exits ..
>>     c) tar is called immediately after that to archive the sparse file
>>     d) tar considers [2] the file is completely sparse (because st_blocks is
>>        zero) and archives no data.  Here comes data loss.
>>
>> Because we fixed 'btrfs' to report non-zero 'st_blocks' when the file data is
>> small and is in-lined (no real data blocks) -- I consider this is too bug in
>> btrfs worth fixing.
>>
>> [1] http://git.savannah.gnu.org/cgit/paxutils.git/tree/lib/system.h?id=ec72abd9dd63bbff4534ec77e97b1a6cadfc3cf8#n392
>> [2] http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c?id=ac065c57fdc1788a2769fb119ed0c8146e1b9dd6#n273
>>
>> Tested on kernel:
>> kernel-4.5.7-300.fc24.x86_64
>>
>> Originally reported here, reproducer available there:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1352061
>
> The reproducer works for me here. So far I found:
>
> * the btrfs implementation of stat.st_blocks (btrfs_getattr) includes
>   the 'delayed allocated' bytes, so there is not a problem in principle
>   (https://urldefense.proofpoint.com/v2/url?u=http-3A__lxr.free-2Delectrons.com_source_fs_btrfs_inode.c-23L9372&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=9QPtTAxcitoznaWRKKHoEQ&m=Y07_PApD0zOC-eaM4Hq-oxAwNlktnY8bDo7LD2GbXRo&s=Is_ZqFiL7a4sVWAB1k1ZuAgbNMK-sZ1gcU7oLtyfKoY&e= )
>
> * calling fsync on the sparsefile will produce the expected result
>
> * a short delay between ./binary and 'stat' will also produce correct
>   result, 0.5 seconds worked for me -- so it IMO proves it's a race
>   between writing and reporting the data
>
> * I'm not yet sure where the delay between write and synced
>   'inode->delalloc_bytes' comes from
>
> * I think that st_blocks accounting can be wrong anyway, if the file is
>   mmap-ed and not msync-ed, I'm writing a reproducer for this case

On my test box running current linux git, things work fine if I run the 
reproducer once.  But if I leave it running in a loop long enough for 
writeback to kick in, I trigger it.

The reproducer has a loop in there where it is adding delalloc writes 
and truncating them away.  What should be happening is that we're 
leaving some delalloc bits set past EOF, which makes us skip bumping 
inode->delalloc_bytes during the new write.

I can kind of confirm this by changing the reproducer to stat directly 
after the write call.  Normally st_blocks is never zero.  But if I leave 
it running in a loop for 30 seconds or so, I eventually get st_block 
zero directly after the write().

If I change the C program to unlink the file on exit, running the binary 
over and over again works every time.

So, the real bug is that we're letting some delalloc stat hang around 
after the truncate, probably related to IO in progress.  We do already 
account for delalloc in what we return to stat, but there's a corner 
case involving truncate where we screw it up.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-11 15:00   ` Chris Mason
@ 2016-07-11 15:16     ` David Sterba
  2016-07-11 17:30       ` Chris Mason
  0 siblings, 1 reply; 21+ messages in thread
From: David Sterba @ 2016-07-11 15:16 UTC (permalink / raw)
  To: Chris Mason; +Cc: dsterba, Pavel Raiskup, linux-btrfs, bug-tar

On Mon, Jul 11, 2016 at 11:00:55AM -0400, Chris Mason wrote:
> So, the real bug is that we're letting some delalloc stat hang around 
> after the truncate, probably related to IO in progress.  We do already 
> account for delalloc in what we return to stat, but there's a corner 
> case involving truncate where we screw it up.

So the original testcase:

> >>     a) some "tool" creates sparse file
> >>     b) that tool does not sync explicitly and exits ..
> >>     c) tar is called immediately after that to archive the sparse file
> >>     d) tar considers [2] the file is completely sparse (because st_blocks is
> >>        zero) and archives no data.  Here comes data loss.

will not happen. The application would basically have to mimick the
provided reproducer script and do the truncate/write loop and be lucky
enough to let tar hit the short race window.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: stat() on btrfs reports the st_blocks with delay (data loss in archivers)
  2016-07-11 15:16     ` David Sterba
@ 2016-07-11 17:30       ` Chris Mason
  0 siblings, 0 replies; 21+ messages in thread
From: Chris Mason @ 2016-07-11 17:30 UTC (permalink / raw)
  To: dsterba, Pavel Raiskup, linux-btrfs, bug-tar



On 07/11/2016 11:16 AM, David Sterba wrote:
> On Mon, Jul 11, 2016 at 11:00:55AM -0400, Chris Mason wrote:
>> So, the real bug is that we're letting some delalloc stat hang around
>> after the truncate, probably related to IO in progress.  We do already
>> account for delalloc in what we return to stat, but there's a corner
>> case involving truncate where we screw it up.
>
> So the original testcase:
>
>>>>     a) some "tool" creates sparse file
>>>>     b) that tool does not sync explicitly and exits ..
>>>>     c) tar is called immediately after that to archive the sparse file
>>>>     d) tar considers [2] the file is completely sparse (because st_blocks is
>>>>        zero) and archives no data.  Here comes data loss.
>
> will not happen. The application would basically have to mimick the
> provided reproducer script and do the truncate/write loop and be lucky
> enough to let tar hit the short race window.
>

Looking harder there is a race window that can trigger this without the 
truncate loop:

1) application calls write(), we make the pages delalloc (in-ram 
st_blocks goes up)
2) VM calls write_cache_pages, we go find a contiguous delalloc range
3) We call cow_file_range on the locked range of pages
4) cow_file_range clears the delalloc bits (in-ram st_blocks goes down)

< ----- race begins here ----->

5) The io is started
6) The IO completes and extents are inserted into the metadata
7) the on disk/in-ram st_blocks goes up

< ---- race ends here ---->

This makes a ton more sense than leaking delalloc bits.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-07-11 17:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-02  7:18 stat() on btrfs reports the st_blocks with delay (data loss in archivers) Pavel Raiskup
2016-07-04 19:35 ` [Bug-tar] " Andreas Dilger
2016-07-05  9:28   ` Joerg Schilling
2016-07-06 11:37     ` Austin S. Hemmelgarn
2016-07-06 11:49       ` Joerg Schilling
2016-07-06 14:43         ` Antonio Diaz Diaz
2016-07-06 14:53           ` Joerg Schilling
2016-07-06 15:01             ` Paul Eggert
2016-07-06 15:09               ` Joerg Schilling
2016-07-06 15:11                 ` Paul Eggert
2016-07-06 15:12             ` Austin S. Hemmelgarn
2016-07-06 15:22               ` Joerg Schilling
2016-07-06 16:05                 ` Austin S. Hemmelgarn
2016-07-06 16:11                   ` Austin S. Hemmelgarn
2016-07-06 16:33                   ` Joerg Schilling
2016-07-06 17:35                     ` Andreas Dilger
2016-07-07  8:08   ` Pavel Raiskup
2016-07-11 14:41 ` David Sterba
2016-07-11 15:00   ` Chris Mason
2016-07-11 15:16     ` David Sterba
2016-07-11 17:30       ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).