linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: pwm <pwm@iapetus.neab.net>
Cc: Hugo Mills <hugo@carfax.org.uk>, linux-btrfs@vger.kernel.org
Subject: Re: Massive loss of disk space
Date: Tue, 1 Aug 2017 13:04:40 -0400	[thread overview]
Message-ID: <e5295664-b9b8-a895-8f85-1428838c3f6a@gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1708011839340.31126@iapetus.neab.net>

On 2017-08-01 12:50, pwm wrote:
> I did a temporary patch of the snapraid code to start fallocate() from 
> the previous parity file size.
Like I said though, it's BTRFS that's misbehaving here, not snapraid. 
I'm going to try to get some further discussion about this here on the 
mailing list,and hopefully it will get fixed in BTRFS (I would try to do 
so myself, but I'm at best a novice at C, and not well versed in kernel 
code).
> 
> Finally have a snapraid sync up and running. Looks good, but will take 
> quite a while before I can try a scrub command to double-check everything.
> 
> Thanks for the help.
Glad I could be helpful!
> 
> /Per W
> 
> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
> 
>> On 2017-08-01 11:24, pwm wrote:
>>> Yes, the test code is as below - trying to match what snapraid tries 
>>> to do:
>>>
>>> #include <sys/types.h>
>>> #include <sys/stat.h>
>>> #include <fcntl.h>
>>> #include <stdio.h>
>>> #include <string.h>
>>> #include <unistd.h>
>>> #include <errno.h>
>>>
>>> int main() {
>>>      int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
>>>      if (fd < 0) {
>>>          printf("Failed opening parity file [%s]\n",strerror(errno));
>>>          return 1;
>>>      }
>>>
>>>      off_t filesize = 5151751667712ull;
>>>      int res;
>>>
>>>      struct stat statbuf;
>>>      if (fstat(fd,&statbuf)) {
>>>          printf("Failed stat [%s]\n",strerror(errno));
>>>          close(fd);
>>>          return 1;
>>>      }
>>>
>>>      printf("Original file size is  %llu bytes\n",i
>>>             (unsigned long long)statbuf.st_size);
>>>      printf("Trying to grow file to %llu bytes\n",i
>>>             (unsigned long long)filesize);
>>>
>>>      res = fallocate(fd,0,0,filesize);
>>>      if (res) {
>>>          printf("Failed fallocate [%s]\n",strerror(errno));
>>>          close(fd);
>>>          return 1;
>>>      }
>>>
>>>      if (fsync(fd)) {
>>>          printf("Failed fsync [%s]\n",fsync(errno));
>>>          close(fd);
>>>          return 1;
>>>      }
>>>
>>>      close(fd);
>>>      return 0;
>>> }
>>>
>>> So the call doesn't make use of the previous file size as offset for 
>>> the extension.
>>>
>>> int fallocate(int fd, int mode, off_t offset, off_t len);
>>>
>>> What you are implying here is that if the fallocate() call is 
>>> modified to:
>>>
>>>    res = fallocate(fd,0,old_size,new_size-old_size);
>>>
>>> then everything should work as expected?
>> Based on what I've seen testing on my end, yes, that should cause 
>> things to work correctly.  That said, given what snapraid does, the 
>> fact that they call fallocate covering the full desired size of the 
>> file is correct usage (the point is to make behavior deterministic, 
>> and calling it on the whole file makes sure that the file isn't 
>> sparse, which can impact performance).
>>
>> Given both the fact that calling fallocate() to extend the file 
>> without worrying about an offset is a legitimate use case, and that 
>> both ext4 and XFS (and I suspect almost every other Linux filesystem) 
>> works in this situation, I'd argue that the behavior of BTRFS in this 
>> situation is incorrect.
>>>
>>> /Per W
>>>
>>> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
>>>
>>>> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
>>>>> On 2017-08-01 10:39, pwm wrote:
>>>>>> Thanks for the links and suggestions.
>>>>>>
>>>>>> I did try your suggestions but it didn't solve the underlying 
>>>>>> problem.
>>>>>>
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>>>>>> Dumping filters: flags 0x1, state 0x0, force is off
>>>>>>    DATA (flags 0x2): balancing, usage=20
>>>>>> Done, had to relocate 4596 out of 9317 chunks
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft 
>>>>>> /mnt/snap_04/
>>>>>> Done, had to relocate 2 out of 4721 chunks
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>>>>>> Data, single: total=4.60TiB, used=4.59TiB
>>>>>> System, DUP: total=40.00MiB, used=512.00KiB
>>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>>>>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>>>          Total devices 1 FS bytes used 4.60TiB
>>>>>>          devid    1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>>>>>
>>>>>>
>>>>>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>>>>>
>>>>>> But if I test to fallocate() to grow the large parity file, I 
>>>>>> directly fail. I wrote a little help program that just focuses on 
>>>>>> fallocate() instead of having to run snapraid with lots of unknown 
>>>>>> additional actions being performed.
>>>>>>
>>>>>>
>>>>>> Original file size is  5050486226944 bytes
>>>>>> Trying to grow file to 5151751667712 bytes
>>>>>> Failed fallocate [No space left on device]
>>>>>>
>>>>>>
>>>>>>
>>>>>> And result after shows 'used' have jumped up to 9.09TiB again.
>>>>>>
>>>>>> root@europium:/mnt# btrfs fi show snap_04
>>>>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>>>          Total devices 1 FS bytes used 4.60TiB
>>>>>>          devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>>>>
>>>>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>>>>>> Data, single: total=9.08TiB, used=4.59TiB
>>>>>> System, DUP: total=40.00MiB, used=992.00KiB
>>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>>
>>>>>>
>>>>>> It's almost like the file system have decided that it needs to 
>>>>>> make a snapshot and store two complete copies of the complete 
>>>>>> file, which is obviously not going to work with a file larger than 
>>>>>> 50% of the file system.
>>>>> I think I _might_ understand what's going on here.  Is that test 
>>>>> program calling fallocate using the desired total size of the file, 
>>>>> or just trying to allocate the range beyond the end to extend the 
>>>>> file?  I've seen issues with the first case on BTRFS before, and 
>>>>> I'm starting to think that it might actually be trying to allocate 
>>>>> the exact amount of space requested by fallocate, even if part of 
>>>>> the range is already allocated space.
>>>>
>>>> OK, I just did a dead simple test by hand, and it looks like I was 
>>>> right. The method I used to check this is as follows:
>>>> 1. Create and mount a reasonably small filesystem (I used an 8G 
>>>> temporary LV for this, a file would work too though).
>>>> 2. Using dd or a similar tool, create a test file that takes up half 
>>>> of the size of the filesystem.  It is important that this _not_ be 
>>>> fallocated, but just written out.
>>>> 3. Use `fallocate -l` to try and extend the size of the file beyond 
>>>> half the size of the filesystem.
>>>>
>>>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it 
>>>> will succeed with no error.  Based on this and some low-level 
>>>> inspection, it looks like BTRFS treats the full range of the 
>>>> fallocate call as unallocated, and thus is trying to allocate space 
>>>> for regions of that range that are already allocated.
>>>>
>>>>>>
>>>>>> No issue at all to grow the parity file on the other parity disk. 
>>>>>> And that's why I wonder if there is some undetected file system 
>>>>>> corruption.
>>>>>>
>>>>
>>
>>


  reply	other threads:[~2017-08-01 17:04 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-01 11:43 Massive loss of disk space pwm
2017-08-01 12:20 ` Hugo Mills
2017-08-01 14:39   ` pwm
2017-08-01 14:47     ` Austin S. Hemmelgarn
2017-08-01 15:00       ` Austin S. Hemmelgarn
2017-08-01 15:24         ` pwm
2017-08-01 15:45           ` Austin S. Hemmelgarn
2017-08-01 16:50             ` pwm
2017-08-01 17:04               ` Austin S. Hemmelgarn [this message]
2017-08-02 17:52         ` Goffredo Baroncelli
2017-08-02 19:10           ` Austin S. Hemmelgarn
2017-08-02 21:05             ` Goffredo Baroncelli
2017-08-03 11:39               ` Austin S. Hemmelgarn
2017-08-03 16:37                 ` Goffredo Baroncelli
2017-08-03 17:23                   ` Austin S. Hemmelgarn
2017-08-04 14:45                     ` Goffredo Baroncelli
2017-08-04 15:05                       ` Austin S. Hemmelgarn
2017-08-03  3:48           ` Duncan
2017-08-03 11:44           ` Marat Khalili
2017-08-03 11:52             ` Austin S. Hemmelgarn
2017-08-03 16:01             ` Goffredo Baroncelli
2017-08-03 17:15               ` Marat Khalili
2017-08-03 17:25                 ` Austin S. Hemmelgarn
2017-08-03 22:51               ` pwm
2017-08-02  4:14       ` Duncan
2017-08-02 11:18         ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e5295664-b9b8-a895-8f85-1428838c3f6a@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=pwm@iapetus.neab.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).