linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: kreijack@inwind.it, pwm <pwm@iapetus.neab.net>,
	Hugo Mills <hugo@carfax.org.uk>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Massive loss of disk space
Date: Thu, 3 Aug 2017 13:23:12 -0400	[thread overview]
Message-ID: <8344dc9f-d213-b2d8-5b6c-5c1a54041ef1@gmail.com> (raw)
In-Reply-To: <cab4df59-a5ce-9944-22cb-367173dab108@inwind.it>

On 2017-08-03 12:37, Goffredo Baroncelli wrote:
> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
>> On 2017-08-02 17:05, Goffredo Baroncelli wrote:
>>> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
>>>> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>>>>> Hi,
>>>>>
>>> [...]
>>>
>>>>> consider the following scenario:
>>>>>
>>>>> a) create a 2GB file
>>>>> b) fallocate -o 1GB -l 2GB
>>>>> c) write from 1GB to 3GB
>>>>>
>>>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
>>>
>>>> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit.
>>>
>>> The man page of fallocate doesn't guarantee that.
>>>
>>> Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true.
>>>
>>> Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file.
>> Yes, you need space, but you don't need _all_ the space.  For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there.  Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file.
>>
>> Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
> 
> It seems that ZFS on linux doesn't support fallocate
> 
> see https://github.com/zfsonlinux/zfs/issues/326
> 
> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).

That said, I'm starting to wonder if just failing fallocate() calls to 
allocate space is actually the right thing to do here after all.  Aside 
from this, we don't reserve metadata space for checksums and similar 
things for the eventual writes (so it's possible to get -ENOSPC on a 
write to an fallocate'ed region anyway because of metadata exhaustion), 
and splitting extents can also cause it to fail, so it's perfectly 
possible for the fallocate assumption to not hole on BTRFS.  The irony 
of this is that if you're in a situation where you actually need to 
reserve space, you're more likely to fail (because if you actually 
_need_ to reserve the space, your filesystem may already be mostly full, 
and therefore any of the above issues may occur).

On the specific note of splitting extents, the following will probably 
fail on BTRFS as well when done with a large enough FS (the turn over 
point ends up being the point at which 256MiB isn't enough space to 
account for all the extents), but will succeed with :
1. Create filesystem and mount it.  On BTRFS, make sure autodefrag is 
off (this makes it fail more reliably, but is not essential for it to fail).
2. Use fallocate to allocate as large a file as possible (in the BTRFS 
case, try for the size of the filesystem - 544MiB (512 MiB for the 
metadata chunk, 32 for the system chunk).
3. Write half the file using 1MB blocks and skipping 1MB of space 
between each block (so every other 1MB of space is actually written to.
4. Write the other half of the file by filling in the holes.

The net effect of this is to split the single large fallocat'ed extent 
into a very large number of 1MB extents, which in turn eats up lots of 
metadata space and will eventually exhaust it.  While this specific 
exercise requires a large filesystem, more generic real world situations 
exist where this can happen (and I have had this happen before).
> 
> [...]
>>> In terms of a COW filesystem, you need the space of a) + the space of b)
>> No, that is only required if the entire file needs to be written atomically.  There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data).
> 
> On the best of my knowledge there is only a time limit: IIRC every 30seconds a transaction is closed. If you are able to fill the filesystem in this time window you are in trouble.
Even with that, it's still possible to implement the method I outlined 
by defining such a limit and forcing a transaction commit when that 
limit is hit.  I'm also not entirely convinced that the transaction is 
the limiting factor here (I was under the impression that the 
transaction just updates the top level metadata to point to the new tree 
of metadata).

  reply	other threads:[~2017-08-03 17:23 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-01 11:43 Massive loss of disk space pwm
2017-08-01 12:20 ` Hugo Mills
2017-08-01 14:39   ` pwm
2017-08-01 14:47     ` Austin S. Hemmelgarn
2017-08-01 15:00       ` Austin S. Hemmelgarn
2017-08-01 15:24         ` pwm
2017-08-01 15:45           ` Austin S. Hemmelgarn
2017-08-01 16:50             ` pwm
2017-08-01 17:04               ` Austin S. Hemmelgarn
2017-08-02 17:52         ` Goffredo Baroncelli
2017-08-02 19:10           ` Austin S. Hemmelgarn
2017-08-02 21:05             ` Goffredo Baroncelli
2017-08-03 11:39               ` Austin S. Hemmelgarn
2017-08-03 16:37                 ` Goffredo Baroncelli
2017-08-03 17:23                   ` Austin S. Hemmelgarn [this message]
2017-08-04 14:45                     ` Goffredo Baroncelli
2017-08-04 15:05                       ` Austin S. Hemmelgarn
2017-08-03  3:48           ` Duncan
2017-08-03 11:44           ` Marat Khalili
2017-08-03 11:52             ` Austin S. Hemmelgarn
2017-08-03 16:01             ` Goffredo Baroncelli
2017-08-03 17:15               ` Marat Khalili
2017-08-03 17:25                 ` Austin S. Hemmelgarn
2017-08-03 22:51               ` pwm
2017-08-02  4:14       ` Duncan
2017-08-02 11:18         ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8344dc9f-d213-b2d8-5b6c-5c1a54041ef1@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=pwm@iapetus.neab.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).