Re: Massive loss of disk space - Austin S. Hemmelgarn

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: kreijack@inwind.it, pwm <pwm@iapetus.neab.net>,
	Hugo Mills <hugo@carfax.org.uk>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Massive loss of disk space
Date: Fri, 4 Aug 2017 11:05:14 -0400	[thread overview]
Message-ID: <4d5470d6-f2d9-04ab-7005-e3febd3775f0@gmail.com> (raw)
In-Reply-To: <e19345c3-87c7-e248-fa4c-9dd608680640@inwind.it>

On 2017-08-04 10:45, Goffredo Baroncelli wrote:
> On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:
>> On 2017-08-03 12:37, Goffredo Baroncelli wrote:
>>> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
> [...]
> 
>>>> Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
>>>
>>> It seems that ZFS on linux doesn't support fallocate
>>>
>>> see https://github.com/zfsonlinux/zfs/issues/326
>>>
>>> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
>> Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).
> 
> For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one.
> 
> 	http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212
> 
> Following the chain of function pointers
> 
> 	http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110
> 
> it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()
> 
> 	http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912
> 
> which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution.
> 
> So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't
 From a practical perspective though, posix_fallocate() doesn't matter, 
because almost everything uses the native fallocate call if at all 
possible.  As you mention, FreeBSD is emulating it, but that 'emulation' 
provides behavior that is close enough to what is required that it 
doesn't matter.  As a matter of perspective, posix_fallocate() is 
emulated on Linux too, see my reply below to your later comment about 
posix_fallocate() on BTRFS.

Internally ZFS also keeps _some_ space reserved so it doesn't get wedged 
like BTRFS does when near full, and they don't do the whole data versus 
metadata segregation crap, so from a practical perspective, what 
FreeBSD's ZFS implementation does is sufficient because of the internal 
structure and handling of writes in ZFS.
> 
> 
>>
>> That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all.  Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS.
> 
> posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it.
As mentioned above, posix_fallocate() is emulated in libc on Linux by 
calling the regular fallocate() if the FS supports it (which BTRFS 
does), or by writing out data like FreeBSD does in the kernel if the FS 
doesn't support fallocate().  IOW, posix_fallocate() has the exact same 
issues on BTRFS as Linux's fallocate() syscall does.
> 
> My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length).
Again, this arises from how we handle writes.  If we were to track 
blocks that have had fallocate called on them and only use those (for 
the first write at least) for writes to the file that had fallocate 
called on them (as well as breaking reflinks on them when fallocate is 
called), then we can get away with just using the size of the biggest 
write plus a little bit more space for _data_, but even then we need 
space for metadata (which we don't appear to track right now).
> 
> I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode.
> 
> https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
> [...]
> /*
>   * The only flag combination which matches the behavior of zfs_space()
>   * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE.  The FALLOC_FL_PUNCH_HOLE
>   * flag was introduced in the 2.6.38 kernel.
>   */
> #if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
> long
> zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
> {
> 	int error = -EOPNOTSUPP;
> 
> #if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
> 	cred_t *cr = CRED();
> 	flock64_t bf;
> 	loff_t olen;
> 	fstrans_cookie_t cookie;
> 
> 	if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> 		return (error);
> 
> [...]
>

next prev parent reply	other threads:[~2017-08-04 15:05 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-01 11:43 Massive loss of disk space pwm
2017-08-01 12:20 ` Hugo Mills
2017-08-01 14:39   ` pwm
2017-08-01 14:47     ` Austin S. Hemmelgarn
2017-08-01 15:00       ` Austin S. Hemmelgarn
2017-08-01 15:24         ` pwm
2017-08-01 15:45           ` Austin S. Hemmelgarn
2017-08-01 16:50             ` pwm
2017-08-01 17:04               ` Austin S. Hemmelgarn
2017-08-02 17:52         ` Goffredo Baroncelli
2017-08-02 19:10           ` Austin S. Hemmelgarn
2017-08-02 21:05             ` Goffredo Baroncelli
2017-08-03 11:39               ` Austin S. Hemmelgarn
2017-08-03 16:37                 ` Goffredo Baroncelli
2017-08-03 17:23                   ` Austin S. Hemmelgarn
2017-08-04 14:45                     ` Goffredo Baroncelli
2017-08-04 15:05                       ` Austin S. Hemmelgarn [this message]
2017-08-03  3:48           ` Duncan
2017-08-03 11:44           ` Marat Khalili
2017-08-03 11:52             ` Austin S. Hemmelgarn
2017-08-03 16:01             ` Goffredo Baroncelli
2017-08-03 17:15               ` Marat Khalili
2017-08-03 17:25                 ` Austin S. Hemmelgarn
2017-08-03 22:51               ` pwm
2017-08-02  4:14       ` Duncan
2017-08-02 11:18         ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4d5470d6-f2d9-04ab-7005-e3febd3775f0@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=pwm@iapetus.neab.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).