From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: kreijack@inwind.it, pwm <pwm@iapetus.neab.net>,
Hugo Mills <hugo@carfax.org.uk>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Massive loss of disk space
Date: Fri, 4 Aug 2017 11:05:14 -0400 [thread overview]
Message-ID: <4d5470d6-f2d9-04ab-7005-e3febd3775f0@gmail.com> (raw)
In-Reply-To: <e19345c3-87c7-e248-fa4c-9dd608680640@inwind.it>
On 2017-08-04 10:45, Goffredo Baroncelli wrote:
> On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:
>> On 2017-08-03 12:37, Goffredo Baroncelli wrote:
>>> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
> [...]
>
>>>> Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
>>>
>>> It seems that ZFS on linux doesn't support fallocate
>>>
>>> see https://github.com/zfsonlinux/zfs/issues/326
>>>
>>> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
>> Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).
>
> For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one.
>
> http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212
>
> Following the chain of function pointers
>
> http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110
>
> it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()
>
> http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912
>
> which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution.
>
> So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't
From a practical perspective though, posix_fallocate() doesn't matter,
because almost everything uses the native fallocate call if at all
possible. As you mention, FreeBSD is emulating it, but that 'emulation'
provides behavior that is close enough to what is required that it
doesn't matter. As a matter of perspective, posix_fallocate() is
emulated on Linux too, see my reply below to your later comment about
posix_fallocate() on BTRFS.
Internally ZFS also keeps _some_ space reserved so it doesn't get wedged
like BTRFS does when near full, and they don't do the whole data versus
metadata segregation crap, so from a practical perspective, what
FreeBSD's ZFS implementation does is sufficient because of the internal
structure and handling of writes in ZFS.
>
>
>>
>> That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all. Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS.
>
> posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it.
As mentioned above, posix_fallocate() is emulated in libc on Linux by
calling the regular fallocate() if the FS supports it (which BTRFS
does), or by writing out data like FreeBSD does in the kernel if the FS
doesn't support fallocate(). IOW, posix_fallocate() has the exact same
issues on BTRFS as Linux's fallocate() syscall does.
>
> My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length).
Again, this arises from how we handle writes. If we were to track
blocks that have had fallocate called on them and only use those (for
the first write at least) for writes to the file that had fallocate
called on them (as well as breaking reflinks on them when fallocate is
called), then we can get away with just using the size of the biggest
write plus a little bit more space for _data_, but even then we need
space for metadata (which we don't appear to track right now).
>
> I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode.
>
> https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
> [...]
> /*
> * The only flag combination which matches the behavior of zfs_space()
> * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE. The FALLOC_FL_PUNCH_HOLE
> * flag was introduced in the 2.6.38 kernel.
> */
> #if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
> long
> zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
> {
> int error = -EOPNOTSUPP;
>
> #if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
> cred_t *cr = CRED();
> flock64_t bf;
> loff_t olen;
> fstrans_cookie_t cookie;
>
> if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> return (error);
>
> [...]
>
next prev parent reply other threads:[~2017-08-04 15:05 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-01 11:43 Massive loss of disk space pwm
2017-08-01 12:20 ` Hugo Mills
2017-08-01 14:39 ` pwm
2017-08-01 14:47 ` Austin S. Hemmelgarn
2017-08-01 15:00 ` Austin S. Hemmelgarn
2017-08-01 15:24 ` pwm
2017-08-01 15:45 ` Austin S. Hemmelgarn
2017-08-01 16:50 ` pwm
2017-08-01 17:04 ` Austin S. Hemmelgarn
2017-08-02 17:52 ` Goffredo Baroncelli
2017-08-02 19:10 ` Austin S. Hemmelgarn
2017-08-02 21:05 ` Goffredo Baroncelli
2017-08-03 11:39 ` Austin S. Hemmelgarn
2017-08-03 16:37 ` Goffredo Baroncelli
2017-08-03 17:23 ` Austin S. Hemmelgarn
2017-08-04 14:45 ` Goffredo Baroncelli
2017-08-04 15:05 ` Austin S. Hemmelgarn [this message]
2017-08-03 3:48 ` Duncan
2017-08-03 11:44 ` Marat Khalili
2017-08-03 11:52 ` Austin S. Hemmelgarn
2017-08-03 16:01 ` Goffredo Baroncelli
2017-08-03 17:15 ` Marat Khalili
2017-08-03 17:25 ` Austin S. Hemmelgarn
2017-08-03 22:51 ` pwm
2017-08-02 4:14 ` Duncan
2017-08-02 11:18 ` Austin S. Hemmelgarn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4d5470d6-f2d9-04ab-7005-e3febd3775f0@gmail.com \
--to=ahferroin7@gmail.com \
--cc=hugo@carfax.org.uk \
--cc=kreijack@inwind.it \
--cc=linux-btrfs@vger.kernel.org \
--cc=pwm@iapetus.neab.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).