From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Adam Borowski <kilobyte@angband.pl>
Cc: Marat Khalili <mkh@rqc.ru>, Duncan <1i5t5.duncan@cox.net>,
linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: qemu-kvm VM died during partial raid1 problems of btrfs
Date: Tue, 12 Sep 2017 16:12:32 -0400 [thread overview]
Message-ID: <e169336d-db31-276f-1d71-bf897cf10d4b@gmail.com> (raw)
In-Reply-To: <20170912200057.3mrgtahlvszkg334@angband.pl>
On 2017-09-12 16:00, Adam Borowski wrote:
> On Tue, Sep 12, 2017 at 03:11:52PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-09-12 14:43, Adam Borowski wrote:
>>> On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote:
>>>> On 2017-09-12 13:21, Adam Borowski wrote:
>>>>> There's fallocate -d, but that for some reason touches mtime which makes
>>>>> rsync go again. This can be handled manually but is still not nice.
>>>
>>> Yeah, the underlying ioctl does modify the file, it's merely fallocate -d
>>> calling it on regions that are already zero. The ioctl doesn't know that,
>>> so fallocate would have to restore the mtime by itself.
>>>
>>> There's also another problem: such a check + ioctl are racey. Unlike defrag
>>> or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could
>>> suddenly become in use). Fixing this would need kernel support, either as
>>> FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate.
>> A new fallocate mode would be more likely. Adding special code to the
>> EXTENT_SAME ioctl and then requiring implementation on filesystems that
>> don't otherwise support it is not likely to get anywhere. A new fallocate
>> mode though would be easy, especially considering that a naive
>> implementation is easy
>
> Sounds like a good idea. If we go this way, there's a question about
> interface: there's choice between:
> A) check if the whole range is zero, if even a single bit is one, abort
> B) dig many holes, with a given granulation (perhaps left to the
> filesystem's choice)
> or even both. The former is more consistent with FILE_EXTENT_SAME, the
> latter can be smarter (like, digging a 4k hole is bad for fragmentation but
> replacing a whole extent, no matter how small, is always a win).
The first. It's more flexible, and the logic required for the second
option is policy, which should not be in the kernel. Matching the
EXTENT_SAME semantics would probably also make the implementation
significantly easier, and with some minor work might give a trivial
implementation for any FS that already supports that ioctl.
>
>> That said, I'm not 100% certain if it's necessary. Intentionally calling
>> fallocate on a file in use is not something most people are going to do
>> normally anyway, since there is already a TOCTOU race in the fallocate -d
>> implementation as things are right now.
>
> _Current_ fallocate -d suffers from races, the whole gain from doing this
> kernel-side would be eliminating those races. Use cases about the same as
> FILE_EXTENT_SAME: you don't need to stop the world. Heck, as I mentioned
> before, it conceptually _is_ FILE_EXTENT_SAME with /dev/null, other than
> your (good) point about non-btrfs non-xfs.
I meant we shouldn't worry about a race involving the mtime check given
that there's an existing race inherent in the ioctl already.
>
>>> For now, though, I wonder -- should we send fine folks at util-linux a patch
>>> to make fallocate -d restore mtime, either always or on an option?
>> It would need to be an option, because it also suffers from a TOCTOU race
>> (other things might have changed the mtime while you were punching holes),
>> and it breaks from existing behavior. I think such an option would be
>> useful, but not universally (for example, I don't care if the mtime on my VM
>> images changes, as it typically matches the current date and time since the
>> VM's are running constantly other than when doing maintenance like punching
>> holes in the images).
>
> Noted. Both Marat's and my use cases, though, involve VMs that are off most
> of the time, and at least for me, turned on only to test something.
> Touching mtime makes rsync run again, and it's freaking _slow_: worse than
> 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
you're going direct to a hard drive. I get better performance than that
on my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s
there, but it's for archival storage so I don't really care). I'm
actually curious what the exact rsync command you are using is (you can
obviously redact paths as you see fit), as the only way I can think of
that it should be that slow is if you're using both --checksum (but if
you're using this, you can tell rsync to skip the mtime check, and that
issue goes away) and --inplace, _and_ your HDD is slow to begin with.
next prev parent reply other threads:[~2017-09-12 20:12 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-12 8:02 qemu-kvm VM died during partial raid1 problems of btrfs Marat Khalili
2017-09-12 8:25 ` Timofey Titovets
2017-09-12 8:42 ` Marat Khalili
2017-09-12 9:21 ` Timofey Titovets
2017-09-12 9:29 ` Marat Khalili
2017-09-12 9:35 ` Timofey Titovets
2017-09-12 10:01 ` Duncan
2017-09-12 10:32 ` Adam Borowski
2017-09-12 10:39 ` Marat Khalili
2017-09-12 11:01 ` Timofey Titovets
2017-09-12 11:12 ` Adam Borowski
2017-09-12 11:17 ` Timofey Titovets
2017-09-12 11:26 ` Marat Khalili
2017-09-12 17:21 ` Adam Borowski
2017-09-12 17:36 ` Austin S. Hemmelgarn
2017-09-12 18:43 ` Adam Borowski
2017-09-12 18:47 ` Christoph Hellwig
2017-09-12 19:12 ` Austin S. Hemmelgarn
2017-09-12 19:11 ` Austin S. Hemmelgarn
2017-09-12 20:00 ` Adam Borowski
2017-09-12 20:12 ` Austin S. Hemmelgarn [this message]
2017-09-12 21:13 ` Adam Borowski
2017-09-13 0:52 ` Timofey Titovets
2017-09-13 12:55 ` Austin S. Hemmelgarn
2017-09-13 12:21 ` Austin S. Hemmelgarn
2017-09-18 11:53 ` Adam Borowski
2017-09-13 14:47 ` Martin Raiber
2017-09-13 15:25 ` Austin S. Hemmelgarn
2017-09-12 11:09 ` Roman Mamedov
2017-09-13 13:23 ` Chris Murphy
2017-09-13 14:15 ` Marat Khalili
2017-09-13 17:52 ` Goffredo Baroncelli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e169336d-db31-276f-1d71-bf897cf10d4b@gmail.com \
--to=ahferroin7@gmail.com \
--cc=1i5t5.duncan@cox.net \
--cc=kilobyte@angband.pl \
--cc=linux-btrfs@vger.kernel.org \
--cc=mkh@rqc.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).