From: Paolo Bonzini <pbonzini@redhat.com>
To: Fam Zheng <famz@redhat.com>
Cc: Eric Blake <eblake@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
qemu-block@nongnu.org, qemu-devel@nongnu.org,
Max Reitz <mreitz@redhat.com>,
Lukas Czerner <lczerner@redhat.com>,
Dave Chinner <dchinner@redhat.com>,
P@draigBrady.com, Niels de Vos <ndevos@redhat.com>
Subject: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
Date: Wed, 20 Jul 2016 05:19:37 -0400 (EDT) [thread overview]
Message-ID: <1796238868.8815050.1469006377577.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <20160720073836.GF10539@ad.usersys.redhat.com>
Adding ext4 and XFS guys (Lukas and Dave respectively). As a quick recap, the
issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which we use in
"qemu-img map". This command prints metadata about a virtual disk image---which
in the case of a raw image amounts to detecting holes and unwritten extents.
First, it seems like SEEK_HOLE and SEEK_DATA really should be "SEEK_NONZERO" and
"SEEK_ZERO", on both ext4 and XFS. You can see that unwritten extents are
reported by "qemu-img map" as holes:
$ dd if=/dev/urandom of=test.img bs=1M count=100
$ fallocate -z -o 10M -l 10M test.img
$ du -h test.img
$ qemu-img map --output=json test.img
[{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 10485760},
{ "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}]
On the second line, zero=true data=false identifies a hole. The right output
would either have zero=true data=true (unwritten extent) or just
[{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": true, "offset": 0},
since the zero flag is advisory (it doesn't try to detect zeroes beyond what the
filesystem says).
This leads to the second question, which is about FIEMAP and FIEMAP_FLAG_SYNC in
particular. Until 2014, QEMU used FIEMAP to implement "qemu-img map". I
resurrected that cod eand indeed it works:
$ dd if=/dev/urandom of=test.img bs=1M count=100
$ du -h test.img
$ fallocate -z -o 10M -l 10M test.img
$ ./qemu-img map --output=json test.img
[{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": true, "offset": 10485760},
{ "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}]
This time qemu-img correctly reports an unwritten extent on the second line. It
reports correctly holes too; continuing the previous example:
$ fallocate -p -o 20M -l 10M test.img
$ ./qemu-img map --output=json test.img
[{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": true, "offset": 10485760},
{ "start": 20971520, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 20971520},
{ "start": 31457280, "length": 73400320, "depth": 0, "zero": false, "data": true, "offset": 31457280}]
Notice that you have data=true on the second line (unwritten extent) but
data=false (hole) on the third.
The reason why we disabled FIEMAP was a combination of a corruption and performance
issue. The data corruption bug was at https://bugs.launchpad.net/qemu/+bug/1368815
and it was reported on Ubuntu Trusty (kernel 3.13 based on the release notes at
https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes). We corrected that by using
FIEMAP_FLAG_SYNC, based on a similar patch to coreutils. This turned out to be too
slow, so we dropped FIEMAP altogether.
However, today I found kernel commit 91dd8c114499 ("ext4: prevent race while walking
extent tree for fiemap", 2012-11-28) whose commit message says:
Moreover the extent currently in delayed allocation might be allocated
after we search the extent tree and before we search extent status tree
delayed buffers resulting in those delayed buffers being completely
missed, even though completely written and allocated.
This seems pretty much like our data corruption bug; it would mean that
using FIEMAP_FLAG_SYNC was only working around a bug and delayed allocations
_should_ be reported as usual by FIEMAP.
Except that the commit went in kernel 3.8 and as said above Trusty had 3.13.
So either there are other bugs, or my understanding of the commit is not correct.
So the questions for Lukas and Dave are:
1) is it expected that SEEK_HOLE skips unwritten extents? If not, would
it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
would be similar to what SEEK_HOLE/SEEK_DATA do now?
2) for FIEMAP do we really need FIEMAP_FLAG_SYNC? And if not, for what
filesystems and kernel releases is it really not needed?
Thanks,
Paolo
next prev parent reply other threads:[~2016-07-20 9:19 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-19 4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 01/14] nbd: Fix bad flag detection on server Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 02/14] nbd: Add qemu-nbd -D for human-readable description Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 03/14] nbd: Limit nbdflags to 16 bits Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 04/14] nbd: Treat flags vs. command type as separate fields Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server Eric Blake
2016-07-19 5:10 ` Fam Zheng
2016-07-19 14:52 ` Eric Blake
2016-07-20 4:39 ` Fam Zheng
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 06/14] nbd: Send message along with server NBD_REP_ERR errors Eric Blake
2016-07-19 5:15 ` Fam Zheng
2016-10-11 15:12 ` Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 07/14] nbd: Share common option-sending code in client Eric Blake
2016-07-19 5:31 ` Fam Zheng
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 08/14] nbd: Let server know when client gives up negotiation Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 09/14] nbd: Let client skip portions of server reply Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 10/14] nbd: Less allocation during NBD_OPT_LIST Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 11/14] nbd: Support shorter handshake Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 12/14] nbd: Improve server handling of shutdown requests Eric Blake
2016-07-19 4:08 ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Eric Blake
2016-07-19 6:21 ` Fam Zheng
2016-07-19 15:28 ` Eric Blake
2016-07-19 15:45 ` Paolo Bonzini
2016-07-20 3:34 ` Fam Zheng
2016-07-20 3:47 ` Eric Blake
2016-07-20 4:37 ` Fam Zheng
2016-07-20 7:09 ` Paolo Bonzini
2016-07-20 7:38 ` Fam Zheng
2016-07-20 8:16 ` Paolo Bonzini
2016-07-20 9:04 ` Fam Zheng
2016-07-20 9:19 ` Paolo Bonzini [this message]
2016-07-20 12:30 ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) Dave Chinner
2016-07-20 13:35 ` Niels de Vos
2016-07-21 11:43 ` Dave Chinner
2016-07-21 12:31 ` Pádraig Brady
2016-07-21 13:15 ` Dave Chinner
2016-07-20 13:40 ` Paolo Bonzini
2016-07-21 12:41 ` Dave Chinner
2016-07-21 13:01 ` Pádraig Brady
2016-07-21 14:23 ` Paolo Bonzini
2016-07-22 8:58 ` Dave Chinner
2016-07-22 10:41 ` Paolo Bonzini
2018-02-15 16:40 ` Vladimir Sementsov-Ogievskiy
2018-02-15 16:42 ` Paolo Bonzini
2018-04-18 14:25 ` Vladimir Sementsov-Ogievskiy
2018-04-18 14:41 ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC Eric Blake
2016-08-18 13:50 ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Vladimir Sementsov-Ogievskiy
2016-08-18 13:52 ` Paolo Bonzini
2016-07-19 4:08 ` [Qemu-devel] [PATCH v5 14/14] nbd: Implement NBD_CMD_WRITE_ZEROES on client Eric Blake
2016-07-19 6:24 ` Fam Zheng
2016-07-19 15:31 ` Eric Blake
2016-07-19 6:33 ` [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Fam Zheng
2016-07-19 8:53 ` Paolo Bonzini
2016-07-19 15:33 ` Eric Blake
2016-07-19 15:41 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1796238868.8815050.1469006377577.JavaMail.zimbra@redhat.com \
--to=pbonzini@redhat.com \
--cc=P@draigBrady.com \
--cc=dchinner@redhat.com \
--cc=eblake@redhat.com \
--cc=famz@redhat.com \
--cc=kwolf@redhat.com \
--cc=lczerner@redhat.com \
--cc=mreitz@redhat.com \
--cc=ndevos@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).