From: Paolo Bonzini <pbonzini@redhat.com>
To: Dave Chinner <dchinner@redhat.com>
Cc: Fam Zheng <famz@redhat.com>, Eric Blake <eblake@redhat.com>,
Kevin Wolf <kwolf@redhat.com>,
qemu-block@nongnu.org, qemu-devel@nongnu.org,
Max Reitz <mreitz@redhat.com>,
Lukas Czerner <lczerner@redhat.com>,
P@draigBrady.com, Niels de Vos <ndevos@redhat.com>
Subject: Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)
Date: Thu, 21 Jul 2016 10:23:48 -0400 (EDT) [thread overview]
Message-ID: <1072047668.9469836.1469111028358.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <20160721124119.GR2031@devil.localdomain>
> > 1) avoid copying zero data, to keep the copy process efficient. For this,
> > SEEK_HOLE/SEEK_DATA are enough.
> >
> > 2) copy file contents while preserving the allocation state of the file's
> > extents.
>
> Which is /very difficult/ to do safely and reliably.
> i.e. the use of fiemap to duplicate the exact layout of a file
> from userspace is only posisble if you can /guarantee/ the source
> file has not changed in any way during the copy operation at the
> pointin time you finalise the destination data copy.
We don't do exactly that, exactly because it's messy when you have
concurrent accesses (which shouldn't be done but you never know). When
doing a copy, we use(d to use) FIEMAP the same way as you'd use lseek,
querying one extent at a time. If you proceed this way, all of these
can cause the same races:
- pread(ofs=10MB, len=10MB) returns all zeroes, so the 10MB..20MB is
not copied
- pread(ofs=10MB, len=10MB) returns non-zero data, so the 10MB..20MB is
copied
- lseek(SEEK_DATA, 10MB) returns 20MB, so the 10MB..20MB area is not
copied
- lseek(SEEK_HOLE, 10MB) returns 20MB, so the 10MB..20MB area is
copied
- ioctl(FIEMAP at 10MB) returns an extent starting at 20MB, so
the 10MB..20MB area is not copied
- ioctl(FIEMAP at 10MB) returns an unwritten extent at 10MB..20MB,
so the 10MB..20MB area is FALLOC_FL_ZERO_RANGE-ed on the destination
- ioctl(FIEMAP at 10MB) returns an extent at 10MB..20MB, so
the 10MB..20MB area is copied
But unfortunately in the FIEMAP case you'd need FIEMAP_FLAG_SYNC. It
would be nice to have a way to do this kind of query without syncing
and with expressiveness that is at least close to FIEMAP's results.
We don't need need the full power of FIEMAP---as mentioned above we
only query one extent---so having a new ioctl would be fine too!
I'm putting a sample program after my signature for reference.
> IOWs, I think you should be looking to optimise file copies by using
> copy_file_range() and getting filesystems to do exactly what you
> need. Using FIEMAP, fallocate and moving data through userspace
> won't ever be reliable without special filesystem help (that only
> exists for XFS right now), nor will it enable the application to
> transparently use smart storage protocols and hardware when it is
> present on user systems....
We want to do that. We also want to present SCSI extended copy to the
guest, and map that to copy_file_range() on the host if possible. However,
there are going to be cases where we need to do low-level queries. For
example, it is on my todo list to implement lseek(SEEK_HOLE/SEEK_DATA)
on /dev/sd* devices as a GET LBA STATUS SCSI command. Dually, QEMU would
convert the GET LBA STATUS command to lseek or FIEMAP. QEMU only does a
simple conversion of commands and results, it's up to the guest to avoid
races.
It is of course the best if the guest can use SCSI extended copy, but we
cannot guarantee that.
Paolo
/*
* Skeleton that walks the extents of a file (e.g. to copy
* each part while preserving the allocation structure).
*
* Usage: ./fiemap FILENAME
*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <linux/fiemap.h>
#include <linux/fs.h>
#include <sys/ioctl.h>
#include <errno.h>
// #define UNSAFE
#define MIN(a, b) ((a) < (b) ? (a) : (b))
enum {
BDRV_BLOCK_ZERO = 1,
BDRV_BLOCK_DATA = 2
};
/* Probe the area at [start, start+length) and return whether it begins with a
* hole (BDRV_BLOCK_ZERO), an unwritten extent (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)
* or allocated data (BDRV_BLOCK_DATA). The return value applies to the
* [start, *next) range. *next should be used as the "start" value in a
* subsequent call to get_next_fiemap.
*
* If successful, return the kind of data at offset start in the file.
* If start is beyond the end of file, return -ENXIO.
* Otherwise, return -errno.
*/
int get_next_fiemap(int fd, off_t start, off_t length, off_t *next)
{
struct {
struct fiemap fm;
struct fiemap_extent fe;
} f;
#if UNSAFE
f.fm.fm_flags = 0;
#else
f.fm.fm_flags = FIEMAP_FLAG_SYNC;
#endif
f.fm.fm_start = start;
f.fm.fm_length = length;
f.fm.fm_extent_count = 1;
f.fm.fm_reserved = 0;
if (ioctl(fd, FS_IOC_FIEMAP, &f) == -1) {
return -errno;
}
if (f.fm.fm_mapped_extents == 0) {
/*
* No extents found, any data must be at start + length or later.
* Check if we're at end of file, and if not, return a hole as large
* as the size we probed; splitting a hole across multiple calls is okay.
*/
off_t filelen = lseek(fd, 0, SEEK_END);
if (start >= filelen) {
return -ENXIO;
}
/* Pay attention to overflows. */
*next = start + MIN(length, filelen - start);
return BDRV_BLOCK_ZERO;
}
if (f.fe.fe_logical > f.fm.fm_start) {
/*
* Found an extent, but not at the probed point. We're in a hole,
* return it and we'll get to that extent on the next call.
*/
*next = f.fe.fe_logical;
return BDRV_BLOCK_ZERO;
}
/* Found an extent, and we're inside it. */
*next = f.fe.fe_logical + f.fe.fe_length;
if (f.fe.fe_flags & FIEMAP_EXTENT_UNWRITTEN) {
return BDRV_BLOCK_DATA|BDRV_BLOCK_ZERO;
} else {
return BDRV_BLOCK_DATA;
}
}
int main(int argc, char **argv)
{
int fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
exit(1);
}
off_t start = 0, next;
off_t length = 8 * 1024 * 1024;
int ret;
while ((ret = get_next_fiemap(fd, start, length, &next)) >= 0) {
printf("%-15ld %-15ld%s%s\n",
start, next - start,
(ret & BDRV_BLOCK_DATA) ? " data" : "",
(ret & BDRV_BLOCK_ZERO) ? " zero" : "");
/* ... copy or fallocate the destination here ... */
start = next;
}
if (ret != -ENXIO) {
perror("FIEMAP");
exit(1);
}
exit(0);
}
next prev parent reply other threads:[~2016-07-21 14:24 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-19 4:07 [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 01/14] nbd: Fix bad flag detection on server Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 02/14] nbd: Add qemu-nbd -D for human-readable description Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 03/14] nbd: Limit nbdflags to 16 bits Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 04/14] nbd: Treat flags vs. command type as separate fields Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 05/14] nbd: Share common reply-sending code in server Eric Blake
2016-07-19 5:10 ` Fam Zheng
2016-07-19 14:52 ` Eric Blake
2016-07-20 4:39 ` Fam Zheng
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 06/14] nbd: Send message along with server NBD_REP_ERR errors Eric Blake
2016-07-19 5:15 ` Fam Zheng
2016-10-11 15:12 ` Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 07/14] nbd: Share common option-sending code in client Eric Blake
2016-07-19 5:31 ` Fam Zheng
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 08/14] nbd: Let server know when client gives up negotiation Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 09/14] nbd: Let client skip portions of server reply Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 10/14] nbd: Less allocation during NBD_OPT_LIST Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 11/14] nbd: Support shorter handshake Eric Blake
2016-07-19 4:07 ` [Qemu-devel] [PATCH v5 12/14] nbd: Improve server handling of shutdown requests Eric Blake
2016-07-19 4:08 ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Eric Blake
2016-07-19 6:21 ` Fam Zheng
2016-07-19 15:28 ` Eric Blake
2016-07-19 15:45 ` Paolo Bonzini
2016-07-20 3:34 ` Fam Zheng
2016-07-20 3:47 ` Eric Blake
2016-07-20 4:37 ` Fam Zheng
2016-07-20 7:09 ` Paolo Bonzini
2016-07-20 7:38 ` Fam Zheng
2016-07-20 8:16 ` Paolo Bonzini
2016-07-20 9:04 ` Fam Zheng
2016-07-20 9:19 ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) Paolo Bonzini
2016-07-20 12:30 ` Dave Chinner
2016-07-20 13:35 ` Niels de Vos
2016-07-21 11:43 ` Dave Chinner
2016-07-21 12:31 ` Pádraig Brady
2016-07-21 13:15 ` Dave Chinner
2016-07-20 13:40 ` Paolo Bonzini
2016-07-21 12:41 ` Dave Chinner
2016-07-21 13:01 ` Pádraig Brady
2016-07-21 14:23 ` Paolo Bonzini [this message]
2016-07-22 8:58 ` Dave Chinner
2016-07-22 10:41 ` Paolo Bonzini
2018-02-15 16:40 ` Vladimir Sementsov-Ogievskiy
2018-02-15 16:42 ` Paolo Bonzini
2018-04-18 14:25 ` Vladimir Sementsov-Ogievskiy
2018-04-18 14:41 ` [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC Eric Blake
2016-08-18 13:50 ` [Qemu-devel] [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server Vladimir Sementsov-Ogievskiy
2016-08-18 13:52 ` Paolo Bonzini
2016-07-19 4:08 ` [Qemu-devel] [PATCH v5 14/14] nbd: Implement NBD_CMD_WRITE_ZEROES on client Eric Blake
2016-07-19 6:24 ` Fam Zheng
2016-07-19 15:31 ` Eric Blake
2016-07-19 6:33 ` [Qemu-devel] [PATCH for-2.7 v5 00/14] nbd: efficient write zeroes Fam Zheng
2016-07-19 8:53 ` Paolo Bonzini
2016-07-19 15:33 ` Eric Blake
2016-07-19 15:41 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1072047668.9469836.1469111028358.JavaMail.zimbra@redhat.com \
--to=pbonzini@redhat.com \
--cc=P@draigBrady.com \
--cc=dchinner@redhat.com \
--cc=eblake@redhat.com \
--cc=famz@redhat.com \
--cc=kwolf@redhat.com \
--cc=lczerner@redhat.com \
--cc=mreitz@redhat.com \
--cc=ndevos@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).