From: Michael Roth <mdroth@linux.vnet.ibm.com>
To: qemu-devel@nongnu.org
Cc: qemu-stable@nongnu.org, Eric Blake <eblake@redhat.com>,
Kevin Wolf <kwolf@redhat.com>
Subject: [Qemu-devel] [PATCH 05/55] block: Perform copy-on-read in loop
Date: Wed, 6 Dec 2017 13:15:58 -0600 [thread overview]
Message-ID: <20171206191648.18208-6-mdroth@linux.vnet.ibm.com> (raw)
In-Reply-To: <20171206191648.18208-1-mdroth@linux.vnet.ibm.com>
From: Eric Blake <eblake@redhat.com>
Improve our braindead copy-on-read implementation. Pre-patch,
we have multiple issues:
- we create a bounce buffer and perform a write for the entire
request, even if the active image already has 99% of the
clusters occupied, and really only needs to copy-on-read the
remaining 1% of the clusters
- our bounce buffer was as large as the read request, and can
needlessly exhaust our memory by using double the memory of
the request size (the original request plus our bounce buffer),
rather than a capped maximum overhead beyond the original
- if a driver has a max_transfer limit, we are bypassing the
normal code in bdrv_aligned_preadv() that fragments to that
limit, and instead attempt to read the entire buffer from the
driver in one go, which some drivers may assert on
- a client can request a large request of nearly 2G such that
rounding the request out to cluster boundaries results in a
byte count larger than 2G. While this cannot exceed 32 bits,
it DOES have some follow-on problems:
-- the call to bdrv_driver_pread() can assert for exceeding
BDRV_REQUEST_MAX_BYTES, if the driver is old and lacks
.bdrv_co_preadv
-- if the buffer is all zeroes, the subsequent call to
bdrv_co_do_pwrite_zeroes is a no-op due to a negative size,
which means we did not actually copy on read
Fix all of these issues by breaking up the action into a loop,
where each iteration is capped to sane limits. Also, querying
the allocation status allows us to optimize: when data is
already present in the active layer, we don't need to bounce.
Note that the code has a telling comment that copy-on-read
should probably be a filter driver rather than a bolt-on hack
in io.c; but that remains a task for another day.
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
(cherry picked from commit cb2e28780c7080af489e72227683fe374f05022d)
Conflicts:
block/io.c
* remove context dep on d855ebcd3
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
block/io.c | 118 ++++++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 81 insertions(+), 37 deletions(-)
diff --git a/block/io.c b/block/io.c
index 26003814eb..fce856ea8a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -34,6 +34,9 @@
#define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress */
+/* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
+#define MAX_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS)
+
static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
int64_t offset, int bytes, BdrvRequestFlags flags);
@@ -945,11 +948,14 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild *child,
BlockDriver *drv = bs->drv;
struct iovec iov;
- QEMUIOVector bounce_qiov;
+ QEMUIOVector local_qiov;
int64_t cluster_offset;
unsigned int cluster_bytes;
size_t skip_bytes;
int ret;
+ int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+ BDRV_REQUEST_MAX_BYTES);
+ unsigned int progress = 0;
/* FIXME We cannot require callers to have write permissions when all they
* are doing is a read request. If we did things right, write permissions
@@ -961,52 +967,94 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild *child,
// assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE));
/* Cover entire cluster so no additional backing file I/O is required when
- * allocating cluster in the image file.
+ * allocating cluster in the image file. Note that this value may exceed
+ * BDRV_REQUEST_MAX_BYTES (even when the original read did not), which
+ * is one reason we loop rather than doing it all at once.
*/
bdrv_round_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
+ skip_bytes = offset - cluster_offset;
trace_bdrv_co_do_copy_on_readv(bs, offset, bytes,
cluster_offset, cluster_bytes);
- iov.iov_len = cluster_bytes;
- iov.iov_base = bounce_buffer = qemu_try_blockalign(bs, iov.iov_len);
+ bounce_buffer = qemu_try_blockalign(bs,
+ MIN(MIN(max_transfer, cluster_bytes),
+ MAX_BOUNCE_BUFFER));
if (bounce_buffer == NULL) {
ret = -ENOMEM;
goto err;
}
- qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+ while (cluster_bytes) {
+ int64_t pnum;
- ret = bdrv_driver_preadv(bs, cluster_offset, cluster_bytes,
- &bounce_qiov, 0);
- if (ret < 0) {
- goto err;
- }
+ ret = bdrv_is_allocated(bs, cluster_offset,
+ MIN(cluster_bytes, max_transfer), &pnum);
+ if (ret < 0) {
+ /* Safe to treat errors in querying allocation as if
+ * unallocated; we'll probably fail again soon on the
+ * read, but at least that will set a decent errno.
+ */
+ pnum = MIN(cluster_bytes, max_transfer);
+ }
- if (drv->bdrv_co_pwrite_zeroes &&
- buffer_is_zero(bounce_buffer, iov.iov_len)) {
- /* FIXME: Should we (perhaps conditionally) be setting
- * BDRV_REQ_MAY_UNMAP, if it will allow for a sparser copy
- * that still correctly reads as zero? */
- ret = bdrv_co_do_pwrite_zeroes(bs, cluster_offset, cluster_bytes, 0);
- } else {
- /* This does not change the data on the disk, it is not necessary
- * to flush even in cache=writethrough mode.
- */
- ret = bdrv_driver_pwritev(bs, cluster_offset, cluster_bytes,
- &bounce_qiov, 0);
- }
+ assert(skip_bytes < pnum);
- if (ret < 0) {
- /* It might be okay to ignore write errors for guest requests. If this
- * is a deliberate copy-on-read then we don't want to ignore the error.
- * Simply report it in all cases.
- */
- goto err;
- }
+ if (ret <= 0) {
+ /* Must copy-on-read; use the bounce buffer */
+ iov.iov_base = bounce_buffer;
+ iov.iov_len = pnum = MIN(pnum, MAX_BOUNCE_BUFFER);
+ qemu_iovec_init_external(&local_qiov, &iov, 1);
- skip_bytes = offset - cluster_offset;
- qemu_iovec_from_buf(qiov, 0, bounce_buffer + skip_bytes, bytes);
+ ret = bdrv_driver_preadv(bs, cluster_offset, pnum,
+ &local_qiov, 0);
+ if (ret < 0) {
+ goto err;
+ }
+
+ if (drv->bdrv_co_pwrite_zeroes &&
+ buffer_is_zero(bounce_buffer, pnum)) {
+ /* FIXME: Should we (perhaps conditionally) be setting
+ * BDRV_REQ_MAY_UNMAP, if it will allow for a sparser copy
+ * that still correctly reads as zero? */
+ ret = bdrv_co_do_pwrite_zeroes(bs, cluster_offset, pnum, 0);
+ } else {
+ /* This does not change the data on the disk, it is not
+ * necessary to flush even in cache=writethrough mode.
+ */
+ ret = bdrv_driver_pwritev(bs, cluster_offset, pnum,
+ &local_qiov, 0);
+ }
+
+ if (ret < 0) {
+ /* It might be okay to ignore write errors for guest
+ * requests. If this is a deliberate copy-on-read
+ * then we don't want to ignore the error. Simply
+ * report it in all cases.
+ */
+ goto err;
+ }
+
+ qemu_iovec_from_buf(qiov, progress, bounce_buffer + skip_bytes,
+ pnum - skip_bytes);
+ } else {
+ /* Read directly into the destination */
+ qemu_iovec_init(&local_qiov, qiov->niov);
+ qemu_iovec_concat(&local_qiov, qiov, progress, pnum - skip_bytes);
+ ret = bdrv_driver_preadv(bs, offset + progress, local_qiov.size,
+ &local_qiov, 0);
+ qemu_iovec_destroy(&local_qiov);
+ if (ret < 0) {
+ goto err;
+ }
+ }
+
+ cluster_offset += pnum;
+ cluster_bytes -= pnum;
+ progress += pnum - skip_bytes;
+ skip_bytes = 0;
+ }
+ ret = 0;
err:
qemu_vfree(bounce_buffer);
@@ -1212,9 +1260,6 @@ int coroutine_fn bdrv_co_readv(BdrvChild *child, int64_t sector_num,
return bdrv_co_do_readv(child, sector_num, nb_sectors, qiov, 0);
}
-/* Maximum buffer for write zeroes fallback, in bytes */
-#define MAX_WRITE_ZEROES_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS)
-
static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
int64_t offset, int bytes, BdrvRequestFlags flags)
{
@@ -1229,8 +1274,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
bs->bl.request_alignment);
- int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
- MAX_WRITE_ZEROES_BOUNCE_BUFFER);
+ int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer, MAX_BOUNCE_BUFFER);
assert(alignment % bs->bl.request_alignment == 0);
head = offset % alignment;
--
2.11.0
next prev parent reply other threads:[~2017-12-06 19:18 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-06 19:15 [Qemu-devel] [PATCH 00/55] Patch Round-up for stable 2.10.2, freeze on 2017-12-13 Michael Roth
2017-12-06 19:15 ` [Qemu-devel] [PATCH 01/55] hw/ppc: CAS reset on early device hotplug Michael Roth
2017-12-06 19:15 ` [Qemu-devel] [PATCH 02/55] hw/usb/bus: Remove bad object_unparent() from usb_try_create_simple() Michael Roth
2017-12-06 19:15 ` [Qemu-devel] [PATCH 03/55] block/mirror: check backing in bdrv_mirror_top_flush Michael Roth
2017-12-06 19:15 ` [Qemu-devel] [PATCH 04/55] kvmclock: use the updated system_timer_msr Michael Roth
2017-12-06 19:15 ` Michael Roth [this message]
2017-12-06 19:15 ` [Qemu-devel] [PATCH 06/55] exec: Explicitly export target AS from address_space_translate_internal Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 07/55] memory: Open code FlatView rendering Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 08/55] memory: Move FlatView allocation to a helper Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 09/55] memory: Move AddressSpaceDispatch from AddressSpace to FlatView Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 10/55] memory: Remove AddressSpace pointer from AddressSpaceDispatch Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 11/55] memory: avoid "resurrection" of dead FlatViews Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 12/55] memory: Switch memory from using AddressSpace to FlatView Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 13/55] memory: Cleanup after switching " Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 14/55] memory: Rename mem_begin/mem_commit/mem_add helpers Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 15/55] memory: Store physical root MR in FlatView Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 16/55] memory: Alloc dispatch tree where topology is generared Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 17/55] memory: Move address_space_update_ioeventfds Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 18/55] memory: Share FlatView's and dispatch trees between address spaces Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 19/55] memory: Do not allocate FlatView in address_space_init Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 20/55] memory: Get rid of address_space_init_shareable Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 21/55] memory: Create FlatView directly Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 22/55] memory: trace FlatView creation and destruction Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 23/55] memory: seek FlatView sharing candidates among children subregions Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 24/55] memory: Share special empty FlatView Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 25/55] exec: add page_mask for flatview_do_translate Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 26/55] exec: simplify address_space_get_iotlb_entry Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 27/55] memory: fix off-by-one error in memory_region_notify_one() Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 28/55] hw/sd: fix out-of-bounds check for multi block reads Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 29/55] qcow2: Fix unaligned preallocated truncation Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 30/55] qcow2: Always execute preallocate() in a coroutine Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 31/55] iotests: Add cluster_size=64k to 125 Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 32/55] nios2: define tcg_env Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 33/55] io: monitor encoutput buffer size from websocket GSource Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 34/55] ppc: fix setting of compat mode Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 35/55] translate.c: Fix usermode big-endian AArch32 LDREXD and STREXD Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 36/55] hw/intc/arm_gicv3_its: Don't abort on table save failure Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 37/55] net/socket: fix coverity issue Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 38/55] net: fix check for number of parameters to -netdev socket Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 39/55] nbd/client: Use error_prepend() correctly Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 40/55] util/stats64: Fix min/max comparisons Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 41/55] virtio: Add queue interface to restore avail index from vring used index Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 42/55] vhost: restore avail index from vring used index on disconnection Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 43/55] hw/ppc: clear pending_events on machine reset Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 44/55] spapr: reset DRCs after devices Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 45/55] scripts/make-release: ship u-boot source as a tarball Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 46/55] block/nfs: fix nfs_client_open for filesize greater than 1TB Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 47/55] virtio-net: don't touch virtqueue if vm is stopped Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 48/55] nbd/server: CVE-2017-15119 Reject options larger than 32M Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 49/55] nbd/server: CVE-2017-15118 Stack smash on large export name Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 50/55] vhost: fix error check in vhost_verify_ring_mappings() Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 51/55] nbd/server: fix nbd_negotiate_handle_info Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 52/55] nbd-client: Refuse read-only client with BDRV_O_RDWR Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 53/55] nbd/client: Don't hard-disconnect on ESHUTDOWN from server Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 54/55] vga: drop line_offset variable Michael Roth
2017-12-06 19:16 ` [Qemu-devel] [PATCH 55/55] vga: handle cirrus vbe mode wraparounds Michael Roth
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171206191648.18208-6-mdroth@linux.vnet.ibm.com \
--to=mdroth@linux.vnet.ibm.com \
--cc=eblake@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-stable@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).