From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60738) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bD7R1-0004ht-TS for qemu-devel@nongnu.org; Wed, 15 Jun 2016 05:48:33 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bD7Qw-00022H-6F for qemu-devel@nongnu.org; Wed, 15 Jun 2016 05:48:30 -0400 Date: Wed, 15 Jun 2016 10:48:20 +0100 From: Stefan Hajnoczi Message-ID: <20160615094820.GD26488@stefanha-x1.localdomain> References: <1465917916-22348-1-git-send-email-den@openvz.org> <1465917916-22348-9-git-send-email-den@openvz.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="fOHHtNG4YXGJ0yqR" Content-Disposition: inline In-Reply-To: <1465917916-22348-9-git-send-email-den@openvz.org> Subject: Re: [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Denis V. Lunev" Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org, vsementsov@virtuozzo.com, Fam Zheng , Kevin Wolf , Max Reitz , Jeff Cody , Eric Blake --fOHHtNG4YXGJ0yqR Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jun 14, 2016 at 06:25:15PM +0300, Denis V. Lunev wrote: > Block commit of the active image to the backing store on a slow disk > could never end. For example with the guest with the following loop > inside > while true; do > dd bs=3D1k count=3D1 if=3D/dev/zero of=3Dx > done > running above slow storage could not complete the operation with a > resonable amount of time: > virsh blockcommit rhel7 sda --active --shallow > virsh qemu-monitor-event > virsh qemu-monitor-command rhel7 \ > '{"execute":"block-job-complete",\ > "arguments":{"device":"drive-scsi0-0-0-0"} }' > virsh qemu-monitor-event > Completion event is never received. >=20 > This problem could not be fixed easily with the current architecture. We > should either prohibit guest writes (making dirty bitmap dirty) or switch > to the sycnchronous scheme. >=20 > This patch implements the latter. It adds mirror_before_write_notify > callback. In this case all data written from the guest is synchnonously > written to the mirror target. Though the problem is solved partially. > We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be > done in the next patch. >=20 > Signed-off-by: Denis V. Lunev > Reviewed-by: Vladimir Sementsov-Ogievskiy > CC: Stefan Hajnoczi > CC: Fam Zheng > CC: Kevin Wolf > CC: Max Reitz > CC: Jeff Cody > CC: Eric Blake > --- > block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++= ++++++ > 1 file changed, 78 insertions(+) >=20 > diff --git a/block/mirror.c b/block/mirror.c > index 7471211..086256c 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -58,6 +58,9 @@ typedef struct MirrorBlockJob { > QSIMPLEQ_HEAD(, MirrorBuffer) buf_free; > int buf_free_count; > =20 > + NotifierWithReturn before_write; > + CoQueue dependent_writes; > + > unsigned long *in_flight_bitmap; > int in_flight; > int sectors_in_flight; > @@ -125,6 +128,7 @@ static void mirror_iteration_done(MirrorOp *op, int r= et) > g_free(op->buf); qemu_vfree() must be used for qemu_blockalign() memory. > g_free(op); > =20 > + qemu_co_queue_restart_all(&s->dependent_writes); > if (s->waiting_for_io) { > qemu_coroutine_enter(s->common.co, NULL); > } > @@ -511,6 +515,74 @@ static void mirror_exit(BlockJob *job, void *opaque) > bdrv_unref(src); > } > =20 > +static int coroutine_fn mirror_before_write_notify( > + NotifierWithReturn *notifier, void *opaque) > +{ > + MirrorBlockJob *s =3D container_of(notifier, MirrorBlockJob, before_= write); > + BdrvTrackedRequest *req =3D opaque; > + MirrorOp *op; > + int sectors_per_chunk =3D s->granularity >> BDRV_SECTOR_BITS; > + int64_t sector_num =3D req->offset >> BDRV_SECTOR_BITS; > + int nb_sectors =3D req->bytes >> BDRV_SECTOR_BITS; > + int64_t end_sector =3D sector_num + nb_sectors; > + int64_t aligned_start, aligned_end; > + > + if (req->type !=3D BDRV_TRACKED_DISCARD && req->type !=3D BDRV_TRACK= ED_WRITE) { > + /* this is not discard and write, we do not care */ > + return 0; > + } > + > + while (1) { > + bool waited =3D false; > + int64_t sn; > + > + for (sn =3D sector_num; sn < end_sector; sn +=3D sectors_per_chu= nk) { > + int64_t chunk =3D sn / sectors_per_chunk; > + if (test_bit(chunk, s->in_flight_bitmap)) { > + trace_mirror_yield_in_flight(s, chunk, s->in_flight); > + qemu_co_queue_wait(&s->dependent_writes); > + waited =3D true; > + } > + } > + > + if (!waited) { > + break; > + } > + } > + > + aligned_start =3D QEMU_ALIGN_UP(sector_num, sectors_per_chunk); > + aligned_end =3D QEMU_ALIGN_DOWN(sector_num + nb_sectors, sectors_per= _chunk); > + if (aligned_end > aligned_start) { > + bdrv_reset_dirty_bitmap(s->dirty_bitmap, aligned_start, > + aligned_end - aligned_start); > + } > + > + if (req->type =3D=3D BDRV_TRACKED_DISCARD) { > + mirror_do_zero_or_discard(s, sector_num, nb_sectors, true); > + return 0; > + } > + > + s->in_flight++; > + s->sectors_in_flight +=3D nb_sectors; > + > + /* Allocate a MirrorOp that is used as an AIO callback. */ > + op =3D g_new(MirrorOp, 1); > + op->s =3D s; > + op->sector_num =3D sector_num; > + op->nb_sectors =3D nb_sectors; > + op->buf =3D qemu_try_blockalign(blk_bs(s->target), req->qiov->size); > + if (op->buf =3D=3D NULL) { > + g_free(op); > + return -ENOMEM; > + } > + qemu_iovec_init(&op->qiov, req->qiov->niov); > + qemu_iovec_clone(&op->qiov, req->qiov, op->buf); Now op->qiov's iovec[] array is equivalent to req->qiov but points to op->buf. But you never copied the data from req->qiov to op->buf so junk gets written to the target! > + blk_aio_pwritev(s->target, req->offset, &op->qiov, 0, > + mirror_write_complete, op); > + return 0; > +} The commit message and description claims this is synchronous but it is not. Async requests are being generated by guest I/O. There is no rate limiting if s->target is slower than bs. In that case the queued AIO requests keep getting longer (including the bounce buffers). The guest will exhaust host memory or aio functions will fail (i.e. Linux AIO max requests is reached). If you want this to be synchronous you have to yield the coroutine until the request completes. Synchronous writes increase latency so this cannot be the new default. A different solution is to detect when the dirty bitmap reaches a minimum threshold and then employ I/O throttling on bs. That way the guest experiences no vcpu/network downtime and the I/O performance only drops during the convergence phase. Stefan --fOHHtNG4YXGJ0yqR Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJXYSRkAAoJEJykq7OBq3PI7T8H/j1UcbIVfblb+UfNo9AsQ7X8 HknbHUJz4Z2I7bKBbQBIr5y2iBak5Ie8l3Sr+H4zN47HGkgez5tGyhKvOlxkSFKZ kdaNKL69IIl8LWMr8JP7TvjTyqu1hK4knkQf3u2J/iElBXK68kIl5oHucQYLDkas L0ozI77DgduEnBc3D3FRaf6/LpnYMzblJlzqSDGA2AuyS3GF5qfNcN2qWtwS3ank QixCKQwP2ktRFVXEbSDzH3u4EtRMOsZAEJQ1EDTyhfLnxAgFDg7E2+i0OMZeIoFv WJ4ZSJgpjOaeG+sR/KgVVR/+llvPLG/+63yeZZaPSqkmiazbKXT4I2WSjXt1y/4= =MOKP -----END PGP SIGNATURE----- --fOHHtNG4YXGJ0yqR--