From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35601) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cdjwu-0004hk-Rp for qemu-devel@nongnu.org; Tue, 14 Feb 2017 15:43:48 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cdjwq-0004Ye-WB for qemu-devel@nongnu.org; Tue, 14 Feb 2017 15:43:44 -0500 Received: from mx-v6.kamp.de ([2a02:248:0:51::16]:47170 helo=mx01.kamp.de) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cdjwq-0004Wg-Ge for qemu-devel@nongnu.org; Tue, 14 Feb 2017 15:43:40 -0500 Date: Tue, 14 Feb 2017 21:43:02 +0100 Message-ID: <1045307195-43255@kerio.kamp.de> In-Reply-To: <20170214162012.GC6189@noname.redhat.com> MIME-Version: 1.0 From: Peter Lieven Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH V3] qemu-img: make convert async List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org, mreitz@redhat.com, ct@flyingcircus.io Von: Kevin Wolf =20 An: Peter Lieven =20 Kopie: , , , =20 Gesendet: 14.02.2017 17:20=20 Betreff: Re: [RFC PATCH V3] qemu-img: make convert async=20 Am 14.02.2017 um 14:39 hat Peter Lieven geschrieben:=20 > this is something I have been thinking about for almost 2 years now.=20 > we heavily have the following two use cases when using qemu-img convert.=20 > =20 > a) reading from NFS and writing to iSCSI for deploying templates=20 > b) reading from iSCSI and writing to NFS for backups=20 > =20 > In both processes we use libiscsi and libnfs so we have no kernel pagecac= he.=20 > As qemu-img convert is implemented with sync operations that means we=20 > read one buffer and then write it. No parallelism and each sync request=20 > takes as long as it takes until it is completed.=20 > =20 > This is version 3 of the approach using coroutine worker "threads".=20 > =20 > So far I have the following runtimes when reading an uncompressed QCOW2 f= rom=20 > NFS and writing it to iSCSI (raw):=20 > =20 > qemu-img (master)=20 > =C2=A0nfs -> iscsi 22.8 secs=20 > =C2=A0nfs -> ram =C2=A0 11.7 secs=20 > =C2=A0ram -> iscsi 12.3 secs=20 > =20 > qemu-img-async=20 > =C2=A0nfs -> iscsi 12.3 secs=20 > =C2=A0nfs -> ram =C2=A0 10.5 secs=20 > =C2=A0ram -> iscsi 11.7 secs=20 > =20 > Comments appreciated.=20 > =20 > Thank you,=20 > Peter=20 > =20 > Signed-off-by: Peter Lieven =20 > ---=20 > v2->v3: - updated stats in the commit msg from a host with a better netwo= rk card=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 - only wake up the coroutine that is acutally= waiting for a write to complete.=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 this was not only overhead, but also b= reaking at least linux AIO.=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 - fix coding style complaints=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 - rename some variables and structs=20 > =20 > v1->v2: - using coroutine as worker "threads". [Max]=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 - keeping the request queue as otherwise it h= appens=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 that we wait on BLK_ZERO chunks while = keeping the write order.=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 it also avoids redundant calls to get_= block_status and helps=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 to skip some conditions for fully allo= cated imaged (!s->min_sparse)=20 > =20 > =C2=A0qemu-img.c | 213 +++++++++++++++++++++++++++++++++++++++++---------= -----------=20 > =C2=A01 file changed, 145 insertions(+), 68 deletions(-)=20 > =20 > diff --git a/qemu-img.c b/qemu-img.c=20 > index cff22e3..970863f 100644=20 > --- a/qemu-img.c=20 > +++ b/qemu-img.c=20 > @@ -1448,6 +1448,16 @@ enum ImgConvertBlockStatus {=20 > =C2=A0 =C2=A0 =C2=A0BLK_BACKING_FILE,=20 > =C2=A0};=20 > =C2=A0=20 > +typedef struct ImgConvertRequest {=20 > + =C2=A0 =C2=A0int64_t sector_num;=20 > + =C2=A0 =C2=A0enum ImgConvertBlockStatus status;=20 > + =C2=A0 =C2=A0int nb_sectors;=20 > + =C2=A0 =C2=A0QSIMPLEQ_ENTRY(ImgConvertRequest) next;=20 > +} ImgConvertRequest;=20 =20 If I understand your code correctly, you first go through the image and=20 create an ImgConvertRequest for every chunk with the same allocation=20 status (well, actually not every chunk, but that may be a bug, see=20 below).=20 =20 So let's assume a 1 TB image with every other cluster allocated:=20 =20 =C2=A0 =C2=A01T / (64k * 2) =3D 8M=20 =20 A list with 8 million entries (requiring 8 million mallocs) feels=20 relatively large. Not counting malloc overhead (which might be=20 considerable here), I think we're at 192 MB memory usage for this case.=20 Not as bad as I was afraid, but not really nice either. And 1 TB isn't=20 even close to the maximum image size.=20 =20 Is it really necessary to create this list at the beginning? Why can't=20 we just get the next chunk as the old code used to?=20 In my initial version with AIO invoking convert_iterate_sectors caused recursive entries into my queue_fill functions. If I have fixed worker thre= ads or only a few coroutines for read/write whatsever this might work without. I will have a look. In my test cases I don't have such heavily fragmented images. I was at a few hundred entries at most. The other issue I feared was that the "cache" in convert_iterate_sectors does not work correctly. But if the calls to this function are still in inc= reasing sector order, it should work. =20 > +/* XXX: this should be a cmdline parameter */=20 > +#define NUM_COROUTINES 8=20 > +=20 > =C2=A0typedef struct ImgConvertState {=20 > =C2=A0 =C2=A0 =C2=A0BlockBackend **src;=20 > =C2=A0 =C2=A0 =C2=A0int64_t *src_sectors;=20 > @@ -1455,6 +1465,8 @@ typedef struct ImgConvertState {=20 > =C2=A0 =C2=A0 =C2=A0int64_t src_cur_offset;=20 > =C2=A0 =C2=A0 =C2=A0int64_t total_sectors;=20 > =C2=A0 =C2=A0 =C2=A0int64_t allocated_sectors;=20 > + =C2=A0 =C2=A0int64_t allocated_done;=20 > + =C2=A0 =C2=A0int64_t wr_offs;=20 > =C2=A0 =C2=A0 =C2=A0enum ImgConvertBlockStatus status;=20 > =C2=A0 =C2=A0 =C2=A0int64_t sector_next_status;=20 > =C2=A0 =C2=A0 =C2=A0BlockBackend *target;=20 > @@ -1464,11 +1476,16 @@ typedef struct ImgConvertState {=20 > =C2=A0 =C2=A0 =C2=A0int min_sparse;=20 > =C2=A0 =C2=A0 =C2=A0size_t cluster_sectors;=20 > =C2=A0 =C2=A0 =C2=A0size_t buf_sectors;=20 > + =C2=A0 =C2=A0Coroutine *co[NUM_COROUTINES];=20 > + =C2=A0 =C2=A0int64_t wait_sector_num[NUM_COROUTINES];=20 > + =C2=A0 =C2=A0QSIMPLEQ_HEAD(, ImgConvertRequest) queue;=20 > + =C2=A0 =C2=A0int ret;=20 > =C2=A0} ImgConvertState;=20 > =C2=A0=20 > =C2=A0static void convert_select_part(ImgConvertState *s, int64_t sector_= num)=20 > =C2=A0{=20 > - =C2=A0 =C2=A0assert(sector_num >=3D s->src_cur_offset);=20 > + =C2=A0 =C2=A0s->src_cur_offset =3D 0;=20 > + =C2=A0 =C2=A0s->src_cur =3D 0;=20 > =C2=A0 =C2=A0 =C2=A0while (sector_num - s->src_cur_offset >=3D s->src_sec= tors[s->src_cur]) {=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0s->src_cur_offset +=3D s->src_sectors[s= ->src_cur];=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0s->src_cur++;=20 =20 Is this necessary because we can effectively go backwards now when one=20 coroutine works already on the next part, but another one is still=20 processing the previous one? Worth a comment?=20 Exactly that is the problem. I will add a comment. =20 > @@ -1544,11 +1561,13 @@ static int convert_iteration_sectors(ImgConvertSt= ate *s, int64_t sector_num)=20 > =C2=A0 =C2=A0 =C2=A0return n;=20 > =C2=A0}=20 > =C2=A0=20 > -static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_s= ectors,=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0uint8_t *buf)=20 > +static int convert_co_read(ImgConvertState *s, ImgConvertRequest *req,=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 uint8_t *buf, QEMUIOVector *qiov)=20 =20 This is a weird interface: You want to get a created, but empty qiov=20 from the caller. There's no real point in that. The traditional approch=20 is having a struct iovec on the stack here locally and then using=20 qemu_iovec_init_external().=20 The idea was to avoid the qemu_iovec every time because this involves a mal= loc, but I admit its not nice. Maybe I find a better approach. =20 > =C2=A0{=20 > =C2=A0 =C2=A0 =C2=A0int n;=20 > =C2=A0 =C2=A0 =C2=A0int ret;=20 > + =C2=A0 =C2=A0int64_t sector_num =3D req->sector_num;=20 > + =C2=A0 =C2=A0int nb_sectors =3D req->nb_sectors;=20 > =C2=A0=20 > =C2=A0 =C2=A0 =C2=A0assert(nb_sectors <=3D s->buf_sectors);=20 > =C2=A0 =C2=A0 =C2=A0while (nb_sectors > 0) {=20 > @@ -1562,10 +1581,13 @@ static int convert_read(ImgConvertState *s, int64= _t sector_num, int nb_sectors,=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0blk =3D s->src[s->src_cur];=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0bs_sectors =3D s->src_sectors[s->src_cu= r];=20 > =C2=A0=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_iovec_reset(qiov);=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0n =3D MIN(nb_sectors, bs_sectors - (sec= tor_num - s->src_cur_offset));=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D blk_pread(blk,=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0(sector_num - s->src_cur_offset) << BDRV_SECTOR_BITS,=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0buf, n << BDRV_SECTOR_BITS);=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_iovec_add(qiov, buf, n << BDRV_SECTOR_B= ITS);=20 > +=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D blk_co_preadv(=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0blk, (sector_num= - s->src_cur_offset) << BDRV_SECTOR_BITS,=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0n << BDRV_SECTOR= _BITS, qiov, 0);=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (ret < 0) {=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ret;=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > @@ -1578,15 +1600,18 @@ static int convert_read(ImgConvertState *s, int64= _t sector_num, int nb_sectors,=20 > =C2=A0 =C2=A0 =C2=A0return 0;=20 > =C2=A0}=20 > =C2=A0=20 > -static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_= sectors,=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 const uint8_t *buf)=20 > +=20 > +static int convert_co_write(ImgConvertState *s, ImgConvertRequest *req,=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0uint8_t *buf, QEMUIOVector *qiov)=20 =20 Same thing here.=20 =20 Note that convert_write() used to be called unconditionally. This is not=20 the case any more because you don't create ImgConvertRequests=20 unconditionally. In particular, BLK_ZERO is skipped without -S 0 instead=20 of doing a blk_co_pwrite_zeroes(). This is a bug. The assertion in the=20 BLK_BACKING_FILE branch of the switch is also dead code now. You are right. This needs to be fixed. =20 =20 > =C2=A0{=20 > =C2=A0 =C2=A0 =C2=A0int ret;=20 > + =C2=A0 =C2=A0int64_t sector_num =3D req->sector_num;=20 > + =C2=A0 =C2=A0int nb_sectors =3D req->nb_sectors;=20 > =C2=A0=20 > =C2=A0 =C2=A0 =C2=A0while (nb_sectors > 0) {=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int n =3D nb_sectors;=20 > -=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0switch (s->status) {=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_iovec_reset(qiov);=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0switch (req->status) {=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0case BLK_BACKING_FILE:=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* If we have a backing f= ile, leave clusters unallocated that are=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * unallocated in the sou= rce image, so that the backing file is=20 > @@ -1607,9 +1632,10 @@ static int convert_write(ImgConvertState *s, int64= _t sector_num, int nb_sectors,=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0break;=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > =C2=A0=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D blk_pwri= te_compressed(s->target,=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0sector_num << BDRV_SECTOR_BITS,=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0buf, n << BDRV_SECTOR_BITS);=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_iovec_add(q= iov, buf, n << BDRV_SECTOR_BITS);=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D blk_co_p= writev(s->target, sector_num << BDRV_SECTOR_BITS,=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 n << BDRV_SECTOR_BI= TS, qiov,=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BDRV_REQ_WRITE_COMP= RESSED);=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (ret < 0= ) {=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0return ret;=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > @@ -1622,8 +1648,9 @@ static int convert_write(ImgConvertState *s, int64_= t sector_num, int nb_sectors,=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!s->min_sparse ||=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0is_allocate= d_sectors_min(buf, n, &n, s->min_sparse))=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D blk_pwri= te(s->target, sector_num << BDRV_SECTOR_BITS,=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 buf, n << BDRV_SECTOR_BITS, 0);=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_iovec_add(q= iov, buf, n << BDRV_SECTOR_BITS);=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D blk_co_p= writev(s->target, sector_num << BDRV_SECTOR_BITS,=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 n << BDRV_SECTOR_BI= TS, qiov, 0);=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (ret < 0= ) {=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0return ret;=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > @@ -1635,8 +1662,9 @@ static int convert_write(ImgConvertState *s, int64_= t sector_num, int nb_sectors,=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (s->has_zero_init) {=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break;=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D blk_pwrite_zeroes(s->t= arget, sector_num << BDRV_SECTOR_BITS,=20 > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0n << BDRV_SECTOR_BIT= S, 0);=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D blk_co_pwrite_zeroes(s= ->target,=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sector_num <= < BDRV_SECTOR_BITS,=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 n << BDRV_SE= CTOR_BITS, 0);=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (ret < 0) {=20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ret;= =20 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > @@ -1651,12 +1679,92 @@ static int convert_write(ImgConvertState *s, int6= 4_t sector_num, int nb_sectors,=20 > =C2=A0 =C2=A0 =C2=A0return 0;=20 > =C2=A0}=20 > =C2=A0=20 > -static int convert_do_copy(ImgConvertState *s)=20 > +static void convert_co_do_copy(void *opaque)=20 > =C2=A0{=20 > + =C2=A0 =C2=A0ImgConvertState *s =3D opaque;=20 > =C2=A0 =C2=A0 =C2=A0uint8_t *buf =3D NULL;=20 > - =C2=A0 =C2=A0int64_t sector_num, allocated_done;=20 > + =C2=A0 =C2=A0int ret, i;=20 > + =C2=A0 =C2=A0ImgConvertRequest *req, *next_req;=20 > + =C2=A0 =C2=A0QEMUIOVector qiov;=20 > + =C2=A0 =C2=A0int index =3D -1;=20 > +=20 > + =C2=A0 =C2=A0for (i =3D 0; i < NUM_COROUTINES; i++) {=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (s->co[i] =3D=3D qemu_coroutine_self()) {= =20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0index =3D i;=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break;=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > + =C2=A0 =C2=A0}=20 > + =C2=A0 =C2=A0assert(index >=3D 0);=20 > +=20 > + =C2=A0 =C2=A0qemu_iovec_init(&qiov, 1);=20 > + =C2=A0 =C2=A0buf =3D blk_blockalign(s->target, s->buf_sectors * BDRV_SE= CTOR_SIZE);=20 > +=20 > + =C2=A0 =C2=A0while (s->ret =3D=3D -EINPROGRESS && (req =3D QSIMPLEQ_FIR= ST(&s->queue))) {=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0QSIMPLEQ_REMOVE_HEAD(&s->queue, next);=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0next_req =3D QSIMPLEQ_FIRST(&s->queue);=20 > +=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0s->allocated_done +=3D req->nb_sectors;=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_progress_print(100.0 * s->allocated_don= e / s->allocated_sectors,=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A00);=20 > +=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (req->status =3D=3D BLK_DATA) {=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D convert_co_read(s, req= , buf, &qiov);=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (ret < 0) {=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0error_report("er= ror while reading sector %" PRId64=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 ": %s", req->sector_num, strerror(-ret));=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0s->ret =3D ret;=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto out;=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > +=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0/* keep writes in order */=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0while (s->wr_offs !=3D req->sector_num) {=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (s->ret !=3D -EINPROGRESS) = {=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto out;=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0s->wait_sector_num[index] =3D = req->sector_num;=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_coroutine_yield();=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0}=20 > + =C2=A0 =C2=A0 =C2=A0 =C2=A0s->wait_sector_num[index] =3D -1;=20 =20 Okay, so may this is the most important part of the patch.=20 =20 If we need to keep writes in order, we can only have a single coroutine=20 that is currently writing. Then how likely is it that more than one=20 other coroutine could actually be reading at the same time rather than=20 just waiting for the current writer to be finished?=20 =20 My assumption is that it effectively degenerates to exactly one reader=20 and one writer, which means that we don't actually need a command line=20 option to configure the number of coroutines, and that our default of 8=20 is already higher than useful.=20 I will check the result for 4 and 2 coroutines. My observation or fear was that a read and write that are effectively no-ops (BLK_ZERO with s->has_zer= o_init) could block the queue and avoid to read ahead. =20 The big question is then if the code could become simpler if we can=20 assume exactly two coroutines. Possibly even not two coroutines that do=20 the same thing, but one that reads and another one that writes,=20 coordinated with the usual coroutine locking mechanisms. (I haven't=20 thought enough about it to be sure that it would be an improvement, but=20 it doesn't feel too unlikely.)=20 =20 On the other hand, if keeping writes in order is only actually needed in=20 specific cases, other cases could benefit from actual parallelism, even=20 though this is not implemented in this patch.=20 I tried to disable it. It has had no influence for my test scenario. But wi= th the current approach such a switch as well as a switch to define the number of workers could easily be implemented. Having multiple readers (and maybe = writers) should increase performance if there is latency involved. Before I rewrite everything I would propose to try to change the code to avoid to build the request queue upfront and analyze what a change in nu= mber of coroutines has in the current and the version without request queue? If we would have a read and a write coroutine and maybe a third (to calcula= te the request size with convert_iterate_sectors) my idea would be to have a limited reque= st queue. That is passed from coroutine to coroutine. sth like: =C2=A0queue 1: free request object =C2=A0fifo queue 2: request objects ready for reading =C2=A0fifo queue 3: request objects ready for writing. The first coroutine would fetch the a free request object and calculate nb_= sectors and status and put it into queue 2. Then the reader coroutine would fetch the next read object in order and per= form the read. Once finisched it puts the object into the queue 3. Then the writer coroutine handles the write and when the write is finished = it puts it back into into the freelist (queue 1). Thanks for your comments, Peter =