From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40545)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <pl@kamp.de>)
	id 1XwXc4-0000p2-TE
	for qemu-devel@nongnu.org; Thu, 04 Dec 2014 09:42:42 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pl@kamp.de>) id 1XwXbx-0000Ei-I5
	for qemu-devel@nongnu.org; Thu, 04 Dec 2014 09:42:36 -0500
Received: from mx-v6.kamp.de ([2a02:248:0:51::16]:48050 helo=mx01.kamp.de)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <pl@kamp.de>)
	id 1XwXbx-0000EW-44
	for qemu-devel@nongnu.org; Thu, 04 Dec 2014 09:42:29 -0500
Message-ID: <548072CD.5000302@kamp.de>
Date: Thu, 04 Dec 2014 15:42:21 +0100
From: Peter Lieven <pl@kamp.de>
MIME-Version: 1.0
References: <1417530788-15172-1-git-send-email-pl@kamp.de>
	<1417530788-15172-4-git-send-email-pl@kamp.de>
	<20141204141201.GC5129@noname.redhat.com>
In-Reply-To: <20141204141201.GC5129@noname.redhat.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH 3/3] virtio-blk: introduce multiread
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: famz@redhat.com, benoit@irqsave.net, ming.lei@canonical.com, armbru@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, pbonzini@redhat.com, mreitz@redhat.com

On 04.12.2014 15:12, Kevin Wolf wrote:
> Am 02.12.2014 um 15:33 hat Peter Lieven geschrieben:
>> this patch finally introduce multiread support to virtio-blk while
>> multiwrite support was there for a long time read support was missing.
>>
>> To achieve this the patch does serveral things which might need futher
>> explaination:
>>
>>   - the whole merge and multireq logic is moved from block.c into
>>     virtio-blk. This is move is a preparation for directly creating a
>>     coroutine out of virtio-blk.
>>
>>   - requests are only merged if they are strictly sequential and no
>>     longer sorted. This simplification decreases overhead and reduces
>>     latency. It will also merge some requests which were unmergable before.
>>
>>     The old algorithm took up to 32 requests sorted them and tried to merge
>>     them. The outcome was anything between 1 and 32 requests. In case of
>>     32 requests there were 31 requests unnecessarily delayed.
>>
>>     On the other hand lets imagine e.g. 16 unmergeable requests followed
>>     by 32 mergable requests. The latter 32 requests would have been split
>>     into two 16 byte requests.
>>
>>     Last the simplified logic allows for a fast path if we have only a
>>     single request in the multirequest. In this case the request is sent as
>>     ordinary request without mulltireq callbacks.
>>
>> As a first benchmark I installed Ubuntu 14.04.1 on a ramdisk. The number of
>> merged requests is in the same order while the latency is slightly decreased.
>> One should not stick to much to the numbers because the number of wr_requests
>> are highly fluctuant. I hope the numbers show that this patch is at least
>> not causing to big harm:
>>
>> Cmdline:
>> qemu-system-x86_64 -m 1024 -smp 2 -enable-kvm -cdrom ubuntu-14.04.1-server-amd64.iso \
>>   -drive if=virtio,file=/tmp/ubuntu.raw -monitor stdio
>>
>> Before:
>> virtio0: rd_bytes=150979072 wr_bytes=2591138816 rd_operations=18475 wr_operations=69216
>>           flush_operations=15343 wr_total_time_ns=26969283701 rd_total_time_ns=1018449432
>>           flush_total_time_ns=562596819 rd_merged=0 wr_merged=10453
>>
>> After:
>> virtio0: rd_bytes=148112896 wr_bytes=2845834240 rd_operations=17535 wr_operations=72197
>>           flush_operations=15760 wr_total_time_ns=26104971623 rd_total_time_ns=1012926283
>>           flush_total_time_ns=564414752 rd_merged=517 wr_merged=9859
>>
>> Some first numbers of improved read performance while booting:
>>
>> The Ubuntu 14.04.1 vServer from above:
>> virtio0: rd_bytes=86158336 wr_bytes=688128 rd_operations=5851 wr_operations=70
>>           flush_operations=4 wr_total_time_ns=10886705 rd_total_time_ns=416688595
>>           flush_total_time_ns=288776 rd_merged=1297 wr_merged=2
>>
>> Windows 2012R2:
>> virtio0: rd_bytes=176559104 wr_bytes=61859840 rd_operations=7200 wr_operations=360
>>           flush_operations=68 wr_total_time_ns=34344992718 rd_total_time_ns=134386844669
>>           flush_total_time_ns=18115517 rd_merged=641 wr_merged=216
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>   hw/block/dataplane/virtio-blk.c |   10 +-
>>   hw/block/virtio-blk.c           |  222 ++++++++++++++++++++++++---------------
>>   include/hw/virtio/virtio-blk.h  |   23 ++--
>>   3 files changed, 156 insertions(+), 99 deletions(-)
>>
>> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
>> index 1222a37..aa4ad91 100644
>> --- a/hw/block/dataplane/virtio-blk.c
>> +++ b/hw/block/dataplane/virtio-blk.c
>> @@ -96,9 +96,8 @@ static void handle_notify(EventNotifier *e)
>>       event_notifier_test_and_clear(&s->host_notifier);
>>       blk_io_plug(s->conf->conf.blk);
>>       for (;;) {
>> -        MultiReqBuffer mrb = {
>> -            .num_writes = 0,
>> -        };
>> +        MultiReqBuffer mrb_rd = {};
>> +        MultiReqBuffer mrb_wr = {.is_write = 1};
>>           int ret;
>>   
>>           /* Disable guest->host notifies to avoid unnecessary vmexits */
>> @@ -117,10 +116,11 @@ static void handle_notify(EventNotifier *e)
>>                                                           req->elem.in_num,
>>                                                           req->elem.index);
>>   
>> -            virtio_blk_handle_request(req, &mrb);
>> +            virtio_blk_handle_request(req, &mrb_wr, &mrb_rd);
>>           }
>>   
>> -        virtio_submit_multiwrite(s->conf->conf.blk, &mrb);
>> +        virtio_submit_multireq(s->conf->conf.blk, &mrb_wr);
>> +        virtio_submit_multireq(s->conf->conf.blk, &mrb_rd);
>>   
>>           if (likely(ret == -EAGAIN)) { /* vring emptied */
>>               /* Re-enable guest->host notifies and stop processing the vring.
>> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
>> index 490f961..f522bfc 100644
>> --- a/hw/block/virtio-blk.c
>> +++ b/hw/block/virtio-blk.c
>> @@ -22,12 +22,15 @@
>>   #include "dataplane/virtio-blk.h"
>>   #include "migration/migration.h"
>>   #include "block/scsi.h"
>> +#include "block/block_int.h"
> No. :-)
>
> We could either expose the information that you need through
> BlockBackend, or wait until your automatic request splitting logic is
> implemented in block.c and the information isn't needed here any more.

I will add sth to block.c and follow-up with the request splitting logic
later. It will still make sense to now the limit here. Otherwise we
merge something that we split again afterwards.

>
>>   #ifdef __linux__
>>   # include <scsi/sg.h>
>>   #endif
>>   #include "hw/virtio/virtio-bus.h"
>>   #include "hw/virtio/virtio-access.h"
>>   
>> +/* #define DEBUG_MULTIREQ */
>> +
>>   VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>>   {
>>       VirtIOBlockReq *req = g_slice_new(VirtIOBlockReq);
>> @@ -88,6 +91,11 @@ static void virtio_blk_rw_complete(void *opaque, int ret)
>>   
>>       trace_virtio_blk_rw_complete(req, ret);
>>   
>> +#ifdef DEBUG_MULTIREQ
>> +    printf("virtio_blk_rw_complete p %p ret %d\n",
>> +           req, ret);
>> +#endif
> In the final version, please use tracepoints instead of printfs.

of course. This is just my debug code.

>
>>       if (ret) {
>>           int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>>           bool is_read = !(p & VIRTIO_BLK_T_OUT);
>> @@ -257,24 +265,63 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
>>       virtio_blk_free_request(req);
>>   }
>>   
>> -void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb)
>> +static void virtio_multireq_cb(void *opaque, int ret)
>> +{
>> +    MultiReqBuffer *mrb = opaque;
>> +    int i;
>> +#ifdef DEBUG_MULTIREQ
>> +    printf("virtio_multireq_cb: p %p sector_num %ld nb_sectors %d is_write %d num_reqs %d\n",
>> +           mrb, mrb->sector_num, mrb->nb_sectors, mrb->is_write, mrb->num_reqs);
>> +#endif
>> +    for (i = 0; i < mrb->num_reqs; i++) {
>> +        virtio_blk_rw_complete(mrb->reqs[i], ret);
>> +    }
>> +
>> +    qemu_iovec_destroy(&mrb->qiov);
>> +    g_free(mrb);
>> +}
>> +
>> +void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb0)
>>   {
>> -    int i, ret;
>> +    MultiReqBuffer *mrb = NULL;
>>   
>> -    if (!mrb->num_writes) {
>> +    if (!mrb0->num_reqs) {
>>           return;
>>       }
>>   
>> -    ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
>> -    if (ret != 0) {
>> -        for (i = 0; i < mrb->num_writes; i++) {
>> -            if (mrb->blkreq[i].error) {
>> -                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
>> -            }
>> +    if (mrb0->num_reqs == 1) {
>> +        if (mrb0->is_write) {
>> +            blk_aio_writev(blk, mrb0->sector_num, &mrb0->reqs[0]->qiov, mrb0->nb_sectors,
>> +                           virtio_blk_rw_complete, mrb0->reqs[0]);
>> +        } else {
>> +            blk_aio_readv(blk, mrb0->sector_num, &mrb0->reqs[0]->qiov, mrb0->nb_sectors,
>> +                          virtio_blk_rw_complete, mrb0->reqs[0]);
>>           }
>> +        qemu_iovec_destroy(&mrb0->qiov);
>> +        mrb0->num_reqs = 0;
>> +        return;
>> +    }
>> +
>> +    mrb = g_malloc(sizeof(MultiReqBuffer));
>> +    memcpy(mrb, mrb0, sizeof(MultiReqBuffer));
>> +    mrb0->num_reqs = 0;
> We probably want to avoid allocations as much as we can. If at all
> possible, we don't want to memcpy() either.
>
> Most parts of MultiReqBuffer don't seem to be actually needed during the
> request and in the callback any more. Essentially, all that is needed is
> the qiov and a way to find all the requests that belong to the same
> MultiReqBuffer, so that they can all be completed.
>
> We could create a linked list by having a VirtIOBlockReq *next in the
> request struct. For the qiov we can probably just use the field in the
> first request (i.e. we extend the qiov of the first request).

Thats a good idea and should work. I will look into this.
But where would you store the merge qiov?

It looks like I would need a small struct which holds the qiov and the pointer to req[0].
This can then be used as opaque value to the callback.


>
> With these requirements removed, MultiReqBuffer can live on the stack
> and doesn't need to be copied to the heap.
>
>
> With the additional patch that you send me, the next few lines read:
>
> +    qemu_iovec_init(&mrb->qiov, mrb->num_reqs);
> +
> +    for (i = 0; i < mrb->num_reqs; i++) {
> +        qemu_iovec_concat(&mrb->qiov, &mrb->reqs[i]->qiov, 0, mrb->reqs[i]->qiov.size);
> +    }
> +    assert(mrb->nb_sectors == mrb->qiov.size / BDRV_SECTOR_SIZE);
>
> The long line is more than 80 characters, but probably more importantly,
> mrb->num_reqs is a really bad estimation for niov. If the guest uses any
> vectored I/O, we're sure to get reallocations. What you really want to
> use here is the old mrb->qiov.niov (which is an abuse to store in the
> qiov, see below).

Of course, my fault. I was so happy to have an exact value for the number
of niov that I put in the wrong variable ;-)

>
> (I also wonder whether it would make sense to keep qiovs allocated when
> putting VirtIOBlockReqs into the pool, but that's not for this series.)
>
>> +
>> +#ifdef DEBUG_MULTIREQ
>> +    printf("virtio_submit_multireq: p %p sector_num %ld nb_sectors %d is_write %d num_reqs %d\n",
>> +           mrb, mrb->sector_num, mrb->nb_sectors, mrb->is_write, mrb->num_reqs);
>> +#endif
>> +
>> +    if (mrb->is_write) {
>> +        blk_aio_writev(blk, mrb->sector_num, &mrb->qiov, mrb->nb_sectors,
>> +                       virtio_multireq_cb, mrb);
>> +    } else {
>> +        blk_aio_readv(blk, mrb->sector_num, &mrb->qiov, mrb->nb_sectors,
>> +                      virtio_multireq_cb, mrb);
>>       }
>>   
>> -    mrb->num_writes = 0;
>> +    block_acct_merge_done(blk_get_stats(blk),
>> +                          mrb->is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ,
>> +                          mrb->num_reqs - 1);
>>   }
>>   
>>   static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> @@ -283,9 +330,9 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>                        BLOCK_ACCT_FLUSH);
>>   
>>       /*
>> -     * Make sure all outstanding writes are posted to the backing device.
>> +     * Make sure all outstanding requests are posted to the backing device.
>>        */
>> -    virtio_submit_multiwrite(req->dev->blk, mrb);
>> +    virtio_submit_multireq(req->dev->blk, mrb);
>>       blk_aio_flush(req->dev->blk, virtio_blk_flush_complete, req);
>>   }
>>   
>> @@ -308,61 +355,8 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>>       return true;
>>   }
>>   
>> -static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> -{
>> -    BlockRequest *blkreq;
>> -    uint64_t sector;
>> -
>> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>> -
>> -    trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
>> -
>> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> -        virtio_blk_free_request(req);
>> -        return;
>> -    }
>> -
>> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>> -                     BLOCK_ACCT_WRITE);
>> -
>> -    if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) {
>> -        virtio_submit_multiwrite(req->dev->blk, mrb);
>> -    }
>> -
>> -    blkreq = &mrb->blkreq[mrb->num_writes];
>> -    blkreq->sector = sector;
>> -    blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
>> -    blkreq->qiov = &req->qiov;
>> -    blkreq->cb = virtio_blk_rw_complete;
>> -    blkreq->opaque = req;
>> -    blkreq->error = 0;
>> -
>> -    mrb->num_writes++;
>> -}
>> -
>> -static void virtio_blk_handle_read(VirtIOBlockReq *req)
>> -{
>> -    uint64_t sector;
>> -
>> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>> -
>> -    trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
>> -
>> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> -        virtio_blk_free_request(req);
>> -        return;
>> -    }
>> -
>> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>> -                     BLOCK_ACCT_READ);
>> -    blk_aio_readv(req->dev->blk, sector, &req->qiov,
>> -                  req->qiov.size / BDRV_SECTOR_SIZE,
>> -                  virtio_blk_rw_complete, req);
>> -}
> Handling reads and writes in a common place makes sense, but the code is
> long enough that I would still keep it in a separate function, like
> virtio_blk_handle_rw().

Sure

>
>> -void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> +void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb_wr,
>> +                               MultiReqBuffer *mrb_rd)
>>   {
>>       uint32_t type;
>>       struct iovec *in_iov = req->elem.in_sg;
>> @@ -397,7 +391,7 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>       type = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>>   
>>       if (type & VIRTIO_BLK_T_FLUSH) {
>> -        virtio_blk_handle_flush(req, mrb);
>> +        virtio_blk_handle_flush(req, mrb_wr);
>>       } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
>>           virtio_blk_handle_scsi(req);
>>       } else if (type & VIRTIO_BLK_T_GET_ID) {
>> @@ -414,13 +408,71 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>           iov_from_buf(in_iov, in_num, 0, serial, size);
>>           virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>>           virtio_blk_free_request(req);
>> -    } else if (type & VIRTIO_BLK_T_OUT) {
>> -        qemu_iovec_init_external(&req->qiov, iov, out_num);
>> -        virtio_blk_handle_write(req, mrb);
>> -    } else if (type == VIRTIO_BLK_T_IN || type == VIRTIO_BLK_T_BARRIER) {
>> -        /* VIRTIO_BLK_T_IN is 0, so we can't just & it. */
>> -        qemu_iovec_init_external(&req->qiov, in_iov, in_num);
>> -        virtio_blk_handle_read(req);
>> +    } else if (type & VIRTIO_BLK_T_OUT || type == VIRTIO_BLK_T_IN || type == VIRTIO_BLK_T_BARRIER) {
>> +        bool is_write = type & VIRTIO_BLK_T_OUT;
>> +        MultiReqBuffer *mrb = is_write ? mrb_wr : mrb_rd;
>> +        int64_t sector_num = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>> +        BlockDriverState *bs = blk_bs(req->dev->blk);
>> +        int nb_sectors = 0;
>> +        int merge = 1;
>> +
>> +        if (!virtio_blk_sect_range_ok(req->dev, sector_num, req->qiov.size)) {
>> +            virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> +            virtio_blk_free_request(req);
>> +            return;
>> +        }
>> +
>> +        if (is_write) {
>> +            qemu_iovec_init_external(&req->qiov, iov, out_num);
>> +            trace_virtio_blk_handle_write(req, sector_num, req->qiov.size / BDRV_SECTOR_SIZE);
>> +        } else {
>> +            qemu_iovec_init_external(&req->qiov, in_iov, in_num);
>> +            trace_virtio_blk_handle_read(req, sector_num, req->qiov.size / BDRV_SECTOR_SIZE);
>> +        }
>> +
>> +        nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
>> +#ifdef DEBUG_MULTIREQ
>> +        printf("virtio_blk_handle_request p %p sector_num %ld nb_sectors %d is_write %d\n",
>> +               req, sector_num, nb_sectors, is_write);
>> +#endif
>> +
>> +        block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>> +                         is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
>> +
>> +        /* merge would exceed maximum number of requests or IOVs */
>> +        if (mrb->num_reqs == MAX_MERGE_REQS ||
>> +            mrb->qiov.niov + req->qiov.niov + 1 > IOV_MAX) {
>> +            merge = 0;
>> +        }
> Again with your latest patch applied on top:
>
> You're abusing mrb->qiov.niov while gathering requests. mrb->qiov is
> never initialised and you don't actually add iovs anywhere, but you just
> increase qiov.niov without the associated data.
>
> Why don't you make it in simple int instead of QEMUIOVector?

I was trying to use same struct for collecting the requests and building the
callback opaque. I will split that.

>
>> +
>> +        /* merge would exceed maximum transfer length of backend device */
>> +        if (bs->bl.max_transfer_length &&
>> +            mrb->nb_sectors + nb_sectors > bs->bl.max_transfer_length) {
>> +            merge = 0;
>> +        }
>> +
>> +        /* requests are not sequential */
>> +        if (mrb->num_reqs && mrb->sector_num + mrb->nb_sectors != sector_num) {
>> +            merge = 0;
>> +        }
>> +
>> +        if (!merge) {
>> +            virtio_submit_multireq(req->dev->blk, mrb);
>> +        }
>> +
>> +        if (mrb->num_reqs == 0) {
>> +            qemu_iovec_init(&mrb->qiov, MAX_MERGE_REQS);
>> +            mrb->sector_num = sector_num;
>> +            mrb->nb_sectors = 0;
>> +        }
>> +
>> +        qemu_iovec_concat(&mrb->qiov, &req->qiov, 0, req->qiov.size);
>> +        mrb->nb_sectors += req->qiov.size / BDRV_SECTOR_SIZE;
>> +
>> +        assert(mrb->nb_sectors == mrb->qiov.size / BDRV_SECTOR_SIZE);
>> +
>> +        mrb->reqs[mrb->num_reqs] = req;
>> +        mrb->num_reqs++;
>>       } else {
>>           virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
>>           virtio_blk_free_request(req);
> Otherwise it looks as if it could work.

Thanks for your comments,
Peter