From: Peter Lieven <pl@kamp.de>
To: Max Reitz <mreitz@redhat.com>, qemu-devel@nongnu.org
Cc: kwolf@redhat.com, famz@redhat.com, benoit@irqsave.net,
ming.lei@canonical.com, armbru@redhat.com, stefanha@redhat.com,
pbonzini@redhat.com
Subject: Re: [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread
Date: Fri, 30 Jan 2015 22:05:32 +0100 [thread overview]
Message-ID: <54CBF21C.6040102@kamp.de> (raw)
In-Reply-To: <54CBBC6C.5030205@redhat.com>
Am 30.01.2015 um 18:16 schrieb Max Reitz:
> On 2015-01-30 at 09:33, Peter Lieven wrote:
>> this patch finally introduces multiread support to virtio-blk. While
>> multiwrite support was there for a long time, read support was missing.
>>
>> The complete merge logic is moved into virtio-blk.c which has
>> been the only user of request merging ever since. This is required
>> to be able to merge chunks of requests and immediately invoke callbacks
>> for those requests. Secondly, this is required to switch to
>> direct invocation of coroutines which is planned at a later stage.
>>
>> The following benchmarks show the performance of running fio with
>> 4 worker threads on a local ram disk. The numbers show the average
>> of 10 test runs after 1 run as warmup phase.
>>
>> | 4k | 64k | 4k
>> MB/s | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
>> --------------+--------+---------+--------+---------+--------+--------
>> master | 1221 | 1187 | 4178 | 4114 | 1745 | 1213
>> multiread | 1829 | 1189 | 4639 | 4110 | 1894 | 1216
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> hw/block/dataplane/virtio-blk.c | 8 +-
>> hw/block/virtio-blk.c | 288 +++++++++++++++++++++++++++-------------
>> include/hw/virtio/virtio-blk.h | 19 +--
>> trace-events | 1 +
>> 4 files changed, 210 insertions(+), 106 deletions(-)
>>
>> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
>> index 39c5d71..93472e9 100644
>> --- a/hw/block/dataplane/virtio-blk.c
>> +++ b/hw/block/dataplane/virtio-blk.c
>> @@ -96,9 +96,7 @@ static void handle_notify(EventNotifier *e)
>> event_notifier_test_and_clear(&s->host_notifier);
>> blk_io_plug(s->conf->conf.blk);
>> for (;;) {
>> - MultiReqBuffer mrb = {
>> - .num_writes = 0,
>> - };
>> + MultiReqBuffer mrb = {};
>> int ret;
>> /* Disable guest->host notifies to avoid unnecessary vmexits */
>> @@ -120,7 +118,9 @@ static void handle_notify(EventNotifier *e)
>> virtio_blk_handle_request(req, &mrb);
>> }
>> - virtio_submit_multiwrite(s->conf->conf.blk, &mrb);
>> + if (mrb.num_reqs) {
>> + virtio_submit_multireq(s->conf->conf.blk, &mrb);
>> + }
>> if (likely(ret == -EAGAIN)) { /* vring emptied */
>> /* Re-enable guest->host notifies and stop processing the vring.
>> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
>> index e04adb8..77890a0 100644
>> --- a/hw/block/virtio-blk.c
>> +++ b/hw/block/virtio-blk.c
>> @@ -34,6 +34,7 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>> req->dev = s;
>> req->qiov.size = 0;
>> req->next = NULL;
>> + req->mr_next = NULL;
>> return req;
>> }
>> @@ -84,20 +85,29 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
>> static void virtio_blk_rw_complete(void *opaque, int ret)
>> {
>> - VirtIOBlockReq *req = opaque;
>> + VirtIOBlockReq *next = opaque;
>> - trace_virtio_blk_rw_complete(req, ret);
>> + while (next) {
>> + VirtIOBlockReq *req = next;
>> + next = req->mr_next;
>> + trace_virtio_blk_rw_complete(req, ret);
>> - if (ret) {
>> - int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>> - bool is_read = !(p & VIRTIO_BLK_T_OUT);
>> - if (virtio_blk_handle_rw_error(req, -ret, is_read))
>> - return;
>> - }
>> + if (req->qiov.nalloc != -1) {
>> + qemu_iovec_destroy(&req->qiov);
>> + }
>> - virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>> - block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>> - virtio_blk_free_request(req);
>> + if (ret) {
>> + int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>> + bool is_read = !(p & VIRTIO_BLK_T_OUT);
>> + if (virtio_blk_handle_rw_error(req, -ret, is_read)) {
>> + continue;
>> + }
>> + }
>> +
>> + virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>> + block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>> + virtio_blk_free_request(req);
>> + }
>> }
>> static void virtio_blk_flush_complete(void *opaque, int ret)
>> @@ -291,24 +301,125 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
>> }
>> }
>> -void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb)
>> +static inline void
>
> Is the "inline" really worth it? This function is rather long...
I tried really hard to not add a regression for random reads and the difference
between good and bad here is sometimes just a matter of a few small changes.
If gcc inlines on its own anyway would you mind to keep the inline? The function
is actually only called from 2 places.
>
> (To my surprise, gcc actually does inline the function)
>
>> +virtio_submit_multireq2(BlockBackend *blk, MultiReqBuffer *mrb,
>
> Hm, that's not a very descriptive name... Maybe "submit_merged_requests" or something like that (it's static, so you can drop the virtio_ prefix if you want to) would be better?
Sure, you are right.
>
>> + int start, int num_reqs, int niov)
>> {
>> - int i, ret;
>> + QEMUIOVector *qiov = &mrb->reqs[start]->qiov;
>> + int64_t sector_num = mrb->reqs[start]->sector_num;
>> + int nb_sectors = mrb->reqs[start]->qiov.size / BDRV_SECTOR_SIZE;
>> + bool is_write = mrb->is_write;
>> +
>> + if (num_reqs > 1) {
>> + int i;
>> + struct iovec *_iov = qiov->iov;
>> + int _niov = qiov->niov;
>
> Identifiers with leading underscores are considered reserved, I'd rather avoid using them.
Okay. I will use tmp_iov and tmp_niov instead.
>
>> +
>> + qemu_iovec_init(qiov, niov);
>> +
>> + for (i = 0; i < _niov; i++) {
>> + qemu_iovec_add(qiov, _iov[i].iov_base, _iov[i].iov_len);
>> + }
>> - if (!mrb->num_writes) {
>> + for (i = start + 1; i < start + num_reqs; i++) {
>> + qemu_iovec_concat(qiov, &mrb->reqs[i]->qiov, 0,
>> + mrb->reqs[i]->qiov.size);
>> + mrb->reqs[i - 1]->mr_next = mrb->reqs[i];
>> + nb_sectors += mrb->reqs[i]->qiov.size / BDRV_SECTOR_SIZE;
>> + }
>> + assert(nb_sectors == qiov->size / BDRV_SECTOR_SIZE);
>> +
>> + trace_virtio_blk_submit_multireq(mrb, start, num_reqs, sector_num,
>> + nb_sectors, is_write);
>> + block_acct_merge_done(blk_get_stats(blk),
>> + is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ,
>> + num_reqs - 1);
>> + }
>> +
>> + if (is_write) {
>> + blk_aio_writev(blk, sector_num, qiov, nb_sectors,
>> + virtio_blk_rw_complete, mrb->reqs[start]);
>> + } else {
>> + blk_aio_readv(blk, sector_num, qiov, nb_sectors,
>> + virtio_blk_rw_complete, mrb->reqs[start]);
>> + }
>> +}
>> +
>> +static int virtio_multireq_compare(const void *a, const void *b)
>> +{
>> + const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a,
>> + *req2 = *(VirtIOBlockReq **)b;
>> +
>> + /*
>> + * Note that we can't simply subtract sector_num1 from sector_num2
>> + * here as that could overflow the return value.
>> + */
>> + if (req1->sector_num > req2->sector_num) {
>> + return 1;
>> + } else if (req1->sector_num < req2->sector_num) {
>> + return -1;
>> + } else {
>> + return 0;
>> + }
>> +}
>> +
>> +void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
>> +{
>> + int i = 0, start = 0, num_reqs = 0, niov = 0, nb_sectors = 0,
>> + max_xfer_len = 0;
>
> Albeit not necessary, starting a new "int" line won't take any more space (and would look nicer to me, probably).
okay, I don't mind.
>
>> + int64_t sector_num = 0;
>> +
>> + if (mrb->num_reqs == 1) {
>> + virtio_submit_multireq2(blk, mrb, 0, 1, -1);
>> + mrb->num_reqs = 0;
>> return;
>> }
>> - ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
>> - if (ret != 0) {
>> - for (i = 0; i < mrb->num_writes; i++) {
>> - if (mrb->blkreq[i].error) {
>> - virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
>> + max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk);
>> + max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX);
>> +
>> + qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
>> + &virtio_multireq_compare);
>> +
>> + for (i = 0; i < mrb->num_reqs; i++) {
>> + VirtIOBlockReq *req = mrb->reqs[i];
>> + if (num_reqs > 0) {
>> + bool merge = true;
>> +
>> + /* merge would exceed maximum number of IOVs */
>> + if (niov + req->qiov.niov + 1 > IOV_MAX) {
>
> Hm, why the +1?
A really good question. I copied this piece from the old merge routine. It seems
definetely wrong.
>
>> + merge = false;
>> + }
>> +
>> + /* merge would exceed maximum transfer length of backend device */
>> + if (req->qiov.size / BDRV_SECTOR_SIZE + nb_sectors > max_xfer_len) {
>> + merge = false;
>> + }
>> +
>> + /* requests are not sequential */
>> + if (sector_num + nb_sectors != req->sector_num) {
>> + merge = false;
>> + }
>> +
>> + if (!merge) {
>> + virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>> + num_reqs = 0;
>> }
>> }
>> +
>> + if (num_reqs == 0) {
>> + sector_num = req->sector_num;
>> + nb_sectors = niov = 0;
>> + start = i;
>> + }
>> +
>> + nb_sectors += req->qiov.size / BDRV_SECTOR_SIZE;
>> + niov += req->qiov.niov;
>> + num_reqs++;
>> }
>> - mrb->num_writes = 0;
>> + virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>> + mrb->num_reqs = 0;
>> }
>> static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> @@ -319,7 +430,9 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> /*
>> * Make sure all outstanding writes are posted to the backing device.
>> */
>> - virtio_submit_multiwrite(req->dev->blk, mrb);
>> + if (mrb->is_write && mrb->num_reqs > 0) {
>> + virtio_submit_multireq(req->dev->blk, mrb);
>> + }
>> blk_aio_flush(req->dev->blk, virtio_blk_flush_complete, req);
>> }
>> @@ -329,6 +442,9 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>> uint64_t nb_sectors = size >> BDRV_SECTOR_BITS;
>> uint64_t total_sectors;
>> + if (nb_sectors > INT_MAX) {
>> + return false;
>> + }
>> if (sector & dev->sector_mask) {
>> return false;
>> }
>> @@ -342,60 +458,6 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>> return true;
>> }
>> -static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> -{
>> - BlockRequest *blkreq;
>> - uint64_t sector;
>> -
>> - sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>> -
>> - trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
>> -
>> - if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>> - virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> - virtio_blk_free_request(req);
>> - return;
>> - }
>> -
>> - block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>> - BLOCK_ACCT_WRITE);
>> -
>> - if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) {
>> - virtio_submit_multiwrite(req->dev->blk, mrb);
>> - }
>> -
>> - blkreq = &mrb->blkreq[mrb->num_writes];
>> - blkreq->sector = sector;
>> - blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
>> - blkreq->qiov = &req->qiov;
>> - blkreq->cb = virtio_blk_rw_complete;
>> - blkreq->opaque = req;
>> - blkreq->error = 0;
>> -
>> - mrb->num_writes++;
>> -}
>> -
>> -static void virtio_blk_handle_read(VirtIOBlockReq *req)
>> -{
>> - uint64_t sector;
>> -
>> - sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>> -
>> - trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
>> -
>> - if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>> - virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> - virtio_blk_free_request(req);
>> - return;
>> - }
>> -
>> - block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>> - BLOCK_ACCT_READ);
>> - blk_aio_readv(req->dev->blk, sector, &req->qiov,
>> - req->qiov.size / BDRV_SECTOR_SIZE,
>> - virtio_blk_rw_complete, req);
>> -}
>> -
>> void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> {
>> uint32_t type;
>> @@ -430,11 +492,54 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> type = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>> - if (type & VIRTIO_BLK_T_FLUSH) {
>> + switch (type & 0xff) {
>
> Somehow I'd prefer type & ~VIRTIO_BLK_T_BARRIER, but this is correct, too.
The problem is that type encodes operations and flags and there is no official mask to distinguish them in
the specs defined. If anytime a new flag is defined it will break the switch statement. I could add a macro
for the 0xff in the headers if you like?
>
>> + case VIRTIO_BLK_T_IN:
>> + case VIRTIO_BLK_T_OUT:
>> + {
>> + bool is_write = type & VIRTIO_BLK_T_OUT;
>> + req->sector_num = virtio_ldq_p(VIRTIO_DEVICE(req->dev),
>> + &req->out.sector);
>> +
>> + if (!virtio_blk_sect_range_ok(req->dev, req->sector_num,
>> + req->qiov.size)) {
>> + virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> + virtio_blk_free_request(req);
>> + return;
>> + }
>> +
>> + if (is_write) {
>> + qemu_iovec_init_external(&req->qiov, iov, out_num);
>> + trace_virtio_blk_handle_write(req, req->sector_num,
>> + req->qiov.size / BDRV_SECTOR_SIZE);
>> + } else {
>> + qemu_iovec_init_external(&req->qiov, in_iov, in_num);
>> + trace_virtio_blk_handle_read(req, req->sector_num,
>> + req->qiov.size / BDRV_SECTOR_SIZE);
>> + }
>
> Before this patch, req->qiov (and subsequently req->qiov.size) was initialized before virtio_blk_sect_range_ok() was called (with req->qiov.size as an argument). Is it fine to change that?
Oops, thats definetly a mistake. Half of the checks in virtio_blk_sect_range_ok won't work with the size set.
>
>> +
>> + block_acct_start(blk_get_stats(req->dev->blk),
>> + &req->acct, req->qiov.size,
>> + is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
>> +
>> + /* merge would exceed maximum number of requests or IO direction
>> + * changes */
>> + if (mrb->num_reqs > 0 && (mrb->num_reqs == VIRTIO_BLK_MAX_MERGE_REQS ||
>> + is_write != mrb->is_write)) {
>> + virtio_submit_multireq(req->dev->blk, mrb);
>> + }
>> +
>> + mrb->reqs[mrb->num_reqs++] = req;
>
> Somehow I'd like an assert(mrb->num_reqs < VIRTIO_BLK_MAX_MERGE_REQS) before this. Feel free to ignore my request.
This can't happen and the check for this is right above. Of what scenario do you think of that could overflow num_reqs?
>
>> + mrb->is_write = is_write;
>> + break;
>> + }
>> + case VIRTIO_BLK_T_FLUSH:
>
> I think VIRTIO_BLK_T_FLUSH_OUT is missing (that's the reason it was "type & VIRTIO_BLK_T_..." before instead of "type == VIRTIO_BLK_T_...").
VIRTIO_BLK_T_FLUSH_OUT is deprecated and not defined in the headers of qemu anymore. I can add it, but then I would also add VIRTIO_BLK_T_SCSI_CMD_OUT.
>
>> virtio_blk_handle_flush(req, mrb);
>> - } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
>> + break;
>> + case VIRTIO_BLK_T_SCSI_CMD:
>
> And VIRTIO_BLK_T_SCSI_CMD_OUT here.
See above. I will define them in the headers.
>
>
> The overall logic of this patch seems good to me (although I had some trouble following which QIOVs had which purpose, which were allocated and which were not, and so on...).
If you tell me at which point it was most difficult I can try to add comments.
Thanks for reviewing,
Peter
next prev parent reply other threads:[~2015-01-30 21:05 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-01-30 14:32 [Qemu-devel] [PATCH v3 0/4] virtio-blk: add multiread support Peter Lieven
2015-01-30 14:32 ` [Qemu-devel] [PATCH v3 1/4] block: add accounting for merged requests Peter Lieven
2015-01-30 15:53 ` Max Reitz
2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 2/4] hw/virtio-blk: add a constant for max number of " Peter Lieven
2015-01-30 16:01 ` Max Reitz
2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 3/4] block-backend: expose bs->bl.max_transfer_length Peter Lieven
2015-01-30 16:04 ` Max Reitz
2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread Peter Lieven
2015-01-30 17:16 ` Max Reitz
2015-01-30 21:05 ` Peter Lieven [this message]
2015-01-30 21:15 ` Kevin Wolf
2015-01-31 19:46 ` Peter Lieven
2015-01-30 21:30 ` Max Reitz
2015-01-31 19:45 ` Peter Lieven
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54CBF21C.6040102@kamp.de \
--to=pl@kamp.de \
--cc=armbru@redhat.com \
--cc=benoit@irqsave.net \
--cc=famz@redhat.com \
--cc=kwolf@redhat.com \
--cc=ming.lei@canonical.com \
--cc=mreitz@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.