All of lore.kernel.org
 help / color / mirror / Atom feed
From: Anthony Liguori <anthony@codemonkey.ws>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes
Date: Mon, 20 Sep 2010 09:56:38 -0500	[thread overview]
Message-ID: <4C977626.4040806@codemonkey.ws> (raw)
In-Reply-To: <4C977028.3050602@codemonkey.ws>

On 09/20/2010 09:31 AM, Anthony Liguori wrote:
>> If we delay the operation and get three of these sequences queued before
>> actually executing, we end up with the following result, saving two 
>> syncs:
>>
>> 1. Update refcount table (req 1)
>> 2. Update refcount table (req 2)
>> 3. Update refcount table (req 3)
>> 4. bdrv_flush
>> 5. Update L2 entry (req 1)
>> 6. Update L2 entry (req 2)
>> 7. Update L2 entry (req 3)
>>
>> This patch only commits a sync if either the guests has requested a 
>> flush or if
>> a certain number of requests in the queue, so usually we batch more 
>> than just
>> three requests.
>>
>> I didn't run any detailed benchmarks but just tried what happens with
>> installation time of a Fedora 13 guest, and while git master takes 
>> about 40-50%
>> longer than before the metadata syncs, we get most of it back with 
>> blkqueue.
>>
>> Of course, in this state the code is not correct, but it's correct 
>> enough to
>> try and have qcow2 run on a file backend. Some remaining problems are:
>>
>> - There's no locking between the worker thread and other functions 
>> accessing
>>    the same backend driver. Should be fine for file, but probably not 
>> for other
>>    backends.
>>
>> - Error handling doesn't really exist. If something goes wrong with 
>> writing
>>    metadata we can't fail the guest request any more because it's long
>>    completed. Losing this data is actually okay, the guest hasn't 
>> flushed yet.
>>
>>    However, we need to be able to fail a flush, and we also need some 
>> way to
>>    handle errors transparently. This probably means that we have to 
>> stop the VM
>>    and let the user fix things so that we can retry. The only other 
>> way would be
>>    to shut down the VM and end up in the same situation as with a 
>> host crash.
>>
>>    Or maybe it would even be enough to start failing all new requests.
>>
>> - The Makefile integration is obviously very wrong, too. It worked 
>> for me good
>>    enough, but you need to be aware when block-queue.o is compiled with
>>    RUN_TESTS and when it isn't. The tests need to be split out properly.
>>
>> They are certainly fixable and shouldn't have any major impact on 
>> performance,
>> so that's just a matter of doing it.
>>
>> Kevin
>>
>> ---
>>   Makefile               |    3 +
>>   Makefile.objs          |    1 +
>>   block-queue.c          |  720 
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>   block-queue.h          |   49 ++++
>>   block/qcow2-cluster.c  |   28 +-
>>   block/qcow2-refcount.c |   44 ++--
>>   block/qcow2.c          |   14 +
>>   block/qcow2.h          |    4 +
>>   qemu-thread.c          |   13 +
>>   qemu-thread.h          |    1 +
>>   10 files changed, 838 insertions(+), 39 deletions(-)
>>   create mode 100644 block-queue.c
>>   create mode 100644 block-queue.h
>>
>> diff --git a/Makefile b/Makefile
>> index ab91d42..0202dc6 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -125,6 +125,9 @@ qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o 
>> qemu-error.o $(trace-obj-y) $(block-ob
>>
>>   qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o 
>> $(trace-obj-y) $(block-obj-y) $(qobject-obj-y)
>>
>> +block-queue$(EXESUF): QEMU_CFLAGS += -DRUN_TESTS
>> +block-queue$(EXESUF): qemu-tool.o qemu-error.o qemu-thread.o 
>> $(block-obj-y) $(qobject-obj-y)
>> +
>>   qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx
>>       $(call quiet-command,sh $(SRC_PATH)/hxtool -h<  $< >  $@,"  
>> GEN   $@")
>>
>> diff --git a/Makefile.objs b/Makefile.objs
>> index 3ef6d80..e97a246 100644
>> --- a/Makefile.objs
>> +++ b/Makefile.objs
>> @@ -9,6 +9,7 @@ qobject-obj-y += qerror.o
>>
>>   block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o 
>> module.o
>>   block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
>> +block-obj-y += qemu-thread.o block-queue.o
>>   block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
>>   block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>>
>> diff --git a/block-queue.c b/block-queue.c
>> new file mode 100644
>> index 0000000..13579a7
>> --- /dev/null
>> +++ b/block-queue.c
>> @@ -0,0 +1,720 @@
>> +/*
>> + * QEMU System Emulator
>> + *
>> + * Copyright (c) 2010 Kevin Wolf<kwolf@redhat.com>
>> + *
>> + * Permission is hereby granted, free of charge, to any person 
>> obtaining a copy
>> + * of this software and associated documentation files (the 
>> "Software"), to deal
>> + * in the Software without restriction, including without limitation 
>> the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, 
>> and/or sell
>> + * copies of the Software, and to permit persons to whom the 
>> Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be 
>> included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
>> EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
>> MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT 
>> SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES 
>> OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
>> ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
>> DEALINGS IN
>> + * THE SOFTWARE.
>> + */
>> +
>> +#include<signal.h>
>> +
>> +#include "qemu-common.h"
>> +#include "qemu-queue.h"
>> +#include "qemu-thread.h"
>> +#include "qemu-barrier.h"
>> +#include "block.h"
>> +#include "block-queue.h"
>> +
>> +enum blkqueue_req_type {
>> +    REQ_TYPE_WRITE,
>> +    REQ_TYPE_BARRIER,
>> +};
>> +
>> +typedef struct BlockQueueRequest {
>> +    enum blkqueue_req_type type;
>> +
>> +    uint64_t    offset;
>> +    void*       buf;
>> +    uint64_t    size;
>> +    unsigned    section;
>> +
>> +    QTAILQ_ENTRY(BlockQueueRequest) link;
>> +    QSIMPLEQ_ENTRY(BlockQueueRequest) link_section;
>> +} BlockQueueRequest;
>> +
>> +struct BlockQueue {
>> +    BlockDriverState*   bs;
>> +
>> +    QemuThread          thread;
>> +    bool                thread_done;
>> +    QemuMutex           lock;
>> +    QemuMutex           flush_lock;
>> +    QemuCond            cond;
>> +
>> +    int                 barriers_requested;
>> +    int                 barriers_submitted;
>> +    int                 queue_size;
>> +
>> +    QTAILQ_HEAD(bq_queue_head, BlockQueueRequest) queue;
>> +    QSIMPLEQ_HEAD(, BlockQueueRequest) sections;
>> +};
>> +
>> +static void *blkqueue_thread(void *bq);
>> +
>> +BlockQueue *blkqueue_create(BlockDriverState *bs)
>> +{
>> +    BlockQueue *bq = qemu_mallocz(sizeof(BlockQueue));
>> +    bq->bs = bs;
>> +
>> +    QTAILQ_INIT(&bq->queue);
>> +    QSIMPLEQ_INIT(&bq->sections);
>> +
>> +    qemu_mutex_init(&bq->lock);
>> +    qemu_mutex_init(&bq->flush_lock);
>> +    qemu_cond_init(&bq->cond);
>> +
>> +    bq->thread_done = false;
>> +    qemu_thread_create(&bq->thread, blkqueue_thread, bq);
>> +
>> +    return bq;
>> +}
>> +
>> +void blkqueue_init_context(BlockQueueContext* context, BlockQueue *bq)
>> +{
>> +    context->bq = bq;
>> +    context->section = 0;
>> +}
>> +
>> +void blkqueue_destroy(BlockQueue *bq)
>> +{
>> +    bq->thread_done = true;
>> +    qemu_cond_signal(&bq->cond);
>> +    qemu_thread_join(&bq->thread);
>> +
>> +    blkqueue_flush(bq);
>> +
>> +    fprintf(stderr, "blkqueue_destroy: %d/%d barriers left\n",
>> +        bq->barriers_submitted, bq->barriers_requested);
>> +
>> +    qemu_mutex_destroy(&bq->lock);
>> +    qemu_mutex_destroy(&bq->flush_lock);
>> +    qemu_cond_destroy(&bq->cond);
>> +
>> +    assert(QTAILQ_FIRST(&bq->queue) == NULL);
>> +    assert(QSIMPLEQ_FIRST(&bq->sections) == NULL);
>> +    qemu_free(bq);
>> +}
>> +
>> +int blkqueue_pread(BlockQueueContext *context, uint64_t offset, void 
>> *buf,
>> +    uint64_t size)
>> +{
>> +    BlockQueue *bq = context->bq;
>> +    BlockQueueRequest *req;
>> +    int ret;
>> +
>> +    /*
>> +     * First check if there are any pending writes for the same 
>> data. Reverse
>> +     * order to return data written by the latest write.
>> +     */
>> +    QTAILQ_FOREACH_REVERSE(req,&bq->queue, bq_queue_head, link) {
>> +        uint64_t end = offset + size;
>> +        uint64_t req_end = req->offset + req->size;
>> +        uint8_t *read_buf = buf;
>> +        uint8_t *req_buf = req->buf;
>> +
>> +        /* We're only interested in queued writes */
>> +        if (req->type != REQ_TYPE_WRITE) {
>> +            continue;
>> +        }
>> +
>> +        /*
>> +         * If we read from a write in the queue (i.e. our read 
>> overlaps the
>> +         * write request), our next write probably depends on this 
>> write, so
>> +         * let's move forward to its section.
>> +         */
>> +        if (end>  req->offset&&  offset<  req_end) {
>> +            context->section = MAX(context->section, req->section);
>> +        }
>> +
>> +        /* How we continue, depends on the kind of overlap we have */
>> +        if ((offset>= req->offset)&&  (end<= req_end)) {
>> +            /* Completely contained in the write request */
>> +            memcpy(buf,&req_buf[offset - req->offset], size);
>> +            return 0;
>> +        } else if ((end>= req->offset)&&  (end<= req_end)) {
>> +            /* Overlap in the end of the read request */
>> +            assert(offset<  req->offset);
>> +            memcpy(&read_buf[req->offset - offset], req_buf, end - 
>> req->offset);
>> +            size = req->offset - offset;
>> +        } else if ((offset>= req->offset)&&  (offset<  req_end)) {
>> +            /* Overlap in the start of the read request */
>> +            assert(end>  req_end);
>> +            memcpy(read_buf,&req_buf[offset - req->offset], req_end 
>> - offset);
>> +            buf = read_buf =&read_buf[req_end - offset];
>> +            offset = req_end;
>> +            size = end - req_end;
>> +        } else if ((req->offset>= offset)&&  (req_end<= end)) {
>> +            /*
>> +             * The write request is completely contained in the read 
>> request.
>> +             * memcpy the data from the write request here, continue 
>> with the
>> +             * data before the write request and handle the data 
>> after the
>> +             * write request with a recursive call.
>> +             */
>> +            memcpy(&read_buf[req->offset - offset], req_buf, req_end 
>> - req->offset);
>> +            size = req->offset - offset;
>> +            blkqueue_pread(context, req_end,&read_buf[req_end - 
>> offset], end - req_end);
>> +        }
>> +    }
>> +
>> +    /* The requested is not written in the queue, read it from disk */
>> +    ret = bdrv_pread(bq->bs, offset, buf, size);
>> +    if (ret<  0) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int blkqueue_pwrite(BlockQueueContext *context, uint64_t offset, 
>> void *buf,
>> +    uint64_t size)
>> +{
>> +    BlockQueue *bq = context->bq;
>> +    BlockQueueRequest *section_req;
>> +
>> +    /* Create request structure */
>> +    BlockQueueRequest *req = qemu_malloc(sizeof(*req));
>> +    req->type       = REQ_TYPE_WRITE;
>> +    req->offset     = offset;
>> +    req->size       = size;
>> +    req->buf        = qemu_malloc(size);
>> +    req->section    = context->section;
>> +    memcpy(req->buf, buf, size);
>> +
>> +    /*
>> +     * Find the right place to insert it into the queue:
>> +     * Right before the barrier that closes the current section.
>> +     */
>> +    qemu_mutex_lock(&bq->lock);
>> +    QSIMPLEQ_FOREACH(section_req,&bq->sections, link_section) {
>> +        if (section_req->section>= req->section) {
>> +            req->section = section_req->section;
>> +            context->section = section_req->section;
>> +            QTAILQ_INSERT_BEFORE(section_req, req, link);
>> +            bq->queue_size++;
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* If there was no barrier, just put it at the end. */
>> +    QTAILQ_INSERT_TAIL(&bq->queue, req, link);
>> +    bq->queue_size++;
>> +    qemu_cond_signal(&bq->cond);
>> +
>> +out:
>> +    qemu_mutex_unlock(&bq->lock);
>> +    return 0;
>> +}
>> +
>> +int blkqueue_barrier(BlockQueueContext *context)
>> +{
>> +    BlockQueue *bq = context->bq;
>> +    BlockQueueRequest *section_req;
>> +
>> +    bq->barriers_requested++;
>> +
>> +    /* Create request structure */
>> +    BlockQueueRequest *req = qemu_malloc(sizeof(*req));
>> +    req->type       = REQ_TYPE_BARRIER;
>> +    req->section    = context->section;
>> +    req->buf        = NULL;
>> +
>> +    /* Find another barrier to merge with. */
>> +    qemu_mutex_lock(&bq->lock);
>> +    QSIMPLEQ_FOREACH(section_req,&bq->sections, link_section) {
>> +        if (section_req->section>= req->section) {
>> +            req->section = section_req->section;
>> +            context->section = section_req->section + 1;
>> +            qemu_free(req);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * If there wasn't a barrier for the same section yet, insert a 
>> new one at
>> +     * the end.
>> +     */
>> +    QTAILQ_INSERT_TAIL(&bq->queue, req, link);
>> +    QSIMPLEQ_INSERT_TAIL(&bq->sections, req, link_section);
>> +    bq->queue_size++;
>> +    context->section++;
>> +    qemu_cond_signal(&bq->cond);
>> +
>> +    bq->barriers_submitted++;
>> +
>> +out:
>> +    qemu_mutex_unlock(&bq->lock);
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Caller needs to hold the bq->lock mutex
>> + */
>> +static BlockQueueRequest *blkqueue_pop(BlockQueue *bq)
>> +{
>> +    BlockQueueRequest *req;
>> +
>> +    req = QTAILQ_FIRST(&bq->queue);
>> +    if (req == NULL) {
>> +        goto out;
>> +    }
>> +
>> +    QTAILQ_REMOVE(&bq->queue, req, link);
>> +    bq->queue_size--;
>> +
>> +    if (req->type == REQ_TYPE_BARRIER) {
>> +        assert(QSIMPLEQ_FIRST(&bq->sections) == req);
>> +        QSIMPLEQ_REMOVE_HEAD(&bq->sections, link_section);
>> +    }
>> +
>> +out:
>> +    return req;
>> +}
>> +
>> +static void blkqueue_free_request(BlockQueueRequest *req)
>> +{
>> +    qemu_free(req->buf);
>> +    qemu_free(req);
>> +}
>> +
>> +static void blkqueue_process_request(BlockQueue *bq)
>> +{
>> +    BlockQueueRequest *req;
>> +    BlockQueueRequest *req2;
>> +    int ret;
>> +
>> +    /*
>> +     * Note that we leave the request in the queue while we process 
>> it. No
>> +     * other request will be queued before this one and we have only 
>> one thread
>> +     * that processes the queue, so afterwards it will still be the 
>> first
>> +     * request. (Not true for barriers in the first position, but we 
>> can handle
>> +     * that)
>> +     */
>> +    req = QTAILQ_FIRST(&bq->queue);
>> +    if (req == NULL) {
>> +        return;
>> +    }
>> +
>> +    switch (req->type) {
>> +        case REQ_TYPE_WRITE:
>> +            ret = bdrv_pwrite(bq->bs, req->offset, req->buf, 
>> req->size);
>> +            if (ret<  0) {
>> +                /* TODO Error reporting! */
>> +                return;
>> +            }
>> +            break;
>> +        case REQ_TYPE_BARRIER:
>> +            bdrv_flush(bq->bs);
>> +            break;
>> +    }
>> +
>> +    /*
>> +     * Only remove the request from the queue when it's written, so 
>> that reads
>> +     * always access the right data.
>> +     */
>> +    qemu_mutex_lock(&bq->lock);
>> +    req2 = QTAILQ_FIRST(&bq->queue);
>> +    if (req == req2) {
>> +        blkqueue_pop(bq);
>> +        blkqueue_free_request(req);
>> +    } else {
>> +        /*
>> +         * If it's a barrier and something has been queued before 
>> it, just
>> +         * leave it in the queue and flush once again later.
>> +         */
>> +        assert(req->type == REQ_TYPE_BARRIER);
>> +        bq->barriers_submitted++;
>> +    }
>> +    qemu_mutex_unlock(&bq->lock);
>> +}
>> +
>> +struct blkqueue_flush_aiocb {
>> +    BlockQueue *bq;
>> +    BlockDriverCompletionFunc *cb;
>> +    void *opaque;
>> +};
>> +
>> +static void *blkqueue_aio_flush_thread(void *opaque)
>> +{
>> +    struct blkqueue_flush_aiocb *acb = opaque;
>> +
>> +    /* Process any left over requests */
>> +    blkqueue_flush(acb->bq);
>> +
>> +    acb->cb(acb->opaque, 0);
>> +    qemu_free(acb);
>> +
>> +    return NULL;
>> +}
>> +
>> +void blkqueue_aio_flush(BlockQueue *bq, BlockDriverCompletionFunc *cb,
>> +    void *opaque)
>> +{
>> +    struct blkqueue_flush_aiocb *acb;
>> +
>> +    acb = qemu_malloc(sizeof(*acb));
>> +    acb->bq = bq;
>> +    acb->cb = cb;
>> +    acb->opaque = opaque;
>> +
>> +    qemu_thread_create(NULL, blkqueue_aio_flush_thread, acb);
>> +}
>> +
>> +void blkqueue_flush(BlockQueue *bq)
>> +{
>> +    qemu_mutex_lock(&bq->flush_lock);
>> +
>> +    /* Process any left over requests */
>> +    while (QTAILQ_FIRST(&bq->queue)) {
>> +        blkqueue_process_request(bq);
>> +    }
>> +
>> +    qemu_mutex_unlock(&bq->flush_lock);
>> +}
>> +
>> +static void *blkqueue_thread(void *_bq)
>> +{
>> +    BlockQueue *bq = _bq;
>> +#ifndef RUN_TESTS
>> +    BlockQueueRequest *req;
>> +#endif
>> +
>> +    qemu_mutex_lock(&bq->flush_lock);
>> +    while (!bq->thread_done) {
>> +        barrier();

A barrier shouldn't be needed here.

>> +#ifndef RUN_TESTS
>> +        req = QTAILQ_FIRST(&bq->queue);
>> +
>> +        /* Don't process barriers, we only do that on flushes */
>> +        if (req&&  (req->type != REQ_TYPE_BARRIER || 
>> bq->queue_size>  42)) {
>> +            blkqueue_process_request(bq);
>> +        } else {
>> +            qemu_cond_wait(&bq->cond,&bq->flush_lock);
>> +        }


The normal pattern for this is:

while (!condition) {
     qemu_cond_wait(&cond, &lock);
}
process_request()

It's generally best not to deviate from this pattern in terms of code 
readability.

A less invasive way of doing this (assuming we're okay with it from a 
correctness perspective) is to make use of qemu_aio_wait() as a 
replacement for qemu_mutex_lock() and shift the pread/pwrite calls to 
bdrv_aio_write/bdrv_aio_read.

IOW, blkqueue_pwrite stages a request via bdrv_aio_write().  
blkqueue_pread() either returns a cached read or it does a 
bdrv_pread().  The blkqueue_flush() call will then do qemu_aio_wait() to 
wait for all pending I/Os to complete.

Regards,

Anthony Liguori

  reply	other threads:[~2010-09-20 14:57 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-20 13:56 [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes Kevin Wolf
2010-09-20 14:31 ` Anthony Liguori
2010-09-20 14:56   ` Anthony Liguori [this message]
2010-09-20 15:33     ` Kevin Wolf
2010-09-20 15:48       ` Anthony Liguori
2010-09-20 15:08   ` Kevin Wolf
2010-09-20 15:33     ` Avi Kivity
2010-09-20 15:38       ` Avi Kivity
2010-09-20 15:46       ` Kevin Wolf
2010-09-20 15:40     ` Anthony Liguori
2010-09-20 15:55       ` Kevin Wolf
2010-09-20 16:34         ` Anthony Liguori
2010-09-20 15:51     ` Anthony Liguori
2010-09-20 16:05       ` Avi Kivity
2010-09-21  9:13       ` Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C977626.4040806@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.