From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=59280 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Oxhng-0003IL-2T
	for qemu-devel@nongnu.org; Mon, 20 Sep 2010 10:57:06 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OxhnY-0006un-6O
	for qemu-devel@nongnu.org; Mon, 20 Sep 2010 10:56:59 -0400
Received: from mail-vw0-f45.google.com ([209.85.212.45]:58846)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OxhnY-0006uc-2D
	for qemu-devel@nongnu.org; Mon, 20 Sep 2010 10:56:52 -0400
Received: by vws19 with SMTP id 19so3644424vws.4
	for <qemu-devel@nongnu.org>; Mon, 20 Sep 2010 07:56:50 -0700 (PDT)
Message-ID: <4C977626.4040806@codemonkey.ws>
Date: Mon, 20 Sep 2010 09:56:38 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes
References: <1284991010-10951-1-git-send-email-kwolf@redhat.com>
	<4C977028.3050602@codemonkey.ws>
In-Reply-To: <4C977028.3050602@codemonkey.ws>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org

On 09/20/2010 09:31 AM, Anthony Liguori wrote:
>> If we delay the operation and get three of these sequences queued before
>> actually executing, we end up with the following result, saving two 
>> syncs:
>>
>> 1. Update refcount table (req 1)
>> 2. Update refcount table (req 2)
>> 3. Update refcount table (req 3)
>> 4. bdrv_flush
>> 5. Update L2 entry (req 1)
>> 6. Update L2 entry (req 2)
>> 7. Update L2 entry (req 3)
>>
>> This patch only commits a sync if either the guests has requested a 
>> flush or if
>> a certain number of requests in the queue, so usually we batch more 
>> than just
>> three requests.
>>
>> I didn't run any detailed benchmarks but just tried what happens with
>> installation time of a Fedora 13 guest, and while git master takes 
>> about 40-50%
>> longer than before the metadata syncs, we get most of it back with 
>> blkqueue.
>>
>> Of course, in this state the code is not correct, but it's correct 
>> enough to
>> try and have qcow2 run on a file backend. Some remaining problems are:
>>
>> - There's no locking between the worker thread and other functions 
>> accessing
>>    the same backend driver. Should be fine for file, but probably not 
>> for other
>>    backends.
>>
>> - Error handling doesn't really exist. If something goes wrong with 
>> writing
>>    metadata we can't fail the guest request any more because it's long
>>    completed. Losing this data is actually okay, the guest hasn't 
>> flushed yet.
>>
>>    However, we need to be able to fail a flush, and we also need some 
>> way to
>>    handle errors transparently. This probably means that we have to 
>> stop the VM
>>    and let the user fix things so that we can retry. The only other 
>> way would be
>>    to shut down the VM and end up in the same situation as with a 
>> host crash.
>>
>>    Or maybe it would even be enough to start failing all new requests.
>>
>> - The Makefile integration is obviously very wrong, too. It worked 
>> for me good
>>    enough, but you need to be aware when block-queue.o is compiled with
>>    RUN_TESTS and when it isn't. The tests need to be split out properly.
>>
>> They are certainly fixable and shouldn't have any major impact on 
>> performance,
>> so that's just a matter of doing it.
>>
>> Kevin
>>
>> ---
>>   Makefile               |    3 +
>>   Makefile.objs          |    1 +
>>   block-queue.c          |  720 
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>   block-queue.h          |   49 ++++
>>   block/qcow2-cluster.c  |   28 +-
>>   block/qcow2-refcount.c |   44 ++--
>>   block/qcow2.c          |   14 +
>>   block/qcow2.h          |    4 +
>>   qemu-thread.c          |   13 +
>>   qemu-thread.h          |    1 +
>>   10 files changed, 838 insertions(+), 39 deletions(-)
>>   create mode 100644 block-queue.c
>>   create mode 100644 block-queue.h
>>
>> diff --git a/Makefile b/Makefile
>> index ab91d42..0202dc6 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -125,6 +125,9 @@ qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o 
>> qemu-error.o $(trace-obj-y) $(block-ob
>>
>>   qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o 
>> $(trace-obj-y) $(block-obj-y) $(qobject-obj-y)
>>
>> +block-queue$(EXESUF): QEMU_CFLAGS += -DRUN_TESTS
>> +block-queue$(EXESUF): qemu-tool.o qemu-error.o qemu-thread.o 
>> $(block-obj-y) $(qobject-obj-y)
>> +
>>   qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx
>>       $(call quiet-command,sh $(SRC_PATH)/hxtool -h<  $< >  $@,"  
>> GEN   $@")
>>
>> diff --git a/Makefile.objs b/Makefile.objs
>> index 3ef6d80..e97a246 100644
>> --- a/Makefile.objs
>> +++ b/Makefile.objs
>> @@ -9,6 +9,7 @@ qobject-obj-y += qerror.o
>>
>>   block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o 
>> module.o
>>   block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
>> +block-obj-y += qemu-thread.o block-queue.o
>>   block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
>>   block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>>
>> diff --git a/block-queue.c b/block-queue.c
>> new file mode 100644
>> index 0000000..13579a7
>> --- /dev/null
>> +++ b/block-queue.c
>> @@ -0,0 +1,720 @@
>> +/*
>> + * QEMU System Emulator
>> + *
>> + * Copyright (c) 2010 Kevin Wolf<kwolf@redhat.com>
>> + *
>> + * Permission is hereby granted, free of charge, to any person 
>> obtaining a copy
>> + * of this software and associated documentation files (the 
>> "Software"), to deal
>> + * in the Software without restriction, including without limitation 
>> the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, 
>> and/or sell
>> + * copies of the Software, and to permit persons to whom the 
>> Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be 
>> included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
>> EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
>> MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT 
>> SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES 
>> OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
>> ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
>> DEALINGS IN
>> + * THE SOFTWARE.
>> + */
>> +
>> +#include<signal.h>
>> +
>> +#include "qemu-common.h"
>> +#include "qemu-queue.h"
>> +#include "qemu-thread.h"
>> +#include "qemu-barrier.h"
>> +#include "block.h"
>> +#include "block-queue.h"
>> +
>> +enum blkqueue_req_type {
>> +    REQ_TYPE_WRITE,
>> +    REQ_TYPE_BARRIER,
>> +};
>> +
>> +typedef struct BlockQueueRequest {
>> +    enum blkqueue_req_type type;
>> +
>> +    uint64_t    offset;
>> +    void*       buf;
>> +    uint64_t    size;
>> +    unsigned    section;
>> +
>> +    QTAILQ_ENTRY(BlockQueueRequest) link;
>> +    QSIMPLEQ_ENTRY(BlockQueueRequest) link_section;
>> +} BlockQueueRequest;
>> +
>> +struct BlockQueue {
>> +    BlockDriverState*   bs;
>> +
>> +    QemuThread          thread;
>> +    bool                thread_done;
>> +    QemuMutex           lock;
>> +    QemuMutex           flush_lock;
>> +    QemuCond            cond;
>> +
>> +    int                 barriers_requested;
>> +    int                 barriers_submitted;
>> +    int                 queue_size;
>> +
>> +    QTAILQ_HEAD(bq_queue_head, BlockQueueRequest) queue;
>> +    QSIMPLEQ_HEAD(, BlockQueueRequest) sections;
>> +};
>> +
>> +static void *blkqueue_thread(void *bq);
>> +
>> +BlockQueue *blkqueue_create(BlockDriverState *bs)
>> +{
>> +    BlockQueue *bq = qemu_mallocz(sizeof(BlockQueue));
>> +    bq->bs = bs;
>> +
>> +    QTAILQ_INIT(&bq->queue);
>> +    QSIMPLEQ_INIT(&bq->sections);
>> +
>> +    qemu_mutex_init(&bq->lock);
>> +    qemu_mutex_init(&bq->flush_lock);
>> +    qemu_cond_init(&bq->cond);
>> +
>> +    bq->thread_done = false;
>> +    qemu_thread_create(&bq->thread, blkqueue_thread, bq);
>> +
>> +    return bq;
>> +}
>> +
>> +void blkqueue_init_context(BlockQueueContext* context, BlockQueue *bq)
>> +{
>> +    context->bq = bq;
>> +    context->section = 0;
>> +}
>> +
>> +void blkqueue_destroy(BlockQueue *bq)
>> +{
>> +    bq->thread_done = true;
>> +    qemu_cond_signal(&bq->cond);
>> +    qemu_thread_join(&bq->thread);
>> +
>> +    blkqueue_flush(bq);
>> +
>> +    fprintf(stderr, "blkqueue_destroy: %d/%d barriers left\n",
>> +        bq->barriers_submitted, bq->barriers_requested);
>> +
>> +    qemu_mutex_destroy(&bq->lock);
>> +    qemu_mutex_destroy(&bq->flush_lock);
>> +    qemu_cond_destroy(&bq->cond);
>> +
>> +    assert(QTAILQ_FIRST(&bq->queue) == NULL);
>> +    assert(QSIMPLEQ_FIRST(&bq->sections) == NULL);
>> +    qemu_free(bq);
>> +}
>> +
>> +int blkqueue_pread(BlockQueueContext *context, uint64_t offset, void 
>> *buf,
>> +    uint64_t size)
>> +{
>> +    BlockQueue *bq = context->bq;
>> +    BlockQueueRequest *req;
>> +    int ret;
>> +
>> +    /*
>> +     * First check if there are any pending writes for the same 
>> data. Reverse
>> +     * order to return data written by the latest write.
>> +     */
>> +    QTAILQ_FOREACH_REVERSE(req,&bq->queue, bq_queue_head, link) {
>> +        uint64_t end = offset + size;
>> +        uint64_t req_end = req->offset + req->size;
>> +        uint8_t *read_buf = buf;
>> +        uint8_t *req_buf = req->buf;
>> +
>> +        /* We're only interested in queued writes */
>> +        if (req->type != REQ_TYPE_WRITE) {
>> +            continue;
>> +        }
>> +
>> +        /*
>> +         * If we read from a write in the queue (i.e. our read 
>> overlaps the
>> +         * write request), our next write probably depends on this 
>> write, so
>> +         * let's move forward to its section.
>> +         */
>> +        if (end>  req->offset&&  offset<  req_end) {
>> +            context->section = MAX(context->section, req->section);
>> +        }
>> +
>> +        /* How we continue, depends on the kind of overlap we have */
>> +        if ((offset>= req->offset)&&  (end<= req_end)) {
>> +            /* Completely contained in the write request */
>> +            memcpy(buf,&req_buf[offset - req->offset], size);
>> +            return 0;
>> +        } else if ((end>= req->offset)&&  (end<= req_end)) {
>> +            /* Overlap in the end of the read request */
>> +            assert(offset<  req->offset);
>> +            memcpy(&read_buf[req->offset - offset], req_buf, end - 
>> req->offset);
>> +            size = req->offset - offset;
>> +        } else if ((offset>= req->offset)&&  (offset<  req_end)) {
>> +            /* Overlap in the start of the read request */
>> +            assert(end>  req_end);
>> +            memcpy(read_buf,&req_buf[offset - req->offset], req_end 
>> - offset);
>> +            buf = read_buf =&read_buf[req_end - offset];
>> +            offset = req_end;
>> +            size = end - req_end;
>> +        } else if ((req->offset>= offset)&&  (req_end<= end)) {
>> +            /*
>> +             * The write request is completely contained in the read 
>> request.
>> +             * memcpy the data from the write request here, continue 
>> with the
>> +             * data before the write request and handle the data 
>> after the
>> +             * write request with a recursive call.
>> +             */
>> +            memcpy(&read_buf[req->offset - offset], req_buf, req_end 
>> - req->offset);
>> +            size = req->offset - offset;
>> +            blkqueue_pread(context, req_end,&read_buf[req_end - 
>> offset], end - req_end);
>> +        }
>> +    }
>> +
>> +    /* The requested is not written in the queue, read it from disk */
>> +    ret = bdrv_pread(bq->bs, offset, buf, size);
>> +    if (ret<  0) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int blkqueue_pwrite(BlockQueueContext *context, uint64_t offset, 
>> void *buf,
>> +    uint64_t size)
>> +{
>> +    BlockQueue *bq = context->bq;
>> +    BlockQueueRequest *section_req;
>> +
>> +    /* Create request structure */
>> +    BlockQueueRequest *req = qemu_malloc(sizeof(*req));
>> +    req->type       = REQ_TYPE_WRITE;
>> +    req->offset     = offset;
>> +    req->size       = size;
>> +    req->buf        = qemu_malloc(size);
>> +    req->section    = context->section;
>> +    memcpy(req->buf, buf, size);
>> +
>> +    /*
>> +     * Find the right place to insert it into the queue:
>> +     * Right before the barrier that closes the current section.
>> +     */
>> +    qemu_mutex_lock(&bq->lock);
>> +    QSIMPLEQ_FOREACH(section_req,&bq->sections, link_section) {
>> +        if (section_req->section>= req->section) {
>> +            req->section = section_req->section;
>> +            context->section = section_req->section;
>> +            QTAILQ_INSERT_BEFORE(section_req, req, link);
>> +            bq->queue_size++;
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* If there was no barrier, just put it at the end. */
>> +    QTAILQ_INSERT_TAIL(&bq->queue, req, link);
>> +    bq->queue_size++;
>> +    qemu_cond_signal(&bq->cond);
>> +
>> +out:
>> +    qemu_mutex_unlock(&bq->lock);
>> +    return 0;
>> +}
>> +
>> +int blkqueue_barrier(BlockQueueContext *context)
>> +{
>> +    BlockQueue *bq = context->bq;
>> +    BlockQueueRequest *section_req;
>> +
>> +    bq->barriers_requested++;
>> +
>> +    /* Create request structure */
>> +    BlockQueueRequest *req = qemu_malloc(sizeof(*req));
>> +    req->type       = REQ_TYPE_BARRIER;
>> +    req->section    = context->section;
>> +    req->buf        = NULL;
>> +
>> +    /* Find another barrier to merge with. */
>> +    qemu_mutex_lock(&bq->lock);
>> +    QSIMPLEQ_FOREACH(section_req,&bq->sections, link_section) {
>> +        if (section_req->section>= req->section) {
>> +            req->section = section_req->section;
>> +            context->section = section_req->section + 1;
>> +            qemu_free(req);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * If there wasn't a barrier for the same section yet, insert a 
>> new one at
>> +     * the end.
>> +     */
>> +    QTAILQ_INSERT_TAIL(&bq->queue, req, link);
>> +    QSIMPLEQ_INSERT_TAIL(&bq->sections, req, link_section);
>> +    bq->queue_size++;
>> +    context->section++;
>> +    qemu_cond_signal(&bq->cond);
>> +
>> +    bq->barriers_submitted++;
>> +
>> +out:
>> +    qemu_mutex_unlock(&bq->lock);
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Caller needs to hold the bq->lock mutex
>> + */
>> +static BlockQueueRequest *blkqueue_pop(BlockQueue *bq)
>> +{
>> +    BlockQueueRequest *req;
>> +
>> +    req = QTAILQ_FIRST(&bq->queue);
>> +    if (req == NULL) {
>> +        goto out;
>> +    }
>> +
>> +    QTAILQ_REMOVE(&bq->queue, req, link);
>> +    bq->queue_size--;
>> +
>> +    if (req->type == REQ_TYPE_BARRIER) {
>> +        assert(QSIMPLEQ_FIRST(&bq->sections) == req);
>> +        QSIMPLEQ_REMOVE_HEAD(&bq->sections, link_section);
>> +    }
>> +
>> +out:
>> +    return req;
>> +}
>> +
>> +static void blkqueue_free_request(BlockQueueRequest *req)
>> +{
>> +    qemu_free(req->buf);
>> +    qemu_free(req);
>> +}
>> +
>> +static void blkqueue_process_request(BlockQueue *bq)
>> +{
>> +    BlockQueueRequest *req;
>> +    BlockQueueRequest *req2;
>> +    int ret;
>> +
>> +    /*
>> +     * Note that we leave the request in the queue while we process 
>> it. No
>> +     * other request will be queued before this one and we have only 
>> one thread
>> +     * that processes the queue, so afterwards it will still be the 
>> first
>> +     * request. (Not true for barriers in the first position, but we 
>> can handle
>> +     * that)
>> +     */
>> +    req = QTAILQ_FIRST(&bq->queue);
>> +    if (req == NULL) {
>> +        return;
>> +    }
>> +
>> +    switch (req->type) {
>> +        case REQ_TYPE_WRITE:
>> +            ret = bdrv_pwrite(bq->bs, req->offset, req->buf, 
>> req->size);
>> +            if (ret<  0) {
>> +                /* TODO Error reporting! */
>> +                return;
>> +            }
>> +            break;
>> +        case REQ_TYPE_BARRIER:
>> +            bdrv_flush(bq->bs);
>> +            break;
>> +    }
>> +
>> +    /*
>> +     * Only remove the request from the queue when it's written, so 
>> that reads
>> +     * always access the right data.
>> +     */
>> +    qemu_mutex_lock(&bq->lock);
>> +    req2 = QTAILQ_FIRST(&bq->queue);
>> +    if (req == req2) {
>> +        blkqueue_pop(bq);
>> +        blkqueue_free_request(req);
>> +    } else {
>> +        /*
>> +         * If it's a barrier and something has been queued before 
>> it, just
>> +         * leave it in the queue and flush once again later.
>> +         */
>> +        assert(req->type == REQ_TYPE_BARRIER);
>> +        bq->barriers_submitted++;
>> +    }
>> +    qemu_mutex_unlock(&bq->lock);
>> +}
>> +
>> +struct blkqueue_flush_aiocb {
>> +    BlockQueue *bq;
>> +    BlockDriverCompletionFunc *cb;
>> +    void *opaque;
>> +};
>> +
>> +static void *blkqueue_aio_flush_thread(void *opaque)
>> +{
>> +    struct blkqueue_flush_aiocb *acb = opaque;
>> +
>> +    /* Process any left over requests */
>> +    blkqueue_flush(acb->bq);
>> +
>> +    acb->cb(acb->opaque, 0);
>> +    qemu_free(acb);
>> +
>> +    return NULL;
>> +}
>> +
>> +void blkqueue_aio_flush(BlockQueue *bq, BlockDriverCompletionFunc *cb,
>> +    void *opaque)
>> +{
>> +    struct blkqueue_flush_aiocb *acb;
>> +
>> +    acb = qemu_malloc(sizeof(*acb));
>> +    acb->bq = bq;
>> +    acb->cb = cb;
>> +    acb->opaque = opaque;
>> +
>> +    qemu_thread_create(NULL, blkqueue_aio_flush_thread, acb);
>> +}
>> +
>> +void blkqueue_flush(BlockQueue *bq)
>> +{
>> +    qemu_mutex_lock(&bq->flush_lock);
>> +
>> +    /* Process any left over requests */
>> +    while (QTAILQ_FIRST(&bq->queue)) {
>> +        blkqueue_process_request(bq);
>> +    }
>> +
>> +    qemu_mutex_unlock(&bq->flush_lock);
>> +}
>> +
>> +static void *blkqueue_thread(void *_bq)
>> +{
>> +    BlockQueue *bq = _bq;
>> +#ifndef RUN_TESTS
>> +    BlockQueueRequest *req;
>> +#endif
>> +
>> +    qemu_mutex_lock(&bq->flush_lock);
>> +    while (!bq->thread_done) {
>> +        barrier();

A barrier shouldn't be needed here.

>> +#ifndef RUN_TESTS
>> +        req = QTAILQ_FIRST(&bq->queue);
>> +
>> +        /* Don't process barriers, we only do that on flushes */
>> +        if (req&&  (req->type != REQ_TYPE_BARRIER || 
>> bq->queue_size>  42)) {
>> +            blkqueue_process_request(bq);
>> +        } else {
>> +            qemu_cond_wait(&bq->cond,&bq->flush_lock);
>> +        }


The normal pattern for this is:

while (!condition) {
     qemu_cond_wait(&cond, &lock);
}
process_request()

It's generally best not to deviate from this pattern in terms of code 
readability.

A less invasive way of doing this (assuming we're okay with it from a 
correctness perspective) is to make use of qemu_aio_wait() as a 
replacement for qemu_mutex_lock() and shift the pread/pwrite calls to 
bdrv_aio_write/bdrv_aio_read.

IOW, blkqueue_pwrite stages a request via bdrv_aio_write().  
blkqueue_pread() either returns a cached read or it does a 
bdrv_pread().  The blkqueue_flush() call will then do qemu_aio_wait() to 
wait for all pending I/Os to complete.

Regards,

Anthony Liguori