From: Anthony Liguori <anthony@codemonkey.ws>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes
Date: Mon, 20 Sep 2010 09:56:38 -0500 [thread overview]
Message-ID: <4C977626.4040806@codemonkey.ws> (raw)
In-Reply-To: <4C977028.3050602@codemonkey.ws>
On 09/20/2010 09:31 AM, Anthony Liguori wrote:
>> If we delay the operation and get three of these sequences queued before
>> actually executing, we end up with the following result, saving two
>> syncs:
>>
>> 1. Update refcount table (req 1)
>> 2. Update refcount table (req 2)
>> 3. Update refcount table (req 3)
>> 4. bdrv_flush
>> 5. Update L2 entry (req 1)
>> 6. Update L2 entry (req 2)
>> 7. Update L2 entry (req 3)
>>
>> This patch only commits a sync if either the guests has requested a
>> flush or if
>> a certain number of requests in the queue, so usually we batch more
>> than just
>> three requests.
>>
>> I didn't run any detailed benchmarks but just tried what happens with
>> installation time of a Fedora 13 guest, and while git master takes
>> about 40-50%
>> longer than before the metadata syncs, we get most of it back with
>> blkqueue.
>>
>> Of course, in this state the code is not correct, but it's correct
>> enough to
>> try and have qcow2 run on a file backend. Some remaining problems are:
>>
>> - There's no locking between the worker thread and other functions
>> accessing
>> the same backend driver. Should be fine for file, but probably not
>> for other
>> backends.
>>
>> - Error handling doesn't really exist. If something goes wrong with
>> writing
>> metadata we can't fail the guest request any more because it's long
>> completed. Losing this data is actually okay, the guest hasn't
>> flushed yet.
>>
>> However, we need to be able to fail a flush, and we also need some
>> way to
>> handle errors transparently. This probably means that we have to
>> stop the VM
>> and let the user fix things so that we can retry. The only other
>> way would be
>> to shut down the VM and end up in the same situation as with a
>> host crash.
>>
>> Or maybe it would even be enough to start failing all new requests.
>>
>> - The Makefile integration is obviously very wrong, too. It worked
>> for me good
>> enough, but you need to be aware when block-queue.o is compiled with
>> RUN_TESTS and when it isn't. The tests need to be split out properly.
>>
>> They are certainly fixable and shouldn't have any major impact on
>> performance,
>> so that's just a matter of doing it.
>>
>> Kevin
>>
>> ---
>> Makefile | 3 +
>> Makefile.objs | 1 +
>> block-queue.c | 720
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>> block-queue.h | 49 ++++
>> block/qcow2-cluster.c | 28 +-
>> block/qcow2-refcount.c | 44 ++--
>> block/qcow2.c | 14 +
>> block/qcow2.h | 4 +
>> qemu-thread.c | 13 +
>> qemu-thread.h | 1 +
>> 10 files changed, 838 insertions(+), 39 deletions(-)
>> create mode 100644 block-queue.c
>> create mode 100644 block-queue.h
>>
>> diff --git a/Makefile b/Makefile
>> index ab91d42..0202dc6 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -125,6 +125,9 @@ qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o
>> qemu-error.o $(trace-obj-y) $(block-ob
>>
>> qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o
>> $(trace-obj-y) $(block-obj-y) $(qobject-obj-y)
>>
>> +block-queue$(EXESUF): QEMU_CFLAGS += -DRUN_TESTS
>> +block-queue$(EXESUF): qemu-tool.o qemu-error.o qemu-thread.o
>> $(block-obj-y) $(qobject-obj-y)
>> +
>> qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx
>> $(call quiet-command,sh $(SRC_PATH)/hxtool -h< $< > $@,"
>> GEN $@")
>>
>> diff --git a/Makefile.objs b/Makefile.objs
>> index 3ef6d80..e97a246 100644
>> --- a/Makefile.objs
>> +++ b/Makefile.objs
>> @@ -9,6 +9,7 @@ qobject-obj-y += qerror.o
>>
>> block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o
>> module.o
>> block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
>> +block-obj-y += qemu-thread.o block-queue.o
>> block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
>> block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>>
>> diff --git a/block-queue.c b/block-queue.c
>> new file mode 100644
>> index 0000000..13579a7
>> --- /dev/null
>> +++ b/block-queue.c
>> @@ -0,0 +1,720 @@
>> +/*
>> + * QEMU System Emulator
>> + *
>> + * Copyright (c) 2010 Kevin Wolf<kwolf@redhat.com>
>> + *
>> + * Permission is hereby granted, free of charge, to any person
>> obtaining a copy
>> + * of this software and associated documentation files (the
>> "Software"), to deal
>> + * in the Software without restriction, including without limitation
>> the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense,
>> and/or sell
>> + * copies of the Software, and to permit persons to whom the
>> Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be
>> included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>> EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
>> MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
>> SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES
>> OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
>> ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
>> DEALINGS IN
>> + * THE SOFTWARE.
>> + */
>> +
>> +#include<signal.h>
>> +
>> +#include "qemu-common.h"
>> +#include "qemu-queue.h"
>> +#include "qemu-thread.h"
>> +#include "qemu-barrier.h"
>> +#include "block.h"
>> +#include "block-queue.h"
>> +
>> +enum blkqueue_req_type {
>> + REQ_TYPE_WRITE,
>> + REQ_TYPE_BARRIER,
>> +};
>> +
>> +typedef struct BlockQueueRequest {
>> + enum blkqueue_req_type type;
>> +
>> + uint64_t offset;
>> + void* buf;
>> + uint64_t size;
>> + unsigned section;
>> +
>> + QTAILQ_ENTRY(BlockQueueRequest) link;
>> + QSIMPLEQ_ENTRY(BlockQueueRequest) link_section;
>> +} BlockQueueRequest;
>> +
>> +struct BlockQueue {
>> + BlockDriverState* bs;
>> +
>> + QemuThread thread;
>> + bool thread_done;
>> + QemuMutex lock;
>> + QemuMutex flush_lock;
>> + QemuCond cond;
>> +
>> + int barriers_requested;
>> + int barriers_submitted;
>> + int queue_size;
>> +
>> + QTAILQ_HEAD(bq_queue_head, BlockQueueRequest) queue;
>> + QSIMPLEQ_HEAD(, BlockQueueRequest) sections;
>> +};
>> +
>> +static void *blkqueue_thread(void *bq);
>> +
>> +BlockQueue *blkqueue_create(BlockDriverState *bs)
>> +{
>> + BlockQueue *bq = qemu_mallocz(sizeof(BlockQueue));
>> + bq->bs = bs;
>> +
>> + QTAILQ_INIT(&bq->queue);
>> + QSIMPLEQ_INIT(&bq->sections);
>> +
>> + qemu_mutex_init(&bq->lock);
>> + qemu_mutex_init(&bq->flush_lock);
>> + qemu_cond_init(&bq->cond);
>> +
>> + bq->thread_done = false;
>> + qemu_thread_create(&bq->thread, blkqueue_thread, bq);
>> +
>> + return bq;
>> +}
>> +
>> +void blkqueue_init_context(BlockQueueContext* context, BlockQueue *bq)
>> +{
>> + context->bq = bq;
>> + context->section = 0;
>> +}
>> +
>> +void blkqueue_destroy(BlockQueue *bq)
>> +{
>> + bq->thread_done = true;
>> + qemu_cond_signal(&bq->cond);
>> + qemu_thread_join(&bq->thread);
>> +
>> + blkqueue_flush(bq);
>> +
>> + fprintf(stderr, "blkqueue_destroy: %d/%d barriers left\n",
>> + bq->barriers_submitted, bq->barriers_requested);
>> +
>> + qemu_mutex_destroy(&bq->lock);
>> + qemu_mutex_destroy(&bq->flush_lock);
>> + qemu_cond_destroy(&bq->cond);
>> +
>> + assert(QTAILQ_FIRST(&bq->queue) == NULL);
>> + assert(QSIMPLEQ_FIRST(&bq->sections) == NULL);
>> + qemu_free(bq);
>> +}
>> +
>> +int blkqueue_pread(BlockQueueContext *context, uint64_t offset, void
>> *buf,
>> + uint64_t size)
>> +{
>> + BlockQueue *bq = context->bq;
>> + BlockQueueRequest *req;
>> + int ret;
>> +
>> + /*
>> + * First check if there are any pending writes for the same
>> data. Reverse
>> + * order to return data written by the latest write.
>> + */
>> + QTAILQ_FOREACH_REVERSE(req,&bq->queue, bq_queue_head, link) {
>> + uint64_t end = offset + size;
>> + uint64_t req_end = req->offset + req->size;
>> + uint8_t *read_buf = buf;
>> + uint8_t *req_buf = req->buf;
>> +
>> + /* We're only interested in queued writes */
>> + if (req->type != REQ_TYPE_WRITE) {
>> + continue;
>> + }
>> +
>> + /*
>> + * If we read from a write in the queue (i.e. our read
>> overlaps the
>> + * write request), our next write probably depends on this
>> write, so
>> + * let's move forward to its section.
>> + */
>> + if (end> req->offset&& offset< req_end) {
>> + context->section = MAX(context->section, req->section);
>> + }
>> +
>> + /* How we continue, depends on the kind of overlap we have */
>> + if ((offset>= req->offset)&& (end<= req_end)) {
>> + /* Completely contained in the write request */
>> + memcpy(buf,&req_buf[offset - req->offset], size);
>> + return 0;
>> + } else if ((end>= req->offset)&& (end<= req_end)) {
>> + /* Overlap in the end of the read request */
>> + assert(offset< req->offset);
>> + memcpy(&read_buf[req->offset - offset], req_buf, end -
>> req->offset);
>> + size = req->offset - offset;
>> + } else if ((offset>= req->offset)&& (offset< req_end)) {
>> + /* Overlap in the start of the read request */
>> + assert(end> req_end);
>> + memcpy(read_buf,&req_buf[offset - req->offset], req_end
>> - offset);
>> + buf = read_buf =&read_buf[req_end - offset];
>> + offset = req_end;
>> + size = end - req_end;
>> + } else if ((req->offset>= offset)&& (req_end<= end)) {
>> + /*
>> + * The write request is completely contained in the read
>> request.
>> + * memcpy the data from the write request here, continue
>> with the
>> + * data before the write request and handle the data
>> after the
>> + * write request with a recursive call.
>> + */
>> + memcpy(&read_buf[req->offset - offset], req_buf, req_end
>> - req->offset);
>> + size = req->offset - offset;
>> + blkqueue_pread(context, req_end,&read_buf[req_end -
>> offset], end - req_end);
>> + }
>> + }
>> +
>> + /* The requested is not written in the queue, read it from disk */
>> + ret = bdrv_pread(bq->bs, offset, buf, size);
>> + if (ret< 0) {
>> + return ret;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +int blkqueue_pwrite(BlockQueueContext *context, uint64_t offset,
>> void *buf,
>> + uint64_t size)
>> +{
>> + BlockQueue *bq = context->bq;
>> + BlockQueueRequest *section_req;
>> +
>> + /* Create request structure */
>> + BlockQueueRequest *req = qemu_malloc(sizeof(*req));
>> + req->type = REQ_TYPE_WRITE;
>> + req->offset = offset;
>> + req->size = size;
>> + req->buf = qemu_malloc(size);
>> + req->section = context->section;
>> + memcpy(req->buf, buf, size);
>> +
>> + /*
>> + * Find the right place to insert it into the queue:
>> + * Right before the barrier that closes the current section.
>> + */
>> + qemu_mutex_lock(&bq->lock);
>> + QSIMPLEQ_FOREACH(section_req,&bq->sections, link_section) {
>> + if (section_req->section>= req->section) {
>> + req->section = section_req->section;
>> + context->section = section_req->section;
>> + QTAILQ_INSERT_BEFORE(section_req, req, link);
>> + bq->queue_size++;
>> + goto out;
>> + }
>> + }
>> +
>> + /* If there was no barrier, just put it at the end. */
>> + QTAILQ_INSERT_TAIL(&bq->queue, req, link);
>> + bq->queue_size++;
>> + qemu_cond_signal(&bq->cond);
>> +
>> +out:
>> + qemu_mutex_unlock(&bq->lock);
>> + return 0;
>> +}
>> +
>> +int blkqueue_barrier(BlockQueueContext *context)
>> +{
>> + BlockQueue *bq = context->bq;
>> + BlockQueueRequest *section_req;
>> +
>> + bq->barriers_requested++;
>> +
>> + /* Create request structure */
>> + BlockQueueRequest *req = qemu_malloc(sizeof(*req));
>> + req->type = REQ_TYPE_BARRIER;
>> + req->section = context->section;
>> + req->buf = NULL;
>> +
>> + /* Find another barrier to merge with. */
>> + qemu_mutex_lock(&bq->lock);
>> + QSIMPLEQ_FOREACH(section_req,&bq->sections, link_section) {
>> + if (section_req->section>= req->section) {
>> + req->section = section_req->section;
>> + context->section = section_req->section + 1;
>> + qemu_free(req);
>> + goto out;
>> + }
>> + }
>> +
>> + /*
>> + * If there wasn't a barrier for the same section yet, insert a
>> new one at
>> + * the end.
>> + */
>> + QTAILQ_INSERT_TAIL(&bq->queue, req, link);
>> + QSIMPLEQ_INSERT_TAIL(&bq->sections, req, link_section);
>> + bq->queue_size++;
>> + context->section++;
>> + qemu_cond_signal(&bq->cond);
>> +
>> + bq->barriers_submitted++;
>> +
>> +out:
>> + qemu_mutex_unlock(&bq->lock);
>> + return 0;
>> +}
>> +
>> +/*
>> + * Caller needs to hold the bq->lock mutex
>> + */
>> +static BlockQueueRequest *blkqueue_pop(BlockQueue *bq)
>> +{
>> + BlockQueueRequest *req;
>> +
>> + req = QTAILQ_FIRST(&bq->queue);
>> + if (req == NULL) {
>> + goto out;
>> + }
>> +
>> + QTAILQ_REMOVE(&bq->queue, req, link);
>> + bq->queue_size--;
>> +
>> + if (req->type == REQ_TYPE_BARRIER) {
>> + assert(QSIMPLEQ_FIRST(&bq->sections) == req);
>> + QSIMPLEQ_REMOVE_HEAD(&bq->sections, link_section);
>> + }
>> +
>> +out:
>> + return req;
>> +}
>> +
>> +static void blkqueue_free_request(BlockQueueRequest *req)
>> +{
>> + qemu_free(req->buf);
>> + qemu_free(req);
>> +}
>> +
>> +static void blkqueue_process_request(BlockQueue *bq)
>> +{
>> + BlockQueueRequest *req;
>> + BlockQueueRequest *req2;
>> + int ret;
>> +
>> + /*
>> + * Note that we leave the request in the queue while we process
>> it. No
>> + * other request will be queued before this one and we have only
>> one thread
>> + * that processes the queue, so afterwards it will still be the
>> first
>> + * request. (Not true for barriers in the first position, but we
>> can handle
>> + * that)
>> + */
>> + req = QTAILQ_FIRST(&bq->queue);
>> + if (req == NULL) {
>> + return;
>> + }
>> +
>> + switch (req->type) {
>> + case REQ_TYPE_WRITE:
>> + ret = bdrv_pwrite(bq->bs, req->offset, req->buf,
>> req->size);
>> + if (ret< 0) {
>> + /* TODO Error reporting! */
>> + return;
>> + }
>> + break;
>> + case REQ_TYPE_BARRIER:
>> + bdrv_flush(bq->bs);
>> + break;
>> + }
>> +
>> + /*
>> + * Only remove the request from the queue when it's written, so
>> that reads
>> + * always access the right data.
>> + */
>> + qemu_mutex_lock(&bq->lock);
>> + req2 = QTAILQ_FIRST(&bq->queue);
>> + if (req == req2) {
>> + blkqueue_pop(bq);
>> + blkqueue_free_request(req);
>> + } else {
>> + /*
>> + * If it's a barrier and something has been queued before
>> it, just
>> + * leave it in the queue and flush once again later.
>> + */
>> + assert(req->type == REQ_TYPE_BARRIER);
>> + bq->barriers_submitted++;
>> + }
>> + qemu_mutex_unlock(&bq->lock);
>> +}
>> +
>> +struct blkqueue_flush_aiocb {
>> + BlockQueue *bq;
>> + BlockDriverCompletionFunc *cb;
>> + void *opaque;
>> +};
>> +
>> +static void *blkqueue_aio_flush_thread(void *opaque)
>> +{
>> + struct blkqueue_flush_aiocb *acb = opaque;
>> +
>> + /* Process any left over requests */
>> + blkqueue_flush(acb->bq);
>> +
>> + acb->cb(acb->opaque, 0);
>> + qemu_free(acb);
>> +
>> + return NULL;
>> +}
>> +
>> +void blkqueue_aio_flush(BlockQueue *bq, BlockDriverCompletionFunc *cb,
>> + void *opaque)
>> +{
>> + struct blkqueue_flush_aiocb *acb;
>> +
>> + acb = qemu_malloc(sizeof(*acb));
>> + acb->bq = bq;
>> + acb->cb = cb;
>> + acb->opaque = opaque;
>> +
>> + qemu_thread_create(NULL, blkqueue_aio_flush_thread, acb);
>> +}
>> +
>> +void blkqueue_flush(BlockQueue *bq)
>> +{
>> + qemu_mutex_lock(&bq->flush_lock);
>> +
>> + /* Process any left over requests */
>> + while (QTAILQ_FIRST(&bq->queue)) {
>> + blkqueue_process_request(bq);
>> + }
>> +
>> + qemu_mutex_unlock(&bq->flush_lock);
>> +}
>> +
>> +static void *blkqueue_thread(void *_bq)
>> +{
>> + BlockQueue *bq = _bq;
>> +#ifndef RUN_TESTS
>> + BlockQueueRequest *req;
>> +#endif
>> +
>> + qemu_mutex_lock(&bq->flush_lock);
>> + while (!bq->thread_done) {
>> + barrier();
A barrier shouldn't be needed here.
>> +#ifndef RUN_TESTS
>> + req = QTAILQ_FIRST(&bq->queue);
>> +
>> + /* Don't process barriers, we only do that on flushes */
>> + if (req&& (req->type != REQ_TYPE_BARRIER ||
>> bq->queue_size> 42)) {
>> + blkqueue_process_request(bq);
>> + } else {
>> + qemu_cond_wait(&bq->cond,&bq->flush_lock);
>> + }
The normal pattern for this is:
while (!condition) {
qemu_cond_wait(&cond, &lock);
}
process_request()
It's generally best not to deviate from this pattern in terms of code
readability.
A less invasive way of doing this (assuming we're okay with it from a
correctness perspective) is to make use of qemu_aio_wait() as a
replacement for qemu_mutex_lock() and shift the pread/pwrite calls to
bdrv_aio_write/bdrv_aio_read.
IOW, blkqueue_pwrite stages a request via bdrv_aio_write().
blkqueue_pread() either returns a cached read or it does a
bdrv_pread(). The blkqueue_flush() call will then do qemu_aio_wait() to
wait for all pending I/Os to complete.
Regards,
Anthony Liguori
next prev parent reply other threads:[~2010-09-20 14:57 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-20 13:56 [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes Kevin Wolf
2010-09-20 14:31 ` Anthony Liguori
2010-09-20 14:56 ` Anthony Liguori [this message]
2010-09-20 15:33 ` Kevin Wolf
2010-09-20 15:48 ` Anthony Liguori
2010-09-20 15:08 ` Kevin Wolf
2010-09-20 15:33 ` Avi Kivity
2010-09-20 15:38 ` Avi Kivity
2010-09-20 15:46 ` Kevin Wolf
2010-09-20 15:40 ` Anthony Liguori
2010-09-20 15:55 ` Kevin Wolf
2010-09-20 16:34 ` Anthony Liguori
2010-09-20 15:51 ` Anthony Liguori
2010-09-20 16:05 ` Avi Kivity
2010-09-21 9:13 ` Kevin Wolf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C977626.4040806@codemonkey.ws \
--to=anthony@codemonkey.ws \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).