From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=59280 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Oxhng-0003IL-2T for qemu-devel@nongnu.org; Mon, 20 Sep 2010 10:57:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OxhnY-0006un-6O for qemu-devel@nongnu.org; Mon, 20 Sep 2010 10:56:59 -0400 Received: from mail-vw0-f45.google.com ([209.85.212.45]:58846) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OxhnY-0006uc-2D for qemu-devel@nongnu.org; Mon, 20 Sep 2010 10:56:52 -0400 Received: by vws19 with SMTP id 19so3644424vws.4 for ; Mon, 20 Sep 2010 07:56:50 -0700 (PDT) Message-ID: <4C977626.4040806@codemonkey.ws> Date: Mon, 20 Sep 2010 09:56:38 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes References: <1284991010-10951-1-git-send-email-kwolf@redhat.com> <4C977028.3050602@codemonkey.ws> In-Reply-To: <4C977028.3050602@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org On 09/20/2010 09:31 AM, Anthony Liguori wrote: >> If we delay the operation and get three of these sequences queued before >> actually executing, we end up with the following result, saving two >> syncs: >> >> 1. Update refcount table (req 1) >> 2. Update refcount table (req 2) >> 3. Update refcount table (req 3) >> 4. bdrv_flush >> 5. Update L2 entry (req 1) >> 6. Update L2 entry (req 2) >> 7. Update L2 entry (req 3) >> >> This patch only commits a sync if either the guests has requested a >> flush or if >> a certain number of requests in the queue, so usually we batch more >> than just >> three requests. >> >> I didn't run any detailed benchmarks but just tried what happens with >> installation time of a Fedora 13 guest, and while git master takes >> about 40-50% >> longer than before the metadata syncs, we get most of it back with >> blkqueue. >> >> Of course, in this state the code is not correct, but it's correct >> enough to >> try and have qcow2 run on a file backend. Some remaining problems are: >> >> - There's no locking between the worker thread and other functions >> accessing >> the same backend driver. Should be fine for file, but probably not >> for other >> backends. >> >> - Error handling doesn't really exist. If something goes wrong with >> writing >> metadata we can't fail the guest request any more because it's long >> completed. Losing this data is actually okay, the guest hasn't >> flushed yet. >> >> However, we need to be able to fail a flush, and we also need some >> way to >> handle errors transparently. This probably means that we have to >> stop the VM >> and let the user fix things so that we can retry. The only other >> way would be >> to shut down the VM and end up in the same situation as with a >> host crash. >> >> Or maybe it would even be enough to start failing all new requests. >> >> - The Makefile integration is obviously very wrong, too. It worked >> for me good >> enough, but you need to be aware when block-queue.o is compiled with >> RUN_TESTS and when it isn't. The tests need to be split out properly. >> >> They are certainly fixable and shouldn't have any major impact on >> performance, >> so that's just a matter of doing it. >> >> Kevin >> >> --- >> Makefile | 3 + >> Makefile.objs | 1 + >> block-queue.c | 720 >> ++++++++++++++++++++++++++++++++++++++++++++++++ >> block-queue.h | 49 ++++ >> block/qcow2-cluster.c | 28 +- >> block/qcow2-refcount.c | 44 ++-- >> block/qcow2.c | 14 + >> block/qcow2.h | 4 + >> qemu-thread.c | 13 + >> qemu-thread.h | 1 + >> 10 files changed, 838 insertions(+), 39 deletions(-) >> create mode 100644 block-queue.c >> create mode 100644 block-queue.h >> >> diff --git a/Makefile b/Makefile >> index ab91d42..0202dc6 100644 >> --- a/Makefile >> +++ b/Makefile >> @@ -125,6 +125,9 @@ qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o >> qemu-error.o $(trace-obj-y) $(block-ob >> >> qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o >> $(trace-obj-y) $(block-obj-y) $(qobject-obj-y) >> >> +block-queue$(EXESUF): QEMU_CFLAGS += -DRUN_TESTS >> +block-queue$(EXESUF): qemu-tool.o qemu-error.o qemu-thread.o >> $(block-obj-y) $(qobject-obj-y) >> + >> qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx >> $(call quiet-command,sh $(SRC_PATH)/hxtool -h< $< > $@," >> GEN $@") >> >> diff --git a/Makefile.objs b/Makefile.objs >> index 3ef6d80..e97a246 100644 >> --- a/Makefile.objs >> +++ b/Makefile.objs >> @@ -9,6 +9,7 @@ qobject-obj-y += qerror.o >> >> block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o >> module.o >> block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o >> +block-obj-y += qemu-thread.o block-queue.o >> block-obj-$(CONFIG_POSIX) += posix-aio-compat.o >> block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o >> >> diff --git a/block-queue.c b/block-queue.c >> new file mode 100644 >> index 0000000..13579a7 >> --- /dev/null >> +++ b/block-queue.c >> @@ -0,0 +1,720 @@ >> +/* >> + * QEMU System Emulator >> + * >> + * Copyright (c) 2010 Kevin Wolf >> + * >> + * Permission is hereby granted, free of charge, to any person >> obtaining a copy >> + * of this software and associated documentation files (the >> "Software"), to deal >> + * in the Software without restriction, including without limitation >> the rights >> + * to use, copy, modify, merge, publish, distribute, sublicense, >> and/or sell >> + * copies of the Software, and to permit persons to whom the >> Software is >> + * furnished to do so, subject to the following conditions: >> + * >> + * The above copyright notice and this permission notice shall be >> included in >> + * all copies or substantial portions of the Software. >> + * >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, >> EXPRESS OR >> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF >> MERCHANTABILITY, >> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT >> SHALL >> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES >> OR OTHER >> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, >> ARISING FROM, >> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER >> DEALINGS IN >> + * THE SOFTWARE. >> + */ >> + >> +#include >> + >> +#include "qemu-common.h" >> +#include "qemu-queue.h" >> +#include "qemu-thread.h" >> +#include "qemu-barrier.h" >> +#include "block.h" >> +#include "block-queue.h" >> + >> +enum blkqueue_req_type { >> + REQ_TYPE_WRITE, >> + REQ_TYPE_BARRIER, >> +}; >> + >> +typedef struct BlockQueueRequest { >> + enum blkqueue_req_type type; >> + >> + uint64_t offset; >> + void* buf; >> + uint64_t size; >> + unsigned section; >> + >> + QTAILQ_ENTRY(BlockQueueRequest) link; >> + QSIMPLEQ_ENTRY(BlockQueueRequest) link_section; >> +} BlockQueueRequest; >> + >> +struct BlockQueue { >> + BlockDriverState* bs; >> + >> + QemuThread thread; >> + bool thread_done; >> + QemuMutex lock; >> + QemuMutex flush_lock; >> + QemuCond cond; >> + >> + int barriers_requested; >> + int barriers_submitted; >> + int queue_size; >> + >> + QTAILQ_HEAD(bq_queue_head, BlockQueueRequest) queue; >> + QSIMPLEQ_HEAD(, BlockQueueRequest) sections; >> +}; >> + >> +static void *blkqueue_thread(void *bq); >> + >> +BlockQueue *blkqueue_create(BlockDriverState *bs) >> +{ >> + BlockQueue *bq = qemu_mallocz(sizeof(BlockQueue)); >> + bq->bs = bs; >> + >> + QTAILQ_INIT(&bq->queue); >> + QSIMPLEQ_INIT(&bq->sections); >> + >> + qemu_mutex_init(&bq->lock); >> + qemu_mutex_init(&bq->flush_lock); >> + qemu_cond_init(&bq->cond); >> + >> + bq->thread_done = false; >> + qemu_thread_create(&bq->thread, blkqueue_thread, bq); >> + >> + return bq; >> +} >> + >> +void blkqueue_init_context(BlockQueueContext* context, BlockQueue *bq) >> +{ >> + context->bq = bq; >> + context->section = 0; >> +} >> + >> +void blkqueue_destroy(BlockQueue *bq) >> +{ >> + bq->thread_done = true; >> + qemu_cond_signal(&bq->cond); >> + qemu_thread_join(&bq->thread); >> + >> + blkqueue_flush(bq); >> + >> + fprintf(stderr, "blkqueue_destroy: %d/%d barriers left\n", >> + bq->barriers_submitted, bq->barriers_requested); >> + >> + qemu_mutex_destroy(&bq->lock); >> + qemu_mutex_destroy(&bq->flush_lock); >> + qemu_cond_destroy(&bq->cond); >> + >> + assert(QTAILQ_FIRST(&bq->queue) == NULL); >> + assert(QSIMPLEQ_FIRST(&bq->sections) == NULL); >> + qemu_free(bq); >> +} >> + >> +int blkqueue_pread(BlockQueueContext *context, uint64_t offset, void >> *buf, >> + uint64_t size) >> +{ >> + BlockQueue *bq = context->bq; >> + BlockQueueRequest *req; >> + int ret; >> + >> + /* >> + * First check if there are any pending writes for the same >> data. Reverse >> + * order to return data written by the latest write. >> + */ >> + QTAILQ_FOREACH_REVERSE(req,&bq->queue, bq_queue_head, link) { >> + uint64_t end = offset + size; >> + uint64_t req_end = req->offset + req->size; >> + uint8_t *read_buf = buf; >> + uint8_t *req_buf = req->buf; >> + >> + /* We're only interested in queued writes */ >> + if (req->type != REQ_TYPE_WRITE) { >> + continue; >> + } >> + >> + /* >> + * If we read from a write in the queue (i.e. our read >> overlaps the >> + * write request), our next write probably depends on this >> write, so >> + * let's move forward to its section. >> + */ >> + if (end> req->offset&& offset< req_end) { >> + context->section = MAX(context->section, req->section); >> + } >> + >> + /* How we continue, depends on the kind of overlap we have */ >> + if ((offset>= req->offset)&& (end<= req_end)) { >> + /* Completely contained in the write request */ >> + memcpy(buf,&req_buf[offset - req->offset], size); >> + return 0; >> + } else if ((end>= req->offset)&& (end<= req_end)) { >> + /* Overlap in the end of the read request */ >> + assert(offset< req->offset); >> + memcpy(&read_buf[req->offset - offset], req_buf, end - >> req->offset); >> + size = req->offset - offset; >> + } else if ((offset>= req->offset)&& (offset< req_end)) { >> + /* Overlap in the start of the read request */ >> + assert(end> req_end); >> + memcpy(read_buf,&req_buf[offset - req->offset], req_end >> - offset); >> + buf = read_buf =&read_buf[req_end - offset]; >> + offset = req_end; >> + size = end - req_end; >> + } else if ((req->offset>= offset)&& (req_end<= end)) { >> + /* >> + * The write request is completely contained in the read >> request. >> + * memcpy the data from the write request here, continue >> with the >> + * data before the write request and handle the data >> after the >> + * write request with a recursive call. >> + */ >> + memcpy(&read_buf[req->offset - offset], req_buf, req_end >> - req->offset); >> + size = req->offset - offset; >> + blkqueue_pread(context, req_end,&read_buf[req_end - >> offset], end - req_end); >> + } >> + } >> + >> + /* The requested is not written in the queue, read it from disk */ >> + ret = bdrv_pread(bq->bs, offset, buf, size); >> + if (ret< 0) { >> + return ret; >> + } >> + >> + return 0; >> +} >> + >> +int blkqueue_pwrite(BlockQueueContext *context, uint64_t offset, >> void *buf, >> + uint64_t size) >> +{ >> + BlockQueue *bq = context->bq; >> + BlockQueueRequest *section_req; >> + >> + /* Create request structure */ >> + BlockQueueRequest *req = qemu_malloc(sizeof(*req)); >> + req->type = REQ_TYPE_WRITE; >> + req->offset = offset; >> + req->size = size; >> + req->buf = qemu_malloc(size); >> + req->section = context->section; >> + memcpy(req->buf, buf, size); >> + >> + /* >> + * Find the right place to insert it into the queue: >> + * Right before the barrier that closes the current section. >> + */ >> + qemu_mutex_lock(&bq->lock); >> + QSIMPLEQ_FOREACH(section_req,&bq->sections, link_section) { >> + if (section_req->section>= req->section) { >> + req->section = section_req->section; >> + context->section = section_req->section; >> + QTAILQ_INSERT_BEFORE(section_req, req, link); >> + bq->queue_size++; >> + goto out; >> + } >> + } >> + >> + /* If there was no barrier, just put it at the end. */ >> + QTAILQ_INSERT_TAIL(&bq->queue, req, link); >> + bq->queue_size++; >> + qemu_cond_signal(&bq->cond); >> + >> +out: >> + qemu_mutex_unlock(&bq->lock); >> + return 0; >> +} >> + >> +int blkqueue_barrier(BlockQueueContext *context) >> +{ >> + BlockQueue *bq = context->bq; >> + BlockQueueRequest *section_req; >> + >> + bq->barriers_requested++; >> + >> + /* Create request structure */ >> + BlockQueueRequest *req = qemu_malloc(sizeof(*req)); >> + req->type = REQ_TYPE_BARRIER; >> + req->section = context->section; >> + req->buf = NULL; >> + >> + /* Find another barrier to merge with. */ >> + qemu_mutex_lock(&bq->lock); >> + QSIMPLEQ_FOREACH(section_req,&bq->sections, link_section) { >> + if (section_req->section>= req->section) { >> + req->section = section_req->section; >> + context->section = section_req->section + 1; >> + qemu_free(req); >> + goto out; >> + } >> + } >> + >> + /* >> + * If there wasn't a barrier for the same section yet, insert a >> new one at >> + * the end. >> + */ >> + QTAILQ_INSERT_TAIL(&bq->queue, req, link); >> + QSIMPLEQ_INSERT_TAIL(&bq->sections, req, link_section); >> + bq->queue_size++; >> + context->section++; >> + qemu_cond_signal(&bq->cond); >> + >> + bq->barriers_submitted++; >> + >> +out: >> + qemu_mutex_unlock(&bq->lock); >> + return 0; >> +} >> + >> +/* >> + * Caller needs to hold the bq->lock mutex >> + */ >> +static BlockQueueRequest *blkqueue_pop(BlockQueue *bq) >> +{ >> + BlockQueueRequest *req; >> + >> + req = QTAILQ_FIRST(&bq->queue); >> + if (req == NULL) { >> + goto out; >> + } >> + >> + QTAILQ_REMOVE(&bq->queue, req, link); >> + bq->queue_size--; >> + >> + if (req->type == REQ_TYPE_BARRIER) { >> + assert(QSIMPLEQ_FIRST(&bq->sections) == req); >> + QSIMPLEQ_REMOVE_HEAD(&bq->sections, link_section); >> + } >> + >> +out: >> + return req; >> +} >> + >> +static void blkqueue_free_request(BlockQueueRequest *req) >> +{ >> + qemu_free(req->buf); >> + qemu_free(req); >> +} >> + >> +static void blkqueue_process_request(BlockQueue *bq) >> +{ >> + BlockQueueRequest *req; >> + BlockQueueRequest *req2; >> + int ret; >> + >> + /* >> + * Note that we leave the request in the queue while we process >> it. No >> + * other request will be queued before this one and we have only >> one thread >> + * that processes the queue, so afterwards it will still be the >> first >> + * request. (Not true for barriers in the first position, but we >> can handle >> + * that) >> + */ >> + req = QTAILQ_FIRST(&bq->queue); >> + if (req == NULL) { >> + return; >> + } >> + >> + switch (req->type) { >> + case REQ_TYPE_WRITE: >> + ret = bdrv_pwrite(bq->bs, req->offset, req->buf, >> req->size); >> + if (ret< 0) { >> + /* TODO Error reporting! */ >> + return; >> + } >> + break; >> + case REQ_TYPE_BARRIER: >> + bdrv_flush(bq->bs); >> + break; >> + } >> + >> + /* >> + * Only remove the request from the queue when it's written, so >> that reads >> + * always access the right data. >> + */ >> + qemu_mutex_lock(&bq->lock); >> + req2 = QTAILQ_FIRST(&bq->queue); >> + if (req == req2) { >> + blkqueue_pop(bq); >> + blkqueue_free_request(req); >> + } else { >> + /* >> + * If it's a barrier and something has been queued before >> it, just >> + * leave it in the queue and flush once again later. >> + */ >> + assert(req->type == REQ_TYPE_BARRIER); >> + bq->barriers_submitted++; >> + } >> + qemu_mutex_unlock(&bq->lock); >> +} >> + >> +struct blkqueue_flush_aiocb { >> + BlockQueue *bq; >> + BlockDriverCompletionFunc *cb; >> + void *opaque; >> +}; >> + >> +static void *blkqueue_aio_flush_thread(void *opaque) >> +{ >> + struct blkqueue_flush_aiocb *acb = opaque; >> + >> + /* Process any left over requests */ >> + blkqueue_flush(acb->bq); >> + >> + acb->cb(acb->opaque, 0); >> + qemu_free(acb); >> + >> + return NULL; >> +} >> + >> +void blkqueue_aio_flush(BlockQueue *bq, BlockDriverCompletionFunc *cb, >> + void *opaque) >> +{ >> + struct blkqueue_flush_aiocb *acb; >> + >> + acb = qemu_malloc(sizeof(*acb)); >> + acb->bq = bq; >> + acb->cb = cb; >> + acb->opaque = opaque; >> + >> + qemu_thread_create(NULL, blkqueue_aio_flush_thread, acb); >> +} >> + >> +void blkqueue_flush(BlockQueue *bq) >> +{ >> + qemu_mutex_lock(&bq->flush_lock); >> + >> + /* Process any left over requests */ >> + while (QTAILQ_FIRST(&bq->queue)) { >> + blkqueue_process_request(bq); >> + } >> + >> + qemu_mutex_unlock(&bq->flush_lock); >> +} >> + >> +static void *blkqueue_thread(void *_bq) >> +{ >> + BlockQueue *bq = _bq; >> +#ifndef RUN_TESTS >> + BlockQueueRequest *req; >> +#endif >> + >> + qemu_mutex_lock(&bq->flush_lock); >> + while (!bq->thread_done) { >> + barrier(); A barrier shouldn't be needed here. >> +#ifndef RUN_TESTS >> + req = QTAILQ_FIRST(&bq->queue); >> + >> + /* Don't process barriers, we only do that on flushes */ >> + if (req&& (req->type != REQ_TYPE_BARRIER || >> bq->queue_size> 42)) { >> + blkqueue_process_request(bq); >> + } else { >> + qemu_cond_wait(&bq->cond,&bq->flush_lock); >> + } The normal pattern for this is: while (!condition) { qemu_cond_wait(&cond, &lock); } process_request() It's generally best not to deviate from this pattern in terms of code readability. A less invasive way of doing this (assuming we're okay with it from a correctness perspective) is to make use of qemu_aio_wait() as a replacement for qemu_mutex_lock() and shift the pread/pwrite calls to bdrv_aio_write/bdrv_aio_read. IOW, blkqueue_pwrite stages a request via bdrv_aio_write(). blkqueue_pread() either returns a cached read or it does a bdrv_pread(). The blkqueue_flush() call will then do qemu_aio_wait() to wait for all pending I/Os to complete. Regards, Anthony Liguori