From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=46269 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1POY8X-0004TP-Nu
	for qemu-devel@nongnu.org; Fri, 03 Dec 2010 11:05:30 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1POY8W-0001jO-56
	for qemu-devel@nongnu.org; Fri, 03 Dec 2010 11:05:29 -0500
Received: from mx1.redhat.com ([209.132.183.28]:4182)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1POY8V-0001hY-PT
	for qemu-devel@nongnu.org; Fri, 03 Dec 2010 11:05:28 -0500
Message-ID: <4CF91581.40705@redhat.com>
Date: Fri, 03 Dec 2010 17:06:25 +0100
From: Kevin Wolf <kwolf@redhat.com>
MIME-Version: 1.0
References: <1290597076-18952-1-git-send-email-stefanha@linux.vnet.ibm.com>
	<1290597076-18952-5-git-send-email-stefanha@linux.vnet.ibm.com>
In-Reply-To: <1290597076-18952-5-git-send-email-stefanha@linux.vnet.ibm.com>
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] Re: [PATCH v5 4/5] qed: Read/write support
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>, Eric Blake <eblake@redhat.com>, qemu-devel@nongnu.org, Avi Kivity <avi@redhat.com>

Am 24.11.2010 12:11, schrieb Stefan Hajnoczi:
> This patch implements the read/write state machine.  Operations are
> fully asynchronous and multiple operations may be active at any time.
> 
> Allocating writes lock tables to ensure metadata updates do not
> interfere with each other.  If two allocating writes need to update the
> same L2 table they will run sequentially.  If two allocating writes need
> to update different L2 tables they will run in parallel.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  Makefile.objs    |    1 +
>  block/qed-lock.c |  124 ++++++++++
>  block/qed.c      |  667 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  block/qed.h      |   43 ++++
>  trace-events     |   10 +
>  5 files changed, 843 insertions(+), 2 deletions(-)
>  create mode 100644 block/qed-lock.c

> diff --git a/block/qed.c b/block/qed.c
> index cd1bead..1513763 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -12,8 +12,26 @@
>   *
>   */
>  
> +#include "trace.h"
>  #include "qed.h"
>  
> +static void qed_aio_cancel(BlockDriverAIOCB *blockacb)
> +{
> +    QEDAIOCB *acb = (QEDAIOCB *)blockacb;
> +    bool finished = false;
> +
> +    /* Wait for the request to finish */
> +    acb->finished = &finished;
> +    while (!finished) {
> +        qemu_aio_wait();
> +    }
> +}

Hm, you don't even try to cancel? I wonder how useful the individual
bdrv_aio_cancel implementations actually are when nobody implements
cancellation. It seems to be pretty hard to do it right.

Maybe we should consider implementing a default bdrv_aio_cancel
implementation in block.c that waits for completion?

> +/**
> + * Construct an iovec array for a given length
> + *
> + * @acb:        I/O request
> + * @len:        Maximum number of bytes
> + *
> + * This function can be called several times to build subset iovec arrays of
> + * acb->qiov.  For example:
> + *
> + *   acb->qiov->iov[] = {{0x100000, 1024},
> + *                       {0x200000, 1024}}
> + *
> + *   qed_acb_build_qiov(acb, 512) =>
> + *                      {{0x100000, 512}}
> + *
> + *   qed_acb_build_qiov(acb, 1024) =>
> + *                      {{0x100200, 512},
> + *                       {0x200000, 512}}
> + *
> + *   qed_acb_build_qiov(acb, 512) =>
> + *                      {{0x200200, 512}}
> + */
> +static void qed_acb_build_qiov(QEDAIOCB *acb, size_t len)
> +{
> +    struct iovec *iov_end = &acb->qiov->iov[acb->qiov->niov];
> +    size_t iov_offset = acb->cur_iov_offset;
> +    struct iovec *iov = acb->cur_iov;
> +
> +    while (iov != iov_end && len > 0) {
> +        size_t nbytes = MIN(iov->iov_len - iov_offset, len);
> +
> +        qemu_iovec_add(&acb->cur_qiov, iov->iov_base + iov_offset, nbytes);
> +        iov_offset += nbytes;
> +        len -= nbytes;
> +
> +        if (iov_offset >= iov->iov_len) {
> +            iov_offset = 0;
> +            iov++;
> +        }
> +    }
> +
> +    /* Stash state for next time */
> +    acb->cur_iov = iov;
> +    acb->cur_iov_offset = iov_offset;
> +}

Wouldn't it be much simpler to just store the offset into acb->qiov and
to use qemu_iovec_copy to get the subset qiov?

> +/**
> + * Write data to the image file
> + */
> +static void qed_aio_write_main(void *opaque, int ret)
> +{
> +    QEDAIOCB *acb = opaque;
> +    BDRVQEDState *s = acb_to_s(acb);
> +    uint64_t offset = acb->cur_cluster;
> +    BlockDriverCompletionFunc *next_fn;
> +    BlockDriverAIOCB *file_acb;
> +
> +    trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);

Why does the trace use a different offset...

> +
> +    if (ret) {
> +        qed_aio_complete(acb, ret);
> +        return;
> +    }
> +
> +    if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
> +        next_fn = qed_aio_next_io;
> +    } else {
> +        if (s->bs->backing_hd) {
> +            next_fn = qed_aio_write_flush_before_l2_update;
> +        } else {
> +            next_fn = qed_aio_write_l2_update;
> +        }
> +    }
> +
> +    offset += qed_offset_into_cluster(s, acb->cur_pos);

...than the write uses?

I missed this line at first because offset is initialized above, so I
didn't expect that this was only half of its initialization.

> +    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
> +    file_acb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
> +                               &acb->cur_qiov,
> +                               acb->cur_qiov.size / BDRV_SECTOR_SIZE,
> +                               next_fn, acb);
> +    if (!file_acb) {
> +        qed_aio_complete(acb, -EIO);
> +    }
> +}
> +
> +/**
> + * Populate back untouched region of new data cluster
> + */
> +static void qed_aio_write_postfill(void *opaque, int ret)
> +{
> +    QEDAIOCB *acb = opaque;
> +    BDRVQEDState *s = acb_to_s(acb);
> +    uint64_t start = acb->cur_pos + acb->cur_qiov.size;
> +    uint64_t len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
> +    uint64_t offset = acb->cur_cluster + qed_offset_into_cluster(s, acb->cur_pos) + acb->cur_qiov.size;

This look like more than 80 characters.

Kevin