From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=33276 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1POSBs-0001TZ-2L
	for qemu-devel@nongnu.org; Fri, 03 Dec 2010 04:44:34 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1POSBq-0002y0-A1
	for qemu-devel@nongnu.org; Fri, 03 Dec 2010 04:44:31 -0500
Received: from mail-wy0-f173.google.com ([74.125.82.173]:58920)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1POSBp-0002xe-Vs
	for qemu-devel@nongnu.org; Fri, 03 Dec 2010 04:44:30 -0500
Received: by wyg36 with SMTP id 36so793137wyg.4
	for <qemu-devel@nongnu.org>; Fri, 03 Dec 2010 01:44:28 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <1291121332-10588-3-git-send-email-kwolf@redhat.com>
References: <1291121332-10588-1-git-send-email-kwolf@redhat.com>
	<1291121332-10588-3-git-send-email-kwolf@redhat.com>
Date: Fri, 3 Dec 2010 09:44:28 +0000
Message-ID: <AANLkTimF3MQdKP2rqyWK+1SYsNDD8W09KXj2imJHXRLn@mail.gmail.com>
From: Stefan Hajnoczi <stefanha@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Subject: [Qemu-devel] Re: [RFC PATCH v3 2/4] Add block-queue
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org

On Tue, Nov 30, 2010 at 12:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> +/*
> + * Adds a write request to the queue.
> + */
> +int blkqueue_pwrite(BlockQueueContext *context, uint64_t offset, void *b=
uf,
> + =A0 =A0uint64_t size)
> +{
> + =A0 =A0BlockQueue *bq =3D context->bq;
> + =A0 =A0BlockQueueRequest *section_req;
> + =A0 =A0bool completed;
> +
> + =A0 =A0/* Don't use the queue for writethrough images */
> + =A0 =A0if ((bq->bs->open_flags & WRITEBACK_MODES) =3D=3D 0) {
> + =A0 =A0 =A0 =A0return bdrv_pwrite(bq->bs, offset, buf, size);
> + =A0 =A0}
> +
> + =A0 =A0/* First check if there are any pending writes for the same data=
. */
> + =A0 =A0DPRINTF("-- =A0 =A0 =A0 =A0pwrite: [%#lx + %ld]\n", offset, size=
);
> + =A0 =A0completed =3D blkqueue_check_queue_overlap(context, &bq->queue, =
&offset,
> + =A0 =A0 =A0 =A0&buf, &size, &blkqueue_pwrite, &pwrite_handle_overlap,
> + =A0 =A0 =A0 =A0context->section);
> +
> + =A0 =A0if (completed) {
> + =A0 =A0 =A0 =A0return 0;
> + =A0 =A0}
> +
> + =A0 =A0/* Create request structure */
> + =A0 =A0BlockQueueRequest *req =3D qemu_malloc(sizeof(*req));
> + =A0 =A0QLIST_INIT(&req->acbs);
> + =A0 =A0req->type =A0 =A0 =A0 =3D REQ_TYPE_WRITE;
> + =A0 =A0req->bq =A0 =A0 =A0 =A0 =3D bq;
> + =A0 =A0req->offset =A0 =A0 =3D offset;
> + =A0 =A0req->size =A0 =A0 =A0 =3D size;
> + =A0 =A0req->buf =A0 =A0 =A0 =A0=3D qemu_malloc(size);

qemu_blockalign()

> +static int insert_barrier(BlockQueueContext *context, BlockQueueAIOCB *a=
cb)
> +{
> + =A0 =A0BlockQueue *bq =3D context->bq;
> + =A0 =A0BlockQueueRequest *section_req;
> +
> + =A0 =A0bq->barriers_requested++;
> +
> + =A0 =A0/* Create request structure */
> + =A0 =A0BlockQueueRequest *req =3D qemu_malloc(sizeof(*req));
> + =A0 =A0QLIST_INIT(&req->acbs);
> + =A0 =A0req->type =A0 =A0 =A0 =3D REQ_TYPE_BARRIER;
> + =A0 =A0req->bq =A0 =A0 =A0 =A0 =3D bq;
> + =A0 =A0req->section =A0 =A0=3D context->section;
> + =A0 =A0req->buf =A0 =A0 =A0 =A0=3D NULL;
> +
> + =A0 =A0/* Find another barrier to merge with. */
> + =A0 =A0QSIMPLEQ_FOREACH(section_req, &bq->sections, link_section) {
> + =A0 =A0 =A0 =A0if (section_req->section >=3D req->section) {
> +
> + =A0 =A0 =A0 =A0 =A0 =A0/*
> + =A0 =A0 =A0 =A0 =A0 =A0 * If acb is set, the intention of the barrier r=
equest is to flush
> + =A0 =A0 =A0 =A0 =A0 =A0 * the complete queue and notify the caller when=
 all requests have
> + =A0 =A0 =A0 =A0 =A0 =A0 * been processed. To achieve this, we may only =
merge with the very
> + =A0 =A0 =A0 =A0 =A0 =A0 * last request in the queue.
> + =A0 =A0 =A0 =A0 =A0 =A0 */
> + =A0 =A0 =A0 =A0 =A0 =A0if (acb && QTAILQ_NEXT(section_req, link)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0continue;
> + =A0 =A0 =A0 =A0 =A0 =A0}
> +
> + =A0 =A0 =A0 =A0 =A0 =A0req->section =3D section_req->section;
> + =A0 =A0 =A0 =A0 =A0 =A0context->section =3D section_req->section + 1;
> + =A0 =A0 =A0 =A0 =A0 =A0qemu_free(req);
> + =A0 =A0 =A0 =A0 =A0 =A0req =3D section_req;
> + =A0 =A0 =A0 =A0 =A0 =A0goto out;
> + =A0 =A0 =A0 =A0}
> + =A0 =A0}

I think the search for an existing barrier should be moved above the
req allocation.  It's more work to allocate req, do an unnecessary
req->section =3D section_req->section, and then free it again.

> +
> + =A0 =A0/*
> + =A0 =A0 * If there wasn't a barrier for the same section yet, insert a =
new one at
> + =A0 =A0 * the end.
> + =A0 =A0 */
> + =A0 =A0DPRINTF("queue-ins flush: %p\n", req);
> + =A0 =A0QTAILQ_INSERT_TAIL(&bq->queue, req, link);
> + =A0 =A0QSIMPLEQ_INSERT_TAIL(&bq->sections, req, link_section);
> + =A0 =A0bq->queue_size++;
> + =A0 =A0context->section++;
> +
> + =A0 =A0bq->barriers_submitted++;
> +
> + =A0 =A0/*
> + =A0 =A0 * At this point, req is either the newly inserted request, or a=
 previously
> + =A0 =A0 * existing barrier with which the current request has been merg=
ed.
> + =A0 =A0 *
> + =A0 =A0 * Insert the ACB in the list of that request so that the callba=
ck is
> + =A0 =A0 * called when the request has completed.
> + =A0 =A0 */
> +out:
> + =A0 =A0if (acb) {
> + =A0 =A0 =A0 =A0QLIST_INSERT_HEAD(&req->acbs, acb, link);

Is there a reason to insert at the head and not append to the tail?
Preserving order is usually good.

> +/*
> + * If there are any blkqueue_aio_flush callbacks pending, call them with=
 ret
> + * as the error code and remove them from the queue.
> + *
> + * If keep_queue is false, all requests are removed from the queue
> + */
> +static void blkqueue_fail_flush(BlockQueue *bq, int ret, bool keep_queue=
)
> +{
> + =A0 =A0BlockQueueRequest *req, *next_req;
> + =A0 =A0BlockQueueAIOCB *acb, *next_acb;
> +
> + =A0 =A0QTAILQ_FOREACH_SAFE(req, &bq->queue, link, next_req) {
> +
> + =A0 =A0 =A0 =A0/* Call and remove registered callbacks */
> + =A0 =A0 =A0 =A0QLIST_FOREACH_SAFE(acb, &req->acbs, link, next_acb) {
> + =A0 =A0 =A0 =A0 =A0 =A0acb->common.cb(acb->common.opaque, ret);
> + =A0 =A0 =A0 =A0 =A0 =A0qemu_free(acb);

qemu_aio_release()

> + =A0 =A0 =A0 =A0}
> + =A0 =A0 =A0 =A0QLIST_INIT(&req->acbs);
> +
> + =A0 =A0 =A0 =A0/* If requested, remove the request itself */
> + =A0 =A0 =A0 =A0if (!keep_queue) {
> + =A0 =A0 =A0 =A0 =A0 =A0QTAILQ_REMOVE(&bq->queue, req, link);

bq->queue_size--;

> + =A0 =A0 =A0 =A0 =A0 =A0if (req->type =3D=3D REQ_TYPE_BARRIER) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0QSIMPLEQ_REMOVE(&bq->sections, req, Bloc=
kQueueRequest,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0link_section);
> + =A0 =A0 =A0 =A0 =A0 =A0}

Now free the request?

> + =A0 =A0 =A0 =A0}
> + =A0 =A0}
> +
> + =A0 =A0/* Make sure that blkqueue_flush stops running */
> + =A0 =A0bq->flushing =3D ret;
> +}
> +
> +static void blkqueue_process_request_cb(void *opaque, int ret)
> +{
> + =A0 =A0BlockQueueRequest *req =3D opaque;
> + =A0 =A0BlockQueue *bq =3D req->bq;
> + =A0 =A0BlockQueueAIOCB *acb, *next;
> +
> + =A0 =A0DPRINTF(" =A0done =A0 =A0req: =A0 =A0%p [%#lx + %ld]\n", req, re=
q->offset, req->size);
> +
> + =A0 =A0/* Remove from in-flight list */
> + =A0 =A0QTAILQ_REMOVE(&bq->in_flight, req, link);
> + =A0 =A0bq->in_flight_num--;
> +
> + =A0 =A0/*
> + =A0 =A0 * Error handling gets a bit complicated, because we have alread=
y completed
> + =A0 =A0 * the requests that went wrong. There are two ways of dealing w=
ith this:
> + =A0 =A0 *
> + =A0 =A0 * 1. With werror=3Dstop we can put the request back into the qu=
eue and stop
> + =A0 =A0 * =A0 =A0the VM. When the user continues the VM, the request is=
 retried.
> + =A0 =A0 *
> + =A0 =A0 * 2. In other cases we need to return an error on the next bdrv=
_flush. The
> + =A0 =A0 * =A0 =A0caller must cope with the fact that he doesn't know wh=
ich of the
> + =A0 =A0 * =A0 =A0requests succeeded (i.e. invalidate all caches)
> + =A0 =A0 *
> + =A0 =A0 * If we're in an blkqueue_aio_flush, we must return an error in=
 both
> + =A0 =A0 * cases. If we stop the VM, we can clear bq->errno immediately =
again.
> + =A0 =A0 * Otherwise, it's cleared in bdrv_(aio_)flush.
> + =A0 =A0 */
> + =A0 =A0if (ret < 0) {
> + =A0 =A0 =A0 =A0if (bq->error_ret !=3D -ENOSPC) {
> + =A0 =A0 =A0 =A0 =A0 =A0bq->error_ret =3D ret;
> + =A0 =A0 =A0 =A0}
> + =A0 =A0}
> +
> + =A0 =A0/* Call any callbacks attached to the request (see blkqueue_aio_=
flush) */
> + =A0 =A0QLIST_FOREACH_SAFE(acb, &req->acbs, link, next) {
> + =A0 =A0 =A0 =A0acb->common.cb(acb->common.opaque, bq->error_ret);
> + =A0 =A0 =A0 =A0qemu_free(acb);

qemu_aio_release()

> + =A0 =A0}
> + =A0 =A0QLIST_INIT(&req->acbs);
> +
> + =A0 =A0/* Handle errors in the VM stop case */
> + =A0 =A0if (ret < 0) {
> + =A0 =A0 =A0 =A0bool keep_queue =3D bq->error_handler(bq->error_opaque, =
ret);
> +
> + =A0 =A0 =A0 =A0/* Fail any flushes that may wait for the queue to becom=
e empty */
> + =A0 =A0 =A0 =A0blkqueue_fail_flush(bq, bq->error_ret, keep_queue);
> +
> + =A0 =A0 =A0 =A0if (keep_queue) {
> + =A0 =A0 =A0 =A0 =A0 =A0/* Reinsert request into the queue */
> + =A0 =A0 =A0 =A0 =A0 =A0QTAILQ_INSERT_HEAD(&bq->queue, req, link);
> + =A0 =A0 =A0 =A0 =A0 =A0if (req->type =3D=3D REQ_TYPE_BARRIER) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0QSIMPLEQ_INSERT_HEAD(&bq->sections, req,=
 link_section);
> + =A0 =A0 =A0 =A0 =A0 =A0}
> +
> + =A0 =A0 =A0 =A0 =A0 =A0/* Clear the error to restore a normal state aft=
er 'cont' */
> + =A0 =A0 =A0 =A0 =A0 =A0bq->error_ret =3D 0;
> + =A0 =A0 =A0 =A0 =A0 =A0return;
> + =A0 =A0 =A0 =A0}
> + =A0 =A0}
> +
> + =A0 =A0/* Cleanup */
> + =A0 =A0blkqueue_free_request(req);
> +
> + =A0 =A0/* Check if there are more requests to submit */
> + =A0 =A0blkqueue_process_request(bq);
> +}
> +
> +/*
> + * Checks if the first request on the queue can run. If so, remove it fr=
om the
> + * queue, submit the request and put it onto the queue of in-flight requ=
ests.
> + *
> + * Returns 0 if a request has been submitted, -1 if no request can run o=
r an
> + * error has occurred.
> + */

If we want specific error codes:
-EAGAIN no request can run
-EINVAL, -EIO, ... specific errors

> +static int blkqueue_submit_request(BlockQueue *bq)
> +{
> + =A0 =A0BlockDriverAIOCB *acb;
> + =A0 =A0BlockQueueRequest *req;
> +
> + =A0 =A0/*
> + =A0 =A0 * If we had an error, we must not submit new requests from anot=
her
> + =A0 =A0 * section or may we get ordering problems. In fact, not submitt=
ing any new
> + =A0 =A0 * requests looks like a good idea in this case.
> + =A0 =A0 */
> + =A0 =A0if (bq->error_ret) {
> + =A0 =A0 =A0 =A0return -1;
> + =A0 =A0}
> +
> + =A0 =A0/* Fetch a request */
> + =A0 =A0req =3D QTAILQ_FIRST(&bq->queue);
> + =A0 =A0if (req =3D=3D NULL) {
> + =A0 =A0 =A0 =A0return -1;
> + =A0 =A0}
> +
> + =A0 =A0/* Writethrough images aren't supposed to have any queue entries=
 */
> + =A0 =A0assert((bq->bs->open_flags & WRITEBACK_MODES) !=3D 0);
> +
> + =A0 =A0/*
> + =A0 =A0 * We need to wait for completion before we can submit new reque=
sts:
> + =A0 =A0 * 1. If we're currently processing a barrier, or the new reques=
t is a
> + =A0 =A0 * =A0 =A0barrier, we need to guarantee this barrier semantics.
> + =A0 =A0 * 2. We must make sure that newer writes cannot pass older ones=
.
> + =A0 =A0 */
> + =A0 =A0if (bq->in_flight_num > 0) {

I don't understand why we refuse to do more work on in_flight_num > 0.
 This function increments in_flight_num which means
blkqueue_process_request() will only ever submit one request at a
time?  Why does blkqueue_process_request() loop then?  I'm missing
something.

> + =A0 =A0 =A0 =A0return -1;
> + =A0 =A0}
> +
> + =A0 =A0/* Process barriers only if the queue is long enough */
> + =A0 =A0if (!bq->flushing) {
> + =A0 =A0 =A0 =A0if (req->type =3D=3D REQ_TYPE_BARRIER && bq->queue_size =
< 50) {
> + =A0 =A0 =A0 =A0 =A0 =A0return -1;
> + =A0 =A0 =A0 =A0}

Not sure about this.  Need to think about the patch more but this
check looks like it could have consequences like starvation.  I guess
that's why you check for !bq->flushing.

> +static void blkqueue_aio_cancel(BlockDriverAIOCB *blockacb)
> +{
> + =A0 =A0BlockQueueAIOCB *acb =3D (BlockQueueAIOCB*) blockacb;
> +
> + =A0 =A0/*
> + =A0 =A0 * We can't cancel the flush any more, but that doesn't hurt. We=
 just
> + =A0 =A0 * need to make sure that we don't call the callback when it com=
pletes.
> + =A0 =A0 */
> + =A0 =A0QLIST_REMOVE(acb, link);
> + =A0 =A0qemu_free(acb);

qemu_aio_release()

> +}
> +
> +/*
> + * Inserts a barrier at the end of the queue (or merges with an existing
> + * barrier there). Once the barrier has completed, the callback is calle=
d.
> + */
> +BlockDriverAIOCB* blkqueue_aio_flush(BlockQueueContext *context,
> + =A0 =A0BlockDriverCompletionFunc *cb, void *opaque)
> +{
> + =A0 =A0BlockQueueAIOCB *acb;
> + =A0 =A0BlockDriverState *bs =3D context->bq->bs;
> + =A0 =A0int ret;
> +
> + =A0 =A0/* Don't use the queue for writethrough images */
> + =A0 =A0if ((bs->open_flags & WRITEBACK_MODES) =3D=3D 0) {
> + =A0 =A0 =A0 =A0return bdrv_aio_flush(bs, cb, opaque);
> + =A0 =A0}
> +
> + =A0 =A0/* Insert a barrier into the queue */
> + =A0 =A0acb =3D qemu_aio_get(&blkqueue_aio_pool, NULL, cb, opaque);
> +
> + =A0 =A0ret =3D insert_barrier(context, acb);
> + =A0 =A0if (ret < 0) {

This return path is broken:

> + =A0 =A0 =A0 =A0cb(opaque, ret);

Missing BH.

> + =A0 =A0 =A0 =A0qemu_free(acb);

qemu_aio_release(acb);

If we want to invoke cb (via a BH) then we shouldn't release acb.  If
we do release it we need to return NULL but shouldn't invoke cb.

> + =A0 =A0}
> +
> + =A0 =A0return &acb->common;
> +}
> +
> +/*
> + * Flushes the queue (i.e. disables waiting for new requests to be batch=
ed) and
> + * waits until all requests in the queue have completed.
> + *
> + * Note that unlike blkqueue_aio_flush this does not call bdrv_flush().
> + */
> +int blkqueue_flush(BlockQueue *bq)
> +{
> + =A0 =A0int res =3D 0;
> +
> + =A0 =A0bq->flushing =3D 1;
> +
> + =A0 =A0/* Process any left over requests */
> + =A0 =A0while ((bq->flushing > 0) &&
> + =A0 =A0 =A0 =A0(bq->in_flight_num || QTAILQ_FIRST(&bq->queue)))
> + =A0 =A0{
> + =A0 =A0 =A0 =A0blkqueue_process_request(bq);
> + =A0 =A0 =A0 =A0qemu_aio_wait();
> + =A0 =A0}
> +
> + =A0 =A0/*
> + =A0 =A0 * bq->flushing contains the error if it could be handled by sto=
pping the
> + =A0 =A0 * VM, error_ret contains it if we're not allowed to do this.
> + =A0 =A0 */
> + =A0 =A0if (bq->error_ret < 0) {
> + =A0 =A0 =A0 =A0res =3D bq->error_ret;
> +
> + =A0 =A0 =A0 =A0/*
> + =A0 =A0 =A0 =A0 * Wait for AIO requests, so that the queue is really un=
used after
> + =A0 =A0 =A0 =A0 * blkqueue_flush() and the caller can destroy it
> + =A0 =A0 =A0 =A0 */
> + =A0 =A0 =A0 =A0if (res < 0) {

We already know bq->error_ret < 0.  This comparison is always true.

Stefan