From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=33276 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1POSBs-0001TZ-2L for qemu-devel@nongnu.org; Fri, 03 Dec 2010 04:44:34 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1POSBq-0002y0-A1 for qemu-devel@nongnu.org; Fri, 03 Dec 2010 04:44:31 -0500 Received: from mail-wy0-f173.google.com ([74.125.82.173]:58920) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1POSBp-0002xe-Vs for qemu-devel@nongnu.org; Fri, 03 Dec 2010 04:44:30 -0500 Received: by wyg36 with SMTP id 36so793137wyg.4 for ; Fri, 03 Dec 2010 01:44:28 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1291121332-10588-3-git-send-email-kwolf@redhat.com> References: <1291121332-10588-1-git-send-email-kwolf@redhat.com> <1291121332-10588-3-git-send-email-kwolf@redhat.com> Date: Fri, 3 Dec 2010 09:44:28 +0000 Message-ID: From: Stefan Hajnoczi Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [RFC PATCH v3 2/4] Add block-queue List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org On Tue, Nov 30, 2010 at 12:48 PM, Kevin Wolf wrote: > +/* > + * Adds a write request to the queue. > + */ > +int blkqueue_pwrite(BlockQueueContext *context, uint64_t offset, void *b= uf, > + =A0 =A0uint64_t size) > +{ > + =A0 =A0BlockQueue *bq =3D context->bq; > + =A0 =A0BlockQueueRequest *section_req; > + =A0 =A0bool completed; > + > + =A0 =A0/* Don't use the queue for writethrough images */ > + =A0 =A0if ((bq->bs->open_flags & WRITEBACK_MODES) =3D=3D 0) { > + =A0 =A0 =A0 =A0return bdrv_pwrite(bq->bs, offset, buf, size); > + =A0 =A0} > + > + =A0 =A0/* First check if there are any pending writes for the same data= . */ > + =A0 =A0DPRINTF("-- =A0 =A0 =A0 =A0pwrite: [%#lx + %ld]\n", offset, size= ); > + =A0 =A0completed =3D blkqueue_check_queue_overlap(context, &bq->queue, = &offset, > + =A0 =A0 =A0 =A0&buf, &size, &blkqueue_pwrite, &pwrite_handle_overlap, > + =A0 =A0 =A0 =A0context->section); > + > + =A0 =A0if (completed) { > + =A0 =A0 =A0 =A0return 0; > + =A0 =A0} > + > + =A0 =A0/* Create request structure */ > + =A0 =A0BlockQueueRequest *req =3D qemu_malloc(sizeof(*req)); > + =A0 =A0QLIST_INIT(&req->acbs); > + =A0 =A0req->type =A0 =A0 =A0 =3D REQ_TYPE_WRITE; > + =A0 =A0req->bq =A0 =A0 =A0 =A0 =3D bq; > + =A0 =A0req->offset =A0 =A0 =3D offset; > + =A0 =A0req->size =A0 =A0 =A0 =3D size; > + =A0 =A0req->buf =A0 =A0 =A0 =A0=3D qemu_malloc(size); qemu_blockalign() > +static int insert_barrier(BlockQueueContext *context, BlockQueueAIOCB *a= cb) > +{ > + =A0 =A0BlockQueue *bq =3D context->bq; > + =A0 =A0BlockQueueRequest *section_req; > + > + =A0 =A0bq->barriers_requested++; > + > + =A0 =A0/* Create request structure */ > + =A0 =A0BlockQueueRequest *req =3D qemu_malloc(sizeof(*req)); > + =A0 =A0QLIST_INIT(&req->acbs); > + =A0 =A0req->type =A0 =A0 =A0 =3D REQ_TYPE_BARRIER; > + =A0 =A0req->bq =A0 =A0 =A0 =A0 =3D bq; > + =A0 =A0req->section =A0 =A0=3D context->section; > + =A0 =A0req->buf =A0 =A0 =A0 =A0=3D NULL; > + > + =A0 =A0/* Find another barrier to merge with. */ > + =A0 =A0QSIMPLEQ_FOREACH(section_req, &bq->sections, link_section) { > + =A0 =A0 =A0 =A0if (section_req->section >=3D req->section) { > + > + =A0 =A0 =A0 =A0 =A0 =A0/* > + =A0 =A0 =A0 =A0 =A0 =A0 * If acb is set, the intention of the barrier r= equest is to flush > + =A0 =A0 =A0 =A0 =A0 =A0 * the complete queue and notify the caller when= all requests have > + =A0 =A0 =A0 =A0 =A0 =A0 * been processed. To achieve this, we may only = merge with the very > + =A0 =A0 =A0 =A0 =A0 =A0 * last request in the queue. > + =A0 =A0 =A0 =A0 =A0 =A0 */ > + =A0 =A0 =A0 =A0 =A0 =A0if (acb && QTAILQ_NEXT(section_req, link)) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0continue; > + =A0 =A0 =A0 =A0 =A0 =A0} > + > + =A0 =A0 =A0 =A0 =A0 =A0req->section =3D section_req->section; > + =A0 =A0 =A0 =A0 =A0 =A0context->section =3D section_req->section + 1; > + =A0 =A0 =A0 =A0 =A0 =A0qemu_free(req); > + =A0 =A0 =A0 =A0 =A0 =A0req =3D section_req; > + =A0 =A0 =A0 =A0 =A0 =A0goto out; > + =A0 =A0 =A0 =A0} > + =A0 =A0} I think the search for an existing barrier should be moved above the req allocation. It's more work to allocate req, do an unnecessary req->section =3D section_req->section, and then free it again. > + > + =A0 =A0/* > + =A0 =A0 * If there wasn't a barrier for the same section yet, insert a = new one at > + =A0 =A0 * the end. > + =A0 =A0 */ > + =A0 =A0DPRINTF("queue-ins flush: %p\n", req); > + =A0 =A0QTAILQ_INSERT_TAIL(&bq->queue, req, link); > + =A0 =A0QSIMPLEQ_INSERT_TAIL(&bq->sections, req, link_section); > + =A0 =A0bq->queue_size++; > + =A0 =A0context->section++; > + > + =A0 =A0bq->barriers_submitted++; > + > + =A0 =A0/* > + =A0 =A0 * At this point, req is either the newly inserted request, or a= previously > + =A0 =A0 * existing barrier with which the current request has been merg= ed. > + =A0 =A0 * > + =A0 =A0 * Insert the ACB in the list of that request so that the callba= ck is > + =A0 =A0 * called when the request has completed. > + =A0 =A0 */ > +out: > + =A0 =A0if (acb) { > + =A0 =A0 =A0 =A0QLIST_INSERT_HEAD(&req->acbs, acb, link); Is there a reason to insert at the head and not append to the tail? Preserving order is usually good. > +/* > + * If there are any blkqueue_aio_flush callbacks pending, call them with= ret > + * as the error code and remove them from the queue. > + * > + * If keep_queue is false, all requests are removed from the queue > + */ > +static void blkqueue_fail_flush(BlockQueue *bq, int ret, bool keep_queue= ) > +{ > + =A0 =A0BlockQueueRequest *req, *next_req; > + =A0 =A0BlockQueueAIOCB *acb, *next_acb; > + > + =A0 =A0QTAILQ_FOREACH_SAFE(req, &bq->queue, link, next_req) { > + > + =A0 =A0 =A0 =A0/* Call and remove registered callbacks */ > + =A0 =A0 =A0 =A0QLIST_FOREACH_SAFE(acb, &req->acbs, link, next_acb) { > + =A0 =A0 =A0 =A0 =A0 =A0acb->common.cb(acb->common.opaque, ret); > + =A0 =A0 =A0 =A0 =A0 =A0qemu_free(acb); qemu_aio_release() > + =A0 =A0 =A0 =A0} > + =A0 =A0 =A0 =A0QLIST_INIT(&req->acbs); > + > + =A0 =A0 =A0 =A0/* If requested, remove the request itself */ > + =A0 =A0 =A0 =A0if (!keep_queue) { > + =A0 =A0 =A0 =A0 =A0 =A0QTAILQ_REMOVE(&bq->queue, req, link); bq->queue_size--; > + =A0 =A0 =A0 =A0 =A0 =A0if (req->type =3D=3D REQ_TYPE_BARRIER) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0QSIMPLEQ_REMOVE(&bq->sections, req, Bloc= kQueueRequest, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0link_section); > + =A0 =A0 =A0 =A0 =A0 =A0} Now free the request? > + =A0 =A0 =A0 =A0} > + =A0 =A0} > + > + =A0 =A0/* Make sure that blkqueue_flush stops running */ > + =A0 =A0bq->flushing =3D ret; > +} > + > +static void blkqueue_process_request_cb(void *opaque, int ret) > +{ > + =A0 =A0BlockQueueRequest *req =3D opaque; > + =A0 =A0BlockQueue *bq =3D req->bq; > + =A0 =A0BlockQueueAIOCB *acb, *next; > + > + =A0 =A0DPRINTF(" =A0done =A0 =A0req: =A0 =A0%p [%#lx + %ld]\n", req, re= q->offset, req->size); > + > + =A0 =A0/* Remove from in-flight list */ > + =A0 =A0QTAILQ_REMOVE(&bq->in_flight, req, link); > + =A0 =A0bq->in_flight_num--; > + > + =A0 =A0/* > + =A0 =A0 * Error handling gets a bit complicated, because we have alread= y completed > + =A0 =A0 * the requests that went wrong. There are two ways of dealing w= ith this: > + =A0 =A0 * > + =A0 =A0 * 1. With werror=3Dstop we can put the request back into the qu= eue and stop > + =A0 =A0 * =A0 =A0the VM. When the user continues the VM, the request is= retried. > + =A0 =A0 * > + =A0 =A0 * 2. In other cases we need to return an error on the next bdrv= _flush. The > + =A0 =A0 * =A0 =A0caller must cope with the fact that he doesn't know wh= ich of the > + =A0 =A0 * =A0 =A0requests succeeded (i.e. invalidate all caches) > + =A0 =A0 * > + =A0 =A0 * If we're in an blkqueue_aio_flush, we must return an error in= both > + =A0 =A0 * cases. If we stop the VM, we can clear bq->errno immediately = again. > + =A0 =A0 * Otherwise, it's cleared in bdrv_(aio_)flush. > + =A0 =A0 */ > + =A0 =A0if (ret < 0) { > + =A0 =A0 =A0 =A0if (bq->error_ret !=3D -ENOSPC) { > + =A0 =A0 =A0 =A0 =A0 =A0bq->error_ret =3D ret; > + =A0 =A0 =A0 =A0} > + =A0 =A0} > + > + =A0 =A0/* Call any callbacks attached to the request (see blkqueue_aio_= flush) */ > + =A0 =A0QLIST_FOREACH_SAFE(acb, &req->acbs, link, next) { > + =A0 =A0 =A0 =A0acb->common.cb(acb->common.opaque, bq->error_ret); > + =A0 =A0 =A0 =A0qemu_free(acb); qemu_aio_release() > + =A0 =A0} > + =A0 =A0QLIST_INIT(&req->acbs); > + > + =A0 =A0/* Handle errors in the VM stop case */ > + =A0 =A0if (ret < 0) { > + =A0 =A0 =A0 =A0bool keep_queue =3D bq->error_handler(bq->error_opaque, = ret); > + > + =A0 =A0 =A0 =A0/* Fail any flushes that may wait for the queue to becom= e empty */ > + =A0 =A0 =A0 =A0blkqueue_fail_flush(bq, bq->error_ret, keep_queue); > + > + =A0 =A0 =A0 =A0if (keep_queue) { > + =A0 =A0 =A0 =A0 =A0 =A0/* Reinsert request into the queue */ > + =A0 =A0 =A0 =A0 =A0 =A0QTAILQ_INSERT_HEAD(&bq->queue, req, link); > + =A0 =A0 =A0 =A0 =A0 =A0if (req->type =3D=3D REQ_TYPE_BARRIER) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0QSIMPLEQ_INSERT_HEAD(&bq->sections, req,= link_section); > + =A0 =A0 =A0 =A0 =A0 =A0} > + > + =A0 =A0 =A0 =A0 =A0 =A0/* Clear the error to restore a normal state aft= er 'cont' */ > + =A0 =A0 =A0 =A0 =A0 =A0bq->error_ret =3D 0; > + =A0 =A0 =A0 =A0 =A0 =A0return; > + =A0 =A0 =A0 =A0} > + =A0 =A0} > + > + =A0 =A0/* Cleanup */ > + =A0 =A0blkqueue_free_request(req); > + > + =A0 =A0/* Check if there are more requests to submit */ > + =A0 =A0blkqueue_process_request(bq); > +} > + > +/* > + * Checks if the first request on the queue can run. If so, remove it fr= om the > + * queue, submit the request and put it onto the queue of in-flight requ= ests. > + * > + * Returns 0 if a request has been submitted, -1 if no request can run o= r an > + * error has occurred. > + */ If we want specific error codes: -EAGAIN no request can run -EINVAL, -EIO, ... specific errors > +static int blkqueue_submit_request(BlockQueue *bq) > +{ > + =A0 =A0BlockDriverAIOCB *acb; > + =A0 =A0BlockQueueRequest *req; > + > + =A0 =A0/* > + =A0 =A0 * If we had an error, we must not submit new requests from anot= her > + =A0 =A0 * section or may we get ordering problems. In fact, not submitt= ing any new > + =A0 =A0 * requests looks like a good idea in this case. > + =A0 =A0 */ > + =A0 =A0if (bq->error_ret) { > + =A0 =A0 =A0 =A0return -1; > + =A0 =A0} > + > + =A0 =A0/* Fetch a request */ > + =A0 =A0req =3D QTAILQ_FIRST(&bq->queue); > + =A0 =A0if (req =3D=3D NULL) { > + =A0 =A0 =A0 =A0return -1; > + =A0 =A0} > + > + =A0 =A0/* Writethrough images aren't supposed to have any queue entries= */ > + =A0 =A0assert((bq->bs->open_flags & WRITEBACK_MODES) !=3D 0); > + > + =A0 =A0/* > + =A0 =A0 * We need to wait for completion before we can submit new reque= sts: > + =A0 =A0 * 1. If we're currently processing a barrier, or the new reques= t is a > + =A0 =A0 * =A0 =A0barrier, we need to guarantee this barrier semantics. > + =A0 =A0 * 2. We must make sure that newer writes cannot pass older ones= . > + =A0 =A0 */ > + =A0 =A0if (bq->in_flight_num > 0) { I don't understand why we refuse to do more work on in_flight_num > 0. This function increments in_flight_num which means blkqueue_process_request() will only ever submit one request at a time? Why does blkqueue_process_request() loop then? I'm missing something. > + =A0 =A0 =A0 =A0return -1; > + =A0 =A0} > + > + =A0 =A0/* Process barriers only if the queue is long enough */ > + =A0 =A0if (!bq->flushing) { > + =A0 =A0 =A0 =A0if (req->type =3D=3D REQ_TYPE_BARRIER && bq->queue_size = < 50) { > + =A0 =A0 =A0 =A0 =A0 =A0return -1; > + =A0 =A0 =A0 =A0} Not sure about this. Need to think about the patch more but this check looks like it could have consequences like starvation. I guess that's why you check for !bq->flushing. > +static void blkqueue_aio_cancel(BlockDriverAIOCB *blockacb) > +{ > + =A0 =A0BlockQueueAIOCB *acb =3D (BlockQueueAIOCB*) blockacb; > + > + =A0 =A0/* > + =A0 =A0 * We can't cancel the flush any more, but that doesn't hurt. We= just > + =A0 =A0 * need to make sure that we don't call the callback when it com= pletes. > + =A0 =A0 */ > + =A0 =A0QLIST_REMOVE(acb, link); > + =A0 =A0qemu_free(acb); qemu_aio_release() > +} > + > +/* > + * Inserts a barrier at the end of the queue (or merges with an existing > + * barrier there). Once the barrier has completed, the callback is calle= d. > + */ > +BlockDriverAIOCB* blkqueue_aio_flush(BlockQueueContext *context, > + =A0 =A0BlockDriverCompletionFunc *cb, void *opaque) > +{ > + =A0 =A0BlockQueueAIOCB *acb; > + =A0 =A0BlockDriverState *bs =3D context->bq->bs; > + =A0 =A0int ret; > + > + =A0 =A0/* Don't use the queue for writethrough images */ > + =A0 =A0if ((bs->open_flags & WRITEBACK_MODES) =3D=3D 0) { > + =A0 =A0 =A0 =A0return bdrv_aio_flush(bs, cb, opaque); > + =A0 =A0} > + > + =A0 =A0/* Insert a barrier into the queue */ > + =A0 =A0acb =3D qemu_aio_get(&blkqueue_aio_pool, NULL, cb, opaque); > + > + =A0 =A0ret =3D insert_barrier(context, acb); > + =A0 =A0if (ret < 0) { This return path is broken: > + =A0 =A0 =A0 =A0cb(opaque, ret); Missing BH. > + =A0 =A0 =A0 =A0qemu_free(acb); qemu_aio_release(acb); If we want to invoke cb (via a BH) then we shouldn't release acb. If we do release it we need to return NULL but shouldn't invoke cb. > + =A0 =A0} > + > + =A0 =A0return &acb->common; > +} > + > +/* > + * Flushes the queue (i.e. disables waiting for new requests to be batch= ed) and > + * waits until all requests in the queue have completed. > + * > + * Note that unlike blkqueue_aio_flush this does not call bdrv_flush(). > + */ > +int blkqueue_flush(BlockQueue *bq) > +{ > + =A0 =A0int res =3D 0; > + > + =A0 =A0bq->flushing =3D 1; > + > + =A0 =A0/* Process any left over requests */ > + =A0 =A0while ((bq->flushing > 0) && > + =A0 =A0 =A0 =A0(bq->in_flight_num || QTAILQ_FIRST(&bq->queue))) > + =A0 =A0{ > + =A0 =A0 =A0 =A0blkqueue_process_request(bq); > + =A0 =A0 =A0 =A0qemu_aio_wait(); > + =A0 =A0} > + > + =A0 =A0/* > + =A0 =A0 * bq->flushing contains the error if it could be handled by sto= pping the > + =A0 =A0 * VM, error_ret contains it if we're not allowed to do this. > + =A0 =A0 */ > + =A0 =A0if (bq->error_ret < 0) { > + =A0 =A0 =A0 =A0res =3D bq->error_ret; > + > + =A0 =A0 =A0 =A0/* > + =A0 =A0 =A0 =A0 * Wait for AIO requests, so that the queue is really un= used after > + =A0 =A0 =A0 =A0 * blkqueue_flush() and the caller can destroy it > + =A0 =A0 =A0 =A0 */ > + =A0 =A0 =A0 =A0if (res < 0) { We already know bq->error_ret < 0. This comparison is always true. Stefan