From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:34988) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TLvM6-0006D7-QD for qemu-devel@nongnu.org; Wed, 10 Oct 2012 08:25:48 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TLvM0-0002N0-5D for qemu-devel@nongnu.org; Wed, 10 Oct 2012 08:25:42 -0400 Received: from e23smtp09.au.ibm.com ([202.81.31.142]:53935) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TLvLz-0002Lt-CG for qemu-devel@nongnu.org; Wed, 10 Oct 2012 08:25:36 -0400 Received: from /spool/local by e23smtp09.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 10 Oct 2012 22:22:44 +1000 Received: from d23av03.au.ibm.com (d23av03.au.ibm.com [9.190.234.97]) by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q9ACFZhi39125224 for ; Wed, 10 Oct 2012 23:15:37 +1100 Received: from d23av03.au.ibm.com (loopback [127.0.0.1]) by d23av03.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q9ACPP6S003842 for ; Wed, 10 Oct 2012 23:25:25 +1100 From: Anthony Liguori In-Reply-To: <50751FAB.1000201@redhat.com> References: <1348577763-12920-1-git-send-email-pbonzini@redhat.com> <20121008113932.GB16332@stefanha-thinkpad.redhat.com> <5072CE54.8020208@redhat.com> <20121009090811.GB13775@stefanha-thinkpad.redhat.com> <877gqzn0xc.fsf@codemonkey.ws> <50743D91.4010900@redhat.com> <87391n8xmq.fsf@codemonkey.ws> <5074502F.5030706@redhat.com> <87ehl7lcxu.fsf@codemonkey.ws> <50751FAB.1000201@redhat.com> Date: Wed, 10 Oct 2012 07:25:12 -0500 Message-ID: <87haq2ldjr.fsf@codemonkey.ws> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [Qemu-devel] Block I/O outside the QEMU global mutex was "Re: [RFC PATCH 00/17] Support for multiple "AIO contexts"" List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: Kevin Wolf , Stefan Hajnoczi , Ping Fan Liu , qemu-devel@nongnu.org, Avi Kivity Paolo Bonzini writes: > Il 09/10/2012 20:26, Anthony Liguori ha scritto: >> Paolo Bonzini writes: >> >>> Il 09/10/2012 17:37, Anthony Liguori ha scritto: >>>>>>>> In the very short term, I can imagine an aio fastpath that was only >>>>>>>> implemented in terms of the device API. We could have a slow path that >>>>>>>> acquired the BQL. >>>>>> >>>>>> Not sure I follow. >>>> >>>> As long as the ioeventfd thread can acquire qemu_mutex in order to call >>>> bdrv_* functions. The new device-only API could do this under the >>>> covers for everything but the linux-aio fast path initially. >>> >>> Ok, so it's about the locking. I'm not even sure we need locking if we >>> have cooperative multitasking. For example if bdrv_aio_readv/writev >>> is called from a VCPU thread, it can just schedule a bottom half for >>> itself in the appropriate AioContext. Similarly for block jobs. >> >> Okay, let's separate out the two issues here though. One is whether we >> need a device specific block API. The second is whether we should short >> cut to a fast path in the short term and go after a fully unlocked bdrv_ >> layer in the long(shortish?) term. >> >> So let's talk about your proposal... >> >>> The only part where I'm not sure how it would work is bdrv_read/write, >>> because of the strange "qemu_aio_wait() calls select with a lock taken". >>> Maybe we can just forbid synchronous I/O if you set a non-default >>> AioContext. >> >> Not sure how practical that is. The is an awful lot of sync I/O still left. > > Hmm, yeah, perhaps we need to bite the bullet and use a recursive lock. > The lock would be taken by: > > - sync I/O ops > > - monitor commands that currently call bdrv_drain_all > > - aio_poll when calling bottom halves or handlers > > The rest of the proposal however would stand (especially with reference > to block jobs). > > I think we can proceed incrementally. The first obvious step is to > s/qemu_bh_new/aio_bh_new/ in the whole block layer (including the > CoQueue stuff), which would also help fixing the qemu-char bug that Jan > reported. > >>> This would be entirely hidden in the block layer. For example the >>> following does it for bdrv_aio_readv/writev: >>> >>> diff --git a/block.c b/block.c >>> index e95f613..7165e82 100644 >>> --- a/block.c >>> +++ b/block.c >>> @@ -3712,15 +3712,6 @@ static AIOPool bdrv_em_co_aio_pool = { >>> .cancel = bdrv_aio_co_cancel_em, >>> }; >>> >>> -static void bdrv_co_em_bh(void *opaque) >>> -{ >>> - BlockDriverAIOCBCoroutine *acb = opaque; >>> - >>> - acb->common.cb(acb->common.opaque, acb->req.error); >>> - qemu_bh_delete(acb->bh); >>> - qemu_aio_release(acb); >>> -} >>> - >>> /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */ >>> static void coroutine_fn bdrv_co_do_rw(void *opaque) >>> { >>> @@ -3735,8 +3726,17 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque) >>> acb->req.nb_sectors, acb->req.qiov, 0); >>> } >>> >>> - acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); >>> - qemu_bh_schedule(acb->bh); >>> + acb->common.cb(acb->common.opaque, acb->req.error); >>> + qemu_aio_release(acb); >>> +} >>> + >>> +static void bdrv_co_em_bh(void *opaque) >>> +{ >>> + BlockDriverAIOCBCoroutine *acb = opaque; >>> + >>> + qemu_bh_delete(acb->bh); >>> + co = qemu_coroutine_create(bdrv_co_do_rw); >>> + qemu_coroutine_enter(co, acb); >>> } >>> >>> static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, >>> @@ -3756,8 +3756,8 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs, >>> acb->req.qiov = qiov; >>> acb->is_write = is_write; >>> >>> - co = qemu_coroutine_create(bdrv_co_do_rw); >>> - qemu_coroutine_enter(co, acb); >>> + acb->bh = qemu_bh_new(bdrv_co_em_bh, acb); >>> + qemu_bh_schedule(acb->bh); >>> >>> return &acb->common; >>> } >>> >>> >>> Then we can add a bdrv_aio_readv/writev_unlocked API to the protocols, which >>> would run outside the bottom half and provide the desired fast path. >> >> This works for some of the block layer I think. How does this interact >> with thread pools for AIO? >> >> But this wouldn't work well with things like NBD or curl, right? What's >> the plan there? > > NBD uses coroutines; curl can use the non-unlocked > bdrv_aio_readv/writev. In both cases they would execute in the > dataplane thread. qcow2-over-raw would also execute its read/write code > entirely from the dataplane thread, for example. Does that mean that we'd stop processing the queue if we're waiting for an I/O completion to handle meta data operations? If that's the case, that probably will hurt performance overall. Regards, Anthony Liguori > > Paolo