From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:34988)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aliguori@us.ibm.com>) id 1TLvM6-0006D7-QD
	for qemu-devel@nongnu.org; Wed, 10 Oct 2012 08:25:48 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aliguori@us.ibm.com>) id 1TLvM0-0002N0-5D
	for qemu-devel@nongnu.org; Wed, 10 Oct 2012 08:25:42 -0400
Received: from e23smtp09.au.ibm.com ([202.81.31.142]:53935)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aliguori@us.ibm.com>) id 1TLvLz-0002Lt-CG
	for qemu-devel@nongnu.org; Wed, 10 Oct 2012 08:25:36 -0400
Received: from /spool/local
	by e23smtp09.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
	Only! Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <aliguori@us.ibm.com>;
	Wed, 10 Oct 2012 22:22:44 +1000
Received: from d23av03.au.ibm.com (d23av03.au.ibm.com [9.190.234.97])
	by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	q9ACFZhi39125224
	for <qemu-devel@nongnu.org>; Wed, 10 Oct 2012 23:15:37 +1100
Received: from d23av03.au.ibm.com (loopback [127.0.0.1])
	by d23av03.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id
	q9ACPP6S003842
	for <qemu-devel@nongnu.org>; Wed, 10 Oct 2012 23:25:25 +1100
From: Anthony Liguori <aliguori@us.ibm.com>
In-Reply-To: <50751FAB.1000201@redhat.com>
References: <1348577763-12920-1-git-send-email-pbonzini@redhat.com>
	<20121008113932.GB16332@stefanha-thinkpad.redhat.com>
	<5072CE54.8020208@redhat.com>
	<20121009090811.GB13775@stefanha-thinkpad.redhat.com>
	<877gqzn0xc.fsf@codemonkey.ws> <50743D91.4010900@redhat.com>
	<87391n8xmq.fsf@codemonkey.ws> <5074502F.5030706@redhat.com>
	<87ehl7lcxu.fsf@codemonkey.ws> <50751FAB.1000201@redhat.com>
Date: Wed, 10 Oct 2012 07:25:12 -0500
Message-ID: <87haq2ldjr.fsf@codemonkey.ws>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Subject: Re: [Qemu-devel] Block I/O outside the QEMU global mutex was "Re:
	[RFC PATCH 00/17] Support for multiple "AIO contexts""
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@gmail.com>, Ping Fan Liu <pingfank@linux.vnet.ibm.com>, qemu-devel@nongnu.org, Avi Kivity <avi@redhat.com>

Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 09/10/2012 20:26, Anthony Liguori ha scritto:
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>> 
>>> Il 09/10/2012 17:37, Anthony Liguori ha scritto:
>>>>>>>> In the very short term, I can imagine an aio fastpath that was only
>>>>>>>> implemented in terms of the device API.  We could have a slow path that
>>>>>>>> acquired the BQL.
>>>>>>
>>>>>> Not sure I follow.
>>>>
>>>> As long as the ioeventfd thread can acquire qemu_mutex in order to call
>>>> bdrv_* functions.  The new device-only API could do this under the
>>>> covers for everything but the linux-aio fast path initially.
>>>
>>> Ok, so it's about the locking.  I'm not even sure we need locking if we
>>> have cooperative multitasking.  For example if bdrv_aio_readv/writev
>>> is called from a VCPU thread, it can just schedule a bottom half for
>>> itself in the appropriate AioContext.  Similarly for block jobs.
>> 
>> Okay, let's separate out the two issues here though.  One is whether we
>> need a device specific block API.  The second is whether we should short
>> cut to a fast path in the short term and go after a fully unlocked bdrv_
>> layer in the long(shortish?) term.
>> 
>> So let's talk about your proposal...
>> 
>>> The only part where I'm not sure how it would work is bdrv_read/write,
>>> because of the strange "qemu_aio_wait() calls select with a lock taken".
>>> Maybe we can just forbid synchronous I/O if you set a non-default
>>> AioContext.
>> 
>> Not sure how practical that is.  The is an awful lot of sync I/O still left.
>
> Hmm, yeah, perhaps we need to bite the bullet and use a recursive lock.
>  The lock would be taken by:
>
> - sync I/O ops
>
> - monitor commands that currently call bdrv_drain_all
>
> - aio_poll when calling bottom halves or handlers
>
> The rest of the proposal however would stand (especially with reference
> to block jobs).
>
> I think we can proceed incrementally.  The first obvious step is to
> s/qemu_bh_new/aio_bh_new/ in the whole block layer (including the
> CoQueue stuff), which would also help fixing the qemu-char bug that Jan
> reported.
>
>>> This would be entirely hidden in the block layer.  For example the
>>> following does it for bdrv_aio_readv/writev:
>>>
>>> diff --git a/block.c b/block.c
>>> index e95f613..7165e82 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -3712,15 +3712,6 @@ static AIOPool bdrv_em_co_aio_pool = {
>>>      .cancel             = bdrv_aio_co_cancel_em,
>>>  };
>>>  
>>> -static void bdrv_co_em_bh(void *opaque)
>>> -{
>>> -    BlockDriverAIOCBCoroutine *acb = opaque;
>>> -
>>> -    acb->common.cb(acb->common.opaque, acb->req.error);
>>> -    qemu_bh_delete(acb->bh);
>>> -    qemu_aio_release(acb);
>>> -}
>>> -
>>>  /* Invoke bdrv_co_do_readv/bdrv_co_do_writev */
>>>  static void coroutine_fn bdrv_co_do_rw(void *opaque)
>>>  {
>>> @@ -3735,8 +3726,17 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
>>>              acb->req.nb_sectors, acb->req.qiov, 0);
>>>      }
>>>  
>>> -    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
>>> -    qemu_bh_schedule(acb->bh);
>>> +    acb->common.cb(acb->common.opaque, acb->req.error);
>>> +    qemu_aio_release(acb);
>>> +}
>>> +
>>> +static void bdrv_co_em_bh(void *opaque)
>>> +{
>>> +    BlockDriverAIOCBCoroutine *acb = opaque;
>>> +
>>> +    qemu_bh_delete(acb->bh);
>>> +    co = qemu_coroutine_create(bdrv_co_do_rw);
>>> +    qemu_coroutine_enter(co, acb);
>>>  }
>>>  
>>>  static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
>>> @@ -3756,8 +3756,8 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
>>>      acb->req.qiov = qiov;
>>>      acb->is_write = is_write;
>>>  
>>> -    co = qemu_coroutine_create(bdrv_co_do_rw);
>>> -    qemu_coroutine_enter(co, acb);
>>> +    acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
>>> +    qemu_bh_schedule(acb->bh);
>>>  
>>>      return &acb->common;
>>>  }
>>>
>>>
>>> Then we can add a bdrv_aio_readv/writev_unlocked API to the protocols, which
>>> would run outside the bottom half and provide the desired fast path.
>> 
>> This works for some of the block layer I think.  How does this interact
>> with thread pools for AIO?
>> 
>> But this wouldn't work well with things like NBD or curl, right?  What's
>> the plan there?
>
> NBD uses coroutines; curl can use the non-unlocked
> bdrv_aio_readv/writev.  In both cases they would execute in the
> dataplane thread.  qcow2-over-raw would also execute its read/write code
> entirely from the dataplane thread, for example.

Does that mean that we'd stop processing the queue if we're waiting for
an I/O completion to handle meta data operations?

If that's the case, that probably will hurt performance overall.

Regards,

Anthony Liguori

>
> Paolo