From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:40956)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1T9ZLC-0002Z0-KZ
	for qemu-devel@nongnu.org; Thu, 06 Sep 2012 06:29:48 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1T9ZL6-0007SY-Mv
	for qemu-devel@nongnu.org; Thu, 06 Sep 2012 06:29:42 -0400
Received: from mx1.redhat.com ([209.132.183.28]:58871)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1T9ZL6-0007SR-EM
	for qemu-devel@nongnu.org; Thu, 06 Sep 2012 06:29:36 -0400
Message-ID: <50487B0A.6010908@redhat.com>
Date: Thu, 06 Sep 2012 12:29:30 +0200
From: Kevin Wolf <kwolf@redhat.com>
MIME-Version: 1.0
References: <20120809130010.GA7960@in.ibm.com>
	<20120809130216.GC7960@in.ibm.com>
	<20120905074106.GA28080@in.ibm.com>
	<20120905095431.GB28080@in.ibm.com>
	<50484F63.5050406@redhat.com> <504867A9.3050808@redhat.com>
	<50486F24.2010106@redhat.com> <504875F6.9090302@redhat.com>
	<50487881.9030207@redhat.com>
In-Reply-To: <50487881.9030207@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v6 2/2] block: Support GlusterFS as a QEMU
	block backend
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>, Anand Avati <aavati@redhat.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, Vijay Bellur <vbellur@redhat.com>, Amar Tumballi <amarts@redhat.com>, qemu-devel@nongnu.org, Blue Swirl <blauwirbel@gmail.com>, bharata@linux.vnet.ibm.com

Am 06.09.2012 12:18, schrieb Paolo Bonzini:
> Il 06/09/2012 12:07, Kevin Wolf ha scritto:
>>> The AIOCB is already invalid at the time the callback is entered, so we
>>> could release it before the call.  However, not all implementation of
>>> AIO are ready for that and I'm not really in the mood for large scale
>>> refactoring...
>>
>> But the way, what I'd really want to see in the end is to get rid of
>> qemu_aio_flush() and replace it by .bdrv_drain() callbacks in each
>> BlockDriver. The way we're doing it today is a layering violation.
> 
> That's quite difficult.  Completion of an I/O operation can trigger
> another I/O operation on another block device, and so on until we go
> back to the first device (think of a hypothetical RAID-5 device).

You always have a tree of BDSes, and children should only ever trigger
completion of I/O operations in their parents. Am I missing anything?

>> Doesn't change anything about this problem, though. So the options that
>> we have are:
>>
>> 1. Delay the callback using a BH. Doing this in each driver is ugly.
>>    But is there actually more than one possible callback in today's
>>    coroutine world? I only see bdrv_co_io_em_complete(), which could
>>    reenter the coroutine from a BH.
> 
> Easy and safe, but it feels a bit like a timebomb.  Also, I'm not
> entirely sure of _why_ the bottom half works. :)

Hm, safe and time bomb is contradictory in my book. :-)

The bottom half work because we're not reentering the qcow2_create
coroutine immediately, so the gluster AIO callback can complete all of
its cleanup work without being interrupted by code that might wait on
this particular request and create a deadlock this way.

>> 2. Delay the callback by just calling it later when the cleanup has
>>    been completed and .io_flush() can return 0. You say that it's hard
>>    to implement for some drivers, except if the AIOCB are leaked until
>>    the end of functions like qcow2_create().
> 
> ... which is what we do in posix-aio-compat.c; nobody screamed so far.

True. Would be easy to fix in posix-aio-compat, though, or can a
callback expect that the AIOCB is still valid?

> Not really hard, it just has to be assessed for each driver separately.
>  We can just do it in gluster and refactor it later.

Okay, so let's keep it as an option for now.

>> 3. Add a delay only later in functions like bdrv_drain_all() that assume
>>    that the request has completed. Are there more of this type? AIOCBs
>>    are leaked until a bdrv_drain_all() call. Does it work with draining
>>    specific BDSes instead of everything?
>>
>> Unless I forgot some important point, it almost looks like option 1 is
>> the easiest and safest.
> 
> I agree with your opinion, but I would feel better if I understood
> better why it works.  (2) can be done easily in each driver (no
> ugliness) and refactored later.

I think option 2 must be done in each driver by design, or do you see
even a theoretical way how to do it generically?

Kevin