From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=41453 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1P3yQp-0003li-QM
	for qemu-devel@nongnu.org; Thu, 07 Oct 2010 17:55:20 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1P3yQo-0005a7-JQ
	for qemu-devel@nongnu.org; Thu, 07 Oct 2010 17:55:19 -0400
Received: from mail-iw0-f173.google.com ([209.85.214.173]:35679)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1P3yQo-0005Zz-DW
	for qemu-devel@nongnu.org; Thu, 07 Oct 2010 17:55:18 -0400
Received: by iwn38 with SMTP id 38so405436iwn.4
	for <qemu-devel@nongnu.org>; Thu, 07 Oct 2010 14:55:17 -0700 (PDT)
Message-ID: <4CAE41BD.2070508@codemonkey.ws>
Date: Thu, 07 Oct 2010 16:55:09 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [PATCH] ceph/rbd block driver for qemu-kvm (v4)
References: <20100802194631.GA4923@chb-desktop>	<alpine.LNX.2.00.1008030033550.2664@linmac>	<20100803201407.GD1475@chb-desktop>	<4CADD567.9010606@codemonkey.ws>	<AANLkTikuNq1ApT+2ML6s6NXXpTZaedgJDG1xq_=oBQPK@mail.gmail.com>	<4CAE13BA.70707@codemonkey.ws>	<AANLkTim2qfA+MncyF9Cx1oGYEzgW2TugG9sFTLP8T=J-@mail.gmail.com>	<4CAE24C5.8030007@codemonkey.ws>	<AANLkTimW98qPTYZ=Bct7k7XqaOXswC51QdQOW6fLsWkQ@mail.gmail.com>	<4CAE35C5.2010809@codemonkey.ws>
	<AANLkTinzdiJZFxacD1BDCPWt44ChokNaNVU6_E=GtcTD@mail.gmail.com>
In-Reply-To: <AANLkTinzdiJZFxacD1BDCPWt44ChokNaNVU6_E=GtcTD@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Yehuda Sadeh Weinraub <yehudasa@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>, ceph-devel@vger.kernel.org, qemu-devel@nongnu.org, kvm@vger.kernel.org, Christian Brunner <chb@muc.de>

On 10/07/2010 04:49 PM, Yehuda Sadeh Weinraub wrote:
> On Thu, Oct 7, 2010 at 2:04 PM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 10/07/2010 03:47 PM, Yehuda Sadeh Weinraub wrote:
>>      
>>>> How is that possible?  Are the callbacks delivered in the context of a
>>>> different thread?  If so, don't you need locking?
>>>>
>>>>          
>>> Not sure I'm completely following you. The callbacks are delivered in
>>> the context of a different thread, but won't run concurrently.
>>>        
>> Concurrently to what?  How do you prevent them from running concurrently
>> with qemu?
>>      
> There are two types of callbacks. The first is for rados aio
> completions, and the second one is the one added later for the fd glue
> layer.
>    

This is a bad architecture for something like qemu.  You could create a 
pipe and use the pipe to signal to qemu.  Same principle as eventfd.  
Ideally, you would do this in the library itself.

Regards,

Anthony Liguori

> The first callback, called by librados whenever aio completes, runs in
> the context of a single librados thread:
>
> +static void rbd_finish_aiocb(rados_completion_t c, RADOSCB *rcb)
> +{
> +    RBDAIOCB *acb = rcb->acb;
> rcb is per a single aio. Was created  before and will be destroyed
> here, whereas acb is shared between a few aios, however, it was
> generated before the first aio was created.
>
> +    int64_t r;
> +    uint64_t buf = 1;
> +    int i;
> +
> +    acb->aiocnt--;
>
> acb->aiocnt has been set before initiating all the aios, so it's ok to
> touch it now. Same goes to all acb fields.
>
> +    r = rados_aio_get_return_value(c);
> +    rados_aio_release(c);
> +    if (acb->write) {
> +        if (r<  0) {
> +            acb->ret = r;
> +            acb->error = 1;
> +        } else if (!acb->error) {
> +            acb->ret += rcb->segsize;
> +        }
> +    } else {
> +        if (r == -ENOENT) {
> +            memset(rcb->buf, 0, rcb->segsize);
> +            if (!acb->error) {
> +                acb->ret += rcb->segsize;
> +            }
> +        } else if (r<  0) {
> +            acb->ret = r;
> +            acb->error = 1;
> +        } else if (r<  rcb->segsize) {
> +            memset(rcb->buf + r, 0, rcb->segsize - r);
> +            if (!acb->error) {
> +                acb->ret += rcb->segsize;
> +            }
> +        } else if (!acb->error) {
> +            acb->ret += r;
> +        }
> +    }
> +    if (write(acb->s->efd,&buf, sizeof(buf))<  0)
> This will wake up the io_read()
>
> +        error_report("failed writing to acb->s->efd\n");
> +    qemu_free(rcb);
> +    i = 0;
> +    if (!acb->aiocnt&&  acb->bh) {
> +        qemu_bh_schedule(acb->bh);
> This is the only qemu related call in here, seems safe to call it.
>
> +    }
> +}
>
> The scheduled bh function will be called only after all aios that
> relate to this specific aio set are done, so the following seems ok,
> as there's no more acb references.
> +static void rbd_aio_bh_cb(void *opaque)
> +{
> +    RBDAIOCB *acb = opaque;
> +    uint64_t buf = 1;
> +
> +    if (!acb->write) {
> +        qemu_iovec_from_buffer(acb->qiov, acb->bounce, acb->qiov->size);
> +    }
> +    qemu_vfree(acb->bounce);
> +    acb->common.cb(acb->common.opaque, (acb->ret>  0 ? 0 : acb->ret));
> +    qemu_bh_delete(acb->bh);
> +    acb->bh = NULL;
> +
> +    if (write(acb->s->efd,&buf, sizeof(buf))<  0)
> +        error_report("failed writing to acb->s->efd\n");
> +    qemu_aio_release(acb);
> +}
>
> Now, the second ones are the io_read(), in which we have our glue fd.
> We send uint64 per each completed io
>
> +static void rbd_aio_completion_cb(void *opaque)
> +{
> +    BDRVRBDState *s = opaque;
> +
> +    uint64_t val;
> +    ssize_t ret;
> +
> +    do {
> +        if ((ret = read(s->efd,&val, sizeof(val)))>  0) {
> +            s->qemu_aio_count -= val;
> There is an issue here with s->qemu_aio_count which needs to be
> protected by a mutex. Other than that, it just reads from s->efd.
>
> +       }
> +    } while (ret<  0&&  errno == EINTR);
> +
> +    return;
> +}
> +
> +static int rbd_aio_flush_cb(void *opaque)
> +{
> +    BDRVRBDState *s = opaque;
> +
> +    return (s->qemu_aio_count>  0);
> Same here as with the previous one, needs a mutex around s->qemu_aio_count.
>
> +}
>
>    
>> If you saw lock ups, I bet that's what it was from.
>>
>>      
> As I explained before, before introducing the fd glue layer, the lack
> of fd associated with our block device caused that there was no way
> for qemu to check whether all aios were flushed or not, which didn't
> work well when doing migration/savevm.
>
> Thanks,
> Yehuda
>