From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=41453 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1P3yQp-0003li-QM for qemu-devel@nongnu.org; Thu, 07 Oct 2010 17:55:20 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1P3yQo-0005a7-JQ for qemu-devel@nongnu.org; Thu, 07 Oct 2010 17:55:19 -0400 Received: from mail-iw0-f173.google.com ([209.85.214.173]:35679) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1P3yQo-0005Zz-DW for qemu-devel@nongnu.org; Thu, 07 Oct 2010 17:55:18 -0400 Received: by iwn38 with SMTP id 38so405436iwn.4 for ; Thu, 07 Oct 2010 14:55:17 -0700 (PDT) Message-ID: <4CAE41BD.2070508@codemonkey.ws> Date: Thu, 07 Oct 2010 16:55:09 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] [PATCH] ceph/rbd block driver for qemu-kvm (v4) References: <20100802194631.GA4923@chb-desktop> <20100803201407.GD1475@chb-desktop> <4CADD567.9010606@codemonkey.ws> <4CAE13BA.70707@codemonkey.ws> <4CAE24C5.8030007@codemonkey.ws> <4CAE35C5.2010809@codemonkey.ws> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Yehuda Sadeh Weinraub Cc: Kevin Wolf , ceph-devel@vger.kernel.org, qemu-devel@nongnu.org, kvm@vger.kernel.org, Christian Brunner On 10/07/2010 04:49 PM, Yehuda Sadeh Weinraub wrote: > On Thu, Oct 7, 2010 at 2:04 PM, Anthony Liguori wrote: > >> On 10/07/2010 03:47 PM, Yehuda Sadeh Weinraub wrote: >> >>>> How is that possible? Are the callbacks delivered in the context of a >>>> different thread? If so, don't you need locking? >>>> >>>> >>> Not sure I'm completely following you. The callbacks are delivered in >>> the context of a different thread, but won't run concurrently. >>> >> Concurrently to what? How do you prevent them from running concurrently >> with qemu? >> > There are two types of callbacks. The first is for rados aio > completions, and the second one is the one added later for the fd glue > layer. > This is a bad architecture for something like qemu. You could create a pipe and use the pipe to signal to qemu. Same principle as eventfd. Ideally, you would do this in the library itself. Regards, Anthony Liguori > The first callback, called by librados whenever aio completes, runs in > the context of a single librados thread: > > +static void rbd_finish_aiocb(rados_completion_t c, RADOSCB *rcb) > +{ > + RBDAIOCB *acb = rcb->acb; > rcb is per a single aio. Was created before and will be destroyed > here, whereas acb is shared between a few aios, however, it was > generated before the first aio was created. > > + int64_t r; > + uint64_t buf = 1; > + int i; > + > + acb->aiocnt--; > > acb->aiocnt has been set before initiating all the aios, so it's ok to > touch it now. Same goes to all acb fields. > > + r = rados_aio_get_return_value(c); > + rados_aio_release(c); > + if (acb->write) { > + if (r< 0) { > + acb->ret = r; > + acb->error = 1; > + } else if (!acb->error) { > + acb->ret += rcb->segsize; > + } > + } else { > + if (r == -ENOENT) { > + memset(rcb->buf, 0, rcb->segsize); > + if (!acb->error) { > + acb->ret += rcb->segsize; > + } > + } else if (r< 0) { > + acb->ret = r; > + acb->error = 1; > + } else if (r< rcb->segsize) { > + memset(rcb->buf + r, 0, rcb->segsize - r); > + if (!acb->error) { > + acb->ret += rcb->segsize; > + } > + } else if (!acb->error) { > + acb->ret += r; > + } > + } > + if (write(acb->s->efd,&buf, sizeof(buf))< 0) > This will wake up the io_read() > > + error_report("failed writing to acb->s->efd\n"); > + qemu_free(rcb); > + i = 0; > + if (!acb->aiocnt&& acb->bh) { > + qemu_bh_schedule(acb->bh); > This is the only qemu related call in here, seems safe to call it. > > + } > +} > > The scheduled bh function will be called only after all aios that > relate to this specific aio set are done, so the following seems ok, > as there's no more acb references. > +static void rbd_aio_bh_cb(void *opaque) > +{ > + RBDAIOCB *acb = opaque; > + uint64_t buf = 1; > + > + if (!acb->write) { > + qemu_iovec_from_buffer(acb->qiov, acb->bounce, acb->qiov->size); > + } > + qemu_vfree(acb->bounce); > + acb->common.cb(acb->common.opaque, (acb->ret> 0 ? 0 : acb->ret)); > + qemu_bh_delete(acb->bh); > + acb->bh = NULL; > + > + if (write(acb->s->efd,&buf, sizeof(buf))< 0) > + error_report("failed writing to acb->s->efd\n"); > + qemu_aio_release(acb); > +} > > Now, the second ones are the io_read(), in which we have our glue fd. > We send uint64 per each completed io > > +static void rbd_aio_completion_cb(void *opaque) > +{ > + BDRVRBDState *s = opaque; > + > + uint64_t val; > + ssize_t ret; > + > + do { > + if ((ret = read(s->efd,&val, sizeof(val)))> 0) { > + s->qemu_aio_count -= val; > There is an issue here with s->qemu_aio_count which needs to be > protected by a mutex. Other than that, it just reads from s->efd. > > + } > + } while (ret< 0&& errno == EINTR); > + > + return; > +} > + > +static int rbd_aio_flush_cb(void *opaque) > +{ > + BDRVRBDState *s = opaque; > + > + return (s->qemu_aio_count> 0); > Same here as with the previous one, needs a mutex around s->qemu_aio_count. > > +} > > >> If you saw lock ups, I bet that's what it was from. >> >> > As I explained before, before introducing the fd glue layer, the lack > of fd associated with our block device caused that there was no way > for qemu to check whether all aios were flushed or not, which didn't > work well when doing migration/savevm. > > Thanks, > Yehuda >