* [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb @ 2013-08-21 2:02 Asias He 2013-08-21 8:16 ` Paolo Bonzini 2013-08-21 15:24 ` Stefan Hajnoczi 0 siblings, 2 replies; 22+ messages in thread From: Asias He @ 2013-08-21 2:02 UTC (permalink / raw) To: qemu-devel Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, Bharata B Rao, Asias He, MORITA Kazutaka In block/gluster.c, we have gluster_finish_aiocb { if (retval != sizeof(acb)) { qemu_mutex_lock_iothread(); /* We are in gluster thread context */ ... qemu_mutex_unlock_iothread(); } } qemu tools, e.g. qemu-img, might race here because qemu_mutex_{lock,unlock}_iothread are a nop operation and gluster_finish_aiocb is in the gluster thread context. To fix, we introduce our own mutex for qemu tools. Signed-off-by: Asias He <asias@redhat.com> --- stubs/iothread-lock.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/stubs/iothread-lock.c b/stubs/iothread-lock.c index 5d8aca1..d5c6dec 100644 --- a/stubs/iothread-lock.c +++ b/stubs/iothread-lock.c @@ -1,10 +1,21 @@ #include "qemu-common.h" #include "qemu/main-loop.h" +static QemuMutex qemu_tools_mutex; +static pthread_once_t qemu_tools_once = PTHREAD_ONCE_INIT; + +static void qemu_tools_mutex_init(void) +{ + qemu_mutex_init(&qemu_tools_mutex); +} + void qemu_mutex_lock_iothread(void) { + pthread_once(&qemu_tools_once, qemu_tools_mutex_init); + qemu_mutex_lock(&qemu_tools_mutex); } void qemu_mutex_unlock_iothread(void) { + qemu_mutex_unlock(&qemu_tools_mutex); } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-21 2:02 [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb Asias He @ 2013-08-21 8:16 ` Paolo Bonzini 2013-08-22 9:50 ` Asias He 2013-08-21 15:24 ` Stefan Hajnoczi 1 sibling, 1 reply; 22+ messages in thread From: Paolo Bonzini @ 2013-08-21 8:16 UTC (permalink / raw) To: Asias He Cc: Kevin Wolf, Vijay Bellur, qemu-devel, Stefan Hajnoczi, Bharata B Rao, MORITA Kazutaka Il 21/08/2013 04:02, Asias He ha scritto: > In block/gluster.c, we have > > gluster_finish_aiocb > { > if (retval != sizeof(acb)) { > qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > ... > qemu_mutex_unlock_iothread(); > } > } > > qemu tools, e.g. qemu-img, might race here because > qemu_mutex_{lock,unlock}_iothread are a nop operation and > gluster_finish_aiocb is in the gluster thread context. > > To fix, we introduce our own mutex for qemu tools. > > Signed-off-by: Asias He <asias@redhat.com> > --- > stubs/iothread-lock.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/stubs/iothread-lock.c b/stubs/iothread-lock.c > index 5d8aca1..d5c6dec 100644 > --- a/stubs/iothread-lock.c > +++ b/stubs/iothread-lock.c > @@ -1,10 +1,21 @@ > #include "qemu-common.h" > #include "qemu/main-loop.h" > > +static QemuMutex qemu_tools_mutex; > +static pthread_once_t qemu_tools_once = PTHREAD_ONCE_INIT; Doesn't work on Windows, but you can just add __attribute__((__constructor__)) to qemu_tools_mutex_init. Paolo > +static void qemu_tools_mutex_init(void) > +{ > + qemu_mutex_init(&qemu_tools_mutex); > +} > + > void qemu_mutex_lock_iothread(void) > { > + pthread_once(&qemu_tools_once, qemu_tools_mutex_init); > + qemu_mutex_lock(&qemu_tools_mutex); > } > > void qemu_mutex_unlock_iothread(void) > { > + qemu_mutex_unlock(&qemu_tools_mutex); > } > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-21 8:16 ` Paolo Bonzini @ 2013-08-22 9:50 ` Asias He 2013-08-22 9:51 ` Paolo Bonzini 0 siblings, 1 reply; 22+ messages in thread From: Asias He @ 2013-08-22 9:50 UTC (permalink / raw) To: Paolo Bonzini Cc: Kevin Wolf, Vijay Bellur, qemu-devel, Stefan Hajnoczi, Bharata B Rao, MORITA Kazutaka On Wed, Aug 21, 2013 at 10:16:02AM +0200, Paolo Bonzini wrote: > Il 21/08/2013 04:02, Asias He ha scritto: > > In block/gluster.c, we have > > > > gluster_finish_aiocb > > { > > if (retval != sizeof(acb)) { > > qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > > ... > > qemu_mutex_unlock_iothread(); > > } > > } > > > > qemu tools, e.g. qemu-img, might race here because > > qemu_mutex_{lock,unlock}_iothread are a nop operation and > > gluster_finish_aiocb is in the gluster thread context. > > > > To fix, we introduce our own mutex for qemu tools. > > > > Signed-off-by: Asias He <asias@redhat.com> > > --- > > stubs/iothread-lock.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/stubs/iothread-lock.c b/stubs/iothread-lock.c > > index 5d8aca1..d5c6dec 100644 > > --- a/stubs/iothread-lock.c > > +++ b/stubs/iothread-lock.c > > @@ -1,10 +1,21 @@ > > #include "qemu-common.h" > > #include "qemu/main-loop.h" > > > > +static QemuMutex qemu_tools_mutex; > > +static pthread_once_t qemu_tools_once = PTHREAD_ONCE_INIT; > > Doesn't work on Windows, but you can just add Hmm, Any reasons, why it does not work on Windows? > __attribute__((__constructor__)) to qemu_tools_mutex_init. __attribute__((__constructor__)) works on Windows? > Paolo > > > +static void qemu_tools_mutex_init(void) > > +{ > > + qemu_mutex_init(&qemu_tools_mutex); > > +} > > + > > void qemu_mutex_lock_iothread(void) > > { > > + pthread_once(&qemu_tools_once, qemu_tools_mutex_init); > > + qemu_mutex_lock(&qemu_tools_mutex); > > } > > > > void qemu_mutex_unlock_iothread(void) > > { > > + qemu_mutex_unlock(&qemu_tools_mutex); > > } > > > -- Asias ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 9:50 ` Asias He @ 2013-08-22 9:51 ` Paolo Bonzini 2013-08-23 8:32 ` Asias He 0 siblings, 1 reply; 22+ messages in thread From: Paolo Bonzini @ 2013-08-22 9:51 UTC (permalink / raw) To: Asias He Cc: Kevin Wolf, Vijay Bellur, qemu-devel, Stefan Hajnoczi, Bharata B Rao, MORITA Kazutaka Il 22/08/2013 11:50, Asias He ha scritto: > On Wed, Aug 21, 2013 at 10:16:02AM +0200, Paolo Bonzini wrote: >> Il 21/08/2013 04:02, Asias He ha scritto: >>> In block/gluster.c, we have >>> >>> gluster_finish_aiocb >>> { >>> if (retval != sizeof(acb)) { >>> qemu_mutex_lock_iothread(); /* We are in gluster thread context */ >>> ... >>> qemu_mutex_unlock_iothread(); >>> } >>> } >>> >>> qemu tools, e.g. qemu-img, might race here because >>> qemu_mutex_{lock,unlock}_iothread are a nop operation and >>> gluster_finish_aiocb is in the gluster thread context. >>> >>> To fix, we introduce our own mutex for qemu tools. >>> >>> Signed-off-by: Asias He <asias@redhat.com> >>> --- >>> stubs/iothread-lock.c | 11 +++++++++++ >>> 1 file changed, 11 insertions(+) >>> >>> diff --git a/stubs/iothread-lock.c b/stubs/iothread-lock.c >>> index 5d8aca1..d5c6dec 100644 >>> --- a/stubs/iothread-lock.c >>> +++ b/stubs/iothread-lock.c >>> @@ -1,10 +1,21 @@ >>> #include "qemu-common.h" >>> #include "qemu/main-loop.h" >>> >>> +static QemuMutex qemu_tools_mutex; >>> +static pthread_once_t qemu_tools_once = PTHREAD_ONCE_INIT; >> >> Doesn't work on Windows, but you can just add > > Hmm, Any reasons, why it does not work on Windows? There are no pthreads on Windows. There is an emulation library, but we're not using it. >> __attribute__((__constructor__)) to qemu_tools_mutex_init. > > __attribute__((__constructor__)) works on Windows? Yes, it is part of the runtime library's support for C++. We use it already, see include/qemu/module.h. Paolo >> Paolo >> >>> +static void qemu_tools_mutex_init(void) >>> +{ >>> + qemu_mutex_init(&qemu_tools_mutex); >>> +} >>> + >>> void qemu_mutex_lock_iothread(void) >>> { >>> + pthread_once(&qemu_tools_once, qemu_tools_mutex_init); >>> + qemu_mutex_lock(&qemu_tools_mutex); >>> } >>> >>> void qemu_mutex_unlock_iothread(void) >>> { >>> + qemu_mutex_unlock(&qemu_tools_mutex); >>> } >>> >> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 9:51 ` Paolo Bonzini @ 2013-08-23 8:32 ` Asias He 2013-08-23 9:05 ` Paolo Bonzini 0 siblings, 1 reply; 22+ messages in thread From: Asias He @ 2013-08-23 8:32 UTC (permalink / raw) To: Paolo Bonzini Cc: Kevin Wolf, Vijay Bellur, qemu-devel, Stefan Hajnoczi, Bharata B Rao, MORITA Kazutaka On Thu, Aug 22, 2013 at 11:51:00AM +0200, Paolo Bonzini wrote: > Il 22/08/2013 11:50, Asias He ha scritto: > > On Wed, Aug 21, 2013 at 10:16:02AM +0200, Paolo Bonzini wrote: > >> Il 21/08/2013 04:02, Asias He ha scritto: > >>> In block/gluster.c, we have > >>> > >>> gluster_finish_aiocb > >>> { > >>> if (retval != sizeof(acb)) { > >>> qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > >>> ... > >>> qemu_mutex_unlock_iothread(); > >>> } > >>> } > >>> > >>> qemu tools, e.g. qemu-img, might race here because > >>> qemu_mutex_{lock,unlock}_iothread are a nop operation and > >>> gluster_finish_aiocb is in the gluster thread context. > >>> > >>> To fix, we introduce our own mutex for qemu tools. > >>> > >>> Signed-off-by: Asias He <asias@redhat.com> > >>> --- > >>> stubs/iothread-lock.c | 11 +++++++++++ > >>> 1 file changed, 11 insertions(+) > >>> > >>> diff --git a/stubs/iothread-lock.c b/stubs/iothread-lock.c > >>> index 5d8aca1..d5c6dec 100644 > >>> --- a/stubs/iothread-lock.c > >>> +++ b/stubs/iothread-lock.c > >>> @@ -1,10 +1,21 @@ > >>> #include "qemu-common.h" > >>> #include "qemu/main-loop.h" > >>> > >>> +static QemuMutex qemu_tools_mutex; > >>> +static pthread_once_t qemu_tools_once = PTHREAD_ONCE_INIT; > >> > >> Doesn't work on Windows, but you can just add > > > > Hmm, Any reasons, why it does not work on Windows? > > There are no pthreads on Windows. > > There is an emulation library, but we're not using it. This one: http://www.sourceware.org/pthreads-win32 ? I found other places use pthread_create. e.g. [qemu]$ git grep pthread_create audio/audio_pt_int.c: err = pthread_create (&p->thread, NULL, func, opaque); audio/audio_pt_int.c: efunc = "pthread_create"; clone_func, &info); qemu-nbd.c: pthread_create(&show_parts_thread, NULL, show_parts, device); qemu-nbd.c: ret = pthread_create(&client_thread, NULL, nbd_client_thread, device); These bits are also not working on Windows? > >> __attribute__((__constructor__)) to qemu_tools_mutex_init. > > > > __attribute__((__constructor__)) works on Windows? > > Yes, it is part of the runtime library's support for C++. We use it > already, see include/qemu/module.h. Hmm, nice. > Paolo > > >> Paolo > >> > >>> +static void qemu_tools_mutex_init(void) > >>> +{ > >>> + qemu_mutex_init(&qemu_tools_mutex); > >>> +} > >>> + > >>> void qemu_mutex_lock_iothread(void) > >>> { > >>> + pthread_once(&qemu_tools_once, qemu_tools_mutex_init); > >>> + qemu_mutex_lock(&qemu_tools_mutex); > >>> } > >>> > >>> void qemu_mutex_unlock_iothread(void) > >>> { > >>> + qemu_mutex_unlock(&qemu_tools_mutex); > >>> } > >>> > >> > > > -- Asias ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-23 8:32 ` Asias He @ 2013-08-23 9:05 ` Paolo Bonzini 0 siblings, 0 replies; 22+ messages in thread From: Paolo Bonzini @ 2013-08-23 9:05 UTC (permalink / raw) To: Asias He Cc: Kevin Wolf, Vijay Bellur, qemu-devel, Stefan Hajnoczi, Bharata B Rao, MORITA Kazutaka Il 23/08/2013 10:32, Asias He ha scritto: > On Thu, Aug 22, 2013 at 11:51:00AM +0200, Paolo Bonzini wrote: >> Il 22/08/2013 11:50, Asias He ha scritto: >>> On Wed, Aug 21, 2013 at 10:16:02AM +0200, Paolo Bonzini wrote: >>>> Il 21/08/2013 04:02, Asias He ha scritto: >>>>> In block/gluster.c, we have >>>>> >>>>> gluster_finish_aiocb >>>>> { >>>>> if (retval != sizeof(acb)) { >>>>> qemu_mutex_lock_iothread(); /* We are in gluster thread context */ >>>>> ... >>>>> qemu_mutex_unlock_iothread(); >>>>> } >>>>> } >>>>> >>>>> qemu tools, e.g. qemu-img, might race here because >>>>> qemu_mutex_{lock,unlock}_iothread are a nop operation and >>>>> gluster_finish_aiocb is in the gluster thread context. >>>>> >>>>> To fix, we introduce our own mutex for qemu tools. >>>>> >>>>> Signed-off-by: Asias He <asias@redhat.com> >>>>> --- >>>>> stubs/iothread-lock.c | 11 +++++++++++ >>>>> 1 file changed, 11 insertions(+) >>>>> >>>>> diff --git a/stubs/iothread-lock.c b/stubs/iothread-lock.c >>>>> index 5d8aca1..d5c6dec 100644 >>>>> --- a/stubs/iothread-lock.c >>>>> +++ b/stubs/iothread-lock.c >>>>> @@ -1,10 +1,21 @@ >>>>> #include "qemu-common.h" >>>>> #include "qemu/main-loop.h" >>>>> >>>>> +static QemuMutex qemu_tools_mutex; >>>>> +static pthread_once_t qemu_tools_once = PTHREAD_ONCE_INIT; >>>> >>>> Doesn't work on Windows, but you can just add >>> >>> Hmm, Any reasons, why it does not work on Windows? >> >> There are no pthreads on Windows. >> >> There is an emulation library, but we're not using it. > > This one: http://www.sourceware.org/pthreads-win32 ? Yes. > I found other places use pthread_create. e.g. > > [qemu]$ git grep pthread_create > audio/audio_pt_int.c: err = pthread_create (&p->thread, NULL, func, > opaque); > audio/audio_pt_int.c: efunc = "pthread_create"; > clone_func, &info); > qemu-nbd.c: pthread_create(&show_parts_thread, NULL, show_parts, > device); > qemu-nbd.c: ret = pthread_create(&client_thread, NULL, > nbd_client_thread, device); > > These bits are also not working on Windows? They are not used on Windows, no. qemu-nbd is not compiled, audio uses audio/audio_win_int.c. Paolo >>>> __attribute__((__constructor__)) to qemu_tools_mutex_init. >>> >>> __attribute__((__constructor__)) works on Windows? >> >> Yes, it is part of the runtime library's support for C++. We use it >> already, see include/qemu/module.h. > > Hmm, nice. > >> Paolo >> >>>> Paolo >>>> >>>>> +static void qemu_tools_mutex_init(void) >>>>> +{ >>>>> + qemu_mutex_init(&qemu_tools_mutex); >>>>> +} >>>>> + >>>>> void qemu_mutex_lock_iothread(void) >>>>> { >>>>> + pthread_once(&qemu_tools_once, qemu_tools_mutex_init); >>>>> + qemu_mutex_lock(&qemu_tools_mutex); >>>>> } >>>>> >>>>> void qemu_mutex_unlock_iothread(void) >>>>> { >>>>> + qemu_mutex_unlock(&qemu_tools_mutex); >>>>> } >>>>> >>>> >>> >> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-21 2:02 [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb Asias He 2013-08-21 8:16 ` Paolo Bonzini @ 2013-08-21 15:24 ` Stefan Hajnoczi 2013-08-21 15:40 ` Paolo Bonzini 1 sibling, 1 reply; 22+ messages in thread From: Stefan Hajnoczi @ 2013-08-21 15:24 UTC (permalink / raw) To: Asias He Cc: Kevin Wolf, Vijay Bellur, qemu-devel, Stefan Hajnoczi, Bharata B Rao, MORITA Kazutaka On Wed, Aug 21, 2013 at 10:02:47AM +0800, Asias He wrote: > In block/gluster.c, we have > > gluster_finish_aiocb > { > if (retval != sizeof(acb)) { > qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > ... > qemu_mutex_unlock_iothread(); > } > } > > qemu tools, e.g. qemu-img, might race here because > qemu_mutex_{lock,unlock}_iothread are a nop operation and > gluster_finish_aiocb is in the gluster thread context. > > To fix, we introduce our own mutex for qemu tools. I think we need to look more closely at the error code path: acb->ret = ret; retval = qemu_write_full(s->fds[GLUSTER_FD_WRITE], &acb, sizeof(acb)); if (retval != sizeof(acb)) { /* * Gluster AIO callback thread failed to notify the waiting * QEMU thread about IO completion. * * Complete this IO request and make the disk inaccessible for * subsequent reads and writes. */ error_report("Gluster failed to notify QEMU about IO completion"); qemu_mutex_lock_iothread(); /* We are in gluster thread context */ acb->common.cb(acb->common.opaque, -EIO); qemu_aio_release(acb); close(s->fds[GLUSTER_FD_READ]); close(s->fds[GLUSTER_FD_WRITE]); Is it safe to close the fds? There is a race here: 1. Another thread opens a new file descriptor and gets GLUSTER_FD_READ or GLUSTER_FD_WRITE's old fd value. 2. Another gluster thread invokes the callback and does qemu_write_full(s->fds[GLUSTER_FD_WRITE], ...). Since the mutex doesn't protect s->fds[] this is possible. Maybe a simpler solution for request completion is: 1. Linked list of completed acbs. 2. Mutex to protect the linked list. 3. EventNotifier to signal iothread. Then this function becomes: static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret, void *arg) { GlusterAIOCB *acb = arg; BlockDriverState *bs = acb->common.bs; BDRVGlusterState *s = bs->opaque; int retval; acb->ret = ret; qemu_mutex_lock(&s->finish_list_lock); QSIMPLEQ_INSERT_TAIL(&s->finish_list, acb, list); qemu_mutex_unlock(&s->finish_list_lock); event_notifier_set(&s->finish_list_notifier); } Stefan ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-21 15:24 ` Stefan Hajnoczi @ 2013-08-21 15:40 ` Paolo Bonzini 2013-08-22 5:59 ` Bharata B Rao 2013-08-23 6:48 ` Bharata B Rao 0 siblings, 2 replies; 22+ messages in thread From: Paolo Bonzini @ 2013-08-21 15:40 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Kevin Wolf, Vijay Bellur, qemu-devel, Stefan Hajnoczi, Bharata B Rao, Asias He, MORITA Kazutaka Il 21/08/2013 17:24, Stefan Hajnoczi ha scritto: > On Wed, Aug 21, 2013 at 10:02:47AM +0800, Asias He wrote: >> In block/gluster.c, we have >> >> gluster_finish_aiocb >> { >> if (retval != sizeof(acb)) { >> qemu_mutex_lock_iothread(); /* We are in gluster thread context */ >> ... >> qemu_mutex_unlock_iothread(); >> } >> } >> >> qemu tools, e.g. qemu-img, might race here because >> qemu_mutex_{lock,unlock}_iothread are a nop operation and >> gluster_finish_aiocb is in the gluster thread context. >> >> To fix, we introduce our own mutex for qemu tools. > > I think we need to look more closely at the error code path: > > acb->ret = ret; > retval = qemu_write_full(s->fds[GLUSTER_FD_WRITE], &acb, sizeof(acb)); > if (retval != sizeof(acb)) { > /* > * Gluster AIO callback thread failed to notify the waiting > * QEMU thread about IO completion. > * > * Complete this IO request and make the disk inaccessible for > * subsequent reads and writes. > */ > error_report("Gluster failed to notify QEMU about IO completion"); > > qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > acb->common.cb(acb->common.opaque, -EIO); > qemu_aio_release(acb); > close(s->fds[GLUSTER_FD_READ]); > close(s->fds[GLUSTER_FD_WRITE]); > > Is it safe to close the fds? There is a race here: > > 1. Another thread opens a new file descriptor and gets GLUSTER_FD_READ or > GLUSTER_FD_WRITE's old fd value. > 2. Another gluster thread invokes the callback and does > qemu_write_full(s->fds[GLUSTER_FD_WRITE], ...). > > Since the mutex doesn't protect s->fds[] this is possible. > > Maybe a simpler solution for request completion is: > > 1. Linked list of completed acbs. > 2. Mutex to protect the linked list. > 3. EventNotifier to signal iothread. We could just use a bottom half, too. Add a bottom half to acb, schedule it in gluster_finish_aiocb, delete it in the bottom half's own callback. Paolo > Then this function becomes: > > static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret, void *arg) > { > GlusterAIOCB *acb = arg; > BlockDriverState *bs = acb->common.bs; > BDRVGlusterState *s = bs->opaque; > int retval; > > acb->ret = ret; > > qemu_mutex_lock(&s->finish_list_lock); > QSIMPLEQ_INSERT_TAIL(&s->finish_list, acb, list); > qemu_mutex_unlock(&s->finish_list_lock); > > event_notifier_set(&s->finish_list_notifier); > } > > Stefan > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-21 15:40 ` Paolo Bonzini @ 2013-08-22 5:59 ` Bharata B Rao 2013-08-22 7:48 ` Stefan Hajnoczi 2013-08-23 6:48 ` Bharata B Rao 1 sibling, 1 reply; 22+ messages in thread From: Bharata B Rao @ 2013-08-22 5:59 UTC (permalink / raw) To: Paolo Bonzini Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka On Wed, Aug 21, 2013 at 05:40:11PM +0200, Paolo Bonzini wrote: > Il 21/08/2013 17:24, Stefan Hajnoczi ha scritto: > > On Wed, Aug 21, 2013 at 10:02:47AM +0800, Asias He wrote: > >> In block/gluster.c, we have > >> > >> gluster_finish_aiocb > >> { > >> if (retval != sizeof(acb)) { > >> qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > >> ... > >> qemu_mutex_unlock_iothread(); > >> } > >> } > >> > >> qemu tools, e.g. qemu-img, might race here because > >> qemu_mutex_{lock,unlock}_iothread are a nop operation and > >> gluster_finish_aiocb is in the gluster thread context. > >> > >> To fix, we introduce our own mutex for qemu tools. > > > > I think we need to look more closely at the error code path: > > > > acb->ret = ret; > > retval = qemu_write_full(s->fds[GLUSTER_FD_WRITE], &acb, sizeof(acb)); > > if (retval != sizeof(acb)) { > > /* > > * Gluster AIO callback thread failed to notify the waiting > > * QEMU thread about IO completion. > > * > > * Complete this IO request and make the disk inaccessible for > > * subsequent reads and writes. > > */ > > error_report("Gluster failed to notify QEMU about IO completion"); > > > > qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > > acb->common.cb(acb->common.opaque, -EIO); > > qemu_aio_release(acb); > > close(s->fds[GLUSTER_FD_READ]); > > close(s->fds[GLUSTER_FD_WRITE]); > > > > Is it safe to close the fds? There is a race here: > > > > 1. Another thread opens a new file descriptor and gets GLUSTER_FD_READ or > > GLUSTER_FD_WRITE's old fd value. > > 2. Another gluster thread invokes the callback and does > > qemu_write_full(s->fds[GLUSTER_FD_WRITE], ...). > > > > Since the mutex doesn't protect s->fds[] this is possible. > > > > Maybe a simpler solution for request completion is: > > > > 1. Linked list of completed acbs. > > 2. Mutex to protect the linked list. > > 3. EventNotifier to signal iothread. > > We could just use a bottom half, too. Add a bottom half to acb, > schedule it in gluster_finish_aiocb, delete it in the bottom half's own > callback. gluster_finish_aiocb gets called from gluster thread, is it safe to create and schedule a bh from such a thread ? In my first implementation (http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg01748.html), I was using a BH from qemu read side thread (the thread that would respond to pipe write from gluster callback thread). That implementation was based on rbd and I later dropped the BH part since it looked like a round about way of completing the aio when we are already using the pipe mechanism for aio completion. Regards, Bharata. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 5:59 ` Bharata B Rao @ 2013-08-22 7:48 ` Stefan Hajnoczi 2013-08-22 9:06 ` Paolo Bonzini 0 siblings, 1 reply; 22+ messages in thread From: Stefan Hajnoczi @ 2013-08-22 7:48 UTC (permalink / raw) To: Bharata B Rao Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Paolo Bonzini, Asias He, MORITA Kazutaka On Thu, Aug 22, 2013 at 11:29:47AM +0530, Bharata B Rao wrote: > On Wed, Aug 21, 2013 at 05:40:11PM +0200, Paolo Bonzini wrote: > > Il 21/08/2013 17:24, Stefan Hajnoczi ha scritto: > > > On Wed, Aug 21, 2013 at 10:02:47AM +0800, Asias He wrote: > > >> In block/gluster.c, we have > > >> > > >> gluster_finish_aiocb > > >> { > > >> if (retval != sizeof(acb)) { > > >> qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > > >> ... > > >> qemu_mutex_unlock_iothread(); > > >> } > > >> } > > >> > > >> qemu tools, e.g. qemu-img, might race here because > > >> qemu_mutex_{lock,unlock}_iothread are a nop operation and > > >> gluster_finish_aiocb is in the gluster thread context. > > >> > > >> To fix, we introduce our own mutex for qemu tools. > > > > > > I think we need to look more closely at the error code path: > > > > > > acb->ret = ret; > > > retval = qemu_write_full(s->fds[GLUSTER_FD_WRITE], &acb, sizeof(acb)); > > > if (retval != sizeof(acb)) { > > > /* > > > * Gluster AIO callback thread failed to notify the waiting > > > * QEMU thread about IO completion. > > > * > > > * Complete this IO request and make the disk inaccessible for > > > * subsequent reads and writes. > > > */ > > > error_report("Gluster failed to notify QEMU about IO completion"); > > > > > > qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > > > acb->common.cb(acb->common.opaque, -EIO); > > > qemu_aio_release(acb); > > > close(s->fds[GLUSTER_FD_READ]); > > > close(s->fds[GLUSTER_FD_WRITE]); > > > > > > Is it safe to close the fds? There is a race here: > > > > > > 1. Another thread opens a new file descriptor and gets GLUSTER_FD_READ or > > > GLUSTER_FD_WRITE's old fd value. > > > 2. Another gluster thread invokes the callback and does > > > qemu_write_full(s->fds[GLUSTER_FD_WRITE], ...). > > > > > > Since the mutex doesn't protect s->fds[] this is possible. > > > > > > Maybe a simpler solution for request completion is: > > > > > > 1. Linked list of completed acbs. > > > 2. Mutex to protect the linked list. > > > 3. EventNotifier to signal iothread. > > > > We could just use a bottom half, too. Add a bottom half to acb, > > schedule it in gluster_finish_aiocb, delete it in the bottom half's own > > callback. > > gluster_finish_aiocb gets called from gluster thread, is it safe to create > and schedule a bh from such a thread ? > > In my first implementation (http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg01748.html), I was using a BH from qemu read side thread (the thread > that would respond to pipe write from gluster callback thread). That > implementation was based on rbd and I later dropped the BH part since it > looked like a round about way of completing the aio when we are already using > the pipe mechanism for aio completion. Recent patches made creating and scheduling a BH thread-safe. I think Paolo's idea is better than mine. Stefan ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 7:48 ` Stefan Hajnoczi @ 2013-08-22 9:06 ` Paolo Bonzini 2013-08-22 9:55 ` Bharata B Rao 0 siblings, 1 reply; 22+ messages in thread From: Paolo Bonzini @ 2013-08-22 9:06 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Bharata B Rao, Asias He, MORITA Kazutaka Il 22/08/2013 09:48, Stefan Hajnoczi ha scritto: >> > gluster_finish_aiocb gets called from gluster thread, is it safe to create >> > and schedule a bh from such a thread ? >> > >> > In my first implementation (http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg01748.html), I was using a BH from qemu read side thread (the thread >> > that would respond to pipe write from gluster callback thread). That >> > implementation was based on rbd and I later dropped the BH part since it >> > looked like a round about way of completing the aio when we are already using >> > the pipe mechanism for aio completion. > Recent patches made creating and scheduling a BH thread-safe. I thought scheduling BHs was always thread safe? > I think Paolo's idea is better than mine. Both are fine, they are just different. Mine is simpler because it leaves list management to the BH code. But since we are at it, we should simplify the code and uniformly use a bottom half for both successful and erroneous completions. This applies to both ideas. Maybe an even simpler patch would be to just abort if the GLUSTER_FD_WRITE write fails. Under what circumstances could it happen? Paolo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 9:06 ` Paolo Bonzini @ 2013-08-22 9:55 ` Bharata B Rao 2013-08-22 10:00 ` Paolo Bonzini 0 siblings, 1 reply; 22+ messages in thread From: Bharata B Rao @ 2013-08-22 9:55 UTC (permalink / raw) To: Paolo Bonzini Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka On Thu, Aug 22, 2013 at 11:06:47AM +0200, Paolo Bonzini wrote: > Il 22/08/2013 09:48, Stefan Hajnoczi ha scritto: > >> > gluster_finish_aiocb gets called from gluster thread, is it safe to create > >> > and schedule a bh from such a thread ? > >> > > >> > In my first implementation (http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg01748.html), I was using a BH from qemu read side thread (the thread > >> > that would respond to pipe write from gluster callback thread). That > >> > implementation was based on rbd and I later dropped the BH part since it > >> > looked like a round about way of completing the aio when we are already using > >> > the pipe mechanism for aio completion. > > Recent patches made creating and scheduling a BH thread-safe. > > I thought scheduling BHs was always thread safe? > > > I think Paolo's idea is better than mine. > > Both are fine, they are just different. Mine is simpler because it > leaves list management to the BH code. > > But since we are at it, we should simplify the code and uniformly use a > bottom half for both successful and erroneous completions. This applies > to both ideas. > > Maybe an even simpler patch would be to just abort if the > GLUSTER_FD_WRITE write fails. This was the first apporach I had. I used to abort when writes to pipe fail. But there were concerns raised about handling the failures gracefully and hence we ended up doing all that error handling of completing the aio with -EIO, closing the pipe and making the disk inaccessible. > Under what circumstances could it happen? Not very sure, I haven't seen that happening. I had to manually inject faults to test this error path and verify the graceful recovery. Regards, Bharata. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 9:55 ` Bharata B Rao @ 2013-08-22 10:00 ` Paolo Bonzini 2013-08-22 10:28 ` Bharata B Rao 0 siblings, 1 reply; 22+ messages in thread From: Paolo Bonzini @ 2013-08-22 10:00 UTC (permalink / raw) To: bharata Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka Il 22/08/2013 11:55, Bharata B Rao ha scritto: > This was the first apporach I had. I used to abort when writes to pipe > fail. But there were concerns raised about handling the failures gracefully > and hence we ended up doing all that error handling of completing the aio > with -EIO, closing the pipe and making the disk inaccessible. > >> > Under what circumstances could it happen? > Not very sure, I haven't seen that happening. I had to manually inject > faults to test this error path and verify the graceful recovery. Looking at write(2), it looks like it is impossible EAGAIN or EWOULDBLOCK can't happen, blocking file descriptor EBADF, EPIPE shouldn't happen since the device is drained before calling qemu_gluster_close. EDESTADDRREQ, EDQUOT, EFBIG, EIO, ENOSPC cannot happen for pipes EFAULT abort would be fine EINTR handled by qemu_write_full EINVAL cannot happen (unless the pipe is closed and the file descriptor recycled, but see EBADF above) The pipe(7) man page doesn't seem to add any more errors. Paolo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 10:00 ` Paolo Bonzini @ 2013-08-22 10:28 ` Bharata B Rao 2013-08-22 11:15 ` Paolo Bonzini 0 siblings, 1 reply; 22+ messages in thread From: Bharata B Rao @ 2013-08-22 10:28 UTC (permalink / raw) To: Paolo Bonzini Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka On Thu, Aug 22, 2013 at 12:00:48PM +0200, Paolo Bonzini wrote: > Il 22/08/2013 11:55, Bharata B Rao ha scritto: > > This was the first apporach I had. I used to abort when writes to pipe > > fail. But there were concerns raised about handling the failures gracefully > > and hence we ended up doing all that error handling of completing the aio > > with -EIO, closing the pipe and making the disk inaccessible. > > > >> > Under what circumstances could it happen? > > Not very sure, I haven't seen that happening. I had to manually inject > > faults to test this error path and verify the graceful recovery. > > Looking at write(2), it looks like it is impossible > > EAGAIN or EWOULDBLOCK > can't happen, blocking file descriptor > > EBADF, EPIPE > shouldn't happen since the device is drained before > calling qemu_gluster_close. > > EDESTADDRREQ, EDQUOT, EFBIG, EIO, ENOSPC > cannot happen for pipes > > EFAULT > abort would be fine In the case where we have separate system and data disks and if error (EFAULT) happens for the data disk, don't we want to keep the VM up by gracefully disabling IO to the data disk ? I remember this was one of the motivations to handle this failure. Regards, Bharata. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 10:28 ` Bharata B Rao @ 2013-08-22 11:15 ` Paolo Bonzini 2013-08-22 13:25 ` Bharata B Rao 0 siblings, 1 reply; 22+ messages in thread From: Paolo Bonzini @ 2013-08-22 11:15 UTC (permalink / raw) To: bharata Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka Il 22/08/2013 12:28, Bharata B Rao ha scritto: > On Thu, Aug 22, 2013 at 12:00:48PM +0200, Paolo Bonzini wrote: >> Il 22/08/2013 11:55, Bharata B Rao ha scritto: >>> This was the first apporach I had. I used to abort when writes to pipe >>> fail. But there were concerns raised about handling the failures gracefully >>> and hence we ended up doing all that error handling of completing the aio >>> with -EIO, closing the pipe and making the disk inaccessible. >>> >>>>> Under what circumstances could it happen? >>> Not very sure, I haven't seen that happening. I had to manually inject >>> faults to test this error path and verify the graceful recovery. >> >> Looking at write(2), it looks like it is impossible >> >> EAGAIN or EWOULDBLOCK >> can't happen, blocking file descriptor >> >> EBADF, EPIPE >> shouldn't happen since the device is drained before >> calling qemu_gluster_close. >> >> EDESTADDRREQ, EDQUOT, EFBIG, EIO, ENOSPC >> cannot happen for pipes >> >> EFAULT >> abort would be fine > > In the case where we have separate system and data disks and if error (EFAULT) > happens for the data disk, don't we want to keep the VM up by gracefully > disabling IO to the data disk ? EFAULT means the buffer address is invalid, I/O error would be EIO, but... > I remember this was one of the motivations to > handle this failure. ... this write is on the pipe, not on a disk. Paolo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 11:15 ` Paolo Bonzini @ 2013-08-22 13:25 ` Bharata B Rao 2013-08-22 13:27 ` Paolo Bonzini 0 siblings, 1 reply; 22+ messages in thread From: Bharata B Rao @ 2013-08-22 13:25 UTC (permalink / raw) To: Paolo Bonzini Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka On Thu, Aug 22, 2013 at 01:15:59PM +0200, Paolo Bonzini wrote: > Il 22/08/2013 12:28, Bharata B Rao ha scritto: > > On Thu, Aug 22, 2013 at 12:00:48PM +0200, Paolo Bonzini wrote: > >> Il 22/08/2013 11:55, Bharata B Rao ha scritto: > >>> This was the first apporach I had. I used to abort when writes to pipe > >>> fail. But there were concerns raised about handling the failures gracefully > >>> and hence we ended up doing all that error handling of completing the aio > >>> with -EIO, closing the pipe and making the disk inaccessible. > >>> > >>>>> Under what circumstances could it happen? > >>> Not very sure, I haven't seen that happening. I had to manually inject > >>> faults to test this error path and verify the graceful recovery. > >> > >> Looking at write(2), it looks like it is impossible > >> > >> EAGAIN or EWOULDBLOCK > >> can't happen, blocking file descriptor > >> > >> EBADF, EPIPE > >> shouldn't happen since the device is drained before > >> calling qemu_gluster_close. > >> > >> EDESTADDRREQ, EDQUOT, EFBIG, EIO, ENOSPC > >> cannot happen for pipes > >> > >> EFAULT > >> abort would be fine > > > > In the case where we have separate system and data disks and if error (EFAULT) > > happens for the data disk, don't we want to keep the VM up by gracefully > > disabling IO to the data disk ? > > EFAULT means the buffer address is invalid, I/O error would be EIO, but... > > > I remember this was one of the motivations to > > handle this failure. > > ... this write is on the pipe, not on a disk. Right. Failure to complete the write on the pipe means that IO done to the disk didn't complete and hence to the VM it is essentially a disk IO failure. That's the reason we return -EIO and make the disk inaccessible when this failure happens. My question was if it is ok to abort the VM when IO to one of the disks fails ? But, if you think it is not worth handling such errors then may be we can drop this elaborate and race-prone error recovery and just abort. Regards, Bharata. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 13:25 ` Bharata B Rao @ 2013-08-22 13:27 ` Paolo Bonzini 2013-08-22 14:01 ` Bharata B Rao 0 siblings, 1 reply; 22+ messages in thread From: Paolo Bonzini @ 2013-08-22 13:27 UTC (permalink / raw) To: bharata Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka Il 22/08/2013 15:25, Bharata B Rao ha scritto: > On Thu, Aug 22, 2013 at 01:15:59PM +0200, Paolo Bonzini wrote: >> Il 22/08/2013 12:28, Bharata B Rao ha scritto: >>> On Thu, Aug 22, 2013 at 12:00:48PM +0200, Paolo Bonzini wrote: >>>> Il 22/08/2013 11:55, Bharata B Rao ha scritto: >>>>> This was the first apporach I had. I used to abort when writes to pipe >>>>> fail. But there were concerns raised about handling the failures gracefully >>>>> and hence we ended up doing all that error handling of completing the aio >>>>> with -EIO, closing the pipe and making the disk inaccessible. >>>>> >>>>>>> Under what circumstances could it happen? >>>>> Not very sure, I haven't seen that happening. I had to manually inject >>>>> faults to test this error path and verify the graceful recovery. >>>> >>>> Looking at write(2), it looks like it is impossible >>>> >>>> EAGAIN or EWOULDBLOCK >>>> can't happen, blocking file descriptor >>>> >>>> EBADF, EPIPE >>>> shouldn't happen since the device is drained before >>>> calling qemu_gluster_close. >>>> >>>> EDESTADDRREQ, EDQUOT, EFBIG, EIO, ENOSPC >>>> cannot happen for pipes >>>> >>>> EFAULT >>>> abort would be fine >>> >>> In the case where we have separate system and data disks and if error (EFAULT) >>> happens for the data disk, don't we want to keep the VM up by gracefully >>> disabling IO to the data disk ? >> >> EFAULT means the buffer address is invalid, I/O error would be EIO, but... >> >>> I remember this was one of the motivations to >>> handle this failure. >> >> ... this write is on the pipe, not on a disk. > > Right. Failure to complete the write on the pipe means that IO done to the > disk didn't complete and hence to the VM it is essentially a disk IO failure. The question is, can the write to the pipe actually fail? Not just "in practice not" according to the documented errors, it seems to me that it cannot. > That's the reason we return -EIO and make the disk inaccessible when this > failure happens. > > My question was if it is ok to abort the VM when IO to one of the disks fails ? Absolutely not, but here the code seems dead to me. Paolo > But, if you think it is not worth handling such errors then may be we can drop > this elaborate and race-prone error recovery and just abort. > > Regards, > Bharata. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 13:27 ` Paolo Bonzini @ 2013-08-22 14:01 ` Bharata B Rao 2013-08-22 14:52 ` Paolo Bonzini 0 siblings, 1 reply; 22+ messages in thread From: Bharata B Rao @ 2013-08-22 14:01 UTC (permalink / raw) To: Paolo Bonzini Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka On Thu, Aug 22, 2013 at 03:27:35PM +0200, Paolo Bonzini wrote: > >>>> Looking at write(2), it looks like it is impossible > >>>> > >>>> EAGAIN or EWOULDBLOCK > >>>> can't happen, blocking file descriptor > >>>> > >>>> EBADF, EPIPE > >>>> shouldn't happen since the device is drained before > >>>> calling qemu_gluster_close. > >>>> > >>>> EDESTADDRREQ, EDQUOT, EFBIG, EIO, ENOSPC > >>>> cannot happen for pipes > >>>> > >>>> EFAULT > >>>> abort would be fine > >>> > >>> In the case where we have separate system and data disks and if error (EFAULT) > >>> happens for the data disk, don't we want to keep the VM up by gracefully > >>> disabling IO to the data disk ? > >> > >> EFAULT means the buffer address is invalid, I/O error would be EIO, but... > >> > >>> I remember this was one of the motivations to > >>> handle this failure. > >> > >> ... this write is on the pipe, not on a disk. > > > > Right. Failure to complete the write on the pipe means that IO done to the > > disk didn't complete and hence to the VM it is essentially a disk IO failure. > > The question is, can the write to the pipe actually fail? Not just "in > practice not" according to the documented errors, it seems to me that it > cannot. May be I am dragging this a bit, but since we are at it, let me make one last observation here :) The buffer in question here is the GlusterAIOCB pointer that gets passed back and forth between QEMU and gluster thro' glfs_pwritev_async and associated callback gluster_finish_aiocb. Isn't there a possibility that gluster will not give us back the same pointer during callback due to some errors on the gluster side ? Unlikely but possible ? Regards, Bharata. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-22 14:01 ` Bharata B Rao @ 2013-08-22 14:52 ` Paolo Bonzini 0 siblings, 0 replies; 22+ messages in thread From: Paolo Bonzini @ 2013-08-22 14:52 UTC (permalink / raw) To: bharata Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka Il 22/08/2013 16:01, Bharata B Rao ha scritto: >>>> EFAULT means the buffer address is invalid, I/O error would be EIO, but... >>>> >>>>> I remember this was one of the motivations to >>>>> handle this failure. >>>> >>>> ... this write is on the pipe, not on a disk. >>> >>> Right. Failure to complete the write on the pipe means that IO done to the >>> disk didn't complete and hence to the VM it is essentially a disk IO failure. >> >> The question is, can the write to the pipe actually fail? Not just "in >> practice not" according to the documented errors, it seems to me that it >> cannot. > > May be I am dragging this a bit, but since we are at it, let me make one last > observation here :) > > The buffer in question here is the GlusterAIOCB pointer that gets passed > back and forth between QEMU and gluster thro' glfs_pwritev_async and associated > callback gluster_finish_aiocb. No, it's not: retval = qemu_write_full(s->fds[GLUSTER_FD_WRITE], &acb, sizeof(acb)); The pointer that is passed to qemu_write_full is on the stack; it's the address of the acb variable. The _content_ of the buffer (which the kernel doesn't look at, so it's not relevant for generating EFAULT) is the GlusterAIOCB pointer. > Isn't there a possibility that gluster will > not give us back the same pointer during callback due to some errors on the > gluster side ? Unlikely but possible ? Then we would have already crashed on the line before: acb->ret = ret; which writes to the hypothetically corrupted pointer. But we cannot protect against memory corruption, or against bugs that violate the API contract at such a fundamental level. QEMU will SIGSEGV, but it wouldn't be the first time. Paolo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-21 15:40 ` Paolo Bonzini 2013-08-22 5:59 ` Bharata B Rao @ 2013-08-23 6:48 ` Bharata B Rao 2013-08-23 7:33 ` Paolo Bonzini 1 sibling, 1 reply; 22+ messages in thread From: Bharata B Rao @ 2013-08-23 6:48 UTC (permalink / raw) To: Paolo Bonzini Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka On Wed, Aug 21, 2013 at 05:40:11PM +0200, Paolo Bonzini wrote: > > We could just use a bottom half, too. Add a bottom half to acb, > schedule it in gluster_finish_aiocb, delete it in the bottom half's own > callback. I tried this approach (the patch at the end of this mail), but see this occasional errors, doesn't happen always. Any clues on what am I missing here ? Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffcbfff700 (LWP 11301)] 0x000055555597156b in event_notifier_set (e=0xa4) at util/event_notifier-posix.c:97 97 ret = write(e->wfd, &value, sizeof(value)); (gdb) bt #0 0x000055555597156b in event_notifier_set (e=0xa4) at util/event_notifier-posix.c:97 #1 0x00005555555dd3cd in aio_notify (ctx=0x0) at async.c:233 #2 0x00005555555dd00f in qemu_bh_schedule (bh=0x7fff300008e0) at async.c:128 #3 0x00005555556002da in gluster_finish_aiocb (fd=0x5555562ae5d0, ret=4096, arg=0x7fffd00419c0) at block/gluster.c:409 #4 0x00007ffff5e70cdc in glfs_io_async_cbk (ret=4096, frame=0x0, data= 0x555556443ee0) at glfs-fops.c:567 #5 0x00007ffff5c2843e in synctask_wrap (old_task=0x555556365940) at syncop.c:133 #6 0x00007ffff3b03370 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () (gdb) up #1 0x00005555555dd3cd in aio_notify (ctx=0x0) at async.c:233 233 event_notifier_set(&ctx->notifier); (gdb) up #2 0x00005555555dd00f in qemu_bh_schedule (bh=0x7fff300008e0) at async.c:128 128 aio_notify(bh->ctx); (gdb) p *bh $1 = {ctx = 0x0, cb = 0x5555555ffdcd <qemu_gluster_aio_bh>, opaque = 0x7fffd00419c0, next = 0x555556345e70, scheduled = false, idle = false, deleted = true} gluster: Use BH mechanism for AIO completion Replace the existing pipe based AIO completion handing by BH based method. Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> --- block/gluster.c | 69 ++++++------------------------------------------------- 1 file changed, 8 insertions(+), 61 deletions(-) diff --git a/block/gluster.c b/block/gluster.c index 46f36f8..598b335 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -30,7 +30,6 @@ typedef struct GlusterAIOCB { typedef struct BDRVGlusterState { struct glfs *glfs; - int fds[2]; struct glfs_fd *fd; int event_reader_pos; GlusterAIOCB *event_acb; @@ -231,12 +230,13 @@ out: return NULL; } -static void qemu_gluster_complete_aio(GlusterAIOCB *acb, BDRVGlusterState *s) +static void qemu_gluster_aio_bh(void *opaque) { int ret; + GlusterAIOCB *acb = opaque; bool *finished = acb->finished; BlockDriverCompletionFunc *cb = acb->common.cb; - void *opaque = acb->common.opaque; + void *cb_opaque = acb->common.opaque; if (!acb->ret || acb->ret == acb->size) { ret = 0; /* Success */ @@ -246,33 +246,15 @@ static void qemu_gluster_complete_aio(GlusterAIOCB *acb, BDRVGlusterState *s) ret = -EIO; /* Partial read/write - fail it */ } + qemu_bh_delete(acb->bh); qemu_aio_release(acb); - cb(opaque, ret); + + cb(cb_opaque, ret); if (finished) { *finished = true; } } -static void qemu_gluster_aio_event_reader(void *opaque) -{ - BDRVGlusterState *s = opaque; - ssize_t ret; - - do { - char *p = (char *)&s->event_acb; - - ret = read(s->fds[GLUSTER_FD_READ], p + s->event_reader_pos, - sizeof(s->event_acb) - s->event_reader_pos); - if (ret > 0) { - s->event_reader_pos += ret; - if (s->event_reader_pos == sizeof(s->event_acb)) { - s->event_reader_pos = 0; - qemu_gluster_complete_aio(s->event_acb, s); - } - } - } while (ret < 0 && errno == EINTR); -} - /* TODO Convert to fine grained options */ static QemuOptsList runtime_opts = { .name = "gluster", @@ -329,17 +311,7 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, s->fd = glfs_open(s->glfs, gconf->image, open_flags); if (!s->fd) { ret = -errno; - goto out; - } - - ret = qemu_pipe(s->fds); - if (ret < 0) { - ret = -errno; - goto out; } - fcntl(s->fds[GLUSTER_FD_READ], F_SETFL, O_NONBLOCK); - qemu_aio_set_fd_handler(s->fds[GLUSTER_FD_READ], - qemu_gluster_aio_event_reader, NULL, s); out: qemu_opts_del(opts); @@ -417,31 +389,10 @@ static const AIOCBInfo gluster_aiocb_info = { static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret, void *arg) { GlusterAIOCB *acb = (GlusterAIOCB *)arg; - BlockDriverState *bs = acb->common.bs; - BDRVGlusterState *s = bs->opaque; - int retval; acb->ret = ret; - retval = qemu_write_full(s->fds[GLUSTER_FD_WRITE], &acb, sizeof(acb)); - if (retval != sizeof(acb)) { - /* - * Gluster AIO callback thread failed to notify the waiting - * QEMU thread about IO completion. - * - * Complete this IO request and make the disk inaccessible for - * subsequent reads and writes. - */ - error_report("Gluster failed to notify QEMU about IO completion"); - - qemu_mutex_lock_iothread(); /* We are in gluster thread context */ - acb->common.cb(acb->common.opaque, -EIO); - qemu_aio_release(acb); - close(s->fds[GLUSTER_FD_READ]); - close(s->fds[GLUSTER_FD_WRITE]); - qemu_aio_set_fd_handler(s->fds[GLUSTER_FD_READ], NULL, NULL, NULL); - bs->drv = NULL; /* Make the disk inaccessible */ - qemu_mutex_unlock_iothread(); - } + acb->bh = qemu_bh_new(qemu_gluster_aio_bh, acb); + qemu_bh_schedule(acb->bh); } static BlockDriverAIOCB *qemu_gluster_aio_rw(BlockDriverState *bs, @@ -592,10 +543,6 @@ static void qemu_gluster_close(BlockDriverState *bs) { BDRVGlusterState *s = bs->opaque; - close(s->fds[GLUSTER_FD_READ]); - close(s->fds[GLUSTER_FD_WRITE]); - qemu_aio_set_fd_handler(s->fds[GLUSTER_FD_READ], NULL, NULL, NULL); - if (s->fd) { glfs_close(s->fd); s->fd = NULL; ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-23 6:48 ` Bharata B Rao @ 2013-08-23 7:33 ` Paolo Bonzini 2013-08-23 8:11 ` Bharata B Rao 0 siblings, 1 reply; 22+ messages in thread From: Paolo Bonzini @ 2013-08-23 7:33 UTC (permalink / raw) To: bharata Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka Il 23/08/2013 08:48, Bharata B Rao ha scritto: > On Wed, Aug 21, 2013 at 05:40:11PM +0200, Paolo Bonzini wrote: >> >> We could just use a bottom half, too. Add a bottom half to acb, >> schedule it in gluster_finish_aiocb, delete it in the bottom half's own >> callback. > > I tried this approach (the patch at the end of this mail), but see this > occasional errors, doesn't happen always. Any clues on what am I missing here ? > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fffcbfff700 (LWP 11301)] > 0x000055555597156b in event_notifier_set (e=0xa4) > at util/event_notifier-posix.c:97 > 97 ret = write(e->wfd, &value, sizeof(value)); > (gdb) bt > #0 0x000055555597156b in event_notifier_set (e=0xa4) > at util/event_notifier-posix.c:97 > #1 0x00005555555dd3cd in aio_notify (ctx=0x0) at async.c:233 > #2 0x00005555555dd00f in qemu_bh_schedule (bh=0x7fff300008e0) at async.c:128 > #3 0x00005555556002da in gluster_finish_aiocb (fd=0x5555562ae5d0, ret=4096, > arg=0x7fffd00419c0) at block/gluster.c:409 > #4 0x00007ffff5e70cdc in glfs_io_async_cbk (ret=4096, frame=0x0, data= > 0x555556443ee0) at glfs-fops.c:567 > #5 0x00007ffff5c2843e in synctask_wrap (old_task=0x555556365940) > at syncop.c:133 > #6 0x00007ffff3b03370 in ?? () from /lib64/libc.so.6 > #7 0x0000000000000000 in ?? () > (gdb) up > #1 0x00005555555dd3cd in aio_notify (ctx=0x0) at async.c:233 > 233 event_notifier_set(&ctx->notifier); > (gdb) up > #2 0x00005555555dd00f in qemu_bh_schedule (bh=0x7fff300008e0) at async.c:128 > 128 aio_notify(bh->ctx); > (gdb) p *bh > $1 = {ctx = 0x0, cb = 0x5555555ffdcd <qemu_gluster_aio_bh>, opaque = > 0x7fffd00419c0, next = 0x555556345e70, scheduled = false, idle = false, > deleted = true} This looks like a use-after-free, with bh->ctx corrupted when freeing the bottom half. But it's not at all obvious how it can happen. I suggest using MALLOC_PERTURB_=42 to check this theory (if it is correct, most fields will be something like 0x2a2a2a2a2a2a2a2a). But I don't see anything clearly wrong in the patch... Thus perhaps it is simpler to just remove the unreachable error handling code. Paolo > > > gluster: Use BH mechanism for AIO completion > > Replace the existing pipe based AIO completion handing by BH based method. > > Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> > --- > block/gluster.c | 69 ++++++------------------------------------------------- > 1 file changed, 8 insertions(+), 61 deletions(-) > > diff --git a/block/gluster.c b/block/gluster.c > index 46f36f8..598b335 100644 > --- a/block/gluster.c > +++ b/block/gluster.c > @@ -30,7 +30,6 @@ typedef struct GlusterAIOCB { > > typedef struct BDRVGlusterState { > struct glfs *glfs; > - int fds[2]; > struct glfs_fd *fd; > int event_reader_pos; > GlusterAIOCB *event_acb; > @@ -231,12 +230,13 @@ out: > return NULL; > } > > -static void qemu_gluster_complete_aio(GlusterAIOCB *acb, BDRVGlusterState *s) > +static void qemu_gluster_aio_bh(void *opaque) > { > int ret; > + GlusterAIOCB *acb = opaque; > bool *finished = acb->finished; > BlockDriverCompletionFunc *cb = acb->common.cb; > - void *opaque = acb->common.opaque; > + void *cb_opaque = acb->common.opaque; > > if (!acb->ret || acb->ret == acb->size) { > ret = 0; /* Success */ > @@ -246,33 +246,15 @@ static void qemu_gluster_complete_aio(GlusterAIOCB *acb, BDRVGlusterState *s) > ret = -EIO; /* Partial read/write - fail it */ > } > > + qemu_bh_delete(acb->bh); > qemu_aio_release(acb); > - cb(opaque, ret); > + > + cb(cb_opaque, ret); > if (finished) { > *finished = true; > } > } > > -static void qemu_gluster_aio_event_reader(void *opaque) > -{ > - BDRVGlusterState *s = opaque; > - ssize_t ret; > - > - do { > - char *p = (char *)&s->event_acb; > - > - ret = read(s->fds[GLUSTER_FD_READ], p + s->event_reader_pos, > - sizeof(s->event_acb) - s->event_reader_pos); > - if (ret > 0) { > - s->event_reader_pos += ret; > - if (s->event_reader_pos == sizeof(s->event_acb)) { > - s->event_reader_pos = 0; > - qemu_gluster_complete_aio(s->event_acb, s); > - } > - } > - } while (ret < 0 && errno == EINTR); > -} > - > /* TODO Convert to fine grained options */ > static QemuOptsList runtime_opts = { > .name = "gluster", > @@ -329,17 +311,7 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > s->fd = glfs_open(s->glfs, gconf->image, open_flags); > if (!s->fd) { > ret = -errno; > - goto out; > - } > - > - ret = qemu_pipe(s->fds); > - if (ret < 0) { > - ret = -errno; > - goto out; > } > - fcntl(s->fds[GLUSTER_FD_READ], F_SETFL, O_NONBLOCK); > - qemu_aio_set_fd_handler(s->fds[GLUSTER_FD_READ], > - qemu_gluster_aio_event_reader, NULL, s); > > out: > qemu_opts_del(opts); > @@ -417,31 +389,10 @@ static const AIOCBInfo gluster_aiocb_info = { > static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret, void *arg) > { > GlusterAIOCB *acb = (GlusterAIOCB *)arg; > - BlockDriverState *bs = acb->common.bs; > - BDRVGlusterState *s = bs->opaque; > - int retval; > > acb->ret = ret; > - retval = qemu_write_full(s->fds[GLUSTER_FD_WRITE], &acb, sizeof(acb)); > - if (retval != sizeof(acb)) { > - /* > - * Gluster AIO callback thread failed to notify the waiting > - * QEMU thread about IO completion. > - * > - * Complete this IO request and make the disk inaccessible for > - * subsequent reads and writes. > - */ > - error_report("Gluster failed to notify QEMU about IO completion"); > - > - qemu_mutex_lock_iothread(); /* We are in gluster thread context */ > - acb->common.cb(acb->common.opaque, -EIO); > - qemu_aio_release(acb); > - close(s->fds[GLUSTER_FD_READ]); > - close(s->fds[GLUSTER_FD_WRITE]); > - qemu_aio_set_fd_handler(s->fds[GLUSTER_FD_READ], NULL, NULL, NULL); > - bs->drv = NULL; /* Make the disk inaccessible */ > - qemu_mutex_unlock_iothread(); > - } > + acb->bh = qemu_bh_new(qemu_gluster_aio_bh, acb); > + qemu_bh_schedule(acb->bh); > } > > static BlockDriverAIOCB *qemu_gluster_aio_rw(BlockDriverState *bs, > @@ -592,10 +543,6 @@ static void qemu_gluster_close(BlockDriverState *bs) > { > BDRVGlusterState *s = bs->opaque; > > - close(s->fds[GLUSTER_FD_READ]); > - close(s->fds[GLUSTER_FD_WRITE]); > - qemu_aio_set_fd_handler(s->fds[GLUSTER_FD_READ], NULL, NULL, NULL); > - > if (s->fd) { > glfs_close(s->fd); > s->fd = NULL; > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb 2013-08-23 7:33 ` Paolo Bonzini @ 2013-08-23 8:11 ` Bharata B Rao 0 siblings, 0 replies; 22+ messages in thread From: Bharata B Rao @ 2013-08-23 8:11 UTC (permalink / raw) To: Paolo Bonzini Cc: Kevin Wolf, Vijay Bellur, Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Asias He, MORITA Kazutaka On Fri, Aug 23, 2013 at 09:33:21AM +0200, Paolo Bonzini wrote: > > (gdb) p *bh > > $1 = {ctx = 0x0, cb = 0x5555555ffdcd <qemu_gluster_aio_bh>, opaque = > > 0x7fffd00419c0, next = 0x555556345e70, scheduled = false, idle = false, > > deleted = true} > > This looks like a use-after-free, with bh->ctx corrupted when freeing > the bottom half. But it's not at all obvious how it can happen. > > I suggest using MALLOC_PERTURB_=42 to check this theory (if it is > correct, most fields will be something like 0x2a2a2a2a2a2a2a2a). But I > don't see anything clearly wrong in the patch... Thus perhaps it is > simpler to just remove the unreachable error handling code. (gdb) p *bh $1 = {ctx = 0x0, cb = 0x2a2a2a2a2a2a2a2a, opaque = 0x2a2a2a2a2a2a2a2a, next = 0x2a2a2a2a2a2a2a2a, scheduled = false, idle = false, deleted = true} May be as note above, I should just remove the unreachable error handling code for now. Regards, Bharata. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2013-08-23 9:06 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-08-21 2:02 [Qemu-devel] [PATCH] block: Fix race in gluster_finish_aiocb Asias He 2013-08-21 8:16 ` Paolo Bonzini 2013-08-22 9:50 ` Asias He 2013-08-22 9:51 ` Paolo Bonzini 2013-08-23 8:32 ` Asias He 2013-08-23 9:05 ` Paolo Bonzini 2013-08-21 15:24 ` Stefan Hajnoczi 2013-08-21 15:40 ` Paolo Bonzini 2013-08-22 5:59 ` Bharata B Rao 2013-08-22 7:48 ` Stefan Hajnoczi 2013-08-22 9:06 ` Paolo Bonzini 2013-08-22 9:55 ` Bharata B Rao 2013-08-22 10:00 ` Paolo Bonzini 2013-08-22 10:28 ` Bharata B Rao 2013-08-22 11:15 ` Paolo Bonzini 2013-08-22 13:25 ` Bharata B Rao 2013-08-22 13:27 ` Paolo Bonzini 2013-08-22 14:01 ` Bharata B Rao 2013-08-22 14:52 ` Paolo Bonzini 2013-08-23 6:48 ` Bharata B Rao 2013-08-23 7:33 ` Paolo Bonzini 2013-08-23 8:11 ` Bharata B Rao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).