From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jim Fehlig Subject: Re: [PATCH 00/12] libxl: fork: SIGCHLD flexibility Date: Sun, 26 Jan 2014 22:22:28 -0700 Message-ID: <52E5ED14.2090005@suse.com> References: <1389975845-1195-1-git-send-email-ian.jackson@eu.citrix.com> <52D9AECF.6050309@suse.com> <52DD678F.3070504@suse.com> <21214.37402.648941.864060@mariner.uk.xensource.com> <52DF57E2.2090602@suse.com> <52E09513.6060603@suse.com> <21216.62800.746512.422459@mariner.uk.xensource.com> <52E1EB97.4080007@suse.com> <21218.24466.92095.134875@mariner.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <21218.24466.92095.134875@mariner.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Jackson Cc: xen-devel@lists.xensource.com, Ian Campbell List-Id: xen-devel@lists.xenproject.org Ian Jackson wrote: > Jim Fehlig writes ("Re: [Xen-devel] [PATCH 00/12] libxl: fork: SIGCHLD flexibility"): > >> BTW, I only see the crash when the save/restore script is running. I >> stopped the other scripts and domains, running only save/restore on a >> single domain, and see the crash rather quickly (within 10 iterations). >> > > I'll look at the libvirt code, but: > > With a recurring timeout, how can you ever know it's cancelled ? > There might be threads out there, which don't hold any locks, which > are in the process of executing a callback for a timeout. That might > be arbitrarily delayed from the pov of the rest of the program. > > E.g.: > > Thread A Thread B > > invoke some libxl operation > X do some libxl stuff > X register timeout (libxl) > XV record timeout info > X do some more libxl stuff > ... > X do some more libxl stuff > X deregister timeout (libxl internal) > X converted to request immediate timeout > XV record new timeout info > X release libvirt event loop lock > entering libvirt event loop > V observe timeout is immediate > V need to do callback > call libxl driver > > entering libvirt event loop > V observe timeout is immediate > V need to do callback > call libxl driver > call libxl > X libxl sees timeout is live > X libxl does libxl stuff > libxl driver deregisters > V record lack of timeout > free driver's timeout struct > call libxl > X libxl sees timeout is dead > X libxl does nothing > libxl driver deregisters > V CRASH due to deregistering > V already-deregistered timeout > > If this is how things are, then I think there is no sane way to use > libvirt's timeouts (!) > Looking at libvirt's default event loop impl, and the current libxl driver code, I think this is how things are :-/. But maybe you have just described a bug in the libxl driver. In the timer callback, libxlDomainObjPrivate is locked, the timeout is disabled in libvirt event loop, libxlDomainObjPrivate is unlocked, and libxl_osevent_occurred_timeout is called. Could the issue be solved by checking if the timeout is still valid in the callback, while holding a lock on libxlDomainObjPrivate? The first thread running the callback could mark the timeout invalid before releasing the lock and calling libxl_osevent_occurred_timeout. After acquiring the libxlDomainObjPrivate lock, subsequent threads running the callback would see the timer is invalid and return. Regards, Jim