From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bruce Edge Subject: Re: domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M) Date: Thu, 19 Aug 2010 06:42:36 -0700 Message-ID: References: <4C6BBA480200007800010949@vpn.id2.novell.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1313817794==" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Keir Fraser Cc: Xen-devel , Jan Beulich List-Id: xen-devel@lists.xenproject.org --===============1313817794== Content-Type: multipart/alternative; boundary=e0cb4e887453df70a3048e2d5a78 --e0cb4e887453df70a3048e2d5a78 Content-Type: text/plain; charset=ISO-8859-1 -Bruce On Wed, Aug 18, 2010 at 2:40 AM, Keir Fraser wrote: > On 18/08/2010 09:47, "Jan Beulich" wrote: > > > Yes, that was what I was trying to hint at, but I wasn't sure whether > > calling ->end() here has any unintended side effects and/or requires > > any extra care (like preventing a subsequent guest initiated EOI to > > call ->end() again). > > Oh you can't naively call ->end() from the time-out handler. You would need > to do something like this in irq_guest_eoi_timer_fn: > spin_lock(&desc->lock); > if ( (desc->status & IRQ_GUEST) && > (action->ack_type == ACKTYPE_EOI) ) { > cpu_eoi_map = action->cpu_eoi_map; > spin_unlock(&desc->lock); > on_selected_cpus(&cpu_eoi_map, set_eoi_ready, desc, 0); > spin_lock(&desc->lock); > } > _irq_guest_eoi(desc); > spin_unlock(&desc->lock); > > I don't think the IRQ_GUEST_EOI_PENDING flag or any of that stuff is needed > for the ACKTYPE_EOI case. I'd make the handling of that, calling of > ->disable/->enable and so on, dependent on ACKTYPE_NONE. > > > While looking at this I came across another thing I don't understand: > > __pirq_guest_eoi(), for the ACKTYPE_EOI case, calls __set_eoi_ready() > > in a cpu_test_and_clear() conditional, but __set_eoi_ready() bails > > out if it finds !cpu_test_and_clear() on the same bitmap - what's the > > point of calling __set_eoi_ready() here then (or what am I missing)? > > __pirq_guest_eoi() acts on a private on-stack copy of cpu_eoi_map. This is > because on_selected_cpus() cannot be called with desc->lock held. But as > soon as desc->lock is released, the desc->action structure can be freed by > another CPU, so it would be invalid to reference action->cpu_eoi_map > directly after desc->lock is released. > > -- Keir > > > Is there any more information that I can provide that would be helpful in diagnosing the direct cause and the appropriate fix? Possibly adding instrumentation or trace code to detect the trigger conditions? This is very repeatable on our target systems after a few hours of load. Thanks -Bruce --e0cb4e887453df70a3048e2d5a78 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
-Bruce


On Wed, Aug 18, 2010 at 2:40 AM, Keir Fr= aser <kei= r.fraser@eu.citrix.com> wrote:
On 18/08/2010 09:47, "Jan Beulich" <JBeulich@novell.com> wrote:

> Yes, that was what I was trying to hint at, but I wasn't sure whet= her
> calling ->end() here has any unintended side effects and/or require= s
> any extra care (like preventing a subsequent guest initiated EOI to > call ->end() again).

Oh you can't naively call ->end() from the time-out handler. Y= ou would need
to do something like this in irq_guest_eoi_timer_fn:
=A0spin_lock(&desc->lock);
=A0if ( (desc->status & IRQ_GUEST) &&
=A0 =A0 =A0(action->ack_type =3D=3D ACKTYPE_EOI) ) {
=A0 =A0cpu_eoi_map =3D action->cpu_eoi_map;
=A0 =A0spin_unlock(&desc->lock);
=A0 =A0on_selected_cpus(&cpu_eoi_map, set_eoi_ready, desc, 0);
=A0 =A0spin_lock(&desc->lock);
=A0}
=A0_irq_guest_eoi(desc);
=A0spin_unlock(&desc->lock);

I don't think the IRQ_GUEST_EOI_PENDING flag or any of that stuff is ne= eded
for the ACKTYPE_EOI case. I'd make the handling of that, calling of
->disable/->enable and so on, dependent on ACKTYPE_NONE.

> While looking at this I came across another thing I don't understa= nd:
> __pirq_guest_eoi(), for the ACKTYPE_EOI case, calls __set_eoi_ready()<= br> > in a cpu_test_and_clear() conditional, but __set_eoi_ready() bails
> out if it finds !cpu_test_and_clear() on the same bitmap - what's = the
> point of calling __set_eoi_ready() here then (or what am I missing)?
__pirq_guest_eoi() acts on a private on-stack copy of cpu_eoi_map. Th= is is
because on_selected_cpus() cannot be called with desc->lock held. But as=
soon as desc->lock is released, the desc->action structure can be fre= ed by
another CPU, so it would be invalid to reference action->cpu_eoi_map
directly after desc->lock is released.

=A0-- Keir


Is there any more information that I can provi= de that would be helpful in diagnosing the direct cause and the appropriate= fix?
Possibly adding instrumentation or trace code to detect the= trigger conditions?
This is very=A0repeatable=A0on our target systems after a few hours of= load.

Thanks

-Bruce

--e0cb4e887453df70a3048e2d5a78-- --===============1313817794== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel --===============1313817794==--