From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [GIT PULL] Fix lost interrupt race in Xen event channels Date: Fri, 27 Aug 2010 16:43:44 -0700 Message-ID: <4C784DB0.3000103@goop.org> References: <4C743B2C.8070208@goop.org> <4C74E7C802000078000120C0@vpn.id2.novell.com> <4C7558E0.1060806@goop.org> <4C7629D10200007800012387@vpn.id2.novell.com> <4C769736.4050409@goop.org> <4C7799EB020000780001276F@vpn.id2.novell.com> <1282941781.26797.386.camel@agari.van.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1282941781.26797.386.camel@agari.van.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Daniel Stodden Cc: "Xen-devel@lists.xensource.com" , Tom Kopec , Jan Beulich List-Id: xen-devel@lists.xenproject.org On 08/27/2010 01:43 PM, Daniel Stodden wrote: > On Fri, 2010-08-27 at 04:56 -0400, Jan Beulich wrote: >>>>> On 26.08.10 at 18:32, Jeremy Fitzhardinge wrote: >>> On 08/25/2010 11:46 PM, Jan Beulich wrote: >>>> >>> On 25.08.10 at 19:54, Jeremy Fitzhardinge wrote: >>>>> Note that this patch is specifically for upstream Xen, which doesn't >>>>> have any pirq support in it at present. >>>> I understand that, but saw that you had paralleling changes to the >>>> pirq handling in your Dom0 tree. >>>> >>>>> However, I did consider using fasteoi, but I couldn't see how to make >>>>> it work. The problem is that it only does a single call into the >>>>> irq_chip for EOI after calling the interrupt handler, but there is no >>>>> call beforehand to ack the interrupt (which means clear the event flag >>>>> in our case). This leads to a race where an event can be lost after the >>>>> interrupt handler has returned, but before the event flag has been >>>>> cleared (because Xen won't set pending or call the upcall function if >>>>> the event is already set). I guess I could pre-clear the event in the >>>>> upcall function, but I'm not sure that's any better. >>>> That's precisely what we're doing. >>> You mean pre-clearing the event? OK. >>> >>> But aren't you still subject to the bug the switch to handle_edge_irq fixed? >>> >>> With handle_fasteoi_irq: >>> >>> cpu A cpu B >>> get event >> mask and clear event > Argh. Right, I guess that's my fault, I was the one who came up with the > PENDING theory, but indeed I failed to see the event masking bits. > > However, please read on. > >>> set INPROGRESS >>> call action >>> : >>> : >>> >>> : get event >> Cannot happen, event is masked (i.e. all that would happen is >> that the event occurrence would be logged evtchn_pending). >> >>> : INPROGRESS set? -> EOI, return >>> : >>> action returns >>> clear INPROGRESS >>> EOI >> unmask event, checking for whether the event got re-bound (and >> doing the unmask through a hypercall if necessary), thus re-raising >> the event in any case > Yes. I agree. So let's come up with a new theory. Right now I'm still > looking at xen/next. Correct me if I'm mistaken: > > mask_ack_pirq will: > 1. chip->mask > 2. chip->ack > > Where chip->ack will: > 1. move_native_irq > 2. clear_evtchn. > > Now if you look into move_native_irq, it will: > 1. chip->mask (gratuitous) > 2. move > 3. chip->unmask (aiiiiiie). > > That explains why edge_irq still fixed the problem. Good point. I guess the simplest fix in that case would have been to use move_masked_irq()... The current fix is not wrong, so we can leave it as-is upstream for now. But I think I will try Jan's idea about masking/clearing in the event upcall then using fasteoi as the handler. J