From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=46040 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OIOGX-0001tY-Ov for qemu-devel@nongnu.org; Sat, 29 May 2010 11:48:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OIIu2-0006e2-4i for qemu-devel@nongnu.org; Sat, 29 May 2010 06:04:27 -0400 Received: from mail-pv0-f173.google.com ([74.125.83.173]:42960) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OIIu1-0006ds-Md for qemu-devel@nongnu.org; Sat, 29 May 2010 06:04:26 -0400 Received: by pvg12 with SMTP id 12so936065pvg.4 for ; Sat, 29 May 2010 03:04:24 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <4C00E257.8070409@web.de> References: <4BFC3028.6030303@codemonkey.ws> <4BFC44D2.2090608@web.de> <4BFD8010.3010001@web.de> <20100527061335.GE5474@redhat.com> <20100528073135.GC17805@redhat.com> <20100528204746.GC3604@redhat.com> <4C00C921.2060809@web.de> <4C00E257.8070409@web.de> From: Blue Swirl Date: Sat, 29 May 2010 10:04:04 +0000 Message-ID: Subject: Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jan Kiszka Cc: qemu-devel@nongnu.org, Gleb Natapov , Juan Quintela On Sat, May 29, 2010 at 9:45 AM, Jan Kiszka wrote: > Blue Swirl wrote: >> 2010/5/29 Jan Kiszka : >>> Gleb Natapov wrote: >>>> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote: >>>>> 2010/5/28 Gleb Natapov : >>>>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote: >>>>>>> 2010/5/27 Gleb Natapov : >>>>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote: >>>>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka w= rote: >>>>>>>>>> Blue Swirl wrote: >>>>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka = wrote: >>>>>>>>>>>> Anthony Liguori wrote: >>>>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote: >>>>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka =C2=A0wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> From: Jan Kiszka >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during = delivery from >>>>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coale= scing >>>>>>>>>>>>>>> workarounds need to register handlers that return the appro= priate >>>>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code acros= s all IRQ >>>>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESC= ED, it can >>>>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source m= ay only >>>>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report >>>>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> No real devices are interested whether any of their output l= ines are >>>>>>>>>>>>>> even connected. This would introduce a new signal type, bidi= rectional >>>>>>>>>>>>>> multi-level, which is not correct. >>>>>>>>>>>>>> >>>>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't= disagree >>>>>>>>>>>>> to a suggestion that we ought to introduce a new signal type = for this >>>>>>>>>>>>> type of bidirectional feedback. =C2=A0Maybe it's qemu_coalesc= ed_irq and has a >>>>>>>>>>>>> similar interface as qemu_irq. >>>>>>>>>>>> A separate type would complicate the delivery of the feedback = value >>>>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing)= . >>>>>>>>>>>> >>>>>>>>>>>>>> I think the real solution to coalescing is put the logic ins= ide one >>>>>>>>>>>>>> device, in this case APIC because it has the information abo= ut irq >>>>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency >>>>>>>>>>>>>> information and whether they get delivered or not. If not, a= n internal >>>>>>>>>>>>>> timer is installed which injects the lost irqs. >>>>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC wit= h a >>>>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tel= l the >>>>>>>>>>>> interrupt controller about the fact that a particular IRQ shou= ld be >>>>>>>>>>>> delivered at a specific regular rate. For this, you also need = a generic >>>>>>>>>>>> interface - nothing really "won". >>>>>>>>>>> OK, let's simplify: just reinject at next possible chance. No n= eed to >>>>>>>>>>> monitor or tell anything. >>>>>>>>>> There are guests that won't like this (I know of one in-house, b= ut >>>>>>>>>> others may even have more examples), specifically if you end up = firing >>>>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason,= the RTC >>>>>>>>>> spreads the reinjection according to the current rate. >>>>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy >>>>>>>> If guest's time frequency is the same as host time frequency you c= an't >>>>>>>> reinject with constant delay. That is why current code mixes two >>>>>>>> approaches: reinject M interrupts in a raw then delay. >>>>>>> This approach can be also used by APIC-only version. >>>>>>> >>>>>> I don't know what APIC-only version you are talking about. I haven't >>>>>> seen the code and I don't understand hand waving, sorry. >>>>> There is no code, because we're still at architecture design stage. >>>>> >>>> Try to write test code to understand the problem better. >>>> >>>>>>>>> guests could also be assisted with special handling (like win2k >>>>>>>>> install hack), for example guest instructions could be counted >>>>>>>>> (approximately, for example using TB size or TSC) and only inject >>>>>>>>> after at least N instructions have passed. >>>>>>>> Guest instructions cannot be easily counted in KVM (it can be done= more >>>>>>>> or less reliably using perf counters, may be). >>>>>>> Aren't there any debug registers or perf counters, which can genera= te >>>>>>> an interrupt after some number of instructions have been executed? >>>>>> Don't think debug registers have something like that and they are >>>>>> available for guest use anyway. Perf counters differs greatly from C= PU >>>>>> to CPU (even between two CPUs of the same manufacturer), and we want= to >>>>>> keep using them for profiling guests. And I don't see what problem i= t >>>>>> will solve anyway that can be solved by simple delay between irq >>>>>> reinjection. >>>>> This would allow counting the executed instructions and limit it. Thu= s >>>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately. >>>>> >>>> Why would you want to limit number of instruction executed by guest if >>>> CPU has nothing else to do anyway? The problem occurs not when we have >>>> spare cycles so give to a guest, but in opposite case. >>>> >>>>>>>>>> And even if the rate did not matter, the APIC woult still have t= o now >>>>>>>>>> about the fact that an IRQ is really periodic and does not only = appear >>>>>>>>>> as such for a certain interval. This really does not sound like >>>>>>>>>> simplifying things or even make them cleaner. >>>>>>>>> It would, the voodoo would be contained only in APIC, RTC would b= e >>>>>>>>> just like any other device. With the bidirectional irqs, this voo= doo >>>>>>>>> would probably eventually spread to many other devices. The logic= al >>>>>>>>> conclusion of that would be a system where all devices would be >>>>>>>>> careful not to disturb the guest at wrong moment because that wou= ld >>>>>>>>> trigger a bug. >>>>>>>>> >>>>>>>> This voodoo will be so complex and unreliable that it will make RT= C hack >>>>>>>> pale in comparison (and I still don't see how you are going to mak= e it >>>>>>>> actually work). >>>>>>> Implement everything inside APIC: only coalescing and reinjection. >>>>>> APIC has zero info needed to implement reinjection correctly as was >>>>>> shown to you several time in this thread and you simply keep ignorin= g >>>>>> it. >>>>> On the contrary, APIC is actually the only source of the IRQ ack >>>>> information. RTC hack would not work without APIC (or the >>>>> bidirectional IRQ) passing this info to RTC. >>>>> >>>>> What APIC doesn't have now is the timer frequency or period info. Thi= s >>>>> is known by RTC and also higher levels managing the clocks. >>>>> >>>> So APIC has one bit of information and RTC everything else. The curren= t >>>> approach (and proposed patch) brings this one bit of information to RT= C, >>>> you are arguing that RTC should be able to communicate all its info to >>>> APIC. Sorry I don't see that your way has any advantage. Just more >>>> complex interface and it is much easier to get it wrong for other time >>>> sources. >>>> >>>>> I keep ignoring the idea that the current model, where both RTC and >>>>> APIC must somehow work together to make coalescing work, is the only >>>>> possible just because it is committed and it happens to work in some >>>>> cases. It would be much better to concentrate this to one place, APIC >>>>> or preferably higher level where it may benefit other timers too. >>>>> Provided of course that the other models can be made to work. >>>>> >>>> So write the code and show us. You haven't show any evidence that RTC = is >>>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it >>>> know when clock frequency changes, it know when device reset happened. >>>> APIC knows only that interrupt was coalesced. It doesn't even know tha= t >>>> it may be masked by a guest in IOAPIC (interrupts delivered while they >>>> are masked not considered coalesced). Time source knows only when >>>> frequency changes and may be when device reset happens if timer is >>>> stopped by device on reset. So RTC is actually a sweet spot if you wan= t >>>> to minimize amount of info you need to pass between various layers. >>> This is - according to my current understanding - the proposed >>> alternative architecture: >>> >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0.---------------. >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0| de-coalescing | >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0| =C2=A0 =C2=A0 logic =C2=A0 =C2=A0 | >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0'---------------' >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0^ =C2=A0 | >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 perio= d,| =C2=A0 |IRQ >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 coalesced| = =C2=A0 |(or tuned VM clock) >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(yes/n= o)| =C2=A0 v >>> .-------. =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.--------. = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .-------. >>> | =C2=A0RTC =C2=A0|-----IRQ----->| router |-----IRQ---->| APIC =C2=A0| >>> '-------' =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'--------' = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 '-------' >>> =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0^ =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 ^ >>> =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0| =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 | >>> =C2=A0 =C2=A0'-------period-------' =C2=A0 =C2=A0'------period-------' >> >> The period information is already known by the higher level clock >> management, > > The timer device models program the next event of some qemu-timer. There > is no tag attached explaining the timer subsystem or anyone else the > reason behind this programming. Yes, but why would we care? All timers in the system besides RTC should be affected likewise, this would in fact be the benefit from handling the problem at higher level. >> it should be available (or made available) to >> de-coalescing logic (which should probably be located close to other >> clock stuff) but otherwise there shouldn't be a need to pass it >> around. The tuned VM clock of course only affects the RTC. Otherwise >> correct. >> >>> And here is what this patch implements (except for the not yet factored >>> out de-coalescing logic): >>> >>> =C2=A0 .---------------. >>> =C2=A0 | de-coalescing | >>> =C2=A0 | =C2=A0 =C2=A0 logic =C2=A0 =C2=A0 | >>> =C2=A0 '---------------' >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 ^ =C2=A0 | >>> =C2=A0period,| =C2=A0 |next IRQ date >>> coalesced| =C2=A0 |(or tuned VM clock) >>> =C2=A0(yes/no)| =C2=A0 v >>> =C2=A0 =C2=A0 =C2=A0 .-------. =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0.--------. =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .-------. >>> =C2=A0 =C2=A0 =C2=A0 | =C2=A0RTC =C2=A0|-----IRQ----->| router |-----IR= Q---->| APIC =C2=A0| >>> =C2=A0 =C2=A0 =C2=A0 '-------' =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0'--------' =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 '-------' >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ^ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0^ =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | >>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 '------coalesced-----' =C2=A0 =C2=A0= '-----coalesced-----' >> >> Why not route the coalesced signal directly from APIC to de-coalescing >> logic? Then our designs would match! > > Because the de-coalescing logic will not be able to associate it with > the period information delivered by the RTC (without apply ugly hacks). > >> >>> I still don't see how the alternative is supposed to simplify our life >>> or improve the efficiency of the de-coalescing workaround. It's rather >>> problematic like Gleb pointed out: The de-coalescing logic needs to be >>> informed about periodicity changes that can only be delivered along >>> IRQs. So what to do with the backlog when the timer is stopped? >> >> What happens with the current design? Gleb only mentioned the >> frequency change, I thought that was not so big problem. But I don't >> think this case should be allowed happen at all, it can't exist on >> real HW. >> >>> Regarding an improved de-coalescing logic: As its final design is widel= y >>> independent of how we collect the information, it could perfectly be >>> done after we laid the elementary foundation. >> >> True. But what is the foundation, do we need bidirectional IRQs or >> not? I still think current IRQs should be enough. > > Then explain how you want to do the association between IRQ injection > and IRQ period. Both of my diagrams require an additional channel > between source and sink, either forward (tag IRQ event with its period) > or backward (report injection result - aka feedback IRQs). We currently > don't have such channel but the APIC hack (apic_reset/get_irq_delivered). > > Jan > >