From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:35091) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R8bKY-0000Uu-PD for qemu-devel@nongnu.org; Tue, 27 Sep 2011 13:20:32 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1R8bKW-0004Sz-NS for qemu-devel@nongnu.org; Tue, 27 Sep 2011 13:20:30 -0400 MIME-Version: 1.0 In-Reply-To: <668883E4-DAA5-4D79-BB3C-2DE9D859C659@suse.de> References: <1315989802-18753-1-git-send-email-agraf@suse.de> <1315989802-18753-25-git-send-email-agraf@suse.de> <14529F4D-D8AC-4097-8DF8-5F13EDCCC77F@suse.de> <4E7769D0.3090909@freescale.com> <1CECB54D-1FED-4AC2-B86B-8082CCFE001F@suse.de> <4E810883.4010405@freescale.com> <668883E4-DAA5-4D79-BB3C-2DE9D859C659@suse.de> From: Blue Swirl Date: Tue, 27 Sep 2011 17:20:08 +0000 Message-ID: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH 24/58] PPC: E500: Add PV spinning code List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexander Graf Cc: Scott Wood , Yoder Stuart-B08248 , qemu-ppc@nongnu.org, qemu-devel Developers , Aurelien Jarno On Tue, Sep 27, 2011 at 5:03 PM, Alexander Graf wrote: > > On 27.09.2011, at 18:53, Blue Swirl wrote: > >> On Tue, Sep 27, 2011 at 3:59 PM, Alexander Graf wrote: >>> >>> On 27.09.2011, at 17:50, Blue Swirl wrote: >>> >>>> On Mon, Sep 26, 2011 at 11:19 PM, Scott Wood = wrote: >>>>> On 09/24/2011 05:00 AM, Alexander Graf wrote: >>>>>> On 24.09.2011, at 10:44, Blue Swirl wrote: >>>>>>> On Sat, Sep 24, 2011 at 8:03 AM, Alexander Graf wro= te: >>>>>>>> On 24.09.2011, at 09:41, Blue Swirl wrote: >>>>>>>>> On Mon, Sep 19, 2011 at 4:12 PM, Scott Wood wrote: >>>>>>>>>> The goal with the spin table stuff, suboptimal as it is, was som= ething >>>>>>>>>> that would work on any powerpc implementation. =C2=A0Other >>>>>>>>>> implementation-specific release mechanisms are allowed, and are >>>>>>>>>> indicated by a property in the cpu node, but only if the loader = knows >>>>>>>>>> that the OS supports it. >>>>>>>>>> >>>>>>>>>>> IIUC the spec that includes these bits is not finalized yet. It= is however in use on all u-boot versions for e500 that I'm aware of and th= e method Linux uses to bring up secondary CPUs. >>>>>>>>>> >>>>>>>>>> It's in ePAPR 1.0, which has been out for a while now. =C2=A0ePA= PR 1.1 was >>>>>>>>>> just released which clarifies some things such as WIMG. >>>>>>>>>> >>>>>>>>>>> Stuart / Scott, do you have any pointers to documentation where= the spinning is explained? >>>>>>>>>> >>>>>>>>>> https://www.power.org/resources/downloads/Power_ePAPR_APPROVED_v= 1.1.pdf >>>>>>>>> >>>>>>>>> Chapter 5.5.2 describes the table. This is actually an interface >>>>>>>>> between OS and Open Firmware, obviously there can't be a real har= dware >>>>>>>>> device that magically loads r3 etc. >>>>> >>>>> Not Open Firmware, but rather an ePAPR-compliant loader. >>>> >>>> 'boot program to client program interface definition'. >>>> >>>>>>>>> The device method would break abstraction layers, >>>>> >>>>> Which abstraction layers? >>>> >>>> QEMU system emulation emulates hardware, not software. Hardware >>>> devices don't touch CPU registers. >>> >>> The great part about this emulated device is that it's basically guest = software running in host context. To the guest, it's not a device in the or= dinary sense, such as vmport, but rather the same as software running on an= other core, just that the other core isn't running any software. >>> >>> Sure, if you consider this a device, it does break abstraction layers. = Just consider it as host running guest code, then it makes sense :). >>> >>>> >>>>>>>>> it's much like >>>>>>>>> vmport stuff in x86. Using a hypercall would be a small improveme= nt. >>>>>>>>> Instead it should be possible to implement a small boot ROM which= puts >>>>>>>>> the secondary CPUs into managed halt state without spinning, then= the >>>>>>>>> boot CPU could send an IPI to a halted CPU to wake them up based = on >>>>>>>>> the spin table, just like real HW would do. >>>>> >>>>> The spin table, with no IPI or halt state, is what real HW does (or >>>>> rather, what software does on real HW) today. =C2=A0It's ugly and ine= fficient >>>>> but it should work everywhere. =C2=A0Anything else would be dependent= on a >>>>> specific HW implementation. >>>> >>>> Yes. Hardware doesn't ever implement the spin table. >>>> >>>>>>>>> On Sparc32 OpenBIOS this >>>>>>>>> is something like a few lines of ASM on both sides. >>>>>>>> >>>>>>>> That sounds pretty close to what I had implemented in v1. Back the= n the only comment was to do it using this method from Scott. >>>>> >>>>> I had some comments on the actual v1 implementation as well. :-) >>>>> >>>>>>>> So we have the choice between having code inside the guest that >>>>>>>> spins, maybe even only checks every x ms, by programming a timer, >>>>>>>> or we can try to make an event out of the memory write. V1 was >>>>>>>> the former, v2 (this one) is the latter. This version performs a >>>>>>>> lot better and is easier to understand. >>>>>>> >>>>>>> The abstraction layers should not be broken lightly, I suppose some >>>>>>> performance or laziness^Wlocal optimization reasons were behind vmp= ort >>>>>>> design too. The ideal way to solve this could be to detect a spinni= ng >>>>>>> CPU and optimize that for all architectures, that could be tricky >>>>>>> though (if a CPU remains in the same TB for extended periods, inspe= ct >>>>>>> the TB: if it performs a loop with a single load instruction, repla= ce >>>>>>> the load by a special wait operation for any memory stores to that >>>>>>> page). >>>>> >>>>> How's that going to work with KVM? >>>>> >>>>>> In fact, the whole kernel loading way we go today is pretty much >>>>>> wrong. We should rather do it similar to OpenBIOS where firmware >>>>>> always loads and then pulls the kernel from QEMU using a PV >>>>>> interface. At that point, we would have to implement such an >>>>>> optimization as you suggest. Or implement a hypercall :). >>>>> >>>>> I think the current approach is more usable for most purposes. =C2=A0= If you >>>>> start U-Boot instead of a kernel, how do pass information on from the >>>>> user (kernel, rfs, etc)? =C2=A0Require the user to create flash image= s[1]? >>>> >>>> No, for example OpenBIOS gets the kernel command line from fw_cfg devi= ce. >>>> >>>>> Maybe that's a useful mode of operation in some cases, but I don't th= ink >>>>> we should be slavishly bound to it. =C2=A0Think of the current approa= ch as >>>>> something between whole-system and userspace emulation. >>>> >>>> This is similar to ARM, M68k and Xtensa semi-hosting mode, but not at >>>> kernel level but lower. Perhaps this mode should be enabled with >>>> -semihosting flag or a new flag. Then the bare metal version could be >>>> run without the flag. >>> >>> and then we'd have 2 implementations for running in system emulation mo= de and need to maintain both. I don't think that scales very well. >> >> No, but such hacks are not common. >> >>>> >>>>> Where does the device tree come from? =C2=A0How do you tell the guest= about >>>>> what devices it has, especially in virtualization scenarios with non-= PCI >>>>> passthrough devices, or custom qdev instantiations? >>>>> >>>>>> But at least we'd always be running the same guest software stack. >>>>> >>>>> No we wouldn't. =C2=A0Any U-Boot that runs under QEMU would have to b= e >>>>> heavily modified, unless we want to implement a ton of random device >>>>> emulation, at least one extra memory translation layer (LAWs, localbu= s >>>>> windows, CCSRBAR, and such), hacks to allow locked cache lines to >>>>> operate despite a lack of backing store, etc. >>>> >>>> I'd say HW emulation business as usual. Now with the new memory API, >>>> it should be possible to emulate the caches with line locking and TLBs >>>> etc., this was not previously possible. IIRC implementing locked cache >>>> lines would allow x86 to boot unmodified coreboot. >>> >>> So how would you emulate cache lines with line locking on KVM? >> >> The cache would be a MMIO device which registers to handle all memory >> space. Configuring the cache controller changes how the device >> operates. Put this device between CPU and memory and other devices. >> Performance would probably be horrible, so CPU should disable the >> device automatically after some time. > > So how would you execute code on this region then? :) Easy, fix QEMU to allow executing from MMIO. (Yeah, I forgot about that). >> >>> However, we already have a number of hacks in SeaBIOS to run in QEMU, s= o I don't see an issue in adding a few here and there in u-boot. The memory= pressure is a real issue though. I'm not sure how we'd manage that one. Ma= ybe we could try and reuse the host u-boot binary? heh >> >> I don't think SeaBIOS breaks layering except for fw_cfg. > > I'm not saying we're breaking layering there. I'm saying that changing u-= boot is not so bad, since it's the same as we do with SeaBIOS. It was an ar= gument in favor of your position. Never mind then ;-) >> For extremely >> memory limited situation, perhaps QEMU (or Native KVM Tool for lean >> and mean version) could be run without glibc, inside kernel or even >> interfacing directly with the hypervisor. I'd also continue making it >> possible to disable building unused devices and features. > > I'm pretty sure you're not the only one with that goal ;). Great, let's do it.