From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:35091)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <blauwirbel@gmail.com>) id 1R8bKY-0000Uu-PD
	for qemu-devel@nongnu.org; Tue, 27 Sep 2011 13:20:32 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <blauwirbel@gmail.com>) id 1R8bKW-0004Sz-NS
	for qemu-devel@nongnu.org; Tue, 27 Sep 2011 13:20:30 -0400
MIME-Version: 1.0
In-Reply-To: <668883E4-DAA5-4D79-BB3C-2DE9D859C659@suse.de>
References: <1315989802-18753-1-git-send-email-agraf@suse.de>
	<1315989802-18753-25-git-send-email-agraf@suse.de>
	<CAAu8pHv=4XDycKpWL-4XAMteMekf7dsWThqmO1ONS6tHQPHAjw@mail.gmail.com>
	<C220D5FE-87F1-465D-8803-D84F9D88AD15@suse.de>
	<CAAu8pHtDDZLxemWPQzEFiRDw-JH28nyFPQXs+a-JXUg7ND8O3Q@mail.gmail.com>
	<14529F4D-D8AC-4097-8DF8-5F13EDCCC77F@suse.de>
	<4E7769D0.3090909@freescale.com>
	<CAAu8pHuaFSooFf7Ea57TTTE=8AOdmCN0DMHNZBihOpmn-VKH9Q@mail.gmail.com>
	<1CECB54D-1FED-4AC2-B86B-8082CCFE001F@suse.de>
	<CAAu8pHteYfHVC7wfBRfwh1pYk64i9cRi_tUb=2_yY_x88sncCw@mail.gmail.com>
	<D803B0B1-4DC8-4F1D-BA25-5E098FF68D56@suse.de>
	<4E810883.4010405@freescale.com>
	<CAAu8pHsOwjL7eqL+1zD0ZEgW_tf0Btmek+J0wZMQePiw1wVvFA@mail.gmail.com>
	<DD1EC171-E182-453D-B463-A14704F1FB5D@suse.de>
	<CAAu8pHuhnXnULeqZP2X1gr740LiF7yH12dL9RDtY-_d0dd7qyw@mail.gmail.com>
	<668883E4-DAA5-4D79-BB3C-2DE9D859C659@suse.de>
From: Blue Swirl <blauwirbel@gmail.com>
Date: Tue, 27 Sep 2011 17:20:08 +0000
Message-ID: <CAAu8pHusXayghFp8Uzy-4-yjV__dxu5QMKPgYbCHSCcYkHoMLw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH 24/58] PPC: E500: Add PV spinning code
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alexander Graf <agraf@suse.de>
Cc: Scott Wood <scottwood@freescale.com>, Yoder Stuart-B08248 <b08248@freescale.com>, qemu-ppc@nongnu.org, qemu-devel Developers <qemu-devel@nongnu.org>, Aurelien Jarno <aurelien@aurel32.net>

On Tue, Sep 27, 2011 at 5:03 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 27.09.2011, at 18:53, Blue Swirl wrote:
>
>> On Tue, Sep 27, 2011 at 3:59 PM, Alexander Graf <agraf@suse.de> wrote:
>>>
>>> On 27.09.2011, at 17:50, Blue Swirl wrote:
>>>
>>>> On Mon, Sep 26, 2011 at 11:19 PM, Scott Wood <scottwood@freescale.com>=
 wrote:
>>>>> On 09/24/2011 05:00 AM, Alexander Graf wrote:
>>>>>> On 24.09.2011, at 10:44, Blue Swirl wrote:
>>>>>>> On Sat, Sep 24, 2011 at 8:03 AM, Alexander Graf <agraf@suse.de> wro=
te:
>>>>>>>> On 24.09.2011, at 09:41, Blue Swirl wrote:
>>>>>>>>> On Mon, Sep 19, 2011 at 4:12 PM, Scott Wood <scottwood@freescale.=
com> wrote:
>>>>>>>>>> The goal with the spin table stuff, suboptimal as it is, was som=
ething
>>>>>>>>>> that would work on any powerpc implementation. =C2=A0Other
>>>>>>>>>> implementation-specific release mechanisms are allowed, and are
>>>>>>>>>> indicated by a property in the cpu node, but only if the loader =
knows
>>>>>>>>>> that the OS supports it.
>>>>>>>>>>
>>>>>>>>>>> IIUC the spec that includes these bits is not finalized yet. It=
 is however in use on all u-boot versions for e500 that I'm aware of and th=
e method Linux uses to bring up secondary CPUs.
>>>>>>>>>>
>>>>>>>>>> It's in ePAPR 1.0, which has been out for a while now. =C2=A0ePA=
PR 1.1 was
>>>>>>>>>> just released which clarifies some things such as WIMG.
>>>>>>>>>>
>>>>>>>>>>> Stuart / Scott, do you have any pointers to documentation where=
 the spinning is explained?
>>>>>>>>>>
>>>>>>>>>> https://www.power.org/resources/downloads/Power_ePAPR_APPROVED_v=
1.1.pdf
>>>>>>>>>
>>>>>>>>> Chapter 5.5.2 describes the table. This is actually an interface
>>>>>>>>> between OS and Open Firmware, obviously there can't be a real har=
dware
>>>>>>>>> device that magically loads r3 etc.
>>>>>
>>>>> Not Open Firmware, but rather an ePAPR-compliant loader.
>>>>
>>>> 'boot program to client program interface definition'.
>>>>
>>>>>>>>> The device method would break abstraction layers,
>>>>>
>>>>> Which abstraction layers?
>>>>
>>>> QEMU system emulation emulates hardware, not software. Hardware
>>>> devices don't touch CPU registers.
>>>
>>> The great part about this emulated device is that it's basically guest =
software running in host context. To the guest, it's not a device in the or=
dinary sense, such as vmport, but rather the same as software running on an=
other core, just that the other core isn't running any software.
>>>
>>> Sure, if you consider this a device, it does break abstraction layers. =
Just consider it as host running guest code, then it makes sense :).
>>>
>>>>
>>>>>>>>> it's much like
>>>>>>>>> vmport stuff in x86. Using a hypercall would be a small improveme=
nt.
>>>>>>>>> Instead it should be possible to implement a small boot ROM which=
 puts
>>>>>>>>> the secondary CPUs into managed halt state without spinning, then=
 the
>>>>>>>>> boot CPU could send an IPI to a halted CPU to wake them up based =
on
>>>>>>>>> the spin table, just like real HW would do.
>>>>>
>>>>> The spin table, with no IPI or halt state, is what real HW does (or
>>>>> rather, what software does on real HW) today. =C2=A0It's ugly and ine=
fficient
>>>>> but it should work everywhere. =C2=A0Anything else would be dependent=
 on a
>>>>> specific HW implementation.
>>>>
>>>> Yes. Hardware doesn't ever implement the spin table.
>>>>
>>>>>>>>> On Sparc32 OpenBIOS this
>>>>>>>>> is something like a few lines of ASM on both sides.
>>>>>>>>
>>>>>>>> That sounds pretty close to what I had implemented in v1. Back the=
n the only comment was to do it using this method from Scott.
>>>>>
>>>>> I had some comments on the actual v1 implementation as well. :-)
>>>>>
>>>>>>>> So we have the choice between having code inside the guest that
>>>>>>>> spins, maybe even only checks every x ms, by programming a timer,
>>>>>>>> or we can try to make an event out of the memory write. V1 was
>>>>>>>> the former, v2 (this one) is the latter. This version performs a
>>>>>>>> lot better and is easier to understand.
>>>>>>>
>>>>>>> The abstraction layers should not be broken lightly, I suppose some
>>>>>>> performance or laziness^Wlocal optimization reasons were behind vmp=
ort
>>>>>>> design too. The ideal way to solve this could be to detect a spinni=
ng
>>>>>>> CPU and optimize that for all architectures, that could be tricky
>>>>>>> though (if a CPU remains in the same TB for extended periods, inspe=
ct
>>>>>>> the TB: if it performs a loop with a single load instruction, repla=
ce
>>>>>>> the load by a special wait operation for any memory stores to that
>>>>>>> page).
>>>>>
>>>>> How's that going to work with KVM?
>>>>>
>>>>>> In fact, the whole kernel loading way we go today is pretty much
>>>>>> wrong. We should rather do it similar to OpenBIOS where firmware
>>>>>> always loads and then pulls the kernel from QEMU using a PV
>>>>>> interface. At that point, we would have to implement such an
>>>>>> optimization as you suggest. Or implement a hypercall :).
>>>>>
>>>>> I think the current approach is more usable for most purposes. =C2=A0=
If you
>>>>> start U-Boot instead of a kernel, how do pass information on from the
>>>>> user (kernel, rfs, etc)? =C2=A0Require the user to create flash image=
s[1]?
>>>>
>>>> No, for example OpenBIOS gets the kernel command line from fw_cfg devi=
ce.
>>>>
>>>>> Maybe that's a useful mode of operation in some cases, but I don't th=
ink
>>>>> we should be slavishly bound to it. =C2=A0Think of the current approa=
ch as
>>>>> something between whole-system and userspace emulation.
>>>>
>>>> This is similar to ARM, M68k and Xtensa semi-hosting mode, but not at
>>>> kernel level but lower. Perhaps this mode should be enabled with
>>>> -semihosting flag or a new flag. Then the bare metal version could be
>>>> run without the flag.
>>>
>>> and then we'd have 2 implementations for running in system emulation mo=
de and need to maintain both. I don't think that scales very well.
>>
>> No, but such hacks are not common.
>>
>>>>
>>>>> Where does the device tree come from? =C2=A0How do you tell the guest=
 about
>>>>> what devices it has, especially in virtualization scenarios with non-=
PCI
>>>>> passthrough devices, or custom qdev instantiations?
>>>>>
>>>>>> But at least we'd always be running the same guest software stack.
>>>>>
>>>>> No we wouldn't. =C2=A0Any U-Boot that runs under QEMU would have to b=
e
>>>>> heavily modified, unless we want to implement a ton of random device
>>>>> emulation, at least one extra memory translation layer (LAWs, localbu=
s
>>>>> windows, CCSRBAR, and such), hacks to allow locked cache lines to
>>>>> operate despite a lack of backing store, etc.
>>>>
>>>> I'd say HW emulation business as usual. Now with the new memory API,
>>>> it should be possible to emulate the caches with line locking and TLBs
>>>> etc., this was not previously possible. IIRC implementing locked cache
>>>> lines would allow x86 to boot unmodified coreboot.
>>>
>>> So how would you emulate cache lines with line locking on KVM?
>>
>> The cache would be a MMIO device which registers to handle all memory
>> space. Configuring the cache controller changes how the device
>> operates. Put this device between CPU and memory and other devices.
>> Performance would probably be horrible, so CPU should disable the
>> device automatically after some time.
>
> So how would you execute code on this region then? :)

Easy, fix QEMU to allow executing from MMIO. (Yeah, I forgot about that).

>>
>>> However, we already have a number of hacks in SeaBIOS to run in QEMU, s=
o I don't see an issue in adding a few here and there in u-boot. The memory=
 pressure is a real issue though. I'm not sure how we'd manage that one. Ma=
ybe we could try and reuse the host u-boot binary? heh
>>
>> I don't think SeaBIOS breaks layering except for fw_cfg.
>
> I'm not saying we're breaking layering there. I'm saying that changing u-=
boot is not so bad, since it's the same as we do with SeaBIOS. It was an ar=
gument in favor of your position.

Never mind then ;-)

>> For extremely
>> memory limited situation, perhaps QEMU (or Native KVM Tool for lean
>> and mean version) could be run without glibc, inside kernel or even
>> interfacing directly with the hypervisor. I'd also continue making it
>> possible to disable building unused devices and features.
>
> I'm pretty sure you're not the only one with that goal ;).

Great, let's do it.