LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 0/13] Hypervisor-mode KVM on POWER7
From: Avi Kivity @ 2011-05-17 11:42 UTC (permalink / raw)
  To: Alexander Graf
  Cc: linuxppc-dev@ozlabs.org, Paul Mackerras, kvm@vger.kernel.org
In-Reply-To: <14CFAA70-1747-4902-8CC1-4BE924CAD031@suse.de>

On 05/17/2011 02:38 PM, Alexander Graf wrote:
> >
> >  What would be the path for these patches to get upstream?  Would this
> >  stuff normally go through Avi's tree?  There is a bit of a
> >  complication in that they are based on Ben's next branch.  Would Avi
> >  pull Ben's next branch, or would they go in via Ben's tree?
>
> Usually the ppc tree gets merged into Avi's tree and goes on from there. When we have interdependencies, we can certainly do it differently though. We can also shove them through Ben's tree this time around, as there are more dependencies on ppc code than KVM code.
>

Yes, both options are fine.  If it goes through kvm.git I can merge 
Ben's tree (provided it is append-only) and apply the kvm-ppc patches on 
top.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH 0/13] Hypervisor-mode KVM on POWER7
From: Alexander Graf @ 2011-05-17 11:38 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev@ozlabs.org, kvm@vger.kernel.org
In-Reply-To: <20110517111536.GC7924@brick.ozlabs.ibm.com>

Am 17.05.2011 um 13:15 schrieb Paul Mackerras <paulus@samba.org>:

> On Tue, May 17, 2011 at 11:46:34AM +0200, Alexander Graf wrote:
>=20
>> Very nice patches indeed :). Is there any way I can test them? I
>> don't like pulling code that I couldn't run anywhere yet.
>=20
> I can understand that, but unfortunately there are no machines
> available outside of IBM at this stage that can run this stuff.
>=20
> Do you think that the earlier patches in the series could go in for
> 2.6.40?  Patches 1 to 8 don't require hypervisor mode and can be
> tested on a G5 or whatever.

Oh, I can and will certainly test if the code breaks pr mode, but that doesn=
't mean it will do the right thing on HV machines :o.

>=20
> What would be the path for these patches to get upstream?  Would this
> stuff normally go through Avi's tree?  There is a bit of a
> complication in that they are based on Ben's next branch.  Would Avi
> pull Ben's next branch, or would they go in via Ben's tree?

Usually the ppc tree gets merged into Avi's tree and goes on from there. Whe=
n we have interdependencies, we can certainly do it differently though. We c=
an also shove them through Ben's tree this time around, as there are more de=
pendencies on ppc code than KVM code.

>=20
> Thanks,
> Paul.

^ permalink raw reply

* Re: [PATCH 13/13] kvm/powerpc: Allow book3s_hv guests to use SMT processor modes
From: Alexander Graf @ 2011-05-17 11:36 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev@ozlabs.org, kvm@vger.kernel.org
In-Reply-To: <20110517104422.GB7924@brick.ozlabs.ibm.com>



Am 17.05.2011 um 12:44 schrieb Paul Mackerras <paulus@samba.org>:

> On Tue, May 17, 2011 at 10:21:56AM +0200, Alexander Graf wrote:
>>=20
>> On 11.05.2011, at 12:46, Paul Mackerras wrote:
>>=20
>>> -#define KVM_MAX_VCPUS 1
>>> +#define KVM_MAX_VCPUS        NR_CPUS
>>> +#define KVM_THREADS_PER_CORE    4
>>=20
>> So what if POWER8 (or whatever it will be called) comes along with 8
>> threads per core? Would that change the userspace interface?
>=20
> The idea is that userspace queries the KVM_CAP_PPC_SMT capability and
> the value it gets back is the number of vcpus per vcore.  It then
> allocates vcpu numbers based on that.
>=20
> If a CPU came along with more than 4 threads per core then we'd have
> to change that define in the kernel, but that won't affect the
> userspace API.

Ah, I see :). Tht's exactly why documentation is so important - with proper d=
ocumentation you wouldn't need to explain those things to me :)

>=20
>>> +    /* wait for secondary threads to get back to nap mode */
>>> +    spin_lock(&vc->lock);
>>> +    if (vc->nap_count < vc->n_woken)
>>> +        kvmppc_wait_for_nap(vc);
>>=20
>> So you're taking the vcore wide lock and wait for other CPUs to set
>> themselves to nap? Not sure I fully understand this. Why would
>> another thread want to go to nap mode when it's 100% busy?
>=20
> It's more about waiting for the other hardware threads to have
> finished writing their vcpu state to memory.  Currently those threads
> then go to nap mode, but they could in fact poll instead for a bit,
> so that name is possible a bit misleading, I agree.

Just so I understand the scheme: One vcpu needs to go to MMU mode in KVM, it=
 then sends IPIs to stop the other threads and finally we return from this w=
ait here?

>=20
>>> +    cmpwi    r12,0x980
>>> +    beq    40f
>>> +    cmpwi    r3,0x100
>>=20
>> good old use define comment :)
>=20
> Yep, OK. :)
>=20
>> Maybe I also missed the point here, but how does this correlate with
>> Linux threads? Is each vcpu running in its own Linux thread? How
>> does the scheduling happen? IIUC the host only sees a single thread
>> per core and then distributes the vcpus to the respective host
>> threads.
>=20
> Each vcpu has its own Linux thread, but while the vcore is running,
> all but one of them are sleeping.  The thing is that since the host is
> running with each core single-threaded, one Linux thread is enough to
> run 4 vcpus.  So when we decide we can run the vcore, the vcpu thread
> that discovered that we can now run the vcore takes the responsibility
> to run it.  That involves sending an IPI to the other hardware threads
> to wake them up and get them to each run a vcpu.  Then the vcpu thread
> that is running the vcore dives into the guest switch code itself.  It
> synchronizes with the other threads and does the partition switch, and
> then they all enter the guest.
>=20
> We thought about various schemes to cope with the hardware restriction
> that all hardware threads in a core have to be in the same partition
> (at least whenever the MMU is on).  This is the least messy scheme we
> could come up with.  I'd be happy to discuss the alternatives if you
> like.

Oh, I'm certainly fine with the scheme :). I would just like to understand i=
t and see it documented somewhere, as it's slightly unintuitive.

Also, this scheme might confuse the host scheduler for a bit, as it might mi=
grate threads to other host CPUs while it would prove beneficial for cache u=
sage to keep them local. But since the scheduler doesn't know about the corr=
elation between the threads, it can't be clever about it.

Alex

>=20

^ permalink raw reply

* Re: [PATCH 0/13] Hypervisor-mode KVM on POWER7
From: Paul Mackerras @ 2011-05-17 11:15 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm
In-Reply-To: <764651B6-6DDA-49DC-AA95-52CD86FF82D9@suse.de>

On Tue, May 17, 2011 at 11:46:34AM +0200, Alexander Graf wrote:

> Very nice patches indeed :). Is there any way I can test them? I
> don't like pulling code that I couldn't run anywhere yet.

I can understand that, but unfortunately there are no machines
available outside of IBM at this stage that can run this stuff.

Do you think that the earlier patches in the series could go in for
2.6.40?  Patches 1 to 8 don't require hypervisor mode and can be
tested on a G5 or whatever.

What would be the path for these patches to get upstream?  Would this
stuff normally go through Avi's tree?  There is a bit of a
complication in that they are based on Ben's next branch.  Would Avi
pull Ben's next branch, or would they go in via Ben's tree?

Thanks,
Paul.

^ permalink raw reply

* Re: [PATCH 13/13] kvm/powerpc: Allow book3s_hv guests to use SMT processor modes
From: Paul Mackerras @ 2011-05-17 10:44 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm
In-Reply-To: <F90D2118-7694-4AB0-9C17-8B449D14B9DF@suse.de>

On Tue, May 17, 2011 at 10:21:56AM +0200, Alexander Graf wrote:
> 
> On 11.05.2011, at 12:46, Paul Mackerras wrote:
> 
> > -#define KVM_MAX_VCPUS 1
> > +#define KVM_MAX_VCPUS		NR_CPUS
> > +#define KVM_THREADS_PER_CORE	4
> 
> So what if POWER8 (or whatever it will be called) comes along with 8
> threads per core? Would that change the userspace interface?

The idea is that userspace queries the KVM_CAP_PPC_SMT capability and
the value it gets back is the number of vcpus per vcore.  It then
allocates vcpu numbers based on that.

If a CPU came along with more than 4 threads per core then we'd have
to change that define in the kernel, but that won't affect the
userspace API.

> > +	/* wait for secondary threads to get back to nap mode */
> > +	spin_lock(&vc->lock);
> > +	if (vc->nap_count < vc->n_woken)
> > +		kvmppc_wait_for_nap(vc);
> 
> So you're taking the vcore wide lock and wait for other CPUs to set
> themselves to nap? Not sure I fully understand this. Why would
> another thread want to go to nap mode when it's 100% busy?

It's more about waiting for the other hardware threads to have
finished writing their vcpu state to memory.  Currently those threads
then go to nap mode, but they could in fact poll instead for a bit,
so that name is possible a bit misleading, I agree.

> > +	cmpwi	r12,0x980
> > +	beq	40f
> > +	cmpwi	r3,0x100
> 
> good old use define comment :)

Yep, OK. :)

> Maybe I also missed the point here, but how does this correlate with
> Linux threads? Is each vcpu running in its own Linux thread? How
> does the scheduling happen? IIUC the host only sees a single thread
> per core and then distributes the vcpus to the respective host
> threads.

Each vcpu has its own Linux thread, but while the vcore is running,
all but one of them are sleeping.  The thing is that since the host is
running with each core single-threaded, one Linux thread is enough to
run 4 vcpus.  So when we decide we can run the vcore, the vcpu thread
that discovered that we can now run the vcore takes the responsibility
to run it.  That involves sending an IPI to the other hardware threads
to wake them up and get them to each run a vcpu.  Then the vcpu thread
that is running the vcore dives into the guest switch code itself.  It
synchronizes with the other threads and does the partition switch, and
then they all enter the guest.

We thought about various schemes to cope with the hardware restriction
that all hardware threads in a core have to be in the same partition
(at least whenever the MMU is on).  This is the least messy scheme we
could come up with.  I'd be happy to discuss the alternatives if you
like.

Paul.

^ permalink raw reply

* Re: [PATCH 11/13] kvm/powerpc: Handle some PAPR hcalls in the kernel
From: Paul Mackerras @ 2011-05-17 10:28 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm
In-Reply-To: <2C53DD5C-4F74-4FC4-9C65-4F401AE9AF78@suse.de>

On Tue, May 17, 2011 at 09:54:35AM +0200, Alexander Graf wrote:

> Not sure I like the name - when is it used? :)

When the real-mode hcall handler decides it can't handle the hcall and
wants to pass it up.

> Also, if it's not in the PAPR, the guest should never receive it, right?

Right.  It's purely an internal value.

> > +	/* See if this is something we can handle in real mode */
> > +	cmpwi	r12,0xc00
> 
> use the define please

OK

> > +	beq	hcall_real_mode
> 
> This is simply a hcall helper, as the name suggests. So the comment
> is slightly misleading - it should rather read like "Try to handle
> hypercalls in real mode".

Hmmm, OK, that's what it was trying to indicate.  I could add "try" to
the name if you like.

> > +	cmpdi	r3,H_TOO_HARD
> > +	beq	hcall_real_fallback
> 
> Ah, very good. Please mark the constant as "for internal use only"
> then, as that's certainly fine :).

Cool, will do.

> > +	li	r12,0xc00
> 
> use the define please :)

OK.

Thanks,
Paul.

^ permalink raw reply

* Re: [PATCH 10/13] kvm/powerpc: Add support for Book3S processors in hypervisor mode
From: Alexander Graf @ 2011-05-17 10:17 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linuxppc-dev, kvm-ppc, KVM list
In-Reply-To: <20110516055809.GA3590@drongo>


On 16.05.2011, at 07:58, Paul Mackerras wrote:

> On Sun, May 15, 2011 at 11:58:12PM +0200, Alexander Graf wrote:
>>=20
>> On 11.05.2011, at 12:44, Paul Mackerras wrote:
>=20
>>> +#ifdef CONFIG_KVM_BOOK3S_NONHV
>>=20
>> I really liked how you called the .c file _pr - why call it NONHV =
now?
>=20
> I agree, CONFIG_KVM_BOOK3S_PR would be better, I'll change it.
>=20
>>> diff --git a/arch/powerpc/include/asm/paca.h =
b/arch/powerpc/include/asm/paca.h
>>> index 7412676..8dba5f6 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -149,6 +149,16 @@ struct paca_struct {
>>> #ifdef CONFIG_KVM_BOOK3S_HANDLER
>>> 	/* We use this to store guest state in */
>>> 	struct kvmppc_book3s_shadow_vcpu shadow_vcpu;
>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>> +	struct kvm_vcpu *kvm_vcpu;
>>> +	u64 dabr;
>>> +	u64 host_mmcr[3];
>>> +	u32 host_pmc[6];
>>> +	u64 host_purr;
>>> +	u64 host_spurr;
>>> +	u64 host_dscr;
>>> +	u64 dec_expires;
>>=20
>> Hrm. I'd say either push those into shadow_vcpu for HV mode or get
>> rid of the shadow_vcpu reference. I'd probably prefer the former.
>=20
> These are fields that are pieces of host state that we need to save
> and restore across execution of a guest; they don't apply to any
> specific guest or vcpu.  That's why I didn't put them in shadow_vcpu,
> which is specifically for one vcpu in one guest. =20
>=20
> Given that book3s_pr copies the shadow_vcpu into and out of the paca,
> I thought it best not to add fields to it that are only live while we
> are in the guest.  True, these fields only exist for book3s_hv, but if
> we later on make it possible to select book3s_pr vs. book3s_hv at
> runtime, we won't want to be copying these fields into and out of the
> paca when book3s_pr is active.
>=20
> Maybe we need another struct, kvm_host_state or something like that,
> to save this sort of state.

Yeah, just put them into a different struct then. I don't want to =
clutter the PACA struct with kvm fields all over :).

>=20
>>> @@ -65,6 +98,7 @@ config KVM_440
>>> 	bool "KVM support for PowerPC 440 processors"
>>> 	depends on EXPERIMENTAL && 44x
>>> 	select KVM
>>> +	select KVM_MMIO
>>=20
>> e500 should also select MMIO, no?
>=20
> Good point, I'll fix that.
>=20
>>> +long kvmppc_alloc_hpt(struct kvm *kvm)
>>> +{
>>> +	unsigned long hpt;
>>> +	unsigned long lpid;
>>> +
>>> +	hpt =3D =
__get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|__GFP_NOWARN,
>>> +			       HPT_ORDER - PAGE_SHIFT);
>>=20
>> This would end up failing quite often, no?
>=20
> In practice it seems to be OK, possibly because the machines we're
> testing this on have plenty of memory.  Maybe we should get qemu to
> allocate the HPT using hugetlbfs so the memory will come from the
> reserved page pool.  It does need to be physically contiguous and
> aligned on a multiple of its size -- that's a hardware requirement.

Yes, I'd certainly prefer to see qemu allocate it. That'd also make =
things easier for migration later, as we still have access to the hpt. =
But then again - we can't really reuse the mappings there anyways, as =
they'll all be host mappings. Phew. Have you given that some thought =
yet? We can probably just ignore non-bolted entries - but the bolted =
ones need to be transferred.

Also, if we have qemu allocate the hpt memory, qemu can modify the =
mappings and thus break out. Bleks. It should definitely not be able to =
write to the hpt after it's actually being used by kvm.

>=20
>>> +	kvm->arch.sdr1 =3D __pa(hpt) | (HPT_ORDER - 18);
>>> +	kvm->arch.lpid =3D lpid;
>>> +	kvm->arch.host_sdr1 =3D mfspr(SPRN_SDR1);
>>> +	kvm->arch.host_lpid =3D mfspr(SPRN_LPID);
>>> +	kvm->arch.host_lpcr =3D mfspr(SPRN_LPCR);
>>=20
>> How do these correlate with the guest's hv mmu? I'd like to keep the
>> code clean enough so we can potentially use it for PR mode as well =
:).=20
>=20
> The host SDR1 and LPID are different from the guest's.  That is, the
> guest has its own HPT which is quite separate from the host's.  The
> host values could be saved in global variables, though; there's no
> real need for each VM to have its own copy, except that doing it this
> way simplifies the low-level assembly code a little.

Well, my point is that I tried to separate the MMU implementation for =
the PR KVM stuff. What I'd like to see at the end of the day would be a =
"guest" hv implementation file that I could plug into PR kvm and have =
the MMU rolling by using the exact same code as the HV code. Only the =
backend would be different. Maybe there's some valid technical reason to =
not do it, but I haven't come across any yet :).

>=20
>>> +	/* First see what page size we have */
>>> +	psize =3D user_page_size(mem->userspace_addr);
>>> +	/* For now, only allow 16MB pages */
>>=20
>> The reason to go for 16MB pages is because of the host mmu code, not
>> the guest hv mmu. So please at least #ifdef the code to HV so we
>> document that correlation.=20
>=20
> I'm not sure what you mean by that.  The reason for going for 16MB
> pages initially is for performance (this way the guest can use 16MB
> pages for its linear mapping) and to avoid having to deal with the
> pages being paged or swapped on the host side.  Could you explain the
> "because of the host mmu code" part of your comment further?

If you consider a split as I described above, an HV guest running on PR =
KVM doesn't have to be mapped linearly. But then again - it could. In =
fact, it might even make sense. Hrm. Very good point! We could also just =
flip SDR1 in PR mode and reuse the exact same code as the HV mode.

However, my point here was that when we don't flip SDR1 but go through a =
shadow hpt, we don't have to map it to 16MB pages.

>=20
> What would adding #ifdef CONFIG_KVM_BOOK3S_64_HV achieve in a file
> whose name ends in _hv.c and which only gets compiled when
> CONFIG_KVM_BOOK3S_64_HV=3Dy?
>=20
>>> +int kvmppc_mmu_hv_init(void)
>>> +{
>>> +	if (!cpu_has_feature(CPU_FTR_HVMODE_206))
>>> +		return 0;
>>=20
>> Return 0 for "it doesn't work" might not be the right exit code ;).
>=20
> Good point, I'll make it -ENXIO or something.

Anything < 0 sounds good to me :). -EINVAL is probably the most sensible =
one here.

>=20
>>> +static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, =
gva_t eaddr,
>>> +				struct kvmppc_pte *gpte, bool data)
>>> +{
>>> +	return -ENOENT;
>>=20
>> Can't you just call the normal book3s_64 mmu code here? Without
>> xlate, doing ppc_ld doesn't work, which means that reading out the
>> faulting guest instruction breaks. We'd also need it for PR mode :).=20=

>=20
> With book3s_hv we currently have no situations where we need to read
> out the faulting guest instruction. =20
>=20
> We could use the normal code here if we had the guest HPT mapped into
> the qemu address space, which it currently isn't -- but should be.  It
> hasn't been a priority to fix because with book3s_hv we currently have
> no situations where we need to read out the faulting guest
> instruction.

Makes sense. We might however need it in case we ever use this code in =
PR mode :). I don't fully remember if you shoved around that code, but =
in case fetching the last instruction fails (which IIRC it can for you =
too), a manual load through this function gets issued.

>=20
>>> +void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
>>> +{
>>> +	vcpu->arch.pvr =3D pvr;
>>> +	kvmppc_mmu_book3s_hv_init(vcpu);
>>=20
>> No need for you to do it depending on pvr. You can just do the mmu
>> initialization on normal init :).
>=20
> OK, I'll do that.
>=20
>>> +	case BOOK3S_INTERRUPT_PROGRAM:
>>> +	{
>>> +		ulong flags;
>>> +		/*
>>> +		 * Normally program interrupts are delivered directly
>>> +		 * to the guest by the hardware, but we can get here
>>> +		 * as a result of a hypervisor emulation interrupt
>>> +		 * (e40) getting turned into a 700 by BML RTAS.
>>=20
>> Not sure I fully understand what's going on there. Mind to explain? =
:)
>=20
> Recent versions of the architecture don't actually deliver a 0x700
> interrupt to the OS on an illegal instruction; instead they generate
> an 0xe40 interrupt to the hypervisor in case the hypervisor wants to
> emulate the instruction.  If the hypervisor doesn't want to do that
> it's supposed to synthesize a 0x700 interrupt to the OS.
>=20
> When we're running this stuff under our BML (Bare Metal Linux)
> framework in the lab, we use a small RTAS implementation that gets
> loaded by the BML loader, and one of the things that this RTAS does is
> to patch the 0xe40 vector to make the 0xe40 interrupt come in via the
> 0x700 vector, in case the kernel you're running under BML hasn't been
> updated to have an 0xe40 handler.
>=20
> I could just remove that case, in fact.

Yeah, the main issue I see here is that there's no hardware that could =
run this code. Whatever the first hardware that would be able to =
requires should be used here.

>=20
>>> +	case BOOK3S_INTERRUPT_SYSCALL:
>>> +	{
>>> +		/* hcall - punt to userspace */
>>> +		int i;
>>> +
>>=20
>> Do we really want to accept sc from pr=3D1? I'd just reject them =
straight away.
>=20
> Good idea, I'll do that.
>=20
>>> +	case BOOK3S_INTERRUPT_H_DATA_STORAGE:
>>> +		vcpu->arch.dsisr =3D vcpu->arch.fault_dsisr;
>>> +		vcpu->arch.dear =3D vcpu->arch.fault_dar;
>>> +		kvmppc_inject_interrupt(vcpu, =
BOOK3S_INTERRUPT_DATA_STORAGE, 0);
>>> +		r =3D RESUME_GUEST;
>>> +		break;
>>> +	case BOOK3S_INTERRUPT_H_INST_STORAGE:
>>> +		kvmppc_inject_interrupt(vcpu, =
BOOK3S_INTERRUPT_INST_STORAGE,
>>> +					0x08000000);
>>=20
>> What do these do? I thought the guest gets DSI and ISI directly?
>=20
> It does for accesses with relocation (IR/DR) on, but because we have
> enabled VRMA mode (Virtualized Real Mode Area) in the LPCR, we get
> these interrupts to the hypervisor if the guest does a bad real-mode
> access.  If we also turned on Virtualized Partition Memory (VPM) mode
> in the LPCR, then all ISI/DSI in the guest come through to the
> hypervisor using these vectors in case the hypervisor wants to do any
> paging/swapping of guest memory.  I plan to do that later when we
> support using 4k/64k pages for guest memory.

I see - please put that explanation in a comment there ;).

>=20
>>> +	/* default to book3s_64 (power7) */
>>> +	vcpu->arch.pvr =3D 0x3f0200;
>>=20
>> Wouldn't it make sense to simply default to the host pvr? Not sure -
>> maybe it's not :).
>=20
> Sounds sensible, actually.  In fact the hypervisor can't spoof the PVR
> for the guest, that is, the guest can read the real PVR value and
> there's no way the hypervisor can stop it.

If it can't spoof it at all, then yes, use the host pvr.

In fact, thinking about this, how does userspace know which mode the =
kernel uses? It should be able to fail if we're trying to run a -M mac99 =
on this code ;).

>=20
>>> +	flush_fp_to_thread(current);
>>=20
>> Do those with fine with preemption enabled?
>=20
> Yes.  If a preemption happens, it will flush the FP/VMX/VSX registers
> out to the thread_struct, and then any of these explicit calls that
> happen after the preemption will do nothing.

Ah, the flushes themselves disable preemption during the flush :). Then =
it makes sense.

>=20
>>> +	/*
>>> +	 * Put whatever is in the decrementer into the
>>> +	 * hypervisor decrementer.
>>> +	 */
>>> +	mfspr	r8,SPRN_DEC
>>> +	mftb	r7
>>> +	mtspr	SPRN_HDEC,r8
>>=20
>> Can't we just always use HDEC on the host? That's save us from all
>> the swapping here.
>=20
> The problem is that there is only one HDEC per core, so using it would
> become complicated when the host is running in SMT4 or SMT2 mode.

I see.

>=20
>>> +	extsw	r8,r8
>>> +	add	r8,r8,r7
>>> +	std	r8,PACA_KVM_DECEXP(r13)
>>=20
>> Why is dec+hdec on vcpu_run decexp? What exactly does this store?
>=20
> R7 here is the current (or very recent, anyway) timebase value, so
> this stores the timebase value at which the host decrementer would get
> to zero.
>=20
>>> +	lwz	r3, PACA_HOST_PMC(r13)
>>> +	lwz	r4, PACA_HOST_PMC + 4(r13)
>>> +	lwz	r5, PACA_HOST_PMC + 4(r13)
>>=20
>> copy&paste error?
>=20
> Yes, thanks.
>=20
>>> +.global kvmppc_handler_lowmem_trampoline
>>> +kvmppc_handler_lowmem_trampoline:
>>> +	cmpwi	r12,0x500
>>> +	beq	1f
>>> +	cmpwi	r12,0x980
>>> +	beq	1f
>>=20
>> What?
>>=20
>> 1) use the macros please
>=20
> OK
>=20
>> 2) why?
>=20
> The external interrupt and hypervisor decrementer interrupt handlers
> expect the interrupted PC and MSR to be in HSRR0/1 rather than
> SRR0/1.  I could remove the case for 0x980 since we don't call the
> linux handler for HDEC interrupts any more.

Ah, makes sense. This also deserves a comment :)

>=20
>>> +	/* Check if HDEC expires soon */
>>> +	mfspr	r3,SPRN_HDEC
>>> +	cmpwi	r3,10
>>> +	li	r12,0x980
>>=20
>> define
>=20
> OK.
>=20
>>> +	mr	r9,r4
>>> +	blt	hdec_soon
>>=20
>> Is it faster to do the check than to save the check and take the
>> odds? Also, maybe we should rather do the check in preemptible
>> code that could just directly pass the time slice on.
>=20
> I do the check there because I was having problems where, if the HDEC
> goes negative before we do the partition switch, we would occasionally
> not get the HDEC interrupt at all until the next time HDEC went
> negative, ~ 8.4 seconds later.

Yikes - so HDEC is edge and doesn't even keep the interrupt line up? =
That sounds like a serious hardware limitation. What if you only use =
HDEC and it triggers while interrupts are off in a critical section? Is =
the hardware really that broken?

>=20
>>> +	/* See if this is a leftover HDEC interrupt */
>>> +	cmpwi	r12,0x980
>>=20
>> define
>=20
> OK.
>=20
>>> +	bne	2f
>>> +	mfspr	r3,SPRN_HDEC
>>> +	cmpwi	r3,0
>>> +	bge	ignore_hdec
>>> +2:
>>> +
>>> +	/* Check for mediated interrupts (could be done earlier really =
...) */
>>> +	cmpwi	r12,0x500
>>=20
>> define
>=20
> OK.
>=20
>>> +	bne+	1f
>>> +	ld	r5,VCPU_LPCR(r9)
>>> +	andi.	r0,r11,MSR_EE
>>> +	beq	1f
>>> +	andi.	r0,r5,LPCR_MER
>>> +	bne	bounce_ext_interrupt
>>=20
>> So there's no need for the external check that directly goes into
>> the Linux handler code on full-fledged exits?
>=20
> No, we still need that; ordinary external interrupts come to the
> hypervisor regardless of the guest's MSR.EE setting.
>=20
> The MER bit (Mediated External Request) is a way for the hypervisor to
> know when the guest sets its MSR.EE bit.  If an event happens that
> means the host wants to give a guest vcpu an external interrupt, but
> the guest vcpu has MSR.EE =3D 0, then the host can't deliver the
> simulated external interrupt.  Instead it sets LPCR.MER, which has the
> effect that the hardware will deliver an external interrupt (to the
> hypervisor) when running in the guest and it has MSR.EE =3D 1.
>=20
> So, when we get an external interrupt, LPCR.MER =3D 1 and MSR.EE =3D =
1,
> we need to synthesize an external interrupt in the guest.  Doing it
> here means that we don't need to do the full partition switch out to
> the host and back again.

Ah, special functionality then. Please comment this in the code, so the =
unknowledgeable reader (me) knows what this is about.

>=20
>>> +	/* Read the guest SLB and save it away */
>>> +	li	r6,0
>>> +	addi	r7,r9,VCPU_SLB
>>> +	li	r5,0
>>> +1:	slbmfee	r8,r6
>>> +	andis.	r0,r8,SLB_ESID_V@h
>>> +	beq	2f
>>> +	add	r8,r8,r6		/* put index in */
>>> +	slbmfev	r3,r6
>>> +	std	r8,VCPU_SLB_E(r7)
>>> +	std	r3,VCPU_SLB_V(r7)
>>> +	addi	r7,r7,VCPU_SLB_SIZE
>>> +	addi	r5,r5,1
>>> +2:	addi	r6,r6,1
>>> +	cmpwi	r6,32
>>=20
>> I don't like how the 32 is hardcoded here. Better create a define
>> for it and use the same in the init code.
>=20
> Sure.  In fact I probably should use vcpu->arch.slb_nr here.

and bctr? yup :)

>=20
>>> +hdec_soon:
>>> +	/* Switch back to host partition */
>>> +	ld	r4,VCPU_KVM(r9)		/* pointer to struct kvm */
>>> +	ld	r6,KVM_HOST_SDR1(r4)
>>> +	lwz	r7,KVM_HOST_LPID(r4)
>>> +	li	r8,0x3ff		/* switch to reserved LPID */
>>=20
>> is it reserved by ISA? Either way, hard-coding the constant without
>> a name is not nice :).
>=20
> Actually, it just has to be an LPID value that isn't ever used for
> running a real guest (or the host).  I'll make a name for it.
>=20
>>> +	lis	r8,0x7fff
>>=20
>> INT_MAX@h might be more readable.
>=20
> OK.
>=20
>>> int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>> {
>>> -	return !(v->arch.shared->msr & MSR_WE) ||
>>> +#ifndef CONFIG_KVM_BOOK3S_64_HV
>>> +	return !(kvmppc_get_msr(v) & MSR_WE) ||
>>> 	       !!(v->arch.pending_exceptions);
>>> +#else
>>> +	return 1;
>>=20
>> So what happens if the guest sets MSR_WE? It just stays in guest
>> context idling? That'd be pretty bad for scheduling on the host.
>=20
> The MSR_WE bit doesn't exist on POWER7 (or any of POWER[4567], in
> fact).  If the guest wants to idle it calls the H_CEDE hcall.

Ah, interesting. Didn't know :).

>=20
>>> @@ -184,12 +188,14 @@ int kvm_dev_ioctl_check_extension(long ext)
>>> 	case KVM_CAP_PPC_IRQ_LEVEL:
>>> 	case KVM_CAP_ENABLE_CAP:
>>> 	case KVM_CAP_PPC_OSI:
>>> +#ifndef CONFIG_KVM_BOOK3S_64_HV
>>=20
>> You also don't do OSI on HV :).
>=20
> Good point, I'll fix that.
>=20
>>> @@ -496,8 +510,11 @@ int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu =
*vcpu, struct kvm_interrupt *irq)
>>> 	if (waitqueue_active(&vcpu->wq)) {
>>> 		wake_up_interruptible(&vcpu->wq);
>>> 		vcpu->stat.halt_wakeup++;
>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>> +	} else if (vcpu->cpu !=3D -1) {
>>> +		smp_send_reschedule(vcpu->cpu);
>>=20
>> Shouldn't this be done for non-HV too? The only reason we don't do
>> it yet is because we don't do SMP, no?
>=20
> I didn't know why you didn't do it for non-HV, so I didn't change it
> for that case.  If you say it's OK, I'll change it (we'll need to set
> vcpu->cpu in the vcpu_load/unload code for book3s_pr too, then).

Sure, go ahead and set it.


Alex

^ permalink raw reply

* Re: [PATCH 0/13] Hypervisor-mode KVM on POWER7
From: Alexander Graf @ 2011-05-17  9:46 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm
In-Reply-To: <20110511103443.GA2837@brick.ozlabs.ibm.com>


On 11.05.2011, at 12:34, Paul Mackerras wrote:

> The following series of patches enable KVM to exploit the hardware
> hypervisor mode on 64-bit Power ISA Book3S machines.  At present only
> POWER7 is supported, but it would be easy to add other processors.
>=20
> Running the KVM host in hypervisor mode means that the guest can use
> both supervisor mode and user mode.  That means that the guest can
> execute supervisor-privilege instructions and access supervisor-
> privilege registers.  In addition the hardware directs most exceptions
> to the guest.  Thus we don't need to emulate any instructions in the
> host.  Generally, the only times we need to exit the guest are when it
> does a hypercall or when an external interrupt or host timer
> (decrementer) interrupt occurs.
>=20
> The focus of this KVM implementation is to run guests that use the
> PAPR (Power Architecture Platform Requirements) paravirtualization
> interface, which is the interface supplied by PowerVM on IBM pSeries
> machines.
>=20
> These patches are against Ben Herrenschmidt's next branch in his tree
> at git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git.

Very nice patches indeed :). Is there any way I can test them? I don't =
like pulling code that I couldn't run anywhere yet.


Alex

^ permalink raw reply

* Re: [PATCH 12/13] kvm/powerpc: Accelerate H_PUT_TCE by implementing it in real mode
From: Alexander Graf @ 2011-05-17  9:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Paul Mackerras, kvm
In-Reply-To: <1305624945.2781.21.camel@pasglop>


On 17.05.2011, at 11:35, Benjamin Herrenschmidt wrote:

> On Tue, 2011-05-17 at 11:31 +0200, Alexander Graf wrote:
>> On 17.05.2011, at 11:11, Benjamin Herrenschmidt wrote:
>>=20
>>> On Tue, 2011-05-17 at 10:01 +0200, Alexander Graf wrote:
>>>> I'm not sure I fully understand how this is supposed to work. If =
the
>>>> tables are kept inside the kernel, how does userspace get to know
>>>> where to DMA to?
>>>=20
>>> The guest gets a dma range from the device-tree which is the range =
of
>>> device-side dma addresses it can use that correspond to the table.
>>>=20
>>> The guest kernel uses the normal linux iommu space allocator to =
allocate
>>> space in that region and uses H_PUT_TCE to populate the =
corresponding
>>> table entries.
>>>=20
>>> This is the same interface that is used for "real" iommu's with PCI
>>> devices btw.
>>=20
>> I'm still slightly puzzled here :). IIUC the main point of an IOMMU =
is for the kernel
>> to change where device accesses actually go to. So device DMAs =
address A, goes through
>> the IOMMU, in reality accesses address B.
>=20
> Right :-)
>=20
>> Now, how do we tell the devices implemented in qemu that they're =
supposed to DMA to
>> address B instead of A if the mapping table is kept in-kernel?
>=20
> Oh, bcs qemu mmaps the table :-)

That's the piece to the puzzle I was missing. Please document that =
interface properly - it needs to be rock stable :)


Alex

^ permalink raw reply

* Re: [PATCH 12/13] kvm/powerpc: Accelerate H_PUT_TCE by implementing it in real mode
From: Benjamin Herrenschmidt @ 2011-05-17  9:35 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, Paul Mackerras, kvm
In-Reply-To: <E483EC5C-5F4A-4783-88D2-1728BD902037@suse.de>

On Tue, 2011-05-17 at 11:31 +0200, Alexander Graf wrote:
> On 17.05.2011, at 11:11, Benjamin Herrenschmidt wrote:
> 
> > On Tue, 2011-05-17 at 10:01 +0200, Alexander Graf wrote:
> >> I'm not sure I fully understand how this is supposed to work. If the
> >> tables are kept inside the kernel, how does userspace get to know
> >> where to DMA to?
> > 
> > The guest gets a dma range from the device-tree which is the range of
> > device-side dma addresses it can use that correspond to the table.
> > 
> > The guest kernel uses the normal linux iommu space allocator to allocate
> > space in that region and uses H_PUT_TCE to populate the corresponding
> > table entries.
> > 
> > This is the same interface that is used for "real" iommu's with PCI
> > devices btw.
> 
> I'm still slightly puzzled here :). IIUC the main point of an IOMMU is for the kernel
> to change where device accesses actually go to. So device DMAs address A, goes through
> the IOMMU, in reality accesses address B.

Right :-)

> Now, how do we tell the devices implemented in qemu that they're supposed to DMA to
> address B instead of A if the mapping table is kept in-kernel?

Oh, bcs qemu mmaps the table :-)

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 12/13] kvm/powerpc: Accelerate H_PUT_TCE by implementing it in real mode
From: Alexander Graf @ 2011-05-17  9:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Paul Mackerras, kvm
In-Reply-To: <1305623510.2781.20.camel@pasglop>

On 17.05.2011, at 11:11, Benjamin Herrenschmidt wrote:

> On Tue, 2011-05-17 at 10:01 +0200, Alexander Graf wrote:
>> I'm not sure I fully understand how this is supposed to work. If the
>> tables are kept inside the kernel, how does userspace get to know
>> where to DMA to?
>=20
> The guest gets a dma range from the device-tree which is the range of
> device-side dma addresses it can use that correspond to the table.
>=20
> The guest kernel uses the normal linux iommu space allocator to =
allocate
> space in that region and uses H_PUT_TCE to populate the corresponding
> table entries.
>=20
> This is the same interface that is used for "real" iommu's with PCI
> devices btw.

I'm still slightly puzzled here :). IIUC the main point of an IOMMU is =
for the kernel to change where device accesses actually go to. So device =
DMAs address A, goes through the IOMMU, in reality accesses address B.

Now, how do we tell the devices implemented in qemu that they're =
supposed to DMA to address B instead of A if the mapping table is kept =
in-kernel?

Alex

^ permalink raw reply

* Re: [PATCH 12/13] kvm/powerpc: Accelerate H_PUT_TCE by implementing it in real mode
From: Benjamin Herrenschmidt @ 2011-05-17  9:11 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, Paul Mackerras, kvm
In-Reply-To: <E9A07C6B-00BC-490F-95E7-5B025A626C98@suse.de>

On Tue, 2011-05-17 at 10:01 +0200, Alexander Graf wrote:
> I'm not sure I fully understand how this is supposed to work. If the
> tables are kept inside the kernel, how does userspace get to know
> where to DMA to?

The guest gets a dma range from the device-tree which is the range of
device-side dma addresses it can use that correspond to the table.

The guest kernel uses the normal linux iommu space allocator to allocate
space in that region and uses H_PUT_TCE to populate the corresponding
table entries.

This is the same interface that is used for "real" iommu's with PCI
devices btw.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 13/13] kvm/powerpc: Allow book3s_hv guests to use SMT processor modes
From: Alexander Graf @ 2011-05-17  8:21 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm
In-Reply-To: <20110511104656.GN2837@brick.ozlabs.ibm.com>


On 11.05.2011, at 12:46, Paul Mackerras wrote:

> This lifts the restriction that book3s_hv guests can only run one
> hardware thread per core, and allows them to use up to 4 threads
> per core on POWER7.  The host still has to run single-threaded.
>=20
> This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
> capability.
>=20
> To use this, the host kernel should be booted with all threads
> active, and then all the secondary threads should be offlined.
> This will put the secondary threads into nap mode.  KVM will then
> wake them from nap mode and use them for running guest code (while
> they are still offline).  To wake the secondary threads, we send
> them an IPI using a new xics_wake_cpu() function, implemented in
> arch/powerpc/sysdev/xics/icp-native.c.  In other words, at this stage
> we assume that the platform has a XICS interrupt controller and
> we are using icp-native.c to drive it.  Since the woken thread will
> need to acknowledge and clear the IPI, we also export the base
> physical address of the XICS registers using kvmppc_set_xics_phys()
> for use in the low-level KVM book3s code.
>=20
> When a vcpu is created, it is assigned to a virtual core (vcore).
> The vcore number is obtained by dividing the vcpu number by 4,
> since we assume at most 4 threads per core.  Thus, if qemu wishes
> to run the guest in single-threaded mode, it should make all vcpu
> numbers be multiples of 4.
>=20
> We distinguish three states of a vcpu: runnable (i.e., ready to =
execute
> the guest), blocked (that is, idle), and busy in host.  We currently
> implement a policy that the vcore can run only when all its threads
> are runnable or blocked.  This way, if a vcpu needs to execute =
elsewhere
> in the kernel or in qemu, it can do so without being starved of CPU
> by the other vcpus.
>=20
> When a vcore starts to run, it executes in the context of one of the
> vcpu threads.  The other vcpu threads all go to sleep and stay asleep
> until something happens requiring the vcpu thread to return to qemu,
> or to wake up to run the vcore (this can happen when another vcpu
> thread goes from busy in host state to blocked).
>=20
> It can happen that a vcpu goes from blocked to runnable state (e.g.
> because of an interrupt), and the vcore it belongs to is already
> running.  In that case it can start to run immediately as long as
> the none of the vcpus in the vcore have started to exit the guest.
> We send the next free thread in the vcore an IPI to get it to start
> to execute the guest.  It synchronizes with the other threads via
> the vcore->entry_exit_count field to make sure that it doesn't go
> into the guest if the other vcpus are exiting by the time that it
> is ready to actually enter the guest.
>=20
> Note that there is no fixed relationship between the hardware thread
> number and the vcpu number.  Hardware threads are assigned to vcpus
> as they become runnable, so we will always use the lower-numbered
> hardware threads in preference to higher-numbered threads if not all
> the vcpus in the vcore are runnable, regardless of which vcpus are
> runnable.
>=20
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/include/asm/kvm.h          |    1 +
> arch/powerpc/include/asm/kvm_host.h     |   45 +++++-
> arch/powerpc/include/asm/kvm_ppc.h      |   13 ++
> arch/powerpc/include/asm/paca.h         |    2 +
> arch/powerpc/kernel/asm-offsets.c       |    6 +
> arch/powerpc/kernel/exceptions-64s.S    |   31 +++-
> arch/powerpc/kernel/idle_power7.S       |    2 -
> arch/powerpc/kvm/Kconfig                |    2 +-
> arch/powerpc/kvm/book3s_hv.c            |  266 =
+++++++++++++++++++++++++++++--
> arch/powerpc/kvm/book3s_hv_rmhandlers.S |  157 ++++++++++++++++++-
> arch/powerpc/sysdev/xics/icp-native.c   |    7 +
> include/linux/kvm.h                     |    3 +
> 12 files changed, 506 insertions(+), 29 deletions(-)
>=20
> diff --git a/arch/powerpc/include/asm/kvm.h =
b/arch/powerpc/include/asm/kvm.h
> index a9e641b..624c872 100644
> --- a/arch/powerpc/include/asm/kvm.h
> +++ b/arch/powerpc/include/asm/kvm.h
> @@ -24,6 +24,7 @@
>=20
> /* Select powerpc specific features in <linux/kvm.h> */
> #define __KVM_HAVE_SPAPR_TCE
> +#define __KVM_HAVE_PPC_SMT
>=20
> struct kvm_regs {
> 	__u64 pc;
> diff --git a/arch/powerpc/include/asm/kvm_host.h =
b/arch/powerpc/include/asm/kvm_host.h
> index cda183e..a2085da 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -25,10 +25,15 @@
> #include <linux/interrupt.h>
> #include <linux/types.h>
> #include <linux/kvm_types.h>
> +#include <linux/threads.h>
> +#include <linux/spinlock.h>
> #include <linux/kvm_para.h>
> #include <asm/kvm_asm.h>
> +#include <asm/processor.h>
>=20
> -#define KVM_MAX_VCPUS 1
> +#define KVM_MAX_VCPUS		NR_CPUS
> +#define KVM_THREADS_PER_CORE	4

So what if POWER8 (or whatever it will be called) comes along with 8 =
threads per core? Would that change the userspace interface?

> +#define KVM_MAX_VCORES		(KVM_MAX_VCPUS / =
KVM_THREADS_PER_CORE)
> #define KVM_MEMORY_SLOTS 32
> /* memory slots that does not exposed to userspace */
> #define KVM_PRIVATE_MEM_SLOTS 4
> @@ -165,9 +170,36 @@ struct kvm_arch {
> 	unsigned long host_sdr1;
> 	int tlbie_lock;
> 	unsigned short last_vcpu[NR_CPUS];
> +	struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
> 	struct list_head spapr_tce_tables;
> };
>=20
> +/*
> + * Struct for a virtual core.
> + * Note: entry_exit_count combines an entry count in the bottom 8 =
bits
> + * and an exit count in the next 8 bits.  This is so that we can
> + * atomically increment the entry count iff the exit count is 0
> + * without taking the lock.
> + */
> +struct kvmppc_vcore {
> +	int n_runnable;
> +	int n_blocked;
> +	int num_threads;
> +	int entry_exit_count;
> +	int n_woken;
> +	int nap_count;
> +	u16 pcpu;
> +	u8 vcore_running;
> +	u8 in_guest;
> +	struct kvm_vcpu *runnable_threads[KVM_THREADS_PER_CORE];
> +	struct task_struct *run_task[KVM_THREADS_PER_CORE];
> +	struct kvm_run *kvm_run[KVM_THREADS_PER_CORE];
> +	spinlock_t lock;
> +};
> +
> +#define VCORE_ENTRY_COUNT(vc)	((vc)->entry_exit_count & 0xff)
> +#define VCORE_EXIT_COUNT(vc)	((vc)->entry_exit_count >> 8)
> +
> struct kvmppc_pte {
> 	ulong eaddr;
> 	u64 vpage;
> @@ -362,10 +394,21 @@ struct kvm_vcpu_arch {
> 	struct slb_shadow *slb_shadow;
> 	struct dtl *dtl;
> 	struct dtl *dtl_end;
> +
> +	struct kvmppc_vcore *vcore;
> +	int ret;
> 	int trap;
> +	int state;
> +	int ptid;
> +	wait_queue_head_t cpu_run;
> +
> 	struct kvm_vcpu_arch_shared *shared;
> 	unsigned long magic_page_pa; /* phys addr to map the magic page =
to */
> 	unsigned long magic_page_ea; /* effect. addr to map the magic =
page to */
> };
>=20
> +#define KVMPPC_VCPU_BUSY_IN_HOST	0
> +#define KVMPPC_VCPU_BLOCKED		1
> +#define KVMPPC_VCPU_RUNNABLE		2
> +
> #endif /* __POWERPC_KVM_HOST_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h =
b/arch/powerpc/include/asm/kvm_ppc.h
> index de683fa..ed75975 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -33,6 +33,9 @@
> #else
> #include <asm/kvm_booke.h>
> #endif
> +#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
> +#include <asm/paca.h>
> +#endif
>=20
> enum emulation_result {
> 	EMULATE_DONE,         /* no further processing */
> @@ -159,4 +162,14 @@ static inline u32 kvmppc_set_field(u64 inst, int =
msb, int lsb, int value)
> 	return r;
> }
>=20
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
> +{
> +	paca[cpu].xics_phys =3D addr;
> +}
> +#else
> +static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
> +{}
> +#endif
> +
> #endif /* __POWERPC_KVM_PPC_H__ */
> diff --git a/arch/powerpc/include/asm/paca.h =
b/arch/powerpc/include/asm/paca.h
> index 8dba5f6..8b6628c 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -151,6 +151,8 @@ struct paca_struct {
> 	struct kvmppc_book3s_shadow_vcpu shadow_vcpu;
> #ifdef CONFIG_KVM_BOOK3S_64_HV
> 	struct kvm_vcpu *kvm_vcpu;
> +	struct kvmppc_vcore *kvm_vcore;
> +	unsigned long xics_phys;
> 	u64 dabr;
> 	u64 host_mmcr[3];
> 	u32 host_pmc[6];
> diff --git a/arch/powerpc/kernel/asm-offsets.c =
b/arch/powerpc/kernel/asm-offsets.c
> index fd56f14..1fee2cf 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -204,12 +204,14 @@ int main(void)
> 	DEFINE(PACA_KVM_SVCPU, offsetof(struct paca_struct, =
shadow_vcpu));
> #ifdef CONFIG_KVM_BOOK3S_64_HV
> 	DEFINE(PACA_KVM_VCPU, offsetof(struct paca_struct, kvm_vcpu));
> +	DEFINE(PACA_KVM_VCORE, offsetof(struct paca_struct, kvm_vcore));
> 	DEFINE(PACA_HOST_MMCR, offsetof(struct paca_struct, host_mmcr));
> 	DEFINE(PACA_HOST_PMC, offsetof(struct paca_struct, host_pmc));
> 	DEFINE(PACA_HOST_PURR, offsetof(struct paca_struct, host_purr));
> 	DEFINE(PACA_HOST_SPURR, offsetof(struct paca_struct, =
host_spurr));
> 	DEFINE(PACA_HOST_DSCR, offsetof(struct paca_struct, host_dscr));
> 	DEFINE(PACA_DABR, offsetof(struct paca_struct, dabr));
> +	DEFINE(PACA_XICS_PHYS, offsetof(struct paca_struct, xics_phys));
> 	DEFINE(PACA_KVM_DECEXP, offsetof(struct paca_struct, =
dec_expires));
> #endif
> #endif /* CONFIG_KVM_BOOK3S_64_HANDLER */
> @@ -478,6 +480,10 @@ int main(void)
> 	DEFINE(VCPU_FAULT_DAR, offsetof(struct kvm_vcpu, =
arch.fault_dar));
> 	DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, =
arch.last_inst));
> 	DEFINE(VCPU_TRAP, offsetof(struct kvm_vcpu, arch.trap));
> +	DEFINE(VCPU_PTID, offsetof(struct kvm_vcpu, arch.ptid));
> +	DEFINE(VCORE_ENTRY_EXIT, offsetof(struct kvmppc_vcore, =
entry_exit_count));
> +	DEFINE(VCORE_NAP_COUNT, offsetof(struct kvmppc_vcore, =
nap_count));
> +	DEFINE(VCORE_IN_GUEST, offsetof(struct kvmppc_vcore, in_guest));
> 	DEFINE(VCPU_SVCPU, offsetof(struct kvmppc_vcpu_book3s, =
shadow_vcpu) -
> 			   offsetof(struct kvmppc_vcpu_book3s, vcpu));
> 	DEFINE(SVCPU_HOST_R1, offsetof(struct kvmppc_book3s_shadow_vcpu, =
host_r1));
> diff --git a/arch/powerpc/kernel/exceptions-64s.S =
b/arch/powerpc/kernel/exceptions-64s.S
> index 80c6456..803dcff 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -49,19 +49,32 @@ BEGIN_FTR_SECTION
> 	 * state loss at this time.
> 	 */
> 	mfspr	r13,SPRN_SRR1
> -	rlwinm	r13,r13,47-31,30,31
> -	cmpwi	cr0,r13,1
> -	bne	1f
> -	b	.power7_wakeup_noloss
> -1:	cmpwi	cr0,r13,2
> -	bne	1f
> -	b	.power7_wakeup_loss
> +	rlwinm.	r13,r13,47-31,30,31
> +	beq	9f
> +
> +	/* waking up from powersave (nap) state */
> +	cmpwi	cr1,r13,2
> 	/* Total loss of HV state is fatal, we could try to use the
> 	 * PIR to locate a PACA, then use an emergency stack etc...
> 	 * but for now, let's just stay stuck here
> 	 */
> -1:	cmpwi	cr0,r13,3
> -	beq	.
> +	bgt	cr1,.
> +	GET_PACA(r13)
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	lbz	r0,PACAPROCSTART(r13)
> +	cmpwi	r0,0x80
> +	bne	1f
> +	li	r0,0
> +	stb	r0,PACAPROCSTART(r13)
> +	b	kvm_start_guest
> +1:
> +#endif
> +
> +	beq	cr1,2f
> +	b	.power7_wakeup_noloss
> +2:	b	.power7_wakeup_loss
> +9:
> END_FTR_SECTION_IFSET(CPU_FTR_HVMODE_206)
> #endif /* CONFIG_PPC_P7_NAP */
> 	EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, system_reset_common, =
EXC_STD,
> diff --git a/arch/powerpc/kernel/idle_power7.S =
b/arch/powerpc/kernel/idle_power7.S
> index f8f0bc7..3a70845 100644
> --- a/arch/powerpc/kernel/idle_power7.S
> +++ b/arch/powerpc/kernel/idle_power7.S
> @@ -73,7 +73,6 @@ _GLOBAL(power7_idle)
> 	b	.
>=20
> _GLOBAL(power7_wakeup_loss)
> -	GET_PACA(r13)
> 	ld	r1,PACAR1(r13)
> 	REST_NVGPRS(r1)
> 	REST_GPR(2, r1)
> @@ -87,7 +86,6 @@ _GLOBAL(power7_wakeup_loss)
> 	rfid
>=20
> _GLOBAL(power7_wakeup_noloss)
> -	GET_PACA(r13)
> 	ld	r1,PACAR1(r13)
> 	ld	r4,_MSR(r1)
> 	ld	r5,_NIP(r1)
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 6ff191b..27d8e36 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -58,7 +58,7 @@ config KVM_BOOK3S_64
>=20
> config KVM_BOOK3S_64_HV
> 	bool "KVM support for POWER7 using hypervisor mode in host"
> -	depends on EXPERIMENTAL && PPC_BOOK3S_64
> +	depends on EXPERIMENTAL && PPC_BOOK3S_64 && PPC_PSERIES
> 	select KVM
> 	select KVM_BOOK3S_64
> 	---help---
> diff --git a/arch/powerpc/kvm/book3s_hv.c =
b/arch/powerpc/kvm/book3s_hv.c
> index eed2c10..ce006c0 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -50,6 +50,7 @@
> void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> {
> 	local_paca->kvm_vcpu =3D vcpu;
> +	local_paca->kvm_vcore =3D vcpu->arch.vcore;
> 	vcpu->cpu =3D cpu;
> }
>=20
> @@ -58,6 +59,9 @@ void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu)
> 	vcpu->cpu =3D -1;
> }
>=20
> +static void kvmppc_vcpu_blocked(struct kvm_vcpu *vcpu);
> +static void kvmppc_vcpu_unblocked(struct kvm_vcpu *vcpu);
> +
> void kvmppc_vcpu_block(struct kvm_vcpu *vcpu)
> {
> 	u64 now;
> @@ -75,11 +79,15 @@ void kvmppc_vcpu_block(struct kvm_vcpu *vcpu)
> 			      HRTIMER_MODE_REL);
> 	}
>=20
> +	kvmppc_vcpu_blocked(vcpu);
> +
> 	kvm_vcpu_block(vcpu);
> 	vcpu->stat.halt_wakeup++;
>=20
> 	if (vcpu->arch.dec_expires !=3D ~(u64)0)
> 		hrtimer_try_to_cancel(&vcpu->arch.dec_timer);
> +
> +	kvmppc_vcpu_unblocked(vcpu);
> }
>=20
> void kvmppc_set_msr(struct kvm_vcpu *vcpu, u64 msr)
> @@ -415,6 +423,8 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct =
kvm *kvm, unsigned int id)
> {
> 	struct kvm_vcpu *vcpu;
> 	int err =3D -ENOMEM;
> +	int core;
> +	struct kvmppc_vcore *vcore;
> 	unsigned long lpcr;
>=20
> 	vcpu =3D kzalloc(sizeof(struct kvm_vcpu), GFP_KERNEL);
> @@ -443,6 +453,37 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct =
kvm *kvm, unsigned int id)
> 	lpcr |=3D LPCR_VPM0 | LPCR_VRMA_L | (4UL << LPCR_DPFD_SH) | =
LPCR_HDICE;
> 	vcpu->arch.lpcr =3D lpcr;
>=20
> +	/*
> +	 * Some vcpus may start out in stopped state.  If we initialize
> +	 * them to busy-in-host state they will stop other vcpus in the
> +	 * vcore from running.  Instead we initialize them to blocked
> +	 * state, effectively considering them to be stopped until we
> +	 * see the first run ioctl for them.
> +	 */
> +	vcpu->arch.state =3D KVMPPC_VCPU_BLOCKED;
> +
> +	init_waitqueue_head(&vcpu->arch.cpu_run);
> +	core =3D id / KVM_THREADS_PER_CORE;
> +
> +	mutex_lock(&kvm->lock);
> +	vcore =3D kvm->arch.vcores[core];
> +	if (!vcore) {
> +		vcore =3D kzalloc(sizeof(struct kvmppc_vcore), =
GFP_KERNEL);
> +		if (vcore)
> +			spin_lock_init(&vcore->lock);
> +		kvm->arch.vcores[core] =3D vcore;
> +	}
> +	mutex_unlock(&kvm->lock);
> +
> +	if (!vcore)
> +		goto free_vcpu;
> +
> +	spin_lock(&vcore->lock);
> +	++vcore->num_threads;
> +	++vcore->n_blocked;
> +	spin_unlock(&vcore->lock);
> +	vcpu->arch.vcore =3D vcore;
> +
> 	return vcpu;
>=20
> free_vcpu:
> @@ -457,37 +498,238 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu =
*vcpu)
> 	kfree(vcpu);
> }
>=20
> +static void kvmppc_vcpu_blocked(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmppc_vcore *vc =3D vcpu->arch.vcore;
> +
> +	spin_lock(&vc->lock);
> +	vcpu->arch.state =3D KVMPPC_VCPU_BLOCKED;
> +	++vc->n_blocked;
> +	if (vc->n_runnable > 0 &&
> +	    vc->n_runnable + vc->n_blocked =3D=3D vc->num_threads)
> +		wake_up(&vc->runnable_threads[0]->arch.cpu_run);
> +	spin_unlock(&vc->lock);
> +}
> +
> +static void kvmppc_vcpu_unblocked(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmppc_vcore *vc =3D vcpu->arch.vcore;
> +
> +	spin_lock(&vc->lock);
> +	vcpu->arch.state =3D KVMPPC_VCPU_BUSY_IN_HOST;
> +	--vc->n_blocked;
> +	spin_unlock(&vc->lock);
> +}
> +
> extern int __kvmppc_vcore_entry(struct kvm_run *kvm_run, struct =
kvm_vcpu *vcpu);
> +extern void xics_wake_cpu(int cpu);
> +
> +static void kvmppc_remove_runnable(struct kvmppc_vcore *vc, int ptid)
> +{
> +	int i;
> +	struct kvm_vcpu *vcpu;
> +
> +	vcpu =3D vc->runnable_threads[ptid];
> +	if (vcpu->arch.state !=3D KVMPPC_VCPU_RUNNABLE)
> +		return;
> +	vcpu->arch.state =3D KVMPPC_VCPU_BUSY_IN_HOST;
> +	--vc->n_runnable;
> +	for (i =3D ptid; i < vc->n_runnable; ++i) {
> +		vcpu =3D vc->runnable_threads[i+1];
> +		vc->runnable_threads[i] =3D vcpu;
> +		vcpu->arch.ptid =3D i;
> +		vc->run_task[i] =3D vc->run_task[i+1];
> +		vc->kvm_run[i] =3D vc->kvm_run[i+1];
> +	}
> +}
> +
> +static void kvmppc_start_thread(struct kvm_vcpu *vcpu)
> +{
> +	int cpu;
> +	struct paca_struct *tpaca;
> +	struct kvmppc_vcore *vc =3D vcpu->arch.vcore;
> +
> +	cpu =3D vc->pcpu + vcpu->arch.ptid;
> +	tpaca =3D &paca[cpu];
> +	tpaca->kvm_vcpu =3D vcpu;
> +	tpaca->kvm_vcore =3D vc;
> +	smp_wmb();
> +	if (vcpu->arch.ptid) {
> +		tpaca->cpu_start =3D 0x80;
> +		tpaca->shadow_vcpu.in_guest =3D KVM_GUEST_MODE_GUEST;
> +		wmb();
> +		xics_wake_cpu(cpu);
> +		++vc->n_woken;
> +	}
> +}
> +
> +static void kvmppc_wait_for_nap(struct kvmppc_vcore *vc)
> +{
> +	int i;
> +
> +	HMT_low();
> +	i =3D 0;
> +	while (vc->nap_count < vc->n_woken) {
> +		if (++i >=3D 1000000) {
> +			pr_err("kvmppc_wait_for_nap timeout %d %d\n",
> +			       vc->nap_count, vc->n_woken);
> +			break;
> +		}
> +		cpu_relax();
> +	}
> +	HMT_medium();
> +}
>=20
> -static int kvmppc_run_vcpu(struct kvm_run *run, struct kvm_vcpu =
*vcpu)
> +static int kvmppc_run_core(struct kvmppc_vcore *vc)
> {
> +	struct kvm_vcpu *vcpu;
> +	long i, ret;
> 	u64 now;
>=20
> +	/* don't start if any threads have a signal pending */
> +	for (i =3D 0; i < vc->n_runnable; ++i) {
> +		if (signal_pending(vc->run_task[i]))
> +			return 0;
> +	}
> +
> +	vc->n_woken =3D 0;
> +	vc->nap_count =3D 0;
> +	vc->entry_exit_count =3D 0;
> +	vc->vcore_running =3D 1;
> +	vc->in_guest =3D 0;
> +	vc->pcpu =3D smp_processor_id();
> +	for (i =3D 0; i < vc->n_runnable; ++i)
> +		kvmppc_start_thread(vc->runnable_threads[i]);
> +	vcpu =3D vc->runnable_threads[0];
> +
> +	spin_unlock(&vc->lock);
> +
> +	kvm_guest_enter();
> +	__kvmppc_vcore_entry(NULL, vcpu);
> +
> +	/* wait for secondary threads to get back to nap mode */
> +	spin_lock(&vc->lock);
> +	if (vc->nap_count < vc->n_woken)
> +		kvmppc_wait_for_nap(vc);

So you're taking the vcore wide lock and wait for other CPUs to set =
themselves to nap? Not sure I fully understand this. Why would another =
thread want to go to nap mode when it's 100% busy?

> +	vc->vcore_running =3D 2;
> +	spin_unlock(&vc->lock);
> +
> +	/* make sure updates to secondary vcpu structs are visible now =
*/
> +	smp_mb();
> +	kvm_guest_exit();
> +
> +	preempt_enable();
> +	kvm_resched(vcpu);
> +
> +	now =3D get_tb();
> +	for (i =3D 0; i < vc->n_runnable; ++i) {
> +		vcpu =3D vc->runnable_threads[i];
> +		/* cancel pending dec exception if dec is positive */
> +		if (now < vcpu->arch.dec_expires &&
> +		    kvmppc_core_pending_dec(vcpu))
> +			kvmppc_core_dequeue_dec(vcpu);
> +		if (!vcpu->arch.trap) {
> +			if (signal_pending(vc->run_task[i])) {
> +				vc->kvm_run[i]->exit_reason =3D =
KVM_EXIT_INTR;
> +				vcpu->arch.ret =3D -EINTR;
> +			}
> +			continue;		/* didn't get to run */
> +		}
> +		ret =3D kvmppc_handle_exit(vc->kvm_run[i], vcpu, =
vc->run_task[i]);
> +		vcpu->arch.ret =3D ret;
> +		vcpu->arch.trap =3D 0;
> +	}
> +
> +	preempt_disable();
> +	spin_lock(&vc->lock);
> +
> +	vc->vcore_running =3D 0;
> +	for (i =3D 0; i < vc->n_runnable; ) {
> +		vcpu =3D vc->runnable_threads[i];
> +		if (vcpu->arch.ret !=3D RESUME_GUEST) {
> +			kvmppc_remove_runnable(vc, i);
> +			wake_up(&vcpu->arch.cpu_run);
> +		} else
> +			++i;
> +	}
> +
> +	return 1;
> +}
> +
> +static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu =
*vcpu)
> +{
> +	int ptid;
> +	int wait_state;
> +	struct kvmppc_vcore *vc;
> +	DEFINE_WAIT(wait);
> +
> +	/* No need to go into the guest when all we do is going out */
> 	if (signal_pending(current)) {
> -		run->exit_reason =3D KVM_EXIT_INTR;
> +		kvm_run->exit_reason =3D KVM_EXIT_INTR;
> 		return -EINTR;
> 	}
>=20
> +	kvm_run->exit_reason =3D 0;
> +	vcpu->arch.ret =3D RESUME_GUEST;
> +	vcpu->arch.trap =3D 0;
> +
> 	flush_fp_to_thread(current);
> 	flush_altivec_to_thread(current);
> 	flush_vsx_to_thread(current);
> 	preempt_disable();
>=20
> -	kvm_guest_enter();
> +	/*
> +	 * Synchronize with other threads in this virtual core
> +	 */
> +	vc =3D vcpu->arch.vcore;
> +	spin_lock(&vc->lock);
> +	/* This happens the first time this is called for a vcpu */
> +	if (vcpu->arch.state =3D=3D KVMPPC_VCPU_BLOCKED)
> +		--vc->n_blocked;
> +	vcpu->arch.state =3D KVMPPC_VCPU_RUNNABLE;
> +	ptid =3D vc->n_runnable;
> +	vc->runnable_threads[ptid] =3D vcpu;
> +	vc->run_task[ptid] =3D current;
> +	vc->kvm_run[ptid] =3D kvm_run;
> +	vcpu->arch.ptid =3D ptid;
> +	++vc->n_runnable;
> +
> +	wait_state =3D TASK_INTERRUPTIBLE;
> +	while (vcpu->arch.state =3D=3D KVMPPC_VCPU_RUNNABLE) {
> +		if (signal_pending(current)) {
> +			if (!vc->vcore_running) {
> +				kvm_run->exit_reason =3D KVM_EXIT_INTR;
> +				vcpu->arch.ret =3D -EINTR;
> +				break;
> +			}
> +			/* have to wait for vcore to stop executing =
guest */
> +			wait_state =3D TASK_UNINTERRUPTIBLE;
> +			smp_send_reschedule(vc->pcpu);
> +		}
>=20
> -	__kvmppc_vcore_entry(NULL, vcpu);
> +		if (!vc->vcore_running &&
> +		    vc->n_runnable + vc->n_blocked =3D=3D =
vc->num_threads) {
> +			/* we can run now */
> +			if (kvmppc_run_core(vc))
> +				continue;
> +		}
>=20
> -	kvm_guest_exit();
> +		if (vc->vcore_running =3D=3D 1 && VCORE_EXIT_COUNT(vc) =
=3D=3D 0)
> +			kvmppc_start_thread(vcpu);
>=20
> -	preempt_enable();
> -	kvm_resched(vcpu);
> +		/* wait for other threads to come in, or wait for vcore =
*/
> +		prepare_to_wait(&vcpu->arch.cpu_run, &wait, wait_state);
> +		spin_unlock(&vc->lock);
> +		schedule();
> +		finish_wait(&vcpu->arch.cpu_run, &wait);
> +		spin_lock(&vc->lock);
> +	}
>=20
> -	now =3D get_tb();
> -	/* cancel pending dec exception if dec is positive */
> -	if (now < vcpu->arch.dec_expires && =
kvmppc_core_pending_dec(vcpu))
> -		kvmppc_core_dequeue_dec(vcpu);
> +	if (vcpu->arch.state =3D=3D KVMPPC_VCPU_RUNNABLE)
> +		kvmppc_remove_runnable(vc, vcpu->arch.ptid);
> +	spin_unlock(&vc->lock);
>=20
> -	return kvmppc_handle_exit(run, vcpu, current);
> +	return vcpu->arch.ret;
> }
>=20
> int kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu)
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S =
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 95f6386..f1d3779 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -111,6 +111,32 @@ kvmppc_trampoline_enter:
>  *                                                                     =
       *
>  =
**************************************************************************=
***/
>=20
> +#define XICS_XIRR		4
> +#define XICS_QIRR		0xc
> +
> +/*
> + * We come in here when wakened from nap mode on a secondary hw =
thread.
> + * Relocation is off and most register values are lost.
> + * r13 points to the PACA.
> + */
> +	.globl	kvm_start_guest
> +kvm_start_guest:
> +	ld	r1,PACAEMERGSP(r13)
> +	subi	r1,r1,STACK_FRAME_OVERHEAD
> +
> +	/* get vcpu pointer */
> +	ld	r4, PACA_KVM_VCPU(r13)
> +
> +	/* We got here with an IPI; clear it */
> +	ld	r5, PACA_XICS_PHYS(r13)
> +	li	r0, 0xff
> +	li	r6, XICS_QIRR
> +	li	r7, XICS_XIRR
> +	lwzcix	r8, r5, r7		/* ack the interrupt */
> +	sync
> +	stbcix	r0, r5, r6		/* clear it */
> +	stwcix	r8, r5, r7		/* EOI it */
> +
> .global kvmppc_handler_trampoline_enter
> kvmppc_handler_trampoline_enter:
>=20
> @@ -229,7 +255,20 @@ kvmppc_handler_trampoline_enter:
> 	slbia
> 	ptesync
>=20
> -	/* Switch to guest partition. */
> +	/* Increment entry count iff exit count is zero. */
> + 	ld	r5,PACA_KVM_VCORE(r13)
> +	addi	r9,r5,VCORE_ENTRY_EXIT
> +21:	lwarx	r3,0,r9
> +	cmpwi	r3,0x100		/* any threads starting to exit? =
*/
> +	bge	secondary_too_late	/* if so we're too late to the =
party */
> +	addi	r3,r3,1
> +	stwcx.	r3,0,r9
> +	bne	21b
> +
> +	/* Primary thread switches to guest partition. */
> +	lwz	r6,VCPU_PTID(r4)
> +	cmpwi	r6,0
> +	bne	20f
> 	ld	r9,VCPU_KVM(r4)		/* pointer to struct kvm */
> 	ld	r6,KVM_SDR1(r9)
> 	lwz	r7,KVM_LPID(r9)
> @@ -239,7 +278,15 @@ kvmppc_handler_trampoline_enter:
> 	mtspr	SPRN_SDR1,r6		/* switch to partition page =
table */
> 	mtspr	SPRN_LPID,r7
> 	isync
> -	ld	r8,VCPU_LPCR(r4)
> +	li	r0,1
> +	stb	r0,VCORE_IN_GUEST(r5)	/* signal secondaries to =
continue */
> +	b	10f
> +
> +	/* Secondary threads wait for primary to have done partition =
switch */
> +20:	lbz	r0,VCORE_IN_GUEST(r5)
> +	cmpwi	r0,0
> +	beq	20b
> +10:	ld	r8,VCPU_LPCR(r4)
> 	mtspr	SPRN_LPCR,r8
> 	isync
>=20
> @@ -254,10 +301,12 @@ kvmppc_handler_trampoline_enter:
> 	 * Invalidate the TLB if we could possibly have stale TLB
> 	 * entries for this partition on this core due to the use
> 	 * of tlbiel.
> +	 * XXX maybe only need this on primary thread?
> 	 */
> 	ld	r9,VCPU_KVM(r4)		/* pointer to struct kvm */
> 	lwz	r5,VCPU_VCPUID(r4)
> 	lhz	r6,PACAPACAINDEX(r13)
> +	rldimi	r6,r5,0,62		/* XXX map as if threads 1:1 p:v =
*/
> 	lhz	r8,VCPU_LAST_CPU(r4)
> 	sldi	r7,r6,1			/* see if this is the same vcpu =
*/
> 	add	r7,r7,r9		/* as last ran on this pcpu */
> @@ -540,8 +589,51 @@ hcall_real_cont:
> 	ptesync
>=20
> hdec_soon:
> -	/* Switch back to host partition */
> +	/* Increment the threads-exiting-guest count in the 0xff00
> +	   bits of vcore->entry_exit_count */
> +	lwsync
> +	ld	r5,PACA_KVM_VCORE(r13)
> +	addi	r6,r5,VCORE_ENTRY_EXIT
> +41:	lwarx	r3,0,r6
> +	addi	r0,r3,0x100
> +	stwcx.	r0,0,r6
> +	bne	41b
> +
> +	/* If this is not a HDEC interrupt and there are other threads,
> +	   and we were the first thread to take an interrupt,
> +	   set HDEC to 0 to pull the other threads out of the guest. */
> +	cmpwi	r12,0x980
> +	beq	40f
> +	cmpwi	r3,0x100

good old use define comment :)

> +	bge	40f
> +	cmpwi	r3,1
> +	ble	40f
> +	li	r0,0
> +	mtspr	SPRN_HDEC,r0
> +40:
> +
> +	/* Secondary threads wait for primary to do partition switch */
> 	ld	r4,VCPU_KVM(r9)		/* pointer to struct kvm */
> +	ld	r5,PACA_KVM_VCORE(r13)
> +	lwz	r3,VCPU_PTID(r9)
> +	cmpwi	r3,0
> +	beq	15f
> +	HMT_LOW
> +13:	lbz	r3,VCORE_IN_GUEST(r5)
> +	cmpwi	r3,0
> +	bne	13b
> +	HMT_MEDIUM
> +	b	16f
> +
> +	/* Primary thread waits for all the secondaries to exit guest */
> +15:	lwz	r3,VCORE_ENTRY_EXIT(r5)
> +	srwi	r0,r3,8
> +	clrldi	r3,r3,56
> +	cmpw	r3,r0
> +	bne	15b
> +	isync
> +
> +	/* Primary thread switches back to host partition */
> 	ld	r6,KVM_HOST_SDR1(r4)
> 	lwz	r7,KVM_HOST_LPID(r4)
> 	li	r8,0x3ff		/* switch to reserved LPID */
> @@ -550,10 +642,12 @@ hdec_soon:
> 	mtspr	SPRN_SDR1,r6		/* switch to partition page =
table */
> 	mtspr	SPRN_LPID,r7
> 	isync
> +	li	r0,0
> +	stb	r0,VCORE_IN_GUEST(r5)
> 	lis	r8,0x7fff
> 	mtspr	SPRN_HDEC,r8
>=20
> -	ld	r8,KVM_HOST_LPCR(r4)
> +16:	ld	r8,KVM_HOST_LPCR(r4)
> 	mtspr	SPRN_LPCR,r8
> 	isync
>=20
> @@ -662,6 +756,11 @@ hdec_soon:
> 	mr	r3, r9
> 	bl	.kvmppc_save_fp
>=20
> +	/* Secondary threads go off to take a nap */
> +	lwz	r0,VCPU_PTID(r3)
> +	cmpwi	r0,0
> +	bne	secondary_nap
> +
> 	/* RFI into the highmem handler */
> 	mfmsr	r7
> 	ori	r7, r7, MSR_IR|MSR_DR|MSR_RI|MSR_ME	/* Enable paging =
*/
> @@ -807,3 +906,53 @@ _GLOBAL(kvmppc_h_set_dabr)
> 	mtspr	SPRN_DABR,r4
> 	li	r3,0
> 	blr
> +
> +secondary_too_late:
> +	ld	r5,PACA_KVM_VCORE(r13)
> +	HMT_LOW
> +13:	lbz	r3,VCORE_IN_GUEST(r5)
> +	cmpwi	r3,0
> +	bne	13b
> +	HMT_MEDIUM
> +	ld	r11,PACA_SLBSHADOWPTR(r13)
> +
> +	.rept	SLB_NUM_BOLTED
> +	ld	r5,SLBSHADOW_SAVEAREA(r11)
> +	ld	r6,SLBSHADOW_SAVEAREA+8(r11)
> +	andis.	r7,r5,SLB_ESID_V@h
> +	beq	1f
> +	slbmte	r6,r5
> +1:	addi	r11,r11,16
> +	.endr
> +	b	50f
> +
> +secondary_nap:
> +	/* Clear any pending IPI */
> +50:	ld	r5, PACA_XICS_PHYS(r13)
> +	li	r0, 0xff
> +	li	r6, XICS_QIRR
> +	stbcix	r0, r5, r6
> +
> +	/* increment the nap count and then go to nap mode */
> +	ld	r4, PACA_KVM_VCORE(r13)
> +	addi	r4, r4, VCORE_NAP_COUNT
> +	lwsync				/* make previous updates visible =
*/
> +51:	lwarx	r3, 0, r4
> +	addi	r3, r3, 1
> +	stwcx.	r3, 0, r4
> +	bne	51b
> +	isync
> +
> +	mfspr	r4, SPRN_LPCR
> +	li	r0, LPCR_PECE
> +	andc	r4, r4, r0
> +	ori	r4, r4, LPCR_PECE0	/* exit nap on interrupt */
> +	mtspr	SPRN_LPCR, r4
> +	li	r0, 0
> +	std	r0, SHADOW_VCPU_OFF + SVCPU_SCRATCH0(r13)
> +	ptesync
> +	ld	r0, SHADOW_VCPU_OFF + SVCPU_SCRATCH0(r13)
> +1:	cmpd	r0, r0
> +	bne	1b
> +	nap
> +	b	.
> diff --git a/arch/powerpc/sysdev/xics/icp-native.c =
b/arch/powerpc/sysdev/xics/icp-native.c
> index be5e3d7..01149c0 100644
> --- a/arch/powerpc/sysdev/xics/icp-native.c
> +++ b/arch/powerpc/sysdev/xics/icp-native.c
> @@ -23,6 +23,7 @@
> #include <asm/irq.h>
> #include <asm/errno.h>
> #include <asm/xics.h>
> +#include <asm/kvm_ppc.h>
>=20
> struct icp_ipl {
> 	union {
> @@ -142,6 +143,11 @@ static inline void icp_native_do_message(int cpu, =
int msg)
> 	icp_native_set_qirr(cpu, IPI_PRIORITY);
> }
>=20
> +void xics_wake_cpu(int cpu)
> +{
> +	icp_native_set_qirr(cpu, IPI_PRIORITY);
> +}
> +
> static void icp_native_message_pass(int target, int msg)
> {
> 	unsigned int i;
> @@ -204,6 +210,7 @@ static int __init icp_native_map_one_cpu(int =
hw_id, unsigned long addr,
> 	}
>=20
> 	icp_native_regs[cpu] =3D ioremap(addr, size);
> +	kvmppc_set_xics_phys(cpu, addr);
> 	if (!icp_native_regs[cpu]) {
> 		pr_warning("icp_native: Failed ioremap for CPU %d, "
> 			   "interrupt server #0x%x, addr %#lx\n",
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 3d3cdf1..952d556 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -550,6 +550,9 @@ struct kvm_ppc_pvinfo {
> #ifdef __KVM_HAVE_SPAPR_TCE
> #define KVM_CAP_SPAPR_TCE 60
> #endif
> +#ifdef __KVM_HAVE_PPC_SMT
> +#define KVM_CAP_PPC_SMT 61
> +#endif
>=20
> #ifdef KVM_CAP_IRQ_ROUTING
>=20
> --=20
> 1.7.4.4
>=20

Maybe I also missed the point here, but how does this correlate with =
Linux threads? Is each vcpu running in its own Linux thread? How does =
the scheduling happen? IIUC the host only sees a single thread per core =
and then distributes the vcpus to the respective host threads.

Alex

^ permalink raw reply

* Re: [PATCH 12/13] kvm/powerpc: Accelerate H_PUT_TCE by implementing it in real mode
From: Alexander Graf @ 2011-05-17  8:01 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm
In-Reply-To: <20110511104615.GM2837@brick.ozlabs.ibm.com>


On 11.05.2011, at 12:46, Paul Mackerras wrote:

> From: David Gibson <dwg@au1.ibm.com>
>=20
> This improves I/O performance for guests using the PAPR =
paravirtualization
> interface by making the H_PUT_TCE hcall faster, by implementing it in
> real mode.  H_PUT_TCE is used for updating virtual IOMMU tables, and =
is
> used both for virtual I/O and for real I/O in the PAPR interface.
>=20
> Since this moves the IOMMU tables into the kernel, we define a new
> KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables.
> The ioctl returns a file descriptor which can be used to mmap the
> newly created table.
>=20
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/include/asm/kvm.h           |    9 +++
> arch/powerpc/include/asm/kvm_book3s_64.h |    2 +
> arch/powerpc/include/asm/kvm_host.h      |    9 +++
> arch/powerpc/include/asm/kvm_ppc.h       |    2 +
> arch/powerpc/kvm/Makefile                |    3 +-
> arch/powerpc/kvm/book3s_64_vio_hv.c      |   73 +++++++++++++++++++
> arch/powerpc/kvm/book3s_hv.c             |  116 =
+++++++++++++++++++++++++++++-
> arch/powerpc/kvm/book3s_hv_rmhandlers.S  |    2 +-
> arch/powerpc/kvm/powerpc.c               |   18 +++++
> include/linux/kvm.h                      |    5 ++

This one definitely needs documentation :).

> 10 files changed, 236 insertions(+), 3 deletions(-)
> create mode 100644 arch/powerpc/kvm/book3s_64_vio_hv.c
>=20
> diff --git a/arch/powerpc/include/asm/kvm.h =
b/arch/powerpc/include/asm/kvm.h
> index 18ea696..a9e641b 100644
> --- a/arch/powerpc/include/asm/kvm.h
> +++ b/arch/powerpc/include/asm/kvm.h
> @@ -22,6 +22,9 @@
>=20
> #include <linux/types.h>
>=20
> +/* Select powerpc specific features in <linux/kvm.h> */
> +#define __KVM_HAVE_SPAPR_TCE
> +
> struct kvm_regs {
> 	__u64 pc;
> 	__u64 cr;
> @@ -88,4 +91,10 @@ struct kvm_guest_debug_arch {
> #define KVM_INTERRUPT_UNSET	-2U
> #define KVM_INTERRUPT_SET_LEVEL	-3U
>=20
> +/* for KVM_CAP_SPAPR_TCE */
> +struct kvm_create_spapr_tce {
> +	__u64 liobn;
> +	__u32 window_size;
> +};
> +
> #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h =
b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 4cadd61..e1a096b 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -25,4 +25,6 @@ static inline struct kvmppc_book3s_shadow_vcpu =
*to_svcpu(struct kvm_vcpu *vcpu)
> 	return &get_paca()->shadow_vcpu;
> }
>=20
> +#define SPAPR_TCE_SHIFT		12
> +
> #endif /* __ASM_KVM_BOOK3S_64_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_host.h =
b/arch/powerpc/include/asm/kvm_host.h
> index af6703e..cda183e 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -144,6 +144,14 @@ struct kvmppc_pginfo {
> 	atomic_t refcnt;
> };
>=20
> +struct kvmppc_spapr_tce_table {
> +	struct list_head list;
> +	struct kvm *kvm;
> +	u64 liobn;
> +	u32 window_size;
> +	struct page *pages[0];
> +};
> +
> struct kvm_arch {
> 	unsigned long hpt_virt;
> 	unsigned long ram_npages;
> @@ -157,6 +165,7 @@ struct kvm_arch {
> 	unsigned long host_sdr1;
> 	int tlbie_lock;
> 	unsigned short last_vcpu[NR_CPUS];
> +	struct list_head spapr_tce_tables;
> };
>=20
> struct kvmppc_pte {
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h =
b/arch/powerpc/include/asm/kvm_ppc.h
> index b4ee11a..de683fa 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -117,6 +117,8 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> extern void kvmppc_map_vrma(struct kvm *kvm,
> 			    struct kvm_userspace_memory_region *mem);
> extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> +				struct kvm_create_spapr_tce *args);
> extern int kvmppc_core_init_vm(struct kvm *kvm);
> extern void kvmppc_core_destroy_vm(struct kvm *kvm);
> extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 37c1a60..8ba062f 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -59,7 +59,8 @@ kvm-book3s_64_hv-objs :=3D \
> 	book3s.o \
> 	book3s_hv.o \
> 	book3s_hv_interrupts.o \
> -	book3s_64_mmu_hv.o
> +	book3s_64_mmu_hv.o \
> +	book3s_64_vio_hv.o
> kvm-objs-$(CONFIG_KVM_BOOK3S_64_HV) :=3D $(kvm-book3s_64_hv-objs)
>=20
> kvm-book3s_32-objs :=3D \
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c =
b/arch/powerpc/kvm/book3s_64_vio_hv.c
> new file mode 100644
> index 0000000..ea0f8c5
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -0,0 +1,73 @@
> +/*
> + * This program is free software; you can redistribute it and/or =
modify
> + * it under the terms of the GNU General Public License, version 2, =
as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  =
02110-1301, USA.
> + *
> + * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
> + * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/string.h>
> +#include <linux/kvm.h>
> +#include <linux/kvm_host.h>
> +#include <linux/highmem.h>
> +#include <linux/gfp.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +#include <linux/list.h>
> +
> +#include <asm/tlbflush.h>
> +#include <asm/kvm_ppc.h>
> +#include <asm/kvm_book3s.h>
> +#include <asm/mmu-hash64.h>
> +#include <asm/hvcall.h>
> +#include <asm/synch.h>
> +#include <asm/ppc-opcode.h>
> +#include <asm/kvm_host.h>
> +#include <asm/udbg.h>
> +
> +#define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
> +

It would be great to somehow mark code that runs in real mode as such - =
either by an attribute in the function header or by a simple comment.

> +long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> +		      unsigned long ioba, unsigned long tce)
> +{
> +	struct kvm *kvm =3D vcpu->kvm;
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	/* udbg_printf("H_PUT_TCE(): liobn=3D0x%lx ioba=3D0x%lx, =
tce=3D0x%lx\n", */
> +	/* 	    liobn, ioba, tce); */
> +
> +	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt->liobn =3D=3D liobn) {
> +			unsigned long idx =3D ioba >> SPAPR_TCE_SHIFT;
> +			struct page *page;
> +			u64 *tbl;
> +
> +			/* udbg_printf("H_PUT_TCE: liobn 0x%lx =3D> =
stt=3D%p  window_size=3D0x%x\n", */
> +			/* 	    liobn, stt, stt->window_size); */
> +			if (ioba >=3D stt->window_size)
> +				return H_PARAMETER;
> +
> +			page =3D stt->pages[idx / TCES_PER_PAGE];
> +			tbl =3D (u64 *)page_address(page);
> +
> +			/* FIXME: Need to validate the TCE itself */
> +			/* udbg_printf("tce @ %p\n", &tbl[idx % =
TCES_PER_PAGE]); */
> +			tbl[idx % TCES_PER_PAGE] =3D tce;
> +			return H_SUCCESS;
> +		}
> +	}
> +
> +	/* Didn't find the liobn, punt it to userspace */
> +	return H_TOO_HARD;
> +}
> diff --git a/arch/powerpc/kvm/book3s_hv.c =
b/arch/powerpc/kvm/book3s_hv.c
> index 377a35a..eed2c10 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -506,6 +506,116 @@ int kvmppc_vcpu_run(struct kvm_run *run, struct =
kvm_vcpu *vcpu)
> 	return r;
> }
>=20
> +static long kvmppc_stt_npages(unsigned long window_size)
> +{
> +	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
> +		     * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
> +}
> +
> +static void release_spapr_tce_table(struct kvmppc_spapr_tce_table =
*stt)
> +{
> +	struct kvm *kvm =3D stt->kvm;
> +	int i;
> +
> +	mutex_lock(&kvm->lock);
> +	list_del(&stt->list);
> +	for (i =3D 0; i < kvmppc_stt_npages(stt->window_size); i++)
> +		__free_page(stt->pages[i]);
> +	kfree(stt);
> +	mutex_unlock(&kvm->lock);
> +
> +	kvm_put_kvm(kvm);
> +}
> +
> +static int kvm_spapr_tce_fault(struct vm_area_struct *vma, struct =
vm_fault *vmf)
> +{
> +	struct kvmppc_spapr_tce_table *stt =3D =
vma->vm_file->private_data;
> +	struct page *page;
> +
> +	if (vmf->pgoff >=3D kvmppc_stt_npages(stt->window_size))
> +		return VM_FAULT_SIGBUS;
> +
> +	page =3D stt->pages[vmf->pgoff];
> +	get_page(page);
> +	vmf->page =3D page;
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct kvm_spapr_tce_vm_ops =3D {
> +	.fault =3D kvm_spapr_tce_fault,
> +};
> +
> +static int kvm_spapr_tce_mmap(struct file *file, struct =
vm_area_struct *vma)
> +{
> +	vma->vm_ops =3D &kvm_spapr_tce_vm_ops;
> +	return 0;
> +}
> +
> +static int kvm_spapr_tce_release(struct inode *inode, struct file =
*filp)
> +{
> +	struct kvmppc_spapr_tce_table *stt =3D filp->private_data;
> +
> +	release_spapr_tce_table(stt);
> +	return 0;
> +}
> +
> +static struct file_operations kvm_spapr_tce_fops =3D {
> +	.mmap           =3D kvm_spapr_tce_mmap,
> +	.release	=3D kvm_spapr_tce_release,
> +};
> +
> +long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> +				   struct kvm_create_spapr_tce *args)
> +{
> +	struct kvmppc_spapr_tce_table *stt =3D NULL;
> +	long npages;
> +	int ret =3D -ENOMEM;
> +	int i;
> +
> +	/* Check this LIOBN hasn't been previously allocated */
> +	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt->liobn =3D=3D args->liobn)
> +			return -EBUSY;
> +	}
> +
> +	npages =3D kvmppc_stt_npages(args->window_size);
> +
> +	stt =3D kzalloc(sizeof(*stt) + npages* sizeof(struct page *),
> +		      GFP_KERNEL);
> +	if (!stt)
> +		goto fail;
> +
> +	stt->liobn =3D args->liobn;
> +	stt->window_size =3D args->window_size;
> +	stt->kvm =3D kvm;
> +
> +	for (i =3D 0; i < npages; i++) {
> +		stt->pages[i] =3D alloc_page(GFP_KERNEL | __GFP_ZERO);
> +		if (!stt->pages[i])
> +			goto fail;
> +	}
> +
> +	kvm_get_kvm(kvm);
> +
> +	mutex_lock(&kvm->lock);
> +	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
> +
> +	mutex_unlock(&kvm->lock);
> +
> +	return anon_inode_getfd("kvm-spapr-tce", &kvm_spapr_tce_fops,
> +				stt, O_RDONLY);
> +
> +fail:
> +	if (stt) {
> +		for (i =3D 0; i < npages; i++)
> +			if (stt->pages[i])
> +				__free_page(stt->pages[i]);
> +
> +		kfree(stt);
> +	}
> +	return ret;
> +}
> +
> int kvmppc_core_prepare_memory_region(struct kvm *kvm,
> 				struct kvm_userspace_memory_region *mem)
> {
> @@ -527,13 +637,17 @@ int kvmppc_core_init_vm(struct kvm *kvm)
>=20
> 	/* Allocate hashed page table */
> 	r =3D kvmppc_alloc_hpt(kvm);
> +	if (r)
> +		return r;
>=20
> -	return r;
> +	INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
> +	return 0;
> }
>=20
> void kvmppc_core_destroy_vm(struct kvm *kvm)
> {
> 	kvmppc_free_hpt(kvm);
> +	WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables));
> }
>=20
> /* These are stubs for now */
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S =
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index e8a8f3c..95f6386 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -722,7 +722,7 @@ hcall_real_table:
> 	.long	0		/* 0x14 - H_CLEAR_REF */
> 	.long	.kvmppc_h_protect - hcall_real_table
> 	.long	0		/* 0x1c - H_GET_TCE */
> -	.long	0		/* 0x20 - H_SET_TCE */
> +	.long	.kvmppc_h_put_tce - hcall_real_table
> 	.long	0		/* 0x24 - H_SET_SPRG0 */
> 	.long	.kvmppc_h_set_dabr - hcall_real_table
> 	.long	0		/* 0x2c */
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 7bfe413..10f777a 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -196,6 +196,11 @@ int kvm_dev_ioctl_check_extension(long ext)
> 		r =3D KVM_COALESCED_MMIO_PAGE_OFFSET;
> 		break;
> #endif
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	case KVM_CAP_SPAPR_TCE:
> +		r =3D 1;
> +		break;
> +#endif
> 	default:
> 		r =3D 0;
> 		break;
> @@ -628,6 +633,19 @@ long kvm_arch_vm_ioctl(struct file *filp,
>=20
> 		break;
> 	}
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	case KVM_CREATE_SPAPR_TCE: {
> +		struct kvm_create_spapr_tce create_tce;
> +		struct kvm *kvm =3D filp->private_data;
> +
> +		r =3D -EFAULT;
> +		if (copy_from_user(&create_tce, argp, =
sizeof(create_tce)))
> +			goto out;
> +		r =3D kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
> +		goto out;
> +	}

I'm not sure I fully understand how this is supposed to work. If the =
tables are kept inside the kernel, how does userspace get to know where =
to DMA to?


Alex

^ permalink raw reply

* Re: [PATCH 11/13] kvm/powerpc: Handle some PAPR hcalls in the kernel
From: Alexander Graf @ 2011-05-17  7:54 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm
In-Reply-To: <20110511104530.GL2837@brick.ozlabs.ibm.com>


On 11.05.2011, at 12:45, Paul Mackerras wrote:

> This adds the infrastructure for handling PAPR hcalls in the kernel,
> either early in the guest exit path while we are still in real mode,
> or later once the MMU has been turned back on and we are in the full
> kernel context.  The advantage of handling hcalls in real mode if
> possible is that we avoid two partition switches -- and this will
> become more important when we support SMT4 guests, since a partition
> switch means we have to pull all of the threads in the core out of
> the guest.  The disadvantage is that we can only access the kernel
> linear mapping, not anything vmalloced or ioremapped, since the MMU
> is off.
>=20
> This also adds code to handle the following hcalls in real mode:
>=20
> H_ENTER       Add an HPTE to the hashed page table
> H_REMOVE      Remove an HPTE from the hashed page table
> H_READ        Read HPTEs from the hashed page table
> H_PROTECT     Change the protection bits in an HPTE
> H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
> H_SET_DABR    Set the data address breakpoint register
>=20
> Plus code to handle the following hcalls in the kernel:
>=20
> H_CEDE        Idle the vcpu until an interrupt or H_PROD hcall arrives
> H_PROD        Wake up a ceded vcpu
> H_REGISTER_VPA Register a virtual processor area (VPA)
>=20
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/include/asm/hvcall.h       |    5 +
> arch/powerpc/include/asm/kvm_host.h     |   11 +
> arch/powerpc/include/asm/kvm_ppc.h      |    1 +
> arch/powerpc/kernel/asm-offsets.c       |    2 +
> arch/powerpc/kvm/book3s_64_mmu_hv.c     |  342 =
+++++++++++++++++++++++++++++++
> arch/powerpc/kvm/book3s_hv.c            |  170 +++++++++++++++-
> arch/powerpc/kvm/book3s_hv_rmhandlers.S |  150 +++++++++++++-
> arch/powerpc/kvm/powerpc.c              |    2 +-
> 8 files changed, 679 insertions(+), 4 deletions(-)
>=20
> diff --git a/arch/powerpc/include/asm/hvcall.h =
b/arch/powerpc/include/asm/hvcall.h
> index 8edec71..a226002 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -29,6 +29,10 @@
> #define H_LONG_BUSY_ORDER_100_SEC	9905  /* Long busy, hint that =
100sec \
> 						 is a good time to retry =
*/
> #define H_LONG_BUSY_END_RANGE		9905  /* End of long busy range =
*/
> +
> +/* Hacked in for HV-aware kvm support */
> +#define H_TOO_HARD	9999

Not sure I like the name - when is it used? :)
Also, if it's not in the PAPR, the guest should never receive it, right?

> +
> #define H_HARDWARE	-1	/* Hardware error */
> #define H_FUNCTION	-2	/* Function not supported */
> #define H_PRIVILEGE	-3	/* Caller not privileged */
> @@ -100,6 +104,7 @@
> #define H_PAGE_SET_ACTIVE	H_PAGE_STATE_CHANGE
> #define H_AVPN			(1UL<<(63-32))	/* An avpn is =
provided as a sanity test */
> #define H_ANDCOND		(1UL<<(63-33))
> +#define H_LOCAL			(1UL<<(63-35))
> #define H_ICACHE_INVALIDATE	(1UL<<(63-40))	/* icbi, etc.  (ignored =
for IO pages) */
> #define H_ICACHE_SYNCHRONIZE	(1UL<<(63-41))	/* dcbst, icbi, etc =
(ignored for IO pages */
> #define H_ZERO_PAGE		(1UL<<(63-48))	/* zero the page before =
mapping (ignored for IO pages) */
> diff --git a/arch/powerpc/include/asm/kvm_host.h =
b/arch/powerpc/include/asm/kvm_host.h
> index ec62365..af6703e 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -59,6 +59,10 @@ struct kvm;
> struct kvm_run;
> struct kvm_vcpu;
>=20
> +struct lppaca;
> +struct slb_shadow;
> +struct dtl;
> +
> struct kvm_vm_stat {
> 	u32 remote_tlb_flush;
> };
> @@ -341,7 +345,14 @@ struct kvm_vcpu_arch {
> 	u64 dec_expires;
> 	unsigned long pending_exceptions;
> 	u16 last_cpu;
> +	u8 ceded;
> +	u8 prodded;
> 	u32 last_inst;
> +
> +	struct lppaca *vpa;
> +	struct slb_shadow *slb_shadow;
> +	struct dtl *dtl;
> +	struct dtl *dtl_end;
> 	int trap;
> 	struct kvm_vcpu_arch_shared *shared;
> 	unsigned long magic_page_pa; /* phys addr to map the magic page =
to */
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h =
b/arch/powerpc/include/asm/kvm_ppc.h
> index cd9ad96..b4ee11a 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -116,6 +116,7 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
> 				struct kvm_userspace_memory_region =
*mem);
> extern void kvmppc_map_vrma(struct kvm *kvm,
> 			    struct kvm_userspace_memory_region *mem);
> +extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> extern int kvmppc_core_init_vm(struct kvm *kvm);
> extern void kvmppc_core_destroy_vm(struct kvm *kvm);
> extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
> diff --git a/arch/powerpc/kernel/asm-offsets.c =
b/arch/powerpc/kernel/asm-offsets.c
> index 49e97fd..fd56f14 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -189,6 +189,7 @@ int main(void)
> 	DEFINE(LPPACADECRINT, offsetof(struct lppaca, =
int_dword.fields.decr_int));
> 	DEFINE(LPPACA_PMCINUSE, offsetof(struct lppaca, =
pmcregs_in_use));
> 	DEFINE(LPPACA_DTLIDX, offsetof(struct lppaca, dtl_idx));
> +	DEFINE(LPPACA_YIELDCOUNT, offsetof(struct lppaca, yield_count));
> 	DEFINE(PACA_DTL_RIDX, offsetof(struct paca_struct, dtl_ridx));
> #endif /* CONFIG_PPC_STD_MMU_64 */
> 	DEFINE(PACAEMERGSP, offsetof(struct paca_struct, emergency_sp));
> @@ -467,6 +468,7 @@ int main(void)
> 	DEFINE(VCPU_DEC, offsetof(struct kvm_vcpu, arch.dec));
> 	DEFINE(VCPU_DEC_EXPIRES, offsetof(struct kvm_vcpu, =
arch.dec_expires));
> 	DEFINE(VCPU_LPCR, offsetof(struct kvm_vcpu, arch.lpcr));
> +	DEFINE(VCPU_VPA, offsetof(struct kvm_vcpu, arch.vpa));
> 	DEFINE(VCPU_MMCR, offsetof(struct kvm_vcpu, arch.mmcr));
> 	DEFINE(VCPU_PMC, offsetof(struct kvm_vcpu, arch.pmc));
> 	DEFINE(VCPU_SLB, offsetof(struct kvm_vcpu, arch.slb));
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c =
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 52d1be1..623caae 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -219,6 +219,348 @@ void kvmppc_map_vrma(struct kvm *kvm, struct =
kvm_userspace_memory_region *mem)
> 	}
> }
>=20
> +#define HPTE_V_HVLOCK	0x40UL
> +
> +static inline long lock_hpte(unsigned long *hpte, unsigned long bits)
> +{
> +	unsigned long tmp, old;
> +
> +	asm volatile("	ldarx	%0,0,%2\n"
> +		     "	and.	%1,%0,%3\n"
> +		     "	bne	2f\n"
> +		     "	ori	%0,%0,%4\n"
> +		     "  stdcx.	%0,0,%2\n"
> +		     "	beq+	2f\n"
> +		     "	li	%1,%3\n"
> +		     "2:	isync"
> +		     : "=3D&r" (tmp), "=3D&r" (old)
> +		     : "r" (hpte), "r" (bits), "i" (HPTE_V_HVLOCK)
> +		     : "cc", "memory");
> +	return old =3D=3D 0;
> +}
> +
> +long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
> +		    long pte_index, unsigned long pteh, unsigned long =
ptel)
> +{
> +	unsigned long porder;
> +	struct kvm *kvm =3D vcpu->kvm;
> +	unsigned long i, lpn, pa;
> +	unsigned long *hpte;
> +
> +	/* only handle 4k, 64k and 16M pages for now */
> +	porder =3D 12;
> +	if (pteh & HPTE_V_LARGE) {
> +		if ((ptel & 0xf000) =3D=3D 0x1000) {
> +			/* 64k page */
> +			porder =3D 16;
> +		} else if ((ptel & 0xff000) =3D=3D 0) {
> +			/* 16M page */
> +			porder =3D 24;
> +			/* lowest AVA bit must be 0 for 16M pages */
> +			if (pteh & 0x80)
> +				return H_PARAMETER;
> +		} else
> +			return H_PARAMETER;
> +	}
> +	lpn =3D (ptel & HPTE_R_RPN) >> kvm->arch.ram_porder;
> +	if (lpn >=3D kvm->arch.ram_npages || porder > =
kvm->arch.ram_porder)
> +		return H_PARAMETER;
> +	pa =3D kvm->arch.ram_pginfo[lpn].pfn << PAGE_SHIFT;
> +	if (!pa)
> +		return H_PARAMETER;
> +	/* Check WIMG */
> +	if ((ptel & HPTE_R_WIMG) !=3D HPTE_R_M &&
> +	    (ptel & HPTE_R_WIMG) !=3D (HPTE_R_W | HPTE_R_I | HPTE_R_M))
> +		return H_PARAMETER;
> +	pteh &=3D ~0x60UL;
> +	ptel &=3D ~(HPTE_R_PP0 - kvm->arch.ram_psize);
> +	ptel |=3D pa;
> +	if (pte_index >=3D (HPT_NPTEG << 3))
> +		return H_PARAMETER;
> +	if (likely((flags & H_EXACT) =3D=3D 0)) {
> +		pte_index &=3D ~7UL;
> +		hpte =3D (unsigned long *)(kvm->arch.hpt_virt + =
(pte_index << 4));
> +		for (i =3D 0; ; ++i) {
> +			if (i =3D=3D 8)
> +				return H_PTEG_FULL;
> +			if ((*hpte & HPTE_V_VALID) =3D=3D 0 &&
> +			    lock_hpte(hpte, HPTE_V_HVLOCK | =
HPTE_V_VALID))
> +				break;
> +			hpte +=3D 2;
> +		}
> +	} else {
> +		i =3D 0;
> +		hpte =3D (unsigned long *)(kvm->arch.hpt_virt + =
(pte_index << 4));
> +		if (!lock_hpte(hpte, HPTE_V_HVLOCK | HPTE_V_VALID))
> +			return H_PTEG_FULL;
> +	}
> +	hpte[1] =3D ptel;
> +	eieio();
> +	hpte[0] =3D pteh;
> +	asm volatile("ptesync" : : : "memory");
> +	atomic_inc(&kvm->arch.ram_pginfo[lpn].refcnt);
> +	vcpu->arch.gpr[4] =3D pte_index + i;
> +	return H_SUCCESS;
> +}
> +
> +static unsigned long compute_tlbie_rb(unsigned long v, unsigned long =
r,
> +				      unsigned long pte_index)
> +{
> +	unsigned long rb, va_low;
> +
> +	rb =3D (v & ~0x7fUL) << 16;		/* AVA field */
> +	va_low =3D pte_index >> 3;
> +	if (v & HPTE_V_SECONDARY)
> +		va_low =3D ~va_low;
> +	/* xor vsid from AVA */
> +	if (!(v & HPTE_V_1TB_SEG))
> +		va_low ^=3D v >> 12;
> +	else
> +		va_low ^=3D v >> 24;
> +	va_low &=3D 0x7ff;
> +	if (v & HPTE_V_LARGE) {
> +		rb |=3D 1;			/* L field */
> +		if (r & 0xff000) {
> +			/* non-16MB large page, must be 64k */
> +			/* (masks depend on page size) */
> +			rb |=3D 0x1000;		/* page encoding in LP =
field */
> +			rb |=3D (va_low & 0x7f) << 16; /* 7b of VA in =
AVA/LP field */
> +			rb |=3D (va_low & 0xfe);	/* AVAL field =
(P7 doesn't seem to care) */
> +		}
> +	} else {
> +		/* 4kB page */
> +		rb |=3D (va_low & 0x7ff) << 12;	/* remaining 11b of VA =
*/
> +	}
> +	rb |=3D (v >> 54) & 0x300;		/* B field */
> +	return rb;
> +}
> +
> +#define LOCK_TOKEN	(*(u32 *)(&get_paca()->lock_token))
> +
> +static inline int try_lock_tlbie(unsigned int *lock)
> +{
> +	unsigned int tmp, old;
> +	unsigned int token =3D LOCK_TOKEN;
> +
> +	asm volatile("1:lwarx	%1,0,%2\n"
> +		     "	cmpwi	cr0,%1,0\n"
> +		     "	bne	2f\n"
> +		     "  stwcx.	%3,0,%2\n"
> +		     "	bne-	1b\n"
> +		     "  isync\n"
> +		     "2:"
> +		     : "=3D&r" (tmp), "=3D&r" (old)
> +		     : "r" (lock), "r" (token)
> +		     : "cc", "memory");
> +	return old =3D=3D 0;
> +}
> +
> +long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
> +		     unsigned long pte_index, unsigned long avpn,
> +		     unsigned long va)
> +{
> +	struct kvm *kvm =3D vcpu->kvm;
> +	unsigned long *hpte;
> +	unsigned long v, r, rb;
> +
> +	if (pte_index >=3D (HPT_NPTEG << 3))
> +		return H_PARAMETER;
> +	hpte =3D (unsigned long *)(kvm->arch.hpt_virt + (pte_index << =
4));
> +	while (!lock_hpte(hpte, HPTE_V_HVLOCK))
> +		cpu_relax();
> +	if ((hpte[0] & HPTE_V_VALID) =3D=3D 0 ||
> +	    ((flags & H_AVPN) && (hpte[0] & ~0x7fUL) !=3D avpn) ||
> +	    ((flags & H_ANDCOND) && (hpte[0] & avpn) !=3D 0)) {
> +		hpte[0] &=3D ~HPTE_V_HVLOCK;
> +		return H_NOT_FOUND;
> +	}
> +	if (atomic_read(&kvm->online_vcpus) =3D=3D 1)
> +		flags |=3D H_LOCAL;
> +	vcpu->arch.gpr[4] =3D v =3D hpte[0] & ~HPTE_V_HVLOCK;
> +	vcpu->arch.gpr[5] =3D r =3D hpte[1];
> +	rb =3D compute_tlbie_rb(v, r, pte_index);
> +	hpte[0] =3D 0;
> +	if (!(flags & H_LOCAL)) {
> +		while(!try_lock_tlbie(&kvm->arch.tlbie_lock))
> +			cpu_relax();
> +		asm volatile("ptesync" : : : "memory");
> +		asm volatile(PPC_TLBIE(%1,%0)"; eieio; tlbsync"
> +			     : : "r" (rb), "r" (kvm->arch.lpid));
> +		asm volatile("ptesync" : : : "memory");
> +		kvm->arch.tlbie_lock =3D 0;
> +	} else {
> +		asm volatile("ptesync" : : : "memory");
> +		asm volatile("tlbiel %0" : : "r" (rb));
> +		asm volatile("ptesync" : : : "memory");
> +	}
> +	return H_SUCCESS;
> +}
> +
> +long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm =3D vcpu->kvm;
> +	unsigned long *args =3D &vcpu->arch.gpr[4];
> +	unsigned long *hp, tlbrb[4];
> +	long int i, found;
> +	long int n_inval =3D 0;
> +	unsigned long flags, req, pte_index;
> +	long int local =3D 0;
> +	long int ret =3D H_SUCCESS;
> +
> +	if (atomic_read(&kvm->online_vcpus) =3D=3D 1)
> +		local =3D 1;
> +	for (i =3D 0; i < 4; ++i) {
> +		pte_index =3D args[i * 2];
> +		flags =3D pte_index >> 56;
> +		pte_index &=3D ((1ul << 56) - 1);
> +		req =3D flags >> 6;
> +		flags &=3D 3;
> +		if (req =3D=3D 3)
> +			break;
> +		if (req !=3D 1 || flags =3D=3D 3 ||
> +		    pte_index >=3D (HPT_NPTEG << 3)) {
> +			/* parameter error */
> +			args[i * 2] =3D ((0xa0 | flags) << 56) + =
pte_index;
> +			ret =3D H_PARAMETER;
> +			break;
> +		}
> +		hp =3D (unsigned long *)(kvm->arch.hpt_virt + (pte_index =
<< 4));
> +		while (!lock_hpte(hp, HPTE_V_HVLOCK))
> +			cpu_relax();
> +		found =3D 0;
> +		if (hp[0] & HPTE_V_VALID) {
> +			switch (flags & 3) {
> +			case 0:		/* absolute */
> +				found =3D 1;
> +				break;
> +			case 1:		/* andcond */
> +				if (!(hp[0] & args[i * 2 + 1]))
> +					found =3D 1;
> +				break;
> +			case 2:		/* AVPN */
> +				if ((hp[0] & ~0x7fUL) =3D=3D args[i * 2 =
+ 1])
> +					found =3D 1;
> +				break;
> +			}
> +		}
> +		if (!found) {
> +			hp[0] &=3D ~HPTE_V_HVLOCK;
> +			args[i * 2] =3D ((0x90 | flags) << 56) + =
pte_index;
> +			continue;
> +		}
> +		/* insert R and C bits from PTE */
> +		flags |=3D (hp[1] >> 5) & 0x0c;
> +		args[i * 2] =3D ((0x80 | flags) << 56) + pte_index;
> +		tlbrb[n_inval++] =3D compute_tlbie_rb(hp[0], hp[1], =
pte_index);
> +		hp[0] =3D 0;
> +	}
> +	if (n_inval =3D=3D 0)
> +		return ret;
> +
> +	if (!local) {
> +		while(!try_lock_tlbie(&kvm->arch.tlbie_lock))
> +			cpu_relax();
> +		asm volatile("ptesync" : : : "memory");
> +		for (i =3D 0; i < n_inval; ++i)
> +			asm volatile(PPC_TLBIE(%1,%0)
> +				     : : "r" (tlbrb[i]), "r" =
(kvm->arch.lpid));
> +		asm volatile("eieio; tlbsync; ptesync" : : : "memory");
> +		kvm->arch.tlbie_lock =3D 0;
> +	} else {
> +		asm volatile("ptesync" : : : "memory");
> +		for (i =3D 0; i < n_inval; ++i)
> +			asm volatile("tlbiel %0" : : "r" (tlbrb[i]));
> +		asm volatile("ptesync" : : : "memory");
> +	}
> +	return ret;
> +}
> +
> +long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
> +		      unsigned long pte_index, unsigned long avpn,
> +		      unsigned long va)
> +{
> +	struct kvm *kvm =3D vcpu->kvm;
> +	unsigned long *hpte;
> +	unsigned long v, r, rb;
> +
> +	if (pte_index >=3D (HPT_NPTEG << 3))
> +		return H_PARAMETER;
> +	hpte =3D (unsigned long *)(kvm->arch.hpt_virt + (pte_index << =
4));
> +	while (!lock_hpte(hpte, HPTE_V_HVLOCK))
> +		cpu_relax();
> +	if ((hpte[0] & HPTE_V_VALID) =3D=3D 0 ||
> +	    ((flags & H_AVPN) && (hpte[0] & ~0x7fUL) !=3D avpn)) {
> +		hpte[0] &=3D ~HPTE_V_HVLOCK;
> +		return H_NOT_FOUND;
> +	}
> +	if (atomic_read(&kvm->online_vcpus) =3D=3D 1)
> +		flags |=3D H_LOCAL;
> +	v =3D hpte[0];
> +	r =3D hpte[1] & ~(HPTE_R_PP0 | HPTE_R_PP | HPTE_R_N |
> +			HPTE_R_KEY_HI | HPTE_R_KEY_LO);
> +	r |=3D (flags << 55) & HPTE_R_PP0;
> +	r |=3D (flags << 48) & HPTE_R_KEY_HI;
> +	r |=3D flags & (HPTE_R_PP | HPTE_R_N | HPTE_R_KEY_LO);
> +	rb =3D compute_tlbie_rb(v, r, pte_index);
> +	hpte[0] =3D v & ~HPTE_V_VALID;
> +	if (!(flags & H_LOCAL)) {
> +		while(!try_lock_tlbie(&kvm->arch.tlbie_lock))
> +			cpu_relax();
> +		asm volatile("ptesync" : : : "memory");
> +		asm volatile(PPC_TLBIE(%1,%0)"; eieio; tlbsync"
> +			     : : "r" (rb), "r" (kvm->arch.lpid));
> +		asm volatile("ptesync" : : : "memory");
> +		kvm->arch.tlbie_lock =3D 0;
> +	} else {
> +		asm volatile("ptesync" : : : "memory");
> +		asm volatile("tlbiel %0" : : "r" (rb));
> +		asm volatile("ptesync" : : : "memory");
> +	}
> +	hpte[1] =3D r;
> +	eieio();
> +	hpte[0] =3D v & ~HPTE_V_HVLOCK;
> +	asm volatile("ptesync" : : : "memory");
> +	return H_SUCCESS;
> +}
> +
> +static unsigned long reverse_xlate(struct kvm *kvm, unsigned long =
realaddr)
> +{
> +	long int i;
> +	unsigned long offset, rpn;
> +
> +	offset =3D realaddr & (kvm->arch.ram_psize - 1);
> +	rpn =3D (realaddr - offset) >> PAGE_SHIFT;
> +	for (i =3D 0; i < kvm->arch.ram_npages; ++i)
> +		if (rpn =3D=3D kvm->arch.ram_pginfo[i].pfn)
> +			return (i << PAGE_SHIFT) + offset;
> +	return HPTE_R_RPN;	/* all 1s in the RPN field */
> +}
> +
> +long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long flags,
> +		   unsigned long pte_index)
> +{
> +	struct kvm *kvm =3D vcpu->kvm;
> +	unsigned long *hpte, r;
> +	int i, n =3D 1;
> +
> +	if (pte_index >=3D (HPT_NPTEG << 3))
> +		return H_PARAMETER;
> +	if (flags & H_READ_4) {
> +		pte_index &=3D ~3;
> +		n =3D 4;
> +	}
> +	for (i =3D 0; i < n; ++i, ++pte_index) {
> +		hpte =3D (unsigned long *)(kvm->arch.hpt_virt + =
(pte_index << 4));
> +		r =3D hpte[1];
> +		if ((flags & H_R_XLATE) && (hpte[0] & HPTE_V_VALID))
> +			r =3D reverse_xlate(kvm, r & HPTE_R_RPN) |
> +				(r & ~HPTE_R_RPN);
> +		vcpu->arch.gpr[4 + i * 2] =3D hpte[0];
> +		vcpu->arch.gpr[5 + i * 2] =3D r;
> +	}
> +	return H_SUCCESS;
> +}
> +
> int kvmppc_mmu_hv_init(void)
> {
> 	if (!cpu_has_feature(CPU_FTR_HVMODE_206))
> diff --git a/arch/powerpc/kvm/book3s_hv.c =
b/arch/powerpc/kvm/book3s_hv.c
> index f6b7cd1..377a35a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -126,6 +126,158 @@ void kvmppc_dump_regs(struct kvm_vcpu *vcpu)
> 	       vcpu->arch.last_inst);
> }
>=20
> +struct kvm_vcpu *kvmppc_find_vcpu(struct kvm *kvm, int id)
> +{
> +	int r;
> +	struct kvm_vcpu *v, *ret =3D NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	kvm_for_each_vcpu(r, v, kvm) {
> +		if (v->vcpu_id =3D=3D id) {
> +			ret =3D v;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&kvm->lock);
> +	return ret;
> +}
> +
> +static void init_vpa(struct kvm_vcpu *vcpu, struct lppaca *vpa)
> +{
> +	vpa->shared_proc =3D 1;
> +	vpa->yield_count =3D 1;
> +}
> +
> +static unsigned long do_h_register_vpa(struct kvm_vcpu *vcpu,
> +				       unsigned long flags,
> +				       unsigned long vcpuid, unsigned =
long vpa)
> +{
> +	struct kvm *kvm =3D vcpu->kvm;
> +	unsigned long pg_index, ra, len;
> +	unsigned long pg_offset;
> +	void *va;
> +	struct kvm_vcpu *tvcpu;
> +
> +	tvcpu =3D kvmppc_find_vcpu(kvm, vcpuid);
> +	if (!tvcpu)
> +		return H_PARAMETER;
> +
> +	flags >>=3D 63 - 18;
> +	flags &=3D 7;
> +	if (flags =3D=3D 0 || flags =3D=3D 4)
> +		return H_PARAMETER;
> +	if (flags < 4) {
> +		if (vpa & 0x7f)
> +			return H_PARAMETER;
> +		/* registering new area; convert logical addr to real */
> +		pg_index =3D vpa >> kvm->arch.ram_porder;
> +		pg_offset =3D vpa & (kvm->arch.ram_psize - 1);
> +		if (pg_index >=3D kvm->arch.ram_npages)
> +			return H_PARAMETER;
> +		if (kvm->arch.ram_pginfo[pg_index].pfn =3D=3D 0)
> +			return H_PARAMETER;
> +		ra =3D kvm->arch.ram_pginfo[pg_index].pfn << PAGE_SHIFT;
> +		ra |=3D pg_offset;
> +		va =3D __va(ra);
> +		if (flags <=3D 1)
> +			len =3D *(unsigned short *)(va + 4);
> +		else
> +			len =3D *(unsigned int *)(va + 4);
> +		if (pg_offset + len > kvm->arch.ram_psize)
> +			return H_PARAMETER;
> +		switch (flags) {
> +		case 1:		/* register VPA */
> +			if (len < 640)
> +				return H_PARAMETER;
> +			tvcpu->arch.vpa =3D va;
> +			init_vpa(vcpu, va);
> +			break;
> +		case 2:		/* register DTL */
> +			if (len < 48)
> +				return H_PARAMETER;
> +			if (!tvcpu->arch.vpa)
> +				return H_RESOURCE;
> +			len -=3D len % 48;
> +			tvcpu->arch.dtl =3D va;
> +			tvcpu->arch.dtl_end =3D va + len;
> +			break;
> +		case 3:		/* register SLB shadow buffer */
> +			if (len < 8)
> +				return H_PARAMETER;
> +			if (!tvcpu->arch.vpa)
> +				return H_RESOURCE;
> +			tvcpu->arch.slb_shadow =3D va;
> +			len =3D (len - 16) / 16;
> +			tvcpu->arch.slb_shadow =3D va;
> +			break;
> +		}
> +	} else {
> +		switch (flags) {
> +		case 5:		/* unregister VPA */
> +			if (tvcpu->arch.slb_shadow || tvcpu->arch.dtl)
> +				return H_RESOURCE;
> +			tvcpu->arch.vpa =3D NULL;
> +			break;
> +		case 6:		/* unregister DTL */
> +			tvcpu->arch.dtl =3D NULL;
> +			break;
> +		case 7:		/* unregister SLB shadow buffer */
> +			tvcpu->arch.slb_shadow =3D NULL;
> +			break;
> +		}
> +	}
> +	return H_SUCCESS;
> +}
> +
> +int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long req =3D kvmppc_get_gpr(vcpu, 3);
> +	unsigned long target, ret =3D H_SUCCESS;
> +	struct kvm_vcpu *tvcpu;
> +
> +	switch (req) {
> +	case H_CEDE:
> +		vcpu->arch.msr |=3D MSR_EE;
> +		vcpu->arch.ceded =3D 1;
> +		smp_mb();
> +		if (!vcpu->arch.prodded)
> +			kvmppc_vcpu_block(vcpu);
> +		else
> +			vcpu->arch.prodded =3D 0;
> +		smp_mb();
> +		vcpu->arch.ceded =3D 0;
> +		break;
> +	case H_PROD:
> +		target =3D kvmppc_get_gpr(vcpu, 4);
> +		tvcpu =3D kvmppc_find_vcpu(vcpu->kvm, target);
> +		if (!tvcpu) {
> +			ret =3D H_PARAMETER;
> +			break;
> +		}
> +		tvcpu->arch.prodded =3D 1;
> +		smp_mb();
> +		if (vcpu->arch.ceded) {
> +			if (waitqueue_active(&vcpu->wq)) {
> +				wake_up_interruptible(&vcpu->wq);
> +				vcpu->stat.halt_wakeup++;
> +			}
> +		}
> +		break;
> +	case H_CONFER:
> +		break;
> +	case H_REGISTER_VPA:
> +		ret =3D do_h_register_vpa(vcpu, kvmppc_get_gpr(vcpu, 4),
> +					kvmppc_get_gpr(vcpu, 5),
> +					kvmppc_get_gpr(vcpu, 6));
> +		break;
> +	default:
> +		return RESUME_HOST;
> +	}
> +	kvmppc_set_gpr(vcpu, 3, ret);
> +	vcpu->arch.hcall_needed =3D 0;
> +	return RESUME_GUEST;
> +}
> +
> static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu =
*vcpu,
> 			      struct task_struct *tsk)
> {
> @@ -307,7 +459,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>=20
> extern int __kvmppc_vcore_entry(struct kvm_run *kvm_run, struct =
kvm_vcpu *vcpu);
>=20
> -int kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu)
> +static int kvmppc_run_vcpu(struct kvm_run *run, struct kvm_vcpu =
*vcpu)
> {
> 	u64 now;
>=20
> @@ -338,6 +490,22 @@ int kvmppc_vcpu_run(struct kvm_run *run, struct =
kvm_vcpu *vcpu)
> 	return kvmppc_handle_exit(run, vcpu, current);
> }
>=20
> +int kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu)
> +{
> +	int r;
> +
> +	do {
> +		r =3D kvmppc_run_vcpu(run, vcpu);
> +
> +		if (run->exit_reason =3D=3D KVM_EXIT_PAPR_HCALL &&
> +		    !(vcpu->arch.msr & MSR_PR)) {
> +			r =3D kvmppc_pseries_do_hcall(vcpu);
> +			kvmppc_core_deliver_interrupts(vcpu);
> +		}
> +	} while (r =3D=3D RESUME_GUEST);
> +	return r;
> +}
> +
> int kvmppc_core_prepare_memory_region(struct kvm *kvm,
> 				struct kvm_userspace_memory_region *mem)
> {
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S =
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 813b01c..e8a8f3c 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -195,6 +195,14 @@ kvmppc_handler_trampoline_enter:
> 	/* Save R1 in the PACA */
> 	std	r1, PACA_KVM_SVCPU + SVCPU_HOST_R1(r13)
>=20
> +	/* Increment yield count if they have a VPA */
> +	ld	r3, VCPU_VPA(r4)
> +	cmpdi	r3, 0
> +	beq	25f
> +	lwz	r5, LPPACA_YIELDCOUNT(r3)
> +	addi	r5, r5, 1
> +	stw	r5, LPPACA_YIELDCOUNT(r3)
> +25:
> 	/* Load up DAR and DSISR */
> 	ld	r5, VCPU_DAR(r4)
> 	lwz	r6, VCPU_DSISR(r4)
> @@ -432,6 +440,10 @@ kvmppc_interrupt:
> 	cmpwi	r3,0
> 	bge	ignore_hdec
> 2:
> +	/* See if this is something we can handle in real mode */
> +	cmpwi	r12,0xc00

use the define please

> +	beq	hcall_real_mode

This is simply a hcall helper, as the name suggests. So the comment is =
slightly misleading - it should rather read like "Try to handle =
hypercalls in real mode".

> +hcall_real_cont:
>=20
> 	/* Check for mediated interrupts (could be done earlier really =
...) */
> 	cmpwi	r12,0x500
> @@ -607,13 +619,28 @@ hdec_soon:
> 	std	r5, VCPU_SPRG2(r9)
> 	std	r6, VCPU_SPRG3(r9)
>=20
> -	/* Save PMU registers */
> +	/* Increment yield count if they have a VPA */
> +	ld	r8, VCPU_VPA(r9)	/* do they have a VPA? */
> +	cmpdi	r8, 0
> +	beq	25f
> +	lwz	r3, LPPACA_YIELDCOUNT(r8)
> +	addi	r3, r3, 1
> +	stw	r3, LPPACA_YIELDCOUNT(r8)
> +25:
> +	/* Save PMU registers if requested */
> +	/* r8 and cr0.eq are live here */
> 	li	r3, 1
> 	sldi	r3, r3, 31		/* MMCR0_FC (freeze counters) =
bit */
> 	mfspr	r4, SPRN_MMCR0		/* save MMCR0 */
> 	mtspr	SPRN_MMCR0, r3		/* freeze all counters, disable =
ints */
> 	isync
> -	mfspr	r5, SPRN_MMCR1
> +	beq	21f			/* if no VPA, save PMU stuff =
anyway */
> +	lbz	r7, LPPACA_PMCINUSE(r8)
> +	cmpwi	r7, 0			/* did they ask for PMU stuff to =
be saved? */
> +	bne	21f
> +	std	r3, VCPU_MMCR(r9)	/* if not, set saved MMCR0 to FC =
*/
> +	b	22f
> +21:	mfspr	r5, SPRN_MMCR1
> 	mfspr	r6, SPRN_MMCRA
> 	std	r4, VCPU_MMCR(r9)
> 	std	r5, VCPU_MMCR + 8(r9)
> @@ -650,6 +677,119 @@ kvmppc_handler_trampoline_exit_end:
> 	mfspr	r7,SPRN_HDSISR
> 	b	7b
>=20
> +	.globl	hcall_real_mode
> +hcall_real_mode:
> +	ld	r3,VCPU_GPR(r3)(r9)
> +	andi.	r0,r11,MSR_PR
> +	bne	hcall_real_cont
> +	clrrdi	r3,r3,2
> +	cmpldi	r3,hcall_real_table_end - hcall_real_table
> +	bge	hcall_real_cont
> +	LOAD_REG_ADDR(r4, hcall_real_table)
> +	lwzx	r3,r3,r4
> +	cmpwi	r3,0
> +	beq	hcall_real_cont
> +	add	r3,r3,r4
> +	mtctr	r3
> +	mr	r3,r9		/* get vcpu pointer */
> +	ld	r4,VCPU_GPR(r4)(r9)
> +	bctrl
> +	cmpdi	r3,H_TOO_HARD
> +	beq	hcall_real_fallback

Ah, very good. Please mark the constant as "for internal use only" then, =
as that's certainly fine :).

> +	ld	r4,PACA_KVM_VCPU(r13)
> +	std	r3,VCPU_GPR(r3)(r4)
> +	ld	r10,VCPU_PC(r4)
> +	ld	r11,VCPU_MSR(r4)
> +	b	fast_guest_return
> +
> +	/* We've attempted a real mode hcall, but it's punted it back
> +	 * to userspace.  We need to restore some clobbered volatiles
> +	 * before resuming the pass-it-to-qemu path */
> +hcall_real_fallback:
> +	li	r12,0xc00

use the define please :)


Alex

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: James Morris @ 2011-05-17  2:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, Paul Mackerras, Eric Paris,
	H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390, Russell King,
	x86, Linus Torvalds, Ingo Molnar, kees.cook, Serge E. Hallyn,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	linux-arm-kernel, Michal Marek, Michal Simek, Will Drewry,
	linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt,
	Martin Schwidefsky, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <20110516150837.GA513@elte.hu>

On Mon, 16 May 2011, Ingo Molnar wrote:

> > Not really.
> > 
> > Firstly, what is the security goal of these restrictions? [...]
> 
> To do what i described above? Namely:
> 
>  " Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
>    and /usr/lib/ "

These are access rules, they don't really describe a high-level security 
goal.  How do you know it's ok to open everything in these directories?


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply

* Re: book to learn ppc assembly and architecture
From: Michael Neuling @ 2011-05-16 23:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: s shaiju, linuxppc-dev
In-Reply-To: <1305589123.2781.15.camel@pasglop>

In message <1305589123.2781.15.camel@pasglop> you wrote:
> On Mon, 2011-05-16 at 16:37 +1000, Michael Neuling wrote:
> > > what  is  the  best  book  to  learn  assembly  and  architecture .
> > 
> > Reading the architecture books with a nice cup of tea.
> > 
> > http://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf
> > 
> 
> A slightly less steep approach might be to get yourself the programmer
> manuals of some older powerpc chips, like the freescale PEMs, and then
> move on to the full ISA

True but a well aged Shiraz goes better with them than the cup of tea.

Mikey

^ permalink raw reply

* Re: book to learn ppc assembly and architecture
From: Benjamin Herrenschmidt @ 2011-05-16 23:38 UTC (permalink / raw)
  To: Michael Neuling; +Cc: s shaiju, linuxppc-dev
In-Reply-To: <9136.1305527852@neuling.org>

On Mon, 2011-05-16 at 16:37 +1000, Michael Neuling wrote:
> > what  is  the  best  book  to  learn  assembly  and  architecture .
> 
> Reading the architecture books with a nice cup of tea.
> 
> http://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf
> 

A slightly less steep approach might be to get yourself the programmer
manuals of some older powerpc chips, like the freescale PEMs, and then
move on to the full ISA

Cheers,
Ben.

^ permalink raw reply

* Re: Updating the define_machine() structure at runtime?
From: Benjamin Herrenschmidt @ 2011-05-16 23:37 UTC (permalink / raw)
  To: Timur Tabi; +Cc: linuxppc-dev
In-Reply-To: <4DD14F15.5070902@freescale.com>

On Mon, 2011-05-16 at 11:21 -0500, Timur Tabi wrote:
> I would like to get some opinions on the following.
> 
> I'm working on extending our existing board definition files to support our
> ePAPR reference hypervisor.  Because our hypervisor provides a virtual PIC, we
> need to have different functions for define_machine() structure that related to
> the PIC, but most everything else can stay the same.

 .../...

> static int __init p4080_ds_probe(void)
> {
> 	...
> #ifdef CONFIG_FSL_HYPERVISOR
> 	if (of_flat_dt_is_compatible(root, "fsl,P4080DS-hv")) {
> 		mach_p4080_ds.get_irq = ehv_pic_get_irq;
> 		...
> 		return 1;
> 	} else
> 		return 0;
> #endif	

One approach i've been taking more often nowadays is to have the PIC
code setup the ppc_md hooks itself.

You don't need to do it in probe(), just in setup_arch or init_IRQ,
wherever you do your PIC discovery, have it set it up.

For example, xics sets get_irq itself, which is necessary because the
xics code itself can have multiple variants.

We could probably make mpic.c do the same when MPIC_PRIMARY is set (or
keep that a wrapper for now).

Cheers,
Ben.

^ permalink raw reply

* [PATCH] powerpc: remove last piece of GEMINI
From: Sebastian Andrzej Siewior @ 2011-05-16 18:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Adrian Bunk, Paul Mackerras

It seems that Adrian is getting old. He removed almost everything of
GEMINI in commit c53653130 ("[POWERPC] Remove the broken Gemini
support") except this piece.

Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
---
 arch/powerpc/kernel/head_32.S |   13 -------------
 1 files changed, 0 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index c5c24be..0e49de0 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -805,19 +805,6 @@ _ENTRY(copy_and_flush)
 	blr
 
 #ifdef CONFIG_SMP
-#ifdef CONFIG_GEMINI
-	.globl	__secondary_start_gemini
-__secondary_start_gemini:
-        mfspr   r4,SPRN_HID0
-        ori     r4,r4,HID0_ICFI
-        li      r3,0
-        ori     r3,r3,HID0_ICE
-        andc    r4,r4,r3
-        mtspr   SPRN_HID0,r4
-        sync
-        b       __secondary_start
-#endif /* CONFIG_GEMINI */
-
 	.globl __secondary_start_mpc86xx
 __secondary_start_mpc86xx:
 	mfspr	r3, SPRN_PIR
-- 
1.7.4.4

^ permalink raw reply related

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Steven Rostedt @ 2011-05-16 17:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	linux-arm-kernel, kees.cook, Serge E. Hallyn, Martin Schwidefsky,
	Thomas Gleixner, Roland McGrath, Michal Marek, Michal Simek,
	Will Drewry, linuxppc-dev, linux-kernel, Eric Paris, Paul Mundt,
	Tejun Heo, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <20110516165249.GB10929@elte.hu>

On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> > way multiple callbacks can be registered. How would:
> > 
> > 	err = event_x();
> > 	if (err == -EACCESS) {
> > 
> > be handled? [...]
> 
> The default behavior would be something obvious: to trigger all callbacks and 
> use the first non-zero return value.


But how do we know which callback that was from? There's no ordering of
what callbacks are called first.

> 
> > [...] Would we need a way to prioritize which call back gets the return 
> > value? One way I guess would be to add a check_event option, where you pass 
> > in an ENUM of the event you want:
> > 
> > 	event_x();
> > 	err = check_event_x(MYEVENT);
> > 
> > If something registered itself as "MYEVENT" to event_x, then you get the 
> > return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or 
> > something could be returned. I'm sure we could even optimize it such a way if 
> > no active events have been registered to event_x, that check_event_x() will 
> > return -ENODEV without any branches.
> 
> I would keep it simple and extensible - that way we can complicate it when the 
> need arises! :)

The above is rather trivial to implement. I don't think it complicates
anything.

-- Steve

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-16 16:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	linux-arm-kernel, kees.cook, Serge E. Hallyn, Martin Schwidefsky,
	Thomas Gleixner, Roland McGrath, Michal Marek, Michal Simek,
	Will Drewry, linuxppc-dev, linux-kernel, Eric Paris, Paul Mundt,
	Tejun Heo, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <1305563026.5456.19.camel@gandalf.stny.rr.com>


* Steven Rostedt <rostedt@goodmis.org> wrote:

> I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> way multiple callbacks can be registered. How would:
> 
> 	err = event_x();
> 	if (err == -EACCESS) {
> 
> be handled? [...]

The default behavior would be something obvious: to trigger all callbacks and 
use the first non-zero return value.

> [...] Would we need a way to prioritize which call back gets the return 
> value? One way I guess would be to add a check_event option, where you pass 
> in an ENUM of the event you want:
> 
> 	event_x();
> 	err = check_event_x(MYEVENT);
> 
> If something registered itself as "MYEVENT" to event_x, then you get the 
> return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or 
> something could be returned. I'm sure we could even optimize it such a way if 
> no active events have been registered to event_x, that check_event_x() will 
> return -ENODEV without any branches.

I would keep it simple and extensible - that way we can complicate it when the 
need arises! :)

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Steven Rostedt @ 2011-05-16 16:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	linux-arm-kernel, kees.cook, Serge E. Hallyn, Martin Schwidefsky,
	Thomas Gleixner, Roland McGrath, Michal Marek, Michal Simek,
	Will Drewry, linuxppc-dev, linux-kernel, Eric Paris, Paul Mundt,
	Tejun Heo, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <20110513145737.GC32688@elte.hu>

On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote:

> > > > Then there's the whole indirection argument, if you don't need
> > > > indirection, its often better to not use it, I myself much prefer code
> > > > to look like:
> > > > 
> > > >    foo1(bar);
> > > >    foo2(bar);
> > > >    foo3(bar);
> > > > 
> > > > Than:
> > > > 
> > > >    foo_notifier(bar);
> > > > 
> > > > Simply because its much clearer who all are involved without me having
> > > > to grep around to see who registers for foo_notifier and wth they do
> > > > with it. It also makes it much harder to sneak in another user, whereas
> > > > its nearly impossible to find new notifier users.
> > > > 
> > > > Its also much faster, no extra memory accesses, no indirect function
> > > > calls, no other muck.
> > > 
> > > But i suspect this question has been settled, given the fact that even pure 
> > > observer events need and already process a chain of events? Am i missing 
> > > something about your argument?
> > 
> > I'm saying that there's reasons to not use notifiers passive or active.
> > 
> > Mostly the whole notifier/indirection muck comes up once you want
> > modules to make use of the thing, because then you need dynamic
> > management of the callback list.
> 
> But your argument assumes that we'd have a chain of functions to call, like 
> regular notifiers.
> 
> While the natural model here would be to have a list of registered event 
> structs for that point, with different filters but basically the same callback 
> mechanism (a call into the filter engine in essence).
> 
> Also note that the common case would be no event registered - and we'd 
> automatically optimize that case via the existing jump labels optimization.

I agree that I prefer the "notifier" type over having direct function
calls. Yes, it's easier to read and figure out what functions are
called, but notifiers can be optimized for the default case where
nothing is called (jump-label nop).

> 
> > (Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c 
> > for all the cpu-hotplug callbacks :-)
> > 
> > Anyway, I oppose for the existing events to gain an active role.
> 
> Why if 'being active' is optional and useful?

I'm a bit nervous about the 'active' role of (trace_)events, because of
the way multiple callbacks can be registered. How would:

	err = event_x();
	if (err == -EACCESS) {

be handled? Would we need a way to prioritize which call back gets the
return value? One way I guess would be to add a check_event option,
where you pass in an ENUM of the event you want:

	event_x();
	err = check_event_x(MYEVENT);

If something registered itself as "MYEVENT" to event_x, then you get the
return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or
something could be returned. I'm sure we could even optimize it such a
way if no active events have been registered to event_x, that
check_event_x() will return -ENODEV without any branches.

-- Steve

^ permalink raw reply

* Updating the define_machine() structure at runtime?
From: Timur Tabi @ 2011-05-16 16:21 UTC (permalink / raw)
  To: linuxppc-dev

I would like to get some opinions on the following.

I'm working on extending our existing board definition files to support our
ePAPR reference hypervisor.  Because our hypervisor provides a virtual PIC, we
need to have different functions for define_machine() structure that related to
the PIC, but most everything else can stay the same.

define_machine(p4080_hv) {
	.name			= "P4080DS HV",
	.probe			= p4080hv_probe,
	.setup_arch		= corenet_ds_setup_arch,
	.init_IRQ		= fsl_hv_pic_init,
#ifdef CONFIG_PCI
	.pcibios_fixup_bus	= fsl_pcibios_fixup_bus,
#endif
	.get_irq		= ehv_pic_get_irq,
	.restart		= fsl_hv_restart,
	.power_off		= fsl_hv_halt,
	.halt			= fsl_hv_halt,
	.calibrate_decr		= generic_calibrate_decr,
	.progress		= udbg_progress,
};

Rather than create two such structures for each board that can support the
hypervisor (one with HV support, one without), I would like to do the following
instead.  What do you all think?

define_machine(p4080_ds) {
	.name			= "P4080 DS",
	.probe			= p4080_ds_probe,
	...
	.get_irq		= mpic_get_coreint_irq,
};

static int __init p4080_ds_probe(void)
{
	...
#ifdef CONFIG_FSL_HYPERVISOR
	if (of_flat_dt_is_compatible(root, "fsl,P4080DS-hv")) {
		mach_p4080_ds.get_irq = ehv_pic_get_irq;
		...
		return 1;
	} else
		return 0;
#endif	

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* [PATCH 2/2] powerpc/5200: dts: digsy_mtc.dts: update to add can, pci, serial and spi
From: Anatolij Gustschin @ 2011-05-16 16:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: devicetree-discuss, Heiko Schocher
In-Reply-To: <1305561764-5942-1-git-send-email-agust@denx.de>

Add new nodes to describe more hardware the board is
equipped with:
 - two can nodes for SJA1000 on localbus
 - pci node to support Coral-PA graphics controller
 - serial node for SC28L92 DUART on localbus
 - spi node for MSP430 device

Also correct i2c eeprom node name.

Signed-off-by: Heiko Schocher <hs@denx.de>
Signed-off-by: Anatolij Gustschin <agust@denx.de>
---
 arch/powerpc/boot/dts/digsy_mtc.dts |   50 ++++++++++++++++++++++++++++++++--
 1 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/boot/dts/digsy_mtc.dts b/arch/powerpc/boot/dts/digsy_mtc.dts
index 27bd267..e205d17 100644
--- a/arch/powerpc/boot/dts/digsy_mtc.dts
+++ b/arch/powerpc/boot/dts/digsy_mtc.dts
@@ -38,6 +38,14 @@
 			status = "disabled";
 		};
 
+		spi@f00 {
+			msp430@0 {
+				compatible = "spidev";
+				spi-max-frequency = <32000>;
+				reg = <0>;
+			};
+		};
+
 		psc@2000 {		// PSC1
 			status = "disabled";
 		};
@@ -73,11 +81,16 @@
 		};
 
 		i2c@3d00 {
-			rtc@50 {
+			eeprom@50 {
 				compatible = "at,24c08";
 				reg = <0x50>;
 			};
 
+			rtc@56 {
+				compatible = "mc,rv3029c2";
+				reg = <0x56>;
+			};
+
 			rtc@68 {
 				compatible = "dallas,ds1339";
 				reg = <0x68>;
@@ -90,11 +103,22 @@
 	};
 
 	pci@f0000d00 {
-		status = "disabled";
+		interrupt-map-mask = <0xf800 0 0 7>;
+		interrupt-map = <0xc000 0 0 1 &mpc5200_pic 0 0 3
+				 0xc000 0 0 2 &mpc5200_pic 0 0 3
+				 0xc000 0 0 3 &mpc5200_pic 0 0 3
+				 0xc000 0 0 4 &mpc5200_pic 0 0 3>;
+		clock-frequency = <0>; // From boot loader
+		interrupts = <2 8 0 2 9 0 2 10 0>;
+		bus-range = <0 0>;
+		ranges = <0x42000000 0 0x80000000 0x80000000 0 0x10000000
+			  0x02000000 0 0x90000000 0x90000000 0 0x10000000
+			  0x01000000 0 0x00000000 0xa0000000 0 0x01000000>;
 	};
 
 	localbus {
-		ranges = <0 0 0xff000000 0x1000000>;
+		ranges = <0 0 0xff000000 0x1000000
+			  4 0 0x60000000 0x0001000>;
 
 		// 16-bit flash device at LocalPlus Bus CS0
 		flash@0,0 {
@@ -122,5 +146,25 @@
 				reg = <0x00f00000 0x100000>;
 			};
 		};
+
+		can@4,0 {
+			compatible = "nxp,sja1000";
+			reg = <4 0x000 0x80>;
+			nxp,external-clock-frequency = <24000000>;
+			interrupts = <1 2 3>; // Level-low
+		};
+
+		can@4,100 {
+			compatible = "nxp,sja1000";
+			reg = <4 0x100 0x80>;
+			nxp,external-clock-frequency = <24000000>;
+			interrupts = <1 2 3>;  // Level-low
+		};
+
+		serial@4,200 {
+			compatible = "nxp,sc28l92";
+			reg = <4 0x200 0x10>;
+			interrupts = <1 3 3>;
+		};
 	};
 };
-- 
1.7.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox