LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 2/5] KVM: PPC: iommu: Add missing kvm_iommu_map_pages/kvm_iommu_unmap_pages
From: Alexey Kardashevskiy @ 2013-05-07  0:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linux-pci, linuxppc-dev, David Gibson
In-Reply-To: <1367874436.2868.90.camel@ul30vt.home>

On 05/07/2013 07:07 AM, Alex Williamson wrote:
> On Mon, 2013-05-06 at 17:21 +1000, aik@ozlabs.ru wrote:
>> From: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> The IOMMU API implements groups creating/deletion, device binding
>> and IOMMU map/unmap operations.
>>
>> The PowerPC implementation uses most of the API except map/unmap
>> operations, which are implemented on POWER using hypercalls.
>>
>> However, in order to link a kernel with the CONFIG_IOMMU_API enabled,
>> the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be
>> defined, so this defines them.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Cc: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>> ---
>>  arch/powerpc/include/asm/kvm_host.h |   14 ++++++++++++++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index b6a047e..c025d91 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -603,4 +603,18 @@ struct kvm_vcpu_arch {
>>  
>>  #define __KVM_HAVE_ARCH_WQP
>>  
>> +#ifdef CONFIG_IOMMU_API
>> +/* POWERPC does not use IOMMU API for mapping/unmapping */
>> +static inline int kvm_iommu_map_pages(struct kvm *kvm,
>> +		struct kvm_memory_slot *slot)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline void kvm_iommu_unmap_pages(struct kvm *kvm,
>> +		struct kvm_memory_slot *slot)
>> +{
>> +}
>> +#endif /* CONFIG_IOMMU_API */
>> +
>>  #endif /* __POWERPC_KVM_HOST_H__ */
> 
> This is no longer needed, Gleb applied my patch for 3.10 that make all
> of KVM device assignment dependent on a build config option and the top
> level kvm_host.h now includes this when that is not set.  Thanks,

Cannot find it, could you point me please where it is on github or
git.kernel.org? Thanks.


-- 
Alexey

^ permalink raw reply

* Re: [v1][KVM][PATCH 1/1] kvm:ppc: enable doorbell exception with E500MC
From: tiejun.chen @ 2013-05-07  1:33 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm, kvm-ppc
In-Reply-To: <5187C521.2070904@suse.de>

On 05/06/2013 10:58 PM, Alexander Graf wrote:
> On 05/06/2013 04:53 AM, Tiejun Chen wrote:
>> Actually E500MC also support doorbell exception, and CONFIG_PPC_E500MC
>> can cover BOOK3E/BOOK3E_64 as well.
>>
>> Signed-off-by: Tiejun Chen<tiejun.chen@windriver.com>
>> ---
>>   arch/powerpc/kvm/booke.c |    2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
>> index 1020119..dc1f590 100644
>> --- a/arch/powerpc/kvm/booke.c
>> +++ b/arch/powerpc/kvm/booke.c
>> @@ -795,7 +795,7 @@ static void kvmppc_restart_interrupt(struct kvm_vcpu *vcpu,
>>           kvmppc_fill_pt_regs(&regs);
>>           timer_interrupt(&regs);
>>           break;
>> -#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_BOOK3E_64)
>> +#if defined(CONFIG_PPC_E500MC)
>
> I suppose you mean CONFIG_KVM_E500MC here? Why didn't this work for you before?

This works for me.

Here I just mean currently CONFIG_PPC_E500MC is always selected no matter what 
CONFIG_PPC_FSL_BOOK3E or CONFIG_PPC_BOOK3E_64 is enabled. And especially, this 
is already in the arch/powerpc/kvm/booke.c file, so I think one #ifdef 
(CONFIG_PPC_E500MC) is enough and also makes sense.

> The ifdef above should cover the same range of CPUs.

Or furthermore, the #ifdef CONFIG_PPC_DOORBELL is reasonable to cover this.

Tiejun

^ permalink raw reply

* Invalid perf_branch_entry.to entries question
From: Michael Neuling @ 2013-05-07  1:35 UTC (permalink / raw)
  To: Peter Zijlstra, eranian; +Cc: Linux PPC dev, linux-kernel, Anshuman Khandual

Peter & Stephane,

We are plumbing the POWER8 Branch History Rolling Buffer (BHRB) into
struct perf_branch_entry.

Sometimes on POWER8 we may not be able to fill out the "to" address.  We
initially thought of just making this 0, but it's feasible that this
could be a valid address to branch to. 

The other logical value to indicate an invalid entry would be all 1s
which is not possible (on POWER at least).

Do you guys have a preference as to what we should use as an invalid
entry?  This would have some consequences for the userspace tool also.

The alternative would be to add a flag alongside mispred/predicted to
indicate the validity of the "to" address.

Mikey

^ permalink raw reply

* Re: [PATCH 2/5] KVM: PPC: iommu: Add missing kvm_iommu_map_pages/kvm_iommu_unmap_pages
From: Alex Williamson @ 2013-05-07  1:42 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linux-pci, linuxppc-dev, David Gibson
In-Reply-To: <51884FA8.6010400@ozlabs.ru>

On Tue, 2013-05-07 at 10:49 +1000, Alexey Kardashevskiy wrote:
> On 05/07/2013 07:07 AM, Alex Williamson wrote:
> > On Mon, 2013-05-06 at 17:21 +1000, aik@ozlabs.ru wrote:
> >> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>
> >> The IOMMU API implements groups creating/deletion, device binding
> >> and IOMMU map/unmap operations.
> >>
> >> The PowerPC implementation uses most of the API except map/unmap
> >> operations, which are implemented on POWER using hypercalls.
> >>
> >> However, in order to link a kernel with the CONFIG_IOMMU_API enabled,
> >> the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be
> >> defined, so this defines them.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> Cc: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Paul Mackerras <paulus@samba.org>
> >> ---
> >>  arch/powerpc/include/asm/kvm_host.h |   14 ++++++++++++++
> >>  1 file changed, 14 insertions(+)
> >>
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index b6a047e..c025d91 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -603,4 +603,18 @@ struct kvm_vcpu_arch {
> >>  
> >>  #define __KVM_HAVE_ARCH_WQP
> >>  
> >> +#ifdef CONFIG_IOMMU_API
> >> +/* POWERPC does not use IOMMU API for mapping/unmapping */
> >> +static inline int kvm_iommu_map_pages(struct kvm *kvm,
> >> +		struct kvm_memory_slot *slot)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static inline void kvm_iommu_unmap_pages(struct kvm *kvm,
> >> +		struct kvm_memory_slot *slot)
> >> +{
> >> +}
> >> +#endif /* CONFIG_IOMMU_API */
> >> +
> >>  #endif /* __POWERPC_KVM_HOST_H__ */
> > 
> > This is no longer needed, Gleb applied my patch for 3.10 that make all
> > of KVM device assignment dependent on a build config option and the top
> > level kvm_host.h now includes this when that is not set.  Thanks,
> 
> Cannot find it, could you point me please where it is on github or
> git.kernel.org? Thanks.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2a5bab1004729f3302c776e53ee7c895b98bb1ce

^ permalink raw reply

* Re: [RFC][KVM][PATCH 1/1] kvm:ppc:booke-64: soft-disable interrupts
From: tiejun.chen @ 2013-05-07  1:56 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev, agraf, kvm-ppc, kvm
In-Reply-To: <1367884257.3398.9@snotra>

On 05/07/2013 07:50 AM, Scott Wood wrote:
> On 05/05/2013 10:13:17 PM, tiejun.chen wrote:
>> On 05/06/2013 11:10 AM, Tiejun Chen wrote:
>>> For the external interrupt, the decrementer exception and the doorbell
>>> excpetion, we also need to soft-disable interrupts while doing as host
>>> interrupt handlers since the DO_KVM hook is always performed to skip
>>> EXCEPTION_COMMON then miss this original chance with the 'ints' (INTS_DISABLE).
>
> http://patchwork.ozlabs.org/patch/241344/
> http://patchwork.ozlabs.org/patch/241412/
>
> :-)

I'm observing the same behaviour as well:

	WARN_ON_ONCE(!irqs_disabled());

>
>>> Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
>>> ---
>>>   arch/powerpc/kvm/bookehv_interrupts.S |    9 +++++++++
>>>   1 file changed, 9 insertions(+)
>>>
>>> diff --git a/arch/powerpc/kvm/bookehv_interrupts.S
>>> b/arch/powerpc/kvm/bookehv_interrupts.S
>>> index e8ed7d6..2fd62bf 100644
>>> --- a/arch/powerpc/kvm/bookehv_interrupts.S
>>> +++ b/arch/powerpc/kvm/bookehv_interrupts.S
>>> @@ -33,6 +33,8 @@
>>>
>>>   #ifdef CONFIG_64BIT
>>>   #include <asm/exception-64e.h>
>>> +#include <asm/hw_irq.h>
>>> +#include <asm/irqflags.h>
>>>   #else
>>>   #include "../kernel/head_booke.h" /* for THREAD_NORMSAVE() */
>>>   #endif
>>> @@ -469,6 +471,13 @@ _GLOBAL(kvmppc_resume_host)
>>>       PPC_LL    r3, HOST_RUN(r1)
>>>       mr    r5, r14 /* intno */
>>>       mr    r14, r4 /* Save vcpu pointer. */
>>> +#ifdef CONFIG_64BIT
>>> +    /* Should we soft-disable interrupts? */
>>> +    andi.    r6, r5, BOOKE_INTERRUPT_EXTERNAL | BOOKE_INTERRUPT_DECREMENTER
>>> | BOOKE_INTERRUPT_DOORBELL
>>> +    beq    skip_soft_dis
>>> +    SOFT_DISABLE_INTS(r7,r8)
>>> +skip_soft_dis:
>>> +#endif
>
> Why wouldn't we always disable them?  kvmppc_handle_exit() will enable
> interrupts when it's ready.

This only disable soft interrupt for kvmppc_restart_interrupt() that restarts 
interrupts if they were meant for the host:

a. SOFT_DISABLE_INTS() only for BOOKE_INTERRUPT_EXTERNAL | 
BOOKE_INTERRUPT_DECREMENTER | BOOKE_INTERRUPT_DOORBELL

b. bl      kvmppc_handle_exit

c. kvmppc_handle_exit()
{
         int r = RESUME_HOST;
         int s;

         /* update before a new last_exit_type is rewritten */
         kvmppc_update_timing_stats(vcpu);

         /* restart interrupts if they were meant for the host */
         kvmppc_restart_interrupt(vcpu, exit_nr);

         local_irq_enable();	==> Enable again.
....

And shouldn't we handle kvmppc_restart_interrupt() like the original HOST flow?

#define MASKABLE_EXCEPTION(trapnum, intnum, label, hdlr, ack)           \
         START_EXCEPTION(label);                                         \
         NORMAL_EXCEPTION_PROLOG(trapnum, intnum, PROLOG_ADDITION_MASKABLE)\
         EXCEPTION_COMMON(trapnum, PACA_EXGEN, *INTS_DISABLE*)             \
	...

So I think this should be reasonable :)

Tiejun

^ permalink raw reply

* Re: [RFC][KVM][PATCH 1/1] kvm:ppc:booke-64: soft-disable interrupts
From: Scott Wood @ 2013-05-07  2:06 UTC (permalink / raw)
  To: tiejun.chen; +Cc: linuxppc-dev, agraf, kvm-ppc, kvm
In-Reply-To: <51885F49.6060605@windriver.com>

On 05/06/2013 08:56:25 PM, tiejun.chen wrote:
> On 05/07/2013 07:50 AM, Scott Wood wrote:
>> On 05/05/2013 10:13:17 PM, tiejun.chen wrote:
>>> On 05/06/2013 11:10 AM, Tiejun Chen wrote:
>>>> For the external interrupt, the decrementer exception and the =20
>>>> doorbell
>>>> excpetion, we also need to soft-disable interrupts while doing as =20
>>>> host
>>>> interrupt handlers since the DO_KVM hook is always performed to =20
>>>> skip
>>>> EXCEPTION_COMMON then miss this original chance with the 'ints' =20
>>>> (INTS_DISABLE).
>>=20
>> http://patchwork.ozlabs.org/patch/241344/
>> http://patchwork.ozlabs.org/patch/241412/
>>=20
>> :-)
>=20
> I'm observing the same behaviour as well:
>=20
> 	WARN_ON_ONCE(!irqs_disabled());

So, could you explain the benefits of your approach over what's being =20
discussed in those threads?

>> Why wouldn't we always disable them?  kvmppc_handle_exit() will =20
>> enable
>> interrupts when it's ready.
>=20
> This only disable soft interrupt for kvmppc_restart_interrupt() that =20
> restarts interrupts if they were meant for the host:
>=20
> a. SOFT_DISABLE_INTS() only for BOOKE_INTERRUPT_EXTERNAL | =20
> BOOKE_INTERRUPT_DECREMENTER | BOOKE_INTERRUPT_DOORBELL

Those aren't the only exceptions that can end up going to the host.  We =20
could get a TLB miss that results in a heavyweight MMIO exit, etc.

And I'd rather see any fix for this problem stay out of the asm code.

> b. bl      kvmppc_handle_exit
>=20
> c. kvmppc_handle_exit()
> {
>         int r =3D RESUME_HOST;
>         int s;
>=20
>         /* update before a new last_exit_type is rewritten */
>         kvmppc_update_timing_stats(vcpu);
>=20
>         /* restart interrupts if they were meant for the host */
>         kvmppc_restart_interrupt(vcpu, exit_nr);
>=20
>         local_irq_enable();	=3D=3D> Enable again.
> ....
>=20
> And shouldn't we handle kvmppc_restart_interrupt() like the original =20
> HOST flow?
>=20
> #define MASKABLE_EXCEPTION(trapnum, intnum, label, hdlr, =20
> ack)           \
>         =20
> START_EXCEPTION(label);                                         \
>         NORMAL_EXCEPTION_PROLOG(trapnum, intnum, =20
> PROLOG_ADDITION_MASKABLE)\
>         EXCEPTION_COMMON(trapnum, PACA_EXGEN, =20
> *INTS_DISABLE*)             \
> 	...

Could you elaborate on what you mean?

-Scott=

^ permalink raw reply

* [PATCH] powerpc: Fix  MAX_STACK_TRACE_ENTRIES too low warning again
From: Li Zhong @ 2013-05-07  2:32 UTC (permalink / raw)
  To: PowerPC email list; +Cc: Paul Mackerras

Saw this warning again, and this time from the ret_from_fork path. 

It seems we could clear the back chain earlier in copy_thread(), which
could cover both path, and also fix potential lockdep usage in
schedule_tail(), or exception occurred before we clear the back chain.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/entry_32.S |    2 --
 arch/powerpc/kernel/entry_64.S |    2 --
 arch/powerpc/kernel/process.c  |    1 +
 3 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index e514de5..d22e73e 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -439,8 +439,6 @@ ret_from_fork:
 ret_from_kernel_thread:
 	REST_NVGPRS(r1)
 	bl	schedule_tail
-	li	r3,0
-	stw	r3,0(r1)
 	mtlr	r14
 	mr	r3,r15
 	PPC440EP_ERR42
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 3fe5259..48e8a86 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -376,8 +376,6 @@ _GLOBAL(ret_from_fork)
 _GLOBAL(ret_from_kernel_thread)
 	bl	.schedule_tail
 	REST_NVGPRS(r1)
-	li	r3,0
-	std	r3,0(r1)
 	ld	r14, 0(r14)
 	mtlr	r14
 	mr	r3,r15
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index ceb4e7b..2c9fc5e 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -973,6 +973,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	 */
 	sp -= sizeof(struct pt_regs);
 	kregs = (struct pt_regs *) sp;
+	kregs->gpr[1] = 0;
 	sp -= STACK_FRAME_OVERHEAD;
 	p->thread.ksp = sp;
 	p->thread.ksp_limit = (unsigned long)task_stack_page(p) +
-- 
1.7.1

^ permalink raw reply related

* Re: [RFC][KVM][PATCH 1/1] kvm:ppc:booke-64: soft-disable interrupts
From: tiejun.chen @ 2013-05-07  2:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev, agraf, kvm-ppc, kvm
In-Reply-To: <1367892390.3398.12@snotra>

On 05/07/2013 10:06 AM, Scott Wood wrote:
> On 05/06/2013 08:56:25 PM, tiejun.chen wrote:
>> On 05/07/2013 07:50 AM, Scott Wood wrote:
>>> On 05/05/2013 10:13:17 PM, tiejun.chen wrote:
>>>> On 05/06/2013 11:10 AM, Tiejun Chen wrote:
>>>>> For the external interrupt, the decrementer exception and the doorbell
>>>>> excpetion, we also need to soft-disable interrupts while doing as host
>>>>> interrupt handlers since the DO_KVM hook is always performed to skip
>>>>> EXCEPTION_COMMON then miss this original chance with the 'ints'
>>>>> (INTS_DISABLE).
>>>
>>> http://patchwork.ozlabs.org/patch/241344/
>>> http://patchwork.ozlabs.org/patch/241412/
>>>
>>> :-)
>>
>> I'm observing the same behaviour as well:
>>
>>     WARN_ON_ONCE(!irqs_disabled());
>
> So, could you explain the benefits of your approach over what's being discussed
> in those threads?

They're a long thread so I think I need to take time to see :)

>
>>> Why wouldn't we always disable them?  kvmppc_handle_exit() will enable
>>> interrupts when it's ready.
>>
>> This only disable soft interrupt for kvmppc_restart_interrupt() that restarts
>> interrupts if they were meant for the host:
>>
>> a. SOFT_DISABLE_INTS() only for BOOKE_INTERRUPT_EXTERNAL |
>> BOOKE_INTERRUPT_DECREMENTER | BOOKE_INTERRUPT_DOORBELL
>
> Those aren't the only exceptions that can end up going to the host.  We could
> get a TLB miss that results in a heavyweight MMIO exit, etc.

This is like host handler, so I'm just disabling soft interrupt during 
kvmppc_restart_interrupt() for Doorbell interrupt/Decrementer Interrupt/External 
Input Interrupt.

I don't see anything should be disabled for any TLB exception in host handler.

>
> And I'd rather see any fix for this problem stay out of the asm code.

We already have an appropriate SOFT_DISABLE_INTS so I think we can take this 
easily :)

>
>> b. bl      kvmppc_handle_exit
>>
>> c. kvmppc_handle_exit()
>> {
>>         int r = RESUME_HOST;
>>         int s;
>>
>>         /* update before a new last_exit_type is rewritten */
>>         kvmppc_update_timing_stats(vcpu);
>>
>>         /* restart interrupts if they were meant for the host */
>>         kvmppc_restart_interrupt(vcpu, exit_nr);
>>
>>         local_irq_enable();    ==> Enable again.
>> ....
>>
>> And shouldn't we handle kvmppc_restart_interrupt() like the original HOST flow?
>>
>> #define MASKABLE_EXCEPTION(trapnum, intnum, label, hdlr, ack)           \
>>         START_EXCEPTION(label);                                         \
>>         NORMAL_EXCEPTION_PROLOG(trapnum, intnum, PROLOG_ADDITION_MASKABLE)\
>>         EXCEPTION_COMMON(trapnum, PACA_EXGEN, *INTS_DISABLE*)             \
>>     ...
>
> Could you elaborate on what you mean?

In host handler, we always use MASKABLE_EXCEPTION() to define-to-handle some 
exceptions: Doorbell interrupt/Decrementer Interrupt/External Input Interrupt:

#define MASKABLE_EXCEPTION(trapnum, intnum, label, hdlr, ack)           \
         START_EXCEPTION(label);                                         \
         NORMAL_EXCEPTION_PROLOG(trapnum, intnum, PROLOG_ADDITION_MASKABLE)\
         EXCEPTION_COMMON(trapnum, PACA_EXGEN, *INTS_DISABLE*)             \

This would call INTS_DISABLE, which is equal to SOFT_DISABLE_INTS(), to disable 
soft interrupt before call all associated handlers: 
do_IRQ()/timer_interrupt()/doorbell_exception().

But DO_KVM hook always skips INTS_DISABLE.

So I think we also need to do INTS_DISABLE for kvmppc_restart_interrupt() since 
actually that restarts interrupts for the host with a similar way as they are 
called by host.

Tiejun

^ permalink raw reply

* Re: [RFC][KVM][PATCH 1/1] kvm:ppc:booke-64: soft-disable interrupts
From: Scott Wood @ 2013-05-07  3:04 UTC (permalink / raw)
  To: tiejun.chen; +Cc: linuxppc-dev, agraf, kvm-ppc, kvm
In-Reply-To: <51886A59.80807@windriver.com>

On 05/06/2013 09:43:37 PM, tiejun.chen wrote:
> On 05/07/2013 10:06 AM, Scott Wood wrote:
>> On 05/06/2013 08:56:25 PM, tiejun.chen wrote:
>>> On 05/07/2013 07:50 AM, Scott Wood wrote:
>>>> On 05/05/2013 10:13:17 PM, tiejun.chen wrote:
>>>>> On 05/06/2013 11:10 AM, Tiejun Chen wrote:
>>>>>> For the external interrupt, the decrementer exception and the =20
>>>>>> doorbell
>>>>>> excpetion, we also need to soft-disable interrupts while doing =20
>>>>>> as host
>>>>>> interrupt handlers since the DO_KVM hook is always performed to =20
>>>>>> skip
>>>>>> EXCEPTION_COMMON then miss this original chance with the 'ints'
>>>>>> (INTS_DISABLE).
>>>>=20
>>>> http://patchwork.ozlabs.org/patch/241344/
>>>> http://patchwork.ozlabs.org/patch/241412/
>>>>=20
>>>> :-)
>>>=20
>>> I'm observing the same behaviour as well:
>>>=20
>>>     WARN_ON_ONCE(!irqs_disabled());
>>=20
>> So, could you explain the benefits of your approach over what's =20
>> being discussed
>> in those threads?
>=20
> They're a long thread so I think I need to take time to see :)
>=20
>>=20
>>>> Why wouldn't we always disable them?  kvmppc_handle_exit() will =20
>>>> enable
>>>> interrupts when it's ready.
>>>=20
>>> This only disable soft interrupt for kvmppc_restart_interrupt() =20
>>> that restarts
>>> interrupts if they were meant for the host:
>>>=20
>>> a. SOFT_DISABLE_INTS() only for BOOKE_INTERRUPT_EXTERNAL |
>>> BOOKE_INTERRUPT_DECREMENTER | BOOKE_INTERRUPT_DOORBELL
>>=20
>> Those aren't the only exceptions that can end up going to the host.  =20
>> We could
>> get a TLB miss that results in a heavyweight MMIO exit, etc.
>=20
> This is like host handler, so I'm just disabling soft interrupt =20
> during kvmppc_restart_interrupt() for Doorbell interrupt/Decrementer =20
> Interrupt/External Input Interrupt.
>=20
> I don't see anything should be disabled for any TLB exception in host =20
> handler.

Every exception type needs consistent lazy EE state once we hit the =20
local_irq_enable() (or any other C code that cares).

Plus, if you're going to add code to make something conditional, you =20
should have a good reason for making it conditional.  Being more like =20
the 64-bit host handler for its own sake isn't good enough, especially =20
if you introduce differences between 32 and 64 bit in the process.

>> And I'd rather see any fix for this problem stay out of the asm code.
>=20
> We already have an appropriate SOFT_DISABLE_INTS so I think we can =20
> take this easily :)

An unconditional call to hard_irq_disable() at the beginning of =20
kvmppc_handle_exit() should suffice.  I already have this in the next =20
version of my patch that I'll be posting shortly.

Note that we need this on 32-bit as well, so that trace_hardirqs_off() =20
gets called.

-Scott=

^ permalink raw reply

* Re: [PATCH] kvm/ppc/booke64: Hard disable interrupts when entering the guest
From: Scott Wood @ 2013-05-07  3:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexander Graf, kvm-ppc, Mihai Caraman, Paul Mackerras,
	linuxppc-dev
In-Reply-To: <1367884994.29496.28.camel@pasglop>

On 05/06/2013 07:03:14 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2013-05-06 at 18:53 -0500, Scott Wood wrote:
> >
> > > Ie. The last stage of entry will hard enable, so they should be
> > > soft-enabled too... if not, latency trackers will consider the =20
> whole
> > > guest periods as "interrupt disabled"...
> >
> > OK... I guess we already have that problem on 32-bit as well?
>=20
> 32-bit doesn't do lazy disable, so the situation is a lot easier =20
> there.

Right, but it still currently enters the guest with interrupts marked =20
as disabled, so we'd have the same latency tracker issue.

> Another problem is that hard_irq_disable() doesn't call
> trace_hardirqs_off()... We might want to fix that:
>=20
> static inline void hard_irq_disable(void)
> {
> 	__hard_irq_disable();
> 	if (get_paca()->soft_enabled)
> 		trace_hardirqs_off();
> 	get_paca()->soft_enabled =3D 0;
> 	get_paca()->irq_happened |=3D PACA_IRQ_HARD_DIS;
> }

Is it possible there are places that assume the current behavior?

> > We also don't want PACA_IRQ_HARD_DIS to be cleared the way
> > prep_irq_for_idle() does, because that's what lets the
> > local_irq_enable() do the hard-enabling after we exit the guest.
>=20
> Then set it again. Don't leave the kernel in a state where =20
> soft_enabled
> is 1 and irq_happened is non-zero. It might work in the specific KVM
> case we are looking at now because we know we are coming back via KVM
> exit and putting things right again but it's fragile, somebody will =20
> come
> back and break it, etc...

KVM is a pretty special case -- at least on booke, it's required that =20
all exits from guest state go through the KVM exception code.  I think =20
it's less likely that that changes, than something breaks in the code =20
to fix up lazy ee state (especially since we've already seen the latter =20
happen).

I'll give it a shot, though.

> If necessary, create (or improve existing) helpers that do the right
> state adjustement. The cost of a couple of byte stores is negligible,
> I'd rather you make sure everything remains in sync at all times.

My concern was mainly about complexity -- it seemed simpler to just say =20
that the during guest execution, CPU is in a special state that is not =20
visible to anything that cares about lazy EE.  The fact that EE can =20
actually be *off* and we still take the interrupt supports its =20
specialness. :-)

-Scott=

^ permalink raw reply

* Re: [PATCH 2/5] KVM: PPC: iommu: Add missing kvm_iommu_map_pages/kvm_iommu_unmap_pages
From: Alexey Kardashevskiy @ 2013-05-07  3:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Alexander Graf, kvm-ppc, linux-kernel, Paul Mackerras,
	linux-pci, linuxppc-dev, David Gibson
In-Reply-To: <1367890979.2868.93.camel@ul30vt.home>

On 05/07/2013 11:42 AM, Alex Williamson wrote:
> On Tue, 2013-05-07 at 10:49 +1000, Alexey Kardashevskiy wrote:
>> On 05/07/2013 07:07 AM, Alex Williamson wrote:
>>> On Mon, 2013-05-06 at 17:21 +1000, aik@ozlabs.ru wrote:
>>>> From: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>
>>>> The IOMMU API implements groups creating/deletion, device binding
>>>> and IOMMU map/unmap operations.
>>>>
>>>> The PowerPC implementation uses most of the API except map/unmap
>>>> operations, which are implemented on POWER using hypercalls.
>>>>
>>>> However, in order to link a kernel with the CONFIG_IOMMU_API enabled,
>>>> the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be
>>>> defined, so this defines them.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>>> ---
>>>>  arch/powerpc/include/asm/kvm_host.h |   14 ++++++++++++++
>>>>  1 file changed, 14 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index b6a047e..c025d91 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -603,4 +603,18 @@ struct kvm_vcpu_arch {
>>>>  
>>>>  #define __KVM_HAVE_ARCH_WQP
>>>>  
>>>> +#ifdef CONFIG_IOMMU_API
>>>> +/* POWERPC does not use IOMMU API for mapping/unmapping */
>>>> +static inline int kvm_iommu_map_pages(struct kvm *kvm,
>>>> +		struct kvm_memory_slot *slot)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static inline void kvm_iommu_unmap_pages(struct kvm *kvm,
>>>> +		struct kvm_memory_slot *slot)
>>>> +{
>>>> +}
>>>> +#endif /* CONFIG_IOMMU_API */
>>>> +
>>>>  #endif /* __POWERPC_KVM_HOST_H__ */
>>>
>>> This is no longer needed, Gleb applied my patch for 3.10 that make all
>>> of KVM device assignment dependent on a build config option and the top
>>> level kvm_host.h now includes this when that is not set.  Thanks,
>>
>> Cannot find it, could you point me please where it is on github or
>> git.kernel.org? Thanks.
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2a5bab1004729f3302c776e53ee7c895b98bb1ce


Yes, I confirm, this is patch is not need any more. Thanks!



-- 
Alexey

^ permalink raw reply

* [PATCH] powerpc: fix numa distance for form0 device tree
From: Michael Ellerman @ 2013-05-07  3:49 UTC (permalink / raw)
  To: stable; +Cc: linuxppc-dev, Anton Blanchard

From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

Commit 7122beeee7bc1757682049780179d7c216dd1c83 upstream.

The following commit breaks numa distance setup for old powerpc
systems that use form0 encoding in device tree.

commit 41eab6f88f24124df89e38067b3766b7bef06ddb
powerpc/numa: Use form 1 affinity to setup node distance

Device tree node /rtas/ibm,associativity-reference-points would
index into /cpus/PowerPCxxxx/ibm,associativity based on form0 or
form1 encoding detected by ibm,architecture-vec-5 property.

All modern systems use form1 and current kernel code is correct.
However, on older systems with form0 encoding, the numa distance
will get hard coded as LOCAL_DISTANCE for all nodes.  This causes
task scheduling anomaly since scheduler will skip building numa
level domain (topmost domain with all cpus) if all numa distances
are same.  (value of 'level' in sched_init_numa() will remain 0)

Prior to the above commit:
((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)

Restoring compatible behavior with this patch for old powerpc systems
with device tree where numa distance are encoded as form0.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---
 arch/powerpc/mm/numa.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index bba87ca..6a252c4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -201,7 +201,7 @@ int __node_distance(int a, int b)
 	int distance = LOCAL_DISTANCE;
 
 	if (!form1_affinity)
-		return distance;
+		return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
 
 	for (i = 0; i < distance_ref_points_depth; i++) {
 		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH] kvm/ppc/booke64: Hard disable interrupts when entering the guest
From: Benjamin Herrenschmidt @ 2013-05-07  3:53 UTC (permalink / raw)
  To: Scott Wood
  Cc: kvm, Alexander Graf, kvm-ppc, Mihai Caraman, Paul Mackerras,
	linuxppc-dev
In-Reply-To: <1367895926.3398.14@snotra>

On Mon, 2013-05-06 at 22:05 -0500, Scott Wood wrote:
> On 05/06/2013 07:03:14 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2013-05-06 at 18:53 -0500, Scott Wood wrote:
> > >
> > > > Ie. The last stage of entry will hard enable, so they should be
> > > > soft-enabled too... if not, latency trackers will consider the  
> > whole
> > > > guest periods as "interrupt disabled"...
> > >
> > > OK... I guess we already have that problem on 32-bit as well?
> > 
> > 32-bit doesn't do lazy disable, so the situation is a lot easier  
> > there.
> 
> Right, but it still currently enters the guest with interrupts marked  
> as disabled, so we'd have the same latency tracker issue.
> 
> > Another problem is that hard_irq_disable() doesn't call
> > trace_hardirqs_off()... We might want to fix that:
> > 
> > static inline void hard_irq_disable(void)
> > {
> > 	__hard_irq_disable();
> > 	if (get_paca()->soft_enabled)
> > 		trace_hardirqs_off();
> > 	get_paca()->soft_enabled = 0;
> > 	get_paca()->irq_happened |= PACA_IRQ_HARD_DIS;
> > }
> 
> Is it possible there are places that assume the current behavior?

There aren't many callers, I think this should be safe. Most
callers call it with interrupts already soft disabled, so that
should be a nop in these cases (idle for example).

But I can give it a quick spin today on a machine or two.

> > > We also don't want PACA_IRQ_HARD_DIS to be cleared the way
> > > prep_irq_for_idle() does, because that's what lets the
> > > local_irq_enable() do the hard-enabling after we exit the guest.
> > 
> > Then set it again. Don't leave the kernel in a state where  
> > soft_enabled
> > is 1 and irq_happened is non-zero. It might work in the specific KVM
> > case we are looking at now because we know we are coming back via KVM
> > exit and putting things right again but it's fragile, somebody will  
> > come
> > back and break it, etc...
> 
> KVM is a pretty special case -- at least on booke, it's required that  
> all exits from guest state go through the KVM exception code.  I think  
> it's less likely that that changes, than something breaks in the code  
> to fix up lazy ee state (especially since we've already seen the latter  
> happen).
> 
> I'll give it a shot, though.
> 
> > If necessary, create (or improve existing) helpers that do the right
> > state adjustement. The cost of a couple of byte stores is negligible,
> > I'd rather you make sure everything remains in sync at all times.
> 
> My concern was mainly about complexity -- it seemed simpler to just say  
> that the during guest execution, CPU is in a special state that is not  
> visible to anything that cares about lazy EE.  The fact that EE can  
> actually be *off* and we still take the interrupt supports its  
> specialness. :-)

Yeah ... sort of :-)

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] powerpc: fix numa distance for form0 device tree
From: Greg KH @ 2013-05-07  4:06 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Anton Blanchard, stable
In-Reply-To: <1367898574-20594-1-git-send-email-michael@ellerman.id.au>

On Tue, May 07, 2013 at 01:49:34PM +1000, Michael Ellerman wrote:
> From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> 
> Commit 7122beeee7bc1757682049780179d7c216dd1c83 upstream.
> 
> The following commit breaks numa distance setup for old powerpc
> systems that use form0 encoding in device tree.
> 
> commit 41eab6f88f24124df89e38067b3766b7bef06ddb
> powerpc/numa: Use form 1 affinity to setup node distance
> 
> Device tree node /rtas/ibm,associativity-reference-points would
> index into /cpus/PowerPCxxxx/ibm,associativity based on form0 or
> form1 encoding detected by ibm,architecture-vec-5 property.
> 
> All modern systems use form1 and current kernel code is correct.
> However, on older systems with form0 encoding, the numa distance
> will get hard coded as LOCAL_DISTANCE for all nodes.  This causes
> task scheduling anomaly since scheduler will skip building numa
> level domain (topmost domain with all cpus) if all numa distances
> are same.  (value of 'level' in sched_init_numa() will remain 0)
> 
> Prior to the above commit:
> ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
> 
> Restoring compatible behavior with this patch for old powerpc systems
> with device tree where numa distance are encoded as form0.
> 
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
> ---
>  arch/powerpc/mm/numa.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

What stable tree should this be applied to?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] powerpc: fix numa distance for form0 device tree
From: Michael Ellerman @ 2013-05-07  4:23 UTC (permalink / raw)
  To: Greg KH; +Cc: linuxppc-dev, Anton Blanchard, stable
In-Reply-To: <20130507040614.GA19525@kroah.com>

On Mon, May 06, 2013 at 09:06:15PM -0700, Greg KH wrote:
> On Tue, May 07, 2013 at 01:49:34PM +1000, Michael Ellerman wrote:
> > From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > 
> > Commit 7122beeee7bc1757682049780179d7c216dd1c83 upstream.
> > 
> > The following commit breaks numa distance setup for old powerpc
> > systems that use form0 encoding in device tree.
> > 
> > commit 41eab6f88f24124df89e38067b3766b7bef06ddb
> > powerpc/numa: Use form 1 affinity to setup node distance
> > 
> > Device tree node /rtas/ibm,associativity-reference-points would
> > index into /cpus/PowerPCxxxx/ibm,associativity based on form0 or
> > form1 encoding detected by ibm,architecture-vec-5 property.
> > 
> > All modern systems use form1 and current kernel code is correct.
> > However, on older systems with form0 encoding, the numa distance
> > will get hard coded as LOCAL_DISTANCE for all nodes.  This causes
> > task scheduling anomaly since scheduler will skip building numa
> > level domain (topmost domain with all cpus) if all numa distances
> > are same.  (value of 'level' in sched_init_numa() will remain 0)
> > 
> > Prior to the above commit:
> > ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
> > 
> > Restoring compatible behavior with this patch for old powerpc systems
> > with device tree where numa distance are encoded as form0.
> > 
> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
> > ---
> >  arch/powerpc/mm/numa.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> What stable tree should this be applied to?

All of them please.

cheers

^ permalink raw reply

* [PATCH] powerpc: Make CONFIG_RTAS_PROC depend on CONFIG_PROC_FS
From: Michael Ellerman @ 2013-05-07  4:43 UTC (permalink / raw)
  To: linuxppc-dev

We are getting build errors with CONFIG_PROC_FS=n:

arch/powerpc/kernel/rtas_flash.c
   In function 'rtas_flash_init':
  745:33: error: unused variable 'f' [-Werror=unused-variable]

But rtas_flash.c should not be built when CONFIG_PROC_FS=n, beacause all
it does is provide a /proc interface to the RTAS flash routines.

CONFIG_RTAS_FLASH already depends on CONFIG_RTAS_PROC, to indicate that
it depends on the RTAS proc support, but CONFIG_RTAS_PROC does not
depend on CONFIG_PROC_FS. So fix that.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---
 arch/powerpc/platforms/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 34d224b..632444f 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -128,7 +128,7 @@ config PPC_RTAS_DAEMON
 
 config RTAS_PROC
 	bool "Proc interface to RTAS"
-	depends on PPC_RTAS
+	depends on PPC_RTAS && PROC_FS
 	default y
 
 config RTAS_FLASH
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 2/6] KVM: PPC: Add support for multiple-TCE hcalls
From: David Gibson @ 2013-05-07  5:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <1367825157-27231-3-git-send-email-aik@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

On Mon, May 06, 2013 at 05:25:53PM +1000, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.  These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>

Fwiw, it would be nice to get this patch merged, regardless of the
rest of the VFIO/powerpc patches.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH 2/3] powerpc/powernv: Disable IO space for PCI buses
From: Gavin Shan @ 2013-05-07  5:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: bhelgaas, linux-pci, linuxppc-dev, yinghai, Gavin Shan
In-Reply-To: <1367875982.15842.58.camel@pasglop>

On Tue, May 07, 2013 at 07:33:02AM +1000, Benjamin Herrenschmidt wrote:
>On Mon, 2013-05-06 at 21:44 +0800, Gavin Shan wrote:
>> The patch intends to set the special flag (PCI_BUS_FLAGS_NO_IO) for
>> root buses so PCI core will skip assignment for IO stuff. Besides,
>> we also clear the IO resources on all PCI devices for PHB3.
>
>Why the new hook ? Can't this be detected simply because there is
>no aperture in the pci_host_bridge with IORESOURCE_IO set in the flags ?
>

Yeah, Ben. I've changed it accordingly in v2 and the problem is that we
need the proposed patches from Yinghai or mine :-)

Thanks,
Gavin 

^ permalink raw reply

* Re: [PATCH 3/3] powerpc/powernv: Don't configure IO window on PHB3
From: Gavin Shan @ 2013-05-07  5:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: yinghai, linux-pci, bhelgaas, linuxppc-dev, Gavin Shan
In-Reply-To: <1367876070.15842.60.camel@pasglop>

On Tue, May 07, 2013 at 07:34:30AM +1000, Benjamin Herrenschmidt wrote:
>On Mon, 2013-05-06 at 21:44 +0800, Gavin Shan wrote:
>> We needn't configure IO windows for the corresponding PEs on PHB3
>> since that doesn't support IO.
>
>Here too, no need for such a flag, just check that
>pci_controller->io_resource.flags is 0.
>
>BTW. Please work on top of the patch I sent already that avoids adding
>bogus resources to pci_host_bridge when their flags are 0. I'll send
>it to Linus today.
>

I've changed it accordingly on top of your patches in v2 :-)

Thanks,
Gavin

^ permalink raw reply

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
From: David Gibson @ 2013-05-07  5:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <1367825157-27231-6-git-send-email-aik@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 21989 bytes --]

On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests without passing them to QEMU, which should
> save time on switching to QEMU and back.
> 
> Both real and virtual modes are supported - whenever the kernel
> fails to handle TCE request, it passes it to the virtual mode.
> If it the virtual mode handlers fail, then the request is passed
> to the user mode, for example, to QEMU.
> 
> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> in-kernel handling of IOMMU map/unmap.
> 
> This adds a special case for huge pages (16MB).  The reference
> counting cannot be easily done for such pages in real mode (when
> MMU is off) so we added a list of huge pages.  It is populated in
> virtual mode and get_page is called just once per a huge page.
> Real mode handlers check if the requested page is huge and in the list,
> then no reference counting is done, otherwise an exit to virtual mode
> happens.  The list is released at KVM exit.  At the moment the fastest
> card available for tests uses up to 9 huge pages so walking through this
> list is not very expensive.  However this can change and we may want
> to optimize this.
> 
> This also adds the virt_only parameter to the KVM module
> for debug and performance check purposes.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  Documentation/virtual/kvm/api.txt   |   28 ++++
>  arch/powerpc/include/asm/kvm_host.h |    2 +
>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c          |   12 ++
>  include/uapi/linux/kvm.h            |    2 +
>  8 files changed, 485 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index f621cd6..2039767 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
>  
> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> +Returns: 0 on success, -1 on error
> +
> +This creates a link between IOMMU group and a hardware TCE (translation
> +control entry) table. This link lets the host kernel know what IOMMU
> +group (i.e. TCE table) to use for the LIOBN number passed with
> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> +
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;

Wouldn't it be more in keeping 

> +	__u32 flags;
> +};
> +
> +No flag is supported at the moment.
> +
> +When the guest issues TCE call on a liobn for which a TCE table has been
> +registered, the kernel will handle it in real mode, updating the hardware
> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> +be handled by userspace.
> +
> +
>  5. The kvm_run structure
>  ------------------------
>  
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 36ceb0d..2b70cbc 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>  	struct kvm *kvm;
>  	u64 liobn;
>  	u32 window_size;
> +	bool virtmode_only;

I see this is now initialized from the global parameter, but I think
it would be better to just check the global (debug) parameter
directly, rather than duplicating it here.

> +	struct iommu_group *grp;    /* used for IOMMU groups */
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index d501246..bdfa140 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce *args);
> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +				struct kvm_create_spapr_tce_iommu *args);
>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 681b314..b67d44b 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>  	__u32 window_size;
>  };
>  
> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> +struct kvm_create_spapr_tce_iommu {
> +	__u64 liobn;
> +	__u32 iommu_id;
> +	__u32 flags;
> +};
> +
>  /* for KVM_ALLOCATE_RMA */
>  struct kvm_allocate_rma {
>  	__u64 rma_size;
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 643ac1e..98cf949 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,9 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/pci.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -38,10 +41,19 @@
>  #include <asm/kvm_host.h>
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
> +#include <asm/tce.h>
> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> +#define DRIVER_DESC	"POWERPC KVM driver"

Really?

>  
>  #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
>  #define ERROR_ADDR      (~(unsigned long)0x0)
>  
> +static bool kvmppc_tce_virt_only = false;
> +module_param_named(virt_only, kvmppc_tce_virt_only, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(virt_only, "Disable realmode handling of IOMMU map/unmap");
> +
>  /*
>   * TCE tables handlers.
>   */
> @@ -58,8 +70,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>  
>  	mutex_lock(&kvm->lock);
>  	list_del(&stt->list);
> -	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> -		__free_page(stt->pages[i]);
> +#ifdef CONFIG_IOMMU_API
> +	if (stt->grp) {
> +		iommu_group_put(stt->grp);
> +	} else
> +#endif
> +		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> +			__free_page(stt->pages[i]);
>  	kfree(stt);
>  	mutex_unlock(&kvm->lock);
>  
> @@ -155,9 +172,127 @@ fail:
>  	return ret;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static const struct file_operations kvm_spapr_tce_iommu_fops = {
> +	.release	= kvm_spapr_tce_release,
> +};
> +
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +		struct kvm_create_spapr_tce_iommu *args)
> +{
> +	struct kvmppc_spapr_tce_table *tt = NULL;
> +	struct iommu_group *grp;
> +	struct iommu_table *tbl;
> +
> +	/* Find an IOMMU table for the given ID */
> +	grp = iommu_group_get_by_id(args->iommu_id);
> +	if (!grp)
> +		return -ENXIO;
> +
> +	tbl = iommu_group_get_iommudata(grp);
> +	if (!tbl)
> +		return -ENXIO;
> +
> +	/* Check this LIOBN hasn't been previously allocated */
> +	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
> +		if (tt->liobn == args->liobn)
> +			return -EBUSY;
> +	}
> +
> +	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
> +	if (!tt)
> +		return -ENOMEM;
> +
> +	tt->liobn = args->liobn;
> +	tt->kvm = kvm;
> +	tt->virtmode_only = kvmppc_tce_virt_only;
> +	tt->grp = grp;
> +
> +	kvm_get_kvm(kvm);
> +
> +	mutex_lock(&kvm->lock);
> +	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
> +
> +	mutex_unlock(&kvm->lock);
> +
> +	pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
> +			args->liobn, args->iommu_id, args->flags);
> +
> +	return anon_inode_getfd("kvm-spapr-tce-iommu",
> +			&kvm_spapr_tce_iommu_fops, tt, O_RDWR);
> +}
> +#else
> +long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> +		struct kvm_create_spapr_tce_iommu *args)
> +{
> +	return -ENOSYS;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
> +#ifdef CONFIG_IOMMU_API
>  /*
>   * Virtual mode handling of IOMMU map/unmap.
>   */
> +static int clear_tce_virt_mode(struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce_value,
> +		unsigned long npages)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (ret)
> +		return ret;
> +
> +	ret = iommu_clear_tces_and_put_pages(tbl, entry, npages);
> +	if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
> +				__func__, ioba, tce_value, ret);
> +
> +	return ret;
> +}
> +
> +static int put_tce_virt_mode(struct kvmppc_spapr_tce_table *tt,
> +		struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce,
> +		pte_t pte, unsigned long pg_size)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_put_param_check(tbl, ioba, tce);
> +	if (ret)
> +		return ret;
> +
> +	/* System page size case, easy to handle */
> +	if (pg_size == PAGE_SIZE)
> +		return iommu_put_tce_user_mode(tbl, entry, tce);
> +
> +	return -EAGAIN;
> +}
> +
> +static pte_t va_to_linux_pte(struct kvm_vcpu *vcpu,
> +		unsigned long hva, bool writing, unsigned long *pg_sizep)
> +{
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	/* Find out the page pte and size if requested */
> +	pte_t pte;
> +	unsigned long pg_size = 0;
> +
> +	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
> +			writing, &pg_size);
> +	if (!pte_present(pte))
> +		return 0;
> +
> +	*pg_sizep = pg_size;
> +
> +	return pte;
> +#else
> +	return 0;
> +#endif
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  /* Converts guest physical address into host virtual */
>  static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
>  		unsigned long gpa)
> @@ -188,6 +323,43 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte;
> +
> +			hpa = get_virt_address(vcpu, tce);
> +			if (hpa == ERROR_ADDR)
> +				return -EFAULT;
> +
> +			pte = va_to_linux_pte(vcpu, hpa, tce & TCE_PCI_WRITE,
> +					&pg_size);
> +			if (!pte)
> +				return -EFAULT;
> +
> +			ret = put_tce_virt_mode(tt, tbl, ioba, hpa,
> +					pte, pg_size);
> +		} else {
> +			ret = clear_tce_virt_mode(tbl, ioba, 0, 1);
> +		}
> +		iommu_flush_tce(tbl);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>  }
> @@ -213,6 +385,52 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (tces == ERROR_ADDR)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret = 0;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		for (i = 0; i < npages; ++i) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +			unsigned long tce;
> +			unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +			if (get_user(tce, (unsigned long __user *)ptce))
> +				break;
> +
> +			hpa = get_virt_address(vcpu, tce);
> +			if (hpa == ERROR_ADDR)
> +				return -EFAULT;
> +
> +			pte = va_to_linux_pte(vcpu, hpa,
> +					tce & TCE_PCI_WRITE, &pg_size);
> +			if (!pte)
> +				return -EFAULT;
> +
> +			ret = put_tce_virt_mode(tt, tbl,
> +					ioba + (i << IOMMU_PAGE_SHIFT),
> +					hpa, pte, pg_size);
> +			if (ret)
> +				break;
> +		}
> +		if (ret)
> +			clear_tce_virt_mode(tbl, ioba, 0, i);
> +
> +		iommu_flush_tce(tbl);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> @@ -253,6 +471,26 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		ret = clear_tce_virt_mode(tbl, ioba,
> +				tce_value, npages);
> +
> +		WARN_ON(ret == -EAGAIN);
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 55fdf7a..c5e5905 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -26,6 +26,7 @@
>  #include <linux/slab.h>
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -161,6 +162,85 @@ static unsigned long get_real_address(struct kvm_vcpu *vcpu,
>  	return hwaddr;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static int clear_tce_real_mode(struct iommu_table *tbl,
> +		unsigned long ioba,
> +		unsigned long tce_value, unsigned long npages)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +
> +	ret = iommu_tce_clear_param_check(tbl, ioba, tce_value, npages);
> +	if (ret)
> +		return ret;
> +
> +	for ( ; npages; --npages, ++entry) {
> +		struct page *page;
> +		unsigned long oldtce;
> +
> +		oldtce = iommu_clear_tce(tbl, entry);
> +		if (!oldtce)
> +			continue;
> +
> +		page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
> +		if (!page) {
> +			ret = -EAGAIN;
> +			break;
> +		}
> +
> +		if (oldtce & TCE_PCI_WRITE)
> +			SetPageDirty(page);
> +
> +		ret = realmode_put_page(page);
> +		if (ret)
> +			break;
> +	}
> +	/* if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
> +				__func__, ioba, tce_value, ret); */
> +
> +	return ret;
> +}
> +
> +static int put_tce_real_mode(struct kvmppc_spapr_tce_table *tt,
> +		struct iommu_table *tbl,
> +		unsigned long ioba, unsigned long tce,
> +		pte_t pte, unsigned long pg_size)
> +{
> +	int ret;
> +	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
> +	struct page *page = NULL;
> +	enum dma_data_direction direction = iommu_tce_direction(tce);
> +
> +	ret = iommu_tce_put_param_check(tbl, ioba, tce);
> +	if (ret)
> +		return ret;
> +
> +	if (pg_size != PAGE_SIZE)
> +		return -EAGAIN;
> +
> +	/* Small page case, find page struct to increment a counter */
> +	page = realmode_pfn_to_page(tce >> PAGE_SHIFT);
> +	if (!page)
> +		return -EAGAIN;
> +
> +	ret = realmode_get_page(page);
> +	if (ret)
> +		return ret;
> +
> +	/* tce_build accepts virtual addresses */
> +	ret = iommu_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
> +	if (ret)
> +		realmode_put_page(page);
> +
> +	/* if (ret < 0)
> +		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
> +				__func__, ioba, tce, ret); */
> +
> +	return ret;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
> @@ -171,6 +251,44 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +
> +			hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
> +					&pte, &pg_size);
> +			if (hpa == ERROR_ADDR)
> +				return H_TOO_HARD;
> +
> +			ret = put_tce_real_mode(tt, tbl, ioba,
> +					hpa, pte, pg_size);
> +		} else {
> +			ret = clear_tce_real_mode(tbl, ioba, 0, 1);
> +		}
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	return kvmppc_emulated_h_put_tce(tt, ioba, tce);
>  }
> @@ -192,10 +310,58 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
>  	tces = get_real_address(vcpu, tce_list, false, NULL, NULL);
>  	if (tces == ERROR_ADDR)
>  		return H_TOO_HARD;
>  
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret = 0;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		for (i = 0; i < npages; ++i) {
> +			unsigned long hpa, pg_size = 0;
> +			pte_t pte = 0;
> +			unsigned long tce;
> +			unsigned long ptce = tces + i * sizeof(unsigned long);
> +
> +			if (get_user(tce, (unsigned long __user *)ptce))
> +				break;
> +
> +			hpa = get_real_address(vcpu, tce,
> +					tce & TCE_PCI_WRITE,
> +					&pte, &pg_size);
> +			if (hpa == ERROR_ADDR)
> +				ret = -EAGAIN;
> +			else
> +				ret = put_tce_real_mode(tt, tbl,
> +						ioba + (i << IOMMU_PAGE_SHIFT),
> +						hpa, pte, pg_size);
> +			if (ret)
> +				break;
> +		}
> +		if (ret)
> +			clear_tce_real_mode(tbl, ioba, 0, i);
> +
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> @@ -236,6 +402,32 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (!tt)
>  		return H_TOO_HARD;
>  
> +	if (tt->virtmode_only)
> +		return H_TOO_HARD;
> +
> +#ifdef CONFIG_IOMMU_API
> +	if (tt->grp) {
> +		long ret;
> +		struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
> +
> +		/* Return error if the group is being destroyed */
> +		if (!tbl)
> +			return H_RESCINDED;
> +
> +		ret = clear_tce_real_mode(tbl, ioba,
> +				tce_value, npages);
> +		iommu_flush_tce(tbl);
> +
> +		if (ret == -EAGAIN)
> +			return H_TOO_HARD;
> +
> +		if (ret < 0)
> +			return H_PARAMETER;
> +
> +		return H_SUCCESS;
> +	}
> +#endif
> +
>  	/* Emulated IO */
>  	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
>  		return H_PARAMETER;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index b7ad589..269b0f6 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -385,6 +385,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		break;
>  #endif
>  	case KVM_CAP_SPAPR_MULTITCE:
> +	case KVM_CAP_SPAPR_TCE_IOMMU:
>  		r = 1;
>  		break;
>  	default:
> @@ -935,6 +936,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
>  		goto out;
>  	}
> +	case KVM_CREATE_SPAPR_TCE_IOMMU: {
> +		struct kvm_create_spapr_tce_iommu create_tce_iommu;
> +		struct kvm *kvm = filp->private_data;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&create_tce_iommu, argp,
> +				sizeof(create_tce_iommu)))
> +			goto out;
> +		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
> +		goto out;
> +	}
>  #endif /* CONFIG_PPC_BOOK3S_64 */
>  
>  #ifdef CONFIG_KVM_BOOK3S_64_HV
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c04da1..161e1d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -641,6 +641,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_RTAS (0x100000 + 87)
>  #define KVM_CAP_SPAPR_XICS (0x100000 + 88)
>  #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -885,6 +886,7 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
>  /* Available with KVM_CAP_PPC_RTAS */
>  #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xdc, struct kvm_rtas_token_args)
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
>  
>  /*
>   * ioctls for vcpu fds

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
From: Alexey Kardashevskiy @ 2013-05-07  5:51 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <20130507052933.GP13692@truffula.fritz.box>

On 05/07/2013 03:29 PM, David Gibson wrote:
> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests without passing them to QEMU, which should
>> save time on switching to QEMU and back.
>>
>> Both real and virtual modes are supported - whenever the kernel
>> fails to handle TCE request, it passes it to the virtual mode.
>> If it the virtual mode handlers fail, then the request is passed
>> to the user mode, for example, to QEMU.
>>
>> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
>> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
>> in-kernel handling of IOMMU map/unmap.
>>
>> This adds a special case for huge pages (16MB).  The reference
>> counting cannot be easily done for such pages in real mode (when
>> MMU is off) so we added a list of huge pages.  It is populated in
>> virtual mode and get_page is called just once per a huge page.
>> Real mode handlers check if the requested page is huge and in the list,
>> then no reference counting is done, otherwise an exit to virtual mode
>> happens.  The list is released at KVM exit.  At the moment the fastest
>> card available for tests uses up to 9 huge pages so walking through this
>> list is not very expensive.  However this can change and we may want
>> to optimize this.
>>
>> This also adds the virt_only parameter to the KVM module
>> for debug and performance check purposes.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Cc: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>> ---
>>  Documentation/virtual/kvm/api.txt   |   28 ++++
>>  arch/powerpc/include/asm/kvm_host.h |    2 +
>>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c          |   12 ++
>>  include/uapi/linux/kvm.h            |    2 +
>>  8 files changed, 485 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index f621cd6..2039767 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>>  valid entries found.
>>  
>>  
>> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
>> +
>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>> +Architectures: powerpc
>> +Type: vm ioctl
>> +Parameters: struct kvm_create_spapr_tce_iommu (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +This creates a link between IOMMU group and a hardware TCE (translation
>> +control entry) table. This link lets the host kernel know what IOMMU
>> +group (i.e. TCE table) to use for the LIOBN number passed with
>> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
>> +
>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>> +struct kvm_create_spapr_tce_iommu {
>> +	__u64 liobn;
>> +	__u32 iommu_id;
> 
> Wouldn't it be more in keeping 


pardon?



>> +	__u32 flags;
>> +};
>> +
>> +No flag is supported at the moment.
>> +
>> +When the guest issues TCE call on a liobn for which a TCE table has been
>> +registered, the kernel will handle it in real mode, updating the hardware
>> +TCE table. TCE table calls for other liobns will cause a vm exit and must
>> +be handled by userspace.
>> +
>> +
>>  5. The kvm_run structure
>>  ------------------------
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 36ceb0d..2b70cbc 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>>  	struct kvm *kvm;
>>  	u64 liobn;
>>  	u32 window_size;
>> +	bool virtmode_only;
> 
> I see this is now initialized from the global parameter, but I think
> it would be better to just check the global (debug) parameter
> directly, rather than duplicating it here.


The global parameter is in kvm.ko and the struct above is in the real mode
part which cannot go to the module.



>> +	struct iommu_group *grp;    /* used for IOMMU groups */
>>  	struct page *pages[0];
>>  };
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index d501246..bdfa140 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>>  
>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  				struct kvm_create_spapr_tce *args);
>> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>> +				struct kvm_create_spapr_tce_iommu *args);
>>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 681b314..b67d44b 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>>  	__u32 window_size;
>>  };
>>  
>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>> +struct kvm_create_spapr_tce_iommu {
>> +	__u64 liobn;
>> +	__u32 iommu_id;
>> +	__u32 flags;
>> +};
>> +
>>  /* for KVM_ALLOCATE_RMA */
>>  struct kvm_allocate_rma {
>>  	__u64 rma_size;
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 643ac1e..98cf949 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -27,6 +27,9 @@
>>  #include <linux/hugetlb.h>
>>  #include <linux/list.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/pci.h>
>> +#include <linux/iommu.h>
>> +#include <linux/module.h>
>>  
>>  #include <asm/tlbflush.h>
>>  #include <asm/kvm_ppc.h>
>> @@ -38,10 +41,19 @@
>>  #include <asm/kvm_host.h>
>>  #include <asm/udbg.h>
>>  #include <asm/iommu.h>
>> +#include <asm/tce.h>
>> +
>> +#define DRIVER_VERSION	"0.1"
>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
>> +#define DRIVER_DESC	"POWERPC KVM driver"
> 
> Really?


What is wrong here?



-- 
Alexey

^ permalink raw reply

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
From: David Gibson @ 2013-05-07  6:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <51889663.5060909@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 7585 bytes --]

On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
> On 05/07/2013 03:29 PM, David Gibson wrote:
> > On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests without passing them to QEMU, which should
> >> save time on switching to QEMU and back.
> >>
> >> Both real and virtual modes are supported - whenever the kernel
> >> fails to handle TCE request, it passes it to the virtual mode.
> >> If it the virtual mode handlers fail, then the request is passed
> >> to the user mode, for example, to QEMU.
> >>
> >> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
> >> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
> >> in-kernel handling of IOMMU map/unmap.
> >>
> >> This adds a special case for huge pages (16MB).  The reference
> >> counting cannot be easily done for such pages in real mode (when
> >> MMU is off) so we added a list of huge pages.  It is populated in
> >> virtual mode and get_page is called just once per a huge page.
> >> Real mode handlers check if the requested page is huge and in the list,
> >> then no reference counting is done, otherwise an exit to virtual mode
> >> happens.  The list is released at KVM exit.  At the moment the fastest
> >> card available for tests uses up to 9 huge pages so walking through this
> >> list is not very expensive.  However this can change and we may want
> >> to optimize this.
> >>
> >> This also adds the virt_only parameter to the KVM module
> >> for debug and performance check purposes.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Cc: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> Signed-off-by: Paul Mackerras <paulus@samba.org>
> >> ---
> >>  Documentation/virtual/kvm/api.txt   |   28 ++++
> >>  arch/powerpc/include/asm/kvm_host.h |    2 +
> >>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
> >>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
> >>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c          |   12 ++
> >>  include/uapi/linux/kvm.h            |    2 +
> >>  8 files changed, 485 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> >> index f621cd6..2039767 100644
> >> --- a/Documentation/virtual/kvm/api.txt
> >> +++ b/Documentation/virtual/kvm/api.txt
> >> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
> >>  valid entries found.
> >>  
> >>  
> >> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
> >> +
> >> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> >> +Architectures: powerpc
> >> +Type: vm ioctl
> >> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> >> +Returns: 0 on success, -1 on error
> >> +
> >> +This creates a link between IOMMU group and a hardware TCE (translation
> >> +control entry) table. This link lets the host kernel know what IOMMU
> >> +group (i.e. TCE table) to use for the LIOBN number passed with
> >> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
> >> +
> >> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> >> +struct kvm_create_spapr_tce_iommu {
> >> +	__u64 liobn;
> >> +	__u32 iommu_id;
> > 
> > Wouldn't it be more in keeping 
> 
> 
> pardon?

Sorry, I was going to suggest a change, but then realised it wasn't
actually any better than what you have now.

> >> +	__u32 flags;
> >> +};
> >> +
> >> +No flag is supported at the moment.
> >> +
> >> +When the guest issues TCE call on a liobn for which a TCE table has been
> >> +registered, the kernel will handle it in real mode, updating the hardware
> >> +TCE table. TCE table calls for other liobns will cause a vm exit and must
> >> +be handled by userspace.
> >> +
> >> +
> >>  5. The kvm_run structure
> >>  ------------------------
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 36ceb0d..2b70cbc 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
> >>  	struct kvm *kvm;
> >>  	u64 liobn;
> >>  	u32 window_size;
> >> +	bool virtmode_only;
> > 
> > I see this is now initialized from the global parameter, but I think
> > it would be better to just check the global (debug) parameter
> > directly, rather than duplicating it here.
> 
> 
> The global parameter is in kvm.ko and the struct above is in the real mode
> part which cannot go to the module.

Ah, ok.  I'm half inclined to just drop the virtmode_only thing
entirely.

> >> +	struct iommu_group *grp;    /* used for IOMMU groups */
> >>  	struct page *pages[0];
> >>  };
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index d501246..bdfa140 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
> >>  
> >>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  				struct kvm_create_spapr_tce *args);
> >> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
> >> +				struct kvm_create_spapr_tce_iommu *args);
> >>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> >>  		struct kvm_vcpu *vcpu, unsigned long liobn);
> >>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
> >> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >> index 681b314..b67d44b 100644
> >> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
> >>  	__u32 window_size;
> >>  };
> >>  
> >> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
> >> +struct kvm_create_spapr_tce_iommu {
> >> +	__u64 liobn;
> >> +	__u32 iommu_id;
> >> +	__u32 flags;
> >> +};
> >> +
> >>  /* for KVM_ALLOCATE_RMA */
> >>  struct kvm_allocate_rma {
> >>  	__u64 rma_size;
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 643ac1e..98cf949 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -27,6 +27,9 @@
> >>  #include <linux/hugetlb.h>
> >>  #include <linux/list.h>
> >>  #include <linux/anon_inodes.h>
> >> +#include <linux/pci.h>
> >> +#include <linux/iommu.h>
> >> +#include <linux/module.h>
> >>  
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/kvm_ppc.h>
> >> @@ -38,10 +41,19 @@
> >>  #include <asm/kvm_host.h>
> >>  #include <asm/udbg.h>
> >>  #include <asm/iommu.h>
> >> +#include <asm/tce.h>
> >> +
> >> +#define DRIVER_VERSION	"0.1"
> >> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> >> +#define DRIVER_DESC	"POWERPC KVM driver"
> > 
> > Really?
> 
> 
> What is wrong here?

Well, it seems entirely unrelated to the rest of the changes, and not
obviously accurate.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
From: Alexey Kardashevskiy @ 2013-05-07  6:27 UTC (permalink / raw)
  To: David Gibson
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <20130507060240.GQ13692@truffula.fritz.box>

On 05/07/2013 04:02 PM, David Gibson wrote:
> On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
>> On 05/07/2013 03:29 PM, David Gibson wrote:
>>> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests without passing them to QEMU, which should
>>>> save time on switching to QEMU and back.
>>>>
>>>> Both real and virtual modes are supported - whenever the kernel
>>>> fails to handle TCE request, it passes it to the virtual mode.
>>>> If it the virtual mode handlers fail, then the request is passed
>>>> to the user mode, for example, to QEMU.
>>>>
>>>> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
>>>> a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
>>>> in-kernel handling of IOMMU map/unmap.
>>>>
>>>> This adds a special case for huge pages (16MB).  The reference
>>>> counting cannot be easily done for such pages in real mode (when
>>>> MMU is off) so we added a list of huge pages.  It is populated in
>>>> virtual mode and get_page is called just once per a huge page.
>>>> Real mode handlers check if the requested page is huge and in the list,
>>>> then no reference counting is done, otherwise an exit to virtual mode
>>>> happens.  The list is released at KVM exit.  At the moment the fastest
>>>> card available for tests uses up to 9 huge pages so walking through this
>>>> list is not very expensive.  However this can change and we may want
>>>> to optimize this.
>>>>
>>>> This also adds the virt_only parameter to the KVM module
>>>> for debug and performance check purposes.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>>> ---
>>>>  Documentation/virtual/kvm/api.txt   |   28 ++++
>>>>  arch/powerpc/include/asm/kvm_host.h |    2 +
>>>>  arch/powerpc/include/asm/kvm_ppc.h  |    2 +
>>>>  arch/powerpc/include/uapi/asm/kvm.h |    7 +
>>>>  arch/powerpc/kvm/book3s_64_vio.c    |  242 ++++++++++++++++++++++++++++++++++-
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c |  192 +++++++++++++++++++++++++++
>>>>  arch/powerpc/kvm/powerpc.c          |   12 ++
>>>>  include/uapi/linux/kvm.h            |    2 +
>>>>  8 files changed, 485 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>> index f621cd6..2039767 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2127,6 +2127,34 @@ written, then `n_invalid' invalid entries, invalidating any previously
>>>>  valid entries found.
>>>>  
>>>>  
>>>> +4.79 KVM_CREATE_SPAPR_TCE_IOMMU
>>>> +
>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>>>> +Architectures: powerpc
>>>> +Type: vm ioctl
>>>> +Parameters: struct kvm_create_spapr_tce_iommu (in)
>>>> +Returns: 0 on success, -1 on error
>>>> +
>>>> +This creates a link between IOMMU group and a hardware TCE (translation
>>>> +control entry) table. This link lets the host kernel know what IOMMU
>>>> +group (i.e. TCE table) to use for the LIOBN number passed with
>>>> +H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
>>>> +
>>>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>>>> +struct kvm_create_spapr_tce_iommu {
>>>> +	__u64 liobn;
>>>> +	__u32 iommu_id;
>>>
>>> Wouldn't it be more in keeping 
>>
>>
>> pardon?
> 
> Sorry, I was going to suggest a change, but then realised it wasn't
> actually any better than what you have now.
> 
>>>> +	__u32 flags;
>>>> +};
>>>> +
>>>> +No flag is supported at the moment.
>>>> +
>>>> +When the guest issues TCE call on a liobn for which a TCE table has been
>>>> +registered, the kernel will handle it in real mode, updating the hardware
>>>> +TCE table. TCE table calls for other liobns will cause a vm exit and must
>>>> +be handled by userspace.
>>>> +
>>>> +
>>>>  5. The kvm_run structure
>>>>  ------------------------
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>>> index 36ceb0d..2b70cbc 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
>>>>  	struct kvm *kvm;
>>>>  	u64 liobn;
>>>>  	u32 window_size;
>>>> +	bool virtmode_only;
>>>
>>> I see this is now initialized from the global parameter, but I think
>>> it would be better to just check the global (debug) parameter
>>> directly, rather than duplicating it here.
>>
>>
>> The global parameter is in kvm.ko and the struct above is in the real mode
>> part which cannot go to the module.
> 
> Ah, ok.  I'm half inclined to just drop the virtmode_only thing
> entirely.
> 
>>>> +	struct iommu_group *grp;    /* used for IOMMU groups */
>>>>  	struct page *pages[0];
>>>>  };
>>>>  
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index d501246..bdfa140 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -139,6 +139,8 @@ extern void kvmppc_xics_free(struct kvm *kvm);
>>>>  
>>>>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>  				struct kvm_create_spapr_tce *args);
>>>> +extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>>>> +				struct kvm_create_spapr_tce_iommu *args);
>>>>  extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>  		struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>  extern long kvmppc_emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>> index 681b314..b67d44b 100644
>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>> @@ -291,6 +291,13 @@ struct kvm_create_spapr_tce {
>>>>  	__u32 window_size;
>>>>  };
>>>>  
>>>> +/* for KVM_CAP_SPAPR_TCE_IOMMU */
>>>> +struct kvm_create_spapr_tce_iommu {
>>>> +	__u64 liobn;
>>>> +	__u32 iommu_id;
>>>> +	__u32 flags;
>>>> +};
>>>> +
>>>>  /* for KVM_ALLOCATE_RMA */
>>>>  struct kvm_allocate_rma {
>>>>  	__u64 rma_size;
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index 643ac1e..98cf949 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -27,6 +27,9 @@
>>>>  #include <linux/hugetlb.h>
>>>>  #include <linux/list.h>
>>>>  #include <linux/anon_inodes.h>
>>>> +#include <linux/pci.h>
>>>> +#include <linux/iommu.h>
>>>> +#include <linux/module.h>
>>>>  
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/kvm_ppc.h>
>>>> @@ -38,10 +41,19 @@
>>>>  #include <asm/kvm_host.h>
>>>>  #include <asm/udbg.h>
>>>>  #include <asm/iommu.h>
>>>> +#include <asm/tce.h>
>>>> +
>>>> +#define DRIVER_VERSION	"0.1"
>>>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
>>>> +#define DRIVER_DESC	"POWERPC KVM driver"
>>>
>>> Really?
>>
>>
>> What is wrong here?
> 
> Well, it seems entirely unrelated to the rest of the changes, 


The patch adds a module parameter so I had to add those DRIVER_xxx.


> and not obviously accurate.

Let's fix it then. How? Paul signed it...



-- 
Alexey

^ permalink raw reply

* Re: [PATCH 5/6] KVM: PPC: Add support for IOMMU in-kernel handling
From: David Gibson @ 2013-05-07  6:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Joerg Roedel, Alexander Graf, kvm-ppc, Alex Williamson,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <51889EE5.5050807@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

On Tue, May 07, 2013 at 04:27:49PM +1000, Alexey Kardashevskiy wrote:
> On 05/07/2013 04:02 PM, David Gibson wrote:
> > On Tue, May 07, 2013 at 03:51:31PM +1000, Alexey Kardashevskiy wrote:
> >> On 05/07/2013 03:29 PM, David Gibson wrote:
> >>> On Mon, May 06, 2013 at 05:25:56PM +1000, Alexey Kardashevskiy wrote:
[snip]
> >>>> +#define DRIVER_VERSION	"0.1"
> >>>> +#define DRIVER_AUTHOR	"Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>"
> >>>> +#define DRIVER_DESC	"POWERPC KVM driver"
> >>>
> >>> Really?
> >>
> >>
> >> What is wrong here?
> > 
> > Well, it seems entirely unrelated to the rest of the changes, 
> 
> 
> The patch adds a module parameter so I had to add those DRIVER_xxx.

Ah, ok.

> > and not obviously accurate.
> 
> Let's fix it then. How? Paul signed it...

Fair enough then.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* [PATCH] powerpc: Make hard_irq_disable() do the right thing vs. irq tracing
From: Benjamin Herrenschmidt @ 2013-05-07  7:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Scott Wood, Mihai Caraman, kvm-ppc, Tiejun Chen

If hard_irq_disable() is called while interrupts are already soft-disabled
(which is the most common case) all is already well.

However you can (and in some cases want) to call it while everything is
enabled (to make sure you don't get a lazy even, for example before entry
into KVM guests) and in this case we need to inform the irq tracer that
the irqs are going off.

We have to change the inline into a macro to avoid an include circular
dependency hell hole.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

Tested on pseries, Scott, I don't expect a problem with that patch especially
since most callers already are soft disabled, so I'll merge it now along with
my other pending stuff and you can simplify your KVM one.

diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h
index e45c494..d615b28 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -95,15 +95,13 @@ static inline bool arch_irqs_disabled(void)
 #define __hard_irq_disable()	__mtmsrd(local_paca->kernel_msr, 1)
 #endif
 
-static inline void hard_irq_disable(void)
-{
-	__hard_irq_disable();
-	get_paca()->soft_enabled = 0;
-	get_paca()->irq_happened |= PACA_IRQ_HARD_DIS;
-}
-
-/* include/linux/interrupt.h needs hard_irq_disable to be a macro */
-#define hard_irq_disable	hard_irq_disable
+#define hard_irq_disable()	do {			\
+	__hard_irq_disable();				\
+	if (local_paca->soft_enabled)			\
+		trace_hardirqs_off();			\
+	get_paca()->soft_enabled = 0;			\
+	get_paca()->irq_happened |= PACA_IRQ_HARD_DIS;	\
+} while(0)
 
 static inline bool lazy_irq_pending(void)
 {

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox