LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
From: Alexander Graf @ 2014-05-06  7:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev@lists.ozlabs.org, paulus@samba.org, Aneesh Kumar K.V,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
In-Reply-To: <1399360775.20388.112.camel@pasglop>


On 06.05.14 09:19, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
>>>> Isn't this a greater problem? We should start swapping before we hit
>>>> the point where non movable kernel allocation fails, no?
>>> Possibly but the fact remains, this can be avoided by making sure that
>>> if we create a CMA reserve for KVM, then it uses it rather than using
>>> the rest of main memory for hash tables.
>> So why were we preferring non-CMA memory before? Considering that Aneesh
>> introduced that logic in fa61a4e3 I suppose this was just a mistake?
> I assume so.
>
>>>> The fact that KVM uses a good number of normal kernel pages is maybe
>>>> suboptimal, but shouldn't be a critical problem.
>>> The point is that we explicitly reserve those pages in CMA for use
>>> by KVM for that specific purpose, but the current code tries first
>>> to get them out of the normal pool.
>>>
>>> This is not an optimal behaviour and is what Aneesh patches are
>>> trying to fix.
>> I agree, and I agree that it's worth it to make better use of our
>> resources. But we still shouldn't crash.
> Well, Linux hitting out of memory conditions has never been a happy
> story :-)
>
>> However, reading through this thread I think I've slowly grasped what
>> the problem is. The hugetlbfs size calculation.
> Not really.
>
>> I guess something in your stack overreserves huge pages because it
>> doesn't account for the fact that some part of system memory is already
>> reserved for CMA.
> Either that or simply Linux runs out because we dirty too fast...
> really, Linux has never been good at dealing with OO situations,
> especially when things like network drivers and filesystems try to do
> ATOMIC or NOIO allocs...
>   
>> So the underlying problem is something completely orthogonal. The patch
>> body as is is fine, but the patch description should simply say that we
>> should prefer the CMA region because it's already reserved for us for
>> this purpose and we make better use of our available resources that way.
> No.
>
> We give a chunk of memory to hugetlbfs, it's all good and fine.
>
> Whatever remains is split between CMA and the normal page allocator.
>
> Without Aneesh latest patch, when creating guests, KVM starts allocating
> it's hash tables from the latter instead of CMA (we never allocate from
> hugetlb pool afaik, only guest pages do that, not hash tables).
>
> So we exhaust the page allocator and get linux into OOM conditions
> while there's plenty of space in CMA. But the kernel cannot use CMA for
> it's own allocations, only to back user pages, which we don't care about
> because our guest pages are covered by our hugetlb reserve :-)

Yes. Write that in the patch description and I'm happy ;).


Alex

^ permalink raw reply

* Re: [PATCH 5/6] powerpc/corenet: Add DPAA FMan support to the SoC device tree(s)
From: Joakim Tjernlund @ 2014-05-06  7:40 UTC (permalink / raw)
  To: Emil Medve
  Cc: Scott Wood, devicetree, Kanetkar Shruti-B44454,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <5368811A.3060609@Freescale.com>

"Linuxppc-dev" 
<linuxppc-dev-bounces+joakim.tjernlund=transmode.se@lists.ozlabs.org> 
wrote on 2014/05/06 08:28:42:
> 

.....

> 
> > That said,
> > I don't object to having a way to label a PHY as attached via TBI if
> > that's useful.  I'm giving a mild, non-nacking (given the history)
> > objection to using device_type for that (given other history).
> 
> Personally, I think that TBI PHY support is a bit messy but I don't have
> bandwidth to deal with that. The TBI PHY should be handled as a regular
> PHY and right now is a special case

Yes please! We will use the TBI as the only PHY in 1000BASE-X mode so
naturally we want to see the TBI as its own PHY and monitor its link 
status, AN etc.

 Jocke

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Alexander Graf @ 2014-05-06  9:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, kvm-ppc, kvm
In-Reply-To: <1399224616-25142-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/kvm_book3s_64.h | 146 ++++++++++++++++++++++++++-----
>   arch/powerpc/kvm/book3s_hv.c             |   7 ++
>   2 files changed, 130 insertions(+), 23 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 51388befeddb..f03ea8f90576 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -77,34 +77,122 @@ static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits)
>   	return old == 0;
>   }
>   
> +static inline int __hpte_actual_psize(unsigned int lp, int psize)
> +{
> +	int i, shift;
> +	unsigned int mask;
> +
> +	/* start from 1 ignoring MMU_PAGE_4K */
> +	for (i = 1; i < MMU_PAGE_COUNT; i++) {
> +
> +		/* invalid penc */
> +		if (mmu_psize_defs[psize].penc[i] == -1)
> +			continue;
> +		/*
> +		 * encoding bits per actual page size
> +		 *        PTE LP     actual page size
> +		 *    rrrr rrrz		>=8KB
> +		 *    rrrr rrzz		>=16KB
> +		 *    rrrr rzzz		>=32KB
> +		 *    rrrr zzzz		>=64KB
> +		 * .......
> +		 */
> +		shift = mmu_psize_defs[i].shift - LP_SHIFT;
> +		if (shift > LP_BITS)
> +			shift = LP_BITS;
> +		mask = (1 << shift) - 1;
> +		if ((lp & mask) == mmu_psize_defs[psize].penc[i])
> +			return i;
> +	}
> +	return -1;
> +}
> +
>   static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
>   					     unsigned long pte_index)
>   {
> -	unsigned long rb, va_low;
> +	int b_size, a_size;
> +	unsigned int penc;
> +	unsigned long rb = 0, va_low, sllp;
> +	unsigned int lp = (r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
> +
> +	if (!(v & HPTE_V_LARGE)) {
> +		/* both base and actual psize is 4k */
> +		b_size = MMU_PAGE_4K;
> +		a_size = MMU_PAGE_4K;
> +	} else {
> +		for (b_size = 0; b_size < MMU_PAGE_COUNT; b_size++) {
> +
> +			/* valid entries have a shift value */
> +			if (!mmu_psize_defs[b_size].shift)
> +				continue;
>   
> +			a_size = __hpte_actual_psize(lp, b_size);
> +			if (a_size != -1)
> +				break;
> +		}
> +	}
> +	/*
> +	 * Ignore the top 14 bits of va
> +	 * v have top two bits covering segment size, hence move
> +	 * by 16 bits, Also clear the lower HPTE_V_AVPN_SHIFT (7) bits.
> +	 * AVA field in v also have the lower 23 bits ignored.
> +	 * For base page size 4K we need 14 .. 65 bits (so need to
> +	 * collect extra 11 bits)
> +	 * For others we need 14..14+i
> +	 */
> +	/* This covers 14..54 bits of va*/
>   	rb = (v & ~0x7fUL) << 16;		/* AVA field */
> +	/*
> +	 * AVA in v had cleared lower 23 bits. We need to derive
> +	 * that from pteg index
> +	 */
>   	va_low = pte_index >> 3;
>   	if (v & HPTE_V_SECONDARY)
>   		va_low = ~va_low;
> -	/* xor vsid from AVA */
> +	/*
> +	 * get the vpn bits from va_low using reverse of hashing.
> +	 * In v we have va with 23 bits dropped and then left shifted
> +	 * HPTE_V_AVPN_SHIFT (7) bits. Now to find vsid we need
> +	 * right shift it with (SID_SHIFT - (23 - 7))
> +	 */
>   	if (!(v & HPTE_V_1TB_SEG))
> -		va_low ^= v >> 12;
> +		va_low ^= v >> (SID_SHIFT - 16);
>   	else
> -		va_low ^= v >> 24;
> +		va_low ^= v >> (SID_SHIFT_1T - 16);
>   	va_low &= 0x7ff;
> -	if (v & HPTE_V_LARGE) {
> -		rb |= 1;			/* L field */
> -		if (cpu_has_feature(CPU_FTR_ARCH_206) &&
> -		    (r & 0xff000)) {
> -			/* non-16MB large page, must be 64k */
> -			/* (masks depend on page size) */
> -			rb |= 0x1000;		/* page encoding in LP field */
> -			rb |= (va_low & 0x7f) << 16; /* 7b of VA in AVA/LP field */
> -			rb |= ((va_low << 4) & 0xf0);	/* AVAL field (P7 doesn't seem to care) */
> -		}
> -	} else {
> -		/* 4kB page */
> -		rb |= (va_low & 0x7ff) << 12;	/* remaining 11b of VA */
> +
> +	switch (b_size) {
> +	case MMU_PAGE_4K:
> +		sllp = ((mmu_psize_defs[a_size].sllp & SLB_VSID_L) >> 6) |
> +			((mmu_psize_defs[a_size].sllp & SLB_VSID_LP) >> 4);
> +		rb |= sllp << 5;	/*  AP field */
> +		rb |= (va_low & 0x7ff) << 12;	/* remaining 11 bits of AVA */
> +		break;
> +	default:
> +	{
> +		int aval_shift;
> +		/*
> +		 * remaining 7bits of AVA/LP fields
> +		 * Also contain the rr bits of LP
> +		 */
> +		rb |= (va_low & 0x7f) << 16;
> +		/*
> +		 * Now clear not needed LP bits based on actual psize
> +		 */
> +		rb &= ~((1ul << mmu_psize_defs[a_size].shift) - 1);
> +		/*
> +		 * AVAL field 58..77 - base_page_shift bits of va
> +		 * we have space for 58..64 bits, Missing bits should
> +		 * be zero filled. +1 is to take care of L bit shift
> +		 */
> +		aval_shift = 64 - (77 - mmu_psize_defs[b_size].shift) + 1;
> +		rb |= ((va_low << aval_shift) & 0xfe);
> +
> +		rb |= 1;		/* L field */
> +		penc = mmu_psize_defs[b_size].penc[a_size];
> +		rb |= penc << 12;	/* LP field */
> +		break;
> +	}
>   	}
>   	rb |= (v >> 54) & 0x300;		/* B field */
>   	return rb;
> @@ -112,14 +200,26 @@ static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
>   
>   static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
>   {
> +	int size, a_size;
> +	/* Look at the 8 bit LP value */
> +	unsigned int lp = (l >> LP_SHIFT) & ((1 << LP_BITS) - 1);
> +
>   	/* only handle 4k, 64k and 16M pages for now */
>   	if (!(h & HPTE_V_LARGE))
> -		return 1ul << 12;		/* 4k page */
> -	if ((l & 0xf000) == 0x1000 && cpu_has_feature(CPU_FTR_ARCH_206))
> -		return 1ul << 16;		/* 64k page */
> -	if ((l & 0xff000) == 0)
> -		return 1ul << 24;		/* 16M page */
> -	return 0;				/* error */
> +		return 1ul << 12;
> +	else {
> +		for (size = 0; size < MMU_PAGE_COUNT; size++) {
> +			/* valid entries have a shift value */
> +			if (!mmu_psize_defs[size].shift)
> +				continue;
> +
> +			a_size = __hpte_actual_psize(lp, size);

a_size as psize is probably a slightly confusing namer. Just call it 
a_psize.

So if I understand this patch correctly, it simply introduces logic to 
handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
size field in the HPTE. Mind to explain why exactly that enables us to 
use THP?

What exactly is the flow if the pages are not backed by huge pages? What 
is the flow when they start to get backed by huge pages?

> +			if (a_size != -1)
> +				return 1ul << mmu_psize_defs[a_size].shift;
> +		}
> +
> +	}
> +	return 0;
>   }
>   
>   static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 8227dba5af0f..a38d3289320a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
>   	 * support pte_enc here
>   	 */
>   	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
> +	/*
> +	 * Add 16MB MPSS support
> +	 */
> +	if (linux_psize != MMU_PAGE_16M) {
> +		(*sps)->enc[1].page_shift = 24;
> +		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
> +	}

So this basically indicates that every segment (except for the 16MB one) 
can also handle 16MB MPSS page sizes? I suppose you want to remove the 
comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.

Can we also ensure that every system we run on can do MPSS?


Alex

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Benjamin Herrenschmidt @ 2014-05-06  9:26 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, paulus, Aneesh Kumar K.V, kvm-ppc, kvm
In-Reply-To: <5368A78D.4070509@suse.de>

On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:

> So if I understand this patch correctly, it simply introduces logic to 
> handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
> size field in the HPTE. Mind to explain why exactly that enables us to 
> use THP?
>
> What exactly is the flow if the pages are not backed by huge pages? What 
> is the flow when they start to get backed by huge pages?

The hypervisor doesn't care about segments ... but it needs to properly
decode the page size requested by the guest, if anything, to issue the
right form of tlbie instruction.

The encoding in the HPTE for a 16M page inside a 64K segment is
different than the encoding for a 16M in a 16M segment, this is done so
that the encoding carries both information, which allows broadcast
tlbie to properly find the right set in the TLB for invalidations among
others.

So from a KVM perspective, we don't know whether the guest is doing THP
or something else (Linux calls it THP but all we care here is that this
is MPSS, another guest than Linux might exploit that differently).

What we do know is that if we advertise MPSS, we need to decode the page
sizes encoded in the HPTE so that we know what we are dealing with in
H_ENTER and can do the appropriate TLB invalidations in H_REMOVE &
evictions.

> > +			if (a_size != -1)
> > +				return 1ul << mmu_psize_defs[a_size].shift;
> > +		}
> > +
> > +	}
> > +	return 0;
> >   }
> >   
> >   static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 8227dba5af0f..a38d3289320a 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
> >   	 * support pte_enc here
> >   	 */
> >   	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
> > +	/*
> > +	 * Add 16MB MPSS support
> > +	 */
> > +	if (linux_psize != MMU_PAGE_16M) {
> > +		(*sps)->enc[1].page_shift = 24;
> > +		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
> > +	}
> 
> So this basically indicates that every segment (except for the 16MB one) 
> can also handle 16MB MPSS page sizes? I suppose you want to remove the 
> comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.

I haven't reviewed the code there, make sure it will indeed do a
different encoding for every combination of segment/actual page size.

> Can we also ensure that every system we run on can do MPSS?

P7 and P8 are identical in that regard. However 970 doesn't do MPSS so
let's make sure we get that right.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Alexander Graf @ 2014-05-06  9:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, paulus, Aneesh Kumar K.V, kvm-ppc, kvm
In-Reply-To: <1399368400.18906.9.camel@pasglop>

On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:
>
>> So if I understand this patch correctly, it simply introduces logic to
>> handle page sizes other than 4k, 64k, 16M by analyzing the actual page
>> size field in the HPTE. Mind to explain why exactly that enables us to
>> use THP?
>>
>> What exactly is the flow if the pages are not backed by huge pages? What
>> is the flow when they start to get backed by huge pages?
> The hypervisor doesn't care about segments ... but it needs to properly
> decode the page size requested by the guest, if anything, to issue the
> right form of tlbie instruction.
>
> The encoding in the HPTE for a 16M page inside a 64K segment is
> different than the encoding for a 16M in a 16M segment, this is done so
> that the encoding carries both information, which allows broadcast
> tlbie to properly find the right set in the TLB for invalidations among
> others.
>
> So from a KVM perspective, we don't know whether the guest is doing THP
> or something else (Linux calls it THP but all we care here is that this
> is MPSS, another guest than Linux might exploit that differently).

Ugh. So we're just talking about a guest using MPSS here? Not about the 
host doing THP? I must've missed that part.

>
> What we do know is that if we advertise MPSS, we need to decode the page
> sizes encoded in the HPTE so that we know what we are dealing with in
> H_ENTER and can do the appropriate TLB invalidations in H_REMOVE &
> evictions.

Yes. That makes a lot of sense. So this patch really is all about 
enabling MPSS support for 16MB pages. No more, no less.

>
>>> +			if (a_size != -1)
>>> +				return 1ul << mmu_psize_defs[a_size].shift;
>>> +		}
>>> +
>>> +	}
>>> +	return 0;
>>>    }
>>>    
>>>    static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>> index 8227dba5af0f..a38d3289320a 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>> @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
>>>    	 * support pte_enc here
>>>    	 */
>>>    	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
>>> +	/*
>>> +	 * Add 16MB MPSS support
>>> +	 */
>>> +	if (linux_psize != MMU_PAGE_16M) {
>>> +		(*sps)->enc[1].page_shift = 24;
>>> +		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
>>> +	}
>> So this basically indicates that every segment (except for the 16MB one)
>> can also handle 16MB MPSS page sizes? I suppose you want to remove the
>> comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.
> I haven't reviewed the code there, make sure it will indeed do a
> different encoding for every combination of segment/actual page size.
>
>> Can we also ensure that every system we run on can do MPSS?
> P7 and P8 are identical in that regard. However 970 doesn't do MPSS so
> let's make sure we get that right.

yes. When / if people can easily get their hands on p7/p8 bare metal 
systems I'll be more than happy to remove 970 support as well, but for 
now it's probably good to keep in.


Alex

^ permalink raw reply

* Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries
From: Ingo Molnar @ 2014-05-06 11:29 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-arch, riel, Madhavan Srinivasan, dave.hansen, peterz, x86,
	linux-kernel, linux-mm, ak, paulus, mgorman, Linus Torvalds, akpm,
	linuxppc-dev, kirill.shutemov
In-Reply-To: <87d2fz47tg.fsf@rustcorp.com.au>


* Rusty Russell <rusty@rustcorp.com.au> wrote:

> Ingo Molnar <mingo@kernel.org> writes:
> > * Madhavan Srinivasan <maddy@linux.vnet.ibm.com> wrote:
> >
> >> Performance data for different FAULT_AROUND_ORDER values from 4 socket
> >> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
> >> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
> >> v3.15-rc1 for different fault around order values.
> >> 
> >> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
> >> 
> >> Linux build (make -j64)
> >> minor-faults            47,437,359      35,279,286      25,425,347      23,461,275      22,002,189      21,435,836
> >> times in seconds        347.302528420   344.061588460   340.974022391   348.193508116   348.673900158   350.986543618
> >>  stddev for time        ( +-  1.50% )   ( +-  0.73% )   ( +-  1.13% )   ( +-  1.01% )   ( +-  1.89% )   ( +-  1.55% )
> >>  %chg time to baseline                  -0.9%           -1.8%           0.2%            0.39%           1.06%
> >
> > Probably too noisy.
> 
> A little, but 3 still looks like the winner.
> 
> >> Linux rebuild (make -j64)
> >> minor-faults            941,552         718,319         486,625         440,124         410,510         397,416
> >> times in seconds        30.569834718    31.219637539    31.319370649    31.434285472    31.972367174    31.443043580
> >>  stddev for time        ( +-  1.07% )   ( +-  0.13% )   ( +-  0.43% )   ( +-  0.18% )   ( +-  0.95% )   ( +-  0.58% )
> >>  %chg time to baseline                  2.1%            2.4%            2.8%            4.58%           2.85%
> >
> > Here it looks like a speedup. Optimal value: 5+.
> 
> No, lower time is better.  Baseline (no faultaround) wins.
> 
> 
> etc.

ah, yeah, you are right. Brainfart of the week...

> It's not a huge surprise that a 64k page arch wants a smaller value 
> than a 4k system.  But I agree: I don't see much upside for FAO > 0, 
> but I do see downside.
> 
> Most extreme results:
> Order 1: 2% loss on recompile.  10% win 4% loss on seq.  9% loss random.
> Order 3: 2% loss on recompile.  6% win 5% loss on seq.  14% loss on random.
> Order 4: 2.8% loss on recompile. 10% win 7% loss on seq.  9% loss on random.
> 
> > I'm starting to suspect that maybe workloads ought to be given a 
> > choice in this matter, via madvise() or such.
> 
> I really don't think they'll be able to use it; it'll change far too 
> much with machine and kernel updates. [...]

Do we know that?

> [...] I think we should apply patch
> #1 (with fixes) to make it a variable, then set it to 0 for PPC.

Ok, agreed - at least until contrary data comes around.

Thanks,

	Ingo

^ permalink raw reply

* Re: Build regressions/improvements in v3.15-rc4
From: Geert Uytterhoeven @ 2014-05-06 12:15 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: the arch/x86 maintainers, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1399377878-29995-1-git-send-email-geert@linux-m68k.org>

On Tue, May 6, 2014 at 2:04 PM, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> JFYI, when comparing v3.15-rc4[1]  to v3.15-rc3[3], the summaries are:
>   - build errors: +7/-1

  + /scratch/kisskb/src/arch/powerpc/include/asm/fixmap.h: error:
overflow in enumeration values  CC      drivers/hwmon/smsc47m192.o:
=> 51:2
  + /scratch/kisskb/src/arch/powerpc/include/asm/fixmap.h: error:
overflow in enumeration values  CC [M]  drivers/usb/gadget/f_rndis.o:
=> 51:2
  + /scratch/kisskb/src/arch/powerpc/include/asm/fixmap.h: error:
overflow in enumeration values:  => 51:2

powerpc-randconfig (looks scary, is CONFIG_HIGHMEM=y broken on ppc?)

  + /scratch/kisskb/src/arch/powerpc/kernel/head_44x.S: Error: invalid
operands (*ABS* and *UND* sections) for `|':  => 686, 603
  + /scratch/kisskb/src/arch/powerpc/mm/tlb_nohash_low.S: Error:
unsupported relocation against PPC47x_TLBE_SIZE:  => 113

powerpc-randconfig

  + /scratch/kisskb/src/arch/powerpc/platforms/powernv/setup.c: error:
implicit declaration of function 'get_hard_smp_processor_id'
[-Werror=implicit-function-declaration]:  => 179:4

ppc64_defconfig+UP

Lemme guess: If CONFIG_SMP=n, <linux/smp.h> does not include <asm/smp.h>,
so it needs an explicit #include <asm/smp.h>?

  + error: initramfs.c: undefined reference to `__stack_chk_guard':
=> .init.text+0x19dc)

x86_64-randconfig

> [1] http://kisskb.ellerman.id.au/kisskb/head/7449/ (all 119 configs)
> [3] http://kisskb.ellerman.id.au/kisskb/head/7427/ (all 119 configs)

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr
From: Aneesh Kumar K.V @ 2014-05-06 14:06 UTC (permalink / raw)
  To: Alexander Graf, Paul Mackerras; +Cc: linuxppc-dev, kvm, kvm-ppc
In-Reply-To: <536887EF.2070201@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 06.05.14 02:41, Paul Mackerras wrote:
>> On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:
>>> On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:
>>>> +#ifdef CONFIG_PPC_BOOK3S_64
>>>> +	return vcpu->arch.fault_dar;
>>> How about PA6T and G5s?
>> G5 sets DAR on an alignment interrupt.
>>
>> As for PA6T, I don't know for sure, but if it doesn't, ordinary
>> alignment interrupts wouldn't be handled properly, since the code in
>> arch/powerpc/kernel/align.c assumes DAR contains the address being
>> accessed on all PowerPC CPUs.
>
> Now that's a good point. If we simply behave like Linux, I'm fine. This 
> definitely deserves a comment on the #ifdef in the code.
>

Will update and send V5

-aneesh

^ permalink raw reply

* Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr
From: Aneesh Kumar K.V @ 2014-05-06 14:12 UTC (permalink / raw)
  To: Alexander Graf, Paul Mackerras; +Cc: linuxppc-dev, kvm, kvm-ppc
In-Reply-To: <536887EF.2070201@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 06.05.14 02:41, Paul Mackerras wrote:
>> On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:
>>> On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:
>>>> +#ifdef CONFIG_PPC_BOOK3S_64
>>>> +	return vcpu->arch.fault_dar;
>>> How about PA6T and G5s?
>> G5 sets DAR on an alignment interrupt.
>>
>> As for PA6T, I don't know for sure, but if it doesn't, ordinary
>> alignment interrupts wouldn't be handled properly, since the code in
>> arch/powerpc/kernel/align.c assumes DAR contains the address being
>> accessed on all PowerPC CPUs.
>
> Now that's a good point. If we simply behave like Linux, I'm fine. This 
> definitely deserves a comment on the #ifdef in the code.


How about ?

#ifdef CONFIG_PPC_BOOK3S_64
	/*
	 * Linux always expect a valid  dar as per alignment
	 * interrupt handling code (fix_alignment()). Don't compute the dar
	 * value here, instead used the saved dar value. Right now we restrict
	 * this only for BOOK3S-64.
	 */
	return vcpu->arch.fault_dar;
#else


-aneesh

^ permalink raw reply

* Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
From: Aneesh Kumar K.V @ 2014-05-06 14:20 UTC (permalink / raw)
  To: Alexander Graf, Benjamin Herrenschmidt
  Cc: linuxppc-dev@lists.ozlabs.org, paulus@samba.org,
	kvm@vger.kernel.org, kvm-ppc@vger.kernel.org
In-Reply-To: <53688D89.1070201@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 06.05.14 09:19, Benjamin Herrenschmidt wrote:
>> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
>>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
>>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
>>>>> Isn't this a greater problem? We should start swapping before we hit
>>>>> the point where non movable kernel allocation fails, no?
>>>> Possibly but the fact remains, this can be avoided by making sure that
>>>> if we create a CMA reserve for KVM, then it uses it rather than using
>>>> the rest of main memory for hash tables.
>>> So why were we preferring non-CMA memory before? Considering that Aneesh
>>> introduced that logic in fa61a4e3 I suppose this was just a mistake?
>> I assume so.

....
...

>>
>> Whatever remains is split between CMA and the normal page allocator.
>>
>> Without Aneesh latest patch, when creating guests, KVM starts allocating
>> it's hash tables from the latter instead of CMA (we never allocate from
>> hugetlb pool afaik, only guest pages do that, not hash tables).
>>
>> So we exhaust the page allocator and get linux into OOM conditions
>> while there's plenty of space in CMA. But the kernel cannot use CMA for
>> it's own allocations, only to back user pages, which we don't care about
>> because our guest pages are covered by our hugetlb reserve :-)
>
> Yes. Write that in the patch description and I'm happy ;).
>

How about the below:

Current KVM code first try to allocate hash page table from the normal
page allocator before falling back to the CMA reserve region. One of the
side effects of that is, we could exhaust the page allocator and get
linux into OOM conditions while we still have plenty of space in CMA. 

Fix this by trying the CMA reserve region first and then falling back
to normal page allocator if we fail to get enough memory from CMA
reserve area.

-aneesh

^ permalink raw reply

* Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr
From: Alexander Graf @ 2014-05-06 14:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, kvm-ppc, kvm
In-Reply-To: <87zjivq9mb.fsf@linux.vnet.ibm.com>

On 05/06/2014 04:12 PM, Aneesh Kumar K.V wrote:
> Alexander Graf <agraf@suse.de> writes:
>
>> On 06.05.14 02:41, Paul Mackerras wrote:
>>> On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:
>>>> On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:
>>>>> +#ifdef CONFIG_PPC_BOOK3S_64
>>>>> +	return vcpu->arch.fault_dar;
>>>> How about PA6T and G5s?
>>> G5 sets DAR on an alignment interrupt.
>>>
>>> As for PA6T, I don't know for sure, but if it doesn't, ordinary
>>> alignment interrupts wouldn't be handled properly, since the code in
>>> arch/powerpc/kernel/align.c assumes DAR contains the address being
>>> accessed on all PowerPC CPUs.
>> Now that's a good point. If we simply behave like Linux, I'm fine. This
>> definitely deserves a comment on the #ifdef in the code.
>
> How about ?
>
> #ifdef CONFIG_PPC_BOOK3S_64
> 	/*
> 	 * Linux always expect a valid  dar as per alignment
> 	 * interrupt handling code (fix_alignment()). Don't compute the dar
> 	 * value here, instead used the saved dar value. Right now we restrict
> 	 * this only for BOOK3S-64.
> 	 */

/* Linux's fix_alignment() assumes that DAR is valid, so can we */


Alex

> 	return vcpu->arch.fault_dar;
> #else
>
>
> -aneesh
>

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Aneesh Kumar K.V @ 2014-05-06 14:23 UTC (permalink / raw)
  To: Alexander Graf; +Cc: paulus, linuxppc-dev, kvm-ppc, kvm
In-Reply-To: <5368A78D.4070509@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

....
....

>>   static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
>>   {
>> +	int size, a_size;
>> +	/* Look at the 8 bit LP value */
>> +	unsigned int lp = (l >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>> +
>>   	/* only handle 4k, 64k and 16M pages for now */
>>   	if (!(h & HPTE_V_LARGE))
>> -		return 1ul << 12;		/* 4k page */
>> -	if ((l & 0xf000) == 0x1000 && cpu_has_feature(CPU_FTR_ARCH_206))
>> -		return 1ul << 16;		/* 64k page */
>> -	if ((l & 0xff000) == 0)
>> -		return 1ul << 24;		/* 16M page */
>> -	return 0;				/* error */
>> +		return 1ul << 12;
>> +	else {
>> +		for (size = 0; size < MMU_PAGE_COUNT; size++) {
>> +			/* valid entries have a shift value */
>> +			if (!mmu_psize_defs[size].shift)
>> +				continue;
>> +
>> +			a_size = __hpte_actual_psize(lp, size);
>
> a_size as psize is probably a slightly confusing namer. Just call it 
> a_psize.

Will update.

>
> So if I understand this patch correctly, it simply introduces logic to 
> handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
> size field in the HPTE. Mind to explain why exactly that enables us to 
> use THP?
>
> What exactly is the flow if the pages are not backed by huge pages? What 
> is the flow when they start to get backed by huge pages?
>
>> +			if (a_size != -1)
>> +				return 1ul << mmu_psize_defs[a_size].shift;
>> +		}
>> +
>> +	}
>> +	return 0;
>>   }
>>   
>>   static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 8227dba5af0f..a38d3289320a 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
>>   	 * support pte_enc here
>>   	 */
>>   	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
>> +	/*
>> +	 * Add 16MB MPSS support
>> +	 */
>> +	if (linux_psize != MMU_PAGE_16M) {
>> +		(*sps)->enc[1].page_shift = 24;
>> +		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
>> +	}
>
> So this basically indicates that every segment (except for the 16MB one) 
> can also handle 16MB MPSS page sizes? I suppose you want to remove the 
> comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS
> here.

Will do

>
> Can we also ensure that every system we run on can do MPSS?
>

Will do

-aneesh

^ permalink raw reply

* Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
From: Alexander Graf @ 2014-05-06 14:25 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: paulus@samba.org, linuxppc-dev@lists.ozlabs.org,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
In-Reply-To: <87wqdzq98f.fsf@linux.vnet.ibm.com>

On 05/06/2014 04:20 PM, Aneesh Kumar K.V wrote:
> Alexander Graf <agraf@suse.de> writes:
>
>> On 06.05.14 09:19, Benjamin Herrenschmidt wrote:
>>> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
>>>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
>>>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
>>>>>> Isn't this a greater problem? We should start swapping before we hit
>>>>>> the point where non movable kernel allocation fails, no?
>>>>> Possibly but the fact remains, this can be avoided by making sure that
>>>>> if we create a CMA reserve for KVM, then it uses it rather than using
>>>>> the rest of main memory for hash tables.
>>>> So why were we preferring non-CMA memory before? Considering that Aneesh
>>>> introduced that logic in fa61a4e3 I suppose this was just a mistake?
>>> I assume so.
> ....
> ...
>
>>> Whatever remains is split between CMA and the normal page allocator.
>>>
>>> Without Aneesh latest patch, when creating guests, KVM starts allocating
>>> it's hash tables from the latter instead of CMA (we never allocate from
>>> hugetlb pool afaik, only guest pages do that, not hash tables).
>>>
>>> So we exhaust the page allocator and get linux into OOM conditions
>>> while there's plenty of space in CMA. But the kernel cannot use CMA for
>>> it's own allocations, only to back user pages, which we don't care about
>>> because our guest pages are covered by our hugetlb reserve :-)
>> Yes. Write that in the patch description and I'm happy ;).
>>
> How about the below:
>
> Current KVM code first try to allocate hash page table from the normal
> page allocator before falling back to the CMA reserve region. One of the
> side effects of that is, we could exhaust the page allocator and get
> linux into OOM conditions while we still have plenty of space in CMA.
>
> Fix this by trying the CMA reserve region first and then falling back
> to normal page allocator if we fail to get enough memory from CMA
> reserve area.

Fix the grammar (I've spotted a good number of mistakes), then this 
should do. Please also improve the headline.


Alex

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Aneesh Kumar K.V @ 2014-05-06 14:25 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Alexander Graf, kvm-ppc, kvm
In-Reply-To: <20140506042058.GA25186@drongo>

Paul Mackerras <paulus@samba.org> writes:

> On Mon, May 05, 2014 at 08:17:00PM +0530, Aneesh Kumar K.V wrote:
>> Alexander Graf <agraf@suse.de> writes:
>> 
>> > On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:
>> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> >
>> > No patch description, no proper explanations anywhere why you're doing 
>> > what. All of that in a pretty sensitive piece of code. There's no way 
>> > this patch can go upstream in its current form.
>> >
>> 
>> Sorry about being vague. Will add a better commit message. The goal is
>> to export MPSS support to guest if the host support the same. MPSS
>> support is exported via penc encoding in "ibm,segment-page-sizes". The
>> actual format can be found at htab_dt_scan_page_sizes. When the guest
>> memory is backed by hugetlbfs we expose the penc encoding the host
>> support to guest via kvmppc_add_seg_page_size. 
>
> In a case like this it's good to assume the reader doesn't know very
> much about Power CPUs, and probably isn't familiar with acronyms such
> as MPSS.  The patch needs an introductory paragraph explaining that on
> recent IBM Power CPUs, while the hashed page table is looked up using
> the page size from the segmentation hardware (i.e. the SLB), it is
> possible to have the HPT entry indicate a larger page size.  Thus for
> example it is possible to put a 16MB page in a 64kB segment, but since
> the hash lookup is done using a 64kB page size, it may be necessary to
> put multiple entries in the HPT for a single 16MB page.  This
> capability is called mixed page-size segment (MPSS).  With MPSS,
> there are two relevant page sizes: the base page size, which is the
> size used in searching the HPT, and the actual page size, which is the
> size indicated in the HPT entry.  Note that the actual page size is
> always >= base page size.

I will update the commit message with the above details

-aneesh

^ permalink raw reply

* [PATCH V5] KVM: PPC: BOOK3S: Use the saved dar value and generic make_dsisr
From: Aneesh Kumar K.V @ 2014-05-06 14:30 UTC (permalink / raw)
  To: agraf, benh, paulus; +Cc: linuxppc-dev, kvm, kvm-ppc, Aneesh Kumar K.V

Although it's optional IBM POWER cpus always had DAR value set on
alignment interrupt. So don't try to compute these values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
* Changes from V4
 * Update comments around using fault_dar

 arch/powerpc/include/asm/disassemble.h | 34 +++++++++++++++++++++++++
 arch/powerpc/kernel/align.c            | 34 +------------------------
 arch/powerpc/kvm/book3s_emulate.c      | 46 ++++++----------------------------
 3 files changed, 43 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/disassemble.h b/arch/powerpc/include/asm/disassemble.h
index 856f8deb557a..6330a61b875a 100644
--- a/arch/powerpc/include/asm/disassemble.h
+++ b/arch/powerpc/include/asm/disassemble.h
@@ -81,4 +81,38 @@ static inline unsigned int get_oc(u32 inst)
 {
 	return (inst >> 11) & 0x7fff;
 }
+
+#define IS_XFORM(inst)	(get_op(inst)  == 31)
+#define IS_DSFORM(inst)	(get_op(inst) >= 56)
+
+/*
+ * Create a DSISR value from the instruction
+ */
+static inline unsigned make_dsisr(unsigned instr)
+{
+	unsigned dsisr;
+
+
+	/* bits  6:15 --> 22:31 */
+	dsisr = (instr & 0x03ff0000) >> 16;
+
+	if (IS_XFORM(instr)) {
+		/* bits 29:30 --> 15:16 */
+		dsisr |= (instr & 0x00000006) << 14;
+		/* bit     25 -->    17 */
+		dsisr |= (instr & 0x00000040) << 8;
+		/* bits 21:24 --> 18:21 */
+		dsisr |= (instr & 0x00000780) << 3;
+	} else {
+		/* bit      5 -->    17 */
+		dsisr |= (instr & 0x04000000) >> 12;
+		/* bits  1: 4 --> 18:21 */
+		dsisr |= (instr & 0x78000000) >> 17;
+		/* bits 30:31 --> 12:13 */
+		if (IS_DSFORM(instr))
+			dsisr |= (instr & 0x00000003) << 18;
+	}
+
+	return dsisr;
+}
 #endif /* __ASM_PPC_DISASSEMBLE_H__ */
diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index 94908af308d8..34f55524d456 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -25,14 +25,13 @@
 #include <asm/cputable.h>
 #include <asm/emulated_ops.h>
 #include <asm/switch_to.h>
+#include <asm/disassemble.h>
 
 struct aligninfo {
 	unsigned char len;
 	unsigned char flags;
 };
 
-#define IS_XFORM(inst)	(((inst) >> 26) == 31)
-#define IS_DSFORM(inst)	(((inst) >> 26) >= 56)
 
 #define INVALID	{ 0, 0 }
 
@@ -192,37 +191,6 @@ static struct aligninfo aligninfo[128] = {
 };
 
 /*
- * Create a DSISR value from the instruction
- */
-static inline unsigned make_dsisr(unsigned instr)
-{
-	unsigned dsisr;
-
-
-	/* bits  6:15 --> 22:31 */
-	dsisr = (instr & 0x03ff0000) >> 16;
-
-	if (IS_XFORM(instr)) {
-		/* bits 29:30 --> 15:16 */
-		dsisr |= (instr & 0x00000006) << 14;
-		/* bit     25 -->    17 */
-		dsisr |= (instr & 0x00000040) << 8;
-		/* bits 21:24 --> 18:21 */
-		dsisr |= (instr & 0x00000780) << 3;
-	} else {
-		/* bit      5 -->    17 */
-		dsisr |= (instr & 0x04000000) >> 12;
-		/* bits  1: 4 --> 18:21 */
-		dsisr |= (instr & 0x78000000) >> 17;
-		/* bits 30:31 --> 12:13 */
-		if (IS_DSFORM(instr))
-			dsisr |= (instr & 0x00000003) << 18;
-	}
-
-	return dsisr;
-}
-
-/*
  * The dcbz (data cache block zero) instruction
  * gives an alignment fault if used on non-cacheable
  * memory.  We handle the fault mainly for the
diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c
index 99d40f8977e8..6bbdb3d1ec77 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -569,48 +569,17 @@ unprivileged:
 
 u32 kvmppc_alignment_dsisr(struct kvm_vcpu *vcpu, unsigned int inst)
 {
-	u32 dsisr = 0;
-
-	/*
-	 * This is what the spec says about DSISR bits (not mentioned = 0):
-	 *
-	 * 12:13		[DS]	Set to bits 30:31
-	 * 15:16		[X]	Set to bits 29:30
-	 * 17			[X]	Set to bit 25
-	 *			[D/DS]	Set to bit 5
-	 * 18:21		[X]	Set to bits 21:24
-	 *			[D/DS]	Set to bits 1:4
-	 * 22:26			Set to bits 6:10 (RT/RS/FRT/FRS)
-	 * 27:31			Set to bits 11:15 (RA)
-	 */
-
-	switch (get_op(inst)) {
-	/* D-form */
-	case OP_LFS:
-	case OP_LFD:
-	case OP_STFD:
-	case OP_STFS:
-		dsisr |= (inst >> 12) & 0x4000;	/* bit 17 */
-		dsisr |= (inst >> 17) & 0x3c00; /* bits 18:21 */
-		break;
-	/* X-form */
-	case 31:
-		dsisr |= (inst << 14) & 0x18000; /* bits 15:16 */
-		dsisr |= (inst << 8)  & 0x04000; /* bit 17 */
-		dsisr |= (inst << 3)  & 0x03c00; /* bits 18:21 */
-		break;
-	default:
-		printk(KERN_INFO "KVM: Unaligned instruction 0x%x\n", inst);
-		break;
-	}
-
-	dsisr |= (inst >> 16) & 0x03ff; /* bits 22:31 */
-
-	return dsisr;
+	return make_dsisr(inst);
 }
 
 ulong kvmppc_alignment_dar(struct kvm_vcpu *vcpu, unsigned int inst)
 {
+#ifdef CONFIG_PPC_BOOK3S_64
+	/*
+	 * Linux's fix_alignment() assumes that DAR is valid, so can we
+	 */
+	return vcpu->arch.fault_dar;
+#else
 	ulong dar = 0;
 	ulong ra = get_ra(inst);
 	ulong rb = get_rb(inst);
@@ -635,4 +604,5 @@ ulong kvmppc_alignment_dar(struct kvm_vcpu *vcpu, unsigned int inst)
 	}
 
 	return dar;
+#endif
 }
-- 
1.9.1

^ permalink raw reply related

* [PATCH] powerpc/fsl: Updated corenet-cf compatible string for corenet1-cf chips
From: Diana Craciun @ 2014-05-06 14:56 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: scottwood, devicetree, Diana Craciun

From: Diana Craciun <Diana.Craciun@freescale.com>

Updated the device trees according to the corenet-cf
binding definition.

Signed-off-by: Diana Craciun <Diana.Craciun@freescale.com>
---
 arch/powerpc/boot/dts/fsl/p2041si-post.dtsi | 2 +-
 arch/powerpc/boot/dts/fsl/p3041si-post.dtsi | 2 +-
 arch/powerpc/boot/dts/fsl/p4080si-post.dtsi | 2 +-
 arch/powerpc/boot/dts/fsl/p5020si-post.dtsi | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi b/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi
index e2987a3..b5daa4c 100644
--- a/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi
@@ -246,7 +246,7 @@
 	};
 
 	corenet-cf@18000 {
-		compatible = "fsl,corenet-cf";
+		compatible = "fsl,corenet1-cf", "fsl,corenet-cf";
 		reg = <0x18000 0x1000>;
 		interrupts = <16 2 1 31>;
 		fsl,ccf-num-csdids = <32>;
diff --git a/arch/powerpc/boot/dts/fsl/p3041si-post.dtsi b/arch/powerpc/boot/dts/fsl/p3041si-post.dtsi
index 7af6d45..5abd1fc 100644
--- a/arch/powerpc/boot/dts/fsl/p3041si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p3041si-post.dtsi
@@ -273,7 +273,7 @@
 	};
 
 	corenet-cf@18000 {
-		compatible = "fsl,corenet-cf";
+		compatible = "fsl,corenet1-cf", "fsl,corenet-cf";
 		reg = <0x18000 0x1000>;
 		interrupts = <16 2 1 31>;
 		fsl,ccf-num-csdids = <32>;
diff --git a/arch/powerpc/boot/dts/fsl/p4080si-post.dtsi b/arch/powerpc/boot/dts/fsl/p4080si-post.dtsi
index 2415e1f..bf0e7c9 100644
--- a/arch/powerpc/boot/dts/fsl/p4080si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p4080si-post.dtsi
@@ -281,7 +281,7 @@
 	};
 
 	corenet-cf@18000 {
-		compatible = "fsl,corenet-cf";
+		compatible = "fsl,corenet1-cf", "fsl,corenet-cf";
 		reg = <0x18000 0x1000>;
 		interrupts = <16 2 1 31>;
 		fsl,ccf-num-csdids = <32>;
diff --git a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
index 2985de4..f7ca9f4 100644
--- a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
@@ -278,7 +278,7 @@
 	};
 
 	corenet-cf@18000 {
-		compatible = "fsl,corenet-cf";
+		compatible = "fsl,corenet1-cf", "fsl,corenet-cf";
 		reg = <0x18000 0x1000>;
 		interrupts = <16 2 1 31>;
 		fsl,ccf-num-csdids = <32>;
-- 
1.7.11.7

^ permalink raw reply related

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Aneesh Kumar K.V @ 2014-05-06 15:06 UTC (permalink / raw)
  To: Alexander Graf, Benjamin Herrenschmidt; +Cc: linuxppc-dev, paulus, kvm, kvm-ppc
In-Reply-To: <5368ADE3.1050503@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:
>> On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:
>>

.....


I updated the commit message as below. Let me know if this is ok.

    KVM: PPC: BOOK3S: HV: THP support for guest
    
    On recent IBM Power CPUs, while the hashed page table is looked up using
    the page size from the segmentation hardware (i.e. the SLB), it is
    possible to have the HPT entry indicate a larger page size.  Thus for
    example it is possible to put a 16MB page in a 64kB segment, but since
    the hash lookup is done using a 64kB page size, it may be necessary to
    put multiple entries in the HPT for a single 16MB page.  This
    capability is called mixed page-size segment (MPSS).  With MPSS,
    there are two relevant page sizes: the base page size, which is the
    size used in searching the HPT, and the actual page size, which is the
    size indicated in the HPT entry. [ Note that the actual page size is
    always >= base page size ].
    
    We advertise MPSS feature to guest only if the host CPU supports the
    same. We use "ibm,segment-page-sizes" device tree node to advertise
    the MPSS support. The penc encoding indicate whether we support
    a specific combination of base page size and actual page size
    in the same segment. It is also the value used in the L|LP encoding
    of HPTE entry.
    
    In-order to support MPSS in guest, KVM need to handle the below details
    * advertise MPSS via ibm,segment-page-sizes
    * Decode the base and actual page size correctly from the HPTE entry
      so that we know what we are dealing with in H_ENTER and and can do
      the appropriate TLB invalidation in H_REMOVE and evictions.
    


>
> yes. When / if people can easily get their hands on p7/p8 bare metal 
> systems I'll be more than happy to remove 970 support as well, but for 
> now it's probably good to keep in.
>

This should handle that.

+	/*
+	 * Add 16MB MPSS support if host supports it
+	 */
+	if (linux_psize != MMU_PAGE_16M && def->penc[MMU_PAGE_16M] != -1) {
+		(*sps)->enc[1].page_shift = 24;
+		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
+	}
 	(*sps)++;

-aneesh

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Alexander Graf @ 2014-05-06 15:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, kvm-ppc, kvm
In-Reply-To: <87oazbq73t.fsf@linux.vnet.ibm.com>

On 05/06/2014 05:06 PM, Aneesh Kumar K.V wrote:
> Alexander Graf <agraf@suse.de> writes:
>
>> On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:
>>> On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:
>>>
> .....
>
>
> I updated the commit message as below. Let me know if this is ok.
>
>      KVM: PPC: BOOK3S: HV: THP support for guest

This has nothing to do with THP.

>      
>      On recent IBM Power CPUs, while the hashed page table is looked up using
>      the page size from the segmentation hardware (i.e. the SLB), it is
>      possible to have the HPT entry indicate a larger page size.  Thus for
>      example it is possible to put a 16MB page in a 64kB segment, but since
>      the hash lookup is done using a 64kB page size, it may be necessary to
>      put multiple entries in the HPT for a single 16MB page.  This
>      capability is called mixed page-size segment (MPSS).  With MPSS,
>      there are two relevant page sizes: the base page size, which is the
>      size used in searching the HPT, and the actual page size, which is the
>      size indicated in the HPT entry. [ Note that the actual page size is
>      always >= base page size ].
>      
>      We advertise MPSS feature to guest only if the host CPU supports the
>      same. We use "ibm,segment-page-sizes" device tree node to advertise
>      the MPSS support. The penc encoding indicate whether we support
>      a specific combination of base page size and actual page size
>      in the same segment. It is also the value used in the L|LP encoding
>      of HPTE entry.
>      
>      In-order to support MPSS in guest, KVM need to handle the below details
>      * advertise MPSS via ibm,segment-page-sizes
>      * Decode the base and actual page size correctly from the HPTE entry
>        so that we know what we are dealing with in H_ENTER and and can do

Which code path exactly changes for H_ENTER?

>        the appropriate TLB invalidation in H_REMOVE and evictions.

Apart from the grammar (which is pretty broken for the part that is not 
copied from Paul) and the subject line this sounds quite reasonable.


Alex

^ permalink raw reply

* RE: [PATCH v2 1/4] KVM: PPC: e500mc: Revert "add load inst fixup"
From: mihai.caraman @ 2014-05-06 15:48 UTC (permalink / raw)
  To: Alexander Graf
  Cc: linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
	kvm-ppc@vger.kernel.org
In-Reply-To: <76DD19C1-EBCD-485A-B847-21304903E139@suse.de>

> -----Original Message-----
> From: Alexander Graf [mailto:agraf@suse.de]
> Sent: Sunday, May 04, 2014 1:14 AM
> To: Caraman Mihai Claudiu-B02008
> Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; linuxppc-
> dev@lists.ozlabs.org
> Subject: Re: [PATCH v2 1/4] KVM: PPC: e500mc: Revert "add load inst
> fixup"
>=20
>=20
>=20
> Am 03.05.2014 um 01:14 schrieb "mihai.caraman@freescale.com"
> <mihai.caraman@freescale.com>:
>=20
> >> From: Alexander Graf <agraf@suse.de>
> >> Sent: Friday, May 2, 2014 12:24 PM

> > This was the first idea that sprang to my mind inspired from how DO_KVM
> > is hooked on PR. I actually did a simple POC for e500mc/e5500, but this
> will
> > not work on e6500 which has shared IVORs between HW threads.
>=20
> What if we combine the ideas? On read we flip the IVOR to a separate
> handler that checks for a field in the PACA. Only if that field is set,
> we treat the fault as kvm fault, otherwise we jump into the normal
> handler.
>=20
> I suppose we'd have to also take a lock to make sure we don't race with
> the other thread when it wants to also read a guest instruction, but you
> get the idea.

This might be a solution for TLB eviction but not for execute-but-not-read
entries which requires access from host context.

>=20
> I have no idea whether this would be any faster, it's more of a
> brainstorming thing really. But regardless this patch set would be a move
> into the right direction.
>=20
> Btw, do we have any guarantees that we don't get scheduled away before we
> run kvmppc_get_last_inst()? If we run on a different core we can't read
> the inst anymore. Hrm.

It was your suggestion to move the logic from kvmppc_handle_exit() irq
disabled area to kvmppc_get_last_inst():

http://git.freescale.com/git/cgit.cgi/ppc/sdk/linux.git/tree/arch/powerpc/k=
vm/booke.c

Still, what is wrong if we get scheduled on another core? We will emulate
again and the guest will populate the TLB on the new core.

-Mike

^ permalink raw reply

* Re: [PATCH v2 1/4] KVM: PPC: e500mc: Revert "add load inst fixup"
From: Alexander Graf @ 2014-05-06 15:54 UTC (permalink / raw)
  To: mihai.caraman@freescale.com
  Cc: linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
	kvm-ppc@vger.kernel.org
In-Reply-To: <887e9723335940fdae0073000ba2d299@BY2PR03MB508.namprd03.prod.outlook.com>

On 05/06/2014 05:48 PM, mihai.caraman@freescale.com wrote:
>> -----Original Message-----
>> From: Alexander Graf [mailto:agraf@suse.de]
>> Sent: Sunday, May 04, 2014 1:14 AM
>> To: Caraman Mihai Claudiu-B02008
>> Cc: kvm-ppc@vger.kernel.org; kvm@vger.kernel.org; linuxppc-
>> dev@lists.ozlabs.org
>> Subject: Re: [PATCH v2 1/4] KVM: PPC: e500mc: Revert "add load inst
>> fixup"
>>
>>
>>
>> Am 03.05.2014 um 01:14 schrieb "mihai.caraman@freescale.com"
>> <mihai.caraman@freescale.com>:
>>
>>>> From: Alexander Graf <agraf@suse.de>
>>>> Sent: Friday, May 2, 2014 12:24 PM
>>> This was the first idea that sprang to my mind inspired from how DO_KVM
>>> is hooked on PR. I actually did a simple POC for e500mc/e5500, but this
>> will
>>> not work on e6500 which has shared IVORs between HW threads.
>> What if we combine the ideas? On read we flip the IVOR to a separate
>> handler that checks for a field in the PACA. Only if that field is set,
>> we treat the fault as kvm fault, otherwise we jump into the normal
>> handler.
>>
>> I suppose we'd have to also take a lock to make sure we don't race with
>> the other thread when it wants to also read a guest instruction, but you
>> get the idea.
> This might be a solution for TLB eviction but not for execute-but-not-read
> entries which requires access from host context.

Good point :).

>
>> I have no idea whether this would be any faster, it's more of a
>> brainstorming thing really. But regardless this patch set would be a move
>> into the right direction.
>>
>> Btw, do we have any guarantees that we don't get scheduled away before we
>> run kvmppc_get_last_inst()? If we run on a different core we can't read
>> the inst anymore. Hrm.
> It was your suggestion to move the logic from kvmppc_handle_exit() irq
> disabled area to kvmppc_get_last_inst():
>
> http://git.freescale.com/git/cgit.cgi/ppc/sdk/linux.git/tree/arch/powerpc/kvm/booke.c
>
> Still, what is wrong if we get scheduled on another core? We will emulate
> again and the guest will populate the TLB on the new core.

Yes, it means we have to get the EMULATE_AGAIN code paths correct :). It 
also means we lose some performance with preemptive kernel configurations.


Alex

^ permalink raw reply

* [PATCH V2] KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation
From: Aneesh Kumar K.V @ 2014-05-06 15:54 UTC (permalink / raw)
  To: agraf, benh, paulus; +Cc: linuxppc-dev, kvm, kvm-ppc, Aneesh Kumar K.V

Today when KVM tries to reserve memory for the hash page table it
allocates from the normal page allocator first. If that fails it
falls back to CMA's reserved region. One of the side effects of
this is that we could end up exhausting the page allocator and
get linux into OOM conditions while we still have plenty of space
available in CMA.

This patch addresses this issue by first trying hash page table
allocation from CMA's reserved region before falling back to the normal
page allocator. So if we run out of memory, we really are out of memory.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
* Changes from V1
  * Update commit message 

 arch/powerpc/kvm/book3s_64_mmu_hv.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index fb25ebc0af0c..f32896ffd784 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -52,7 +52,7 @@ static void kvmppc_rmap_reset(struct kvm *kvm);
 
 long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 {
-	unsigned long hpt;
+	unsigned long hpt = 0;
 	struct revmap_entry *rev;
 	struct page *page = NULL;
 	long order = KVM_DEFAULT_HPT_ORDER;
@@ -64,22 +64,11 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 	}
 
 	kvm->arch.hpt_cma_alloc = 0;
-	/*
-	 * try first to allocate it from the kernel page allocator.
-	 * We keep the CMA reserved for failed allocation.
-	 */
-	hpt = __get_free_pages(GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT |
-			       __GFP_NOWARN, order - PAGE_SHIFT);
-
-	/* Next try to allocate from the preallocated pool */
-	if (!hpt) {
-		VM_BUG_ON(order < KVM_CMA_CHUNK_ORDER);
-		page = kvm_alloc_hpt(1 << (order - PAGE_SHIFT));
-		if (page) {
-			hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
-			kvm->arch.hpt_cma_alloc = 1;
-		} else
-			--order;
+	VM_BUG_ON(order < KVM_CMA_CHUNK_ORDER);
+	page = kvm_alloc_hpt(1 << (order - PAGE_SHIFT));
+	if (page) {
+		hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
+		kvm->arch.hpt_cma_alloc = 1;
 	}
 
 	/* Lastly try successively smaller sizes from the page allocator */
-- 
1.9.1

^ permalink raw reply related

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Aneesh Kumar K.V @ 2014-05-06 16:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: paulus, linuxppc-dev, kvm-ppc, kvm
In-Reply-To: <5368FE66.5040809@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 05/06/2014 05:06 PM, Aneesh Kumar K.V wrote:
>> Alexander Graf <agraf@suse.de> writes:
>>
>>> On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:
>>>> On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:
>>>>
>> .....
>>
>>
>> I updated the commit message as below. Let me know if this is ok.
>>
>>      KVM: PPC: BOOK3S: HV: THP support for guest
>
> This has nothing to do with THP.

THP support in guest depend on KVM advertising MPSS feature. We already
have rest of the changes needed to support transparent huge pages
upstream. (We do support THP with PowerVM LPAR already). The primary
motivation of this patch is to enable THP in powerkvm guest. 

>
>>      
>>      On recent IBM Power CPUs, while the hashed page table is looked up using
>>      the page size from the segmentation hardware (i.e. the SLB), it is
>>      possible to have the HPT entry indicate a larger page size.  Thus for
>>      example it is possible to put a 16MB page in a 64kB segment, but since
>>      the hash lookup is done using a 64kB page size, it may be necessary to
>>      put multiple entries in the HPT for a single 16MB page.  This
>>      capability is called mixed page-size segment (MPSS).  With MPSS,
>>      there are two relevant page sizes: the base page size, which is the
>>      size used in searching the HPT, and the actual page size, which is the
>>      size indicated in the HPT entry. [ Note that the actual page size is
>>      always >= base page size ].
>>      
>>      We advertise MPSS feature to guest only if the host CPU supports the
>>      same. We use "ibm,segment-page-sizes" device tree node to advertise
>>      the MPSS support. The penc encoding indicate whether we support
>>      a specific combination of base page size and actual page size
>>      in the same segment. It is also the value used in the L|LP encoding
>>      of HPTE entry.
>>      
>>      In-order to support MPSS in guest, KVM need to handle the below details
>>      * advertise MPSS via ibm,segment-page-sizes
>>      * Decode the base and actual page size correctly from the HPTE entry
>>        so that we know what we are dealing with in H_ENTER and and can do
>
> Which code path exactly changes for H_ENTER?

There is no real code path changes. Any code path that use
hpte_page_size() is impacted. We return actual page size there. 

>
>>        the appropriate TLB invalidation in H_REMOVE and evictions.
>
> Apart from the grammar (which is pretty broken for the part that is not 
> copied from Paul) and the subject line this sounds quite reasonable.
>

Wll try to fix.

-aneesh

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Alexander Graf @ 2014-05-06 16:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, kvm-ppc, kvm
In-Reply-To: <87ha52ritd.fsf@linux.vnet.ibm.com>

On 05/06/2014 06:08 PM, Aneesh Kumar K.V wrote:
> Alexander Graf <agraf@suse.de> writes:
>
>> On 05/06/2014 05:06 PM, Aneesh Kumar K.V wrote:
>>> Alexander Graf <agraf@suse.de> writes:
>>>
>>>> On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:
>>>>> On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:
>>>>>
>>> .....
>>>
>>>
>>> I updated the commit message as below. Let me know if this is ok.
>>>
>>>       KVM: PPC: BOOK3S: HV: THP support for guest
>> This has nothing to do with THP.
> THP support in guest depend on KVM advertising MPSS feature. We already
> have rest of the changes needed to support transparent huge pages
> upstream. (We do support THP with PowerVM LPAR already). The primary
> motivation of this patch is to enable THP in powerkvm guest.

But KVM doesn't care. KVM cares about MPSS. It's like saying "Support 
fork()" in a subject line while your patch implements page faults.

>
>>>       
>>>       On recent IBM Power CPUs, while the hashed page table is looked up using
>>>       the page size from the segmentation hardware (i.e. the SLB), it is
>>>       possible to have the HPT entry indicate a larger page size.  Thus for
>>>       example it is possible to put a 16MB page in a 64kB segment, but since
>>>       the hash lookup is done using a 64kB page size, it may be necessary to
>>>       put multiple entries in the HPT for a single 16MB page.  This
>>>       capability is called mixed page-size segment (MPSS).  With MPSS,
>>>       there are two relevant page sizes: the base page size, which is the
>>>       size used in searching the HPT, and the actual page size, which is the
>>>       size indicated in the HPT entry. [ Note that the actual page size is
>>>       always >= base page size ].
>>>       
>>>       We advertise MPSS feature to guest only if the host CPU supports the
>>>       same. We use "ibm,segment-page-sizes" device tree node to advertise
>>>       the MPSS support. The penc encoding indicate whether we support
>>>       a specific combination of base page size and actual page size
>>>       in the same segment. It is also the value used in the L|LP encoding
>>>       of HPTE entry.
>>>       
>>>       In-order to support MPSS in guest, KVM need to handle the below details
>>>       * advertise MPSS via ibm,segment-page-sizes
>>>       * Decode the base and actual page size correctly from the HPTE entry
>>>         so that we know what we are dealing with in H_ENTER and and can do
>> Which code path exactly changes for H_ENTER?
> There is no real code path changes. Any code path that use
> hpte_page_size() is impacted. We return actual page size there.

Ah, I see :).

>
>>>         the appropriate TLB invalidation in H_REMOVE and evictions.
>> Apart from the grammar (which is pretty broken for the part that is not
>> copied from Paul) and the subject line this sounds quite reasonable.
>>
> Wll try to fix.

Awesome. Thanks a lot!


Alex

^ permalink raw reply

* Re: [PATCH] powerpc/fsl: Updated corenet-cf compatible string for corenet1-cf chips
From: Scott Wood @ 2014-05-06 16:57 UTC (permalink / raw)
  To: Diana Craciun; +Cc: devicetree, linuxppc-dev
In-Reply-To: <1399388176-25257-1-git-send-email-diana.craciun@freescale.com>

On Tue, 2014-05-06 at 17:56 +0300, Diana Craciun wrote:
> From: Diana Craciun <Diana.Craciun@freescale.com>
> 
> Updated the device trees according to the corenet-cf
> binding definition.
> 
> Signed-off-by: Diana Craciun <Diana.Craciun@freescale.com>
> ---
>  arch/powerpc/boot/dts/fsl/p2041si-post.dtsi | 2 +-
>  arch/powerpc/boot/dts/fsl/p3041si-post.dtsi | 2 +-
>  arch/powerpc/boot/dts/fsl/p4080si-post.dtsi | 2 +-
>  arch/powerpc/boot/dts/fsl/p5020si-post.dtsi | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)

Oops, I meant to include this in the patch I sent, but forgot to squash
the two patches together. :-P

Where's p5040?

-Scott

^ permalink raw reply

* [PATCH V2] KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
From: Aneesh Kumar K.V @ 2014-05-06 18:01 UTC (permalink / raw)
  To: agraf, benh, paulus; +Cc: linuxppc-dev, kvm, kvm-ppc, Aneesh Kumar K.V

On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size.  Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page.  This
capability is called mixed page-size segment (MPSS).  With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].

We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.

This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
Changes from V1:
* Update commit message
* Rename variables as per review feedback

 arch/powerpc/include/asm/kvm_book3s_64.h | 146 ++++++++++++++++++++++++++-----
 arch/powerpc/kvm/book3s_hv.c             |   7 ++
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 51388befeddb..fddb72b48ce9 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -77,34 +77,122 @@ static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits)
 	return old == 0;
 }
 
+static inline int __hpte_actual_psize(unsigned int lp, int psize)
+{
+	int i, shift;
+	unsigned int mask;
+
+	/* start from 1 ignoring MMU_PAGE_4K */
+	for (i = 1; i < MMU_PAGE_COUNT; i++) {
+
+		/* invalid penc */
+		if (mmu_psize_defs[psize].penc[i] == -1)
+			continue;
+		/*
+		 * encoding bits per actual page size
+		 *        PTE LP     actual page size
+		 *    rrrr rrrz		>=8KB
+		 *    rrrr rrzz		>=16KB
+		 *    rrrr rzzz		>=32KB
+		 *    rrrr zzzz		>=64KB
+		 * .......
+		 */
+		shift = mmu_psize_defs[i].shift - LP_SHIFT;
+		if (shift > LP_BITS)
+			shift = LP_BITS;
+		mask = (1 << shift) - 1;
+		if ((lp & mask) == mmu_psize_defs[psize].penc[i])
+			return i;
+	}
+	return -1;
+}
+
 static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
 					     unsigned long pte_index)
 {
-	unsigned long rb, va_low;
+	int b_psize, a_psize;
+	unsigned int penc;
+	unsigned long rb = 0, va_low, sllp;
+	unsigned int lp = (r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
+
+	if (!(v & HPTE_V_LARGE)) {
+		/* both base and actual psize is 4k */
+		b_psize = MMU_PAGE_4K;
+		a_psize = MMU_PAGE_4K;
+	} else {
+		for (b_psize = 0; b_psize < MMU_PAGE_COUNT; b_psize++) {
+
+			/* valid entries have a shift value */
+			if (!mmu_psize_defs[b_psize].shift)
+				continue;
 
+			a_psize = __hpte_actual_psize(lp, b_psize);
+			if (a_psize != -1)
+				break;
+		}
+	}
+	/*
+	 * Ignore the top 14 bits of va
+	 * v have top two bits covering segment size, hence move
+	 * by 16 bits, Also clear the lower HPTE_V_AVPN_SHIFT (7) bits.
+	 * AVA field in v also have the lower 23 bits ignored.
+	 * For base page size 4K we need 14 .. 65 bits (so need to
+	 * collect extra 11 bits)
+	 * For others we need 14..14+i
+	 */
+	/* This covers 14..54 bits of va*/
 	rb = (v & ~0x7fUL) << 16;		/* AVA field */
+	/*
+	 * AVA in v had cleared lower 23 bits. We need to derive
+	 * that from pteg index
+	 */
 	va_low = pte_index >> 3;
 	if (v & HPTE_V_SECONDARY)
 		va_low = ~va_low;
-	/* xor vsid from AVA */
+	/*
+	 * get the vpn bits from va_low using reverse of hashing.
+	 * In v we have va with 23 bits dropped and then left shifted
+	 * HPTE_V_AVPN_SHIFT (7) bits. Now to find vsid we need
+	 * right shift it with (SID_SHIFT - (23 - 7))
+	 */
 	if (!(v & HPTE_V_1TB_SEG))
-		va_low ^= v >> 12;
+		va_low ^= v >> (SID_SHIFT - 16);
 	else
-		va_low ^= v >> 24;
+		va_low ^= v >> (SID_SHIFT_1T - 16);
 	va_low &= 0x7ff;
-	if (v & HPTE_V_LARGE) {
-		rb |= 1;			/* L field */
-		if (cpu_has_feature(CPU_FTR_ARCH_206) &&
-		    (r & 0xff000)) {
-			/* non-16MB large page, must be 64k */
-			/* (masks depend on page size) */
-			rb |= 0x1000;		/* page encoding in LP field */
-			rb |= (va_low & 0x7f) << 16; /* 7b of VA in AVA/LP field */
-			rb |= ((va_low << 4) & 0xf0);	/* AVAL field (P7 doesn't seem to care) */
-		}
-	} else {
-		/* 4kB page */
-		rb |= (va_low & 0x7ff) << 12;	/* remaining 11b of VA */
+
+	switch (b_psize) {
+	case MMU_PAGE_4K:
+		sllp = ((mmu_psize_defs[a_psize].sllp & SLB_VSID_L) >> 6) |
+			((mmu_psize_defs[a_psize].sllp & SLB_VSID_LP) >> 4);
+		rb |= sllp << 5;	/*  AP field */
+		rb |= (va_low & 0x7ff) << 12;	/* remaining 11 bits of AVA */
+		break;
+	default:
+	{
+		int aval_shift;
+		/*
+		 * remaining 7bits of AVA/LP fields
+		 * Also contain the rr bits of LP
+		 */
+		rb |= (va_low & 0x7f) << 16;
+		/*
+		 * Now clear not needed LP bits based on actual psize
+		 */
+		rb &= ~((1ul << mmu_psize_defs[a_psize].shift) - 1);
+		/*
+		 * AVAL field 58..77 - base_page_shift bits of va
+		 * we have space for 58..64 bits, Missing bits should
+		 * be zero filled. +1 is to take care of L bit shift
+		 */
+		aval_shift = 64 - (77 - mmu_psize_defs[b_psize].shift) + 1;
+		rb |= ((va_low << aval_shift) & 0xfe);
+
+		rb |= 1;		/* L field */
+		penc = mmu_psize_defs[b_psize].penc[a_psize];
+		rb |= penc << 12;	/* LP field */
+		break;
+	}
 	}
 	rb |= (v >> 54) & 0x300;		/* B field */
 	return rb;
@@ -112,14 +200,26 @@ static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
 
 static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
 {
+	int size, a_psize;
+	/* Look at the 8 bit LP value */
+	unsigned int lp = (l >> LP_SHIFT) & ((1 << LP_BITS) - 1);
+
 	/* only handle 4k, 64k and 16M pages for now */
 	if (!(h & HPTE_V_LARGE))
-		return 1ul << 12;		/* 4k page */
-	if ((l & 0xf000) == 0x1000 && cpu_has_feature(CPU_FTR_ARCH_206))
-		return 1ul << 16;		/* 64k page */
-	if ((l & 0xff000) == 0)
-		return 1ul << 24;		/* 16M page */
-	return 0;				/* error */
+		return 1ul << 12;
+	else {
+		for (size = 0; size < MMU_PAGE_COUNT; size++) {
+			/* valid entries have a shift value */
+			if (!mmu_psize_defs[size].shift)
+				continue;
+
+			a_psize = __hpte_actual_psize(lp, size);
+			if (a_psize != -1)
+				return 1ul << mmu_psize_defs[a_psize].shift;
+		}
+
+	}
+	return 0;
 }
 
 static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8227dba5af0f..f5d4b9d77dc9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
 	 * support pte_enc here
 	 */
 	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
+	/*
+	 * Add 16MB MPSS support if host supports it
+	 */
+	if (linux_psize != MMU_PAGE_16M && def->penc[MMU_PAGE_16M] != -1) {
+		(*sps)->enc[1].page_shift = 24;
+		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
+	}
 	(*sps)++;
 }
 
-- 
1.9.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox