LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Aneesh Kumar K.V @ 2014-05-06 14:25 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Alexander Graf, kvm-ppc, kvm
In-Reply-To: <20140506042058.GA25186@drongo>

Paul Mackerras <paulus@samba.org> writes:

> On Mon, May 05, 2014 at 08:17:00PM +0530, Aneesh Kumar K.V wrote:
>> Alexander Graf <agraf@suse.de> writes:
>> 
>> > On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:
>> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> >
>> > No patch description, no proper explanations anywhere why you're doing 
>> > what. All of that in a pretty sensitive piece of code. There's no way 
>> > this patch can go upstream in its current form.
>> >
>> 
>> Sorry about being vague. Will add a better commit message. The goal is
>> to export MPSS support to guest if the host support the same. MPSS
>> support is exported via penc encoding in "ibm,segment-page-sizes". The
>> actual format can be found at htab_dt_scan_page_sizes. When the guest
>> memory is backed by hugetlbfs we expose the penc encoding the host
>> support to guest via kvmppc_add_seg_page_size. 
>
> In a case like this it's good to assume the reader doesn't know very
> much about Power CPUs, and probably isn't familiar with acronyms such
> as MPSS.  The patch needs an introductory paragraph explaining that on
> recent IBM Power CPUs, while the hashed page table is looked up using
> the page size from the segmentation hardware (i.e. the SLB), it is
> possible to have the HPT entry indicate a larger page size.  Thus for
> example it is possible to put a 16MB page in a 64kB segment, but since
> the hash lookup is done using a 64kB page size, it may be necessary to
> put multiple entries in the HPT for a single 16MB page.  This
> capability is called mixed page-size segment (MPSS).  With MPSS,
> there are two relevant page sizes: the base page size, which is the
> size used in searching the HPT, and the actual page size, which is the
> size indicated in the HPT entry.  Note that the actual page size is
> always >= base page size.

I will update the commit message with the above details

-aneesh

^ permalink raw reply

* Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
From: Alexander Graf @ 2014-05-06 14:25 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: paulus@samba.org, linuxppc-dev@lists.ozlabs.org,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
In-Reply-To: <87wqdzq98f.fsf@linux.vnet.ibm.com>

On 05/06/2014 04:20 PM, Aneesh Kumar K.V wrote:
> Alexander Graf <agraf@suse.de> writes:
>
>> On 06.05.14 09:19, Benjamin Herrenschmidt wrote:
>>> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
>>>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
>>>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
>>>>>> Isn't this a greater problem? We should start swapping before we hit
>>>>>> the point where non movable kernel allocation fails, no?
>>>>> Possibly but the fact remains, this can be avoided by making sure that
>>>>> if we create a CMA reserve for KVM, then it uses it rather than using
>>>>> the rest of main memory for hash tables.
>>>> So why were we preferring non-CMA memory before? Considering that Aneesh
>>>> introduced that logic in fa61a4e3 I suppose this was just a mistake?
>>> I assume so.
> ....
> ...
>
>>> Whatever remains is split between CMA and the normal page allocator.
>>>
>>> Without Aneesh latest patch, when creating guests, KVM starts allocating
>>> it's hash tables from the latter instead of CMA (we never allocate from
>>> hugetlb pool afaik, only guest pages do that, not hash tables).
>>>
>>> So we exhaust the page allocator and get linux into OOM conditions
>>> while there's plenty of space in CMA. But the kernel cannot use CMA for
>>> it's own allocations, only to back user pages, which we don't care about
>>> because our guest pages are covered by our hugetlb reserve :-)
>> Yes. Write that in the patch description and I'm happy ;).
>>
> How about the below:
>
> Current KVM code first try to allocate hash page table from the normal
> page allocator before falling back to the CMA reserve region. One of the
> side effects of that is, we could exhaust the page allocator and get
> linux into OOM conditions while we still have plenty of space in CMA.
>
> Fix this by trying the CMA reserve region first and then falling back
> to normal page allocator if we fail to get enough memory from CMA
> reserve area.

Fix the grammar (I've spotted a good number of mistakes), then this 
should do. Please also improve the headline.


Alex

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Aneesh Kumar K.V @ 2014-05-06 14:23 UTC (permalink / raw)
  To: Alexander Graf; +Cc: paulus, linuxppc-dev, kvm-ppc, kvm
In-Reply-To: <5368A78D.4070509@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

....
....

>>   static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
>>   {
>> +	int size, a_size;
>> +	/* Look at the 8 bit LP value */
>> +	unsigned int lp = (l >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>> +
>>   	/* only handle 4k, 64k and 16M pages for now */
>>   	if (!(h & HPTE_V_LARGE))
>> -		return 1ul << 12;		/* 4k page */
>> -	if ((l & 0xf000) == 0x1000 && cpu_has_feature(CPU_FTR_ARCH_206))
>> -		return 1ul << 16;		/* 64k page */
>> -	if ((l & 0xff000) == 0)
>> -		return 1ul << 24;		/* 16M page */
>> -	return 0;				/* error */
>> +		return 1ul << 12;
>> +	else {
>> +		for (size = 0; size < MMU_PAGE_COUNT; size++) {
>> +			/* valid entries have a shift value */
>> +			if (!mmu_psize_defs[size].shift)
>> +				continue;
>> +
>> +			a_size = __hpte_actual_psize(lp, size);
>
> a_size as psize is probably a slightly confusing namer. Just call it 
> a_psize.

Will update.

>
> So if I understand this patch correctly, it simply introduces logic to 
> handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
> size field in the HPTE. Mind to explain why exactly that enables us to 
> use THP?
>
> What exactly is the flow if the pages are not backed by huge pages? What 
> is the flow when they start to get backed by huge pages?
>
>> +			if (a_size != -1)
>> +				return 1ul << mmu_psize_defs[a_size].shift;
>> +		}
>> +
>> +	}
>> +	return 0;
>>   }
>>   
>>   static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 8227dba5af0f..a38d3289320a 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
>>   	 * support pte_enc here
>>   	 */
>>   	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
>> +	/*
>> +	 * Add 16MB MPSS support
>> +	 */
>> +	if (linux_psize != MMU_PAGE_16M) {
>> +		(*sps)->enc[1].page_shift = 24;
>> +		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
>> +	}
>
> So this basically indicates that every segment (except for the 16MB one) 
> can also handle 16MB MPSS page sizes? I suppose you want to remove the 
> comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS
> here.

Will do

>
> Can we also ensure that every system we run on can do MPSS?
>

Will do

-aneesh

^ permalink raw reply

* Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr
From: Alexander Graf @ 2014-05-06 14:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, kvm-ppc, kvm
In-Reply-To: <87zjivq9mb.fsf@linux.vnet.ibm.com>

On 05/06/2014 04:12 PM, Aneesh Kumar K.V wrote:
> Alexander Graf <agraf@suse.de> writes:
>
>> On 06.05.14 02:41, Paul Mackerras wrote:
>>> On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:
>>>> On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:
>>>>> +#ifdef CONFIG_PPC_BOOK3S_64
>>>>> +	return vcpu->arch.fault_dar;
>>>> How about PA6T and G5s?
>>> G5 sets DAR on an alignment interrupt.
>>>
>>> As for PA6T, I don't know for sure, but if it doesn't, ordinary
>>> alignment interrupts wouldn't be handled properly, since the code in
>>> arch/powerpc/kernel/align.c assumes DAR contains the address being
>>> accessed on all PowerPC CPUs.
>> Now that's a good point. If we simply behave like Linux, I'm fine. This
>> definitely deserves a comment on the #ifdef in the code.
>
> How about ?
>
> #ifdef CONFIG_PPC_BOOK3S_64
> 	/*
> 	 * Linux always expect a valid  dar as per alignment
> 	 * interrupt handling code (fix_alignment()). Don't compute the dar
> 	 * value here, instead used the saved dar value. Right now we restrict
> 	 * this only for BOOK3S-64.
> 	 */

/* Linux's fix_alignment() assumes that DAR is valid, so can we */


Alex

> 	return vcpu->arch.fault_dar;
> #else
>
>
> -aneesh
>

^ permalink raw reply

* Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
From: Aneesh Kumar K.V @ 2014-05-06 14:20 UTC (permalink / raw)
  To: Alexander Graf, Benjamin Herrenschmidt
  Cc: linuxppc-dev@lists.ozlabs.org, paulus@samba.org,
	kvm@vger.kernel.org, kvm-ppc@vger.kernel.org
In-Reply-To: <53688D89.1070201@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 06.05.14 09:19, Benjamin Herrenschmidt wrote:
>> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
>>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
>>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
>>>>> Isn't this a greater problem? We should start swapping before we hit
>>>>> the point where non movable kernel allocation fails, no?
>>>> Possibly but the fact remains, this can be avoided by making sure that
>>>> if we create a CMA reserve for KVM, then it uses it rather than using
>>>> the rest of main memory for hash tables.
>>> So why were we preferring non-CMA memory before? Considering that Aneesh
>>> introduced that logic in fa61a4e3 I suppose this was just a mistake?
>> I assume so.

....
...

>>
>> Whatever remains is split between CMA and the normal page allocator.
>>
>> Without Aneesh latest patch, when creating guests, KVM starts allocating
>> it's hash tables from the latter instead of CMA (we never allocate from
>> hugetlb pool afaik, only guest pages do that, not hash tables).
>>
>> So we exhaust the page allocator and get linux into OOM conditions
>> while there's plenty of space in CMA. But the kernel cannot use CMA for
>> it's own allocations, only to back user pages, which we don't care about
>> because our guest pages are covered by our hugetlb reserve :-)
>
> Yes. Write that in the patch description and I'm happy ;).
>

How about the below:

Current KVM code first try to allocate hash page table from the normal
page allocator before falling back to the CMA reserve region. One of the
side effects of that is, we could exhaust the page allocator and get
linux into OOM conditions while we still have plenty of space in CMA. 

Fix this by trying the CMA reserve region first and then falling back
to normal page allocator if we fail to get enough memory from CMA
reserve area.

-aneesh

^ permalink raw reply

* Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr
From: Aneesh Kumar K.V @ 2014-05-06 14:12 UTC (permalink / raw)
  To: Alexander Graf, Paul Mackerras; +Cc: linuxppc-dev, kvm, kvm-ppc
In-Reply-To: <536887EF.2070201@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 06.05.14 02:41, Paul Mackerras wrote:
>> On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:
>>> On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:
>>>> +#ifdef CONFIG_PPC_BOOK3S_64
>>>> +	return vcpu->arch.fault_dar;
>>> How about PA6T and G5s?
>> G5 sets DAR on an alignment interrupt.
>>
>> As for PA6T, I don't know for sure, but if it doesn't, ordinary
>> alignment interrupts wouldn't be handled properly, since the code in
>> arch/powerpc/kernel/align.c assumes DAR contains the address being
>> accessed on all PowerPC CPUs.
>
> Now that's a good point. If we simply behave like Linux, I'm fine. This 
> definitely deserves a comment on the #ifdef in the code.


How about ?

#ifdef CONFIG_PPC_BOOK3S_64
	/*
	 * Linux always expect a valid  dar as per alignment
	 * interrupt handling code (fix_alignment()). Don't compute the dar
	 * value here, instead used the saved dar value. Right now we restrict
	 * this only for BOOK3S-64.
	 */
	return vcpu->arch.fault_dar;
#else


-aneesh

^ permalink raw reply

* Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr
From: Aneesh Kumar K.V @ 2014-05-06 14:06 UTC (permalink / raw)
  To: Alexander Graf, Paul Mackerras; +Cc: linuxppc-dev, kvm, kvm-ppc
In-Reply-To: <536887EF.2070201@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 06.05.14 02:41, Paul Mackerras wrote:
>> On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:
>>> On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:
>>>> +#ifdef CONFIG_PPC_BOOK3S_64
>>>> +	return vcpu->arch.fault_dar;
>>> How about PA6T and G5s?
>> G5 sets DAR on an alignment interrupt.
>>
>> As for PA6T, I don't know for sure, but if it doesn't, ordinary
>> alignment interrupts wouldn't be handled properly, since the code in
>> arch/powerpc/kernel/align.c assumes DAR contains the address being
>> accessed on all PowerPC CPUs.
>
> Now that's a good point. If we simply behave like Linux, I'm fine. This 
> definitely deserves a comment on the #ifdef in the code.
>

Will update and send V5

-aneesh

^ permalink raw reply

* Re: Build regressions/improvements in v3.15-rc4
From: Geert Uytterhoeven @ 2014-05-06 12:15 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: the arch/x86 maintainers, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1399377878-29995-1-git-send-email-geert@linux-m68k.org>

On Tue, May 6, 2014 at 2:04 PM, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> JFYI, when comparing v3.15-rc4[1]  to v3.15-rc3[3], the summaries are:
>   - build errors: +7/-1

  + /scratch/kisskb/src/arch/powerpc/include/asm/fixmap.h: error:
overflow in enumeration values  CC      drivers/hwmon/smsc47m192.o:
=> 51:2
  + /scratch/kisskb/src/arch/powerpc/include/asm/fixmap.h: error:
overflow in enumeration values  CC [M]  drivers/usb/gadget/f_rndis.o:
=> 51:2
  + /scratch/kisskb/src/arch/powerpc/include/asm/fixmap.h: error:
overflow in enumeration values:  => 51:2

powerpc-randconfig (looks scary, is CONFIG_HIGHMEM=y broken on ppc?)

  + /scratch/kisskb/src/arch/powerpc/kernel/head_44x.S: Error: invalid
operands (*ABS* and *UND* sections) for `|':  => 686, 603
  + /scratch/kisskb/src/arch/powerpc/mm/tlb_nohash_low.S: Error:
unsupported relocation against PPC47x_TLBE_SIZE:  => 113

powerpc-randconfig

  + /scratch/kisskb/src/arch/powerpc/platforms/powernv/setup.c: error:
implicit declaration of function 'get_hard_smp_processor_id'
[-Werror=implicit-function-declaration]:  => 179:4

ppc64_defconfig+UP

Lemme guess: If CONFIG_SMP=n, <linux/smp.h> does not include <asm/smp.h>,
so it needs an explicit #include <asm/smp.h>?

  + error: initramfs.c: undefined reference to `__stack_chk_guard':
=> .init.text+0x19dc)

x86_64-randconfig

> [1] http://kisskb.ellerman.id.au/kisskb/head/7449/ (all 119 configs)
> [3] http://kisskb.ellerman.id.au/kisskb/head/7427/ (all 119 configs)

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries
From: Ingo Molnar @ 2014-05-06 11:29 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-arch, riel, Madhavan Srinivasan, dave.hansen, peterz, x86,
	linux-kernel, linux-mm, ak, paulus, mgorman, Linus Torvalds, akpm,
	linuxppc-dev, kirill.shutemov
In-Reply-To: <87d2fz47tg.fsf@rustcorp.com.au>


* Rusty Russell <rusty@rustcorp.com.au> wrote:

> Ingo Molnar <mingo@kernel.org> writes:
> > * Madhavan Srinivasan <maddy@linux.vnet.ibm.com> wrote:
> >
> >> Performance data for different FAULT_AROUND_ORDER values from 4 socket
> >> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
> >> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
> >> v3.15-rc1 for different fault around order values.
> >> 
> >> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
> >> 
> >> Linux build (make -j64)
> >> minor-faults            47,437,359      35,279,286      25,425,347      23,461,275      22,002,189      21,435,836
> >> times in seconds        347.302528420   344.061588460   340.974022391   348.193508116   348.673900158   350.986543618
> >>  stddev for time        ( +-  1.50% )   ( +-  0.73% )   ( +-  1.13% )   ( +-  1.01% )   ( +-  1.89% )   ( +-  1.55% )
> >>  %chg time to baseline                  -0.9%           -1.8%           0.2%            0.39%           1.06%
> >
> > Probably too noisy.
> 
> A little, but 3 still looks like the winner.
> 
> >> Linux rebuild (make -j64)
> >> minor-faults            941,552         718,319         486,625         440,124         410,510         397,416
> >> times in seconds        30.569834718    31.219637539    31.319370649    31.434285472    31.972367174    31.443043580
> >>  stddev for time        ( +-  1.07% )   ( +-  0.13% )   ( +-  0.43% )   ( +-  0.18% )   ( +-  0.95% )   ( +-  0.58% )
> >>  %chg time to baseline                  2.1%            2.4%            2.8%            4.58%           2.85%
> >
> > Here it looks like a speedup. Optimal value: 5+.
> 
> No, lower time is better.  Baseline (no faultaround) wins.
> 
> 
> etc.

ah, yeah, you are right. Brainfart of the week...

> It's not a huge surprise that a 64k page arch wants a smaller value 
> than a 4k system.  But I agree: I don't see much upside for FAO > 0, 
> but I do see downside.
> 
> Most extreme results:
> Order 1: 2% loss on recompile.  10% win 4% loss on seq.  9% loss random.
> Order 3: 2% loss on recompile.  6% win 5% loss on seq.  14% loss on random.
> Order 4: 2.8% loss on recompile. 10% win 7% loss on seq.  9% loss on random.
> 
> > I'm starting to suspect that maybe workloads ought to be given a 
> > choice in this matter, via madvise() or such.
> 
> I really don't think they'll be able to use it; it'll change far too 
> much with machine and kernel updates. [...]

Do we know that?

> [...] I think we should apply patch
> #1 (with fixes) to make it a variable, then set it to 0 for PPC.

Ok, agreed - at least until contrary data comes around.

Thanks,

	Ingo

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Alexander Graf @ 2014-05-06  9:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, paulus, Aneesh Kumar K.V, kvm-ppc, kvm
In-Reply-To: <1399368400.18906.9.camel@pasglop>

On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:
>
>> So if I understand this patch correctly, it simply introduces logic to
>> handle page sizes other than 4k, 64k, 16M by analyzing the actual page
>> size field in the HPTE. Mind to explain why exactly that enables us to
>> use THP?
>>
>> What exactly is the flow if the pages are not backed by huge pages? What
>> is the flow when they start to get backed by huge pages?
> The hypervisor doesn't care about segments ... but it needs to properly
> decode the page size requested by the guest, if anything, to issue the
> right form of tlbie instruction.
>
> The encoding in the HPTE for a 16M page inside a 64K segment is
> different than the encoding for a 16M in a 16M segment, this is done so
> that the encoding carries both information, which allows broadcast
> tlbie to properly find the right set in the TLB for invalidations among
> others.
>
> So from a KVM perspective, we don't know whether the guest is doing THP
> or something else (Linux calls it THP but all we care here is that this
> is MPSS, another guest than Linux might exploit that differently).

Ugh. So we're just talking about a guest using MPSS here? Not about the 
host doing THP? I must've missed that part.

>
> What we do know is that if we advertise MPSS, we need to decode the page
> sizes encoded in the HPTE so that we know what we are dealing with in
> H_ENTER and can do the appropriate TLB invalidations in H_REMOVE &
> evictions.

Yes. That makes a lot of sense. So this patch really is all about 
enabling MPSS support for 16MB pages. No more, no less.

>
>>> +			if (a_size != -1)
>>> +				return 1ul << mmu_psize_defs[a_size].shift;
>>> +		}
>>> +
>>> +	}
>>> +	return 0;
>>>    }
>>>    
>>>    static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>> index 8227dba5af0f..a38d3289320a 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>> @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
>>>    	 * support pte_enc here
>>>    	 */
>>>    	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
>>> +	/*
>>> +	 * Add 16MB MPSS support
>>> +	 */
>>> +	if (linux_psize != MMU_PAGE_16M) {
>>> +		(*sps)->enc[1].page_shift = 24;
>>> +		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
>>> +	}
>> So this basically indicates that every segment (except for the 16MB one)
>> can also handle 16MB MPSS page sizes? I suppose you want to remove the
>> comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.
> I haven't reviewed the code there, make sure it will indeed do a
> different encoding for every combination of segment/actual page size.
>
>> Can we also ensure that every system we run on can do MPSS?
> P7 and P8 are identical in that regard. However 970 doesn't do MPSS so
> let's make sure we get that right.

yes. When / if people can easily get their hands on p7/p8 bare metal 
systems I'll be more than happy to remove 970 support as well, but for 
now it's probably good to keep in.


Alex

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Benjamin Herrenschmidt @ 2014-05-06  9:26 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, paulus, Aneesh Kumar K.V, kvm-ppc, kvm
In-Reply-To: <5368A78D.4070509@suse.de>

On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:

> So if I understand this patch correctly, it simply introduces logic to 
> handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
> size field in the HPTE. Mind to explain why exactly that enables us to 
> use THP?
>
> What exactly is the flow if the pages are not backed by huge pages? What 
> is the flow when they start to get backed by huge pages?

The hypervisor doesn't care about segments ... but it needs to properly
decode the page size requested by the guest, if anything, to issue the
right form of tlbie instruction.

The encoding in the HPTE for a 16M page inside a 64K segment is
different than the encoding for a 16M in a 16M segment, this is done so
that the encoding carries both information, which allows broadcast
tlbie to properly find the right set in the TLB for invalidations among
others.

So from a KVM perspective, we don't know whether the guest is doing THP
or something else (Linux calls it THP but all we care here is that this
is MPSS, another guest than Linux might exploit that differently).

What we do know is that if we advertise MPSS, we need to decode the page
sizes encoded in the HPTE so that we know what we are dealing with in
H_ENTER and can do the appropriate TLB invalidations in H_REMOVE &
evictions.

> > +			if (a_size != -1)
> > +				return 1ul << mmu_psize_defs[a_size].shift;
> > +		}
> > +
> > +	}
> > +	return 0;
> >   }
> >   
> >   static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 8227dba5af0f..a38d3289320a 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
> >   	 * support pte_enc here
> >   	 */
> >   	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
> > +	/*
> > +	 * Add 16MB MPSS support
> > +	 */
> > +	if (linux_psize != MMU_PAGE_16M) {
> > +		(*sps)->enc[1].page_shift = 24;
> > +		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
> > +	}
> 
> So this basically indicates that every segment (except for the 16MB one) 
> can also handle 16MB MPSS page sizes? I suppose you want to remove the 
> comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.

I haven't reviewed the code there, make sure it will indeed do a
different encoding for every combination of segment/actual page size.

> Can we also ensure that every system we run on can do MPSS?

P7 and P8 are identical in that regard. However 970 doesn't do MPSS so
let's make sure we get that right.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Alexander Graf @ 2014-05-06  9:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, kvm-ppc, kvm
In-Reply-To: <1399224616-25142-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/kvm_book3s_64.h | 146 ++++++++++++++++++++++++++-----
>   arch/powerpc/kvm/book3s_hv.c             |   7 ++
>   2 files changed, 130 insertions(+), 23 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 51388befeddb..f03ea8f90576 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -77,34 +77,122 @@ static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits)
>   	return old == 0;
>   }
>   
> +static inline int __hpte_actual_psize(unsigned int lp, int psize)
> +{
> +	int i, shift;
> +	unsigned int mask;
> +
> +	/* start from 1 ignoring MMU_PAGE_4K */
> +	for (i = 1; i < MMU_PAGE_COUNT; i++) {
> +
> +		/* invalid penc */
> +		if (mmu_psize_defs[psize].penc[i] == -1)
> +			continue;
> +		/*
> +		 * encoding bits per actual page size
> +		 *        PTE LP     actual page size
> +		 *    rrrr rrrz		>=8KB
> +		 *    rrrr rrzz		>=16KB
> +		 *    rrrr rzzz		>=32KB
> +		 *    rrrr zzzz		>=64KB
> +		 * .......
> +		 */
> +		shift = mmu_psize_defs[i].shift - LP_SHIFT;
> +		if (shift > LP_BITS)
> +			shift = LP_BITS;
> +		mask = (1 << shift) - 1;
> +		if ((lp & mask) == mmu_psize_defs[psize].penc[i])
> +			return i;
> +	}
> +	return -1;
> +}
> +
>   static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
>   					     unsigned long pte_index)
>   {
> -	unsigned long rb, va_low;
> +	int b_size, a_size;
> +	unsigned int penc;
> +	unsigned long rb = 0, va_low, sllp;
> +	unsigned int lp = (r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
> +
> +	if (!(v & HPTE_V_LARGE)) {
> +		/* both base and actual psize is 4k */
> +		b_size = MMU_PAGE_4K;
> +		a_size = MMU_PAGE_4K;
> +	} else {
> +		for (b_size = 0; b_size < MMU_PAGE_COUNT; b_size++) {
> +
> +			/* valid entries have a shift value */
> +			if (!mmu_psize_defs[b_size].shift)
> +				continue;
>   
> +			a_size = __hpte_actual_psize(lp, b_size);
> +			if (a_size != -1)
> +				break;
> +		}
> +	}
> +	/*
> +	 * Ignore the top 14 bits of va
> +	 * v have top two bits covering segment size, hence move
> +	 * by 16 bits, Also clear the lower HPTE_V_AVPN_SHIFT (7) bits.
> +	 * AVA field in v also have the lower 23 bits ignored.
> +	 * For base page size 4K we need 14 .. 65 bits (so need to
> +	 * collect extra 11 bits)
> +	 * For others we need 14..14+i
> +	 */
> +	/* This covers 14..54 bits of va*/
>   	rb = (v & ~0x7fUL) << 16;		/* AVA field */
> +	/*
> +	 * AVA in v had cleared lower 23 bits. We need to derive
> +	 * that from pteg index
> +	 */
>   	va_low = pte_index >> 3;
>   	if (v & HPTE_V_SECONDARY)
>   		va_low = ~va_low;
> -	/* xor vsid from AVA */
> +	/*
> +	 * get the vpn bits from va_low using reverse of hashing.
> +	 * In v we have va with 23 bits dropped and then left shifted
> +	 * HPTE_V_AVPN_SHIFT (7) bits. Now to find vsid we need
> +	 * right shift it with (SID_SHIFT - (23 - 7))
> +	 */
>   	if (!(v & HPTE_V_1TB_SEG))
> -		va_low ^= v >> 12;
> +		va_low ^= v >> (SID_SHIFT - 16);
>   	else
> -		va_low ^= v >> 24;
> +		va_low ^= v >> (SID_SHIFT_1T - 16);
>   	va_low &= 0x7ff;
> -	if (v & HPTE_V_LARGE) {
> -		rb |= 1;			/* L field */
> -		if (cpu_has_feature(CPU_FTR_ARCH_206) &&
> -		    (r & 0xff000)) {
> -			/* non-16MB large page, must be 64k */
> -			/* (masks depend on page size) */
> -			rb |= 0x1000;		/* page encoding in LP field */
> -			rb |= (va_low & 0x7f) << 16; /* 7b of VA in AVA/LP field */
> -			rb |= ((va_low << 4) & 0xf0);	/* AVAL field (P7 doesn't seem to care) */
> -		}
> -	} else {
> -		/* 4kB page */
> -		rb |= (va_low & 0x7ff) << 12;	/* remaining 11b of VA */
> +
> +	switch (b_size) {
> +	case MMU_PAGE_4K:
> +		sllp = ((mmu_psize_defs[a_size].sllp & SLB_VSID_L) >> 6) |
> +			((mmu_psize_defs[a_size].sllp & SLB_VSID_LP) >> 4);
> +		rb |= sllp << 5;	/*  AP field */
> +		rb |= (va_low & 0x7ff) << 12;	/* remaining 11 bits of AVA */
> +		break;
> +	default:
> +	{
> +		int aval_shift;
> +		/*
> +		 * remaining 7bits of AVA/LP fields
> +		 * Also contain the rr bits of LP
> +		 */
> +		rb |= (va_low & 0x7f) << 16;
> +		/*
> +		 * Now clear not needed LP bits based on actual psize
> +		 */
> +		rb &= ~((1ul << mmu_psize_defs[a_size].shift) - 1);
> +		/*
> +		 * AVAL field 58..77 - base_page_shift bits of va
> +		 * we have space for 58..64 bits, Missing bits should
> +		 * be zero filled. +1 is to take care of L bit shift
> +		 */
> +		aval_shift = 64 - (77 - mmu_psize_defs[b_size].shift) + 1;
> +		rb |= ((va_low << aval_shift) & 0xfe);
> +
> +		rb |= 1;		/* L field */
> +		penc = mmu_psize_defs[b_size].penc[a_size];
> +		rb |= penc << 12;	/* LP field */
> +		break;
> +	}
>   	}
>   	rb |= (v >> 54) & 0x300;		/* B field */
>   	return rb;
> @@ -112,14 +200,26 @@ static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
>   
>   static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
>   {
> +	int size, a_size;
> +	/* Look at the 8 bit LP value */
> +	unsigned int lp = (l >> LP_SHIFT) & ((1 << LP_BITS) - 1);
> +
>   	/* only handle 4k, 64k and 16M pages for now */
>   	if (!(h & HPTE_V_LARGE))
> -		return 1ul << 12;		/* 4k page */
> -	if ((l & 0xf000) == 0x1000 && cpu_has_feature(CPU_FTR_ARCH_206))
> -		return 1ul << 16;		/* 64k page */
> -	if ((l & 0xff000) == 0)
> -		return 1ul << 24;		/* 16M page */
> -	return 0;				/* error */
> +		return 1ul << 12;
> +	else {
> +		for (size = 0; size < MMU_PAGE_COUNT; size++) {
> +			/* valid entries have a shift value */
> +			if (!mmu_psize_defs[size].shift)
> +				continue;
> +
> +			a_size = __hpte_actual_psize(lp, size);

a_size as psize is probably a slightly confusing namer. Just call it 
a_psize.

So if I understand this patch correctly, it simply introduces logic to 
handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
size field in the HPTE. Mind to explain why exactly that enables us to 
use THP?

What exactly is the flow if the pages are not backed by huge pages? What 
is the flow when they start to get backed by huge pages?

> +			if (a_size != -1)
> +				return 1ul << mmu_psize_defs[a_size].shift;
> +		}
> +
> +	}
> +	return 0;
>   }
>   
>   static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 8227dba5af0f..a38d3289320a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
>   	 * support pte_enc here
>   	 */
>   	(*sps)->enc[0].pte_enc = def->penc[linux_psize];
> +	/*
> +	 * Add 16MB MPSS support
> +	 */
> +	if (linux_psize != MMU_PAGE_16M) {
> +		(*sps)->enc[1].page_shift = 24;
> +		(*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
> +	}

So this basically indicates that every segment (except for the 16MB one) 
can also handle 16MB MPSS page sizes? I suppose you want to remove the 
comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.

Can we also ensure that every system we run on can do MPSS?


Alex

^ permalink raw reply

* Re: [PATCH 5/6] powerpc/corenet: Add DPAA FMan support to the SoC device tree(s)
From: Joakim Tjernlund @ 2014-05-06  7:40 UTC (permalink / raw)
  To: Emil Medve
  Cc: Scott Wood, devicetree, Kanetkar Shruti-B44454,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <5368811A.3060609@Freescale.com>

"Linuxppc-dev" 
<linuxppc-dev-bounces+joakim.tjernlund=transmode.se@lists.ozlabs.org> 
wrote on 2014/05/06 08:28:42:
> 

.....

> 
> > That said,
> > I don't object to having a way to label a PHY as attached via TBI if
> > that's useful.  I'm giving a mild, non-nacking (given the history)
> > objection to using device_type for that (given other history).
> 
> Personally, I think that TBI PHY support is a bit messy but I don't have
> bandwidth to deal with that. The TBI PHY should be handled as a regular
> PHY and right now is a special case

Yes please! We will use the TBI as the only PHY in 1000BASE-X mode so
naturally we want to see the TBI as its own PHY and monitor its link 
status, AN etc.

 Jocke

^ permalink raw reply

* Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
From: Alexander Graf @ 2014-05-06  7:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev@lists.ozlabs.org, paulus@samba.org, Aneesh Kumar K.V,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
In-Reply-To: <1399360775.20388.112.camel@pasglop>


On 06.05.14 09:19, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
>>>> Isn't this a greater problem? We should start swapping before we hit
>>>> the point where non movable kernel allocation fails, no?
>>> Possibly but the fact remains, this can be avoided by making sure that
>>> if we create a CMA reserve for KVM, then it uses it rather than using
>>> the rest of main memory for hash tables.
>> So why were we preferring non-CMA memory before? Considering that Aneesh
>> introduced that logic in fa61a4e3 I suppose this was just a mistake?
> I assume so.
>
>>>> The fact that KVM uses a good number of normal kernel pages is maybe
>>>> suboptimal, but shouldn't be a critical problem.
>>> The point is that we explicitly reserve those pages in CMA for use
>>> by KVM for that specific purpose, but the current code tries first
>>> to get them out of the normal pool.
>>>
>>> This is not an optimal behaviour and is what Aneesh patches are
>>> trying to fix.
>> I agree, and I agree that it's worth it to make better use of our
>> resources. But we still shouldn't crash.
> Well, Linux hitting out of memory conditions has never been a happy
> story :-)
>
>> However, reading through this thread I think I've slowly grasped what
>> the problem is. The hugetlbfs size calculation.
> Not really.
>
>> I guess something in your stack overreserves huge pages because it
>> doesn't account for the fact that some part of system memory is already
>> reserved for CMA.
> Either that or simply Linux runs out because we dirty too fast...
> really, Linux has never been good at dealing with OO situations,
> especially when things like network drivers and filesystems try to do
> ATOMIC or NOIO allocs...
>   
>> So the underlying problem is something completely orthogonal. The patch
>> body as is is fine, but the patch description should simply say that we
>> should prefer the CMA region because it's already reserved for us for
>> this purpose and we make better use of our available resources that way.
> No.
>
> We give a chunk of memory to hugetlbfs, it's all good and fine.
>
> Whatever remains is split between CMA and the normal page allocator.
>
> Without Aneesh latest patch, when creating guests, KVM starts allocating
> it's hash tables from the latter instead of CMA (we never allocate from
> hugetlb pool afaik, only guest pages do that, not hash tables).
>
> So we exhaust the page allocator and get linux into OOM conditions
> while there's plenty of space in CMA. But the kernel cannot use CMA for
> it's own allocations, only to back user pages, which we don't care about
> because our guest pages are covered by our hugetlb reserve :-)

Yes. Write that in the patch description and I'm happy ;).


Alex

^ permalink raw reply

* Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
From: Benjamin Herrenschmidt @ 2014-05-06  7:19 UTC (permalink / raw)
  To: Alexander Graf
  Cc: linuxppc-dev@lists.ozlabs.org, paulus@samba.org, Aneesh Kumar K.V,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
In-Reply-To: <536889C6.1050603@suse.de>

On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
> > On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
> >> Isn't this a greater problem? We should start swapping before we hit
> >> the point where non movable kernel allocation fails, no?
> > Possibly but the fact remains, this can be avoided by making sure that
> > if we create a CMA reserve for KVM, then it uses it rather than using
> > the rest of main memory for hash tables.
> 
> So why were we preferring non-CMA memory before? Considering that Aneesh 
> introduced that logic in fa61a4e3 I suppose this was just a mistake?

I assume so.

> >> The fact that KVM uses a good number of normal kernel pages is maybe
> >> suboptimal, but shouldn't be a critical problem.
> > The point is that we explicitly reserve those pages in CMA for use
> > by KVM for that specific purpose, but the current code tries first
> > to get them out of the normal pool.
> >
> > This is not an optimal behaviour and is what Aneesh patches are
> > trying to fix.
> 
> I agree, and I agree that it's worth it to make better use of our 
> resources. But we still shouldn't crash.

Well, Linux hitting out of memory conditions has never been a happy
story :-)

> However, reading through this thread I think I've slowly grasped what 
> the problem is. The hugetlbfs size calculation.

Not really.

> I guess something in your stack overreserves huge pages because it 
> doesn't account for the fact that some part of system memory is already 
> reserved for CMA.

Either that or simply Linux runs out because we dirty too fast...
really, Linux has never been good at dealing with OO situations,
especially when things like network drivers and filesystems try to do
ATOMIC or NOIO allocs...
 
> So the underlying problem is something completely orthogonal. The patch 
> body as is is fine, but the patch description should simply say that we 
> should prefer the CMA region because it's already reserved for us for 
> this purpose and we make better use of our available resources that way.

No.

We give a chunk of memory to hugetlbfs, it's all good and fine.

Whatever remains is split between CMA and the normal page allocator.

Without Aneesh latest patch, when creating guests, KVM starts allocating
it's hash tables from the latter instead of CMA (we never allocate from
hugetlb pool afaik, only guest pages do that, not hash tables).

So we exhaust the page allocator and get linux into OOM conditions
while there's plenty of space in CMA. But the kernel cannot use CMA for
it's own allocations, only to back user pages, which we don't care about
because our guest pages are covered by our hugetlb reserve :-)

> All the bits about pinning, numa, libvirt and whatnot don't really 
> matter and are just details that led Aneesh to find this non-optimal 
> allocation.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH RFC 00/22] EEH Support for VFIO PCI devices on PowerKVM guest
From: Benjamin Herrenschmidt @ 2014-05-06  7:14 UTC (permalink / raw)
  To: Alexander Graf
  Cc: kvm, aik, Gavin Shan, kvm-ppc, Alex Williamson, qiudayu,
	linuxppc-dev
In-Reply-To: <536887A3.30703@suse.de>

On Tue, 2014-05-06 at 08:56 +0200, Alexander Graf wrote:
> > For the error injection, I guess I have to put the logic token
> management
> > into QEMU and error injection request will be handled by QEMU and
> then
> > routed to host kernel via additional syscall as we did for pSeries.
> 
> Yes, start off without in-kernel XICS so everything simply lives in 
> QEMU. Then add callbacks into the in-kernel XICS to inject these 
> interrupts if we don't have wide enough interfaces already.

It's got nothing to do with XICS ... :-)

But yes, we can route everything via qemu for now, then we'll need
at least one of the call to have a "direct" path but we should probably
strive to even make it real mode if that's possible, it's the one that
Linux will call whenever an MMIO returns all f's to check if the
underlying PE is frozen.

But we can do that as a second stage.

In fact going via VFIO ioctl's does make the whole security and
translation model much simpler initially.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.
From: Alexander Graf @ 2014-05-06  7:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev@lists.ozlabs.org, paulus@samba.org, Aneesh Kumar K.V,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
In-Reply-To: <1399334797.20388.71.camel@pasglop>

On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
>> Isn't this a greater problem? We should start swapping before we hit
>> the point where non movable kernel allocation fails, no?
> Possibly but the fact remains, this can be avoided by making sure that
> if we create a CMA reserve for KVM, then it uses it rather than using
> the rest of main memory for hash tables.

So why were we preferring non-CMA memory before? Considering that Aneesh 
introduced that logic in fa61a4e3 I suppose this was just a mistake?

>> The fact that KVM uses a good number of normal kernel pages is maybe
>> suboptimal, but shouldn't be a critical problem.
> The point is that we explicitly reserve those pages in CMA for use
> by KVM for that specific purpose, but the current code tries first
> to get them out of the normal pool.
>
> This is not an optimal behaviour and is what Aneesh patches are
> trying to fix.

I agree, and I agree that it's worth it to make better use of our 
resources. But we still shouldn't crash.

However, reading through this thread I think I've slowly grasped what 
the problem is. The hugetlbfs size calculation.

I guess something in your stack overreserves huge pages because it 
doesn't account for the fact that some part of system memory is already 
reserved for CMA.

So the underlying problem is something completely orthogonal. The patch 
body as is is fine, but the patch description should simply say that we 
should prefer the CMA region because it's already reserved for us for 
this purpose and we make better use of our available resources that way.

All the bits about pinning, numa, libvirt and whatnot don't really 
matter and are just details that led Aneesh to find this non-optimal 
allocation.

Alex

^ permalink raw reply

* Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr
From: Alexander Graf @ 2014-05-06  6:57 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Aneesh Kumar K.V, kvm-ppc, kvm
In-Reply-To: <20140506004133.GA12595@iris.ozlabs.ibm.com>


On 06.05.14 02:41, Paul Mackerras wrote:
> On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:
>> On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:
>>> +#ifdef CONFIG_PPC_BOOK3S_64
>>> +	return vcpu->arch.fault_dar;
>> How about PA6T and G5s?
> G5 sets DAR on an alignment interrupt.
>
> As for PA6T, I don't know for sure, but if it doesn't, ordinary
> alignment interrupts wouldn't be handled properly, since the code in
> arch/powerpc/kernel/align.c assumes DAR contains the address being
> accessed on all PowerPC CPUs.

Now that's a good point. If we simply behave like Linux, I'm fine. This 
definitely deserves a comment on the #ifdef in the code.


Alex

^ permalink raw reply

* Re: [PATCH RFC 00/22] EEH Support for VFIO PCI devices on PowerKVM guest
From: Alexander Graf @ 2014-05-06  6:56 UTC (permalink / raw)
  To: Gavin Shan, Alex Williamson; +Cc: kvm, aik, kvm-ppc, qiudayu, linuxppc-dev
In-Reply-To: <20140506042622.GA24228@shangw>


On 06.05.14 06:26, Gavin Shan wrote:
> On Mon, May 05, 2014 at 08:00:12AM -0600, Alex Williamson wrote:
>> On Mon, 2014-05-05 at 13:56 +0200, Alexander Graf wrote:
>>> On 05/05/2014 03:27 AM, Gavin Shan wrote:
>>>> The series of patches intends to support EEH for PCI devices, which have been
>>>> passed through to PowerKVM based guest via VFIO. The implementation is
>>>> straightforward based on the issues or problems we have to resolve to support
>>>> EEH for PowerKVM based guest.
>>>>
>>>> - Emulation for EEH RTAS requests. Thanksfully, we already have infrastructure
>>>>     to emulate XICS. Without introducing new mechanism, we just extend that
>>>>     existing infrastructure to support EEH RTAS emulation. EEH RTAS requests
>>>>     initiated from guest are posted to host where the requests get handled or
>>>>     delivered to underly firmware for further handling. For that, the host kerenl
>>>>     has to maintain the PCI address (host domain/bus/slot/function to guest's
>>>>     PHB BUID/bus/slot/function) mapping via KVM VFIO device. The address mapping
>>>>     will be built when initializing VFIO device in QEMU and destroied when the
>>>>     VFIO device in QEMU is going to offline, or VM is destroy.
>>> Do you also expose all those interfaces to user space? VFIO is as much
>>> about user space device drivers as it is about device assignment.
>>>
> Yep, all the interfaces are exported to user space.
>
>>> I would like to first see an implementation that doesn't touch KVM
>>> emulation code at all but instead routes everything through QEMU. As a
>>> second step we can then accelerate performance critical paths inside of KVM.
>>>
> Ok. I'll change the implementation. However, the QEMU still has to
> poll/push information from/to host kerenl. So the best place for that
> would be tce_iommu_driver_ops::ioctl as EEH is Power specific feature.
>
> For the error injection, I guess I have to put the logic token management
> into QEMU and error injection request will be handled by QEMU and then
> routed to host kernel via additional syscall as we did for pSeries.

Yes, start off without in-kernel XICS so everything simply lives in 
QEMU. Then add callbacks into the in-kernel XICS to inject these 
interrupts if we don't have wide enough interfaces already.



Alex

^ permalink raw reply

* Re: [PATCH 5/6] powerpc/corenet: Add DPAA FMan support to the SoC device tree(s)
From: Emil Medve @ 2014-05-06  6:28 UTC (permalink / raw)
  To: Scott Wood
  Cc: devicetree, Kanetkar Shruti-B44454, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1399332886.15726.161.camel@snotra.buserror.net>

Hello Scott,


On 05/05/2014 06:34 PM, Scott Wood wrote:
> On Sun, 2014-05-04 at 05:59 -0500, Emil Medve wrote:
>> Hello Scott,
>>
>>
>> On 04/21/2014 05:14 PM, Scott Wood wrote:
>>> On Fri, 2014-04-18 at 07:21 -0500, Shruti Kanetkar wrote:
>>>> FMan 1 Gb/s MACs (dTSEC and mEMAC) have support for SGMII PHYs.
>>>> Add support for the internal SerDes TBI PHYs
>>>>
>>>> Based on prior work by Andy Fleming <afleming@gmail.com>
>>>>
>>>> Signed-off-by: Shruti Kanetkar <Shruti@Freescale.com>
>>>> ---
>>>>  arch/powerpc/boot/dts/fsl/b4860si-post.dtsi |  28 +++++
>>>>  arch/powerpc/boot/dts/fsl/b4si-post.dtsi    |  51 +++++++++
>>>>  arch/powerpc/boot/dts/fsl/p1023si-post.dtsi |  14 +++
>>>>  arch/powerpc/boot/dts/fsl/p2041si-post.dtsi |  64 ++++++++++++
>>>>  arch/powerpc/boot/dts/fsl/p3041si-post.dtsi |  64 ++++++++++++
>>>>  arch/powerpc/boot/dts/fsl/p4080si-post.dtsi | 104 +++++++++++++++++++
>>>>  arch/powerpc/boot/dts/fsl/p5020si-post.dtsi |  64 ++++++++++++
>>>>  arch/powerpc/boot/dts/fsl/p5040si-post.dtsi | 128 +++++++++++++++++++++++
>>>>  arch/powerpc/boot/dts/fsl/t4240si-post.dtsi | 154 ++++++++++++++++++++++++++++
>>>>  9 files changed, 671 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/boot/dts/fsl/b4860si-post.dtsi b/arch/powerpc/boot/dts/fsl/b4860si-post.dtsi
>>>> index cbc354b..45b0ff5 100644
>>>> --- a/arch/powerpc/boot/dts/fsl/b4860si-post.dtsi
>>>> +++ b/arch/powerpc/boot/dts/fsl/b4860si-post.dtsi
>>>> @@ -172,6 +172,34 @@
>>>>  		compatible = "fsl,b4860-rcpm", "fsl,qoriq-rcpm-2.0";
>>>>  	};
>>>>  
>>>> +/include/ "qoriq-fman3-0-1g-4.dtsi"
>>>> +/include/ "qoriq-fman3-0-1g-5.dtsi"
>>>> +/include/ "qoriq-fman3-0-10g-0.dtsi"
>>>> +/include/ "qoriq-fman3-0-10g-1.dtsi"
>>>> +	fman@400000 {
>>>> +		ethernet@e8000 {
>>>> +			tbi-handle = <&tbi4>;
>>>> +		};
>>>
>>> Binding needed
>>>
>>> Where is the "reg" for these unit addresses?
>>
>> As I said, the bulk of the FMan work comes from another team. Here we
>> need just enough to hook up the MDIO and PHY nodes.
> 
> Unit addresses must match reg.  No reg, no unit address.

We can add a 'reg' property, but we really don't want to clash with the
team that is working on upstreaming the FMan/MAC bindings and drivers

>> I'd really like to be able to make progress on this without waiting for that moment in time
>> we can get the entire FMan binding in place
> 
> Why is the fman binding such a big deal?
> 
>>>> +		mdio@e9000 {
>>>> +			tbi4: tbi-phy@8 {
>>>> +				reg = <0x8>;
>>>> +				device_type = "tbi-phy";
>>>> +			};
>>>> +		};
>>>
>>> Binding needed for tbi-phy device_type
>>
>> I guess that's fair (BTW, you accepted tbi-phy nodes/device-type before
>> without a binding)
> 
> It's existing practice on eTSEC.  FMan seemed like an opportunity to
> avoid carrying cruft forward.

The 1 Gb/s MDIO block is not FMan specific. As I said is the same block
from eTSEC. That's part of the reason we're trying upstreaming this
independent of the FMan stuff. So, don't think FMan, think MDIO

>>> Why are we using device_type at all for this?
>>
>> That's what the upstream driver is looking for.
> 
> Drivers should look for what the binding says -- not the other way
> around.

Yeah yeah. Nobody likes it, but the driver is/describes the de facto binding

On a constructive note, the Ethernet PHY code doesn't do device tree
based probing so no compatibles are used at all. So device_type is used
to convey a TBI PHY

>>  Anyway, most days PHYs can be discovered so they don't use/need
>> compatible properties. That's I guess part of the reason we don't have
>> bindings for them PHY nodes
> 
> I don't see why there couldn't be a compatible that describes the
> standard programming interface.

Because it can be detected at runtime and I guess stuff like that should
stay out of the device tree. I'm using PCI as an analogy here

>> However, what you can't discover is how they are wired to the MAC(s) so
>> we still need some nodes in the device tree to convey that. Also, when
>> looking for a specific kind of PHY, such as TBI, device_type works
>> easier then parsing compatibles from various vendors or so
> 
> Don't you find the TBI by following the tbi-handle property?

When the MAC "attaches" to the PHY the tbi-handle is followed. But the
MDIO/PHY code/driver(s) doesn't quite "see" the tbi-handle as it's
outside the MDIO/PHY nodes

> That said,
> I don't object to having a way to label a PHY as attached via TBI if
> that's useful.  I'm giving a mild, non-nacking (given the history)
> objection to using device_type for that (given other history).

Personally, I think that TBI PHY support is a bit messy but I don't have
bandwidth to deal with that. The TBI PHY should be handled as a regular
PHY and right now is a special case


Cheers,

^ permalink raw reply

* Re: [PATCH 4/6] powerpc/corenet: Create the dts components for the DPAA FMan
From: Emil Medve @ 2014-05-06  5:54 UTC (permalink / raw)
  To: Scott Wood; +Cc: devicetree, Shruti Kanetkar, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1399332359.15726.154.camel@snotra.buserror.net>

Hello Scott,

On 05/05/2014 06:25 PM, Scott Wood wrote:
> On Sat, 2014-05-03 at 05:02 -0500, Emil Medve wrote:
>> Hello Scott,
>>
>>
>> On 04/21/2014 05:11 PM, Scott Wood wrote:
>>> On Fri, 2014-04-18 at 07:21 -0500, Shruti Kanetkar wrote:
>>>> +fman@400000 {
>>>> +	mdio@f1000 {
>>>> +		#address-cells = <1>;
>>>> +		#size-cells = <0>;
>>>> +		compatible = "fsl,fman-xmdio";
>>>> +		reg = <0xf1000 0x1000>;
>>>> +	};
>>>> +};
>>>
>>> I'd like to see a complete fman binding before we start adding pieces.
>>
>> The driver for the FMan 10 Gb/s MDIO has upstreamed a couple of years
>> ago: '9f35a73 net/fsl: introduce Freescale 10G MDIO driver', granted
>> without a binding writeup.
> 
> Pushing driver code through the netdev tree does not establish device
> tree ABI.  Binding documents and dts files do.

Sure, ideally and formally. But upstreaming a driver represents, if
nothing else, a statement of intent to observe a device tree ABI. Via
the SDK, FSL customers are using the device tree ABI the driver de facto
establishes. I guess a driver that makes it upstream can establish an
device tree ABI

We'll re-spin adding the binding document

>> This patch series should probably include a
>> binding blurb. However, let's not gate this patchset on a complete
>> binding for the FMan
> 
> I at least want to see enough of the FMan binding to have confidence
> that what we're adding now is correct.

I'm not sure what you're looking for. The nodes we're adding are
describing a very common CCSR space interface for quite common device blocks

>> As you know we don't own the FMan work and the FMan work is... not ready
>> for upstreaming.
> 
> I'm not asking for a driver, just a binding that describes hardware.  Is
> there any reason why the fman node needs to be anywhere near as
> complicated as it is in the SDK, if we're limiting it to actual hardware
> description?

Is this a trick question? :-) Of course it doesn't need to be more
complicated than actual hardware. But, to repeat myself, said
description is not... ready and I don't know when it will be. Somebody
else owns pushing the bulk of FMan upstream and I'd rather not step on
their turf quite like this

> Do we really need to have nodes for all the sub-blocks?

Definitely no, and internally I'm pushing to clean that up. However, you
surely remember we've been pushing from the early days of P4080 and it's
been, to put it optimistically, slow

>> In an attempt to make some sort of progress we've
>> decided to upstream the pieces that are less controversial and MDIO is
>> an obvious candidate
>>
>>>> +fman@400000 {
>>>> +	mdio0: mdio@e1120 {
>>>> +		#address-cells = <1>;
>>>> +		#size-cells = <0>;
>>>> +		compatible = "fsl,fman-mdio";
>>>> +		reg = <0xe1120 0xee0>;
>>>> +	};
>>>> +};
>>>
>>> What is the difference between "fsl,fman-mdio" and "fsl,fman-xmdio"?  I
>>> don't see the latter on the list of compatibles in patch 3/6.
>>
>> 'fsl,fman-mdio' is the 1 Gb/s MDIO (Clause 22 only). 'fsl,fman-xmdio' is
>> the 10 Gb/s MDIO (Clause 45 only). We can respin this patch wi
>>
> 
> "respin this patch wi..."?

Not sure where the end of that sentence went. I meant we'll re-spin with
a binding for the 10 Gb/s MDIO block

>> I believe 'fsl,fman-mdio' (and others on that list) was added
>> gratuitously as the FMan MDIO is completely compatible with the
>> eTSEC/gianfar MDIO driver, but we can deal with that later
> 
> It's still good to identify the specific device, even if it's believed
> to be 100% compatible.

You suggesting we create new compatibles for every instance/integration
of a hardware block even though is identical with an earlier hardware
integration? Well, I guess that's been done that and now we have about 8
different compatibles that convey no real difference at all

> Plus, IIRC there's been enough badness in the
> eTSEC MDIO binding that it'd be good to steer clear of it.

Hmm... I guess we can leave things as they are. I wasn't going to touch
this just now anyway

>>> Within each category, is the exact fman version discoverable from the
>>> mdio registers?
>>
>> No, but that's irrelevant as that's not the difference between the two
>> compatibles
> 
> It's relevant because it means the compatible string should have a block
> version number in it, or at least some other way in the MDIO node to
> indicate the block version.

The 1 Gb/s MDIO block doesn't track a version of its own and from a
programming interface perspective it has no visible difference since
eTSEC. The 10 Gb/s MDIO doesn't track a version of its own either and
across the existing FMan versions is identical from a programming
interface perspective

I guess we can append a 'v1.0' to the MDIO compatible(s). However, given
the SDK we'll have to support the compatibles the (already upstream)
drivers support. Dealing with all that legacy is going to be so tedious

>>>> +fman@500000 {
>>>> +	#address-cells = <1>;
>>>> +	#size-cells = <1>;
>>>> +	compatible = "simple-bus";
>>>
>>> Why is this simple-bus?
>>
>> Because that's the translation type for the FMan sub-nodes.
> 
> What do you mean by "translation type"?

I mean address translation across buses

>> We need it now to get the MDIO nodes probed
> 
> No.  "simple-bus" is stating an attribute of the hardware, that the
> child nodes represent simple memory-mapped devices that can be used
> without special bus knowledge.  I don't think that applies here.

Yes it does. The FMan sub-nodes are "simple memory-mapped devices that
can be used without special bus knowledge". Perhaps you're thinking
about the PHY devices on the MDIO bus

> You can get the MDIO node probed without misusing simple-bus by adding
> the fman node's compatible to the probe list in the kernel code.

I think that's gratuitous and it's been done gratuitously in the past
for CCSR space (sub-)nodes

> This sort of thing is why I want to see what the rest of the fman
> binding will look like.
> 
>>  and we'll needed later to probe other nodes/devices that will have
>> standalone drivers: MAC, MURAM. etc. 
> 
> How are they truly standalone?

I meant that they have individual drivers and they are not handled by
the high-level FMan driver

> The exist in service to the greater
> entity that is fman.  They presumably work together in some fashion.

Some blocks can work independently. The MURAM is an example and it seems
the existing CPM/QE MURAM code allows it to be used as regular memory.
The MDIO block could handle PHY(s) for other MACs in the system.

>>>> +	/* mdio nodes for fman v3 @ 0x500000 */
>>>> +	mdio@fc000 {
>>>> +		#address-cells = <1>;
>>>> +		#size-cells = <0>;
>>>> +		reg = <0xfc000 0x1000>;
>>>> +	};
>>>> +
>>>> +	mdio@fd000 {
>>>> +		#address-cells = <1>;
>>>> +		#size-cells = <0>;
>>>> +		reg = <0xfd000 0x1000>;
>>>> +	};
>>>> +};
>>>
>>> Where's the compatible?  Why is this file different from all the others?
>>
>> The FMan v3 MDIO block (supports both Clause 22/45) is compatible with
>> the FMan v2 10 Gb/s MDIO (the xgmac-mdio driver). However, the driver
>> needs a small clean-up patch (still in internal review) that will get it
>> working for FMan v3 MDIO.
> 
> This suggests that it is not 100% backwards compatible.

It is. The code is just not everything it should be

Cheers,

>>  With that patch will add the compatible to these nodes. However, we
>> need these nodes now for the board level MDIO bus muxing support
>> (included in this patchset)
> 
> If you need these nodes now then add the compatible property now.
> 
> -Scott

^ permalink raw reply

* Re: [PATCH RFC 00/22] EEH Support for VFIO PCI devices on PowerKVM guest
From: Gavin Shan @ 2014-05-06  4:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, aik, Alexander Graf, kvm-ppc, Gavin Shan, qiudayu,
	linuxppc-dev
In-Reply-To: <1399298412.24318.521.camel@ul30vt.home>

On Mon, May 05, 2014 at 08:00:12AM -0600, Alex Williamson wrote:
>On Mon, 2014-05-05 at 13:56 +0200, Alexander Graf wrote:
>> On 05/05/2014 03:27 AM, Gavin Shan wrote:
>> > The series of patches intends to support EEH for PCI devices, which have been
>> > passed through to PowerKVM based guest via VFIO. The implementation is
>> > straightforward based on the issues or problems we have to resolve to support
>> > EEH for PowerKVM based guest.
>> >
>> > - Emulation for EEH RTAS requests. Thanksfully, we already have infrastructure
>> >    to emulate XICS. Without introducing new mechanism, we just extend that
>> >    existing infrastructure to support EEH RTAS emulation. EEH RTAS requests
>> >    initiated from guest are posted to host where the requests get handled or
>> >    delivered to underly firmware for further handling. For that, the host kerenl
>> >    has to maintain the PCI address (host domain/bus/slot/function to guest's
>> >    PHB BUID/bus/slot/function) mapping via KVM VFIO device. The address mapping
>> >    will be built when initializing VFIO device in QEMU and destroied when the
>> >    VFIO device in QEMU is going to offline, or VM is destroy.
>> 
>> Do you also expose all those interfaces to user space? VFIO is as much 
>> about user space device drivers as it is about device assignment.
>> 

Yep, all the interfaces are exported to user space. 

>> I would like to first see an implementation that doesn't touch KVM 
>> emulation code at all but instead routes everything through QEMU. As a 
>> second step we can then accelerate performance critical paths inside of KVM.
>> 

Ok. I'll change the implementation. However, the QEMU still has to
poll/push information from/to host kerenl. So the best place for that
would be tce_iommu_driver_ops::ioctl as EEH is Power specific feature.

For the error injection, I guess I have to put the logic token management
into QEMU and error injection request will be handled by QEMU and then
routed to host kernel via additional syscall as we did for pSeries.

>> That way we ensure that user space device drivers have all the power 
>> over a device they need to drive it.
>
>+1
>

Thanks,
Gavin

^ permalink raw reply

* Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest
From: Paul Mackerras @ 2014-05-06  4:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Alexander Graf, kvm-ppc, kvm
In-Reply-To: <87lhug9taz.fsf@linux.vnet.ibm.com>

On Mon, May 05, 2014 at 08:17:00PM +0530, Aneesh Kumar K.V wrote:
> Alexander Graf <agraf@suse.de> writes:
> 
> > On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >
> > No patch description, no proper explanations anywhere why you're doing 
> > what. All of that in a pretty sensitive piece of code. There's no way 
> > this patch can go upstream in its current form.
> >
> 
> Sorry about being vague. Will add a better commit message. The goal is
> to export MPSS support to guest if the host support the same. MPSS
> support is exported via penc encoding in "ibm,segment-page-sizes". The
> actual format can be found at htab_dt_scan_page_sizes. When the guest
> memory is backed by hugetlbfs we expose the penc encoding the host
> support to guest via kvmppc_add_seg_page_size. 

In a case like this it's good to assume the reader doesn't know very
much about Power CPUs, and probably isn't familiar with acronyms such
as MPSS.  The patch needs an introductory paragraph explaining that on
recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size.  Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page.  This
capability is called mixed page-size segment (MPSS).  With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry.  Note that the actual page size is
always >= base page size.

> Now the challenge to THP support is to make sure that our henter,
> hremove etc decode base page size and actual page size correctly
> from the hash table entry values. Most of the changes is to do that.
> Rest of the stuff is already handled by kvm. 
> 
> NOTE: It is much easier to read the code after applying the patch rather
> than reading the diff. I have added comments around each steps in the
> code.

Paul.

^ permalink raw reply

* Re: [PATCH v3] powerpc/fsl: Added binding for Freescale CoreNet coherency fabric (CCF)
From: Scott Wood @ 2014-05-06  2:22 UTC (permalink / raw)
  To: Kumar Gala; +Cc: Diana Craciun, devicetree, linuxppc-dev
In-Reply-To: <1FE471CC-2024-4B39-89E2-FBACF8F18A9B@kernel.crashing.org>

On Mon, 2014-05-05 at 21:12 -0500, Kumar Gala wrote:
> On May 5, 2014, at 10:58 AM, Diana Craciun <diana.craciun@freescale.com> wrote:
> 
> > From: Diana Craciun <Diana.Craciun@freescale.com>
> > 
> > The CoreNet coherency fabric is a fabric-oriented, conectivity
> > infrastructure that enables the implementation of coherent, multicore
> > systems. The CCF acts as a central interconnect for cores,
> > platform-level caches, memory subsystem, peripheral devices and I/O host
> > bridges in the system.
> > 
> > Signed-off-by: Diana Craciun <Diana.Craciun@freescale.com>
> > ---
> > v3:
> > 	- added port ID mapping
> > 	- removed fsl,corenetx-cf
> > 
> > .../devicetree/bindings/powerpc/fsl/ccf.txt        | 42 ++++++++++++++++++++++
> > .../devicetree/bindings/powerpc/fsl/cpus.txt       |  8 +++++
> > .../devicetree/bindings/powerpc/fsl/pamu.txt       |  8 +++++
> > 3 files changed, 58 insertions(+)
> > create mode 100644 Documentation/devicetree/bindings/powerpc/fsl/ccf.txt
> 
> [snip]
> 
> > --- a/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt
> > +++ b/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt
> > @@ -20,3 +20,11 @@ PROPERTIES
> > 	a property named fsl,eref-[CAT], where [CAT] is the abbreviated category
> > 	name with all uppercase letters converted to lowercase, indicates that
> > 	the category is supported by the implementation.
> > +
> > +	- fsl,portid-mapping : <u32>
> > +	The Coherency Subdomain ID Port Mapping Registers and Snoop ID Port Mapping
> > +	registers which are part of the CoreNet Coherency fabric (CCF) provide a
> > +	CoreNet Coherency Subdomain ID/CoreNet Snoop ID to cpu mapping functions.
> > +	Certain bits from these registers should be set if the coresponding CPU
> > +	should be snooped. This property defines a bitmask which selects the bit that
> > +	should be set if this cpu should be snooped.
> 
> Under what cases can software not figure out how to set this based on the PAMUs in the DT?

How would it go about doing that?

Besides the difference between corenet1-cf and corenet2-cf, on
corenet1-cf the position of the PAMU bits depends on the number of CPUs
that the chip was designed for.  This may be different from the number
of CPUs that are actually present (e.g. p4040, or AMP).  It's also a
complication that IMHO is asking for trouble, versus straightforwardly
recording information that is present in a table in the manual.

-Scott

^ permalink raw reply

* Re: [PATCH v3] powerpc/fsl: Added binding for Freescale CoreNet coherency fabric (CCF)
From: Kumar Gala @ 2014-05-06  2:12 UTC (permalink / raw)
  To: Diana Craciun; +Cc: scottwood, devicetree, linuxppc-dev
In-Reply-To: <1399305499-6612-1-git-send-email-diana.craciun@freescale.com>


On May 5, 2014, at 10:58 AM, Diana Craciun <diana.craciun@freescale.com> =
wrote:

> From: Diana Craciun <Diana.Craciun@freescale.com>
>=20
> The CoreNet coherency fabric is a fabric-oriented, conectivity
> infrastructure that enables the implementation of coherent, multicore
> systems. The CCF acts as a central interconnect for cores,
> platform-level caches, memory subsystem, peripheral devices and I/O =
host
> bridges in the system.
>=20
> Signed-off-by: Diana Craciun <Diana.Craciun@freescale.com>
> ---
> v3:
> 	- added port ID mapping
> 	- removed fsl,corenetx-cf
>=20
> .../devicetree/bindings/powerpc/fsl/ccf.txt        | 42 =
++++++++++++++++++++++
> .../devicetree/bindings/powerpc/fsl/cpus.txt       |  8 +++++
> .../devicetree/bindings/powerpc/fsl/pamu.txt       |  8 +++++
> 3 files changed, 58 insertions(+)
> create mode 100644 =
Documentation/devicetree/bindings/powerpc/fsl/ccf.txt

[snip]

> --- a/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt
> +++ b/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt
> @@ -20,3 +20,11 @@ PROPERTIES
> 	a property named fsl,eref-[CAT], where [CAT] is the abbreviated =
category
> 	name with all uppercase letters converted to lowercase, =
indicates that
> 	the category is supported by the implementation.
> +
> +	- fsl,portid-mapping : <u32>
> +	The Coherency Subdomain ID Port Mapping Registers and Snoop ID =
Port Mapping
> +	registers which are part of the CoreNet Coherency fabric (CCF) =
provide a
> +	CoreNet Coherency Subdomain ID/CoreNet Snoop ID to cpu mapping =
functions.
> +	Certain bits from these registers should be set if the =
coresponding CPU
> +	should be snooped. This property defines a bitmask which selects =
the bit that
> +	should be set if this cpu should be snooped.

Under what cases can software not figure out how to set this based on =
the PAMUs in the DT?

> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/pamu.txt =
b/Documentation/devicetree/bindings/powerpc/fsl/pamu.txt
> index 1f5e329..827c637 100644
> --- a/Documentation/devicetree/bindings/powerpc/fsl/pamu.txt
> +++ b/Documentation/devicetree/bindings/powerpc/fsl/pamu.txt
> @@ -26,6 +26,13 @@ Required properties:
> 		  A standard property.
> - #size-cells	: <u32>
> 		  A standard property.
> +- fsl,portid-mapping : <u32>
> +	The Coherency Subdomain ID Port Mapping Registers and Snoop ID =
Port Mapping
> +	registers which are part of the CoreNet Coherency fabric (CCF) =
provide a
> +	CoreNet Coherency Subdomain ID/CoreNet Snoop ID to pamu mapping =
functions.
> +	Certain bits from these registers should be set if PAMUs should =
be snooped.
> +	This property defines a bitmask which selects the bits that =
should be set
> +	if PAMUs should be snooped.
>=20
> Optional properties:
> - reg		: <prop-encoded-array>
> @@ -88,6 +95,7 @@ Example:
> 		compatible =3D "fsl,pamu-v1.0", "fsl,pamu";
> 		reg =3D <0x20000 0x5000>;
> 		ranges =3D <0 0x20000 0x5000>;
> +		fsl,portid-mapping =3D <0xf80000>;
> 		#address-cells =3D <1>;
> 		#size-cells =3D <1>;
> 		interrupts =3D <
> --=20
> 1.7.11.7
>=20
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox