LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 2/3] irq: Add hw continuous IRQs map to virtual continuous IRQs support
From: Michael Ellerman @ 2013-03-05  2:23 UTC (permalink / raw)
  To: Mike Qiu; +Cc: tglx, linuxppc-dev, linux-kernel
In-Reply-To: <1358235536-32741-3-git-send-email-qiudayu@linux.vnet.ibm.com>

On Tue, Jan 15, 2013 at 03:38:55PM +0800, Mike Qiu wrote:
> Adding a function irq_create_mapping_many() which can associate
> multiple MSIs to a continous irq mapping.
> 
> This is needed to enable multiple MSI support for pSeries.
> 
> Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
> ---
>  include/linux/irq.h       |    2 +
>  include/linux/irqdomain.h |    3 ++
>  kernel/irq/irqdomain.c    |   61 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index 60ef45b..e00a7ec 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -592,6 +592,8 @@ int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
>  #define irq_alloc_desc_from(from, node)		\
>  	irq_alloc_descs(-1, from, 1, node)
>  
> +#define irq_alloc_desc_n(nevc, node)		\
> +	irq_alloc_descs(-1, 0, nevc, node)

This has been superseeded by irq_alloc_descs_from(), which is the right
way to do it.

> diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
> index 0d5b17b..831dded 100644
> --- a/include/linux/irqdomain.h
> +++ b/include/linux/irqdomain.h
> @@ -168,6 +168,9 @@ extern int irq_create_strict_mappings(struct irq_domain *domain,
>  				      unsigned int irq_base,
>  				      irq_hw_number_t hwirq_base, int count);
>  
> +extern int irq_create_mapping_many(struct irq_domain *domain,
> +					irq_hw_number_t hwirq_base, int count);
> +
>  static inline int irq_create_identity_mapping(struct irq_domain *host,
>  					      irq_hw_number_t hwirq)
>  {
> diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
> index 96f3a1d..38648e6 100644
> --- a/kernel/irq/irqdomain.c
> +++ b/kernel/irq/irqdomain.c
> @@ -636,6 +636,67 @@ int irq_create_strict_mappings(struct irq_domain *domain, unsigned int irq_base,
>  }
>  EXPORT_SYMBOL_GPL(irq_create_strict_mappings);
>  
> +/**
> + * irq_create_mapping_many - Map a range of hw IRQs to a range of virtual IRQs
> + * @domain: domain owning the interrupt range
> + * @hwirq_base: beginning of continuous hardware IRQ range
> + * @count: Number of interrupts to map

For multiple-MSI the allocated interrupt numbers must be a power-of-2,
and must be naturally aligned. I don't /think/ that's a requirement for
the virtual numbers, but it's probably best that we do it anyway.

So this API needs to specify that it will give you back a power-of-2
block that is naturally aligned - otherwise you can't use it for MSI.

> + * This routine is used for allocating and mapping a range of hardware
> + * irqs to virtual IRQs where the virtual irq numbers are not at pre-defined
> + * locations.

This comment doesn't make sense to me.

> + *
> + * Greater than 0 is returned upon success, while any failure to establish a
> + * static mapping is treated as an error.
> + */
> +int irq_create_mapping_many(struct irq_domain *domain,
> +		irq_hw_number_t hwirq_base, int count)
> +{
> +	int ret, irq_base;
> +	int virq, i;
> +
> +	pr_debug("irq_create_mapping(0x%p, 0x%lx)\n", domain, hwirq_base);


I'd like to see this whole function rewritten to reduce the duplication
vs irq_create_mapping(). I don't see any reason why this can't be the
core routine, and irq_create_mapping() becomes a caller of it, passing a
count of 1 ?

> +	/* Look for default domain if nececssary */
> +	if (!domain)
> +		domain = irq_default_domain;
> +	if (!domain) {
> +		pr_warn("irq_create_mapping called for NULL domain, hwirq=%lx\n"
> +			, hwirq_base);
> +		WARN_ON(1);
> +		return 0;
> +	}
> +	pr_debug("-> using domain @%p\n", domain);
> +
> +	/* For IRQ_DOMAIN_MAP_LEGACY, get the first virtual interrupt number */
> +	if (domain->revmap_type == IRQ_DOMAIN_MAP_LEGACY)
> +		return irq_domain_legacy_revmap(domain, hwirq_base);

The above doesn't work.

> +	/* Check if mapping already exists */
> +	for (i = 0; i < count; i++) {
> +		virq = irq_find_mapping(domain, hwirq_base+i);
> +		if (virq) {
> +			pr_debug("existing mapping on virq %d,"
> +					" now dispose it first\n", virq);
> +			irq_dispose_mapping(virq);

You might have just disposed of someone elses mapping, we shouldn't do
that. It should be an error to the caller.

cheers

^ permalink raw reply

* Re: [PATCH -V1 06/24] powerpc: Reduce PTE table memory wastage
From: Paul Mackerras @ 2013-03-05  2:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <874ngr2zz1.fsf@linux.vnet.ibm.com>

On Mon, Mar 04, 2013 at 04:28:42PM +0530, Aneesh Kumar K.V wrote:
> Paul Mackerras <paulus@samba.org> writes:
> 
> > The other general comment I have is that it's not really clear when a
> > page will be on the mm->context.pgtable_list and when it won't.  I
> > would like to see an invariant that says something like "the page is
> > on the pgtable_list if and only if (page->_mapcount & FRAG_MASK) is
> > neither 0 nor FRAG_MASK".  But that doesn't seem to be the case
> > exactly, and I can't see any consistent rule, which makes me think
> > there are going to be bugs in corner cases.
> >
> 
> 
> I added the below comment when initializing the list.
> 
> +#ifdef CONFIG_PPC_64K_PAGES
> +       /*
> +        * Used to support 4K PTE fragment. The pages are added to list,
> +        * when we have free framents in the page. We track the whether
> +        * a page frament is available using page._mapcount. A value of
> +        * zero indicate none of the fragments are used and page can be
> +        * freed. A value of FRAG_MASK indicate all the fragments are used
> +        * and hence the page will be removed from the below list.
> +        */
> +       INIT_LIST_HEAD(&init_mm.context.pgtable_list);
> +#endif
> 
> I am not sure about why you say there is no consistent rule. Can you
> elaborate on that ?

Well, sometimes you take things off the list when mask == 0, and
sometimes when (mask & FRAG_MASK) == 0.  So it's not clear whether the
page is supposed to be on the list when (mask & FRAG_MASK) == 0 but
mask != 0.  If you stated in a comment what the rule was supposed to
be then reviewers could check whether your code implemented that rule.
Also, if you had a consistent rule you could more easily add debug
code to check that the rule was being followed.

Also, that comment above doesn't say anything about the upper bits and
whether they have any influence on whether the page should be on the
list or not.

> > Consider, for example, the case where a page has two fragments still
> > in use, and one of them gets queued up by RCU for freeing via a call
> > to page_table_free_rcu, and then the other one gets freed through
> > page_table_free().  Neither the call to page_table_free_rcu nor the
> > call to page_table_free will take the page off the list AFAICS, and
> > then __page_table_free_rcu() will free the page while it's still on
> > the pgtable_list.
> 
> The last one that ends up doing atomic_xor_bits which cause the mapcount
> to go zero, will take the page off the list and free the page. 

No, look at the example again.  page_table_free_rcu() won't take it
off the list because it uses the (mask & FRAG_MASK) == 0 test, which
fails (one fragment is still in use).  page_table_free() won't take it
off the list because it uses the mask == 0 test, which also fails (one
fragment is still waiting for the RCU grace period).  Finally,
__page_table_free_rcu() doesn't take it off the list, it just frees
the page.  Oops. :)

Paul.

^ permalink raw reply

* Re: [PATCH -V1 09/24] powerpc: Decode the pte-lp-encoding bits correctly.
From: Paul Mackerras @ 2013-03-05  2:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <87y5e31jem.fsf@linux.vnet.ibm.com>

On Mon, Mar 04, 2013 at 05:11:53PM +0530, Aneesh Kumar K.V wrote:
> Paul Mackerras <paulus@samba.org> writes:
> >> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
> >> +{
> >> +	unsigned int mask;
> >> +	int i, penc, shift;
> >> +	/* Look at the 8 bit LP value */
> >> +	unsigned int lp = (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
> >> +
> >> +	penc = 0;
> >> +	for (i = 0; i < MMU_PAGE_COUNT; i++) {
> >> +		/* valid entries have a shift value */
> >> +		if (!mmu_psize_defs[i].shift)
> >> +			continue;
> >> +
> >> +		/* encoding bits per actual page size */
> >> +		shift = mmu_psize_defs[i].shift - 11;
> >> +		if (shift > 9)
> >> +			shift = 9;
> >> +		mask = (1 << shift) - 1;
> >> +		if ((lp & mask) == mmu_psize_defs[psize].penc[i])
> >> +			return i;
> >> +	}
> >> +	return -1;
> >> +}
> >
> > This doesn't look right to me.  First, it's not clear what the 11 and
> > 9 refer to, and I think the 9 should be LP_BITS (i.e. 8).  Secondly,
> > the mask for the comparison needs to depend on the actual page size
> > not the base page size.
> 
> That 11 should be 12.That depends on the fact that we have below mapping

And the 12 should be LP_SHIFT, shouldn't it?

>  rrrr rrrz 	≥8KB
> 
> Yes, that 9 should be LP_BITs. 
> 
> We are generating mask based on actual page size above (variable i in
> the for loop).

OK, yes, you're right.

> > I don't see where in this function you set the penc[] elements for
> > invalid actual page sizes to -1.
> 
> We do the below
> 
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -125,7 +125,7 @@ static struct mmu_psize_def mmu_psize_defaults_old[] = {
>         [MMU_PAGE_4K] = {
>                 .shift  = 12,
>                 .sllp   = 0,
> -               .penc   = 0,
> +               .penc   = { [0 ... MMU_PAGE_COUNT - 1] = -1 },
>                 .avpnm  = 0,

Yes, which sets them for the entries you initialize, but not for the
others.  For example, the entry for MMU_PAGE_64K will initially be all
zeroes.  Then we find an entry in the ibm,segment-page-sizes property
for 64k pages, so we set mmu_psize_defs[MMU_PAGE_64K].shift to 16,
making that entry valid, but we never set any of the .penc[] entries
to -1, leading your other code to think that it can do (say) 1M pages
in a 64k segment using an encoding of 0.

Also, I noticed that the code in the if (base_idx < 0) statement is
wrong.  It needs to advance prop (and decrease size) by 2 * lpnum,
not just 2.

Paul.

^ permalink raw reply

* Re: [PATCH] drivers/tty/hvc: using strlcpy instead of strncpy
From: Chen Gang @ 2013-03-05  1:58 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Greg KH, tklauser, wfp5p, linuxppc-dev, alan
In-Reply-To: <512F5FDC.3060000@suse.cz>

于 2013年02月28日 21:47, Jiri Slaby 写道:
>>>   when strlen(&pi->location_code[0]) == HVCS_CLC_LENGTH + 2
> It cannot, pi->location_code is defined as char[HVCS_CLC_LENGTH + 1].
> 

  really, it is, I did not notice it.

  but I still prefer to modify it, but the patch should be changed
  such as:
    subject: beautify code: deleting useless judging code.
    comments: src buf len and dest buf len are the same, strcpy is better.
    contents: using strcpy instead of strncpy, and delete judging code.

  is it ok ?

BTW:
  sorry for my reply is too late, and did not notify it, originally before.
    I have to do some urgent things, during these days.
      my father had a serious heart disease, and is in hospital.
      during these days, most of my time has to be in hospital.
      (God Bless, and thank Jesus Christ, my father is safe, now).
    within my company (Asianux), I also have something to do.

  excuse me:
    within this week, maybe can not get my mail reply, too.




-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* [PATCH v2] powerpc: Wireup the kcmp syscall to sys_ni
From: Tony Breeds @ 2013-03-05  1:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, LinuxPPC-dev; +Cc: Stephen Rothwell
In-Reply-To: <1362443994-26580-1-git-send-email-tony@bakeyournoodle.com>

Since kmp takes 2 unsigned long args there should be a compat wrapper.
Since one isn't provided I think it's safer just to hook this up to not
implemented.  If we need it later we can do it properly then.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
---
Changes since v1:
 - Add comment to make finding the syscall later - thanks sfr

 arch/powerpc/include/asm/systbl.h      | 1 +
 arch/powerpc/include/asm/unistd.h      | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 535b6d8..ebbec52 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -358,3 +358,4 @@ SYSCALL_SPU(setns)
 COMPAT_SYS(process_vm_readv)
 COMPAT_SYS(process_vm_writev)
 SYSCALL(finit_module)
+SYSCALL(ni_syscall) /* sys_kcmp */
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index f25b5c4..1487f0f 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include <uapi/asm/unistd.h>
 
 
-#define __NR_syscalls		354
+#define __NR_syscalls		355
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 8c478c6..74cb4d7 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -376,6 +376,7 @@
 #define __NR_process_vm_readv	351
 #define __NR_process_vm_writev	352
 #define __NR_finit_module	353
+#define __NR_kcmp		354
 
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
1.8.1.4

^ permalink raw reply related

* Re: [PATCH -V1 07/24] powerpc: Add size argument to pgtable_cache_add
From: Paul Mackerras @ 2013-03-05  1:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <871ubv2zsv.fsf@linux.vnet.ibm.com>

On Mon, Mar 04, 2013 at 04:32:24PM +0530, Aneesh Kumar K.V wrote:
> 
> Now with table_size argument, the first arg is no more the shift value,
> rather it is index into the array. Hence i changed the variable name. I
> will split that patch to make it easy for review.

OK, so you're saying that the simple relation between index and the
size of the objects in PGT_CACHE(index) no longer holds.  That worries
me, because now, what guarantees that two callers won't use the same
index value with two different sizes?  And what guarantees that we
won't have two callers using different index values but the same size
(which wouldn't be a disaster but would be a waste of space)?

I think it would be preferable to keep the relation between shift and
the size of the objects and just arrange to use a different shift
value for the pmd objects when you need to.

Paul.

^ permalink raw reply

* Re: [PATCH] powerpc: Wireup the kcmp syscall to sys_ni
From: Stephen Rothwell @ 2013-03-05  1:47 UTC (permalink / raw)
  To: Tony Breeds; +Cc: LinuxPPC-dev
In-Reply-To: <1362443994-26580-1-git-send-email-tony@bakeyournoodle.com>

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

Hi Tony,

On Tue,  5 Mar 2013 11:39:54 +1100 Tony Breeds <tony@bakeyournoodle.com> wrote:
>
> diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
> index 535b6d8..289bb52 100644
> --- a/arch/powerpc/include/asm/systbl.h
> +++ b/arch/powerpc/include/asm/systbl.h
> @@ -358,3 +358,4 @@ SYSCALL_SPU(setns)
>  COMPAT_SYS(process_vm_readv)
>  COMPAT_SYS(process_vm_writev)
>  SYSCALL(finit_module)
> +SYSCALL(ni_syscall)

It is probably worth commenting what syscall should be there.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH] kexec/ppc: Fix kernel program entry point while changing the load addr
From: Simon Horman @ 2013-03-05  1:40 UTC (permalink / raw)
  To: Suzuki K. Poulose
  Cc: Matthew McClintock, Sebastian Andrzej Siewior, kexec,
	linuxppc-dev
In-Reply-To: <51343788.7070503@in.ibm.com>

On Mon, Mar 04, 2013 at 11:26:24AM +0530, Suzuki K. Poulose wrote:
> On 03/04/2013 07:11 AM, Simon Horman wrote:
> >[ Cc: linuxppc-dev@lists.ozlabs.org ]
> >
> >On Sun, Mar 03, 2013 at 01:06:00PM +0530, Suzuki K. Poulose wrote:
> >>From: Suzuki K. Poulose <suzuki@in.ibm.com>
> >>
> >>uImage probe fills the entry point (ep) based on the load_addr
> >>from the uImage headers. If we change the load_addr, we should
> >>accordingly update the entry point.
> >>
> >>For ELF, calculate the offset of e_entry from the virtual start
> >>address and add it to the physical start address to find the
> >>physical address of kernel entry.
> >>
> >>i.e,
> >>   pa (e_entry) = pa(phdr[0].p_vaddr) + (e_entry - phdr[0].p_vaddr)
> >>                = kernel_addr + (e_entry - phdr[0].p_vaddr)
> >
> >Would it be possible for someone to provide a review of this change?
> To make it bit more clear :
> 
> The entry point of the kernel is usually at 0 offset from the first
> PT_LOAD section. The current code makes this assumption and uses the
> pa(phdr[0].p_vaddr) as the kernel entry.
> 
> But this *may* not be true always, in such a case the kexec would fail.
> While I fixed the uImage case, I thought it would be better to
> handle the same case in ELF.
> 
> Btw, this calculation is not specific to ppc32.

Thanks, I have applied the patch.

^ permalink raw reply

* Re: [PATCH V2] lglock: add read-preference local-global rwlock
From: Michel Lespinasse @ 2013-03-05  1:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rostedt, rjw, namhyung, tglx, linux-arm-kernel, Lai Jiangshan,
	netdev, linux-kernel, vincent.guittot, sbw, Srivatsa S. Bhat, tj,
	akpm, linuxppc-dev
In-Reply-To: <20130303174056.GA30176@redhat.com>

On Sun, Mar 3, 2013 at 9:40 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> However, I still think that FALLBACK_BASE only adds the unnecessary
>> complications. But even if I am right this is subjective of course, please
>> feel free to ignore.

Would it help if I sent out that version (without FALLBACK_BASE) as a
formal proposal ?

> Hmm. But then I do not understand the lglock annotations. Obviously,
> rwlock_acquire_read() in lg_local_lock() can't even detect the simplest
> deadlock, say, lg_local_lock(LOCK) + lg_local_lock(LOCK). Not to mention
> spin_lock(X) + lg_local_lock(Y) vs lg_local_lock(Y) + spin_lock(X).
>
> OK, I understand that it is not easy to make these annotations correct...

I am going to send out a proposal to fix the existing lglock
annotations and detect the two cases you noticed. It's actually not
that hard :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply

* Re: [PATCH] ppc32: Fix compile of sha1-powerpc-asm.S
From: Christian Kujau @ 2013-03-05  1:23 UTC (permalink / raw)
  To: Tony Breeds; +Cc: LinuxPPC-dev
In-Reply-To: <1361845205-30606-1-git-send-email-tony@bakeyournoodle.com>

On Tue, 26 Feb 2013 at 13:20, Tony Breeds wrote:
> When building with CRYPTO_SHA1_PPC enabled we fail with:
> ---
> powerpc/crypto/sha1-powerpc-asm.S: Assembler messages:
> powerpc/crypto/sha1-powerpc-asm.S:116: Error: can't resolve `0' {*ABS* section} - `STACKFRAMESIZE' {*UND* section}
> powerpc/crypto/sha1-powerpc-asm.S:116: Error: expression too complex
> powerpc/crypto/sha1-powerpc-asm.S:178: Error: unsupported relocation against STACKFRAMESIZE
> ---
> 
> Use INT_FRAME_SIZE instead.
> 
> Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>

Thanks for the fix! Ran into this as well, with your patch 3.9-rc1 
compiles again (and it even boots :-))

Tested-by: Christian Kujau <lists@nerdbynature.de>

$ grep -A10 sha1 /proc/crypto
name         : sha1
driver       : sha1-powerpc
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
type         : shash
blocksize    : 64
digestsize   : 20


> ---
>  arch/powerpc/crypto/sha1-powerpc-asm.S | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> FWIW the SHA1_PPC makes about a 20% difference on my 32bit board
> 
> diff --git a/arch/powerpc/crypto/sha1-powerpc-asm.S b/arch/powerpc/crypto/sha1-powerpc-asm.S
> index a5f8264..125e165 100644
> --- a/arch/powerpc/crypto/sha1-powerpc-asm.S
> +++ b/arch/powerpc/crypto/sha1-powerpc-asm.S
> @@ -113,7 +113,7 @@
>  	STEPUP4((t)+16, fn)
>  
>  _GLOBAL(powerpc_sha_transform)
> -	PPC_STLU r1,-STACKFRAMESIZE(r1)
> +	PPC_STLU r1,-INT_FRAME_SIZE(r1)
>  	SAVE_8GPRS(14, r1)
>  	SAVE_10GPRS(22, r1)
>  
> @@ -175,5 +175,5 @@ _GLOBAL(powerpc_sha_transform)
>  
>  	REST_8GPRS(14, r1)
>  	REST_10GPRS(22, r1)
> -	addi	r1,r1,STACKFRAMESIZE
> +	addi	r1,r1,INT_FRAME_SIZE
>  	blr
> -- 
> 1.8.1.2
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

-- 
BOFH excuse #21:

POSIX compliance problem

^ permalink raw reply

* [PATCH] powerpc: Wireup the kcmp syscall to sys_ni
From: Tony Breeds @ 2013-03-05  0:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, LinuxPPC-dev; +Cc: Stephen Rothwell

Since kmp takes 2 unsigned long args there should be a compat wrapper.
Since one isn't provided I think it's safer just to hook this up to not
implemented.  If we need it later we can do it properly then.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
---
 arch/powerpc/include/asm/systbl.h      | 1 +
 arch/powerpc/include/asm/unistd.h      | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 535b6d8..289bb52 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -358,3 +358,4 @@ SYSCALL_SPU(setns)
 COMPAT_SYS(process_vm_readv)
 COMPAT_SYS(process_vm_writev)
 SYSCALL(finit_module)
+SYSCALL(ni_syscall)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index f25b5c4..1487f0f 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include <uapi/asm/unistd.h>
 
 
-#define __NR_syscalls		354
+#define __NR_syscalls		355
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 8c478c6..74cb4d7 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -376,6 +376,7 @@
 #define __NR_process_vm_readv	351
 #define __NR_process_vm_writev	352
 #define __NR_finit_module	353
+#define __NR_kcmp		354
 
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
1.8.1.4

^ permalink raw reply related

* Re: [PATCH 0/3] Enable multiple MSI feature in pSeries
From: Michael Ellerman @ 2013-03-05  0:28 UTC (permalink / raw)
  To: Mike Qiu; +Cc: tglx, linuxppc-dev, linux-kernel
In-Reply-To: <513411AD.5010408@linux.vnet.ibm.com>

On Mon, Mar 04, 2013 at 11:14:53AM +0800, Mike Qiu wrote:
> 于 2013/3/1 11:54, Michael Ellerman 写道:
> >On Fri, Mar 01, 2013 at 11:08:45AM +0800, Mike wrote:
> >>Hi all
> >>
> >>Any comments? or any questions about my patchset?
> >You were going to get some performance numbers that show a definite
> >benefit for using more than one MSI.

> Yes, but my patch just enable the kernel to support this feature, whether
> to use it depens on the device driver.

Sure, but we don't add code just for fun, so unless there's a good
reason to add the feature - like better performance - we won't bother.

> And this feature has been merged to the kernel for X86 for a long time.
> See commit: 5ca72c4f7c412c2002363218901eba5516c476b1
> 51906e779f2b13b38f8153774c4c7163d412ffd9

That commit was merged in 3.9-rc1, ie. a few days ago, so no it has not been
in x86 for a long time.

That code removes the need for your first patch, which is a good start.
Please send a new version using irq_set_msi_desc_off().

cheers

^ permalink raw reply

* Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx
From: Scott Wood @ 2013-03-04 23:45 UTC (permalink / raw)
  To: Stuart Yoder; +Cc: linuxppc-dev, Jia Hongtao
In-Reply-To: <CALRxmdBjb+TrqZ_XdNjH=oDcdYtVJ5=Q71dcCkKodpe7WCHDpA@mail.gmail.com>

On 03/04/2013 10:16:10 AM, Stuart Yoder wrote:
> On Mon, Mar 4, 2013 at 2:40 AM, Jia Hongtao <B38951@freescale.com> =20
> wrote:
> > A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
> > goes down. when the link goes down, Non-posted transactions issued
> > via the ATMU requiring completion result in an instruction stall.
> > At the same time a machine-check exception is generated to the core
> > to allow further processing by the handler. We implements the =20
> handler
> > which skips the instruction caused the stall.
>=20
> Can you explain at a high level how just skipping an instruction =20
> solves
> anything?   If you just skip a load/store and continue like nothing is
> wrong, isn't your system possibly in a really bad state.

If the instruction was a load, we probably at least want to fill the =20
destination register with 0xffffffff or similar.

> And if the core is already hung, due to the PCI link going down, isn't
> it too late?   How does skipping help?

Maybe the machine check unhangs the core?

Is there an erratum number for this?

-Scott=

^ permalink raw reply

* Re: [PATCH -V1 06/24] powerpc: Reduce PTE table memory wastage
From: Benjamin Herrenschmidt @ 2013-03-04 23:36 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Paul Mackerras, linux-mm
In-Reply-To: <874ngr2zz1.fsf@linux.vnet.ibm.com>

On Mon, 2013-03-04 at 16:28 +0530, Aneesh Kumar K.V wrote:
> I added the below comment when initializing the list.
> 
> +#ifdef CONFIG_PPC_64K_PAGES
> +       /*
> +        * Used to support 4K PTE fragment. The pages are added to list,
> +        * when we have free framents in the page. We track the whether
> +        * a page frament is available using page._mapcount. A value of
> +        * zero indicate none of the fragments are used and page can be
> +        * freed. A value of FRAG_MASK indicate all the fragments are used
> +        * and hence the page will be removed from the below list.
> +        */
> +       INIT_LIST_HEAD(&init_mm.context.pgtable_list);
> +#endif
> 
> I am not sure about why you say there is no consistent rule. Can you
> elaborate on that ?

Do you really need that list ? I assume it's meant to allow you to find
free frags when allocating but my worry is that you'll end up losing
quite a bit of node locality of PTE pages....

It may or may not work but can you investigate doing things differently
here ? The idea I want you to consider is to always allocate a full
page, but make the relationship of the fragments to PTE pages fixed. IE.
the fragment in the page is a function of the VA.

Basically, the algorithm for allocation is roughly:

 - Walk the tree down to the PMD ptr (* that can be improved with a
generic change, see below)

 - Check if any of the neighbouring PMDs is populated. If yes, you have
your page and pick the appropriate fragment based on the VA

 - If not, allocate and populate

On free, similarly, you checked if all neighbouring PMDs have been
cleared, in which case you can fire off the page for RCU freeing.

(*) By changing pte_alloc_one to take the PMD ptr (which the call side
has right at hand) you can avoid the tree lookup.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH -V1 09/24] powerpc: Decode the pte-lp-encoding bits correctly.
From: Aneesh Kumar K.V @ 2013-03-04 18:11 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <87vc971iwd.fsf@linux.vnet.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

> Paul Mackerras <paulus@samba.org> writes:
>
>> On Tue, Feb 26, 2013 at 01:34:59PM +0530, Aneesh Kumar K.V wrote:
>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>>=20
>>> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
>>> +{
>>> +	unsigned int mask;
>>> +	int i, penc, shift;
>>> +	/* Look at the 8 bit LP value */
>>> +	unsigned int lp =3D (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>>> +
>>> +	penc =3D 0;
>>> +	for (i =3D 0; i < MMU_PAGE_COUNT; i++) {
>>> +		/* valid entries have a shift value */
>>> +		if (!mmu_psize_defs[i].shift)
>>> +			continue;
>>> +
>>> +		/* encoding bits per actual page size */
>>> +		shift =3D mmu_psize_defs[i].shift - 11;
>>> +		if (shift > 9)
>>> +			shift =3D 9;
>>> +		mask =3D (1 << shift) - 1;
>>> +		if ((lp & mask) =3D=3D mmu_psize_defs[psize].penc[i])
>>> +			return i;
>>> +	}
>>> +	return -1;
>>> +}
>>
>> This doesn't look right to me.  First, it's not clear what the 11 and
>> 9 refer to, and I think the 9 should be LP_BITS (i.e. 8).  Secondly,
>> the mask for the comparison needs to depend on the actual page size
>> not the base page size.
>
> How about the below. I am yet to test this in user space.=20

I needed to special case 4K case. This seems to work fine with the test.

static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
{
	unsigned int mask;
	int i, penc, shift;
	/* Look at the 8 bit LP value */
	unsigned int lp =3D (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);

	/* First check if it is large page */
	if (!(hptep->v & HPTE_V_LARGE))
		return MMU_PAGE_4K;

	penc =3D 0;
	for (i =3D 1; i < MMU_PAGE_COUNT; i++) {
		/* valid entries have a shift value */
		if (!mmu_psize_defs[i].shift)
			continue;
		/*
		 * encoding bits per actual page size
		 *        PTE LP     actual page size
		 *    rrrr rrrz		=E2=89=A58KB
		 *    rrrr rrzz		=E2=89=A516KB
		 *    rrrr rzzz		=E2=89=A532KB
		 *    rrrr zzzz		=E2=89=A564KB
		 * .......
		 */
		shift =3D mmu_psize_defs[i].shift -
				mmu_psize_defs[MMU_PAGE_4K].shift;
		if (shift > LP_BITS)
			shift =3D LP_BITS;
		mask =3D (1 << shift) - 1;
		if ((lp & mask) =3D=3D mmu_psize_defs[psize].penc[i])
			return i;
	}
	return -1;
}

^ permalink raw reply

* RE: [PATCH V4] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx
From: David Laight @ 2013-03-04 17:15 UTC (permalink / raw)
  To: Jia Hongtao, linuxppc-dev, galak; +Cc: B07421
In-Reply-To: <1362386425-20149-1-git-send-email-B38951@freescale.com>

> A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
> goes down. when the link goes down, Non-posted transactions issued
> via the ATMU requiring completion result in an instruction stall.
> At the same time a machine-check exception is generated to the core
> to allow further processing by the handler. We implements the handler
> which skips the instruction caused the stall.

Just skipping the instruction doesn't seem a good idea.
But I suspect that re-initialising the PCI interface is also
almost impossible.

Does the mpc83xx have the same errata?
We've seen machine-check faults using the CSB bridge on an 83xx
doing a 'pio' access after a PEX_DMA transfer to certain target
addresses stalls - software gives up waiting for the dma.
The target is an fpga, nothing is mapped at those
addesses - but we'd expect to get ~0u back as happens on other
slave windows.

I also remember some problems with single word DMA.

	David

^ permalink raw reply

* RE: [PATCH 6/6 v8] iommu/fsl: Freescale PAMU driver and IOMMU API implementation.
From: Yoder Stuart-B08248 @ 2013-03-04 16:55 UTC (permalink / raw)
  To: Sethi Varun-B16395
  Cc: Wood Scott-B07421, iommu@lists.linux-foundation.org,
	linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	Joerg Roedel
In-Reply-To: <C5ECD7A89D1DC44195F34B25E172658D3D12D9@039-SN2MPN1-013.039d.mgd.msft.net>



> -----Original Message-----
> From: Sethi Varun-B16395
> Sent: Monday, March 04, 2013 5:31 AM
> To: Stuart Yoder
> Cc: iommu@lists.linux-foundation.org; linuxppc-dev@lists.ozlabs.org; linu=
x-kernel@vger.kernel.org; Wood
> Scott-B07421; Joerg Roedel; Yoder Stuart-B08248
> Subject: RE: [PATCH 6/6 v8] iommu/fsl: Freescale PAMU driver and IOMMU AP=
I implementation.
>=20
>=20
>=20
> > -----Original Message-----
> > From: Stuart Yoder [mailto:b08248@gmail.com]
> > Sent: Saturday, March 02, 2013 4:58 AM
> > To: Sethi Varun-B16395
> > Cc: iommu@lists.linux-foundation.org; linuxppc-dev@lists.ozlabs.org;
> > linux-kernel@vger.kernel.org; Wood Scott-B07421; Joerg Roedel; Yoder
> > Stuart-B08248
> > Subject: Re: [PATCH 6/6 v8] iommu/fsl: Freescale PAMU driver and IOMMU
> > API implementation.
> >
> > On Mon, Feb 18, 2013 at 6:52 AM, Varun Sethi <Varun.Sethi@freescale.com=
>
> > wrote:
> > [cut]
> > > +static phys_addr_t get_phys_addr(struct fsl_dma_domain *dma_domain,
> > > +unsigned long iova) {
> > > +       u32 win_cnt =3D dma_domain->win_cnt;
> > > +       struct dma_window *win_ptr =3D
> > > +                               &dma_domain->win_arr[0];
> > > +       struct iommu_domain_geometry *geom;
> > > +
> > > +       geom =3D &dma_domain->iommu_domain->geometry;
> > > +
> > > +       if (!win_cnt || !dma_domain->geom_size) {
> > > +               pr_err("Number of windows/geometry not configured for
> > the domain\n");
> > > +               return 0;
> > > +       }
> > > +
> > > +       if (win_cnt > 1) {
> > > +               u64 subwin_size;
> > > +               unsigned long subwin_iova;
> > > +               u32 wnd;
> > > +
> > > +               subwin_size =3D dma_domain->geom_size >> ilog2(win_cn=
t);
> >
> > Could it be just geom_size / win_cnt ??
> [Sethi Varun-B16395] You would run in to linking issues with respect to u=
64 division on 32 bit kernels.
>=20
> >
> > > +               subwin_iova =3D iova & ~(subwin_size - 1);
> > > +               wnd =3D (subwin_iova - geom->aperture_start) >>
> > ilog2(subwin_size);
> > > +               win_ptr =3D &dma_domain->win_arr[wnd];
> > > +       }
> > > +
> > > +       if (win_ptr->valid)
> > > +               return (win_ptr->paddr + (iova & (win_ptr->size -
> > > + 1)));
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int map_liodn_subwins(int liodn, struct fsl_dma_domain
> > > +*dma_domain)
> >
> > Just call it map_subwins().  They are just sub windows, not "liodn sub
> > windows".
> >
> [Sethi Varun-B16395] This would be called per liodn.
>
> > [cut]
> >
> > > +static int map_liodn_win(int liodn, struct fsl_dma_domain
> > > +*dma_domain)
> >
> > Call it map_win().
> [Sethi Varun-B16395] This would again be called per liodn.

On the function names-- function names are typically named
with a noun describing some object and a verb describing
the action...and sometimes a subsystem identifier:
    kmem_cache + zalloc
    spin + lock

I don't think inserting liodn in the function name to indicates that it=20
operates per liodn makes it more clear and reads a little awkward to me:

    map + liodn_win

...it's almost like there is a "liodn_win" object.

I think plain map_win() is more clear and the function prototype makes
it pretty clear that you are operating on an liodn.
=20
> > [cut]
> > > +static  bool check_pci_ctl_endpt_part(struct pci_controller *pci_ctl=
)
> > > +{
> > > +       u32 version;
> > > +
> > > +       /* Check the PCI controller version number by readding BRR1
> > register */
> > > +       version =3D in_be32(pci_ctl->cfg_addr + (PCI_FSL_BRR1 >> 2));
> > > +       version &=3D PCI_FSL_BRR1_VER;
> > > +       /* If PCI controller version is >=3D 0x204 we can partition
> > endpoints*/
> > > +       if (version >=3D 0x204)
> > > +               return 1;
> > > +
> > > +       return 0;
> > > +}
> >
> > Can we just use the compatible string here to identify the different
> > kinds of PCI
> > controllers?   Reading the actual device registers is ugly right now
> > because
> > you are #defining the BRR1 stuff in a generic powerpc header.
> >
> [Sethi Varun-B16395] hmmm......, I would have to check for various differ=
ent compatible strings in that
> case. May be I can move the defines to a different file. What if I move t=
hem to some other header?

Don't personally feel too strongly about version register or header...but i=
t should be some fsl
PCI header that you define those regs.

You'll either need to check for a hardware version number or a compatible,
so a compatible check may be just as easy...look for "fsl,qoriq-pcie-v2.4"?

Stuart

^ permalink raw reply

* Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx
From: Stuart Yoder @ 2013-03-04 16:16 UTC (permalink / raw)
  To: Jia Hongtao; +Cc: B07421, linuxppc-dev
In-Reply-To: <1362386425-20149-1-git-send-email-B38951@freescale.com>

On Mon, Mar 4, 2013 at 2:40 AM, Jia Hongtao <B38951@freescale.com> wrote:
> A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
> goes down. when the link goes down, Non-posted transactions issued
> via the ATMU requiring completion result in an instruction stall.
> At the same time a machine-check exception is generated to the core
> to allow further processing by the handler. We implements the handler
> which skips the instruction caused the stall.

Can you explain at a high level how just skipping an instruction solves
anything?   If you just skip a load/store and continue like nothing is
wrong, isn't your system possibly in a really bad state.

And if the core is already hung, due to the PCI link going down, isn't
it too late?   How does skipping help?

Stuart

^ permalink raw reply

* Re: [PATCH -V1 09/24] powerpc: Decode the pte-lp-encoding bits correctly.
From: Aneesh Kumar K.V @ 2013-03-04 11:52 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130304054848.GE27523@drongo>

Paul Mackerras <paulus@samba.org> writes:

> On Tue, Feb 26, 2013 at 01:34:59PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>=20
>> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
>> +{
>> +	unsigned int mask;
>> +	int i, penc, shift;
>> +	/* Look at the 8 bit LP value */
>> +	unsigned int lp =3D (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>> +
>> +	penc =3D 0;
>> +	for (i =3D 0; i < MMU_PAGE_COUNT; i++) {
>> +		/* valid entries have a shift value */
>> +		if (!mmu_psize_defs[i].shift)
>> +			continue;
>> +
>> +		/* encoding bits per actual page size */
>> +		shift =3D mmu_psize_defs[i].shift - 11;
>> +		if (shift > 9)
>> +			shift =3D 9;
>> +		mask =3D (1 << shift) - 1;
>> +		if ((lp & mask) =3D=3D mmu_psize_defs[psize].penc[i])
>> +			return i;
>> +	}
>> +	return -1;
>> +}
>
> This doesn't look right to me.  First, it's not clear what the 11 and
> 9 refer to, and I think the 9 should be LP_BITS (i.e. 8).  Secondly,
> the mask for the comparison needs to depend on the actual page size
> not the base page size.

How about the below. I am yet to test this in user space.=20

static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
{
	unsigned int mask;
	int i, penc, shift;
	/* Look at the 8 bit LP value */
	unsigned int lp =3D (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);

	penc =3D 0;
	for (i =3D 0; i < MMU_PAGE_COUNT; i++) {
		/* valid entries have a shift value */
		if (!mmu_psize_defs[i].shift)
			continue;
		/*
		 * encoding bits per actual page size
		 *        PTE LP     actual page size
		 *    rrrr rrrz		=E2=89=A58KB
		 *    rrrr rrzz		=E2=89=A516KB
		 *    rrrr rzzz		=E2=89=A532KB
		 *    rrrr zzzz		=E2=89=A564KB
		 * .......
		 */
		shift =3D mmu_psize_defs[i].shift -
				mmu_psize_defs[MMU_PAGE_4K].shift;
		if (shift > LP_BITS)
			shift =3D LP_BITS;
		mask =3D (1 << shift) - 1;
		if ((lp & mask) =3D=3D mmu_psize_defs[psize].penc[i])
			return i;
	}
	return -1;
}

^ permalink raw reply

* Re: [PATCH -V1 09/24] powerpc: Decode the pte-lp-encoding bits correctly.
From: Aneesh Kumar K.V @ 2013-03-04 11:41 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130304054848.GE27523@drongo>

Paul Mackerras <paulus@samba.org> writes:

> On Tue, Feb 26, 2013 at 01:34:59PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>=20
>> We look at both the segment base page size and actual page size and store
>> the pte-lp-encodings in an array per base page size.
>>=20
>> We also update all relevant functions to take actual page size argument
>> so that we can use the correct PTE LP encoding in HPTE. This should also
>> get the basic Multiple Page Size per Segment (MPSS) support. This is nee=
ded
>> to enable THP on ppc64.
>
> Mostly looks OK, comments below...
>
>> +/*
>> + * HPTE LP details
>> + */
>> +#define LP_SHIFT	12
>> +#define LP_BITS		8
>> +#define LP_MASK(i)	((0xFF >> (i)) << LP_SHIFT)
>
> The reader might be wondering at this point what "LP" is; be kind and
> make it "large page" in the comment for them.

Fixed.

>
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 71d0c90..48f6d99 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1515,7 +1515,7 @@ static void kvmppc_add_seg_page_size(struct kvm_pp=
c_one_seg_page_size **sps,
>>  	(*sps)->page_shift =3D def->shift;
>>  	(*sps)->slb_enc =3D def->sllp;
>>  	(*sps)->enc[0].page_shift =3D def->shift;
>> -	(*sps)->enc[0].pte_enc =3D def->penc;
>> +	(*sps)->enc[0].pte_enc =3D def->penc[linux_psize];
>>  	(*sps)++;
>>  }
>
> This will only return the entries where actual page size =3D=3D base page
> size, which basically means that KVM guests won't be able to use
> MPSS.  We will need to return multiple entries in that case.

I did that as a the follow up patch.

[PATCH -V1 10/24] powerpc: Return all the valid pte ecndoing in
KVM_PPC_GET_SMMU_INFO ioct

>
>> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
>> +{
>> +	unsigned int mask;
>> +	int i, penc, shift;
>> +	/* Look at the 8 bit LP value */
>> +	unsigned int lp =3D (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>> +
>> +	penc =3D 0;
>> +	for (i =3D 0; i < MMU_PAGE_COUNT; i++) {
>> +		/* valid entries have a shift value */
>> +		if (!mmu_psize_defs[i].shift)
>> +			continue;
>> +
>> +		/* encoding bits per actual page size */
>> +		shift =3D mmu_psize_defs[i].shift - 11;
>> +		if (shift > 9)
>> +			shift =3D 9;
>> +		mask =3D (1 << shift) - 1;
>> +		if ((lp & mask) =3D=3D mmu_psize_defs[psize].penc[i])
>> +			return i;
>> +	}
>> +	return -1;
>> +}
>
> This doesn't look right to me.  First, it's not clear what the 11 and
> 9 refer to, and I think the 9 should be LP_BITS (i.e. 8).  Secondly,
> the mask for the comparison needs to depend on the actual page size
> not the base page size.

That 11 should be 12.That depends on the fact that we have below mapping
 rrrr rrrz 	=E2=89=A58KB

Yes, that 9 should be LP_BITs.=20

We are generating mask based on actual page size above (variable i in
the for loop).


>
> I strongly suggest you pull out this code together with
> native_hpte_insert into a little userspace test program that runs
> through all the possible page size combinations, creating an HPTE and
> then decoding it with hpte_actual_psize() to check that you get back
> the correct actual page size.
>

will do.

>>  static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
>> -			int *psize, int *ssize, unsigned long *vpn)
>> +			int *psize, int *apsize, int *ssize, unsigned long *vpn)
>>  {
>>  	unsigned long avpn, pteg, vpi;
>>  	unsigned long hpte_r =3D hpte->r;
>>  	unsigned long hpte_v =3D hpte->v;
>>  	unsigned long vsid, seg_off;
>> -	int i, size, shift, penc;
>> +	int i, size, a_size =3D MMU_PAGE_4K, shift, penc;
>>=20=20
>>  	if (!(hpte_v & HPTE_V_LARGE))
>>  		size =3D MMU_PAGE_4K;
>> @@ -395,12 +422,13 @@ static void hpte_decode(struct hash_pte *hpte, uns=
igned long slot,
>>  			/* valid entries have a shift value */
>>  			if (!mmu_psize_defs[size].shift)
>>  				continue;
>> -
>> -			if (penc =3D=3D mmu_psize_defs[size].penc)
>> -				break;
>> +			for (a_size =3D 0; a_size < MMU_PAGE_COUNT; a_size++)
>> +				if (penc =3D=3D mmu_psize_defs[size].penc[a_size])
>> +					goto out;
>
> Once again I don't think this is correct, since the number of bits in
> the page size encoding depends on the page size.  In fact the
> calculation of penc in that function looks completely bogus to me (not
> that that is code that you have written or modified, but it looks to
> me like it needs fixing).

I am fixing that in the later patch

powerpc: Fix hpte_decode to use the correct decoding for page sizes

But that will also need fixing as you suggested above.

>
>>  static int __init htab_dt_scan_page_sizes(unsigned long node,
>>  					  const char *uname, int depth,
>>  					  void *data)
>> @@ -294,60 +318,57 @@ static int __init htab_dt_scan_page_sizes(unsigned=
 long node,
>>  		size /=3D 4;
>>  		cur_cpu_spec->mmu_features &=3D ~(MMU_FTR_16M_PAGE);
>>  		while(size > 0) {
>> -			unsigned int shift =3D prop[0];
>> +			unsigned int base_shift =3D prop[0];
>>  			unsigned int slbenc =3D prop[1];
>>  			unsigned int lpnum =3D prop[2];
>> -			unsigned int lpenc =3D 0;
>>  			struct mmu_psize_def *def;
>> -			int idx =3D -1;
>> +			int idx, base_idx;
>>=20=20
>>  			size -=3D 3; prop +=3D 3;
>> -			while(size > 0 && lpnum) {
>> -				if (prop[0] =3D=3D shift)
>> -					lpenc =3D prop[1];
>> +			base_idx =3D get_idx_from_shift(base_shift);
>> +			if (base_idx < 0) {
>> +				/*
>> +				 * skip the pte encoding also
>> +				 */
>>  				prop +=3D 2; size -=3D 2;
>> -				lpnum--;
>> +				continue;
>>  			}
>> -			switch(shift) {
>> -			case 0xc:
>> -				idx =3D MMU_PAGE_4K;
>> -				break;
>> -			case 0x10:
>> -				idx =3D MMU_PAGE_64K;
>> -				break;
>> -			case 0x14:
>> -				idx =3D MMU_PAGE_1M;
>> -				break;
>> -			case 0x18:
>> -				idx =3D MMU_PAGE_16M;
>> +			def =3D &mmu_psize_defs[base_idx];
>> +			if (base_idx =3D=3D MMU_PAGE_16M)
>>  				cur_cpu_spec->mmu_features |=3D MMU_FTR_16M_PAGE;
>> -				break;
>> -			case 0x22:
>> -				idx =3D MMU_PAGE_16G;
>> -				break;
>> -			}
>> -			if (idx < 0)
>> -				continue;
>> -			def =3D &mmu_psize_defs[idx];
>> -			def->shift =3D shift;
>> -			if (shift <=3D 23)
>> +
>> +			def->shift =3D base_shift;
>> +			if (base_shift <=3D 23)
>>  				def->avpnm =3D 0;
>>  			else
>> -				def->avpnm =3D (1 << (shift - 23)) - 1;
>> +				def->avpnm =3D (1 << (base_shift - 23)) - 1;
>>  			def->sllp =3D slbenc;
>> -			def->penc =3D lpenc;
>> -			/* We don't know for sure what's up with tlbiel, so
>> +			/*
>> +			 * We don't know for sure what's up with tlbiel, so
>>  			 * for now we only set it for 4K and 64K pages
>>  			 */
>> -			if (idx =3D=3D MMU_PAGE_4K || idx =3D=3D MMU_PAGE_64K)
>> +			if (base_idx =3D=3D MMU_PAGE_4K || base_idx =3D=3D MMU_PAGE_64K)
>>  				def->tlbiel =3D 1;
>>  			else
>>  				def->tlbiel =3D 0;
>>=20=20
>> -			DBG(" %d: shift=3D%02x, sllp=3D%04lx, avpnm=3D%08lx, "
>> -			    "tlbiel=3D%d, penc=3D%d\n",
>> -			    idx, shift, def->sllp, def->avpnm, def->tlbiel,
>> -			    def->penc);
>> +			while (size > 0 && lpnum) {
>> +				unsigned int shift =3D prop[0];
>> +				unsigned int penc  =3D prop[1];
>> +
>> +				prop +=3D 2; size -=3D 2;
>> +				lpnum--;
>> +
>> +				idx =3D get_idx_from_shift(shift);
>> +				if (idx < 0)
>> +					continue;
>> +
>> +				def->penc[idx] =3D penc;
>> +				DBG(" %d: shift=3D%02x, sllp=3D%04lx, "
>> +				    "avpnm=3D%08lx, tlbiel=3D%d, penc=3D%d\n",
>> +				    idx, shift, def->sllp, def->avpnm,
>> +				    def->tlbiel, def->penc[idx]);
>> +			}
>
> I don't see where in this function you set the penc[] elements for
> invalid actual page sizes to -1.

We do the below

--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -125,7 +125,7 @@ static struct mmu_psize_def mmu_psize_defaults_old[] =
=3D {
        [MMU_PAGE_4K] =3D {
                .shift  =3D 12,
                .sllp   =3D 0,
-               .penc   =3D 0,
+               .penc   =3D { [0 ... MMU_PAGE_COUNT - 1] =3D -1 },
                .avpnm  =3D 0,

-aneesh

^ permalink raw reply

* RE: [PATCH 6/6 v8] iommu/fsl: Freescale PAMU driver and IOMMU API implementation.
From: Sethi Varun-B16395 @ 2013-03-04 11:31 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Wood Scott-B07421, Joerg Roedel, linux-kernel@vger.kernel.org,
	Yoder Stuart-B08248, iommu@lists.linux-foundation.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <CALRxmdDEccTaS2Houuns44DoQ_8wUbXAMiXQ8w0Ks9R2F2u-Ag@mail.gmail.com>



> -----Original Message-----
> From: Stuart Yoder [mailto:b08248@gmail.com]
> Sent: Saturday, March 02, 2013 4:58 AM
> To: Sethi Varun-B16395
> Cc: iommu@lists.linux-foundation.org; linuxppc-dev@lists.ozlabs.org;
> linux-kernel@vger.kernel.org; Wood Scott-B07421; Joerg Roedel; Yoder
> Stuart-B08248
> Subject: Re: [PATCH 6/6 v8] iommu/fsl: Freescale PAMU driver and IOMMU
> API implementation.
>=20
> On Mon, Feb 18, 2013 at 6:52 AM, Varun Sethi <Varun.Sethi@freescale.com>
> wrote:
> [cut]
> > +static phys_addr_t get_phys_addr(struct fsl_dma_domain *dma_domain,
> > +unsigned long iova) {
> > +       u32 win_cnt =3D dma_domain->win_cnt;
> > +       struct dma_window *win_ptr =3D
> > +                               &dma_domain->win_arr[0];
> > +       struct iommu_domain_geometry *geom;
> > +
> > +       geom =3D &dma_domain->iommu_domain->geometry;
> > +
> > +       if (!win_cnt || !dma_domain->geom_size) {
> > +               pr_err("Number of windows/geometry not configured for
> the domain\n");
> > +               return 0;
> > +       }
> > +
> > +       if (win_cnt > 1) {
> > +               u64 subwin_size;
> > +               unsigned long subwin_iova;
> > +               u32 wnd;
> > +
> > +               subwin_size =3D dma_domain->geom_size >> ilog2(win_cnt)=
;
>=20
> Could it be just geom_size / win_cnt ??
[Sethi Varun-B16395] You would run in to linking issues with respect to u64=
 division on 32 bit kernels.

>=20
> > +               subwin_iova =3D iova & ~(subwin_size - 1);
> > +               wnd =3D (subwin_iova - geom->aperture_start) >>
> ilog2(subwin_size);
> > +               win_ptr =3D &dma_domain->win_arr[wnd];
> > +       }
> > +
> > +       if (win_ptr->valid)
> > +               return (win_ptr->paddr + (iova & (win_ptr->size -
> > + 1)));
> > +
> > +       return 0;
> > +}
> > +
> > +static int map_liodn_subwins(int liodn, struct fsl_dma_domain
> > +*dma_domain)
>=20
> Just call it map_subwins().  They are just sub windows, not "liodn sub
> windows".
>=20
[Sethi Varun-B16395] This would be called per liodn.

> [cut]
>=20
> > +static int map_liodn_win(int liodn, struct fsl_dma_domain
> > +*dma_domain)
>=20
> Call it map_win().
[Sethi Varun-B16395] This would again be called per liodn.

>=20
> [cut]
> > +static struct fsl_dma_domain *iommu_alloc_dma_domain(void) {
> > +       struct fsl_dma_domain *domain;
> > +
> > +       domain =3D kmem_cache_zalloc(fsl_pamu_domain_cache, GFP_KERNEL)=
;
> > +       if (!domain)
> > +               return NULL;
> > +
> > +       domain->stash_id =3D ~(u32)0;
> > +       domain->snoop_id =3D ~(u32)0;
> > +       domain->win_cnt =3D max_subwindow_count;
>=20
> To align with my previous comments on fsl_pamu.c, I think instead of
> referencing a global variable (in fsl_pamu.c) you should be making an
> accessor API call here to get the max subwindow _count.
>=20
[Sethi Varun-B16395] ok, I will make a accessor API call.

> > +       domain->geom_size =3D 0;
> > +
> > +       INIT_LIST_HEAD(&domain->devices);
> > +
> > +       spin_lock_init(&domain->domain_lock);
> > +
> > +       return domain;
> > +}
> > +
> > +static inline struct device_domain_info *find_domain(struct device
> > +*dev) {
> > +       return dev->archdata.iommu_domain; }
> > +
> > +static void remove_domain_ref(struct device_domain_info *info, u32
> > +win_cnt) {
> > +               list_del(&info->link);
> > +               spin_lock(&iommu_lock);
> > +               if (win_cnt)
> > +                       pamu_free_subwins(info->liodn);
> > +               pamu_disable_liodn(info->liodn);
> > +               spin_unlock(&iommu_lock);
> > +               spin_lock(&device_domain_lock);
> > +               info->dev->archdata.iommu_domain =3D NULL;
> > +               kmem_cache_free(iommu_devinfo_cache, info);
> > +               spin_unlock(&device_domain_lock); }
>=20
> The above function is literally removing the _device_ reference from the
> domain.
> The name implies that it is removing a "domain reference".   Suggestion
> is
> to call it "remove_device_ref".
>=20
> Also, the whitespace is messed up there.  You have 2 tabs instead of 1.
[Sethi Varun-B16395]  ok will make the change.
>=20
> > +static void destroy_domain(struct fsl_dma_domain *dma_domain) {
> > +       struct device_domain_info *info;
> > +
> > +       /* Dissociate all the devices from this domain */
> > +       while (!list_empty(&dma_domain->devices)) {
> > +               info =3D list_entry(dma_domain->devices.next,
> > +                       struct device_domain_info, link);
> > +               remove_domain_ref(info, dma_domain->win_cnt);
> > +       }
> > +}
>=20
> This function is removing all devices from a domain...maybe to be
> consistent with my suggestion below on detach_domain(), call this
> detach_all_devices().
>  We have 2 functions
> doing almost the same thing....one detaches a single device, one detaches
> all devices.
> The current names "destroy_domain" and "detach_domain" are not as clear
> to me.
>=20
[Sethi Varun-B16395]ok=20

> > +static void detach_domain(struct device *dev, struct fsl_dma_domain
> > +*dma_domain) {
> > +       struct device_domain_info *info;
> > +       struct list_head *entry, *tmp;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&dma_domain->domain_lock, flags);
> > +       /* Remove the device from the domain device list */
> > +       if (!list_empty(&dma_domain->devices)) {
> > +               list_for_each_safe(entry, tmp, &dma_domain->devices) {
> > +                       info =3D list_entry(entry, struct
> device_domain_info, link);
> > +                       if (info->dev =3D=3D dev)
> > +                               remove_domain_ref(info, dma_domain-
> >win_cnt);
> > +               }
> > +       }
> > +       spin_unlock_irqrestore(&dma_domain->domain_lock, flags); }
>=20
> This function is not "detaching a domain", but is detaching a device.
>  Call it detach_device().
>=20
[Sethi Varun-B16395] ok, will address this.

> > +static void attach_domain(struct fsl_dma_domain *dma_domain, int
> > +liodn, struct device *dev) {
>=20
> Same thing here.   This is not attaching a domain, but attaching a
> device.  Call it attach_device.
>=20
[Sethi Varun-B16395] ok.

> > +       struct device_domain_info *info, *old_domain_info;
> > +
> > +       spin_lock(&device_domain_lock);
> > +       /*
> > +        * Check here if the device is already attached to domain or
> not.
> > +        * If the device is already attached to a domain detach it.
> > +        */
> > +       old_domain_info =3D find_domain(dev);
> > +       if (old_domain_info && old_domain_info->domain !=3D dma_domain)=
 {
> > +               spin_unlock(&device_domain_lock);
> > +               detach_domain(dev, old_domain_info->domain);
> > +               spin_lock(&device_domain_lock);
> > +       }
> > +
> > +       info =3D kmem_cache_zalloc(iommu_devinfo_cache, GFP_KERNEL);
> > +
> > +       info->dev =3D dev;
> > +       info->liodn =3D liodn;
> > +       info->domain =3D dma_domain;
> > +
> > +       list_add(&info->link, &dma_domain->devices);
> > +       /*
> > +        * In case of devices with multiple LIODNs just store
> > +        * the info for the first LIODN as all
> > +        * LIODNs share the same domain
> > +        */
> > +       if (!old_domain_info)
> > +               dev->archdata.iommu_domain =3D info;
> > +       spin_unlock(&device_domain_lock);
> > +
> > +}
> > +
>=20
> [cut]
> > +/* Configure geometry settings for all LIODNs associated with domain
> > +*/ static int configure_domain(struct fsl_dma_domain *dma_domain,
> > +                           struct iommu_domain_geometry *geom_attr,
> > +                           u32 win_cnt)
>=20
> This function is not configuring the iommu domain...which is a concept in
> the Linux driver, it is taking the domain geometry and setting up the
> PAMU tables for all LIODNs currently in the domain.
>=20
> Maybe it would help if you used a prefix like "pamu" or "paact" to
> identify functions that operate on the actual PAMU tables.
>=20
> maybe:  pamu_set_domain_geometry()
>=20
[Sethi Varun-B16395] ok.
> > +{
> > +       struct device_domain_info *info;
> > +       int ret =3D 0;
> > +
> > +       if (!list_empty(&dma_domain->devices)) {
> > +               list_for_each_entry(info, &dma_domain->devices, link) {
> > +                       ret =3D configure_liodn(info->liodn, info->dev,
> dma_domain,
> > +                                             geom_attr, win_cnt);
>=20
> ...and following the above naming convention, call this (?):
> pamu_set_liodn
[Sethi Varun-B16395] ok.

>=20
> > +                       if (ret)
> > +                               break;
> > +               }
> > +       }
> > +
> > +       return ret;
> > +}
> > +
>=20
> [cut]
> > +static int fsl_pamu_window_enable(struct iommu_domain *domain, u32
> wnd_nr,
> > +                                 phys_addr_t paddr, u64 size, int
> > +iommu_prot) {
> > +       struct fsl_dma_domain *dma_domain =3D domain->priv;
> > +       struct dma_window *wnd;
> > +       int prot =3D 0;
> > +       int ret;
> > +       unsigned long flags;
> > +       u64 win_size;
> > +
> > +       if (iommu_prot & IOMMU_READ)
> > +               prot |=3D PAACE_AP_PERMS_QUERY;
> > +       if (iommu_prot & IOMMU_WRITE)
> > +               prot |=3D PAACE_AP_PERMS_UPDATE;
> > +
> > +       spin_lock_irqsave(&dma_domain->domain_lock, flags);
> > +       if (!dma_domain->win_arr) {
> > +               pr_err("Number of windows not configured\n");
> > +               spin_unlock_irqrestore(&dma_domain->domain_lock,
> flags);
> > +               return -ENODEV;
> > +       }
> > +
> > +       if (wnd_nr >=3D dma_domain->win_cnt) {
> > +               pr_err("Invalid window index\n");
> > +               spin_unlock_irqrestore(&dma_domain->domain_lock,
> flags);
> > +               return -EINVAL;
> > +       }
> > +
> > +       win_size =3D dma_domain->geom_size >>
> > + ilog2(dma_domain->win_cnt);
>=20
> Isn't size just geom_size / win_cnt.   Not sure why you do the ilog2()
> and right shift...
> We already validated that  win_cnt is power of 2 when it was set.
>=20
[Sethi Varun-B16395] problem with u64 division.

> > +       if (size > win_size) {
> > +               pr_err("Invalid window size \n");
> > +               spin_unlock_irqrestore(&dma_domain->domain_lock,
> flags);
> > +               return -EINVAL;
> > +       }
> > +
> > +       if (dma_domain->win_cnt =3D=3D 1) {
> > +               if (dma_domain->enabled) {
> > +                       pr_err("Disable the window before updating the
> mapping\n");
> > +                       spin_unlock_irqrestore(&dma_domain-
> >domain_lock, flags);
> > +                       return -EBUSY;
> > +               }
> > +
> > +               ret =3D check_size(size, domain-
> >geometry.aperture_start);
> > +               if (ret) {
> > +                       pr_err("Aperture start not aligned to the
> size\n");
> > +                       spin_unlock_irqrestore(&dma_domain-
> >domain_lock, flags);
> > +                       return -EINVAL;
> > +               }
> > +       }
>=20
> Why is win_cnt=3D=3D1 a special case?    Would the geometry not have been
> verified
> when it was originally set up?
>=20
[Sethi Varun-B16395] Yes, but in case of a single window you can still have=
 the flexibility of specifying a different size range. But the start addres=
s should still be aligned to the new size.
> Also, do you need a check here to verify if the geometry has been set.
> Is it a requirement to set the geometry prior to window enable?
[Sethi Varun-B16395] That is already checked with the subwindow array check=
.

>=20
> > +       wnd =3D &dma_domain->win_arr[wnd_nr];
> > +       if (!wnd->valid) {
> > +               wnd->paddr =3D paddr;
> > +               wnd->size =3D size;
> > +               wnd->prot =3D prot;
> > +
> > +               ret =3D update_domain_mapping(dma_domain, wnd_nr);
> > +               if (!ret) {
> > +                       wnd->valid =3D 1;
> > +                       dma_domain->mapped++;
> > +               }
> > +       } else {
> > +               pr_err("Disable the window before updating the
> mapping\n");
> > +               ret =3D -EBUSY;
> > +       }
> > +
> > +       spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> > +
> > +       return ret;
> > +}
> > +
> > +/*
>=20
> [cut]
> > +static int fsl_pamu_attach_device(struct iommu_domain *domain,
> > +                                 struct device *dev) {
> > +       struct fsl_dma_domain *dma_domain =3D domain->priv;
> > +       const u32 *prop;
> > +       u32 prop_cnt;
> > +       int len, ret =3D 0;
> > +       struct pci_dev *pdev =3D NULL;
> > +       struct pci_controller *pci_ctl;
> > +
> > +       /*
> > +        * Hack to make attach device work for the PCI devices. Simply
> assign the
> > +        * the LIODN for the PCI controller to the PCI device.
> > +        */
>=20
> Instead of "Simply assign...", perhaps say instead:  Use the LIODN for
> the PCI controller when attaching a PCI device.
[Sethi Varun-B16395] ok.

>=20
> > +       if (dev->bus =3D=3D &pci_bus_type) {
> > +               pdev =3D to_pci_dev(dev);
> > +               pci_ctl =3D pci_bus_to_host(pdev->bus);
> > +               /*
> > +                * make dev point to pci controller device
> > +                * so we can get the LIODN programmed by
> > +                * u-boot;
> > +                */
> > +               dev =3D pci_ctl->parent;
> > +       }
> > +
> > +       prop =3D of_get_property(dev->of_node, "fsl,liodn", &len);
>=20
> suggestion: name "prop" to be "liodn" and "prop_cnt" to be "liodn_cnt".
> That will be more clear.
[Sethi Varun-B16395] ok.

>=20
> > +       if (prop) {
> > +               prop_cnt =3D len / sizeof(u32);
> > +               ret =3D handle_attach_device(dma_domain, dev,
> > +                                        prop, prop_cnt);
> > +       } else {
> > +               pr_err("missing fsl,liodn property at %s\n",
> > +                         dev->of_node->full_name);
> > +                       ret =3D -EINVAL;
> > +       }
> > +
> > +       return ret;
> > +}
> > +
> > +static void fsl_pamu_detach_device(struct iommu_domain *domain,
> > +                                     struct device *dev) {
> > +       struct fsl_dma_domain *dma_domain =3D domain->priv;
> > +       const u32 *prop;
> > +       int len;
> > +       struct pci_dev *pdev =3D NULL;
> > +       struct pci_controller *pci_ctl;
> > +
> > +       /*
> > +        * Hack to make detach device work for the PCI devices. Simply
> assign the
> > +        * the LIODN for the PCI controller to the PCI device.
> > +        */
> > +       if (dev->bus =3D=3D &pci_bus_type) {
> > +               pdev =3D to_pci_dev(dev);
> > +               pci_ctl =3D pci_bus_to_host(pdev->bus);
> > +               /*
> > +                * make dev point to pci controller device
> > +                * so we can get the LIODN programmed by
> > +                * u-boot;
> > +                */
> > +               dev =3D pci_ctl->parent;
> > +       }
> > +
> > +       prop =3D of_get_property(dev->of_node, "fsl,liodn", &len);
> > +       if (prop)
> > +               detach_domain(dev, dma_domain);
> > +       else
> > +               pr_err("missing fsl,liodn property at %s\n",
> > +                         dev->of_node->full_name); }
>=20
> I understand why you need the LIODN when attaching, but why do you get it
> in the detatch function.   You are not using "prop" here.
[Sethi Varun-B16395] Just a sanity check.

>=20
> [cut]
> > +static  bool check_pci_ctl_endpt_part(struct pci_controller *pci_ctl)
> > +{
> > +       u32 version;
> > +
> > +       /* Check the PCI controller version number by readding BRR1
> register */
> > +       version =3D in_be32(pci_ctl->cfg_addr + (PCI_FSL_BRR1 >> 2));
> > +       version &=3D PCI_FSL_BRR1_VER;
> > +       /* If PCI controller version is >=3D 0x204 we can partition
> endpoints*/
> > +       if (version >=3D 0x204)
> > +               return 1;
> > +
> > +       return 0;
> > +}
>=20
> Can we just use the compatible string here to identify the different
> kinds of PCI
> controllers?   Reading the actual device registers is ugly right now
> because
> you are #defining the BRR1 stuff in a generic powerpc header.
>=20
[Sethi Varun-B16395] hmmm......, I would have to check for various differen=
t compatible strings in that case. May be I can move the defines to a diffe=
rent file. What if I move them to some other header?

> > +static int fsl_pamu_add_device(struct device *dev) {
> > +       struct iommu_group *group =3D NULL;
> > +       struct pci_dev *pdev;
> > +       struct pci_dev *bridge, *dma_pdev =3D NULL;
> > +       struct pci_controller *pci_ctl;
> > +       int ret;
> > +
> > +       /*
> > +        * For platform devices we allocate a separate group for
> > +        * each of the devices.
> > +        */
> > +       if (dev->bus =3D=3D &pci_bus_type) {
> > +               bool pci_endpt_part;
>=20
> That variable name is not clear, the abbreviations are not natural.  I
> would just call it pci_endpoint_partitioning.
[Sethi Varun-B16395] ok.

-Varun

^ permalink raw reply

* Re: [PATCH -V1 07/24] powerpc: Add size argument to pgtable_cache_add
From: Aneesh Kumar K.V @ 2013-03-04 11:02 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130304051340.GC27523@drongo>

Paul Mackerras <paulus@samba.org> writes:

> On Tue, Feb 26, 2013 at 01:34:57PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> We will use this later with THP changes to request for pmd table of double the size.
>> THP code does PTE page allocation along with large page request and deposit them
>> for later use. This is to ensure that we won't have any failures when we split
>> huge pages to regular pages.
>> 
>> On powerpc we want to use the deposited PTE page for storing hash pte slot and
>> secondary bit information for the HPTEs. Hence we save them in the second half
>> of the pmd table.
>
> Looks OK, but you should explain why you made the wholesale change of
> "shift" to "index".  Is there some important semantic difference, or
> do you just prefer the "index" name for some reason?
>

Now with table_size argument, the first arg is no more the shift value,
rather it is index into the array. Hence i changed the variable name. I
will split that patch to make it easy for review.

-aneesh

^ permalink raw reply

* Re: [PATCH -V1 06/24] powerpc: Reduce PTE table memory wastage
From: Aneesh Kumar K.V @ 2013-03-04 10:58 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130304045853.GB27523@drongo>

Paul Mackerras <paulus@samba.org> writes:

> On Tue, Feb 26, 2013 at 01:34:56PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> We allocate one page for the last level of linux page table. With THP and
>> large page size of 16MB, that would mean we are be wasting large part
>> of that page. To map 16MB area, we only need a PTE space of 2K with 64K
>> page size. This patch reduce the space wastage by sharing the page
>> allocated for the last level of linux page table with multiple pmd
>> entries. We call these smaller chunks PTE page fragments and allocated
>> page, PTE page. We use the page->_mapcount as bitmap to indicate which
>> PTE fragments are free.
>> 
>> page->_mapcount is divided into two halves. The upper half is used for
>> tracking the freed page framents in the RCU grace period.
>> 
>> In order to support systems which doesn't have 64K HPTE support, we also
>> add another 2K to PTE page fragment. The second half of the PTE fragments
>> is used for storing slot and secondary bit information of an HPTE. With this
>> we now have a 4K PTE fragment.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> This one has taken me hours to review.  Perhaps it's partly because of
> the way that diff has matched things up, but it's difficult to see
> what's moved where, what's common code that is now the 4k page case,
> etc.  For example, pmd_alloc_one() and pmd_free() are unchanged, but
> the diff shows them as removed in one place and added in another.

I updated the previous patch "powerpc: Move the pte free routines from
common header" to duplicate the 4K and 64K definitions. That helped in
making the diff better. I have inlined the resulting patch below. Let
know if this one looks better.

>
> The other general comment I have is that it's not really clear when a
> page will be on the mm->context.pgtable_list and when it won't.  I
> would like to see an invariant that says something like "the page is
> on the pgtable_list if and only if (page->_mapcount & FRAG_MASK) is
> neither 0 nor FRAG_MASK".  But that doesn't seem to be the case
> exactly, and I can't see any consistent rule, which makes me think
> there are going to be bugs in corner cases.
>


I added the below comment when initializing the list.

+#ifdef CONFIG_PPC_64K_PAGES
+       /*
+        * Used to support 4K PTE fragment. The pages are added to list,
+        * when we have free framents in the page. We track the whether
+        * a page frament is available using page._mapcount. A value of
+        * zero indicate none of the fragments are used and page can be
+        * freed. A value of FRAG_MASK indicate all the fragments are used
+        * and hence the page will be removed from the below list.
+        */
+       INIT_LIST_HEAD(&init_mm.context.pgtable_list);
+#endif

I am not sure about why you say there is no consistent rule. Can you
elaborate on that ?

> Consider, for example, the case where a page has two fragments still
> in use, and one of them gets queued up by RCU for freeing via a call
> to page_table_free_rcu, and then the other one gets freed through
> page_table_free().  Neither the call to page_table_free_rcu nor the
> call to page_table_free will take the page off the list AFAICS, and
> then __page_table_free_rcu() will free the page while it's still on
> the pgtable_list.

The last one that ends up doing atomic_xor_bits which cause the mapcount
to go zero, will take the page off the list and free the page. 

>
> More specific comments below...
>
>> -static inline void pgtable_free(void *table, unsigned index_size)
>> -{
>> -	if (!index_size)
>> -		free_page((unsigned long)table);
>> -	else {
>> -		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
>> -		kmem_cache_free(PGT_CACHE(index_size), table);
>> -	}
>> -}
>
> This is still used in the UP case, both for 4k and 64k, and UP configs
> now fail to build.
>
>>  static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
>>  {
>>  	free_page((unsigned long)pte);
>> @@ -156,7 +118,12 @@ static inline void __tlb_remove_table(void *_table)
>>  	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
>>  	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
>>  
>> -	pgtable_free(table, shift);
>> +	if (!shift)
>> +		free_page((unsigned long)table);
>> +	else {
>> +		BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
>> +		kmem_cache_free(PGT_CACHE(shift), table);
>> +	}
>
> Any particular reason for open-coding pgtable_free() here?
>
>> +/*
>> + * we support 15 fragments per PTE page. This is limited by how many
>> + * bits we can pack in page->_mapcount. We use the first half for
>> + * tracking the usage for rcu page table free.
>> + */
>> +#define FRAG_MASK_BITS	15
>> +#define FRAG_MASK ((1 << FRAG_MASK_BITS) - 1)
>
> Atomic_t variables are 32-bit, so you really should be able to make
> FRAG_MASK_BITS be 16 and avoid wasting the last fragment of each page.
>


commit ff1af6b4b5c90223bac43052436aae943bae1104
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Fri Oct 12 10:54:07 2012 +0530

    powerpc: Reduce PTE table memory wastage
    
    We allocate one page for the last level of linux page table. With THP and
    large page size of 16MB, that would mean we are be wasting large part
    of that page. To map 16MB area, we only need a PTE space of 2K with 64K
    page size. This patch reduce the space wastage by sharing the page
    allocated for the last level of linux page table with multiple pmd
    entries. We call these smaller chunks PTE page fragments and allocated
    page, PTE page. We use the page->_mapcount as bitmap to indicate which
    PTE fragments are free.
    
    page->_mapcount is divided into two halves. The upper half is used for
    tracking the freed page framents in the RCU grace period.
    
    In order to support systems which doesn't have 64K HPTE support, we also
    add another 2K to PTE page fragment. The second half of the PTE fragments
    is used for storing slot and secondary bit information of an HPTE. With this
    we now have a 4K PTE fragment.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/arch/powerpc/include/asm/mmu-book3e.h b/arch/powerpc/include/asm/mmu-book3e.h
index 99d43e0..ffae629 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -231,6 +231,10 @@ typedef struct {
 	u64 high_slices_psize;  /* 4 bits per slice for now */
 	u16 user_psize;         /* page size index */
 #endif
+#ifdef CONFIG_PPC_64K_PAGES
+	/* for 4K PTE fragment support */
+	struct list_head pgtable_list;
+#endif
 } mm_context_t;
 
 /* Page size definitions, common between 32 and 64-bit
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 35bb51e..feac737 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -498,6 +498,10 @@ typedef struct {
 	unsigned long acop;	/* mask of enabled coprocessor types */
 	unsigned int cop_pid;	/* pid value used with coprocessors */
 #endif /* CONFIG_PPC_ICSWX */
+#ifdef CONFIG_PPC_64K_PAGES
+	/* for 4K PTE fragment support */
+	struct list_head pgtable_list;
+#endif
 } mm_context_t;
 
 
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index f072e97..38e7ff6 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -378,7 +378,11 @@ void arch_free_page(struct page *page, int order);
 
 struct vm_area_struct;
 
+#ifdef CONFIG_PPC_64K_PAGES
+typedef pte_t *pgtable_t;
+#else
 typedef struct page *pgtable_t;
+#endif
 
 #include <asm-generic/memory_model.h>
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index cdbf555..3b62636 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -150,6 +150,18 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 
 #else /* if CONFIG_PPC_64K_PAGES */
 
+extern unsigned long *page_table_alloc(struct mm_struct *, unsigned long);
+extern void page_table_free(struct mm_struct *, unsigned long *);
+#ifdef CONFIG_SMP
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
+extern void __tlb_remove_table(void *_table);
+#else
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+				    void *table, int shift)
+{
+	pgtable_free(table, shift);
+}
+#endif
 #define pud_populate(mm, pud, pmd)	pud_set(pud, (unsigned long)pmd)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd,
@@ -161,90 +173,42 @@ static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd,
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 				pgtable_t pte_page)
 {
-	pmd_populate_kernel(mm, pmd, page_address(pte_page));
+	pmd_set(pmd, (unsigned long)pte_page);
 }
 
 static inline pgtable_t pmd_pgtable(pmd_t pmd)
 {
-	return pmd_page(pmd);
+	return (pgtable_t)(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+	return (pte_t *)page_table_alloc(mm, address);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
-				      unsigned long address)
+					unsigned long address)
 {
-	struct page *page;
-	pte_t *pte;
-
-	pte = pte_alloc_one_kernel(mm, address);
-	if (!pte)
-		return NULL;
-	page = virt_to_page(pte);
-	pgtable_page_ctor(page);
-	return page;
+	return (pgtable_t)page_table_alloc(mm, address);
 }
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
-	free_page((unsigned long)pte);
+	page_table_free(mm, (unsigned long *)pte);
 }
 
 static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 {
-	pgtable_page_dtor(ptepage);
-	__free_page(ptepage);
+	page_table_free(mm, (unsigned long *)ptepage);
 }
 
-static inline void pgtable_free(void *table, unsigned index_size)
-{
-	if (!index_size)
-		free_page((unsigned long)table);
-	else {
-		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
-		kmem_cache_free(PGT_CACHE(index_size), table);
-	}
-}
-
-#ifdef CONFIG_SMP
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-				    void *table, int shift)
-{
-	unsigned long pgf = (unsigned long)table;
-	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
-	pgf |= shift;
-	tlb_remove_table(tlb, (void *)pgf);
-}
-
-static inline void __tlb_remove_table(void *_table)
-{
-	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
-	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
-
-	pgtable_free(table, shift);
-}
-#else /* !CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-				    void *table, int shift)
-{
-	pgtable_free(table, shift);
-}
-#endif /* CONFIG_SMP */
-
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 				  unsigned long address)
 {
-	struct page *page = page_address(table);
-
 	tlb_flush_pgtable(tlb, address);
-	pgtable_page_dtor(page);
-	pgtable_free_tlb(tlb, page, 0);
+	pgtable_free_tlb(tlb, table, 0);
 }
-
 #endif /* CONFIG_PPC_64K_PAGES */
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -258,7 +222,6 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
 }
 
-
 #define __pmd_free_tlb(tlb, pmd, addr)		      \
 	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
 #ifndef CONFIG_PPC_64K_PAGES
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 6da881b..442f858 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -575,7 +575,17 @@ void __init setup_arch(char **cmdline_p)
 	init_mm.end_code = (unsigned long) _etext;
 	init_mm.end_data = (unsigned long) _edata;
 	init_mm.brk = klimit;
-	
+#ifdef CONFIG_PPC_64K_PAGES
+	/*
+	 * Used to support 4K PTE fragment. The pages are added to list,
+	 * when we have free framents in the page. We track the whether
+	 * a page frament is available using page._mapcount. A value of
+	 * zero indicate none of the fragments are used and page can be
+	 * freed. A value of FRAG_MASK indicate all the fragments are used
+	 * and hence the page will be removed from the below list.
+	 */
+	INIT_LIST_HEAD(&init_mm.context.pgtable_list);
+#endif
 	irqstack_early_init();
 	exc_lvl_early_init();
 	emergency_stack_init();
diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
index 59cd773..474b9af 100644
--- a/arch/powerpc/mm/mmu_context_hash64.c
+++ b/arch/powerpc/mm/mmu_context_hash64.c
@@ -86,6 +86,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	spin_lock_init(mm->context.cop_lockp);
 #endif /* CONFIG_PPC_ICSWX */
 
+#ifdef CONFIG_PPC_64K_PAGES
+	INIT_LIST_HEAD(&mm->context.pgtable_list);
+#endif
 	return 0;
 }
 
@@ -97,13 +100,37 @@ void __destroy_context(int context_id)
 }
 EXPORT_SYMBOL_GPL(__destroy_context);
 
+#ifdef CONFIG_PPC_64K_PAGES
+static void destroy_pagetable_list(struct mm_struct *mm)
+{
+	struct page *page;
+	struct list_head *item, *tmp;
+
+	list_for_each_safe(item, tmp, &mm->context.pgtable_list) {
+		page = list_entry(item, struct page, lru);
+		list_del(&page->lru);
+		pgtable_page_dtor(page);
+		atomic_set(&page->_mapcount, -1);
+		__free_page(page);
+	}
+}
+#else
+static inline void destroy_pagetable_list(struct mm_struct *mm)
+{
+	return;
+}
+#endif
+
 void destroy_context(struct mm_struct *mm)
 {
+
 #ifdef CONFIG_PPC_ICSWX
 	drop_cop(mm->context.acop, mm);
 	kfree(mm->context.cop_lockp);
 	mm->context.cop_lockp = NULL;
 #endif /* CONFIG_PPC_ICSWX */
+
+	destroy_pagetable_list(mm);
 	__destroy_context(mm->context.id);
 	subpage_prot_free(mm);
 	mm->context.id = MMU_NO_CONTEXT;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index e212a27..2a49044 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -337,3 +337,193 @@ EXPORT_SYMBOL(__ioremap_at);
 EXPORT_SYMBOL(iounmap);
 EXPORT_SYMBOL(__iounmap);
 EXPORT_SYMBOL(__iounmap_at);
+
+#ifdef CONFIG_PPC_64K_PAGES
+/*
+ * we support 16 fragments per PTE page. This is limited by how many
+ * bits we can pack in page->_mapcount. We use the first half for
+ * tracking the usage for rcu page table free.
+ */
+#define FRAG_MASK_BITS	16
+#define FRAG_MASK ((1 << FRAG_MASK_BITS) - 1)
+/*
+ * We use a 2K PTE page fragment and another 2K for storing
+ * real_pte_t hash index
+ */
+#define PTE_FRAG_SIZE (2 * PTRS_PER_PTE * sizeof(pte_t))
+
+static inline unsigned int atomic_xor_bits(atomic_t *v, unsigned int bits)
+{
+	unsigned int old, new;
+
+	do {
+		old = atomic_read(v);
+		new = old ^ bits;
+	} while (atomic_cmpxchg(v, old, new) != old);
+	return new;
+}
+
+unsigned long *page_table_alloc(struct mm_struct *mm, unsigned long vmaddr)
+{
+	struct page *page;
+	unsigned int mask, bit;
+	unsigned long *table;
+
+	spin_lock(&mm->page_table_lock);
+	mask = FRAG_MASK;
+	if (!list_empty(&mm->context.pgtable_list)) {
+		page = list_first_entry(&mm->context.pgtable_list,
+					struct page, lru);
+		table = (unsigned long *) page_address(page);
+		mask = atomic_read(&page->_mapcount);
+		/*
+		 * Update with the higher order mask bits accumulated,
+		 * added as a part of rcu free.
+		 */
+		mask = mask | (mask >> FRAG_MASK_BITS);
+	}
+	if ((mask & FRAG_MASK) == FRAG_MASK) {
+		spin_unlock(&mm->page_table_lock);
+		page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+		if (!page)
+			return NULL;
+		pgtable_page_ctor(page);
+		atomic_set(&page->_mapcount, 1);
+		table = (unsigned long *) page_address(page);
+		spin_lock(&mm->page_table_lock);
+		INIT_LIST_HEAD(&page->lru);
+		list_add(&page->lru, &mm->context.pgtable_list);
+	} else {
+		/* The second half is used for real_pte_t hindex */
+		for (bit = 1; mask & bit; bit <<= 1)
+			table = (unsigned long *)((char *)table + PTE_FRAG_SIZE);
+
+		mask = atomic_xor_bits(&page->_mapcount, bit);
+		/*
+		 * We have taken up all the space, remove this from
+		 * the list, we will add it back when we have a free slot
+		 */
+		if ((mask & FRAG_MASK) == FRAG_MASK)
+			list_del_init(&page->lru);
+	}
+	spin_unlock(&mm->page_table_lock);
+	/*
+	 * zero out the newly allocated area, this make sure we don't
+	 * see the old left over pte values
+	 */
+	memset(table, 0, PTE_FRAG_SIZE);
+	return table;
+}
+
+void page_table_free(struct mm_struct *mm, unsigned long *table)
+{
+	struct page *page;
+	unsigned int bit, mask;
+
+	/* Free 4K page table fragment of a 64K page */
+	page = virt_to_page(table);
+	bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+	spin_lock(&mm->page_table_lock);
+	mask = atomic_xor_bits(&page->_mapcount, bit);
+	if (mask == 0)
+		list_del(&page->lru);
+	else if (mask & FRAG_MASK) {
+		/*
+		 * Add the page table page to pgtable_list so that
+		 * the free fragment can be used by the next alloc
+		 */
+		list_del_init(&page->lru);
+		list_add(&page->lru, &mm->context.pgtable_list);
+	}
+	spin_unlock(&mm->page_table_lock);
+	if (mask == 0) {
+		pgtable_page_dtor(page);
+		atomic_set(&page->_mapcount, -1);
+		__free_page(page);
+	}
+}
+
+#ifdef CONFIG_SMP
+static void __page_table_free_rcu(void *table)
+{
+	unsigned int bit;
+	struct page *page;
+	/*
+	 * this is a PTE page free 4K page table
+	 * fragment of a 64K page.
+	 */
+	page = virt_to_page(table);
+	bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+	bit <<= FRAG_MASK_BITS;
+	/*
+	 * clear the higher half and if nobody used the page in
+	 * between, even lower half would be zero.
+	 */
+	if (atomic_xor_bits(&page->_mapcount, bit) == 0) {
+		pgtable_page_dtor(page);
+		atomic_set(&page->_mapcount, -1);
+		__free_page(page);
+	}
+}
+
+static void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table)
+{
+	struct page *page;
+	struct mm_struct *mm;
+	unsigned int bit, mask;
+
+	mm = tlb->mm;
+	/* Free 4K page table fragment of a 64K page */
+	page = virt_to_page(table);
+	bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+	spin_lock(&mm->page_table_lock);
+	/*
+	 * stash the actual mask in higher half, and clear the lower half
+	 * and selectively, add remove from pgtable list
+	 */
+	mask = atomic_xor_bits(&page->_mapcount, bit | (bit << FRAG_MASK_BITS));
+	if (!(mask & FRAG_MASK))
+		list_del(&page->lru);
+	else {
+		/*
+		 * Add the page table page to pgtable_list so that
+		 * the free fragment can be used by the next alloc.
+		 * We will not be able to use it untill the rcu grace period
+		 * is over, because we have the corresponding high half bit set
+		 * and page_table_alloc looks at the high half bit.
+		 */
+		list_del_init(&page->lru);
+		list_add_tail(&page->lru, &mm->context.pgtable_list);
+	}
+	spin_unlock(&mm->page_table_lock);
+	tlb_remove_table(tlb, table);
+}
+
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	if (shift == 0)
+		/* PTE page needs special handling */
+		page_table_free_rcu(tlb, table);
+	else
+		tlb_remove_table(tlb, (void *)pgf);
+}
+
+void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	if (!shift)
+		/* PTE page needs special handling */
+		__page_table_free_rcu(table);
+	else {
+		BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(shift), table);
+	}
+}
+#endif
+#endif /* CONFIG_PPC_64K_PAGES */

^ permalink raw reply related

* [PATCH 3/3] powerpc: Set DSCR bit in FSCR setup
From: Michael Neuling @ 2013-03-04  9:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev, Ian Munsie
In-Reply-To: <1362390402-17725-1-git-send-email-mikey@neuling.org>

We support DSCR so we should make sure we set it in the FSCR incase some
firmwares don't set it.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
 arch/powerpc/kernel/cpu_setup_power.S |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
index bb2d203..ea847ab 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -116,7 +116,7 @@ __init_LPCR:
 
 __init_FSCR:
 	mfspr	r3,SPRN_FSCR
-	ori	r3,r3,FSCR_TAR
+	ori	r3,r3,FSCR_TAR|FSCR_DSCR
 	mtspr	SPRN_FSCR,r3
 	blr
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/3] powerpc: Add DSCR FSCR register bit definition
From: Michael Neuling @ 2013-03-04  9:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev, Ian Munsie
In-Reply-To: <1362390402-17725-1-git-send-email-mikey@neuling.org>

Also harmonise TAR bit definition too.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
 arch/powerpc/include/asm/reg.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e665861..a43cd2d 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -266,7 +266,8 @@
 #define SPRN_HSRR0	0x13A	/* Hypervisor Save/Restore 0 */
 #define SPRN_HSRR1	0x13B	/* Hypervisor Save/Restore 1 */
 #define SPRN_FSCR	0x099	/* Facility Status & Control Register */
-#define FSCR_TAR	(1<<8)	/* Enable Target Adress Register */
+#define   FSCR_TAR	(1 << (63-55)) /* Enable Target Adress Register */
+#define   FSCR_DSCR	(1 << (63-61)) /* Enable DSCR */
 #define SPRN_TAR	0x32f	/* Target Address Register */
 #define SPRN_LPCR	0x13E	/* LPAR Control Register */
 #define   LPCR_VPM0	(1ul << (63-0))
-- 
1.7.10.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox