LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH -V2 3/5] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE
From: Benjamin Herrenschmidt @ 2013-12-05 21:00 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linuxppc-dev, paulus, Aneesh Kumar K.V
In-Reply-To: <52A0B786.608@redhat.com>

On Thu, 2013-12-05 at 12:27 -0500, Rik van Riel wrote:

> However, it appears that since the code was #ifdefed
> like that, the called code was made generic enough,
> that change_prot_numa should actually work for
> everything.
> 
> In other words:
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>

Ok thanks, that's what I needed. Do you have any objection of me merging
that change via the powerpc tree along with the corresponding powerpc
bits from Aneesh ?

The other option would be to have it in a topic branch that I pull from
you.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH -V3] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE
From: Rik van Riel @ 2013-12-05 20:25 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, mgorman; +Cc: linux-mm, linuxppc-dev
In-Reply-To: <1386268702-30806-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

On 12/05/2013 01:38 PM, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> change_prot_numa should work even if _PAGE_NUMA != _PAGE_PROTNONE.
> On archs like ppc64 that don't use _PAGE_PROTNONE and also have
> a separate page table outside linux pagetable, we just need to
> make sure that when calling change_prot_numa we flush the
> hardware page table entry so that next page access  result in a numa
> fault.
>
> We still need to make sure we use the numa faulting logic only
> when CONFIG_NUMA_BALANCING is set. This implies the migrate-on-fault
> (Lazy migration) via mbind will only work if CONFIG_NUMA_BALANCING
> is set.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply

* [PATCH -V3] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE
From: Aneesh Kumar K.V @ 2013-12-05 18:38 UTC (permalink / raw)
  To: benh, paulus, mgorman, riel; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

change_prot_numa should work even if _PAGE_NUMA != _PAGE_PROTNONE.
On archs like ppc64 that don't use _PAGE_PROTNONE and also have
a separate page table outside linux pagetable, we just need to
make sure that when calling change_prot_numa we flush the
hardware page table entry so that next page access  result in a numa
fault.

We still need to make sure we use the numa faulting logic only
when CONFIG_NUMA_BALANCING is set. This implies the migrate-on-fault
(Lazy migration) via mbind will only work if CONFIG_NUMA_BALANCING
is set.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
Previous discussion around the patch can be found at
http://article.gmane.org/gmane.linux.kernel.mm/109305

changes from V2:
* Move the numa faulting definition within CONFIG_NUMA_BALANCING

 include/linux/mm.h | 2 +-
 mm/mempolicy.c     | 5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1cedd000cf29..a7b4e310bf42 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1842,7 +1842,7 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
-#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+#ifdef CONFIG_NUMA_BALANCING
 unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 #endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eca4a3129129..9f73b29d304d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -613,7 +613,7 @@ static inline int queue_pages_pgd_range(struct vm_area_struct *vma,
 	return 0;
 }
 
-#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+#ifdef CONFIG_NUMA_BALANCING
 /*
  * This is used to mark a range of virtual addresses to be inaccessible.
  * These are later cleared by a NUMA hinting fault. Depending on these
@@ -627,7 +627,6 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long addr, unsigned long end)
 {
 	int nr_updated;
-	BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
 
 	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
 	if (nr_updated)
@@ -641,7 +640,7 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	return 0;
 }
-#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
+#endif /* CONFIG_NUMA_BALANCING */
 
 /*
  * Walk through page tables and collect pages to be migrated.
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH -V2 3/5] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE
From: Rik van Riel @ 2013-12-05 17:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Aneesh Kumar K.V
  Cc: linux-mm, linuxppc-dev, paulus, Mel Gorman
In-Reply-To: <1386220835.21910.21.camel@pasglop>

On 12/05/2013 12:20 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2013-12-05 at 10:48 +0530, Aneesh Kumar K.V wrote:
>>
>> Ok, I can move the changes below #ifdef CONFIG_NUMA_BALANCING ? We call
>> change_prot_numa from task_numa_work and queue_pages_range(). The later
>> may be an issue. So doing the below will help ?
>>
>> -#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
>> +#ifdef CONFIG_NUMA_BALANCING
>
> I will defer to Mel and Rik (should we also CC Andrea ?)

It looks like manual numa binding can also use lazy
page migration, but I am not sure if that can happen
without CONFIG_NUMA_BALANCING, or if mbind always uses
MPOL_MF_STRICT...

^ permalink raw reply

* Re: [PATCH -V2 3/5] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE
From: Rik van Riel @ 2013-12-05 17:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Aneesh Kumar K.V; +Cc: linux-mm, paulus, linuxppc-dev
In-Reply-To: <1386126782.16703.137.camel@pasglop>

On 12/03/2013 10:13 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2013-11-18 at 14:58 +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>
>> change_prot_numa should work even if _PAGE_NUMA != _PAGE_PROTNONE.
>> On archs like ppc64 that don't use _PAGE_PROTNONE and also have
>> a separate page table outside linux pagetable, we just need to
>> make sure that when calling change_prot_numa we flush the
>> hardware page table entry so that next page access  result in a numa
>> fault.
>
> That patch doesn't look right...

At first glance, indeed...

> You are essentially making change_prot_numa() do whatever it does (which
> I don't completely understand) *for all architectures* now, whether they
> have CONFIG_ARCH_USES_NUMA_PROT_NONE or not ... So because you want that
> behaviour on powerpc book3s64, you change everybody.

However, it appears that since the code was #ifdefed
like that, the called code was made generic enough,
that change_prot_numa should actually work for
everything.

In other words:

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply

* [PATCH] powerpc: Fix PTE page address mismatch in pgtable ctor/dtor
From: Hong H. Pham @ 2013-12-05 15:54 UTC (permalink / raw)
  To: linux-rt-users, linuxppc-dev
  Cc: Paul Mackerras, Hong H. Pham, Aneesh Kumar K.V, linux-stable

In pte_alloc_one(), pgtable_page_ctor() is passed an address that has
not been converted by page_address() to the newly allocated PTE page.

When the PTE is freed, __pte_free_tlb() calls pgtable_page_dtor()
with an address to the PTE page that has been converted by page_address().
The mismatch in the PTE's page address causes pgtable_page_dtor() to access
invalid memory, so resources for that PTE (such as the page lock) is not
properly cleaned up.

This bug was introduced by commit d614bb041209fd7cb5e4b35e11a7b2f6ee8f62b8
"powerpc: Move the pte free routines from common header".

On a preempt-rt kernel, a spinlock is dynamically allocated for each
PTE in pgtable_page_ctor().  When the PTE is freed, calling
pgtable_page_dtor() with a mismatched page address causes a memory leak,
as the pointer to the PTE's spinlock is bogus.

On mainline, there isn't any immediately obvious symptoms, but the
problem still exists here.

Fixes: d614bb041209fd7c "powerpc: Move the pte free routes from common header"
Cc: Paul Mackerras <paulus@samba.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linux-stable <stable@vger.kernel.org> # v3.10+
Signed-off-by: Hong H. Pham <hong.pham@windriver.com>
---
 arch/powerpc/include/asm/pgalloc-32.h | 2 +-
 arch/powerpc/include/asm/pgalloc-64.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc-32.h b/arch/powerpc/include/asm/pgalloc-32.h
index 27b2386..7ff24f0 100644
--- a/arch/powerpc/include/asm/pgalloc-32.h
+++ b/arch/powerpc/include/asm/pgalloc-32.h
@@ -87,7 +87,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 	struct page *page = page_address(table);
 
 	tlb_flush_pgtable(tlb, address);
-	pgtable_page_dtor(page);
+	pgtable_page_dtor(table);
 	pgtable_free_tlb(tlb, page, 0);
 }
 #endif /* _ASM_POWERPC_PGALLOC_32_H */
diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index f65e27b..b187dc5 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -147,7 +147,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 	struct page *page = page_address(table);
 
 	tlb_flush_pgtable(tlb, address);
-	pgtable_page_dtor(page);
+	pgtable_page_dtor(table);
 	pgtable_free_tlb(tlb, page, 0);
 }
 
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH 3/3] powerpc: mm: optimize for the correctly placed page
From: Liu ping fan @ 2013-12-05 13:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Paul Mackerras
In-Reply-To: <877gbjva24.fsf@linux.vnet.ibm.com>

On Thu, Dec 5, 2013 at 6:58 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Liu Ping Fan <kernelfans@gmail.com> writes:
>
>> The period check of _PAGE_NUMA can probably trigger the check on
>> the correctly placed page. For this case, we can just insert hpte and
>> do fast exception return.
>
> I still don't understand why we need to handle numa faults in hash
> page ? Are you trying to optimize the code path ? If so can you explain
> the benefits ? Some numbers showing it is helping  ?
>
When return from hash_page(), we will take fast_exc_return_irq, while
from do_page_fault(), we take ret_from_except.
As the "fast" implies that there are more complicated logic to sync
the interrupt states in ret_from_except, which cost much.
Do you think so?

>>
>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/mm/hash_utils_64.c | 34 +++++++++++++++++++++++++++++++++-
>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
>> index 9bf1195..735678c 100644
>> --- a/arch/powerpc/mm/hash_utils_64.c
>> +++ b/arch/powerpc/mm/hash_utils_64.c
>> @@ -965,6 +965,10 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
>>       const struct cpumask *tmp;
>>       int rc, user_region = 0, local = 0;
>>       int psize, ssize;
>> +     pte_t old, new;
>> +     struct vm_area_struct *vma;
>> +     int page_nid, target_nid;
>> +     struct page *test_page;
>>
>>       DBG_LOW("hash_page(ea=%016lx, access=%lx, trap=%lx\n",
>>               ea, access, trap);
>> @@ -1033,12 +1037,40 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
>>
>>       /* Get PTE and page size from page tables */
>>       ptep = find_linux_pte_or_hugepte(pgdir, ea, &hugeshift);
>> -     if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) {
>> +     if (ptep == NULL || !pte_present(*ptep)) {
>>               DBG_LOW(" no PTE !\n");
>>               rc = 1;
>>               goto bail;
>>       }
>>
>> +     old = pte_val(*ptep);
>> +     if (pte_numa(old)) {
>> +             /* If fail to lock, let do_page_fault() to handle it */
>> +             if (down_read_trylock(&mm->mmap_sem)) {
>
> hmm is that something we want to do in hash_page ?
>
Yes, the function has no relation with hash. But I think it depends on
whether it is worth to optimize or not.

Thanks and regards,
Pingfan
>> +                     vma = find_vma(mm, ea);
>> +                     up_read(&mm->mmap_sem);
>> +                     test_page = pte_page(old);
>> +                     page_nid = page_to_nid(test_page);
>> +                     target_nid = numa_migrate_prep(test_page, vma, ea,
>> +                                             page_nid);
>> +                     if (target_nid < 0) {
>> +                             new = pte_mknonnuma(old);
>> +                             /* If ptep is modified under us,
>> +                              * just retry the access
>> +                              */
>> +                             if (unlikely(cmpxchg(ptep, old, new) != old)) {
>> +                                     put_page(test_page);
>> +                                     return 0;
>> +                             }
>> +                             put_page(test_page);
>> +                     }
>> +             } else {
>> +                             put_page(test_page);
>> +                             rc = 1;
>> +                             goto bail;
>> +             }
>> +     }
>> +
>>       /* Add _PAGE_PRESENT to the required access perm */
>>       access |= _PAGE_PRESENT;
>>
>
> -aneesh
>

^ permalink raw reply

* [PATCH 6/9] powerpc/pci: Use dev_is_pci() to check whether it is pci device
From: Yijing Wang @ 2013-12-05 12:01 UTC (permalink / raw)
  To: Bjorn Helgaas, Benjamin Herrenschmidt, Paul Mackerras
  Cc: linux-pci, linux-kernel, Yijing Wang, Hanjun Guo, Scott Wood,
	linuxppc-dev

Use PCI standard marco dev_is_pci() instead of directly compare
pci_bus_type to check whether it is pci device.

Signed-off-by: Yijing Wang <wangyijing@huawei.com>
---
 arch/powerpc/sysdev/fsl_pci.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index 4dfd61d..7066e52 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -122,7 +122,7 @@ static int fsl_pci_dma_set_mask(struct device *dev, u64 dma_mask)
 	 * address width of the SoC such that we can address any internal
 	 * SoC address from across PCI if needed
 	 */
-	if ((dev->bus == &pci_bus_type) &&
+	if ((dev_is_pci(dev)) &&
 	    dma_mask >= DMA_BIT_MASK(MAX_PHYS_ADDR_BITS)) {
 		set_dma_ops(dev, &dma_direct_ops);
 		set_dma_offset(dev, pci64_dma_offset);
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH 3/3] powerpc: mm: optimize for the correctly placed page
From: Aneesh Kumar K.V @ 2013-12-05 10:58 UTC (permalink / raw)
  To: Liu Ping Fan, linuxppc-dev; +Cc: Paul Mackerras
In-Reply-To: <1386140348-7854-4-git-send-email-pingfank@linux.vnet.ibm.com>

Liu Ping Fan <kernelfans@gmail.com> writes:

> The period check of _PAGE_NUMA can probably trigger the check on
> the correctly placed page. For this case, we can just insert hpte and
> do fast exception return.

I still don't understand why we need to handle numa faults in hash
page ? Are you trying to optimize the code path ? If so can you explain
the benefits ? Some numbers showing it is helping  ?


>
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/hash_utils_64.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index 9bf1195..735678c 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -965,6 +965,10 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
>  	const struct cpumask *tmp;
>  	int rc, user_region = 0, local = 0;
>  	int psize, ssize;
> +	pte_t old, new;
> +	struct vm_area_struct *vma;
> +	int page_nid, target_nid;
> +	struct page *test_page;
>
>  	DBG_LOW("hash_page(ea=%016lx, access=%lx, trap=%lx\n",
>  		ea, access, trap);
> @@ -1033,12 +1037,40 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
>
>  	/* Get PTE and page size from page tables */
>  	ptep = find_linux_pte_or_hugepte(pgdir, ea, &hugeshift);
> -	if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) {
> +	if (ptep == NULL || !pte_present(*ptep)) {
>  		DBG_LOW(" no PTE !\n");
>  		rc = 1;
>  		goto bail;
>  	}
>
> +	old = pte_val(*ptep);
> +	if (pte_numa(old)) {
> +		/* If fail to lock, let do_page_fault() to handle it */
> +		if (down_read_trylock(&mm->mmap_sem)) {

hmm is that something we want to do in hash_page ?

> +			vma = find_vma(mm, ea);
> +			up_read(&mm->mmap_sem);
> +			test_page = pte_page(old);
> +			page_nid = page_to_nid(test_page);
> +			target_nid = numa_migrate_prep(test_page, vma, ea,
> +						page_nid);
> +			if (target_nid < 0) {
> +				new = pte_mknonnuma(old);
> +				/* If ptep is modified under us,
> +				 * just retry the access
> +				 */
> +				if (unlikely(cmpxchg(ptep, old, new) != old)) {
> +					put_page(test_page);
> +					return 0;
> +				}
> +				put_page(test_page);
> +			}
> +		} else {
> +				put_page(test_page);
> +				rc = 1;
> +				goto bail;
> +		}
> +	}
> +
>  	/* Add _PAGE_PRESENT to the required access perm */
>  	access |= _PAGE_PRESENT;
>

-aneesh

^ permalink raw reply

* RE: Error in frreing hugepages with preemption enabled
From: Bharat Bhushan @ 2013-12-05 11:14 UTC (permalink / raw)
  To: Andrea Arcangeli, Alexander Graf
  Cc: Scott Wood, linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
	kvm-ppc@vger.kernel.org
In-Reply-To: <20131203222121.GB18764@redhat.com>



> -----Original Message-----
> From: Andrea Arcangeli [mailto:aarcange@redhat.com]
> Sent: Wednesday, December 04, 2013 3:51 AM
> To: Alexander Graf
> Cc: Bhushan Bharat-R65777; linuxppc-dev@lists.ozlabs.org; kvm-
> ppc@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421; Ben Herrensc=
hmidt
> Subject: Re: Error in frreing hugepages with preemption enabled
>=20
> Hi everyone,
>=20
> On Fri, Nov 29, 2013 at 12:13:03PM +0100, Alexander Graf wrote:
> >
> > On 29.11.2013, at 05:38, Bharat Bhushan <Bharat.Bhushan@freescale.com> =
wrote:
> >
> > > Hi Alex,
> > >
> > > I am running KVM guest with host kernel having CONFIG_PREEMPT enabled=
. With
> allocated pages things seems to work fine but I uses hugepages for guest =
I see
> below prints when "quit" from qemu.
> > >
> > > (qemu) QEMU waiting for connection on: telnet:0.0.0.0:4444,server
> > > qemu-system-ppc64: pci_add_option_rom: failed to find romfile "efi-
> virtio.rom"
> > > q
> > > debug_smp_processor_id: 15 callbacks suppressed
> > > BUG: using smp_processor_id() in preemptible [00000000] code:
> > > qemu-system-ppc/2504 caller is .free_hugepd_range+0xb0/0x21c
> > > CPU: 1 PID: 2504 Comm: qemu-system-ppc Not tainted
> > > 3.12.0-rc3-07733-gabf4907 #175 Call Trace:
> > > [c0000000fb433400] [c000000000007d38] .show_stack+0x7c/0x1cc
> > > (unreliable) [c0000000fb4334d0] [c0000000005e8ce0]
> > > .dump_stack+0x9c/0xf4 [c0000000fb433560] [c0000000002de5ec]
> > > .debug_smp_processor_id+0x108/0x11c
> > > [c0000000fb4335f0] [c000000000025e10] .free_hugepd_range+0xb0/0x21c
> > > [c0000000fb433680] [c0000000000265bc]
> > > .hugetlb_free_pgd_range+0x2c8/0x3b0
> > > [c0000000fb4337a0] [c0000000000e428c] .free_pgtables+0x14c/0x158
> > > [c0000000fb433840] [c0000000000ef320] .exit_mmap+0xec/0x194
> > > [c0000000fb433960] [c00000000004d780] .mmput+0x64/0x124
> > > [c0000000fb4339e0] [c000000000051f40] .do_exit+0x29c/0x9c8
> > > [c0000000fb433ae0] [c0000000000527c8] .do_group_exit+0x50/0xc4
> > > [c0000000fb433b70] [c0000000000606a0]
> > > .get_signal_to_deliver+0x21c/0x5d8
> > > [c0000000fb433c70] [c000000000009b08] .do_signal+0x54/0x278
> > > [c0000000fb433db0] [c000000000009e50] .do_notify_resume+0x64/0x78
> > > [c0000000fb433e30] [c000000000000b44]
> > > .ret_from_except_lite+0x70/0x74
> > >
> > >
> > > This mean that free_hugepd_range() must be called with preemption ena=
bled.
> >
> > with preemption disabled.
> >
> > > I tried below change and this seems to work fine (I am not having
> > > expertise in this area so not sure this is correct way)
> >
> > Not sure - the scope looks odd to me. Let's ask Andrea - I'm sure he kn=
ows
> what to do :).
>=20
> :) So I had a look at the top of this function (0xb0) in the upstream ker=
nel and
> no smp_processor_id() call is apparent, is this stock git or a ppc tree? =
The
> first few calls seem not to call it but I may have overlooked something. =
It's
> just quicker if somebody with vmlinux finds the location of it.
>=20
> static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int
> pdshift,
> 			      unsigned long start, unsigned long end,
> 			      unsigned long floor, unsigned long ceiling) {
> 	pte_t *hugepte =3D hugepd_page(*hpdp);
> 	int i;
>=20
> 	unsigned long pdmask =3D ~((1UL << pdshift) - 1);
> 	unsigned int num_hugepd =3D 1;
>=20
> #ifdef CONFIG_PPC_FSL_BOOK3E
> 	/* Note: On fsl the hpdp may be the first of several */
> 	num_hugepd =3D (1 << (hugepd_shift(*hpdp) - pdshift)); #else
> 	unsigned int shift =3D hugepd_shift(*hpdp); #endif
>=20
> 	start &=3D pdmask;
> 	if (start < floor)
> 		return;
> 	if (ceiling) {
> 		ceiling &=3D pdmask;
> 		if (! ceiling)
> 			return;
> 	}
> 	if (end - 1 > ceiling - 1)
> 		return;
>=20
> 	for (i =3D 0; i < num_hugepd; i++, hpdp++)
> 		hpdp->pd =3D 0;
>=20
> 	tlb->need_flush =3D 1;
>=20
> #ifdef CONFIG_PPC_FSL_BOOK3E
> 	hugepd_free(tlb, hugepte);
> #else
> 	pgtable_free_tlb(tlb, hugepte, pdshift - shift); #endif }
>=20
> Generally smp_processor_id should never be used, exactly to avoid problem=
s like
> above with preempion enabled in .config.
>=20
> Instead it should be replaced with a get_cpu()/put_cpu() pair that is exa=
ctly
> meant to fix bugs like this and define proper critical sections around th=
e per-
> cpu variables.
>=20
> #define get_cpu()		({ preempt_disable(); smp_processor_id(); })
> #define put_cpu()		preempt_enable()
>=20
> After you find where that smp_processor_id() is located, you should simpl=
y
> replace it with a get_cpu() and then you should insert a put_cpu immediat=
ely
> after the "cpu" info is not used anymore. That will define a proper and s=
trict
> critical section around the use of the per-cpu variables.
>=20
> With a ppc vmlinux it should be immediate to find the location of
> smp_processor_id but I don't have the ppc vmlinux here.

Thanks Andrea for the details description. It is really helpful

I will look into this.

Thanks
-Bharat

>=20
> Thanks!
> Andrea
>=20
> >
> >
> > Alex
> >
> > >
> > > diff --git a/arch/powerpc/mm/hugetlbpage.c
> > > b/arch/powerpc/mm/hugetlbpage.c index d67db4b..6bf8459 100644
> > > --- a/arch/powerpc/mm/hugetlbpage.c
> > > +++ b/arch/powerpc/mm/hugetlbpage.c
> > > @@ -563,8 +563,10 @@ static void hugetlb_free_pmd_range(struct mmu_ga=
ther
> *tlb, pud_t *pud,
> > >                 */
> > >                next =3D addr + (1 << hugepd_shift(*(hugepd_t *)pmd));
> > > #endif
> > > +               preempt_disable();
> > >                free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
> > >                                  addr, next, floor, ceiling);
> > > +               preempt_enable();
> > >        } while (addr =3D next, addr !=3D end);
> > >
> > >        start &=3D PUD_MASK;
> > >
> > >
> > > Thanks
> > > -Bharat
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm-ppc"
> > > in the body of a message to majordomo@vger.kernel.org More majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply

* Re: [PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect
From: Aneesh Kumar K.V @ 2013-12-05 10:53 UTC (permalink / raw)
  To: Liu Ping Fan, linuxppc-dev; +Cc: Paul Mackerras
In-Reply-To: <1386140348-7854-2-git-send-email-pingfank@linux.vnet.ibm.com>

Liu Ping Fan <kernelfans@gmail.com> writes:

> To enable the do_numa_page(), we should not fix _PAGE_NUMA in
> hash_page(), so bail out for the case of pte_numa().
>
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/hash_utils_64.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index fb176e9..9bf1195 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -1033,7 +1033,7 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
>
>  	/* Get PTE and page size from page tables */
>  	ptep = find_linux_pte_or_hugepte(pgdir, ea, &hugeshift);
> -	if (ptep == NULL || !pte_present(*ptep)) {
> +	if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) {
>  		DBG_LOW(" no PTE !\n");
>  		rc = 1;
>  		goto bail;

why ? , All the hash routines do check for _PAGE_PRESENT via access
variable.

-aneesh

^ permalink raw reply

* Re: [PATCH] powerpc: fix xmon disassembler for little-endian
From: Philippe Bergheaud @ 2013-12-05  8:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Tom Musta, linuxppc-dev
In-Reply-To: <1386218366.21910.16.camel@pasglop>

Benjamin Herrenschmidt wrote:
> On Wed, 2013-12-04 at 14:45 +0100, Philippe Bergheaud wrote:
> 
> 
>>>>+#ifdef __LITTLE_ENDIAN__
>>>>+#define GETWORD(v)	(((v)[3] << 24) + ((v)[2] << 16) + ((v)[1] << 8) + (v)[0])
>>>>+#else
>>>>#define GETWORD(v)	(((v)[0] << 24) + ((v)[1] << 16) + ((v)[2] << 8) + (v)[3])
>>>>+#endif
>>>>
>>>>#define isxdigit(c)	(('0' <= (c) && (c) <= '9') \
>>>>			 || ('a' <= (c) && (c) <= 'f') \
>>>>
>>>
>>>
>>>Philippe:  Wouldn't it be better to just do a 32-bit load and let the endianness be worked out
>>>by the hardware?  i.e.
>>>
>>>#define GETWORD(v) (*(u32 *)v)
>>
>>Yes, your alternative is better.
>>Wouldn't it narrow the scope of the macro to aligned words on POWER7?
>>I think that all references to GETWORD operate on aligned words anyway.
> 
> 
> Well, xmon has to be robust ... as long as you are *certain* that even
> with crap entry state it won't try to access unaligned boundaries then
> go for it but we aren't looking at performance here.
Thank you Tom and Ben.  We are definitely not looking at performance here.
I prefer to stay on the safe side, and leave the original patch untouched.

Philippe

^ permalink raw reply

* [PATCH] powerpc/p1010rdb-pa: modify phy interrupt.
From: Zhao Qiang @ 2013-12-05  6:00 UTC (permalink / raw)
  To: bcousson, tony, devicetree; +Cc: Zhao Qiang, linuxppc-dev

It is not correct according to p1010rdb-pa user guide.
So modify it.

Signed-off-by: Zhao Qiang <B45475@freescale.com>
---
 arch/powerpc/boot/dts/p1010rdb-pa.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/p1010rdb-pa.dtsi b/arch/powerpc/boot/dts/p1010rdb-pa.dtsi
index 3d275e0..0cf6c34 100644
--- a/arch/powerpc/boot/dts/p1010rdb-pa.dtsi
+++ b/arch/powerpc/boot/dts/p1010rdb-pa.dtsi
@@ -39,7 +39,7 @@
 };
 
 &phy0 {
-	interrupts = <3 1 0 0>;
+	interrupts = <1 1 0 0>;
 };
 
 &phy1 {
@@ -47,5 +47,5 @@
 };
 
 &phy2 {
-	interrupts = <2 1 0 0>;
+	interrupts = <4 1 0 0>;
 };
-- 
1.8.4

^ permalink raw reply related

* Re: [PATCH -V2 3/5] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE
From: Benjamin Herrenschmidt @ 2013-12-05  5:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linux-mm, Rik van Riel, linuxppc-dev, paulus, Mel Gorman
In-Reply-To: <87a9gfri3u.fsf@linux.vnet.ibm.com>

On Thu, 2013-12-05 at 10:48 +0530, Aneesh Kumar K.V wrote:
> 
> Ok, I can move the changes below #ifdef CONFIG_NUMA_BALANCING ? We call
> change_prot_numa from task_numa_work and queue_pages_range(). The later
> may be an issue. So doing the below will help ?
> 
> -#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
> +#ifdef CONFIG_NUMA_BALANCING

I will defer to Mel and Rik (should we also CC Andrea ?)

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH -V2 3/5] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE
From: Aneesh Kumar K.V @ 2013-12-05  5:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Mel Gorman, Rik van Riel
  Cc: linux-mm, paulus, linuxppc-dev
In-Reply-To: <1386126782.16703.137.camel@pasglop>


Adding Mel and Rik to cc:

Benjamin Herrenschmidt <benh@au1.ibm.com> writes:

> On Mon, 2013-11-18 at 14:58 +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> change_prot_numa should work even if _PAGE_NUMA != _PAGE_PROTNONE.
>> On archs like ppc64 that don't use _PAGE_PROTNONE and also have
>> a separate page table outside linux pagetable, we just need to
>> make sure that when calling change_prot_numa we flush the
>> hardware page table entry so that next page access  result in a numa
>> fault.
>
> That patch doesn't look right...
>
> You are essentially making change_prot_numa() do whatever it does (which
> I don't completely understand) *for all architectures* now, whether they
> have CONFIG_ARCH_USES_NUMA_PROT_NONE or not ... So because you want that
> behaviour on powerpc book3s64, you change everybody.
>
> Is that correct ?


Yes. 

>
> Also what exactly is that doing, can you explain ? From what I can see,
> it calls back into the core of mprotect to change the protection to
> vma->vm_page_prot, which I would have expected is already the protection
> there, with the added "prot_numa" flag passed down.

it set the _PAGE_NUMA bit. Now we also want to make sure that when
we set _PAGE_NUMA, we would get a pagefault on that so that we can track
that fault as a numa fault. To ensure that, we had the below BUILD_BUG

	BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
        

But other than that the function doesn't really have any dependency on
_PAGE_PROTNONE. The only requirement is when we set _PAGE_NUMA, the
architecture should do enough to ensure that we get a page fault. Now on
ppc64 we does that by clearlying hpte entry and also clearing
_PAGE_PRESENT. Since we have _PAGE_PRESENT cleared hash_page will return
1 and we get to page fault handler.

>
> Your changeset comment says "On archs like ppc64 [...] we just need to
> make sure that when calling change_prot_numa we flush the
> hardware page table entry so that next page access  result in a numa
> fault."
>
> But change_prot_numa() does a lot more than that ... it does
> pte_mknuma(), do we need it ? I assume we do or we wouldn't have added
> that PTE bit to begin with...
>
> Now it *might* be allright and it might be that no other architecture
> cares anyway etc... but I need at least some mm folks to ack on that
> patch before I can take it because it *will* change behaviour of other
> architectures.
>

Ok, I can move the changes below #ifdef CONFIG_NUMA_BALANCING ? We call
change_prot_numa from task_numa_work and queue_pages_range(). The later
may be an issue. So doing the below will help ?

-#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+#ifdef CONFIG_NUMA_BALANCING


-aneesh

^ permalink raw reply

* Re: [PATCH 3/8] IBM Akebono: Add support for a new PHY interface to the IBM emac driver
From: Benjamin Herrenschmidt @ 2013-12-05  4:49 UTC (permalink / raw)
  To: Alistair Popple; +Cc: linuxppc-dev, David S. Miller, netdev
In-Reply-To: <1385086116-10972-3-git-send-email-alistair@popple.id.au>

On Fri, 2013-11-22 at 13:08 +1100, Alistair Popple wrote:
> The IBM PPC476GTR SoC that is used on the Akebono board uses a
> different ethernet PHY interface that has wake on lan (WOL) support
> with the IBM emac. This patch adds support to the IBM emac driver for
> this new PHY interface.
> 
> At this stage the wake on lan functionality has not been implemented.
> 
> Signed-off-by: Alistair Popple <alistair@popple.id.au>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: netdev@vger.kernel.org

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

^ permalink raw reply

* Re: [PATCH V4 00/10] perf: New conditional branch filter
From: Michael Ellerman @ 2013-12-05  4:47 UTC (permalink / raw)
  To: acme
  Cc: mikey, ak, linux-kernel, eranian, linuxppc-dev, sukadev, mingo,
	Anshuman Khandual
In-Reply-To: <1386153162-11225-1-git-send-email-khandual@linux.vnet.ibm.com>

On Wed, 2013-12-04 at 16:02 +0530, Anshuman Khandual wrote:
> 		This patchset is the re-spin of the original branch stack sampling
> patchset which introduced new PERF_SAMPLE_BRANCH_COND branch filter. This patchset
> also enables SW based branch filtering support for book3s powerpc platforms which
> have PMU HW backed branch stack sampling support. 
> 
> Summary of code changes in this patchset:
> 
> (1) Introduces a new PERF_SAMPLE_BRANCH_COND branch filter
> (2) Add the "cond" branch filter options in the "perf record" tool
> (3) Enable PERF_SAMPLE_BRANCH_COND in X86 platforms
> (4) Enable PERF_SAMPLE_BRANCH_COND in POWER8 platform 
> (5) Update the documentation regarding "perf record" tool


Hi Arnaldo,

Can you please take just patches 1-5 into the perf tree? And do you mind
putting them in a topic branch so Benh can merge that.

The remaining patches are powerpc specific and still need some more review.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc: fix xmon disassembler for little-endian
From: Benjamin Herrenschmidt @ 2013-12-05  4:39 UTC (permalink / raw)
  To: Philippe Bergheaud; +Cc: Tom Musta, linuxppc-dev
In-Reply-To: <529F31F8.6040302@linux.vnet.ibm.com>

On Wed, 2013-12-04 at 14:45 +0100, Philippe Bergheaud wrote:

> >>+#ifdef __LITTLE_ENDIAN__
> >>+#define GETWORD(v)	(((v)[3] << 24) + ((v)[2] << 16) + ((v)[1] << 8) + (v)[0])
> >>+#else
> >> #define GETWORD(v)	(((v)[0] << 24) + ((v)[1] << 16) + ((v)[2] << 8) + (v)[3])
> >>+#endif
> >> 
> >> #define isxdigit(c)	(('0' <= (c) && (c) <= '9') \
> >> 			 || ('a' <= (c) && (c) <= 'f') \
> >>
> > 
> > 
> > Philippe:  Wouldn't it be better to just do a 32-bit load and let the endianness be worked out
> > by the hardware?  i.e.
> > 
> > #define GETWORD(v) (*(u32 *)v)
> Yes, your alternative is better.
> Wouldn't it narrow the scope of the macro to aligned words on POWER7?
> I think that all references to GETWORD operate on aligned words anyway.

Well, xmon has to be robust ... as long as you are *certain* that even
with crap entry state it won't try to access unaligned boundaries then
go for it but we aren't looking at performance here.

Cheers,
Ben.

^ permalink raw reply

* [PATCH v2] arch/powerpc: Dynamically allocate slb_shadow from memblock
From: Jeremy Kerr @ 2013-12-05  3:42 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Michael Neuling
In-Reply-To: <1386214268.68246.485420017657.2.gpush@pablo>

Currently, the slb_shadow buffer is our largest symbol:

  [jk@pablo linux]$ nm --size-sort -r -S obj/vmlinux | head -1
  c000000000da0000 0000000000040000 d slb_shadow

- we allocate 128 bytes per cpu; so 256k with NR_CPUS=2048. As we have
constant initialisers, it's allocated in .text, causing a larger vmlinux
image. We may also allocate unecessary slb_shadow buffers (> no. pacas),
since we use the build-time NR_CPUS rather than the run-time nr_cpu_ids.

We could move this to the bss, but then we still have the NR_CPUS vs
nr_cpu_ids potential for overallocation.

This change dynamically allocates the slb_shadow array, during
initialise_pacas(). At a cost of 104 bytes of text, we save 256k of
data:

  [jk@pablo linux]$ size obj/vmlinux{.orig,}
     text     data      bss       dec     hex	filename
  9202795  5244676  1169576  15617047  ee4c17	obj/vmlinux.orig
  9202899  4982532  1169576  15355007  ea4c7f	obj/vmlinux

Tested on pseries.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>

---
v2: don't drop the endian conversion, use SLB_NUM_BOLTED

---
 arch/powerpc/kernel/paca.c |   32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 9095a6f7..623c356f 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -99,12 +99,28 @@ static inline void free_lppacas(void) { }
  * 3 persistent SLBs are registered here.  The buffer will be zero
  * initially, hence will all be invaild until we actually write them.
  */
-static struct slb_shadow slb_shadow[] __cacheline_aligned = {
-	[0 ... (NR_CPUS-1)] = {
-		.persistent = cpu_to_be32(SLB_NUM_BOLTED),
-		.buffer_length = cpu_to_be32(sizeof(struct slb_shadow)),
-	},
-};
+static struct slb_shadow *slb_shadow;
+
+static void __init allocate_slb_shadows(int nr_cpus, int limit)
+{
+	int size = PAGE_ALIGN(sizeof(struct slb_shadow) * nr_cpus);
+	slb_shadow = __va(memblock_alloc_base(size, PAGE_SIZE, limit));
+	memset(slb_shadow, 0, size);
+}
+
+static struct slb_shadow * __init init_slb_shadow(int cpu)
+{
+	struct slb_shadow *s = &slb_shadow[cpu];
+
+	s->persistent = cpu_to_be32(SLB_NUM_BOLTED);
+	s->buffer_length = cpu_to_be32(sizeof(*s));
+
+	return s;
+}
+
+#else /* CONFIG_PPC_STD_MMU_64 */
+
+static void __init allocate_slb_shadows(int nr_cpus, int limit) { }
 
 #endif /* CONFIG_PPC_STD_MMU_64 */
 
@@ -142,7 +158,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 	new_paca->__current = &init_task;
 	new_paca->data_offset = 0xfeeeeeeeeeeeeeeeULL;
 #ifdef CONFIG_PPC_STD_MMU_64
-	new_paca->slb_shadow_ptr = &slb_shadow[cpu];
+	new_paca->slb_shadow_ptr = init_slb_shadow(cpu);
 #endif /* CONFIG_PPC_STD_MMU_64 */
 }
 
@@ -190,6 +206,8 @@ void __init allocate_pacas(void)
 
 	allocate_lppacas(nr_cpu_ids, limit);
 
+	allocate_slb_shadows(nr_cpu_ids, limit);
+
 	/* Can't use for_each_*_cpu, as they aren't functional yet */
 	for (cpu = 0; cpu < nr_cpu_ids; cpu++)
 		initialise_paca(&paca[cpu], cpu);

^ permalink raw reply related

* [PATCH 1/2] arch/powerpc: Make slb_shadow a local
From: Jeremy Kerr @ 2013-12-05  3:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Michael Neuling

The only external user of slb_shadow is the pseries lpar code, and it
can access through the paca array instead.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>

---
 arch/powerpc/include/asm/lppaca.h     |    2 --
 arch/powerpc/kernel/paca.c            |    2 +-
 arch/powerpc/platforms/pseries/lpar.c |    2 +-
 3 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h
index 844c28de..d0a2a2f9 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -132,8 +132,6 @@ struct slb_shadow {
 	} save_area[SLB_NUM_BOLTED];
 } ____cacheline_aligned;
 
-extern struct slb_shadow slb_shadow[];
-
 /*
  * Layout of entries in the hypervisor's dispatch trace log buffer.
  */
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 0620eaaa..9095a6f7 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -99,7 +99,7 @@ static inline void free_lppacas(void) { }
  * 3 persistent SLBs are registered here.  The buffer will be zero
  * initially, hence will all be invaild until we actually write them.
  */
-struct slb_shadow slb_shadow[] __cacheline_aligned = {
+static struct slb_shadow slb_shadow[] __cacheline_aligned = {
 	[0 ... (NR_CPUS-1)] = {
 		.persistent = cpu_to_be32(SLB_NUM_BOLTED),
 		.buffer_length = cpu_to_be32(sizeof(struct slb_shadow)),
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 4fca3def..28cf0f33 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -92,7 +92,7 @@ void vpa_init(int cpu)
 	 * PAPR says this feature is SLB-Buffer but firmware never
 	 * reports that.  All SPLPAR support SLB shadow buffer.
 	 */
-	addr = __pa(&slb_shadow[cpu]);
+	addr = __pa(paca[cpu].slb_shadow_ptr);
 	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
 		ret = register_slb_shadow(hwcpu, addr);
 		if (ret)

^ permalink raw reply related

* [PATCH 2/2] arch/powerpc: Dynamically allocate slb_shadow from memblock
From: Jeremy Kerr @ 2013-12-05  3:31 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Michael Neuling
In-Reply-To: <1386214268.68092.227432126043.1.gpush@pablo>

Currently, the slb_shadow buffer is our largest symbol:

  [jk@pablo linux]$ nm --size-sort -r -S obj/vmlinux | head -1
  c000000000da0000 0000000000040000 d slb_shadow

- we allocate 128 bytes per cpu; so 256k with NR_CPUS=2048. As we have
constant initialisers, it's allocated in .text, causing a larger vmlinux
image. We may also allocate unecessary slb_shadow buffers (> no. pacas),
since we use the build-time NR_CPUS rather than the run-time nr_cpu_ids.

We could move this to the bss, but then we still have the NR_CPUS vs
nr_cpu_ids potential for overallocation.

This change dynamically allocates the slb_shadow array, during
initialise_pacas(). At a cost of 104 bytes of text, we save 256k of
data:

  [jk@pablo linux]$ size obj/vmlinux{.orig,}
     text     data      bss       dec     hex	filename
  9202795  5244676  1169576  15617047  ee4c17	obj/vmlinux.orig
  9202899  4982532  1169576  15355007  ea4c7f	obj/vmlinux

Tested on pseries.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>

---
 arch/powerpc/kernel/paca.c |   32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 9095a6f7..ad7a485f 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -99,12 +99,28 @@ static inline void free_lppacas(void) { }
  * 3 persistent SLBs are registered here.  The buffer will be zero
  * initially, hence will all be invaild until we actually write them.
  */
-static struct slb_shadow slb_shadow[] __cacheline_aligned = {
-	[0 ... (NR_CPUS-1)] = {
-		.persistent = cpu_to_be32(SLB_NUM_BOLTED),
-		.buffer_length = cpu_to_be32(sizeof(struct slb_shadow)),
-	},
-};
+static struct slb_shadow *slb_shadow;
+
+static void __init allocate_slb_shadows(int nr_cpus, int limit)
+{
+	int size = PAGE_ALIGN(sizeof(struct slb_shadow) * nr_cpus);
+	slb_shadow = __va(memblock_alloc_base(size, PAGE_SIZE, limit));
+	memset(slb_shadow, 0, size);
+}
+
+static struct slb_shadow * __init init_slb_shadow(int cpu)
+{
+	struct slb_shadow *s = &slb_shadow[cpu];
+
+	s->persistent = ARRAY_SIZE(s->save_area);
+	s->buffer_length = sizeof(*s);
+
+	return s;
+}
+
+#else /* CONFIG_PPC_STD_MMU_64 */
+
+static void __init allocate_slb_shadows(int nr_cpus, int limit) { }
 
 #endif /* CONFIG_PPC_STD_MMU_64 */
 
@@ -142,7 +158,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 	new_paca->__current = &init_task;
 	new_paca->data_offset = 0xfeeeeeeeeeeeeeeeULL;
 #ifdef CONFIG_PPC_STD_MMU_64
-	new_paca->slb_shadow_ptr = &slb_shadow[cpu];
+	new_paca->slb_shadow_ptr = init_slb_shadow(cpu);
 #endif /* CONFIG_PPC_STD_MMU_64 */
 }
 
@@ -190,6 +206,8 @@ void __init allocate_pacas(void)
 
 	allocate_lppacas(nr_cpu_ids, limit);
 
+	allocate_slb_shadows(nr_cpu_ids, limit);
+
 	/* Can't use for_each_*_cpu, as they aren't functional yet */
 	for (cpu = 0; cpu < nr_cpu_ids; cpu++)
 		initialise_paca(&paca[cpu], cpu);

^ permalink raw reply related

* Re: [PATCH v6 09/17] serial: mpc512x: adjust for OF based clock lookup
From: Greg Kroah-Hartman @ 2013-12-05  0:55 UTC (permalink / raw)
  To: Gerhard Sittig
  Cc: Mike Turquette, Detlev Zundel, Jiri Slaby, linux-serial,
	Scott Wood, Anatolij Gustschin, linuxppc-dev, linux-arm-kernel
In-Reply-To: <1385851897-23475-10-git-send-email-gsi@denx.de>

On Sat, Nov 30, 2013 at 11:51:29PM +0100, Gerhard Sittig wrote:
> after device tree based clock lookup became available, the peripheral
> driver need no longer construct clock names which include the PSC index,
> remove the "psc%d_mclk" template and unconditionally use 'mclk'
> 
> acquire and release the "ipg" clock item for register access as well
> 
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Jiri Slaby <jslaby@suse.cz>
> Cc: linux-serial@vger.kernel.org
> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> # for v4
> Signed-off-by: Gerhard Sittig <gsi@denx.de>
> ---
> Greg, the difference since v4 of this patch is that v4 took the 'mclk'
> and 'ipg' clock items in reverse order, and thus potentially obfuscated
> the adjusted name for 'mclk' -- the updated version of the patch is
> identical in content but cleaner diff-wise

That's fine:

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply

* Re: [PATCH v6 11/17] USB: fsl-mph-dr-of: adjust for OF based clock lookup
From: Greg Kroah-Hartman @ 2013-12-05  0:55 UTC (permalink / raw)
  To: Gerhard Sittig
  Cc: Mike Turquette, Detlev Zundel, linux-usb, Scott Wood,
	Anatolij Gustschin, linuxppc-dev, linux-arm-kernel
In-Reply-To: <1385851897-23475-12-git-send-email-gsi@denx.de>

On Sat, Nov 30, 2013 at 11:51:31PM +0100, Gerhard Sittig wrote:
> after device tree based clock lookup became available, the peripheral
> driver need no longer construct clock names which include the component
> index -- remove the "usb%d_clk" template, always use "ipg" instead
> 
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: linux-usb@vger.kernel.org
> Signed-off-by: Gerhard Sittig <gsi@denx.de>
> ---

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply

* Re: [PATCH v6 10/17] serial: mpc512x: setup the PSC FIFO clock as well
From: Greg Kroah-Hartman @ 2013-12-05  0:55 UTC (permalink / raw)
  To: Gerhard Sittig
  Cc: Mike Turquette, Detlev Zundel, Jiri Slaby, linux-serial,
	Scott Wood, Anatolij Gustschin, linuxppc-dev, linux-arm-kernel
In-Reply-To: <1385851897-23475-11-git-send-email-gsi@denx.de>

On Sat, Nov 30, 2013 at 11:51:30PM +0100, Gerhard Sittig wrote:
> prepare and enable the FIFO clock upon PSC FIFO initialization,
> check for and propagage errors when enabling the PSC FIFO clock,
> disable and unprepare the FIFO clock upon PSC FIFO uninitialization
> 
> devm_{get,put}_clk() doesn't apply here, as the SoC provides a
> single FIFO component which is shared among several PSC components,
> thus the FIFO isn't associated with a device (while the PSCs are)
> 
> provide a fallback clock lookup approach in case the OF based clock
> lookup for the PSC FIFO fails, this allows for successful operation in
> the presence of an outdated device tree which lacks clock specs
> 
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Jiri Slaby <jslaby@suse.cz>
> Cc: linux-serial@vger.kernel.org
> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> # for v4
> Signed-off-by: Gerhard Sittig <gsi@denx.de>
> ---
> Greg, the addition since v4 is the clk_get_sys() call for the 'ipg'
> clock item (backwards compat for device trees w/o clock specs)

That's fine, my ACK still stands.

thanks,

greg k-h

^ permalink raw reply

* [PATCH v3 1/2] offb: little endian fixes
From: Cédric Le Goater @ 2013-12-04 16:49 UTC (permalink / raw)
  To: benh; +Cc: Cédric Le Goater, linuxppc-dev
In-Reply-To: <1385240643.4882.133.camel@pasglop>

The "screen" properties : depth, width, height, linebytes need
to be converted to the host endian order when read from the device
tree.

The offb_init_palette_hacks() routine also made assumption on the
host endian order.

Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
---

Changes in v3:

 - fixed offb_init_palette_hacks()

Changes in v2:

 - replaced be32_to_cpu() by be32_to_cpup()

 drivers/video/offb.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/video/offb.c b/drivers/video/offb.c
index 9dbea2223401..43a0a52fc527 100644
--- a/drivers/video/offb.c
+++ b/drivers/video/offb.c
@@ -301,7 +301,7 @@ static struct fb_ops offb_ops = {
 static void __iomem *offb_map_reg(struct device_node *np, int index,
 				  unsigned long offset, unsigned long size)
 {
-	const u32 *addrp;
+	const __be32 *addrp;
 	u64 asize, taddr;
 	unsigned int flags;
 
@@ -369,7 +369,11 @@ static void offb_init_palette_hacks(struct fb_info *info, struct device_node *dp
 		}
 		of_node_put(pciparent);
 	} else if (dp && of_device_is_compatible(dp, "qemu,std-vga")) {
-		const u32 io_of_addr[3] = { 0x01000000, 0x0, 0x0 };
+#ifdef __BIG_ENDIAN
+		const __be32 io_of_addr[3] = { 0x01000000, 0x0, 0x0 };
+#else
+		const __be32 io_of_addr[3] = { 0x00000001, 0x0, 0x0 };
+#endif
 		u64 io_addr = of_translate_address(dp, io_of_addr);
 		if (io_addr != OF_BAD_ADDR) {
 			par->cmap_adr = ioremap(io_addr + 0x3c8, 2);
@@ -535,7 +539,7 @@ static void __init offb_init_nodriver(struct device_node *dp, int no_real_node)
 	unsigned int flags, rsize, addr_prop = 0;
 	unsigned long max_size = 0;
 	u64 rstart, address = OF_BAD_ADDR;
-	const u32 *pp, *addrp, *up;
+	const __be32 *pp, *addrp, *up;
 	u64 asize;
 	int foreign_endian = 0;
 
@@ -551,25 +555,25 @@ static void __init offb_init_nodriver(struct device_node *dp, int no_real_node)
 	if (pp == NULL)
 		pp = of_get_property(dp, "depth", &len);
 	if (pp && len == sizeof(u32))
-		depth = *pp;
+		depth = be32_to_cpup(pp);
 
 	pp = of_get_property(dp, "linux,bootx-width", &len);
 	if (pp == NULL)
 		pp = of_get_property(dp, "width", &len);
 	if (pp && len == sizeof(u32))
-		width = *pp;
+		width = be32_to_cpup(pp);
 
 	pp = of_get_property(dp, "linux,bootx-height", &len);
 	if (pp == NULL)
 		pp = of_get_property(dp, "height", &len);
 	if (pp && len == sizeof(u32))
-		height = *pp;
+		height = be32_to_cpup(pp);
 
 	pp = of_get_property(dp, "linux,bootx-linebytes", &len);
 	if (pp == NULL)
 		pp = of_get_property(dp, "linebytes", &len);
 	if (pp && len == sizeof(u32) && (*pp != 0xffffffffu))
-		pitch = *pp;
+		pitch = be32_to_cpup(pp);
 	else
 		pitch = width * ((depth + 7) / 8);
 
-- 
1.7.10.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox