LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 8/8] powerpc/85xx: Update corenet64_smp_defconfig for T4240
From: Kumar Gala @ 2013-03-05 23:16 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1362525360-23136-7-git-send-email-galak@kernel.crashing.org>

* Add support for up to 24 cores on T4240 (includes threads)
* Enable AltiVec support (on T4240)
* Add T4240QDS board into build
* Other changes are due to general kernel update of defconfig

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
---
 arch/powerpc/configs/corenet64_smp_defconfig |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/configs/corenet64_smp_defconfig b/arch/powerpc/configs/corenet64_smp_defconfig
index 3d139fa..c3da860 100644
--- a/arch/powerpc/configs/corenet64_smp_defconfig
+++ b/arch/powerpc/configs/corenet64_smp_defconfig
@@ -1,14 +1,13 @@
 CONFIG_PPC64=y
 CONFIG_PPC_BOOK3E_64=y
-# CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
+CONFIG_ALTIVEC=y
 CONFIG_SMP=y
-CONFIG_NR_CPUS=2
-CONFIG_EXPERIMENTAL=y
+CONFIG_NR_CPUS=24
 CONFIG_SYSVIPC=y
-CONFIG_BSD_PROCESS_ACCT=y
 CONFIG_IRQ_DOMAIN_DEBUG=y
 CONFIG_NO_HZ=y
 CONFIG_HIGH_RES_TIMERS=y
+CONFIG_BSD_PROCESS_ACCT=y
 CONFIG_IKCONFIG=y
 CONFIG_IKCONFIG_PROC=y
 CONFIG_LOG_BUF_SHIFT=14
@@ -24,6 +23,7 @@ CONFIG_PARTITION_ADVANCED=y
 CONFIG_MAC_PARTITION=y
 CONFIG_P5020_DS=y
 CONFIG_P5040_DS=y
+CONFIG_T4240_QDS=y
 # CONFIG_PPC_OF_BOOT_TRAMPOLINE is not set
 CONFIG_BINFMT_MISC=m
 CONFIG_PCIEPORTBUS=y
@@ -140,6 +140,5 @@ CONFIG_CRYPTO_PCBC=m
 CONFIG_CRYPTO_MD4=y
 CONFIG_CRYPTO_SHA256=y
 CONFIG_CRYPTO_SHA512=y
-CONFIG_CRYPTO_AES=y
 # CONFIG_CRYPTO_ANSI_CPRNG is not set
 CONFIG_CRYPTO_DEV_FSL_CAAM=y
-- 
1.7.9.7

^ permalink raw reply related

* Re: [PATCH 5/8] powerpc/fsl-booke: Add initial silicon device tree for
From: Scott Wood @ 2013-03-06  0:15 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev
In-Reply-To: <1362525360-23136-5-git-send-email-galak@kernel.crashing.org>

On 03/05/2013 05:15:57 PM, Kumar Gala wrote:
> Enable a baseline T4240 SoC to boot.  There are several things missing
> from the device trees for T4240:
>=20
> * Thread support on e6500

Why did threads get removed from the device tree?  It's supposed to =20
describe hardware, not what Linux currently supports.

> * Proper PAMU topology information
> * DPAA related nodes (Qman, Bman, Fman, Rman, DCE)
> * Prefetch Manager
> * Thermal monitor unit
> * Interlaken

The dts should be marked preliminary somehow -- we really should get =20
out of the habit of letting device nodes trickle in as drivers get =20
added.

> +/* controller at 0x240000 */
> +&pci0 {
> +	compatible =3D "fsl,t4240-pcie", "fsl,qoriq-pcie-v3.0";

We have a version register -- do we really need to keep sticking the =20
version number in the compatible?  Note that we've had device trees =20
that specified the version incorrectly in the past.

> +	device_type =3D "pci";
> +	#size-cells =3D <2>;
> +	#address-cells =3D <3>;
> +	bus-range =3D <0x0 0xff>;
> +	clock-frequency =3D <33333333>;

This clock-frequency is not correct (I doubt it's needed at all).

> +		PowerPC,e6500@1 {
> +			device_type =3D "cpu";
> +			reg =3D <2>;
> +			next-level-cache =3D <&L2_1>;
> +		};
> +		PowerPC,e6500@2 {
> +			device_type =3D "cpu";
> +			reg =3D <4>;
> +			next-level-cache =3D <&L2_1>;
> +		};
> +		PowerPC,e6500@3 {
> +			device_type =3D "cpu";
> +			reg =3D <6>;
> +			next-level-cache =3D <&L2_1>;
> +		};
> +
> +		PowerPC,e6500@4 {
> +			device_type =3D "cpu";
> +			reg =3D <8>;
> +			next-level-cache =3D <&L2_2>;
> +		};

Inconsistent whitespace.

As usual, the pre/post split is unnecessary.  Everything in it can go =20
in post.

-Scott=

^ permalink raw reply

* [PATCH 0/2] powerpc: HFSCR enablement for POWER8
From: Michael Neuling @ 2013-03-06  2:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev

Benh, 

This small series adds support for the HFSCR (Hypervisor Facility Status &
Control Register) in POWER8.  It just sets the bits we know about at this
stage.  This is useful only when MSR HV=1.

Mikey

^ permalink raw reply

* [PATCH 1/2] powerpc: Add HFSCR SPR definitions
From: Michael Neuling @ 2013-03-06  2:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev
In-Reply-To: <1362536114-9658-1-git-send-email-mikey@neuling.org>

Add SPR number and bit definitions for the HFSCR (Hypervisor Facility Status
and Control Register).

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
 arch/powerpc/include/asm/reg.h |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index c9c67fc..4ae2d44 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -268,6 +268,12 @@
 #define SPRN_FSCR	0x099	/* Facility Status & Control Register */
 #define   FSCR_TAR	(1 << (63-55)) /* Enable Target Address Register */
 #define   FSCR_DSCR	(1 << (63-61)) /* Enable Data Stream Control Register */
+#define SPRN_HFSCR	0xbe	/* HV=1 Facility Status & Control Register */
+#define   HFSCR_TAR	(1 << (63-55)) /* Enable Target Address Register */
+#define   HFSCR_TM	(1 << (63-58)) /* Enable Transactional Memory */
+#define   HFSCR_DSCR	(1 << (63-61)) /* Enable Data Stream Control Register */
+#define   HFSCR_VECVSX	(1 << (63-62)) /* Enable VMX/VSX  */
+#define   HFSCR_FP	(1 << (63-63)) /* Enable Floating Point */
 #define SPRN_TAR	0x32f	/* Target Address Register */
 #define SPRN_LPCR	0x13E	/* LPAR Control Register */
 #define   LPCR_VPM0	(1ul << (63-0))
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/2] powerpc: Setup in HFSCR for POWER8
From: Michael Neuling @ 2013-03-06  2:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev
In-Reply-To: <1362536114-9658-1-git-send-email-mikey@neuling.org>

Setup the HFSCR (Hypervisor Facility Status and Control Register) for POWER8
when running HV=1.  The HFSCR is the same as the FSCR except but for
hypervisors.

This patch sets the facilities Linux knows about incase the firmware doesn't.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
 arch/powerpc/kernel/cpu_setup_power.S |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
index ea847ab..2e6ad11 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -57,6 +57,7 @@ _GLOBAL(__setup_cpu_power8)
 	mfspr	r3,SPRN_LPCR
 	oris	r3, r3, LPCR_AIL_3@h
 	bl	__init_LPCR
+	bl	__init_HFSCR
 	bl	__init_TLB
 	mtlr	r11
 	blr
@@ -72,6 +73,7 @@ _GLOBAL(__restore_cpu_power8)
 	mfspr   r3,SPRN_LPCR
 	oris	r3, r3, LPCR_AIL_3@h
 	bl	__init_LPCR
+	bl	__init_HFSCR
 	bl	__init_TLB
 	mtlr	r11
 	blr
@@ -120,6 +122,12 @@ __init_FSCR:
 	mtspr	SPRN_FSCR,r3
 	blr
 
+__init_HFSCR:
+	mfspr	r3,SPRN_HFSCR
+	ori	r3,r3,HFSCR_TAR|HFSCR_TM|HFSCR_DSCR|HFSCR_VECVSX|HFSCR_FP
+	mtspr	SPRN_HFSCR,r3
+	blr
+
 __init_TLB:
 	/* Clear the TLB */
 	li	r6,128
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 5/8] powerpc/fsl-booke: Add initial silicon device tree for
From: Roy Zang @ 2013-03-06 11:02 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev
In-Reply-To: <1362525360-23136-5-git-send-email-galak@kernel.crashing.org>

On 03/06/2013 07:15 AM, Kumar Gala wrote:
> * Thread support on e6500
> * Proper PAMU topology information
> * DPAA related nodes (Qman, Bman, Fman, Rman, DCE)
> * Prefetch Manager
> * Thermal monitor unit
> * Interlaken
>
> Signed-off-by: Roy Zang<tie-fei.zang@freescale.com>
> Signed-off-by: Minghuan Lian<Minghuan.Lian@freescale.com>
> Signed-off-by: Haiying Wang<Haiying.Wang@freescale.com>
> Signed-off-by: Andy Fleming<afleming@freescale.com>
> Signed-off-by: Prabhakar Kushwaha<prabhakar@freescale.com>
> Signed-off-by: York Sun<yorksun@freescale.com>
> Signed-off-by: Vakul Garg<vakul@freescale.com>
> Signed-off-by: Tang Yuantian<Yuantian.Tang@freescale.com>
> Signed-off-by: Zhao Chenhui<chenhui.zhao@freescale.com>
> Signed-off-by: Li Yang<leoli@freescale.com>
> Signed-off-by: Ramneek Mehresh<ramneek.mehresh@freescale.com>
> Signed-off-by: Haiying Wang<Haiying.Wang@freescale.com>
Haiying is doubled.
Roy

^ permalink raw reply

* Re: [PATCH] powerpc/powernv: Fix next available MSI IRQ
From: Michael Ellerman @ 2013-03-06  3:24 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linuxppc-dev
In-Reply-To: <1362466756-16113-1-git-send-email-shangw@linux.vnet.ibm.com>

On Tue, Mar 05, 2013 at 02:59:16PM +0800, Gavin Shan wrote:
> The allocation of MSI is implemented based on bitmap and working
> like the mechanism of strict round through the traced next available
> cursor. However, the next available MSI is never updated in current
> implementation. The patch fixes the issue.
> 
> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/pci.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 6f464dc..9cf18c4 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -66,6 +66,11 @@ static unsigned int pnv_get_one_msi(struct pnv_phb *phb)
>  		rc = 0;
>  		goto out;
>  	}
> +
> +	if (id >= phb->msi_count - 1)
> +		phb->msi_next = 0;
> +	else
> +		phb->msi_next = id + 1;
>  	__set_bit(id, phb->msi_map);


There is code in arch/powerpc/sysdev/msi_bitmap.c that implements a
bitmap allocator for MSI. It may not do what you need but please take a
look at it if you haven't already.

cheers

^ permalink raw reply

* [PATCH 0/2] powerpc: HFSCR enablement for POWER8
From: Michael Neuling @ 2013-03-06  3:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev

Benh, 

This small series adds support for the HFSCR (Hypervisor Facility Status &
Control Register) in POWER8.  It just sets the bits we know about at this
stage.  This is useful only when MSR HV=1.

The HFSCR is the same as the FSCR except it's for hypervisors.  It controls the
available of various facilities in OS and userspace levels.  It also indicates
the cause of a hypervisor facility unavailable interrupt (although we are not
using this here).

v2:
  Make description of the HFSCR more verbose as suggested by benh.

Mikey

^ permalink raw reply

* [PATCH 1/2] powerpc: Add HFSCR SPR definitions
From: Michael Neuling @ 2013-03-06  3:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev
In-Reply-To: <1362540924-15253-1-git-send-email-mikey@neuling.org>

Add SPR number and bit definitions for the HFSCR (Hypervisor Facility Status
and Control Register).

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
 arch/powerpc/include/asm/reg.h |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index c9c67fc..4ae2d44 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -268,6 +268,12 @@
 #define SPRN_FSCR	0x099	/* Facility Status & Control Register */
 #define   FSCR_TAR	(1 << (63-55)) /* Enable Target Address Register */
 #define   FSCR_DSCR	(1 << (63-61)) /* Enable Data Stream Control Register */
+#define SPRN_HFSCR	0xbe	/* HV=1 Facility Status & Control Register */
+#define   HFSCR_TAR	(1 << (63-55)) /* Enable Target Address Register */
+#define   HFSCR_TM	(1 << (63-58)) /* Enable Transactional Memory */
+#define   HFSCR_DSCR	(1 << (63-61)) /* Enable Data Stream Control Register */
+#define   HFSCR_VECVSX	(1 << (63-62)) /* Enable VMX/VSX  */
+#define   HFSCR_FP	(1 << (63-63)) /* Enable Floating Point */
 #define SPRN_TAR	0x32f	/* Target Address Register */
 #define SPRN_LPCR	0x13E	/* LPAR Control Register */
 #define   LPCR_VPM0	(1ul << (63-0))
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/2] powerpc: Setup in HFSCR for POWER8
From: Michael Neuling @ 2013-03-06  3:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev
In-Reply-To: <1362540924-15253-1-git-send-email-mikey@neuling.org>

Setup the HFSCR (Hypervisor Facility Status and Control Register) for POWER8
when running HV=1.  The HFSCR is the same as the FSCR except it's for
hypervisors.  It controls the available of various facilities in OS and
userspace levels.  It also indicates the cause of a hypervisor facility
unavailable interrupt (although we are not using this here).

This patch sets the facilities Linux knows about incase the firmware doesn't.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
 arch/powerpc/kernel/cpu_setup_power.S |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
index ea847ab..2e6ad11 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -57,6 +57,7 @@ _GLOBAL(__setup_cpu_power8)
 	mfspr	r3,SPRN_LPCR
 	oris	r3, r3, LPCR_AIL_3@h
 	bl	__init_LPCR
+	bl	__init_HFSCR
 	bl	__init_TLB
 	mtlr	r11
 	blr
@@ -72,6 +73,7 @@ _GLOBAL(__restore_cpu_power8)
 	mfspr   r3,SPRN_LPCR
 	oris	r3, r3, LPCR_AIL_3@h
 	bl	__init_LPCR
+	bl	__init_HFSCR
 	bl	__init_TLB
 	mtlr	r11
 	blr
@@ -120,6 +122,12 @@ __init_FSCR:
 	mtspr	SPRN_FSCR,r3
 	blr
 
+__init_HFSCR:
+	mfspr	r3,SPRN_HFSCR
+	ori	r3,r3,HFSCR_TAR|HFSCR_TM|HFSCR_DSCR|HFSCR_VECVSX|HFSCR_FP
+	mtspr	SPRN_HFSCR,r3
+	blr
+
 __init_TLB:
 	/* Clear the TLB */
 	li	r6,128
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 2/3] irq: Add hw continuous IRQs map to virtual continuous IRQs support
From: Michael Ellerman @ 2013-03-06  3:54 UTC (permalink / raw)
  To: Mike Qiu; +Cc: tglx, linuxppc-dev, linux-kernel
In-Reply-To: <51359C9D.5030009@linux.vnet.ibm.com>

On Tue, Mar 05, 2013 at 03:19:57PM +0800, Mike Qiu wrote:
> 于 2013/3/5 10:23, Michael Ellerman 写道:
> >On Tue, Jan 15, 2013 at 03:38:55PM +0800, Mike Qiu wrote:
> >>Adding a function irq_create_mapping_many() which can associate
> >>multiple MSIs to a continous irq mapping.
> >>
> >>This is needed to enable multiple MSI support for pSeries.
> >>
> >>Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
> >>---
> >>  include/linux/irq.h       |    2 +
> >>  include/linux/irqdomain.h |    3 ++
> >>  kernel/irq/irqdomain.c    |   61 +++++++++++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 66 insertions(+), 0 deletions(-)
> >>
> >>diff --git a/include/linux/irq.h b/include/linux/irq.h
> >>index 60ef45b..e00a7ec 100644
> >>--- a/include/linux/irq.h
> >>+++ b/include/linux/irq.h
> >>@@ -592,6 +592,8 @@ int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
> >>  #define irq_alloc_desc_from(from, node)		\
> >>  	irq_alloc_descs(-1, from, 1, node)
> >>+#define irq_alloc_desc_n(nevc, node)		\
> >>+	irq_alloc_descs(-1, 0, nevc, node)
> >This has been superseeded by irq_alloc_descs_from(), which is the right
> >way to do it.

> Yes, but irq_alloc_descs_from() just for 1 irq

No it's not, look again.

#define irq_alloc_descs_from(from, cnt, node)   \
	irq_alloc_descs(-1, from, cnt, node)


> >>diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
> >>index 96f3a1d..38648e6 100644
> >>--- a/kernel/irq/irqdomain.c
> >>+++ b/kernel/irq/irqdomain.c
> >>@@ -636,6 +636,67 @@ int irq_create_strict_mappings(struct irq_domain *domain, unsigned int irq_base,
> >>  }
> >>  EXPORT_SYMBOL_GPL(irq_create_strict_mappings);
> >>+/**
> >>+ * irq_create_mapping_many - Map a range of hw IRQs to a range of virtual IRQs
> >>+ * @domain: domain owning the interrupt range
> >>+ * @hwirq_base: beginning of continuous hardware IRQ range
> >>+ * @count: Number of interrupts to map

> >For multiple-MSI the allocated interrupt numbers must be a power-of-2,
> >and must be naturally aligned. I don't /think/ that's a requirement for
> >the virtual numbers, but it's probably best that we do it anyway.
> >
> >So this API needs to specify that it will give you back a power-of-2
> >block that is naturally aligned - otherwise you can't use it for MSI.

> rtas_call will return the numbers of hardware interrupt, and it
> should be power-of-2, as this I think do not need to specify

You're confusing hardware interrupt numbers and virtual interrupt
numbers. My comment is about irq_create_mapping_many(), which returns
virtual interrupt numbers.

As I said I don't think there is a requirement that the virtual
interrupt numbers are also a power-of-2 naturally aligned block, but we
should allocate them as one anyway, to avoid any issues in future.

And so this API, which returns virtual interrupt numbers, must satisfy
that specification.

> >>+	/* Look for default domain if nececssary */
> >>+	if (!domain)
> >>+		domain = irq_default_domain;
> >>+	if (!domain) {
> >>+		pr_warn("irq_create_mapping called for NULL domain, hwirq=%lx\n"
> >>+			, hwirq_base);
> >>+		WARN_ON(1);
> >>+		return 0;
> >>+	}
> >>+	pr_debug("-> using domain @%p\n", domain);
> >>+
> >>+	/* For IRQ_DOMAIN_MAP_LEGACY, get the first virtual interrupt number */
> >>+	if (domain->revmap_type == IRQ_DOMAIN_MAP_LEGACY)
> >>+		return irq_domain_legacy_revmap(domain, hwirq_base);
> >The above doesn't work.
> Why it doesn't work ?

Because irq_domain_legacy_revmap() only allocates a single interrupt
number.

> >>+	/* Check if mapping already exists */
> >>+	for (i = 0; i < count; i++) {
> >>+		virq = irq_find_mapping(domain, hwirq_base+i);
> >>+		if (virq) {
> >>+			pr_debug("existing mapping on virq %d,"
> >>+					" now dispose it first\n", virq);
> >>+			irq_dispose_mapping(virq);

> >You might have just disposed of someone elses mapping, we shouldn't do
> >that. It should be an error to the caller.

> It's a good question. If the interrupt used for someone elses, why I
> can apply it from the system?

I agree, that would be a bug. But disposing of someone elses mapping is
not OK.

> So it may someone else forget to dispose mapping, and it never be
> used for others as I have got the interrupt I think.

Perhaps, but that is a bug that needs to be fixed in the code that
forgets to dispose of the mapping.

cheers

^ permalink raw reply

* Re: [PATCH -V1 06/24] powerpc: Reduce PTE table memory wastage
From: Aneesh Kumar K.V @ 2013-03-06  4:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Paul Mackerras, linux-mm
In-Reply-To: <1362440204.21357.20.camel@pasglop>

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Mon, 2013-03-04 at 16:28 +0530, Aneesh Kumar K.V wrote:
>> I added the below comment when initializing the list.
>> 
>> +#ifdef CONFIG_PPC_64K_PAGES
>> +       /*
>> +        * Used to support 4K PTE fragment. The pages are added to list,
>> +        * when we have free framents in the page. We track the whether
>> +        * a page frament is available using page._mapcount. A value of
>> +        * zero indicate none of the fragments are used and page can be
>> +        * freed. A value of FRAG_MASK indicate all the fragments are used
>> +        * and hence the page will be removed from the below list.
>> +        */
>> +       INIT_LIST_HEAD(&init_mm.context.pgtable_list);
>> +#endif
>> 
>> I am not sure about why you say there is no consistent rule. Can you
>> elaborate on that ?
>
> Do you really need that list ? I assume it's meant to allow you to find
> free frags when allocating but my worry is that you'll end up losing
> quite a bit of node locality of PTE pages....
>
> It may or may not work but can you investigate doing things differently
> here ? The idea I want you to consider is to always allocate a full
> page, but make the relationship of the fragments to PTE pages fixed. IE.
> the fragment in the page is a function of the VA.
>
> Basically, the algorithm for allocation is roughly:
>
>  - Walk the tree down to the PMD ptr (* that can be improved with a
> generic change, see below)
>
>  - Check if any of the neighbouring PMDs is populated. If yes, you have
> your page and pick the appropriate fragment based on the VA
>
>  - If not, allocate and populate
>
> On free, similarly, you checked if all neighbouring PMDs have been
> cleared, in which case you can fire off the page for RCU freeing.
>
> (*) By changing pte_alloc_one to take the PMD ptr (which the call side
> has right at hand) you can avoid the tree lookup.
>

Will try this.

-aneesh

^ permalink raw reply

* Re: [PATCH -V1 06/24] powerpc: Reduce PTE table memory wastage
From: Aneesh Kumar K.V @ 2013-03-06  4:08 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130305021219.GC2888@iris.ozlabs.ibm.com>

Paul Mackerras <paulus@samba.org> writes:

> On Mon, Mar 04, 2013 at 04:28:42PM +0530, Aneesh Kumar K.V wrote:
>> Paul Mackerras <paulus@samba.org> writes:
>> 
>> > The other general comment I have is that it's not really clear when a
>> > page will be on the mm->context.pgtable_list and when it won't.  I
>> > would like to see an invariant that says something like "the page is
>> > on the pgtable_list if and only if (page->_mapcount & FRAG_MASK) is
>> > neither 0 nor FRAG_MASK".  But that doesn't seem to be the case
>> > exactly, and I can't see any consistent rule, which makes me think
>> > there are going to be bugs in corner cases.
>> >
>> 
>> 
>> I added the below comment when initializing the list.
>> 
>> +#ifdef CONFIG_PPC_64K_PAGES
>> +       /*
>> +        * Used to support 4K PTE fragment. The pages are added to list,
>> +        * when we have free framents in the page. We track the whether
>> +        * a page frament is available using page._mapcount. A value of
>> +        * zero indicate none of the fragments are used and page can be
>> +        * freed. A value of FRAG_MASK indicate all the fragments are used
>> +        * and hence the page will be removed from the below list.
>> +        */
>> +       INIT_LIST_HEAD(&init_mm.context.pgtable_list);
>> +#endif
>> 
>> I am not sure about why you say there is no consistent rule. Can you
>> elaborate on that ?
>
> Well, sometimes you take things off the list when mask == 0, and
> sometimes when (mask & FRAG_MASK) == 0.  So it's not clear whether the
> page is supposed to be on the list when (mask & FRAG_MASK) == 0 but
> mask != 0.  If you stated in a comment what the rule was supposed to
> be then reviewers could check whether your code implemented that rule.
> Also, if you had a consistent rule you could more easily add debug
> code to check that the rule was being followed.


I guess you are looking at this in page_table_free_rcu ?

+	mask = atomic_xor_bits(&page->_mapcount, bit | (bit << FRAG_MASK_BITS));
+	if (!(mask & FRAG_MASK))
+		list_del(&page->lru);
+	else {

We want to remove the page from list looking at the lower half bits. If
all the bits are cleared, that indicate nobody is using that page. But
then we may have pending rcu, which is indicated by higher half. So if
the lower half is 0 we can remove from the list. _mapcount is 0 we can
free the page. 

Now if all the bits in the lower half is set then also we remove the
page from the list, because we don't have any free fragments in the
page.


>
> Also, that comment above doesn't say anything about the upper bits and
> whether they have any influence on whether the page should be on the
> list or not.


will add more to the comment.

>
>> > Consider, for example, the case where a page has two fragments still
>> > in use, and one of them gets queued up by RCU for freeing via a call
>> > to page_table_free_rcu, and then the other one gets freed through
>> > page_table_free().  Neither the call to page_table_free_rcu nor the
>> > call to page_table_free will take the page off the list AFAICS, and
>> > then __page_table_free_rcu() will free the page while it's still on
>> > the pgtable_list.
>> 
>> The last one that ends up doing atomic_xor_bits which cause the mapcount
>> to go zero, will take the page off the list and free the page. 
>
> No, look at the example again.  page_table_free_rcu() won't take it
> off the list because it uses the (mask & FRAG_MASK) == 0 test, which
> fails (one fragment is still in use).  page_table_free() won't take it
> off the list because it uses the mask == 0 test, which also fails (one
> fragment is still waiting for the RCU grace period).  Finally,
> __page_table_free_rcu() doesn't take it off the list, it just frees
> the page.  Oops. :)

Got it, I will see how we can fix that.

-aneesh

^ permalink raw reply

* Re: [PATCH] ppc32: Fix compile of sha1-powerpc-asm.S
From: Michael Ellerman @ 2013-03-06  4:09 UTC (permalink / raw)
  To: Christian Kujau; +Cc: LinuxPPC-dev
In-Reply-To: <alpine.DEB.2.10.1303041656010.22410@trent.utfs.org>

On Mon, Mar 04, 2013 at 05:23:14PM -0800, Christian Kujau wrote:
> On Tue, 26 Feb 2013 at 13:20, Tony Breeds wrote:
> > When building with CRYPTO_SHA1_PPC enabled we fail with:
> > ---
> > powerpc/crypto/sha1-powerpc-asm.S: Assembler messages:
> > powerpc/crypto/sha1-powerpc-asm.S:116: Error: can't resolve `0' {*ABS* section} - `STACKFRAMESIZE' {*UND* section}
> > powerpc/crypto/sha1-powerpc-asm.S:116: Error: expression too complex
> > powerpc/crypto/sha1-powerpc-asm.S:178: Error: unsupported relocation against STACKFRAMESIZE
> > ---
> > 
> > Use INT_FRAME_SIZE instead.
> > 
> > Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
> 
> Thanks for the fix! Ran into this as well, with your patch 3.9-rc1 
> compiles again (and it even boots :-))
> 
> Tested-by: Christian Kujau <lists@nerdbynature.de>
> 
> $ grep -A10 sha1 /proc/crypto
> name         : sha1
> driver       : sha1-powerpc
> module       : kernel
> priority     : 0
> refcnt       : 1
> selftest     : passed
> type         : shash
> blocksize    : 64
> digestsize   : 20

Thanks Christian. What hardware are you on?

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/powernv: Fix next available MSI IRQ
From: Gavin Shan @ 2013-03-06  4:09 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Gavin Shan
In-Reply-To: <20130306032454.GA3493@concordia>

On Wed, Mar 06, 2013 at 02:24:54PM +1100, Michael Ellerman wrote:
>On Tue, Mar 05, 2013 at 02:59:16PM +0800, Gavin Shan wrote:
>> The allocation of MSI is implemented based on bitmap and working
>> like the mechanism of strict round through the traced next available
>> cursor. However, the next available MSI is never updated in current
>> implementation. The patch fixes the issue.
>> 
>> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/powernv/pci.c |    5 +++++
>>  1 files changed, 5 insertions(+), 0 deletions(-)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index 6f464dc..9cf18c4 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -66,6 +66,11 @@ static unsigned int pnv_get_one_msi(struct pnv_phb *phb)
>>  		rc = 0;
>>  		goto out;
>>  	}
>> +
>> +	if (id >= phb->msi_count - 1)
>> +		phb->msi_next = 0;
>> +	else
>> +		phb->msi_next = id + 1;
>>  	__set_bit(id, phb->msi_map);
>
>
>There is code in arch/powerpc/sysdev/msi_bitmap.c that implements a
>bitmap allocator for MSI. It may not do what you need but please take a
>look at it if you haven't already.
>

Thanks, Michael. I neve know that you've implemented bitmaps to manage
MSI interrupts. It seems arch/powerpc/sysdev/msi_bitmap.c meets our
requirment here except that needs device tree node. Fortunately, we
can set the corresponding device tree node to NULL and functions playing
with the device tree nodes (of_node_get/of_node_put) works well for NULL
device tree node.

I'll update powernv platform to use msi_bitmap.c to manage it MSI interrupts.

Thanks,
Gavin

^ permalink raw reply

* Re: [PATCH -V1 07/24] powerpc: Add size argument to pgtable_cache_add
From: Aneesh Kumar K.V @ 2013-03-06  4:23 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130305015041.GA2888@iris.ozlabs.ibm.com>

Paul Mackerras <paulus@samba.org> writes:

> On Mon, Mar 04, 2013 at 04:32:24PM +0530, Aneesh Kumar K.V wrote:
>> 
>> Now with table_size argument, the first arg is no more the shift value,
>> rather it is index into the array. Hence i changed the variable name. I
>> will split that patch to make it easy for review.
>
> OK, so you're saying that the simple relation between index and the
> size of the objects in PGT_CACHE(index) no longer holds. That worries
> me, because now, what guarantees that two callers won't use the same
> index value with two different sizes?  And what guarantees that we
> won't have two callers using different index values but the same size
> (which wouldn't be a disaster but would be a waste of space)?
>
> I think it would be preferable to keep the relation between shift and
> the size of the objects and just arrange to use a different shift
> value for the pmd objects when you need to.

Most of the places we get the cache pointer by doing something like.
PGT_CACHE(PMD_INDEX_SIZE). What we need is that kmem_cache to return an
object twice the size of PMD_TABLE_SIZE. The relevant diff in the later
patch is below.

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/*
+	 * we store the pgtable details in the second half of PMD
+	 */
+	if (PGT_CACHE(PMD_INDEX_SIZE))
+		pr_err("PMD Page cache already initialized with different size\n");
+	__pgtable_cache_add(PMD_INDEX_SIZE, PMD_TABLE_SIZE * 2, pmd_ctor);
+#else
 	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
+#endif

^ permalink raw reply

* Re: [PATCH -V1 09/24] powerpc: Decode the pte-lp-encoding bits correctly.
From: Aneesh Kumar K.V @ 2013-03-06  4:30 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130305020205.GB2888@iris.ozlabs.ibm.com>

Paul Mackerras <paulus@samba.org> writes:

> On Mon, Mar 04, 2013 at 05:11:53PM +0530, Aneesh Kumar K.V wrote:
>> Paul Mackerras <paulus@samba.org> writes:
>> >> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psiz=
e)
>> >> +{
>> >> +	unsigned int mask;
>> >> +	int i, penc, shift;
>> >> +	/* Look at the 8 bit LP value */
>> >> +	unsigned int lp =3D (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>> >> +
>> >> +	penc =3D 0;
>> >> +	for (i =3D 0; i < MMU_PAGE_COUNT; i++) {
>> >> +		/* valid entries have a shift value */
>> >> +		if (!mmu_psize_defs[i].shift)
>> >> +			continue;
>> >> +
>> >> +		/* encoding bits per actual page size */
>> >> +		shift =3D mmu_psize_defs[i].shift - 11;
>> >> +		if (shift > 9)
>> >> +			shift =3D 9;
>> >> +		mask =3D (1 << shift) - 1;
>> >> +		if ((lp & mask) =3D=3D mmu_psize_defs[psize].penc[i])
>> >> +			return i;
>> >> +	}
>> >> +	return -1;
>> >> +}
>> >
>> > This doesn't look right to me.  First, it's not clear what the 11 and
>> > 9 refer to, and I think the 9 should be LP_BITS (i.e. 8).  Secondly,
>> > the mask for the comparison needs to depend on the actual page size
>> > not the base page size.
>>=20
>> That 11 should be 12.That depends on the fact that we have below mapping
>
> And the 12 should be LP_SHIFT, shouldn't it?

LP_SHIFT would indicate how many bit poisition need to be shifted to get
to the LP field in HPTE. I guess what we want here is shift value for 4K
page.  How about=20

shift =3D mmu_psize_defs[i].shift - mmu_psize_defs[MMU_PAGE_4K].shift;


>
>>  rrrr rrrz 	=E2=89=A58KB
>>=20
>> Yes, that 9 should be LP_BITs.=20
>>=20
>> We are generating mask based on actual page size above (variable i in
>> the for loop).
>
> OK, yes, you're right.
>
>> > I don't see where in this function you set the penc[] elements for
>> > invalid actual page sizes to -1.
>>=20
>> We do the below
>>=20
>> --- a/arch/powerpc/mm/hash_utils_64.c
>> +++ b/arch/powerpc/mm/hash_utils_64.c
>> @@ -125,7 +125,7 @@ static struct mmu_psize_def mmu_psize_defaults_old[]=
 =3D {
>>         [MMU_PAGE_4K] =3D {
>>                 .shift  =3D 12,
>>                 .sllp   =3D 0,
>> -               .penc   =3D 0,
>> +               .penc   =3D { [0 ... MMU_PAGE_COUNT - 1] =3D -1 },
>>                 .avpnm  =3D 0,
>
> Yes, which sets them for the entries you initialize, but not for the
> others.  For example, the entry for MMU_PAGE_64K will initially be all
> zeroes.  Then we find an entry in the ibm,segment-page-sizes property
> for 64k pages, so we set mmu_psize_defs[MMU_PAGE_64K].shift to 16,
> making that entry valid, but we never set any of the .penc[] entries
> to -1, leading your other code to think that it can do (say) 1M pages
> in a 64k segment using an encoding of 0.
>

Noticed that earlier. This is what i currently have.

+static void mmu_psize_set_default_penc(struct mmu_psize_def *mmu_psize)
+{
+	int bpsize, apsize;
+	for (bpsize =3D 0; bpsize < MMU_PAGE_COUNT; bpsize++)
+		for (apsize =3D 0; apsize < MMU_PAGE_COUNT; apsize++)
+			mmu_psize[bpsize].penc[apsize] =3D -1;
+}
+
 static void __init htab_init_page_sizes(void)
 {
 	int rc;
=20
+	mmu_psize_set_default_penc(mmu_psize_defaults_old);
+
 	/* Default to 4K pages only */
 	memcpy(mmu_psize_defs, mmu_psize_defaults_old,
 	       sizeof(mmu_psize_defaults_old));
@@ -411,6 +443,8 @@ static void __init htab_init_page_sizes(void)
 	if (rc !=3D 0)  /* Found */
 		goto found;
=20
+	mmu_psize_set_default_penc(mmu_psize_defaults_gp);
+
 	/*
 	 * Not in the device-tree, let's fallback on known size
 	 * list for 16M capable GP & GR
	Modified   arch/powerpc/mm/hugetlbpage-hash64.c



> Also, I noticed that the code in the if (base_idx < 0) statement is
> wrong.  It needs to advance prop (and decrease size) by 2 * lpnum,
> not just 2.
>

Ok. Fixed now.

-aneesh

^ permalink raw reply

* Re: [PATCH -V1 06/24] powerpc: Reduce PTE table memory wastage
From: Aneesh Kumar K.V @ 2013-03-06  5:03 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130305021219.GC2888@iris.ozlabs.ibm.com>

Paul Mackerras <paulus@samba.org> writes:

> On Mon, Mar 04, 2013 at 04:28:42PM +0530, Aneesh Kumar K.V wrote:
>> The last one that ends up doing atomic_xor_bits which cause the mapcount
>> to go zero, will take the page off the list and free the page. 
>
> No, look at the example again.  page_table_free_rcu() won't take it
> off the list because it uses the (mask & FRAG_MASK) == 0 test, which
> fails (one fragment is still in use).  page_table_free() won't take it
> off the list because it uses the mask == 0 test, which also fails (one
> fragment is still waiting for the RCU grace period).  Finally,
> __page_table_free_rcu() doesn't take it off the list, it just frees
> the page.  Oops. :)


How about the below

--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -425,7 +425,7 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
        bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
        spin_lock(&mm->page_table_lock);
        mask = atomic_xor_bits(&page->_mapcount, bit);
-       if (mask == 0)
+       if (!(mask & FRAG_MASK))
                list_del(&page->lru);
        else if (mask & FRAG_MASK) {
                /*
@@ -446,7 +446,7 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)


ie, we always remove the page from the list, when the lower half is
zero or lower half is FRAG_MASK.  We free the page when _mapcount is 0.

-aneesh

^ permalink raw reply

* Re: [PATCH 2/3] irq: Add hw continuous IRQs map to virtual continuous IRQs support
From: Mike Qiu @ 2013-03-06  5:34 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: tglx, linuxppc-dev, linux-kernel
In-Reply-To: <20130306035443.GB3493@concordia>

[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]

于 2013/3/6 11:54, Michael Ellerman 写道:
> On Tue, Mar 05, 2013 at 03:19:57PM +0800, Mike Qiu wrote:
>> 于 2013/3/5 10:23, Michael Ellerman 写道:
>>> On Tue, Jan 15, 2013 at 03:38:55PM +0800, Mike Qiu wrote:
>>>> Adding a function irq_create_mapping_many() which can associate
>>>> multiple MSIs to a continous irq mapping.
>>>>
>>>> This is needed to enable multiple MSI support for pSeries.
>>>>
>>>> Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
>>>> ---
>>>>   include/linux/irq.h       |    2 +
>>>>   include/linux/irqdomain.h |    3 ++
>>>>   kernel/irq/irqdomain.c    |   61 +++++++++++++++++++++++++++++++++++++++++++++
>>>>   3 files changed, 66 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/linux/irq.h b/include/linux/irq.h
>>>> index 60ef45b..e00a7ec 100644
>>>> --- a/include/linux/irq.h
>>>> +++ b/include/linux/irq.h
>>>> @@ -592,6 +592,8 @@ int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
>>>>   #define irq_alloc_desc_from(from, node)		\
>>>>   	irq_alloc_descs(-1, from, 1, node)
>>>> +#define irq_alloc_desc_n(nevc, node)		\
>>>> +	irq_alloc_descs(-1, 0, nevc, node)
>>> This has been superseeded by irq_alloc_descs_from(), which is the right
>>> way to do it.
>> Yes, but irq_alloc_descs_from() just for 1 irq
> No it's not, look again.
>
> #define irq_alloc_descs_from(from, cnt, node)   \
> 	irq_alloc_descs(-1, from, cnt, node)
Sorry, I see as irq_alloc_desc_from(from, node)
you are right
>
>
>>>> diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
>>>> index 96f3a1d..38648e6 100644
>>>> --- a/kernel/irq/irqdomain.c
>>>> +++ b/kernel/irq/irqdomain.c
>>>> @@ -636,6 +636,67 @@ int irq_create_strict_mappings(struct irq_domain *domain, unsigned int irq_base,
>>>>   }
>>>>   EXPORT_SYMBOL_GPL(irq_create_strict_mappings);
>>>> +/**
>>>> + * irq_create_mapping_many - Map a range of hw IRQs to a range of virtual IRQs
>>>> + * @domain: domain owning the interrupt range
>>>> + * @hwirq_base: beginning of continuous hardware IRQ range
>>>> + * @count: Number of interrupts to map
>>> For multiple-MSI the allocated interrupt numbers must be a power-of-2,
>>> and must be naturally aligned. I don't /think/ that's a requirement for
>>> the virtual numbers, but it's probably best that we do it anyway.
>>>
>>> So this API needs to specify that it will give you back a power-of-2
>>> block that is naturally aligned - otherwise you can't use it for MSI.
>> rtas_call will return the numbers of hardware interrupt, and it
>> should be power-of-2, as this I think do not need to specify
> You're confusing hardware interrupt numbers and virtual interrupt
> numbers. My comment is about irq_create_mapping_many(), which returns
> virtual interrupt numbers.
>
> As I said I don't think there is a requirement that the virtual
> interrupt numbers are also a power-of-2 naturally aligned block, but we
> should allocate them as one anyway, to avoid any issues in future.
But for virtual interrupt numbersit should be a power-of-2 naturally
aligned block, because it must be continuous, as the MSI-HOWTO.txt says:

     4.2.2 pci_enable_msi_block
     int pci_enable_msi_block(struct pci_dev *dev, int count)
     This variation on the above call allows a device driver to request
     multiple MSIs.  The MSI specification only allows interrupts to be
     allocated in powers of two, up to a maximum of 2^5 (32).
     If this function returns 0, it has succeeded in allocating at least
     as many interrupts as the driver requested
     (it may have allocated more in order to satisfy the power-of-two
     requirement). In this case, the function enables MSI on this device
     and updates dev->irq to be the lowest of the new interrupts
     assigned to it. The other interrupts assigned to the device are in
     the range dev->irq to dev->irq + count - 1.

See the last line, that means for the virtual interrupts must be a
continuous block.
> And so this API, which returns virtual interrupt numbers, must satisfy
> that specification.
>
>>>> +	/* Look for default domain if nececssary */
>>>> +	if (!domain)
>>>> +		domain = irq_default_domain;
>>>> +	if (!domain) {
>>>> +		pr_warn("irq_create_mapping called for NULL domain, hwirq=%lx\n"
>>>> +			, hwirq_base);
>>>> +		WARN_ON(1);
>>>> +		return 0;
>>>> +	}
>>>> +	pr_debug("-> using domain @%p\n", domain);
>>>> +
>>>> +	/* For IRQ_DOMAIN_MAP_LEGACY, get the first virtual interrupt number */
>>>> +	if (domain->revmap_type == IRQ_DOMAIN_MAP_LEGACY)
>>>> +		return irq_domain_legacy_revmap(domain, hwirq_base);
>>> The above doesn't work.
>> Why it doesn't work ?
> Because irq_domain_legacy_revmap() only allocates a single interrupt
> number.
OK, your right.
>>>> +	/* Check if mapping already exists */
>>>> +	for (i = 0; i < count; i++) {
>>>> +		virq = irq_find_mapping(domain, hwirq_base+i);
>>>> +		if (virq) {
>>>> +			pr_debug("existing mapping on virq %d,"
>>>> +					" now dispose it first\n", virq);
>>>> +			irq_dispose_mapping(virq);
>>> You might have just disposed of someone elses mapping, we shouldn't do
>>> that. It should be an error to the caller.
>> It's a good question. If the interrupt used for someone elses, why I
>> can apply it from the system?
> I agree, that would be a bug. But disposing of someone elses mapping is
> not OK.
>
>> So it may someone else forget to dispose mapping, and it never be
>> used for others as I have got the interrupt I think.
> Perhaps, but that is a bug that needs to be fixed in the code that
> forgets to dispose of the mapping.
>
> cheers
>


[-- Attachment #2: Type: text/html, Size: 8453 bytes --]

^ permalink raw reply

* Re: [PATCH 2/3] irq: Add hw continuous IRQs map to virtual continuous IRQs support
From: Michael Ellerman @ 2013-03-06  5:42 UTC (permalink / raw)
  To: Mike Qiu; +Cc: tglx, linuxppc-dev, linux-kernel
In-Reply-To: <5136D582.80101@linux.vnet.ibm.com>

On Wed, Mar 06, 2013 at 01:34:58PM +0800, Mike Qiu wrote:
> 于 2013/3/6 11:54, Michael Ellerman 写道:
> >On Tue, Mar 05, 2013 at 03:19:57PM +0800, Mike Qiu wrote:
> >>于 2013/3/5 10:23, Michael Ellerman 写道:
> >>>On Tue, Jan 15, 2013 at 03:38:55PM +0800, Mike Qiu wrote:
> >>>>diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
> >>>>index 96f3a1d..38648e6 100644
> >>>>--- a/kernel/irq/irqdomain.c
> >>>>+++ b/kernel/irq/irqdomain.c
> >>>>@@ -636,6 +636,67 @@ int irq_create_strict_mappings(struct irq_domain *domain, unsigned int irq_base,
> >>>>  }
> >>>>  EXPORT_SYMBOL_GPL(irq_create_strict_mappings);
> >>>>+/**
> >>>>+ * irq_create_mapping_many - Map a range of hw IRQs to a range of virtual IRQs
> >>>>+ * @domain: domain owning the interrupt range
> >>>>+ * @hwirq_base: beginning of continuous hardware IRQ range
> >>>>+ * @count: Number of interrupts to map
> >>>For multiple-MSI the allocated interrupt numbers must be a power-of-2,
> >>>and must be naturally aligned. I don't /think/ that's a requirement for
> >>>the virtual numbers, but it's probably best that we do it anyway.
> >>>
> >>>So this API needs to specify that it will give you back a power-of-2
> >>>block that is naturally aligned - otherwise you can't use it for MSI.
> >>rtas_call will return the numbers of hardware interrupt, and it
> >>should be power-of-2, as this I think do not need to specify
> >You're confusing hardware interrupt numbers and virtual interrupt
> >numbers. My comment is about irq_create_mapping_many(), which returns
> >virtual interrupt numbers.
> >
> >As I said I don't think there is a requirement that the virtual
> >interrupt numbers are also a power-of-2 naturally aligned block, but we
> >should allocate them as one anyway, to avoid any issues in future.

> But for virtual interrupt numbersit should be a power-of-2 naturally
> aligned block, because it must be continuous, as the MSI-HOWTO.txt says:
> 
>     4.2.2 pci_enable_msi_block
>     int pci_enable_msi_block(struct pci_dev *dev, int count)
>     This variation on the above call allows a device driver to request
>     multiple MSIs.  The MSI specification only allows interrupts to be
>     allocated in powers of two, up to a maximum of 2^5 (32).
>     If this function returns 0, it has succeeded in allocating at least
>     as many interrupts as the driver requested
>     (it may have allocated more in order to satisfy the power-of-two
>     requirement). In this case, the function enables MSI on this device
>     and updates dev->irq to be the lowest of the new interrupts
>     assigned to it. The other interrupts assigned to the device are in
>     the range dev->irq to dev->irq + count - 1.
> 
> See the last line, that means for the virtual interrupts must be a
> continuous block.

In practice I think things could work if we didn't, because we are not
using the mask routines that assume that layout.

But you're right, we must implement the API as it's specified, so the
virtual interrupt numbers must be a naturally aligned power-of-2.

cheers

^ permalink raw reply

* Re: [PATCH 5/6][v4]: perf: Create a sysfs entry for Power event format
From: Sukadev Bhattiprolu @ 2013-03-06  5:48 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andi Kleen, Peter Zijlstra, robert.richter, Anton Blanchard,
	linux-kernel, Stephane Eranian, linuxppc-dev, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Jiri Olsa
In-Reply-To: <20130227011725.GA5819@concordia>

Michael Ellerman [michael@ellerman.id.au] wrote:
| I suspect Arnaldo was either waiting for an ACK from Ben, or was
| expecting Ben to take it?

Arnaldo, here is an updated patch. If it is acked by Paul Mackerras,
Michael Ellerman or Ben, will you add it to your tree so the whole
patchset comes from one place ?

Sukadev

---
>From 50c7a46f14083c0ed10d66b7aed66ba76e798550 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Tue, 5 Mar 2013 21:20:56 -0800
Subject: [PATCH] [PATCH 5/6][v4]: perf Create a sysfs format entry for Power7 events

Create a sysfs entry, '/sys/bus/event_source/devices/cpu/format/event'
which describes the format of the POWER7 PMU events.

This code is based on corresponding code in x86.

Changelog[v4]:  [Michael Ellerman, Paul Mckerras] The event format is different
		for other POWER cpus. So move the code to POWER7-specific,
		power7-pmu.c Also, the POWER7 format uses bits 0-19 not 0-20.

Changelog[v2]: [Jiri Osla] Use PMU_FORMAT_ATTR rather than duplicating code.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 arch/powerpc/perf/power7-pmu.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
index b554879..3c475d6 100644
--- a/arch/powerpc/perf/power7-pmu.c
+++ b/arch/powerpc/perf/power7-pmu.c
@@ -420,7 +420,20 @@ static struct attribute_group power7_pmu_events_group = {
 	.attrs = power7_events_attr,
 };
 
+PMU_FORMAT_ATTR(event, "config:0-19");
+
+static struct attribute *power7_pmu_format_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+struct attribute_group power7_pmu_format_group = {
+	.name = "format",
+	.attrs = power7_pmu_format_attr,
+};
+
 static const struct attribute_group *power7_pmu_attr_groups[] = {
+	&power7_pmu_format_group,
 	&power7_pmu_events_group,
 	NULL,
 };
-- 
1.7.1

^ permalink raw reply related

* 3.9-rc1 powerpc ptrace.c: 'brk.len' is used uninitialized
From: Philippe De Muyter @ 2013-03-06  6:00 UTC (permalink / raw)
  To: Michael Neuling; +Cc: linuxppc-dev, linux-kernel

Hello Michael,

bisect tells me that since your commit 9422de3e953d0e60eb95f5430a9dd803eec1c6d7 
"powerpc: Hardware breakpoints rewrite to handle non DABR breakpoint registers",
compiling linux fails with :

  cc1: warnings being treated as errors
  arch/powerpc/kernel/ptrace.c: In function 'arch_ptrace':
  arch/powerpc/kernel/ptrace.c:1450: warning: 'brk.len' is used uninitialized in this function
  arch/powerpc/kernel/ptrace.c:1352: note: 'brk.len' was declared here

could you look at that ?

Thanks

Philippe

^ permalink raw reply

* [PATCH -V2 00/26]T HP support for PPC64
From: Aneesh Kumar K.V @ 2013-03-06  6:10 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev

Hi,

This patchset adds transparent huge page support for PPC64.

TODO:
* ppc64 KVM related changes
* powernv still doesn't boot
* hash preload support in update_mmu_cache_pmd

Some numbers:

The latency measurements code from Anton  found at
http://ozlabs.org/~anton/junkcode/latency2001.c

THP disabled 64K page size
------------------------
[root@llmp24l02 ~]# ./latency2001 8G
 8589934592    731.73 cycles    205.77 ns
[root@llmp24l02 ~]# ./latency2001 8G
 8589934592    743.39 cycles    209.05 ns
[root@llmp24l02 ~]#

THP disabled large page via hugetlbfs
-------------------------------------
[root@llmp24l02 ~]# ./latency2001  -l 8G
 8589934592    416.09 cycles    117.01 ns
[root@llmp24l02 ~]# ./latency2001  -l 8G
 8589934592    415.74 cycles    116.91 ns

THP enabled 64K page size.
----------------
[root@llmp24l02 ~]# ./latency2001 8G
 8589934592    405.07 cycles    113.91 ns
[root@llmp24l02 ~]# ./latency2001 8G
 8589934592    411.82 cycles    115.81 ns
[root@llmp24l02 ~]#


We are close to hugetlbfs in latency and we can achieve this with zero
config/page reservation. Most of the allocations above are fault allocated.
I haven't really measured the collapse alloc impact.

Another test that does 50000000 random access over 1GB area goes from
2.65 seconds to 1.07 seconds with this patchset.

Changes from V1
* Address review comments
* More patch split
* Add batch hpte invalidate for hugepages.

Changes from RFC V2:
* Address review comments
* More code cleanup and patch split

Changes from RFC V1:
* HugeTLB fs now works
* Compile issues fixed
* rebased to v3.8
* Patch series reorded so that ppc64 cleanups and MM THP changes are moved
  early in the series. This should help in picking those patches early.

Thanks,
-aneesh

^ permalink raw reply

* [PATCH -V2 01/26] powerpc: Use signed formatting when printing error
From: Aneesh Kumar K.V @ 2013-03-06  6:10 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1362550227-575-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

PAPR defines these errors as negative values. So print them accordingly
for easy debugging.

Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/lpar.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 0da39fe..a77c35b 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -155,7 +155,7 @@ static long pSeries_lpar_hpte_insert(unsigned long hpte_group,
 	 */
 	if (unlikely(lpar_rc != H_SUCCESS)) {
 		if (!(vflags & HPTE_V_BOLTED))
-			pr_devel(" lpar err %lu\n", lpar_rc);
+			pr_devel(" lpar err %ld\n", lpar_rc);
 		return -2;
 	}
 	if (!(vflags & HPTE_V_BOLTED))
-- 
1.7.10

^ permalink raw reply related

* [PATCH -V2 02/26] powerpc: Save DAR and DSISR in pt_regs on MCE
From: Aneesh Kumar K.V @ 2013-03-06  6:10 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1362550227-575-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We were not saving DAR and DSISR on MCE. Save then and also print the values
along with exception details in xmon.

Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/exceptions-64s.S |    9 +++++++++
 arch/powerpc/xmon/xmon.c             |    2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 0e9c48c..d02e730 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -640,9 +640,18 @@ slb_miss_user_pseries:
 	.align	7
 	.globl machine_check_common
 machine_check_common:
+
+	mfspr	r10,SPRN_DAR
+	std	r10,PACA_EXGEN+EX_DAR(r13)
+	mfspr	r10,SPRN_DSISR
+	stw	r10,PACA_EXGEN+EX_DSISR(r13)
 	EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
 	FINISH_NAP
 	DISABLE_INTS
+	ld	r3,PACA_EXGEN+EX_DAR(r13)
+	lwz	r4,PACA_EXGEN+EX_DSISR(r13)
+	std	r3,_DAR(r1)
+	std	r4,_DSISR(r1)
 	bl	.save_nvgprs
 	addi	r3,r1,STACK_FRAME_OVERHEAD
 	bl	.machine_check_exception
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 1f8d2f1..a72e490 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1423,7 +1423,7 @@ static void excprint(struct pt_regs *fp)
 	printf("    sp: %lx\n", fp->gpr[1]);
 	printf("   msr: %lx\n", fp->msr);
 
-	if (trap == 0x300 || trap == 0x380 || trap == 0x600) {
+	if (trap == 0x300 || trap == 0x380 || trap == 0x600 || trap == 0x200) {
 		printf("   dar: %lx\n", fp->dar);
 		if (trap != 0x380)
 			printf(" dsisr: %lx\n", fp->dsisr);
-- 
1.7.10

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox