LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V4 0/3] powerpc/powernv: Error logging interfaces
From: Deepthi Dharwar @ 2014-02-03 19:29 UTC (permalink / raw)
  To: linuxppc-dev

This patch series defines generic interfaces for error logging to
push down critical errors from powernv platform to FSP.

Also, it contains few minor fixes for the exisiting error logging
framework that retrieves error logs from FSP.

Changes from V3:
* Change memory allocation to GFP_ATOMIC, to generate
  errors in bad context
* Move all error log generation related code to arch/powernv
  rename it from opal_* to pnv_* .

Changes from V2:
* Review comments from V2 have been addressed
  includes comment formats, changing naming
  conventions and incorporated error handling
  of the buffers.
* Minor typo fix and use of pr_err/pr_fmt to
  log errors.

 Deepthi Dharwar (3):
      powernv: Push critical error logs to FSP
      powernv: Correct spell error in opal-elog.c
      powernv: Have uniform logging of errors in opal-elog.c

 arch/powerpc/include/asm/opal.h                |   36 +++++++++
 arch/powerpc/platforms/powernv/opal-elog.c     |   93 +++++++++++++++++++++---
 arch/powerpc/platforms/powernv/opal-wrappers.S |    2 -
 arch/powerpc/platforms/powernv/powernv.h       |   84 ++++++++++++++++++++++
 4 files changed, 203 insertions(+), 12 deletions(-)

-- Deepthi

^ permalink raw reply

* Re: PCIe Access - achieve bursts without DMA
From: David Hawkins @ 2014-02-03 17:08 UTC (permalink / raw)
  To: Michael Moese; +Cc: linuxppc-dev
In-Reply-To: <20140203082050.GB1970@localhost.intra.men.de>

Hi Michael,

> On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote:
>> 1. Peripheral board DMA (board-to-board)
>> 2. Peripheral board DMA to host memory.
>> 3. Host (root complex) DMA.
>>
>> As far as "verification" of your custom peripheral board FPGA IP is
>> concerned, if I was a customer, and you had data for (1) and (2),
>> I'd be pretty happy (and could care less about (2), since its so
>> system dependent).
>
> Usually I would totally agree with you and try to implement the benchmark
> using DMA transfers Unfortunately, we have some boards and IP cores that
> do not support DMA transfers, or the target system must not do by a
> requirement, and as I have no influence on these, I had to investigate
> on how to improve my throughput.

Ah, I see, that does make your life difficult then.

> I've submitted a RFC Patch earlier today, which allowed me to perform
> PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s
> I got when using non-cached reads. However, I had to ioremap() my
> memory, like Gabriel said, using write-thru configuration.

That sounds like a reasonable compromise.

Cheers,
Dave

^ permalink raw reply

* Re: [PATCH] powerpc: thp: Fix crash on mremap
From: Luis Henriques @ 2014-02-03 14:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, stable
In-Reply-To: <1390911423-4409-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

On Tue, Jan 28, 2014 at 05:47:03PM +0530, Aneesh Kumar K.V wrote:
> This patch fix the below crash
> 
> NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440
> LR [c0000000000439ac] .hash_page+0x18c/0x5e0
> ...
> Call Trace:
> [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable)
> [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0
> [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58
> 
> On ppc64 we use the pgtable for storing the hpte slot information and
> store address to the pgtable at a constant offset (PTRS_PER_PMD) from
> pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
> the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
> from new pmd.
> 
> We also want to move the withdraw and deposit before the set_pmd so
> that, when page fault find the pmd as trans huge we can be sure that
> pgtable can be located at the offset.
> 
> variant of upstream SHA1: b3084f4db3aeb991c507ca774337c7e7893ed04f
> for 3.11 stable series
> 

Since both you and Benjamin Herrenschmidt claim this is good for stable, I
am queuing this variant for the 3.11 kernel.  Thanks a lot!

Cheers,
--
Luis

> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/Kconfig                           |  3 +++
>  arch/powerpc/platforms/Kconfig.cputype |  1 +
>  mm/huge_memory.c                       | 12 ++++++++++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 1feb169274fe..c5863b35d054 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -368,6 +368,9 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE
>  config HAVE_ARCH_SOFT_DIRTY
>  	bool
>  
> +config ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
> +	bool
> +
>  config HAVE_MOD_ARCH_SPECIFIC
>  	bool
>  	help
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index 47d9a03dd415..d11a34be018d 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -71,6 +71,7 @@ config PPC_BOOK3S_64
>  	select PPC_FPU
>  	select PPC_HAVE_PMU_SUPPORT
>  	select SYS_SUPPORTS_HUGETLBFS
> +	select ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
>  
>  config PPC_BOOK3E_64
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 12acb0ba7991..beaa7cc9de75 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1461,8 +1461,20 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>  
>  	ret = __pmd_trans_huge_lock(old_pmd, vma);
>  	if (ret == 1) {
> +#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
> +		pgtable_t pgtable;
> +#endif
>  		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
>  		VM_BUG_ON(!pmd_none(*new_pmd));
> +#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
> +		/*
> +		 * Archs like ppc64 use pgtable to store per pmd
> +		 * specific information. So when we switch the pmd,
> +		 * we should also withdraw and deposit the pgtable
> +		 */
> +		pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
> +		pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> +#endif
>  		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
>  		spin_unlock(&mm->page_table_lock);
>  	}
> -- 
> 1.8.5.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe stable" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: PCIe Access - achieve bursts without DMA
From: David Laight @ 2014-02-03 10:51 UTC (permalink / raw)
  To: 'Michael Moese'; +Cc: linuxppc-dev@lists.ozlabs.org, David Hawkins
In-Reply-To: <20140203103943.GC1970@localhost.intra.men.de>

From: Michael Moese=20
> On Mon, Feb 03, 2014 at 10:17:43AM +0000, David Laight wrote:
>=20
> > We achieved about twice that using the PEX dma controller.
>=20
> > Your 3MB/s for single word transfers is similar to what we saw.
> > Cycle times that make an ISA bus look fast.
>=20
> Indeed, this is a really poor performance. I know we could achieve much
> more performance using DMA, we have several products where we simply
> don't have DMA available - this requires searching for other paths.

I got the host (ppc) to do a dma, not the card. (This does need a
dma controller that is adequately intergrated with the PCIe logic.)
So it doesn't require any hardware changes.
I did have to design the software to minimise the number of single
memory transfers.

> My ioremap_wt() could help in these situations, at least increasing
> performance for non-DMA operation to a not-that-bad level.

I needed to do writes as well as reads - so I think I would have
needed to map PCIe space fully cached (rather than write-through).
The speed of back to back writes is better than reads (even if they don't
get combined) because the requests get 'posted' and overlap on the
PCIe bus.

Managing cached accesses does get tricky - you need to make sure that
both sides never have to write to the same cache line.

> I don't know if other devices could benefit from this, but surely we
> got several IPs that would, but those were not yet upstreamed, we're
> still working on this.
>=20
> Michael
>=20

^ permalink raw reply

* Re: PCIe Access - achieve bursts without DMA
From: Michael Moese @ 2014-02-03 10:39 UTC (permalink / raw)
  To: David Laight
  Cc: linuxppc-dev@lists.ozlabs.org, David Hawkins,
	'Michael Moese'
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6B774B@AcuExch.aculab.com>

On Mon, Feb 03, 2014 at 10:17:43AM +0000, David Laight wrote:

> We achieved about twice that using the PEX dma controller.

> Your 3MB/s for single word transfers is similar to what we saw.
> Cycle times that make an ISA bus look fast.

Indeed, this is a really poor performance. I know we could achieve much
more performance using DMA, we have several products where we simply 
don't have DMA available - this requires searching for other paths.

My ioremap_wt() could help in these situations, at least increasing
performance for non-DMA operation to a not-that-bad level.

I don't know if other devices could benefit from this, but surely we
got several IPs that would, but those were not yet upstreamed, we're
still working on this.

Michael

^ permalink raw reply

* RE: PCIe Access - achieve bursts without DMA
From: David Laight @ 2014-02-03 10:17 UTC (permalink / raw)
  To: 'Michael Moese', David Hawkins; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20140203082050.GB1970@localhost.intra.men.de>

From: Michael Moese
> Thank you for your help - we might be satisfied with the achieved
> 18 MB/s.

We achieved about twice that using the PEX dma controller.
I found the following comment I wrote:

/* Long transfer requests are cut into smaller DMA requests.
 * Each PCIe request can contain a maximum of 128 bytes, but the
 * dma engine can have multiple PCIe requests outstanding and this
 * speeds things up somewhat (50ns/byte with 128, 24ns/byte with 1024).
 * 1k is somewhere near the point of diminishing returns. */

Those times would include a system call.
The transfers were done through a simple driver that converted pread()
and pwrite() requests into accesses to the boards memory.
The non-dma versions are just copy_to/from_user() directly between
the PCIe and user buffers.

Your 3MB/s for single word transfers is similar to what we saw.
Cycle times that make an ISA bus look fast.

	David

^ permalink raw reply

* Re: [RFC PATCH] powerpc: add ioremap_wt
From: Gabriel Paubert @ 2014-02-03  8:32 UTC (permalink / raw)
  To: Michael Moese; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <1391411809-1845-1-git-send-email-michael.moese@men.de>

On Mon, Feb 03, 2014 at 08:16:49AM +0100, Michael Moese wrote:
> Allow for IO memory to be mapped cacheable for performing
> PCI read bursts.
> 
> Signed-off-by: Michael Moese <michael.moese@men.de>
> ---
>  arch/powerpc/include/asm/io.h | 3 +++
>  arch/powerpc/mm/pgtable_32.c  | 8 ++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
> index 45698d5..9591fff 100644
> --- a/arch/powerpc/include/asm/io.h
> +++ b/arch/powerpc/include/asm/io.h
> @@ -631,6 +631,8 @@ static inline void iosync(void)
>   *
>   * * ioremap_wc enables write combining
>   *
> + * * ioremap_wc enables write thru

Typo: _wc -> _wt

Looks fine in principle, but there is a significant difference with wc
on x86, where read accesses always go to the bus (no read caching).

	Gabriel

> + *
>   * * iounmap undoes such a mapping and can be hooked
>   *
>   * * __ioremap_at (and the pending __iounmap_at) are low level functions to
> @@ -652,6 +654,7 @@ extern void __iomem *ioremap(phys_addr_t address, unsigned long size);
>  extern void __iomem *ioremap_prot(phys_addr_t address, unsigned long size,
>  				  unsigned long flags);
>  extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size);
> +extern void __iomem *ioremap_wt(phys_addr_t address, unsigned long size);
>  #define ioremap_nocache(addr, size)	ioremap((addr), (size))
>  
>  extern void iounmap(volatile void __iomem *addr);
> diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
> index 51f8795..9ab0a54 100644
> --- a/arch/powerpc/mm/pgtable_32.c
> +++ b/arch/powerpc/mm/pgtable_32.c
> @@ -141,6 +141,14 @@ ioremap_wc(phys_addr_t addr, unsigned long size)
>  EXPORT_SYMBOL(ioremap_wc);
>  
>  void __iomem *
> +ioremap_wt(phys_addr_t addr, unsigned long size)
> +{
> +	return __ioremap_caller(addr, size, _PAGE_WRITETHRU,
> +				__builtin_return_address(0));
> +}
> +EXPORT_SYMBOL(ioremap_wt);
> +
> +void __iomem *
>  ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long flags)
>  {
>  	/* writeable implies dirty for kernel addresses */
> -- 
> 1.8.5.3
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: PCIe Access - achieve bursts without DMA
From: Michael Moese @ 2014-02-03  8:20 UTC (permalink / raw)
  To: David Hawkins; +Cc: linuxppc-dev, Moese, Michael
In-Reply-To: <52EC2F46.7000609@ovro.caltech.edu>

On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote:
> 1. Peripheral board DMA (board-to-board)
> 2. Peripheral board DMA to host memory.
> 3. Host (root complex) DMA.
> 
> As far as "verification" of your custom peripheral board FPGA IP is
> concerned, if I was a customer, and you had data for (1) and (2),
> I'd be pretty happy (and could care less about (2), since its so
> system dependent).

Usually I would totally agree with you and try to implement the benchmark
using DMA transfers Unfortunately, we have some boards and IP cores that
do not support DMA transfers, or the target system must not do by a 
requirement, and as I have no influence on these, I had to investigate
on how to improve my throughput.
I've submitted a RFC Patch earlier today, which allowed me to perform
PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s
I got when using non-cached reads. However, I had to ioremap() my 
memory, like Gabriel said, using write-thru configuration. 

> Since its an FPGA-based IP. I'd also expect to see a PCIe simulation
> with Bus Functional Models showing what the optimal performance of
> your IP was, and then how it nicely matches with the measurements
> in (1). If you do not have a PCIe logic analyzer, both Xilinx and
> Altera have Chipscope/SignalTap logic analyzers that can be used
> for tracing traffic at the TLP layer inside the FPGA.

Of course our IP developers to simulation and analyzing, we have PCI
and PCIe analyzer and all other equipment one might need. However,
we've seen that not only on PowerPC but also on x86, performing real
bursts is not intuitive.

Thank you for your help - we might be satisfied with the achieved 
18 MB/s.

Michael

^ permalink raw reply

* [RFC PATCH] powerpc: add ioremap_wt
From: Michael Moese @ 2014-02-03  7:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, linux-kernel, Michael Moese

Allow for IO memory to be mapped cacheable for performing
PCI read bursts.

Signed-off-by: Michael Moese <michael.moese@men.de>
---
 arch/powerpc/include/asm/io.h | 3 +++
 arch/powerpc/mm/pgtable_32.c  | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 45698d5..9591fff 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -631,6 +631,8 @@ static inline void iosync(void)
  *
  * * ioremap_wc enables write combining
  *
+ * * ioremap_wc enables write thru
+ *
  * * iounmap undoes such a mapping and can be hooked
  *
  * * __ioremap_at (and the pending __iounmap_at) are low level functions to
@@ -652,6 +654,7 @@ extern void __iomem *ioremap(phys_addr_t address, unsigned long size);
 extern void __iomem *ioremap_prot(phys_addr_t address, unsigned long size,
 				  unsigned long flags);
 extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size);
+extern void __iomem *ioremap_wt(phys_addr_t address, unsigned long size);
 #define ioremap_nocache(addr, size)	ioremap((addr), (size))
 
 extern void iounmap(volatile void __iomem *addr);
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 51f8795..9ab0a54 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -141,6 +141,14 @@ ioremap_wc(phys_addr_t addr, unsigned long size)
 EXPORT_SYMBOL(ioremap_wc);
 
 void __iomem *
+ioremap_wt(phys_addr_t addr, unsigned long size)
+{
+	return __ioremap_caller(addr, size, _PAGE_WRITETHRU,
+				__builtin_return_address(0));
+}
+EXPORT_SYMBOL(ioremap_wt);
+
+void __iomem *
 ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long flags)
 {
 	/* writeable implies dirty for kernel addresses */
-- 
1.8.5.3

^ permalink raw reply related

* Re: [git pull] Please pull powerpc.git next branch
From: Michael Ellerman @ 2014-02-03  3:00 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Olaf Hering, Linus Torvalds, linuxppc-dev, Linux Kernel list
In-Reply-To: <4555187.D5eRSF5r8x@mexican>

On Wed, 2014-01-29 at 13:29 +1100, Alistair Popple wrote:
> Looks like I missed the dart iommu code when changing the iommu table
> initialisation. The patch below should fix it, would you mind testing
> it Ben? Thanks.

Any reason not to add the following to save ourselves in future?

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d773dd4..6ab7b53 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -657,6 +657,8 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
        unsigned int i;
        struct iommu_pool *p;
 
+       BUG_ON(!tbl->it_page_shift);
+
        /* number of bytes needed for the bitmap */
        sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
 

cheers

^ permalink raw reply related

* Re: [PATCH 6/8] powerpc/perf: add support for the hv gpci (get performance counter info) interface
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-7-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:52 UTC, Cody P Schafer wrote:
> This provides a basic link between perf and hv_gpci. Notably, it does
> not yet support transactions and does not list any events (they can
> still be manually composed).

What are the plans for listing?

The manual compose is nice but pretty hairy to use in practice I would think.

> diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
> new file mode 100644
> index 0000000..31d9d59
> --- /dev/null
> +++ b/arch/powerpc/perf/hv-gpci.c
> @@ -0,0 +1,235 @@
> +/*
> + * Hypervisor supplied "gpci" ("get performance counter info") performance
> + * counter support
> + *
> + * Author: Cody P Schafer <cody@linux.vnet.ibm.com>
> + * Copyright 2014 IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +#define pr_fmt(fmt) "hv-gpci: " fmt
> +
> +#include <linux/module.h>
> +#include <linux/perf_event.h>
> +#include <asm/firmware.h>
> +#include <asm/hvcall.h>
> +#include <asm/hv_gpci.h>
> +#include <asm/io.h>

Needed?

> +/* See arch/powerpc/include/asm/hv_gpci.h for details on the hcall interface */
> +
> +PMU_RANGE_ATTR(request, config, 0, 31); /* u32 */
> +PMU_RANGE_ATTR(starting_index, config, 32, 63); /* u32 */
> +PMU_RANGE_ATTR(secondary_index, config1, 0, 15); /* u16 */
> +PMU_RANGE_ATTR(counter_info_version, config1, 16, 23); /* u8 */
> +PMU_RANGE_ATTR(length, config1, 24, 31); /* u8, bytes of data (1-8) */
> +PMU_RANGE_ATTR(offset, config1, 32, 63); /* u32, byte offset */
> +
> +static struct attribute *format_attr[] = {
> +	&format_attr_request.attr,
> +	&format_attr_starting_index.attr,
> +	&format_attr_secondary_index.attr,
> +	&format_attr_counter_info_version.attr,
> +

Lonley blank line.

> +	&format_attr_offset.attr,
> +	&format_attr_length.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group format_group = {
> +	.name = "format",
> +	.attrs = format_attr,
> +};
> +
> +static const struct attribute_group *attr_groups[] = {
> +	&format_group,
> +	NULL,
> +};
> +
> +static unsigned long single_gpci_request(u32 req, u32 starting_index,
> +		u16 secondary_index, u8 version_in, u32 offset, u8 length,
> +		u64 *value)

Passing the event and extracting the values in here would be neater IMHO.

> +{
> +	unsigned long ret;
> +	size_t i;
> +	u64 count;
> +
> +	struct {
> +		struct hv_get_perf_counter_info_params params;
> +		union {
> +			union h_gpci_cvs data;
> +			uint8_t bytes[sizeof(union h_gpci_cvs)];
> +		};
> +	} arg = {
> +		.params = {
> +			.counter_request = cpu_to_be32(req),
> +			.starting_index = cpu_to_be32(starting_index),
> +			.secondary_index = cpu_to_be16(secondary_index),
> +			.counter_info_version_in = version_in,
> +		}
> +	};
> +
> +	ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
> +			virt_to_phys(&arg), sizeof(arg));
> +	if (ret) {
> +		pr_devel("hcall failed: 0x%lx\n", ret);
> +		return ret;
> +	}
> +
> +	/*
> +	 * we verify offset and length are within the zeroed buffer at event
> +	 * init.
> +	 */
> +	count = 0;
> +	for (i = offset; i < offset + length; i++)
> +		count |= arg.bytes[i] << (i - offset);
> +
> +	*value = count;
> +	return ret;
> +}
> +
> +static u64 h_gpci_get_value(struct perf_event *event)
> +{
> +	u64 count;
> +	unsigned long ret = single_gpci_request(event_get_request(event),
> +					event_get_starting_index(event),
> +					event_get_secondary_index(event),
> +					event_get_counter_info_version(event),
> +					event_get_offset(event),
> +					event_get_length(event),
> +					&count);
> +	if (ret)
> +		return 0;
> +	return count;
> +}
> +
> +static void h_gpci_event_update(struct perf_event *event)
> +{
> +	s64 prev;
> +	u64 now = h_gpci_get_value(event);
> +	prev = local64_xchg(&event->hw.prev_count, now);
> +	local64_add(now - prev, &event->count);
> +}
> +
> +static void h_gpci_event_start(struct perf_event *event, int flags)
> +{
> +	local64_set(&event->hw.prev_count, h_gpci_get_value(event));
> +	perf_swevent_start_hrtimer(event);
> +}
> +
> +static void h_gpci_event_stop(struct perf_event *event, int flags)
> +{
> +	perf_swevent_cancel_hrtimer(event);
> +	h_gpci_event_update(event);
> +}
> +
> +static int h_gpci_event_add(struct perf_event *event, int flags)
> +{
> +	if (flags & PERF_EF_START)
> +		h_gpci_event_start(event, flags);
> +
> +	return 0;
> +}
> +
> +static void h_gpci_event_del(struct perf_event *event, int flags)
> +{
> +	h_gpci_event_stop(event, flags);
> +}

Can just hook del directly no?

> +static void h_gpci_event_read(struct perf_event *event)
> +{
> +	h_gpci_event_update(event);
> +}

Ditto.

> +static int h_gpci_event_init(struct perf_event *event)
> +{
> +	u64 count;
> +	u8 length;
> +
> +	/* Not our event */
> +	if (event->attr.type != event->pmu->type)
> +		return -ENOENT;

I don't understand why you need this?

> +	/* config2 is unused */
> +	if (event->attr.config2)
> +		return -EINVAL;

You must also check the reserved regions of config and config1.


> +	/* unsupported modes and filters */
> +	if (event->attr.exclude_user   ||
> +	    event->attr.exclude_kernel ||
> +	    event->attr.exclude_hv     ||
> +	    event->attr.exclude_idle   ||
> +	    event->attr.exclude_host   ||
> +	    event->attr.exclude_guest  ||
> +	    is_sampling_event(event)) /* no sampling */

I think you should also check sample_type.

> +		return -EINVAL;

Have you thought about inherit, pinned, exclusive?

> +
> +	/* no branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
> +	length = event_get_length(event);
> +	if (length < 1 || length > 8)
> +		return -EINVAL;
> +
> +	/* last byte within the buffer? */
> +	if ((event_get_offset(event) + length) > sizeof(union h_gpci_cvs))
> +		return -EINVAL;
> +
> +	/* check if the request works... */
> +	if (single_gpci_request(event_get_request(event),
> +				event_get_starting_index(event),
> +				event_get_secondary_index(event),
> +				event_get_counter_info_version(event),
> +				event_get_offset(event),
> +				length,
> +				&count))
> +		return -EINVAL;
> +
> +	/*
> +	 * Some of the events are per-cpu, some per-core, some per-chip, some
> +	 * are global, and some access data from other virtual machines on the
> +	 * same physical machine. We can't map the cpu value without a lot of
> +	 * work. Instead, we pick an arbitrary cpu for all events on this pmu.
> +	 */
> +	event->cpu = 0;

OK, but is having them all on cpu zero a good idea?

> +	perf_swevent_init_hrtimer(event);
> +	return 0;
> +}
> +
> +struct pmu h_gpci_pmu = {
> +	.task_ctx_nr = perf_invalid_context,
> +
> +	.name = "hv_gpci",
> +	.attr_groups = attr_groups,
> +	.event_init = h_gpci_event_init,
> +	.add = h_gpci_event_add,
> +	.del = h_gpci_event_del,
> +	.start = h_gpci_event_start,
> +	.stop = h_gpci_event_stop,
> +	.read = h_gpci_event_read,

Nice to have them align vertically.

> +	.event_idx = perf_swevent_event_idx,
> +};
> +
> +static int hv_gpci_init(void)
> +{
> +	int r;
> +
> +	if (!firmware_has_feature(FW_FEATURE_LPAR)) {
> +		pr_info("Not running under phyp, not supported\n");

If only it was that simple :)

You'll see FW_FEATURE_LPAR in a KVM guest too.

There are at least two mechanisms for FW to indicate the presence of features,
"ibm,hypertas-functions" and "ibm,architecture-vec-5".

If HGPCI is not exposed in either of those then we'd want to do a probe hcall
here to try and detect it at runtime.


> +		return -ENODEV;
> +	}
> +
> +	r = perf_pmu_register(&h_gpci_pmu, h_gpci_pmu.name, -1);
> +	if (r)
> +		return r;
> +
> +	return 0;
> +}
> +
> +module_init(hv_gpci_init);

This is not modular code so you're discouraged from using module_init(),
arch_initcall() would probably be fine.

cheers

^ permalink raw reply

* Re: [PATCH 5/8] powerpc: add 24x7 interface header
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-6-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:51 UTC, Cody P Schafer wrote:
> 24x7 (also called hv_24x7 or H_24X7) is an interface to obtain
> performance counters from the hypervisor. These counters do not have a
> fixed format/possition and are instead documented in a "24x7 Catalog",
> which is provided by the hypervisor (that interface is also documented
> in this header).
> 
> This method of obtaining performance counters from the hypervisor is
> intended to paritialy replace the gpci interface.

Same comments as for the previous patch.

cheers

^ permalink raw reply

* Re: [PATCH 4/8] powerpc: add hv_gpci interface header
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-5-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:50 UTC, Cody P Schafer wrote:
> "H_GetPerformanceCounterInfo" (refered to as hv_gpci or just gpci from
> here on) is an interface to retrieve specific performance counters and
> other data from the hypervisor. All outputs have a fixed format (and
> are represented as structs in this patch).

So how much of this are we actually using? A lot of these seem to be only used
in the union at the bottom of this file, and not touched elsewhere - or am I
missing something subtle?

Some of it doesn't seem to be used at all?
 
> diff --git a/arch/powerpc/include/asm/hv_gpci.h b/arch/powerpc/include/asm/hv_gpci.h

Any reason this can't just live in arch/powerpc/perf ?

> +++ b/arch/powerpc/include/asm/hv_gpci.h
> @@ -0,0 +1,490 @@
> +#ifndef LINUX_POWERPC_UAPI_HV_GPCI_H_
> +#define LINUX_POWERPC_UAPI_HV_GPCI_H_
> +
> +#include <linux/types.h>
> +
> +/* From the document "H_GetPerformanceCounterInfo Interface" v1.06, paritialy
> + * updated with v1.07 */

Is that public?

> +
> +/* H_GET_PERF_COUNTER_INFO argument */
> +struct hv_get_perf_counter_info_params {
> +	__be32 counter_request; /* I */
> +	__be32 starting_index;  /* IO */
> +	__be16 secondary_index; /* IO */
> +	__be16 returned_values; /* O */
> +	__be32 detail_rc; /* O, "only for 32bit clients" */
> +
> +	/*
> +	 * O, size each of counter_value element in bytes, only set for version
> +	 * >= 0x3
> +	 */
> +	__be16 cv_element_size;
> +
> +	/* I, funny if version < 0x3 */

Funny how? Or better still, do we only support operating on some minimum
sane version of the API?

> +	__u8 counter_info_version_in;
> +
> +	/* O, funny if version < 0x3 */
> +	__u8 counter_info_version_out;
> +	__u8 reserved[0xC];
> +	__u8 counter_value[];
> +} __packed;
> +
> +/* 8 => power8 (1.07)
> + * 6 => TLBIE  (1.07)
> + * 5 => (1.05)
> + * 4 => ?
> + * 3 => ?
> + * 2 => v7r7m0.phyp (?)
> + * 1 => v7r6m0.phyp (?)
> + * 0 => v7r{2,3,4}m0.phyp (?)
> + */

I think this is a mapping of version numbers to firmware releases, it should
say so.

> +#define COUNTER_INFO_VERSION_CURRENT 0x8
> +
> +/* these determine the counter_value[] layout and the meaning of starting_index
> + * and secondary_index */

Needs: leading capital, full stop, block comment.

> +enum counter_info_requests {
> +
> +	/* GENERAL */
> +
> +	/* @starting_index: "starting" physical processor index or -1 for

Why '"starting"' ?

> +	 *                  current phyical processor. Data is only collected
> +	 *                  for the processors' "primary" thread.
> +	 * @secondary_index: unused

This seems to be true in all cases at least for this enum, can we drop it?

> +	 */
> +	CIR_dispatch_timebase_by_processor = 0x10,

Any reason for the weird capitialisation? You've obviously learnt the
noCamelCase rule, but this is still a bit odd :)

> +
> +	/* @starting_index: starting partition id or -1 for the current logical
> +	 *                  partition (virtual machine).
> +	 * @secondary_index: unused
> +	 */
> +	CIR_entitled_capped_uncapped_donated_idle_timebase_by_partition = 0x20,
> +
> +	/* @starting_index: starting partition id or -1 for the current logical
> +	 *                  partition (virtual machine).
> +	 * @secondary_index: unused
> +	 */
> +	CIR_run_instructions_run_cycles_by_partition = 0x30,
> +
> +	/* @starting_index: must be -1 (to refer to the current partition)
> +	 * @secondary_index: unused
> +	 */
> +	CIR_system_performance_capabilities = 0x40,
> +
> +
> +	/* Data from this should only be considered valid if
> +	 * counter_info_version >= 0x3
> +	 * @starting_index: starting hardware chip id or -1 for the current hw
> +	 *		    chip id
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_bus_utilization_abc_links = 0x50,
> +
> +	/* Data from this should only be considered valid if
> +	 * counter_info_version >= 0x3
> +	 * @starting_index: starting hardware chip id or -1 for the current hw
> +	 *		    chip id
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_bus_utilization_wxyz_links = 0x60,
> +
> +
> +	/* EXPANDED */

??

These are only available if you have the DLC ?

> +	/* Avaliable if counter_info_version >= 0x3
> +	 * @starting_index: starting hardware chip id or -1 for the current hw
> +	 *		    chip id
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_bus_utilization_gx_links = 0x70,
> +
> +	/* Avaliable if counter_info_version >= 0x3
> +	 * @starting_index: starting hardware chip id or -1 for the current hw
> +	 *		    chip id
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_bus_utilization_mc_links = 0x80,
> +
> +	/* Avaliable if counter_info_version >= 0x3
> +	 * @starting_index: starting physical processor or -1 for the current
> +	 *                  physical processor
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_config = 0x90,
> +
> +	/* Avaliable if counter_info_version >= 0x3
> +	 * @starting_index: starting physical processor or -1 for the current
> +	 *                  physical processor
> +	 * @secondary_index: unused
> +	 */
> +	CIR_current_processor_frequency = 0x91,
> +
> +	CIR_processor_core_utilization = 0x94,
> +
> +	CIR_processor_core_power_mode = 0x95,
> +
> +	CIR_affinity_domain_information_by_virutal_processor = 0xA0,
> +
> +	CIR_affinity_domain_info_by_domain = 0xB0,
> +
> +	CIR_affinity_domain_info_by_partition = 0xB1,
> +
> +	/* @starting_index: unused
> +	 * @secondary_index: unused
> +	 */
> +	CIR_physical_memory_info = 0xC0,
> +
> +	CIR_processor_bus_topology = 0xD0,
> +
> +	CIR_partition_hypervisor_queuing_times = 0xE0,
> +
> +	CIR_system_hypervisor_times = 0xF0,
> +
> +	/* LAB */
> +
> +	CIR_set_mmcrh = 0x80001000,
> +	CIR_get_hpmcx = 0x80002000,
> +};


cheers

^ permalink raw reply

* Re: [PATCH 3/8] powerpc: add hvcalls for 24x7 and gpci (get performance counter info)
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-4-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:49 UTC, Cody P Schafer wrote:
> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/hvcall.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
> index d8b600b..48d6efa 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -269,11 +269,15 @@
>  #define H_COP			0x304
>  #define H_GET_MPP_X		0x314
>  #define H_SET_MODE		0x31C
> -#define MAX_HCALL_OPCODE	H_SET_MODE
> +#define H_GET_24X7_CATALOG_PAGE 0xF078
> +#define H_GET_24X7_DATA		0xF07C
> +#define H_GET_PERF_COUNTER_INFO 0xF080

Ugh, why the hell did they put them up there.

> +#define MAX_HCALL_OPCODE	H_GET_PERF_COUNTER_INFO

We have an array which is sized based on this, which is unpleasant.

I think you're better off putting these below in the platform specific section,
and leaving MAX_HCALL_OPCODE alone. The only downside is you can't use the
hcall tracing to see them.

>  /* Platform specific hcalls, used by KVM */
>  #define H_RTAS			0xf000


cheers

^ permalink raw reply

* Re: [PATCH 2/8] perf core: export swevent hrtimer helpers
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-3-git-send-email-cody@linux.vnet.ibm.com>

Peter, Ingo, can we get your ACK on this please?

cheers


On Thu, 2014-16-01 at 23:53:48 UTC, Cody P Schafer wrote:
> Export the swevent hrtimer helpers currently only used in events/core.c
> to allow the addition of architecture specific sw-like pmus.

> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
> ---
>  include/linux/perf_event.h | 5 ++++-
>  kernel/events/core.c       | 8 ++++----
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8646e33..c5bc71a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -558,7 +558,10 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
>  				int src_cpu, int dst_cpu);
>  extern u64 perf_event_read_value(struct perf_event *event,
>  				 u64 *enabled, u64 *running);
> -
> +extern void perf_swevent_init_hrtimer(struct perf_event *event);
> +extern void perf_swevent_start_hrtimer(struct perf_event *event);
> +extern void perf_swevent_cancel_hrtimer(struct perf_event *event);
> +extern int perf_swevent_event_idx(struct perf_event *event);
>  
>  struct perf_sample_data {
>  	u64				type;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f574401..d881d1e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5801,7 +5801,7 @@ static int perf_swevent_init(struct perf_event *event)
>  	return 0;
>  }
>  
> -static int perf_swevent_event_idx(struct perf_event *event)
> +int perf_swevent_event_idx(struct perf_event *event)
>  {
>  	return 0;
>  }
> @@ -6030,7 +6030,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
>  	return ret;
>  }
>  
> -static void perf_swevent_start_hrtimer(struct perf_event *event)
> +void perf_swevent_start_hrtimer(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
>  	s64 period;
> @@ -6052,7 +6052,7 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
>  				HRTIMER_MODE_REL_PINNED, 0);
>  }
>  
> -static void perf_swevent_cancel_hrtimer(struct perf_event *event)
> +void perf_swevent_cancel_hrtimer(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
>  
> @@ -6064,7 +6064,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
>  	}
>  }
>  
> -static void perf_swevent_init_hrtimer(struct perf_event *event)
> +void perf_swevent_init_hrtimer(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
>  
> -- 
> 1.8.5.2
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 
> 

^ permalink raw reply

* Re: [PATCH 1/8] perf: add PMU_RANGE_ATTR() helper for use by sw-like pmus
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-2-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:47 UTC, Cody P Schafer wrote:
> Add PMU_RANGE_ATTR() and PMU_RANGE_RESV() (for reserved areas) which
> generate functions to extract the relevent bits from
> event->attr.config{,1,2} for use by sw-like pmus where the
> 'config{,1,2}' values don't map directly to hardware registers.

This is neat.

The split of the macros is a bit weird, ie. PMU_RANGE_RESV() doesn't really do
what it's name suggests.

I think you want one macro which creates the accessors, with a name that
reflects that - yeah I can't think of a good one right now, but "event" should
probably be in there because that's what it operates on.

Having a macro for the reserved regions is good, but you MUST actually check
that the reserved regions are zero. Otherwise you are permitting your caller to
pass junk in there and you then can't unreserved them in a future version of
the API.

So I think a macro that gives you a special reserved region routine would be
good, so you can write something like:

  if (event_check_reserved1() || event_check_reserved2())
  	return -EINVAL;

cheers

^ permalink raw reply

* [PATCH v2] powerpc: Add cpu family documentation
From: Michael Ellerman @ 2014-02-01  4:35 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Stephen Rothwell

This patch adds some documentation on the different cpu families
supported by arch/powerpc.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
v2: Reworked formatting to avoid wrapping.
    Fixed up Freescale details.


 Documentation/powerpc/cpu_families.txt | 227 +++++++++++++++++++++++++++++++++
 1 file changed, 227 insertions(+)
 create mode 100644 Documentation/powerpc/cpu_families.txt

diff --git a/Documentation/powerpc/cpu_families.txt b/Documentation/powerpc/cpu_families.txt
new file mode 100644
index 0000000..fa4f159
--- /dev/null
+++ b/Documentation/powerpc/cpu_families.txt
@@ -0,0 +1,227 @@
+CPU Families
+============
+
+This document tries to summarise some of the different cpu families that exist
+and are supported by arch/powerpc.
+
+
+Book3S (aka sPAPR)
+------------------
+
+ - Hash MMU
+ - Mix of 32 & 64 bit
+
+   +--------------+                              +----------------+
+   |  Old POWER   | ---------------------------> | RS64 (threads) |
+   +--------------+                              +----------------+
+          |
+          |
+          v
+   +--------------+                              +----------------+    +-------+
+   |     601      | ---------------------------> |      603       | -> |  740  |
+   +--------------+                              +----------------+    +-------+
+          |                                              |
+          |                                              |
+          v                                              v
+   +--------------+                              +----------------+    +-------+
+   |     604      |                              |    750 (G3)    | -> | 750CX |
+   +--------------+                              +----------------+    +-------+
+          |                                              |                 |
+          |                                              |                 |
+          v                                              v                 v
+   +--------------+                              +----------------+    +-------+
+   | 620 (64 bit) |                              |      7400      |    | 750CL |
+   +--------------+                              +----------------+    +-------+
+          |                                              |                 |
+          |                                              |                 |
+          v                                              v                 v
+   +--------------+                              +----------------+    +-------+
+   |  POWER3/630  |                              |      7410      |    | 750FX |
+   +--------------+                              +----------------+    +-------+
+          |                                              |
+          |                                              |
+          v                                              v
+   +--------------+                              +----------------+
+   |   POWER3+    |                              |      7450      |
+   +--------------+                              +----------------+
+          |                                              |
+          |                                              |
+          v                                              v
+   +--------------+                              +----------------+
+   |    POWER4    |                              |      7455      |
+   +--------------+                              +----------------+
+          |                                              |
+          |                                              |
+          v                                              v
+   +--------------+                  +-------+   +----------------+
+   |   POWER4+    | ---------------> |  970  |   |      7447      |
+   +--------------+                  +-------+   +----------------+
+          |                              |               |
+          |                              |               |
+          v                              v               v
+   +--------------+     +-------+    +-------+   +----------------+
+   |    POWER5    | --> | Cell  |    | 970FX |   |      7448      |
+   +--------------+     +-------+    +-------+   +----------------+
+          |                              |
+          |                              |
+          v                              v
+   +--------------+                  +-------+
+   |   POWER5+    |                  | 970MP |
+   +--------------+                  +-------+
+          |
+          |
+          v
+   +--------------+
+   |   POWER5++   |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |    POWER6    |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |    POWER7    |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |   POWER7+    |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |    POWER8    |
+   +--------------+
+
+
+   +---------------+
+   | PA6T (64 bit) |
+   +---------------+
+
+
+IBM BookE
+---------
+
+ - Software loaded TLB.
+ - All 32 bit
+
+   +--------------+
+   |     401      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     403      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     405      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     440      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+     +----------------+
+   |     450      | --> |      BG/P      |
+   +--------------+     +----------------+
+          |
+          |
+          v
+   +--------------+
+   |     460      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     476      |
+   +--------------+
+
+
+Motorola/Freescale 8xx
+----------------------
+
+ - Software loaded with hardware assist.
+ - All 32 bit
+
+   +--------------+
+   |     8xx      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     850      |
+   +--------------+
+
+
+Freescale BookE
+---------------
+
+ - Software loaded TLB.
+ - e6500 adds HW loaded indirect TLB entries.
+ - Mix of 32 & 64 bit
+
+   +--------------+
+   |     e200     |
+   +--------------+
+
+
+   +--------------------------------+
+   |              e500              |
+   +--------------------------------+
+                   |
+                   |
+                   v
+   +--------------------------------+
+   |             e500v2             |
+   +--------------------------------+
+                   |
+                   |
+                   v
+   +--------------------------------+
+   |             e500mc             |
+   +--------------------------------+
+                   |
+                   |
+                   v
+   +--------------------------------+
+   |    e5500 (Book3e) (64 bit)     |
+   +--------------------------------+
+                   |
+                   |
+                   v
+   +--------------------------------+
+   | e6500 (HW TLB) (Multithreaded) |
+   +--------------------------------+
+
+
+IBM A2 core
+-----------
+
+ - Book3E, software loaded TLB + HW loaded indirect TLB entries.
+ - 64 bit
+
+   +--------------+     +----------------+
+   |   A2 core    | --> |      WSP       |
+   +--------------+     +----------------+
+           |
+           |
+           v
+   +--------------+
+   |     BG/Q     |
+   +--------------+
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH] powerpc: Add cpu family documentation
From: Michael Ellerman @ 2014-02-01  4:28 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: linuxppc-dev
In-Reply-To: <20140130143237.a2bc71327ed4939cd71a734f@canb.auug.org.au>

On Thu, 2014-01-30 at 14:32 +1100, Stephen Rothwell wrote:
> Hi Michael,
> 
> Nice.
> 
> On Thu, 30 Jan 2014 13:38:00 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
> >
> > +++ b/Documentation/powerpc/cpu_families.txt
> > @@ -0,0 +1,76 @@
> > +CPU Families
> > +============
> > +
> > +This doco tries to summarise some of the different cpu families that exist and
>         ^^^^
> document
> 
> > +   |            |
> > +   |            *---- [620] --- POWER3/630 --- POWER3+ --- POWER4 --- POWER4+ --- POWER5 --- POWER5+ --- POWER5++ --- POWER6 --- POWER7 --- POWER7+ --- POWER8
> 
> Its a pity that this wraps ...

Yeah it is. I was too lazy to fix it.

New version coming.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc: Add cpu family documentation
From: Michael Ellerman @ 2014-02-01  4:28 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev
In-Reply-To: <669D726F-25DF-4703-AD30-CAE7CA142970@kernel.crashing.org>

On Fri, 2014-01-31 at 07:32 -0600, Kumar Gala wrote:
> On Jan 29, 2014, at 8:38 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> > +Freescale BookE
> > +---------------
> > +
> > + - Software loaded TLB.
> > + - e6500 adds HW loaded indirect TLB entries.
> > + - Mix of 32 & 64 bit
> > +
> > +  e200 --- e500 --- e500v2 --- e500mc --- e5500 --- e6500
> > +                                         (Book3E)  (HW TLB)
> > +                                         (64bit)
> > +
> 
> e200 is its own core family that doesn’t have any relation to e500 line other than being book-e
> 
> might want to add multithreaded to e6500.

Thanks Kumar.

cheers

^ permalink raw reply

* [PATCH v3 3/3] powerpc/pseries: Report in kernel device tree update to drmgr
From: Tyrel Datwyler @ 2014-01-31 23:58 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: nfont, Tyrel Datwyler
In-Reply-To: <1391212692-16217-1-git-send-email-tyreld@linux.vnet.ibm.com>

Traditionally it has been drmgr's responsibilty to update the device tree
through the /proc/ppc64/ofdt interface after a suspend/resume operation.
This patchset however has modified suspend/resume ops to preform that update
entirely in the kernel during the resume. Therefore, a mechanism is required
for drmgr to determine who is responsible for the update. This patch adds a
show function to the "hibernate" attribute that returns 1 if the kernel
updates the device tree after the resume and 0 if drmgr is responsible.

Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/suspend.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c
index 1d9c580..b87b978 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -192,7 +192,30 @@ out:
 	return rc;
 }
 
-static DEVICE_ATTR(hibernate, S_IWUSR, NULL, store_hibernate);
+#define USER_DT_UPDATE	0
+#define KERN_DT_UPDATE	1
+
+/**
+ * show_hibernate - Report device tree update responsibilty
+ * @dev:		subsys root device
+ * @attr:		device attribute struct
+ * @buf:		buffer
+ *
+ * Report whether a device tree update is performed by the kernel after a
+ * resume, or if drmgr must coordinate the update from user space.
+ *
+ * Return value:
+ *	0 if drmgr is to initiate update, and 1 otherwise
+ **/
+static ssize_t show_hibernate(struct device *dev,
+			      struct device_attribute *attr,
+			      char *buf)
+{
+	return sprintf(buf, "%d\n", KERN_DT_UPDATE);
+}
+
+static DEVICE_ATTR(hibernate, S_IWUSR | S_IRUGO,
+		   show_hibernate, store_hibernate);
 
 static struct bus_type suspend_subsys = {
 	.name = "power",
-- 
1.7.12.2

^ permalink raw reply related

* [PATCH v3 2/3] powerpc/pseries: Update dynamic cache nodes for suspend/resume operation
From: Tyrel Datwyler @ 2014-01-31 23:58 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: nfont, Tyrel Datwyler
In-Reply-To: <1391212692-16217-1-git-send-email-tyreld@linux.vnet.ibm.com>

From: Haren Myneni <hbabu@us.ibm.com>

pHyp can change cache nodes for suspend/resume operation. The current code
updates the device tree after all non boot CPUs are enabled. Hence, we do not
modify the cache list based on the latest cache nodes. Also we do not remove
cache entries for the primary CPU.

This patch removes the cache list for the boot CPU, updates the device tree
before enabling nonboot CPUs and adds cache list for the boot cpu.

Signed-off-by: Haren Myneni <hbabu@us.ibm.com>
Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/rtas.h          |  1 +
 arch/powerpc/platforms/pseries/suspend.c | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 9bd52c6..a0e1add 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -283,6 +283,7 @@ extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal);
 
 #ifdef CONFIG_PPC_PSERIES
 extern int pseries_devicetree_update(s32 scope);
+extern void post_mobility_fixup(void);
 #endif
 
 #ifdef CONFIG_PPC_RTAS_DAEMON
diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c
index 16a2552..1d9c580 100644
--- a/arch/powerpc/platforms/pseries/suspend.c
+++ b/arch/powerpc/platforms/pseries/suspend.c
@@ -26,6 +26,7 @@
 #include <asm/mmu.h>
 #include <asm/rtas.h>
 #include <asm/topology.h>
+#include "../../kernel/cacheinfo.h"
 
 static u64 stream_id;
 static struct device suspend_dev;
@@ -79,6 +80,23 @@ static int pseries_suspend_cpu(void)
 }
 
 /**
+ * pseries_suspend_enable_irqs
+ *
+ * Post suspend configuration updates
+ *
+ **/
+static void pseries_suspend_enable_irqs(void)
+{
+	/*
+	 * Update configuration which can be modified based on device tree
+	 * changes during resume.
+	 */
+	cacheinfo_cpu_offline(smp_processor_id());
+	post_mobility_fixup();
+	cacheinfo_cpu_online(smp_processor_id());
+}
+
+/**
  * pseries_suspend_enter - Final phase of hibernation
  *
  * Return value:
@@ -235,6 +253,7 @@ static int __init pseries_suspend_init(void)
 		return rc;
 
 	ppc_md.suspend_disable_cpu = pseries_suspend_cpu;
+	ppc_md.suspend_enable_irqs = pseries_suspend_enable_irqs;
 	suspend_set_ops(&pseries_suspend_ops);
 	return 0;
 }
-- 
1.7.12.2

^ permalink raw reply related

* [PATCH v3 1/3] powerpc/pseries: Device tree should only be updated once after suspend/migrate
From: Tyrel Datwyler @ 2014-01-31 23:58 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: nfont, Tyrel Datwyler
In-Reply-To: <1391212692-16217-1-git-send-email-tyreld@linux.vnet.ibm.com>

From: Haren Myneni <hbabu@us.ibm.com>

The current code makes rtas calls for update-nodes, activate-firmware and then
update-nodes again. The FW provides the same data for both update-nodes calls.
As a result a proc entry exists error is reported for the second update while
adding device nodes.

This patch makes a single rtas call for update-nodes after activating the FW.
It also add rtas_busy delay for the activate-firmware rtas call.

Signed-off-by: Haren Myneni <hbabu@us.ibm.com>
Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/mobility.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index cde4e0a..bde7eba 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -290,13 +290,6 @@ void post_mobility_fixup(void)
 	int rc;
 	int activate_fw_token;
 
-	rc = pseries_devicetree_update(MIGRATION_SCOPE);
-	if (rc) {
-		printk(KERN_ERR "Initial post-mobility device tree update "
-		       "failed: %d\n", rc);
-		return;
-	}
-
 	activate_fw_token = rtas_token("ibm,activate-firmware");
 	if (activate_fw_token == RTAS_UNKNOWN_SERVICE) {
 		printk(KERN_ERR "Could not make post-mobility "
@@ -304,16 +297,17 @@ void post_mobility_fixup(void)
 		return;
 	}
 
-	rc = rtas_call(activate_fw_token, 0, 1, NULL);
-	if (!rc) {
-		rc = pseries_devicetree_update(MIGRATION_SCOPE);
-		if (rc)
-			printk(KERN_ERR "Secondary post-mobility device tree "
-			       "update failed: %d\n", rc);
-	} else {
+	do {
+		rc = rtas_call(activate_fw_token, 0, 1, NULL);
+	} while (rtas_busy_delay(rc));
+
+	if (rc)
 		printk(KERN_ERR "Post-mobility activate-fw failed: %d\n", rc);
-		return;
-	}
+
+	rc = pseries_devicetree_update(MIGRATION_SCOPE);
+	if (rc)
+		printk(KERN_ERR "Post-mobility device tree update "
+			"failed: %d\n", rc);
 
 	return;
 }
-- 
1.7.12.2

^ permalink raw reply related

* [PATCH v3 0/3] powerpc/pseries: fix issues in suspend/resume code
From: Tyrel Datwyler @ 2014-01-31 23:58 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: nfont, Tyrel Datwyler

This patchset fixes a couple of issues encountered in the suspend/resume code
base. First when using the kernel device tree update code update-nodes is
unnecessarily called more than once. Second the cpu cache lists are not
updated after a suspend/resume which under certain conditions may cause a
panic. Finally, since the cache list fix utilzes in kernel device tree update
code a means for telling drmgr not to perform a device tree update from
userspace is required.

Changes from v2:
- Moved dynamic configuration update code into pseries specific routine
  per Nathan's suggestion.

Changes from v1:
- Fixed several commit message typos
- Fixed authorship of first two patches

Haren Myneni (2):
  powerpc/pseries: Device tree should only be updated once after
    suspend/migrate
  powerpc/pseries: Update dynamic cache nodes for suspend/resume
    operation

Tyrel Datwyler (1):
  powerpc/pseries: Report in kernel device tree update to drmgr

 arch/powerpc/include/asm/rtas.h           |  1 +
 arch/powerpc/platforms/pseries/mobility.c | 26 +++++++-----------
 arch/powerpc/platforms/pseries/suspend.c  | 44 ++++++++++++++++++++++++++++++-
 3 files changed, 54 insertions(+), 17 deletions(-)

-- 
1.7.12.2

^ permalink raw reply

* Re: PCIe Access - achieve bursts without DMA
From: David Hawkins @ 2014-01-31 23:18 UTC (permalink / raw)
  To: Moese, Michael; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1391208815.27142.38.camel@pasglop>

Hi Michael,

>> I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
>> located inside our FPGA. On x86-based systems I was able to achieve bursts for
>> both read and write access. On PPC32, using an e500v2, I had no success at all
>> so far.

Whenever I want to benchmark PCI/PCIe performance I do the
following tests;

1. Peripheral board DMA (board-to-board)

    Use two of your FPGA boards in a chassis and DMA between them.

    In a PCI system, you can put the cards on the same bus segment and
    then between a bridge and see how that affects things. In your case,
    the PCIe traffic will all be via the root-complex/switch, so
    you should get the same performance regardless of which PCIe slot
    you use.

    This is likely the "best you can do" as far as bursts go.

2. Peripheral board DMA to host memory.

    In this case I typically insmod a simple driver on the host that
    gives me a page of memory, and then DMA into and out of that
    memory, using the DMA controller on the peripheral.

3. Host (root complex) DMA.

    If your host has a DMA controller, then program it per (2).

As far as "verification" of your custom peripheral board FPGA IP is
concerned, if I was a customer, and you had data for (1) and (2),
I'd be pretty happy (and could care less about (2), since its so
system dependent).

Since its an FPGA-based IP. I'd also expect to see a PCIe simulation
with Bus Functional Models showing what the optimal performance of
your IP was, and then how it nicely matches with the measurements
in (1). If you do not have a PCIe logic analyzer, both Xilinx and
Altera have Chipscope/SignalTap logic analyzers that can be used
for tracing traffic at the TLP layer inside the FPGA.

Just some thoughts ...

Cheers,
Dave

^ permalink raw reply

* Re: PCIe Access - achieve bursts without DMA
From: Benjamin Herrenschmidt @ 2014-01-31 22:53 UTC (permalink / raw)
  To: Moese, Michael; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <2DF74D4E746FF14C8697D5041AAE72D56A2B1420@MEN-EX2.intra.men.de>

On Thu, 2014-01-30 at 12:20 +0000, Moese, Michael wrote:
> Hello PPC-developers,
> I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
> located inside our FPGA. On x86-based systems I was able to achieve bursts for
> both read and write access. On PPC32, using an e500v2, I had no success at all 
> so far. 
> I tried using ioremap_wc(), like I did on x86, for writing, and it only results in my
> writes just being single requests, one after another.

Hrm, ioremap_wc will give you a mapping without the G (guard) bit.
Whether that results in some store gathering or not on IOs depends on a
specific HW implementation, you'll have to check with the FSP folks on
that one, there could also be a chicken switch (HID bit or similar)
needed to enable that (there was on some earlier ppc32 chips).

Another thing you can try is to use FP register load/stores.

> For reads, I noticed I could not ioremap_cache() on PPC, so I used simple ioremap()
> here. 
> I used several ways to read from the device, from simple readl(),memcpy_from_io(), 
> memcpy()  to cacheable_memcpy() - with no improvements.  Even when just issuing
> a batch of prefetch()-calls for all the memory to read did not result in read bursts.
> 
> I only get really poor results, writing is possible with around 40 MiByte/s, whereas I  
> can read at about only 3 MiByte/s.
> After hours of studying the reference manual from freescale, looking into other code
> and searching the web, I'm close to resignation.
> 
> Maybe someone of you has some more directions for me, I'd appreciate every hint
> that leads me to my problem's solution - maybe I just missed something or lack 
> knowledge about this architecture in general.
> 
> Thanks for your reading.
> 
> 
> Michael
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox