LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 7/9] fsl: add EPU FSM configuration for deep sleep
From: Scott Wood @ 2014-03-14 22:51 UTC (permalink / raw)
  To: Chenhui Zhao; +Cc: linuxppc-dev, linux-kernel, Jason.Jin
In-Reply-To: <20140312083410.GF4706@localhost.localdomain>

On Wed, 2014-03-12 at 16:34 +0800, Chenhui Zhao wrote:
> On Tue, Mar 11, 2014 at 07:08:43PM -0500, Scott Wood wrote:
> > On Fri, 2014-03-07 at 12:58 +0800, Chenhui Zhao wrote:
> > > From: Hongbo Zhang <hongbo.zhang@freescale.com>
> > > 
> > > In the last stage of deep sleep, software will trigger a Finite
> > > State Machine (FSM) to control the hardware precedure, such as
> > > board isolation, killing PLLs, removing power, and so on.
> > > 
> > > When the system is waked up by an interrupt, the FSM controls the
> > > hardware to complete the early resume precedure.
> > > 
> > > This patch configure the EPU FSM preparing for deep sleep.
> > > 
> > > Signed-off-by: Hongbo Zhang <hongbo.zhang@freescale.com>
> > > Signed-off-by: Chenhui Zhao <chenhui.zhao@freescale.com>
> > 
> > Couldn't this be part of qoriq_pm.c?
> 
> Put the code in drivers/platform/fsl/ so that LS1 can share these code.

How can LS1 share it if it's got hardcoded T1040 values?

> > > diff --git a/drivers/platform/Kconfig b/drivers/platform/Kconfig
> > > index 09fde58..6539e6d 100644
> > > --- a/drivers/platform/Kconfig
> > > +++ b/drivers/platform/Kconfig
> > > @@ -6,3 +6,7 @@ source "drivers/platform/goldfish/Kconfig"
> > >  endif
> > >  
> > >  source "drivers/platform/chrome/Kconfig"
> > > +
> > > +if FSL_SOC
> > > +source "drivers/platform/fsl/Kconfig"
> > > +endif
> > 
> > Chrome doesn't need an ifdef -- why does this?
> 
> Don't wish other platform see these options, and the X86 and GOLDFISH have
> ifdefs.

The point is you can implement the dependency inside
drivers/platform/fsl/Kconfig.

> > > diff --git a/drivers/platform/fsl/Makefile b/drivers/platform/fsl/Makefile
> > > new file mode 100644
> > > index 0000000..d99ca0e
> > > --- /dev/null
> > > +++ b/drivers/platform/fsl/Makefile
> > > @@ -0,0 +1,5 @@
> > > +#
> > > +# Makefile for linux/drivers/platform/fsl
> > > +# Freescale Specific Power Management Drivers
> > > +#
> > > +obj-$(CONFIG_FSL_SLEEP_FSM)	+= sleep_fsm.o
> > 
> > Why is this here while the other stuff is in arch/powerpc/sysdev?
> > 
> > > +/* Block offsets */
> > > +#define	RCPM_BLOCK_OFFSET	0x00022000
> > > +#define	EPU_BLOCK_OFFSET	0x00000000
> > > +#define	NPC_BLOCK_OFFSET	0x00001000
> > 
> > Why don't these block offsets come from the device tree?
> 
> Have maped DCSR registers. Don't wish to remap them.

We don't wish to have hardcoded CCSR/DCSR offsets in the kernel source.
Sorry.
 
> > > +	/* Configure the EPU Counters */
> > > +	epu_write(EPCCR15, 0x92840000);
> > > +	epu_write(EPCCR14, 0x92840000);
> > > +	epu_write(EPCCR12, 0x92840000);
> > > +	epu_write(EPCCR11, 0x92840000);
> > > +	epu_write(EPCCR10, 0x92840000);
> > > +	epu_write(EPCCR9, 0x92840000);
> > > +	epu_write(EPCCR8, 0x92840000);
> > > +	epu_write(EPCCR5, 0x92840000);
> > > +	epu_write(EPCCR4, 0x92840000);
> > > +	epu_write(EPCCR2, 0x92840000);
> > > +
> > > +	/* Configure the SCUs Inputs */
> > > +	epu_write(EPSMCR15, 0x76000000);
> > > +	epu_write(EPSMCR14, 0x00000031);
> > > +	epu_write(EPSMCR13, 0x00003100);
> > > +	epu_write(EPSMCR12, 0x7F000000);
> > > +	epu_write(EPSMCR11, 0x31740000);
> > > +	epu_write(EPSMCR10, 0x65000030);
> > > +	epu_write(EPSMCR9, 0x00003000);
> > > +	epu_write(EPSMCR8, 0x64300000);
> > > +	epu_write(EPSMCR7, 0x30000000);
> > > +	epu_write(EPSMCR6, 0x7C000000);
> > > +	epu_write(EPSMCR5, 0x00002E00);
> > > +	epu_write(EPSMCR4, 0x002F0000);
> > > +	epu_write(EPSMCR3, 0x2F000000);
> > > +	epu_write(EPSMCR2, 0x6C700000);
> > 
> > Where do these magic numbers come from?  Which chips are they valid for?
> 
> They are for T1040. Can be found in the RCPM chapter of T1040RM.

Then put in a comment to that effect, including what part of the RCPM
chapter.

How do you plan to handle the addition of another SoC with different
values?

-Scott

^ permalink raw reply

* Re: [PATCH 6/9] powerpc/85xx: support sleep feature on QorIQ SoCs with RCPM
From: Scott Wood @ 2014-03-14 22:46 UTC (permalink / raw)
  To: Chenhui Zhao; +Cc: linuxppc-dev, linux-kernel, Jason.Jin
In-Reply-To: <20140312080814.GE4706@localhost.localdomain>

On Wed, 2014-03-12 at 16:08 +0800, Chenhui Zhao wrote:
> On Tue, Mar 11, 2014 at 07:00:27PM -0500, Scott Wood wrote:
> > On Fri, 2014-03-07 at 12:58 +0800, Chenhui Zhao wrote:
> > > In sleep mode, the clocks of e500 cores and unused IP blocks is
> > > turned off. The IP blocks which are allowed to wake up the processor
> > > are still running.
> > > 
> > > The sleep mode is equal to the Standby state in Linux. Use the
> > > command to enter sleep mode:
> > >   echo standby > /sys/power/state
> > > 
> > > Signed-off-by: Chenhui Zhao <chenhui.zhao@freescale.com>
> > > ---
> > >  arch/powerpc/Kconfig                   |    4 +-
> > >  arch/powerpc/platforms/85xx/Makefile   |    3 +
> > >  arch/powerpc/platforms/85xx/qoriq_pm.c |   78 ++++++++++++++++++++++++++++++++
> > >  3 files changed, 83 insertions(+), 2 deletions(-)
> > >  create mode 100644 arch/powerpc/platforms/85xx/qoriq_pm.c
> > > 
> > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > > index 05f6323..e1d6510 100644
> > > --- a/arch/powerpc/Kconfig
> > > +++ b/arch/powerpc/Kconfig
> > > @@ -222,7 +222,7 @@ config ARCH_HIBERNATION_POSSIBLE
> > >  config ARCH_SUSPEND_POSSIBLE
> > >  	def_bool y
> > >  	depends on ADB_PMU || PPC_EFIKA || PPC_LITE5200 || PPC_83xx || \
> > > -		   (PPC_85xx && !PPC_E500MC) || PPC_86xx || PPC_PSERIES \
> > > +		   FSL_SOC_BOOKE || PPC_86xx || PPC_PSERIES \
> > >  		   || 44x || 40x
> > >  
> > >  config PPC_DCR_NATIVE
> > > @@ -709,7 +709,7 @@ config FSL_PCI
> > >  config FSL_PMC
> > >  	bool
> > >  	default y
> > > -	depends on SUSPEND && (PPC_85xx || PPC_86xx)
> > > +	depends on SUSPEND && (PPC_85xx && !PPC_E500MC || PPC_86xx)
> > 
> > Don't mix && and || without parentheses.
> > 
> > Maybe convert this into being selected (similar to FSL_RCPM), rather
> > than default y?
> 
> Yes, will do.
> 
> > 
> > > diff --git a/arch/powerpc/platforms/85xx/Makefile b/arch/powerpc/platforms/85xx/Makefile
> > > index 25cebe7..7fae817 100644
> > > --- a/arch/powerpc/platforms/85xx/Makefile
> > > +++ b/arch/powerpc/platforms/85xx/Makefile
> > > @@ -2,6 +2,9 @@
> > >  # Makefile for the PowerPC 85xx linux kernel.
> > >  #
> > >  obj-$(CONFIG_SMP) += smp.o
> > > +ifeq ($(CONFIG_FSL_CORENET_RCPM), y)
> > > +obj-$(CONFIG_SUSPEND)	+= qoriq_pm.o
> > > +endif
> > 
> > There should probably be a kconfig symbol for this.
> 
> OK.
> 
> > 
> > > diff --git a/arch/powerpc/platforms/85xx/qoriq_pm.c b/arch/powerpc/platforms/85xx/qoriq_pm.c
> > > new file mode 100644
> > > index 0000000..915b13b
> > > --- /dev/null
> > > +++ b/arch/powerpc/platforms/85xx/qoriq_pm.c
> > > @@ -0,0 +1,78 @@
> > > +/*
> > > + * Support Power Management feature
> > > + *
> > > + * Copyright 2014 Freescale Semiconductor Inc.
> > > + *
> > > + * Author: Chenhui Zhao <chenhui.zhao@freescale.com>
> > > + *
> > > + * This program is free software; you can redistribute	it and/or modify it
> > > + * under  the terms of	the GNU General	 Public License as published by the
> > > + * Free Software Foundation;  either version 2 of the  License, or (at your
> > > + * option) any later version.
> > > + */
> > > +
> > > +#include <linux/kernel.h>
> > > +#include <linux/suspend.h>
> > > +#include <linux/of_platform.h>
> > > +
> > > +#include <sysdev/fsl_soc.h>
> > > +
> > > +#define FSL_SLEEP		0x1
> > > +#define FSL_DEEP_SLEEP		0x2
> > 
> > FSL_DEEP_SLEEP is unused.
> 
> Will be used in the last patch.
> [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040

Ideally the #define would have been introduced in that patch.

> > > +	sleep_modes = FSL_SLEEP;
> > > +	sleep_pm_state = PLAT_PM_SLEEP;
> > > +
> > > +	np = of_find_compatible_node(NULL, NULL, "fsl,qoriq-rcpm-2.0");
> > > +	if (np)
> > > +		sleep_pm_state = PLAT_PM_LPM20;
> > > +
> > > +	suspend_set_ops(&qoriq_suspend_ops);
> > > +
> > > +	return 0;
> > > +}
> > > +arch_initcall(qoriq_suspend_init);
> > 
> > Why is this not a platform driver?  If fsl_pmc can do it...
> > 
> > -Scott
> 
> It can be, but what advantage of being a platform driver.

If nothing else, compliance with the standard way of doing things.  Why
not make it a platform driver?  You'd be able to use dev_err, have a
place in sysfs if attributes are needed in the future, etc.

A better answer might be that there are multiple not-very-related files
driving different portions of RCPM.

-Scott

^ permalink raw reply

* Re: [PATCH 5/9] powerpc/85xx: disable irq by hardware when suspend for 64-bit
From: Scott Wood @ 2014-03-14 22:41 UTC (permalink / raw)
  To: Chenhui Zhao; +Cc: linuxppc-dev, linux-kernel, Jason.Jin
In-Reply-To: <20140312074610.GD4706@localhost.localdomain>

On Wed, 2014-03-12 at 15:46 +0800, Chenhui Zhao wrote:
> On Tue, Mar 11, 2014 at 06:51:20PM -0500, Scott Wood wrote:
> > On Fri, 2014-03-07 at 12:58 +0800, Chenhui Zhao wrote:
> > > In 64-bit mode, kernel just clears the irq soft-enable flag
> > > in struct paca_struct to disable external irqs. But, in
> > > the case of suspend, irqs should be disabled by hardware.
> > > Therefore, hook a function to ppc_md.suspend_disable_irqs
> > > to really disable irqs.
> > > 
> > > Signed-off-by: Chenhui Zhao <chenhui.zhao@freescale.com>
> > > ---
> > >  arch/powerpc/platforms/85xx/corenet_generic.c |   12 ++++++++++++
> > >  1 files changed, 12 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/platforms/85xx/corenet_generic.c b/arch/powerpc/platforms/85xx/corenet_generic.c
> > > index 3fdf9f3..983d81f 100644
> > > --- a/arch/powerpc/platforms/85xx/corenet_generic.c
> > > +++ b/arch/powerpc/platforms/85xx/corenet_generic.c
> > > @@ -32,6 +32,13 @@
> > >  #include <sysdev/fsl_pci.h>
> > >  #include "smp.h"
> > >  
> > > +#if defined(CONFIG_PPC64) && defined(CONFIG_SUSPEND)
> > > +static void fsl_suspend_disable_irqs(void)
> > > +{
> > > +	__hard_irq_disable();
> > > +}
> > > +#endif
> > 
> > Why the underscore version?  Don't you want PACA_IRQ_HARD_DIS to be set?
> > 
> > If hard disabling is appropriate here, shouldn't we do it in
> > generic_suspend_disable_irqs()?
> > 
> > Are there any existing platforms that supply a
> > ppc_md.suspend_disable_irqs()?  I don't see any when grepping.
> > 
> > -Scott
> 
> Will use hard_irq_disable().
> 
> I think this is a general problem for powerpc.
> Should clear MSR_EE before suspend. I agree to put it
> in generic_suspend_disable_irqs().

BTW, make sure you test this patchset with CONFIG_DEBUG_PREEMPT and
similar debugging options to help ensure that the soft IRQ state is
being tracked properly.

-Scott

^ permalink raw reply

* Re: [PATCH 3/9] powerpc/rcpm: add RCPM driver
From: Scott Wood @ 2014-03-14 22:34 UTC (permalink / raw)
  To: Chenhui Zhao; +Cc: linuxppc-dev, linux-kernel, Jason.Jin
In-Reply-To: <20140312035954.GB4706@localhost.localdomain>

On Wed, 2014-03-12 at 11:59 +0800, Chenhui Zhao wrote:
> On Tue, Mar 11, 2014 at 06:42:51PM -0500, Scott Wood wrote:
> > On Fri, 2014-03-07 at 12:57 +0800, Chenhui Zhao wrote:
> > > +int fsl_rcpm_init(void)
> > > +{
> > > +	struct device_node *np;
> > > +
> > > +	np = of_find_compatible_node(NULL, NULL, "fsl,qoriq-rcpm-2.0");
> > > +	if (np) {
> > > +		rcpm_v2_regs = of_iomap(np, 0);
> > > +		of_node_put(np);
> > > +		if (!rcpm_v2_regs)
> > > +			return -ENOMEM;
> > > +
> > > +		qoriq_pm_ops = &qoriq_rcpm_v2_ops;
> > > +
> > > +	} else {
> > > +		np = of_find_compatible_node(NULL, NULL, "fsl,qoriq-rcpm-1.0");
> > > +		if (np) {
> > > +			rcpm_v1_regs = of_iomap(np, 0);
> > > +			of_node_put(np);
> > > +			if (!rcpm_v1_regs)
> > > +				return -ENOMEM;
> > > +
> > > +			qoriq_pm_ops = &qoriq_rcpm_v1_ops;
> > > +
> > > +		} else {
> > > +			pr_err("%s: can't find the rcpm node.\n", __func__);
> > > +			return -EINVAL;
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > 
> > Why isn't this a proper platform driver?
> > 
> > -Scott
> 
> The RCPM is not a single function IP block, instead it is a collection
> of device run control and power management. It would be called by other
> drivers and functions. For example, the callback .freeze_time_base()
> need to be called at early stage of kernel init. Therefore, it would be
> better to init it at early stage.

OK, but consider using of_find_matching_node_and_match().

-Scott

^ permalink raw reply

* Re: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040
From: Scott Wood @ 2014-03-14 22:26 UTC (permalink / raw)
  To: Kevin Hao; +Cc: linuxppc-dev, Chenhui Zhao, Jason.Jin, linux-kernel
In-Reply-To: <20140313074613.GD26692@pek-khao-d1.corp.ad.wrs.com>

On Thu, 2014-03-13 at 15:46 +0800, Kevin Hao wrote:
> On Wed, Mar 12, 2014 at 12:43:05PM -0500, Scott Wood wrote:
> > > Shouldn't we use "readback, sync" here? The following is quoted form t4240RM:
> > >   To guarantee that the results of any sequence of writes to configuration
> > >   registers are in effect, the final configuration register write should be
> > >   immediately followed by a read of the same register, and that should be
> > >   followed by a SYNC instruction. Then accesses can safely be made to memory
> > >   regions affected by the configuration register write.
> > 
> > I agree that the sync before the readback is probably not necessary,
> > since transactions to the same address should already be ordered.
> > 
> > A sync after the readback helps if you're trying to order the readback
> > with subsequent memory accesses, though in that case wouldn't a sync
> > alone (no readback) be adequate?
> 
> No, we don't just want to order the subsequent memory access here.
> The 'write, readback, sync' is the required sequence if we want to make
> sure that the writing to CCSR register does really take effect.
> 
> >  Though maybe not always -- see the
> > comment near the end of fsl_elbc_write_buf() in
> > drivers/mtd/nand_fsl_elbc.c.  I guess the readback does more than just
> > make sure the device has seen the write, ensuring that the device has
> > finished the transaction to the point of acting on another one.
> 
> Agree.
> 
> > 
> > The data dependency plus isync sequence, which is done by the normal I/O
> > accessors used from C code, orders the readback versus all future
> > instructions (not just I/O).  The delay loop is not I/O.
> 
> According to the PowerISA, the sequence 'load, date dependency, isync' only
> order the load accesses. 

The point is to order the delay loop after the load, not to order
storage versus storage.

This is a sequence we're already using on all of our I/O loads
(excluding accesses like in this patch that don't use the standard
accessors).  I'm confident that it works even if it's not
architecturally guaranteed.  I'm not sure that there exists a clear
architectural way of synchronizing non-storage instructions relative to
storage instructions.

Given that isync is documented as preventing any execution of
instructions after the isync until all previous instructions complete,
it doesn't seem to make sense for the architecture to explicitly talk
about loads (as opposed to any other instruction) following a load,
dependent conditional branch, isync sequence.

> So if we want to order all the storage access as well
> as execution synchronization, we should choose sync here.

Do we need execution synchronization or context synchronization?

The t4240 RM section that talks about a readback and a sync is in the
context of subsequent memory operations ("Then accesses can safely be
made to memory regions affected..."), not arbitrary instructions.  There
are also a couple other places in the RM where isync is recommended
instead (when setting LAWs or CCSRBAR), even though those also only
involve memory accesses.

In any case, this is not performance critical and thus it's better to
oversynchronize than undersynchronize.

-Scott

^ permalink raw reply

* Re: [PATCH 06/10] powerpc/booke64: Use SPRG_TLB_EXFRAME on bolted handlers
From: Scott Wood @ 2014-03-14 19:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Tiejun Chen, linuxppc-dev, kvm-ppc, Mihai Caraman
In-Reply-To: <1394755249-8856-7-git-send-email-scottwood@freescale.com>

On Thu, 2014-03-13 at 19:00 -0500, Scott Wood wrote:
> @@ -444,6 +451,9 @@ _GLOBAL(kvmppc_resume_host)
>  	PPC_STD(r8, VCPU_SHARED_SPRG6, r11)
>  	mfxer	r3
>  	PPC_STD(r9, VCPU_SHARED_SPRG7, r11)
> +#ifdef CONFIG_64BIT
> +	mtspr	SPRN_SPRG_VDSO_WRITE, r3
> +#endif

Oops, this hunk was a patch shuffling accident.  v2 coming.

-Scott

^ permalink raw reply

* Re: [PATCH v3 10/52] arm, kvm: Fix CPU hotplug callback registration
From: Christoffer Dall @ 2014-03-14 19:10 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: ego, kvm, peterz, linux-kernel, linuxppc-dev, paulus, walken,
	kvmarm, linux-arch, linux, mingo, marc.zyngier, paulmck, linux-pm,
	Gleb Natapov, rusty, tglx, linux-arm-kernel, rjw, oleg, tj,
	Paolo Bonzini, akpm
In-Reply-To: <53229701.8050405@linux.vnet.ibm.com>

On Fri, Mar 14, 2014 at 11:13:29AM +0530, Srivatsa S. Bhat wrote:
> On 03/13/2014 04:51 AM, Christoffer Dall wrote:
> > On Tue, Mar 11, 2014 at 02:05:38AM +0530, Srivatsa S. Bhat wrote:
> >> Subsystems that want to register CPU hotplug callbacks, as well as perform
> >> initialization for the CPUs that are already online, often do it as shown
> >> below:
> >>
> >> 	get_online_cpus();
> >>
> >> 	for_each_online_cpu(cpu)
> >> 		init_cpu(cpu);
> >>
> >> 	register_cpu_notifier(&foobar_cpu_notifier);
> >>
> >> 	put_online_cpus();
> >>
> >> This is wrong, since it is prone to ABBA deadlocks involving the
> >> cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
> >> with CPU hotplug operations).
> >>
> >> Instead, the correct and race-free way of performing the callback
> >> registration is:
> >>
> >> 	cpu_notifier_register_begin();
> >>
> >> 	for_each_online_cpu(cpu)
> >> 		init_cpu(cpu);
> >>
> >> 	/* Note the use of the double underscored version of the API */
> >> 	__register_cpu_notifier(&foobar_cpu_notifier);
> >>
> >> 	cpu_notifier_register_done();
> >>
> >>
> >> Fix the kvm code in arm by using this latter form of callback registration.
> >>
> >> Cc: Christoffer Dall <christoffer.dall@linaro.org>
> >> Cc: Gleb Natapov <gleb@kernel.org>
> >> Cc: Russell King <linux@arm.linux.org.uk>
> >> Cc: Ingo Molnar <mingo@kernel.org>
> >> Cc: kvmarm@lists.cs.columbia.edu
> >> Cc: kvm@vger.kernel.org
> >> Cc: linux-arm-kernel@lists.infradead.org
> >> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> >> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> >> ---
> >>
> >>  arch/arm/kvm/arm.c |    7 ++++++-
> >>  1 file changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> >> index bd18bb8..f0e50a0 100644
> >> --- a/arch/arm/kvm/arm.c
> >> +++ b/arch/arm/kvm/arm.c
> >> @@ -1051,21 +1051,26 @@ int kvm_arch_init(void *opaque)
> >>  		}
> >>  	}
> >>  
> >> +	cpu_notifier_register_begin();
> >> +
> >>  	err = init_hyp_mode();
> >>  	if (err)
> >>  		goto out_err;
> >>  
> >> -	err = register_cpu_notifier(&hyp_init_cpu_nb);
> >> +	err = __register_cpu_notifier(&hyp_init_cpu_nb);
> >>  	if (err) {
> >>  		kvm_err("Cannot register HYP init CPU notifier (%d)\n", err);
> >>  		goto out_err;
> >>  	}
> >>  
> >> +	cpu_notifier_register_done();
> >> +
> >>  	hyp_cpu_pm_init();
> >>  
> >>  	kvm_coproc_table_init();
> >>  	return 0;
> >>  out_err:
> >> +	cpu_notifier_register_done();
> >>  	return err;
> >>  }
> >>  
> >>
> > 
> > Just so we're clear, the existing code was simply racy as not prone to
> > deadlocks, right?
> > 
> > This makes it clear that the test above for compatible CPUs can be quite
> > easily evaded by using CPU hotplug, but we don't really have a good
> > solution for handling that yet...  Hmmm, grumble grumble, I guess if you
> > hotplug unsupported CPUs on a KVM/ARM system for now, stuff will break.
> > 
> 
> In this particular case, there was no deadlock possibility, rather the
> existing code had insufficient synchronization against CPU hotplug.
> 
> init_hyp_mode() would invoke cpu_init_hyp_mode() on currently online CPUs
> using on_each_cpu(). If a CPU came online after this point and before calling
> register_cpu_notifier(), that CPU would remain uninitialized because this
> subsystem would miss the hot-online event. This patch fixes this bug and
> also uses the new synchronization method (instead of get/put_online_cpus())
> to ensure that we don't deadlock with CPU hotplug.
> 

Yes, that was my conclusion as well.  Thanks for clarifying.  (It could
be noted in the commit message as well if you should feel so inclined).

> > In any case:
> > Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
> > 
> 
> Thanks a lot!
> 
Thanks,
-Christoffer

^ permalink raw reply

* Re: [PATCH 1/2 v2] irqdomain: add support for creating a continous mapping
From: Thomas Gleixner @ 2014-03-14 11:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Scott Wood, linuxppc-dev
In-Reply-To: <20140221085736.GA11411@linutronix.de>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4896 bytes --]

On Fri, 21 Feb 2014, Sebastian Andrzej Siewior wrote:

> A MSI device may have multiple interrupts. That means that the
> interrupts numbers should be continuos so that pdev->irq refers to the
> first interrupt, pdev->irq + 1 to the second and so on.
> This patch adds support for continuous allocation of virqs for a range
> of hwirqs. The function is based on irq_create_mapping() but due to the
> number argument there is very little in common now.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> Scott, this is what you suggested. I must admit, it does not look that
> bad. It is just compile tested.

Is it tested for real as well?
 
> +static int irq_check_continuous_mapping(struct irq_domain *domain,
> +		irq_hw_number_t hwirq, unsigned int num)
> +{
> +	int virq;
> +	int i;
> +
> +	virq = irq_find_mapping(domain, hwirq);
> +
> +	for (i = 1; i < num; i++) {
> +		unsigned int next;
> +
> +		next = irq_find_mapping(domain, hwirq + i);
> +		if (next == virq + i)
> +			continue;
> +
> +		pr_err("irq: invalid partial mapping. First hwirq %lu maps to "
> +				"%d and \n", hwirq, virq);
> +		pr_err("irq: +%d hwirq (%lu) maps to %d but should be %d.\n",
> +				i, hwirq + i, next, virq + i);
> +		return -EINVAL;
> +	}
> +
> +	pr_debug("-> existing mapping on virq %d\n", virq);
> +	return virq;
> +}
> +
>  /**
> - * irq_create_mapping() - Map a hardware interrupt into linux irq space
> + * irq_create_mapping_block() - Map multiple hardware interrupts
>   * @domain: domain owning this hardware interrupt or NULL for default domain
>   * @hwirq: hardware irq number in that domain space
> + * @num: number of interrupts
> + *
> + * Maps a hwirq to a newly allocated virq. If num is greater than 1 then num
> + * hwirqs (hwirq … hwirq + num - 1) will be mapped and virq will be  continuous.
> + * Returns the first linux virq number.
>   *
> - * Only one mapping per hardware interrupt is permitted. Returns a linux
> - * irq number.
>   * If the sense/trigger is to be specified, set_irq_type() should be called
>   * on the number returned from that call.
>   */
> -unsigned int irq_create_mapping(struct irq_domain *domain,
> -				irq_hw_number_t hwirq)
> +unsigned int irq_create_mapping_block(struct irq_domain *domain,
> +		irq_hw_number_t hwirq, unsigned int num)
>  {
> -	unsigned int hint;
>  	int virq;
> +	int i;
> +	int node;
> +	unsigned int hint;

What's wrong with

	unsigned int hint;
  	int virq, i, node;

?
  
> -	pr_debug("irq_create_mapping(0x%p, 0x%lx)\n", domain, hwirq);
> +	pr_debug("%s(0x%p, 0x%lx, %d)\n", __func__, domain, hwirq, num);
>  
>  	/* Look for default domain if nececssary */
> -	if (domain == NULL)
> +	if (!domain && num == 1)
>  		domain = irq_default_domain;
> +
>  	if (domain == NULL) {
>  		WARN(1, "%s(, %lx) called with NULL domain\n", __func__, hwirq);
>  		return 0;
>  	}
>  	pr_debug("-> using domain @%p\n", domain);
>  
>  	/* Check if mapping already exists */
> -	virq = irq_find_mapping(domain, hwirq);
> -	if (virq) {
> -		pr_debug("-> existing mapping on virq %d\n", virq);
> -		return virq;
> +	for (i = 0; i < num; i++) {
> +		virq = irq_find_mapping(domain, hwirq + i);
> +		if (virq != NO_IRQ) {
> +			if (i == 0)
> +				return irq_check_continuous_mapping(domain,
> +						hwirq, num);

So what is the loop for? If i == 0 and virq != NO_IRQ you return. That
does not make sense at all.
  
> +			pr_err("irq: hwirq %ld has no mapping but hwirq %ld "
> +				"maps to virq %d. This can't be a block\n",
> +				hwirq, hwirq + i, virq);
> +			return -EINVAL;
> +		}
>  	}
>  
> +	node = of_node_to_nid(domain->of_node);
>  	/* Allocate a virtual interrupt number */
>  	hint = hwirq % nr_irqs;
>  	if (hint == 0)
>  		hint++;
> -	virq = irq_alloc_desc_from(hint, of_node_to_nid(domain->of_node));
> -	if (virq <= 0)
> -		virq = irq_alloc_desc_from(1, of_node_to_nid(domain->of_node));
> +	virq = irq_alloc_descs_from(hint, num, node);
> +	if (virq <= 0 && hint != 1)
> +		virq = irq_alloc_descs_from(1, num, node);
>  	if (virq <= 0) {
>  		pr_debug("-> virq allocation failed\n");
>  		return 0;
>  	}
>  
> -	if (irq_domain_associate(domain, virq, hwirq)) {
> -		irq_free_desc(virq);
> -		return 0;
> +	irq_domain_associate_many(domain, virq, hwirq, num);

So irq_domain_associate can fail, but irq_domain_associate_many cannot ?

> +	if (num == 1) {
> +		pr_debug("irq %lu on domain %s mapped to virtual irq %u\n",
> +			hwirq, of_node_full_name(domain->of_node), virq);
> +		return virq;
>  	}
> -
> -	pr_debug("irq %lu on domain %s mapped to virtual irq %u\n",
> -		hwirq, of_node_full_name(domain->of_node), virq);
> -
> +	pr_debug("irqs %lu…%lu on domain %s mapped to virtual irqs %u…%u\n",
> +		hwirq, hwirq + num - 1, of_node_full_name(domain->of_node),
> +			virq, virq + num - 1);

A single pr_debug is sufficient, hmm?

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH RFC v9 5/6] dma: mpc512x: add device tree binding document
From: Arnd Bergmann @ 2014-03-14 10:43 UTC (permalink / raw)
  To: Mark Rutland
  Cc: devicetree@vger.kernel.org, Lars-Peter Clausen, Vinod Koul,
	Gerhard Sittig, Andy Shevchenko, Alexander Popov,
	dmaengine@vger.kernel.org, Dan Williams, Anatolij Gustschin,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20140313180916.GF25870@e106331-lin.cambridge.arm.com>

On Thursday 13 March 2014, Mark Rutland wrote:
> > +
> > +Example:
> > +
> > +     dma0: dma@14000 {
> > +             compatible = "fsl,mpc5121-dma";
> > +             reg = <0x14000 0x1800>;
> > +             interrupts = <65 0x8>;
> > +             #dma-cells = <1>;
> > +     };
> > +
> > +
> > +Client node properties:
> > +
> > +Required properties:
> > +- dmas:                      list of DMA specifiers, consisting each of a handle
> > +                     for the DMA controller and integer cells to specify
> > +                     the channel used within the DMA controller
> > +- dma-names:         list of identifier strings for the DMA specifiers,
> > +                     client device driver code uses these strings to
> > +                     have DMA channels looked up at the controller
> 
> List the exact names you expect, or the dma-names property is useless.

Listing specific names makes no sense in the binding for the provider,
they should be part of the slave device binding. A reference to the
generic bindings/dma/dma.txt file should be enough here.

	Arnd

^ permalink raw reply

* Re: [PATCH RFC v9 3/6] dma: mpc512x: replace devm_request_irq() with request_irq()
From: Andy Shevchenko @ 2014-03-14  9:50 UTC (permalink / raw)
  To: Alexander Popov
  Cc: Lars-Peter Clausen, Arnd Bergmann, Vinod Koul, Gerhard Sittig,
	dmaengine, Dan Williams, Anatolij Gustschin, linuxppc-dev
In-Reply-To: <1394624875-24411-4-git-send-email-a13xp0p0v88@gmail.com>

On Wed, 2014-03-12 at 15:47 +0400, Alexander Popov wrote:
> Replace devm_request_irq() with request_irq() since there is no need
> to use it because the original code always frees IRQ manually with
> devm_free_irq(). Replace devm_free_irq() with free_irq() accordingly.
> 
> Signed-off-by: Alexander Popov <a13xp0p0v88@gmail.com>
> ---
>  drivers/dma/mpc512x_dma.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/dma/mpc512x_dma.c b/drivers/dma/mpc512x_dma.c
> index b1e430c..ff7f678 100644
> --- a/drivers/dma/mpc512x_dma.c
> +++ b/drivers/dma/mpc512x_dma.c
> @@ -921,16 +921,15 @@ static int mpc_dma_probe(struct platform_device *op)
>  	mdma->tcd = (struct mpc_dma_tcd *)((u8 *)(mdma->regs)
>  							+ MPC_DMA_TCD_OFFSET);
>  
> -	retval = devm_request_irq(dev, mdma->irq, &mpc_dma_irq, 0, DRV_NAME,
> -									mdma);
> +	retval = request_irq(mdma->irq, &mpc_dma_irq, 0, DRV_NAME, mdma);
>  	if (retval) {
>  		dev_err(dev, "Error requesting IRQ!\n");
>  		return -EINVAL;
>  	}
>  
>  	if (mdma->is_mpc8308) {
> -		retval = devm_request_irq(dev, mdma->irq2, &mpc_dma_irq, 0,
> -				DRV_NAME, mdma);
> +		retval = request_irq(mdma->irq2, &mpc_dma_irq, 0,
> +							DRV_NAME, mdma);
>  		if (retval) {
>  			dev_err(dev, "Error requesting IRQ2!\n");

+ free_irq(IRQ1) here and may be in other places.

>  			return -EINVAL;
> @@ -1020,7 +1019,7 @@ static int mpc_dma_probe(struct platform_device *op)
>  	dev_set_drvdata(dev, mdma);
>  	retval = dma_async_device_register(dma);
>  	if (retval) {
> -		devm_free_irq(dev, mdma->irq, mdma);
> +		free_irq(mdma->irq, mdma);
>  		irq_dispose_mapping(mdma->irq);
>  	}
>  
> @@ -1033,7 +1032,7 @@ static int mpc_dma_remove(struct platform_device *op)
>  	struct mpc_dma *mdma = dev_get_drvdata(dev);
>  
>  	dma_async_device_unregister(&mdma->dma);
> -	devm_free_irq(dev, mdma->irq, mdma);
> +	free_irq(mdma->irq, mdma);
>  	irq_dispose_mapping(mdma->irq);
>  
>  	return 0;


-- 
Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Intel Finland Oy

^ permalink raw reply

* Re: [PATCH RFC v9 2/6] dma: mpc512x: add support for peripheral transfers
From: Andy Shevchenko @ 2014-03-14  9:47 UTC (permalink / raw)
  To: Alexander Popov
  Cc: Lars-Peter Clausen, Arnd Bergmann, Vinod Koul, Gerhard Sittig,
	dmaengine, Dan Williams, Anatolij Gustschin, linuxppc-dev
In-Reply-To: <1394624875-24411-3-git-send-email-a13xp0p0v88@gmail.com>

On Wed, 2014-03-12 at 15:47 +0400, Alexander Popov wrote:
> Introduce support for slave s/g transfer preparation and the associated
> device control callback in the MPC512x DMA controller driver, which adds
> support for data transfers between memory and peripheral I/O to the
> previously supported mem-to-mem transfers.


> --- a/drivers/dma/mpc512x_dma.c
> +++ b/drivers/dma/mpc512x_dma.c
> @@ -2,6 +2,7 @@
>   * Copyright (C) Freescale Semicondutor, Inc. 2007, 2008.
>   * Copyright (C) Semihalf 2009
>   * Copyright (C) Ilya Yanok, Emcraft Systems 2010
> + * Copyright (C) Alexander Popov, Promcontroller 2013

2014?

[]

> +static int mpc_dma_device_control(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
> +							unsigned long arg)
> +{
> +	struct mpc_dma_chan *mchan;
> +	struct mpc_dma *mdma;
> +	struct dma_slave_config *cfg;
> +	unsigned long flags;
> +
> +	mchan = dma_chan_to_mpc_dma_chan(chan);
> +	switch (cmd) {
> +	case DMA_TERMINATE_ALL:
> +		/* Disable channel requests */
> +		mdma = dma_chan_to_mpc_dma(chan);
> +
> +		spin_lock_irqsave(&mchan->lock, flags);
> +
> +		out_8(&mdma->regs->dmacerq, chan->chan_id);
> +		list_splice_tail_init(&mchan->prepared, &mchan->free);
> +		list_splice_tail_init(&mchan->queued, &mchan->free);
> +		list_splice_tail_init(&mchan->active, &mchan->free);
> +
> +		spin_unlock_irqrestore(&mchan->lock, flags);
> +
> +		return 0;
> +	case DMA_SLAVE_CONFIG:
> +		/* Constraints:
> +		 *  - only transfers between a peripheral device and
> +		 *     memory are supported;
> +		 *  - minimal transfer chunk is 4 bytes and consequently
> +		 *     source and destination addresses must be 4-byte aligned
> +		 *     and transfer size must be aligned on (4 * maxburst)
> +		 *     boundary;
> +		 *  - during the transfer RAM address is being incremented by
> +		 *     the size of minimal transfer chunk;
> +		 *  - peripheral port's address is constant during the transfer.
> +		 */
> +
> +		cfg = (void *)arg;
> +
> +		if (!is_slave_direction(cfg->direction))
> +			return -EINVAL;

As far as I understand the intention you have not to use direction field
in the dma_slave_config. It will be removed once.

> +
> +		if (cfg->src_addr_width != DMA_SLAVE_BUSWIDTH_4_BYTES &&
> +			cfg->dst_addr_width != DMA_SLAVE_BUSWIDTH_4_BYTES)
> +			return -EINVAL;
> +
> +		spin_lock_irqsave(&mchan->lock, flags);
> +
> +		if (cfg->direction == DMA_DEV_TO_MEM) {
> +			mchan->per_paddr = cfg->src_addr;
> +			mchan->tcd_nunits = cfg->src_maxburst;
> +		} else {
> +			mchan->per_paddr = cfg->dst_addr;
> +			mchan->tcd_nunits = cfg->dst_maxburst;
> +		}

Ditto.


-- 
Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Intel Finland Oy

^ permalink raw reply

* Re: [PATCH v3 36/52] zsmalloc: Fix CPU hotplug callback registration
From: Minchan Kim @ 2014-03-14  6:41 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-arch, ego, walken, linux, akpm, linux-pm, peterz, rusty,
	rjw, oleg, linux-kernel, linux-mm, linuxppc-dev, paulus, tj, tglx,
	paulmck, mingo, Nitin Gupta
In-Reply-To: <20140310203959.10746.61303.stgit@srivatsabhat.in.ibm.com>

On Tue, Mar 11, 2014 at 02:09:59AM +0530, Srivatsa S. Bhat wrote:
> Subsystems that want to register CPU hotplug callbacks, as well as perform
> initialization for the CPUs that are already online, often do it as shown
> below:
> 
> 	get_online_cpus();
> 
> 	for_each_online_cpu(cpu)
> 		init_cpu(cpu);
> 
> 	register_cpu_notifier(&foobar_cpu_notifier);
> 
> 	put_online_cpus();
> 
> This is wrong, since it is prone to ABBA deadlocks involving the
> cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
> with CPU hotplug operations).
> 
> Instead, the correct and race-free way of performing the callback
> registration is:
> 
> 	cpu_notifier_register_begin();
> 
> 	for_each_online_cpu(cpu)
> 		init_cpu(cpu);
> 
> 	/* Note the use of the double underscored version of the API */
> 	__register_cpu_notifier(&foobar_cpu_notifier);
> 
> 	cpu_notifier_register_done();
> 
> 
> Fix the zsmalloc code by using this latter form of callback registration.
> 
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Nitin Gupta <ngupta@vflare.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

Acked-by: Minchan Kim <minchan@kernel.org>

Thanks.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply

* [PATCH] powerpc: ratelimit users spamming kernel log buffer
From: Michael Neuling @ 2014-03-14  6:03 UTC (permalink / raw)
  To: benh; +Cc: michael, Linux PPC dev, Paul Mackerras

The facility unavailable exception can be triggered from userspace by
accessing PMU registers when EBB is not enabled.  This causes the
included pr_err() to run, hence spamming the kernel log buffer.

This avoids this by rate limiting these messages.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
We can bike shed the hell out of this one.  We could just remove it, or
we could put it under the control of show_unhandled_signals.

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index e435bc0..7836de3f 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1340,8 +1340,9 @@ void facility_unavailable_exception(struct pt_regs *regs)
 	if (!arch_irq_disabled_regs(regs))
 		local_irq_enable();
 
-	pr_err("%sFacility '%s' unavailable, exception at 0x%lx, MSR=%lx\n",
-	       hv ? "Hypervisor " : "", facility, regs->nip, regs->msr);
+	pr_err_ratelimited(
+		"%sFacility '%s' unavailable, exception at 0x%lx, MSR=%lx\n",
+		hv ? "Hypervisor " : "", facility, regs->nip, regs->msr);
 
 	if (user_mode(regs)) {
 		_exception(SIGILL, regs, ILL_ILLOPC, regs->nip);

^ permalink raw reply related

* Re: [PATCH v3 10/52] arm, kvm: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-03-14  5:43 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ego, kvm, peterz, linux-kernel, linuxppc-dev, paulus, walken,
	kvmarm, linux-arch, linux, mingo, marc.zyngier, paulmck, linux-pm,
	Gleb Natapov, rusty, tglx, linux-arm-kernel, rjw, oleg, tj,
	Paolo Bonzini, akpm
In-Reply-To: <20140312232127.GC24808@cbox>

On 03/13/2014 04:51 AM, Christoffer Dall wrote:
> On Tue, Mar 11, 2014 at 02:05:38AM +0530, Srivatsa S. Bhat wrote:
>> Subsystems that want to register CPU hotplug callbacks, as well as perform
>> initialization for the CPUs that are already online, often do it as shown
>> below:
>>
>> 	get_online_cpus();
>>
>> 	for_each_online_cpu(cpu)
>> 		init_cpu(cpu);
>>
>> 	register_cpu_notifier(&foobar_cpu_notifier);
>>
>> 	put_online_cpus();
>>
>> This is wrong, since it is prone to ABBA deadlocks involving the
>> cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
>> with CPU hotplug operations).
>>
>> Instead, the correct and race-free way of performing the callback
>> registration is:
>>
>> 	cpu_notifier_register_begin();
>>
>> 	for_each_online_cpu(cpu)
>> 		init_cpu(cpu);
>>
>> 	/* Note the use of the double underscored version of the API */
>> 	__register_cpu_notifier(&foobar_cpu_notifier);
>>
>> 	cpu_notifier_register_done();
>>
>>
>> Fix the kvm code in arm by using this latter form of callback registration.
>>
>> Cc: Christoffer Dall <christoffer.dall@linaro.org>
>> Cc: Gleb Natapov <gleb@kernel.org>
>> Cc: Russell King <linux@arm.linux.org.uk>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: kvmarm@lists.cs.columbia.edu
>> Cc: kvm@vger.kernel.org
>> Cc: linux-arm-kernel@lists.infradead.org
>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>>  arch/arm/kvm/arm.c |    7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index bd18bb8..f0e50a0 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -1051,21 +1051,26 @@ int kvm_arch_init(void *opaque)
>>  		}
>>  	}
>>  
>> +	cpu_notifier_register_begin();
>> +
>>  	err = init_hyp_mode();
>>  	if (err)
>>  		goto out_err;
>>  
>> -	err = register_cpu_notifier(&hyp_init_cpu_nb);
>> +	err = __register_cpu_notifier(&hyp_init_cpu_nb);
>>  	if (err) {
>>  		kvm_err("Cannot register HYP init CPU notifier (%d)\n", err);
>>  		goto out_err;
>>  	}
>>  
>> +	cpu_notifier_register_done();
>> +
>>  	hyp_cpu_pm_init();
>>  
>>  	kvm_coproc_table_init();
>>  	return 0;
>>  out_err:
>> +	cpu_notifier_register_done();
>>  	return err;
>>  }
>>  
>>
> 
> Just so we're clear, the existing code was simply racy as not prone to
> deadlocks, right?
> 
> This makes it clear that the test above for compatible CPUs can be quite
> easily evaded by using CPU hotplug, but we don't really have a good
> solution for handling that yet...  Hmmm, grumble grumble, I guess if you
> hotplug unsupported CPUs on a KVM/ARM system for now, stuff will break.
> 

In this particular case, there was no deadlock possibility, rather the
existing code had insufficient synchronization against CPU hotplug.

init_hyp_mode() would invoke cpu_init_hyp_mode() on currently online CPUs
using on_each_cpu(). If a CPU came online after this point and before calling
register_cpu_notifier(), that CPU would remain uninitialized because this
subsystem would miss the hot-online event. This patch fixes this bug and
also uses the new synchronization method (instead of get/put_online_cpus())
to ensure that we don't deadlock with CPU hotplug.

> In any case:
> Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
> 

Thanks a lot!

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* [PATCH 20/20] powerpc/perf: Fix handling of L3 events with bank == 1
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

Currently we reject events which have the L3 bank == 1, such as
0x000084918F, because the cache field is non-zero.

However that is incorrect, because although the bank is non-zero, the
value we would write into MMCRC is zero, and so we can count the event.

So fix the check to ignore the bank selector when checking whether the
cache selector is non-zero.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/perf/power8-pmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 3ad363d..fe2763b 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -325,9 +325,10 @@ static int power8_get_constraint(u64 event, unsigned long *maskp, unsigned long
 		 * HV writable, and there is no API for guest kernels to modify
 		 * it. The solution is for the hypervisor to initialise the
 		 * field to zeroes, and for us to only ever allow events that
-		 * have a cache selector of zero.
+		 * have a cache selector of zero. The bank selector (bit 3) is
+		 * irrelevant, as long as the rest of the value is 0.
 		 */
-		if (cache)
+		if (cache & 0x7)
 			return -1;
 
 	} else if (event & EVENT_IS_L1) {
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 19/20] powerpc/perf/hv_{gpci, 24x7}: Add documentation of device attributes
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

gpci and 24x7 expose some device specific attributes. Add some
documentation for them.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 .../testing/sysfs-bus-event_source-devices-hv_24x7 | 23 ++++++++++++
 .../testing/sysfs-bus-event_source-devices-hv_gpci | 43 ++++++++++++++++++++++
 2 files changed, 66 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
 create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
new file mode 100644
index 0000000..e78ee79
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
@@ -0,0 +1,23 @@
+What:		/sys/bus/event_source/devices/hv_24x7/interface/catalog
+Date:		February 2014
+Contact:	Cody P Schafer <cody@linux.vnet.ibm.com>
+Description:
+		Provides access to the binary "24x7 catalog" provided by the
+		hypervisor on POWER7 and 8 systems. This catalog lists events
+		avaliable from the powerpc "hv_24x7" pmu. Its format is
+		documented here:
+		https://raw.githubusercontent.com/jmesmon/catalog-24x7/master/hv-24x7-catalog.h
+
+What:		/sys/bus/event_source/devices/hv_24x7/interface/catalog_length
+Date:		February 2014
+Contact:	Cody P Schafer <cody@linux.vnet.ibm.com>
+Description:
+		A number equal to the length in bytes of the catalog. This is
+		also extractable from the provided binary "catalog" sysfs entry.
+
+What:		/sys/bus/event_source/devices/hv_24x7/interface/catalog_version
+Date:		February 2014
+Contact:	Cody P Schafer <cody@linux.vnet.ibm.com>
+Description:
+		Exposes the "version" field of the 24x7 catalog. This is also
+		extractable from the provided binary "catalog" sysfs entry.
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
new file mode 100644
index 0000000..3fa58c2
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
@@ -0,0 +1,43 @@
+What:		/sys/bus/event_source/devices/hv_gpci/interface/collect_privileged
+Date:		February 2014
+Contact:	Cody P Schafer <cody@linux.vnet.ibm.com>
+Description:
+		'0' if the hypervisor is configured to forbid access to event
+		counters being accumulated by other guests and to physical
+		domain event counters.
+		'1' if that access is allowed.
+
+What:		/sys/bus/event_source/devices/hv_gpci/interface/ga
+Date:		February 2014
+Contact:	Cody P Schafer <cody@linux.vnet.ibm.com>
+Description:
+		0 or 1. Indicates whether we have access to "GA" events (listed
+		in arch/powerpc/perf/hv-gpci.h).
+
+What:		/sys/bus/event_source/devices/hv_gpci/interface/expanded
+Date:		February 2014
+Contact:	Cody P Schafer <cody@linux.vnet.ibm.com>
+Description:
+		0 or 1. Indicates whether we have access to "EXPANDED" events (listed
+		in arch/powerpc/perf/hv-gpci.h).
+
+What:		/sys/bus/event_source/devices/hv_gpci/interface/lab
+Date:		February 2014
+Contact:	Cody P Schafer <cody@linux.vnet.ibm.com>
+Description:
+		0 or 1. Indicates whether we have access to "LAB" events (listed
+		in arch/powerpc/perf/hv-gpci.h).
+
+What:		/sys/bus/event_source/devices/hv_gpci/interface/version
+Date:		February 2014
+Contact:	Cody P Schafer <cody@linux.vnet.ibm.com>
+Description:
+		A number indicating the version of the gpci interface that the
+		hypervisor reports supporting.
+
+What:		/sys/bus/event_source/devices/hv_gpci/interface/kernel_version
+Date:		February 2014
+Contact:	Cody P Schafer <cody@linux.vnet.ibm.com>
+Description:
+		A number indicating the latest version of the gpci interface
+		that the kernel is aware of.
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 18/20] powerpc/perf: Add kconfig option for hypervisor provided counters
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

The commit adds a Kconfig option which allows the hv_gpci and hv_24x7
PMUs, added in the preceeding commits, to be built.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/perf/Makefile             |  2 ++
 arch/powerpc/platforms/pseries/Kconfig | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 60d71ee..f9c083a 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -11,5 +11,7 @@ obj32-$(CONFIG_PPC_PERF_CTRS)	+= mpc7450-pmu.o
 obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
 obj-$(CONFIG_FSL_EMB_PERF_EVENT_E500) += e500-pmu.o e6500-pmu.o
 
+obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o
+
 obj-$(CONFIG_PPC64)		+= $(obj64-y)
 obj-$(CONFIG_PPC32)		+= $(obj32-y)
diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index 80b1d57..2cb8b77 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -111,6 +111,18 @@ config CMM
 	  will be reused for other LPARs. The interface allows firmware to
 	  balance memory across many LPARs.
 
+config HV_PERF_CTRS
+       bool "Hypervisor supplied PMU events (24x7 & GPCI)"
+       default y
+       depends on PERF_EVENTS && PPC_PSERIES
+       help
+	  Enable access to hypervisor supplied counters in perf. Currently,
+	  this enables code that uses the hcall GetPerfCounterInfo and 24x7
+	  interfaces to retrieve counters. GPCI exists on Power 6 and later
+	  systems. 24x7 is available on Power 8 systems.
+
+          If unsure, select Y.
+
 config DTL
 	bool "Dispatch Trace Log"
 	depends on PPC_SPLPAR && DEBUG_FS
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 17/20] powerpc/perf: Add support for the hv 24x7 interface
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

This provides a basic interface between hv_24x7 and perf. Similar to
the one provided for gpci, it lacks transaction support and does not
list any events.

Example usage via perf tool:

	perf stat -e 'hv_24x7/domain=2,offset=8,starting_index=0,lpar=0xffffffff/' -r 0 -C 0 -x ' ' sleep 0.1

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/perf/hv-24x7.c | 510 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 510 insertions(+)
 create mode 100644 arch/powerpc/perf/hv-24x7.c

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
new file mode 100644
index 0000000..297c9105
--- /dev/null
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -0,0 +1,510 @@
+/*
+ * Hypervisor supplied "24x7" performance counter support
+ *
+ * Author: Cody P Schafer <cody@linux.vnet.ibm.com>
+ * Copyright 2014 IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) "hv-24x7: " fmt
+
+#include <linux/perf_event.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <asm/firmware.h>
+#include <asm/hvcall.h>
+#include <asm/io.h>
+
+#include "hv-24x7.h"
+#include "hv-24x7-catalog.h"
+#include "hv-common.h"
+
+/*
+ * TODO: Merging events:
+ * - Think of the hcall as an interface to a 4d array of counters:
+ *   - x = domains
+ *   - y = indexes in the domain (core, chip, vcpu, node, etc)
+ *   - z = offset into the counter space
+ *   - w = lpars (guest vms, "logical partitions")
+ * - A single request is: x,y,y_last,z,z_last,w,w_last
+ *   - this means we can retrieve a rectangle of counters in y,z for a single x.
+ *
+ * - Things to consider (ignoring w):
+ *   - input  cost_per_request = 16
+ *   - output cost_per_result(ys,zs)  = 8 + 8 * ys + ys * zs
+ *   - limited number of requests per hcall (must fit into 4K bytes)
+ *     - 4k = 16 [buffer header] - 16 [request size] * request_count
+ *     - 255 requests per hcall
+ *   - sometimes it will be more efficient to read extra data and discard
+ */
+
+/*
+ * Example usage:
+ *  perf stat -e 'hv_24x7/domain=2,offset=8,starting_index=0,lpar=0xffffffff/'
+ */
+
+/* u3 0-6, one of HV_24X7_PERF_DOMAIN */
+EVENT_DEFINE_RANGE_FORMAT(domain, config, 0, 3);
+/* u16 */
+EVENT_DEFINE_RANGE_FORMAT(starting_index, config, 16, 31);
+/* u32, see "data_offset" */
+EVENT_DEFINE_RANGE_FORMAT(offset, config, 32, 63);
+/* u16 */
+EVENT_DEFINE_RANGE_FORMAT(lpar, config1, 0, 15);
+
+EVENT_DEFINE_RANGE(reserved1, config,   4, 15);
+EVENT_DEFINE_RANGE(reserved2, config1, 16, 63);
+EVENT_DEFINE_RANGE(reserved3, config2,  0, 63);
+
+static struct attribute *format_attrs[] = {
+	&format_attr_domain.attr,
+	&format_attr_offset.attr,
+	&format_attr_starting_index.attr,
+	&format_attr_lpar.attr,
+	NULL,
+};
+
+static struct attribute_group format_group = {
+	.name = "format",
+	.attrs = format_attrs,
+};
+
+static struct kmem_cache *hv_page_cache;
+
+/*
+ * read_offset_data - copy data from one buffer to another while treating the
+ *                    source buffer as a small view on the total avaliable
+ *                    source data.
+ *
+ * @dest: buffer to copy into
+ * @dest_len: length of @dest in bytes
+ * @requested_offset: the offset within the source data we want. Must be > 0
+ * @src: buffer to copy data from
+ * @src_len: length of @src in bytes
+ * @source_offset: the offset in the sorce data that (src,src_len) refers to.
+ *                 Must be > 0
+ *
+ * returns the number of bytes copied.
+ *
+ * The following ascii art shows the various buffer possitioning we need to
+ * handle, assigns some arbitrary varibles to points on the buffer, and then
+ * shows how we fiddle with those values to get things we care about (copy
+ * start in src and copy len)
+ *
+ * s = @src buffer
+ * d = @dest buffer
+ * '.' areas in d are written to.
+ *
+ *                       u
+ *   x         w	 v  z
+ * d           |.........|
+ * s |----------------------|
+ *
+ *                      u
+ *   x         w	z     v
+ * d           |........------|
+ * s |------------------|
+ *
+ *   x         w        u,z,v
+ * d           |........|
+ * s |------------------|
+ *
+ *   x,w                u,v,z
+ * d |..................|
+ * s |------------------|
+ *
+ *   x        u
+ *   w        v		z
+ * d |........|
+ * s |------------------|
+ *
+ *   x      z   w      v
+ * d            |------|
+ * s |------|
+ *
+ * x = source_offset
+ * w = requested_offset
+ * z = source_offset + src_len
+ * v = requested_offset + dest_len
+ *
+ * w_offset_in_s = w - x = requested_offset - source_offset
+ * z_offset_in_s = z - x = src_len
+ * v_offset_in_s = v - x = request_offset + dest_len - src_len
+ */
+static ssize_t read_offset_data(void *dest, size_t dest_len,
+				loff_t requested_offset, void *src,
+				size_t src_len, loff_t source_offset)
+{
+	size_t w_offset_in_s = requested_offset - source_offset;
+	size_t z_offset_in_s = src_len;
+	size_t v_offset_in_s = requested_offset + dest_len - src_len;
+	size_t u_offset_in_s = min(z_offset_in_s, v_offset_in_s);
+	size_t copy_len = u_offset_in_s - w_offset_in_s;
+
+	if (requested_offset < 0 || source_offset < 0)
+		return -EINVAL;
+
+	if (z_offset_in_s <= w_offset_in_s)
+		return 0;
+
+	memcpy(dest, src + w_offset_in_s, copy_len);
+	return copy_len;
+}
+
+static unsigned long h_get_24x7_catalog_page(char page[static 4096],
+					     u32 version, u32 index)
+{
+	WARN_ON(!IS_ALIGNED((unsigned long)page, 4096));
+	return plpar_hcall_norets(H_GET_24X7_CATALOG_PAGE,
+			virt_to_phys(page),
+			version,
+			index);
+}
+
+static ssize_t catalog_read(struct file *filp, struct kobject *kobj,
+			    struct bin_attribute *bin_attr, char *buf,
+			    loff_t offset, size_t count)
+{
+	unsigned long hret;
+	ssize_t ret = 0;
+	size_t catalog_len = 0, catalog_page_len = 0, page_count = 0;
+	loff_t page_offset = 0;
+	uint32_t catalog_version_num = 0;
+	void *page = kmem_cache_alloc(hv_page_cache, GFP_USER);
+	struct hv_24x7_catalog_page_0 *page_0 = page;
+	if (!page)
+		return -ENOMEM;
+
+	hret = h_get_24x7_catalog_page(page, 0, 0);
+	if (hret) {
+		ret = -EIO;
+		goto e_free;
+	}
+
+	catalog_version_num = be32_to_cpu(page_0->version);
+	catalog_page_len = be32_to_cpu(page_0->length);
+	catalog_len = catalog_page_len * 4096;
+
+	page_offset = offset / 4096;
+	page_count  = count  / 4096;
+
+	if (page_offset >= catalog_page_len)
+		goto e_free;
+
+	if (page_offset != 0) {
+		hret = h_get_24x7_catalog_page(page, catalog_version_num,
+					       page_offset);
+		if (hret) {
+			ret = -EIO;
+			goto e_free;
+		}
+	}
+
+	ret = read_offset_data(buf, count, offset,
+				page, 4096, page_offset * 4096);
+e_free:
+	if (hret)
+		pr_err("h_get_24x7_catalog_page(ver=%d, page=%lld) failed: rc=%ld\n",
+				catalog_version_num, page_offset, hret);
+	kfree(page);
+
+	pr_devel("catalog_read: offset=%lld(%lld) count=%zu(%zu) catalog_len=%zu(%zu) => %zd\n",
+			offset, page_offset, count, page_count, catalog_len,
+			catalog_page_len, ret);
+
+	return ret;
+}
+
+#define PAGE_0_ATTR(_name, _fmt, _expr)				\
+static ssize_t _name##_show(struct device *dev,			\
+			    struct device_attribute *dev_attr,	\
+			    char *buf)				\
+{								\
+	unsigned long hret;					\
+	ssize_t ret = 0;					\
+	void *page = kmem_cache_alloc(hv_page_cache, GFP_USER);	\
+	struct hv_24x7_catalog_page_0 *page_0 = page;		\
+	if (!page)						\
+		return -ENOMEM;					\
+	hret = h_get_24x7_catalog_page(page, 0, 0);		\
+	if (hret) {						\
+		ret = -EIO;					\
+		goto e_free;					\
+	}							\
+	ret = sprintf(buf, _fmt, _expr);			\
+e_free:								\
+	kfree(page);						\
+	return ret;						\
+}								\
+static DEVICE_ATTR_RO(_name)
+
+PAGE_0_ATTR(catalog_version, "%lld\n",
+		(unsigned long long)be32_to_cpu(page_0->version));
+PAGE_0_ATTR(catalog_len, "%lld\n",
+		(unsigned long long)be32_to_cpu(page_0->length) * 4096);
+static BIN_ATTR_RO(catalog, 0/* real length varies */);
+
+static struct bin_attribute *if_bin_attrs[] = {
+	&bin_attr_catalog,
+	NULL,
+};
+
+static struct attribute *if_attrs[] = {
+	&dev_attr_catalog_len.attr,
+	&dev_attr_catalog_version.attr,
+	NULL,
+};
+
+static struct attribute_group if_group = {
+	.name = "interface",
+	.bin_attrs = if_bin_attrs,
+	.attrs = if_attrs,
+};
+
+static const struct attribute_group *attr_groups[] = {
+	&format_group,
+	&if_group,
+	NULL,
+};
+
+static bool is_physical_domain(int domain)
+{
+	return  domain == HV_24X7_PERF_DOMAIN_PHYSICAL_CHIP ||
+		domain == HV_24X7_PERF_DOMAIN_PHYSICAL_CORE;
+}
+
+static unsigned long single_24x7_request(u8 domain, u32 offset, u16 ix,
+					 u16 lpar, u64 *res,
+					 bool success_expected)
+{
+	unsigned long ret;
+
+	/*
+	 * request_buffer and result_buffer are not required to be 4k aligned,
+	 * but are not allowed to cross any 4k boundary. Aligning them to 4k is
+	 * the simplest way to ensure that.
+	 */
+	struct reqb {
+		struct hv_24x7_request_buffer buf;
+		struct hv_24x7_request req;
+	} __packed __aligned(4096) request_buffer = {
+		.buf = {
+			.interface_version = HV_24X7_IF_VERSION_CURRENT,
+			.num_requests = 1,
+		},
+		.req = {
+			.performance_domain = domain,
+			.data_size = cpu_to_be16(8),
+			.data_offset = cpu_to_be32(offset),
+			.starting_lpar_ix = cpu_to_be16(lpar),
+			.max_num_lpars = cpu_to_be16(1),
+			.starting_ix = cpu_to_be16(ix),
+			.max_ix = cpu_to_be16(1),
+		}
+	};
+
+	struct resb {
+		struct hv_24x7_data_result_buffer buf;
+		struct hv_24x7_result res;
+		struct hv_24x7_result_element elem;
+		__be64 result;
+	} __packed __aligned(4096) result_buffer = {};
+
+	ret = plpar_hcall_norets(H_GET_24X7_DATA,
+			virt_to_phys(&request_buffer), sizeof(request_buffer),
+			virt_to_phys(&result_buffer),  sizeof(result_buffer));
+
+	if (ret) {
+		if (success_expected)
+			pr_err_ratelimited("hcall failed: %d %#x %#x %d => 0x%lx (%ld) detail=0x%x failing ix=%x\n",
+					domain, offset, ix, lpar,
+					ret, ret,
+					result_buffer.buf.detailed_rc,
+					result_buffer.buf.failing_request_ix);
+		return ret;
+	}
+
+	*res = be64_to_cpu(result_buffer.result);
+	return ret;
+}
+
+static unsigned long event_24x7_request(struct perf_event *event, u64 *res,
+		bool success_expected)
+{
+	return single_24x7_request(event_get_domain(event),
+				event_get_offset(event),
+				event_get_starting_index(event),
+				event_get_lpar(event),
+				res,
+				success_expected);
+}
+
+static int h_24x7_event_init(struct perf_event *event)
+{
+	struct hv_perf_caps caps;
+	unsigned domain;
+	unsigned long hret;
+	u64 ct;
+
+	/* Not our event */
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	/* Unused areas must be 0 */
+	if (event_get_reserved1(event) ||
+	    event_get_reserved2(event) ||
+	    event_get_reserved3(event)) {
+		pr_devel("reserved set when forbidden 0x%llx(0x%llx) 0x%llx(0x%llx) 0x%llx(0x%llx)\n",
+				event->attr.config,
+				event_get_reserved1(event),
+				event->attr.config1,
+				event_get_reserved2(event),
+				event->attr.config2,
+				event_get_reserved3(event));
+		return -EINVAL;
+	}
+
+	/* unsupported modes and filters */
+	if (event->attr.exclude_user   ||
+	    event->attr.exclude_kernel ||
+	    event->attr.exclude_hv     ||
+	    event->attr.exclude_idle   ||
+	    event->attr.exclude_host   ||
+	    event->attr.exclude_guest  ||
+	    is_sampling_event(event)) /* no sampling */
+		return -EINVAL;
+
+	/* no branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
+	/* offset must be 8 byte aligned */
+	if (event_get_offset(event) % 8) {
+		pr_devel("bad alignment\n");
+		return -EINVAL;
+	}
+
+	/* Domains above 6 are invalid */
+	domain = event_get_domain(event);
+	if (domain > 6) {
+		pr_devel("invalid domain %d\n", domain);
+		return -EINVAL;
+	}
+
+	hret = hv_perf_caps_get(&caps);
+	if (hret) {
+		pr_devel("could not get capabilities: rc=%ld\n", hret);
+		return -EIO;
+	}
+
+	/* PHYSICAL domains & other lpars require extra capabilities */
+	if (!caps.collect_privileged && (is_physical_domain(domain) ||
+		(event_get_lpar(event) != event_get_lpar_max()))) {
+		pr_devel("hv permisions disallow: is_physical_domain:%d, lpar=0x%llx\n",
+				is_physical_domain(domain),
+				event_get_lpar(event));
+		return -EACCES;
+	}
+
+	/* see if the event complains */
+	if (event_24x7_request(event, &ct, false)) {
+		pr_devel("test hcall failed\n");
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static u64 h_24x7_get_value(struct perf_event *event)
+{
+	unsigned long ret;
+	u64 ct;
+	ret = event_24x7_request(event, &ct, true);
+	if (ret)
+		/* We checked this in event init, shouldn't fail here... */
+		return 0;
+
+	return ct;
+}
+
+static void h_24x7_event_update(struct perf_event *event)
+{
+	s64 prev;
+	u64 now;
+	now = h_24x7_get_value(event);
+	prev = local64_xchg(&event->hw.prev_count, now);
+	local64_add(now - prev, &event->count);
+}
+
+static void h_24x7_event_start(struct perf_event *event, int flags)
+{
+	if (flags & PERF_EF_RELOAD)
+		local64_set(&event->hw.prev_count, h_24x7_get_value(event));
+}
+
+static void h_24x7_event_stop(struct perf_event *event, int flags)
+{
+	h_24x7_event_update(event);
+}
+
+static int h_24x7_event_add(struct perf_event *event, int flags)
+{
+	if (flags & PERF_EF_START)
+		h_24x7_event_start(event, flags);
+
+	return 0;
+}
+
+static int h_24x7_event_idx(struct perf_event *event)
+{
+	return 0;
+}
+
+static struct pmu h_24x7_pmu = {
+	.task_ctx_nr = perf_invalid_context,
+
+	.name = "hv_24x7",
+	.attr_groups = attr_groups,
+	.event_init  = h_24x7_event_init,
+	.add         = h_24x7_event_add,
+	.del         = h_24x7_event_stop,
+	.start       = h_24x7_event_start,
+	.stop        = h_24x7_event_stop,
+	.read        = h_24x7_event_update,
+	.event_idx   = h_24x7_event_idx,
+};
+
+static int hv_24x7_init(void)
+{
+	int r;
+	unsigned long hret;
+	struct hv_perf_caps caps;
+
+	if (!firmware_has_feature(FW_FEATURE_LPAR)) {
+		pr_info("not a virtualized system, not enabling\n");
+		return -ENODEV;
+	}
+
+	hret = hv_perf_caps_get(&caps);
+	if (hret) {
+		pr_info("could not obtain capabilities, error 0x%80lx, not enabling\n",
+				hret);
+		return -ENODEV;
+	}
+
+	hv_page_cache = kmem_cache_create("hv-page-4096", 4096, 4096, 0, NULL);
+	if (!hv_page_cache)
+		return -ENOMEM;
+
+	r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
+	if (r)
+		return r;
+
+	return 0;
+}
+
+device_initcall(hv_24x7_init);
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 16/20] powerpc/perf: Add support for the hv gpci (get performance counter info) interface
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

This provides a basic link between perf and hv_gpci. Notably, it does
not yet support transactions and does not list any events (they can
still be manually composed).

Example usage via perf tool:

	perf stat -e 'hv_gpci/counter_info_version=3,offset=0,length=8,secondary_index=0,starting_index=0xffffffff,request=0x10/' -r 0 -C 0 -x ' ' sleep 0.1

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/perf/hv-gpci.c | 294 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 294 insertions(+)
 create mode 100644 arch/powerpc/perf/hv-gpci.c

diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
new file mode 100644
index 0000000..278ba7b
--- /dev/null
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -0,0 +1,294 @@
+/*
+ * Hypervisor supplied "gpci" ("get performance counter info") performance
+ * counter support
+ *
+ * Author: Cody P Schafer <cody@linux.vnet.ibm.com>
+ * Copyright 2014 IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) "hv-gpci: " fmt
+
+#include <linux/init.h>
+#include <linux/perf_event.h>
+#include <asm/firmware.h>
+#include <asm/hvcall.h>
+#include <asm/io.h>
+
+#include "hv-gpci.h"
+#include "hv-common.h"
+
+/*
+ * Example usage:
+ *  perf stat -e 'hv_gpci/counter_info_version=3,offset=0,length=8,
+ *		  secondary_index=0,starting_index=0xffffffff,request=0x10/' ...
+ */
+
+/* u32 */
+EVENT_DEFINE_RANGE_FORMAT(request, config, 0, 31);
+/* u32 */
+EVENT_DEFINE_RANGE_FORMAT(starting_index, config, 32, 63);
+/* u16 */
+EVENT_DEFINE_RANGE_FORMAT(secondary_index, config1, 0, 15);
+/* u8 */
+EVENT_DEFINE_RANGE_FORMAT(counter_info_version, config1, 16, 23);
+/* u8, bytes of data (1-8) */
+EVENT_DEFINE_RANGE_FORMAT(length, config1, 24, 31);
+/* u32, byte offset */
+EVENT_DEFINE_RANGE_FORMAT(offset, config1, 32, 63);
+
+static struct attribute *format_attrs[] = {
+	&format_attr_request.attr,
+	&format_attr_starting_index.attr,
+	&format_attr_secondary_index.attr,
+	&format_attr_counter_info_version.attr,
+
+	&format_attr_offset.attr,
+	&format_attr_length.attr,
+	NULL,
+};
+
+static struct attribute_group format_group = {
+	.name = "format",
+	.attrs = format_attrs,
+};
+
+#define HV_CAPS_ATTR(_name, _format)				\
+static ssize_t _name##_show(struct device *dev,			\
+			    struct device_attribute *attr,	\
+			    char *page)				\
+{								\
+	struct hv_perf_caps caps;				\
+	unsigned long hret = hv_perf_caps_get(&caps);		\
+	if (hret)						\
+		return -EIO;					\
+								\
+	return sprintf(page, _format, caps._name);		\
+}								\
+static struct device_attribute hv_caps_attr_##_name = __ATTR_RO(_name)
+
+static ssize_t kernel_version_show(struct device *dev,
+				   struct device_attribute *attr,
+				   char *page)
+{
+	return sprintf(page, "0x%x\n", COUNTER_INFO_VERSION_CURRENT);
+}
+
+DEVICE_ATTR_RO(kernel_version);
+HV_CAPS_ATTR(version, "0x%x\n");
+HV_CAPS_ATTR(ga, "%d\n");
+HV_CAPS_ATTR(expanded, "%d\n");
+HV_CAPS_ATTR(lab, "%d\n");
+HV_CAPS_ATTR(collect_privileged, "%d\n");
+
+static struct attribute *interface_attrs[] = {
+	&dev_attr_kernel_version.attr,
+	&hv_caps_attr_version.attr,
+	&hv_caps_attr_ga.attr,
+	&hv_caps_attr_expanded.attr,
+	&hv_caps_attr_lab.attr,
+	&hv_caps_attr_collect_privileged.attr,
+	NULL,
+};
+
+static struct attribute_group interface_group = {
+	.name = "interface",
+	.attrs = interface_attrs,
+};
+
+static const struct attribute_group *attr_groups[] = {
+	&format_group,
+	&interface_group,
+	NULL,
+};
+
+#define GPCI_MAX_DATA_BYTES \
+	(1024 - sizeof(struct hv_get_perf_counter_info_params))
+
+static unsigned long single_gpci_request(u32 req, u32 starting_index,
+		u16 secondary_index, u8 version_in, u32 offset, u8 length,
+		u64 *value)
+{
+	unsigned long ret;
+	size_t i;
+	u64 count;
+
+	struct {
+		struct hv_get_perf_counter_info_params params;
+		uint8_t bytes[GPCI_MAX_DATA_BYTES];
+	} __packed __aligned(sizeof(uint64_t)) arg = {
+		.params = {
+			.counter_request = cpu_to_be32(req),
+			.starting_index = cpu_to_be32(starting_index),
+			.secondary_index = cpu_to_be16(secondary_index),
+			.counter_info_version_in = version_in,
+		}
+	};
+
+	ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
+			virt_to_phys(&arg), sizeof(arg));
+	if (ret) {
+		pr_devel("hcall failed: 0x%lx\n", ret);
+		return ret;
+	}
+
+	/*
+	 * we verify offset and length are within the zeroed buffer at event
+	 * init.
+	 */
+	count = 0;
+	for (i = offset; i < offset + length; i++)
+		count |= arg.bytes[i] << (i - offset);
+
+	*value = count;
+	return ret;
+}
+
+static u64 h_gpci_get_value(struct perf_event *event)
+{
+	u64 count;
+	unsigned long ret = single_gpci_request(event_get_request(event),
+					event_get_starting_index(event),
+					event_get_secondary_index(event),
+					event_get_counter_info_version(event),
+					event_get_offset(event),
+					event_get_length(event),
+					&count);
+	if (ret)
+		return 0;
+	return count;
+}
+
+static void h_gpci_event_update(struct perf_event *event)
+{
+	s64 prev;
+	u64 now = h_gpci_get_value(event);
+	prev = local64_xchg(&event->hw.prev_count, now);
+	local64_add(now - prev, &event->count);
+}
+
+static void h_gpci_event_start(struct perf_event *event, int flags)
+{
+	local64_set(&event->hw.prev_count, h_gpci_get_value(event));
+}
+
+static void h_gpci_event_stop(struct perf_event *event, int flags)
+{
+	h_gpci_event_update(event);
+}
+
+static int h_gpci_event_add(struct perf_event *event, int flags)
+{
+	if (flags & PERF_EF_START)
+		h_gpci_event_start(event, flags);
+
+	return 0;
+}
+
+static int h_gpci_event_init(struct perf_event *event)
+{
+	u64 count;
+	u8 length;
+
+	/* Not our event */
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	/* config2 is unused */
+	if (event->attr.config2) {
+		pr_devel("config2 set when reserved\n");
+		return -EINVAL;
+	}
+
+	/* unsupported modes and filters */
+	if (event->attr.exclude_user   ||
+	    event->attr.exclude_kernel ||
+	    event->attr.exclude_hv     ||
+	    event->attr.exclude_idle   ||
+	    event->attr.exclude_host   ||
+	    event->attr.exclude_guest  ||
+	    is_sampling_event(event)) /* no sampling */
+		return -EINVAL;
+
+	/* no branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
+	length = event_get_length(event);
+	if (length < 1 || length > 8) {
+		pr_devel("length invalid\n");
+		return -EINVAL;
+	}
+
+	/* last byte within the buffer? */
+	if ((event_get_offset(event) + length) > GPCI_MAX_DATA_BYTES) {
+		pr_devel("request outside of buffer: %zu > %zu\n",
+				(size_t)event_get_offset(event) + length,
+				GPCI_MAX_DATA_BYTES);
+		return -EINVAL;
+	}
+
+	/* check if the request works... */
+	if (single_gpci_request(event_get_request(event),
+				event_get_starting_index(event),
+				event_get_secondary_index(event),
+				event_get_counter_info_version(event),
+				event_get_offset(event),
+				length,
+				&count)) {
+		pr_devel("gpci hcall failed\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int h_gpci_event_idx(struct perf_event *event)
+{
+	return 0;
+}
+
+static struct pmu h_gpci_pmu = {
+	.task_ctx_nr = perf_invalid_context,
+
+	.name = "hv_gpci",
+	.attr_groups = attr_groups,
+	.event_init  = h_gpci_event_init,
+	.add         = h_gpci_event_add,
+	.del         = h_gpci_event_stop,
+	.start       = h_gpci_event_start,
+	.stop        = h_gpci_event_stop,
+	.read        = h_gpci_event_update,
+	.event_idx   = h_gpci_event_idx,
+};
+
+static int hv_gpci_init(void)
+{
+	int r;
+	unsigned long hret;
+	struct hv_perf_caps caps;
+
+	if (!firmware_has_feature(FW_FEATURE_LPAR)) {
+		pr_info("not a virtualized system, not enabling\n");
+		return -ENODEV;
+	}
+
+	hret = hv_perf_caps_get(&caps);
+	if (hret) {
+		pr_info("could not obtain capabilities, error 0x%80lx, not enabling\n",
+				hret);
+		return -ENODEV;
+	}
+
+	r = perf_pmu_register(&h_gpci_pmu, h_gpci_pmu.name, -1);
+	if (r)
+		return r;
+
+	return 0;
+}
+
+device_initcall(hv_gpci_init);
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 15/20] powerpc/perf: Add macros for defining event fields & formats
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

Add two macros which generate functions to extract the relevent bits
from event->attr.config{,1,2}.

EVENT_DEFINE_RANGE() defines an accessor for a range of bits in the
event, as well as a "max" function that gives the maximum value of the
field based on the bit width.

EVENT_DEFINE_RANGE_FORMAT() defines the accessor & max routine and also
a format attribute for use in the PMU's attr_groups.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
[mpe: move to powerpc, ugly but descriptive macro names]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/perf/hv-common.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/powerpc/perf/hv-common.h b/arch/powerpc/perf/hv-common.h
index 7e615bd..5d79cec 100644
--- a/arch/powerpc/perf/hv-common.h
+++ b/arch/powerpc/perf/hv-common.h
@@ -1,6 +1,7 @@
 #ifndef LINUX_POWERPC_PERF_HV_COMMON_H_
 #define LINUX_POWERPC_PERF_HV_COMMON_H_
 
+#include <linux/perf_event.h>
 #include <linux/types.h>
 
 struct hv_perf_caps {
@@ -14,4 +15,22 @@ struct hv_perf_caps {
 
 unsigned long hv_perf_caps_get(struct hv_perf_caps *caps);
 
+
+#define EVENT_DEFINE_RANGE_FORMAT(name, attr_var, bit_start, bit_end)	\
+PMU_FORMAT_ATTR(name, #attr_var ":" #bit_start "-" #bit_end);		\
+EVENT_DEFINE_RANGE(name, attr_var, bit_start, bit_end)
+
+#define EVENT_DEFINE_RANGE(name, attr_var, bit_start, bit_end)	\
+static u64 event_get_##name##_max(void)					\
+{									\
+	BUILD_BUG_ON((bit_start > bit_end)				\
+		    || (bit_end >= (sizeof(1ull) * 8)));		\
+	return (((1ull << (bit_end - bit_start)) - 1) << 1) + 1;	\
+}									\
+static u64 event_get_##name(struct perf_event *event)			\
+{									\
+	return (event->attr.attr_var >> (bit_start)) &			\
+		event_get_##name##_max();				\
+}
+
 #endif
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 14/20] powerpc/perf: Add a shared interface to get gpci version and capabilities
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

This exposes a simple way to grab the firmware provided
collect_priveliged, ga, expanded, and lab capability bits. All of these
bits come in from the same gpci request, so we've exposed all of them.

Only the collect_priveliged bit is really used by the hv-gpci/hv-24x7
code, the other bits are simply exposed in sysfs to inform the user.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/perf/hv-common.c | 39 +++++++++++++++++++++++++++++++++++++++
 arch/powerpc/perf/hv-common.h | 17 +++++++++++++++++
 2 files changed, 56 insertions(+)
 create mode 100644 arch/powerpc/perf/hv-common.c
 create mode 100644 arch/powerpc/perf/hv-common.h

diff --git a/arch/powerpc/perf/hv-common.c b/arch/powerpc/perf/hv-common.c
new file mode 100644
index 0000000..47e02b3
--- /dev/null
+++ b/arch/powerpc/perf/hv-common.c
@@ -0,0 +1,39 @@
+#include <asm/io.h>
+#include <asm/hvcall.h>
+
+#include "hv-gpci.h"
+#include "hv-common.h"
+
+unsigned long hv_perf_caps_get(struct hv_perf_caps *caps)
+{
+	unsigned long r;
+	struct p {
+		struct hv_get_perf_counter_info_params params;
+		struct cv_system_performance_capabilities caps;
+	} __packed __aligned(sizeof(uint64_t));
+
+	struct p arg = {
+		.params = {
+			.counter_request = cpu_to_be32(
+					CIR_SYSTEM_PERFORMANCE_CAPABILITIES),
+			.starting_index = cpu_to_be32(-1),
+			.counter_info_version_in = 0,
+		}
+	};
+
+	r = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
+			       virt_to_phys(&arg), sizeof(arg));
+
+	if (r)
+		return r;
+
+	pr_devel("capability_mask: 0x%x\n", arg.caps.capability_mask);
+
+	caps->version = arg.params.counter_info_version_out;
+	caps->collect_privileged = !!arg.caps.perf_collect_privileged;
+	caps->ga = !!(arg.caps.capability_mask & CV_CM_GA);
+	caps->expanded = !!(arg.caps.capability_mask & CV_CM_EXPANDED);
+	caps->lab = !!(arg.caps.capability_mask & CV_CM_LAB);
+
+	return r;
+}
diff --git a/arch/powerpc/perf/hv-common.h b/arch/powerpc/perf/hv-common.h
new file mode 100644
index 0000000..7e615bd
--- /dev/null
+++ b/arch/powerpc/perf/hv-common.h
@@ -0,0 +1,17 @@
+#ifndef LINUX_POWERPC_PERF_HV_COMMON_H_
+#define LINUX_POWERPC_PERF_HV_COMMON_H_
+
+#include <linux/types.h>
+
+struct hv_perf_caps {
+	u16 version;
+	u16 collect_privileged:1,
+	    ga:1,
+	    expanded:1,
+	    lab:1,
+	    unused:12;
+};
+
+unsigned long hv_perf_caps_get(struct hv_perf_caps *caps);
+
+#endif
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 13/20] powerpc/perf: Add 24x7 interface headers
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

24x7 (also called hv_24x7 or H_24X7) is an interface to obtain
performance counters from the hypervisor. These counters do not have a
fixed format/possition and are instead documented in a "24x7 Catalog",
which is provided by the hypervisor (that interface is also documented
paritialy in the included hv-24x7-catalog.h and fully in at
https://raw.githubusercontent.com/jmesmon/catalog-24x7/master/hv-24x7-catalog.h ).

The 24x7 data access is simply a copy operation into a 4 dimentional
array of 64bit counters (from hypervisor to kernel memory). There is no
interupt triggered on overflow, these are completely disjoint from the
typical power pmu.

This method of obtaining performance counters from the hypervisor is
intended to paritialy replace the gpci interface.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/perf/hv-24x7-catalog.h |  33 +++++++++++
 arch/powerpc/perf/hv-24x7.h         | 109 ++++++++++++++++++++++++++++++++++++
 2 files changed, 142 insertions(+)
 create mode 100644 arch/powerpc/perf/hv-24x7-catalog.h
 create mode 100644 arch/powerpc/perf/hv-24x7.h

diff --git a/arch/powerpc/perf/hv-24x7-catalog.h b/arch/powerpc/perf/hv-24x7-catalog.h
new file mode 100644
index 0000000..21b19dd
--- /dev/null
+++ b/arch/powerpc/perf/hv-24x7-catalog.h
@@ -0,0 +1,33 @@
+#ifndef LINUX_POWERPC_PERF_HV_24X7_CATALOG_H_
+#define LINUX_POWERPC_PERF_HV_24X7_CATALOG_H_
+
+#include <linux/types.h>
+
+/* From document "24x7 Event and Group Catalog Formats Proposal" v0.15 */
+
+struct hv_24x7_catalog_page_0 {
+#define HV_24X7_CATALOG_MAGIC 0x32347837 /* "24x7" in ASCII */
+	__be32 magic;
+	__be32 length; /* In 4096 byte pages */
+	__be64 version; /* XXX: arbitrary? what's the meaning/useage/purpose? */
+	__u8 build_time_stamp[16]; /* "YYYYMMDDHHMMSS\0\0" */
+	__u8 reserved2[32];
+	__be16 schema_data_offs; /* in 4096 byte pages */
+	__be16 schema_data_len;  /* in 4096 byte pages */
+	__be16 schema_entry_count;
+	__u8 reserved3[2];
+	__be16 event_data_offs;
+	__be16 event_data_len;
+	__be16 event_entry_count;
+	__u8 reserved4[2];
+	__be16 group_data_offs; /* in 4096 byte pages */
+	__be16 group_data_len;  /* in 4096 byte pages */
+	__be16 group_entry_count;
+	__u8 reserved5[2];
+	__be16 formula_data_offs; /* in 4096 byte pages */
+	__be16 formula_data_len;  /* in 4096 byte pages */
+	__be16 formula_entry_count;
+	__u8 reserved6[2];
+} __packed;
+
+#endif
diff --git a/arch/powerpc/perf/hv-24x7.h b/arch/powerpc/perf/hv-24x7.h
new file mode 100644
index 0000000..720ebce
--- /dev/null
+++ b/arch/powerpc/perf/hv-24x7.h
@@ -0,0 +1,109 @@
+#ifndef LINUX_POWERPC_PERF_HV_24X7_H_
+#define LINUX_POWERPC_PERF_HV_24X7_H_
+
+#include <linux/types.h>
+
+struct hv_24x7_request {
+	/* PHYSICAL domains require enabling via phyp/hmc. */
+#define HV_24X7_PERF_DOMAIN_PHYSICAL_CHIP 0x01
+#define HV_24X7_PERF_DOMAIN_PHYSICAL_CORE 0x02
+#define HV_24X7_PERF_DOMAIN_VIRTUAL_PROCESSOR_HOME_CORE   0x03
+#define HV_24X7_PERF_DOMAIN_VIRTUAL_PROCESSOR_HOME_CHIP   0x04
+#define HV_24X7_PERF_DOMAIN_VIRTUAL_PROCESSOR_HOME_NODE   0x05
+#define HV_24X7_PERF_DOMAIN_VIRTUAL_PROCESSOR_REMOTE_NODE 0x06
+	__u8 performance_domain;
+	__u8 reserved[0x1];
+
+	/* bytes to read starting at @data_offset. must be a multiple of 8 */
+	__be16 data_size;
+
+	/*
+	 * byte offset within the perf domain to read from. must be 8 byte
+	 * aligned
+	 */
+	__be32 data_offset;
+
+	/*
+	 * only valid for VIRTUAL_PROCESSOR domains, ignored for others.
+	 * -1 means "current partition only"
+	 *  Enabling via phyp/hmc required for non-"-1" values. 0 forbidden
+	 *  unless requestor is 0.
+	 */
+	__be16 starting_lpar_ix;
+
+	/*
+	 * Ignored when @starting_lpar_ix == -1
+	 * Ignored when @performance_domain is not VIRTUAL_PROCESSOR_*
+	 * -1 means "infinite" or all
+	 */
+	__be16 max_num_lpars;
+
+	/* chip, core, or virtual processor based on @performance_domain */
+	__be16 starting_ix;
+	__be16 max_ix;
+} __packed;
+
+struct hv_24x7_request_buffer {
+	/* 0 - ? */
+	/* 1 - ? */
+#define HV_24X7_IF_VERSION_CURRENT 0x01
+	__u8 interface_version;
+	__u8 num_requests;
+	__u8 reserved[0xE];
+	struct hv_24x7_request requests[];
+} __packed;
+
+struct hv_24x7_result_element {
+	__be16 lpar_ix;
+
+	/*
+	 * represents the core, chip, or virtual processor based on the
+	 * request's @performance_domain
+	 */
+	__be16 domain_ix;
+
+	/* -1 if @performance_domain does not refer to a virtual processor */
+	__be32 lpar_cfg_instance_id;
+
+	/* size = @result_element_data_size of cointaining result. */
+	__u8 element_data[];
+} __packed;
+
+struct hv_24x7_result {
+	__u8 result_ix;
+
+	/*
+	 * 0 = not all result elements fit into the buffer, additional requests
+	 *     required
+	 * 1 = all result elements were returned
+	 */
+	__u8 results_complete;
+	__be16 num_elements_returned;
+
+	/* This is a copy of @data_size from the coresponding hv_24x7_request */
+	__be16 result_element_data_size;
+	__u8 reserved[0x2];
+
+	/* WARNING: only valid for first result element due to variable sizes
+	 *          of result elements */
+	/* struct hv_24x7_result_element[@num_elements_returned] */
+	struct hv_24x7_result_element elements[];
+} __packed;
+
+struct hv_24x7_data_result_buffer {
+	/* See versioning for request buffer */
+	__u8 interface_version;
+
+	__u8 num_results;
+	__u8 reserved[0x1];
+	__u8 failing_request_ix;
+	__be32 detailed_rc;
+	__be64 cec_cfg_instance_id;
+	__be64 catalog_version_num;
+	__u8 reserved2[0x8];
+	/* WARNING: only valid for the first result due to variable sizes of
+	 *	    results */
+	struct hv_24x7_result results[]; /* [@num_results] */
+} __packed;
+
+#endif
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 12/20] powerpc/perf: Add hv_gpci interface header
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

"H_GetPerformanceCounterInfo" (refered to as hv_gpci or just gpci from
here on) is an interface to retrieve specific performance counters and
other data from the hypervisor. All outputs have a fixed format. This
header only describes the portions of the interface that we plan on
using in linux at this time.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/perf/hv-gpci.h | 73 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
 create mode 100644 arch/powerpc/perf/hv-gpci.h

diff --git a/arch/powerpc/perf/hv-gpci.h b/arch/powerpc/perf/hv-gpci.h
new file mode 100644
index 0000000..b25f460
--- /dev/null
+++ b/arch/powerpc/perf/hv-gpci.h
@@ -0,0 +1,73 @@
+#ifndef LINUX_POWERPC_PERF_HV_GPCI_H_
+#define LINUX_POWERPC_PERF_HV_GPCI_H_
+
+#include <linux/types.h>
+
+/* From the document "H_GetPerformanceCounterInfo Interface" v1.07 */
+
+/* H_GET_PERF_COUNTER_INFO argument */
+struct hv_get_perf_counter_info_params {
+	__be32 counter_request; /* I */
+	__be32 starting_index;  /* IO */
+	__be16 secondary_index; /* IO */
+	__be16 returned_values; /* O */
+	__be32 detail_rc; /* O, only needed when called via *_norets() */
+
+	/*
+	 * O, size each of counter_value element in bytes, only set for version
+	 * >= 0x3
+	 */
+	__be16 cv_element_size;
+
+	/* I, 0 (zero) for versions < 0x3 */
+	__u8 counter_info_version_in;
+
+	/* O, 0 (zero) if version < 0x3. Must be set to 0 when making hcall */
+	__u8 counter_info_version_out;
+	__u8 reserved[0xC];
+	__u8 counter_value[];
+} __packed;
+
+/*
+ * counter info version => fw version/reference (spec version)
+ *
+ * 8 => power8 (1.07)
+ * [7 is skipped by spec 1.07]
+ * 6 => TLBIE (1.07)
+ * 5 => v7r7m0.phyp (1.05)
+ * [4 skipped]
+ * 3 => v7r6m0.phyp (?)
+ * [1,2 skipped]
+ * 0 => v7r{2,3,4}m0.phyp (?)
+ */
+#define COUNTER_INFO_VERSION_CURRENT 0x8
+
+/*
+ * These determine the counter_value[] layout and the meaning of starting_index
+ * and secondary_index.
+ *
+ * Unless otherwise noted, @secondary_index is unused and ignored.
+ */
+enum counter_info_requests {
+
+	/* GENERAL */
+
+	/* @starting_index: must be -1 (to refer to the current partition)
+	 */
+	CIR_SYSTEM_PERFORMANCE_CAPABILITIES = 0X40,
+};
+
+struct cv_system_performance_capabilities {
+	/* If != 0, allowed to collect data from other partitions */
+	__u8 perf_collect_privileged;
+
+	/* These following are only valid if counter_info_version >= 0x3 */
+#define CV_CM_GA       (1 << 7)
+#define CV_CM_EXPANDED (1 << 6)
+#define CV_CM_LAB      (1 << 5)
+	/* remaining bits are reserved */
+	__u8 capability_mask;
+	__u8 reserved[0xE];
+} __packed;
+
+#endif
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 11/20] powerpc: Add hvcalls for 24x7 and gpci (Get Performance Counter Info)
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/include/asm/hvcall.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
index d8b600b..5dbbb29 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -274,6 +274,11 @@
 /* Platform specific hcalls, used by KVM */
 #define H_RTAS			0xf000
 
+/* "Platform specific hcalls", provided by PHYP */
+#define H_GET_24X7_CATALOG_PAGE	0xF078
+#define H_GET_24X7_DATA		0xF07C
+#define H_GET_PERF_COUNTER_INFO	0xF080
+
 #ifndef __ASSEMBLY__
 
 /**
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 10/20] sysfs: create bin_attributes under the requested group
From: Michael Ellerman @ 2014-03-14  5:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cody, khandual
In-Reply-To: <1394773245-18328-1-git-send-email-mpe@ellerman.id.au>

From: Cody P Schafer <cody@linux.vnet.ibm.com>

bin_attributes created/updated in create_files() (such as those listed
via (struct device).attribute_groups) were not placed under the
specified group, and instead appeared in the base kobj directory.

Fix this by making bin_attributes use creating code similar to normal
attributes.

A quick grep shows that no one is using bin_attrs in a named attribute
group yet, so we can do this without breaking anything in usespace.

Note that I do not add is_visible() support to
bin_attributes, though that could be done as well.

This is a copy of the patch already merged in Greg's tree.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/sysfs/group.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index 6b57938..aa04068 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -70,8 +70,11 @@ static int create_files(struct kernfs_node *parent, struct kobject *kobj,
 	if (grp->bin_attrs) {
 		for (bin_attr = grp->bin_attrs; *bin_attr; bin_attr++) {
 			if (update)
-				sysfs_remove_bin_file(kobj, *bin_attr);
-			error = sysfs_create_bin_file(kobj, *bin_attr);
+				kernfs_remove_by_name(parent,
+						(*bin_attr)->attr.name);
+			error = sysfs_add_file_mode_ns(parent,
+					&(*bin_attr)->attr, true,
+					(*bin_attr)->attr.mode, NULL);
 			if (error)
 				break;
 		}
-- 
1.8.3.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox