LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 3/8] powerpc: add hvcalls for 24x7 and gpci (get performance counter info)
From: Cody P Schafer @ 2014-02-03 21:21 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC
  Cc: Ingo Molnar, Paul Mackerras, Peter Zijlstra,
	Arnaldo Carvalho de Melo, LKML
In-Reply-To: <20140201055806.A25152C00B2@ozlabs.org>

On 01/31/2014 09:58 PM, Michael Ellerman wrote:
> On Thu, 2014-16-01 at 23:53:49 UTC, Cody P Schafer wrote:
>> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/include/asm/hvcall.h | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
>> index d8b600b..48d6efa 100644
>> --- a/arch/powerpc/include/asm/hvcall.h
>> +++ b/arch/powerpc/include/asm/hvcall.h
>> @@ -269,11 +269,15 @@
>>   #define H_COP			0x304
>>   #define H_GET_MPP_X		0x314
>>   #define H_SET_MODE		0x31C
>> -#define MAX_HCALL_OPCODE	H_SET_MODE
>> +#define H_GET_24X7_CATALOG_PAGE 0xF078
>> +#define H_GET_24X7_DATA		0xF07C
>> +#define H_GET_PERF_COUNTER_INFO 0xF080
>
> Ugh, why the hell did they put them up there.
>
>> +#define MAX_HCALL_OPCODE	H_GET_PERF_COUNTER_INFO
>
> We have an array which is sized based on this, which is unpleasant.
>
> I think you're better off putting these below in the platform specific section,
> and leaving MAX_HCALL_OPCODE alone. The only downside is you can't use the
> hcall tracing to see them.

Ya, I'm aware. I've got them up there as I did want to trace them :) . I 
don't see a big issue with moving them out of that section, though.

>>   /* Platform specific hcalls, used by KVM */
>>   #define H_RTAS			0xf000
>

^ permalink raw reply

* Re: [PATCH 1/8] perf: add PMU_RANGE_ATTR() helper for use by sw-like pmus
From: Cody P Schafer @ 2014-02-03 21:19 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC
  Cc: Ingo Molnar, Paul Mackerras, Peter Zijlstra,
	Arnaldo Carvalho de Melo, LKML
In-Reply-To: <20140201055805.6FF982C00AF@ozlabs.org>

On 01/31/2014 09:58 PM, Michael Ellerman wrote:
> On Thu, 2014-16-01 at 23:53:47 UTC, Cody P Schafer wrote:
>> Add PMU_RANGE_ATTR() and PMU_RANGE_RESV() (for reserved areas) which
>> generate functions to extract the relevent bits from
>> event->attr.config{,1,2} for use by sw-like pmus where the
>> 'config{,1,2}' values don't map directly to hardware registers.
>
> This is neat.
>
> The split of the macros is a bit weird, ie. PMU_RANGE_RESV() doesn't really do
> what it's name suggests.
>
> I think you want one macro which creates the accessors, with a name that
> reflects that - yeah I can't think of a good one right now, but "event" should
> probably be in there because that's what it operates on.
>
> Having a macro for the reserved regions is good, but you MUST actually check
> that the reserved regions are zero. Otherwise you are permitting your caller to
> pass junk in there and you then can't unreserved them in a future version of
> the API.
>
> So I think a macro that gives you a special reserved region routine would be
> good, so you can write something like:
>
>    if (event_check_reserved1() || event_check_reserved2())
>    	return -EINVAL;
>

The way it's set up right now, RESV is just a hint to the user of the 
PMU_RANGE_ATTR() and PMU_RANGE_RESV() macros to indicate which to use. 
RESV simply avoids creating an attr format which would go unused only in 
the case where the range is a reserved one (and gcc would complain about 
it).

I don't like the "event_check_foo()" bit because that is actually 
identical to "event_get_foo()", I don't see a point in generating 
differently named functions that do exactly the same thing.

The current user (hv-24x7.c) of PMU_RANGE_RESV() already does the 
appropriate checking:

	if (event_get_reserved1(event) ||
	    event_get_reserved2(event) ||
	    event_get_reserved3(event)) {
		pr_devel("reserved set when forbidden 0x%llx(0x%llx) 0x%llx(0x%llx) 
0x%llx(0x%llx)\n",
				event->attr.config,
				event_get_reserved1(event),
				event->attr.config1,
				event_get_reserved2(event),
				event->attr.config2,
				event_get_reserved3(event));
		return -EINVAL;
	}

^ permalink raw reply

* Re: [PATCH 6/8] powerpc/perf: add support for the hv gpci (get performance counter info) interface
From: Cody P Schafer @ 2014-02-03 21:13 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC
  Cc: Ingo Molnar, Paul Mackerras, Peter Zijlstra,
	Arnaldo Carvalho de Melo, LKML
In-Reply-To: <20140201055808.1829F2C00D3@ozlabs.org>

On 01/31/2014 09:58 PM, Michael Ellerman wrote:
> On Thu, 2014-16-01 at 23:53:52 UTC, Cody P Schafer wrote:
>> This provides a basic link between perf and hv_gpci. Notably, it does
>> not yet support transactions and does not list any events (they can
>> still be manually composed).
>
> What are the plans for listing?

I'm looking at extending the sysfs api for listing perf events. We can't 
use the existing one as it doesn't let us parametrize the events by 
anything, meaning we'd need to list an essentially duplicate event for 
each cpu/core/chip and lpar (guest). The duplication in cpu/core/chip 
comes from not using the typical cpu parameter to perf_event_open() 
(which we don't do because it wouldn't make sense when the guest 
monitored isn't us).

> The manual compose is nice but pretty hairy to use in practice I would think.
>
>> diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
>> new file mode 100644
>> index 0000000..31d9d59
>> --- /dev/null
>> +++ b/arch/powerpc/perf/hv-gpci.c
>> @@ -0,0 +1,235 @@
>> +/*
>> + * Hypervisor supplied "gpci" ("get performance counter info") performance
>> + * counter support
>> + *
>> + * Author: Cody P Schafer <cody@linux.vnet.ibm.com>
>> + * Copyright 2014 IBM Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * as published by the Free Software Foundation; either version
>> + * 2 of the License, or (at your option) any later version.
>> + */
>> +#define pr_fmt(fmt) "hv-gpci: " fmt
>> +
>> +#include <linux/module.h>
>> +#include <linux/perf_event.h>
>> +#include <asm/firmware.h>
>> +#include <asm/hvcall.h>
>> +#include <asm/hv_gpci.h>
>> +#include <asm/io.h>
>
> Needed?
>

asm/io.h is for virt_to_phys(). And yes, I need it.

>> +/* See arch/powerpc/include/asm/hv_gpci.h for details on the hcall interface */
>> +
>> +PMU_RANGE_ATTR(request, config, 0, 31); /* u32 */
>> +PMU_RANGE_ATTR(starting_index, config, 32, 63); /* u32 */
>> +PMU_RANGE_ATTR(secondary_index, config1, 0, 15); /* u16 */
>> +PMU_RANGE_ATTR(counter_info_version, config1, 16, 23); /* u8 */
>> +PMU_RANGE_ATTR(length, config1, 24, 31); /* u8, bytes of data (1-8) */
>> +PMU_RANGE_ATTR(offset, config1, 32, 63); /* u32, byte offset */
>> +
>> +static struct attribute *format_attr[] = {
>> +	&format_attr_request.attr,
>> +	&format_attr_starting_index.attr,
>> +	&format_attr_secondary_index.attr,
>> +	&format_attr_counter_info_version.attr,
>> +
> Lonley blank line.

Which was seperating "real" attributes from those provided by the kernel 
(gpci doesn't know about "offset" and "length"). I'll remove.

>> +	&format_attr_offset.attr,
>> +	&format_attr_length.attr,
>> +	NULL,
>> +};
>> +
>> +static struct attribute_group format_group = {
>> +	.name = "format",
>> +	.attrs = format_attr,
>> +};
>> +
>> +static const struct attribute_group *attr_groups[] = {
>> +	&format_group,
>> +	NULL,
>> +};
>> +
>> +static unsigned long single_gpci_request(u32 req, u32 starting_index,
>> +		u16 secondary_index, u8 version_in, u32 offset, u8 length,
>> +		u64 *value)
>
> Passing the event and extracting the values in here would be neater IMHO.
>

Well, I'll at least add a wrapper that does that. The idea here was to 
separate the perf specific logic from the gpci specific logic. And I'll 
end up taking advantage of that separation when doing the interface 
probing (mentioned way down near the end of this email).

>> +{
>> +	unsigned long ret;
>> +	size_t i;
>> +	u64 count;
>> +
>> +	struct {
>> +		struct hv_get_perf_counter_info_params params;
>> +		union {
>> +			union h_gpci_cvs data;
>> +			uint8_t bytes[sizeof(union h_gpci_cvs)];
>> +		};
>> +	} arg = {
>> +		.params = {
>> +			.counter_request = cpu_to_be32(req),
>> +			.starting_index = cpu_to_be32(starting_index),
>> +			.secondary_index = cpu_to_be16(secondary_index),
>> +			.counter_info_version_in = version_in,
>> +		}
>> +	};
>> +
>> +	ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
>> +			virt_to_phys(&arg), sizeof(arg));
>> +	if (ret) {
>> +		pr_devel("hcall failed: 0x%lx\n", ret);
>> +		return ret;
>> +	}
>> +
>> +	/*
>> +	 * we verify offset and length are within the zeroed buffer at event
>> +	 * init.
>> +	 */
>> +	count = 0;
>> +	for (i = offset; i < offset + length; i++)
>> +		count |= arg.bytes[i] << (i - offset);
>> +
>> +	*value = count;
>> +	return ret;
>> +}
>> +
>> +static u64 h_gpci_get_value(struct perf_event *event)
>> +{
>> +	u64 count;
>> +	unsigned long ret = single_gpci_request(event_get_request(event),
>> +					event_get_starting_index(event),
>> +					event_get_secondary_index(event),
>> +					event_get_counter_info_version(event),
>> +					event_get_offset(event),
>> +					event_get_length(event),
>> +					&count);
>> +	if (ret)
>> +		return 0;
>> +	return count;
>> +}
>> +
>> +static void h_gpci_event_update(struct perf_event *event)
>> +{
>> +	s64 prev;
>> +	u64 now = h_gpci_get_value(event);
>> +	prev = local64_xchg(&event->hw.prev_count, now);
>> +	local64_add(now - prev, &event->count);
>> +}
>> +
>> +static void h_gpci_event_start(struct perf_event *event, int flags)
>> +{
>> +	local64_set(&event->hw.prev_count, h_gpci_get_value(event));
>> +	perf_swevent_start_hrtimer(event);
>> +}
>> +
>> +static void h_gpci_event_stop(struct perf_event *event, int flags)
>> +{
>> +	perf_swevent_cancel_hrtimer(event);
>> +	h_gpci_event_update(event);
>> +}
>> +
>> +static int h_gpci_event_add(struct perf_event *event, int flags)
>> +{
>> +	if (flags & PERF_EF_START)
>> +		h_gpci_event_start(event, flags);
>> +
>> +	return 0;
>> +}
>> +
>> +static void h_gpci_event_del(struct perf_event *event, int flags)
>> +{
>> +	h_gpci_event_stop(event, flags);
>> +}
>
> Can just hook del directly no?

Yep, good point.

>
>> +static void h_gpci_event_read(struct perf_event *event)
>> +{
>> +	h_gpci_event_update(event);
>> +}
>
> Ditto.

Yep, agreed.

>
>> +static int h_gpci_event_init(struct perf_event *event)
>> +{
>> +	u64 count;
>> +	u8 length;
>> +
>> +	/* Not our event */
>> +	if (event->attr.type != event->pmu->type)
>> +		return -ENOENT;
> I don't understand why you need this?

It's part of the standard perf stuff. In perf_init_event() we do an idr 
lookup to identify the responsible PMU, and it that fails the PMUs get 
iterated over and need to return -ENOENT for events they don't own. That 
said, I have no idea what cases the idr lookup would fail.

>> +	/* config2 is unused */
>> +	if (event->attr.config2)
>> +		return -EINVAL;
>
> You must also check the reserved regions of config and config1.

There aren't any (I use all the bits in config and config1).

>> +	/* unsupported modes and filters */
>> +	if (event->attr.exclude_user   ||
>> +	    event->attr.exclude_kernel ||
>> +	    event->attr.exclude_hv     ||
>> +	    event->attr.exclude_idle   ||
>> +	    event->attr.exclude_host   ||
>> +	    event->attr.exclude_guest  ||
>> +	    is_sampling_event(event)) /* no sampling */
>
> I think you should also check sample_type.

Why? Many PMUs don't, I'm not seeing what the need is here.

>> +		return -EINVAL;
>
> Have you thought about inherit, pinned, exclusive?
>

exclusive and pinned don't make any sense because we don't have any 
actual PMU hw that is being consumed. The event counters are always 
counting.
inherit isn't relevant because we forbid all task tracking (via "+ 
.task_ctx_nr = perf_invalid_context," below).

I haven't forbidden either of these because I'm not sure there is a 
point in doing so (the exclude_* stuff is used for some permissions 
checking)

>> +
>> +	/* no branch sampling */
>> +	if (has_branch_stack(event))
>> +		return -EOPNOTSUPP;
>> +
>> +	length = event_get_length(event);
>> +	if (length < 1 || length > 8)
>> +		return -EINVAL;
>> +
>> +	/* last byte within the buffer? */
>> +	if ((event_get_offset(event) + length) > sizeof(union h_gpci_cvs))
>> +		return -EINVAL;
>> +
>> +	/* check if the request works... */
>> +	if (single_gpci_request(event_get_request(event),
>> +				event_get_starting_index(event),
>> +				event_get_secondary_index(event),
>> +				event_get_counter_info_version(event),
>> +				event_get_offset(event),
>> +				length,
>> +				&count))
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * Some of the events are per-cpu, some per-core, some per-chip, some
>> +	 * are global, and some access data from other virtual machines on the
>> +	 * same physical machine. We can't map the cpu value without a lot of
>> +	 * work. Instead, we pick an arbitrary cpu for all events on this pmu.
>> +	 */
>> +	event->cpu = 0;
>
> OK, but is having them all on cpu zero a good idea?

Probably not. We can remove this line without any real effect on the 
event output. Ideally, we'd have a way to tell perf that our PMU isn't 
associated with _any_ cpu (which would also eliminate some cross-cpu 
calls which trigger IPIs), and that is something I'm looking at, but 
perf widely assumes that events are tied to cpus (or some cpu mappable 
unit like a core or socket).

>
>> +	perf_swevent_init_hrtimer(event);
>> +	return 0;
>> +}
>> +
>> +struct pmu h_gpci_pmu = {
>> +	.task_ctx_nr = perf_invalid_context,
>> +
>> +	.name = "hv_gpci",
>> +	.attr_groups = attr_groups,
>> +	.event_init = h_gpci_event_init,
>> +	.add = h_gpci_event_add,
>> +	.del = h_gpci_event_del,
>> +	.start = h_gpci_event_start,
>> +	.stop = h_gpci_event_stop,
>> +	.read = h_gpci_event_read,
>
> Nice to have them align vertically.
Ack.

>
>> +	.event_idx = perf_swevent_event_idx,
>> +};
>> +
>> +static int hv_gpci_init(void)
>> +{
>> +	int r;
>> +
>> +	if (!firmware_has_feature(FW_FEATURE_LPAR)) {
>> +		pr_info("Not running under phyp, not supported\n");
>
> If only it was that simple :)
>
> You'll see FW_FEATURE_LPAR in a KVM guest too.
>
> There are at least two mechanisms for FW to indicate the presence of features,
> "ibm,hypertas-functions" and "ibm,architecture-vec-5".

I don't _think_ it is exposed in either of those, or if it is the docs 
don't say so :(.

> If HGPCI is not exposed in either of those then we'd want to do a probe hcall
> here to try and detect it at runtime.

Yep, having a probing hcall would be the way to go. The only issue there 
is handling migration between firmware which has or doesn't have support 
for this hcall. I suppose hooking a migration notifier of some kind 
would let us get past that bump.

>> +		return -ENODEV;
>> +	}
>> +
>> +	r = perf_pmu_register(&h_gpci_pmu, h_gpci_pmu.name, -1);
>> +	if (r)
>> +		return r;
>> +
>> +	return 0;
>> +}
>> +
>> +module_init(hv_gpci_init);
>
> This is not modular code so you're discouraged from using module_init(),
> arch_initcall() would probably be fine.
>

Got it.

^ permalink raw reply

* [PATCH V4 3/3] powerpc/powernv: Have uniform logging of errors in opal-elog.c
From: Deepthi Dharwar @ 2014-02-03 19:30 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20140203192915.7790.20325.stgit@deepthi>

Currently some errors/info to be reported use
printk and the rest pr_fmt(). This patch
makes the complete error logging uniform.

Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/opal-elog.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-elog.c b/arch/powerpc/platforms/powernv/opal-elog.c
index 13874b1..d7a3e68 100644
--- a/arch/powerpc/platforms/powernv/opal-elog.c
+++ b/arch/powerpc/platforms/powernv/opal-elog.c
@@ -63,7 +63,7 @@ void opal_elog_ack(uint64_t ack_id)
 	struct opal_err_log *record, *next;
 	bool found = false;
 
-	printk(KERN_INFO "OPAL Log ACK=%llx", ack_id);
+	pr_info("OPAL Log ACK=%llx", ack_id);
 
 	/* once user acknowledge a log delete record from list */
 	spin_lock_irqsave(&opal_elog_lock, flags);
@@ -189,7 +189,7 @@ static void opal_elog_read(void)
 	/* read log size and log ID from OPAL */
 	rc = opal_get_elog_size(&log_id, &elog_size, &elog_type);
 	if (rc != OPAL_SUCCESS) {
-		pr_err("ELOG: Opal log read failed\n");
+		pr_err("Opal log read failed\n");
 		return;
 	}
 	if (elog_size >= OPAL_MAX_ERRLOG_SIZE)
@@ -203,7 +203,7 @@ static void opal_elog_read(void)
 	rc = opal_read_elog(__pa(err_log_data), elog_size, log_id);
 	if (rc != OPAL_SUCCESS) {
 		mutex_unlock(&err_log_data_mutex);
-		pr_err("ELOG: log read failed for log-id=%llx\n", log_id);
+		pr_err("Reading of log failed for log-id=%llx\n", log_id);
 		/* put back the free node. */
 		spin_lock_irqsave(&opal_elog_lock, flags);
 		list_add(&record->link, &elog_ack_list);
@@ -265,7 +265,7 @@ static int init_err_log_buffer(void)
 
 	buf_ptr = vmalloc(sizeof(struct opal_err_log) * MAX_NUM_RECORD);
 	if (!buf_ptr) {
-		printk(KERN_ERR "ELOG: failed to allocate memory.\n");
+		pr_err("Failed to allocate memory for error logging buffers.\n");
 		return -ENOMEM;
 	}
 	memset(buf_ptr, 0, sizeof(struct opal_err_log) * MAX_NUM_RECORD);
@@ -358,15 +358,13 @@ int __init opal_elog_init(void)
 
 	rc = sysfs_create_bin_file(opal_kobj, &opal_elog_attr);
 	if (rc) {
-		printk(KERN_ERR "ELOG: unable to create sysfs file"
-					"opal_elog (%d)\n", rc);
+		pr_err("Unable to create sysfs file opal_elog (%d)\n", rc);
 		return rc;
 	}
 
 	rc = sysfs_create_file(opal_kobj, &opal_elog_ack_attr.attr);
 	if (rc) {
-		printk(KERN_ERR "ELOG: unable to create sysfs file"
-			" opal_elog_ack (%d)\n", rc);
+		pr_err("Unable to create sysfs file opal_elog_ack (%d)\n", rc);
 		return rc;
 	}
 

^ permalink raw reply related

* [PATCH V4 2/3] powerpc/powernv: Correct spell error in opal-elog.c
From: Deepthi Dharwar @ 2014-02-03 19:29 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20140203192915.7790.20325.stgit@deepthi>

Correct spell error in opal-elog.c

Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/opal-elog.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/opal-elog.c b/arch/powerpc/platforms/powernv/opal-elog.c
index 0a03b60..13874b1 100644
--- a/arch/powerpc/platforms/powernv/opal-elog.c
+++ b/arch/powerpc/platforms/powernv/opal-elog.c
@@ -26,7 +26,7 @@
 /* Maximum size of a single log on FSP is 16KB */
 #define OPAL_MAX_ERRLOG_SIZE	16384
 
-/* maximu number of records powernv can hold */
+/* Maximum number of records powernv platform can hold */
 #define MAX_NUM_RECORD	128
 
 struct opal_err_log {

^ permalink raw reply related

* [PATCH V4 1/3] powerpc/powernv: Push critical error logs to FSP
From: Deepthi Dharwar @ 2014-02-03 19:29 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20140203192915.7790.20325.stgit@deepthi>

This patch provides error logging interfaces to report critical
powernv error logs to FSP.
All the required information to dump the error is collected
at POWERNV level through error log interfaces
and then pushed on to FSP.

Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/opal.h                |   36 ++++++++++
 arch/powerpc/platforms/powernv/opal-elog.c     |   77 ++++++++++++++++++++++
 arch/powerpc/platforms/powernv/opal-wrappers.S |    2 -
 arch/powerpc/platforms/powernv/powernv.h       |   84 ++++++++++++++++++++++++
 4 files changed, 196 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 554a031..6f9e02a 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -268,6 +268,40 @@ enum OpalMessageType {
 	OPAL_MSG_TYPE_MAX,
 };
 
+/* Max user dump size is 14K    */
+#define OPAL_LOG_MAX_DUMP       14336
+
+/* Multiple user data sections */
+struct __attribute__((__packed__)) opal_user_data_section {
+	uint32_t tag;
+	uint16_t size;
+	uint16_t component_id;
+	char data_dump[1];
+};
+
+/*
+ * All the information regarding an error/event to be reported
+ * needs to populate this structure using pre-defined interfaces
+ * only
+ */
+struct __attribute__((__packed__)) opal_errorlog {
+
+	uint16_t component_id;
+	uint8_t error_event_type;
+	uint8_t subsystem_id;
+
+	uint8_t event_severity;
+	uint8_t event_subtype;
+	uint8_t user_section_count;
+	uint8_t elog_origin;
+
+	uint32_t user_section_size;
+	uint32_t reason_code;
+	uint32_t additional_info[4];
+
+	char user_data_dump[OPAL_LOG_MAX_DUMP];
+};
+
 /* Machine check related definitions */
 enum OpalMCE_Version {
 	OpalMCE_V1 = 1,
@@ -859,7 +893,7 @@ int64_t opal_lpc_read(uint32_t chip_id, enum OpalLPCAddressType addr_type,
 		      uint32_t addr, uint32_t *data, uint32_t sz);
 int64_t opal_read_elog(uint64_t buffer, size_t size, uint64_t log_id);
 int64_t opal_get_elog_size(uint64_t *log_id, size_t *size, uint64_t *elog_type);
-int64_t opal_write_elog(uint64_t buffer, uint64_t size, uint64_t offset);
+int64_t opal_elog_write(void *buffer);
 int64_t opal_send_ack_elog(uint64_t log_id);
 void opal_resend_pending_logs(void);
 int64_t opal_validate_flash(uint64_t buffer, uint32_t *size, uint32_t *result);
diff --git a/arch/powerpc/platforms/powernv/opal-elog.c b/arch/powerpc/platforms/powernv/opal-elog.c
index fc891ae..0a03b60 100644
--- a/arch/powerpc/platforms/powernv/opal-elog.c
+++ b/arch/powerpc/platforms/powernv/opal-elog.c
@@ -8,6 +8,9 @@
  * as published by the Free Software Foundation; either version
  * 2 of the License, or (at your option) any later version.
  */
+#undef DEBUG
+#define pr_fmt(fmt) "ELOG: " fmt
+
 #include <linux/kernel.h>
 #include <linux/init.h>
 #include <linux/of.h>
@@ -16,8 +19,9 @@
 #include <linux/fs.h>
 #include <linux/vmalloc.h>
 #include <linux/fcntl.h>
+#include <linux/mm.h>
 #include <asm/uaccess.h>
-#include <asm/opal.h>
+#include "powernv.h"
 
 /* Maximum size of a single log on FSP is 16KB */
 #define OPAL_MAX_ERRLOG_SIZE	16384
@@ -272,6 +276,77 @@ static int init_err_log_buffer(void)
 	return 0;
 }
 
+/* Interface to be used by POWERNV to push the logs to FSP via Sapphire */
+struct opal_errorlog *pnv_elog_create(uint8_t pnv_error_event_type,
+			uint16_t pnv_component_id, uint8_t pnv_subsystem_id,
+			uint8_t pnv_event_severity, uint8_t pnv_event_subtype,
+			uint32_t reason_code, uint32_t info0, uint32_t info1,
+			uint32_t info2, uint32_t info3)
+{
+	struct opal_errorlog *buf;
+
+	buf = kzalloc(sizeof(struct opal_errorlog), GFP_ATOMIC);
+	if (!buf) {
+		pr_err("Failed to allocate buffer for generating error log\n");
+		return NULL;
+	}
+
+	buf->error_event_type = pnv_error_event_type;
+	buf->component_id = pnv_component_id;
+	buf->subsystem_id = pnv_subsystem_id;
+	buf->event_severity = pnv_event_severity;
+	buf->event_subtype = pnv_event_subtype;
+	buf->reason_code = reason_code;
+	buf->additional_info[0] = info0;
+	buf->additional_info[1] = info1;
+	buf->additional_info[2] = info2;
+	buf->additional_info[3] = info3;
+	return buf;
+}
+
+int pnv_elog_update_user_dump(struct opal_errorlog *buf, unsigned char *data,
+						uint32_t tag, uint16_t size)
+{
+	char *buffer;
+	struct opal_user_data_section *tmp;
+
+	if (!buf) {
+		pr_err("Cannot update user data. Error log buffer is invalid");
+		return -1;
+	}
+
+	buffer = (char *)buf->user_data_dump + buf->user_section_size;
+	if ((buf->user_section_size + size) > OPAL_LOG_MAX_DUMP) {
+		pr_err("Size of user data overruns the buffer");
+		return -1;
+	}
+
+	tmp = (struct opal_user_data_section *)buffer;
+	tmp->tag = tag;
+	tmp->size = size + sizeof(struct opal_user_data_section) - 1;
+	memcpy(tmp->data_dump, data, size);
+
+	buf->user_section_size += tmp->size;
+	buf->user_section_count++;
+	return 0;
+}
+
+int pnv_commit_errorlog(struct opal_errorlog *buf)
+{
+	int rc;
+
+	rc = opal_elog_write((void *)
+			(vmalloc_to_pfn(buf) << PAGE_SHIFT));
+	if (rc == OPAL_SUCCESS) {
+		/* If the log has been committed, free the buffer */
+		kfree(buf);
+		buf = NULL;
+	} else
+		pr_err("Error log could not be committed to FSP");
+
+	return rc;
+}
+
 /* Initialize error logging */
 int __init opal_elog_init(void)
 {
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 81e445f..c9d46e8 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -120,7 +120,7 @@ OPAL_CALL(opal_read_elog,			OPAL_ELOG_READ);
 OPAL_CALL(opal_send_ack_elog,			OPAL_ELOG_ACK);
 OPAL_CALL(opal_get_elog_size,			OPAL_ELOG_SIZE);
 OPAL_CALL(opal_resend_pending_logs,		OPAL_ELOG_RESEND);
-OPAL_CALL(opal_write_elog,			OPAL_ELOG_WRITE);
+OPAL_CALL(opal_elog_write,			OPAL_ELOG_WRITE);
 OPAL_CALL(opal_validate_flash,			OPAL_FLASH_VALIDATE);
 OPAL_CALL(opal_manage_flash,			OPAL_FLASH_MANAGE);
 OPAL_CALL(opal_update_flash,			OPAL_FLASH_UPDATE);
diff --git a/arch/powerpc/platforms/powernv/powernv.h b/arch/powerpc/platforms/powernv/powernv.h
index c9cfb0b..42a8b8c 100644
--- a/arch/powerpc/platforms/powernv/powernv.h
+++ b/arch/powerpc/platforms/powernv/powernv.h
@@ -1,6 +1,8 @@
 #ifndef _POWERNV_H
 #define _POWERNV_H
 
+#include <asm/opal.h>
+
 #ifdef CONFIG_SMP
 extern void pnv_smp_init(void);
 #else
@@ -23,4 +25,86 @@ bool cpu_core_split_required(void);
 
 extern void pnv_lpc_init(void);
 
+/* Classification of error/event type to be reported on POWERNV */
+/* Platform Events/Errors: Report Machine Check Interrupt */
+#define PNV_PLATFORM_ERR_EVT		0x01
+/* INPUT_OUTPUT: Report all I/O related events/errors */
+#define PNV_INPUT_OUTPUT_ERR_EVT	0x02
+/* RESOURCE_DEALLOC: Hotplug events and errors */
+#define PNV_RESOURCE_DEALLOC_ERR_EVT	0x03
+/* MISC: Miscellanous error */
+#define PNV_MISC_ERR_EVT		0x04
+
+/* POWERNV Subsystem IDs listed for reporting events/errors */
+#define PNV_PROCESSOR_SUBSYSTEM		0x10
+#define PNV_MEMORY_SUBSYSTEM		0x20
+#define PNV_IO_SUBSYSTEM		0x30
+#define PNV_IO_DEVICES			0x40
+#define PNV_CEC_HARDWARE		0x50
+#define PNV_POWER_COOLING		0x60
+#define PNV_MISC_SUBSYSTEM		0x70
+#define PNV_SURVEILLANCE_ERR		0x7A
+#define PNV_PLATFORM_FIRMWARE		0x80
+#define PNV_SOFTWARE			0x90
+#define PNV_EXTERNAL_ENV		0xA0
+/*
+ * During reporting an event/error the following represents
+ * how serious the logged event/error is. (Severity)
+ */
+#define PNV_INFO						0x00
+#define PNV_RECOVERED_ERR_GENERAL				0x10
+
+/* 0x2X series is to denote set of Predictive Error */
+/* 0x20 Generic predictive error */
+#define PNV_PREDICTIVE_ERR_GENERAL				0x20
+/* 0x21 Predictive error, degraded performance */
+#define PNV_PREDICTIVE_ERR_DEGRADED_PERF			0x21
+/* 0x22 Predictive error, fault may be corrected after reboot */
+#define PNV_PREDICTIVE_ERR_FAULT_RECTIFY_REBOOT			0x22
+/*
+ * 0x23 Predictive error, fault may be corrected after reboot,
+ * degraded performance
+ */
+#define PNV_PREDICTIVE_ERR_FAULT_RECTIFY_BOOT_DEGRADE_PERF	0x23
+/* 0x24 Predictive error, loss of redundancy */
+#define PNV_PREDICTIVE_ERR_LOSS_OF_REDUNDANCY			0x24
+
+/* 0x4X series for Unrecoverable Error */
+/* 0x40 Generic Unrecoverable error */
+#define PNV_UNRECOVERABLE_ERR_GENERAL				0x40
+/* 0x41 Unrecoverable error bypassed with degraded performance */
+#define PNV_UNRECOVERABLE_ERR_DEGRADE_PERF			0x41
+/* 0x44 Unrecoverable error bypassed with loss of redundancy */
+#define PNV_UNRECOVERABLE_ERR_LOSS_REDUNDANCY			0x44
+/* 0x45 Unrecoverable error bypassed with loss of redundancy and performance */
+#define PNV_UNRECOVERABLE_ERR_LOSS_REDUNDANCY_PERF		0x45
+/* 0x48 Unrecoverable error bypassed with loss of function */
+#define PNV_UNRECOVERABLE_ERR_LOSS_OF_FUNCTION			0x48
+/*
+ * POWERNV Event Sub-type
+ * This field provides additional information on the non-error
+ * event type
+ */
+#define PNV_NA						0x00
+#define PNV_MISCELLANEOUS_INFO_ONLY			0x01
+#define PNV_PREV_REPORTED_ERR_RECTIFIED			0x10
+#define PNV_SYS_RESOURCES_DECONFIG_BY_USER		0x20
+#define PNV_SYS_RESOURCE_DECONFIG_PRIOR_ERR		0x21
+#define PNV_RESOURCE_DEALLOC_EVENT_NOTIFY		0x22
+#define PNV_CONCURRENT_MAINTENANCE_EVENT		0x40
+#define PNV_CAPACITY_UPGRADE_EVENT			0x60
+#define PNV_RESOURCE_SPARING_EVENT			0x70
+#define PNV_DYNAMIC_RECONFIG_EVENT			0x80
+#define PNV_NORMAL_SYS_PLATFORM_SHUTDOWN		0xD0
+#define PNV_ABNORMAL_POWER_OFF				0xE0
+
+struct opal_errorlog *pnv_elog_create(uint8_t pnv_error_event_type,
+			uint16_t pnv_component_id, uint8_t pnv_subsystem_id,
+			uint8_t pnv_event_severity, uint8_t pnv_event_subtype,
+			uint32_t reason_code, uint32_t info0, uint32_t info1,
+			uint32_t info2, uint32_t info3);
+int pnv_elog_update_user_dump(struct opal_errorlog *buf, unsigned char *data,
+						uint32_t tag, uint16_t size);
+int pnv_commit_errorlog(struct opal_errorlog *buf);
+
 #endif /* _POWERNV_H */

^ permalink raw reply related

* [PATCH V4 0/3] powerpc/powernv: Error logging interfaces
From: Deepthi Dharwar @ 2014-02-03 19:29 UTC (permalink / raw)
  To: linuxppc-dev

This patch series defines generic interfaces for error logging to
push down critical errors from powernv platform to FSP.

Also, it contains few minor fixes for the exisiting error logging
framework that retrieves error logs from FSP.

Changes from V3:
* Change memory allocation to GFP_ATOMIC, to generate
  errors in bad context
* Move all error log generation related code to arch/powernv
  rename it from opal_* to pnv_* .

Changes from V2:
* Review comments from V2 have been addressed
  includes comment formats, changing naming
  conventions and incorporated error handling
  of the buffers.
* Minor typo fix and use of pr_err/pr_fmt to
  log errors.

 Deepthi Dharwar (3):
      powernv: Push critical error logs to FSP
      powernv: Correct spell error in opal-elog.c
      powernv: Have uniform logging of errors in opal-elog.c

 arch/powerpc/include/asm/opal.h                |   36 +++++++++
 arch/powerpc/platforms/powernv/opal-elog.c     |   93 +++++++++++++++++++++---
 arch/powerpc/platforms/powernv/opal-wrappers.S |    2 -
 arch/powerpc/platforms/powernv/powernv.h       |   84 ++++++++++++++++++++++
 4 files changed, 203 insertions(+), 12 deletions(-)

-- Deepthi

^ permalink raw reply

* Re: PCIe Access - achieve bursts without DMA
From: David Hawkins @ 2014-02-03 17:08 UTC (permalink / raw)
  To: Michael Moese; +Cc: linuxppc-dev
In-Reply-To: <20140203082050.GB1970@localhost.intra.men.de>

Hi Michael,

> On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote:
>> 1. Peripheral board DMA (board-to-board)
>> 2. Peripheral board DMA to host memory.
>> 3. Host (root complex) DMA.
>>
>> As far as "verification" of your custom peripheral board FPGA IP is
>> concerned, if I was a customer, and you had data for (1) and (2),
>> I'd be pretty happy (and could care less about (2), since its so
>> system dependent).
>
> Usually I would totally agree with you and try to implement the benchmark
> using DMA transfers Unfortunately, we have some boards and IP cores that
> do not support DMA transfers, or the target system must not do by a
> requirement, and as I have no influence on these, I had to investigate
> on how to improve my throughput.

Ah, I see, that does make your life difficult then.

> I've submitted a RFC Patch earlier today, which allowed me to perform
> PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s
> I got when using non-cached reads. However, I had to ioremap() my
> memory, like Gabriel said, using write-thru configuration.

That sounds like a reasonable compromise.

Cheers,
Dave

^ permalink raw reply

* Re: [PATCH] powerpc: thp: Fix crash on mremap
From: Luis Henriques @ 2014-02-03 14:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, stable
In-Reply-To: <1390911423-4409-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

On Tue, Jan 28, 2014 at 05:47:03PM +0530, Aneesh Kumar K.V wrote:
> This patch fix the below crash
> 
> NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440
> LR [c0000000000439ac] .hash_page+0x18c/0x5e0
> ...
> Call Trace:
> [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable)
> [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0
> [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58
> 
> On ppc64 we use the pgtable for storing the hpte slot information and
> store address to the pgtable at a constant offset (PTRS_PER_PMD) from
> pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
> the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
> from new pmd.
> 
> We also want to move the withdraw and deposit before the set_pmd so
> that, when page fault find the pmd as trans huge we can be sure that
> pgtable can be located at the offset.
> 
> variant of upstream SHA1: b3084f4db3aeb991c507ca774337c7e7893ed04f
> for 3.11 stable series
> 

Since both you and Benjamin Herrenschmidt claim this is good for stable, I
am queuing this variant for the 3.11 kernel.  Thanks a lot!

Cheers,
--
Luis

> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/Kconfig                           |  3 +++
>  arch/powerpc/platforms/Kconfig.cputype |  1 +
>  mm/huge_memory.c                       | 12 ++++++++++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 1feb169274fe..c5863b35d054 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -368,6 +368,9 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE
>  config HAVE_ARCH_SOFT_DIRTY
>  	bool
>  
> +config ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
> +	bool
> +
>  config HAVE_MOD_ARCH_SPECIFIC
>  	bool
>  	help
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index 47d9a03dd415..d11a34be018d 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -71,6 +71,7 @@ config PPC_BOOK3S_64
>  	select PPC_FPU
>  	select PPC_HAVE_PMU_SUPPORT
>  	select SYS_SUPPORTS_HUGETLBFS
> +	select ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
>  
>  config PPC_BOOK3E_64
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 12acb0ba7991..beaa7cc9de75 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1461,8 +1461,20 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>  
>  	ret = __pmd_trans_huge_lock(old_pmd, vma);
>  	if (ret == 1) {
> +#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
> +		pgtable_t pgtable;
> +#endif
>  		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
>  		VM_BUG_ON(!pmd_none(*new_pmd));
> +#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
> +		/*
> +		 * Archs like ppc64 use pgtable to store per pmd
> +		 * specific information. So when we switch the pmd,
> +		 * we should also withdraw and deposit the pgtable
> +		 */
> +		pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
> +		pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> +#endif
>  		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
>  		spin_unlock(&mm->page_table_lock);
>  	}
> -- 
> 1.8.5.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe stable" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: PCIe Access - achieve bursts without DMA
From: David Laight @ 2014-02-03 10:51 UTC (permalink / raw)
  To: 'Michael Moese'; +Cc: linuxppc-dev@lists.ozlabs.org, David Hawkins
In-Reply-To: <20140203103943.GC1970@localhost.intra.men.de>

From: Michael Moese=20
> On Mon, Feb 03, 2014 at 10:17:43AM +0000, David Laight wrote:
>=20
> > We achieved about twice that using the PEX dma controller.
>=20
> > Your 3MB/s for single word transfers is similar to what we saw.
> > Cycle times that make an ISA bus look fast.
>=20
> Indeed, this is a really poor performance. I know we could achieve much
> more performance using DMA, we have several products where we simply
> don't have DMA available - this requires searching for other paths.

I got the host (ppc) to do a dma, not the card. (This does need a
dma controller that is adequately intergrated with the PCIe logic.)
So it doesn't require any hardware changes.
I did have to design the software to minimise the number of single
memory transfers.

> My ioremap_wt() could help in these situations, at least increasing
> performance for non-DMA operation to a not-that-bad level.

I needed to do writes as well as reads - so I think I would have
needed to map PCIe space fully cached (rather than write-through).
The speed of back to back writes is better than reads (even if they don't
get combined) because the requests get 'posted' and overlap on the
PCIe bus.

Managing cached accesses does get tricky - you need to make sure that
both sides never have to write to the same cache line.

> I don't know if other devices could benefit from this, but surely we
> got several IPs that would, but those were not yet upstreamed, we're
> still working on this.
>=20
> Michael
>=20

^ permalink raw reply

* Re: PCIe Access - achieve bursts without DMA
From: Michael Moese @ 2014-02-03 10:39 UTC (permalink / raw)
  To: David Laight
  Cc: linuxppc-dev@lists.ozlabs.org, David Hawkins,
	'Michael Moese'
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6B774B@AcuExch.aculab.com>

On Mon, Feb 03, 2014 at 10:17:43AM +0000, David Laight wrote:

> We achieved about twice that using the PEX dma controller.

> Your 3MB/s for single word transfers is similar to what we saw.
> Cycle times that make an ISA bus look fast.

Indeed, this is a really poor performance. I know we could achieve much
more performance using DMA, we have several products where we simply 
don't have DMA available - this requires searching for other paths.

My ioremap_wt() could help in these situations, at least increasing
performance for non-DMA operation to a not-that-bad level.

I don't know if other devices could benefit from this, but surely we
got several IPs that would, but those were not yet upstreamed, we're
still working on this.

Michael

^ permalink raw reply

* RE: PCIe Access - achieve bursts without DMA
From: David Laight @ 2014-02-03 10:17 UTC (permalink / raw)
  To: 'Michael Moese', David Hawkins; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20140203082050.GB1970@localhost.intra.men.de>

From: Michael Moese
> Thank you for your help - we might be satisfied with the achieved
> 18 MB/s.

We achieved about twice that using the PEX dma controller.
I found the following comment I wrote:

/* Long transfer requests are cut into smaller DMA requests.
 * Each PCIe request can contain a maximum of 128 bytes, but the
 * dma engine can have multiple PCIe requests outstanding and this
 * speeds things up somewhat (50ns/byte with 128, 24ns/byte with 1024).
 * 1k is somewhere near the point of diminishing returns. */

Those times would include a system call.
The transfers were done through a simple driver that converted pread()
and pwrite() requests into accesses to the boards memory.
The non-dma versions are just copy_to/from_user() directly between
the PCIe and user buffers.

Your 3MB/s for single word transfers is similar to what we saw.
Cycle times that make an ISA bus look fast.

	David

^ permalink raw reply

* Re: [RFC PATCH] powerpc: add ioremap_wt
From: Gabriel Paubert @ 2014-02-03  8:32 UTC (permalink / raw)
  To: Michael Moese; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <1391411809-1845-1-git-send-email-michael.moese@men.de>

On Mon, Feb 03, 2014 at 08:16:49AM +0100, Michael Moese wrote:
> Allow for IO memory to be mapped cacheable for performing
> PCI read bursts.
> 
> Signed-off-by: Michael Moese <michael.moese@men.de>
> ---
>  arch/powerpc/include/asm/io.h | 3 +++
>  arch/powerpc/mm/pgtable_32.c  | 8 ++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
> index 45698d5..9591fff 100644
> --- a/arch/powerpc/include/asm/io.h
> +++ b/arch/powerpc/include/asm/io.h
> @@ -631,6 +631,8 @@ static inline void iosync(void)
>   *
>   * * ioremap_wc enables write combining
>   *
> + * * ioremap_wc enables write thru

Typo: _wc -> _wt

Looks fine in principle, but there is a significant difference with wc
on x86, where read accesses always go to the bus (no read caching).

	Gabriel

> + *
>   * * iounmap undoes such a mapping and can be hooked
>   *
>   * * __ioremap_at (and the pending __iounmap_at) are low level functions to
> @@ -652,6 +654,7 @@ extern void __iomem *ioremap(phys_addr_t address, unsigned long size);
>  extern void __iomem *ioremap_prot(phys_addr_t address, unsigned long size,
>  				  unsigned long flags);
>  extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size);
> +extern void __iomem *ioremap_wt(phys_addr_t address, unsigned long size);
>  #define ioremap_nocache(addr, size)	ioremap((addr), (size))
>  
>  extern void iounmap(volatile void __iomem *addr);
> diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
> index 51f8795..9ab0a54 100644
> --- a/arch/powerpc/mm/pgtable_32.c
> +++ b/arch/powerpc/mm/pgtable_32.c
> @@ -141,6 +141,14 @@ ioremap_wc(phys_addr_t addr, unsigned long size)
>  EXPORT_SYMBOL(ioremap_wc);
>  
>  void __iomem *
> +ioremap_wt(phys_addr_t addr, unsigned long size)
> +{
> +	return __ioremap_caller(addr, size, _PAGE_WRITETHRU,
> +				__builtin_return_address(0));
> +}
> +EXPORT_SYMBOL(ioremap_wt);
> +
> +void __iomem *
>  ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long flags)
>  {
>  	/* writeable implies dirty for kernel addresses */
> -- 
> 1.8.5.3
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: PCIe Access - achieve bursts without DMA
From: Michael Moese @ 2014-02-03  8:20 UTC (permalink / raw)
  To: David Hawkins; +Cc: linuxppc-dev, Moese, Michael
In-Reply-To: <52EC2F46.7000609@ovro.caltech.edu>

On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote:
> 1. Peripheral board DMA (board-to-board)
> 2. Peripheral board DMA to host memory.
> 3. Host (root complex) DMA.
> 
> As far as "verification" of your custom peripheral board FPGA IP is
> concerned, if I was a customer, and you had data for (1) and (2),
> I'd be pretty happy (and could care less about (2), since its so
> system dependent).

Usually I would totally agree with you and try to implement the benchmark
using DMA transfers Unfortunately, we have some boards and IP cores that
do not support DMA transfers, or the target system must not do by a 
requirement, and as I have no influence on these, I had to investigate
on how to improve my throughput.
I've submitted a RFC Patch earlier today, which allowed me to perform
PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s
I got when using non-cached reads. However, I had to ioremap() my 
memory, like Gabriel said, using write-thru configuration. 

> Since its an FPGA-based IP. I'd also expect to see a PCIe simulation
> with Bus Functional Models showing what the optimal performance of
> your IP was, and then how it nicely matches with the measurements
> in (1). If you do not have a PCIe logic analyzer, both Xilinx and
> Altera have Chipscope/SignalTap logic analyzers that can be used
> for tracing traffic at the TLP layer inside the FPGA.

Of course our IP developers to simulation and analyzing, we have PCI
and PCIe analyzer and all other equipment one might need. However,
we've seen that not only on PowerPC but also on x86, performing real
bursts is not intuitive.

Thank you for your help - we might be satisfied with the achieved 
18 MB/s.

Michael

^ permalink raw reply

* [RFC PATCH] powerpc: add ioremap_wt
From: Michael Moese @ 2014-02-03  7:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, linux-kernel, Michael Moese

Allow for IO memory to be mapped cacheable for performing
PCI read bursts.

Signed-off-by: Michael Moese <michael.moese@men.de>
---
 arch/powerpc/include/asm/io.h | 3 +++
 arch/powerpc/mm/pgtable_32.c  | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 45698d5..9591fff 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -631,6 +631,8 @@ static inline void iosync(void)
  *
  * * ioremap_wc enables write combining
  *
+ * * ioremap_wc enables write thru
+ *
  * * iounmap undoes such a mapping and can be hooked
  *
  * * __ioremap_at (and the pending __iounmap_at) are low level functions to
@@ -652,6 +654,7 @@ extern void __iomem *ioremap(phys_addr_t address, unsigned long size);
 extern void __iomem *ioremap_prot(phys_addr_t address, unsigned long size,
 				  unsigned long flags);
 extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size);
+extern void __iomem *ioremap_wt(phys_addr_t address, unsigned long size);
 #define ioremap_nocache(addr, size)	ioremap((addr), (size))
 
 extern void iounmap(volatile void __iomem *addr);
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 51f8795..9ab0a54 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -141,6 +141,14 @@ ioremap_wc(phys_addr_t addr, unsigned long size)
 EXPORT_SYMBOL(ioremap_wc);
 
 void __iomem *
+ioremap_wt(phys_addr_t addr, unsigned long size)
+{
+	return __ioremap_caller(addr, size, _PAGE_WRITETHRU,
+				__builtin_return_address(0));
+}
+EXPORT_SYMBOL(ioremap_wt);
+
+void __iomem *
 ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long flags)
 {
 	/* writeable implies dirty for kernel addresses */
-- 
1.8.5.3

^ permalink raw reply related

* Re: [git pull] Please pull powerpc.git next branch
From: Michael Ellerman @ 2014-02-03  3:00 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Olaf Hering, Linus Torvalds, linuxppc-dev, Linux Kernel list
In-Reply-To: <4555187.D5eRSF5r8x@mexican>

On Wed, 2014-01-29 at 13:29 +1100, Alistair Popple wrote:
> Looks like I missed the dart iommu code when changing the iommu table
> initialisation. The patch below should fix it, would you mind testing
> it Ben? Thanks.

Any reason not to add the following to save ourselves in future?

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d773dd4..6ab7b53 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -657,6 +657,8 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
        unsigned int i;
        struct iommu_pool *p;
 
+       BUG_ON(!tbl->it_page_shift);
+
        /* number of bytes needed for the bitmap */
        sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
 

cheers

^ permalink raw reply related

* Re: [PATCH 6/8] powerpc/perf: add support for the hv gpci (get performance counter info) interface
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-7-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:52 UTC, Cody P Schafer wrote:
> This provides a basic link between perf and hv_gpci. Notably, it does
> not yet support transactions and does not list any events (they can
> still be manually composed).

What are the plans for listing?

The manual compose is nice but pretty hairy to use in practice I would think.

> diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
> new file mode 100644
> index 0000000..31d9d59
> --- /dev/null
> +++ b/arch/powerpc/perf/hv-gpci.c
> @@ -0,0 +1,235 @@
> +/*
> + * Hypervisor supplied "gpci" ("get performance counter info") performance
> + * counter support
> + *
> + * Author: Cody P Schafer <cody@linux.vnet.ibm.com>
> + * Copyright 2014 IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +#define pr_fmt(fmt) "hv-gpci: " fmt
> +
> +#include <linux/module.h>
> +#include <linux/perf_event.h>
> +#include <asm/firmware.h>
> +#include <asm/hvcall.h>
> +#include <asm/hv_gpci.h>
> +#include <asm/io.h>

Needed?

> +/* See arch/powerpc/include/asm/hv_gpci.h for details on the hcall interface */
> +
> +PMU_RANGE_ATTR(request, config, 0, 31); /* u32 */
> +PMU_RANGE_ATTR(starting_index, config, 32, 63); /* u32 */
> +PMU_RANGE_ATTR(secondary_index, config1, 0, 15); /* u16 */
> +PMU_RANGE_ATTR(counter_info_version, config1, 16, 23); /* u8 */
> +PMU_RANGE_ATTR(length, config1, 24, 31); /* u8, bytes of data (1-8) */
> +PMU_RANGE_ATTR(offset, config1, 32, 63); /* u32, byte offset */
> +
> +static struct attribute *format_attr[] = {
> +	&format_attr_request.attr,
> +	&format_attr_starting_index.attr,
> +	&format_attr_secondary_index.attr,
> +	&format_attr_counter_info_version.attr,
> +

Lonley blank line.

> +	&format_attr_offset.attr,
> +	&format_attr_length.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group format_group = {
> +	.name = "format",
> +	.attrs = format_attr,
> +};
> +
> +static const struct attribute_group *attr_groups[] = {
> +	&format_group,
> +	NULL,
> +};
> +
> +static unsigned long single_gpci_request(u32 req, u32 starting_index,
> +		u16 secondary_index, u8 version_in, u32 offset, u8 length,
> +		u64 *value)

Passing the event and extracting the values in here would be neater IMHO.

> +{
> +	unsigned long ret;
> +	size_t i;
> +	u64 count;
> +
> +	struct {
> +		struct hv_get_perf_counter_info_params params;
> +		union {
> +			union h_gpci_cvs data;
> +			uint8_t bytes[sizeof(union h_gpci_cvs)];
> +		};
> +	} arg = {
> +		.params = {
> +			.counter_request = cpu_to_be32(req),
> +			.starting_index = cpu_to_be32(starting_index),
> +			.secondary_index = cpu_to_be16(secondary_index),
> +			.counter_info_version_in = version_in,
> +		}
> +	};
> +
> +	ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
> +			virt_to_phys(&arg), sizeof(arg));
> +	if (ret) {
> +		pr_devel("hcall failed: 0x%lx\n", ret);
> +		return ret;
> +	}
> +
> +	/*
> +	 * we verify offset and length are within the zeroed buffer at event
> +	 * init.
> +	 */
> +	count = 0;
> +	for (i = offset; i < offset + length; i++)
> +		count |= arg.bytes[i] << (i - offset);
> +
> +	*value = count;
> +	return ret;
> +}
> +
> +static u64 h_gpci_get_value(struct perf_event *event)
> +{
> +	u64 count;
> +	unsigned long ret = single_gpci_request(event_get_request(event),
> +					event_get_starting_index(event),
> +					event_get_secondary_index(event),
> +					event_get_counter_info_version(event),
> +					event_get_offset(event),
> +					event_get_length(event),
> +					&count);
> +	if (ret)
> +		return 0;
> +	return count;
> +}
> +
> +static void h_gpci_event_update(struct perf_event *event)
> +{
> +	s64 prev;
> +	u64 now = h_gpci_get_value(event);
> +	prev = local64_xchg(&event->hw.prev_count, now);
> +	local64_add(now - prev, &event->count);
> +}
> +
> +static void h_gpci_event_start(struct perf_event *event, int flags)
> +{
> +	local64_set(&event->hw.prev_count, h_gpci_get_value(event));
> +	perf_swevent_start_hrtimer(event);
> +}
> +
> +static void h_gpci_event_stop(struct perf_event *event, int flags)
> +{
> +	perf_swevent_cancel_hrtimer(event);
> +	h_gpci_event_update(event);
> +}
> +
> +static int h_gpci_event_add(struct perf_event *event, int flags)
> +{
> +	if (flags & PERF_EF_START)
> +		h_gpci_event_start(event, flags);
> +
> +	return 0;
> +}
> +
> +static void h_gpci_event_del(struct perf_event *event, int flags)
> +{
> +	h_gpci_event_stop(event, flags);
> +}

Can just hook del directly no?

> +static void h_gpci_event_read(struct perf_event *event)
> +{
> +	h_gpci_event_update(event);
> +}

Ditto.

> +static int h_gpci_event_init(struct perf_event *event)
> +{
> +	u64 count;
> +	u8 length;
> +
> +	/* Not our event */
> +	if (event->attr.type != event->pmu->type)
> +		return -ENOENT;

I don't understand why you need this?

> +	/* config2 is unused */
> +	if (event->attr.config2)
> +		return -EINVAL;

You must also check the reserved regions of config and config1.


> +	/* unsupported modes and filters */
> +	if (event->attr.exclude_user   ||
> +	    event->attr.exclude_kernel ||
> +	    event->attr.exclude_hv     ||
> +	    event->attr.exclude_idle   ||
> +	    event->attr.exclude_host   ||
> +	    event->attr.exclude_guest  ||
> +	    is_sampling_event(event)) /* no sampling */

I think you should also check sample_type.

> +		return -EINVAL;

Have you thought about inherit, pinned, exclusive?

> +
> +	/* no branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
> +	length = event_get_length(event);
> +	if (length < 1 || length > 8)
> +		return -EINVAL;
> +
> +	/* last byte within the buffer? */
> +	if ((event_get_offset(event) + length) > sizeof(union h_gpci_cvs))
> +		return -EINVAL;
> +
> +	/* check if the request works... */
> +	if (single_gpci_request(event_get_request(event),
> +				event_get_starting_index(event),
> +				event_get_secondary_index(event),
> +				event_get_counter_info_version(event),
> +				event_get_offset(event),
> +				length,
> +				&count))
> +		return -EINVAL;
> +
> +	/*
> +	 * Some of the events are per-cpu, some per-core, some per-chip, some
> +	 * are global, and some access data from other virtual machines on the
> +	 * same physical machine. We can't map the cpu value without a lot of
> +	 * work. Instead, we pick an arbitrary cpu for all events on this pmu.
> +	 */
> +	event->cpu = 0;

OK, but is having them all on cpu zero a good idea?

> +	perf_swevent_init_hrtimer(event);
> +	return 0;
> +}
> +
> +struct pmu h_gpci_pmu = {
> +	.task_ctx_nr = perf_invalid_context,
> +
> +	.name = "hv_gpci",
> +	.attr_groups = attr_groups,
> +	.event_init = h_gpci_event_init,
> +	.add = h_gpci_event_add,
> +	.del = h_gpci_event_del,
> +	.start = h_gpci_event_start,
> +	.stop = h_gpci_event_stop,
> +	.read = h_gpci_event_read,

Nice to have them align vertically.

> +	.event_idx = perf_swevent_event_idx,
> +};
> +
> +static int hv_gpci_init(void)
> +{
> +	int r;
> +
> +	if (!firmware_has_feature(FW_FEATURE_LPAR)) {
> +		pr_info("Not running under phyp, not supported\n");

If only it was that simple :)

You'll see FW_FEATURE_LPAR in a KVM guest too.

There are at least two mechanisms for FW to indicate the presence of features,
"ibm,hypertas-functions" and "ibm,architecture-vec-5".

If HGPCI is not exposed in either of those then we'd want to do a probe hcall
here to try and detect it at runtime.


> +		return -ENODEV;
> +	}
> +
> +	r = perf_pmu_register(&h_gpci_pmu, h_gpci_pmu.name, -1);
> +	if (r)
> +		return r;
> +
> +	return 0;
> +}
> +
> +module_init(hv_gpci_init);

This is not modular code so you're discouraged from using module_init(),
arch_initcall() would probably be fine.

cheers

^ permalink raw reply

* Re: [PATCH 5/8] powerpc: add 24x7 interface header
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-6-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:51 UTC, Cody P Schafer wrote:
> 24x7 (also called hv_24x7 or H_24X7) is an interface to obtain
> performance counters from the hypervisor. These counters do not have a
> fixed format/possition and are instead documented in a "24x7 Catalog",
> which is provided by the hypervisor (that interface is also documented
> in this header).
> 
> This method of obtaining performance counters from the hypervisor is
> intended to paritialy replace the gpci interface.

Same comments as for the previous patch.

cheers

^ permalink raw reply

* Re: [PATCH 4/8] powerpc: add hv_gpci interface header
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-5-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:50 UTC, Cody P Schafer wrote:
> "H_GetPerformanceCounterInfo" (refered to as hv_gpci or just gpci from
> here on) is an interface to retrieve specific performance counters and
> other data from the hypervisor. All outputs have a fixed format (and
> are represented as structs in this patch).

So how much of this are we actually using? A lot of these seem to be only used
in the union at the bottom of this file, and not touched elsewhere - or am I
missing something subtle?

Some of it doesn't seem to be used at all?
 
> diff --git a/arch/powerpc/include/asm/hv_gpci.h b/arch/powerpc/include/asm/hv_gpci.h

Any reason this can't just live in arch/powerpc/perf ?

> +++ b/arch/powerpc/include/asm/hv_gpci.h
> @@ -0,0 +1,490 @@
> +#ifndef LINUX_POWERPC_UAPI_HV_GPCI_H_
> +#define LINUX_POWERPC_UAPI_HV_GPCI_H_
> +
> +#include <linux/types.h>
> +
> +/* From the document "H_GetPerformanceCounterInfo Interface" v1.06, paritialy
> + * updated with v1.07 */

Is that public?

> +
> +/* H_GET_PERF_COUNTER_INFO argument */
> +struct hv_get_perf_counter_info_params {
> +	__be32 counter_request; /* I */
> +	__be32 starting_index;  /* IO */
> +	__be16 secondary_index; /* IO */
> +	__be16 returned_values; /* O */
> +	__be32 detail_rc; /* O, "only for 32bit clients" */
> +
> +	/*
> +	 * O, size each of counter_value element in bytes, only set for version
> +	 * >= 0x3
> +	 */
> +	__be16 cv_element_size;
> +
> +	/* I, funny if version < 0x3 */

Funny how? Or better still, do we only support operating on some minimum
sane version of the API?

> +	__u8 counter_info_version_in;
> +
> +	/* O, funny if version < 0x3 */
> +	__u8 counter_info_version_out;
> +	__u8 reserved[0xC];
> +	__u8 counter_value[];
> +} __packed;
> +
> +/* 8 => power8 (1.07)
> + * 6 => TLBIE  (1.07)
> + * 5 => (1.05)
> + * 4 => ?
> + * 3 => ?
> + * 2 => v7r7m0.phyp (?)
> + * 1 => v7r6m0.phyp (?)
> + * 0 => v7r{2,3,4}m0.phyp (?)
> + */

I think this is a mapping of version numbers to firmware releases, it should
say so.

> +#define COUNTER_INFO_VERSION_CURRENT 0x8
> +
> +/* these determine the counter_value[] layout and the meaning of starting_index
> + * and secondary_index */

Needs: leading capital, full stop, block comment.

> +enum counter_info_requests {
> +
> +	/* GENERAL */
> +
> +	/* @starting_index: "starting" physical processor index or -1 for

Why '"starting"' ?

> +	 *                  current phyical processor. Data is only collected
> +	 *                  for the processors' "primary" thread.
> +	 * @secondary_index: unused

This seems to be true in all cases at least for this enum, can we drop it?

> +	 */
> +	CIR_dispatch_timebase_by_processor = 0x10,

Any reason for the weird capitialisation? You've obviously learnt the
noCamelCase rule, but this is still a bit odd :)

> +
> +	/* @starting_index: starting partition id or -1 for the current logical
> +	 *                  partition (virtual machine).
> +	 * @secondary_index: unused
> +	 */
> +	CIR_entitled_capped_uncapped_donated_idle_timebase_by_partition = 0x20,
> +
> +	/* @starting_index: starting partition id or -1 for the current logical
> +	 *                  partition (virtual machine).
> +	 * @secondary_index: unused
> +	 */
> +	CIR_run_instructions_run_cycles_by_partition = 0x30,
> +
> +	/* @starting_index: must be -1 (to refer to the current partition)
> +	 * @secondary_index: unused
> +	 */
> +	CIR_system_performance_capabilities = 0x40,
> +
> +
> +	/* Data from this should only be considered valid if
> +	 * counter_info_version >= 0x3
> +	 * @starting_index: starting hardware chip id or -1 for the current hw
> +	 *		    chip id
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_bus_utilization_abc_links = 0x50,
> +
> +	/* Data from this should only be considered valid if
> +	 * counter_info_version >= 0x3
> +	 * @starting_index: starting hardware chip id or -1 for the current hw
> +	 *		    chip id
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_bus_utilization_wxyz_links = 0x60,
> +
> +
> +	/* EXPANDED */

??

These are only available if you have the DLC ?

> +	/* Avaliable if counter_info_version >= 0x3
> +	 * @starting_index: starting hardware chip id or -1 for the current hw
> +	 *		    chip id
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_bus_utilization_gx_links = 0x70,
> +
> +	/* Avaliable if counter_info_version >= 0x3
> +	 * @starting_index: starting hardware chip id or -1 for the current hw
> +	 *		    chip id
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_bus_utilization_mc_links = 0x80,
> +
> +	/* Avaliable if counter_info_version >= 0x3
> +	 * @starting_index: starting physical processor or -1 for the current
> +	 *                  physical processor
> +	 * @secondary_index: unused
> +	 */
> +	CIR_processor_config = 0x90,
> +
> +	/* Avaliable if counter_info_version >= 0x3
> +	 * @starting_index: starting physical processor or -1 for the current
> +	 *                  physical processor
> +	 * @secondary_index: unused
> +	 */
> +	CIR_current_processor_frequency = 0x91,
> +
> +	CIR_processor_core_utilization = 0x94,
> +
> +	CIR_processor_core_power_mode = 0x95,
> +
> +	CIR_affinity_domain_information_by_virutal_processor = 0xA0,
> +
> +	CIR_affinity_domain_info_by_domain = 0xB0,
> +
> +	CIR_affinity_domain_info_by_partition = 0xB1,
> +
> +	/* @starting_index: unused
> +	 * @secondary_index: unused
> +	 */
> +	CIR_physical_memory_info = 0xC0,
> +
> +	CIR_processor_bus_topology = 0xD0,
> +
> +	CIR_partition_hypervisor_queuing_times = 0xE0,
> +
> +	CIR_system_hypervisor_times = 0xF0,
> +
> +	/* LAB */
> +
> +	CIR_set_mmcrh = 0x80001000,
> +	CIR_get_hpmcx = 0x80002000,
> +};


cheers

^ permalink raw reply

* Re: [PATCH 3/8] powerpc: add hvcalls for 24x7 and gpci (get performance counter info)
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-4-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:49 UTC, Cody P Schafer wrote:
> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/hvcall.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
> index d8b600b..48d6efa 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -269,11 +269,15 @@
>  #define H_COP			0x304
>  #define H_GET_MPP_X		0x314
>  #define H_SET_MODE		0x31C
> -#define MAX_HCALL_OPCODE	H_SET_MODE
> +#define H_GET_24X7_CATALOG_PAGE 0xF078
> +#define H_GET_24X7_DATA		0xF07C
> +#define H_GET_PERF_COUNTER_INFO 0xF080

Ugh, why the hell did they put them up there.

> +#define MAX_HCALL_OPCODE	H_GET_PERF_COUNTER_INFO

We have an array which is sized based on this, which is unpleasant.

I think you're better off putting these below in the platform specific section,
and leaving MAX_HCALL_OPCODE alone. The only downside is you can't use the
hcall tracing to see them.

>  /* Platform specific hcalls, used by KVM */
>  #define H_RTAS			0xf000


cheers

^ permalink raw reply

* Re: [PATCH 2/8] perf core: export swevent hrtimer helpers
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-3-git-send-email-cody@linux.vnet.ibm.com>

Peter, Ingo, can we get your ACK on this please?

cheers


On Thu, 2014-16-01 at 23:53:48 UTC, Cody P Schafer wrote:
> Export the swevent hrtimer helpers currently only used in events/core.c
> to allow the addition of architecture specific sw-like pmus.

> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
> ---
>  include/linux/perf_event.h | 5 ++++-
>  kernel/events/core.c       | 8 ++++----
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8646e33..c5bc71a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -558,7 +558,10 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
>  				int src_cpu, int dst_cpu);
>  extern u64 perf_event_read_value(struct perf_event *event,
>  				 u64 *enabled, u64 *running);
> -
> +extern void perf_swevent_init_hrtimer(struct perf_event *event);
> +extern void perf_swevent_start_hrtimer(struct perf_event *event);
> +extern void perf_swevent_cancel_hrtimer(struct perf_event *event);
> +extern int perf_swevent_event_idx(struct perf_event *event);
>  
>  struct perf_sample_data {
>  	u64				type;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f574401..d881d1e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5801,7 +5801,7 @@ static int perf_swevent_init(struct perf_event *event)
>  	return 0;
>  }
>  
> -static int perf_swevent_event_idx(struct perf_event *event)
> +int perf_swevent_event_idx(struct perf_event *event)
>  {
>  	return 0;
>  }
> @@ -6030,7 +6030,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
>  	return ret;
>  }
>  
> -static void perf_swevent_start_hrtimer(struct perf_event *event)
> +void perf_swevent_start_hrtimer(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
>  	s64 period;
> @@ -6052,7 +6052,7 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
>  				HRTIMER_MODE_REL_PINNED, 0);
>  }
>  
> -static void perf_swevent_cancel_hrtimer(struct perf_event *event)
> +void perf_swevent_cancel_hrtimer(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
>  
> @@ -6064,7 +6064,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
>  	}
>  }
>  
> -static void perf_swevent_init_hrtimer(struct perf_event *event)
> +void perf_swevent_init_hrtimer(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
>  
> -- 
> 1.8.5.2
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 
> 

^ permalink raw reply

* Re: [PATCH 1/8] perf: add PMU_RANGE_ATTR() helper for use by sw-like pmus
From: Michael Ellerman @ 2014-02-01  5:58 UTC (permalink / raw)
  To: Cody P Schafer, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Cody P Schafer
In-Reply-To: <1389916434-2288-2-git-send-email-cody@linux.vnet.ibm.com>

On Thu, 2014-16-01 at 23:53:47 UTC, Cody P Schafer wrote:
> Add PMU_RANGE_ATTR() and PMU_RANGE_RESV() (for reserved areas) which
> generate functions to extract the relevent bits from
> event->attr.config{,1,2} for use by sw-like pmus where the
> 'config{,1,2}' values don't map directly to hardware registers.

This is neat.

The split of the macros is a bit weird, ie. PMU_RANGE_RESV() doesn't really do
what it's name suggests.

I think you want one macro which creates the accessors, with a name that
reflects that - yeah I can't think of a good one right now, but "event" should
probably be in there because that's what it operates on.

Having a macro for the reserved regions is good, but you MUST actually check
that the reserved regions are zero. Otherwise you are permitting your caller to
pass junk in there and you then can't unreserved them in a future version of
the API.

So I think a macro that gives you a special reserved region routine would be
good, so you can write something like:

  if (event_check_reserved1() || event_check_reserved2())
  	return -EINVAL;

cheers

^ permalink raw reply

* [PATCH v2] powerpc: Add cpu family documentation
From: Michael Ellerman @ 2014-02-01  4:35 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Stephen Rothwell

This patch adds some documentation on the different cpu families
supported by arch/powerpc.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
v2: Reworked formatting to avoid wrapping.
    Fixed up Freescale details.


 Documentation/powerpc/cpu_families.txt | 227 +++++++++++++++++++++++++++++++++
 1 file changed, 227 insertions(+)
 create mode 100644 Documentation/powerpc/cpu_families.txt

diff --git a/Documentation/powerpc/cpu_families.txt b/Documentation/powerpc/cpu_families.txt
new file mode 100644
index 0000000..fa4f159
--- /dev/null
+++ b/Documentation/powerpc/cpu_families.txt
@@ -0,0 +1,227 @@
+CPU Families
+============
+
+This document tries to summarise some of the different cpu families that exist
+and are supported by arch/powerpc.
+
+
+Book3S (aka sPAPR)
+------------------
+
+ - Hash MMU
+ - Mix of 32 & 64 bit
+
+   +--------------+                              +----------------+
+   |  Old POWER   | ---------------------------> | RS64 (threads) |
+   +--------------+                              +----------------+
+          |
+          |
+          v
+   +--------------+                              +----------------+    +-------+
+   |     601      | ---------------------------> |      603       | -> |  740  |
+   +--------------+                              +----------------+    +-------+
+          |                                              |
+          |                                              |
+          v                                              v
+   +--------------+                              +----------------+    +-------+
+   |     604      |                              |    750 (G3)    | -> | 750CX |
+   +--------------+                              +----------------+    +-------+
+          |                                              |                 |
+          |                                              |                 |
+          v                                              v                 v
+   +--------------+                              +----------------+    +-------+
+   | 620 (64 bit) |                              |      7400      |    | 750CL |
+   +--------------+                              +----------------+    +-------+
+          |                                              |                 |
+          |                                              |                 |
+          v                                              v                 v
+   +--------------+                              +----------------+    +-------+
+   |  POWER3/630  |                              |      7410      |    | 750FX |
+   +--------------+                              +----------------+    +-------+
+          |                                              |
+          |                                              |
+          v                                              v
+   +--------------+                              +----------------+
+   |   POWER3+    |                              |      7450      |
+   +--------------+                              +----------------+
+          |                                              |
+          |                                              |
+          v                                              v
+   +--------------+                              +----------------+
+   |    POWER4    |                              |      7455      |
+   +--------------+                              +----------------+
+          |                                              |
+          |                                              |
+          v                                              v
+   +--------------+                  +-------+   +----------------+
+   |   POWER4+    | ---------------> |  970  |   |      7447      |
+   +--------------+                  +-------+   +----------------+
+          |                              |               |
+          |                              |               |
+          v                              v               v
+   +--------------+     +-------+    +-------+   +----------------+
+   |    POWER5    | --> | Cell  |    | 970FX |   |      7448      |
+   +--------------+     +-------+    +-------+   +----------------+
+          |                              |
+          |                              |
+          v                              v
+   +--------------+                  +-------+
+   |   POWER5+    |                  | 970MP |
+   +--------------+                  +-------+
+          |
+          |
+          v
+   +--------------+
+   |   POWER5++   |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |    POWER6    |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |    POWER7    |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |   POWER7+    |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |    POWER8    |
+   +--------------+
+
+
+   +---------------+
+   | PA6T (64 bit) |
+   +---------------+
+
+
+IBM BookE
+---------
+
+ - Software loaded TLB.
+ - All 32 bit
+
+   +--------------+
+   |     401      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     403      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     405      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     440      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+     +----------------+
+   |     450      | --> |      BG/P      |
+   +--------------+     +----------------+
+          |
+          |
+          v
+   +--------------+
+   |     460      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     476      |
+   +--------------+
+
+
+Motorola/Freescale 8xx
+----------------------
+
+ - Software loaded with hardware assist.
+ - All 32 bit
+
+   +--------------+
+   |     8xx      |
+   +--------------+
+          |
+          |
+          v
+   +--------------+
+   |     850      |
+   +--------------+
+
+
+Freescale BookE
+---------------
+
+ - Software loaded TLB.
+ - e6500 adds HW loaded indirect TLB entries.
+ - Mix of 32 & 64 bit
+
+   +--------------+
+   |     e200     |
+   +--------------+
+
+
+   +--------------------------------+
+   |              e500              |
+   +--------------------------------+
+                   |
+                   |
+                   v
+   +--------------------------------+
+   |             e500v2             |
+   +--------------------------------+
+                   |
+                   |
+                   v
+   +--------------------------------+
+   |             e500mc             |
+   +--------------------------------+
+                   |
+                   |
+                   v
+   +--------------------------------+
+   |    e5500 (Book3e) (64 bit)     |
+   +--------------------------------+
+                   |
+                   |
+                   v
+   +--------------------------------+
+   | e6500 (HW TLB) (Multithreaded) |
+   +--------------------------------+
+
+
+IBM A2 core
+-----------
+
+ - Book3E, software loaded TLB + HW loaded indirect TLB entries.
+ - 64 bit
+
+   +--------------+     +----------------+
+   |   A2 core    | --> |      WSP       |
+   +--------------+     +----------------+
+           |
+           |
+           v
+   +--------------+
+   |     BG/Q     |
+   +--------------+
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH] powerpc: Add cpu family documentation
From: Michael Ellerman @ 2014-02-01  4:28 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: linuxppc-dev
In-Reply-To: <20140130143237.a2bc71327ed4939cd71a734f@canb.auug.org.au>

On Thu, 2014-01-30 at 14:32 +1100, Stephen Rothwell wrote:
> Hi Michael,
> 
> Nice.
> 
> On Thu, 30 Jan 2014 13:38:00 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
> >
> > +++ b/Documentation/powerpc/cpu_families.txt
> > @@ -0,0 +1,76 @@
> > +CPU Families
> > +============
> > +
> > +This doco tries to summarise some of the different cpu families that exist and
>         ^^^^
> document
> 
> > +   |            |
> > +   |            *---- [620] --- POWER3/630 --- POWER3+ --- POWER4 --- POWER4+ --- POWER5 --- POWER5+ --- POWER5++ --- POWER6 --- POWER7 --- POWER7+ --- POWER8
> 
> Its a pity that this wraps ...

Yeah it is. I was too lazy to fix it.

New version coming.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc: Add cpu family documentation
From: Michael Ellerman @ 2014-02-01  4:28 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev
In-Reply-To: <669D726F-25DF-4703-AD30-CAE7CA142970@kernel.crashing.org>

On Fri, 2014-01-31 at 07:32 -0600, Kumar Gala wrote:
> On Jan 29, 2014, at 8:38 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> > +Freescale BookE
> > +---------------
> > +
> > + - Software loaded TLB.
> > + - e6500 adds HW loaded indirect TLB entries.
> > + - Mix of 32 & 64 bit
> > +
> > +  e200 --- e500 --- e500v2 --- e500mc --- e5500 --- e6500
> > +                                         (Book3E)  (HW TLB)
> > +                                         (64bit)
> > +
> 
> e200 is its own core family that doesn’t have any relation to e500 line other than being book-e
> 
> might want to add multithreaded to e6500.

Thanks Kumar.

cheers

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox