LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask
From: Gautham R Shenoy @ 2020-06-24 10:56 UTC (permalink / raw)
  To: Kajol Jain; +Cc: nathanl, ego, maddy, suka, anju, linuxppc-dev
In-Reply-To: <20200624101754.169612-3-kjain@linux.ibm.com>

Hi Kajol,

On Wed, Jun 24, 2020 at 03:47:54PM +0530, Kajol Jain wrote:
> Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation.
> 
> command:# cat /sys/devices/hv_24x7/cpumask
> 0

Since this sysfs interface is read-only, and the user cannot change
the CPU which will be making the HCALLs to obtain the 24x7 counts,
does the user even need to know if currently CPU X is the one which is
going to make HCALLs to retrive the 24x7 counts ? Does it help in any
kind of trouble-shooting ?

It would have made sense if the interface was read-write, since a user
can set this to a CPU which is not running user applications. This
would help in minimising jitter on those active CPUs running the user
applications.


> 
> Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
> ---
>  .../sysfs-bus-event_source-devices-hv_24x7    |  7 ++++
>  arch/powerpc/perf/hv-24x7.c                   | 36 +++++++++++++++++--
>  2 files changed, 41 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
> index e8698afcd952..f9dd3755b049 100644
> --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
> @@ -43,6 +43,13 @@ Description:	read only
>  		This sysfs interface exposes the number of cores per chip
>  		present in the system.
> 
> +What:		/sys/devices/hv_24x7/cpumask
> +Date:		June 2020
> +Contact:	Linux on PowerPC Developer List <linuxppc-dev@lists.ozlabs.org>
> +Description:	read only
> +		This sysfs file exposes the cpumask which is designated to make
> +		HCALLs to retrive hv-24x7 pmu event counter data.
> +
>  What:		/sys/bus/event_source/devices/hv_24x7/event_descs/<event-name>
>  Date:		February 2014
>  Contact:	Linux on PowerPC Developer List <linuxppc-dev@lists.ozlabs.org>
> diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
> index ce4739e2b407..3c699612d29f 100644
> --- a/arch/powerpc/perf/hv-24x7.c
> +++ b/arch/powerpc/perf/hv-24x7.c
> @@ -448,6 +448,12 @@ static ssize_t device_show_string(struct device *dev,
>  	return sprintf(buf, "%s\n", (char *)d->var);
>  }
> 
> +static ssize_t cpumask_get_attr(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	return cpumap_print_to_pagebuf(true, buf, &hv_24x7_cpumask);
> +}
> +
>  static ssize_t sockets_show(struct device *dev,
>  			    struct device_attribute *attr, char *buf)
>  {
> @@ -1116,6 +1122,17 @@ static DEVICE_ATTR_RO(sockets);
>  static DEVICE_ATTR_RO(chipspersocket);
>  static DEVICE_ATTR_RO(coresperchip);
> 
> +static DEVICE_ATTR(cpumask, S_IRUGO, cpumask_get_attr, NULL);
> +
> +static struct attribute *cpumask_attrs[] = {
> +	&dev_attr_cpumask.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group cpumask_attr_group = {
> +	.attrs = cpumask_attrs,
> +};
> +
>  static struct bin_attribute *if_bin_attrs[] = {
>  	&bin_attr_catalog,
>  	NULL,
> @@ -1143,6 +1160,11 @@ static const struct attribute_group *attr_groups[] = {
>  	&event_desc_group,
>  	&event_long_desc_group,
>  	&if_group,
> +	/*
> +	 * This NULL is a placeholder for the cpumask attr which will update
> +	 * onlyif cpuhotplug registration is successful
> +	 */
> +	NULL,
>  	NULL,
>  };
> 
> @@ -1683,7 +1705,7 @@ static int hv_24x7_cpu_hotplug_init(void)
> 
>  static int hv_24x7_init(void)
>  {
> -	int r;
> +	int r, i = -1;
>  	unsigned long hret;
>  	struct hv_perf_caps caps;
> 
> @@ -1727,8 +1749,18 @@ static int hv_24x7_init(void)
> 
>  	/* init cpuhotplug */
>  	r = hv_24x7_cpu_hotplug_init();
> -	if (r)
> +	if (r) {
>  		pr_err("hv_24x7: CPU hotplug init failed\n");
> +	} else {
> +		/*
> +		 * Cpu hotplug init is successful, add the
> +		 * cpumask file as part of pmu attr group and
> +		 * assign it to very first NULL location.
> +		 */
> +		while (attr_groups[++i])
> +			/* nothing */;
> +		attr_groups[i] = &cpumask_attr_group;
> +	}
> 
>  	r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
>  	if (r)
> -- 
> 2.18.2
> 

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/perf/hv-24x7: Add cpu hotplug support
From: kajoljain @ 2020-06-24 10:44 UTC (permalink / raw)
  To: ego; +Cc: nathanl, maddy, suka, anju, linuxppc-dev
In-Reply-To: <20200624103609.GC31972@in.ibm.com>



On 6/24/20 4:06 PM, Gautham R Shenoy wrote:
> On Wed, Jun 24, 2020 at 03:47:53PM +0530, Kajol Jain wrote:
>> Patch here adds cpu hotplug functions to hv_24x7 pmu.
>> A new cpuhp_state "CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE" enum
>> is added.
>>
>> The online function update the cpumask only if its NULL.
>       ^^^^^^^^^^^^^^^^^^^^^^                     ^^^^^^^^^
> 	online callback function updates	 it is empty.
> 						 
> 
>> As the primary intention for adding hotplug support
>> is to desiginate a CPU to make HCALL to collect the
>         ^^^^^^^^^^
> 	designate
> 	
Sorry My bad. Thanks for reviewing the patch. Will correct these
mistakes.

Thanks,
Kajol Jain

>> count data.
>>
>> The offline function test and clear corresponding cpu in a cpumask
>> and update cpumask to any other active cpu.
>>
>> Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
> 
> Otherwise, looks good to me.
> 
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> 
>> ---
>>  arch/powerpc/perf/hv-24x7.c | 45 +++++++++++++++++++++++++++++++++++++
>>  include/linux/cpuhotplug.h  |  1 +
>>  2 files changed, 46 insertions(+)
>>
>> diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
>> index db213eb7cb02..ce4739e2b407 100644
>> --- a/arch/powerpc/perf/hv-24x7.c
>> +++ b/arch/powerpc/perf/hv-24x7.c
>> @@ -31,6 +31,8 @@ static int interface_version;
>>  /* Whether we have to aggregate result data for some domains. */
>>  static bool aggregate_result_elements;
>>
>> +static cpumask_t hv_24x7_cpumask;
>> +
>>  static bool domain_is_valid(unsigned domain)
>>  {
>>  	switch (domain) {
>> @@ -1641,6 +1643,44 @@ static struct pmu h_24x7_pmu = {
>>  	.capabilities = PERF_PMU_CAP_NO_EXCLUDE,
>>  };
>>
>> +static int ppc_hv_24x7_cpu_online(unsigned int cpu)
>> +{
>> +	/* Make this CPU the designated target for counter collection */
>> +	if (cpumask_empty(&hv_24x7_cpumask))
>> +		cpumask_set_cpu(cpu, &hv_24x7_cpumask);
>> +
>> +	return 0;
>> +}
>> +
>> +static int ppc_hv_24x7_cpu_offline(unsigned int cpu)
>> +{
>> +	int target = -1;
>> +
>> +	/* Check if exiting cpu is used for collecting 24x7 events */
>> +	if (!cpumask_test_and_clear_cpu(cpu, &hv_24x7_cpumask))
>> +		return 0;
>> +
>> +	/* Find a new cpu to collect 24x7 events */
>> +	target = cpumask_last(cpu_active_mask);
>> +
>> +	if (target < 0 || target >= nr_cpu_ids)
>> +		return -1;
>> +
>> +	/* Migrate 24x7 events to the new target */
>> +	cpumask_set_cpu(target, &hv_24x7_cpumask);
>> +	perf_pmu_migrate_context(&h_24x7_pmu, cpu, target);
>> +
>> +	return 0;
>> +}
>> +
>> +static int hv_24x7_cpu_hotplug_init(void)
>> +{
>> +	return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
>> +			  "perf/powerpc/hv_24x7:online",
>> +			  ppc_hv_24x7_cpu_online,
>> +			  ppc_hv_24x7_cpu_offline);
>> +}
>> +
>>  static int hv_24x7_init(void)
>>  {
>>  	int r;
>> @@ -1685,6 +1725,11 @@ static int hv_24x7_init(void)
>>  	if (r)
>>  		return r;
>>
>> +	/* init cpuhotplug */
>> +	r = hv_24x7_cpu_hotplug_init();
>> +	if (r)
>> +		pr_err("hv_24x7: CPU hotplug init failed\n");
>> +
>>  	r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
>>  	if (r)
>>  		return r;
>> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
>> index 191772d4a4d7..a2710e654b64 100644
>> --- a/include/linux/cpuhotplug.h
>> +++ b/include/linux/cpuhotplug.h
>> @@ -181,6 +181,7 @@ enum cpuhp_state {
>>  	CPUHP_AP_PERF_POWERPC_CORE_IMC_ONLINE,
>>  	CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE,
>>  	CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE,
>> +	CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
>>  	CPUHP_AP_WATCHDOG_ONLINE,
>>  	CPUHP_AP_WORKQUEUE_ONLINE,
>>  	CPUHP_AP_RCUTREE_ONLINE,
>> -- 
>> 2.18.2
>>

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/perf/hv-24x7: Add cpu hotplug support
From: Gautham R Shenoy @ 2020-06-24 10:36 UTC (permalink / raw)
  To: Kajol Jain; +Cc: nathanl, ego, maddy, suka, anju, linuxppc-dev
In-Reply-To: <20200624101754.169612-2-kjain@linux.ibm.com>

On Wed, Jun 24, 2020 at 03:47:53PM +0530, Kajol Jain wrote:
> Patch here adds cpu hotplug functions to hv_24x7 pmu.
> A new cpuhp_state "CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE" enum
> is added.
> 
> The online function update the cpumask only if its NULL.
      ^^^^^^^^^^^^^^^^^^^^^^                     ^^^^^^^^^
	online callback function updates	 it is empty.
						 

> As the primary intention for adding hotplug support
> is to desiginate a CPU to make HCALL to collect the
        ^^^^^^^^^^
	designate
	
> count data.
> 
> The offline function test and clear corresponding cpu in a cpumask
> and update cpumask to any other active cpu.
> 
> Signed-off-by: Kajol Jain <kjain@linux.ibm.com>

Otherwise, looks good to me.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

> ---
>  arch/powerpc/perf/hv-24x7.c | 45 +++++++++++++++++++++++++++++++++++++
>  include/linux/cpuhotplug.h  |  1 +
>  2 files changed, 46 insertions(+)
> 
> diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
> index db213eb7cb02..ce4739e2b407 100644
> --- a/arch/powerpc/perf/hv-24x7.c
> +++ b/arch/powerpc/perf/hv-24x7.c
> @@ -31,6 +31,8 @@ static int interface_version;
>  /* Whether we have to aggregate result data for some domains. */
>  static bool aggregate_result_elements;
> 
> +static cpumask_t hv_24x7_cpumask;
> +
>  static bool domain_is_valid(unsigned domain)
>  {
>  	switch (domain) {
> @@ -1641,6 +1643,44 @@ static struct pmu h_24x7_pmu = {
>  	.capabilities = PERF_PMU_CAP_NO_EXCLUDE,
>  };
> 
> +static int ppc_hv_24x7_cpu_online(unsigned int cpu)
> +{
> +	/* Make this CPU the designated target for counter collection */
> +	if (cpumask_empty(&hv_24x7_cpumask))
> +		cpumask_set_cpu(cpu, &hv_24x7_cpumask);
> +
> +	return 0;
> +}
> +
> +static int ppc_hv_24x7_cpu_offline(unsigned int cpu)
> +{
> +	int target = -1;
> +
> +	/* Check if exiting cpu is used for collecting 24x7 events */
> +	if (!cpumask_test_and_clear_cpu(cpu, &hv_24x7_cpumask))
> +		return 0;
> +
> +	/* Find a new cpu to collect 24x7 events */
> +	target = cpumask_last(cpu_active_mask);
> +
> +	if (target < 0 || target >= nr_cpu_ids)
> +		return -1;
> +
> +	/* Migrate 24x7 events to the new target */
> +	cpumask_set_cpu(target, &hv_24x7_cpumask);
> +	perf_pmu_migrate_context(&h_24x7_pmu, cpu, target);
> +
> +	return 0;
> +}
> +
> +static int hv_24x7_cpu_hotplug_init(void)
> +{
> +	return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
> +			  "perf/powerpc/hv_24x7:online",
> +			  ppc_hv_24x7_cpu_online,
> +			  ppc_hv_24x7_cpu_offline);
> +}
> +
>  static int hv_24x7_init(void)
>  {
>  	int r;
> @@ -1685,6 +1725,11 @@ static int hv_24x7_init(void)
>  	if (r)
>  		return r;
> 
> +	/* init cpuhotplug */
> +	r = hv_24x7_cpu_hotplug_init();
> +	if (r)
> +		pr_err("hv_24x7: CPU hotplug init failed\n");
> +
>  	r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
>  	if (r)
>  		return r;
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 191772d4a4d7..a2710e654b64 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -181,6 +181,7 @@ enum cpuhp_state {
>  	CPUHP_AP_PERF_POWERPC_CORE_IMC_ONLINE,
>  	CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE,
>  	CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE,
> +	CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
>  	CPUHP_AP_WATCHDOG_ONLINE,
>  	CPUHP_AP_WORKQUEUE_ONLINE,
>  	CPUHP_AP_RCUTREE_ONLINE,
> -- 
> 2.18.2
> 

^ permalink raw reply

* Re: [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn
From: Gautham R Shenoy @ 2020-06-24 10:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R Shenoy, Michal Hocko, David Hildenbrand, Linus Torvalds,
	linux-kernel, linux-mm, Satheesh Rajendran, Mel Gorman,
	Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka
In-Reply-To: <20200624092846.9194-3-srikar@linux.vnet.ibm.com>

Hello Srikar,


On Wed, Jun 24, 2020 at 02:58:45PM +0530, Srikar Dronamraju wrote:
> Node id queried from the static device tree may not
> be correct. For example: it may always show 0 on a shared processor.
> Hence prefer the node id queried from vphn and fallback on the device tree
> based node id if vphn query fails.
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>


This patch looks good to me.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>


--
Thanks and Regards
gautham.

^ permalink raw reply

* [PATCH v2 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask
From: Kajol Jain @ 2020-06-24 10:17 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: nathanl, ego, maddy, kjain, suka, anju
In-Reply-To: <20200624101754.169612-1-kjain@linux.ibm.com>

Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation.

command:# cat /sys/devices/hv_24x7/cpumask
0

Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
---
 .../sysfs-bus-event_source-devices-hv_24x7    |  7 ++++
 arch/powerpc/perf/hv-24x7.c                   | 36 +++++++++++++++++--
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
index e8698afcd952..f9dd3755b049 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
@@ -43,6 +43,13 @@ Description:	read only
 		This sysfs interface exposes the number of cores per chip
 		present in the system.
 
+What:		/sys/devices/hv_24x7/cpumask
+Date:		June 2020
+Contact:	Linux on PowerPC Developer List <linuxppc-dev@lists.ozlabs.org>
+Description:	read only
+		This sysfs file exposes the cpumask which is designated to make
+		HCALLs to retrive hv-24x7 pmu event counter data.
+
 What:		/sys/bus/event_source/devices/hv_24x7/event_descs/<event-name>
 Date:		February 2014
 Contact:	Linux on PowerPC Developer List <linuxppc-dev@lists.ozlabs.org>
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index ce4739e2b407..3c699612d29f 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -448,6 +448,12 @@ static ssize_t device_show_string(struct device *dev,
 	return sprintf(buf, "%s\n", (char *)d->var);
 }
 
+static ssize_t cpumask_get_attr(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	return cpumap_print_to_pagebuf(true, buf, &hv_24x7_cpumask);
+}
+
 static ssize_t sockets_show(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
@@ -1116,6 +1122,17 @@ static DEVICE_ATTR_RO(sockets);
 static DEVICE_ATTR_RO(chipspersocket);
 static DEVICE_ATTR_RO(coresperchip);
 
+static DEVICE_ATTR(cpumask, S_IRUGO, cpumask_get_attr, NULL);
+
+static struct attribute *cpumask_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL,
+};
+
+static struct attribute_group cpumask_attr_group = {
+	.attrs = cpumask_attrs,
+};
+
 static struct bin_attribute *if_bin_attrs[] = {
 	&bin_attr_catalog,
 	NULL,
@@ -1143,6 +1160,11 @@ static const struct attribute_group *attr_groups[] = {
 	&event_desc_group,
 	&event_long_desc_group,
 	&if_group,
+	/*
+	 * This NULL is a placeholder for the cpumask attr which will update
+	 * onlyif cpuhotplug registration is successful
+	 */
+	NULL,
 	NULL,
 };
 
@@ -1683,7 +1705,7 @@ static int hv_24x7_cpu_hotplug_init(void)
 
 static int hv_24x7_init(void)
 {
-	int r;
+	int r, i = -1;
 	unsigned long hret;
 	struct hv_perf_caps caps;
 
@@ -1727,8 +1749,18 @@ static int hv_24x7_init(void)
 
 	/* init cpuhotplug */
 	r = hv_24x7_cpu_hotplug_init();
-	if (r)
+	if (r) {
 		pr_err("hv_24x7: CPU hotplug init failed\n");
+	} else {
+		/*
+		 * Cpu hotplug init is successful, add the
+		 * cpumask file as part of pmu attr group and
+		 * assign it to very first NULL location.
+		 */
+		while (attr_groups[++i])
+			/* nothing */;
+		attr_groups[i] = &cpumask_attr_group;
+	}
 
 	r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
 	if (r)
-- 
2.18.2


^ permalink raw reply related

* [PATCH v2 1/2] powerpc/perf/hv-24x7: Add cpu hotplug support
From: Kajol Jain @ 2020-06-24 10:17 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: nathanl, ego, maddy, kjain, suka, anju
In-Reply-To: <20200624101754.169612-1-kjain@linux.ibm.com>

Patch here adds cpu hotplug functions to hv_24x7 pmu.
A new cpuhp_state "CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE" enum
is added.

The online function update the cpumask only if its NULL.
As the primary intention for adding hotplug support
is to desiginate a CPU to make HCALL to collect the
count data.

The offline function test and clear corresponding cpu in a cpumask
and update cpumask to any other active cpu.

Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
---
 arch/powerpc/perf/hv-24x7.c | 45 +++++++++++++++++++++++++++++++++++++
 include/linux/cpuhotplug.h  |  1 +
 2 files changed, 46 insertions(+)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index db213eb7cb02..ce4739e2b407 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -31,6 +31,8 @@ static int interface_version;
 /* Whether we have to aggregate result data for some domains. */
 static bool aggregate_result_elements;
 
+static cpumask_t hv_24x7_cpumask;
+
 static bool domain_is_valid(unsigned domain)
 {
 	switch (domain) {
@@ -1641,6 +1643,44 @@ static struct pmu h_24x7_pmu = {
 	.capabilities = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
+static int ppc_hv_24x7_cpu_online(unsigned int cpu)
+{
+	/* Make this CPU the designated target for counter collection */
+	if (cpumask_empty(&hv_24x7_cpumask))
+		cpumask_set_cpu(cpu, &hv_24x7_cpumask);
+
+	return 0;
+}
+
+static int ppc_hv_24x7_cpu_offline(unsigned int cpu)
+{
+	int target = -1;
+
+	/* Check if exiting cpu is used for collecting 24x7 events */
+	if (!cpumask_test_and_clear_cpu(cpu, &hv_24x7_cpumask))
+		return 0;
+
+	/* Find a new cpu to collect 24x7 events */
+	target = cpumask_last(cpu_active_mask);
+
+	if (target < 0 || target >= nr_cpu_ids)
+		return -1;
+
+	/* Migrate 24x7 events to the new target */
+	cpumask_set_cpu(target, &hv_24x7_cpumask);
+	perf_pmu_migrate_context(&h_24x7_pmu, cpu, target);
+
+	return 0;
+}
+
+static int hv_24x7_cpu_hotplug_init(void)
+{
+	return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
+			  "perf/powerpc/hv_24x7:online",
+			  ppc_hv_24x7_cpu_online,
+			  ppc_hv_24x7_cpu_offline);
+}
+
 static int hv_24x7_init(void)
 {
 	int r;
@@ -1685,6 +1725,11 @@ static int hv_24x7_init(void)
 	if (r)
 		return r;
 
+	/* init cpuhotplug */
+	r = hv_24x7_cpu_hotplug_init();
+	if (r)
+		pr_err("hv_24x7: CPU hotplug init failed\n");
+
 	r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
 	if (r)
 		return r;
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 191772d4a4d7..a2710e654b64 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -181,6 +181,7 @@ enum cpuhp_state {
 	CPUHP_AP_PERF_POWERPC_CORE_IMC_ONLINE,
 	CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE,
 	CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE,
+	CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
 	CPUHP_AP_WATCHDOG_ONLINE,
 	CPUHP_AP_WORKQUEUE_ONLINE,
 	CPUHP_AP_RCUTREE_ONLINE,
-- 
2.18.2


^ permalink raw reply related

* [PATCH v2 0/2] Add cpu hotplug support for powerpc/perf/hv-24x7
From: Kajol Jain @ 2020-06-24 10:17 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: nathanl, ego, maddy, kjain, suka, anju

This patchset add cpu hotplug support for hv_24x7 driver by adding
online/offline cpu hotplug function. It also add sysfs file
"cpumask" to expose current online cpu that can be used for
hv_24x7 event count.

Changelog:
v1 -> v2
- Changed function to pick active cpu incase of offline
  from "cpumask_any_but" to "cpumask_last", as
  cpumask_any_but function pick very next online cpu and incase where
  we are sequentially off-lining multiple cpus, "pmu_migrate_context"
  can add extra latency.
  - Suggested by: Gautham R Shenoy.

- Change documentation for cpumask and rather then hardcode the
  initialization for cpumask_attr_group, add loop to get very first
  NULL as suggested by Gautham R Shenoy.

Kajol Jain (2):
  powerpc/perf/hv-24x7: Add cpu hotplug support
  powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask

 .../sysfs-bus-event_source-devices-hv_24x7    |  7 ++
 arch/powerpc/perf/hv-24x7.c                   | 79 ++++++++++++++++++-
 include/linux/cpuhotplug.h                    |  1 +
 3 files changed, 86 insertions(+), 1 deletion(-)

-- 
2.18.2


^ permalink raw reply

* Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables
From: Marco Elver @ 2020-06-24 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-s390, linuxppc-dev, Paul E. McKenney, bigeasy,
	the arch/x86 maintainers, heiko.carstens, LKML, Steven Rostedt,
	David S. Miller, Ahmed S. Darwish, sparclinux, linux,
	Thomas Gleixner, Will Deacon, Ingo Molnar
In-Reply-To: <20200624090033.GD4800@hirez.programming.kicks-ass.net>

On Wed, 24 Jun 2020 at 11:01, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jun 23, 2020 at 10:24:04PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2020 at 08:12:32PM +0200, Peter Zijlstra wrote:
> > > Fair enough; I'll rip it all up and boot a KCSAN kernel, see what if
> > > anything happens.
> >
> > OK, so the below patch doesn't seem to have any nasty recursion issues
> > here. The only 'problem' is that lockdep now sees report_lock can cause
> > deadlocks.
> >
> > It is completely right about it too, but I don't suspect there's much we
> > can do about it, it's pretty much the standard printk() with scheduler
> > locks held report.
>
> So I've been getting tons and tons of this:
>
> [   60.471348] ==================================================================
> [   60.479427] BUG: KCSAN: data-race in __rcu_read_lock / __rcu_read_unlock
> [   60.486909]
> [   60.488572] write (marked) to 0xffff88840fff1cf0 of 4 bytes by interrupt on cpu 1:
> [   60.497026]  __rcu_read_lock+0x37/0x60
> [   60.501214]  cpuacct_account_field+0x1b/0x170
> [   60.506081]  task_group_account_field+0x32/0x160
> [   60.511238]  account_system_time+0xe6/0x110
> [   60.515912]  update_process_times+0x1d/0xd0
> [   60.520585]  tick_sched_timer+0xfc/0x180
> [   60.524967]  __hrtimer_run_queues+0x271/0x440
> [   60.529832]  hrtimer_interrupt+0x222/0x670
> [   60.534409]  __sysvec_apic_timer_interrupt+0xb3/0x1a0
> [   60.540052]  asm_call_on_stack+0x12/0x20
> [   60.544434]  sysvec_apic_timer_interrupt+0xba/0x130
> [   60.549882]  asm_sysvec_apic_timer_interrupt+0x12/0x20
> [   60.555621]  delay_tsc+0x7d/0xe0
> [   60.559226]  kcsan_setup_watchpoint+0x292/0x4e0
> [   60.564284]  __rcu_read_unlock+0x73/0x2c0
> [   60.568763]  __unlock_page_memcg+0xda/0xf0
> [   60.573338]  unlock_page_memcg+0x32/0x40
> [   60.577721]  page_remove_rmap+0x5c/0x200
> [   60.582104]  unmap_page_range+0x83c/0xc10
> [   60.586582]  unmap_single_vma+0xb0/0x150
> [   60.590963]  unmap_vmas+0x81/0xe0
> [   60.594663]  exit_mmap+0x135/0x2b0
> [   60.598464]  __mmput+0x21/0x150
> [   60.601970]  mmput+0x2a/0x30
> [   60.605176]  exit_mm+0x2fc/0x350
> [   60.608780]  do_exit+0x372/0xff0
> [   60.612385]  do_group_exit+0x139/0x140
> [   60.616571]  __do_sys_exit_group+0xb/0x10
> [   60.621048]  __se_sys_exit_group+0xa/0x10
> [   60.625524]  __x64_sys_exit_group+0x1b/0x20
> [   60.630189]  do_syscall_64+0x6c/0xe0
> [   60.634182]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   60.639820]
> [   60.641485] read to 0xffff88840fff1cf0 of 4 bytes by task 2430 on cpu 1:
> [   60.648969]  __rcu_read_unlock+0x73/0x2c0
> [   60.653446]  __unlock_page_memcg+0xda/0xf0
> [   60.658019]  unlock_page_memcg+0x32/0x40
> [   60.662400]  page_remove_rmap+0x5c/0x200
> [   60.666782]  unmap_page_range+0x83c/0xc10
> [   60.671259]  unmap_single_vma+0xb0/0x150
> [   60.675641]  unmap_vmas+0x81/0xe0
> [   60.679341]  exit_mmap+0x135/0x2b0
> [   60.683141]  __mmput+0x21/0x150
> [   60.686647]  mmput+0x2a/0x30
> [   60.689853]  exit_mm+0x2fc/0x350
> [   60.693458]  do_exit+0x372/0xff0
> [   60.697062]  do_group_exit+0x139/0x140
> [   60.701248]  __do_sys_exit_group+0xb/0x10
> [   60.705724]  __se_sys_exit_group+0xa/0x10
> [   60.710201]  __x64_sys_exit_group+0x1b/0x20
> [   60.714872]  do_syscall_64+0x6c/0xe0
> [   60.718864]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   60.724503]
> [   60.726156] Reported by Kernel Concurrency Sanitizer on:
> [   60.732089] CPU: 1 PID: 2430 Comm: sshd Not tainted 5.8.0-rc2-00186-gb4ee11fe08b3-dirty #303
> [   60.741510] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [   60.752957] ==================================================================
>
> And I figured a quick way to get rid of that would be something like the
> below, seeing how volatile gets auto annotated... but that doesn't seem
> to actually work.
>
> What am I missing?

There's one more in include/linux/rcupdate.h. I suggested this at some point:

    https://lore.kernel.org/lkml/20200220213317.GA35033@google.com/

To avoid volatiles as I don't think they are needed here.

[ Still testing your other patches for KCSAN, will send another reply there. ]

Thanks,
-- Marco

^ permalink raw reply

* Re: [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus
From: Gautham R Shenoy @ 2020-06-24  9:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R Shenoy, Michal Hocko, David Hildenbrand, Linus Torvalds,
	linux-kernel, linux-mm, Satheesh Rajendran, Mel Gorman,
	Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka
In-Reply-To: <20200624092846.9194-2-srikar@linux.vnet.ibm.com>

On Wed, Jun 24, 2020 at 02:58:44PM +0530, Srikar Dronamraju wrote:
> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> enabled always used to have a node 0, even if node 0 does not any cpus
> or memory attached to it. As per PAPR, node affinity of a cpu is only
> available once its present / online. For all cpus that are possible but
> not present, cpu_to_node() would point to node 0.
> 
> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> to make sure all possible but not present cpu_to_node are set to a
> proper node.
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

This looks good to me.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

--
Thanks and Regards
gautham.

^ permalink raw reply

* [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
From: Srikar Dronamraju @ 2020-06-24  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gautham R Shenoy, Michal Hocko, Srikar Dronamraju,
	David Hildenbrand, Linus Torvalds, linux-kernel, linux-mm,
	Satheesh Rajendran, Mel Gorman, Kirill A. Shutemov,
	Christopher Lameter, linuxppc-dev, Vlastimil Babka
In-Reply-To: <20200624092846.9194-1-srikar@linux.vnet.ibm.com>

Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.8-rc2
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.8-rc2 + patch
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u

 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..5187664558e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy);
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 	[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+	[N_ONLINE] = NODE_MASK_NONE,
+#else
 	[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
 	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
 	[N_HIGH_MEMORY] = { { [0] = 1UL } },
-- 
2.18.1


^ permalink raw reply related

* [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn
From: Srikar Dronamraju @ 2020-06-24  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gautham R Shenoy, Michal Hocko, Srikar Dronamraju,
	David Hildenbrand, Linus Torvalds, linux-kernel, linux-mm,
	Satheesh Rajendran, Mel Gorman, Kirill A. Shutemov,
	Christopher Lameter, linuxppc-dev, Vlastimil Babka
In-Reply-To: <20200624092846.9194-1-srikar@linux.vnet.ibm.com>

Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5b7918c132f5..163f5dc8fd46 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -724,21 +724,22 @@ static int __init parse_numa_properties(void)
 	 */
 	for_each_present_cpu(i) {
 		struct device_node *cpu;
-		int nid;
-
-		cpu = of_get_cpu_node(i, NULL);
-		BUG_ON(!cpu);
-		nid = of_node_to_nid_single(cpu);
-		of_node_put(cpu);
+		int nid = vphn_get_nid(i);
 
 		/*
 		 * Don't fall back to default_nid yet -- we will plug
 		 * cpus into nodes once the memory scan has discovered
 		 * the topology.
 		 */
-		if (nid < 0)
-			continue;
-		node_set_online(nid);
+		if (nid == NUMA_NO_NODE) {
+			cpu = of_get_cpu_node(i, NULL);
+			BUG_ON(!cpu);
+			nid = of_node_to_nid_single(cpu);
+			of_node_put(cpu);
+		}
+
+		if (likely(nid > 0))
+			node_set_online(nid);
 	}
 
 	get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
-- 
2.18.1


^ permalink raw reply related

* [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus
From: Srikar Dronamraju @ 2020-06-24  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gautham R Shenoy, Michal Hocko, Srikar Dronamraju,
	David Hildenbrand, Linus Torvalds, linux-kernel, linux-mm,
	Satheesh Rajendran, Mel Gorman, Kirill A. Shutemov,
	Christopher Lameter, linuxppc-dev, Vlastimil Babka
In-Reply-To: <20200624092846.9194-1-srikar@linux.vnet.ibm.com>

A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: http://lore.kernel.org/lkml/20200501031128.19584-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3

 arch/powerpc/mm/numa.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9fcf2d195830..5b7918c132f5 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -506,6 +506,11 @@ static int numa_setup_cpu(unsigned long lcpu)
 	int fcpu = cpu_first_thread_sibling(lcpu);
 	int nid = NUMA_NO_NODE;
 
+	if (!cpu_present(lcpu)) {
+		set_cpu_numa_node(lcpu, first_online_node);
+		return first_online_node;
+	}
+
 	/*
 	 * If a valid cpu-to-node mapping is already available, use it
 	 * directly instead of querying the firmware, since it represents
@@ -931,8 +936,17 @@ void __init mem_topology_setup(void)
 
 	reset_numa_cpu_lookup_table();
 
-	for_each_present_cpu(cpu)
+	for_each_possible_cpu(cpu) {
+		/*
+		 * Powerpc with CONFIG_NUMA always used to have a node 0,
+		 * even if it was memoryless or cpuless. For all cpus that
+		 * are possible but not present, cpu_to_node() would point
+		 * to node 0. To remove a cpuless, memoryless dummy node,
+		 * powerpc need to make sure all possible but not present
+		 * cpu_to_node are set to a proper node.
+		 */
 		numa_setup_cpu(cpu);
+	}
 }
 
 void __init initmem_init(void)
-- 
2.18.1


^ permalink raw reply related

* [PATCH v5 0/3] Offline memoryless cpuless node 0
From: Srikar Dronamraju @ 2020-06-24  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gautham R Shenoy, Michal Hocko, Srikar Dronamraju,
	David Hildenbrand, Linus Torvalds, linux-kernel, linux-mm,
	Satheesh Rajendran, Mel Gorman, Kirill A. Shutemov,
	Christopher Lameter, linuxppc-dev, Vlastimil Babka

Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: http://lore.kernel.org/lkml/20200501031128.19584-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.
Link v1: https://lore.kernel.org/linuxppc-dev/20200311110237.5731-1-srikar@linux.vnet.ibm.com/t/#u

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.

v5.8-rc2
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.8-rc2 + patches
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

On a machine with just one node with node number not being 0,
the current setup will end up showing 2 online nodes. And when there are
more than one online nodes, numa_balancing gets enabled.

Without patch
$ grep numa /proc/vmstat
numa_hit 95179
numa_miss 0
numa_foreign 0
numa_interleave 3764
numa_local 95179
numa_other 0
numa_pte_updates 1206973                 <----------
numa_huge_pte_updates 4654                 <----------
numa_hint_faults 19560                 <----------
numa_hint_faults_local 19560                 <----------
numa_pages_migrated 0

With patch
$ grep numa /proc/vmstat
numa_hit 322338756
numa_miss 0
numa_foreign 0
numa_interleave 3790
numa_local 322338756
numa_other 0
numa_pte_updates 0                 <----------
numa_huge_pte_updates 0                 <----------
numa_hint_faults 0                 <----------
numa_hint_faults_local 0                 <----------
numa_pages_migrated 0

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of
cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Testcase         Time:  Min      Max      Avg      StdDev
./numa01.sh      Real:  149.62   149.66   149.64   0.02
./numa01.sh      Sys:   3.21     3.71     3.46     0.25
./numa01.sh      User:  4755.13  4758.15  4756.64  1.51
./numa02.sh      Real:  24.98    25.02    25.00    0.02
./numa02.sh      Sys:   0.51     0.59     0.55     0.04
./numa02.sh      User:  790.28   790.88   790.58   0.30

Testcase         Time:  Min      Max      Avg      StdDev  %Change
./numa01.sh      Real:  149.44   149.46   149.45   0.01    0.127133%
./numa01.sh      Sys:   0.71     0.89     0.80     0.09    332.5%
./numa01.sh      User:  4754.19  4754.48  4754.33  0.15    0.0485873%
./numa02.sh      Real:  24.97    24.98    24.98    0.00    0.0800641%
./numa02.sh      Sys:   0.26     0.41     0.33     0.08    66.6667%
./numa02.sh      User:  789.75   790.28   790.01   0.27    0.072151%

numa01.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        1131164     0           -100%
numa_hint_faults_local  1131164     0           -100%
numa_hit                213696      214244      0.256439%
numa_local              213696      214244      0.256439%
numa_pte_updates        1131294     0           -100%
pgfault                 1380845     241424      -82.5162%
pgmajfault              75          60          -20%

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of
cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Without patch
-------------
Testcase         Time:  Min      Max      Avg      StdDev
./numa01.sh      Real:  149.62   149.66   149.64   0.02
./numa01.sh      Sys:   3.21     3.71     3.46     0.25
./numa01.sh      User:  4755.13  4758.15  4756.64  1.51
./numa02.sh      Real:  24.98    25.02    25.00    0.02
./numa02.sh      Sys:   0.51     0.59     0.55     0.04
./numa02.sh      User:  790.28   790.88   790.58   0.30

With patch
-----------
Testcase         Time:  Min      Max      Avg      StdDev  %Change
./numa01.sh      Real:  149.44   149.46   149.45   0.01    0.127133%
./numa01.sh      Sys:   0.71     0.89     0.80     0.09    332.5%
./numa01.sh      User:  4754.19  4754.48  4754.33  0.15    0.0485873%
./numa02.sh      Real:  24.97    24.98    24.98    0.00    0.0800641%
./numa02.sh      Sys:   0.26     0.41     0.33     0.08    66.6667%
./numa02.sh      User:  789.75   790.28   790.01   0.27    0.072151%

numa01.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        1131164     0           -100%
numa_hint_faults_local  1131164     0           -100%
numa_hit                213696      214244      0.256439%
numa_local              213696      214244      0.256439%
numa_pte_updates        1131294     0           -100%
pgfault                 1380845     241424      -82.5162%
pgmajfault              75          60          -20%

numa02.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        111878      0           -100%
numa_hint_faults_local  111878      0           -100%
numa_hit                41854       43220       3.26373%
numa_local              41854       43220       3.26373%
numa_pte_updates        113926      0           -100%
pgfault                 163662      51210       -68.7099%
pgmajfault              56          52          -7.14286%

Observations:
The real time and user time actually doesn't change much. However the system
time changes to some extent. The reason being the number of numa hinting
faults. With the patch we are not seeing the numa hinting faults.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  mm/page_alloc: Keep memoryless cpuless node 0 offline

 arch/powerpc/mm/numa.c | 35 +++++++++++++++++++++++++----------
 mm/page_alloc.c        |  4 +++-
 2 files changed, 28 insertions(+), 11 deletions(-)

-- 
2.18.1


^ permalink raw reply

* Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables
From: Peter Zijlstra @ 2020-06-24  9:00 UTC (permalink / raw)
  To: Marco Elver
  Cc: linux-s390, linuxppc-dev, bigeasy, x86, heiko.carstens,
	linux-kernel, rostedt, davem, Ahmed S. Darwish, sparclinux, linux,
	tglx, will, mingo
In-Reply-To: <20200623202404.GE2483@worktop.programming.kicks-ass.net>

On Tue, Jun 23, 2020 at 10:24:04PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 08:12:32PM +0200, Peter Zijlstra wrote:
> > Fair enough; I'll rip it all up and boot a KCSAN kernel, see what if
> > anything happens.
> 
> OK, so the below patch doesn't seem to have any nasty recursion issues
> here. The only 'problem' is that lockdep now sees report_lock can cause
> deadlocks.
> 
> It is completely right about it too, but I don't suspect there's much we
> can do about it, it's pretty much the standard printk() with scheduler
> locks held report.

So I've been getting tons and tons of this:

[   60.471348] ==================================================================
[   60.479427] BUG: KCSAN: data-race in __rcu_read_lock / __rcu_read_unlock
[   60.486909]
[   60.488572] write (marked) to 0xffff88840fff1cf0 of 4 bytes by interrupt on cpu 1:
[   60.497026]  __rcu_read_lock+0x37/0x60
[   60.501214]  cpuacct_account_field+0x1b/0x170
[   60.506081]  task_group_account_field+0x32/0x160
[   60.511238]  account_system_time+0xe6/0x110
[   60.515912]  update_process_times+0x1d/0xd0
[   60.520585]  tick_sched_timer+0xfc/0x180
[   60.524967]  __hrtimer_run_queues+0x271/0x440
[   60.529832]  hrtimer_interrupt+0x222/0x670
[   60.534409]  __sysvec_apic_timer_interrupt+0xb3/0x1a0
[   60.540052]  asm_call_on_stack+0x12/0x20
[   60.544434]  sysvec_apic_timer_interrupt+0xba/0x130
[   60.549882]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[   60.555621]  delay_tsc+0x7d/0xe0
[   60.559226]  kcsan_setup_watchpoint+0x292/0x4e0
[   60.564284]  __rcu_read_unlock+0x73/0x2c0
[   60.568763]  __unlock_page_memcg+0xda/0xf0
[   60.573338]  unlock_page_memcg+0x32/0x40
[   60.577721]  page_remove_rmap+0x5c/0x200
[   60.582104]  unmap_page_range+0x83c/0xc10
[   60.586582]  unmap_single_vma+0xb0/0x150
[   60.590963]  unmap_vmas+0x81/0xe0
[   60.594663]  exit_mmap+0x135/0x2b0
[   60.598464]  __mmput+0x21/0x150
[   60.601970]  mmput+0x2a/0x30
[   60.605176]  exit_mm+0x2fc/0x350
[   60.608780]  do_exit+0x372/0xff0
[   60.612385]  do_group_exit+0x139/0x140
[   60.616571]  __do_sys_exit_group+0xb/0x10
[   60.621048]  __se_sys_exit_group+0xa/0x10
[   60.625524]  __x64_sys_exit_group+0x1b/0x20
[   60.630189]  do_syscall_64+0x6c/0xe0
[   60.634182]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   60.639820]
[   60.641485] read to 0xffff88840fff1cf0 of 4 bytes by task 2430 on cpu 1:
[   60.648969]  __rcu_read_unlock+0x73/0x2c0
[   60.653446]  __unlock_page_memcg+0xda/0xf0
[   60.658019]  unlock_page_memcg+0x32/0x40
[   60.662400]  page_remove_rmap+0x5c/0x200
[   60.666782]  unmap_page_range+0x83c/0xc10
[   60.671259]  unmap_single_vma+0xb0/0x150
[   60.675641]  unmap_vmas+0x81/0xe0
[   60.679341]  exit_mmap+0x135/0x2b0
[   60.683141]  __mmput+0x21/0x150
[   60.686647]  mmput+0x2a/0x30
[   60.689853]  exit_mm+0x2fc/0x350
[   60.693458]  do_exit+0x372/0xff0
[   60.697062]  do_group_exit+0x139/0x140
[   60.701248]  __do_sys_exit_group+0xb/0x10
[   60.705724]  __se_sys_exit_group+0xa/0x10
[   60.710201]  __x64_sys_exit_group+0x1b/0x20
[   60.714872]  do_syscall_64+0x6c/0xe0
[   60.718864]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   60.724503]
[   60.726156] Reported by Kernel Concurrency Sanitizer on:
[   60.732089] CPU: 1 PID: 2430 Comm: sshd Not tainted 5.8.0-rc2-00186-gb4ee11fe08b3-dirty #303
[   60.741510] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[   60.752957] ==================================================================

And I figured a quick way to get rid of that would be something like the
below, seeing how volatile gets auto annotated... but that doesn't seem
to actually work.

What am I missing?



diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 352223664ebd..b08861118e1a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -351,17 +351,17 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 
 static void rcu_preempt_read_enter(void)
 {
-	current->rcu_read_lock_nesting++;
+	(*(volatile int *)&current->rcu_read_lock_nesting)++;
 }
 
 static int rcu_preempt_read_exit(void)
 {
-	return --current->rcu_read_lock_nesting;
+	return --(*(volatile int *)&current->rcu_read_lock_nesting);
 }
 
 static void rcu_preempt_depth_set(int val)
 {
-	current->rcu_read_lock_nesting = val;
+	WRITE_ONCE(current->rcu_read_lock_nesting, val);
 }
 
 /*


^ permalink raw reply related

* [PATCH v2 6/6] powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00
From: Leonardo Bras @ 2020-06-24  6:24 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
	Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>

As of today, enable_ddw() will return a non-null DMA address if the
created DDW maps the whole partition. If the address is valid,
iommu_bypass_supported_pSeriesLP() will consider iommu bypass enabled.

This can cause some trouble if the DDW happens to start at 0x00.

Instead if checking if the address is non-null, check directly if
the DDW maps the whole partition, so it can bypass iommu.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 2d217cda4075..967634a379b0 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1078,7 +1078,8 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
  *
  * returns the dma offset for use by the direct mapped DMA code.
  */
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
+		      bool *maps_partition)
 {
 	int len, ret;
 	struct ddw_query_response query;
@@ -1237,9 +1238,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	list_add(&window->list, &direct_window_list);
 	spin_unlock(&direct_window_list_lock);
 
-	/* Only returns the dma_addr if DDW maps the whole partition */
 	if (len == order_base_2(max_addr))
-		dma_addr = be64_to_cpu(ddwprop->dma_base);
+		*maps_partition = true;
+	dma_addr = be64_to_cpu(ddwprop->dma_base);
 	goto out_unlock;
 
 out_free_window:
@@ -1324,6 +1325,7 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
 {
 	struct device_node *dn = pci_device_to_OF_node(pdev), *pdn;
 	const __be32 *dma_window = NULL;
+	bool ret = false;
 
 	/* only attempt to use a new window if 64-bit DMA is requested */
 	if (dma_mask < DMA_BIT_MASK(64))
@@ -1344,13 +1346,10 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
 			break;
 	}
 
-	if (pdn && PCI_DN(pdn)) {
-		pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
-		if (pdev->dev.archdata.dma_offset)
-			return true;
-	}
+	if (pdn && PCI_DN(pdn))
+		pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn, &ret);
 
-	return false;
+	return ret;
 }
 
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
-- 
2.25.4


^ permalink raw reply related

* [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
From: Leonardo Bras @ 2020-06-24  6:24 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
	Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>

As of today, if a DDW is created and can't map the whole partition, it's
removed and the default DMA window "ibm,dma-window" is used instead.

Usually this DDW is bigger than the default DMA window, so it would be
better to make use of it instead.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 28 +++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 4fcf00016fb1..2d217cda4075 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -685,7 +685,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 	struct iommu_table *tbl;
 	struct device_node *dn, *pdn;
 	struct pci_dn *ppci;
-	const __be32 *dma_window = NULL;
+	const __be32 *dma_window = NULL, *alt_dma_window = NULL;
 
 	dn = pci_bus_to_OF_node(bus);
 
@@ -699,8 +699,13 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 			break;
 	}
 
+	/* If there is a DDW available, use it instead */
+	alt_dma_window = of_get_property(pdn, DIRECT64_PROPNAME, NULL);
+	if (alt_dma_window)
+		dma_window = alt_dma_window;
+
 	if (dma_window == NULL) {
-		pr_debug("  no ibm,dma-window property !\n");
+		pr_debug("  no ibm,dma-window nor linux,direct64-ddr-window-info property !\n");
 		return;
 	}
 
@@ -1166,16 +1171,19 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 			  query.page_size);
 		goto out_failed;
 	}
+
 	/* verify the window * number of ptes will map the partition */
-	/* check largest block * page size > max memory hotplug addr */
 	max_addr = ddw_memory_hotplug_max();
 	if (query.largest_available_block < (max_addr >> page_shift)) {
-		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
-			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
-			  1ULL << page_shift);
-		goto out_failed;
+		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
+			max_addr, query.largest_available_block,
+			1ULL << page_shift);
+
+		len = order_base_2(query.largest_available_block << page_shift);
+	} else {
+		len = order_base_2(max_addr);
 	}
-	len = order_base_2(max_addr);
+
 	win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
 	if (!win64) {
 		dev_info(&dev->dev,
@@ -1229,7 +1237,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	list_add(&window->list, &direct_window_list);
 	spin_unlock(&direct_window_list_lock);
 
-	dma_addr = be64_to_cpu(ddwprop->dma_base);
+	/* Only returns the dma_addr if DDW maps the whole partition */
+	if (len == order_base_2(max_addr))
+		dma_addr = be64_to_cpu(ddwprop->dma_base);
 	goto out_unlock;
 
 out_free_window:
-- 
2.25.4


^ permalink raw reply related

* [PATCH v2 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
From: Leonardo Bras @ 2020-06-24  6:24 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
	Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>

On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the
default DMA window for the device, before attempting to configure a DDW,
in order to make the maximum resources available for the next DDW to be
created.

This is a requirement for some devices to use DDW, given they only
allow one DMA window.

If setting up a new DDW fails anywhere after the removal of this
default DMA window, it's needed to restore the default DMA window.
For this, an implementation of ibm,reset-pe-dma-windows rtas call is
needed:

Platforms supporting the DDW option starting with LoPAR level 2.7 implement
ibm,ddw-extensions. The first extension available (index 2) carries the
token for ibm,reset-pe-dma-windows rtas call, which is used to restore
the default DMA window for a device, if it has been deleted.

It does so by resetting the TCE table allocation for the PE to it's
boot time value, available in "ibm,dma-window" device tree node.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 70 ++++++++++++++++++++++----
 1 file changed, 61 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index a8840d9e1c35..4fcf00016fb1 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1029,6 +1029,39 @@ static phys_addr_t ddw_memory_hotplug_max(void)
 	return max_addr;
 }
 
+/*
+ * Platforms supporting the DDW option starting with LoPAR level 2.7 implement
+ * ibm,ddw-extensions, which carries the rtas token for
+ * ibm,reset-pe-dma-windows.
+ * That rtas-call can be used to restore the default DMA window for the device.
+ */
+static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
+{
+	int ret;
+	u32 cfg_addr, ddw_ext[DDW_EXT_RESET_DMA_WIN + 1];
+	u64 buid;
+	struct device_node *dn;
+	struct pci_dn *pdn;
+
+	ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions",
+					 &ddw_ext[0], DDW_EXT_RESET_DMA_WIN + 1);
+	if (ret)
+		return;
+
+	dn = pci_device_to_OF_node(dev);
+	pdn = PCI_DN(dn);
+	buid = pdn->phb->buid;
+	cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
+
+	ret = rtas_call(ddw_ext[DDW_EXT_RESET_DMA_WIN], 3, 1, NULL, cfg_addr,
+			BUID_HI(buid), BUID_LO(buid));
+	if (ret)
+		dev_info(&dev->dev,
+			 "ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ",
+			 ddw_ext[1], cfg_addr, BUID_HI(buid), BUID_LO(buid),
+			 ret);
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1049,8 +1082,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	u64 dma_addr, max_addr;
 	struct device_node *dn;
 	u32 ddw_avail[DDW_APPLICABLE_SIZE];
+
 	struct direct_window *window;
-	struct property *win64;
+	struct property *win64, *default_win = NULL, *ddw_ext = NULL;
 	struct dynamic_dma_window_prop *ddwprop;
 	struct failed_ddw_pdn *fpdn;
 
@@ -1085,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	if (ret)
 		goto out_failed;
 
-       /*
+	/*
 	 * Query if there is a second window of size to map the
 	 * whole partition.  Query returns number of windows, largest
 	 * block assigned to PE (partition endpoint), and two bitmasks
@@ -1096,15 +1130,31 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	if (ret != 0)
 		goto out_failed;
 
+	/*
+	 * If there is no window available, remove the default DMA window,
+	 * if it's present. This will make all the resources available to the
+	 * new DDW window.
+	 * If anything fails after this, we need to restore it, so also check
+	 * for extensions presence.
+	 */
 	if (query.windows_available == 0) {
-		/*
-		 * no additional windows are available for this device.
-		 * We might be able to reallocate the existing window,
-		 * trading in for a larger page size.
-		 */
-		dev_dbg(&dev->dev, "no free dynamic windows");
-		goto out_failed;
+		default_win = of_find_property(pdn, "ibm,dma-window", NULL);
+		ddw_ext = of_find_property(pdn, "ibm,ddw-extensions", NULL);
+		if (default_win && ddw_ext)
+			remove_dma_window(pdn, ddw_avail, default_win);
+
+		/* Query again, to check if the window is available */
+		ret = query_ddw(dev, ddw_avail, &query, pdn);
+		if (ret != 0)
+			goto out_failed;
+
+		if (query.windows_available == 0) {
+			/* no windows are available for this device. */
+			dev_dbg(&dev->dev, "no free dynamic windows");
+			goto out_failed;
+		}
 	}
+
 	if (query.page_size & 4) {
 		page_shift = 24; /* 16MB */
 	} else if (query.page_size & 2) {
@@ -1194,6 +1244,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	kfree(win64);
 
 out_failed:
+	if (default_win && ddw_ext)
+		reset_dma_window(dev, pdn);
 
 	fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
 	if (!fpdn)
-- 
2.25.4


^ permalink raw reply related

* [PATCH v2 3/6] powerpc/pseries/iommu: Move window-removing part of remove_ddw into remove_dma_window
From: Leonardo Bras @ 2020-06-24  6:24 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
	Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>

Move the window-removing part of remove_ddw into a new function
(remove_dma_window), so it can be used to remove other DMA windows.

It's useful for removing DMA windows that don't create DIRECT64_PROPNAME
property, like the default DMA window from the device, which uses
"ibm,dma-window".

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 45 +++++++++++++++-----------
 1 file changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 558e5441c355..a8840d9e1c35 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -776,25 +776,14 @@ static int __init disable_ddw_setup(char *str)
 
 early_param("disable_ddw", disable_ddw_setup);
 
-static void remove_ddw(struct device_node *np, bool remove_prop)
+static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
+			      struct property *win)
 {
 	struct dynamic_dma_window_prop *dwp;
-	struct property *win64;
-	u32 ddw_avail[DDW_APPLICABLE_SIZE];
 	u64 liobn;
-	int ret = 0;
-
-	ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
-					 &ddw_avail[0], DDW_APPLICABLE_SIZE);
-
-	win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
-	if (!win64)
-		return;
-
-	if (ret || win64->length < sizeof(*dwp))
-		goto delprop;
+	int ret;
 
-	dwp = win64->value;
+	dwp = win->value;
 	liobn = (u64)be32_to_cpu(dwp->liobn);
 
 	/* clear the whole window, note the arg is in kernel pages */
@@ -816,10 +805,30 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
 		pr_debug("%pOF: successfully removed direct window: rtas returned "
 			"%d to ibm,remove-pe-dma-window(%x) %llx\n",
 			np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
+}
+
+static void remove_ddw(struct device_node *np, bool remove_prop)
+{
+	struct property *win;
+	u32 ddw_avail[DDW_APPLICABLE_SIZE];
+	int ret = 0;
+
+	ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
+					 &ddw_avail[0], DDW_APPLICABLE_SIZE);
+	if (ret)
+		return;
+
+	win = of_find_property(np, DIRECT64_PROPNAME, NULL);
+	if (!win)
+		return;
+
+	if (win->length >= sizeof(struct dynamic_dma_window_prop))
+		remove_dma_window(np, ddw_avail, win);
+
+	if (!remove_prop)
+		return;
 
-delprop:
-	if (remove_prop)
-		ret = of_remove_property(np, win64);
+	ret = of_remove_property(np, win);
 	if (ret)
 		pr_warn("%pOF: failed to remove direct window property: %d\n",
 			np, ret);
-- 
2.25.4


^ permalink raw reply related

* [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm, query-pe-dma-windows
From: Leonardo Bras @ 2020-06-24  6:24 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
	Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>

From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can make the number of
outputs from "ibm,query-pe-dma-windows" go from 5 to 6.

This change of output size is meant to expand the address size of
largest_available_block PE TCE from 32-bit to 64-bit, which ends up
shifting page_size and migration_capable.

This ends up requiring the update of
ddw_query_response->largest_available_block from u32 to u64, and manually
assigning the values from the buffer into this struct, according to
output size.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 57 +++++++++++++++++++++-----
 1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 68d2aa9c71a8..558e5441c355 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -44,6 +44,10 @@
 #define DDW_REMOVE_PE_DMA_WIN	2
 #define DDW_APPLICABLE_SIZE	3
 
+#define DDW_EXT_SIZE		0
+#define DDW_EXT_RESET_DMA_WIN	1
+#define DDW_EXT_QUERY_OUT_SIZE	2
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
 	struct iommu_table_group *table_group;
@@ -339,7 +343,7 @@ struct direct_window {
 /* Dynamic DMA Window support */
 struct ddw_query_response {
 	u32 windows_available;
-	u32 largest_available_block;
+	u64 largest_available_block;
 	u32 page_size;
 	u32 migration_capable;
 };
@@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void)
 machine_arch_initcall(pseries, find_existing_ddw_windows);
 
 static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
-			struct ddw_query_response *query)
+		     struct ddw_query_response *query,
+		     struct device_node *parent)
 {
 	struct device_node *dn;
 	struct pci_dn *pdn;
-	u32 cfg_addr;
+	u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1];
 	u64 buid;
-	int ret;
+	int ret, out_sz;
+
+	/*
+	 * From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many
+	 * output parameters ibm,query-pe-dma-windows will have, ranging from
+	 * 5 to 6.
+	 */
+
+	ret = of_property_read_u32_array(parent, "ibm,ddw-extensions",
+					 &ddw_ext[0],
+					 DDW_EXT_QUERY_OUT_SIZE + 1);
+	if (ret && ddw_ext[DDW_EXT_SIZE] > 1 &&
+	    ddw_ext[DDW_EXT_QUERY_OUT_SIZE] == 1)
+		out_sz = 6;
+	else
+		out_sz = 5;
 
 	/*
 	 * Get the config address and phb buid of the PE window.
@@ -894,11 +914,28 @@ static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
 	buid = pdn->phb->buid;
 	cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
 
-	ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query,
+	ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, out_sz, query_out,
 			cfg_addr, BUID_HI(buid), BUID_LO(buid));
-	dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
-		" returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr,
-		 BUID_HI(buid), BUID_LO(buid), ret);
+	dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned %d\n",
+		 ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
+		 BUID_LO(buid), ret);
+
+	switch (out_sz) {
+	case 5:
+		query->windows_available = query_out[0];
+		query->largest_available_block = query_out[1];
+		query->page_size = query_out[2];
+		query->migration_capable = query_out[3];
+		break;
+	case 6:
+		query->windows_available = query_out[0];
+		query->largest_available_block = ((u64)query_out[1] << 32) |
+						 query_out[2];
+		query->page_size = query_out[3];
+		query->migration_capable = query_out[4];
+		break;
+	}
+
 	return ret;
 }
 
@@ -1046,7 +1083,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	 * of page sizes: supported and supported for migrate-dma.
 	 */
 	dn = pci_device_to_OF_node(dev);
-	ret = query_ddw(dev, ddw_avail, &query);
+	ret = query_ddw(dev, ddw_avail, &query, pdn);
 	if (ret != 0)
 		goto out_failed;
 
@@ -1074,7 +1111,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	/* check largest block * page size > max memory hotplug addr */
 	max_addr = ddw_memory_hotplug_max();
 	if (query.largest_available_block < (max_addr >> page_shift)) {
-		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
+		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
 			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
 			  1ULL << page_shift);
 		goto out_failed;
-- 
2.25.4


^ permalink raw reply related

* [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm, ddw-applicable
From: Leonardo Bras @ 2020-06-24  6:24 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
	Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>

Create defines to help handling ibm,ddw-applicable values, avoiding
confusion about the index of given operations.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 40 +++++++++++++++-----------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 6d47b4a3ce39..68d2aa9c71a8 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -39,6 +39,11 @@
 
 #include "pseries.h"
 
+#define DDW_QUERY_PE_DMA_WIN	0
+#define DDW_CREATE_PE_DMA_WIN	1
+#define DDW_REMOVE_PE_DMA_WIN	2
+#define DDW_APPLICABLE_SIZE	3
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
 	struct iommu_table_group *table_group;
@@ -771,12 +776,12 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
 {
 	struct dynamic_dma_window_prop *dwp;
 	struct property *win64;
-	u32 ddw_avail[3];
+	u32 ddw_avail[DDW_APPLICABLE_SIZE];
 	u64 liobn;
 	int ret = 0;
 
 	ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
-					 &ddw_avail[0], 3);
+					 &ddw_avail[0], DDW_APPLICABLE_SIZE);
 
 	win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
 	if (!win64)
@@ -798,15 +803,15 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
 		pr_debug("%pOF successfully cleared tces in window.\n",
 			 np);
 
-	ret = rtas_call(ddw_avail[2], 1, 1, NULL, liobn);
+	ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
 	if (ret)
 		pr_warn("%pOF: failed to remove direct window: rtas returned "
 			"%d to ibm,remove-pe-dma-window(%x) %llx\n",
-			np, ret, ddw_avail[2], liobn);
+			np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 	else
 		pr_debug("%pOF: successfully removed direct window: rtas returned "
 			"%d to ibm,remove-pe-dma-window(%x) %llx\n",
-			np, ret, ddw_avail[2], liobn);
+			np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 
 delprop:
 	if (remove_prop)
@@ -889,11 +894,11 @@ static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
 	buid = pdn->phb->buid;
 	cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
 
-	ret = rtas_call(ddw_avail[0], 3, 5, (u32 *)query,
-		  cfg_addr, BUID_HI(buid), BUID_LO(buid));
+	ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query,
+			cfg_addr, BUID_HI(buid), BUID_LO(buid));
 	dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
-		" returned %d\n", ddw_avail[0], cfg_addr, BUID_HI(buid),
-		BUID_LO(buid), ret);
+		" returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr,
+		 BUID_HI(buid), BUID_LO(buid), ret);
 	return ret;
 }
 
@@ -920,15 +925,16 @@ static int create_ddw(struct pci_dev *dev, const u32 *ddw_avail,
 
 	do {
 		/* extra outputs are LIOBN and dma-addr (hi, lo) */
-		ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create,
-				cfg_addr, BUID_HI(buid), BUID_LO(buid),
-				page_shift, window_shift);
+		ret = rtas_call(ddw_avail[DDW_CREATE_PE_DMA_WIN], 5, 4,
+				(u32 *)create, cfg_addr, BUID_HI(buid),
+				BUID_LO(buid), page_shift, window_shift);
 	} while (rtas_busy_delay(ret));
 	dev_info(&dev->dev,
 		"ibm,create-pe-dma-window(%x) %x %x %x %x %x returned %d "
-		"(liobn = 0x%x starting addr = %x %x)\n", ddw_avail[1],
-		 cfg_addr, BUID_HI(buid), BUID_LO(buid), page_shift,
-		 window_shift, ret, create->liobn, create->addr_hi, create->addr_lo);
+		"(liobn = 0x%x starting addr = %x %x)\n",
+		 ddw_avail[DDW_CREATE_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
+		 BUID_LO(buid), page_shift, window_shift, ret, create->liobn,
+		 create->addr_hi, create->addr_lo);
 
 	return ret;
 }
@@ -996,7 +1002,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	int page_shift;
 	u64 dma_addr, max_addr;
 	struct device_node *dn;
-	u32 ddw_avail[3];
+	u32 ddw_avail[DDW_APPLICABLE_SIZE];
 	struct direct_window *window;
 	struct property *win64;
 	struct dynamic_dma_window_prop *ddwprop;
@@ -1029,7 +1035,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	 * the property is actually in the parent, not the PE
 	 */
 	ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
-					 &ddw_avail[0], 3);
+					 &ddw_avail[0], DDW_APPLICABLE_SIZE);
 	if (ret)
 		goto out_failed;
 
-- 
2.25.4


^ permalink raw reply related

* [PATCH v2 0/6] Remove default DMA window before creating DDW
From: Leonardo Bras @ 2020-06-24  6:24 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Alexey Kardashevskiy, Leonardo Bras, Thiago Jung Bauermann,
	Ram Pai
  Cc: linuxppc-dev, linux-kernel

There are some devices that only allow 1 DMA window to exist at a time,
and in those cases, a DDW is never created to them, since the default DMA
window keeps using this resource.

LoPAR recommends this procedure:
1. Remove the default DMA window,
2. Query for which configs the DDW can be created,
3. Create a DDW.

Patch #1:
Create defines for outputs of ibm,ddw-applicable, so it's easier to
identify them.

Patch #2:
- After LoPAR level 2.8, there is an extension that can make
  ibm,query-pe-dma-windows to have 6 outputs instead of 5. This changes the
  order of the outputs, and that can cause some trouble. 
- query_ddw() was updated to check how many outputs the 
  ibm,query-pe-dma-windows is supposed to have, update the rtas_call() and
  deal correctly with the outputs in both cases.
- This patch looks somehow unrelated to the series, but it can avoid future
  problems on DDW creation.

Patch #3 moves the window-removing code from remove_ddw() to
remove_dma_window(), creating a way to delete any DMA window, so it can be
used to delete the default DMA window.

Patch #4 makes use of the remove_dma_window() from patch #3 to remove the
default DMA window before query_ddw(). It also implements a new rtas call
to recover the default DMA window, in case anything fails after it was
removed, and a DDW couldn't be created.

Patch #5:
Instead of destroying the created DDW if it doesn't map the whole
partition, make use of it instead of the default DMA window.

Patch #6:
Changes the way iommu_bypass_supported_pSeriesLP() check for 
iommu_bypass: instead of checking the address returned by enable_ddw(),
it checks a new output value that reflects if the DDW created maps
the whole partition.


All patches were tested into an LPAR with an Ethernet VF:
4005:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4 Virtual Function]

---
Changes since v1:
- Add defines for ibm,ddw-applicable and ibm,ddw-extensions outputs
- Merge aux function query_ddw_out_sz() into query_ddw()
- Merge reset_dma_window() patch (prev. #2) into remove default DMA
  window patch (#4).
- Keep device_node *np name instead of using pdn in remove_*()
- Rename 'device_node *pdn' into 'parent' in new functions
- Rename dfl_win to default_win
- Only remove the default DMA window if there is no window available
  in first query.
- Check if default DMA window can be restored before removing it.
- Fix 'unitialized use' (found by travis mpe:ci-test)
- New patches #5 and #6

Special thanks for Alexey Kardashevskiy and Oliver O'Halloran for
the feedback provided!

Leonardo Bras (6):
  powerpc/pseries/iommu: Create defines for operations in
    ibm,ddw-applicable
  powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
  powerpc/pseries/iommu: Move window-removing part of remove_ddw into
    remove_dma_window
  powerpc/pseries/iommu: Remove default DMA window before creating DDW
  powerpc/pseries/iommu: Make use of DDW even if it does not map the
    partition
  powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00

 arch/powerpc/platforms/pseries/iommu.c | 239 ++++++++++++++++++-------
 1 file changed, 176 insertions(+), 63 deletions(-)

-- 
2.25.4





^ permalink raw reply

* [PATCH RFC 1/1] powerpc/eeh: PE info tree via debugfs and syslog
From: Sam Bobroff @ 2020-06-24  5:57 UTC (permalink / raw)
  To: linuxppc-dev

Provide an ASCII art tree display of the PEs affected by recovery,
with as much state as possible, at the start and end of recovery.

Some platform specific information is provided via a new eeh_ops
member.

The same information is made available from debugfs at:

/sys/kernel/debug/powerpc/PCIXXXX/eeh_pe_tree

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
Here's some debug code I've been using for a long time while working on EEH. I
haven't posted it before because it wasn't possible to make the code safe
enough (to avoid either NULL or LIST_POISON), but with the recent safety work
done it's become possible.

It should be applied on top of:

"powerpc/eeh: Synchronization for safety"
"powerpc/eeh: Provide a unique ID for each EEH recovery"
"powerpc/eeh: Asynchronous recovery"

 arch/powerpc/include/asm/eeh.h               |  3 +
 arch/powerpc/kernel/eeh.c                    | 90 ++++++++++++++++++++
 arch/powerpc/kernel/eeh_driver.c             | 28 ++++++
 arch/powerpc/platforms/powernv/eeh-powernv.c | 63 +++++++++++++-
 arch/powerpc/platforms/pseries/eeh_pseries.c | 21 ++++-
 5 files changed, 203 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index dd55d1bf1cfd..46dec5b2482e 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -230,6 +230,7 @@ struct eeh_ops {
 	int (*next_error)(struct eeh_pe **pe);
 	int (*restore_config)(struct pci_dn *pdn);
 	int (*notify_resume)(struct pci_dn *pdn);
+	void (*pe_plat_state_dump)(char *buf, size_t len, struct eeh_pe *pe);
 };
 
 extern int eeh_subsystem_flags;
@@ -324,6 +325,8 @@ int eeh_pe_configure(struct eeh_pe *pe);
 int eeh_pe_inject_err(struct eeh_pe *pe, int type, int func,
 		      unsigned long addr, unsigned long mask);
 int eeh_restore_vf_config(struct pci_dn *pdn);
+void eeh_tree_state_dump(void (*pfn)(void *, const char *, ...),
+			 void *s, struct eeh_pe *pe, int level, int xlevel);
 
 /**
  * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 54f921ff7621..6f675f277d26 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -839,6 +839,96 @@ int eeh_restore_vf_config(struct pci_dn *pdn)
 	return 0;
 }
 
+static void eeh_tree_state_indent(void (*pfn)(void *, const char *, ...),
+				  void *s, int level, int xlevel, bool node)
+{
+	int i;
+
+	for (i = 0; i < level; i++)
+		pfn(s, "%c   ", ((xlevel & (1 << i)) ? '|' : ' '));
+	if (node)
+		pfn(s, "%c---", ((xlevel & (1 << level)) ? '+' : '\\'));
+	else
+		pfn(s, "%c...", ((xlevel & (1 << level)) ? '|' : ' '));
+}
+
+void eeh_tree_state_dump(void (*pfn)(void *, const char *, ...),
+			 void *s, struct eeh_pe *pe, int level, int xlevel)
+{
+	struct eeh_dev *edev;
+	struct eeh_pe *child_pe;
+	int slevel, sxlevel;
+	char buf[1024];
+
+	eeh_recovery_must_be_locked();
+	eeh_tree_state_indent(pfn, s, level, xlevel, true);
+	pfn(s, "* [PE#0x%x] type=%s%s%s%s%s config_addr=0x%x pass_count=%d\n",
+	    pe->addr,
+	    ((pe->type & EEH_PE_INVALID) ? "INVALID " : ""),
+	    ((pe->type & EEH_PE_PHB) ? "PHB " : ""),
+	    ((pe->type & EEH_PE_DEVICE) ? "DEVICE " : ""),
+	    ((pe->type & EEH_PE_BUS) ? "BUS " : ""),
+	    ((pe->type & EEH_PE_VF) ? "VF " : ""),
+	    pe->config_addr,
+	    atomic_read(&pe->pass_dev_cnt));
+
+	slevel = level + 1;
+	sxlevel = xlevel;
+	if (!list_empty(&pe->edevs) || !list_empty(&pe->child_list))
+		sxlevel |= (1 << slevel);
+	eeh_tree_state_indent(pfn, s, slevel, sxlevel, false);
+	pfn(s, "  check_count=%d freeze_count=%d false_positives=%d first_freeze=%llu\n",
+		   pe->check_count, pe->freeze_count, pe->false_positives,
+		   pe->tstamp);
+	eeh_tree_state_indent(pfn, s, slevel, sxlevel, false);
+	pfn(s, "  kernel state=0x%x %s%s%s%s%s%s%s%s\n", pe->state,
+	    ((pe->state & EEH_PE_ISOLATED) ? "ISOLATED " : ""),
+	    ((pe->state & EEH_PE_RECOVERING) ? "RECOVERING " : ""),
+	    ((pe->state & EEH_PE_CFG_BLOCKED) ? "CFG_BLOCKED " : ""),
+	    ((pe->state & EEH_PE_RESET) ? "RESET " : ""),
+	    ((pe->state & EEH_PE_KEEP) ? "KEEP " : ""),
+	    ((pe->state & EEH_PE_CFG_RESTRICTED) ? "CFG_RESTRICTED " : ""),
+	    ((pe->state & EEH_PE_REMOVED) ? "REMOVED" : ""),
+	    ((pe->state & EEH_PE_PRI_BUS) ? "PRI_BUS" : ""));
+	if (eeh_ops->pe_plat_state_dump) {
+		eeh_tree_state_indent(pfn, s, slevel, sxlevel, false);
+		eeh_ops->pe_plat_state_dump(buf, sizeof(buf), pe);
+		pfn(s, "  %.*s\n", sizeof(buf), buf);
+	}
+
+	slevel = level + 1;
+	list_for_each_entry(edev, &pe->edevs, entry) {
+		struct pci_dev *pdev = edev->pdev;
+
+		sxlevel = xlevel;
+		if ((edev != list_last_entry(&pe->edevs, struct eeh_dev, entry)) ||
+		    !list_empty(&pe->child_list))
+			sxlevel |= (1 << slevel);
+		eeh_tree_state_indent(pfn, s, slevel, sxlevel, true);
+		pfn(s, "* [DEV %s] pe_config_addr=0x%x driver=%s\n",
+		    pci_name(pdev), edev->pe_config_addr, eeh_driver_name(pdev));
+		eeh_tree_state_indent(pfn, s, slevel + 1, sxlevel, false);
+		pfn(s, "mode=0x%x %s%s%s%s%s%s%s%s\n", edev->mode,
+		    ((edev->mode & EEH_DEV_BRIDGE) ? "BRIDGE " : ""),
+		    ((edev->mode & EEH_DEV_ROOT_PORT) ? "ROOT_PORT " : ""),
+		    ((edev->mode & EEH_DEV_DS_PORT) ? "DS_PORT " : ""),
+		    ((edev->mode & EEH_DEV_IRQ_DISABLED) ? "IRQ_DISABLED " : ""),
+		    ((edev->mode & EEH_DEV_DISCONNECTED) ? "DISCONNECTED " : ""),
+		    ((edev->mode & EEH_DEV_NO_HANDLER) ? "NO_HANDLER " : ""),
+		    ((edev->mode & EEH_DEV_SYSFS) ? "SYSFS " : ""),
+		    ((edev->mode & EEH_DEV_REMOVED) ? "REMOVED " : ""));
+	}
+
+	slevel = level + 1;
+	list_for_each_entry(child_pe, &pe->child_list, child) {
+		sxlevel = xlevel;
+		if (child_pe != list_last_entry(&pe->child_list, struct eeh_pe, child))
+			sxlevel |= (1 << slevel);
+		eeh_tree_state_dump(pfn, s, child_pe, slevel, sxlevel);
+	}
+}
+
+
 /**
  * pcibios_set_pcie_reset_state - Set PCI-E reset state
  * @dev: pci device struct
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 9d03292f66a7..7b077599c567 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -15,6 +15,7 @@
 #include <linux/kthread.h>
 #include <linux/workqueue.h>
 #include <linux/completion.h>
+#include <linux/seq_buf.h>
 #include <asm/eeh.h>
 #include <asm/eeh_event.h>
 #include <asm/ppc-pci.h>
@@ -996,6 +997,26 @@ static int eeh_reset_device(unsigned int event_id,
 	return 0;
 }
 
+static void eeh_tree_state_dump_kprintf(struct eeh_pe *root_pe)
+{
+	struct seq_buf s;
+	char *buf, *p, *q;
+
+	buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	seq_buf_init(&s, buf, PAGE_SIZE);
+
+	eeh_tree_state_dump((void (*)(void *, const char *, ...))seq_buf_printf, &s, root_pe, 0, 0);
+
+	/* printk's output is limited to LOG_LINE_MAX, so split into lines: */
+	/* buf must end with a \n */
+	p = buf;
+	while ((q = strstr(p, "\n"))) {
+		printk(KERN_INFO "%.*s", (int)(q - p + 1), p);
+		p = q + 1;
+	}
+	kfree(buf);
+}
+
 /* The longest amount of time to wait for a pci device
  * to come back on line, in seconds.
  */
@@ -1129,6 +1150,9 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 	int devices = 0;
 
 	eeh_recovery_lock();
+	pr_info("EEH(%u): PE state before recovery:\n", event_id);
+	eeh_tree_state_dump_kprintf(pe);
+
 	bus = eeh_pe_bus_get(pe);
 	if (!bus) {
 		pr_err("EEH(%u): %s: Cannot find PCI bus for PHB#%x-PE#%x\n",
@@ -1381,6 +1405,8 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 			eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
 			eeh_pe_dev_mode_mark(pe, EEH_DEV_REMOVED);
 
+			pr_info("PE state after recovery (before final device removal):\n");
+			eeh_tree_state_dump_kprintf(pe);
 			eeh_recovery_unlock();
 			pci_lock_rescan_remove();
 			pci_hp_remove_devices(bus);
@@ -1405,6 +1431,8 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 			eeh_clear_slot_attention(edev->pdev);
 
 	eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
+	pr_info("EEH(%u): PE state after recovery:\n", event_id);
+	eeh_tree_state_dump_kprintf(pe);
 	eeh_recovery_unlock();
 
 }
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 1f6ec78ad88b..82b705e3f1e7 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -206,6 +206,30 @@ PNV_EEH_DBGFS_ENTRY(outb, 0xD10);
 PNV_EEH_DBGFS_ENTRY(inbA, 0xD90);
 PNV_EEH_DBGFS_ENTRY(inbB, 0xE10);
 
+static int eeh_tree_state_dump_seq_file(struct seq_file *s, void *unused)
+{
+	struct eeh_pe *root_pe = s->private;
+
+	eeh_recovery_lock();
+	eeh_tree_state_dump((void (*)(void *, const char *, ...))seq_printf,
+			    s, root_pe, 0, 0);
+	eeh_recovery_unlock();
+	return 0;
+}
+
+static int eeh_tree_state_dbgfs_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, eeh_tree_state_dump_seq_file, eeh_phb_pe_get(inode->i_private));
+}
+
+static const struct file_operations eeh_tree_state_debugfs_ops = {
+	.owner		= THIS_MODULE,
+	.open		= eeh_tree_state_dbgfs_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 #endif /* CONFIG_DEBUG_FS */
 
 void pnv_eeh_enable_phbs(void)
@@ -287,6 +311,9 @@ int pnv_eeh_post_init(void)
 		debugfs_create_file("err_injct_inboundB", 0600,
 				    phb->dbgfs, hose,
 				    &pnv_eeh_dbgfs_ops_inbB);
+		debugfs_create_file("eeh_pe_tree", 0400,
+				    phb->dbgfs, hose,
+				    &eeh_tree_state_debugfs_ops);
 #endif /* CONFIG_DEBUG_FS */
 	}
 
@@ -1685,6 +1712,39 @@ static int pnv_eeh_restore_config(struct pci_dn *pdn)
 	return ret;
 }
 
+void pnv_eeh_pe_plat_state_dump(char *buf, size_t len, struct eeh_pe *pe)
+{
+	struct pnv_phb *phb = pe->phb->private_data;
+	s64 rc;
+	u8 state = 0;
+	__be16 pcierr = 0;
+
+	/* Collect state directly from OPAL to avoid side effects: */
+	rc = opal_pci_eeh_freeze_status(phb->opal_id, pe->addr, &state,
+					&pcierr, NULL);
+	pcierr = be16_to_cpu(pcierr);
+
+	if (rc == OPAL_SUCCESS)
+		snprintf(buf, len, "OPAL state=0x%x %s%s%s%s%s%s%s pcierr=0x%x %s%s%s%s%s%s",
+			 state,
+			 ((state == OPAL_EEH_STOPPED_NOT_FROZEN) ? "OK" : ""),
+			 ((state == OPAL_EEH_STOPPED_MMIO_FREEZE) ? "MMIO_FREEZE" : ""),
+			 ((state == OPAL_EEH_STOPPED_DMA_FREEZE) ? "DMA_FREEZE" : ""),
+			 ((state == OPAL_EEH_STOPPED_MMIO_DMA_FREEZE) ? "MMIO_DMA_FREEZE" : ""),
+			 ((state == OPAL_EEH_STOPPED_RESET) ? "STOPPED_RESET" : ""),
+			 ((state == OPAL_EEH_STOPPED_TEMP_UNAVAIL) ? "STOPPED_TEMP_UNAVAIL" : ""),
+			 ((state == OPAL_EEH_STOPPED_PERM_UNAVAIL) ? "STOPPED_PERM_UNAVAIL" : ""),
+			 pcierr,
+			 ((pcierr == OPAL_EEH_NO_ERROR) ? "OK" : ""),
+			 ((pcierr == OPAL_EEH_IOC_ERROR) ? "IOC_ERROR" : ""),
+			 ((pcierr == OPAL_EEH_PHB_ERROR) ? "PHB_ERROR" : ""),
+			 ((pcierr == OPAL_EEH_PE_ERROR) ? "PE_ERROR" : ""),
+			 ((pcierr == OPAL_EEH_PE_MMIO_ERROR) ? "MMIO_ERROR" : ""),
+			 ((pcierr == OPAL_EEH_PE_DMA_ERROR) ? "DMA_ERROR" : ""));
+	else
+		snprintf(buf, len, "OPAL error=0x%llx", rc);
+}
+
 static struct eeh_ops pnv_eeh_ops = {
 	.name                   = "powernv",
 	.init                   = pnv_eeh_init,
@@ -1700,7 +1760,8 @@ static struct eeh_ops pnv_eeh_ops = {
 	.write_config           = pnv_eeh_write_config,
 	.next_error		= pnv_eeh_next_error,
 	.restore_config		= pnv_eeh_restore_config,
-	.notify_resume		= NULL
+	.notify_resume		= NULL,
+	.pe_plat_state_dump	= pnv_eeh_pe_plat_state_dump,
 };
 
 #ifdef CONFIG_PCI_IOV
diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index c49fab23f1c9..3fbbd8583350 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -782,6 +782,24 @@ static int pseries_notify_resume(struct pci_dn *pdn)
 }
 #endif
 
+void pseries_eeh_pe_plat_state_dump(char *buf, size_t len, struct eeh_pe *pe)
+{
+	int state, delay;
+
+	/* pseries_eeh_get_state() doesn't have side-effects, so we can use it here: */
+	state = pseries_eeh_get_state(pe, &delay);
+
+	snprintf(buf, len, "RTAS state=0x%x %s%s%s%s%s%s%s",
+	    state,
+	    ((state & EEH_STATE_UNAVAILABLE) ? "UNAVAILABLE " : ""),
+	    ((state & EEH_STATE_NOT_SUPPORT) ? "NOT_SUPPORT " : ""),
+	    ((state & EEH_STATE_RESET_ACTIVE) ? "RESET_ACTIVE " : ""),
+	    ((state & EEH_STATE_MMIO_ACTIVE) ? "MMIO_ACTIVE " : ""),
+	    ((state & EEH_STATE_DMA_ACTIVE) ? "DMA_ACTIVE " : ""),
+	    ((state & EEH_STATE_MMIO_ENABLED) ? "MMIO_ENABLED " : ""),
+	    ((state & EEH_STATE_DMA_ENABLED) ? "DMA_ENABLED " : ""));
+}
+
 static struct eeh_ops pseries_eeh_ops = {
 	.name			= "pseries",
 	.init			= pseries_eeh_init,
@@ -798,8 +816,9 @@ static struct eeh_ops pseries_eeh_ops = {
 	.next_error		= NULL,
 	.restore_config		= pseries_eeh_restore_config,
 #ifdef CONFIG_PCI_IOV
-	.notify_resume		= pseries_notify_resume
+	.notify_resume		= pseries_notify_resume,
 #endif
+	.pe_plat_state_dump	= pseries_eeh_pe_plat_state_dump,
 };
 
 /**
-- 
2.22.0.216.g00a2a96fc9


^ permalink raw reply related

* [PATCH RFC 1/1] powerpc/eeh: Asynchronous recovery
From: Sam Bobroff @ 2020-06-24  5:54 UTC (permalink / raw)
  To: linuxppc-dev

Currently, EEH recovery is entirely serialized and takes place within
a single kernel thread. This can cause recovery to take a long time
when there are many devices.

To shorten recovery time, this change allows recovery to proceed in
parallel in two ways:
- Each PHB is given it's own recovery event queue and can be recovered
independently from other PHBs.
- Driver handlers are called in parallel, but with the constraint that
handlers higher up (closer to the PHB) in the PE hierarchy must be
called before those lower down.

To maintain the constraint, above, the driver handlers are called by
traversing the tree of affected PEs from the top, stopping to call
handlers (in parallel) when a PE with devices is discovered. When the
calls for that PE are complete, traversal continues at each child PE.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
This patch should be applied on top of both:
"powerpc/eeh: Synchronization for safety"
"powerpc/eeh: Provide a unique ID for each EEH recovery"

 arch/powerpc/include/asm/eeh.h        |   1 +
 arch/powerpc/include/asm/eeh_event.h  |   7 +
 arch/powerpc/include/asm/pci-bridge.h |   2 +
 arch/powerpc/kernel/eeh_dev.c         |   2 +
 arch/powerpc/kernel/eeh_driver.c      | 313 ++++++++++++++++++--------
 arch/powerpc/kernel/eeh_event.c       |  65 +++---
 6 files changed, 276 insertions(+), 114 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 1d4c0b19a63c..dd55d1bf1cfd 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -130,6 +130,7 @@ static inline bool eeh_pe_passed(struct eeh_pe *pe)
 #define EEH_DEV_NO_HANDLER	(1 << 8)	/* No error handler	*/
 #define EEH_DEV_SYSFS		(1 << 9)	/* Sysfs created	*/
 #define EEH_DEV_REMOVED		(1 << 10)	/* Removed permanently	*/
+#define EEH_DEV_RECOVERING	(1 << 11)	/* Recovering		*/
 
 struct eeh_dev {
 	int mode;			/* EEH mode			*/
diff --git a/arch/powerpc/include/asm/eeh_event.h b/arch/powerpc/include/asm/eeh_event.h
index a1fe736bc4cf..b21f49e87b7b 100644
--- a/arch/powerpc/include/asm/eeh_event.h
+++ b/arch/powerpc/include/asm/eeh_event.h
@@ -8,6 +8,8 @@
 #define ASM_POWERPC_EEH_EVENT_H
 #ifdef __KERNEL__
 
+#include <linux/workqueue.h>
+
 /*
  * structure holding pci controller data that describes a
  * change in the isolation status of a PCI slot.  A pointer
@@ -15,16 +17,21 @@
  * callback.
  */
 struct eeh_event {
+	struct work_struct	work;
 	struct list_head	list;	/* to form event queue	*/
 	struct eeh_pe		*pe;	/* EEH PE		*/
 	unsigned int		id;	/* Event ID		*/
 };
 
+extern spinlock_t eeh_eventlist_lock;
+
 int eeh_event_init(void);
+int eeh_phb_event(struct eeh_pe *pe);
 int eeh_send_failure_event(struct eeh_pe *pe);
 int __eeh_send_failure_event(struct eeh_pe *pe);
 void eeh_remove_event(struct eeh_pe *pe, bool force);
 void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe);
+void eeh_handle_normal_event_work(struct work_struct *work);
 void eeh_handle_special_event(void);
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 69f4cb3b7c56..2a9d639b18d1 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -127,6 +127,8 @@ struct pci_controller {
 
 	void *private_data;
 	struct npu *npu;
+	bool eeh_in_progress;
+	struct list_head eeh_eventlist;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
index 7370185c7a05..2e48a1e142a9 100644
--- a/arch/powerpc/kernel/eeh_dev.c
+++ b/arch/powerpc/kernel/eeh_dev.c
@@ -62,6 +62,8 @@ struct eeh_dev *eeh_dev_init(struct pci_dn *pdn)
  */
 void eeh_dev_phb_init_dynamic(struct pci_controller *phb)
 {
+	phb->eeh_in_progress = false; /* TODO: Necessary? */
+	INIT_LIST_HEAD(&phb->eeh_eventlist);
 	/* EEH PE for PHB */
 	eeh_phb_pe_create(phb);
 }
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 0dbc218597e3..9d03292f66a7 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -12,6 +12,9 @@
 #include <linux/module.h>
 #include <linux/pci.h>
 #include <linux/pci_hotplug.h>
+#include <linux/kthread.h>
+#include <linux/workqueue.h>
+#include <linux/completion.h>
 #include <asm/eeh.h>
 #include <asm/eeh_event.h>
 #include <asm/ppc-pci.h>
@@ -19,6 +22,8 @@
 #include <asm/prom.h>
 #include <asm/rtas.h>
 
+static atomic_t eeh_wu_id = ATOMIC_INIT(0);
+
 struct eeh_rmv_data {
 	struct list_head removed_vf_list;
 	int removed_dev_count;
@@ -249,73 +254,58 @@ static void eeh_set_irq_state(struct eeh_pe *root, bool enable)
 }
 
 typedef enum pci_ers_result (*eeh_report_fn)(unsigned int event_id,
+					     unsigned int id,
 					     struct pci_dev *,
 					     struct pci_driver *);
 static void eeh_pe_report_pdev(unsigned int event_id,
-			       struct pci_dev *pdev, eeh_report_fn fn,
+			       unsigned int id,
+			       struct pci_dev *pdev,
+			       const char *fn_name, eeh_report_fn fn,
 			       enum pci_ers_result *result,
-			       const char *handler_name)
+			       bool late, bool removed, bool passed)
 {
-	struct eeh_dev *edev;
 	struct pci_driver *driver;
-	bool actionable, late, removed, passed;
 	enum pci_ers_result new_result;
 
-	edev = pci_dev_to_eeh_dev(pdev);
-	if (!edev) {
-		pci_info(pdev, "EEH(%u): no EEH state for device", event_id);
-		return;
-	}
-	/* Cache some useful values before releasing the lock: */
-	actionable = eeh_edev_actionable(edev);
-	late = edev->mode & EEH_DEV_NO_HANDLER;
-	removed = eeh_dev_removed(edev);
-	passed = eeh_pe_passed(edev->pe);
-	if (actionable) {
+	/*
+	 * Driver callbacks may end up calling back into EEH functions
+	 * (for example by removing a PCI device) which will deadlock
+	 * unless the EEH locks are released first. Note that it may be
+	 * re-acquired by the report functions, if necessary.
+	 */
+	device_lock(&pdev->dev);
+	driver = eeh_pcid_get(pdev);
+
+	if (!driver)
+		pci_info(pdev, "EEH(%u): W%u: no driver", event_id, id);
+	else if (!driver->err_handler)
+		pci_info(pdev, "EEH(%u): W%u: driver not EEH aware", event_id, id);
+	else if (late)
+		pci_info(pdev, "EEH(%u): W%u: driver bound too late", event_id, id);
+	else {
+		pci_info(pdev, "EEH(%u): EVENT=HANDLER_CALL HANDLER='%s'\n",
+			event_id, fn_name);
+
+		new_result = fn(event_id, id, pdev, driver);
 		/*
-		 * Driver callbacks may end up calling back into EEH functions
-		 * (for example by removing a PCI device) which will deadlock
-		 * unless the EEH locks are released first. Note that it may be
-		 * re-acquired by the report functions, if necessary.
+		 * It's not safe to use edev here, because the locks
+		 * have been released and devices could have changed.
 		 */
-		eeh_recovery_unlock();
-		device_lock(&pdev->dev);
-		driver = eeh_pcid_get(pdev);
-
-		if (!driver)
-			pci_info(pdev, "EEH(%u): no driver", event_id);
-		else if (!driver->err_handler)
-			pci_info(pdev, "EEH(%u): driver not EEH aware", event_id);
-		else if (late)
-			pci_info(pdev, "EEH(%u): driver bound too late", event_id);
-		else {
-			pr_warn("EEH(%u): EVENT=HANDLER_CALL DEVICE=%04x:%02x:%02x.%x DRIVER='%s' HANDLER='%s'\n",
-				event_id, edev->controller->global_number,
-				PCI_BUSNO(edev->bdfn), PCI_SLOT(edev->bdfn),
-				PCI_FUNC(edev->bdfn), driver->name, handler_name);
-
-			new_result = fn(event_id, pdev, driver);
-			/*
-			 * It's not safe to use edev here, because the locks
-			 * have been released and devices could have changed.
-			 */
-			pr_warn("EEH(%u): EVENT=HANDLER_RETURN RESULT='%s'\n",
-				event_id, pci_ers_result_name(new_result));
-			pci_info(pdev, "EEH(%u): %s driver reports: %s",
-				      event_id, driver->name,
-				      pci_ers_result_name(new_result));
-			if (result)
-				*result = pci_ers_merge_result(*result,
-							       new_result);
+		pr_warn("EEH(%u): EVENT=HANDLER_RETURN RESULT='%s'\n",
+			event_id, pci_ers_result_name(new_result));
+		pci_info(pdev, "EEH(%u): W%u: %s driver reports: '%s'",
+			 event_id, id, driver->name,
+			 pci_ers_result_name(new_result));
+		if (result) {
+			eeh_recovery_lock();
+			*result = pci_ers_merge_result(*result,
+						       new_result);
+			eeh_recovery_unlock();
 		}
-		if (driver)
-			eeh_pcid_put(pdev);
-		device_unlock(&pdev->dev);
-		eeh_recovery_lock();
-	} else {
-		pci_info(pdev, "EEH(%u): not actionable (%d,%d,%d)",
-			 event_id, !!pdev, !removed, !passed);
 	}
+	if (driver)
+		eeh_pcid_put(pdev);
+	device_unlock(&pdev->dev);
 }
 
 struct pci_dev **pdev_cache_list_create(struct eeh_pe *root)
@@ -355,27 +345,142 @@ static void pdev_cache_list_destroy(struct pci_dev **pdevs)
 	kfree(pdevs);
 }
 
-static void eeh_pe_report(unsigned int event_id,
-			  const char *name, struct eeh_pe *root,
-			  eeh_report_fn fn, enum pci_ers_result *result)
+struct work_unit {
+	unsigned int id;
+	struct work_struct work;
+	unsigned int event_id;
+	struct pci_dev *pdev;
+	struct eeh_pe *pe;
+	const char *fn_name;
+	eeh_report_fn fn;
+	enum pci_ers_result *result;
+	atomic_t *count;
+	struct completion *done;
+};
+
+static void eeh_pe_report_pdev_thread(struct work_struct *work);
+/*
+ * Traverse down from a PE through it's children, to find devices and enqueue
+ * jobs to call the handler (fn) on them.  But do not traverse below a PE that
+ * has devices, so that devices are always handled strictly before their
+ * children. (Traversal is continued by the jobs after handlers are called.)
+ * The recovery lock must be held.
+ * TODO: Convert away from recursive descent traversal?
+ */
+static bool enqueue_pe_work(struct eeh_pe *root, unsigned int event_id,
+			    const char *fn_name, eeh_report_fn fn,
+			    enum pci_ers_result *result, atomic_t *count,
+			    struct completion *done)
 {
-	struct pci_dev **pdevs, **pdevp;
+	struct eeh_pe *pe;
+	struct eeh_dev *edev, *tmp;
+	struct work_unit *wu;
+	bool work_added = false;
+
+	if (list_empty(&root->edevs)) {
+		list_for_each_entry(pe, &root->child_list, child)
+			work_added |= enqueue_pe_work(pe, event_id, fn_name, fn, result, count, done);
+	} else {
+		eeh_pe_for_each_dev(root, edev, tmp) {
+			work_added = true;
+			edev->mode |= EEH_DEV_RECOVERING;
+			atomic_inc(count);
+			WARN_ON(!(edev->mode & EEH_DEV_RECOVERING));
+			wu = kmalloc(sizeof(*wu), GFP_KERNEL);
+			wu->id = (unsigned int)atomic_inc_return(&eeh_wu_id);
+			wu->event_id = event_id;
+			get_device(&edev->pdev->dev);
+			wu->pdev = edev->pdev;
+			wu->pe = root;
+			wu->fn_name = fn_name;
+			wu->fn = fn;
+			wu->result = result;
+			wu->count = count;
+			wu->done = done;
+			INIT_WORK(&wu->work, eeh_pe_report_pdev_thread);
+			pr_debug("EEH(%u): Queue work unit W%u for device %s (count ~ %d)\n", event_id, wu->id, pci_name(edev->pdev), atomic_read(count));
+			queue_work(system_unbound_wq, &wu->work);
+		}
+		/* This PE has devices, so don't traverse further now */
+	}
+	return work_added;
+}
+
+static void eeh_pe_report_pdev_thread(struct work_struct *work)
+{
+	struct work_unit *wu = container_of(work, struct work_unit, work);
+	struct eeh_dev *edev, *oedev, *tmp;
+	struct eeh_pe *pe;
+	int todo;
 
-	pr_info("EEH(%u): Beginning: '%s'\n", event_id, name);
 	/*
 	 * It would be convenient to continue to hold the recovery lock here
 	 * but driver callbacks can take a very long time or never return at
 	 * all.
 	 */
-	pdevs = pdev_cache_list_create(root);
-	for (pdevp = pdevs; pdevp && *pdevp; pdevp++) {
-		/*
-		 * NOTE! eeh_recovery_lock() is released briefly
-		 * in eeh_pe_report_pdev()
-		 */
-		eeh_pe_report_pdev(event_id, *pdevp, fn, result, name);
+	pr_debug("EEH(%u): W%u: start (device: %s)\n", wu->event_id, wu->id, pci_name(wu->pdev));
+	eeh_recovery_lock();
+	edev = pci_dev_to_eeh_dev(wu->pdev);
+	if (edev) {
+		bool late, removed, passed;
+
+		WARN_ON(!(edev->mode & EEH_DEV_RECOVERING));
+		removed = eeh_dev_removed(edev);
+		passed = eeh_pe_passed(edev->pe);
+		late = edev->mode & EEH_DEV_NO_HANDLER;
+		if (eeh_edev_actionable(edev)) {
+			eeh_recovery_unlock();
+			eeh_pe_report_pdev(wu->event_id, wu->id, wu->pdev, wu->fn_name, wu->fn, wu->result, late, removed, passed);
+			eeh_recovery_lock();
+		} else {
+			pci_info(wu->pdev, "EEH(%u): W%u: Not actionable (%d,%d,%d)\n",
+				 wu->event_id, wu->id, !!wu->pdev, !removed, !passed);
+		}
+		edev = pci_dev_to_eeh_dev(wu->pdev); // Re-acquire after lock release
+		if (edev)
+			edev->mode &= ~EEH_DEV_RECOVERING;
+		/* The edev may be lost, but not moved to a different PE! */
+		WARN_ON(eeh_dev_to_pe(edev) && (eeh_dev_to_pe(edev) != wu->pe));
+		todo = 0;
+		eeh_pe_for_each_dev(wu->pe, oedev, tmp)
+			if (oedev->mode & EEH_DEV_RECOVERING)
+				todo++;
+		pci_dbg(wu->pdev, "EEH(%u): W%u: Remaining devices in this PE: %d\n", wu->event_id, wu->id, todo);
+		if (todo) {
+			pr_debug("EEH(%u): W%u: Remaining work units at this PE: %d\n", wu->event_id, wu->id, todo);
+		} else {
+			pr_debug("EEH(%u): W%u: All work for this PE complete, continuing traversal:\n", wu->event_id, wu->id);
+			list_for_each_entry(pe, &wu->pe->child_list, child)
+				enqueue_pe_work(pe, wu->event_id, wu->fn_name, wu->fn, wu->result, wu->count, wu->done);
+		}
+	} else {
+		pr_warn("EEH(%u): W%u: Device removed.\n", wu->event_id, wu->id);
+	}
+	eeh_recovery_unlock();
+	if (atomic_dec_and_test(wu->count)) {
+		pr_debug("EEH(%u): W%u: done\n", wu->event_id, wu->id);
+		complete(wu->done);
+	}
+	put_device(&wu->pdev->dev);
+	kfree(wu);
+}
+
+static void eeh_pe_report(unsigned int event_id, const char *name, struct eeh_pe *root,
+			  eeh_report_fn fn, enum pci_ers_result *result)
+{
+	atomic_t count = ATOMIC_INIT(0);
+	DECLARE_COMPLETION_ONSTACK(done);
+
+	pr_info("EEH(%u): Beginning: '%s'\n", event_id, name);
+	if (enqueue_pe_work(root, event_id, name, fn, result, &count, &done)) {
+		pr_info("EEH(%u): Waiting for asynchronous recovery work to complete...\n", event_id);
+		eeh_recovery_unlock();
+		wait_for_completion_interruptible(&done);
+		pr_info("EEH(%u): Asynchronous recovery work is complete.\n", event_id);
+		eeh_recovery_lock();
+	} else {
+		pr_info("EEH(%u): No recovery work do.\n", event_id);
 	}
-	pdev_cache_list_destroy(pdevs);
 
 	if (result)
 		pr_info("EEH(%u): Finished:'%s' with aggregate recovery state:'%s'\n",
@@ -392,6 +497,7 @@ static void eeh_pe_report(unsigned int event_id,
  * Report an EEH error to each device driver.
  */
 static enum pci_ers_result eeh_report_error(unsigned int event_id,
+					    unsigned int id,
 					    struct pci_dev *pdev,
 					    struct pci_driver *driver)
 {
@@ -402,7 +508,8 @@ static enum pci_ers_result eeh_report_error(unsigned int event_id,
 	if (!driver->err_handler->error_detected)
 		return PCI_ERS_RESULT_NONE;
 
-	pci_info(pdev, "Invoking %s->error_detected(IO frozen)", driver->name);
+	pci_info(pdev, "EEH(%u): W%u: Invoking %s->error_detected(IO frozen)",
+		 event_id, id, driver->name);
 	rc = driver->err_handler->error_detected(pdev, pci_channel_io_frozen);
 
 	eeh_serialize_lock(&flags);
@@ -423,13 +530,14 @@ static enum pci_ers_result eeh_report_error(unsigned int event_id,
  * are now enabled.
  */
 static enum pci_ers_result eeh_report_mmio_enabled(unsigned int event_id,
+						   unsigned int id,
 						   struct pci_dev *pdev,
 						   struct pci_driver *driver)
 {
 	if (!driver->err_handler->mmio_enabled)
 		return PCI_ERS_RESULT_NONE;
-	pci_info(pdev, "EEH(%u): Invoking %s->mmio_enabled()",
-		 event_id, driver->name);
+	pci_info(pdev, "EEH(%u): W%u: Invoking %s->mmio_enabled()",
+		 event_id, id, driver->name);
 	return driver->err_handler->mmio_enabled(pdev);
 }
 
@@ -444,6 +552,7 @@ static enum pci_ers_result eeh_report_mmio_enabled(unsigned int event_id,
  * driver can work again while the device is recovered.
  */
 static enum pci_ers_result eeh_report_reset(unsigned int event_id,
+					    unsigned int id,
 					    struct pci_dev *pdev,
 					    struct pci_driver *driver)
 {
@@ -457,8 +566,8 @@ static enum pci_ers_result eeh_report_reset(unsigned int event_id,
 		return PCI_ERS_RESULT_NONE;
 	}
 	eeh_serialize_unlock(flags);
-	pci_info(pdev, "EEH(%u): Invoking %s->slot_reset()",
-		 event_id, driver->name);
+	pci_info(pdev, "EEH(%u): W%u: Invoking %s->slot_reset()",
+		 event_id, id, driver->name);
 	return driver->err_handler->slot_reset(pdev);
 }
 
@@ -499,6 +608,7 @@ static void eeh_dev_restore_state(struct eeh_dev *edev, void *userdata)
  * to make the recovered device work again.
  */
 static enum pci_ers_result eeh_report_resume(unsigned int event_id,
+					     unsigned int id,
 					     struct pci_dev *pdev,
 					     struct pci_driver *driver)
 {
@@ -513,8 +623,8 @@ static enum pci_ers_result eeh_report_resume(unsigned int event_id,
 	}
 	eeh_serialize_unlock(flags);
 
-	pci_info(pdev, "EEH(%u): Invoking %s->resume()",
-		 event_id, driver->name);
+	pci_info(pdev, "EEH(%u): W%u Invoking %s->resume()",
+		 event_id, id, driver->name);
 	driver->err_handler->resume(pdev);
 
 	pci_uevent_ers(pdev, PCI_ERS_RESULT_RECOVERED);
@@ -536,6 +646,7 @@ static enum pci_ers_result eeh_report_resume(unsigned int event_id,
  * dead, and that no further recovery attempts will be made on it.
  */
 static enum pci_ers_result eeh_report_failure(unsigned int event_id,
+					      unsigned int id,
 					      struct pci_dev *pdev,
 					      struct pci_driver *driver)
 {
@@ -544,8 +655,8 @@ static enum pci_ers_result eeh_report_failure(unsigned int event_id,
 	if (!driver->err_handler->error_detected)
 		return PCI_ERS_RESULT_NONE;
 
-	pci_info(pdev, "EEH(%u): Invoking %s->error_detected(permanent failure)",
-		 event_id, driver->name);
+	pci_info(pdev, "EEH(%u): W%u: Invoking %s->error_detected(permanent failure)",
+		 event_id, id, driver->name);
 	rc = driver->err_handler->error_detected(pdev,
 						 pci_channel_io_perm_failure);
 
@@ -1006,9 +1117,10 @@ static void eeh_clear_slot_attention(struct pci_dev *pdev)
  */
 void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 {
+	struct eeh_pe *tmp_pe;
+	struct pci_controller *phb = pe->phb;
 	struct pci_bus *bus;
 	struct eeh_dev *edev, *tmp;
-	struct eeh_pe *tmp_pe;
 	struct pci_dev **pdevs, **pdevp;
 	int rc = 0;
 	enum pci_ers_result result = PCI_ERS_RESULT_NONE;
@@ -1020,7 +1132,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 	bus = eeh_pe_bus_get(pe);
 	if (!bus) {
 		pr_err("EEH(%u): %s: Cannot find PCI bus for PHB#%x-PE#%x\n",
-			event_id, __func__, pe->phb->global_number, pe->addr);
+			event_id, __func__, phb->global_number, pe->addr);
 		eeh_recovery_unlock();
 		return;
 	}
@@ -1041,7 +1153,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 
 	if (!devices) {
 		pr_debug("EEH(%u): Frozen PHB#%x-PE#%x is empty!\n",
-			event_id, pe->phb->global_number, pe->addr);
+			event_id, phb->global_number, pe->addr);
 		goto out; /* nothing to recover */
 	}
 
@@ -1053,12 +1165,12 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 	/* Log the event */
 	if (pe->type & EEH_PE_PHB) {
 		pr_err("EEH(%u): Recovering PHB#%x, location: %s\n",
-			event_id, pe->phb->global_number, eeh_pe_loc_get(pe));
+			event_id, phb->global_number, eeh_pe_loc_get(pe));
 	} else {
-		struct eeh_pe *phb_pe = eeh_phb_pe_get(pe->phb);
+		struct eeh_pe *phb_pe = eeh_phb_pe_get(phb);
 
 		pr_err("EEH(%u): Recovering PHB#%x-PE#%x\n",
-		       event_id, pe->phb->global_number, pe->addr);
+		       event_id, phb->global_number, pe->addr);
 		pr_err("EEH(%u): PE location: %s, PHB location: %s\n",
 		       event_id, eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
 	}
@@ -1073,7 +1185,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 		int i;
 
 		pr_err("EEH(%u): Frozen PHB#%x-PE#%x detected\n",
-		       event_id, pe->phb->global_number, pe->addr);
+		       event_id, phb->global_number, pe->addr);
 
 		/* FIXME: Use the same format as dump_stack() */
 		pr_err("EEH(%u): Call Trace:\n", event_id);
@@ -1087,7 +1199,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 	eeh_pe_update_time_stamp(pe);
 	if (pe->freeze_count > eeh_max_freezes) {
 		pr_err("EEH(%u): PHB#%x-PE#%x has failed %d times in the last hour and has been permanently disabled.\n",
-		       event_id, pe->phb->global_number, pe->addr,
+		       event_id, phb->global_number, pe->addr,
 		       pe->freeze_count);
 		result = PCI_ERS_RESULT_DISCONNECT;
 	}
@@ -1240,7 +1352,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 		 */
 		pr_err("EEH(%u): Unable to recover from failure from PHB#%x-PE#%x.\n"
 		       "Please try reseating or replacing it\n",
-			event_id, pe->phb->global_number, pe->addr);
+			event_id, phb->global_number, pe->addr);
 
 		eeh_slot_error_detail(event_id, pe, EEH_LOG_PERM);
 
@@ -1294,8 +1406,34 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 
 	eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 	eeh_recovery_unlock();
+
 }
 
+void eeh_handle_normal_event_work(struct work_struct *work)
+{
+	unsigned long flags;
+	struct eeh_event *event = container_of(work, struct eeh_event, work);
+	struct pci_controller *phb = event->pe->phb;
+
+	eeh_handle_normal_event(event->id, event->pe);
+
+	kfree(event);
+	spin_lock_irqsave(&eeh_eventlist_lock, flags);
+	WARN_ON_ONCE(!phb->eeh_in_progress);
+	if (list_empty(&phb->eeh_eventlist)) {
+		phb->eeh_in_progress = false;
+		pr_debug("EEH(%u): No more work to do\n", event->id);
+	} else {
+		pr_warn("EEH(%u): More work to do\n", event->id);
+		event = list_entry(phb->eeh_eventlist.next,
+				   struct eeh_event, list);
+		list_del(&event->list);
+		queue_work(system_unbound_wq, &event->work);
+	}
+	spin_unlock_irqrestore(&eeh_eventlist_lock, flags);
+}
+
+
 /**
  * eeh_handle_special_event - Handle EEH events without a specific failing PE
  *
@@ -1366,8 +1504,7 @@ void eeh_handle_special_event(void)
 		 */
 		if (rc == EEH_NEXT_ERR_FROZEN_PE ||
 		    rc == EEH_NEXT_ERR_FENCED_PHB) {
-			eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
-			eeh_handle_normal_event(0, pe);
+			eeh_phb_event(pe);
 		} else {
 			eeh_for_each_pe(pe, tmp_pe)
 				eeh_pe_for_each_dev(tmp_pe, edev, tmp_edev)
diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
index bd38d6fe5449..81dc4f924324 100644
--- a/arch/powerpc/kernel/eeh_event.c
+++ b/arch/powerpc/kernel/eeh_event.c
@@ -22,7 +22,7 @@
  *  work-queue, where a worker thread can drive recovery.
  */
 
-static DEFINE_SPINLOCK(eeh_eventlist_lock);
+DEFINE_SPINLOCK(eeh_eventlist_lock);
 static DECLARE_COMPLETION(eeh_eventlist_event);
 static LIST_HEAD(eeh_eventlist);
 
@@ -61,7 +61,7 @@ static int eeh_event_handler(void * dummy)
 			continue;
 
 		/* We might have event without binding PE */
-		if (event->pe)
+		if (event->pe) /* TODO: Unused now? */
 			eeh_handle_normal_event(event->id, event->pe);
 		else
 			eeh_handle_special_event();
@@ -94,33 +94,56 @@ int eeh_event_init(void)
 	return 0;
 }
 
-/**
- * eeh_send_failure_event - Generate a PCI error event
- * @pe: EEH PE
- *
- * This routine can be called within an interrupt context;
- * the actual event will be delivered in a normal context
- * (from a workqueue).
- */
-int __eeh_send_failure_event(struct eeh_pe *pe)
+int eeh_phb_event(struct eeh_pe *pe)
 {
-	unsigned long flags;
 	struct eeh_event *event;
+	unsigned long flags;
 
 	event = kzalloc(sizeof(*event), GFP_ATOMIC);
 	if (!event) {
 		pr_err("EEH: out of memory, event not handled\n");
 		return -ENOMEM;
 	}
-	event->pe = pe;
+
 	do {
 		/* Skip over the special value (0) */
 		event->id = (unsigned int)atomic_inc_return(&eeh_event_id);
 	} while (!event->id);
 
-	pr_err("EEH(%u): EVENT=ERROR_DETECTED PHB=%#x PE=%#x\n",
-	      event->id, pe->phb->global_number, pe->addr);
+	spin_lock_irqsave(&eeh_eventlist_lock, flags);
+	INIT_WORK(&event->work, eeh_handle_normal_event_work);
+	if (pe) {
+		event->pe = pe;
+		eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
+		pr_err("EEH(%u): EVENT=ERROR_DETECTED PHB=%#x PE=%#x\n",
+		      event->id, pe->phb->global_number, pe->addr);
+		if (event->pe->phb->eeh_in_progress) {
+			pr_info("EEH: EEH already in progress on this PHB, queueing.\n");
+			list_add(&event->list, &event->pe->phb->eeh_eventlist);
+		} else {
+			pr_info("EEH: Beginning recovery on this PHB.\n");
+			WARN_ON_ONCE(!list_empty(&event->pe->phb->eeh_eventlist));
+			event->pe->phb->eeh_in_progress = true;
+			queue_work(system_unbound_wq, &event->work);
+		}
+	} else {
+		list_add(&event->list, &eeh_eventlist);
+		complete(&eeh_eventlist_event);
+	}
+	spin_unlock_irqrestore(&eeh_eventlist_lock, flags);
+	return 0;
+}
 
+/**
+ * eeh_send_failure_event - Generate a PCI error event
+ * @pe: EEH PE
+ *
+ * This routine can be called within an interrupt context;
+ * the actual event will be delivered in a normal context
+ * (from a workqueue).
+ */
+int __eeh_send_failure_event(struct eeh_pe *pe)
+{
 	/*
 	 * Mark the PE as recovering before inserting it in the queue.
 	 * This prevents the PE from being free()ed by a hotplug driver
@@ -136,18 +159,8 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
 					 ARRAY_SIZE(pe->stack_trace), 0);
 #endif /* CONFIG_STACKTRACE */
 
-		eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
 	}
-
-	/* We may or may not be called in an interrupt context */
-	spin_lock_irqsave(&eeh_eventlist_lock, flags);
-	list_add(&event->list, &eeh_eventlist);
-	spin_unlock_irqrestore(&eeh_eventlist_lock, flags);
-
-	/* For EEH deamon to knick in */
-	complete(&eeh_eventlist_event);
-
-	return 0;
+	return eeh_phb_event(pe);
 }
 
 int eeh_send_failure_event(struct eeh_pe *pe)
-- 
2.22.0.216.g00a2a96fc9


^ permalink raw reply related

* Re: [PATCH v2 2/2] cpufreq: Specify default governor on command line
From: Viresh Kumar @ 2020-06-24  5:50 UTC (permalink / raw)
  To: Quentin Perret
  Cc: juri.lelli, kernel-team, vincent.guittot, arnd, rafael, peterz,
	adharmap, linux-pm, rjw, linux-kernel, mingo, paulus,
	linuxppc-dev, tkjos
In-Reply-To: <20200623142138.209513-3-qperret@google.com>

On 23-06-20, 15:21, Quentin Perret wrote:
> Currently, the only way to specify the default CPUfreq governor is via
> Kconfig options, which suits users who can build the kernel themselves
> perfectly.
> 
> However, for those who use a distro-like kernel (such as Android, with
> the Generic Kernel Image project), the only way to use a different
> default is to boot to userspace, and to then switch using the sysfs
> interface. Being able to specify the default governor on the command
> line, like is the case for cpuidle, would enable those users to specify
> their governor of choice earlier on, and to simplify slighlty the
> userspace boot procedure.
> 
> To support this use-case, add a kernel command line parameter enabling
> to specify a default governor for CPUfreq, which takes precedence over
> the builtin default.
> 
> This implementation has one notable limitation: the default governor
> must be registered before the driver. This is solved for builtin
> governors and drivers using appropriate *_initcall() functions. And in
> the modular case, this must be reflected as a constraint on the module
> loading order.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  5 ++++
>  Documentation/admin-guide/pm/cpufreq.rst      |  6 ++---
>  drivers/cpufreq/cpufreq.c                     | 23 +++++++++++++++----
>  3 files changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index fb95fad81c79..5fd3c9f187eb 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -703,6 +703,11 @@
>  	cpufreq.off=1	[CPU_FREQ]
>  			disable the cpufreq sub-system
>  
> +	cpufreq.default_governor=
> +			[CPU_FREQ] Name of the default cpufreq governor to use.
> +			This governor must be registered in the kernel before
> +			the cpufreq driver probes.
> +
>  	cpu_init_udelay=N
>  			[X86] Delay for N microsec between assert and de-assert
>  			of APIC INIT to start processors.  This delay occurs
> diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
> index 0c74a7784964..368e612145d2 100644
> --- a/Documentation/admin-guide/pm/cpufreq.rst
> +++ b/Documentation/admin-guide/pm/cpufreq.rst
> @@ -147,9 +147,9 @@ CPUs in it.
>  
>  The next major initialization step for a new policy object is to attach a
>  scaling governor to it (to begin with, that is the default scaling governor
> -determined by the kernel configuration, but it may be changed later
> -via ``sysfs``).  First, a pointer to the new policy object is passed to the
> -governor's ``->init()`` callback which is expected to initialize all of the
> +determined by the kernel command line or configuration, but it may be changed
> +later via ``sysfs``).  First, a pointer to the new policy object is passed to
> +the governor's ``->init()`` callback which is expected to initialize all of the
>  data structures necessary to handle the given policy and, possibly, to add
>  a governor ``sysfs`` interface to it.  Next, the governor is started by
>  invoking its ``->start()`` callback.
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 0128de3603df..4b1a5c0173cf 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -50,6 +50,9 @@ static LIST_HEAD(cpufreq_governor_list);
>  #define for_each_governor(__governor)				\
>  	list_for_each_entry(__governor, &cpufreq_governor_list, governor_list)
>  
> +static char cpufreq_param_governor[CPUFREQ_NAME_LEN];
> +static struct cpufreq_governor *default_governor;
> +
>  /**
>   * The "cpufreq driver" - the arch- or hardware-dependent low
>   * level driver of CPUFreq support, and its spinlock. This lock
> @@ -1055,7 +1058,6 @@ __weak struct cpufreq_governor *cpufreq_default_governor(void)
>  
>  static int cpufreq_init_policy(struct cpufreq_policy *policy)
>  {
> -	struct cpufreq_governor *def_gov = cpufreq_default_governor();
>  	struct cpufreq_governor *gov = NULL;
>  	unsigned int pol = CPUFREQ_POLICY_UNKNOWN;
>  
> @@ -1065,8 +1067,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
>  		if (gov) {
>  			pr_debug("Restoring governor %s for cpu %d\n",
>  				 policy->governor->name, policy->cpu);
> -		} else if (def_gov) {
> -			gov = def_gov;
> +		} else if (default_governor) {
> +			gov = default_governor;
>  		} else {
>  			return -ENODATA;
>  		}
> @@ -1074,8 +1076,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
>  		/* Use the default policy if there is no last_policy. */
>  		if (policy->last_policy) {
>  			pol = policy->last_policy;
> -		} else if (def_gov) {
> -			pol = cpufreq_parse_policy(def_gov->name);
> +		} else if (default_governor) {
> +			pol = cpufreq_parse_policy(default_governor->name);
>  			/*
>  			 * In case the default governor is neiter "performance"
>  			 * nor "powersave", fall back to the initial policy
> @@ -2320,6 +2322,9 @@ int cpufreq_register_governor(struct cpufreq_governor *governor)
>  		list_add(&governor->governor_list, &cpufreq_governor_list);
>  	}
>  
> +	if (!strncasecmp(cpufreq_param_governor, governor->name, CPUFREQ_NAME_LEN))
> +		default_governor = governor;
> +
>  	mutex_unlock(&cpufreq_governor_mutex);
>  	return err;
>  }
> @@ -2348,6 +2353,8 @@ void cpufreq_unregister_governor(struct cpufreq_governor *governor)
>  
>  	mutex_lock(&cpufreq_governor_mutex);
>  	list_del(&governor->governor_list);
> +	if (governor == default_governor)
> +		default_governor = cpufreq_default_governor();
>  	mutex_unlock(&cpufreq_governor_mutex);
>  }
>  EXPORT_SYMBOL_GPL(cpufreq_unregister_governor);
> @@ -2789,7 +2796,13 @@ static int __init cpufreq_core_init(void)
>  	cpufreq_global_kobject = kobject_create_and_add("cpufreq", &cpu_subsys.dev_root->kobj);
>  	BUG_ON(!cpufreq_global_kobject);
>  
> +	mutex_lock(&cpufreq_governor_mutex);
> +	if (!default_governor)
> +		default_governor = cpufreq_default_governor();
> +	mutex_unlock(&cpufreq_governor_mutex);

I don't think locking is required here at core-initcall level. Apart
from that:

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

> +
>  	return 0;
>  }
>  module_param(off, int, 0444);
> +module_param_string(default_governor, cpufreq_param_governor, CPUFREQ_NAME_LEN, 0444);
>  core_initcall(cpufreq_core_init);
> -- 
> 2.27.0.111.gc72c7da667-goog

-- 
viresh

^ permalink raw reply

* [PATCH RFC 1/1] powerpc/eeh: Provide a unique ID for each EEH recovery
From: Sam Bobroff @ 2020-06-24  5:19 UTC (permalink / raw)
  To: linuxppc-dev

Give a unique ID to each recovery event, to ease log parsing and
prepare for parallel recovery.

Also add some new messages with a very simple format that may be
useful to log-parsers.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
This patch should be applied on top of my recent(ish) set:
"powerpc/eeh: Synchronization for safety".

 arch/powerpc/include/asm/eeh_event.h |   3 +-
 arch/powerpc/include/asm/ppc-pci.h   |   2 +-
 arch/powerpc/kernel/eeh.c            |  45 ++++---
 arch/powerpc/kernel/eeh_driver.c     | 189 +++++++++++++++------------
 arch/powerpc/kernel/eeh_event.c      |  12 +-
 5 files changed, 148 insertions(+), 103 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh_event.h b/arch/powerpc/include/asm/eeh_event.h
index dadde7d52f46..a1fe736bc4cf 100644
--- a/arch/powerpc/include/asm/eeh_event.h
+++ b/arch/powerpc/include/asm/eeh_event.h
@@ -17,13 +17,14 @@
 struct eeh_event {
 	struct list_head	list;	/* to form event queue	*/
 	struct eeh_pe		*pe;	/* EEH PE		*/
+	unsigned int		id;	/* Event ID		*/
 };
 
 int eeh_event_init(void);
 int eeh_send_failure_event(struct eeh_pe *pe);
 int __eeh_send_failure_event(struct eeh_pe *pe);
 void eeh_remove_event(struct eeh_pe *pe, bool force);
-void eeh_handle_normal_event(struct eeh_pe *pe);
+void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe);
 void eeh_handle_special_event(void);
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
index 7f4be5a05eb3..ec62b491ff97 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -47,7 +47,7 @@ extern int rtas_setup_phb(struct pci_controller *phb);
 void eeh_addr_cache_insert_dev(struct pci_dev *dev);
 void eeh_addr_cache_rmv_dev(struct pci_dev *dev);
 struct eeh_dev *eeh_addr_cache_get_dev(unsigned long addr);
-void eeh_slot_error_detail(struct eeh_pe *pe, int severity);
+void eeh_slot_error_detail(unsigned int event_id, struct eeh_pe *pe, int severity);
 int eeh_pci_enable(struct eeh_pe *pe, int function);
 int eeh_pe_reset_full(struct eeh_pe *pe, bool include_passed);
 void eeh_save_bars(struct eeh_dev *edev);
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 68e6dfa526a5..54f921ff7621 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -197,7 +197,8 @@ EXPORT_SYMBOL_GPL(eeh_recovery_must_be_locked);
  * for the indicated PCI device, and puts them into a buffer
  * for RTAS error logging.
  */
-static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
+static size_t eeh_dump_dev_log(unsigned int event_id, struct eeh_dev *edev,
+			       char *buf, size_t len)
 {
 	struct pci_dn *pdn = eeh_dev_to_pdn(edev);
 	u32 cfg;
@@ -206,34 +207,37 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
 	char buffer[128];
 
 	if (!pdn) {
-		pr_warn("EEH: Note: No error log for absent device.\n");
+		pr_warn("EEH(%u): Note: No error log for absent device.\n",
+			event_id);
 		return 0;
 	}
 
 	n += scnprintf(buf+n, len-n, "%04x:%02x:%02x.%01x\n",
 		       pdn->phb->global_number, pdn->busno,
 		       PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
-	pr_warn("EEH: of node=%04x:%02x:%02x.%01x\n",
+	pr_warn("EEH(%u): of node=%04x:%02x:%02x.%01x\n",
+		event_id,
 		pdn->phb->global_number, pdn->busno,
 		PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
 
 	eeh_ops->read_config(pdn, PCI_VENDOR_ID, 4, &cfg);
 	n += scnprintf(buf+n, len-n, "dev/vend:%08x\n", cfg);
-	pr_warn("EEH: PCI device/vendor: %08x\n", cfg);
+	pr_warn("EEH(%u): PCI device/vendor: %08x\n", event_id, cfg);
 
 	eeh_ops->read_config(pdn, PCI_COMMAND, 4, &cfg);
 	n += scnprintf(buf+n, len-n, "cmd/stat:%x\n", cfg);
-	pr_warn("EEH: PCI cmd/status register: %08x\n", cfg);
+	pr_warn("EEH(%u): PCI cmd/status register: %08x\n", event_id, cfg);
 
 	/* Gather bridge-specific registers */
 	if (edev->mode & EEH_DEV_BRIDGE) {
 		eeh_ops->read_config(pdn, PCI_SEC_STATUS, 2, &cfg);
 		n += scnprintf(buf+n, len-n, "sec stat:%x\n", cfg);
-		pr_warn("EEH: Bridge secondary status: %04x\n", cfg);
+		pr_warn("EEH(%u): Bridge secondary status: %04x\n",
+			event_id, cfg);
 
 		eeh_ops->read_config(pdn, PCI_BRIDGE_CONTROL, 2, &cfg);
 		n += scnprintf(buf+n, len-n, "brdg ctl:%x\n", cfg);
-		pr_warn("EEH: Bridge control: %04x\n", cfg);
+		pr_warn("EEH(%u): Bridge control: %04x\n", event_id, cfg);
 	}
 
 	/* Dump out the PCI-X command and status regs */
@@ -241,18 +245,19 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
 	if (cap) {
 		eeh_ops->read_config(pdn, cap, 4, &cfg);
 		n += scnprintf(buf+n, len-n, "pcix-cmd:%x\n", cfg);
-		pr_warn("EEH: PCI-X cmd: %08x\n", cfg);
+		pr_warn("EEH(%u): PCI-X cmd: %08x\n", event_id, cfg);
 
 		eeh_ops->read_config(pdn, cap+4, 4, &cfg);
 		n += scnprintf(buf+n, len-n, "pcix-stat:%x\n", cfg);
-		pr_warn("EEH: PCI-X status: %08x\n", cfg);
+		pr_warn("EEH(%u): PCI-X status: %08x\n", event_id, cfg);
 	}
 
 	/* If PCI-E capable, dump PCI-E cap 10 */
 	cap = edev->pcie_cap;
 	if (cap) {
 		n += scnprintf(buf+n, len-n, "pci-e cap10:\n");
-		pr_warn("EEH: PCI-E capabilities and status follow:\n");
+		pr_warn("EEH(%u): PCI-E capabilities and status follow:\n",
+			event_id);
 
 		for (i=0; i<=8; i++) {
 			eeh_ops->read_config(pdn, cap+4*i, 4, &cfg);
@@ -263,8 +268,8 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
 					pr_warn("%s\n", buffer);
 
 				l = scnprintf(buffer, sizeof(buffer),
-					      "EEH: PCI-E %02x: %08x ",
-					      4*i, cfg);
+					      "EEH(%u): PCI-E %02x: %08x ",
+					      event_id, 4*i, cfg);
 			} else {
 				l += scnprintf(buffer+l, sizeof(buffer)-l,
 					       "%08x ", cfg);
@@ -279,7 +284,8 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
 	cap = edev->aer_cap;
 	if (cap) {
 		n += scnprintf(buf+n, len-n, "pci-e AER:\n");
-		pr_warn("EEH: PCI-E AER capability register set follows:\n");
+		pr_warn("EEH(%u): PCI-E AER capability register set follows:\n",
+			event_id);
 
 		for (i=0; i<=13; i++) {
 			eeh_ops->read_config(pdn, cap+4*i, 4, &cfg);
@@ -304,16 +310,13 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
 	return n;
 }
 
-static void *eeh_dump_pe_log(struct eeh_pe *pe, void *flag)
+static void eeh_dump_pe_log(unsigned int event_id, struct eeh_pe *pe, size_t *plen)
 {
 	struct eeh_dev *edev, *tmp;
-	size_t *plen = flag;
 
 	eeh_pe_for_each_dev(pe, edev, tmp)
-		*plen += eeh_dump_dev_log(edev, pci_regs_buf + *plen,
+		*plen += eeh_dump_dev_log(event_id, edev, pci_regs_buf + *plen,
 					  EEH_PCI_REGS_LOG_LEN - *plen);
-
-	return NULL;
 }
 
 /**
@@ -326,9 +329,10 @@ static void *eeh_dump_pe_log(struct eeh_pe *pe, void *flag)
  * out from the config space of the corresponding PCI device, while
  * the error log is fetched through platform dependent function call.
  */
-void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
+void eeh_slot_error_detail(unsigned int event_id, struct eeh_pe *pe, int severity)
 {
 	size_t loglen = 0;
+	struct eeh_pe *tmp_pe;
 
 	/*
 	 * When the PHB is fenced or dead, it's pointless to collect
@@ -368,7 +372,8 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
 			eeh_pe_restore_bars(pe);
 
 			pci_regs_buf[0] = 0;
-			eeh_pe_traverse(pe, eeh_dump_pe_log, &loglen);
+			eeh_for_each_pe(pe, tmp_pe)
+				eeh_dump_pe_log(event_id, tmp_pe, &loglen);
 		}
 	}
 
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index f21e0910f80a..0dbc218597e3 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -248,10 +248,13 @@ static void eeh_set_irq_state(struct eeh_pe *root, bool enable)
 	}
 }
 
-typedef enum pci_ers_result (*eeh_report_fn)(struct pci_dev *,
+typedef enum pci_ers_result (*eeh_report_fn)(unsigned int event_id,
+					     struct pci_dev *,
 					     struct pci_driver *);
-static void eeh_pe_report_pdev(struct pci_dev *pdev, eeh_report_fn fn,
-			       enum pci_ers_result *result)
+static void eeh_pe_report_pdev(unsigned int event_id,
+			       struct pci_dev *pdev, eeh_report_fn fn,
+			       enum pci_ers_result *result,
+			       const char *handler_name)
 {
 	struct eeh_dev *edev;
 	struct pci_driver *driver;
@@ -260,7 +263,7 @@ static void eeh_pe_report_pdev(struct pci_dev *pdev, eeh_report_fn fn,
 
 	edev = pci_dev_to_eeh_dev(pdev);
 	if (!edev) {
-		pci_info(pdev, "no EEH state for device");
+		pci_info(pdev, "EEH(%u): no EEH state for device", event_id);
 		return;
 	}
 	/* Cache some useful values before releasing the lock: */
@@ -280,19 +283,26 @@ static void eeh_pe_report_pdev(struct pci_dev *pdev, eeh_report_fn fn,
 		driver = eeh_pcid_get(pdev);
 
 		if (!driver)
-			pci_info(pdev, "no driver");
+			pci_info(pdev, "EEH(%u): no driver", event_id);
 		else if (!driver->err_handler)
-			pci_info(pdev, "driver not EEH aware");
+			pci_info(pdev, "EEH(%u): driver not EEH aware", event_id);
 		else if (late)
-			pci_info(pdev, "driver bound too late");
+			pci_info(pdev, "EEH(%u): driver bound too late", event_id);
 		else {
-			new_result = fn(pdev, driver);
+			pr_warn("EEH(%u): EVENT=HANDLER_CALL DEVICE=%04x:%02x:%02x.%x DRIVER='%s' HANDLER='%s'\n",
+				event_id, edev->controller->global_number,
+				PCI_BUSNO(edev->bdfn), PCI_SLOT(edev->bdfn),
+				PCI_FUNC(edev->bdfn), driver->name, handler_name);
+
+			new_result = fn(event_id, pdev, driver);
 			/*
 			 * It's not safe to use edev here, because the locks
 			 * have been released and devices could have changed.
 			 */
-			pci_info(pdev, "%s driver reports: '%s'",
-				      driver->name,
+			pr_warn("EEH(%u): EVENT=HANDLER_RETURN RESULT='%s'\n",
+				event_id, pci_ers_result_name(new_result));
+			pci_info(pdev, "EEH(%u): %s driver reports: %s",
+				      event_id, driver->name,
 				      pci_ers_result_name(new_result));
 			if (result)
 				*result = pci_ers_merge_result(*result,
@@ -303,8 +313,8 @@ static void eeh_pe_report_pdev(struct pci_dev *pdev, eeh_report_fn fn,
 		device_unlock(&pdev->dev);
 		eeh_recovery_lock();
 	} else {
-		pci_info(pdev, "not actionable (%d,%d,%d)", !!pdev,
-			 !removed, !passed);
+		pci_info(pdev, "EEH(%u): not actionable (%d,%d,%d)",
+			 event_id, !!pdev, !removed, !passed);
 	}
 }
 
@@ -345,12 +355,13 @@ static void pdev_cache_list_destroy(struct pci_dev **pdevs)
 	kfree(pdevs);
 }
 
-static void eeh_pe_report(const char *name, struct eeh_pe *root,
+static void eeh_pe_report(unsigned int event_id,
+			  const char *name, struct eeh_pe *root,
 			  eeh_report_fn fn, enum pci_ers_result *result)
 {
 	struct pci_dev **pdevs, **pdevp;
 
-	pr_info("EEH: Beginning: '%s'\n", name);
+	pr_info("EEH(%u): Beginning: '%s'\n", event_id, name);
 	/*
 	 * It would be convenient to continue to hold the recovery lock here
 	 * but driver callbacks can take a very long time or never return at
@@ -362,15 +373,15 @@ static void eeh_pe_report(const char *name, struct eeh_pe *root,
 		 * NOTE! eeh_recovery_lock() is released briefly
 		 * in eeh_pe_report_pdev()
 		 */
-		eeh_pe_report_pdev(*pdevp, fn, result);
+		eeh_pe_report_pdev(event_id, *pdevp, fn, result, name);
 	}
 	pdev_cache_list_destroy(pdevs);
 
 	if (result)
-		pr_info("EEH: Finished:'%s' with aggregate recovery state:'%s'\n",
-			name, pci_ers_result_name(*result));
+		pr_info("EEH(%u): Finished:'%s' with aggregate recovery state:'%s'\n",
+			event_id, name, pci_ers_result_name(*result));
 	else
-		pr_info("EEH: Finished:'%s'", name);
+		pr_info("EEH(%u): Finished:'%s'", event_id, name);
 }
 
 /**
@@ -380,7 +391,8 @@ static void eeh_pe_report(const char *name, struct eeh_pe *root,
  *
  * Report an EEH error to each device driver.
  */
-static enum pci_ers_result eeh_report_error(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_error(unsigned int event_id,
+					    struct pci_dev *pdev,
 					    struct pci_driver *driver)
 {
 	enum pci_ers_result rc;
@@ -410,12 +422,14 @@ static enum pci_ers_result eeh_report_error(struct pci_dev *pdev,
  * Tells each device driver that IO ports, MMIO and config space I/O
  * are now enabled.
  */
-static enum pci_ers_result eeh_report_mmio_enabled(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_mmio_enabled(unsigned int event_id,
+						   struct pci_dev *pdev,
 						   struct pci_driver *driver)
 {
 	if (!driver->err_handler->mmio_enabled)
 		return PCI_ERS_RESULT_NONE;
-	pci_info(pdev, "Invoking %s->mmio_enabled()", driver->name);
+	pci_info(pdev, "EEH(%u): Invoking %s->mmio_enabled()",
+		 event_id, driver->name);
 	return driver->err_handler->mmio_enabled(pdev);
 }
 
@@ -429,7 +443,8 @@ static enum pci_ers_result eeh_report_mmio_enabled(struct pci_dev *pdev,
  * some actions, usually to save data the driver needs so that the
  * driver can work again while the device is recovered.
  */
-static enum pci_ers_result eeh_report_reset(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_reset(unsigned int event_id,
+					    struct pci_dev *pdev,
 					    struct pci_driver *driver)
 {
 	struct eeh_dev *edev;
@@ -442,7 +457,8 @@ static enum pci_ers_result eeh_report_reset(struct pci_dev *pdev,
 		return PCI_ERS_RESULT_NONE;
 	}
 	eeh_serialize_unlock(flags);
-	pci_info(pdev, "Invoking %s->slot_reset()", driver->name);
+	pci_info(pdev, "EEH(%u): Invoking %s->slot_reset()",
+		 event_id, driver->name);
 	return driver->err_handler->slot_reset(pdev);
 }
 
@@ -482,7 +498,8 @@ static void eeh_dev_restore_state(struct eeh_dev *edev, void *userdata)
  * could resume so that the device driver can do some initialization
  * to make the recovered device work again.
  */
-static enum pci_ers_result eeh_report_resume(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_resume(unsigned int event_id,
+					     struct pci_dev *pdev,
 					     struct pci_driver *driver)
 {
 	struct eeh_dev *edev;
@@ -496,7 +513,8 @@ static enum pci_ers_result eeh_report_resume(struct pci_dev *pdev,
 	}
 	eeh_serialize_unlock(flags);
 
-	pci_info(pdev, "Invoking %s->resume()", driver->name);
+	pci_info(pdev, "EEH(%u): Invoking %s->resume()",
+		 event_id, driver->name);
 	driver->err_handler->resume(pdev);
 
 	pci_uevent_ers(pdev, PCI_ERS_RESULT_RECOVERED);
@@ -517,7 +535,8 @@ static enum pci_ers_result eeh_report_resume(struct pci_dev *pdev,
  * This informs the device driver that the device is permanently
  * dead, and that no further recovery attempts will be made on it.
  */
-static enum pci_ers_result eeh_report_failure(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_failure(unsigned int event_id,
+					      struct pci_dev *pdev,
 					      struct pci_driver *driver)
 {
 	enum pci_ers_result rc;
@@ -525,8 +544,8 @@ static enum pci_ers_result eeh_report_failure(struct pci_dev *pdev,
 	if (!driver->err_handler->error_detected)
 		return PCI_ERS_RESULT_NONE;
 
-	pci_info(pdev, "Invoking %s->error_detected(permanent failure)",
-		 driver->name);
+	pci_info(pdev, "EEH(%u): Invoking %s->error_detected(permanent failure)",
+		 event_id, driver->name);
 	rc = driver->err_handler->error_detected(pdev,
 						 pci_channel_io_perm_failure);
 
@@ -579,7 +598,8 @@ static void *eeh_add_virt_device(struct eeh_dev *edev)
  * lock is dropped (which it must be in order to take the PCI rescan/remove
  * lock without risking a deadlock).
  */
-static void eeh_rmv_device(struct pci_dev *pdev, void *userdata)
+static void eeh_rmv_device(unsigned int event_id,
+			   struct pci_dev *pdev, void *userdata)
 {
 	struct eeh_dev *edev;
 	struct pci_driver *driver;
@@ -588,8 +608,8 @@ static void eeh_rmv_device(struct pci_dev *pdev, void *userdata)
 
 	edev = pci_dev_to_eeh_dev(pdev);
 	if (!edev) {
-		pci_warn(pdev, "EEH: Device removed during processing (#%d)\n",
-			 __LINE__);
+		pci_warn(pdev, "EEH(%u): Device removed during processing (#%d)\n",
+			 event_id, __LINE__);
 		return;
 	}
 	/*
@@ -617,7 +637,8 @@ static void eeh_rmv_device(struct pci_dev *pdev, void *userdata)
 	}
 
 	/* Remove it from PCI subsystem */
-	pci_info(pdev, "EEH: Removing device without EEH sensitive driver\n");
+	pci_info(pdev, "EEH(%u): Removing device without EEH sensitive driver\n",
+		 event_id);
 	edev->mode |= EEH_DEV_DISCONNECTED;
 	if (rmv_data)
 		rmv_data->removed_dev_count++;
@@ -743,7 +764,8 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
  * During the reset, udev might be invoked because those affected
  * PCI devices will be removed and then added.
  */
-static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
+static int eeh_reset_device(unsigned int event_id,
+			    struct eeh_pe *pe, struct pci_bus *bus,
 			    struct eeh_rmv_data *rmv_data,
 			    bool driver_eeh_aware)
 {
@@ -777,7 +799,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
 		pdevs = pdev_cache_list_create(pe);
 		/* eeh_rmv_device() may re-acquire the recovery lock */
 		for (pdevp = pdevs; pdevp && *pdevp; pdevp++)
-			eeh_rmv_device(*pdevp, rmv_data);
+			eeh_rmv_device(event_id, *pdevp, rmv_data);
 		pdev_cache_list_destroy(pdevs);
 	} else {
 		eeh_recovery_unlock();
@@ -825,8 +847,8 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
 	 * potentially weird things happen.
 	 */
 	if (!driver_eeh_aware || rmv_data->removed_dev_count) {
-		pr_info("EEH: Sleep 5s ahead of %s hotplug\n",
-			(driver_eeh_aware ? "partial" : "complete"));
+		pr_info("EEH(%u): Sleep 5s ahead of %s hotplug\n",
+			event_id, (driver_eeh_aware ? "partial" : "complete"));
 		eeh_recovery_unlock();
 		ssleep(5);
 		eeh_recovery_lock();
@@ -982,7 +1004,7 @@ static void eeh_clear_slot_attention(struct pci_dev *pdev)
  * drivers (which cause a second set of hotplug events to go out to
  * userspace).
  */
-void eeh_handle_normal_event(struct eeh_pe *pe)
+void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
 {
 	struct pci_bus *bus;
 	struct eeh_dev *edev, *tmp;
@@ -997,8 +1019,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 	eeh_recovery_lock();
 	bus = eeh_pe_bus_get(pe);
 	if (!bus) {
-		pr_err("%s: Cannot find PCI bus for PHB#%x-PE#%x\n",
-			__func__, pe->phb->global_number, pe->addr);
+		pr_err("EEH(%u): %s: Cannot find PCI bus for PHB#%x-PE#%x\n",
+			event_id, __func__, pe->phb->global_number, pe->addr);
 		eeh_recovery_unlock();
 		return;
 	}
@@ -1018,22 +1040,27 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 				devices++;
 
 	if (!devices) {
-		pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
-			pe->phb->global_number, pe->addr);
+		pr_debug("EEH(%u): Frozen PHB#%x-PE#%x is empty!\n",
+			event_id, pe->phb->global_number, pe->addr);
 		goto out; /* nothing to recover */
 	}
 
+	pe->freeze_count++;
+	pr_warn("EEH(%u): EVENT=RECOVERY_START TYPE=%s PHB=%#x PE=%#x COUNT=%d\n",
+		event_id, ((pe->type & EEH_PE_PHB) ? "PHB" : "PE"),
+		pe->phb->global_number, pe->addr, pe->freeze_count);
+
 	/* Log the event */
 	if (pe->type & EEH_PE_PHB) {
-		pr_err("EEH: Recovering PHB#%x, location: %s\n",
-			pe->phb->global_number, eeh_pe_loc_get(pe));
+		pr_err("EEH(%u): Recovering PHB#%x, location: %s\n",
+			event_id, pe->phb->global_number, eeh_pe_loc_get(pe));
 	} else {
 		struct eeh_pe *phb_pe = eeh_phb_pe_get(pe->phb);
 
-		pr_err("EEH: Recovering PHB#%x-PE#%x\n",
-		       pe->phb->global_number, pe->addr);
-		pr_err("EEH: PE location: %s, PHB location: %s\n",
-		       eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
+		pr_err("EEH(%u): Recovering PHB#%x-PE#%x\n",
+		       event_id, pe->phb->global_number, pe->addr);
+		pr_err("EEH(%u): PE location: %s, PHB location: %s\n",
+		       event_id, eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
 	}
 
 #ifdef CONFIG_STACKTRACE
@@ -1045,23 +1072,22 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		void **ptrs = (void **) pe->stack_trace;
 		int i;
 
-		pr_err("EEH: Frozen PHB#%x-PE#%x detected\n",
-		       pe->phb->global_number, pe->addr);
+		pr_err("EEH(%u): Frozen PHB#%x-PE#%x detected\n",
+		       event_id, pe->phb->global_number, pe->addr);
 
 		/* FIXME: Use the same format as dump_stack() */
-		pr_err("EEH: Call Trace:\n");
+		pr_err("EEH(%u): Call Trace:\n", event_id);
 		for (i = 0; i < pe->trace_entries; i++)
-			pr_err("EEH: [%pK] %pS\n", ptrs[i], ptrs[i]);
+			pr_err("EEH(%u): [%pK] %pS\n", event_id, ptrs[i], ptrs[i]);
 
 		pe->trace_entries = 0;
 	}
 #endif /* CONFIG_STACKTRACE */
 
 	eeh_pe_update_time_stamp(pe);
-	pe->freeze_count++;
 	if (pe->freeze_count > eeh_max_freezes) {
-		pr_err("EEH: PHB#%x-PE#%x has failed %d times in the last hour and has been permanently disabled.\n",
-		       pe->phb->global_number, pe->addr,
+		pr_err("EEH(%u): PHB#%x-PE#%x has failed %d times in the last hour and has been permanently disabled.\n",
+		       event_id, pe->phb->global_number, pe->addr,
 		       pe->freeze_count);
 		result = PCI_ERS_RESULT_DISCONNECT;
 	}
@@ -1081,12 +1107,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 	 * hotplug for this case.
 	 */
 	if (result != PCI_ERS_RESULT_DISCONNECT) {
-		pr_warn("EEH: This PCI device has failed %d times in the last hour and will be permanently disabled after %d failures.\n",
-			pe->freeze_count, eeh_max_freezes);
-		pr_info("EEH: Notify device drivers to shutdown\n");
+		pr_warn("EEH(%u): This PCI device has failed %d times in the last hour and will be permanently disabled after %d failures.\n",
+			event_id, pe->freeze_count, eeh_max_freezes);
+		pr_info("EEH(%u): Notify device drivers to shutdown\n", event_id);
 		eeh_set_channel_state(pe, pci_channel_io_frozen);
 		eeh_set_irq_state(pe, false);
-		eeh_pe_report("error_detected(IO frozen)", pe,
+		eeh_pe_report(event_id, "error_detected(IO frozen)", pe,
 			      eeh_report_error, &result);
 		if ((pe->type & EEH_PE_PHB) &&
 		    result != PCI_ERS_RESULT_NONE &&
@@ -1100,7 +1126,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 	if (result != PCI_ERS_RESULT_DISCONNECT) {
 		rc = eeh_wait_state(pe, MAX_WAIT_FOR_RECOVERY*1000, true);
 		if (rc < 0 || rc == EEH_STATE_NOT_SUPPORT) {
-			pr_warn("EEH: Permanent failure\n");
+			pr_warn("EEH(%u): Permanent failure\n", event_id);
 			result = PCI_ERS_RESULT_DISCONNECT;
 		}
 	}
@@ -1110,8 +1136,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 	 * have been informed.
 	 */
 	if (result != PCI_ERS_RESULT_DISCONNECT) {
-		pr_info("EEH: Collect temporary log\n");
-		eeh_slot_error_detail(pe, EEH_LOG_TEMP);
+		pr_info("EEH(%u): Collect temporary log\n", event_id);
+		eeh_slot_error_detail(event_id, pe, EEH_LOG_TEMP);
 	}
 
 	/* If all device drivers were EEH-unaware, then shut
@@ -1119,8 +1145,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 	 * go down willingly, without panicing the system.
 	 */
 	if (result == PCI_ERS_RESULT_NONE) {
-		pr_info("EEH: Reset with hotplug activity\n");
-		rc = eeh_reset_device(pe, bus, NULL, false);
+		pr_info("EEH(%u): Reset with hotplug activity\n", event_id);
+		rc = eeh_reset_device(event_id, pe, bus, NULL, false);
 		if (rc) {
 			pr_warn("%s: Unable to reset, err=%d\n",
 				__func__, rc);
@@ -1130,7 +1156,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 
 	/* If all devices reported they can proceed, then re-enable MMIO */
 	if (result == PCI_ERS_RESULT_CAN_RECOVER) {
-		pr_info("EEH: Enable I/O for affected devices\n");
+		pr_info("EEH(%u): Enable I/O for affected devices\n", event_id);
 		rc = eeh_pci_enable(pe, EEH_OPT_THAW_MMIO);
 
 		if (rc < 0) {
@@ -1138,15 +1164,15 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		} else if (rc) {
 			result = PCI_ERS_RESULT_NEED_RESET;
 		} else {
-			pr_info("EEH: Notify device drivers to resume I/O\n");
-			eeh_pe_report("mmio_enabled", pe,
+			pr_info("EEH(%u): Notify device drivers to resume I/O\n", event_id);
+			eeh_pe_report(event_id, "mmio_enabled", pe,
 				      eeh_report_mmio_enabled, &result);
 		}
 	}
 
 	/* If all devices reported they can proceed, then re-enable DMA */
 	if (result == PCI_ERS_RESULT_CAN_RECOVER) {
-		pr_info("EEH: Enabled DMA for affected devices\n");
+		pr_info("EEH(%u): Enabled DMA for affected devices\n", event_id);
 		rc = eeh_pci_enable(pe, EEH_OPT_THAW_DMA);
 
 		if (rc < 0) {
@@ -1166,8 +1192,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 
 	/* If any device called out for a reset, then reset the slot */
 	if (result == PCI_ERS_RESULT_NEED_RESET) {
-		pr_info("EEH: Reset without hotplug activity\n");
-		rc = eeh_reset_device(pe, bus, &rmv_data, true);
+		pr_info("EEH(%u): Reset without hotplug activity\n", event_id);
+		rc = eeh_reset_device(event_id, pe, bus, &rmv_data, true);
 		if (rc) {
 			pr_warn("%s: Cannot reset, err=%d\n",
 				__func__, rc);
@@ -1176,7 +1202,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 			result = PCI_ERS_RESULT_NONE;
 			eeh_set_channel_state(pe, pci_channel_io_normal);
 			eeh_set_irq_state(pe, true);
-			eeh_pe_report("slot_reset", pe, eeh_report_reset,
+			eeh_pe_report(event_id, "slot_reset", pe, eeh_report_reset,
 				      &result);
 		}
 	}
@@ -1194,10 +1220,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		}
 
 		/* Tell all device drivers that they can resume operations */
-		pr_info("EEH: Notify device driver to resume\n");
+		pr_info("EEH(%u): Notify device driver to resume\n", event_id);
 		eeh_set_channel_state(pe, pci_channel_io_normal);
 		eeh_set_irq_state(pe, true);
-		eeh_pe_report("resume", pe, eeh_report_resume, NULL);
+		eeh_pe_report(event_id, "resume", pe, eeh_report_resume, NULL);
 		eeh_for_each_pe(pe, tmp_pe) {
 			eeh_pe_for_each_dev(tmp_pe, edev, tmp) {
 				edev->mode &= ~EEH_DEV_NO_HANDLER;
@@ -1205,24 +1231,25 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 			}
 		}
 
-		pr_info("EEH: Recovery successful.\n");
+		pr_info("EEH(%u): Recovery successful.\n", event_id);
 	} else  {
 		/*
 		 * About 90% of all real-life EEH failures in the field
 		 * are due to poorly seated PCI cards. Only 10% or so are
 		 * due to actual, failed cards.
 		 */
-		pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n"
+		pr_err("EEH(%u): Unable to recover from failure from PHB#%x-PE#%x.\n"
 		       "Please try reseating or replacing it\n",
-			pe->phb->global_number, pe->addr);
+			event_id, pe->phb->global_number, pe->addr);
 
-		eeh_slot_error_detail(pe, EEH_LOG_PERM);
+		eeh_slot_error_detail(event_id, pe, EEH_LOG_PERM);
 
 		/* Notify all devices that they're about to go down. */
 		eeh_set_channel_state(pe, pci_channel_io_perm_failure);
 		eeh_set_irq_state(pe, false);
-		eeh_pe_report("error_detected(permanent failure)", pe,
+		eeh_pe_report(event_id, "error_detected(permanent failure)", pe,
 			      eeh_report_failure, NULL);
+		pr_crit("EEH(%u): EVENT=RECOVERY_END RESULT=failure\n", event_id);
 
 		/* Mark the PE to be removed permanently */
 		eeh_pe_state_mark(pe, EEH_PE_REMOVED);
@@ -1235,7 +1262,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		if (pe->type & EEH_PE_VF) {
 			pdevs = pdev_cache_list_create(pe);
 			for (pdevp = pdevs; pdevp && *pdevp; pdevp++)
-				eeh_rmv_device(*pdevp, NULL);
+				eeh_rmv_device(event_id, *pdevp, NULL);
 			pdev_cache_list_destroy(pdevs);
 			eeh_pe_dev_mode_mark(pe, EEH_DEV_REMOVED);
 		} else {
@@ -1251,6 +1278,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		}
 	}
 
+	pr_info("EEH(%u): EVENT=RECOVERY_END RESULT=success\n", event_id);
+
 out:
 	/*
 	 * Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
@@ -1338,7 +1367,7 @@ void eeh_handle_special_event(void)
 		if (rc == EEH_NEXT_ERR_FROZEN_PE ||
 		    rc == EEH_NEXT_ERR_FENCED_PHB) {
 			eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
-			eeh_handle_normal_event(pe);
+			eeh_handle_normal_event(0, pe);
 		} else {
 			eeh_for_each_pe(pe, tmp_pe)
 				eeh_pe_for_each_dev(tmp_pe, edev, tmp_edev)
@@ -1347,7 +1376,7 @@ void eeh_handle_special_event(void)
 			/* Notify all devices to be down */
 			eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
 			eeh_set_channel_state(pe, pci_channel_io_perm_failure);
-			eeh_pe_report(
+			eeh_pe_report(0,
 				"error_detected(permanent failure)", pe,
 				eeh_report_failure, NULL);
 
diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
index a7a8dc182efb..bd38d6fe5449 100644
--- a/arch/powerpc/kernel/eeh_event.c
+++ b/arch/powerpc/kernel/eeh_event.c
@@ -26,6 +26,9 @@ static DEFINE_SPINLOCK(eeh_eventlist_lock);
 static DECLARE_COMPLETION(eeh_eventlist_event);
 static LIST_HEAD(eeh_eventlist);
 
+/* Event ID 0 is reserved for special events */
+static atomic_t eeh_event_id = ATOMIC_INIT(1);
+
 /**
  * eeh_event_handler - Dispatch EEH events.
  * @dummy - unused
@@ -59,7 +62,7 @@ static int eeh_event_handler(void * dummy)
 
 		/* We might have event without binding PE */
 		if (event->pe)
-			eeh_handle_normal_event(event->pe);
+			eeh_handle_normal_event(event->id, event->pe);
 		else
 			eeh_handle_special_event();
 
@@ -110,6 +113,13 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
 		return -ENOMEM;
 	}
 	event->pe = pe;
+	do {
+		/* Skip over the special value (0) */
+		event->id = (unsigned int)atomic_inc_return(&eeh_event_id);
+	} while (!event->id);
+
+	pr_err("EEH(%u): EVENT=ERROR_DETECTED PHB=%#x PE=%#x\n",
+	      event->id, pe->phb->global_number, pe->addr);
 
 	/*
 	 * Mark the PE as recovering before inserting it in the queue.
-- 
2.22.0.216.g00a2a96fc9


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox