LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v14 16/22] selftests/vm: fix an assertion in test_pkey_alloc_exhaust()
From: Dave Hansen @ 2018-07-18 16:52 UTC (permalink / raw)
  To: Ram Pai, shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, mhocko,
	bauerman, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1531835365-32387-17-git-send-email-linuxram@us.ibm.com>

On 07/17/2018 06:49 AM, Ram Pai wrote:
> The maximum number of keys that can be allocated has to
> take into consideration, that some keys are reserved by
> the architecture for   specific   purpose. Hence cannot
> be allocated.

Back to incomplete sentences, I see. :)

How about:

	Some pkeys which are valid to the hardware are not available
	for application use.  Those can not be allocated.

	test_pkey_alloc_exhaust() tries to account for these but
	___FILL_IN_WHAT_IT_DID_WRONG____.  We fix this by
	___FILL_IN_WAY_IT_WAS_FIXED____.

> diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
> index d27fa5e..67d841e 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -1171,15 +1171,11 @@ void test_pkey_alloc_exhaust(int *ptr, u16 pkey)
>  	pkey_assert(i < NR_PKEYS*2);
>  
>  	/*
> -	 * There are 16 pkeys supported in hardware.  Three are
> -	 * allocated by the time we get here:
> -	 *   1. The default key (0)
> -	 *   2. One possibly consumed by an execute-only mapping.
> -	 *   3. One allocated by the test code and passed in via
> -	 *      'pkey' to this function.
> -	 * Ensure that we can allocate at least another 13 (16-3).
> +	 * There are NR_PKEYS pkeys supported in hardware. arch_reserved_keys()
> +	 * are reserved. And one key is allocated by the test code and passed
> +	 * in via 'pkey' to this function.
>  	 */
> -	pkey_assert(i >= NR_PKEYS-3);
> +	pkey_assert(i >= (NR_PKEYS-arch_reserved_keys()-1));
>  
>  	for (i = 0; i < nr_allocated_pkeys; i++) {
>  		err = sys_pkey_free(allocated_pkeys[i]);

You also killed my nice, shiny, new comment.  You made an attempt to
make up for it two patches ago, but it pales in comparison to mine.  The
fact that I wrote it only a few short week ago makes me very attached to
it, kinda like a new puppy.  I don't want to throw it to the wolves
quite yet.  So, please preserve as much of it as possible, even if it
has to live in the x86 header.

For bonus points, axe this comment in the same patch that you create the
x86 header comment for easier review.

^ permalink raw reply

* Re: [PATCH v14 20/22] selftests/vm: testcases must restore pkey-permissions
From: Dave Hansen @ 2018-07-18 16:56 UTC (permalink / raw)
  To: Ram Pai, shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, mhocko,
	bauerman, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1531835365-32387-21-git-send-email-linuxram@us.ibm.com>

On 07/17/2018 06:49 AM, Ram Pai wrote:
> Generally the signal handler restores the state of the pkey register
> before returning. However there are times when the read/write operation
> can legitamely fail without invoking the signal handler.  Eg: A
> sys_read() operaton to a write-protected page should be disallowed.  In
> such a case the state of the pkey register is not restored to its
> original state.  Test cases may not remember to restoring the key
> register state. During cleanup generically restore the key permissions.

This would, indeed be a good thing to do for a well-behaved application.

But, for selftests, why does it matter what state we leave the key in?
Doesn't the test itself need to establish permissions?  Don't we *do*
that at pkey_alloc() anyway?

What problem does this solve?

> diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
> index 8a6afdd..ea3cf04 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -1476,8 +1476,13 @@ void run_tests_once(void)
>  		pkey_tests[test_nr](ptr, pkey);
>  		dprintf1("freeing test memory: %p\n", ptr);
>  		free_pkey_malloc(ptr);
> +
> +		/* restore the permission on the key after use */
> +		pkey_access_allow(pkey);
> +		pkey_write_allow(pkey);
>  		sys_pkey_free(pkey);
>  
> +
>  		dprintf1("pkey_faults: %d\n", pkey_faults);
>  		dprintf1("orig_pkey_faults: %d\n", orig_pkey_faults);

^ permalink raw reply

* Re: [PATCH v14 22/22] selftests/vm: test correct behavior of pkey-0
From: Dave Hansen @ 2018-07-18 17:03 UTC (permalink / raw)
  To: Ram Pai, shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, mhocko,
	bauerman, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1531835365-32387-23-git-send-email-linuxram@us.ibm.com>

On 07/17/2018 06:49 AM, Ram Pai wrote:
> Ensure pkey-0 is allocated on start.  Ensure pkey-0 can be attached
> dynamically in various modes, without failures.  Ensure pkey-0 can be
> freed and allocated.
> 
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  tools/testing/selftests/vm/protection_keys.c |   66 +++++++++++++++++++++++++-
>  1 files changed, 64 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
> index 569faf1..156b449 100644
> --- a/tools/testing/selftests/vm/protection_keys.c
> +++ b/tools/testing/selftests/vm/protection_keys.c
> @@ -999,6 +999,67 @@ void close_test_fds(void)
>  	return *ptr;
>  }
>  
> +void test_pkey_alloc_free_attach_pkey0(int *ptr, u16 pkey)
> +{
> +	int i, err;
> +	int max_nr_pkey_allocs;
> +	int alloced_pkeys[NR_PKEYS];
> +	int nr_alloced = 0;
> +	int newpkey;
> +	long size;
> +
> +	assert(pkey_last_malloc_record);
> +	size = pkey_last_malloc_record->size;
> +	/*
> +	 * This is a bit of a hack.  But mprotect() requires
> +	 * huge-page-aligned sizes when operating on hugetlbfs.
> +	 * So, make sure that we use something that's a multiple
> +	 * of a huge page when we can.
> +	 */
> +	if (size >= HPAGE_SIZE)
> +		size = HPAGE_SIZE;
> +
> +
> +	/* allocate every possible key and make sure key-0 never got allocated */
> +	max_nr_pkey_allocs = NR_PKEYS;
> +	for (i = 0; i < max_nr_pkey_allocs; i++) {
> +		int new_pkey = alloc_pkey();
> +		assert(new_pkey != 0);

Missed these earlier.  This needs to be pkey_assert().  We don't want
these tests to ever _actually_ crash.

> +	/* attach key-0 in various modes */
> +	err = sys_mprotect_pkey(ptr, size, PROT_READ, 0);
> +	pkey_assert(!err);
> +	err = sys_mprotect_pkey(ptr, size, PROT_WRITE, 0);
> +	pkey_assert(!err);
> +	err = sys_mprotect_pkey(ptr, size, PROT_EXEC, 0);
> +	pkey_assert(!err);
> +	err = sys_mprotect_pkey(ptr, size, PROT_READ|PROT_WRITE, 0);
> +	pkey_assert(!err);
> +	err = sys_mprotect_pkey(ptr, size, PROT_READ|PROT_WRITE|PROT_EXEC, 0);
> +	pkey_assert(!err);

This is all fine.

> +	/* free key-0 */
> +	err = sys_pkey_free(0);
> +	pkey_assert(!err);

This part is called out as undefined behavior in the manpage:

>        An application should not call pkey_free() on any protection key
>        which has been assigned to an address range by pkey_mprotect(2) and
>        which is still in use.  The behavior in this case is undefined and
>        may result in an error.

I don't think we should be testing for undefined behavior.

> +	newpkey = sys_pkey_alloc(0, 0x0);
> +	assert(newpkey == 0);
> +}
> +
>  void test_read_of_write_disabled_region(int *ptr, u16 pkey)
>  {
>  	int ptr_contents;
> @@ -1144,10 +1205,10 @@ void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
>  void test_pkey_syscalls_on_non_allocated_pkey(int *ptr, u16 pkey)
>  {
>  	int err;
> -	int i = get_start_key();
> +	int i;
>  
>  	/* Note: 0 is the default pkey, so don't mess with it */
> -	for (; i < NR_PKEYS; i++) {
> +	for (i=1; i < NR_PKEYS; i++) {
>  		if (pkey == i)
>  			continue;

This seems to be randomly reverting earlier changes.

^ permalink raw reply

* Re: [PATCH] Mark ams driver as orphaned in MAINTAINERS
From: Michael Hanselmann @ 2018-07-18 17:24 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev, linux-kernel
In-Reply-To: <877elsq2xh.fsf@concordia.ellerman.id.au>

On 18.07.2018 12:00, Michael Ellerman wrote:
> Michael Hanselmann <linux-kernel@hansmi.ch> writes:
>> I no longer have any hardware with the Apple motion sensor and thus
>> relinquish maintainership of the driver.
> 
> Thanks. I think I'll just remove the whole entry, meaning it will be
> caught under the "LINUX FOR POWER MACINTOSH" entry.
> 
> Unless you object. Full patch below.

Sounds good to me.

Reviewed-by: Michael Hanselmann <linux-kernel@hansmi.ch>

^ permalink raw reply

* [PATCH v6 2/2] hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
From: Shilpasri G Bhat @ 2018-07-18 18:13 UTC (permalink / raw)
  To: mpe, linux; +Cc: linuxppc-dev, linux-hwmon, linux-kernel, ego, Shilpasri G Bhat
In-Reply-To: <1531937610-6454-1-git-send-email-shilpa.bhat@linux.vnet.ibm.com>

On-Chip-Controller(OCC) is an embedded micro-processor in POWER9 chip
which measures various system and chip level sensors. These sensors
comprises of environmental sensors (like power, temperature, current
and voltage) and performance sensors (like utilization, frequency).
All these sensors are copied to main memory at a regular interval of
100ms. OCC provides a way to select a group of sensors that is copied
to the main memory to increase the update frequency of selected sensor
groups. When a sensor-group is disabled, OCC will not copy it to main
memory and those sensors read 0 values.

This patch provides support for enabling/disabling the sensor groups
like power, temperature, current and voltage. This patch adds new
per-senor sysfs attribute to disable and enable them.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
---
Changes from v5:
- Dont store the sensor node and parse the device-tree for each sensor
  to find the sensor-group during init

 Documentation/hwmon/ibmpowernv |  43 ++++++-
 drivers/hwmon/ibmpowernv.c     | 249 +++++++++++++++++++++++++++++++++++------
 2 files changed, 258 insertions(+), 34 deletions(-)

diff --git a/Documentation/hwmon/ibmpowernv b/Documentation/hwmon/ibmpowernv
index 8826ba2..5646825 100644
--- a/Documentation/hwmon/ibmpowernv
+++ b/Documentation/hwmon/ibmpowernv
@@ -33,9 +33,48 @@ fanX_input		Measured RPM value.
 fanX_min		Threshold RPM for alert generation.
 fanX_fault		0: No fail condition
 			1: Failing fan
+
 tempX_input		Measured ambient temperature.
 tempX_max		Threshold ambient temperature for alert generation.
-inX_input		Measured power supply voltage
+tempX_highest		Historical maximum temperature
+tempX_lowest		Historical minimum temperature
+tempX_enable		Enable/disable all temperature sensors belonging to the
+			sub-group. In POWER9, this attribute corresponds to
+			each OCC. Using this attribute each OCC can be asked to
+			disable/enable all of its temperature sensors.
+			1: Enable
+			0: Disable
+
+inX_input		Measured power supply voltage (millivolt)
 inX_fault		0: No fail condition.
 			1: Failing power supply.
-power1_input		System power consumption (microWatt)
+inX_highest		Historical maximum voltage
+inX_lowest		Historical minimum voltage
+inX_enable		Enable/disable all voltage sensors belonging to the
+			sub-group. In POWER9, this attribute corresponds to
+			each OCC. Using this attribute each OCC can be asked to
+			disable/enable all of its voltage sensors.
+			1: Enable
+			0: Disable
+
+powerX_input		Power consumption (microWatt)
+powerX_input_highest	Historical maximum power
+powerX_input_lowest	Historical minimum power
+powerX_enable		Enable/disable all power sensors belonging to the
+			sub-group. In POWER9, this attribute corresponds to
+			each OCC. Using this attribute each OCC can be asked to
+			disable/enable all of its power sensors.
+			1: Enable
+			0: Disable
+
+currX_input		Measured current (milliampere)
+currX_highest		Historical maximum current
+currX_lowest		Historical minimum current
+currX_enable		Enable/disable all current sensors belonging to the
+			sub-group. In POWER9, this attribute corresponds to
+			each OCC. Using this attribute each OCC can be asked to
+			disable/enable all of its current sensors.
+			1: Enable
+			0: Disable
+
+energyX_input		Cumulative energy (microJoule)
diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c
index f829dad..167acf3 100644
--- a/drivers/hwmon/ibmpowernv.c
+++ b/drivers/hwmon/ibmpowernv.c
@@ -90,11 +90,20 @@ struct sensor_data {
 	char label[MAX_LABEL_LEN];
 	char name[MAX_ATTR_LEN];
 	struct device_attribute dev_attr;
+	struct sensor_group_data *sgrp_data;
+};
+
+struct sensor_group_data {
+	struct mutex mutex;
+	u32 gid;
+	bool enable;
 };
 
 struct platform_data {
 	const struct attribute_group *attr_groups[MAX_SENSOR_TYPE + 1];
+	struct sensor_group_data *sgrp_data;
 	u32 sensors_count; /* Total count of sensors from each group */
+	u32 nr_sensor_groups; /* Total number of sensor groups */
 };
 
 static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
@@ -105,6 +114,9 @@ static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
 	ssize_t ret;
 	u64 x;
 
+	if (sdata->sgrp_data && !sdata->sgrp_data->enable)
+		return -ENODATA;
+
 	ret =  opal_get_sensor_data_u64(sdata->id, &x);
 
 	if (ret)
@@ -120,6 +132,46 @@ static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
 	return sprintf(buf, "%llu\n", x);
 }
 
+static ssize_t show_enable(struct device *dev,
+			   struct device_attribute *devattr, char *buf)
+{
+	struct sensor_data *sdata = container_of(devattr, struct sensor_data,
+						 dev_attr);
+
+	return sprintf(buf, "%u\n", sdata->sgrp_data->enable);
+}
+
+static ssize_t store_enable(struct device *dev,
+			    struct device_attribute *devattr,
+			    const char *buf, size_t count)
+{
+	struct sensor_data *sdata = container_of(devattr, struct sensor_data,
+						 dev_attr);
+	struct sensor_group_data *sgrp_data = sdata->sgrp_data;
+	bool data;
+	int ret;
+
+	ret = kstrtobool(buf, &data);
+	if (ret)
+		return ret;
+
+	ret = mutex_lock_interruptible(&sgrp_data->mutex);
+	if (ret)
+		return ret;
+
+	if (data != sgrp_data->enable) {
+		ret =  sensor_group_enable(sgrp_data->gid, data);
+		if (!ret)
+			sgrp_data->enable = data;
+	}
+
+	if (!ret)
+		ret = count;
+
+	mutex_unlock(&sgrp_data->mutex);
+	return ret;
+}
+
 static ssize_t show_label(struct device *dev, struct device_attribute *devattr,
 			  char *buf)
 {
@@ -292,12 +344,126 @@ static u32 get_sensor_hwmon_index(struct sensor_data *sdata,
 	return ++sensor_groups[sdata->type].hwmon_index;
 }
 
+static int init_sensor_group_data(struct platform_device *pdev,
+				  struct platform_data *pdata)
+{
+	struct sensor_group_data *sgrp_data;
+	struct device_node *groups, *sgrp;
+	enum sensors type;
+	int count = 0, ret = 0;
+
+	groups = of_find_node_by_path("/ibm,opal/sensor-groups");
+	if (!groups)
+		return ret;
+
+	for_each_child_of_node(groups, sgrp) {
+		type = get_sensor_type(sgrp);
+		if (type != MAX_SENSOR_TYPE)
+			pdata->nr_sensor_groups++;
+	}
+
+	if (!pdata->nr_sensor_groups)
+		goto out;
+
+	sgrp_data = devm_kcalloc(&pdev->dev, pdata->nr_sensor_groups,
+				 sizeof(*sgrp_data), GFP_KERNEL);
+	if (!sgrp_data) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for_each_child_of_node(groups, sgrp) {
+		const __be32 *phandles;
+		int len, gid;
+
+		type = get_sensor_type(sgrp);
+		if (type == MAX_SENSOR_TYPE)
+			continue;
+
+		if (of_property_read_u32(sgrp, "sensor-group-id", &gid))
+			continue;
+
+		phandles = of_get_property(sgrp, "sensors", &len);
+		if (!phandles)
+			continue;
+
+		len /= sizeof(u32);
+		if (!len)
+			continue;
+
+		sensor_groups[type].attr_count++;
+		sgrp_data[count].gid = gid;
+		mutex_init(&sgrp_data[count].mutex);
+		sgrp_data[count++].enable = false;
+	}
+
+	pdata->sgrp_data = sgrp_data;
+out:
+	of_node_put(groups);
+	return ret;
+}
+
+static struct sensor_group_data *get_sensor_group(struct platform_data *pdata,
+						  struct device_node *node,
+						  enum sensors gtype)
+{
+	struct sensor_group_data *sgrp_data = pdata->sgrp_data;
+	struct device_node *groups, *sgrp;
+
+	groups = of_find_node_by_path("/ibm,opal/sensor-groups");
+	if (!groups)
+		return NULL;
+
+	for_each_child_of_node(groups, sgrp) {
+		const __be32 *phandles;
+		int len, gid, i;
+		enum sensors type;
+
+		type = get_sensor_type(sgrp);
+		if (type != gtype)
+			continue;
+
+		if (of_property_read_u32(sgrp, "sensor-group-id", &gid))
+			continue;
+
+		phandles = of_get_property(sgrp, "sensors", &len);
+		if (!phandles)
+			continue;
+
+		len /= sizeof(u32);
+		if (!len)
+			continue;
+
+		while (--len >= 0)
+			if (be32_to_cpu(phandles[len]) == node->phandle)
+				break;
+
+		if (len < 0)
+			continue;
+
+		for (i = 0; i < pdata->nr_sensor_groups; i++)
+			if (gid == sgrp_data[i].gid) {
+				of_node_put(sgrp);
+				of_node_put(groups);
+				return &sgrp_data[i];
+			}
+	}
+
+	of_node_put(groups);
+	return NULL;
+}
+
 static int populate_attr_groups(struct platform_device *pdev)
 {
 	struct platform_data *pdata = platform_get_drvdata(pdev);
 	const struct attribute_group **pgroups = pdata->attr_groups;
 	struct device_node *opal, *np;
 	enum sensors type;
+	int ret;
+
+	ret = init_sensor_group_data(pdev, pdata);
+	if (ret)
+		return ret;
 
 	opal = of_find_node_by_path("/ibm,opal/sensors");
 	for_each_child_of_node(opal, np) {
@@ -344,7 +510,10 @@ static int populate_attr_groups(struct platform_device *pdev)
 static void create_hwmon_attr(struct sensor_data *sdata, const char *attr_name,
 			      ssize_t (*show)(struct device *dev,
 					      struct device_attribute *attr,
-					      char *buf))
+					      char *buf),
+			    ssize_t (*store)(struct device *dev,
+					     struct device_attribute *attr,
+					     const char *buf, size_t count))
 {
 	snprintf(sdata->name, MAX_ATTR_LEN, "%s%d_%s",
 		 sensor_groups[sdata->type].name, sdata->hwmon_index,
@@ -352,23 +521,33 @@ static void create_hwmon_attr(struct sensor_data *sdata, const char *attr_name,
 
 	sysfs_attr_init(&sdata->dev_attr.attr);
 	sdata->dev_attr.attr.name = sdata->name;
-	sdata->dev_attr.attr.mode = S_IRUGO;
 	sdata->dev_attr.show = show;
+	if (store) {
+		sdata->dev_attr.store = store;
+		sdata->dev_attr.attr.mode = 0664;
+	} else {
+		sdata->dev_attr.attr.mode = 0444;
+	}
 }
 
 static void populate_sensor(struct sensor_data *sdata, int od, int hd, int sid,
 			    const char *attr_name, enum sensors type,
 			    const struct attribute_group *pgroup,
+			    struct sensor_group_data *sgrp_data,
 			    ssize_t (*show)(struct device *dev,
 					    struct device_attribute *attr,
-					    char *buf))
+					    char *buf),
+			    ssize_t (*store)(struct device *dev,
+					     struct device_attribute *attr,
+					     const char *buf, size_t count))
 {
 	sdata->id = sid;
 	sdata->type = type;
 	sdata->opal_index = od;
 	sdata->hwmon_index = hd;
-	create_hwmon_attr(sdata, attr_name, show);
+	create_hwmon_attr(sdata, attr_name, show, store);
 	pgroup->attrs[sensor_groups[type].attr_count++] = &sdata->dev_attr.attr;
+	sdata->sgrp_data = sgrp_data;
 }
 
 static char *get_max_attr(enum sensors type)
@@ -403,24 +582,23 @@ static int create_device_attrs(struct platform_device *pdev)
 	const struct attribute_group **pgroups = pdata->attr_groups;
 	struct device_node *opal, *np;
 	struct sensor_data *sdata;
-	u32 sensor_id;
-	enum sensors type;
 	u32 count = 0;
-	int err = 0;
+	u32 group_attr_id[MAX_SENSOR_TYPE] = {0};
 
-	opal = of_find_node_by_path("/ibm,opal/sensors");
 	sdata = devm_kcalloc(&pdev->dev,
 			     pdata->sensors_count, sizeof(*sdata),
 			     GFP_KERNEL);
-	if (!sdata) {
-		err = -ENOMEM;
-		goto exit_put_node;
-	}
+	if (!sdata)
+		return -ENOMEM;
 
+	opal = of_find_node_by_path("/ibm,opal/sensors");
 	for_each_child_of_node(opal, np) {
+		struct sensor_group_data *sgrp_data;
 		const char *attr_name;
-		u32 opal_index;
+		u32 opal_index, hw_id;
+		u32 sensor_id;
 		const char *label;
+		enum sensors type;
 
 		if (np->name == NULL)
 			continue;
@@ -456,14 +634,12 @@ static int create_device_attrs(struct platform_device *pdev)
 			opal_index = INVALID_INDEX;
 		}
 
-		sdata[count].opal_index = opal_index;
-		sdata[count].hwmon_index =
-			get_sensor_hwmon_index(&sdata[count], sdata, count);
-
-		create_hwmon_attr(&sdata[count], attr_name, show_sensor);
-
-		pgroups[type]->attrs[sensor_groups[type].attr_count++] =
-				&sdata[count++].dev_attr.attr;
+		hw_id = get_sensor_hwmon_index(&sdata[count], sdata, count);
+		sgrp_data = get_sensor_group(pdata, np, type);
+		populate_sensor(&sdata[count], opal_index, hw_id, sensor_id,
+				attr_name, type, pgroups[type], sgrp_data,
+				show_sensor, NULL);
+		count++;
 
 		if (!of_property_read_string(np, "label", &label)) {
 			/*
@@ -474,35 +650,43 @@ static int create_device_attrs(struct platform_device *pdev)
 			 */
 
 			make_sensor_label(np, &sdata[count], label);
-			populate_sensor(&sdata[count], opal_index,
-					sdata[count - 1].hwmon_index,
+			populate_sensor(&sdata[count], opal_index, hw_id,
 					sensor_id, "label", type, pgroups[type],
-					show_label);
+					NULL, show_label, NULL);
 			count++;
 		}
 
 		if (!of_property_read_u32(np, "sensor-data-max", &sensor_id)) {
 			attr_name = get_max_attr(type);
-			populate_sensor(&sdata[count], opal_index,
-					sdata[count - 1].hwmon_index,
+			populate_sensor(&sdata[count], opal_index, hw_id,
 					sensor_id, attr_name, type,
-					pgroups[type], show_sensor);
+					pgroups[type], sgrp_data, show_sensor,
+					NULL);
 			count++;
 		}
 
 		if (!of_property_read_u32(np, "sensor-data-min", &sensor_id)) {
 			attr_name = get_min_attr(type);
-			populate_sensor(&sdata[count], opal_index,
-					sdata[count - 1].hwmon_index,
+			populate_sensor(&sdata[count], opal_index, hw_id,
 					sensor_id, attr_name, type,
-					pgroups[type], show_sensor);
+					pgroups[type], sgrp_data, show_sensor,
+					NULL);
+			count++;
+		}
+
+		if (sgrp_data && !sgrp_data->enable) {
+			sgrp_data->enable = true;
+			hw_id = ++group_attr_id[type];
+			populate_sensor(&sdata[count], opal_index, hw_id,
+					sgrp_data->gid, "enable", type,
+					pgroups[type], sgrp_data, show_enable,
+					store_enable);
 			count++;
 		}
 	}
 
-exit_put_node:
 	of_node_put(opal);
-	return err;
+	return 0;
 }
 
 static int ibmpowernv_probe(struct platform_device *pdev)
@@ -517,6 +701,7 @@ static int ibmpowernv_probe(struct platform_device *pdev)
 
 	platform_set_drvdata(pdev, pdata);
 	pdata->sensors_count = 0;
+	pdata->nr_sensor_groups = 0;
 	err = populate_attr_groups(pdev);
 	if (err)
 		return err;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v6 1/2] powernv:opal-sensor-groups: Add support to enable sensor groups
From: Shilpasri G Bhat @ 2018-07-18 18:13 UTC (permalink / raw)
  To: mpe, linux; +Cc: linuxppc-dev, linux-hwmon, linux-kernel, ego, Shilpasri G Bhat
In-Reply-To: <1531937610-6454-1-git-send-email-shilpa.bhat@linux.vnet.ibm.com>

Adds support to enable/disable a sensor group at runtime. This
can be used to select the sensor groups that needs to be copied to
main memory by OCC. Sensor groups like power, temperature, current,
voltage, frequency, utilization can be enabled/disabled at runtime.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/opal-api.h                |  1 +
 arch/powerpc/include/asm/opal.h                    |  2 ++
 .../powerpc/platforms/powernv/opal-sensor-groups.c | 28 ++++++++++++++++++++++
 arch/powerpc/platforms/powernv/opal-wrappers.S     |  1 +
 4 files changed, 32 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 3bab299..56a94a1 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -206,6 +206,7 @@
 #define OPAL_NPU_SPA_CLEAR_CACHE		160
 #define OPAL_NPU_TL_SET				161
 #define OPAL_SENSOR_READ_U64			162
+#define OPAL_SENSOR_GROUP_ENABLE		163
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
 #define OPAL_LAST				165
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index e1b2910..fc0550e 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -292,6 +292,7 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t address,
 int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
 int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
 int opal_sensor_group_clear(u32 group_hndl, int token);
+int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
 
 s64 opal_signal_system_reset(s32 cpu);
 s64 opal_quiesce(u64 shutdown_type, s32 cpu);
@@ -326,6 +327,7 @@ extern int opal_async_wait_response_interruptible(uint64_t token,
 		struct opal_msg *msg);
 extern int opal_get_sensor_data(u32 sensor_hndl, u32 *sensor_data);
 extern int opal_get_sensor_data_u64(u32 sensor_hndl, u64 *sensor_data);
+extern int sensor_group_enable(u32 grp_hndl, bool enable);
 
 struct rtc_time;
 extern time64_t opal_get_boot_time(void);
diff --git a/arch/powerpc/platforms/powernv/opal-sensor-groups.c b/arch/powerpc/platforms/powernv/opal-sensor-groups.c
index 541c9ea..f7d04b6 100644
--- a/arch/powerpc/platforms/powernv/opal-sensor-groups.c
+++ b/arch/powerpc/platforms/powernv/opal-sensor-groups.c
@@ -32,6 +32,34 @@ struct sg_attr {
 	struct sg_attr *sgattrs;
 } *sgs;
 
+int sensor_group_enable(u32 handle, bool enable)
+{
+	struct opal_msg msg;
+	int token, ret;
+
+	token = opal_async_get_token_interruptible();
+	if (token < 0)
+		return token;
+
+	ret = opal_sensor_group_enable(handle, token, enable);
+	if (ret == OPAL_ASYNC_COMPLETION) {
+		ret = opal_async_wait_response(token, &msg);
+		if (ret) {
+			pr_devel("Failed to wait for the async response\n");
+			ret = -EIO;
+			goto out;
+		}
+		ret = opal_error_code(opal_get_async_rc(msg));
+	} else {
+		ret = opal_error_code(ret);
+	}
+
+out:
+	opal_async_release_token(token);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(sensor_group_enable);
+
 static ssize_t sg_store(struct kobject *kobj, struct kobj_attribute *attr,
 			const char *buf, size_t count)
 {
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index a8d9b40..8268a1e 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -327,3 +327,4 @@ OPAL_CALL(opal_npu_tl_set,			OPAL_NPU_TL_SET);
 OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,		OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,		OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_sensor_read_u64,			OPAL_SENSOR_READ_U64);
+OPAL_CALL(opal_sensor_group_enable,		OPAL_SENSOR_GROUP_ENABLE);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v6 0/2] hwmon/powernv: Add attributes to enable/disable sensors
From: Shilpasri G Bhat @ 2018-07-18 18:13 UTC (permalink / raw)
  To: mpe, linux; +Cc: linuxppc-dev, linux-hwmon, linux-kernel, ego, Shilpasri G Bhat

This patch series adds new attribute to enable or disable a sensor at
runtime.

Changes from v5:
- Dont store the sensor node and parse the device-tree for each sensor
  to find the sensor-group during init

v5 : https://lkml.org/lkml/2018/7/15/15
v4 : https://lkml.org/lkml/2018/7/6/379
v3 : https://lkml.org/lkml/2018/7/5/476
v2 : https://lkml.org/lkml/2018/7/4/263
v1 : https://lkml.org/lkml/2018/3/22/214

Shilpasri G Bhat (2):
  powernv:opal-sensor-groups: Add support to enable sensor groups
  hwmon: ibmpowernv: Add attributes to enable/disable sensor groups

 Documentation/hwmon/ibmpowernv                     |  43 +++-
 arch/powerpc/include/asm/opal-api.h                |   1 +
 arch/powerpc/include/asm/opal.h                    |   2 +
 .../powerpc/platforms/powernv/opal-sensor-groups.c |  28 +++
 arch/powerpc/platforms/powernv/opal-wrappers.S     |   1 +
 drivers/hwmon/ibmpowernv.c                         | 249 ++++++++++++++++++---
 6 files changed, 290 insertions(+), 34 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [PATCH] mtd: powernv_flash: set of_node in mtd's dev
From: Boris Brezillon @ 2018-07-18 18:27 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: David Woodhouse, Brian Norris, Boris Brezillon, Marek Vasut,
	Richard Weinberger, Benjamin Herrenschmidt, Timothy Pearson,
	linux-mtd, Michael Ellerman, Miquel Raynal,
	Rafał Miłecki, Paul Mackerras, linuxppc-dev, Cyril Bur
In-Reply-To: <20180713081559.4373-1-zajec5@gmail.com>

On Fri, 13 Jul 2018 10:15:59 +0200
Rafa=C5=82 Mi=C5=82ecki <zajec5@gmail.com> wrote:

> From: Rafa=C5=82 Mi=C5=82ecki <rafal@milecki.pl>
>=20
> This enables some features implemented in mtd subsystem like reading
> label and partitioning info from DT.
>=20
> Reported-by: Timothy Pearson <tpearson@raptorengineering.com>
> Signed-off-by: Rafa=C5=82 Mi=C5=82ecki <rafal@milecki.pl>

Applied.

Thanks,

Boris

> ---
>  drivers/mtd/devices/powernv_flash.c | 1 +
>  1 file changed, 1 insertion(+)
>=20
> diff --git a/drivers/mtd/devices/powernv_flash.c b/drivers/mtd/devices/po=
wernv_flash.c
> index c1312b141ae0..33593122e49b 100644
> --- a/drivers/mtd/devices/powernv_flash.c
> +++ b/drivers/mtd/devices/powernv_flash.c
> @@ -223,6 +223,7 @@ static int powernv_flash_set_driver_info(struct devic=
e *dev,
>  	mtd->_read =3D powernv_flash_read;
>  	mtd->_write =3D powernv_flash_write;
>  	mtd->dev.parent =3D dev;
> +	mtd_set_of_node(mtd, dev->of_node);
>  	return 0;
>  }
> =20

^ permalink raw reply

* [PATCH] powerpc/ps3: Set driver coherent_dma_mask
From: Geoff Levand @ 2018-07-18 22:08 UTC (permalink / raw)
  To: Alan Stern, Takashi Iwai, Jaroslav Kysela, Michael Ellerman
  Cc: linux-usb, linuxppc-dev@lists.ozlabs.org

Set the coherent_dma_mask for the PS3 ehci, ohci, and snd devices.

Silences WARN_ON_ONCE messages emitted by the dma_alloc_attrs() routine.

Reported-by: Fredrik Noring <noring@nocrew.org>
Signed-off-by: Geoff Levand <geoff@infradead.org>
---
Hi Michael,

This just silences some warnings.  Can you take it through the powerpc
tree?

-Geoff


 drivers/usb/host/ehci-ps3.c | 6 ++++--
 drivers/usb/host/ohci-ps3.c | 6 ++++--
 sound/ppc/snd_ps3.c         | 5 +++++
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
index 8c733492d8fe..454d8c624a3f 100644
--- a/drivers/usb/host/ehci-ps3.c
+++ b/drivers/usb/host/ehci-ps3.c
@@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
 	int result;
 	struct usb_hcd *hcd;
 	unsigned int virq;
-	static u64 dummy_mask = DMA_BIT_MASK(32);
+	static u64 dummy_mask;
 
 	if (usb_disabled()) {
 		result = -ENODEV;
@@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
 		goto fail_irq;
 	}
 
-	dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
+	dummy_mask = DMA_BIT_MASK(32);
+	dev->core.dma_mask = &dummy_mask;
+	dma_set_coherent_mask(&dev->core, dummy_mask);
 
 	hcd = usb_create_hcd(&ps3_ehci_hc_driver, &dev->core, dev_name(&dev->core));
 
diff --git a/drivers/usb/host/ohci-ps3.c b/drivers/usb/host/ohci-ps3.c
index 20a23d795adf..395f9d3bc849 100644
--- a/drivers/usb/host/ohci-ps3.c
+++ b/drivers/usb/host/ohci-ps3.c
@@ -69,7 +69,7 @@ static int ps3_ohci_probe(struct ps3_system_bus_device *dev)
 	int result;
 	struct usb_hcd *hcd;
 	unsigned int virq;
-	static u64 dummy_mask = DMA_BIT_MASK(32);
+	static u64 dummy_mask;
 
 	if (usb_disabled()) {
 		result = -ENODEV;
@@ -115,7 +115,9 @@ static int ps3_ohci_probe(struct ps3_system_bus_device *dev)
 		goto fail_irq;
 	}
 
-	dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
+	dummy_mask = DMA_BIT_MASK(32);
+	dev->core.dma_mask = &dummy_mask;
+	dma_set_coherent_mask(&dev->core, dummy_mask);
 
 	hcd = usb_create_hcd(&ps3_ohci_hc_driver, &dev->core, dev_name(&dev->core));
 
diff --git a/sound/ppc/snd_ps3.c b/sound/ppc/snd_ps3.c
index 36f34f434ecb..abe031c9d592 100644
--- a/sound/ppc/snd_ps3.c
+++ b/sound/ppc/snd_ps3.c
@@ -930,6 +930,7 @@ static int snd_ps3_driver_probe(struct ps3_system_bus_device *dev)
 {
 	int i, ret;
 	u64 lpar_addr, lpar_size;
+	static u64 dummy_mask;
 
 	if (WARN_ON(!firmware_has_feature(FW_FEATURE_PS3_LV1)))
 		return -ENODEV;
@@ -970,6 +971,10 @@ static int snd_ps3_driver_probe(struct ps3_system_bus_device *dev)
 		goto clean_mmio;
 	}
 
+	dummy_mask = DMA_BIT_MASK(32);
+	dev->core.dma_mask = &dummy_mask;
+	dma_set_coherent_mask(&dev->core, dummy_mask);
+
 	snd_ps3_audio_set_base_addr(dev->d_region->bus_addr);
 
 	/* CONFIG_SND_PS3_DEFAULT_START_DELAY */
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH] MAINTAINERS: Drop inactive Vitaly Bordug's email
From: David Miller @ 2018-07-18 22:22 UTC (permalink / raw)
  To: krzk; +Cc: pantelis.antoniou, linuxppc-dev, netdev, linux-kernel, vbordug,
	vitb
In-Reply-To: <20180717164154.6577-1-krzk@kernel.org>

From: Krzysztof Kozlowski <krzk@kernel.org>
Date: Tue, 17 Jul 2018 18:41:54 +0200

> The Vitaly Bordug's email bounces ("ru.mvista.com: Name or service not
> known") and there was no activity (ack, review, sign) since 2009.
> 
> Cc: Vitaly Bordug <vitb@kernel.crashing.org>
> Cc: Pantelis Antoniou <pantelis.antoniou@gmail.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>

Applied.

^ permalink raw reply

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
From: Andrew Morton @ 2018-07-18 22:33 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied,
	kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list,
	keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx,
	brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei,
	richard.weiyang, devel, linux-input, linux-nvdimm, devicetree,
	linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem,
	chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul,
	linux-parisc, linuxppc-dev, kexec
In-Reply-To: <20180718024944.577-5-bhe@redhat.com>

On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote:

> For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> is used to load kernel/initrd/purgatory is supposed to be allocated from
> top to down. This is what we have been doing all along in the old kexec
> loading interface and the kexec loading is still default setting in some
> distributions. However, the current kexec_file loading interface doesn't
> do like this. The function arch_kexec_walk_mem() it calls ignores checking
> kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> all resources of System RAM from bottom to up, to try to find memory region
> which can contain the specific kexec buffer, then call locate_mem_hole_callback()
> to allocate memory in that found memory region from top to down. This brings
> confusion especially when KASLR is widely supported , users have to make clear
> why kexec/kdump kernel loading position is different between these two
> interfaces in order to exclude unnecessary noises. Hence these two interfaces
> need be unified on behaviour.

As far as I can tell, the above is the whole reason for the patchset,
yes?  To avoid confusing users.

Is that sufficient?  Can we instead simplify their lives by providing
better documentation or informative printks or better Kconfig text,
etc?

And who *are* the people who are performing this configuration?  Random
system administrators?  Linux distro engineers?  If the latter then
they presumably aren't easily confused!

In other words, I'm trying to understand how much benefit this patchset
will provide to our users as a whole.

^ permalink raw reply

* Re: Improvements for the PS3
From: Geoff Levand @ 2018-07-18 22:40 UTC (permalink / raw)
  To: Fredrik Noring, linuxppc-dev, Geert Uytterhoeven
In-Reply-To: <20180714164906.GQ23412@localhost.localdomain>

Hi Fredrik,

On 07/14/2018 09:49 AM, Fredrik Noring wrote:
> so I added a sleep with
> 
> +	msleep(10000);
> +
>  	return 0;
> 
> et voilà, the screen came alive and the kernel panic was revealed! It seems
> the kernel panics so fast that the PS3 frame buffer is unprepared. This is,
> of course, very unfortunate because trying to debug the boot process without
> a screen or any other means of obtaining console text is quite difficult.

We could add a fixed delay there, but I'd like to avoid waiting that
long on every boot.  Why don't you add a kernel module_param named
something like ps3fb_delay that takes a value in milliseconds and a
default of zero.  See:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/ps3fb.c?h=v4.17#n260

> I suppose the problem is that it relies on interrupts for ps3fb_sync_image
> to regularly copy the image, hence without them the screen isn't updated to
> show kernel panics, etc. Perhaps one way to fix that is to implement the
> struct fb_tile_ops API, so that the console is synchronously updated? Would
> that be acceptable?

I'm not sure if that would work or not.   Maybe Geert is more familiar with it.

-Geoff

^ permalink raw reply

* Re: [PATCH v3] PCI: Data corruption happening due to race condition
From: Bjorn Helgaas @ 2018-07-18 23:29 UTC (permalink / raw)
  To: Hari Vyas
  Cc: bhelgaas, linux-pci, ray.jui, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, linuxppc-dev
In-Reply-To: <1530608741-30664-2-git-send-email-hari.vyas@broadcom.com>

[+cc Paul, Michael, linuxppc-dev]

On Tue, Jul 03, 2018 at 02:35:41PM +0530, Hari Vyas wrote:
> When a pci device is detected, a variable is_added is set to
> 1 in pci device structure and proc, sys entries are created.
> 
> When a pci device is removed, first is_added is checked for one
> and then device is detached with clearing of proc and sys
> entries and at end, is_added is set to 0.
> 
> is_added and is_busmaster are bit fields in pci_dev structure
> sharing same memory location.
> 
> A strange issue was observed with multiple times removal and
> rescan of a pcie nvme device using sysfs commands where is_added
> flag was observed as zero instead of one while removing device
> and proc,sys entries are not cleared.  This causes issue in
> later device addition with warning message "proc_dir_entry"
> already registered.
> 
> Debugging revealed a race condition between pcie core driver
> enabling is_added bit(pci_bus_add_device()) and nvme driver
> reset work-queue enabling is_busmaster bit (by pci_set_master()).
> As both fields are not handled in atomic manner and that clears
> is_added bit.
> 
> Fix moves device addition is_added bit to separate private flag
> variable and use different atomic functions to set and retrieve
> device addition state. As is_added shares different memory
> location so race condition is avoided.

Really nice bit of debugging!

> Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
> ---
>  arch/powerpc/kernel/pci-common.c          |  4 +++-
>  arch/powerpc/platforms/powernv/pci-ioda.c |  3 ++-
>  arch/powerpc/platforms/pseries/setup.c    |  3 ++-
>  drivers/pci/bus.c                         |  6 +++---
>  drivers/pci/hotplug/acpiphp_glue.c        |  2 +-
>  drivers/pci/pci.h                         | 11 +++++++++++
>  drivers/pci/probe.c                       |  4 ++--
>  drivers/pci/remove.c                      |  5 +++--
>  include/linux/pci.h                       |  1 -
>  9 files changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index fe9733f..471aac3 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -42,6 +42,8 @@
>  #include <asm/ppc-pci.h>
>  #include <asm/eeh.h>
>  
> +#include "../../../drivers/pci/pci.h"

I see why you need it, but this include path is really ugly.  Outside
of bootloaders and tools, there are very few instances of includes
like this that reference a different top-level directory, and I'm not
very keen about adding more.

Obviously powerpc is the only arch that needs dev->is_added.  It seems
to be because "We can only call pcibios_setup_device() after bus setup
is complete, since some of the platform specific DMA setup code
depends on it."

I don't know powerpc, but it does raise the question in my mind of
whether powerpc could be changed to do the DMA setup more like other
arches do to remove this ordering dependency and the need to use
dev->is_added.

That sounds like a lot of work, but it would have the benefit of
unifying some code that is probably needlessly arch-specific.

>  /* hose_spinlock protects accesses to the the phb_bitmap. */
>  static DEFINE_SPINLOCK(hose_spinlock);
>  LIST_HEAD(hose_list);
> @@ -1014,7 +1016,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
>  		/* Cardbus can call us to add new devices to a bus, so ignore
>  		 * those who are already fully discovered
>  		 */
> -		if (dev->is_added)
> +		if (pci_dev_is_added(dev))
>  			continue;
>  
>  		pcibios_setup_device(dev);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 5bd0eb6..70b2e1e 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -46,6 +46,7 @@
>  
>  #include "powernv.h"
>  #include "pci.h"
> +#include "../../../../drivers/pci/pci.h"
>  
>  #define PNV_IODA1_M64_NUM	16	/* Number of M64 BARs	*/
>  #define PNV_IODA1_M64_SEGS	8	/* Segments per M64 BAR	*/
> @@ -3138,7 +3139,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  	struct pci_dn *pdn;
>  	int mul, total_vfs;
>  
> -	if (!pdev->is_physfn || pdev->is_added)
> +	if (!pdev->is_physfn || pci_dev_is_added(pdev))
>  		return;
>  
>  	pdn = pci_get_pdn(pdev);
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 139f0af..8a4868a 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -71,6 +71,7 @@
>  #include <asm/security_features.h>
>  
>  #include "pseries.h"
> +#include "../../../../drivers/pci/pci.h"
>  
>  int CMO_PrPSP = -1;
>  int CMO_SecPSP = -1;
> @@ -664,7 +665,7 @@ static void pseries_pci_fixup_iov_resources(struct pci_dev *pdev)
>  	const int *indexes;
>  	struct device_node *dn = pci_device_to_OF_node(pdev);
>  
> -	if (!pdev->is_physfn || pdev->is_added)
> +	if (!pdev->is_physfn || pci_dev_is_added(pdev))
>  		return;
>  	/*Firmware must support open sriov otherwise dont configure*/
>  	indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 35b7fc8..5cb40b2 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -330,7 +330,7 @@ void pci_bus_add_device(struct pci_dev *dev)
>  		return;
>  	}
>  
> -	dev->is_added = 1;
> +	pci_dev_assign_added(dev, true);
>  }
>  EXPORT_SYMBOL_GPL(pci_bus_add_device);
>  
> @@ -347,14 +347,14 @@ void pci_bus_add_devices(const struct pci_bus *bus)
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		/* Skip already-added devices */
> -		if (dev->is_added)
> +		if (pci_dev_is_added(dev))
>  			continue;
>  		pci_bus_add_device(dev);
>  	}
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		/* Skip if device attach failed */
> -		if (!dev->is_added)
> +		if (!pci_dev_is_added(dev))
>  			continue;
>  		child = dev->subordinate;
>  		if (child)
> diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> index 3a17b29..ef0b1b6 100644
> --- a/drivers/pci/hotplug/acpiphp_glue.c
> +++ b/drivers/pci/hotplug/acpiphp_glue.c
> @@ -509,7 +509,7 @@ static void enable_slot(struct acpiphp_slot *slot)
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		/* Assume that newly added devices are powered on already. */
> -		if (!dev->is_added)
> +		if (!pci_dev_is_added(dev))
>  			dev->current_state = PCI_D0;
>  	}
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 882f1f9..0881725 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -288,6 +288,7 @@ struct pci_sriov {
>  
>  /* pci_dev priv_flags */
>  #define PCI_DEV_DISCONNECTED 0
> +#define PCI_DEV_ADDED 1
>  
>  static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
>  {
> @@ -300,6 +301,16 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
>  	return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
>  }
>  
> +static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
> +{
> +	assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
> +}
> +
> +static inline bool pci_dev_is_added(const struct pci_dev *dev)
> +{
> +	return test_bit(PCI_DEV_ADDED, &dev->priv_flags);
> +}
> +
>  #ifdef CONFIG_PCI_ATS
>  void pci_restore_ats_state(struct pci_dev *dev);
>  #else
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac876e3..611adcd 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2433,13 +2433,13 @@ int pci_scan_slot(struct pci_bus *bus, int devfn)
>  	dev = pci_scan_single_device(bus, devfn);
>  	if (!dev)
>  		return 0;
> -	if (!dev->is_added)
> +	if (!pci_dev_is_added(dev))
>  		nr++;
>  
>  	for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn)) {
>  		dev = pci_scan_single_device(bus, devfn + fn);
>  		if (dev) {
> -			if (!dev->is_added)
> +			if (!pci_dev_is_added(dev))
>  				nr++;
>  			dev->multifunction = 1;
>  		}
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index 6f072ea..5e3d0dc 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -19,11 +19,12 @@ static void pci_stop_dev(struct pci_dev *dev)
>  {
>  	pci_pme_active(dev, false);
>  
> -	if (dev->is_added) {
> +	if (pci_dev_is_added(dev)) {
>  		device_release_driver(&dev->dev);
>  		pci_proc_detach_device(dev);
>  		pci_remove_sysfs_dev_files(dev);
> -		dev->is_added = 0;
> +
> +		pci_dev_assign_added(dev, false);
>  	}
>  
>  	if (dev->bus->self)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 340029b..506125b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -368,7 +368,6 @@ struct pci_dev {
>  	unsigned int	transparent:1;		/* Subtractive decode bridge */
>  	unsigned int	multifunction:1;	/* Multi-function device */
>  
> -	unsigned int	is_added:1;
>  	unsigned int	is_busmaster:1;		/* Is busmaster */
>  	unsigned int	no_msi:1;		/* May not use MSI */
>  	unsigned int	no_64bit_msi:1; 	/* May only use 32-bit MSIs */
> -- 
> 1.9.1
> 

^ permalink raw reply

* [PATCH v3 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
From: Sam Bobroff @ 2018-07-19  2:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: kvm, kvm-ppc, paulus, david, clg

From: Sam Bobroff <sam.bobroff@au1.ibm.com>

It is not currently possible to create the full number of possible
VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
threads per core than it's core stride (or "VSMT mode"). This is
because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
even though the VCPU ID is less than KVM_MAX_VCPU_ID.

To address this, "pack" the VCORE ID and XIVE offsets by using
knowledge of the way the VCPU IDs will be used when there are less
guest threads per core than the core stride. The primary thread of
each core will always be used first. Then, if the guest uses more than
one thread per core, these secondary threads will sequentially follow
the primary in each core.

So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
VCPUs are being spaced apart, so at least half of each core is empty
and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
into the second half of each core (4..7, in an 8-thread core).

Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
each core is being left empty, and we can map down into the second and
third quarters of each core (2, 3 and 5, 6 in an 8-thread core).

Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
threads are being used and 7/8 of the core is empty, allowing use of
the 1, 3, 5 and 7 thread slots.

(Strides less than 8 are handled similarly.)

This allows the VCORE ID or offset to be calculated quickly from the
VCPU ID or XIVE server numbers, without access to the VCPU structure.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
---
Hello everyone,

I've completed a trial merge with the guest native-XIVE code and found no
problems; it's no more difficult than the host side and only requires a few
calls to xive_vp().

On that basis, here is v3 (unchanged from v2) as non-RFC and it seems to be
ready to go.

Patch set v3:
Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space

Patch set v2:
Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
* Corrected places in kvm/book3s_xive.c where IDs weren't packed.
* Because kvmppc_pack_vcpu_id() is only called on P9, there is no need to test "emul_smt_mode > 1", so remove it.
* Re-ordered block_offsets[] to be more ascending.
* Added more detailed description of the packing algorithm.

Patch set v1:
Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space

 arch/powerpc/include/asm/kvm_book3s.h | 44 +++++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_hv.c          | 14 +++++++----
 arch/powerpc/kvm/book3s_xive.c        | 19 +++++++++------
 3 files changed, 66 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 1f345a0b6ba2..ba4b6e00fca7 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -390,4 +390,48 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
 #define SPLIT_HACK_MASK			0xff000000
 #define SPLIT_HACK_OFFS			0xfb000000
 
+/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
+ * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
+ * (but not it's actual threading mode, which is not available) to avoid
+ * collisions.
+ *
+ * The implementation leaves VCPU IDs from the range [0..KVM_MAX_VCPUS) (block
+ * 0) unchanged: if the guest is filling each VCORE completely then it will be
+ * using consecutive IDs and it will fill the space without any packing.
+ *
+ * For higher VCPU IDs, the packed ID is based on the VCPU ID modulo
+ * KVM_MAX_VCPUS (effectively masking off the top bits) and then an offset is
+ * added to avoid collisions.
+ *
+ * VCPU IDs in the range [KVM_MAX_VCPUS..(KVM_MAX_VCPUS*2)) (block 1) are only
+ * possible if the guest is leaving at least 1/2 of each VCORE empty, so IDs
+ * can be safely packed into the second half of each VCORE by adding an offset
+ * of (stride / 2).
+ *
+ * Similarly, if VCPU IDs in the range [(KVM_MAX_VCPUS*2)..(KVM_MAX_VCPUS*4))
+ * (blocks 2 and 3) are seen, the guest must be leaving at least 3/4 of each
+ * VCORE empty so packed IDs can be offset by (stride / 4) and (stride * 3 / 4).
+ *
+ * Finally, VCPU IDs from blocks 5..7 will only be seen if the guest is using a
+ * stride of 8 and 1 thread per core so the remaining offsets of 1, 3, 5 and 7
+ * must be free to use.
+ *
+ * (The offsets for each block are stored in block_offsets[], indexed by the
+ * block number if the stride is 8. For cases where the guest's stride is less
+ * than 8, we can re-use the block_offsets array by multiplying the block
+ * number by (MAX_SMT_THREADS / stride) to reach the correct entry.)
+ */
+static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
+{
+	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 3, 5, 7};
+	int stride = kvm->arch.emul_smt_mode;
+	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
+	u32 packed_id;
+
+	BUG_ON(block >= MAX_SMT_THREADS);
+	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
+	BUG_ON(packed_id >= KVM_MAX_VCPUS);
+	return packed_id;
+}
+
 #endif /* __ASM_KVM_BOOK3S_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index de686b340f4a..363c2fb0d89e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1816,7 +1816,7 @@ static int threads_per_vcore(struct kvm *kvm)
 	return threads_per_subcore;
 }
 
-static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
+static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
 {
 	struct kvmppc_vcore *vcore;
 
@@ -1830,7 +1830,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
 	init_swait_queue_head(&vcore->wq);
 	vcore->preempt_tb = TB_NIL;
 	vcore->lpcr = kvm->arch.lpcr;
-	vcore->first_vcpuid = core * kvm->arch.smt_mode;
+	vcore->first_vcpuid = id;
 	vcore->kvm = kvm;
 	INIT_LIST_HEAD(&vcore->preempt_list);
 
@@ -2048,12 +2048,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 	mutex_lock(&kvm->lock);
 	vcore = NULL;
 	err = -EINVAL;
-	core = id / kvm->arch.smt_mode;
+	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+		BUG_ON(kvm->arch.smt_mode != 1);
+		core = kvmppc_pack_vcpu_id(kvm, id);
+	} else {
+		core = id / kvm->arch.smt_mode;
+	}
 	if (core < KVM_MAX_VCORES) {
 		vcore = kvm->arch.vcores[core];
+		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
 		if (!vcore) {
 			err = -ENOMEM;
-			vcore = kvmppc_vcore_create(kvm, core);
+			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
 			kvm->arch.vcores[core] = vcore;
 			kvm->arch.online_vcores++;
 		}
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index f9818d7d3381..dbd5887daf4a 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
 	return -EBUSY;
 }
 
+static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
+{
+	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
+}
+
 static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
 			     struct kvmppc_xive_src_block *sb,
 			     struct kvmppc_xive_irq_state *state)
@@ -362,7 +367,7 @@ static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
 	 */
 	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
 		xive_native_configure_irq(hw_num,
-					  xive->vp_base + state->act_server,
+					  xive_vp(xive, state->act_server),
 					  MASKED, state->number);
 		/* set old_p so we can track if an H_EOI was done */
 		state->old_p = true;
@@ -418,7 +423,7 @@ static void xive_finish_unmask(struct kvmppc_xive *xive,
 	 */
 	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
 		xive_native_configure_irq(hw_num,
-					  xive->vp_base + state->act_server,
+					  xive_vp(xive, state->act_server),
 					  state->act_priority, state->number);
 		/* If an EOI is needed, do it here */
 		if (!state->old_p)
@@ -495,7 +500,7 @@ static int xive_target_interrupt(struct kvm *kvm,
 	kvmppc_xive_select_irq(state, &hw_num, NULL);
 
 	return xive_native_configure_irq(hw_num,
-					 xive->vp_base + server,
+					 xive_vp(xive, server),
 					 prio, state->number);
 }
 
@@ -883,7 +888,7 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
 	 * which is fine for a never started interrupt.
 	 */
 	xive_native_configure_irq(hw_irq,
-				  xive->vp_base + state->act_server,
+				  xive_vp(xive, state->act_server),
 				  state->act_priority, state->number);
 
 	/*
@@ -959,7 +964,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
 
 	/* Reconfigure the IPI */
 	xive_native_configure_irq(state->ipi_number,
-				  xive->vp_base + state->act_server,
+				  xive_vp(xive, state->act_server),
 				  state->act_priority, state->number);
 
 	/*
@@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		pr_devel("Duplicate !\n");
 		return -EEXIST;
 	}
-	if (cpu >= KVM_MAX_VCPUS) {
+	if (cpu >= KVM_MAX_VCPU_ID) {
 		pr_devel("Out of bounds !\n");
 		return -EINVAL;
 	}
@@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 	xc->xive = xive;
 	xc->vcpu = vcpu;
 	xc->server_num = cpu;
-	xc->vp_id = xive->vp_base + cpu;
+	xc->vp_id = xive_vp(xive, cpu);
 	xc->mfrr = 0xff;
 	xc->valid = true;
 
-- 
2.16.1.74.g9b0b1f47b

^ permalink raw reply related

* Re: [PATCH v3] PCI: Data corruption happening due to race condition
From: Benjamin Herrenschmidt @ 2018-07-19  4:18 UTC (permalink / raw)
  To: Bjorn Helgaas, Hari Vyas
  Cc: bhelgaas, linux-pci, ray.jui, Paul Mackerras, Michael Ellerman,
	linuxppc-dev
In-Reply-To: <20180718232904.GJ128988@bhelgaas-glaptop.roam.corp.google.com>

On Wed, 2018-07-18 at 18:29 -0500, Bjorn Helgaas wrote:
> [+cc Paul, Michael, linuxppc-dev]
> 

   ..../...

> > Debugging revealed a race condition between pcie core driver
> > enabling is_added bit(pci_bus_add_device()) and nvme driver
> > reset work-queue enabling is_busmaster bit (by pci_set_master()).
> > As both fields are not handled in atomic manner and that clears
> > is_added bit.
> > 
> > Fix moves device addition is_added bit to separate private flag
> > variable and use different atomic functions to set and retrieve
> > device addition state. As is_added shares different memory
> > location so race condition is avoided.
> 
> Really nice bit of debugging!

Indeed. However I'm not fan of the solution. Shouldn't we instead have
some locking for the content of pci_dev ? I've always been wary of us
having other similar races in there.

As for the powerpc bits, I'm probably the one who wrote them, however,
I'm on vacation this week and right now, no bandwidth to context switch
all that back in :-) So give me a few days and/or ping me next week.

The powerpc PCI code contains a lot of cruft coming from the depth of
history, including rather nasty assumptions. We want to progressively
clean it up, starting with EEH, but it will take time.

Cheers,
Ben.

> > Signed-off-by: Hari Vyas <hari.vyas@broadcom.com>
> > ---
> >  arch/powerpc/kernel/pci-common.c          |  4 +++-
> >  arch/powerpc/platforms/powernv/pci-ioda.c |  3 ++-
> >  arch/powerpc/platforms/pseries/setup.c    |  3 ++-
> >  drivers/pci/bus.c                         |  6 +++---
> >  drivers/pci/hotplug/acpiphp_glue.c        |  2 +-
> >  drivers/pci/pci.h                         | 11 +++++++++++
> >  drivers/pci/probe.c                       |  4 ++--
> >  drivers/pci/remove.c                      |  5 +++--
> >  include/linux/pci.h                       |  1 -
> >  9 files changed, 27 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> > index fe9733f..471aac3 100644
> > --- a/arch/powerpc/kernel/pci-common.c
> > +++ b/arch/powerpc/kernel/pci-common.c
> > @@ -42,6 +42,8 @@
> >  #include <asm/ppc-pci.h>
> >  #include <asm/eeh.h>
> >  
> > +#include "../../../drivers/pci/pci.h"
> 
> I see why you need it, but this include path is really ugly.  Outside
> of bootloaders and tools, there are very few instances of includes
> like this that reference a different top-level directory, and I'm not
> very keen about adding more.
> 
> Obviously powerpc is the only arch that needs dev->is_added.  It seems
> to be because "We can only call pcibios_setup_device() after bus setup
> is complete, since some of the platform specific DMA setup code
> depends on it."
> 
> I don't know powerpc, but it does raise the question in my mind of
> whether powerpc could be changed to do the DMA setup more like other
> arches do to remove this ordering dependency and the need to use
> dev->is_added.
> 
> That sounds like a lot of work, but it would have the benefit of
> unifying some code that is probably needlessly arch-specific.
> 
> >  /* hose_spinlock protects accesses to the the phb_bitmap. */
> >  static DEFINE_SPINLOCK(hose_spinlock);
> >  LIST_HEAD(hose_list);
> > @@ -1014,7 +1016,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
> >  		/* Cardbus can call us to add new devices to a bus, so ignore
> >  		 * those who are already fully discovered
> >  		 */
> > -		if (dev->is_added)
> > +		if (pci_dev_is_added(dev))
> >  			continue;
> >  
> >  		pcibios_setup_device(dev);
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index 5bd0eb6..70b2e1e 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -46,6 +46,7 @@
> >  
> >  #include "powernv.h"
> >  #include "pci.h"
> > +#include "../../../../drivers/pci/pci.h"
> >  
> >  #define PNV_IODA1_M64_NUM	16	/* Number of M64 BARs	*/
> >  #define PNV_IODA1_M64_SEGS	8	/* Segments per M64 BAR	*/
> > @@ -3138,7 +3139,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
> >  	struct pci_dn *pdn;
> >  	int mul, total_vfs;
> >  
> > -	if (!pdev->is_physfn || pdev->is_added)
> > +	if (!pdev->is_physfn || pci_dev_is_added(pdev))
> >  		return;
> >  
> >  	pdn = pci_get_pdn(pdev);
> > diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> > index 139f0af..8a4868a 100644
> > --- a/arch/powerpc/platforms/pseries/setup.c
> > +++ b/arch/powerpc/platforms/pseries/setup.c
> > @@ -71,6 +71,7 @@
> >  #include <asm/security_features.h>
> >  
> >  #include "pseries.h"
> > +#include "../../../../drivers/pci/pci.h"
> >  
> >  int CMO_PrPSP = -1;
> >  int CMO_SecPSP = -1;
> > @@ -664,7 +665,7 @@ static void pseries_pci_fixup_iov_resources(struct pci_dev *pdev)
> >  	const int *indexes;
> >  	struct device_node *dn = pci_device_to_OF_node(pdev);
> >  
> > -	if (!pdev->is_physfn || pdev->is_added)
> > +	if (!pdev->is_physfn || pci_dev_is_added(pdev))
> >  		return;
> >  	/*Firmware must support open sriov otherwise dont configure*/
> >  	indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > index 35b7fc8..5cb40b2 100644
> > --- a/drivers/pci/bus.c
> > +++ b/drivers/pci/bus.c
> > @@ -330,7 +330,7 @@ void pci_bus_add_device(struct pci_dev *dev)
> >  		return;
> >  	}
> >  
> > -	dev->is_added = 1;
> > +	pci_dev_assign_added(dev, true);
> >  }
> >  EXPORT_SYMBOL_GPL(pci_bus_add_device);
> >  
> > @@ -347,14 +347,14 @@ void pci_bus_add_devices(const struct pci_bus *bus)
> >  
> >  	list_for_each_entry(dev, &bus->devices, bus_list) {
> >  		/* Skip already-added devices */
> > -		if (dev->is_added)
> > +		if (pci_dev_is_added(dev))
> >  			continue;
> >  		pci_bus_add_device(dev);
> >  	}
> >  
> >  	list_for_each_entry(dev, &bus->devices, bus_list) {
> >  		/* Skip if device attach failed */
> > -		if (!dev->is_added)
> > +		if (!pci_dev_is_added(dev))
> >  			continue;
> >  		child = dev->subordinate;
> >  		if (child)
> > diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> > index 3a17b29..ef0b1b6 100644
> > --- a/drivers/pci/hotplug/acpiphp_glue.c
> > +++ b/drivers/pci/hotplug/acpiphp_glue.c
> > @@ -509,7 +509,7 @@ static void enable_slot(struct acpiphp_slot *slot)
> >  
> >  	list_for_each_entry(dev, &bus->devices, bus_list) {
> >  		/* Assume that newly added devices are powered on already. */
> > -		if (!dev->is_added)
> > +		if (!pci_dev_is_added(dev))
> >  			dev->current_state = PCI_D0;
> >  	}
> >  
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 882f1f9..0881725 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -288,6 +288,7 @@ struct pci_sriov {
> >  
> >  /* pci_dev priv_flags */
> >  #define PCI_DEV_DISCONNECTED 0
> > +#define PCI_DEV_ADDED 1
> >  
> >  static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> >  {
> > @@ -300,6 +301,16 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
> >  	return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
> >  }
> >  
> > +static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
> > +{
> > +	assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
> > +}
> > +
> > +static inline bool pci_dev_is_added(const struct pci_dev *dev)
> > +{
> > +	return test_bit(PCI_DEV_ADDED, &dev->priv_flags);
> > +}
> > +
> >  #ifdef CONFIG_PCI_ATS
> >  void pci_restore_ats_state(struct pci_dev *dev);
> >  #else
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index ac876e3..611adcd 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -2433,13 +2433,13 @@ int pci_scan_slot(struct pci_bus *bus, int devfn)
> >  	dev = pci_scan_single_device(bus, devfn);
> >  	if (!dev)
> >  		return 0;
> > -	if (!dev->is_added)
> > +	if (!pci_dev_is_added(dev))
> >  		nr++;
> >  
> >  	for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn)) {
> >  		dev = pci_scan_single_device(bus, devfn + fn);
> >  		if (dev) {
> > -			if (!dev->is_added)
> > +			if (!pci_dev_is_added(dev))
> >  				nr++;
> >  			dev->multifunction = 1;
> >  		}
> > diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> > index 6f072ea..5e3d0dc 100644
> > --- a/drivers/pci/remove.c
> > +++ b/drivers/pci/remove.c
> > @@ -19,11 +19,12 @@ static void pci_stop_dev(struct pci_dev *dev)
> >  {
> >  	pci_pme_active(dev, false);
> >  
> > -	if (dev->is_added) {
> > +	if (pci_dev_is_added(dev)) {
> >  		device_release_driver(&dev->dev);
> >  		pci_proc_detach_device(dev);
> >  		pci_remove_sysfs_dev_files(dev);
> > -		dev->is_added = 0;
> > +
> > +		pci_dev_assign_added(dev, false);
> >  	}
> >  
> >  	if (dev->bus->self)
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 340029b..506125b 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -368,7 +368,6 @@ struct pci_dev {
> >  	unsigned int	transparent:1;		/* Subtractive decode bridge */
> >  	unsigned int	multifunction:1;	/* Multi-function device */
> >  
> > -	unsigned int	is_added:1;
> >  	unsigned int	is_busmaster:1;		/* Is busmaster */
> >  	unsigned int	no_msi:1;		/* May not use MSI */
> >  	unsigned int	no_64bit_msi:1; 	/* May only use 32-bit MSIs */
> > -- 
> > 1.9.1
> > 

^ permalink raw reply

* Re: [PATCH v3 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
From: David Gibson @ 2018-07-19  5:28 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: linuxppc-dev, kvm, kvm-ppc, paulus, clg
In-Reply-To: <1fb3aea5f44f1029866ee10db40abde7e18b24ad.1531967105.git.sbobroff@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 10410 bytes --]

On Thu, Jul 19, 2018 at 12:25:10PM +1000, Sam Bobroff wrote:
> From: Sam Bobroff <sam.bobroff@au1.ibm.com>
> 
> It is not currently possible to create the full number of possible
> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> threads per core than it's core stride (or "VSMT mode"). This is
> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> 
> To address this, "pack" the VCORE ID and XIVE offsets by using
> knowledge of the way the VCPU IDs will be used when there are less
> guest threads per core than the core stride. The primary thread of
> each core will always be used first. Then, if the guest uses more than
> one thread per core, these secondary threads will sequentially follow
> the primary in each core.
> 
> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> VCPUs are being spaced apart, so at least half of each core is empty
> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> into the second half of each core (4..7, in an 8-thread core).
> 
> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> each core is being left empty, and we can map down into the second and
> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> 
> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> threads are being used and 7/8 of the core is empty, allowing use of
> the 1, 3, 5 and 7 thread slots.
> 
> (Strides less than 8 are handled similarly.)
> 
> This allows the VCORE ID or offset to be calculated quickly from the
> VCPU ID or XIVE server numbers, without access to the VCPU structure.
> 
> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Hello everyone,
> 
> I've completed a trial merge with the guest native-XIVE code and found no
> problems; it's no more difficult than the host side and only requires a few
> calls to xive_vp().
> 
> On that basis, here is v3 (unchanged from v2) as non-RFC and it seems to be
> ready to go.
> 
> Patch set v3:
> Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
> 
> Patch set v2:
> Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
> * Corrected places in kvm/book3s_xive.c where IDs weren't packed.
> * Because kvmppc_pack_vcpu_id() is only called on P9, there is no need to test "emul_smt_mode > 1", so remove it.
> * Re-ordered block_offsets[] to be more ascending.
> * Added more detailed description of the packing algorithm.
> 
> Patch set v1:
> Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
> 
>  arch/powerpc/include/asm/kvm_book3s.h | 44 +++++++++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_hv.c          | 14 +++++++----
>  arch/powerpc/kvm/book3s_xive.c        | 19 +++++++++------
>  3 files changed, 66 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> index 1f345a0b6ba2..ba4b6e00fca7 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -390,4 +390,48 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
>  #define SPLIT_HACK_MASK			0xff000000
>  #define SPLIT_HACK_OFFS			0xfb000000
>  
> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
> + * (but not it's actual threading mode, which is not available) to avoid
> + * collisions.
> + *
> + * The implementation leaves VCPU IDs from the range [0..KVM_MAX_VCPUS) (block
> + * 0) unchanged: if the guest is filling each VCORE completely then it will be
> + * using consecutive IDs and it will fill the space without any packing.
> + *
> + * For higher VCPU IDs, the packed ID is based on the VCPU ID modulo
> + * KVM_MAX_VCPUS (effectively masking off the top bits) and then an offset is
> + * added to avoid collisions.
> + *
> + * VCPU IDs in the range [KVM_MAX_VCPUS..(KVM_MAX_VCPUS*2)) (block 1) are only
> + * possible if the guest is leaving at least 1/2 of each VCORE empty, so IDs
> + * can be safely packed into the second half of each VCORE by adding an offset
> + * of (stride / 2).
> + *
> + * Similarly, if VCPU IDs in the range [(KVM_MAX_VCPUS*2)..(KVM_MAX_VCPUS*4))
> + * (blocks 2 and 3) are seen, the guest must be leaving at least 3/4 of each
> + * VCORE empty so packed IDs can be offset by (stride / 4) and (stride * 3 / 4).
> + *
> + * Finally, VCPU IDs from blocks 5..7 will only be seen if the guest is using a
> + * stride of 8 and 1 thread per core so the remaining offsets of 1, 3, 5 and 7
> + * must be free to use.
> + *
> + * (The offsets for each block are stored in block_offsets[], indexed by the
> + * block number if the stride is 8. For cases where the guest's stride is less
> + * than 8, we can re-use the block_offsets array by multiplying the block
> + * number by (MAX_SMT_THREADS / stride) to reach the correct entry.)
> + */
> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> +{
> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 3, 5, 7};
> +	int stride = kvm->arch.emul_smt_mode;
> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> +	u32 packed_id;
> +
> +	BUG_ON(block >= MAX_SMT_THREADS);
> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
> +	return packed_id;
> +}
> +
>  #endif /* __ASM_KVM_BOOK3S_H__ */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index de686b340f4a..363c2fb0d89e 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1816,7 +1816,7 @@ static int threads_per_vcore(struct kvm *kvm)
>  	return threads_per_subcore;
>  }
>  
> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
>  {
>  	struct kvmppc_vcore *vcore;
>  
> @@ -1830,7 +1830,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>  	init_swait_queue_head(&vcore->wq);
>  	vcore->preempt_tb = TB_NIL;
>  	vcore->lpcr = kvm->arch.lpcr;
> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> +	vcore->first_vcpuid = id;
>  	vcore->kvm = kvm;
>  	INIT_LIST_HEAD(&vcore->preempt_list);
>  
> @@ -2048,12 +2048,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>  	mutex_lock(&kvm->lock);
>  	vcore = NULL;
>  	err = -EINVAL;
> -	core = id / kvm->arch.smt_mode;
> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> +		BUG_ON(kvm->arch.smt_mode != 1);
> +		core = kvmppc_pack_vcpu_id(kvm, id);
> +	} else {
> +		core = id / kvm->arch.smt_mode;
> +	}
>  	if (core < KVM_MAX_VCORES) {
>  		vcore = kvm->arch.vcores[core];
> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
>  		if (!vcore) {
>  			err = -ENOMEM;
> -			vcore = kvmppc_vcore_create(kvm, core);
> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
>  			kvm->arch.vcores[core] = vcore;
>  			kvm->arch.online_vcores++;
>  		}
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index f9818d7d3381..dbd5887daf4a 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
>  	return -EBUSY;
>  }
>  
> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> +{
> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> +}
> +
>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>  			     struct kvmppc_xive_src_block *sb,
>  			     struct kvmppc_xive_irq_state *state)
> @@ -362,7 +367,7 @@ static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>  	 */
>  	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
>  		xive_native_configure_irq(hw_num,
> -					  xive->vp_base + state->act_server,
> +					  xive_vp(xive, state->act_server),
>  					  MASKED, state->number);
>  		/* set old_p so we can track if an H_EOI was done */
>  		state->old_p = true;
> @@ -418,7 +423,7 @@ static void xive_finish_unmask(struct kvmppc_xive *xive,
>  	 */
>  	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
>  		xive_native_configure_irq(hw_num,
> -					  xive->vp_base + state->act_server,
> +					  xive_vp(xive, state->act_server),
>  					  state->act_priority, state->number);
>  		/* If an EOI is needed, do it here */
>  		if (!state->old_p)
> @@ -495,7 +500,7 @@ static int xive_target_interrupt(struct kvm *kvm,
>  	kvmppc_xive_select_irq(state, &hw_num, NULL);
>  
>  	return xive_native_configure_irq(hw_num,
> -					 xive->vp_base + server,
> +					 xive_vp(xive, server),
>  					 prio, state->number);
>  }
>  
> @@ -883,7 +888,7 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
>  	 * which is fine for a never started interrupt.
>  	 */
>  	xive_native_configure_irq(hw_irq,
> -				  xive->vp_base + state->act_server,
> +				  xive_vp(xive, state->act_server),
>  				  state->act_priority, state->number);
>  
>  	/*
> @@ -959,7 +964,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
>  
>  	/* Reconfigure the IPI */
>  	xive_native_configure_irq(state->ipi_number,
> -				  xive->vp_base + state->act_server,
> +				  xive_vp(xive, state->act_server),
>  				  state->act_priority, state->number);
>  
>  	/*
> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  		pr_devel("Duplicate !\n");
>  		return -EEXIST;
>  	}
> -	if (cpu >= KVM_MAX_VCPUS) {
> +	if (cpu >= KVM_MAX_VCPU_ID) {
>  		pr_devel("Out of bounds !\n");
>  		return -EINVAL;
>  	}
> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  	xc->xive = xive;
>  	xc->vcpu = vcpu;
>  	xc->server_num = cpu;
> -	xc->vp_id = xive->vp_base + cpu;
> +	xc->vp_id = xive_vp(xive, cpu);
>  	xc->mfrr = 0xff;
>  	xc->valid = true;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: powerpc/Makefile: Assemble with -me500 when building for E500
From: Michael Ellerman @ 2018-07-19  6:06 UTC (permalink / raw)
  To: James Clarke, linuxppc-dev; +Cc: James Clarke
In-Reply-To: <20180712214149.19587-1-jrtc27@jrtc27.com>

On Thu, 2018-07-12 at 21:41:49 UTC, James Clarke wrote:
> Some of the assembly files use instructions specific to BookE or E500,
> which are rejected with the now-default -mcpu=powerpc, so we must pass
> -me500 to the assembler just as we pass -me200 for E200.
> 
> Fixes: 4bf4f42a2feb ("powerpc/kbuild: Set default generic machine type for 32-bit compile")
> Signed-off-by: James Clarke <jrtc27@jrtc27.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/4e4a4b75ccce827f9b8b04b6ee5c90

cheers

^ permalink raw reply

* Re: powerpc/xmon: Fix disassembly since printf changes
From: Michael Ellerman @ 2018-07-19  6:06 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: malat
In-Reply-To: <20180717005038.32352-1-mpe@ellerman.id.au>

On Tue, 2018-07-17 at 00:50:38 UTC, Michael Ellerman wrote:
> The recent change to add printf annotations to xmon inadvertently made
> the disassembly output ugly, eg:
> 
>   c00000002001e058  7ee00026      mfcr    r23
>   c00000002001e05c  fffffffffae101a0      std     r23,416(r1)
>   c00000002001e060  fffffffff8230000      std     r1,0(r3)
> 
> The problem being that negative 32-bit values are being displayed in
> full 64-bits.
> 
> The printf conversion was actually correct, we are passing unsigned
> long so it should use "lx". But powerpc instructions are only 4 bytes
> and the code only reads 4 bytes, so inst should really just be
> unsigned int, and that also fixes the printing to look the way we
> want:
> 
>   c00000002001e058  7ee00026      mfcr    r23
>   c00000002001e05c  fae101a0      std     r23,416(r1)
>   c00000002001e060  f8230000      std     r1,0(r3)
> 
> Fixes: e70d8f55268b ("powerpc/xmon: Add __printf annotation to xmon_printf()")
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/941d810725ad48cc21948f4cff8cf7

cheers

^ permalink raw reply

* Re: [kernel, v7, 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize
From: Michael Ellerman @ 2018-07-19  6:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, Nicholas Piggin, kvm-ppc, Alex Williamson,
	Aneesh Kumar K.V, David Gibson
In-Reply-To: <20180717071913.2167-2-aik@ozlabs.ru>

On Tue, 2018-07-17 at 07:19:12 UTC, Alexey Kardashevskiy wrote:
> The size is always equal to 1 page so let's use this. Later on this will
> be used for other checks which use page shifts to check the granularity
> of access.
> 
> This should cause no behavioral change.
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Acked-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/1463edca6734d42ab4406fa2896e20

cheers

^ permalink raw reply

* Re: [kernel, v7, 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Michael Ellerman @ 2018-07-19  6:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, Nicholas Piggin, kvm-ppc, Alex Williamson,
	Aneesh Kumar K.V, David Gibson
In-Reply-To: <20180717071913.2167-3-aik@ozlabs.ru>

On Tue, 2018-07-17 at 07:19:13 UTC, Alexey Kardashevskiy wrote:
> A VM which has:
>  - a DMA capable device passed through to it (eg. network card);
>  - running a malicious kernel that ignores H_PUT_TCE failure;
>  - capability of using IOMMU pages bigger that physical pages
> can create an IOMMU mapping that exposes (for example) 16MB of
> the host physical memory to the device when only 64K was allocated to the VM.
> 
> The remaining 16MB - 64K will be some other content of host memory, possibly
> including pages of the VM, but also pages of host kernel memory, host
> programs or other VMs.
> 
> The attacking VM does not control the location of the page it can map,
> and is only allowed to map as many pages as it has pages of RAM.
> 
> We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> an IOMMU page is contained in the physical page so the PCI hardware won't
> get access to unassigned host memory; however this check is missing in
> the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> did not hit this yet as the very first time when the mapping happens
> we do not have tbl::it_userspace allocated yet and fall back to
> the userspace which in turn calls VFIO IOMMU driver, this fails and
> the guest does not retry,
> 
> This stores the smallest preregistered page size in the preregistered
> region descriptor and changes the mm_iommu_xxx API to check this against
> the IOMMU page size.
> 
> This calculates maximum page size as a minimum of the natural region
> alignment and compound page size. For the page shift this uses the shift
> returned by find_linux_pte() which indicates how the page is mapped to
> the current userspace - if the page is huge and this is not a zero, then
> it is a leaf pte and the page is mapped within the range.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/76fa4975f3ed12d15762bc979ca440

cheers

^ permalink raw reply

* Re: [v2] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
From: Michael Ellerman @ 2018-07-19  6:06 UTC (permalink / raw)
  To: Gautham R. Shenoy, Benjamin Herrenschmidt, Michael Neuling,
	Vaidyanathan Srinivasan
  Cc: Florian Weimer, Gautham R. Shenoy, linux-kernel, stable,
	Oleg Nesterov, linuxppc-dev
In-Reply-To: <1531902796-32294-1-git-send-email-ego@linux.vnet.ibm.com>

On Wed, 2018-07-18 at 08:33:16 UTC, "Gautham R. Shenoy" wrote:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> On 64-bit servers, SPRN_SPRG3 and its userspace read-only mirror
> SPRN_USPRG3 are used as userspace VDSO write and read registers
> respectively.
> 
> SPRN_SPRG3 is lost when we enter stop4 and above, and is currently not
> restored.  As a result, any read from SPRN_USPRG3 returns zero on an
> exit from stop4 and above.
> 
> Thus in this situation, on POWER9, any call from sched_getcpu() always
> returns zero, as on powerpc, we call __kernel_getcpu() which relies
> upon SPRN_USPRG3 to report the CPU and NUMA node information.
> 
> Fix this by restoring SPRN_SPRG3 on wake up from a deep stop state
> with the sprg_vdso value that is cached in PACA.
> 
> Fixes: e1c1cfed5432 ("powerpc/powernv: Save/Restore additional SPRs
> for stop4 cpuidle")
> 
> Reported-by: Florian Weimer <fweimer@redhat.com>
> Cc: <stable@vger.kernel.org> # 4.14
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/b03897cf318dfc47de33a7ecbc7655

cheers

^ permalink raw reply

* Re: cpufreq: powernv: Remove global pstate ramp-down timer in POWER9
From: Michael Ellerman @ 2018-07-19  6:07 UTC (permalink / raw)
  To: Shilpasri G Bhat, rjw, viresh.kumar
  Cc: linux-pm, linux-kernel, Shilpasri G Bhat, linuxppc-dev
In-Reply-To: <1524636895-27185-1-git-send-email-shilpa.bhat@linux.vnet.ibm.com>

On Wed, 2018-04-25 at 06:14:55 UTC, Shilpasri G Bhat wrote:
> POWER9 doesnot support global pstate requests for the chip. So remove
> the timer logic which slowly ramps down the global pstate in P9
> platforms.
> 
> Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/dcb14337e0f2adb227c376e6327ef0

cheers

^ permalink raw reply

* Re: powerpc/64s: Report SLB multi-hit rather than parity error
From: Michael Ellerman @ 2018-07-19  6:07 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: npiggin
In-Reply-To: <20180613132414.32207-1-mpe@ellerman.id.au>

On Wed, 2018-06-13 at 13:24:14 UTC, Michael Ellerman wrote:
> When we take an SLB multi-hit on bare metal, we see both the multi-hit
> and parity error bits set in DSISR. The user manuals indicates this is
> expected to always happen on Power8, whereas on Power9 it says a
> multi-hit will "usually" also cause a parity error.
> 
> We decide what to do based on the various error tables in mce_power.c,
> and because we process them in order and only report the first, we
> currently always report a parity error but not the multi-hit, eg:
> 
>   Severe Machine check interrupt [Recovered]
>     Initiator: CPU
>     Error type: SLB [Parity]
>       Effective address: c000000ffffd4300
> 
> Although this is correct, it leaves the user wondering why they got a
> parity error. It would be clearer instead if we reported the
> multi-hit because that is more likely to be simply a software bug,
> whereas a true parity error is possibly an indication of a bad core.
> 
> We can do that simply by reordering the error tables so that multi-hit
> appears before parity. That doesn't affect the error recovery at all,
> because we flush the SLB either way.
> 
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

Applied to powerpc next.

https://git.kernel.org/powerpc/c/54dbcfc211f15586c57d27492f938e

cheers

^ permalink raw reply

* Re: powerpc: enable kernel XZ compression option on BOOK3S_32
From: Michael Ellerman @ 2018-07-19  6:07 UTC (permalink / raw)
  To: Aaro Koskinen, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev
  Cc: Aaro Koskinen
In-Reply-To: <20180619205230.19290-1-aaro.koskinen@iki.fi>

On Tue, 2018-06-19 at 20:52:30 UTC, Aaro Koskinen wrote:
> Enable kernel XZ compression option on BOOK3S_32. Tested on G4 PowerBook.
> 
> Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/26064848efbca49c643d1237dc1f82

cheers

^ permalink raw reply

* Re: [RESEND, 1/3] powerpc: dts: use 'atmel' as at24 anufacturer for pdm360ng
From: Michael Ellerman @ 2018-07-19  6:07 UTC (permalink / raw)
  To: Bartosz Golaszewski, Rob Herring, Mark Rutland,
	Benjamin Herrenschmidt, Paul Mackerras
  Cc: devicetree, Bartosz Golaszewski, linuxppc-dev, linux-kernel
In-Reply-To: <20180621083305.5322-1-brgl@bgdev.pl>

On Thu, 2018-06-21 at 08:33:03 UTC, Bartosz Golaszewski wrote:
> Using 'at' as the <manufacturer> part of the compatible string is now
> deprecated. Use a correct string: 'atmel,<model>'.
> 
> Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/835b706bab95141e2663b046f2fc59

cheers

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox