LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: G5 Quad working with Linux PPC and RadeonHD
From: Luigi Burdo @ 2014-05-17  8:28 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <5376DD44.4040402@daenzer.net>

Hi Michel D=C3=A4nzer thanks for your explanation about .
About the first post

https://bugs.freedesktop.org/show_bug.cgi?id=3D72877

Christian (Xeno74) had been success to have the right mesa/gl colors on
X1000 machine he make a patch about =2C i tested his package on QuadG5
but no effect probably something is different in kernel/hardware (?!?)

about the second
on X1000 r600 boards dont gave decoration problems
there is probably needed some other trick about
for the other architecture.

https://bugs.freedesktop.org/show_bug.cgi?id=3D74939


In any way thanks alot for the infos and for the works
the PPC linux staff is doing :)

Luigi

Il 17/05/14 05:53=2C Michel D=C3=A4nzer ha scritto:
> On 16.05.2014 19:31=2C Luigi Burdo wrote:
>> Hi all i start one week ago to write massage in ubuntu about
>> my object i have  Radeon Hd working on my Quad G5 thanks to linuxPPC .
>>
>> But there are some things to fix
>> One RadeonHD mesa colors are wrong .
> See https://bugs.freedesktop.org/show_bug.cgi?id=3D72877 . No progress ye=
t
> unfortunately.
>
>
>> On RadeonHD 4650 ddr3 512mb i have full video acceleration and hdmi
>> audio working
>> On RadeonHD 6570 ddr3 2048mb i have video but no chars on windows and
>> icons .
> The latter sounds like
> https://bugs.freedesktop.org/show_bug.cgi?id=3D74939 =2C which is fixed i=
n
> http://cgit.freedesktop.org/xorg/driver/xf86-video-ati/commit/?id=3D8da17=
f30c70f4494ce22ad781a1cee17041812f3
> .
>
>

^ permalink raw reply

* [PATCH] powerpc: Clear ELF personality flag if ELFv2 is not requested.
From: Jeff Bailey @ 2014-05-17 15:05 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1258 bytes --]

powerpc: Clear ELF personality flag if ELFv2 is not requested.

The POWER kernel uses a personality flag to determine whether it should
be setting up function descriptors or not (per the updated ABI).  This
flag wasn't being cleared on a new process but instead was being
inherited.  The visible effect was that an ELFv2 binary could not execve
to an ELFv1 binary.

Signed-off-by: Jeff Bailey <jeffbailey@google.com>

 arch/powerpc/include/asm/elf.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 935b5e7..888d8f3 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -90,6 +90,8 @@ typedef elf_vrregset_t elf_fpxregset_t;
 do {                                                           \
        if (((ex).e_flags & 0x3) == 2)                          \
                set_thread_flag(TIF_ELF2ABI);                   \
+       else                                                    \
+               clear_thread_flag(TIF_ELF2ABI);                 \
        if ((ex).e_ident[EI_CLASS] == ELFCLASS32)               \
                set_thread_flag(TIF_32BIT);                     \
        else                                                    \

[-- Attachment #2: Type: text/html, Size: 1790 bytes --]

^ permalink raw reply related

* [PATCH] macintosh: windfarm_pm121.c: Fix for possible null pointer dereference
From: Rickard Strandqvist @ 2014-05-17 17:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev; +Cc: linux-kernel, Rickard Strandqvist

There is otherwise a risk of a possible null pointer dereference.

Was largely found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
---
 drivers/macintosh/windfarm_pm121.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/macintosh/windfarm_pm121.c b/drivers/macintosh/windfarm_pm121.c
index 7fe58b0..b350fb8 100644
--- a/drivers/macintosh/windfarm_pm121.c
+++ b/drivers/macintosh/windfarm_pm121.c
@@ -555,8 +555,18 @@ static void pm121_create_sys_fans(int loop_id)
 	pid_param.interval	= PM121_SYS_INTERVAL;
 	pid_param.history_len	= PM121_SYS_HISTORY_SIZE;
 	pid_param.itarget	= param->itarget;
-	pid_param.min		= control->ops->get_min(control);
-	pid_param.max		= control->ops->get_max(control);
+	if(control)
+	{
+		pid_param.min		= control->ops->get_min(control);
+		pid_param.max		= control->ops->get_max(control);
+	} else {
+		/*
+		 * This is probably not the right!?
+		 * Perhaps goto fail  if control == NULL  above?
+		 */
+		pid_param.min		= 0;
+		pid_param.max		= 0;
+	}
 
 	wf_pid_init(&pm121_sys_state[loop_id]->pid, &pid_param);
 
@@ -571,7 +581,7 @@ static void pm121_create_sys_fans(int loop_id)
 	   control the same control */
 	printk(KERN_WARNING "pm121: failed to set up %s loop "
 	       "setting \"%s\" to max speed.\n",
-	       loop_names[loop_id], control->name);
+	       loop_names[loop_id], control ? control->name : "uninitialized value");
 
 	if (control)
 		wf_control_set_max(control);
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH] powerpc/powernv: hwmon driver for power values, fan rpm and temperature
From: Neelesh Gupta @ 2014-05-18 18:08 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: linuxppc-dev, sbhat, jdelvare, lm-sensors
In-Reply-To: <20140514170933.GB18032@roeck-us.net>

On 05/14/2014 10:39 PM, Guenter Roeck wrote:
> On Wed, May 14, 2014 at 11:31:53AM +0530, Neelesh Gupta wrote:
>> This patch adds basic kernel enablement for reading power values, fan
>> speed rpm and temperature values on powernv platforms which will
>> be exported to user space through sysfs interface.
>>
>> Test results:
>> -------------
>> [root@tul163p1 ~]# sensors
>> ibmpowernv-isa-0000
>> Adapter: ISA adapter
>> fan1:        5294 RPM  (min =    0 RPM)
>> fan2:        4945 RPM  (min =    0 RPM)
>> fan3:        5831 RPM  (min =    0 RPM)
>> fan4:        5212 RPM  (min =    0 RPM)
>> fan5:           0 RPM  (min =    0 RPM)
>> fan6:           0 RPM  (min =    0 RPM)
>> fan7:        7472 RPM  (min =    0 RPM)
>> fan8:        7920 RPM  (min =    0 RPM)
>> temp1:        +39.0°C  (high =  +0.0°C)
>> power1:      192.00 W
>>
>> [root@tul163p1 ~]#
>> [root@tul163p1 ~]# ls /sys/devices/platform/ibmpowernv.0/
>> driver      fan2_min    fan4_min    fan6_min    fan8_min   modalias      uevent
>> fan1_fault  fan3_fault  fan5_fault  fan7_fault  hwmon      name
>> fan1_input  fan3_input  fan5_input  fan7_input  in1_fault  power1_input
>> fan1_min    fan3_min    fan5_min    fan7_min    in2_fault  subsystem
>> fan2_fault  fan4_fault  fan6_fault  fan8_fault  in3_fault  temp1_input
>> fan2_input  fan4_input  fan6_input  fan8_input  in4_fault  temp1_max
>> [root@tul163p1 ~]#
>> [root@tul163p1 ~]# ls /sys/class/hwmon/hwmon0/device/
>> driver      fan2_min    fan4_min    fan6_min    fan8_min   modalias      uevent
>> fan1_fault  fan3_fault  fan5_fault  fan7_fault  hwmon      name
>> fan1_input  fan3_input  fan5_input  fan7_input  in1_fault  power1_input
>> fan1_min    fan3_min    fan5_min    fan7_min    in2_fault  subsystem
>> fan2_fault  fan4_fault  fan6_fault  fan8_fault  in3_fault  temp1_input
>> fan2_input  fan4_input  fan6_input  fan8_input  in4_fault  temp1_max
>> [root@tul163p1 ~]#
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
>> Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
>> ---
>>   drivers/hwmon/Kconfig      |    8 +
>>   drivers/hwmon/Makefile     |    1
>>   drivers/hwmon/ibmpowernv.c |  386 ++++++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 395 insertions(+)
>>   create mode 100644 drivers/hwmon/ibmpowernv.c
>>
>> diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
>> index bc196f4..3e308fa 100644
>> --- a/drivers/hwmon/Kconfig
>> +++ b/drivers/hwmon/Kconfig
>> @@ -554,6 +554,14 @@ config SENSORS_IBMPEX
>>   	  This driver can also be built as a module.  If so, the module
>>   	  will be called ibmpex.
>>   
>> +config SENSORS_IBMPOWERNV
>> +	tristate "IBM POWERNV platform sensors"
>> +	depends on PPC_POWERNV
>> +	default y
>> +	help
>> +	  If you say yes here you get support for the temperature/fan/power
>> +	  sensors on your platform.
>> +
>>   config SENSORS_IIO_HWMON
>>   	tristate "Hwmon driver that uses channels specified via iio maps"
>>   	depends on IIO
>> diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
>> index c48f987..199c401 100644
>> --- a/drivers/hwmon/Makefile
>> +++ b/drivers/hwmon/Makefile
>> @@ -71,6 +71,7 @@ obj-$(CONFIG_SENSORS_ULTRA45)	+= ultra45_env.o
>>   obj-$(CONFIG_SENSORS_I5K_AMB)	+= i5k_amb.o
>>   obj-$(CONFIG_SENSORS_IBMAEM)	+= ibmaem.o
>>   obj-$(CONFIG_SENSORS_IBMPEX)	+= ibmpex.o
>> +obj-$(CONFIG_SENSORS_IBMPOWERNV)+= ibmpowernv.o
>>   obj-$(CONFIG_SENSORS_IIO_HWMON) += iio_hwmon.o
>>   obj-$(CONFIG_SENSORS_INA209)	+= ina209.o
>>   obj-$(CONFIG_SENSORS_INA2XX)	+= ina2xx.o
>> diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c
>> new file mode 100644
>> index 0000000..e5cffce
>> --- /dev/null
>> +++ b/drivers/hwmon/ibmpowernv.c
>> @@ -0,0 +1,386 @@
>> +/*
>> + * IBM PowerNV platform sensors for temperature/fan/power
>> + * Copyright (C) 2014 IBM
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> Please drop the FSF address; it can change, and we don't want to update the
> driver each time it does.
>
>> + */
>> +
>> +#include <linux/init.h>
>> +#include <linux/module.h>
>> +#include <linux/kernel.h>
>> +#include <linux/hwmon.h>
>> +#include <linux/hwmon-sysfs.h>
>> +#include <linux/of.h>
>> +#include <linux/slab.h>
>> +
>> +#include <linux/platform_device.h>
>> +#include <asm/opal.h>
>> +#include <linux/err.h>
>> +
>> +#define DRVNAME		"ibmpowernv"
>> +#define MAX_ATTR_LEN	32
>> +
>> +/* Sensor suffix name from DT */
>> +#define DT_FAULT_ATTR_SUFFIX		"faulted"
>> +#define DT_DATA_ATTR_SUFFIX		"data"
>> +#define DT_THRESHOLD_ATTR_SUFFIX	"thrs"
>> +
>> +/* Enumerates all the sensors in the POWERNV platform and also index into
>> + * 'struct sensor_name'
> Comment coding style, please (this is not the networking subsystem ;-)
>
>> + */
>> +enum sensors {
>> +	FAN,
>> +	AMBIENT_TEMP,
>> +	POWERSUPPLY,
>> +	POWER,
>> +	MAX_SENSOR_TYPE,
>> +};
>> +
>> +static struct sensor_name {
> const ?
>
>> +	char *name;
>> +	char *compatible;
>> +} sensor_names[] = {
>> +	{"fan", "ibm,opal-sensor-cooling-fan"},
>> +	{"temp", "ibm,opal-sensor-amb-temp"},
>> +	{"in", "ibm,opal-sensor-power-supply"},
>> +	{"power", "ibm,opal-sensor-power"}
>> +};
>> +
>> +struct platform_data {
>> +	struct device *hwmon_dev;
> Not needed.
>
>> +	struct device_attribute name_attr;
>> +	struct list_head attr_list;
>> +};
>> +
>> +struct sensor_data {
>> +	u32 id;
>> +	enum sensors stype;
>> +	char name[MAX_ATTR_LEN];
>> +	struct sensor_device_attribute sd_attr;
>> +	struct list_head list;
>> +};
>> +
>> +/* Platform device representing all the ibmpowernv sensors */
>> +static struct platform_device *pdevice;
>> +
>> +static void get_sensor_index_attr(const char *name, u32 *index, char *attr)
>> +{
>> +	char *hash_pos = strchr(name, '#');
>> +	char *dash_pos;
>> +	u32 copy_len;
>> +	char buf[8];
>> +
>> +	memset(buf, 0, sizeof(buf));
>> +	*index = 0;
>> +	*attr = '\0';
>> +
>> +	if (hash_pos) {
>> +		dash_pos = strchr(hash_pos, '-');
>> +		if (dash_pos) {
>> +			copy_len = dash_pos - hash_pos - 1;
>> +			if (copy_len < sizeof(buf)) {
>> +				strncpy(buf, hash_pos + 1, copy_len);
>> +				sscanf(buf, "%d", index);
>> +			}
>> +
>> +			strncpy(attr, dash_pos + 1, MAX_ATTR_LEN);
>> +		}
>> +	}
>> +}
>> +
>> +/*
>> + * This function translates the DT node name into the 'hwmon' attribute name.
>> + * IBMPOWERNV device node appear like cooling-fan#2-data, amb-temp#1-thrs etc.
>> + * which need to be mapped as fan2_input, temp1_max respectively before
>> + * populating them inside hwmon device class..
>> + */
>> +static int create_hwmon_attr_name(enum sensors stype, const char *node_name,
>> +		char *hwmon_attr_name)
> Please follow CodingStyle for continuation line alignment.
>
>> +{
>> +	char attr_suffix[MAX_ATTR_LEN];
>> +	char *attr_name;
>> +	u32 index;
>> +
>> +	get_sensor_index_attr(node_name, &index, attr_suffix);
>> +	if (!index || !strlen(attr_suffix)) {
>> +		pr_info("%s: Sensor device node name is invalid, name: %s\n",
>> +				__func__, node_name);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (!strcmp(attr_suffix, DT_FAULT_ATTR_SUFFIX))
>> +		attr_name = "fault";
>> +	else if(!strcmp(attr_suffix, DT_DATA_ATTR_SUFFIX))
>> +		attr_name = "input";
>> +	else if (!strcmp(attr_suffix, DT_THRESHOLD_ATTR_SUFFIX)) {
>> +		if (stype == AMBIENT_TEMP)
>> +			attr_name = "max";
>> +		else if (stype == FAN)
>> +			attr_name = "min";
>> +		else
>> +			return -ENOENT;
>> +	} else
>> +		return -ENOENT;
>> +
>> +	snprintf(hwmon_attr_name, MAX_ATTR_LEN, "%s%d_%s",
>> +			sensor_names[stype].name, index, attr_name);
>> +	return 0;
>> +}
>> +
>> +static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
>> +		char *buf)
>> +{
>> +	struct sensor_device_attribute *sd_attr = to_sensor_dev_attr(devattr);
>> +	struct sensor_data *sdata = container_of(sd_attr, struct sensor_data,
>> +			sd_attr);
>> +	int err;
>> +	u32 x;
>> +
>> +	err = opal_get_sensor_data(sdata->id, &x);
>> +	if (err) {
>> +		pr_err("%s: Failed to get opal sensor data\n", __func__);
>> +		x = -1;
> Unusual. Why not report the error to user space ?
> Reporting <maxuint> on error doesn't seem to be very useful.
>
>> +	}
>> +
>> +	/* Convert temperature to milli-degrees */
>> +	if (sdata->stype == AMBIENT_TEMP && x > 0)
>> +		x *= 1000;
>> +	/* Convert power to micro-watts */
>> +	else if (sdata->stype == POWER && x > 0)
>> +		x *= 1000000;
>> +
> Is there an overflow concern ? Guess not ... overflow happens at ~17KW.
> Just wondering.
>
>> +	return sprintf(buf, "%d\n", x);
>> +}
>> +
>> +static void remove_device_attrs(struct platform_device *pdev)
>> +{
>> +	struct platform_data *pdata = platform_get_drvdata(pdev);
>> +	struct device *dev = &pdev->dev;
>> +	struct sensor_data *s, *next;
>> +
>> +	list_for_each_entry_safe(s, next, &pdata->attr_list, list) {
>> +		device_remove_file(dev, &s->sd_attr.dev_attr);
>> +		list_del(&s->list);
> This is unnecessary since you always remove the entire list anyway.
> Actually, I don't think you need the list in the first place.
> You only use it to delete entries, but that can be handled automatically
> by the infrastructure. All you really need is an array pointing to the device
> attributes so you can create all attribute files with a single operation.
>
>> +		kfree(s);
>> +	}
>> +}
>> +
>> +/*
>> + * Iterate through the device tree and for each child of sensor node, create
>> + * a sysfs attribute file, the file is named by translating the DT node name
>> + * to the name required by the higher 'hwmon' driver like fan1_input, temp1_max
>> + * etc..
>> + */
>> +static int create_device_attrs(struct platform_device *pdev)
>> +{
>> +	struct platform_data *pdata = platform_get_drvdata(pdev);
>> +	struct device *dev = &pdev->dev;
>> +	struct device_node *opal, *np;
>> +	struct sensor_data *sdata;
>> +	const u32 *sensor_id;
>> +	enum sensors stype;
>> +	int err;
>> +
>> +	opal = of_find_node_by_path("/ibm,opal/sensors");
>> +        if (!opal) {
>> +		pr_err("%s: Opal 'sensors' node not found\n", __func__);
>> +		err = -ENXIO;
> 	ENODEV ?
>
>> +		goto exit;
> 	I would suggest to simply return here.
>
>> +        }
>> +
>> +	for_each_child_of_node(opal, np) {
>> +                if (np->name == NULL)
>> +                        continue;
>> +
>> +		for (stype = 0; stype < MAX_SENSOR_TYPE; stype++)
>> +			if (of_device_is_compatible(np,
>> +					sensor_names[stype].compatible))
>> +				break;
>> +
>> +		if (stype == MAX_SENSOR_TYPE)
>> +			continue;
>> +
>> +		sensor_id = of_get_property(np, "sensor-id", NULL);
>> +		if (!sensor_id) {
>> +			pr_info("%s: %s doesn't have sensor-id\n", __func__,
>> +					np->name);
>> +			continue;
>> +		}
>> +
>> +		sdata = kzalloc(sizeof(*sdata), GFP_KERNEL);
> Can you use devm_kzalloc() ?
>
>> +		if (!sdata) {
>> +			pr_err("%s: Failed to allocate memory for sensor_data",
>> +					__func__);
>> +			err = -ENOMEM;
>> +			goto exit_put_node;
>> +		}
>> +
>> +		sdata->id = *sensor_id;
>> +		sdata->stype = stype;
>> +		err = create_hwmon_attr_name(stype, np->name, sdata->name);
>> +		if (err) {
>> +			pr_err("%s: Failed to create the hwmon attribute name\n",
>> +					__func__);
> create_hwmon_attr_name() (sometimes) already creates an error message.
> Would be nice if you can avoid duplicate error messages.
>
>> +			goto exit_free_sdata;
>> +		}
>> +
>> +		sysfs_attr_init(&sdata->sd_attr.dev_attr.attr);
>> +		sdata->sd_attr.dev_attr.attr.name = sdata->name;
>> +		sdata->sd_attr.dev_attr.attr.mode = S_IRUGO;
>> +		sdata->sd_attr.dev_attr.show = show_sensor;
> Since you are not using the index value from sensor_device_attribute,
> but use your own 'id' variable instead, you don't need sensor_device_attribute
> but can use device_attribute instead (or drop 'id' and use
> sd_attr.index instead).
>
>> +
>> +		/* Create sysfs attribute file */
>> +		err = device_create_file(dev, &sdata->sd_attr.dev_attr);
>> +		if (err)
>> +			goto exit_free_sdata;
>> +
> Please don't create the attribute files here but just a list of attributes
> instead. See below for more details.
>
>> +		list_add_tail(&sdata->list, &pdata->attr_list);
>> +	}
>> +
>> +	of_node_put(opal);
>> +	return 0;
>> +
>> +exit_free_sdata:
>> +	kfree(sdata);
>> +exit_put_node:
>> +	of_node_put(opal);
>> +	remove_device_attrs(pdev);
>> +exit:
>> +	return err;
>> +}
>> +
>> +static ssize_t show_name(struct device *dev, struct device_attribute *devattr,
>> +		char *buf)
>> +{
>> +	struct platform_device *pdev = to_platform_device(dev);
>> +	return sprintf(buf, "%s\n", pdev->name);
>> +}
> Not necessary; see below.
>
>> +
>> +static int ibmpowernv_probe(struct platform_device *pdev)
>> +{
>> +	struct device *dev = &pdev->dev;
>> +	struct platform_data *pdata;
>> +	int err;
>> +
>> +	pdata = devm_kzalloc(dev, sizeof(*pdata), GFP_KERNEL);
>> +	if (!pdata) {
>> +		err = -ENOMEM;
>> +		goto exit;
>> +	}
>> +
>> +	INIT_LIST_HEAD(&pdata->attr_list);
>> +	platform_set_drvdata(pdev, pdata);
>> +
>> +	/* Create platform device 'name' attribute */
>> +	sysfs_attr_init(&pdata->name_attr.attr);
>> +	pdata->name_attr.attr.name = "name";
>> +	pdata->name_attr.attr.mode = S_IRUGO;
>> +	pdata->name_attr.show = show_name;
>> +	err = device_create_file(dev, &pdata->name_attr);
> You don't need to create a name attribute.
> devm_hwmon_device_register_with_groups does it for you (from the second
> argument).
>
>> +	if (err)
>> +		goto exit;
>> +
>> +	/* Create sysfs attribute file for each sensor found in the DT */
>> +	err = create_device_attrs(pdev);
> That defeats the purpose of using devm_hwmon_device_register_with_groups.
> The idea here is to build a list of attributes, which you can do in
> create_device_attrs(), but not create the actual device entries.
> Then also create an attribute_group as well as a list of groups,
> and pass that as last argument to devm_hwmon_device_register_with_groups().
>
> drivers/hwmon/pmbus/pmbus_core.c does something similar; maybe you can
> use a similar approach.
>
> With that, you also don't need the remove functions, since the hwmon
> subsystem will auto-remove the attributes. If you do it correctly
> (ie use devm_kzalloc() for allocating memory) you should not need a
> remove function at all.
>
>> +	if (err) {
>> +		pr_err("%s: Failed to create the device attributes\n",
>> +				__func__);
>> +		goto exit_remove_name;
>> +	}
>> +
>> +	/* Finally, register with hwmon */
>> +	pdata->hwmon_dev = devm_hwmon_device_register_with_groups(dev, DRVNAME,
>> +			pdata, NULL);
>> +	if (IS_ERR(pdata->hwmon_dev)) {
>> +		err = PTR_ERR(pdata->hwmon_dev);
>> +		goto exit_remove_devattrs;
>> +	}
>> +
>> +	return 0;
>> +
>> +exit_remove_devattrs:
>> +	remove_device_attrs(pdev);
>> +exit_remove_name:
>> +	device_remove_file(dev, &pdata->name_attr);
>> +exit:
>> +	return err;
> If you do it right you should be able to reduce this to something like
>
> 	hwmon_dev = devm_hwmon_device_register_with_groups(dev, DRVNAME, pdata,
> 							   pdata->attr_groups);
> 	return PTR_ERR_OR_ZERO(hwmon_dev);
>
> ...
>
>> +}
>> +
>> +static int ibmpowernv_remove(struct platform_device *pdev)
>> +{
>> +	struct platform_data *pdata = platform_get_drvdata(pdev);
>> +	struct device *dev = &pdev->dev;
>> +
>> +	remove_device_attrs(pdev);
>> +	device_remove_file(dev, &pdata->name_attr);
>> +
>> +	return 0;
>> +}
> ... and you should be able to remove this entire function.
>
>> +
>> +
> No double empty lines please.
>
>> +static struct platform_driver ibmpowernv_driver = {
>> +	.driver = {
>> +		.owner = THIS_MODULE,
>> +		.name = DRVNAME,
>> +	},
>> +	.probe = ibmpowernv_probe,
>> +	.remove = ibmpowernv_remove,
>> +};
>> +
>> +
> Excessive empty lines.
>
>> +static int __init ibmpowernv_init(void)
>> +{
>> +	int err;
>> +
>> +
> Excessive empty lines.
>
>> +	err = platform_driver_register(&ibmpowernv_driver);
>> +	if (err)
>> +		goto exit;
> Sometimes you create an error message, sometimes not. I'd prefer no error
> message to start with (the module loader will create one anyway), but
> at least please be consistent.
>
>> +
>> +
> Excessive empty lines.
>
>> +	pdevice = platform_device_alloc(DRVNAME, 0);
>> +	if (!pdevice) {
>> +		pr_err("%s: Device allocation failed\n", __func__);
>> +		err = -ENOMEM;
>> +		goto exit_driver_unreg;
>> +	}
>> +
>> +	err = platform_device_add(pdevice);
>> +	if (err) {
>> +		pr_err("%s: Device addition failed (%d)\n", __func__, err);
>> +		goto exit_device_put;
>> +	}
>> +
> Can you use platform_driver_probe() here ?
>
>> +	return 0;
>> +
>> +exit_device_put:
>> +	platform_device_put(pdevice);
>> +exit_driver_unreg:
>> +	platform_driver_unregister(&ibmpowernv_driver);
>> +exit:
>> +	return err;
>> +}
>> +
>> +static void __exit ibmpowernv_exit(void)
>> +{
>> +	platform_device_unregister(pdevice);
>> +	platform_driver_unregister(&ibmpowernv_driver);
>> +}
>> +
>> +MODULE_DESCRIPTION("IBM POWERNV platform sensors");
>> +MODULE_LICENSE("GPL");
>> +
>> +module_init(ibmpowernv_init);
>> +module_exit(ibmpowernv_exit);
>>
>>
> Empty lines at end.
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
Hi Guenter,

Thanks for the review. I'll rework on the patch and post the v2.

- Neelesh

^ permalink raw reply

* Re: questions on CONFIG_PPC_ADV_DEBUG_REGS, DBCR0_BRT, and DBCR0_ACTIVE_EVENTS
From: shiva7 @ 2014-05-18 23:38 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1398206224.1694.254.camel@snotra.buserror.net>

Thanks Scott.

Apologize for not quoting the old email reference. I assumed the old archive
content always will be in the trail.

Any idea whether the DBCR0 BRT bit actually works(??), if I add  DBCR0 under
sys_debug context call? because this code seems like direct porting and
wondering why on first place not included under debug_context call. 

And also, anything special required for "server" family application code
porting here ?? as because in server family the trace exception used to viz
NORMAL exception proglog and uses SRR0 and SRR1 but in this ISA/embedded
case have dedicated DEBUG_DEBUG prolog and dedicated registers DSRR0 and
DSRR1.

I can see the debug handler already have HACK to simulate the branch taken
case and SIGTRAP to user space.

Thanks in Advance.



--
View this message in context: http://linuxppc.10917.n7.nabble.com/questions-on-CONFIG-PPC-ADV-DEBUG-REGS-DBCR0-BRT-and-DBCR0-ACTIVE-EVENTS-tp70147p82408.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: [PATCH V4 0/2] mm: FAULT_AROUND_ORDER patchset performance data for powerpc
From: Rusty Russell @ 2014-05-19  0:12 UTC (permalink / raw)
  To: Hugh Dickins, Madhavan Srinivasan
  Cc: linux-arch, riel, ak, dave.hansen, peterz, x86, linux-kernel,
	linux-mm, paulus, mgorman, akpm, linuxppc-dev, mingo,
	kirill.shutemov
In-Reply-To: <alpine.LSU.2.11.1405151026540.4664@eggly.anvils>

Hugh Dickins <hughd@google.com> writes:
> On Thu, 15 May 2014, Madhavan Srinivasan wrote:
>> 
>> Hi Ingo,
>> 
>> 	Do you have any comments for the latest version of the patchset. If
>> not, kindly can you pick it up as is.
>> 
>> 
>> With regards
>> Maddy
>> 
>> > Kirill A. Shutemov with 8c6e50b029 commit introduced
>> > vm_ops->map_pages() for mapping easy accessible pages around
>> > fault address in hope to reduce number of minor page faults.
>> > 
>> > This patch creates infrastructure to modify the FAULT_AROUND_ORDER
>> > value using mm/Kconfig. This will enable architecture maintainers
>> > to decide on suitable FAULT_AROUND_ORDER value based on
>> > performance data for that architecture. First patch also defaults
>> > FAULT_AROUND_ORDER Kconfig element to 4. Second patch list
>> > out the performance numbers for powerpc (platform pseries) and
>> > initialize the fault around order variable for pseries platform of
>> > powerpc.
>
> Sorry for not commenting earlier - just reminded by this ping to Ingo.
>
> I didn't study your numbers, but nowhere did I see what PAGE_SIZE you use.
>
> arch/powerpc/Kconfig suggests that Power supports base page size of
> 4k, 16k, 64k or 256k.
>
> I would expect your optimal fault_around_order to depend very much on
> the base page size.

It was 64k, which is what PPC64 uses on all the major distributions.
You really only get a choice of 4k and 64k with 64 bit power.

> Perhaps fault_around_size would provide a more useful default?

That seems to fit.  With 4k pages and order 4, you're asking for 64k.
Maddy's result shows 64k is also reasonable for 64k pages.

Perhaps we try to generalize from two data points (a slight improvement
over doing it from 1!), eg:

/* 4 seems good for 4k-page x86, 0 seems good for 64k page ppc64, so: */
unsigned int fault_around_order __read_mostly =
        (16 - PAGE_SHIFT < 0 ? 0 : 16 - PAGE_SHIFT);

Cheers,
Rusty.

^ permalink raw reply

* Re: G5 Quad working with Linux PPC and RadeonHD
From: Michel Dänzer @ 2014-05-19  2:45 UTC (permalink / raw)
  To: Luigi Burdo, linuxppc-dev
In-Reply-To: <BLU436-SMTP260B93781C0F77CDD26FD4AC8300@phx.gbl>

On 17.05.2014 17:28, Luigi Burdo wrote:
> 
> on X1000 r600 boards dont gave decoration problems
> there is probably needed some other trick about
> for the other architecture.
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=74939

That bug is specific to Evergreen / Northern Islands family cards, which
includes your 6570, but not your 4650 for example.


-- 
Earthling Michel Dänzer            |                  http://www.amd.com
Libre software enthusiast          |                Mesa and X developer

^ permalink raw reply

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
From: Joonsoo Kim @ 2014-05-19  2:41 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140516233735.GH8941@linux.vnet.ibm.com>

On Fri, May 16, 2014 at 04:37:35PM -0700, Nishanth Aravamudan wrote:
> On 06.02.2014 [17:07:04 +0900], Joonsoo Kim wrote:
> > Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> > a partial slab on numa_node_id() node. This doesn't work properly on the
> > system having memoryless node, since it can have no memory on that node and
> > there must be no partial slab on that node.
> > 
> > On that node, page allocation always fallback to numa_mem_id() first. So
> > searching a partial slab on numa_node_id() in that case is proper solution
> > for memoryless node case.
> > 
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> Joonsoo, would you send this one on to Andrew?

Hello,

Okay. I will do it.

Thanks.

^ permalink raw reply

* Re: [PATCH V4 0/2] mm: FAULT_AROUND_ORDER patchset performance data for powerpc
From: Madhavan Srinivasan @ 2014-05-19  3:05 UTC (permalink / raw)
  To: Rusty Russell, Hugh Dickins
  Cc: linux-arch, riel, ak, dave.hansen, peterz, x86, linux-kernel,
	linux-mm, paulus, mgorman, akpm, linuxppc-dev, mingo,
	kirill.shutemov
In-Reply-To: <87wqdik4n5.fsf@rustcorp.com.au>

On Monday 19 May 2014 05:42 AM, Rusty Russell wrote:
> Hugh Dickins <hughd@google.com> writes:
>> On Thu, 15 May 2014, Madhavan Srinivasan wrote:
>>>
>>> Hi Ingo,
>>>
>>> 	Do you have any comments for the latest version of the patchset. If
>>> not, kindly can you pick it up as is.
>>>
>>>
>>> With regards
>>> Maddy
>>>
>>>> Kirill A. Shutemov with 8c6e50b029 commit introduced
>>>> vm_ops->map_pages() for mapping easy accessible pages around
>>>> fault address in hope to reduce number of minor page faults.
>>>>
>>>> This patch creates infrastructure to modify the FAULT_AROUND_ORDER
>>>> value using mm/Kconfig. This will enable architecture maintainers
>>>> to decide on suitable FAULT_AROUND_ORDER value based on
>>>> performance data for that architecture. First patch also defaults
>>>> FAULT_AROUND_ORDER Kconfig element to 4. Second patch list
>>>> out the performance numbers for powerpc (platform pseries) and
>>>> initialize the fault around order variable for pseries platform of
>>>> powerpc.
>>
>> Sorry for not commenting earlier - just reminded by this ping to Ingo.
>>
>> I didn't study your numbers, but nowhere did I see what PAGE_SIZE you use.
>>
>> arch/powerpc/Kconfig suggests that Power supports base page size of
>> 4k, 16k, 64k or 256k.
>>
>> I would expect your optimal fault_around_order to depend very much on
>> the base page size.
> 
> It was 64k, which is what PPC64 uses on all the major distributions.
> You really only get a choice of 4k and 64k with 64 bit power.
> 
This is true. PPC64 support multiple pagesize and yes the default page
size of 64k, is taken as base pagesize for the tests.

>> Perhaps fault_around_size would provide a more useful default?
> 
> That seems to fit.  With 4k pages and order 4, you're asking for 64k.
> Maddy's result shows 64k is also reasonable for 64k pages.
> 
> Perhaps we try to generalize from two data points (a slight improvement
> over doing it from 1!), eg:
> 
> /* 4 seems good for 4k-page x86, 0 seems good for 64k page ppc64, so: */
> unsigned int fault_around_order __read_mostly =
>         (16 - PAGE_SHIFT < 0 ? 0 : 16 - PAGE_SHIFT);
> 

This may be right. But these are the concerns, will not this make other
arch to pick default without any tuning and also this will remove the
compile time option to disable the feature?

Thanks for review
With regards
Maddy



> Cheers,
> Rusty.
> 

^ permalink raw reply

* Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
From: Anshuman Khandual @ 2014-05-19  9:12 UTC (permalink / raw)
  To: Pedro Alves; +Cc: mikey, avagin, oleg, linux-kernel, michael, linuxppc-dev
In-Reply-To: <5374AE24.1030302@redhat.com>

On 05/15/2014 05:38 PM, Pedro Alves wrote:
> On 05/15/2014 09:25 AM, Anshuman Khandual wrote:
>> On 05/14/2014 04:45 PM, Pedro Alves wrote:
>>> On 05/14/14 06:46, Anshuman Khandual wrote:
>>>> On 05/13/2014 10:43 PM, Pedro Alves wrote:
>>>>> On 05/05/14 08:54, Anshuman Khandual wrote:
>>>>>> This patch enables get and set of transactional memory related register
>>>>>> sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing
>>>>>> four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR,
>>>>>> REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new
>>>>>> ELF core note types added previously in this regard.
>>>>>>
>>>>>> 	(1) NT_PPC_TM_SPR
>>>>>> 	(2) NT_PPC_TM_CGPR
>>>>>> 	(3) NT_PPC_TM_CFPR
>>>>>> 	(4) NT_PPC_TM_CVMX
>>>>>
>>>>> Sorry that I couldn't tell this from the code, but, what does the
>>>>> kernel return when the ptracer requests these registers and the
>>>>> program is not in a transaction?  Specifically I'm wondering whether
>>>>> this follows the same semantics as the s390 port.
>>>>>
>>>>
>>>> Right now, it still returns the saved state of the registers from thread
>>>> struct. I had assumed that the user must know the state of the transaction
>>>> before initiating the ptrace request. I guess its better to check for
>>>> the transaction status before processing the request. In case if TM is not
>>>> active on that thread, we should return -EINVAL.
>>>
>>> I think s390 returns ENODATA in that case.
>>>
>>>  https://sourceware.org/ml/gdb-patches/2013-06/msg00273.html
>>>
>>> We'll want some way to tell whether the system actually
>>> supports this.  That could be ENODATA vs something-else (EINVAL
>>> or perhaps better EIO for "request is invalid").
>>
>> As Mickey has pointed out, the transaction memory support in the system can be
>> checked from the HWCAP2 flags. So when the transaction is not active, we will
>> return ENODATA instead for TM related ptrace regset requests.
> 
> Returning ENODATA when the transaction is not active, like
> s390 is great.  Thank you.
> 
> But I think it's worth it to consider what should the kernel
> return when the machine doesn't have these registers at all.
> 
> Sure, for this case we happen to have the hwcap flag.  But in
> general, I don't know whether we will always have a hwcap bit
> for each register set that is added.  Maybe we will, so that
> the info ends up in core dumps.
> 
> Still, I think it's worth to consider this case in the
> general sense, irrespective of hwcap.
> 
> That is, what should PTRACE_GETREGSET/PTRACE_SETREGSET return
> when the machine doesn't have the registers at all.  We shouldn't
> need to consult something elsewhere (like hwcap) to determine
> what ENODATA means.  The kernel knows it right there.  I think
> s390 goofed here.
> 
> Taking a look at x86, for example, we see:
> 
> 	[REGSET_XSTATE] = {
> 		.core_note_type = NT_X86_XSTATE,
> 		.size = sizeof(u64), .align = sizeof(u64),
> 		.active = xstateregs_active, .get = xstateregs_get,
> 		.set = xstateregs_set
> 	},
> 
> Note that it installs the ".active" hook.
> 
>  24 /**
>  25  * user_regset_active_fn - type of @active function in &struct user_regset
>  26  * @target:     thread being examined
>  27  * @regset:     regset being examined
>  28  *
>  29  * Return -%ENODEV if not available on the hardware found.
>  30  * Return %0 if no interesting state in this thread.
>  31  * Return >%0 number of @size units of interesting state.
>  32  * Any get call fetching state beyond that number will
>  33  * see the default initialization state for this data,
>  34  * so a caller that knows what the default state is need
>  35  * not copy it all out.
>  36  * This call is optional; the pointer is %NULL if there
>  37  * is no inexpensive check to yield a value < @n.
>  38  */
>  39 typedef int user_regset_active_fn(struct task_struct *target,
>  40                                   const struct user_regset *regset);
>  41
> 
> Note the mention of ENODEV.
> 
> I couldn't actually find any arch that currently returns -ENODEV in
> the "active" hook.  I see that binfmt_elf.c doesn't handle
> regset->active() returning < 0.  Guess that may be why.  Looks like
> something that could be cleaned up, to me.
> 
> Anyway, notice x86's REGSET_XSTATE regset->get hook:
> 
> int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
> 		unsigned int pos, unsigned int count,
> 		void *kbuf, void __user *ubuf)
> {
> 	int ret;
> 
> 	if (!cpu_has_xsave)
> 		return -ENODEV;
>         ^^^^^^^^^^^^^^^^^^^^^^
> 
> And then we see that xfpregs_get has a similar ENODEV case.
> 
> So in sum, it very much looks like the intention is for
> PTRACE_GETREGSET/PTRACE_SETREGSET to return ENODEV in the
> case the regset doesn't exist on the running machine, and then
> it looks like at least x86 works that way.
> 

Will work on these suggestions and post it again. Thanks for the
detailed insights and review.

^ permalink raw reply

* Re: [PATCH RFC v3 0/8] EEH Support for VFIO PCI device
From: Gavin Shan @ 2014-05-19 10:07 UTC (permalink / raw)
  To: Gavin Shan; +Cc: aik, agraf, kvm-ppc, alex.williamson, qiudayu, linuxppc-dev
In-Reply-To: <1400040722-29608-1-git-send-email-gwshan@linux.vnet.ibm.com>

On Wed, May 14, 2014 at 02:11:54PM +1000, Gavin Shan wrote:

Ping, Alex.G and Alex.W. Could you please take a look on this when
you have available bandwidth? Thanks in advance for your time :-)

>The series of patches intends to support EEH for PCI devices, which are
>passed through to PowerKVM based guest via VFIO. The implementation is
>straightforward based on the issues or problems we have to resolve to
>support EEH for PowerKVM based guest.
>
>- Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly.
>  If QEMU can't handle it, the request will be sent to host via newly introduced
>  VFIO container IOCTL command (VFIO_EEH_INFO) and gets handled in host kernel.
>
>- The error injection infrastructure need support request from the userland
>  utility "errinjct" and PowerKVM based guest. The userland utility "errinjct"
>  works on pSeries platform well with dedicated syscall, which helps invoking
>  RTAS service to fulfil error injection in kernel. From the perspective, it's
>  reasonable to extend the syscall to support PowerNV platform so that OPAL call
>  can be invoked in host kernel for injecting errors. The data transported
>  between userland and kerenl is still following "struct rtas_args" for both
>  cases of PowerNV (OPAL) and pSeries (RTAS).
>
>The series of patches requires corresponding firmware changes from Mike Qiu to
>support error injection and QEMU changes to support EEH for guest. QEMU patchset
>will be sent separately.
>
>Change log
>==========
>v1 -> v2:
>	* EEH RTAS requests are routed to QEMU, and then possiblly to host kerenl.
>	  The mechanism KVM in-kernel handling is dropped.
>	* Error injection is reimplemented based syscall, instead of KVM in-kerenl
>	  handling. The logic for error injection token management is moved to
>	  QEMU. The error injection request is routed to QEMU and then possiblly
>	  to host kernel.
>v2 -> v3:
>	* Make the fields in struct eeh_vfio_pci_addr, struct vfio_eeh_info based
>	  on the comments from Alexey.
>	* Define macros for EEH VFIO operations (Alexey).
>	* Clear frozen state after successful PE reset.
>	* Merge original [PATCH 1/2/3] to one.
>
>Testing on P7
>=============
>
>- Emulex adapter
>
>Testing on P8
>=============
>
>- Need more testing after design is finalized.
>
>-----
>
>Gavin Shan (8):
>  drivers/vfio: Introduce CONFIG_VFIO_EEH
>  powerpc/eeh: Info to trace passed devices
>  drivers/vfio: New IOCTL command VFIO_EEH_INFO
>  powerpc/eeh: Avoid event on passed PE
>  powerpc/powernv: Sync OPAL header file with firmware
>  powerpc: Extend syscall ppc_rtas()
>  powerpc/powernv: Implement ppc_call_opal()
>  powerpc/powernv: Error injection infrastructure
>
> arch/powerpc/include/asm/eeh.h                 |  52 +++
> arch/powerpc/include/asm/opal.h                |  74 ++-
> arch/powerpc/include/asm/rtas.h                |  10 +-
> arch/powerpc/include/asm/syscalls.h            |   2 +-
> arch/powerpc/include/asm/systbl.h              |   2 +-
> arch/powerpc/include/uapi/asm/unistd.h         |   2 +-
> arch/powerpc/kernel/eeh.c                      |   8 +
> arch/powerpc/kernel/eeh_pe.c                   |  80 ++++
> arch/powerpc/kernel/rtas.c                     |  57 +--
> arch/powerpc/kernel/syscalls.c                 |  50 +++
> arch/powerpc/platforms/powernv/Makefile        |   3 +-
> arch/powerpc/platforms/powernv/eeh-ioda.c      |   3 +-
> arch/powerpc/platforms/powernv/eeh-vfio.c      | 593 +++++++++++++++++++++++++
> arch/powerpc/platforms/powernv/errinject.c     | 224 ++++++++++
> arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
> arch/powerpc/platforms/powernv/opal.c          |  93 ++++
> drivers/vfio/Kconfig                           |   6 +
> drivers/vfio/vfio_iommu_spapr_tce.c            |  12 +
> include/uapi/linux/vfio.h                      |  57 +++
> kernel/sys_ni.c                                |   2 +-
> 20 files changed, 1278 insertions(+), 53 deletions(-)
> create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
> create mode 100644 arch/powerpc/platforms/powernv/errinject.c
>

Thanks,
Gavin

^ permalink raw reply

* Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
From: Anshuman Khandual @ 2014-05-19 11:46 UTC (permalink / raw)
  To: Pedro Alves; +Cc: mikey, avagin, oleg, linux-kernel, michael, linuxppc-dev
In-Reply-To: <5374AE24.1030302@redhat.com>

On 05/15/2014 05:38 PM, Pedro Alves wrote:
> On 05/15/2014 09:25 AM, Anshuman Khandual wrote:
>> On 05/14/2014 04:45 PM, Pedro Alves wrote:
>>> On 05/14/14 06:46, Anshuman Khandual wrote:
>>>> On 05/13/2014 10:43 PM, Pedro Alves wrote:
>>>>> On 05/05/14 08:54, Anshuman Khandual wrote:
>>>>>> This patch enables get and set of transactional memory related register
>>>>>> sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing
>>>>>> four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR,
>>>>>> REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new
>>>>>> ELF core note types added previously in this regard.
>>>>>>
>>>>>> 	(1) NT_PPC_TM_SPR
>>>>>> 	(2) NT_PPC_TM_CGPR
>>>>>> 	(3) NT_PPC_TM_CFPR
>>>>>> 	(4) NT_PPC_TM_CVMX
>>>>>
>>>>> Sorry that I couldn't tell this from the code, but, what does the
>>>>> kernel return when the ptracer requests these registers and the
>>>>> program is not in a transaction?  Specifically I'm wondering whether
>>>>> this follows the same semantics as the s390 port.
>>>>>
>>>>
>>>> Right now, it still returns the saved state of the registers from thread
>>>> struct. I had assumed that the user must know the state of the transaction
>>>> before initiating the ptrace request. I guess its better to check for
>>>> the transaction status before processing the request. In case if TM is not
>>>> active on that thread, we should return -EINVAL.
>>>
>>> I think s390 returns ENODATA in that case.
>>>
>>>  https://sourceware.org/ml/gdb-patches/2013-06/msg00273.html
>>>
>>> We'll want some way to tell whether the system actually
>>> supports this.  That could be ENODATA vs something-else (EINVAL
>>> or perhaps better EIO for "request is invalid").
>>
>> As Mickey has pointed out, the transaction memory support in the system can be
>> checked from the HWCAP2 flags. So when the transaction is not active, we will
>> return ENODATA instead for TM related ptrace regset requests.
> 
> Returning ENODATA when the transaction is not active, like
> s390 is great.  Thank you.
> 
> But I think it's worth it to consider what should the kernel
> return when the machine doesn't have these registers at all.
> 
> Sure, for this case we happen to have the hwcap flag.  But in
> general, I don't know whether we will always have a hwcap bit
> for each register set that is added.  Maybe we will, so that
> the info ends up in core dumps.
> 
> Still, I think it's worth to consider this case in the
> general sense, irrespective of hwcap.
> 
> That is, what should PTRACE_GETREGSET/PTRACE_SETREGSET return
> when the machine doesn't have the registers at all.  We shouldn't
> need to consult something elsewhere (like hwcap) to determine
> what ENODATA means.  The kernel knows it right there.  I think
> s390 goofed here.
> 
> Taking a look at x86, for example, we see:
> 
> 	[REGSET_XSTATE] = {
> 		.core_note_type = NT_X86_XSTATE,
> 		.size = sizeof(u64), .align = sizeof(u64),
> 		.active = xstateregs_active, .get = xstateregs_get,
> 		.set = xstateregs_set
> 	},
> 
> Note that it installs the ".active" hook.
> 
>  24 /**
>  25  * user_regset_active_fn - type of @active function in &struct user_regset
>  26  * @target:     thread being examined
>  27  * @regset:     regset being examined
>  28  *
>  29  * Return -%ENODEV if not available on the hardware found.
>  30  * Return %0 if no interesting state in this thread.
>  31  * Return >%0 number of @size units of interesting state.
>  32  * Any get call fetching state beyond that number will
>  33  * see the default initialization state for this data,
>  34  * so a caller that knows what the default state is need
>  35  * not copy it all out.
>  36  * This call is optional; the pointer is %NULL if there
>  37  * is no inexpensive check to yield a value < @n.
>  38  */
>  39 typedef int user_regset_active_fn(struct task_struct *target,
>  40                                   const struct user_regset *regset);
>  41
> 
> Note the mention of ENODEV.
> 
> I couldn't actually find any arch that currently returns -ENODEV in
> the "active" hook.  I see that binfmt_elf.c doesn't handle
> regset->active() returning < 0.  Guess that may be why.  Looks like
> something that could be cleaned up, to me.
>

Also it does not consider the return value of regset->active(t->task, regset)
(whose objective is to figure out whether we need to request regset->n number
of elements or less than that) in the subsequent call to regset->get function.

> Anyway, notice x86's REGSET_XSTATE regset->get hook:
> 
> int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
> 		unsigned int pos, unsigned int count,
> 		void *kbuf, void __user *ubuf)
> {
> 	int ret;
> 
> 	if (!cpu_has_xsave)
> 		return -ENODEV;
>         ^^^^^^^^^^^^^^^^^^^^^^
> 
> And then we see that xfpregs_get has a similar ENODEV case.
> 
> So in sum, it very much looks like the intention is for
> PTRACE_GETREGSET/PTRACE_SETREGSET to return ENODEV in the
> case the regset doesn't exist on the running machine, and then
> it looks like at least x86 works that way.

Okay. Looks like for all the "get/set" hooks I have added for the brand new regsets,
we need to implement ENODATA error condition as that of s390 when TM is not active
on the thread in target and implement ENODEV error condition as that of x86 when
TM supports is not at all available on the system. So the code snippet which should
be added to all the new "get/set" functions will be something like this.

+       if (!cpu_has_feature(CPU_FTR_TM))
+               return -ENODEV;
+
+       if(!MSR_TM_ACTIVE(target->thread.regs->msr))
+               return -ENODATA;


Now coming to the installation of the .active hooks part for all the new regsets, it
should be pretty straight forward as well. Though its optional and used for elf_core_dump
purpose only, its worth adding them here. Example of an active function should be something
like this. The function is inexpensive as required.

+static int tm_spr_active(struct task_struct *target,
+                               const struct user_regset *regset)
+{
+       if (!cpu_has_feature(CPU_FTR_TM))
+               return -ENODEV;
+
+       if(!MSR_TM_ACTIVE(target->thread.regs->msr))
+               return 0;
+
+       return regset->n;
+}

^ permalink raw reply

* Re: [PATCH 2/8] powerpc/eeh: Info to trace passed devices
From: Alexander Graf @ 2014-05-19 12:46 UTC (permalink / raw)
  To: Gavin Shan, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, linuxppc-dev
In-Reply-To: <1400040722-29608-3-git-send-email-gwshan@linux.vnet.ibm.com>


On 14.05.14 06:11, Gavin Shan wrote:
> The address of passed PCI devices (domain:bus:slot:func) might be
> quite different from the perspective of host and guest. We have to
> trace the address mapping so that we can emulate EEH RTAS requests
> from guest. The patch introduces additional fields to eeh_pe and
> eeh_dev for the purpose.
>
> Also, the patch adds function eeh_vfio_pe_get() and eeh_vfio_dev_get()
> to search EEH PE or device according to the given guest address. Both
> of them will be used by subsequent patches.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>

I don't see the point of VFIO knowing about guest addresses. They are 
not unique across a system and the whole idea that a VFIO device has to 
be owned by a guest is also pretty dubious.

I suppose what you really care about here is just a token for a specific 
device? But why do you need one where we don't have tokens yet?


Alex

^ permalink raw reply

* Re: [PATCH 3/8] drivers/vfio: New IOCTL command VFIO_EEH_INFO
From: Alexander Graf @ 2014-05-19 12:51 UTC (permalink / raw)
  To: Gavin Shan, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, linuxppc-dev
In-Reply-To: <1400040722-29608-4-git-send-email-gwshan@linux.vnet.ibm.com>


On 14.05.14 06:11, Gavin Shan wrote:
> The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container
> to support EEH functionality for PCI devices, which have been
> passed from host to guest via VFIO.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/Makefile   |   1 +
>   arch/powerpc/platforms/powernv/eeh-vfio.c | 593 ++++++++++++++++++++++++++++++
>   drivers/vfio/vfio_iommu_spapr_tce.c       |  12 +
>   include/uapi/linux/vfio.h                 |  57 +++

Documentation/...?


Alex

^ permalink raw reply

* Re: [PATCH 6/8] powerpc: Extend syscall ppc_rtas()
From: Alexander Graf @ 2014-05-19 12:55 UTC (permalink / raw)
  To: Gavin Shan, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, linuxppc-dev
In-Reply-To: <1400040722-29608-7-git-send-email-gwshan@linux.vnet.ibm.com>


On 14.05.14 06:12, Gavin Shan wrote:
> Originally, syscall ppc_rtas() can be used to invoke RTAS call from
> user space. Utility "errinjct" is using it to inject various errors
> to the system for testing purpose. The patch intends to extend the
> syscall to support both pSeries and PowerNV platform. With that,
> RTAS and OPAL call can be invoked from user space. In turn, utility
> "errinjct" can be supported on pSeries and PowerNV platform at same
> time.
>
> The original syscall handler ppc_rtas() is renamed to ppc_firmware(),
> which calls ppc_call_rtas() or ppc_call_opal() depending on the
> running platform. The data transported between userland and kerenl is

Please fix your spelling of kernel.

> by "struct rtas_args". It's platform specific on how to use the data.
>
> Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>

I think the basic idea to maintain the same interface between PAPR and 
OPAL to user space is sound, but this is really Ben's call.


Alex

^ permalink raw reply

* Re: [PATCH 7/8] powerpc/powernv: Implement ppc_call_opal()
From: Alexander Graf @ 2014-05-19 12:59 UTC (permalink / raw)
  To: Gavin Shan, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, linuxppc-dev
In-Reply-To: <1400040722-29608-8-git-send-email-gwshan@linux.vnet.ibm.com>


On 14.05.14 06:12, Gavin Shan wrote:
> If we're running PowerNV platform, ppc_firmware() will be directed
> to ppc_call_opal() where we can call to OPAL API accordingly. In
> ppc_call_opal(), the input argument are parsed out and call to
> appropriate OPAL API to handle that. Each request passed to the
> function is identified with token. As we get to the function either
> from host owned application (e.g. errinjct) or VM, we always have
> the first parameter (so-called "virtual") to differentiate the
> cases.

This sounds like nonsense. A virtual parameter? What is that supposed to be?

I don't think you can use this interface for VM device management, as 
the syscall interface is (hopefully!) limited to CAP_ADMIN which you 
(hopefully!) don't have as VFIO user.


Alex

^ permalink raw reply

* Re: [PATCH 8/8] powerpc/powernv: Error injection infrastructure
From: Alexander Graf @ 2014-05-19 13:04 UTC (permalink / raw)
  To: Gavin Shan, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, linuxppc-dev
In-Reply-To: <1400040722-29608-9-git-send-email-gwshan@linux.vnet.ibm.com>


On 14.05.14 06:12, Gavin Shan wrote:
> The patch intends to implement the error injection infrastructure
> for PowerNV platform. The predetermined handlers will be called
> according to the type of injected error (e.g. OpalErrinjctTypeIoaBusError).
> For now, we just support PCI error injection. We need support
> injecting other types of errors in future.

Your token to a VFIO device is the VFIO fd. If you want to inject an 
error into that device, you should do it via that token. That gets you 
all permission problems solved for free.

But I still didn't quite grasp why you need to do this. Why do we need 
to inject an error into a device via OPAL when we want to do EEH inside 
of a guest? Are you trying to emulate guest side error injection?


Alex

^ permalink raw reply

* [PATCH v2] powerpc/powernv: hwmon driver for power values, fan rpm and temperature
From: Neelesh Gupta @ 2014-05-19 14:26 UTC (permalink / raw)
  To: linuxppc-dev, linux, jdelvare, lm-sensors; +Cc: sbhat

This patch adds basic kernel enablement for reading power values, fan
speed rpm and temperature values on powernv platforms which will
be exported to user space through sysfs interface.

Test results:
-------------
[root@tul163p1 ~]# sensors
ibmpowernv-isa-0000
Adapter: ISA adapter
fan1:        5487 RPM  (min =    0 RPM)
fan2:        5152 RPM  (min =    0 RPM)
fan3:        5590 RPM  (min =    0 RPM)
fan4:        4963 RPM  (min =    0 RPM)
fan5:           0 RPM  (min =    0 RPM)
fan6:           0 RPM  (min =    0 RPM)
fan7:        7488 RPM  (min =    0 RPM)
fan8:        7944 RPM  (min =    0 RPM)
temp1:        +39.0°C  (high =  +0.0°C)
power1:      192.00 W  

[root@tul163p1 ~]# ls /sys/devices/platform/
alarmtimer  ibmpowernv.0  rtc-generic  serial8250  uevent
[root@tul163p1 ~]# ls /sys/devices/platform/ibmpowernv.0/
driver/    hwmon/     modalias   subsystem/ uevent     
[root@tul163p1 ~]# ls /sys/devices/platform/ibmpowernv.0/hwmon/hwmon0/
device	    fan2_min	fan4_min    fan6_min	fan8_min   power1_input
fan1_fault  fan3_fault	fan5_fault  fan7_fault	in1_fault  subsystem
fan1_input  fan3_input	fan5_input  fan7_input	in2_fault  temp1_input
fan1_min    fan3_min	fan5_min    fan7_min	in3_fault  temp1_max
fan2_fault  fan4_fault	fan6_fault  fan8_fault	in4_fault  uevent
fan2_input  fan4_input	fan6_input  fan8_input	name
[root@tul163p1 ~]# 
[root@tul163p1 ~]# ls /sys/class/hwmon/hwmon0/
device	    fan2_min	fan4_min    fan6_min	fan8_min   power1_input
fan1_fault  fan3_fault	fan5_fault  fan7_fault	in1_fault  subsystem
fan1_input  fan3_input	fan5_input  fan7_input	in2_fault  temp1_input
fan1_min    fan3_min	fan5_min    fan7_min	in3_fault  temp1_max
fan2_fault  fan4_fault	fan6_fault  fan8_fault	in4_fault  uevent
fan2_input  fan4_input	fan6_input  fan8_input	name
[root@tul163p1 ~]#

Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
---

Changes in v2
=============
- Generic use of devm_* functions in hwmon like using devm_kzalloc() for dynamic
  memory request, avoiding the need to explicit free of memory.
  Adding 'struct attribute_group' as member of platform data structure to be
  populated and then passed to devm_hwmon_device_register_with_groups().

  Note: Having an array of pointers of 'attribute_group' and each group
  corresponds to 'enum sensors' type. Not completely sure, if it's ideal or
  could have just one group populated with attributes of sensor types?

- 'ibmpowernv' is not hot-pluggable device so moving 'platform_driver' callback
  function (probe) as part of __init code.
- Fixed issues related to coding style.
- Other general comments in v1.

 drivers/hwmon/Kconfig      |    8 +
 drivers/hwmon/Makefile     |    1 
 drivers/hwmon/ibmpowernv.c |  368 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 377 insertions(+)
 create mode 100644 drivers/hwmon/ibmpowernv.c

diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
index bc196f4..3e308fa 100644
--- a/drivers/hwmon/Kconfig
+++ b/drivers/hwmon/Kconfig
@@ -554,6 +554,14 @@ config SENSORS_IBMPEX
 	  This driver can also be built as a module.  If so, the module
 	  will be called ibmpex.
 
+config SENSORS_IBMPOWERNV
+	tristate "IBM POWERNV platform sensors"
+	depends on PPC_POWERNV
+	default y
+	help
+	  If you say yes here you get support for the temperature/fan/power
+	  sensors on your platform.
+
 config SENSORS_IIO_HWMON
 	tristate "Hwmon driver that uses channels specified via iio maps"
 	depends on IIO
diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
index c48f987..199c401 100644
--- a/drivers/hwmon/Makefile
+++ b/drivers/hwmon/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_SENSORS_ULTRA45)	+= ultra45_env.o
 obj-$(CONFIG_SENSORS_I5K_AMB)	+= i5k_amb.o
 obj-$(CONFIG_SENSORS_IBMAEM)	+= ibmaem.o
 obj-$(CONFIG_SENSORS_IBMPEX)	+= ibmpex.o
+obj-$(CONFIG_SENSORS_IBMPOWERNV)+= ibmpowernv.o
 obj-$(CONFIG_SENSORS_IIO_HWMON) += iio_hwmon.o
 obj-$(CONFIG_SENSORS_INA209)	+= ina209.o
 obj-$(CONFIG_SENSORS_INA2XX)	+= ina2xx.o
diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c
new file mode 100644
index 0000000..afce620
--- /dev/null
+++ b/drivers/hwmon/ibmpowernv.c
@@ -0,0 +1,368 @@
+/*
+ * IBM PowerNV platform sensors for temperature/fan/power
+ * Copyright (C) 2014 IBM
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
+#include <linux/of.h>
+#include <linux/slab.h>
+
+#include <linux/platform_device.h>
+#include <asm/opal.h>
+#include <linux/err.h>
+
+#define DRVNAME		"ibmpowernv"
+#define MAX_ATTR_LEN	32
+
+/* Sensor suffix name from DT */
+#define DT_FAULT_ATTR_SUFFIX		"faulted"
+#define DT_DATA_ATTR_SUFFIX		"data"
+#define DT_THRESHOLD_ATTR_SUFFIX	"thrs"
+
+/*
+ * Enumerates all the types of sensors in the POWERNV platform and does index
+ * into 'struct sensor_group'
+ */
+enum sensors {
+	FAN,
+	AMBIENT_TEMP,
+	POWER_SUPPLY,
+	POWER_INPUT,
+	MAX_SENSOR_TYPE,
+};
+
+static struct sensor_group {
+	const char *name;
+	const char *compatible;
+	struct attribute_group group;
+	u32 attr_count;
+} sensor_groups[] = {
+	{"fan", "ibm,opal-sensor-cooling-fan", {0}, 0},
+	{"temp", "ibm,opal-sensor-amb-temp", {0}, 0},
+	{"in", "ibm,opal-sensor-power-supply", {0}, 0},
+	{"power", "ibm,opal-sensor-power", {0}, 0}
+};
+
+struct sensor_data {
+	u32 id; /* An opaque id of the firmware for each sensor */
+	enum sensors type;
+	char name[MAX_ATTR_LEN];
+	struct device_attribute dev_attr;
+};
+
+struct platform_data {
+	const struct attribute_group *attr_groups[MAX_SENSOR_TYPE + 1];
+	u32 sensors_count; /* Total count of sensors from each group */
+};
+
+/* Platform device representing all the ibmpowernv sensors */
+static struct platform_device *pdevice;
+
+static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
+			   char *buf)
+{
+	struct sensor_data *sdata = container_of(devattr, struct sensor_data,
+						 dev_attr);
+	ssize_t ret;
+	u32 x;
+
+	ret = opal_get_sensor_data(sdata->id, &x);
+	if (ret) {
+		pr_err("%s: Failed to get opal sensor data\n", __func__);
+		return ret;
+	}
+
+	/* Convert temperature to milli-degrees */
+	if (sdata->type == AMBIENT_TEMP)
+		x *= 1000;
+	/* Convert power to micro-watts */
+	else if (sdata->type == POWER_INPUT)
+		x *= 1000000;
+
+	return sprintf(buf, "%d\n", x);
+}
+
+static void __init get_sensor_index_attr(const char *name, u32 *index, char *attr)
+{
+	char *hash_pos = strchr(name, '#');
+	char *dash_pos;
+	u32 copy_len;
+	char buf[8];
+
+	memset(buf, 0, sizeof(buf));
+	*index = 0;
+	*attr = '\0';
+
+	if (hash_pos) {
+		dash_pos = strchr(hash_pos, '-');
+		if (dash_pos) {
+			copy_len = dash_pos - hash_pos - 1;
+			if (copy_len < sizeof(buf)) {
+				strncpy(buf, hash_pos + 1, copy_len);
+				sscanf(buf, "%d", index);
+			}
+
+			strncpy(attr, dash_pos + 1, MAX_ATTR_LEN);
+		}
+	}
+}
+
+/*
+ * This function translates the DT node name into the 'hwmon' attribute name.
+ * IBMPOWERNV device node appear like cooling-fan#2-data, amb-temp#1-thrs etc.
+ * which need to be mapped as fan2_input, temp1_max respectively before
+ * populating them inside hwmon device class..
+ */
+static int __init create_hwmon_attr_name(enum sensors type, const char *node_name,
+				  char *hwmon_attr_name)
+{
+	char attr_suffix[MAX_ATTR_LEN];
+	char *attr_name;
+	u32 index;
+
+	get_sensor_index_attr(node_name, &index, attr_suffix);
+	if (!index || !strlen(attr_suffix)) {
+		pr_info("%s: Sensor device node name is invalid, name: %s\n",
+				__func__, node_name);
+		return -EINVAL;
+	}
+
+	if (!strcmp(attr_suffix, DT_FAULT_ATTR_SUFFIX))
+		attr_name = "fault";
+	else if(!strcmp(attr_suffix, DT_DATA_ATTR_SUFFIX))
+		attr_name = "input";
+	else if (!strcmp(attr_suffix, DT_THRESHOLD_ATTR_SUFFIX)) {
+		if (type == AMBIENT_TEMP)
+			attr_name = "max";
+		else if (type == FAN)
+			attr_name = "min";
+		else
+			return -ENOENT;
+	} else
+		return -ENOENT;
+
+	snprintf(hwmon_attr_name, MAX_ATTR_LEN, "%s%d_%s",
+			sensor_groups[type].name, index, attr_name);
+	return 0;
+}
+
+static int __init populate_attr_groups(struct platform_device *pdev)
+{
+	struct platform_data *pdata = platform_get_drvdata(pdev);
+	const struct attribute_group **pgroups = pdata->attr_groups;
+	struct device_node *opal, *np;
+	enum sensors type;
+	int err = 0;
+
+	opal = of_find_node_by_path("/ibm,opal/sensors");
+        if (!opal) {
+		pr_err("%s: Opal 'sensors' node not found\n", __func__);
+		return -ENODEV;
+        }
+
+	for_each_child_of_node(opal, np) {
+		if (np->name == NULL)
+			continue;
+
+		for (type = 0; type < MAX_SENSOR_TYPE; type++)
+			if (of_device_is_compatible(np,
+				sensor_groups[type].compatible)) {
+				sensor_groups[type].attr_count++;
+				break;
+			}
+	}
+
+	for (type = 0; type < MAX_SENSOR_TYPE; type++) {
+		if (!sensor_groups[type].attr_count)
+			continue;
+
+		sensor_groups[type].group.attrs = devm_kzalloc(&pdev->dev,
+					sizeof(struct attribute*) *
+					(sensor_groups[type].attr_count + 1),
+					GFP_KERNEL);
+		if (!sensor_groups[type].group.attrs) {
+			pr_err("%s: Failed to allocate memory for attribute"
+					"array\n", __func__);
+			err = -ENOMEM;
+			goto exit_put_node;
+		}
+
+		pgroups[type] = &sensor_groups[type].group;
+		pdata->sensors_count += sensor_groups[type].attr_count;
+		sensor_groups[type].attr_count = 0;
+	}
+
+exit_put_node:
+	of_node_put(opal);
+	return err;
+}
+
+/*
+ * Iterate through the device tree for each child of sensor node, create
+ * a sysfs attribute file, the file is named by translating the DT node name
+ * to the name required by the higher 'hwmon' driver like fan1_input, temp1_max
+ * etc..
+ */
+static int __init create_device_attrs(struct platform_device *pdev)
+{
+	struct platform_data *pdata = platform_get_drvdata(pdev);
+	const struct attribute_group **pgroups = pdata->attr_groups;
+	struct device_node *opal, *np;
+	struct sensor_data *sdata;
+	const u32 *sensor_id;
+	enum sensors type;
+	u32 count = 0;
+	int err = 0;
+
+	opal = of_find_node_by_path("/ibm,opal/sensors");
+        if (!opal) {
+		pr_err("%s: Opal 'sensors' node not found\n", __func__);
+		return -ENODEV;
+        }
+
+	sdata = devm_kzalloc(&pdev->dev, (pdata->sensors_count) *
+			     sizeof(*sdata), GFP_KERNEL);
+	if (!sdata) {
+		pr_err("%s: Failed to allocate memory for the sensor_data",
+				__func__);
+		err = -ENOMEM;
+		goto exit_put_node;
+	}
+
+	for_each_child_of_node(opal, np) {
+                if (np->name == NULL)
+                        continue;
+
+		for (type = 0; type < MAX_SENSOR_TYPE; type++)
+			if (of_device_is_compatible(np,
+					sensor_groups[type].compatible))
+				break;
+
+		if (type == MAX_SENSOR_TYPE)
+			continue;
+
+		sensor_id = of_get_property(np, "sensor-id", NULL);
+		if (!sensor_id) {
+			pr_info("%s: %s doesn't have sensor-id\n", __func__,
+				np->name);
+			continue;
+		}
+
+		sdata[count].id = *sensor_id;
+		sdata[count].type = type;
+		err = create_hwmon_attr_name(type, np->name, sdata[count].name);
+		if (err)
+			goto exit_put_node;
+
+		sysfs_attr_init(&sdata[count].dev_attr.attr);
+		sdata[count].dev_attr.attr.name = sdata[count].name;
+		sdata[count].dev_attr.attr.mode = S_IRUGO;
+		sdata[count].dev_attr.show = show_sensor;
+
+		pgroups[type]->attrs[sensor_groups[type].attr_count++] =
+				&sdata[count++].dev_attr.attr;
+	}
+
+exit_put_node:
+	of_node_put(opal);
+	return err;
+}
+
+static int __init ibmpowernv_probe(struct platform_device *pdev)
+{
+	struct platform_data *pdata;
+	struct device *hwmon_dev;
+	int err;
+
+	pdata = devm_kzalloc(&pdev->dev, sizeof(*pdata), GFP_KERNEL);
+	if (!pdata)
+		return -ENOMEM;
+
+	platform_set_drvdata(pdev, pdata);
+	pdata->sensors_count = 0;
+	err = populate_attr_groups(pdev);
+	if (err)
+		return err;
+
+	/* Create sysfs attribute file for each sensor found in the DT */
+	err = create_device_attrs(pdev);
+	if (err)
+		return err;
+
+	/* Finally, register with hwmon */
+	hwmon_dev = devm_hwmon_device_register_with_groups(&pdev->dev, DRVNAME,
+							   pdata,
+							   pdata->attr_groups);
+
+	return PTR_ERR_OR_ZERO(hwmon_dev);
+}
+
+static struct platform_driver ibmpowernv_driver = {
+	.driver = {
+		.owner = THIS_MODULE,
+		.name = DRVNAME,
+	},
+};
+
+static int __init ibmpowernv_init(void)
+{
+	int err;
+
+	pdevice = platform_device_alloc(DRVNAME, 0);
+	if (!pdevice) {
+		pr_err("%s: Device allocation failed\n", __func__);
+		err = -ENOMEM;
+		goto exit;
+	}
+
+	err = platform_device_add(pdevice);
+	if (err) {
+		pr_err("%s: Device addition failed (%d)\n", __func__, err);
+		goto exit_device_put;
+	}
+
+	err = platform_driver_probe(&ibmpowernv_driver, ibmpowernv_probe);
+	if (err) {
+		pr_err("%s: Platfrom driver probe failed\n", __func__);
+		goto exit_device_del;
+	}
+
+	return 0;
+
+exit_device_del:
+	platform_device_del(pdevice);
+exit_device_put:
+	platform_device_put(pdevice);
+exit:
+	return err;
+}
+
+static void __exit ibmpowernv_exit(void)
+{
+	platform_driver_unregister(&ibmpowernv_driver);
+	platform_device_unregister(pdevice);
+}
+
+MODULE_AUTHOR("Neelesh Gupta <neelegup@linux.vnet.ibm.com>");
+MODULE_DESCRIPTION("IBM POWERNV platform sensors");
+MODULE_LICENSE("GPL");
+
+module_init(ibmpowernv_init);
+module_exit(ibmpowernv_exit);

^ permalink raw reply related

* Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
From: Pedro Alves @ 2014-05-19 14:43 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: mikey, avagin, oleg, linux-kernel, michael, linuxppc-dev
In-Reply-To: <5379EF0E.6090504@linux.vnet.ibm.com>

On 05/19/2014 12:46 PM, Anshuman Khandual wrote:

>> > I couldn't actually find any arch that currently returns -ENODEV in
>> > the "active" hook.  I see that binfmt_elf.c doesn't handle
>> > regset->active() returning < 0.  Guess that may be why.  Looks like
>> > something that could be cleaned up, to me.
>> >
> Also it does not consider the return value of regset->active(t->task, regset)
> (whose objective is to figure out whether we need to request regset->n number
> of elements or less than that) in the subsequent call to regset->get function.

Indeed.

TBC, do you plan on fixing this?  Otherwise ...

> Now coming to the installation of the .active hooks part for all the new regsets, it
> should be pretty straight forward as well. Though its optional and used for elf_core_dump
> purpose only, its worth adding them here. Example of an active function should be something
> like this. The function is inexpensive as required.
> 
> +static int tm_spr_active(struct task_struct *target,
> +                               const struct user_regset *regset)
> +{
> +       if (!cpu_has_feature(CPU_FTR_TM))
> +               return -ENODEV;

... unfortunately this will do the wrong thing.

Thanks,
-- 
Pedro Alves

^ permalink raw reply

* [PATCH] dt/powerpc: Introduce the use of the managed version of kzalloc
From: Himangi Saraogi @ 2014-05-19 17:35 UTC (permalink / raw)
  To: Olof Johansson, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, linux-kernel
  Cc: julia.lawall

This patch moves data allocated using kzalloc to managed data allocated
using devm_kzalloc and cleans now unnecessary kfrees in probe and remove
functions. Also, the now unnecessary labels out_free_priv and out are
done away with.

The following Coccinelle semantic patch was used for making the change:

@platform@
identifier p, probefn, removefn;
@@
struct platform_driver p = {
  .probe = probefn,
  .remove = removefn,
};

@prb@
identifier platform.probefn, pdev;
expression e, e1, e2;
@@
probefn(struct platform_device *pdev, ...) {
  <+...
- e = kzalloc(e1, e2)
+ e = devm_kzalloc(&pdev->dev, e1, e2)
  ...
?-kfree(e);
  ...+>
}

@rem depends on prb@
identifier platform.removefn;
expression e;
@@
removefn(...) {
  <...
- kfree(e);
  ...>
}

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
---
Not compile tested due to incompatible architecture

 arch/powerpc/platforms/pasemi/gpio_mdio.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pasemi/gpio_mdio.c b/arch/powerpc/platforms/pasemi/gpio_mdio.c
index 15adee5..a722a4a 100644
--- a/arch/powerpc/platforms/pasemi/gpio_mdio.c
+++ b/arch/powerpc/platforms/pasemi/gpio_mdio.c
@@ -226,15 +226,14 @@ static int gpio_mdio_probe(struct platform_device *ofdev)
 	const unsigned int *prop;
 	int err;
 
-	err = -ENOMEM;
-	priv = kzalloc(sizeof(struct gpio_priv), GFP_KERNEL);
+	priv = devm_kzalloc(&ofdev->dev, sizeof(struct gpio_priv), GFP_KERNEL);
 	if (!priv)
-		goto out;
+		return -ENOMEM;
 
 	new_bus = mdiobus_alloc();
 
 	if (!new_bus)
-		goto out_free_priv;
+		return -ENOMEM;
 
 	new_bus->name = "pasemi gpio mdio bus";
 	new_bus->read = &gpio_mdio_read;
@@ -268,9 +267,6 @@ static int gpio_mdio_probe(struct platform_device *ofdev)
 
 out_free_irq:
 	kfree(new_bus);
-out_free_priv:
-	kfree(priv);
-out:
 	return err;
 }
 
@@ -283,7 +279,6 @@ static int gpio_mdio_remove(struct platform_device *dev)
 
 	dev_set_drvdata(&dev->dev, NULL);
 
-	kfree(bus->priv);
 	bus->priv = NULL;
 	mdiobus_free(bus);
 
-- 
1.9.1

^ permalink raw reply related

* Re: questions on CONFIG_PPC_ADV_DEBUG_REGS, DBCR0_BRT, and DBCR0_ACTIVE_EVENTS
From: Scott Wood @ 2014-05-19 17:47 UTC (permalink / raw)
  To: shiva7; +Cc: linuxppc-dev
In-Reply-To: <1400456291193-82408.post@n7.nabble.com>

On Sun, 2014-05-18 at 16:38 -0700, shiva7 wrote:
> Thanks Scott.
> 
> Apologize for not quoting the old email reference.

You did it again. :-)

> I assumed the old archive content always will be in the trail.

It's in the archives (as I noted), but it's a pain to search for it
versus having relevant bits quoted.

> Any idea whether the DBCR0 BRT bit actually works(??),

Do you have reason to believe that it might not?

> if I add  DBCR0 under sys_debug context call? because this code seems
> like direct porting and wondering why on first place not included under
> debug_context call. 

Again, it was probably an oversight, but the people who might know for
sure are no longer reachable.

> And also, anything special required for "server" family application code
> porting here ?? as because in server family the trace exception used to viz
> NORMAL exception proglog and uses SRR0 and SRR1 but in this ISA/embedded
> case have dedicated DEBUG_DEBUG prolog and dedicated registers DSRR0 and
> DSRR1.

IIRC the branch taken mechanism does have different semantics than the
equivalent mechanism on server.  You can find discussion of this in the
archives. :-)

-Scott

^ permalink raw reply

* [RFC PATCH v2 1/2] powerpc: numa: enable USE_PERCPU_NUMA_NODE_ID
From: Nishanth Aravamudan @ 2014-05-19 18:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee Schermerhorn, Christoph Lameter, Mel Gorman, linux-mm,
	Anton Blanchard, David Rientjes, Joonsoo Kim, linuxppc-dev
In-Reply-To: <20140516233945.GI8941@linux.vnet.ibm.com>

Hi Andrew,

I found one issue with my patch, fixed below...

On 16.05.2014 [16:39:45 -0700], Nishanth Aravamudan wrote:
> Based off 3bccd996 for ia64, convert powerpc to use the generic per-CPU
> topology tracking, specifically:
>     
> 	initialize per cpu numa_node entry in start_secondary
>     	remove the powerpc cpu_to_node()
>     	define CONFIG_USE_PERCPU_NUMA_NODE_ID if NUMA
>     
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

<snip>

> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index e2a4232..b95be24 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -750,6 +750,11 @@ void start_secondary(void *unused)
>  	}
>  	traverse_core_siblings(cpu, true);
>  
> +	/*
> +	 * numa_node_id() works after this.
> +	 */
> +	set_numa_node(numa_cpu_lookup_table[cpu]);
> +

Similar change is needed for the boot CPU. Update patch:


powerpc: numa: enable USE_PERCPU_NUMA_NODE_ID
    
Based off 3bccd996 for ia64, convert powerpc to use the generic per-CPU
topology tracking, specifically:
    
    initialize per cpu numa_node entry in start_secondary
    remove the powerpc cpu_to_node()
    define CONFIG_USE_PERCPU_NUMA_NODE_ID if NUMA
    
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e099899..9125964 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -453,6 +453,10 @@ config NODES_SHIFT
 	default "4"
 	depends on NEED_MULTIPLE_NODES
 
+config USE_PERCPU_NUMA_NODE_ID
+	def_bool y
+	depends on NUMA
+
 config ARCH_SELECT_MEMORY_MODEL
 	def_bool y
 	depends on PPC64
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index c920215..5ecf7ea 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -20,19 +20,6 @@ struct device_node;
 
 #include <asm/mmzone.h>
 
-static inline int cpu_to_node(int cpu)
-{
-	int nid;
-
-	nid = numa_cpu_lookup_table[cpu];
-
-	/*
-	 * During early boot, the numa-cpu lookup table might not have been
-	 * setup for all CPUs yet. In such cases, default to node 0.
-	 */
-	return (nid < 0) ? 0 : nid;
-}
-
 #define parent_node(node)	(node)
 
 #define cpumask_of_node(node) ((node) == -1 ?				\
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index e2a4232..d7252ad 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -390,6 +390,7 @@ void smp_prepare_boot_cpu(void)
 #ifdef CONFIG_PPC64
 	paca[boot_cpuid].__current = current;
 #endif
+	set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
 	current_set[boot_cpuid] = task_thread_info(current);
 }
 
@@ -750,6 +751,11 @@ void start_secondary(void *unused)
 	}
 	traverse_core_siblings(cpu, true);
 
+	/*
+	 * numa_node_id() works after this.
+	 */
+	set_numa_node(numa_cpu_lookup_table[cpu]);
+
 	smp_wmb();
 	notify_cpu_starting(cpu);
 	set_cpu_online(cpu, true);

^ permalink raw reply related

* Re: Node 0 not necessary for powerpc?
From: Nishanth Aravamudan @ 2014-05-19 18:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linuxppc-dev, anton, rientjes
In-Reply-To: <20140313164949.GC22247@linux.vnet.ibm.com>

On 13.03.2014 [09:49:49 -0700], Nishanth Aravamudan wrote:
> On 12.03.2014 [08:41:40 -0500], Christoph Lameter wrote:
> > On Tue, 11 Mar 2014, Nishanth Aravamudan wrote:
> > > I have a P7 system that has no node0, but a node0 shows up in numactl
> > > --hardware, which has no cpus and no memory (and no PCI devices):
> > 
> > Well as you see from the code there has been so far the assumption that
> > node 0 has memory. I have never run a machine that has no node 0 memory.
> 
> Do you mean beyond the initialization? I didn't see anything obvious so
> far in the code itself that assumes a given node has memory (in the
> sense of the nid). What are your thoughts about how best to support
> this?

Ah, I found one path that is problematic on powerpc:

I'm seeing a panic at boot with this change on an LPAR which actually
has no Node 0. Here's what I think is happening:

start_kernel
    ...
    -> setup_per_cpu_areas
        -> pcpu_embed_first_chunk
            -> pcpu_fc_alloc
                -> ___alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu), ...
    -> smp_prepare_boot_cpu
        -> set_numa_node(boot_cpuid)

So we panic on the NODE_DATA call. It seems that ia64, at least, uses
pcpu_alloc_first_chunk rather than embed. x86 has some code to handle
early calls of cpu_to_node (early_cpu_to_node) and sets the mapping for
all CPUs in setup_per_cpu_areas().

Thoughts? Does that mean we need something similar to x86 for powerpc?

Thanks,
Nish

^ permalink raw reply

* Re: [PATCH 3/8] drivers/vfio: New IOCTL command VFIO_EEH_INFO
From: Alex Williamson @ 2014-05-19 22:33 UTC (permalink / raw)
  To: Gavin Shan; +Cc: aik, agraf, kvm-ppc, qiudayu, linuxppc-dev
In-Reply-To: <1400040722-29608-4-git-send-email-gwshan@linux.vnet.ibm.com>

On Wed, 2014-05-14 at 14:11 +1000, Gavin Shan wrote:
> The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container
> to support EEH functionality for PCI devices, which have been
> passed from host to guest via VFIO.

Some comments throughout, but overall this seems to forgo every bit of
the device ownership and protection model used by VFIO and lets the user
pick arbitrary host devices and do various operations, mostly unchecked.
That's not acceptable.

> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/Makefile   |   1 +
>  arch/powerpc/platforms/powernv/eeh-vfio.c | 593 ++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_spapr_tce.c       |  12 +
>  include/uapi/linux/vfio.h                 |  57 +++
>  4 files changed, 663 insertions(+)
>  create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
> 
> diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
> index 63cebb9..2b15a03 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -6,5 +6,6 @@ obj-y			+= opal-msglog.o
>  obj-$(CONFIG_SMP)	+= smp.o
>  obj-$(CONFIG_PCI)	+= pci.o pci-p5ioc2.o pci-ioda.o
>  obj-$(CONFIG_EEH)	+= eeh-ioda.o eeh-powernv.o
> +obj-$(CONFIG_VFIO_EEH)	+= eeh-vfio.o
>  obj-$(CONFIG_PPC_SCOM)	+= opal-xscom.o
>  obj-$(CONFIG_MEMORY_FAILURE)	+= opal-memory-errors.o
> diff --git a/arch/powerpc/platforms/powernv/eeh-vfio.c b/arch/powerpc/platforms/powernv/eeh-vfio.c
> new file mode 100644
> index 0000000..69d5f2d
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/eeh-vfio.c
> @@ -0,0 +1,593 @@
> +/*
> +  * The file intends to support EEH funtionality for those PCI devices,
> +  * which have been passed through from host to guest via VFIO. So this
> +  * file is naturally part of VFIO implementation on PowerNV platform.
> +  *
> +  * Copyright Benjamin Herrenschmidt & Gavin Shan, IBM Corporation 2014.
> +  *
> +  * This program is free software; you can redistribute it and/or modify
> +  * it under the terms of the GNU General Public License as published by
> +  * the Free Software Foundation; either version 2 of the License, or
> +  * (at your option) any later version.
> +  */
> +
> +#include <linux/init.h>
> +#include <linux/io.h>
> +#include <linux/irq.h>
> +#include <linux/kernel.h>
> +#include <linux/kvm_host.h>
> +#include <linux/msi.h>
> +#include <linux/pci.h>
> +#include <linux/string.h>
> +#include <linux/vfio.h>
> +
> +#include <asm/eeh.h>
> +#include <asm/eeh_event.h>
> +#include <asm/io.h>
> +#include <asm/iommu.h>
> +#include <asm/opal.h>
> +#include <asm/msi_bitmap.h>
> +#include <asm/pci-bridge.h>
> +#include <asm/ppc-pci.h>
> +#include <asm/tce.h>
> +#include <asm/uaccess.h>
> +
> +#include "powernv.h"
> +#include "pci.h"
> +
> +static int powernv_eeh_vfio_map(struct vfio_eeh_info *info)
> +{
> +	struct pci_bus *bus, *pe_bus;
> +	struct pci_dev *pdev;
> +	struct eeh_dev *edev;
> +	struct eeh_pe *pe;
> +	int domain, bus_no, devfn;
> +
> +	/* Host address */
> +	domain = info->map.host_domain;
> +	bus_no = (info->map.host_cfg_addr >> 8) & 0xff;
> +	devfn = info->map.host_cfg_addr & 0xff;

Where are we validating that the user has any legitimate claim to be
touching this device?

> +	/* Find PCI bus */
> +	bus = pci_find_bus(domain, bus_no);
> +	if (!bus) {
> +		pr_warn("%s: PCI bus %04x:%02x not found\n",
> +			__func__, domain, bus_no);
> +		return -ENODEV;
> +	}
> +
> +	/* Find PCI device */
> +	pdev = pci_get_slot(bus, devfn);
> +	if (!pdev) {
> +		pr_warn("%s: PCI device %04x:%02x:%02x.%01x not found\n",
> +			__func__, domain, bus_no,
> +			PCI_SLOT(devfn), PCI_FUNC(devfn));
> +		return -ENODEV;
> +	}
> +
> +	/* No EEH device - almost impossible */
> +	edev = pci_dev_to_eeh_dev(pdev);
> +	if (unlikely(!edev)) {
> +		pci_dev_put(pdev);
> +		pr_warn("%s: No EEH dev for PCI device %s\n",
> +			__func__, pci_name(pdev));
> +		return -ENODEV;
> +	}
> +
> +	/* Doesn't support PE migration between different PHBs */
> +	pe = edev->pe;
> +	if (!eeh_pe_passed(pe)) {
> +		pe_bus = eeh_pe_bus_get(pe);
> +		BUG_ON(!pe_bus);

Can a user trigger this maliciously?

> +
> +		/* PE# has format 00BBSS00 */
> +		pe->guest_addr.buid    = info->map.guest_buid;
> +		pe->guest_addr.pe_addr = pe_bus->number << 16;
> +		eeh_pe_set_passed(pe, true);
> +	} else if (pe->guest_addr.buid != info->map.guest_buid) {
> +		pci_dev_put(pdev);
> +		pr_warn("%s: Mismatched PHB BUID (0x%llx, 0x%llx)\n",
> +			__func__, pe->guest_addr.buid, info->map.guest_buid);
> +		return -EINVAL;
> +	}
> +
> +	edev->guest_addr.buid = info->map.guest_buid;
> +	edev->guest_addr.config_addr = info->map.guest_cfg_addr;
> +	eeh_dev_set_passed(edev, true);
> +
> +	pr_debug("EEH: Host PCI dev %s to %llx-%02x:%02x.%01x\n",
> +		 pci_name(pdev), info->map.guest_buid,
> +		 (info->map.guest_cfg_addr >> 8) & 0xFF,
> +		 PCI_SLOT(info->map.guest_cfg_addr & 0xFF),
> +		 PCI_FUNC(info->map.guest_cfg_addr & 0xFF));
> +
> +	pci_dev_put(pdev);
> +	return 0;
> +}

So the effect of this function is that a user gets to setup an arbitrary
guest mapping for an arbitrary host device and associated pe.  Is that
right?  It seems bad.

> +
> +static int powernv_eeh_vfio_unmap(struct vfio_eeh_info *info)
> +{
> +	struct eeh_vfio_pci_addr addr;
> +	struct pci_dev *pdev;
> +	struct eeh_dev *edev, *tmp;
> +	struct eeh_pe *pe;
> +	bool passed;
> +
> +	/* Get EEH device */
> +	addr.buid = info->unmap.buid;
> +	addr.config_addr = info->unmap.cfg_addr;
> +	edev = eeh_vfio_dev_get(&addr);

eeh_vfio_dev_get() just looks for a "passed" dev and a match for a well
known address space.  Seems very exploitable.

> +	if (!edev) {
> +		pr_warn("%s: Cannot find %llx:%02x:%02x.%01x\n",
> +			__func__, info->unmap.buid,
> +			(info->unmap.cfg_addr >> 8) & 0xFF,
> +			PCI_SLOT(info->unmap.cfg_addr & 0xFF),
> +			PCI_FUNC(info->unmap.cfg_addr & 0xFF));
> +		return -ENODEV;
> +	}
> +
> +	/* Return EEH device */
> +	memset(&edev->guest_addr, 0, sizeof(edev->guest_addr));
> +	eeh_dev_set_passed(edev, false);
> +	pdev = eeh_dev_to_pci_dev(edev);
> +	pr_debug("EEH: Host PCI dev %s returned\n",
> +		 pdev ? pci_name(pdev) : "NULL");
> +
> +	/* Return PE if no EEH device is owned by guest */
> +	pe = edev->pe;
> +	passed = false;
> +	eeh_pe_for_each_dev(pe, edev, tmp) {
> +		pdev = eeh_dev_to_pci_dev(edev);
> +		if (pdev && pdev->subordinate)
> +			continue;
> +
> +		if (eeh_dev_passed(edev)) {
> +			passed = true;
> +			break;
> +		}
> +	}
> +
> +	if (!passed) {
> +		memset(&pe->guest_addr, 0, sizeof(pe->guest_addr));
> +		eeh_pe_set_passed(pe, false);
> +		pr_debug("EEH: PHB#%x-PE#%x returned to host\n",
> +			 pe->phb->global_number, pe->addr);
> +	}
> +
> +	return 0;
> +}
> +
> +static int powernv_eeh_vfio_set_option(struct vfio_eeh_info *info)
> +{
> +	struct pnv_phb *phb;
> +	struct eeh_dev *edev;
> +	struct eeh_pe *pe;
> +	struct eeh_vfio_pci_addr addr;
> +	int opcode = info->option.option;
> +	int ret = 0;
> +
> +	/* Check opcode */
> +	if (opcode < EEH_OPT_DISABLE || opcode > EEH_OPT_THAW_DMA) {
> +		pr_warn("%s: opcode %d out of range (%d, %d)\n",
> +			__func__, opcode, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
> +		ret = 3;

Please don't make up arbitrary return values.

> +		goto out;
> +	}
> +
> +	/* Option "enable" uses PCI config address */
> +	if (opcode == EEH_OPT_ENABLE) {
> +		addr.buid = info->option.buid;
> +		addr.config_addr = (info->option.addr >> 8) & 0xFFFF;
> +		edev = eeh_vfio_dev_get(&addr);
> +		if (!edev) {
> +			pr_warn("%s: Cannot find %llx:%02x:%02x.%01x\n",
> +				__func__, addr.buid,
> +				(addr.config_addr >> 8) & 0xFF,
> +				PCI_SLOT(addr.config_addr & 0xFF),
> +				PCI_FUNC(addr.config_addr & 0xFF));
> +			ret = 7;
> +			goto out;
> +		}
> +		phb = edev->phb->private_data;
> +	} else {
> +		addr.buid    = info->option.buid;
> +		addr.pe_addr = info->option.addr;
> +		pe = eeh_vfio_pe_get(&addr);
> +		if (!pe) {
> +			pr_warn("%s: Cannot find PE %llx:%x\n",
> +				__func__, addr.buid, addr.pe_addr);
> +			ret = 7;
> +			goto out;
> +		}
> +		phb = pe->phb->private_data;
> +	}
> +
> +	/* Insure that the EEH stuff has been initialized */
> +	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
> +		pr_warn("%s: EEH disabled on PHB#%d\n",
> +			__func__, phb->hose->global_number);
> +		ret = 7;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The EEH functionality has been enabled on all PEs
> +	 * by default. So just return success. The same situation
> +	 * would be applied while we disable EEH functionality.
> +	 * However, the guest isn't expected to disable that
> +	 * at all.
> +	 */
> +	if (opcode == EEH_OPT_DISABLE ||
> +	    opcode == EEH_OPT_ENABLE) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Call into the IODA dependent backend in order
> +	 * to enable DMA or MMIO for the indicated PE.
> +	 */
> +	if (phb->eeh_ops && phb->eeh_ops->set_option) {
> +		if (phb->eeh_ops->set_option(pe, opcode)) {
> +			pr_warn("%s: Failure from backend\n",
> +				__func__);
> +			ret = 1;
> +		}
> +	} else {
> +		pr_warn("%s: Unsupported request\n",
> +			__func__);
> +		ret = 7;
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +static int powernv_eeh_vfio_get_addr(struct vfio_eeh_info *info)
> +{
> +	struct pnv_phb *phb;
> +	struct eeh_dev *edev;
> +	struct eeh_vfio_pci_addr addr;
> +	int opcode = info->addr.option;
> +	int ret = 0;
> +
> +	/* Check opcode */
> +	if (opcode != 0 && opcode != 1) {
> +		pr_warn("%s: opcode %d out of range (0, 1)\n",
> +			__func__, opcode);
> +		ret = 3;
> +		goto out;
> +	}
> +
> +	/* Find EEH device */
> +	addr.buid = info->addr.buid;
> +	addr.config_addr = (info->addr.cfg_addr >> 8 ) & 0xFFFF;
> +	edev = eeh_vfio_dev_get(&addr);
> +	if (!edev) {
> +		pr_warn("%s: Cannot find %llx:%02x:%02x.%01x\n",
> +			__func__, addr.buid,
> +			(addr.config_addr >> 8) & 0xFF,
> +			PCI_SLOT(addr.config_addr & 0xFF),
> +			PCI_FUNC(addr.config_addr & 0xFF));
> +		ret = 7;
> +		goto out;
> +	}
> +	phb = edev->phb->private_data;
> +
> +	/* EEH enabled ? */
> +	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
> +		pr_warn("%s: EEH disabled on PHB#%d\n",
> +			__func__, phb->hose->global_number);
> +		ret = 3;
> +		goto out;
> +	}
> +
> +	/* EEH device passed ? */
> +	if (!eeh_dev_passed(edev)) {
> +		pr_warn("%s: EEH dev %llx:%02x:%02x.%01x owned by host\n",
> +			__func__, addr.buid,
> +			(addr.config_addr >> 8) & 0xFF,
> +			PCI_SLOT(addr.config_addr & 0xFF),
> +			PCI_FUNC(addr.config_addr & 0xFF));
> +		ret = 3;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Fill result according to opcode. We don't differentiate
> +	 * PCI bus and device sensitive PE here.
> +	 */
> +	if (opcode == 0)
> +		info->addr.ret = edev->pe->guest_addr.pe_addr;
> +	else
> +		info->addr.ret = 1;
> +out:
> +	return ret;
> +}
> +
> +static int powernv_eeh_vfio_get_state(struct vfio_eeh_info *info)
> +{
> +	struct pnv_phb *phb;
> +	struct eeh_pe *pe;
> +	struct eeh_vfio_pci_addr addr;
> +	int result, ret = 0;
> +
> +	/* Locate the PE */
> +	addr.buid    = info->state.buid;
> +	addr.pe_addr = info->state.pe_addr;
> +	pe = eeh_vfio_pe_get(&addr);
> +	if (!pe) {
> +		pr_warn("%s: Cannot locate %llx:%x\n",
> +			__func__, addr.buid, addr.pe_addr);
> +		ret = 3;
> +		goto out;
> +	}
> +	phb = pe->phb->private_data;
> +
> +	/* EEH enabled ? */
> +	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
> +		pr_warn("%s: EEH disabled on PHB#%d\n",
> +			__func__, phb->hose->global_number);
> +		ret = 3;
> +		goto out;
> +	}
> +
> +	/* Call to the IOC dependent function */
> +	if (phb->eeh_ops && phb->eeh_ops->get_state) {
> +		result = phb->eeh_ops->get_state(pe);
> +
> +		if (!(result & EEH_STATE_RESET_ACTIVE) &&
> +		     (result & EEH_STATE_DMA_ENABLED) &&
> +		     (result & EEH_STATE_MMIO_ENABLED))
> +			info->state.state = 0;
> +		else if (result & EEH_STATE_RESET_ACTIVE)
> +			info->state.state = 1;
> +		else if (!(result & EEH_STATE_RESET_ACTIVE) &&
> +			 !(result & EEH_STATE_DMA_ENABLED) &&
> +			 !(result & EEH_STATE_MMIO_ENABLED))
> +			info->state.state = 2;
> +		else if (!(result & EEH_STATE_RESET_ACTIVE) &&
> +			 (result & EEH_STATE_DMA_ENABLED) &&
> +			 !(result & EEH_STATE_MMIO_ENABLED))
> +			info->state.state = 4;
> +		else
> +			info->state.state = 5;
> +
> +		ret = 0;
> +	} else {
> +		pr_warn("%s: Unsupported request\n", __func__);
> +		ret = 3;
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +static int powernv_eeh_vfio_pe_reset(struct vfio_eeh_info *info)
> +{
> +	struct pnv_phb *phb;
> +	struct eeh_pe *pe;
> +	struct eeh_vfio_pci_addr addr;
> +	int opcode = info->reset.option;
> +	int ret = 0;
> +
> +	/* Check opcode */
> +	if (opcode != EEH_RESET_DEACTIVATE &&
> +	    opcode != EEH_RESET_HOT &&
> +	    opcode != EEH_RESET_FUNDAMENTAL) {
> +		pr_warn("%s: Unsupported opcode %d\n",
> +			__func__, opcode);

Console warnings are exploitable DoS attacks.

> +		ret = 3;
> +		goto out;
> +	}
> +
> +	/* Locate the PE */
> +	addr.buid    = info->reset.buid;
> +	addr.pe_addr = info->reset.pe_addr;
> +	pe = eeh_vfio_pe_get(&addr);
> +	if (!pe) {
> +		pr_warn("%s: Cannot locate %llx:%x\n",
> +			__func__, addr.buid, addr.pe_addr);
> +		ret = 3;
> +		goto out;
> +	}
> +	phb = pe->phb->private_data;
> +
> +	/* EEH enabled ? */
> +	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
> +		pr_warn("%s: EEH disabled on PHB#%d\n",
> +			__func__, phb->hose->global_number);
> +		ret = 3;
> +		goto out;
> +	}
> +
> +	/* Call into the IODA dependent backend to do the reset */
> +	if (!phb->eeh_ops ||
> +	    !phb->eeh_ops->set_option ||
> +	    !phb->eeh_ops->reset) {
> +		pr_warn("%s: Unsupported request\n",
> +			__func__);
> +		ret = 7;
> +	} else {
> +		/*
> +		 * The frozen PE might be caused by the mechanism called
> +		 * PAPR error injection, which is supposed to be one-shot
> +		 * without "sticky" bit as being stated by the spec. But
> +		 * the reality isn't that, at least on P7IOC. So we have
> +		 * to clear that to avoid recrusive error, which fails the
> +		 * recovery eventually.
> +		 */
> +		if (opcode == EEH_RESET_DEACTIVATE)
> +			opal_pci_reset(phb->opal_id,
> +				       OPAL_PHB_ERROR,
> +				       OPAL_ASSERT_RESET);
> +
> +		if (phb->eeh_ops->reset(pe, opcode)) {
> +			pr_warn("%s: Failure from backend\n", __func__);
> +			ret = 1;
> +			goto out;
> +		}
> +
> +		/*
> +		 * The PE is still in frozen state and we need clear that.
> +		 * It's good to clear frozen state after deassert to avoid
> +		 * messy IO access during reset, which might cause recrusive
> +		 * frozen PE.
> +		 */
> +		if (opcode == EEH_RESET_DEACTIVATE) {
> +			if (phb->eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO) ||
> +			    phb->eeh_ops->set_option(pe, EEH_OPT_THAW_DMA)) {
> +				pr_warn("%s: Cannot clear frozen state\n",
> +					__func__);
> +				ret = 1;
> +			}
> +
> +			eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
> +		}
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +static int powernv_eeh_vfio_pe_config(struct vfio_eeh_info *info)
> +{
> +	struct pnv_phb *phb;
> +	struct eeh_pe *pe;
> +	struct eeh_vfio_pci_addr addr;
> +	int ret = 0;
> +
> +	/* Locate the PE */
> +	addr.buid    = info->config.buid;
> +	addr.pe_addr = info->config.pe_addr;
> +	pe = eeh_vfio_pe_get(&addr);
> +	if (!pe) {
> +		pr_warn("%s: Cannot locate %llx:%x\n",
> +			__func__, addr.buid, addr.pe_addr);
> +		ret = 3;
> +		goto out;
> +	}
> +	phb = pe->phb->private_data;
> +
> +	/* EEH enabled ? */
> +	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
> +		pr_warn("%s: EEH disabled on PHB#%d\n",
> +			__func__, phb->hose->global_number);
> +		ret = 3;
> +		goto out;
> +        }
> +
> +	/*
> +	 * The access to PCI config space on VFIO device has some
> +	 * limitations. Part of PCI config space, including BAR
> +	 * registers are not readable and writable. So the guest
> +	 * should have stale values for those registers and we have
> +	 * to restore them in host side.
> +	 */
> +	eeh_pe_restore_bars(pe);
> +out:
> +	return ret;
> +}
> +
> +void eeh_vfio_release(struct iommu_table *tbl)
> +{
> +	struct pnv_ioda_pe *pnv_pe = container_of(tbl, struct pnv_ioda_pe,
> +						  tce32_table);
> +	struct pnv_phb *phb = pnv_pe->phb;
> +	struct eeh_pe *phb_pe, *pe;
> +	struct eeh_dev dev, *edev, *tmp;
> +
> +	/* Find PHB PE */
> +	phb_pe = eeh_phb_pe_get(phb->hose);
> +	if (unlikely(!phb_pe)) {
> +		pr_warn("%s: Cannot find PHB#%d PE\n",
> +			__func__, phb->hose->global_number);
> +		return;
> +	}
> +
> +	/* Find PE */
> +	memset(&dev, 0, sizeof(struct eeh_dev));
> +	dev.phb = phb->hose;
> +	dev.pe_config_addr = pnv_pe->pe_number;
> +	pe = eeh_pe_get(&dev);
> +	if (unlikely(!pe)) {
> +		pr_warn("%s: Cannot find PE instance for PHB#%d-PE#%d\n",
> +			__func__, phb->hose->global_number,
> +			pnv_pe->pe_number);
> +		return;
> +	}
> +
> +	/* Release it to host */
> +	if (!eeh_pe_passed(pe))
> +		return;
> +
> +	eeh_pe_for_each_dev(pe, edev, tmp) {
> +		if (!eeh_dev_passed(edev))
> +			continue;
> +
> +		memset(&edev->guest_addr, 0, sizeof(edev->guest_addr));

Is guest_addr = { 0 } not valid?  As agraf already mentioned, there are
a number of issues with using a guest_address for a token.

> +		eeh_dev_set_passed(edev, false);
> +	}
> +
> +	memset(&pe->guest_addr, 0, sizeof(pe->guest_addr));
> +	eeh_pe_set_passed(pe, false);
> +}
> +EXPORT_SYMBOL(eeh_vfio_release);
> +
> +int eeh_vfio_ioctl(unsigned long arg)
> +{
> +	struct vfio_eeh_info info;
> +	int ret = -EINVAL;
> +
> +	/* Copy over user argument */
> +	if (copy_from_user(&info, (void __user *)arg, sizeof(info))) {
> +		pr_warn("%s: Cannot copy user argument 0x%lx\n",
> +			__func__, arg);
> +		return -EFAULT;
> +	}
> +
> +	/* Sanity check */
> +	if (info.argsz != sizeof(info)) {

This breaks compatibility if you need to later add a new ops with a
larger footprint.

> +		pr_warn("%s: Invalid argument size (%d, %ld)\n",
> +			__func__, info.argsz, sizeof(info));
> +		return -EINVAL;
> +	}
> +
> +	/* Route according to operation */
> +	switch (info.op) {
> +	case VFIO_EEH_OP_MAP:
> +		ret = powernv_eeh_vfio_map(&info);
> +		break;
> +	case VFIO_EEH_OP_UNMAP:
> +		ret = powernv_eeh_vfio_unmap(&info);
> +		break;
> +	case VFIO_EEH_OP_SET_OPTION:
> +		ret = powernv_eeh_vfio_set_option(&info);
> +		break;
> +	case VFIO_EEH_OP_GET_ADDR:
> +		ret = powernv_eeh_vfio_get_addr(&info);
> +		break;
> +	case VFIO_EEH_OP_GET_STATE:
> +		ret = powernv_eeh_vfio_get_state(&info);
> +		break;
> +	case VFIO_EEH_OP_PE_RESET:
> +		ret = powernv_eeh_vfio_pe_reset(&info);
> +		break;
> +	case VFIO_EEH_OP_PE_CONFIG:
> +		ret = powernv_eeh_vfio_pe_config(&info);
> +		break;
> +	default:
> +		pr_info("%s: Cannot handle op#%d\n",
> +			__func__, info.op);
> +	}
> +
> +	/* Copy data back */
> +	if (!ret && copy_to_user((void __user *)arg, &info, sizeof(info))) {
> +		pr_warn("%s: Cannot copy to user 0x%lx\n",
> +			__func__, arg);
> +		return -EFAULT;
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(eeh_vfio_ioctl);
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index a84788b..c45dece 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -26,6 +26,11 @@
>  #define DRIVER_AUTHOR   "aik@ozlabs.ru"
>  #define DRIVER_DESC     "VFIO IOMMU SPAPR TCE"
>  
> +#ifdef CONFIG_VFIO_EEH
> +extern void eeh_vfio_release(struct iommu_table *tbl);
> +extern int eeh_vfio_ioctl(unsigned long arg);
> +#endif
> +
>  static void tce_iommu_detach_group(void *iommu_data,
>  		struct iommu_group *iommu_group);
>  
> @@ -283,6 +288,10 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		tce_iommu_disable(container);
>  		mutex_unlock(&container->lock);
>  		return 0;
> +#ifdef CONFIG_VFIO_EEH

I'm not a fan of all these #ifdefs, hide it in eeh_vfio_ioctl() and
eeh_vfio_release() if needed.

> +	case VFIO_EEH_INFO:
> +		return eeh_vfio_ioctl(arg);
> +#endif
>  	}
>  
>  	return -ENOTTY;
> @@ -342,6 +351,9 @@ static void tce_iommu_detach_group(void *iommu_data,
>  		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
>  				iommu_group_id(iommu_group), iommu_group); */
>  		container->tbl = NULL;
> +#ifdef CONFIG_VFIO_EEH
> +		eeh_vfio_release(tbl);
> +#endif
>  		iommu_release_ownership(tbl);
>  	}
>  	mutex_unlock(&container->lock);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index cb9023d..1fd1bfb 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -455,6 +455,63 @@ struct vfio_iommu_spapr_tce_info {
>  
>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>  
> +/*
> + * The VFIO EEH info struct provides way to support EEH functionality
> + * for PCI device that is passed from host to guest via VFIO.
> + */
> +#define VFIO_EEH_OP_MAP		0
> +#define VFIO_EEH_OP_UNMAP	1
> +#define VFIO_EEH_OP_SET_OPTION	2
> +#define VFIO_EEH_OP_GET_ADDR	3
> +#define VFIO_EEH_OP_GET_STATE	4
> +#define VFIO_EEH_OP_PE_RESET	5
> +#define VFIO_EEH_OP_PE_CONFIG	6

Is this really an "info" ioctl?

> +
> +struct vfio_eeh_info {
> +	__u32 argsz;
> +	__u32 op;
> +
> +	union {
> +		struct vfio_eeh_map {
> +			__u32 host_domain;
> +			__u16 host_cfg_addr;
> +			__u64 guest_buid;
> +			__u16 guest_cfg_addr;
> +		} map;
> +		struct vfio_eeh_unmap {
> +			__u64 buid;
> +			__u16 cfg_addr;
> +		} unmap;
> +		struct vfio_eeh_set_option {
> +			__u64 buid;
> +			__u32 addr;
> +			__u32 option;
> +		} option;
> +		struct vfio_eeh_pe_addr {
> +			__u64 buid;
> +			__u32 cfg_addr;
> +			__u32 option;
> +			__u32 ret;
> +		} addr;
> +		struct vfio_eeh_state {
> +			__u64 buid;
> +			__u32 pe_addr;
> +			__u32 state;
> +                } state;
> +		struct vfio_eeh_reset {
> +			__u64 buid;
> +			__u32 pe_addr;
> +			__u32 option;
> +		} reset;
> +		struct vfio_eeh_config {
> +			__u64 buid;
> +			__u32 pe_addr;
> +		} config;
> +	};
> +};
> +
> +#define VFIO_EEH_INFO	_IO(VFIO_TYPE, VFIO_BASE + 21)
> +
>  /* ***************************************************************** */
>  
>  #endif /* _UAPIVFIO_H */

^ permalink raw reply

* Re: [PATCH 2/8] powerpc/eeh: Info to trace passed devices
From: Benjamin Herrenschmidt @ 2014-05-19 22:37 UTC (permalink / raw)
  To: Alexander Graf
  Cc: aik, Gavin Shan, kvm-ppc, alex.williamson, qiudayu, linuxppc-dev
In-Reply-To: <5379FD19.2020100@suse.de>

On Mon, 2014-05-19 at 14:46 +0200, Alexander Graf wrote:
> I don't see the point of VFIO knowing about guest addresses. They are 
> not unique across a system and the whole idea that a VFIO device has to 
> be owned by a guest is also pretty dubious.
> 
> I suppose what you really care about here is just a token for a specific 
> device? But why do you need one where we don't have tokens yet?

I think this is going to be needed when doing in-kernel acceleration
of some of the RTAS calls, but yes, Gavin, why do we need that now ?

Cheers,
Ben.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox