Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Poosa, Karthik" <karthik.poosa@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
	<badal.nilawar@intel.com>, <rodrigo.vivi@intel.com>,
	<riana.tauro@intel.com>
Subject: Re: [PATCH v3 2/4] drm/xe/hwmon: Expose memory controller temperature
Date: Wed, 24 Dec 2025 15:43:05 +0530	[thread overview]
Message-ID: <ba9efeca-6733-4af2-a0fc-77bdbe765ccf@intel.com> (raw)
In-Reply-To: <aUUE7IXfU2zxyRLz@black.igk.intel.com>


On 19-12-2025 13:25, Raag Jadav wrote:
> On Tue, Dec 16, 2025 at 05:10:28PM +0530, Karthik Poosa wrote:
>> Expose GPU memory controller average temperature and its limits
>> under temp4_xxx.
>> Update Xe hwmon documentation for this.
>>
>> v2:
>>   - Rephrase commit message. (Badal)
>>   - Update kernel version in Xe hwmon documentation. (Raag)
>>
>> Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
>> ---
>>   .../ABI/testing/sysfs-driver-intel-xe-hwmon   | 24 +++++
>>   drivers/gpu/drm/xe/xe_hwmon.c                 | 94 ++++++++++++++++++-
>>   drivers/gpu/drm/xe/xe_pcode_api.h             |  4 +
>>   3 files changed, 118 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon b/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon
>> index c8f211c336be..81f9b5d58850 100644
>> --- a/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon
>> +++ b/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon
>> @@ -236,3 +236,27 @@ Contact:	intel-xe@lists.freedesktop.org
>>   Description:	RO. VRAM critical temperature in millidegree Celsius.
>>   
>>   		Only supported for particular Intel Xe graphics platforms.
>> +
>> +What:		/sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp4_input
>> +Date:		December 2025
>> +KernelVersion:	6.19
>> +Contact:	intel-xe@lists.freedesktop.org
>> +Description:	RO. Memory controller average temperature in millidegree Celsius.
>> +
>> +		Only supported for particular Intel Xe graphics platforms.
>> +
>> +What:		/sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp4_emergency
>> +Date:		December 2025
>> +KernelVersion:	6.19
>> +Contact:	intel-xe@lists.freedesktop.org
>> +Description:	RO. Memory controller shutdown temperature in millidegree Celsius.
>> +
>> +		Only supported for particular Intel Xe graphics platforms.
>> +
>> +What:		/sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp4_crit
>> +Date:		December 2025
>> +KernelVersion:	6.19
>> +Contact:	intel-xe@lists.freedesktop.org
>> +Description:	RO. Memory controller critical temperature in millidegree Celsius.
>> +
>> +		Only supported for particular Intel Xe graphics platforms.
> Same as patch 1. These attributes defer from ABI definition so not my call.
>
>> diff --git a/drivers/gpu/drm/xe/xe_hwmon.c b/drivers/gpu/drm/xe/xe_hwmon.c
>> index 66a8c3e40027..6d31ad74cd0e 100644
>> --- a/drivers/gpu/drm/xe/xe_hwmon.c
>> +++ b/drivers/gpu/drm/xe/xe_hwmon.c
>> @@ -43,6 +43,7 @@ enum xe_hwmon_channel {
>>   	CHANNEL_CARD,
>>   	CHANNEL_PKG,
>>   	CHANNEL_VRAM,
>> +	CHANNEL_MCTRL,
>>   	CHANNEL_MAX,
>>   };
>>   
>> @@ -99,6 +100,11 @@ enum sensor_attr_power {
>>    */
>>   #define PL_WRITE_MBX_TIMEOUT_MS	(1)
>>   
>> +/*
>> + * Number of thermal sensors.
>> + */
>> +#define MAX_THERMAL_SENSORS    (255)
>> +
>>   /**
>>    * struct xe_hwmon_energy_info - to accumulate energy
>>    */
>> @@ -129,6 +135,10 @@ struct xe_hwmon_thermal_info {
>>   		/** @data: temperature limits in dwords */
>>   		u32 data[2];
>>   	};
>> +	/** @count: no of temperature sensors */
>> +	u8 count;
>> +	/** @value: value from each sensors S1.7 format. */
>> +	u8 value[MAX_THERMAL_SENSORS];
>>   };
>>   
>>   /**
>> @@ -702,6 +712,7 @@ static const struct hwmon_channel_info * const hwmon_info[] = {
>>   			   HWMON_T_LABEL,
>>   			   HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_EMERGENCY | HWMON_T_CRIT |
>>   			   HWMON_T_MAX,
>> +			   HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_EMERGENCY | HWMON_T_CRIT,
>>   			   HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_EMERGENCY | HWMON_T_CRIT),
>>   	HWMON_CHANNEL_INFO(power, HWMON_P_MAX | HWMON_P_RATED_MAX | HWMON_P_LABEL | HWMON_P_CRIT |
>>   			   HWMON_P_CAP,
>> @@ -717,6 +728,7 @@ static int xe_hwmon_pcode_read_thermal_info(struct xe_hwmon *hwmon)
>>   {
>>   	struct xe_tile *root_tile = xe_device_get_root_tile(hwmon->xe);
>>   	int ret = 0;
>> +	u32 val = 0;
>>   
>>   	if (!hwmon->xe->info.has_mbx_power_limits)
>>   		return -EOPNOTSUPP;
>> @@ -727,9 +739,54 @@ static int xe_hwmon_pcode_read_thermal_info(struct xe_hwmon *hwmon)
>>   	drm_dbg(&hwmon->xe->drm, "thermal info read ret %d, val 0x%x val1 0x%x\n", ret,
>>   		hwmon->temp.data[0], hwmon->temp.data[1]);
>>   
>> +	/* Read thermal config */
> Comments are usually helpful for something that's not obvious from the code,
> so no need for tautology (and also in all other places where applicable).
will remove this
>
>> +	ret = xe_pcode_read(root_tile, PCODE_MBOX(PCODE_THERMAL_INFO, READ_THERMAL_CONFIG, 0),
>> +			    &val, NULL);
>> +	drm_dbg(&hwmon->xe->drm, "thermal config read ret %d, count %d\n", ret, val);
> This is redundant in error cases because the pcode helper already prints the
> error code and count won't be present.
it will be helpful to know the count during debug
>
>> +	if (ret)
>> +		return ret;
>> +
>> +	hwmon->temp.count = val & TEMP_MASK;
>> +	if (hwmon->temp.count > MAX_THERMAL_SENSORS) {
> How does a u8 variable exceed the value of 255? ...
it is not needed, this can be removed
>
>> +		drm_warn(&hwmon->xe->drm, "thermal config count %d exceeds supported limit %d\n",
>> +			 hwmon->temp.count, MAX_THERMAL_SENSORS);
>> +		ret = -ENOMEM;
> ... and how does it translate to 'out of memory'?

Earlier patch we have allocated memory only for 11 sensors, so we had 
this check,

it is not needed now as we are storing memory for MAX_THERMAL_SENSORS

>
>> +	}
>>   	return ret;
>>   }
>>   
>> +static int get_mc_temp(struct xe_hwmon *hwmon, long *val)
>> +{
>> +	struct xe_tile *root_tile = xe_device_get_root_tile(hwmon->xe);
>> +	int ret = 0, i = 0;
>> +
>> +	for (i = 0; i < (hwmon->temp.count / 4); i++) {
> Magic 4.
sizeof(u32), will change this
>
>> +		/* Read thermal data */
>> +		ret = xe_pcode_read(root_tile, PCODE_MBOX(PCODE_THERMAL_INFO, READ_THERMAL_DATA, i),
>> +				    (uint32_t *)&hwmon->temp.value[i * 4], NULL);
> Let's not solve a problem that doesn't exist and try to work without
> a cast.
>
>> +		drm_dbg(&hwmon->xe->drm, "thermal data for group %d ret %d, val 0x%x\n", i, ret,
>> +			(u32)hwmon->temp.value[i * 4]);
> Ditto. And you're using both uint32_t and u32, make it consistent.
ok
>
>> +		if (ret)
>> +			return ret;
>> +	}
>> +	if (hwmon->temp.count % 4) {
>> +		ret = xe_pcode_read(root_tile, PCODE_MBOX(PCODE_THERMAL_INFO, READ_THERMAL_DATA, i),
>> +				    (uint32_t *)&hwmon->temp.value[i * 4], NULL);
> Ditto.
>
>> +		drm_dbg(&hwmon->xe->drm, "thermal data for group %d ret %d, val 0x%x\n", i, ret,
>> +			(u32)hwmon->temp.value[i * 4]);
> Ditto.
>
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	*val = 0;
>> +	for (i = TEMP_INDEX_MCTRL; i < hwmon->temp.count; i++) {
>> +		*val += (hwmon->temp.value[i] & TEMP_MASK_MAILBOX) *
>> +			((hwmon->temp.value[i] & 0x80) ? -1 : 1);
> Now, this is something that needs an explanation but doesn't have it ...
Signed for S1.7, bit 7 for sign, 6:0 for value, adding comment in next 
revision
>
>> +	}
>> +	*val = (*val / hwmon->temp.count) * MILLIDEGREE_PER_DEGREE;
> ... and the math is independent of the assignment so let's do them
> separately.
okay
>
>> +	return 0;
>> +}
>> +
>>   /* I1 is exposed as power_crit or as curr_crit depending on bit 31 */
>>   static int xe_hwmon_pcode_read_i1(const struct xe_hwmon *hwmon, u32 *uval)
>>   {
>> @@ -827,6 +884,8 @@ static void xe_hwmon_get_voltage(struct xe_hwmon *hwmon, int channel, long *valu
>>   static umode_t
>>   xe_hwmon_temp_is_visible(struct xe_hwmon *hwmon, u32 attr, int channel)
>>   {
>> +	long val = 0;
>> +
>>   	switch (attr) {
>>   	case hwmon_temp_emergency:
>>   		switch (channel) {
>> @@ -834,6 +893,8 @@ xe_hwmon_temp_is_visible(struct xe_hwmon *hwmon, u32 attr, int channel)
>>   			return hwmon->temp.limit[TEMP_LIMIT_PKG_SHUTDOWN] ? 0444 : 0;
>>   		case CHANNEL_VRAM:
>>   			return hwmon->temp.limit[TEMP_LIMIT_MEM_SHUTDOWN] ? 0444 : 0;
>> +		case CHANNEL_MCTRL:
>> +			return (!get_mc_temp(hwmon, &val)) ? 0444 : 0;
> Do we really need to read it all here?
we can make visible based on hwmon->temp.count > 0
>
>>   		default:
>>   			return 0;
>>   		}
>> @@ -844,6 +905,8 @@ xe_hwmon_temp_is_visible(struct xe_hwmon *hwmon, u32 attr, int channel)
>>   			return hwmon->temp.limit[TEMP_LIMIT_PKG_TJMAX] ? 0444 : 0;
>>   		case CHANNEL_VRAM:
>>   			return hwmon->temp.limit[TEMP_LIMIT_MEM_TJMAX] ? 0444 : 0;
>> +		case CHANNEL_MCTRL:
>> +			return (!get_mc_temp(hwmon, &val)) ? 0444 : 0;
> Ditto.
>
>>   		default:
>>   			return 0;
>>   		}
>> @@ -858,7 +921,16 @@ xe_hwmon_temp_is_visible(struct xe_hwmon *hwmon, u32 attr, int channel)
>>   		break;
>>   	case hwmon_temp_input:
>>   	case hwmon_temp_label:
>> -		return xe_reg_is_valid(xe_hwmon_get_reg(hwmon, REG_TEMP, channel)) ? 0444 : 0;
>> +		switch (channel) {
>> +		case CHANNEL_PKG:
>> +		case CHANNEL_VRAM:
>> +			return xe_reg_is_valid(xe_hwmon_get_reg(hwmon, REG_TEMP,
>> +					       channel)) ? 0444 : 0;
>> +		case CHANNEL_MCTRL:
>> +			return (!get_mc_temp(hwmon, &val)) ? 0444 : 0;
> Ditto.
>
>> +		default:
>> +			return 0;
>> +		}
>>   	default:
>>   		return 0;
>>   	}
>> @@ -872,10 +944,20 @@ xe_hwmon_temp_read(struct xe_hwmon *hwmon, u32 attr, int channel, long *val)
>>   
>>   	switch (attr) {
>>   	case hwmon_temp_input:
>> -		reg_val = xe_mmio_read32(mmio, xe_hwmon_get_reg(hwmon, REG_TEMP, channel));
>> +		switch (channel) {
>> +		case CHANNEL_PKG:
>> +		case CHANNEL_VRAM:
>> +			reg_val = xe_mmio_read32(mmio, xe_hwmon_get_reg(hwmon, REG_TEMP, channel));
>>   
>> -		/* HW register value is in degrees Celsius, convert to millidegrees. */
>> -		*val = REG_FIELD_GET(TEMP_MASK, reg_val) * MILLIDEGREE_PER_DEGREE;
>> +			/* HW register value is in degrees Celsius, convert to millidegrees. */
>> +			*val = REG_FIELD_GET(TEMP_MASK, reg_val) * MILLIDEGREE_PER_DEGREE;
>> +			break;
>> +		case CHANNEL_MCTRL:
>> +			return get_mc_temp(hwmon, val);
>> +		default:
>> +			*val = 0;
> Do you need this?
incase *val is garbage this shall make this 0
>
>> +			return -EOPNOTSUPP;
>> +		}
>>   		break;
>>   	case hwmon_temp_emergency:
>>   		switch (channel) {
>> @@ -883,6 +965,7 @@ xe_hwmon_temp_read(struct xe_hwmon *hwmon, u32 attr, int channel, long *val)
>>   			*val = hwmon->temp.limit[TEMP_LIMIT_PKG_SHUTDOWN] * MILLIDEGREE_PER_DEGREE;
>>   			break;
>>   		case CHANNEL_VRAM:
>> +		case CHANNEL_MCTRL:
>>   			*val = hwmon->temp.limit[TEMP_LIMIT_MEM_SHUTDOWN] * MILLIDEGREE_PER_DEGREE;
>>   			break;
>>   		default:
>> @@ -896,6 +979,7 @@ xe_hwmon_temp_read(struct xe_hwmon *hwmon, u32 attr, int channel, long *val)
>>   			*val = hwmon->temp.limit[TEMP_LIMIT_PKG_TJMAX] * MILLIDEGREE_PER_DEGREE;
>>   			break;
>>   		case CHANNEL_VRAM:
>> +		case CHANNEL_MCTRL:
>>   			*val = hwmon->temp.limit[TEMP_LIMIT_MEM_TJMAX] * MILLIDEGREE_PER_DEGREE;
>>   			break;
>>   		default:
>> @@ -1274,6 +1358,8 @@ static int xe_hwmon_read_label(struct device *dev,
>>   			*str = "pkg";
>>   		else if (channel == CHANNEL_VRAM)
>>   			*str = "vram";
>> +		else if (channel == CHANNEL_MCTRL)
>> +			*str = "mctrl_avg";
> The rest of the channels also signify the average so no need to spell
> it out.
>
> Raag
ok
>
>>   		return 0;
>>   	case hwmon_power:
>>   	case hwmon_energy:
>> diff --git a/drivers/gpu/drm/xe/xe_pcode_api.h b/drivers/gpu/drm/xe/xe_pcode_api.h
>> index a3bff76a074d..91e232834482 100644
>> --- a/drivers/gpu/drm/xe/xe_pcode_api.h
>> +++ b/drivers/gpu/drm/xe/xe_pcode_api.h
>> @@ -67,6 +67,10 @@
>>   
>>   #define	PCODE_THERMAL_INFO			0x25
>>   #define	READ_THERMAL_LIMITS			0x0
>> +#define	READ_THERMAL_CONFIG			0x1
>> +#define	READ_THERMAL_DATA			0x2
>> +#define TEMP_MASK_MAILBOX			REG_GENMASK8(6, 0)
>> +#define TEMP_INDEX_MCTRL			0x2
>>   
>>   #define   PCODE_FREQUENCY_CONFIG		0x6e
>>   /* Frequency Config Sub Commands (param1) */
>> -- 
>> 2.25.1
>>

  parent reply	other threads:[~2025-12-24 10:13 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-16 11:40 [PATCH v3 0/4] drm/xe/hwmon: Expose new temperature attributes Karthik Poosa
2025-12-16 11:40 ` [PATCH v3 1/4] drm/xe/hwmon: Expose temperature limits Karthik Poosa
2025-12-17 17:05   ` Raag Jadav
2025-12-18  7:37     ` Poosa, Karthik
2025-12-18 11:15       ` Raag Jadav
2025-12-19  8:37         ` Nilawar, Badal
2025-12-19 15:06           ` Guenter Roeck
2025-12-19 15:56         ` Rodrigo Vivi
2025-12-16 11:40 ` [PATCH v3 2/4] drm/xe/hwmon: Expose memory controller temperature Karthik Poosa
2025-12-19  7:55   ` Raag Jadav
2025-12-23 15:49     ` Poosa, Karthik
2025-12-24 10:13     ` Poosa, Karthik [this message]
2025-12-16 11:40 ` [PATCH v3 3/4] drm/xe/hwmon: Expose GPU pcie temperature Karthik Poosa
2025-12-19  8:23   ` Raag Jadav
2025-12-24 10:19     ` Poosa, Karthik
2025-12-16 11:40 ` [PATCH v3 4/4] drm/xe/hwmon: Expose individual vram temperature Karthik Poosa
2025-12-21 18:52   ` Raag Jadav
2025-12-23 11:51     ` Poosa, Karthik
2025-12-16 19:28 ` ✓ CI.KUnit: success for drm/xe/hwmon: Expose new temperature attributes (rev4) Patchwork
2025-12-16 21:05 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-17 19:17 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ba9efeca-6733-4af2-a0fc-77bdbe765ccf@intel.com \
    --to=karthik.poosa@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox