From: Raag Jadav <raag.jadav@intel.com>
To: Karthik Poosa <karthik.poosa@intel.com>
Cc: intel-xe@lists.freedesktop.org, anshuman.gupta@intel.com,
badal.nilawar@intel.com, rodrigo.vivi@intel.com
Subject: Re: [PATCH v4 2/4] drm/xe/hwmon: Expose memory controller temperature
Date: Fri, 9 Jan 2026 11:56:39 +0100 [thread overview]
Message-ID: <aWDe56t90FND09x2@black.igk.intel.com> (raw)
In-Reply-To: <20260108130323.426531-3-karthik.poosa@intel.com>
On Thu, Jan 08, 2026 at 06:33:21PM +0530, Karthik Poosa wrote:
> Expose GPU memory controller average temperature and its limits
> under temp4_xxx.
Same comments as last patch.
> Update Xe hwmon documentation for this.
>
> v2:
> - Rephrase commit message. (Badal)
> - Update kernel version in Xe hwmon documentation. (Raag)
>
> v3:
> - Update kernel version in Xe hwmon documentation.
> - Address review comments from Raag.
> - Remove obvious comments.
> - Remove redundant debug logs.
> - Remove unnecessary checks.
> - Avoid magic numbers.
> - Add new comments.
> - Use temperature sensors count to make memory controller visible.
> - Use temperature limits of package for memory controller.
>
> Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
> ---
> .../ABI/testing/sysfs-driver-intel-xe-hwmon | 24 ++++++
> drivers/gpu/drm/xe/xe_hwmon.c | 86 ++++++++++++++++++-
> drivers/gpu/drm/xe/xe_pcode_api.h | 4 +
> 3 files changed, 110 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon b/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon
> index c3b367c42741..a9fcfa6f11b9 100644
> --- a/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon
> +++ b/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon
> @@ -236,3 +236,27 @@ Contact: intel-xe@lists.freedesktop.org
> Description: RO. VRAM critical temperature in millidegree Celsius.
>
> Only supported for particular Intel Xe graphics platforms.
> +
> +What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp4_input
> +Date: January 2026
> +KernelVersion: 7.0
> +Contact: intel-xe@lists.freedesktop.org
> +Description: RO. Memory controller average temperature in millidegree Celsius.
> +
> + Only supported for particular Intel Xe graphics platforms.
> +
> +What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp4_emergency
> +Date: January 2026
> +KernelVersion: 7.0
> +Contact: intel-xe@lists.freedesktop.org
> +Description: RO. Memory controller shutdown temperature in millidegree Celsius.
> +
> + Only supported for particular Intel Xe graphics platforms.
> +
> +What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp4_crit
> +Date: January 2026
> +KernelVersion: 7.0
> +Contact: intel-xe@lists.freedesktop.org
> +Description: RO. Memory controller critical temperature in millidegree Celsius.
> +
> + Only supported for particular Intel Xe graphics platforms.
Same comments as last patch.
> diff --git a/drivers/gpu/drm/xe/xe_hwmon.c b/drivers/gpu/drm/xe/xe_hwmon.c
> index db00728c89e4..2bf5c9ac948a 100644
> --- a/drivers/gpu/drm/xe/xe_hwmon.c
> +++ b/drivers/gpu/drm/xe/xe_hwmon.c
> @@ -43,6 +43,7 @@ enum xe_hwmon_channel {
> CHANNEL_CARD,
> CHANNEL_PKG,
> CHANNEL_VRAM,
> + CHANNEL_MCTRL,
> CHANNEL_MAX,
> };
>
> @@ -100,6 +101,11 @@ enum sensor_attr_power {
> */
> #define PL_WRITE_MBX_TIMEOUT_MS (1)
>
> +/*
> + * Maximum number of thermal sensors supported by hardware.
> + */
> +#define MAX_THERMAL_SENSORS (255)
I don't think we need this, see below.
> +
> /**
> * struct xe_hwmon_energy_info - to accumulate energy
> */
> @@ -130,6 +136,10 @@ struct xe_hwmon_thermal_info {
> /** @data: temperature limits in dwords */
> u32 data[(TEMP_LIMIT_MAX / sizeof(u32)) + ((TEMP_LIMIT_MAX % sizeof(u32)) ? 1 : 0)];
> };
> + /** @count: no of temperature sensors available for the platform */
> + u8 count;
> + /** @value: value from each sensor, bit 7 is for sign and 6:0 for value */
Nit: "Signed value from each sensor" would be sufficient.
> + s8 value[MAX_THERMAL_SENSORS];
Just use U8_MAX and add a comment that it's the size of count member.
> };
>
> /**
> @@ -703,6 +713,7 @@ static const struct hwmon_channel_info * const hwmon_info[] = {
> HWMON_T_LABEL,
> HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_EMERGENCY | HWMON_T_CRIT |
> HWMON_T_MAX,
> + HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_EMERGENCY | HWMON_T_CRIT,
Alphabetic order please!
> HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_EMERGENCY | HWMON_T_CRIT),
> HWMON_CHANNEL_INFO(power, HWMON_P_MAX | HWMON_P_RATED_MAX | HWMON_P_LABEL | HWMON_P_CRIT |
> HWMON_P_CAP,
> @@ -718,15 +729,58 @@ static int xe_hwmon_pcode_read_thermal_info(struct xe_hwmon *hwmon)
> {
> struct xe_tile *root_tile = xe_device_get_root_tile(hwmon->xe);
> int ret = 0;
> + u32 val = 0;
Make this config.
> ret = xe_pcode_read(root_tile, PCODE_MBOX(PCODE_THERMAL_INFO, READ_THERMAL_LIMITS, 0),
> &hwmon->temp.data[0], &hwmon->temp.data[1]);
> + if (ret)
> + return ret;
> +
> drm_dbg(&hwmon->xe->drm, "thermal info read val 0x%x val1 0x%x\n",
> hwmon->temp.data[0], hwmon->temp.data[1]);
>
> + ret = xe_pcode_read(root_tile, PCODE_MBOX(PCODE_THERMAL_INFO, READ_THERMAL_CONFIG, 0),
> + &val, NULL);
> + if (ret)
> + return ret;
> +
> + drm_dbg(&hwmon->xe->drm, "thermal config count %d\n", val);
> + hwmon->temp.count = val & TEMP_MASK;
> +
> return ret;
> }
>
> +static int get_mc_temp(struct xe_hwmon *hwmon, long *val)
> +{
> + struct xe_tile *root_tile = xe_device_get_root_tile(hwmon->xe);
> + int ret = 0, i = 0;
Redundant initializations.
> + u32 *dword = (u32 *)hwmon->temp.value;
Use reverse xmas tree order.
> + s16 average = 0;
I know this does the job but let's use s32 for uniformity.
> + for (i = 0; i < (hwmon->temp.count / sizeof(u32)); i++) {
Hm... see below.
> + ret = xe_pcode_read(root_tile, PCODE_MBOX(PCODE_THERMAL_INFO, READ_THERMAL_DATA, i),
> + (dword + i), NULL);
> + if (ret)
> + return ret;
> + drm_dbg(&hwmon->xe->drm, "thermal data for group %d val 0x%x\n", i, dword[i]);
> + }
> +
> + if (hwmon->temp.count % (sizeof(u32))) {
I think the correct logic would be to use
(((hwmon->temp.count - 1)/ sizeof(u32)) + 1) as loop condition,
or did I miss something?
> + ret = xe_pcode_read(root_tile, PCODE_MBOX(PCODE_THERMAL_INFO, READ_THERMAL_DATA, i),
> + (dword + i), NULL);
> + if (ret)
> + return ret;
> + drm_dbg(&hwmon->xe->drm, "thermal data for group %d val 0x%x\n", i, dword[i]);
> + }
> +
> + for (i = TEMP_INDEX_MCTRL; i < hwmon->temp.count - 1; i++)
> + average += hwmon->temp.value[i];
> +
> + average /= (hwmon->temp.count - TEMP_INDEX_MCTRL - 1);
> + *val = average * MILLIDEGREE_PER_DEGREE;
> + return 0;
> +}
> +
> /* I1 is exposed as power_crit or as curr_crit depending on bit 31 */
> static int xe_hwmon_pcode_read_i1(const struct xe_hwmon *hwmon, u32 *uval)
> {
> @@ -831,6 +885,8 @@ xe_hwmon_temp_is_visible(struct xe_hwmon *hwmon, u32 attr, int channel)
> return hwmon->temp.limit[TEMP_LIMIT_PKG_SHUTDOWN] ? 0444 : 0;
> case CHANNEL_VRAM:
> return hwmon->temp.limit[TEMP_LIMIT_MEM_SHUTDOWN] ? 0444 : 0;
> + case CHANNEL_MCTRL:
> + return hwmon->temp.count ? 0444 : 0;
> default:
> return 0;
> }
> @@ -841,6 +897,8 @@ xe_hwmon_temp_is_visible(struct xe_hwmon *hwmon, u32 attr, int channel)
> return hwmon->temp.limit[TEMP_LIMIT_PKG_TJMAX] ? 0444 : 0;
> case CHANNEL_VRAM:
> return hwmon->temp.limit[TEMP_LIMIT_MEM_TJMAX] ? 0444 : 0;
> + case CHANNEL_MCTRL:
> + return hwmon->temp.count ? 0444 : 0;
> default:
> return 0;
> }
> @@ -855,7 +913,16 @@ xe_hwmon_temp_is_visible(struct xe_hwmon *hwmon, u32 attr, int channel)
> break;
> case hwmon_temp_input:
> case hwmon_temp_label:
> - return xe_reg_is_valid(xe_hwmon_get_reg(hwmon, REG_TEMP, channel)) ? 0444 : 0;
> + switch (channel) {
> + case CHANNEL_PKG:
> + case CHANNEL_VRAM:
> + return xe_reg_is_valid(xe_hwmon_get_reg(hwmon, REG_TEMP,
> + channel)) ? 0444 : 0;
Align with correct braces.
> + case CHANNEL_MCTRL:
> + return hwmon->temp.count ? 0444 : 0;
> + default:
> + return 0;
> + }
> default:
> return 0;
> }
> @@ -869,14 +936,22 @@ xe_hwmon_temp_read(struct xe_hwmon *hwmon, u32 attr, int channel, long *val)
>
> switch (attr) {
> case hwmon_temp_input:
> - reg_val = xe_mmio_read32(mmio, xe_hwmon_get_reg(hwmon, REG_TEMP, channel));
> + switch (channel) {
> + case CHANNEL_PKG:
> + case CHANNEL_VRAM:
> + reg_val = xe_mmio_read32(mmio, xe_hwmon_get_reg(hwmon, REG_TEMP, channel));
>
> - /* HW register value is in degrees Celsius, convert to millidegrees. */
> - *val = REG_FIELD_GET(TEMP_MASK, reg_val) * MILLIDEGREE_PER_DEGREE;
> + /* HW register value is in degrees Celsius, convert to millidegrees. */
> + *val = REG_FIELD_GET(TEMP_MASK, reg_val) * MILLIDEGREE_PER_DEGREE;
> + break;
Nope, just return.
> + case CHANNEL_MCTRL:
> + return get_mc_temp(hwmon, val);
Add a default case with return 0 to make static checker happy (and also
in all other similar places).
> + }
> break;
> case hwmon_temp_emergency:
> switch (channel) {
> case CHANNEL_PKG:
> + case CHANNEL_MCTRL:
> *val = hwmon->temp.limit[TEMP_LIMIT_PKG_SHUTDOWN] * MILLIDEGREE_PER_DEGREE;
> break;
> case CHANNEL_VRAM:
> @@ -887,6 +962,7 @@ xe_hwmon_temp_read(struct xe_hwmon *hwmon, u32 attr, int channel, long *val)
> case hwmon_temp_crit:
> switch (channel) {
> case CHANNEL_PKG:
> + case CHANNEL_MCTRL:
> *val = hwmon->temp.limit[TEMP_LIMIT_PKG_TJMAX] * MILLIDEGREE_PER_DEGREE;
> break;
> case CHANNEL_VRAM:
> @@ -1263,6 +1339,8 @@ static int xe_hwmon_read_label(struct device *dev,
> *str = "pkg";
> else if (channel == CHANNEL_VRAM)
> *str = "vram";
> + else if (channel == CHANNEL_MCTRL)
> + *str = "mctrl";
> return 0;
> case hwmon_power:
> case hwmon_energy:
> diff --git a/drivers/gpu/drm/xe/xe_pcode_api.h b/drivers/gpu/drm/xe/xe_pcode_api.h
> index 6c84c1780fe9..fc8811a87741 100644
> --- a/drivers/gpu/drm/xe/xe_pcode_api.h
> +++ b/drivers/gpu/drm/xe/xe_pcode_api.h
> @@ -67,6 +67,10 @@
>
> #define PCODE_THERMAL_INFO 0x25
> #define READ_THERMAL_LIMITS 0x0
> +#define READ_THERMAL_CONFIG 0x1
> +#define READ_THERMAL_DATA 0x2
> +#define TEMP_INDEX_MCTRL 0x2
This is not used by pcode and should not be in this file. I think the right
place for it would be xe_hwmon.c.
> +#define TEMP_MASK_MAILBOX REG_GENMASK8(6, 0)
Is this used in this patch?
Raag
> #define PCODE_FREQUENCY_CONFIG 0x6e
> /* Frequency Config Sub Commands (param1) */
> --
> 2.25.1
>
next prev parent reply other threads:[~2026-01-09 10:56 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-08 13:03 [PATCH v4 0/4] drm/xe/hwmon: Expose new temperature attributes Karthik Poosa
2026-01-08 13:03 ` [PATCH v4 1/4] drm/xe/hwmon: Expose temperature limits Karthik Poosa
2026-01-09 9:29 ` Raag Jadav
2026-01-09 13:42 ` Poosa, Karthik
2026-01-09 14:24 ` Poosa, Karthik
2026-01-08 13:03 ` [PATCH v4 2/4] drm/xe/hwmon: Expose memory controller temperature Karthik Poosa
2026-01-09 10:56 ` Raag Jadav [this message]
2026-01-08 13:03 ` [PATCH v4 3/4] drm/xe/hwmon: Expose GPU pcie temperature Karthik Poosa
2026-01-09 13:26 ` Raag Jadav
2026-01-08 13:03 ` [PATCH v4 4/4] drm/xe/hwmon: Expose individual vram channel temperature Karthik Poosa
2026-01-08 13:20 ` ✓ CI.KUnit: success for drm/xe/hwmon: Expose new temperature attributes (rev5) Patchwork
2026-01-08 13:58 ` ✓ Xe.CI.BAT: " Patchwork
2026-01-08 19:02 ` ✓ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aWDe56t90FND09x2@black.igk.intel.com \
--to=raag.jadav@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=karthik.poosa@intel.com \
--cc=rodrigo.vivi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.