From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: "Poosa, Karthik" <karthik.poosa@intel.com>,
<intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
<badal.nilawar@intel.com>, <riana.tauro@intel.com>
Subject: Re: [PATCH v3 1/4] drm/xe/hwmon: Expose temperature limits
Date: Fri, 19 Dec 2025 10:56:40 -0500 [thread overview]
Message-ID: <aUV1uHBt1k0h3kXz@intel.com> (raw)
In-Reply-To: <aUPiXUAji8P9QvFH@black.igk.intel.com>
On Thu, Dec 18, 2025 at 12:15:41PM +0100, Raag Jadav wrote:
> On Thu, Dec 18, 2025 at 01:07:57PM +0530, Poosa, Karthik wrote:
> > On 17-12-2025 22:35, Raag Jadav wrote:
> > > On Tue, Dec 16, 2025 at 05:10:27PM +0530, Karthik Poosa wrote:
> > > > Read temperature limits using pcode mailbox and
> > > > expose shutdown temperature limit as tempX_emergency,
> > > > critical temperature limit of as tempX_crit and
> > > > GPU average temperature limit as tempX_max.
> > > >
> > > > Update Xe hwmon documentation for these entries.
> > > >
> > > > Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
> > > > ---
> > > > .../ABI/testing/sysfs-driver-intel-xe-hwmon | 40 ++++++
> > > > drivers/gpu/drm/xe/xe_device_types.h | 3 +
> > > > drivers/gpu/drm/xe/xe_hwmon.c | 116 +++++++++++++++++-
> > > > drivers/gpu/drm/xe/xe_pci.c | 5 +
> > > > drivers/gpu/drm/xe/xe_pci_types.h | 1 +
> > > > drivers/gpu/drm/xe/xe_pcode_api.h | 3 +
> > > > 6 files changed, 164 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon b/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon
> > > > index d9e2b17c6872..c8f211c336be 100644
> > > > --- a/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon
> > > > +++ b/Documentation/ABI/testing/sysfs-driver-intel-xe-hwmon
> > > > @@ -196,3 +196,43 @@ Description: RW. Package burst power limit interval (Tau in PL2/Tau) in
> > > > milliseconds over which sustained power is averaged.
> > > > Only supported for particular Intel Xe graphics platforms.
> > > > +
> > > > +What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp2_emergency
> > > > +Date: December 2025
> > > > +KernelVersion: 6.19
> > > > +Contact: intel-xe@lists.freedesktop.org
> > > > +Description: RO. Package shutdown temperature in millidegree Celsius.
> > > > +
> > > > + Only supported for particular Intel Xe graphics platforms.
> > > > +
> > > > +What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp2_crit
> > > > +Date: December 2025
> > > > +KernelVersion: 6.19
> > > > +Contact: intel-xe@lists.freedesktop.org
> > > > +Description: RO. Package critical temperature in millidegree Celsius.
> > > > +
> > > > + Only supported for particular Intel Xe graphics platforms.
> > > > +
> > > > +What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp2_max
> > > > +Date: December 2025
> > > > +KernelVersion: 6.19
> > > > +Contact: intel-xe@lists.freedesktop.org
> > > > +Description: RO. Package average temperature limit in millidegree Celsius.
> > > > +
> > > > + Only supported for particular Intel Xe graphics platforms.
> > > > +
> > > > +What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp3_emergency
> > > > +Date: December 2025
> > > > +KernelVersion: 6.19
> > > > +Contact: intel-xe@lists.freedesktop.org
> > > > +Description: RO. VRAM shutdown temperature in millidegree Celsius.
> > > > +
> > > > + Only supported for particular Intel Xe graphics platforms.
> > > > +
> > > > +What: /sys/bus/pci/drivers/xe/.../hwmon/hwmon<i>/temp3_crit
> > > > +Date: December 2025
> > > > +KernelVersion: 6.19
> > > > +Contact: intel-xe@lists.freedesktop.org
> > > > +Description: RO. VRAM critical temperature in millidegree Celsius.
> > > > +
> > > > + Only supported for particular Intel Xe graphics platforms.
> > > As per hwmon documentation all these attributes are RW so they don't
> > > match with ABI expectations.
> >
> > 1. These are read only values for our hardware, write is not supported.
> >
> > 2. Couple of other drivers are also supporting readonly for temp_crit.
> >
> > Ex:
> >
> > a) drivers/net/ethernet/broadcom/bnxt/bnxt_hwmon.c - bnxt_hwmon_is_visible
> >
> > 75 static umode_t bnxt_hwmon_is_visible(const void *_data, enum
> > hwmon_sensor_types type,
> > 76 u32 attr, int channel)
> > 77 { 82
> > 83 switch (attr) {
> > 86 case hwmon_temp_max:
> > 87 case hwmon_temp_crit:
> > 88 case hwmon_temp_emergency:
> > 89 case hwmon_temp_max_alarm:
> > 90 case hwmon_temp_crit_alarm:
> > 91 case hwmon_temp_emergency_alarm:
> > 92 if (!(bp->fw_cap &
> > BNXT_FW_CAP_THRESHOLD_TEMP_SUPPORTED))
> > 93 return 0;
> > 94 return 0444;
> >
> > b) . drivers/hwmon/intel-m10-bmc-hwmon.c -
> >
> > static const struct hwmon_ops m10bmc_hwmon_ops = {
> > .visible = 0444,
> > .read = m10bmc_hwmon_read,
> >
> > 3. hwmon also doesn't report any violation for 0444 return from
> > xe_hwmon_temp_is_visible.
>
> Sure, but since this is a pre-defined ABI which we don't control I'll let
> the maintainers have the final call on this.
Not needed. It is properly documented:
"""
Read/write values may be read-only for some chips, depending on the
hardware implementation.
"""
in Documentation/hwmon/sysfs-interface.rst
>
> Raag
>
> > > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > > > index 413ba4c8b62e..8e6dd2665596 100644
> > > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > > @@ -321,6 +321,9 @@ struct xe_device {
> > > > * pcode mailbox commands.
> > > > */
> > > > u8 has_mbx_power_limits:1;
> > > > + /** @info.has_mbx_thermal_info: Device has support for thermal mailbox commands.
> > > > + */
> > > Can be one line.
> > >
> > > > + u8 has_mbx_thermal_info:1;
> > > > /** @info.has_mem_copy_instr: Device supports MEM_COPY instruction */
> > > > u8 has_mem_copy_instr:1;
> > > > /** @info.has_mert: Device has standalone MERT */
> > > > diff --git a/drivers/gpu/drm/xe/xe_hwmon.c b/drivers/gpu/drm/xe/xe_hwmon.c
> > > > index ff2aea52ef75..66a8c3e40027 100644
> > > > --- a/drivers/gpu/drm/xe/xe_hwmon.c
> > > > +++ b/drivers/gpu/drm/xe/xe_hwmon.c
> > > > @@ -53,6 +53,14 @@ enum xe_fan_channel {
> > > > FAN_MAX,
> > > > };
> > > > +enum xe_temp_limit {
> > > > + TEMP_LIMIT_PKG_SHUTDOWN,
> > > > + TEMP_LIMIT_PKG_TJMAX,
> > > > + TEMP_LIMIT_MEM_SHUTDOWN,
> > > > + TEMP_LIMIT_PKG_CRIT,
> > > > + TEMP_LIMIT_MEM_TJMAX,
> > > > +};
> > > > +
> > > > /* Attribute index for powerX_xxx_interval sysfs entries */
> > > > enum sensor_attr_power {
> > > > SENSOR_INDEX_PSYS_PL1,
> > > > @@ -111,6 +119,18 @@ struct xe_hwmon_fan_info {
> > > > u64 time_prev;
> > > > };
> > > > +/**
> > > > + * struct xe_hwmon_thermal_info - to store temperature data
> > > > + */
> > > > +struct xe_hwmon_thermal_info {
> > > > + union {
> > > > + /** @limits: temperatures limits */
> > > > + u8 limit[8];
> > > What are the magic numbers? Can we have proper defines?
> >
> > We can use xe_temp_limit max value and keep remaining bytes as reserved.
> >
> > >
> > > > + /** @data: temperature limits in dwords */
> > > > + u32 data[2];
> > > > + };
> > > > +};
> > > > +
> > > > /**
> > > > * struct xe_hwmon - xe hwmon data structure
> > > > */
> > > > @@ -137,7 +157,8 @@ struct xe_hwmon {
> > > > u32 pl1_on_boot[CHANNEL_MAX];
> > > > /** @pl2_on_boot: power limit PL2 on boot */
> > > > u32 pl2_on_boot[CHANNEL_MAX];
> > > > -
> > > > + /** @temp: Temperature info */
> > > > + struct xe_hwmon_thermal_info temp;
> > > > };
> > > > static int xe_hwmon_pcode_read_power_limit(const struct xe_hwmon *hwmon, u32 attr, int channel,
> > > > @@ -677,8 +698,11 @@ static const struct attribute_group *hwmon_groups[] = {
> > > > };
> > > > static const struct hwmon_channel_info * const hwmon_info[] = {
> > > > - HWMON_CHANNEL_INFO(temp, HWMON_T_LABEL, HWMON_T_INPUT | HWMON_T_LABEL,
> > > > - HWMON_T_INPUT | HWMON_T_LABEL),
> > > > + HWMON_CHANNEL_INFO(temp,
> > > > + HWMON_T_LABEL,
> > > > + HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_EMERGENCY | HWMON_T_CRIT |
> > > > + HWMON_T_MAX,
> > > > + HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_EMERGENCY | HWMON_T_CRIT),
> > > > HWMON_CHANNEL_INFO(power, HWMON_P_MAX | HWMON_P_RATED_MAX | HWMON_P_LABEL | HWMON_P_CRIT |
> > > > HWMON_P_CAP,
> > > > HWMON_P_MAX | HWMON_P_RATED_MAX | HWMON_P_LABEL | HWMON_P_CAP),
> > > > @@ -689,6 +713,23 @@ static const struct hwmon_channel_info * const hwmon_info[] = {
> > > > NULL
> > > > };
> > > > +static int xe_hwmon_pcode_read_thermal_info(struct xe_hwmon *hwmon)
> > > > +{
> > > > + struct xe_tile *root_tile = xe_device_get_root_tile(hwmon->xe);
> > > > + int ret = 0;
> > > > +
> > > > + if (!hwmon->xe->info.has_mbx_power_limits)
> > > > + return -EOPNOTSUPP;
> > > > +
> > > > + /* Read thermal info */
> > > > + ret = xe_pcode_read(root_tile, PCODE_MBOX(PCODE_THERMAL_INFO, READ_THERMAL_LIMITS, 0),
> > > > + &hwmon->temp.data[0], &hwmon->temp.data[1]);
> > > > + drm_dbg(&hwmon->xe->drm, "thermal info read ret %d, val 0x%x val1 0x%x\n", ret,
> > > > + hwmon->temp.data[0], hwmon->temp.data[1]);
> > > > +
> > > > + return ret;
> > > > +}
> > > > +
> > > > /* I1 is exposed as power_crit or as curr_crit depending on bit 31 */
> > > > static int xe_hwmon_pcode_read_i1(const struct xe_hwmon *hwmon, u32 *uval)
> > > > {
> > > > @@ -787,6 +828,34 @@ static umode_t
> > > > xe_hwmon_temp_is_visible(struct xe_hwmon *hwmon, u32 attr, int channel)
> > > > {
> > > > switch (attr) {
> > > > + case hwmon_temp_emergency:
> > > > + switch (channel) {
> > > > + case CHANNEL_PKG:
> > > > + return hwmon->temp.limit[TEMP_LIMIT_PKG_SHUTDOWN] ? 0444 : 0;
> > > > + case CHANNEL_VRAM:
> > > > + return hwmon->temp.limit[TEMP_LIMIT_MEM_SHUTDOWN] ? 0444 : 0;
> > > > + default:
> > > > + return 0;
> > > > + }
> > > > + break;
> > > > + case hwmon_temp_crit:
> > > > + switch (channel) {
> > > > + case CHANNEL_PKG:
> > > > + return hwmon->temp.limit[TEMP_LIMIT_PKG_TJMAX] ? 0444 : 0;
> > > > + case CHANNEL_VRAM:
> > > > + return hwmon->temp.limit[TEMP_LIMIT_MEM_TJMAX] ? 0444 : 0;
> > > > + default:
> > > > + return 0;
> > > > + }
> > > > + break;
> > > > + case hwmon_temp_max:
> > > > + switch (channel) {
> > > > + case CHANNEL_PKG:
> > > > + return hwmon->temp.limit[TEMP_LIMIT_PKG_CRIT] ? 0444 : 0;
> > > > + default:
> > > > + return 0;
> > > > + }
> > > > + break;
> > > > case hwmon_temp_input:
> > > > case hwmon_temp_label:
> > > > return xe_reg_is_valid(xe_hwmon_get_reg(hwmon, REG_TEMP, channel)) ? 0444 : 0;
> > > > @@ -807,10 +876,46 @@ xe_hwmon_temp_read(struct xe_hwmon *hwmon, u32 attr, int channel, long *val)
> > > > /* HW register value is in degrees Celsius, convert to millidegrees. */
> > > > *val = REG_FIELD_GET(TEMP_MASK, reg_val) * MILLIDEGREE_PER_DEGREE;
> > > > - return 0;
> > > > + break;
> > > > + case hwmon_temp_emergency:
> > > > + switch (channel) {
> > > > + case CHANNEL_PKG:
> > > > + *val = hwmon->temp.limit[TEMP_LIMIT_PKG_SHUTDOWN] * MILLIDEGREE_PER_DEGREE;
> > > > + break;
> > > > + case CHANNEL_VRAM:
> > > > + *val = hwmon->temp.limit[TEMP_LIMIT_MEM_SHUTDOWN] * MILLIDEGREE_PER_DEGREE;
> > > > + break;
> > > > + default:
> > > > + *val = 0;
> > > Do you need this?
> > >
> > > > + return -EOPNOTSUPP;
> > > > + }
> > > > + break;
> > > > + case hwmon_temp_crit:
> > > > + switch (channel) {
> > > > + case CHANNEL_PKG:
> > > > + *val = hwmon->temp.limit[TEMP_LIMIT_PKG_TJMAX] * MILLIDEGREE_PER_DEGREE;
> > > > + break;
> > > > + case CHANNEL_VRAM:
> > > > + *val = hwmon->temp.limit[TEMP_LIMIT_MEM_TJMAX] * MILLIDEGREE_PER_DEGREE;
> > > > + break;
> > > > + default:
> > > > + *val = 0;
> > > Ditto.
> > >
> > > > + return -EOPNOTSUPP;
> > > > + }
> > > > + break;
> > > > + case hwmon_temp_max:
> > > > + switch (channel) {
> > > > + case CHANNEL_PKG:
> > > > + *val = hwmon->temp.limit[TEMP_LIMIT_PKG_CRIT] * MILLIDEGREE_PER_DEGREE;
> > > > + break;
> > > > + default:
> > > > + return 0;
> > > > + }
> > > > + break;
> > > > default:
> > > > return -EOPNOTSUPP;
> > > > }
> > > > + return 0;
> > > > }
> > > > static umode_t
> > > > @@ -1263,6 +1368,9 @@ xe_hwmon_get_preregistration_info(struct xe_hwmon *hwmon)
> > > > for (channel = 0; channel < FAN_MAX; channel++)
> > > > if (xe_hwmon_is_visible(hwmon, hwmon_fan, hwmon_fan_input, channel))
> > > > xe_hwmon_fan_input_read(hwmon, channel, &fan_speed);
> > > > +
> > > > + if (xe_hwmon_pcode_read_thermal_info(hwmon))
> > > > + drm_dbg(&hwmon->xe->drm, "Thermal mailbox support not present in firmware\n");
> > > > }
> > > > int xe_hwmon_register(struct xe_device *xe)
> > > > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > > > index 673761c8623e..a47637900e84 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pci.c
> > > > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > > > @@ -311,6 +311,7 @@ static const struct xe_device_desc dg2_desc = {
> > > > .has_display = true,
> > > > .has_fan_control = true,
> > > > .has_mbx_power_limits = false,
> > > > + .has_mbx_thermal_info = false,
> > > Why?
> > >
> > > > };
> > > > static const __maybe_unused struct xe_device_desc pvc_desc = {
> > > > @@ -328,6 +329,7 @@ static const __maybe_unused struct xe_device_desc pvc_desc = {
> > > > .vm_max_level = 4,
> > > > .vram_flags = XE_VRAM_FLAGS_NEED64K,
> > > > .has_mbx_power_limits = false,
> > > > + .has_mbx_thermal_info = false,
> > > Ditto.
> > >
> > > > };
> > > > static const struct xe_device_desc mtl_desc = {
> > > > @@ -365,6 +367,7 @@ static const struct xe_device_desc bmg_desc = {
> > > > .has_fan_control = true,
> > > > .has_flat_ccs = 1,
> > > > .has_mbx_power_limits = true,
> > > > + .has_mbx_thermal_info = true,
> > > > .has_gsc_nvm = 1,
> > > > .has_heci_cscfi = 1,
> > > > .has_i2c = true,
> > > > @@ -418,6 +421,7 @@ static const struct xe_device_desc cri_desc = {
> > > > .has_flat_ccs = false,
> > > > .has_i2c = true,
> > > > .has_mbx_power_limits = true,
> > > > + .has_mbx_thermal_info = true,
> > > > .has_mert = true,
> > > > .has_pre_prod_wa = 1,
> > > > .has_sriov = true,
> > > > @@ -681,6 +685,7 @@ static int xe_info_init_early(struct xe_device *xe,
> > > > /* runtime fusing may force flat_ccs to disabled later */
> > > > xe->info.has_flat_ccs = desc->has_flat_ccs;
> > > > xe->info.has_mbx_power_limits = desc->has_mbx_power_limits;
> > > > + xe->info.has_mbx_thermal_info = desc->has_mbx_thermal_info;
> > > > xe->info.has_gsc_nvm = desc->has_gsc_nvm;
> > > > xe->info.has_heci_gscfi = desc->has_heci_gscfi;
> > > > xe->info.has_heci_cscfi = desc->has_heci_cscfi;
> > > > diff --git a/drivers/gpu/drm/xe/xe_pci_types.h b/drivers/gpu/drm/xe/xe_pci_types.h
> > > > index 3bb51d155951..34d2b66188e4 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pci_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_pci_types.h
> > > > @@ -48,6 +48,7 @@ struct xe_device_desc {
> > > > u8 has_late_bind:1;
> > > > u8 has_llc:1;
> > > > u8 has_mbx_power_limits:1;
> > > > + u8 has_mbx_thermal_info:1;
> > > > u8 has_mem_copy_instr:1;
> > > > u8 has_mert:1;
> > > > u8 has_pre_prod_wa:1;
> > > > diff --git a/drivers/gpu/drm/xe/xe_pcode_api.h b/drivers/gpu/drm/xe/xe_pcode_api.h
> > > > index 975892d6b230..a3bff76a074d 100644
> > > > --- a/drivers/gpu/drm/xe/xe_pcode_api.h
> > > > +++ b/drivers/gpu/drm/xe/xe_pcode_api.h
> > > > @@ -65,6 +65,9 @@
> > > > #define FAN_TABLE 1
> > > > #define VR_CONFIG 2
> > > > +#define PCODE_THERMAL_INFO 0x25
> > > > +#define READ_THERMAL_LIMITS 0x0
> > > Does this match with spacing convention and ordering?
> > >
> > > Raag
> > >
> > > > #define PCODE_FREQUENCY_CONFIG 0x6e
> > > > /* Frequency Config Sub Commands (param1) */
> > > > #define PCODE_MBOX_FC_SC_READ_FUSED_P0 0x0
> > > > --
> > > > 2.25.1
> > > >
next prev parent reply other threads:[~2025-12-19 15:56 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-16 11:40 [PATCH v3 0/4] drm/xe/hwmon: Expose new temperature attributes Karthik Poosa
2025-12-16 11:40 ` [PATCH v3 1/4] drm/xe/hwmon: Expose temperature limits Karthik Poosa
2025-12-17 17:05 ` Raag Jadav
2025-12-18 7:37 ` Poosa, Karthik
2025-12-18 11:15 ` Raag Jadav
2025-12-19 8:37 ` Nilawar, Badal
2025-12-19 15:06 ` Guenter Roeck
2025-12-19 15:56 ` Rodrigo Vivi [this message]
2025-12-16 11:40 ` [PATCH v3 2/4] drm/xe/hwmon: Expose memory controller temperature Karthik Poosa
2025-12-19 7:55 ` Raag Jadav
2025-12-23 15:49 ` Poosa, Karthik
2025-12-24 10:13 ` Poosa, Karthik
2025-12-16 11:40 ` [PATCH v3 3/4] drm/xe/hwmon: Expose GPU pcie temperature Karthik Poosa
2025-12-19 8:23 ` Raag Jadav
2025-12-24 10:19 ` Poosa, Karthik
2025-12-16 11:40 ` [PATCH v3 4/4] drm/xe/hwmon: Expose individual vram temperature Karthik Poosa
2025-12-21 18:52 ` Raag Jadav
2025-12-23 11:51 ` Poosa, Karthik
2025-12-16 19:28 ` ✓ CI.KUnit: success for drm/xe/hwmon: Expose new temperature attributes (rev4) Patchwork
2025-12-16 21:05 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-17 19:17 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aUV1uHBt1k0h3kXz@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=karthik.poosa@intel.com \
--cc=raag.jadav@intel.com \
--cc=riana.tauro@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.