From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A2A0AF34C4E for ; Mon, 13 Apr 2026 13:32:46 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 51EC710E44E; Mon, 13 Apr 2026 13:32:46 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="ikYeOe9V"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) by gabe.freedesktop.org (Postfix) with ESMTPS id 44BA610E44A; Mon, 13 Apr 2026 13:32:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1776087165; x=1807623165; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=J2SZkfPO7r+JAMty2zoWU9d/tXx69RdG89SZB2LkYa0=; b=ikYeOe9V6QFafp2beq0mjQvYlF7hMbO2BpnSELcRnb8HHGNb1RIM9EqQ /X8beaAO/SYv9BAStJTH/chm+6yYHrB4pGBpLbROy47wyVCniS+0CkJ27 dPQCwjhlZa4rc9kB+ZJEeEJ/o+gO1Na9cNTCAbEDb+BsIe8hZWwlx++lF KzW0qQ7I3sTITTGPjR/REdtHETOWmitwSQSACO6nuxTjL6IwRcO3/qRcY 4b3vctn3EWKjHJvemj5Gaioqdit7P5Qmd2bHzgRxh39PdQNhaHN5f2zjf Ls48uTKIyE81BQYKp7es60jPXUmy3PQ60mLVrgwBe7hBNmR0ZxpmlBczj Q==; X-CSE-ConnectionGUID: jG1A6PzKR4+W16iu1Tn9Hw== X-CSE-MsgGUID: SAeglQrsS6OvRILw9dVN+w== X-IronPort-AV: E=McAfee;i="6800,10657,11758"; a="94594612" X-IronPort-AV: E=Sophos;i="6.23,177,1770624000"; d="scan'208";a="94594612" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2026 06:32:45 -0700 X-CSE-ConnectionGUID: M8XjpBjHRlynnz7Pdu7r1g== X-CSE-MsgGUID: vT7HYRPaTM+z4kBgdygk8w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,177,1770624000"; d="scan'208";a="229659585" Received: from jraag-z790m-itx-wifi.iind.intel.com ([10.190.239.23]) by orviesa009.jf.intel.com with ESMTP; 13 Apr 2026 06:32:41 -0700 From: Mallesh Koujalagi To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org, rodrigo.vivi@intel.com Cc: andrealmeid@igalia.com, christian.koenig@amd.com, airlied@gmail.com, simona.vetter@ffwll.ch, mripard@kernel.org, anshuman.gupta@intel.com, badal.nilawar@intel.com, riana.tauro@intel.com, karthik.poosa@intel.com, sk.anirban@intel.com, raag.jadav@intel.com, Mallesh Koujalagi Subject: [PATCH v4 0/4] Introduce cold reset recovery method Date: Mon, 13 Apr 2026 19:00:14 +0530 Message-ID: <20260413133013.560239-6-mallesh.koujalagi@intel.com> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" This series builds on top of Introduce Xe Uncorrectable Error Handling[1] and adds support for handling errors that require a complete device power cycle (cold reset) to recover. Certain error conditions leave the device in a persistent hardware error state that cannot be cleared through existing recovery mechanisms such as driver reload or PCIe reset. In these cases, functionality can only be restored by performing a cold reset. To support this, the series introduces a new DRM wedging recovery method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged with this method, the DRM core notifies userspace via a uevent that a cold reset is required. This allows userspace to take appropriate action to power-cycle the device. Example uevent received: SUBSYSTEM=drm WEDGED=cold-reset DEVPATH=/devices/.../drm/card0 Detailed description in commit message. [1] https://patchwork.freedesktop.org/series/160482/ This patch series introduces a call to xe_punit_error_handler() from within handle_soc_internal_errors() when PUNIT errors detected. v2: - Add use case: Handling errors from power management unit, which requires a complete power cycle to recover. (Christian) - Add several instead of number to avoid update. (Jani) v3: - Update any scenario that requires cold-reset. (Riana) - Update document with generic scenario. (Riana) - Consistent with terminology. (Raag) - Remove already covered information. - Use PUNIT instead of PMU. (Riana) - Use consistent wordingi. - Remove log. (Raag) v4: - Rename cold reset to power cyclce. (Raag) - Update doc. (Raag/Riana) - Change commit message. (Raag) - Make function static. (Raag) Cc: André Almeida Cc: Christian König Cc: David Airlie Cc: Simona Vetter Cc: Maxime Ripard Mallesh Koujalagi (3): drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/xe: Handle PUNIT errors by requesting cold-reset recovery Riana Tauro (1): Introduce Xe Uncorrectable Error Handling Documentation/gpu/drm-uapi.rst | 60 +++- drivers/gpu/drm/drm_drv.c | 2 + drivers/gpu/drm/xe/Makefile | 2 + drivers/gpu/drm/xe/xe_device.c | 10 + drivers/gpu/drm/xe/xe_device.h | 15 + drivers/gpu/drm/xe/xe_device_types.h | 6 + drivers/gpu/drm/xe/xe_gt.c | 14 +- drivers/gpu/drm/xe/xe_guc_submit.c | 9 +- drivers/gpu/drm/xe/xe_pci.c | 3 + drivers/gpu/drm/xe/xe_pci_error.c | 118 ++++++ drivers/gpu/drm/xe/xe_ras.c | 335 ++++++++++++++++++ drivers/gpu/drm/xe/xe_ras.h | 16 + drivers/gpu/drm/xe/xe_ras_types.h | 203 +++++++++++ drivers/gpu/drm/xe/xe_survivability_mode.c | 12 +- drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 13 + include/drm/drm_device.h | 1 + 16 files changed, 810 insertions(+), 9 deletions(-) create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c create mode 100644 drivers/gpu/drm/xe/xe_ras.c create mode 100644 drivers/gpu/drm/xe/xe_ras.h create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h -- 2.34.1