From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 52945C4345F for ; Thu, 18 Apr 2024 10:44:44 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E3B75113B75; Thu, 18 Apr 2024 10:44:43 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="L28Dd8bN"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.19]) by gabe.freedesktop.org (Postfix) with ESMTPS id 9862C10F219 for ; Thu, 18 Apr 2024 10:44:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1713437083; x=1744973083; h=message-id:date:subject:references:to:cc:from: in-reply-to:mime-version; bh=vgrAKBZFzFy8woB+fb5FgV2G+O4EAHxqRIzZ0wxTv+g=; b=L28Dd8bN5jrs11bragq6B1Aq0JHYicSArK/OYnFPtNToHrNXFQuTDLJu MbDEvIrFzah5jfX8qQaxp1dis1MCk7/yM5pVHE7dyWEemEIaeqa4rG8XK FvsleJmKSoA8ZgigPUYuE+MXI3RsLoCuXKoks1A23SagZHrOt8rEUZCjX HAzCphCCV70NL3FNk5zj6/L0lOSzTemyQlRLCubYCiiWDos5NLMMwZxEo wunRMc0cUXf0cs+T7lXk4JQCCJ0ahwraPWJlfLp1CM2JWBK4LZv1+m6Dz EGU5H52bQZoAxMHfsRVtDqJiJoZ2wBDCwW3B1ncTN6oC9Uddv6mh6S7L1 Q==; X-CSE-ConnectionGUID: RW1e3mqVQjO8W5C92mr6mQ== X-CSE-MsgGUID: ZFpXXbP8SxaCzMALepDshQ== X-IronPort-AV: E=McAfee;i="6600,9927,11047"; a="8848932" X-IronPort-AV: E=Sophos;i="6.07,212,1708416000"; d="scan'208,217";a="8848932" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by orvoesa111.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Apr 2024 03:44:42 -0700 X-CSE-ConnectionGUID: ihiBwNidRSuhqerpXD3J9w== X-CSE-MsgGUID: 2hrRyH8eQIG/M+xgTx/YXQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,212,1708416000"; d="scan'208,217";a="23034681" Received: from orsmsx603.amr.corp.intel.com ([10.22.229.16]) by fmviesa007.fm.intel.com with ESMTP/TLS/AES256-GCM-SHA384; 18 Apr 2024 03:44:33 -0700 Received: from orsmsx611.amr.corp.intel.com (10.22.229.24) by ORSMSX603.amr.corp.intel.com (10.22.229.16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 18 Apr 2024 03:44:32 -0700 Received: from ORSEDG602.ED.cps.intel.com (10.7.248.7) by orsmsx611.amr.corp.intel.com (10.22.229.24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35 via Frontend Transport; Thu, 18 Apr 2024 03:44:32 -0700 Received: from NAM10-DM6-obe.outbound.protection.outlook.com (104.47.58.100) by edgegateway.intel.com (134.134.137.103) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Thu, 18 Apr 2024 03:44:32 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=RPWKhTkBTzkZZvgERSSqDkAvwV9knudEglyVC0lPxkP3CQjPgnZV7bEaVwhnFh/BT5Mwp0L5buPsDi4BDPHkH+D15RFjuCBh8SELQIv+y4luCcBHgQAixYhXVoGx7gT5xMy+NaEVYa3afKo6mndXJSmxwHbn+QOPZqrWXbUPLAtPvSWfOEguysPBboxjq3R2YzIcvkLsUqIqtFH+rWVn+Ap11le4ZRRXW/tLvwo/huQDYQzLMfvOPPKc29P/FDBQf1WxzwobOMw0JAZ9+esjmcMV/PbhaOZjRxvHmR6VzAheX3FbkgX+q0Rfzx4Bf/9GAJnUzyyKeRU4tYZIIr08Qw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=erWeFbi2ZbgAZ1fbBdrM/MJPmwKJKncKhoSDX65FMJk=; b=fi6Mpu3+ITzf+k0gqV1Ra68dI9xoFbqSE+GHDkFmQpm3MhauXCY7np8JcWc6CKuzGXcwaNg6D5kuv2+k4epBKSdwDP/Gwm8uJ9PiWRnfw7wgPGBCxxw1gYKqEGpzrNoFV1slT8VGox9i5+Z88gf150srsGnUQw1rcCYHHFDEJ/aV0LUxhwIaNto62ouip+KYWT52UNgi4PVvxZcFQ0M9T5i1UO+KtpPgnUdxFADdFVDYuHuaSAAwzMLbV6bIuJUnHMsjtsgCNK77E1LqTlOisps2+5w3Vx449WWiCmL38gb8+J4OCVWET/sdAmLA8QvQL5Hyx1byTu0aRxN39mruxA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; Received: from MW4PR11MB7056.namprd11.prod.outlook.com (2603:10b6:303:21a::12) by PH7PR11MB7516.namprd11.prod.outlook.com (2603:10b6:510:275::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7519.12; Thu, 18 Apr 2024 10:44:30 +0000 Received: from MW4PR11MB7056.namprd11.prod.outlook.com ([fe80::ff2a:1235:d1ba:4f93]) by MW4PR11MB7056.namprd11.prod.outlook.com ([fe80::ff2a:1235:d1ba:4f93%3]) with mapi id 15.20.7472.037; Thu, 18 Apr 2024 10:44:30 +0000 Content-Type: multipart/alternative; boundary="------------y2IH6I5DbUynyJL8myjPD26q" Message-ID: <8fc9bf18-f3c1-402f-82b3-65e8268cce02@intel.com> Date: Thu, 18 Apr 2024 16:14:23 +0530 User-Agent: Mozilla Thunderbird Subject: [PATCH 4/4] drm/xe: Introduce the wedged_mode debugfs Content-Language: en-US References: To: Rodrigo Vivi , "intel-xe@lists.freedesktop.org" CC: Lucas De Marchi , Alan Previn From: "Ghimiray, Himal Prasad" In-Reply-To: X-Forwarded-Message-Id: X-ClientProxiedBy: PN2PR01CA0012.INDPRD01.PROD.OUTLOOK.COM (2603:1096:c01:25::17) To MW4PR11MB7056.namprd11.prod.outlook.com (2603:10b6:303:21a::12) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: MW4PR11MB7056:EE_|PH7PR11MB7516:EE_ X-MS-Office365-Filtering-Correlation-Id: 29bde2dc-69ef-4105-fe55-08dc5f948688 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230031|366007|1800799015|376005; X-Microsoft-Antispam-Message-Info: n9z1vYdiP2ffLTLVDRLUezFbci2xvWH0PoOskFnf7HRI32VBEbMtW3eaNInsFr5jtN9WJ7wLex8smMZg7FZ4IHVViVZqcWfwYYCDnuMctQeVfAXtsd2qgxoUdgTkzsalnRwQTLYbrxnxeb7LLt5GX53ItO/cHDxGfWGbgzW5StKwTrhkyVCTCNVTMxj9GlRhalW0NXVcIOUl0Nyz969mkry3uztW5e61bzBZkcKCnOQCx0GnKYoD5M5p1CWw3NqeKEOwX6f/oOvjWV6jnir+pSOmKQ5SS9PXIIhmK4vL5oNdDjKgEHkQhLiSPL28SK8+NZesDO1fxUydR0c47rgezwmIUyt5iTb5J22wLm+dw6bUCrkCMXoukR0muPmfTCeiK5H+8OIRq/vY1VKWnwRJPk8S+7+ZLzH1g6KKyHS1KiNE+DxQ63bSOn3XL5foW4Tj9iDSqljlhtClEK6twnborKkmf842VfoK60Ge6T5+YXLXRLLW6pScmz5skU3QMuamV8i1kaT2zgN8ogvnkd6Erus9uFlbQ3YTfl+N1dhoFNSKNWoeW/OMkIZd1bM/T/s6o4hxT9XvLNk8AZwENqHG+7kblffLS6RhnOjzUHA476nEM4bLIwP+O6Bo/0vYKL1aWKNzh8MMqAAZD1mOC1jcA+Zv4yfiWscSel0q9a+Vh+Y= X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:MW4PR11MB7056.namprd11.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230031)(366007)(1800799015)(376005); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?eUljY1dETU1hS0UxeUZ4M05ndWVrSzVvK01UNXMwWHBVZk1FRlRoZW54V29x?= =?utf-8?B?Ukhza2xBU1pMVGFrYjlXS09xeWFWYlYvRm5zM1NCY2o2LzFpeXZUU2Uwbk8y?= =?utf-8?B?d0dJOG9jOGNDNXhiNHhPVW50ZUlSNkl2NGxrN3BMaHcza3ZWMk00dDJuZkZP?= =?utf-8?B?OVVhMVY3S0RHT0VDWmlXeHdsVVJUMzFkSTJtNWxSVjRFSzI0QWFQVUJENWV5?= =?utf-8?B?ekhyRXM1VkJNM0twN2x6VDJYU3BHYnM1Zi9kT2pFUS90clB6QTBDSSthOTd2?= =?utf-8?B?T0IvcVIzeHRaT0FaZDdudzZyRFZITGQvSkpKSEQvZ0xBTWpzaFp3Q1VFLzZT?= =?utf-8?B?dmtEQ2d3UllXRXpORGU2RGt5TlBZbU9pY0ZsbTkzaXlla0Q3cEdHSTIycFVv?= =?utf-8?B?UnFqanpPOURXMkM0b0g0OWhWcjBJc2lTdGtwTjNJak9iTGRlT0VWb3U1emtj?= =?utf-8?B?clZwdGpnQW4xWk5Wc2RVazFrdG9sRnROUGlXSmlHem81aktTTDVMUTNEZEkr?= =?utf-8?B?Skx3Skg4dUpqWmg1dXFrMXhma29haVNkMHRvUktzYnN5MGFWL2NLUi8xQ1JK?= =?utf-8?B?OUZLSXc3TndEUVlzazFnL29mdHp1VHVPWUszaFd0V24rMzYxK3JrVXNGZmtU?= =?utf-8?B?dkVLbUF4M1JiUGxHVzk0UTJPdDBPVlJsRUtTY0dSRHoyZTcyZVU5VllkcENt?= =?utf-8?B?SUMzQnR2ZVBVZEg0aXA3Mmx6MWZOWVN1aUZPRlo3NXZ2N051MTRDbXlWNjl4?= =?utf-8?B?Z0ZqTGJmeENESjN0Ujh0aGcyQldTRXpSM2hRY1hPalVhZmJmT3VvK2dLQjR0?= =?utf-8?B?L256dUZJT1VxSTkxLy82UlEyK1gzR1gyRm5UQVpZc01qUnl6a1gvZlZmcHo4?= =?utf-8?B?ekxJczAyQ1BnZS9RMUtCUWd5M285OTI1QlQwRmt6UjNnQmFPV2N2Sm4vR3R5?= =?utf-8?B?dE16aDJDL0RJVW1XWGk4ZDN6WUd3amhiSWcwVU9PTC94N2c0MCtFYUhvSjMw?= =?utf-8?B?SDFUQVdhU0hRVDBwa2RISVFhV2JMQ3JITC8zY2N0M3VYZzBXLzdITGFPOSs0?= =?utf-8?B?WFJMWEdqMUxSWTUwSm8zL0VsRkQ3QVNPT0lGMFJ2SVgvT1R2NE1BU1h2cHVi?= =?utf-8?B?bnhaNzZhL3JjbWs2OWM0dG1SMWdmcmw3RUU1MEhOZ0hFWUFxU29SSERlZ1hj?= =?utf-8?B?REpnZnNrNDdYak1JMlY3R242ZVQzRmJqV3NRL1JhdE9uTW1jTktPWjV3UVZZ?= =?utf-8?B?RzZiVktINGJud3d4N2U1L3VuTGJwQTNMcld0QUd6MkJKbURUYzFDd01uZFNu?= =?utf-8?B?T25zYlNaTUc3RlpoaloxK28xM1RKbUovdUFCWGRUT2dRYjNoM0FIZlhicFZS?= =?utf-8?B?RWZnSGdSRS9PMXBzdndGeVA4cGZpWkRzblQvaHdDelJWc21wOHJnTDJpWm9N?= =?utf-8?B?RFNrNWx4UW9uSlhHd2ZSaEV0UDdkOUtjWDl0OUx3VjN5TWJNZ1p2Z1BBazFP?= =?utf-8?B?SFhscjBMcGNsaVZJMEhaYlFCNU1uRVRrMHdHWWlwZU96RlZZREcyMXhwQ1Vh?= =?utf-8?B?V1QvZVJnQjRMU1pDbzNsdnQyQTVrSERsRzc2bk42YXc1SnJIYTh4TUIraXdV?= =?utf-8?B?dGRIYjBPMngxYW5JQmZlZFZsRDNOVFVwbzNhSWpLbkM4NStqZzY5YmRJcG9a?= =?utf-8?B?VzZpa255Ums0cmRmZWdhSHpBTGhXZ1Rnb3NKWXd1d044WWpzT01RdkZubm9u?= =?utf-8?B?dUI3ZUswb0pMZE5QVVF5SWhGN1ZPVkxYRTF2cm5nQlFKM3c2Ui9kZnh0S2FB?= =?utf-8?B?ZmVKamJ1QmRHVVlVNXVDdkNUQ3A0b0l4WCtkQnZxK2tKOHo3aUo1SFZuZVBJ?= =?utf-8?B?T0hjdEhPaytwRWZXOVVXMDZpdCt1RHJRSVRHTlhkZW95Zm13WkVsRDR6MjJs?= =?utf-8?B?SGJqSUt6V2Vpbk9tcmVJWVMwMGlwcWpEUzR6WXlTWEhwZHdoVk5rR0lpbEU2?= =?utf-8?B?OE1VMkdTNzE3V2pOaVNvdndLZFRvWFZoV1NxK3U4eFl4eTY4OXVsaGJZajRs?= =?utf-8?B?bFFJYTY2djlTaW4yVWlMcXVFNjNGa0pDUElOWDFvdmFDd1JGT1VNRG5OdDI5?= =?utf-8?B?TGcvb2ZrWXM0YXB2Nnc2NHRBR3JlNHk3UHlBNk50aVJVUStxNHNuZjdtK0xD?= =?utf-8?B?WXc9PQ==?= X-MS-Exchange-CrossTenant-Network-Message-Id: 29bde2dc-69ef-4105-fe55-08dc5f948688 X-MS-Exchange-CrossTenant-AuthSource: MW4PR11MB7056.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 18 Apr 2024 10:44:29.9725 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: Wg41xS/SASSBz1HALaFhS1MvGIdT1Q34F/tcPeRGMlkOpgIkcGIFGAzt+1teP/Tasg1hgZNVTN325vtJegyh3uCAeQyEU577G7H/tAQmuUI= X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH7PR11MB7516 X-OriginatorOrg: intel.com X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" --------------y2IH6I5DbUynyJL8myjPD26q Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit It seems my previous response was only sent to the email list. On 10-04-2024 03:45, Rodrigo Vivi wrote: > So, the wedged mode can be selected per device at runtime, > before the tests or before reproducing the issue. > > v2: - s/busted/wedged > - some locking consistency > > Cc: Lucas De Marchi > Cc: Alan Previn > Signed-off-by: Rodrigo Vivi > --- > drivers/gpu/drm/xe/xe_debugfs.c | 56 ++++++++++++++++++++++++++++ > drivers/gpu/drm/xe/xe_device.c | 41 ++++++++++++++------ > drivers/gpu/drm/xe/xe_device.h | 4 +- > drivers/gpu/drm/xe/xe_device_types.h | 11 +++++- > drivers/gpu/drm/xe/xe_gt.c | 2 +- > drivers/gpu/drm/xe/xe_guc.c | 2 +- > drivers/gpu/drm/xe/xe_guc_ads.c | 52 +++++++++++++++++++++++++- > drivers/gpu/drm/xe/xe_guc_ads.h | 1 + > drivers/gpu/drm/xe/xe_guc_submit.c | 28 +++++++------- > 9 files changed, 163 insertions(+), 34 deletions(-) > > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c > index 86150cafe0ff..6ff067ea5a8f 100644 > --- a/drivers/gpu/drm/xe/xe_debugfs.c > +++ b/drivers/gpu/drm/xe/xe_debugfs.c > @@ -12,6 +12,7 @@ > #include "xe_bo.h" > #include "xe_device.h" > #include "xe_gt_debugfs.h" > +#include "xe_guc_ads.h" > #include "xe_pm.h" > #include "xe_step.h" > > @@ -106,6 +107,58 @@ static const struct file_operations forcewake_all_fops = { > .release = forcewake_release, > }; > > +static ssize_t wedged_mode_show(struct file *f, char __user *ubuf, > + size_t size, loff_t *pos) > +{ > + struct xe_device *xe = file_inode(f)->i_private; > + char buf[32]; > + int len = 0; > + > + mutex_lock(&xe->wedged.lock); > + len = scnprintf(buf, sizeof(buf), "%d\n", xe->wedged.mode); > + mutex_unlock(&xe->wedged.lock); > + > + return simple_read_from_buffer(ubuf, size, pos, buf, len); > +} > + > +static ssize_t wedged_mode_set(struct file *f, const char __user *ubuf, > + size_t size, loff_t *pos) > +{ > + struct xe_device *xe = file_inode(f)->i_private; > + struct xe_gt *gt; > + u32 wedged_mode; > + ssize_t ret; > + u8 id; > + > + ret = kstrtouint_from_user(ubuf, size, 0, &wedged_mode); > + if (ret) > + return ret; > + > + if (wedged_mode > 2) > + return -EINVAL; > + > + mutex_lock(&xe->wedged.lock); > + xe->wedged.mode = wedged_mode; > + if (wedged_mode == 2) { The transition of |xe->wedged.mode|from 2 to 1 indicates change in wedged state , yet the GUC policy still retains engine reset disabled, which seems incorrect. How about calling |xe_guc_ads_scheduler_policy_disable_reset|for both modes (1 and 2) ? For mode 1, this function will reset the GUC policies to default settings. If we agree on calling above function unconditionally, it might be better to rename |xe_guc_ads_scheduler_policy_disable_reset|to a more suitable name, as for mode 1, it won't actually disable reset. > + for_each_gt(gt, xe, id) { > + ret = xe_guc_ads_scheduler_policy_disable_reset(>->uc.guc.ads); Given this debugs, where users have the option to choose whether to disable engine reset before submission, is the modparam introduced in [PATCH 3/4] really necessary? This also ensures post rebind we have default policies. > + if (ret) { > + drm_err(&xe->drm, "Failed to update GuC ADS scheduler policy. GPU might still reset even on the wedged_mode=2\n"); > + break; > + } > + } > + } > + mutex_unlock(&xe->wedged.lock); > + > + return size; > +} > + > +static const struct file_operations wedged_mode_fops = { > + .owner = THIS_MODULE, > + .read = wedged_mode_show, > + .write = wedged_mode_set, > +}; > + > void xe_debugfs_register(struct xe_device *xe) > { > struct ttm_device *bdev = &xe->ttm; > @@ -123,6 +176,9 @@ void xe_debugfs_register(struct xe_device *xe) > debugfs_create_file("forcewake_all", 0400, root, xe, > &forcewake_all_fops); > > + debugfs_create_file("wedged_mode", 0400, root, xe, > + &wedged_mode_fops); > + > for (mem_type = XE_PL_VRAM0; mem_type <= XE_PL_VRAM1; ++mem_type) { > man = ttm_manager_type(bdev, mem_type); > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c > index 7928a5470cee..949fca2f0400 100644 > --- a/drivers/gpu/drm/xe/xe_device.c > +++ b/drivers/gpu/drm/xe/xe_device.c > @@ -445,6 +445,9 @@ int xe_device_probe_early(struct xe_device *xe) > if (err) > return err; > > + mutex_init(&xe->wedged.lock); > + xe->wedged.mode = xe_modparam.wedged_mode; > + > return 0; > } > > @@ -787,26 +790,37 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address) > } > > /** > - * xe_device_declare_wedged - Declare device wedged > + * xe_device_hint_wedged - Get a hint and possibly declare device as wedged > * @xe: xe device instance > + * @in_timeout_path: hint coming from a timeout path > * > - * This is a final state that can only be cleared with a module > + * The wedged state is a final on that can only be cleared with a module > * re-probe (unbind + bind). > * In this state every IOCTL will be blocked so the GT cannot be used. > - * In general it will be called upon any critical error such as gt reset > - * failure or guc loading failure. > - * If xe.wedged module parameter is set to 2, this function will be called > - * on every single execution timeout (a.k.a. GPU hang) right after devcoredump > - * snapshot capture. In this mode, GT reset won't be attempted so the state of > - * the issue is preserved for further debugging. > + * In general device will be declared wedged only at critical > + * error paths such as gt reset failure or guc loading failure. > + * Hints are also expected from every single execution timeout (a.k.a. GPU hang) > + * right after devcoredump snapshot capture. Then, device can be declared wedged > + * if wedged_mode is set to 2. In this mode, GT reset won't be attempted so the > + * state of the issue is preserved for further debugging. > + * > + * Return: True if device has been just declared wedged. False otherwise. > */ > -void xe_device_declare_wedged(struct xe_device *xe) > +bool xe_device_hint_wedged(struct xe_device *xe, bool in_timeout_path) > { > - if (xe_modparam.wedged_mode == 0) > - return; > + bool ret = false; > + > + mutex_lock(&xe->wedged.lock); > > - if (!atomic_xchg(&xe->wedged, 1)) { > + if (xe->wedged.mode == 0) > + goto out; > + > + if (in_timeout_path && xe->wedged.mode != 2) > + goto out; > + > + if (!atomic_xchg(&xe->wedged.flag, 1)) { > xe->needs_flr_on_fini = true; > + ret = true; > drm_err(&xe->drm, > "CRITICAL: Xe has declared device %s as wedged.\n" > "IOCTLs and executions are blocked until device is probed again with unbind and bind operations:\n" > @@ -816,4 +830,7 @@ void xe_device_declare_wedged(struct xe_device *xe) > dev_name(xe->drm.dev), dev_name(xe->drm.dev), > dev_name(xe->drm.dev)); > } > +out: > + mutex_unlock(&xe->wedged.lock); > + return ret; > } > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h > index 0fea5c18f76d..e3ea8a43e7f9 100644 > --- a/drivers/gpu/drm/xe/xe_device.h > +++ b/drivers/gpu/drm/xe/xe_device.h > @@ -178,9 +178,9 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address); > > static inline bool xe_device_wedged(struct xe_device *xe) > { > - return atomic_read(&xe->wedged); > + return atomic_read(&xe->wedged.flag); > } > > -void xe_device_declare_wedged(struct xe_device *xe); > +bool xe_device_hint_wedged(struct xe_device *xe, bool in_timeout_path); > > #endif > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h > index b9ef60f21750..0da4787f1087 100644 > --- a/drivers/gpu/drm/xe/xe_device_types.h > +++ b/drivers/gpu/drm/xe/xe_device_types.h > @@ -458,8 +458,15 @@ struct xe_device { > /** @needs_flr_on_fini: requests function-reset on fini */ > bool needs_flr_on_fini; > > - /** @wedged: Xe device faced a critical error and is now blocked. */ > - atomic_t wedged; > + /** @wedged: Struct to control Wedged States and mode */ > + struct { > + /** @wedged.flag: Xe device faced a critical error and is now blocked. */ > + atomic_t flag; > + /** @wedged.mode: Mode controlled by kernel parameter and debugfs */ > + int mode; > + /** @wedged.lock: To protect @wedged.mode */ > + struct mutex lock; > + } wedged; > > /* private: */ > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c > index 0844081b88ef..da16f4273877 100644 > --- a/drivers/gpu/drm/xe/xe_gt.c > +++ b/drivers/gpu/drm/xe/xe_gt.c > @@ -688,7 +688,7 @@ static int gt_reset(struct xe_gt *gt) > err_fail: > xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); > > - xe_device_declare_wedged(gt_to_xe(gt)); > + xe_device_hint_wedged(gt_to_xe(gt), false); > > return err; > } > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c > index f1c3e338301d..ee7e0fa4815d 100644 > --- a/drivers/gpu/drm/xe/xe_guc.c > +++ b/drivers/gpu/drm/xe/xe_guc.c > @@ -495,7 +495,7 @@ static void guc_wait_ucode(struct xe_guc *guc) > xe_gt_err(gt, "GuC firmware exception. EIP: %#x\n", > xe_mmio_read32(gt, SOFT_SCRATCH(13))); > > - xe_device_declare_wedged(gt_to_xe(gt)); > + xe_device_hint_wedged(gt_to_xe(gt), false); > } else { > xe_gt_dbg(gt, "GuC successfully loaded\n"); > } > diff --git a/drivers/gpu/drm/xe/xe_guc_ads.c b/drivers/gpu/drm/xe/xe_guc_ads.c > index dbd88ae20aa3..ad64d5a31239 100644 > --- a/drivers/gpu/drm/xe/xe_guc_ads.c > +++ b/drivers/gpu/drm/xe/xe_guc_ads.c > @@ -9,6 +9,7 @@ > > #include > > +#include "abi/guc_actions_abi.h" > #include "regs/xe_engine_regs.h" > #include "regs/xe_gt_regs.h" > #include "regs/xe_guc_regs.h" > @@ -16,11 +17,11 @@ > #include "xe_gt.h" > #include "xe_gt_ccs_mode.h" > #include "xe_guc.h" > +#include "xe_guc_ct.h" > #include "xe_hw_engine.h" > #include "xe_lrc.h" > #include "xe_map.h" > #include "xe_mmio.h" > -#include "xe_module.h" > #include "xe_platform_types.h" > #include "xe_wa.h" > > @@ -395,6 +396,7 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads) > > static void guc_policies_init(struct xe_guc_ads *ads) > { > + struct xe_device *xe = ads_to_xe(ads); > u32 global_flags = 0; > > ads_blob_write(ads, policies.dpc_promote_time, > @@ -402,8 +404,10 @@ static void guc_policies_init(struct xe_guc_ads *ads) > ads_blob_write(ads, policies.max_num_work_items, > GLOBAL_POLICY_MAX_NUM_WI); > > - if (xe_modparam.wedged_mode == 2) > + mutex_lock(&xe->wedged.lock); > + if (xe->wedged.mode == 2) > global_flags |= GLOBAL_POLICY_DISABLE_ENGINE_RESET; > + mutex_unlock(&xe->wedged.lock); > > ads_blob_write(ads, policies.global_flags, global_flags); > ads_blob_write(ads, policies.is_valid, 1); > @@ -760,3 +764,47 @@ void xe_guc_ads_populate_post_load(struct xe_guc_ads *ads) > { > guc_populate_golden_lrc(ads); > } > + > +static int guc_ads_action_update_policies(struct xe_guc_ads *ads, u32 policy_offset) > +{ > + struct xe_guc_ct *ct = &ads_to_guc(ads)->ct; > + u32 action[] = { > + XE_GUC_ACTION_GLOBAL_SCHED_POLICY_CHANGE, > + policy_offset > + }; > + > + return xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0); > +} > + > +int xe_guc_ads_scheduler_policy_disable_reset(struct xe_guc_ads *ads) > +{ > + struct xe_device *xe = ads_to_xe(ads); > + struct xe_gt *gt = ads_to_gt(ads); > + struct xe_tile *tile = gt_to_tile(gt); > + struct guc_policies *policies; > + struct xe_bo *bo; > + int ret = 0; > + > + policies = kmalloc(sizeof(*policies), GFP_KERNEL); > + if (!policies) > + return -ENOMEM; > + > + policies->dpc_promote_time = ads_blob_read(ads, policies.dpc_promote_time); > + policies->max_num_work_items = ads_blob_read(ads, policies.max_num_work_items); > + policies->is_valid = 1; > + if (xe->wedged.mode == 2) > + policies->global_flags |= GLOBAL_POLICY_DISABLE_ENGINE_RESET; > + > + bo = xe_managed_bo_create_from_data(xe, tile, policies, sizeof(struct guc_policies), > + XE_BO_FLAG_VRAM_IF_DGFX(tile) | > + XE_BO_FLAG_GGTT); > + if (IS_ERR(bo)) { > + ret = PTR_ERR(bo); > + goto out; > + } > + > + ret = guc_ads_action_update_policies(ads, xe_bo_ggtt_addr(bo)); > +out: > + kfree(policies); > + return ret; > +} > diff --git a/drivers/gpu/drm/xe/xe_guc_ads.h b/drivers/gpu/drm/xe/xe_guc_ads.h > index 138ef6267671..7c45c40fab34 100644 > --- a/drivers/gpu/drm/xe/xe_guc_ads.h > +++ b/drivers/gpu/drm/xe/xe_guc_ads.h > @@ -13,5 +13,6 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads); > void xe_guc_ads_populate(struct xe_guc_ads *ads); > void xe_guc_ads_populate_minimal(struct xe_guc_ads *ads); > void xe_guc_ads_populate_post_load(struct xe_guc_ads *ads); > +int xe_guc_ads_scheduler_policy_disable_reset(struct xe_guc_ads *ads); > > #endif > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c > index 0bea17536659..7de97b90ad00 100644 > --- a/drivers/gpu/drm/xe/xe_guc_submit.c > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c > @@ -35,7 +35,6 @@ > #include "xe_macros.h" > #include "xe_map.h" > #include "xe_mocs.h" > -#include "xe_module.h" > #include "xe_ring_ops_types.h" > #include "xe_sched_job.h" > #include "xe_trace.h" > @@ -868,26 +867,33 @@ static void xe_guc_exec_queue_trigger_cleanup(struct xe_exec_queue *q) > xe_sched_tdr_queue_imm(&q->guc->sched); > } > > -static void guc_submit_wedged(struct xe_guc *guc) > +static bool guc_submit_hint_wedged(struct xe_guc *guc) > { > struct xe_exec_queue *q; > unsigned long index; > int err; > > - xe_device_declare_wedged(guc_to_xe(guc)); > + if (xe_device_wedged(guc_to_xe(guc))) > + return true; > + > + if (!xe_device_hint_wedged(guc_to_xe(guc), true)) > + return false; > + > xe_guc_submit_reset_prepare(guc); > xe_guc_ct_stop(&guc->ct); > > err = drmm_add_action_or_reset(&guc_to_xe(guc)->drm, > guc_submit_wedged_fini, guc); > if (err) > - return; > + return true; /* Device is wedged anyway */ > > mutex_lock(&guc->submission_state.lock); > xa_for_each(&guc->submission_state.exec_queue_lookup, index, q) > if (xe_exec_queue_get_unless_zero(q)) > set_exec_queue_wedged(q); > mutex_unlock(&guc->submission_state.lock); > + > + return true; > } > > static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w) > @@ -898,15 +904,12 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w) > struct xe_guc *guc = exec_queue_to_guc(q); > struct xe_device *xe = guc_to_xe(guc); > struct xe_gpu_scheduler *sched = &ge->sched; > - bool wedged = xe_device_wedged(xe); > + bool wedged; > > xe_assert(xe, xe_exec_queue_is_lr(q)); > trace_xe_exec_queue_lr_cleanup(q); > > - if (!wedged && xe_modparam.wedged_mode == 2) { > - guc_submit_wedged(exec_queue_to_guc(q)); > - wedged = true; > - } > + wedged = guc_submit_hint_wedged(exec_queue_to_guc(q)); > > /* Kill the run_job / process_msg entry points */ > xe_sched_submission_stop(sched); > @@ -957,7 +960,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) > struct xe_device *xe = guc_to_xe(exec_queue_to_guc(q)); > int err = -ETIME; > int i = 0; > - bool wedged = xe_device_wedged(xe); > + bool wedged; > > /* > * TDR has fired before free job worker. Common if exec queue > @@ -981,10 +984,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) > > trace_xe_sched_job_timedout(job); > > - if (!wedged && xe_modparam.wedged_mode == 2) { > - guc_submit_wedged(exec_queue_to_guc(q)); > - wedged = true; > - } > + wedged = guc_submit_hint_wedged(exec_queue_to_guc(q)); > > /* Kill the run_job entry point */ > xe_sched_submission_stop(sched); --------------y2IH6I5DbUynyJL8myjPD26q Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

It= seems my previous response was only sent to the email list.

On 10-04-2024 03:45, Rodrigo Vivi wrote:
So, the wedged mode can be s=
elected per device at runtime,
before the tests or before reproducing the issue.

v2: - s/busted/wedged
    - some locking consistency

Cc: Lucas De Marchi <lucas.demarchi@intel.com&=
gt;
Cc: Alan Previn <alan.previn.teres.a=
lexis@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.=
com>
---
 drivers/gpu/drm/xe/xe_debugfs.c      | 56 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_device.c       | 41 ++++++++++++++------
 drivers/gpu/drm/xe/xe_device.h       |  4 +-
 drivers/gpu/drm/xe/xe_device_types.h | 11 +++++-
 drivers/gpu/drm/xe/xe_gt.c           |  2 +-
 drivers/gpu/drm/xe/xe_guc.c          |  2 +-
 drivers/gpu/drm/xe/xe_guc_ads.c      | 52 +++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_guc_ads.h      |  1 +
 drivers/gpu/drm/xe/xe_guc_submit.c   | 28 +++++++-------
 9 files changed, 163 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugf=
s.c
index 86150cafe0ff..6ff067ea5a8f 100644
--- a/drivers/gpu/drm/xe/xe_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_debugfs.c
@@ -12,6 +12,7 @@
 #include "xe_bo.h"
 #include "xe_device.h"
 #include "xe_gt_debugfs.h"
+#include "xe_guc_ads.h"
 #include "xe_pm.h"
 #include "xe_step.h"
=20
@@ -106,6 +107,58 @@ static const struct file_operations forcewake_all_fops=
 =3D {
 	.release =3D forcewake_release,
 };
=20
+static ssize_t wedged_mode_show(struct file *f, char __user *ubuf,
+				size_t size, loff_t *pos)
+{
+	struct xe_device *xe =3D file_inode(f)->i_private;
+	char buf[32];
+	int len =3D 0;
+
+	mutex_lock(&xe->wedged.lock);
+	len =3D scnprintf(buf, sizeof(buf), "%d\n", xe->wedged.mode)=
;
+	mutex_unlock(&xe->wedged.lock);
+
+	return simple_read_from_buffer(ubuf, size, pos, buf, len);
+}
+
+static ssize_t wedged_mode_set(struct file *f, const char __user *ubuf,
+			       size_t size, loff_t *pos)
+{
+	struct xe_device *xe =3D file_inode(f)->i_private;
+	struct xe_gt *gt;
+	u32 wedged_mode;
+	ssize_t ret;
+	u8 id;
+
+	ret =3D kstrtouint_from_user(ubuf, size, 0, &wedged_mode);
+	if (ret)
+		return ret;
+
+	if (wedged_mode > 2)
+		return -EINVAL;
+
+	mutex_lock(&xe->wedged.lock);
+	xe->wedged.mode =3D wedged_mode;
+	if (wedged_mode =3D=3D 2) {



= The transition of xe->wedged.mode from 2 to 1 indicates change in wed= ged state , yet the GUC policy still retains engine reset disabled, which s= eems incorrect. How about calling xe_guc_ads_scheduler_policy_disable= _reset for both modes (1 and 2) ? For mode 1, this function will reset the GU= C policies to default settings.

= If we agree on calling above function unconditionally, it might be better = to rename xe_guc_ads_scheduler_policy_disable_reset to a more suitable = name, as for mode 1, it won't actually disable reset.

+		for_each_gt(gt, xe, id) {
+			ret =3D xe_guc_ads_scheduler_policy_disable_reset(&gt->uc.guc.ad=
s);


= Given this debugs, where users have the option to choose whether to disable= engine reset before submission, is the modparam introduced in [PATCH 3/4] = really necessary? This also ensures post rebind we have default policies.

+			if (ret) {
+				drm_err(&xe->drm, "Failed to update GuC ADS scheduler poli=
cy. GPU might still reset even on the wedged_mode=3D2\n");
+				break;
+			}
+		}
+	}
+	mutex_unlock(&xe->wedged.lock);
+
+	return size;
+}
+
+static const struct file_operations wedged_mode_fops =3D {
+	.owner =3D THIS_MODULE,
+	.read =3D wedged_mode_show,
+	.write =3D wedged_mode_set,
+};
+
 void xe_debugfs_register(struct xe_device *xe)
 {
 	struct ttm_device *bdev =3D &xe->ttm;
@@ -123,6 +176,9 @@ void xe_debugfs_register(struct xe_device *xe)
 	debugfs_create_file("forcewake_all", 0400, root, xe,
 			    &forcewake_all_fops);
=20
+	debugfs_create_file("wedged_mode", 0400, root, xe,
+			    &wedged_mode_fops);
+
 	for (mem_type =3D XE_PL_VRAM0; mem_type <=3D XE_PL_VRAM1; ++mem_type) =
{
 		man =3D ttm_manager_type(bdev, mem_type);
=20
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.=
c
index 7928a5470cee..949fca2f0400 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -445,6 +445,9 @@ int xe_device_probe_early(struct xe_device *xe)
 	if (err)
 		return err;
=20
+	mutex_init(&xe->wedged.lock);
+	xe->wedged.mode =3D xe_modparam.wedged_mode;
+
 	return 0;
 }
=20
@@ -787,26 +790,37 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *x=
e, u64 address)
 }
=20
 /**
- * xe_device_declare_wedged - Declare device wedged
+ * xe_device_hint_wedged - Get a hint and possibly declare device as wedge=
d
  * @xe: xe device instance
+ * @in_timeout_path: hint coming from a timeout path
  *
- * This is a final state that can only be cleared with a module
+ * The wedged state is a final on that can only be cleared with a module
  * re-probe (unbind + bind).
  * In this state every IOCTL will be blocked so the GT cannot be used.
- * In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure.
- * If xe.wedged module parameter is set to 2, this function will be called
- * on every single execution timeout (a.k.a. GPU hang) right after devcore=
dump
- * snapshot capture. In this mode, GT reset won't be attempted so the stat=
e of
- * the issue is preserved for further debugging.
+ * In general device will be declared wedged only at critical
+ * error paths such as gt reset failure or guc loading failure.
+ * Hints are also expected from every single execution timeout (a.k.a. GPU=
 hang)
+ * right after devcoredump snapshot capture. Then, device can be declared =
wedged
+ * if wedged_mode is set to 2. In this mode, GT reset won't be attempted s=
o the
+ * state of the issue is preserved for further debugging.
+ *
+ * Return: True if device has been just declared wedged. False otherwise.
  */
-void xe_device_declare_wedged(struct xe_device *xe)
+bool xe_device_hint_wedged(struct xe_device *xe, bool in_timeout_path)
 {
-	if (xe_modparam.wedged_mode =3D=3D 0)
-		return;
+	bool ret =3D false;
+
+	mutex_lock(&xe->wedged.lock);
=20
-	if (!atomic_xchg(&xe->wedged, 1)) {
+	if (xe->wedged.mode =3D=3D 0)
+		goto out;
+
+	if (in_timeout_path && xe->wedged.mode !=3D 2)
+		goto out;
+
+	if (!atomic_xchg(&xe->wedged.flag, 1)) {
 		xe->needs_flr_on_fini =3D true;
+		ret =3D true;
 		drm_err(&xe->drm,
 			"CRITICAL: Xe has declared device %s as wedged.\n"
 			"IOCTLs and executions are blocked until device is probed again wi=
th unbind and bind operations:\n"
@@ -816,4 +830,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
 			dev_name(xe->drm.dev), dev_name(xe->drm.dev),
 			dev_name(xe->drm.dev));
 	}
+out:
+	mutex_unlock(&xe->wedged.lock);
+	return ret;
 }
diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.=
h
index 0fea5c18f76d..e3ea8a43e7f9 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -178,9 +178,9 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe,=
 u64 address);
=20
 static inline bool xe_device_wedged(struct xe_device *xe)
 {
-	return atomic_read(&xe->wedged);
+	return atomic_read(&xe->wedged.flag);
 }
=20
-void xe_device_declare_wedged(struct xe_device *xe);
+bool xe_device_hint_wedged(struct xe_device *xe, bool in_timeout_path);
=20
 #endif
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_d=
evice_types.h
index b9ef60f21750..0da4787f1087 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -458,8 +458,15 @@ struct xe_device {
 	/** @needs_flr_on_fini: requests function-reset on fini */
 	bool needs_flr_on_fini;
=20
-	/** @wedged: Xe device faced a critical error and is now blocked. */
-	atomic_t wedged;
+	/** @wedged: Struct to control Wedged States and mode */
+	struct {
+		/** @wedged.flag: Xe device faced a critical error and is now blocked. *=
/
+		atomic_t flag;
+		/** @wedged.mode: Mode controlled by kernel parameter and debugfs */
+		int mode;
+		/** @wedged.lock: To protect @wedged.mode */
+		struct mutex lock;
+	} wedged;
=20
 	/* private: */
=20
diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 0844081b88ef..da16f4273877 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -688,7 +688,7 @@ static int gt_reset(struct xe_gt *gt)
 err_fail:
 	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
=20
-	xe_device_declare_wedged(gt_to_xe(gt));
+	xe_device_hint_wedged(gt_to_xe(gt), false);
=20
 	return err;
 }
diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
index f1c3e338301d..ee7e0fa4815d 100644
--- a/drivers/gpu/drm/xe/xe_guc.c
+++ b/drivers/gpu/drm/xe/xe_guc.c
@@ -495,7 +495,7 @@ static void guc_wait_ucode(struct xe_guc *guc)
 			xe_gt_err(gt, "GuC firmware exception. EIP: %#x\n",
 				  xe_mmio_read32(gt, SOFT_SCRATCH(13)));
=20
-		xe_device_declare_wedged(gt_to_xe(gt));
+		xe_device_hint_wedged(gt_to_xe(gt), false);
 	} else {
 		xe_gt_dbg(gt, "GuC successfully loaded\n");
 	}
diff --git a/drivers/gpu/drm/xe/xe_guc_ads.c b/drivers/gpu/drm/xe/xe_guc_ad=
s.c
index dbd88ae20aa3..ad64d5a31239 100644
--- a/drivers/gpu/drm/xe/xe_guc_ads.c
+++ b/drivers/gpu/drm/xe/xe_guc_ads.c
@@ -9,6 +9,7 @@
=20
 #include <generated/xe_wa_oob.h>
=20
+#include "abi/guc_actions_abi.h"
 #include "regs/xe_engine_regs.h"
 #include "regs/xe_gt_regs.h"
 #include "regs/xe_guc_regs.h"
@@ -16,11 +17,11 @@
 #include "xe_gt.h"
 #include "xe_gt_ccs_mode.h"
 #include "xe_guc.h"
+#include "xe_guc_ct.h"
 #include "xe_hw_engine.h"
 #include "xe_lrc.h"
 #include "xe_map.h"
 #include "xe_mmio.h"
-#include "xe_module.h"
 #include "xe_platform_types.h"
 #include "xe_wa.h"
=20
@@ -395,6 +396,7 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ad=
s)
=20
 static void guc_policies_init(struct xe_guc_ads *ads)
 {
+	struct xe_device *xe =3D ads_to_xe(ads);
 	u32 global_flags =3D 0;
=20
 	ads_blob_write(ads, policies.dpc_promote_time,
@@ -402,8 +404,10 @@ static void guc_policies_init(struct xe_guc_ads *ads)
 	ads_blob_write(ads, policies.max_num_work_items,
 		       GLOBAL_POLICY_MAX_NUM_WI);
=20
-	if (xe_modparam.wedged_mode =3D=3D 2)
+	mutex_lock(&xe->wedged.lock);
+	if (xe->wedged.mode =3D=3D 2)
 		global_flags |=3D GLOBAL_POLICY_DISABLE_ENGINE_RESET;
+	mutex_unlock(&xe->wedged.lock);
=20
 	ads_blob_write(ads, policies.global_flags, global_flags);
 	ads_blob_write(ads, policies.is_valid, 1);
@@ -760,3 +764,47 @@ void xe_guc_ads_populate_post_load(struct xe_guc_ads *=
ads)
 {
 	guc_populate_golden_lrc(ads);
 }
+
+static int guc_ads_action_update_policies(struct xe_guc_ads *ads, u32 poli=
cy_offset)
+{
+	struct  xe_guc_ct *ct =3D &ads_to_guc(ads)->ct;
+	u32 action[] =3D {
+		XE_GUC_ACTION_GLOBAL_SCHED_POLICY_CHANGE,
+		policy_offset
+	};
+
+	return xe_guc_ct_send(ct, action, ARRAY_SIZE(action), 0, 0);
+}
+
+int xe_guc_ads_scheduler_policy_disable_reset(struct xe_guc_ads *ads)
+{
+	struct xe_device *xe =3D ads_to_xe(ads);
+	struct xe_gt *gt =3D ads_to_gt(ads);
+	struct xe_tile *tile =3D gt_to_tile(gt);
+	struct guc_policies *policies;
+	struct xe_bo *bo;
+	int ret =3D 0;
+
+	policies =3D kmalloc(sizeof(*policies), GFP_KERNEL);
+	if (!policies)
+		return -ENOMEM;
+
+	policies->dpc_promote_time =3D ads_blob_read(ads, policies.dpc_promote=
_time);
+	policies->max_num_work_items =3D ads_blob_read(ads, policies.max_num_w=
ork_items);
+	policies->is_valid =3D 1;
+	if (xe->wedged.mode =3D=3D 2)
+		policies->global_flags |=3D GLOBAL_POLICY_DISABLE_ENGINE_RESET;
+
+	bo =3D xe_managed_bo_create_from_data(xe, tile, policies, sizeof(struct g=
uc_policies),
+					    XE_BO_FLAG_VRAM_IF_DGFX(tile) |
+					    XE_BO_FLAG_GGTT);
+	if (IS_ERR(bo)) {
+		ret =3D PTR_ERR(bo);
+		goto out;
+	}
+
+	ret =3D guc_ads_action_update_policies(ads, xe_bo_ggtt_addr(bo));
+out:
+	kfree(policies);
+	return ret;
+}
diff --git a/drivers/gpu/drm/xe/xe_guc_ads.h b/drivers/gpu/drm/xe/xe_guc_ad=
s.h
index 138ef6267671..7c45c40fab34 100644
--- a/drivers/gpu/drm/xe/xe_guc_ads.h
+++ b/drivers/gpu/drm/xe/xe_guc_ads.h
@@ -13,5 +13,6 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads)=
;
 void xe_guc_ads_populate(struct xe_guc_ads *ads);
 void xe_guc_ads_populate_minimal(struct xe_guc_ads *ads);
 void xe_guc_ads_populate_post_load(struct xe_guc_ads *ads);
+int xe_guc_ads_scheduler_policy_disable_reset(struct xe_guc_ads *ads);
=20
 #endif
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc=
_submit.c
index 0bea17536659..7de97b90ad00 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -35,7 +35,6 @@
 #include "xe_macros.h"
 #include "xe_map.h"
 #include "xe_mocs.h"
-#include "xe_module.h"
 #include "xe_ring_ops_types.h"
 #include "xe_sched_job.h"
 #include "xe_trace.h"
@@ -868,26 +867,33 @@ static void xe_guc_exec_queue_trigger_cleanup(struct =
xe_exec_queue *q)
 		xe_sched_tdr_queue_imm(&q->guc->sched);
 }
=20
-static void guc_submit_wedged(struct xe_guc *guc)
+static bool guc_submit_hint_wedged(struct xe_guc *guc)
 {
 	struct xe_exec_queue *q;
 	unsigned long index;
 	int err;
=20
-	xe_device_declare_wedged(guc_to_xe(guc));
+	if (xe_device_wedged(guc_to_xe(guc)))
+		return true;
+
+	if (!xe_device_hint_wedged(guc_to_xe(guc), true))
+		return false;
+
 	xe_guc_submit_reset_prepare(guc);
 	xe_guc_ct_stop(&guc->ct);
=20
 	err =3D drmm_add_action_or_reset(&guc_to_xe(guc)->drm,
 				       guc_submit_wedged_fini, guc);
 	if (err)
-		return;
+		return true; /* Device is wedged anyway */
=20
 	mutex_lock(&guc->submission_state.lock);
 	xa_for_each(&guc->submission_state.exec_queue_lookup, index, q)
 		if (xe_exec_queue_get_unless_zero(q))
 			set_exec_queue_wedged(q);
 	mutex_unlock(&guc->submission_state.lock);
+
+	return true;
 }
=20
 static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
@@ -898,15 +904,12 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_=
struct *w)
 	struct xe_guc *guc =3D exec_queue_to_guc(q);
 	struct xe_device *xe =3D guc_to_xe(guc);
 	struct xe_gpu_scheduler *sched =3D &ge->sched;
-	bool wedged =3D xe_device_wedged(xe);
+	bool wedged;
=20
 	xe_assert(xe, xe_exec_queue_is_lr(q));
 	trace_xe_exec_queue_lr_cleanup(q);
=20
-	if (!wedged && xe_modparam.wedged_mode =3D=3D 2) {
-		guc_submit_wedged(exec_queue_to_guc(q));
-		wedged =3D true;
-	}
+	wedged =3D guc_submit_hint_wedged(exec_queue_to_guc(q));
=20
 	/* Kill the run_job / process_msg entry points */
 	xe_sched_submission_stop(sched);
@@ -957,7 +960,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_j=
ob)
 	struct xe_device *xe =3D guc_to_xe(exec_queue_to_guc(q));
 	int err =3D -ETIME;
 	int i =3D 0;
-	bool wedged =3D xe_device_wedged(xe);
+	bool wedged;
=20
 	/*
 	 * TDR has fired before free job worker. Common if exec queue
@@ -981,10 +984,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_=
job)
=20
 	trace_xe_sched_job_timedout(job);
=20
-	if (!wedged && xe_modparam.wedged_mode =3D=3D 2) {
-		guc_submit_wedged(exec_queue_to_guc(q));
-		wedged =3D true;
-	}
+	wedged =3D guc_submit_hint_wedged(exec_queue_to_guc(q));
=20
 	/* Kill the run_job entry point */
 	xe_sched_submission_stop(sched);
--------------y2IH6I5DbUynyJL8myjPD26q--