From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 88938EE6B59 for ; Fri, 6 Feb 2026 20:29:19 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 3663510E8FE; Fri, 6 Feb 2026 20:29:19 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="O9DX+z9f"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by gabe.freedesktop.org (Postfix) with ESMTPS id BFEE710E8FE for ; Fri, 6 Feb 2026 20:29:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770409758; x=1801945758; h=message-id:date:subject:to:cc:references:from: in-reply-to:content-transfer-encoding:mime-version; bh=gSdkQvZefgcLMcWV9cAReHcQvbkoA3AjFgrTnUPMZd8=; b=O9DX+z9f30hwm+gXvRC4qPgzVjk+pthnA0lfIcHCwQ3GkhPRguiSiCcN 8+XLSex7llI0q+7deTPgg2qAeYbUsJIQ7RIR80T99z4FwnMmLIc7Tcqhm lNN1hmEi7gyxMwbafY1jxnntrg9mKerI60ApJC+CS3DuR5qmN0dhBb6Ut PQ+SCQQnGWEMGtf7jgRHP9xKQUO/DxW30nkYC9mCn4McjKQzcczeY20Wa 63p9cx1raXEjJVwmWs+C9+GR1Kwlzp2hU4jVt+s+SlPkFCGN4lmIqpeTy /BtmAEfeKgywo6nm09f5YUACVH8tbhfYabIhECEAzt0uKP4m3XVI5fy/t g==; X-CSE-ConnectionGUID: ZduMrjEMSbG+cqQIL4G9SQ== X-CSE-MsgGUID: KDnf0j+iTrSBmf+wb1Qtbw== X-IronPort-AV: E=McAfee;i="6800,10657,11693"; a="97080748" X-IronPort-AV: E=Sophos;i="6.21,277,1763452800"; d="scan'208";a="97080748" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Feb 2026 12:29:17 -0800 X-CSE-ConnectionGUID: uZh0cVVqSK+MbJs132VNQQ== X-CSE-MsgGUID: d7XDg2sNSxiHipUS1uO5wA== X-ExtLoop1: 1 Received: from orsmsx903.amr.corp.intel.com ([10.22.229.25]) by fmviesa003.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Feb 2026 12:29:16 -0800 Received: from ORSMSX901.amr.corp.intel.com (10.22.229.23) by ORSMSX903.amr.corp.intel.com (10.22.229.25) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.35; Fri, 6 Feb 2026 12:29:16 -0800 Received: from ORSEDG903.ED.cps.intel.com (10.7.248.13) by ORSMSX901.amr.corp.intel.com (10.22.229.23) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.35 via Frontend Transport; Fri, 6 Feb 2026 12:29:16 -0800 Received: from CO1PR03CU002.outbound.protection.outlook.com (52.101.46.48) by edgegateway.intel.com (134.134.137.113) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.35; Fri, 6 Feb 2026 12:29:16 -0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Bx8NhGNJyiApIiusCuB9Gs2WaUB41N6pLfCZ3F9XJzhSC4YfYgANHYsKqTwwwAXbCgLDcXQZdD34UmrzxWVEnEIiKJCYVjGxbzu38ixiaIxfTOjMBs6RKwexFeEkbrKAn4+j8o7J0M/JJqIOKhEIY8+Qvk9s7Cz1eJIttQ/vr9ObZSvcLW1GZeKliRfqM0PJkoEs1h9iIdPjLGaBZtcSAfx3C1i+bbT6dP4AhnF/xOsUU0adH6lrv9GnU25NqBKrve4w8FVkGo6f9lM8c+9nfiXVEvf5Jkir9UUbDnEzO3Nk0v3nNMsWBu09H3EuebeYYltgnD9ciHRxHJ56Soae0w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=952hOi9cYwthdpIETnESCQvwNyMOUhZ7HFMJEyqXXFw=; b=shkZu2THaKgLXCJQYJlXHhBGEu/fZgXdZKC9UnmywWxpVb/QGydNNXo3+6xe0LvMOLuhlvhkMxp8fNpFGacUcPqLJrXRDDtynVHO6AdSMcmHglIKuesnfm+HxEcSbUmINUbddbELbE/L+Df7yqghiN21YYaUtPtwdpNLoufpvzu4qziYXxw6KV5E+MOZhTtAZu2Fh8amBqgswFBzuAGrXHmGyfow/VeeS7xYOH3wht2gUTA/TwZu9JT5iCb1+pD+4me2TB1SPBIH5xzYxkJhY6hQuJdK+At5Muq0guq0wfUbPWO3XPwQLYI+guiqveVozDjWzgqpqRFqmoP/t7WEDg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; Received: from IA1PR11MB8200.namprd11.prod.outlook.com (2603:10b6:208:454::6) by SA0PR11MB4688.namprd11.prod.outlook.com (2603:10b6:806:72::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9587.16; Fri, 6 Feb 2026 20:29:14 +0000 Received: from IA1PR11MB8200.namprd11.prod.outlook.com ([fe80::e0e6:a2f:a53b:4414]) by IA1PR11MB8200.namprd11.prod.outlook.com ([fe80::e0e6:a2f:a53b:4414%3]) with mapi id 15.20.9587.013; Fri, 6 Feb 2026 20:29:14 +0000 Message-ID: <94bc7f8c-bb28-4f25-929d-a42253c65702@intel.com> Date: Fri, 6 Feb 2026 15:29:11 -0500 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini To: Matthew Brost CC: References: <20251218214418.4037401-1-matthew.brost@intel.com> <20251218214418.4037401-3-matthew.brost@intel.com> <5a99db81-ebbe-4dfe-a528-1063c4bcf1d1@intel.com> Content-Language: en-US From: "Dong, Zhanjun" In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: SJ0PR13CA0045.namprd13.prod.outlook.com (2603:10b6:a03:2c2::20) To IA1PR11MB8200.namprd11.prod.outlook.com (2603:10b6:208:454::6) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: IA1PR11MB8200:EE_|SA0PR11MB4688:EE_ X-MS-Office365-Filtering-Correlation-Id: c00fbc04-3aad-4281-aae4-08de65be64bb X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|366016|376014; X-Microsoft-Antispam-Message-Info: =?utf-8?B?c0JCb1BzVEs2R0MrbHVwSUIrT1p5dG9HbVVOLzZmeU9iN0FxRHVhdjVOY2Fh?= =?utf-8?B?ZDRJOVRQZCt0SU80dUpRTG1qd3RKaHpxeDkxSzhIZHA4UFU0QTdDM09keWcr?= =?utf-8?B?dGYyRG1CM1dQS3YxbDZ0eCtUZmk5bHE4MXVNUlVTN1hub2Z1Lzl5anVCc3ZN?= =?utf-8?B?dHdtaFRqLzYwaUxWOHZRdXR4amY1Q1YyZmNrY1ZJbkhEVXB5SHBxU2t4YVFV?= =?utf-8?B?aTdoa0ZqeDRrMU5PbmlPNjkwMWV6U2JvOVpuQm45SVovVEw3Y05wa2N3MWE1?= =?utf-8?B?NGpWRVhDQnVLTFFuRjMyK0Nidm96Nm9reGJ2b2pFRy92cEtjd1VEOXg0TUtX?= =?utf-8?B?RHJXL25lU0lIQ2Uvb2dLYUVUL1RyamZuRi9uRloyMitOV2FJOXQzRjB2V05Z?= =?utf-8?B?TDNPakpDOWNnR1graVF3dkF6KzB2ZC9mcVdYd0puWHJsalBUeXZEeUtPb1NH?= =?utf-8?B?b0l3aElITDJXNTJTbldFTWg0a01Zbnlydmh5VnV0V0hBYnUvOTBXbkZ6eDkr?= =?utf-8?B?UGo2enVqSU5rTkRFNXBtMWhkSDRSZFJBMUwvVW1hVUxpRUQ0dENUWXBTS29T?= =?utf-8?B?bnpyVWJ4TUF5ZHVRSXlRcGMvbFcrR21XZ3RlQjYrSTBuWDlRaVpZT0xxQ0Uv?= =?utf-8?B?eXVOSUJhV3g1a2ZNWWRrOTNZZU9Wa3Nrc1cwQXVFbnJqVGRHcCt4NG9tcTh4?= =?utf-8?B?YVIzMzJPZmdGZFpEeTF4OTk1LzVvMnZvdFlaZ2l2T0o0OHAwQmtGbW5WRzNn?= =?utf-8?B?TmM0aTN2OGk1dVhiUTRJNWRWaEYyZ3hiNVUwSmsvdkFMc21hSitPR1VoZ0tr?= =?utf-8?B?YS9PZXBwR1ZSNjhkTWtBRElENFhSZys3UlVBUWRzT2lkbTk4dzFwK2NLVFZ3?= =?utf-8?B?NmlGZHBFWUlQYVdMaE4wOFpqdDhCZnlYa3BnZkFxd1FHQTdKUGJRT0d1YVBD?= =?utf-8?B?VktoWWhjdlJsWEhRTzl5UFFwMHFiMXcrQmYyTHBHSVB0L0xpRlMrSWNNeVVV?= =?utf-8?B?MWZqSmZOY0V5RXZjSmdwcXNMUFRBa3ErNldaWEVWVXpLdVEySzFHSjRUa0lk?= =?utf-8?B?ZlM3c3hXUFFXTDdqRTVzeW8rTU9Zd21jTUU2akxtR0tLdFlPV2owZDFyc2Vn?= =?utf-8?B?cW1sSFFmemlaVTgvYlMwa2c2aHVJSkdpZEl1WHJ2RmEwVGFEN3ZFYlBTY2s3?= =?utf-8?B?bWxEeDhTbTZvZjN1YjVxY2pEWUN0SEFsWVQzTDRqcERJMDBDeWI2V1JKQ2pZ?= =?utf-8?B?Tm1wYzRuKzA0Q1hHY0dycmkwdzRXUVlKV3dhTEorVXFFL2txUis2ZGV2TDNw?= =?utf-8?B?aWxsQmNhMHlSUDhQK2tpaXl2TGh2RExHNzVjTlBWSTIrUHIwcWM3TXU4bUMv?= =?utf-8?B?S2QzUTNHS1VvSnZRbm4ycmZOUFJNOFZ5WGNCQVB5TEd6OVBxckIzOVBLREc1?= =?utf-8?B?ZWRSZ0NJaE9mbmpvSVV5eEc2MkYwOUhjd0lLQU5Pa3NZbDN5WWx2bGFQS3pa?= =?utf-8?B?a2JDV0QvNnZYWG84b2YzYkFaVHhrVCs4a1haV1c1bkdhVHdQUWtQRlgwclky?= =?utf-8?B?MnNSVDhKSnF3ZmxLTkNGbDdnRE9ZM1FYaXlLYWRMWEZqajlOSk5GKzh6WDk4?= =?utf-8?B?TEFqcGNCalBUTmxESFd0cGNXakJvMTBXck1TRG9JRmdTdFJMdzlWcmpDU1JI?= =?utf-8?B?b0xtMXNoSkpnREJOS0hDK1EweEJkYVVianRDNDRzd0cza1B1Vy9uTWY2Unoy?= =?utf-8?B?MGMxd3RWQUhRSHNvSHJOZlcwYkNVTUhHYkE1eHVrY0JoQlBiQnNyNFBVSWhG?= =?utf-8?B?ekFicW5rN0ZBZGNpOEQvNkRQWXFuN3dnZ3IxWVUyOGZUZUZhc1BCR2lxSm13?= =?utf-8?B?bjR5ODNRQWxCQXRDcjF0dTQ4anZHcDcrQWZaakRVVnNIeG0rdGgvYWxzUDYz?= =?utf-8?B?bng4OXQxUEdBTTRaaUE3UTB3cFlNZm43SkVHV1JWTUhyRExGK0h4SDZiTGtB?= =?utf-8?B?azBJVDV0RCtLSC9zNmhRZTlRZ21SaGVMQ1lPSVFwbTYrZW94NUh4VlE0ejlF?= =?utf-8?B?Nm9HTnRuQnBxV1pyV1pQcjR0ZktsSC9LUmtCMm8wam9BUlVtUWlqSHp4RHB4?= =?utf-8?Q?9RVc=3D?= X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:IA1PR11MB8200.namprd11.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(366016)(376014); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?MWRGZkJISTJlQVJqRFVIN0Z0OVJtNGNJUUNWcUhBVUcvNEw4U0dvS1d0aU44?= =?utf-8?B?VzgwTURmY3Vad3IrWlpwZ0JVT2x2cVlsZldpVlRHbmhQYW1oNll6R3hvaTA4?= =?utf-8?B?ak84SzArRi9tVnBSQlhsamhVN0pVN3VnSGpRUEw4MWExYWxiUlo5bkFlVVFK?= =?utf-8?B?SSt2NkVCZVk0VERqbmtZSmVtaG85SWsyUWh6YURwcW5LRUZMME13cFVvYjJp?= =?utf-8?B?aG1YUzBTcHlQdUV2MDVvZWREaDNMOSswc0JjUU5EdTlqVlB3VE13Ym0yWG80?= =?utf-8?B?WkwrSzlpUVRYK0JZdWNmMUx1WHNuSHFYOXR4UGZIb1orMzV0d3d4QnM0NC9R?= =?utf-8?B?OVl5R1dnc3dVTlk1c0NIN2VsVFVFeXdxVG5YTHJZbEl3bnZBTm4zeTEzc2Ey?= =?utf-8?B?NW9XcjZsMU56RDE5Ty9xVCszUHIxbEtid0JIVExkSnNUWi92L2ozVkZuejg1?= =?utf-8?B?b25Oa3N5Ym1URE5IVmQ5bFVOLzRGc0RmRnV3VDNZcmxlelczWHZLWnJvWHhi?= =?utf-8?B?QW9vS0FWL0tjNVNWamtxSVJzWkVQUjQ4QzQ4eU5XWXRmRTZTWHdDR2VFQkRj?= =?utf-8?B?VW1ObmkzNlZjQkMvS2VGTTRibHZDbjBPamRmdm12OHJXdWw1Y2dqcFViYWlr?= =?utf-8?B?eVd1Y3NPZG56S1hjcHFhNWxVMHRkenNzbGoxdkU1RDFVQ2JvbVJpcm9GbEYy?= =?utf-8?B?cUJ3VlFVNnpHVEgyQVlSc1YyK1EvSHoyNXVhNkJMelZSQlMxM1lvRHF4RCt3?= =?utf-8?B?YytxNkFRb1NOcHdpenZSZ3ZNZ1JMYVUxUXd1SXJidlB3YzFUbndkN0VPeW55?= =?utf-8?B?a2hoeXJFSEhhSVo2eTNTSnBmYy9DSGtaYUFBaGJZeENJVEl0amhwQ0h4b1d5?= =?utf-8?B?TGx4UnRSVVdyQ2JuUG03VHNnc2dXaDkyZUtiMXBZdkd4M2h3MDdXS0pITnd3?= =?utf-8?B?bzN0eGxXTCtJMGQxalJQbjE0SmF0VWRYMmhQQlpYQ3RsQWNaaUZRK0ZKdHll?= =?utf-8?B?eGM4QXZncWdLTTFmSi9YSFNEUGtJb05RVVNFajdESDdBQnFBbGh4T1dtUkti?= =?utf-8?B?NEh1RmVmc2JvZlFpY25pVnNPUzN1WkJwU0dLSWc5K2xNblNxMFdtTzJ1bC9x?= =?utf-8?B?M0lrZHNwMi9IWk0rVGJmMlVpcjlGVnRKa0Z6dmd3Z2tJQkNEbDd4YTJFYWJp?= =?utf-8?B?cS8xNlhkUDVzNng2QnVLcU9NUW9qQmpuZDJCeVR0ZFBwVDZLZ2VVUzJ2eWRH?= =?utf-8?B?bzI4SFBDd3A4ejBPMHF2d1BUTEZkWjROeGlxT25wcXR6aXc1WXl4cGlOYlhR?= =?utf-8?B?eS91bFpKZUQ2WHUreW00V0lYSnFaUHUrc0lRd0dlc0F0Smg2dmI4eTdaTlU4?= =?utf-8?B?aVZnL0pBL05QUC85cmYvRDhqRjNVQjdtSGwwQW43L3NnTlNjRjFldXh6TTRL?= =?utf-8?B?VEQyVmgxbDhucjBrQXFubGliWmRUSlc0UlFGdFJQeXFGYVVuVlNHdmV6NHhN?= =?utf-8?B?MUE5clBBeVBodDRnTm96azQvZ2cyejVVUVdWM1YwMHBSM2NEVTNwd1RhWVgy?= =?utf-8?B?aG42ZTFURDEzWFJBcERMN3prZ2xsdW5IMWllbW54dVBJbDBSTHYxZUpwVHg3?= =?utf-8?B?bmlGRy8xNVVrc2pKam9KSUI1WTVlWTFyNGRSNW4ybmdFc1hoWjZWVytUWlpU?= =?utf-8?B?blZKZGV4bGM4UjhFblJ4SlY1bU1XUnR1N1lpR1A2MnJtN0Q3aDMxL2l0Vk40?= =?utf-8?B?Z0tZTmxTQVJjMEpSNzRnYW1hQXhFdm1nVEUrekZ5dlVRMTdvUlBDajFQekYx?= =?utf-8?B?MnJPZVZBTkZCckk4ZTlmcmNNTlhPeFRYZ0NqaStqM3kwdWhiZzlCdTFGbllW?= =?utf-8?B?amdia2JFajMvY0dRdks0ckFuQjFJMGtGN0hvR2MzaXVmWTYzWUJNQ0l0dHdL?= =?utf-8?B?c1hVQlpDV09PM1Q2dHFZZnlKaDRST3dJU0pSbWZGNmcrdmp0L3dGek5NZnZn?= =?utf-8?B?ZERJbTFtLzZ1bEZaYytTblhianFBQ3FuWHlCaG9ndkhWV1gyb2MwQzZ3VGp5?= =?utf-8?B?MG9LQnRhMHFwT1hTSXVZVEY0WTFBaExsUFBja0ozbkdkSHFQV05oeUt5MGE5?= =?utf-8?B?ZUQzWktHdFBSN2gyMjNYZ1o3bnMzdlI5by8yQytxREJkS3liZnF1SmdMNEpE?= =?utf-8?B?MklwQ0pIbytlMEJWaisvRVJHOElxalhPL21qeGp1STdQMnNuY2pGdEsrcVRO?= =?utf-8?B?YjljS2l1c3FkM2piYlgyNnN6WGZoWHprQjg1UnFyTkJnZGRadG1LODE5WlI3?= =?utf-8?B?cDhJbjVKOWVnYWVMWVFMeE0reFZPNkJRSzQrcHJaUVM4eHZQTEdDLzUrTHAv?= =?utf-8?Q?ERLJMSvuencboqcw=3D?= X-MS-Exchange-CrossTenant-Network-Message-Id: c00fbc04-3aad-4281-aae4-08de65be64bb X-MS-Exchange-CrossTenant-AuthSource: IA1PR11MB8200.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 06 Feb 2026 20:29:14.3942 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: SDiVG4dtKKo/kmsI060qcJu/nSiE3E23OsF/WSkuo904jbQagKZFS9leUU4iHOz7UcofmZ66kjWAGPAtL28Iiw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA0PR11MB4688 X-OriginatorOrg: intel.com X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On 2026-02-06 12:50 a.m., Matthew Brost wrote: > On Wed, Jan 14, 2026 at 05:35:38PM -0500, Dong, Zhanjun wrote: > > This is actually larger problem. > >> >> >> On 2026-01-08 2:17 p.m., Matthew Brost wrote: >>> On Thu, Jan 08, 2026 at 02:00:15PM -0500, Dong, Zhanjun wrote: >>>> >>>> >>>> On 2025-12-18 4:44 p.m., Matthew Brost wrote: >>>>> In GuC submit fini, forcefully tear down any exec queues by disabling >>>>> CTs, stopping the scheduler (which cleans up lost G2H), killing all >>>>> remaining queues, and resuming scheduling to allow any remaining cleanup >>>>> actions to complete and signal any remaining fences. >>>>> >>>>> v2: >>>>> - Fix VF failure (CI) >>>>> >>>>> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") >>>>> Cc: stable@vger.kernel.org >>>>> Signed-off-by: Zhanjun Dong >>>>> Signed-off-by: Matthew Brost >>>>> >>>>> --- >>>>> >>>>> This fix will not apply outright to any stable kernel as it depeneds on >>>>> functions which have added in the KMD since the original commit. Likely >>>>> will have to manually send out patches to stable for kernel which we'd >>>>> like to fix. >>>>> --- >>>>> drivers/gpu/drm/xe/xe_guc_submit.c | 27 ++++++++++++++++++++------- >>>>> 1 file changed, 20 insertions(+), 7 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c >>>>> index 071cbfec2401..58ec94439df1 100644 >>>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c >>>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c >>>>> @@ -289,6 +289,8 @@ static bool exec_queue_killed_or_banned_or_wedged(struct xe_exec_queue *q) >>>>> EXEC_QUEUE_STATE_BANNED)); >>>>> } >>>>> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc); >>>>> + >>>>> static void guc_submit_fini(struct drm_device *drm, void *arg) >>>>> { >>>>> struct xe_guc *guc = arg; >>>>> @@ -296,6 +298,12 @@ static void guc_submit_fini(struct drm_device *drm, void *arg) >>>>> struct xe_gt *gt = guc_to_gt(guc); >>>>> int ret; >>>>> + /* Forcefully kill any remaining exec queues */ >>>>> + xe_guc_ct_stop(&guc->ct); >>>>> + __xe_guc_submit_reset_prepare(guc); >>>>> + xe_guc_submit_stop(guc); >>>>> + xe_guc_submit_pause_abort(guc); >>>>> + >>>> >>>> Tested this series over >>>> 265d13795b45 drm-tip: 2026y-01m-06d-08h-06m-43s UTC integration manifest >>>> ===(CI_DRM_17772) and (xe-4335) with (IGT_8685)=== >>>> >>>> and run test xe_fault_injection --r probe-fail-guc-xe_guc_mmio_send_recv >>>> --debug >>>> got few problems: >>>> 1. Assertion ct->g2h_outstanding == 0 triggered >>>> call stack shows: >>>> [ 708.967261] xe_guc_ct_disable+0x17/0x80 [xe] >>>> [ 709.043382] xe_guc_sanitize+0x31/0x50 [xe] >>>> [ 709.119557] xe_uc_load_hw+0x187/0x2a0 [xe] >>> >>> Above is a different problem. Just delete xe_guc_sanitize from >>> xe_uc_load_hw, that call is nonsense left over from the i915 port. >>> >>> xe_guc_sanitize / xe_uc_sanitize everywhere probably needs a look if >>> those calls make any bit of sense. >> Agree >>> >>>> >>>> 2. Page fault >>>> [ 740.822070] BUG: unable to handle page fault for address: >>>> ffffc9000c80fc50 >>>> [ 740.828896] #PF: supervisor write access in kernel mode >>>> [ 740.834063] #PF: error_code(0x0002) - not-present page >>>> [ 740.839145] PGD 100000067 P4D 100000067 PUD 100ad4067 PMD 0 >>>> [ 740.844738] Oops: Oops: 0002 [#2] SMP NOPTI >>>> [ 740.848880] CPU: 2 UID: 0 PID: 169 Comm: kworker/2:2 Tainted: G S M UD W >>>> 6.19.0-rc4+xu4335+ #3 PREEMPT(voluntary) >>>> [ 740.859964] Tainted: [S]=CPU_OUT_OF_SPEC, [M]=MACHINE_CHECK, [U]=USER, >>>> [D]=DIE, [W]=WARN >>>> [ 740.867952] Hardware name: Intel Corporation Meteor Lake Client >>>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.4122.D21.2408281317 >>>> 08/28/2024 >>>> [ 740.881081] Workqueue: xe-destroy-wq __guc_exec_queue_destroy_async [xe] >>>> [ 740.887820] RIP: 0010:xe_ggtt_set_pte+0x53/0x350 [xe] >>>> [ 740.892900] Code: e2 48 89 45 d0 31 c0 f7 c6 ff 0f 00 00 75 56 49 3b 5c >>>> 24 08 0f 83 a8 01 00 00 49 8b 84 24 b0 00 00 00 48 c1 eb 0c 48 8d 04 d8 <4c> >>>> 89 38 48 8b 45 d0 65 48 2b 05 e6 41 d1 e2 0f 85 e1 02 00 00 48 >>>> [ 740.911428] RSP: 0018:ffffc9000074b9f0 EFLAGS: 00010202 >>>> [ 740.916599] RAX: ffffc9000c80fc50 RBX: 0000000000001f8a RCX: >>>> 0000000000000000 >>>> [ 740.923653] RDX: 0000000000000000 RSI: 0000000001f8a000 RDI: >>>> ffff888132562628 >>>> [ 740.930705] RBP: ffffc9000074ba88 R08: 0000000000000000 R09: >>>> ffff888168188000 >>>> [ 740.937758] R10: 0000000000000000 R11: 0000000000000000 R12: >>>> ffff888132562628 >>>> [ 740.944807] R13: 0000000000000000 R14: ffff88816818a768 R15: >>>> 0000000000000000 >>>> [ 740.951861] FS: 0000000000000000(0000) GS:ffff8884ebbe0000(0000) >>>> knlGS:0000000000000000 >>>> [ 740.959850] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 740.965534] CR2: ffffc9000c80fc50 CR3: 0000000132923003 CR4: >>>> 0000000000f72ef0 >>>> [ 740.972585] PKRU: 55555554 >>>> [ 740.975268] Call Trace: >>>> [ 740.977694] >>>> [ 740.979778] ? __mutex_lock+0xae/0x1080 >>>> [ 740.983583] xe_ggtt_clear+0xa1/0x260 [xe] >>>> [ 740.987716] ? lock_release+0x1df/0x280 >>>> [ 740.991519] ? pm_runtime_get_conditional+0x66/0x150 >>>> [ 740.996436] ggtt_node_remove+0xb2/0x140 [xe] >>>> [ 741.000829] xe_ggtt_node_remove+0x40/0xa0 [xe] >>>> [ 741.005393] xe_ggtt_remove_bo+0x87/0x250 [xe] >>>> [ 741.009874] ? _raw_write_unlock+0x22/0x50 >>>> [ 741.013927] ? drm_vma_offset_remove+0x65/0x80 >>>> [ 741.018324] xe_ttm_bo_destroy+0xd4/0x310 [xe] >>>> [ 741.022800] ttm_bo_release+0x70/0x330 [ttm] >>>> [ 741.027032] ? vunmap+0x4a/0x70 >>>> [ 741.030147] ? vunmap+0x4a/0x70 >>>> [ 741.033260] ttm_bo_fini+0x3c/0x70 [ttm] >>>> [ 741.037145] xe_gem_object_free+0x1a/0x30 [xe] >>>> [ 741.041618] drm_gem_object_free+0x1d/0x40 >>>> [ 741.045671] xe_bo_put+0x136/0x1c0 [xe] >>>> [ 741.049548] xe_lrc_destroy+0x47/0x60 [xe] >>>> [ 741.053691] xe_exec_queue_fini+0x85/0xd0 [xe] >>>> [ 741.058172] __guc_exec_queue_destroy_async+0x7c/0x190 [xe] >>>> [ 741.063770] process_one_work+0x22e/0x6b0 >>>> [ 741.067741] worker_thread+0x1a0/0x370 >>>> [ 741.071456] ? __pfx_worker_thread+0x10/0x10 >>>> [ 741.075683] kthread+0x11f/0x250 >>>> [ 741.078882] ? __pfx_kthread+0x10/0x10 >>>> [ 741.082594] ret_from_fork+0x337/0x390 >>>> [ 741.086315] ? __pfx_kthread+0x10/0x10 >>>> [ 741.090027] ret_from_fork_asm+0x1a/0x30 >>>> [ 741.093909] >>>> >>>> Sounds like call xe_guc_submit_pause_abort here might cause trouble. That's >>>> why I call it in guc_fini_hw, which make the test passed. >>>> >>> >>> Thanks for the info. guc_fini_hw isn't definitely isn't the right place >>> though as that is registered before xe_guc_submit_init is called. >>> >>> If I'm understanding the trace correctly - guc_submit_fini should be on >>> the devm exit handler. >>> >>> Want to give my two suggestions a try? Also feel free run with these >>> patch / take over if you bandwidth. It is unlikely I'll have bandwidth >>> to pick these back up for at least a week or so. >> >> With more debug print on begin(^)/end($) of >> guc_fini_hw/mmio_fini/guc_submit_fini: >> [ 183.000171] ZD guc_fini_hw ^ >> [ 183.000187] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: GT1: >> GuC CT communication channel disabled >> [ 183.003374] ZD guc_fini_hw $ >> [ 183.116889] ZD __xe_exec_queue_fini q:ffff88816a92d000 flag:0 >> lrc.bo:ffff88816baa8800 >> [ 183.129725] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: GT0: >> GuC CT communication channel stopped >> [ 183.130487] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: GT0: >> GuC CT communication channel disabled >> [ 183.131138] ZD guc_fini_hw ^ >> [ 183.131146] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: GT0: >> GuC CT communication channel disabled >> [ 183.134163] ZD guc_fini_hw $ >> [ 183.235099] xe 0000:00:02.0: [drm:intel_pps_vdd_off_sync_unlocked [xe]] >> [ENCODER:505:DDI A/PHY A] PPS 0 turning VDD off >> [ 183.238289] xe 0000:00:02.0: [drm:intel_pps_vdd_off_sync_unlocked [xe]] >> [ENCODER:505:DDI A/PHY A] PPS 0 PP_STATUS: 0x00000000 PP_CONTROL: 0x00000060 >> [ 183.238415] xe 0000:00:02.0: [drm:intel_power_well_disable [xe]] >> disabling AUX_A >> [ 183.238621] xe 0000:00:02.0: [drm:wait_panel_power_cycle [xe]] >> [ENCODER:505:DDI A/PHY A] PPS 0 wait for panel power cycle (500 ms >> remaining) >> [ 183.747985] xe 0000:00:02.0: [drm:wait_panel_status [xe]] >> [ENCODER:505:DDI A/PHY A] PPS 0 mask: 0xb800000f value: 0x00000000 >> PP_STATUS: 0x00000000 PP_CONTROL: 0x00000060 >> [ 183.758418] xe 0000:00:02.0: [drm:wait_panel_status [xe]] Wait complete >> [ 183.774541] ZD mmio_fini ^ >> [ 183.774551] ZD mmio_fini $ >> [ 183.777314] xe 0000:00:02.0: [drm:drm_pagemap_shrinker_fini >> [drm_gpusvm_helper]] Destroying dpagemap shrinker. >> [ 183.789419] ZD guc_submit_fini ^ >> [ 183.792669] xe 0000:00:02.0: [drm:guc_ct_change_state [xe]] Tile0: GT1: >> GuC CT communication channel stopped >> [ 183.793409] ZD xe_guc_submit_pause_abort q:ffff88811d5fd000 flag:10 >> [ 183.799955] ZD __xe_exec_queue_fini q:ffff88811d5fd600 flag:10 >> lrc.bo:ffff888168fa6800 >> [ 183.807866] ZD guc_submit_fini start drain_workqueue >> [ 183.807920] ZD __xe_exec_queue_fini q:ffff88811d5fd000 flag:90 >> lrc.bo:ffff888168fa5000 >> [ 183.820685] ZD xe_ggtt_remove_bo bo:ffff888168fa6800 >> ggtt:ffff88812c695628 >> [ 183.827536] ZD xe_ggtt_remove_bo bo:ffff888168fa5000 >> ggtt:ffff88812c695628 >> [ 183.834390] ZD xe_ggtt_clear ggtt:ffff88812c695628 start:33239040 >> gsm:ffffc9000c800000 gsm.:ffffc9000c80fd98 >> [ 183.844343] BUG: unable to handle page fault for address: >> ffffc9000c80fd98 >> [ 183.851153] #PF: supervisor write access in kernel mode >> [ 183.856324] #PF: error_code(0x0002) - not-present page >> [ 183.861406] PGD 100000067 P4D 100000067 PUD 100ac9067 PMD 0 >> [ 183.867001] Oops: Oops: 0002 [#1] SMP NOPTI >> [ 183.871143] CPU: 7 UID: 0 PID: 298 Comm: kworker/7:2 Tainted: G S M U W >> 6.19.0-rc5+xu4373+ #13 PREEMPT(voluntary) >> [ 183.882305] Tainted: [S]=CPU_OUT_OF_SPEC, [M]=MACHINE_CHECK, [U]=USER, >> [W]=WARN >> [ 183.889524] Hardware name: Intel Corporation Meteor Lake Client >> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.4122.D21.2408281317 >> 08/28/2024 >> [ 183.902650] Workqueue: xe-destroy-wq __guc_exec_queue_destroy_async [xe] >> [ 183.909399] RIP: 0010:xe_ggtt_set_pte+0x5b/0x360 [xe] >> [ 183.914482] Code: c6 ff 0f 00 00 75 5e 49 8b 44 24 10 49 03 44 24 08 48 >> 39 c3 0f 83 b0 01 00 00 49 8b 84 24 b8 00 00 00 48 c1 eb 0c 48 8d 04 d8 <4c> >> 89 38 48 8b 45 d0 65 48 2b 05 1e 41 d1 e2 0f 85 e9 02 00 00 48 >> [ 183.933007] RSP: 0018:ffffc90001ce79c8 EFLAGS: 00010202 >> [ 183.938179] RAX: ffffc9000c80fd98 RBX: 0000000000001fb3 RCX: >> 0000000000000000 >> [ 183.945234] RDX: 0000000000000000 RSI: 0000000001fb3000 RDI: >> ffff88812c695628 >> [ 183.952285] RBP: ffffc90001ce7a60 R08: 0000000000000000 R09: >> 0000000000000000 >> [ 183.959338] R10: 0000000000000000 R11: 0000000000000000 R12: >> ffff88812c695628 >> [ 183.966388] R13: ffff8881329ea768 R14: ffff8881329ea768 R15: >> 0000000000000000 >> [ 183.973438] FS: 0000000000000000(0000) GS:ffff8884ebe60000(0000) >> knlGS:0000000000000000 >> [ 183.981431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 183.987110] CR2: ffffc9000c80fd98 CR3: 000000010b9c5006 CR4: >> 0000000000f72ef0 >> [ 183.994159] PKRU: 55555554 >> [ 183.996847] Call Trace: >> [ 183.999267] >> [ 184.001356] ? vprintk_default+0x1d/0x30 >> [ 184.005244] ? vprintk+0x18/0x50 >> [ 184.008446] ? _printk+0x57/0x80 >> [ 184.011648] xe_ggtt_clear+0x104/0x2a0 [xe] >> [ 184.015878] ? mark_held_locks+0x4d/0x90 >> [ 184.019767] ggtt_node_remove+0xb2/0x140 [xe] > > ggtt_node_remove has hotplug protection via drm_dev_enter, but it > appears that drm_dev_unplug isn't called if the driver load fails, so > the device still appears to be plugged in. This becomes an issue if, for > example, MMIO space is unmapped in mmio_fini then sometime later a BO is > freed with a GGTT mapping. > > I checked all the drm_dev_enter usages believe we are ok aside from GGTT > case. Nice to hear that. > >> [ 184.024164] xe_ggtt_node_remove+0x40/0xa0 [xe] >> [ 184.028728] xe_ggtt_remove_bo+0xa4/0x2e0 [xe] >> [ 184.033210] ? _raw_write_unlock+0x22/0x50 >> [ 184.037271] ? drm_vma_offset_remove+0x65/0x80 >> [ 184.041672] xe_ttm_bo_destroy+0xae/0x2d0 [xe] >> [ 184.046150] ttm_bo_release+0x70/0x330 [ttm] >> [ 184.050382] ? vunmap+0x4a/0x70 >> [ 184.053494] ? vunmap+0x4a/0x70 >> [ 184.056609] ttm_bo_fini+0x3c/0x70 [ttm] >> [ 184.060491] xe_gem_object_free+0x1a/0x30 [xe] >> [ 184.064966] drm_gem_object_free+0x1d/0x40 >> [ 184.069018] xe_bo_put+0x123/0x180 [xe] >> [ 184.072898] xe_lrc_destroy+0x47/0x60 [xe] >> [ 184.077041] __xe_exec_queue_fini+0x93/0xd0 [xe] >> [ 184.081693] xe_exec_queue_fini+0x2b/0x60 [xe] >> [ 184.086171] __guc_exec_queue_destroy_async+0x6c/0x170 [xe] >> [ 184.091769] process_one_work+0x22e/0x6b0 >> [ 184.095737] worker_thread+0x1a0/0x370 >> [ 184.099448] ? __pfx_worker_thread+0x10/0x10 >> [ 184.103676] kthread+0x11f/0x250 >> [ 184.106877] ? __pfx_kthread+0x10/0x10 >> [ 184.110586] ret_from_fork+0x337/0x390 >> [ 184.114301] ? __pfx_kthread+0x10/0x10 >> [ 184.118011] ret_from_fork_asm+0x1a/0x30 >> [ 184.121900] >> >> So the root cause of the page fault should be: >> 1.mmio_fini do pci_iounmap >> 2.writeq in xe_ggtt_set_pte access valiad address (ffffc9000c80fd98) >> 3.Since already unmapped in step 1, the page fault tiggered. >> >> The excution order of fini(s) is: >> guc_fini_hw (for each guc) >> mmio_fini >> guc_submit_fini >> >> meanwhile, it is the destroy worker perform the bo release action, that >> causes problem, the worker out of sync with the managed actions. >> > > Yes, this is an issue with all versions of this series, even with some > of the further suggestions I sent over today off-list, if hotplug > protection doesn’t work in the GGTT code. We might need to open-code the > protection in the GGTT code rather than relying on hotplug. Right, that's why I move the guc_submit_fini to devm since v3, test shows this prevent the page fault happens. And ofcourse, better to have open-code protection, I will try it later. Regards, Zhanjun Dong > > Matt > >> Regards, >> Zhanjun Dong >> >> >>> >>> Matt >>> >>>> Regards, >>>> Zhanjun Dong >>>> >>>>> ret = wait_event_timeout(guc->submission_state.fini_wq, >>>>> xa_empty(&guc->submission_state.exec_queue_lookup), >>>>> HZ * 5); >>>>> @@ -2459,16 +2467,10 @@ static void guc_exec_queue_stop(struct xe_guc *guc, struct xe_exec_queue *q) >>>>> } >>>>> } >>>>> -int xe_guc_submit_reset_prepare(struct xe_guc *guc) >>>>> +static int __xe_guc_submit_reset_prepare(struct xe_guc *guc) >>>>> { >>>>> int ret; >>>>> - if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc))) >>>>> - return 0; >>>>> - >>>>> - if (!guc->submission_state.initialized) >>>>> - return 0; >>>>> - >>>>> /* >>>>> * Using an atomic here rather than submission_state.lock as this >>>>> * function can be called while holding the CT lock (engine reset >>>>> @@ -2483,6 +2485,17 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc) >>>>> return ret; >>>>> } >>>>> +int xe_guc_submit_reset_prepare(struct xe_guc *guc) >>>>> +{ >>>>> + if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc))) >>>>> + return 0; >>>>> + >>>>> + if (!guc->submission_state.initialized) >>>>> + return 0; >>>>> + >>>>> + return __xe_guc_submit_reset_prepare(guc); >>>>> +} >>>>> + >>>>> void xe_guc_submit_reset_wait(struct xe_guc *guc) >>>>> { >>>>> wait_event(guc->ct.wq, xe_device_wedged(guc_to_xe(guc)) || >>>> >>