From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id CF09FCAC5B8 for ; Mon, 6 Oct 2025 15:54:27 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8A93A10E36D; Mon, 6 Oct 2025 15:54:27 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="T4oin6zv"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11]) by gabe.freedesktop.org (Postfix) with ESMTPS id CA12510E36D for ; Mon, 6 Oct 2025 15:54:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1759766067; x=1791302067; h=date:from:to:cc:subject:message-id:references: in-reply-to:mime-version; bh=J58HA8Y8N/GQZE2UmR7ZFrzSNZ9eOlPmB9f1QQcqGM0=; b=T4oin6zvwPvdpXH8af1AhOcPr/RSIuWt06gTDohf3DcA/VUbbJr4IFQ7 gGPC6kfL5K7MlB2I4aDWhEZnB6n4t1RsfFqkW4fE/KnrRnS8FftL3XSSu 7k154ByrhrJYQ3igy7Q9Mt/A3HdmheY5bLYa20LGcdZ/Z0n75jUsFUTGD 1nrAUydZEpcS62UDzNpLAsXqzoYD2Jp20gCjGzCU7JfD41qHnCKoW+JDu HEPryRIxxlxD5/xH8C07c0Hh+eNyxk2kgzYP/qZ2HZg+O4finBOnNc1yZ fjqb8fpdFPYDIUcfKTuS7jIKKlwN/HDs6EzpqsJ1ZntJaScCkXA6gd0bG Q==; X-CSE-ConnectionGUID: qocngj4cRu6f4I4UwXHjkw== X-CSE-MsgGUID: JTacYoZARymRsLiyastazg== X-IronPort-AV: E=McAfee;i="6800,10657,11574"; a="72556564" X-IronPort-AV: E=Sophos;i="6.18,320,1751266800"; d="scan'208";a="72556564" Received: from orviesa007.jf.intel.com ([10.64.159.147]) by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Oct 2025 08:54:26 -0700 X-CSE-ConnectionGUID: /CUCcFk3QpCqajU6W+ZPjw== X-CSE-MsgGUID: XjbUla+MSfWl94giBVt3/Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,320,1751266800"; d="scan'208";a="179728415" Received: from fmsmsx901.amr.corp.intel.com ([10.18.126.90]) by orviesa007.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Oct 2025 08:54:25 -0700 Received: from FMSMSX903.amr.corp.intel.com (10.18.126.92) by fmsmsx901.amr.corp.intel.com (10.18.126.90) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.27; Mon, 6 Oct 2025 08:54:24 -0700 Received: from fmsedg901.ED.cps.intel.com (10.1.192.143) by FMSMSX903.amr.corp.intel.com (10.18.126.92) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.27 via Frontend Transport; Mon, 6 Oct 2025 08:54:24 -0700 Received: from PH8PR06CU001.outbound.protection.outlook.com (40.107.209.18) by edgegateway.intel.com (192.55.55.81) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.27; Mon, 6 Oct 2025 08:54:24 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=CNUrwNECpGwU6f/iru6Z/pR7QAU5QazsjIBuW3AS+h57sst3hSyPyInJgDJitgyUQLyIUmaUQKmYhIfTyWf0gqpfdC/uWt6y7i2Yh43DbTPODsNMfJ2Jc/EWOhosF5UpB8Z+1MyRhQa069zHpcTi9O5Gv8xDk8HlInNGmYGmaXpONBn69h/mhR3vU2e6b/kFZF6ackqJdUmIPxpLeaSuEuXHegHof6tVZsMT9FKRr4E41QownB2VUlx71ICX8zz8yVCr+SuVV+KeZnKYQ2NoJOKVFDC6wZskulrQ5OSpdJVwwmlu2ucECxxkqNIZpDimq4cykcAzGDi8bjJ+K8drug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=dLlRitfLfIzFa4eaS4yyt1iaORDLt/vjIsUhcvr7i7U=; b=GdViwqa9E8P4o/JTFMDE7oF5egaIdPLtdlQQQPsZXwD3msDWqH5lEMepcvFCWqeMG7exNBkQS3SGWWUUItqMAE++C3xQJauUr58f8TtHcN1OdN4Q6G3uA3wsquAT5N6CbRXdoamFE6bgVnaH+bve1FsOMjD1GRWsXA/wpCuVY2gkhrdtZm2ULgFjtLQjarZ23wGDX99CkmF4XAkmBB/QJBOAqjlrwkPZtKOWSvvvZKavwkQMd2u2aE/ItEKAPqazrVSUqaIkXx0tinkYH/QSHT9pIEE4Nn5tqq0fINa21jQrefB+ceMtiecq4EPHLziF0OOD6n0hPaLDiY7WtpYFwg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; Received: from PH7PR11MB6522.namprd11.prod.outlook.com (2603:10b6:510:212::12) by CY8PR11MB6915.namprd11.prod.outlook.com (2603:10b6:930:59::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9182.18; Mon, 6 Oct 2025 15:54:18 +0000 Received: from PH7PR11MB6522.namprd11.prod.outlook.com ([fe80::9e94:e21f:e11a:332]) by PH7PR11MB6522.namprd11.prod.outlook.com ([fe80::9e94:e21f:e11a:332%4]) with mapi id 15.20.9182.017; Mon, 6 Oct 2025 15:54:18 +0000 Date: Mon, 6 Oct 2025 08:54:15 -0700 From: Matthew Brost To: Michal Wajdeczko CC: Subject: Re: [PATCH v6 14/30] drm/xe/vf: Wakeup in GuC backend on VF post migration recovery Message-ID: References: <20251006111038.2234860-1-matthew.brost@intel.com> <20251006111038.2234860-15-matthew.brost@intel.com> <94961e87-1826-4059-bb81-b79073074ea8@intel.com> Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <94961e87-1826-4059-bb81-b79073074ea8@intel.com> X-ClientProxiedBy: BY5PR04CA0016.namprd04.prod.outlook.com (2603:10b6:a03:1d0::26) To PH7PR11MB6522.namprd11.prod.outlook.com (2603:10b6:510:212::12) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: PH7PR11MB6522:EE_|CY8PR11MB6915:EE_ X-MS-Office365-Filtering-Correlation-Id: 2c6667e7-0bfa-4443-5979-08de04f09b8b X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|376014|366016|1800799024; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?m+hpHJSxi6wl3SK0QyTFoizvmbu8w1qO+hIyAWAUU4eyyyRCXMIoZJYeayD6?= =?us-ascii?Q?5iUXNjk0XSiJTOt1V8En+zol75iXucOG2agSSL1D5G/jE+yVdnb42mMX0zJa?= =?us-ascii?Q?yJ3zgQjdHiDml1JPBSB+BsWnNX7hu1gTk9IJzUV2xyQ4/CG9SmjBKX9IAqiv?= =?us-ascii?Q?yFZUSyOQU+dOJeAc09saUhggaVy/M0gr+tgfhllSIwlrHTu9JKYSsxxRs7re?= =?us-ascii?Q?bJ0xsN/7JZ3fJTV1Z13Nj5j7yENEhZ7AW42cAM+nvtJg3EjkJKAwVSZTL8mv?= =?us-ascii?Q?44X8rHxyBSeRdxO7D/5GLNinAxWd5zNr+/9lcHZ/6wtxj0pHK9+dgyC6idQx?= =?us-ascii?Q?kU9hlhaXuZB5xlNn2LDiRZdaZ0oXVMNj08Y7quuc2f6TAZG/ls3W8ld5RG+4?= =?us-ascii?Q?IySdXqeX4SmXIx7wB2mjObMHTj5iryGndmyYQ3u8s/2+mHULqaPOkk8dHl+g?= =?us-ascii?Q?HwfWFPGKdLKBFXSS3yaxNdha0bao8o0NdOSFvbT8mXHLjohdw9Ivht2bn9zq?= =?us-ascii?Q?nvHbB1Xq7LpkSay/JyRKR/o3p1NGUVib3j2idgOb1nE8WPTbBxMHZww3/FFz?= =?us-ascii?Q?HmNbkVedymmXRdtNrqHLzJL+tnLiaBOtSXXRwAnfe+cKBweATKbUBKxgpna1?= =?us-ascii?Q?Ke+YkjjdCqr31Srh5rdgjtYrgNYu3Zfq3O9X/dtTWP3NfzQx1pn/PEiBpNRq?= =?us-ascii?Q?uKctoV1Ya4Yayhks/E2ni53o3XX2c7urXrpKhUn4ZhvZQB24Vv7VMt4fmeae?= =?us-ascii?Q?O1CBfweQrzi1J/bgq8t7u/dW6TynDmuP4Dx77mp7L2pq6GzV8q9R+aMpH8PU?= =?us-ascii?Q?+TMNucxoOG79D4Czplo3E/3z7libW2PH+68bYj2nKGRUSCAEdf75FEMXm6KV?= =?us-ascii?Q?ynvckbjGSlQzce6H9y9PzBPo9Z7HxLUvFnGYCyA7SQfXLaRcxyufjkZ3YMxh?= =?us-ascii?Q?H6hjka53UP35Su5EcRNxwlEeH92FUZ/7PbbCyJHFfNLrFq+XzebhAB0M0tiY?= =?us-ascii?Q?CR0YJx1zP/kk1Im8sDGczrODseLJACftkmq9T4OyWDpEL1pkvGM2ZBrAdjoB?= =?us-ascii?Q?foCoDchI4X7lnmocYV690494eN94HV6OA84Z3VWhw5IvEKc0EITabxfDRbIm?= =?us-ascii?Q?M7HtoQjImtsS7v4wTOYc9zKIA/b5VmS0AdVeeERVBjPC8uyL5Vf+wrnNiz6X?= =?us-ascii?Q?xXBTwllhSbnyqdgO0FoiHuzTYwGM6VdxsoYboMb0xqK5/A1Lf11HlvQ5OiPn?= =?us-ascii?Q?7c1dR4kHSHk53uspy5ldjFAf5GBfMoY8/W1oqiL59I6r4VKVy5nOky45xEOa?= =?us-ascii?Q?NhvvAVyINqGbwjgE2Fv+bSUS3VIGYsyrWxpQFzHEZWzq6heQBIm6PalZebgS?= =?us-ascii?Q?PbJG70l8wU6YIOhfMlnSNT1C+J54Nhv54BOzajwXaGpAnyt7XySFnjbbvvKe?= =?us-ascii?Q?yDxPYhmfJM99nwy/LedVIMPQu8AUUE5N?= X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:PH7PR11MB6522.namprd11.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230040)(376014)(366016)(1800799024); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?4kfZa2SMrdqfEeVk5Um4ukti9SOM0Wi4NABDBZPLVvd7GyMNg1FM3DXTJZmi?= =?us-ascii?Q?hDYmYFF75j/OTE2mLEDZzG1xjQ8S0vokzpE5D89YvxEysGZ7/PSdKAwHeN8Z?= =?us-ascii?Q?NVz3PASHisHT2j9/o/EERghxq92fmRwASXegiHR6PMVGFfTZj8kED6ocLg0r?= =?us-ascii?Q?YMzVB65erwdT838dcf1/mOk5SscZES8QtYCu7D345F2laYCsofLGkra49OvC?= =?us-ascii?Q?STXGGmR4et31ukAj2zpw2XVeppLZaaF1vFphbT2Qi1MOLf2JaTEkLuxsrlNI?= =?us-ascii?Q?UEl1OSQw2qpqo57QxjvNTDxTYrx4xZRC94Td+LSluMQblDe+kqLQ2OSJicgA?= =?us-ascii?Q?Wwu6aGS2IU1i28APsV5NkY4Bq1iQpqar/fK4ycJgYn7VzaxcUIe04dHFUQkO?= =?us-ascii?Q?tmuSrtYMn2GbMBxmzY4yPOpInkhOeNZO02X/VuyvdkOdklcNCUH3fk/Lr4TO?= =?us-ascii?Q?6H+0HmSGN2VRXBZR9DrBJ6fjZizZ5U1Kl6ZcV6Gq6kQ5udqtsCx6vJ/c1BRK?= =?us-ascii?Q?w5tUE0wQye2Y6Fi7uvXVZKF5Y6DYHTpZWE0x/QnO4cGo5bj/rvLVAGs/RKB9?= =?us-ascii?Q?xw90VO2bxA3PuP6vjd4k3uDe6kDjB4WHIYgCT7q8qQX6RWBAUIvWAQX9mX7e?= =?us-ascii?Q?twmrgG4YCTSELnNd+40XuYH7tKK0X6qMoqvw6InLntfXWNbvP6AGlcw4U+Px?= =?us-ascii?Q?e5j2moT5EMI8WJ4UA2pD/RfnT/kTnIYnJje3pznZny26p3mdAndiNgbbsYTQ?= =?us-ascii?Q?0TGraAbBP0nEqXCGo/2rP4Za87bcPfO3DgbhhlvgnauZX23+Tg8u3H7dfXzB?= =?us-ascii?Q?3vCSZMmZqDanXJ4oqeyLaGyOynzvHuNbrJhFRhk6Xmnxy7yV/FBmNFRc6FOa?= =?us-ascii?Q?9x3g8vbsY0P8fwN7WY75O9zkgL77AqwAkS9tDsUIeY9OpA0xuFirwX47H/G8?= =?us-ascii?Q?wSv+yUdGdqNHAt5zjIBJl480dWqoFlIeK/94xXNteRBUJl6A5goCk2OMt8ND?= =?us-ascii?Q?Uc7xXynzOEQ1LeevbNRFDF03ZgZxL1ZVclCoJhpLYOCNjGWi1pgppJubmc84?= =?us-ascii?Q?jXK77glSFHz3FhiNuSESN25zdLnYQRAYrLDbr+VUYEe9vF9c6b8zp0cX+M0v?= =?us-ascii?Q?R1NXleTwNN1M646EyLEDduwa0F4JpEilGrnuQC97dDHmTPaEeDEW68xfjC3O?= =?us-ascii?Q?ySeFNS/6bg3rsIo4ipLfzMEdR6Wz0+wNPg9hImxiJYhijt6xmyV3/+06S4lo?= =?us-ascii?Q?xrqmDKplrimAc5TQsp43gHJwJn+n4T/qbiDHGbpj86h/Ba3mWo4d0fMqe1gj?= =?us-ascii?Q?HkFSvKWV0AGOfmplmFjFYqhvNmuLnZD06hLBVHknDclnWxfpteoyqtYSTJCy?= =?us-ascii?Q?agnLOuileg73tGyPIbstou7cLvUcVdzr7fZ0cyjaf/gNALFZ6Rd7rp5+KT3r?= =?us-ascii?Q?z3UhXk0XIeVLJ1f3rh6SS5ll8Sg2SNj/9Rl+Q7tNe+opatnkXLEhOttCDWKE?= =?us-ascii?Q?rPDwwb2l5LtY3PvN4OCFP3pneQt2P/vz8NPht0mj0d+QLtbqR37J/ltTxz9y?= =?us-ascii?Q?htUh4DYJaQSyHDvCP6AgbDXEulMqeruzolEWg9slv48ko3mm+u5xjoOX+GYu?= =?us-ascii?Q?zw=3D=3D?= X-MS-Exchange-CrossTenant-Network-Message-Id: 2c6667e7-0bfa-4443-5979-08de04f09b8b X-MS-Exchange-CrossTenant-AuthSource: PH7PR11MB6522.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 06 Oct 2025 15:54:18.5515 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: OKnwRsRaYqSAkG2tYl4KbREAC56FAt1uArHUfdRNpQLUHT7lvhTYSlflrOzyvPz/dHd1e2lACKCHL6GFT7Fb7A== X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY8PR11MB6915 X-OriginatorOrg: intel.com X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Mon, Oct 06, 2025 at 04:35:51PM +0200, Michal Wajdeczko wrote: > > > On 10/6/2025 1:10 PM, Matthew Brost wrote: > > If VF post-migration recovery is in progress, the recovery flow will > > rebuild all GuC submission state. In this case, exit all waiters to > > ensure that submission queue scheduling can also be paused. Avoid taking > > any adverse actions after aborting the wait. > > > > As part of waking up the GuC backend, suspend_wait can now return > > -EAGAIN indicating the waiter should be retried. If the caller is > > running on work item, that work item need to be requeued to avoid a > > deadlock for the work item blocking the VF migration recovery work item. > > > > v3: > > - Don't block in preempt fence work queue as this can interfere with VF > > post-migration work queue scheduling leading to deadlock (Testing) > > - Use xe_gt_recovery_inprogress (Michal) > > v5: > > - Use static function for vf_recovery (Michal) > > - Add helper to wake CT waiters (Michal) > > - Move some code to following patch (Michal) > > - Adjust commit message to explain suspend_wait returning -EAGAIN (Michal) > > - Add kernel doc to suspend_wait around returning -EAGAIN > > > > Signed-off-by: Matthew Brost > > --- > > drivers/gpu/drm/xe/xe_exec_queue_types.h | 3 + > > drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 4 ++ > > drivers/gpu/drm/xe/xe_guc_ct.h | 9 +++ > > drivers/gpu/drm/xe/xe_guc_submit.c | 82 ++++++++++++++++++------ > > drivers/gpu/drm/xe/xe_preempt_fence.c | 11 ++++ > > 5 files changed, 88 insertions(+), 21 deletions(-) > > > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h > > index 27b76cf9da89..282505fa1377 100644 > > --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h > > +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h > > @@ -207,6 +207,9 @@ struct xe_exec_queue_ops { > > * call after suspend. In dma-fencing path thus must return within a > > * reasonable amount of time. -ETIME return shall indicate an error > > * waiting for suspend resulting in associated VM getting killed. > > + * -EAGAIN return indicates the wait should be tried again, if the wait > > + * is within a work item, the work item should be requeued as deadlock > > + * avoidance mechanism. > > */ > > int (*suspend_wait)(struct xe_exec_queue *q); > > /** > > diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c > > index 7057260175f3..7f703336d692 100644 > > --- a/drivers/gpu/drm/xe/xe_gt_sriov_vf.c > > +++ b/drivers/gpu/drm/xe/xe_gt_sriov_vf.c > > @@ -23,6 +23,7 @@ > > #include "xe_gt_sriov_vf.h" > > #include "xe_gt_sriov_vf_types.h" > > #include "xe_guc.h" > > +#include "xe_guc_ct.h" > > #include "xe_guc_hxg_helpers.h" > > #include "xe_guc_relay.h" > > #include "xe_guc_submit.h" > > @@ -743,6 +744,9 @@ static void vf_start_migration_recovery(struct xe_gt *gt) > > !gt->sriov.vf.migration.recovery_teardown) { > > gt->sriov.vf.migration.recovery_queued = true; > > WRITE_ONCE(gt->sriov.vf.migration.recovery_inprogress, true); > > + smp_wmb(); /* Ensure above write visable before wake */ > > + > > + xe_guc_ct_wake_waiters(>->uc.guc.ct); > > > > started = queue_work(gt->ordered_wq, >->sriov.vf.migration.worker); > > xe_gt_sriov_info(gt, "VF migration recovery %s\n", started ? > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h > > index d6c81325a76c..ca0ec938edac 100644 > > --- a/drivers/gpu/drm/xe/xe_guc_ct.h > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.h > > @@ -72,4 +72,13 @@ xe_guc_ct_send_block_no_fail(struct xe_guc_ct *ct, const u32 *action, u32 len) > > > > long xe_guc_ct_queue_proc_time_jiffies(struct xe_guc_ct *ct); > > > > +/** > > + * xe_guc_ct_wake_waiters() - GuC CT wake up waiters > > + * @guc: GuC CT object > > + */ > > +static inline void xe_guc_ct_wake_waiters(struct xe_guc_ct *ct) > > +{ > > + wake_up_all(&ct->wq); > > +} > > + > > #endif > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c > > index 59371b7cc8a4..b2ca4911efe9 100644 > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c > > @@ -27,7 +27,6 @@ > > #include "xe_gt.h" > > #include "xe_gt_clock.h" > > #include "xe_gt_printk.h" > > -#include "xe_gt_sriov_vf.h" > > #include "xe_guc.h" > > #include "xe_guc_capture.h" > > #include "xe_guc_ct.h" > > @@ -702,6 +701,11 @@ static u32 wq_space_until_wrap(struct xe_exec_queue *q) > > return (WQ_SIZE - q->guc->wqi_tail); > > } > > > > +static bool vf_recovery(struct xe_guc *guc) > > +{ > > + return xe_gt_recovery_pending(guc_to_gt(guc)); > > +} > > + > > static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size) > > { > > struct xe_guc *guc = exec_queue_to_guc(q); > > @@ -711,7 +715,7 @@ static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size) > > > > #define AVAILABLE_SPACE \ > > CIRC_SPACE(q->guc->wqi_tail, q->guc->wqi_head, WQ_SIZE) > > - if (wqi_size > AVAILABLE_SPACE) { > > + if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) { > > try_again: > > q->guc->wqi_head = parallel_read(xe, map, wq_desc.head); > > if (wqi_size > AVAILABLE_SPACE) { > > @@ -910,9 +914,10 @@ static void disable_scheduling_deregister(struct xe_guc *guc, > > ret = wait_event_timeout(guc->ct.wq, > > (!exec_queue_pending_enable(q) && > > !exec_queue_pending_disable(q)) || > > - xe_guc_read_stopped(guc), > > + xe_guc_read_stopped(guc) || > > + vf_recovery(guc), > > HZ * 5); > > - if (!ret) { > > + if (!ret && !vf_recovery(guc)) { > > struct xe_gpu_scheduler *sched = &q->guc->sched; > > > > xe_gt_warn(q->gt, "Pending enable/disable failed to respond\n"); > > @@ -1015,6 +1020,10 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w) > > bool wedged = false; > > > > xe_gt_assert(guc_to_gt(guc), xe_exec_queue_is_lr(q)); > > + > > + if (vf_recovery(guc)) > > + return; > > + > > trace_xe_exec_queue_lr_cleanup(q); > > > > if (!exec_queue_killed(q)) > > @@ -1047,7 +1056,11 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w) > > */ > > ret = wait_event_timeout(guc->ct.wq, > > !exec_queue_pending_disable(q) || > > - xe_guc_read_stopped(guc), HZ * 5); > > + xe_guc_read_stopped(guc) || > > + vf_recovery(guc), HZ * 5); > > + if (vf_recovery(guc)) > > + return; > > + > > if (!ret) { > > xe_gt_warn(q->gt, "Schedule disable failed to respond, guc_id=%d\n", > > q->guc->id); > > @@ -1137,8 +1150,9 @@ static void enable_scheduling(struct xe_exec_queue *q) > > > > ret = wait_event_timeout(guc->ct.wq, > > !exec_queue_pending_enable(q) || > > - xe_guc_read_stopped(guc), HZ * 5); > > - if (!ret || xe_guc_read_stopped(guc)) { > > + xe_guc_read_stopped(guc) || > > + vf_recovery(guc), HZ * 5); > > + if ((!ret && !vf_recovery(guc)) || xe_guc_read_stopped(guc)) { > > xe_gt_warn(guc_to_gt(guc), "Schedule enable failed to respond"); > > set_exec_queue_banned(q); > > xe_gt_reset_async(q->gt); > > @@ -1209,7 +1223,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) > > * list so job can be freed and kick scheduler ensuring free job is not > > * lost. > > */ > > - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) > > + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags) || > > + vf_recovery(guc)) > > return DRM_GPU_SCHED_STAT_NO_HANG; > > > > /* Kill the run_job entry point */ > > @@ -1261,7 +1276,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) > > ret = wait_event_timeout(guc->ct.wq, > > (!exec_queue_pending_enable(q) && > > !exec_queue_pending_disable(q)) || > > - xe_guc_read_stopped(guc), HZ * 5); > > + xe_guc_read_stopped(guc) || > > + vf_recovery(guc), HZ * 5); > > + if (vf_recovery(guc)) > > + goto handle_vf_resume; > > if (!ret || xe_guc_read_stopped(guc)) > > goto trigger_reset; > > > > @@ -1286,7 +1304,10 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) > > smp_rmb(); > > ret = wait_event_timeout(guc->ct.wq, > > !exec_queue_pending_disable(q) || > > - xe_guc_read_stopped(guc), HZ * 5); > > + xe_guc_read_stopped(guc) || > > + vf_recovery(guc), HZ * 5); > > + if (vf_recovery(guc)) > > + goto handle_vf_resume; > > if (!ret || xe_guc_read_stopped(guc)) { > > trigger_reset: > > if (!ret) > > @@ -1391,6 +1412,7 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) > > * some thought, do this in a follow up. > > */ > > xe_sched_submission_start(sched); > > +handle_vf_resume: > > return DRM_GPU_SCHED_STAT_NO_HANG; > > } > > > > @@ -1487,11 +1509,17 @@ static void __guc_exec_queue_process_msg_set_sched_props(struct xe_sched_msg *ms > > > > static void __suspend_fence_signal(struct xe_exec_queue *q) > > { > > + struct xe_guc *guc = exec_queue_to_guc(q); > > + struct xe_device *xe = guc_to_xe(guc); > > + > > if (!q->guc->suspend_pending) > > return; > > > > WRITE_ONCE(q->guc->suspend_pending, false); > > - wake_up(&q->guc->suspend_wait); > > + if (IS_SRIOV_VF(xe)) > > + wake_up_all(&guc->ct.wq); > > maybe xe_guc_ct_wake_waiters() ? > We have roughly 10 other calls of wake_up_all(&guc->ct.wq) else where that need fixing. I suggest we fixup the entire driver in follow on patch to this series. > and I guess some small in source comment why we differentiate between VF and !VF case would be beneficial > I've added this. > > + else > > + wake_up(&q->guc->suspend_wait); > > } > > > > static void suspend_fence_signal(struct xe_exec_queue *q) > > @@ -1512,8 +1540,9 @@ static void __guc_exec_queue_process_msg_suspend(struct xe_sched_msg *msg) > > > > if (guc_exec_queue_allowed_to_change_state(q) && !exec_queue_suspended(q) && > > exec_queue_enabled(q)) { > > - wait_event(guc->ct.wq, (q->guc->resume_time != RESUME_PENDING || > > - xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q)); > > + wait_event(guc->ct.wq, vf_recovery(guc) || > > + ((q->guc->resume_time != RESUME_PENDING || > > + xe_guc_read_stopped(guc)) && !exec_queue_pending_disable(q))); > > > > if (!xe_guc_read_stopped(guc)) { > > s64 since_resume_ms = > > @@ -1640,7 +1669,7 @@ static int guc_exec_queue_init(struct xe_exec_queue *q) > > > > q->entity = &ge->entity; > > > > - if (xe_guc_read_stopped(guc)) > > + if (xe_guc_read_stopped(guc) || vf_recovery(guc)) > > xe_sched_stop(sched); > > > > mutex_unlock(&guc->submission_state.lock); > > @@ -1786,6 +1815,7 @@ static int guc_exec_queue_suspend(struct xe_exec_queue *q) > > static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q) > > { > > struct xe_guc *guc = exec_queue_to_guc(q); > > + struct xe_device *xe = guc_to_xe(guc); > > int ret; > > > > /* > > @@ -1793,11 +1823,22 @@ static int guc_exec_queue_suspend_wait(struct xe_exec_queue *q) > > * suspend_pending upon kill but to be paranoid but races in which > > * suspend_pending is set after kill also check kill here. > > */ > > - ret = wait_event_interruptible_timeout(q->guc->suspend_wait, > > - !READ_ONCE(q->guc->suspend_pending) || > > - exec_queue_killed(q) || > > - xe_guc_read_stopped(guc), > > - HZ * 5); > > + if (IS_SRIOV_VF(xe)) > > + ret = wait_event_interruptible_timeout(guc->ct.wq, > > + !READ_ONCE(q->guc->suspend_pending) || > > + exec_queue_killed(q) || > > + xe_guc_read_stopped(guc) || > > + vf_recovery(guc), > > + HZ * 5); > > + else > > + ret = wait_event_interruptible_timeout(q->guc->suspend_wait, > > + !READ_ONCE(q->guc->suspend_pending) || > > + exec_queue_killed(q) || > > + xe_guc_read_stopped(guc), > > + HZ * 5); > > nit: maybe both magic 5sec timeouts deserve some comment? That's just the standard time we pick for dma-fences to signal everywhere in Xe. Again perhaps we do a follow up and replace HZ * 5 with global dma fence timeout value. Matt > > + > > + if (vf_recovery(guc) && !xe_device_wedged((guc_to_xe(guc)))) > > + return -EAGAIN; > > > > if (!ret) { > > xe_gt_warn(guc_to_gt(guc), > > @@ -1905,8 +1946,7 @@ int xe_guc_submit_reset_prepare(struct xe_guc *guc) > > { > > int ret; > > > > - if (xe_gt_WARN_ON(guc_to_gt(guc), > > - xe_gt_sriov_vf_recovery_pending(guc_to_gt(guc)))) > > + if (xe_gt_WARN_ON(guc_to_gt(guc), vf_recovery(guc))) > > return 0; > > > > if (!guc->submission_state.initialized) > > diff --git a/drivers/gpu/drm/xe/xe_preempt_fence.c b/drivers/gpu/drm/xe/xe_preempt_fence.c > > index 83fbeea5aa20..7f587ca3947d 100644 > > --- a/drivers/gpu/drm/xe/xe_preempt_fence.c > > +++ b/drivers/gpu/drm/xe/xe_preempt_fence.c > > @@ -8,6 +8,8 @@ > > #include > > > > #include "xe_exec_queue.h" > > +#include "xe_gt_printk.h" > > +#include "xe_guc_exec_queue_types.h" > > #include "xe_vm.h" > > > > static void preempt_fence_work_func(struct work_struct *w) > > @@ -22,6 +24,15 @@ static void preempt_fence_work_func(struct work_struct *w) > > } else if (!q->ops->reset_status(q)) { > > int err = q->ops->suspend_wait(q); > > > > + if (err == -EAGAIN) { > > + xe_gt_dbg(q->gt, "PREEMPT FENCE RETRY guc_id=%d", > > + q->guc->id); > > + queue_work(q->vm->xe->preempt_fence_wq, > > + &pfence->preempt_work); > > + dma_fence_end_signalling(cookie); > > + return; > > + } > > + > > if (err) > > dma_fence_set_error(&pfence->base, err); > > } else { > > just few suggestions, but overall LGTM, trusting you (and CI) that it works, so > > Reviewed-by: Michal Wajdeczko > >