From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 551BBC47DDB for ; Fri, 26 Jan 2024 16:30:26 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 11FE110FBD0; Fri, 26 Jan 2024 16:30:26 +0000 (UTC) Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2EE2010FBD0; Fri, 26 Jan 2024 16:30:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1706286625; x=1737822625; h=date:from:to:cc:subject:message-id:references: content-transfer-encoding:in-reply-to:mime-version; bh=sUMABoXmoefgH93qY9BKv2NkTZA0nzR1Qs1xyDMKTUY=; b=E/MIZH3kh6iTG0qiuKfJ4S2tvL4WEL4TzuLRxbNwL45T4hq75Pnu4Gjr uPMrzAOwJFJTxUMY/icFiKxC65ymHbED16ady3VV4TrgCu1+wkHfsmxnv CJva8r+ATARyFoZKDxNyiW/udfjnTpewJq+ONmM98iyAMAt18yhfrIOAl 5O2QwXYmPiBTT9gnkqBi7cetG1IQKh5H/2lKugsMNz2594k90a8s75SQh uDprOP0XxJAW9Vv0WbpTJivLxfGebyTB3krXY2c2rpZFPq3ikFy/paglL u2Nqeers+1f2TDCjCymT6Ohg+OQZVWGPVTnvQ+AtbuWU+Gc5o4l7O0w7u g==; X-IronPort-AV: E=McAfee;i="6600,9927,10964"; a="399667645" X-IronPort-AV: E=Sophos;i="6.05,216,1701158400"; d="scan'208";a="399667645" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2024 08:30:24 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10964"; a="960238955" X-IronPort-AV: E=Sophos;i="6.05,216,1701158400"; d="scan'208";a="960238955" Received: from fmsmsx602.amr.corp.intel.com ([10.18.126.82]) by orsmga005.jf.intel.com with ESMTP/TLS/AES256-GCM-SHA384; 26 Jan 2024 08:30:23 -0800 Received: from fmsmsx610.amr.corp.intel.com (10.18.126.90) by fmsmsx602.amr.corp.intel.com (10.18.126.82) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Fri, 26 Jan 2024 08:30:23 -0800 Received: from fmsedg601.ED.cps.intel.com (10.1.192.135) by fmsmsx610.amr.corp.intel.com (10.18.126.90) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35 via Frontend Transport; Fri, 26 Jan 2024 08:30:23 -0800 Received: from NAM10-BN7-obe.outbound.protection.outlook.com (104.47.70.100) by edgegateway.intel.com (192.55.55.70) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Fri, 26 Jan 2024 08:30:22 -0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=LYu7ixbjeRpILeoexTT/BcjwluHAKVbJn8dAAcriq1lZbw41sDf3VdgmGOGBIhH8dZ9xx4k9Z4EFLkl0fVEmzr+6R96SJGz8jkUvd6HB/dV/KIUSt460WllZcQnlANztNCXx5CGp6FlUSOqBSI6iGJxsU4PexQQZrFR1zPq9qnwURSm1Qiw3DAA9g6G9XF7HHas1CrRw9dPRsma81arv81VevCK8HjCGc4bARoWNpBuG/pQesRwHc5xNsmRKRl4a0wNdkH5AMWcB3Abt8ZVAGqTOYp7nhoZEbd4i72ovVaJxKmFgD0yYbIjLVpQPwPrbJpD4yNzJn0LZ9/mv1Q4e6g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=75Fa7jY4HV92c5CHTPGaB4Bj8UPxc7yFNy3AjlJsOiI=; b=EtCk4JvYvfeq1iRrQlnhWkfB9F2HPokj8RKxIdFNdDasryFR8gRb6Dc3HCK6RHy0DgnNrMFNlCIpaGmetxvsZ/HJt6Om7PC0angnHpNDppRCDOzV//pcaHfbyTVEIlewwbv4R/V49PcRRBV1sj+yxcmoRQ+ADifZkhxxFPQRMjnI3rjcXqoXPI2+JEe/kX1HDrFooyzVaI1HnZeV3+g/32vFD828JXbaPWk+hy5YZJQ7Ow8ot+MMfRoIRJBk3l1lUMRW/5iYZNb7Z/YR6u2AMJg+dwvcw/nECiIpB3l+BhYGXZdHUkKzE4FodIZOD58hIErDRE59J88CkvrtBq6TPg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; Received: from PH7PR11MB6522.namprd11.prod.outlook.com (2603:10b6:510:212::12) by DM4PR11MB6066.namprd11.prod.outlook.com (2603:10b6:8:62::13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7228.27; Fri, 26 Jan 2024 16:30:19 +0000 Received: from PH7PR11MB6522.namprd11.prod.outlook.com ([fe80::b9a8:8221:e4a1:4cda]) by PH7PR11MB6522.namprd11.prod.outlook.com ([fe80::b9a8:8221:e4a1:4cda%4]) with mapi id 15.20.7202.035; Fri, 26 Jan 2024 16:30:19 +0000 Date: Fri, 26 Jan 2024 16:29:27 +0000 From: Matthew Brost To: Christian =?iso-8859-1?Q?K=F6nig?= Subject: Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker Message-ID: References: <20240124210811.1639040-1-matthew.brost@intel.com> <0bef4c76-924f-442f-af9c-d701e640db41@amd.com> Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <0bef4c76-924f-442f-af9c-d701e640db41@amd.com> X-ClientProxiedBy: BY5PR20CA0036.namprd20.prod.outlook.com (2603:10b6:a03:1f4::49) To PH7PR11MB6522.namprd11.prod.outlook.com (2603:10b6:510:212::12) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: PH7PR11MB6522:EE_|DM4PR11MB6066:EE_ X-MS-Office365-Filtering-Correlation-Id: 9a598d32-469b-4845-b0c3-08dc1e8c1617 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: XS0IZ41Bj8Tdfq8dL1iEZ9dyg9ndamcXhdCqA/VwS3HbOWtGUsdFXQlBFbotnvz0dNz7DRAcAhgjIfC7KBnHqWpOnPKPyHheJfYKRgq32VkKKKLZz2V1rtc5+HVO0chAIfXZUK1dhAyoplznVcTgS0rrDnHRcCRHbj+KYgFRkC4PxnLTw/ZHsVvLDgi/ZrnY+X9RVisPifFQ3uxnbvaIdElqFYahp6KpcE33vjqVib6sZQYEzzHMX0tEtEynLuM//BtpvFI85PzoVARXoFcHoOkMOWAwFgMCZpIuKSjcIUd1MwzQndNgLvGnvo+xdLPmCVoul+sIQjUUjrTNvXc08HnaW3nBR+/ikf9fiPf16O38BFXkz+90pW+wQ8IZhHEnYEsSJwV8hYEK+OfVBCsb1Lo4+SG6vAQbcGIn52u733aDr3ZqjOSfkMvminOG5lcfLj2woGloubGx0XEThO6Vxq8DQQp/vBQbLzb98iiffGMeBZSCWRWkBbr4HiprihVszIBF0KdH6kKINQsKOXmmpZQXYg2rWmri9rLTSKayjQk= X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:PH7PR11MB6522.namprd11.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230031)(396003)(346002)(39860400002)(136003)(366004)(376002)(230922051799003)(186009)(451199024)(64100799003)(1800799012)(86362001)(44832011)(7416002)(2906002)(5660300002)(41300700001)(83380400001)(478600001)(82960400001)(6512007)(6666004)(38100700002)(66574015)(26005)(966005)(6486002)(4326008)(8936002)(8676002)(66476007)(6916009)(316002)(54906003)(66556008)(6506007)(66946007); DIR:OUT; SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?iso-8859-1?Q?fRlt0DaohxkQUXvsf2vceNwMk1UDbcuoaouLuEfSiRcQSpQbH8Gmb5Ygjj?= =?iso-8859-1?Q?+/yeFjV6l48yKmaEwoZ3sp93EAbHyFV5IWG4Lhv0ZQ197L1mbYZUGpcaHn?= =?iso-8859-1?Q?/RSHLP1o4uZIlruNoaGmPEl/iZAMtFKJ/tSK4YAfHPMBN662WpPaRdtIKg?= =?iso-8859-1?Q?cSVKnIiXeCazpdRcch9z3L7DmwnSaxGu97IUOCDclQYcz8ts5BIy6Lj5W0?= =?iso-8859-1?Q?+XE+viZiNWhDnwkd53UVMywdMsqL7cw+1E6dxalve5f41SkJvGOjzV0op6?= =?iso-8859-1?Q?j3b7TSUA7I2Wti64mrE5a94TCfuFr2shVd5OiCTB7O//szMgEX0+6ue46X?= =?iso-8859-1?Q?7GLbeYtExBCNzNp6UUcptF3HzGHcc8HrIUz5iQoTA1CPpMVhAXGOh4NQeW?= =?iso-8859-1?Q?PAZ+OfHzCEkFtvdGkjVYl7eON5Owodl1RdO9PF2du/nEU2FOBD37FgCUqY?= =?iso-8859-1?Q?UOwQQKM3j+eM+vv1H+XzFdCxNZhX/HWpUUtu98VvddBijX50YCi634pNsu?= =?iso-8859-1?Q?dxekU503wxThLEkjIbuR3aoL1zEzbiLHvh+1KdgDv95CFAr660t6GQqYi8?= =?iso-8859-1?Q?+22fQZJxO/UUoQeWU/BIE7KAwgVajwTcfGeSTX6zWNRJ7Enhbs4WQm/xgV?= =?iso-8859-1?Q?JZElwRFuQ2MT5Fhfb2ycDYaQeEzD1xdZIRvhyVIxB3NsEYFS0cZNOROWGH?= =?iso-8859-1?Q?DjjknERIHRqeCQgmuH69/ZueIztMt3UYngkKYfHuUD/NXkQ6/OViDcrD5o?= =?iso-8859-1?Q?2Oyjb5Wx/4rWGtC4ZwwhnmKTL3RvUO/8YIJXj05aG4G+yBPZj1GPUbgAPK?= =?iso-8859-1?Q?Y0D7PuPxpYJLDsvUGhp5KAhkS9f2SjoxmAIcGrSAMVU+3xy4h7Q83JhzFs?= =?iso-8859-1?Q?YgqXsFhW5AJ2z/M8BFSP/PpUyXvbLQU7cnYfT0F1NjGt0j+ddWbznxpZoi?= =?iso-8859-1?Q?uXXy+hn6PC3LXixx4huPlOGTGxXBLdJzFZ9oDeWnXjtXJqqi9XxICIQK4w?= =?iso-8859-1?Q?E0ZhH+WGzMhX4QnBewIGhiw+2qgEH0ra+nFu7krdWbgPnQRFMVvtdh83aX?= =?iso-8859-1?Q?42iVTYV3OUYAC0ZV6BO50I/i6ZqUDU5rOgvbo+GLkIggfoouWbysm4Etk4?= =?iso-8859-1?Q?MfsiV96L1mFwMZtImuIXD0RzVxwF3mHfQRh0MV/bEx0FiZ1lNmLsh/pBF2?= =?iso-8859-1?Q?bA9Cja7TXi4NTWYBlgLe7sWxGEhx2s8IP0BDR0HPVTtiYHKliW7VsiSKuX?= =?iso-8859-1?Q?BKohqdZsqzxbUP8f/b3w5h0Rd6Fn1x7HSSaiT0SEjsl7QDkYwuxFpVGnMk?= =?iso-8859-1?Q?4uvcOuuax+fb6mBqcS0ujVhVAwe/TxHaBPXrko1eT2MjjtximxzLv07MD7?= =?iso-8859-1?Q?bjnib0Ig80rLjDqG04IbOrdPOJMMrEahux9HAX3fd55CDIS0tQLC60FXsy?= =?iso-8859-1?Q?RQ3lgq1So1Q5NYJGD38J4v1Atc4WD907xleS5Eq5JNyxyb+idmARi8F48y?= =?iso-8859-1?Q?/bKQSw+PGtvLaTrczi/SY2xyU1XA+Z1v1fe7N5grpGooSEfiCG/tzZ1X/j?= =?iso-8859-1?Q?swLagW0i1IKX8FBZDiaAvNfsjJYfeSL9fKj5yNh97yPOdh+HZoBjV4c8/r?= =?iso-8859-1?Q?2JalHBq+diT3aqdt9ygwDvS8aa0b360cQez6wLrLcZNWQxS2emToGriw?= =?iso-8859-1?Q?=3D=3D?= X-MS-Exchange-CrossTenant-Network-Message-Id: 9a598d32-469b-4845-b0c3-08dc1e8c1617 X-MS-Exchange-CrossTenant-AuthSource: PH7PR11MB6522.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Jan 2024 16:30:19.5934 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: DtVaNPSyTywUuPFcBCOaf8bjBgnuIrUu5SW+CN8boh6kSPM6BmBCCO0LUvcwGkyqsw7LWldZ7FSA1hSracQijw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR11MB6066 X-OriginatorOrg: intel.com X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: ltuikov89@gmail.com, dri-devel@lists.freedesktop.org, Thorsten Leemhuis , Mario Limonciello , daniel@ffwll.ch, Mikhail Gavrilov , airlied@gmail.com, intel-xe@lists.freedesktop.org, Vlastimil Babka Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote: > Am 25.01.24 um 18:30 schrieb Matthew Brost: > > On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote: > > > > > > Am 24.01.24 um 22:08 schrieb Matthew Brost: > > > > All entities must be drained in the DRM scheduler run job worker to > > > > avoid the following case. An entity found that is ready, no job found > > > > ready on entity, and run job worker goes idle with other entities + jobs > > > > ready. Draining all ready entities (i.e. loop over all ready entities) > > > > in the run job worker ensures all job that are ready will be scheduled. > > > That doesn't make sense. drm_sched_select_entity() only returns entities > > > which are "ready", e.g. have a job to run. > > > > > That is what I thought too, hence my original design but it is not > > exactly true. Let me explain. > > > > drm_sched_select_entity() returns an entity with a non-empty spsc queue > > (job in queue) and no *current* waiting dependecies [1]. Dependecies for > > an entity can be added when drm_sched_entity_pop_job() is called [2][3] > > returning a NULL job. Thus we can get into a scenario where 2 entities > > A and B both have jobs and no current dependecies. A's job is waiting > > B's job, entity A gets selected first, a dependecy gets installed in > > drm_sched_entity_pop_job(), run work goes idle, and now we deadlock. > > And here is the real problem. run work doesn't goes idle in that moment. > > drm_sched_run_job_work() should restarts itself until there is either no > more space in the ring buffer or it can't find a ready entity any more. > > At least that was the original design when that was all still driven by a > kthread. > > It can perfectly be that we messed this up when switching from kthread to a > work item. > Right, that what this patch does - the run worker does not go idle until no ready entities are found. That was incorrect in the original patch and fixed here. Do you have any issues with this fix? It has been tested 3x times and clearly fixes the issue. Matt > Regards, > Christian. > > > > > The proper solution is to loop over all ready entities until one with a > > job is found via drm_sched_entity_pop_job() and then requeue the run > > job worker. Or loop over all entities until drm_sched_select_entity() > > returns NULL and then let the run job worker go idle. This is what the > > old threaded design did too [4]. Hope this clears everything up. > > > > Matt > > > > [1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144 > > [2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464 > > [3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397 > > [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011 > > > > > If that's not the case any more then you have broken something else. > > > > > > Regards, > > > Christian. > > > > > > > Cc: Thorsten Leemhuis > > > > Reported-by: Mikhail Gavrilov > > > > Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/ > > > > Reported-and-tested-by: Mario Limonciello > > > > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124 > > > > Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/ > > > > Reported-by: Vlastimil Babka > > > > Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a > > > > Signed-off-by: Matthew Brost > > > > --- > > > > drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++-------- > > > > 1 file changed, 7 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > index 550492a7a031..85f082396d42 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w) > > > > struct drm_sched_entity *entity; > > > > struct dma_fence *fence; > > > > struct drm_sched_fence *s_fence; > > > > - struct drm_sched_job *sched_job; > > > > + struct drm_sched_job *sched_job = NULL; > > > > int r; > > > > if (READ_ONCE(sched->pause_submit)) > > > > return; > > > > - entity = drm_sched_select_entity(sched); > > > > + /* Find entity with a ready job */ > > > > + while (!sched_job && (entity = drm_sched_select_entity(sched))) { > > > > + sched_job = drm_sched_entity_pop_job(entity); > > > > + if (!sched_job) > > > > + complete_all(&entity->entity_idle); > > > > + } > > > > if (!entity) > > > > - return; > > > > - > > > > - sched_job = drm_sched_entity_pop_job(entity); > > > > - if (!sched_job) { > > > > - complete_all(&entity->entity_idle); > > > > return; /* No more work */ > > > > - } > > > > s_fence = sched_job->s_fence; >