From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8D78226E6E4 for ; Wed, 24 Sep 2025 11:00:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758711617; cv=none; b=paykp6nHPvDwh7YG0OGSAJyoD5jBMmqa3EEXpbtXSFGayJ1Juk76pAtEDiZ1TJMQCz143PyyeiXVPhnwlDOYCylNJJuHuUyC082+dkl9eYs2TXjW+b9vMfNNB3VbRCLnc/ehUKINAwhUwyQ9rFGGGTHAS4fJyEE1XopFakF0CBs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758711617; c=relaxed/simple; bh=PLxNIxm4nnFqvF9R5flN5cuSrvAujprcSgu1R4cCFHs=; h=Mime-Version:Content-Type:Date:Message-Id:Subject:Cc:To:From: References:In-Reply-To; b=PEqV9mQy6YH2n4Ng59Au2h9jFY4iuMXYWciX3Gds4R/6ez3Uk/Jix2OwnDUoZsYpTa5z6lYgffO6CLHFJQY1eBJ687Z9qGx15N7DRhLeSnx9NC4sVp5vqNg5ZYD7GuDAG00a7G7uCC8+tRkc9K5ZzdrDY3a+JXd6x5GMGVJbOJI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=DjXsO7uF; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="DjXsO7uF" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5122DC4CEE7; Wed, 24 Sep 2025 11:00:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758711616; bh=PLxNIxm4nnFqvF9R5flN5cuSrvAujprcSgu1R4cCFHs=; h=Date:Subject:Cc:To:From:References:In-Reply-To:From; b=DjXsO7uFaLxhxztyX6w+3JEBJijm3UmPV62s1n4Aw/ArAWp3k4Vr4tvtzLrWQdtT8 owspY6gAQ17pJKLPtXCDO8pEVv00RrSlU3t/SkPmX8R7ZQZOCUbrHbJ965ZpcMLSMt iWD3HVCxPaA8VgYCltKanqq+azO//2ERHHSGjO6e+YnaJId0z0HV9N4xIfVTNx9vB8 ksC577bCW3xhF1UH1bCrGbMo0RHQiaRGx+QNo5/lpR/n3Q5vukOBOt0BJiww5m77hI Fy1lVVgRq5IpOUFRBBfHqULEAI0Yln83s956qWiqZzstdE8TqMXiHmgQG7LStyI0KE aVP86a8cOspiA== Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Wed, 24 Sep 2025 13:00:12 +0200 Message-Id: Subject: Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job Cc: "Philipp Stanner" , "Jules Maselbas" , , , "Tvrtko Ursulin" , "Alex Deucher" , "Luben Tuikov" , "Matthew Brost" To: =?utf-8?q?Christian_K=C3=B6nig?= From: "Danilo Krummrich" References: <20250922130948.5549-1-jmaselbas@zdiv.net> <8661bce085eed921feb3e718b8dc4c46784dff4d.camel@redhat.com> <57b2275c-d18a-418d-956f-2ed054ec555f@amd.com> <76c94ee6-ba28-4517-8b6c-35658ac95d3b@amd.com> In-Reply-To: <76c94ee6-ba28-4517-8b6c-35658ac95d3b@amd.com> On Tue Sep 23, 2025 at 2:33 PM CEST, Christian K=C3=B6nig wrote: > On 23.09.25 14:08, Philipp Stanner wrote: >> You know folks, situations like that are why we want to strongly >> discourage accessing another API's struct members directly. There is no >> API contract for them. Indeed, please don't peek API internals. If you need additional functionali= ty, please send a patch adding a supported API for the component instead. Drivers messing with component internals makes impossible to maintain them = in the long term. >> And a proper API function rarely changes its interface, and if it does, >> it's easy to find for the contributor where drivers need to be >> adjusted. If we were all following that rule, you wouldn't even have to >> bother with patches #1 and #2. >>=20 >> That said, I see two proper solutions for your problem: >>=20 >> A. amdgpu is the one stopping the entities anyways, isn't it? It >> knows which entities it has killed. So that information could be >> stored in struct amdgpu_vm. > > No, it's the scheduler which decides when entities are stopped. Can you please show me the code where the scheduler calls any of drm_sched_entity_fini(), drm_sched_entity_flush(), drm_sched_entity_destroy= ()? Or are you referring the broken hack in drm_sched_fini() (introduced by com= mit c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released")) where = it is just ignored that we need to take the entity lock as well, because it inconviniently would lead to lock inversion? spin_lock(&rq->lock); list_for_each_entry(s_entity, &rq->entities, list) /* * Prevents reinsertion and marks job_queue as idle, * it will be removed from the rq in drm_sched_entity_fini() * eventually */ s_entity->stopped =3D true; spin_unlock(&rq->lock); The patch description that introduced the hack says: If scheduler is already stopped by the time sched_entity is released and entity's job_queue not empty I encountred a hang in drm_sched_entity_flush. But this sounds to me as if amdgpu simply doesn't implement the correct shu= tdown ordering. Why do nouveau, Xe and other drivers don't have this problem? Why= do we need to solve it in the scheduler instead? Maybe there are reasonable answers to that. And assuming there are, it stil= l isn't a justification for building on top of a broken workaround. :(