linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
To: "Nilawar, Badal" <badal.nilawar@intel.com>,
	<intel-xe@lists.freedesktop.org>,
	<dri-devel@lists.freedesktop.org>, <linux-kernel@vger.kernel.org>
Cc: <anshuman.gupta@intel.com>, <rodrigo.vivi@intel.com>,
	<alexander.usyskin@intel.com>, <gregkh@linuxfoundation.org>,
	<jgg@nvidia.com>
Subject: Re: [PATCH v3 06/10] drm/xe/xe_late_bind_fw: Reload late binding fw in rpm resume
Date: Mon, 23 Jun 2025 08:26:46 -0700	[thread overview]
Message-ID: <6733693f-64b2-47fa-97ba-4ebba3edef35@intel.com> (raw)
In-Reply-To: <a8d2605c-930b-4eeb-8e4a-1aa9bbfbb960@intel.com>



On 6/18/2025 10:52 PM, Nilawar, Badal wrote:
>
> On 19-06-2025 02:35, Daniele Ceraolo Spurio wrote:
>>
>>
>> On 6/18/2025 12:00 PM, Badal Nilawar wrote:
>>> Reload late binding fw during runtime resume.
>>>
>>> v2: Flush worker during runtime suspend
>>>
>>> Signed-off-by: Badal Nilawar <badal.nilawar@intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_late_bind_fw.c | 2 +-
>>>   drivers/gpu/drm/xe/xe_late_bind_fw.h | 1 +
>>>   drivers/gpu/drm/xe/xe_pm.c           | 6 ++++++
>>>   3 files changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_late_bind_fw.c 
>>> b/drivers/gpu/drm/xe/xe_late_bind_fw.c
>>> index 54aa08c6bdfd..c0be9611c73b 100644
>>> --- a/drivers/gpu/drm/xe/xe_late_bind_fw.c
>>> +++ b/drivers/gpu/drm/xe/xe_late_bind_fw.c
>>> @@ -58,7 +58,7 @@ static int xe_late_bind_fw_num_fans(struct 
>>> xe_late_bind *late_bind)
>>>           return 0;
>>>   }
>>>   -static void xe_late_bind_wait_for_worker_completion(struct 
>>> xe_late_bind *late_bind)
>>> +void xe_late_bind_wait_for_worker_completion(struct xe_late_bind 
>>> *late_bind)
>>>   {
>>>       struct xe_device *xe = late_bind_to_xe(late_bind);
>>>       struct xe_late_bind_fw *lbfw;
>>> diff --git a/drivers/gpu/drm/xe/xe_late_bind_fw.h 
>>> b/drivers/gpu/drm/xe/xe_late_bind_fw.h
>>> index 28d56ed2bfdc..07e437390539 100644
>>> --- a/drivers/gpu/drm/xe/xe_late_bind_fw.h
>>> +++ b/drivers/gpu/drm/xe/xe_late_bind_fw.h
>>> @@ -12,5 +12,6 @@ struct xe_late_bind;
>>>     int xe_late_bind_init(struct xe_late_bind *late_bind);
>>>   int xe_late_bind_fw_load(struct xe_late_bind *late_bind);
>>> +void xe_late_bind_wait_for_worker_completion(struct xe_late_bind 
>>> *late_bind);
>>>     #endif
>>> diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
>>> index ff749edc005b..91923fd4af80 100644
>>> --- a/drivers/gpu/drm/xe/xe_pm.c
>>> +++ b/drivers/gpu/drm/xe/xe_pm.c
>>> @@ -20,6 +20,7 @@
>>>   #include "xe_gt.h"
>>>   #include "xe_guc.h"
>>>   #include "xe_irq.h"
>>> +#include "xe_late_bind_fw.h"
>>>   #include "xe_pcode.h"
>>>   #include "xe_pxp.h"
>>>   #include "xe_trace.h"
>>> @@ -460,6 +461,8 @@ int xe_pm_runtime_suspend(struct xe_device *xe)
>>>       if (err)
>>>           goto out;
>>>   + xe_late_bind_wait_for_worker_completion(&xe->late_bind);
>>
>> I thing this can deadlock, because you do an rpm_put from within the 
>> worker and if that's the last put it'll end up here and wait for the 
>> worker to complete.
>> We could probably just skip this wait, because the worker can handle 
>> rpm itself. What we might want to be careful about is to nor re-queue 
>> it (from xe_late_bind_fw_load below) if it's currently being 
>> executed; we could also just let the fw be loaded twice if we hit 
>> that race condition, that shouldn't be an issue apart from doing 
>> something not needed.
>
> In xe_pm_runtime_get/_put, deadlocks are avoided by verifying the 
> condition (xe_pm_read_callback_task(xe) == current).

Isn't that for calls to rpm_get/put done from within the 
rpm_suspend/resume code? This is not the case here, we're not 
deadlocking on the rpm lock, we're deadlocking on the worker.

The error flow as I see it here would be as follow:

     rpm refcount is 1, owned by thread X
     worker starts
     worker takes rpm [rpm refcount now 2]
     thread X releases rpm [rpm refcount now 1]
     worker releases rpm [rpm refcount now 0]
         rpm_suspend is called from within the worker
             xe_pm_write_callback_task is called
             flush_work is called -> deadlock

I don't see how the callback_task() code can block the flush_work from 
deadlocking here.

Also, what happens if when the worker starts the rpm refcount is 0? 
Assuming the deadlock issue is not there.

     worker starts
     worker takes rpm [rpm refcount now 1]
         rpm_resume is called
             worker is re-queued
     worker releases rpm [rpm refcount now 0]
     worker exits
     worker re-starts -> go back to beginning

This second issue should be easily fixed by using pm_get_if_in_use from 
the worker, to not load the late_bind table if we're rpm_suspended since 
we'll do it when someone else resumes the device.

Daniele

>
> Badal
>
>>
>> Daniele
>>
>>> +
>>>       /*
>>>        * Applying lock for entire list op as xe_ttm_bo_destroy and 
>>> xe_bo_move_notify
>>>        * also checks and deletes bo entry from user fault list.
>>> @@ -550,6 +553,9 @@ int xe_pm_runtime_resume(struct xe_device *xe)
>>>         xe_pxp_pm_resume(xe->pxp);
>>>   +    if (xe->d3cold.allowed)
>>> +        xe_late_bind_fw_load(&xe->late_bind);
>>> +
>>>   out:
>>>       xe_rpm_lockmap_release(xe);
>>>       xe_pm_write_callback_task(xe, NULL);
>>


  reply	other threads:[~2025-06-23 15:27 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-18 18:59 [PATCH v3 00/10] Introducing firmware late binding Badal Nilawar
2025-06-18 18:59 ` [PATCH v3 01/10] mei: bus: add mei_cldev_mtu interface Badal Nilawar
2025-06-18 18:59 ` [PATCH v3 02/10] mei: late_bind: add late binding component driver Badal Nilawar
2025-06-19  7:32   ` Gupta, Anshuman
2025-06-19  8:11     ` Jani Nikula
2025-06-18 19:00 ` [PATCH v3 03/10] drm/xe/xe_late_bind_fw: Introducing xe_late_bind_fw Badal Nilawar
2025-06-18 20:16   ` Daniele Ceraolo Spurio
2025-06-18 19:00 ` [PATCH v3 04/10] drm/xe/xe_late_bind_fw: Initialize late binding firmware Badal Nilawar
2025-06-18 20:46   ` Daniele Ceraolo Spurio
2025-06-19  4:57     ` Nilawar, Badal
2025-06-18 19:00 ` [PATCH v3 05/10] drm/xe/xe_late_bind_fw: Load " Badal Nilawar
2025-06-18 20:57   ` Daniele Ceraolo Spurio
2025-06-19  5:54     ` Nilawar, Badal
2025-06-18 19:00 ` [PATCH v3 06/10] drm/xe/xe_late_bind_fw: Reload late binding fw in rpm resume Badal Nilawar
2025-06-18 21:05   ` Daniele Ceraolo Spurio
2025-06-19  5:52     ` Nilawar, Badal
2025-06-23 15:26       ` Daniele Ceraolo Spurio [this message]
2025-06-23 16:11         ` Nilawar, Badal
2025-06-23 16:42           ` Daniele Ceraolo Spurio
2025-06-18 19:00 ` [PATCH v3 07/10] drm/xe/xe_late_bind_fw: Reload late binding fw in S2Idle/S3 resume Badal Nilawar
2025-06-20 13:49   ` Rodrigo Vivi
2025-06-20 15:14     ` Nilawar, Badal
2025-06-18 19:00 ` [PATCH v3 08/10] drm/xe/xe_late_bind_fw: Introduce debug fs node to disable late binding Badal Nilawar
2025-06-18 21:19   ` Daniele Ceraolo Spurio
2025-06-19  6:51     ` Nilawar, Badal
2025-06-20 13:46       ` Rodrigo Vivi
2025-06-23 15:37       ` Daniele Ceraolo Spurio
2025-06-24 10:12         ` Nilawar, Badal
2025-06-18 19:00 ` [PATCH v3 09/10] drm/xe/xe_late_bind_fw: Extract and print version info Badal Nilawar
2025-06-18 21:56   ` Daniele Ceraolo Spurio
2025-06-19  9:32     ` Nilawar, Badal
2025-06-24 13:41   ` Dan Carpenter
2025-06-18 19:00 ` [PATCH v3 10/10] [CI]drm/xe/xe_late_bind_fw: Select INTEL_MEI_LATE_BIND for CI Badal Nilawar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6733693f-64b2-47fa-97ba-4ebba3edef35@intel.com \
    --to=daniele.ceraolospurio@intel.com \
    --cc=alexander.usyskin@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jgg@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).