From: "Lis, Tomasz" <tomasz.lis@intel.com>
To: Michal Wajdeczko <michal.wajdeczko@intel.com>,
<intel-xe@lists.freedesktop.org>
Cc: "Michał Winiarski" <michal.winiarski@intel.com>
Subject: Re: [PATCH v2 4/4] drm/xe/vf: Defer fixups if migrated twice fast
Date: Thu, 26 Sep 2024 23:48:26 +0200 [thread overview]
Message-ID: <4e4954f4-202a-4e07-89f8-cefc7cc1b90a@intel.com> (raw)
In-Reply-To: <7c70dbc5-889c-4571-92c9-b8ac3ea0c9f9@intel.com>
On 26.09.2024 16:35, Michal Wajdeczko wrote:
>
> On 24.09.2024 22:25, Tomasz Lis wrote:
>> If another VF migration happened during post-migration recovery,
>> then the current worker should be finished to allow the next
>> one start swiftly and cleanly.
>>
>> Check for defer in two places: before fixups, and before
>> sending RESFIX_DONE.
>>
>> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_sriov_vf.c | 25 +++++++++++++++++++++++++
>> 1 file changed, 25 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_sriov_vf.c b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> index fe5eefa736c8..f326e507d73e 100644
>> --- a/drivers/gpu/drm/xe/xe_sriov_vf.c
>> +++ b/drivers/gpu/drm/xe/xe_sriov_vf.c
>> @@ -52,6 +52,19 @@ static int vf_post_migration_requery_guc(struct xe_device *xe)
>> return err;
>> }
>>
>> +/*
>> + * vf_post_migration_imminent - Check if post-restore recovery is coming.
>> + * @xe: the &xe_device struct instance
>> + *
>> + * Return: True if migration recovery worker will soon be running. Any worker currently
>> + * executing does not affect the result.
>> + */
>> +static bool vf_post_migration_imminent(struct xe_device *xe)
>> +{
>> + return xe->sriov.vf.migration.gt_flags != 0 ||
>> + work_pending(&xe->sriov.vf.migration.worker);
> make sure scripts/checkpatch.pl --strict is not complaining
ok
>
>> +}
>> +
>> /*
>> * vf_post_migration_notify_resfix_done - Notify all GuCs about resource fixups apply finished.
>> * @xe: the &xe_device struct instance
>> @@ -63,11 +76,17 @@ static void vf_post_migration_notify_resfix_done(struct xe_device *xe)
>> int err, num_sent = 0;
>>
>> for_each_gt(gt, xe, id) {
>> + if (vf_post_migration_imminent(xe))
>> + goto skip;
> hmm, what if new migration happen right here? this is still racy and
> likely needs to be solved at GUC-VF protocol level, not by adding more
> check points in the driver
This is how the current spec works.
"At the end of patching, check if a new MIGRATED interrupt has been
received"
We cannot guarantee no issues if the 2nd migration happens in this short
period. It would be a really bad luck if someone actually ran into the
issue where one of GuCs slips through the checks here, and that would
cause engine hang. Even then, everything would continue after skipping
the one job.
For the check points, there are exactly 2: before fixups and before
RESFIX_DONE. we're not adding more.
-Tomasz
>
>> err = xe_gt_sriov_vf_notify_resfix_done(gt);
>> if (!err)
>> num_sent++;
>> }
>> drm_dbg(&xe->drm, "sent %d VF resource fixups done notifications\n", num_sent);
>> + return;
>> +
>> +skip:
>> + drm_dbg(&xe->drm, "another recovery imminent, skipping notifications\n");
>> }
>>
>> static void vf_post_migration_recovery(struct xe_device *xe)
>> @@ -77,6 +96,8 @@ static void vf_post_migration_recovery(struct xe_device *xe)
>> drm_dbg(&xe->drm, "migration recovery in progress\n");
>> xe_pm_runtime_get(xe);
>> err = vf_post_migration_requery_guc(xe);
>> + if (vf_post_migration_imminent(xe))
>> + goto defer;
>> if (unlikely(err))
>> goto fail;
>>
>> @@ -85,6 +106,10 @@ static void vf_post_migration_recovery(struct xe_device *xe)
>> xe_pm_runtime_put(xe);
>> drm_notice(&xe->drm, "migration recovery ended\n");
>> return;
>> +defer:
>> + xe_pm_runtime_put(xe);
>> + drm_dbg(&xe->drm, "migration recovery deferred\n");
>> + return;
>> fail:
>> xe_pm_runtime_put(xe);
>> drm_err(&xe->drm, "migration recovery failed (%pe)\n", ERR_PTR(err));
next prev parent reply other threads:[~2024-09-26 21:48 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-24 20:25 [PATCH v2 0/4] drm/xe/vf: Post-migration recovery worker basis Tomasz Lis
2024-09-24 20:25 ` [PATCH v2 1/4] drm/xe/vf: React to MIGRATED interrupt Tomasz Lis
2024-09-26 14:05 ` Michal Wajdeczko
2024-09-26 21:22 ` Lis, Tomasz
2024-09-24 20:25 ` [PATCH v2 2/4] drm/xe/vf: Send RESFIX_DONE message at end of VF restore Tomasz Lis
2024-09-24 20:25 ` [PATCH v2 3/4] drm/xe/vf: Start post-migration fixups with provisinoning query Tomasz Lis
2024-09-26 14:27 ` Michal Wajdeczko
2024-09-26 21:32 ` Lis, Tomasz
2024-09-24 20:25 ` [PATCH v2 4/4] drm/xe/vf: Defer fixups if migrated twice fast Tomasz Lis
2024-09-26 14:35 ` Michal Wajdeczko
2024-09-26 21:48 ` Lis, Tomasz [this message]
2024-09-26 5:16 ` ✓ CI.Patch_applied: success for drm/xe/vf: Post-migration recovery worker basis (rev2) Patchwork
2024-09-26 5:16 ` ✗ CI.checkpatch: warning " Patchwork
2024-09-26 5:18 ` ✓ CI.KUnit: success " Patchwork
2024-09-26 5:23 ` ✗ CI.Build: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4e4954f4-202a-4e07-89f8-cefc7cc1b90a@intel.com \
--to=tomasz.lis@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=michal.wajdeczko@intel.com \
--cc=michal.winiarski@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox