From: Matthew Brost <matthew.brost@intel.com>
To: Tomasz Lis <tomasz.lis@intel.com>
Cc: intel-xe@lists.freedesktop.org,
"Michał Winiarski" <michal.winiarski@intel.com>,
"Michał Wajdeczko" <michal.wajdeczko@intel.com>,
"Piotr Piórkowski" <piotr.piorkowski@intel.com>,
"Satyanarayana K V P" <satyanarayana.k.v.p@intel.com>
Subject: Re: [PATCH v1] drm/xe/vf: Stop waiting for ring space on VF post migration recovery
Date: Thu, 4 Dec 2025 14:18:35 -0800 [thread overview]
Message-ID: <aTIIu/1i+9lpq4GL@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <20251204200820.2206168-1-tomasz.lis@intel.com>
On Thu, Dec 04, 2025 at 09:08:20PM +0100, Tomasz Lis wrote:
> If wait for ring space started just before migration, it can delay
> the recovery process, by waiting without bailout path for up to 2
> seconds.
>
> Two second wait for recovery is not acceptable, and if the ring was
> completely filled even without the migration temporarily stopping
> execution, then such a wait will result in up to a thousand new jobs
> (assuming constant flow) being added while the wait is happening.
>
> While this will not cause data corruption, it will lead to warning
> messages getting logged due to reset being scheduled on a GT under
> recovery. Also several seconds of unresponsiveness, as the backlog
> of jobs gets progressively executed.
>
> Add a bailout condition, to make sure the recovery starts without
> much delay. The recovery is expected to finish in about 100 ms when
> under moderate stress, so the condition verification period needs to be
> below that - settling at 64 ms.
>
> The theoretical max time which the recovery can take depends on how
> many requests can be emitted to engine rings and be pending execution.
> While stress testing, it was possible to reach 10k pending requests
> on rings when a platform with two GTs was used. This resulted in max
> recovery time of 5 seconds. But in real life situations, it is very
> unlikely that the amount of pending requests will ever exceed 100,
> and for that the recovery time will be around 50 ms - well within our
> claimed limit of 100ms.
>
> Fixes: a4dae94aad6a ("drm/xe/vf: Wakeup in GuC backend on VF post migration recovery")
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
> drivers/gpu/drm/xe/xe_guc_submit.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index f3f2c8556a66..ff6fda84bf0f 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -722,21 +722,23 @@ static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
> struct xe_guc *guc = exec_queue_to_guc(q);
> struct xe_device *xe = guc_to_xe(guc);
> struct iosys_map map = xe_lrc_parallel_map(q->lrc[0]);
> - unsigned int sleep_period_ms = 1;
> + unsigned int sleep_period_ms = 1, sleep_total_ms = 0;
>
> #define AVAILABLE_SPACE \
> CIRC_SPACE(q->guc->wqi_tail, q->guc->wqi_head, WQ_SIZE)
> if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {
> try_again:
> q->guc->wqi_head = parallel_read(xe, map, wq_desc.head);
> - if (wqi_size > AVAILABLE_SPACE) {
> - if (sleep_period_ms == 1024) {
> + if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {
Ah, yes this was mistake on my end. The intent was to bail out of the
wait if vf_recovery was in progress.
The wait / sleep logic looks better too.
With that:
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
> + if (sleep_total_ms > 2000) {
> xe_gt_reset_async(q->gt);
> return -ENODEV;
> }
>
> msleep(sleep_period_ms);
> - sleep_period_ms <<= 1;
> + sleep_total_ms += sleep_period_ms;
> + if (sleep_period_ms < 64)
> + sleep_period_ms <<= 1;
> goto try_again;
> }
> }
> --
> 2.25.1
>
prev parent reply other threads:[~2025-12-04 22:18 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-04 20:08 [PATCH v1] drm/xe/vf: Stop waiting for ring space on VF post migration recovery Tomasz Lis
2025-12-04 20:49 ` ✓ CI.KUnit: success for " Patchwork
2025-12-04 21:32 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-04 22:09 ` ✓ Xe.CI.Full: " Patchwork
2025-12-04 22:18 ` Matthew Brost [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aTIIu/1i+9lpq4GL@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=michal.wajdeczko@intel.com \
--cc=michal.winiarski@intel.com \
--cc=piotr.piorkowski@intel.com \
--cc=satyanarayana.k.v.p@intel.com \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox