Re: [PATCH v1] drm/xe/vf: Stop waiting for ring space on VF post migration recovery

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Tomasz Lis <tomasz.lis@intel.com>
Cc: intel-xe@lists.freedesktop.org,
	"Michał Winiarski" <michal.winiarski@intel.com>,
	"Michał Wajdeczko" <michal.wajdeczko@intel.com>,
	"Piotr Piórkowski" <piotr.piorkowski@intel.com>,
	"Satyanarayana K V P" <satyanarayana.k.v.p@intel.com>
Subject: Re: [PATCH v1] drm/xe/vf: Stop waiting for ring space on VF post migration recovery
Date: Thu, 4 Dec 2025 14:18:35 -0800	[thread overview]
Message-ID: <aTIIu/1i+9lpq4GL@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <20251204200820.2206168-1-tomasz.lis@intel.com>

On Thu, Dec 04, 2025 at 09:08:20PM +0100, Tomasz Lis wrote:
> If wait for ring space started just before migration, it can delay
> the recovery process, by waiting without bailout path for up to 2
> seconds.
> 
> Two second wait for recovery is not acceptable, and if the ring was
> completely filled even without the migration temporarily stopping
> execution, then such a wait will result in up to a thousand new jobs
> (assuming constant flow) being added while the wait is happening.
> 
> While this will not cause data corruption, it will lead to warning
> messages getting logged due to reset being scheduled on a GT under
> recovery. Also several seconds of unresponsiveness, as the backlog
> of jobs gets progressively executed.
> 
> Add a bailout condition, to make sure the recovery starts without
> much delay. The recovery is expected to finish in about 100 ms when
> under moderate stress, so the condition verification period needs to be
> below that - settling at 64 ms.
> 
> The theoretical max time which the recovery can take depends on how
> many requests can be emitted to engine rings and be pending execution.
> While stress testing, it was possible to reach 10k pending requests
> on rings when a platform with two GTs was used. This resulted in max
> recovery time of 5 seconds. But in real life situations, it is very
> unlikely that the amount of pending requests will ever exceed 100,
> and for that the recovery time will be around 50 ms - well within our
> claimed limit of 100ms.
> 
> Fixes: a4dae94aad6a ("drm/xe/vf: Wakeup in GuC backend on VF post migration recovery")
> Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index f3f2c8556a66..ff6fda84bf0f 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -722,21 +722,23 @@ static int wq_wait_for_space(struct xe_exec_queue *q, u32 wqi_size)
>  	struct xe_guc *guc = exec_queue_to_guc(q);
>  	struct xe_device *xe = guc_to_xe(guc);
>  	struct iosys_map map = xe_lrc_parallel_map(q->lrc[0]);
> -	unsigned int sleep_period_ms = 1;
> +	unsigned int sleep_period_ms = 1, sleep_total_ms = 0;
>  
>  #define AVAILABLE_SPACE \
>  	CIRC_SPACE(q->guc->wqi_tail, q->guc->wqi_head, WQ_SIZE)
>  	if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {
>  try_again:
>  		q->guc->wqi_head = parallel_read(xe, map, wq_desc.head);
> -		if (wqi_size > AVAILABLE_SPACE) {
> -			if (sleep_period_ms == 1024) {
> +		if (wqi_size > AVAILABLE_SPACE && !vf_recovery(guc)) {

Ah, yes this was mistake on my end. The intent was to bail out of the
wait if vf_recovery was in progress.

The wait / sleep logic looks better too.

With that:
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> +			if (sleep_total_ms > 2000) {
>  				xe_gt_reset_async(q->gt);
>  				return -ENODEV;
>  			}
>  
>  			msleep(sleep_period_ms);
> -			sleep_period_ms <<= 1;
> +			sleep_total_ms += sleep_period_ms;
> +			if (sleep_period_ms < 64)
> +				sleep_period_ms <<= 1;
>  			goto try_again;
>  		}
>  	}
> -- 
> 2.25.1
>

     prev parent reply	other threads:[~2025-12-04 22:18 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-04 20:08 [PATCH v1] drm/xe/vf: Stop waiting for ring space on VF post migration recovery Tomasz Lis
2025-12-04 20:49 ` ✓ CI.KUnit: success for " Patchwork
2025-12-04 21:32 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-04 22:09 ` ✓ Xe.CI.Full: " Patchwork
2025-12-04 22:18 ` Matthew Brost [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aTIIu/1i+9lpq4GL@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=michal.wajdeczko@intel.com \
    --cc=michal.winiarski@intel.com \
    --cc=piotr.piorkowski@intel.com \
    --cc=satyanarayana.k.v.p@intel.com \
    --cc=tomasz.lis@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox