Re: [PATCH] drm/xe/guc: Add more GuC CT states

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Riana Tauro <riana.tauro@intel.com>
To: Matthew Brost <matthew.brost@intel.com>,
	Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: intel-xe@lists.freedesktop.org
Subject: Re: [PATCH] drm/xe/guc: Add more GuC CT states
Date: Wed, 27 Dec 2023 12:22:07 +0530	[thread overview]
Message-ID: <8932513b-5aa9-40f1-b172-d271a9a08b92@intel.com> (raw)
In-Reply-To: <ZYUkIzf+R/JSJPIc@DUT025-TGLU.fm.intel.com>



On 12/22/2023 11:22 AM, Matthew Brost wrote:
> On Fri, Dec 22, 2023 at 05:47:17AM +0000, Matthew Brost wrote:
>> On Thu, Dec 21, 2023 at 01:56:33PM -0800, Daniele Ceraolo Spurio wrote:
>>>
>>>
>>> On 12/19/2023 9:28 AM, Matthew Brost wrote:
>>>> The Guc CT has more than enabled / disables states rather it has 4. The
>>>> 4 states are not initialized, disabled, drop messages, and enabled.
>>>> Change the code to reflect this. These states will enable proper return
>>>> codes from functions and therefore enable proper error messages.
>>>
>>> Can you explain a bit more in which situation we expect to drop messages and
>>> handle it? AFAICS not all callers waiting for a G2H reply can cope with the
>>
>> Anything that requires a G2H reply must be able to cope with it getting
>> dropped as the GuC can hang at any moment. Certainly all of submission
>> is designed this way, so is TLB invalidations. More on that below. With
>> everything being able to cope with lost G2H their is not a point to
>> continue to process G2H once a reset has started (or send H2G either).
>>
>>> reply not coming; e.g. it looks like xe_gt_tlb_invalidation_wait() will
>>
>> During a GT reset xe_gt_tlb_invalidation_reset() is called which will
>> signal all waiters for invalidations avoiding timeouts.
>>
>> So the flow roughly is:
>>
>> Set CT channel to drop messages
>> Stop all submissions
>> Do reset
>> Signal TLB invalidation waiters.
>>
> 
> Ah, forgot a key detail here. Setting CT channel to drop message before
> do the reset is key here - we don't want a G2H being processed to race
> with cleaning up lost G2H in there reset step.
> 
> Matt
> 
>>> timeout and throw an error (which IMO is already an issue, because the reply
>>> might be lost due to reset). I know that currently in all cases in which we
>>> stop communication we do a reset, so the situation ends up ok, but there is
>>> a pending series to remove the reset in the runtime suspend/resume scenario
>>> (https://patchwork.freedesktop.org/series/122772/) in which case IMO we
>>
>> This path we would want to put the GuC communication into a state where
>> if messages send / recv this triggers an error. (-ENODEV).  We don't
>> expect to suspend the device and then send / recv messages. That is the
>> point of this patch - it is fine drop messages during a reset, not if
>> during suspend or if CT has not yet been initialized.

Hi Matthew

During D3hot->D0 , currently GuC is reloaded during resume. This pending 
patch https://patchwork.freedesktop.org/series/122772/ only 
suspends/resumes CTB communication instead of reload .

Which states should be used in this case?

Thanks
Riana
>>
>> Proper error messages will added based on these new states.
>>
>>> don't want to drop messages but do a flush instead.
>>>
>>
>> See above. Also unsure what you mean by flush here? Do you mean the G2H
>> worker? I think that creates some dma-fencing (or lockdep) nightmares if
>> we do that.
>>
>> Matt
>>
>>> Daniele
>>>
>>>>
>>>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>>> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_guc.c          |  4 +-
>>>>    drivers/gpu/drm/xe/xe_guc_ct.c       | 55 ++++++++++++++++++++--------
>>>>    drivers/gpu/drm/xe/xe_guc_ct.h       |  8 +++-
>>>>    drivers/gpu/drm/xe/xe_guc_ct_types.h | 18 ++++++++-
>>>>    4 files changed, 64 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>>>> index 482cb0df9f15..9b0fa8b1eb48 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>>>> @@ -645,7 +645,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
>>>>    	BUILD_BUG_ON(VF_SW_FLAG_COUNT != MED_VF_SW_FLAG_COUNT);
>>>> -	xe_assert(xe, !guc->ct.enabled);
>>>> +	xe_assert(xe, !xe_guc_ct_enabled(&guc->ct));
>>>>    	xe_assert(xe, len);
>>>>    	xe_assert(xe, len <= VF_SW_FLAG_COUNT);
>>>>    	xe_assert(xe, len <= MED_VF_SW_FLAG_COUNT);
>>>> @@ -827,7 +827,7 @@ int xe_guc_stop(struct xe_guc *guc)
>>>>    {
>>>>    	int ret;
>>>> -	xe_guc_ct_disable(&guc->ct);
>>>> +	xe_guc_ct_drop_messages(&guc->ct);
>>>>    	ret = xe_guc_submit_stop(guc);
>>>>    	if (ret)
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> index 24a33fa36496..22d655a8bf9a 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> @@ -278,12 +278,25 @@ static int guc_ct_control_toggle(struct xe_guc_ct *ct, bool enable)
>>>>    	return ret > 0 ? -EPROTO : ret;
>>>>    }
>>>> +static void xe_guc_ct_set_state(struct xe_guc_ct *ct,
>>>> +				enum xe_guc_ct_state state)
>>>> +{
>>>> +	mutex_lock(&ct->lock);		/* Serialise dequeue_one_g2h() */
>>>> +	spin_lock_irq(&ct->fast_lock);	/* Serialise CT fast-path */
>>>> +
>>>> +	ct->g2h_outstanding = 0;
>>>> +	ct->state = state;
>>>> +
>>>> +	spin_unlock_irq(&ct->fast_lock);
>>>> +	mutex_unlock(&ct->lock);
>>>> +}
>>>> +
>>>>    int xe_guc_ct_enable(struct xe_guc_ct *ct)
>>>>    {
>>>>    	struct xe_device *xe = ct_to_xe(ct);
>>>>    	int err;
>>>> -	xe_assert(xe, !ct->enabled);
>>>> +	xe_assert(xe, !xe_guc_ct_enabled(ct));
>>>>    	guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
>>>>    	guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
>>>> @@ -300,12 +313,7 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
>>>>    	if (err)
>>>>    		goto err_out;
>>>> -	mutex_lock(&ct->lock);
>>>> -	spin_lock_irq(&ct->fast_lock);
>>>> -	ct->g2h_outstanding = 0;
>>>> -	ct->enabled = true;
>>>> -	spin_unlock_irq(&ct->fast_lock);
>>>> -	mutex_unlock(&ct->lock);
>>>> +	xe_guc_ct_set_state(ct, XE_GUC_CT_STATE_ENABLED);
>>>>    	smp_mb();
>>>>    	wake_up_all(&ct->wq);
>>>> @@ -321,12 +329,12 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
>>>>    void xe_guc_ct_disable(struct xe_guc_ct *ct)
>>>>    {
>>>> -	mutex_lock(&ct->lock); /* Serialise dequeue_one_g2h() */
>>>> -	spin_lock_irq(&ct->fast_lock); /* Serialise CT fast-path */
>>>> -	ct->enabled = false; /* Finally disable CT communication */
>>>> -	spin_unlock_irq(&ct->fast_lock);
>>>> -	mutex_unlock(&ct->lock);
>>>> +	xe_guc_ct_set_state(ct, XE_GUC_CT_STATE_DISABLED);
>>>> +}
>>>> +void xe_guc_ct_drop_messages(struct xe_guc_ct *ct)
>>>> +{
>>>> +	xe_guc_ct_set_state(ct, XE_GUC_CT_STATE_DROP_MESSAGES);
>>>>    	xa_destroy(&ct->fence_lookup);
>>>>    }
>>>> @@ -493,11 +501,19 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
>>>>    		goto out;
>>>>    	}
>>>> -	if (unlikely(!ct->enabled)) {
>>>> +	if (ct->state == XE_GUC_CT_STATE_NOT_INITIALIZED ||
>>>> +	    ct->state == XE_GUC_CT_STATE_DISABLED) {
>>>>    		ret = -ENODEV;
>>>>    		goto out;
>>>>    	}
>>>> +	if (ct->state == XE_GUC_CT_STATE_DROP_MESSAGES) {
>>>> +		ret = -ECANCELED;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	xe_assert(xe, xe_guc_ct_enabled(ct));
>>>> +
>>>>    	if (g2h_fence) {
>>>>    		g2h_len = GUC_CTB_HXG_MSG_MAX_LEN;
>>>>    		num_g2h = 1;
>>>> @@ -682,7 +698,8 @@ static bool retry_failure(struct xe_guc_ct *ct, int ret)
>>>>    		return false;
>>>>    #define ct_alive(ct)	\
>>>> -	(ct->enabled && !ct->ctbs.h2g.info.broken && !ct->ctbs.g2h.info.broken)
>>>> +	(xe_guc_ct_enabled(ct) && !ct->ctbs.h2g.info.broken && \
>>>> +	 !ct->ctbs.g2h.info.broken)
>>>>    	if (!wait_event_interruptible_timeout(ct->wq, ct_alive(ct),  HZ * 5))
>>>>    		return false;
>>>>    #undef ct_alive
>>>> @@ -941,12 +958,18 @@ static int g2h_read(struct xe_guc_ct *ct, u32 *msg, bool fast_path)
>>>>    	lockdep_assert_held(&ct->fast_lock);
>>>> -	if (!ct->enabled)
>>>> +	if (ct->state == XE_GUC_CT_STATE_NOT_INITIALIZED ||
>>>> +	    ct->state == XE_GUC_CT_STATE_DISABLED)
>>>>    		return -ENODEV;
>>>> +	if (ct->state == XE_GUC_CT_STATE_DROP_MESSAGES)
>>>> +		return -ECANCELED;
>>>> +
>>>>    	if (g2h->info.broken)
>>>>    		return -EPIPE;
>>>> +	xe_assert(xe, xe_guc_ct_enabled(ct));
>>>> +
>>>>    	/* Calculate DW available to read */
>>>>    	tail = desc_read(xe, g2h, tail);
>>>>    	avail = tail - g2h->info.head;
>>>> @@ -1245,7 +1268,7 @@ struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_capture(struct xe_guc_ct *ct,
>>>>    		return NULL;
>>>>    	}
>>>> -	if (ct->enabled) {
>>>> +	if (xe_guc_ct_enabled(ct)) {
>>>>    		snapshot->ct_enabled = true;
>>>>    		snapshot->g2h_outstanding = READ_ONCE(ct->g2h_outstanding);
>>>>    		guc_ctb_snapshot_capture(xe, &ct->ctbs.h2g,
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
>>>> index f15f8a4857e0..214a6a357519 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.h
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.h
>>>> @@ -13,6 +13,7 @@ struct drm_printer;
>>>>    int xe_guc_ct_init(struct xe_guc_ct *ct);
>>>>    int xe_guc_ct_enable(struct xe_guc_ct *ct);
>>>>    void xe_guc_ct_disable(struct xe_guc_ct *ct);
>>>> +void xe_guc_ct_drop_messages(struct xe_guc_ct *ct);
>>>>    void xe_guc_ct_fast_path(struct xe_guc_ct *ct);
>>>>    struct xe_guc_ct_snapshot *
>>>> @@ -22,10 +23,15 @@ void xe_guc_ct_snapshot_print(struct xe_guc_ct_snapshot *snapshot,
>>>>    void xe_guc_ct_snapshot_free(struct xe_guc_ct_snapshot *snapshot);
>>>>    void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool atomic);
>>>> +static inline bool xe_guc_ct_enabled(struct xe_guc_ct *ct)
>>>> +{
>>>> +	return ct->state == XE_GUC_CT_STATE_ENABLED;
>>>> +}
>>>> +
>>>>    static inline void xe_guc_ct_irq_handler(struct xe_guc_ct *ct)
>>>>    {
>>>>    	wake_up_all(&ct->wq);
>>>> -	if (ct->enabled)
>>>> +	if (xe_guc_ct_enabled(ct))
>>>>    		queue_work(system_unbound_wq, &ct->g2h_worker);
>>>>    	xe_guc_ct_fast_path(ct);
>>>>    }
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct_types.h b/drivers/gpu/drm/xe/xe_guc_ct_types.h
>>>> index d814d4ee3fc6..e36c7029dffe 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct_types.h
>>>> @@ -72,6 +72,20 @@ struct xe_guc_ct_snapshot {
>>>>    	struct guc_ctb_snapshot h2g;
>>>>    };
>>>> +/**
>>>> + * enum xe_guc_ct_state - CT state
>>>> + * @XE_GUC_CT_STATE_NOT_INITIALIZED: CT suspended, messages not expected in this state
>>>> + * @XE_GUC_CT_STATE_DISABLED: CT disabled, messages not expected in this state
>>>> + * @XE_GUC_CT_STATE_DROP_MESSAGES: CT drops messages without errors
>>>> + * @XE_GUC_CT_STATE_ENABLED: CT enabled, messages sent / recieved in this state
>>>> + */
>>>> +enum xe_guc_ct_state {
>>>> +	XE_GUC_CT_STATE_NOT_INITIALIZED = 0,
>>>> +	XE_GUC_CT_STATE_DISABLED,
>>>> +	XE_GUC_CT_STATE_DROP_MESSAGES,
>>>> +	XE_GUC_CT_STATE_ENABLED,
>>>> +};
>>>> +
>>>>    /**
>>>>     * struct xe_guc_ct - GuC command transport (CT) layer
>>>>     *
>>>> @@ -96,8 +110,8 @@ struct xe_guc_ct {
>>>>    	u32 g2h_outstanding;
>>>>    	/** @g2h_worker: worker to process G2H messages */
>>>>    	struct work_struct g2h_worker;
>>>> -	/** @enabled: CT enabled */
>>>> -	bool enabled;
>>>> +	/** @state: CT state */
>>>> +	enum xe_guc_ct_state state;;
>>>>    	/** @fence_seqno: G2H fence seqno - 16 bits used by CT */
>>>>    	u32 fence_seqno;
>>>>    	/** @fence_lookup: G2H fence lookup */
>>>

next prev parent reply	other threads:[~2023-12-27  6:53 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-19 17:28 [PATCH] drm/xe/guc: Add more GuC CT states Matthew Brost
2023-12-19 17:47 ` ✓ CI.Patch_applied: success for " Patchwork
2023-12-19 17:47 ` ✗ CI.checkpatch: warning " Patchwork
2023-12-19 17:48 ` ✓ CI.KUnit: success " Patchwork
2023-12-19 17:55 ` ✓ CI.Build: " Patchwork
2023-12-19 17:56 ` ✓ CI.Hooks: " Patchwork
2023-12-19 17:57 ` ✓ CI.checksparse: " Patchwork
2023-12-19 18:32 ` ✓ CI.BAT: " Patchwork
2023-12-21 21:56 ` [PATCH] " Daniele Ceraolo Spurio
2023-12-22  5:47   ` Matthew Brost
2023-12-22  5:52     ` Matthew Brost
2023-12-27  6:52       ` Riana Tauro [this message]
2023-12-27 22:21         ` Matthew Brost
2023-12-22 19:36     ` Daniele Ceraolo Spurio
2023-12-27 21:55       ` Matthew Brost
2023-12-27 22:20         ` Daniele Ceraolo Spurio
2023-12-27 22:25           ` Daniele Ceraolo Spurio
2023-12-27 22:43             ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8932513b-5aa9-40f1-b172-d271a9a08b92@intel.com \
    --to=riana.tauro@intel.com \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox