Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Nilawar, Badal" <badal.nilawar@intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
	<john.c.harrison@intel.com>, <vinay.belgaumkar@intel.com>,
	<vivi.rodrigo@intel.com>
Subject: Re: [PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out
Date: Fri, 4 Oct 2024 15:36:49 +0530	[thread overview]
Message-ID: <0f45ed91-488c-4dff-885f-e6355ef02c78@intel.com> (raw)
In-Reply-To: <Zv+QJ4HVR0pJbNsX@DUT025-TGLU.fm.intel.com>



On 04-10-2024 12:20, Matthew Brost wrote:
> On Thu, Oct 03, 2024 at 03:24:12PM +0530, Nilawar, Badal wrote:
>>
>>
>> On 02-10-2024 19:34, Matthew Brost wrote:
>>> On Tue, Oct 01, 2024 at 01:41:15PM +0530, Nilawar, Badal wrote:
>>>>
>>>>
>>>> On 28-09-2024 02:57, Matthew Brost wrote:
>>>>> On Sat, Sep 28, 2024 at 12:54:28AM +0530, Badal Nilawar wrote:
>>>>>> It is observed that for GuC CT request G2H IRQ triggered and g2h_worker
>>>>>> queued, but it didn't get opportunity to execute and timeout occurred.
>>>>>> To address this the g2h_worker is being flushed.
>>>>>>
>>>>>> Cc: John Harrison <John.C.Harrison@Intel.com>
>>>>>> Signed-off-by: Badal Nilawar <badal.nilawar@intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/xe/xe_guc_ct.c | 11 +++++++++++
>>>>>>     1 file changed, 11 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>>>> index 4b95f75b1546..4a5d7f85d1a0 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>>>> @@ -903,6 +903,17 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>>>>>     	}
>>>>>>     	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * It is observed that for above GuC CT request G2H IRQ triggered
>>>>>
>>>>> Where is this observed. 1 second is a long to wait for a worker...
>>>>
>>>> Please see this log.
>>>>
>>>
>>> Logs are good but explaining the test case is also helpful so I don't
>>> have reverse engineer things. Also having platform information would be
>>> helpful too. So what is the test case here and what platform?
>>
>> Sorry, my bad, I should have added issue id in commit message.
>> https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620.
>>
>> This issue is reported on LNL for xe_gt_freq@freq_reset_multiple test and
>> xe_pm@* tests during resume flow.
>>
>>>> [  176.602482] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
>>>> GuC PC status query
>>>> [  176.603019] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>>>> GT[0]
>>>> [  176.603449] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>>>> running GT[0]
>>>> [  176.604379] xe 0000:00:02.0: [drm:xe_guc_pc_get_max_freq [xe]] GT0: GT[0]
>>>> GuC PC status query
>>>> [  176.605464] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>>>> GT[0]
>>>> [  176.605821] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>>>> running GT[0]
>>>> [  176.716699] xe 0000:00:02.0: [drm] GT0: trying reset
>>>
>>> This looks we are doing a GT reset and this is causing problems. This
>>> patch is likely papering over an issue with our GT flows. So this patch
>>> doesn't seem correct to me. Let's try to figure what is going wrong in
>>> the reset flow.
>>
>> This is seen for slpc query after "reset done" as well.
>>
>>>> [  176.716718] xe 0000:00:02.0: [drm] GT0: GuC PC status query    //GuC PC
>>>> check request
>>>> [  176.717648] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>>>> GT[0]  // IRQ
>>>> [  177.728637] xe 0000:00:02.0: [drm] *ERROR* GT0: Timed out wait for G2H,
>>>> fence 1311, action 3003  //Timeout
>>>> [  177.737637] xe 0000:00:02.0: [drm] *ERROR* GT0: GuC PC query task state
>>>> failed: -ETIME
>>>> [  177.745644] xe 0000:00:02.0: [drm] GT0: reset queued
>>>
>>> Here this is almost 1 second after 'trying reset' which I'm unsure how
>>> that could happen looking at the source code upstream.
>>> 'xe_uc_reset_prepare' is called between 'trying reset' and 'reset
>>> queued' but that doesn't wait anywhere rather resolves to the below
>>> function:
>>>
>>> 1769 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
>>> 1770 {
>>> 1771         int ret;
>>> 1772
>>> 1773         /*
>>> 1774          * Using an atomic here rather than submission_state.lock as this
>>> 1775          * function can be called while holding the CT lock (engine reset
>>> 1776          * failure). submission_state.lock needs the CT lock to resubmit jobs.
>>> 1777          * Atomic is not ideal, but it works to prevent against concurrent reset
>>> 1778          * and releasing any TDRs waiting on guc->submission_state.stopped.
>>> 1779          */
>>> 1780         ret = atomic_fetch_or(1, &guc->submission_state.stopped);
>>> 1781         smp_wmb();
>>> 1782         wake_up_all(&guc->ct.wq);
>>> 1783
>>> 1784         return ret;
>>> 1785 }
>>
>> And CT is not disabled yet, so SLPC query will go through.
>>
> 
> I agree CT should not be disabled at this point.
> 
>>>
>>> If this log from an internal repo or something? This looks like some
>>> sort of circular dependency where a GT reset starts and the G2H handler
>>> doesn't get queued because the CT channel is disabled, the G2H times
>>> out, and reset stalls waiting for the timeout.
>>
>> This log is captured on LNL, with debug prints added, by running
>> xe_gt_freq@freq_reset_multiple.
>>
>> If CT channel is disabled then we will not see "G2H fence (1311) not
>> found!".
>>
>> During xe pm resume flow this is seen during guc_pc_start->pc_init_freqs().
>>
> 
> Ok, was spitballing ideas - if this is upstream then the CT should be
> alive but somehow it appears the worker to process CT is getting
> staled.
> 
> Also it very suspect the time gap between 'trying reset' and 'reset
> queued'.
> 
> This patch doesn't look like the solution. Can you look into hints I've
> given here?

Ok, I will look into it.

Badal

> 
> Matt
>   
>>
>>>
>>>> [  177.849081] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
>>>> GuC PC status query
>>>> [  177.849659] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
>>>> GT[0]
>>>> [  178.632672] xe 0000:00:02.0: [drm] GT0: reset started
>>>> [  178.632639] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
>>>> running GT[0] // Worker ran
>>>> [  178.632897] xe 0000:00:02.0: [drm] GT0: G2H fence (1311) not found!
>>>>
>>>>>
>>>>>> +	 * and g2h_worker queued, but it didn't get opportunity to execute
>>>>>> +	 * and timeout occurred. To address the g2h_worker is being flushed.
>>>>>> +	 */
>>>>>> +	if (!ret) {
>>>>>> +		flush_work(&ct->g2h_worker);
>>>>>> +		ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>>>>>
>>>>> If this is needed I wouldn't wait 1 second, if the flush worked
>>>>> 'g2h_fence.done' should immediately be signaled. Maybe wait 1 MS?
>>>>
>>>> In config HZ is set to 250, which is 4 ms I think.
>>>>
>>>
>>> HZ should always be one second [1].
>>>
>>> [1] https://www.oreilly.com/library/view/linux-device-drivers/9781785280009/4041820a-bbe4-4502-8ef9-d1913e133332.xhtml#:~:text=In%20other%20words%2C%20HZ%20represents,incremented%20HZ%20times%20every%20second.
>>>
>>>> CONFIG_HZ_250=y
>>>> # CONFIG_HZ_300 is not set
>>>> # CONFIG_HZ_1000 is not set
>>>> CONFIG_HZ=250
>>>>
>>>
>>> I'm little confused how this Kconfig works [2] but I don't think
>>> actually changes the time of HZ rather it changes how many jiffies are
>>> in one second.
>>>
>>> [2] https://lwn.net/Articles/56378/
>>
>> Oh ok, Thanks for clarification.
>>
>> Regards,
>> Badal
>>
>>>
>>> Matt
>>>
>>>> Regards,
>>>> Badal
>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>> +	}
>>>>>> +
>>>>>>     	if (!ret) {
>>>>>>     		xe_gt_err(gt, "Timed out wait for G2H, fence %u, action %04x",
>>>>>>     			  g2h_fence.seqno, action[0]);
>>>>>> -- 
>>>>>> 2.34.1
>>>>>>
>>>>
>>


  reply	other threads:[~2024-10-04 10:07 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-27 19:24 [PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out Badal Nilawar
2024-09-27 20:26 ` ✓ CI.Patch_applied: success for " Patchwork
2024-09-27 20:26 ` ✓ CI.checkpatch: " Patchwork
2024-09-27 20:27 ` ✓ CI.KUnit: " Patchwork
2024-09-27 20:39 ` ✓ CI.Build: " Patchwork
2024-09-27 20:41 ` ✓ CI.Hooks: " Patchwork
2024-09-27 20:43 ` ✓ CI.checksparse: " Patchwork
2024-09-27 21:08 ` ✓ CI.BAT: " Patchwork
2024-09-27 21:27 ` [PATCH] " Matthew Brost
2024-10-01  8:11   ` Nilawar, Badal
2024-10-01 17:49     ` Nilawar, Badal
2024-10-02 14:04     ` Matthew Brost
2024-10-03  9:54       ` Nilawar, Badal
2024-10-04  6:50         ` Matthew Brost
2024-10-04 10:06           ` Nilawar, Badal [this message]
2024-09-30  6:46 ` ✗ CI.FULL: failure for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0f45ed91-488c-4dff-885f-e6355ef02c78@intel.com \
    --to=badal.nilawar@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=vinay.belgaumkar@intel.com \
    --cc=vivi.rodrigo@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox