[PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe

Linux kernel -stable discussions
 help / color / mirror / Atom feed

* [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h
@ 2024-10-29 12:01 Nirmoy Das
  2024-10-29 12:01 ` [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout Nirmoy Das
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Nirmoy Das @ 2024-10-29 12:01 UTC (permalink / raw)
  To: intel-xe
  Cc: Nirmoy Das, Badal Nilawar, Matthew Auld, Matthew Brost,
	Himal Prasad Ghimiray, Lucas De Marchi, stable, John Harrison

Move LNL scheduling WA to xe_device.h so this can be used in other
places without needing keep the same comment about removal of this WA
in the future. The WA, which flushes work or workqueues, is now wrapped
in macros and can be reused wherever needed.

Cc: Badal Nilawar <badal.nilawar@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
cc: <stable@vger.kernel.org> # v6.11+
Suggested-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
---
 drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
 drivers/gpu/drm/xe/xe_guc_ct.c | 11 +----------
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
index 4c3f0ebe78a9..f1fbfe916867 100644
--- a/drivers/gpu/drm/xe/xe_device.h
+++ b/drivers/gpu/drm/xe/xe_device.h
@@ -191,4 +191,18 @@ void xe_device_declare_wedged(struct xe_device *xe);
 struct xe_file *xe_file_get(struct xe_file *xef);
 void xe_file_put(struct xe_file *xef);
 
+/*
+ * Occasionally it is seen that the G2H worker starts running after a delay of more than
+ * a second even after being queued and activated by the Linux workqueue subsystem. This
+ * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
+ * Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS
+ * and this is beyond xe kmd.
+ *
+ * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
+ */
+#define LNL_FLUSH_WORKQUEUE(wq__) \
+	flush_workqueue(wq__)
+#define LNL_FLUSH_WORK(wrk__) \
+	flush_work(wrk__)
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 1b5d8fb1033a..703b44b257a7 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1018,17 +1018,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
 
 	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
 
-	/*
-	 * Occasionally it is seen that the G2H worker starts running after a delay of more than
-	 * a second even after being queued and activated by the Linux workqueue subsystem. This
-	 * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
-	 * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS
-	 * and this is beyond xe kmd.
-	 *
-	 * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
-	 */
 	if (!ret) {
-		flush_work(&ct->g2h_worker);
+		LNL_FLUSH_WORK(&ct->g2h_worker);
 		if (g2h_fence.done) {
 			xe_gt_warn(gt, "G2H fence %u, action %04x, done\n",
 				   g2h_fence.seqno, action[0]);
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout
  2024-10-29 12:01 [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
@ 2024-10-29 12:01 ` Nirmoy Das
  2024-11-01 13:21   ` Matthew Auld
  2024-11-01 18:49   ` John Harrison
  2024-10-29 12:01 ` [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout Nirmoy Das
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 14+ messages in thread
From: Nirmoy Das @ 2024-10-29 12:01 UTC (permalink / raw)
  To: intel-xe
  Cc: Nirmoy Das, Badal Nilawar, Matthew Auld, John Harrison,
	Himal Prasad Ghimiray, Lucas De Marchi, stable, Matthew Brost

Flush xe ordered_wq in case of ufence timeout which is observed
on LNL and that points to recent scheduling issue with E-cores.

This is similar to the recent fix:
commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
response timeout") and should be removed once there is a E-core
scheduling fix for LNL.

v2: Add platform check(Himal)
    s/__flush_workqueue/flush_workqueue(Jani)
v3: Remove gfx platform check as the issue related to cpu
    platform(John)
v4: Use the Common macro(John) and print when the flush resolves
    timeout(Matt B)

Cc: Badal Nilawar <badal.nilawar@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: <stable@vger.kernel.org> # v6.11+
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
---
 drivers/gpu/drm/xe/xe_wait_user_fence.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c
index f5deb81eba01..5b4264ea38bd 100644
--- a/drivers/gpu/drm/xe/xe_wait_user_fence.c
+++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c
@@ -155,6 +155,13 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data,
 		}
 
 		if (!timeout) {
+			LNL_FLUSH_WORKQUEUE(xe->ordered_wq);
+			err = do_compare(addr, args->value, args->mask,
+					 args->op);
+			if (err <= 0) {
+				drm_dbg(&xe->drm, "LNL_FLUSH_WORKQUEUE resolved ufence timeout\n");
+				break;
+			}
 			err = -ETIME;
 			break;
 		}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout
  2024-10-29 12:01 [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
  2024-10-29 12:01 ` [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout Nirmoy Das
@ 2024-10-29 12:01 ` Nirmoy Das
  2024-11-01 13:19   ` Matthew Auld
  2024-11-01 18:51   ` John Harrison
  2024-11-01 13:09 ` [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 14+ messages in thread
From: Nirmoy Das @ 2024-10-29 12:01 UTC (permalink / raw)
  To: intel-xe
  Cc: Nirmoy Das, Badal Nilawar, Matthew Brost, Matthew Auld,
	John Harrison, Himal Prasad Ghimiray, Lucas De Marchi, stable

Flush the g2h worker explicitly if TLB timeout happens which is
observed on LNL and that points to the recent scheduling issue with
E-cores on LNL.

This is similar to the recent fix:
commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
response timeout") and should be removed once there is E core
scheduling fix.

v2: Add platform check(Himal)
v3: Remove gfx platform check as the issue related to cpu
    platform(John)
    Use the common WA macro(John) and print when the flush
    resolves timeout(Matt B)
v4: Remove the resolves log and do the flush before taking
    pending_lock(Matt A)

Cc: Badal Nilawar <badal.nilawar@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: <stable@vger.kernel.org> # v6.11+
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
index 773de1f08db9..3cb228c773cd 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
@@ -72,6 +72,8 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work)
 	struct xe_device *xe = gt_to_xe(gt);
 	struct xe_gt_tlb_invalidation_fence *fence, *next;
 
+	LNL_FLUSH_WORK(&gt->uc.guc.ct.g2h_worker);
+
 	spin_lock_irq(&gt->tlb_invalidation.pending_lock);
 	list_for_each_entry_safe(fence, next,
 				 &gt->tlb_invalidation.pending_fences, link) {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h
  2024-10-29 12:01 [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
  2024-10-29 12:01 ` [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout Nirmoy Das
  2024-10-29 12:01 ` [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout Nirmoy Das
@ 2024-11-01 13:09 ` Nirmoy Das
  2024-11-01 13:21 ` Matthew Auld
  2024-11-01 18:48 ` John Harrison
  4 siblings, 0 replies; 14+ messages in thread
From: Nirmoy Das @ 2024-11-01 13:09 UTC (permalink / raw)
  To: Nirmoy Das, intel-xe
  Cc: Badal Nilawar, Matthew Auld, Matthew Brost, Himal Prasad Ghimiray,
	Lucas De Marchi, stable, John Harrison

ping!

On 10/29/2024 1:01 PM, Nirmoy Das wrote:
> Move LNL scheduling WA to xe_device.h so this can be used in other
> places without needing keep the same comment about removal of this WA
> in the future. The WA, which flushes work or workqueues, is now wrapped
> in macros and can be reused wherever needed.
>
> Cc: Badal Nilawar <badal.nilawar@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> cc: <stable@vger.kernel.org> # v6.11+
> Suggested-by: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
>  drivers/gpu/drm/xe/xe_guc_ct.c | 11 +----------
>  2 files changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 4c3f0ebe78a9..f1fbfe916867 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -191,4 +191,18 @@ void xe_device_declare_wedged(struct xe_device *xe);
>  struct xe_file *xe_file_get(struct xe_file *xef);
>  void xe_file_put(struct xe_file *xef);
>  
> +/*
> + * Occasionally it is seen that the G2H worker starts running after a delay of more than
> + * a second even after being queued and activated by the Linux workqueue subsystem. This
> + * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
> + * Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS
> + * and this is beyond xe kmd.
> + *
> + * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
> + */
> +#define LNL_FLUSH_WORKQUEUE(wq__) \
> +	flush_workqueue(wq__)
> +#define LNL_FLUSH_WORK(wrk__) \
> +	flush_work(wrk__)
> +
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 1b5d8fb1033a..703b44b257a7 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -1018,17 +1018,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>  
>  	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>  
> -	/*
> -	 * Occasionally it is seen that the G2H worker starts running after a delay of more than
> -	 * a second even after being queued and activated by the Linux workqueue subsystem. This
> -	 * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
> -	 * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS
> -	 * and this is beyond xe kmd.
> -	 *
> -	 * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
> -	 */
>  	if (!ret) {
> -		flush_work(&ct->g2h_worker);
> +		LNL_FLUSH_WORK(&ct->g2h_worker);
>  		if (g2h_fence.done) {
>  			xe_gt_warn(gt, "G2H fence %u, action %04x, done\n",
>  				   g2h_fence.seqno, action[0]);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout
  2024-10-29 12:01 ` [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout Nirmoy Das
@ 2024-11-01 13:19   ` Matthew Auld
  2024-11-01 18:51   ` John Harrison
  1 sibling, 0 replies; 14+ messages in thread
From: Matthew Auld @ 2024-11-01 13:19 UTC (permalink / raw)
  To: Nirmoy Das, intel-xe
  Cc: Badal Nilawar, Matthew Brost, John Harrison,
	Himal Prasad Ghimiray, Lucas De Marchi, stable

On 29/10/2024 12:01, Nirmoy Das wrote:
> Flush the g2h worker explicitly if TLB timeout happens which is
> observed on LNL and that points to the recent scheduling issue with
> E-cores on LNL.
> 
> This is similar to the recent fix:
> commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
> response timeout") and should be removed once there is E core
> scheduling fix.
> 
> v2: Add platform check(Himal)
> v3: Remove gfx platform check as the issue related to cpu
>      platform(John)
>      Use the common WA macro(John) and print when the flush
>      resolves timeout(Matt B)
> v4: Remove the resolves log and do the flush before taking
>      pending_lock(Matt A)
> 
> Cc: Badal Nilawar <badal.nilawar@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: <stable@vger.kernel.org> # v6.11+
> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout
  2024-10-29 12:01 ` [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout Nirmoy Das
@ 2024-11-01 13:21   ` Matthew Auld
  2024-11-01 18:49   ` John Harrison
  1 sibling, 0 replies; 14+ messages in thread
From: Matthew Auld @ 2024-11-01 13:21 UTC (permalink / raw)
  To: Nirmoy Das, intel-xe
  Cc: Badal Nilawar, John Harrison, Himal Prasad Ghimiray,
	Lucas De Marchi, stable, Matthew Brost

On 29/10/2024 12:01, Nirmoy Das wrote:
> Flush xe ordered_wq in case of ufence timeout which is observed
> on LNL and that points to recent scheduling issue with E-cores.
> 
> This is similar to the recent fix:
> commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
> response timeout") and should be removed once there is a E-core
> scheduling fix for LNL.
> 
> v2: Add platform check(Himal)
>      s/__flush_workqueue/flush_workqueue(Jani)
> v3: Remove gfx platform check as the issue related to cpu
>      platform(John)
> v4: Use the Common macro(John) and print when the flush resolves
>      timeout(Matt B)
> 
> Cc: Badal Nilawar <badal.nilawar@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: <stable@vger.kernel.org> # v6.11+
> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754
> Suggested-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h
  2024-10-29 12:01 [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
                   ` (2 preceding siblings ...)
  2024-11-01 13:09 ` [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
@ 2024-11-01 13:21 ` Matthew Auld
  2024-11-01 13:44   ` Nirmoy Das
  2024-11-01 18:48 ` John Harrison
  4 siblings, 1 reply; 14+ messages in thread
From: Matthew Auld @ 2024-11-01 13:21 UTC (permalink / raw)
  To: Nirmoy Das, intel-xe
  Cc: Badal Nilawar, Matthew Brost, Himal Prasad Ghimiray,
	Lucas De Marchi, stable, John Harrison

On 29/10/2024 12:01, Nirmoy Das wrote:
> Move LNL scheduling WA to xe_device.h so this can be used in other
> places without needing keep the same comment about removal of this WA
> in the future. The WA, which flushes work or workqueues, is now wrapped
> in macros and can be reused wherever needed.
> 
> Cc: Badal Nilawar <badal.nilawar@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> cc: <stable@vger.kernel.org> # v6.11+
> Suggested-by: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h
  2024-11-01 13:21 ` Matthew Auld
@ 2024-11-01 13:44   ` Nirmoy Das
  2024-11-01 15:47     ` Lucas De Marchi
  0 siblings, 1 reply; 14+ messages in thread
From: Nirmoy Das @ 2024-11-01 13:44 UTC (permalink / raw)
  To: Matthew Auld, Nirmoy Das, intel-xe
  Cc: Badal Nilawar, Matthew Brost, Himal Prasad Ghimiray,
	Lucas De Marchi, stable, John Harrison


On 11/1/2024 2:21 PM, Matthew Auld wrote:
> On 29/10/2024 12:01, Nirmoy Das wrote:
>> Move LNL scheduling WA to xe_device.h so this can be used in other
>> places without needing keep the same comment about removal of this WA
>> in the future. The WA, which flushes work or workqueues, is now wrapped
>> in macros and can be reused wherever needed.
>>
>> Cc: Badal Nilawar <badal.nilawar@intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> cc: <stable@vger.kernel.org> # v6.11+
>> Suggested-by: John Harrison <John.C.Harrison@Intel.com>
>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> Reviewed-by: Matthew Auld <matthew.auld@intel.com>


Thanks for reviewing the series, Matt!


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h
  2024-11-01 13:44   ` Nirmoy Das
@ 2024-11-01 15:47     ` Lucas De Marchi
  0 siblings, 0 replies; 14+ messages in thread
From: Lucas De Marchi @ 2024-11-01 15:47 UTC (permalink / raw)
  To: Nirmoy Das
  Cc: Matthew Auld, Nirmoy Das, intel-xe, Badal Nilawar, Matthew Brost,
	Himal Prasad Ghimiray, stable, John Harrison

On Fri, Nov 01, 2024 at 02:44:13PM +0100, Nirmoy Das wrote:
>
>On 11/1/2024 2:21 PM, Matthew Auld wrote:
>> On 29/10/2024 12:01, Nirmoy Das wrote:
>>> Move LNL scheduling WA to xe_device.h so this can be used in other
>>> places without needing keep the same comment about removal of this WA
>>> in the future. The WA, which flushes work or workqueues, is now wrapped
>>> in macros and can be reused wherever needed.
>>>
>>> Cc: Badal Nilawar <badal.nilawar@intel.com>
>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>> cc: <stable@vger.kernel.org> # v6.11+
>>> Suggested-by: John Harrison <John.C.Harrison@Intel.com>
>>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>
>
>Thanks for reviewing the series, Matt!


and pushed to drm-xe-next, thanks.

Lucas De Marchi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h
  2024-10-29 12:01 [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
                   ` (3 preceding siblings ...)
  2024-11-01 13:21 ` Matthew Auld
@ 2024-11-01 18:48 ` John Harrison
  2024-11-04  9:52   ` Nirmoy Das
  4 siblings, 1 reply; 14+ messages in thread
From: John Harrison @ 2024-11-01 18:48 UTC (permalink / raw)
  To: Nirmoy Das, intel-xe
  Cc: Badal Nilawar, Matthew Auld, Matthew Brost, Himal Prasad Ghimiray,
	Lucas De Marchi, stable

On 10/29/2024 05:01, Nirmoy Das wrote:
> Move LNL scheduling WA to xe_device.h so this can be used in other
> places without needing keep the same comment about removal of this WA
> in the future. The WA, which flushes work or workqueues, is now wrapped
> in macros and can be reused wherever needed.
>
> Cc: Badal Nilawar <badal.nilawar@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> cc: <stable@vger.kernel.org> # v6.11+
> Suggested-by: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
>   drivers/gpu/drm/xe/xe_guc_ct.c | 11 +----------
>   2 files changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 4c3f0ebe78a9..f1fbfe916867 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -191,4 +191,18 @@ void xe_device_declare_wedged(struct xe_device *xe);
>   struct xe_file *xe_file_get(struct xe_file *xef);
>   void xe_file_put(struct xe_file *xef);
>   
> +/*
> + * Occasionally it is seen that the G2H worker starts running after a delay of more than
> + * a second even after being queued and activated by the Linux workqueue subsystem. This
> + * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
> + * Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS
> + * and this is beyond xe kmd.
> + *
> + * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
> + */
> +#define LNL_FLUSH_WORKQUEUE(wq__) \
> +	flush_workqueue(wq__)
> +#define LNL_FLUSH_WORK(wrk__) \
> +	flush_work(wrk__)
> +
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 1b5d8fb1033a..703b44b257a7 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -1018,17 +1018,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>   
>   	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>   
> -	/*
> -	 * Occasionally it is seen that the G2H worker starts running after a delay of more than
> -	 * a second even after being queued and activated by the Linux workqueue subsystem. This
> -	 * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
> -	 * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS
> -	 * and this is beyond xe kmd.
> -	 *
> -	 * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
> -	 */
>   	if (!ret) {
> -		flush_work(&ct->g2h_worker);
> +		LNL_FLUSH_WORK(&ct->g2h_worker);
>   		if (g2h_fence.done) {
>   			xe_gt_warn(gt, "G2H fence %u, action %04x, done\n",
>   				   g2h_fence.seqno, action[0]);
This message is still wrong.

We have a warning that says 'job completed successfully'! That is 
misleading. It needs to say "done after flush" or "done but flush was 
required" or something along those lines.

John.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout
  2024-10-29 12:01 ` [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout Nirmoy Das
  2024-11-01 13:21   ` Matthew Auld
@ 2024-11-01 18:49   ` John Harrison
  1 sibling, 0 replies; 14+ messages in thread
From: John Harrison @ 2024-11-01 18:49 UTC (permalink / raw)
  To: Nirmoy Das, intel-xe
  Cc: Badal Nilawar, Matthew Auld, Himal Prasad Ghimiray,
	Lucas De Marchi, stable, Matthew Brost

On 10/29/2024 05:01, Nirmoy Das wrote:
> Flush xe ordered_wq in case of ufence timeout which is observed
> on LNL and that points to recent scheduling issue with E-cores.
>
> This is similar to the recent fix:
> commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
> response timeout") and should be removed once there is a E-core
> scheduling fix for LNL.
>
> v2: Add platform check(Himal)
>      s/__flush_workqueue/flush_workqueue(Jani)
> v3: Remove gfx platform check as the issue related to cpu
>      platform(John)
> v4: Use the Common macro(John) and print when the flush resolves
>      timeout(Matt B)
>
> Cc: Badal Nilawar <badal.nilawar@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: <stable@vger.kernel.org> # v6.11+
> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754
> Suggested-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_wait_user_fence.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c
> index f5deb81eba01..5b4264ea38bd 100644
> --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c
> +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c
> @@ -155,6 +155,13 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data,
>   		}
>   
>   		if (!timeout) {
> +			LNL_FLUSH_WORKQUEUE(xe->ordered_wq);
> +			err = do_compare(addr, args->value, args->mask,
> +					 args->op);
> +			if (err <= 0) {
> +				drm_dbg(&xe->drm, "LNL_FLUSH_WORKQUEUE resolved ufence timeout\n");
Should this not be a warning as well? To match the one in the G2H code, 
with the reason being that we want CI to track how often this is occurring.

John.

> +				break;
> +			}
>   			err = -ETIME;
>   			break;
>   		}


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout
  2024-10-29 12:01 ` [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout Nirmoy Das
  2024-11-01 13:19   ` Matthew Auld
@ 2024-11-01 18:51   ` John Harrison
  2024-11-04 11:20     ` Nirmoy Das
  1 sibling, 1 reply; 14+ messages in thread
From: John Harrison @ 2024-11-01 18:51 UTC (permalink / raw)
  To: Nirmoy Das, intel-xe
  Cc: Badal Nilawar, Matthew Brost, Matthew Auld, Himal Prasad Ghimiray,
	Lucas De Marchi, stable

On 10/29/2024 05:01, Nirmoy Das wrote:
> Flush the g2h worker explicitly if TLB timeout happens which is
> observed on LNL and that points to the recent scheduling issue with
> E-cores on LNL.
>
> This is similar to the recent fix:
> commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
> response timeout") and should be removed once there is E core
> scheduling fix.
>
> v2: Add platform check(Himal)
> v3: Remove gfx platform check as the issue related to cpu
>      platform(John)
>      Use the common WA macro(John) and print when the flush
>      resolves timeout(Matt B)
> v4: Remove the resolves log and do the flush before taking
>      pending_lock(Matt A)
>
> Cc: Badal Nilawar <badal.nilawar@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: <stable@vger.kernel.org> # v6.11+
> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> index 773de1f08db9..3cb228c773cd 100644
> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> @@ -72,6 +72,8 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work)
>   	struct xe_device *xe = gt_to_xe(gt);
>   	struct xe_gt_tlb_invalidation_fence *fence, *next;
>   
> +	LNL_FLUSH_WORK(&gt->uc.guc.ct.g2h_worker);
> +
Do we not want some kind of 'success required flush' message here as per 
the other instances?

John.

>   	spin_lock_irq(&gt->tlb_invalidation.pending_lock);
>   	list_for_each_entry_safe(fence, next,
>   				 &gt->tlb_invalidation.pending_fences, link) {


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h
  2024-11-01 18:48 ` John Harrison
@ 2024-11-04  9:52   ` Nirmoy Das
  0 siblings, 0 replies; 14+ messages in thread
From: Nirmoy Das @ 2024-11-04  9:52 UTC (permalink / raw)
  To: John Harrison, Nirmoy Das, intel-xe
  Cc: Badal Nilawar, Matthew Auld, Matthew Brost, Himal Prasad Ghimiray,
	Lucas De Marchi, stable


On 11/1/2024 7:48 PM, John Harrison wrote:
> On 10/29/2024 05:01, Nirmoy Das wrote:
>> Move LNL scheduling WA to xe_device.h so this can be used in other
>> places without needing keep the same comment about removal of this WA
>> in the future. The WA, which flushes work or workqueues, is now wrapped
>> in macros and can be reused wherever needed.
>>
>> Cc: Badal Nilawar <badal.nilawar@intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> cc: <stable@vger.kernel.org> # v6.11+
>> Suggested-by: John Harrison <John.C.Harrison@Intel.com>
>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++
>>   drivers/gpu/drm/xe/xe_guc_ct.c | 11 +----------
>>   2 files changed, 15 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>> index 4c3f0ebe78a9..f1fbfe916867 100644
>> --- a/drivers/gpu/drm/xe/xe_device.h
>> +++ b/drivers/gpu/drm/xe/xe_device.h
>> @@ -191,4 +191,18 @@ void xe_device_declare_wedged(struct xe_device *xe);
>>   struct xe_file *xe_file_get(struct xe_file *xef);
>>   void xe_file_put(struct xe_file *xef);
>>   +/*
>> + * Occasionally it is seen that the G2H worker starts running after a delay of more than
>> + * a second even after being queued and activated by the Linux workqueue subsystem. This
>> + * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
>> + * Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS
>> + * and this is beyond xe kmd.
>> + *
>> + * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
>> + */
>> +#define LNL_FLUSH_WORKQUEUE(wq__) \
>> +    flush_workqueue(wq__)
>> +#define LNL_FLUSH_WORK(wrk__) \
>> +    flush_work(wrk__)
>> +
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>> index 1b5d8fb1033a..703b44b257a7 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>> @@ -1018,17 +1018,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>         ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
>>   -    /*
>> -     * Occasionally it is seen that the G2H worker starts running after a delay of more than
>> -     * a second even after being queued and activated by the Linux workqueue subsystem. This
>> -     * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
>> -     * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS
>> -     * and this is beyond xe kmd.
>> -     *
>> -     * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
>> -     */
>>       if (!ret) {
>> -        flush_work(&ct->g2h_worker);
>> +        LNL_FLUSH_WORK(&ct->g2h_worker);
>>           if (g2h_fence.done) {
>>               xe_gt_warn(gt, "G2H fence %u, action %04x, done\n",
>>                      g2h_fence.seqno, action[0]);
> This message is still wrong.

I see that this is open from the previous patch, https://patchwork.freedesktop.org/patch/620189/#comment_1128218.


Sent out a patch to improve the message.


Thanks,

Nirmoy

>
> We have a warning that says 'job completed successfully'! That is misleading. It needs to say "done after flush" or "done but flush was required" or something along those lines.
>
> John.
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout
  2024-11-01 18:51   ` John Harrison
@ 2024-11-04 11:20     ` Nirmoy Das
  0 siblings, 0 replies; 14+ messages in thread
From: Nirmoy Das @ 2024-11-04 11:20 UTC (permalink / raw)
  To: John Harrison, intel-xe
  Cc: Badal Nilawar, Matthew Brost, Matthew Auld, Himal Prasad Ghimiray,
	Lucas De Marchi, stable


On 11/1/2024 7:51 PM, John Harrison wrote:
> On 10/29/2024 05:01, Nirmoy Das wrote:
>> Flush the g2h worker explicitly if TLB timeout happens which is
>> observed on LNL and that points to the recent scheduling issue with
>> E-cores on LNL.
>>
>> This is similar to the recent fix:
>> commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
>> response timeout") and should be removed once there is E core
>> scheduling fix.
>>
>> v2: Add platform check(Himal)
>> v3: Remove gfx platform check as the issue related to cpu
>>      platform(John)
>>      Use the common WA macro(John) and print when the flush
>>      resolves timeout(Matt B)
>> v4: Remove the resolves log and do the flush before taking
>>      pending_lock(Matt A)
>>
>> Cc: Badal Nilawar <badal.nilawar@intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: John Harrison <John.C.Harrison@Intel.com>
>> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: <stable@vger.kernel.org> # v6.11+
>> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687
>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>> index 773de1f08db9..3cb228c773cd 100644
>> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
>> @@ -72,6 +72,8 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work)
>>       struct xe_device *xe = gt_to_xe(gt);
>>       struct xe_gt_tlb_invalidation_fence *fence, *next;
>>   +    LNL_FLUSH_WORK(&gt->uc.guc.ct.g2h_worker);
>> +
> Do we not want some kind of 'success required flush' message here as per the other instances?


This flush resolves TLB timeout but in this function, I can't think of a simple way to detect that and log a message.


Regards,

Nirmoy

> John.
>
>>       spin_lock_irq(&gt->tlb_invalidation.pending_lock);
>>       list_for_each_entry_safe(fence, next,
>>                    &gt->tlb_invalidation.pending_fences, link) {
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-11-04 11:20 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29 12:01 [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
2024-10-29 12:01 ` [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout Nirmoy Das
2024-11-01 13:21   ` Matthew Auld
2024-11-01 18:49   ` John Harrison
2024-10-29 12:01 ` [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout Nirmoy Das
2024-11-01 13:19   ` Matthew Auld
2024-11-01 18:51   ` John Harrison
2024-11-04 11:20     ` Nirmoy Das
2024-11-01 13:09 ` [PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
2024-11-01 13:21 ` Matthew Auld
2024-11-01 13:44   ` Nirmoy Das
2024-11-01 15:47     ` Lucas De Marchi
2024-11-01 18:48 ` John Harrison
2024-11-04  9:52   ` Nirmoy Das

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox