public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH v4] drm/i915/selftests: Defer signalling the request fence
@ 2026-01-30 18:45 Krzysztof Niemiec
  2026-01-30 19:45 ` ✗ i915.CI.BAT: failure for drm/i915/selftests: Defer signalling the request fence (rev6) Patchwork
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Krzysztof Niemiec @ 2026-01-30 18:45 UTC (permalink / raw)
  To: dri-devel, intel-gfx
  Cc: Andi Shyti, Jonathan Cavitt, Janusz Krzysztofik, Krzysztof Karas,
	Sebastian Brzezinka, Chris Wilson, Krzysztof Niemiec

The i915_active selftests live_active_wait and live_active_retire
operate on an i915_active attached to a mock, empty request, created as
part of test setup. A fence is attached to this request to control when
the request is processed. The tests then wait for the completion of the
active with __i915_active_wait(), and the test is considered successful
if this results in setting a variable in the active callback.

However, the behavior of __i915_active_wait() is such that if the
refcount for the active is 0, the function is almost completely skipped;
waiting on a already completed active yields no effect. This includes a
subsequent call to the retire() function of the active (which is the
callback that the test is interested about, and which dictates whether
its successful or not). So, if the active is completed before the
aforementioned call to __i915_active_wait(), the test will fail.

Most of the test runs in a single thread, including creating the
request, creating the fence for it, signalling that fence, and calling
__i915_active_wait(). However, the request itself is handled
asynchronously. This creates a race condition where if the request is
completed after signalling the fence, but before waiting on its active,
the active callback will not be invoked, failing the test.

Defer signalling the request's fence, to ensure the main test thread
gets to call __i915_active_wait() before request completion.

v4:
- Lower the delay timeout to 50ms (Jonathan)
- Put the check on work_finished inside a helper function (Jonathan)

v3:
- Embed the variables inside the live_active struct (Andi)
- Move the schedule_delayed_work call closer to the wait (Andi)
- Implement error handling in case an error state - the wait has
  finished, but the deferred work didn't run - is somehow achieved (Andi)

v2:
- Clarify the need for a fix a little more (Krzysztof K., Janusz)

Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14808
Signed-off-by: Krzysztof Niemiec <krzysztof.niemiec@intel.com>
---
 drivers/gpu/drm/i915/selftests/i915_active.c | 51 +++++++++++++++++---
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/selftests/i915_active.c b/drivers/gpu/drm/i915/selftests/i915_active.c
index 36c3a5460221..eadd0a8bc094 100644
--- a/drivers/gpu/drm/i915/selftests/i915_active.c
+++ b/drivers/gpu/drm/i915/selftests/i915_active.c
@@ -21,6 +21,10 @@ struct live_active {
 	struct i915_active base;
 	struct kref ref;
 	bool retired;
+
+	struct i915_sw_fence *submit;
+	struct delayed_work work;
+	bool work_finished;
 };
 
 static void __live_get(struct live_active *active)
@@ -76,11 +80,37 @@ static struct live_active *__live_alloc(struct drm_i915_private *i915)
 	return active;
 }
 
+static void __live_submit_work_handler(struct work_struct *work)
+{
+	struct delayed_work *d_work = container_of(work, struct delayed_work, work);
+	struct live_active *active = container_of(d_work, struct live_active, work);
+	i915_sw_fence_commit(active->submit);
+	heap_fence_put(active->submit);
+	active->work_finished = true;
+}
+
+static int
+__live_work_confirm_finished(struct drm_i915_private *i915,
+			     struct live_active *active)
+{
+	int err = 0;
+
+	if (!active->work_finished) {
+		struct drm_printer p = drm_err_printer(&i915->drm, __func__);
+
+		drm_printf(&p, "active->work hasn't finished, something went\
+				terribly wrong\n");
+		err = -EINVAL;
+		cancel_delayed_work_sync(&active->work);
+	}
+
+	return err;
+}
+
 static struct live_active *
 __live_active_setup(struct drm_i915_private *i915)
 {
 	struct intel_engine_cs *engine;
-	struct i915_sw_fence *submit;
 	struct live_active *active;
 	unsigned int count = 0;
 	int err = 0;
@@ -89,8 +119,11 @@ __live_active_setup(struct drm_i915_private *i915)
 	if (!active)
 		return ERR_PTR(-ENOMEM);
 
-	submit = heap_fence_create(GFP_KERNEL);
-	if (!submit) {
+	INIT_DELAYED_WORK(&active->work, __live_submit_work_handler);
+	active->work_finished = false;
+
+	active->submit = heap_fence_create(GFP_KERNEL);
+	if (!active->submit) {
 		kfree(active);
 		return ERR_PTR(-ENOMEM);
 	}
@@ -109,7 +142,7 @@ __live_active_setup(struct drm_i915_private *i915)
 		}
 
 		err = i915_sw_fence_await_sw_fence_gfp(&rq->submit,
-						       submit,
+						       active->submit,
 						       GFP_KERNEL);
 		if (err >= 0)
 			err = i915_active_add_request(&active->base, rq);
@@ -134,8 +167,6 @@ __live_active_setup(struct drm_i915_private *i915)
 	}
 
 out:
-	i915_sw_fence_commit(submit);
-	heap_fence_put(submit);
 	if (err) {
 		__live_put(active);
 		active = ERR_PTR(err);
@@ -156,6 +187,8 @@ static int live_active_wait(void *arg)
 	if (IS_ERR(active))
 		return PTR_ERR(active);
 
+	schedule_delayed_work(&active->work, msecs_to_jiffies(50));
+
 	__i915_active_wait(&active->base, TASK_UNINTERRUPTIBLE);
 	if (!READ_ONCE(active->retired)) {
 		struct drm_printer p = drm_err_printer(&i915->drm, __func__);
@@ -166,6 +199,8 @@ static int live_active_wait(void *arg)
 		err = -EINVAL;
 	}
 
+	err = __live_work_confirm_finished(i915, active);
+
 	__live_put(active);
 
 	if (igt_flush_test(i915))
@@ -186,6 +221,8 @@ static int live_active_retire(void *arg)
 	if (IS_ERR(active))
 		return PTR_ERR(active);
 
+	schedule_delayed_work(&active->work, msecs_to_jiffies(50));
+
 	/* waits for & retires all requests */
 	if (igt_flush_test(i915))
 		err = -EIO;
@@ -199,6 +236,8 @@ static int live_active_retire(void *arg)
 		err = -EINVAL;
 	}
 
+	err = __live_work_confirm_finished(i915, active);
+
 	__live_put(active);
 
 	return err;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* ✗ i915.CI.BAT: failure for drm/i915/selftests: Defer signalling the request fence (rev6)
  2026-01-30 18:45 [PATCH v4] drm/i915/selftests: Defer signalling the request fence Krzysztof Niemiec
@ 2026-01-30 19:45 ` Patchwork
  2026-02-02  7:31 ` [PATCH v4] drm/i915/selftests: Defer signalling the request fence Krzysztof Karas
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Patchwork @ 2026-01-30 19:45 UTC (permalink / raw)
  To: Krzysztof Niemiec; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 7703 bytes --]

== Series Details ==

Series: drm/i915/selftests: Defer signalling the request fence (rev6)
URL   : https://patchwork.freedesktop.org/series/156505/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_17911 -> Patchwork_156505v6
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_156505v6 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_156505v6, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/index.html

Participating hosts (43 -> 41)
------------------------------

  Missing    (2): bat-dg2-13 fi-snb-2520m 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_156505v6:

### IGT changes ###

#### Possible regressions ####

  * igt@i915_selftest@live:
    - fi-hsw-4770:        [PASS][1] -> [DMESG-FAIL][2] +1 other test dmesg-fail
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/fi-hsw-4770/igt@i915_selftest@live.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/fi-hsw-4770/igt@i915_selftest@live.html
    - fi-ivb-3770:        [PASS][3] -> [DMESG-FAIL][4] +1 other test dmesg-fail
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/fi-ivb-3770/igt@i915_selftest@live.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/fi-ivb-3770/igt@i915_selftest@live.html
    - bat-rpls-4:         [PASS][5] -> [DMESG-FAIL][6] +1 other test dmesg-fail
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/bat-rpls-4/igt@i915_selftest@live.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/bat-rpls-4/igt@i915_selftest@live.html
    - fi-ilk-650:         [PASS][7] -> [DMESG-FAIL][8] +1 other test dmesg-fail
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/fi-ilk-650/igt@i915_selftest@live.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/fi-ilk-650/igt@i915_selftest@live.html
    - fi-skl-6600u:       [PASS][9] -> [DMESG-FAIL][10] +1 other test dmesg-fail
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/fi-skl-6600u/igt@i915_selftest@live.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/fi-skl-6600u/igt@i915_selftest@live.html

  * igt@i915_selftest@live@active:
    - bat-kbl-2:          [PASS][11] -> [DMESG-FAIL][12] +1 other test dmesg-fail
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/bat-kbl-2/igt@i915_selftest@live@active.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/bat-kbl-2/igt@i915_selftest@live@active.html

  
Known issues
------------

  Here are the changes found in Patchwork_156505v6 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@i915_selftest@live:
    - bat-mtlp-8:         [PASS][13] -> [DMESG-FAIL][14] ([i915#12061]) +1 other test dmesg-fail
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/bat-mtlp-8/igt@i915_selftest@live.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/bat-mtlp-8/igt@i915_selftest@live.html
    - fi-glk-j4005:       [PASS][15] -> [ABORT][16] ([i915#15623])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/fi-glk-j4005/igt@i915_selftest@live.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/fi-glk-j4005/igt@i915_selftest@live.html
    - bat-jsl-5:          [PASS][17] -> [DMESG-FAIL][18] ([i915#14808]) +1 other test dmesg-fail
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/bat-jsl-5/igt@i915_selftest@live.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/bat-jsl-5/igt@i915_selftest@live.html
    - fi-rkl-11600:       [PASS][19] -> [DMESG-FAIL][20] ([i915#14808]) +1 other test dmesg-fail
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/fi-rkl-11600/igt@i915_selftest@live.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/fi-rkl-11600/igt@i915_selftest@live.html
    - fi-cfl-guc:         [PASS][21] -> [DMESG-FAIL][22] ([i915#14808]) +1 other test dmesg-fail
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/fi-cfl-guc/igt@i915_selftest@live.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/fi-cfl-guc/igt@i915_selftest@live.html

  * igt@i915_selftest@live@active:
    - fi-cfl-8700k:       [PASS][23] -> [DMESG-FAIL][24] ([i915#14808]) +1 other test dmesg-fail
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/fi-cfl-8700k/igt@i915_selftest@live@active.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/fi-cfl-8700k/igt@i915_selftest@live@active.html

  * igt@i915_selftest@live@execlists:
    - fi-glk-j4005:       [PASS][25] -> [ABORT][26] ([i915#15624])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/fi-glk-j4005/igt@i915_selftest@live@execlists.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/fi-glk-j4005/igt@i915_selftest@live@execlists.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@sanitycheck:
    - bat-apl-1:          [DMESG-WARN][27] ([i915#13735]) -> [PASS][28] +75 other tests pass
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/bat-apl-1/igt@i915_selftest@live@sanitycheck.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/bat-apl-1/igt@i915_selftest@live@sanitycheck.html

  * igt@i915_selftest@live@workarounds:
    - bat-arls-5:         [DMESG-FAIL][29] ([i915#12061]) -> [PASS][30] +1 other test pass
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/bat-arls-5/igt@i915_selftest@live@workarounds.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/bat-arls-5/igt@i915_selftest@live@workarounds.html

  * igt@kms_pm_rpm@basic-pci-d3-state:
    - bat-apl-1:          [DMESG-WARN][31] ([i915#13735] / [i915#180]) -> [PASS][32] +49 other tests pass
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/bat-apl-1/igt@kms_pm_rpm@basic-pci-d3-state.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/bat-apl-1/igt@kms_pm_rpm@basic-pci-d3-state.html

  
#### Warnings ####

  * igt@i915_selftest@live:
    - bat-apl-1:          [DMESG-WARN][33] ([i915#13735]) -> [DMESG-FAIL][34] ([i915#14808]) +1 other test dmesg-fail
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_17911/bat-apl-1/igt@i915_selftest@live.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/bat-apl-1/igt@i915_selftest@live.html

  
  [i915#12061]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
  [i915#13735]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13735
  [i915#14808]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14808
  [i915#15623]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/15623
  [i915#15624]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/15624
  [i915#180]: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/180


Build changes
-------------

  * Linux: CI_DRM_17911 -> Patchwork_156505v6

  CI-20190529: 20190529
  CI_DRM_17911: feea7da320aa6950f02d3c4633ab678e6e61f7e9 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_8726: 8726
  Patchwork_156505v6: feea7da320aa6950f02d3c4633ab678e6e61f7e9 @ git://anongit.freedesktop.org/gfx-ci/linux

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_156505v6/index.html

[-- Attachment #2: Type: text/html, Size: 9088 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] drm/i915/selftests: Defer signalling the request fence
  2026-01-30 18:45 [PATCH v4] drm/i915/selftests: Defer signalling the request fence Krzysztof Niemiec
  2026-01-30 19:45 ` ✗ i915.CI.BAT: failure for drm/i915/selftests: Defer signalling the request fence (rev6) Patchwork
@ 2026-02-02  7:31 ` Krzysztof Karas
  2026-02-04  0:39 ` Andi Shyti
  2026-02-04  9:26 ` Andi Shyti
  3 siblings, 0 replies; 5+ messages in thread
From: Krzysztof Karas @ 2026-02-02  7:31 UTC (permalink / raw)
  To: Krzysztof Niemiec
  Cc: dri-devel, intel-gfx, Andi Shyti, Jonathan Cavitt,
	Janusz Krzysztofik, Sebastian Brzezinka, Chris Wilson

Hi Krzysztof,

[...]

> +__live_work_confirm_finished(struct drm_i915_private *i915,
> +			     struct live_active *active)
> +{
> +	int err = 0;
> +
> +	if (!active->work_finished) {
> +		struct drm_printer p = drm_err_printer(&i915->drm, __func__);
> +
> +		drm_printf(&p, "active->work hasn't finished, something went\
> +				terribly wrong\n");
I think "something went terribly wrong" is superfluous, when you
are not sure what the "something" was, so you could skip
printing that part. This is a non-blocking suggestion, include
it if you plan to prepare a v5 of this patch.

Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com>

-- 
Best Regards,
Krzysztof

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] drm/i915/selftests: Defer signalling the request fence
  2026-01-30 18:45 [PATCH v4] drm/i915/selftests: Defer signalling the request fence Krzysztof Niemiec
  2026-01-30 19:45 ` ✗ i915.CI.BAT: failure for drm/i915/selftests: Defer signalling the request fence (rev6) Patchwork
  2026-02-02  7:31 ` [PATCH v4] drm/i915/selftests: Defer signalling the request fence Krzysztof Karas
@ 2026-02-04  0:39 ` Andi Shyti
  2026-02-04  9:26 ` Andi Shyti
  3 siblings, 0 replies; 5+ messages in thread
From: Andi Shyti @ 2026-02-04  0:39 UTC (permalink / raw)
  To: Krzysztof Niemiec
  Cc: dri-devel, intel-gfx, Andi Shyti, Jonathan Cavitt,
	Janusz Krzysztofik, Krzysztof Karas, Sebastian Brzezinka,
	Chris Wilson

Hi Krzysztof,

On Fri, Jan 30, 2026 at 07:45:08PM +0100, Krzysztof Niemiec wrote:
> The i915_active selftests live_active_wait and live_active_retire
> operate on an i915_active attached to a mock, empty request, created as
> part of test setup. A fence is attached to this request to control when
> the request is processed. The tests then wait for the completion of the
> active with __i915_active_wait(), and the test is considered successful
> if this results in setting a variable in the active callback.
> 
> However, the behavior of __i915_active_wait() is such that if the
> refcount for the active is 0, the function is almost completely skipped;
> waiting on a already completed active yields no effect. This includes a
> subsequent call to the retire() function of the active (which is the
> callback that the test is interested about, and which dictates whether
> its successful or not). So, if the active is completed before the
> aforementioned call to __i915_active_wait(), the test will fail.
> 
> Most of the test runs in a single thread, including creating the
> request, creating the fence for it, signalling that fence, and calling
> __i915_active_wait(). However, the request itself is handled
> asynchronously. This creates a race condition where if the request is
> completed after signalling the fence, but before waiting on its active,
> the active callback will not be invoked, failing the test.
> 
> Defer signalling the request's fence, to ensure the main test thread
> gets to call __i915_active_wait() before request completion.
> 
> v4:
> - Lower the delay timeout to 50ms (Jonathan)
> - Put the check on work_finished inside a helper function (Jonathan)
> 
> v3:
> - Embed the variables inside the live_active struct (Andi)
> - Move the schedule_delayed_work call closer to the wait (Andi)
> - Implement error handling in case an error state - the wait has
>   finished, but the deferred work didn't run - is somehow achieved (Andi)
> 
> v2:
> - Clarify the need for a fix a little more (Krzysztof K., Janusz)
> 
> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14808
> Signed-off-by: Krzysztof Niemiec <krzysztof.niemiec@intel.com>

I'm sorry, but for now this patch, for the reason I explained you
in a previous version, is a nack.

You are trying to bypass locking issues by adding a random 50ms
delay. It's too fragile and looks hackish.

In any case, I've seen a few issues below, in case someone else
agrees and is willing to merge it.

> ---
>  drivers/gpu/drm/i915/selftests/i915_active.c | 51 +++++++++++++++++---
>  1 file changed, 45 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/selftests/i915_active.c b/drivers/gpu/drm/i915/selftests/i915_active.c
> index 36c3a5460221..eadd0a8bc094 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_active.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_active.c
> @@ -21,6 +21,10 @@ struct live_active {
>  	struct i915_active base;
>  	struct kref ref;
>  	bool retired;
> +
> +	struct i915_sw_fence *submit;
> +	struct delayed_work work;
> +	bool work_finished;
>  };
>  
>  static void __live_get(struct live_active *active)
> @@ -76,11 +80,37 @@ static struct live_active *__live_alloc(struct drm_i915_private *i915)
>  	return active;
>  }
>  
> +static void __live_submit_work_handler(struct work_struct *work)

I don't see the point for the '__' here.

> +{
> +	struct delayed_work *d_work = container_of(work, struct delayed_work, work);
> +	struct live_active *active = container_of(d_work, struct live_active, work);
> +	i915_sw_fence_commit(active->submit);
> +	heap_fence_put(active->submit);
> +	active->work_finished = true;
> +}
> +
> +static int
> +__live_work_confirm_finished(struct drm_i915_private *i915,
> +			     struct live_active *active)
> +{
> +	int err = 0;
> +
> +	if (!active->work_finished) {
> +		struct drm_printer p = drm_err_printer(&i915->drm, __func__);
> +
> +		drm_printf(&p, "active->work hasn't finished, something went\
> +				terribly wrong\n");

until 100 characters per line is fine, but I'm sure you can
reword to something better.

> +		err = -EINVAL;
> +		cancel_delayed_work_sync(&active->work);
> +	}
> +
> +	return err;
> +}
> +
>  static struct live_active *
>  __live_active_setup(struct drm_i915_private *i915)
>  {
>  	struct intel_engine_cs *engine;
> -	struct i915_sw_fence *submit;
>  	struct live_active *active;
>  	unsigned int count = 0;
>  	int err = 0;
> @@ -89,8 +119,11 @@ __live_active_setup(struct drm_i915_private *i915)
>  	if (!active)
>  		return ERR_PTR(-ENOMEM);
>  
> -	submit = heap_fence_create(GFP_KERNEL);
> -	if (!submit) {
> +	INIT_DELAYED_WORK(&active->work, __live_submit_work_handler);
> +	active->work_finished = false;
> +
> +	active->submit = heap_fence_create(GFP_KERNEL);
> +	if (!active->submit) {
>  		kfree(active);
>  		return ERR_PTR(-ENOMEM);
>  	}
> @@ -109,7 +142,7 @@ __live_active_setup(struct drm_i915_private *i915)
>  		}
>  
>  		err = i915_sw_fence_await_sw_fence_gfp(&rq->submit,
> -						       submit,
> +						       active->submit,
>  						       GFP_KERNEL);
>  		if (err >= 0)
>  			err = i915_active_add_request(&active->base, rq);
> @@ -134,8 +167,6 @@ __live_active_setup(struct drm_i915_private *i915)
>  	}
>  
>  out:
> -	i915_sw_fence_commit(submit);
> -	heap_fence_put(submit);
>  	if (err) {
>  		__live_put(active);
>  		active = ERR_PTR(err);
> @@ -156,6 +187,8 @@ static int live_active_wait(void *arg)
>  	if (IS_ERR(active))
>  		return PTR_ERR(active);

if we return with an error...

>  
> +	schedule_delayed_work(&active->work, msecs_to_jiffies(50));

... we don't schedule the work and therefore you leak the fence.

> +
>  	__i915_active_wait(&active->base, TASK_UNINTERRUPTIBLE);
>  	if (!READ_ONCE(active->retired)) {
>  		struct drm_printer p = drm_err_printer(&i915->drm, __func__);
> @@ -166,6 +199,8 @@ static int live_active_wait(void *arg)
>  		err = -EINVAL;
>  	}
>  
> +	err = __live_work_confirm_finished(i915, active);

Here you are overwriting err.

> +
>  	__live_put(active);
>  
>  	if (igt_flush_test(i915))
> @@ -186,6 +221,8 @@ static int live_active_retire(void *arg)
>  	if (IS_ERR(active))
>  		return PTR_ERR(active);
>  
> +	schedule_delayed_work(&active->work, msecs_to_jiffies(50));
> +
>  	/* waits for & retires all requests */
>  	if (igt_flush_test(i915))
>  		err = -EIO;
> @@ -199,6 +236,8 @@ static int live_active_retire(void *arg)
>  		err = -EINVAL;
>  	}
>  
> +	err = __live_work_confirm_finished(i915, active);

here as well.

Andi

> +
>  	__live_put(active);
>  
>  	return err;
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] drm/i915/selftests: Defer signalling the request fence
  2026-01-30 18:45 [PATCH v4] drm/i915/selftests: Defer signalling the request fence Krzysztof Niemiec
                   ` (2 preceding siblings ...)
  2026-02-04  0:39 ` Andi Shyti
@ 2026-02-04  9:26 ` Andi Shyti
  3 siblings, 0 replies; 5+ messages in thread
From: Andi Shyti @ 2026-02-04  9:26 UTC (permalink / raw)
  To: Krzysztof Niemiec
  Cc: dri-devel, intel-gfx, Andi Shyti, Jonathan Cavitt,
	Janusz Krzysztofik, Krzysztof Karas, Sebastian Brzezinka,
	Chris Wilson

Hi Krzysztof,

On Fri, Jan 30, 2026 at 07:45:08PM +0100, Krzysztof Niemiec wrote:
> The i915_active selftests live_active_wait and live_active_retire
> operate on an i915_active attached to a mock, empty request, created as
> part of test setup. A fence is attached to this request to control when
> the request is processed. The tests then wait for the completion of the
> active with __i915_active_wait(), and the test is considered successful
> if this results in setting a variable in the active callback.
> 
> However, the behavior of __i915_active_wait() is such that if the
> refcount for the active is 0, the function is almost completely skipped;
> waiting on a already completed active yields no effect. This includes a
> subsequent call to the retire() function of the active (which is the
> callback that the test is interested about, and which dictates whether
> its successful or not). So, if the active is completed before the
> aforementioned call to __i915_active_wait(), the test will fail.
> 
> Most of the test runs in a single thread, including creating the
> request, creating the fence for it, signalling that fence, and calling
> __i915_active_wait(). However, the request itself is handled
> asynchronously. This creates a race condition where if the request is
> completed after signalling the fence, but before waiting on its active,
> the active callback will not be invoked, failing the test.
> 
> Defer signalling the request's fence, to ensure the main test thread
> gets to call __i915_active_wait() before request completion.
> 
> v4:
> - Lower the delay timeout to 50ms (Jonathan)
> - Put the check on work_finished inside a helper function (Jonathan)
> 
> v3:
> - Embed the variables inside the live_active struct (Andi)
> - Move the schedule_delayed_work call closer to the wait (Andi)
> - Implement error handling in case an error state - the wait has
>   finished, but the deferred work didn't run - is somehow achieved (Andi)
> 
> v2:
> - Clarify the need for a fix a little more (Krzysztof K., Janusz)
> 
> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14808
> Signed-off-by: Krzysztof Niemiec <krzysztof.niemiec@intel.com>

BTW, I don't want to block this patch, I'm just not feeling
comfortable at merging it and I don't have better suggestions.

BTW, you already have consensus here:

Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>

And, BTW, can you please add comments through the lines so that
people understand what you are doing.

Moreover, as Janusz suggested I would also like to have a real
use case description of the issue and how it appeared in our
environment.

Thanks,
Andi

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-04  9:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-30 18:45 [PATCH v4] drm/i915/selftests: Defer signalling the request fence Krzysztof Niemiec
2026-01-30 19:45 ` ✗ i915.CI.BAT: failure for drm/i915/selftests: Defer signalling the request fence (rev6) Patchwork
2026-02-02  7:31 ` [PATCH v4] drm/i915/selftests: Defer signalling the request fence Krzysztof Karas
2026-02-04  0:39 ` Andi Shyti
2026-02-04  9:26 ` Andi Shyti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox