* [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
@ 2025-06-06 13:58 Janusz Krzysztofik
2025-06-10 13:59 ` Sebastian Brzezinka
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Janusz Krzysztofik @ 2025-06-06 13:58 UTC (permalink / raw)
To: intel-gfx
Cc: dri-devel, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, Andi Shyti, Nitin Gote, Chris Wilson,
Matthew Auld
The following error has been reported sporadically by CI when a test
unbinds the i915 driver on a ring submission platform:
<4> [239.330153] ------------[ cut here ]------------
<4> [239.330166] i915 0000:00:02.0: [drm] drm_WARN_ON(dev_priv->mm.shrink_count)
<4> [239.330196] WARNING: CPU: 1 PID: 18570 at drivers/gpu/drm/i915/i915_gem.c:1309 i915_gem_cleanup_early+0x13e/0x150 [i915]
...
<4> [239.330640] RIP: 0010:i915_gem_cleanup_early+0x13e/0x150 [i915]
...
<4> [239.330942] Call Trace:
<4> [239.330944] <TASK>
<4> [239.330949] i915_driver_late_release+0x2b/0xa0 [i915]
<4> [239.331202] i915_driver_release+0x86/0xa0 [i915]
<4> [239.331482] devm_drm_dev_init_release+0x61/0x90
<4> [239.331494] devm_action_release+0x15/0x30
<4> [239.331504] release_nodes+0x3d/0x120
<4> [239.331517] devres_release_all+0x96/0xd0
<4> [239.331533] device_unbind_cleanup+0x12/0x80
<4> [239.331543] device_release_driver_internal+0x23a/0x280
<4> [239.331550] ? bus_find_device+0xa5/0xe0
<4> [239.331563] device_driver_detach+0x14/0x20
...
<4> [357.719679] ---[ end trace 0000000000000000 ]---
If the test also unloads the i915 module then that's followed with:
<3> [357.787478] =============================================================================
<3> [357.788006] BUG i915_vma (Tainted: G U W N ): Objects remaining on __kmem_cache_shutdown()
<3> [357.788031] -----------------------------------------------------------------------------
<3> [357.788204] Object 0xffff888109e7f480 @offset=29824
<3> [357.788670] Allocated in i915_vma_instance+0xee/0xc10 [i915] age=292729 cpu=4 pid=2244
<4> [357.788994] i915_vma_instance+0xee/0xc10 [i915]
<4> [357.789290] init_status_page+0x7b/0x420 [i915]
<4> [357.789532] intel_engines_init+0x1d8/0x980 [i915]
<4> [357.789772] intel_gt_init+0x175/0x450 [i915]
<4> [357.790014] i915_gem_init+0x113/0x340 [i915]
<4> [357.790281] i915_driver_probe+0x847/0xed0 [i915]
<4> [357.790504] i915_pci_probe+0xe6/0x220 [i915]
...
Closer analysis of CI results history has revealed a dependency of the
error on a few IGT tests, namely:
- igt@api_intel_allocator@fork-simple-stress-signal,
- igt@api_intel_allocator@two-level-inception-interruptible,
- igt@gem_linear_blits@interruptible,
- igt@prime_mmap_coherency@ioctl-errors,
which invisibly trigger the issue, then exhibited with first driver unbind
attempt.
All of the above tests perform actions which are actively interrupted with
signals. Further debugging has allowed to narrow that scope down to
DRM_IOCTL_I915_GEM_EXECBUFFER2, and ring_context_alloc(), specific to ring
submission, in particular.
If successful then that function, or its execlists or GuC submission
equivalent, is supposed to be called only once per GEM context engine,
followed by raise of a flag that prevents the function from being called
again. The function is expected to unwind its internal errors itself, so
it may be safely called once more after it returns an error.
In case of ring submission, the function first gets a reference to the
engine's legacy timeline and then allocates a VMA. If the VMA allocation
fails, e.g. when i915_vma_instance() called from inside is interrupted
with a signal, then ring_context_alloc() fails, leaving the timeline held
referenced. On next I915_GEM_EXECBUFFER2 IOCTL, another reference to the
timeline is got, and only that last one is put on successful completion.
As a consequence, the legacy timeline, with its underlying engine status
page's VMA object, is still held and not released on driver unbind.
Get the legacy timeline only after successful allocation of the context
engine's VMA.
Fixes: 75d0a7f31eec ("drm/i915: Lift timeline into intel_context")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
---
drivers/gpu/drm/i915/gt/intel_ring_submission.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
index a876a34455f11..2a6d79abf25b5 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
@@ -610,7 +610,6 @@ static int ring_context_alloc(struct intel_context *ce)
/* One ringbuffer to rule them all */
GEM_BUG_ON(!engine->legacy.ring);
ce->ring = engine->legacy.ring;
- ce->timeline = intel_timeline_get(engine->legacy.timeline);
GEM_BUG_ON(ce->state);
if (engine->context_size) {
@@ -623,6 +622,8 @@ static int ring_context_alloc(struct intel_context *ce)
ce->state = vma;
}
+ ce->timeline = intel_timeline_get(engine->legacy.timeline);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-06 13:58 Janusz Krzysztofik
@ 2025-06-10 13:59 ` Sebastian Brzezinka
2025-06-10 14:06 ` Krzysztof Niemiec
2025-06-11 8:03 ` Krzysztof Karas
2 siblings, 0 replies; 14+ messages in thread
From: Sebastian Brzezinka @ 2025-06-10 13:59 UTC (permalink / raw)
To: Janusz Krzysztofik, intel-gfx
Cc: dri-devel, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, Andi Shyti, Nitin Gote, Chris Wilson,
Matthew Auld
On Fri Jun 6, 2025 at 1:58 PM UTC, Janusz Krzysztofik wrote:
> The following error has been reported sporadically by CI when a test
> unbinds the i915 driver on a ring submission platform:
>
> <4> [239.330153] ------------[ cut here ]------------
> <4> [239.330166] i915 0000:00:02.0: [drm] drm_WARN_ON(dev_priv->mm.shrink_count)
> <4> [239.330196] WARNING: CPU: 1 PID: 18570 at drivers/gpu/drm/i915/i915_gem.c:1309 i915_gem_cleanup_early+0x13e/0x150 [i915]
> ...
> <4> [239.330640] RIP: 0010:i915_gem_cleanup_early+0x13e/0x150 [i915]
> ...
> <4> [239.330942] Call Trace:
> <4> [239.330944] <TASK>
> <4> [239.330949] i915_driver_late_release+0x2b/0xa0 [i915]
> <4> [239.331202] i915_driver_release+0x86/0xa0 [i915]
> <4> [239.331482] devm_drm_dev_init_release+0x61/0x90
> <4> [239.331494] devm_action_release+0x15/0x30
> <4> [239.331504] release_nodes+0x3d/0x120
> <4> [239.331517] devres_release_all+0x96/0xd0
> <4> [239.331533] device_unbind_cleanup+0x12/0x80
> <4> [239.331543] device_release_driver_internal+0x23a/0x280
> <4> [239.331550] ? bus_find_device+0xa5/0xe0
> <4> [239.331563] device_driver_detach+0x14/0x20
> ...
> <4> [357.719679] ---[ end trace 0000000000000000 ]---
>
> If the test also unloads the i915 module then that's followed with:
>
> <3> [357.787478] =============================================================================
> <3> [357.788006] BUG i915_vma (Tainted: G U W N ): Objects remaining on __kmem_cache_shutdown()
> <3> [357.788031] -----------------------------------------------------------------------------
> <3> [357.788204] Object 0xffff888109e7f480 @offset=29824
> <3> [357.788670] Allocated in i915_vma_instance+0xee/0xc10 [i915] age=292729 cpu=4 pid=2244
> <4> [357.788994] i915_vma_instance+0xee/0xc10 [i915]
> <4> [357.789290] init_status_page+0x7b/0x420 [i915]
> <4> [357.789532] intel_engines_init+0x1d8/0x980 [i915]
> <4> [357.789772] intel_gt_init+0x175/0x450 [i915]
> <4> [357.790014] i915_gem_init+0x113/0x340 [i915]
> <4> [357.790281] i915_driver_probe+0x847/0xed0 [i915]
> <4> [357.790504] i915_pci_probe+0xe6/0x220 [i915]
> ...
>
> Closer analysis of CI results history has revealed a dependency of the
> error on a few IGT tests, namely:
> - igt@api_intel_allocator@fork-simple-stress-signal,
> - igt@api_intel_allocator@two-level-inception-interruptible,
> - igt@gem_linear_blits@interruptible,
> - igt@prime_mmap_coherency@ioctl-errors,
> which invisibly trigger the issue, then exhibited with first driver unbind
> attempt.
>
> All of the above tests perform actions which are actively interrupted with
> signals. Further debugging has allowed to narrow that scope down to
> DRM_IOCTL_I915_GEM_EXECBUFFER2, and ring_context_alloc(), specific to ring
> submission, in particular.
>
> If successful then that function, or its execlists or GuC submission
> equivalent, is supposed to be called only once per GEM context engine,
> followed by raise of a flag that prevents the function from being called
> again. The function is expected to unwind its internal errors itself, so
> it may be safely called once more after it returns an error.
>
> In case of ring submission, the function first gets a reference to the
> engine's legacy timeline and then allocates a VMA. If the VMA allocation
> fails, e.g. when i915_vma_instance() called from inside is interrupted
> with a signal, then ring_context_alloc() fails, leaving the timeline held
> referenced. On next I915_GEM_EXECBUFFER2 IOCTL, another reference to the
> timeline is got, and only that last one is put on successful completion.
> As a consequence, the legacy timeline, with its underlying engine status
> page's VMA object, is still held and not released on driver unbind.
>
> Get the legacy timeline only after successful allocation of the context
> engine's VMA.
>
> Fixes: 75d0a7f31eec ("drm/i915: Lift timeline into intel_context")
> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
> Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Everything looks good to me overall.
Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
--
Best regards,
Sebastian
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-06 13:58 Janusz Krzysztofik
2025-06-10 13:59 ` Sebastian Brzezinka
@ 2025-06-10 14:06 ` Krzysztof Niemiec
2025-06-11 8:03 ` Krzysztof Karas
2 siblings, 0 replies; 14+ messages in thread
From: Krzysztof Niemiec @ 2025-06-10 14:06 UTC (permalink / raw)
To: Janusz Krzysztofik
Cc: intel-gfx, dri-devel, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, Andi Shyti, Nitin Gote, Chris Wilson,
Matthew Auld
Hi Janusz,
On 2025-06-06 at 15:58:08 GMT, Janusz Krzysztofik wrote:
> The following error has been reported sporadically by CI when a test
> unbinds the i915 driver on a ring submission platform:
>
> <4> [239.330153] ------------[ cut here ]------------
> <4> [239.330166] i915 0000:00:02.0: [drm] drm_WARN_ON(dev_priv->mm.shrink_count)
> <4> [239.330196] WARNING: CPU: 1 PID: 18570 at drivers/gpu/drm/i915/i915_gem.c:1309 i915_gem_cleanup_early+0x13e/0x150 [i915]
> ...
> <4> [239.330640] RIP: 0010:i915_gem_cleanup_early+0x13e/0x150 [i915]
> ...
> <4> [239.330942] Call Trace:
> <4> [239.330944] <TASK>
> <4> [239.330949] i915_driver_late_release+0x2b/0xa0 [i915]
> <4> [239.331202] i915_driver_release+0x86/0xa0 [i915]
> <4> [239.331482] devm_drm_dev_init_release+0x61/0x90
> <4> [239.331494] devm_action_release+0x15/0x30
> <4> [239.331504] release_nodes+0x3d/0x120
> <4> [239.331517] devres_release_all+0x96/0xd0
> <4> [239.331533] device_unbind_cleanup+0x12/0x80
> <4> [239.331543] device_release_driver_internal+0x23a/0x280
> <4> [239.331550] ? bus_find_device+0xa5/0xe0
> <4> [239.331563] device_driver_detach+0x14/0x20
> ...
> <4> [357.719679] ---[ end trace 0000000000000000 ]---
>
> If the test also unloads the i915 module then that's followed with:
>
> <3> [357.787478] =============================================================================
> <3> [357.788006] BUG i915_vma (Tainted: G U W N ): Objects remaining on __kmem_cache_shutdown()
> <3> [357.788031] -----------------------------------------------------------------------------
> <3> [357.788204] Object 0xffff888109e7f480 @offset=29824
> <3> [357.788670] Allocated in i915_vma_instance+0xee/0xc10 [i915] age=292729 cpu=4 pid=2244
> <4> [357.788994] i915_vma_instance+0xee/0xc10 [i915]
> <4> [357.789290] init_status_page+0x7b/0x420 [i915]
> <4> [357.789532] intel_engines_init+0x1d8/0x980 [i915]
> <4> [357.789772] intel_gt_init+0x175/0x450 [i915]
> <4> [357.790014] i915_gem_init+0x113/0x340 [i915]
> <4> [357.790281] i915_driver_probe+0x847/0xed0 [i915]
> <4> [357.790504] i915_pci_probe+0xe6/0x220 [i915]
> ...
>
> Closer analysis of CI results history has revealed a dependency of the
> error on a few IGT tests, namely:
> - igt@api_intel_allocator@fork-simple-stress-signal,
> - igt@api_intel_allocator@two-level-inception-interruptible,
> - igt@gem_linear_blits@interruptible,
> - igt@prime_mmap_coherency@ioctl-errors,
> which invisibly trigger the issue, then exhibited with first driver unbind
> attempt.
>
> All of the above tests perform actions which are actively interrupted with
> signals. Further debugging has allowed to narrow that scope down to
> DRM_IOCTL_I915_GEM_EXECBUFFER2, and ring_context_alloc(), specific to ring
> submission, in particular.
>
> If successful then that function, or its execlists or GuC submission
> equivalent, is supposed to be called only once per GEM context engine,
> followed by raise of a flag that prevents the function from being called
> again. The function is expected to unwind its internal errors itself, so
> it may be safely called once more after it returns an error.
>
> In case of ring submission, the function first gets a reference to the
> engine's legacy timeline and then allocates a VMA. If the VMA allocation
> fails, e.g. when i915_vma_instance() called from inside is interrupted
> with a signal, then ring_context_alloc() fails, leaving the timeline held
> referenced. On next I915_GEM_EXECBUFFER2 IOCTL, another reference to the
> timeline is got, and only that last one is put on successful completion.
> As a consequence, the legacy timeline, with its underlying engine status
> page's VMA object, is still held and not released on driver unbind.
>
> Get the legacy timeline only after successful allocation of the context
> engine's VMA.
>
> Fixes: 75d0a7f31eec ("drm/i915: Lift timeline into intel_context")
> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
> Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
LGTM.
Reviewed-by: Krzysztof Niemiec <krzysztof.niemiec@intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_ring_submission.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> index a876a34455f11..2a6d79abf25b5 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> @@ -610,7 +610,6 @@ static int ring_context_alloc(struct intel_context *ce)
> /* One ringbuffer to rule them all */
> GEM_BUG_ON(!engine->legacy.ring);
> ce->ring = engine->legacy.ring;
> - ce->timeline = intel_timeline_get(engine->legacy.timeline);
>
> GEM_BUG_ON(ce->state);
> if (engine->context_size) {
> @@ -623,6 +622,8 @@ static int ring_context_alloc(struct intel_context *ce)
> ce->state = vma;
> }
>
> + ce->timeline = intel_timeline_get(engine->legacy.timeline);
> +
> return 0;
> }
>
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-06 13:58 Janusz Krzysztofik
2025-06-10 13:59 ` Sebastian Brzezinka
2025-06-10 14:06 ` Krzysztof Niemiec
@ 2025-06-11 8:03 ` Krzysztof Karas
2 siblings, 0 replies; 14+ messages in thread
From: Krzysztof Karas @ 2025-06-11 8:03 UTC (permalink / raw)
To: Janusz Krzysztofik
Cc: intel-gfx, dri-devel, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, Andi Shyti, Nitin Gote, Chris Wilson,
Matthew Auld
Hi Janusz,
[...]
> If successful then that function, or its execlists or GuC submission
> equivalent, is supposed to be called only once per GEM context engine,
Could you clarify "execlists or GuC submission equivalent" here
- do these functions perform similar reference acquisition,
which may be lost along with some other data? Should they also
be modified to avoid similar problems in the future?
Best Regards,
Krzysztof
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
@ 2025-06-11 10:42 Janusz Krzysztofik
2025-06-11 15:45 ` Gote, Nitin R
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Janusz Krzysztofik @ 2025-06-11 10:42 UTC (permalink / raw)
To: intel-gfx
Cc: dri-devel, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, Chris Wilson, Matthew Auld, Andi Shyti,
Nitin Gote, Sebastian Brzezinka, Krzysztof Niemiec,
Krzysztof Karas, Janusz Krzysztofik
The following error has been reported sporadically by CI when a test
unbinds the i915 driver on a ring submission platform:
<4> [239.330153] ------------[ cut here ]------------
<4> [239.330166] i915 0000:00:02.0: [drm] drm_WARN_ON(dev_priv->mm.shrink_count)
<4> [239.330196] WARNING: CPU: 1 PID: 18570 at drivers/gpu/drm/i915/i915_gem.c:1309 i915_gem_cleanup_early+0x13e/0x150 [i915]
...
<4> [239.330640] RIP: 0010:i915_gem_cleanup_early+0x13e/0x150 [i915]
...
<4> [239.330942] Call Trace:
<4> [239.330944] <TASK>
<4> [239.330949] i915_driver_late_release+0x2b/0xa0 [i915]
<4> [239.331202] i915_driver_release+0x86/0xa0 [i915]
<4> [239.331482] devm_drm_dev_init_release+0x61/0x90
<4> [239.331494] devm_action_release+0x15/0x30
<4> [239.331504] release_nodes+0x3d/0x120
<4> [239.331517] devres_release_all+0x96/0xd0
<4> [239.331533] device_unbind_cleanup+0x12/0x80
<4> [239.331543] device_release_driver_internal+0x23a/0x280
<4> [239.331550] ? bus_find_device+0xa5/0xe0
<4> [239.331563] device_driver_detach+0x14/0x20
...
<4> [357.719679] ---[ end trace 0000000000000000 ]---
If the test also unloads the i915 module then that's followed with:
<3> [357.787478] =============================================================================
<3> [357.788006] BUG i915_vma (Tainted: G U W N ): Objects remaining on __kmem_cache_shutdown()
<3> [357.788031] -----------------------------------------------------------------------------
<3> [357.788204] Object 0xffff888109e7f480 @offset=29824
<3> [357.788670] Allocated in i915_vma_instance+0xee/0xc10 [i915] age=292729 cpu=4 pid=2244
<4> [357.788994] i915_vma_instance+0xee/0xc10 [i915]
<4> [357.789290] init_status_page+0x7b/0x420 [i915]
<4> [357.789532] intel_engines_init+0x1d8/0x980 [i915]
<4> [357.789772] intel_gt_init+0x175/0x450 [i915]
<4> [357.790014] i915_gem_init+0x113/0x340 [i915]
<4> [357.790281] i915_driver_probe+0x847/0xed0 [i915]
<4> [357.790504] i915_pci_probe+0xe6/0x220 [i915]
...
Closer analysis of CI results history has revealed a dependency of the
error on a few IGT tests, namely:
- igt@api_intel_allocator@fork-simple-stress-signal,
- igt@api_intel_allocator@two-level-inception-interruptible,
- igt@gem_linear_blits@interruptible,
- igt@prime_mmap_coherency@ioctl-errors,
which invisibly trigger the issue, then exhibited with first driver unbind
attempt.
All of the above tests perform actions which are actively interrupted with
signals. Further debugging has allowed to narrow that scope down to
DRM_IOCTL_I915_GEM_EXECBUFFER2, and ring_context_alloc(), specific to ring
submission, in particular.
If successful then that function, or its execlists or GuC submission
equivalent, is supposed to be called only once per GEM context engine,
followed by raise of a flag that prevents the function from being called
again. The function is expected to unwind its internal errors itself, so
it may be safely called once more after it returns an error.
In case of ring submission, the function first gets a reference to the
engine's legacy timeline and then allocates a VMA. If the VMA allocation
fails, e.g. when i915_vma_instance() called from inside is interrupted
with a signal, then ring_context_alloc() fails, leaving the timeline held
referenced. On next I915_GEM_EXECBUFFER2 IOCTL, another reference to the
timeline is got, and only that last one is put on successful completion.
As a consequence, the legacy timeline, with its underlying engine status
page's VMA object, is still held and not released on driver unbind.
Get the legacy timeline only after successful allocation of the context
engine's VMA.
v2: Add a note on other submission methods (Krzysztof Karas):
Both execlists and GuC submission use lrc_alloc() which seems free
from a similar issue.
Fixes: 75d0a7f31eec ("drm/i915: Lift timeline into intel_context")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Krzysztof Karas <krzysztof.karas@intel.com>
Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
Reviewed-by: Krzysztof Niemiec <krzysztof.niemiec@intel.com>
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
---
drivers/gpu/drm/i915/gt/intel_ring_submission.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
index a876a34455f11..2a6d79abf25b5 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
@@ -610,7 +610,6 @@ static int ring_context_alloc(struct intel_context *ce)
/* One ringbuffer to rule them all */
GEM_BUG_ON(!engine->legacy.ring);
ce->ring = engine->legacy.ring;
- ce->timeline = intel_timeline_get(engine->legacy.timeline);
GEM_BUG_ON(ce->state);
if (engine->context_size) {
@@ -623,6 +622,8 @@ static int ring_context_alloc(struct intel_context *ce)
ce->state = vma;
}
+ ce->timeline = intel_timeline_get(engine->legacy.timeline);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* RE: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-11 10:42 [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error Janusz Krzysztofik
@ 2025-06-11 15:45 ` Gote, Nitin R
2025-06-11 20:54 ` Andi Shyti
2025-06-11 20:53 ` Andi Shyti
2025-06-30 13:46 ` Andi Shyti
2 siblings, 1 reply; 14+ messages in thread
From: Gote, Nitin R @ 2025-06-11 15:45 UTC (permalink / raw)
To: Janusz Krzysztofik
Cc: dri-devel@lists.freedesktop.org, Jani Nikula, Joonas Lahtinen,
Vivi, Rodrigo, Tvrtko Ursulin, Chris Wilson, Auld, Matthew,
Andi Shyti, Brzezinka, Sebastian, Niemiec, Krzysztof,
Karas, Krzysztof, intel-gfx
Hi Janusz,
[...]
> Subject: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc
> error
>
Generally, it's preferred to use "drm/i915/gt:" file path over "drm/i915/ring_submission:" file name in the commit title.
Otherwise, the patch looks good to me.
Reviewed-by: Nitin Gote <nitin.r.gote@intel.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-11 10:42 [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error Janusz Krzysztofik
2025-06-11 15:45 ` Gote, Nitin R
@ 2025-06-11 20:53 ` Andi Shyti
2025-06-30 13:46 ` Andi Shyti
2 siblings, 0 replies; 14+ messages in thread
From: Andi Shyti @ 2025-06-11 20:53 UTC (permalink / raw)
To: Janusz Krzysztofik
Cc: intel-gfx, dri-devel, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, Chris Wilson, Matthew Auld, Andi Shyti,
Nitin Gote, Sebastian Brzezinka, Krzysztof Niemiec,
Krzysztof Karas
Hi Janusz,
On Wed, Jun 11, 2025 at 12:42:13PM +0200, Janusz Krzysztofik wrote:
> The following error has been reported sporadically by CI when a test
> unbinds the i915 driver on a ring submission platform:
>
> <4> [239.330153] ------------[ cut here ]------------
> <4> [239.330166] i915 0000:00:02.0: [drm] drm_WARN_ON(dev_priv->mm.shrink_count)
> <4> [239.330196] WARNING: CPU: 1 PID: 18570 at drivers/gpu/drm/i915/i915_gem.c:1309 i915_gem_cleanup_early+0x13e/0x150 [i915]
> ...
> <4> [239.330640] RIP: 0010:i915_gem_cleanup_early+0x13e/0x150 [i915]
> ...
> <4> [239.330942] Call Trace:
> <4> [239.330944] <TASK>
> <4> [239.330949] i915_driver_late_release+0x2b/0xa0 [i915]
> <4> [239.331202] i915_driver_release+0x86/0xa0 [i915]
> <4> [239.331482] devm_drm_dev_init_release+0x61/0x90
> <4> [239.331494] devm_action_release+0x15/0x30
> <4> [239.331504] release_nodes+0x3d/0x120
> <4> [239.331517] devres_release_all+0x96/0xd0
> <4> [239.331533] device_unbind_cleanup+0x12/0x80
> <4> [239.331543] device_release_driver_internal+0x23a/0x280
> <4> [239.331550] ? bus_find_device+0xa5/0xe0
> <4> [239.331563] device_driver_detach+0x14/0x20
> ...
> <4> [357.719679] ---[ end trace 0000000000000000 ]---
>
> If the test also unloads the i915 module then that's followed with:
>
> <3> [357.787478] =============================================================================
> <3> [357.788006] BUG i915_vma (Tainted: G U W N ): Objects remaining on __kmem_cache_shutdown()
> <3> [357.788031] -----------------------------------------------------------------------------
> <3> [357.788204] Object 0xffff888109e7f480 @offset=29824
> <3> [357.788670] Allocated in i915_vma_instance+0xee/0xc10 [i915] age=292729 cpu=4 pid=2244
> <4> [357.788994] i915_vma_instance+0xee/0xc10 [i915]
> <4> [357.789290] init_status_page+0x7b/0x420 [i915]
> <4> [357.789532] intel_engines_init+0x1d8/0x980 [i915]
> <4> [357.789772] intel_gt_init+0x175/0x450 [i915]
> <4> [357.790014] i915_gem_init+0x113/0x340 [i915]
> <4> [357.790281] i915_driver_probe+0x847/0xed0 [i915]
> <4> [357.790504] i915_pci_probe+0xe6/0x220 [i915]
> ...
>
> Closer analysis of CI results history has revealed a dependency of the
> error on a few IGT tests, namely:
> - igt@api_intel_allocator@fork-simple-stress-signal,
> - igt@api_intel_allocator@two-level-inception-interruptible,
> - igt@gem_linear_blits@interruptible,
> - igt@prime_mmap_coherency@ioctl-errors,
> which invisibly trigger the issue, then exhibited with first driver unbind
> attempt.
>
> All of the above tests perform actions which are actively interrupted with
> signals. Further debugging has allowed to narrow that scope down to
> DRM_IOCTL_I915_GEM_EXECBUFFER2, and ring_context_alloc(), specific to ring
> submission, in particular.
>
> If successful then that function, or its execlists or GuC submission
> equivalent, is supposed to be called only once per GEM context engine,
> followed by raise of a flag that prevents the function from being called
> again. The function is expected to unwind its internal errors itself, so
> it may be safely called once more after it returns an error.
>
> In case of ring submission, the function first gets a reference to the
> engine's legacy timeline and then allocates a VMA. If the VMA allocation
> fails, e.g. when i915_vma_instance() called from inside is interrupted
> with a signal, then ring_context_alloc() fails, leaving the timeline held
> referenced. On next I915_GEM_EXECBUFFER2 IOCTL, another reference to the
> timeline is got, and only that last one is put on successful completion.
> As a consequence, the legacy timeline, with its underlying engine status
> page's VMA object, is still held and not released on driver unbind.
>
> Get the legacy timeline only after successful allocation of the context
> engine's VMA.
>
> v2: Add a note on other submission methods (Krzysztof Karas):
> Both execlists and GuC submission use lrc_alloc() which seems free
> from a similar issue.
>
> Fixes: 75d0a7f31eec ("drm/i915: Lift timeline into intel_context")
> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
> Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Krzysztof Karas <krzysztof.karas@intel.com>
> Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
> Reviewed-by: Krzysztof Niemiec <krzysztof.niemiec@intel.com>
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Yeah, it basically needed a cleanup before returning error.
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
You are still missing
Cc: <stable@vger.kernel.org> # v5.4+
Thanks,
Andi
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-11 15:45 ` Gote, Nitin R
@ 2025-06-11 20:54 ` Andi Shyti
2025-06-12 9:08 ` Janusz Krzysztofik
0 siblings, 1 reply; 14+ messages in thread
From: Andi Shyti @ 2025-06-11 20:54 UTC (permalink / raw)
To: Gote, Nitin R
Cc: Janusz Krzysztofik, dri-devel@lists.freedesktop.org, Jani Nikula,
Joonas Lahtinen, Vivi, Rodrigo, Tvrtko Ursulin, Chris Wilson,
Auld, Matthew, Andi Shyti, Brzezinka, Sebastian,
Niemiec, Krzysztof, Karas, Krzysztof, intel-gfx
Hi Nitin,
On Wed, Jun 11, 2025 at 03:45:30PM +0000, Gote, Nitin R wrote:
> [...]
> > Subject: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc
> > error
> >
>
> Generally, it's preferred to use "drm/i915/gt:" file path over "drm/i915/ring_submission:" file name in the commit title.
good observation, I missed it. I agree with Nitin on this, it can
be fixed before merging.
Andi
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-11 20:54 ` Andi Shyti
@ 2025-06-12 9:08 ` Janusz Krzysztofik
2025-06-12 9:35 ` Jani Nikula
0 siblings, 1 reply; 14+ messages in thread
From: Janusz Krzysztofik @ 2025-06-12 9:08 UTC (permalink / raw)
To: Gote, Nitin R, Andi Shyti
Cc: dri-devel@lists.freedesktop.org, Jani Nikula, Joonas Lahtinen,
Vivi, Rodrigo, Tvrtko Ursulin, Chris Wilson, Auld, Matthew,
Andi Shyti, Brzezinka, Sebastian, Niemiec, Krzysztof,
Karas, Krzysztof, intel-gfx
On Wednesday, 11 June 2025 22:54:40 CEST Andi Shyti wrote:
> Hi Nitin,
>
> On Wed, Jun 11, 2025 at 03:45:30PM +0000, Gote, Nitin R wrote:
> > [...]
> > > Subject: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc
> > > error
> > >
> >
> > Generally, it's preferred to use "drm/i915/gt:" file path over "drm/i915/ring_submission:" file name in the commit title.
>
> good observation, I missed it. I agree with Nitin on this, it can
> be fixed before merging.
I'm not sure. I found no single word on the *subsystem* component of the
canonical patch format subject line (or commit message) expected to reflect
any directory structure in case of DRM. However, if you think it should for
some reason, or you just don't recognize i915 ring submission as a good
candidate for the subsystem component of the commit message, then I'm OK with
drm/i915/gt, but then, the summary phrase of the commit message seems too
general for the whole GT subsystem, not pointing to ring submission as the
only submission method out of the three that's affected, and needs to be
rephrased, I believe, while still kept short enough. Maybe "Fix *legacy*
timeline held on VMA alloc error" (with the 'left' word dropped)?
Thanks,
Janusz
>
> Andi
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-12 9:08 ` Janusz Krzysztofik
@ 2025-06-12 9:35 ` Jani Nikula
2025-06-12 9:45 ` Janusz Krzysztofik
0 siblings, 1 reply; 14+ messages in thread
From: Jani Nikula @ 2025-06-12 9:35 UTC (permalink / raw)
To: Janusz Krzysztofik, Gote, Nitin R, Andi Shyti
Cc: dri-devel@lists.freedesktop.org, Joonas Lahtinen, Vivi, Rodrigo,
Tvrtko Ursulin, Chris Wilson, Auld, Matthew, Andi Shyti,
Brzezinka, Sebastian, Niemiec, Krzysztof, Karas, Krzysztof,
intel-gfx
On Thu, 12 Jun 2025, Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> wrote:
> On Wednesday, 11 June 2025 22:54:40 CEST Andi Shyti wrote:
>> Hi Nitin,
>>
>> On Wed, Jun 11, 2025 at 03:45:30PM +0000, Gote, Nitin R wrote:
>> > [...]
>> > > Subject: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc
>> > > error
>> > >
>> >
>> > Generally, it's preferred to use "drm/i915/gt:" file path over "drm/i915/ring_submission:" file name in the commit title.
>>
>> good observation, I missed it. I agree with Nitin on this, it can
>> be fixed before merging.
>
> I'm not sure. I found no single word on the *subsystem* component of the
> canonical patch format subject line (or commit message) expected to reflect
> any directory structure in case of DRM.
It's not about the directory structure, though, but rather about
(admittedly unwritten) conventions. Usually about driver components,
features or platforms.
See:
$ git log --since={5years} --no-merges --pretty=%s -- "<PATH>" | sed 's/:.*//' | sort | uniq -c | sort -rn
Where "<PATH>" is drivers/gpu/drm/i915/gt/intel_ring_submission.c or
drivers/gpu/drm/i915/gt.
"ring" or "submission" is just not there in the prefix, at all.
BR,
Jani.
> However, if you think it should for
> some reason, or you just don't recognize i915 ring submission as a good
> candidate for the subsystem component of the commit message, then I'm OK with
> drm/i915/gt, but then, the summary phrase of the commit message seems too
> general for the whole GT subsystem, not pointing to ring submission as the
> only submission method out of the three that's affected, and needs to be
> rephrased, I believe, while still kept short enough. Maybe "Fix *legacy*
> timeline held on VMA alloc error" (with the 'left' word dropped)?
>
> Thanks,
> Janusz
>
>>
>> Andi
>>
>
>
>
>
--
Jani Nikula, Intel
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-12 9:35 ` Jani Nikula
@ 2025-06-12 9:45 ` Janusz Krzysztofik
2025-06-12 11:30 ` Andi Shyti
0 siblings, 1 reply; 14+ messages in thread
From: Janusz Krzysztofik @ 2025-06-12 9:45 UTC (permalink / raw)
To: Gote, Nitin R, Andi Shyti, Jani Nikula
Cc: dri-devel@lists.freedesktop.org, Joonas Lahtinen, Vivi, Rodrigo,
Tvrtko Ursulin, Chris Wilson, Auld, Matthew, Andi Shyti,
Brzezinka, Sebastian, Niemiec, Krzysztof, Karas, Krzysztof,
intel-gfx
On Thursday, 12 June 2025 11:35:31 CEST Jani Nikula wrote:
> On Thu, 12 Jun 2025, Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> wrote:
> > On Wednesday, 11 June 2025 22:54:40 CEST Andi Shyti wrote:
> >> Hi Nitin,
> >>
> >> On Wed, Jun 11, 2025 at 03:45:30PM +0000, Gote, Nitin R wrote:
> >> > [...]
> >> > > Subject: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc
> >> > > error
> >> > >
> >> >
> >> > Generally, it's preferred to use "drm/i915/gt:" file path over "drm/i915/ring_submission:" file name in the commit title.
> >>
> >> good observation, I missed it. I agree with Nitin on this, it can
> >> be fixed before merging.
> >
> > I'm not sure. I found no single word on the *subsystem* component of the
> > canonical patch format subject line (or commit message) expected to reflect
> > any directory structure in case of DRM.
>
> It's not about the directory structure, though, but rather about
> (admittedly unwritten) conventions. Usually about driver components,
> features or platforms.
>
> See:
>
> $ git log --since={5years} --no-merges --pretty=%s -- "<PATH>" | sed 's/:.*//' | sort | uniq -c | sort -rn
>
> Where "<PATH>" is drivers/gpu/drm/i915/gt/intel_ring_submission.c or
> drivers/gpu/drm/i915/gt.
>
> "ring" or "submission" is just not there in the prefix, at all.
I see. Is there a convention for designating old, pre-execlists *platforms*
as affected subsystem / area? Or is describing it in the summary phrase of
the commit message the only way?
Thanks,
Janusz
>
>
> BR,
> Jani.
>
> > However, if you think it should for
> > some reason, or you just don't recognize i915 ring submission as a good
> > candidate for the subsystem component of the commit message, then I'm OK with
> > drm/i915/gt, but then, the summary phrase of the commit message seems too
> > general for the whole GT subsystem, not pointing to ring submission as the
> > only submission method out of the three that's affected, and needs to be
> > rephrased, I believe, while still kept short enough. Maybe "Fix *legacy*
> > timeline held on VMA alloc error" (with the 'left' word dropped)?
> >
> > Thanks,
> > Janusz
> >
> >>
> >> Andi
> >>
> >
> >
> >
> >
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-12 9:45 ` Janusz Krzysztofik
@ 2025-06-12 11:30 ` Andi Shyti
2025-06-12 11:46 ` Janusz Krzysztofik
0 siblings, 1 reply; 14+ messages in thread
From: Andi Shyti @ 2025-06-12 11:30 UTC (permalink / raw)
To: Janusz Krzysztofik
Cc: Gote, Nitin R, Andi Shyti, Jani Nikula,
dri-devel@lists.freedesktop.org, Joonas Lahtinen, Vivi, Rodrigo,
Tvrtko Ursulin, Chris Wilson, Auld, Matthew, Brzezinka, Sebastian,
Niemiec, Krzysztof, Karas, Krzysztof, intel-gfx
Hi Janusz,
On Thu, Jun 12, 2025 at 11:45:46AM +0200, Janusz Krzysztofik wrote:
> On Thursday, 12 June 2025 11:35:31 CEST Jani Nikula wrote:
> > On Thu, 12 Jun 2025, Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> wrote:
> > > On Wednesday, 11 June 2025 22:54:40 CEST Andi Shyti wrote:
> > >> Hi Nitin,
> > >>
> > >> On Wed, Jun 11, 2025 at 03:45:30PM +0000, Gote, Nitin R wrote:
> > >> > [...]
> > >> > > Subject: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc
> > >> > > error
> > >> > >
> > >> >
> > >> > Generally, it's preferred to use "drm/i915/gt:" file path over "drm/i915/ring_submission:" file name in the commit title.
> > >>
> > >> good observation, I missed it. I agree with Nitin on this, it can
> > >> be fixed before merging.
> > >
> > > I'm not sure. I found no single word on the *subsystem* component of the
> > > canonical patch format subject line (or commit message) expected to reflect
> > > any directory structure in case of DRM.
> >
> > It's not about the directory structure, though, but rather about
> > (admittedly unwritten) conventions. Usually about driver components,
> > features or platforms.
> >
> > See:
> >
> > $ git log --since={5years} --no-merges --pretty=%s -- "<PATH>" | sed 's/:.*//' | sort | uniq -c | sort -rn
> >
> > Where "<PATH>" is drivers/gpu/drm/i915/gt/intel_ring_submission.c or
> > drivers/gpu/drm/i915/gt.
> >
> > "ring" or "submission" is just not there in the prefix, at all.
>
> I see. Is there a convention for designating old, pre-execlists *platforms*
> as affected subsystem / area? Or is describing it in the summary phrase of
> the commit message the only way?
it's an unwritten rule and my feeling is that these tings take
their own track without anyone deciding it officially.
Indeed every community has its own way of doing it. As you know
already, in i915 we have always used:
drm/i915:
drm/i915/gt:
drm/i915/gem:
drm/i915/gt/guc: (or drm/i915/guc:)
drm/i915/display:
...
pointing to the directory rather than the topic or the file.
In my opinion using "ring_submission:" is not wrong and it makes
sense, but it's out of the ordinary and this would be the only
patch doing it.
That's why this title is a little odd, unless we all agree to
change and set a convention.
Thanks,
Andi
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-12 11:30 ` Andi Shyti
@ 2025-06-12 11:46 ` Janusz Krzysztofik
0 siblings, 0 replies; 14+ messages in thread
From: Janusz Krzysztofik @ 2025-06-12 11:46 UTC (permalink / raw)
To: Andi Shyti
Cc: Gote, Nitin R, Andi Shyti, Jani Nikula,
dri-devel@lists.freedesktop.org, Joonas Lahtinen, Vivi, Rodrigo,
Tvrtko Ursulin, Chris Wilson, Auld, Matthew, Brzezinka, Sebastian,
Niemiec, Krzysztof, Karas, Krzysztof, intel-gfx
On Thursday, 12 June 2025 13:30:42 CEST Andi Shyti wrote:
> Hi Janusz,
>
> On Thu, Jun 12, 2025 at 11:45:46AM +0200, Janusz Krzysztofik wrote:
> > On Thursday, 12 June 2025 11:35:31 CEST Jani Nikula wrote:
> > > On Thu, 12 Jun 2025, Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> wrote:
> > > > On Wednesday, 11 June 2025 22:54:40 CEST Andi Shyti wrote:
> > > >> Hi Nitin,
> > > >>
> > > >> On Wed, Jun 11, 2025 at 03:45:30PM +0000, Gote, Nitin R wrote:
> > > >> > [...]
> > > >> > > Subject: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc
> > > >> > > error
> > > >> > >
> > > >> >
> > > >> > Generally, it's preferred to use "drm/i915/gt:" file path over "drm/i915/ring_submission:" file name in the commit title.
> > > >>
> > > >> good observation, I missed it. I agree with Nitin on this, it can
> > > >> be fixed before merging.
> > > >
> > > > I'm not sure. I found no single word on the *subsystem* component of the
> > > > canonical patch format subject line (or commit message) expected to reflect
> > > > any directory structure in case of DRM.
> > >
> > > It's not about the directory structure, though, but rather about
> > > (admittedly unwritten) conventions. Usually about driver components,
> > > features or platforms.
> > >
> > > See:
> > >
> > > $ git log --since={5years} --no-merges --pretty=%s -- "<PATH>" | sed 's/:.*//' | sort | uniq -c | sort -rn
> > >
> > > Where "<PATH>" is drivers/gpu/drm/i915/gt/intel_ring_submission.c or
> > > drivers/gpu/drm/i915/gt.
> > >
> > > "ring" or "submission" is just not there in the prefix, at all.
> >
> > I see. Is there a convention for designating old, pre-execlists *platforms*
> > as affected subsystem / area? Or is describing it in the summary phrase of
> > the commit message the only way?
>
> it's an unwritten rule and my feeling is that these tings take
> their own track without anyone deciding it officially.
>
> Indeed every community has its own way of doing it. As you know
> already, in i915 we have always used:
>
> drm/i915:
> drm/i915/gt:
> drm/i915/gem:
> drm/i915/gt/guc: (or drm/i915/guc:)
> drm/i915/display:
> ...
I find this convention as more oriented on designating an area of
responsibility rather than a component / feature / platform that is affected /
fixed. My feeling is that it should rather be the latter.
Anyway, in this particular case I propose to follow the current convention and
add the word 'legacy' to the summary phrase, as I suggested before, unless you
are able to propose something better. We may discuss the convention doubts
independently.
Thanks,
Janusz
>
> pointing to the directory rather than the topic or the file.
>
> In my opinion using "ring_submission:" is not wrong and it makes
> sense, but it's out of the ordinary and this would be the only
> patch doing it.
>
> That's why this title is a little odd, unless we all agree to
> change and set a convention.
>
> Thanks,
> Andi
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error
2025-06-11 10:42 [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error Janusz Krzysztofik
2025-06-11 15:45 ` Gote, Nitin R
2025-06-11 20:53 ` Andi Shyti
@ 2025-06-30 13:46 ` Andi Shyti
2 siblings, 0 replies; 14+ messages in thread
From: Andi Shyti @ 2025-06-30 13:46 UTC (permalink / raw)
To: Janusz Krzysztofik
Cc: intel-gfx, dri-devel, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, Chris Wilson, Matthew Auld, Andi Shyti,
Nitin Gote, Sebastian Brzezinka, Krzysztof Niemiec,
Krzysztof Karas
Hi Janusz,
merged to drm-intel-gt-next.
Thanks,
Andi
On Wed, Jun 11, 2025 at 12:42:13PM +0200, Janusz Krzysztofik wrote:
> The following error has been reported sporadically by CI when a test
> unbinds the i915 driver on a ring submission platform:
>
> <4> [239.330153] ------------[ cut here ]------------
> <4> [239.330166] i915 0000:00:02.0: [drm] drm_WARN_ON(dev_priv->mm.shrink_count)
> <4> [239.330196] WARNING: CPU: 1 PID: 18570 at drivers/gpu/drm/i915/i915_gem.c:1309 i915_gem_cleanup_early+0x13e/0x150 [i915]
> ...
> <4> [239.330640] RIP: 0010:i915_gem_cleanup_early+0x13e/0x150 [i915]
> ...
> <4> [239.330942] Call Trace:
> <4> [239.330944] <TASK>
> <4> [239.330949] i915_driver_late_release+0x2b/0xa0 [i915]
> <4> [239.331202] i915_driver_release+0x86/0xa0 [i915]
> <4> [239.331482] devm_drm_dev_init_release+0x61/0x90
> <4> [239.331494] devm_action_release+0x15/0x30
> <4> [239.331504] release_nodes+0x3d/0x120
> <4> [239.331517] devres_release_all+0x96/0xd0
> <4> [239.331533] device_unbind_cleanup+0x12/0x80
> <4> [239.331543] device_release_driver_internal+0x23a/0x280
> <4> [239.331550] ? bus_find_device+0xa5/0xe0
> <4> [239.331563] device_driver_detach+0x14/0x20
> ...
> <4> [357.719679] ---[ end trace 0000000000000000 ]---
>
> If the test also unloads the i915 module then that's followed with:
>
> <3> [357.787478] =============================================================================
> <3> [357.788006] BUG i915_vma (Tainted: G U W N ): Objects remaining on __kmem_cache_shutdown()
> <3> [357.788031] -----------------------------------------------------------------------------
> <3> [357.788204] Object 0xffff888109e7f480 @offset=29824
> <3> [357.788670] Allocated in i915_vma_instance+0xee/0xc10 [i915] age=292729 cpu=4 pid=2244
> <4> [357.788994] i915_vma_instance+0xee/0xc10 [i915]
> <4> [357.789290] init_status_page+0x7b/0x420 [i915]
> <4> [357.789532] intel_engines_init+0x1d8/0x980 [i915]
> <4> [357.789772] intel_gt_init+0x175/0x450 [i915]
> <4> [357.790014] i915_gem_init+0x113/0x340 [i915]
> <4> [357.790281] i915_driver_probe+0x847/0xed0 [i915]
> <4> [357.790504] i915_pci_probe+0xe6/0x220 [i915]
> ...
>
> Closer analysis of CI results history has revealed a dependency of the
> error on a few IGT tests, namely:
> - igt@api_intel_allocator@fork-simple-stress-signal,
> - igt@api_intel_allocator@two-level-inception-interruptible,
> - igt@gem_linear_blits@interruptible,
> - igt@prime_mmap_coherency@ioctl-errors,
> which invisibly trigger the issue, then exhibited with first driver unbind
> attempt.
>
> All of the above tests perform actions which are actively interrupted with
> signals. Further debugging has allowed to narrow that scope down to
> DRM_IOCTL_I915_GEM_EXECBUFFER2, and ring_context_alloc(), specific to ring
> submission, in particular.
>
> If successful then that function, or its execlists or GuC submission
> equivalent, is supposed to be called only once per GEM context engine,
> followed by raise of a flag that prevents the function from being called
> again. The function is expected to unwind its internal errors itself, so
> it may be safely called once more after it returns an error.
>
> In case of ring submission, the function first gets a reference to the
> engine's legacy timeline and then allocates a VMA. If the VMA allocation
> fails, e.g. when i915_vma_instance() called from inside is interrupted
> with a signal, then ring_context_alloc() fails, leaving the timeline held
> referenced. On next I915_GEM_EXECBUFFER2 IOCTL, another reference to the
> timeline is got, and only that last one is put on successful completion.
> As a consequence, the legacy timeline, with its underlying engine status
> page's VMA object, is still held and not released on driver unbind.
>
> Get the legacy timeline only after successful allocation of the context
> engine's VMA.
>
> v2: Add a note on other submission methods (Krzysztof Karas):
> Both execlists and GuC submission use lrc_alloc() which seems free
> from a similar issue.
>
> Fixes: 75d0a7f31eec ("drm/i915: Lift timeline into intel_context")
> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
> Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Krzysztof Karas <krzysztof.karas@intel.com>
> Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
> Reviewed-by: Krzysztof Niemiec <krzysztof.niemiec@intel.com>
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_ring_submission.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> index a876a34455f11..2a6d79abf25b5 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
> @@ -610,7 +610,6 @@ static int ring_context_alloc(struct intel_context *ce)
> /* One ringbuffer to rule them all */
> GEM_BUG_ON(!engine->legacy.ring);
> ce->ring = engine->legacy.ring;
> - ce->timeline = intel_timeline_get(engine->legacy.timeline);
>
> GEM_BUG_ON(ce->state);
> if (engine->context_size) {
> @@ -623,6 +622,8 @@ static int ring_context_alloc(struct intel_context *ce)
> ce->state = vma;
> }
>
> + ce->timeline = intel_timeline_get(engine->legacy.timeline);
> +
> return 0;
> }
>
> --
> 2.49.0
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-06-30 13:46 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 10:42 [PATCH] drm/i915/ring_submission: Fix timeline left held on VMA alloc error Janusz Krzysztofik
2025-06-11 15:45 ` Gote, Nitin R
2025-06-11 20:54 ` Andi Shyti
2025-06-12 9:08 ` Janusz Krzysztofik
2025-06-12 9:35 ` Jani Nikula
2025-06-12 9:45 ` Janusz Krzysztofik
2025-06-12 11:30 ` Andi Shyti
2025-06-12 11:46 ` Janusz Krzysztofik
2025-06-11 20:53 ` Andi Shyti
2025-06-30 13:46 ` Andi Shyti
-- strict thread matches above, loose matches on Subject: below --
2025-06-06 13:58 Janusz Krzysztofik
2025-06-10 13:59 ` Sebastian Brzezinka
2025-06-10 14:06 ` Krzysztof Niemiec
2025-06-11 8:03 ` Krzysztof Karas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).