* [PATCH v4 0/3] drm/i915: Fix VMA UAF on destroy against deactivate race
@ 2024-01-22 14:04 ` Janusz Krzysztofik
0 siblings, 0 replies; 19+ messages in thread
From: Janusz Krzysztofik @ 2024-01-22 14:04 UTC (permalink / raw)
To: intel-gfx
Cc: Thomas Hellström, Chris Wilson, Andrzej Hajda, dri-devel,
Daniel Vetter, Rodrigo Vivi, David Airlie, Nirmoy Das
Object debugging tools were sporadically reporting illegal attempts to
free a still active i915 VMA object when parking a GPU tile believed to be
idle.
[161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915]
[161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0
...
[161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1
[161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022
[161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915]
[161.360592] RIP: 0010:debug_print_object+0x80/0xb0
...
[161.361347] debug_object_free+0xeb/0x110
[161.361362] i915_active_fini+0x14/0x130 [i915]
[161.361866] release_references+0xfe/0x1f0 [i915]
[161.362543] i915_vma_parked+0x1db/0x380 [i915]
[161.363129] __gt_park+0x121/0x230 [i915]
[161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915]
That has been tracked down to be happening when another thread is
deactivating the VMA inside __active_retire() helper, after the VMA's
active counter has been already decremented to 0, but before deactivation
of the VMA's object is reported to the object debugging tool.
There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring
an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see
commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform").
However, it occurred insufficient -- the issue was still reported by CI.
That wakeref was released on exit from i915_gem_do_execbuffer(), then
potentially before completion of the request and deactivation of its
associated VMAs.
I believe the issue was introduced by commit d93939730347 ("drm/i915:
Remove the vma refcount") which moved a call to i915_active_fini() from
a dropped i915_vma_release(), called on last put of the removed VMA kref,
to i915_vma_parked() processing path called on last put of a GT wakeref.
However, its visibility to the object debugging tool was suppressed by a
bug in i915_active that was fixed two weeks later with commit e92eb246feb9
("drm/i915/active: Fix missing debug object activation").
Fix the issue by getting a wakeref for the VMA's tile when activating it,
and putting that wakeref only after the VMA is deactivated. However,
exclude global GTT from that processing path, otherwise the GPU never goes
idle. Since __i915_vma_retire() may be called from atomic contexts, use
async variant of wakeref put.
Having that fixed, stop explicitly acquiring the extra GT0 wakeref from
inside i915_gem_do_execbuffer(), and also drop an extra call to
i915_active_wait(), introduced by commit 7a2280e8dcd2 ("drm/i915: Wait for
active retire before i915_active_fini()") as another insufficient fix for
this UAF race.
v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm
wakerefs") (Andi),
- for more easy backporting, split out removal of former insufficient
workarounds and move them to separate patches (Nirmoy).
Janusz Krzysztofik (3):
drm/i915/vma: Fix UAF on destroy against retire race
Manually revert "drm/i915: Fix a VMA UAF for multi-gt platform"
Revert "drm/i915: Wait for active retire before i915_active_fini()"
.../gpu/drm/i915/gem/i915_gem_execbuffer.c | 17 -----------
drivers/gpu/drm/i915/i915_vma.c | 28 +++++++++++++------
drivers/gpu/drm/i915/i915_vma_types.h | 1 +
3 files changed, 20 insertions(+), 26 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH v4 0/3] drm/i915: Fix VMA UAF on destroy against deactivate race @ 2024-01-22 14:04 ` Janusz Krzysztofik 0 siblings, 0 replies; 19+ messages in thread From: Janusz Krzysztofik @ 2024-01-22 14:04 UTC (permalink / raw) To: intel-gfx Cc: Tvrtko Ursulin, Andi Shyti, Thomas Hellström, Chris Wilson, Andrzej Hajda, dri-devel, Daniel Vetter, Rodrigo Vivi, Janusz Krzysztofik, David Airlie, Nirmoy Das Object debugging tools were sporadically reporting illegal attempts to free a still active i915 VMA object when parking a GPU tile believed to be idle. [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 ... [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 ... [161.361347] debug_object_free+0xeb/0x110 [161.361362] i915_active_fini+0x14/0x130 [i915] [161.361866] release_references+0xfe/0x1f0 [i915] [161.362543] i915_vma_parked+0x1db/0x380 [i915] [161.363129] __gt_park+0x121/0x230 [i915] [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] That has been tracked down to be happening when another thread is deactivating the VMA inside __active_retire() helper, after the VMA's active counter has been already decremented to 0, but before deactivation of the VMA's object is reported to the object debugging tool. There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). However, it occurred insufficient -- the issue was still reported by CI. That wakeref was released on exit from i915_gem_do_execbuffer(), then potentially before completion of the request and deactivation of its associated VMAs. I believe the issue was introduced by commit d93939730347 ("drm/i915: Remove the vma refcount") which moved a call to i915_active_fini() from a dropped i915_vma_release(), called on last put of the removed VMA kref, to i915_vma_parked() processing path called on last put of a GT wakeref. However, its visibility to the object debugging tool was suppressed by a bug in i915_active that was fixed two weeks later with commit e92eb246feb9 ("drm/i915/active: Fix missing debug object activation"). Fix the issue by getting a wakeref for the VMA's tile when activating it, and putting that wakeref only after the VMA is deactivated. However, exclude global GTT from that processing path, otherwise the GPU never goes idle. Since __i915_vma_retire() may be called from atomic contexts, use async variant of wakeref put. Having that fixed, stop explicitly acquiring the extra GT0 wakeref from inside i915_gem_do_execbuffer(), and also drop an extra call to i915_active_wait(), introduced by commit 7a2280e8dcd2 ("drm/i915: Wait for active retire before i915_active_fini()") as another insufficient fix for this UAF race. v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm wakerefs") (Andi), - for more easy backporting, split out removal of former insufficient workarounds and move them to separate patches (Nirmoy). Janusz Krzysztofik (3): drm/i915/vma: Fix UAF on destroy against retire race Manually revert "drm/i915: Fix a VMA UAF for multi-gt platform" Revert "drm/i915: Wait for active retire before i915_active_fini()" .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 17 ----------- drivers/gpu/drm/i915/i915_vma.c | 28 +++++++++++++------ drivers/gpu/drm/i915/i915_vma_types.h | 1 + 3 files changed, 20 insertions(+), 26 deletions(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v4 1/3] drm/i915/vma: Fix UAF on destroy against retire race 2024-01-22 14:04 ` Janusz Krzysztofik @ 2024-01-22 14:04 ` Janusz Krzysztofik -1 siblings, 0 replies; 19+ messages in thread From: Janusz Krzysztofik @ 2024-01-22 14:04 UTC (permalink / raw) To: intel-gfx Cc: Thomas Hellström, Chris Wilson, Andrzej Hajda, dri-devel, Daniel Vetter, Rodrigo Vivi, David Airlie, Nirmoy Das Object debugging tools were sporadically reporting illegal attempts to free a still active i915 VMA object when parking a GPU tile believed to be idle. [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 ... [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 ... [161.361347] debug_object_free+0xeb/0x110 [161.361362] i915_active_fini+0x14/0x130 [i915] [161.361866] release_references+0xfe/0x1f0 [i915] [161.362543] i915_vma_parked+0x1db/0x380 [i915] [161.363129] __gt_park+0x121/0x230 [i915] [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] That has been tracked down to be happening when another thread is deactivating the VMA inside __active_retire() helper, after the VMA's active counter has been already decremented to 0, but before deactivation of the VMA's object is reported to the object debugging tool. We could prevent from that race by serializing i915_active_fini() with __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from being used, e.g. from __i915_vma_retire() called at the end of __active_retire(), after that VMA has been already freed by a concurrent i915_vma_destroy() on return from the i915_active_fini(). Then, we should rather fix the issue at the VMA level, not in i915_active. Since __i915_vma_parked() is called from __gt_park() on last put of the GT's wakeref, the issue could be addressed by holding the GT wakeref long enough for __active_retire() to complete before that wakeref is released and the GT parked. A VMA associated with a request doesn't acquire a GT wakeref by itself. Instead, it depends on a wakeref held directly by the request's active intel_context for a GT associated with its VM, and indirectly on that intel_context's engine wakeref if the engine belongs to the same GT as the VMA's VM. In case of single-tile platforms, at least one of those wakerefs is usually held long enough for the request's VMA to be deactivated on time, before it is destroyed on last put of its VM GT wakeref. However, on multi-tile platforms, a request may use a VMA from a tile other than the one that hosts the request's engine, then it is protected only with the intel_context's VM GT wakeref. There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). However, it occurred insufficient -- the issue was still reported by CI. That wakeref was released on exit from i915_gem_do_execbuffer(), then potentially before completion of the request and deactivation of its associated VMAs. OTOH, CI reports indicate that single-tile platforms also suffer sporadically from the same race. I believe the issue was introduced by commit d93939730347 ("drm/i915: Remove the vma refcount") which moved a call to i915_active_fini() from a dropped i915_vma_release(), called on last put of the removed VMA kref, to i915_vma_parked() processing path called on last put of a GT wakeref. However, its visibility to the object debugging tool was suppressed by a bug in i915_active that was fixed two weeks later with commit e92eb246feb9 ("drm/i915/active: Fix missing debug object activation"). Fix the issue by getting a wakeref for the VMA's tile when activating it, and putting that wakeref only after the VMA is deactivated. However, exclude global GTT from that processing path, otherwise the GPU never goes idle. Since __i915_vma_retire() may be called from atomic contexts, use async variant of wakeref put. v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm wakerefs") (Andi), - for more easy backporting, split out removal of former insufficient workarounds and move them to separate patches (Nirmoy). - clean up commit message and description a bit. v3: Identify root cause more precisely, and a commit to blame, - identify and drop former workarounds, - update commit message and description. v2: Get the wakeref before VM mutex to avoid circular locking dependency, - drop questionable Fixes: tag. Fixes: d93939730347 ("drm/i915: Remove the vma refcount") Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875 Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: stable@vger.kernel.org # v5.19+ --- drivers/gpu/drm/i915/i915_vma.c | 26 +++++++++++++++++++------- drivers/gpu/drm/i915/i915_vma_types.h | 1 + 2 files changed, 20 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index d09aad34ba37f..604d420b9e1fd 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -34,6 +34,7 @@ #include "gt/intel_engine.h" #include "gt/intel_engine_heartbeat.h" #include "gt/intel_gt.h" +#include "gt/intel_gt_pm.h" #include "gt/intel_gt_requests.h" #include "gt/intel_tlb.h" @@ -103,12 +104,25 @@ static inline struct i915_vma *active_to_vma(struct i915_active *ref) static int __i915_vma_active(struct i915_active *ref) { - return i915_vma_tryget(active_to_vma(ref)) ? 0 : -ENOENT; + struct i915_vma *vma = active_to_vma(ref); + + if (!i915_vma_tryget(vma)) + return -ENOENT; + + if (!i915_vma_is_ggtt(vma)) + vma->wakeref = intel_gt_pm_get(vma->vm->gt); + + return 0; } static void __i915_vma_retire(struct i915_active *ref) { - i915_vma_put(active_to_vma(ref)); + struct i915_vma *vma = active_to_vma(ref); + + if (!i915_vma_is_ggtt(vma)) + intel_gt_pm_put_async(vma->vm->gt, vma->wakeref); + + i915_vma_put(vma); } static struct i915_vma * @@ -1404,7 +1418,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, struct i915_vma_work *work = NULL; struct dma_fence *moving = NULL; struct i915_vma_resource *vma_res = NULL; - intel_wakeref_t wakeref = 0; + intel_wakeref_t wakeref; unsigned int bound; int err; @@ -1424,8 +1438,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, if (err) return err; - if (flags & PIN_GLOBAL) - wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); + wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); if (flags & vma->vm->bind_async_flags) { /* lock VM */ @@ -1561,8 +1574,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, if (work) dma_fence_work_commit_imm(&work->base); err_rpm: - if (wakeref) - intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); + intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); if (moving) dma_fence_put(moving); diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h index 64472b7f0e770..f0086fadff4d3 100644 --- a/drivers/gpu/drm/i915/i915_vma_types.h +++ b/drivers/gpu/drm/i915/i915_vma_types.h @@ -264,6 +264,7 @@ struct i915_vma { #define I915_VMA_SCANOUT ((int)BIT(I915_VMA_SCANOUT_BIT)) struct i915_active active; + intel_wakeref_t wakeref; #define I915_VMA_PAGES_BIAS 24 #define I915_VMA_PAGES_ACTIVE (BIT(24) | 1) -- 2.43.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v4 1/3] drm/i915/vma: Fix UAF on destroy against retire race @ 2024-01-22 14:04 ` Janusz Krzysztofik 0 siblings, 0 replies; 19+ messages in thread From: Janusz Krzysztofik @ 2024-01-22 14:04 UTC (permalink / raw) To: intel-gfx Cc: Tvrtko Ursulin, Andi Shyti, Thomas Hellström, Chris Wilson, Andrzej Hajda, dri-devel, Daniel Vetter, Rodrigo Vivi, Janusz Krzysztofik, David Airlie, Nirmoy Das Object debugging tools were sporadically reporting illegal attempts to free a still active i915 VMA object when parking a GPU tile believed to be idle. [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 ... [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 ... [161.361347] debug_object_free+0xeb/0x110 [161.361362] i915_active_fini+0x14/0x130 [i915] [161.361866] release_references+0xfe/0x1f0 [i915] [161.362543] i915_vma_parked+0x1db/0x380 [i915] [161.363129] __gt_park+0x121/0x230 [i915] [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] That has been tracked down to be happening when another thread is deactivating the VMA inside __active_retire() helper, after the VMA's active counter has been already decremented to 0, but before deactivation of the VMA's object is reported to the object debugging tool. We could prevent from that race by serializing i915_active_fini() with __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from being used, e.g. from __i915_vma_retire() called at the end of __active_retire(), after that VMA has been already freed by a concurrent i915_vma_destroy() on return from the i915_active_fini(). Then, we should rather fix the issue at the VMA level, not in i915_active. Since __i915_vma_parked() is called from __gt_park() on last put of the GT's wakeref, the issue could be addressed by holding the GT wakeref long enough for __active_retire() to complete before that wakeref is released and the GT parked. A VMA associated with a request doesn't acquire a GT wakeref by itself. Instead, it depends on a wakeref held directly by the request's active intel_context for a GT associated with its VM, and indirectly on that intel_context's engine wakeref if the engine belongs to the same GT as the VMA's VM. In case of single-tile platforms, at least one of those wakerefs is usually held long enough for the request's VMA to be deactivated on time, before it is destroyed on last put of its VM GT wakeref. However, on multi-tile platforms, a request may use a VMA from a tile other than the one that hosts the request's engine, then it is protected only with the intel_context's VM GT wakeref. There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). However, it occurred insufficient -- the issue was still reported by CI. That wakeref was released on exit from i915_gem_do_execbuffer(), then potentially before completion of the request and deactivation of its associated VMAs. OTOH, CI reports indicate that single-tile platforms also suffer sporadically from the same race. I believe the issue was introduced by commit d93939730347 ("drm/i915: Remove the vma refcount") which moved a call to i915_active_fini() from a dropped i915_vma_release(), called on last put of the removed VMA kref, to i915_vma_parked() processing path called on last put of a GT wakeref. However, its visibility to the object debugging tool was suppressed by a bug in i915_active that was fixed two weeks later with commit e92eb246feb9 ("drm/i915/active: Fix missing debug object activation"). Fix the issue by getting a wakeref for the VMA's tile when activating it, and putting that wakeref only after the VMA is deactivated. However, exclude global GTT from that processing path, otherwise the GPU never goes idle. Since __i915_vma_retire() may be called from atomic contexts, use async variant of wakeref put. v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm wakerefs") (Andi), - for more easy backporting, split out removal of former insufficient workarounds and move them to separate patches (Nirmoy). - clean up commit message and description a bit. v3: Identify root cause more precisely, and a commit to blame, - identify and drop former workarounds, - update commit message and description. v2: Get the wakeref before VM mutex to avoid circular locking dependency, - drop questionable Fixes: tag. Fixes: d93939730347 ("drm/i915: Remove the vma refcount") Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875 Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: stable@vger.kernel.org # v5.19+ --- drivers/gpu/drm/i915/i915_vma.c | 26 +++++++++++++++++++------- drivers/gpu/drm/i915/i915_vma_types.h | 1 + 2 files changed, 20 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index d09aad34ba37f..604d420b9e1fd 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -34,6 +34,7 @@ #include "gt/intel_engine.h" #include "gt/intel_engine_heartbeat.h" #include "gt/intel_gt.h" +#include "gt/intel_gt_pm.h" #include "gt/intel_gt_requests.h" #include "gt/intel_tlb.h" @@ -103,12 +104,25 @@ static inline struct i915_vma *active_to_vma(struct i915_active *ref) static int __i915_vma_active(struct i915_active *ref) { - return i915_vma_tryget(active_to_vma(ref)) ? 0 : -ENOENT; + struct i915_vma *vma = active_to_vma(ref); + + if (!i915_vma_tryget(vma)) + return -ENOENT; + + if (!i915_vma_is_ggtt(vma)) + vma->wakeref = intel_gt_pm_get(vma->vm->gt); + + return 0; } static void __i915_vma_retire(struct i915_active *ref) { - i915_vma_put(active_to_vma(ref)); + struct i915_vma *vma = active_to_vma(ref); + + if (!i915_vma_is_ggtt(vma)) + intel_gt_pm_put_async(vma->vm->gt, vma->wakeref); + + i915_vma_put(vma); } static struct i915_vma * @@ -1404,7 +1418,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, struct i915_vma_work *work = NULL; struct dma_fence *moving = NULL; struct i915_vma_resource *vma_res = NULL; - intel_wakeref_t wakeref = 0; + intel_wakeref_t wakeref; unsigned int bound; int err; @@ -1424,8 +1438,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, if (err) return err; - if (flags & PIN_GLOBAL) - wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); + wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); if (flags & vma->vm->bind_async_flags) { /* lock VM */ @@ -1561,8 +1574,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, if (work) dma_fence_work_commit_imm(&work->base); err_rpm: - if (wakeref) - intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); + intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); if (moving) dma_fence_put(moving); diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h index 64472b7f0e770..f0086fadff4d3 100644 --- a/drivers/gpu/drm/i915/i915_vma_types.h +++ b/drivers/gpu/drm/i915/i915_vma_types.h @@ -264,6 +264,7 @@ struct i915_vma { #define I915_VMA_SCANOUT ((int)BIT(I915_VMA_SCANOUT_BIT)) struct i915_active active; + intel_wakeref_t wakeref; #define I915_VMA_PAGES_BIAS 24 #define I915_VMA_PAGES_ACTIVE (BIT(24) | 1) -- 2.43.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] drm/i915/vma: Fix UAF on destroy against retire race 2024-01-22 14:04 ` Janusz Krzysztofik @ 2024-01-22 21:09 ` Rodrigo Vivi -1 siblings, 0 replies; 19+ messages in thread From: Rodrigo Vivi @ 2024-01-22 21:09 UTC (permalink / raw) To: Janusz Krzysztofik Cc: Thomas Hellström, Chris Wilson, intel-gfx, Andrzej Hajda, dri-devel, Daniel Vetter, David Airlie, Nirmoy Das On Mon, Jan 22, 2024 at 03:04:42PM +0100, Janusz Krzysztofik wrote: > Object debugging tools were sporadically reporting illegal attempts to > free a still active i915 VMA object when parking a GPU tile believed to be > idle. > > [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] > [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 > ... > [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 > [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 > [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] > [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 > ... > [161.361347] debug_object_free+0xeb/0x110 > [161.361362] i915_active_fini+0x14/0x130 [i915] > [161.361866] release_references+0xfe/0x1f0 [i915] > [161.362543] i915_vma_parked+0x1db/0x380 [i915] > [161.363129] __gt_park+0x121/0x230 [i915] > [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] > > That has been tracked down to be happening when another thread is > deactivating the VMA inside __active_retire() helper, after the VMA's > active counter has been already decremented to 0, but before deactivation > of the VMA's object is reported to the object debugging tool. > > We could prevent from that race by serializing i915_active_fini() with > __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from > being used, e.g. from __i915_vma_retire() called at the end of > __active_retire(), after that VMA has been already freed by a concurrent > i915_vma_destroy() on return from the i915_active_fini(). Then, we should > rather fix the issue at the VMA level, not in i915_active. > > Since __i915_vma_parked() is called from __gt_park() on last put of the > GT's wakeref, the issue could be addressed by holding the GT wakeref long > enough for __active_retire() to complete before that wakeref is released > and the GT parked. > > A VMA associated with a request doesn't acquire a GT wakeref by itself. > Instead, it depends on a wakeref held directly by the request's active > intel_context for a GT associated with its VM, and indirectly on that > intel_context's engine wakeref if the engine belongs to the same GT as the > VMA's VM. In case of single-tile platforms, at least one of those > wakerefs is usually held long enough for the request's VMA to be > deactivated on time, before it is destroyed on last put of its VM GT > wakeref. However, on multi-tile platforms, a request may use a VMA from a > tile other than the one that hosts the request's engine, then it is > protected only with the intel_context's VM GT wakeref. > > There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring please do not confuse the terminology here. MTL is 1-tile platform, with multiple GTs (1 for Render/Compute and 1 for Media). Also you could probably avoid mentioning the other case here when you are actively trying to resolve the RKL's single GT case. > an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see > commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). > However, it occurred insufficient -- the issue was still reported by CI. > That wakeref was released on exit from i915_gem_do_execbuffer(), then > potentially before completion of the request and deactivation of its > associated VMAs. > > OTOH, CI reports indicate that single-tile platforms also suffer > sporadically from the same race. > > I believe the issue was introduced by commit d93939730347 ("drm/i915: > Remove the vma refcount") which moved a call to i915_active_fini() from > a dropped i915_vma_release(), called on last put of the removed VMA kref, > to i915_vma_parked() processing path called on last put of a GT wakeref. > However, its visibility to the object debugging tool was suppressed by a > bug in i915_active that was fixed two weeks later with commit e92eb246feb9 > ("drm/i915/active: Fix missing debug object activation"). > > Fix the issue by getting a wakeref for the VMA's tile when activating it, > and putting that wakeref only after the VMA is deactivated. However, > exclude global GTT from that processing path, otherwise the GPU never goes > idle. Since __i915_vma_retire() may be called from atomic contexts, use > async variant of wakeref put. okay, this explains the first block of the patch below, but I'm afraid that it doesn't explain why: - if (flags & PIN_GLOBAL) > > v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm > wakerefs") (Andi), > - for more easy backporting, split out removal of former insufficient > workarounds and move them to separate patches (Nirmoy). > - clean up commit message and description a bit. > v3: Identify root cause more precisely, and a commit to blame, > - identify and drop former workarounds, > - update commit message and description. > v2: Get the wakeref before VM mutex to avoid circular locking dependency, > - drop questionable Fixes: tag. > > Fixes: d93939730347 ("drm/i915: Remove the vma refcount") > Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875 > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> > Cc: Nirmoy Das <nirmoy.das@intel.com> > Cc: Andi Shyti <andi.shyti@linux.intel.com> > Cc: stable@vger.kernel.org # v5.19+ > --- > drivers/gpu/drm/i915/i915_vma.c | 26 +++++++++++++++++++------- > drivers/gpu/drm/i915/i915_vma_types.h | 1 + > 2 files changed, 20 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c > index d09aad34ba37f..604d420b9e1fd 100644 > --- a/drivers/gpu/drm/i915/i915_vma.c > +++ b/drivers/gpu/drm/i915/i915_vma.c > @@ -34,6 +34,7 @@ > #include "gt/intel_engine.h" > #include "gt/intel_engine_heartbeat.h" > #include "gt/intel_gt.h" > +#include "gt/intel_gt_pm.h" > #include "gt/intel_gt_requests.h" > #include "gt/intel_tlb.h" > > @@ -103,12 +104,25 @@ static inline struct i915_vma *active_to_vma(struct i915_active *ref) > > static int __i915_vma_active(struct i915_active *ref) > { > - return i915_vma_tryget(active_to_vma(ref)) ? 0 : -ENOENT; > + struct i915_vma *vma = active_to_vma(ref); > + > + if (!i915_vma_tryget(vma)) > + return -ENOENT; > + > + if (!i915_vma_is_ggtt(vma)) > + vma->wakeref = intel_gt_pm_get(vma->vm->gt); > + > + return 0; > } > > static void __i915_vma_retire(struct i915_active *ref) > { > - i915_vma_put(active_to_vma(ref)); > + struct i915_vma *vma = active_to_vma(ref); > + > + if (!i915_vma_is_ggtt(vma)) > + intel_gt_pm_put_async(vma->vm->gt, vma->wakeref); > + > + i915_vma_put(vma); > } > > static struct i915_vma * > @@ -1404,7 +1418,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > struct i915_vma_work *work = NULL; > struct dma_fence *moving = NULL; > struct i915_vma_resource *vma_res = NULL; > - intel_wakeref_t wakeref = 0; > + intel_wakeref_t wakeref; > unsigned int bound; > int err; > > @@ -1424,8 +1438,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > if (err) > return err; > > - if (flags & PIN_GLOBAL) > - wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > + wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > if (flags & vma->vm->bind_async_flags) { > /* lock VM */ > @@ -1561,8 +1574,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > if (work) > dma_fence_work_commit_imm(&work->base); > err_rpm: > - if (wakeref) > - intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > + intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > if (moving) > dma_fence_put(moving); > diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h > index 64472b7f0e770..f0086fadff4d3 100644 > --- a/drivers/gpu/drm/i915/i915_vma_types.h > +++ b/drivers/gpu/drm/i915/i915_vma_types.h > @@ -264,6 +264,7 @@ struct i915_vma { > #define I915_VMA_SCANOUT ((int)BIT(I915_VMA_SCANOUT_BIT)) > > struct i915_active active; > + intel_wakeref_t wakeref; > > #define I915_VMA_PAGES_BIAS 24 > #define I915_VMA_PAGES_ACTIVE (BIT(24) | 1) > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] drm/i915/vma: Fix UAF on destroy against retire race @ 2024-01-22 21:09 ` Rodrigo Vivi 0 siblings, 0 replies; 19+ messages in thread From: Rodrigo Vivi @ 2024-01-22 21:09 UTC (permalink / raw) To: Janusz Krzysztofik Cc: Tvrtko Ursulin, Andi Shyti, Thomas Hellström, Chris Wilson, intel-gfx, Andrzej Hajda, dri-devel, Daniel Vetter, David Airlie, Nirmoy Das On Mon, Jan 22, 2024 at 03:04:42PM +0100, Janusz Krzysztofik wrote: > Object debugging tools were sporadically reporting illegal attempts to > free a still active i915 VMA object when parking a GPU tile believed to be > idle. > > [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] > [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 > ... > [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 > [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 > [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] > [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 > ... > [161.361347] debug_object_free+0xeb/0x110 > [161.361362] i915_active_fini+0x14/0x130 [i915] > [161.361866] release_references+0xfe/0x1f0 [i915] > [161.362543] i915_vma_parked+0x1db/0x380 [i915] > [161.363129] __gt_park+0x121/0x230 [i915] > [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] > > That has been tracked down to be happening when another thread is > deactivating the VMA inside __active_retire() helper, after the VMA's > active counter has been already decremented to 0, but before deactivation > of the VMA's object is reported to the object debugging tool. > > We could prevent from that race by serializing i915_active_fini() with > __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from > being used, e.g. from __i915_vma_retire() called at the end of > __active_retire(), after that VMA has been already freed by a concurrent > i915_vma_destroy() on return from the i915_active_fini(). Then, we should > rather fix the issue at the VMA level, not in i915_active. > > Since __i915_vma_parked() is called from __gt_park() on last put of the > GT's wakeref, the issue could be addressed by holding the GT wakeref long > enough for __active_retire() to complete before that wakeref is released > and the GT parked. > > A VMA associated with a request doesn't acquire a GT wakeref by itself. > Instead, it depends on a wakeref held directly by the request's active > intel_context for a GT associated with its VM, and indirectly on that > intel_context's engine wakeref if the engine belongs to the same GT as the > VMA's VM. In case of single-tile platforms, at least one of those > wakerefs is usually held long enough for the request's VMA to be > deactivated on time, before it is destroyed on last put of its VM GT > wakeref. However, on multi-tile platforms, a request may use a VMA from a > tile other than the one that hosts the request's engine, then it is > protected only with the intel_context's VM GT wakeref. > > There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring please do not confuse the terminology here. MTL is 1-tile platform, with multiple GTs (1 for Render/Compute and 1 for Media). Also you could probably avoid mentioning the other case here when you are actively trying to resolve the RKL's single GT case. > an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see > commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). > However, it occurred insufficient -- the issue was still reported by CI. > That wakeref was released on exit from i915_gem_do_execbuffer(), then > potentially before completion of the request and deactivation of its > associated VMAs. > > OTOH, CI reports indicate that single-tile platforms also suffer > sporadically from the same race. > > I believe the issue was introduced by commit d93939730347 ("drm/i915: > Remove the vma refcount") which moved a call to i915_active_fini() from > a dropped i915_vma_release(), called on last put of the removed VMA kref, > to i915_vma_parked() processing path called on last put of a GT wakeref. > However, its visibility to the object debugging tool was suppressed by a > bug in i915_active that was fixed two weeks later with commit e92eb246feb9 > ("drm/i915/active: Fix missing debug object activation"). > > Fix the issue by getting a wakeref for the VMA's tile when activating it, > and putting that wakeref only after the VMA is deactivated. However, > exclude global GTT from that processing path, otherwise the GPU never goes > idle. Since __i915_vma_retire() may be called from atomic contexts, use > async variant of wakeref put. okay, this explains the first block of the patch below, but I'm afraid that it doesn't explain why: - if (flags & PIN_GLOBAL) > > v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm > wakerefs") (Andi), > - for more easy backporting, split out removal of former insufficient > workarounds and move them to separate patches (Nirmoy). > - clean up commit message and description a bit. > v3: Identify root cause more precisely, and a commit to blame, > - identify and drop former workarounds, > - update commit message and description. > v2: Get the wakeref before VM mutex to avoid circular locking dependency, > - drop questionable Fixes: tag. > > Fixes: d93939730347 ("drm/i915: Remove the vma refcount") > Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875 > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> > Cc: Nirmoy Das <nirmoy.das@intel.com> > Cc: Andi Shyti <andi.shyti@linux.intel.com> > Cc: stable@vger.kernel.org # v5.19+ > --- > drivers/gpu/drm/i915/i915_vma.c | 26 +++++++++++++++++++------- > drivers/gpu/drm/i915/i915_vma_types.h | 1 + > 2 files changed, 20 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c > index d09aad34ba37f..604d420b9e1fd 100644 > --- a/drivers/gpu/drm/i915/i915_vma.c > +++ b/drivers/gpu/drm/i915/i915_vma.c > @@ -34,6 +34,7 @@ > #include "gt/intel_engine.h" > #include "gt/intel_engine_heartbeat.h" > #include "gt/intel_gt.h" > +#include "gt/intel_gt_pm.h" > #include "gt/intel_gt_requests.h" > #include "gt/intel_tlb.h" > > @@ -103,12 +104,25 @@ static inline struct i915_vma *active_to_vma(struct i915_active *ref) > > static int __i915_vma_active(struct i915_active *ref) > { > - return i915_vma_tryget(active_to_vma(ref)) ? 0 : -ENOENT; > + struct i915_vma *vma = active_to_vma(ref); > + > + if (!i915_vma_tryget(vma)) > + return -ENOENT; > + > + if (!i915_vma_is_ggtt(vma)) > + vma->wakeref = intel_gt_pm_get(vma->vm->gt); > + > + return 0; > } > > static void __i915_vma_retire(struct i915_active *ref) > { > - i915_vma_put(active_to_vma(ref)); > + struct i915_vma *vma = active_to_vma(ref); > + > + if (!i915_vma_is_ggtt(vma)) > + intel_gt_pm_put_async(vma->vm->gt, vma->wakeref); > + > + i915_vma_put(vma); > } > > static struct i915_vma * > @@ -1404,7 +1418,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > struct i915_vma_work *work = NULL; > struct dma_fence *moving = NULL; > struct i915_vma_resource *vma_res = NULL; > - intel_wakeref_t wakeref = 0; > + intel_wakeref_t wakeref; > unsigned int bound; > int err; > > @@ -1424,8 +1438,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > if (err) > return err; > > - if (flags & PIN_GLOBAL) > - wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > + wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > if (flags & vma->vm->bind_async_flags) { > /* lock VM */ > @@ -1561,8 +1574,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > if (work) > dma_fence_work_commit_imm(&work->base); > err_rpm: > - if (wakeref) > - intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > + intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > if (moving) > dma_fence_put(moving); > diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h > index 64472b7f0e770..f0086fadff4d3 100644 > --- a/drivers/gpu/drm/i915/i915_vma_types.h > +++ b/drivers/gpu/drm/i915/i915_vma_types.h > @@ -264,6 +264,7 @@ struct i915_vma { > #define I915_VMA_SCANOUT ((int)BIT(I915_VMA_SCANOUT_BIT)) > > struct i915_active active; > + intel_wakeref_t wakeref; > > #define I915_VMA_PAGES_BIAS 24 > #define I915_VMA_PAGES_ACTIVE (BIT(24) | 1) > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] drm/i915/vma: Fix UAF on destroy against retire race 2024-01-22 21:09 ` Rodrigo Vivi @ 2024-01-23 10:51 ` Janusz Krzysztofik -1 siblings, 0 replies; 19+ messages in thread From: Janusz Krzysztofik @ 2024-01-23 10:51 UTC (permalink / raw) To: Rodrigo Vivi Cc: Thomas Hellström, Chris Wilson, intel-gfx, Andrzej Hajda, dri-devel, Daniel Vetter, David Airlie, Nirmoy Das Hi Rodrigo, Thank you for review. On Monday, 22 January 2024 22:09:38 CET Rodrigo Vivi wrote: > On Mon, Jan 22, 2024 at 03:04:42PM +0100, Janusz Krzysztofik wrote: > > Object debugging tools were sporadically reporting illegal attempts to > > free a still active i915 VMA object when parking a GPU tile believed to be > > idle. > > > > [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] > > [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 > > ... > > [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 > > [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 > > [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] > > [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 > > ... > > [161.361347] debug_object_free+0xeb/0x110 > > [161.361362] i915_active_fini+0x14/0x130 [i915] > > [161.361866] release_references+0xfe/0x1f0 [i915] > > [161.362543] i915_vma_parked+0x1db/0x380 [i915] > > [161.363129] __gt_park+0x121/0x230 [i915] > > [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] > > > > That has been tracked down to be happening when another thread is > > deactivating the VMA inside __active_retire() helper, after the VMA's > > active counter has been already decremented to 0, but before deactivation > > of the VMA's object is reported to the object debugging tool. > > > > We could prevent from that race by serializing i915_active_fini() with > > __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from > > being used, e.g. from __i915_vma_retire() called at the end of > > __active_retire(), after that VMA has been already freed by a concurrent > > i915_vma_destroy() on return from the i915_active_fini(). Then, we should > > rather fix the issue at the VMA level, not in i915_active. > > > > Since __i915_vma_parked() is called from __gt_park() on last put of the > > GT's wakeref, the issue could be addressed by holding the GT wakeref long > > enough for __active_retire() to complete before that wakeref is released > > and the GT parked. > > > > A VMA associated with a request doesn't acquire a GT wakeref by itself. > > Instead, it depends on a wakeref held directly by the request's active > > intel_context for a GT associated with its VM, and indirectly on that > > intel_context's engine wakeref if the engine belongs to the same GT as the > > VMA's VM. In case of single-tile platforms, at least one of those > > wakerefs is usually held long enough for the request's VMA to be > > deactivated on time, before it is destroyed on last put of its VM GT > > wakeref. However, on multi-tile platforms, a request may use a VMA from a > > tile other than the one that hosts the request's engine, then it is > > protected only with the intel_context's VM GT wakeref. > > > > There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring > > please do not confuse the terminology here. MTL is 1-tile platform, > with multiple GTs (1 for Render/Compute and 1 for Media). I didn't realize that "tile" is not the same as "GT". I can review the whole description and replace all occurrences of "tile" with "GT". > > Also you could probably avoid mentioning the other case here when > you are actively trying to resolve the RKL's single GT case. OK, but let me keep that part of commit description in the cover letter then, because: - historically, the issue was more frequently reproduced in CI on MTL than on other platforms, and that was the initial scope I started working on, - the full description as is better reflects phases of my process of reproduction and root cause analysis of the issue, - the MTL case was specifically addressed by the former insufficient workaround which now I'm now proposing to drop. > > an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see > > commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). > > However, it occurred insufficient -- the issue was still reported by CI. > > That wakeref was released on exit from i915_gem_do_execbuffer(), then > > potentially before completion of the request and deactivation of its > > associated VMAs. > > > > OTOH, CI reports indicate that single-tile platforms also suffer > > sporadically from the same race. > > > > I believe the issue was introduced by commit d93939730347 ("drm/i915: > > Remove the vma refcount") which moved a call to i915_active_fini() from > > a dropped i915_vma_release(), called on last put of the removed VMA kref, > > to i915_vma_parked() processing path called on last put of a GT wakeref. > > However, its visibility to the object debugging tool was suppressed by a > > bug in i915_active that was fixed two weeks later with commit e92eb246feb9 > > ("drm/i915/active: Fix missing debug object activation"). > > > > Fix the issue by getting a wakeref for the VMA's tile when activating it, > > and putting that wakeref only after the VMA is deactivated. However, > > exclude global GTT from that processing path, otherwise the GPU never goes > > idle. Since __i915_vma_retire() may be called from atomic contexts, use > > async variant of wakeref put. > > okay, this explains the first block of the patch below, but I'm afraid > that it doesn't explain why: > > - if (flags & PIN_GLOBAL) That's explained in v2 changelog below, I believe. I can add that information to the body of commit description as well. Thanks, Janusz > > > > v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm > > wakerefs") (Andi), > > - for more easy backporting, split out removal of former insufficient > > workarounds and move them to separate patches (Nirmoy). > > - clean up commit message and description a bit. > > v3: Identify root cause more precisely, and a commit to blame, > > - identify and drop former workarounds, > > - update commit message and description. > > v2: Get the wakeref before VM mutex to avoid circular locking dependency, > > - drop questionable Fixes: tag. > > > > Fixes: d93939730347 ("drm/i915: Remove the vma refcount") > > Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875 > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > Cc: Nirmoy Das <nirmoy.das@intel.com> > > Cc: Andi Shyti <andi.shyti@linux.intel.com> > > Cc: stable@vger.kernel.org # v5.19+ > > --- > > drivers/gpu/drm/i915/i915_vma.c | 26 +++++++++++++++++++------- > > drivers/gpu/drm/i915/i915_vma_types.h | 1 + > > 2 files changed, 20 insertions(+), 7 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c > > index d09aad34ba37f..604d420b9e1fd 100644 > > --- a/drivers/gpu/drm/i915/i915_vma.c > > +++ b/drivers/gpu/drm/i915/i915_vma.c > > @@ -34,6 +34,7 @@ > > #include "gt/intel_engine.h" > > #include "gt/intel_engine_heartbeat.h" > > #include "gt/intel_gt.h" > > +#include "gt/intel_gt_pm.h" > > #include "gt/intel_gt_requests.h" > > #include "gt/intel_tlb.h" > > > > @@ -103,12 +104,25 @@ static inline struct i915_vma *active_to_vma(struct i915_active *ref) > > > > static int __i915_vma_active(struct i915_active *ref) > > { > > - return i915_vma_tryget(active_to_vma(ref)) ? 0 : -ENOENT; > > + struct i915_vma *vma = active_to_vma(ref); > > + > > + if (!i915_vma_tryget(vma)) > > + return -ENOENT; > > + > > + if (!i915_vma_is_ggtt(vma)) > > + vma->wakeref = intel_gt_pm_get(vma->vm->gt); > > + > > + return 0; > > } > > > > static void __i915_vma_retire(struct i915_active *ref) > > { > > - i915_vma_put(active_to_vma(ref)); > > + struct i915_vma *vma = active_to_vma(ref); > > + > > + if (!i915_vma_is_ggtt(vma)) > > + intel_gt_pm_put_async(vma->vm->gt, vma->wakeref); > > + > > + i915_vma_put(vma); > > } > > > > static struct i915_vma * > > @@ -1404,7 +1418,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > struct i915_vma_work *work = NULL; > > struct dma_fence *moving = NULL; > > struct i915_vma_resource *vma_res = NULL; > > - intel_wakeref_t wakeref = 0; > > + intel_wakeref_t wakeref; > > unsigned int bound; > > int err; > > > > @@ -1424,8 +1438,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > if (err) > > return err; > > > > - if (flags & PIN_GLOBAL) > > - wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > + wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > > > if (flags & vma->vm->bind_async_flags) { > > /* lock VM */ > > @@ -1561,8 +1574,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > if (work) > > dma_fence_work_commit_imm(&work->base); > > err_rpm: > > - if (wakeref) > > - intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > + intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > > > if (moving) > > dma_fence_put(moving); > > diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h > > index 64472b7f0e770..f0086fadff4d3 100644 > > --- a/drivers/gpu/drm/i915/i915_vma_types.h > > +++ b/drivers/gpu/drm/i915/i915_vma_types.h > > @@ -264,6 +264,7 @@ struct i915_vma { > > #define I915_VMA_SCANOUT ((int)BIT(I915_VMA_SCANOUT_BIT)) > > > > struct i915_active active; > > + intel_wakeref_t wakeref; > > > > #define I915_VMA_PAGES_BIAS 24 > > #define I915_VMA_PAGES_ACTIVE (BIT(24) | 1) > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] drm/i915/vma: Fix UAF on destroy against retire race @ 2024-01-23 10:51 ` Janusz Krzysztofik 0 siblings, 0 replies; 19+ messages in thread From: Janusz Krzysztofik @ 2024-01-23 10:51 UTC (permalink / raw) To: Rodrigo Vivi Cc: Tvrtko Ursulin, Andi Shyti, Thomas Hellström, Chris Wilson, intel-gfx, Andrzej Hajda, dri-devel, Daniel Vetter, Janusz Krzysztofik, David Airlie, Nirmoy Das Hi Rodrigo, Thank you for review. On Monday, 22 January 2024 22:09:38 CET Rodrigo Vivi wrote: > On Mon, Jan 22, 2024 at 03:04:42PM +0100, Janusz Krzysztofik wrote: > > Object debugging tools were sporadically reporting illegal attempts to > > free a still active i915 VMA object when parking a GPU tile believed to be > > idle. > > > > [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] > > [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 > > ... > > [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 > > [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 > > [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] > > [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 > > ... > > [161.361347] debug_object_free+0xeb/0x110 > > [161.361362] i915_active_fini+0x14/0x130 [i915] > > [161.361866] release_references+0xfe/0x1f0 [i915] > > [161.362543] i915_vma_parked+0x1db/0x380 [i915] > > [161.363129] __gt_park+0x121/0x230 [i915] > > [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] > > > > That has been tracked down to be happening when another thread is > > deactivating the VMA inside __active_retire() helper, after the VMA's > > active counter has been already decremented to 0, but before deactivation > > of the VMA's object is reported to the object debugging tool. > > > > We could prevent from that race by serializing i915_active_fini() with > > __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from > > being used, e.g. from __i915_vma_retire() called at the end of > > __active_retire(), after that VMA has been already freed by a concurrent > > i915_vma_destroy() on return from the i915_active_fini(). Then, we should > > rather fix the issue at the VMA level, not in i915_active. > > > > Since __i915_vma_parked() is called from __gt_park() on last put of the > > GT's wakeref, the issue could be addressed by holding the GT wakeref long > > enough for __active_retire() to complete before that wakeref is released > > and the GT parked. > > > > A VMA associated with a request doesn't acquire a GT wakeref by itself. > > Instead, it depends on a wakeref held directly by the request's active > > intel_context for a GT associated with its VM, and indirectly on that > > intel_context's engine wakeref if the engine belongs to the same GT as the > > VMA's VM. In case of single-tile platforms, at least one of those > > wakerefs is usually held long enough for the request's VMA to be > > deactivated on time, before it is destroyed on last put of its VM GT > > wakeref. However, on multi-tile platforms, a request may use a VMA from a > > tile other than the one that hosts the request's engine, then it is > > protected only with the intel_context's VM GT wakeref. > > > > There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring > > please do not confuse the terminology here. MTL is 1-tile platform, > with multiple GTs (1 for Render/Compute and 1 for Media). I didn't realize that "tile" is not the same as "GT". I can review the whole description and replace all occurrences of "tile" with "GT". > > Also you could probably avoid mentioning the other case here when > you are actively trying to resolve the RKL's single GT case. OK, but let me keep that part of commit description in the cover letter then, because: - historically, the issue was more frequently reproduced in CI on MTL than on other platforms, and that was the initial scope I started working on, - the full description as is better reflects phases of my process of reproduction and root cause analysis of the issue, - the MTL case was specifically addressed by the former insufficient workaround which now I'm now proposing to drop. > > an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see > > commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). > > However, it occurred insufficient -- the issue was still reported by CI. > > That wakeref was released on exit from i915_gem_do_execbuffer(), then > > potentially before completion of the request and deactivation of its > > associated VMAs. > > > > OTOH, CI reports indicate that single-tile platforms also suffer > > sporadically from the same race. > > > > I believe the issue was introduced by commit d93939730347 ("drm/i915: > > Remove the vma refcount") which moved a call to i915_active_fini() from > > a dropped i915_vma_release(), called on last put of the removed VMA kref, > > to i915_vma_parked() processing path called on last put of a GT wakeref. > > However, its visibility to the object debugging tool was suppressed by a > > bug in i915_active that was fixed two weeks later with commit e92eb246feb9 > > ("drm/i915/active: Fix missing debug object activation"). > > > > Fix the issue by getting a wakeref for the VMA's tile when activating it, > > and putting that wakeref only after the VMA is deactivated. However, > > exclude global GTT from that processing path, otherwise the GPU never goes > > idle. Since __i915_vma_retire() may be called from atomic contexts, use > > async variant of wakeref put. > > okay, this explains the first block of the patch below, but I'm afraid > that it doesn't explain why: > > - if (flags & PIN_GLOBAL) That's explained in v2 changelog below, I believe. I can add that information to the body of commit description as well. Thanks, Janusz > > > > v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm > > wakerefs") (Andi), > > - for more easy backporting, split out removal of former insufficient > > workarounds and move them to separate patches (Nirmoy). > > - clean up commit message and description a bit. > > v3: Identify root cause more precisely, and a commit to blame, > > - identify and drop former workarounds, > > - update commit message and description. > > v2: Get the wakeref before VM mutex to avoid circular locking dependency, > > - drop questionable Fixes: tag. > > > > Fixes: d93939730347 ("drm/i915: Remove the vma refcount") > > Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875 > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > Cc: Nirmoy Das <nirmoy.das@intel.com> > > Cc: Andi Shyti <andi.shyti@linux.intel.com> > > Cc: stable@vger.kernel.org # v5.19+ > > --- > > drivers/gpu/drm/i915/i915_vma.c | 26 +++++++++++++++++++------- > > drivers/gpu/drm/i915/i915_vma_types.h | 1 + > > 2 files changed, 20 insertions(+), 7 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c > > index d09aad34ba37f..604d420b9e1fd 100644 > > --- a/drivers/gpu/drm/i915/i915_vma.c > > +++ b/drivers/gpu/drm/i915/i915_vma.c > > @@ -34,6 +34,7 @@ > > #include "gt/intel_engine.h" > > #include "gt/intel_engine_heartbeat.h" > > #include "gt/intel_gt.h" > > +#include "gt/intel_gt_pm.h" > > #include "gt/intel_gt_requests.h" > > #include "gt/intel_tlb.h" > > > > @@ -103,12 +104,25 @@ static inline struct i915_vma *active_to_vma(struct i915_active *ref) > > > > static int __i915_vma_active(struct i915_active *ref) > > { > > - return i915_vma_tryget(active_to_vma(ref)) ? 0 : -ENOENT; > > + struct i915_vma *vma = active_to_vma(ref); > > + > > + if (!i915_vma_tryget(vma)) > > + return -ENOENT; > > + > > + if (!i915_vma_is_ggtt(vma)) > > + vma->wakeref = intel_gt_pm_get(vma->vm->gt); > > + > > + return 0; > > } > > > > static void __i915_vma_retire(struct i915_active *ref) > > { > > - i915_vma_put(active_to_vma(ref)); > > + struct i915_vma *vma = active_to_vma(ref); > > + > > + if (!i915_vma_is_ggtt(vma)) > > + intel_gt_pm_put_async(vma->vm->gt, vma->wakeref); > > + > > + i915_vma_put(vma); > > } > > > > static struct i915_vma * > > @@ -1404,7 +1418,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > struct i915_vma_work *work = NULL; > > struct dma_fence *moving = NULL; > > struct i915_vma_resource *vma_res = NULL; > > - intel_wakeref_t wakeref = 0; > > + intel_wakeref_t wakeref; > > unsigned int bound; > > int err; > > > > @@ -1424,8 +1438,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > if (err) > > return err; > > > > - if (flags & PIN_GLOBAL) > > - wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > + wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > > > if (flags & vma->vm->bind_async_flags) { > > /* lock VM */ > > @@ -1561,8 +1574,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > if (work) > > dma_fence_work_commit_imm(&work->base); > > err_rpm: > > - if (wakeref) > > - intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > + intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > > > if (moving) > > dma_fence_put(moving); > > diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h > > index 64472b7f0e770..f0086fadff4d3 100644 > > --- a/drivers/gpu/drm/i915/i915_vma_types.h > > +++ b/drivers/gpu/drm/i915/i915_vma_types.h > > @@ -264,6 +264,7 @@ struct i915_vma { > > #define I915_VMA_SCANOUT ((int)BIT(I915_VMA_SCANOUT_BIT)) > > > > struct i915_active active; > > + intel_wakeref_t wakeref; > > > > #define I915_VMA_PAGES_BIAS 24 > > #define I915_VMA_PAGES_ACTIVE (BIT(24) | 1) > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] drm/i915/vma: Fix UAF on destroy against retire race 2024-01-23 10:51 ` Janusz Krzysztofik @ 2024-01-23 22:33 ` Rodrigo Vivi -1 siblings, 0 replies; 19+ messages in thread From: Rodrigo Vivi @ 2024-01-23 22:33 UTC (permalink / raw) To: Janusz Krzysztofik Cc: Thomas Hellström, Chris Wilson, intel-gfx, Andrzej Hajda, dri-devel, Daniel Vetter, Rodrigo Vivi, David Airlie, Nirmoy Das On Tue, Jan 23, 2024 at 11:51:15AM +0100, Janusz Krzysztofik wrote: > Hi Rodrigo, > > Thank you for review. > > On Monday, 22 January 2024 22:09:38 CET Rodrigo Vivi wrote: > > On Mon, Jan 22, 2024 at 03:04:42PM +0100, Janusz Krzysztofik wrote: > > > Object debugging tools were sporadically reporting illegal attempts to > > > free a still active i915 VMA object when parking a GPU tile believed to be > > > idle. > > > > > > [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] > > > [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 > > > ... > > > [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 > > > [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 > > > [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] > > > [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 > > > ... > > > [161.361347] debug_object_free+0xeb/0x110 > > > [161.361362] i915_active_fini+0x14/0x130 [i915] > > > [161.361866] release_references+0xfe/0x1f0 [i915] > > > [161.362543] i915_vma_parked+0x1db/0x380 [i915] > > > [161.363129] __gt_park+0x121/0x230 [i915] > > > [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] > > > > > > That has been tracked down to be happening when another thread is > > > deactivating the VMA inside __active_retire() helper, after the VMA's > > > active counter has been already decremented to 0, but before deactivation > > > of the VMA's object is reported to the object debugging tool. > > > > > > We could prevent from that race by serializing i915_active_fini() with > > > __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from > > > being used, e.g. from __i915_vma_retire() called at the end of > > > __active_retire(), after that VMA has been already freed by a concurrent > > > i915_vma_destroy() on return from the i915_active_fini(). Then, we should > > > rather fix the issue at the VMA level, not in i915_active. > > > > > > Since __i915_vma_parked() is called from __gt_park() on last put of the > > > GT's wakeref, the issue could be addressed by holding the GT wakeref long > > > enough for __active_retire() to complete before that wakeref is released > > > and the GT parked. > > > > > > A VMA associated with a request doesn't acquire a GT wakeref by itself. > > > Instead, it depends on a wakeref held directly by the request's active > > > intel_context for a GT associated with its VM, and indirectly on that > > > intel_context's engine wakeref if the engine belongs to the same GT as the > > > VMA's VM. In case of single-tile platforms, at least one of those > > > wakerefs is usually held long enough for the request's VMA to be > > > deactivated on time, before it is destroyed on last put of its VM GT > > > wakeref. However, on multi-tile platforms, a request may use a VMA from a > > > tile other than the one that hosts the request's engine, then it is > > > protected only with the intel_context's VM GT wakeref. > > > > > > There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring > > > > please do not confuse the terminology here. MTL is 1-tile platform, > > with multiple GTs (1 for Render/Compute and 1 for Media). > > I didn't realize that "tile" is not the same as "GT". I can review the whole > description and replace all occurrences of "tile" with "GT". yeap, in i915 it is a 1-1 map... a tile is implemented within the intel_gt. In Xe there's a good split and a good doc picture is here: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html#drm-xe-uapi > > > > > Also you could probably avoid mentioning the other case here when > > you are actively trying to resolve the RKL's single GT case. > > OK, but let me keep that part of commit description in the cover letter then, > because: > - historically, the issue was more frequently reproduced in CI on MTL than on > other platforms, and that was the initial scope I started working on, > - the full description as is better reflects phases of my process of > reproduction and root cause analysis of the issue, > - the MTL case was specifically addressed by the former insufficient > workaround which now I'm now proposing to drop. > > > > an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see > > > commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). > > > However, it occurred insufficient -- the issue was still reported by CI. > > > That wakeref was released on exit from i915_gem_do_execbuffer(), then > > > potentially before completion of the request and deactivation of its > > > associated VMAs. > > > > > > OTOH, CI reports indicate that single-tile platforms also suffer > > > sporadically from the same race. > > > > > > I believe the issue was introduced by commit d93939730347 ("drm/i915: > > > Remove the vma refcount") which moved a call to i915_active_fini() from > > > a dropped i915_vma_release(), called on last put of the removed VMA kref, > > > to i915_vma_parked() processing path called on last put of a GT wakeref. > > > However, its visibility to the object debugging tool was suppressed by a > > > bug in i915_active that was fixed two weeks later with commit e92eb246feb9 > > > ("drm/i915/active: Fix missing debug object activation"). > > > > > > Fix the issue by getting a wakeref for the VMA's tile when activating it, > > > and putting that wakeref only after the VMA is deactivated. However, > > > exclude global GTT from that processing path, otherwise the GPU never goes > > > idle. Since __i915_vma_retire() may be called from atomic contexts, use > > > async variant of wakeref put. > > > > okay, this explains the first block of the patch below, but I'm afraid > > that it doesn't explain why: > > > > - if (flags & PIN_GLOBAL) > > That's explained in v2 changelog below, I believe. I can add that information > to the body of commit description as well. > > Thanks, > Janusz > > > > > > > > v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm > > > wakerefs") (Andi), > > > - for more easy backporting, split out removal of former insufficient > > > workarounds and move them to separate patches (Nirmoy). > > > - clean up commit message and description a bit. > > > v3: Identify root cause more precisely, and a commit to blame, > > > - identify and drop former workarounds, > > > - update commit message and description. > > > v2: Get the wakeref before VM mutex to avoid circular locking dependency, > > > - drop questionable Fixes: tag. > > > > > > Fixes: d93939730347 ("drm/i915: Remove the vma refcount") > > > Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875 > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > > Cc: Nirmoy Das <nirmoy.das@intel.com> > > > Cc: Andi Shyti <andi.shyti@linux.intel.com> > > > Cc: stable@vger.kernel.org # v5.19+ > > > --- > > > drivers/gpu/drm/i915/i915_vma.c | 26 +++++++++++++++++++------- > > > drivers/gpu/drm/i915/i915_vma_types.h | 1 + > > > 2 files changed, 20 insertions(+), 7 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c > > > index d09aad34ba37f..604d420b9e1fd 100644 > > > --- a/drivers/gpu/drm/i915/i915_vma.c > > > +++ b/drivers/gpu/drm/i915/i915_vma.c > > > @@ -34,6 +34,7 @@ > > > #include "gt/intel_engine.h" > > > #include "gt/intel_engine_heartbeat.h" > > > #include "gt/intel_gt.h" > > > +#include "gt/intel_gt_pm.h" > > > #include "gt/intel_gt_requests.h" > > > #include "gt/intel_tlb.h" > > > > > > @@ -103,12 +104,25 @@ static inline struct i915_vma *active_to_vma(struct i915_active *ref) > > > > > > static int __i915_vma_active(struct i915_active *ref) > > > { > > > - return i915_vma_tryget(active_to_vma(ref)) ? 0 : -ENOENT; > > > + struct i915_vma *vma = active_to_vma(ref); > > > + > > > + if (!i915_vma_tryget(vma)) > > > + return -ENOENT; > > > + > > > + if (!i915_vma_is_ggtt(vma)) > > > + vma->wakeref = intel_gt_pm_get(vma->vm->gt); > > > + > > > + return 0; > > > } > > > > > > static void __i915_vma_retire(struct i915_active *ref) > > > { > > > - i915_vma_put(active_to_vma(ref)); > > > + struct i915_vma *vma = active_to_vma(ref); > > > + > > > + if (!i915_vma_is_ggtt(vma)) > > > + intel_gt_pm_put_async(vma->vm->gt, vma->wakeref); > > > + > > > + i915_vma_put(vma); > > > } > > > > > > static struct i915_vma * > > > @@ -1404,7 +1418,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > > struct i915_vma_work *work = NULL; > > > struct dma_fence *moving = NULL; > > > struct i915_vma_resource *vma_res = NULL; > > > - intel_wakeref_t wakeref = 0; > > > + intel_wakeref_t wakeref; > > > unsigned int bound; > > > int err; > > > > > > @@ -1424,8 +1438,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > > if (err) > > > return err; > > > > > > - if (flags & PIN_GLOBAL) > > > - wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > > + wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > > > > > if (flags & vma->vm->bind_async_flags) { > > > /* lock VM */ > > > @@ -1561,8 +1574,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > > if (work) > > > dma_fence_work_commit_imm(&work->base); > > > err_rpm: > > > - if (wakeref) > > > - intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > > + intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > > > > > if (moving) > > > dma_fence_put(moving); > > > diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h > > > index 64472b7f0e770..f0086fadff4d3 100644 > > > --- a/drivers/gpu/drm/i915/i915_vma_types.h > > > +++ b/drivers/gpu/drm/i915/i915_vma_types.h > > > @@ -264,6 +264,7 @@ struct i915_vma { > > > #define I915_VMA_SCANOUT ((int)BIT(I915_VMA_SCANOUT_BIT)) > > > > > > struct i915_active active; > > > + intel_wakeref_t wakeref; > > > > > > #define I915_VMA_PAGES_BIAS 24 > > > #define I915_VMA_PAGES_ACTIVE (BIT(24) | 1) > > > > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 1/3] drm/i915/vma: Fix UAF on destroy against retire race @ 2024-01-23 22:33 ` Rodrigo Vivi 0 siblings, 0 replies; 19+ messages in thread From: Rodrigo Vivi @ 2024-01-23 22:33 UTC (permalink / raw) To: Janusz Krzysztofik Cc: Tvrtko Ursulin, Andi Shyti, Thomas Hellström, Chris Wilson, intel-gfx, Andrzej Hajda, dri-devel, Daniel Vetter, Rodrigo Vivi, David Airlie, Nirmoy Das On Tue, Jan 23, 2024 at 11:51:15AM +0100, Janusz Krzysztofik wrote: > Hi Rodrigo, > > Thank you for review. > > On Monday, 22 January 2024 22:09:38 CET Rodrigo Vivi wrote: > > On Mon, Jan 22, 2024 at 03:04:42PM +0100, Janusz Krzysztofik wrote: > > > Object debugging tools were sporadically reporting illegal attempts to > > > free a still active i915 VMA object when parking a GPU tile believed to be > > > idle. > > > > > > [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] > > > [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 > > > ... > > > [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 > > > [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 > > > [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] > > > [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 > > > ... > > > [161.361347] debug_object_free+0xeb/0x110 > > > [161.361362] i915_active_fini+0x14/0x130 [i915] > > > [161.361866] release_references+0xfe/0x1f0 [i915] > > > [161.362543] i915_vma_parked+0x1db/0x380 [i915] > > > [161.363129] __gt_park+0x121/0x230 [i915] > > > [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915] > > > > > > That has been tracked down to be happening when another thread is > > > deactivating the VMA inside __active_retire() helper, after the VMA's > > > active counter has been already decremented to 0, but before deactivation > > > of the VMA's object is reported to the object debugging tool. > > > > > > We could prevent from that race by serializing i915_active_fini() with > > > __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from > > > being used, e.g. from __i915_vma_retire() called at the end of > > > __active_retire(), after that VMA has been already freed by a concurrent > > > i915_vma_destroy() on return from the i915_active_fini(). Then, we should > > > rather fix the issue at the VMA level, not in i915_active. > > > > > > Since __i915_vma_parked() is called from __gt_park() on last put of the > > > GT's wakeref, the issue could be addressed by holding the GT wakeref long > > > enough for __active_retire() to complete before that wakeref is released > > > and the GT parked. > > > > > > A VMA associated with a request doesn't acquire a GT wakeref by itself. > > > Instead, it depends on a wakeref held directly by the request's active > > > intel_context for a GT associated with its VM, and indirectly on that > > > intel_context's engine wakeref if the engine belongs to the same GT as the > > > VMA's VM. In case of single-tile platforms, at least one of those > > > wakerefs is usually held long enough for the request's VMA to be > > > deactivated on time, before it is destroyed on last put of its VM GT > > > wakeref. However, on multi-tile platforms, a request may use a VMA from a > > > tile other than the one that hosts the request's engine, then it is > > > protected only with the intel_context's VM GT wakeref. > > > > > > There was an attempt to fix this issue on 2-tile Meteor Lake by acquiring > > > > please do not confuse the terminology here. MTL is 1-tile platform, > > with multiple GTs (1 for Render/Compute and 1 for Media). > > I didn't realize that "tile" is not the same as "GT". I can review the whole > description and replace all occurrences of "tile" with "GT". yeap, in i915 it is a 1-1 map... a tile is implemented within the intel_gt. In Xe there's a good split and a good doc picture is here: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html#drm-xe-uapi > > > > > Also you could probably avoid mentioning the other case here when > > you are actively trying to resolve the RKL's single GT case. > > OK, but let me keep that part of commit description in the cover letter then, > because: > - historically, the issue was more frequently reproduced in CI on MTL than on > other platforms, and that was the initial scope I started working on, > - the full description as is better reflects phases of my process of > reproduction and root cause analysis of the issue, > - the MTL case was specifically addressed by the former insufficient > workaround which now I'm now proposing to drop. > > > > an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see > > > commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). > > > However, it occurred insufficient -- the issue was still reported by CI. > > > That wakeref was released on exit from i915_gem_do_execbuffer(), then > > > potentially before completion of the request and deactivation of its > > > associated VMAs. > > > > > > OTOH, CI reports indicate that single-tile platforms also suffer > > > sporadically from the same race. > > > > > > I believe the issue was introduced by commit d93939730347 ("drm/i915: > > > Remove the vma refcount") which moved a call to i915_active_fini() from > > > a dropped i915_vma_release(), called on last put of the removed VMA kref, > > > to i915_vma_parked() processing path called on last put of a GT wakeref. > > > However, its visibility to the object debugging tool was suppressed by a > > > bug in i915_active that was fixed two weeks later with commit e92eb246feb9 > > > ("drm/i915/active: Fix missing debug object activation"). > > > > > > Fix the issue by getting a wakeref for the VMA's tile when activating it, > > > and putting that wakeref only after the VMA is deactivated. However, > > > exclude global GTT from that processing path, otherwise the GPU never goes > > > idle. Since __i915_vma_retire() may be called from atomic contexts, use > > > async variant of wakeref put. > > > > okay, this explains the first block of the patch below, but I'm afraid > > that it doesn't explain why: > > > > - if (flags & PIN_GLOBAL) > > That's explained in v2 changelog below, I believe. I can add that information > to the body of commit description as well. > > Thanks, > Janusz > > > > > > > > v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm > > > wakerefs") (Andi), > > > - for more easy backporting, split out removal of former insufficient > > > workarounds and move them to separate patches (Nirmoy). > > > - clean up commit message and description a bit. > > > v3: Identify root cause more precisely, and a commit to blame, > > > - identify and drop former workarounds, > > > - update commit message and description. > > > v2: Get the wakeref before VM mutex to avoid circular locking dependency, > > > - drop questionable Fixes: tag. > > > > > > Fixes: d93939730347 ("drm/i915: Remove the vma refcount") > > > Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875 > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> > > > Cc: Nirmoy Das <nirmoy.das@intel.com> > > > Cc: Andi Shyti <andi.shyti@linux.intel.com> > > > Cc: stable@vger.kernel.org # v5.19+ > > > --- > > > drivers/gpu/drm/i915/i915_vma.c | 26 +++++++++++++++++++------- > > > drivers/gpu/drm/i915/i915_vma_types.h | 1 + > > > 2 files changed, 20 insertions(+), 7 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c > > > index d09aad34ba37f..604d420b9e1fd 100644 > > > --- a/drivers/gpu/drm/i915/i915_vma.c > > > +++ b/drivers/gpu/drm/i915/i915_vma.c > > > @@ -34,6 +34,7 @@ > > > #include "gt/intel_engine.h" > > > #include "gt/intel_engine_heartbeat.h" > > > #include "gt/intel_gt.h" > > > +#include "gt/intel_gt_pm.h" > > > #include "gt/intel_gt_requests.h" > > > #include "gt/intel_tlb.h" > > > > > > @@ -103,12 +104,25 @@ static inline struct i915_vma *active_to_vma(struct i915_active *ref) > > > > > > static int __i915_vma_active(struct i915_active *ref) > > > { > > > - return i915_vma_tryget(active_to_vma(ref)) ? 0 : -ENOENT; > > > + struct i915_vma *vma = active_to_vma(ref); > > > + > > > + if (!i915_vma_tryget(vma)) > > > + return -ENOENT; > > > + > > > + if (!i915_vma_is_ggtt(vma)) > > > + vma->wakeref = intel_gt_pm_get(vma->vm->gt); > > > + > > > + return 0; > > > } > > > > > > static void __i915_vma_retire(struct i915_active *ref) > > > { > > > - i915_vma_put(active_to_vma(ref)); > > > + struct i915_vma *vma = active_to_vma(ref); > > > + > > > + if (!i915_vma_is_ggtt(vma)) > > > + intel_gt_pm_put_async(vma->vm->gt, vma->wakeref); > > > + > > > + i915_vma_put(vma); > > > } > > > > > > static struct i915_vma * > > > @@ -1404,7 +1418,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > > struct i915_vma_work *work = NULL; > > > struct dma_fence *moving = NULL; > > > struct i915_vma_resource *vma_res = NULL; > > > - intel_wakeref_t wakeref = 0; > > > + intel_wakeref_t wakeref; > > > unsigned int bound; > > > int err; > > > > > > @@ -1424,8 +1438,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > > if (err) > > > return err; > > > > > > - if (flags & PIN_GLOBAL) > > > - wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > > + wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm); > > > > > > if (flags & vma->vm->bind_async_flags) { > > > /* lock VM */ > > > @@ -1561,8 +1574,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct i915_gem_ww_ctx *ww, > > > if (work) > > > dma_fence_work_commit_imm(&work->base); > > > err_rpm: > > > - if (wakeref) > > > - intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > > + intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref); > > > > > > if (moving) > > > dma_fence_put(moving); > > > diff --git a/drivers/gpu/drm/i915/i915_vma_types.h b/drivers/gpu/drm/i915/i915_vma_types.h > > > index 64472b7f0e770..f0086fadff4d3 100644 > > > --- a/drivers/gpu/drm/i915/i915_vma_types.h > > > +++ b/drivers/gpu/drm/i915/i915_vma_types.h > > > @@ -264,6 +264,7 @@ struct i915_vma { > > > #define I915_VMA_SCANOUT ((int)BIT(I915_VMA_SCANOUT_BIT)) > > > > > > struct i915_active active; > > > + intel_wakeref_t wakeref; > > > > > > #define I915_VMA_PAGES_BIAS 24 > > > #define I915_VMA_PAGES_ACTIVE (BIT(24) | 1) > > > > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v4 2/3] Manually revert "drm/i915: Fix a VMA UAF for multi-gt platform" 2024-01-22 14:04 ` Janusz Krzysztofik @ 2024-01-22 14:04 ` Janusz Krzysztofik -1 siblings, 0 replies; 19+ messages in thread From: Janusz Krzysztofik @ 2024-01-22 14:04 UTC (permalink / raw) To: intel-gfx Cc: Thomas Hellström, Chris Wilson, Andrzej Hajda, dri-devel, Daniel Vetter, Rodrigo Vivi, David Airlie, Nirmoy Das This reverts changes introduced by commit f56fe3e91787, obsoleted by "drm/i915/vma: Fix UAF on destroy against retire race". Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> --- drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 17 ----------------- 1 file changed, 17 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index d3a771afb083e..cedca8fd8754d 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -2686,7 +2686,6 @@ static int eb_select_engine(struct i915_execbuffer *eb) { struct intel_context *ce, *child; - struct intel_gt *gt; unsigned int idx; int err; @@ -2710,17 +2709,10 @@ eb_select_engine(struct i915_execbuffer *eb) } } eb->num_batches = ce->parallel.number_children + 1; - gt = ce->engine->gt; for_each_child(ce, child) intel_context_get(child); eb->wakeref = intel_gt_pm_get(ce->engine->gt); - /* - * Keep GT0 active on MTL so that i915_vma_parked() doesn't - * free VMAs while execbuf ioctl is validating VMAs. - */ - if (gt->info.id) - eb->wakeref_gt0 = intel_gt_pm_get(to_gt(gt->i915)); if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) { err = intel_context_alloc_state(ce); @@ -2759,9 +2751,6 @@ eb_select_engine(struct i915_execbuffer *eb) return err; err: - if (gt->info.id) - intel_gt_pm_put(to_gt(gt->i915), eb->wakeref_gt0); - intel_gt_pm_put(ce->engine->gt, eb->wakeref); for_each_child(ce, child) intel_context_put(child); @@ -2775,12 +2764,6 @@ eb_put_engine(struct i915_execbuffer *eb) struct intel_context *child; i915_vm_put(eb->context->vm); - /* - * This works in conjunction with eb_select_engine() to prevent - * i915_vma_parked() from interfering while execbuf validates vmas. - */ - if (eb->gt->info.id) - intel_gt_pm_put(to_gt(eb->gt->i915), eb->wakeref_gt0); intel_gt_pm_put(eb->context->engine->gt, eb->wakeref); for_each_child(eb->context, child) intel_context_put(child); -- 2.43.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v4 2/3] Manually revert "drm/i915: Fix a VMA UAF for multi-gt platform" @ 2024-01-22 14:04 ` Janusz Krzysztofik 0 siblings, 0 replies; 19+ messages in thread From: Janusz Krzysztofik @ 2024-01-22 14:04 UTC (permalink / raw) To: intel-gfx Cc: Tvrtko Ursulin, Andi Shyti, Thomas Hellström, Chris Wilson, Andrzej Hajda, dri-devel, Daniel Vetter, Rodrigo Vivi, Janusz Krzysztofik, David Airlie, Nirmoy Das This reverts changes introduced by commit f56fe3e91787, obsoleted by "drm/i915/vma: Fix UAF on destroy against retire race". Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> --- drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 17 ----------------- 1 file changed, 17 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index d3a771afb083e..cedca8fd8754d 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -2686,7 +2686,6 @@ static int eb_select_engine(struct i915_execbuffer *eb) { struct intel_context *ce, *child; - struct intel_gt *gt; unsigned int idx; int err; @@ -2710,17 +2709,10 @@ eb_select_engine(struct i915_execbuffer *eb) } } eb->num_batches = ce->parallel.number_children + 1; - gt = ce->engine->gt; for_each_child(ce, child) intel_context_get(child); eb->wakeref = intel_gt_pm_get(ce->engine->gt); - /* - * Keep GT0 active on MTL so that i915_vma_parked() doesn't - * free VMAs while execbuf ioctl is validating VMAs. - */ - if (gt->info.id) - eb->wakeref_gt0 = intel_gt_pm_get(to_gt(gt->i915)); if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) { err = intel_context_alloc_state(ce); @@ -2759,9 +2751,6 @@ eb_select_engine(struct i915_execbuffer *eb) return err; err: - if (gt->info.id) - intel_gt_pm_put(to_gt(gt->i915), eb->wakeref_gt0); - intel_gt_pm_put(ce->engine->gt, eb->wakeref); for_each_child(ce, child) intel_context_put(child); @@ -2775,12 +2764,6 @@ eb_put_engine(struct i915_execbuffer *eb) struct intel_context *child; i915_vm_put(eb->context->vm); - /* - * This works in conjunction with eb_select_engine() to prevent - * i915_vma_parked() from interfering while execbuf validates vmas. - */ - if (eb->gt->info.id) - intel_gt_pm_put(to_gt(eb->gt->i915), eb->wakeref_gt0); intel_gt_pm_put(eb->context->engine->gt, eb->wakeref); for_each_child(eb->context, child) intel_context_put(child); -- 2.43.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v4 2/3] Manually revert "drm/i915: Fix a VMA UAF for multi-gt platform" 2024-01-22 14:04 ` Janusz Krzysztofik @ 2024-01-22 21:13 ` Rodrigo Vivi -1 siblings, 0 replies; 19+ messages in thread From: Rodrigo Vivi @ 2024-01-22 21:13 UTC (permalink / raw) To: Janusz Krzysztofik Cc: Thomas Hellström, Chris Wilson, intel-gfx, Andrzej Hajda, dri-devel, Daniel Vetter, David Airlie, Nirmoy Das On Mon, Jan 22, 2024 at 03:04:43PM +0100, Janusz Krzysztofik wrote: > This reverts changes introduced by commit f56fe3e91787, obsoleted by > "drm/i915/vma: Fix UAF on destroy against retire race". I believe the good chunk of the first commit message should be moved here instead. But why did you do a 'manually revert'? revert itself didn't apply? maybe you could avoid the word 'revert' and use something like Remove extra multi-gt pm-references... or something like that. > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > --- > drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 17 ----------------- > 1 file changed, 17 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > index d3a771afb083e..cedca8fd8754d 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > @@ -2686,7 +2686,6 @@ static int > eb_select_engine(struct i915_execbuffer *eb) > { > struct intel_context *ce, *child; > - struct intel_gt *gt; > unsigned int idx; > int err; > > @@ -2710,17 +2709,10 @@ eb_select_engine(struct i915_execbuffer *eb) > } > } > eb->num_batches = ce->parallel.number_children + 1; > - gt = ce->engine->gt; > > for_each_child(ce, child) > intel_context_get(child); > eb->wakeref = intel_gt_pm_get(ce->engine->gt); > - /* > - * Keep GT0 active on MTL so that i915_vma_parked() doesn't > - * free VMAs while execbuf ioctl is validating VMAs. > - */ > - if (gt->info.id) > - eb->wakeref_gt0 = intel_gt_pm_get(to_gt(gt->i915)); > > if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) { > err = intel_context_alloc_state(ce); > @@ -2759,9 +2751,6 @@ eb_select_engine(struct i915_execbuffer *eb) > return err; > > err: > - if (gt->info.id) > - intel_gt_pm_put(to_gt(gt->i915), eb->wakeref_gt0); > - > intel_gt_pm_put(ce->engine->gt, eb->wakeref); > for_each_child(ce, child) > intel_context_put(child); > @@ -2775,12 +2764,6 @@ eb_put_engine(struct i915_execbuffer *eb) > struct intel_context *child; > > i915_vm_put(eb->context->vm); > - /* > - * This works in conjunction with eb_select_engine() to prevent > - * i915_vma_parked() from interfering while execbuf validates vmas. > - */ > - if (eb->gt->info.id) > - intel_gt_pm_put(to_gt(eb->gt->i915), eb->wakeref_gt0); > intel_gt_pm_put(eb->context->engine->gt, eb->wakeref); > for_each_child(eb->context, child) > intel_context_put(child); > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4 2/3] Manually revert "drm/i915: Fix a VMA UAF for multi-gt platform" @ 2024-01-22 21:13 ` Rodrigo Vivi 0 siblings, 0 replies; 19+ messages in thread From: Rodrigo Vivi @ 2024-01-22 21:13 UTC (permalink / raw) To: Janusz Krzysztofik Cc: Tvrtko Ursulin, Andi Shyti, Thomas Hellström, Chris Wilson, intel-gfx, Andrzej Hajda, dri-devel, Daniel Vetter, David Airlie, Nirmoy Das On Mon, Jan 22, 2024 at 03:04:43PM +0100, Janusz Krzysztofik wrote: > This reverts changes introduced by commit f56fe3e91787, obsoleted by > "drm/i915/vma: Fix UAF on destroy against retire race". I believe the good chunk of the first commit message should be moved here instead. But why did you do a 'manually revert'? revert itself didn't apply? maybe you could avoid the word 'revert' and use something like Remove extra multi-gt pm-references... or something like that. > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > --- > drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 17 ----------------- > 1 file changed, 17 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > index d3a771afb083e..cedca8fd8754d 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c > @@ -2686,7 +2686,6 @@ static int > eb_select_engine(struct i915_execbuffer *eb) > { > struct intel_context *ce, *child; > - struct intel_gt *gt; > unsigned int idx; > int err; > > @@ -2710,17 +2709,10 @@ eb_select_engine(struct i915_execbuffer *eb) > } > } > eb->num_batches = ce->parallel.number_children + 1; > - gt = ce->engine->gt; > > for_each_child(ce, child) > intel_context_get(child); > eb->wakeref = intel_gt_pm_get(ce->engine->gt); > - /* > - * Keep GT0 active on MTL so that i915_vma_parked() doesn't > - * free VMAs while execbuf ioctl is validating VMAs. > - */ > - if (gt->info.id) > - eb->wakeref_gt0 = intel_gt_pm_get(to_gt(gt->i915)); > > if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) { > err = intel_context_alloc_state(ce); > @@ -2759,9 +2751,6 @@ eb_select_engine(struct i915_execbuffer *eb) > return err; > > err: > - if (gt->info.id) > - intel_gt_pm_put(to_gt(gt->i915), eb->wakeref_gt0); > - > intel_gt_pm_put(ce->engine->gt, eb->wakeref); > for_each_child(ce, child) > intel_context_put(child); > @@ -2775,12 +2764,6 @@ eb_put_engine(struct i915_execbuffer *eb) > struct intel_context *child; > > i915_vm_put(eb->context->vm); > - /* > - * This works in conjunction with eb_select_engine() to prevent > - * i915_vma_parked() from interfering while execbuf validates vmas. > - */ > - if (eb->gt->info.id) > - intel_gt_pm_put(to_gt(eb->gt->i915), eb->wakeref_gt0); > intel_gt_pm_put(eb->context->engine->gt, eb->wakeref); > for_each_child(eb->context, child) > intel_context_put(child); > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v4 3/3] Revert "drm/i915: Wait for active retire before i915_active_fini()" 2024-01-22 14:04 ` Janusz Krzysztofik @ 2024-01-22 14:04 ` Janusz Krzysztofik -1 siblings, 0 replies; 19+ messages in thread From: Janusz Krzysztofik @ 2024-01-22 14:04 UTC (permalink / raw) To: intel-gfx Cc: Thomas Hellström, Chris Wilson, Andrzej Hajda, dri-devel, Daniel Vetter, Rodrigo Vivi, David Airlie, Nirmoy Das This reverts commit 7a2280e8dcd2f1f436db9631287c0b21cf6a92b0, obsoleted by "drm/i915/vma: Fix UAF on destroy against retire race". Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> --- drivers/gpu/drm/i915/i915_vma.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index 604d420b9e1fd..09b8a6b52d065 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -1752,8 +1752,6 @@ static void release_references(struct i915_vma *vma, struct intel_gt *gt, if (vm_ddestroy) i915_vm_resv_put(vma->vm); - /* Wait for async active retire */ - i915_active_wait(&vma->active); i915_active_fini(&vma->active); GEM_WARN_ON(vma->resource); i915_vma_free(vma); -- 2.43.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v4 3/3] Revert "drm/i915: Wait for active retire before i915_active_fini()" @ 2024-01-22 14:04 ` Janusz Krzysztofik 0 siblings, 0 replies; 19+ messages in thread From: Janusz Krzysztofik @ 2024-01-22 14:04 UTC (permalink / raw) To: intel-gfx Cc: Tvrtko Ursulin, Andi Shyti, Thomas Hellström, Chris Wilson, Andrzej Hajda, dri-devel, Daniel Vetter, Rodrigo Vivi, Janusz Krzysztofik, David Airlie, Nirmoy Das This reverts commit 7a2280e8dcd2f1f436db9631287c0b21cf6a92b0, obsoleted by "drm/i915/vma: Fix UAF on destroy against retire race". Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> --- drivers/gpu/drm/i915/i915_vma.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index 604d420b9e1fd..09b8a6b52d065 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -1752,8 +1752,6 @@ static void release_references(struct i915_vma *vma, struct intel_gt *gt, if (vm_ddestroy) i915_vm_resv_put(vma->vm); - /* Wait for async active retire */ - i915_active_wait(&vma->active); i915_active_fini(&vma->active); GEM_WARN_ON(vma->resource); i915_vma_free(vma); -- 2.43.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Fix VMA UAF on destroy against deactivate race 2024-01-22 14:04 ` Janusz Krzysztofik ` (3 preceding siblings ...) (?) @ 2024-01-22 21:05 ` Patchwork -1 siblings, 0 replies; 19+ messages in thread From: Patchwork @ 2024-01-22 21:05 UTC (permalink / raw) To: Janusz Krzysztofik; +Cc: intel-gfx == Series Details == Series: drm/i915: Fix VMA UAF on destroy against deactivate race URL : https://patchwork.freedesktop.org/series/129026/ State : warning == Summary == Error: dim checkpatch failed 55cc6a52bd6b drm/i915/vma: Fix UAF on destroy against retire race 90d6c8b0ef9a Manually revert "drm/i915: Fix a VMA UAF for multi-gt platform" -:7: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform")' #7: This reverts changes introduced by commit f56fe3e91787, obsoleted by total: 1 errors, 0 warnings, 0 checks, 45 lines checked 940d2453dbc1 Revert "drm/i915: Wait for active retire before i915_active_fini()" ^ permalink raw reply [flat|nested] 19+ messages in thread
* ✗ Fi.CI.SPARSE: warning for drm/i915: Fix VMA UAF on destroy against deactivate race 2024-01-22 14:04 ` Janusz Krzysztofik ` (4 preceding siblings ...) (?) @ 2024-01-22 21:05 ` Patchwork -1 siblings, 0 replies; 19+ messages in thread From: Patchwork @ 2024-01-22 21:05 UTC (permalink / raw) To: Janusz Krzysztofik; +Cc: intel-gfx == Series Details == Series: drm/i915: Fix VMA UAF on destroy against deactivate race URL : https://patchwork.freedesktop.org/series/129026/ State : warning == Summary == Error: dim sparse failed Sparse version: v0.6.2 Fast mode used, each commit won't be checked separately. ^ permalink raw reply [flat|nested] 19+ messages in thread
* ✗ Fi.CI.BAT: failure for drm/i915: Fix VMA UAF on destroy against deactivate race 2024-01-22 14:04 ` Janusz Krzysztofik ` (5 preceding siblings ...) (?) @ 2024-01-22 21:23 ` Patchwork -1 siblings, 0 replies; 19+ messages in thread From: Patchwork @ 2024-01-22 21:23 UTC (permalink / raw) To: Janusz Krzysztofik; +Cc: intel-gfx [-- Attachment #1: Type: text/plain, Size: 4302 bytes --] == Series Details == Series: drm/i915: Fix VMA UAF on destroy against deactivate race URL : https://patchwork.freedesktop.org/series/129026/ State : failure == Summary == CI Bug Log - changes from CI_DRM_14159 -> Patchwork_129026v1 ==================================================== Summary ------- **FAILURE** Serious unknown changes coming with Patchwork_129026v1 absolutely need to be verified manually. If you think the reported changes have nothing to do with the changes introduced in Patchwork_129026v1, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them to document this new failure mode, which will reduce false positives in CI. External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v1/index.html Participating hosts (36 -> 36) ------------------------------ Additional (1): fi-pnv-d510 Missing (1): fi-snb-2520m Possible new issues ------------------- Here are the unknown changes that may have been introduced in Patchwork_129026v1: ### IGT changes ### #### Possible regressions #### * igt@core_hotunplug@unbind-rebind: - fi-tgl-1115g4: [PASS][1] -> [INCOMPLETE][2] [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14159/fi-tgl-1115g4/igt@core_hotunplug@unbind-rebind.html [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v1/fi-tgl-1115g4/igt@core_hotunplug@unbind-rebind.html #### Suppressed #### The following results come from untrusted machines, tests, or statuses. They do not affect the overall result. * igt@core_hotunplug@unbind-rebind: - {bat-adls-6}: [PASS][3] -> [ABORT][4] [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14159/bat-adls-6/igt@core_hotunplug@unbind-rebind.html [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v1/bat-adls-6/igt@core_hotunplug@unbind-rebind.html Known issues ------------ Here are the changes found in Patchwork_129026v1 that come from known issues: ### CI changes ### #### Issues hit #### * boot: - bat-jsl-1: [PASS][5] -> [FAIL][6] ([i915#8293]) [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14159/bat-jsl-1/boot.html [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v1/bat-jsl-1/boot.html ### IGT changes ### #### Issues hit #### * igt@gem_lmem_swapping@basic: - fi-pnv-d510: NOTRUN -> [SKIP][7] ([fdo#109271]) +28 other tests skip [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v1/fi-pnv-d510/igt@gem_lmem_swapping@basic.html * igt@i915_pciid: - fi-apl-guc: [PASS][8] -> [INCOMPLETE][9] ([i915#2295]) [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14159/fi-apl-guc/igt@i915_pciid.html [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v1/fi-apl-guc/igt@i915_pciid.html #### Possible fixes #### * igt@i915_selftest@live@hangcheck: - {bat-rpls-3}: [DMESG-WARN][10] ([i915#5591]) -> [PASS][11] [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14159/bat-rpls-3/igt@i915_selftest@live@hangcheck.html [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v1/bat-rpls-3/igt@i915_selftest@live@hangcheck.html {name}: This element is suppressed. This means it is ignored when computing the status of the difference (SUCCESS, WARNING, or FAILURE). [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271 [i915#2295]: https://gitlab.freedesktop.org/drm/intel/issues/2295 [i915#5591]: https://gitlab.freedesktop.org/drm/intel/issues/5591 [i915#8293]: https://gitlab.freedesktop.org/drm/intel/issues/8293 Build changes ------------- * Linux: CI_DRM_14159 -> Patchwork_129026v1 CI-20190529: 20190529 CI_DRM_14159: 7f85c565ce86f84fb5ef9ec4cf4c67b8e021710c @ git://anongit.freedesktop.org/gfx-ci/linux IGT_7685: 7685 Patchwork_129026v1: 7f85c565ce86f84fb5ef9ec4cf4c67b8e021710c @ git://anongit.freedesktop.org/gfx-ci/linux ### Linux commits c48edaec0e17 Revert "drm/i915: Wait for active retire before i915_active_fini()" ef8f2581a395 Manually revert "drm/i915: Fix a VMA UAF for multi-gt platform" 5bd2664bd952 drm/i915/vma: Fix UAF on destroy against retire race == Logs == For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v1/index.html [-- Attachment #2: Type: text/html, Size: 5088 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-01-23 22:34 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-01-22 14:04 [PATCH v4 0/3] drm/i915: Fix VMA UAF on destroy against deactivate race Janusz Krzysztofik 2024-01-22 14:04 ` Janusz Krzysztofik 2024-01-22 14:04 ` [PATCH v4 1/3] drm/i915/vma: Fix UAF on destroy against retire race Janusz Krzysztofik 2024-01-22 14:04 ` Janusz Krzysztofik 2024-01-22 21:09 ` Rodrigo Vivi 2024-01-22 21:09 ` Rodrigo Vivi 2024-01-23 10:51 ` Janusz Krzysztofik 2024-01-23 10:51 ` Janusz Krzysztofik 2024-01-23 22:33 ` Rodrigo Vivi 2024-01-23 22:33 ` Rodrigo Vivi 2024-01-22 14:04 ` [PATCH v4 2/3] Manually revert "drm/i915: Fix a VMA UAF for multi-gt platform" Janusz Krzysztofik 2024-01-22 14:04 ` Janusz Krzysztofik 2024-01-22 21:13 ` Rodrigo Vivi 2024-01-22 21:13 ` Rodrigo Vivi 2024-01-22 14:04 ` [PATCH v4 3/3] Revert "drm/i915: Wait for active retire before i915_active_fini()" Janusz Krzysztofik 2024-01-22 14:04 ` Janusz Krzysztofik 2024-01-22 21:05 ` ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Fix VMA UAF on destroy against deactivate race Patchwork 2024-01-22 21:05 ` ✗ Fi.CI.SPARSE: " Patchwork 2024-01-22 21:23 ` ✗ Fi.CI.BAT: failure " Patchwork
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.