public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling
@ 2024-09-02  9:40 tjakobi
  2024-09-02  9:40 ` [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct() tjakobi
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: tjakobi @ 2024-09-02  9:40 UTC (permalink / raw)
  To: amd-gfx, dri-devel, linux-kernel; +Cc: Tobias Jakobi

From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>

Hello,

this fixes a nasty race condition in the set_drr() callbacks for DCN10
and DCN35 that has existed now since quite some time, see this GitLab
issue for reference.

https://gitlab.freedesktop.org/drm/amd/-/issues/3142

The report just focuses von DCN10, but the same problem also exists in
the DCN35 code.

With best wishes,
Tobias

Tobias Jakobi (2):
  drm/amd/display: Avoid race between dcn10_set_drr() and
    dc_state_destruct()
  drm/amd/display: Avoid race between dcn35_set_drr() and
    dc_state_destruct()

 .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
 .../amd/display/dc/hwss/dcn35/dcn35_hwseq.c   | 20 +++++++++++--------
 2 files changed, 24 insertions(+), 16 deletions(-)

-- 
2.44.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct()
  2024-09-02  9:40 [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling tjakobi
@ 2024-09-02  9:40 ` tjakobi
  2024-09-02 11:03   ` Sefa Eyeoglu
                     ` (3 more replies)
  2024-09-02  9:40 ` [PATCH 2/2] drm/amd/display: Avoid race between dcn35_set_drr() " tjakobi
  2024-09-08  7:35 ` [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling Christopher Snowhill
  2 siblings, 4 replies; 14+ messages in thread
From: tjakobi @ 2024-09-02  9:40 UTC (permalink / raw)
  To: Harry Wentland, Leo Li, Rodrigo Siqueira, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Mario Limonciello
  Cc: Tobias Jakobi, amd-gfx, dri-devel, linux-kernel

From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>

dc_state_destruct() nulls the resource context of the DC state. The pipe
context passed to dcn10_set_drr() is a member of this resource context.

If dc_state_destruct() is called parallel to the IRQ processing (which
calls dcn10_set_drr() at some point), we can end up using already nulled
function callback fields of struct stream_resource.

The logic in dcn10_set_drr() already tries to avoid this, by checking tg
against NULL. But if the nulling happens exactly after the NULL check and
before the next access, then we get a race.

Avoid this by copying tg first to a local variable, and then use this
variable for all the operations. This should work, as long as nobody
frees the resource pool where the timing generators live.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3142
Fixes: 06ad7e164256 ("drm/amd/display: Destroy DC context while keeping DML and DML2")
Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
---
 .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
index 3306684e805a..da8f2cb3c5db 100644
--- a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
@@ -3223,15 +3223,19 @@ void dcn10_set_drr(struct pipe_ctx **pipe_ctx,
 	 * as well.
 	 */
 	for (i = 0; i < num_pipes; i++) {
-		if ((pipe_ctx[i]->stream_res.tg != NULL) && pipe_ctx[i]->stream_res.tg->funcs) {
-			if (pipe_ctx[i]->stream_res.tg->funcs->set_drr)
-				pipe_ctx[i]->stream_res.tg->funcs->set_drr(
-					pipe_ctx[i]->stream_res.tg, &params);
+		/* dc_state_destruct() might null the stream resources, so fetch tg
+		 * here first to avoid a race condition. The lifetime of the pointee
+		 * itself (the timing_generator object) is not a problem here.
+		 */
+		struct timing_generator *tg = pipe_ctx[i]->stream_res.tg;
+
+		if ((tg != NULL) && tg->funcs) {
+			if (tg->funcs->set_drr)
+				tg->funcs->set_drr(tg, &params);
 			if (adjust.v_total_max != 0 && adjust.v_total_min != 0)
-				if (pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control)
-					pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control(
-						pipe_ctx[i]->stream_res.tg,
-						event_triggers, num_frames);
+				if (tg->funcs->set_static_screen_control)
+					tg->funcs->set_static_screen_control(
+						tg, event_triggers, num_frames);
 		}
 	}
 }
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] drm/amd/display: Avoid race between dcn35_set_drr() and dc_state_destruct()
  2024-09-02  9:40 [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling tjakobi
  2024-09-02  9:40 ` [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct() tjakobi
@ 2024-09-02  9:40 ` tjakobi
  2024-09-03 21:28   ` Harry Wentland
  2024-09-08  7:35 ` [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling Christopher Snowhill
  2 siblings, 1 reply; 14+ messages in thread
From: tjakobi @ 2024-09-02  9:40 UTC (permalink / raw)
  To: Harry Wentland, Leo Li, Rodrigo Siqueira, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Mario Limonciello
  Cc: Tobias Jakobi, amd-gfx, dri-devel, linux-kernel

From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>

dc_state_destruct() nulls the resource context of the DC state. The pipe
context passed to dcn35_set_drr() is a member of this resource context.

If dc_state_destruct() is called parallel to the IRQ processing (which
calls dcn35_set_drr() at some point), we can end up using already nulled
function callback fields of struct stream_resource.

The logic in dcn35_set_drr() already tries to avoid this, by checking tg
against NULL. But if the nulling happens exactly after the NULL check and
before the next access, then we get a race.

Avoid this by copying tg first to a local variable, and then use this
variable for all the operations. This should work, as long as nobody
frees the resource pool where the timing generators live.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3142
Fixes: 06ad7e164256 ("drm/amd/display: Destroy DC context while keeping DML and DML2")
Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
---
 .../amd/display/dc/hwss/dcn35/dcn35_hwseq.c   | 20 +++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn35/dcn35_hwseq.c b/drivers/gpu/drm/amd/display/dc/hwss/dcn35/dcn35_hwseq.c
index dcced89c07b3..4e77728dac10 100644
--- a/drivers/gpu/drm/amd/display/dc/hwss/dcn35/dcn35_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn35/dcn35_hwseq.c
@@ -1370,7 +1370,13 @@ void dcn35_set_drr(struct pipe_ctx **pipe_ctx,
 	params.vertical_total_mid_frame_num = adjust.v_total_mid_frame_num;
 
 	for (i = 0; i < num_pipes; i++) {
-		if ((pipe_ctx[i]->stream_res.tg != NULL) && pipe_ctx[i]->stream_res.tg->funcs) {
+		/* dc_state_destruct() might null the stream resources, so fetch tg
+		 * here first to avoid a race condition. The lifetime of the pointee
+		 * itself (the timing_generator object) is not a problem here.
+		 */
+		struct timing_generator *tg = pipe_ctx[i]->stream_res.tg;
+
+		if ((tg != NULL) && tg->funcs) {
 			struct dc_crtc_timing *timing = &pipe_ctx[i]->stream->timing;
 			struct dc *dc = pipe_ctx[i]->stream->ctx->dc;
 
@@ -1383,14 +1389,12 @@ void dcn35_set_drr(struct pipe_ctx **pipe_ctx,
 					num_frames = 2 * (frame_rate % 60);
 				}
 			}
-			if (pipe_ctx[i]->stream_res.tg->funcs->set_drr)
-				pipe_ctx[i]->stream_res.tg->funcs->set_drr(
-					pipe_ctx[i]->stream_res.tg, &params);
+			if (tg->funcs->set_drr)
+				tg->funcs->set_drr(tg, &params);
 			if (adjust.v_total_max != 0 && adjust.v_total_min != 0)
-				if (pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control)
-					pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control(
-						pipe_ctx[i]->stream_res.tg,
-						event_triggers, num_frames);
+				if (tg->funcs->set_static_screen_control)
+					tg->funcs->set_static_screen_control(
+						tg, event_triggers, num_frames);
 		}
 	}
 }
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct()
  2024-09-02  9:40 ` [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct() tjakobi
@ 2024-09-02 11:03   ` Sefa Eyeoglu
  2024-09-02 14:43   ` raoul.van.rueschen
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Sefa Eyeoglu @ 2024-09-02 11:03 UTC (permalink / raw)
  To: tjakobi, Harry Wentland, Leo Li, Rodrigo Siqueira, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Mario Limonciello
  Cc: amd-gfx, dri-devel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3023 bytes --]

On Mon, 2024-09-02 at 11:40 +0200, tjakobi@math.uni-bielefeld.de wrote:
> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
> 
> dc_state_destruct() nulls the resource context of the DC state. The
> pipe
> context passed to dcn10_set_drr() is a member of this resource
> context.
> 
> If dc_state_destruct() is called parallel to the IRQ processing
> (which
> calls dcn10_set_drr() at some point), we can end up using already
> nulled
> function callback fields of struct stream_resource.
> 
> The logic in dcn10_set_drr() already tries to avoid this, by checking
> tg
> against NULL. But if the nulling happens exactly after the NULL check
> and
> before the next access, then we get a race.
> 
> Avoid this by copying tg first to a local variable, and then use this
> variable for all the operations. This should work, as long as nobody
> frees the resource pool where the timing generators live.
> 
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3142
> Fixes: 06ad7e164256 ("drm/amd/display: Destroy DC context while
> keeping DML and DML2")
> Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
> ---
>  .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++------
> --
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> index 3306684e805a..da8f2cb3c5db 100644
> --- a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> +++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> @@ -3223,15 +3223,19 @@ void dcn10_set_drr(struct pipe_ctx
> **pipe_ctx,
>  	 * as well.
>  	 */
>  	for (i = 0; i < num_pipes; i++) {
> -		if ((pipe_ctx[i]->stream_res.tg != NULL) &&
> pipe_ctx[i]->stream_res.tg->funcs) {
> -			if (pipe_ctx[i]->stream_res.tg->funcs-
> >set_drr)
> -				pipe_ctx[i]->stream_res.tg->funcs-
> >set_drr(
> -					pipe_ctx[i]->stream_res.tg,
> &params);
> +		/* dc_state_destruct() might null the stream
> resources, so fetch tg
> +		 * here first to avoid a race condition. The
> lifetime of the pointee
> +		 * itself (the timing_generator object) is not a
> problem here.
> +		 */
> +		struct timing_generator *tg = pipe_ctx[i]-
> >stream_res.tg;
> +
> +		if ((tg != NULL) && tg->funcs) {
> +			if (tg->funcs->set_drr)
> +				tg->funcs->set_drr(tg, &params);
>  			if (adjust.v_total_max != 0 &&
> adjust.v_total_min != 0)
> -				if (pipe_ctx[i]->stream_res.tg-
> >funcs->set_static_screen_control)
> -					pipe_ctx[i]->stream_res.tg-
> >funcs->set_static_screen_control(
> -						pipe_ctx[i]-
> >stream_res.tg,
> -						event_triggers,
> num_frames);
> +				if (tg->funcs-
> >set_static_screen_control)
> +					tg->funcs-
> >set_static_screen_control(
> +						tg, event_triggers,
> num_frames);
>  		}
>  	}
>  }

This fixes the panics with my RX 6800 XT on Sway with VRR enabled!

Tested-by: Sefa Eyeoglu <contact@scrumplex.net>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct()
  2024-09-02  9:40 ` [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct() tjakobi
  2024-09-02 11:03   ` Sefa Eyeoglu
@ 2024-09-02 14:43   ` raoul.van.rueschen
  2024-09-03  5:13   ` Christopher Snowhill
  2024-09-03 21:27   ` Harry Wentland
  3 siblings, 0 replies; 14+ messages in thread
From: raoul.van.rueschen @ 2024-09-02 14:43 UTC (permalink / raw)
  To: tjakobi
  Cc: Rodrigo.Siqueira, Xinhui.Pan, airlied, alexander.deucher, amd-gfx,
	christian.koenig, daniel, dri-devel, harry.wentland, linux-kernel,
	mario.limonciello, sunpeng.li, raoul.van.rueschen

> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
>
> dc_state_destruct() nulls the resource context of the DC state. The
> pipe
> context passed to dcn10_set_drr() is a member of this resource
> context.
>
> If dc_state_destruct() is called parallel to the IRQ processing
> (which
> calls dcn10_set_drr() at some point), we can end up using already
> nulled
> function callback fields of struct stream_resource.
>
> The logic in dcn10_set_drr() already tries to avoid this, by checking
> tg
> against NULL. But if the nulling happens exactly after the NULL check
> and
> before the next access, then we get a race.
>
> Avoid this by copying tg first to a local variable, and then use this
> variable for all the operations. This should work, as long as nobody
> frees the resource pool where the timing generators live.
>
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3142
> Fixes: 06ad7e164256 ("drm/amd/display: Destroy DC context while
> keeping DML and DML2")
> Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
> ---
>  .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++------
> --
>  1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> index 3306684e805a..da8f2cb3c5db 100644
> --- a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> +++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> @@ -3223,15 +3223,19 @@ void dcn10_set_drr(struct pipe_ctx
> **pipe_ctx,
>  	 * as well.
>  	 */
>  	for (i = 0; i < num_pipes; i++) {
> -		if ((pipe_ctx[i]->stream_res.tg != NULL) &&
> pipe_ctx[i]->stream_res.tg->funcs) {
> -			if (pipe_ctx[i]->stream_res.tg->funcs-
> >set_drr)
> -				pipe_ctx[i]->stream_res.tg->funcs-
> >set_drr(
> -					pipe_ctx[i]->stream_res.tg,
> &params);
> +		/* dc_state_destruct() might null the stream
> resources, so fetch tg
> +		 * here first to avoid a race condition. The
> lifetime of the pointee
> +		 * itself (the timing_generator object) is not a
> problem here.
> +		 */
> +		struct timing_generator *tg = pipe_ctx[i]-
> >stream_res.tg;
> +
> +		if ((tg != NULL) && tg->funcs) {
> +			if (tg->funcs->set_drr)
> +				tg->funcs->set_drr(tg, &params);
>  			if (adjust.v_total_max != 0 &&
> adjust.v_total_min != 0)
> -				if (pipe_ctx[i]->stream_res.tg-
> >funcs->set_static_screen_control)
> -					pipe_ctx[i]->stream_res.tg-
> >funcs->set_static_screen_control(
> -						pipe_ctx[i]-
> >stream_res.tg,
> -						event_triggers,
> num_frames);
> +				if (tg->funcs-
> >set_static_screen_control)
> +					tg->funcs-
> >set_static_screen_control(
> +						tg, event_triggers,
> num_frames);
>  		}
>  	}
>  }

This fixes full system freezes when taking screenshots at low framerates with VRR enabled on an RX 7900 XTX.

Tested-by: Raoul van Rüschen <raoul.van.rueschen@gmail.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct()
  2024-09-02  9:40 ` [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct() tjakobi
  2024-09-02 11:03   ` Sefa Eyeoglu
  2024-09-02 14:43   ` raoul.van.rueschen
@ 2024-09-03  5:13   ` Christopher Snowhill
  2024-09-03 21:27   ` Harry Wentland
  3 siblings, 0 replies; 14+ messages in thread
From: Christopher Snowhill @ 2024-09-03  5:13 UTC (permalink / raw)
  To: tjakobi, Harry Wentland, Leo Li, Rodrigo Siqueira, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Mario Limonciello
  Cc: amd-gfx, dri-devel, linux-kernel

On Mon Sep 2, 2024 at 2:40 AM PDT, tjakobi wrote:
> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
>
> dc_state_destruct() nulls the resource context of the DC state. The pipe
> context passed to dcn10_set_drr() is a member of this resource context.
>
> If dc_state_destruct() is called parallel to the IRQ processing (which
> calls dcn10_set_drr() at some point), we can end up using already nulled
> function callback fields of struct stream_resource.
>
> The logic in dcn10_set_drr() already tries to avoid this, by checking tg
> against NULL. But if the nulling happens exactly after the NULL check and
> before the next access, then we get a race.
>
> Avoid this by copying tg first to a local variable, and then use this
> variable for all the operations. This should work, as long as nobody
> frees the resource pool where the timing generators live.
>
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3142
> Fixes: 06ad7e164256 ("drm/amd/display: Destroy DC context while keeping DML and DML2")
> Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
> ---
>  .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
>  1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> index 3306684e805a..da8f2cb3c5db 100644
> --- a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> +++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> @@ -3223,15 +3223,19 @@ void dcn10_set_drr(struct pipe_ctx **pipe_ctx,
>  	 * as well.
>  	 */
>  	for (i = 0; i < num_pipes; i++) {
> -		if ((pipe_ctx[i]->stream_res.tg != NULL) && pipe_ctx[i]->stream_res.tg->funcs) {
> -			if (pipe_ctx[i]->stream_res.tg->funcs->set_drr)
> -				pipe_ctx[i]->stream_res.tg->funcs->set_drr(
> -					pipe_ctx[i]->stream_res.tg, &params);
> +		/* dc_state_destruct() might null the stream resources, so fetch tg
> +		 * here first to avoid a race condition. The lifetime of the pointee
> +		 * itself (the timing_generator object) is not a problem here.
> +		 */
> +		struct timing_generator *tg = pipe_ctx[i]->stream_res.tg;
> +
> +		if ((tg != NULL) && tg->funcs) {
> +			if (tg->funcs->set_drr)
> +				tg->funcs->set_drr(tg, &params);
>  			if (adjust.v_total_max != 0 && adjust.v_total_min != 0)
> -				if (pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control)
> -					pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control(
> -						pipe_ctx[i]->stream_res.tg,
> -						event_triggers, num_frames);
> +				if (tg->funcs->set_static_screen_control)
> +					tg->funcs->set_static_screen_control(
> +						tg, event_triggers, num_frames);
>  		}
>  	}
>  }

This fixes hard to trace panics with labwc VRR and Wayfire on RX 6700 XT. I had to use netconsole to arrive at the original bug report.

Tested-by: Christopher Snowhill <chris@kode54.net>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct()
  2024-09-02  9:40 ` [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct() tjakobi
                     ` (2 preceding siblings ...)
  2024-09-03  5:13   ` Christopher Snowhill
@ 2024-09-03 21:27   ` Harry Wentland
  3 siblings, 0 replies; 14+ messages in thread
From: Harry Wentland @ 2024-09-03 21:27 UTC (permalink / raw)
  To: tjakobi, Leo Li, Rodrigo Siqueira, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Mario Limonciello
  Cc: amd-gfx, dri-devel, linux-kernel



On 2024-09-02 05:40, tjakobi@math.uni-bielefeld.de wrote:
> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
> 
> dc_state_destruct() nulls the resource context of the DC state. The pipe
> context passed to dcn10_set_drr() is a member of this resource context.
> 
> If dc_state_destruct() is called parallel to the IRQ processing (which
> calls dcn10_set_drr() at some point), we can end up using already nulled
> function callback fields of struct stream_resource.
> 
> The logic in dcn10_set_drr() already tries to avoid this, by checking tg
> against NULL. But if the nulling happens exactly after the NULL check and
> before the next access, then we get a race.
> 
> Avoid this by copying tg first to a local variable, and then use this
> variable for all the operations. This should work, as long as nobody
> frees the resource pool where the timing generators live.
> 
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3142
> Fixes: 06ad7e164256 ("drm/amd/display: Destroy DC context while keeping DML and DML2")
> Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>

Thanks for this fix. It also makes the code more readable.

Reviewed-by: Harry Wentland <harry.wentland@amd.com>

Harry

> ---
>  .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> index 3306684e805a..da8f2cb3c5db 100644
> --- a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> +++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
> @@ -3223,15 +3223,19 @@ void dcn10_set_drr(struct pipe_ctx **pipe_ctx,
>  	 * as well.
>  	 */
>  	for (i = 0; i < num_pipes; i++) {
> -		if ((pipe_ctx[i]->stream_res.tg != NULL) && pipe_ctx[i]->stream_res.tg->funcs) {
> -			if (pipe_ctx[i]->stream_res.tg->funcs->set_drr)
> -				pipe_ctx[i]->stream_res.tg->funcs->set_drr(
> -					pipe_ctx[i]->stream_res.tg, &params);
> +		/* dc_state_destruct() might null the stream resources, so fetch tg
> +		 * here first to avoid a race condition. The lifetime of the pointee
> +		 * itself (the timing_generator object) is not a problem here.
> +		 */
> +		struct timing_generator *tg = pipe_ctx[i]->stream_res.tg;
> +
> +		if ((tg != NULL) && tg->funcs) {
> +			if (tg->funcs->set_drr)
> +				tg->funcs->set_drr(tg, &params);
>  			if (adjust.v_total_max != 0 && adjust.v_total_min != 0)
> -				if (pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control)
> -					pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control(
> -						pipe_ctx[i]->stream_res.tg,
> -						event_triggers, num_frames);
> +				if (tg->funcs->set_static_screen_control)
> +					tg->funcs->set_static_screen_control(
> +						tg, event_triggers, num_frames);
>  		}
>  	}
>  }


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] drm/amd/display: Avoid race between dcn35_set_drr() and dc_state_destruct()
  2024-09-02  9:40 ` [PATCH 2/2] drm/amd/display: Avoid race between dcn35_set_drr() " tjakobi
@ 2024-09-03 21:28   ` Harry Wentland
  0 siblings, 0 replies; 14+ messages in thread
From: Harry Wentland @ 2024-09-03 21:28 UTC (permalink / raw)
  To: tjakobi, Leo Li, Rodrigo Siqueira, Alex Deucher,
	Christian König, Pan, Xinhui, David Airlie, Daniel Vetter,
	Mario Limonciello
  Cc: amd-gfx, dri-devel, linux-kernel



On 2024-09-02 05:40, tjakobi@math.uni-bielefeld.de wrote:
> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
> 
> dc_state_destruct() nulls the resource context of the DC state. The pipe
> context passed to dcn35_set_drr() is a member of this resource context.
> 
> If dc_state_destruct() is called parallel to the IRQ processing (which
> calls dcn35_set_drr() at some point), we can end up using already nulled
> function callback fields of struct stream_resource.
> 
> The logic in dcn35_set_drr() already tries to avoid this, by checking tg
> against NULL. But if the nulling happens exactly after the NULL check and
> before the next access, then we get a race.
> 
> Avoid this by copying tg first to a local variable, and then use this
> variable for all the operations. This should work, as long as nobody
> frees the resource pool where the timing generators live.
> 
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3142
> Fixes: 06ad7e164256 ("drm/amd/display: Destroy DC context while keeping DML and DML2")
> Signed-off-by: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>

Reviewed-by: Harry Wentland <harry.wentland@amd.com>

Harry

> ---
>  .../amd/display/dc/hwss/dcn35/dcn35_hwseq.c   | 20 +++++++++++--------
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn35/dcn35_hwseq.c b/drivers/gpu/drm/amd/display/dc/hwss/dcn35/dcn35_hwseq.c
> index dcced89c07b3..4e77728dac10 100644
> --- a/drivers/gpu/drm/amd/display/dc/hwss/dcn35/dcn35_hwseq.c
> +++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn35/dcn35_hwseq.c
> @@ -1370,7 +1370,13 @@ void dcn35_set_drr(struct pipe_ctx **pipe_ctx,
>  	params.vertical_total_mid_frame_num = adjust.v_total_mid_frame_num;
>  
>  	for (i = 0; i < num_pipes; i++) {
> -		if ((pipe_ctx[i]->stream_res.tg != NULL) && pipe_ctx[i]->stream_res.tg->funcs) {
> +		/* dc_state_destruct() might null the stream resources, so fetch tg
> +		 * here first to avoid a race condition. The lifetime of the pointee
> +		 * itself (the timing_generator object) is not a problem here.
> +		 */
> +		struct timing_generator *tg = pipe_ctx[i]->stream_res.tg;
> +
> +		if ((tg != NULL) && tg->funcs) {
>  			struct dc_crtc_timing *timing = &pipe_ctx[i]->stream->timing;
>  			struct dc *dc = pipe_ctx[i]->stream->ctx->dc;
>  
> @@ -1383,14 +1389,12 @@ void dcn35_set_drr(struct pipe_ctx **pipe_ctx,
>  					num_frames = 2 * (frame_rate % 60);
>  				}
>  			}
> -			if (pipe_ctx[i]->stream_res.tg->funcs->set_drr)
> -				pipe_ctx[i]->stream_res.tg->funcs->set_drr(
> -					pipe_ctx[i]->stream_res.tg, &params);
> +			if (tg->funcs->set_drr)
> +				tg->funcs->set_drr(tg, &params);
>  			if (adjust.v_total_max != 0 && adjust.v_total_min != 0)
> -				if (pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control)
> -					pipe_ctx[i]->stream_res.tg->funcs->set_static_screen_control(
> -						pipe_ctx[i]->stream_res.tg,
> -						event_triggers, num_frames);
> +				if (tg->funcs->set_static_screen_control)
> +					tg->funcs->set_static_screen_control(
> +						tg, event_triggers, num_frames);
>  		}
>  	}
>  }


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling
  2024-09-02  9:40 [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling tjakobi
  2024-09-02  9:40 ` [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct() tjakobi
  2024-09-02  9:40 ` [PATCH 2/2] drm/amd/display: Avoid race between dcn35_set_drr() " tjakobi
@ 2024-09-08  7:35 ` Christopher Snowhill
  2024-09-08 11:23   ` Tobias Jakobi
  2 siblings, 1 reply; 14+ messages in thread
From: Christopher Snowhill @ 2024-09-08  7:35 UTC (permalink / raw)
  To: tjakobi, amd-gfx, dri-devel, linux-kernel

On Mon Sep 2, 2024 at 2:40 AM PDT, tjakobi wrote:
> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
>
> Hello,
>
> this fixes a nasty race condition in the set_drr() callbacks for DCN10
> and DCN35 that has existed now since quite some time, see this GitLab
> issue for reference.
>
> https://gitlab.freedesktop.org/drm/amd/-/issues/3142
>
> The report just focuses von DCN10, but the same problem also exists in
> the DCN35 code.

Does the problem not exist in the following references to funcs->set_drr?

drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx[i]->stream_res.tg->funcs->set_drr(
drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                if (pipe_ctx->stream_res.tg->funcs->set_drr)
drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                        pipe_ctx->stream_res.tg->funcs->set_drr(
drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(

>
> With best wishes,
> Tobias
>
> Tobias Jakobi (2):
>   drm/amd/display: Avoid race between dcn10_set_drr() and
>     dc_state_destruct()
>   drm/amd/display: Avoid race between dcn35_set_drr() and
>     dc_state_destruct()
>
>  .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
>  .../amd/display/dc/hwss/dcn35/dcn35_hwseq.c   | 20 +++++++++++--------
>  2 files changed, 24 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling
  2024-09-08  7:35 ` [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling Christopher Snowhill
@ 2024-09-08 11:23   ` Tobias Jakobi
  2024-09-09  8:26     ` Christopher Snowhill
  2024-09-09 17:11     ` Alex Deucher
  0 siblings, 2 replies; 14+ messages in thread
From: Tobias Jakobi @ 2024-09-08 11:23 UTC (permalink / raw)
  To: Christopher Snowhill, amd-gfx, dri-devel, linux-kernel

On 9/8/24 09:35, Christopher Snowhill wrote:

> On Mon Sep 2, 2024 at 2:40 AM PDT, tjakobi wrote:
>> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
>>
>> Hello,
>>
>> this fixes a nasty race condition in the set_drr() callbacks for DCN10
>> and DCN35 that has existed now since quite some time, see this GitLab
>> issue for reference.
>>
>> https://gitlab.freedesktop.org/drm/amd/-/issues/3142
>>
>> The report just focuses von DCN10, but the same problem also exists in
>> the DCN35 code.
> Does the problem not exist in the following references to funcs->set_drr?
>
> drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
> drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
> drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx[i]->stream_res.tg->funcs->set_drr(
> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                if (pipe_ctx->stream_res.tg->funcs->set_drr)
> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                        pipe_ctx->stream_res.tg->funcs->set_drr(
> drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
> drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
> drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
> drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(

Maybe. But the big difference I see here, is that in this code there 
isn't even any kind of NULL check applied to tg. Or most of the members 
of *pipe_ctx. If there really is the same kind of problem here, then one 
would need to rewrite a bit more code to fix stuff.

E.g. in the case of  dcn31_hwseq.c, the questionable code is in 
dcn31_reset_back_end_for_pipe(), which is static and only called from 
dcn31_reset_hw_ctx_wrap(). Which is assigned to the .reset_hw_ctx_wrap 
callback. And this specific callback, from what I can see, is only 
called from dce110_reset_hw_ctx_wrap(). Which is then assigned to the 
.apply_ctx_to_hw callback. The callback is only called from 
dc_commit_state_no_check(). That one is static again, and called from 
dc_commit_streams().

I could trace this even further. My point is: I don't think this is 
called from any IRQ handler code. And given the depth and complexity of 
the callgraph, I have to admit, that, at least at this point, this is a 
bit over my head.

Sure, I could now sprinkle a bunch of x != NULL in the code, but that 
would be merely voodoo. And I usually try to have a theoretical basis 
when I apply changes to code.

Maybe if someone from the AMD display team could give some insight if 
there still is potentially vulnerable code in some of the instances that 
Christopher has posted, then I would gladly take a look.

With best wishes,
Tobias

>
>> With best wishes,
>> Tobias
>>
>> Tobias Jakobi (2):
>>    drm/amd/display: Avoid race between dcn10_set_drr() and
>>      dc_state_destruct()
>>    drm/amd/display: Avoid race between dcn35_set_drr() and
>>      dc_state_destruct()
>>
>>   .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
>>   .../amd/display/dc/hwss/dcn35/dcn35_hwseq.c   | 20 +++++++++++--------
>>   2 files changed, 24 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling
  2024-09-08 11:23   ` Tobias Jakobi
@ 2024-09-09  8:26     ` Christopher Snowhill
  2024-09-09 17:11     ` Alex Deucher
  1 sibling, 0 replies; 14+ messages in thread
From: Christopher Snowhill @ 2024-09-09  8:26 UTC (permalink / raw)
  To: Tobias Jakobi, amd-gfx, dri-devel, linux-kernel

On Sun Sep 8, 2024 at 4:23 AM PDT, Tobias Jakobi wrote:
> On 9/8/24 09:35, Christopher Snowhill wrote:
>
> > On Mon Sep 2, 2024 at 2:40 AM PDT, tjakobi wrote:
> >> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
> >>
> >> Hello,
> >>
> >> this fixes a nasty race condition in the set_drr() callbacks for DCN10
> >> and DCN35 that has existed now since quite some time, see this GitLab
> >> issue for reference.
> >>
> >> https://gitlab.freedesktop.org/drm/amd/-/issues/3142
> >>
> >> The report just focuses von DCN10, but the same problem also exists in
> >> the DCN35 code.
> > Does the problem not exist in the following references to funcs->set_drr?
> >
> > drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx[i]->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                        pipe_ctx->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
>
> Maybe. But the big difference I see here, is that in this code there 
> isn't even any kind of NULL check applied to tg. Or most of the members 
> of *pipe_ctx. If there really is the same kind of problem here, then one 
> would need to rewrite a bit more code to fix stuff.
>
> E.g. in the case of  dcn31_hwseq.c, the questionable code is in 
> dcn31_reset_back_end_for_pipe(), which is static and only called from 
> dcn31_reset_hw_ctx_wrap(). Which is assigned to the .reset_hw_ctx_wrap 
> callback. And this specific callback, from what I can see, is only 
> called from dce110_reset_hw_ctx_wrap(). Which is then assigned to the 
> .apply_ctx_to_hw callback. The callback is only called from 
> dc_commit_state_no_check(). That one is static again, and called from 
> dc_commit_streams().
>
> I could trace this even further. My point is: I don't think this is 
> called from any IRQ handler code. And given the depth and complexity of 
> the callgraph, I have to admit, that, at least at this point, this is a 
> bit over my head.
>
> Sure, I could now sprinkle a bunch of x != NULL in the code, but that 
> would be merely voodoo. And I usually try to have a theoretical basis 
> when I apply changes to code.
>
> Maybe if someone from the AMD display team could give some insight if 
> there still is potentially vulnerable code in some of the instances that 
> Christopher has posted, then I would gladly take a look.

Sorry, I was taking a note from someone else who mentioned set_drr functions, and wasn't aware that none of the other implementations happen to be called from IRQ handlers. Thanks for looking into this.

-Christopher

> With best wishes,
> Tobias
>
> >
> >> With best wishes,
> >> Tobias
> >>
> >> Tobias Jakobi (2):
> >>    drm/amd/display: Avoid race between dcn10_set_drr() and
> >>      dc_state_destruct()
> >>    drm/amd/display: Avoid race between dcn35_set_drr() and
> >>      dc_state_destruct()
> >>
> >>   .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
> >>   .../amd/display/dc/hwss/dcn35/dcn35_hwseq.c   | 20 +++++++++++--------
> >>   2 files changed, 24 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling
  2024-09-08 11:23   ` Tobias Jakobi
  2024-09-09  8:26     ` Christopher Snowhill
@ 2024-09-09 17:11     ` Alex Deucher
  2024-09-09 17:18       ` Harry Wentland
  1 sibling, 1 reply; 14+ messages in thread
From: Alex Deucher @ 2024-09-09 17:11 UTC (permalink / raw)
  To: Tobias Jakobi, Wentland, Harry
  Cc: Christopher Snowhill, amd-gfx, dri-devel, linux-kernel

On Sun, Sep 8, 2024 at 7:23 AM Tobias Jakobi
<tjakobi@math.uni-bielefeld.de> wrote:
>
> On 9/8/24 09:35, Christopher Snowhill wrote:
>
> > On Mon Sep 2, 2024 at 2:40 AM PDT, tjakobi wrote:
> >> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
> >>
> >> Hello,
> >>
> >> this fixes a nasty race condition in the set_drr() callbacks for DCN10
> >> and DCN35 that has existed now since quite some time, see this GitLab
> >> issue for reference.
> >>
> >> https://gitlab.freedesktop.org/drm/amd/-/issues/3142
> >>
> >> The report just focuses von DCN10, but the same problem also exists in
> >> the DCN35 code.
> > Does the problem not exist in the following references to funcs->set_drr?
> >
> > drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx[i]->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                        pipe_ctx->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
> > drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
> > drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
>
> Maybe. But the big difference I see here, is that in this code there
> isn't even any kind of NULL check applied to tg. Or most of the members
> of *pipe_ctx. If there really is the same kind of problem here, then one
> would need to rewrite a bit more code to fix stuff.
>
> E.g. in the case of  dcn31_hwseq.c, the questionable code is in
> dcn31_reset_back_end_for_pipe(), which is static and only called from
> dcn31_reset_hw_ctx_wrap(). Which is assigned to the .reset_hw_ctx_wrap
> callback. And this specific callback, from what I can see, is only
> called from dce110_reset_hw_ctx_wrap(). Which is then assigned to the
> .apply_ctx_to_hw callback. The callback is only called from
> dc_commit_state_no_check(). That one is static again, and called from
> dc_commit_streams().
>
> I could trace this even further. My point is: I don't think this is
> called from any IRQ handler code. And given the depth and complexity of
> the callgraph, I have to admit, that, at least at this point, this is a
> bit over my head.
>
> Sure, I could now sprinkle a bunch of x != NULL in the code, but that
> would be merely voodoo. And I usually try to have a theoretical basis
> when I apply changes to code.
>
> Maybe if someone from the AMD display team could give some insight if
> there still is potentially vulnerable code in some of the instances that
> Christopher has posted, then I would gladly take a look.

@Wentland, Harry can you confirm this?

Alex

>
> With best wishes,
> Tobias
>
> >
> >> With best wishes,
> >> Tobias
> >>
> >> Tobias Jakobi (2):
> >>    drm/amd/display: Avoid race between dcn10_set_drr() and
> >>      dc_state_destruct()
> >>    drm/amd/display: Avoid race between dcn35_set_drr() and
> >>      dc_state_destruct()
> >>
> >>   .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
> >>   .../amd/display/dc/hwss/dcn35/dcn35_hwseq.c   | 20 +++++++++++--------
> >>   2 files changed, 24 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling
  2024-09-09 17:11     ` Alex Deucher
@ 2024-09-09 17:18       ` Harry Wentland
  2024-09-09 19:36         ` Tobias Jakobi
  0 siblings, 1 reply; 14+ messages in thread
From: Harry Wentland @ 2024-09-09 17:18 UTC (permalink / raw)
  To: Alex Deucher, Tobias Jakobi
  Cc: Christopher Snowhill, amd-gfx, dri-devel, linux-kernel



On 2024-09-09 13:11, Alex Deucher wrote:
> On Sun, Sep 8, 2024 at 7:23 AM Tobias Jakobi
> <tjakobi@math.uni-bielefeld.de> wrote:
>>
>> On 9/8/24 09:35, Christopher Snowhill wrote:
>>
>>> On Mon Sep 2, 2024 at 2:40 AM PDT, tjakobi wrote:
>>>> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
>>>>
>>>> Hello,
>>>>
>>>> this fixes a nasty race condition in the set_drr() callbacks for DCN10
>>>> and DCN35 that has existed now since quite some time, see this GitLab
>>>> issue for reference.
>>>>
>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3142
>>>>
>>>> The report just focuses von DCN10, but the same problem also exists in
>>>> the DCN35 code.
>>> Does the problem not exist in the following references to funcs->set_drr?
>>>
>>> drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>> drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
>>> drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx[i]->stream_res.tg->funcs->set_drr(
>>> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
>>> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                        pipe_ctx->stream_res.tg->funcs->set_drr(
>>> drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>> drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
>>> drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>> drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
>>
>> Maybe. But the big difference I see here, is that in this code there
>> isn't even any kind of NULL check applied to tg. Or most of the members
>> of *pipe_ctx. If there really is the same kind of problem here, then one
>> would need to rewrite a bit more code to fix stuff.
>>
>> E.g. in the case of  dcn31_hwseq.c, the questionable code is in
>> dcn31_reset_back_end_for_pipe(), which is static and only called from
>> dcn31_reset_hw_ctx_wrap(). Which is assigned to the .reset_hw_ctx_wrap
>> callback. And this specific callback, from what I can see, is only
>> called from dce110_reset_hw_ctx_wrap(). Which is then assigned to the
>> .apply_ctx_to_hw callback. The callback is only called from
>> dc_commit_state_no_check(). That one is static again, and called from
>> dc_commit_streams().
>>
>> I could trace this even further. My point is: I don't think this is
>> called from any IRQ handler code. And given the depth and complexity of
>> the callgraph, I have to admit, that, at least at this point, this is a
>> bit over my head.
>>
>> Sure, I could now sprinkle a bunch of x != NULL in the code, but that
>> would be merely voodoo. And I usually try to have a theoretical basis
>> when I apply changes to code.
>>
>> Maybe if someone from the AMD display team could give some insight if
>> there still is potentially vulnerable code in some of the instances that
>> Christopher has posted, then I would gladly take a look.
> 
> @Wentland, Harry can you confirm this?
> 

As Tobias said, without extensive analysis and trace of the code in all
possible use-case it's hard to say there's no possible way the other
set_drr calls could potentially have a similar issue.

I think Tobias' analysis is sound and this fixes a number of issues, hence
my RB.

Harry

> Alex
> 
>>
>> With best wishes,
>> Tobias
>>
>>>
>>>> With best wishes,
>>>> Tobias
>>>>
>>>> Tobias Jakobi (2):
>>>>    drm/amd/display: Avoid race between dcn10_set_drr() and
>>>>      dc_state_destruct()
>>>>    drm/amd/display: Avoid race between dcn35_set_drr() and
>>>>      dc_state_destruct()
>>>>
>>>>   .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
>>>>   .../amd/display/dc/hwss/dcn35/dcn35_hwseq.c   | 20 +++++++++++--------
>>>>   2 files changed, 24 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling
  2024-09-09 17:18       ` Harry Wentland
@ 2024-09-09 19:36         ` Tobias Jakobi
  0 siblings, 0 replies; 14+ messages in thread
From: Tobias Jakobi @ 2024-09-09 19:36 UTC (permalink / raw)
  To: Harry Wentland, Alex Deucher
  Cc: Christopher Snowhill, amd-gfx, dri-devel, linux-kernel

On 9/9/24 19:18, Harry Wentland wrote:

>
> On 2024-09-09 13:11, Alex Deucher wrote:
>> On Sun, Sep 8, 2024 at 7:23 AM Tobias Jakobi
>> <tjakobi@math.uni-bielefeld.de> wrote:
>>> On 9/8/24 09:35, Christopher Snowhill wrote:
>>>
>>>> On Mon Sep 2, 2024 at 2:40 AM PDT, tjakobi wrote:
>>>>> From: Tobias Jakobi <tjakobi@math.uni-bielefeld.de>
>>>>>
>>>>> Hello,
>>>>>
>>>>> this fixes a nasty race condition in the set_drr() callbacks for DCN10
>>>>> and DCN35 that has existed now since quite some time, see this GitLab
>>>>> issue for reference.
>>>>>
>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3142
>>>>>
>>>>> The report just focuses von DCN10, but the same problem also exists in
>>>>> the DCN35 code.
>>>> Does the problem not exist in the following references to funcs->set_drr?
>>>>
>>>> drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>>> drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
>>>> drivers/gpu/drm/amd/display/dc/hwss/dce110/dce110_hwseq.c:              pipe_ctx[i]->stream_res.tg->funcs->set_drr(
>>>> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>>> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
>>>> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>>> drivers/gpu/drm/amd/display/dc/hwss/dcn20/dcn20_hwseq.c:                        pipe_ctx->stream_res.tg->funcs->set_drr(
>>>> drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:        if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>>> drivers/gpu/drm/amd/display/dc/hwss/dcn31/dcn31_hwseq.c:                pipe_ctx->stream_res.tg->funcs->set_drr(
>>>> drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:      if (pipe_ctx->stream_res.tg->funcs->set_drr)
>>>> drivers/gpu/drm/amd/display/dc/hwss/dcn401/dcn401_hwseq.c:              pipe_ctx->stream_res.tg->funcs->set_drr(
>>> Maybe. But the big difference I see here, is that in this code there
>>> isn't even any kind of NULL check applied to tg. Or most of the members
>>> of *pipe_ctx. If there really is the same kind of problem here, then one
>>> would need to rewrite a bit more code to fix stuff.
>>>
>>> E.g. in the case of  dcn31_hwseq.c, the questionable code is in
>>> dcn31_reset_back_end_for_pipe(), which is static and only called from
>>> dcn31_reset_hw_ctx_wrap(). Which is assigned to the .reset_hw_ctx_wrap
>>> callback. And this specific callback, from what I can see, is only
>>> called from dce110_reset_hw_ctx_wrap(). Which is then assigned to the
>>> .apply_ctx_to_hw callback. The callback is only called from
>>> dc_commit_state_no_check(). That one is static again, and called from
>>> dc_commit_streams().
>>>
>>> I could trace this even further. My point is: I don't think this is
>>> called from any IRQ handler code. And given the depth and complexity of
>>> the callgraph, I have to admit, that, at least at this point, this is a
>>> bit over my head.
>>>
>>> Sure, I could now sprinkle a bunch of x != NULL in the code, but that
>>> would be merely voodoo. And I usually try to have a theoretical basis
>>> when I apply changes to code.
>>>
>>> Maybe if someone from the AMD display team could give some insight if
>>> there still is potentially vulnerable code in some of the instances that
>>> Christopher has posted, then I would gladly take a look.
>> @Wentland, Harry can you confirm this?
>>
> As Tobias said, without extensive analysis and trace of the code in all
> possible use-case it's hard to say there's no possible way the other
> set_drr calls could potentially have a similar issue.
>
> I think Tobias' analysis is sound and this fixes a number of issues, hence
> my RB.
In fact one user pointed out another potentially vulnerable callback:
https://gitlab.freedesktop.org/drm/amd/-/issues/3142#note_2560109

Which is set_drr() in dce110_hwseq.c -- from which we know that it's 
called from IRQ handler code. Also the backtrace that he posted confirms 
this. That code seems to be a bit older than the DCN10/DCN25 code, as it 
lacks any kind of NULL-check. I have posted a patch that more or less 
copies over the DCN10/35 code. Still waiting for conclusive feedback if 
the patch does something.

If it does, I'm going to post it to amd-gfx as well.

With best wishes,
Tobias
>
> Harry
>
>> Alex
>>
>>> With best wishes,
>>> Tobias
>>>
>>>>> With best wishes,
>>>>> Tobias
>>>>>
>>>>> Tobias Jakobi (2):
>>>>>     drm/amd/display: Avoid race between dcn10_set_drr() and
>>>>>       dc_state_destruct()
>>>>>     drm/amd/display: Avoid race between dcn35_set_drr() and
>>>>>       dc_state_destruct()
>>>>>
>>>>>    .../amd/display/dc/hwss/dcn10/dcn10_hwseq.c   | 20 +++++++++++--------
>>>>>    .../amd/display/dc/hwss/dcn35/dcn35_hwseq.c   | 20 +++++++++++--------
>>>>>    2 files changed, 24 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-09-09 19:36 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-02  9:40 [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling tjakobi
2024-09-02  9:40 ` [PATCH 1/2] drm/amd/display: Avoid race between dcn10_set_drr() and dc_state_destruct() tjakobi
2024-09-02 11:03   ` Sefa Eyeoglu
2024-09-02 14:43   ` raoul.van.rueschen
2024-09-03  5:13   ` Christopher Snowhill
2024-09-03 21:27   ` Harry Wentland
2024-09-02  9:40 ` [PATCH 2/2] drm/amd/display: Avoid race between dcn35_set_drr() " tjakobi
2024-09-03 21:28   ` Harry Wentland
2024-09-08  7:35 ` [PATCH 0/2] drm/amd: fix VRR race condition during IRQ handling Christopher Snowhill
2024-09-08 11:23   ` Tobias Jakobi
2024-09-09  8:26     ` Christopher Snowhill
2024-09-09 17:11     ` Alex Deucher
2024-09-09 17:18       ` Harry Wentland
2024-09-09 19:36         ` Tobias Jakobi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox