[PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
@ 2017-02-09 10:33 Samuel Pitoiset
       [not found] ` <20170209103337.1522-1-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-09 10:33 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Samuel Pitoiset

When ttm_bo_init() fails, the reservation mutex should be unlocked.

In debug build, the kernel reported "possible recursive locking
detected" in this codepath. For debugging purposes, I also added
a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
mutex was locked as expected.

This should fix (random) GPU hangs. The easy way to reproduce the
issue is to change the "Super Sampling" option from 1.0 to 2.0 in
Hitman. It will create a huge buffer, evict a bunch of buffers
(around ~5k) and deadlock.

This regression has been introduced pretty recently.

v2: only release the mutex if resv is NULL

Fixes: 12a852219583 ("drm/amdgpu: improve AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index d1ef1d064de4..556236a112c1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
 			&bo->placement, page_align, !kernel, NULL,
 			acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
 			&amdgpu_ttm_bo_destroy);
-	if (unlikely(r != 0))
+	if (unlikely(r != 0)) {
+		if (!resv)
+			ww_mutex_unlock(&bo->tbo.resv->lock);
 		return r;
+	}
 
 	bo->tbo.priority = ilog2(bo->tbo.num_pages);
 	if (kernel)
-- 
2.11.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/2] drm/amdgpu: report the number of bytes moved at buffer creation
       [not found] ` <20170209103337.1522-1-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-09 10:33   ` Samuel Pitoiset
       [not found]     ` <20170209103337.1522-2-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-09 16:08   ` [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted() Christian König
  2017-02-13 16:25   ` Nicolai Hähnle
  2 siblings, 1 reply; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-09 10:33 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Samuel Pitoiset

Like ttm_bo_validate(), ttm_bo_init() might need to move BO and
the number of bytes moved by TTM should be reported. This can help
the throttle buffer migration mechanism to make a better decision.

v2: fix computation

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 3 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++++++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 402a8954c6d8..5227e4d1d5db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1720,6 +1720,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data);
 int amdgpu_cs_get_ring(struct amdgpu_device *adev, u32 ip_type,
 		       u32 ip_instance, u32 ring,
 		       struct amdgpu_ring **out_ring);
+void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes);
 void amdgpu_ttm_placement_from_domain(struct amdgpu_bo *abo, u32 domain);
 bool amdgpu_ttm_bo_is_amdgpu_bo(struct ttm_buffer_object *bo);
 int amdgpu_ttm_tt_get_user_pages(struct ttm_tt *ttm, struct page **pages);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 6e948e4986ec..dade2fa9593a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -351,8 +351,7 @@ static u64 amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev)
  * submission. This can result in a debt that can stop buffer migrations
  * temporarily.
  */
-static void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev,
-					 u64 num_bytes)
+void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes)
 {
 	spin_lock(&adev->mm_stats.lock);
 	adev->mm_stats.accum_us -= bytes_to_us(adev, num_bytes);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 556236a112c1..4aa2c8a94347 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -323,6 +323,7 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
 	struct amdgpu_bo *bo;
 	enum ttm_bo_type type;
 	unsigned long page_align;
+	u64 initial_bytes_moved;
 	size_t acc_size;
 	int r;
 
@@ -399,10 +400,15 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
 		locked = ww_mutex_trylock(&bo->tbo.ttm_resv.lock);
 		WARN_ON(!locked);
 	}
+
+	initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
 	r = ttm_bo_init(&adev->mman.bdev, &bo->tbo, size, type,
 			&bo->placement, page_align, !kernel, NULL,
 			acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
 			&amdgpu_ttm_bo_destroy);
+	amdgpu_cs_report_moved_bytes(adev,
+		atomic64_read(&adev->num_bytes_moved) - initial_bytes_moved);
+
 	if (unlikely(r != 0)) {
 		if (!resv)
 			ww_mutex_unlock(&bo->tbo.resv->lock);
-- 
2.11.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found] ` <20170209103337.1522-1-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-09 10:33   ` [PATCH v2 2/2] drm/amdgpu: report the number of bytes moved at buffer creation Samuel Pitoiset
@ 2017-02-09 16:08   ` Christian König
  2017-02-13 16:25   ` Nicolai Hähnle
  2 siblings, 0 replies; 23+ messages in thread
From: Christian König @ 2017-02-09 16:08 UTC (permalink / raw)
  To: Samuel Pitoiset, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 09.02.2017 um 11:33 schrieb Samuel Pitoiset:
> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>
> In debug build, the kernel reported "possible recursive locking
> detected" in this codepath. For debugging purposes, I also added
> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
> mutex was locked as expected.
>
> This should fix (random) GPU hangs. The easy way to reproduce the
> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
> Hitman. It will create a huge buffer, evict a bunch of buffers
> (around ~5k) and deadlock.
>
> This regression has been introduced pretty recently.
>
> v2: only release the mutex if resv is NULL
>
> Fixes: 12a852219583 ("drm/amdgpu: improve AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>

Nice catch, both patches are Reviewed-by: Christian König 
<christian.koenig@amd.com>.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index d1ef1d064de4..556236a112c1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
>   			&bo->placement, page_align, !kernel, NULL,
>   			acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>   			&amdgpu_ttm_bo_destroy);
> -	if (unlikely(r != 0))
> +	if (unlikely(r != 0)) {
> +		if (!resv)
> +			ww_mutex_unlock(&bo->tbo.resv->lock);
>   		return r;
> +	}
>   
>   	bo->tbo.priority = ilog2(bo->tbo.num_pages);
>   	if (kernel)


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found] ` <20170209103337.1522-1-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-09 10:33   ` [PATCH v2 2/2] drm/amdgpu: report the number of bytes moved at buffer creation Samuel Pitoiset
  2017-02-09 16:08   ` [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted() Christian König
@ 2017-02-13 16:25   ` Nicolai Hähnle
       [not found]     ` <ef49a53a-6011-0de1-9ff6-3755851daf10-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2 siblings, 1 reply; 23+ messages in thread
From: Nicolai Hähnle @ 2017-02-13 16:25 UTC (permalink / raw)
  To: Samuel Pitoiset, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 09.02.2017 11:33, Samuel Pitoiset wrote:
> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>
> In debug build, the kernel reported "possible recursive locking
> detected" in this codepath. For debugging purposes, I also added
> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
> mutex was locked as expected.
>
> This should fix (random) GPU hangs. The easy way to reproduce the
> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
> Hitman. It will create a huge buffer, evict a bunch of buffers
> (around ~5k) and deadlock.
>
> This regression has been introduced pretty recently.
>
> v2: only release the mutex if resv is NULL
>
> Fixes: 12a852219583 ("drm/amdgpu: improve AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index d1ef1d064de4..556236a112c1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
>  			&bo->placement, page_align, !kernel, NULL,
>  			acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>  			&amdgpu_ttm_bo_destroy);
> -	if (unlikely(r != 0))
> +	if (unlikely(r != 0)) {
> +		if (!resv)
> +			ww_mutex_unlock(&bo->tbo.resv->lock);
>  		return r;
> +	}

I was looking at this myself a couple of weeks back, and I'm pretty sure 
I had this exact same patch just to realize that it's actually incorrect.

The problem is that ttm_bo_init will actually call the destroy function 
(in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been freed.

This code is a huge mess. I'm surprised though: have you verified that 
this patch actually fixes a hang?

Cheers,
Nicolai


>
>  	bo->tbo.priority = ilog2(bo->tbo.num_pages);
>  	if (kernel)
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] drm/amdgpu: report the number of bytes moved at buffer creation
       [not found]     ` <20170209103337.1522-2-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 16:28       ` Nicolai Hähnle
       [not found]         ` <758a0adf-3e8c-1860-7528-0099a82e79da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Nicolai Hähnle @ 2017-02-13 16:28 UTC (permalink / raw)
  To: Samuel Pitoiset, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 09.02.2017 11:33, Samuel Pitoiset wrote:
> Like ttm_bo_validate(), ttm_bo_init() might need to move BO and
> the number of bytes moved by TTM should be reported. This can help
> the throttle buffer migration mechanism to make a better decision.

Hmm, this could double-count bytes if there's a concurrent CS submission 
going on.

It's only a heuristic, so I guess it's not too bad, but still - having 
at least a comment about this would be nice.

Nicolai

> v2: fix computation
>
> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 3 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++++++
>  3 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 402a8954c6d8..5227e4d1d5db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1720,6 +1720,7 @@ int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data);
>  int amdgpu_cs_get_ring(struct amdgpu_device *adev, u32 ip_type,
>  		       u32 ip_instance, u32 ring,
>  		       struct amdgpu_ring **out_ring);
> +void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes);
>  void amdgpu_ttm_placement_from_domain(struct amdgpu_bo *abo, u32 domain);
>  bool amdgpu_ttm_bo_is_amdgpu_bo(struct ttm_buffer_object *bo);
>  int amdgpu_ttm_tt_get_user_pages(struct ttm_tt *ttm, struct page **pages);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 6e948e4986ec..dade2fa9593a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -351,8 +351,7 @@ static u64 amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev)
>   * submission. This can result in a debt that can stop buffer migrations
>   * temporarily.
>   */
> -static void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev,
> -					 u64 num_bytes)
> +void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes)
>  {
>  	spin_lock(&adev->mm_stats.lock);
>  	adev->mm_stats.accum_us -= bytes_to_us(adev, num_bytes);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 556236a112c1..4aa2c8a94347 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -323,6 +323,7 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
>  	struct amdgpu_bo *bo;
>  	enum ttm_bo_type type;
>  	unsigned long page_align;
> +	u64 initial_bytes_moved;
>  	size_t acc_size;
>  	int r;
>
> @@ -399,10 +400,15 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev,
>  		locked = ww_mutex_trylock(&bo->tbo.ttm_resv.lock);
>  		WARN_ON(!locked);
>  	}
> +
> +	initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
>  	r = ttm_bo_init(&adev->mman.bdev, &bo->tbo, size, type,
>  			&bo->placement, page_align, !kernel, NULL,
>  			acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>  			&amdgpu_ttm_bo_destroy);
> +	amdgpu_cs_report_moved_bytes(adev,
> +		atomic64_read(&adev->num_bytes_moved) - initial_bytes_moved);
> +
>  	if (unlikely(r != 0)) {
>  		if (!resv)
>  			ww_mutex_unlock(&bo->tbo.resv->lock);
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] drm/amdgpu: report the number of bytes moved at buffer creation
       [not found]         ` <758a0adf-3e8c-1860-7528-0099a82e79da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 16:39           ` Christian König
  2017-02-13 17:50           ` Samuel Pitoiset
  1 sibling, 0 replies; 23+ messages in thread
From: Christian König @ 2017-02-13 16:39 UTC (permalink / raw)
  To: Nicolai Hähnle, Samuel Pitoiset,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 13.02.2017 um 17:28 schrieb Nicolai Hähnle:
> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>> Like ttm_bo_validate(), ttm_bo_init() might need to move BO and
>> the number of bytes moved by TTM should be reported. This can help
>> the throttle buffer migration mechanism to make a better decision.
>
> Hmm, this could double-count bytes if there's a concurrent CS 
> submission going on.
>
> It's only a heuristic, so I guess it's not too bad, but still - having 
> at least a comment about this would be nice.

Yeah, already working on this.

I've want to give ttm_bo_init and ttm_bo_validate a "context" parameter 
to count the bytes moved as well as the bytes it is allowed to move (as 
a new feature) before returning -EBUSY.

Regards,
Christian.

>
> Nicolai
>
>> v2: fix computation
>>
>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 3 +--
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++++++
>>  3 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 402a8954c6d8..5227e4d1d5db 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1720,6 +1720,7 @@ int amdgpu_cs_parser_init(struct 
>> amdgpu_cs_parser *p, void *data);
>>  int amdgpu_cs_get_ring(struct amdgpu_device *adev, u32 ip_type,
>>                 u32 ip_instance, u32 ring,
>>                 struct amdgpu_ring **out_ring);
>> +void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 
>> num_bytes);
>>  void amdgpu_ttm_placement_from_domain(struct amdgpu_bo *abo, u32 
>> domain);
>>  bool amdgpu_ttm_bo_is_amdgpu_bo(struct ttm_buffer_object *bo);
>>  int amdgpu_ttm_tt_get_user_pages(struct ttm_tt *ttm, struct page 
>> **pages);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 6e948e4986ec..dade2fa9593a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -351,8 +351,7 @@ static u64 
>> amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev)
>>   * submission. This can result in a debt that can stop buffer 
>> migrations
>>   * temporarily.
>>   */
>> -static void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev,
>> -                     u64 num_bytes)
>> +void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 
>> num_bytes)
>>  {
>>      spin_lock(&adev->mm_stats.lock);
>>      adev->mm_stats.accum_us -= bytes_to_us(adev, num_bytes);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 556236a112c1..4aa2c8a94347 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -323,6 +323,7 @@ int amdgpu_bo_create_restricted(struct 
>> amdgpu_device *adev,
>>      struct amdgpu_bo *bo;
>>      enum ttm_bo_type type;
>>      unsigned long page_align;
>> +    u64 initial_bytes_moved;
>>      size_t acc_size;
>>      int r;
>>
>> @@ -399,10 +400,15 @@ int amdgpu_bo_create_restricted(struct 
>> amdgpu_device *adev,
>>          locked = ww_mutex_trylock(&bo->tbo.ttm_resv.lock);
>>          WARN_ON(!locked);
>>      }
>> +
>> +    initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
>>      r = ttm_bo_init(&adev->mman.bdev, &bo->tbo, size, type,
>>              &bo->placement, page_align, !kernel, NULL,
>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>              &amdgpu_ttm_bo_destroy);
>> +    amdgpu_cs_report_moved_bytes(adev,
>> +        atomic64_read(&adev->num_bytes_moved) - initial_bytes_moved);
>> +
>>      if (unlikely(r != 0)) {
>>          if (!resv)
>>              ww_mutex_unlock(&bo->tbo.resv->lock);
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]     ` <ef49a53a-6011-0de1-9ff6-3755851daf10-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 17:49       ` Samuel Pitoiset
       [not found]         ` <3fc6e61c-cc70-7b10-9abb-45a4f22b3909-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-13 17:49 UTC (permalink / raw)
  To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>
>> In debug build, the kernel reported "possible recursive locking
>> detected" in this codepath. For debugging purposes, I also added
>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>> mutex was locked as expected.
>>
>> This should fix (random) GPU hangs. The easy way to reproduce the
>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>> Hitman. It will create a huge buffer, evict a bunch of buffers
>> (around ~5k) and deadlock.
>>
>> This regression has been introduced pretty recently.
>>
>> v2: only release the mutex if resv is NULL
>>
>> Fixes: 12a852219583 ("drm/amdgpu: improve
>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index d1ef1d064de4..556236a112c1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>> amdgpu_device *adev,
>>              &bo->placement, page_align, !kernel, NULL,
>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>              &amdgpu_ttm_bo_destroy);
>> -    if (unlikely(r != 0))
>> +    if (unlikely(r != 0)) {
>> +        if (!resv)
>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>          return r;
>> +    }
>
> I was looking at this myself a couple of weeks back, and I'm pretty sure
> I had this exact same patch just to realize that it's actually incorrect.
>
> The problem is that ttm_bo_init will actually call the destroy function
> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been freed.
>
> This code is a huge mess. I'm surprised though: have you verified that
> this patch actually fixes a hang?

Yes, I triple-checked. I can't reproduce the hangs with Hitman.

This fixes a deadlock, here's the report:
https://hastebin.com/durodivoma.xml

The resv->lock has to be unlocked when ttm_bo_init() fails (I checked 
with a WARN_ON(is_locked)) because it doesn't call the destroy function 
in all situations. Presumably, when drm_vma_offset_add() fails and resv 
is not NULL, the mutex is not unlocked.

>
> Cheers,
> Nicolai
>
>
>>
>>      bo->tbo.priority = ilog2(bo->tbo.num_pages);
>>      if (kernel)
>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 2/2] drm/amdgpu: report the number of bytes moved at buffer creation
       [not found]         ` <758a0adf-3e8c-1860-7528-0099a82e79da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-13 16:39           ` Christian König
@ 2017-02-13 17:50           ` Samuel Pitoiset
  1 sibling, 0 replies; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-13 17:50 UTC (permalink / raw)
  To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 02/13/2017 05:28 PM, Nicolai Hähnle wrote:
> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>> Like ttm_bo_validate(), ttm_bo_init() might need to move BO and
>> the number of bytes moved by TTM should be reported. This can help
>> the throttle buffer migration mechanism to make a better decision.
>
> Hmm, this could double-count bytes if there's a concurrent CS submission
> going on.

You are right.

Thanks Christian for taking care of this.

>
> It's only a heuristic, so I guess it's not too bad, but still - having
> at least a comment about this would be nice.
>
> Nicolai
>
>> v2: fix computation
>>
>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 3 +--
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++++++
>>  3 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 402a8954c6d8..5227e4d1d5db 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1720,6 +1720,7 @@ int amdgpu_cs_parser_init(struct
>> amdgpu_cs_parser *p, void *data);
>>  int amdgpu_cs_get_ring(struct amdgpu_device *adev, u32 ip_type,
>>                 u32 ip_instance, u32 ring,
>>                 struct amdgpu_ring **out_ring);
>> +void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64
>> num_bytes);
>>  void amdgpu_ttm_placement_from_domain(struct amdgpu_bo *abo, u32
>> domain);
>>  bool amdgpu_ttm_bo_is_amdgpu_bo(struct ttm_buffer_object *bo);
>>  int amdgpu_ttm_tt_get_user_pages(struct ttm_tt *ttm, struct page
>> **pages);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 6e948e4986ec..dade2fa9593a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -351,8 +351,7 @@ static u64
>> amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev)
>>   * submission. This can result in a debt that can stop buffer migrations
>>   * temporarily.
>>   */
>> -static void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev,
>> -                     u64 num_bytes)
>> +void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64
>> num_bytes)
>>  {
>>      spin_lock(&adev->mm_stats.lock);
>>      adev->mm_stats.accum_us -= bytes_to_us(adev, num_bytes);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 556236a112c1..4aa2c8a94347 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -323,6 +323,7 @@ int amdgpu_bo_create_restricted(struct
>> amdgpu_device *adev,
>>      struct amdgpu_bo *bo;
>>      enum ttm_bo_type type;
>>      unsigned long page_align;
>> +    u64 initial_bytes_moved;
>>      size_t acc_size;
>>      int r;
>>
>> @@ -399,10 +400,15 @@ int amdgpu_bo_create_restricted(struct
>> amdgpu_device *adev,
>>          locked = ww_mutex_trylock(&bo->tbo.ttm_resv.lock);
>>          WARN_ON(!locked);
>>      }
>> +
>> +    initial_bytes_moved = atomic64_read(&adev->num_bytes_moved);
>>      r = ttm_bo_init(&adev->mman.bdev, &bo->tbo, size, type,
>>              &bo->placement, page_align, !kernel, NULL,
>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>              &amdgpu_ttm_bo_destroy);
>> +    amdgpu_cs_report_moved_bytes(adev,
>> +        atomic64_read(&adev->num_bytes_moved) - initial_bytes_moved);
>> +
>>      if (unlikely(r != 0)) {
>>          if (!resv)
>>              ww_mutex_unlock(&bo->tbo.resv->lock);
>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]         ` <3fc6e61c-cc70-7b10-9abb-45a4f22b3909-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 17:52           ` Samuel Pitoiset
  2017-02-13 18:04           ` Nicolai Hähnle
  1 sibling, 0 replies; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-13 17:52 UTC (permalink / raw)
  To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 02/13/2017 06:49 PM, Samuel Pitoiset wrote:
>
>
> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>
>>> In debug build, the kernel reported "possible recursive locking
>>> detected" in this codepath. For debugging purposes, I also added
>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>> mutex was locked as expected.
>>>
>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>> (around ~5k) and deadlock.
>>>
>>> This regression has been introduced pretty recently.
>>>
>>> v2: only release the mutex if resv is NULL
>>>
>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index d1ef1d064de4..556236a112c1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>> amdgpu_device *adev,
>>>              &bo->placement, page_align, !kernel, NULL,
>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>              &amdgpu_ttm_bo_destroy);
>>> -    if (unlikely(r != 0))
>>> +    if (unlikely(r != 0)) {
>>> +        if (!resv)
>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>          return r;
>>> +    }
>>
>> I was looking at this myself a couple of weeks back, and I'm pretty sure
>> I had this exact same patch just to realize that it's actually incorrect.
>>
>> The problem is that ttm_bo_init will actually call the destroy function
>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>> freed.
>>
>> This code is a huge mess. I'm surprised though: have you verified that
>> this patch actually fixes a hang?
>
> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>
> This fixes a deadlock, here's the report:
> https://hastebin.com/durodivoma.xml
>
> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
> with a WARN_ON(is_locked)) because it doesn't call the destroy function
> in all situations. Presumably, when drm_vma_offset_add() fails and resv
> is not NULL, the mutex is not unlocked.

err, resv is always NULL in this situation.

>
>>
>> Cheers,
>> Nicolai
>>
>>
>>>
>>>      bo->tbo.priority = ilog2(bo->tbo.num_pages);
>>>      if (kernel)
>>>
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]         ` <3fc6e61c-cc70-7b10-9abb-45a4f22b3909-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-13 17:52           ` Samuel Pitoiset
@ 2017-02-13 18:04           ` Nicolai Hähnle
       [not found]             ` <7c0ece9f-8983-317b-0eae-adaf535c6d06-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Nicolai Hähnle @ 2017-02-13 18:04 UTC (permalink / raw)
  To: Samuel Pitoiset, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 13.02.2017 18:49, Samuel Pitoiset wrote:
>
>
> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>
>>> In debug build, the kernel reported "possible recursive locking
>>> detected" in this codepath. For debugging purposes, I also added
>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>> mutex was locked as expected.
>>>
>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>> (around ~5k) and deadlock.
>>>
>>> This regression has been introduced pretty recently.
>>>
>>> v2: only release the mutex if resv is NULL
>>>
>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index d1ef1d064de4..556236a112c1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>> amdgpu_device *adev,
>>>              &bo->placement, page_align, !kernel, NULL,
>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>              &amdgpu_ttm_bo_destroy);
>>> -    if (unlikely(r != 0))
>>> +    if (unlikely(r != 0)) {
>>> +        if (!resv)
>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>          return r;
>>> +    }
>>
>> I was looking at this myself a couple of weeks back, and I'm pretty sure
>> I had this exact same patch just to realize that it's actually incorrect.
>>
>> The problem is that ttm_bo_init will actually call the destroy function
>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>> freed.
>>
>> This code is a huge mess. I'm surprised though: have you verified that
>> this patch actually fixes a hang?
>
> Yes, I triple-checked. I can't reproduce the hangs with Hitman.

That's surprising, but a relief. Maybe it ties into some of the other 
problems I'm seeing as well.

This means we need a real fix for this; I still think the current patch 
is broken.


> This fixes a deadlock, here's the report:
> https://hastebin.com/durodivoma.xml
>
> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
> with a WARN_ON(is_locked)) because it doesn't call the destroy function
> in all situations. Presumably, when drm_vma_offset_add() fails and resv
> is not NULL, the mutex is not unlocked.

On which code path is the destroy function not called? If that is the 
case, we're leaking memory.

With the patch as-is, the error paths are either leaking memory (if 
you're right) or accessing memory after it's freed (otherwise). 
Obviously, neither is good.

Nicolai
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]             ` <7c0ece9f-8983-317b-0eae-adaf535c6d06-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 18:09               ` Nicolai Hähnle
       [not found]                 ` <ed01ca29-e255-62f8-ca99-f079de294eda-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-13 18:11               ` Samuel Pitoiset
  1 sibling, 1 reply; 23+ messages in thread
From: Nicolai Hähnle @ 2017-02-13 18:09 UTC (permalink / raw)
  To: Samuel Pitoiset, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 13.02.2017 19:04, Nicolai Hähnle wrote:
> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>
>>
>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>
>>>> In debug build, the kernel reported "possible recursive locking
>>>> detected" in this codepath. For debugging purposes, I also added
>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>> mutex was locked as expected.
>>>>
>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>> (around ~5k) and deadlock.
>>>>
>>>> This regression has been introduced pretty recently.
>>>>
>>>> v2: only release the mutex if resv is NULL
>>>>
>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> index d1ef1d064de4..556236a112c1 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>> amdgpu_device *adev,
>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>              &amdgpu_ttm_bo_destroy);
>>>> -    if (unlikely(r != 0))
>>>> +    if (unlikely(r != 0)) {
>>>> +        if (!resv)
>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>          return r;
>>>> +    }
>>>
>>> I was looking at this myself a couple of weeks back, and I'm pretty sure
>>> I had this exact same patch just to realize that it's actually
>>> incorrect.
>>>
>>> The problem is that ttm_bo_init will actually call the destroy function
>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>> freed.
>>>
>>> This code is a huge mess. I'm surprised though: have you verified that
>>> this patch actually fixes a hang?
>>
>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>
> That's surprising, but a relief. Maybe it ties into some of the other
> problems I'm seeing as well.
>
> This means we need a real fix for this; I still think the current patch
> is broken.
>
>
>> This fixes a deadlock, here's the report:
>> https://hastebin.com/durodivoma.xml
>>
>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>> with a WARN_ON(is_locked)) because it doesn't call the destroy function
>> in all situations. Presumably, when drm_vma_offset_add() fails and resv
>> is not NULL, the mutex is not unlocked.
>
> On which code path is the destroy function not called? If that is the
> case, we're leaking memory.
>
> With the patch as-is, the error paths are either leaking memory (if
> you're right) or accessing memory after it's freed (otherwise).
> Obviously, neither is good.

Actually, I find it extremely suspicious that this patch resolves hangs. 
By all rights, no other task should have a pointer to this bo left. It 
points at problems elsewhere in the code, possibly the precise problem 
I've been trying to track down.

Could you please revert the patch, reproduce the hang, and report 
/proc/$pid/stack for all the hung tasks?

Thanks,
Nicolai
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]             ` <7c0ece9f-8983-317b-0eae-adaf535c6d06-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-13 18:09               ` Nicolai Hähnle
@ 2017-02-13 18:11               ` Samuel Pitoiset
       [not found]                 ` <e8a51ac0-8319-810a-0fb8-992c3afd4f34-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-13 18:11 UTC (permalink / raw)
  To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 02/13/2017 07:04 PM, Nicolai Hähnle wrote:
> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>
>>
>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>
>>>> In debug build, the kernel reported "possible recursive locking
>>>> detected" in this codepath. For debugging purposes, I also added
>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>> mutex was locked as expected.
>>>>
>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>> (around ~5k) and deadlock.
>>>>
>>>> This regression has been introduced pretty recently.
>>>>
>>>> v2: only release the mutex if resv is NULL
>>>>
>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> index d1ef1d064de4..556236a112c1 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>> amdgpu_device *adev,
>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>              &amdgpu_ttm_bo_destroy);
>>>> -    if (unlikely(r != 0))
>>>> +    if (unlikely(r != 0)) {
>>>> +        if (!resv)
>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>          return r;
>>>> +    }
>>>
>>> I was looking at this myself a couple of weeks back, and I'm pretty sure
>>> I had this exact same patch just to realize that it's actually
>>> incorrect.
>>>
>>> The problem is that ttm_bo_init will actually call the destroy function
>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>> freed.
>>>
>>> This code is a huge mess. I'm surprised though: have you verified that
>>> this patch actually fixes a hang?
>>
>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>
> That's surprising, but a relief. Maybe it ties into some of the other
> problems I'm seeing as well.
>
> This means we need a real fix for this; I still think the current patch
> is broken.

Maybe the issue is somewhere else and this not the proper solution, but 
I don't think the given patch is broken as-is. It fixes deadlocks which 
are pretty easy to reproduce with Hitman (as explained in the commit 
description).

>
>
>> This fixes a deadlock, here's the report:
>> https://hastebin.com/durodivoma.xml
>>
>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>> with a WARN_ON(is_locked)) because it doesn't call the destroy function
>> in all situations. Presumably, when drm_vma_offset_add() fails and resv
>> is not NULL, the mutex is not unlocked.
>
> On which code path is the destroy function not called? If that is the
> case, we're leaking memory.
>
> With the patch as-is, the error paths are either leaking memory (if
> you're right) or accessing memory after it's freed (otherwise).
> Obviously, neither is good.

No, I was wrong. resv is always NULL in this situation. The best 
solution is probably to try to clean up that code path because I do 
agree: it's a bit messy.

>
> Nicolai
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                 ` <e8a51ac0-8319-810a-0fb8-992c3afd4f34-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 18:19                   ` Nicolai Hähnle
       [not found]                     ` <eeafbad0-ca3b-ba03-7068-f54dd2119267-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Nicolai Hähnle @ 2017-02-13 18:19 UTC (permalink / raw)
  To: Samuel Pitoiset, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 13.02.2017 19:11, Samuel Pitoiset wrote:
>
>
> On 02/13/2017 07:04 PM, Nicolai Hähnle wrote:
>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>
>>>
>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>
>>>>> In debug build, the kernel reported "possible recursive locking
>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>> mutex was locked as expected.
>>>>>
>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>> (around ~5k) and deadlock.
>>>>>
>>>>> This regression has been introduced pretty recently.
>>>>>
>>>>> v2: only release the mutex if resv is NULL
>>>>>
>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>> ---
>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>> amdgpu_device *adev,
>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>              &amdgpu_ttm_bo_destroy);
>>>>> -    if (unlikely(r != 0))
>>>>> +    if (unlikely(r != 0)) {
>>>>> +        if (!resv)
>>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>          return r;
>>>>> +    }
>>>>
>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>> sure
>>>> I had this exact same patch just to realize that it's actually
>>>> incorrect.
>>>>
>>>> The problem is that ttm_bo_init will actually call the destroy function
>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>> freed.
>>>>
>>>> This code is a huge mess. I'm surprised though: have you verified that
>>>> this patch actually fixes a hang?
>>>
>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>
>> That's surprising, but a relief. Maybe it ties into some of the other
>> problems I'm seeing as well.
>>
>> This means we need a real fix for this; I still think the current patch
>> is broken.
>
> Maybe the issue is somewhere else and this not the proper solution, but
> I don't think the given patch is broken as-is. It fixes deadlocks which
> are pretty easy to reproduce with Hitman (as explained in the commit
> description).

I'm sorry, but a use-after-free is clearly broken.

Nicolai

>
>>
>>
>>> This fixes a deadlock, here's the report:
>>> https://hastebin.com/durodivoma.xml
>>>
>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>> with a WARN_ON(is_locked)) because it doesn't call the destroy function
>>> in all situations. Presumably, when drm_vma_offset_add() fails and resv
>>> is not NULL, the mutex is not unlocked.
>>
>> On which code path is the destroy function not called? If that is the
>> case, we're leaking memory.
>>
>> With the patch as-is, the error paths are either leaking memory (if
>> you're right) or accessing memory after it's freed (otherwise).
>> Obviously, neither is good.
>
> No, I was wrong. resv is always NULL in this situation. The best
> solution is probably to try to clean up that code path because I do
> agree: it's a bit messy.
>
>>
>> Nicolai

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                     ` <eeafbad0-ca3b-ba03-7068-f54dd2119267-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 18:32                       ` Samuel Pitoiset
       [not found]                         ` <4d8bb8a9-d2b7-2ade-0dd0-8762e1dcc7a5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-13 18:32 UTC (permalink / raw)
  To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 02/13/2017 07:19 PM, Nicolai Hähnle wrote:
> On 13.02.2017 19:11, Samuel Pitoiset wrote:
>>
>>
>> On 02/13/2017 07:04 PM, Nicolai Hähnle wrote:
>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>
>>>>
>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>
>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>> mutex was locked as expected.
>>>>>>
>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>> (around ~5k) and deadlock.
>>>>>>
>>>>>> This regression has been introduced pretty recently.
>>>>>>
>>>>>> v2: only release the mutex if resv is NULL
>>>>>>
>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>> ---
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>> amdgpu_device *adev,
>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>> -    if (unlikely(r != 0))
>>>>>> +    if (unlikely(r != 0)) {
>>>>>> +        if (!resv)
>>>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>          return r;
>>>>>> +    }
>>>>>
>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>> sure
>>>>> I had this exact same patch just to realize that it's actually
>>>>> incorrect.
>>>>>
>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>> function
>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>> freed.
>>>>>
>>>>> This code is a huge mess. I'm surprised though: have you verified that
>>>>> this patch actually fixes a hang?
>>>>
>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>
>>> That's surprising, but a relief. Maybe it ties into some of the other
>>> problems I'm seeing as well.
>>>
>>> This means we need a real fix for this; I still think the current patch
>>> is broken.
>>
>> Maybe the issue is somewhere else and this not the proper solution, but
>> I don't think the given patch is broken as-is. It fixes deadlocks which
>> are pretty easy to reproduce with Hitman (as explained in the commit
>> description).
>
> I'm sorry, but a use-after-free is clearly broken.

You are right. If the destroy callback is called, there is a 
use-after-free which is bad, really..

>
> Nicolai
>
>>
>>>
>>>
>>>> This fixes a deadlock, here's the report:
>>>> https://hastebin.com/durodivoma.xml
>>>>
>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy function
>>>> in all situations. Presumably, when drm_vma_offset_add() fails and resv
>>>> is not NULL, the mutex is not unlocked.
>>>
>>> On which code path is the destroy function not called? If that is the
>>> case, we're leaking memory.
>>>
>>> With the patch as-is, the error paths are either leaking memory (if
>>> you're right) or accessing memory after it's freed (otherwise).
>>> Obviously, neither is good.
>>
>> No, I was wrong. resv is always NULL in this situation. The best
>> solution is probably to try to clean up that code path because I do
>> agree: it's a bit messy.
>>
>>>
>>> Nicolai
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                 ` <ed01ca29-e255-62f8-ca99-f079de294eda-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 18:38                   ` Samuel Pitoiset
       [not found]                     ` <aaa5ce51-e2e0-f368-c3f0-b94b5afe3881-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-13 18:38 UTC (permalink / raw)
  To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 02/13/2017 07:09 PM, Nicolai Hähnle wrote:
> On 13.02.2017 19:04, Nicolai Hähnle wrote:
>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>
>>>
>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>
>>>>> In debug build, the kernel reported "possible recursive locking
>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>> mutex was locked as expected.
>>>>>
>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>> (around ~5k) and deadlock.
>>>>>
>>>>> This regression has been introduced pretty recently.
>>>>>
>>>>> v2: only release the mutex if resv is NULL
>>>>>
>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>> ---
>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>> amdgpu_device *adev,
>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>              &amdgpu_ttm_bo_destroy);
>>>>> -    if (unlikely(r != 0))
>>>>> +    if (unlikely(r != 0)) {
>>>>> +        if (!resv)
>>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>          return r;
>>>>> +    }
>>>>
>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>> sure
>>>> I had this exact same patch just to realize that it's actually
>>>> incorrect.
>>>>
>>>> The problem is that ttm_bo_init will actually call the destroy function
>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>> freed.
>>>>
>>>> This code is a huge mess. I'm surprised though: have you verified that
>>>> this patch actually fixes a hang?
>>>
>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>
>> That's surprising, but a relief. Maybe it ties into some of the other
>> problems I'm seeing as well.
>>
>> This means we need a real fix for this; I still think the current patch
>> is broken.
>>
>>
>>> This fixes a deadlock, here's the report:
>>> https://hastebin.com/durodivoma.xml
>>>
>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>> with a WARN_ON(is_locked)) because it doesn't call the destroy function
>>> in all situations. Presumably, when drm_vma_offset_add() fails and resv
>>> is not NULL, the mutex is not unlocked.
>>
>> On which code path is the destroy function not called? If that is the
>> case, we're leaking memory.
>>
>> With the patch as-is, the error paths are either leaking memory (if
>> you're right) or accessing memory after it's freed (otherwise).
>> Obviously, neither is good.
>
> Actually, I find it extremely suspicious that this patch resolves hangs.
> By all rights, no other task should have a pointer to this bo left. It
> points at problems elsewhere in the code, possibly the precise problem
> I've been trying to track down.

Well, maybe we are just lucky but as I said, I checked many times to 
reproduce the issue with that patch applied without any success, you can 
trust me. Although I'm also starting to think that's not the right 
solution (and could introduce other ones).

>
> Could you please revert the patch, reproduce the hang, and report
> /proc/$pid/stack for all the hung tasks?

Sure. The thing is: Hitman's branch has been updated during the weekend 
and my local installation is broken. I need to re-download the whole 
game (will take a while).

I will let you know when I'm able to grab that report.

Thanks Nicolai.
>
> Thanks,
> Nicolai
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                         ` <4d8bb8a9-d2b7-2ade-0dd0-8762e1dcc7a5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 18:41                           ` Christian König
       [not found]                             ` <39ea0637-7794-865e-dbf7-6109d295064e-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2017-02-13 18:41 UTC (permalink / raw)
  To: Samuel Pitoiset, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 13.02.2017 um 19:32 schrieb Samuel Pitoiset:
>
>
> On 02/13/2017 07:19 PM, Nicolai Hähnle wrote:
>> On 13.02.2017 19:11, Samuel Pitoiset wrote:
>>>
>>>
>>> On 02/13/2017 07:04 PM, Nicolai Hähnle wrote:
>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>
>>>>>
>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>>
>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>> mutex was locked as expected.
>>>>>>>
>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>> (around ~5k) and deadlock.
>>>>>>>
>>>>>>> This regression has been introduced pretty recently.
>>>>>>>
>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>
>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>> -    if (unlikely(r != 0))
>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>> +        if (!resv)
>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>          return r;
>>>>>>> +    }
>>>>>>
>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>> sure
>>>>>> I had this exact same patch just to realize that it's actually
>>>>>> incorrect.
>>>>>>
>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>> function
>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>> freed.
>>>>>>
>>>>>> This code is a huge mess. I'm surprised though: have you verified 
>>>>>> that
>>>>>> this patch actually fixes a hang?
>>>>>
>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>
>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>> problems I'm seeing as well.
>>>>
>>>> This means we need a real fix for this; I still think the current 
>>>> patch
>>>> is broken.
>>>
>>> Maybe the issue is somewhere else and this not the proper solution, but
>>> I don't think the given patch is broken as-is. It fixes deadlocks which
>>> are pretty easy to reproduce with Hitman (as explained in the commit
>>> description).
>>
>> I'm sorry, but a use-after-free is clearly broken.
>
> You are right. If the destroy callback is called, there is a 
> use-after-free which is bad, really..

bad. Calling the destroy callback when something goes wrong sound fishy 
to me in the first place when the structure initialized here is 
allocated by the caller.

Probably best to clean that up from the beginning.

Regards,
Christian.

>
>>
>> Nicolai
>>
>>>
>>>>
>>>>
>>>>> This fixes a deadlock, here's the report:
>>>>> https://hastebin.com/durodivoma.xml
>>>>>
>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy 
>>>>> function
>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and 
>>>>> resv
>>>>> is not NULL, the mutex is not unlocked.
>>>>
>>>> On which code path is the destroy function not called? If that is the
>>>> case, we're leaking memory.
>>>>
>>>> With the patch as-is, the error paths are either leaking memory (if
>>>> you're right) or accessing memory after it's freed (otherwise).
>>>> Obviously, neither is good.
>>>
>>> No, I was wrong. resv is always NULL in this situation. The best
>>> solution is probably to try to clean up that code path because I do
>>> agree: it's a bit messy.
>>>
>>>>
>>>> Nicolai
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                             ` <39ea0637-7794-865e-dbf7-6109d295064e-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-02-13 18:45                               ` Samuel Pitoiset
  0 siblings, 0 replies; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-13 18:45 UTC (permalink / raw)
  To: Christian König, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 02/13/2017 07:41 PM, Christian König wrote:
> Am 13.02.2017 um 19:32 schrieb Samuel Pitoiset:
>>
>>
>> On 02/13/2017 07:19 PM, Nicolai Hähnle wrote:
>>> On 13.02.2017 19:11, Samuel Pitoiset wrote:
>>>>
>>>>
>>>> On 02/13/2017 07:04 PM, Nicolai Hähnle wrote:
>>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>>
>>>>>>
>>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>>>
>>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>>> mutex was locked as expected.
>>>>>>>>
>>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>>> (around ~5k) and deadlock.
>>>>>>>>
>>>>>>>> This regression has been introduced pretty recently.
>>>>>>>>
>>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>>
>>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>>>> ---
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>>> -    if (unlikely(r != 0))
>>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>>> +        if (!resv)
>>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>>          return r;
>>>>>>>> +    }
>>>>>>>
>>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>>> sure
>>>>>>> I had this exact same patch just to realize that it's actually
>>>>>>> incorrect.
>>>>>>>
>>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>>> function
>>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>>> freed.
>>>>>>>
>>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>>> that
>>>>>>> this patch actually fixes a hang?
>>>>>>
>>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>>
>>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>>> problems I'm seeing as well.
>>>>>
>>>>> This means we need a real fix for this; I still think the current
>>>>> patch
>>>>> is broken.
>>>>
>>>> Maybe the issue is somewhere else and this not the proper solution, but
>>>> I don't think the given patch is broken as-is. It fixes deadlocks which
>>>> are pretty easy to reproduce with Hitman (as explained in the commit
>>>> description).
>>>
>>> I'm sorry, but a use-after-free is clearly broken.
>>
>> You are right. If the destroy callback is called, there is a
>> use-after-free which is bad, really..
>
> bad. Calling the destroy callback when something goes wrong sound fishy
> to me in the first place when the structure initialized here is
> allocated by the caller.
>
> Probably best to clean that up from the beginning.

Yes.
I wonder where the original issue comes from though. I will need to 
investigate more.

>
> Regards,
> Christian.
>
>>
>>>
>>> Nicolai
>>>
>>>>
>>>>>
>>>>>
>>>>>> This fixes a deadlock, here's the report:
>>>>>> https://hastebin.com/durodivoma.xml
>>>>>>
>>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>>> function
>>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>>> resv
>>>>>> is not NULL, the mutex is not unlocked.
>>>>>
>>>>> On which code path is the destroy function not called? If that is the
>>>>> case, we're leaking memory.
>>>>>
>>>>> With the patch as-is, the error paths are either leaking memory (if
>>>>> you're right) or accessing memory after it's freed (otherwise).
>>>>> Obviously, neither is good.
>>>>
>>>> No, I was wrong. resv is always NULL in this situation. The best
>>>> solution is probably to try to clean up that code path because I do
>>>> agree: it's a bit messy.
>>>>
>>>>>
>>>>> Nicolai
>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                     ` <aaa5ce51-e2e0-f368-c3f0-b94b5afe3881-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 18:58                       ` Nicolai Hähnle
       [not found]                         ` <4185541c-a3a4-7681-4539-54c5daddd0ac-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Nicolai Hähnle @ 2017-02-13 18:58 UTC (permalink / raw)
  To: Samuel Pitoiset, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 13.02.2017 19:38, Samuel Pitoiset wrote:
>
>
> On 02/13/2017 07:09 PM, Nicolai Hähnle wrote:
>> On 13.02.2017 19:04, Nicolai Hähnle wrote:
>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>
>>>>
>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>
>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>> mutex was locked as expected.
>>>>>>
>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>> (around ~5k) and deadlock.
>>>>>>
>>>>>> This regression has been introduced pretty recently.
>>>>>>
>>>>>> v2: only release the mutex if resv is NULL
>>>>>>
>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>> ---
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>> amdgpu_device *adev,
>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>> -    if (unlikely(r != 0))
>>>>>> +    if (unlikely(r != 0)) {
>>>>>> +        if (!resv)
>>>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>          return r;
>>>>>> +    }
>>>>>
>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>> sure
>>>>> I had this exact same patch just to realize that it's actually
>>>>> incorrect.
>>>>>
>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>> function
>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>> freed.
>>>>>
>>>>> This code is a huge mess. I'm surprised though: have you verified that
>>>>> this patch actually fixes a hang?
>>>>
>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>
>>> That's surprising, but a relief. Maybe it ties into some of the other
>>> problems I'm seeing as well.
>>>
>>> This means we need a real fix for this; I still think the current patch
>>> is broken.
>>>
>>>
>>>> This fixes a deadlock, here's the report:
>>>> https://hastebin.com/durodivoma.xml
>>>>
>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy function
>>>> in all situations. Presumably, when drm_vma_offset_add() fails and resv
>>>> is not NULL, the mutex is not unlocked.
>>>
>>> On which code path is the destroy function not called? If that is the
>>> case, we're leaking memory.
>>>
>>> With the patch as-is, the error paths are either leaking memory (if
>>> you're right) or accessing memory after it's freed (otherwise).
>>> Obviously, neither is good.
>>
>> Actually, I find it extremely suspicious that this patch resolves hangs.
>> By all rights, no other task should have a pointer to this bo left. It
>> points at problems elsewhere in the code, possibly the precise problem
>> I've been trying to track down.
>
> Well, maybe we are just lucky but as I said, I checked many times to
> reproduce the issue with that patch applied without any success, you can
> trust me. Although I'm also starting to think that's not the right
> solution (and could introduce other ones).
>
>>
>> Could you please revert the patch, reproduce the hang, and report
>> /proc/$pid/stack for all the hung tasks?
>
> Sure. The thing is: Hitman's branch has been updated during the weekend
> and my local installation is broken. I need to re-download the whole
> game (will take a while).
>
> I will let you know when I'm able to grab that report.

Hmm, so I thought about this some more, and I'm no longer so sure that 
your bug and mine are the same. If it was related, I'd somehow expect 
you to get an error about a mutex being destroyed while it's held (at 
least with lock debugging enabled).

Anyway... we need to change the contract of ttm_bo_init, I'm just not 
yet sure how, because there are two points of failure: one quite early 
on, and the second rather late which gets cleaned up by ttm_bo_unref.

Cheers,
Nicolai

> Thanks Nicolai.
>>
>> Thanks,
>> Nicolai

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                         ` <4185541c-a3a4-7681-4539-54c5daddd0ac-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-13 19:03                           ` Nicolai Hähnle
       [not found]                             ` <60486a5c-0ac3-675e-86f6-67f67db1213c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-13 22:13                           ` Samuel Pitoiset
  1 sibling, 1 reply; 23+ messages in thread
From: Nicolai Hähnle @ 2017-02-13 19:03 UTC (permalink / raw)
  To: Samuel Pitoiset, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 13.02.2017 19:58, Nicolai Hähnle wrote:
> On 13.02.2017 19:38, Samuel Pitoiset wrote:
>>
>>
>> On 02/13/2017 07:09 PM, Nicolai Hähnle wrote:
>>> On 13.02.2017 19:04, Nicolai Hähnle wrote:
>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>
>>>>>
>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>>
>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>> mutex was locked as expected.
>>>>>>>
>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>> (around ~5k) and deadlock.
>>>>>>>
>>>>>>> This regression has been introduced pretty recently.
>>>>>>>
>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>
>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>> -    if (unlikely(r != 0))
>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>> +        if (!resv)
>>>>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>          return r;
>>>>>>> +    }
>>>>>>
>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>> sure
>>>>>> I had this exact same patch just to realize that it's actually
>>>>>> incorrect.
>>>>>>
>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>> function
>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>> freed.
>>>>>>
>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>> that
>>>>>> this patch actually fixes a hang?
>>>>>
>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>
>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>> problems I'm seeing as well.
>>>>
>>>> This means we need a real fix for this; I still think the current patch
>>>> is broken.
>>>>
>>>>
>>>>> This fixes a deadlock, here's the report:
>>>>> https://hastebin.com/durodivoma.xml
>>>>>
>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>> function
>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>> resv
>>>>> is not NULL, the mutex is not unlocked.
>>>>
>>>> On which code path is the destroy function not called? If that is the
>>>> case, we're leaking memory.
>>>>
>>>> With the patch as-is, the error paths are either leaking memory (if
>>>> you're right) or accessing memory after it's freed (otherwise).
>>>> Obviously, neither is good.
>>>
>>> Actually, I find it extremely suspicious that this patch resolves hangs.
>>> By all rights, no other task should have a pointer to this bo left. It
>>> points at problems elsewhere in the code, possibly the precise problem
>>> I've been trying to track down.
>>
>> Well, maybe we are just lucky but as I said, I checked many times to
>> reproduce the issue with that patch applied without any success, you can
>> trust me. Although I'm also starting to think that's not the right
>> solution (and could introduce other ones).
>>
>>>
>>> Could you please revert the patch, reproduce the hang, and report
>>> /proc/$pid/stack for all the hung tasks?
>>
>> Sure. The thing is: Hitman's branch has been updated during the weekend
>> and my local installation is broken. I need to re-download the whole
>> game (will take a while).
>>
>> I will let you know when I'm able to grab that report.
>
> Hmm, so I thought about this some more, and I'm no longer so sure that
> your bug and mine are the same. If it was related, I'd somehow expect
> you to get an error about a mutex being destroyed while it's held (at
> least with lock debugging enabled).
>
> Anyway... we need to change the contract of ttm_bo_init, I'm just not
> yet sure how, because there are two points of failure: one quite early
> on, and the second rather late which gets cleaned up by ttm_bo_unref.

Maybe it would actually be best to split ttm_bo_init into two parts: the 
initial bulk of structure initialization as the first half, and the 
ttm_bo_validate in the second half.

Cheers,
Nicolai

>
> Cheers,
> Nicolai
>
>> Thanks Nicolai.
>>>
>>> Thanks,
>>> Nicolai
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                         ` <4185541c-a3a4-7681-4539-54c5daddd0ac-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-13 19:03                           ` Nicolai Hähnle
@ 2017-02-13 22:13                           ` Samuel Pitoiset
  1 sibling, 0 replies; 23+ messages in thread
From: Samuel Pitoiset @ 2017-02-13 22:13 UTC (permalink / raw)
  To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 02/13/2017 07:58 PM, Nicolai Hähnle wrote:
> On 13.02.2017 19:38, Samuel Pitoiset wrote:
>>
>>
>> On 02/13/2017 07:09 PM, Nicolai Hähnle wrote:
>>> On 13.02.2017 19:04, Nicolai Hähnle wrote:
>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>
>>>>>
>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>>
>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>> mutex was locked as expected.
>>>>>>>
>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>> (around ~5k) and deadlock.
>>>>>>>
>>>>>>> This regression has been introduced pretty recently.
>>>>>>>
>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>
>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>> -    if (unlikely(r != 0))
>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>> +        if (!resv)
>>>>>>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>          return r;
>>>>>>> +    }
>>>>>>
>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>> sure
>>>>>> I had this exact same patch just to realize that it's actually
>>>>>> incorrect.
>>>>>>
>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>> function
>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>> freed.
>>>>>>
>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>> that
>>>>>> this patch actually fixes a hang?
>>>>>
>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>
>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>> problems I'm seeing as well.
>>>>
>>>> This means we need a real fix for this; I still think the current patch
>>>> is broken.
>>>>
>>>>
>>>>> This fixes a deadlock, here's the report:
>>>>> https://hastebin.com/durodivoma.xml
>>>>>
>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>> function
>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>> resv
>>>>> is not NULL, the mutex is not unlocked.
>>>>
>>>> On which code path is the destroy function not called? If that is the
>>>> case, we're leaking memory.
>>>>
>>>> With the patch as-is, the error paths are either leaking memory (if
>>>> you're right) or accessing memory after it's freed (otherwise).
>>>> Obviously, neither is good.
>>>
>>> Actually, I find it extremely suspicious that this patch resolves hangs.
>>> By all rights, no other task should have a pointer to this bo left. It
>>> points at problems elsewhere in the code, possibly the precise problem
>>> I've been trying to track down.
>>
>> Well, maybe we are just lucky but as I said, I checked many times to
>> reproduce the issue with that patch applied without any success, you can
>> trust me. Although I'm also starting to think that's not the right
>> solution (and could introduce other ones).
>>
>>>
>>> Could you please revert the patch, reproduce the hang, and report
>>> /proc/$pid/stack for all the hung tasks?
>>
>> Sure. The thing is: Hitman's branch has been updated during the weekend
>> and my local installation is broken. I need to re-download the whole
>> game (will take a while).
>>
>> I will let you know when I'm able to grab that report.
>
> Hmm, so I thought about this some more, and I'm no longer so sure that
> your bug and mine are the same. If it was related, I'd somehow expect
> you to get an error about a mutex being destroyed while it's held (at
> least with lock debugging enabled).

I enabled a bunch of debugging options (include lock debugging) when I 
investigated into that issue.

Yeah, unclear if it's actually related but something definitely needs to 
be fixed.

>
> Anyway... we need to change the contract of ttm_bo_init, I'm just not
> yet sure how, because there are two points of failure: one quite early
> on, and the second rather late which gets cleaned up by ttm_bo_unref.
>
> Cheers,
> Nicolai
>
>> Thanks Nicolai.
>>>
>>> Thanks,
>>> Nicolai
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                             ` <60486a5c-0ac3-675e-86f6-67f67db1213c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-14  2:56                               ` zhoucm1
       [not found]                                 ` <58A271E0.9090807-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: zhoucm1 @ 2017-02-14  2:56 UTC (permalink / raw)
  To: Nicolai Hähnle, Samuel Pitoiset,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 2017年02月14日 03:03, Nicolai Hähnle wrote:
> On 13.02.2017 19:58, Nicolai Hähnle wrote:
>> On 13.02.2017 19:38, Samuel Pitoiset wrote:
>>>
>>>
>>> On 02/13/2017 07:09 PM, Nicolai Hähnle wrote:
>>>> On 13.02.2017 19:04, Nicolai Hähnle wrote:
>>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>>
>>>>>>
>>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>>> When ttm_bo_init() fails, the reservation mutex should be 
>>>>>>>> unlocked.
>>>>>>>>
>>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>>> mutex was locked as expected.
>>>>>>>>
>>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>>> (around ~5k) and deadlock.
>>>>>>>>
>>>>>>>> This regression has been introduced pretty recently.
>>>>>>>>
>>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>>
>>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>>>> ---
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>>> -    if (unlikely(r != 0))
>>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>>> +        if (!resv)
>>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>>          return r;
>>>>>>>> +    }
>>>>>>>
>>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>>> sure
>>>>>>> I had this exact same patch just to realize that it's actually
>>>>>>> incorrect.
>>>>>>>
>>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>>> function
>>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>>> freed.
>>>>>>>
>>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>>> that
>>>>>>> this patch actually fixes a hang?
>>>>>>
>>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>>
>>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>>> problems I'm seeing as well.
>>>>>
>>>>> This means we need a real fix for this; I still think the current 
>>>>> patch
>>>>> is broken.
>>>>>
>>>>>
>>>>>> This fixes a deadlock, here's the report:
>>>>>> https://hastebin.com/durodivoma.xml
>>>>>>
>>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I 
>>>>>> checked
>>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>>> function
>>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>>> resv
>>>>>> is not NULL, the mutex is not unlocked.
>>>>>
>>>>> On which code path is the destroy function not called? If that is the
>>>>> case, we're leaking memory.
>>>>>
>>>>> With the patch as-is, the error paths are either leaking memory (if
>>>>> you're right) or accessing memory after it's freed (otherwise).
>>>>> Obviously, neither is good.
>>>>
>>>> Actually, I find it extremely suspicious that this patch resolves 
>>>> hangs.
>>>> By all rights, no other task should have a pointer to this bo left. It
>>>> points at problems elsewhere in the code, possibly the precise problem
>>>> I've been trying to track down.
>>>
>>> Well, maybe we are just lucky but as I said, I checked many times to
>>> reproduce the issue with that patch applied without any success, you 
>>> can
>>> trust me. Although I'm also starting to think that's not the right
>>> solution (and could introduce other ones).
>>>
>>>>
>>>> Could you please revert the patch, reproduce the hang, and report
>>>> /proc/$pid/stack for all the hung tasks?
>>>
>>> Sure. The thing is: Hitman's branch has been updated during the weekend
>>> and my local installation is broken. I need to re-download the whole
>>> game (will take a while).
>>>
>>> I will let you know when I'm able to grab that report.
>>
>> Hmm, so I thought about this some more, and I'm no longer so sure that
>> your bug and mine are the same. If it was related, I'd somehow expect
>> you to get an error about a mutex being destroyed while it's held (at
>> least with lock debugging enabled).
>>
>> Anyway... we need to change the contract of ttm_bo_init, I'm just not
>> yet sure how, because there are two points of failure: one quite early
>> on, and the second rather late which gets cleaned up by ttm_bo_unref.
>
> Maybe it would actually be best to split ttm_bo_init into two parts: 
> the initial bulk of structure initialization as the first half, and 
> the ttm_bo_validate in the second half.
Agreed. Have you gone ahead with your proposal?
Although Samuel's patch isn't best way, it indeed fix a OCL bug which is 
trying to allocate multiple big buffers.

Thanks,
David Zhou
>
> Cheers,
> Nicolai
>
>>
>> Cheers,
>> Nicolai
>>
>>> Thanks Nicolai.
>>>>
>>>> Thanks,
>>>> Nicolai
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                                 ` <58A271E0.9090807-5C7GfCeVMHo@public.gmane.org>
@ 2017-02-14  8:30                                   ` Nicolai Hähnle
  2017-02-14 17:31                                   ` Alex Deucher
  1 sibling, 0 replies; 23+ messages in thread
From: Nicolai Hähnle @ 2017-02-14  8:30 UTC (permalink / raw)
  To: zhoucm1, Samuel Pitoiset,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 14.02.2017 03:56, zhoucm1 wrote:
>
>
> On 2017年02月14日 03:03, Nicolai Hähnle wrote:
>> On 13.02.2017 19:58, Nicolai Hähnle wrote:
>>> On 13.02.2017 19:38, Samuel Pitoiset wrote:
>>>>
>>>>
>>>> On 02/13/2017 07:09 PM, Nicolai Hähnle wrote:
>>>>> On 13.02.2017 19:04, Nicolai Hähnle wrote:
>>>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>>>> When ttm_bo_init() fails, the reservation mutex should be
>>>>>>>>> unlocked.
>>>>>>>>>
>>>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>>>> mutex was locked as expected.
>>>>>>>>>
>>>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>>>> (around ~5k) and deadlock.
>>>>>>>>>
>>>>>>>>> This regression has been introduced pretty recently.
>>>>>>>>>
>>>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>>>
>>>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>>>>> ---
>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>>>> -    if (unlikely(r != 0))
>>>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>>>> +        if (!resv)
>>>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>>>          return r;
>>>>>>>>> +    }
>>>>>>>>
>>>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>>>> sure
>>>>>>>> I had this exact same patch just to realize that it's actually
>>>>>>>> incorrect.
>>>>>>>>
>>>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>>>> function
>>>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>>>> freed.
>>>>>>>>
>>>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>>>> that
>>>>>>>> this patch actually fixes a hang?
>>>>>>>
>>>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>>>
>>>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>>>> problems I'm seeing as well.
>>>>>>
>>>>>> This means we need a real fix for this; I still think the current
>>>>>> patch
>>>>>> is broken.
>>>>>>
>>>>>>
>>>>>>> This fixes a deadlock, here's the report:
>>>>>>> https://hastebin.com/durodivoma.xml
>>>>>>>
>>>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I
>>>>>>> checked
>>>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>>>> function
>>>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>>>> resv
>>>>>>> is not NULL, the mutex is not unlocked.
>>>>>>
>>>>>> On which code path is the destroy function not called? If that is the
>>>>>> case, we're leaking memory.
>>>>>>
>>>>>> With the patch as-is, the error paths are either leaking memory (if
>>>>>> you're right) or accessing memory after it's freed (otherwise).
>>>>>> Obviously, neither is good.
>>>>>
>>>>> Actually, I find it extremely suspicious that this patch resolves
>>>>> hangs.
>>>>> By all rights, no other task should have a pointer to this bo left. It
>>>>> points at problems elsewhere in the code, possibly the precise problem
>>>>> I've been trying to track down.
>>>>
>>>> Well, maybe we are just lucky but as I said, I checked many times to
>>>> reproduce the issue with that patch applied without any success, you
>>>> can
>>>> trust me. Although I'm also starting to think that's not the right
>>>> solution (and could introduce other ones).
>>>>
>>>>>
>>>>> Could you please revert the patch, reproduce the hang, and report
>>>>> /proc/$pid/stack for all the hung tasks?
>>>>
>>>> Sure. The thing is: Hitman's branch has been updated during the weekend
>>>> and my local installation is broken. I need to re-download the whole
>>>> game (will take a while).
>>>>
>>>> I will let you know when I'm able to grab that report.
>>>
>>> Hmm, so I thought about this some more, and I'm no longer so sure that
>>> your bug and mine are the same. If it was related, I'd somehow expect
>>> you to get an error about a mutex being destroyed while it's held (at
>>> least with lock debugging enabled).
>>>
>>> Anyway... we need to change the contract of ttm_bo_init, I'm just not
>>> yet sure how, because there are two points of failure: one quite early
>>> on, and the second rather late which gets cleaned up by ttm_bo_unref.
>>
>> Maybe it would actually be best to split ttm_bo_init into two parts:
>> the initial bulk of structure initialization as the first half, and
>> the ttm_bo_validate in the second half.
> Agreed. Have you gone ahead with your proposal?

I'm looking into it now.

Nicolai

> Although Samuel's patch isn't best way, it indeed fix a OCL bug which is
> trying to allocate multiple big buffers.
>
> Thanks,
> David Zhou
>>
>> Cheers,
>> Nicolai
>>
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>> Thanks Nicolai.
>>>>>
>>>>> Thanks,
>>>>> Nicolai
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
       [not found]                                 ` <58A271E0.9090807-5C7GfCeVMHo@public.gmane.org>
  2017-02-14  8:30                                   ` Nicolai Hähnle
@ 2017-02-14 17:31                                   ` Alex Deucher
  1 sibling, 0 replies; 23+ messages in thread
From: Alex Deucher @ 2017-02-14 17:31 UTC (permalink / raw)
  To: zhoucm1; +Cc: Nicolai Hähnle, amd-gfx list, Samuel Pitoiset

On Mon, Feb 13, 2017 at 9:56 PM, zhoucm1 <david1.zhou@amd.com> wrote:
>
>
> On 2017年02月14日 03:03, Nicolai Hähnle wrote:
>>
>> On 13.02.2017 19:58, Nicolai Hähnle wrote:
>>>
>>> On 13.02.2017 19:38, Samuel Pitoiset wrote:
>>>>
>>>>
>>>>
>>>> On 02/13/2017 07:09 PM, Nicolai Hähnle wrote:
>>>>>
>>>>> On 13.02.2017 19:04, Nicolai Hähnle wrote:
>>>>>>
>>>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>>>>>
>>>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>>>>
>>>>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>>>>
>>>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>>>> mutex was locked as expected.
>>>>>>>>>
>>>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>>>> (around ~5k) and deadlock.
>>>>>>>>>
>>>>>>>>> This regression has been introduced pretty recently.
>>>>>>>>>
>>>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>>>
>>>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>>>>> ---
>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>>>> -    if (unlikely(r != 0))
>>>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>>>> +        if (!resv)
>>>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>>>          return r;
>>>>>>>>> +    }
>>>>>>>>
>>>>>>>>
>>>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>>>> sure
>>>>>>>> I had this exact same patch just to realize that it's actually
>>>>>>>> incorrect.
>>>>>>>>
>>>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>>>> function
>>>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>>>> freed.
>>>>>>>>
>>>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>>>> that
>>>>>>>> this patch actually fixes a hang?
>>>>>>>
>>>>>>>
>>>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>>>
>>>>>>
>>>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>>>> problems I'm seeing as well.
>>>>>>
>>>>>> This means we need a real fix for this; I still think the current
>>>>>> patch
>>>>>> is broken.
>>>>>>
>>>>>>
>>>>>>> This fixes a deadlock, here's the report:
>>>>>>> https://hastebin.com/durodivoma.xml
>>>>>>>
>>>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>>>> function
>>>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>>>> resv
>>>>>>> is not NULL, the mutex is not unlocked.
>>>>>>
>>>>>>
>>>>>> On which code path is the destroy function not called? If that is the
>>>>>> case, we're leaking memory.
>>>>>>
>>>>>> With the patch as-is, the error paths are either leaking memory (if
>>>>>> you're right) or accessing memory after it's freed (otherwise).
>>>>>> Obviously, neither is good.
>>>>>
>>>>>
>>>>> Actually, I find it extremely suspicious that this patch resolves
>>>>> hangs.
>>>>> By all rights, no other task should have a pointer to this bo left. It
>>>>> points at problems elsewhere in the code, possibly the precise problem
>>>>> I've been trying to track down.
>>>>
>>>>
>>>> Well, maybe we are just lucky but as I said, I checked many times to
>>>> reproduce the issue with that patch applied without any success, you can
>>>> trust me. Although I'm also starting to think that's not the right
>>>> solution (and could introduce other ones).
>>>>
>>>>>
>>>>> Could you please revert the patch, reproduce the hang, and report
>>>>> /proc/$pid/stack for all the hung tasks?
>>>>
>>>>
>>>> Sure. The thing is: Hitman's branch has been updated during the weekend
>>>> and my local installation is broken. I need to re-download the whole
>>>> game (will take a while).
>>>>
>>>> I will let you know when I'm able to grab that report.
>>>
>>>
>>> Hmm, so I thought about this some more, and I'm no longer so sure that
>>> your bug and mine are the same. If it was related, I'd somehow expect
>>> you to get an error about a mutex being destroyed while it's held (at
>>> least with lock debugging enabled).
>>>
>>> Anyway... we need to change the contract of ttm_bo_init, I'm just not
>>> yet sure how, because there are two points of failure: one quite early
>>> on, and the second rather late which gets cleaned up by ttm_bo_unref.
>>
>>
>> Maybe it would actually be best to split ttm_bo_init into two parts: the
>> initial bulk of structure initialization as the first half, and the
>> ttm_bo_validate in the second half.
>
> Agreed. Have you gone ahead with your proposal?
> Although Samuel's patch isn't best way, it indeed fix a OCL bug which is
> trying to allocate multiple big buffers.

I've applied it for now.

Alex

>
> Thanks,
> David Zhou
>
>>
>> Cheers,
>> Nicolai
>>
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>> Thanks Nicolai.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Nicolai
>>>
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-02-14 17:31 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-09 10:33 [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted() Samuel Pitoiset
     [not found] ` <20170209103337.1522-1-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-09 10:33   ` [PATCH v2 2/2] drm/amdgpu: report the number of bytes moved at buffer creation Samuel Pitoiset
     [not found]     ` <20170209103337.1522-2-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 16:28       ` Nicolai Hähnle
     [not found]         ` <758a0adf-3e8c-1860-7528-0099a82e79da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 16:39           ` Christian König
2017-02-13 17:50           ` Samuel Pitoiset
2017-02-09 16:08   ` [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted() Christian König
2017-02-13 16:25   ` Nicolai Hähnle
     [not found]     ` <ef49a53a-6011-0de1-9ff6-3755851daf10-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 17:49       ` Samuel Pitoiset
     [not found]         ` <3fc6e61c-cc70-7b10-9abb-45a4f22b3909-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 17:52           ` Samuel Pitoiset
2017-02-13 18:04           ` Nicolai Hähnle
     [not found]             ` <7c0ece9f-8983-317b-0eae-adaf535c6d06-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:09               ` Nicolai Hähnle
     [not found]                 ` <ed01ca29-e255-62f8-ca99-f079de294eda-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:38                   ` Samuel Pitoiset
     [not found]                     ` <aaa5ce51-e2e0-f368-c3f0-b94b5afe3881-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:58                       ` Nicolai Hähnle
     [not found]                         ` <4185541c-a3a4-7681-4539-54c5daddd0ac-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 19:03                           ` Nicolai Hähnle
     [not found]                             ` <60486a5c-0ac3-675e-86f6-67f67db1213c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-14  2:56                               ` zhoucm1
     [not found]                                 ` <58A271E0.9090807-5C7GfCeVMHo@public.gmane.org>
2017-02-14  8:30                                   ` Nicolai Hähnle
2017-02-14 17:31                                   ` Alex Deucher
2017-02-13 22:13                           ` Samuel Pitoiset
2017-02-13 18:11               ` Samuel Pitoiset
     [not found]                 ` <e8a51ac0-8319-810a-0fb8-992c3afd4f34-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:19                   ` Nicolai Hähnle
     [not found]                     ` <eeafbad0-ca3b-ba03-7068-f54dd2119267-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:32                       ` Samuel Pitoiset
     [not found]                         ` <4d8bb8a9-d2b7-2ade-0dd0-8762e1dcc7a5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:41                           ` Christian König
     [not found]                             ` <39ea0637-7794-865e-dbf7-6109d295064e-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-02-13 18:45                               ` Samuel Pitoiset

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.