All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oleksandr Natalenko <oleksandr@natalenko.name>
To: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	Tvrtko Ursulin <tursulin@igalia.com>
Cc: kernel-dev@igalia.com,
	"Tvrtko Ursulin" <tvrtko.ursulin@igalia.com>,
	stable@vger.kernel.org, "Matthew Brost" <matthew.brost@intel.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"Philipp Stanner" <pstanner@redhat.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Tejun Heo" <tj@kernel.org>
Subject: Re: [PATCH] drm/amdgpu: Make the submission path memory reclaim safe
Date: Tue, 03 Dec 2024 15:24:36 +0100	[thread overview]
Message-ID: <2757527.mvXUDI8C0e@natalenko.name> (raw)
In-Reply-To: <20241113134838.52608-1-tursulin@igalia.com>

[-- Attachment #1: Type: text/plain, Size: 5274 bytes --]

On středa 13. listopadu 2024 14:48:38, středoevropský standardní čas Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> 
> As commit 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM")
> points out, ever since
> a6149f039369 ("drm/sched: Convert drm scheduler to use a work queue rather than kthread"),
> any workqueue flushing done from the job submission path must only
> involve memory reclaim safe workqueues to be safe against reclaim
> deadlocks.
> 
> This is also pointed out by workqueue sanity checks:
> 
>  [ ] workqueue: WQ_MEM_RECLAIM sdma0:drm_sched_run_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM events:amdgpu_device_delay_enable_gfx_off [amdgpu]
> ...
>  [ ] Workqueue: sdma0 drm_sched_run_job_work [gpu_sched]
> ...
>  [ ] Call Trace:
>  [ ]  <TASK>
> ...
>  [ ]  ? check_flush_dependency+0xf5/0x110
> ...
>  [ ]  cancel_delayed_work_sync+0x6e/0x80
>  [ ]  amdgpu_gfx_off_ctrl+0xab/0x140 [amdgpu]
>  [ ]  amdgpu_ring_alloc+0x40/0x50 [amdgpu]
>  [ ]  amdgpu_ib_schedule+0xf4/0x810 [amdgpu]
>  [ ]  ? drm_sched_run_job_work+0x22c/0x430 [gpu_sched]
>  [ ]  amdgpu_job_run+0xaa/0x1f0 [amdgpu]
>  [ ]  drm_sched_run_job_work+0x257/0x430 [gpu_sched]
>  [ ]  process_one_work+0x217/0x720
> ...
>  [ ]  </TASK>
> 
> Fix this by creating a memory reclaim safe driver workqueue and make the
> submission path use it.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> References: 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM")
> Fixes: a6149f039369 ("drm/sched: Convert drm scheduler to use a work queue rather than kthread")
> Cc: stable@vger.kernel.org
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Philipp Stanner <pstanner@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 25 +++++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c |  5 +++--
>  3 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 7645e498faa4..a6aad687537e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -268,6 +268,8 @@ extern int amdgpu_agp;
>  
>  extern int amdgpu_wbrf;
>  
> +extern struct workqueue_struct *amdgpu_reclaim_wq;
> +
>  #define AMDGPU_VM_MAX_NUM_CTX			4096
>  #define AMDGPU_SG_THRESHOLD			(256*1024*1024)
>  #define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS	        3000
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 38686203bea6..f5b7172e8042 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -255,6 +255,8 @@ struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
>  	.period = 0x0, /* default to 0x0 (timeout disable) */
>  };
>  
> +struct workqueue_struct *amdgpu_reclaim_wq;
> +
>  /**
>   * DOC: vramlimit (int)
>   * Restrict the total amount of VRAM in MiB for testing.  The default is 0 (Use full VRAM).
> @@ -2971,6 +2973,21 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>  	.dev_groups = amdgpu_sysfs_groups,
>  };
>  
> +static int amdgpu_wq_init(void)
> +{
> +	amdgpu_reclaim_wq =
> +		alloc_workqueue("amdgpu-reclaim", WQ_MEM_RECLAIM, 0);
> +	if (!amdgpu_reclaim_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static void amdgpu_wq_fini(void)
> +{
> +	destroy_workqueue(amdgpu_reclaim_wq);
> +}
> +
>  static int __init amdgpu_init(void)
>  {
>  	int r;
> @@ -2978,6 +2995,10 @@ static int __init amdgpu_init(void)
>  	if (drm_firmware_drivers_only())
>  		return -EINVAL;
>  
> +	r = amdgpu_wq_init();
> +	if (r)
> +		goto error_wq;
> +
>  	r = amdgpu_sync_init();
>  	if (r)
>  		goto error_sync;
> @@ -3006,6 +3027,9 @@ static int __init amdgpu_init(void)
>  	amdgpu_sync_fini();
>  
>  error_sync:
> +	amdgpu_wq_fini();
> +
> +error_wq:
>  	return r;
>  }
>  
> @@ -3017,6 +3041,7 @@ static void __exit amdgpu_exit(void)
>  	amdgpu_acpi_release();
>  	amdgpu_sync_fini();
>  	amdgpu_fence_slab_fini();
> +	amdgpu_wq_fini();
>  	mmu_notifier_synchronize();
>  	amdgpu_xcp_drv_release();
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 2f3f09dfb1fd..f8fd71d9382f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -790,8 +790,9 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>  						AMD_IP_BLOCK_TYPE_GFX, true))
>  					adev->gfx.gfx_off_state = true;
>  			} else {
> -				schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
> -					      delay);
> +				queue_delayed_work(amdgpu_reclaim_wq,
> +						   &adev->gfx.gfx_off_delay_work,
> +						   delay);
>  			}
>  		}
>  	} else {

I can confirm this fixed the warning for me.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

What's the fate of this submission?

Thank you.

-- 
Oleksandr Natalenko, MSE

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2024-12-04  8:16 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-13 13:48 [PATCH] drm/amdgpu: Make the submission path memory reclaim safe Tvrtko Ursulin
2024-11-13 14:26 ` Christian König
2024-11-13 14:42   ` Tvrtko Ursulin
2024-11-22 11:34     ` Tvrtko Ursulin
2024-11-22 13:46       ` Christian König
2024-11-22 14:36         ` Tvrtko Ursulin
2024-12-17  0:26           ` Matthew Brost
2024-12-17  9:30             ` Christian König
2024-12-17  9:33               ` Fwd: " Christian König
2024-12-03 14:24 ` Oleksandr Natalenko [this message]
2024-12-03 14:54   ` Christian König
2024-12-17 10:39 ` Philipp Stanner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2757527.mvXUDI8C0e@natalenko.name \
    --to=oleksandr@natalenko.name \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=kernel-dev@igalia.com \
    --cc=matthew.brost@intel.com \
    --cc=pstanner@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=tursulin@igalia.com \
    --cc=tvrtko.ursulin@igalia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.