Re: [RFC PATCH] drm: gpu: msm: forbid mem reclaim from reset

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sergey Senozhatsky <senozhatsky@chromium.org>
To: Rob Clark <robin.clark@oss.qualcomm.com>,
	 Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
	Maxime Ripard <mripard@kernel.org>,
	 Thomas Zimmermann <tzimmermann@suse.de>,
	David Airlie <airlied@gmail.com>,
	 Simona Vetter <simona@ffwll.ch>
Cc: Sean Paul <sean@poorly.run>,
	Konrad Dybcio <konradybcio@kernel.org>,
	 Akhil P Oommen <akhilpo@oss.qualcomm.com>,
	linux-arm-msm@vger.kernel.org, dri-devel@lists.freedesktop.org,
	 freedreno@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	Tomasz Figa <tfiga@chromium.org>,
	 Sergey Senozhatsky <senozhatsky@chromium.org>
Subject: Re: [RFC PATCH] drm: gpu: msm: forbid mem reclaim from reset
Date: Thu, 26 Mar 2026 10:54:07 +0900	[thread overview]
Message-ID: <acSRDNA8fCP7qAFJ@google.com> (raw)
In-Reply-To: <20260127073341.2862078-1-senozhatsky@chromium.org>

On (26/01/27 16:33), Sergey Senozhatsky wrote:
> We sometimes get into a situtation where GPU hangcheck fails to
> recover GPU:
> 
> [..]
> msm_dpu ae01000.display-controller: [drm:hangcheck_handler] *ERROR* (IPv4: 1): hangcheck detected gpu lockup rb 0!
> msm_dpu ae01000.display-controller: [drm:hangcheck_handler] *ERROR* (IPv4: 1): completed fence: 7840161
> msm_dpu ae01000.display-controller: [drm:hangcheck_handler] *ERROR* (IPv4: 1): submitted fence: 7840162
> msm_dpu ae01000.display-controller: [drm:hangcheck_handler] *ERROR* (IPv4: 1): hangcheck detected gpu lockup rb 0!
> msm_dpu ae01000.display-controller: [drm:hangcheck_handler] *ERROR* (IPv4: 1): completed fence: 7840162
> msm_dpu ae01000.display-controller: [drm:hangcheck_handler] *ERROR* (IPv4: 1): submitted fence: 7840163
> [..]
> 
> The problem is that msm_job worker is blocked on gpu->lock
> 
> INFO: task ring0:155 blocked for more than 122 seconds.
> Not tainted 6.6.99-08727-gaac38b365d2c #1
> task:ring0 state:D stack:0 pid:155 ppid:2 flags:0x00000008
> Call trace:
> __switch_to+0x108/0x208
> schedule+0x544/0x11f0
> schedule_preempt_disabled+0x30/0x50
> __mutex_lock_common+0x410/0x850
> __mutex_lock_slowpath+0x28/0x40
> mutex_lock+0x5c/0x90
> msm_job_run+0x9c/0x140
> drm_sched_main+0x514/0x938
> kthread+0x114/0x138
> ret_from_fork+0x10/0x20
> 
> which is owned by recover worker, which is waiting for DMA fences
> from a memory reclaim path, under the very same gpu->lock
> 
> INFO: task ring0:155 is blocked on a mutex likely owned by task gpu-worker:154.
> task:gpu-worker state:D stack:0 pid:154 ppid:2 flags:0x00000008
> Call trace:
> __switch_to+0x108/0x208
> schedule+0x544/0x11f0
> schedule_timeout+0x1f8/0x770
> dma_fence_default_wait+0x108/0x218
> dma_fence_wait_timeout+0x6c/0x1c0
> dma_resv_wait_timeout+0xe4/0x118
> active_purge+0x34/0x98
> drm_gem_lru_scan+0x1d0/0x388
> msm_gem_shrinker_scan+0x1cc/0x2e8
> shrink_slab+0x228/0x478
> shrink_node+0x380/0x730
> try_to_free_pages+0x204/0x510
> __alloc_pages_direct_reclaim+0x90/0x158
> __alloc_pages_slowpath+0x1d4/0x4a0
> __alloc_pages+0x9f0/0xc88
> vm_area_alloc_pages+0x17c/0x260
> __vmalloc_node_range+0x1c0/0x420
> kvmalloc_node+0xe8/0x108
> msm_gpu_crashstate_capture+0x1e4/0x280
> recover_worker+0x1c0/0x638
> kthread_worker_fn+0x150/0x2d8
> kthread+0x114/0x138
> 
> So no one can make any further progress.
> 
> Forbid recover/fault worker to enter memory reclaim (under
> gpu->lock) to address this deadlock scenario.
> 
> Cc: Tomasz Figa <tfiga@chromium.org>
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>


Folks, can somebody please review/pickup this patch?  It solves a real
(deadlock) problem that we observe in the field.

// keeping the patch body just in case

> ---
>  drivers/gpu/drm/msm/msm_gpu.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 995549d0bbbc..ddcd9e1c217a 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -17,6 +17,7 @@
>  #include <linux/string_helpers.h>
>  #include <linux/devcoredump.h>
>  #include <linux/sched/task.h>
> +#include <linux/sched/mm.h>
>  
>  /*
>   * Power Management:
> @@ -469,6 +470,7 @@ static void recover_worker(struct kthread_work *work)
>  	struct msm_gem_submit *submit;
>  	struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
>  	char *comm = NULL, *cmd = NULL;
> +	unsigned int noreclaim_flag;
>  	struct task_struct *task;
>  	int i;
>  
> @@ -506,6 +508,8 @@ static void recover_worker(struct kthread_work *work)
>  			msm_gem_vm_unusable(submit->vm);
>  	}
>  
> +	noreclaim_flag = memalloc_noreclaim_save();
> +
>  	get_comm_cmdline(submit, &comm, &cmd);
>  
>  	if (comm && cmd) {
> @@ -524,6 +528,8 @@ static void recover_worker(struct kthread_work *work)
>  	pm_runtime_get_sync(&gpu->pdev->dev);
>  	msm_gpu_crashstate_capture(gpu, submit, NULL, comm, cmd);
>  
> +	memalloc_noreclaim_restore(noreclaim_flag);
> +
>  	kfree(cmd);
>  	kfree(comm);
>  
> @@ -588,6 +594,7 @@ void msm_gpu_fault_crashstate_capture(struct msm_gpu *gpu, struct msm_gpu_fault_
>  	struct msm_gem_submit *submit;
>  	struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
>  	char *comm = NULL, *cmd = NULL;
> +	unsigned int noreclaim_flag;
>  
>  	mutex_lock(&gpu->lock);
>  
> @@ -595,6 +602,8 @@ void msm_gpu_fault_crashstate_capture(struct msm_gpu *gpu, struct msm_gpu_fault_
>  	if (submit && submit->fault_dumped)
>  		goto resume_smmu;
>  
> +	noreclaim_flag = memalloc_noreclaim_save();
> +
>  	if (submit) {
>  		get_comm_cmdline(submit, &comm, &cmd);
>  
> @@ -610,6 +619,8 @@ void msm_gpu_fault_crashstate_capture(struct msm_gpu *gpu, struct msm_gpu_fault_
>  	msm_gpu_crashstate_capture(gpu, submit, fault_info, comm, cmd);
>  	pm_runtime_put_sync(&gpu->pdev->dev);
>  
> +	memalloc_noreclaim_restore(noreclaim_flag);
> +
>  	kfree(cmd);
>  	kfree(comm);
>  
> -- 
> 2.53.0.rc1.217.geba53bf80e-goog
>

next prev parent reply	other threads:[~2026-03-26  1:54 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-27  7:33 [RFC PATCH] drm: gpu: msm: forbid mem reclaim from reset Sergey Senozhatsky
2026-02-03  3:42 ` Sergey Senozhatsky
2026-03-26  1:54 ` Sergey Senozhatsky [this message]
2026-03-27  0:17   ` Akhil P Oommen
2026-03-27 16:08     ` Rob Clark
2026-03-30  2:46       ` Sergey Senozhatsky
2026-03-30  2:45     ` Sergey Senozhatsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acSRDNA8fCP7qAFJ@google.com \
    --to=senozhatsky@chromium.org \
    --cc=airlied@gmail.com \
    --cc=akhilpo@oss.qualcomm.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=freedreno@lists.freedesktop.org \
    --cc=konradybcio@kernel.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=mripard@kernel.org \
    --cc=robin.clark@oss.qualcomm.com \
    --cc=sean@poorly.run \
    --cc=simona@ffwll.ch \
    --cc=tfiga@chromium.org \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.