AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>,
	Alex Deucher <alexdeucher@gmail.com>
Cc: michel@daenzer.net, Borislav Petkov <bp@alien8.de>,
	amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: grab extra fence reference for drm_sched_job_add_dependency
Date: Thu, 5 Jan 2023 11:03:14 +0100	[thread overview]
Message-ID: <e6b6a599-8fdd-a4fc-a2bb-d0750e6d477d@gmail.com> (raw)
In-Reply-To: <CABXGCsOmtfo=7YWUv0QmGGrCat1Md59oz7UWw9-7MPn7f6AAdA@mail.gmail.com>

Am 05.01.23 um 02:44 schrieb Mikhail Gavrilov:
> On Tue, Jan 3, 2023 at 7:26 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Tue, Jan 3, 2023 at 3:34 AM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> I assume that this was already upstreamed while I was on sick leave?
>> Yes.
>>
>> Alex
>>
> What about commit 2fdb8a8f07c2f1353770a324fd19b8114e4329ac ?

That one should be fixed by:

commit 9f1ecfc5dcb47a7ca37be47b0eaca0f37f1ae93d
Author: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Date:   Wed Nov 23 03:13:03 2022 +0300

     drm/scheduler: Fix lockup in drm_sched_entity_kill()

     The drm_sched_entity_kill() is invoked twice by 
drm_sched_entity_destroy()
     while userspace process is exiting or being killed. First time it's 
invoked
     when sched entity is flushed and second time when entity is 
released. This
     causes a lockup within wait_for_completion(entity_idle) due to how 
completion
     API works.

     Calling wait_for_completion() more times than complete() was 
invoked is a
     error condition that causes lockup because completion internally uses
     counter for complete/wait calls. The complete_all() must be used 
instead
     in such cases.

     This patch fixes lockup of Panfrost driver that is reproducible by 
killing
     any application in a middle of 3d drawing operation.

     Fixes: 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and 
fini")
     Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
     Reviewed-by: Christian König <christian.koenig@amd.com>
     Link: 
https://patchwork.freedesktop.org/patch/msgid/20221123001303.533968-1-dmitry.osipenko@collabora.com

Regards,
Christian.

> I checked twice and I'm sure that this commit is the reason why I
> can't terminate some games (and others processes).
> Demonstration: https://youtu.be/O0AfjiMdFGw
> I also attached a full kernel log.
>
> INFO: task ZAT.exe:4745 blocked for more than 122 seconds.
>        Tainted: G        W    L
> 6.1.0-rc1-13-2fdb8a8f07c2f1353770a324fd19b8114e4329ac+ #18
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:ZAT.exe         state:D stack:12608 pid:4745  ppid:1      flags:0x20004006
> Call Trace:
>   <TASK>
>   __schedule+0x4c5/0x1740
>   schedule+0x5d/0xe0
>   schedule_timeout+0xf0/0x130
>   __wait_for_common+0xa9/0x1f0
>   ? usleep_range_state+0x90/0x90
>   drm_sched_entity_kill.part.0+0x4d/0x210 [gpu_sched]
>   drm_sched_entity_flush+0xa0/0x260 [gpu_sched]
>   amdgpu_ctx_mgr_entity_flush+0x83/0xd0 [amdgpu]
>   amdgpu_flush+0x25/0x40 [amdgpu]
>   filp_close+0x31/0x70
>   put_files_struct+0x78/0xf0
>   do_exit+0x364/0xc30
>   ? sched_clock_cpu+0xb/0xc0
>   do_group_exit+0x33/0xa0
>   get_signal+0xb41/0xb50
>   arch_do_signal_or_restart+0x44/0x7a0
>   exit_to_user_mode_prepare+0x17b/0x250
>   syscall_exit_to_user_mode+0x16/0x50
>   __do_fast_syscall_32+0x94/0xf0
> 2132]: Reached target exit.target - Exit the Session.
> 1]: user@1000.service: Killing process 4402 (reaper) with signal SIGKILL.
> 1]: user@1000.service: Killing process 4745 (ZAT.exe) with signal SIGKILL.
> 1]: Started plymouth-reboot.service - Show Plymouth Reboot Screen.
> : SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295
> subj=system_u:system_r:init_t:s0 msg='unit=plymouth-reboot
> comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=?
> terminal=? res=succe>
> 1]: plymouth-switch-root-initramfs.service - Tell Plymouth To Jump To
> initramfs was skipped because of an unmet condition check
> (ConditionPathExists=/run/initramfs/bin/sh).
> INFO: task ZAT.exe:4745 blocked for more than 122 seconds.
>        Tainted: G        W    L
> 6.1.0-rc1-13-2fdb8a8f07c2f1353770a324fd19b8114e4329ac+ #18
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:ZAT.exe         state:D stack:12608 pid:4745  ppid:1      flags:0x20004006
> Call Trace:
>   <TASK>
>   __schedule+0x4c5/0x1740
>   schedule+0x5d/0xe0
>   schedule_timeout+0xf0/0x130
>   __wait_for_common+0xa9/0x1f0
>   ? usleep_range_state+0x90/0x90
>   drm_sched_entity_kill.part.0+0x4d/0x210 [gpu_sched]
>   drm_sched_entity_flush+0xa0/0x260 [gpu_sched]
>   amdgpu_ctx_mgr_entity_flush+0x83/0xd0 [amdgpu]
>   amdgpu_flush+0x25/0x40 [amdgpu]
>   filp_close+0x31/0x70
>   put_files_struct+0x78/0xf0
>   do_exit+0x364/0xc30
>   ? sched_clock_cpu+0xb/0xc0
>   do_group_exit+0x33/0xa0
>   get_signal+0xb41/0xb50
>   arch_do_signal_or_restart+0x44/0x7a0
>   exit_to_user_mode_prepare+0x17b/0x250
>   syscall_exit_to_user_mode+0x16/0x50
>   __do_fast_syscall_32+0x94/0xf0
>   ? __do_fast_syscall_32+0x94/0xf0
>   ? lockdep_hardirqs_on+0x7d/0x100
>   ? __do_fast_syscall_32+0x94/0xf0
>   ? __do_fast_syscall_32+0x94/0xf0
>   do_fast_syscall_32+0x2f/0x70
>   entry_SYSCALL_compat_after_hwframe+0x62/0x6a
> RIP: 0023:0xf7f6b579
> RSP: 002b:00000000e8dffd40 EFLAGS: 00200282 ORIG_RAX: 00000000000000f0
> RAX: fffffffffffffe00 RBX: 00000000f0b54dcc RCX: 0000000000000189
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 00000000ffffffff R08: 00000000e8dffd40 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000200282 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>   </TASK>
>
> Showing all locks held in the system:
> 1 lock held by rcu_tasks_kthre/11:
>   #0: ffffffffae368a20 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at:
> rcu_tasks_one_gp+0x2b/0x3e0
> 1 lock held by rcu_tasks_rude_/12:
>   #0: ffffffffae368760 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at:
> rcu_tasks_one_gp+0x2b/0x3e0
> 1 lock held by rcu_tasks_trace/13:
>   #0: ffffffffae368460 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3},
> at: rcu_tasks_one_gp+0x2b/0x3e0
> 1 lock held by khungtaskd/182:
>   #0: ffffffffae369520 (rcu_read_lock){....}-{1:2}, at:
> debug_show_all_locks+0x15/0x16b
> 2 locks held by kworker/25:1/215:
> 1 lock held by systemd-journal/852:
> 1 lock held by ZAT.exe/4745:
>   #0: ffff9b087c337cf8 (&mgr->lock#3){+.+.}-{3:3}, at:
> amdgpu_ctx_mgr_entity_flush+0x3a/0xd0 [amdgpu]
>
> =============================================
> 1]: user@1000.service: Processes still around after final SIGKILL.
> Entering failed mode.
> 1]: user@1000.service: Failed with result 'timeout'.
> 1]: Stopped user@1000.service - User Manager for UID 1000.
>
>


  reply	other threads:[~2023-01-05 10:03 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-19 10:47 [PATCH] drm/amdgpu: grab extra fence reference for drm_sched_job_add_dependency Christian König
2022-12-19 14:00 ` Borislav Petkov
2022-12-21 21:10   ` Alex Deucher
2023-01-03  8:34     ` Christian König
2023-01-03 14:26       ` Alex Deucher
2023-01-03 14:28         ` Michel Dänzer
2023-01-05  1:44         ` Mikhail Gavrilov
2023-01-05 10:03           ` Christian König [this message]
2023-01-06 12:59             ` Mikhail Gavrilov
2023-01-06 14:24               ` Alex Deucher
2023-01-06 15:27                 ` Christian König
2023-01-09 13:13                   ` Mikhail Gavrilov
2023-01-09 13:40                     ` Christian König
2023-01-10 18:21                       ` Mikhail Gavrilov
2023-01-12 12:05                         ` Christian König
2022-12-19 15:08 ` Luben Tuikov
2022-12-23 10:00 ` Michal Kubecek
2022-12-23 22:55 ` Mikhail Gavrilov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6b6a599-8fdd-a4fc-a2bb-d0750e6d477d@gmail.com \
    --to=ckoenig.leichtzumerken@gmail.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=bp@alien8.de \
    --cc=michel@daenzer.net \
    --cc=mikhail.v.gavrilov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox