All of lore.kernel.org
 help / color / mirror / Atom feed
From: Trigger Huang <Trigger.Huang-5C7GfCeVMHo@public.gmane.org>
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Cc: Trigger Huang <Trigger.Huang-5C7GfCeVMHo@public.gmane.org>
Subject: [PATCH] drm/scheduler: Fix bad job be re-processed in TDR
Date: Fri, 9 Nov 2018 20:10:03 +0800	[thread overview]
Message-ID: <1541765403-8732-1-git-send-email-Trigger.Huang@amd.com> (raw)

A bad job is the one triggered TDR. In the recovery process, its fence
will be signaled and as a result, the work behind will be scheduled
to delete it from the mirror list, but if the TDR process is invoked
before the work's execution, then the call dma_fence_set_error to its
fence in TDR process will lead to kernel warning trace:

[  143.033605] WARNING: CPU: 2 PID: 53 at ./include/linux/dma-fence.h:437 amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
kernel: [  143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too floppy psmouse 8139cp mii i2c_piix4 pata_acpi
[  143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G           OE    4.15.0-20-generic #21-Ubuntu
[  143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  143.033653] Workqueue: events drm_sched_job_timedout [amd_sched]
[  143.033656] RIP: 0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
[  143.033657] RSP: 0018:ffffa9f880fe7d48 EFLAGS: 00010202
[  143.033659] RAX: 0000000000000007 RBX: ffff9b98f2b24c00 RCX: ffff9b98efef4f08
[  143.033660] RDX: ffff9b98f2b27400 RSI: ffff9b98f2b24c50 RDI: ffff9b98efef4f18
[  143.033660] RBP: ffffa9f880fe7d98 R08: 0000000000000001 R09: 00000000000002b6
[  143.033661] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9b98efef3430
[  143.033662] R13: ffff9b98efef4d80 R14: ffff9b98efef4e98 R15: ffff9b98eaf91c00
[  143.033663] FS:  0000000000000000(0000) GS:ffff9b98ffd00000(0000) knlGS:0000000000000000
[  143.033664] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  143.033665] CR2: 00007fc49c96d470 CR3: 000000001400a005 CR4: 00000000003606e0
[  143.033669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  143.033669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  143.033670] Call Trace:
[  143.033744]  amdgpu_device_gpu_recover+0x144/0x820 [amdgpu]
[  143.033788]  amdgpu_job_timedout+0x9b/0xa0 [amdgpu]
[  143.033791]  drm_sched_job_timedout+0xcc/0x150 [amd_sched]
[  143.033795]  process_one_work+0x1de/0x410
[  143.033797]  worker_thread+0x32/0x410
[  143.033799]  kthread+0x121/0x140
[  143.033801]  ? process_one_work+0x410/0x410
[  143.033803]  ? kthread_create_worker_on_cpu+0x70/0x70
[  143.033806]  ret_from_fork+0x35/0x40

So just delete the bad job from mirror list directly after signal its fence

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 18ebbb0..b2832fb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -228,7 +228,10 @@ static void drm_sched_job_finish(struct work_struct *work)
 
 	spin_lock(&sched->job_list_lock);
 	/* remove job from ring_mirror_list */
-	list_del(&s_job->node);
+	/* This node may already has been deleted in job recovery */
+	if (s_job->node.next != &(s_job->node))
+		list_del_init(&s_job->node);
+
 	/* queue TDR for next job */
 	drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
@@ -394,6 +397,19 @@ void drm_sched_job_recovery(struct drm_gpu_scheduler *sched)
 			drm_sched_process_job(NULL, &s_fence->cb);
 		}
 		spin_lock(&sched->job_list_lock);
+		/**
+		 * Normally the 'bad' job will be deleted by its finish_work
+		 * after signal its fence, but sometimes if a GPU recovery is
+		 * invoked before this work finished execution (for example,
+		 * KFD/Host triggered the GPU reset when the current one is
+		 * on-going), then the 'bad' job may will be processed again,
+		 * which is definitely no necessary, and also will cause a lot
+		 * of warning call traces when this job is set by
+		 * 'dma_fence_set_error' because it has already been signaled
+		 */
+		if ((s_fence->finished.error < 0)
+			&& (s_job->node.next != &(s_job->node)))
+			list_del_init(&s_job->node);
 	}
 	drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

             reply	other threads:[~2018-11-09 12:10 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-09 12:10 Trigger Huang [this message]
     [not found] ` <1541765403-8732-1-git-send-email-Trigger.Huang-5C7GfCeVMHo@public.gmane.org>
2018-11-09 12:18   ` [PATCH] drm/scheduler: Fix bad job be re-processed in TDR Christian König
     [not found]     ` <c201b18a-7ea3-a832-538a-3d393764c7e1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-11-12  8:35       ` Huang, Trigger
     [not found]         ` <BY2PR12MB054503EE0A2D13827825AD7EFEC10-K//h7OWB4q7Qm/t26ohGYQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-11-12  9:19           ` Liu, Monk
     [not found]             ` <CY4PR1201MB0245017E5AF13831BD09A9B584C10-1iTaO6aE1DBfNQakwlCMTGrFom/aUZj6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2018-11-12 10:12               ` Koenig, Christian
     [not found]                 ` <c13ec36d-79d5-19f1-a159-fd79f1e9a2f3-5C7GfCeVMHo@public.gmane.org>
2018-11-14  9:06                   ` Huang, Trigger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1541765403-8732-1-git-send-email-Trigger.Huang@amd.com \
    --to=trigger.huang-5c7gfcevmho@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.