From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BDC7DC43214 for ; Mon, 16 Aug 2021 15:39:19 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 88EAA61042 for ; Mon, 16 Aug 2021 15:39:19 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 88EAA61042 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CFF7A89AB3; Mon, 16 Aug 2021 15:39:18 +0000 (UTC) Received: from mail-ej1-x62f.google.com (mail-ej1-x62f.google.com [IPv6:2a00:1450:4864:20::62f]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8099089AB3 for ; Mon, 16 Aug 2021 15:39:17 +0000 (UTC) Received: by mail-ej1-x62f.google.com with SMTP id z20so32570381ejf.5 for ; Mon, 16 Aug 2021 08:39:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=xrs4zPTDDVJACG+yEgg1cFeyZ7/1pj9Z6Onj3ArL5y0=; b=C3mpZJig38/JlCZU56Ao3vUnRvChKwKMf6mW7uTzpsHqTfg/UJmxWH2d1jIqy3UfkO rtAXMxRVE/fLEdMuR5hyIXKGW/o+epTb4bBOQJiG/xHKvsX0bxpsCoJPzBuVQzWjqSD9 J3/JfxJ22oUbx43IfJCYsREW3qvG2G+VNMoK0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=xrs4zPTDDVJACG+yEgg1cFeyZ7/1pj9Z6Onj3ArL5y0=; b=sZSiSrWuptltSJ2bzCMsJC1BLMBW3tAVdUWFcgK06pVDR6WRQlwvdRVUd8Y40WUBz8 kZrYwJ2WdTOGRFezDVoxS0Q+spXh7VXb8+U++8YDgUoxolM7WyWHHDQBJPxVYCOU8kuy XWGEXp9g2qJws2UDK7mnY+ReY+Euth2mZigJ3OosPLO9ScepsTTPDyU6YXNbL8ZlMhgm wM0Sx1bW0bSxGVQ+w42JfqJrU4j71rhP2hJuXtA8jtPJZw0+o9BaLEUGI64ssds3z62y 9A1nUbBj59hOdedK7BrTGYnW75tvo1L77oA+sfJFvKP1nS4SktZs144mUfzXEYC3cbQ3 s5QA== X-Gm-Message-State: AOAM532cjnAWtJ+ICkrPiLhNr7TiYKoOIDbr0PJPhEVdw8b1gDECtFzU vMASggT9D+9XHDQW0vA46D1g0w== X-Google-Smtp-Source: ABdhPJw0TzV0bulcAezdFeFqMUZE+6D/5LH9QXB//dObs9+D8NrRnLhKh9pMu/BmEqbL0IWW0SiflQ== X-Received: by 2002:a17:906:f11:: with SMTP id z17mr17096099eji.385.1629128355844; Mon, 16 Aug 2021 08:39:15 -0700 (PDT) Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa]) by smtp.gmail.com with ESMTPSA id u18sm233922ejf.118.2021.08.16.08.39.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Aug 2021 08:39:15 -0700 (PDT) Date: Mon, 16 Aug 2021 17:39:13 +0200 From: Daniel Vetter To: Matthew Brost Cc: Daniel Vetter , gfx-internal-devel@eclists.intel.com, dri-devel@lists.freedesktop.org Subject: Re: [PATCH 5/9] drm/i915/guc: Flush the work queue for GuC generated G2H Message-ID: References: <20210811011622.255784-1-matthew.brost@intel.com> <20210811011622.255784-6-matthew.brost@intel.com> <20210812152330.GA480603@DUT151-ICLU.fm.intel.com> <20210812223818.GA649455@DUT151-ICLU.fm.intel.com> <20210813190255.GA798750@DUT151-ICLU.fm.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210813190255.GA798750@DUT151-ICLU.fm.intel.com> X-Operating-System: Linux phenom 5.10.0-7-amd64 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Fri, Aug 13, 2021 at 07:02:55PM +0000, Matthew Brost wrote: > On Fri, Aug 13, 2021 at 05:11:59PM +0200, Daniel Vetter wrote: > > On Thu, Aug 12, 2021 at 10:38:18PM +0000, Matthew Brost wrote: > > > On Thu, Aug 12, 2021 at 09:47:23PM +0200, Daniel Vetter wrote: > > > > On Thu, Aug 12, 2021 at 03:23:30PM +0000, Matthew Brost wrote: > > > > > On Thu, Aug 12, 2021 at 04:11:28PM +0200, Daniel Vetter wrote: > > > > > > On Wed, Aug 11, 2021 at 01:16:18AM +0000, Matthew Brost wrote: > > > > > > > Flush the work queue for GuC generated G2H messages durinr a GT reset. > > > > > > > This is accomplished by spinning on the the list of outstanding G2H to > > > > > > > go empty. > > > > > > > > > > > > > > Fixes: eb5e7da736f3 ("drm/i915/guc: Reset implementation for new GuC interface") > > > > > > > Signed-off-by: Matthew Brost > > > > > > > Cc: > > > > > > > --- > > > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 5 +++++ > > > > > > > 1 file changed, 5 insertions(+) > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > > > > > > index 3cd2da6f5c03..e5eb2df11b4a 100644 > > > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > > > > > > @@ -727,6 +727,11 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc) > > > > > > > wait_for_reset(guc, &guc->outstanding_submission_g2h); > > > > > > > } while (!list_empty(&guc->ct.requests.incoming)); > > > > > > > } > > > > > > > + > > > > > > > + /* Flush any GuC generated G2H */ > > > > > > > + while (!list_empty(&guc->ct.requests.incoming)) > > > > > > > + msleep(20); > > > > > > > > > > > > flush_work or flush_workqueue, beacuse that comes with lockdep > > > > > > annotations. Dont hand-roll stuff like this if at all possible. > > > > > > > > > > lockdep puked when used that. > > > > > > > > Lockdep tends to be right ... So we definitely want that, but maybe a > > > > different flavour, or there's something wrong with the workqueue setup. > > > > > > > > > > Here is a dependency chain that lockdep doesn't like. > > > > > > fs_reclaim_acquire -> >->reset.mutex (shrinker) > > > workqueue -> fs_reclaim_acquire (error capture in workqueue) > > > >->reset.mutex -> workqueue (reset) > > > > > > In practice I don't think we couldn't ever hit this but lockdep does > > > looks right here. Trying to work out how to fix this. We really need to > > > all G2H to done being processed before we proceed during a reset or we > > > have races. Have a few ideas of how to handle this but can't convince > > > myself any of them are fully safe. > > > > It might be false sharing due to a single workqueue, or a single-threaded > > workqueue. > > > > Essentially the lockdep annotations for work_struct track two things: > > - dependencies against the specific work item > > - dependencies against anything queued on that work queue, if you flush > > the entire queue, or if you flush a work item that's on a > > single-threaded queue. > > > > Because if guc/host communication is inverted like this here, you have a > > much bigger problem. > > > > Note that if you pick a different workqueue for your guc work stuff then > > you need to make sure that's all properly flushed on suspend and driver > > unload. > > > > It might also be that the reset work is on the wrong workqueue. > > > > Either way, this must be fixed, because I've seen too many of these "it > > never happens in practice" blow up, plus if your locking scheme is > > engineered with quicksand forget about anyone ever understanding it. > > The solution is to allocate memory for the error capture in an atomic > context if the error capture is being done from the G2H work queue. That > means this can possibly fail if the system is under memory pressure but > that is better than a lockdep splat. Ah yeah if this is for error capture then GFP_ATOMIC is the right option. -Daniel > > Matt > > > -Daniel > > > > > > > > Splat below: > > > > > > [ 154.625989] ====================================================== > > > [ 154.632195] WARNING: possible circular locking dependency detected > > > [ 154.638393] 5.14.0-rc5-guc+ #50 Tainted: G U > > > [ 154.643991] ------------------------------------------------------ > > > [ 154.650196] i915_selftest/1673 is trying to acquire lock: > > > [ 154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0 > > > [ 154.665826] > > > but task is already holding lock: > > > [ 154.671682] ffff8881079cbfb8 (>->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915] > > > [ 154.680659] > > > which lock already depends on the new lock. > > > > > > [ 154.688857] > > > the existing dependency chain (in reverse order) is: > > > [ 154.696365] > > > -> #2 (>->reset.mutex){+.+.}-{3:3}: > > > [ 154.702571] lock_acquire+0xd2/0x300 > > > [ 154.706695] i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915] > > > [ 154.712959] intel_gt_init_reset+0x61/0x80 [i915] > > > [ 154.718258] intel_gt_init_early+0xe6/0x120 [i915] > > > [ 154.723648] i915_driver_probe+0x592/0xdc0 [i915] > > > [ 154.728942] i915_pci_probe+0x43/0x1c0 [i915] > > > [ 154.733891] pci_device_probe+0x9b/0x110 > > > [ 154.738362] really_probe+0x1a6/0x3a0 > > > [ 154.742568] __driver_probe_device+0xf9/0x170 > > > [ 154.747468] driver_probe_device+0x19/0x90 > > > [ 154.752114] __driver_attach+0x99/0x170 > > > [ 154.756492] bus_for_each_dev+0x73/0xc0 > > > [ 154.760870] bus_add_driver+0x14b/0x1f0 > > > [ 154.765248] driver_register+0x67/0xb0 > > > [ 154.769542] i915_init+0x18/0x8c [i915] > > > [ 154.773964] do_one_initcall+0x53/0x2e0 > > > [ 154.778343] do_init_module+0x56/0x210 > > > [ 154.782639] load_module+0x25fc/0x29f0 > > > [ 154.786934] __do_sys_finit_module+0xae/0x110 > > > [ 154.791835] do_syscall_64+0x38/0xc0 > > > [ 154.795958] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [ 154.801558] > > > -> #1 (fs_reclaim){+.+.}-{0:0}: > > > [ 154.807241] lock_acquire+0xd2/0x300 > > > [ 154.811361] fs_reclaim_acquire+0x9e/0xd0 > > > [ 154.815914] kmem_cache_alloc_trace+0x30/0x790 > > > [ 154.820899] i915_gpu_coredump_alloc+0x53/0x1a0 [i915] > > > [ 154.826649] i915_gpu_coredump+0x39/0x560 [i915] > > > [ 154.831866] i915_capture_error_state+0xa/0x70 [i915] > > > [ 154.837513] intel_guc_context_reset_process_msg+0x174/0x1f0 [i915] > > > [ 154.844383] ct_incoming_request_worker_func+0x130/0x1b0 [i915] > > > [ 154.850898] process_one_work+0x264/0x590 > > > [ 154.855451] worker_thread+0x4b/0x3a0 > > > [ 154.859655] kthread+0x147/0x170 > > > [ 154.863428] ret_from_fork+0x1f/0x30 > > > [ 154.867548] > > > -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}: > > > [ 154.875747] check_prev_add+0x90/0xc30 > > > [ 154.880042] __lock_acquire+0x1643/0x2110 > > > [ 154.884595] lock_acquire+0xd2/0x300 > > > [ 154.888715] __flush_work+0x373/0x4d0 > > > [ 154.892920] intel_guc_submission_reset_prepare+0xf3/0x340 [i915] > > > [ 154.899606] intel_uc_reset_prepare+0x40/0x50 [i915] > > > [ 154.905166] reset_prepare+0x55/0x60 [i915] > > > [ 154.909946] intel_gt_reset+0x11c/0x300 [i915] > > > [ 154.914984] do_device_reset+0x13/0x20 [i915] > > > [ 154.919936] check_whitelist_across_reset+0x166/0x250 [i915] > > > [ 154.926212] live_reset_whitelist.cold+0x6a/0x7a [i915] > > > [ 154.932037] __i915_subtests.cold+0x20/0x74 [i915] > > > [ 154.937428] __run_selftests.cold+0x96/0xee [i915] > > > [ 154.942816] i915_live_selftests+0x2c/0x60 [i915] > > > [ 154.948125] i915_pci_probe+0x93/0x1c0 [i915] > > > [ 154.953076] pci_device_probe+0x9b/0x110 > > > [ 154.957545] really_probe+0x1a6/0x3a0 > > > [ 154.961749] __driver_probe_device+0xf9/0x170 > > > [ 154.966653] driver_probe_device+0x19/0x90 > > > [ 154.971290] __driver_attach+0x99/0x170 > > > [ 154.975671] bus_for_each_dev+0x73/0xc0 > > > [ 154.980053] bus_add_driver+0x14b/0x1f0 > > > [ 154.984431] driver_register+0x67/0xb0 > > > [ 154.988725] i915_init+0x18/0x8c [i915] > > > [ 154.993149] do_one_initcall+0x53/0x2e0 > > > [ 154.997527] do_init_module+0x56/0x210 > > > [ 155.001822] load_module+0x25fc/0x29f0 > > > [ 155.006118] __do_sys_finit_module+0xae/0x110 > > > [ 155.011019] do_syscall_64+0x38/0xc0 > > > [ 155.015139] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [ 155.020729] > > > other info that might help us debug this: > > > > > > [ 155.028752] Chain exists of: > > > (work_completion)(&ct->requests.worker) --> fs_reclaim --> >->reset.mutex > > > > > > [ 155.041294] Possible unsafe locking scenario: > > > > > > [ 155.047240] CPU0 CPU1 > > > [ 155.051791] ---- ---- > > > [ 155.056344] lock(>->reset.mutex); > > > [ 155.060026] lock(fs_reclaim); > > > [ 155.065706] lock(>->reset.mutex); > > > [ 155.071912] lock((work_completion)(&ct->requests.worker)); > > > [ 155.077595] > > > *** DEADLOCK *** > > > > > > [ 155.083534] 2 locks held by i915_selftest/1673: > > > [ 155.088086] #0: ffff888103851240 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x8e/0x170 > > > [ 155.096460] #1: ffff8881079cbfb8 (>->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915] > > > [ 155.105845] > > > stack backtrace: > > > [ 155.110223] CPU: 6 PID: 1673 Comm: i915_selftest Tainted: G U 5.14.0-rc5-guc+ #50 > > > [ 155.119035] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U LPDDR4/4x T4 RVP, BIOS TGLSFWI1.R00.3425.A00.2010162309 10/16/2020 > > > [ 155.132530] Call Trace: > > > [ 155.134999] dump_stack_lvl+0x57/0x7d > > > [ 155.138690] check_noncircular+0x12d/0x150 > > > [ 155.142814] check_prev_add+0x90/0xc30 > > > [ 155.146587] __lock_acquire+0x1643/0x2110 > > > [ 155.150618] lock_acquire+0xd2/0x300 > > > [ 155.154218] ? __flush_work+0x350/0x4d0 > > > [ 155.158073] ? __lock_acquire+0x5f3/0x2110 > > > [ 155.162194] __flush_work+0x373/0x4d0 > > > [ 155.165883] ? __flush_work+0x350/0x4d0 > > > [ 155.169739] ? mark_held_locks+0x49/0x70 > > > [ 155.173686] ? _raw_spin_unlock_irqrestore+0x50/0x70 > > > [ 155.178679] intel_guc_submission_reset_prepare+0xf3/0x340 [i915] > > > [ 155.184857] ? _raw_spin_unlock_irqrestore+0x50/0x70 > > > [ 155.189843] intel_uc_reset_prepare+0x40/0x50 [i915] > > > [ 155.194891] reset_prepare+0x55/0x60 [i915] > > > [ 155.199145] intel_gt_reset+0x11c/0x300 [i915] > > > [ 155.203657] ? _raw_spin_unlock_irqrestore+0x3d/0x70 > > > [ 155.208641] ? do_engine_reset+0x10/0x10 [i915] > > > [ 155.213243] do_device_reset+0x13/0x20 [i915] > > > [ 155.217664] check_whitelist_across_reset+0x166/0x250 [i915] > > > [ 155.223415] live_reset_whitelist.cold+0x6a/0x7a [i915] > > > [ 155.228725] __i915_subtests.cold+0x20/0x74 [i915] > > > [ 155.233593] ? __i915_live_teardown+0x50/0x50 [i915] > > > [ 155.238644] ? __intel_gt_live_setup+0x30/0x30 [i915] > > > [ 155.243773] __run_selftests.cold+0x96/0xee [i915] > > > [ 155.248646] i915_live_selftests+0x2c/0x60 [i915] > > > [ 155.253425] i915_pci_probe+0x93/0x1c0 [i915] > > > [ 155.257854] ? _raw_spin_unlock_irqrestore+0x3d/0x70 > > > [ 155.262839] pci_device_probe+0x9b/0x110 > > > [ 155.266785] really_probe+0x1a6/0x3a0 > > > [ 155.270467] __driver_probe_device+0xf9/0x170 > > > [ 155.274846] driver_probe_device+0x19/0x90 > > > [ 155.278968] __driver_attach+0x99/0x170 > > > [ 155.282824] ? __device_attach_driver+0xd0/0xd0 > > > [ 155.287376] ? __device_attach_driver+0xd0/0xd0 > > > [ 155.291928] bus_for_each_dev+0x73/0xc0 > > > [ 155.295792] bus_add_driver+0x14b/0x1f0 > > > [ 155.299648] driver_register+0x67/0xb0 > > > [ 155.303419] i915_init+0x18/0x8c [i915] > > > [ 155.307328] ? 0xffffffffa0188000 > > > [ 155.310669] do_one_initcall+0x53/0x2e0 > > > [ 155.314525] ? rcu_read_lock_sched_held+0x4d/0x80 > > > [ 155.319253] ? kmem_cache_alloc_trace+0x5ad/0x790 > > > [ 155.323982] do_init_module+0x56/0x210 > > > [ 155.327754] load_module+0x25fc/0x29f0 > > > [ 155.331528] ? __do_sys_finit_module+0xae/0x110 > > > [ 155.336081] __do_sys_finit_module+0xae/0x110 > > > [ 155.340462] do_syscall_64+0x38/0xc0 > > > [ 155.344060] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [ 155.349137] RIP: 0033:0x7efc4496389d > > > [ 155.352735] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48 > > > [ 155.371530] RSP: 002b:00007ffeb3e23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > > [ 155.379123] RAX: ffffffffffffffda RBX: 0000557664b72240 RCX: 00007efc4496389d > > > [ 155.386282] RDX: 0000000000000000 RSI: 0000557664b69610 RDI: 0000000000000004 > > > [ 155.393443] RBP: 0000000000000020 R08: 00007ffeb3e21ff0 R09: 0000557664b682f0 > > > [ 155.400603] R10: 00007ffeb3e23360 R11: 0000000000000246 R12: 0000557664b69610 > > > [ 155.407761] R13: 0000000000000000 R14: 0000557664b6ada0 R15: 0000557664b72240 > > > > > > > This is the major reason why inventing any wheels locally in the kernel > > > > isn't a good idea - aside from the duplicated code because likely there is > > > > a solution for whatever you need. There's all the validation tools, > > > > careful thought about semantics (especially around races) and all that > > > > stuff that you're always missing on your hand-rolled thing. Aside from > > > > anything hand-rolled is much harder to read, since intent isn't clear. > > > > -Daniel > > > > > > > > > > > > > > > > > > Matt > > > > > > > > > > > -Daniel > > > > > > > > > > > > > + > > > > > > > scrub_guc_desc_for_outstanding_g2h(guc); > > > > > > > } > > > > > > > > > > > > > > -- > > > > > > > 2.32.0 > > > > > > > > > > > > > > > > > > > -- > > > > > > Daniel Vetter > > > > > > Software Engineer, Intel Corporation > > > > > > http://blog.ffwll.ch > > > > > > > > -- > > > > Daniel Vetter > > > > Software Engineer, Intel Corporation > > > > http://blog.ffwll.ch > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch