From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC843C433EF for ; Fri, 8 Oct 2021 06:43:23 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id A451261040 for ; Fri, 8 Oct 2021 06:43:23 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org A451261040 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id DB31D6E0A2; Fri, 8 Oct 2021 06:43:22 +0000 (UTC) Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by gabe.freedesktop.org (Postfix) with ESMTPS id 142FB6E0A2 for ; Fri, 8 Oct 2021 06:43:21 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10130"; a="207257647" X-IronPort-AV: E=Sophos;i="5.85,356,1624345200"; d="scan'208";a="207257647" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2021 23:43:19 -0700 X-IronPort-AV: E=Sophos;i="5.85,356,1624345200"; d="scan'208";a="713634192" Received: from lenovo-x280.ger.corp.intel.com (HELO [10.249.254.98]) ([10.249.254.98]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2021 23:43:18 -0700 Message-ID: <571f25bb979922f84c12cd0e98ab5870c9f63448.camel@linux.intel.com> Subject: Re: [RFC PATCH 1/2] drm/amdgpu: Move dma_resv waiting outside the notifier lock From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: Christian =?ISO-8859-1?Q?K=F6nig?= , dri-devel@lists.freedesktop.org Date: Fri, 08 Oct 2021 08:43:16 +0200 In-Reply-To: References: <20211007132630.495184-1-thomas.hellstrom@linux.intel.com> <20211007132630.495184-2-thomas.hellstrom@linux.intel.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.40.4 (3.40.4-1.fc34) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Hi, Christian, On Thu, 2021-10-07 at 16:57 +0200, Christian König wrote: > Am 07.10.21 um 15:26 schrieb Thomas Hellström: > > While the range notifier is executing, we have the write-side mmu > > interval > > seqlock, and mmu_interval_read_retry() is always returning true, > > which means that if amdgpu_cs_submit grabs the notifier lock during > > the > > fence wait, it will retry anyway when checking the userptr validity > > and block when retrying in mmu_interval_read_begin(). > > (See the extensive comments in mmu_interval_read_begin()) > > > > Hence we can release the notifier lock before the fence wait and > > avoid > > a device-wide command submission block during invalidation. > > First of all I'm not convinced that this works and second blocking > the > CS while an MMU invalidation is underway is completely intentional. > > In other words when for example fork() is called in parallel with a > CS > the CS should be blocked until the invalidation caused by the fork() > operation is completed and *NOT* risk that the CS succeeds and adds > another dependency to the MMU invalidation. The point is that command submission can't succeed since while the notifer runs, we have the write-side seqlock. It's released when the notifier ends, and we can rely on that. The following will happen: Thread1 Thread 2 enter_cs(); enter_invalidation(); notifier_lock() update_seqno(); notifier_unlock(); notifier_lock(); fence_wait(); validate_userptr(); (invalid_seqno) notifier_unlock(); (retry) mmu_interval_read_begin(); (blocks) fence_signal(); (notifier ends) mn_itree_inv_end(); (unblocks, new seqno); enter_cs(); (succeeds) So the difference is that you block only CS that affect that particular userptr, not CS across the entire device. So for the sake of discussion, is there a particular situation where you see a race that might happen here? Thanks, Thomas > > Christian. > > > > > Cc: Christian König > > Signed-off-by: Thomas Hellström > > --- > >   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 3 ++- > >   1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > index 4b153daf283d..d3d340a6129c 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > @@ -75,9 +75,10 @@ static bool amdgpu_mn_invalidate_gfx(struct > > mmu_interval_notifier *mni, > >   > >         mmu_interval_set_seq(mni, cur_seq); > >   > > +       mutex_unlock(&adev->notifier_lock); > > + > >         r = dma_resv_wait_timeout(bo->tbo.base.resv, true, false, > >                                   MAX_SCHEDULE_TIMEOUT); > > -       mutex_unlock(&adev->notifier_lock); > >         if (r <= 0) > >                 DRM_ERROR("(%ld) failed to wait for user bo\n", r); > >         return true; >