From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jason Gunthorpe <jgg@ziepe.ca>
Subject: Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
Date: Thu, 15 Aug 2019 14:16:22 -0300
Message-ID: <20190815171622.GL21596@ziepe.ca>
References: <20190814202027.18735-1-daniel.vetter@ffwll.ch>
 <20190814202027.18735-3-daniel.vetter@ffwll.ch>
 <20190814134558.fe659b1a9a169c0150c3e57c@linux-foundation.org>
 <20190815084429.GE9477@dhcp22.suse.cz>
 <20190815130415.GD21596@ziepe.ca>
 <CAKMK7uE9zdmBuvxa788ONYky=46GN=5Up34mKDmsJMkir4x7MQ@mail.gmail.com>
 <20190815143759.GG21596@ziepe.ca>
 <CAKMK7uEJQ6mPQaOWbT_6M+55T-dCVbsOxFnMC6KzLAMQNa-RGg@mail.gmail.com>
 <20190815151028.GJ21596@ziepe.ca>
 <20190815163238.GA30781@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Return-path: <linux-kernel-owner@vger.kernel.org>
Content-Disposition: inline
In-Reply-To: <20190815163238.GA30781@redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
To: Jerome Glisse <jglisse@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>, Michal Hocko <mhocko@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, LKML <linux-kernel@vger.kernel.org>, Linux MM <linux-mm@kvack.org>, DRI Development <dri-devel@lists.freedesktop.org>, Intel Graphics Development <intel-gfx@lists.freedesktop.org>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, David Rientjes <rientjes@google.com>, Christian =?utf-8?B?S8O2bmln?= <christian.koenig@amd.com>, Masahiro Yamada <yamada.masahiro@socionext.com>, Wei Wang <wvw@google.com>, Andy Shevchenko <andriy.shevchenko@linux.intel.com>, Thomas Gleixner <tglx@linutronix.de>, Jann Horn <jannh@google.com>, Feng Tang <feng.tang@intel.com>, Kees Cook <keescook@chromium.org>, Randy Dunlap <rdunlap@infrade>
List-Id: intel-gfx@lists.freedesktop.org

On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > 
> > > You have to wait for the gpu to finnish current processing in
> > > invalidate_range_start. Otherwise there's no point to any of this
> > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > 
> > I don't envy your task :|
> > 
> > But, what you describe sure sounds like a 'registration cache' model,
> > not the 'shadow pte' model of coherency.
> > 
> > The key difference is that a regirstationcache is allowed to become
> > incoherent with the VMA's because it holds page pins. It is a
> > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > while the device is working on that VA, but it does not harm system
> > integrity because of the page pin.
> > 
> > The cache ensures that each initiated operation sees a DMA setup that
> > matches the current VA map when the operation is initiated and allows
> > expensive device DMA setups to be re-used.
> > 
> > A 'shadow pte' model (ie hmm) *really* needs device support to
> > directly block DMA access - ie trigger 'device page fault'. ie the
> > invalidate_start should inform the device to enter a fault mode and
> > that is it.  If the device can't do that, then the driver probably
> > shouldn't persue this level of coherency. The driver would quickly get
> > into the messy locking problems like dma_fence_wait from a notifier.
> 
> I think here we do not agree on the hardware requirement. For GPU
> we will always need to be able to wait for some GPU fence from inside
> the notifier callback, there is just no way around that for many of
> the GPUs today (i do not see any indication of that changing).

I didn't say you couldn't wait, I was trying to say that the wait
should only be contigent on the HW itself. Ie you can wait on a GPU
page table lock, and you can wait on a GPU page table flush completion
via IRQ.

What is troubling is to wait till some other thread gets a GPU command
completion and decr's a kref on the DMA buffer - which kinda looks
like what this dma_fence() stuff is all about. A driver like that
would have to be super careful to ensure consistent forward progress
toward dma ref == 0 when the system is under reclaim. 

ie by running it's entire IRQ flow under fs_reclaim locking.

> associated with the mm_struct. In all GPU driver so far it is a short
> lived lock and nothing blocking is done while holding it (it is just
> about updating page table directory really wether it is filling it or
> clearing it).

The main blocking I expect in a shadow PTE flow is waiting for the HW
to complete invalidations of its PTE cache.

> > It is important to identify what model you are going for as defining a
> > 'registration cache' coherence expectation allows the driver to skip
> > blocking in invalidate_range_start. All it does is invalidate the
> > cache so that future operations pick up the new VA mapping.
> > 
> > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > well proven, within some limitations of course.
> > 
> > At least, 'registration cache' is the only use model I know of where
> > it is acceptable to skip invalidate_range_end.
> 
> Here GPU are not in the registration cache model, i know it might looks
> like it because of GUP but GUP was use just because hmm did not exist
> at the time.

It is not because of GUP, it is because of the lack of
invalidate_range_end. A driver cannot correctly implement the SPTE
model without invalidate_range_end, even if it holds the page pins via
GUP.

So, I've been assuming the few drivers without invalidate_range_end
are trying to do registration caching, rather than assuming they are
broken.

Jason