From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Gunthorpe Subject: Re: [PATCH 2/5] kernel.h: Add non_block_start/end() Date: Thu, 15 Aug 2019 14:16:22 -0300 Message-ID: <20190815171622.GL21596@ziepe.ca> References: <20190814202027.18735-1-daniel.vetter@ffwll.ch> <20190814202027.18735-3-daniel.vetter@ffwll.ch> <20190814134558.fe659b1a9a169c0150c3e57c@linux-foundation.org> <20190815084429.GE9477@dhcp22.suse.cz> <20190815130415.GD21596@ziepe.ca> <20190815143759.GG21596@ziepe.ca> <20190815151028.GJ21596@ziepe.ca> <20190815163238.GA30781@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20190815163238.GA30781@redhat.com> Sender: linux-kernel-owner@vger.kernel.org To: Jerome Glisse Cc: Daniel Vetter , Michal Hocko , Andrew Morton , LKML , Linux MM , DRI Development , Intel Graphics Development , Peter Zijlstra , Ingo Molnar , David Rientjes , Christian =?utf-8?B?S8O2bmln?= , Masahiro Yamada , Wei Wang , Andy Shevchenko , Thomas Gleixner , Jann Horn , Feng Tang , Kees Cook , Randy Dunlap List-Id: intel-gfx@lists.freedesktop.org On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote: > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote: > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > > > You have to wait for the gpu to finnish current processing in > > > invalidate_range_start. Otherwise there's no point to any of this > > > really. So the wait_event/dma_fence_wait are unavoidable really. > > > > I don't envy your task :| > > > > But, what you describe sure sounds like a 'registration cache' model, > > not the 'shadow pte' model of coherency. > > > > The key difference is that a regirstationcache is allowed to become > > incoherent with the VMA's because it holds page pins. It is a > > programming bug in userspace to change VA mappings via mmap/munmap/etc > > while the device is working on that VA, but it does not harm system > > integrity because of the page pin. > > > > The cache ensures that each initiated operation sees a DMA setup that > > matches the current VA map when the operation is initiated and allows > > expensive device DMA setups to be re-used. > > > > A 'shadow pte' model (ie hmm) *really* needs device support to > > directly block DMA access - ie trigger 'device page fault'. ie the > > invalidate_start should inform the device to enter a fault mode and > > that is it. If the device can't do that, then the driver probably > > shouldn't persue this level of coherency. The driver would quickly get > > into the messy locking problems like dma_fence_wait from a notifier. > > I think here we do not agree on the hardware requirement. For GPU > we will always need to be able to wait for some GPU fence from inside > the notifier callback, there is just no way around that for many of > the GPUs today (i do not see any indication of that changing). I didn't say you couldn't wait, I was trying to say that the wait should only be contigent on the HW itself. Ie you can wait on a GPU page table lock, and you can wait on a GPU page table flush completion via IRQ. What is troubling is to wait till some other thread gets a GPU command completion and decr's a kref on the DMA buffer - which kinda looks like what this dma_fence() stuff is all about. A driver like that would have to be super careful to ensure consistent forward progress toward dma ref == 0 when the system is under reclaim. ie by running it's entire IRQ flow under fs_reclaim locking. > associated with the mm_struct. In all GPU driver so far it is a short > lived lock and nothing blocking is done while holding it (it is just > about updating page table directory really wether it is filling it or > clearing it). The main blocking I expect in a shadow PTE flow is waiting for the HW to complete invalidations of its PTE cache. > > It is important to identify what model you are going for as defining a > > 'registration cache' coherence expectation allows the driver to skip > > blocking in invalidate_range_start. All it does is invalidate the > > cache so that future operations pick up the new VA mapping. > > > > Intel's HFI RDMA driver uses this model extensively, and I think it is > > well proven, within some limitations of course. > > > > At least, 'registration cache' is the only use model I know of where > > it is acceptable to skip invalidate_range_end. > > Here GPU are not in the registration cache model, i know it might looks > like it because of GUP but GUP was use just because hmm did not exist > at the time. It is not because of GUP, it is because of the lack of invalidate_range_end. A driver cannot correctly implement the SPTE model without invalidate_range_end, even if it holds the page pins via GUP. So, I've been assuming the few drivers without invalidate_range_end are trying to do registration caching, rather than assuming they are broken. Jason