From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs Date: Fri, 12 Sep 2014 14:39:35 -0400 Message-ID: <20140912183934.GA5196@gmail.com> References: <1410277434-3087-1-git-send-email-joro@8bytes.org> <20140910150125.31a7495c7d0fe814b85fd514@linux-foundation.org> <20140911000211.GA4989@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: <20140911000211.GA4989@gmail.com> Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Joerg Roedel , Andrea Arcangeli , Rik van Riel , jroedel@suse.de, Peter Zijlstra , John.Bridgman@amd.com, Jesse Barnes , Hugh Dickins , linux-kernel@vger.kernel.org, ben.sander@amd.com, linux-mm@kvack.org, Jerome Glisse , Jay.Cornwall@amd.com, Mel Gorman , David Woodhouse , Johannes Weiner , iommu@lists.linux-foundation.org List-Id: iommu@lists.linux-foundation.org On Wed, Sep 10, 2014 at 08:02:12PM -0400, Jerome Glisse wrote: > On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote: > > On Tue, 9 Sep 2014 17:43:51 +0200 Joerg Roedel wro= te: > >=20 > > > here is a patch-set to extend the mmu_notifiers in the Linux > > > kernel to allow managing CPU external TLBs. Those TLBs may > > > be implemented in IOMMUs or any other external device, e.g. > > > ATS/PRI capable PCI devices. > > >=20 > > > The problem with managing these TLBs are the semantics of > > > the invalidate_range_start/end call-backs currently > > > available. Currently the subsystem using mmu_notifiers has > > > to guarantee that no new TLB entries are established between > > > invalidate_range_start/end. Furthermore the > > > invalidate_range_start() function is called when all pages > > > are still mapped and invalidate_range_end() when the pages > > > are unmapped an already freed. > > >=20 > > > So both call-backs can't be used to safely flush any non-CPU > > > TLB because _start() is called too early and _end() too > > > late. > >=20 > > There's a lot of missing information here. Why don't the existing > > callbacks suit non-CPU TLBs? What is different about them? Please > > update the changelog to contain all this context. >=20 > So unlike KVM or any current user of the mmu_notifier api, any PCIE > ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing > name for Intel (probably VTd) implementation walk the cpu page table > on there own and have there own TLB cache. In fact not only the iommu > can have a TLB cache but any single PCIE hardware can implement its > own local TLB cache. >=20 > So if we flush the IOMMU and device TLB inside the range_start callback > there is a chance that the hw will just rewalk the cpu page table and > repopulate its TLB before the CPU page table is actually updated. >=20 > Now if we shoot down the TLB inside the range_end callback, then we > are too late ie the CPU page table is already populated with new entry > and all the TLB in the IOMMU an in device might be pointing to the old > pages. >=20 > So the aim of this callback is to happen right after the CPU page table > is updated but before the old page is freed or recycled. Note that it > is also safe for COW and other transition from like read only to read > and write or the other way around. >=20 > >=20 > > > In the AMD IOMMUv2 driver this is currently implemented by > > > assigning an empty page-table to the external device between > > > _start() and _end(). But as tests have shown this doesn't > > > work as external devices don't re-fault infinitly but enter > > > a failure state after some time. > >=20 > > More missing info. Why are these faults occurring? Is there some > > device activity which is trying to fault in pages, but the CPU is > > executing code between _start() and _end() so the driver must refuse = to > > instantiate a page to satisfy the fault? That's just a guess, and I > > shouldn't be guessing. Please update the changelog to fully describe > > the dynamic activity which is causing this. >=20 > The hack that was use prior to this patch was to point the IOMMU to an > empty page table (a zero page) inside the range_start() callback and > shoot down the TLB but this meant that the device might enter inside a > storm of page fault. GPU can have thousand of threads and because durin= g > invalidation the empty page table is use they all starts triggering pag= e > fault even if they were not trying to access the range being invalidate= d. > It turns out that when this happens current hw like AMD GPU actually st= op > working after a while ie the hw stumble because there is too much fault > going on. >=20 > >=20 > > > Next problem with this solution is that it causes an > > > interrupt storm for IO page faults to be handled when an > > > empty page-table is assigned. > >=20 > > Also too skimpy. I *think* this is a variant of the problem in the > > preceding paragraph. We get a fault storm (which is problem 2) and > > sometimes the faulting device gives up (which is problem 1). > >=20 > > Or something. Please de-fog all of this. > >=20 >=20 > Does above explanation help understand the issue ? Given that on each > device page fault an IRQ is trigger (well the way the hw works is bit > more complex than that). >=20 > > > Furthermore the _start()/end() notifiers only catch the > > > moment when page mappings are released, but not page-table > > > pages. But this is necessary for managing external TLBs when > > > the page-table is shared with the CPU. > >=20 > > How come? >=20 > As explained above end() might happens after page that were previously > mapped are free or recycled. >=20 > >=20 > > > To solve this situation I wrote a patch-set to introduce a > > > new notifier call-back: mmu_notifer_invalidate_range(). This > > > notifier lifts the strict requirements that no new > > > references are taken in the range between _start() and > > > _end(). When the subsystem can't guarantee that any new > > > references are taken is has to provide the > > > invalidate_range() call-back to clear any new references in > > > there. > > >=20 > > > It is called between invalidate_range_start() and _end() > > > every time the VMM has to wipe out any references to a > > > couple of pages. This are usually the places where the CPU > > > TLBs are flushed too and where its important that this > > > happens before invalidate_range_end() is called. > > >=20 > > > Any comments and review appreciated! > >=20 > > The patchset looks decent, although I find it had to review because I > > just wasn't provided with enough of the thinking that went into it. = I > > have enough info to look at the C code, but not enough info to identi= fy > > and evaluate alternative implementation approaches, to identify > > possible future extensions, etc. > >=20 > > The patchset does appear to add significant additional overhead to ho= t > > code paths when mm_has_notifiers(mm). Please let's update the > > changelog to address this rather important concern. How significant = is > > the impact on such mm's, how common are such mm's now and in the > > future, should we (for example) look at short-circuiting > > __mmu_notifier_invalidate_range() if none of the registered notifiers > > implement ->invalidate_range(), etc. >=20 > So one might feel like just completely removing the range_start()/end() > from the mmu_notifier and stick to this one callback but it will not wo= rk > with other hardware like the one i am doing HMM patchset for (i send it > again a couple weeks ago https://lkml.org/lkml/2014/8/29/423). >=20 > Right now there are very few user of the mmu_notifier, SGI, xen, KVM an= d > sadly recently the GPU folks (which i am part of too). So as the GPU ar= e > starting to use it we will see a lot more application going through the > mmu_notifier callback. Yet you do want to leverage the hw like GPU and > you do want to use the same address space on the GPU as on the CPU and > thus you do want to share or at least keep synchronize the GPU view of > the CPU page table. >=20 > Right now, for IOMMUv2 this means adding this callback, for device that > do not rely on ATS/PASID PCIE extension this means something like HMM. > Also note that HMM is a superset of IOMMUv2 as it could be use at the > same time and provide more feature mainly allowing migrating some page > to device local memory for performances purposes. >=20 > I think this sumup all motivation behind this patchset and also behind > my other patchset. As usual i am happy to discuss alternative way to do > things but i think that the path of least disruption from current code > is the one implemented by those patchset. >=20 > Cheers, > J=E9r=F4me Hi Andrew, i wanted to touch base to see if those explanations were enoug= h to understand the problem we are trying to solve. Do you think the patchs= et is ok or do you feel like there should be a better solution ? Cheers, J=E9r=F4me >=20 >=20 > > _______________________________________________ > > iommu mailing list > > iommu@lists.linux-foundation.org > > https://lists.linuxfoundation.org/mailman/listinfo/iommu -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org