From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f46.google.com (mail-qa0-f46.google.com [209.85.216.46]) by kanga.kvack.org (Postfix) with ESMTP id E2D736B013D for ; Thu, 8 May 2014 21:45:50 -0400 (EDT) Received: by mail-qa0-f46.google.com with SMTP id w8so3339704qac.5 for ; Thu, 08 May 2014 18:45:50 -0700 (PDT) Received: from mail-qc0-x22c.google.com (mail-qc0-x22c.google.com [2607:f8b0:400d:c01::22c]) by mx.google.com with ESMTPS id c4si1372578qad.164.2014.05.08.18.45.50 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 08 May 2014 18:45:50 -0700 (PDT) Received: by mail-qc0-f172.google.com with SMTP id l6so3955479qcy.31 for ; Thu, 08 May 2014 18:45:50 -0700 (PDT) Date: Thu, 8 May 2014 21:45:45 -0400 From: Jerome Glisse Subject: Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu). Message-ID: <20140509014544.GB2906@gmail.com> References: <1399038730-25641-1-git-send-email-j.glisse@gmail.com> <20140506102925.GD11096@twins.programming.kicks-ass.net> <1399429987.2581.25.camel@buesod1.americas.hpqcorp.net> <536BB508.2020704@mellanox.com> <20140508175624.GA3121@gmail.com> <1399599734.2497.2.camel@buesod1.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1399599734.2497.2.camel@buesod1.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org List-ID: To: Davidlohr Bueso Cc: sagi grimberg , Peter Zijlstra , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Mel Gorman , "H. Peter Anvin" , Andrew Morton , Linda Wang , Kevin E Martin , Jerome Glisse , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Jeff Law , Brendan Conoboy , Joe Donohue , Duncan Poole , Sherry Cheung , Subhash Gutti , John Hubbard , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Haggai Eran , Or Gerlitz , Shachar Raindel , Liran Liss , Roland Dreier , "Sander, Ben" , "Stoner, Greg" , "Bridgman, John" , "Mantor, Michael" , "Blinzer, Paul" , "Morichetti, Laurent" , "Deucher, Alexander" , "Gabbay, Oded" , Linus Torvalds On Thu, May 08, 2014 at 06:42:14PM -0700, Davidlohr Bueso wrote: > On Thu, 2014-05-08 at 13:56 -0400, Jerome Glisse wrote: > > On Thu, May 08, 2014 at 07:47:04PM +0300, sagi grimberg wrote: > > > On 5/7/2014 5:33 AM, Davidlohr Bueso wrote: > > > >On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote: > > > >>So you forgot to CC Linus, Linus has expressed some dislike for > > > >>preemptible mmu_notifiers in the recent past: > > > >> > > > >> https://lkml.org/lkml/2013/9/30/385 > > > >I'm glad this came up again. > > > > > > > >So I've been running benchmarks (mostly aim7, which nicely exercises our > > > >locks) comparing my recent v4 for rwsem optimistic spinning against > > > >previous implementation ideas for the anon-vma lock, mostly: > > > > > > > >- rwsem (currently) > > > >- rwlock_t > > > >- qrwlock_t > > > >- rwsem+optspin > > > > > > > >Of course, *any* change provides significant improvement in throughput > > > >for several workloads, by avoiding to block -- there are more > > > >performance numbers in the different patches. This is fairly obvious. > > > > > > > >What is perhaps not so obvious is that rwsem+optimistic spinning beats > > > >all others, including the improved qrwlock from Waiman and Peter. This > > > >is mostly because of the idea of cancelable MCS, which was mimic'ed from > > > >mutexes. The delta in most cases is around +10-15%, which is non > > > >trivial. > > > > > > These are great news David! > > > > > > >I mention this because from a performance PoV, we'll stop caring so much > > > >about the type of lock we require in the notifier related code. So while > > > >this is not conclusive, I'm not as opposed to keeping the locks blocking > > > >as I once was. Now this might still imply things like poor design > > > >choices, but that's neither here nor there. > > > > > > So is the rwsem+opt strategy the way to go Given it keeps everyone happy? > > > We will be more than satisfied with it as it will allow us to > > > guarantee device > > > MMU update. > > > > > > >/me sees Sagi smiling ;) > > > > > > :) > > > > So i started doing thing with tlb flush but i must say things looks ugly. > > I need a new page flag (goodbye 32bits platform) and i need my own lru and > > page reclaimation for any page in use by a device, i need to hook up inside > > try_to_unmap or migrate (but i will do the former). I am trying to be smart > > by trying to schedule a worker on another cpu before before sending the ipi > > so that while the ipi is in progress hopefully another cpu might schedule > > the invalidation on the GPU and the wait after ipi for the gpu will be quick. > > > > So all in all this is looking ugly and it does not change the fact that i > > sleep (well need to be able to sleep). It just move the sleeping to another > > part. > > > > Maybe i should stress that with the mmu_notifier version it only sleep for > > process that are using the GPU those process are using userspace API like > > OpenCL which are not playing well with fork, ie read do not use fork if > > you are using such API. > > > > So for my case if a process has mm->hmm set to something that would mean > > that there is a GPU using that address space and that it is unlikely to > > go under the massive workload that people try to optimize the anon_vma > > lock for. > > > > My point is that with rwsem+optspin it could try spinning if mm->hmm > > was NULL and make the massive fork workload go fast, or it could sleep > > directly if mm->hmm is set. > > Sorry? Unless I'm misunderstanding you, we don't do such things. Our > locks are generic and need to work for any circumstance, no special > cases here and there... _specially_ with these kind of things. So no, > rwsem will spin as long as the owner is set, just like any other users. > > Thanks, > Davidlohr > I do not mind spining all time i was just thinking that it could be optimize away in case there is hmm for the current mm as it means that any way there very much likely gonna be a schedule inside the mmu_notifier. But if you prefer keep code generic i am fine with wasting cpu cycle. Cheers, Jerome Glisse -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org