From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement support for MMU notifiers regarding on demand paging regions Date: Wed, 10 Sep 2014 16:14:34 -0400 Message-ID: <20140910201432.GA3801@gmail.com> References: <20140904202458.GB2685@gmail.com> <6B2A6E60C06CCC42AE31809BF572352B010E2442C7@MTLDAG02.mtl.com> <20140909153627.GA3545@gmail.com> <6B2A6E60C06CCC42AE31809BF572352B010E244E86@MTLDAG02.mtl.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <6B2A6E60C06CCC42AE31809BF572352B010E244E86-LSMZvP3E4uyuSA5JZHE7gA@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Shachar Raindel Cc: Haggai Eran , "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , Sagi Grimberg List-Id: linux-rdma@vger.kernel.org On Wed, Sep 10, 2014 at 09:00:36AM +0000, Shachar Raindel wrote: >=20 >=20 > > -----Original Message----- > > From: Jerome Glisse [mailto:j.glisse-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] > > Sent: Tuesday, September 09, 2014 6:37 PM > > To: Shachar Raindel > > Cc: 1404377069-20585-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; Haggai = Eran; > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Jerome Glisse; Sagi Grimberg > > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement support f= or > > MMU notifiers regarding on demand paging regions > >=20 > > On Sun, Sep 07, 2014 at 02:35:59PM +0000, Shachar Raindel wrote: > > > Hi, > > > > > > > -----Original Message----- > > > > From: Jerome Glisse [mailto:j.glisse-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] > > > > Sent: Thursday, September 04, 2014 11:25 PM > > > > To: Haggai Eran; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > > Cc: Shachar Raindel; Sagi Grimberg > > > > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement suppo= rt > > for > > > > MMU notifiers regarding on demand paging regions > > > > > > > > > * Add an interval tree implementation for ODP umems. Create a= n > > > > interval tree > > > > > for each ucontext (including a count of the number of ODP M= Rs in > > > > this > > > > > context, mutex, etc.), and register ODP umems in the interv= al > > tree. > > > > > * Add MMU notifiers handling functions, using the interval tr= ee to > > > > notify only > > > > > the relevant umems and underlying MRs. > > > > > * Register to receive MMU notifier events from the MM subsyst= em > > upon > > > > ODP MR > > > > > registration (and unregister accordingly). > > > > > * Add a completion object to synchronize the destruction of O= DP > > umems. > > > > > * Add mechanism to abort page faults when there's a concurren= t > > > > invalidation. > > > > > > > > > > The way we synchronize between concurrent invalidations and p= age > > > > faults is by > > > > > keeping a counter of currently running invalidations, and a > > sequence > > > > number > > > > > that is incremented whenever an invalidation is caught. The p= age > > fault > > > > code > > > > > checks the counter and also verifies that the sequence number > > hasn't > > > > > progressed before it updates the umem's page tables. This is > > similar > > > > to what > > > > > the kvm module does. > > > > > > > > > > There's currently a rare race in the code when registering a = umem > > in > > > > the > > > > > middle of an ongoing notifier. The proper fix is to either > > serialize > > > > the > > > > > insertion to our umem tree with mm_lock_all or use a ucontext= wide > > > > running > > > > > notifiers count for retries decision. Either is ugly and can = lead > > to > > > > some sort > > > > > of starvation. The current workaround is ugly as well - now t= he > > user > > > > can end > > > > > up with mapped addresses that are not in the user's address s= pace > > > > (although it > > > > > is highly unlikely). > > > > > > > > I have been trying to wrap my head around this comment. I am to= taly > > > > unfamiliar > > > > with RDMA code, but from quick look at it when registering umem= you > > take > > > > the > > > > mmap_sem in read mode so any munmap from userspace would be > > serialize. > > > > Really > > > > the worst that can happen is that a umem pointing to a mmaped f= ile > > that > > > > is > > > > concurently truncated but even then the address is still valid,= but > > it > > > > should > > > > result in a SIGBUS which here is obviously harder to report (ag= ain > > dunno > > > > how > > > > RDMA works). > > > > > > > > So am i missing something ? > > > > > > > > > > Sadly, taking mmap_sem in read-only mode does not prevent all pos= sible > > invalidations from happening. > > > For example, a call to madvise requesting MADVISE_DONTNEED will l= ock > > the mmap_sem for reading only, allowing a notifier to run in parall= el to > > the MR registration As a result, the following sequence of events c= ould > > happen: > > > > > > Thread 1: | Thread 2 > > > --------------------------------+------------------------- > > > madvise | > > > down_read(mmap_sem) | > > > notifier_start | > > > | down_read(mmap_sem) > > > | register_mr > > > notifier_end | > > > reduce_mr_notifiers_count | > > > > > > The end result of this sequence is an mr with running notifiers c= ount > > of -1, which is bad. > > > The current workaround is to avoid decreasing the notifiers count= if > > it is zero, which can cause other issues. > > > The proper fix would be to prevent notifiers from running in para= llel > > to registration. For this, taking mmap_sem in write mode might be > > sufficient, but we are not sure about this. > > > We will be happy to hear additional input on this subject, to mak= e > > sure we got it covered properly. > >=20 > > So in HMM i solve this by having a struct allocated in the start ra= nge > > callback > > and the end range callback just ignore things when it can not find = the > > matching > > struct. >=20 > This kind of mechanism sounds like it has a bigger risk for deadlocki= ng > the system, causing an OOM kill without a real need or significantly=20 > slowing down the system. > If you are doing non-atomic memory allocations, you can deadlock the > system by requesting memory in the swapper flow. > Even if you are doing atomic memory allocations, you need to handle t= he > case of failing allocation, the solution to which is unclear to me. > If you are using a pre-allocated pool, what are you doing when you ru= n > out of available entries in the pool? If you are blocking until some > entries free up, what guarantees you that this will not cause a deadl= ock? So i am using a fixed pool and when it runs out it block in start callb= ack until one is freed. But as i said i have a patch to use the stack that = will solve this and avoid a pool. >=20 > >=20 > > That being said when registering the mmu_notifier you need 2 things= , > > first you > > need a pin on the mm (either mm is current ie current->mm or you to= ok a > > reference > > on it). Second you need to that the mmap smemaphore in write mode s= o > > that > > no concurrent mmap/munmap/madvise can happen. By doing that you pro= tect > > yourself > > from concurrent range_start/range_end that can happen and that does > > matter. > > The only concurrent range_start/end that can happen is through file > > invalidation > > which is fine because subsequent page fault will go through the fil= e > > layer and > > bring back page or return error (if file was truncated for instance= ). >=20 > Sadly, this is not sufficient for our use case. We are registering > a single MMU notifier handler, and broadcast the notifications to > all relevant listeners, which are stored in an interval tree. > > Each listener represents a section of the address space that has been > exposed to the network. Such implementation allows us to limit the im= pact > of invalidations, and only block racing page faults to the affected a= reas. >=20 > Each of the listeners maintain a counter of the number of invalidate_= range > notifications that are currently affecting it. The counter is increas= ed > for each invalidate_range_start callback received, and decrease for e= ach > invalidate_range_end callback received. If we add a listener to the > interval tree after the invalidate_range_start callback happened, but > before the invalidate_range_end callback happened, it will decrease t= he > counter, reaching negative numbers and breaking the logic. >=20 > The mmu_notifiers registration code avoid such issues by taking all > relevant locks on the MM. This effectively blocks all possible notifi= ers > from happening when registering a new notifier. Sadly, this function = is > not exported for modules to use it. >=20 > Our options at the moment are: > - Use a tracking mechanism similar to what HMM uses, alongside the > challenges involved in allocating memory from notifiers >=20 > - Use a per-process counter for invalidations, causing a possible > performance degradation. This can possibly be used as a fallback to= the > first option (i.e. have a pool of X notifier identifiers, once it i= s > full, increase/decrease a per-MM counter) >=20 > - Export the mm_take_all_locks function for modules. This will allow = us > to lock the MM when adding a new listener. I was not clear enough, you need to take the mmap_sem in write mode acc= ross mmu_notifier_register(). This is only to partialy solve your issue that= if a mmu_notifier is already register for the mm you are trying to registe= ring against then there is a chance for you to be inside an active range_sta= rt/ range_end section which would lead to invalid counter inside your track= ing structure. But, sadly, taking mmap_sem in write mode is not enough as f= ile invalidation might still happen concurrently so you will need to make s= ure you invalidation counters does not go negative but from page fault poin= t of view you will be fine because the page fault will synchronize through t= he pagecache. So scenario (A and B are to anonymous overlapping address ra= nge) : APP_TOTO_RDMA_THREAD | APP_TOTO_SOME_OTHER_THREAD | mmu_notifier_invalidate_range_start= (A) odp_register() | down_read(mmap_sem) | mmu_notifier_register() | up_read(mmap_sem) | odp_add_new_region(B) | odp_page_fault(B) | down_read(mmap_sem) | ... | up_read(mmap_sem) | | mmu_notifier_invalidate_range_end(A= ) The odp_page_fault(B) might see invalid cpu page table but you have no = idea about it because you registered after the range_start(). But if you tak= e the mmap_sem in write mode then the only case where you might still have th= is scenario is if A and B are range of a file backed vma and that the file= is undergoing some change (most likely truncation). But the file case is f= ine because the odp_page_fault() will go through the pagecache which is pro= perly synchronize against the current range invalidation. Now for the the general case outside of mmu_notifier_register() HMM als= o track active invalidation range to avoid page faulting into those range as we= can not trust the cpu page table for as long as the range invalidation is on go= ing. > >=20 > > So as long as you hold the mmap_sem in write mode you should not wo= rry > > about > > concurrent range_start/range_end (well they might happen but only f= or > > file > > backed vma). > >=20 >=20 > Sadly, the mmap_sem is not enough to protect us :(. This is enough like i explain above, but i am only talking about the mm= u notifier registration. For the general case once you register you only need to take the mmap_sem in read mode during page fault. > > Given that you face the same issue as i have with the > > range_start/range_end i > > will stich up a patch to make it easier to track those. > >=20 >=20 > That would be nice, especially if we could easily integrate it into o= ur > code and reduce the code size. Yes it's a "small modification" to the mmu_notifier api, i have been si= de tracked on other thing. But i will have it soon. >=20 > > Cheers, > > J=E9r=F4me > >=20 > >=20 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" i= n the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html