From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id C65576B003D for ; Wed, 11 Mar 2009 13:35:39 -0400 (EDT) Date: Wed, 11 Mar 2009 10:33:00 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090311170611.GA2079@elte.hu> Message-ID: References: <20090311170611.GA2079@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Nick Piggin , Hugh Dickins , Andrea Arcangeli , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Ingo Molnar wrote: > > FYI, in case you missed it. Large MM fix - and it's awfully late > in -rc7. Yeah, I'm not taking this at this point. No way, no-how. If there is no simpler and obvious fix, it needs to go through -stable, after having cooked in 2.6.30-rc for a while. Especially as this is a totally uninteresting usage case that I can't see as being at all relevant to any real world. Anybody who mixes O_DIRECT and fork() (and threads) is already doing some seriously strange things. Nothing new there. And quite frankly, the patch is so ugly as-is that I'm not likely to take it even into the 2.6.30 merge window unless it can be cleaned up. That whole fork_pre_cow function is too f*cking ugly to live. We just don't write code like this in the kernel. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 156AD6B003D for ; Wed, 11 Mar 2009 13:41:11 -0400 (EDT) Date: Wed, 11 Mar 2009 18:41:03 +0100 From: Ingo Molnar Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311174103.GA11979@elte.hu> References: <20090311170611.GA2079@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Nick Piggin , Hugh Dickins , Andrea Arcangeli , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: * Linus Torvalds wrote: > On Wed, 11 Mar 2009, Ingo Molnar wrote: > > > > FYI, in case you missed it. Large MM fix - and it's awfully > > late in -rc7. > > Yeah, I'm not taking this at this point. No way, no-how. > > If there is no simpler and obvious fix, it needs to go through > -stable, after having cooked in 2.6.30-rc for a while. > Especially as this is a totally uninteresting usage case that > I can't see as being at all relevant to any real world. > > Anybody who mixes O_DIRECT and fork() (and threads) is already > doing some seriously strange things. Nothing new there. Hm, is there any security impact? Andrea is talking about data corruption. I'm wondering whether that's just corruption relative to whatever twisted semantics O_DIRECT has in this case [which would be harmless], or some true pagecache corruption going across COW (or other) protection domains that could be exploited [which would not be harmless]. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 690C86B003D for ; Wed, 11 Mar 2009 14:02:27 -0400 (EDT) Date: Wed, 11 Mar 2009 10:58:17 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090311174103.GA11979@elte.hu> Message-ID: References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Nick Piggin , Hugh Dickins , Andrea Arcangeli , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Ingo Molnar wrote: > > Hm, is there any security impact? Andrea is talking about data > corruption. I'm wondering whether that's just corruption > relative to whatever twisted semantics O_DIRECT has in this case > [which would be harmless], or some true pagecache corruption > going across COW (or other) protection domains that could be > exploited [which would not be harmless]. As far as I can tell, it's the same old problem that we've always had: if you fork(), it's unclear who is going to do the first write - parent or child (and "parent" in this case can include any number of threads that share the VM, of course). And that means that anything that relies on pinned pages will never know whether it is pinning a page in the parent or the child - because whoever does the first COW of that page is the one that just gets a _copy_, not the original pinned page. This isn't anything new. Anything that does anything by physical address will simply not do the right thing over a fork. The physical page may have started out as the parents physical page, but it may end up in the end being the _childs_ physical page if the parent wrote to it and triggered the cow. The rule has always been: don't mix fork() with page pinning. It doesn't work. It never worked. It likely never will. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 594AC6B003D for ; Wed, 11 Mar 2009 14:22:28 -0400 (EDT) Date: Wed, 11 Mar 2009 19:22:16 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311182216.GJ27823@random.random> References: <20090311170611.GA2079@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, Mar 11, 2009 at 10:33:00AM -0700, Linus Torvalds wrote: > > On Wed, 11 Mar 2009, Ingo Molnar wrote: > > > > FYI, in case you missed it. Large MM fix - and it's awfully late > > in -rc7. I didn't specify it, but I didn't mean to submit it for immediate inclusion. I posted it because it's ready and I wanted feedback from Hugh/Nick/linux-mm so we can get this fixed when next merge window open. > Yeah, I'm not taking this at this point. No way, no-how. > > If there is no simpler and obvious fix, it needs to go through -stable, > after having cooked in 2.6.30-rc for a while. Especially as this is a > totally uninteresting usage case that I can't see as being at all relevant > to any real world. Actually AFIK there are mission critical real world applications that used 512byte blocksize that were affected by this (I CC'ed relevant people who knows). However this is rare thing so it almost never triggers because the window is so small. > Anybody who mixes O_DIRECT and fork() (and threads) is already doing some > seriously strange things. Nothing new there. Most apps aren't affected of course. But almost all apps eventually call fork (system/fork/exec/anything). Calling fork currently is enough to generate memory corruption in the parent (i.e. lost O_DIRECT reads from disk). > And quite frankly, the patch is so ugly as-is that I'm not likely to take > it even into the 2.6.30 merge window unless it can be cleaned up. That > whole fork_pre_cow function is too f*cking ugly to live. We just don't > write code like this in the kernel. Yes, this is exactly why I posted it now, to get feedback, it wasn't meant for submission. Feel free to write it yourself in another way of course, I included all relevant testcases to test alternate fixes too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 7DA326B004D for ; Wed, 11 Mar 2009 14:38:05 -0400 (EDT) Date: Wed, 11 Mar 2009 19:37:48 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311183748.GK27823@random.random> References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote: > As far as I can tell, it's the same old problem that we've always had: if > you fork(), it's unclear who is going to do the first write - parent or > child (and "parent" in this case can include any number of threads that > share the VM, of course). The child doesn't touch any page. Calling fork just generates O_DIRECT corruption in the parent regardless of what the child does. > This isn't anything new. Anything that does anything by physical address This is nothing new also in the sense that all linux kernels out there had this bug thus far. > will simply not do the right thing over a fork. The physical page may have > started out as the parents physical page, but it may end up in the end > being the _childs_ physical page if the parent wrote to it and triggered > the cow. Actually the child will get corrupted too. Not just the parent by losing the O_DIRECT reads. The child always assumes its anon page contents will not get lost or overwritten after changing them in the child. > The rule has always been: don't mix fork() with page pinning. It doesn't > work. It never worked. It likely never will. I never heard this rule here, but surely I agree there will not be many apps out there capable of triggering this. Mostly because most apps uses O_DIRECT on top of shm (surely not because they're not usually calling fork). The ones affected are the ones using anonymous memory with threads and not allocating memory with memalign(4096) despite they use 512byte blocksize for their I/O. If they use threads and they allocate with memalign(512) they can be affected if they call fork anywhere. I don't think it's urgent fix, but if you now are pretending that this doesn't ever need fixing and we can live with the bug forever, I think you're wrong. If something I'd rather see O_DIRECT not supporting hardblocksize anymore but only PAGE_SIZE multiples, that would at least limiting the breakage to an undefined behavior. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id E2B026B004D for ; Wed, 11 Mar 2009 14:49:14 -0400 (EDT) Date: Wed, 11 Mar 2009 11:46:17 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090311183748.GK27823@random.random> Message-ID: References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote: > > As far as I can tell, it's the same old problem that we've always had: if > > you fork(), it's unclear who is going to do the first write - parent or > > child (and "parent" in this case can include any number of threads that > > share the VM, of course). > > The child doesn't touch any page. Calling fork just generates O_DIRECT > corruption in the parent regardless of what the child does. You aren't listening. It depends on who does the write. If the _parent_ does the write (with another thread or not), then the _parent_ gets the COW. That's all I said. > > The rule has always been: don't mix fork() with page pinning. It doesn't > > work. It never worked. It likely never will. > > I never heard this rule here It's never been written down, but it's obvious to anybody who looks at how COW works for even five seconds. The fact is, the person doing the COW after a fork() is the person who no longer has the same physical page (because he got a new page). So _anything- that depends on physical addresses simply _cannot_ work concurrently with a fork. That has always been true. If the idiots who use O_DIRECT don't understand that, then hey, it's their problem. I have long been of the opinion that we should not support O_DIRECT at all, and that it's a totally broken premise to start with. This is just one of millions of reasons. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id D88A86B004D for ; Wed, 11 Mar 2009 14:53:47 -0400 (EDT) Date: Wed, 11 Mar 2009 19:53:39 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311185339.GL27823@random.random> References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090311174103.GA11979@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Linus Torvalds , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: Hello, On Wed, Mar 11, 2009 at 06:41:03PM +0100, Ingo Molnar wrote: > Hm, is there any security impact? Andrea is talking about data > corruption. I'm wondering whether that's just corruption > relative to whatever twisted semantics O_DIRECT has in this case > [which would be harmless], or some true pagecache corruption I don't think it's exploitable and I don't see this much as a security issue. This can only corrupt user data inside anonymous pages (not filesystem metadata or kernel pagecache). Side effects will be the usual ones of random user memory corruption or as worse it can lead to I/O corruption on disk, but only in user data. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id C11BF6B0047 for ; Wed, 11 Mar 2009 15:07:13 -0400 (EDT) Date: Wed, 11 Mar 2009 20:06:55 +0100 From: Ingo Molnar Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311190655.GA690@elte.hu> References: <20090311170611.GA2079@elte.hu> <20090311182216.GJ27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090311182216.GJ27823@random.random> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: * Andrea Arcangeli wrote: > On Wed, Mar 11, 2009 at 10:33:00AM -0700, Linus Torvalds wrote: > > > > On Wed, 11 Mar 2009, Ingo Molnar wrote: > > > > > > FYI, in case you missed it. Large MM fix - and it's awfully late > > > in -rc7. > > I didn't specify it, but I didn't mean to submit it for > immediate inclusion. I posted it because it's ready and I > wanted feedback from Hugh/Nick/linux-mm so we can get this > fixed when next merge window open. Good - i saw the '(fast-)gup fix' qualifier and fast-gup is a fresh feature. If the problem existed in earlier kernels too then i guess it isnt urgent. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id A073F6B004D for ; Wed, 11 Mar 2009 15:07:22 -0400 (EDT) Date: Wed, 11 Mar 2009 20:06:55 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311190655.GM27823@random.random> References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, Mar 11, 2009 at 11:46:17AM -0700, Linus Torvalds wrote: > > > On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > > > On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote: > > > As far as I can tell, it's the same old problem that we've always had: if > > > you fork(), it's unclear who is going to do the first write - parent or > > > child (and "parent" in this case can include any number of threads that > > > share the VM, of course). > > > > The child doesn't touch any page. Calling fork just generates O_DIRECT > > corruption in the parent regardless of what the child does. > > You aren't listening. > > It depends on who does the write. If the _parent_ does the write (with > another thread or not), then the _parent_ gets the COW. > > That's all I said. I only wanted to clarify this doesn't require the child to touch the page at all. > If the idiots who use O_DIRECT don't understand that, then hey, it's their > problem. I have long been of the opinion that we should not support > O_DIRECT at all, and that it's a totally broken premise to start with. Well if you don't like it used by databases, O_DIRECT is still ideal for KVM. Guest caches runs at cpu core speed unlike host cache. Not that KVM can reproduce this bug (all ram where KVM would be doing O_DIRECT is mapped MADV_DONTFORK, and besides guest physical ram has to be allocated with memalign(4096) ;). Said that I agree it'd be better off to nuke O_DIRECT than to leave this bug as O_DIRECT should not break the usual memory-protection semantics provided by read() and fork() syscalls. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 805776B003D for ; Wed, 11 Mar 2009 15:15:39 -0400 (EDT) Date: Wed, 11 Mar 2009 20:15:26 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311191526.GN27823@random.random> References: <20090311170611.GA2079@elte.hu> <20090311182216.GJ27823@random.random> <20090311190655.GA690@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090311190655.GA690@elte.hu> Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: Linus Torvalds , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, Mar 11, 2009 at 08:06:55PM +0100, Ingo Molnar wrote: > Good - i saw the '(fast-)gup fix' qualifier and fast-gup is a > fresh feature. If the problem existed in earlier kernels too > then i guess it isnt urgent. It always existed yes. The reason of the (fast-) qualifier is because gup-fast made it harder to fix this in mainline (there is also a patch floating around for 2.6.18 based kernels that is simpler thanks to gup-fast not being there). The trouble of gup-fast is that doing the check of page_count inside PT lock (or mmap_sem write mode like in fork(), but ksm only takes mmap_sem in read mode and it relied on PT lock only) wasn't enough anymore to be sure the page_count wouldn't increase from under us just after we read it, because a gup-fast could be running in another CPU without mmap_sem and without PT lock taken. So fixing this on mainline has been a bit harder as I had to prevent gup-fast to go ahead in the fast path, in a way that didn't send IPIs to flush the smp-tlb before reading the page_count (so to avoid sending IPIs for every anon page mapped writeable). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 2C7966B003D for ; Wed, 11 Mar 2009 15:19:03 -0400 (EDT) Date: Wed, 11 Mar 2009 12:01:56 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: Message-ID: References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Linus Torvalds wrote: > > It's never been written down, but it's obvious to anybody who looks at how > COW works for even five seconds. The fact is, the person doing the COW > after a fork() is the person who no longer has the same physical page > (because he got a new page). Btw, I think your patch has a race. Admittedly a really small one. When you look up the page in gup.c, and then set the GUP flag on the "struct page", in between the lookup and the setting of the flag, another thread can come in and do that same fork+write thing. CPU0: CPU1 gup: fork: - look up page - it's read-write ... set_wr_protect test GUP bit - not set, good done - Mark it GUP tlb_flush write to it from user space - COW since there is no lockng on the GUP side (there's the TLB flush that will wait for interrupts being enabled again on CPU0, but that's later in the fork sequence). Maybe I'm missing something. The race is certainly very unlikely to ever happen in practice, but it looks real. Also, having to set the PG_GUP bit means that the "fast" gup is likely not much faster than the slow one. It now has two atomics per page it looks up, afaik, which sounds like it would delete any advantage it had over the slow version that needed locking. What we _could_ try to do is to always make the COW breaking be a _directed_ event - we'd make sure that we always break COW in the direction of the first owner (going to the rmap chains). That might solve everything, and be purely local to the logic in mm/memory.c (do_wp_page). I dunno. I have not looked at how horrible that would be. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 8B3C56B003D for ; Wed, 11 Mar 2009 16:00:01 -0400 (EDT) Date: Wed, 11 Mar 2009 20:59:35 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311195935.GO27823@random.random> References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, Mar 11, 2009 at 12:01:56PM -0700, Linus Torvalds wrote: > Btw, I think your patch has a race. Admittedly a really small one. > > When you look up the page in gup.c, and then set the GUP flag on the > "struct page", in between the lookup and the setting of the flag, another > thread can come in and do that same fork+write thing. > > CPU0: CPU1 > > gup: fork: > - look up page > - it's read-write > ... > set_wr_protect > test GUP bit - not set, good > done > > - Mark it GUP > tlb_flush > > write to it from user space - COW Did you notice the check after 'mark it gup' that will run in CPU0? + if (PageAnon(page)) { + if (!PageGUP(page)) + SetPageGUP(page); + smp_mb(); + /* + * Fork doesn't want to flush the smp-tlb for + * every pte that it marks readonly but newly + * created shared anon pages cannot have + * direct-io going to them, so check if fork + * made the page shared before we taken the + * page pin. + */ + if ((pte_flags(gup_get_pte(ptep)) & + (mask | _PAGE_SPECIAL)) != mask) { + put_page(page); + pte_unmap(ptep); + return 0; + } + } gup-fast will _not_ succeed because of the set_wr_protect that just happened on CPU1. That's why I added the above check after setpagegup/get_page. > since there is no lockng on the GUP side (there's the TLB flush that will > wait for interrupts being enabled again on CPU0, but that's later in the > fork sequence). Right, I preferred to 'recheck' the wrprotect bit before allowing gup-fast to succeed to avoid sending a flood of IPI in the fork fast path. So I leave the tlb flush at the end of the fork sequence and a single IPI in the common case. Only exception is the forcecow path where the copy has to happen atomically per-page, so I have to flush the smp-tlb before the copy after marking the parent wrprotected temporarly (later the parent pte is marked read-write again by fork_pre_cow after the copy), or NPTL will never have a chance to fix its bug as its glibc-parent data structures that could be modified by threads won't be copied atomically to the child. But that's a slow path so it's ok to flush tlb there. > Also, having to set the PG_GUP bit means that the "fast" gup is likely not > much faster than the slow one. It now has two atomics per page it looks > up, afaik, which sounds like it would delete any advantage it had over the > slow version that needed locking. gup-fast has already to get_page, so I don't see it. gup-fast will always dirty that cacheline and take over it regardless of PG_gup, gup-fast will never be able to run without running get_page. Furthermore starting from the second access GUP is already set and it's only a read from l1 from a cacheline that was already dirtied and taken over a few instructions before. So I think it can't be slowing down gup-fast in any measurable way, given how close mark-gup is set after get_page. > What we _could_ try to do is to always make the COW breaking be a > _directed_ event - we'd make sure that we always break COW in the > direction of the first owner (going to the rmap chains). That might solve > everything, and be purely local to the logic in mm/memory.c (do_wp_page). That's a really interesting idea and frankly I didn't think about it. Probably one reason is that it can't work for ksm where we take two random anon pages and create one out of them so each one could already have O_DIRECT in progress on them and we've to prevent to merge pages that have in-flight O_DIRECT to be merged no matter what (ordering is irrelevant for ksm, page contents must be stable or ksm will break). I was thinking of using the same logic for both ksm and fork. But theoretically, ksm can keep doing the page_count check to truly ensure no in-flight I/O is going on, and fork could fix it in whatever way it wants (I wonder if it'd be ok for fork to map a 'changing' page in the child because of the not-defined behavior of forking while a read is in progress, at least at the first write the page would stop changing contents). In fact ksm doesn't even require the above change to gup-fast because it does ptep_clear_flush_notify when it tries to wrprotect a not-shared anon page. > I dunno. I have not looked at how horrible that would be. For fork I think it would work, not sure if the current data structures would be enough, but at first glance I think besides how horrible that would be, I think from a practical standpoint the main problem is the slowdown it'd generate in the do_wp_page fast path. The anon_vma list can be huge in some weird case, which we normally cannot care less as swap algorithms and disk I/O (even on no-seeking SSD) is even slower than that. The coolness of rmap w/o pte_chains is that rmap is zerocost for all page faults (a check on vma->anon_vma being not null is the only cost) and I'd like to keep it that way. The cost of my fix to fork is not measurable with fork microbenchmark, while the cost of finding who owns the original shared page in do_wp_page would be potentially be much bigger. The only slowdown to fork is in the O_DIRECT slow path which we don't care about and in the worst case is limited to the total amount of in-flight I/O. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id A6FB86B003D for ; Wed, 11 Mar 2009 16:22:14 -0400 (EDT) Date: Wed, 11 Mar 2009 13:19:03 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090311195935.GO27823@random.random> Message-ID: References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > > Did you notice the check after 'mark it gup' that will run in CPU0? Ahh, no. I just read the patch through fairly quickly, and the whole "(gup_get_pte & mask) != mask" didn't trigger as obvious. But yeah, I see that it ends up re-checking the RW bit. > gup-fast will _not_ succeed because of the set_wr_protect that just > happened on CPU1. That's why I added the above check after > setpagegup/get_page. Ok, with the recheck I think it's fine. > > Also, having to set the PG_GUP bit means that the "fast" gup is likely not > > much faster than the slow one. It now has two atomics per page it looks > > up, afaik, which sounds like it would delete any advantage it had over the > > slow version that needed locking. > > gup-fast has already to get_page, so I don't see it. That's my point. It used to have one atomic. Now it has two (and a memory barrier). Those tend to be pretty expensive - even when there's no cacheline bouncing. > Furthermore starting from the second access GUP is already > set That's a totally bogus argument. It will be true for _benchmarks_, but if somebody is trying to avoid buffered IO, one very possible common case is that it's all going to be new pages all the time. That said, I don't know who the crazy O_DIRECT users are. It may be true that some O_DIRECT users end up using the same pages over and over again, and that this is a good optimization for them. > > What we _could_ try to do is to always make the COW breaking be a > > _directed_ event - we'd make sure that we always break COW in the > > direction of the first owner (going to the rmap chains). That might solve > > everything, and be purely local to the logic in mm/memory.c (do_wp_page). > > That's a really interesting idea and frankly I didn't think about it. The advantage of it is that it fixes the problem not just in one place, but "forever". No hacks about exactly how you access the mappings etc. Of course, nothing _really_ solves things. If you do some delayed IO after having looked up the mapping and turned it into a physical page, and the original allocator actually unmaps it (or exits), then the same issue can still happen (well, not the _same_ one - but the very similar issue of the child seeing changes even though the IO was started in the parent). This is why I think any "look up by physical" is fundamentally flawed. It very basically becomes a "I have a secret local TLB that cannot be changed or flushed". And any single-bit solution (GUP) is always going to be fairly broken. > The cost of my fix to fork is not measurable with fork microbenchmark, > while the cost of finding who owns the original shared page in > do_wp_page would be potentially be much bigger. The only slowdown to > fork is in the O_DIRECT slow path which we don't care about and in the > worst case is limited to the total amount of in-flight I/O. Agreed. However, I really think this is a O_DIRECT problem. Just document it. Tell people that O_DIRECT simply doesn't work with COW, and fundamentally can never work well. If you use O_DIRECT with threading, you had better know what the hell you're doing anyway. I do not think that the kernel should do stupid things just because stupid users don't understand the semantics of the _non-stupid_ thing (which is to just let people think about COW for five seconds). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 205C16B003D for ; Wed, 11 Mar 2009 16:36:28 -0400 (EDT) Date: Wed, 11 Mar 2009 13:33:17 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: Message-ID: References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Linus Torvalds wrote: > > Agreed. However, I really think this is a O_DIRECT problem. Just document > it. Tell people that O_DIRECT simply doesn't work with COW, and > fundamentally can never work well. > > If you use O_DIRECT with threading, you had better know what the hell > you're doing anyway. I do not think that the kernel should do stupid > things just because stupid users don't understand the semantics of the > _non-stupid_ thing (which is to just let people think about COW for five > seconds). Btw, if we don't do that, then there are better alternatives. One is: - fork already always takes the write lock on mmap_sem (and f*ck no, I doubt anybody will ever care one whit how "parallel" you can do forks from threads, so I don't think this is an issue) - Just make the rule be that people who use get_user_pages() always have to have the read-lock on mmap_sem until they've used the pages. We already take the read-lock for the lookup (well, not for the gup, but for all the slow cases), but I'm saying that we could go one step further - just read-lock over the _whole_ O_DIRECT read or write. That way you literally protect against concurrent fork()s. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 573276B003D for ; Wed, 11 Mar 2009 16:48:38 -0400 (EDT) Date: Wed, 11 Mar 2009 21:48:15 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311204815.GQ27823@random.random> References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, Mar 11, 2009 at 01:19:03PM -0700, Linus Torvalds wrote: > That said, I don't know who the crazy O_DIRECT users are. It may be true > that some O_DIRECT users end up using the same pages over and over again, > and that this is a good optimization for them. If it's done on new pages chances are that gup-fast fast-path can't run in the first place, modulo glibc memalign re-using previously freed areas. Overall I think it's worthwhile optimization, to avoid the locked op in the rewrite case that I think it's common enough. But I totally agree that it'd be good to benchmark gup-fast on already instantiated ptes where SetPageGUP will run. I thought it'd be like below measurement error and not measurable but good to check it. > The advantage of it is that it fixes the problem not just in one place, > but "forever". No hacks about exactly how you access the mappings etc. > > Of course, nothing _really_ solves things. If you do some delayed IO after > having looked up the mapping and turned it into a physical page, and the > original allocator actually unmaps it (or exits), then the same issue can > still happen (well, not the _same_ one - but the very similar issue of the > child seeing changes even though the IO was started in the parent). > > This is why I think any "look up by physical" is fundamentally flawed. It > very basically becomes a "I have a secret local TLB that cannot be changed > or flushed". And any single-bit solution (GUP) is always going to be > fairly broken. One of the reasons of not sharing when PG_gup is set and page_count is shown as pinned, is also to fix all sort of drivers that are doing gup to "lookup by physical" on anon pages and doing "dma by physical some offset of the page" at any time later and fork. Otherwise PageReserved should be set by default by gup-fast instead of relying on the drivers to set it after gup-fast returns. > Agreed. However, I really think this is a O_DIRECT problem. Just document > it. Tell people that O_DIRECT simply doesn't work with COW, and > fundamentally can never work well. > > If you use O_DIRECT with threading, you had better know what the hell > you're doing anyway. I do not think that the kernel should do stupid > things just because stupid users don't understand the semantics of the > _non-stupid_ thing (which is to just let people think about COW for five > seconds). This really isn't only about O_DIRECT. This is to fix gup vs fork, O_DIRECT is just one of the million of gup users out there... KVM work around this by using MADV_DONTFORK, until MADV_DONTFORK was introduced I once started to get corruption in KVM when a change made system() to be executed once in a while for whatever unrelated reason. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 9EDC36B003D for ; Wed, 11 Mar 2009 16:55:37 -0400 (EDT) Date: Wed, 11 Mar 2009 21:55:29 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311205529.GR27823@random.random> References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, Mar 11, 2009 at 01:33:17PM -0700, Linus Torvalds wrote: > Btw, if we don't do that, then there are better alternatives. One is: > > - fork already always takes the write lock on mmap_sem (and f*ck no, I > doubt anybody will ever care one whit how "parallel" you can do forks > from threads, so I don't think this is an issue) > > - Just make the rule be that people who use get_user_pages() always > have to have the read-lock on mmap_sem until they've used the pages. How do you handle pages where gup already returned and I/O still in flight? Forcing gup-fast to be called with mmap_sem already hold (like gup used to require) only avoids the need of changes in gup-fast AFAICT. You'll still get pages that are pinned and calling gup-fast under mmap_sem (no matter if read or even write mode) won't make a difference, still those pages will be pinned while fork runs and with dma going to them (by O_DIRECT or some driver using gup, as long as PageReserved isn't set on them). > We already take the read-lock for the lookup (well, not for the gup, but > for all the slow cases), but I'm saying that we could go one step further > - just read-lock over the _whole_ O_DIRECT read or write. That way you > literally protect against concurrent fork()s. Releasing the mmap_sem read mode in the irq-completion handler context should be possible, however fork will end up throttled blocking for I/O which isn't very nice behavior. BTW, direct-io.c is a total mess, I couldn't even figure out where to release those locks in the I/O completion handlers when I tried something like this with PG_lock instead of the mmap_sem... Eventually I gave it up because this isn't just about O_DIRECT but all gup users have this trouble with fork. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 605286B003D for ; Wed, 11 Mar 2009 17:30:46 -0400 (EDT) Date: Wed, 11 Mar 2009 14:28:08 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090311205529.GR27823@random.random> Message-ID: References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> <20090311205529.GR27823@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > On Wed, Mar 11, 2009 at 01:33:17PM -0700, Linus Torvalds wrote: > > Btw, if we don't do that, then there are better alternatives. One is: > > > > - fork already always takes the write lock on mmap_sem (and f*ck no, I > > doubt anybody will ever care one whit how "parallel" you can do forks > > from threads, so I don't think this is an issue) > > > > - Just make the rule be that people who use get_user_pages() always > > have to have the read-lock on mmap_sem until they've used the pages. > > How do you handle pages where gup already returned and I/O still in > flight? The rule is: - either keep the mmap_sem for reading until the IO is done - admit the fact that IO is asynchronous, and has visible async behavior. > Forcing gup-fast to be called with mmap_sem already hold (like > gup used to require) only avoids the need of changes in gup-fast > AFAICT. You'll still get pages that are pinned and calling gup-fast > under mmap_sem (no matter if read or even write mode) won't make a > difference, still those pages will be pinned while fork runs and with > dma going to them (by O_DIRECT or some driver using gup, as long as > PageReserved isn't set on them). The point I'm trying to make is that anybody who thinks that pages are stable over various behavior that runs in another thread - be it a fork, a mmap/munmap, or anything else, is just fooling themselves. The pages are going to show up in "random" places. The fact that the non-fast "get_user_pages()" takes the mmap semaphore for reading doesn't even protect that. It just means that the pages made sense at the time the get_user_pages() happened, not necessarily at the time when the actual use of them did. > Releasing the mmap_sem read mode in the irq-completion handler context > should be possible, however fork will end up throttled blocking for > I/O which isn't very nice behavior. BTW, direct-io.c is a total mess, > I couldn't even figure out where to release those locks in the I/O > completion handlers when I tried something like this with PG_lock > instead of the mmap_sem... Eventually I gave it up because this isn't > just about O_DIRECT but all gup users have this trouble with fork. O_DIRECT is actually the _simple_ case, since we won't be returning until it is done (ie it's not actually a async interface). So no, O_DIRECT doesn't need any interrupt handler games. It would just need to hold the sem over the actual call to the filesystem (ie just over the ->direct_IO() call). Of course, I suspect that all users of O_DIRECT would be _very_ unhappy if they cannot do mmap/unmap/brk on other areas while O_DIRECT is going on, so it's almost certainly not reasonable. People want the relaxed synchronization we give them, and that's literally why get_user_pages_fast exists - because people don't want _more_ synchronization, they want _less_. But the thing is, with less synchronization, the behavior really is surprising in the edge cases. Which is why I think "threaded fork" plus "get_user_pages_fast" just doesn't make sense to even _worry_ about. If you use O_DIRECT and mix it with fork, you get what you get, and it's random - exactly because people who want O_DIRECT don't want any locking. It's a user-space issue, not a kernel issue. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id A8DF86B0047 for ; Wed, 11 Mar 2009 17:57:33 -0400 (EDT) Date: Wed, 11 Mar 2009 22:57:21 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311215721.GS27823@random.random> References: <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> <20090311205529.GR27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, Mar 11, 2009 at 02:28:08PM -0700, Linus Torvalds wrote: > The fact that the non-fast "get_user_pages()" takes the mmap semaphore for > reading doesn't even protect that. It just means that the pages made sense > at the time the get_user_pages() happened, not necessarily at the time > when the actual use of them did. Indeed this is a generic problem, not specific to get_user_pages_fast. get_user_pages_fast just adds a few complications to serialize against. > O_DIRECT is actually the _simple_ case, since we won't be returning until > it is done (ie it's not actually a async interface). So no, O_DIRECT > doesn't need any interrupt handler games. It would just need to hold the > sem over the actual call to the filesystem (ie just over the ->direct_IO() > call). I don't see how you can solve the race by only holding the sem only over the direct_IO call (and not until the I/O completion handler fires). I think to solve the race using mmap_sem only, the bio I/O completion handler that eventually calls into direct-io.c from irq context would need to up_read(&mmap_sem). The way my patch avoids to alter the I/O completion path running from irq context is by ensuring no I/O is going on at all to the pages that are being shared with the child, and by ensuring that any gup or gup-fast will trigger cow before it can write to the shared page. Pages simply can't be shared before I/O is complete. > People want the relaxed synchronization we give them, and that's literally > why get_user_pages_fast exists - because people don't want _more_ > synchronization, they want _less_. > > But the thing is, with less synchronization, the behavior really is > surprising in the edge cases. Which is why I think "threaded fork" plus > "get_user_pages_fast" just doesn't make sense to even _worry_ about. If > you use O_DIRECT and mix it with fork, you get what you get, and it's > random - exactly because people who want O_DIRECT don't want any locking. > > It's a user-space issue, not a kernel issue. I think your point of view is clear, I sure can write userland code that copes it the currently altered memory protection semantics of read vs fork if fd is opened with O_DIRECT or drivers using gup, so I'll let the userland folks comment on it, some are in CC. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id C2E286B0047 for ; Wed, 11 Mar 2009 18:09:08 -0400 (EDT) Date: Wed, 11 Mar 2009 15:06:18 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090311215721.GS27823@random.random> Message-ID: References: <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> <20090311205529.GR27823@random.random> <20090311215721.GS27823@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > > > People want the relaxed synchronization we give them, and that's literally > > why get_user_pages_fast exists - because people don't want _more_ > > synchronization, they want _less_. > > > > But the thing is, with less synchronization, the behavior really is > > surprising in the edge cases. Which is why I think "threaded fork" plus > > "get_user_pages_fast" just doesn't make sense to even _worry_ about. If > > you use O_DIRECT and mix it with fork, you get what you get, and it's > > random - exactly because people who want O_DIRECT don't want any locking. > > > > It's a user-space issue, not a kernel issue. > > I think your point of view is clear, I sure can write userland code > that copes it the currently altered memory protection semantics of > read vs fork if fd is opened with O_DIRECT or drivers using gup, so > I'll let the userland folks comment on it, some are in CC. Btw, we could make it easier for people to not screw up. In particular, "fork()" in a threaded program is almost always wrong. If you want to exec another program from a threaded one, you should either just do execve() (which kills all threads) or you should do vfork+execve (which has none of the COW issues). An we could add a warning for it. Something like "if this is a threaded program, and it has ever used get_user_pages(), and it does a fork(), warn about it once". Maybe people would realize what a stupid thing they are doing, and that there is a simple fix (vfork). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id A9B476B0047 for ; Wed, 11 Mar 2009 18:10:41 -0400 (EDT) Date: Wed, 11 Mar 2009 15:07:48 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: Message-ID: References: <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> <20090311205529.GR27823@random.random> <20090311215721.GS27823@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Linus Torvalds wrote: > > An we could add a warning for it. Something like "if this is a threaded > program, and it has ever used get_user_pages(), and it does a fork(), warn > about it once". Maybe people would realize what a stupid thing they are > doing, and that there is a simple fix (vfork). Ehh. vfork is only simple if you literally are going to execve. If you are using a fork as some kind of odd way to snapshot, I don't know what you should do. You can't sanely snapshot a threaded app with fork, but I bet some people try. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id D580D6B0047 for ; Wed, 11 Mar 2009 18:22:27 -0400 (EDT) Received: from makko.or.mcafeemobile.com by x35.xmailserver.org with [XMail 1.26 ESMTP Server] id for from ; Wed, 11 Mar 2009 18:22:24 -0400 Date: Wed, 11 Mar 2009 15:22:39 -0700 (PDT) From: Davide Libenzi Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: Message-ID: References: <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> <20090311205529.GR27823@random.random> <20090311215721.GS27823@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Linus Torvalds wrote: > In particular, "fork()" in a threaded program is almost always wrong. If > you want to exec another program from a threaded one, you should either > just do execve() (which kills all threads) or you should do vfork+execve > (which has none of the COW issues). Didn't follow the lengthy thread, but if we make fork+exec to fail inside a threaded program, we might end up making a lot of people unhappy. - Davide -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 6804F6B0047 for ; Wed, 11 Mar 2009 18:35:33 -0400 (EDT) Date: Wed, 11 Mar 2009 15:32:38 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: Message-ID: References: <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> <20090311205529.GR27823@random.random> <20090311215721.GS27823@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Davide Libenzi Cc: Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 11 Mar 2009, Davide Libenzi wrote: > > Didn't follow the lengthy thread, but if we make fork+exec to fail inside > a threaded program, we might end up making a lot of people unhappy. Yeah, no, we don't want to fail it, but we could do a one-time warning or something, to at least see who does it and perhaps see if some of them might realize the problems. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 618326B003D for ; Thu, 12 Mar 2009 01:36:28 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Thu, 12 Mar 2009 16:36:18 +1100 References: <20090311170611.GA2079@elte.hu> <20090311183748.GK27823@random.random> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903121636.18867.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Thursday 12 March 2009 05:46:17 Linus Torvalds wrote: > On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > > > The rule has always been: don't mix fork() with page pinning. It > > > doesn't work. It never worked. It likely never will. > > > > I never heard this rule here > > It's never been written down, but it's obvious to anybody who looks at how > COW works for even five seconds. The fact is, the person doing the COW > after a fork() is the person who no longer has the same physical page > (because he got a new page). > > So _anything- that depends on physical addresses simply _cannot_ work > concurrently with a fork. That has always been true. > > If the idiots who use O_DIRECT don't understand that, then hey, it's their > problem. I have long been of the opinion that we should not support > O_DIRECT at all, and that it's a totally broken premise to start with. > > This is just one of millions of reasons. Well it is a quite well known issue at this stage I think. We've had MADV_DONTFORK since 2.6.16 which is basically to solve this issue I think with infiniband library. I guess if it would be really helpful we *could* add MADV_DONTCOW. Assuming we want to try fixing it transparently... what about another approach, mark a vma as VM_DONTCOW and uncow all existing pages in it if it ever has get_user_pages run on it. Big hammer approach. fast gup would be a little bit harder because looking up the vma defeats the purpose. However if we use another page bit to say the page belongs to a VM_DONTCOW vma, then we only need to check that once and fall back to slow gup if it is clear. So there would be no extra atomics in the repeat case. Yes it would be slower, but apps that really care should know what they are doing and set MADV_DONTFORK or MADV_DONTCOW on the vma by hand before doing the zero copy IO. Would this work? Anyone see any holes? (I imagine someone might argue against big hammer, but I would prefer it if it is lighter impact on the VM and still allows good applications to avoid the hammer) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id DC88A6B003D for ; Thu, 12 Mar 2009 12:23:53 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Fri, 13 Mar 2009 03:23:40 +1100 References: <20090311170611.GA2079@elte.hu> <200903121636.18867.nickpiggin@yahoo.com.au> In-Reply-To: <200903121636.18867.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: base64 Content-Disposition: inline Message-Id: <200903130323.41193.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: T24gVGh1cnNkYXkgMTIgTWFyY2ggMjAwOSAxNjozNjoxOCBOaWNrIFBpZ2dpbiB3cm90ZToKCj4g QXNzdW1pbmcgd2Ugd2FudCB0byB0cnkgZml4aW5nIGl0IHRyYW5zcGFyZW50bHkuLi4gd2hhdCBh Ym91dCBhbm90aGVyCj4gYXBwcm9hY2gsIG1hcmsgYSB2bWEgYXMgVk1fRE9OVENPVyBhbmQgdW5j b3cgYWxsIGV4aXN0aW5nIHBhZ2VzIGluIGl0Cj4gaWYgaXQgZXZlciBoYXMgZ2V0X3VzZXJfcGFn ZXMgcnVuIG9uIGl0LiBCaWcgaGFtbWVyIGFwcHJvYWNoLgo+Cj4gZmFzdCBndXAgd291bGQgYmUg YSBsaXR0bGUgYml0IGhhcmRlciBiZWNhdXNlIGxvb2tpbmcgdXAgdGhlIHZtYQo+IGRlZmVhdHMg dGhlIHB1cnBvc2UuIEhvd2V2ZXIgaWYgd2UgdXNlIGFub3RoZXIgcGFnZSBiaXQgdG8gc2F5IHRo ZQo+IHBhZ2UgYmVsb25ncyB0byBhIFZNX0RPTlRDT1cgdm1hLCB0aGVuIHdlIG9ubHkgbmVlZCB0 byBjaGVjayB0aGF0Cj4gb25jZSBhbmQgZmFsbCBiYWNrIHRvIHNsb3cgZ3VwIGlmIGl0IGlzIGNs ZWFyLiBTbyB0aGVyZSB3b3VsZCBiZSBubwo+IGV4dHJhIGF0b21pY3MgaW4gdGhlIHJlcGVhdCBj YXNlLiBZZXMgaXQgd291bGQgYmUgc2xvd2VyLCBidXQgYXBwcwo+IHRoYXQgcmVhbGx5IGNhcmUg c2hvdWxkIGtub3cgd2hhdCB0aGV5IGFyZSBkb2luZyBhbmQgc2V0Cj4gTUFEVl9ET05URk9SSyBv ciBNQURWX0RPTlRDT1cgb24gdGhlIHZtYSBieSBoYW5kIGJlZm9yZSBkb2luZyB0aGUKPiB6ZXJv IGNvcHkgSU8uCj4KPiBXb3VsZCB0aGlzIHdvcms/IEFueW9uZSBzZWUgYW55IGhvbGVzPyAoSSBp bWFnaW5lIHNvbWVvbmUgbWlnaHQgYXJndWUKPiBhZ2FpbnN0IGJpZyBoYW1tZXIsIGJ1dCBJIHdv dWxkIHByZWZlciBpdCBpZiBpdCBpcyBsaWdodGVyIGltcGFjdCBvbgo+IHRoZSBWTSBhbmQgc3Rp bGwgYWxsb3dzIGdvb2QgYXBwbGljYXRpb25zIHRvIGF2b2lkIHRoZSBoYW1tZXIpCgpPSywgdGhp cyBpcyBhcyBmYXIgYXMgSSBnb3QgdG9uaWdodC4KClRoaXMgcGFzc2VzIEFuZHJlYSdzIGRtYV90 aHJlYWQgdGVzdCBjYXNlLiBJIGhhdmVuJ3Qgc3RhcnRlZCBodWdlcGFnZXMsCmFuZCBpdCBpc24n dCBxdWl0ZSByaWdodCB0byBkcm9wIHRoZSBtbWFwX3NlbSBhbmQgcmV0YWtlIGl0IGZvciB3cml0 ZQppbiBnZXRfdXNlcl9wYWdlcyAoZmlyc3RseSwgY2FsbGVyIG1pZ2h0IGhvbGQgbW1hcF9zZW0g Zm9yIHdyaXRlLApzZWNvbmRseSwgaXQgbWF5IG5vdCBiZSBhYmxlIHRvIHRvbGVyYXRlIG1tYXBf c2VtIGJlaW5nIGRyb3BwZWQpLgoKQW5ub3lpbmcgdGhhdCBpdCBoYXMgdG8gdGFrZSBtbWFwX3Nl bSBmb3Igd3JpdGUgdG8gYWRkIHRoaXMgYml0IHRvCnZtX2ZsYWdzLiBQb3NzaWJseSB3ZSBjb3Vs ZCB1c2UgYSBkaWZmZXJlbnQgd2F5IHRvIHNpZ25hbCBpdCBpcyBhCiJkb250Y293IiB2bWEuLi4g c29tZXRoaW5nIGluIGFub25fdm1hIG1heWJlPwoKQW55d2F5LCBiZWZvcmUgd29ycnlpbmcgdG9v IG11Y2ggbW9yZSBhYm91dCB0aG9zZSBkZXRhaWxzLCBJJ2xsIHBvc3QKaXQuIEl0IGlzIGEgZGlm ZmVyZW50IGFwcHJvYWNoIHRoYXQgSSB0aGluayBtaWdodCBiZSB3b3J0aCBjb25zaWRlcmF0aW9u LgpDb21tZW50cz8KClRoYW5rcywKTmljawotLQpJbmRleDogbGludXgtMi42L2luY2x1ZGUvbGlu dXgvbW0uaAo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09Ci0tLSBsaW51eC0yLjYub3JpZy9pbmNsdWRlL2xpbnV4L21tLmgJ MjAwOS0wMy0xMyAwMzowMDo1OC4wMDAwMDAwMDAgKzExMDAKKysrIGxpbnV4LTIuNi9pbmNsdWRl L2xpbnV4L21tLmgJMjAwOS0wMy0xMyAwMzowNTowMC4wMDAwMDAwMDAgKzExMDAKQEAgLTEwNCw2 ICsxMDQsNyBAQCBleHRlcm4gdW5zaWduZWQgaW50IGtvYmpzaXplKGNvbnN0IHZvaWQgCiAjZGVm aW5lIFZNX0NBTl9OT05MSU5FQVIgMHgwODAwMDAwMAkvKiBIYXMgLT5mYXVsdCAmIGRvZXMgbm9u bGluZWFyIHBhZ2VzICovCiAjZGVmaW5lIFZNX01JWEVETUFQCTB4MTAwMDAwMDAJLyogQ2FuIGNv bnRhaW4gInN0cnVjdCBwYWdlIiBhbmQgcHVyZSBQRk4gcGFnZXMgKi8KICNkZWZpbmUgVk1fU0FP CQkweDIwMDAwMDAwCS8qIFN0cm9uZyBBY2Nlc3MgT3JkZXJpbmcgKHBvd2VycGMpICovCisjZGVm aW5lIFZNX0RPTlRDT1cJMHg0MDAwMDAwMAkvKiBDb250YWlucyBubyBDT1cgcGFnZXMgKGNvcGll cyBvbiBmb3JrKSAqLwogCiAjaWZuZGVmIFZNX1NUQUNLX0RFRkFVTFRfRkxBR1MJCS8qIGFyY2gg Y2FuIG92ZXJyaWRlIHRoaXMgKi8KICNkZWZpbmUgVk1fU1RBQ0tfREVGQVVMVF9GTEFHUyBWTV9E QVRBX0RFRkFVTFRfRkxBR1MKQEAgLTc4OSw3ICs3OTAsNyBAQCBpbnQgd2Fsa19wYWdlX3Jhbmdl KHVuc2lnbmVkIGxvbmcgYWRkciwgCiB2b2lkIGZyZWVfcGdkX3JhbmdlKHN0cnVjdCBtbXVfZ2F0 aGVyICp0bGIsIHVuc2lnbmVkIGxvbmcgYWRkciwKIAkJdW5zaWduZWQgbG9uZyBlbmQsIHVuc2ln bmVkIGxvbmcgZmxvb3IsIHVuc2lnbmVkIGxvbmcgY2VpbGluZyk7CiBpbnQgY29weV9wYWdlX3Jh bmdlKHN0cnVjdCBtbV9zdHJ1Y3QgKmRzdCwgc3RydWN0IG1tX3N0cnVjdCAqc3JjLAotCQkJc3Ry dWN0IHZtX2FyZWFfc3RydWN0ICp2bWEpOworCQlzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKmRzdF92 bWEsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hKTsKIHZvaWQgdW5tYXBfbWFwcGluZ19yYW5n ZShzdHJ1Y3QgYWRkcmVzc19zcGFjZSAqbWFwcGluZywKIAkJbG9mZl90IGNvbnN0IGhvbGViZWdp biwgbG9mZl90IGNvbnN0IGhvbGVsZW4sIGludCBldmVuX2Nvd3MpOwogaW50IGZvbGxvd19waHlz KHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hLCB1bnNpZ25lZCBsb25nIGFkZHJlc3MsCkluZGV4 OiBsaW51eC0yLjYvbW0vbWVtb3J5LmMKPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQotLS0gbGludXgtMi42Lm9yaWcvbW0v bWVtb3J5LmMJMjAwOS0wMy0xMyAwMzowMDo1OC4wMDAwMDAwMDAgKzExMDAKKysrIGxpbnV4LTIu Ni9tbS9tZW1vcnkuYwkyMDA5LTAzLTEzIDAzOjA3OjUyLjAwMDAwMDAwMCArMTEwMApAQCAtNTgw LDcgKzU4MCw4IEBAIGNvcHlfb25lX3B0ZShzdHJ1Y3QgbW1fc3RydWN0ICpkc3RfbW0sIHMKIAkg KiBpbiB0aGUgcGFyZW50IGFuZCB0aGUgY2hpbGQKIAkgKi8KIAlpZiAoaXNfY293X21hcHBpbmco dm1fZmxhZ3MpKSB7Ci0JCXB0ZXBfc2V0X3dycHJvdGVjdChzcmNfbW0sIGFkZHIsIHNyY19wdGUp OworCQlpZiAobGlrZWx5KCEodm1fZmxhZ3MgJiBWTV9ET05UQ09XKSkpCisJCQlwdGVwX3NldF93 cnByb3RlY3Qoc3JjX21tLCBhZGRyLCBzcmNfcHRlKTsKIAkJcHRlID0gcHRlX3dycHJvdGVjdChw dGUpOwogCX0KIApAQCAtNTk0LDYgKzU5NSw3IEBAIGNvcHlfb25lX3B0ZShzdHJ1Y3QgbW1fc3Ry dWN0ICpkc3RfbW0sIHMKIAogCXBhZ2UgPSB2bV9ub3JtYWxfcGFnZSh2bWEsIGFkZHIsIHB0ZSk7 CiAJaWYgKHBhZ2UpIHsKKwkJVk1fQlVHX09OKFBhZ2VEb250Q09XKHBhZ2UpICYmICEodm1fZmxh Z3MgJiBWTV9ET05UQ09XKSk7CiAJCWdldF9wYWdlKHBhZ2UpOwogCQlwYWdlX2R1cF9ybWFwKHBh Z2UsIHZtYSwgYWRkcik7CiAJCXJzc1shIVBhZ2VBbm9uKHBhZ2UpXSsrOwpAQCAtNjk2LDggKzY5 OCwxMCBAQCBzdGF0aWMgaW5saW5lIGludCBjb3B5X3B1ZF9yYW5nZShzdHJ1Y3QgCiAJcmV0dXJu IDA7CiB9CiAKK3N0YXRpYyBpbnQgZGVjb3dfcGFnZV9yYW5nZShzdHJ1Y3QgbW1fc3RydWN0ICpt bSwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEpOworCiBpbnQgY29weV9wYWdlX3JhbmdlKHN0 cnVjdCBtbV9zdHJ1Y3QgKmRzdF9tbSwgc3RydWN0IG1tX3N0cnVjdCAqc3JjX21tLAotCQlzdHJ1 Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSkKKwkJc3RydWN0IHZtX2FyZWFfc3RydWN0ICpkc3Rfdm1h LCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSkKIHsKIAlwZ2RfdCAqc3JjX3BnZCwgKmRzdF9w Z2Q7CiAJdW5zaWduZWQgbG9uZyBuZXh0OwpAQCAtNzU1LDYgKzc1OSwxNSBAQCBpbnQgY29weV9w YWdlX3JhbmdlKHN0cnVjdCBtbV9zdHJ1Y3QgKmRzCiAJaWYgKGlzX2Nvd19tYXBwaW5nKHZtYS0+ dm1fZmxhZ3MpKQogCQltbXVfbm90aWZpZXJfaW52YWxpZGF0ZV9yYW5nZV9lbmQoc3JjX21tLAog CQkJCQkJICB2bWEtPnZtX3N0YXJ0LCBlbmQpOworCisJV0FSTl9PTihyZXQpOworCWlmICh1bmxp a2VseSh2bWEtPnZtX2ZsYWdzICYgVk1fRE9OVENPVykgJiYgIXJldCkgeworCQlpZiAoZGVjb3df cGFnZV9yYW5nZShkc3RfbW0sIGRzdF92bWEpKQorCQkJcmV0ID0gLUVOT01FTTsKKwkJLyogY2hp bGQgZG9lc24ndCByZWFsbHkgbmVlZCBWTV9ET05UQ09XIGFmdGVyIGJlaW5nIGRlLUNPV2VkICov CisJCS8vIGRzdF92bWEtPnZtX2ZsYWdzICY9IH5WTV9ET05UQ09XOworCX0KKwogCXJldHVybiBy ZXQ7CiB9CiAKQEAgLTEyMDAsNiArMTIxMyw3IEBAIHN0YXRpYyBpbmxpbmUgaW50IHVzZV96ZXJv X3BhZ2Uoc3RydWN0IHYKIH0KIAogCitzdGF0aWMgaW50IG1ha2Vfdm1hX25vY293KHN0cnVjdCB2 bV9hcmVhX3N0cnVjdCAqdm1hKTsKIAogaW50IF9fZ2V0X3VzZXJfcGFnZXMoc3RydWN0IHRhc2tf c3RydWN0ICp0c2ssIHN0cnVjdCBtbV9zdHJ1Y3QgKm1tLAogCQkgICAgIHVuc2lnbmVkIGxvbmcg c3RhcnQsIGludCBsZW4sIGludCBmbGFncywKQEAgLTEyNzMsNiArMTI4NywyMyBAQCBpbnQgX19n ZXRfdXNlcl9wYWdlcyhzdHJ1Y3QgdGFza19zdHJ1Y3QgCiAJCSAgICAoIWlnbm9yZSAmJiAhKHZt X2ZsYWdzICYgdm1hLT52bV9mbGFncykpKQogCQkJcmV0dXJuIGkgPyA6IC1FRkFVTFQ7CiAKKwkJ aWYgKCEoZmxhZ3MgJiBHVVBfRkxBR1NfU1RBQ0spICYmCisJCQkJaXNfY293X21hcHBpbmcodm1h LT52bV9mbGFncykgJiYKKwkJCQkhKHZtYS0+dm1fZmxhZ3MgJiBWTV9ET05UQ09XKSkgeworCQkJ dXBfcmVhZCgmbW0tPm1tYXBfc2VtKTsKKwkJCWRvd25fd3JpdGUoJm1tLT5tbWFwX3NlbSk7CisJ CQl2bWEgPSBmaW5kX3ZtYShtbSwgc3RhcnQpOworCQkJaWYgKHZtYSAmJiBpc19jb3dfbWFwcGlu Zyh2bWEtPnZtX2ZsYWdzKSAmJgorCQkJCSEodm1hLT52bV9mbGFncyAmIFZNX0RPTlRDT1cpKSB7 CisJCQkJaWYgKG1ha2Vfdm1hX25vY293KHZtYSkpIHsKKwkJCQkJZG93bmdyYWRlX3dyaXRlKCZt bS0+bW1hcF9zZW0pOworCQkJCQlyZXR1cm4gaSA/IDogLUVOT01FTTsKKwkJCQl9CisJCQl9CisJ CQlkb3duZ3JhZGVfd3JpdGUoJm1tLT5tbWFwX3NlbSk7CisJCQljb250aW51ZTsKKwkJfQorCiAJ CWlmIChpc192bV9odWdldGxiX3BhZ2Uodm1hKSkgewogCQkJaSA9IGZvbGxvd19odWdldGxiX3Bh Z2UobW0sIHZtYSwgcGFnZXMsIHZtYXMsCiAJCQkJCQkmc3RhcnQsICZsZW4sIGksIHdyaXRlKTsK QEAgLTE5MTAsNiArMTk0MSw4IEBAIHN0YXRpYyBpbnQgZG9fd3BfcGFnZShzdHJ1Y3QgbW1fc3Ry dWN0ICoKIAkJZ290byBnb3R0ZW47CiAJfQogCisJVk1fQlVHX09OKFBhZ2VEb250Q09XKG9sZF9w YWdlKSk7CisKIAkvKgogCSAqIFRha2Ugb3V0IGFub255bW91cyBwYWdlcyBmaXJzdCwgYW5vbnlt b3VzIHNoYXJlZCB2bWFzIGFyZQogCSAqIG5vdCBkaXJ0eSBhY2NvdW50YWJsZS4KQEAgLTIxMDIs NiArMjEzNSwyMzIgQEAgdW53cml0YWJsZV9wYWdlOgogCXJldHVybiBWTV9GQVVMVF9TSUdCVVM7 CiB9CiAKK3N0YXRpYyBpbnQgZGVjb3dfb25lX3B0ZShzdHJ1Y3QgbW1fc3RydWN0ICptbSwgcHRl X3QgKnB0ZXAsIHBtZF90ICpwbWQsCisJCQlzcGlubG9ja190ICpwdGwsIHN0cnVjdCB2bV9hcmVh X3N0cnVjdCAqdm1hLAorCQkJdW5zaWduZWQgbG9uZyBhZGRyZXNzKQoreworCXB0ZV90IHB0ZSA9 ICpwdGVwOworCXN0cnVjdCBwYWdlICpwYWdlLCAqbmV3X3BhZ2U7CisJaW50IHJldCA9IDA7CisK KwkvKiBwdGUgY29udGFpbnMgcG9zaXRpb24gaW4gc3dhcCBvciBmaWxlLCBzbyBkb24ndCBkbyBh bnl0aGluZyAqLworCWlmICh1bmxpa2VseSghcHRlX3ByZXNlbnQocHRlKSkpCisJCXJldHVybiAw OworCS8qIHB0ZSBpcyB3cml0YWJsZSwgY2FuJ3QgYmUgQ09XICovCisJaWYgKHB0ZV93cml0ZShw dGUpKQorCQlyZXR1cm4gMDsKKworCXBhZ2UgPSB2bV9ub3JtYWxfcGFnZSh2bWEsIGFkZHJlc3Ms IHB0ZSk7CisJaWYgKCFwYWdlKQorCQlyZXR1cm4gMDsKKworCWlmICghUGFnZUFub24ocGFnZSkp CisJCXJldHVybiAwOworCisJcGFnZV9jYWNoZV9nZXQocGFnZSk7CisKKwlwdGVfdW5tYXBfdW5s b2NrKHB0ZSwgcHRsKTsKKworCWlmICh1bmxpa2VseShhbm9uX3ZtYV9wcmVwYXJlKHZtYSkpKQor CQlnb3RvIG9vbTsKKwlWTV9CVUdfT04ocGFnZSA9PSBaRVJPX1BBR0UoMCkpOworCW5ld19wYWdl ID0gYWxsb2NfcGFnZV92bWEoR0ZQX0hJR0hVU0VSX01PVkFCTEUsIHZtYSwgYWRkcmVzcyk7CisJ aWYgKCFuZXdfcGFnZSkKKwkJZ290byBvb207CisJLyoKKwkgKiBEb24ndCBsZXQgYW5vdGhlciB0 YXNrLCB3aXRoIHBvc3NpYmx5IHVubG9ja2VkIHZtYSwKKwkgKiBrZWVwIHRoZSBtbG9ja2VkIHBh Z2UuCisJICovCisJaWYgKHZtYS0+dm1fZmxhZ3MgJiBWTV9MT0NLRUQpIHsKKwkJbG9ja19wYWdl KHBhZ2UpOwkvKiBmb3IgTFJVIG1hbmlwdWxhdGlvbiAqLworCQljbGVhcl9wYWdlX21sb2NrKHBh Z2UpOworCQl1bmxvY2tfcGFnZShwYWdlKTsKKwl9CisJY293X3VzZXJfcGFnZShuZXdfcGFnZSwg cGFnZSwgYWRkcmVzcywgdm1hKTsKKwlfX1NldFBhZ2VVcHRvZGF0ZShuZXdfcGFnZSk7CisJX19T ZXRQYWdlRG9udENPVyhuZXdfcGFnZSk7CisKKwlpZiAobWVtX2Nncm91cF9uZXdwYWdlX2NoYXJn ZShuZXdfcGFnZSwgbW0sIEdGUF9LRVJORUwpKQorCQlnb3RvIG9vbV9mcmVlX25ldzsKKworCS8q CisJICogUmUtY2hlY2sgdGhlIHB0ZSAtIHdlIGRyb3BwZWQgdGhlIGxvY2sKKwkgKi8KKwlwdGVw ID0gcHRlX29mZnNldF9tYXBfbG9jayhtbSwgcG1kLCBhZGRyZXNzLCAmcHRsKTsKKwlpZiAobGlr ZWx5KHB0ZV9zYW1lKCpwdGVwLCBwdGUpKSkgeworCQlwdGVfdCBlbnRyeTsKKworCQlmbHVzaF9j YWNoZV9wYWdlKHZtYSwgYWRkcmVzcywgcHRlX3BmbihwdGUpKTsKKwkJZW50cnkgPSBta19wdGUo bmV3X3BhZ2UsIHZtYS0+dm1fcGFnZV9wcm90KTsKKwkJZW50cnkgPSBtYXliZV9ta3dyaXRlKHB0 ZV9ta2RpcnR5KGVudHJ5KSwgdm1hKTsKKwkJLyoKKwkJICogQ2xlYXIgdGhlIHB0ZSBlbnRyeSBh bmQgZmx1c2ggaXQgZmlyc3QsIGJlZm9yZSB1cGRhdGluZyB0aGUKKwkJICogcHRlIHdpdGggdGhl IG5ldyBlbnRyeS4gVGhpcyB3aWxsIGF2b2lkIGEgcmFjZSBjb25kaXRpb24KKwkJICogc2VlbiBp biB0aGUgcHJlc2VuY2Ugb2Ygb25lIHRocmVhZCBkb2luZyBTTUMgYW5kIGFub3RoZXIKKwkJICog dGhyZWFkIGRvaW5nIENPVy4KKwkJICovCisJCXB0ZXBfY2xlYXJfZmx1c2hfbm90aWZ5KHZtYSwg YWRkcmVzcywgcHRlcCk7CisJCXBhZ2VfYWRkX25ld19hbm9uX3JtYXAobmV3X3BhZ2UsIHZtYSwg YWRkcmVzcyk7CisJCXNldF9wdGVfYXQobW0sIGFkZHJlc3MsIHB0ZXAsIGVudHJ5KTsKKworCQkv KiBTZWUgY29tbWVudCBpbiBkb193cF9wYWdlICovCisJCXBhZ2VfcmVtb3ZlX3JtYXAocGFnZSk7 CisJfSBlbHNlIHsKKwkJbWVtX2Nncm91cF91bmNoYXJnZV9wYWdlKG5ld19wYWdlKTsKKwkJcGFn ZV9jYWNoZV9yZWxlYXNlKG5ld19wYWdlKTsKKwkJcmV0ID0gLUVBR0FJTjsKKwl9CisKKwlwYWdl X2NhY2hlX3JlbGVhc2UocGFnZSk7CisKKwlyZXR1cm4gcmV0OworCitvb21fZnJlZV9uZXc6CisJ cGFnZV9jYWNoZV9yZWxlYXNlKG5ld19wYWdlKTsKK29vbToKKwlwYWdlX2NhY2hlX3JlbGVhc2Uo cGFnZSk7CisJcmV0dXJuIC1FTk9NRU07Cit9CisKK3N0YXRpYyBpbnQgZGVjb3dfcHRlX3Jhbmdl KHN0cnVjdCBtbV9zdHJ1Y3QgKm1tLAorCQkJcG1kX3QgKnBtZCwgc3RydWN0IHZtX2FyZWFfc3Ry dWN0ICp2bWEsCisJCQl1bnNpZ25lZCBsb25nIGFkZHIsIHVuc2lnbmVkIGxvbmcgZW5kKQorewor CXB0ZV90ICpwdGU7CisJc3BpbmxvY2tfdCAqcHRsOworCWludCBwcm9ncmVzcyA9IDA7CisJaW50 IHJldCA9IDA7CisKK2FnYWluOgorCXB0ZSA9IHB0ZV9vZmZzZXRfbWFwX2xvY2sobW0sIHBtZCwg YWRkciwgJnB0bCk7CisvLwlhcmNoX2VudGVyX2xhenlfbW11X21vZGUoKTsKKworCWRvIHsKKwkJ LyoKKwkJICogV2UgYXJlIGhvbGRpbmcgdHdvIGxvY2tzIGF0IHRoaXMgcG9pbnQgLSBlaXRoZXIg b2YgdGhlbQorCQkgKiBjb3VsZCBnZW5lcmF0ZSBsYXRlbmNpZXMgaW4gYW5vdGhlciB0YXNrIG9u IGFub3RoZXIgQ1BVLgorCQkgKi8KKwkJaWYgKHByb2dyZXNzID49IDMyKSB7CisJCQlwcm9ncmVz cyA9IDA7CisJCQlpZiAobmVlZF9yZXNjaGVkKCkgfHwgc3Bpbl9uZWVkYnJlYWsocHRsKSkKKwkJ CQlicmVhazsKKwkJfQorCQlpZiAocHRlX25vbmUoKnB0ZSkpIHsKKwkJCXByb2dyZXNzKys7CisJ CQljb250aW51ZTsKKwkJfQorCQlyZXQgPSBkZWNvd19vbmVfcHRlKG1tLCBwdGUsIHBtZCwgcHRs LCB2bWEsIGFkZHIpOworCQlpZiAocmV0KSB7CisJCQlpZiAocmV0ID09IC1FQUdBSU4pIHsgLyog cmV0cnkgKi8KKwkJCQlyZXQgPSAwOworCQkJCWJyZWFrOworCQkJfQorCQkJZ290byBvdXQ7CisJ CX0KKwkJcHJvZ3Jlc3MgKz0gODsKKwl9IHdoaWxlIChwdGUrKywgYWRkciArPSBQQUdFX1NJWkUs IGFkZHIgIT0gZW5kKTsKKworLy8JYXJjaF9sZWF2ZV9sYXp5X21tdV9tb2RlKCk7CisJcHRlX3Vu bWFwX3VubG9jayhwdGUgLSAxLCBwdGwpOworCWNvbmRfcmVzY2hlZCgpOworCWlmIChhZGRyICE9 IGVuZCkKKwkJZ290byBhZ2FpbjsKK291dDoKKwlyZXR1cm4gcmV0OworfQorCitzdGF0aWMgaW50 IGRlY293X3BtZF9yYW5nZShzdHJ1Y3QgbW1fc3RydWN0ICptbSwKKwkJCXB1ZF90ICpwdWQsIHN0 cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hLAorCQkJdW5zaWduZWQgbG9uZyBhZGRyLCB1bnNpZ25l ZCBsb25nIGVuZCkKK3sKKwlwbWRfdCAqcG1kOworCXVuc2lnbmVkIGxvbmcgbmV4dDsKKworCXBt ZCA9IHBtZF9vZmZzZXQocHVkLCBhZGRyKTsKKwlkbyB7CisJCW5leHQgPSBwbWRfYWRkcl9lbmQo YWRkciwgZW5kKTsKKwkJaWYgKHBtZF9ub25lX29yX2NsZWFyX2JhZChwbWQpKQorCQkJY29udGlu dWU7CisJCWlmIChkZWNvd19wdGVfcmFuZ2UobW0sIHBtZCwgdm1hLCBhZGRyLCBuZXh0KSkKKwkJ CXJldHVybiAtRU5PTUVNOworCX0gd2hpbGUgKHBtZCsrLCBhZGRyID0gbmV4dCwgYWRkciAhPSBl bmQpOworCXJldHVybiAwOworfQorCitzdGF0aWMgaW50IGRlY293X3B1ZF9yYW5nZShzdHJ1Y3Qg bW1fc3RydWN0ICptbSwKKwkJCXBnZF90ICpwZ2QsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1h LAorCQkJdW5zaWduZWQgbG9uZyBhZGRyLCB1bnNpZ25lZCBsb25nIGVuZCkKK3sKKwlwdWRfdCAq cHVkOworCXVuc2lnbmVkIGxvbmcgbmV4dDsKKworCXB1ZCA9IHB1ZF9vZmZzZXQocGdkLCBhZGRy KTsKKwlkbyB7CisJCW5leHQgPSBwdWRfYWRkcl9lbmQoYWRkciwgZW5kKTsKKwkJaWYgKHB1ZF9u b25lX29yX2NsZWFyX2JhZChwdWQpKQorCQkJY29udGludWU7CisJCWlmIChkZWNvd19wbWRfcmFu Z2UobW0sIHB1ZCwgdm1hLCBhZGRyLCBuZXh0KSkKKwkJCXJldHVybiAtRU5PTUVNOworCX0gd2hp bGUgKHB1ZCsrLCBhZGRyID0gbmV4dCwgYWRkciAhPSBlbmQpOworCXJldHVybiAwOworfQorCitz dGF0aWMgbm9pbmxpbmUgaW50IGRlY293X3BhZ2VfcmFuZ2Uoc3RydWN0IG1tX3N0cnVjdCAqbW0s IHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hKQoreworCXBnZF90ICpwZ2Q7CisJdW5zaWduZWQg bG9uZyBuZXh0OworCXVuc2lnbmVkIGxvbmcgYWRkciA9IHZtYS0+dm1fc3RhcnQ7CisJdW5zaWdu ZWQgbG9uZyBlbmQgPSB2bWEtPnZtX2VuZDsKKwlpbnQgcmV0OworCisJQlVHX09OKCFpc19jb3df bWFwcGluZyh2bWEtPnZtX2ZsYWdzKSk7CisKKy8vCWlmIChpc192bV9odWdldGxiX3BhZ2Uodm1h KSkKKy8vCQlyZXR1cm4gZGVjb3dfaHVnZXRsYl9wYWdlX3JhbmdlKG1tLCB2bWEpOworCisJbW11 X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2Vfc3RhcnQobW0sIGFkZHIsIGVuZCk7CisKKwlyZXQg PSAwOworCXBnZCA9IHBnZF9vZmZzZXQobW0sIGFkZHIpOworCWRvIHsKKwkJbmV4dCA9IHBnZF9h ZGRyX2VuZChhZGRyLCBlbmQpOworCQlpZiAocGdkX25vbmVfb3JfY2xlYXJfYmFkKHBnZCkpCisJ CQljb250aW51ZTsKKwkJaWYgKHVubGlrZWx5KGRlY293X3B1ZF9yYW5nZShtbSwgcGdkLCB2bWEs IGFkZHIsIG5leHQpKSkgeworCQkJcmV0ID0gLUVOT01FTTsKKwkJCWJyZWFrOworCQl9CisJfSB3 aGlsZSAocGdkKyssIGFkZHIgPSBuZXh0LCBhZGRyICE9IGVuZCk7CisKKwltbXVfbm90aWZpZXJf aW52YWxpZGF0ZV9yYW5nZV9lbmQobW0sIHZtYS0+dm1fc3RhcnQsIGVuZCk7CisKKwlyZXR1cm4g cmV0OworfQorCisvKgorICogVHVybnMgdGhlIGFub255bW91cyBWTUEgaW50byBhICJub2NvdyIg dm1hLiBEZS1jb3cgZXhpc3RpbmcgQ09XIHBhZ2VzLgorICogTXVzdCBob2xkIG1tYXBfc2VtIGZv ciB3cml0ZS4KKyAqLworc3RhdGljIGludCBtYWtlX3ZtYV9ub2NvdyhzdHJ1Y3Qgdm1fYXJlYV9z dHJ1Y3QgKnZtYSkKK3sKKwlzdGF0aWMgREVGSU5FX01VVEVYKGxvY2spOworCXN0cnVjdCBtbV9z dHJ1Y3QgKm1tID0gdm1hLT52bV9tbTsKKwlpbnQgcmV0OworCisJbXV0ZXhfbG9jaygmbG9jayk7 CisJaWYgKHZtYS0+dm1fZmxhZ3MgJiBWTV9ET05UQ09XKSB7CisJCW11dGV4X3VubG9jaygmbG9j ayk7CisJCXJldHVybiAwOworCX0KKworCXJldCA9IGRlY293X3BhZ2VfcmFuZ2UobW0sIHZtYSk7 CisJaWYgKCFyZXQpCisJCXZtYS0+dm1fZmxhZ3MgfD0gVk1fRE9OVENPVzsKKwltdXRleF91bmxv Y2soJmxvY2spOworCisJcmV0dXJuIHJldDsKK30KKwogLyoKICAqIEhlbHBlciBmdW5jdGlvbnMg Zm9yIHVubWFwX21hcHBpbmdfcmFuZ2UoKS4KICAqCkBAIC0yNDMzLDYgKzI2OTIsOSBAQCBzdGF0 aWMgaW50IGRvX3N3YXBfcGFnZShzdHJ1Y3QgbW1fc3RydWN0CiAJCWNvdW50X3ZtX2V2ZW50KFBH TUFKRkFVTFQpOwogCX0KIAorCWlmICh1bmxpa2VseSh2bWEtPnZtX2ZsYWdzICYgVk1fRE9OVENP VykpCisJCVNldFBhZ2VEb250Q09XKHBhZ2UpOworCiAJbWFya19wYWdlX2FjY2Vzc2VkKHBhZ2Up OwogCiAJbG9ja19wYWdlKHBhZ2UpOwpAQCAtMjUzMCw2ICsyNzkyLDggQEAgc3RhdGljIGludCBk b19hbm9ueW1vdXNfcGFnZShzdHJ1Y3QgbW1fcwogCWlmICghcGFnZSkKIAkJZ290byBvb207CiAJ X19TZXRQYWdlVXB0b2RhdGUocGFnZSk7CisJaWYgKHVubGlrZWx5KHZtYS0+dm1fZmxhZ3MgJiBW TV9ET05UQ09XKSkKKwkJX19TZXRQYWdlRG9udENPVyhwYWdlKTsKIAogCWlmIChtZW1fY2dyb3Vw X25ld3BhZ2VfY2hhcmdlKHBhZ2UsIG1tLCBHRlBfS0VSTkVMKSkKIAkJZ290byBvb21fZnJlZV9w YWdlOwpAQCAtMjYzNiw2ICsyOTAwLDggQEAgc3RhdGljIGludCBfX2RvX2ZhdWx0KHN0cnVjdCBt bV9zdHJ1Y3QgKgogCQkJCWNsZWFyX3BhZ2VfbWxvY2sodm1mLnBhZ2UpOwogCQkJY29weV91c2Vy X2hpZ2hwYWdlKHBhZ2UsIHZtZi5wYWdlLCBhZGRyZXNzLCB2bWEpOwogCQkJX19TZXRQYWdlVXB0 b2RhdGUocGFnZSk7CisJCQlpZiAodW5saWtlbHkodm1hLT52bV9mbGFncyAmIFZNX0RPTlRDT1cp KQorCQkJCV9fU2V0UGFnZURvbnRDT1cocGFnZSk7CiAJCX0gZWxzZSB7CiAJCQkvKgogCQkJICog SWYgdGhlIHBhZ2Ugd2lsbCBiZSBzaGFyZWFibGUsIHNlZSBpZiB0aGUgYmFja2luZwpAQCAtMjkz NSw4ICszMjAxLDkgQEAgaW50IG1ha2VfcGFnZXNfcHJlc2VudCh1bnNpZ25lZCBsb25nIGFkZAog CUJVR19PTihhZGRyID49IGVuZCk7CiAJQlVHX09OKGVuZCA+IHZtYS0+dm1fZW5kKTsKIAlsZW4g PSBESVZfUk9VTkRfVVAoZW5kLCBQQUdFX1NJWkUpIC0gYWRkci9QQUdFX1NJWkU7Ci0JcmV0ID0g Z2V0X3VzZXJfcGFnZXMoY3VycmVudCwgY3VycmVudC0+bW0sIGFkZHIsCi0JCQlsZW4sIHdyaXRl LCAwLCBOVUxMLCBOVUxMKTsKKwlyZXQgPSBfX2dldF91c2VyX3BhZ2VzKGN1cnJlbnQsIGN1cnJl bnQtPm1tLCBhZGRyLAorCQkJbGVuLCBHVVBfRkxBR1NfU1RBQ0sgfCAod3JpdGUgPyBHVVBfRkxB R1NfV1JJVEUgOiAwKSwKKwkJCU5VTEwsIE5VTEwpOwogCWlmIChyZXQgPCAwKQogCQlyZXR1cm4g cmV0OwogCXJldHVybiByZXQgPT0gbGVuID8gMCA6IC1FRkFVTFQ7CkBAIC0zMDg1LDggKzMzNTIs OSBAQCBpbnQgYWNjZXNzX3Byb2Nlc3Nfdm0oc3RydWN0IHRhc2tfc3RydWN0CiAJCXZvaWQgKm1h ZGRyOwogCQlzdHJ1Y3QgcGFnZSAqcGFnZSA9IE5VTEw7CiAKLQkJcmV0ID0gZ2V0X3VzZXJfcGFn ZXModHNrLCBtbSwgYWRkciwgMSwKLQkJCQl3cml0ZSwgMSwgJnBhZ2UsICZ2bWEpOworCQlyZXQg PSBfX2dldF91c2VyX3BhZ2VzKHRzaywgbW0sIGFkZHIsIDEsCisJCQkJR1VQX0ZMQUdTX0ZPUkNF IHwgR1VQX0ZMQUdTX1NUQUNLIHwKKwkJCQkod3JpdGUgPyBHVVBfRkxBR1NfV1JJVEUgOiAwKSwg JnBhZ2UsICZ2bWEpOwogCQlpZiAocmV0IDw9IDApIHsKIAkJCS8qCiAJCQkgKiBDaGVjayBpZiB0 aGlzIGlzIGEgVk1fSU8gfCBWTV9QRk5NQVAgVk1BLCB3aGljaApJbmRleDogbGludXgtMi42L2Fy Y2gveDg2L21tL2d1cC5jCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT0KLS0tIGxpbnV4LTIuNi5vcmlnL2FyY2gveDg2L21t L2d1cC5jCTIwMDktMDMtMTMgMDM6MDA6NTguMDAwMDAwMDAwICsxMTAwCisrKyBsaW51eC0yLjYv YXJjaC94ODYvbW0vZ3VwLmMJMjAwOS0wMy0xMyAwMzowMTowMy4wMDAwMDAwMDAgKzExMDAKQEAg LTgzLDExICs4MywxNCBAQCBzdGF0aWMgbm9pbmxpbmUgaW50IGd1cF9wdGVfcmFuZ2UocG1kX3Qg CiAJCXN0cnVjdCBwYWdlICpwYWdlOwogCiAJCWlmICgocHRlX2ZsYWdzKHB0ZSkgJiAobWFzayB8 IF9QQUdFX1NQRUNJQUwpKSAhPSBtYXNrKSB7CitmYWlsZWQ6CiAJCQlwdGVfdW5tYXAocHRlcCk7 CiAJCQlyZXR1cm4gMDsKIAkJfQogCQlWTV9CVUdfT04oIXBmbl92YWxpZChwdGVfcGZuKHB0ZSkp KTsKIAkJcGFnZSA9IHB0ZV9wYWdlKHB0ZSk7CisJCWlmICh1bmxpa2VseSghUGFnZURvbnRDT1co cGFnZSkpKQorCQkJZ290byBmYWlsZWQ7CiAJCWdldF9wYWdlKHBhZ2UpOwogCQlwYWdlc1sqbnJd ID0gcGFnZTsKIAkJKCpucikrKzsKSW5kZXg6IGxpbnV4LTIuNi9pbmNsdWRlL2xpbnV4L3BhZ2Ut ZmxhZ3MuaAo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09Ci0tLSBsaW51eC0yLjYub3JpZy9pbmNsdWRlL2xpbnV4L3BhZ2Ut ZmxhZ3MuaAkyMDA5LTAzLTEzIDAzOjAwOjU4LjAwMDAwMDAwMCArMTEwMAorKysgbGludXgtMi42 L2luY2x1ZGUvbGludXgvcGFnZS1mbGFncy5oCTIwMDktMDMtMTMgMDM6MDE6MDMuMDAwMDAwMDAw ICsxMTAwCkBAIC05NCw2ICs5NCw3IEBAIGVudW0gcGFnZWZsYWdzIHsKIAlQR19yZWNsYWltLAkJ LyogVG8gYmUgcmVjbGFpbWVkIGFzYXAgKi8KIAlQR19idWRkeSwJCS8qIFBhZ2UgaXMgZnJlZSwg b24gYnVkZHkgbGlzdHMgKi8KIAlQR19zd2FwYmFja2VkLAkJLyogUGFnZSBpcyBiYWNrZWQgYnkg UkFNL3N3YXAgKi8KKwlQR19kb250Y293LAkJLyogUGFnZUFub24gcGFnZSBpbiBhIFZNX0RPTlRD T1cgdm1hICovCiAjaWZkZWYgQ09ORklHX1VORVZJQ1RBQkxFX0xSVQogCVBHX3VuZXZpY3RhYmxl LAkJLyogUGFnZSBpcyAidW5ldmljdGFibGUiICAqLwogCVBHX21sb2NrZWQsCQkvKiBQYWdlIGlz IHZtYSBtbG9ja2VkICovCkBAIC0yMDgsNiArMjA5LDggQEAgX19QQUdFRkxBRyhTbHViRGVidWcs IHNsdWJfZGVidWcpCiAgKi8KIFRFU1RQQUdFRkxBRyhXcml0ZWJhY2ssIHdyaXRlYmFjaykgVEVT VFNDRkxBRyhXcml0ZWJhY2ssIHdyaXRlYmFjaykKIF9fUEFHRUZMQUcoQnVkZHksIGJ1ZGR5KQor X19QQUdFRkxBRyhEb250Q09XLCBkb250Y293KQorU0VUUEFHRUZMQUcoRG9udENPVywgZG9udGNv dykKIFBBR0VGTEFHKE1hcHBlZFRvRGlzaywgbWFwcGVkdG9kaXNrKQogCiAvKiBQR19yZWFkYWhl YWQgaXMgb25seSB1c2VkIGZvciBmaWxlIHJlYWRzOyBQR19yZWNsYWltIGlzIG9ubHkgZm9yIHdy aXRlcyAqLwpJbmRleDogbGludXgtMi42L21tL3BhZ2VfYWxsb2MuYwo9PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Ci0tLSBs aW51eC0yLjYub3JpZy9tbS9wYWdlX2FsbG9jLmMJMjAwOS0wMy0xMyAwMzowMDo1OC4wMDAwMDAw MDAgKzExMDAKKysrIGxpbnV4LTIuNi9tbS9wYWdlX2FsbG9jLmMJMjAwOS0wMy0xMyAwMzowMTow My4wMDAwMDAwMDAgKzExMDAKQEAgLTEwMDAsNiArMTAwMCw3IEBAIHN0YXRpYyB2b2lkIGZyZWVf aG90X2NvbGRfcGFnZShzdHJ1Y3QgcGEKIAlzdHJ1Y3QgcGVyX2NwdV9wYWdlcyAqcGNwOwogCXVu c2lnbmVkIGxvbmcgZmxhZ3M7CiAKKwlfX0NsZWFyUGFnZURvbnRDT1cocGFnZSk7CiAJaWYgKFBh Z2VBbm9uKHBhZ2UpKQogCQlwYWdlLT5tYXBwaW5nID0gTlVMTDsKIAlpZiAoZnJlZV9wYWdlc19j aGVjayhwYWdlKSkKSW5kZXg6IGxpbnV4LTIuNi9rZXJuZWwvZm9yay5jCj09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KLS0t IGxpbnV4LTIuNi5vcmlnL2tlcm5lbC9mb3JrLmMJMjAwOS0wMy0xMyAwMzowNDozMy4wMDAwMDAw MDAgKzExMDAKKysrIGxpbnV4LTIuNi9rZXJuZWwvZm9yay5jCTIwMDktMDMtMTMgMDM6MDU6MDAu MDAwMDAwMDAwICsxMTAwCkBAIC0zNTMsNyArMzUzLDcgQEAgc3RhdGljIGludCBkdXBfbW1hcChz dHJ1Y3QgbW1fc3RydWN0ICptbQogCQlyYl9wYXJlbnQgPSAmdG1wLT52bV9yYjsKIAogCQltbS0+ bWFwX2NvdW50Kys7Ci0JCXJldHZhbCA9IGNvcHlfcGFnZV9yYW5nZShtbSwgb2xkbW0sIG1wbnQp OworCQlyZXR2YWwgPSBjb3B5X3BhZ2VfcmFuZ2UobW0sIG9sZG1tLCB0bXAsIG1wbnQpOwogCiAJ CWlmICh0bXAtPnZtX29wcyAmJiB0bXAtPnZtX29wcy0+b3BlbikKIAkJCXRtcC0+dm1fb3BzLT5v cGVuKHRtcCk7CkluZGV4OiBsaW51eC0yLjYvZnMvZXhlYy5jCj09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KLS0tIGxpbnV4 LTIuNi5vcmlnL2ZzL2V4ZWMuYwkyMDA5LTAzLTEzIDAzOjA0OjMzLjAwMDAwMDAwMCArMTEwMAor KysgbGludXgtMi42L2ZzL2V4ZWMuYwkyMDA5LTAzLTEzIDAzOjA1OjAwLjAwMDAwMDAwMCArMTEw MApAQCAtMTY1LDYgKzE2NSwxMyBAQCBleGl0OgogCiAjaWZkZWYgQ09ORklHX01NVQogCisjZGVm aW5lIEdVUF9GTEFHU19XUklURSAgICAgICAgICAgICAgICAgIDB4MDEKKyNkZWZpbmUgR1VQX0ZM QUdTX1NUQUNLICAgICAgICAgICAgICAgICAgMHgxMAorCitpbnQgX19nZXRfdXNlcl9wYWdlcyhz dHJ1Y3QgdGFza19zdHJ1Y3QgKnRzaywgc3RydWN0IG1tX3N0cnVjdCAqbW0sCisJCSAgICAgdW5z aWduZWQgbG9uZyBzdGFydCwgaW50IGxlbiwgaW50IGZsYWdzLAorCQkgICAgIHN0cnVjdCBwYWdl ICoqcGFnZXMsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqKnZtYXMpOworCiBzdGF0aWMgc3RydWN0 IHBhZ2UgKmdldF9hcmdfcGFnZShzdHJ1Y3QgbGludXhfYmlucHJtICpicHJtLCB1bnNpZ25lZCBs b25nIHBvcywKIAkJaW50IHdyaXRlKQogewpAQCAtMTc4LDggKzE4NSwxMSBAQCBzdGF0aWMgc3Ry dWN0IHBhZ2UgKmdldF9hcmdfcGFnZShzdHJ1Y3QgCiAJCQlyZXR1cm4gTlVMTDsKIAl9CiAjZW5k aWYKLQlyZXQgPSBnZXRfdXNlcl9wYWdlcyhjdXJyZW50LCBicHJtLT5tbSwgcG9zLAotCQkJMSwg d3JpdGUsIDEsICZwYWdlLCBOVUxMKTsKKwlkb3duX3JlYWQoJmJwcm0tPm1tLT5tbWFwX3NlbSk7 CisJcmV0ID0gX19nZXRfdXNlcl9wYWdlcyhjdXJyZW50LCBicHJtLT5tbSwgcG9zLAorCQkJMSwg R1VQX0ZMQUdTX1NUQUNLIHwgKHdyaXRlID8gR1VQX0ZMQUdTX1dSSVRFIDogMCksCisJCQkmcGFn ZSwgTlVMTCk7CisJdXBfcmVhZCgmYnBybS0+bW0tPm1tYXBfc2VtKTsKIAlpZiAocmV0IDw9IDAp CiAJCXJldHVybiBOVUxMOwogCkluZGV4OiBsaW51eC0yLjYvbW0vaW50ZXJuYWwuaAo9PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09Ci0tLSBsaW51eC0yLjYub3JpZy9tbS9pbnRlcm5hbC5oCTIwMDktMDMtMTMgMDM6MDQ6MzMu MDAwMDAwMDAwICsxMTAwCisrKyBsaW51eC0yLjYvbW0vaW50ZXJuYWwuaAkyMDA5LTAzLTEzIDAz OjA1OjAwLjAwMDAwMDAwMCArMTEwMApAQCAtMjczLDEwICsyNzMsMTEgQEAgc3RhdGljIGlubGlu ZSB2b2lkIG1taW5pdF92YWxpZGF0ZV9tZW1tbwogfQogI2VuZGlmIC8qIENPTkZJR19TUEFSU0VN RU0gKi8KIAotI2RlZmluZSBHVVBfRkxBR1NfV1JJVEUgICAgICAgICAgICAgICAgICAweDEKLSNk ZWZpbmUgR1VQX0ZMQUdTX0ZPUkNFICAgICAgICAgICAgICAgICAgMHgyCi0jZGVmaW5lIEdVUF9G TEFHU19JR05PUkVfVk1BX1BFUk1JU1NJT05TIDB4NAotI2RlZmluZSBHVVBfRkxBR1NfSUdOT1JF X1NJR0tJTEwgICAgICAgICAweDgKKyNkZWZpbmUgR1VQX0ZMQUdTX1dSSVRFICAgICAgICAgICAg ICAgICAgMHgwMQorI2RlZmluZSBHVVBfRkxBR1NfRk9SQ0UgICAgICAgICAgICAgICAgICAweDAy CisjZGVmaW5lIEdVUF9GTEFHU19JR05PUkVfVk1BX1BFUk1JU1NJT05TIDB4MDQKKyNkZWZpbmUg R1VQX0ZMQUdTX0lHTk9SRV9TSUdLSUxMICAgICAgICAgMHgwOAorI2RlZmluZSBHVVBfRkxBR1Nf U1RBQ0sgICAgICAgICAgICAgICAgICAweDEwCiAKIGludCBfX2dldF91c2VyX3BhZ2VzKHN0cnVj dCB0YXNrX3N0cnVjdCAqdHNrLCBzdHJ1Y3QgbW1fc3RydWN0ICptbSwKIAkJICAgICB1bnNpZ25l ZCBsb25nIHN0YXJ0LCBpbnQgbGVuLCBpbnQgZmxhZ3MsCgAKCgo= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 103A96B003D for ; Thu, 12 Mar 2009 13:00:37 -0400 (EDT) Date: Thu, 12 Mar 2009 18:00:11 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090312170010.GT27823@random.random> References: <20090311170611.GA2079@elte.hu> <200903121636.18867.nickpiggin@yahoo.com.au> <200903130323.41193.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200903130323.41193.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Fri, Mar 13, 2009 at 03:23:40AM +1100, Nick Piggin wrote: > OK, this is as far as I got tonight. > > This passes Andrea's dma_thread test case. I haven't started hugepages, > and it isn't quite right to drop the mmap_sem and retake it for write > in get_user_pages (firstly, caller might hold mmap_sem for write, > secondly, it may not be able to tolerate mmap_sem being dropped). What's the point? I mean this will simply work worse than my patch because it'll have to don't-cow the whole range regardless if it's pinned or not. Which will slowdown fork in the O_DIRECT case even more, for no good reason. I thought the complaint here was only a beauty issue of not wanting to add a function called fork_pre_cow or your equivalent decow_one_pte in the fork path, not any practical issue with my patch which already passed all sort of regression testing and performance valuations. Plus you still have a per-page bitflag, and I think you have implementation issues in the patch (the parent pte can't be left writeable if you are in a don't-cow vma, or the copy will not be atomic, and glibc will have no chance to fix its bugs). You're not removing the fork_pre_cow logic from fork, so I can only see it as a regression to make the logic less granular in the vma. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 5999A6B003D for ; Thu, 12 Mar 2009 13:20:33 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Fri, 13 Mar 2009 04:20:27 +1100 References: <20090311170611.GA2079@elte.hu> <200903130323.41193.nickpiggin@yahoo.com.au> <20090312170010.GT27823@random.random> In-Reply-To: <20090312170010.GT27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903130420.28772.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Friday 13 March 2009 04:00:11 Andrea Arcangeli wrote: > On Fri, Mar 13, 2009 at 03:23:40AM +1100, Nick Piggin wrote: > > OK, this is as far as I got tonight. > > > > This passes Andrea's dma_thread test case. I haven't started hugepages, > > and it isn't quite right to drop the mmap_sem and retake it for write > > in get_user_pages (firstly, caller might hold mmap_sem for write, > > secondly, it may not be able to tolerate mmap_sem being dropped). > > What's the point? Well the main point is to avoid atomics and barriers and stuff like that especially in the fast gup path. It also seems very much smaller (the vast majority of the change is the addition of decow function). > I mean this will simply work worse than my patch > because it'll have to don't-cow the whole range regardless if it's > pinned or not. Which will slowdown fork in the O_DIRECT case even > more, for no good reason. Hmm, maybe. It probably can possibly work entirely without the vm_flag and just use the page flag, however. Yes I think it could, and that might just avoid the whole problem of modifying vm_flags in gup. I'll have to consider it more tomorrow. But this case is just if we want to transparently support this without too much intrusive. Apps that know and care very much could use MADV_DONTFORK to avoid the copy completely. > I thought the complaint here was only a > beauty issue of not wanting to add a function called fork_pre_cow or > your equivalent decow_one_pte in the fork path, not any practical > issue with my patch which already passed all sort of regression > testing and performance valuations. My complaint is not decow / pre cow (I think I suggested it as the fix for the problem in the first place). I think the patch is quite complex and is quite a slowdown for fast gup (especially with hugepages). I'm just trying to explore different approach. > Plus you still have a per-page > bitflag, Sure. It's the atomic operations which I want to try to minimise. > and I think you have implementation issues in the patch (the > parent pte can't be left writeable if you are in a don't-cow vma, or > the copy will not be atomic, and glibc will have no chance to fix its > bugs) Oh, we need to do that? OK, then just take out that statement, and change VM_BUG_ON(PageDontCOW()) in do_wp_page to VM_BUG_ON(PageDontCOW() && !reuse); > . You're not removing the fork_pre_cow logic from fork, so I can > only see it as a regression to make the logic less granular in the > vma. I'll see if it can be made per-page. But I still don't know if it is a big problem. It's hard to know exactly what crazy things apps require to be fast. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 924016B0047 for ; Thu, 12 Mar 2009 13:23:40 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Fri, 13 Mar 2009 04:23:35 +1100 References: <20090311170611.GA2079@elte.hu> <20090312170010.GT27823@random.random> <200903130420.28772.nickpiggin@yahoo.com.au> In-Reply-To: <200903130420.28772.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903130423.36142.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Friday 13 March 2009 04:20:27 Nick Piggin wrote: > On Friday 13 March 2009 04:00:11 Andrea Arcangeli wrote: > > and I think you have implementation issues in the patch (the > > parent pte can't be left writeable if you are in a don't-cow vma, or > > the copy will not be atomic, and glibc will have no chance to fix its > > bugs) > > Oh, we need to do that? OK, then just take out that statement, and Should read: "take out that *if* statement" (the one which I put in to avoid wrprotect in the parent) > change VM_BUG_ON(PageDontCOW()) in do_wp_page to > VM_BUG_ON(PageDontCOW() && !reuse); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id E19366B0047 for ; Thu, 12 Mar 2009 14:06:58 -0400 (EDT) Date: Thu, 12 Mar 2009 19:06:48 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090312180648.GV27823@random.random> References: <20090311170611.GA2079@elte.hu> <200903130323.41193.nickpiggin@yahoo.com.au> <20090312170010.GT27823@random.random> <200903130420.28772.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200903130420.28772.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Fri, Mar 13, 2009 at 04:20:27AM +1100, Nick Piggin wrote: > Well the main point is to avoid atomics and barriers and stuff like > that especially in the fast gup path. It also seems very much smaller > (the vast majority of the change is the addition of decow function). Well if you remove the hugetlb part and you remove the pass of src/dst vma that is needed anyway to fix PAT bugs, my patch will get quite smaller too. Agree about the gup-fast path, but frankly I miss how you avoid having to change gup-fast... I wanted to asked about that... > Hmm, maybe. It probably can possibly work entirely without the vm_flag > and just use the page flag, however. Yes I think it could, and that Right I only use the page flag, and you seem to have a page flag PG_dontcow too after all. > might just avoid the whole problem of modifying vm_flags in gup. I'll > have to consider it more tomorrow. Ok. > But this case is just if we want to transparently support this without > too much intrusive. Apps that know and care very much could use > MADV_DONTFORK to avoid the copy completely. Well those apps aren't the problem. > My complaint is not decow / pre cow (I think I suggested it as the > fix for the problem in the first place). I think the patch is quite I'm sure that's not your complaint right. I thought it was the primary complaint in discussion so far though. > complex and is quite a slowdown for fast gup (especially with > hugepages). I'm just trying to explore different approach. I think we could benchmark this. Also once I'll get how you avoid to touch gup-fast fast path, without sending a flood of ipis in fork, I'll understand better how your patch work. > Oh, we need to do that? OK, then just take out that statement, and > change VM_BUG_ON(PageDontCOW()) in do_wp_page to > VM_BUG_ON(PageDontCOW() && !reuse); Not sure how do_wp_page is relevant, the problem I pointed out is in the fork_pre_cow/decow_pte only. If do_wp_page runs it means the page was already wrprotected in the parent or it couldn't be shared, no problem in do_wp_page in that respect. The only thing required is that cow_user_page is copying a page that can't be modified by the parent thread pool during the copy. So marking parent pte wrprotected and flushing tlb is required. Then after the copy like in my fork_pre_cow we set the parent pte writable again. BTW, I start to think I forgot a tlb flush after setting the pte writable again, that could generate a minor fault that we can avoid by flushing the tlb, right? But this is a minor thing, and it'd only trigger if parent only reads the parent pte, otherwise the parent thread will wait fork in mmap_sem if it did a write, or it won't have the tlb loaded in the first place if it didn't touch the page while the pte was temporarily wrprotected. > I'll see if it can be made per-page. But I still don't know if it > is a big problem. It's hard to know exactly what crazy things apps > require to be fast. The thing is quite simple, if an app has a 1G of vma loaded, you'll allocate 1G of ram for no good reason. It can even OOM, it's not just a performance issue. While doing it per-page like I do, won't be noticeable, as the in-flight I/O will be minor. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id BD7696B003D for ; Thu, 12 Mar 2009 14:59:07 -0400 (EDT) Date: Thu, 12 Mar 2009 19:58:41 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090312185841.GA27823@random.random> References: <20090311170611.GA2079@elte.hu> <200903130323.41193.nickpiggin@yahoo.com.au> <20090312170010.GT27823@random.random> <200903130420.28772.nickpiggin@yahoo.com.au> <20090312180648.GV27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090312180648.GV27823@random.random> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Thu, Mar 12, 2009 at 07:06:48PM +0100, Andrea Arcangeli wrote: > again. BTW, I start to think I forgot a tlb flush after setting the > pte writable again, that could generate a minor fault that we can > avoid by flushing the tlb, right? But this is a minor thing, and it'd Ah no, that is already taken care of by the fork flush in the parent before returning, so no problem (and it would have been a minor thing anyway). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id C9C656B003D for ; Fri, 13 Mar 2009 12:09:47 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Sat, 14 Mar 2009 03:09:39 +1100 References: <20090311170611.GA2079@elte.hu> <200903130420.28772.nickpiggin@yahoo.com.au> <20090312180648.GV27823@random.random> In-Reply-To: <20090312180648.GV27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903140309.39777.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Friday 13 March 2009 05:06:48 Andrea Arcangeli wrote: > On Fri, Mar 13, 2009 at 04:20:27AM +1100, Nick Piggin wrote: > > Well the main point is to avoid atomics and barriers and stuff like > > that especially in the fast gup path. It also seems very much smaller > > (the vast majority of the change is the addition of decow function). > > Well if you remove the hugetlb part and you remove the pass of src/dst > vma that is needed anyway to fix PAT bugs, my patch will get quite > smaller too. Possibly true. OK, it wasn't a very good argument to compare my incomplete, RFC patch based on size alone :) > Agree about the gup-fast path, but frankly I miss how you avoid having > to change gup-fast... I wanted to asked about that... It is more straightforward than your version because it does not try to make the page re-cow-able again after the GUP is finished. The main conceptual difference between our fixes I think (ignoring my silly vma-wide decow), is this issue. Of course I could have a race in fast-gup, but I don't think I can see one. I'm working on removing the vma stuff and just making it per-page, which might make it easier to review. > > Oh, we need to do that? OK, then just take out that statement, and > > change VM_BUG_ON(PageDontCOW()) in do_wp_page to > > VM_BUG_ON(PageDontCOW() && !reuse); > > Not sure how do_wp_page is relevant, the problem I pointed out is in > the fork_pre_cow/decow_pte only. If do_wp_page runs it means the page > was already wrprotected in the parent or it couldn't be shared, no > problem in do_wp_page in that respect. Well, it would save having to touch the parent's pagetables after doing the atomic copy-on-fork in the child. Just have the parent do a do_wp_page, which will notice it is the only user of the page and reuse it rather than COW it (now that Hugh has fixed the races in the reuse check that should be fine). > The only thing required is that cow_user_page is copying a page that > can't be modified by the parent thread pool during the copy. So > marking parent pte wrprotected and flushing tlb is required. Then > after the copy like in my fork_pre_cow we set the parent pte writable > again. Yes you could do it this way too, I'm not sure which way is better... I'll have to take another look at it after removing the per-vma code from mine. > > I'll see if it can be made per-page. But I still don't know if it > > is a big problem. It's hard to know exactly what crazy things apps > > require to be fast. > > The thing is quite simple, if an app has a 1G of vma loaded, you'll > allocate 1G of ram for no good reason. It can even OOM, it's not just > a performance issue. While doing it per-page like I do, won't be > noticeable, as the in-flight I/O will be minor. Yes I agree now it is a silly way to do it. Now I also see that your patch still hasn't covered the other side of the race, wheras my scheme should do. Hmm, I think that if we want to go to the extent of adding all this code in and tell userspace apps they can use zerocopy IO and not care about COW, then we really must cover both sides of the race otherwise it is just asking for data corruption. Conversely, if we leave *any* holes open by design, then we may as well leave *all* holes open and have simpler code -- because apps will have to know about the zerocopy vs COW problem anyway. Don't you agree? Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 192416B003D for ; Fri, 13 Mar 2009 15:34:36 -0400 (EDT) Date: Fri, 13 Mar 2009 20:34:16 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090313193416.GG27823@random.random> References: <20090311170611.GA2079@elte.hu> <200903130420.28772.nickpiggin@yahoo.com.au> <20090312180648.GV27823@random.random> <200903140309.39777.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200903140309.39777.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Sat, Mar 14, 2009 at 03:09:39AM +1100, Nick Piggin wrote: > Of course I could have a race in fast-gup, but I don't think I can see > one. I'm working on removing the vma stuff and just making it per-page, > which might make it easier to review. If you didn't touch gup-fast and you don't send ipis in fork, you most certainly have one, it's the one Linus pointed out and that I've fixed (with Izik, then I sorted out the ordering details and how to make it safe on frok side). > Well, it would save having to touch the parent's pagetables after > doing the atomic copy-on-fork in the child. Just have the parent do > a do_wp_page, which will notice it is the only user of the page and > reuse it rather than COW it (now that Hugh has fixed the races in > the reuse check that should be fine). If we're into the trouble path, it means parent already owns the page. I just leave it owned to the parent, pte remains the same before and after fork. No point in changing the pte value if we're in the troublesome path as far as I can tell. I only verify that the parent pte didn't go away from under fork when I temporarily release the parent PT lock to allocate the cow page in the slow path (see the -EAGAIN path, I also verified it triggers with swapping and system survives fine ;). > Now I also see that your patch still hasn't covered the other side of > the race, wheras my scheme should do. Hmm, I think that if we want to Sorry, but can you elaborate again what the other side of the race is? If child gets a whole new page, and parent keeps its own page with pte marked read-write the whole time that a page fault can run (page fault takes mmap_sem, all we have to protect against when temporarily releasing parent PT lock is the VM rmap code and that is taken care of by the pte_same path), so I don't see any other side of the race... > go to the extent of adding all this code in and tell userspace apps > they can use zerocopy IO and not care about COW, then we really must > cover both sides of the race otherwise it is just asking for data > corruption. Surely I agree if there's another side of the race left uncovered by my patch we've to address it too if we make any change and we don't consider this a 'feature'! > Conversely, if we leave *any* holes open by design, then we may as well > leave *all* holes open and have simpler code -- because apps will have > to know about the zerocopy vs COW problem anyway. Don't you agree? Indeed ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 4F6546B003D for ; Sat, 14 Mar 2009 00:46:44 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Sat, 14 Mar 2009 15:46:30 +1100 References: <20090311170611.GA2079@elte.hu> <20090312180648.GV27823@random.random> <200903140309.39777.nickpiggin@yahoo.com.au> In-Reply-To: <200903140309.39777.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: base64 Content-Disposition: inline Message-Id: <200903141546.31139.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: T24gU2F0dXJkYXkgMTQgTWFyY2ggMjAwOSAwMzowOTozOSBOaWNrIFBpZ2dpbiB3cm90ZToKPiBP biBGcmlkYXkgMTMgTWFyY2ggMjAwOSAwNTowNjo0OCBBbmRyZWEgQXJjYW5nZWxpIHdyb3RlOgoK PiA+IFRoZSB0aGluZyBpcyBxdWl0ZSBzaW1wbGUsIGlmIGFuIGFwcCBoYXMgYSAxRyBvZiB2bWEg bG9hZGVkLCB5b3UnbGwKPiA+IGFsbG9jYXRlIDFHIG9mIHJhbSBmb3Igbm8gZ29vZCByZWFzb24u IEl0IGNhbiBldmVuIE9PTSwgaXQncyBub3QganVzdAo+ID4gYSBwZXJmb3JtYW5jZSBpc3N1ZS4g V2hpbGUgZG9pbmcgaXQgcGVyLXBhZ2UgbGlrZSBJIGRvLCB3b24ndCBiZQo+ID4gbm90aWNlYWJs ZSwgYXMgdGhlIGluLWZsaWdodCBJL08gd2lsbCBiZSBtaW5vci4KPgo+IFllcyBJIGFncmVlIG5v dyBpdCBpcyBhIHNpbGx5IHdheSB0byBkbyBpdC4KCkhlcmUgaXMgYW4gdXBkYXRlZCBwYXRjaCB0 aGF0IGp1c3QgZG9lcyBpdCBvbiBhIHBlci1wYWdlIGJhc2lzLgpBY3R1YWxseSBpdCBpcyBzdGls bCBhIGJpdCBzbG9wcHkgYmVjYXVzZSBJIGp1c3QgcmV1c2VkIHNvbWUgY29kZQpmcm9tIG15IGxh c3QgcGF0Y2ggZm9yIHRoZSBkZWNvdyBsb2dpYy4uLiBwb3NzaWJseSBJIGNhbiBqdXN0IHVzZQp0 aGUgc2FtZSBwcmVjb3cgY29kZSB0aGF0IHlvdSBkbyBmb3Igc21hbGwgYW5kIGh1Z2UgcGFnZXMg KGFsdGhvdWdoCkxpbnVzIGRpZG4ndCBsaWtlIGl0IHNvIG11Y2guLi4gaXQgaXMgdmVyeSBoYXJk IHRvIGRvIG5pY2VseSByaWdodApkb3duIHRoZXJlIGluIHRoZSBjYWxsIGNoYWluIDooKQoKQW55 d2F5LCBpZ25vcmluZyB0aGUgZGVjb3cgaW1wbGVtZW50YXRpb24gKHRoYXQncyBub3QgcmVhbGx5 IHRoZQppbnRlcmVzdGluZyBwYXJ0IG9mIHRoZSBwYXRjaCksIEkgdGhpbmsgdGhpcyBpcyBsb29r aW5nIHByZXR0eSBnb29kCm5vdy4KLS0tCkluZGV4OiBsaW51eC0yLjYvaW5jbHVkZS9saW51eC9t bS5oCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT0KLS0tIGxpbnV4LTIuNi5vcmlnL2luY2x1ZGUvbGludXgvbW0uaAkyMDA5 LTAzLTE0IDAyOjQ4OjA2LjAwMDAwMDAwMCArMTEwMAorKysgbGludXgtMi42L2luY2x1ZGUvbGlu dXgvbW0uaAkyMDA5LTAzLTE0IDE1OjEyOjEzLjAwMDAwMDAwMCArMTEwMApAQCAtNzg5LDcgKzc4 OSw3IEBAIGludCB3YWxrX3BhZ2VfcmFuZ2UodW5zaWduZWQgbG9uZyBhZGRyLCAKIHZvaWQgZnJl ZV9wZ2RfcmFuZ2Uoc3RydWN0IG1tdV9nYXRoZXIgKnRsYiwgdW5zaWduZWQgbG9uZyBhZGRyLAog CQl1bnNpZ25lZCBsb25nIGVuZCwgdW5zaWduZWQgbG9uZyBmbG9vciwgdW5zaWduZWQgbG9uZyBj ZWlsaW5nKTsKIGludCBjb3B5X3BhZ2VfcmFuZ2Uoc3RydWN0IG1tX3N0cnVjdCAqZHN0LCBzdHJ1 Y3QgbW1fc3RydWN0ICpzcmMsCi0JCQlzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSk7CisJCXN0 cnVjdCB2bV9hcmVhX3N0cnVjdCAqZHN0X3ZtYSwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEp Owogdm9pZCB1bm1hcF9tYXBwaW5nX3JhbmdlKHN0cnVjdCBhZGRyZXNzX3NwYWNlICptYXBwaW5n LAogCQlsb2ZmX3QgY29uc3QgaG9sZWJlZ2luLCBsb2ZmX3QgY29uc3QgaG9sZWxlbiwgaW50IGV2 ZW5fY293cyk7CiBpbnQgZm9sbG93X3BoeXMoc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsIHVu c2lnbmVkIGxvbmcgYWRkcmVzcywKSW5kZXg6IGxpbnV4LTIuNi9tbS9tZW1vcnkuYwo9PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09Ci0tLSBsaW51eC0yLjYub3JpZy9tbS9tZW1vcnkuYwkyMDA5LTAzLTE0IDAyOjQ4OjA2LjAw MDAwMDAwMCArMTEwMAorKysgbGludXgtMi42L21tL21lbW9yeS5jCTIwMDktMDMtMTQgMTU6NDA6 MzcuMDAwMDAwMDAwICsxMTAwCkBAIC01MzMsMTIgKzUzMywyNDggQEAgb3V0OgogfQogCiAvKgor ICogRG8gcHRlX21rd3JpdGUsIGJ1dCBvbmx5IGlmIHRoZSB2bWEgc2F5cyBWTV9XUklURS4gIFdl IGRvIHRoaXMgd2hlbgorICogc2VydmljaW5nIGZhdWx0cyBmb3Igd3JpdGUgYWNjZXNzLiAgSW4g dGhlIG5vcm1hbCBjYXNlLCBkbyBhbHdheXMgd2FudAorICogcHRlX21rd3JpdGUuICBCdXQgZ2V0 X3VzZXJfcGFnZXMgY2FuIGNhdXNlIHdyaXRlIGZhdWx0cyBmb3IgbWFwcGluZ3MKKyAqIHRoYXQg ZG8gbm90IGhhdmUgd3JpdGluZyBlbmFibGVkLCB3aGVuIHVzZWQgYnkgYWNjZXNzX3Byb2Nlc3Nf dm0uCisgKi8KK3N0YXRpYyBpbmxpbmUgcHRlX3QgbWF5YmVfbWt3cml0ZShwdGVfdCBwdGUsIHN0 cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hKQoreworCWlmIChsaWtlbHkodm1hLT52bV9mbGFncyAm IFZNX1dSSVRFKSkKKwkJcHRlID0gcHRlX21rd3JpdGUocHRlKTsKKwlyZXR1cm4gcHRlOworfQor CitzdGF0aWMgaW5saW5lIHZvaWQgY293X3VzZXJfcGFnZShzdHJ1Y3QgcGFnZSAqZHN0LCBzdHJ1 Y3QgcGFnZSAqc3JjLCB1bnNpZ25lZCBsb25nIHZhLCBzdHJ1Y3QgCnZtX2FyZWFfc3RydWN0ICp2 bWEpCit7CisJLyoKKwkgKiBJZiB0aGUgc291cmNlIHBhZ2Ugd2FzIGEgUEZOIG1hcHBpbmcsIHdl IGRvbid0IGhhdmUKKwkgKiBhICJzdHJ1Y3QgcGFnZSIgZm9yIGl0LiBXZSBkbyBhIGJlc3QtZWZm b3J0IGNvcHkgYnkKKwkgKiBqdXN0IGNvcHlpbmcgZnJvbSB0aGUgb3JpZ2luYWwgdXNlciBhZGRy ZXNzLiBJZiB0aGF0CisJICogZmFpbHMsIHdlIGp1c3QgemVyby1maWxsIGl0LiBMaXZlIHdpdGgg aXQuCisJICovCisJaWYgKHVubGlrZWx5KCFzcmMpKSB7CisJCXZvaWQgKmthZGRyID0ga21hcF9h dG9taWMoZHN0LCBLTV9VU0VSMCk7CisJCXZvaWQgX191c2VyICp1YWRkciA9ICh2b2lkIF9fdXNl ciAqKSh2YSAmIFBBR0VfTUFTSyk7CisKKwkJLyoKKwkJICogVGhpcyByZWFsbHkgc2hvdWxkbid0 IGZhaWwsIGJlY2F1c2UgdGhlIHBhZ2UgaXMgdGhlcmUKKwkJICogaW4gdGhlIHBhZ2UgdGFibGVz LiBCdXQgaXQgbWlnaHQganVzdCBiZSB1bnJlYWRhYmxlLAorCQkgKiBpbiB3aGljaCBjYXNlIHdl IGp1c3QgZ2l2ZSB1cCBhbmQgZmlsbCB0aGUgcmVzdWx0IHdpdGgKKwkJICogemVyb2VzLgorCQkg Ki8KKwkJaWYgKF9fY29weV9mcm9tX3VzZXJfaW5hdG9taWMoa2FkZHIsIHVhZGRyLCBQQUdFX1NJ WkUpKQorCQkJbWVtc2V0KGthZGRyLCAwLCBQQUdFX1NJWkUpOworCQlrdW5tYXBfYXRvbWljKGth ZGRyLCBLTV9VU0VSMCk7CisJCWZsdXNoX2RjYWNoZV9wYWdlKGRzdCk7CisJfSBlbHNlCisJCWNv cHlfdXNlcl9oaWdocGFnZShkc3QsIHNyYywgdmEsIHZtYSk7Cit9CisKK3N0YXRpYyBpbnQgZGVj b3dfb25lX3B0ZShzdHJ1Y3QgbW1fc3RydWN0ICptbSwgcHRlX3QgKnB0ZXAsIHBtZF90ICpwbWQs CisJCQlzcGlubG9ja190ICpwdGwsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hLAorCQkJdW5z aWduZWQgbG9uZyBhZGRyZXNzKQoreworCXB0ZV90IHB0ZSA9ICpwdGVwOworCXN0cnVjdCBwYWdl ICpwYWdlLCAqbmV3X3BhZ2U7CisKKwkvKiBwdGUgY29udGFpbnMgcG9zaXRpb24gaW4gc3dhcCBv ciBmaWxlLCBzbyBkb24ndCBkbyBhbnl0aGluZyAqLworCWlmICh1bmxpa2VseSghcHRlX3ByZXNl bnQocHRlKSkpCisJCXJldHVybiAwOworCS8qIHB0ZSBpcyB3cml0YWJsZSwgY2FuJ3QgYmUgQ09X ICovCisJaWYgKHB0ZV93cml0ZShwdGUpKQorCQlyZXR1cm4gMDsKKworCXBhZ2UgPSB2bV9ub3Jt YWxfcGFnZSh2bWEsIGFkZHJlc3MsIHB0ZSk7CisJaWYgKCFwYWdlKQorCQlyZXR1cm4gMDsKKwor CWlmICghUGFnZUFub24ocGFnZSkpCisJCXJldHVybiAwOworCisJV0FSTl9PTighUGFnZURvbnRD T1cocGFnZSkpOworCisJcGFnZV9jYWNoZV9nZXQocGFnZSk7CisKKwlwdGVfdW5tYXBfdW5sb2Nr KHB0ZSwgcHRsKTsKKworCWlmICh1bmxpa2VseShhbm9uX3ZtYV9wcmVwYXJlKHZtYSkpKQorCQln b3RvIG9vbTsKKwlWTV9CVUdfT04ocGFnZSA9PSBaRVJPX1BBR0UoMCkpOworCW5ld19wYWdlID0g YWxsb2NfcGFnZV92bWEoR0ZQX0hJR0hVU0VSX01PVkFCTEUsIHZtYSwgYWRkcmVzcyk7CisJaWYg KCFuZXdfcGFnZSkKKwkJZ290byBvb207CisJLyoKKwkgKiBEb24ndCBsZXQgYW5vdGhlciB0YXNr LCB3aXRoIHBvc3NpYmx5IHVubG9ja2VkIHZtYSwKKwkgKiBrZWVwIHRoZSBtbG9ja2VkIHBhZ2Uu CisJICovCisJaWYgKHZtYS0+dm1fZmxhZ3MgJiBWTV9MT0NLRUQpIHsKKwkJbG9ja19wYWdlKHBh Z2UpOwkvKiBmb3IgTFJVIG1hbmlwdWxhdGlvbiAqLworCQljbGVhcl9wYWdlX21sb2NrKHBhZ2Up OworCQl1bmxvY2tfcGFnZShwYWdlKTsKKwl9CisJY293X3VzZXJfcGFnZShuZXdfcGFnZSwgcGFn ZSwgYWRkcmVzcywgdm1hKTsKKwlfX1NldFBhZ2VVcHRvZGF0ZShuZXdfcGFnZSk7CisKKwlpZiAo bWVtX2Nncm91cF9uZXdwYWdlX2NoYXJnZShuZXdfcGFnZSwgbW0sIEdGUF9LRVJORUwpKQorCQln b3RvIG9vbV9mcmVlX25ldzsKKworCS8qCisJICogUmUtY2hlY2sgdGhlIHB0ZSAtIHdlIGRyb3Bw ZWQgdGhlIGxvY2sKKwkgKi8KKwlwdGVwID0gcHRlX29mZnNldF9tYXBfbG9jayhtbSwgcG1kLCBh ZGRyZXNzLCAmcHRsKTsKKwlCVUdfT04oIXB0ZV9zYW1lKCpwdGVwLCBwdGUpKTsKKwl7CisJCXB0 ZV90IGVudHJ5OworCisJCWZsdXNoX2NhY2hlX3BhZ2Uodm1hLCBhZGRyZXNzLCBwdGVfcGZuKHB0 ZSkpOworCQllbnRyeSA9IG1rX3B0ZShuZXdfcGFnZSwgdm1hLT52bV9wYWdlX3Byb3QpOworCQll bnRyeSA9IG1heWJlX21rd3JpdGUocHRlX21rZGlydHkoZW50cnkpLCB2bWEpOworCQkvKgorCQkg KiBDbGVhciB0aGUgcHRlIGVudHJ5IGFuZCBmbHVzaCBpdCBmaXJzdCwgYmVmb3JlIHVwZGF0aW5n IHRoZQorCQkgKiBwdGUgd2l0aCB0aGUgbmV3IGVudHJ5LiBUaGlzIHdpbGwgYXZvaWQgYSByYWNl IGNvbmRpdGlvbgorCQkgKiBzZWVuIGluIHRoZSBwcmVzZW5jZSBvZiBvbmUgdGhyZWFkIGRvaW5n IFNNQyBhbmQgYW5vdGhlcgorCQkgKiB0aHJlYWQgZG9pbmcgQ09XLgorCQkgKi8KKwkJcHRlcF9j bGVhcl9mbHVzaF9ub3RpZnkodm1hLCBhZGRyZXNzLCBwdGVwKTsKKwkJcGFnZV9hZGRfbmV3X2Fu b25fcm1hcChuZXdfcGFnZSwgdm1hLCBhZGRyZXNzKTsKKwkJc2V0X3B0ZV9hdChtbSwgYWRkcmVz cywgcHRlcCwgZW50cnkpOworCisJCS8qIFNlZSBjb21tZW50IGluIGRvX3dwX3BhZ2UgKi8KKwkJ cGFnZV9yZW1vdmVfcm1hcChwYWdlKTsKKwl9CisKKwlwYWdlX2NhY2hlX3JlbGVhc2UocGFnZSk7 CisKKwlyZXR1cm4gMDsKKworb29tX2ZyZWVfbmV3OgorCXBhZ2VfY2FjaGVfcmVsZWFzZShuZXdf cGFnZSk7Citvb206CisJcGFnZV9jYWNoZV9yZWxlYXNlKHBhZ2UpOworCXJldHVybiAtRU5PTUVN OworfQorCitzdGF0aWMgaW50IGRlY293X3B0ZV9yYW5nZShzdHJ1Y3QgbW1fc3RydWN0ICptbSwK KwkJCXBtZF90ICpwbWQsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hLAorCQkJdW5zaWduZWQg bG9uZyBhZGRyLCB1bnNpZ25lZCBsb25nIGVuZCkKK3sKKwlwdGVfdCAqcHRlOworCXNwaW5sb2Nr X3QgKnB0bDsKKwlpbnQgcHJvZ3Jlc3MgPSAwOworCWludCByZXQgPSAwOworCithZ2FpbjoKKwlw dGUgPSBwdGVfb2Zmc2V0X21hcF9sb2NrKG1tLCBwbWQsIGFkZHIsICZwdGwpOworLy8JYXJjaF9l bnRlcl9sYXp5X21tdV9tb2RlKCk7CisKKwlkbyB7CisJCS8qCisJCSAqIFdlIGFyZSBob2xkaW5n IHR3byBsb2NrcyBhdCB0aGlzIHBvaW50IC0gZWl0aGVyIG9mIHRoZW0KKwkJICogY291bGQgZ2Vu ZXJhdGUgbGF0ZW5jaWVzIGluIGFub3RoZXIgdGFzayBvbiBhbm90aGVyIENQVS4KKwkJICovCisJ CWlmIChwcm9ncmVzcyA+PSAzMikgeworCQkJcHJvZ3Jlc3MgPSAwOworCQkJaWYgKG5lZWRfcmVz Y2hlZCgpIHx8IHNwaW5fbmVlZGJyZWFrKHB0bCkpCisJCQkJYnJlYWs7CisJCX0KKwkJaWYgKHB0 ZV9ub25lKCpwdGUpKSB7CisJCQlwcm9ncmVzcysrOworCQkJY29udGludWU7CisJCX0KKwkJcmV0 ID0gZGVjb3dfb25lX3B0ZShtbSwgcHRlLCBwbWQsIHB0bCwgdm1hLCBhZGRyKTsKKwkJaWYgKHJl dCkgeworCQkJaWYgKHJldCA9PSAtRUFHQUlOKSB7IC8qIHJldHJ5ICovCisJCQkJcmV0ID0gMDsK KwkJCQlicmVhazsKKwkJCX0KKwkJCWdvdG8gb3V0OworCQl9CisJCXByb2dyZXNzICs9IDg7CisJ fSB3aGlsZSAocHRlKyssIGFkZHIgKz0gUEFHRV9TSVpFLCBhZGRyICE9IGVuZCk7CisKKy8vCWFy Y2hfbGVhdmVfbGF6eV9tbXVfbW9kZSgpOworCXB0ZV91bm1hcF91bmxvY2socHRlIC0gMSwgcHRs KTsKKwljb25kX3Jlc2NoZWQoKTsKKwlpZiAoYWRkciAhPSBlbmQpCisJCWdvdG8gYWdhaW47Citv dXQ6CisJcmV0dXJuIHJldDsKK30KKworc3RhdGljIGludCBkZWNvd19wbWRfcmFuZ2Uoc3RydWN0 IG1tX3N0cnVjdCAqbW0sCisJCQlwdWRfdCAqcHVkLCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZt YSwKKwkJCXVuc2lnbmVkIGxvbmcgYWRkciwgdW5zaWduZWQgbG9uZyBlbmQpCit7CisJcG1kX3Qg KnBtZDsKKwl1bnNpZ25lZCBsb25nIG5leHQ7CisKKwlwbWQgPSBwbWRfb2Zmc2V0KHB1ZCwgYWRk cik7CisJZG8geworCQluZXh0ID0gcG1kX2FkZHJfZW5kKGFkZHIsIGVuZCk7CisJCWlmIChwbWRf bm9uZV9vcl9jbGVhcl9iYWQocG1kKSkKKwkJCWNvbnRpbnVlOworCQlpZiAoZGVjb3dfcHRlX3Jh bmdlKG1tLCBwbWQsIHZtYSwgYWRkciwgbmV4dCkpCisJCQlyZXR1cm4gLUVOT01FTTsKKwl9IHdo aWxlIChwbWQrKywgYWRkciA9IG5leHQsIGFkZHIgIT0gZW5kKTsKKwlyZXR1cm4gMDsKK30KKwor c3RhdGljIGludCBkZWNvd19wdWRfcmFuZ2Uoc3RydWN0IG1tX3N0cnVjdCAqbW0sCisJCQlwZ2Rf dCAqcGdkLCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSwKKwkJCXVuc2lnbmVkIGxvbmcgYWRk ciwgdW5zaWduZWQgbG9uZyBlbmQpCit7CisJcHVkX3QgKnB1ZDsKKwl1bnNpZ25lZCBsb25nIG5l eHQ7CisKKwlwdWQgPSBwdWRfb2Zmc2V0KHBnZCwgYWRkcik7CisJZG8geworCQluZXh0ID0gcHVk X2FkZHJfZW5kKGFkZHIsIGVuZCk7CisJCWlmIChwdWRfbm9uZV9vcl9jbGVhcl9iYWQocHVkKSkK KwkJCWNvbnRpbnVlOworCQlpZiAoZGVjb3dfcG1kX3JhbmdlKG1tLCBwdWQsIHZtYSwgYWRkciwg bmV4dCkpCisJCQlyZXR1cm4gLUVOT01FTTsKKwl9IHdoaWxlIChwdWQrKywgYWRkciA9IG5leHQs IGFkZHIgIT0gZW5kKTsKKwlyZXR1cm4gMDsKK30KKworc3RhdGljIG5vaW5saW5lIGludCBkZWNv d19wYWdlX3JhbmdlKHN0cnVjdCBtbV9zdHJ1Y3QgKm1tLCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3Qg KnZtYSwgdW5zaWduZWQgCmxvbmcgYWRkciwgdW5zaWduZWQgbG9uZyBlbmQpCit7CisJcGdkX3Qg KnBnZDsKKwl1bnNpZ25lZCBsb25nIG5leHQ7CisJaW50IHJldDsKKworCUJVR19PTighaXNfY293 X21hcHBpbmcodm1hLT52bV9mbGFncykpOworCisvLwlpZiAoaXNfdm1faHVnZXRsYl9wYWdlKHZt YSkpCisvLwkJcmV0dXJuIGRlY293X2h1Z2V0bGJfcGFnZV9yYW5nZShtbSwgdm1hKTsKKworLy8J bW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2Vfc3RhcnQobW0sIGFkZHIsIGVuZCk7CisKKwly ZXQgPSAwOworCXBnZCA9IHBnZF9vZmZzZXQobW0sIGFkZHIpOworCWRvIHsKKwkJbmV4dCA9IHBn ZF9hZGRyX2VuZChhZGRyLCBlbmQpOworCQlpZiAocGdkX25vbmVfb3JfY2xlYXJfYmFkKHBnZCkp CisJCQljb250aW51ZTsKKwkJaWYgKHVubGlrZWx5KGRlY293X3B1ZF9yYW5nZShtbSwgcGdkLCB2 bWEsIGFkZHIsIG5leHQpKSkgeworCQkJcmV0ID0gLUVOT01FTTsKKwkJCWJyZWFrOworCQl9CisJ fSB3aGlsZSAocGdkKyssIGFkZHIgPSBuZXh0LCBhZGRyICE9IGVuZCk7CisKKy8vCW1tdV9ub3Rp Zmllcl9pbnZhbGlkYXRlX3JhbmdlX2VuZChtbSwgdm1hLT52bV9zdGFydCwgZW5kKTsKKworCXJl dHVybiByZXQ7Cit9CisKKy8qCiAgKiBjb3B5IG9uZSB2bV9hcmVhIGZyb20gb25lIHRhc2sgdG8g dGhlIG90aGVyLiBBc3N1bWVzIHRoZSBwYWdlIHRhYmxlcwogICogYWxyZWFkeSBwcmVzZW50IGlu IHRoZSBuZXcgdGFzayB0byBiZSBjbGVhcmVkIGluIHRoZSB3aG9sZSByYW5nZQogICogY292ZXJl ZCBieSB0aGlzIHZtYS4KICAqLwogCi1zdGF0aWMgaW5saW5lIHZvaWQKK3N0YXRpYyBpbmxpbmUg aW50CiBjb3B5X29uZV9wdGUoc3RydWN0IG1tX3N0cnVjdCAqZHN0X21tLCBzdHJ1Y3QgbW1fc3Ry dWN0ICpzcmNfbW0sCiAJCXB0ZV90ICpkc3RfcHRlLCBwdGVfdCAqc3JjX3B0ZSwgc3RydWN0IHZt X2FyZWFfc3RydWN0ICp2bWEsCiAJCXVuc2lnbmVkIGxvbmcgYWRkciwgaW50ICpyc3MpCkBAIC01 NDYsNiArNzgyLDcgQEAgY29weV9vbmVfcHRlKHN0cnVjdCBtbV9zdHJ1Y3QgKmRzdF9tbSwgcwog CXVuc2lnbmVkIGxvbmcgdm1fZmxhZ3MgPSB2bWEtPnZtX2ZsYWdzOwogCXB0ZV90IHB0ZSA9ICpz cmNfcHRlOwogCXN0cnVjdCBwYWdlICpwYWdlOworCWludCByZXQgPSAwOwogCiAJLyogcHRlIGNv bnRhaW5zIHBvc2l0aW9uIGluIHN3YXAgb3IgZmlsZSwgc28gY29weS4gKi8KIAlpZiAodW5saWtl bHkoIXB0ZV9wcmVzZW50KHB0ZSkpKSB7CkBAIC01OTcsMjAgKzgzNCwyNiBAQCBjb3B5X29uZV9w dGUoc3RydWN0IG1tX3N0cnVjdCAqZHN0X21tLCBzCiAJCWdldF9wYWdlKHBhZ2UpOwogCQlwYWdl X2R1cF9ybWFwKHBhZ2UsIHZtYSwgYWRkcik7CiAJCXJzc1shIVBhZ2VBbm9uKHBhZ2UpXSsrOwor CQlpZiAodW5saWtlbHkoUGFnZURvbnRDT1cocGFnZSkpKQorCQkJcmV0ID0gLUVBR0FJTjsKIAl9 CiAKIG91dF9zZXRfcHRlOgogCXNldF9wdGVfYXQoZHN0X21tLCBhZGRyLCBkc3RfcHRlLCBwdGUp OworCisJcmV0dXJuIHJldDsKIH0KIAogc3RhdGljIGludCBjb3B5X3B0ZV9yYW5nZShzdHJ1Y3Qg bW1fc3RydWN0ICpkc3RfbW0sIHN0cnVjdCBtbV9zdHJ1Y3QgKnNyY19tbSwKLQkJcG1kX3QgKmRz dF9wbWQsIHBtZF90ICpzcmNfcG1kLCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSwKKwkJcG1k X3QgKmRzdF9wbWQsIHBtZF90ICpzcmNfcG1kLAorCQlzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKmRz dF92bWEsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqc3JjX3ZtYSwKIAkJdW5zaWduZWQgbG9uZyBh ZGRyLCB1bnNpZ25lZCBsb25nIGVuZCkKIHsKIAlwdGVfdCAqc3JjX3B0ZSwgKmRzdF9wdGU7CiAJ c3BpbmxvY2tfdCAqc3JjX3B0bCwgKmRzdF9wdGw7CiAJaW50IHByb2dyZXNzID0gMDsKIAlpbnQg cnNzWzJdOworCWludCByZXQgPSAwOwogCiBhZ2FpbjoKIAlyc3NbMV0gPSByc3NbMF0gPSAwOwpA QCAtNjM3LDcgKzg4MCwxMCBAQCBhZ2FpbjoKIAkJCXByb2dyZXNzKys7CiAJCQljb250aW51ZTsK IAkJfQotCQljb3B5X29uZV9wdGUoZHN0X21tLCBzcmNfbW0sIGRzdF9wdGUsIHNyY19wdGUsIHZt YSwgYWRkciwgcnNzKTsKKwkJcmV0ID0gY29weV9vbmVfcHRlKGRzdF9tbSwgc3JjX21tLCBkc3Rf cHRlLCBzcmNfcHRlLAorCQkJCQkJc3JjX3ZtYSwgYWRkciwgcnNzKTsKKwkJaWYgKHVubGlrZWx5 KHJldCkpCisJCQlnb3RvIGRlY293OwogCQlwcm9ncmVzcyArPSA4OwogCX0gd2hpbGUgKGRzdF9w dGUrKywgc3JjX3B0ZSsrLCBhZGRyICs9IFBBR0VfU0laRSwgYWRkciAhPSBlbmQpOwogCkBAIC02 NTAsMTAgKzg5NiwyNSBAQCBhZ2FpbjoKIAlpZiAoYWRkciAhPSBlbmQpCiAJCWdvdG8gYWdhaW47 CiAJcmV0dXJuIDA7CisKK2RlY293OgorCWFyY2hfbGVhdmVfbGF6eV9tbXVfbW9kZSgpOworCXNw aW5fdW5sb2NrKHNyY19wdGwpOworCXB0ZV91bm1hcF9uZXN0ZWQoc3JjX3B0ZSk7CisJYWRkX21t X3Jzcyhkc3RfbW0sIHJzc1swXSwgcnNzWzFdKTsKKwlwdGVfdW5tYXBfdW5sb2NrKGRzdF9wdGUs IGRzdF9wdGwpOworCWNvbmRfcmVzY2hlZCgpOworCWlmIChkZWNvd19wYWdlX3JhbmdlKGRzdF9t bSwgZHN0X3ZtYSwgYWRkciwgYWRkciArIFBBR0VfU0laRSkpCisJCXJldHVybiAtRU5PTUVNOwor CWFkZHIgKz0gUEFHRV9TSVpFOworCWlmIChhZGRyICE9IGVuZCkKKwkJZ290byBhZ2FpbjsKKwly ZXR1cm4gMDsKIH0KIAogc3RhdGljIGlubGluZSBpbnQgY29weV9wbWRfcmFuZ2Uoc3RydWN0IG1t X3N0cnVjdCAqZHN0X21tLCBzdHJ1Y3QgbW1fc3RydWN0ICpzcmNfbW0sCi0JCXB1ZF90ICpkc3Rf cHVkLCBwdWRfdCAqc3JjX3B1ZCwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsCisJCXB1ZF90 ICpkc3RfcHVkLCBwdWRfdCAqc3JjX3B1ZCwKKwkJc3RydWN0IHZtX2FyZWFfc3RydWN0ICpkc3Rf dm1hLCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnNyY192bWEsCiAJCXVuc2lnbmVkIGxvbmcgYWRk ciwgdW5zaWduZWQgbG9uZyBlbmQpCiB7CiAJcG1kX3QgKnNyY19wbWQsICpkc3RfcG1kOwpAQCAt NjY4LDE0ICs5MjksMTUgQEAgc3RhdGljIGlubGluZSBpbnQgY29weV9wbWRfcmFuZ2Uoc3RydWN0 IAogCQlpZiAocG1kX25vbmVfb3JfY2xlYXJfYmFkKHNyY19wbWQpKQogCQkJY29udGludWU7CiAJ CWlmIChjb3B5X3B0ZV9yYW5nZShkc3RfbW0sIHNyY19tbSwgZHN0X3BtZCwgc3JjX3BtZCwKLQkJ CQkJCXZtYSwgYWRkciwgbmV4dCkpCisJCQkJCQlkc3Rfdm1hLCBzcmNfdm1hLCBhZGRyLCBuZXh0 KSkKIAkJCXJldHVybiAtRU5PTUVNOwogCX0gd2hpbGUgKGRzdF9wbWQrKywgc3JjX3BtZCsrLCBh ZGRyID0gbmV4dCwgYWRkciAhPSBlbmQpOwogCXJldHVybiAwOwogfQogCiBzdGF0aWMgaW5saW5l IGludCBjb3B5X3B1ZF9yYW5nZShzdHJ1Y3QgbW1fc3RydWN0ICpkc3RfbW0sIHN0cnVjdCBtbV9z dHJ1Y3QgKnNyY19tbSwKLQkJcGdkX3QgKmRzdF9wZ2QsIHBnZF90ICpzcmNfcGdkLCBzdHJ1Y3Qg dm1fYXJlYV9zdHJ1Y3QgKnZtYSwKKwkJcGdkX3QgKmRzdF9wZ2QsIHBnZF90ICpzcmNfcGdkLAor CQlzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKmRzdF92bWEsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAq c3JjX3ZtYSwKIAkJdW5zaWduZWQgbG9uZyBhZGRyLCB1bnNpZ25lZCBsb25nIGVuZCkKIHsKIAlw dWRfdCAqc3JjX3B1ZCwgKmRzdF9wdWQ7CkBAIC02OTAsMTkgKzk1MiwxOSBAQCBzdGF0aWMgaW5s aW5lIGludCBjb3B5X3B1ZF9yYW5nZShzdHJ1Y3QgCiAJCWlmIChwdWRfbm9uZV9vcl9jbGVhcl9i YWQoc3JjX3B1ZCkpCiAJCQljb250aW51ZTsKIAkJaWYgKGNvcHlfcG1kX3JhbmdlKGRzdF9tbSwg c3JjX21tLCBkc3RfcHVkLCBzcmNfcHVkLAotCQkJCQkJdm1hLCBhZGRyLCBuZXh0KSkKKwkJCQkJ CWRzdF92bWEsIHNyY192bWEsIGFkZHIsIG5leHQpKQogCQkJcmV0dXJuIC1FTk9NRU07CiAJfSB3 aGlsZSAoZHN0X3B1ZCsrLCBzcmNfcHVkKyssIGFkZHIgPSBuZXh0LCBhZGRyICE9IGVuZCk7CiAJ cmV0dXJuIDA7CiB9CiAKIGludCBjb3B5X3BhZ2VfcmFuZ2Uoc3RydWN0IG1tX3N0cnVjdCAqZHN0 X21tLCBzdHJ1Y3QgbW1fc3RydWN0ICpzcmNfbW0sCi0JCXN0cnVjdCB2bV9hcmVhX3N0cnVjdCAq dm1hKQorCQlzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKmRzdF92bWEsIHN0cnVjdCB2bV9hcmVhX3N0 cnVjdCAqc3JjX3ZtYSkKIHsKIAlwZ2RfdCAqc3JjX3BnZCwgKmRzdF9wZ2Q7CiAJdW5zaWduZWQg bG9uZyBuZXh0OwotCXVuc2lnbmVkIGxvbmcgYWRkciA9IHZtYS0+dm1fc3RhcnQ7Ci0JdW5zaWdu ZWQgbG9uZyBlbmQgPSB2bWEtPnZtX2VuZDsKKwl1bnNpZ25lZCBsb25nIGFkZHIgPSBzcmNfdm1h LT52bV9zdGFydDsKKwl1bnNpZ25lZCBsb25nIGVuZCA9IHNyY192bWEtPnZtX2VuZDsKIAlpbnQg cmV0OwogCiAJLyoKQEAgLTcxMSwyMCArOTczLDIwIEBAIGludCBjb3B5X3BhZ2VfcmFuZ2Uoc3Ry dWN0IG1tX3N0cnVjdCAqZHMKIAkgKiByZWFkb25seSBtYXBwaW5ncy4gVGhlIHRyYWRlb2ZmIGlz IHRoYXQgY29weV9wYWdlX3JhbmdlIGlzIG1vcmUKIAkgKiBlZmZpY2llbnQgdGhhbiBmYXVsdGlu Zy4KIAkgKi8KLQlpZiAoISh2bWEtPnZtX2ZsYWdzICYgKFZNX0hVR0VUTEJ8Vk1fTk9OTElORUFS fFZNX1BGTk1BUHxWTV9JTlNFUlRQQUdFKSkpIHsKLQkJaWYgKCF2bWEtPmFub25fdm1hKQorCWlm ICghKHNyY192bWEtPnZtX2ZsYWdzICYgKFZNX0hVR0VUTEJ8Vk1fTk9OTElORUFSfFZNX1BGTk1B UHxWTV9JTlNFUlRQQUdFKSkpIHsKKwkJaWYgKCFzcmNfdm1hLT5hbm9uX3ZtYSkKIAkJCXJldHVy biAwOwogCX0KIAotCWlmIChpc192bV9odWdldGxiX3BhZ2Uodm1hKSkKLQkJcmV0dXJuIGNvcHlf aHVnZXRsYl9wYWdlX3JhbmdlKGRzdF9tbSwgc3JjX21tLCB2bWEpOworCWlmIChpc192bV9odWdl dGxiX3BhZ2Uoc3JjX3ZtYSkpCisJCXJldHVybiBjb3B5X2h1Z2V0bGJfcGFnZV9yYW5nZShkc3Rf bW0sIHNyY19tbSwgc3JjX3ZtYSk7CiAKLQlpZiAodW5saWtlbHkoaXNfcGZuX21hcHBpbmcodm1h KSkpIHsKKwlpZiAodW5saWtlbHkoaXNfcGZuX21hcHBpbmcoc3JjX3ZtYSkpKSB7CiAJCS8qCiAJ CSAqIFdlIGRvIG5vdCBmcmVlIG9uIGVycm9yIGNhc2VzIGJlbG93IGFzIHJlbW92ZV92bWEKIAkJ ICogZ2V0cyBjYWxsZWQgb24gZXJyb3IgZnJvbSBoaWdoZXIgbGV2ZWwgcm91dGluZQogCQkgKi8K LQkJcmV0ID0gdHJhY2tfcGZuX3ZtYV9jb3B5KHZtYSk7CisJCXJldCA9IHRyYWNrX3Bmbl92bWFf Y29weShzcmNfdm1hKTsKIAkJaWYgKHJldCkKIAkJCXJldHVybiByZXQ7CiAJfQpAQCAtNzM1LDcg Kzk5Nyw3IEBAIGludCBjb3B5X3BhZ2VfcmFuZ2Uoc3RydWN0IG1tX3N0cnVjdCAqZHMKIAkgKiBw YXJlbnQgbW0uIEFuZCBhIHBlcm1pc3Npb24gZG93bmdyYWRlIHdpbGwgb25seSBoYXBwZW4gaWYK IAkgKiBpc19jb3dfbWFwcGluZygpIHJldHVybnMgdHJ1ZS4KIAkgKi8KLQlpZiAoaXNfY293X21h cHBpbmcodm1hLT52bV9mbGFncykpCisJaWYgKGlzX2Nvd19tYXBwaW5nKHNyY192bWEtPnZtX2Zs YWdzKSkKIAkJbW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2Vfc3RhcnQoc3JjX21tLCBhZGRy LCBlbmQpOwogCiAJcmV0ID0gMDsKQEAgLTc0NiwxNSArMTAwOCwxNiBAQCBpbnQgY29weV9wYWdl X3JhbmdlKHN0cnVjdCBtbV9zdHJ1Y3QgKmRzCiAJCWlmIChwZ2Rfbm9uZV9vcl9jbGVhcl9iYWQo c3JjX3BnZCkpCiAJCQljb250aW51ZTsKIAkJaWYgKHVubGlrZWx5KGNvcHlfcHVkX3JhbmdlKGRz dF9tbSwgc3JjX21tLCBkc3RfcGdkLCBzcmNfcGdkLAotCQkJCQkgICAgdm1hLCBhZGRyLCBuZXh0 KSkpIHsKKwkJCQkJICAgIGRzdF92bWEsIHNyY192bWEsIGFkZHIsIG5leHQpKSkgewogCQkJcmV0 ID0gLUVOT01FTTsKIAkJCWJyZWFrOwogCQl9CiAJfSB3aGlsZSAoZHN0X3BnZCsrLCBzcmNfcGdk KyssIGFkZHIgPSBuZXh0LCBhZGRyICE9IGVuZCk7CiAKLQlpZiAoaXNfY293X21hcHBpbmcodm1h LT52bV9mbGFncykpCisJaWYgKGlzX2Nvd19tYXBwaW5nKHNyY192bWEtPnZtX2ZsYWdzKSkKIAkJ bW11X25vdGlmaWVyX2ludmFsaWRhdGVfcmFuZ2VfZW5kKHNyY19tbSwKLQkJCQkJCSAgdm1hLT52 bV9zdGFydCwgZW5kKTsKKwkJCQkJCSAgc3JjX3ZtYS0+dm1fc3RhcnQsIGVuZCk7CisKIAlyZXR1 cm4gcmV0OwogfQogCkBAIC0xMjAwLDcgKzE0NjMsNiBAQCBzdGF0aWMgaW5saW5lIGludCB1c2Vf emVyb19wYWdlKHN0cnVjdCB2CiB9CiAKIAotCiBpbnQgX19nZXRfdXNlcl9wYWdlcyhzdHJ1Y3Qg dGFza19zdHJ1Y3QgKnRzaywgc3RydWN0IG1tX3N0cnVjdCAqbW0sCiAJCSAgICAgdW5zaWduZWQg bG9uZyBzdGFydCwgaW50IGxlbiwgaW50IGZsYWdzLAogCQlzdHJ1Y3QgcGFnZSAqKnBhZ2VzLCBz dHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKip2bWFzKQpAQCAtMTIyNSw2ICsxNDg3LDcgQEAgaW50IF9f Z2V0X3VzZXJfcGFnZXMoc3RydWN0IHRhc2tfc3RydWN0IAogCWRvIHsKIAkJc3RydWN0IHZtX2Fy ZWFfc3RydWN0ICp2bWE7CiAJCXVuc2lnbmVkIGludCBmb2xsX2ZsYWdzOworCQlpbnQgZGVjb3c7 CiAKIAkJdm1hID0gZmluZF9leHRlbmRfdm1hKG1tLCBzdGFydCk7CiAJCWlmICghdm1hICYmIGlu X2dhdGVfYXJlYSh0c2ssIHN0YXJ0KSkgewpAQCAtMTI3OSwxMiArMTU0MiwxNSBAQCBpbnQgX19n ZXRfdXNlcl9wYWdlcyhzdHJ1Y3QgdGFza19zdHJ1Y3QgCiAJCQljb250aW51ZTsKIAkJfQogCisJ CWRlY293ID0gKCEoZmxhZ3MgJiBHVVBfRkxBR1NfU1RBQ0spICYmCisJCQkJCWlzX2Nvd19tYXBw aW5nKHZtYS0+dm1fZmxhZ3MpKTsKIAkJZm9sbF9mbGFncyA9IEZPTExfVE9VQ0g7CiAJCWlmIChw YWdlcykKIAkJCWZvbGxfZmxhZ3MgfD0gRk9MTF9HRVQ7CiAJCWlmICghd3JpdGUgJiYgdXNlX3pl cm9fcGFnZSh2bWEpKQogCQkJZm9sbF9mbGFncyB8PSBGT0xMX0FOT047CiAKKwogCQlkbyB7CiAJ CQlzdHJ1Y3QgcGFnZSAqcGFnZTsKIApAQCAtMTI5OSw3ICsxNTY1LDcgQEAgaW50IF9fZ2V0X3Vz ZXJfcGFnZXMoc3RydWN0IHRhc2tfc3RydWN0IAogCQkJCQlmYXRhbF9zaWduYWxfcGVuZGluZyhj dXJyZW50KSkpCiAJCQkJcmV0dXJuIGkgPyBpIDogLUVSRVNUQVJUU1lTOwogCi0JCQlpZiAod3Jp dGUpCisJCQlpZiAod3JpdGUgfHwgZGVjb3cpCiAJCQkJZm9sbF9mbGFncyB8PSBGT0xMX1dSSVRF OwogCiAJCQljb25kX3Jlc2NoZWQoKTsKQEAgLTEzNDIsNiArMTYwOCw3IEBAIGludCBfX2dldF91 c2VyX3BhZ2VzKHN0cnVjdCB0YXNrX3N0cnVjdCAKIAkJCWlmIChwYWdlcykgewogCQkJCXBhZ2Vz W2ldID0gcGFnZTsKIAorCQkJCVNldFBhZ2VEb250Q09XKHBhZ2UpOwogCQkJCWZsdXNoX2Fub25f cGFnZSh2bWEsIHBhZ2UsIHN0YXJ0KTsKIAkJCQlmbHVzaF9kY2FjaGVfcGFnZShwYWdlKTsKIAkJ CX0KQEAgLTE4MjksNDUgKzIwOTYsNiBAQCBzdGF0aWMgaW5saW5lIGludCBwdGVfdW5tYXBfc2Ft ZShzdHJ1Y3QgCiB9CiAKIC8qCi0gKiBEbyBwdGVfbWt3cml0ZSwgYnV0IG9ubHkgaWYgdGhlIHZt YSBzYXlzIFZNX1dSSVRFLiAgV2UgZG8gdGhpcyB3aGVuCi0gKiBzZXJ2aWNpbmcgZmF1bHRzIGZv ciB3cml0ZSBhY2Nlc3MuICBJbiB0aGUgbm9ybWFsIGNhc2UsIGRvIGFsd2F5cyB3YW50Ci0gKiBw dGVfbWt3cml0ZS4gIEJ1dCBnZXRfdXNlcl9wYWdlcyBjYW4gY2F1c2Ugd3JpdGUgZmF1bHRzIGZv ciBtYXBwaW5ncwotICogdGhhdCBkbyBub3QgaGF2ZSB3cml0aW5nIGVuYWJsZWQsIHdoZW4gdXNl ZCBieSBhY2Nlc3NfcHJvY2Vzc192bS4KLSAqLwotc3RhdGljIGlubGluZSBwdGVfdCBtYXliZV9t a3dyaXRlKHB0ZV90IHB0ZSwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEpCi17Ci0JaWYgKGxp a2VseSh2bWEtPnZtX2ZsYWdzICYgVk1fV1JJVEUpKQotCQlwdGUgPSBwdGVfbWt3cml0ZShwdGUp OwotCXJldHVybiBwdGU7Ci19Ci0KLXN0YXRpYyBpbmxpbmUgdm9pZCBjb3dfdXNlcl9wYWdlKHN0 cnVjdCBwYWdlICpkc3QsIHN0cnVjdCBwYWdlICpzcmMsIHVuc2lnbmVkIGxvbmcgdmEsIHN0cnVj dCAKdm1fYXJlYV9zdHJ1Y3QgKnZtYSkKLXsKLQkvKgotCSAqIElmIHRoZSBzb3VyY2UgcGFnZSB3 YXMgYSBQRk4gbWFwcGluZywgd2UgZG9uJ3QgaGF2ZQotCSAqIGEgInN0cnVjdCBwYWdlIiBmb3Ig aXQuIFdlIGRvIGEgYmVzdC1lZmZvcnQgY29weSBieQotCSAqIGp1c3QgY29weWluZyBmcm9tIHRo ZSBvcmlnaW5hbCB1c2VyIGFkZHJlc3MuIElmIHRoYXQKLQkgKiBmYWlscywgd2UganVzdCB6ZXJv LWZpbGwgaXQuIExpdmUgd2l0aCBpdC4KLQkgKi8KLQlpZiAodW5saWtlbHkoIXNyYykpIHsKLQkJ dm9pZCAqa2FkZHIgPSBrbWFwX2F0b21pYyhkc3QsIEtNX1VTRVIwKTsKLQkJdm9pZCBfX3VzZXIg KnVhZGRyID0gKHZvaWQgX191c2VyICopKHZhICYgUEFHRV9NQVNLKTsKLQotCQkvKgotCQkgKiBU aGlzIHJlYWxseSBzaG91bGRuJ3QgZmFpbCwgYmVjYXVzZSB0aGUgcGFnZSBpcyB0aGVyZQotCQkg KiBpbiB0aGUgcGFnZSB0YWJsZXMuIEJ1dCBpdCBtaWdodCBqdXN0IGJlIHVucmVhZGFibGUsCi0J CSAqIGluIHdoaWNoIGNhc2Ugd2UganVzdCBnaXZlIHVwIGFuZCBmaWxsIHRoZSByZXN1bHQgd2l0 aAotCQkgKiB6ZXJvZXMuCi0JCSAqLwotCQlpZiAoX19jb3B5X2Zyb21fdXNlcl9pbmF0b21pYyhr YWRkciwgdWFkZHIsIFBBR0VfU0laRSkpCi0JCQltZW1zZXQoa2FkZHIsIDAsIFBBR0VfU0laRSk7 Ci0JCWt1bm1hcF9hdG9taWMoa2FkZHIsIEtNX1VTRVIwKTsKLQkJZmx1c2hfZGNhY2hlX3BhZ2Uo ZHN0KTsKLQl9IGVsc2UKLQkJY29weV91c2VyX2hpZ2hwYWdlKGRzdCwgc3JjLCB2YSwgdm1hKTsK LX0KLQotLyoKICAqIFRoaXMgcm91dGluZSBoYW5kbGVzIHByZXNlbnQgcGFnZXMsIHdoZW4gdXNl cnMgdHJ5IHRvIHdyaXRlCiAgKiB0byBhIHNoYXJlZCBwYWdlLiBJdCBpcyBkb25lIGJ5IGNvcHlp bmcgdGhlIHBhZ2UgdG8gYSBuZXcgYWRkcmVzcwogICogYW5kIGRlY3JlbWVudGluZyB0aGUgc2hh cmVkLXBhZ2UgY291bnRlciBmb3IgdGhlIG9sZCBwYWdlLgpAQCAtMTkzMCw2ICsyMTU4LDggQEAg c3RhdGljIGludCBkb193cF9wYWdlKHN0cnVjdCBtbV9zdHJ1Y3QgKgogCQl9CiAJCXJldXNlID0g cmV1c2Vfc3dhcF9wYWdlKG9sZF9wYWdlKTsKIAkJdW5sb2NrX3BhZ2Uob2xkX3BhZ2UpOworCQlW TV9CVUdfT04oUGFnZURvbnRDT1cob2xkX3BhZ2UpICYmICFyZXVzZSk7CisKIAl9IGVsc2UgaWYg KHVubGlrZWx5KCh2bWEtPnZtX2ZsYWdzICYgKFZNX1dSSVRFfFZNX1NIQVJFRCkpID09CiAJCQkJ CShWTV9XUklURXxWTV9TSEFSRUQpKSkgewogCQkvKgpAQCAtMjkzNSw4ICszMTY1LDkgQEAgaW50 IG1ha2VfcGFnZXNfcHJlc2VudCh1bnNpZ25lZCBsb25nIGFkZAogCUJVR19PTihhZGRyID49IGVu ZCk7CiAJQlVHX09OKGVuZCA+IHZtYS0+dm1fZW5kKTsKIAlsZW4gPSBESVZfUk9VTkRfVVAoZW5k LCBQQUdFX1NJWkUpIC0gYWRkci9QQUdFX1NJWkU7Ci0JcmV0ID0gZ2V0X3VzZXJfcGFnZXMoY3Vy cmVudCwgY3VycmVudC0+bW0sIGFkZHIsCi0JCQlsZW4sIHdyaXRlLCAwLCBOVUxMLCBOVUxMKTsK KwlyZXQgPSBfX2dldF91c2VyX3BhZ2VzKGN1cnJlbnQsIGN1cnJlbnQtPm1tLCBhZGRyLAorCQkJ bGVuLCBHVVBfRkxBR1NfU1RBQ0sgfCAod3JpdGUgPyBHVVBfRkxBR1NfV1JJVEUgOiAwKSwKKwkJ CU5VTEwsIE5VTEwpOwogCWlmIChyZXQgPCAwKQogCQlyZXR1cm4gcmV0OwogCXJldHVybiByZXQg PT0gbGVuID8gMCA6IC1FRkFVTFQ7CkBAIC0zMDg1LDggKzMzMTYsOSBAQCBpbnQgYWNjZXNzX3By b2Nlc3Nfdm0oc3RydWN0IHRhc2tfc3RydWN0CiAJCXZvaWQgKm1hZGRyOwogCQlzdHJ1Y3QgcGFn ZSAqcGFnZSA9IE5VTEw7CiAKLQkJcmV0ID0gZ2V0X3VzZXJfcGFnZXModHNrLCBtbSwgYWRkciwg MSwKLQkJCQl3cml0ZSwgMSwgJnBhZ2UsICZ2bWEpOworCQlyZXQgPSBfX2dldF91c2VyX3BhZ2Vz KHRzaywgbW0sIGFkZHIsIDEsCisJCQkJR1VQX0ZMQUdTX0ZPUkNFIHwgR1VQX0ZMQUdTX1NUQUNL IHwKKwkJCQkod3JpdGUgPyBHVVBfRkxBR1NfV1JJVEUgOiAwKSwgJnBhZ2UsICZ2bWEpOwogCQlp ZiAocmV0IDw9IDApIHsKIAkJCS8qCiAJCQkgKiBDaGVjayBpZiB0aGlzIGlzIGEgVk1fSU8gfCBW TV9QRk5NQVAgVk1BLCB3aGljaApJbmRleDogbGludXgtMi42L2FyY2gveDg2L21tL2d1cC5jCj09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT0KLS0tIGxpbnV4LTIuNi5vcmlnL2FyY2gveDg2L21tL2d1cC5jCTIwMDktMDMtMTQg MDI6NDg6MDYuMDAwMDAwMDAwICsxMTAwCisrKyBsaW51eC0yLjYvYXJjaC94ODYvbW0vZ3VwLmMJ MjAwOS0wMy0xNCAwMjo0ODoxMi4wMDAwMDAwMDAgKzExMDAKQEAgLTgzLDExICs4MywxNCBAQCBz dGF0aWMgbm9pbmxpbmUgaW50IGd1cF9wdGVfcmFuZ2UocG1kX3QgCiAJCXN0cnVjdCBwYWdlICpw YWdlOwogCiAJCWlmICgocHRlX2ZsYWdzKHB0ZSkgJiAobWFzayB8IF9QQUdFX1NQRUNJQUwpKSAh PSBtYXNrKSB7CitmYWlsZWQ6CiAJCQlwdGVfdW5tYXAocHRlcCk7CiAJCQlyZXR1cm4gMDsKIAkJ fQogCQlWTV9CVUdfT04oIXBmbl92YWxpZChwdGVfcGZuKHB0ZSkpKTsKIAkJcGFnZSA9IHB0ZV9w YWdlKHB0ZSk7CisJCWlmICh1bmxpa2VseSghUGFnZURvbnRDT1cocGFnZSkpKQorCQkJZ290byBm YWlsZWQ7CiAJCWdldF9wYWdlKHBhZ2UpOwogCQlwYWdlc1sqbnJdID0gcGFnZTsKIAkJKCpucikr KzsKSW5kZXg6IGxpbnV4LTIuNi9pbmNsdWRlL2xpbnV4L3BhZ2UtZmxhZ3MuaAo9PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 Ci0tLSBsaW51eC0yLjYub3JpZy9pbmNsdWRlL2xpbnV4L3BhZ2UtZmxhZ3MuaAkyMDA5LTAzLTE0 IDAyOjQ4OjA2LjAwMDAwMDAwMCArMTEwMAorKysgbGludXgtMi42L2luY2x1ZGUvbGludXgvcGFn ZS1mbGFncy5oCTIwMDktMDMtMTQgMDI6NDg6MTMuMDAwMDAwMDAwICsxMTAwCkBAIC05NCw2ICs5 NCw3IEBAIGVudW0gcGFnZWZsYWdzIHsKIAlQR19yZWNsYWltLAkJLyogVG8gYmUgcmVjbGFpbWVk IGFzYXAgKi8KIAlQR19idWRkeSwJCS8qIFBhZ2UgaXMgZnJlZSwgb24gYnVkZHkgbGlzdHMgKi8K IAlQR19zd2FwYmFja2VkLAkJLyogUGFnZSBpcyBiYWNrZWQgYnkgUkFNL3N3YXAgKi8KKwlQR19k b250Y293LAkJLyogRG9udCBDT1cgUGFnZUFub24gcGFnZSAqLwogI2lmZGVmIENPTkZJR19VTkVW SUNUQUJMRV9MUlUKIAlQR191bmV2aWN0YWJsZSwJCS8qIFBhZ2UgaXMgInVuZXZpY3RhYmxlIiAg Ki8KIAlQR19tbG9ja2VkLAkJLyogUGFnZSBpcyB2bWEgbWxvY2tlZCAqLwpAQCAtMjA4LDYgKzIw OSw4IEBAIF9fUEFHRUZMQUcoU2x1YkRlYnVnLCBzbHViX2RlYnVnKQogICovCiBURVNUUEFHRUZM QUcoV3JpdGViYWNrLCB3cml0ZWJhY2spIFRFU1RTQ0ZMQUcoV3JpdGViYWNrLCB3cml0ZWJhY2sp CiBfX1BBR0VGTEFHKEJ1ZGR5LCBidWRkeSkKK19fUEFHRUZMQUcoRG9udENPVywgZG9udGNvdykK K1NFVFBBR0VGTEFHKERvbnRDT1csIGRvbnRjb3cpCiBQQUdFRkxBRyhNYXBwZWRUb0Rpc2ssIG1h cHBlZHRvZGlzaykKIAogLyogUEdfcmVhZGFoZWFkIGlzIG9ubHkgdXNlZCBmb3IgZmlsZSByZWFk czsgUEdfcmVjbGFpbSBpcyBvbmx5IGZvciB3cml0ZXMgKi8KSW5kZXg6IGxpbnV4LTIuNi9tbS9w YWdlX2FsbG9jLmMKPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PQotLS0gbGludXgtMi42Lm9yaWcvbW0vcGFnZV9hbGxvYy5j CTIwMDktMDMtMTMgMjA6MjU6MDIuMDAwMDAwMDAwICsxMTAwCisrKyBsaW51eC0yLjYvbW0vcGFn ZV9hbGxvYy5jCTIwMDktMDMtMTQgMDI6NDg6MTMuMDAwMDAwMDAwICsxMTAwCkBAIC0xMDAwLDYg KzEwMDAsNyBAQCBzdGF0aWMgdm9pZCBmcmVlX2hvdF9jb2xkX3BhZ2Uoc3RydWN0IHBhCiAJc3Ry dWN0IHBlcl9jcHVfcGFnZXMgKnBjcDsKIAl1bnNpZ25lZCBsb25nIGZsYWdzOwogCisJX19DbGVh clBhZ2VEb250Q09XKHBhZ2UpOwogCWlmIChQYWdlQW5vbihwYWdlKSkKIAkJcGFnZS0+bWFwcGlu ZyA9IE5VTEw7CiAJaWYgKGZyZWVfcGFnZXNfY2hlY2socGFnZSkpCkluZGV4OiBsaW51eC0yLjYv a2VybmVsL2ZvcmsuYwo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09Ci0tLSBsaW51eC0yLjYub3JpZy9rZXJuZWwvZm9yay5j CTIwMDktMDMtMTQgMDI6NDg6MDYuMDAwMDAwMDAwICsxMTAwCisrKyBsaW51eC0yLjYva2VybmVs L2ZvcmsuYwkyMDA5LTAzLTE0IDE1OjEyOjA5LjAwMDAwMDAwMCArMTEwMApAQCAtMzUzLDcgKzM1 Myw3IEBAIHN0YXRpYyBpbnQgZHVwX21tYXAoc3RydWN0IG1tX3N0cnVjdCAqbW0KIAkJcmJfcGFy ZW50ID0gJnRtcC0+dm1fcmI7CiAKIAkJbW0tPm1hcF9jb3VudCsrOwotCQlyZXR2YWwgPSBjb3B5 X3BhZ2VfcmFuZ2UobW0sIG9sZG1tLCBtcG50KTsKKwkJcmV0dmFsID0gY29weV9wYWdlX3Jhbmdl KG1tLCBvbGRtbSwgdG1wLCBtcG50KTsKIAogCQlpZiAodG1wLT52bV9vcHMgJiYgdG1wLT52bV9v cHMtPm9wZW4pCiAJCQl0bXAtPnZtX29wcy0+b3Blbih0bXApOwpJbmRleDogbGludXgtMi42L2Zz L2V4ZWMuYwo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09Ci0tLSBsaW51eC0yLjYub3JpZy9mcy9leGVjLmMJMjAwOS0wMy0x MyAyMDoyNTowMC4wMDAwMDAwMDAgKzExMDAKKysrIGxpbnV4LTIuNi9mcy9leGVjLmMJMjAwOS0w My0xNCAwMjo0ODoxNC4wMDAwMDAwMDAgKzExMDAKQEAgLTE2NSw2ICsxNjUsMTMgQEAgZXhpdDoK IAogI2lmZGVmIENPTkZJR19NTVUKIAorI2RlZmluZSBHVVBfRkxBR1NfV1JJVEUgICAgICAgICAg ICAgICAgICAweDAxCisjZGVmaW5lIEdVUF9GTEFHU19TVEFDSyAgICAgICAgICAgICAgICAgIDB4 MTAKKworaW50IF9fZ2V0X3VzZXJfcGFnZXMoc3RydWN0IHRhc2tfc3RydWN0ICp0c2ssIHN0cnVj dCBtbV9zdHJ1Y3QgKm1tLAorCQkgICAgIHVuc2lnbmVkIGxvbmcgc3RhcnQsIGludCBsZW4sIGlu dCBmbGFncywKKwkJICAgICBzdHJ1Y3QgcGFnZSAqKnBhZ2VzLCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1 Y3QgKip2bWFzKTsKKwogc3RhdGljIHN0cnVjdCBwYWdlICpnZXRfYXJnX3BhZ2Uoc3RydWN0IGxp bnV4X2JpbnBybSAqYnBybSwgdW5zaWduZWQgbG9uZyBwb3MsCiAJCWludCB3cml0ZSkKIHsKQEAg LTE3OCw4ICsxODUsMTEgQEAgc3RhdGljIHN0cnVjdCBwYWdlICpnZXRfYXJnX3BhZ2Uoc3RydWN0 IAogCQkJcmV0dXJuIE5VTEw7CiAJfQogI2VuZGlmCi0JcmV0ID0gZ2V0X3VzZXJfcGFnZXMoY3Vy cmVudCwgYnBybS0+bW0sIHBvcywKLQkJCTEsIHdyaXRlLCAxLCAmcGFnZSwgTlVMTCk7CisJZG93 bl9yZWFkKCZicHJtLT5tbS0+bW1hcF9zZW0pOworCXJldCA9IF9fZ2V0X3VzZXJfcGFnZXMoY3Vy cmVudCwgYnBybS0+bW0sIHBvcywKKwkJCTEsIEdVUF9GTEFHU19TVEFDSyB8ICh3cml0ZSA/IEdV UF9GTEFHU19XUklURSA6IDApLAorCQkJJnBhZ2UsIE5VTEwpOworCXVwX3JlYWQoJmJwcm0tPm1t LT5tbWFwX3NlbSk7CiAJaWYgKHJldCA8PSAwKQogCQlyZXR1cm4gTlVMTDsKIApJbmRleDogbGlu dXgtMi42L21tL2ludGVybmFsLmgKPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQotLS0gbGludXgtMi42Lm9yaWcvbW0vaW50 ZXJuYWwuaAkyMDA5LTAzLTEzIDIwOjI1OjAwLjAwMDAwMDAwMCArMTEwMAorKysgbGludXgtMi42 L21tL2ludGVybmFsLmgJMjAwOS0wMy0xNCAwMjo0ODoxNC4wMDAwMDAwMDAgKzExMDAKQEAgLTI3 MywxMCArMjczLDExIEBAIHN0YXRpYyBpbmxpbmUgdm9pZCBtbWluaXRfdmFsaWRhdGVfbWVtbW8K IH0KICNlbmRpZiAvKiBDT05GSUdfU1BBUlNFTUVNICovCiAKLSNkZWZpbmUgR1VQX0ZMQUdTX1dS SVRFICAgICAgICAgICAgICAgICAgMHgxCi0jZGVmaW5lIEdVUF9GTEFHU19GT1JDRSAgICAgICAg ICAgICAgICAgIDB4MgotI2RlZmluZSBHVVBfRkxBR1NfSUdOT1JFX1ZNQV9QRVJNSVNTSU9OUyAw eDQKLSNkZWZpbmUgR1VQX0ZMQUdTX0lHTk9SRV9TSUdLSUxMICAgICAgICAgMHg4CisjZGVmaW5l IEdVUF9GTEFHU19XUklURSAgICAgICAgICAgICAgICAgIDB4MDEKKyNkZWZpbmUgR1VQX0ZMQUdT X0ZPUkNFICAgICAgICAgICAgICAgICAgMHgwMgorI2RlZmluZSBHVVBfRkxBR1NfSUdOT1JFX1ZN QV9QRVJNSVNTSU9OUyAweDA0CisjZGVmaW5lIEdVUF9GTEFHU19JR05PUkVfU0lHS0lMTCAgICAg ICAgIDB4MDgKKyNkZWZpbmUgR1VQX0ZMQUdTX1NUQUNLICAgICAgICAgICAgICAgICAgMHgxMAog CiBpbnQgX19nZXRfdXNlcl9wYWdlcyhzdHJ1Y3QgdGFza19zdHJ1Y3QgKnRzaywgc3RydWN0IG1t X3N0cnVjdCAqbW0sCiAJCSAgICAgdW5zaWduZWQgbG9uZyBzdGFydCwgaW50IGxlbiwgaW50IGZs YWdzLAoACg== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id BCAE46B003D for ; Sat, 14 Mar 2009 00:59:19 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Sat, 14 Mar 2009 15:59:11 +1100 References: <20090311170611.GA2079@elte.hu> <200903140309.39777.nickpiggin@yahoo.com.au> <20090313193416.GG27823@random.random> In-Reply-To: <20090313193416.GG27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: base64 Content-Disposition: inline Message-Id: <200903141559.12484.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: T24gU2F0dXJkYXkgMTQgTWFyY2ggMjAwOSAwNjozNDoxNiBBbmRyZWEgQXJjYW5nZWxpIHdyb3Rl Ogo+IE9uIFNhdCwgTWFyIDE0LCAyMDA5IGF0IDAzOjA5OjM5QU0gKzExMDAsIE5pY2sgUGlnZ2lu IHdyb3RlOgo+ID4gT2YgY291cnNlIEkgY291bGQgaGF2ZSBhIHJhY2UgaW4gZmFzdC1ndXAsIGJ1 dCBJIGRvbid0IHRoaW5rIEkgY2FuIHNlZQo+ID4gb25lLiBJJ20gd29ya2luZyBvbiByZW1vdmlu ZyB0aGUgdm1hIHN0dWZmIGFuZCBqdXN0IG1ha2luZyBpdCBwZXItcGFnZSwKPiA+IHdoaWNoIG1p Z2h0IG1ha2UgaXQgZWFzaWVyIHRvIHJldmlldy4KPgo+IElmIHlvdSBkaWRuJ3QgdG91Y2ggZ3Vw LWZhc3QgYW5kIHlvdSBkb24ndCBzZW5kIGlwaXMgaW4gZm9yaywgeW91IG1vc3QKPiBjZXJ0YWlu bHkgaGF2ZSBvbmUsIGl0J3MgdGhlIG9uZSBMaW51cyBwb2ludGVkIG91dCBhbmQgdGhhdCBJJ3Zl IGZpeGVkCj4gKHdpdGggSXppaywgdGhlbiBJIHNvcnRlZCBvdXQgdGhlIG9yZGVyaW5nIGRldGFp bHMgYW5kIGhvdyB0byBtYWtlIGl0Cj4gc2FmZSBvbiBmcm9rIHNpZGUpLgoKSXQgZG9lcyB0b3Vj aCBndXAtZmFzdCwgYnV0IGl0IGp1c3QgYWRkcyBvbmUgYnJhbmNoIGFuZCBubyBiYXJyaWVyIGlu IHRoZQpjYXNlIHRoZSBwYWdlIGlzIGRlLWNvd2VkIChhbmQgd291bGQgYmUgYWJsZSB0byB3b3Jr IHdpdGggaHVnZXBhZ2VzIHdpdGgKdGhlIGdldF9wYWdlX211bHRpcGxlIHN0aWxsIEkgdGhpbmsg YWx0aG91Z2ggSSBoYXZlbid0IGRvbmUgaHVnZXBhZ2UKaW1wbGVtZW50YXRpb24geWV0KS4KCgo+ ID4gV2VsbCwgaXQgd291bGQgc2F2ZSBoYXZpbmcgdG8gdG91Y2ggdGhlIHBhcmVudCdzIHBhZ2V0 YWJsZXMgYWZ0ZXIKPiA+IGRvaW5nIHRoZSBhdG9taWMgY29weS1vbi1mb3JrIGluIHRoZSBjaGls ZC4gSnVzdCBoYXZlIHRoZSBwYXJlbnQgZG8KPiA+IGEgZG9fd3BfcGFnZSwgd2hpY2ggd2lsbCBu b3RpY2UgaXQgaXMgdGhlIG9ubHkgdXNlciBvZiB0aGUgcGFnZSBhbmQKPiA+IHJldXNlIGl0IHJh dGhlciB0aGFuIENPVyBpdCAobm93IHRoYXQgSHVnaCBoYXMgZml4ZWQgdGhlIHJhY2VzIGluCj4g PiB0aGUgcmV1c2UgY2hlY2sgdGhhdCBzaG91bGQgYmUgZmluZSkuCj4KPiBJZiB3ZSdyZSBpbnRv IHRoZSB0cm91YmxlIHBhdGgsIGl0IG1lYW5zIHBhcmVudCBhbHJlYWR5IG93bnMgdGhlCj4gcGFn ZS4gSSBqdXN0IGxlYXZlIGl0IG93bmVkIHRvIHRoZSBwYXJlbnQsIHB0ZSByZW1haW5zIHRoZSBz YW1lIGJlZm9yZQo+IGFuZCBhZnRlciBmb3JrLiBObyBwb2ludCBpbiBjaGFuZ2luZyB0aGUgcHRl IHZhbHVlIGlmIHdlJ3JlIGluIHRoZQo+IHRyb3VibGVzb21lIHBhdGggYXMgZmFyIGFzIEkgY2Fu IHRlbGwuIEkgb25seSB2ZXJpZnkgdGhhdCB0aGUgcGFyZW50IHB0ZQo+IGRpZG4ndCBnbyBhd2F5 IGZyb20gdW5kZXIgZm9yayB3aGVuIEkgdGVtcG9yYXJpbHkgcmVsZWFzZSB0aGUgcGFyZW50Cj4g UFQgbG9jayB0byBhbGxvY2F0ZSB0aGUgY293IHBhZ2UgaW4gdGhlIHNsb3cgcGF0aCAoc2VlIHRo ZSAtRUFHQUlOCj4gcGF0aCwgSSBhbHNvIHZlcmlmaWVkIGl0IHRyaWdnZXJzIHdpdGggc3dhcHBp bmcgYW5kIHN5c3RlbSBzdXJ2aXZlcyBmaW5lCj4gOykuCgpQb3NzaWJseSB0aGF0J3MgdGhlIHJp Z2h0IHdheSB0byBnby4gRGVwZW5kcyBpZiBpdCBpcyBpbiB0aGUgc2xpZ2h0ZXN0CnBlcmZvcm1h bmNlIGNyaXRpY2FsLiBJZiBub3QsIEkgd291bGQganVzdCBsZXQgZG9fd3BfcGFnZSBkbyB0aGUg d29yawp0byBhdm9pZCBhIGxpdHRsZSBiaXQgb2YgbG9naWMsIGJ1dCBlaXRoZXIgd2F5IGlzIG5v dCBhIGJpZyBkZWFsIHRvIG1lLgoKCj4gPiBOb3cgSSBhbHNvIHNlZSB0aGF0IHlvdXIgcGF0Y2gg c3RpbGwgaGFzbid0IGNvdmVyZWQgdGhlIG90aGVyIHNpZGUgb2YKPiA+IHRoZSByYWNlLCB3aGVy YXMgbXkgc2NoZW1lIHNob3VsZCBkby4gSG1tLCBJIHRoaW5rIHRoYXQgaWYgd2Ugd2FudCB0bwo+ Cj4gU29ycnksIGJ1dCBjYW4geW91IGVsYWJvcmF0ZSBhZ2FpbiB3aGF0IHRoZSBvdGhlciBzaWRl IG9mIHRoZSByYWNlIGlzPwo+Cj4gSWYgY2hpbGQgZ2V0cyBhIHdob2xlIG5ldyBwYWdlLCBhbmQg cGFyZW50IGtlZXBzIGl0cyBvd24gcGFnZSB3aXRoIHB0ZQo+IG1hcmtlZCByZWFkLXdyaXRlIHRo ZSB3aG9sZSB0aW1lIHRoYXQgYSBwYWdlIGZhdWx0IGNhbiBydW4gKHBhZ2UgZmF1bHQKPiB0YWtl cyBtbWFwX3NlbSwgYWxsIHdlIGhhdmUgdG8gcHJvdGVjdCBhZ2FpbnN0IHdoZW4gdGVtcG9yYXJp bHkKPiByZWxlYXNpbmcgcGFyZW50IFBUIGxvY2sgaXMgdGhlIFZNIHJtYXAgY29kZSBhbmQgdGhh dCBpcyB0YWtlbiBjYXJlIG9mCj4gYnkgdGhlIHB0ZV9zYW1lIHBhdGgpLCBzbyBJIGRvbid0IHNl ZSBhbnkgb3RoZXIgc2lkZSBvZiB0aGUgcmFjZS4uLgoKT2ggc29ycnkuIEkgd2FzIHVwIHRvbyBs YXRlIGxhc3QgbmlnaHQgOikKCk9uZSBzaWRlIG9mIHRoZSByYWNlIGlzIGRpcmVjdCBJTyByZWFk IHdyaXRpbmcgdG8gZm9yayBjaGlsZCBwYWdlLgpUaGUgb3RoZXIgc2lkZSBvZiB0aGUgcmFjZSBp cyBmb3JrIGNoaWxkIHBhZ2Ugd3JpdGUgbGVha2luZyBpbnRvCnRoZSBkaXJlY3QgSU8uCgpNeSBw YXRjaCBzb2x2ZXMgYm90aCBzaWRlcyBieSBkZS1jb3dpbmcgKmFueSogQ09XIHBhZ2UgYmVmb3Jl IGl0Cm1heSBiZSByZXR1cm5lZCBmcm9tIGdldF91c2VyX3BhZ2VzIChmb3IgcmVhZCBvciB3cml0 ZSkuCgpUaGUgZm9sbG93aW5nIHRlc3QgY2FzZSBzaG93cyB1cCB0aGUgY29ycnVwdGlvbiBib3Ro IHdpdGggc3RhbmRhcmQKa2VybmVsIGFuZCB5b3VyIHBhdGNoLCBidXQgY2FuJ3QgdHJpZ2dlciBp dCB3aXRoIG15IHBhdGNoLiBZb3UgbXVzdApjcmVhdGUgYnkgaGFuZCAiZmlsZS5kYXQiIGZpbGUg b2YgRklMRVNJWkUgaW4gdGhlIGN3ZC4gWW91IG1heSBoYXZlCnRvIHR3ZWFrIHRpbWluZ3MgdG8g Z2V0IHRoZSBmb2xsb3dpbmcgb3JkZXIgb2Ygb3V0cHV0OgoKdGhyZWFkIHdyaXRpbmcKcGFyZW50 IHN0b3JpbmcKY2hpbGQgc3RvcmluZwp0aHJlYWQgd3JpdGluZyBkb25lCgpBZnRlcndhcmRzIGhl eGR1bXAgZmlsZS5kYXQsIGFuZCBpZiBhbnkgMHhmZiBieXRlcyBoYXZlIGxlYWtlZCBpbnRvCml0 LCB0aGVuIGl0IGlzIGZyb20gY2hpbGQgd3JpdGluZyB0byBjaGlsZCBidWZmZXIgYWZmZWN0aW5n IHBhcmVudCdzCmRpcmVjdCBJTyB3cml0ZS4KCgotLS0gcmV2ZXJzZS1yYWNlLmMgLS0tCgojZGVm aW5lIF9HTlVfU09VUkNFIDEKCiNpbmNsdWRlIDxzdGRpby5oPgojaW5jbHVkZSA8c3RkbGliLmg+ CiNpbmNsdWRlIDxmY250bC5oPgojaW5jbHVkZSA8dW5pc3RkLmg+CiNpbmNsdWRlIDxtZW1vcnku aD4KI2luY2x1ZGUgPHB0aHJlYWQuaD4KI2luY2x1ZGUgPGdldG9wdC5oPgojaW5jbHVkZSA8ZXJy bm8uaD4KI2luY2x1ZGUgPHN5cy90eXBlcy5oPgojaW5jbHVkZSA8c3lzL3dhaXQuaD4KCiNkZWZp bmUgRklMRVNJWkUgKDQqMTAyNCoxMDI0KSAKI2RlZmluZSBCVUZTSVpFICAoMTAyNCoxMDI0KQoK c3RhdGljIHB0aHJlYWRfbXV0ZXhfdCBsb2NrID0gUFRIUkVBRF9NVVRFWF9JTklUSUFMSVpFUjsK c3RhdGljIGNvbnN0IGNoYXIgKmZpbGVuYW1lID0gImZpbGUuZGF0IjsKc3RhdGljIGludCBmZDsK c3RhdGljIHZvaWQgKmJ1ZmZlcjsKI2RlZmluZSBQQUdFX1NJWkUgICA0MDk2CgpzdGF0aWMgdm9p ZCBzdG9yZSh2b2lkKQp7CglpbnQgaTsKCglpZiAodXNsZWVwKDUwKjEwMDApID09IC0xKQoJCXBl cnJvcigidXNsZWVwIiksIGV4aXQoMSk7CgoJcHJpbnRmKCJjaGlsZCBzdG9yaW5nXG4iKTsgZmZs dXNoKHN0ZG91dCk7Cglmb3IgKGkgPSAwOyBpIDwgQlVGU0laRTsgaSsrKQoJCSgoY2hhciAqKWJ1 ZmZlcilbaV0gPSAweGZmOwoKCV9leGl0KDApOwp9CgpzdGF0aWMgdm9pZCAqd3JpdGVyKHZvaWQg KmFyZykKewoJaW50IGk7CgoJaWYgKHB0aHJlYWRfbXV0ZXhfbG9jaygmbG9jaykgPT0gLTEpCgkJ cGVycm9yKCJwdGhyZWFkX211dGV4X2xvY2siKSwgZXhpdCgxKTsKCglwcmludGYoInRocmVhZCB3 cml0aW5nXG4iKTsgZmZsdXNoKHN0ZG91dCk7Cglmb3IgKGkgPSAwOyBpIDwgRklMRVNJWkUgLyBC VUZTSVpFOyBpKyspIHsKCQlzaXplX3QgY291bnQgPSBCVUZTSVpFOwoJCXNzaXplX3QgcmV0OwoK CQlkbyB7CgkJCXJldCA9IHdyaXRlKGZkLCBidWZmZXIsIGNvdW50KTsKCQkJaWYgKHJldCA9PSAt MSkgewoJCQkJaWYgKGVycm5vICE9IEVJTlRSKQoJCQkJCXBlcnJvcigid3JpdGUiKSwgZXhpdCgx KTsKCQkJCXJldCA9IDA7CgkJCX0KCQkJY291bnQgLT0gcmV0OwoJCX0gd2hpbGUgKGNvdW50KTsK CX0KCXByaW50ZigidGhyZWFkIHdyaXRpbmcgZG9uZVxuIik7IGZmbHVzaChzdGRvdXQpOwoKCWlm IChwdGhyZWFkX211dGV4X3VubG9jaygmbG9jaykgPT0gLTEpCgkJcGVycm9yKCJwdGhyZWFkX211 dGV4X2xvY2siKSwgZXhpdCgxKTsKCglyZXR1cm4gTlVMTDsKfQoKaW50IG1haW4oaW50IGFyZ2Ms IGNoYXIgKmFyZ3ZbXSkKewoJaW50IGk7CglpbnQgc3RhdHVzOwoJcHRocmVhZF90IHdyaXRlcl90 aHJlYWQ7CglwaWRfdCBzdG9yZV9wcm9jOwoKCXBvc2l4X21lbWFsaWduKCZidWZmZXIsIFBBR0Vf U0laRSwgQlVGU0laRSk7CglwcmludGYoIldyaXRlIGJ1ZmZlcjogJXAuXG4iLCBidWZmZXIpOwoK CWZvciAoaSA9IDA7IGkgPCBCVUZTSVpFOyBpKyspCgkJKChjaGFyICopYnVmZmVyKVtpXSA9IDB4 MDA7CgoJZmQgPSBvcGVuKGZpbGVuYW1lLCBPX1JEV1J8T19ESVJFQ1QpOwoJaWYgKGZkID09IC0x KQoJCXBlcnJvcigib3BlbiIpLCBleGl0KDEpOwoKCWlmIChwdGhyZWFkX211dGV4X2xvY2soJmxv Y2spID09IC0xKQoJCXBlcnJvcigicHRocmVhZF9tdXRleF9sb2NrIiksIGV4aXQoMSk7CgoJaWYg KHB0aHJlYWRfY3JlYXRlKCZ3cml0ZXJfdGhyZWFkLCBOVUxMLCB3cml0ZXIsIE5VTEwpID09IC0x KQoJCXBlcnJvcigicHRocmVkX2NyZWF0ZSIpLCBleGl0KDEpOwoKCXN0b3JlX3Byb2MgPSBmb3Jr KCk7CglpZiAoc3RvcmVfcHJvYyA9PSAtMSkKCQlwZXJyb3IoImZvcmsiKSwgZXhpdCgxKTsKCWlm ICghc3RvcmVfcHJvYykKCQlzdG9yZSgpOwoKCWlmIChwdGhyZWFkX211dGV4X3VubG9jaygmbG9j aykgPT0gLTEpCgkJcGVycm9yKCJwdGhyZWFkX211dGV4X2xvY2siKSwgZXhpdCgxKTsKCglpZiAo dXNsZWVwKDEwKjEwMDApID09IC0xKQoJCXBlcnJvcigidXNsZWVwIiksIGV4aXQoMSk7CgoJcHJp bnRmKCJwYXJlbnQgc3RvcmluZ1xuIik7IGZmbHVzaChzdGRvdXQpOwoJZm9yIChpID0gMDsgaSA8 IEJVRlNJWkU7IGkrKykKCQkoKGNoYXIgKilidWZmZXIpW2ldID0gMHgxMTsKCglkbyB7CgkJcGlk X3QgdzsKCQl3ID0gd2FpdHBpZChzdG9yZV9wcm9jLCAmc3RhdHVzLCBXVU5UUkFDRUQgfCBXQ09O VElOVUVEKTsKCQlpZiAodyA9PSAtMSkKCQkJcGVycm9yKCJ3YWl0cGlkIiksIGV4aXQoMSk7Cgl9 IHdoaWxlICghV0lGRVhJVEVEKHN0YXR1cykgJiYgIVdJRlNJR05BTEVEKHN0YXR1cykpOwoKCWlm IChwdGhyZWFkX2pvaW4od3JpdGVyX3RocmVhZCwgTlVMTCkgPT0gLTEpCgkJcGVycm9yKCJwdGhy ZWFkX2pvaW4iKSwgZXhpdCgxKTsKCglleGl0KDApOwp9CgAK -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id EE45D6B003D for ; Sat, 14 Mar 2009 01:06:11 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Sat, 14 Mar 2009 16:06:03 +1100 References: <20090311170611.GA2079@elte.hu> <200903140309.39777.nickpiggin@yahoo.com.au> <200903141546.31139.nickpiggin@yahoo.com.au> In-Reply-To: <200903141546.31139.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903141606.04450.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Saturday 14 March 2009 15:46:30 Nick Piggin wrote: > Index: linux-2.6/arch/x86/mm/gup.c > =================================================================== > --- linux-2.6.orig/arch/x86/mm/gup.c 2009-03-14 02:48:06.000000000 +1100 > +++ linux-2.6/arch/x86/mm/gup.c 2009-03-14 02:48:12.000000000 +1100 > @@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t > struct page *page; > > if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { > +failed: > pte_unmap(ptep); > return 0; > } > VM_BUG_ON(!pfn_valid(pte_pfn(pte))); > page = pte_page(pte); > + if (unlikely(!PageDontCOW(page))) > + goto failed; > get_page(page); > pages[*nr] = page; > (*nr)++; Ah, that's stupid, the test should be confined just to PageAnon && !PageDontCOW pages, of course. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id DB3816B0047 for ; Sat, 14 Mar 2009 01:06:49 -0400 (EDT) Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] From: Benjamin Herrenschmidt In-Reply-To: References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> Content-Type: text/plain Date: Sat, 14 Mar 2009 16:06:29 +1100 Message-Id: <1237007189.25062.91.camel@pasglop> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 2009-03-11 at 13:19 -0700, Linus Torvalds wrote: > > That said, I don't know who the crazy O_DIRECT users are. It may be true > that some O_DIRECT users end up using the same pages over and over again, > and that this is a good optimization for them. Just my 2 cents here... While I agree mostly with what you say about O_DIRECT crazyness, unfortunately, gup is also a fashionable interface in a few other areas, such as IB or RDMA'ish things, and I'm pretty sure we'll see others popping here or there. Right, it's a bit stinky, but it -is- somewhat nice for a driver to be able to take a chunk of existing user addresses and not care whether they are anonymous, shmem, file mappings, large pages, ... and just gup and get some DMA pounding on them. There are various usage scenarios where it's in fact less ugly than anything else you can come up with ... pretty much. IB folks so far have been avoiding the fork() trap thanks to madvise(MADV_DONTFORK) afaik. And it all goes generally well when the whole application knows what it's doing and just plain avoids fork. -But- things get nasty if for some reason, the user of gup is somewhere deep in some kind of library that an application uses without knowing, while forking here or there to run shell scripts or other helpers. I've seen it :-) So if a solution can be found that doesn't uglify the whole thing beyond recognition, it's probably worth it. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 452546B003D for ; Sat, 14 Mar 2009 01:07:49 -0400 (EDT) Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] From: Benjamin Herrenschmidt In-Reply-To: References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> <20090311195935.GO27823@random.random> Content-Type: text/plain Date: Sat, 14 Mar 2009 16:07:21 +1100 Message-Id: <1237007241.25062.92.camel@pasglop> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 2009-03-11 at 13:33 -0700, Linus Torvalds wrote: > - Just make the rule be that people who use get_user_pages() always > have to have the read-lock on mmap_sem until they've used the > pages. > That's not going to work with IB and friends who gup() whole bunches of user memory forever... Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 2BE806B0055 for ; Sat, 14 Mar 2009 01:21:20 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Sat, 14 Mar 2009 16:20:44 +1100 References: <20090311170611.GA2079@elte.hu> <1237007189.25062.91.camel@pasglop> In-Reply-To: <1237007189.25062.91.camel@pasglop> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903141620.45052.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Benjamin Herrenschmidt Cc: Linus Torvalds , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Saturday 14 March 2009 16:06:29 Benjamin Herrenschmidt wrote: > On Wed, 2009-03-11 at 13:19 -0700, Linus Torvalds wrote: > > That said, I don't know who the crazy O_DIRECT users are. It may be true > > that some O_DIRECT users end up using the same pages over and over again, > > and that this is a good optimization for them. > > Just my 2 cents here... > > While I agree mostly with what you say about O_DIRECT crazyness, > unfortunately, gup is also a fashionable interface in a few other areas, > such as IB or RDMA'ish things, and I'm pretty sure we'll see others > popping here or there. > > Right, it's a bit stinky, but it -is- somewhat nice for a driver to be > able to take a chunk of existing user addresses and not care whether > they are anonymous, shmem, file mappings, large pages, ... and just gup > and get some DMA pounding on them. There are various usage scenarios > where it's in fact less ugly than anything else you can come up with ... > pretty much. > > IB folks so far have been avoiding the fork() trap thanks to > madvise(MADV_DONTFORK) afaik. And it all goes generally well when the > whole application knows what it's doing and just plain avoids fork. > > -But- things get nasty if for some reason, the user of gup is somewhere > deep in some kind of library that an application uses without knowing, > while forking here or there to run shell scripts or other helpers. > > I've seen it :-) > > So if a solution can be found that doesn't uglify the whole thing beyond > recognition, it's probably worth it. AFAIKS, the approach I've posted is probably the simplest (and maybe only way) to really fix it. It's not too ugly. You can't easily fix it at write-time by COWing in the right direction like Linus suggested because at that point you may have multiple get_user_pages (for read) from the parent and child on the page, so there is no way to COW it in the right direction. You could do something crazy like allowing only one get_user_pages read on a wp page, and recording which direction to send it if it does get COWed. But at that point you've got something that's far uglier in the core code and more complex than what I posted. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 4630D6B003D for ; Mon, 16 Mar 2009 09:57:26 -0400 (EDT) Date: Mon, 16 Mar 2009 14:56:54 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090316135654.GA17949@random.random> References: <20090311170611.GA2079@elte.hu> <200903140309.39777.nickpiggin@yahoo.com.au> <20090313193416.GG27823@random.random> <200903141559.12484.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200903141559.12484.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Sat, Mar 14, 2009 at 03:59:11PM +1100, Nick Piggin wrote: > It does touch gup-fast, but it just adds one branch and no barrier in the My question is what trick to you use to stop gup-fast from returning the page mapped read-write by the pte if gup-fast doesn't take any lock whatsoever, it doesn't set any bit in any page or vma, and it doesn't recheck the pte is still viable after having set any bit on page or vmas, and you still don't send a flood of ipis from fork fast path (no race case). > case the page is de-cowed (and would be able to work with hugepages with > the get_page_multiple still I think although I haven't done hugepage > implementation yet). Yes let's ignore hugetlb for now, I fixed hugetlb too but that can be left for later. > Possibly that's the right way to go. Depends if it is in the slightest > performance critical. If not, I would just let do_wp_page do the work > to avoid a little bit of logic, but either way is not a big deal to me. fork is less performance critical than do_wp_page, still in fork microbenchmark no slowdown is measured with the patch. Before I introduced PG_gup there were false positives triggered by the pagevec temporary pins, that was measurable, after PG_gup the fast path is unaffected (I've still to measure gup-fast slowdown in setting PG_gup but I'm rather optimistic that you're understimating the cost of walking 4 layers of pagetables compared to a locked op on a l1 exclusive cacheline, so I think it'll be lost in the noise). I think the big thing of gup-fast is primarly in not having to search vmas, and in turn to take any shared lock like mmap_sem/PT lock and to scale on a page level with just a get-page being the troublesome cacheline. > One side of the race is direct IO read writing to fork child page. > The other side of the race is fork child page write leaking into > the direct IO. > > My patch solves both sides by de-cowing *any* COW page before it > may be returned from get_user_pages (for read or write). I see what you mean now. If you read the comment of my patch you'll see I explicitly intended that only people writing into memory with gup was troublesome here. Like you point out, using gup for _reading_ from memory is troublesome as well if child writes to those pages. This is kind of a lower problem because the major issue is that fork is enough to generate memory corruption even if the child isn't touching those pages. The reverse race requires the child to write to those pages so I guess it never triggered in real life apps. But nevertheless I totally agree if we fix the write-to-memory-with-gup we've to fix the read-from-memory-with-gup. Below I updated my patch and relative commit header to fix the reverse race too. However I had to enlarge the buffer to 40M to reproduce with your testcase because my HD was too fast otherwise. ---------- From: Andrea Arcangeli Subject: fork-o_direct-race Think a thread writing constantly to the last 512bytes of a page, while another thread read and writes to/from the first 512bytes of the page. We can lose O_DIRECT reads (or any other get_user_pages write=1 I/O not just bio/O_DIRECT), the very moment we mark any pte wrprotected because a third unrelated thread forks off a child. This fixes it by copying the anon page (instead of sharing it) within fork, if there can be any direct I/O in flight to the page. That takes care of O_DIRECT reads (writes to memory, read from disk). Checking the page_count under the PT lock guarantees no get_user_pages could be running under us because if somebody wants to write to the page, it has to break any cow first and that requires taking the PT lock in follow_page before increasing the page count. We are also guaranteed mapcount is 1 if fork is writeprotecting the pte so the PT lock is enough to serialize against get_user_pages->get_page. Another problem are the O_DIRECT writes to disk, if the parent touches a shared anon page before the child, the child do_wp_page will takeover the anon page and map it read-write despite it was under direct-io from the parent thread pool. This requires de-cowing the pages in gup more aggressively (i.e. setting FOLL_WRITE temporarily on anon pages to de-cow them, and always assume write=1 for hugetlb follow_page version). gup-fast is taken care of without flushing the smp-tlb for every parent-pte wrprotected, by wrprotecting the pte before checking the page count vs mapcount. gup-fast will then re-check that the pte is still available in write mode after having increased the page count, so solving the race without a flood of IPIs in fork. The COW triggered inside fork will run while the parent pte is readonly to provide as usual the per-page atomic copy from parent to child during fork. However timings will be altered by having to copy the pages that might be under O_DIRECT. Once this race is fixed, the testcase instead of showing corruption is capable of triggering a glibc NPTL race condition where fork_pre_cow is copying internal the nptl stack list in anonymous memory while some parent thread may be modifying it, which results in userland deadlock when the fork-child tries to free the stacks before returning from fork. We are flushing the tlb after wrprotecting the pte that maps the anon page if we take the fork_pre_cow path, so we should be providing per-page atomic copy from parent to child. The race indeed can trigger also without this patch and without fork_pre_Cow and to trigger it the wrprotect event must happen exactly in the middle of a list_add/list_del instruction run by some NPLT thread that is mangling over the stack list while fork runs. Some preliminary NPTL fix for this race exposed by this fix, is happening on glibc repository but I think it'd be better off to use a smart lock capable of jumping in and out of signal handler and not to go out of order rcu style which sounds too complex. The pagevec code calls get_page while the page is sitting in the pagevec (before it becomes PageLRU) and doing so it can generate false positives, so to avoid slowing down fork all the time even for pages that could never possibly be under O_DIRECT write=1, the PG_gup bitflag is added, this eliminates most overhead of the fix in fork. I had to add src_vma/dst_vma to use proper ->mm pointers, and in the case of track_pfn_vma_copy PAT code, this is fixing a bug, because previously vma was the dst_vma, while track_pfn_vma_copy has to run on the src_vma (the dst_vma in that place is guaranteed to have zero ptes instantiated/allocated). There are two testcases that reproduces the bug and they reproduce the bug both for regular anon pages and using the libhugetlbfs and hugepages too. Patch works for both. The glibc race is also eventually reproducible both using anon pages and hugepages with the dma_thread testcase (the forkscrew testcases isn't capable of reproducing the nptl race condition in fork). ========== dma_thread.c ======= /* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */ #define _GNU_SOURCE 1 #include #include #include #include #include #include #include #include #include #include #define FILESIZE (12*1024*1024) #define READSIZE (1024*1024) #define FILENAME "test_%.04d.tmp" #define FILECOUNT 100 #define MIN_WORKERS 2 #define MAX_WORKERS 256 #define PAGE_SIZE 4096 #define true 1 #define false 0 typedef int bool; bool done = false; int workers = 2; #define PATTERN (0xfa) static void usage (void) { fprintf(stderr, "\nUsage: dma_thread [-h | -a [ -w ]\n" "\nWith no arguments, generate test files and exit.\n" "-h Display this help and exit.\n" "-a align read buffer to offset .\n" "-w number of worker threads, 2 (default) to 256,\n" " defaults to number of cores.\n\n" "Run first with no arguments to generate files.\n" "Then run with -a = 512 or 0. \n"); } typedef struct { pthread_t tid; int worker_number; int fd; int offset; int length; int pattern; unsigned char *buffer; } worker_t; void *worker_thread(void * arg) { int bytes_read; int i,k; worker_t *worker = (worker_t *) arg; int offset = worker->offset; int fd = worker->fd; unsigned char *buffer = worker->buffer; int pattern = worker->pattern; int length = worker->length; if (lseek(fd, offset, SEEK_SET) < 0) { fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n", offset, fd, strerror(errno)); exit(1); } bytes_read = read(fd, buffer, length); if (bytes_read != length) { fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n", fd, bytes_read, strerror(errno)); exit(1); } /* Corruption check */ for (i = 0; i < length; i++) { if (buffer[i] != pattern) { printf("Bad data at 0x%.06x: %p, \n", i, buffer + i); printf("Data dump starting at 0x%.06x:\n", i - 8); printf("Expect 0x%x followed by 0x%x:\n", pattern, PATTERN); for (k = 0; k < 16; k++) { printf("%02x ", buffer[i - 8 + k]); if (k == 7) { printf("\n"); } } printf("\n"); abort(); } } return 0; } void *fork_thread (void *arg) { pid_t pid; while (!done) { pid = fork(); if (pid == 0) { exit(0); } else if (pid < 0) { fprintf(stderr, "Failed to fork child.\n"); exit(1); } waitpid(pid, NULL, 0 ); usleep(100); } return NULL; } int main(int argc, char *argv[]) { unsigned char *buffer = NULL; char filename[1024]; int fd; bool dowrite = true; pthread_t fork_tid; int c, n, j; worker_t *worker; int align = 0; int offset, rc; workers = sysconf(_SC_NPROCESSORS_ONLN); while ((c = getopt(argc, argv, "a:hw:")) != -1) { switch (c) { case 'a': align = atoi(optarg); if (align < 0 || align > PAGE_SIZE) { printf("Bad alignment %d.\n", align); exit(1); } dowrite = false; break; case 'h': usage(); exit(0); break; case 'w': workers = atoi(optarg); if (workers < MIN_WORKERS || workers > MAX_WORKERS) { fprintf(stderr, "Worker count %d not between " "%d and %d, inclusive.\n", workers, MIN_WORKERS, MAX_WORKERS); usage(); exit(1); } dowrite = false; break; default: usage(); exit(1); } } if (argc > 1 && (optind < argc)) { fprintf(stderr, "Bad command line.\n"); usage(); exit(1); } if (dowrite) { buffer = malloc(FILESIZE); if (buffer == NULL) { fprintf(stderr, "Failed to malloc write buffer.\n"); exit(1); } for (n = 1; n <= FILECOUNT; n++) { sprintf(filename, FILENAME, n); fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666); if (fd < 0) { printf("create failed(%s): %s.\n", filename, strerror(errno)); exit(1); } memset(buffer, n, FILESIZE); printf("Writing file %s.\n", filename); if (write(fd, buffer, FILESIZE) != FILESIZE) { printf("write failed (%s)\n", filename); } close(fd); fd = -1; } free(buffer); buffer = NULL; printf("done\n"); exit(0); } printf("Using %d workers.\n", workers); worker = malloc(workers * sizeof(worker_t)); if (worker == NULL) { fprintf(stderr, "Failed to malloc worker array.\n"); exit(1); } for (j = 0; j < workers; j++) { worker[j].worker_number = j; } printf("Using alignment %d.\n", align); posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align); printf("Read buffer: %p.\n", buffer); for (n = 1; n <= FILECOUNT; n++) { sprintf(filename, FILENAME, n); for (j = 0; j < workers; j++) { if ((worker[j].fd = open(filename, O_RDONLY|O_DIRECT)) < 0) { fprintf(stderr, "Failed to open %s: %s.\n", filename, strerror(errno)); exit(1); } worker[j].pattern = n; } printf("Reading file %d.\n", n); for (offset = 0; offset < FILESIZE; offset += READSIZE) { memset(buffer, PATTERN, READSIZE + align); for (j = 0; j < workers; j++) { worker[j].offset = offset + j * PAGE_SIZE; worker[j].buffer = buffer + align + j * PAGE_SIZE; worker[j].length = PAGE_SIZE; } /* The final worker reads whatever is left over. */ worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1); done = 0; rc = pthread_create(&fork_tid, NULL, fork_thread, NULL); if (rc != 0) { fprintf(stderr, "Can't create fork thread: %s.\n", strerror(rc)); exit(1); } for (j = 0; j < workers; j++) { rc = pthread_create(&worker[j].tid, NULL, worker_thread, worker + j); if (rc != 0) { fprintf(stderr, "Can't create worker thread %d: %s.\n", j, strerror(rc)); exit(1); } } for (j = 0; j < workers; j++) { rc = pthread_join(worker[j].tid, NULL); if (rc != 0) { fprintf(stderr, "Failed to join worker thread %d: %s.\n", j, strerror(rc)); exit(1); } } /* Let the fork thread know it's ok to exit */ done = 1; rc = pthread_join(fork_tid, NULL); if (rc != 0) { fprintf(stderr, "Failed to join fork thread: %s.\n", strerror(rc)); exit(1); } } /* Close the fd's for the next file. */ for (j = 0; j < workers; j++) { close(worker[j].fd); } } return 0; } ========== dma_thread.c ======= ========== forkscrew.c ======== /* * Copyright 2009, Red Hat, Inc. * * Author: Jeff Moyer * * This program attempts to expose a race between O_DIRECT I/O and the fork() * path in a multi-threaded program. In order to reliably reproduce the * problem, it is best to perform a dd from the device under test to /dev/null * as this makes the read I/O slow enough to orchestrate the problem. * * Running: ./forkscrew * * It is expected that a file name "data" exists in the current working * directory, and that its contents are something other than 0x2a. A simple * dd if=/dev/zero of=data bs=1M count=1 should be sufficient. */ #define _GNU_SOURCE 1 #include #include #include #include #include #include #include #include #include #include pthread_cond_t worker_cond = PTHREAD_COND_INITIALIZER; pthread_mutex_t worker_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t fork_cond = PTHREAD_COND_INITIALIZER; pthread_mutex_t fork_mutex = PTHREAD_MUTEX_INITIALIZER; char *buffer; int fd; /* pattern filled into the in-memory buffer */ #define PATTERN 0x2a // '*' void usage(void) { fprintf(stderr, "\nUsage: forkscrew\n" "it is expected that a file named \"data\" is the current\n" "working directory. It should be at least 3*pagesize in size\n" ); } void dump_buffer(char *buf, int len) { int i; int last_off, last_val; last_off = -1; last_val = -1; for (i = 0; i < len; i++) { if (last_off < 0) { last_off = i; last_val = buf[i]; continue; } if (buf[i] != last_val) { printf("%d - %d: %d\n", last_off, i - 1, last_val); last_off = i; last_val = buf[i]; } } if (last_off != len - 1) printf("%d - %d: %d\n", last_off, i-1, last_val); } int check_buffer(char *bufp, int len, int pattern) { int i; for (i = 0; i < len; i++) { if (bufp[i] == pattern) return 1; } return 0; } void * forker_thread(void *arg) { pthread_mutex_lock(&fork_mutex); pthread_cond_signal(&fork_cond); pthread_cond_wait(&fork_cond, &fork_mutex); switch (fork()) { case 0: sleep(1); printf("child dumping buffer:\n"); dump_buffer(buffer + 512, 2*getpagesize()); exit(0); case -1: perror("fork"); exit(1); default: break; } pthread_cond_signal(&fork_cond); pthread_mutex_unlock(&fork_mutex); wait(NULL); return (void *)0; } void * worker(void *arg) { int first = (int)arg; char *bufp; int pagesize = getpagesize(); int ret; int corrupted = 0; if (first) { io_context_t aioctx; struct io_event event; struct iocb *iocb = malloc(sizeof *iocb); if (!iocb) { perror("malloc"); exit(1); } memset(&aioctx, 0, sizeof(aioctx)); ret = io_setup(1, &aioctx); if (ret != 0) { errno = -ret; perror("io_setup"); exit(1); } bufp = buffer + 512; io_prep_pread(iocb, fd, bufp, pagesize, 0); /* submit the I/O */ io_submit(aioctx, 1, &iocb); /* tell the fork thread to run */ pthread_mutex_lock(&fork_mutex); pthread_cond_signal(&fork_cond); /* wait for the fork to happen */ pthread_cond_wait(&fork_cond, &fork_mutex); pthread_mutex_unlock(&fork_mutex); /* release the other worker to issue I/O */ pthread_mutex_lock(&worker_mutex); pthread_cond_signal(&worker_cond); pthread_mutex_unlock(&worker_mutex); ret = io_getevents(aioctx, 1, 1, &event, NULL); if (ret != 1) { errno = -ret; perror("io_getevents"); exit(1); } if (event.res != pagesize) { errno = -event.res; perror("read error"); exit(1); } io_destroy(aioctx); /* check buffer, should be corrupt */ if (check_buffer(bufp, pagesize, PATTERN)) { printf("worker 0 failed check\n"); dump_buffer(bufp, pagesize); corrupted = 1; } } else { bufp = buffer + 512 + pagesize; pthread_mutex_lock(&worker_mutex); pthread_cond_signal(&worker_cond); /* tell main we're ready */ /* wait for the first I/O and the fork */ pthread_cond_wait(&worker_cond, &worker_mutex); pthread_mutex_unlock(&worker_mutex); /* submit overlapping I/O */ ret = read(fd, bufp, pagesize); if (ret != pagesize) { perror("read"); exit(1); } /* check buffer, should be fine */ if (check_buffer(bufp, pagesize, PATTERN)) { printf("worker 1 failed check -- abnormal\n"); dump_buffer(bufp, pagesize); corrupted = 1; } } return (void *)corrupted; } int main(int argc, char **argv) { pthread_t workers[2]; pthread_t forker; int ret, rc = 0; void *thread_ret; int pagesize = getpagesize(); fd = open("data", O_DIRECT|O_RDONLY); if (fd < 0) { perror("open"); exit(1); } ret = posix_memalign(&buffer, pagesize, 3 * pagesize); if (ret != 0) { errno = ret; perror("posix_memalign"); exit(1); } memset(buffer, PATTERN, 3*pagesize); pthread_mutex_lock(&fork_mutex); ret = pthread_create(&forker, NULL, forker_thread, NULL); pthread_cond_wait(&fork_cond, &fork_mutex); pthread_mutex_unlock(&fork_mutex); pthread_mutex_lock(&worker_mutex); ret |= pthread_create(&workers[0], NULL, worker, (void *)0); if (ret) { perror("pthread_create"); exit(1); } pthread_cond_wait(&worker_cond, &worker_mutex); pthread_mutex_unlock(&worker_mutex); ret = pthread_create(&workers[1], NULL, worker, (void *)1); if (ret != 0) { perror("pthread_create"); exit(1); } pthread_join(forker, NULL); pthread_join(workers[0], &thread_ret); if (thread_ret != 0) rc = 1; pthread_join(workers[1], &thread_ret); if (thread_ret != 0) rc = 1; if (rc != 0) { printf("parent dumping full buffer\n"); dump_buffer(buffer + 512, 2 * pagesize); } close(fd); free(buffer); exit(rc); } ========== forkscrew.c ======== ========== forkscrewreverse.c ======== #define _GNU_SOURCE 1 #include #include #include #include #include #include #include #include #include #include #define FILESIZE (40*1024*1024) #define BUFSIZE (40*1024*1024) static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; static const char *filename = "file.dat"; static int fd; static void *buffer; #define PAGE_SIZE 4096 static void store(void) { int i; if (usleep(50*1000) == -1) perror("usleep"), exit(1); printf("child storing\n"); fflush(stdout); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0xff; _exit(0); } static void *writer(void *arg) { int i; if (pthread_mutex_lock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); printf("thread writing\n"); fflush(stdout); for (i = 0; i < FILESIZE / BUFSIZE; i++) { size_t count = BUFSIZE; ssize_t ret; do { ret = write(fd, buffer, count); if (ret == -1) { if (errno != EINTR) perror("write"), exit(1); ret = 0; } count -= ret; } while (count); } printf("thread writing done\n"); fflush(stdout); if (pthread_mutex_unlock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); return NULL; } int main(int argc, char *argv[]) { int i; int status; pthread_t writer_thread; pid_t store_proc; posix_memalign(&buffer, PAGE_SIZE, BUFSIZE); printf("Write buffer: %p.\n", buffer); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0x00; fd = open(filename, O_RDWR|O_DIRECT); if (fd == -1) perror("open"), exit(1); if (pthread_mutex_lock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); if (pthread_create(&writer_thread, NULL, writer, NULL) == -1) perror("pthred_create"), exit(1); store_proc = fork(); if (store_proc == -1) perror("fork"), exit(1); if (!store_proc) store(); if (pthread_mutex_unlock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); if (usleep(10*1000) == -1) perror("usleep"), exit(1); printf("parent storing\n"); fflush(stdout); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0x11; do { pid_t w; w = waitpid(store_proc, &status, WUNTRACED | WCONTINUED); if (w == -1) perror("waitpid"), exit(1); } while (!WIFEXITED(status) && !WIFSIGNALED(status)); if (pthread_join(writer_thread, NULL) == -1) perror("pthread_join"), exit(1); exit(0); } ========== forkscrewreverse.c ======== Normally I test with "dma_thread -a 512 -w 40". To reproduce or verify the fix with hugepages run it like this: LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ../test/dma_thread -a 512 -w 40 LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./forkscrew LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./forkscrewreverse This is a fixed version of original patch from Nick Piggin. KSM has the same problem of fork and it also checks the page_count after a ptep_clear_flush_notify (the _flush sending smp-tlb-flush stops gup-fast so it doesn't depend on the above gup-fast changes that allows fork not to flush the smp-tlb at every pte wrprotected, and the _notify ensure all secondary ptes are zapped and any page-pin released for mmu-notifier subsystems that take page pins like currently KVM). BTW, I guess it's pure luck ENOSPC != VM_FAULT_OOM in hugetlb.c, mixing -errno with -VM_FAULT_* is total breakage that will have to be cleaned up (either don't use -ENOSPC, or use -ENOMEM instead of VM_FAULT_OOM), I didn't address it in this patch as it's unrelated. Signed-off-by: Andrea Arcangeli --- Removed mtk.manpages@gmail.com, linux-man@vger.kernel.org from previous CC list. diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -89,6 +89,26 @@ static noinline int gup_pte_range(pmd_t VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); get_page(page); + if (PageAnon(page)) { + if (!PageGUP(page)) + SetPageGUP(page); + smp_mb(); + /* + * Fork doesn't want to flush the smp-tlb for + * every pte that it marks readonly but newly + * created shared anon pages cannot have + * direct-io going to them, so check if fork + * made the page shared before we taken the + * page pin. + * de-cow to make direct read from memory safe. + */ + if ((pte_flags(gup_get_pte(ptep)) & + (mask | _PAGE_SPECIAL)) != (mask|_PAGE_RW)) { + put_page(page); + pte_unmap(ptep); + return 0; + } + } pages[*nr] = page; (*nr)++; @@ -98,24 +118,16 @@ static noinline int gup_pte_range(pmd_t return 1; } -static inline void get_head_page_multiple(struct page *page, int nr) -{ - VM_BUG_ON(page != compound_head(page)); - VM_BUG_ON(page_count(page) == 0); - atomic_add(nr, &page->_count); -} - -static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, - unsigned long end, int write, struct page **pages, int *nr) +static noinline int gup_huge_pmd(pmd_t *pmdp, unsigned long addr, + unsigned long end, struct page **pages, int *nr) { unsigned long mask; - pte_t pte = *(pte_t *)&pmd; + pte_t pte = *(pte_t *)pmdp; struct page *head, *page; int refs; - mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + /* de-cow to make direct read from memory safe */ + mask = _PAGE_PRESENT|_PAGE_USER|_PAGE_RW; if ((pte_flags(pte) & mask) != mask) return 0; /* hugepages are never "special" */ @@ -127,12 +139,21 @@ static noinline int gup_huge_pmd(pmd_t p page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); do { VM_BUG_ON(compound_head(page) != head); + get_page(head); + if (!PageGUP(head)) + SetPageGUP(head); + smp_mb(); + if ((pte_flags(*(pte_t *)pmdp) & mask) != mask) { + put_page(page); + return 0; + } pages[*nr] = page; (*nr)++; page++; refs++; } while (addr += PAGE_SIZE, addr != end); - get_head_page_multiple(head, refs); + VM_BUG_ON(page_count(head) == 0); + VM_BUG_ON(head != compound_head(head)); return 1; } @@ -151,7 +172,7 @@ static int gup_pmd_range(pud_t pud, unsi if (pmd_none(pmd)) return 0; if (unlikely(pmd_large(pmd))) { - if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) + if (!gup_huge_pmd(pmdp, addr, next, pages, nr)) return 0; } else { if (!gup_pte_range(pmd, addr, next, write, pages, nr)) @@ -162,17 +183,16 @@ static int gup_pmd_range(pud_t pud, unsi return 1; } -static noinline int gup_huge_pud(pud_t pud, unsigned long addr, - unsigned long end, int write, struct page **pages, int *nr) +static noinline int gup_huge_pud(pud_t *pudp, unsigned long addr, + unsigned long end, struct page **pages, int *nr) { unsigned long mask; - pte_t pte = *(pte_t *)&pud; + pte_t pte = *(pte_t *)pudp; struct page *head, *page; int refs; - mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + /* de-cow to make direct read from memory safe */ + mask = _PAGE_PRESENT|_PAGE_USER|_PAGE_RW; if ((pte_flags(pte) & mask) != mask) return 0; /* hugepages are never "special" */ @@ -184,12 +204,21 @@ static noinline int gup_huge_pud(pud_t p page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); do { VM_BUG_ON(compound_head(page) != head); + get_page(head); + if (!PageGUP(head)) + SetPageGUP(head); + smp_mb(); + if ((pte_flags(*(pte_t *)pudp) & mask) != mask) { + put_page(page); + return 0; + } pages[*nr] = page; (*nr)++; page++; refs++; } while (addr += PAGE_SIZE, addr != end); - get_head_page_multiple(head, refs); + VM_BUG_ON(page_count(head) == 0); + VM_BUG_ON(head != compound_head(head)); return 1; } @@ -208,7 +237,7 @@ static int gup_pud_range(pgd_t pgd, unsi if (pud_none(pud)) return 0; if (unlikely(pud_large(pud))) { - if (!gup_huge_pud(pud, addr, next, write, pages, nr)) + if (!gup_huge_pud(pudp, addr, next, pages, nr)) return 0; } else { if (!gup_pmd_range(pud, addr, next, write, pages, nr)) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -20,8 +20,8 @@ int hugetlb_sysctl_handler(struct ctl_ta int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); -int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); -int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); +int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *, struct vm_area_struct *); +int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int); void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, struct page *); void __unmap_hugepage_range(struct vm_area_struct *, @@ -75,9 +75,9 @@ static inline unsigned long hugetlb_tota return 0; } -#define follow_hugetlb_page(m,v,p,vs,a,b,i,w) ({ BUG(); 0; }) +#define follow_hugetlb_page(m,v,p,vs,a,b,i) ({ BUG(); 0; }) #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) -#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) +#define copy_hugetlb_page_range(src, dst, dst_vma, src_vma) ({ BUG(); 0; }) #define hugetlb_prefault(mapping, vma) ({ BUG(); 0; }) #define unmap_hugepage_range(vma, start, end, page) BUG() static inline void hugetlb_report_meminfo(struct seq_file *m) diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -789,7 +789,8 @@ void free_pgd_range(struct mmu_gather *t void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma); + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); int follow_phys(struct vm_area_struct *vma, unsigned long address, @@ -1238,7 +1239,7 @@ int vm_insert_mixed(struct vm_area_struc unsigned long pfn); struct page *follow_page(struct vm_area_struct *, unsigned long address, - unsigned int foll_flags); + unsigned int *foll_flags); #define FOLL_WRITE 0x01 /* check pte is writable */ #define FOLL_TOUCH 0x02 /* mark page accessed */ #define FOLL_GET 0x04 /* do get_page on page */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -101,6 +101,7 @@ enum pageflags { #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR PG_uncached, /* Page has been mapped as uncached */ #endif + PG_gup, __NR_PAGEFLAGS, /* Filesystems */ @@ -195,6 +196,7 @@ PAGEFLAG(Private, private) __CLEARPAGEFL PAGEFLAG(Private, private) __CLEARPAGEFLAG(Private, private) __SETPAGEFLAG(Private, private) PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) +PAGEFLAG(GUP, gup) __CLEARPAGEFLAG(GUP, gup) __PAGEFLAG(SlobPage, slob_page) __PAGEFLAG(SlobFree, slob_free) diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, mpnt); + retval = copy_page_range(mm, oldmm, tmp, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1695,20 +1695,37 @@ static void set_huge_ptep_writable(struc } } +/* Return the pagecache page at a given address within a VMA */ +static struct page *hugetlbfs_pagecache_page(struct hstate *h, + struct vm_area_struct *vma, unsigned long address) +{ + struct address_space *mapping; + pgoff_t idx; + + mapping = vma->vm_file->f_mapping; + idx = vma_hugecache_offset(h, vma, address); + + return find_lock_page(mapping, idx); +} + +static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte, + struct page *pagecache_page); int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma) { - pte_t *src_pte, *dst_pte, entry; + pte_t *src_pte, *dst_pte, entry, orig_entry; struct page *ptepage; unsigned long addr; - int cow; - struct hstate *h = hstate_vma(vma); + int cow, forcecow, oom; + struct hstate *h = hstate_vma(src_vma); unsigned long sz = huge_page_size(h); - cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; + cow = (src_vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; - for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { + for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) { src_pte = huge_pte_offset(src, addr); if (!src_pte) continue; @@ -1720,22 +1737,76 @@ int copy_hugetlb_page_range(struct mm_st if (dst_pte == src_pte) continue; + oom = 0; spin_lock(&dst->page_table_lock); spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING); - if (!huge_pte_none(huge_ptep_get(src_pte))) { - if (cow) - huge_ptep_set_wrprotect(src, addr, src_pte); - entry = huge_ptep_get(src_pte); + orig_entry = entry = huge_ptep_get(src_pte); + forcecow = 0; + if (!huge_pte_none(entry)) { ptepage = pte_page(entry); get_page(ptepage); + if (cow && pte_write(entry)) { + huge_ptep_set_wrprotect(src, addr, src_pte); + smp_mb(); + if (PageGUP(ptepage)) + forcecow = 1; + entry = huge_ptep_get(src_pte); + } set_huge_pte_at(dst, addr, dst_pte, entry); } spin_unlock(&src->page_table_lock); + if (forcecow) { + if (unlikely(vma_needs_reservation(h, dst_vma, addr) + < 0)) + oom = 1; + else { + struct page *pg; + int cow_ret; + spin_unlock(&dst->page_table_lock); + /* force atomic copy from parent to child */ + flush_tlb_range(src_vma, addr, addr+sz); + /* + * Can use hstate from src_vma and src_vma + * because the hugetlbfs pagecache will + * be the same for both src_vma and dst_vma. + */ + pg = hugetlbfs_pagecache_page(h, + src_vma, + addr); + spin_lock_nested(&dst->page_table_lock, + SINGLE_DEPTH_NESTING); + cow_ret = hugetlb_cow(dst, dst_vma, addr, + dst_pte, entry, + pg); + /* + * We hold mmap_sem in write mode and + * the VM doesn't know about hugepages + * so the src_pte/dst_pte can't change + * from under us even without both + * page_table_lock hold the whole time. + */ + BUG_ON(!pte_same(huge_ptep_get(src_pte), + entry)); + set_huge_pte_at(src, addr, + src_pte, + orig_entry); + if (cow_ret) + oom = 1; + } + } spin_unlock(&dst->page_table_lock); + if (oom) + goto nomem; } return 0; nomem: + /* + * Want this to also be able to return -ENOSPC? Then stop the + * mess of mixing -VM_FAULT_ and -ENOSPC retvals and be + * consistent returning -ENOMEM instead of -VM_FAULT_OOM in + * alloc_huge_page. + */ return -ENOMEM; } @@ -1943,19 +2014,6 @@ retry_avoidcopy: return 0; } -/* Return the pagecache page at a given address within a VMA */ -static struct page *hugetlbfs_pagecache_page(struct hstate *h, - struct vm_area_struct *vma, unsigned long address) -{ - struct address_space *mapping; - pgoff_t idx; - - mapping = vma->vm_file->f_mapping; - idx = vma_hugecache_offset(h, vma, address); - - return find_lock_page(mapping, idx); -} - static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *ptep, int write_access) { @@ -2160,8 +2218,7 @@ static int huge_zeropage_ok(pte_t *ptep, int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, - unsigned long *position, int *length, int i, - int write) + unsigned long *position, int *length, int i) { unsigned long pfn_offset; unsigned long vaddr = *position; @@ -2181,16 +2238,16 @@ int follow_hugetlb_page(struct mm_struct * first, for the page indexing below to work. */ pte = huge_pte_offset(mm, vaddr & huge_page_mask(h)); - if (huge_zeropage_ok(pte, write, shared)) + if (huge_zeropage_ok(pte, 1, shared)) zeropage_ok = 1; if (!pte || (huge_pte_none(huge_ptep_get(pte)) && !zeropage_ok) || - (write && !pte_write(huge_ptep_get(pte)))) { + !pte_write(huge_ptep_get(pte))) { int ret; spin_unlock(&mm->page_table_lock); - ret = hugetlb_fault(mm, vma, vaddr, write); + ret = hugetlb_fault(mm, vma, vaddr, 1); spin_lock(&mm->page_table_lock); if (!(ret & VM_FAULT_ERROR)) continue; @@ -2207,8 +2264,11 @@ same_page: if (pages) { if (zeropage_ok) pages[i] = ZERO_PAGE(0); - else + else { pages[i] = mem_map_offset(page, pfn_offset); + if (!PageGUP(page)) + SetPageGUP(page); + } get_page(pages[i]); } diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -538,14 +538,16 @@ out: * covered by this vma. */ -static inline void +static inline int copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, + pte_t *dst_pte, pte_t *src_pte, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, int *rss) { - unsigned long vm_flags = vma->vm_flags; + unsigned long vm_flags = src_vma->vm_flags; pte_t pte = *src_pte; struct page *page; + int forcecow = 0; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { @@ -576,15 +578,6 @@ copy_one_pte(struct mm_struct *dst_mm, s } /* - * If it's a COW mapping, write protect it both - * in the parent and the child - */ - if (is_cow_mapping(vm_flags)) { - ptep_set_wrprotect(src_mm, addr, src_pte); - pte = pte_wrprotect(pte); - } - - /* * If it's a shared mapping, mark it clean in * the child */ @@ -592,27 +585,87 @@ copy_one_pte(struct mm_struct *dst_mm, s pte = pte_mkclean(pte); pte = pte_mkold(pte); - page = vm_normal_page(vma, addr, pte); + /* + * If it's a COW mapping, write protect it both + * in the parent and the child. + */ + if (is_cow_mapping(vm_flags) && pte_write(pte)) { + /* + * Serialization against gup-fast happens by + * wrprotecting the pte and checking the PG_gup flag + * and the number of page pins after that. If gup-fast + * boosts the page_count after we checked it, it will + * also take the slow path because it will find the + * pte wrprotected. + */ + ptep_set_wrprotect(src_mm, addr, src_pte); + } + + page = vm_normal_page(src_vma, addr, pte); if (page) { get_page(page); - page_dup_rmap(page, vma, addr); + page_dup_rmap(page, dst_vma, addr); + if (is_cow_mapping(vm_flags) && pte_write(pte) && + PageAnon(page)) { + smp_mb(); + if (PageGUP(page)) { + if (unlikely(!trylock_page(page))) + forcecow = 1; + else { + BUG_ON(page_mapcount(page) != 2); + if (unlikely(page_count(page) != + page_mapcount(page) + + !!PageSwapCache(page))) + forcecow = 1; + unlock_page(page); + } + } + } rss[!!PageAnon(page)]++; + } + + if (is_cow_mapping(vm_flags) && pte_write(pte)) { + pte = pte_wrprotect(pte); + if (forcecow) { + /* force atomic copy from parent to child */ + flush_tlb_page(src_vma, addr); + /* + * Don't set the dst_pte here to be + * safer, as fork_pre_cow might return + * -EAGAIN and restart. + */ + goto out; + } } out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); +out: + return forcecow; } +static int fork_pre_cow(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + unsigned long address, + pte_t **dst_ptep, pte_t **src_ptep, + spinlock_t **dst_ptlp, spinlock_t **src_ptlp, + pmd_t *dst_pmd, pmd_t *src_pmd); + static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, + pmd_t *dst_pmd, pmd_t *src_pmd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pte_t *src_pte, *dst_pte; spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2]; + int forcecow; again: + forcecow = 0; rss[1] = rss[0] = 0; dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); if (!dst_pte) @@ -623,6 +676,9 @@ again: arch_enter_lazy_mmu_mode(); do { + if (forcecow) + break; + /* * We are holding two locks at this point - either of them * could generate latencies in another task on another CPU. @@ -637,9 +693,38 @@ again: progress++; continue; } - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); + forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, + dst_vma, src_vma, addr, rss); progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); + + if (unlikely(forcecow)) { + pte_t *_src_pte = src_pte-1, *_dst_pte = dst_pte-1; + /* + * Try to COW the child page as direct I/O is working + * on the parent page, and so we've to mark the parent + * pte read-write before dropping the PT lock and + * mmap_sem to avoid the page to be cowed in the + * parent and any direct I/O to get lost. + */ + forcecow = fork_pre_cow(dst_mm, src_mm, + dst_vma, src_vma, + addr-PAGE_SIZE, + &_dst_pte, &_src_pte, + &dst_ptl, &src_ptl, + dst_pmd, src_pmd); + src_pte = _src_pte + 1; + dst_pte = _dst_pte + 1; + /* after the page copy set the parent pte writeable again */ + set_pte_at(src_mm, addr-PAGE_SIZE, src_pte-1, + pte_mkwrite(*(src_pte-1))); + if (unlikely(forcecow == -EAGAIN)) { + dst_pte--; + src_pte--; + addr -= PAGE_SIZE; + rss[1]--; + } + } arch_leave_lazy_mmu_mode(); spin_unlock(src_ptl); @@ -647,13 +732,16 @@ again: add_mm_rss(dst_mm, rss[0], rss[1]); pte_unmap_unlock(dst_pte - 1, dst_ptl); cond_resched(); + if (unlikely(forcecow == -ENOMEM)) + return -ENOMEM; if (addr != end) goto again; return 0; } static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma, + pud_t *dst_pud, pud_t *src_pud, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pmd_t *src_pmd, *dst_pmd; @@ -668,14 +756,15 @@ static inline int copy_pmd_range(struct if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma, + pgd_t *dst_pgd, pgd_t *src_pgd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pud_t *src_pud, *dst_pud; @@ -690,19 +779,20 @@ static inline int copy_pud_range(struct if (pud_none_or_clear_bad(src_pud)) continue; if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end); return 0; } int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma) { pgd_t *src_pgd, *dst_pgd; unsigned long next; - unsigned long addr = vma->vm_start; - unsigned long end = vma->vm_end; + unsigned long addr = src_vma->vm_start; + unsigned long end = src_vma->vm_end; int ret; /* @@ -711,20 +801,21 @@ int copy_page_range(struct mm_struct *ds * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { - if (!vma->anon_vma) + if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { + if (!src_vma->anon_vma) return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_vm_hugetlb_page(src_vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, + dst_vma, src_vma); - if (unlikely(is_pfn_mapping(vma))) { + if (unlikely(is_pfn_mapping(src_vma))) { /* * We do not free on error cases below as remove_vma * gets called on error from higher level routine */ - ret = track_pfn_vma_copy(vma); + ret = track_pfn_vma_copy(src_vma); if (ret) return ret; } @@ -735,7 +826,7 @@ int copy_page_range(struct mm_struct *ds * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */ - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); ret = 0; @@ -746,15 +837,15 @@ int copy_page_range(struct mm_struct *ds if (pgd_none_or_clear_bad(src_pgd)) continue; if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next))) { + dst_vma, src_vma, addr, next))) { ret = -ENOMEM; break; } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm, - vma->vm_start, end); + src_vma->vm_start, end); return ret; } @@ -1091,7 +1182,7 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes); * Do a quick page-table lookup for a single page. */ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int flags) + unsigned int *flagsp) { pgd_t *pgd; pud_t *pud; @@ -1100,6 +1191,7 @@ struct page *follow_page(struct vm_area_ spinlock_t *ptl; struct page *page; struct mm_struct *mm = vma->vm_mm; + unsigned long flags = *flagsp; page = follow_huge_addr(mm, address, flags & FOLL_WRITE); if (!IS_ERR(page)) { @@ -1145,8 +1237,19 @@ struct page *follow_page(struct vm_area_ if (unlikely(!page)) goto bad_page; - if (flags & FOLL_GET) + if (flags & FOLL_GET) { + if (PageAnon(page)) { + /* de-cow to make direct read from memory safe */ + if (!pte_write(pte)) { + page = NULL; + *flagsp |= FOLL_WRITE; + goto unlock; + } + if (!PageGUP(page)) + SetPageGUP(page); + } get_page(page); + } if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -1275,7 +1378,7 @@ int __get_user_pages(struct task_struct if (is_vm_hugetlb_page(vma)) { i = follow_hugetlb_page(mm, vma, pages, vmas, - &start, &len, i, write); + &start, &len, i); continue; } @@ -1303,7 +1406,7 @@ int __get_user_pages(struct task_struct foll_flags |= FOLL_WRITE; cond_resched(); - while (!(page = follow_page(vma, start, foll_flags))) { + while (!(page = follow_page(vma, start, &foll_flags))) { int ret; ret = handle_mm_fault(mm, vma, start, foll_flags & FOLL_WRITE); @@ -1865,6 +1968,81 @@ static inline void cow_user_page(struct flush_dcache_page(dst); } else copy_user_highpage(dst, src, va, vma); +} + +static int fork_pre_cow(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + unsigned long address, + pte_t **dst_ptep, pte_t **src_ptep, + spinlock_t **dst_ptlp, spinlock_t **src_ptlp, + pmd_t *dst_pmd, pmd_t *src_pmd) +{ + pte_t _src_pte, _dst_pte; + struct page *old_page, *new_page; + + _src_pte = **src_ptep; + _dst_pte = **dst_ptep; + old_page = vm_normal_page(src_vma, address, **src_ptep); + BUG_ON(!old_page); + get_page(old_page); + arch_leave_lazy_mmu_mode(); + spin_unlock(*src_ptlp); + pte_unmap_nested(*src_ptep); + pte_unmap_unlock(*dst_ptep, *dst_ptlp); + + new_page = alloc_page_vma(GFP_HIGHUSER, dst_vma, address); + if (unlikely(!new_page)) { + *dst_ptep = pte_offset_map_lock(dst_mm, dst_pmd, address, + dst_ptlp); + *src_ptep = pte_offset_map_nested(src_pmd, address); + *src_ptlp = pte_lockptr(src_mm, src_pmd); + spin_lock_nested(*src_ptlp, SINGLE_DEPTH_NESTING); + arch_enter_lazy_mmu_mode(); + return -ENOMEM; + } + cow_user_page(new_page, old_page, address, dst_vma); + + *dst_ptep = pte_offset_map_lock(dst_mm, dst_pmd, address, dst_ptlp); + *src_ptep = pte_offset_map_nested(src_pmd, address); + *src_ptlp = pte_lockptr(src_mm, src_pmd); + spin_lock_nested(*src_ptlp, SINGLE_DEPTH_NESTING); + arch_enter_lazy_mmu_mode(); + + /* + * src pte can unmapped by the VM from under us after dropping + * the src_ptlp but it can't be cowed from under us as fork + * holds the mmap_sem in write mode. + */ + if (!pte_same(**src_ptep, _src_pte)) + goto eagain; + if (!pte_same(**dst_ptep, _dst_pte)) + goto eagain; + + page_remove_rmap(old_page); + page_cache_release(old_page); + page_cache_release(old_page); + + __SetPageUptodate(new_page); + flush_cache_page(src_vma, address, pte_pfn(**src_ptep)); + _dst_pte = mk_pte(new_page, dst_vma->vm_page_prot); + _dst_pte = maybe_mkwrite(pte_mkdirty(_dst_pte), dst_vma); + page_add_new_anon_rmap(new_page, dst_vma, address); + set_pte_at(dst_mm, address, *dst_ptep, _dst_pte); + update_mmu_cache(dst_vma, address, _dst_pte); + return 0; + +eagain: + page_cache_release(old_page); + page_cache_release(new_page); + /* + * Later we'll repeat the copy of this pte, so here we've to + * undo the mapcount and page count taken in copy_one_pte. + */ + page_remove_rmap(old_page); + page_cache_release(old_page); + return -EAGAIN; } /* diff --git a/mm/swap.c b/mm/swap.c --- a/mm/swap.c +++ b/mm/swap.c @@ -64,6 +64,8 @@ static void put_compound_page(struct pag if (put_page_testzero(page)) { compound_page_dtor *dtor; + if (PageGUP(page)) + __ClearPageGUP(page); dtor = get_compound_page_dtor(page); (*dtor)(page); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id EC9F06B0047 for ; Mon, 16 Mar 2009 12:01:54 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 03:01:42 +1100 References: <20090311170611.GA2079@elte.hu> <200903141559.12484.nickpiggin@yahoo.com.au> <20090316135654.GA17949@random.random> In-Reply-To: <20090316135654.GA17949@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903170301.43091.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: Linus Torvalds , Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 00:56:54 Andrea Arcangeli wrote: > On Sat, Mar 14, 2009 at 03:59:11PM +1100, Nick Piggin wrote: > > It does touch gup-fast, but it just adds one branch and no barrier in the > > My question is what trick to you use to stop gup-fast from returning > the page mapped read-write by the pte if gup-fast doesn't take any > lock whatsoever, it doesn't set any bit in any page or vma, and it > doesn't recheck the pte is still viable after having set any bit on > page or vmas, and you still don't send a flood of ipis from fork fast > path (no race case). If the page is not marked PageDontCOW, then it decows it, which gives synchronisation against fork. If it is marked PageDontCOW, then it can't possibly be COWed by fork, previous or subsequent. > > Possibly that's the right way to go. Depends if it is in the slightest > > performance critical. If not, I would just let do_wp_page do the work > > to avoid a little bit of logic, but either way is not a big deal to me. > > fork is less performance critical than do_wp_page, still in fork > microbenchmark no slowdown is measured with the patch. Before I > introduced PG_gup there were false positives triggered by the pagevec > temporary pins, that was measurable, after PG_gup the fast path is OK. Mine doesn't get false positives, but it doesn't try to reintroduce pages as COW candidates after the get_user_pages is finished. This is how it is simpler than your patch. > unaffected (I've still to measure gup-fast slowdown in setting PG_gup > but I'm rather optimistic that you're understimating the cost of > walking 4 layers of pagetables compared to a locked op on a l1 > exclusive cacheline, so I think it'll be lost in the noise). I think > the big thing of gup-fast is primarly in not having to search vmas, > and in turn to take any shared lock like mmap_sem/PT lock and to scale > on a page level with just a get-page being the troublesome cacheline. You lost the get_head_page_multiple too for huge pages. This is the path that Oracle/DB2 will always go down when running any benchmarks. At the current DIO_PAGES size, this means adding up to 63 atomics, 64 mfences, and and touching cachelines of 63-64 of the non-head struct pages per request. OK probably even those databases don't get a chance to do such big IOs, but they definitely will be doing larger than 4K at a time in many cases (probably even their internal block size can be larger). > > One side of the race is direct IO read writing to fork child page. > > The other side of the race is fork child page write leaking into > > the direct IO. > > > > My patch solves both sides by de-cowing *any* COW page before it > > may be returned from get_user_pages (for read or write). > > I see what you mean now. If you read the comment of my patch you'll > see I explicitly intended that only people writing into memory with > gup was troublesome here. Like you point out, using gup for _reading_ > from memory is troublesome as well if child writes to those > pages. This is kind of a lower problem because the major issue is that > fork is enough to generate memory corruption even if the child isn't > touching those pages. The reverse race requires the child to write to > those pages so I guess it never triggered in real life apps. But > nevertheless I totally agree if we fix the write-to-memory-with-gup > we've to fix the read-from-memory-with-gup. Yes. > Below I updated my patch and relative commit header to fix the reverse > race too. However I had to enlarge the buffer to 40M to reproduce with > your testcase because my HD was too fast otherwise. You're using a solid state disk? :) > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c > --- a/arch/x86/mm/gup.c > +++ b/arch/x86/mm/gup.c > @@ -89,6 +89,26 @@ static noinline int gup_pte_range(pmd_t > VM_BUG_ON(!pfn_valid(pte_pfn(pte))); > page = pte_page(pte); > get_page(page); > + if (PageAnon(page)) { > + if (!PageGUP(page)) > + SetPageGUP(page); > + smp_mb(); > + /* > + * Fork doesn't want to flush the smp-tlb for > + * every pte that it marks readonly but newly > + * created shared anon pages cannot have > + * direct-io going to them, so check if fork > + * made the page shared before we taken the > + * page pin. > + * de-cow to make direct read from memory safe. > + */ > + if ((pte_flags(gup_get_pte(ptep)) & > + (mask | _PAGE_SPECIAL)) != (mask|_PAGE_RW)) { > + put_page(page); > + pte_unmap(ptep); > + return 0; Hmm, so this is disabling fast-gup for RO anonymous ranges? I guess this seems like it covers the reverse race then... btw powerpc has a slightly different fast-gup scheme where it isn't actually holding off TLB shootdown. I don't think you need to do anything too different, but better double check. And here is my improved patch. Same logic but just streamlines the decow stuff a bit and cuts out some unneeded stuff. This should be pretty complete for 4K pages. Except I'm a little unsure about the "ptes don't match, retry" path of the decow procedure. Lots of tricky little details to get right... And I'm not quite sure that you got this right either -- vmscan.c can turn the child pte into a swap pte here, right? In which case I think you need to drop its swapcache entry don't you? I don't know if there are other ways it could be changed, but I import the full zap_pte function over just in case. -- Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/include/linux/mm.h 2009-03-17 00:37:59.000000000 +1100 @@ -789,7 +789,7 @@ int walk_page_range(unsigned long addr, void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma); + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); int follow_phys(struct vm_area_struct *vma, unsigned long address, Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/mm/memory.c 2009-03-17 02:43:21.000000000 +1100 @@ -533,12 +533,171 @@ out: } /* + * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when + * servicing faults for write access. In the normal case, do always want + * pte_mkwrite. But get_user_pages can cause write faults for mappings + * that do not have writing enabled, when used by access_process_vm. + */ +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) +{ + if (likely(vma->vm_flags & VM_WRITE)) + pte = pte_mkwrite(pte); + return pte; +} + +static void cow_user_page(struct page *dst, struct page *src, + unsigned long va, struct vm_area_struct *vma) +{ + /* + * If the source page was a PFN mapping, we don't have + * a "struct page" for it. We do a best-effort copy by + * just copying from the original user address. If that + * fails, we just zero-fill it. Live with it. + */ + if (unlikely(!src)) { + void *kaddr = kmap_atomic(dst, KM_USER0); + void __user *uaddr = (void __user *)(va & PAGE_MASK); + + /* + * This really shouldn't fail, because the page is there + * in the page tables. But it might just be unreadable, + * in which case we just give up and fill the result with + * zeroes. + */ + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) + memset(kaddr, 0, PAGE_SIZE); + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(dst); + } else + copy_user_highpage(dst, src, va, vma); +} + +void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + pte_t pte = *ptep; + + if (pte_present(pte)) { + struct page *page; + + flush_cache_page(vma, addr, pte_pfn(pte)); + pte = ptep_clear_flush(vma, addr, ptep); + page = vm_normal_page(vma, addr, pte); + if (page) { + if (pte_dirty(pte)) + set_page_dirty(page); + page_remove_rmap(page); + page_cache_release(page); + update_hiwater_rss(mm); + if (PageAnon(page)) + dec_mm_counter(mm, anon_rss); + else + dec_mm_counter(mm, file_rss); + } + } else { + if (!pte_file(pte)) + free_swap_and_cache(pte_to_swp_entry(pte)); + pte_clear_not_present_full(mm, addr, ptep, 0); + } +} +/* + * breaks COW of child pte that has been marked COW by fork(). + * Must be called with the child's ptl held and pte mapped. + * Returns 0 on success with ptl held and pte mapped. + * -ENOMEM on OOM failure, or -EAGAIN if something changed under us. + * ptl dropped and pte unmapped on error cases. + */ +static noinline int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd, + spinlock_t *ptl, struct vm_area_struct *vma, + unsigned long address) +{ + pte_t pte = *ptep; + struct page *page, *new_page; + int ret; + + BUG_ON(!pte_present(pte)); + BUG_ON(pte_write(pte)); + + page = vm_normal_page(vma, address, pte); + BUG_ON(!page); + BUG_ON(!PageAnon(page)); + BUG_ON(!PageDontCOW(page)); + + /* The following code comes from do_wp_page */ + page_cache_get(page); + pte_unmap_unlock(pte, ptl); + + if (unlikely(anon_vma_prepare(vma))) + goto oom; + VM_BUG_ON(page == ZERO_PAGE(0)); + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); + if (!new_page) + goto oom; + /* + * Don't let another task, with possibly unlocked vma, + * keep the mlocked page. + */ + if (vma->vm_flags & VM_LOCKED) { + lock_page(page); /* for LRU manipulation */ + clear_page_mlock(page); + unlock_page(page); + } + cow_user_page(new_page, page, address, vma); + __SetPageUptodate(new_page); + + if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)) + goto oom_free_new; + + /* + * Re-check the pte - we dropped the lock + */ + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + if (pte_same(*ptep, pte)) { + pte_t entry; + + flush_cache_page(vma, address, pte_pfn(pte)); + entry = mk_pte(new_page, vma->vm_page_prot); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + /* + * Clear the pte entry and flush it first, before updating the + * pte with the new entry. This will avoid a race condition + * seen in the presence of one thread doing SMC and another + * thread doing COW. + */ + ptep_clear_flush_notify(vma, address, ptep); + page_add_new_anon_rmap(new_page, vma, address); + set_pte_at(mm, address, ptep, entry); + + /* See comment in do_wp_page */ + page_remove_rmap(page); + page_cache_release(page); + ret = 0; + } else { + if (!pte_none(*ptep)) + zap_pte(mm, vma, address, ptep); + pte_unmap_unlock(pte, ptl); + mem_cgroup_uncharge_page(new_page); + page_cache_release(new_page); + ret = -EAGAIN; + } + page_cache_release(page); + + return ret; + +oom_free_new: + page_cache_release(new_page); +oom: + page_cache_release(page); + return -ENOMEM; +} + +/* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range * covered by this vma. */ -static inline void +static inline int copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, unsigned long addr, int *rss) @@ -546,6 +705,7 @@ copy_one_pte(struct mm_struct *dst_mm, s unsigned long vm_flags = vma->vm_flags; pte_t pte = *src_pte; struct page *page; + int ret = 0; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { @@ -597,20 +757,26 @@ copy_one_pte(struct mm_struct *dst_mm, s get_page(page); page_dup_rmap(page, vma, addr); rss[!!PageAnon(page)]++; + if (unlikely(PageDontCOW(page))) + ret = 1; } out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); + + return ret; } static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, + pmd_t *dst_pmd, pmd_t *src_pmd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pte_t *src_pte, *dst_pte; spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2]; + int decow; again: rss[1] = rss[0] = 0; @@ -637,7 +803,10 @@ again: progress++; continue; } - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); + decow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, + src_vma, addr, rss); + if (unlikely(decow)) + goto decow; progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); @@ -646,14 +815,31 @@ again: pte_unmap_nested(src_pte - 1); add_mm_rss(dst_mm, rss[0], rss[1]); pte_unmap_unlock(dst_pte - 1, dst_ptl); +next: cond_resched(); if (addr != end) goto again; return 0; + +decow: + arch_leave_lazy_mmu_mode(); + spin_unlock(src_ptl); + pte_unmap_nested(src_pte); + add_mm_rss(dst_mm, rss[0], rss[1]); + decow = decow_one_pte(dst_mm, dst_pte, dst_pmd, dst_ptl, dst_vma, addr); + if (decow == -ENOMEM) + return -ENOMEM; + if (decow == -EAGAIN) + goto again; + pte_unmap_unlock(dst_pte, dst_ptl); + cond_resched(); + addr += PAGE_SIZE; + goto next; } static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma, + pud_t *dst_pud, pud_t *src_pud, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pmd_t *src_pmd, *dst_pmd; @@ -668,14 +854,15 @@ static inline int copy_pmd_range(struct if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma, + pgd_t *dst_pgd, pgd_t *src_pgd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pud_t *src_pud, *dst_pud; @@ -690,19 +877,19 @@ static inline int copy_pud_range(struct if (pud_none_or_clear_bad(src_pud)) continue; if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end); return 0; } int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { pgd_t *src_pgd, *dst_pgd; unsigned long next; - unsigned long addr = vma->vm_start; - unsigned long end = vma->vm_end; + unsigned long addr = src_vma->vm_start; + unsigned long end = src_vma->vm_end; int ret; /* @@ -711,20 +898,20 @@ int copy_page_range(struct mm_struct *ds * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { - if (!vma->anon_vma) + if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { + if (!src_vma->anon_vma) return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_vm_hugetlb_page(src_vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, src_vma); - if (unlikely(is_pfn_mapping(vma))) { + if (unlikely(is_pfn_mapping(src_vma))) { /* * We do not free on error cases below as remove_vma * gets called on error from higher level routine */ - ret = track_pfn_vma_copy(vma); + ret = track_pfn_vma_copy(src_vma); if (ret) return ret; } @@ -735,7 +922,7 @@ int copy_page_range(struct mm_struct *ds * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */ - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); ret = 0; @@ -746,15 +933,16 @@ int copy_page_range(struct mm_struct *ds if (pgd_none_or_clear_bad(src_pgd)) continue; if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next))) { + dst_vma, src_vma, addr, next))) { ret = -ENOMEM; break; } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm, - vma->vm_start, end); + src_vma->vm_start, end); + return ret; } @@ -1199,8 +1387,6 @@ static inline int use_zero_page(struct v return !vma->vm_ops || !vma->vm_ops->fault; } - - int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, struct page **pages, struct vm_area_struct **vmas) @@ -1225,6 +1411,7 @@ int __get_user_pages(struct task_struct do { struct vm_area_struct *vma; unsigned int foll_flags; + int decow; vma = find_extend_vma(mm, start); if (!vma && in_gate_area(tsk, start)) { @@ -1279,6 +1466,14 @@ int __get_user_pages(struct task_struct continue; } + /* + * Except in special cases where the caller will not read to or + * write from these pages, we must break COW for any pages + * returned from get_user_pages, so that our caller does not + * subsequently end up with the pages of a parent or child + * process after a COW takes place. + */ + decow = (pages && is_cow_mapping(vma->vm_flags)); foll_flags = FOLL_TOUCH; if (pages) foll_flags |= FOLL_GET; @@ -1299,7 +1494,7 @@ int __get_user_pages(struct task_struct fatal_signal_pending(current))) return i ? i : -ERESTARTSYS; - if (write) + if (write || decow) foll_flags |= FOLL_WRITE; cond_resched(); @@ -1342,6 +1537,8 @@ int __get_user_pages(struct task_struct if (pages) { pages[i] = page; + if (decow && !PageDontCOW(page)) + SetPageDontCOW(page); flush_anon_page(vma, page, start); flush_dcache_page(page); } @@ -1370,7 +1567,6 @@ int get_user_pages(struct task_struct *t start, len, flags, pages, vmas); } - EXPORT_SYMBOL(get_user_pages); pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr, @@ -1829,45 +2025,6 @@ static inline int pte_unmap_same(struct } /* - * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when - * servicing faults for write access. In the normal case, do always want - * pte_mkwrite. But get_user_pages can cause write faults for mappings - * that do not have writing enabled, when used by access_process_vm. - */ -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) -{ - if (likely(vma->vm_flags & VM_WRITE)) - pte = pte_mkwrite(pte); - return pte; -} - -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) -{ - /* - * If the source page was a PFN mapping, we don't have - * a "struct page" for it. We do a best-effort copy by - * just copying from the original user address. If that - * fails, we just zero-fill it. Live with it. - */ - if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst, KM_USER0); - void __user *uaddr = (void __user *)(va & PAGE_MASK); - - /* - * This really shouldn't fail, because the page is there - * in the page tables. But it might just be unreadable, - * in which case we just give up and fill the result with - * zeroes. - */ - if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) - memset(kaddr, 0, PAGE_SIZE); - kunmap_atomic(kaddr, KM_USER0); - flush_dcache_page(dst); - } else - copy_user_highpage(dst, src, va, vma); -} - -/* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address * and decrementing the shared-page counter for the old page. @@ -1930,6 +2087,8 @@ static int do_wp_page(struct mm_struct * } reuse = reuse_swap_page(old_page); unlock_page(old_page); + VM_BUG_ON(PageDontCOW(old_page) && !reuse); + } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { /* @@ -2936,7 +3095,8 @@ int make_pages_present(unsigned long add BUG_ON(end > vma->vm_end); len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE; ret = get_user_pages(current, current->mm, addr, - len, write, 0, NULL, NULL); + len, write, 0, + NULL, NULL); if (ret < 0) return ret; return ret == len ? 0 : -EFAULT; @@ -3086,7 +3246,7 @@ int access_process_vm(struct task_struct struct page *page = NULL; ret = get_user_pages(tsk, mm, addr, 1, - write, 1, &page, &vma); + 0, 1, &page, &vma); if (ret <= 0) { /* * Check if this is a VM_IO | VM_PFNMAP VMA, which Index: linux-2.6/arch/x86/mm/gup.c =================================================================== --- linux-2.6.orig/arch/x86/mm/gup.c 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/arch/x86/mm/gup.c 2009-03-14 16:21:40.000000000 +1100 @@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t struct page *page; if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { +failed: pte_unmap(ptep); return 0; } VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); + if (PageAnon(page) && unlikely(!PageDontCOW(page))) + goto failed; get_page(page); pages[*nr] = page; (*nr)++; Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/include/linux/page-flags.h 2009-03-14 02:48:13.000000000 +1100 @@ -94,6 +94,7 @@ enum pageflags { PG_reclaim, /* To be reclaimed asap */ PG_buddy, /* Page is free, on buddy lists */ PG_swapbacked, /* Page is backed by RAM/swap */ + PG_dontcow, /* PageAnon page in a VM_DONTCOW vma */ #ifdef CONFIG_UNEVICTABLE_LRU PG_unevictable, /* Page is "unevictable" */ PG_mlocked, /* Page is vma mlocked */ @@ -208,6 +209,8 @@ __PAGEFLAG(SlubDebug, slub_debug) */ TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) __PAGEFLAG(Buddy, buddy) +__PAGEFLAG(DontCOW, dontcow) +SETPAGEFLAG(DontCOW, dontcow) PAGEFLAG(MappedToDisk, mappedtodisk) /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/kernel/fork.c 2009-03-14 15:12:09.000000000 +1100 @@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, mpnt); + retval = copy_page_range(mm, oldmm, tmp, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); Index: linux-2.6/mm/internal.h =================================================================== --- linux-2.6.orig/mm/internal.h 2009-03-13 20:25:00.000000000 +1100 +++ linux-2.6/mm/internal.h 2009-03-17 02:41:48.000000000 +1100 @@ -15,6 +15,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); +void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep); extern void prep_compound_page(struct page *page, unsigned long order); extern void prep_compound_gigantic_page(struct page *page, unsigned long order); Index: linux-2.6/arch/powerpc/mm/gup.c =================================================================== --- linux-2.6.orig/arch/powerpc/mm/gup.c 2009-03-17 01:00:48.000000000 +1100 +++ linux-2.6/arch/powerpc/mm/gup.c 2009-03-17 01:02:10.000000000 +1100 @@ -39,6 +39,8 @@ static noinline int gup_pte_range(pmd_t return 0; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); + if (PageAnon(page) && unlikely(!PageDontCOW(page))) + return 0; if (!page_cache_get_speculative(page)) return 0; if (unlikely(pte_val(pte) != pte_val(*ptep))) { Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2009-03-17 02:37:21.000000000 +1100 +++ linux-2.6/mm/fremap.c 2009-03-17 02:42:11.000000000 +1100 @@ -23,32 +23,6 @@ #include "internal.h" -static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, pte_t *ptep) -{ - pte_t pte = *ptep; - - if (pte_present(pte)) { - struct page *page; - - flush_cache_page(vma, addr, pte_pfn(pte)); - pte = ptep_clear_flush(vma, addr, ptep); - page = vm_normal_page(vma, addr, pte); - if (page) { - if (pte_dirty(pte)) - set_page_dirty(page); - page_remove_rmap(page); - page_cache_release(page); - update_hiwater_rss(mm); - dec_mm_counter(mm, file_rss); - } - } else { - if (!pte_file(pte)) - free_swap_and_cache(pte_to_swp_entry(pte)); - pte_clear_not_present_full(mm, addr, ptep, 0); - } -} - /* * Install a file pte to a given virtual memory address, release any * previously existing mapping. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id CF2A56B004D for ; Mon, 16 Mar 2009 12:03:26 -0400 (EDT) Date: Tue, 17 Mar 2009 01:01:42 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903141620.45052.nickpiggin@yahoo.com.au> References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> Message-Id: <20090316223612.4B2A.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: kosaki.motohiro@jp.fujitsu.com, Benjamin Herrenschmidt , Linus Torvalds , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: Hi > > IB folks so far have been avoiding the fork() trap thanks to > > madvise(MADV_DONTFORK) afaik. And it all goes generally well when the > > whole application knows what it's doing and just plain avoids fork. > > > > -But- things get nasty if for some reason, the user of gup is somewhere > > deep in some kind of library that an application uses without knowing, > > while forking here or there to run shell scripts or other helpers. > > > > I've seen it :-) > > > > So if a solution can be found that doesn't uglify the whole thing beyond > > recognition, it's probably worth it. > > AFAIKS, the approach I've posted is probably the simplest (and maybe only > way) to really fix it. It's not too ugly. May I join this discussion? if we only need concern to O_DIRECT, below patch is enough. Yes, my patch isn't realy solusion. Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork issue. *and* my patch is crazy slow :) So, my point is, I merely oppose easily decision to give up fixing. Currently, I agree we don't have easily fixinig way. but I believe we can solve this problem completely in the nealy future because LKML folks are very cool guys. Thus, I don't hope to append the "BUGS" section of the O_DIRECT man page. Also I don't hope that I says "Oh, Solaris can solve your requirement, AIX can, FreeBSD can, but Linux can't". it beat my proud of linux developer a bit ;) andorea's patch seems a bit complex than your. but I think it can improve later. but the man page change can't undo. In addition, May I talk about my gup-fast concern? AFAIK, the worth of gup-fast is not removing one atomic operation. not grabbing mmap_sem is essetial. it because: - block layer and i/o driver also have several lock. then, DirectIO take many atomic operations anyway. one atomic operation cost is not so expensive. - but mmap_sem is one of most easy contented lock in linux. because - almost modern DB software have multi threading. - glibc malloc/free can cause mmap, munmap, mprotect syscall. its syscall grab down_write(&mmap_sem). - page fault also grab down_read(&mmap_sem). - anyway, userland application can't avoid malloc() and pagefault. However, I haven't seen anyone try to munmap() to direct-io region. So, it imply mmap_sem can split out fine grainy. (or, Can we remove it completely? iirc PerterZ tryed it about two month ago) after that, we can grab mmap_sem without performace degression and many mmap_sem avoiding effort can be removed. perhaps, I talk funny thing. gup-fast was introduced for solving DB2 problem. but I don't have any DB2 development experience. Am I over-optimistic? > You can't easily fix it at write-time by COWing in the right direction like > Linus suggested because at that point you may have multiple get_user_pages > (for read) from the parent and child on the page, so there is no way to COW > it in the right direction. > > You could do something crazy like allowing only one get_user_pages read on a > wp page, and recording which direction to send it if it does get COWed. But > at that point you've got something that's far uglier in the core code and > more complex than what I posted. --- fs/direct-io.c | 2 ++ include/linux/init_task.h | 1 + include/linux/mm_types.h | 3 +++ kernel/fork.c | 3 +++ 4 files changed, 9 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index b6d4390..8f9a810 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -1206,8 +1206,10 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) && (end > i_size_read(inode))); + down_read(¤t->mm->directio_sem); retval = direct_io_worker(rw, iocb, inode, iov, offset, nr_segs, blkbits, get_block, end_io, dio); + up_read(¤t->mm->directio_sem); /* * In case of error extending write may have instantiated a few diff --git a/include/linux/init_task.h b/include/linux/init_task.h index e752d97..68e02b9 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -37,6 +37,7 @@ extern struct fs_struct init_fs; .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \ .mmlist = LIST_HEAD_INIT(name.mmlist), \ .cpu_vm_mask = CPU_MASK_ALL, \ + .directio_sem = __RWSEM_INITIALIZER(name.directio_sem), \ } #define INIT_SIGNALS(sig) { \ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d84feb7..39ba4e6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -274,6 +274,9 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif + + /* if there are on-flight directio, we can't fork. */ + struct rw_semaphore directio_sem; }; /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ diff --git a/kernel/fork.c b/kernel/fork.c index 4854c2c..bbe9fa7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) unsigned long charge; struct mempolicy *pol; + down_write(&oldmm->directio_sem); down_write(&oldmm->mmap_sem); flush_cache_dup_mm(oldmm); /* @@ -368,6 +369,7 @@ out: up_write(&mm->mmap_sem); flush_tlb_mm(oldmm); up_write(&oldmm->mmap_sem); + up_write(&oldmm->directio_sem); return retval; fail_nomem_policy: kmem_cache_free(vm_area_cachep, tmp); @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p) mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; mm_init_owner(mm, p); + init_rwsem(&mm->directio_sem); if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; -- 1.6.0.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 062326B005D for ; Mon, 16 Mar 2009 12:23:53 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 03:23:45 +1100 References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> In-Reply-To: <20090316223612.4B2A.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903170323.45917.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Benjamin Herrenschmidt , Linus Torvalds , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 03:01:42 KOSAKI Motohiro wrote: > Hi > > > AFAIKS, the approach I've posted is probably the simplest (and maybe only > > way) to really fix it. It's not too ugly. > > May I join this discussion? Of course :) > if we only need concern to O_DIRECT, below patch is enough. > > Yes, my patch isn't realy solusion. > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork > issue. *and* my patch is crazy slow :) Well, it's an interesting question. I'd say it probably is more than just O_DIRECT. vmsplice too, for example (which I think is much harder to fix this way because the pages are retired by the other end of the pipe, so I don't think you can hold a lock across it). For other device drivers, one could argue that they are "special" and require special knowledge and apps to use MADV_DONTFORK... Ben didn't like that so much, and also some other users of get_user_pages might come up. But your patch is interesting. I don't think it is crazy slow... well it might be a bit slow in the case that a threaded app doing a lot of direct IO or an app doing async IO forks. But how common is that? I would be slightly more worried about the common cacheline touched to take the read lock for multithreaded direct IO, but I'm not sure how much that will hurt DB2. > So, my point is, I merely oppose easily decision to give up fixing. > > Currently, I agree we don't have easily fixinig way. > but I believe we can solve this problem completely in the nealy future > because LKML folks are very cool guys. > > Thus, I don't hope to append the "BUGS" section of the O_DIRECT man page. > Also I don't hope that I says "Oh, Solaris can solve your requirement, > AIX can, FreeBSD can, but Linux can't". > it beat my proud of linux developer a bit ;) > > andorea's patch seems a bit complex than your. but I think it can > improve later. > but the man page change can't undo. > > > In addition, May I talk about my gup-fast concern? > AFAIK, the worth of gup-fast is not removing one atomic operation. > not grabbing mmap_sem is essetial. Yes, mmap_sem is the big thing. But straight line speed is important too. [...] > --- > fs/direct-io.c | 2 ++ > include/linux/init_task.h | 1 + > include/linux/mm_types.h | 3 +++ > kernel/fork.c | 3 +++ > 4 files changed, 9 insertions(+), 0 deletions(-) It is an interesting patch. Thanks for throwing it into the discussion. I do prefer to close the race up for all cases if we decide to do anything at all about it, ie. all or nothing. But maybe others disagree. > diff --git a/fs/direct-io.c b/fs/direct-io.c > index b6d4390..8f9a810 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -1206,8 +1206,10 @@ __blockdev_direct_IO(int rw, struct kiocb > *iocb, struct inode *inode, > dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) && > (end > i_size_read(inode))); > > + down_read(¤t->mm->directio_sem); > retval = direct_io_worker(rw, iocb, inode, iov, offset, > nr_segs, blkbits, get_block, end_io, dio); > + up_read(¤t->mm->directio_sem); > > /* > * In case of error extending write may have instantiated a few > diff --git a/include/linux/init_task.h b/include/linux/init_task.h > index e752d97..68e02b9 100644 > --- a/include/linux/init_task.h > +++ b/include/linux/init_task.h > @@ -37,6 +37,7 @@ extern struct fs_struct init_fs; > .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \ > .mmlist = LIST_HEAD_INIT(name.mmlist), \ > .cpu_vm_mask = CPU_MASK_ALL, \ > + .directio_sem = __RWSEM_INITIALIZER(name.directio_sem), \ > } > > #define INIT_SIGNALS(sig) { \ > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index d84feb7..39ba4e6 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -274,6 +274,9 @@ struct mm_struct { > #ifdef CONFIG_MMU_NOTIFIER > struct mmu_notifier_mm *mmu_notifier_mm; > #endif > + > + /* if there are on-flight directio, we can't fork. */ > + struct rw_semaphore directio_sem; > }; > > /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ > diff --git a/kernel/fork.c b/kernel/fork.c > index 4854c2c..bbe9fa7 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct > mm_struct *oldmm) > unsigned long charge; > struct mempolicy *pol; > > + down_write(&oldmm->directio_sem); > down_write(&oldmm->mmap_sem); > flush_cache_dup_mm(oldmm); > /* > @@ -368,6 +369,7 @@ out: > up_write(&mm->mmap_sem); > flush_tlb_mm(oldmm); > up_write(&oldmm->mmap_sem); > + up_write(&oldmm->directio_sem); > return retval; > fail_nomem_policy: > kmem_cache_free(vm_area_cachep, tmp); > @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct > * mm, struct task_struct *p) > mm->free_area_cache = TASK_UNMAPPED_BASE; > mm->cached_hole_size = ~0UL; > mm_init_owner(mm, p); > + init_rwsem(&mm->directio_sem); > > if (likely(!mm_alloc_pgd(mm))) { > mm->def_flags = 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 0E9E86B0047 for ; Mon, 16 Mar 2009 12:37:41 -0400 (EDT) Date: Mon, 16 Mar 2009 09:32:11 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903170323.45917.nickpiggin@yahoo.com.au> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <200903170323.45917.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Nick Piggin wrote: > > Yes, my patch isn't realy solusion. > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork > > issue. *and* my patch is crazy slow :) > > Well, it's an interesting question. I'd say it probably is more than > just O_DIRECT. vmsplice too, for example (which I think is much harder > to fix this way because the pages are retired by the other end of > the pipe, so I don't think you can hold a lock across it). Well, only the "fork()" has the race problem. So having a fork-specific lock (but not naming it by directio) actually does make sense. The fork is much less performance-critical than most random mmap_sem users - and doesn't have the same scalability issues either (ie people probably _do_ want to do mmap/munmap/brk concurrently with gup lookup, but there's much less worry about concurrent fork() performance). It doesn't necessarily make the general problem go away, but it makes the _particular_ race between get_user_pages() and fork() go away. Then you can do per-page flags or whatever and not have to worry about concurrent lookups. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 8B31E6B005C for ; Mon, 16 Mar 2009 12:50:22 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 03:50:12 +1100 References: <1237007189.25062.91.camel@pasglop> <200903170323.45917.nickpiggin@yahoo.com.au> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903170350.13665.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 03:32:11 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > > Yes, my patch isn't realy solusion. > > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs > > > fork issue. *and* my patch is crazy slow :) > > > > Well, it's an interesting question. I'd say it probably is more than > > just O_DIRECT. vmsplice too, for example (which I think is much harder > > to fix this way because the pages are retired by the other end of > > the pipe, so I don't think you can hold a lock across it). > > Well, only the "fork()" has the race problem. > > So having a fork-specific lock (but not naming it by directio) actually > does make sense. The fork is much less performance-critical than most > random mmap_sem users - and doesn't have the same scalability issues > either (ie people probably _do_ want to do mmap/munmap/brk concurrently > with gup lookup, but there's much less worry about concurrent fork() > performance). > > It doesn't necessarily make the general problem go away, but it makes the > _particular_ race between get_user_pages() and fork() go away. Then you > can do per-page flags or whatever and not have to worry about concurrent > lookups. Hmm, I see what you mean there; it can be used to solve Andrea's race instead of using set_bit/memory barriers. But I think then you would still need to put this lock in fork and get_user_pages[_fast], *and* still do most of the other stuff required in Andrea's patch. So I'm not sure if that was KAMEZAWA-san's patch. It actually should solve one side of the race completely, as is, but only for direct-IO. Because it ensures that no get_user_pages for direct IO can be outstanding over a fork. However it does a) not solve other get_user_pages problems, and b) doesn't solve the case where for readonly get_user_pages on an already shared pte will get confused if it is subsequently COWed -- it can end up being polluted with wrong data. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 60B9F6B005A for ; Mon, 16 Mar 2009 13:08:10 -0400 (EDT) Date: Mon, 16 Mar 2009 10:02:02 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903170350.13665.nickpiggin@yahoo.com.au> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903170323.45917.nickpiggin@yahoo.com.au> <200903170350.13665.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Nick Piggin wrote: > > Hmm, I see what you mean there; it can be used to solve Andrea's race > instead of using set_bit/memory barriers. But I think then you would > still need to put this lock in fork and get_user_pages[_fast], *and* > still do most of the other stuff required in Andrea's patch. Well, yes and no. What if we just did the caller get the lock? And then leave it entirely to the caller to decide how it wants to synchronize with fork? In particular, we really _could_ just say "hold the lock for reading for as long as you hold the reference count to the page" - since now the lock only matters for fork(), nothing else. And make the forking part use "down_write_killable()", so that you can kill the process if it does something bad. Now you can make vmsplice literally get a read-lock for the whole IO operation. The process that does "vmsplice()" will not be able to fork until the IO is done, but let's be honest here: if you're doing vmsplice(), that is damn well what you WANT! splice() already has a callback for releasing the pages, so it's doable. O_DIRECT has similar issues - by the time we return from an O_DIRECT write, the pages had better already be written out, so we could just take the read-lock over the whole operation. So don't take the lock in the low level get_user_pages(). Take it as high as you want to. And if some user doesn't want that serialization (maybe ptrace?), don't take the lock at all, or take it just over the get_user_pages() call. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 156D76B003D for ; Mon, 16 Mar 2009 13:19:47 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 04:19:38 +1100 References: <1237007189.25062.91.camel@pasglop> <200903170350.13665.nickpiggin@yahoo.com.au> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903170419.38988.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 04:02:02 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > Hmm, I see what you mean there; it can be used to solve Andrea's race > > instead of using set_bit/memory barriers. But I think then you would > > still need to put this lock in fork and get_user_pages[_fast], *and* > > still do most of the other stuff required in Andrea's patch. > > Well, yes and no. > > What if we just did the caller get the lock? And then leave it entirely to > the caller to decide how it wants to synchronize with fork? > > In particular, we really _could_ just say "hold the lock for reading for > as long as you hold the reference count to the page" - since now the lock > only matters for fork(), nothing else. Well that in theory should close the race in one direction (writing into the wrong page). I don't think it closes it in the other direction (reading the wrong data from the page). I'm also not quite convinced of vmsplice. > And make the forking part use "down_write_killable()", so that you can > kill the process if it does something bad. > > Now you can make vmsplice literally get a read-lock for the whole IO > operation. The process that does "vmsplice()" will not be able to fork > until the IO is done, but let's be honest here: if you're doing > vmsplice(), that is damn well what you WANT! Really? I'm not sure (probably primarily because I've never really seen how vmsplice would be used). splice is supposed to be asynchronous, so I don't know why you necessarily would want to avoid fork after a splice (until the asynchronous reader on the other end that you don't necessarily have control over or know anything about reads all the data you've sent it). > splice() already has a callback for releasing the pages, so it's doable. doable, maybe. > O_DIRECT has similar issues - by the time we return from an O_DIRECT > write, the pages had better already be written out, so we could just take > the read-lock over the whole operation. Yes I think that's what the patch was doing. > So don't take the lock in the low level get_user_pages(). Take it as high > as you want to. > > And if some user doesn't want that serialization (maybe ptrace?), don't > take the lock at all, or take it just over the get_user_pages() call. BTW. have you looked at my approach yet? I've tried to solve the fork vs gup race in yet another way. Don't know if you think it is palatable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 6E1396B003D for ; Mon, 16 Mar 2009 13:48:53 -0400 (EDT) Date: Mon, 16 Mar 2009 10:42:48 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903170419.38988.nickpiggin@yahoo.com.au> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903170350.13665.nickpiggin@yahoo.com.au> <200903170419.38988.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Nick Piggin wrote: > > Well that in theory should close the race in one direction (writing into > the wrong page). > > I don't think it closes it in the other direction (reading the wrong data > from the page). Why? If somebody does a COW while we have a get_user_pages() page frame cached, the get_user_pages() will have increased the page count, so regardless of _who_ writes to the page, the writer will always get a new page. No? So reading data from the page will always get the old pre-cow data. [ goes to reading code ] Oh, damn. That's how it used to work a long time ago when we looked at the page count. Now we just look at the page *map* count, we don't look at any other counts. So the COW logic won't see that somebody else has a copy. Maybe we could go back to also looking at page counts? > BTW. have you looked at my approach yet? I've tried to solve the fork > vs gup race in yet another way. Don't know if you think it is palatable. I really think we should be able to fix this without _anything_ like that at all. Just the lock (and some reuse_swap_page() logic changes). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 5687A6B005A for ; Mon, 16 Mar 2009 14:03:06 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 05:02:56 +1100 References: <1237007189.25062.91.camel@pasglop> <200903170419.38988.nickpiggin@yahoo.com.au> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903170502.57217.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 04:42:48 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > Well that in theory should close the race in one direction (writing into > > the wrong page). > > > > I don't think it closes it in the other direction (reading the wrong data > > from the page). > > Why? > > If somebody does a COW while we have a get_user_pages() page frame cached, > the get_user_pages() will have increased the page count, so regardless of > _who_ writes to the page, the writer will always get a new page. No? [(no)] > Maybe we could go back to also looking at page counts? Hmm, possibly could. > > BTW. have you looked at my approach yet? I've tried to solve the fork > > vs gup race in yet another way. Don't know if you think it is palatable. > > I really think we should be able to fix this without _anything_ like that > at all. Just the lock (and some reuse_swap_page() logic changes). What part of that do you dislike, though? I don't think the lock is a particularly elegant idea either (shared cacheline, vmsplice, converting callers). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 79BCB6B005A for ; Mon, 16 Mar 2009 14:05:54 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 05:05:46 +1100 References: <1237007189.25062.91.camel@pasglop> <200903170502.57217.nickpiggin@yahoo.com.au> In-Reply-To: <200903170502.57217.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Disposition: inline Message-Id: <200903170505.46905.nickpiggin@yahoo.com.au> Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 05:02:56 Nick Piggin wrote: > On Tuesday 17 March 2009 04:42:48 Linus Torvalds wrote: > > On Tue, 17 Mar 2009, Nick Piggin wrote: > > > BTW. have you looked at my approach yet? I've tried to solve the fork > > > vs gup race in yet another way. Don't know if you think it is > > > palatable. > > > > I really think we should be able to fix this without _anything_ like that > > at all. Just the lock (and some reuse_swap_page() logic changes). > > What part of that do you dislike, though? If you disregard code motion and extra argument to copy_page_range, my fix is a couple of dozen lines change to existing code, plus the "decow" function (which could probably share a fair bit of code with do_wp_page). Do you dislike the added complexity of the code? Or the behaviour that gets changed? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 0DD8D6B005A for ; Mon, 16 Mar 2009 14:21:05 -0400 (EDT) Date: Mon, 16 Mar 2009 11:14:59 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903170502.57217.nickpiggin@yahoo.com.au> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903170419.38988.nickpiggin@yahoo.com.au> <200903170502.57217.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Nick Piggin wrote: > > What part of that do you dislike, though? I don't think the lock is a > particularly elegant idea either (shared cacheline, vmsplice, converting > callers). All of the absolute *crap* for no good reason. Did you even look at your patch? It wasn't as ugly as Andrea's, but it was ugly enough, and it was buggy. That whole "decow" stuff was too f*cking ugly to live. Couple that with the fact that no real-life user can possibly care, and that O_DIRECT is broken to begin with, and I say: "let's fix this with a _much_ smaller patch". You may think that the lock isn't particularly "elegant", but I can only say "f*ck that, look at the number of lines of code, and the simplicity". Your "elegant" argument is total and utter sh*t, in other words. The lock approach is tons more elegant, considering that it solves the problem much more cleanly, and with _much_ less crap. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id BA6246B005A for ; Mon, 16 Mar 2009 14:22:55 -0400 (EDT) Date: Mon, 16 Mar 2009 11:17:02 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903170505.46905.nickpiggin@yahoo.com.au> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903170502.57217.nickpiggin@yahoo.com.au> <200903170505.46905.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Nick Piggin wrote: > > If you disregard code motion and extra argument to copy_page_range, > my fix is a couple of dozen lines change to existing code, plus the > "decow" function (which could probably share a fair bit of code > with do_wp_page). > > Do you dislike the added complexity of the code? Or the behaviour > that gets changed? The complexity. That decow thing is shit. So is all the extra flags for no good reason. What's your argument against "keep it simple with a single lock, and adding basically a single line to reuse_swap_page() to say "don't reuse the page if the count is elevated"? THAT is simple and elegant, and needs none of the complexity. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id EE8266B003D for ; Mon, 16 Mar 2009 14:28:37 -0400 (EDT) Date: Mon, 16 Mar 2009 19:28:14 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090316182814.GA20555@random.random> References: <1237007189.25062.91.camel@pasglop> <200903170350.13665.nickpiggin@yahoo.com.au> <200903170419.38988.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Nick Piggin , KOSAKI Motohiro , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Mon, Mar 16, 2009 at 10:42:48AM -0700, Linus Torvalds wrote: > Maybe we could go back to also looking at page counts? Hugh just recently reminded me why we switched to mapcount and explanation is here: c475a8ab625d567eacf5e30ec35d6d8704558062 which wasn't entirely safe until this was added too: ab967d86015a19777955370deebc8262d50fed63 which reliably allowed to takeover swapcache pages taken by gup and at the same time it allowed the VM to unmap ptes pointing to swapcache taken by GUP. Yes it's possible to go back to page counts, then we have only to reintroduce by 2.6.7 solution that will prevent the VM to unmap ptes that are mapping pages take by GUP. Otherwise do_wp_page won't be able to remap into the pte the same swapcache that was unmapped by the pte by the VM leading to disk corruption with swapping (the 2.4 bug, fixed in 2.4 with a simpler PG_lock local to direct-io, that prevented the VM to unmap ptes on the page as long as I/O was in progress, and PG_lock was released by the ->end_io async handler from irq IIRC). The only problem I can see is if mapcount and page count can change freely while PT lock and rmap locks are taken, comparing them won't be as reliable as in ksm/fork (in my version of the fix) where we're guaranteed mapcount is 1 and stays 1 as long as we hold PT lock, because pte_write(pte) == true and PageAnon == true (I also added a BUG_ON to check mapcount to be always 1 with the other two conditions are true). That makes ksm/forkfix quite obviously safe in this regard. But for the VM to decide not to unmap a pte taken by GUP, we also have to deal with a mapcount > 1 and pte_write(pte) == false and PageAnon == true. So if we solve that ordering issue between reading mapcount and page count I don't see much of a problem to returning checking the page count in the VM code to prevent the pte to be unmapped while page is under GUP and then remove the mapcount-only check from do_wp_page swapcache-reuse logic. If we'd return using the page_count instead of mapcount, my first patch I posted here would then not require any change to take care of the 'reverse' race (modulo hugetlb) of the child writing to the pages that are being written to disk by the parent, there would be no need to de-cow in GUP (again modulo hugetlb). > I really think we should be able to fix this without _anything_ like that > at all. Just the lock (and some reuse_swap_page() logic changes). I don't see why we should introduce mm wide locks outside GUP (worrying about the SetPageGUP in gup-fast when gup-fast would then instead have to take a mm-wide lock sounds small issue) when we can be page-granular and lockless. I agree it could be simpler and less invasive into the gup details to add any logic outside of gup, but I don't think the result will be superior, given it'll most certainly become an havier-weight lock bouncing across all cpus calling gup-fast, and it won't make a speed difference for the CPU to execute an atomic lock op inner or outer of gup-fast. OTOH if the argument for an outer mm wide lock is to keep the code simpler or more maintainable, that would explain it. I think fixing it my way is not more complicated than by fixing outside gup, but then I clearly may be biased in what it looks simpler to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id A70A46B0047 for ; Mon, 16 Mar 2009 14:29:20 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 05:29:08 +1100 References: <1237007189.25062.91.camel@pasglop> <200903170502.57217.nickpiggin@yahoo.com.au> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903170529.08995.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 05:14:59 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > What part of that do you dislike, though? I don't think the lock is a > > particularly elegant idea either (shared cacheline, vmsplice, converting > > callers). > > All of the absolute *crap* for no good reason. > > Did you even look at your patch? It wasn't as ugly as Andrea's, but it was > ugly enough, and it was buggy. That whole "decow" stuff was too f*cking > ugly to live. What's buggy about it? Stupid bugs, or fundamentally broken? > Couple that with the fact that no real-life user can possibly care, and > that O_DIRECT is broken to begin with, and I say: "let's fix this with a > _much_ smaller patch". If it is based on nobody caring, I would prefer not to add anything at all to "fix" it? We have MADV_DONTFORK already... > You may think that the lock isn't particularly "elegant", but I can only > say "f*ck that, look at the number of lines of code, and the simplicity". > > Your "elegant" argument is total and utter sh*t, in other words. The lock > approach is tons more elegant, considering that it solves the problem much > more cleanly, and with _much_ less crap. In my opinion it is not, given that you have to convert callers. If you say that you only care about fixing O_DIRECT, then yes I would probably agree the lock is nicer in that case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id B738C6B003D for ; Mon, 16 Mar 2009 14:33:56 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 05:33:47 +1100 References: <1237007189.25062.91.camel@pasglop> <200903170505.46905.nickpiggin@yahoo.com.au> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903170533.48423.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 05:17:02 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > If you disregard code motion and extra argument to copy_page_range, > > my fix is a couple of dozen lines change to existing code, plus the > > "decow" function (which could probably share a fair bit of code > > with do_wp_page). > > > > Do you dislike the added complexity of the code? Or the behaviour > > that gets changed? > > The complexity. That decow thing is shit. copying the page on fork instead of write protecting it? The code or the idea? Code can certainly be improved... > So is all the extra flags for no > good reason. Which extra flags are you referring to? > What's your argument against "keep it simple with a single lock, and > adding basically a single line to reuse_swap_page() to say "don't reuse > the page if the count is elevated"? I made them in a previous message. It depends on what callers you want to convert I guess. I don't think vmsplice takes to the lock approach very well though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 4A16E6B004D for ; Mon, 16 Mar 2009 14:38:10 -0400 (EDT) Date: Mon, 16 Mar 2009 19:37:50 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090316183750.GB20555@random.random> References: <1237007189.25062.91.camel@pasglop> <200903170419.38988.nickpiggin@yahoo.com.au> <200903170502.57217.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Nick Piggin , KOSAKI Motohiro , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Mon, Mar 16, 2009 at 11:14:59AM -0700, Linus Torvalds wrote: > You may think that the lock isn't particularly "elegant", but I can only > say "f*ck that, look at the number of lines of code, and the simplicity". I'm sorry but the number of lines that you're reading in the direct_io_worker patch, aren't representative of what it takes to fix it with a mm wide lock. It may be conceptually simpler to fix it outside GUP, on that I can certainly agree (with the downside of leaving splice broken etc..), but I can't see how that small patch can fix anything as releasing the semaphore after direct_io_worker returns with O_DIRECT mixed with async-io. Before claiming that the outer lock results in less number of lines of code, I'd wait to see a fix that works with O_DIRECT+async-io too as well as mine and Nick's do. > Your "elegant" argument is total and utter sh*t, in other words. The lock > approach is tons more elegant, considering that it solves the problem much > more cleanly, and with _much_ less crap. I guess elegant is relative, but the size argument is objective, and that should be possible to compare if somebody writes a full fix that doesn't fall apart if return value of direct_io_worker is -EIOCBQUEUED. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id C1BCA6B0047 for ; Mon, 16 Mar 2009 15:23:45 -0400 (EDT) Date: Mon, 16 Mar 2009 12:17:21 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903170529.08995.nickpiggin@yahoo.com.au> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903170502.57217.nickpiggin@yahoo.com.au> <200903170529.08995.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Nick Piggin wrote: > > What's buggy about it? Stupid bugs, or fundamentally broken? The lack of locking. > In my opinion it is not, given that you have to convert callers. If you > say that you only care about fixing O_DIRECT, then yes I would probably > agree the lock is nicer in that case. F*ck me, I'm not going to bother to argue. I'm not going to merge your patch, it's that easy. Quite frankly, I don't think that the "bug" is a bug to begin with. O_DIRECT+fork() can damn well continue to be broken. But if we fix it, we fix it the _clean_ way with a simple patch, not with that shit-for-logic horrible decow crap. It's that simple. I refuse to take putrid industrial waste patches for something like this. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 784366B0047 for ; Mon, 16 Mar 2009 15:27:49 -0400 (EDT) Date: Mon, 16 Mar 2009 12:22:12 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903170533.48423.nickpiggin@yahoo.com.au> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903170505.46905.nickpiggin@yahoo.com.au> <200903170533.48423.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Nick Piggin wrote: > > > So is all the extra flags for no > > good reason. > > Which extra flags are you referring to? Fuck me, didn't you even read your own patch? What do you call PG_dontcow? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 0DF016B0047 for ; Mon, 16 Mar 2009 20:00:38 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n2H00ZAj010416 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Tue, 17 Mar 2009 09:00:35 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 32E7D45DE56 for ; Tue, 17 Mar 2009 09:00:35 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 0155145DE52 for ; Tue, 17 Mar 2009 09:00:35 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id C8E71E08010 for ; Tue, 17 Mar 2009 09:00:34 +0900 (JST) Received: from ml13.s.css.fujitsu.com (ml13.s.css.fujitsu.com [10.249.87.103]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 491921DB8018 for ; Tue, 17 Mar 2009 09:00:34 +0900 (JST) Date: Tue, 17 Mar 2009 08:59:11 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-Id: <20090317085911.4eb2135d.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <200903170350.13665.nickpiggin@yahoo.com.au> References: <1237007189.25062.91.camel@pasglop> <200903170323.45917.nickpiggin@yahoo.com.au> <200903170350.13665.nickpiggin@yahoo.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Linus Torvalds , KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009 03:50:12 +1100 Nick Piggin wrote: > On Tuesday 17 March 2009 03:32:11 Linus Torvalds wrote: > > On Tue, 17 Mar 2009, Nick Piggin wrote: > > > > Yes, my patch isn't realy solusion. > > > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs > > > > fork issue. *and* my patch is crazy slow :) > > > > > > Well, it's an interesting question. I'd say it probably is more than > > > just O_DIRECT. vmsplice too, for example (which I think is much harder > > > to fix this way because the pages are retired by the other end of > > > the pipe, so I don't think you can hold a lock across it). > > > > Well, only the "fork()" has the race problem. > > > > So having a fork-specific lock (but not naming it by directio) actually > > does make sense. The fork is much less performance-critical than most > > random mmap_sem users - and doesn't have the same scalability issues > > either (ie people probably _do_ want to do mmap/munmap/brk concurrently > > with gup lookup, but there's much less worry about concurrent fork() > > performance). > > > > It doesn't necessarily make the general problem go away, but it makes the > > _particular_ race between get_user_pages() and fork() go away. Then you > > can do per-page flags or whatever and not have to worry about concurrent > > lookups. > > Hmm, I see what you mean there; it can be used to solve Andrea's race > instead of using set_bit/memory barriers. But I think then you would > still need to put this lock in fork and get_user_pages[_fast], *and* > still do most of the other stuff required in Andrea's patch. > > So I'm not sure if that was KAMEZAWA-san's patch. > Just FYI. This was the last patch I sent to redhat (againat RHEL5) but ignored ;) plz ignore the dirty part which comes from limitation that I can't modify mm_struct. === This patch provides a kind of rwlock for DIO. This patch adds below: struct mm_private { struct mm_struct new our data } Before issuing dio, dio submitter should call dio_lock()/dio_unlock(). Before startinc COW, the kennel should call mm_cow_start()/mm_cow_end(). dio_lock() registers a range of address which is under DIO. mm_cow_start() checks range of address is under DIO or not, then - If under DIO, retry fault. (for releaseing rwsem.) - If not under DIO, mark "we're under COW". This will make DIO submitters wait. For avoiding too many page faults, "conflict" counter is added and if conflict==1, DIO submitter will wait for a while. If no one isseus DIO yet at copy-on-write, no checkes. Signed-off-by: KAMEZAWA Hiroyuki -- fs/direct-io.c | 43 ++++++++++++++- include/linux/direct-io.h | 38 +++++++++++++ include/linux/mm_private.h | 24 ++++++++ kernel/fork.c | 23 ++++++-- mm/Makefile | 2 mm/diolock.c | 129 +++++++++++++++++++++++++++++++++++++++++++++ mm/hugetlb.c | 11 +++ mm/memory.c | 15 +++++ 8 files changed, 278 insertions(+), 7 deletions(-) Index: kame-odirect-linux/include/linux/direct-io.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ kame-odirect-linux/include/linux/direct-io.h 2009-01-30 10:12:58.000000000 +0900 @@ -0,0 +1,38 @@ +#ifndef __LINUX_DIRECT_IO_H +#define __LINUX_DIRECT_IO_H + +struct dio_lock_head +{ + spinlock_t lock; /* A lock for all below */ + struct list_head dios; /* DIOs running now */ + int need_dio_check; /* This process used DIO */ + int cows; /* COWs running now */ + int conflicts; /* conflicts between COW and DIOs*/ + wait_queue_head_t waitq; /* A waitq for all stopped DIOs.*/ +}; + +struct dio_lock_ent +{ + struct list_head list; /* Linked list from head->dios */ + struct mm_struct *mm; /* the mm struct this is assgined for */ + unsigned long start; /* start address for a DIO */ + unsigned long end; /* end address for a DIO */ +}; + +/* called at fork/exit */ +int dio_lock_init(struct dio_lock_head *head); +void dio_lock_free(struct dio_lock_head *head); + +/* + * Called by DIO submitter. + */ +int dio_lock(struct mm_struct *mm, unsigned long start, unsigned long end, + struct dio_lock_ent *lock); +void dio_unlock(struct dio_lock_ent *lock); +/* + * Called by waiters. + */ +int mm_cow_start(struct mm_struct *mm, unsigned long start, unsigned long size); +void mm_cow_end(struct mm_struct *mm); + +#endif Index: kame-odirect-linux/mm/diolock.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ kame-odirect-linux/mm/diolock.c 2009-01-30 10:43:11.000000000 +0900 @@ -0,0 +1,129 @@ +#include +#include +#include +#include + + +int dio_lock_init(struct dio_lock_head *head) +{ + spin_lock_init(&head->lock); + head->need_dio_check = 0; + head->cows = 0; + head->conflicts = 0; + INIT_LIST_HEAD(&head->dios); + init_waitqueue_head(&head->waitq); + return 0; +} + +void dio_lock_free(struct dio_lock_head *head) +{ + BUG_ON(!list_empty(&head->dios)); + return; +} + + +int dio_lock(struct mm_struct *mm, unsigned long start, unsigned long end, + struct dio_lock_ent *lock) +{ + unsigned long flags; + struct dio_lock_head *head; + DEFINE_WAIT(wait); +retry: + if (signal_pending(current)) + return -EINTR; + head = &get_mm_private(mm)->diolock; + + if (!head->need_dio_check) { + down_write(&mm->mmap_sem); + head->need_dio_check = 1; + up_write(&mm->mmap_sem); + } + + prepare_to_wait(&head->waitq, &wait, TASK_INTERRUPTIBLE); + spin_lock_irqsave(&head->lock, flags); + if (head->cows || head->conflicts) { /* Allow COWs go ahead rather than new I/O */ + spin_unlock_irqrestore(&head->lock, flags); + if (head->cows) + schedule(); + else { + schedule_timeout(10); /* Allow 10tick for COW rertry */ + head->conflicts = 0; + } + finish_wait(&head->waitq, &wait); + goto retry; + } + lock->mm = mm; + lock->start = PAGE_ALIGN(start); + lock->end = PAGE_ALIGN(end) + PAGE_SIZE; + list_add(&lock->list, &head->dios); + atomic_inc(&mm->mm_users); + spin_unlock_irqrestore(&head->lock, flags); + finish_wait(&head->waitq, &wait); + return 0; +} + +void dio_unlock(struct dio_lock_ent *lock) +{ + struct dio_lock_head *head; + struct mm_struct *mm; + unsigned long flags; + + mm = lock->mm; + head = &get_mm_private(mm)->diolock; + spin_lock_irqsave(&head->lock, flags); + list_del(&lock->list); + if (waitqueue_active(&head->waitq)) + wake_up_all(&head->waitq); + spin_unlock_irqrestore(&head->lock, flags); + mmput(mm); +} + +int mm_cow_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct dio_lock_head *head; + struct dio_lock_ent *lock; + + head = &get_mm_private(mm)->diolock; + if (!head->need_dio_check) + return 0; + + spin_lock_irq(&head->lock); + head->cows++; + if (list_empty(&head->dios)) { + spin_unlock_irq(&head->lock); + return 0; + } + /* SLOW PATH */ + list_for_each_entry(lock, &head->dios, list) { + if ((start < lock->end) && (end > lock->start)) { + head->cows--; + head->conflicts++; + spin_unlock_irq(&head->lock); + /* This page fault will be retried but new dio requests will be + delayed until cow ends.*/ + return 1; + } + } + spin_unlock_irq(&head->lock); + return 0; +} + +void mm_cow_end(struct mm_struct *mm) +{ + struct dio_lock_head *head; + + head = &get_mm_private(mm)->diolock; + if (!head->need_dio_check) + return; + + spin_lock_irq(&head->lock); + head->cows--; + if (!head->cows) { + head->conflicts = 0; + if (waitqueue_active(&head->waitq)) + wake_up_all(&head->waitq); + } + spin_unlock_irq(&head->lock); + +} Index: kame-odirect-linux/fs/direct-io.c =================================================================== --- kame-odirect-linux.orig/fs/direct-io.c 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/fs/direct-io.c 2009-01-30 10:53:45.000000000 +0900 @@ -34,6 +34,8 @@ #include #include #include +#include + #include /* @@ -130,8 +132,43 @@ int is_async; /* is IO async ? */ int io_error; /* IO error in completion path */ ssize_t result; /* IO result */ + + /* For sanity of Direct-IO and Copy-On-Write */ + struct dio_lock_ent *locks; + int nr_segs; }; +int dio_protect_all(struct dio *dio, const struct iovec *iov, int nsegs) +{ + struct dio_lock_ent *lock; + unsigned long start, end; + int seg; + + lock = kzalloc(sizeof(*lock) * nsegs, GFP_KERNEL); + if (!lock) + return -ENOMEM; + dio->locks = lock; + dio->nr_segs = nsegs; + for (seg = 0; seg < nsegs; seg++) { + start = (unsigned long)iov[seg].iov_base; + end = (unsigned long)iov[seg].iov_base + iov[seg].iov_len; + dio_lock(current->mm, start, end, lock+seg); + } + return 0; +} + +void dio_release_all_protection(struct dio *dio) +{ + int seg; + + if (!dio->locks) + return; + + for (seg = 0; seg < dio->nr_segs; seg++) + dio_unlock(dio->locks + seg); + kfree(dio->locks); +} + /* * How many pages are in the queue? */ @@ -284,6 +321,7 @@ if (remaining == 0) { int ret = dio_complete(dio, dio->iocb->ki_pos, 0); aio_complete(dio->iocb, ret, 0); + dio_release_all_protection(dio); kfree(dio); } @@ -965,6 +1003,7 @@ dio->iocb = iocb; dio->i_size = i_size_read(inode); + dio->locks = NULL; spin_lock_init(&dio->bio_lock); dio->refcount = 1; @@ -1088,6 +1127,7 @@ if (ret2 == 0) { ret = dio_complete(dio, offset, ret); + dio_release_all_protection(dio); kfree(dio); } else BUG_ON(ret != -EIOCBQUEUED); @@ -1166,7 +1206,8 @@ retval = -ENOMEM; if (!dio) goto out; - + if (dio_protect_all(dio, iov, nr_segs)) + goto out; /* * For block device access DIO_NO_LOCKING is used, * neither readers nor writers do any locking at all Index: kame-odirect-linux/kernel/fork.c =================================================================== --- kame-odirect-linux.orig/kernel/fork.c 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/kernel/fork.c 2009-01-30 09:54:05.000000000 +0900 @@ -46,6 +46,7 @@ #include #include #include +#include #ifndef __GENKSYMS__ #include #include @@ -77,8 +78,8 @@ struct hlist_head mm_flags_hash[MM_FLAGS_HASH_SIZE] = { [ 0 ... MM_FLAGS_HASH_SIZE - 1 ] = HLIST_HEAD_INIT }; DEFINE_SPINLOCK(mm_flags_lock); -#define MM_HASH_SHIFT ((sizeof(struct mm_struct) >= 1024) ? 10 \ - : (sizeof(struct mm_struct) >= 512) ? 9 \ +#define MM_HASH_SHIFT ((sizeof(struct mm_private) >= 1024) ? 10 \ + : (sizeof(struct mm_private) >= 512) ? 9 \ : 8) #define mm_flags_hash_fn(mm) \ hash_long((unsigned long)(mm) >> MM_HASH_SHIFT, MM_FLAGS_HASH_BITS) @@ -299,6 +300,17 @@ spin_unlock(&mm_flags_lock); } +static void init_mm_private(struct mm_private *mmp) +{ + dio_lock_init(&mmp->diolock); +} + +static void free_mm_private(struct mm_private *mmp) +{ + dio_lock_free(&mmp->diolock); +} + + #ifdef CONFIG_MMU static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) { @@ -430,7 +442,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock); #define allocate_mm() (kmem_cache_alloc(mm_cachep, SLAB_KERNEL)) -#define free_mm(mm) (kmem_cache_free(mm_cachep, (mm))) +#define free_mm(mm) (kmem_cache_free(mm_cachep, get_mm_private((mm)))) #include @@ -451,6 +463,7 @@ mm->ioctx_list = NULL; mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; + init_mm_private(get_mm_private(mm)); mm_flags = get_mm_flags(current->mm); if (mm_flags != MMF_DUMP_FILTER_DEFAULT) { @@ -466,6 +479,7 @@ if (mm_flags != MMF_DUMP_FILTER_DEFAULT) free_mm_flags(mm); fail_nomem: + free_mm_private(get_mm_private(mm)); free_mm(mm); return NULL; } @@ -494,6 +508,7 @@ { BUG_ON(mm == &init_mm); free_mm_flags(mm); + free_mm_private(get_mm_private(mm)); mm_free_pgd(mm); destroy_context(mm); free_mm(mm); @@ -1550,7 +1565,7 @@ sizeof(struct vm_area_struct), 0, SLAB_PANIC, NULL, NULL); mm_cachep = kmem_cache_create("mm_struct", - sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN, + sizeof(struct mm_private), ARCH_MIN_MMSTRUCT_ALIGN, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); } Index: kame-odirect-linux/mm/Makefile =================================================================== --- kame-odirect-linux.orig/mm/Makefile 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/mm/Makefile 2009-01-29 14:01:59.000000000 +0900 @@ -5,7 +5,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ - vmalloc.o + vmalloc.o diolock.o obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \ page_alloc.o page-writeback.o pdflush.o \ Index: kame-odirect-linux/mm/memory.c =================================================================== --- kame-odirect-linux.orig/mm/memory.c 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/mm/memory.c 2009-01-29 16:18:19.000000000 +0900 @@ -50,6 +50,7 @@ #include #include #include +#include #include #include @@ -1665,6 +1666,7 @@ int reuse = 0, ret = VM_FAULT_MINOR; struct page *dirty_page = NULL; int dirty_pte = 0; + int dio_stop = 0; old_page = vm_normal_page(vma, address, orig_pte); if (!old_page) @@ -1738,6 +1740,7 @@ gotten: pte_unmap_unlock(page_table, ptl); + if (unlikely(anon_vma_prepare(vma))) goto oom; if (old_page == ZERO_PAGE(address)) { @@ -1748,6 +1751,11 @@ new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); if (!new_page) goto oom; + if (mm_cow_start(mm, address, address+PAGE_SIZE)) { + page_cache_release(new_page); + goto out_retry; + } + dio_stop = 1; cow_user_page(new_page, old_page, address); } @@ -1789,6 +1797,9 @@ page_cache_release(new_page); if (old_page) page_cache_release(old_page); + /* Allow DIO progress */ + if (dio_stop) + mm_cow_end(mm); unlock: pte_unmap_unlock(page_table, ptl); if (dirty_page) { @@ -1797,6 +1808,10 @@ put_page(dirty_page); } return ret; +out_retry: + if (old_page) + page_cache_release(old_page); + return ret; oom: if (old_page) page_cache_release(old_page); Index: kame-odirect-linux/mm/hugetlb.c =================================================================== --- kame-odirect-linux.orig/mm/hugetlb.c 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/mm/hugetlb.c 2009-01-29 16:29:51.000000000 +0900 @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -470,7 +471,13 @@ page_cache_release(old_page); return VM_FAULT_OOM; } - + if (mm_cow_start(mm, address & HPAGE_MASK, HPAGE_SIZE)) { + /* we have to retry. */ + page_cache_release(old_page); + page_cache_release(new_page); + return VM_FAULT_MINOR; + } + spin_unlock(&mm->page_table_lock); copy_huge_page(new_page, old_page, address); spin_lock(&mm->page_table_lock); @@ -486,6 +493,8 @@ } page_cache_release(new_page); page_cache_release(old_page); + mm_cow_end(mm); + return VM_FAULT_MINOR; } Index: kame-odirect-linux/include/linux/mm_private.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ kame-odirect-linux/include/linux/mm_private.h 2009-01-30 09:52:26.000000000 +0900 @@ -0,0 +1,24 @@ +#ifndef __LINUX_MM_PRIVATE_H +#define __LINUX_MM_PRIVATE_H + +#include +#include + +/* + * Because we have to keep KABI, we cannot modify mm_struct itself. This + * mm_private is per-process object and not covered by KABI. + * Just for a fields of future bugfix. + * Note: Now, this is not copied at fork(). + */ +struct mm_private { + struct mm_struct mm; + /* For fixing direct-io/COW races. */ + struct dio_lock_head diolock; +}; + +static inline struct mm_private *get_mm_private(struct mm_struct *mm) +{ + return container_of(mm, struct mm_private, mm); +} + +#endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 5448E6B003D for ; Mon, 16 Mar 2009 20:50:04 -0400 (EDT) Date: Mon, 16 Mar 2009 17:44:25 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090316223612.4B2A.A69D9226@jp.fujitsu.com> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Nick Piggin , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, KOSAKI Motohiro wrote: > > if we only need concern to O_DIRECT, below patch is enough. .. together with something like this, to handle the other direction. This should take care of the case of an O_DIRECT write() call using a page that was duplicated by an _earlier_ fork(), and then got split up by a COW in the wrong direction (ie having data from the child show up in the write). Untested. But fairly trivial, after all. We simply do the same old "reuse_swap_page()" count, but we only break the COW if the page count afterwards is 1 (reuse_swap_page will have removed it from the swap cache if it returns success). Does this (together with Kosaki's patch) pass the tests that Andrea had? Linus --- mm/memory.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index baa999e..2bd5fb0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1928,7 +1928,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, } page_cache_release(old_page); } - reuse = reuse_swap_page(old_page); + /* + * If we can re-use the swap page _and_ the end + * result has only one user (the mapping), then + * we reuse the whole page + */ + if (reuse_swap_page(old_page)) + reuse = page_count(old_page) == 1; unlock_page(old_page); } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 7CFC76B003D for ; Mon, 16 Mar 2009 20:57:36 -0400 (EDT) Received: from mt1.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n2H0vYgE024528 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Tue, 17 Mar 2009 09:57:34 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 06A5845DE52 for ; Tue, 17 Mar 2009 09:57:34 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id D3F4745DE4E for ; Tue, 17 Mar 2009 09:57:33 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id AD72AE08001 for ; Tue, 17 Mar 2009 09:57:33 +0900 (JST) Received: from m107.s.css.fujitsu.com (m107.s.css.fujitsu.com [10.249.87.107]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 6339F1DB803C for ; Tue, 17 Mar 2009 09:57:33 +0900 (JST) Date: Tue, 17 Mar 2009 09:56:11 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-Id: <20090317095611.77fdff59.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , linux-mm@kvack.org List-ID: On Mon, 16 Mar 2009 17:44:25 -0700 (PDT) Linus Torvalds wrote: > > > On Tue, 17 Mar 2009, KOSAKI Motohiro wrote: > > > > if we only need concern to O_DIRECT, below patch is enough. > > .. together with something like this, to handle the other direction. This > should take care of the case of an O_DIRECT write() call using a page that > was duplicated by an _earlier_ fork(), and then got split up by a COW in > the wrong direction (ie having data from the child show up in the write). > > Untested. But fairly trivial, after all. We simply do the same old > "reuse_swap_page()" count, but we only break the COW if the page count > afterwards is 1 (reuse_swap_page will have removed it from the swap cache > if it returns success). > > Does this (together with Kosaki's patch) pass the tests that Andrea had? > I'm not sure but I doubt "AIO" case. + down_read(¤t->mm->directio_sem); retval = direct_io_worker(rw, iocb, inode, iov, offset, nr_segs, blkbits, get_block, end_io, dio); + up_read(¤t->mm->directio_sem); If AIO, this semaphore range seems to be not enough. Thanks, -Kame > Linus > > --- > mm/memory.c | 8 +++++++- > 1 files changed, 7 insertions(+), 1 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index baa999e..2bd5fb0 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1928,7 +1928,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > } > page_cache_release(old_page); > } > - reuse = reuse_swap_page(old_page); > + /* > + * If we can re-use the swap page _and_ the end > + * result has only one user (the mapping), then > + * we reuse the whole page > + */ > + if (reuse_swap_page(old_page)) > + reuse = page_count(old_page) == 1; > unlock_page(old_page); > } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > (VM_WRITE|VM_SHARED))) { > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 109BA6B0055 for ; Tue, 17 Mar 2009 01:42:37 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 16:42:24 +1100 References: <1237007189.25062.91.camel@pasglop> <200903170529.08995.nickpiggin@yahoo.com.au> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903171642.25760.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 06:17:21 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > What's buggy about it? Stupid bugs, or fundamentally broken? > > The lack of locking. I don't think it's broken. I can't see a problem. > > In my opinion it is not, given that you have to convert callers. If you > > say that you only care about fixing O_DIRECT, then yes I would probably > > agree the lock is nicer in that case. > > F*ck me, I'm not going to bother to argue. I'm not going to merge your > patch, it's that easy. > > Quite frankly, I don't think that the "bug" is a bug to begin with. > O_DIRECT+fork() can damn well continue to be broken. But if we fix it, we > fix it the _clean_ way with a simple patch, not with that shit-for-logic > horrible decow crap. > > It's that simple. I refuse to take putrid industrial waste patches for > something like this. I consider it is clean because it only adds branches in 3 places that are not taken unless direct IO and fork are used, and it fixes the "problem" in the VM directly leaving get_user_pages unchanged. I don't think it is conceptually such a problem to copy pages rather than COW them in fork. Seems fairly straightforward to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id B6D776B0055 for ; Tue, 17 Mar 2009 01:44:16 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 16:44:08 +1100 References: <1237007189.25062.91.camel@pasglop> <200903170533.48423.nickpiggin@yahoo.com.au> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903171644.09260.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 06:22:12 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > > So is all the extra flags for no > > > good reason. > > > > Which extra flags are you referring to? > > Fuck me, didn't you even read your own patch? > > What do you call PG_dontcow? It is a flag, there for a good reason. It sounded like you were seeing more than one flag, and that you thought they were useless. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id E70946B0047 for ; Tue, 17 Mar 2009 01:59:02 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Tue, 17 Mar 2009 16:58:55 +1100 References: <1237007189.25062.91.camel@pasglop> <200903171642.25760.nickpiggin@yahoo.com.au> In-Reply-To: <200903171642.25760.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903171658.56278.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Benjamin Herrenschmidt , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tuesday 17 March 2009 16:42:24 Nick Piggin wrote: > I consider it is clean because it only adds branches in 3 places that > are not taken unless direct IO and fork are used, and it fixes the > "problem" in the VM directly leaving get_user_pages unchanged. leaving get_user_pages callers unchanged. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 9464A6B003D for ; Tue, 17 Mar 2009 12:20:44 -0400 (EDT) Date: Tue, 17 Mar 2009 13:19:00 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090317121900.GD20555@random.random> References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Mon, Mar 16, 2009 at 05:44:25PM -0700, Linus Torvalds wrote: > - reuse = reuse_swap_page(old_page); > + /* > + * If we can re-use the swap page _and_ the end > + * result has only one user (the mapping), then > + * we reuse the whole page > + */ > + if (reuse_swap_page(old_page)) > + reuse = page_count(old_page) == 1; > unlock_page(old_page); Think if the anon page is added to swapcache and the pte is unmapped by the VM and set non present after GUP taken the page for a O_DIRECT read (write to memory). If a thread writes to the page while the O_DIRECT read is running in another thread (or aio), then do_wp_page will make a copy of the swapcache under O_DIRECT read, and part of the read operation will get lost. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id D6AE06B003D for ; Tue, 17 Mar 2009 12:49:27 -0400 (EDT) Date: Tue, 17 Mar 2009 09:43:41 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090317121900.GD20555@random.random> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <20090317121900.GD20555@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Andrea Arcangeli wrote: > > Think if the anon page is added to swapcache and the pte is unmapped > by the VM and set non present after GUP taken the page for a O_DIRECT > read (write to memory). If a thread writes to the page while the > O_DIRECT read is running in another thread (or aio), then do_wp_page > will make a copy of the swapcache under O_DIRECT read, and part of the > read operation will get lost. In that case, you aren't getting to the "do_wp_page()" case at all, you're getting the "do_swap_page()" case. Which does its own reuse_swap_page() thing (and that one I didn't touch - on purpose). But you're right - it only does that for writes. If we _first_ do a read (to swap it back in), it will mark it read-only and _then_ we can get a "do_wp_page()" that splits it. So yes - I had expected our VM to be sane, and have a writable private page _stay_ writable (in the absense of fork() it should never turn into a COW page), but the swapout+swapin code can result in a rw page that turns read-only in order to catch a swap cache invalidation. Good catch. Let me think about it. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 500546B003D for ; Tue, 17 Mar 2009 13:06:26 -0400 (EDT) Date: Tue, 17 Mar 2009 10:01:06 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <20090317121900.GD20555@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Linus Torvalds wrote: > > So yes - I had expected our VM to be sane, and have a writable private > page _stay_ writable (in the absense of fork() it should never turn into a > COW page), but the swapout+swapin code can result in a rw page that turns > read-only in order to catch a swap cache invalidation. > > Good catch. Let me think about it. Btw, I think this is actually a pre-existing bug regardless of my patch. That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT write - exactly for the same reason. When swapin turns the page into a read-only page in order to keep the physical page in the swap cache, the write to the physical page (that was gotten by get_user_pages() earlier) will bypass all that. So the get_user_pages() users will then write to the page, but the next time we swap things out, if nobody _else_ wrote to it, that write will be lost because we'll just drop the page (it was in the swap cache!) even though it had changed data on it. My patch changed the schenario a bit (split page rather than dropped page), but the fundamental cause seems to be the same - the swap cache code very much depends on writes to the _virtual_ address. Or am I missing something? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 8631E6B003D for ; Tue, 17 Mar 2009 13:10:59 -0400 (EDT) Date: Tue, 17 Mar 2009 18:10:49 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090317171049.GA28447@random.random> References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <20090317121900.GD20555@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, Mar 17, 2009 at 10:01:06AM -0700, Linus Torvalds wrote: > That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT I think the dirty bit is set in dio_bio_complete (or bio_check_pages_dirty for the aio case) so forcing the swapcache to be written out again before the page can be freed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id D95846B003D for ; Tue, 17 Mar 2009 13:48:43 -0400 (EDT) Date: Tue, 17 Mar 2009 10:43:08 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090317171049.GA28447@random.random> Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <20090317121900.GD20555@random.random> <20090317171049.GA28447@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Andrea Arcangeli wrote: > On Tue, Mar 17, 2009 at 10:01:06AM -0700, Linus Torvalds wrote: > > That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT > > I think the dirty bit is set in dio_bio_complete (or > bio_check_pages_dirty for the aio case) so forcing the swapcache to be > written out again before the page can be freed. Do all the other get_user_pages() users do that, though? [ Looks around - at least access_process_vm(), IB and the NFS direct code do. So we seem to be mostly ok, at least for the main users ] Ok, no worries. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id C60086B003D for ; Tue, 17 Mar 2009 14:14:53 -0400 (EDT) Date: Tue, 17 Mar 2009 11:09:02 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <20090317121900.GD20555@random.random> <20090317171049.GA28447@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Linus Torvalds wrote: > > Do all the other get_user_pages() users do that, though? > > [ Looks around - at least access_process_vm(), IB and the NFS direct code > do. So we seem to be mostly ok, at least for the main users ] > > Ok, no worries. This problem is actually pretty easy to fix for anonymous pages: since the act of pinning (for writes) should have done all the COW stuff and made sure the page is not in the swap cache, we only need to avoid adding it back. IOW, something like the following makes sense on all levels regardless (note: I didn't check if there is some off-by-one issue where we've raised the page count for other reasons when scanning it, so this is not meant to be a serious patch, just a "something along these lines" thing). This does not obviate the need to mark pages dirty afterwards, though, since true shared mappings always cause that (and we cannot keep them dirty, since somebody may be doing fsync() on them or something like that). But since the COW issue is only a matter of private pages, this handles that trivially. Linus --- mm/swap_state.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 3ecea98..83137fe 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -140,6 +140,10 @@ int add_to_swap(struct page *page) VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(!PageUptodate(page)); + /* Refuse to add pinned pages to the swap cache */ + if (page_count(page) > page_mapped(page)) + return 0; + for (;;) { entry = get_swap_page(); if (!entry.val) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 82C356B003D for ; Tue, 17 Mar 2009 14:25:33 -0400 (EDT) Date: Tue, 17 Mar 2009 11:19:59 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: Message-ID: References: <1237007189.25062.91.camel@pasglop> <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <20090317121900.GD20555@random.random> <20090317171049.GA28447@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Linus Torvalds wrote: > > This problem is actually pretty easy to fix for anonymous pages: since the > act of pinning (for writes) should have done all the COW stuff and made > sure the page is not in the swap cache, we only need to avoid adding it > back. An alternative approach would have been to just count page pinning as being a "referenced", which to some degree would be even more logical (we don't set the referenced flag when we look those pages up). That would also affect pages that were get_user_page'd just for reading, which might be seen as an additional bonus. The "don't turn pinned pages into swap cache pages" is a somewhat more direct patch, though. It gives more obvious guarantees about the lifetime behaviour of anon pages wrt get_user_pages[_fast]().. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id DD94F6B003D for ; Tue, 17 Mar 2009 14:47:09 -0400 (EDT) Date: Tue, 17 Mar 2009 19:46:47 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090317184647.GC28447@random.random> References: <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <20090317121900.GD20555@random.random> <20090317171049.GA28447@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, Mar 17, 2009 at 11:19:59AM -0700, Linus Torvalds wrote: > > > On Tue, 17 Mar 2009, Linus Torvalds wrote: > > > > This problem is actually pretty easy to fix for anonymous pages: since the > > act of pinning (for writes) should have done all the COW stuff and made > > sure the page is not in the swap cache, we only need to avoid adding it > > back. > > An alternative approach would have been to just count page pinning as > being a "referenced", which to some degree would be even more logical (we > don't set the referenced flag when we look those pages up). That would > also affect pages that were get_user_page'd just for reading, which might > be seen as an additional bonus. > > The "don't turn pinned pages into swap cache pages" is a somewhat more > direct patch, though. It gives more obvious guarantees about the lifetime > behaviour of anon pages wrt get_user_pages[_fast]().. I don't think you can tackle this from add_to_swap because the page may be in the swapcache well before gup runs (gup(write=1) can map the swapcache as exclusive and read-write in the pte). So then what happens is again that the VM unmaps the page, do_swap_page map it as readonly swapcache (so far so good), and the do_wp_page copies the page under O_DIRECT read again. The off by one is most certain as it's invoked by the VM but that's an implementation detail not relevant for this discussion agreed, and I guess you also meant page_mapcount instead of page_mapped or I think shared pages would stop being swapped out. That is more relevant because of some worry I have in the comparison between page count and mapcount, see below. My preference is still to keeps pages with elevated refcount pinned in the ptes like 2.6.7 did, that will allow do_wp_page to takeover only pages with page_count not elevated without risk of calling do_wp_page on any page under gup. Only worry I have now is how to compare count with mapcount when both can change under us if mapcount > 1, but if you meant page_mapcount in add_to_swap as I think, that logic in add_to_swap would have the same problem and so it needs a solution for doing a coherent/safe comparison too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id E4A3D6B003D for ; Tue, 17 Mar 2009 15:09:55 -0400 (EDT) Date: Tue, 17 Mar 2009 12:03:55 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090317184647.GC28447@random.random> Message-ID: References: <200903141620.45052.nickpiggin@yahoo.com.au> <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <20090317121900.GD20555@random.random> <20090317171049.GA28447@random.random> <20090317184647.GC28447@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Andrea Arcangeli wrote: > > I don't think you can tackle this from add_to_swap because the page > may be in the swapcache well before gup runs (gup(write=1) can map the > swapcache as exclusive and read-write in the pte). If it's in the swap cache, it should be mapped read-only, and gup(write=1) will do the COW break and un-swapcache it. When can it be writably in the swap cache? The write-only thing is the one we use to invalidate stale swap cache entries, and when we mark those pages writable (in do_wp_page or do_swap_page) we always remove the page from the swap cache at the same time. Or is there some other path I missed? > My preference is still to keeps pages with elevated refcount pinned in > the ptes like 2.6.7 did, that will allow do_wp_page to takeover only > pages with page_count not elevated without risk of calling do_wp_page > on any page under gup. I agree that that would also work - and be even simpler. If done right, we can even avoid clearing the dirty bit (in page_mkclean()) for such pages, and now it works for _all_ pages, not just anonymous pages. IOW, even if you had a shared mapping and were to GUP() those pages for writing, they'd _stay_ dirty until you free'd them - no need to re-dirty them in case somebody did IO on them. > Only worry I have now is how to compare count > with mapcount when both can change under us if mapcount > 1, but if > you meant page_mapcount in add_to_swap as I think, that logic in > add_to_swap would have the same problem and so it needs a solution for > doing a coherent/safe comparison too. I don't think you can use just mapcount on its own - you have to compare it to page_count(). Otherwise perfectly normal (non-gup) pages will trigger, since that page count is the only thing that differs between the two cases. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 9A5F56B004D for ; Tue, 17 Mar 2009 15:35:56 -0400 (EDT) Date: Tue, 17 Mar 2009 20:35:38 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090317193538.GD28447@random.random> References: <20090317121900.GD20555@random.random> <20090317171049.GA28447@random.random> <20090317184647.GC28447@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, Mar 17, 2009 at 12:03:55PM -0700, Linus Torvalds wrote: > If it's in the swap cache, it should be mapped read-only, and gup(write=1) > will do the COW break and un-swapcache it. It may turn it read-write instead of COW break and un-swapcache. if (write_access && reuse_swap_page(page)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); This is done to avoid fragmenting the swap device. > I agree that that would also work - and be even simpler. If done right, we > can even avoid clearing the dirty bit (in page_mkclean()) for such pages, > and now it works for _all_ pages, not just anonymous pages. > > IOW, even if you had a shared mapping and were to GUP() those pages for > writing, they'd _stay_ dirty until you free'd them - no need to re-dirty > them in case somebody did IO on them. I agree in principle, if the VM stays away from pages under GUP theoretically the dirty bit shouldn't be transferred to the PG_dirty of the page until after the I/O is complete, so the dirty bit set by gup in the pte may be enough. Not sure if there are other places that could transfer the dirty bit of the pte before the gup user releases the page-pin. > I don't think you can use just mapcount on its own - you have to compare > it to page_count(). Otherwise perfectly normal (non-gup) pages will > trigger, since that page count is the only thing that differs between the > two cases. Yes, page_count shall be compared with page_mapcount. My worry is only that both can change from under us if mapcount > 1 (not enough to hold PT lock to be sure mapcount/count is stable if mapcount > 1). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id A28876B004D for ; Tue, 17 Mar 2009 16:00:30 -0400 (EDT) Date: Tue, 17 Mar 2009 12:55:05 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090317193538.GD28447@random.random> Message-ID: References: <20090317121900.GD20555@random.random> <20090317171049.GA28447@random.random> <20090317184647.GC28447@random.random> <20090317193538.GD28447@random.random> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrea Arcangeli Cc: KOSAKI Motohiro , Nick Piggin , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 17 Mar 2009, Andrea Arcangeli wrote: > On Tue, Mar 17, 2009 at 12:03:55PM -0700, Linus Torvalds wrote: > > If it's in the swap cache, it should be mapped read-only, and gup(write=1) > > will do the COW break and un-swapcache it. > > It may turn it read-write instead of COW break and un-swapcache. > > if (write_access && reuse_swap_page(page)) { > pte = maybe_mkwrite(pte_mkdirty(pte), vma); > > This is done to avoid fragmenting the swap device. Right, but reuse_swap_page() will have removed it from the swapcache if it returns success. So if the page is writable in the page tables, it should not be in the swap cache. Oh, except that we do it in shrink_page_list(), and while we're going to do that whole "try_to_unmap()", I guess it can fail to unmap there? In that case, you could actually have it in the page tables while in the swap cache. And besides, we do remove it from the page tables in the wrong order (ie we add it to the swap cache first, _then_ remove it), so I guess that also ends up being a race with another CPU doing fast-gup. And we _have_ to do it in that order at least for the map_count > 1 case, since a read-only swap page may be shared by multiple mm's, and the swap-cache is how we make sure that they all end up joining together. Of course, the only case we really care about is the map_count=1 case, since that's the only one that is possible after GUP has succeeded (assuming, as always, that fork() is locked out of making copies). So we really only care about the simpler case. > I agree in principle, if the VM stays away from pages under GUP > theoretically the dirty bit shouldn't be transferred to the PG_dirty > of the page until after the I/O is complete, so the dirty bit set by > gup in the pte may be enough. Not sure if there are other places that > could transfer the dirty bit of the pte before the gup user releases > the page-pin. I do suspect there are subtle issues like the above. > > I don't think you can use just mapcount on its own - you have to compare > > it to page_count(). Otherwise perfectly normal (non-gup) pages will > > trigger, since that page count is the only thing that differs between the > > two cases. > > Yes, page_count shall be compared with page_mapcount. My worry is only > that both can change from under us if mapcount > 1 (not enough to hold > PT lock to be sure mapcount/count is stable if mapcount > 1). Now, that's not a big worry, because we only care about mapcount=1 for the anonymous page case at least. So we can stabilize that one with the pt lock. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 7E63C6B005D for ; Tue, 17 Mar 2009 22:04:18 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n2I24Fae032652 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Wed, 18 Mar 2009 11:04:15 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 4464645DE53 for ; Wed, 18 Mar 2009 11:04:15 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 2993045DE50 for ; Wed, 18 Mar 2009 11:04:15 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 1F2C91DB8038 for ; Wed, 18 Mar 2009 11:04:15 +0900 (JST) Received: from m107.s.css.fujitsu.com (m107.s.css.fujitsu.com [10.249.87.107]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id CDC261DB803A for ; Wed, 18 Mar 2009 11:04:14 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903170323.45917.nickpiggin@yahoo.com.au> References: <20090316223612.4B2A.A69D9226@jp.fujitsu.com> <200903170323.45917.nickpiggin@yahoo.com.au> Message-Id: <20090318105735.BD17.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Wed, 18 Mar 2009 11:04:13 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: kosaki.motohiro@jp.fujitsu.com, Benjamin Herrenschmidt , Linus Torvalds , Andrea Arcangeli , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: Hi > > --- > > fs/direct-io.c | 2 ++ > > include/linux/init_task.h | 1 + > > include/linux/mm_types.h | 3 +++ > > kernel/fork.c | 3 +++ > > 4 files changed, 9 insertions(+), 0 deletions(-) > > It is an interesting patch. Thanks for throwing it into the discussion. > I do prefer to close the race up for all cases if we decide to do > anything at all about it, ie. all or nothing. But maybe others disagree. Honestly, I wan't excepting linus's reaction. but I hope to make my v2. My point is: - my patch don't prevent implement madvice(DONTCOW), I think. - andrea patch's complexity is mainly caused by avoiding perfromance degression effort, then, kernel later improvement can shrink his patch automatically. furtunately KSM don't merge yet. we can discuss his patch again at KSM submitting. - anyway, it can fix the bug. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 6AC446B003D for ; Sun, 22 Mar 2009 07:42:05 -0400 (EDT) Date: Sun, 22 Mar 2009 21:23:56 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090318105735.BD17.A69D9226@jp.fujitsu.com> References: <200903170323.45917.nickpiggin@yahoo.com.au> <20090318105735.BD17.A69D9226@jp.fujitsu.com> Message-Id: <20090322205249.6801.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin , Linus Torvalds , Andrea Arcangeli Cc: kosaki.motohiro@jp.fujitsu.com, Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: Hi following patch is my v2 approach. it survive Andrea's three dio test-case. Linus suggested to change add_to_swap() and shrink_page_list() stuff for avoid false cow in do_wp_page() when page become to swapcache. I think it's good idea. but it's a bit radical. so I think it's for development tree tackle. Then, I decide to use Nick's early decow in get_user_pages() and RO mapped page don't use gup_fast. yeah, my approach is extream brutal way and big hammer. but I think it don't have performance issue in real world. why? Practically, we can assume following two thing. (1) the buffer of passed write(2) syscall argument is RW mapped page or COWed RO page. if anybody write following code, my path cause performance degression. buf = mmap() memset(buf, 0x11, len); mprotect(buf, len, PROT_READ) fd = open(O_DIRECT) write(fd, buf, len) but it's very artifactical code. nobody want this. ok, we can ignore this. (2) DirectIO user process isn't short lived process. early decow only decrease short lived process performaqnce. because long lived process do decowing anyway before exec(2). and, All DB application is definitely long lived process. then early decow don't cause degression. TODO - implement down_write_killable(). (but it isn't important thing because this is rare case issue.) - implement non x86 portion. Am I missing any thing? Note: this is still RFC. not intent submission. -- arch/x86/mm/gup.c | 22 ++++++++++++++-------- fs/direct-io.c | 11 +++++++++++ include/linux/init_task.h | 1 + include/linux/mm.h | 9 +++++++++ include/linux/mm_types.h | 6 ++++++ kernel/fork.c | 3 +++ mm/internal.h | 10 ---------- mm/memory.c | 17 ++++++++++++++++- mm/util.c | 8 ++++++-- 9 files changed, 66 insertions(+), 21 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index be54176..02e479b 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -74,8 +74,10 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, pte_t *ptep; mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + + /* Maybe the read only pte is cow mapped page. (or not maybe) + So, falling back to get_user_pages() is better */ + mask |= _PAGE_RW; ptep = pte_offset_map(&pmd, addr); do { @@ -114,8 +116,7 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, int refs; mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + mask |= _PAGE_RW; if ((pte_flags(pte) & mask) != mask) return 0; /* hugepages are never "special" */ @@ -171,8 +172,7 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr, int refs; mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + mask |= _PAGE_RW; if ((pte_flags(pte) & mask) != mask) return 0; /* hugepages are never "special" */ @@ -272,6 +272,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, { int ret; + int gup_flags; slow: local_irq_enable(); @@ -280,9 +281,14 @@ slow_irqon: start += nr << PAGE_SHIFT; pages += nr; + gup_flags = GUP_FLAGS_PINNING_PAGE; + if (write) + gup_flags |= GUP_FLAGS_WRITE; + down_read(&mm->mmap_sem); - ret = get_user_pages(current, mm, start, - (end - start) >> PAGE_SHIFT, write, 0, pages, NULL); + ret = __get_user_pages(current, mm, start, + (end - start) >> PAGE_SHIFT, gup_flags, + pages, NULL); up_read(&mm->mmap_sem); /* Have to be a bit careful with return values */ diff --git a/fs/direct-io.c b/fs/direct-io.c index b6d4390..4f46720 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -131,6 +131,9 @@ struct dio { int is_async; /* is IO async ? */ int io_error; /* IO error in completion path */ ssize_t result; /* IO result */ + + /* fork exclusive stuff */ + struct mm_struct *mm; }; /* @@ -243,6 +246,9 @@ static int dio_complete(struct dio *dio, loff_t offset, int ret) if (dio->lock_type == DIO_LOCKING) /* lockdep: non-owner release */ up_read_non_owner(&dio->inode->i_alloc_sem); + up_read_non_owner(&dio->mm->mm_pinned_sem); + mmdrop(dio->mm); + dio->mm = NULL; if (ret == 0) ret = dio->page_errors; @@ -942,6 +948,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, ssize_t ret = 0; ssize_t ret2; size_t bytes; + struct mm_struct *mm; dio->inode = inode; dio->rw = rw; @@ -960,6 +967,10 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, spin_lock_init(&dio->bio_lock); dio->refcount = 1; + mm = dio->mm = current->mm; + atomic_inc(&mm->mm_count); + down_read_non_owner(&mm->mm_pinned_sem); + /* * In case of non-aligned buffers, we may need 2 more * pages since we need to zero out first and last block. diff --git a/include/linux/init_task.h b/include/linux/init_task.h index e752d97..3bc134a 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -37,6 +37,7 @@ extern struct fs_struct init_fs; .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \ .mmlist = LIST_HEAD_INIT(name.mmlist), \ .cpu_vm_mask = CPU_MASK_ALL, \ + .mm_pinned_sem = __RWSEM_INITIALIZER(name.mm_pinned_sem), \ } #define INIT_SIGNALS(sig) { \ diff --git a/include/linux/mm.h b/include/linux/mm.h index 065cdf8..dcc6ccc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -823,6 +823,15 @@ static inline int handle_mm_fault(struct mm_struct *mm, extern int make_pages_present(unsigned long addr, unsigned long end); extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write); +#define GUP_FLAGS_WRITE 0x01 +#define GUP_FLAGS_FORCE 0x02 +#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x04 +#define GUP_FLAGS_IGNORE_SIGKILL 0x08 +#define GUP_FLAGS_PINNING_PAGE 0x10 + +int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int len, int flags, + struct page **pages, struct vm_area_struct **vmas); int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int write, int force, struct page **pages, struct vm_area_struct **vmas); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d84feb7..27089d9 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -274,6 +274,12 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif + + /* + * if there are on-flight directio or similar pinning action, + * COW cause memory corruption. the sem protect it by preventing fork. + */ + struct rw_semaphore mm_pinned_sem; }; /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ diff --git a/kernel/fork.c b/kernel/fork.c index 4854c2c..ded7caf 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) unsigned long charge; struct mempolicy *pol; + down_write(&oldmm->mm_pinned_sem); down_write(&oldmm->mmap_sem); flush_cache_dup_mm(oldmm); /* @@ -368,6 +369,7 @@ out: up_write(&mm->mmap_sem); flush_tlb_mm(oldmm); up_write(&oldmm->mmap_sem); + up_write(&oldmm->mm_pinned_sem); return retval; fail_nomem_policy: kmem_cache_free(vm_area_cachep, tmp); @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p) mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; mm_init_owner(mm, p); + init_rwsem(&mm->mm_pinned_sem); if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; diff --git a/mm/internal.h b/mm/internal.h index 478223b..04f25d2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -272,14 +272,4 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn, { } #endif /* CONFIG_SPARSEMEM */ - -#define GUP_FLAGS_WRITE 0x1 -#define GUP_FLAGS_FORCE 0x2 -#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4 -#define GUP_FLAGS_IGNORE_SIGKILL 0x8 - -int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, int len, int flags, - struct page **pages, struct vm_area_struct **vmas); - #endif diff --git a/mm/memory.c b/mm/memory.c index baa999e..b00e3e9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1211,6 +1211,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int force = !!(flags & GUP_FLAGS_FORCE); int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS); int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL); + int decow = 0; if (len <= 0) return 0; @@ -1279,6 +1280,20 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, continue; } + /* + * Except in special cases where the caller will not read to or + * write from these pages, we must break COW for any pages + * returned from get_user_pages, so that our caller does not + * subsequently end up with the pages of a parent or child + * process after a COW takes place. + */ + if (flags & GUP_FLAGS_PINNING_PAGE) { + if (!pages) + return -EINVAL; + if (is_cow_mapping(vma->vm_flags)) + decow = 1; + } + foll_flags = FOLL_TOUCH; if (pages) foll_flags |= FOLL_GET; @@ -1299,7 +1314,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, fatal_signal_pending(current))) return i ? i : -ERESTARTSYS; - if (write) + if (write || decow) foll_flags |= FOLL_WRITE; cond_resched(); diff --git a/mm/util.c b/mm/util.c index 37eaccd..a80d5d3 100644 --- a/mm/util.c +++ b/mm/util.c @@ -197,10 +197,14 @@ int __attribute__((weak)) get_user_pages_fast(unsigned long start, { struct mm_struct *mm = current->mm; int ret; + int gup_flags = GUP_FLAGS_PINNING_PAGE; + + if (write) + gup_flags |= GUP_FLAGS_WRITE; down_read(&mm->mmap_sem); - ret = get_user_pages(current, mm, start, nr_pages, - write, 0, pages, NULL); + ret = __get_user_pages(current, mm, start, nr_pages, + gup_flags, pages, NULL); up_read(&mm->mmap_sem); return ret; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id A5F596B0047 for ; Sun, 22 Mar 2009 19:20:34 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n2N0Dook019176 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 23 Mar 2009 09:13:50 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 090B445DD7F for ; Mon, 23 Mar 2009 09:13:50 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id D75E645DD7E for ; Mon, 23 Mar 2009 09:13:49 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id A756AE0800C for ; Mon, 23 Mar 2009 09:13:49 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 4A18D1DB803E for ; Mon, 23 Mar 2009 09:13:49 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090322205249.6801.A69D9226@jp.fujitsu.com> References: <20090318105735.BD17.A69D9226@jp.fujitsu.com> <20090322205249.6801.A69D9226@jp.fujitsu.com> Message-Id: <20090323091056.69DF.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Mon, 23 Mar 2009 09:13:48 +0900 (JST) Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Nick Piggin , Linus Torvalds , Andrea Arcangeli , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > Hi > > following patch is my v2 approach. > it survive Andrea's three dio test-case. > > Linus suggested to change add_to_swap() and shrink_page_list() stuff > for avoid false cow in do_wp_page() when page become to swapcache. > > I think it's good idea. but it's a bit radical. so I think it's for development > tree tackle. > > Then, I decide to use Nick's early decow in > get_user_pages() and RO mapped page don't use gup_fast. > > yeah, my approach is extream brutal way and big hammer. but I think > it don't have performance issue in real world. > > why? > > Practically, we can assume following two thing. > > (1) the buffer of passed write(2) syscall argument is RW mapped > page or COWed RO page. > > if anybody write following code, my path cause performance degression. > > buf = mmap() > memset(buf, 0x11, len); > mprotect(buf, len, PROT_READ) > fd = open(O_DIRECT) > write(fd, buf, len) > > but it's very artifactical code. nobody want this. > ok, we can ignore this. > > (2) DirectIO user process isn't short lived process. > > early decow only decrease short lived process performaqnce. > because long lived process do decowing anyway before exec(2). > > and, All DB application is definitely long lived process. > then early decow don't cause degression. Frankly, linus sugessted to insert one branch into do_wp_page(), but I remove one branch from gup_fast. I think it's good performance trade-off. but if anybody hate my approach, I'll drop my chicken heart and try to linus suggested way. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 6C94E6B003D for ; Mon, 23 Mar 2009 11:21:34 -0400 (EDT) Date: Mon, 23 Mar 2009 17:29:54 +0100 From: Ingo Molnar Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090323162954.GB4192@elte.hu> References: <20090318105735.BD17.A69D9226@jp.fujitsu.com> <20090322205249.6801.A69D9226@jp.fujitsu.com> <20090323091056.69DF.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090323091056.69DF.A69D9226@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Nick Piggin , Linus Torvalds , Andrea Arcangeli , Benjamin Herrenschmidt , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: * KOSAKI Motohiro wrote: > > following patch is my v2 approach. > > it survive Andrea's three dio test-case. > > > > [...] > Frankly, linus sugessted to insert one branch into do_wp_page(), > but I remove one branch from gup_fast. > > I think it's good performance trade-off. but if anybody hate my > approach, I'll drop my chicken heart and try to linus suggested > way. We started out with a difficult corner case problem (for an arguably botched syscall promise we made to user-space many moons ago), and an invasive and unmaintainable looking patch: 8 files changed, 342 insertions(+), 77 deletions(-) And your v2 is now: 9 files changed, 66 insertions(+), 21 deletions(-) ... and it is also speeding up fast-gup. Which is a marked improvement IMO. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id C5B256B003D for ; Mon, 23 Mar 2009 11:48:17 -0400 (EDT) Date: Mon, 23 Mar 2009 09:46:18 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <20090323162954.GB4192@elte.hu> Message-ID: References: <20090318105735.BD17.A69D9226@jp.fujitsu.com> <20090322205249.6801.A69D9226@jp.fujitsu.com> <20090323091056.69DF.A69D9226@jp.fujitsu.com> <20090323162954.GB4192@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Ingo Molnar Cc: KOSAKI Motohiro , Nick Piggin , Andrea Arcangeli , Benjamin Herrenschmidt , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Mon, 23 Mar 2009, Ingo Molnar wrote: > > And your v2 is now: > > 9 files changed, 66 insertions(+), 21 deletions(-) > > ... and it is also speeding up fast-gup. Which is a marked > improvement IMO. Yeah, I have no problems with that patch. I'd just suggest a final simplification, and getting rid of the mask = _PAGE_PRESENT|_PAGE_USER; /* Maybe the read only pte is cow mapped page. (or not maybe) So, falling back to get_user_pages() is better */ mask |= _PAGE_RW; and just doing something like /* * fast-GUP only handles the simple cases where we have * full access to the page (ie private pages are copied * etc). */ #define GUP_MASK (_PAGE_PRESENT|_PAGE_USER|_PAGE_RW) and leaving it at that. Of course, maybe somebody does O_DIRECT writes on a fork'ed image in order to create a snapshot image or something, and now the v2 thing breaks COW on all the pages in order to be safe and performance sucks. But I can't really say that _I_ could possibly care. I really seriously think that O_DIRECT and its ilk were braindamaged to begin with. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 092C86B003D for ; Tue, 24 Mar 2009 01:05:34 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n2O58Djh018363 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 24 Mar 2009 14:08:13 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id F36D545DD76 for ; Tue, 24 Mar 2009 14:08:12 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id D072E45DD75 for ; Tue, 24 Mar 2009 14:08:12 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id B2FBDE08004 for ; Tue, 24 Mar 2009 14:08:12 +0900 (JST) Received: from m107.s.css.fujitsu.com (m107.s.css.fujitsu.com [10.249.87.107]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 6EB9CE08002 for ; Tue, 24 Mar 2009 14:08:12 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: References: <20090323162954.GB4192@elte.hu> Message-Id: <20090324140304.902C.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 24 Mar 2009 14:08:11 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: kosaki.motohiro@jp.fujitsu.com, Ingo Molnar , Nick Piggin , Andrea Arcangeli , Benjamin Herrenschmidt , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: Hi > > And your v2 is now: > > > > 9 files changed, 66 insertions(+), 21 deletions(-) > > > > ... and it is also speeding up fast-gup. Which is a marked > > improvement IMO. > > Yeah, I have no problems with that patch. I'd just suggest a final > simplification, and getting rid of the > > mask = _PAGE_PRESENT|_PAGE_USER; > /* Maybe the read only pte is cow mapped page. (or not maybe) > So, falling back to get_user_pages() is better */ > mask |= _PAGE_RW; > > and just doing something like > > /* > * fast-GUP only handles the simple cases where we have > * full access to the page (ie private pages are copied > * etc). > */ > #define GUP_MASK (_PAGE_PRESENT|_PAGE_USER|_PAGE_RW) OK! I'll do that. Thanks good reviewing! > and leaving it at that. > > Of course, maybe somebody does O_DIRECT writes on a fork'ed image in order > to create a snapshot image or something, and now the v2 thing breaks COW > on all the pages in order to be safe and performance sucks. > > But I can't really say that _I_ could possibly care. I really seriously > think that O_DIRECT and its ilk were braindamaged to begin with. Yes. I have to totally agreed ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id D9D966B003D for ; Tue, 24 Mar 2009 09:32:45 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Wed, 25 Mar 2009 00:43:16 +1100 References: <200903170323.45917.nickpiggin@yahoo.com.au> <20090318105735.BD17.A69D9226@jp.fujitsu.com> <20090322205249.6801.A69D9226@jp.fujitsu.com> In-Reply-To: <20090322205249.6801.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903250043.18069.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Linus Torvalds , Andrea Arcangeli , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Sunday 22 March 2009 23:23:56 KOSAKI Motohiro wrote: > Hi > > following patch is my v2 approach. > it survive Andrea's three dio test-case. > > Linus suggested to change add_to_swap() and shrink_page_list() stuff > for avoid false cow in do_wp_page() when page become to swapcache. > > I think it's good idea. but it's a bit radical. so I think it's for > development tree tackle. > > Then, I decide to use Nick's early decow in > get_user_pages() and RO mapped page don't use gup_fast. You probably should be testing for PageAnon pages in gup_fast. Also, using a bit in page->flags you could potentially get anonymous, readonly mappings working again (I thought I had them working in my patch, but on second thoughts perhaps I had a bug in tagging them, I'll try to fix that). > yeah, my approach is extream brutal way and big hammer. but I think > it don't have performance issue in real world. > > why? > > Practically, we can assume following two thing. > > (1) the buffer of passed write(2) syscall argument is RW mapped > page or COWed RO page. > > if anybody write following code, my path cause performance degression. > > buf = mmap() > memset(buf, 0x11, len); > mprotect(buf, len, PROT_READ) > fd = open(O_DIRECT) > write(fd, buf, len) > > but it's very artifactical code. nobody want this. > ok, we can ignore this. The more interesting uses of gup (and perhaps somewhat improved or enabled with fast-gup) I think are things like vmsplice, and syslets/threadlets/aio kind of things. And I don't exactly know what the users are going to look like. > (2) DirectIO user process isn't short lived process. > > early decow only decrease short lived process performaqnce. > because long lived process do decowing anyway before exec(2). > > and, All DB application is definitely long lived process. > then early decow don't cause degression. Right, most databases won't care *at all* because they won't do any decowing. But if there are cases that do care, then we can perhaps take the policy of having them use MADV_DONTFORK or somesuch. > TODO > - implement down_write_killable(). > (but it isn't important thing because this is rare case issue.) > - implement non x86 portion. > > > Am I missing any thing? I still don't understand why this way is so much better than my last proposal. I just wanted to let that simmer down for a few days :) But I'm honestly really just interested in a good discussion and I don't mind being sworn at if I'm being stupid, but I really want to hear opinions of why I'm wrong too. Yes my patch has downsides I'm quite happy to admit. But I just don't see that copy-on-fork rather than wrprotect-on-fork is the showstopper. To me it seemed nice because it is practically just reusing code straight from do_wp_page, and pretty well isolated out of the fastpath. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 02EDA6B003D for ; Tue, 24 Mar 2009 13:47:56 -0400 (EDT) Date: Tue, 24 Mar 2009 10:56:24 -0700 (PDT) From: Linus Torvalds Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903250043.18069.nickpiggin@yahoo.com.au> Message-ID: References: <200903170323.45917.nickpiggin@yahoo.com.au> <20090318105735.BD17.A69D9226@jp.fujitsu.com> <20090322205249.6801.A69D9226@jp.fujitsu.com> <200903250043.18069.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: KOSAKI Motohiro , Andrea Arcangeli , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, 25 Mar 2009, Nick Piggin wrote: > > I still don't understand why this way is so much better than > my last proposal. Take a look at the diffstat. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 2746E6B0062 for ; Mon, 30 Mar 2009 06:52:00 -0400 (EDT) Received: from mt1.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n2UAqkGU009183 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 30 Mar 2009 19:52:47 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id B2FCA45DE59 for ; Mon, 30 Mar 2009 19:52:46 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 7608F45DE4E for ; Mon, 30 Mar 2009 19:52:46 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 4FCE31DB8047 for ; Mon, 30 Mar 2009 19:52:46 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id A7F451DB803A for ; Mon, 30 Mar 2009 19:52:45 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] In-Reply-To: <200903250043.18069.nickpiggin@yahoo.com.au> References: <20090322205249.6801.A69D9226@jp.fujitsu.com> <200903250043.18069.nickpiggin@yahoo.com.au> Message-Id: <20090330191830.6924.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Mon, 30 Mar 2009 19:52:44 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: kosaki.motohiro@jp.fujitsu.com, Linus Torvalds , Andrea Arcangeli , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: Hi Nick, > > Am I missing any thing? > > I still don't understand why this way is so much better than > my last proposal. I just wanted to let that simmer down for a > few days :) But I'm honestly really just interested in a good > discussion and I don't mind being sworn at if I'm being stupid, > but I really want to hear opinions of why I'm wrong too. > > Yes my patch has downsides I'm quite happy to admit. But I just > don't see that copy-on-fork rather than wrprotect-on-fork is > the showstopper. To me it seemed nice because it is practically > just reusing code straight from do_wp_page, and pretty well > isolated out of the fastpath. Firstly, I'm very sorry for very long delay responce. This month, I'm very busy and I don't have enough developing time ;) Secondly, I have strongly obsession to bugfix. (I guess you alread know it) but I don't have obsession to bugfix _way_. my patch was made for creating good discussion, not NAK your patch. I think your patch is good. but it have few disadvantage. (yeah, I agree mine have lot disadvantage) 1. using page->flags nowadays, page->flags is one of most prime estate in linux. as far as possible, we can avoid to use it. 2. don't have GUP_FLAGS_PINNING_PAGE flag then, access_process_vm() can decow a page unnecessary. it isn't good feature, I think. IOW, I don't think "caller transparent" is important. minimal side effect is important more. my side-effect mean non direct-io effection. I don't mind direct-io path side effection. it is only used DB or similar software. then, we can assume a lot of userland usage. and I was playing your patch in last week. but I conclude I can't shrink it more. As far as I understand, Linus don't refuse copy-on-fork itself. he only refuse messy bugfix patch. In general, bugfix patch should be backportable to stable tree. Then, I think step-by-step development is better. 1. at first, merge wrprotect-on-fork. 2. improve speed. What do you think? btw, Linus give me good inspiration. if page pinning happend, the patch is guranteed to grabbed only one process. then, we can put pinning-count and some additional information into anon_vma. it can avoid to use page->flags although we implement copy-on-fork. maybe. HOWEVER, if you really hate my approach, please don't hesitate to tell it. I don't hope submit your disliked patch. I respect linus, but I respect you too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 084E76B003D for ; Thu, 2 Apr 2009 23:49:18 -0400 (EDT) From: Nick Piggin Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Date: Fri, 3 Apr 2009 14:49:48 +1100 References: <20090322205249.6801.A69D9226@jp.fujitsu.com> <20090330191830.6924.A69D9226@jp.fujitsu.com> <200904022307.12043.nickpiggin@yahoo.com.au> In-Reply-To: <200904022307.12043.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200904031449.49594.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Linus Torvalds , Andrea Arcangeli , Benjamin Herrenschmidt , Ingo Molnar , Nick Piggin , Hugh Dickins , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: [sorry, resending because my mail client started sending HTML and this didn't get through spam filters] On Thursday 02 April 2009 23:07:11 Nick Piggin wrote: Hi! On Monday 30 March 2009 21:52:44 KOSAKI Motohiro wrote: > > Hi Nick, > > > > Am I missing any thing? > > > > I still don't understand why this way is so much better than > > my last proposal. I just wanted to let that simmer down for a > > few days :) But I'm honestly really just interested in a good > > discussion and I don't mind being sworn at if I'm being stupid, > > but I really want to hear opinions of why I'm wrong too. > > > > Yes my patch has downsides I'm quite happy to admit. But I just > > don't see that copy-on-fork rather than wrprotect-on-fork is > > the showstopper. To me it seemed nice because it is practically > > just reusing code straight from do_wp_page, and pretty well > > isolated out of the fastpath. > > Firstly, I'm very sorry for very long delay responce. This month, I'm > very busy and I don't have enough developing time ;) No problem. > Secondly, I have strongly obsession to bugfix. (I guess you alread know it) > but I don't have obsession to bugfix _way_. my patch was made for > creating good discussion, not NAK your patch. Definitely. I like more discussion and alternative approaches. > I think your patch is good. but it have few disadvantage. > (yeah, I agree mine have lot disadvantage) > > 1. using page->flags > nowadays, page->flags is one of most prime estate in linux. > as far as possible, we can avoid to use it. Well... I'm not sure if it is that bad. It uses an anonymous page flag, which are not so congested as pagecache page flags. I can't think of anything preventing anonymous pages from using PG_owner_priv_1, PG_private, or PG_mappedtodisk, so a "final" solution that uses a page flag would use one of those I guess. > 2. don't have GUP_FLAGS_PINNING_PAGE flag > then, access_process_vm() can decow a page unnecessary. > it isn't good feature, I think. access_process_vm I think can just avoid COWing because it holds mmap_sem for the duration of the operation. I just didn't fix that because I didn't really think of it. > IOW, I don't think "caller transparent" is important. Well I don't know about that. I don't know that O_DIRECT is particularly more important to fix the problem than vmsplice, or any of the numerous other zero-copy methods open coded in drivers. > minimal side effect is important more. my side-effect mean non direct- io > effection. I don't mind direct-io path side effection. it is only used > DB or similar software. then, we can assume a lot of userland usage. I agree my patch should not be de-cowing for access_process_vm for read. I think that can be fixed. But I disagree that O_DIRECT is unimportant. I think the big database users don't like more cost in this path, and they obviously have the capacity to use it carefully so I'm sure they would prefer not to add anything. Intel definitely counts cycles in the O_DIRECT path. > and I was playing your patch in last week. but I conclude I can't shrink > it more. > As far as I understand, Linus don't refuse copy-on-fork itself. he only > refuse messy bugfix patch. > In general, bugfix patch should be backportable to stable tree. I think assessing this type of patch based of diffstat is a bit ridiculous ;) But I think it can be shrunk a bit if it shares a bit of code with do_wp_page. > Then, I think step-by-step development is better. > > 1. at first, merge wrprotect-on-fork. > 2. improve speed. > > What do you think? > > > btw, > Linus give me good inspiration. if page pinning happend, the patch > is guranteed to grabbed only one process. > then, we can put pinning-count and some additional information > into anon_vma. it can avoid to use page->flags although we implement > copy-on-fork. maybe. Hmm, I might try playing with that in my patch. Not so much because the extra flag is important (as I explain above), but keeping a count will -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org