* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] [not found] <20090311170611.GA2079@elte.hu> @ 2009-03-11 17:33 ` Linus Torvalds 2009-03-11 17:41 ` Ingo Molnar 2009-03-11 18:22 ` Andrea Arcangeli 0 siblings, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 17:33 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Hugh Dickins, Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Ingo Molnar wrote: > > FYI, in case you missed it. Large MM fix - and it's awfully late > in -rc7. Yeah, I'm not taking this at this point. No way, no-how. If there is no simpler and obvious fix, it needs to go through -stable, after having cooked in 2.6.30-rc for a while. Especially as this is a totally uninteresting usage case that I can't see as being at all relevant to any real world. Anybody who mixes O_DIRECT and fork() (and threads) is already doing some seriously strange things. Nothing new there. And quite frankly, the patch is so ugly as-is that I'm not likely to take it even into the 2.6.30 merge window unless it can be cleaned up. That whole fork_pre_cow function is too f*cking ugly to live. We just don't write code like this in the kernel. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 17:33 ` [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Linus Torvalds @ 2009-03-11 17:41 ` Ingo Molnar 2009-03-11 17:58 ` Linus Torvalds 2009-03-11 18:53 ` Andrea Arcangeli 2009-03-11 18:22 ` Andrea Arcangeli 1 sibling, 2 replies; 83+ messages in thread From: Ingo Molnar @ 2009-03-11 17:41 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 11 Mar 2009, Ingo Molnar wrote: > > > > FYI, in case you missed it. Large MM fix - and it's awfully > > late in -rc7. > > Yeah, I'm not taking this at this point. No way, no-how. > > If there is no simpler and obvious fix, it needs to go through > -stable, after having cooked in 2.6.30-rc for a while. > Especially as this is a totally uninteresting usage case that > I can't see as being at all relevant to any real world. > > Anybody who mixes O_DIRECT and fork() (and threads) is already > doing some seriously strange things. Nothing new there. Hm, is there any security impact? Andrea is talking about data corruption. I'm wondering whether that's just corruption relative to whatever twisted semantics O_DIRECT has in this case [which would be harmless], or some true pagecache corruption going across COW (or other) protection domains that could be exploited [which would not be harmless]. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 17:41 ` Ingo Molnar @ 2009-03-11 17:58 ` Linus Torvalds 2009-03-11 18:37 ` Andrea Arcangeli 2009-03-11 18:53 ` Andrea Arcangeli 1 sibling, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 17:58 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Hugh Dickins, Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Ingo Molnar wrote: > > Hm, is there any security impact? Andrea is talking about data > corruption. I'm wondering whether that's just corruption > relative to whatever twisted semantics O_DIRECT has in this case > [which would be harmless], or some true pagecache corruption > going across COW (or other) protection domains that could be > exploited [which would not be harmless]. As far as I can tell, it's the same old problem that we've always had: if you fork(), it's unclear who is going to do the first write - parent or child (and "parent" in this case can include any number of threads that share the VM, of course). And that means that anything that relies on pinned pages will never know whether it is pinning a page in the parent or the child - because whoever does the first COW of that page is the one that just gets a _copy_, not the original pinned page. This isn't anything new. Anything that does anything by physical address will simply not do the right thing over a fork. The physical page may have started out as the parents physical page, but it may end up in the end being the _childs_ physical page if the parent wrote to it and triggered the cow. The rule has always been: don't mix fork() with page pinning. It doesn't work. It never worked. It likely never will. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 17:58 ` Linus Torvalds @ 2009-03-11 18:37 ` Andrea Arcangeli 2009-03-11 18:46 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-11 18:37 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote: > As far as I can tell, it's the same old problem that we've always had: if > you fork(), it's unclear who is going to do the first write - parent or > child (and "parent" in this case can include any number of threads that > share the VM, of course). The child doesn't touch any page. Calling fork just generates O_DIRECT corruption in the parent regardless of what the child does. > This isn't anything new. Anything that does anything by physical address This is nothing new also in the sense that all linux kernels out there had this bug thus far. > will simply not do the right thing over a fork. The physical page may have > started out as the parents physical page, but it may end up in the end > being the _childs_ physical page if the parent wrote to it and triggered > the cow. Actually the child will get corrupted too. Not just the parent by losing the O_DIRECT reads. The child always assumes its anon page contents will not get lost or overwritten after changing them in the child. > The rule has always been: don't mix fork() with page pinning. It doesn't > work. It never worked. It likely never will. I never heard this rule here, but surely I agree there will not be many apps out there capable of triggering this. Mostly because most apps uses O_DIRECT on top of shm (surely not because they're not usually calling fork). The ones affected are the ones using anonymous memory with threads and not allocating memory with memalign(4096) despite they use 512byte blocksize for their I/O. If they use threads and they allocate with memalign(512) they can be affected if they call fork anywhere. I don't think it's urgent fix, but if you now are pretending that this doesn't ever need fixing and we can live with the bug forever, I think you're wrong. If something I'd rather see O_DIRECT not supporting hardblocksize anymore but only PAGE_SIZE multiples, that would at least limiting the breakage to an undefined behavior. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 18:37 ` Andrea Arcangeli @ 2009-03-11 18:46 ` Linus Torvalds 2009-03-11 19:01 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 18:46 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote: > > As far as I can tell, it's the same old problem that we've always had: if > > you fork(), it's unclear who is going to do the first write - parent or > > child (and "parent" in this case can include any number of threads that > > share the VM, of course). > > The child doesn't touch any page. Calling fork just generates O_DIRECT > corruption in the parent regardless of what the child does. You aren't listening. It depends on who does the write. If the _parent_ does the write (with another thread or not), then the _parent_ gets the COW. That's all I said. > > The rule has always been: don't mix fork() with page pinning. It doesn't > > work. It never worked. It likely never will. > > I never heard this rule here It's never been written down, but it's obvious to anybody who looks at how COW works for even five seconds. The fact is, the person doing the COW after a fork() is the person who no longer has the same physical page (because he got a new page). So _anything- that depends on physical addresses simply _cannot_ work concurrently with a fork. That has always been true. If the idiots who use O_DIRECT don't understand that, then hey, it's their problem. I have long been of the opinion that we should not support O_DIRECT at all, and that it's a totally broken premise to start with. This is just one of millions of reasons. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 18:46 ` Linus Torvalds @ 2009-03-11 19:01 ` Linus Torvalds 2009-03-11 19:59 ` Andrea Arcangeli 2009-03-11 19:06 ` Andrea Arcangeli 2009-03-12 5:36 ` Nick Piggin 2 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 19:01 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Linus Torvalds wrote: > > It's never been written down, but it's obvious to anybody who looks at how > COW works for even five seconds. The fact is, the person doing the COW > after a fork() is the person who no longer has the same physical page > (because he got a new page). Btw, I think your patch has a race. Admittedly a really small one. When you look up the page in gup.c, and then set the GUP flag on the "struct page", in between the lookup and the setting of the flag, another thread can come in and do that same fork+write thing. CPU0: CPU1 gup: fork: - look up page - it's read-write ... set_wr_protect test GUP bit - not set, good done - Mark it GUP tlb_flush write to it from user space - COW since there is no lockng on the GUP side (there's the TLB flush that will wait for interrupts being enabled again on CPU0, but that's later in the fork sequence). Maybe I'm missing something. The race is certainly very unlikely to ever happen in practice, but it looks real. Also, having to set the PG_GUP bit means that the "fast" gup is likely not much faster than the slow one. It now has two atomics per page it looks up, afaik, which sounds like it would delete any advantage it had over the slow version that needed locking. What we _could_ try to do is to always make the COW breaking be a _directed_ event - we'd make sure that we always break COW in the direction of the first owner (going to the rmap chains). That might solve everything, and be purely local to the logic in mm/memory.c (do_wp_page). I dunno. I have not looked at how horrible that would be. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 19:01 ` Linus Torvalds @ 2009-03-11 19:59 ` Andrea Arcangeli 2009-03-11 20:19 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-11 19:59 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, Mar 11, 2009 at 12:01:56PM -0700, Linus Torvalds wrote: > Btw, I think your patch has a race. Admittedly a really small one. > > When you look up the page in gup.c, and then set the GUP flag on the > "struct page", in between the lookup and the setting of the flag, another > thread can come in and do that same fork+write thing. > > CPU0: CPU1 > > gup: fork: > - look up page > - it's read-write > ... > set_wr_protect > test GUP bit - not set, good > done > > - Mark it GUP > tlb_flush > > write to it from user space - COW Did you notice the check after 'mark it gup' that will run in CPU0? + if (PageAnon(page)) { + if (!PageGUP(page)) + SetPageGUP(page); + smp_mb(); + /* + * Fork doesn't want to flush the smp-tlb for + * every pte that it marks readonly but newly + * created shared anon pages cannot have + * direct-io going to them, so check if fork + * made the page shared before we taken the + * page pin. + */ + if ((pte_flags(gup_get_pte(ptep)) & + (mask | _PAGE_SPECIAL)) != mask) { + put_page(page); + pte_unmap(ptep); + return 0; + } + } gup-fast will _not_ succeed because of the set_wr_protect that just happened on CPU1. That's why I added the above check after setpagegup/get_page. > since there is no lockng on the GUP side (there's the TLB flush that will > wait for interrupts being enabled again on CPU0, but that's later in the > fork sequence). Right, I preferred to 'recheck' the wrprotect bit before allowing gup-fast to succeed to avoid sending a flood of IPI in the fork fast path. So I leave the tlb flush at the end of the fork sequence and a single IPI in the common case. Only exception is the forcecow path where the copy has to happen atomically per-page, so I have to flush the smp-tlb before the copy after marking the parent wrprotected temporarly (later the parent pte is marked read-write again by fork_pre_cow after the copy), or NPTL will never have a chance to fix its bug as its glibc-parent data structures that could be modified by threads won't be copied atomically to the child. But that's a slow path so it's ok to flush tlb there. > Also, having to set the PG_GUP bit means that the "fast" gup is likely not > much faster than the slow one. It now has two atomics per page it looks > up, afaik, which sounds like it would delete any advantage it had over the > slow version that needed locking. gup-fast has already to get_page, so I don't see it. gup-fast will always dirty that cacheline and take over it regardless of PG_gup, gup-fast will never be able to run without running get_page. Furthermore starting from the second access GUP is already set and it's only a read from l1 from a cacheline that was already dirtied and taken over a few instructions before. So I think it can't be slowing down gup-fast in any measurable way, given how close mark-gup is set after get_page. > What we _could_ try to do is to always make the COW breaking be a > _directed_ event - we'd make sure that we always break COW in the > direction of the first owner (going to the rmap chains). That might solve > everything, and be purely local to the logic in mm/memory.c (do_wp_page). That's a really interesting idea and frankly I didn't think about it. Probably one reason is that it can't work for ksm where we take two random anon pages and create one out of them so each one could already have O_DIRECT in progress on them and we've to prevent to merge pages that have in-flight O_DIRECT to be merged no matter what (ordering is irrelevant for ksm, page contents must be stable or ksm will break). I was thinking of using the same logic for both ksm and fork. But theoretically, ksm can keep doing the page_count check to truly ensure no in-flight I/O is going on, and fork could fix it in whatever way it wants (I wonder if it'd be ok for fork to map a 'changing' page in the child because of the not-defined behavior of forking while a read is in progress, at least at the first write the page would stop changing contents). In fact ksm doesn't even require the above change to gup-fast because it does ptep_clear_flush_notify when it tries to wrprotect a not-shared anon page. > I dunno. I have not looked at how horrible that would be. For fork I think it would work, not sure if the current data structures would be enough, but at first glance I think besides how horrible that would be, I think from a practical standpoint the main problem is the slowdown it'd generate in the do_wp_page fast path. The anon_vma list can be huge in some weird case, which we normally cannot care less as swap algorithms and disk I/O (even on no-seeking SSD) is even slower than that. The coolness of rmap w/o pte_chains is that rmap is zerocost for all page faults (a check on vma->anon_vma being not null is the only cost) and I'd like to keep it that way. The cost of my fix to fork is not measurable with fork microbenchmark, while the cost of finding who owns the original shared page in do_wp_page would be potentially be much bigger. The only slowdown to fork is in the O_DIRECT slow path which we don't care about and in the worst case is limited to the total amount of in-flight I/O. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 19:59 ` Andrea Arcangeli @ 2009-03-11 20:19 ` Linus Torvalds 2009-03-11 20:33 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 20:19 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > > Did you notice the check after 'mark it gup' that will run in CPU0? Ahh, no. I just read the patch through fairly quickly, and the whole "(gup_get_pte & mask) != mask" didn't trigger as obvious. But yeah, I see that it ends up re-checking the RW bit. > gup-fast will _not_ succeed because of the set_wr_protect that just > happened on CPU1. That's why I added the above check after > setpagegup/get_page. Ok, with the recheck I think it's fine. > > Also, having to set the PG_GUP bit means that the "fast" gup is likely not > > much faster than the slow one. It now has two atomics per page it looks > > up, afaik, which sounds like it would delete any advantage it had over the > > slow version that needed locking. > > gup-fast has already to get_page, so I don't see it. That's my point. It used to have one atomic. Now it has two (and a memory barrier). Those tend to be pretty expensive - even when there's no cacheline bouncing. > Furthermore starting from the second access GUP is already > set That's a totally bogus argument. It will be true for _benchmarks_, but if somebody is trying to avoid buffered IO, one very possible common case is that it's all going to be new pages all the time. That said, I don't know who the crazy O_DIRECT users are. It may be true that some O_DIRECT users end up using the same pages over and over again, and that this is a good optimization for them. > > What we _could_ try to do is to always make the COW breaking be a > > _directed_ event - we'd make sure that we always break COW in the > > direction of the first owner (going to the rmap chains). That might solve > > everything, and be purely local to the logic in mm/memory.c (do_wp_page). > > That's a really interesting idea and frankly I didn't think about it. The advantage of it is that it fixes the problem not just in one place, but "forever". No hacks about exactly how you access the mappings etc. Of course, nothing _really_ solves things. If you do some delayed IO after having looked up the mapping and turned it into a physical page, and the original allocator actually unmaps it (or exits), then the same issue can still happen (well, not the _same_ one - but the very similar issue of the child seeing changes even though the IO was started in the parent). This is why I think any "look up by physical" is fundamentally flawed. It very basically becomes a "I have a secret local TLB that cannot be changed or flushed". And any single-bit solution (GUP) is always going to be fairly broken. > The cost of my fix to fork is not measurable with fork microbenchmark, > while the cost of finding who owns the original shared page in > do_wp_page would be potentially be much bigger. The only slowdown to > fork is in the O_DIRECT slow path which we don't care about and in the > worst case is limited to the total amount of in-flight I/O. Agreed. However, I really think this is a O_DIRECT problem. Just document it. Tell people that O_DIRECT simply doesn't work with COW, and fundamentally can never work well. If you use O_DIRECT with threading, you had better know what the hell you're doing anyway. I do not think that the kernel should do stupid things just because stupid users don't understand the semantics of the _non-stupid_ thing (which is to just let people think about COW for five seconds). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 20:19 ` Linus Torvalds @ 2009-03-11 20:33 ` Linus Torvalds 2009-03-11 20:55 ` Andrea Arcangeli 2009-03-14 5:07 ` Benjamin Herrenschmidt 2009-03-11 20:48 ` Andrea Arcangeli 2009-03-14 5:06 ` Benjamin Herrenschmidt 2 siblings, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 20:33 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Linus Torvalds wrote: > > Agreed. However, I really think this is a O_DIRECT problem. Just document > it. Tell people that O_DIRECT simply doesn't work with COW, and > fundamentally can never work well. > > If you use O_DIRECT with threading, you had better know what the hell > you're doing anyway. I do not think that the kernel should do stupid > things just because stupid users don't understand the semantics of the > _non-stupid_ thing (which is to just let people think about COW for five > seconds). Btw, if we don't do that, then there are better alternatives. One is: - fork already always takes the write lock on mmap_sem (and f*ck no, I doubt anybody will ever care one whit how "parallel" you can do forks from threads, so I don't think this is an issue) - Just make the rule be that people who use get_user_pages() always have to have the read-lock on mmap_sem until they've used the pages. We already take the read-lock for the lookup (well, not for the gup, but for all the slow cases), but I'm saying that we could go one step further - just read-lock over the _whole_ O_DIRECT read or write. That way you literally protect against concurrent fork()s. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 20:33 ` Linus Torvalds @ 2009-03-11 20:55 ` Andrea Arcangeli 2009-03-11 21:28 ` Linus Torvalds 2009-03-14 5:07 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-11 20:55 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, Mar 11, 2009 at 01:33:17PM -0700, Linus Torvalds wrote: > Btw, if we don't do that, then there are better alternatives. One is: > > - fork already always takes the write lock on mmap_sem (and f*ck no, I > doubt anybody will ever care one whit how "parallel" you can do forks > from threads, so I don't think this is an issue) > > - Just make the rule be that people who use get_user_pages() always > have to have the read-lock on mmap_sem until they've used the pages. How do you handle pages where gup already returned and I/O still in flight? Forcing gup-fast to be called with mmap_sem already hold (like gup used to require) only avoids the need of changes in gup-fast AFAICT. You'll still get pages that are pinned and calling gup-fast under mmap_sem (no matter if read or even write mode) won't make a difference, still those pages will be pinned while fork runs and with dma going to them (by O_DIRECT or some driver using gup, as long as PageReserved isn't set on them). > We already take the read-lock for the lookup (well, not for the gup, but > for all the slow cases), but I'm saying that we could go one step further > - just read-lock over the _whole_ O_DIRECT read or write. That way you > literally protect against concurrent fork()s. Releasing the mmap_sem read mode in the irq-completion handler context should be possible, however fork will end up throttled blocking for I/O which isn't very nice behavior. BTW, direct-io.c is a total mess, I couldn't even figure out where to release those locks in the I/O completion handlers when I tried something like this with PG_lock instead of the mmap_sem... Eventually I gave it up because this isn't just about O_DIRECT but all gup users have this trouble with fork. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 20:55 ` Andrea Arcangeli @ 2009-03-11 21:28 ` Linus Torvalds 2009-03-11 21:57 ` Andrea Arcangeli 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 21:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > On Wed, Mar 11, 2009 at 01:33:17PM -0700, Linus Torvalds wrote: > > Btw, if we don't do that, then there are better alternatives. One is: > > > > - fork already always takes the write lock on mmap_sem (and f*ck no, I > > doubt anybody will ever care one whit how "parallel" you can do forks > > from threads, so I don't think this is an issue) > > > > - Just make the rule be that people who use get_user_pages() always > > have to have the read-lock on mmap_sem until they've used the pages. > > How do you handle pages where gup already returned and I/O still in > flight? The rule is: - either keep the mmap_sem for reading until the IO is done - admit the fact that IO is asynchronous, and has visible async behavior. > Forcing gup-fast to be called with mmap_sem already hold (like > gup used to require) only avoids the need of changes in gup-fast > AFAICT. You'll still get pages that are pinned and calling gup-fast > under mmap_sem (no matter if read or even write mode) won't make a > difference, still those pages will be pinned while fork runs and with > dma going to them (by O_DIRECT or some driver using gup, as long as > PageReserved isn't set on them). The point I'm trying to make is that anybody who thinks that pages are stable over various behavior that runs in another thread - be it a fork, a mmap/munmap, or anything else, is just fooling themselves. The pages are going to show up in "random" places. The fact that the non-fast "get_user_pages()" takes the mmap semaphore for reading doesn't even protect that. It just means that the pages made sense at the time the get_user_pages() happened, not necessarily at the time when the actual use of them did. > Releasing the mmap_sem read mode in the irq-completion handler context > should be possible, however fork will end up throttled blocking for > I/O which isn't very nice behavior. BTW, direct-io.c is a total mess, > I couldn't even figure out where to release those locks in the I/O > completion handlers when I tried something like this with PG_lock > instead of the mmap_sem... Eventually I gave it up because this isn't > just about O_DIRECT but all gup users have this trouble with fork. O_DIRECT is actually the _simple_ case, since we won't be returning until it is done (ie it's not actually a async interface). So no, O_DIRECT doesn't need any interrupt handler games. It would just need to hold the sem over the actual call to the filesystem (ie just over the ->direct_IO() call). Of course, I suspect that all users of O_DIRECT would be _very_ unhappy if they cannot do mmap/unmap/brk on other areas while O_DIRECT is going on, so it's almost certainly not reasonable. People want the relaxed synchronization we give them, and that's literally why get_user_pages_fast exists - because people don't want _more_ synchronization, they want _less_. But the thing is, with less synchronization, the behavior really is surprising in the edge cases. Which is why I think "threaded fork" plus "get_user_pages_fast" just doesn't make sense to even _worry_ about. If you use O_DIRECT and mix it with fork, you get what you get, and it's random - exactly because people who want O_DIRECT don't want any locking. It's a user-space issue, not a kernel issue. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 21:28 ` Linus Torvalds @ 2009-03-11 21:57 ` Andrea Arcangeli 2009-03-11 22:06 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-11 21:57 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, Mar 11, 2009 at 02:28:08PM -0700, Linus Torvalds wrote: > The fact that the non-fast "get_user_pages()" takes the mmap semaphore for > reading doesn't even protect that. It just means that the pages made sense > at the time the get_user_pages() happened, not necessarily at the time > when the actual use of them did. Indeed this is a generic problem, not specific to get_user_pages_fast. get_user_pages_fast just adds a few complications to serialize against. > O_DIRECT is actually the _simple_ case, since we won't be returning until > it is done (ie it's not actually a async interface). So no, O_DIRECT > doesn't need any interrupt handler games. It would just need to hold the > sem over the actual call to the filesystem (ie just over the ->direct_IO() > call). I don't see how you can solve the race by only holding the sem only over the direct_IO call (and not until the I/O completion handler fires). I think to solve the race using mmap_sem only, the bio I/O completion handler that eventually calls into direct-io.c from irq context would need to up_read(&mmap_sem). The way my patch avoids to alter the I/O completion path running from irq context is by ensuring no I/O is going on at all to the pages that are being shared with the child, and by ensuring that any gup or gup-fast will trigger cow before it can write to the shared page. Pages simply can't be shared before I/O is complete. > People want the relaxed synchronization we give them, and that's literally > why get_user_pages_fast exists - because people don't want _more_ > synchronization, they want _less_. > > But the thing is, with less synchronization, the behavior really is > surprising in the edge cases. Which is why I think "threaded fork" plus > "get_user_pages_fast" just doesn't make sense to even _worry_ about. If > you use O_DIRECT and mix it with fork, you get what you get, and it's > random - exactly because people who want O_DIRECT don't want any locking. > > It's a user-space issue, not a kernel issue. I think your point of view is clear, I sure can write userland code that copes it the currently altered memory protection semantics of read vs fork if fd is opened with O_DIRECT or drivers using gup, so I'll let the userland folks comment on it, some are in CC. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 21:57 ` Andrea Arcangeli @ 2009-03-11 22:06 ` Linus Torvalds 2009-03-11 22:07 ` Linus Torvalds 2009-03-11 22:22 ` Davide Libenzi 0 siblings, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 22:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > > > People want the relaxed synchronization we give them, and that's literally > > why get_user_pages_fast exists - because people don't want _more_ > > synchronization, they want _less_. > > > > But the thing is, with less synchronization, the behavior really is > > surprising in the edge cases. Which is why I think "threaded fork" plus > > "get_user_pages_fast" just doesn't make sense to even _worry_ about. If > > you use O_DIRECT and mix it with fork, you get what you get, and it's > > random - exactly because people who want O_DIRECT don't want any locking. > > > > It's a user-space issue, not a kernel issue. > > I think your point of view is clear, I sure can write userland code > that copes it the currently altered memory protection semantics of > read vs fork if fd is opened with O_DIRECT or drivers using gup, so > I'll let the userland folks comment on it, some are in CC. Btw, we could make it easier for people to not screw up. In particular, "fork()" in a threaded program is almost always wrong. If you want to exec another program from a threaded one, you should either just do execve() (which kills all threads) or you should do vfork+execve (which has none of the COW issues). An we could add a warning for it. Something like "if this is a threaded program, and it has ever used get_user_pages(), and it does a fork(), warn about it once". Maybe people would realize what a stupid thing they are doing, and that there is a simple fix (vfork). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 22:06 ` Linus Torvalds @ 2009-03-11 22:07 ` Linus Torvalds 2009-03-11 22:22 ` Davide Libenzi 1 sibling, 0 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 22:07 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Linus Torvalds wrote: > > An we could add a warning for it. Something like "if this is a threaded > program, and it has ever used get_user_pages(), and it does a fork(), warn > about it once". Maybe people would realize what a stupid thing they are > doing, and that there is a simple fix (vfork). Ehh. vfork is only simple if you literally are going to execve. If you are using a fork as some kind of odd way to snapshot, I don't know what you should do. You can't sanely snapshot a threaded app with fork, but I bet some people try. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 22:06 ` Linus Torvalds 2009-03-11 22:07 ` Linus Torvalds @ 2009-03-11 22:22 ` Davide Libenzi 2009-03-11 22:32 ` Linus Torvalds 1 sibling, 1 reply; 83+ messages in thread From: Davide Libenzi @ 2009-03-11 22:22 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Linus Torvalds wrote: > In particular, "fork()" in a threaded program is almost always wrong. If > you want to exec another program from a threaded one, you should either > just do execve() (which kills all threads) or you should do vfork+execve > (which has none of the COW issues). Didn't follow the lengthy thread, but if we make fork+exec to fail inside a threaded program, we might end up making a lot of people unhappy. - Davide -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 22:22 ` Davide Libenzi @ 2009-03-11 22:32 ` Linus Torvalds 0 siblings, 0 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-11 22:32 UTC (permalink / raw) To: Davide Libenzi Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 11 Mar 2009, Davide Libenzi wrote: > > Didn't follow the lengthy thread, but if we make fork+exec to fail inside > a threaded program, we might end up making a lot of people unhappy. Yeah, no, we don't want to fail it, but we could do a one-time warning or something, to at least see who does it and perhaps see if some of them might realize the problems. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 20:33 ` Linus Torvalds 2009-03-11 20:55 ` Andrea Arcangeli @ 2009-03-14 5:07 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 83+ messages in thread From: Benjamin Herrenschmidt @ 2009-03-14 5:07 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 2009-03-11 at 13:33 -0700, Linus Torvalds wrote: > - Just make the rule be that people who use get_user_pages() always > have to have the read-lock on mmap_sem until they've used the > pages. > That's not going to work with IB and friends who gup() whole bunches of user memory forever... Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 20:19 ` Linus Torvalds 2009-03-11 20:33 ` Linus Torvalds @ 2009-03-11 20:48 ` Andrea Arcangeli 2009-03-14 5:06 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-11 20:48 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, Mar 11, 2009 at 01:19:03PM -0700, Linus Torvalds wrote: > That said, I don't know who the crazy O_DIRECT users are. It may be true > that some O_DIRECT users end up using the same pages over and over again, > and that this is a good optimization for them. If it's done on new pages chances are that gup-fast fast-path can't run in the first place, modulo glibc memalign re-using previously freed areas. Overall I think it's worthwhile optimization, to avoid the locked op in the rewrite case that I think it's common enough. But I totally agree that it'd be good to benchmark gup-fast on already instantiated ptes where SetPageGUP will run. I thought it'd be like below measurement error and not measurable but good to check it. > The advantage of it is that it fixes the problem not just in one place, > but "forever". No hacks about exactly how you access the mappings etc. > > Of course, nothing _really_ solves things. If you do some delayed IO after > having looked up the mapping and turned it into a physical page, and the > original allocator actually unmaps it (or exits), then the same issue can > still happen (well, not the _same_ one - but the very similar issue of the > child seeing changes even though the IO was started in the parent). > > This is why I think any "look up by physical" is fundamentally flawed. It > very basically becomes a "I have a secret local TLB that cannot be changed > or flushed". And any single-bit solution (GUP) is always going to be > fairly broken. One of the reasons of not sharing when PG_gup is set and page_count is shown as pinned, is also to fix all sort of drivers that are doing gup to "lookup by physical" on anon pages and doing "dma by physical some offset of the page" at any time later and fork. Otherwise PageReserved should be set by default by gup-fast instead of relying on the drivers to set it after gup-fast returns. > Agreed. However, I really think this is a O_DIRECT problem. Just document > it. Tell people that O_DIRECT simply doesn't work with COW, and > fundamentally can never work well. > > If you use O_DIRECT with threading, you had better know what the hell > you're doing anyway. I do not think that the kernel should do stupid > things just because stupid users don't understand the semantics of the > _non-stupid_ thing (which is to just let people think about COW for five > seconds). This really isn't only about O_DIRECT. This is to fix gup vs fork, O_DIRECT is just one of the million of gup users out there... KVM work around this by using MADV_DONTFORK, until MADV_DONTFORK was introduced I once started to get corruption in KVM when a change made system() to be executed once in a while for whatever unrelated reason. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 20:19 ` Linus Torvalds 2009-03-11 20:33 ` Linus Torvalds 2009-03-11 20:48 ` Andrea Arcangeli @ 2009-03-14 5:06 ` Benjamin Herrenschmidt 2009-03-14 5:20 ` Nick Piggin 2 siblings, 1 reply; 83+ messages in thread From: Benjamin Herrenschmidt @ 2009-03-14 5:06 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, 2009-03-11 at 13:19 -0700, Linus Torvalds wrote: > > That said, I don't know who the crazy O_DIRECT users are. It may be true > that some O_DIRECT users end up using the same pages over and over again, > and that this is a good optimization for them. Just my 2 cents here... While I agree mostly with what you say about O_DIRECT crazyness, unfortunately, gup is also a fashionable interface in a few other areas, such as IB or RDMA'ish things, and I'm pretty sure we'll see others popping here or there. Right, it's a bit stinky, but it -is- somewhat nice for a driver to be able to take a chunk of existing user addresses and not care whether they are anonymous, shmem, file mappings, large pages, ... and just gup and get some DMA pounding on them. There are various usage scenarios where it's in fact less ugly than anything else you can come up with ... pretty much. IB folks so far have been avoiding the fork() trap thanks to madvise(MADV_DONTFORK) afaik. And it all goes generally well when the whole application knows what it's doing and just plain avoids fork. -But- things get nasty if for some reason, the user of gup is somewhere deep in some kind of library that an application uses without knowing, while forking here or there to run shell scripts or other helpers. I've seen it :-) So if a solution can be found that doesn't uglify the whole thing beyond recognition, it's probably worth it. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-14 5:06 ` Benjamin Herrenschmidt @ 2009-03-14 5:20 ` Nick Piggin 2009-03-16 16:01 ` KOSAKI Motohiro 0 siblings, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-14 5:20 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Linus Torvalds, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Saturday 14 March 2009 16:06:29 Benjamin Herrenschmidt wrote: > On Wed, 2009-03-11 at 13:19 -0700, Linus Torvalds wrote: > > That said, I don't know who the crazy O_DIRECT users are. It may be true > > that some O_DIRECT users end up using the same pages over and over again, > > and that this is a good optimization for them. > > Just my 2 cents here... > > While I agree mostly with what you say about O_DIRECT crazyness, > unfortunately, gup is also a fashionable interface in a few other areas, > such as IB or RDMA'ish things, and I'm pretty sure we'll see others > popping here or there. > > Right, it's a bit stinky, but it -is- somewhat nice for a driver to be > able to take a chunk of existing user addresses and not care whether > they are anonymous, shmem, file mappings, large pages, ... and just gup > and get some DMA pounding on them. There are various usage scenarios > where it's in fact less ugly than anything else you can come up with ... > pretty much. > > IB folks so far have been avoiding the fork() trap thanks to > madvise(MADV_DONTFORK) afaik. And it all goes generally well when the > whole application knows what it's doing and just plain avoids fork. > > -But- things get nasty if for some reason, the user of gup is somewhere > deep in some kind of library that an application uses without knowing, > while forking here or there to run shell scripts or other helpers. > > I've seen it :-) > > So if a solution can be found that doesn't uglify the whole thing beyond > recognition, it's probably worth it. AFAIKS, the approach I've posted is probably the simplest (and maybe only way) to really fix it. It's not too ugly. You can't easily fix it at write-time by COWing in the right direction like Linus suggested because at that point you may have multiple get_user_pages (for read) from the parent and child on the page, so there is no way to COW it in the right direction. You could do something crazy like allowing only one get_user_pages read on a wp page, and recording which direction to send it if it does get COWed. But at that point you've got something that's far uglier in the core code and more complex than what I posted. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-14 5:20 ` Nick Piggin @ 2009-03-16 16:01 ` KOSAKI Motohiro 2009-03-16 16:23 ` Nick Piggin 2009-03-17 0:44 ` Linus Torvalds 0 siblings, 2 replies; 83+ messages in thread From: KOSAKI Motohiro @ 2009-03-16 16:01 UTC (permalink / raw) To: Nick Piggin Cc: kosaki.motohiro, Benjamin Herrenschmidt, Linus Torvalds, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm Hi > > IB folks so far have been avoiding the fork() trap thanks to > > madvise(MADV_DONTFORK) afaik. And it all goes generally well when the > > whole application knows what it's doing and just plain avoids fork. > > > > -But- things get nasty if for some reason, the user of gup is somewhere > > deep in some kind of library that an application uses without knowing, > > while forking here or there to run shell scripts or other helpers. > > > > I've seen it :-) > > > > So if a solution can be found that doesn't uglify the whole thing beyond > > recognition, it's probably worth it. > > AFAIKS, the approach I've posted is probably the simplest (and maybe only > way) to really fix it. It's not too ugly. May I join this discussion? if we only need concern to O_DIRECT, below patch is enough. Yes, my patch isn't realy solusion. Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork issue. *and* my patch is crazy slow :) So, my point is, I merely oppose easily decision to give up fixing. Currently, I agree we don't have easily fixinig way. but I believe we can solve this problem completely in the nealy future because LKML folks are very cool guys. Thus, I don't hope to append the "BUGS" section of the O_DIRECT man page. Also I don't hope that I says "Oh, Solaris can solve your requirement, AIX can, FreeBSD can, but Linux can't". it beat my proud of linux developer a bit ;) andorea's patch seems a bit complex than your. but I think it can improve later. but the man page change can't undo. In addition, May I talk about my gup-fast concern? AFAIK, the worth of gup-fast is not removing one atomic operation. not grabbing mmap_sem is essetial. it because: - block layer and i/o driver also have several lock. then, DirectIO take many atomic operations anyway. one atomic operation cost is not so expensive. - but mmap_sem is one of most easy contented lock in linux. because - almost modern DB software have multi threading. - glibc malloc/free can cause mmap, munmap, mprotect syscall. its syscall grab down_write(&mmap_sem). - page fault also grab down_read(&mmap_sem). - anyway, userland application can't avoid malloc() and pagefault. However, I haven't seen anyone try to munmap() to direct-io region. So, it imply mmap_sem can split out fine grainy. (or, Can we remove it completely? iirc PerterZ tryed it about two month ago) after that, we can grab mmap_sem without performace degression and many mmap_sem avoiding effort can be removed. perhaps, I talk funny thing. gup-fast was introduced for solving DB2 problem. but I don't have any DB2 development experience. Am I over-optimistic? > You can't easily fix it at write-time by COWing in the right direction like > Linus suggested because at that point you may have multiple get_user_pages > (for read) from the parent and child on the page, so there is no way to COW > it in the right direction. > > You could do something crazy like allowing only one get_user_pages read on a > wp page, and recording which direction to send it if it does get COWed. But > at that point you've got something that's far uglier in the core code and > more complex than what I posted. --- fs/direct-io.c | 2 ++ include/linux/init_task.h | 1 + include/linux/mm_types.h | 3 +++ kernel/fork.c | 3 +++ 4 files changed, 9 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index b6d4390..8f9a810 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -1206,8 +1206,10 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) && (end > i_size_read(inode))); + down_read(¤t->mm->directio_sem); retval = direct_io_worker(rw, iocb, inode, iov, offset, nr_segs, blkbits, get_block, end_io, dio); + up_read(¤t->mm->directio_sem); /* * In case of error extending write may have instantiated a few diff --git a/include/linux/init_task.h b/include/linux/init_task.h index e752d97..68e02b9 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -37,6 +37,7 @@ extern struct fs_struct init_fs; .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \ .mmlist = LIST_HEAD_INIT(name.mmlist), \ .cpu_vm_mask = CPU_MASK_ALL, \ + .directio_sem = __RWSEM_INITIALIZER(name.directio_sem), \ } #define INIT_SIGNALS(sig) { \ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d84feb7..39ba4e6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -274,6 +274,9 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif + + /* if there are on-flight directio, we can't fork. */ + struct rw_semaphore directio_sem; }; /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ diff --git a/kernel/fork.c b/kernel/fork.c index 4854c2c..bbe9fa7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) unsigned long charge; struct mempolicy *pol; + down_write(&oldmm->directio_sem); down_write(&oldmm->mmap_sem); flush_cache_dup_mm(oldmm); /* @@ -368,6 +369,7 @@ out: up_write(&mm->mmap_sem); flush_tlb_mm(oldmm); up_write(&oldmm->mmap_sem); + up_write(&oldmm->directio_sem); return retval; fail_nomem_policy: kmem_cache_free(vm_area_cachep, tmp); @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p) mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; mm_init_owner(mm, p); + init_rwsem(&mm->directio_sem); if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; -- 1.6.0.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 16:01 ` KOSAKI Motohiro @ 2009-03-16 16:23 ` Nick Piggin 2009-03-16 16:32 ` Linus Torvalds 2009-03-18 2:04 ` KOSAKI Motohiro 2009-03-17 0:44 ` Linus Torvalds 1 sibling, 2 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-16 16:23 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Benjamin Herrenschmidt, Linus Torvalds, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 03:01:42 KOSAKI Motohiro wrote: > Hi > > > AFAIKS, the approach I've posted is probably the simplest (and maybe only > > way) to really fix it. It's not too ugly. > > May I join this discussion? Of course :) > if we only need concern to O_DIRECT, below patch is enough. > > Yes, my patch isn't realy solusion. > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork > issue. *and* my patch is crazy slow :) Well, it's an interesting question. I'd say it probably is more than just O_DIRECT. vmsplice too, for example (which I think is much harder to fix this way because the pages are retired by the other end of the pipe, so I don't think you can hold a lock across it). For other device drivers, one could argue that they are "special" and require special knowledge and apps to use MADV_DONTFORK... Ben didn't like that so much, and also some other users of get_user_pages might come up. But your patch is interesting. I don't think it is crazy slow... well it might be a bit slow in the case that a threaded app doing a lot of direct IO or an app doing async IO forks. But how common is that? I would be slightly more worried about the common cacheline touched to take the read lock for multithreaded direct IO, but I'm not sure how much that will hurt DB2. > So, my point is, I merely oppose easily decision to give up fixing. > > Currently, I agree we don't have easily fixinig way. > but I believe we can solve this problem completely in the nealy future > because LKML folks are very cool guys. > > Thus, I don't hope to append the "BUGS" section of the O_DIRECT man page. > Also I don't hope that I says "Oh, Solaris can solve your requirement, > AIX can, FreeBSD can, but Linux can't". > it beat my proud of linux developer a bit ;) > > andorea's patch seems a bit complex than your. but I think it can > improve later. > but the man page change can't undo. > > > In addition, May I talk about my gup-fast concern? > AFAIK, the worth of gup-fast is not removing one atomic operation. > not grabbing mmap_sem is essetial. Yes, mmap_sem is the big thing. But straight line speed is important too. [...] > --- > fs/direct-io.c | 2 ++ > include/linux/init_task.h | 1 + > include/linux/mm_types.h | 3 +++ > kernel/fork.c | 3 +++ > 4 files changed, 9 insertions(+), 0 deletions(-) It is an interesting patch. Thanks for throwing it into the discussion. I do prefer to close the race up for all cases if we decide to do anything at all about it, ie. all or nothing. But maybe others disagree. > diff --git a/fs/direct-io.c b/fs/direct-io.c > index b6d4390..8f9a810 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -1206,8 +1206,10 @@ __blockdev_direct_IO(int rw, struct kiocb > *iocb, struct inode *inode, > dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) && > (end > i_size_read(inode))); > > + down_read(¤t->mm->directio_sem); > retval = direct_io_worker(rw, iocb, inode, iov, offset, > nr_segs, blkbits, get_block, end_io, dio); > + up_read(¤t->mm->directio_sem); > > /* > * In case of error extending write may have instantiated a few > diff --git a/include/linux/init_task.h b/include/linux/init_task.h > index e752d97..68e02b9 100644 > --- a/include/linux/init_task.h > +++ b/include/linux/init_task.h > @@ -37,6 +37,7 @@ extern struct fs_struct init_fs; > .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \ > .mmlist = LIST_HEAD_INIT(name.mmlist), \ > .cpu_vm_mask = CPU_MASK_ALL, \ > + .directio_sem = __RWSEM_INITIALIZER(name.directio_sem), \ > } > > #define INIT_SIGNALS(sig) { \ > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index d84feb7..39ba4e6 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -274,6 +274,9 @@ struct mm_struct { > #ifdef CONFIG_MMU_NOTIFIER > struct mmu_notifier_mm *mmu_notifier_mm; > #endif > + > + /* if there are on-flight directio, we can't fork. */ > + struct rw_semaphore directio_sem; > }; > > /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ > diff --git a/kernel/fork.c b/kernel/fork.c > index 4854c2c..bbe9fa7 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct > mm_struct *oldmm) > unsigned long charge; > struct mempolicy *pol; > > + down_write(&oldmm->directio_sem); > down_write(&oldmm->mmap_sem); > flush_cache_dup_mm(oldmm); > /* > @@ -368,6 +369,7 @@ out: > up_write(&mm->mmap_sem); > flush_tlb_mm(oldmm); > up_write(&oldmm->mmap_sem); > + up_write(&oldmm->directio_sem); > return retval; > fail_nomem_policy: > kmem_cache_free(vm_area_cachep, tmp); > @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct > * mm, struct task_struct *p) > mm->free_area_cache = TASK_UNMAPPED_BASE; > mm->cached_hole_size = ~0UL; > mm_init_owner(mm, p); > + init_rwsem(&mm->directio_sem); > > if (likely(!mm_alloc_pgd(mm))) { > mm->def_flags = 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 16:23 ` Nick Piggin @ 2009-03-16 16:32 ` Linus Torvalds 2009-03-16 16:50 ` Nick Piggin 2009-03-18 2:04 ` KOSAKI Motohiro 1 sibling, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-16 16:32 UTC (permalink / raw) To: Nick Piggin Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Nick Piggin wrote: > > Yes, my patch isn't realy solusion. > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork > > issue. *and* my patch is crazy slow :) > > Well, it's an interesting question. I'd say it probably is more than > just O_DIRECT. vmsplice too, for example (which I think is much harder > to fix this way because the pages are retired by the other end of > the pipe, so I don't think you can hold a lock across it). Well, only the "fork()" has the race problem. So having a fork-specific lock (but not naming it by directio) actually does make sense. The fork is much less performance-critical than most random mmap_sem users - and doesn't have the same scalability issues either (ie people probably _do_ want to do mmap/munmap/brk concurrently with gup lookup, but there's much less worry about concurrent fork() performance). It doesn't necessarily make the general problem go away, but it makes the _particular_ race between get_user_pages() and fork() go away. Then you can do per-page flags or whatever and not have to worry about concurrent lookups. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 16:32 ` Linus Torvalds @ 2009-03-16 16:50 ` Nick Piggin 2009-03-16 17:02 ` Linus Torvalds 2009-03-16 23:59 ` KAMEZAWA Hiroyuki 0 siblings, 2 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-16 16:50 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 03:32:11 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > > Yes, my patch isn't realy solusion. > > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs > > > fork issue. *and* my patch is crazy slow :) > > > > Well, it's an interesting question. I'd say it probably is more than > > just O_DIRECT. vmsplice too, for example (which I think is much harder > > to fix this way because the pages are retired by the other end of > > the pipe, so I don't think you can hold a lock across it). > > Well, only the "fork()" has the race problem. > > So having a fork-specific lock (but not naming it by directio) actually > does make sense. The fork is much less performance-critical than most > random mmap_sem users - and doesn't have the same scalability issues > either (ie people probably _do_ want to do mmap/munmap/brk concurrently > with gup lookup, but there's much less worry about concurrent fork() > performance). > > It doesn't necessarily make the general problem go away, but it makes the > _particular_ race between get_user_pages() and fork() go away. Then you > can do per-page flags or whatever and not have to worry about concurrent > lookups. Hmm, I see what you mean there; it can be used to solve Andrea's race instead of using set_bit/memory barriers. But I think then you would still need to put this lock in fork and get_user_pages[_fast], *and* still do most of the other stuff required in Andrea's patch. So I'm not sure if that was KAMEZAWA-san's patch. It actually should solve one side of the race completely, as is, but only for direct-IO. Because it ensures that no get_user_pages for direct IO can be outstanding over a fork. However it does a) not solve other get_user_pages problems, and b) doesn't solve the case where for readonly get_user_pages on an already shared pte will get confused if it is subsequently COWed -- it can end up being polluted with wrong data. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 16:50 ` Nick Piggin @ 2009-03-16 17:02 ` Linus Torvalds 2009-03-16 17:19 ` Nick Piggin 2009-03-16 23:59 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-16 17:02 UTC (permalink / raw) To: Nick Piggin Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Nick Piggin wrote: > > Hmm, I see what you mean there; it can be used to solve Andrea's race > instead of using set_bit/memory barriers. But I think then you would > still need to put this lock in fork and get_user_pages[_fast], *and* > still do most of the other stuff required in Andrea's patch. Well, yes and no. What if we just did the caller get the lock? And then leave it entirely to the caller to decide how it wants to synchronize with fork? In particular, we really _could_ just say "hold the lock for reading for as long as you hold the reference count to the page" - since now the lock only matters for fork(), nothing else. And make the forking part use "down_write_killable()", so that you can kill the process if it does something bad. Now you can make vmsplice literally get a read-lock for the whole IO operation. The process that does "vmsplice()" will not be able to fork until the IO is done, but let's be honest here: if you're doing vmsplice(), that is damn well what you WANT! splice() already has a callback for releasing the pages, so it's doable. O_DIRECT has similar issues - by the time we return from an O_DIRECT write, the pages had better already be written out, so we could just take the read-lock over the whole operation. So don't take the lock in the low level get_user_pages(). Take it as high as you want to. And if some user doesn't want that serialization (maybe ptrace?), don't take the lock at all, or take it just over the get_user_pages() call. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 17:02 ` Linus Torvalds @ 2009-03-16 17:19 ` Nick Piggin 2009-03-16 17:42 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-16 17:19 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 04:02:02 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > Hmm, I see what you mean there; it can be used to solve Andrea's race > > instead of using set_bit/memory barriers. But I think then you would > > still need to put this lock in fork and get_user_pages[_fast], *and* > > still do most of the other stuff required in Andrea's patch. > > Well, yes and no. > > What if we just did the caller get the lock? And then leave it entirely to > the caller to decide how it wants to synchronize with fork? > > In particular, we really _could_ just say "hold the lock for reading for > as long as you hold the reference count to the page" - since now the lock > only matters for fork(), nothing else. Well that in theory should close the race in one direction (writing into the wrong page). I don't think it closes it in the other direction (reading the wrong data from the page). I'm also not quite convinced of vmsplice. > And make the forking part use "down_write_killable()", so that you can > kill the process if it does something bad. > > Now you can make vmsplice literally get a read-lock for the whole IO > operation. The process that does "vmsplice()" will not be able to fork > until the IO is done, but let's be honest here: if you're doing > vmsplice(), that is damn well what you WANT! Really? I'm not sure (probably primarily because I've never really seen how vmsplice would be used). splice is supposed to be asynchronous, so I don't know why you necessarily would want to avoid fork after a splice (until the asynchronous reader on the other end that you don't necessarily have control over or know anything about reads all the data you've sent it). > splice() already has a callback for releasing the pages, so it's doable. doable, maybe. > O_DIRECT has similar issues - by the time we return from an O_DIRECT > write, the pages had better already be written out, so we could just take > the read-lock over the whole operation. Yes I think that's what the patch was doing. > So don't take the lock in the low level get_user_pages(). Take it as high > as you want to. > > And if some user doesn't want that serialization (maybe ptrace?), don't > take the lock at all, or take it just over the get_user_pages() call. BTW. have you looked at my approach yet? I've tried to solve the fork vs gup race in yet another way. Don't know if you think it is palatable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 17:19 ` Nick Piggin @ 2009-03-16 17:42 ` Linus Torvalds 2009-03-16 18:02 ` Nick Piggin 2009-03-16 18:28 ` Andrea Arcangeli 0 siblings, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-16 17:42 UTC (permalink / raw) To: Nick Piggin Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Nick Piggin wrote: > > Well that in theory should close the race in one direction (writing into > the wrong page). > > I don't think it closes it in the other direction (reading the wrong data > from the page). Why? If somebody does a COW while we have a get_user_pages() page frame cached, the get_user_pages() will have increased the page count, so regardless of _who_ writes to the page, the writer will always get a new page. No? So reading data from the page will always get the old pre-cow data. [ goes to reading code ] Oh, damn. That's how it used to work a long time ago when we looked at the page count. Now we just look at the page *map* count, we don't look at any other counts. So the COW logic won't see that somebody else has a copy. Maybe we could go back to also looking at page counts? > BTW. have you looked at my approach yet? I've tried to solve the fork > vs gup race in yet another way. Don't know if you think it is palatable. I really think we should be able to fix this without _anything_ like that at all. Just the lock (and some reuse_swap_page() logic changes). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 17:42 ` Linus Torvalds @ 2009-03-16 18:02 ` Nick Piggin 2009-03-16 18:05 ` Nick Piggin 2009-03-16 18:14 ` Linus Torvalds 2009-03-16 18:28 ` Andrea Arcangeli 1 sibling, 2 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-16 18:02 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 04:42:48 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > Well that in theory should close the race in one direction (writing into > > the wrong page). > > > > I don't think it closes it in the other direction (reading the wrong data > > from the page). > > Why? > > If somebody does a COW while we have a get_user_pages() page frame cached, > the get_user_pages() will have increased the page count, so regardless of > _who_ writes to the page, the writer will always get a new page. No? [(no)] > Maybe we could go back to also looking at page counts? Hmm, possibly could. > > BTW. have you looked at my approach yet? I've tried to solve the fork > > vs gup race in yet another way. Don't know if you think it is palatable. > > I really think we should be able to fix this without _anything_ like that > at all. Just the lock (and some reuse_swap_page() logic changes). What part of that do you dislike, though? I don't think the lock is a particularly elegant idea either (shared cacheline, vmsplice, converting callers). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 18:02 ` Nick Piggin @ 2009-03-16 18:05 ` Nick Piggin 2009-03-16 18:17 ` Linus Torvalds 2009-03-16 18:14 ` Linus Torvalds 1 sibling, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-16 18:05 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 05:02:56 Nick Piggin wrote: > On Tuesday 17 March 2009 04:42:48 Linus Torvalds wrote: > > On Tue, 17 Mar 2009, Nick Piggin wrote: > > > BTW. have you looked at my approach yet? I've tried to solve the fork > > > vs gup race in yet another way. Don't know if you think it is > > > palatable. > > > > I really think we should be able to fix this without _anything_ like that > > at all. Just the lock (and some reuse_swap_page() logic changes). > > What part of that do you dislike, though? If you disregard code motion and extra argument to copy_page_range, my fix is a couple of dozen lines change to existing code, plus the "decow" function (which could probably share a fair bit of code with do_wp_page). Do you dislike the added complexity of the code? Or the behaviour that gets changed? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 18:05 ` Nick Piggin @ 2009-03-16 18:17 ` Linus Torvalds 2009-03-16 18:33 ` Nick Piggin 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-16 18:17 UTC (permalink / raw) To: Nick Piggin Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Nick Piggin wrote: > > If you disregard code motion and extra argument to copy_page_range, > my fix is a couple of dozen lines change to existing code, plus the > "decow" function (which could probably share a fair bit of code > with do_wp_page). > > Do you dislike the added complexity of the code? Or the behaviour > that gets changed? The complexity. That decow thing is shit. So is all the extra flags for no good reason. What's your argument against "keep it simple with a single lock, and adding basically a single line to reuse_swap_page() to say "don't reuse the page if the count is elevated"? THAT is simple and elegant, and needs none of the complexity. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 18:17 ` Linus Torvalds @ 2009-03-16 18:33 ` Nick Piggin 2009-03-16 19:22 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-16 18:33 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 05:17:02 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > If you disregard code motion and extra argument to copy_page_range, > > my fix is a couple of dozen lines change to existing code, plus the > > "decow" function (which could probably share a fair bit of code > > with do_wp_page). > > > > Do you dislike the added complexity of the code? Or the behaviour > > that gets changed? > > The complexity. That decow thing is shit. copying the page on fork instead of write protecting it? The code or the idea? Code can certainly be improved... > So is all the extra flags for no > good reason. Which extra flags are you referring to? > What's your argument against "keep it simple with a single lock, and > adding basically a single line to reuse_swap_page() to say "don't reuse > the page if the count is elevated"? I made them in a previous message. It depends on what callers you want to convert I guess. I don't think vmsplice takes to the lock approach very well though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 18:33 ` Nick Piggin @ 2009-03-16 19:22 ` Linus Torvalds 2009-03-17 5:44 ` Nick Piggin 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-16 19:22 UTC (permalink / raw) To: Nick Piggin Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Nick Piggin wrote: > > > So is all the extra flags for no > > good reason. > > Which extra flags are you referring to? Fuck me, didn't you even read your own patch? What do you call PG_dontcow? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 19:22 ` Linus Torvalds @ 2009-03-17 5:44 ` Nick Piggin 0 siblings, 0 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-17 5:44 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 06:22:12 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > > So is all the extra flags for no > > > good reason. > > > > Which extra flags are you referring to? > > Fuck me, didn't you even read your own patch? > > What do you call PG_dontcow? It is a flag, there for a good reason. It sounded like you were seeing more than one flag, and that you thought they were useless. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 18:02 ` Nick Piggin 2009-03-16 18:05 ` Nick Piggin @ 2009-03-16 18:14 ` Linus Torvalds 2009-03-16 18:29 ` Nick Piggin 2009-03-16 18:37 ` Andrea Arcangeli 1 sibling, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-16 18:14 UTC (permalink / raw) To: Nick Piggin Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Nick Piggin wrote: > > What part of that do you dislike, though? I don't think the lock is a > particularly elegant idea either (shared cacheline, vmsplice, converting > callers). All of the absolute *crap* for no good reason. Did you even look at your patch? It wasn't as ugly as Andrea's, but it was ugly enough, and it was buggy. That whole "decow" stuff was too f*cking ugly to live. Couple that with the fact that no real-life user can possibly care, and that O_DIRECT is broken to begin with, and I say: "let's fix this with a _much_ smaller patch". You may think that the lock isn't particularly "elegant", but I can only say "f*ck that, look at the number of lines of code, and the simplicity". Your "elegant" argument is total and utter sh*t, in other words. The lock approach is tons more elegant, considering that it solves the problem much more cleanly, and with _much_ less crap. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 18:14 ` Linus Torvalds @ 2009-03-16 18:29 ` Nick Piggin 2009-03-16 19:17 ` Linus Torvalds 2009-03-16 18:37 ` Andrea Arcangeli 1 sibling, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-16 18:29 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 05:14:59 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > What part of that do you dislike, though? I don't think the lock is a > > particularly elegant idea either (shared cacheline, vmsplice, converting > > callers). > > All of the absolute *crap* for no good reason. > > Did you even look at your patch? It wasn't as ugly as Andrea's, but it was > ugly enough, and it was buggy. That whole "decow" stuff was too f*cking > ugly to live. What's buggy about it? Stupid bugs, or fundamentally broken? > Couple that with the fact that no real-life user can possibly care, and > that O_DIRECT is broken to begin with, and I say: "let's fix this with a > _much_ smaller patch". If it is based on nobody caring, I would prefer not to add anything at all to "fix" it? We have MADV_DONTFORK already... > You may think that the lock isn't particularly "elegant", but I can only > say "f*ck that, look at the number of lines of code, and the simplicity". > > Your "elegant" argument is total and utter sh*t, in other words. The lock > approach is tons more elegant, considering that it solves the problem much > more cleanly, and with _much_ less crap. In my opinion it is not, given that you have to convert callers. If you say that you only care about fixing O_DIRECT, then yes I would probably agree the lock is nicer in that case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 18:29 ` Nick Piggin @ 2009-03-16 19:17 ` Linus Torvalds 2009-03-17 5:42 ` Nick Piggin 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-16 19:17 UTC (permalink / raw) To: Nick Piggin Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Nick Piggin wrote: > > What's buggy about it? Stupid bugs, or fundamentally broken? The lack of locking. > In my opinion it is not, given that you have to convert callers. If you > say that you only care about fixing O_DIRECT, then yes I would probably > agree the lock is nicer in that case. F*ck me, I'm not going to bother to argue. I'm not going to merge your patch, it's that easy. Quite frankly, I don't think that the "bug" is a bug to begin with. O_DIRECT+fork() can damn well continue to be broken. But if we fix it, we fix it the _clean_ way with a simple patch, not with that shit-for-logic horrible decow crap. It's that simple. I refuse to take putrid industrial waste patches for something like this. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 19:17 ` Linus Torvalds @ 2009-03-17 5:42 ` Nick Piggin 2009-03-17 5:58 ` Nick Piggin 0 siblings, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-17 5:42 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 06:17:21 Linus Torvalds wrote: > On Tue, 17 Mar 2009, Nick Piggin wrote: > > What's buggy about it? Stupid bugs, or fundamentally broken? > > The lack of locking. I don't think it's broken. I can't see a problem. > > In my opinion it is not, given that you have to convert callers. If you > > say that you only care about fixing O_DIRECT, then yes I would probably > > agree the lock is nicer in that case. > > F*ck me, I'm not going to bother to argue. I'm not going to merge your > patch, it's that easy. > > Quite frankly, I don't think that the "bug" is a bug to begin with. > O_DIRECT+fork() can damn well continue to be broken. But if we fix it, we > fix it the _clean_ way with a simple patch, not with that shit-for-logic > horrible decow crap. > > It's that simple. I refuse to take putrid industrial waste patches for > something like this. I consider it is clean because it only adds branches in 3 places that are not taken unless direct IO and fork are used, and it fixes the "problem" in the VM directly leaving get_user_pages unchanged. I don't think it is conceptually such a problem to copy pages rather than COW them in fork. Seems fairly straightforward to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 5:42 ` Nick Piggin @ 2009-03-17 5:58 ` Nick Piggin 0 siblings, 0 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-17 5:58 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 16:42:24 Nick Piggin wrote: > I consider it is clean because it only adds branches in 3 places that > are not taken unless direct IO and fork are used, and it fixes the > "problem" in the VM directly leaving get_user_pages unchanged. leaving get_user_pages callers unchanged. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 18:14 ` Linus Torvalds 2009-03-16 18:29 ` Nick Piggin @ 2009-03-16 18:37 ` Andrea Arcangeli 1 sibling, 0 replies; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-16 18:37 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, KOSAKI Motohiro, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Mon, Mar 16, 2009 at 11:14:59AM -0700, Linus Torvalds wrote: > You may think that the lock isn't particularly "elegant", but I can only > say "f*ck that, look at the number of lines of code, and the simplicity". I'm sorry but the number of lines that you're reading in the direct_io_worker patch, aren't representative of what it takes to fix it with a mm wide lock. It may be conceptually simpler to fix it outside GUP, on that I can certainly agree (with the downside of leaving splice broken etc..), but I can't see how that small patch can fix anything as releasing the semaphore after direct_io_worker returns with O_DIRECT mixed with async-io. Before claiming that the outer lock results in less number of lines of code, I'd wait to see a fix that works with O_DIRECT+async-io too as well as mine and Nick's do. > Your "elegant" argument is total and utter sh*t, in other words. The lock > approach is tons more elegant, considering that it solves the problem much > more cleanly, and with _much_ less crap. I guess elegant is relative, but the size argument is objective, and that should be possible to compare if somebody writes a full fix that doesn't fall apart if return value of direct_io_worker is -EIOCBQUEUED. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 17:42 ` Linus Torvalds 2009-03-16 18:02 ` Nick Piggin @ 2009-03-16 18:28 ` Andrea Arcangeli 1 sibling, 0 replies; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-16 18:28 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, KOSAKI Motohiro, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Mon, Mar 16, 2009 at 10:42:48AM -0700, Linus Torvalds wrote: > Maybe we could go back to also looking at page counts? Hugh just recently reminded me why we switched to mapcount and explanation is here: c475a8ab625d567eacf5e30ec35d6d8704558062 which wasn't entirely safe until this was added too: ab967d86015a19777955370deebc8262d50fed63 which reliably allowed to takeover swapcache pages taken by gup and at the same time it allowed the VM to unmap ptes pointing to swapcache taken by GUP. Yes it's possible to go back to page counts, then we have only to reintroduce by 2.6.7 solution that will prevent the VM to unmap ptes that are mapping pages take by GUP. Otherwise do_wp_page won't be able to remap into the pte the same swapcache that was unmapped by the pte by the VM leading to disk corruption with swapping (the 2.4 bug, fixed in 2.4 with a simpler PG_lock local to direct-io, that prevented the VM to unmap ptes on the page as long as I/O was in progress, and PG_lock was released by the ->end_io async handler from irq IIRC). The only problem I can see is if mapcount and page count can change freely while PT lock and rmap locks are taken, comparing them won't be as reliable as in ksm/fork (in my version of the fix) where we're guaranteed mapcount is 1 and stays 1 as long as we hold PT lock, because pte_write(pte) == true and PageAnon == true (I also added a BUG_ON to check mapcount to be always 1 with the other two conditions are true). That makes ksm/forkfix quite obviously safe in this regard. But for the VM to decide not to unmap a pte taken by GUP, we also have to deal with a mapcount > 1 and pte_write(pte) == false and PageAnon == true. So if we solve that ordering issue between reading mapcount and page count I don't see much of a problem to returning checking the page count in the VM code to prevent the pte to be unmapped while page is under GUP and then remove the mapcount-only check from do_wp_page swapcache-reuse logic. If we'd return using the page_count instead of mapcount, my first patch I posted here would then not require any change to take care of the 'reverse' race (modulo hugetlb) of the child writing to the pages that are being written to disk by the parent, there would be no need to de-cow in GUP (again modulo hugetlb). > I really think we should be able to fix this without _anything_ like that > at all. Just the lock (and some reuse_swap_page() logic changes). I don't see why we should introduce mm wide locks outside GUP (worrying about the SetPageGUP in gup-fast when gup-fast would then instead have to take a mm-wide lock sounds small issue) when we can be page-granular and lockless. I agree it could be simpler and less invasive into the gup details to add any logic outside of gup, but I don't think the result will be superior, given it'll most certainly become an havier-weight lock bouncing across all cpus calling gup-fast, and it won't make a speed difference for the CPU to execute an atomic lock op inner or outer of gup-fast. OTOH if the argument for an outer mm wide lock is to keep the code simpler or more maintainable, that would explain it. I think fixing it my way is not more complicated than by fixing outside gup, but then I clearly may be biased in what it looks simpler to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 16:50 ` Nick Piggin 2009-03-16 17:02 ` Linus Torvalds @ 2009-03-16 23:59 ` KAMEZAWA Hiroyuki 1 sibling, 0 replies; 83+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-03-16 23:59 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, linux-mm On Tue, 17 Mar 2009 03:50:12 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Tuesday 17 March 2009 03:32:11 Linus Torvalds wrote: > > On Tue, 17 Mar 2009, Nick Piggin wrote: > > > > Yes, my patch isn't realy solusion. > > > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs > > > > fork issue. *and* my patch is crazy slow :) > > > > > > Well, it's an interesting question. I'd say it probably is more than > > > just O_DIRECT. vmsplice too, for example (which I think is much harder > > > to fix this way because the pages are retired by the other end of > > > the pipe, so I don't think you can hold a lock across it). > > > > Well, only the "fork()" has the race problem. > > > > So having a fork-specific lock (but not naming it by directio) actually > > does make sense. The fork is much less performance-critical than most > > random mmap_sem users - and doesn't have the same scalability issues > > either (ie people probably _do_ want to do mmap/munmap/brk concurrently > > with gup lookup, but there's much less worry about concurrent fork() > > performance). > > > > It doesn't necessarily make the general problem go away, but it makes the > > _particular_ race between get_user_pages() and fork() go away. Then you > > can do per-page flags or whatever and not have to worry about concurrent > > lookups. > > Hmm, I see what you mean there; it can be used to solve Andrea's race > instead of using set_bit/memory barriers. But I think then you would > still need to put this lock in fork and get_user_pages[_fast], *and* > still do most of the other stuff required in Andrea's patch. > > So I'm not sure if that was KAMEZAWA-san's patch. > Just FYI. This was the last patch I sent to redhat (againat RHEL5) but ignored ;) plz ignore the dirty part which comes from limitation that I can't modify mm_struct. === This patch provides a kind of rwlock for DIO. This patch adds below: struct mm_private { struct mm_struct new our data } Before issuing dio, dio submitter should call dio_lock()/dio_unlock(). Before startinc COW, the kennel should call mm_cow_start()/mm_cow_end(). dio_lock() registers a range of address which is under DIO. mm_cow_start() checks range of address is under DIO or not, then - If under DIO, retry fault. (for releaseing rwsem.) - If not under DIO, mark "we're under COW". This will make DIO submitters wait. For avoiding too many page faults, "conflict" counter is added and if conflict==1, DIO submitter will wait for a while. If no one isseus DIO yet at copy-on-write, no checkes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> -- fs/direct-io.c | 43 ++++++++++++++- include/linux/direct-io.h | 38 +++++++++++++ include/linux/mm_private.h | 24 ++++++++ kernel/fork.c | 23 ++++++-- mm/Makefile | 2 mm/diolock.c | 129 +++++++++++++++++++++++++++++++++++++++++++++ mm/hugetlb.c | 11 +++ mm/memory.c | 15 +++++ 8 files changed, 278 insertions(+), 7 deletions(-) Index: kame-odirect-linux/include/linux/direct-io.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ kame-odirect-linux/include/linux/direct-io.h 2009-01-30 10:12:58.000000000 +0900 @@ -0,0 +1,38 @@ +#ifndef __LINUX_DIRECT_IO_H +#define __LINUX_DIRECT_IO_H + +struct dio_lock_head +{ + spinlock_t lock; /* A lock for all below */ + struct list_head dios; /* DIOs running now */ + int need_dio_check; /* This process used DIO */ + int cows; /* COWs running now */ + int conflicts; /* conflicts between COW and DIOs*/ + wait_queue_head_t waitq; /* A waitq for all stopped DIOs.*/ +}; + +struct dio_lock_ent +{ + struct list_head list; /* Linked list from head->dios */ + struct mm_struct *mm; /* the mm struct this is assgined for */ + unsigned long start; /* start address for a DIO */ + unsigned long end; /* end address for a DIO */ +}; + +/* called at fork/exit */ +int dio_lock_init(struct dio_lock_head *head); +void dio_lock_free(struct dio_lock_head *head); + +/* + * Called by DIO submitter. + */ +int dio_lock(struct mm_struct *mm, unsigned long start, unsigned long end, + struct dio_lock_ent *lock); +void dio_unlock(struct dio_lock_ent *lock); +/* + * Called by waiters. + */ +int mm_cow_start(struct mm_struct *mm, unsigned long start, unsigned long size); +void mm_cow_end(struct mm_struct *mm); + +#endif Index: kame-odirect-linux/mm/diolock.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ kame-odirect-linux/mm/diolock.c 2009-01-30 10:43:11.000000000 +0900 @@ -0,0 +1,129 @@ +#include <linux/mm.h> +#include <linux/wait.h> +#include <linux/hash.h> +#include <linux/mm_private.h> + + +int dio_lock_init(struct dio_lock_head *head) +{ + spin_lock_init(&head->lock); + head->need_dio_check = 0; + head->cows = 0; + head->conflicts = 0; + INIT_LIST_HEAD(&head->dios); + init_waitqueue_head(&head->waitq); + return 0; +} + +void dio_lock_free(struct dio_lock_head *head) +{ + BUG_ON(!list_empty(&head->dios)); + return; +} + + +int dio_lock(struct mm_struct *mm, unsigned long start, unsigned long end, + struct dio_lock_ent *lock) +{ + unsigned long flags; + struct dio_lock_head *head; + DEFINE_WAIT(wait); +retry: + if (signal_pending(current)) + return -EINTR; + head = &get_mm_private(mm)->diolock; + + if (!head->need_dio_check) { + down_write(&mm->mmap_sem); + head->need_dio_check = 1; + up_write(&mm->mmap_sem); + } + + prepare_to_wait(&head->waitq, &wait, TASK_INTERRUPTIBLE); + spin_lock_irqsave(&head->lock, flags); + if (head->cows || head->conflicts) { /* Allow COWs go ahead rather than new I/O */ + spin_unlock_irqrestore(&head->lock, flags); + if (head->cows) + schedule(); + else { + schedule_timeout(10); /* Allow 10tick for COW rertry */ + head->conflicts = 0; + } + finish_wait(&head->waitq, &wait); + goto retry; + } + lock->mm = mm; + lock->start = PAGE_ALIGN(start); + lock->end = PAGE_ALIGN(end) + PAGE_SIZE; + list_add(&lock->list, &head->dios); + atomic_inc(&mm->mm_users); + spin_unlock_irqrestore(&head->lock, flags); + finish_wait(&head->waitq, &wait); + return 0; +} + +void dio_unlock(struct dio_lock_ent *lock) +{ + struct dio_lock_head *head; + struct mm_struct *mm; + unsigned long flags; + + mm = lock->mm; + head = &get_mm_private(mm)->diolock; + spin_lock_irqsave(&head->lock, flags); + list_del(&lock->list); + if (waitqueue_active(&head->waitq)) + wake_up_all(&head->waitq); + spin_unlock_irqrestore(&head->lock, flags); + mmput(mm); +} + +int mm_cow_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct dio_lock_head *head; + struct dio_lock_ent *lock; + + head = &get_mm_private(mm)->diolock; + if (!head->need_dio_check) + return 0; + + spin_lock_irq(&head->lock); + head->cows++; + if (list_empty(&head->dios)) { + spin_unlock_irq(&head->lock); + return 0; + } + /* SLOW PATH */ + list_for_each_entry(lock, &head->dios, list) { + if ((start < lock->end) && (end > lock->start)) { + head->cows--; + head->conflicts++; + spin_unlock_irq(&head->lock); + /* This page fault will be retried but new dio requests will be + delayed until cow ends.*/ + return 1; + } + } + spin_unlock_irq(&head->lock); + return 0; +} + +void mm_cow_end(struct mm_struct *mm) +{ + struct dio_lock_head *head; + + head = &get_mm_private(mm)->diolock; + if (!head->need_dio_check) + return; + + spin_lock_irq(&head->lock); + head->cows--; + if (!head->cows) { + head->conflicts = 0; + if (waitqueue_active(&head->waitq)) + wake_up_all(&head->waitq); + } + spin_unlock_irq(&head->lock); + +} Index: kame-odirect-linux/fs/direct-io.c =================================================================== --- kame-odirect-linux.orig/fs/direct-io.c 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/fs/direct-io.c 2009-01-30 10:53:45.000000000 +0900 @@ -34,6 +34,8 @@ #include <linux/buffer_head.h> #include <linux/rwsem.h> #include <linux/uio.h> +#include <linux/direct-io.h> + #include <asm/atomic.h> /* @@ -130,8 +132,43 @@ int is_async; /* is IO async ? */ int io_error; /* IO error in completion path */ ssize_t result; /* IO result */ + + /* For sanity of Direct-IO and Copy-On-Write */ + struct dio_lock_ent *locks; + int nr_segs; }; +int dio_protect_all(struct dio *dio, const struct iovec *iov, int nsegs) +{ + struct dio_lock_ent *lock; + unsigned long start, end; + int seg; + + lock = kzalloc(sizeof(*lock) * nsegs, GFP_KERNEL); + if (!lock) + return -ENOMEM; + dio->locks = lock; + dio->nr_segs = nsegs; + for (seg = 0; seg < nsegs; seg++) { + start = (unsigned long)iov[seg].iov_base; + end = (unsigned long)iov[seg].iov_base + iov[seg].iov_len; + dio_lock(current->mm, start, end, lock+seg); + } + return 0; +} + +void dio_release_all_protection(struct dio *dio) +{ + int seg; + + if (!dio->locks) + return; + + for (seg = 0; seg < dio->nr_segs; seg++) + dio_unlock(dio->locks + seg); + kfree(dio->locks); +} + /* * How many pages are in the queue? */ @@ -284,6 +321,7 @@ if (remaining == 0) { int ret = dio_complete(dio, dio->iocb->ki_pos, 0); aio_complete(dio->iocb, ret, 0); + dio_release_all_protection(dio); kfree(dio); } @@ -965,6 +1003,7 @@ dio->iocb = iocb; dio->i_size = i_size_read(inode); + dio->locks = NULL; spin_lock_init(&dio->bio_lock); dio->refcount = 1; @@ -1088,6 +1127,7 @@ if (ret2 == 0) { ret = dio_complete(dio, offset, ret); + dio_release_all_protection(dio); kfree(dio); } else BUG_ON(ret != -EIOCBQUEUED); @@ -1166,7 +1206,8 @@ retval = -ENOMEM; if (!dio) goto out; - + if (dio_protect_all(dio, iov, nr_segs)) + goto out; /* * For block device access DIO_NO_LOCKING is used, * neither readers nor writers do any locking at all Index: kame-odirect-linux/kernel/fork.c =================================================================== --- kame-odirect-linux.orig/kernel/fork.c 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/kernel/fork.c 2009-01-30 09:54:05.000000000 +0900 @@ -46,6 +46,7 @@ #include <linux/delayacct.h> #include <linux/taskstats_kern.h> #include <linux/hash.h> +#include <linux/mm_private.h> #ifndef __GENKSYMS__ #include <linux/ptrace.h> #include <linux/tty.h> @@ -77,8 +78,8 @@ struct hlist_head mm_flags_hash[MM_FLAGS_HASH_SIZE] = { [ 0 ... MM_FLAGS_HASH_SIZE - 1 ] = HLIST_HEAD_INIT }; DEFINE_SPINLOCK(mm_flags_lock); -#define MM_HASH_SHIFT ((sizeof(struct mm_struct) >= 1024) ? 10 \ - : (sizeof(struct mm_struct) >= 512) ? 9 \ +#define MM_HASH_SHIFT ((sizeof(struct mm_private) >= 1024) ? 10 \ + : (sizeof(struct mm_private) >= 512) ? 9 \ : 8) #define mm_flags_hash_fn(mm) \ hash_long((unsigned long)(mm) >> MM_HASH_SHIFT, MM_FLAGS_HASH_BITS) @@ -299,6 +300,17 @@ spin_unlock(&mm_flags_lock); } +static void init_mm_private(struct mm_private *mmp) +{ + dio_lock_init(&mmp->diolock); +} + +static void free_mm_private(struct mm_private *mmp) +{ + dio_lock_free(&mmp->diolock); +} + + #ifdef CONFIG_MMU static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) { @@ -430,7 +442,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock); #define allocate_mm() (kmem_cache_alloc(mm_cachep, SLAB_KERNEL)) -#define free_mm(mm) (kmem_cache_free(mm_cachep, (mm))) +#define free_mm(mm) (kmem_cache_free(mm_cachep, get_mm_private((mm)))) #include <linux/init_task.h> @@ -451,6 +463,7 @@ mm->ioctx_list = NULL; mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; + init_mm_private(get_mm_private(mm)); mm_flags = get_mm_flags(current->mm); if (mm_flags != MMF_DUMP_FILTER_DEFAULT) { @@ -466,6 +479,7 @@ if (mm_flags != MMF_DUMP_FILTER_DEFAULT) free_mm_flags(mm); fail_nomem: + free_mm_private(get_mm_private(mm)); free_mm(mm); return NULL; } @@ -494,6 +508,7 @@ { BUG_ON(mm == &init_mm); free_mm_flags(mm); + free_mm_private(get_mm_private(mm)); mm_free_pgd(mm); destroy_context(mm); free_mm(mm); @@ -1550,7 +1565,7 @@ sizeof(struct vm_area_struct), 0, SLAB_PANIC, NULL, NULL); mm_cachep = kmem_cache_create("mm_struct", - sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN, + sizeof(struct mm_private), ARCH_MIN_MMSTRUCT_ALIGN, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); } Index: kame-odirect-linux/mm/Makefile =================================================================== --- kame-odirect-linux.orig/mm/Makefile 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/mm/Makefile 2009-01-29 14:01:59.000000000 +0900 @@ -5,7 +5,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ - vmalloc.o + vmalloc.o diolock.o obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \ page_alloc.o page-writeback.o pdflush.o \ Index: kame-odirect-linux/mm/memory.c =================================================================== --- kame-odirect-linux.orig/mm/memory.c 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/mm/memory.c 2009-01-29 16:18:19.000000000 +0900 @@ -50,6 +50,7 @@ #include <linux/delayacct.h> #include <linux/init.h> #include <linux/writeback.h> +#include <linux/direct-io.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -1665,6 +1666,7 @@ int reuse = 0, ret = VM_FAULT_MINOR; struct page *dirty_page = NULL; int dirty_pte = 0; + int dio_stop = 0; old_page = vm_normal_page(vma, address, orig_pte); if (!old_page) @@ -1738,6 +1740,7 @@ gotten: pte_unmap_unlock(page_table, ptl); + if (unlikely(anon_vma_prepare(vma))) goto oom; if (old_page == ZERO_PAGE(address)) { @@ -1748,6 +1751,11 @@ new_page = alloc_page_vma(GFP_HIGHUSER, vma, address); if (!new_page) goto oom; + if (mm_cow_start(mm, address, address+PAGE_SIZE)) { + page_cache_release(new_page); + goto out_retry; + } + dio_stop = 1; cow_user_page(new_page, old_page, address); } @@ -1789,6 +1797,9 @@ page_cache_release(new_page); if (old_page) page_cache_release(old_page); + /* Allow DIO progress */ + if (dio_stop) + mm_cow_end(mm); unlock: pte_unmap_unlock(page_table, ptl); if (dirty_page) { @@ -1797,6 +1808,10 @@ put_page(dirty_page); } return ret; +out_retry: + if (old_page) + page_cache_release(old_page); + return ret; oom: if (old_page) page_cache_release(old_page); Index: kame-odirect-linux/mm/hugetlb.c =================================================================== --- kame-odirect-linux.orig/mm/hugetlb.c 2009-01-29 14:01:44.000000000 +0900 +++ kame-odirect-linux/mm/hugetlb.c 2009-01-29 16:29:51.000000000 +0900 @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/direct-io.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -470,7 +471,13 @@ page_cache_release(old_page); return VM_FAULT_OOM; } - + if (mm_cow_start(mm, address & HPAGE_MASK, HPAGE_SIZE)) { + /* we have to retry. */ + page_cache_release(old_page); + page_cache_release(new_page); + return VM_FAULT_MINOR; + } + spin_unlock(&mm->page_table_lock); copy_huge_page(new_page, old_page, address); spin_lock(&mm->page_table_lock); @@ -486,6 +493,8 @@ } page_cache_release(new_page); page_cache_release(old_page); + mm_cow_end(mm); + return VM_FAULT_MINOR; } Index: kame-odirect-linux/include/linux/mm_private.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ kame-odirect-linux/include/linux/mm_private.h 2009-01-30 09:52:26.000000000 +0900 @@ -0,0 +1,24 @@ +#ifndef __LINUX_MM_PRIVATE_H +#define __LINUX_MM_PRIVATE_H + +#include <linux/sched.h> +#include <linux/direct-io.h> + +/* + * Because we have to keep KABI, we cannot modify mm_struct itself. This + * mm_private is per-process object and not covered by KABI. + * Just for a fields of future bugfix. + * Note: Now, this is not copied at fork(). + */ +struct mm_private { + struct mm_struct mm; + /* For fixing direct-io/COW races. */ + struct dio_lock_head diolock; +}; + +static inline struct mm_private *get_mm_private(struct mm_struct *mm) +{ + return container_of(mm, struct mm_private, mm); +} + +#endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 16:23 ` Nick Piggin 2009-03-16 16:32 ` Linus Torvalds @ 2009-03-18 2:04 ` KOSAKI Motohiro 2009-03-22 12:23 ` KOSAKI Motohiro 1 sibling, 1 reply; 83+ messages in thread From: KOSAKI Motohiro @ 2009-03-18 2:04 UTC (permalink / raw) To: Nick Piggin Cc: kosaki.motohiro, Benjamin Herrenschmidt, Linus Torvalds, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm Hi > > --- > > fs/direct-io.c | 2 ++ > > include/linux/init_task.h | 1 + > > include/linux/mm_types.h | 3 +++ > > kernel/fork.c | 3 +++ > > 4 files changed, 9 insertions(+), 0 deletions(-) > > It is an interesting patch. Thanks for throwing it into the discussion. > I do prefer to close the race up for all cases if we decide to do > anything at all about it, ie. all or nothing. But maybe others disagree. Honestly, I wan't excepting linus's reaction. but I hope to make my v2. My point is: - my patch don't prevent implement madvice(DONTCOW), I think. - andrea patch's complexity is mainly caused by avoiding perfromance degression effort, then, kernel later improvement can shrink his patch automatically. furtunately KSM don't merge yet. we can discuss his patch again at KSM submitting. - anyway, it can fix the bug. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-18 2:04 ` KOSAKI Motohiro @ 2009-03-22 12:23 ` KOSAKI Motohiro 2009-03-23 0:13 ` KOSAKI Motohiro 2009-03-24 13:43 ` Nick Piggin 0 siblings, 2 replies; 83+ messages in thread From: KOSAKI Motohiro @ 2009-03-22 12:23 UTC (permalink / raw) To: Nick Piggin, Linus Torvalds, Andrea Arcangeli Cc: kosaki.motohiro, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm Hi following patch is my v2 approach. it survive Andrea's three dio test-case. Linus suggested to change add_to_swap() and shrink_page_list() stuff for avoid false cow in do_wp_page() when page become to swapcache. I think it's good idea. but it's a bit radical. so I think it's for development tree tackle. Then, I decide to use Nick's early decow in get_user_pages() and RO mapped page don't use gup_fast. yeah, my approach is extream brutal way and big hammer. but I think it don't have performance issue in real world. why? Practically, we can assume following two thing. (1) the buffer of passed write(2) syscall argument is RW mapped page or COWed RO page. if anybody write following code, my path cause performance degression. buf = mmap() memset(buf, 0x11, len); mprotect(buf, len, PROT_READ) fd = open(O_DIRECT) write(fd, buf, len) but it's very artifactical code. nobody want this. ok, we can ignore this. (2) DirectIO user process isn't short lived process. early decow only decrease short lived process performaqnce. because long lived process do decowing anyway before exec(2). and, All DB application is definitely long lived process. then early decow don't cause degression. TODO - implement down_write_killable(). (but it isn't important thing because this is rare case issue.) - implement non x86 portion. Am I missing any thing? Note: this is still RFC. not intent submission. -- arch/x86/mm/gup.c | 22 ++++++++++++++-------- fs/direct-io.c | 11 +++++++++++ include/linux/init_task.h | 1 + include/linux/mm.h | 9 +++++++++ include/linux/mm_types.h | 6 ++++++ kernel/fork.c | 3 +++ mm/internal.h | 10 ---------- mm/memory.c | 17 ++++++++++++++++- mm/util.c | 8 ++++++-- 9 files changed, 66 insertions(+), 21 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index be54176..02e479b 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -74,8 +74,10 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr, pte_t *ptep; mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + + /* Maybe the read only pte is cow mapped page. (or not maybe) + So, falling back to get_user_pages() is better */ + mask |= _PAGE_RW; ptep = pte_offset_map(&pmd, addr); do { @@ -114,8 +116,7 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, int refs; mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + mask |= _PAGE_RW; if ((pte_flags(pte) & mask) != mask) return 0; /* hugepages are never "special" */ @@ -171,8 +172,7 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr, int refs; mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + mask |= _PAGE_RW; if ((pte_flags(pte) & mask) != mask) return 0; /* hugepages are never "special" */ @@ -272,6 +272,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, { int ret; + int gup_flags; slow: local_irq_enable(); @@ -280,9 +281,14 @@ slow_irqon: start += nr << PAGE_SHIFT; pages += nr; + gup_flags = GUP_FLAGS_PINNING_PAGE; + if (write) + gup_flags |= GUP_FLAGS_WRITE; + down_read(&mm->mmap_sem); - ret = get_user_pages(current, mm, start, - (end - start) >> PAGE_SHIFT, write, 0, pages, NULL); + ret = __get_user_pages(current, mm, start, + (end - start) >> PAGE_SHIFT, gup_flags, + pages, NULL); up_read(&mm->mmap_sem); /* Have to be a bit careful with return values */ diff --git a/fs/direct-io.c b/fs/direct-io.c index b6d4390..4f46720 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -131,6 +131,9 @@ struct dio { int is_async; /* is IO async ? */ int io_error; /* IO error in completion path */ ssize_t result; /* IO result */ + + /* fork exclusive stuff */ + struct mm_struct *mm; }; /* @@ -243,6 +246,9 @@ static int dio_complete(struct dio *dio, loff_t offset, int ret) if (dio->lock_type == DIO_LOCKING) /* lockdep: non-owner release */ up_read_non_owner(&dio->inode->i_alloc_sem); + up_read_non_owner(&dio->mm->mm_pinned_sem); + mmdrop(dio->mm); + dio->mm = NULL; if (ret == 0) ret = dio->page_errors; @@ -942,6 +948,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, ssize_t ret = 0; ssize_t ret2; size_t bytes; + struct mm_struct *mm; dio->inode = inode; dio->rw = rw; @@ -960,6 +967,10 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, spin_lock_init(&dio->bio_lock); dio->refcount = 1; + mm = dio->mm = current->mm; + atomic_inc(&mm->mm_count); + down_read_non_owner(&mm->mm_pinned_sem); + /* * In case of non-aligned buffers, we may need 2 more * pages since we need to zero out first and last block. diff --git a/include/linux/init_task.h b/include/linux/init_task.h index e752d97..3bc134a 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -37,6 +37,7 @@ extern struct fs_struct init_fs; .page_table_lock = __SPIN_LOCK_UNLOCKED(name.page_table_lock), \ .mmlist = LIST_HEAD_INIT(name.mmlist), \ .cpu_vm_mask = CPU_MASK_ALL, \ + .mm_pinned_sem = __RWSEM_INITIALIZER(name.mm_pinned_sem), \ } #define INIT_SIGNALS(sig) { \ diff --git a/include/linux/mm.h b/include/linux/mm.h index 065cdf8..dcc6ccc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -823,6 +823,15 @@ static inline int handle_mm_fault(struct mm_struct *mm, extern int make_pages_present(unsigned long addr, unsigned long end); extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write); +#define GUP_FLAGS_WRITE 0x01 +#define GUP_FLAGS_FORCE 0x02 +#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x04 +#define GUP_FLAGS_IGNORE_SIGKILL 0x08 +#define GUP_FLAGS_PINNING_PAGE 0x10 + +int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int len, int flags, + struct page **pages, struct vm_area_struct **vmas); int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int write, int force, struct page **pages, struct vm_area_struct **vmas); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d84feb7..27089d9 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -274,6 +274,12 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif + + /* + * if there are on-flight directio or similar pinning action, + * COW cause memory corruption. the sem protect it by preventing fork. + */ + struct rw_semaphore mm_pinned_sem; }; /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ diff --git a/kernel/fork.c b/kernel/fork.c index 4854c2c..ded7caf 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) unsigned long charge; struct mempolicy *pol; + down_write(&oldmm->mm_pinned_sem); down_write(&oldmm->mmap_sem); flush_cache_dup_mm(oldmm); /* @@ -368,6 +369,7 @@ out: up_write(&mm->mmap_sem); flush_tlb_mm(oldmm); up_write(&oldmm->mmap_sem); + up_write(&oldmm->mm_pinned_sem); return retval; fail_nomem_policy: kmem_cache_free(vm_area_cachep, tmp); @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p) mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; mm_init_owner(mm, p); + init_rwsem(&mm->mm_pinned_sem); if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; diff --git a/mm/internal.h b/mm/internal.h index 478223b..04f25d2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -272,14 +272,4 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn, { } #endif /* CONFIG_SPARSEMEM */ - -#define GUP_FLAGS_WRITE 0x1 -#define GUP_FLAGS_FORCE 0x2 -#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4 -#define GUP_FLAGS_IGNORE_SIGKILL 0x8 - -int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, int len, int flags, - struct page **pages, struct vm_area_struct **vmas); - #endif diff --git a/mm/memory.c b/mm/memory.c index baa999e..b00e3e9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1211,6 +1211,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int force = !!(flags & GUP_FLAGS_FORCE); int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS); int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL); + int decow = 0; if (len <= 0) return 0; @@ -1279,6 +1280,20 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, continue; } + /* + * Except in special cases where the caller will not read to or + * write from these pages, we must break COW for any pages + * returned from get_user_pages, so that our caller does not + * subsequently end up with the pages of a parent or child + * process after a COW takes place. + */ + if (flags & GUP_FLAGS_PINNING_PAGE) { + if (!pages) + return -EINVAL; + if (is_cow_mapping(vma->vm_flags)) + decow = 1; + } + foll_flags = FOLL_TOUCH; if (pages) foll_flags |= FOLL_GET; @@ -1299,7 +1314,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, fatal_signal_pending(current))) return i ? i : -ERESTARTSYS; - if (write) + if (write || decow) foll_flags |= FOLL_WRITE; cond_resched(); diff --git a/mm/util.c b/mm/util.c index 37eaccd..a80d5d3 100644 --- a/mm/util.c +++ b/mm/util.c @@ -197,10 +197,14 @@ int __attribute__((weak)) get_user_pages_fast(unsigned long start, { struct mm_struct *mm = current->mm; int ret; + int gup_flags = GUP_FLAGS_PINNING_PAGE; + + if (write) + gup_flags |= GUP_FLAGS_WRITE; down_read(&mm->mmap_sem); - ret = get_user_pages(current, mm, start, nr_pages, - write, 0, pages, NULL); + ret = __get_user_pages(current, mm, start, nr_pages, + gup_flags, pages, NULL); up_read(&mm->mmap_sem); return ret; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-22 12:23 ` KOSAKI Motohiro @ 2009-03-23 0:13 ` KOSAKI Motohiro 2009-03-23 16:29 ` Ingo Molnar 2009-03-24 13:43 ` Nick Piggin 1 sibling, 1 reply; 83+ messages in thread From: KOSAKI Motohiro @ 2009-03-23 0:13 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Nick Piggin, Linus Torvalds, Andrea Arcangeli, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm > Hi > > following patch is my v2 approach. > it survive Andrea's three dio test-case. > > Linus suggested to change add_to_swap() and shrink_page_list() stuff > for avoid false cow in do_wp_page() when page become to swapcache. > > I think it's good idea. but it's a bit radical. so I think it's for development > tree tackle. > > Then, I decide to use Nick's early decow in > get_user_pages() and RO mapped page don't use gup_fast. > > yeah, my approach is extream brutal way and big hammer. but I think > it don't have performance issue in real world. > > why? > > Practically, we can assume following two thing. > > (1) the buffer of passed write(2) syscall argument is RW mapped > page or COWed RO page. > > if anybody write following code, my path cause performance degression. > > buf = mmap() > memset(buf, 0x11, len); > mprotect(buf, len, PROT_READ) > fd = open(O_DIRECT) > write(fd, buf, len) > > but it's very artifactical code. nobody want this. > ok, we can ignore this. > > (2) DirectIO user process isn't short lived process. > > early decow only decrease short lived process performaqnce. > because long lived process do decowing anyway before exec(2). > > and, All DB application is definitely long lived process. > then early decow don't cause degression. Frankly, linus sugessted to insert one branch into do_wp_page(), but I remove one branch from gup_fast. I think it's good performance trade-off. but if anybody hate my approach, I'll drop my chicken heart and try to linus suggested way. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-23 0:13 ` KOSAKI Motohiro @ 2009-03-23 16:29 ` Ingo Molnar 2009-03-23 16:46 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Ingo Molnar @ 2009-03-23 16:29 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Nick Piggin, Linus Torvalds, Andrea Arcangeli, Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: > > following patch is my v2 approach. > > it survive Andrea's three dio test-case. > > > > [...] > Frankly, linus sugessted to insert one branch into do_wp_page(), > but I remove one branch from gup_fast. > > I think it's good performance trade-off. but if anybody hate my > approach, I'll drop my chicken heart and try to linus suggested > way. We started out with a difficult corner case problem (for an arguably botched syscall promise we made to user-space many moons ago), and an invasive and unmaintainable looking patch: 8 files changed, 342 insertions(+), 77 deletions(-) And your v2 is now: 9 files changed, 66 insertions(+), 21 deletions(-) ... and it is also speeding up fast-gup. Which is a marked improvement IMO. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-23 16:29 ` Ingo Molnar @ 2009-03-23 16:46 ` Linus Torvalds 2009-03-24 5:08 ` KOSAKI Motohiro 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-23 16:46 UTC (permalink / raw) To: Ingo Molnar Cc: KOSAKI Motohiro, Nick Piggin, Andrea Arcangeli, Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Mon, 23 Mar 2009, Ingo Molnar wrote: > > And your v2 is now: > > 9 files changed, 66 insertions(+), 21 deletions(-) > > ... and it is also speeding up fast-gup. Which is a marked > improvement IMO. Yeah, I have no problems with that patch. I'd just suggest a final simplification, and getting rid of the mask = _PAGE_PRESENT|_PAGE_USER; /* Maybe the read only pte is cow mapped page. (or not maybe) So, falling back to get_user_pages() is better */ mask |= _PAGE_RW; and just doing something like /* * fast-GUP only handles the simple cases where we have * full access to the page (ie private pages are copied * etc). */ #define GUP_MASK (_PAGE_PRESENT|_PAGE_USER|_PAGE_RW) and leaving it at that. Of course, maybe somebody does O_DIRECT writes on a fork'ed image in order to create a snapshot image or something, and now the v2 thing breaks COW on all the pages in order to be safe and performance sucks. But I can't really say that _I_ could possibly care. I really seriously think that O_DIRECT and its ilk were braindamaged to begin with. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-23 16:46 ` Linus Torvalds @ 2009-03-24 5:08 ` KOSAKI Motohiro 0 siblings, 0 replies; 83+ messages in thread From: KOSAKI Motohiro @ 2009-03-24 5:08 UTC (permalink / raw) To: Linus Torvalds Cc: kosaki.motohiro, Ingo Molnar, Nick Piggin, Andrea Arcangeli, Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm Hi > > And your v2 is now: > > > > 9 files changed, 66 insertions(+), 21 deletions(-) > > > > ... and it is also speeding up fast-gup. Which is a marked > > improvement IMO. > > Yeah, I have no problems with that patch. I'd just suggest a final > simplification, and getting rid of the > > mask = _PAGE_PRESENT|_PAGE_USER; > /* Maybe the read only pte is cow mapped page. (or not maybe) > So, falling back to get_user_pages() is better */ > mask |= _PAGE_RW; > > and just doing something like > > /* > * fast-GUP only handles the simple cases where we have > * full access to the page (ie private pages are copied > * etc). > */ > #define GUP_MASK (_PAGE_PRESENT|_PAGE_USER|_PAGE_RW) OK! I'll do that. Thanks good reviewing! > and leaving it at that. > > Of course, maybe somebody does O_DIRECT writes on a fork'ed image in order > to create a snapshot image or something, and now the v2 thing breaks COW > on all the pages in order to be safe and performance sucks. > > But I can't really say that _I_ could possibly care. I really seriously > think that O_DIRECT and its ilk were braindamaged to begin with. Yes. I have to totally agreed ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-22 12:23 ` KOSAKI Motohiro 2009-03-23 0:13 ` KOSAKI Motohiro @ 2009-03-24 13:43 ` Nick Piggin 2009-03-24 17:56 ` Linus Torvalds 2009-03-30 10:52 ` KOSAKI Motohiro 1 sibling, 2 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-24 13:43 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Linus Torvalds, Andrea Arcangeli, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Sunday 22 March 2009 23:23:56 KOSAKI Motohiro wrote: > Hi > > following patch is my v2 approach. > it survive Andrea's three dio test-case. > > Linus suggested to change add_to_swap() and shrink_page_list() stuff > for avoid false cow in do_wp_page() when page become to swapcache. > > I think it's good idea. but it's a bit radical. so I think it's for > development tree tackle. > > Then, I decide to use Nick's early decow in > get_user_pages() and RO mapped page don't use gup_fast. You probably should be testing for PageAnon pages in gup_fast. Also, using a bit in page->flags you could potentially get anonymous, readonly mappings working again (I thought I had them working in my patch, but on second thoughts perhaps I had a bug in tagging them, I'll try to fix that). > yeah, my approach is extream brutal way and big hammer. but I think > it don't have performance issue in real world. > > why? > > Practically, we can assume following two thing. > > (1) the buffer of passed write(2) syscall argument is RW mapped > page or COWed RO page. > > if anybody write following code, my path cause performance degression. > > buf = mmap() > memset(buf, 0x11, len); > mprotect(buf, len, PROT_READ) > fd = open(O_DIRECT) > write(fd, buf, len) > > but it's very artifactical code. nobody want this. > ok, we can ignore this. The more interesting uses of gup (and perhaps somewhat improved or enabled with fast-gup) I think are things like vmsplice, and syslets/threadlets/aio kind of things. And I don't exactly know what the users are going to look like. > (2) DirectIO user process isn't short lived process. > > early decow only decrease short lived process performaqnce. > because long lived process do decowing anyway before exec(2). > > and, All DB application is definitely long lived process. > then early decow don't cause degression. Right, most databases won't care *at all* because they won't do any decowing. But if there are cases that do care, then we can perhaps take the policy of having them use MADV_DONTFORK or somesuch. > TODO > - implement down_write_killable(). > (but it isn't important thing because this is rare case issue.) > - implement non x86 portion. > > > Am I missing any thing? I still don't understand why this way is so much better than my last proposal. I just wanted to let that simmer down for a few days :) But I'm honestly really just interested in a good discussion and I don't mind being sworn at if I'm being stupid, but I really want to hear opinions of why I'm wrong too. Yes my patch has downsides I'm quite happy to admit. But I just don't see that copy-on-fork rather than wrprotect-on-fork is the showstopper. To me it seemed nice because it is practically just reusing code straight from do_wp_page, and pretty well isolated out of the fastpath. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-24 13:43 ` Nick Piggin @ 2009-03-24 17:56 ` Linus Torvalds 2009-03-30 10:52 ` KOSAKI Motohiro 1 sibling, 0 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-24 17:56 UTC (permalink / raw) To: Nick Piggin Cc: KOSAKI Motohiro, Andrea Arcangeli, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Wed, 25 Mar 2009, Nick Piggin wrote: > > I still don't understand why this way is so much better than > my last proposal. Take a look at the diffstat. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-24 13:43 ` Nick Piggin 2009-03-24 17:56 ` Linus Torvalds @ 2009-03-30 10:52 ` KOSAKI Motohiro [not found] ` <200904022307.12043.nickpiggin@yahoo.com.au> 1 sibling, 1 reply; 83+ messages in thread From: KOSAKI Motohiro @ 2009-03-30 10:52 UTC (permalink / raw) To: Nick Piggin Cc: kosaki.motohiro, Linus Torvalds, Andrea Arcangeli, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm Hi Nick, > > Am I missing any thing? > > I still don't understand why this way is so much better than > my last proposal. I just wanted to let that simmer down for a > few days :) But I'm honestly really just interested in a good > discussion and I don't mind being sworn at if I'm being stupid, > but I really want to hear opinions of why I'm wrong too. > > Yes my patch has downsides I'm quite happy to admit. But I just > don't see that copy-on-fork rather than wrprotect-on-fork is > the showstopper. To me it seemed nice because it is practically > just reusing code straight from do_wp_page, and pretty well > isolated out of the fastpath. Firstly, I'm very sorry for very long delay responce. This month, I'm very busy and I don't have enough developing time ;) Secondly, I have strongly obsession to bugfix. (I guess you alread know it) but I don't have obsession to bugfix _way_. my patch was made for creating good discussion, not NAK your patch. I think your patch is good. but it have few disadvantage. (yeah, I agree mine have lot disadvantage) 1. using page->flags nowadays, page->flags is one of most prime estate in linux. as far as possible, we can avoid to use it. 2. don't have GUP_FLAGS_PINNING_PAGE flag then, access_process_vm() can decow a page unnecessary. it isn't good feature, I think. IOW, I don't think "caller transparent" is important. minimal side effect is important more. my side-effect mean non direct-io effection. I don't mind direct-io path side effection. it is only used DB or similar software. then, we can assume a lot of userland usage. and I was playing your patch in last week. but I conclude I can't shrink it more. As far as I understand, Linus don't refuse copy-on-fork itself. he only refuse messy bugfix patch. In general, bugfix patch should be backportable to stable tree. Then, I think step-by-step development is better. 1. at first, merge wrprotect-on-fork. 2. improve speed. What do you think? btw, Linus give me good inspiration. if page pinning happend, the patch is guranteed to grabbed only one process. then, we can put pinning-count and some additional information into anon_vma. it can avoid to use page->flags although we implement copy-on-fork. maybe. HOWEVER, if you really hate my approach, please don't hesitate to tell it. I don't hope submit your disliked patch. I respect linus, but I respect you too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
[parent not found: <200904022307.12043.nickpiggin@yahoo.com.au>]
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] [not found] ` <200904022307.12043.nickpiggin@yahoo.com.au> @ 2009-04-03 3:49 ` Nick Piggin 0 siblings, 0 replies; 83+ messages in thread From: Nick Piggin @ 2009-04-03 3:49 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Linus Torvalds, Andrea Arcangeli, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm [sorry, resending because my mail client started sending HTML and this didn't get through spam filters] On Thursday 02 April 2009 23:07:11 Nick Piggin wrote: Hi! On Monday 30 March 2009 21:52:44 KOSAKI Motohiro wrote: > > Hi Nick, > > > > Am I missing any thing? > > > > I still don't understand why this way is so much better than > > my last proposal. I just wanted to let that simmer down for a > > few days :) But I'm honestly really just interested in a good > > discussion and I don't mind being sworn at if I'm being stupid, > > but I really want to hear opinions of why I'm wrong too. > > > > Yes my patch has downsides I'm quite happy to admit. But I just > > don't see that copy-on-fork rather than wrprotect-on-fork is > > the showstopper. To me it seemed nice because it is practically > > just reusing code straight from do_wp_page, and pretty well > > isolated out of the fastpath. > > Firstly, I'm very sorry for very long delay responce. This month, I'm > very busy and I don't have enough developing time ;) No problem. > Secondly, I have strongly obsession to bugfix. (I guess you alread know it) > but I don't have obsession to bugfix _way_. my patch was made for > creating good discussion, not NAK your patch. Definitely. I like more discussion and alternative approaches. > I think your patch is good. but it have few disadvantage. > (yeah, I agree mine have lot disadvantage) > > 1. using page->flags > nowadays, page->flags is one of most prime estate in linux. > as far as possible, we can avoid to use it. Well... I'm not sure if it is that bad. It uses an anonymous page flag, which are not so congested as pagecache page flags. I can't think of anything preventing anonymous pages from using PG_owner_priv_1, PG_private, or PG_mappedtodisk, so a "final" solution that uses a page flag would use one of those I guess. > 2. don't have GUP_FLAGS_PINNING_PAGE flag > then, access_process_vm() can decow a page unnecessary. > it isn't good feature, I think. access_process_vm I think can just avoid COWing because it holds mmap_sem for the duration of the operation. I just didn't fix that because I didn't really think of it. > IOW, I don't think "caller transparent" is important. Well I don't know about that. I don't know that O_DIRECT is particularly more important to fix the problem than vmsplice, or any of the numerous other zero-copy methods open coded in drivers. > minimal side effect is important more. my side-effect mean non direct- io > effection. I don't mind direct-io path side effection. it is only used > DB or similar software. then, we can assume a lot of userland usage. I agree my patch should not be de-cowing for access_process_vm for read. I think that can be fixed. But I disagree that O_DIRECT is unimportant. I think the big database users don't like more cost in this path, and they obviously have the capacity to use it carefully so I'm sure they would prefer not to add anything. Intel definitely counts cycles in the O_DIRECT path. > and I was playing your patch in last week. but I conclude I can't shrink > it more. > As far as I understand, Linus don't refuse copy-on-fork itself. he only > refuse messy bugfix patch. > In general, bugfix patch should be backportable to stable tree. I think assessing this type of patch based of diffstat is a bit ridiculous ;) But I think it can be shrunk a bit if it shares a bit of code with do_wp_page. > Then, I think step-by-step development is better. > > 1. at first, merge wrprotect-on-fork. > 2. improve speed. > > What do you think? > > > btw, > Linus give me good inspiration. if page pinning happend, the patch > is guranteed to grabbed only one process. > then, we can put pinning-count and some additional information > into anon_vma. it can avoid to use page->flags although we implement > copy-on-fork. maybe. Hmm, I might try playing with that in my patch. Not so much because the extra flag is important (as I explain above), but keeping a count will -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 16:01 ` KOSAKI Motohiro 2009-03-16 16:23 ` Nick Piggin @ 2009-03-17 0:44 ` Linus Torvalds 2009-03-17 0:56 ` KAMEZAWA Hiroyuki 2009-03-17 12:19 ` Andrea Arcangeli 1 sibling, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-17 0:44 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Nick Piggin, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, KOSAKI Motohiro wrote: > > if we only need concern to O_DIRECT, below patch is enough. .. together with something like this, to handle the other direction. This should take care of the case of an O_DIRECT write() call using a page that was duplicated by an _earlier_ fork(), and then got split up by a COW in the wrong direction (ie having data from the child show up in the write). Untested. But fairly trivial, after all. We simply do the same old "reuse_swap_page()" count, but we only break the COW if the page count afterwards is 1 (reuse_swap_page will have removed it from the swap cache if it returns success). Does this (together with Kosaki's patch) pass the tests that Andrea had? Linus --- mm/memory.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index baa999e..2bd5fb0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1928,7 +1928,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, } page_cache_release(old_page); } - reuse = reuse_swap_page(old_page); + /* + * If we can re-use the swap page _and_ the end + * result has only one user (the mapping), then + * we reuse the whole page + */ + if (reuse_swap_page(old_page)) + reuse = page_count(old_page) == 1; unlock_page(old_page); } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 0:44 ` Linus Torvalds @ 2009-03-17 0:56 ` KAMEZAWA Hiroyuki 2009-03-17 12:19 ` Andrea Arcangeli 1 sibling, 0 replies; 83+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-03-17 0:56 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, linux-mm On Mon, 16 Mar 2009 17:44:25 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Tue, 17 Mar 2009, KOSAKI Motohiro wrote: > > > > if we only need concern to O_DIRECT, below patch is enough. > > .. together with something like this, to handle the other direction. This > should take care of the case of an O_DIRECT write() call using a page that > was duplicated by an _earlier_ fork(), and then got split up by a COW in > the wrong direction (ie having data from the child show up in the write). > > Untested. But fairly trivial, after all. We simply do the same old > "reuse_swap_page()" count, but we only break the COW if the page count > afterwards is 1 (reuse_swap_page will have removed it from the swap cache > if it returns success). > > Does this (together with Kosaki's patch) pass the tests that Andrea had? > I'm not sure but I doubt "AIO" case. + down_read(¤t->mm->directio_sem); retval = direct_io_worker(rw, iocb, inode, iov, offset, nr_segs, blkbits, get_block, end_io, dio); + up_read(¤t->mm->directio_sem); If AIO, this semaphore range seems to be not enough. Thanks, -Kame > Linus > > --- > mm/memory.c | 8 +++++++- > 1 files changed, 7 insertions(+), 1 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index baa999e..2bd5fb0 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1928,7 +1928,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > } > page_cache_release(old_page); > } > - reuse = reuse_swap_page(old_page); > + /* > + * If we can re-use the swap page _and_ the end > + * result has only one user (the mapping), then > + * we reuse the whole page > + */ > + if (reuse_swap_page(old_page)) > + reuse = page_count(old_page) == 1; > unlock_page(old_page); > } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > (VM_WRITE|VM_SHARED))) { > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 0:44 ` Linus Torvalds 2009-03-17 0:56 ` KAMEZAWA Hiroyuki @ 2009-03-17 12:19 ` Andrea Arcangeli 2009-03-17 16:43 ` Linus Torvalds 1 sibling, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-17 12:19 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Mon, Mar 16, 2009 at 05:44:25PM -0700, Linus Torvalds wrote: > - reuse = reuse_swap_page(old_page); > + /* > + * If we can re-use the swap page _and_ the end > + * result has only one user (the mapping), then > + * we reuse the whole page > + */ > + if (reuse_swap_page(old_page)) > + reuse = page_count(old_page) == 1; > unlock_page(old_page); Think if the anon page is added to swapcache and the pte is unmapped by the VM and set non present after GUP taken the page for a O_DIRECT read (write to memory). If a thread writes to the page while the O_DIRECT read is running in another thread (or aio), then do_wp_page will make a copy of the swapcache under O_DIRECT read, and part of the read operation will get lost. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 12:19 ` Andrea Arcangeli @ 2009-03-17 16:43 ` Linus Torvalds 2009-03-17 17:01 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-17 16:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Andrea Arcangeli wrote: > > Think if the anon page is added to swapcache and the pte is unmapped > by the VM and set non present after GUP taken the page for a O_DIRECT > read (write to memory). If a thread writes to the page while the > O_DIRECT read is running in another thread (or aio), then do_wp_page > will make a copy of the swapcache under O_DIRECT read, and part of the > read operation will get lost. In that case, you aren't getting to the "do_wp_page()" case at all, you're getting the "do_swap_page()" case. Which does its own reuse_swap_page() thing (and that one I didn't touch - on purpose). But you're right - it only does that for writes. If we _first_ do a read (to swap it back in), it will mark it read-only and _then_ we can get a "do_wp_page()" that splits it. So yes - I had expected our VM to be sane, and have a writable private page _stay_ writable (in the absense of fork() it should never turn into a COW page), but the swapout+swapin code can result in a rw page that turns read-only in order to catch a swap cache invalidation. Good catch. Let me think about it. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 16:43 ` Linus Torvalds @ 2009-03-17 17:01 ` Linus Torvalds 2009-03-17 17:10 ` Andrea Arcangeli 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-17 17:01 UTC (permalink / raw) To: Andrea Arcangeli Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Linus Torvalds wrote: > > So yes - I had expected our VM to be sane, and have a writable private > page _stay_ writable (in the absense of fork() it should never turn into a > COW page), but the swapout+swapin code can result in a rw page that turns > read-only in order to catch a swap cache invalidation. > > Good catch. Let me think about it. Btw, I think this is actually a pre-existing bug regardless of my patch. That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT write - exactly for the same reason. When swapin turns the page into a read-only page in order to keep the physical page in the swap cache, the write to the physical page (that was gotten by get_user_pages() earlier) will bypass all that. So the get_user_pages() users will then write to the page, but the next time we swap things out, if nobody _else_ wrote to it, that write will be lost because we'll just drop the page (it was in the swap cache!) even though it had changed data on it. My patch changed the schenario a bit (split page rather than dropped page), but the fundamental cause seems to be the same - the swap cache code very much depends on writes to the _virtual_ address. Or am I missing something? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 17:01 ` Linus Torvalds @ 2009-03-17 17:10 ` Andrea Arcangeli 2009-03-17 17:43 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-17 17:10 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, Mar 17, 2009 at 10:01:06AM -0700, Linus Torvalds wrote: > That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT I think the dirty bit is set in dio_bio_complete (or bio_check_pages_dirty for the aio case) so forcing the swapcache to be written out again before the page can be freed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 17:10 ` Andrea Arcangeli @ 2009-03-17 17:43 ` Linus Torvalds 2009-03-17 18:09 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-17 17:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Andrea Arcangeli wrote: > On Tue, Mar 17, 2009 at 10:01:06AM -0700, Linus Torvalds wrote: > > That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT > > I think the dirty bit is set in dio_bio_complete (or > bio_check_pages_dirty for the aio case) so forcing the swapcache to be > written out again before the page can be freed. Do all the other get_user_pages() users do that, though? [ Looks around - at least access_process_vm(), IB and the NFS direct code do. So we seem to be mostly ok, at least for the main users ] Ok, no worries. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 17:43 ` Linus Torvalds @ 2009-03-17 18:09 ` Linus Torvalds 2009-03-17 18:19 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-17 18:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Linus Torvalds wrote: > > Do all the other get_user_pages() users do that, though? > > [ Looks around - at least access_process_vm(), IB and the NFS direct code > do. So we seem to be mostly ok, at least for the main users ] > > Ok, no worries. This problem is actually pretty easy to fix for anonymous pages: since the act of pinning (for writes) should have done all the COW stuff and made sure the page is not in the swap cache, we only need to avoid adding it back. IOW, something like the following makes sense on all levels regardless (note: I didn't check if there is some off-by-one issue where we've raised the page count for other reasons when scanning it, so this is not meant to be a serious patch, just a "something along these lines" thing). This does not obviate the need to mark pages dirty afterwards, though, since true shared mappings always cause that (and we cannot keep them dirty, since somebody may be doing fsync() on them or something like that). But since the COW issue is only a matter of private pages, this handles that trivially. Linus --- mm/swap_state.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 3ecea98..83137fe 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -140,6 +140,10 @@ int add_to_swap(struct page *page) VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(!PageUptodate(page)); + /* Refuse to add pinned pages to the swap cache */ + if (page_count(page) > page_mapped(page)) + return 0; + for (;;) { entry = get_swap_page(); if (!entry.val) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 18:09 ` Linus Torvalds @ 2009-03-17 18:19 ` Linus Torvalds 2009-03-17 18:46 ` Andrea Arcangeli 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-17 18:19 UTC (permalink / raw) To: Andrea Arcangeli Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Linus Torvalds wrote: > > This problem is actually pretty easy to fix for anonymous pages: since the > act of pinning (for writes) should have done all the COW stuff and made > sure the page is not in the swap cache, we only need to avoid adding it > back. An alternative approach would have been to just count page pinning as being a "referenced", which to some degree would be even more logical (we don't set the referenced flag when we look those pages up). That would also affect pages that were get_user_page'd just for reading, which might be seen as an additional bonus. The "don't turn pinned pages into swap cache pages" is a somewhat more direct patch, though. It gives more obvious guarantees about the lifetime behaviour of anon pages wrt get_user_pages[_fast]().. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 18:19 ` Linus Torvalds @ 2009-03-17 18:46 ` Andrea Arcangeli 2009-03-17 19:03 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-17 18:46 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, Mar 17, 2009 at 11:19:59AM -0700, Linus Torvalds wrote: > > > On Tue, 17 Mar 2009, Linus Torvalds wrote: > > > > This problem is actually pretty easy to fix for anonymous pages: since the > > act of pinning (for writes) should have done all the COW stuff and made > > sure the page is not in the swap cache, we only need to avoid adding it > > back. > > An alternative approach would have been to just count page pinning as > being a "referenced", which to some degree would be even more logical (we > don't set the referenced flag when we look those pages up). That would > also affect pages that were get_user_page'd just for reading, which might > be seen as an additional bonus. > > The "don't turn pinned pages into swap cache pages" is a somewhat more > direct patch, though. It gives more obvious guarantees about the lifetime > behaviour of anon pages wrt get_user_pages[_fast]().. I don't think you can tackle this from add_to_swap because the page may be in the swapcache well before gup runs (gup(write=1) can map the swapcache as exclusive and read-write in the pte). So then what happens is again that the VM unmaps the page, do_swap_page map it as readonly swapcache (so far so good), and the do_wp_page copies the page under O_DIRECT read again. The off by one is most certain as it's invoked by the VM but that's an implementation detail not relevant for this discussion agreed, and I guess you also meant page_mapcount instead of page_mapped or I think shared pages would stop being swapped out. That is more relevant because of some worry I have in the comparison between page count and mapcount, see below. My preference is still to keeps pages with elevated refcount pinned in the ptes like 2.6.7 did, that will allow do_wp_page to takeover only pages with page_count not elevated without risk of calling do_wp_page on any page under gup. Only worry I have now is how to compare count with mapcount when both can change under us if mapcount > 1, but if you meant page_mapcount in add_to_swap as I think, that logic in add_to_swap would have the same problem and so it needs a solution for doing a coherent/safe comparison too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 18:46 ` Andrea Arcangeli @ 2009-03-17 19:03 ` Linus Torvalds 2009-03-17 19:35 ` Andrea Arcangeli 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2009-03-17 19:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Andrea Arcangeli wrote: > > I don't think you can tackle this from add_to_swap because the page > may be in the swapcache well before gup runs (gup(write=1) can map the > swapcache as exclusive and read-write in the pte). If it's in the swap cache, it should be mapped read-only, and gup(write=1) will do the COW break and un-swapcache it. When can it be writably in the swap cache? The write-only thing is the one we use to invalidate stale swap cache entries, and when we mark those pages writable (in do_wp_page or do_swap_page) we always remove the page from the swap cache at the same time. Or is there some other path I missed? > My preference is still to keeps pages with elevated refcount pinned in > the ptes like 2.6.7 did, that will allow do_wp_page to takeover only > pages with page_count not elevated without risk of calling do_wp_page > on any page under gup. I agree that that would also work - and be even simpler. If done right, we can even avoid clearing the dirty bit (in page_mkclean()) for such pages, and now it works for _all_ pages, not just anonymous pages. IOW, even if you had a shared mapping and were to GUP() those pages for writing, they'd _stay_ dirty until you free'd them - no need to re-dirty them in case somebody did IO on them. > Only worry I have now is how to compare count > with mapcount when both can change under us if mapcount > 1, but if > you meant page_mapcount in add_to_swap as I think, that logic in > add_to_swap would have the same problem and so it needs a solution for > doing a coherent/safe comparison too. I don't think you can use just mapcount on its own - you have to compare it to page_count(). Otherwise perfectly normal (non-gup) pages will trigger, since that page count is the only thing that differs between the two cases. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 19:03 ` Linus Torvalds @ 2009-03-17 19:35 ` Andrea Arcangeli 2009-03-17 19:55 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-17 19:35 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, Mar 17, 2009 at 12:03:55PM -0700, Linus Torvalds wrote: > If it's in the swap cache, it should be mapped read-only, and gup(write=1) > will do the COW break and un-swapcache it. It may turn it read-write instead of COW break and un-swapcache. if (write_access && reuse_swap_page(page)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); This is done to avoid fragmenting the swap device. > I agree that that would also work - and be even simpler. If done right, we > can even avoid clearing the dirty bit (in page_mkclean()) for such pages, > and now it works for _all_ pages, not just anonymous pages. > > IOW, even if you had a shared mapping and were to GUP() those pages for > writing, they'd _stay_ dirty until you free'd them - no need to re-dirty > them in case somebody did IO on them. I agree in principle, if the VM stays away from pages under GUP theoretically the dirty bit shouldn't be transferred to the PG_dirty of the page until after the I/O is complete, so the dirty bit set by gup in the pte may be enough. Not sure if there are other places that could transfer the dirty bit of the pte before the gup user releases the page-pin. > I don't think you can use just mapcount on its own - you have to compare > it to page_count(). Otherwise perfectly normal (non-gup) pages will > trigger, since that page count is the only thing that differs between the > two cases. Yes, page_count shall be compared with page_mapcount. My worry is only that both can change from under us if mapcount > 1 (not enough to hold PT lock to be sure mapcount/count is stable if mapcount > 1). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-17 19:35 ` Andrea Arcangeli @ 2009-03-17 19:55 ` Linus Torvalds 0 siblings, 0 replies; 83+ messages in thread From: Linus Torvalds @ 2009-03-17 19:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm On Tue, 17 Mar 2009, Andrea Arcangeli wrote: > On Tue, Mar 17, 2009 at 12:03:55PM -0700, Linus Torvalds wrote: > > If it's in the swap cache, it should be mapped read-only, and gup(write=1) > > will do the COW break and un-swapcache it. > > It may turn it read-write instead of COW break and un-swapcache. > > if (write_access && reuse_swap_page(page)) { > pte = maybe_mkwrite(pte_mkdirty(pte), vma); > > This is done to avoid fragmenting the swap device. Right, but reuse_swap_page() will have removed it from the swapcache if it returns success. So if the page is writable in the page tables, it should not be in the swap cache. Oh, except that we do it in shrink_page_list(), and while we're going to do that whole "try_to_unmap()", I guess it can fail to unmap there? In that case, you could actually have it in the page tables while in the swap cache. And besides, we do remove it from the page tables in the wrong order (ie we add it to the swap cache first, _then_ remove it), so I guess that also ends up being a race with another CPU doing fast-gup. And we _have_ to do it in that order at least for the map_count > 1 case, since a read-only swap page may be shared by multiple mm's, and the swap-cache is how we make sure that they all end up joining together. Of course, the only case we really care about is the map_count=1 case, since that's the only one that is possible after GUP has succeeded (assuming, as always, that fork() is locked out of making copies). So we really only care about the simpler case. > I agree in principle, if the VM stays away from pages under GUP > theoretically the dirty bit shouldn't be transferred to the PG_dirty > of the page until after the I/O is complete, so the dirty bit set by > gup in the pte may be enough. Not sure if there are other places that > could transfer the dirty bit of the pte before the gup user releases > the page-pin. I do suspect there are subtle issues like the above. > > I don't think you can use just mapcount on its own - you have to compare > > it to page_count(). Otherwise perfectly normal (non-gup) pages will > > trigger, since that page count is the only thing that differs between the > > two cases. > > Yes, page_count shall be compared with page_mapcount. My worry is only > that both can change from under us if mapcount > 1 (not enough to hold > PT lock to be sure mapcount/count is stable if mapcount > 1). Now, that's not a big worry, because we only care about mapcount=1 for the anonymous page case at least. So we can stabilize that one with the pt lock. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 18:46 ` Linus Torvalds 2009-03-11 19:01 ` Linus Torvalds @ 2009-03-11 19:06 ` Andrea Arcangeli 2009-03-12 5:36 ` Nick Piggin 2 siblings, 0 replies; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-11 19:06 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, Mar 11, 2009 at 11:46:17AM -0700, Linus Torvalds wrote: > > > On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > > > On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote: > > > As far as I can tell, it's the same old problem that we've always had: if > > > you fork(), it's unclear who is going to do the first write - parent or > > > child (and "parent" in this case can include any number of threads that > > > share the VM, of course). > > > > The child doesn't touch any page. Calling fork just generates O_DIRECT > > corruption in the parent regardless of what the child does. > > You aren't listening. > > It depends on who does the write. If the _parent_ does the write (with > another thread or not), then the _parent_ gets the COW. > > That's all I said. I only wanted to clarify this doesn't require the child to touch the page at all. > If the idiots who use O_DIRECT don't understand that, then hey, it's their > problem. I have long been of the opinion that we should not support > O_DIRECT at all, and that it's a totally broken premise to start with. Well if you don't like it used by databases, O_DIRECT is still ideal for KVM. Guest caches runs at cpu core speed unlike host cache. Not that KVM can reproduce this bug (all ram where KVM would be doing O_DIRECT is mapped MADV_DONTFORK, and besides guest physical ram has to be allocated with memalign(4096) ;). Said that I agree it'd be better off to nuke O_DIRECT than to leave this bug as O_DIRECT should not break the usual memory-protection semantics provided by read() and fork() syscalls. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 18:46 ` Linus Torvalds 2009-03-11 19:01 ` Linus Torvalds 2009-03-11 19:06 ` Andrea Arcangeli @ 2009-03-12 5:36 ` Nick Piggin 2009-03-12 16:23 ` Nick Piggin 2 siblings, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-12 5:36 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Thursday 12 March 2009 05:46:17 Linus Torvalds wrote: > On Wed, 11 Mar 2009, Andrea Arcangeli wrote: > > > The rule has always been: don't mix fork() with page pinning. It > > > doesn't work. It never worked. It likely never will. > > > > I never heard this rule here > > It's never been written down, but it's obvious to anybody who looks at how > COW works for even five seconds. The fact is, the person doing the COW > after a fork() is the person who no longer has the same physical page > (because he got a new page). > > So _anything- that depends on physical addresses simply _cannot_ work > concurrently with a fork. That has always been true. > > If the idiots who use O_DIRECT don't understand that, then hey, it's their > problem. I have long been of the opinion that we should not support > O_DIRECT at all, and that it's a totally broken premise to start with. > > This is just one of millions of reasons. Well it is a quite well known issue at this stage I think. We've had MADV_DONTFORK since 2.6.16 which is basically to solve this issue I think with infiniband library. I guess if it would be really helpful we *could* add MADV_DONTCOW. Assuming we want to try fixing it transparently... what about another approach, mark a vma as VM_DONTCOW and uncow all existing pages in it if it ever has get_user_pages run on it. Big hammer approach. fast gup would be a little bit harder because looking up the vma defeats the purpose. However if we use another page bit to say the page belongs to a VM_DONTCOW vma, then we only need to check that once and fall back to slow gup if it is clear. So there would be no extra atomics in the repeat case. Yes it would be slower, but apps that really care should know what they are doing and set MADV_DONTFORK or MADV_DONTCOW on the vma by hand before doing the zero copy IO. Would this work? Anyone see any holes? (I imagine someone might argue against big hammer, but I would prefer it if it is lighter impact on the VM and still allows good applications to avoid the hammer) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-12 5:36 ` Nick Piggin @ 2009-03-12 16:23 ` Nick Piggin 2009-03-12 17:00 ` Andrea Arcangeli 0 siblings, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-12 16:23 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Thursday 12 March 2009 16:36:18 Nick Piggin wrote: > Assuming we want to try fixing it transparently... what about another > approach, mark a vma as VM_DONTCOW and uncow all existing pages in it > if it ever has get_user_pages run on it. Big hammer approach. > > fast gup would be a little bit harder because looking up the vma > defeats the purpose. However if we use another page bit to say the > page belongs to a VM_DONTCOW vma, then we only need to check that > once and fall back to slow gup if it is clear. So there would be no > extra atomics in the repeat case. Yes it would be slower, but apps > that really care should know what they are doing and set > MADV_DONTFORK or MADV_DONTCOW on the vma by hand before doing the > zero copy IO. > > Would this work? Anyone see any holes? (I imagine someone might argue > against big hammer, but I would prefer it if it is lighter impact on > the VM and still allows good applications to avoid the hammer) OK, this is as far as I got tonight. This passes Andrea's dma_thread test case. I haven't started hugepages, and it isn't quite right to drop the mmap_sem and retake it for write in get_user_pages (firstly, caller might hold mmap_sem for write, secondly, it may not be able to tolerate mmap_sem being dropped). Annoying that it has to take mmap_sem for write to add this bit to vm_flags. Possibly we could use a different way to signal it is a "dontcow" vma... something in anon_vma maybe? Anyway, before worrying too much more about those details, I'll post it. It is a different approach that I think might be worth consideration. Comments? Thanks, Nick -- Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2009-03-13 03:00:58.000000000 +1100 +++ linux-2.6/include/linux/mm.h 2009-03-13 03:05:00.000000000 +1100 @@ -104,6 +104,7 @@ extern unsigned int kobjsize(const void #define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */ #define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */ #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ +#define VM_DONTCOW 0x40000000 /* Contains no COW pages (copies on fork) */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS @@ -789,7 +790,7 @@ int walk_page_range(unsigned long addr, void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma); + struct vm_area_struct *dst_vma, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); int follow_phys(struct vm_area_struct *vma, unsigned long address, Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2009-03-13 03:00:58.000000000 +1100 +++ linux-2.6/mm/memory.c 2009-03-13 03:07:52.000000000 +1100 @@ -580,7 +580,8 @@ copy_one_pte(struct mm_struct *dst_mm, s * in the parent and the child */ if (is_cow_mapping(vm_flags)) { - ptep_set_wrprotect(src_mm, addr, src_pte); + if (likely(!(vm_flags & VM_DONTCOW))) + ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } @@ -594,6 +595,7 @@ copy_one_pte(struct mm_struct *dst_mm, s page = vm_normal_page(vma, addr, pte); if (page) { + VM_BUG_ON(PageDontCOW(page) && !(vm_flags & VM_DONTCOW)); get_page(page); page_dup_rmap(page, vma, addr); rss[!!PageAnon(page)]++; @@ -696,8 +698,10 @@ static inline int copy_pud_range(struct return 0; } +static int decow_page_range(struct mm_struct *mm, struct vm_area_struct *vma); + int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, struct vm_area_struct *vma) { pgd_t *src_pgd, *dst_pgd; unsigned long next; @@ -755,6 +759,15 @@ int copy_page_range(struct mm_struct *ds if (is_cow_mapping(vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm, vma->vm_start, end); + + WARN_ON(ret); + if (unlikely(vma->vm_flags & VM_DONTCOW) && !ret) { + if (decow_page_range(dst_mm, dst_vma)) + ret = -ENOMEM; + /* child doesn't really need VM_DONTCOW after being de-COWed */ + // dst_vma->vm_flags &= ~VM_DONTCOW; + } + return ret; } @@ -1200,6 +1213,7 @@ static inline int use_zero_page(struct v } +static int make_vma_nocow(struct vm_area_struct *vma); int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, @@ -1273,6 +1287,23 @@ int __get_user_pages(struct task_struct (!ignore && !(vm_flags & vma->vm_flags))) return i ? : -EFAULT; + if (!(flags & GUP_FLAGS_STACK) && + is_cow_mapping(vma->vm_flags) && + !(vma->vm_flags & VM_DONTCOW)) { + up_read(&mm->mmap_sem); + down_write(&mm->mmap_sem); + vma = find_vma(mm, start); + if (vma && is_cow_mapping(vma->vm_flags) && + !(vma->vm_flags & VM_DONTCOW)) { + if (make_vma_nocow(vma)) { + downgrade_write(&mm->mmap_sem); + return i ? : -ENOMEM; + } + } + downgrade_write(&mm->mmap_sem); + continue; + } + if (is_vm_hugetlb_page(vma)) { i = follow_hugetlb_page(mm, vma, pages, vmas, &start, &len, i, write); @@ -1910,6 +1941,8 @@ static int do_wp_page(struct mm_struct * goto gotten; } + VM_BUG_ON(PageDontCOW(old_page)); + /* * Take out anonymous pages first, anonymous shared vmas are * not dirty accountable. @@ -2102,6 +2135,232 @@ unwritable_page: return VM_FAULT_SIGBUS; } +static int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd, + spinlock_t *ptl, struct vm_area_struct *vma, + unsigned long address) +{ + pte_t pte = *ptep; + struct page *page, *new_page; + int ret = 0; + + /* pte contains position in swap or file, so don't do anything */ + if (unlikely(!pte_present(pte))) + return 0; + /* pte is writable, can't be COW */ + if (pte_write(pte)) + return 0; + + page = vm_normal_page(vma, address, pte); + if (!page) + return 0; + + if (!PageAnon(page)) + return 0; + + page_cache_get(page); + + pte_unmap_unlock(pte, ptl); + + if (unlikely(anon_vma_prepare(vma))) + goto oom; + VM_BUG_ON(page == ZERO_PAGE(0)); + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); + if (!new_page) + goto oom; + /* + * Don't let another task, with possibly unlocked vma, + * keep the mlocked page. + */ + if (vma->vm_flags & VM_LOCKED) { + lock_page(page); /* for LRU manipulation */ + clear_page_mlock(page); + unlock_page(page); + } + cow_user_page(new_page, page, address, vma); + __SetPageUptodate(new_page); + __SetPageDontCOW(new_page); + + if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)) + goto oom_free_new; + + /* + * Re-check the pte - we dropped the lock + */ + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + if (likely(pte_same(*ptep, pte))) { + pte_t entry; + + flush_cache_page(vma, address, pte_pfn(pte)); + entry = mk_pte(new_page, vma->vm_page_prot); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + /* + * Clear the pte entry and flush it first, before updating the + * pte with the new entry. This will avoid a race condition + * seen in the presence of one thread doing SMC and another + * thread doing COW. + */ + ptep_clear_flush_notify(vma, address, ptep); + page_add_new_anon_rmap(new_page, vma, address); + set_pte_at(mm, address, ptep, entry); + + /* See comment in do_wp_page */ + page_remove_rmap(page); + } else { + mem_cgroup_uncharge_page(new_page); + page_cache_release(new_page); + ret = -EAGAIN; + } + + page_cache_release(page); + + return ret; + +oom_free_new: + page_cache_release(new_page); +oom: + page_cache_release(page); + return -ENOMEM; +} + +static int decow_pte_range(struct mm_struct *mm, + pmd_t *pmd, struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + pte_t *pte; + spinlock_t *ptl; + int progress = 0; + int ret = 0; + +again: + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); +// arch_enter_lazy_mmu_mode(); + + do { + /* + * We are holding two locks at this point - either of them + * could generate latencies in another task on another CPU. + */ + if (progress >= 32) { + progress = 0; + if (need_resched() || spin_needbreak(ptl)) + break; + } + if (pte_none(*pte)) { + progress++; + continue; + } + ret = decow_one_pte(mm, pte, pmd, ptl, vma, addr); + if (ret) { + if (ret == -EAGAIN) { /* retry */ + ret = 0; + break; + } + goto out; + } + progress += 8; + } while (pte++, addr += PAGE_SIZE, addr != end); + +// arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(pte - 1, ptl); + cond_resched(); + if (addr != end) + goto again; +out: + return ret; +} + +static int decow_pmd_range(struct mm_struct *mm, + pud_t *pud, struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + pmd_t *pmd; + unsigned long next; + + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none_or_clear_bad(pmd)) + continue; + if (decow_pte_range(mm, pmd, vma, addr, next)) + return -ENOMEM; + } while (pmd++, addr = next, addr != end); + return 0; +} + +static int decow_pud_range(struct mm_struct *mm, + pgd_t *pgd, struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + pud_t *pud; + unsigned long next; + + pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + if (decow_pmd_range(mm, pud, vma, addr, next)) + return -ENOMEM; + } while (pud++, addr = next, addr != end); + return 0; +} + +static noinline int decow_page_range(struct mm_struct *mm, struct vm_area_struct *vma) +{ + pgd_t *pgd; + unsigned long next; + unsigned long addr = vma->vm_start; + unsigned long end = vma->vm_end; + int ret; + + BUG_ON(!is_cow_mapping(vma->vm_flags)); + +// if (is_vm_hugetlb_page(vma)) +// return decow_hugetlb_page_range(mm, vma); + + mmu_notifier_invalidate_range_start(mm, addr, end); + + ret = 0; + pgd = pgd_offset(mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) + continue; + if (unlikely(decow_pud_range(mm, pgd, vma, addr, next))) { + ret = -ENOMEM; + break; + } + } while (pgd++, addr = next, addr != end); + + mmu_notifier_invalidate_range_end(mm, vma->vm_start, end); + + return ret; +} + +/* + * Turns the anonymous VMA into a "nocow" vma. De-cow existing COW pages. + * Must hold mmap_sem for write. + */ +static int make_vma_nocow(struct vm_area_struct *vma) +{ + static DEFINE_MUTEX(lock); + struct mm_struct *mm = vma->vm_mm; + int ret; + + mutex_lock(&lock); + if (vma->vm_flags & VM_DONTCOW) { + mutex_unlock(&lock); + return 0; + } + + ret = decow_page_range(mm, vma); + if (!ret) + vma->vm_flags |= VM_DONTCOW; + mutex_unlock(&lock); + + return ret; +} + /* * Helper functions for unmap_mapping_range(). * @@ -2433,6 +2692,9 @@ static int do_swap_page(struct mm_struct count_vm_event(PGMAJFAULT); } + if (unlikely(vma->vm_flags & VM_DONTCOW)) + SetPageDontCOW(page); + mark_page_accessed(page); lock_page(page); @@ -2530,6 +2792,8 @@ static int do_anonymous_page(struct mm_s if (!page) goto oom; __SetPageUptodate(page); + if (unlikely(vma->vm_flags & VM_DONTCOW)) + __SetPageDontCOW(page); if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) goto oom_free_page; @@ -2636,6 +2900,8 @@ static int __do_fault(struct mm_struct * clear_page_mlock(vmf.page); copy_user_highpage(page, vmf.page, address, vma); __SetPageUptodate(page); + if (unlikely(vma->vm_flags & VM_DONTCOW)) + __SetPageDontCOW(page); } else { /* * If the page will be shareable, see if the backing @@ -2935,8 +3201,9 @@ int make_pages_present(unsigned long add BUG_ON(addr >= end); BUG_ON(end > vma->vm_end); len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE; - ret = get_user_pages(current, current->mm, addr, - len, write, 0, NULL, NULL); + ret = __get_user_pages(current, current->mm, addr, + len, GUP_FLAGS_STACK | (write ? GUP_FLAGS_WRITE : 0), + NULL, NULL); if (ret < 0) return ret; return ret == len ? 0 : -EFAULT; @@ -3085,8 +3352,9 @@ int access_process_vm(struct task_struct void *maddr; struct page *page = NULL; - ret = get_user_pages(tsk, mm, addr, 1, - write, 1, &page, &vma); + ret = __get_user_pages(tsk, mm, addr, 1, + GUP_FLAGS_FORCE | GUP_FLAGS_STACK | + (write ? GUP_FLAGS_WRITE : 0), &page, &vma); if (ret <= 0) { /* * Check if this is a VM_IO | VM_PFNMAP VMA, which Index: linux-2.6/arch/x86/mm/gup.c =================================================================== --- linux-2.6.orig/arch/x86/mm/gup.c 2009-03-13 03:00:58.000000000 +1100 +++ linux-2.6/arch/x86/mm/gup.c 2009-03-13 03:01:03.000000000 +1100 @@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t struct page *page; if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { +failed: pte_unmap(ptep); return 0; } VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); + if (unlikely(!PageDontCOW(page))) + goto failed; get_page(page); pages[*nr] = page; (*nr)++; Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2009-03-13 03:00:58.000000000 +1100 +++ linux-2.6/include/linux/page-flags.h 2009-03-13 03:01:03.000000000 +1100 @@ -94,6 +94,7 @@ enum pageflags { PG_reclaim, /* To be reclaimed asap */ PG_buddy, /* Page is free, on buddy lists */ PG_swapbacked, /* Page is backed by RAM/swap */ + PG_dontcow, /* PageAnon page in a VM_DONTCOW vma */ #ifdef CONFIG_UNEVICTABLE_LRU PG_unevictable, /* Page is "unevictable" */ PG_mlocked, /* Page is vma mlocked */ @@ -208,6 +209,8 @@ __PAGEFLAG(SlubDebug, slub_debug) */ TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) __PAGEFLAG(Buddy, buddy) +__PAGEFLAG(DontCOW, dontcow) +SETPAGEFLAG(DontCOW, dontcow) PAGEFLAG(MappedToDisk, mappedtodisk) /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c 2009-03-13 03:00:58.000000000 +1100 +++ linux-2.6/mm/page_alloc.c 2009-03-13 03:01:03.000000000 +1100 @@ -1000,6 +1000,7 @@ static void free_hot_cold_page(struct pa struct per_cpu_pages *pcp; unsigned long flags; + __ClearPageDontCOW(page); if (PageAnon(page)) page->mapping = NULL; if (free_pages_check(page)) Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2009-03-13 03:04:33.000000000 +1100 +++ linux-2.6/kernel/fork.c 2009-03-13 03:05:00.000000000 +1100 @@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, mpnt); + retval = copy_page_range(mm, oldmm, tmp, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); Index: linux-2.6/fs/exec.c =================================================================== --- linux-2.6.orig/fs/exec.c 2009-03-13 03:04:33.000000000 +1100 +++ linux-2.6/fs/exec.c 2009-03-13 03:05:00.000000000 +1100 @@ -165,6 +165,13 @@ exit: #ifdef CONFIG_MMU +#define GUP_FLAGS_WRITE 0x01 +#define GUP_FLAGS_STACK 0x10 + +int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int len, int flags, + struct page **pages, struct vm_area_struct **vmas); + static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, int write) { @@ -178,8 +185,11 @@ static struct page *get_arg_page(struct return NULL; } #endif - ret = get_user_pages(current, bprm->mm, pos, - 1, write, 1, &page, NULL); + down_read(&bprm->mm->mmap_sem); + ret = __get_user_pages(current, bprm->mm, pos, + 1, GUP_FLAGS_STACK | (write ? GUP_FLAGS_WRITE : 0), + &page, NULL); + up_read(&bprm->mm->mmap_sem); if (ret <= 0) return NULL; Index: linux-2.6/mm/internal.h =================================================================== --- linux-2.6.orig/mm/internal.h 2009-03-13 03:04:33.000000000 +1100 +++ linux-2.6/mm/internal.h 2009-03-13 03:05:00.000000000 +1100 @@ -273,10 +273,11 @@ static inline void mminit_validate_memmo } #endif /* CONFIG_SPARSEMEM */ -#define GUP_FLAGS_WRITE 0x1 -#define GUP_FLAGS_FORCE 0x2 -#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4 -#define GUP_FLAGS_IGNORE_SIGKILL 0x8 +#define GUP_FLAGS_WRITE 0x01 +#define GUP_FLAGS_FORCE 0x02 +#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x04 +#define GUP_FLAGS_IGNORE_SIGKILL 0x08 +#define GUP_FLAGS_STACK 0x10 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, \0 ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-12 16:23 ` Nick Piggin @ 2009-03-12 17:00 ` Andrea Arcangeli 2009-03-12 17:20 ` Nick Piggin 0 siblings, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-12 17:00 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Fri, Mar 13, 2009 at 03:23:40AM +1100, Nick Piggin wrote: > OK, this is as far as I got tonight. > > This passes Andrea's dma_thread test case. I haven't started hugepages, > and it isn't quite right to drop the mmap_sem and retake it for write > in get_user_pages (firstly, caller might hold mmap_sem for write, > secondly, it may not be able to tolerate mmap_sem being dropped). What's the point? I mean this will simply work worse than my patch because it'll have to don't-cow the whole range regardless if it's pinned or not. Which will slowdown fork in the O_DIRECT case even more, for no good reason. I thought the complaint here was only a beauty issue of not wanting to add a function called fork_pre_cow or your equivalent decow_one_pte in the fork path, not any practical issue with my patch which already passed all sort of regression testing and performance valuations. Plus you still have a per-page bitflag, and I think you have implementation issues in the patch (the parent pte can't be left writeable if you are in a don't-cow vma, or the copy will not be atomic, and glibc will have no chance to fix its bugs). You're not removing the fork_pre_cow logic from fork, so I can only see it as a regression to make the logic less granular in the vma. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-12 17:00 ` Andrea Arcangeli @ 2009-03-12 17:20 ` Nick Piggin 2009-03-12 17:23 ` Nick Piggin 2009-03-12 18:06 ` Andrea Arcangeli 0 siblings, 2 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-12 17:20 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Friday 13 March 2009 04:00:11 Andrea Arcangeli wrote: > On Fri, Mar 13, 2009 at 03:23:40AM +1100, Nick Piggin wrote: > > OK, this is as far as I got tonight. > > > > This passes Andrea's dma_thread test case. I haven't started hugepages, > > and it isn't quite right to drop the mmap_sem and retake it for write > > in get_user_pages (firstly, caller might hold mmap_sem for write, > > secondly, it may not be able to tolerate mmap_sem being dropped). > > What's the point? Well the main point is to avoid atomics and barriers and stuff like that especially in the fast gup path. It also seems very much smaller (the vast majority of the change is the addition of decow function). > I mean this will simply work worse than my patch > because it'll have to don't-cow the whole range regardless if it's > pinned or not. Which will slowdown fork in the O_DIRECT case even > more, for no good reason. Hmm, maybe. It probably can possibly work entirely without the vm_flag and just use the page flag, however. Yes I think it could, and that might just avoid the whole problem of modifying vm_flags in gup. I'll have to consider it more tomorrow. But this case is just if we want to transparently support this without too much intrusive. Apps that know and care very much could use MADV_DONTFORK to avoid the copy completely. > I thought the complaint here was only a > beauty issue of not wanting to add a function called fork_pre_cow or > your equivalent decow_one_pte in the fork path, not any practical > issue with my patch which already passed all sort of regression > testing and performance valuations. My complaint is not decow / pre cow (I think I suggested it as the fix for the problem in the first place). I think the patch is quite complex and is quite a slowdown for fast gup (especially with hugepages). I'm just trying to explore different approach. > Plus you still have a per-page > bitflag, Sure. It's the atomic operations which I want to try to minimise. > and I think you have implementation issues in the patch (the > parent pte can't be left writeable if you are in a don't-cow vma, or > the copy will not be atomic, and glibc will have no chance to fix its > bugs) Oh, we need to do that? OK, then just take out that statement, and change VM_BUG_ON(PageDontCOW()) in do_wp_page to VM_BUG_ON(PageDontCOW() && !reuse); > . You're not removing the fork_pre_cow logic from fork, so I can > only see it as a regression to make the logic less granular in the > vma. I'll see if it can be made per-page. But I still don't know if it is a big problem. It's hard to know exactly what crazy things apps require to be fast. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-12 17:20 ` Nick Piggin @ 2009-03-12 17:23 ` Nick Piggin 2009-03-12 18:06 ` Andrea Arcangeli 1 sibling, 0 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-12 17:23 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Friday 13 March 2009 04:20:27 Nick Piggin wrote: > On Friday 13 March 2009 04:00:11 Andrea Arcangeli wrote: > > and I think you have implementation issues in the patch (the > > parent pte can't be left writeable if you are in a don't-cow vma, or > > the copy will not be atomic, and glibc will have no chance to fix its > > bugs) > > Oh, we need to do that? OK, then just take out that statement, and Should read: "take out that *if* statement" (the one which I put in to avoid wrprotect in the parent) > change VM_BUG_ON(PageDontCOW()) in do_wp_page to > VM_BUG_ON(PageDontCOW() && !reuse); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-12 17:20 ` Nick Piggin 2009-03-12 17:23 ` Nick Piggin @ 2009-03-12 18:06 ` Andrea Arcangeli 2009-03-12 18:58 ` Andrea Arcangeli 2009-03-13 16:09 ` Nick Piggin 1 sibling, 2 replies; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-12 18:06 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Fri, Mar 13, 2009 at 04:20:27AM +1100, Nick Piggin wrote: > Well the main point is to avoid atomics and barriers and stuff like > that especially in the fast gup path. It also seems very much smaller > (the vast majority of the change is the addition of decow function). Well if you remove the hugetlb part and you remove the pass of src/dst vma that is needed anyway to fix PAT bugs, my patch will get quite smaller too. Agree about the gup-fast path, but frankly I miss how you avoid having to change gup-fast... I wanted to asked about that... > Hmm, maybe. It probably can possibly work entirely without the vm_flag > and just use the page flag, however. Yes I think it could, and that Right I only use the page flag, and you seem to have a page flag PG_dontcow too after all. > might just avoid the whole problem of modifying vm_flags in gup. I'll > have to consider it more tomorrow. Ok. > But this case is just if we want to transparently support this without > too much intrusive. Apps that know and care very much could use > MADV_DONTFORK to avoid the copy completely. Well those apps aren't the problem. > My complaint is not decow / pre cow (I think I suggested it as the > fix for the problem in the first place). I think the patch is quite I'm sure that's not your complaint right. I thought it was the primary complaint in discussion so far though. > complex and is quite a slowdown for fast gup (especially with > hugepages). I'm just trying to explore different approach. I think we could benchmark this. Also once I'll get how you avoid to touch gup-fast fast path, without sending a flood of ipis in fork, I'll understand better how your patch work. > Oh, we need to do that? OK, then just take out that statement, and > change VM_BUG_ON(PageDontCOW()) in do_wp_page to > VM_BUG_ON(PageDontCOW() && !reuse); Not sure how do_wp_page is relevant, the problem I pointed out is in the fork_pre_cow/decow_pte only. If do_wp_page runs it means the page was already wrprotected in the parent or it couldn't be shared, no problem in do_wp_page in that respect. The only thing required is that cow_user_page is copying a page that can't be modified by the parent thread pool during the copy. So marking parent pte wrprotected and flushing tlb is required. Then after the copy like in my fork_pre_cow we set the parent pte writable again. BTW, I start to think I forgot a tlb flush after setting the pte writable again, that could generate a minor fault that we can avoid by flushing the tlb, right? But this is a minor thing, and it'd only trigger if parent only reads the parent pte, otherwise the parent thread will wait fork in mmap_sem if it did a write, or it won't have the tlb loaded in the first place if it didn't touch the page while the pte was temporarily wrprotected. > I'll see if it can be made per-page. But I still don't know if it > is a big problem. It's hard to know exactly what crazy things apps > require to be fast. The thing is quite simple, if an app has a 1G of vma loaded, you'll allocate 1G of ram for no good reason. It can even OOM, it's not just a performance issue. While doing it per-page like I do, won't be noticeable, as the in-flight I/O will be minor. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-12 18:06 ` Andrea Arcangeli @ 2009-03-12 18:58 ` Andrea Arcangeli 2009-03-13 16:09 ` Nick Piggin 1 sibling, 0 replies; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-12 18:58 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Thu, Mar 12, 2009 at 07:06:48PM +0100, Andrea Arcangeli wrote: > again. BTW, I start to think I forgot a tlb flush after setting the > pte writable again, that could generate a minor fault that we can > avoid by flushing the tlb, right? But this is a minor thing, and it'd Ah no, that is already taken care of by the fork flush in the parent before returning, so no problem (and it would have been a minor thing anyway). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-12 18:06 ` Andrea Arcangeli 2009-03-12 18:58 ` Andrea Arcangeli @ 2009-03-13 16:09 ` Nick Piggin 2009-03-13 19:34 ` Andrea Arcangeli 2009-03-14 4:46 ` Nick Piggin 1 sibling, 2 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-13 16:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Friday 13 March 2009 05:06:48 Andrea Arcangeli wrote: > On Fri, Mar 13, 2009 at 04:20:27AM +1100, Nick Piggin wrote: > > Well the main point is to avoid atomics and barriers and stuff like > > that especially in the fast gup path. It also seems very much smaller > > (the vast majority of the change is the addition of decow function). > > Well if you remove the hugetlb part and you remove the pass of src/dst > vma that is needed anyway to fix PAT bugs, my patch will get quite > smaller too. Possibly true. OK, it wasn't a very good argument to compare my incomplete, RFC patch based on size alone :) > Agree about the gup-fast path, but frankly I miss how you avoid having > to change gup-fast... I wanted to asked about that... It is more straightforward than your version because it does not try to make the page re-cow-able again after the GUP is finished. The main conceptual difference between our fixes I think (ignoring my silly vma-wide decow), is this issue. Of course I could have a race in fast-gup, but I don't think I can see one. I'm working on removing the vma stuff and just making it per-page, which might make it easier to review. > > Oh, we need to do that? OK, then just take out that statement, and > > change VM_BUG_ON(PageDontCOW()) in do_wp_page to > > VM_BUG_ON(PageDontCOW() && !reuse); > > Not sure how do_wp_page is relevant, the problem I pointed out is in > the fork_pre_cow/decow_pte only. If do_wp_page runs it means the page > was already wrprotected in the parent or it couldn't be shared, no > problem in do_wp_page in that respect. Well, it would save having to touch the parent's pagetables after doing the atomic copy-on-fork in the child. Just have the parent do a do_wp_page, which will notice it is the only user of the page and reuse it rather than COW it (now that Hugh has fixed the races in the reuse check that should be fine). > The only thing required is that cow_user_page is copying a page that > can't be modified by the parent thread pool during the copy. So > marking parent pte wrprotected and flushing tlb is required. Then > after the copy like in my fork_pre_cow we set the parent pte writable > again. Yes you could do it this way too, I'm not sure which way is better... I'll have to take another look at it after removing the per-vma code from mine. > > I'll see if it can be made per-page. But I still don't know if it > > is a big problem. It's hard to know exactly what crazy things apps > > require to be fast. > > The thing is quite simple, if an app has a 1G of vma loaded, you'll > allocate 1G of ram for no good reason. It can even OOM, it's not just > a performance issue. While doing it per-page like I do, won't be > noticeable, as the in-flight I/O will be minor. Yes I agree now it is a silly way to do it. Now I also see that your patch still hasn't covered the other side of the race, wheras my scheme should do. Hmm, I think that if we want to go to the extent of adding all this code in and tell userspace apps they can use zerocopy IO and not care about COW, then we really must cover both sides of the race otherwise it is just asking for data corruption. Conversely, if we leave *any* holes open by design, then we may as well leave *all* holes open and have simpler code -- because apps will have to know about the zerocopy vs COW problem anyway. Don't you agree? Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-13 16:09 ` Nick Piggin @ 2009-03-13 19:34 ` Andrea Arcangeli 2009-03-14 4:59 ` Nick Piggin 2009-03-14 4:46 ` Nick Piggin 1 sibling, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-13 19:34 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Sat, Mar 14, 2009 at 03:09:39AM +1100, Nick Piggin wrote: > Of course I could have a race in fast-gup, but I don't think I can see > one. I'm working on removing the vma stuff and just making it per-page, > which might make it easier to review. If you didn't touch gup-fast and you don't send ipis in fork, you most certainly have one, it's the one Linus pointed out and that I've fixed (with Izik, then I sorted out the ordering details and how to make it safe on frok side). > Well, it would save having to touch the parent's pagetables after > doing the atomic copy-on-fork in the child. Just have the parent do > a do_wp_page, which will notice it is the only user of the page and > reuse it rather than COW it (now that Hugh has fixed the races in > the reuse check that should be fine). If we're into the trouble path, it means parent already owns the page. I just leave it owned to the parent, pte remains the same before and after fork. No point in changing the pte value if we're in the troublesome path as far as I can tell. I only verify that the parent pte didn't go away from under fork when I temporarily release the parent PT lock to allocate the cow page in the slow path (see the -EAGAIN path, I also verified it triggers with swapping and system survives fine ;). > Now I also see that your patch still hasn't covered the other side of > the race, wheras my scheme should do. Hmm, I think that if we want to Sorry, but can you elaborate again what the other side of the race is? If child gets a whole new page, and parent keeps its own page with pte marked read-write the whole time that a page fault can run (page fault takes mmap_sem, all we have to protect against when temporarily releasing parent PT lock is the VM rmap code and that is taken care of by the pte_same path), so I don't see any other side of the race... > go to the extent of adding all this code in and tell userspace apps > they can use zerocopy IO and not care about COW, then we really must > cover both sides of the race otherwise it is just asking for data > corruption. Surely I agree if there's another side of the race left uncovered by my patch we've to address it too if we make any change and we don't consider this a 'feature'! > Conversely, if we leave *any* holes open by design, then we may as well > leave *all* holes open and have simpler code -- because apps will have > to know about the zerocopy vs COW problem anyway. Don't you agree? Indeed ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-13 19:34 ` Andrea Arcangeli @ 2009-03-14 4:59 ` Nick Piggin 2009-03-16 13:56 ` Andrea Arcangeli 0 siblings, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-14 4:59 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Saturday 14 March 2009 06:34:16 Andrea Arcangeli wrote: > On Sat, Mar 14, 2009 at 03:09:39AM +1100, Nick Piggin wrote: > > Of course I could have a race in fast-gup, but I don't think I can see > > one. I'm working on removing the vma stuff and just making it per-page, > > which might make it easier to review. > > If you didn't touch gup-fast and you don't send ipis in fork, you most > certainly have one, it's the one Linus pointed out and that I've fixed > (with Izik, then I sorted out the ordering details and how to make it > safe on frok side). It does touch gup-fast, but it just adds one branch and no barrier in the case the page is de-cowed (and would be able to work with hugepages with the get_page_multiple still I think although I haven't done hugepage implementation yet). > > Well, it would save having to touch the parent's pagetables after > > doing the atomic copy-on-fork in the child. Just have the parent do > > a do_wp_page, which will notice it is the only user of the page and > > reuse it rather than COW it (now that Hugh has fixed the races in > > the reuse check that should be fine). > > If we're into the trouble path, it means parent already owns the > page. I just leave it owned to the parent, pte remains the same before > and after fork. No point in changing the pte value if we're in the > troublesome path as far as I can tell. I only verify that the parent pte > didn't go away from under fork when I temporarily release the parent > PT lock to allocate the cow page in the slow path (see the -EAGAIN > path, I also verified it triggers with swapping and system survives fine > ;). Possibly that's the right way to go. Depends if it is in the slightest performance critical. If not, I would just let do_wp_page do the work to avoid a little bit of logic, but either way is not a big deal to me. > > Now I also see that your patch still hasn't covered the other side of > > the race, wheras my scheme should do. Hmm, I think that if we want to > > Sorry, but can you elaborate again what the other side of the race is? > > If child gets a whole new page, and parent keeps its own page with pte > marked read-write the whole time that a page fault can run (page fault > takes mmap_sem, all we have to protect against when temporarily > releasing parent PT lock is the VM rmap code and that is taken care of > by the pte_same path), so I don't see any other side of the race... Oh sorry. I was up too late last night :) One side of the race is direct IO read writing to fork child page. The other side of the race is fork child page write leaking into the direct IO. My patch solves both sides by de-cowing *any* COW page before it may be returned from get_user_pages (for read or write). The following test case shows up the corruption both with standard kernel and your patch, but can't trigger it with my patch. You must create by hand "file.dat" file of FILESIZE in the cwd. You may have to tweak timings to get the following order of output: thread writing parent storing child storing thread writing done Afterwards hexdump file.dat, and if any 0xff bytes have leaked into it, then it is from child writing to child buffer affecting parent's direct IO write. --- reverse-race.c --- #define _GNU_SOURCE 1 #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <memory.h> #include <pthread.h> #include <getopt.h> #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #define FILESIZE (4*1024*1024) #define BUFSIZE (1024*1024) static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; static const char *filename = "file.dat"; static int fd; static void *buffer; #define PAGE_SIZE 4096 static void store(void) { int i; if (usleep(50*1000) == -1) perror("usleep"), exit(1); printf("child storing\n"); fflush(stdout); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0xff; _exit(0); } static void *writer(void *arg) { int i; if (pthread_mutex_lock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); printf("thread writing\n"); fflush(stdout); for (i = 0; i < FILESIZE / BUFSIZE; i++) { size_t count = BUFSIZE; ssize_t ret; do { ret = write(fd, buffer, count); if (ret == -1) { if (errno != EINTR) perror("write"), exit(1); ret = 0; } count -= ret; } while (count); } printf("thread writing done\n"); fflush(stdout); if (pthread_mutex_unlock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); return NULL; } int main(int argc, char *argv[]) { int i; int status; pthread_t writer_thread; pid_t store_proc; posix_memalign(&buffer, PAGE_SIZE, BUFSIZE); printf("Write buffer: %p.\n", buffer); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0x00; fd = open(filename, O_RDWR|O_DIRECT); if (fd == -1) perror("open"), exit(1); if (pthread_mutex_lock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); if (pthread_create(&writer_thread, NULL, writer, NULL) == -1) perror("pthred_create"), exit(1); store_proc = fork(); if (store_proc == -1) perror("fork"), exit(1); if (!store_proc) store(); if (pthread_mutex_unlock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); if (usleep(10*1000) == -1) perror("usleep"), exit(1); printf("parent storing\n"); fflush(stdout); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0x11; do { pid_t w; w = waitpid(store_proc, &status, WUNTRACED | WCONTINUED); if (w == -1) perror("waitpid"), exit(1); } while (!WIFEXITED(status) && !WIFSIGNALED(status)); if (pthread_join(writer_thread, NULL) == -1) perror("pthread_join"), exit(1); exit(0); } \0 N§²æìr¸zǧu©²Æ {\béì¹»\x1c®&Þ)îÆi¢Ø^nr¶Ý¢j$½§$¢¸\x05¢¹¨è§~'.)îÄÃ,yèm¶ÿÃ\f%{±j+ðèצj)Z· ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-14 4:59 ` Nick Piggin @ 2009-03-16 13:56 ` Andrea Arcangeli 2009-03-16 16:01 ` Nick Piggin 0 siblings, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-16 13:56 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Sat, Mar 14, 2009 at 03:59:11PM +1100, Nick Piggin wrote: > It does touch gup-fast, but it just adds one branch and no barrier in the My question is what trick to you use to stop gup-fast from returning the page mapped read-write by the pte if gup-fast doesn't take any lock whatsoever, it doesn't set any bit in any page or vma, and it doesn't recheck the pte is still viable after having set any bit on page or vmas, and you still don't send a flood of ipis from fork fast path (no race case). > case the page is de-cowed (and would be able to work with hugepages with > the get_page_multiple still I think although I haven't done hugepage > implementation yet). Yes let's ignore hugetlb for now, I fixed hugetlb too but that can be left for later. > Possibly that's the right way to go. Depends if it is in the slightest > performance critical. If not, I would just let do_wp_page do the work > to avoid a little bit of logic, but either way is not a big deal to me. fork is less performance critical than do_wp_page, still in fork microbenchmark no slowdown is measured with the patch. Before I introduced PG_gup there were false positives triggered by the pagevec temporary pins, that was measurable, after PG_gup the fast path is unaffected (I've still to measure gup-fast slowdown in setting PG_gup but I'm rather optimistic that you're understimating the cost of walking 4 layers of pagetables compared to a locked op on a l1 exclusive cacheline, so I think it'll be lost in the noise). I think the big thing of gup-fast is primarly in not having to search vmas, and in turn to take any shared lock like mmap_sem/PT lock and to scale on a page level with just a get-page being the troublesome cacheline. > One side of the race is direct IO read writing to fork child page. > The other side of the race is fork child page write leaking into > the direct IO. > > My patch solves both sides by de-cowing *any* COW page before it > may be returned from get_user_pages (for read or write). I see what you mean now. If you read the comment of my patch you'll see I explicitly intended that only people writing into memory with gup was troublesome here. Like you point out, using gup for _reading_ from memory is troublesome as well if child writes to those pages. This is kind of a lower problem because the major issue is that fork is enough to generate memory corruption even if the child isn't touching those pages. The reverse race requires the child to write to those pages so I guess it never triggered in real life apps. But nevertheless I totally agree if we fix the write-to-memory-with-gup we've to fix the read-from-memory-with-gup. Below I updated my patch and relative commit header to fix the reverse race too. However I had to enlarge the buffer to 40M to reproduce with your testcase because my HD was too fast otherwise. ---------- From: Andrea Arcangeli <aarcange@redhat.com> Subject: fork-o_direct-race Think a thread writing constantly to the last 512bytes of a page, while another thread read and writes to/from the first 512bytes of the page. We can lose O_DIRECT reads (or any other get_user_pages write=1 I/O not just bio/O_DIRECT), the very moment we mark any pte wrprotected because a third unrelated thread forks off a child. This fixes it by copying the anon page (instead of sharing it) within fork, if there can be any direct I/O in flight to the page. That takes care of O_DIRECT reads (writes to memory, read from disk). Checking the page_count under the PT lock guarantees no get_user_pages could be running under us because if somebody wants to write to the page, it has to break any cow first and that requires taking the PT lock in follow_page before increasing the page count. We are also guaranteed mapcount is 1 if fork is writeprotecting the pte so the PT lock is enough to serialize against get_user_pages->get_page. Another problem are the O_DIRECT writes to disk, if the parent touches a shared anon page before the child, the child do_wp_page will takeover the anon page and map it read-write despite it was under direct-io from the parent thread pool. This requires de-cowing the pages in gup more aggressively (i.e. setting FOLL_WRITE temporarily on anon pages to de-cow them, and always assume write=1 for hugetlb follow_page version). gup-fast is taken care of without flushing the smp-tlb for every parent-pte wrprotected, by wrprotecting the pte before checking the page count vs mapcount. gup-fast will then re-check that the pte is still available in write mode after having increased the page count, so solving the race without a flood of IPIs in fork. The COW triggered inside fork will run while the parent pte is readonly to provide as usual the per-page atomic copy from parent to child during fork. However timings will be altered by having to copy the pages that might be under O_DIRECT. Once this race is fixed, the testcase instead of showing corruption is capable of triggering a glibc NPTL race condition where fork_pre_cow is copying internal the nptl stack list in anonymous memory while some parent thread may be modifying it, which results in userland deadlock when the fork-child tries to free the stacks before returning from fork. We are flushing the tlb after wrprotecting the pte that maps the anon page if we take the fork_pre_cow path, so we should be providing per-page atomic copy from parent to child. The race indeed can trigger also without this patch and without fork_pre_Cow and to trigger it the wrprotect event must happen exactly in the middle of a list_add/list_del instruction run by some NPLT thread that is mangling over the stack list while fork runs. Some preliminary NPTL fix for this race exposed by this fix, is happening on glibc repository but I think it'd be better off to use a smart lock capable of jumping in and out of signal handler and not to go out of order rcu style which sounds too complex. The pagevec code calls get_page while the page is sitting in the pagevec (before it becomes PageLRU) and doing so it can generate false positives, so to avoid slowing down fork all the time even for pages that could never possibly be under O_DIRECT write=1, the PG_gup bitflag is added, this eliminates most overhead of the fix in fork. I had to add src_vma/dst_vma to use proper ->mm pointers, and in the case of track_pfn_vma_copy PAT code, this is fixing a bug, because previously vma was the dst_vma, while track_pfn_vma_copy has to run on the src_vma (the dst_vma in that place is guaranteed to have zero ptes instantiated/allocated). There are two testcases that reproduces the bug and they reproduce the bug both for regular anon pages and using the libhugetlbfs and hugepages too. Patch works for both. The glibc race is also eventually reproducible both using anon pages and hugepages with the dma_thread testcase (the forkscrew testcases isn't capable of reproducing the nptl race condition in fork). ========== dma_thread.c ======= /* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */ #define _GNU_SOURCE 1 #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <memory.h> #include <pthread.h> #include <getopt.h> #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #define FILESIZE (12*1024*1024) #define READSIZE (1024*1024) #define FILENAME "test_%.04d.tmp" #define FILECOUNT 100 #define MIN_WORKERS 2 #define MAX_WORKERS 256 #define PAGE_SIZE 4096 #define true 1 #define false 0 typedef int bool; bool done = false; int workers = 2; #define PATTERN (0xfa) static void usage (void) { fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n" "\nWith no arguments, generate test files and exit.\n" "-h Display this help and exit.\n" "-a align read buffer to offset <alignment>.\n" "-w number of worker threads, 2 (default) to 256,\n" " defaults to number of cores.\n\n" "Run first with no arguments to generate files.\n" "Then run with -a <alignment> = 512 or 0. \n"); } typedef struct { pthread_t tid; int worker_number; int fd; int offset; int length; int pattern; unsigned char *buffer; } worker_t; void *worker_thread(void * arg) { int bytes_read; int i,k; worker_t *worker = (worker_t *) arg; int offset = worker->offset; int fd = worker->fd; unsigned char *buffer = worker->buffer; int pattern = worker->pattern; int length = worker->length; if (lseek(fd, offset, SEEK_SET) < 0) { fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n", offset, fd, strerror(errno)); exit(1); } bytes_read = read(fd, buffer, length); if (bytes_read != length) { fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n", fd, bytes_read, strerror(errno)); exit(1); } /* Corruption check */ for (i = 0; i < length; i++) { if (buffer[i] != pattern) { printf("Bad data at 0x%.06x: %p, \n", i, buffer + i); printf("Data dump starting at 0x%.06x:\n", i - 8); printf("Expect 0x%x followed by 0x%x:\n", pattern, PATTERN); for (k = 0; k < 16; k++) { printf("%02x ", buffer[i - 8 + k]); if (k == 7) { printf("\n"); } } printf("\n"); abort(); } } return 0; } void *fork_thread (void *arg) { pid_t pid; while (!done) { pid = fork(); if (pid == 0) { exit(0); } else if (pid < 0) { fprintf(stderr, "Failed to fork child.\n"); exit(1); } waitpid(pid, NULL, 0 ); usleep(100); } return NULL; } int main(int argc, char *argv[]) { unsigned char *buffer = NULL; char filename[1024]; int fd; bool dowrite = true; pthread_t fork_tid; int c, n, j; worker_t *worker; int align = 0; int offset, rc; workers = sysconf(_SC_NPROCESSORS_ONLN); while ((c = getopt(argc, argv, "a:hw:")) != -1) { switch (c) { case 'a': align = atoi(optarg); if (align < 0 || align > PAGE_SIZE) { printf("Bad alignment %d.\n", align); exit(1); } dowrite = false; break; case 'h': usage(); exit(0); break; case 'w': workers = atoi(optarg); if (workers < MIN_WORKERS || workers > MAX_WORKERS) { fprintf(stderr, "Worker count %d not between " "%d and %d, inclusive.\n", workers, MIN_WORKERS, MAX_WORKERS); usage(); exit(1); } dowrite = false; break; default: usage(); exit(1); } } if (argc > 1 && (optind < argc)) { fprintf(stderr, "Bad command line.\n"); usage(); exit(1); } if (dowrite) { buffer = malloc(FILESIZE); if (buffer == NULL) { fprintf(stderr, "Failed to malloc write buffer.\n"); exit(1); } for (n = 1; n <= FILECOUNT; n++) { sprintf(filename, FILENAME, n); fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666); if (fd < 0) { printf("create failed(%s): %s.\n", filename, strerror(errno)); exit(1); } memset(buffer, n, FILESIZE); printf("Writing file %s.\n", filename); if (write(fd, buffer, FILESIZE) != FILESIZE) { printf("write failed (%s)\n", filename); } close(fd); fd = -1; } free(buffer); buffer = NULL; printf("done\n"); exit(0); } printf("Using %d workers.\n", workers); worker = malloc(workers * sizeof(worker_t)); if (worker == NULL) { fprintf(stderr, "Failed to malloc worker array.\n"); exit(1); } for (j = 0; j < workers; j++) { worker[j].worker_number = j; } printf("Using alignment %d.\n", align); posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align); printf("Read buffer: %p.\n", buffer); for (n = 1; n <= FILECOUNT; n++) { sprintf(filename, FILENAME, n); for (j = 0; j < workers; j++) { if ((worker[j].fd = open(filename, O_RDONLY|O_DIRECT)) < 0) { fprintf(stderr, "Failed to open %s: %s.\n", filename, strerror(errno)); exit(1); } worker[j].pattern = n; } printf("Reading file %d.\n", n); for (offset = 0; offset < FILESIZE; offset += READSIZE) { memset(buffer, PATTERN, READSIZE + align); for (j = 0; j < workers; j++) { worker[j].offset = offset + j * PAGE_SIZE; worker[j].buffer = buffer + align + j * PAGE_SIZE; worker[j].length = PAGE_SIZE; } /* The final worker reads whatever is left over. */ worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1); done = 0; rc = pthread_create(&fork_tid, NULL, fork_thread, NULL); if (rc != 0) { fprintf(stderr, "Can't create fork thread: %s.\n", strerror(rc)); exit(1); } for (j = 0; j < workers; j++) { rc = pthread_create(&worker[j].tid, NULL, worker_thread, worker + j); if (rc != 0) { fprintf(stderr, "Can't create worker thread %d: %s.\n", j, strerror(rc)); exit(1); } } for (j = 0; j < workers; j++) { rc = pthread_join(worker[j].tid, NULL); if (rc != 0) { fprintf(stderr, "Failed to join worker thread %d: %s.\n", j, strerror(rc)); exit(1); } } /* Let the fork thread know it's ok to exit */ done = 1; rc = pthread_join(fork_tid, NULL); if (rc != 0) { fprintf(stderr, "Failed to join fork thread: %s.\n", strerror(rc)); exit(1); } } /* Close the fd's for the next file. */ for (j = 0; j < workers; j++) { close(worker[j].fd); } } return 0; } ========== dma_thread.c ======= ========== forkscrew.c ======== /* * Copyright 2009, Red Hat, Inc. * * Author: Jeff Moyer <jmoyer@redhat.com> * * This program attempts to expose a race between O_DIRECT I/O and the fork() * path in a multi-threaded program. In order to reliably reproduce the * problem, it is best to perform a dd from the device under test to /dev/null * as this makes the read I/O slow enough to orchestrate the problem. * * Running: ./forkscrew * * It is expected that a file name "data" exists in the current working * directory, and that its contents are something other than 0x2a. A simple * dd if=/dev/zero of=data bs=1M count=1 should be sufficient. */ #define _GNU_SOURCE 1 #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #include <pthread.h> #include <libaio.h> pthread_cond_t worker_cond = PTHREAD_COND_INITIALIZER; pthread_mutex_t worker_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t fork_cond = PTHREAD_COND_INITIALIZER; pthread_mutex_t fork_mutex = PTHREAD_MUTEX_INITIALIZER; char *buffer; int fd; /* pattern filled into the in-memory buffer */ #define PATTERN 0x2a // '*' void usage(void) { fprintf(stderr, "\nUsage: forkscrew\n" "it is expected that a file named \"data\" is the current\n" "working directory. It should be at least 3*pagesize in size\n" ); } void dump_buffer(char *buf, int len) { int i; int last_off, last_val; last_off = -1; last_val = -1; for (i = 0; i < len; i++) { if (last_off < 0) { last_off = i; last_val = buf[i]; continue; } if (buf[i] != last_val) { printf("%d - %d: %d\n", last_off, i - 1, last_val); last_off = i; last_val = buf[i]; } } if (last_off != len - 1) printf("%d - %d: %d\n", last_off, i-1, last_val); } int check_buffer(char *bufp, int len, int pattern) { int i; for (i = 0; i < len; i++) { if (bufp[i] == pattern) return 1; } return 0; } void * forker_thread(void *arg) { pthread_mutex_lock(&fork_mutex); pthread_cond_signal(&fork_cond); pthread_cond_wait(&fork_cond, &fork_mutex); switch (fork()) { case 0: sleep(1); printf("child dumping buffer:\n"); dump_buffer(buffer + 512, 2*getpagesize()); exit(0); case -1: perror("fork"); exit(1); default: break; } pthread_cond_signal(&fork_cond); pthread_mutex_unlock(&fork_mutex); wait(NULL); return (void *)0; } void * worker(void *arg) { int first = (int)arg; char *bufp; int pagesize = getpagesize(); int ret; int corrupted = 0; if (first) { io_context_t aioctx; struct io_event event; struct iocb *iocb = malloc(sizeof *iocb); if (!iocb) { perror("malloc"); exit(1); } memset(&aioctx, 0, sizeof(aioctx)); ret = io_setup(1, &aioctx); if (ret != 0) { errno = -ret; perror("io_setup"); exit(1); } bufp = buffer + 512; io_prep_pread(iocb, fd, bufp, pagesize, 0); /* submit the I/O */ io_submit(aioctx, 1, &iocb); /* tell the fork thread to run */ pthread_mutex_lock(&fork_mutex); pthread_cond_signal(&fork_cond); /* wait for the fork to happen */ pthread_cond_wait(&fork_cond, &fork_mutex); pthread_mutex_unlock(&fork_mutex); /* release the other worker to issue I/O */ pthread_mutex_lock(&worker_mutex); pthread_cond_signal(&worker_cond); pthread_mutex_unlock(&worker_mutex); ret = io_getevents(aioctx, 1, 1, &event, NULL); if (ret != 1) { errno = -ret; perror("io_getevents"); exit(1); } if (event.res != pagesize) { errno = -event.res; perror("read error"); exit(1); } io_destroy(aioctx); /* check buffer, should be corrupt */ if (check_buffer(bufp, pagesize, PATTERN)) { printf("worker 0 failed check\n"); dump_buffer(bufp, pagesize); corrupted = 1; } } else { bufp = buffer + 512 + pagesize; pthread_mutex_lock(&worker_mutex); pthread_cond_signal(&worker_cond); /* tell main we're ready */ /* wait for the first I/O and the fork */ pthread_cond_wait(&worker_cond, &worker_mutex); pthread_mutex_unlock(&worker_mutex); /* submit overlapping I/O */ ret = read(fd, bufp, pagesize); if (ret != pagesize) { perror("read"); exit(1); } /* check buffer, should be fine */ if (check_buffer(bufp, pagesize, PATTERN)) { printf("worker 1 failed check -- abnormal\n"); dump_buffer(bufp, pagesize); corrupted = 1; } } return (void *)corrupted; } int main(int argc, char **argv) { pthread_t workers[2]; pthread_t forker; int ret, rc = 0; void *thread_ret; int pagesize = getpagesize(); fd = open("data", O_DIRECT|O_RDONLY); if (fd < 0) { perror("open"); exit(1); } ret = posix_memalign(&buffer, pagesize, 3 * pagesize); if (ret != 0) { errno = ret; perror("posix_memalign"); exit(1); } memset(buffer, PATTERN, 3*pagesize); pthread_mutex_lock(&fork_mutex); ret = pthread_create(&forker, NULL, forker_thread, NULL); pthread_cond_wait(&fork_cond, &fork_mutex); pthread_mutex_unlock(&fork_mutex); pthread_mutex_lock(&worker_mutex); ret |= pthread_create(&workers[0], NULL, worker, (void *)0); if (ret) { perror("pthread_create"); exit(1); } pthread_cond_wait(&worker_cond, &worker_mutex); pthread_mutex_unlock(&worker_mutex); ret = pthread_create(&workers[1], NULL, worker, (void *)1); if (ret != 0) { perror("pthread_create"); exit(1); } pthread_join(forker, NULL); pthread_join(workers[0], &thread_ret); if (thread_ret != 0) rc = 1; pthread_join(workers[1], &thread_ret); if (thread_ret != 0) rc = 1; if (rc != 0) { printf("parent dumping full buffer\n"); dump_buffer(buffer + 512, 2 * pagesize); } close(fd); free(buffer); exit(rc); } ========== forkscrew.c ======== ========== forkscrewreverse.c ======== #define _GNU_SOURCE 1 #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <memory.h> #include <pthread.h> #include <getopt.h> #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #define FILESIZE (40*1024*1024) #define BUFSIZE (40*1024*1024) static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; static const char *filename = "file.dat"; static int fd; static void *buffer; #define PAGE_SIZE 4096 static void store(void) { int i; if (usleep(50*1000) == -1) perror("usleep"), exit(1); printf("child storing\n"); fflush(stdout); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0xff; _exit(0); } static void *writer(void *arg) { int i; if (pthread_mutex_lock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); printf("thread writing\n"); fflush(stdout); for (i = 0; i < FILESIZE / BUFSIZE; i++) { size_t count = BUFSIZE; ssize_t ret; do { ret = write(fd, buffer, count); if (ret == -1) { if (errno != EINTR) perror("write"), exit(1); ret = 0; } count -= ret; } while (count); } printf("thread writing done\n"); fflush(stdout); if (pthread_mutex_unlock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); return NULL; } int main(int argc, char *argv[]) { int i; int status; pthread_t writer_thread; pid_t store_proc; posix_memalign(&buffer, PAGE_SIZE, BUFSIZE); printf("Write buffer: %p.\n", buffer); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0x00; fd = open(filename, O_RDWR|O_DIRECT); if (fd == -1) perror("open"), exit(1); if (pthread_mutex_lock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); if (pthread_create(&writer_thread, NULL, writer, NULL) == -1) perror("pthred_create"), exit(1); store_proc = fork(); if (store_proc == -1) perror("fork"), exit(1); if (!store_proc) store(); if (pthread_mutex_unlock(&lock) == -1) perror("pthread_mutex_lock"), exit(1); if (usleep(10*1000) == -1) perror("usleep"), exit(1); printf("parent storing\n"); fflush(stdout); for (i = 0; i < BUFSIZE; i++) ((char *)buffer)[i] = 0x11; do { pid_t w; w = waitpid(store_proc, &status, WUNTRACED | WCONTINUED); if (w == -1) perror("waitpid"), exit(1); } while (!WIFEXITED(status) && !WIFSIGNALED(status)); if (pthread_join(writer_thread, NULL) == -1) perror("pthread_join"), exit(1); exit(0); } ========== forkscrewreverse.c ======== Normally I test with "dma_thread -a 512 -w 40". To reproduce or verify the fix with hugepages run it like this: LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ../test/dma_thread -a 512 -w 40 LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./forkscrew LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./forkscrewreverse This is a fixed version of original patch from Nick Piggin. KSM has the same problem of fork and it also checks the page_count after a ptep_clear_flush_notify (the _flush sending smp-tlb-flush stops gup-fast so it doesn't depend on the above gup-fast changes that allows fork not to flush the smp-tlb at every pte wrprotected, and the _notify ensure all secondary ptes are zapped and any page-pin released for mmu-notifier subsystems that take page pins like currently KVM). BTW, I guess it's pure luck ENOSPC != VM_FAULT_OOM in hugetlb.c, mixing -errno with -VM_FAULT_* is total breakage that will have to be cleaned up (either don't use -ENOSPC, or use -ENOMEM instead of VM_FAULT_OOM), I didn't address it in this patch as it's unrelated. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- Removed mtk.manpages@gmail.com, linux-man@vger.kernel.org from previous CC list. diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -89,6 +89,26 @@ static noinline int gup_pte_range(pmd_t VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); get_page(page); + if (PageAnon(page)) { + if (!PageGUP(page)) + SetPageGUP(page); + smp_mb(); + /* + * Fork doesn't want to flush the smp-tlb for + * every pte that it marks readonly but newly + * created shared anon pages cannot have + * direct-io going to them, so check if fork + * made the page shared before we taken the + * page pin. + * de-cow to make direct read from memory safe. + */ + if ((pte_flags(gup_get_pte(ptep)) & + (mask | _PAGE_SPECIAL)) != (mask|_PAGE_RW)) { + put_page(page); + pte_unmap(ptep); + return 0; + } + } pages[*nr] = page; (*nr)++; @@ -98,24 +118,16 @@ static noinline int gup_pte_range(pmd_t return 1; } -static inline void get_head_page_multiple(struct page *page, int nr) -{ - VM_BUG_ON(page != compound_head(page)); - VM_BUG_ON(page_count(page) == 0); - atomic_add(nr, &page->_count); -} - -static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, - unsigned long end, int write, struct page **pages, int *nr) +static noinline int gup_huge_pmd(pmd_t *pmdp, unsigned long addr, + unsigned long end, struct page **pages, int *nr) { unsigned long mask; - pte_t pte = *(pte_t *)&pmd; + pte_t pte = *(pte_t *)pmdp; struct page *head, *page; int refs; - mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + /* de-cow to make direct read from memory safe */ + mask = _PAGE_PRESENT|_PAGE_USER|_PAGE_RW; if ((pte_flags(pte) & mask) != mask) return 0; /* hugepages are never "special" */ @@ -127,12 +139,21 @@ static noinline int gup_huge_pmd(pmd_t p page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); do { VM_BUG_ON(compound_head(page) != head); + get_page(head); + if (!PageGUP(head)) + SetPageGUP(head); + smp_mb(); + if ((pte_flags(*(pte_t *)pmdp) & mask) != mask) { + put_page(page); + return 0; + } pages[*nr] = page; (*nr)++; page++; refs++; } while (addr += PAGE_SIZE, addr != end); - get_head_page_multiple(head, refs); + VM_BUG_ON(page_count(head) == 0); + VM_BUG_ON(head != compound_head(head)); return 1; } @@ -151,7 +172,7 @@ static int gup_pmd_range(pud_t pud, unsi if (pmd_none(pmd)) return 0; if (unlikely(pmd_large(pmd))) { - if (!gup_huge_pmd(pmd, addr, next, write, pages, nr)) + if (!gup_huge_pmd(pmdp, addr, next, pages, nr)) return 0; } else { if (!gup_pte_range(pmd, addr, next, write, pages, nr)) @@ -162,17 +183,16 @@ static int gup_pmd_range(pud_t pud, unsi return 1; } -static noinline int gup_huge_pud(pud_t pud, unsigned long addr, - unsigned long end, int write, struct page **pages, int *nr) +static noinline int gup_huge_pud(pud_t *pudp, unsigned long addr, + unsigned long end, struct page **pages, int *nr) { unsigned long mask; - pte_t pte = *(pte_t *)&pud; + pte_t pte = *(pte_t *)pudp; struct page *head, *page; int refs; - mask = _PAGE_PRESENT|_PAGE_USER; - if (write) - mask |= _PAGE_RW; + /* de-cow to make direct read from memory safe */ + mask = _PAGE_PRESENT|_PAGE_USER|_PAGE_RW; if ((pte_flags(pte) & mask) != mask) return 0; /* hugepages are never "special" */ @@ -184,12 +204,21 @@ static noinline int gup_huge_pud(pud_t p page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); do { VM_BUG_ON(compound_head(page) != head); + get_page(head); + if (!PageGUP(head)) + SetPageGUP(head); + smp_mb(); + if ((pte_flags(*(pte_t *)pudp) & mask) != mask) { + put_page(page); + return 0; + } pages[*nr] = page; (*nr)++; page++; refs++; } while (addr += PAGE_SIZE, addr != end); - get_head_page_multiple(head, refs); + VM_BUG_ON(page_count(head) == 0); + VM_BUG_ON(head != compound_head(head)); return 1; } @@ -208,7 +237,7 @@ static int gup_pud_range(pgd_t pgd, unsi if (pud_none(pud)) return 0; if (unlikely(pud_large(pud))) { - if (!gup_huge_pud(pud, addr, next, write, pages, nr)) + if (!gup_huge_pud(pudp, addr, next, pages, nr)) return 0; } else { if (!gup_pmd_range(pud, addr, next, write, pages, nr)) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -20,8 +20,8 @@ int hugetlb_sysctl_handler(struct ctl_ta int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); -int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); -int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); +int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *, struct vm_area_struct *); +int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int); void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, struct page *); void __unmap_hugepage_range(struct vm_area_struct *, @@ -75,9 +75,9 @@ static inline unsigned long hugetlb_tota return 0; } -#define follow_hugetlb_page(m,v,p,vs,a,b,i,w) ({ BUG(); 0; }) +#define follow_hugetlb_page(m,v,p,vs,a,b,i) ({ BUG(); 0; }) #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) -#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) +#define copy_hugetlb_page_range(src, dst, dst_vma, src_vma) ({ BUG(); 0; }) #define hugetlb_prefault(mapping, vma) ({ BUG(); 0; }) #define unmap_hugepage_range(vma, start, end, page) BUG() static inline void hugetlb_report_meminfo(struct seq_file *m) diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -789,7 +789,8 @@ void free_pgd_range(struct mmu_gather *t void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma); + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); int follow_phys(struct vm_area_struct *vma, unsigned long address, @@ -1238,7 +1239,7 @@ int vm_insert_mixed(struct vm_area_struc unsigned long pfn); struct page *follow_page(struct vm_area_struct *, unsigned long address, - unsigned int foll_flags); + unsigned int *foll_flags); #define FOLL_WRITE 0x01 /* check pte is writable */ #define FOLL_TOUCH 0x02 /* mark page accessed */ #define FOLL_GET 0x04 /* do get_page on page */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -101,6 +101,7 @@ enum pageflags { #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR PG_uncached, /* Page has been mapped as uncached */ #endif + PG_gup, __NR_PAGEFLAGS, /* Filesystems */ @@ -195,6 +196,7 @@ PAGEFLAG(Private, private) __CLEARPAGEFL PAGEFLAG(Private, private) __CLEARPAGEFLAG(Private, private) __SETPAGEFLAG(Private, private) PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) +PAGEFLAG(GUP, gup) __CLEARPAGEFLAG(GUP, gup) __PAGEFLAG(SlobPage, slob_page) __PAGEFLAG(SlobFree, slob_free) diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, mpnt); + retval = copy_page_range(mm, oldmm, tmp, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1695,20 +1695,37 @@ static void set_huge_ptep_writable(struc } } +/* Return the pagecache page at a given address within a VMA */ +static struct page *hugetlbfs_pagecache_page(struct hstate *h, + struct vm_area_struct *vma, unsigned long address) +{ + struct address_space *mapping; + pgoff_t idx; + + mapping = vma->vm_file->f_mapping; + idx = vma_hugecache_offset(h, vma, address); + + return find_lock_page(mapping, idx); +} + +static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte, + struct page *pagecache_page); int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma) { - pte_t *src_pte, *dst_pte, entry; + pte_t *src_pte, *dst_pte, entry, orig_entry; struct page *ptepage; unsigned long addr; - int cow; - struct hstate *h = hstate_vma(vma); + int cow, forcecow, oom; + struct hstate *h = hstate_vma(src_vma); unsigned long sz = huge_page_size(h); - cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; + cow = (src_vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; - for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { + for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) { src_pte = huge_pte_offset(src, addr); if (!src_pte) continue; @@ -1720,22 +1737,76 @@ int copy_hugetlb_page_range(struct mm_st if (dst_pte == src_pte) continue; + oom = 0; spin_lock(&dst->page_table_lock); spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING); - if (!huge_pte_none(huge_ptep_get(src_pte))) { - if (cow) - huge_ptep_set_wrprotect(src, addr, src_pte); - entry = huge_ptep_get(src_pte); + orig_entry = entry = huge_ptep_get(src_pte); + forcecow = 0; + if (!huge_pte_none(entry)) { ptepage = pte_page(entry); get_page(ptepage); + if (cow && pte_write(entry)) { + huge_ptep_set_wrprotect(src, addr, src_pte); + smp_mb(); + if (PageGUP(ptepage)) + forcecow = 1; + entry = huge_ptep_get(src_pte); + } set_huge_pte_at(dst, addr, dst_pte, entry); } spin_unlock(&src->page_table_lock); + if (forcecow) { + if (unlikely(vma_needs_reservation(h, dst_vma, addr) + < 0)) + oom = 1; + else { + struct page *pg; + int cow_ret; + spin_unlock(&dst->page_table_lock); + /* force atomic copy from parent to child */ + flush_tlb_range(src_vma, addr, addr+sz); + /* + * Can use hstate from src_vma and src_vma + * because the hugetlbfs pagecache will + * be the same for both src_vma and dst_vma. + */ + pg = hugetlbfs_pagecache_page(h, + src_vma, + addr); + spin_lock_nested(&dst->page_table_lock, + SINGLE_DEPTH_NESTING); + cow_ret = hugetlb_cow(dst, dst_vma, addr, + dst_pte, entry, + pg); + /* + * We hold mmap_sem in write mode and + * the VM doesn't know about hugepages + * so the src_pte/dst_pte can't change + * from under us even without both + * page_table_lock hold the whole time. + */ + BUG_ON(!pte_same(huge_ptep_get(src_pte), + entry)); + set_huge_pte_at(src, addr, + src_pte, + orig_entry); + if (cow_ret) + oom = 1; + } + } spin_unlock(&dst->page_table_lock); + if (oom) + goto nomem; } return 0; nomem: + /* + * Want this to also be able to return -ENOSPC? Then stop the + * mess of mixing -VM_FAULT_ and -ENOSPC retvals and be + * consistent returning -ENOMEM instead of -VM_FAULT_OOM in + * alloc_huge_page. + */ return -ENOMEM; } @@ -1943,19 +2014,6 @@ retry_avoidcopy: return 0; } -/* Return the pagecache page at a given address within a VMA */ -static struct page *hugetlbfs_pagecache_page(struct hstate *h, - struct vm_area_struct *vma, unsigned long address) -{ - struct address_space *mapping; - pgoff_t idx; - - mapping = vma->vm_file->f_mapping; - idx = vma_hugecache_offset(h, vma, address); - - return find_lock_page(mapping, idx); -} - static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *ptep, int write_access) { @@ -2160,8 +2218,7 @@ static int huge_zeropage_ok(pte_t *ptep, int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, - unsigned long *position, int *length, int i, - int write) + unsigned long *position, int *length, int i) { unsigned long pfn_offset; unsigned long vaddr = *position; @@ -2181,16 +2238,16 @@ int follow_hugetlb_page(struct mm_struct * first, for the page indexing below to work. */ pte = huge_pte_offset(mm, vaddr & huge_page_mask(h)); - if (huge_zeropage_ok(pte, write, shared)) + if (huge_zeropage_ok(pte, 1, shared)) zeropage_ok = 1; if (!pte || (huge_pte_none(huge_ptep_get(pte)) && !zeropage_ok) || - (write && !pte_write(huge_ptep_get(pte)))) { + !pte_write(huge_ptep_get(pte))) { int ret; spin_unlock(&mm->page_table_lock); - ret = hugetlb_fault(mm, vma, vaddr, write); + ret = hugetlb_fault(mm, vma, vaddr, 1); spin_lock(&mm->page_table_lock); if (!(ret & VM_FAULT_ERROR)) continue; @@ -2207,8 +2264,11 @@ same_page: if (pages) { if (zeropage_ok) pages[i] = ZERO_PAGE(0); - else + else { pages[i] = mem_map_offset(page, pfn_offset); + if (!PageGUP(page)) + SetPageGUP(page); + } get_page(pages[i]); } diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -538,14 +538,16 @@ out: * covered by this vma. */ -static inline void +static inline int copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, + pte_t *dst_pte, pte_t *src_pte, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, int *rss) { - unsigned long vm_flags = vma->vm_flags; + unsigned long vm_flags = src_vma->vm_flags; pte_t pte = *src_pte; struct page *page; + int forcecow = 0; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { @@ -576,15 +578,6 @@ copy_one_pte(struct mm_struct *dst_mm, s } /* - * If it's a COW mapping, write protect it both - * in the parent and the child - */ - if (is_cow_mapping(vm_flags)) { - ptep_set_wrprotect(src_mm, addr, src_pte); - pte = pte_wrprotect(pte); - } - - /* * If it's a shared mapping, mark it clean in * the child */ @@ -592,27 +585,87 @@ copy_one_pte(struct mm_struct *dst_mm, s pte = pte_mkclean(pte); pte = pte_mkold(pte); - page = vm_normal_page(vma, addr, pte); + /* + * If it's a COW mapping, write protect it both + * in the parent and the child. + */ + if (is_cow_mapping(vm_flags) && pte_write(pte)) { + /* + * Serialization against gup-fast happens by + * wrprotecting the pte and checking the PG_gup flag + * and the number of page pins after that. If gup-fast + * boosts the page_count after we checked it, it will + * also take the slow path because it will find the + * pte wrprotected. + */ + ptep_set_wrprotect(src_mm, addr, src_pte); + } + + page = vm_normal_page(src_vma, addr, pte); if (page) { get_page(page); - page_dup_rmap(page, vma, addr); + page_dup_rmap(page, dst_vma, addr); + if (is_cow_mapping(vm_flags) && pte_write(pte) && + PageAnon(page)) { + smp_mb(); + if (PageGUP(page)) { + if (unlikely(!trylock_page(page))) + forcecow = 1; + else { + BUG_ON(page_mapcount(page) != 2); + if (unlikely(page_count(page) != + page_mapcount(page) + + !!PageSwapCache(page))) + forcecow = 1; + unlock_page(page); + } + } + } rss[!!PageAnon(page)]++; + } + + if (is_cow_mapping(vm_flags) && pte_write(pte)) { + pte = pte_wrprotect(pte); + if (forcecow) { + /* force atomic copy from parent to child */ + flush_tlb_page(src_vma, addr); + /* + * Don't set the dst_pte here to be + * safer, as fork_pre_cow might return + * -EAGAIN and restart. + */ + goto out; + } } out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); +out: + return forcecow; } +static int fork_pre_cow(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + unsigned long address, + pte_t **dst_ptep, pte_t **src_ptep, + spinlock_t **dst_ptlp, spinlock_t **src_ptlp, + pmd_t *dst_pmd, pmd_t *src_pmd); + static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, + pmd_t *dst_pmd, pmd_t *src_pmd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pte_t *src_pte, *dst_pte; spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2]; + int forcecow; again: + forcecow = 0; rss[1] = rss[0] = 0; dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); if (!dst_pte) @@ -623,6 +676,9 @@ again: arch_enter_lazy_mmu_mode(); do { + if (forcecow) + break; + /* * We are holding two locks at this point - either of them * could generate latencies in another task on another CPU. @@ -637,9 +693,38 @@ again: progress++; continue; } - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); + forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, + dst_vma, src_vma, addr, rss); progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); + + if (unlikely(forcecow)) { + pte_t *_src_pte = src_pte-1, *_dst_pte = dst_pte-1; + /* + * Try to COW the child page as direct I/O is working + * on the parent page, and so we've to mark the parent + * pte read-write before dropping the PT lock and + * mmap_sem to avoid the page to be cowed in the + * parent and any direct I/O to get lost. + */ + forcecow = fork_pre_cow(dst_mm, src_mm, + dst_vma, src_vma, + addr-PAGE_SIZE, + &_dst_pte, &_src_pte, + &dst_ptl, &src_ptl, + dst_pmd, src_pmd); + src_pte = _src_pte + 1; + dst_pte = _dst_pte + 1; + /* after the page copy set the parent pte writeable again */ + set_pte_at(src_mm, addr-PAGE_SIZE, src_pte-1, + pte_mkwrite(*(src_pte-1))); + if (unlikely(forcecow == -EAGAIN)) { + dst_pte--; + src_pte--; + addr -= PAGE_SIZE; + rss[1]--; + } + } arch_leave_lazy_mmu_mode(); spin_unlock(src_ptl); @@ -647,13 +732,16 @@ again: add_mm_rss(dst_mm, rss[0], rss[1]); pte_unmap_unlock(dst_pte - 1, dst_ptl); cond_resched(); + if (unlikely(forcecow == -ENOMEM)) + return -ENOMEM; if (addr != end) goto again; return 0; } static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma, + pud_t *dst_pud, pud_t *src_pud, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pmd_t *src_pmd, *dst_pmd; @@ -668,14 +756,15 @@ static inline int copy_pmd_range(struct if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma, + pgd_t *dst_pgd, pgd_t *src_pgd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pud_t *src_pud, *dst_pud; @@ -690,19 +779,20 @@ static inline int copy_pud_range(struct if (pud_none_or_clear_bad(src_pud)) continue; if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end); return 0; } int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma) { pgd_t *src_pgd, *dst_pgd; unsigned long next; - unsigned long addr = vma->vm_start; - unsigned long end = vma->vm_end; + unsigned long addr = src_vma->vm_start; + unsigned long end = src_vma->vm_end; int ret; /* @@ -711,20 +801,21 @@ int copy_page_range(struct mm_struct *ds * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { - if (!vma->anon_vma) + if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { + if (!src_vma->anon_vma) return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_vm_hugetlb_page(src_vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, + dst_vma, src_vma); - if (unlikely(is_pfn_mapping(vma))) { + if (unlikely(is_pfn_mapping(src_vma))) { /* * We do not free on error cases below as remove_vma * gets called on error from higher level routine */ - ret = track_pfn_vma_copy(vma); + ret = track_pfn_vma_copy(src_vma); if (ret) return ret; } @@ -735,7 +826,7 @@ int copy_page_range(struct mm_struct *ds * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */ - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); ret = 0; @@ -746,15 +837,15 @@ int copy_page_range(struct mm_struct *ds if (pgd_none_or_clear_bad(src_pgd)) continue; if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next))) { + dst_vma, src_vma, addr, next))) { ret = -ENOMEM; break; } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm, - vma->vm_start, end); + src_vma->vm_start, end); return ret; } @@ -1091,7 +1182,7 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes); * Do a quick page-table lookup for a single page. */ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int flags) + unsigned int *flagsp) { pgd_t *pgd; pud_t *pud; @@ -1100,6 +1191,7 @@ struct page *follow_page(struct vm_area_ spinlock_t *ptl; struct page *page; struct mm_struct *mm = vma->vm_mm; + unsigned long flags = *flagsp; page = follow_huge_addr(mm, address, flags & FOLL_WRITE); if (!IS_ERR(page)) { @@ -1145,8 +1237,19 @@ struct page *follow_page(struct vm_area_ if (unlikely(!page)) goto bad_page; - if (flags & FOLL_GET) + if (flags & FOLL_GET) { + if (PageAnon(page)) { + /* de-cow to make direct read from memory safe */ + if (!pte_write(pte)) { + page = NULL; + *flagsp |= FOLL_WRITE; + goto unlock; + } + if (!PageGUP(page)) + SetPageGUP(page); + } get_page(page); + } if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -1275,7 +1378,7 @@ int __get_user_pages(struct task_struct if (is_vm_hugetlb_page(vma)) { i = follow_hugetlb_page(mm, vma, pages, vmas, - &start, &len, i, write); + &start, &len, i); continue; } @@ -1303,7 +1406,7 @@ int __get_user_pages(struct task_struct foll_flags |= FOLL_WRITE; cond_resched(); - while (!(page = follow_page(vma, start, foll_flags))) { + while (!(page = follow_page(vma, start, &foll_flags))) { int ret; ret = handle_mm_fault(mm, vma, start, foll_flags & FOLL_WRITE); @@ -1865,6 +1968,81 @@ static inline void cow_user_page(struct flush_dcache_page(dst); } else copy_user_highpage(dst, src, va, vma); +} + +static int fork_pre_cow(struct mm_struct *dst_mm, + struct mm_struct *src_mm, + struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + unsigned long address, + pte_t **dst_ptep, pte_t **src_ptep, + spinlock_t **dst_ptlp, spinlock_t **src_ptlp, + pmd_t *dst_pmd, pmd_t *src_pmd) +{ + pte_t _src_pte, _dst_pte; + struct page *old_page, *new_page; + + _src_pte = **src_ptep; + _dst_pte = **dst_ptep; + old_page = vm_normal_page(src_vma, address, **src_ptep); + BUG_ON(!old_page); + get_page(old_page); + arch_leave_lazy_mmu_mode(); + spin_unlock(*src_ptlp); + pte_unmap_nested(*src_ptep); + pte_unmap_unlock(*dst_ptep, *dst_ptlp); + + new_page = alloc_page_vma(GFP_HIGHUSER, dst_vma, address); + if (unlikely(!new_page)) { + *dst_ptep = pte_offset_map_lock(dst_mm, dst_pmd, address, + dst_ptlp); + *src_ptep = pte_offset_map_nested(src_pmd, address); + *src_ptlp = pte_lockptr(src_mm, src_pmd); + spin_lock_nested(*src_ptlp, SINGLE_DEPTH_NESTING); + arch_enter_lazy_mmu_mode(); + return -ENOMEM; + } + cow_user_page(new_page, old_page, address, dst_vma); + + *dst_ptep = pte_offset_map_lock(dst_mm, dst_pmd, address, dst_ptlp); + *src_ptep = pte_offset_map_nested(src_pmd, address); + *src_ptlp = pte_lockptr(src_mm, src_pmd); + spin_lock_nested(*src_ptlp, SINGLE_DEPTH_NESTING); + arch_enter_lazy_mmu_mode(); + + /* + * src pte can unmapped by the VM from under us after dropping + * the src_ptlp but it can't be cowed from under us as fork + * holds the mmap_sem in write mode. + */ + if (!pte_same(**src_ptep, _src_pte)) + goto eagain; + if (!pte_same(**dst_ptep, _dst_pte)) + goto eagain; + + page_remove_rmap(old_page); + page_cache_release(old_page); + page_cache_release(old_page); + + __SetPageUptodate(new_page); + flush_cache_page(src_vma, address, pte_pfn(**src_ptep)); + _dst_pte = mk_pte(new_page, dst_vma->vm_page_prot); + _dst_pte = maybe_mkwrite(pte_mkdirty(_dst_pte), dst_vma); + page_add_new_anon_rmap(new_page, dst_vma, address); + set_pte_at(dst_mm, address, *dst_ptep, _dst_pte); + update_mmu_cache(dst_vma, address, _dst_pte); + return 0; + +eagain: + page_cache_release(old_page); + page_cache_release(new_page); + /* + * Later we'll repeat the copy of this pte, so here we've to + * undo the mapcount and page count taken in copy_one_pte. + */ + page_remove_rmap(old_page); + page_cache_release(old_page); + return -EAGAIN; } /* diff --git a/mm/swap.c b/mm/swap.c --- a/mm/swap.c +++ b/mm/swap.c @@ -64,6 +64,8 @@ static void put_compound_page(struct pag if (put_page_testzero(page)) { compound_page_dtor *dtor; + if (PageGUP(page)) + __ClearPageGUP(page); dtor = get_compound_page_dtor(page); (*dtor)(page); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-16 13:56 ` Andrea Arcangeli @ 2009-03-16 16:01 ` Nick Piggin 0 siblings, 0 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-16 16:01 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Tuesday 17 March 2009 00:56:54 Andrea Arcangeli wrote: > On Sat, Mar 14, 2009 at 03:59:11PM +1100, Nick Piggin wrote: > > It does touch gup-fast, but it just adds one branch and no barrier in the > > My question is what trick to you use to stop gup-fast from returning > the page mapped read-write by the pte if gup-fast doesn't take any > lock whatsoever, it doesn't set any bit in any page or vma, and it > doesn't recheck the pte is still viable after having set any bit on > page or vmas, and you still don't send a flood of ipis from fork fast > path (no race case). If the page is not marked PageDontCOW, then it decows it, which gives synchronisation against fork. If it is marked PageDontCOW, then it can't possibly be COWed by fork, previous or subsequent. > > Possibly that's the right way to go. Depends if it is in the slightest > > performance critical. If not, I would just let do_wp_page do the work > > to avoid a little bit of logic, but either way is not a big deal to me. > > fork is less performance critical than do_wp_page, still in fork > microbenchmark no slowdown is measured with the patch. Before I > introduced PG_gup there were false positives triggered by the pagevec > temporary pins, that was measurable, after PG_gup the fast path is OK. Mine doesn't get false positives, but it doesn't try to reintroduce pages as COW candidates after the get_user_pages is finished. This is how it is simpler than your patch. > unaffected (I've still to measure gup-fast slowdown in setting PG_gup > but I'm rather optimistic that you're understimating the cost of > walking 4 layers of pagetables compared to a locked op on a l1 > exclusive cacheline, so I think it'll be lost in the noise). I think > the big thing of gup-fast is primarly in not having to search vmas, > and in turn to take any shared lock like mmap_sem/PT lock and to scale > on a page level with just a get-page being the troublesome cacheline. You lost the get_head_page_multiple too for huge pages. This is the path that Oracle/DB2 will always go down when running any benchmarks. At the current DIO_PAGES size, this means adding up to 63 atomics, 64 mfences, and and touching cachelines of 63-64 of the non-head struct pages per request. OK probably even those databases don't get a chance to do such big IOs, but they definitely will be doing larger than 4K at a time in many cases (probably even their internal block size can be larger). > > One side of the race is direct IO read writing to fork child page. > > The other side of the race is fork child page write leaking into > > the direct IO. > > > > My patch solves both sides by de-cowing *any* COW page before it > > may be returned from get_user_pages (for read or write). > > I see what you mean now. If you read the comment of my patch you'll > see I explicitly intended that only people writing into memory with > gup was troublesome here. Like you point out, using gup for _reading_ > from memory is troublesome as well if child writes to those > pages. This is kind of a lower problem because the major issue is that > fork is enough to generate memory corruption even if the child isn't > touching those pages. The reverse race requires the child to write to > those pages so I guess it never triggered in real life apps. But > nevertheless I totally agree if we fix the write-to-memory-with-gup > we've to fix the read-from-memory-with-gup. Yes. > Below I updated my patch and relative commit header to fix the reverse > race too. However I had to enlarge the buffer to 40M to reproduce with > your testcase because my HD was too fast otherwise. You're using a solid state disk? :) > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c > --- a/arch/x86/mm/gup.c > +++ b/arch/x86/mm/gup.c > @@ -89,6 +89,26 @@ static noinline int gup_pte_range(pmd_t > VM_BUG_ON(!pfn_valid(pte_pfn(pte))); > page = pte_page(pte); > get_page(page); > + if (PageAnon(page)) { > + if (!PageGUP(page)) > + SetPageGUP(page); > + smp_mb(); > + /* > + * Fork doesn't want to flush the smp-tlb for > + * every pte that it marks readonly but newly > + * created shared anon pages cannot have > + * direct-io going to them, so check if fork > + * made the page shared before we taken the > + * page pin. > + * de-cow to make direct read from memory safe. > + */ > + if ((pte_flags(gup_get_pte(ptep)) & > + (mask | _PAGE_SPECIAL)) != (mask|_PAGE_RW)) { > + put_page(page); > + pte_unmap(ptep); > + return 0; Hmm, so this is disabling fast-gup for RO anonymous ranges? I guess this seems like it covers the reverse race then... btw powerpc has a slightly different fast-gup scheme where it isn't actually holding off TLB shootdown. I don't think you need to do anything too different, but better double check. And here is my improved patch. Same logic but just streamlines the decow stuff a bit and cuts out some unneeded stuff. This should be pretty complete for 4K pages. Except I'm a little unsure about the "ptes don't match, retry" path of the decow procedure. Lots of tricky little details to get right... And I'm not quite sure that you got this right either -- vmscan.c can turn the child pte into a swap pte here, right? In which case I think you need to drop its swapcache entry don't you? I don't know if there are other ways it could be changed, but I import the full zap_pte function over just in case. -- Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/include/linux/mm.h 2009-03-17 00:37:59.000000000 +1100 @@ -789,7 +789,7 @@ int walk_page_range(unsigned long addr, void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma); + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); int follow_phys(struct vm_area_struct *vma, unsigned long address, Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/mm/memory.c 2009-03-17 02:43:21.000000000 +1100 @@ -533,12 +533,171 @@ out: } /* + * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when + * servicing faults for write access. In the normal case, do always want + * pte_mkwrite. But get_user_pages can cause write faults for mappings + * that do not have writing enabled, when used by access_process_vm. + */ +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) +{ + if (likely(vma->vm_flags & VM_WRITE)) + pte = pte_mkwrite(pte); + return pte; +} + +static void cow_user_page(struct page *dst, struct page *src, + unsigned long va, struct vm_area_struct *vma) +{ + /* + * If the source page was a PFN mapping, we don't have + * a "struct page" for it. We do a best-effort copy by + * just copying from the original user address. If that + * fails, we just zero-fill it. Live with it. + */ + if (unlikely(!src)) { + void *kaddr = kmap_atomic(dst, KM_USER0); + void __user *uaddr = (void __user *)(va & PAGE_MASK); + + /* + * This really shouldn't fail, because the page is there + * in the page tables. But it might just be unreadable, + * in which case we just give up and fill the result with + * zeroes. + */ + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) + memset(kaddr, 0, PAGE_SIZE); + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(dst); + } else + copy_user_highpage(dst, src, va, vma); +} + +void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + pte_t pte = *ptep; + + if (pte_present(pte)) { + struct page *page; + + flush_cache_page(vma, addr, pte_pfn(pte)); + pte = ptep_clear_flush(vma, addr, ptep); + page = vm_normal_page(vma, addr, pte); + if (page) { + if (pte_dirty(pte)) + set_page_dirty(page); + page_remove_rmap(page); + page_cache_release(page); + update_hiwater_rss(mm); + if (PageAnon(page)) + dec_mm_counter(mm, anon_rss); + else + dec_mm_counter(mm, file_rss); + } + } else { + if (!pte_file(pte)) + free_swap_and_cache(pte_to_swp_entry(pte)); + pte_clear_not_present_full(mm, addr, ptep, 0); + } +} +/* + * breaks COW of child pte that has been marked COW by fork(). + * Must be called with the child's ptl held and pte mapped. + * Returns 0 on success with ptl held and pte mapped. + * -ENOMEM on OOM failure, or -EAGAIN if something changed under us. + * ptl dropped and pte unmapped on error cases. + */ +static noinline int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd, + spinlock_t *ptl, struct vm_area_struct *vma, + unsigned long address) +{ + pte_t pte = *ptep; + struct page *page, *new_page; + int ret; + + BUG_ON(!pte_present(pte)); + BUG_ON(pte_write(pte)); + + page = vm_normal_page(vma, address, pte); + BUG_ON(!page); + BUG_ON(!PageAnon(page)); + BUG_ON(!PageDontCOW(page)); + + /* The following code comes from do_wp_page */ + page_cache_get(page); + pte_unmap_unlock(pte, ptl); + + if (unlikely(anon_vma_prepare(vma))) + goto oom; + VM_BUG_ON(page == ZERO_PAGE(0)); + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); + if (!new_page) + goto oom; + /* + * Don't let another task, with possibly unlocked vma, + * keep the mlocked page. + */ + if (vma->vm_flags & VM_LOCKED) { + lock_page(page); /* for LRU manipulation */ + clear_page_mlock(page); + unlock_page(page); + } + cow_user_page(new_page, page, address, vma); + __SetPageUptodate(new_page); + + if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)) + goto oom_free_new; + + /* + * Re-check the pte - we dropped the lock + */ + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + if (pte_same(*ptep, pte)) { + pte_t entry; + + flush_cache_page(vma, address, pte_pfn(pte)); + entry = mk_pte(new_page, vma->vm_page_prot); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + /* + * Clear the pte entry and flush it first, before updating the + * pte with the new entry. This will avoid a race condition + * seen in the presence of one thread doing SMC and another + * thread doing COW. + */ + ptep_clear_flush_notify(vma, address, ptep); + page_add_new_anon_rmap(new_page, vma, address); + set_pte_at(mm, address, ptep, entry); + + /* See comment in do_wp_page */ + page_remove_rmap(page); + page_cache_release(page); + ret = 0; + } else { + if (!pte_none(*ptep)) + zap_pte(mm, vma, address, ptep); + pte_unmap_unlock(pte, ptl); + mem_cgroup_uncharge_page(new_page); + page_cache_release(new_page); + ret = -EAGAIN; + } + page_cache_release(page); + + return ret; + +oom_free_new: + page_cache_release(new_page); +oom: + page_cache_release(page); + return -ENOMEM; +} + +/* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range * covered by this vma. */ -static inline void +static inline int copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, unsigned long addr, int *rss) @@ -546,6 +705,7 @@ copy_one_pte(struct mm_struct *dst_mm, s unsigned long vm_flags = vma->vm_flags; pte_t pte = *src_pte; struct page *page; + int ret = 0; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { @@ -597,20 +757,26 @@ copy_one_pte(struct mm_struct *dst_mm, s get_page(page); page_dup_rmap(page, vma, addr); rss[!!PageAnon(page)]++; + if (unlikely(PageDontCOW(page))) + ret = 1; } out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); + + return ret; } static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, + pmd_t *dst_pmd, pmd_t *src_pmd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pte_t *src_pte, *dst_pte; spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2]; + int decow; again: rss[1] = rss[0] = 0; @@ -637,7 +803,10 @@ again: progress++; continue; } - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); + decow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, + src_vma, addr, rss); + if (unlikely(decow)) + goto decow; progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); @@ -646,14 +815,31 @@ again: pte_unmap_nested(src_pte - 1); add_mm_rss(dst_mm, rss[0], rss[1]); pte_unmap_unlock(dst_pte - 1, dst_ptl); +next: cond_resched(); if (addr != end) goto again; return 0; + +decow: + arch_leave_lazy_mmu_mode(); + spin_unlock(src_ptl); + pte_unmap_nested(src_pte); + add_mm_rss(dst_mm, rss[0], rss[1]); + decow = decow_one_pte(dst_mm, dst_pte, dst_pmd, dst_ptl, dst_vma, addr); + if (decow == -ENOMEM) + return -ENOMEM; + if (decow == -EAGAIN) + goto again; + pte_unmap_unlock(dst_pte, dst_ptl); + cond_resched(); + addr += PAGE_SIZE; + goto next; } static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma, + pud_t *dst_pud, pud_t *src_pud, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pmd_t *src_pmd, *dst_pmd; @@ -668,14 +854,15 @@ static inline int copy_pmd_range(struct if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma, + pgd_t *dst_pgd, pgd_t *src_pgd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pud_t *src_pud, *dst_pud; @@ -690,19 +877,19 @@ static inline int copy_pud_range(struct if (pud_none_or_clear_bad(src_pud)) continue; if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end); return 0; } int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { pgd_t *src_pgd, *dst_pgd; unsigned long next; - unsigned long addr = vma->vm_start; - unsigned long end = vma->vm_end; + unsigned long addr = src_vma->vm_start; + unsigned long end = src_vma->vm_end; int ret; /* @@ -711,20 +898,20 @@ int copy_page_range(struct mm_struct *ds * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { - if (!vma->anon_vma) + if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { + if (!src_vma->anon_vma) return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_vm_hugetlb_page(src_vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, src_vma); - if (unlikely(is_pfn_mapping(vma))) { + if (unlikely(is_pfn_mapping(src_vma))) { /* * We do not free on error cases below as remove_vma * gets called on error from higher level routine */ - ret = track_pfn_vma_copy(vma); + ret = track_pfn_vma_copy(src_vma); if (ret) return ret; } @@ -735,7 +922,7 @@ int copy_page_range(struct mm_struct *ds * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */ - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); ret = 0; @@ -746,15 +933,16 @@ int copy_page_range(struct mm_struct *ds if (pgd_none_or_clear_bad(src_pgd)) continue; if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next))) { + dst_vma, src_vma, addr, next))) { ret = -ENOMEM; break; } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm, - vma->vm_start, end); + src_vma->vm_start, end); + return ret; } @@ -1199,8 +1387,6 @@ static inline int use_zero_page(struct v return !vma->vm_ops || !vma->vm_ops->fault; } - - int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, struct page **pages, struct vm_area_struct **vmas) @@ -1225,6 +1411,7 @@ int __get_user_pages(struct task_struct do { struct vm_area_struct *vma; unsigned int foll_flags; + int decow; vma = find_extend_vma(mm, start); if (!vma && in_gate_area(tsk, start)) { @@ -1279,6 +1466,14 @@ int __get_user_pages(struct task_struct continue; } + /* + * Except in special cases where the caller will not read to or + * write from these pages, we must break COW for any pages + * returned from get_user_pages, so that our caller does not + * subsequently end up with the pages of a parent or child + * process after a COW takes place. + */ + decow = (pages && is_cow_mapping(vma->vm_flags)); foll_flags = FOLL_TOUCH; if (pages) foll_flags |= FOLL_GET; @@ -1299,7 +1494,7 @@ int __get_user_pages(struct task_struct fatal_signal_pending(current))) return i ? i : -ERESTARTSYS; - if (write) + if (write || decow) foll_flags |= FOLL_WRITE; cond_resched(); @@ -1342,6 +1537,8 @@ int __get_user_pages(struct task_struct if (pages) { pages[i] = page; + if (decow && !PageDontCOW(page)) + SetPageDontCOW(page); flush_anon_page(vma, page, start); flush_dcache_page(page); } @@ -1370,7 +1567,6 @@ int get_user_pages(struct task_struct *t start, len, flags, pages, vmas); } - EXPORT_SYMBOL(get_user_pages); pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr, @@ -1829,45 +2025,6 @@ static inline int pte_unmap_same(struct } /* - * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when - * servicing faults for write access. In the normal case, do always want - * pte_mkwrite. But get_user_pages can cause write faults for mappings - * that do not have writing enabled, when used by access_process_vm. - */ -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) -{ - if (likely(vma->vm_flags & VM_WRITE)) - pte = pte_mkwrite(pte); - return pte; -} - -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) -{ - /* - * If the source page was a PFN mapping, we don't have - * a "struct page" for it. We do a best-effort copy by - * just copying from the original user address. If that - * fails, we just zero-fill it. Live with it. - */ - if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst, KM_USER0); - void __user *uaddr = (void __user *)(va & PAGE_MASK); - - /* - * This really shouldn't fail, because the page is there - * in the page tables. But it might just be unreadable, - * in which case we just give up and fill the result with - * zeroes. - */ - if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) - memset(kaddr, 0, PAGE_SIZE); - kunmap_atomic(kaddr, KM_USER0); - flush_dcache_page(dst); - } else - copy_user_highpage(dst, src, va, vma); -} - -/* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address * and decrementing the shared-page counter for the old page. @@ -1930,6 +2087,8 @@ static int do_wp_page(struct mm_struct * } reuse = reuse_swap_page(old_page); unlock_page(old_page); + VM_BUG_ON(PageDontCOW(old_page) && !reuse); + } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { /* @@ -2936,7 +3095,8 @@ int make_pages_present(unsigned long add BUG_ON(end > vma->vm_end); len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE; ret = get_user_pages(current, current->mm, addr, - len, write, 0, NULL, NULL); + len, write, 0, + NULL, NULL); if (ret < 0) return ret; return ret == len ? 0 : -EFAULT; @@ -3086,7 +3246,7 @@ int access_process_vm(struct task_struct struct page *page = NULL; ret = get_user_pages(tsk, mm, addr, 1, - write, 1, &page, &vma); + 0, 1, &page, &vma); if (ret <= 0) { /* * Check if this is a VM_IO | VM_PFNMAP VMA, which Index: linux-2.6/arch/x86/mm/gup.c =================================================================== --- linux-2.6.orig/arch/x86/mm/gup.c 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/arch/x86/mm/gup.c 2009-03-14 16:21:40.000000000 +1100 @@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t struct page *page; if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { +failed: pte_unmap(ptep); return 0; } VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); + if (PageAnon(page) && unlikely(!PageDontCOW(page))) + goto failed; get_page(page); pages[*nr] = page; (*nr)++; Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/include/linux/page-flags.h 2009-03-14 02:48:13.000000000 +1100 @@ -94,6 +94,7 @@ enum pageflags { PG_reclaim, /* To be reclaimed asap */ PG_buddy, /* Page is free, on buddy lists */ PG_swapbacked, /* Page is backed by RAM/swap */ + PG_dontcow, /* PageAnon page in a VM_DONTCOW vma */ #ifdef CONFIG_UNEVICTABLE_LRU PG_unevictable, /* Page is "unevictable" */ PG_mlocked, /* Page is vma mlocked */ @@ -208,6 +209,8 @@ __PAGEFLAG(SlubDebug, slub_debug) */ TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) __PAGEFLAG(Buddy, buddy) +__PAGEFLAG(DontCOW, dontcow) +SETPAGEFLAG(DontCOW, dontcow) PAGEFLAG(MappedToDisk, mappedtodisk) /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/kernel/fork.c 2009-03-14 15:12:09.000000000 +1100 @@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, mpnt); + retval = copy_page_range(mm, oldmm, tmp, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); Index: linux-2.6/mm/internal.h =================================================================== --- linux-2.6.orig/mm/internal.h 2009-03-13 20:25:00.000000000 +1100 +++ linux-2.6/mm/internal.h 2009-03-17 02:41:48.000000000 +1100 @@ -15,6 +15,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); +void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep); extern void prep_compound_page(struct page *page, unsigned long order); extern void prep_compound_gigantic_page(struct page *page, unsigned long order); Index: linux-2.6/arch/powerpc/mm/gup.c =================================================================== --- linux-2.6.orig/arch/powerpc/mm/gup.c 2009-03-17 01:00:48.000000000 +1100 +++ linux-2.6/arch/powerpc/mm/gup.c 2009-03-17 01:02:10.000000000 +1100 @@ -39,6 +39,8 @@ static noinline int gup_pte_range(pmd_t return 0; VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); + if (PageAnon(page) && unlikely(!PageDontCOW(page))) + return 0; if (!page_cache_get_speculative(page)) return 0; if (unlikely(pte_val(pte) != pte_val(*ptep))) { Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2009-03-17 02:37:21.000000000 +1100 +++ linux-2.6/mm/fremap.c 2009-03-17 02:42:11.000000000 +1100 @@ -23,32 +23,6 @@ #include "internal.h" -static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, pte_t *ptep) -{ - pte_t pte = *ptep; - - if (pte_present(pte)) { - struct page *page; - - flush_cache_page(vma, addr, pte_pfn(pte)); - pte = ptep_clear_flush(vma, addr, ptep); - page = vm_normal_page(vma, addr, pte); - if (page) { - if (pte_dirty(pte)) - set_page_dirty(page); - page_remove_rmap(page); - page_cache_release(page); - update_hiwater_rss(mm); - dec_mm_counter(mm, file_rss); - } - } else { - if (!pte_file(pte)) - free_swap_and_cache(pte_to_swp_entry(pte)); - pte_clear_not_present_full(mm, addr, ptep, 0); - } -} - /* * Install a file pte to a given virtual memory address, release any * previously existing mapping. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-13 16:09 ` Nick Piggin 2009-03-13 19:34 ` Andrea Arcangeli @ 2009-03-14 4:46 ` Nick Piggin 2009-03-14 5:06 ` Nick Piggin 1 sibling, 1 reply; 83+ messages in thread From: Nick Piggin @ 2009-03-14 4:46 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Saturday 14 March 2009 03:09:39 Nick Piggin wrote: > On Friday 13 March 2009 05:06:48 Andrea Arcangeli wrote: > > The thing is quite simple, if an app has a 1G of vma loaded, you'll > > allocate 1G of ram for no good reason. It can even OOM, it's not just > > a performance issue. While doing it per-page like I do, won't be > > noticeable, as the in-flight I/O will be minor. > > Yes I agree now it is a silly way to do it. Here is an updated patch that just does it on a per-page basis. Actually it is still a bit sloppy because I just reused some code from my last patch for the decow logic... possibly I can just use the same precow code that you do for small and huge pages (although Linus didn't like it so much... it is very hard to do nicely right down there in the call chain :() Anyway, ignoring the decow implementation (that's not really the interesting part of the patch), I think this is looking pretty good now. --- Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/include/linux/mm.h 2009-03-14 15:12:13.000000000 +1100 @@ -789,7 +789,7 @@ int walk_page_range(unsigned long addr, void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma); + struct vm_area_struct *dst_vma, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); int follow_phys(struct vm_area_struct *vma, unsigned long address, Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/mm/memory.c 2009-03-14 15:40:37.000000000 +1100 @@ -533,12 +533,248 @@ out: } /* + * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when + * servicing faults for write access. In the normal case, do always want + * pte_mkwrite. But get_user_pages can cause write faults for mappings + * that do not have writing enabled, when used by access_process_vm. + */ +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) +{ + if (likely(vma->vm_flags & VM_WRITE)) + pte = pte_mkwrite(pte); + return pte; +} + +static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +{ + /* + * If the source page was a PFN mapping, we don't have + * a "struct page" for it. We do a best-effort copy by + * just copying from the original user address. If that + * fails, we just zero-fill it. Live with it. + */ + if (unlikely(!src)) { + void *kaddr = kmap_atomic(dst, KM_USER0); + void __user *uaddr = (void __user *)(va & PAGE_MASK); + + /* + * This really shouldn't fail, because the page is there + * in the page tables. But it might just be unreadable, + * in which case we just give up and fill the result with + * zeroes. + */ + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) + memset(kaddr, 0, PAGE_SIZE); + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(dst); + } else + copy_user_highpage(dst, src, va, vma); +} + +static int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd, + spinlock_t *ptl, struct vm_area_struct *vma, + unsigned long address) +{ + pte_t pte = *ptep; + struct page *page, *new_page; + + /* pte contains position in swap or file, so don't do anything */ + if (unlikely(!pte_present(pte))) + return 0; + /* pte is writable, can't be COW */ + if (pte_write(pte)) + return 0; + + page = vm_normal_page(vma, address, pte); + if (!page) + return 0; + + if (!PageAnon(page)) + return 0; + + WARN_ON(!PageDontCOW(page)); + + page_cache_get(page); + + pte_unmap_unlock(pte, ptl); + + if (unlikely(anon_vma_prepare(vma))) + goto oom; + VM_BUG_ON(page == ZERO_PAGE(0)); + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); + if (!new_page) + goto oom; + /* + * Don't let another task, with possibly unlocked vma, + * keep the mlocked page. + */ + if (vma->vm_flags & VM_LOCKED) { + lock_page(page); /* for LRU manipulation */ + clear_page_mlock(page); + unlock_page(page); + } + cow_user_page(new_page, page, address, vma); + __SetPageUptodate(new_page); + + if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)) + goto oom_free_new; + + /* + * Re-check the pte - we dropped the lock + */ + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + BUG_ON(!pte_same(*ptep, pte)); + { + pte_t entry; + + flush_cache_page(vma, address, pte_pfn(pte)); + entry = mk_pte(new_page, vma->vm_page_prot); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + /* + * Clear the pte entry and flush it first, before updating the + * pte with the new entry. This will avoid a race condition + * seen in the presence of one thread doing SMC and another + * thread doing COW. + */ + ptep_clear_flush_notify(vma, address, ptep); + page_add_new_anon_rmap(new_page, vma, address); + set_pte_at(mm, address, ptep, entry); + + /* See comment in do_wp_page */ + page_remove_rmap(page); + } + + page_cache_release(page); + + return 0; + +oom_free_new: + page_cache_release(new_page); +oom: + page_cache_release(page); + return -ENOMEM; +} + +static int decow_pte_range(struct mm_struct *mm, + pmd_t *pmd, struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + pte_t *pte; + spinlock_t *ptl; + int progress = 0; + int ret = 0; + +again: + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); +// arch_enter_lazy_mmu_mode(); + + do { + /* + * We are holding two locks at this point - either of them + * could generate latencies in another task on another CPU. + */ + if (progress >= 32) { + progress = 0; + if (need_resched() || spin_needbreak(ptl)) + break; + } + if (pte_none(*pte)) { + progress++; + continue; + } + ret = decow_one_pte(mm, pte, pmd, ptl, vma, addr); + if (ret) { + if (ret == -EAGAIN) { /* retry */ + ret = 0; + break; + } + goto out; + } + progress += 8; + } while (pte++, addr += PAGE_SIZE, addr != end); + +// arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(pte - 1, ptl); + cond_resched(); + if (addr != end) + goto again; +out: + return ret; +} + +static int decow_pmd_range(struct mm_struct *mm, + pud_t *pud, struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + pmd_t *pmd; + unsigned long next; + + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none_or_clear_bad(pmd)) + continue; + if (decow_pte_range(mm, pmd, vma, addr, next)) + return -ENOMEM; + } while (pmd++, addr = next, addr != end); + return 0; +} + +static int decow_pud_range(struct mm_struct *mm, + pgd_t *pgd, struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + pud_t *pud; + unsigned long next; + + pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + if (decow_pmd_range(mm, pud, vma, addr, next)) + return -ENOMEM; + } while (pud++, addr = next, addr != end); + return 0; +} + +static noinline int decow_page_range(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long end) +{ + pgd_t *pgd; + unsigned long next; + int ret; + + BUG_ON(!is_cow_mapping(vma->vm_flags)); + +// if (is_vm_hugetlb_page(vma)) +// return decow_hugetlb_page_range(mm, vma); + +// mmu_notifier_invalidate_range_start(mm, addr, end); + + ret = 0; + pgd = pgd_offset(mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) + continue; + if (unlikely(decow_pud_range(mm, pgd, vma, addr, next))) { + ret = -ENOMEM; + break; + } + } while (pgd++, addr = next, addr != end); + +// mmu_notifier_invalidate_range_end(mm, vma->vm_start, end); + + return ret; +} + +/* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range * covered by this vma. */ -static inline void +static inline int copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, unsigned long addr, int *rss) @@ -546,6 +782,7 @@ copy_one_pte(struct mm_struct *dst_mm, s unsigned long vm_flags = vma->vm_flags; pte_t pte = *src_pte; struct page *page; + int ret = 0; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { @@ -597,20 +834,26 @@ copy_one_pte(struct mm_struct *dst_mm, s get_page(page); page_dup_rmap(page, vma, addr); rss[!!PageAnon(page)]++; + if (unlikely(PageDontCOW(page))) + ret = -EAGAIN; } out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); + + return ret; } static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, + pmd_t *dst_pmd, pmd_t *src_pmd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pte_t *src_pte, *dst_pte; spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2]; + int ret = 0; again: rss[1] = rss[0] = 0; @@ -637,7 +880,10 @@ again: progress++; continue; } - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); + ret = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, + src_vma, addr, rss); + if (unlikely(ret)) + goto decow; progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); @@ -650,10 +896,25 @@ again: if (addr != end) goto again; return 0; + +decow: + arch_leave_lazy_mmu_mode(); + spin_unlock(src_ptl); + pte_unmap_nested(src_pte); + add_mm_rss(dst_mm, rss[0], rss[1]); + pte_unmap_unlock(dst_pte, dst_ptl); + cond_resched(); + if (decow_page_range(dst_mm, dst_vma, addr, addr + PAGE_SIZE)) + return -ENOMEM; + addr += PAGE_SIZE; + if (addr != end) + goto again; + return 0; } static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma, + pud_t *dst_pud, pud_t *src_pud, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pmd_t *src_pmd, *dst_pmd; @@ -668,14 +929,15 @@ static inline int copy_pmd_range(struct if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma, + pgd_t *dst_pgd, pgd_t *src_pgd, + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, unsigned long end) { pud_t *src_pud, *dst_pud; @@ -690,19 +952,19 @@ static inline int copy_pud_range(struct if (pud_none_or_clear_bad(src_pud)) continue; if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud, - vma, addr, next)) + dst_vma, src_vma, addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end); return 0; } int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { pgd_t *src_pgd, *dst_pgd; unsigned long next; - unsigned long addr = vma->vm_start; - unsigned long end = vma->vm_end; + unsigned long addr = src_vma->vm_start; + unsigned long end = src_vma->vm_end; int ret; /* @@ -711,20 +973,20 @@ int copy_page_range(struct mm_struct *ds * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { - if (!vma->anon_vma) + if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { + if (!src_vma->anon_vma) return 0; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_vm_hugetlb_page(src_vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, src_vma); - if (unlikely(is_pfn_mapping(vma))) { + if (unlikely(is_pfn_mapping(src_vma))) { /* * We do not free on error cases below as remove_vma * gets called on error from higher level routine */ - ret = track_pfn_vma_copy(vma); + ret = track_pfn_vma_copy(src_vma); if (ret) return ret; } @@ -735,7 +997,7 @@ int copy_page_range(struct mm_struct *ds * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */ - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); ret = 0; @@ -746,15 +1008,16 @@ int copy_page_range(struct mm_struct *ds if (pgd_none_or_clear_bad(src_pgd)) continue; if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next))) { + dst_vma, src_vma, addr, next))) { ret = -ENOMEM; break; } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - if (is_cow_mapping(vma->vm_flags)) + if (is_cow_mapping(src_vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm, - vma->vm_start, end); + src_vma->vm_start, end); + return ret; } @@ -1200,7 +1463,6 @@ static inline int use_zero_page(struct v } - int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, struct page **pages, struct vm_area_struct **vmas) @@ -1225,6 +1487,7 @@ int __get_user_pages(struct task_struct do { struct vm_area_struct *vma; unsigned int foll_flags; + int decow; vma = find_extend_vma(mm, start); if (!vma && in_gate_area(tsk, start)) { @@ -1279,12 +1542,15 @@ int __get_user_pages(struct task_struct continue; } + decow = (!(flags & GUP_FLAGS_STACK) && + is_cow_mapping(vma->vm_flags)); foll_flags = FOLL_TOUCH; if (pages) foll_flags |= FOLL_GET; if (!write && use_zero_page(vma)) foll_flags |= FOLL_ANON; + do { struct page *page; @@ -1299,7 +1565,7 @@ int __get_user_pages(struct task_struct fatal_signal_pending(current))) return i ? i : -ERESTARTSYS; - if (write) + if (write || decow) foll_flags |= FOLL_WRITE; cond_resched(); @@ -1342,6 +1608,7 @@ int __get_user_pages(struct task_struct if (pages) { pages[i] = page; + SetPageDontCOW(page); flush_anon_page(vma, page, start); flush_dcache_page(page); } @@ -1829,45 +2096,6 @@ static inline int pte_unmap_same(struct } /* - * Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when - * servicing faults for write access. In the normal case, do always want - * pte_mkwrite. But get_user_pages can cause write faults for mappings - * that do not have writing enabled, when used by access_process_vm. - */ -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) -{ - if (likely(vma->vm_flags & VM_WRITE)) - pte = pte_mkwrite(pte); - return pte; -} - -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) -{ - /* - * If the source page was a PFN mapping, we don't have - * a "struct page" for it. We do a best-effort copy by - * just copying from the original user address. If that - * fails, we just zero-fill it. Live with it. - */ - if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst, KM_USER0); - void __user *uaddr = (void __user *)(va & PAGE_MASK); - - /* - * This really shouldn't fail, because the page is there - * in the page tables. But it might just be unreadable, - * in which case we just give up and fill the result with - * zeroes. - */ - if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) - memset(kaddr, 0, PAGE_SIZE); - kunmap_atomic(kaddr, KM_USER0); - flush_dcache_page(dst); - } else - copy_user_highpage(dst, src, va, vma); -} - -/* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address * and decrementing the shared-page counter for the old page. @@ -1930,6 +2158,8 @@ static int do_wp_page(struct mm_struct * } reuse = reuse_swap_page(old_page); unlock_page(old_page); + VM_BUG_ON(PageDontCOW(old_page) && !reuse); + } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { /* @@ -2935,8 +3165,9 @@ int make_pages_present(unsigned long add BUG_ON(addr >= end); BUG_ON(end > vma->vm_end); len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE; - ret = get_user_pages(current, current->mm, addr, - len, write, 0, NULL, NULL); + ret = __get_user_pages(current, current->mm, addr, + len, GUP_FLAGS_STACK | (write ? GUP_FLAGS_WRITE : 0), + NULL, NULL); if (ret < 0) return ret; return ret == len ? 0 : -EFAULT; @@ -3085,8 +3316,9 @@ int access_process_vm(struct task_struct void *maddr; struct page *page = NULL; - ret = get_user_pages(tsk, mm, addr, 1, - write, 1, &page, &vma); + ret = __get_user_pages(tsk, mm, addr, 1, + GUP_FLAGS_FORCE | GUP_FLAGS_STACK | + (write ? GUP_FLAGS_WRITE : 0), &page, &vma); if (ret <= 0) { /* * Check if this is a VM_IO | VM_PFNMAP VMA, which Index: linux-2.6/arch/x86/mm/gup.c =================================================================== --- linux-2.6.orig/arch/x86/mm/gup.c 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/arch/x86/mm/gup.c 2009-03-14 02:48:12.000000000 +1100 @@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t struct page *page; if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { +failed: pte_unmap(ptep); return 0; } VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); + if (unlikely(!PageDontCOW(page))) + goto failed; get_page(page); pages[*nr] = page; (*nr)++; Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/include/linux/page-flags.h 2009-03-14 02:48:13.000000000 +1100 @@ -94,6 +94,7 @@ enum pageflags { PG_reclaim, /* To be reclaimed asap */ PG_buddy, /* Page is free, on buddy lists */ PG_swapbacked, /* Page is backed by RAM/swap */ + PG_dontcow, /* Dont COW PageAnon page */ #ifdef CONFIG_UNEVICTABLE_LRU PG_unevictable, /* Page is "unevictable" */ PG_mlocked, /* Page is vma mlocked */ @@ -208,6 +209,8 @@ __PAGEFLAG(SlubDebug, slub_debug) */ TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) __PAGEFLAG(Buddy, buddy) +__PAGEFLAG(DontCOW, dontcow) +SETPAGEFLAG(DontCOW, dontcow) PAGEFLAG(MappedToDisk, mappedtodisk) /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c 2009-03-13 20:25:02.000000000 +1100 +++ linux-2.6/mm/page_alloc.c 2009-03-14 02:48:13.000000000 +1100 @@ -1000,6 +1000,7 @@ static void free_hot_cold_page(struct pa struct per_cpu_pages *pcp; unsigned long flags; + __ClearPageDontCOW(page); if (PageAnon(page)) page->mapping = NULL; if (free_pages_check(page)) Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2009-03-14 02:48:06.000000000 +1100 +++ linux-2.6/kernel/fork.c 2009-03-14 15:12:09.000000000 +1100 @@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, mpnt); + retval = copy_page_range(mm, oldmm, tmp, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); Index: linux-2.6/fs/exec.c =================================================================== --- linux-2.6.orig/fs/exec.c 2009-03-13 20:25:00.000000000 +1100 +++ linux-2.6/fs/exec.c 2009-03-14 02:48:14.000000000 +1100 @@ -165,6 +165,13 @@ exit: #ifdef CONFIG_MMU +#define GUP_FLAGS_WRITE 0x01 +#define GUP_FLAGS_STACK 0x10 + +int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int len, int flags, + struct page **pages, struct vm_area_struct **vmas); + static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, int write) { @@ -178,8 +185,11 @@ static struct page *get_arg_page(struct return NULL; } #endif - ret = get_user_pages(current, bprm->mm, pos, - 1, write, 1, &page, NULL); + down_read(&bprm->mm->mmap_sem); + ret = __get_user_pages(current, bprm->mm, pos, + 1, GUP_FLAGS_STACK | (write ? GUP_FLAGS_WRITE : 0), + &page, NULL); + up_read(&bprm->mm->mmap_sem); if (ret <= 0) return NULL; Index: linux-2.6/mm/internal.h =================================================================== --- linux-2.6.orig/mm/internal.h 2009-03-13 20:25:00.000000000 +1100 +++ linux-2.6/mm/internal.h 2009-03-14 02:48:14.000000000 +1100 @@ -273,10 +273,11 @@ static inline void mminit_validate_memmo } #endif /* CONFIG_SPARSEMEM */ -#define GUP_FLAGS_WRITE 0x1 -#define GUP_FLAGS_FORCE 0x2 -#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4 -#define GUP_FLAGS_IGNORE_SIGKILL 0x8 +#define GUP_FLAGS_WRITE 0x01 +#define GUP_FLAGS_FORCE 0x02 +#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x04 +#define GUP_FLAGS_IGNORE_SIGKILL 0x08 +#define GUP_FLAGS_STACK 0x10 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, \0 ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-14 4:46 ` Nick Piggin @ 2009-03-14 5:06 ` Nick Piggin 0 siblings, 0 replies; 83+ messages in thread From: Nick Piggin @ 2009-03-14 5:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Saturday 14 March 2009 15:46:30 Nick Piggin wrote: > Index: linux-2.6/arch/x86/mm/gup.c > =================================================================== > --- linux-2.6.orig/arch/x86/mm/gup.c 2009-03-14 02:48:06.000000000 +1100 > +++ linux-2.6/arch/x86/mm/gup.c 2009-03-14 02:48:12.000000000 +1100 > @@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t > struct page *page; > > if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) { > +failed: > pte_unmap(ptep); > return 0; > } > VM_BUG_ON(!pfn_valid(pte_pfn(pte))); > page = pte_page(pte); > + if (unlikely(!PageDontCOW(page))) > + goto failed; > get_page(page); > pages[*nr] = page; > (*nr)++; Ah, that's stupid, the test should be confined just to PageAnon && !PageDontCOW pages, of course. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 17:41 ` Ingo Molnar 2009-03-11 17:58 ` Linus Torvalds @ 2009-03-11 18:53 ` Andrea Arcangeli 1 sibling, 0 replies; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-11 18:53 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm Hello, On Wed, Mar 11, 2009 at 06:41:03PM +0100, Ingo Molnar wrote: > Hm, is there any security impact? Andrea is talking about data > corruption. I'm wondering whether that's just corruption > relative to whatever twisted semantics O_DIRECT has in this case > [which would be harmless], or some true pagecache corruption I don't think it's exploitable and I don't see this much as a security issue. This can only corrupt user data inside anonymous pages (not filesystem metadata or kernel pagecache). Side effects will be the usual ones of random user memory corruption or as worse it can lead to I/O corruption on disk, but only in user data. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 17:33 ` [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Linus Torvalds 2009-03-11 17:41 ` Ingo Molnar @ 2009-03-11 18:22 ` Andrea Arcangeli 2009-03-11 19:06 ` Ingo Molnar 1 sibling, 1 reply; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-11 18:22 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, Mar 11, 2009 at 10:33:00AM -0700, Linus Torvalds wrote: > > On Wed, 11 Mar 2009, Ingo Molnar wrote: > > > > FYI, in case you missed it. Large MM fix - and it's awfully late > > in -rc7. I didn't specify it, but I didn't mean to submit it for immediate inclusion. I posted it because it's ready and I wanted feedback from Hugh/Nick/linux-mm so we can get this fixed when next merge window open. > Yeah, I'm not taking this at this point. No way, no-how. > > If there is no simpler and obvious fix, it needs to go through -stable, > after having cooked in 2.6.30-rc for a while. Especially as this is a > totally uninteresting usage case that I can't see as being at all relevant > to any real world. Actually AFIK there are mission critical real world applications that used 512byte blocksize that were affected by this (I CC'ed relevant people who knows). However this is rare thing so it almost never triggers because the window is so small. > Anybody who mixes O_DIRECT and fork() (and threads) is already doing some > seriously strange things. Nothing new there. Most apps aren't affected of course. But almost all apps eventually call fork (system/fork/exec/anything). Calling fork currently is enough to generate memory corruption in the parent (i.e. lost O_DIRECT reads from disk). > And quite frankly, the patch is so ugly as-is that I'm not likely to take > it even into the 2.6.30 merge window unless it can be cleaned up. That > whole fork_pre_cow function is too f*cking ugly to live. We just don't > write code like this in the kernel. Yes, this is exactly why I posted it now, to get feedback, it wasn't meant for submission. Feel free to write it yourself in another way of course, I included all relevant testcases to test alternate fixes too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 18:22 ` Andrea Arcangeli @ 2009-03-11 19:06 ` Ingo Molnar 2009-03-11 19:15 ` Andrea Arcangeli 0 siblings, 1 reply; 83+ messages in thread From: Ingo Molnar @ 2009-03-11 19:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm * Andrea Arcangeli <aarcange@redhat.com> wrote: > On Wed, Mar 11, 2009 at 10:33:00AM -0700, Linus Torvalds wrote: > > > > On Wed, 11 Mar 2009, Ingo Molnar wrote: > > > > > > FYI, in case you missed it. Large MM fix - and it's awfully late > > > in -rc7. > > I didn't specify it, but I didn't mean to submit it for > immediate inclusion. I posted it because it's ready and I > wanted feedback from Hugh/Nick/linux-mm so we can get this > fixed when next merge window open. Good - i saw the '(fast-)gup fix' qualifier and fast-gup is a fresh feature. If the problem existed in earlier kernels too then i guess it isnt urgent. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] 2009-03-11 19:06 ` Ingo Molnar @ 2009-03-11 19:15 ` Andrea Arcangeli 0 siblings, 0 replies; 83+ messages in thread From: Andrea Arcangeli @ 2009-03-11 19:15 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm On Wed, Mar 11, 2009 at 08:06:55PM +0100, Ingo Molnar wrote: > Good - i saw the '(fast-)gup fix' qualifier and fast-gup is a > fresh feature. If the problem existed in earlier kernels too > then i guess it isnt urgent. It always existed yes. The reason of the (fast-) qualifier is because gup-fast made it harder to fix this in mainline (there is also a patch floating around for 2.6.18 based kernels that is simpler thanks to gup-fast not being there). The trouble of gup-fast is that doing the check of page_count inside PT lock (or mmap_sem write mode like in fork(), but ksm only takes mmap_sem in read mode and it relied on PT lock only) wasn't enough anymore to be sure the page_count wouldn't increase from under us just after we read it, because a gup-fast could be running in another CPU without mmap_sem and without PT lock taken. So fixing this on mainline has been a bit harder as I had to prevent gup-fast to go ahead in the fast path, in a way that didn't send IPIs to flush the smp-tlb before reading the page_count (so to avoid sending IPIs for every anon page mapped writeable). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 83+ messages in thread
end of thread, other threads:[~2009-04-03 3:49 UTC | newest]
Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20090311170611.GA2079@elte.hu>
2009-03-11 17:33 ` [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Linus Torvalds
2009-03-11 17:41 ` Ingo Molnar
2009-03-11 17:58 ` Linus Torvalds
2009-03-11 18:37 ` Andrea Arcangeli
2009-03-11 18:46 ` Linus Torvalds
2009-03-11 19:01 ` Linus Torvalds
2009-03-11 19:59 ` Andrea Arcangeli
2009-03-11 20:19 ` Linus Torvalds
2009-03-11 20:33 ` Linus Torvalds
2009-03-11 20:55 ` Andrea Arcangeli
2009-03-11 21:28 ` Linus Torvalds
2009-03-11 21:57 ` Andrea Arcangeli
2009-03-11 22:06 ` Linus Torvalds
2009-03-11 22:07 ` Linus Torvalds
2009-03-11 22:22 ` Davide Libenzi
2009-03-11 22:32 ` Linus Torvalds
2009-03-14 5:07 ` Benjamin Herrenschmidt
2009-03-11 20:48 ` Andrea Arcangeli
2009-03-14 5:06 ` Benjamin Herrenschmidt
2009-03-14 5:20 ` Nick Piggin
2009-03-16 16:01 ` KOSAKI Motohiro
2009-03-16 16:23 ` Nick Piggin
2009-03-16 16:32 ` Linus Torvalds
2009-03-16 16:50 ` Nick Piggin
2009-03-16 17:02 ` Linus Torvalds
2009-03-16 17:19 ` Nick Piggin
2009-03-16 17:42 ` Linus Torvalds
2009-03-16 18:02 ` Nick Piggin
2009-03-16 18:05 ` Nick Piggin
2009-03-16 18:17 ` Linus Torvalds
2009-03-16 18:33 ` Nick Piggin
2009-03-16 19:22 ` Linus Torvalds
2009-03-17 5:44 ` Nick Piggin
2009-03-16 18:14 ` Linus Torvalds
2009-03-16 18:29 ` Nick Piggin
2009-03-16 19:17 ` Linus Torvalds
2009-03-17 5:42 ` Nick Piggin
2009-03-17 5:58 ` Nick Piggin
2009-03-16 18:37 ` Andrea Arcangeli
2009-03-16 18:28 ` Andrea Arcangeli
2009-03-16 23:59 ` KAMEZAWA Hiroyuki
2009-03-18 2:04 ` KOSAKI Motohiro
2009-03-22 12:23 ` KOSAKI Motohiro
2009-03-23 0:13 ` KOSAKI Motohiro
2009-03-23 16:29 ` Ingo Molnar
2009-03-23 16:46 ` Linus Torvalds
2009-03-24 5:08 ` KOSAKI Motohiro
2009-03-24 13:43 ` Nick Piggin
2009-03-24 17:56 ` Linus Torvalds
2009-03-30 10:52 ` KOSAKI Motohiro
[not found] ` <200904022307.12043.nickpiggin@yahoo.com.au>
2009-04-03 3:49 ` Nick Piggin
2009-03-17 0:44 ` Linus Torvalds
2009-03-17 0:56 ` KAMEZAWA Hiroyuki
2009-03-17 12:19 ` Andrea Arcangeli
2009-03-17 16:43 ` Linus Torvalds
2009-03-17 17:01 ` Linus Torvalds
2009-03-17 17:10 ` Andrea Arcangeli
2009-03-17 17:43 ` Linus Torvalds
2009-03-17 18:09 ` Linus Torvalds
2009-03-17 18:19 ` Linus Torvalds
2009-03-17 18:46 ` Andrea Arcangeli
2009-03-17 19:03 ` Linus Torvalds
2009-03-17 19:35 ` Andrea Arcangeli
2009-03-17 19:55 ` Linus Torvalds
2009-03-11 19:06 ` Andrea Arcangeli
2009-03-12 5:36 ` Nick Piggin
2009-03-12 16:23 ` Nick Piggin
2009-03-12 17:00 ` Andrea Arcangeli
2009-03-12 17:20 ` Nick Piggin
2009-03-12 17:23 ` Nick Piggin
2009-03-12 18:06 ` Andrea Arcangeli
2009-03-12 18:58 ` Andrea Arcangeli
2009-03-13 16:09 ` Nick Piggin
2009-03-13 19:34 ` Andrea Arcangeli
2009-03-14 4:59 ` Nick Piggin
2009-03-16 13:56 ` Andrea Arcangeli
2009-03-16 16:01 ` Nick Piggin
2009-03-14 4:46 ` Nick Piggin
2009-03-14 5:06 ` Nick Piggin
2009-03-11 18:53 ` Andrea Arcangeli
2009-03-11 18:22 ` Andrea Arcangeli
2009-03-11 19:06 ` Ingo Molnar
2009-03-11 19:15 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).