* page->mapping == 0 [not found] ` <Pine.LNX.4.10.10010262325440.864-100000@penguin.transmeta.com> @ 2000-10-29 9:38 ` Paul Mackerras 2000-10-29 12:39 ` Daniel Phillips 2000-10-29 18:05 ` Linus Torvalds 0 siblings, 2 replies; 11+ messages in thread From: Paul Mackerras @ 2000-10-29 9:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus, I have tracked down why I am getting the oops with page->mapping == 0 in block_read_full_page. Well, at least if not the ultimate cause, then the penultimate cause. It's basically yet another truncate bug (tm). I have some fairly detailed traces which I can send you if you like, but the executive summary is that a page can get truncated out from under us while we are sleeping in lock_page. The scenario I observed is this: Process A goes to read some data from a file. It finds or makes a page cache page and starts the read into the page. Process B goes to read the same data from the same file. It gets down into do_generic_file_read, finds the existing page cache page, increments its count, sees that it is not up-to-date, and calls lock_page, which calls ___wait_on_page. At some later time the read completes, the page becomes up-to-date and process B is woken. But by this time process C is running. It is also trying to read the same data from the same file. It completes the read without blocking and then does a truncate on the file (in preparation for rewriting it; the canonical example is when several bashes exit at the same time, and they all go to read and rewrite the ~/.bash_history file). So process C calls vmtruncate() which calls truncate_inode_pages() which calls truncate_complete_page() on all of the complete pages of the file. That in turn calls remove_inode_page() which calls __remove_inode_page() which sets page->mapping to NULL. (This code doesn't seem to check page->count at all.) Then process B actually gets to run. It returns from ___wait_on_page to lock_page to do_generic_file_read. It now has the page locked but it is no longer really the page it wants. It has page->mapping set to NULL and it isn't on the page list for the mapping any more. The code in do_generic_file_read doesn't check this but just charges on to call mapping->a_ops->readpage. I am not any kind of expert on the page cache but it seems to me that maybe after locking the page we should check if it is still the page we want (i.e. page->mapping and page->index are still correct), and go back and look up the page again if not. Presumably there will be quite a few places besides do_generic_file_read where that check would be needed also. Regards, Paul. -- Paul Mackerras, Senior Open Source Researcher, Linuxcare, Inc. +61 2 6262 8990 tel, +61 2 6262 8991 fax paulus@linuxcare.com.au, http://www.linuxcare.com.au/ Linuxcare. Support for the revolution. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 9:38 ` page->mapping == 0 Paul Mackerras @ 2000-10-29 12:39 ` Daniel Phillips 2000-10-29 18:05 ` Linus Torvalds 1 sibling, 0 replies; 11+ messages in thread From: Daniel Phillips @ 2000-10-29 12:39 UTC (permalink / raw) To: paulus, linux-kernel Paul Mackerras wrote: > I am not any kind of expert on the page cache but it seems to me that > maybe after locking the page we should check if it is still the page > we want (i.e. page->mapping and page->index are still correct), and go > back and look up the page again if not. Presumably there will be > quite a few places besides do_generic_file_read where that check would > be needed also. I'm getting at least some kind of clue about the page cache, though I don't claim to be an expert. You've pointed out how the operation of becoming a reader on a page can fail because the page disappears while you're waiting. That's an error, right? It's the result of a user space race, which in turn is the result of parallel processes doing nonatomic file operations with no synchronization. It sounds like bailing out of the file_read if the page turns out to be unmapped after the lock is the right thing to do, at least for a quick fix. http://lxr.linux.no/source/mm/filemap.c?v=2.4.0-test9#L1029 There actually aren't that many places where lock_page is called: http://lxr.linux.no/ident?v=2.4.0-test9;i=lock_page Maybe lock_page needs to return an error code. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 9:38 ` page->mapping == 0 Paul Mackerras 2000-10-29 12:39 ` Daniel Phillips @ 2000-10-29 18:05 ` Linus Torvalds 2000-10-29 18:32 ` Alexander Viro 1 sibling, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2000-10-29 18:05 UTC (permalink / raw) To: Paul Mackerras; +Cc: linux-kernel On Sun, 29 Oct 2000, Paul Mackerras wrote: > > I have tracked down why I am getting the oops with page->mapping == 0 > in block_read_full_page. Well, at least if not the ultimate cause, > then the penultimate cause. It's basically yet another truncate bug > (tm). I have some fairly detailed traces which I can send you if you > like, but the executive summary is that a page can get truncated out > from under us while we are sleeping in lock_page. The scenario I > observed is this: Sorry, maybe our discussions with Al Viro weren't public, but yes, we did come to the same conclusion a few days ago. The only problem right now is deciding how to fix it - because as you rightly point out, this actually happens in a number of places, and as such it's not trivial to fix it cleanly. We may have to use a brute-force approach for 2.4.x where we just re-test against the file size in multiple places where this can happen - but it would be nicer to handle this more cleanly. There are a few details that still elude me, the main one being the big question about why this started happening to so many people just recently. The bug is not new, and yet it's become obvious only in the last few weeks. I _hope_ that the reason for this is that more people are testing out 2.4.x, and that we've fixed most of the other issues that kept people back from using it. It may be as simple as that. But I wonder if there is something else going on too, like the read-ahead being buggy, and that causing the bug to just happen a lot more (if read-ahead is buggy and tries to read ahead past the end of the file, a truncate() will be much more likely to hit these pages past the end, for example. Because most cases of "truncate" are actually about truncating _upwards_, not downwards). So the two worries I have now is (a) the details on how to fix this thing (minor worry, mainly one of aesthetics) and (b) whether there is something else going on that brings it out of the woodworks.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 18:05 ` Linus Torvalds @ 2000-10-29 18:32 ` Alexander Viro 2000-10-29 19:12 ` Linus Torvalds 2000-10-29 20:32 ` Alan Cox 0 siblings, 2 replies; 11+ messages in thread From: Alexander Viro @ 2000-10-29 18:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: Paul Mackerras, linux-kernel On Sun, 29 Oct 2000, Linus Torvalds wrote: > approach for 2.4.x where we just re-test against the file size in multiple > places where this can happen - but it would be nicer to handle this more > cleanly. One possible way is to access page->mapping _only_ under the page lock and in cases when we call ->i_mapping->a_ops->foo check the ->mapping before the method call. > There are a few details that still elude me, the main one being the big > question about why this started happening to so many people just recently. > The bug is not new, and yet it's become obvious only in the last few > weeks. And original truncate() problems went undetected since...? Pattern looks very similar. Petr posted bug reports long time ago - back then they were ignored since they involved vmware. In both cases we had persistent problems caught by few people who were using not-too-common combinations (innd on 2.4 boxen, for example) and at some point the shit had hit the fan big way. No amount of testing can prove that there is no bugs and all such... I would expect problems with truncate, mmap, rename, POSIX locks, fasync, ptrace and mount go unnoticed for _long_. Ditto for parts of procfs proper that are not exercised by ps and friends. Sigh... Maybe something along the lines of BTS might be useful, hell knows. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 18:32 ` Alexander Viro @ 2000-10-29 19:12 ` Linus Torvalds 2000-10-29 19:33 ` Linus Torvalds 2000-10-29 20:32 ` Alan Cox 1 sibling, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2000-10-29 19:12 UTC (permalink / raw) To: Alexander Viro; +Cc: Paul Mackerras, linux-kernel On Sun, 29 Oct 2000, Alexander Viro wrote: > > One possible way is to access page->mapping _only_ under the page lock > and in cases when we call ->i_mapping->a_ops->foo check the ->mapping > before the method call. I'm leaning towards this for a 2.4.x solution. As far as I can tell, page->mapping is _already_ only accessed and modified with the page lock held. It's just that we don't test it for NULL, in case an earlier lock holder decided to clear it. (No, I didn't look through all the users, but at least conceptually it _should_ be true that we only look at "mapping" with the lock held: it's mainly used for pagein, and pageout, buth with the lock held for other reasons already. Certainly all the places where we have had bug-reports have been of this type). Making it policy that we have to re-test page->mapping after aquireing the page lock might be the simplest fix for 2.4.x. It still means that we might end up allowing people to have a "bad" page in the VM space due to the "test->insert" race condition, but it woul dmake that event pretty much a harmless bug (and thus move it to the "beauty wart - to be fixed later" category). And the places where we get the page lock and use page->mapping are not that many, I think. (And notice how we actually _have_ this approach already in do_buffer_fdatasync(), for similar reasons - we use the "re-test the page->buffers" thing there. Of course, there we do it because the clearing of page->buffers is easier to see, and can happen as a result of memory pressure, and not just truncate()). So, for example, in __find_lock_page() we should re-test the mapping after we aquired the page lock. Which is fairly easy, just add something like /* Race: did the mapping go away before we got the page lock? */ if (page && page->mapping != mapping) { page_cache_release(page); goto repeat; } to the end of __find_lock_page(). Add similar logic to do_generic_file_read(), filemap_nopage() and filemap_sync_pte() and read_cache_page(), and you're pretty much done. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 19:12 ` Linus Torvalds @ 2000-10-29 19:33 ` Linus Torvalds 2000-10-30 1:57 ` Daniel Phillips 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2000-10-29 19:33 UTC (permalink / raw) To: Alexander Viro; +Cc: Paul Mackerras, linux-kernel On Sun, 29 Oct 2000, Linus Torvalds wrote: > > Making it policy that we have to re-test page->mapping after aquireing the > page lock might be the simplest fix for 2.4.x. It still means that we > might end up allowing people to have a "bad" page in the VM space due to > the "test->insert" race condition, but it woul dmake that event pretty > much a harmless bug (and thus move it to the "beauty wart - to be fixed > later" category). I'd like to just re-iterate this point: re-testing "page->mapping" fixes the oops, but is not a full fix for the conceptual problem. The problem with just re-testing "page->mapping" is that you still have a nasty potential race where you insert a (bogus) page into the VM space of a process instead of giving a SIGBUS/SIGSEGV. Now, I don't think this is really a valid usage pattern, so it's most likely to be a result of a buggy application, but I can imagine having some strange kind of user-space VM memory management scheme that depends on SIGBUS to maintain a file length. I've never heard of such a thing, but I could imagine somebody doing some kind of persistent data object store in user space this way. Does anybody actually know of an application that does something like this? Because I'm more and more inclined to just going with the half-fix for now. It would at least guarantee internal kernel data consistency (and no oopses, of course). Don't get me wrong - we need to clean this part up, but as far as I can tell we have never done this "right", so in that sense it's not a new 2.4.x bug and it can't break existing applications. [ In fact, with the current ordering inside "vmtruncate()" of doing the "truncate_inode_pages()" thing before doing the "vmtruncate_list()", I have this suspicion that the race might even be impossible to trigger. Even when the race "happens" in kernel space, we will end up unmapping the page immediately afterwards, and the only effect as far as the user is concerned is the disappearance of the SIGBUS. And the "disappearing SIGBUS" is actually explainable with a _user_level_ race: in order to get the kernel race at all, user level itself must have been inherently racing on the truncate/access, and depending on which one happened "first" you'd have lost the SIGBUS and the data you wrote anyway. So it may actually be that we can honestly claim that the half-fix is actually a proper fix. I would have to look a lot closer at the issue to be able to guarantee this, though. Comments? Anybody? ] Al? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 19:33 ` Linus Torvalds @ 2000-10-30 1:57 ` Daniel Phillips 0 siblings, 0 replies; 11+ messages in thread From: Daniel Phillips @ 2000-10-30 1:57 UTC (permalink / raw) To: Linus Torvalds, linux-kernel Linus Torvalds wrote: > Now, I don't think this is really a valid usage pattern, so it's most > likely to be a result of a buggy application, but I can imagine having > some strange kind of user-space VM memory management scheme that depends > on SIGBUS to maintain a file length. I've never heard of such a thing, but > I could imagine somebody doing some kind of persistent data object store > in user space this way. > > Does anybody actually know of an application that does something like > this? Because I'm more and more inclined to just going with the half-fix > for now. It would at least guarantee internal kernel data consistency (and > no oopses, of course). > > Don't get me wrong - we need to clean this part up, but as far as I can > tell we have never done this "right", so in that sense it's not a new > 2.4.x bug and it can't break existing applications. Strace shows that the history really is updated with no locking: 4030 open("/home/phillips/.bash_history", O_WRONLY|O_TRUNC) = 4 4030 write(4, "l\nless README\ne ../*test8/fs/ext"..., 7546) = 7546 4030 close(4) = 0 There's no reason to continue doing the read after you've detected the page was pulled out from under you - clearly it's a user error state, the only question is: return the error and break the buggy application? Or gloss over it. And it seems like returning the error might actually break bash - maybe it deserves it... but... -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 18:32 ` Alexander Viro 2000-10-29 19:12 ` Linus Torvalds @ 2000-10-29 20:32 ` Alan Cox 2000-10-29 20:42 ` Alexander Viro ` (2 more replies) 1 sibling, 3 replies; 11+ messages in thread From: Alan Cox @ 2000-10-29 20:32 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Paul Mackerras, linux-kernel > I would expect problems with truncate, mmap, rename, POSIX locks, fasync, > ptrace and mount go unnoticed for _long_. Ditto for parts of procfs Well the ptrace one still has mysteriously breaks usermode linux against it on my list here. Was that ever explained. It looked like the stack got corrupted which is weird. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 20:32 ` Alan Cox @ 2000-10-29 20:42 ` Alexander Viro 2000-10-29 22:26 ` Jeff V. Merkey 2000-10-29 23:12 ` Jeff Dike 2 siblings, 0 replies; 11+ messages in thread From: Alexander Viro @ 2000-10-29 20:42 UTC (permalink / raw) To: Alan Cox; +Cc: Linus Torvalds, Paul Mackerras, linux-kernel On Sun, 29 Oct 2000, Alan Cox wrote: > > I would expect problems with truncate, mmap, rename, POSIX locks, fasync, > > ptrace and mount go unnoticed for _long_. Ditto for parts of procfs > > Well the ptrace one still has mysteriously breaks usermode linux against it > on my list here. Was that ever explained. It looked like the stack got corrupted > which is weird. Alan, is it me or usermode port really tends to catch the stack overflows? <thinks> How about we allocate the pages by 4, not by 2 as we do now, and explicitly unmap the pages below the task_struct? That would still leave the chance of stack overflow going unnoticed, but the window would be limited. Another possibility for ptrace-related screwups: past-the-EOF check in filemap_nopage(). I'm still not convinced that it's right. It may be, but... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 20:32 ` Alan Cox 2000-10-29 20:42 ` Alexander Viro @ 2000-10-29 22:26 ` Jeff V. Merkey 2000-10-29 23:12 ` Jeff Dike 2 siblings, 0 replies; 11+ messages in thread From: Jeff V. Merkey @ 2000-10-29 22:26 UTC (permalink / raw) To: Alan Cox; +Cc: Alexander Viro, Linus Torvalds, Paul Mackerras, linux-kernel Alan Cox wrote: > > > I would expect problems with truncate, mmap, rename, POSIX locks, fasync, > > ptrace and mount go unnoticed for _long_. Ditto for parts of procfs > > Well the ptrace one still has mysteriously breaks usermode linux against it > on my list here. Was that ever explained. It looked like the stack got corrupted > which is weird. Do any of you guys have an SMP inverse assembler handy? This bug is scary as shit, and someone needs to setup an SMP ICE and actually trace through the code (and set address breakpoints) to see just what is going on. It really is sounding like a race condition of some kind, and I read everyone's "speculation" about what they think is happneing, but how about getting some hardware tools, and tracking is down. It feels like an MMU bug of some type. Who has access to Yellow Cover Intel Errata sheets? It may be a hardware bug intel knows about but we don't -- their MMU is so bug infested how do we know it's not one of these. I don't think you guys have a good handle on it yet. Linus seems to think it's a coding problem with a race somewhere. How about someone getting some hardware tools and nailing the puppy to the floor so 2.4 can get out the door... :-) Jeff > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: page->mapping == 0 2000-10-29 20:32 ` Alan Cox 2000-10-29 20:42 ` Alexander Viro 2000-10-29 22:26 ` Jeff V. Merkey @ 2000-10-29 23:12 ` Jeff Dike 2 siblings, 0 replies; 11+ messages in thread From: Jeff Dike @ 2000-10-29 23:12 UTC (permalink / raw) To: Alan Cox; +Cc: viro, Linus Torvalds, Paul Mackerras, linux-kernel alan@lxorguk.ukuu.org.uk said: > Well the ptrace one still has mysteriously breaks usermode linux > against it on my list here. Was that ever explained. It looked like > the stack got corrupted which is weird. If this is the test8 "turn off ORIG_EAX if EIP is changed" fix, it apparently also broke a couple of other ptrace-intensive things which aren't coming to mind (Andi Kleen noticed them). AFAIK, it was never explained. Something would change its victim's EIP and the victim would segfault as soon as it was continued. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2000-10-30 1:57 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <14841.7644.853174.666385@argo.linuxcare.com.au>
[not found] ` <Pine.LNX.4.10.10010262325440.864-100000@penguin.transmeta.com>
2000-10-29 9:38 ` page->mapping == 0 Paul Mackerras
2000-10-29 12:39 ` Daniel Phillips
2000-10-29 18:05 ` Linus Torvalds
2000-10-29 18:32 ` Alexander Viro
2000-10-29 19:12 ` Linus Torvalds
2000-10-29 19:33 ` Linus Torvalds
2000-10-30 1:57 ` Daniel Phillips
2000-10-29 20:32 ` Alan Cox
2000-10-29 20:42 ` Alexander Viro
2000-10-29 22:26 ` Jeff V. Merkey
2000-10-29 23:12 ` Jeff Dike
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox