* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) [not found] <20040310080000.GA30940@dualathlon.random> @ 2004-03-10 13:01 ` Rik van Riel 2004-03-10 13:50 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-10 13:01 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel, Rajesh Venkatasubramanian On Wed, 10 Mar 2004, Andrea Arcangeli wrote: > On Tue, Mar 09, 2004 at 06:56:50PM +0100, Andrea Arcangeli wrote: > > We've lot of room for improvements. > > Rajesh has a smart idea on how to fix the complexity issue (for both > truncate and vm) and it involes a new non trivial data structure. > > I trust he will make it, but if there will be any trouble with his > approch for safety I'm currently planning on a simpler fallback solution > that I can manage without having to design a new tree data structure. > > Sharing his "tree and sorting" idea, the fallback I propose is to simply > index the vmas in a rbtree too. That simply results in looking up less VMAs for low file indexes, but still needing to check all of them for high file indexes. You really want to sort on both the start and end offset of the VMA, as can be done with a kd-tree or kdb-tree. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) 2004-03-10 13:01 ` [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) Rik van Riel @ 2004-03-10 13:50 ` Andrea Arcangeli 2004-03-12 17:05 ` anon_vma RFC2 Rajesh Venkatasubramanian 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-10 13:50 UTC (permalink / raw) To: Rik van Riel Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel, Rajesh Venkatasubramanian On Wed, Mar 10, 2004 at 08:01:15AM -0500, Rik van Riel wrote: > On Wed, 10 Mar 2004, Andrea Arcangeli wrote: > > On Tue, Mar 09, 2004 at 06:56:50PM +0100, Andrea Arcangeli wrote: > > > We've lot of room for improvements. > > > > Rajesh has a smart idea on how to fix the complexity issue (for both > > truncate and vm) and it involes a new non trivial data structure. > > > > I trust he will make it, but if there will be any trouble with his > > approch for safety I'm currently planning on a simpler fallback solution > > that I can manage without having to design a new tree data structure. > > > > Sharing his "tree and sorting" idea, the fallback I propose is to simply > > index the vmas in a rbtree too. > > That simply results in looking up less VMAs for low file > indexes, but still needing to check all of them for high > file indexes. > > You really want to sort on both the start and end offset > of the VMA, as can be done with a kd-tree or kdb-tree. yes. But the only single reason for me to even consider using the rbtree was to avoid having to introduce another data structure and to feel very safe in terms of risks of memory corruption in the short term ;). The rbtree is extremely well exercised, that's the only reason I suggested it. Rajesh is currently working on another data strucure that is efficient at finding a "range" (not sure if it is what you're suggesting, he called it a prio_tree, mix between hashes and raidx trees), that's optimal, though in practice the rbtree would work too (peraphs one could still work an exploit ;) but the the real life apps would be definitely covered by the rbtree too (since all vma are of the same size and they're all naturally aligned). ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-10 13:50 ` Andrea Arcangeli @ 2004-03-12 17:05 ` Rajesh Venkatasubramanian 2004-03-12 17:26 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Rajesh Venkatasubramanian @ 2004-03-12 17:05 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel >> have a devastating effect on vma usage, yes) issue of vma merging, but >> what about the (mandatory) vma splitting? ...[snip] > you're right about vma_split, the way I implemented it is wrong, > basically the as.vma/PageDirect idea is falling apart with vma_split. Why do you have to fix up all page structs' PageDirect and as.vma fields when a vma_split or vma_merge occurs. Can't you do it lazily on the next page_referenced or page_add_rmap, etc. Anyway we can get to the anon_vma using as.vma->anon_vma. I understand that currenly your code assumes that if PageDirect is set, then there cannot be an anon_vma corresponding to the page. Rajesh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 17:05 ` anon_vma RFC2 Rajesh Venkatasubramanian @ 2004-03-12 17:26 ` Andrea Arcangeli 2004-03-12 21:16 ` Rajesh Venkatasubramanian 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 17:26 UTC (permalink / raw) To: Rajesh Venkatasubramanian; +Cc: linux-kernel On Fri, Mar 12, 2004 at 12:05:27PM -0500, Rajesh Venkatasubramanian wrote: > > > >> have a devastating effect on vma usage, yes) issue of vma merging, but > >> what about the (mandatory) vma splitting? ...[snip] > > > you're right about vma_split, the way I implemented it is wrong, > > basically the as.vma/PageDirect idea is falling apart with vma_split. > > Why do you have to fix up all page structs' PageDirect and as.vma > fields when a vma_split or vma_merge occurs. > > Can't you do it lazily on the next page_referenced or page_add_rmap, I cannot do it lazily unfortunately because the paging routine will start from the page, so if the page is not uptodate it will go to read into nirvana. > etc. Anyway we can get to the anon_vma using as.vma->anon_vma. > > I understand that currenly your code assumes that if PageDirect is > set, then there cannot be an anon_vma corresponding to the page. correct, though I will have to change that for the above problem ;( Well, another way is to just do the pagetable walk and fixup the page->as.vma to be a page->as.anon_vma during split/merge (actually merge is already taken care of by forbidding merging in the interesting cases, what I missed was the split, oh well ;). But preallocating the anon_vma is such a little cost that it should be a lot better than slowing down the split. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 17:26 ` Andrea Arcangeli @ 2004-03-12 21:16 ` Rajesh Venkatasubramanian 2004-03-13 17:55 ` Rajesh Venkatasubramanian 0 siblings, 1 reply; 74+ messages in thread From: Rajesh Venkatasubramanian @ 2004-03-12 21:16 UTC (permalink / raw) To: riel; +Cc: linux-kernel, torvalds >> I think your approach could work (reverse map by having separate >> address >> spaces for unrelated processes), but I don't see any good "page->index" >> allocation scheme that is implementable. >> Or did I totally mis-understand what you were proposing? > You're absolutely right. I am still trying to come up with > a way to do this. > [snip] > I just can't think of any now ... Atleast one solution exists. It may be just an academic solution, though. Add a new prio_tree root "remap_address" to anonmm address_space structure. struct anon_remap_address { unsigned long old_page_index_start; unsigned long old_page_index_end; unsigned long new_page_index; struct prio_tree_node prio_tree_node; } For each mremap that expands the area and moves the page tables, allocate a new anon_remap_address struct and add to remap_address tree. The page->index does not change ever. Take the page->index and walk remap_address tree to find all remapped addresses. Once a list of all remapped addresses are found, it's easy to find the interesting vmas (again using a different prio_tree). Finding all remapped addresses may involve recursion, that's bad. Rajesh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 21:16 ` Rajesh Venkatasubramanian @ 2004-03-13 17:55 ` Rajesh Venkatasubramanian 2004-03-13 18:16 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Rajesh Venkatasubramanian @ 2004-03-13 17:55 UTC (permalink / raw) To: riel; +Cc: linux-kernel, torvalds, andrea > The only problem is mremap() after a fork(), and hell, we know that's a > special case anyway, and let's just add a few lines to copy_one_pte(), > which basically does: > > if (PageAnonymous(page) && page->count > 1) { > newpage = alloc_page(); > copy_page(page, newpage); > page = newpage; > } > /* Move the page to the new address */ > page->index = address >> PAGE_SHIFT; > > and now we have zero special cases. This part makes the problem so simple. If this is acceptable, then we have many choices. Since we won't have many mms in the anonmm list, I don't think we will have any search complexity problems. If we really worry again about search complexity, we can consider using prio_tree (adds 16 bytes per vma - we cannot share vma.shared.prio_tree_node). The prio_tree easily fits for anonmm after linus-mremap-simplification. Rajesh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:55 ` Rajesh Venkatasubramanian @ 2004-03-13 18:16 ` Andrea Arcangeli 2004-03-13 19:40 ` Rajesh Venkatasubramanian 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-13 18:16 UTC (permalink / raw) To: Rajesh Venkatasubramanian; +Cc: riel, linux-kernel, torvalds On Sat, Mar 13, 2004 at 12:55:09PM -0500, Rajesh Venkatasubramanian wrote: > > > The only problem is mremap() after a fork(), and hell, we know that's a > > special case anyway, and let's just add a few lines to copy_one_pte(), > > which basically does: > > > > if (PageAnonymous(page) && page->count > 1) { > > newpage = alloc_page(); > > copy_page(page, newpage); > > page = newpage; > > } > > /* Move the page to the new address */ > > page->index = address >> PAGE_SHIFT; > > > > and now we have zero special cases. > > This part makes the problem so simple. If this is acceptable, then we > have many choices. Since we won't have many mms in the anonmm list, > I don't think we will have any search complexity problems. If we really > worry again about search complexity, we can consider using prio_tree > (adds 16 bytes per vma - we cannot share vma.shared.prio_tree_node). > The prio_tree easily fits for anonmm after linus-mremap-simplification. prio_tree with linus-mremap-simplification makes no sense to me. You cannot avoid checking all the mm with the prio_tree and that is the only complexity issue introduced by anonmm vs anon_vma. prio_tree can only sit on top of anon_vma, not on top of anonmm+linus-unshare-mremap (and yes, I cannot share vma.shared.prio_tree_node) but pratically it's not needed for the anon_vmas. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 18:16 ` Andrea Arcangeli @ 2004-03-13 19:40 ` Rajesh Venkatasubramanian 2004-03-14 0:23 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Rajesh Venkatasubramanian @ 2004-03-13 19:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: riel, linux-kernel, torvalds > prio_tree can only sit on top of anon_vma, not on top of > anonmm+linus-unshare-mremap (and yes, I cannot share > vma.shared.prio_tree_node) but pratically it's not needed for the > anon_vmas. Agreed. prio_tree is only useful for anon_vma. But, after linus-unshare-mremap, the anon_vma patch can be modified (simplified ?) a lot. You don't need any as.anon_vma, as.vma pointers in the page struct. You just need the already existing page->mapping and page->index, and a prio_tree of all anon vmas. The prio_tree can be used to get to the "interesting vmas" without walking all mms. However, the new prio_tree node adds 16 bytes per-vma. Considering there may not be much sharing of anon vmas in common case, I am not sure whether that is worthwhile. Maybe we can wait for someone to write a program that locks the machine :) Rajesh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 19:40 ` Rajesh Venkatasubramanian @ 2004-03-14 0:23 ` Andrea Arcangeli 2004-03-14 0:52 ` Linus Torvalds 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-14 0:23 UTC (permalink / raw) To: Rajesh Venkatasubramanian; +Cc: riel, linux-kernel, torvalds On Sat, Mar 13, 2004 at 02:40:09PM -0500, Rajesh Venkatasubramanian wrote: > Agreed. prio_tree is only useful for anon_vma. But, after > linus-unshare-mremap, the anon_vma patch can be modified > (simplified ?) a lot. You don't need any as.anon_vma, as.vma > pointers in the page struct. You just need the already existing > page->mapping and page->index, and a prio_tree of all anon vmas. what you are missing is that we don't need a prio_tree at all with anonmm+linus-unshare-mremap, prio tree can make sense only with anon_vma, not with anonmm. the vm_pgoff is meaningless with anonmm. find_vma (and the rbtree) already does the trick with anonmm. the linus-unshare-mremap guarantees that a certain physical page will be only at a certain virtual address in every mm, so prio_tree taking pgoff into account isn't needed there, find_vma is more than enough. any prio_tree can't fix anyways the problem that anonmm will force the vm to scan all mm at the page->index address, even for a newly allocated malloc region. that is optimized away by anon_vma, plus anon_vma avoids the early-COW in mremap. the relevant downside of anon_vma is that it takes some more byte in the vma to provide those features. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-14 0:23 ` Andrea Arcangeli @ 2004-03-14 0:52 ` Linus Torvalds 2004-03-14 1:01 ` William Lee Irwin III 0 siblings, 1 reply; 74+ messages in thread From: Linus Torvalds @ 2004-03-14 0:52 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Rajesh Venkatasubramanian, riel, linux-kernel On Sun, 14 Mar 2004, Andrea Arcangeli wrote: > > linus-unshare-mremap guarantees that a certain physical page will be > only at a certain virtual address in every mm, so prio_tree taking pgoff > into account isn't needed there, find_vma is more than enough. Yes. However, I'd at least personally hope that we don't even need the find_vma() all the time. When removing a page using the reverse mapping, there really is very little reason to even look up the vma, although right now the "flush_tlb_page()" interface is done for vma only so we'd need to change that or at least add a "flush_tlb_page_mm(mm, virt)" flusher (and if any architecture wants to look up the vma, they could do so). It would be silly to look up the vma if we don't actually need it, and I don't think we do. It's likely faster to just look up the page tables directly than to even worry about anything else. But find_vma() certainly would be sufficient. Linus ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-14 0:52 ` Linus Torvalds @ 2004-03-14 1:01 ` William Lee Irwin III 2004-03-14 1:07 ` Rik van Riel 2004-03-14 1:15 ` Linus Torvalds 0 siblings, 2 replies; 74+ messages in thread From: William Lee Irwin III @ 2004-03-14 1:01 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Rajesh Venkatasubramanian, riel, linux-kernel On Sat, Mar 13, 2004 at 04:52:00PM -0800, Linus Torvalds wrote: > Yes. However, I'd at least personally hope that we don't even need the > find_vma() all the time. > When removing a page using the reverse mapping, there really is very > little reason to even look up the vma, although right now the > "flush_tlb_page()" interface is done for vma only so we'd need to change > that or at least add a "flush_tlb_page_mm(mm, virt)" flusher (and if any > architecture wants to look up the vma, they could do so). > It would be silly to look up the vma if we don't actually need it, and I > don't think we do. It's likely faster to just look up the page tables > directly than to even worry about anything else. > But find_vma() certainly would be sufficient. find_vma() is often necessary to determine whether the page is mlock()'d. In schemes where mm's that may not map the page appear in searches, it may also be necessary to determine if there's even a vma covering the area at all or otherwise a normal vma, since pagetables outside normal vmas may very well not be understood by the core (e.g. hugetlb). -- wli ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-14 1:01 ` William Lee Irwin III @ 2004-03-14 1:07 ` Rik van Riel 2004-03-14 1:19 ` William Lee Irwin III 2004-03-14 1:15 ` Linus Torvalds 1 sibling, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-14 1:07 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Andrea Arcangeli, Rajesh Venkatasubramanian, linux-kernel On Sat, 13 Mar 2004, William Lee Irwin III wrote: > On Sat, Mar 13, 2004 at 04:52:00PM -0800, Linus Torvalds wrote: > > Yes. However, I'd at least personally hope that we don't even need the > > find_vma() all the time. > > find_vma() is often necessary to determine whether the page is mlock()'d. Alternatively, the mlock()d pages shouldn't appear on the LRU at all, reusing one of the variables inside page->lru as a counter to keep track of exactly how many times this page is mlock()d. > In schemes where mm's that may not map the page appear in searches, > it may also be necessary to determine if there's even a vma covering the > area at all or otherwise a normal vma, since pagetables outside normal > vmas may very well not be understood by the core (e.g. hugetlb). If the page is a normal page on the LRU, I suspect we don't need to find the VMA, with the exception of mlock()d pages... Good thing Christoph was already looking at the mlock()d page counter idea. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-14 1:07 ` Rik van Riel @ 2004-03-14 1:19 ` William Lee Irwin III 2004-03-14 1:41 ` Rik van Riel 0 siblings, 1 reply; 74+ messages in thread From: William Lee Irwin III @ 2004-03-14 1:19 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Andrea Arcangeli, Rajesh Venkatasubramanian, linux-kernel On Sat, 13 Mar 2004, William Lee Irwin III wrote: >> find_vma() is often necessary to determine whether the page is mlock()'d. On Sat, Mar 13, 2004 at 08:07:52PM -0500, Rik van Riel wrote: > Alternatively, the mlock()d pages shouldn't appear on the LRU > at all, reusing one of the variables inside page->lru as a > counter to keep track of exactly how many times this page is > mlock()d. That would be the rare case where it's not necessary. =) On Sat, 13 Mar 2004, William Lee Irwin III wrote: >> In schemes where mm's that may not map the page appear in searches, >> it may also be necessary to determine if there's even a vma covering the >> area at all or otherwise a normal vma, since pagetables outside normal >> vmas may very well not be understood by the core (e.g. hugetlb). On Sat, Mar 13, 2004 at 08:07:52PM -0500, Rik van Riel wrote: > If the page is a normal page on the LRU, I suspect we don't > need to find the VMA, with the exception of mlock()d pages... > Good thing Christoph was already looking at the mlock()d page > counter idea. That's not quite where the issue happens. Suppose you have a COW sharing group (called variously struct anonmm, struct anon, and so on by various codebases) where a page you're trying to unmap occurs at some virtual address in several of them, but others may have hugetlb vmas where that page is otherwise expected. On i386 and potentially others, the core may not understand present pmd's that are not mere pointers to ptes and other machine-dependent hugetlb constructs, so there is trouble. Searching the COW sharing group isn't how everything works, but in those cases where additionally you can find mm's that don't map the page at that virtual address and may have different vmas cover it, this can arise. -- wli ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-14 1:19 ` William Lee Irwin III @ 2004-03-14 1:41 ` Rik van Riel 2004-03-14 2:27 ` William Lee Irwin III 0 siblings, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-14 1:41 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Andrea Arcangeli, Rajesh Venkatasubramanian, linux-kernel On Sat, 13 Mar 2004, William Lee Irwin III wrote: > [hugetlb at same address] Well, we can find this merely by looking at the page tables themselves, so that shouldn't be a problem. > Searching the COW sharing group isn't how everything works, but in those > cases where additionally you can find mm's that don't map the page at > that virtual address and may have different vmas cover it, this can > arise. This could only happen when you truncate a file that's been mapped by various nonlinear VMAs, so truncate can't get rid of the pages... I suspect there are two ways to fix that: 1) on truncate, scan ALL the ptes inside nonlinear VMAs and remove the pages 2) don't allow truncate on a file that's mapped with nonlinear VMAs Either would work. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-14 1:41 ` Rik van Riel @ 2004-03-14 2:27 ` William Lee Irwin III 0 siblings, 0 replies; 74+ messages in thread From: William Lee Irwin III @ 2004-03-14 2:27 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Andrea Arcangeli, Rajesh Venkatasubramanian, linux-kernel On Sat, 13 Mar 2004, William Lee Irwin III wrote: >> [hugetlb at same address] On Sat, Mar 13, 2004 at 08:41:42PM -0500, Rik van Riel wrote: > Well, we can find this merely by looking at the page tables > themselves, so that shouldn't be a problem. Pagetables of a kind the core understands may not be present there. On ia32 one could in theory have a pmd_huge() check, which would in turn not suffice for ia64 and sparc64 hugetlb. These were only examples. Other unusual forms of mappings, e.g. VM_RESERVED and VM_IO, may also be bad ideas to trip over by accident. On Sat, 13 Mar 2004, William Lee Irwin III wrote: >> Searching the COW sharing group isn't how everything works, but in those >> cases where additionally you can find mm's that don't map the page at >> that virtual address and may have different vmas cover it, this can >> arise. On Sat, Mar 13, 2004 at 08:41:42PM -0500, Rik van Riel wrote: > This could only happen when you truncate a file that's > been mapped by various nonlinear VMAs, so truncate can't > get rid of the pages... > I suspect there are two ways to fix that: > 1) on truncate, scan ALL the ptes inside nonlinear VMAs > and remove the pages > 2) don't allow truncate on a file that's mapped with > nonlinear VMAs > Either would work. I'm not sure how that came in. The issue I had in mind was strictly a matter of tripping over things one can't make sense of from pagetables alone in try_to_unmap(). COW-shared anonymous pages not unmappable via anonymous COW sharing groups arising from truncate() vs. remap_file_pages() interactions and failures to check for nonlinearly-mapped pages in pagetable walkers are an issue in general of course, but they just aren't this issue. -- wli ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-14 1:01 ` William Lee Irwin III 2004-03-14 1:07 ` Rik van Riel @ 2004-03-14 1:15 ` Linus Torvalds 1 sibling, 0 replies; 74+ messages in thread From: Linus Torvalds @ 2004-03-14 1:15 UTC (permalink / raw) To: William Lee Irwin III Cc: Andrea Arcangeli, Rajesh Venkatasubramanian, riel, linux-kernel On Sat, 13 Mar 2004, William Lee Irwin III wrote: > > find_vma() is often necessary to determine whether the page is mlock()'d. > In schemes where mm's that may not map the page appear in searches, it > may also be necessary to determine if there's even a vma covering the > area at all or otherwise a normal vma, since pagetables outside normal > vmas may very well not be understood by the core (e.g. hugetlb). Both excellent points. I guess we'll need the extra few cache misses. Dang. Linus ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2
@ 2004-03-11 20:09 Manfred Spraul
0 siblings, 0 replies; 74+ messages in thread
From: Manfred Spraul @ 2004-03-11 20:09 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: linux-kernel
>
>
>at the previous try, with slab debugging enabled, it was spawning tons
>of errors but I suspect it's a bug in the slab debugging, it was
>complaining with red zone memory corruption, could be due the tiny size
>of this object (only 8 bytes).
>
>andrea@xeon:~> grep anon_vma /proc/slabinfo
>anon_vma 1230 1500 12 250 1 : tunables 120 60 8 : slabdata 6 6 0
>
According to the slabinfo line, 12 bytes. The revoke_table is 12 bytes,
too, and I'm not aware of any problems with slab debugging enabled.
Could you send me the first few errors?
--
Manfred
^ permalink raw reply [flat|nested] 74+ messages in thread* objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)
@ 2004-03-08 20:24 Andrea Arcangeli
2004-03-09 10:52 ` [lockup] " Ingo Molnar
0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-08 20:24 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton; +Cc: linux-kernel
Hello,
This patch avoids the allocation of rmap for shared memory and it uses
the objrmap framework to do find the mapping-ptes starting from a
page_t which is zero memory cost, (and zero cpu cost for the fast paths).
patch applies cleanly to linux-2.5 CVS. I suggest it for merging into
mainline.
without this patch not even the 4:4 tlb overhead would allow intensive
shm (shmfs+IPC) workloads to surivive on 32bit archs. Basically without
this fix it's like 2.6 is running w/o pte-highmem. 700 tasks with 2.7G
of shm mapped each would run the box out of zone-normal even with 4:4.
With 3:1 100 tasks would be enough. Math is easy:
2.7*1024*1024*1024/4096*8*100/1024/1024/1024
2.7*1024*1024*1024/4096*8*700/1024/1024/1024
But the real reason of this work is for huge 64bit archs, so we speedup
and avoid to waste tons of ram. on 32-ways the scalability is hurted
very badly by rmap, so it has to be removed (Martin can provide the
numbers I think).
Even with this fix removing rmap for the file mappings, the anonymous
memory will still pay for the rmap slowdown (still very relevant for
various critical apps), so I just finished designing a new method for
unmapping ptes of anonymous mappings too. it's not Hugh's anobjrmap
patch because (despite being very useful to get the right mindset) its
design was flawed since it was tracking mm not vmas and the page->index
as an absolute address not an offset, so it was breaking with mremap
(forcing him to reinstantiate rmap during mremap in the anobjrmap-5
patch), and it had several other implementation issues. But all my
further work will be against the below objrmap-core. The below patch
just fixes the most serious bottlenecks. So I recommend it for
inclusion, the rest of the work for anonymous memory and non linear
vmas, is orthogonal with this.
Credit for this patch goes enterely to Dave McCracken (the original idea
of using the i_mmap lists for the vm instead of only using it for
truncate is as usual from David Miller), I only fixed two bugs in its
version before submitting it to you.
I speculate that because of rmap some people has been forced to use 4:4
generating >30% slowdowns to critical common server linux workloads even
to boxes with as little as 8G of ram.
I'm very convinced that it would be an huge mistake to force the
userbase with <=16G of ram to the 4:4 slowdown, but to avoid that we've
to drop rmap.
As part of my current anon_vma_chain vm work I'm also shrinking the
page_t to 40 bytes, and eventually it will be 32 bytes with further
patches, that combined with the usage of remap_file_pages (avoiding tons
of vmas) and the bio work not requiring flood of bh anymore (more
powerful than the 2.4 varyio), should reduce even further the needs of
normal-zone during high end workloads, allowing at least 16G boxes to
run perfectly fine with 3:1 design, like today with 2.4 we already run
huge shm workloads on 16G boxes with plenty of zone-normal margin in
production, even 32G seems to work fine (though the margin is not huge
there). With 2.6 I expect to raise the margin significantly (for
safety) in 32G boxes too with the most efficient 3:1 kernel split. Only
64G boxes will require either 2.5:1.5 or 4:4, and I think it's ok to
either use 4:4 or 2.5:1.5 there since they're less than 1% of the
userbase and with AMD64 hitting the market already I doubt the x86 64G
userbase will increase anytime.
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/fs/exec.c sles-objrmap/fs/exec.c
--- sles-ref/fs/exec.c 2004-02-29 17:47:21.000000000 +0100
+++ sles-objrmap/fs/exec.c 2004-03-03 06:45:38.716636864 +0100
@@ -323,6 +323,7 @@ void put_dirty_page(struct task_struct *
}
lru_cache_add_active(page);
flush_dcache_page(page);
+ SetPageAnon(page);
set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(page, prot))));
pte_chain = page_add_rmap(page, pte, pte_chain);
pte_unmap(pte);
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/include/linux/mm.h sles-objrmap/include/linux/mm.h
--- sles-ref/include/linux/mm.h 2004-02-29 17:47:30.000000000 +0100
+++ sles-objrmap/include/linux/mm.h 2004-03-03 06:45:38.000000000 +0100
@@ -180,6 +180,7 @@ struct page {
struct pte_chain *chain;/* Reverse pte mapping pointer.
* protected by PG_chainlock */
pte_addr_t direct;
+ int mapcount;
} pte;
unsigned long private; /* mapping-private opaque data */
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/include/linux/page-flags.h sles-objrmap/include/linux/page-flags.h
--- sles-ref/include/linux/page-flags.h 2004-01-15 18:36:24.000000000 +0100
+++ sles-objrmap/include/linux/page-flags.h 2004-03-03 06:45:38.808622880 +0100
@@ -75,6 +75,7 @@
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
#define PG_compound 19 /* Part of a compound page */
+#define PG_anon 20 /* Anonymous page */
/*
@@ -270,6 +271,10 @@ extern void get_full_page_state(struct p
#define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
#define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)
+#define PageAnon(page) test_bit(PG_anon, &(page)->flags)
+#define SetPageAnon(page) set_bit(PG_anon, &(page)->flags)
+#define ClearPageAnon(page) clear_bit(PG_anon, &(page)->flags)
+
/*
* The PageSwapCache predicate doesn't use a PG_flag at this time,
* but it may again do so one day.
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/include/linux/swap.h sles-objrmap/include/linux/swap.h
--- sles-ref/include/linux/swap.h 2004-02-04 16:07:05.000000000 +0100
+++ sles-objrmap/include/linux/swap.h 2004-03-03 06:45:38.830619536 +0100
@@ -185,6 +185,8 @@ struct pte_chain *FASTCALL(page_add_rmap
void FASTCALL(page_remove_rmap(struct page *, pte_t *));
int FASTCALL(try_to_unmap(struct page *));
+int page_convert_anon(struct page *);
+
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
#else
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/filemap.c sles-objrmap/mm/filemap.c
--- sles-ref/mm/filemap.c 2004-02-29 17:47:33.000000000 +0100
+++ sles-objrmap/mm/filemap.c 2004-03-03 06:45:38.915606616 +0100
@@ -73,6 +73,9 @@
* ->mmap_sem
* ->i_sem (msync)
*
+ * ->lock_page
+ * ->i_shared_sem (page_convert_anon)
+ *
* ->inode_lock
* ->sb_lock (fs/fs-writeback.c)
* ->mapping->page_lock (__sync_single_inode)
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/fremap.c sles-objrmap/mm/fremap.c
--- sles-ref/mm/fremap.c 2004-02-29 17:47:26.000000000 +0100
+++ sles-objrmap/mm/fremap.c 2004-03-03 06:45:38.936603424 +0100
@@ -61,10 +61,26 @@ int install_page(struct mm_struct *mm, s
pmd_t *pmd;
pte_t pte_val;
struct pte_chain *pte_chain;
+ unsigned long pgidx;
pte_chain = pte_chain_alloc(GFP_KERNEL);
if (!pte_chain)
goto err;
+
+ /*
+ * Convert this page to anon for objrmap if it's nonlinear
+ */
+ pgidx = (addr - vma->vm_start) >> PAGE_SHIFT;
+ pgidx += vma->vm_pgoff;
+ pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
+ if (!PageAnon(page) && (page->index != pgidx)) {
+ lock_page(page);
+ err = page_convert_anon(page);
+ unlock_page(page);
+ if (err < 0)
+ goto err_free;
+ }
+
pgd = pgd_offset(mm, addr);
spin_lock(&mm->page_table_lock);
@@ -85,12 +101,11 @@ int install_page(struct mm_struct *mm, s
pte_val = *pte;
pte_unmap(pte);
update_mmu_cache(vma, addr, pte_val);
- spin_unlock(&mm->page_table_lock);
- pte_chain_free(pte_chain);
- return 0;
+ err = 0;
err_unlock:
spin_unlock(&mm->page_table_lock);
+err_free:
pte_chain_free(pte_chain);
err:
return err;
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/memory.c sles-objrmap/mm/memory.c
--- sles-ref/mm/memory.c 2004-02-29 17:47:33.000000000 +0100
+++ sles-objrmap/mm/memory.c 2004-03-03 06:45:38.965599016 +0100
@@ -1071,6 +1071,7 @@ static int do_wp_page(struct mm_struct *
++mm->rss;
page_remove_rmap(old_page, page_table);
break_cow(vma, new_page, address, page_table);
+ SetPageAnon(new_page);
pte_chain = page_add_rmap(new_page, page_table, pte_chain);
lru_cache_add_active(new_page);
@@ -1310,6 +1311,7 @@ static int do_swap_page(struct mm_struct
flush_icache_page(vma, page);
set_pte(page_table, pte);
+ SetPageAnon(page);
pte_chain = page_add_rmap(page, page_table, pte_chain);
/* No need to invalidate - it was non-present before */
@@ -1377,6 +1379,7 @@ do_anonymous_page(struct mm_struct *mm,
vma);
lru_cache_add_active(page);
mark_page_accessed(page);
+ SetPageAnon(page);
}
set_pte(page_table, entry);
@@ -1444,6 +1447,10 @@ retry:
if (!pte_chain)
goto oom;
+ /* See if nopage returned an anon page */
+ if (!new_page->mapping || PageSwapCache(new_page))
+ SetPageAnon(new_page);
+
/*
* Should we do an early C-O-W break?
*/
@@ -1454,6 +1461,7 @@ retry:
copy_user_highpage(page, new_page, address);
page_cache_release(new_page);
lru_cache_add_active(page);
+ SetPageAnon(page);
new_page = page;
}
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/mmap.c sles-objrmap/mm/mmap.c
--- sles-ref/mm/mmap.c 2004-02-29 17:47:30.000000000 +0100
+++ sles-objrmap/mm/mmap.c 2004-03-03 06:53:46.000000000 +0100
@@ -267,9 +267,7 @@ static void vma_link(struct mm_struct *m
if (mapping)
down(&mapping->i_shared_sem);
- spin_lock(&mm->page_table_lock);
__vma_link(mm, vma, prev, rb_link, rb_parent);
- spin_unlock(&mm->page_table_lock);
if (mapping)
up(&mapping->i_shared_sem);
@@ -318,6 +316,22 @@ static inline int is_mergeable_vma(struc
return 1;
}
+/* requires that the relevant i_shared_sem be held by the caller */
+static void move_vma_start(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct inode *inode = NULL;
+
+ if (vma->vm_file)
+ inode = vma->vm_file->f_dentry->d_inode;
+ if (inode)
+ __remove_shared_vm_struct(vma, inode);
+ /* If no vm_file, perhaps we should always keep vm_pgoff at 0?? */
+ vma->vm_pgoff += (long)(addr - vma->vm_start) >> PAGE_SHIFT;
+ vma->vm_start = addr;
+ if (inode)
+ __vma_link_file(vma);
+}
+
/*
* Return true if we can merge this (vm_flags,file,vm_pgoff,size)
* in front of (at a lower virtual address and file offset than) the vma.
@@ -370,7 +384,6 @@ static int vma_merge(struct mm_struct *m
unsigned long end, unsigned long vm_flags,
struct file *file, unsigned long pgoff)
{
- spinlock_t *lock = &mm->page_table_lock;
struct inode *inode = file ? file->f_dentry->d_inode : NULL;
struct semaphore *i_shared_sem;
@@ -402,7 +415,6 @@ static int vma_merge(struct mm_struct *m
down(i_shared_sem);
need_up = 1;
}
- spin_lock(lock);
prev->vm_end = end;
/*
@@ -415,7 +427,6 @@ static int vma_merge(struct mm_struct *m
prev->vm_end = next->vm_end;
__vma_unlink(mm, next, prev);
__remove_shared_vm_struct(next, inode);
- spin_unlock(lock);
if (need_up)
up(i_shared_sem);
if (file)
@@ -425,7 +436,6 @@ static int vma_merge(struct mm_struct *m
kmem_cache_free(vm_area_cachep, next);
return 1;
}
- spin_unlock(lock);
if (need_up)
up(i_shared_sem);
return 1;
@@ -443,10 +453,7 @@ static int vma_merge(struct mm_struct *m
if (end == prev->vm_start) {
if (file)
down(i_shared_sem);
- spin_lock(lock);
- prev->vm_start = addr;
- prev->vm_pgoff -= (end - addr) >> PAGE_SHIFT;
- spin_unlock(lock);
+ move_vma_start(prev, addr);
if (file)
up(i_shared_sem);
return 1;
@@ -905,19 +912,16 @@ int expand_stack(struct vm_area_struct *
*/
address += 4 + PAGE_SIZE - 1;
address &= PAGE_MASK;
- spin_lock(&vma->vm_mm->page_table_lock);
grow = (address - vma->vm_end) >> PAGE_SHIFT;
/* Overcommit.. */
if (security_vm_enough_memory(grow)) {
- spin_unlock(&vma->vm_mm->page_table_lock);
return -ENOMEM;
}
if (address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur ||
((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
current->rlim[RLIMIT_AS].rlim_cur) {
- spin_unlock(&vma->vm_mm->page_table_lock);
vm_unacct_memory(grow);
return -ENOMEM;
}
@@ -925,7 +929,6 @@ int expand_stack(struct vm_area_struct *
vma->vm_mm->total_vm += grow;
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm += grow;
- spin_unlock(&vma->vm_mm->page_table_lock);
return 0;
}
@@ -959,19 +962,16 @@ int expand_stack(struct vm_area_struct *
* the spinlock only before relocating the vma range ourself.
*/
address &= PAGE_MASK;
- spin_lock(&vma->vm_mm->page_table_lock);
grow = (vma->vm_start - address) >> PAGE_SHIFT;
/* Overcommit.. */
if (security_vm_enough_memory(grow)) {
- spin_unlock(&vma->vm_mm->page_table_lock);
return -ENOMEM;
}
if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
current->rlim[RLIMIT_AS].rlim_cur) {
- spin_unlock(&vma->vm_mm->page_table_lock);
vm_unacct_memory(grow);
return -ENOMEM;
}
@@ -980,7 +980,6 @@ int expand_stack(struct vm_area_struct *
vma->vm_mm->total_vm += grow;
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm += grow;
- spin_unlock(&vma->vm_mm->page_table_lock);
return 0;
}
@@ -1147,8 +1146,6 @@ static void unmap_region(struct mm_struc
/*
* Create a list of vma's touched by the unmap, removing them from the mm's
* vma list as we go..
- *
- * Called with the page_table_lock held.
*/
static void
detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1211,10 +1208,9 @@ int split_vma(struct mm_struct * mm, str
down(&mapping->i_shared_sem);
spin_lock(&mm->page_table_lock);
- if (new_below) {
- vma->vm_start = addr;
- vma->vm_pgoff += ((addr - new->vm_start) >> PAGE_SHIFT);
- } else
+ if (new_below)
+ move_vma_start(vma, addr);
+ else
vma->vm_end = addr;
__insert_vm_struct(mm, new);
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/page_alloc.c sles-objrmap/mm/page_alloc.c
--- sles-ref/mm/page_alloc.c 2004-02-29 17:47:36.000000000 +0100
+++ sles-objrmap/mm/page_alloc.c 2004-03-03 06:45:38.992594912 +0100
@@ -230,6 +230,8 @@ static inline void free_pages_check(cons
bad_page(function, page);
if (PageDirty(page))
ClearPageDirty(page);
+ if (PageAnon(page))
+ ClearPageAnon(page);
}
/*
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/rmap.c sles-objrmap/mm/rmap.c
--- sles-ref/mm/rmap.c 2004-02-29 17:47:33.000000000 +0100
+++ sles-objrmap/mm/rmap.c 2004-03-03 07:01:39.200621104 +0100
@@ -102,6 +102,136 @@ pte_chain_encode(struct pte_chain *pte_c
**/
/**
+ * find_pte - Find a pte pointer given a vma and a struct page.
+ * @vma: the vma to search
+ * @page: the page to find
+ *
+ * Determine if this page is mapped in this vma. If it is, map and rethrn
+ * the pte pointer associated with it. Return null if the page is not
+ * mapped in this vma for any reason.
+ *
+ * This is strictly an internal helper function for the object-based rmap
+ * functions.
+ *
+ * It is the caller's responsibility to unmap the pte if it is returned.
+ */
+static inline pte_t *
+find_pte(struct vm_area_struct *vma, struct page *page, unsigned long *addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pte_t *pte;
+ unsigned long loffset;
+ unsigned long address;
+
+ loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
+ address = vma->vm_start + ((loffset - vma->vm_pgoff) << PAGE_SHIFT);
+ if (address < vma->vm_start || address >= vma->vm_end)
+ goto out;
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pmd = pmd_offset(pgd, address);
+ if (!pmd_present(*pmd))
+ goto out;
+
+ pte = pte_offset_map(pmd, address);
+ if (!pte_present(*pte))
+ goto out_unmap;
+
+ if (page_to_pfn(page) != pte_pfn(*pte))
+ goto out_unmap;
+
+ if (addr)
+ *addr = address;
+
+ return pte;
+
+out_unmap:
+ pte_unmap(pte);
+out:
+ return NULL;
+}
+
+/**
+ * page_referenced_obj_one - referenced check for object-based rmap
+ * @vma: the vma to look in.
+ * @page: the page we're working on.
+ *
+ * Find a pte entry for a page/vma pair, then check and clear the referenced
+ * bit.
+ *
+ * This is strictly a helper function for page_referenced_obj.
+ */
+static int
+page_referenced_obj_one(struct vm_area_struct *vma, struct page *page)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t *pte;
+ int referenced = 0;
+
+ if (!spin_trylock(&mm->page_table_lock))
+ return 1;
+
+ pte = find_pte(vma, page, NULL);
+ if (pte) {
+ if (ptep_test_and_clear_young(pte))
+ referenced++;
+ pte_unmap(pte);
+ }
+
+ spin_unlock(&mm->page_table_lock);
+ return referenced;
+}
+
+/**
+ * page_referenced_obj_one - referenced check for object-based rmap
+ * @page: the page we're checking references on.
+ *
+ * For an object-based mapped page, find all the places it is mapped and
+ * check/clear the referenced flag. This is done by following the page->mapping
+ * pointer, then walking the chain of vmas it holds. It returns the number
+ * of references it found.
+ *
+ * This function is only called from page_referenced for object-based pages.
+ *
+ * The semaphore address_space->i_shared_sem is tried. If it can't be gotten,
+ * assume a reference count of 1.
+ */
+static int
+page_referenced_obj(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct vm_area_struct *vma;
+ int referenced = 0;
+
+ if (!page->pte.mapcount)
+ return 0;
+
+ if (!mapping)
+ BUG();
+
+ if (PageSwapCache(page))
+ BUG();
+
+ if (down_trylock(&mapping->i_shared_sem))
+ return 1;
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared)
+ referenced += page_referenced_obj_one(vma, page);
+
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
+ referenced += page_referenced_obj_one(vma, page);
+
+ up(&mapping->i_shared_sem);
+
+ return referenced;
+}
+
+/**
* page_referenced - test if the page was referenced
* @page: the page to test
*
@@ -123,6 +253,10 @@ int fastcall page_referenced(struct page
if (TestClearPageReferenced(page))
referenced++;
+ if (!PageAnon(page)) {
+ referenced += page_referenced_obj(page);
+ goto out;
+ }
if (PageDirect(page)) {
pte_t *pte = rmap_ptep_map(page->pte.direct);
if (ptep_test_and_clear_young(pte))
@@ -154,6 +288,7 @@ int fastcall page_referenced(struct page
__pte_chain_free(pc);
}
}
+out:
return referenced;
}
@@ -176,6 +311,21 @@ page_add_rmap(struct page *page, pte_t *
pte_chain_lock(page);
+ /*
+ * If this is an object-based page, just count it. We can
+ * find the mappings by walking the object vma chain for that object.
+ */
+ if (!PageAnon(page)) {
+ if (!page->mapping)
+ BUG();
+ if (PageSwapCache(page))
+ BUG();
+ if (!page->pte.mapcount)
+ inc_page_state(nr_mapped);
+ page->pte.mapcount++;
+ goto out;
+ }
+
if (page->pte.direct == 0) {
page->pte.direct = pte_paddr;
SetPageDirect(page);
@@ -232,8 +382,25 @@ void fastcall page_remove_rmap(struct pa
pte_chain_lock(page);
if (!page_mapped(page))
- goto out_unlock; /* remap_page_range() from a driver? */
+ goto out_unlock;
+ /*
+ * If this is an object-based page, just uncount it. We can
+ * find the mappings by walking the object vma chain for that object.
+ */
+ if (!PageAnon(page)) {
+ if (!page->mapping)
+ BUG();
+ if (PageSwapCache(page))
+ BUG();
+ if (!page->pte.mapcount)
+ BUG();
+ page->pte.mapcount--;
+ if (!page->pte.mapcount)
+ dec_page_state(nr_mapped);
+ goto out_unlock;
+ }
+
if (PageDirect(page)) {
if (page->pte.direct == pte_paddr) {
page->pte.direct = 0;
@@ -280,6 +447,102 @@ out_unlock:
}
/**
+ * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Determine whether a page is mapped in a given vma and unmap it if it's found.
+ *
+ * This function is strictly a helper function for try_to_unmap_obj.
+ */
+static inline int
+try_to_unmap_obj_one(struct vm_area_struct *vma, struct page *page)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address;
+ pte_t *pte;
+ pte_t pteval;
+ int ret = SWAP_AGAIN;
+
+ if (!spin_trylock(&mm->page_table_lock))
+ return ret;
+
+ pte = find_pte(vma, page, &address);
+ if (!pte)
+ goto out;
+
+ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+
+ flush_cache_page(vma, address);
+ pteval = ptep_get_and_clear(pte);
+ flush_tlb_page(vma, address);
+
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
+
+ if (!page->pte.mapcount)
+ BUG();
+
+ mm->rss--;
+ page->pte.mapcount--;
+ page_cache_release(page);
+
+out_unmap:
+ pte_unmap(pte);
+
+out:
+ spin_unlock(&mm->page_table_lock);
+ return ret;
+}
+
+/**
+ * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the address_space struct it points to.
+ *
+ * This function is only called from try_to_unmap for object-based pages.
+ *
+ * The semaphore address_space->i_shared_sem is tried. If it can't be gotten,
+ * return a temporary error.
+ */
+static int
+try_to_unmap_obj(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct vm_area_struct *vma;
+ int ret = SWAP_AGAIN;
+
+ if (!mapping)
+ BUG();
+
+ if (PageSwapCache(page))
+ BUG();
+
+ if (down_trylock(&mapping->i_shared_sem))
+ return ret;
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared) {
+ ret = try_to_unmap_obj_one(vma, page);
+ if (ret == SWAP_FAIL || !page->pte.mapcount)
+ goto out;
+ }
+
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ ret = try_to_unmap_obj_one(vma, page);
+ if (ret == SWAP_FAIL || !page->pte.mapcount)
+ goto out;
+ }
+
+out:
+ up(&mapping->i_shared_sem);
+ return ret;
+}
+
+/**
* try_to_unmap_one - worker function for try_to_unmap
* @page: page to unmap
* @ptep: page table entry to unmap from page
@@ -397,6 +660,15 @@ int fastcall try_to_unmap(struct page *
if (!page->mapping)
BUG();
+ /*
+ * If it's an object-based page, use the object vma chain to find all
+ * the mappings.
+ */
+ if (!PageAnon(page)) {
+ ret = try_to_unmap_obj(page);
+ goto out;
+ }
+
if (PageDirect(page)) {
ret = try_to_unmap_one(page, page->pte.direct);
if (ret == SWAP_SUCCESS) {
@@ -453,12 +725,115 @@ int fastcall try_to_unmap(struct page *
}
}
out:
- if (!page_mapped(page))
+ if (!page_mapped(page)) {
dec_page_state(nr_mapped);
+ ret = SWAP_SUCCESS;
+ }
return ret;
}
/**
+ * page_convert_anon - Convert an object-based mapped page to pte_chain-based.
+ * @page: the page to convert
+ *
+ * Find all the mappings for an object-based page and convert them
+ * to 'anonymous', ie create a pte_chain and store all the pte pointers there.
+ *
+ * This function takes the address_space->i_shared_sem, sets the PageAnon flag,
+ * then sets the mm->page_table_lock for each vma and calls page_add_rmap. This
+ * means there is a period when PageAnon is set, but still has some mappings
+ * with no pte_chain entry. This is in fact safe, since page_remove_rmap will
+ * simply not find it. try_to_unmap might erroneously return success, but it
+ * will never be called because the page_convert_anon() caller has locked the
+ * page.
+ *
+ * page_referenced() may fail to scan all the appropriate pte's and may return
+ * an inaccurate result. This is so rare that it does not matter.
+ */
+int page_convert_anon(struct page *page)
+{
+ struct address_space *mapping;
+ struct vm_area_struct *vma;
+ struct pte_chain *pte_chain = NULL;
+ pte_t *pte;
+ int err = 0;
+
+ mapping = page->mapping;
+ if (mapping == NULL)
+ goto out; /* truncate won the lock_page() race */
+
+ down(&mapping->i_shared_sem);
+ pte_chain_lock(page);
+
+ /*
+ * Has someone else done it for us before we got the lock?
+ * If so, pte.direct or pte.chain has replaced pte.mapcount.
+ */
+ if (PageAnon(page)) {
+ pte_chain_unlock(page);
+ goto out_unlock;
+ }
+
+ SetPageAnon(page);
+ if (page->pte.mapcount == 0) {
+ pte_chain_unlock(page);
+ goto out_unlock;
+ }
+ /* This is gonna get incremented by page_add_rmap */
+ dec_page_state(nr_mapped);
+ page->pte.mapcount = 0;
+
+ /*
+ * Now that the page is marked as anon, unlock it. page_add_rmap will
+ * lock it as necessary.
+ */
+ pte_chain_unlock(page);
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared) {
+ if (!pte_chain) {
+ pte_chain = pte_chain_alloc(GFP_KERNEL);
+ if (!pte_chain) {
+ err = -ENOMEM;
+ goto out_unlock;
+ }
+ }
+ spin_lock(&vma->vm_mm->page_table_lock);
+ pte = find_pte(vma, page, NULL);
+ if (pte) {
+ /* Make sure this isn't a duplicate */
+ page_remove_rmap(page, pte);
+ pte_chain = page_add_rmap(page, pte, pte_chain);
+ pte_unmap(pte);
+ }
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ }
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ if (!pte_chain) {
+ pte_chain = pte_chain_alloc(GFP_KERNEL);
+ if (!pte_chain) {
+ err = -ENOMEM;
+ goto out_unlock;
+ }
+ }
+ spin_lock(&vma->vm_mm->page_table_lock);
+ pte = find_pte(vma, page, NULL);
+ if (pte) {
+ /* Make sure this isn't a duplicate */
+ page_remove_rmap(page, pte);
+ pte_chain = page_add_rmap(page, pte, pte_chain);
+ pte_unmap(pte);
+ }
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ }
+
+out_unlock:
+ pte_chain_free(pte_chain);
+ up(&mapping->i_shared_sem);
+out:
+ return err;
+}
+
+/**
** No more VM stuff below this comment, only pte_chain helper
** functions.
**/
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/swapfile.c sles-objrmap/mm/swapfile.c
--- sles-ref/mm/swapfile.c 2004-02-20 17:26:54.000000000 +0100
+++ sles-objrmap/mm/swapfile.c 2004-03-03 07:03:33.128301464 +0100
@@ -390,6 +390,7 @@ unuse_pte(struct vm_area_struct *vma, un
vma->vm_mm->rss++;
get_page(page);
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
+ SetPageAnon(page);
*pte_chainp = page_add_rmap(page, dir, *pte_chainp);
swap_free(entry);
}
^ permalink raw reply [flat|nested] 74+ messages in thread* [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) 2004-03-08 20:24 objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) Andrea Arcangeli @ 2004-03-09 10:52 ` Ingo Molnar 2004-03-09 11:02 ` Ingo Molnar 0 siblings, 1 reply; 74+ messages in thread From: Ingo Molnar @ 2004-03-09 10:52 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2140 bytes --] * Andrea Arcangeli <andrea@suse.de> wrote: > This patch avoids the allocation of rmap for shared memory and it uses > the objrmap framework to do find the mapping-ptes starting from a > page_t which is zero memory cost, (and zero cpu cost for the fast > paths) this patch locks up the VM. To reproduce, run the attached, very simple test-mmap.c code (as unprivileged user) which maps 80MB worth of shared memory in a finegrained way, creating ~19K vmas, and sleeps. Keep this process around. Then try to create any sort of VM swap pressure. (start a few desktop apps or generate pagecache pressure.) [the 500 MHz P3 system i tried this on has 256 MB of RAM and 300 MB of swap.] stock 2.6.4-rc2-mm1 handles it just fine - it starts swapping and recovers. The system is responsive and behaves just fine. with 2.6.4-rc2-mm1 + your objrmap patch the box in essence locks up and it's not possible to do anything. The VM is looping within the objrmap functions. (a sample trace attached.) Note that the test-mmap.c app does nothing that a normal user cannot do. In fact it's not even hostile - it only has lots of vmas but is otherwise not actively pushing the VM, it's just sleeping. (Also, the test is a very far cry from Oracle's workload of gigabytes of shm mapped in a finegrained way to hundreds of processes.) All in one, currently i believe the patch is pretty unacceptable in its present form. Ingo Pid: 7, comm: kswapd0 EIP: 0060:[<c013ee6d>] CPU: 0 EIP is at page_referenced_obj+0xdd/0x120 EFLAGS: 00000246 Not tainted EAX: cb311808 EBX: cb311820 ECX: 40a2d000 EDX: cb311848 ESI: cfe202fc EDI: cfe2033c EBP: cfdf9dc4 DS: 007b ES: 007b CR0: 8005003b CR2: 40507000 CR3: 0b11e000 CR4: 00000290 Call Trace: [<c013ef71>] page_referenced+0xc1/0xd0 [<c0137bad>] refill_inactive_zone+0x3fd/0x4c0 [<c01376bc>] shrink_cache+0x26c/0x360 [<c0137d11>] shrink_zone+0xa1/0xb0 [<c01380d7>] balance_pgdat+0x1a7/0x200 [<c013820b>] kswapd+0xdb/0xe0 [<c01180b0>] autoremove_wake_function+0x0/0x50 [<c01180b0>] autoremove_wake_function+0x0/0x50 [<c0138130>] kswapd+0x0/0xe0 [<c01050f9>] kernel_thread_helper+0x5/0xc [-- Attachment #2: test-mmap.c --] [-- Type: text/plain, Size: 1095 bytes --] /* * Copyright (C) Ingo Molnar, 2004 * * Create 80 MB worth of finegrained mappings to a shmfs file. */ #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <sys/mman.h> #include <sys/stat.h> /* 80 MB of mappings */ #define CACHE_PAGES 20000 #define PAGE_SIZE 4096 #define CACHE_SIZE (CACHE_PAGES*PAGE_SIZE) #define WINDOW_PAGES (CACHE_PAGES*9/10) #define WINDOW_SIZE (WINDOW_PAGES*PAGE_SIZE) #define WINDOW_START 0x48000000 int main(void) { char *data, *ptr, filename[100]; char empty_page [PAGE_SIZE]; int i, fd; sprintf(filename, "/dev/shm/cache%d", getpid()); fd = open(filename, O_RDWR|O_CREAT|O_TRUNC,S_IRWXU); unlink(filename); for (i = 0; i < CACHE_PAGES; i++) write(fd, empty_page, PAGE_SIZE); data = mmap(0, WINDOW_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED , fd, 0); for (i = 0; i < WINDOW_PAGES; i++) { ptr = (char*) mmap(data + i*PAGE_SIZE, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, (WINDOW_PAGES-i)*PAGE_SIZE); (*ptr)++; } printf("%d pages mapped - sleeping until Ctrl-C.\n", WINDOW_PAGES); pause(); return 0; } ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) 2004-03-09 10:52 ` [lockup] " Ingo Molnar @ 2004-03-09 11:02 ` Ingo Molnar 2004-03-09 11:09 ` Andrew Morton 0 siblings, 1 reply; 74+ messages in thread From: Ingo Molnar @ 2004-03-09 11:02 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 477 bytes --] * Ingo Molnar <mingo@elte.hu> wrote: > To reproduce, run the attached, very simple test-mmap.c code (as > unprivileged user) which maps 80MB worth of shared memory in a > finegrained way, creating ~19K vmas, and sleeps. Keep this process > around. or run the attached test-mmap2.c code, which simulates a very small DB app using only 1800 vmas per process: it only maps 8 MB of shm and spawns 32 processes. This has an even more lethal effect than the previous code. Ingo [-- Attachment #2: test-mmap2.c --] [-- Type: text/plain, Size: 1160 bytes --] /* * Copyright (C) Ingo Molnar, 2004 * * Create 8 MB worth of finegrained mappings to a shmfs file, * and spawn 32 processes. */ #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <sys/mman.h> #include <sys/stat.h> /* 8 MB of mappings */ #define CACHE_PAGES 2000 #define PAGE_SIZE 4096 #define CACHE_SIZE (CACHE_PAGES*PAGE_SIZE) #define WINDOW_PAGES (CACHE_PAGES*9/10) #define WINDOW_SIZE (WINDOW_PAGES*PAGE_SIZE) #define WINDOW_START 0x48000000 int main(void) { char *data, *ptr, filename[100]; char empty_page [PAGE_SIZE]; int i, fd; sprintf(filename, "/dev/shm/cache%d", getpid()); fd = open(filename, O_RDWR|O_CREAT|O_TRUNC,S_IRWXU); unlink(filename); for (i = 0; i < CACHE_PAGES; i++) write(fd, empty_page, PAGE_SIZE); data = mmap(0, WINDOW_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED , fd, 0); for (i = 0; i < WINDOW_PAGES; i++) { ptr = (char*) mmap(data + i*PAGE_SIZE, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, (WINDOW_PAGES-i)*PAGE_SIZE); (*ptr)++; } printf("%d pages mapped - sleeping until Ctrl-C.\n", WINDOW_PAGES); fork(); fork(); fork(); fork(); fork(); pause(); return 0; } ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) 2004-03-09 11:02 ` Ingo Molnar @ 2004-03-09 11:09 ` Andrew Morton 2004-03-09 11:49 ` Ingo Molnar 0 siblings, 1 reply; 74+ messages in thread From: Andrew Morton @ 2004-03-09 11:09 UTC (permalink / raw) To: Ingo Molnar; +Cc: andrea, torvalds, linux-kernel Ingo Molnar <mingo@elte.hu> wrote: > > or run the attached test-mmap2.c code, which simulates a very small DB > app using only 1800 vmas per process: it only maps 8 MB of shm and > spawns 32 processes. This has an even more lethal effect than the > previous code. Do these tests actually make any forward progress at all, or is it some bug which has sent the kernel into a loop? ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) 2004-03-09 11:09 ` Andrew Morton @ 2004-03-09 11:49 ` Ingo Molnar 2004-03-09 16:03 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Ingo Molnar @ 2004-03-09 11:49 UTC (permalink / raw) To: Andrew Morton; +Cc: andrea, torvalds, linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > > or run the attached test-mmap2.c code, which simulates a very small DB > > app using only 1800 vmas per process: it only maps 8 MB of shm and > > spawns 32 processes. This has an even more lethal effect than the > > previous code. > > Do these tests actually make any forward progress at all, or is it some bug > which has sent the kernel into a loop? i think they make a forward progress so it's more of a DoS - but a very effective one, especially considering that i didnt even try hard ... what worries me is that there are apps that generate such vma patterns (for various reasons). I do believe that scanning ->i_mmap & ->i_mmap_shared is fundamentally flawed. Ingo ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) 2004-03-09 11:49 ` Ingo Molnar @ 2004-03-09 16:03 ` Andrea Arcangeli 2004-03-10 10:36 ` RFC anon_vma previous (i.e. full objrmap) Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-09 16:03 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, torvalds, linux-kernel On Tue, Mar 09, 2004 at 12:49:24PM +0100, Ingo Molnar wrote: > > * Andrew Morton <akpm@osdl.org> wrote: > > > > or run the attached test-mmap2.c code, which simulates a very small DB > > > app using only 1800 vmas per process: it only maps 8 MB of shm and > > > spawns 32 processes. This has an even more lethal effect than the > > > previous code. > > > > Do these tests actually make any forward progress at all, or is it some bug > > which has sent the kernel into a loop? > > i think they make a forward progress so it's more of a DoS - but a very > effective one, especially considering that i didnt even try hard ... > > what worries me is that there are apps that generate such vma patterns > (for various reasons). those vmas in those apps are forced to be mlocked with the rmap VM, so it's hard for me to buy that rmap is any better. You can't even allow those vmas to be non-mlocked or you'll finish your zone-normal even with 4:4. on 64bit those apps will work _absolutely_best_ with objrmap and they'll waste tons of ram (and some amount of cpu too) with rmap. objrmap is the absolutely best model for those apps in any 64bit arch. the argument you're making about those apps are all in favour of objrmap IMO. > I do believe that scanning ->i_mmap & ->i_mmap_shared is fundamentally > flawed. If it's the DoS that you worry about, vmtruncate will do the trick too. overall machine remains usable for me, despite the increased cpu load. ^ permalink raw reply [flat|nested] 74+ messages in thread
* RFC anon_vma previous (i.e. full objrmap) 2004-03-09 16:03 ` Andrea Arcangeli @ 2004-03-10 10:36 ` Andrea Arcangeli 2004-03-11 6:52 ` anon_vma RFC2 Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-10 10:36 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, torvalds, linux-kernel On Tue, Mar 09, 2004 at 05:03:07PM +0100, Andrea Arcangeli wrote: > those vmas in those apps are forced to be mlocked with the rmap VM, so > it's hard for me to buy that rmap is any better. You can't even allow btw, try your exploit by keeping the stuff mlocked. you'll see we're not following the i_mmap already the first time we run into a VM_LOCKED vma, we could be even more efficient by removing mlocked pages from the lru, but it's definitely not required to get that workload right, and that workload needs mlock with rmap to remove the pte_chains anyways! So even now objrmap seems a lot better than rmap for that workload, it doesn't even require mlock, it only requires it if you want to pageout heavily (rmap requires it regardless if you pageout or not). And it can be fixed too with an rbtree as worse, while the rmap overhead is not fixable (other than to remove rmap enterely like I'm doing). BTW, my current anon_vma work is going really well, the code is so much nicer, and it's quite smaller too. include/linux/mm.h | 76 +++ include/linux/objrmap.h | 74 +++ include/linux/page-flags.h | 4 include/linux/rmap.h | 53 -- init/main.c | 4 mm/memory.c | 15 mm/mmap.c | 4 mm/nommu.c | 2 mm/objrmap.c | 480 +++++++++++++++++++++++ mm/page_alloc.c | 6 mm/rmap.c | 908 --------------------------------------------- 12 files changed, 636 insertions(+), 990 deletions(-) and this doesn't remove all the pte_chains everywhere yet. objrmap.c seems already fully complete, what's missing now is the removal of all the pte_chains from memory.c and friends, and later the anon_vma tracking with fork and munmap (I've only covered do_anonymous_page, so far, see how cool it looks like now: static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *page_table, pmd_t *pmd, int write_access, unsigned long addr) { pte_t entry; struct page * page = ZERO_PAGE(addr); int ret; /* Read-only mapping of ZERO_PAGE. */ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); /* ..except if it's a write access */ if (write_access) { /* Allocate our own private page. */ pte_unmap(page_table); spin_unlock(&mm->page_table_lock); page = alloc_page(GFP_HIGHUSER); if (!page) goto no_mem; clear_user_highpage(page, addr); spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, addr); if (!pte_none(*page_table)) { pte_unmap(page_table); page_cache_release(page); spin_unlock(&mm->page_table_lock); ret = VM_FAULT_MINOR; goto out; } mm->rss++; entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)), vma); lru_cache_add_active(page); mark_page_accessed(page); SetPageAnon(page); } set_pte(page_table, entry); /* ignores ZERO_PAGE */ page_add_rmap(page, vma); pte_unmap(page_table); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, addr, entry); spin_unlock(&mm->page_table_lock); ret = VM_FAULT_MINOR; goto out; no_mem: ret = VM_FAULT_OOM; out: return ret; } no pte_chains anywhere. and here the page_add_rmap from objrmap.c: /* this needs the page->flags PG_map_lock held */ static void inline anon_vma_page_link(struct page * page, struct vm_struct * vma) { SetPageDirect(page); page->as.vma = vma; } /** * page_add_rmap - add reverse mapping entry to a page * @page: the page to add the mapping to * @vma: the vma that is covering the page * * Add a new pte reverse mapping to a page. * The caller needs to hold the mm->page_table_lock. */ void * fastcall page_add_rmap(struct page *page, struct vm_struct * vma) { if (!pfn_valid(page_to_pfn(page)) || PageReserved(page)) return; page_map_lock(page); if (!page->mapcount++) inc_page_state(nr_mapped); if (PageAnon(page)) anon_vma_page_link(page, vma); else { /* * If this is an object-based page, just count it. * We can find the mappings by walking the object * vma chain for that object. */ BUG_ON(!page->as.mapping); BUG_ON(PageSwapCache(page)); } page_map_unlock(page); } here page_remove_rmap: /* this needs the page->flags PG_map_lock held */ static void inline anon_vma_page_unlink(struct page * page) { /* * Cleanup if this anon page is gone * as far as the vm is concerned. */ if (!page->mapcount) { page->as.vma = 0; #if 0 /* * The above clears page->as.anon_vma too * if the page wasn't direct. */ page->as.anon_vma = 0; #endif ClearPageDirect(page); } } /** * page_remove_rmap - take down reverse mapping to a page * @page: page to remove mapping from * * Removes the reverse mapping from the pte_chain of the page, * after that the caller can clear the page table entry and free * the page. * Caller needs to hold the mm->page_table_lock. */ void fastcall page_remove_rmap(struct page *page) { if (!pfn_valid(page_to_pfn(page)) || PageReserved(page)) return; page_map_lock(page); if (!page_mapped(page)) goto out_unlock; if (!--page->mapcount) dec_page_state(nr_mapped); if (PageAnon(page)) anon_vma_page_unlink(page, vma); else { /* * If this is an object-based page, just uncount it. * We can find the mappings by walking the object vma * chain for that object. */ BUG_ON(!page->as.mapping); BUG_ON(PageSwapCache(page)); } page_map_unlock(page); return; } here the paging code that unmaps the ptes: static int try_to_unmap_anon(struct page * page) { int ret = SWAP_AGAIN; page_map_lock(page); if (PageDirect(page)) { ret = try_to_unmap_inode_one(page->as.vma, page); } else { struct vm_area_struct * vma; anon_vma_t * anon_vma = page->as.anon_vma; list_for_each_entry(vma, &anon_vma->anon_vma_head, anon_vma_node) { ret = try_to_unmap_inode_one(vma, page); if (ret == SWAP_FAIL || !page->mapcount) goto out; } } out: page_map_unlock(page); return ret; } /** * try_to_unmap - try to remove all page table mappings to a page * @page: the page to get unmapped * * Tries to remove all the page table entries which are mapping this * page, used in the pageout path. Caller must hold the page lock * and its pte chain lock. Return values are: * * SWAP_SUCCESS - we succeeded in removing all mappings * SWAP_AGAIN - we missed a trylock, try again later * SWAP_FAIL - the page is unswappable */ int fastcall try_to_unmap(struct page * page) { struct pte_chain *pc, *next_pc, *start; int ret = SWAP_SUCCESS; /* This page should not be on the pageout lists. */ BUG_ON(PageReserved(page)); BUG_ON(!PageLocked(page)); /* * We need backing store to swap out a page. * Subtle: this checks for page->as.anon_vma too ;). */ BUG_ON(!page->as.mapping); if (!PageAnon(page)) ret = try_to_unmap_inode(page); else ret = try_to_unmap_anon(page); if (!page_mapped(page)) { dec_page_state(nr_mapped); ret = SWAP_SUCCESS; } return ret; } In my first attempt I was nucking page->mapping++ (that's pure locking overhead for the file mappings and it wastes 4 bytes per page_t) but then I retraced since nr_mapped was expanding everywhere in the vm and the modifications were growing too fast at the same time, so I'll think about it later for now I will do anon_vma only plus the nonlinear pagetable walk, so the page is as self contained as possible and it'll drop all pte_chains from the kernel. The only single reason I need page->mapped is that if the page is an inode mapping, page->as.mapping won't be enough to tell if it was already mapped or not. So my current anon_vma patch (incremental with objrmap) only reduces the page_t of 4 bytes compared to mainline 2.4 and mainline 2.6. With PageDirect and the page->as.vma field I'm deferring _all_ anon_vma object allocations to fork(), even when a MAP_PRIVATE vma is already tracked by an inode and by an anon_vma (generated by an old fork), new anonymous pages allocated are still "direct". So the same vma will have direct anon pages, anon_vma indirect cow pages, and finally it will have inode pages too (readonly writeprotect). I plan to teach the cow fault to convert anon_vma indirect pages to direct pages if page->mapping == 1 (no, I don't need page->mapping for that, I could use page_count but since I've page->mapping I use it so the unlikely races are converted to direct mode too). However a vma can't return "direct", only the page can return direct. The reason is that I've no way to reach _only_ the pages pointing to an anon_vma starting from the vma (the only way would be a pagetable walk but I don't want to do that, and leaving the anon_vma is perfectly fine, I will garbage collect it when the vma goes away too). Overall this means anonymous page faults will be blazing fast, no allocation ever in the fast paths, just fork will have to allocate 12 more bytes per anonymous vma to track the cows (not a big deal compared to 8 bytes per pte of rmap ;). here below (most important of all to understand my anon_vma proposed design) a preview of the data structure layout. I think this is close to DaveM's original approch to handle the anonymous memory, though the last time I read his patch was a few years ago so I don't remeber exactly, the only thing I remeber (because I disliked that) was that he was doing slab allocations from page faults, something that definitely completely avoid with highest priority. Hugh's approch as well was not usable since it was tracking mm and it broke off with mremap unfortunately. the way I designed the garbage collection of the anon_vma transient objects as well I think is extremely optimized, I don't need list of pages or counter of the pages, I simply garbage collect the anon_vma during vma destruction, checking vma->anon_vma && list_empty(&vma->anon_vma->anon_vma_head). I use the invariant that for a page to point to an anon_vma there must be a vma still queued in the anon_vma. That should work reliably and it allows me to only point anon_vmas from pages, and I never know from a anon_vma (or a vma) if any page is pointing to it (I only need to know that no page is pointing to it if no vma is queued into the anon_vma). It took me a while to design this thing, but now I'm quite happy, I hope not to find some huge design flaw at the last minute ;). This is why I'm showing you all this right now before it's finished, if you see any design flaw please let me know ASAP, I need this thing working quickly! thanks. --- sles-anobjrmap-2/include/linux/mm.h.~1~ 2004-03-03 06:45:38.000000000 +0100 +++ sles-anobjrmap-2/include/linux/mm.h 2004-03-10 10:25:55.955735680 +0100 @@ -39,6 +39,22 @@ extern int page_cluster; * mmap() functions). */ +typedef struct anon_vma_s { + /* This serializes the accesses to the vma list. */ + spinlock_t anon_vma_lock; + + /* + * This is a list of anonymous "related" vmas, + * to scan if one of the pages pointing to this + * anon_vma needs to be unmapped. + * After we unlink the last vma we must garbage collect + * the object if the list is empty because we're + * guaranteed no page can be pointing to this anon_vma + * if there's no vma anymore. + */ + struct list_head anon_vma_head; +} anon_vma_t; + /* * This struct defines a memory VMM memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory @@ -69,6 +85,19 @@ struct vm_area_struct { */ struct list_head shared; + /* + * The same vma can be both queued into the i_mmap and in a + * anon_vma too, for example after a cow in + * a MAP_PRIVATE file mapping. However only the MAP_PRIVATE + * will go both in the i_mmap and anon_vma. A MAP_SHARED + * will only be in the i_mmap_shared and a MAP_ANONYMOUS (file = 0) + * will only be queued only in the anon_vma. + * The list is serialized by the anon_vma->lock. + */ + struct list_head anon_vma_node; + /* Serialized by the vma->vm_mm->page_table_lock */ + anon_vma_t * anon_vma; + /* Function pointers to deal with this struct. */ struct vm_operations_struct * vm_ops; @@ -172,16 +201,51 @@ struct page { updated asynchronously */ atomic_t count; /* Usage count, see below. */ struct list_head list; /* ->mapping has some page lists. */ - struct address_space *mapping; /* The inode (or ...) we belong to. */ unsigned long index; /* Our offset within mapping. */ struct list_head lru; /* Pageout list, eg. active_list; protected by zone->lru_lock !! */ + + /* + * Address space of this page. + * A page can be either mapped to a file or to be anonymous + * memory, so using the union is optimal here. The PG_anon + * bitflag tells if this is anonymous or a file-mapping. + * If PG_anon is clear we use the as.mapping, if PG_anon is + * set and PG_direct is not set we use the as.anon_vma, + * if PG_anon is set and PG_direct is set we use the as.vma. + */ union { - struct pte_chain *chain;/* Reverse pte mapping pointer. - * protected by PG_chainlock */ - pte_addr_t direct; - int mapcount; - } pte; + /* The inode address space if it's a file mapping. */ + struct address_space * mapping; + + /* + * This points to an anon_vma object. + * The anon_vma can't go away under us if + * we hold the PG_maplock. + */ + anon_vma_t * anon_vma; + + /* + * Before the first fork we avoid anon_vma object allocation + * and we set PG_direct. anon_vma objects are only created + * via fork(), and the vm then stop using the page->as.vma + * and it starts using the as.anon_vma object instead. + * After the first fork(), even if the child exit, the pages + * cannot be downgraded to PG_direct anymore (even if we + * wanted to) because there's no way to reach pages starting + * from an anon_vma object. + */ + struct vm_struct * vma; + } as; + + /* + * Number of ptes mapping this page. + * It's serialized by PG_maplock. + * This is needed only to maintain the nr_mapped global info + * so it would be nice to drop it. + */ + unsigned long mapcount; + unsigned long private; /* mapping-private opaque data */ /* --- sles-anobjrmap-2/include/linux/page-flags.h.~1~ 2004-03-03 06:45:38.000000000 +0100 +++ sles-anobjrmap-2/include/linux/page-flags.h 2004-03-10 10:20:59.324830432 +0100 @@ -69,9 +69,9 @@ #define PG_private 12 /* Has something at ->private */ #define PG_writeback 13 /* Page is under writeback */ #define PG_nosave 14 /* Used for system suspend/resume */ -#define PG_chainlock 15 /* lock bit for ->pte_chain */ +#define PG_maplock 15 /* lock bit for ->as.anon_vma and ->mapcount */ -#define PG_direct 16 /* ->pte_chain points directly at pte */ +#define PG_direct 16 /* if set it must use page->as.vma */ #define PG_mappedtodisk 17 /* Has blocks allocated on-disk */ #define PG_reclaim 18 /* To be reclaimed asap */ #define PG_compound 19 /* Part of a compound page */ ^ permalink raw reply [flat|nested] 74+ messages in thread
* anon_vma RFC2 2004-03-10 10:36 ` RFC anon_vma previous (i.e. full objrmap) Andrea Arcangeli @ 2004-03-11 6:52 ` Andrea Arcangeli 2004-03-11 13:23 ` Hugh Dickins 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-11 6:52 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, torvalds, linux-kernel, William Lee Irwin III, Hugh Dickins Hello, this is the full current status of my anon_vma work. Now fork() and all the other page_add/remove_rmap in memory.c plus the paging routines seems fully covered and I'm now dealing with the vma merging and the anon_vma garbage collection (the latter is easy but I need to track all the kmem_cache_free). There is just one minor limitation with the vma merging of anonymous memory that I didn't considered during the design phase (I figured it out while coding). In short this is only an issue with the mremap syscall (and sometimes with mmap too while filling an hole). The vma merging happening during mmap/brk (not filling an hole) is always going to happen fine, since the newly created vma has vma->anon_vma == NULL and I can have the guarantee from the caller that no page is yet mapped to this vma, so I can merge it just fine and it'll be part of whatever pre-existing anon_vma object (after possibly fixing up the vma->pg_off of the newly created vma). Only if I fill an hole (with mmap or brk) I may be not able to merge the three anon vmas together if their pg_off disagrees. However their pg_off may disagree only if somebody used mremap on those vmas previously, since I setup the pg_off of anonymous memory in a way that if you only use mmap/brk even filling the holes is guaranteed to do full merging. The problem in mremap is not only the pgoff, the problem is that I can merge anonymous vma only if (!vma1->anon_vma || !vma2->anon_vma) is true. If both vma1 and vma2 have a different anon_vma I cannot merge them togheter (even if the pg_off agrees) because the pages under vma2 may point to vma2->anon_vma and the pages under vma1 may point to vma1->anon_vma in their page->as.anon_vma. There is no way to reach efficiently the pages pointing to a certain anon_vma. As said yesterday the invariant I use to garbage collect the anon_vma is to wait all vma to go be unlinked from the anon_vma, but as far as there are vmas queued into the anon_vma object I cannot release those anon_vma objects, and in turn I cannot do merging either. the only way to allow 100% merging through mremap would be to have a list with the head in the anon_vma and the nodes in the page_t, that would be very easy but it would waste 4 bytes per page_t for a hlist_node (the 4byte waste in the anon_vma is not a problem). And the merging would be very expensive too since I would need to run a for_each_page_in_the_list loop to fixup first all the page->index according to the spread between vma1->pg_off and vma2->pg_off, and second I should reset the page->as.anon_vma (or page->as.vma for direct pages) to point respectively to the other anon_vma (or the other vma for direct pages). So I think I will go ahead with the current data structures despite the small regression in vma merging. I doubt it's an issue but please let me know if you think it's an issue and that I should add an hlist_node to the page_t and an hlist_head to the anon_vma_t. btw, it's something I can always do later if it's really necessary. Even with the additional 4bytes per page_t the page_t size would not be bigger than mainline 2.4 and mainline 2.6. include/linux/mm.h | 79 +++ include/linux/objrmap.h | 66 +++ include/linux/page-flags.h | 4 include/linux/rmap.h | 53 -- init/main.c | 4 kernel/fork.c | 10 mm/Makefile | 2 mm/memory.c | 129 +----- mm/mmap.c | 9 mm/nommu.c | 2 mm/objrmap.c | 575 ++++++++++++++++++++++++++++ mm/page_alloc.c | 6 mm/rmap.c | 908 --------------------------------------------- 14 files changed, 772 insertions(+), 1075 deletions(-) --- sles-anobjrmap-2/include/linux/mm.h.~1~ 2004-03-03 06:45:38.000000000 +0100 +++ sles-anobjrmap-2/include/linux/mm.h 2004-03-10 18:59:14.000000000 +0100 @@ -39,6 +39,22 @@ extern int page_cluster; * mmap() functions). */ +typedef struct anon_vma_s { + /* This serializes the accesses to the vma list. */ + spinlock_t anon_vma_lock; + + /* + * This is a list of anonymous "related" vmas, + * to scan if one of the pages pointing to this + * anon_vma needs to be unmapped. + * After we unlink the last vma we must garbage collect + * the object if the list is empty because we're + * guaranteed no page can be pointing to this anon_vma + * if there's no vma anymore. + */ + struct list_head anon_vma_head; +} anon_vma_t; + /* * This struct defines a memory VMM memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory @@ -69,6 +85,19 @@ struct vm_area_struct { */ struct list_head shared; + /* + * The same vma can be both queued into the i_mmap and in a + * anon_vma too, for example after a cow in + * a MAP_PRIVATE file mapping. However only the MAP_PRIVATE + * will go both in the i_mmap and anon_vma. A MAP_SHARED + * will only be in the i_mmap_shared and a MAP_ANONYMOUS (file = 0) + * will only be queued only in the anon_vma. + * The list is serialized by the anon_vma->lock. + */ + struct list_head anon_vma_node; + /* Serialized by the vma->vm_mm->page_table_lock */ + anon_vma_t * anon_vma; + /* Function pointers to deal with this struct. */ struct vm_operations_struct * vm_ops; @@ -172,16 +201,51 @@ struct page { updated asynchronously */ atomic_t count; /* Usage count, see below. */ struct list_head list; /* ->mapping has some page lists. */ - struct address_space *mapping; /* The inode (or ...) we belong to. */ unsigned long index; /* Our offset within mapping. */ struct list_head lru; /* Pageout list, eg. active_list; protected by zone->lru_lock !! */ + + /* + * Address space of this page. + * A page can be either mapped to a file or to be anonymous + * memory, so using the union is optimal here. The PG_anon + * bitflag tells if this is anonymous or a file-mapping. + * If PG_anon is clear we use the as.mapping, if PG_anon is + * set and PG_direct is not set we use the as.anon_vma, + * if PG_anon is set and PG_direct is set we use the as.vma. + */ union { - struct pte_chain *chain;/* Reverse pte mapping pointer. - * protected by PG_chainlock */ - pte_addr_t direct; - int mapcount; - } pte; + /* The inode address space if it's a file mapping. */ + struct address_space * mapping; + + /* + * This points to an anon_vma object. + * The anon_vma can't go away under us if + * we hold the PG_maplock. + */ + anon_vma_t * anon_vma; + + /* + * Before the first fork we avoid anon_vma object allocation + * and we set PG_direct. anon_vma objects are only created + * via fork(), and the vm then stop using the page->as.vma + * and it starts using the as.anon_vma object instead. + * After the first fork(), even if the child exit, the pages + * cannot be downgraded to PG_direct anymore (even if we + * wanted to) because there's no way to reach pages starting + * from an anon_vma object. + */ + struct vm_struct * vma; + } as; + + /* + * Number of ptes mapping this page. + * It's serialized by PG_maplock. + * This is needed only to maintain the nr_mapped global info + * so it would be nice to drop it. + */ + unsigned long mapcount; + unsigned long private; /* mapping-private opaque data */ /* @@ -440,7 +504,8 @@ void unmap_page_range(struct mmu_gather unsigned long address, unsigned long size); void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma); + struct vm_area_struct *vma, struct vm_area_struct *orig_vma, + anon_vma_t ** anon_vma); int zeromap_page_range(struct vm_area_struct *vma, unsigned long from, unsigned long size, pgprot_t prot); --- sles-anobjrmap-2/include/linux/page-flags.h.~1~ 2004-03-03 06:45:38.000000000 +0100 +++ sles-anobjrmap-2/include/linux/page-flags.h 2004-03-10 10:20:59.000000000 +0100 @@ -69,9 +69,9 @@ #define PG_private 12 /* Has something at ->private */ #define PG_writeback 13 /* Page is under writeback */ #define PG_nosave 14 /* Used for system suspend/resume */ -#define PG_chainlock 15 /* lock bit for ->pte_chain */ +#define PG_maplock 15 /* lock bit for ->as.anon_vma and ->mapcount */ -#define PG_direct 16 /* ->pte_chain points directly at pte */ +#define PG_direct 16 /* if set it must use page->as.vma */ #define PG_mappedtodisk 17 /* Has blocks allocated on-disk */ #define PG_reclaim 18 /* To be reclaimed asap */ #define PG_compound 19 /* Part of a compound page */ --- sles-anobjrmap-2/include/linux/objrmap.h.~1~ 2004-03-05 05:27:41.000000000 +0100 +++ sles-anobjrmap-2/include/linux/objrmap.h 2004-03-10 20:48:57.000000000 +0100 @@ -1,8 +1,7 @@ #ifndef _LINUX_RMAP_H #define _LINUX_RMAP_H /* - * Declarations for Reverse Mapping functions in mm/rmap.c - * Its structures are declared within that file. + * Declarations for Object Reverse Mapping functions in mm/objrmap.c */ #include <linux/config.h> @@ -10,32 +9,46 @@ #include <linux/linkage.h> #include <linux/slab.h> +#include <linux/kernel.h> -struct pte_chain; -extern kmem_cache_t *pte_chain_cache; +extern kmem_cache_t * anon_vma_cachep; -#define pte_chain_lock(page) bit_spin_lock(PG_chainlock, &page->flags) -#define pte_chain_unlock(page) bit_spin_unlock(PG_chainlock, &page->flags) +#define page_map_lock(page) bit_spin_lock(PG_maplock, &page->flags) +#define page_map_unlock(page) bit_spin_unlock(PG_maplock, &page->flags) -struct pte_chain *pte_chain_alloc(int gfp_flags); -void __pte_chain_free(struct pte_chain *pte_chain); +static inline void anon_vma_free(anon_vma_t * anon_vma) +{ + kmem_cache_free(anon_vma); +} -static inline void pte_chain_free(struct pte_chain *pte_chain) +static inline anon_vma_t * anon_vma_alloc(void) { - if (pte_chain) - __pte_chain_free(pte_chain); + might_sleep(); + + return kmem_cache_alloc(anon_vma_cachep, SLAB_KERNEL); } -int FASTCALL(page_referenced(struct page *)); -struct pte_chain *FASTCALL(page_add_rmap(struct page *, pte_t *, - struct pte_chain *)); -void FASTCALL(page_remove_rmap(struct page *, pte_t *)); -int page_convert_anon(struct page *); +static inline void anon_vma_unlink(struct vm_area_struct * vma) +{ + anon_vma_t * anon_vma = vma->anon_vma; + + if (anon_vma) { + spin_lock(&anon_vma->anon_vma_lock); + list_del(&vma->anon_vm_node); + spin_unlock(&anon_vma->anon_vma_lock); + } +} + +void FASTCALL(page_add_rmap(struct page *, struct vm_struct *)); +void FASTCALL(page_add_rmap_fork(struct page *, struct vm_area_struct *, + struct vm_area_struct *, anon_vma_t **)); +void FASTCALL(page_remove_rmap(struct page *)); /* * Called from mm/vmscan.c to handle paging out */ int FASTCALL(try_to_unmap(struct page *)); +int FASTCALL(page_referenced(struct page *)); /* * Return values of try_to_unmap --- sles-anobjrmap-2/init/main.c.~1~ 2004-02-29 17:47:36.000000000 +0100 +++ sles-anobjrmap-2/init/main.c 2004-03-09 05:32:34.000000000 +0100 @@ -85,7 +85,7 @@ extern void signals_init(void); extern void buffer_init(void); extern void pidhash_init(void); extern void pidmap_init(void); -extern void pte_chain_init(void); +extern void anon_vma_init(void); extern void radix_tree_init(void); extern void free_initmem(void); extern void populate_rootfs(void); @@ -495,7 +495,7 @@ asmlinkage void __init start_kernel(void calibrate_delay(); pidmap_init(); pgtable_cache_init(); - pte_chain_init(); + anon_vma_init(); #ifdef CONFIG_KDB kdb_init(); --- sles-anobjrmap-2/kernel/fork.c.~1~ 2004-02-29 17:47:33.000000000 +0100 +++ sles-anobjrmap-2/kernel/fork.c 2004-03-10 18:58:29.000000000 +0100 @@ -276,6 +276,7 @@ static inline int dup_mmap(struct mm_str struct vm_area_struct * mpnt, *tmp, **pprev; int retval; unsigned long charge = 0; + anon_vma_t * anon_vma = NULL; down_write(&oldmm->mmap_sem); flush_cache_mm(current->mm); @@ -310,6 +311,11 @@ static inline int dup_mmap(struct mm_str goto fail_nomem; charge += len; } + if (!anon_vma) { + anon_vma = anon_vma_alloc(); + if (!anon_vma) + goto fail_nomem; + } tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL); if (!tmp) goto fail_nomem; @@ -339,7 +345,7 @@ static inline int dup_mmap(struct mm_str *pprev = tmp; pprev = &tmp->vm_next; mm->map_count++; - retval = copy_page_range(mm, current->mm, tmp); + retval = copy_page_range(mm, current->mm, tmp, mpnt, &anon_vma); spin_unlock(&mm->page_table_lock); if (tmp->vm_ops && tmp->vm_ops->open) @@ -354,6 +360,8 @@ static inline int dup_mmap(struct mm_str out: flush_tlb_mm(current->mm); up_write(&oldmm->mmap_sem); + if (anon_vma) + anon_vma_free(anon_vma); return retval; fail_nomem: retval = -ENOMEM; --- sles-anobjrmap-2/mm/mmap.c.~1~ 2004-03-03 06:53:46.000000000 +0100 +++ sles-anobjrmap-2/mm/mmap.c 2004-03-11 07:43:32.158221568 +0100 @@ -325,7 +325,7 @@ static void move_vma_start(struct vm_are inode = vma->vm_file->f_dentry->d_inode; if (inode) __remove_shared_vm_struct(vma, inode); - /* If no vm_file, perhaps we should always keep vm_pgoff at 0?? */ + /* we must update pgoff even if no vm_file for the anon_vma_chain */ vma->vm_pgoff += (long)(addr - vma->vm_start) >> PAGE_SHIFT; vma->vm_start = addr; if (inode) @@ -576,6 +576,7 @@ unsigned long __do_mmap_pgoff(struct mm_ case MAP_SHARED: break; } + pgoff = addr << PAGE_SHIFT; } error = security_file_mmap(file, prot, flags); @@ -639,6 +640,8 @@ munmap_back: vma->vm_private_data = NULL; vma->vm_next = NULL; INIT_LIST_HEAD(&vma->shared); + INIT_LIST_HEAD(&vma->anon_vma_node); + vma->anon_vma = NULL; if (file) { error = -EINVAL; @@ -1381,10 +1384,12 @@ unsigned long do_brk(unsigned long addr, vma->vm_flags = flags; vma->vm_page_prot = protection_map[flags & 0x0f]; vma->vm_ops = NULL; - vma->vm_pgoff = 0; + vma->vm_pgoff = addr << PAGE_SHIFT; vma->vm_file = NULL; vma->vm_private_data = NULL; INIT_LIST_HEAD(&vma->shared); + INIT_LIST_HEAD(&vma->anon_vma_node); + vma->anon_vma = NULL; vma_link(mm, vma, prev, rb_link, rb_parent); --- sles-anobjrmap-2/mm/page_alloc.c.~1~ 2004-03-03 06:45:38.000000000 +0100 +++ sles-anobjrmap-2/mm/page_alloc.c 2004-03-10 10:28:26.000000000 +0100 @@ -91,6 +91,7 @@ static void bad_page(const char *functio 1 << PG_writeback); set_page_count(page, 0); page->mapping = NULL; + page->mapcount = 0; } #if !defined(CONFIG_HUGETLB_PAGE) && !defined(CONFIG_CRASH_DUMP) \ @@ -216,8 +217,7 @@ static inline void __free_pages_bulk (st static inline void free_pages_check(const char *function, struct page *page) { - if ( page_mapped(page) || - page->mapping != NULL || + if ( page->as.mapping != NULL || page_count(page) != 0 || (page->flags & ( 1 << PG_lru | @@ -329,7 +329,7 @@ static inline void set_page_refs(struct */ static void prep_new_page(struct page *page, int order) { - if (page->mapping || page_mapped(page) || + if (page->as.mapping || (page->flags & ( 1 << PG_private | 1 << PG_locked | --- sles-anobjrmap-2/mm/nommu.c.~1~ 2004-02-04 16:07:06.000000000 +0100 +++ sles-anobjrmap-2/mm/nommu.c 2004-03-09 05:32:41.000000000 +0100 @@ -568,6 +568,6 @@ unsigned long get_unmapped_area(struct f return -ENOMEM; } -void pte_chain_init(void) +void anon_vma_init(void) { } --- sles-anobjrmap-2/mm/memory.c.~1~ 2004-03-05 05:24:35.000000000 +0100 +++ sles-anobjrmap-2/mm/memory.c 2004-03-10 19:25:27.000000000 +0100 @@ -43,12 +43,11 @@ #include <linux/swap.h> #include <linux/highmem.h> #include <linux/pagemap.h> -#include <linux/rmap.h> +#include <linux/objrmap.h> #include <linux/module.h> #include <linux/init.h> #include <asm/pgalloc.h> -#include <asm/rmap.h> #include <asm/uaccess.h> #include <asm/tlb.h> #include <asm/tlbflush.h> @@ -105,7 +104,6 @@ static inline void free_one_pmd(struct m } page = pmd_page(*dir); pmd_clear(dir); - pgtable_remove_rmap(page); pte_free_tlb(tlb, page); } @@ -164,7 +162,6 @@ pte_t fastcall * pte_alloc_map(struct mm pte_free(new); goto out; } - pgtable_add_rmap(new, mm, address); pmd_populate(mm, pmd, new); } out: @@ -190,7 +187,6 @@ pte_t fastcall * pte_alloc_kernel(struct pte_free_kernel(new); goto out; } - pgtable_add_rmap(virt_to_page(new), mm, address); pmd_populate_kernel(mm, pmd, new); } out: @@ -211,26 +207,17 @@ out: * but may be dropped within pmd_alloc() and pte_alloc_map(). */ int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma) + struct vm_area_struct *vma, struct vm_area_struct *orig_vma, + anon_vma_t ** anon_vma) { pgd_t * src_pgd, * dst_pgd; unsigned long address = vma->vm_start; unsigned long end = vma->vm_end; unsigned long cow; - struct pte_chain *pte_chain = NULL; if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst, src, vma); - pte_chain = pte_chain_alloc(GFP_ATOMIC); - if (!pte_chain) { - spin_unlock(&dst->page_table_lock); - pte_chain = pte_chain_alloc(GFP_KERNEL); - spin_lock(&dst->page_table_lock); - if (!pte_chain) - goto nomem; - } - cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; src_pgd = pgd_offset(src, address)-1; dst_pgd = pgd_offset(dst, address)-1; @@ -299,7 +286,7 @@ skip_copy_pte_range: pfn = pte_pfn(pte); /* the pte points outside of valid memory, the * mapping is assumed to be good, meaningful - * and not mapped via rmap - duplicate the + * and not mapped via objrmap - duplicate the * mapping as is. */ page = NULL; @@ -331,30 +318,20 @@ skip_copy_pte_range: dst->rss++; set_pte(dst_pte, pte); - pte_chain = page_add_rmap(page, dst_pte, - pte_chain); - if (pte_chain) - goto cont_copy_pte_range_noset; - pte_chain = pte_chain_alloc(GFP_ATOMIC); - if (pte_chain) - goto cont_copy_pte_range_noset; + page_add_rmap_fork(page, vma, orig_vma, anon_vma); + + if (need_resched()) { + pte_unmap_nested(src_pte); + pte_unmap(dst_pte); + spin_unlock(&src->page_table_lock); + spin_unlock(&dst->page_table_lock); + __cond_resched(); + spin_lock(&dst->page_table_lock); + spin_lock(&src->page_table_lock); + dst_pte = pte_offset_map(dst_pmd, address); + src_pte = pte_offset_map_nested(src_pmd, address); + } - /* - * pte_chain allocation failed, and we need to - * run page reclaim. - */ - pte_unmap_nested(src_pte); - pte_unmap(dst_pte); - spin_unlock(&src->page_table_lock); - spin_unlock(&dst->page_table_lock); - pte_chain = pte_chain_alloc(GFP_KERNEL); - spin_lock(&dst->page_table_lock); - if (!pte_chain) - goto nomem; - spin_lock(&src->page_table_lock); - dst_pte = pte_offset_map(dst_pmd, address); - src_pte = pte_offset_map_nested(src_pmd, - address); cont_copy_pte_range_noset: address += PAGE_SIZE; if (address >= end) { @@ -377,10 +354,9 @@ cont_copy_pmd_range: out_unlock: spin_unlock(&src->page_table_lock); out: - pte_chain_free(pte_chain); return 0; + nomem: - pte_chain_free(pte_chain); return -ENOMEM; } @@ -421,7 +397,7 @@ zap_pte_range(struct mmu_gather *tlb, pm !PageSwapCache(page)) mark_page_accessed(page); tlb->freed++; - page_remove_rmap(page, ptep); + page_remove_rmap(page); tlb_remove_page(tlb, page); } } @@ -1014,7 +990,6 @@ static int do_wp_page(struct mm_struct * { struct page *old_page, *new_page; unsigned long pfn = pte_pfn(pte); - struct pte_chain *pte_chain; pte_t entry; if (unlikely(!pfn_valid(pfn))) { @@ -1053,9 +1028,6 @@ static int do_wp_page(struct mm_struct * page_cache_get(old_page); spin_unlock(&mm->page_table_lock); - pte_chain = pte_chain_alloc(GFP_KERNEL); - if (!pte_chain) - goto no_pte_chain; new_page = alloc_page(GFP_HIGHUSER); if (!new_page) goto no_new_page; @@ -1069,10 +1041,10 @@ static int do_wp_page(struct mm_struct * if (pte_same(*page_table, pte)) { if (PageReserved(old_page)) ++mm->rss; - page_remove_rmap(old_page, page_table); + page_remove_rmap(old_page); break_cow(vma, new_page, address, page_table); SetPageAnon(new_page); - pte_chain = page_add_rmap(new_page, page_table, pte_chain); + page_add_rmap(new_page, vma); lru_cache_add_active(new_page); /* Free the old page.. */ @@ -1082,12 +1054,9 @@ static int do_wp_page(struct mm_struct * page_cache_release(new_page); page_cache_release(old_page); spin_unlock(&mm->page_table_lock); - pte_chain_free(pte_chain); return VM_FAULT_MINOR; no_new_page: - pte_chain_free(pte_chain); -no_pte_chain: page_cache_release(old_page); return VM_FAULT_OOM; } @@ -1245,7 +1214,6 @@ static int do_swap_page(struct mm_struct swp_entry_t entry = pte_to_swp_entry(orig_pte); pte_t pte; int ret = VM_FAULT_MINOR; - struct pte_chain *pte_chain = NULL; pte_unmap(page_table); spin_unlock(&mm->page_table_lock); @@ -1275,11 +1243,6 @@ static int do_swap_page(struct mm_struct } mark_page_accessed(page); - pte_chain = pte_chain_alloc(GFP_KERNEL); - if (!pte_chain) { - ret = VM_FAULT_OOM; - goto out; - } lock_page(page); /* @@ -1312,14 +1275,13 @@ static int do_swap_page(struct mm_struct flush_icache_page(vma, page); set_pte(page_table, pte); SetPageAnon(page); - pte_chain = page_add_rmap(page, page_table, pte_chain); + page_add_rmap(page, vma); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, address, pte); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); out: - pte_chain_free(pte_chain); return ret; } @@ -1335,20 +1297,8 @@ do_anonymous_page(struct mm_struct *mm, { pte_t entry; struct page * page = ZERO_PAGE(addr); - struct pte_chain *pte_chain; int ret; - pte_chain = pte_chain_alloc(GFP_ATOMIC); - if (!pte_chain) { - pte_unmap(page_table); - spin_unlock(&mm->page_table_lock); - pte_chain = pte_chain_alloc(GFP_KERNEL); - if (!pte_chain) - goto no_mem; - spin_lock(&mm->page_table_lock); - page_table = pte_offset_map(pmd, addr); - } - /* Read-only mapping of ZERO_PAGE. */ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); @@ -1359,8 +1309,8 @@ do_anonymous_page(struct mm_struct *mm, spin_unlock(&mm->page_table_lock); page = alloc_page(GFP_HIGHUSER); - if (!page) - goto no_mem; + if (unlikely(!page)) + return VM_FAULT_OOM; clear_user_highpage(page, addr); spin_lock(&mm->page_table_lock); @@ -1370,8 +1320,7 @@ do_anonymous_page(struct mm_struct *mm, pte_unmap(page_table); page_cache_release(page); spin_unlock(&mm->page_table_lock); - ret = VM_FAULT_MINOR; - goto out; + return VM_FAULT_MINOR; } mm->rss++; entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, @@ -1383,20 +1332,16 @@ do_anonymous_page(struct mm_struct *mm, } set_pte(page_table, entry); - /* ignores ZERO_PAGE */ - pte_chain = page_add_rmap(page, page_table, pte_chain); pte_unmap(page_table); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, addr, entry); spin_unlock(&mm->page_table_lock); ret = VM_FAULT_MINOR; - goto out; -no_mem: - ret = VM_FAULT_OOM; -out: - pte_chain_free(pte_chain); + /* ignores ZERO_PAGE */ + page_add_rmap(page, vma); + return ret; } @@ -1419,7 +1364,6 @@ do_no_page(struct mm_struct *mm, struct struct page * new_page; struct address_space *mapping = NULL; pte_t entry; - struct pte_chain *pte_chain; int sequence = 0; int ret = VM_FAULT_MINOR; @@ -1443,10 +1387,6 @@ retry: if (new_page == NOPAGE_OOM) return VM_FAULT_OOM; - pte_chain = pte_chain_alloc(GFP_KERNEL); - if (!pte_chain) - goto oom; - /* See if nopage returned an anon page */ if (!new_page->mapping || PageSwapCache(new_page)) SetPageAnon(new_page); @@ -1476,7 +1416,6 @@ retry: sequence = atomic_read(&mapping->truncate_count); spin_unlock(&mm->page_table_lock); page_cache_release(new_page); - pte_chain_free(pte_chain); goto retry; } page_table = pte_offset_map(pmd, address); @@ -1500,7 +1439,7 @@ retry: if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); set_pte(page_table, entry); - pte_chain = page_add_rmap(new_page, page_table, pte_chain); + page_add_rmap(new_page, vma); pte_unmap(page_table); } else { /* One of our sibling threads was faster, back out. */ @@ -1513,13 +1452,13 @@ retry: /* no need to invalidate: a not-present page shouldn't be cached */ update_mmu_cache(vma, address, entry); spin_unlock(&mm->page_table_lock); - goto out; -oom: + out: + return ret; + + oom: page_cache_release(new_page); ret = VM_FAULT_OOM; -out: - pte_chain_free(pte_chain); - return ret; + goto out; } /* --- sles-anobjrmap-2/mm/objrmap.c.~1~ 2004-03-05 05:40:21.000000000 +0100 +++ sles-anobjrmap-2/mm/objrmap.c 2004-03-10 20:29:20.000000000 +0100 @@ -1,105 +1,27 @@ /* - * mm/rmap.c - physical to virtual reverse mappings - * - * Copyright 2001, Rik van Riel <riel@conectiva.com.br> - * Released under the General Public License (GPL). + * mm/objrmap.c * + * Provides methods for unmapping all sort of mapped pages + * using the vma objects, the brainer part of objrmap is the + * tracking of the vma to analyze for every given mapped page. + * The anon_vma methods are tracking anonymous pages, + * and the inode methods are tracking pages belonging + * to an inode. * - * Simple, low overhead pte-based reverse mapping scheme. - * This is kept modular because we may want to experiment - * with object-based reverse mapping schemes. Please try - * to keep this thing as modular as possible. + * anonymous methods by Andrea Arcangeli <andrea@suse.de> 2004 + * inode methods by Dave McCracken <dmccr@us.ibm.com> 2003, 2004 */ /* - * Locking: - * - the page->pte.chain is protected by the PG_chainlock bit, - * which nests within the the mm->page_table_lock, - * which nests within the page lock. - * - because swapout locking is opposite to the locking order - * in the page fault path, the swapout path uses trylocks - * on the mm->page_table_lock - */ -#include <linux/mm.h> -#include <linux/pagemap.h> -#include <linux/swap.h> -#include <linux/swapops.h> -#include <linux/slab.h> -#include <linux/init.h> -#include <linux/rmap.h> -#include <linux/cache.h> -#include <linux/percpu.h> - -#include <asm/pgalloc.h> -#include <asm/rmap.h> -#include <asm/tlb.h> -#include <asm/tlbflush.h> - -/* #define DEBUG_RMAP */ - -/* - * Shared pages have a chain of pte_chain structures, used to locate - * all the mappings to this page. We only need a pointer to the pte - * here, the page struct for the page table page contains the process - * it belongs to and the offset within that process. - * - * We use an array of pte pointers in this structure to minimise cache misses - * while traversing reverse maps. - */ -#define NRPTE ((L1_CACHE_BYTES - sizeof(unsigned long))/sizeof(pte_addr_t)) - -/* - * next_and_idx encodes both the address of the next pte_chain and the - * offset of the highest-index used pte in ptes[]. + * try_to_unmap/page_referenced/page_add_rmap/page_remove_rmap + * inherit from the rmap design mm/rmap.c under + * Copyright 2001, Rik van Riel <riel@conectiva.com.br> + * Released under the General Public License (GPL). */ -struct pte_chain { - unsigned long next_and_idx; - pte_addr_t ptes[NRPTE]; -} ____cacheline_aligned; - -kmem_cache_t *pte_chain_cache; -static inline struct pte_chain *pte_chain_next(struct pte_chain *pte_chain) -{ - return (struct pte_chain *)(pte_chain->next_and_idx & ~NRPTE); -} - -static inline struct pte_chain *pte_chain_ptr(unsigned long pte_chain_addr) -{ - return (struct pte_chain *)(pte_chain_addr & ~NRPTE); -} - -static inline int pte_chain_idx(struct pte_chain *pte_chain) -{ - return pte_chain->next_and_idx & NRPTE; -} - -static inline unsigned long -pte_chain_encode(struct pte_chain *pte_chain, int idx) -{ - return (unsigned long)pte_chain | idx; -} - -/* - * pte_chain list management policy: - * - * - If a page has a pte_chain list then it is shared by at least two processes, - * because a single sharing uses PageDirect. (Well, this isn't true yet, - * coz this code doesn't collapse singletons back to PageDirect on the remove - * path). - * - A pte_chain list has free space only in the head member - all succeeding - * members are 100% full. - * - If the head element has free space, it occurs in its leading slots. - * - All free space in the pte_chain is at the start of the head member. - * - Insertion into the pte_chain puts a pte pointer in the last free slot of - * the head member. - * - Removal from a pte chain moves the head pte of the head member onto the - * victim pte and frees the head member if it became empty. - */ +#include <linux/mm.h> -/** - ** VM stuff below this comment - **/ +kmem_cache_t * anon_vma_cachep; /** * find_pte - Find a pte pointer given a vma and a struct page. @@ -157,17 +79,17 @@ out: } /** - * page_referenced_obj_one - referenced check for object-based rmap + * page_referenced_inode_one - referenced check for object-based rmap * @vma: the vma to look in. * @page: the page we're working on. * * Find a pte entry for a page/vma pair, then check and clear the referenced * bit. * - * This is strictly a helper function for page_referenced_obj. + * This is strictly a helper function for page_referenced_inode. */ static int -page_referenced_obj_one(struct vm_area_struct *vma, struct page *page) +page_referenced_inode_one(struct vm_area_struct *vma, struct page *page) { struct mm_struct *mm = vma->vm_mm; pte_t *pte; @@ -188,11 +110,11 @@ page_referenced_obj_one(struct vm_area_s } /** - * page_referenced_obj_one - referenced check for object-based rmap + * page_referenced_inode_one - referenced check for object-based rmap * @page: the page we're checking references on. * * For an object-based mapped page, find all the places it is mapped and - * check/clear the referenced flag. This is done by following the page->mapping + * check/clear the referenced flag. This is done by following the page->as.mapping * pointer, then walking the chain of vmas it holds. It returns the number * of references it found. * @@ -202,29 +124,54 @@ page_referenced_obj_one(struct vm_area_s * assume a reference count of 1. */ static int -page_referenced_obj(struct page *page) +page_referenced_inode(struct page *page) { - struct address_space *mapping = page->mapping; + struct address_space *mapping = page->as.mapping; struct vm_area_struct *vma; - int referenced = 0; + int referenced; - if (!page->pte.mapcount) + if (!page->mapcount) return 0; - if (!mapping) - BUG(); + BUG_ON(!mapping); + BUG_ON(PageSwapCache(page)); - if (PageSwapCache(page)) - BUG(); + if (down_trylock(&mapping->i_shared_sem)) + return 1; + + referenced = 0; + + list_for_each_entry(vma, &mapping->i_mmap, shared) + referenced += page_referenced_inode_one(vma, page); + + list_for_each_entry(vma, &mapping->i_mmap_shared, shared) + referenced += page_referenced_inode_one(vma, page); + + up(&mapping->i_shared_sem); + + return referenced; +} + +static int page_referenced_anon(struct page *page) +{ + int referenced; + + if (!page->mapcount) + return 0; + + BUG_ON(!mapping); + BUG_ON(PageSwapCache(page)); if (down_trylock(&mapping->i_shared_sem)) return 1; - + + referenced = 0; + list_for_each_entry(vma, &mapping->i_mmap, shared) - referenced += page_referenced_obj_one(vma, page); + referenced += page_referenced_inode_one(vma, page); list_for_each_entry(vma, &mapping->i_mmap_shared, shared) - referenced += page_referenced_obj_one(vma, page); + referenced += page_referenced_inode_one(vma, page); up(&mapping->i_shared_sem); @@ -244,7 +191,6 @@ page_referenced_obj(struct page *page) */ int fastcall page_referenced(struct page * page) { - struct pte_chain *pc; int referenced = 0; if (page_test_and_clear_young(page)) @@ -253,209 +199,179 @@ int fastcall page_referenced(struct page if (TestClearPageReferenced(page)) referenced++; - if (!PageAnon(page)) { - referenced += page_referenced_obj(page); - goto out; - } - if (PageDirect(page)) { - pte_t *pte = rmap_ptep_map(page->pte.direct); - if (ptep_test_and_clear_young(pte)) - referenced++; - rmap_ptep_unmap(pte); - } else { - int nr_chains = 0; + if (!PageAnon(page)) + referenced += page_referenced_inode(page); + else + referenced += page_referenced_anon(page); - /* Check all the page tables mapping this page. */ - for (pc = page->pte.chain; pc; pc = pte_chain_next(pc)) { - int i; - - for (i = pte_chain_idx(pc); i < NRPTE; i++) { - pte_addr_t pte_paddr = pc->ptes[i]; - pte_t *p; - - p = rmap_ptep_map(pte_paddr); - if (ptep_test_and_clear_young(p)) - referenced++; - rmap_ptep_unmap(p); - nr_chains++; - } - } - if (nr_chains == 1) { - pc = page->pte.chain; - page->pte.direct = pc->ptes[NRPTE-1]; - SetPageDirect(page); - pc->ptes[NRPTE-1] = 0; - __pte_chain_free(pc); - } - } -out: return referenced; } +/* this needs the page->flags PG_map_lock held */ +static void inline anon_vma_page_link(struct page * page, struct vm_area_struct * vma) +{ + BUG_ON(page->mapcount != 1); + BUG_ON(PageDirect(page)); + + SetPageDirect(page); + page->as.vma = vma; +} + +/* this needs the page->flags PG_map_lock held */ +static void inline anon_vma_page_link_fork(struct page * page, struct vm_area_struct * vma, + struct vm_area_struct * orig_vma, anon_vma_t ** anon_vma) +{ + anon_vma_t * anon_vma = orig_vma->anon_vma; + + BUG_ON(page->mapcount <= 1); + BUG_ON(!PageDirect(page)); + + if (!anon_vma) { + anon_vma = *anon_vma; + *anon_vma = NULL; + + /* it's single threaded here, avoid the anon_vma->anon_vma_lock */ + list_add(&vma->anon_vma_node, &anon_vma->anon_vma_head); + list_add(&orig_vma->anon_vma_node, &anon_vma->anon_vma_head); + + orig_vma->anon_vma = vma->anon_vma = anon_vma; + } else { + /* multithreaded here, anon_vma existed already in other mm */ + spin_lock(&anon_vma->anon_vma_lock); + list_add(&vma->anon_vma_node, &anon_vma->anon_vma_head); + spin_unlock(&anon_vma->anon_vma_lock); + } + + ClearPageDirect(page); + page->as.anon_vma = anon_vma; +} + /** * page_add_rmap - add reverse mapping entry to a page * @page: the page to add the mapping to - * @ptep: the page table entry mapping this page + * @vma: the vma that is covering the page * * Add a new pte reverse mapping to a page. - * The caller needs to hold the mm->page_table_lock. */ -struct pte_chain * fastcall -page_add_rmap(struct page *page, pte_t *ptep, struct pte_chain *pte_chain) +void fastcall page_add_rmap(struct page *page, struct vm_area_struct * vma) { - pte_addr_t pte_paddr = ptep_to_paddr(ptep); - struct pte_chain *cur_pte_chain; + if (!pfn_valid(page_to_pfn(page)) || PageReserved(page)) + return; - if (PageReserved(page)) - return pte_chain; + page_map_lock(page); - pte_chain_lock(page); + if (!page->mapcount++) + inc_page_state(nr_mapped); - /* - * If this is an object-based page, just count it. We can - * find the mappings by walking the object vma chain for that object. - */ - if (!PageAnon(page)) { - if (!page->mapping) - BUG(); - if (PageSwapCache(page)) - BUG(); - if (!page->pte.mapcount) - inc_page_state(nr_mapped); - page->pte.mapcount++; - goto out; + if (PageAnon(page)) + anon_vma_page_link(page, vma); + else { + /* + * If this is an object-based page, just count it. + * We can find the mappings by walking the object + * vma chain for that object. + */ + BUG_ON(!page->as.mapping); + BUG_ON(PageSwapCache(page)); } - if (page->pte.direct == 0) { - page->pte.direct = pte_paddr; - SetPageDirect(page); + page_map_unlock(page); +} + +/* called from fork() */ +void fastcall page_add_rmap_fork(struct page *page, struct vm_area_struct * vma, + struct vm_area_struct * orig_vma, anon_vma_t ** anon_vma) +{ + if (!pfn_valid(page_to_pfn(page)) || PageReserved(page)) + return; + + page_map_lock(page); + + if (!page->mapcount++) inc_page_state(nr_mapped); - goto out; - } - if (PageDirect(page)) { - /* Convert a direct pointer into a pte_chain */ - ClearPageDirect(page); - pte_chain->ptes[NRPTE-1] = page->pte.direct; - pte_chain->ptes[NRPTE-2] = pte_paddr; - pte_chain->next_and_idx = pte_chain_encode(NULL, NRPTE-2); - page->pte.direct = 0; - page->pte.chain = pte_chain; - pte_chain = NULL; /* We consumed it */ - goto out; + if (PageAnon(page)) + anon_vma_page_link_fork(page, vma, orig_vma, anon_vma); + else { + /* + * If this is an object-based page, just count it. + * We can find the mappings by walking the object + * vma chain for that object. + */ + BUG_ON(!page->as.mapping); + BUG_ON(PageSwapCache(page)); } - cur_pte_chain = page->pte.chain; - if (cur_pte_chain->ptes[0]) { /* It's full */ - pte_chain->next_and_idx = pte_chain_encode(cur_pte_chain, - NRPTE - 1); - page->pte.chain = pte_chain; - pte_chain->ptes[NRPTE-1] = pte_paddr; - pte_chain = NULL; /* We consumed it */ - goto out; + page_map_unlock(page); +} + +/* this needs the page->flags PG_map_lock held */ +static void inline anon_vma_page_unlink(struct page * page) +{ + /* + * Cleanup if this anon page is gone + * as far as the vm is concerned. + */ + if (!page->mapcount) { + page->as.vma = 0; +#if 0 + /* + * The above clears page->as.anon_vma too + * if the page wasn't direct. + */ + page->as.anon_vma = 0; +#endif + ClearPageDirect(page); } - cur_pte_chain->ptes[pte_chain_idx(cur_pte_chain) - 1] = pte_paddr; - cur_pte_chain->next_and_idx--; -out: - pte_chain_unlock(page); - return pte_chain; } /** * page_remove_rmap - take down reverse mapping to a page * @page: page to remove mapping from - * @ptep: page table entry to remove * * Removes the reverse mapping from the pte_chain of the page, * after that the caller can clear the page table entry and free * the page. - * Caller needs to hold the mm->page_table_lock. */ -void fastcall page_remove_rmap(struct page *page, pte_t *ptep) +void fastcall page_remove_rmap(struct page *page) { - pte_addr_t pte_paddr = ptep_to_paddr(ptep); - struct pte_chain *pc; - if (!pfn_valid(page_to_pfn(page)) || PageReserved(page)) return; - pte_chain_lock(page); + page_map_lock(page); if (!page_mapped(page)) goto out_unlock; - /* - * If this is an object-based page, just uncount it. We can - * find the mappings by walking the object vma chain for that object. - */ - if (!PageAnon(page)) { - if (!page->mapping) - BUG(); - if (PageSwapCache(page)) - BUG(); - if (!page->pte.mapcount) - BUG(); - page->pte.mapcount--; - if (!page->pte.mapcount) - dec_page_state(nr_mapped); - goto out_unlock; + if (!--page->mapcount) + dec_page_state(nr_mapped); + + if (PageAnon(page)) + anon_vma_page_unlink(page, vma); + else { + /* + * If this is an object-based page, just uncount it. + * We can find the mappings by walking the object vma + * chain for that object. + */ + BUG_ON(!page->as.mapping); + BUG_ON(PageSwapCache(page)); } - if (PageDirect(page)) { - if (page->pte.direct == pte_paddr) { - page->pte.direct = 0; - ClearPageDirect(page); - goto out; - } - } else { - struct pte_chain *start = page->pte.chain; - struct pte_chain *next; - int victim_i = pte_chain_idx(start); - - for (pc = start; pc; pc = next) { - int i; - - next = pte_chain_next(pc); - if (next) - prefetch(next); - for (i = pte_chain_idx(pc); i < NRPTE; i++) { - pte_addr_t pa = pc->ptes[i]; - - if (pa != pte_paddr) - continue; - pc->ptes[i] = start->ptes[victim_i]; - start->ptes[victim_i] = 0; - if (victim_i == NRPTE-1) { - /* Emptied a pte_chain */ - page->pte.chain = pte_chain_next(start); - __pte_chain_free(start); - } else { - start->next_and_idx++; - } - goto out; - } - } - } -out: - if (page->pte.direct == 0 && page_test_and_clear_dirty(page)) - set_page_dirty(page); - if (!page_mapped(page)) - dec_page_state(nr_mapped); -out_unlock: - pte_chain_unlock(page); + page_map_unlock(page); return; } /** - * try_to_unmap_obj - unmap a page using the object-based rmap method + * try_to_unmap_one - unmap a page using the object-based rmap method * @page: the page to unmap * * Determine whether a page is mapped in a given vma and unmap it if it's found. * - * This function is strictly a helper function for try_to_unmap_obj. + * This function is strictly a helper function for try_to_unmap_inode. */ -static inline int -try_to_unmap_obj_one(struct vm_area_struct *vma, struct page *page) +static int +try_to_unmap_one(struct vm_area_struct *vma, struct page *page) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -477,17 +393,39 @@ try_to_unmap_obj_one(struct vm_area_stru } flush_cache_page(vma, address); - pteval = ptep_get_and_clear(pte); - flush_tlb_page(vma, address); + pteval = ptep_clear_flush(vma, address, pte); + + if (PageSwapCache(page)) { + /* + * Store the swap location in the pte. + * See handle_pte_fault() ... + */ + swp_entry_t entry = { .val = page->index }; + swap_duplicate(entry); + set_pte(pte, swp_entry_to_pte(entry)); + BUG_ON(pte_file(*pte)); + } else { + unsigned long pgidx; + /* + * If a nonlinear mapping then store the file page offset + * in the pte. + */ + pgidx = (address - vma->vm_start) >> PAGE_SHIFT; + pgidx += vma->vm_pgoff; + pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; + if (page->index != pgidx) { + set_pte(pte, pgoff_to_pte(page->index)); + BUG_ON(!pte_file(*pte)); + } + } if (pte_dirty(pteval)) set_page_dirty(page); - if (!page->pte.mapcount) - BUG(); + BUG_ON(!page->mapcount); mm->rss--; - page->pte.mapcount--; + page->mapcount--; page_cache_release(page); out_unmap: @@ -499,7 +437,7 @@ out: } /** - * try_to_unmap_obj - unmap a page using the object-based rmap method + * try_to_unmap_inode - unmap a page using the object-based rmap method * @page: the page to unmap * * Find all the mappings of a page using the mapping pointer and the vma chains @@ -511,30 +449,26 @@ out: * return a temporary error. */ static int -try_to_unmap_obj(struct page *page) +try_to_unmap_inode(struct page *page) { - struct address_space *mapping = page->mapping; + struct address_space *mapping = page->as.mapping; struct vm_area_struct *vma; int ret = SWAP_AGAIN; - if (!mapping) - BUG(); - - if (PageSwapCache(page)) - BUG(); + BUG_ON(PageSwapCache(page)); if (down_trylock(&mapping->i_shared_sem)) return ret; list_for_each_entry(vma, &mapping->i_mmap, shared) { - ret = try_to_unmap_obj_one(vma, page); - if (ret == SWAP_FAIL || !page->pte.mapcount) + ret = try_to_unmap_one(vma, page); + if (ret == SWAP_FAIL || !page->mapcount) goto out; } list_for_each_entry(vma, &mapping->i_mmap_shared, shared) { - ret = try_to_unmap_obj_one(vma, page); - if (ret == SWAP_FAIL || !page->pte.mapcount) + ret = try_to_unmap_one(vma, page); + if (ret == SWAP_FAIL || !page->mapcount) goto out; } @@ -543,94 +477,33 @@ out: return ret; } -/** - * try_to_unmap_one - worker function for try_to_unmap - * @page: page to unmap - * @ptep: page table entry to unmap from page - * - * Internal helper function for try_to_unmap, called for each page - * table entry mapping a page. Because locking order here is opposite - * to the locking order used by the page fault path, we use trylocks. - * Locking: - * page lock shrink_list(), trylock - * pte_chain_lock shrink_list() - * mm->page_table_lock try_to_unmap_one(), trylock - */ -static int FASTCALL(try_to_unmap_one(struct page *, pte_addr_t)); -static int fastcall try_to_unmap_one(struct page * page, pte_addr_t paddr) -{ - pte_t *ptep = rmap_ptep_map(paddr); - unsigned long address = ptep_to_address(ptep); - struct mm_struct * mm = ptep_to_mm(ptep); - struct vm_area_struct * vma; - pte_t pte; - int ret; - - if (!mm) - BUG(); - - /* - * We need the page_table_lock to protect us from page faults, - * munmap, fork, etc... - */ - if (!spin_trylock(&mm->page_table_lock)) { - rmap_ptep_unmap(ptep); - return SWAP_AGAIN; - } - - - /* During mremap, it's possible pages are not in a VMA. */ - vma = find_vma(mm, address); - if (!vma) { - ret = SWAP_FAIL; - goto out_unlock; - } - - /* The page is mlock()d, we cannot swap it out. */ - if (vma->vm_flags & VM_LOCKED) { - ret = SWAP_FAIL; - goto out_unlock; - } +static int +try_to_unmap_anon(struct page * page) +{ + int ret = SWAP_AGAIN; - /* Nuke the page table entry. */ - flush_cache_page(vma, address); - pte = ptep_clear_flush(vma, address, ptep); + page_map_lock(page); - if (PageSwapCache(page)) { - /* - * Store the swap location in the pte. - * See handle_pte_fault() ... - */ - swp_entry_t entry = { .val = page->index }; - swap_duplicate(entry); - set_pte(ptep, swp_entry_to_pte(entry)); - BUG_ON(pte_file(*ptep)); + if (PageDirect(page)) { + vma = page->as.vma; + ret = try_to_unmap_one(page->as.vma, page); } else { - unsigned long pgidx; - /* - * If a nonlinear mapping then store the file page offset - * in the pte. - */ - pgidx = (address - vma->vm_start) >> PAGE_SHIFT; - pgidx += vma->vm_pgoff; - pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT; - if (page->index != pgidx) { - set_pte(ptep, pgoff_to_pte(page->index)); - BUG_ON(!pte_file(*ptep)); + struct vm_area_struct * vma; + anon_vma_t * anon_vma = page->as.anon_vma; + + spin_lock(&anon_vma->anon_vma_lock); + list_for_each_entry(vma, &anon_vma->anon_vma_head, anon_vma_node) { + ret = try_to_unmap_one(vma, page); + if (ret == SWAP_FAIL || !page->mapcount) { + spin_unlock(&anon_vma->anon_vma_lock); + goto out; + } } + spin_unlock(&anon_vma->anon_vma_lock); } - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pte)) - set_page_dirty(page); - - mm->rss--; - page_cache_release(page); - ret = SWAP_SUCCESS; - -out_unlock: - rmap_ptep_unmap(ptep); - spin_unlock(&mm->page_table_lock); +out: + page_map_unlock(page); return ret; } @@ -650,82 +523,22 @@ int fastcall try_to_unmap(struct page * { struct pte_chain *pc, *next_pc, *start; int ret = SWAP_SUCCESS; - int victim_i; /* This page should not be on the pageout lists. */ - if (PageReserved(page)) - BUG(); - if (!PageLocked(page)) - BUG(); - /* We need backing store to swap out a page. */ - if (!page->mapping) - BUG(); + BUG_ON(PageReserved(page)); + BUG_ON(!PageLocked(page)); /* - * If it's an object-based page, use the object vma chain to find all - * the mappings. + * We need backing store to swap out a page. + * Subtle: this checks for page->as.anon_vma too ;). */ - if (!PageAnon(page)) { - ret = try_to_unmap_obj(page); - goto out; - } + BUG_ON(!page->as.mapping); - if (PageDirect(page)) { - ret = try_to_unmap_one(page, page->pte.direct); - if (ret == SWAP_SUCCESS) { - if (page_test_and_clear_dirty(page)) - set_page_dirty(page); - page->pte.direct = 0; - ClearPageDirect(page); - } - goto out; - } + if (!PageAnon(page)) + ret = try_to_unmap_inode(page); + else + ret = try_to_unmap_anon(page); - start = page->pte.chain; - victim_i = pte_chain_idx(start); - for (pc = start; pc; pc = next_pc) { - int i; - - next_pc = pte_chain_next(pc); - if (next_pc) - prefetch(next_pc); - for (i = pte_chain_idx(pc); i < NRPTE; i++) { - pte_addr_t pte_paddr = pc->ptes[i]; - - switch (try_to_unmap_one(page, pte_paddr)) { - case SWAP_SUCCESS: - /* - * Release a slot. If we're releasing the - * first pte in the first pte_chain then - * pc->ptes[i] and start->ptes[victim_i] both - * refer to the same thing. It works out. - */ - pc->ptes[i] = start->ptes[victim_i]; - start->ptes[victim_i] = 0; - victim_i++; - if (victim_i == NRPTE) { - page->pte.chain = pte_chain_next(start); - __pte_chain_free(start); - start = page->pte.chain; - victim_i = 0; - } else { - start->next_and_idx++; - } - if (page->pte.direct == 0 && - page_test_and_clear_dirty(page)) - set_page_dirty(page); - break; - case SWAP_AGAIN: - /* Skip this pte, remembering status. */ - ret = SWAP_AGAIN; - continue; - case SWAP_FAIL: - ret = SWAP_FAIL; - goto out; - } - } - } -out: if (!page_mapped(page)) { dec_page_state(nr_mapped); ret = SWAP_SUCCESS; @@ -733,176 +546,30 @@ out: return ret; } -/** - * page_convert_anon - Convert an object-based mapped page to pte_chain-based. - * @page: the page to convert - * - * Find all the mappings for an object-based page and convert them - * to 'anonymous', ie create a pte_chain and store all the pte pointers there. - * - * This function takes the address_space->i_shared_sem, sets the PageAnon flag, - * then sets the mm->page_table_lock for each vma and calls page_add_rmap. This - * means there is a period when PageAnon is set, but still has some mappings - * with no pte_chain entry. This is in fact safe, since page_remove_rmap will - * simply not find it. try_to_unmap might erroneously return success, but it - * will never be called because the page_convert_anon() caller has locked the - * page. - * - * page_referenced() may fail to scan all the appropriate pte's and may return - * an inaccurate result. This is so rare that it does not matter. +/* + * No more VM stuff below this comment, only anon_vma helper + * functions. */ -int page_convert_anon(struct page *page) -{ - struct address_space *mapping; - struct vm_area_struct *vma; - struct pte_chain *pte_chain = NULL; - pte_t *pte; - int err = 0; - - mapping = page->mapping; - if (mapping == NULL) - goto out; /* truncate won the lock_page() race */ - - down(&mapping->i_shared_sem); - pte_chain_lock(page); - - /* - * Has someone else done it for us before we got the lock? - * If so, pte.direct or pte.chain has replaced pte.mapcount. - */ - if (PageAnon(page)) { - pte_chain_unlock(page); - goto out_unlock; - } - - SetPageAnon(page); - if (page->pte.mapcount == 0) { - pte_chain_unlock(page); - goto out_unlock; - } - /* This is gonna get incremented by page_add_rmap */ - dec_page_state(nr_mapped); - page->pte.mapcount = 0; - - /* - * Now that the page is marked as anon, unlock it. page_add_rmap will - * lock it as necessary. - */ - pte_chain_unlock(page); - - list_for_each_entry(vma, &mapping->i_mmap, shared) { - if (!pte_chain) { - pte_chain = pte_chain_alloc(GFP_KERNEL); - if (!pte_chain) { - err = -ENOMEM; - goto out_unlock; - } - } - spin_lock(&vma->vm_mm->page_table_lock); - pte = find_pte(vma, page, NULL); - if (pte) { - /* Make sure this isn't a duplicate */ - page_remove_rmap(page, pte); - pte_chain = page_add_rmap(page, pte, pte_chain); - pte_unmap(pte); - } - spin_unlock(&vma->vm_mm->page_table_lock); - } - list_for_each_entry(vma, &mapping->i_mmap_shared, shared) { - if (!pte_chain) { - pte_chain = pte_chain_alloc(GFP_KERNEL); - if (!pte_chain) { - err = -ENOMEM; - goto out_unlock; - } - } - spin_lock(&vma->vm_mm->page_table_lock); - pte = find_pte(vma, page, NULL); - if (pte) { - /* Make sure this isn't a duplicate */ - page_remove_rmap(page, pte); - pte_chain = page_add_rmap(page, pte, pte_chain); - pte_unmap(pte); - } - spin_unlock(&vma->vm_mm->page_table_lock); - } - -out_unlock: - pte_chain_free(pte_chain); - up(&mapping->i_shared_sem); -out: - return err; -} - -/** - ** No more VM stuff below this comment, only pte_chain helper - ** functions. - **/ - -static void pte_chain_ctor(void *p, kmem_cache_t *cachep, unsigned long flags) -{ - struct pte_chain *pc = p; - - memset(pc, 0, sizeof(*pc)); -} - -DEFINE_PER_CPU(struct pte_chain *, local_pte_chain) = 0; -/** - * __pte_chain_free - free pte_chain structure - * @pte_chain: pte_chain struct to free - */ -void __pte_chain_free(struct pte_chain *pte_chain) +static void +anon_vma_ctor(void *data, kmem_cache_t *cachep, unsigned long flags) { - struct pte_chain **pte_chainp; - - pte_chainp = &get_cpu_var(local_pte_chain); - if (pte_chain->next_and_idx) - pte_chain->next_and_idx = 0; - if (*pte_chainp) - kmem_cache_free(pte_chain_cache, *pte_chainp); - *pte_chainp = pte_chain; - put_cpu_var(local_pte_chain); -} + if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) == + SLAB_CTOR_CONSTRUCTOR) { + anon_vma_t * anon_vma = (anon_vma_t *) data; -/* - * pte_chain_alloc(): allocate a pte_chain structure for use by page_add_rmap(). - * - * The caller of page_add_rmap() must perform the allocation because - * page_add_rmap() is invariably called under spinlock. Often, page_add_rmap() - * will not actually use the pte_chain, because there is space available in one - * of the existing pte_chains which are attached to the page. So the case of - * allocating and then freeing a single pte_chain is specially optimised here, - * with a one-deep per-cpu cache. - */ -struct pte_chain *pte_chain_alloc(int gfp_flags) -{ - struct pte_chain *ret; - struct pte_chain **pte_chainp; - - might_sleep_if(gfp_flags & __GFP_WAIT); - - pte_chainp = &get_cpu_var(local_pte_chain); - if (*pte_chainp) { - ret = *pte_chainp; - *pte_chainp = NULL; - put_cpu_var(local_pte_chain); - } else { - put_cpu_var(local_pte_chain); - ret = kmem_cache_alloc(pte_chain_cache, gfp_flags); + spin_lock_init(&anon_vma->anon_vma_lock); + INIT_LIST_HEAD(&anon_vma->anon_vma_head); } - return ret; } -void __init pte_chain_init(void) +void __init anon_vma_init(void) { - pte_chain_cache = kmem_cache_create( "pte_chain", - sizeof(struct pte_chain), - 0, - SLAB_MUST_HWCACHE_ALIGN, - pte_chain_ctor, - NULL); + /* this is intentonally not hw aligned to avoid wasting ram */ + anon_vma_cachep = kmem_cache_create("anon_vma", + sizeof(anon_vma_t), 0, 0, + anon_vma_ctor, NULL); - if (!pte_chain_cache) - panic("failed to create pte_chain cache!\n"); + if(!anon_vma_cachep) + panic("Cannot create anon_vma SLAB cache"); } --- sles-anobjrmap-2/mm/Makefile.~1~ 2004-02-29 17:47:30.000000000 +0100 +++ sles-anobjrmap-2/mm/Makefile 2004-03-10 20:26:16.000000000 +0100 @@ -4,7 +4,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ - mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ + mlock.o mmap.o mprotect.o mremap.o msync.o objrmap.o \ shmem.o vmalloc.o obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \ ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-11 6:52 ` anon_vma RFC2 Andrea Arcangeli @ 2004-03-11 13:23 ` Hugh Dickins 2004-03-11 13:56 ` Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 74+ messages in thread From: Hugh Dickins @ 2004-03-11 13:23 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III Hi Andrea, On Thu, 11 Mar 2004, Andrea Arcangeli wrote: > > this is the full current status of my anon_vma work. Now fork() and all > the other page_add/remove_rmap in memory.c plus the paging routines > seems fully covered and I'm now dealing with the vma merging and the > anon_vma garbage collection (the latter is easy but I need to track all > the kmem_cache_free). I'm still making my way through all the relevant mails, and not even glanced at your code yet: I hope later today. But to judge by the length of your essay on vma merging, it strikes me that you've taken a wrong direction in switching from my anon mm to your anon vma. Go by vmas and you have tiresome problems as they are split and merged, very commonly. Plus you have the overhead of new data structure per vma. If your design magicked those problems away somehow, okay, but it seems you're finding issues with it: I think you should go back to anon mms. Go by mms, and there's only the exceedingly rare (does it ever occur outside our testing?) awkward case of tracking pages in a private anon vma inherited from parent, when parent or child mremaps it with MAYMOVE. Which I reused the pte_chain code for, but it's probably better done by conjuring up an imaginary tmpfs object as backing at that point (that has its own little cost, since the object lives on at full size until all its mappers unmap it, however small the portion they have mapped). And the overhead of the new data structre is per mm only. I'll get back to reading through the mails now: sorry if I'm about to find the arguments against anonmm in my reading. (By the way, several times you mention the size of a 2.6 struct page as larger than a 2.4 struct page: no, thanks to wli and others it's the 2.6 that's smaller.) Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-11 13:23 ` Hugh Dickins @ 2004-03-11 13:56 ` Andrea Arcangeli 2004-03-11 21:54 ` Hugh Dickins 2004-03-12 3:28 ` Rik van Riel 2004-03-11 17:33 ` Andrea Arcangeli 2004-03-11 22:20 ` Rik van Riel 2 siblings, 2 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-11 13:56 UTC (permalink / raw) To: Hugh Dickins Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III Hi Hugh, On Thu, Mar 11, 2004 at 01:23:24PM +0000, Hugh Dickins wrote: > Hi Andrea, > > On Thu, 11 Mar 2004, Andrea Arcangeli wrote: > > > > this is the full current status of my anon_vma work. Now fork() and all > > the other page_add/remove_rmap in memory.c plus the paging routines > > seems fully covered and I'm now dealing with the vma merging and the > > anon_vma garbage collection (the latter is easy but I need to track all > > the kmem_cache_free). > > I'm still making my way through all the relevant mails, and not even > glanced at your code yet: I hope later today. But to judge by the > length of your essay on vma merging, it strikes me that you've taken > a wrong direction in switching from my anon mm to your anon vma. > > Go by vmas and you have tiresome problems as they are split and merged, > very commonly. Plus you have the overhead of new data structure per vma. it's more complicated because it's more finegrined and it can handle mremap too. I mean, the additional cost of tracking the vmas payoffs because then we've a tiny list of vma to search for every page, otherwise with the mm-wide model we'd need to search all of the vmas in a mm. This is quite important during swapping with tons of vmas. Note that in my common case the page will point directly to the vma (PageDirect(page) == 1), no find_vma or whatever needed in between. the per-vma overhead is 12 bytes, 2 pointers for the list node and 1 pointer to the anon-vma. As said above it provides several advantages, but you're certainly right the mm approch had no vma overhead. I'm quite convinced the anon_vma is the optimal design, though it's not running yet ;). However it's close to compile. the whole vma and page layer is finished (including the vma merging). I'm now dealing with the swapcache stuff and I'm doing it slightly differently from your anobjrmap-2 patch (obviously I also reistantiate the PG_swapcache bitflag but the fundamental difference is that I don't drop the swapper_space): static inline struct address_space * page_mapping(struct page * page) { extern struct address_space swapper_space; struct address_space * mapping = NULL; if (PageSwapCache(page)) mapping = &swapper_space; else if (!PageAnon(page)) mapping = page->as.mapping; return mapping; } I want the same pagecache/swapcache code to work transparently, but I free up the page->index and the page->mapping for the swapcache, so that I can reuse it to track the anon_vma. I think the above is simpler than killing the swapper_space completely as you did. My solution avoids me hacks like this: if (mapping && mapping->a_ops && mapping->a_ops->sync_page) return mapping->a_ops->sync_page(page); + if (PageSwapCache(page)) + blk_run_queues(); return 0; } it also avoids me rework set_page_dirty to call __set_page_dirty_buffers by hand too. I mean, it's less intrusive. the cpu cost it's similar, since I pay for an additional compare in page_mapping though, but the code looks cleaner. Could be my opinion only though ;). > If your design magicked those problems away somehow, okay, but it seems > you're finding issues with it: I think you should go back to anon mms. the only issue I found so far, is that to track the stuff in a fine-granular way I have to forbid merging sometime. note that forbidding merging is a feature too, if I would go down with a pagetable scan on the vma to fixup all page->as.vma/anon_vma and page->index I would then lose some historic information on the origin of certain vmas, and I would eventually fallback to the mm-wide information if I would do total merging. I think the probability of forbidden merging is low enough that it doesn't matter. Also it doesn't impact in any way the file merging. It basically merges as well as the file merging. Right now I'm also not overriding the intitial vm_pgoff given to brand new anonymous vmas, but I could, to boost the merging with mremapped segments. Though I don't think it's necessary. Overall the main reason for forbidding keeping track of vmas and not of mm, is to be able to handle mremap as efficiently as with 2.4, I mean your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has to deal with both pte_chains and anonmm too. > Go by mms, and there's only the exceedingly rare (does it ever occur > outside our testing?) awkward case of tracking pages in a private anon > vma inherited from parent, when parent or child mremaps it with MAYMOVE. > > Which I reused the pte_chain code for, but it's probably better done > by conjuring up an imaginary tmpfs object as backing at that point > (that has its own little cost, since the object lives on at full size > until all its mappers unmap it, however small the portion they have > mapped). And the overhead of the new data structre is per mm only. > > I'll get back to reading through the mails now: sorry if I'm about to > find the arguments against anonmm in my reading. (By the way, several > times you mention the size of a 2.6 struct page as larger than a 2.4 > struct page: no, thanks to wli and others it's the 2.6 that's smaller.) really? mainline 2.6 has the same size of mainline 2.4 (48 bytes), or I'm counting wrong? (at least my 2.4-aa tree is 48 bytes too, but I think 2.4 mainline too) objrmap adds 4 bytes (goes to 52bytes), my patch removes 8 bytes (i.e. the pte_chain) and the result of my patch is 4 bytes less than 2.4 and 2.6 (44 bytes instead of 48 bytes). I wanted to nuke the mapcount too but that destroy the nr_mapped info, and that spreads all over so for now I keep the page->mapcount ;) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-11 13:56 ` Andrea Arcangeli @ 2004-03-11 21:54 ` Hugh Dickins 2004-03-12 1:47 ` Andrea Arcangeli 2004-03-12 3:28 ` Rik van Riel 1 sibling, 1 reply; 74+ messages in thread From: Hugh Dickins @ 2004-03-11 21:54 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, William Lee Irwin III, linux-kernel On Thu, 11 Mar 2004, Andrea Arcangeli wrote: > On Thu, Mar 11, 2004 at 01:23:24PM +0000, Hugh Dickins wrote: > > > > Go by vmas and you have tiresome problems as they are split and merged, > > very commonly. Plus you have the overhead of new data structure per vma. > > it's more complicated because it's more finegrined and it can handle > mremap too. I mean, the additional cost of tracking the vmas payoffs > because then we've a tiny list of vma to search for every page, > otherwise with the mm-wide model we'd need to search all of the vmas in > a mm. This is quite important during swapping with tons of vmas. Note > that in my common case the page will point directly to the vma > (PageDirect(page) == 1), no find_vma or whatever needed in between. Nice if you can avoid the find_vma, but it is (or was) used in the objrmap case, so I was happy to have it in the anobj case also. Could you post a patch against 2.6.3 or 2.6.4? Your objrmap patch applies with offsets, no problem, but your anobjrmap patch doesn't apply cleanly on top of that, partly because you've renamed files in between (revert that?), but there seem to be other untracked changes too. I may not be seeing the whole story right. Great to see the pte_chains gone, but I find what you have for anon vmas strangely complicated: the continued existence of PageDirect etc. I guess, having elected to go by vmas, you're trying to avoid some of the overhead until fork. But that does make it messy to my eyes, the anonmm way much cleaner and simpler in that regard. > I want the same pagecache/swapcache code to work transparently, but I > free up the page->index and the page->mapping for the swapcache, so that > I can reuse it to track the anon_vma. I think the above is simpler than > killing the swapper_space completely as you did. My solution avoids me > hacks like this: > > if (mapping && mapping->a_ops && mapping->a_ops->sync_page) > return mapping->a_ops->sync_page(page); > + if (PageSwapCache(page)) > + blk_run_queues(); > return 0; > } > > it also avoids me rework set_page_dirty to call __set_page_dirty_buffers > by hand too. I mean, it's less intrusive. There may well be better ways of reassigning the page struct fields than I had, making for less extensive changes, yes. Best to go with the least intrusive for now (so long as not too ugly) and reappraise later. > Overall the main reason for forbidding keeping track of vmas and not of > mm, is to be able to handle mremap as efficiently as with 2.4, I mean > your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has > to deal with both pte_chains and anonmm too. Yes, I used pte_chains for that because we hadn't worked out how to do remap_file_pages without them (I've not yet looked into how you're handling those), so might as well put them to use here too. But if nonlinear is now relieved of pte_chains, great, and as I said below, the anonmm mremap case should be able to conjure a tmpfs backing object - which probably amounts to your anon_vma, but only needed in that one odd case, anon mm sufficient for all the rest, less overhead all round. > > Go by mms, and there's only the exceedingly rare (does it ever occur > > outside our testing?) awkward case of tracking pages in a private anon > > vma inherited from parent, when parent or child mremaps it with MAYMOVE. > > > > Which I reused the pte_chain code for, but it's probably better done > > by conjuring up an imaginary tmpfs object as backing at that point > > (that has its own little cost, since the object lives on at full size > > until all its mappers unmap it, however small the portion they have > > mapped). And the overhead of the new data structre is per mm only. > > > > I'll get back to reading through the mails now: sorry if I'm about to > > find the arguments against anonmm in my reading. (By the way, several > > times you mention the size of a 2.6 struct page as larger than a 2.4 > > struct page: no, thanks to wli and others it's the 2.6 that's smaller.) > > really? mainline 2.6 has the same size of mainline 2.4 (48 bytes), or > I'm counting wrong? (at least my 2.4-aa tree is 48 bytes too, but I > think 2.4 mainline too) objrmap adds 4 bytes (goes to 52bytes), my patch > removes 8 bytes (i.e. the pte_chain) and the result of my patch is 4 > bytes less than 2.4 and 2.6 (44 bytes instead of 48 bytes). I wanted to > nuke the mapcount too but that destroy the nr_mapped info, and that > spreads all over so for now I keep the page->mapcount ;) I think you were counting wrong. Mainline 2.4 i386 48 bytes, agreed. Mainline 2.6 i386 40 bytes, or 44 bytes if PAE & HIGHPTE. And today, 2.6.4-mm1 i386 32 bytes, or 36 bytes if PAE & HIGHPTE. Though of course the vanished fields will often be countered by memory usage elsewhere. Yes, keep mapcount for now: I went around that same loop, it surely has the feel of something that can be disposed of in the end, but there's no need to attempt that while doing this objrmap job, it's better done after since it needs a different kind of care. (Be aware that shmem_writepage will do the wrong thing, COWing what should be a shared page, if it is ever given a still-mapped page: but no need to worry about that now, and it may be easy to work it differently once the rmap changes settle down. As to shmem_writepage going directly to swap, by the way: I'm perfectly happy for you to make that change, but I don't believe the old way was mistaken - it intentionally gave tmpfs pages which should remain in memory another go around. I was never convinced one way or the other: but the current code works very badly for some loads, as you found, I doubt there are any that will suffer so greatly from the change, so go ahead.) Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-11 21:54 ` Hugh Dickins @ 2004-03-12 1:47 ` Andrea Arcangeli 2004-03-12 2:20 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 1:47 UTC (permalink / raw) To: Hugh Dickins Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, William Lee Irwin III, linux-kernel On Thu, Mar 11, 2004 at 09:54:01PM +0000, Hugh Dickins wrote: > Could you post a patch against 2.6.3 or 2.6.4? Your objrmap patch I uploaded my latest status, there are three patches, the first is Dave's objrmap, the second is your anobjrmap-1, the third is my anon_vma work that removes the pte_chains all over the kernel. my patch is not stable yet, it crashes during swapping and the debugging code catches bug even before swapping (which is good): 0 0 0 404468 11900 41276 0 0 0 0 1095 61 0 0 100 0 0 0 0 404468 11900 41276 0 0 0 0 1108 71 0 0 100 0 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 404468 11908 41268 0 0 0 136 1102 59 0 0 100 0 1 0 0 310972 11908 41268 0 0 0 0 1100 50 2 7 91 0 1 0 0 66748 11908 41268 0 0 0 0 1085 30 6 19 75 0 1 1 128 2648 216 14132 0 128 0 256 1118 139 3 16 73 8 1 2 77084 1332 232 2188 0 76952 308 76952 1162 255 1 10 54 35 I hope to make it work tomorrow, then the next two things to do are the pagetable walk in the nonlinear (currently it's pinned) and the rbtree (or prio_tree) for the i_mmap{,shared}. Then it will be complete and mergeable. http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.3/objrmap ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 1:47 ` Andrea Arcangeli @ 2004-03-12 2:20 ` Andrea Arcangeli 0 siblings, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 2:20 UTC (permalink / raw) To: Hugh Dickins Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, William Lee Irwin III, linux-kernel On Fri, Mar 12, 2004 at 02:47:10AM +0100, Andrea Arcangeli wrote: > my patch is not stable yet, it crashes during swapping and the debugging > code catches bug even before swapping (which is good): I fixed some more bugs (s/index/private), it's not stable yet but some basic swapping works now (there is probably some issue with shared swapcache still, since ps just oopsed, and ps may be sharing-cow swapcache through fork). 0 0 0 408712 7800 41160 0 0 0 0 1131 46 0 0 95 5 0 0 0 408712 7800 41160 0 0 0 0 1102 64 0 0 100 0 0 0 0 408712 7800 41160 0 0 0 0 1090 40 0 0 100 0 0 0 0 408712 7800 41160 0 0 0 0 1107 84 0 0 100 0 0 0 0 408712 7808 41152 0 0 0 84 1101 66 0 0 100 0 0 0 0 408712 7808 41152 0 0 0 0 1096 52 0 0 100 0 1 0 0 264808 7808 41152 0 0 0 0 1093 49 5 16 79 0 1 0 0 51636 7808 41152 0 0 0 0 1083 34 5 20 75 0 1 1 128 2384 212 14068 0 128 0 204 1106 178 1 7 73 19 1 2 82824 2332 200 2136 32 82668 40 82668 1221 1955 1 12 49 38 1 2 130000 2448 208 1868 32 47048 312 47048 1184 782 0 5 60 35 0 3 178700 1676 208 2428 10388 48700 11000 48700 1536 1291 0 4 55 40 0 3 205996 1780 216 1992 4264 27224 4424 27224 1312 549 1 4 41 55 2 2 238900 4148 240 2388 88 32980 684 32984 1190 1380 1 6 23 69 0 3 295124 1996 244 2392 92 56148 232 56148 1223 149 1 6 38 54 0 2 315204 2036 244 2356 0 19972 0 19972 1172 55 1 2 52 45 1 0 334052 3924 264 2592 192 18720 372 18720 1205 154 0 1 35 63 0 3 377208 2324 264 1928 64 42984 64 42984 1249 208 2 6 39 53 0 1 389856 3408 264 2032 128 12680 224 12680 1187 159 0 1 60 38 0 0 374032 263036 316 3504 920 0 2464 0 1258 224 0 2 76 23 0 0 374032 263036 316 3504 0 0 0 0 1087 27 0 0 100 0 0 0 374032 263036 316 3504 0 0 0 0 1083 25 0 0 100 0 0 0 374032 263040 316 3504 0 0 0 0 1086 25 0 0 100 0 0 0 374032 263040 316 3504 0 0 0 0 1084 27 0 0 100 0 0 0 374032 263128 316 3504 0 0 0 0 1086 23 0 0 100 0 0 0 374032 263164 316 3472 32 0 32 0 1086 23 0 0 100 0 0 0 374032 263212 316 3508 32 0 32 0 1086 25 0 0 100 0 I uploaded a new anon_vma patch in the same directory with the fixes to make the basic swapping work. Tomorrow I'll look into the ps oops and into heavey cow loads. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-11 13:56 ` Andrea Arcangeli 2004-03-11 21:54 ` Hugh Dickins @ 2004-03-12 3:28 ` Rik van Riel 2004-03-12 12:21 ` Andrea Arcangeli 1 sibling, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-12 3:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Thu, 11 Mar 2004, Andrea Arcangeli wrote: > it's more complicated because it's more finegrined and it can handle > mremap too. I mean, the additional cost of tracking the vmas payoffs > because then we've a tiny list of vma to search for every page, > otherwise with the mm-wide model we'd need to search all of the vmas in > a mm. Actually, with the code Rajesh is working on there's no search problem with Hugh's idea. Considering the fact that we'll need Rajesh's code anyway, to deal with Ingo's test program and the real world programs that do similar things, I don't see how your objection to Hugh's code is still valid. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 3:28 ` Rik van Riel @ 2004-03-12 12:21 ` Andrea Arcangeli 2004-03-12 12:40 ` Rik van Riel ` (3 more replies) 0 siblings, 4 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 12:21 UTC (permalink / raw) To: Rik van Riel Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote: > On Thu, 11 Mar 2004, Andrea Arcangeli wrote: > > > it's more complicated because it's more finegrined and it can handle > > mremap too. I mean, the additional cost of tracking the vmas payoffs > > because then we've a tiny list of vma to search for every page, > > otherwise with the mm-wide model we'd need to search all of the vmas in > > a mm. > > Actually, with the code Rajesh is working on there's > no search problem with Hugh's idea. you missed the fact mremap doesn't work, that's the fundamental reason for the vma tracking, so you can use vm_pgoff. if you take Hugh's anonmm, mremap will be attaching a persistent dynamic overhead to the vma it touches. Currently it does in form of pte_chains, that can be converted to other means of overhead, but I simply don't like it. I like all vmas to be symmetric to each other, without special hacks to handle mremap right. We have the vm_pgoff to handle mremap and I simply use that. > Considering the fact that we'll need Rajesh's code > anyway, to deal with Ingo's test program and the real Rajesh's code has nothing to do with the mremap breakage, Rajesh's code can only boost the search of the interesting vmas in an anonmm, it doesn't solve mremap. > world programs that do similar things, I don't see how > your objection to Hugh's code is still valid. This was my objection, maybe you didn't read all my emails, i quote again: "Overall the main reason for forbidding keeping track of vmas and not of mm, is to be able to handle mremap as efficiently as with 2.4, I mean your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has to deal with both pte_chains and anonmm too." As said one can convert the pte_chains to other means of overhead, but still it's an hack and you'll need transient objects to track those if you don't track finegrined by vma as I'm doing. It's not that I didn't read anonmm patches from Hugh, I spent lots of time on those, they just were flawed and they couldn't handle mremap, he very well knows, see anobjrmap-5 for istance. the vma merging isn't a problem, we need to rework the code anyways to allow the file merging in both mprotect and mremap (currently only mmap is capable of merging files, and in turn it's also the only one capable of merging anon_vmas). Any merging code that is currently capable of merging files is easy to teach about anon_vmas too, it's basically the same problem at merging. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 12:21 ` Andrea Arcangeli @ 2004-03-12 12:40 ` Rik van Riel 2004-03-12 13:11 ` Andrea Arcangeli 2004-03-12 12:42 ` Andrea Arcangeli ` (2 subsequent siblings) 3 siblings, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-12 12:40 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote: > > Actually, with the code Rajesh is working on there's > > no search problem with Hugh's idea. > > you missed the fact mremap doesn't work, that's the fundamental reason > for the vma tracking, so you can use vm_pgoff. > > if you take Hugh's anonmm, mremap will be attaching a persistent dynamic > overhead to the vma it touches. Currently it does in form of pte_chains, > that can be converted to other means of overhead, but I simply don't > like it. > > I like all vmas to be symmetric to each other, without special hacks to > handle mremap right. > > We have the vm_pgoff to handle mremap and I simply use that. Would it be possible to get rid of that if we attached a struct address_space to each mm_struct after exec(), sharing the address_space between parent and child processes after a fork() ? Note that the page cache can handle up to 2^42 bytes in one address_space on a 32 bit system, so there's more than enough space to be shared between parent and child processes. Then the vmas can track vm_pgoff inside the address space attached to the mm. > > Considering the fact that we'll need Rajesh's code > > anyway, to deal with Ingo's test program and the real > > Rajesh's code has nothing to do with the mremap breakage, Rajesh's code > can only boost the search of the interesting vmas in an anonmm, it > doesn't solve mremap. If you mmap a file, then mremap part of that mmap, where's the special case ? > "Overall the main reason for forbidding keeping track of vmas and not of > mm, is to be able to handle mremap as efficiently as with 2.4, I mean > your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has > to deal with both pte_chains and anonmm too." Yes, that's a problem indeed. I'm not sure it's fundamental or just an implementation artifact, though... > the vma merging isn't a problem, we need to rework the code anyways to > allow the file merging in both mprotect and mremap Agreed. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 12:40 ` Rik van Riel @ 2004-03-12 13:11 ` Andrea Arcangeli 2004-03-12 16:25 ` Rik van Riel 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 13:11 UTC (permalink / raw) To: Rik van Riel Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, Mar 12, 2004 at 07:40:51AM -0500, Rik van Riel wrote: > On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > > On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote: > > > > Actually, with the code Rajesh is working on there's > > > no search problem with Hugh's idea. > > > > you missed the fact mremap doesn't work, that's the fundamental reason > > for the vma tracking, so you can use vm_pgoff. > > > > if you take Hugh's anonmm, mremap will be attaching a persistent dynamic > > overhead to the vma it touches. Currently it does in form of pte_chains, > > that can be converted to other means of overhead, but I simply don't > > like it. > > > > I like all vmas to be symmetric to each other, without special hacks to > > handle mremap right. > > > > We have the vm_pgoff to handle mremap and I simply use that. > > Would it be possible to get rid of that if we attached > a struct address_space to each mm_struct after exec(), > sharing the address_space between parent and child > processes after a fork() ? > Note that the page cache can handle up to 2^42 bytes > in one address_space on a 32 bit system, so there's > more than enough space to be shared between parent and > child processes. > > Then the vmas can track vm_pgoff inside the address > space attached to the mm. I can't understand sorry. I don't see what you mean with sharing the same address space between parent and child, whatever _global_ mm wide address space is screwed by mremap, if you don't use the pg_off to ofset the page->index, the vm_start/vm_end means nothing. I think the anonmm design is flawed and it has no way to handle mremap reasonably well, though feel free to keep doing research on that, I would be happy to use a simpler and more efficient design, I just tried to reuse the anonmm but it was overlay complex in design and inefficient too to deal with mremap, so I had not much doubts I had to change that, and the anon_vma idea solved all the issues with anonmm, so I started coding that. If you don't track by vmas (like I'm doing), and you allow merging of two different vmas, one touched by mremap and the other not, you'll end up mixing the vm_pgoff and the whole anonmm falls apart, and the tree search falls apart too after you lost the vm_pgoff of the vma that got merged. Hugh solved this by simply saying that anonmm isn't capable of dealing with mremap and he used the pte_chains like if it was the rmap vm, after the first mremap. That's bad, but whatever more efficient solution than the pte_chains (for example a metadata tracking a range, not wasting bytes for every single page in the range like rmap does) will still be a mess in terms of vma merging, tracking and rbtree/prio_tree search too, and it won't at all be more obviously efficient, since you'll still have to use the tree, and in all common cases my design will beat the tree performance (even ignoring the mremap overhead with anonmm). the way I defer the anonvma allocation and I instantiate direct pages is as well is extremely efficient compared to the anonmm. The only thing I disallow is the merging of two vmas with different anon_vma or different vm_pgoff, but that's a feature, if you don't do that in the anonmm design, you'll have to allocate dynamic structures on top of the vma tracking partial ranges within each vma which can be a lot slower and it's so messy to deal with that I don't even remotely considered writing anything like that, when I can use the pgoff with the anon_vma_t. > > > Considering the fact that we'll need Rajesh's code > > > anyway, to deal with Ingo's test program and the real > > > > Rajesh's code has nothing to do with the mremap breakage, Rajesh's code > > can only boost the search of the interesting vmas in an anonmm, it > > doesn't solve mremap. > > If you mmap a file, then mremap part of that mmap, where's > the special case ? you miss that we disallow the merging of vmas with vm_pgoff if they belong to a file (vma->vm_file != NULL). Infact what my code is doing is to threat the anon vma similarly to the file-vmas, and that's why the merging probability is reduced a little bit. The single fact anonmm allows merging of all anonvmas like if they were not-vma-tracked tells you anonmm is flawed w.r.t. mremap. Something has to be changed anyways in the vma handling code too (like the vma merging code) even with anonmm, if your object is to always pass through the vma to reach the pagetables. Hugh solved this by not passing through the vma after the first mremap, that works too of course but I think my design is more efficient, my whole effort is to avoid allocating per-page overhead and to have a single metadata object (the vma) serving a range of pages, that's a lot more efficient than the pte_chains and it saves a load of ram in 64bit and 32bit. to tell it in another way, the problem you have with anonmm, is that after an mremap the page->index becomes invalid, and no, you can't fixup the page->index by looping all over the pages pointed by the vma because those page->index will be meaningful to other vmas in other address spaces, where their address is still the original one (the one before fork()). > > "Overall the main reason for forbidding keeping track of vmas and not of > > mm, is to be able to handle mremap as efficiently as with 2.4, I mean > > your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has > > to deal with both pte_chains and anonmm too." > > Yes, that's a problem indeed. I'm not sure it's fundamental > or just an implementation artifact, though... I think it's fundamental but again, if you can find a solution to that it's more than welcome, I just don't see how you can ever handle mremap if you threat all the vmas the same, before and after mremap, if you threat all the vmas the same you lose vm_pgoff and in turn you break in mremap and you can forget using the vmas for reaching the pagetables since you will do nothing with just the vm_start/vm_end and page->index then. You can still threat all of them the same by allocating dynamic stuff on top of the vma but that will complicate everything, including the tree search and the vma merging too. So the few lines I had to add to the vma merging to teach the vma layer about the anon_vma, should be a whole lot simpler and a whole lot more efficient than the ones you've to add to allocate those dynamic objects sitting on top of the vmas and telling you the right pg_off per-range (not to tell the handling of the oom conditions while allocating those dynamic objects in super-spinlocked paths, even the GFP_ATOMIC abuses from the pte_chains were nasty, GFP_ATOMIC should reserved to irqs and bhs since they've no way to unlock and sleep!...). ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 13:11 ` Andrea Arcangeli @ 2004-03-12 16:25 ` Rik van Riel 2004-03-12 17:13 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-12 16:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > I don't see what you mean with sharing the same address space between > parent and child, whatever _global_ mm wide address space is screwed by > mremap, if you don't use the pg_off to ofset the page->index, the > vm_start/vm_end means nothing. At mremap time, you don't change the page->index at all, but only the vm_start/vm_end. Think of it as an mm_struct pointing to a struct address_space with its anonymous memory. On exec() the mm_struct gets a new address_space, on fork parent and child share them. Sharing is good enough, because there is PAGE_SIZE times more space in a struct address_space than there's available virtual memory in one single process. That means that for a daemon like apache every child can simply get its own 4GB subset of the address space for any new VMAs, while mapping the inherited VMAs in the same way any other file is mapped. > I think the anonmm design is flawed and it has no way to handle > mremap reasonably well, There's no difference between mremap() of anonymous memory and mremap() of part of an mmap() range of a file... At least, there doesn't need to be. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 16:25 ` Rik van Riel @ 2004-03-12 17:13 ` Andrea Arcangeli 2004-03-12 17:23 ` Rik van Riel 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 17:13 UTC (permalink / raw) To: Rik van Riel Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, Mar 12, 2004 at 11:25:27AM -0500, Rik van Riel wrote: > pointing to a struct address_space with its anonymous > memory. On exec() the mm_struct gets a new address_space, > on fork parent and child share them. isn't this what anonmm is already doing? are you suggesting something different? > There's no difference between mremap() of anonymous memory > and mremap() of part of an mmap() range of a file... > > At least, there doesn't need to be. the anonmm simply cannot work because it's not reaching vmas, it only reaches mm, and with an mm and a virtual address you cannot reach the right vma if it was moved around by mremap, you don't even see any vm_pgoff during the lookup, no way to fix anonmm with a prio_tree. something in between anon_vma and anonmm that could handle mremap too would been possible but it has downsides not fixable with a prio_tree, and it consists in queueing all the _vmas_ (not the mm!) into an anon_vma object, then you've to fixup the vma merging code to obey to forbid merging with different vm_pgoff. That would be like anon_vma but it would not be finegriend like anon_vma is, you'll end up scanning very old vma segments in other address spaces despite you're working with direct memory now. Such model (let's call it anon_vma_global) would save 8 bytes per vma of anonvma objects. Maybe that's the model that DaveM implemented originally? I think my anon_vma is superior because more finegriend (it also avoids the need of a prio_tree even if in theory we could stack a prio_tree on top of every anon_vma, but it's really not needed) and the memory usage is minimal anyways (the per-vma memory cost is the same for anon_vma and anon_vma_global, only the total number of anon_vma objects vary). the prio_tree wouldn't fix the intermediate model because the vma ranges could match fine in all address spaces, so you would need the prio_tree adding another 12 bytes to each vma (on top of the 12 bytes addred by the anon_vma_global), but the pages would be different because the vma->vm_mm is different and there can be copy on writes. this cannot happen with an inode, so the prio_tree fixes the inode completely while it doesn't fix the anon_vma_global design with 1 anon_vma only allocated at fork for all childs. anon_vma gets that optimally instead (with a 8byte cost). so overall I think anon_vma is a much better utilizations of the 12 bytes, rather than having a prio_tree stacked on top of a anon_vma_global, I prefer to be finegrined and to track the stuff that not even a prio tree can track when the vma->vm_mm has different pages for every vma in the same range. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 17:13 ` Andrea Arcangeli @ 2004-03-12 17:23 ` Rik van Riel 2004-03-12 17:44 ` Andrea Arcangeli 2004-03-12 18:25 ` Linus Torvalds 0 siblings, 2 replies; 74+ messages in thread From: Rik van Riel @ 2004-03-12 17:23 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > On Fri, Mar 12, 2004 at 11:25:27AM -0500, Rik van Riel wrote: > > pointing to a struct address_space with its anonymous > > memory. On exec() the mm_struct gets a new address_space, > > on fork parent and child share them. > > isn't this what anonmm is already doing? are you suggesting something > different? I am suggesting a pointer from the mm_struct to a struct address_space ... > > There's no difference between mremap() of anonymous memory > > and mremap() of part of an mmap() range of a file... > > > > At least, there doesn't need to be. > > the anonmm simply cannot work because it's not reaching vmas, it only > reaches mm, and with an mm and a virtual address you cannot reach the > right vma if it was moved around by mremap, ... and use the offset into the struct address_space as the page->index, NOT the virtual address inside the mm. On first creation of anonymous memory these addresses could be the same, but on mremap inside a forked process (with multiple processes sharing part of anonymous memory) a page could have a different offset inside the struct address space than its virtual address.... Then on mremap you only need to adjust the start and end offsets inside the VMAs, not the page->index ... > That would be like anon_vma but it would not be finegriend like anon_vma > is, you'll end up scanning very old vma segments in other address spaces Not really. On exec you can start with a new address space entirely, so the sharing is limited only to processes that really do share anonymous memory with each other... > I think my anon_vma is superior because more finegriend Isn't being LESS finegrained the whole reason for moving from pte based to object based reverse mapping ? ;)) > (it also avoids the need of a prio_tree even if in theory we could stack > a prio_tree on top of every anon_vma, but it's really not needed) We need the prio_tree anyway for files. I don't see why we couldn't reuse that code for anonymous memory, but instead reimplement something new... Having the same code everywhere will definately help simplifying things. cheers, Rik -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 17:23 ` Rik van Riel @ 2004-03-12 17:44 ` Andrea Arcangeli 2004-03-12 18:18 ` Rik van Riel 2004-03-12 18:25 ` Linus Torvalds 1 sibling, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 17:44 UTC (permalink / raw) To: Rik van Riel Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, Mar 12, 2004 at 12:23:22PM -0500, Rik van Riel wrote: > On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > > On Fri, Mar 12, 2004 at 11:25:27AM -0500, Rik van Riel wrote: > > > pointing to a struct address_space with its anonymous > > > memory. On exec() the mm_struct gets a new address_space, > > > on fork parent and child share them. > > > > isn't this what anonmm is already doing? are you suggesting something > > different? > > I am suggesting a pointer from the mm_struct to a > struct address_space ... that's the anonmm: + mm->anonmm = anonmm; > > > There's no difference between mremap() of anonymous memory > > > and mremap() of part of an mmap() range of a file... > > > > > > At least, there doesn't need to be. > > > > the anonmm simply cannot work because it's not reaching vmas, it only > > reaches mm, and with an mm and a virtual address you cannot reach the > > right vma if it was moved around by mremap, > > ... and use the offset into the struct address_space as > the page->index, NOT the virtual address inside the mm. > > On first creation of anonymous memory these addresses > could be the same, but on mremap inside a forked process > (with multiple processes sharing part of anonymous memory) > a page could have a different offset inside the struct > address space than its virtual address.... > > Then on mremap you only need to adjust the start and > end offsets inside the VMAs, not the page->index ... I don't see how this can work, each vma needs its own vm_off or a single address space can't handle them all. Also the page->index is the virtual address (or the virtual offset with anon_vma), it cannot be replaced with something global, it has to be per-page. > Isn't being LESS finegrained the whole reason for moving > from pte based to object based reverse mapping ? ;)) the object is to cover ranges, instead of forcing per-page overhread. Being finegrined at the vma is fine, being finegrined less than a vma is desiderable only if there's no downside. > > (it also avoids the need of a prio_tree even if in theory we could stack > > a prio_tree on top of every anon_vma, but it's really not needed) > > We need the prio_tree anyway for files. I don't see As I said in the last email the prio_tree will not work for the anonvmas, because every vma in the same range will map to different pages. So you'll find more vmas than the ones you're interested about. This doesn't happen with inodes. with inodes every vma queued into the i_mmap will be mapping to the right page _if_ it's pte_present == 1. with your anonymous address space shared by childs the prio_tree will find lots of vmas in different vma->vm_mm, each one pointing to different pages. so to unmap a direct page after a malloc, you may end up scanning all the address spaces by mistake. This cannot happen with anon_vma. Furthermore the prio_tree will waste 12 bytes per vma, while the anon_vma design will waste _at_most_ 8 bytes per vma (actually less if the anon_vma are shared). And with anon_vma in practice you won't need a prio_tree stacked on top of anon_vma. You could put one if you want paying another 12bytes per vma, but it doesn't worth it. So anon_vma takes less memory and it's more efficent as far as I can tell. > Having the same code everywhere will definately help > simplifying things. Reusing the same code would be good I agree, but I don't think it would work as well as with the inodes, and with the inodes it's really needed only for a special 32bit case, so normally the lookup would be immdiate, while here we need it for real expensive lookups if one has many anonymous vmas in the childs even on 64bit apps. So I prefer a design where the prio_tree or not the cost for good apps 64bit archs is the same. prio_tree is not free, it's still O(log(N)) and I prefer a design where the common case is N == 1 like with anon_vma (with your address-space design N would be >1 normally in a server app). ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 17:44 ` Andrea Arcangeli @ 2004-03-12 18:18 ` Rik van Riel 0 siblings, 0 replies; 74+ messages in thread From: Rik van Riel @ 2004-03-12 18:18 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > On Fri, Mar 12, 2004 at 12:23:22PM -0500, Rik van Riel wrote: > > ... and use the offset into the struct address_space as > > the page->index, NOT the virtual address inside the mm. > As I said in the last email the prio_tree will not work for the > anonvmas, because every vma in the same range will map to different > pages. So you'll find more vmas than the ones you're interested about. > This doesn't happen with inodes. with inodes every vma queued into the > i_mmap will be mapping to the right page _if_ it's pte_present == 1. You don't have multiple VMAs mapping to same pages, but in the same range in the address_space. Note that the per-process virtual memory != per "fork-group" backing address_space ... > with your anonymous address space shared by childs the prio_tree will > find lots of vmas in different vma->vm_mm, each one pointing to > different pages. Nope. I wish I was better with graphical programs, or I'd draw you a picture. ;) > > Having the same code everywhere will definately help > > simplifying things. > > Reusing the same code would be good I agree, but I don't think it would > work as well as with the inodes, > prio_tree is not free, it's still O(log(N)) and I prefer a design where > the common case is N == 1 like with anon_vma (with your address-space > design N would be >1 normally in a server app). It's all a space-time overhead. Do you want more structures allocated and a more complex mremap, or do you eat the O(log(N)) lookup ? -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 17:23 ` Rik van Riel 2004-03-12 17:44 ` Andrea Arcangeli @ 2004-03-12 18:25 ` Linus Torvalds 2004-03-12 18:48 ` Rik van Riel 2004-03-12 21:08 ` Jamie Lokier 1 sibling, 2 replies; 74+ messages in thread From: Linus Torvalds @ 2004-03-12 18:25 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Rik van Riel wrote: > > I am suggesting a pointer from the mm_struct to a > struct address_space ... [ deleted ] > Then on mremap you only need to adjust the start and > end offsets inside the VMAs, not the page->index ... One fundamental problem I see, maybe you can explain it to me... - You need a _unique_ page->index start for each VMA, since each anonymous page needs to have a unique index. Right? - You can use the virtual address as that unique page index start - when you mremap() an area, you leave the start indexes the same, so that you can find the original pages (and create new ones in the old mapping) by just searching the vma's, not by actually looking at the page tables. - HOWEVER, after a mremap(), when you now create a new vma (or expand an old one) into the previously used page index area, you're now screwed. How are you going to generate unique page indexes in this new area without re-using the indexes that you allocated in the old (moved) area? I think your approach could work (reverse map by having separate address spaces for unrelated processes), but I don't see any good "page->index" allocation scheme that is implementable. The "unique" page->index thing wouldn't need to have to have anything to do with the virtual address (indeed, after a mremap it clearly cannot have anything to do with that), but the thing is, you'd need to be able to cover the virtual address space with whatever numbers you choose. You'd want to allocate contiguous indexes within one "vma", since the whole point would be to be able to try to quickly find the vma (and thus the page) that contains one particular page, but there are no range allocators that I can think of that allow growing the VMA after allocation (needed for vma merging on mmap and brk()) and still keep the range of indexes down to reasonable numbers. Or did I totally mis-understand what you were proposing? Linus ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 18:25 ` Linus Torvalds @ 2004-03-12 18:48 ` Rik van Riel 2004-03-12 19:02 ` Chris Friesen 2004-03-12 21:08 ` Jamie Lokier 1 sibling, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-12 18:48 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Linus Torvalds wrote: > I think your approach could work (reverse map by having separate address > spaces for unrelated processes), but I don't see any good "page->index" > allocation scheme that is implementable. > Or did I totally mis-understand what you were proposing? You're absolutely right. I am still trying to come up with a way to do this. Note that since we count page->index in PAGE_SIZE unit we have PAGE_SIZE times as much space as a process can take, so we definately have enough address space to come up with a creative allocation scheme. I just can't think of any now ... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 18:48 ` Rik van Riel @ 2004-03-12 19:02 ` Chris Friesen 2004-03-12 19:06 ` Rik van Riel 0 siblings, 1 reply; 74+ messages in thread From: Chris Friesen @ 2004-03-12 19:02 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III Rik van Riel wrote: > On Fri, 12 Mar 2004, Linus Torvalds wrote: > > >>I think your approach could work (reverse map by having separate address >>spaces for unrelated processes), but I don't see any good "page->index" >>allocation scheme that is implementable. > Note that since we count page->index in PAGE_SIZE unit we > have PAGE_SIZE times as much space as a process can take, > so we definately have enough address space to come up with > a creative allocation scheme. What happens when you have more than PAGE_SIZE processes running? Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 19:02 ` Chris Friesen @ 2004-03-12 19:06 ` Rik van Riel 2004-03-12 19:10 ` Chris Friesen 2004-03-12 20:27 ` Andrea Arcangeli 0 siblings, 2 replies; 74+ messages in thread From: Rik van Riel @ 2004-03-12 19:06 UTC (permalink / raw) To: Chris Friesen Cc: Linus Torvalds, Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Chris Friesen wrote: > What happens when you have more than PAGE_SIZE processes running? Forked off the same process ? Without doing an exec ? On a 32 bit system ? You'd probably run out of space to put the VMAs, mm_structs and pgds long before reaching this point ... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 19:06 ` Rik van Riel @ 2004-03-12 19:10 ` Chris Friesen 2004-03-12 19:14 ` Rik van Riel 2004-03-12 20:27 ` Andrea Arcangeli 1 sibling, 1 reply; 74+ messages in thread From: Chris Friesen @ 2004-03-12 19:10 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III Rik van Riel wrote: > On Fri, 12 Mar 2004, Chris Friesen wrote: > > >>What happens when you have more than PAGE_SIZE processes running? > > > Forked off the same process ? > Without doing an exec ? > On a 32 bit system ? > > You'd probably run out of space to put the VMAs, > mm_structs and pgds long before reaching this point ... I'm just thinking of the "fork 100000 kids to test 32-bit pids" sort of test cases. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 19:10 ` Chris Friesen @ 2004-03-12 19:14 ` Rik van Riel 0 siblings, 0 replies; 74+ messages in thread From: Rik van Riel @ 2004-03-12 19:14 UTC (permalink / raw) To: Chris Friesen Cc: Linus Torvalds, Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Chris Friesen wrote: > I'm just thinking of the "fork 100000 kids to test 32-bit pids" sort of > test cases. Try that with a process that takes up 2GB of address space ;) It won't work now and it'll fail for the same reasons with the scheme I proposed. Probably before the 2^44 bits of space run out, too. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 19:06 ` Rik van Riel 2004-03-12 19:10 ` Chris Friesen @ 2004-03-12 20:27 ` Andrea Arcangeli 2004-03-12 20:32 ` Rik van Riel 1 sibling, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 20:27 UTC (permalink / raw) To: Rik van Riel Cc: Chris Friesen, Linus Torvalds, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III On Fri, Mar 12, 2004 at 02:06:17PM -0500, Rik van Riel wrote: > On Fri, 12 Mar 2004, Chris Friesen wrote: > > > What happens when you have more than PAGE_SIZE processes running? > > Forked off the same process ? > Without doing an exec ? > On a 32 bit system ? > > You'd probably run out of space to put the VMAs, > mm_structs and pgds long before reaching this point ... 7.5k users are being reached in a real workload with around 2gigs mapped per process and with tons of vma per process. with 2.6 and faster cpus I hope to go even further. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 20:27 ` Andrea Arcangeli @ 2004-03-12 20:32 ` Rik van Riel 2004-03-12 20:49 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-12 20:32 UTC (permalink / raw) To: Andrea Arcangeli Cc: Chris Friesen, Linus Torvalds, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > 7.5k users are being reached in a real workload with around 2gigs mapped > per process and with tons of vma per process. with 2.6 and faster cpus > I hope to go even further. That's not all anonymous memory, though ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 20:32 ` Rik van Riel @ 2004-03-12 20:49 ` Andrea Arcangeli 0 siblings, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 20:49 UTC (permalink / raw) To: Rik van Riel Cc: Chris Friesen, Linus Torvalds, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III On Fri, Mar 12, 2004 at 03:32:20PM -0500, Rik van Riel wrote: > That's not all anonymous memory, though ;) true, my point is it's feasible (cow or shared is the same from a memory footprint standpoint, actually less since anon_vmas are a lot cheaper than dummy shmfs inodes) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 18:25 ` Linus Torvalds 2004-03-12 18:48 ` Rik van Riel @ 2004-03-12 21:08 ` Jamie Lokier 1 sibling, 0 replies; 74+ messages in thread From: Jamie Lokier @ 2004-03-12 21:08 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel, William Lee Irwin III Linus Torvalds wrote: > You'd want to allocate contiguous indexes within one "vma", since the > whole point would be to be able to try to quickly find the vma (and thus > the page) that contains one particular page, but there are no range > allocators that I can think of that allow growing the VMA after allocation > (needed for vma merging on mmap and brk()) and still keep the range of > indexes down to reasonable numbers. For growing, they don't have to be contiguous - it's just desirable. When a vma is grown and the page->offset space it would like to occupy is already taken, it can be split into two vmas. Of course that alters mremap() semantics, which depend on vma boundaries. (mmap, munmap and mprotect don't care). So add a vma flag which indicates that it and the following vma(s) are a single unit for the purpose of remapping. Call it the mremap-group flag. Groups always have the same flags etc.; only the vm_offset varies. In effect, I'm suggesting that instead of having vmas be the user-visible unit, and some other finer-grained structures track page mappings, let _vmas_ be the finer-grained structure, and make the user-visible unit be whatever multiple consecutive vmas occur with that flag set. (This is a good balance if the number of splits is small; not if there are many). It shouldn't lead to a proliferation of vmas, provided the page->offset allocation algorithm is sufficiently sparse. To keep the number of potential splits small, always allocate some extra page->offset space so that a vma can grow into it. Only when it cannot grow in page->offset space, do you create a new vma. The new vma has extra page->offset space allocated too. That extra space should be proportional to the size of the entire new mremap() region (multiple vmas), not the new vma size. In that way, I think it bounds the number of splits to O(log (n/m)) where n is the total mremap() region size, and m is the original size. The constant in that expression is determined by the proportion that is used for reserving extra space. This has some consequences. If each vma's page->offset allocation reserves space around it to grow, then adjacent anonymous vmas won't be mergeable. If they aren't mergeable, it begs the question of why not have an address_space per vma, instead of per-mm, other than to save memory on address_space structures? Well we like them to be mergeable. Lots of reasons. So make initial mmap() allocations not reserve page->offset space exclusively, but make allocations done by mremap() reserve the extra space, to get that O(log (n/m)) property. Using the mremap-group flag, we are also able to give the appearance of merged vmas when it would be difficult. If we want certain anonymous vmas to be appear merged despite them having incompatible vm_offset values, we can do that. So going back to the question of address_space per-mm: you don't need one, due to the mremap-group flag. It's good to use as few as possible, but it's ok to use more than one per process or per fork-group, when absolutely necessary. That fixes the address_space limitation of 2^32 pages and makes page->offset allocation _very_ simple: 1. Allocate by simply incrementing an address counter. 2. When it's about to wrap, allocate a new address_space. 3. When allocating, reserve extra space for growing. The extra space should be proportional to the allocation, or the total size size of the region after mremap(), and clamped to a sane maximum such as 4G minus size, and a sane minimum such as 2^22 (room for a million reservations per address_space). 5. When allocating, look at the nearby preceding or following vma in the virtual address space. If the amount of page->offset space reserved by those vmas is large enough, we can claim some of that reservation for the new allocation. If our good neighbour is adjacent to the new vma, that means the neighbour vma is simply grown. Otherwise, it means we create a new vma which is vm_offset-compatible with its neighbour, allowing them to merge if the hole between is filled. 6. By using large reservations, large regions of the virtual address space become covered with vm_offset-compatible vmas that are mergeable when the holes are filled. 4. When trying to merge adjacent anon vmas during ordinary mmap/munmap/mprotect/mremap operations, if they are not vm_offset-compatible (or their address_spaces aren't equal) just use the mremap-group flag to make them appear merged. The user-visible result is a single vma. The effect on the kernel is a rare non-mergeable boundary, which will slow vma searching marginally. The benefit is this simple allocation scheme. This is like what we have today, with some occasional non-mergeable vma boundaries (but only very few compared with the total number of vmas in an mm). These boundaries are not user-visible, and only affect the kernel algorithms - and in a simple way. Data structure changes required: one flag, VM_GROUP or something; each vma needs a pointer to _its_ address_space (can share space with vm_file or such); each vma needs to record how much page->offset space it has reserved beyond its own size. VM_GROWSDOWN vmas might want to record a reservation down rather than up. -- Jamie ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 12:21 ` Andrea Arcangeli 2004-03-12 12:40 ` Rik van Riel @ 2004-03-12 12:42 ` Andrea Arcangeli 2004-03-12 12:46 ` William Lee Irwin III 2004-03-12 13:43 ` Hugh Dickins 3 siblings, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 12:42 UTC (permalink / raw) To: Rik van Riel Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, Mar 12, 2004 at 01:21:27PM +0100, Andrea Arcangeli wrote: > Rajesh's code has nothing to do with the mremap breakage, Rajesh's code > can only boost the search of the interesting vmas in an anonmm, it > doesn't solve mremap. btw, one more detail, Rajesh's code will fall apart while dealing with the dynamic metadata attached to vmas relocated by mremap: his code is usable out of the box only on top of anon_vma (where vm_pgoff/vm_start/vm_end retains the same semantics as the file mappings in the i_mmap list), not on top of anonmm where you'll have to stack some other dynamic structure (like the pte_chains today in anobjrmap-5). Not sure how well his code could be modified to take into account the dynamic data structure generated by mremap. Also don't forget Rajesh's code doesn't come free, it also adds overhead to the vma, so if you need the tree in the anonmm too (not only in the inode), you'll grow the vma size too (I grow it of 12 bytes with anon_vma but then I don't need complex metadata dynamic allocated later in mremap and I don't need the rbtree search either since it's finegrined well enough). I also expect you'll still have significant problems merging two vmas, one touched by mremap, the other not, since then the dynamic objects would need to be "partial" for only a part of the vma, complicating even further the "tree search" with ranges in the sub-metadata attached to the vma. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 12:21 ` Andrea Arcangeli 2004-03-12 12:40 ` Rik van Riel 2004-03-12 12:42 ` Andrea Arcangeli @ 2004-03-12 12:46 ` William Lee Irwin III 2004-03-12 13:24 ` Andrea Arcangeli 2004-03-12 16:17 ` Linus Torvalds 2004-03-12 13:43 ` Hugh Dickins 3 siblings, 2 replies; 74+ messages in thread From: William Lee Irwin III @ 2004-03-12 12:46 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel On Fri, Mar 12, 2004 at 01:21:27PM +0100, Andrea Arcangeli wrote: > you missed the fact mremap doesn't work, that's the fundamental reason > for the vma tracking, so you can use vm_pgoff. > if you take Hugh's anonmm, mremap will be attaching a persistent dynamic > overhead to the vma it touches. Currently it does in form of pte_chains, > that can be converted to other means of overhead, but I simply don't > like it. > I like all vmas to be symmetric to each other, without special hacks to > handle mremap right. > We have the vm_pgoff to handle mremap and I simply use that. Absolute guarantees are nice but this characterization is too extreme. The case where mremap() creates rmap_chains is so rare I never ever saw it happen in 6 months of regular practical use and testing. Their creation could be triggered only by remap_file_pages(). -- wli ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 12:46 ` William Lee Irwin III @ 2004-03-12 13:24 ` Andrea Arcangeli 2004-03-12 13:40 ` William Lee Irwin III 2004-03-12 13:55 ` Hugh Dickins 2004-03-12 16:17 ` Linus Torvalds 1 sibling, 2 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 13:24 UTC (permalink / raw) To: William Lee Irwin III, Rik van Riel, Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel On Fri, Mar 12, 2004 at 04:46:38AM -0800, William Lee Irwin III wrote: > On Fri, Mar 12, 2004 at 01:21:27PM +0100, Andrea Arcangeli wrote: > > you missed the fact mremap doesn't work, that's the fundamental reason > > for the vma tracking, so you can use vm_pgoff. > > if you take Hugh's anonmm, mremap will be attaching a persistent dynamic > > overhead to the vma it touches. Currently it does in form of pte_chains, > > that can be converted to other means of overhead, but I simply don't > > like it. > > I like all vmas to be symmetric to each other, without special hacks to > > handle mremap right. > > We have the vm_pgoff to handle mremap and I simply use that. > > Absolute guarantees are nice but this characterization is too extreme. > The case where mremap() creates rmap_chains is so rare I never ever saw > it happen in 6 months of regular practical use and testing. Their > creation could be triggered only by remap_file_pages(). did you try specweb with apache? that's super heavy mremap as far as I know (and it maybe using anon memory, and if not I certainly cannot exclude other apps are using mremap on significant amounts of anymous ram). To a point that the kmap_lock for the persistent kmaps I used originally in mremap (at least it has never been racy) was a showstopper bottleneck spending most of system time there (profiling was horrible in the kmap_lock) and I had to fixup the 2.6 way with the per-cpu atomic kmaps to avoid being an order of magnitude slower than in the small boxes w/o highmem. the single reason I'm doing this work is to avoid allocating the pte_chains and to always use the vma instead. If I've to use the pte_chains again for mremap (hoping that no application is using mremap) then I'm not at all happy since people could still fall in the pte_chain trap with some app. Amittedly the pte_chains makes perfect sense only for nonlinear vmas, since the vma is meaningless for the nonlinear vmas and really a per-page cost makes sense there, but I'm not going to add 8 bytes per-page to swapout the nonlinear vmas efficiently, and I'll let the cpu pay for that if you really need to swap the nonlinear mappings (i.e. the pagetable walk). An alternate way would been to dynamically allocate the per-pte pointer, but that will throw a whole lot of memory at the problem too, and one of the main points for using nonlinear maps is to avoid the allocation of the vmas, so I doubt people really want to allocate lots of ram to handle nonlinear efficiently, so I believe saving all ram at the expense of cpu cost during swapping will be ok. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 13:24 ` Andrea Arcangeli @ 2004-03-12 13:40 ` William Lee Irwin III 2004-03-12 13:55 ` Hugh Dickins 1 sibling, 0 replies; 74+ messages in thread From: William Lee Irwin III @ 2004-03-12 13:40 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel On Fri, Mar 12, 2004 at 02:24:36PM +0100, Andrea Arcangeli wrote: > did you try specweb with apache? that's super heavy mremap as far as I > know (and it maybe using anon memory, and if not I certainly cannot > exclude other apps are using mremap on significant amounts of anymous > ram). To a point that the kmap_lock for the persistent kmaps I used > originally in mremap (at least it has never been racy) was a showstopper > bottleneck spending most of system time there (profiling was horrible in > the kmap_lock) and I had to fixup the 2.6 way with the per-cpu atomic > kmaps to avoid being an order of magnitude slower than in the small > boxes w/o highmem. No. I have never had access to systems set up for specweb. -- wli ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 13:24 ` Andrea Arcangeli 2004-03-12 13:40 ` William Lee Irwin III @ 2004-03-12 13:55 ` Hugh Dickins 2004-03-12 16:01 ` Andrea Arcangeli 1 sibling, 1 reply; 74+ messages in thread From: Hugh Dickins @ 2004-03-12 13:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: William Lee Irwin III, Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > On Fri, Mar 12, 2004 at 04:46:38AM -0800, William Lee Irwin III wrote: > > > > The case where mremap() creates rmap_chains is so rare I never ever saw > > it happen in 6 months of regular practical use and testing. Their > > creation could be triggered only by remap_file_pages(). > > did you try specweb with apache? that's super heavy mremap as far as I > know (and it maybe using anon memory, and if not I certainly cannot > exclude other apps are using mremap on significant amounts of anymous > ram). anonmm has no problem with most mremaps: the special case is for mremap MAYMOVE of anon vmas _inherited from parent_ (same page at different addresses in the different mms). As I said before, it's quite conceivable that this case never arises outside our testing (but I'd be glad to be shown wrong, would make effort worthwhile). Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 13:55 ` Hugh Dickins @ 2004-03-12 16:01 ` Andrea Arcangeli 0 siblings, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 16:01 UTC (permalink / raw) To: Hugh Dickins Cc: William Lee Irwin III, Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel On Fri, Mar 12, 2004 at 01:55:30PM +0000, Hugh Dickins wrote: > On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > > On Fri, Mar 12, 2004 at 04:46:38AM -0800, William Lee Irwin III wrote: > > > > > > The case where mremap() creates rmap_chains is so rare I never ever saw > > > it happen in 6 months of regular practical use and testing. Their > > > creation could be triggered only by remap_file_pages(). > > > > did you try specweb with apache? that's super heavy mremap as far as I > > know (and it maybe using anon memory, and if not I certainly cannot > > exclude other apps are using mremap on significant amounts of anymous > > ram). > > anonmm has no problem with most mremaps: the special case is for > mremap MAYMOVE of anon vmas _inherited from parent_ (same page at > different addresses in the different mms). As I said before, it's > quite conceivable that this case never arises outside our testing > (but I'd be glad to be shown wrong, would make effort worthwhile). the problem is that it _can_ arise, and fixing that is an huge mess without using the pte_chains IMHO (no hope to use the vma->shared). I also don't see how can you know if a vma is pointing all to "direct" pages and in turn you can move it somewhere else without the pte_chains. sure you can move all anon vmas freely after an execve, but after the first fork (and in turn with cow pages going on) all mremaps will non-trackable with anonmm, right? lots of server processes uses fork() model for the childs, and they can run mremap inside the child of memory malloced inside the child, and I don't think you can easily track if the malloc happened inside the child or inside the father, though I may be wrong on this. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 12:46 ` William Lee Irwin III 2004-03-12 13:24 ` Andrea Arcangeli @ 2004-03-12 16:17 ` Linus Torvalds 2004-03-13 0:28 ` William Lee Irwin III 2004-03-13 14:43 ` Rik van Riel 1 sibling, 2 replies; 74+ messages in thread From: Linus Torvalds @ 2004-03-12 16:17 UTC (permalink / raw) To: William Lee Irwin III Cc: Andrea Arcangeli, Rik van Riel, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel On Fri, 12 Mar 2004, William Lee Irwin III wrote: > > Absolute guarantees are nice but this characterization is too extreme. > The case where mremap() creates rmap_chains is so rare I never ever saw > it happen in 6 months of regular practical use and testing. Their > creation could be triggered only by remap_file_pages(). I have to _violently_ agree with Andrea on this one. The absolute _LAST_ thing we want to have is a "remnant" rmap infrastructure that only gets very occasional use. That's a GUARANTEED way to get bugs, and really subtle behaviour. I think Andrea is 100% right. Either do rmap for everything (like we do now, modulo IO/mlock), or do it for _nothing_. No half measures with "most of the time". Quite frankly, the stuff I've seen suggested sounds absolutely _horrible_. Special cases are not just a pain to work with, they definitely will cause bugs. It's not a matter of "if", it's a matter of "when". So let's make it clear: if we have an object-based reverse mapping, it should cover all reasonable cases, and in particular, it should NOT have rare fallbacks to code that thus never gets any real testing. And if we have per-page rmap like now, it should _always_ be there. You do have to realize that maintainability is a HELL of a lot more important than scalability of performance can be. Please keep that in mind. Linus ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 16:17 ` Linus Torvalds @ 2004-03-13 0:28 ` William Lee Irwin III 2004-03-13 14:43 ` Rik van Riel 1 sibling, 0 replies; 74+ messages in thread From: William Lee Irwin III @ 2004-03-13 0:28 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Rik van Riel, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel On Fri, Mar 12, 2004 at 08:17:49AM -0800, Linus Torvalds wrote: > I have to _violently_ agree with Andrea on this one. > The absolute _LAST_ thing we want to have is a "remnant" rmap > infrastructure that only gets very occasional use. That's a GUARANTEED way > to get bugs, and really subtle behaviour. > I think Andrea is 100% right. Either do rmap for everything (like we do > now, modulo IO/mlock), or do it for _nothing_. No half measures with > "most of the time". > Quite frankly, the stuff I've seen suggested sounds absolutely _horrible_. > Special cases are not just a pain to work with, they definitely will cause > bugs. It's not a matter of "if", it's a matter of "when". > So let's make it clear: if we have an object-based reverse mapping, it > should cover all reasonable cases, and in particular, it should NOT have > rare fallbacks to code that thus never gets any real testing. > And if we have per-page rmap like now, it should _always_ be there. > You do have to realize that maintainability is a HELL of a lot more > important than scalability of performance can be. Please keep that in > mind. The sole point I had to make was against a performance/resource scalabilty argument; the soft issues weren't part of that, though they may ultimately be the deciding factor. -- wli ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 16:17 ` Linus Torvalds 2004-03-13 0:28 ` William Lee Irwin III @ 2004-03-13 14:43 ` Rik van Riel 2004-03-13 16:18 ` Linus Torvalds 1 sibling, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-13 14:43 UTC (permalink / raw) To: Linus Torvalds Cc: William Lee Irwin III, Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton, linux-kernel On Fri, 12 Mar 2004, Linus Torvalds wrote: > So let's make it clear: if we have an object-based reverse mapping, it > should cover all reasonable cases, and in particular, it should NOT have > rare fallbacks to code that thus never gets any real testing. Absolutely agreed. And with Rajesh's code it should be possible to get object-based rmap right, not vulnerable to the scalability issues demonstrated by Ingo's test programs. Whether we go with mm-based or vma-based, I don't particularly care either. As long as the code is nice... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 14:43 ` Rik van Riel @ 2004-03-13 16:18 ` Linus Torvalds 2004-03-13 17:24 ` Hugh Dickins 2004-03-13 17:33 ` Andrea Arcangeli 0 siblings, 2 replies; 74+ messages in thread From: Linus Torvalds @ 2004-03-13 16:18 UTC (permalink / raw) To: Rik van Riel, Andrea Arcangeli Cc: William Lee Irwin III, Hugh Dickins, Ingo Molnar, Andrew Morton, Kernel Mailing List Ok, guys, how about this anon-page suggestion? I'm a bit nervous about the complexity issues in Andrea's current setup, so I've been thinking about Rik's per-mm thing. And I think that there is one very simple approach, which should work fine, and should have minimal impact on the existing setup exactly because it is so simple. Basic setup: - each anonymous page is associated with exactly _one_ virtual address, in a "anon memory group". We put the virtual address (shifted down by PAGE_SHIFT) into "page->index". We put the "anon memory group" pointer into "page->mapping". We have a PAGE_ANONYMOUS flag to tell the rest of the world about this. - the anon memory group has a list of all mm's that it is associated with. - an "execve()" creates a new "anon memory group" and drops the old one. - a mm copy operation just increments the reference count and adds the new mm to the mm list for that anon memory group. So now to do reverse mapping, we can take a page, and do if (PageAnonymous(page)) { struct anongroup *mmlist = (struct anongroup *)page->mapping; unsigned long address = page->index << PAGE_SHIFT; struct mm_struct *mm; for_each_entry(mm, mmlist->anon_mms, anon_mm) { .. look up page in page tables in "mm, address" .. .. most of the time we may not even need to look .. .. up the "vma" at all, just walk the page tables .. } } else { /* Shared page */ .. look up page using the inode vma list .. } The above all works 99% of the time. The only problem is mremap() after a fork(), and hell, we know that's a special case anyway, and let's just add a few lines to copy_one_pte(), which basically does: if (PageAnonymous(page) && page->count > 1) { newpage = alloc_page(); copy_page(page, newpage); page = newpage; } /* Move the page to the new address */ page->index = address >> PAGE_SHIFT; and now we have zero special cases. The above should work very well. In most cases the "anongroup" will be very small, and even when it's large (if somebody does a ton of forks without any execve's), we only have _one_ address to check, and that is pretty fast. A high-performance server would use threads, anyway. (And quite frankly, _any_ algorithm will have this issue. Even rmap will have exactly the same loop, although rmap skips any vm's where the page might have been COW'ed or removed). The extra COW in mremap() seems benign. Again, it should usually not even trigger. What do you think? To me, this seems to be a really simple approach.. Linus ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 16:18 ` Linus Torvalds @ 2004-03-13 17:24 ` Hugh Dickins 2004-03-13 17:28 ` Rik van Riel 2004-03-13 17:48 ` Andrea Arcangeli 2004-03-13 17:33 ` Andrea Arcangeli 1 sibling, 2 replies; 74+ messages in thread From: Hugh Dickins @ 2004-03-13 17:24 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrea Arcangeli, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, 13 Mar 2004, Linus Torvalds wrote: > > Ok, guys, > how about this anon-page suggestion? What you describe is pretty much exactly what my anobjrmap patch from a year ago did. I'm currently looking through that again to bring it up to date. > I'm a bit nervous about the complexity issues in Andrea's current setup, > so I've been thinking about Rik's per-mm thing. And I think that there is > one very simple approach, which should work fine, and should have minimal > impact on the existing setup exactly because it is so simple. > > Basic setup: > - each anonymous page is associated with exactly _one_ virtual address, > in a "anon memory group". > > We put the virtual address (shifted down by PAGE_SHIFT) into > "page->index". We put the "anon memory group" pointer into > "page->mapping". We have a PAGE_ANONYMOUS flag to tell the > rest of the world about this. It's a bit more complicated because page->mapping currently contains &swapper_space if PageSwapCache(page) - indeed, at present that's exactly what PageSwapCache(page) tests. So I reintroduced a PageSwapCache(page) flagbit, avoid the very few places where mapping pointing to swapper_space was actually useful, and use page->private instead of page->index for the swp_entry_t. (Andrew did point out that we could reduce the scale of the mods by reusing page->list fields instead of mapping/index; but mapping/index are the natural fields to use, and Andrew now has other changes in -mm which remove page->list: so the original choice looks right again.) > for_each_entry(mm, mmlist->anon_mms, anon_mm) { > .. look up page in page tables in "mm, address" .. > .. most of the time we may not even need to look .. > .. up the "vma" at all, just walk the page tables .. > } I believe page_referenced() can just walk the page tables, but try_to_unmap() needs vma to check VM_LOCKED (we're thinking of other ways to avoid that, but they needn't get mixed into this) and for flushing cache and tlb (perhaps avoidable on some arches? I've not checked, and again that would be an optimization to consider later, not mix in at this stage). > The only problem is mremap() after a fork(), and hell, we know that's a > special case anyway, and let's just add a few lines to copy_one_pte(), > which basically does: > > if (PageAnonymous(page) && page->count > 1) { > newpage = alloc_page(); > copy_page(page, newpage); > page = newpage; > } > /* Move the page to the new address */ > page->index = address >> PAGE_SHIFT; > > and now we have zero special cases. That's always been a fallback solution, I was just a little too ashamed to propose it originally - seems a little wrong to waste whole pages rather than wasting a few bytes of data structure trying to track them: though the pages are pageable unlike any data structure we come up with. I think we have page_table_lock in copy_one_pte, so won't want to do it quite like that. It won't matter at all if pages are transiently untrackable. Might want to do something like make_pages_present afterwards (but it should only be COWing instantiated pages; and does need to COW pages currently on swap too). There's probably an issue with Alan's strict commit memory accounting, if the mapping is readonly; but so long as we get that counting right, I don't think it's really going to matter at all if we sometimes fail an mremap for that reason - but probably need to avoid mistaking the common case (mremap of own area) for the rare case which needs this copying (mremap of inherited area). Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:24 ` Hugh Dickins @ 2004-03-13 17:28 ` Rik van Riel 2004-03-13 17:41 ` Hugh Dickins ` (2 more replies) 2004-03-13 17:48 ` Andrea Arcangeli 1 sibling, 3 replies; 74+ messages in thread From: Rik van Riel @ 2004-03-13 17:28 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrea Arcangeli, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, 13 Mar 2004, Hugh Dickins wrote: > On Sat, 13 Mar 2004, Linus Torvalds wrote: > > if (PageAnonymous(page) && page->count > 1) { > > newpage = alloc_page(); > > copy_page(page, newpage); > > page = newpage; > > } > > /* Move the page to the new address */ > > page->index = address >> PAGE_SHIFT; > > > > and now we have zero special cases. > > That's always been a fallback solution, I was just a little too ashamed > to propose it originally - seems a little wrong to waste whole pages > rather than wasting a few bytes of data structure trying to track them: > though the pages are pageable unlike any data structure we come up with. No, Linus is right. If a child process uses mremap(), it stands to reason that it's about to use those pages for something. Think of it as taking the COW faults early, because chances are you'd be taking them anyway, just a little bit later... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:28 ` Rik van Riel @ 2004-03-13 17:41 ` Hugh Dickins 2004-03-13 18:08 ` Andrea Arcangeli 2004-03-13 17:54 ` Andrea Arcangeli 2004-03-13 18:57 ` Linus Torvalds 2 siblings, 1 reply; 74+ messages in thread From: Hugh Dickins @ 2004-03-13 17:41 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Andrea Arcangeli, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, 13 Mar 2004, Rik van Riel wrote: > > No, Linus is right. > > If a child process uses mremap(), it stands to reason that > it's about to use those pages for something. > > Think of it as taking the COW faults early, because chances > are you'd be taking them anyway, just a little bit later... Makes perfect sense in the read-write case. The read-only case is less satisfactory, but those will be even rarer. Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:41 ` Hugh Dickins @ 2004-03-13 18:08 ` Andrea Arcangeli 0 siblings, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-13 18:08 UTC (permalink / raw) To: Hugh Dickins Cc: Rik van Riel, Linus Torvalds, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, Mar 13, 2004 at 05:41:37PM +0000, Hugh Dickins wrote: > On Sat, 13 Mar 2004, Rik van Riel wrote: > > > > No, Linus is right. > > > > If a child process uses mremap(), it stands to reason that > > it's about to use those pages for something. > > > > Think of it as taking the COW faults early, because chances > > are you'd be taking them anyway, just a little bit later... > > Makes perfect sense in the read-write case. The read-only > case is less satisfactory, but those will be even rarer. overall it's not obvious to me that those will be even rarer. see the last email about kde-like usages to share data like-threads but with memory protection, those won't write to the data. I mean, it maybe the way to go, but I think we should get some ok from the major linux projects that we're not going to invalidate their smart optimizations first, and we should get this "misfeature" documented somehow. I've to admit the simplicity is appealing, but besides its coding-simplicity in practice I believe the only other appealing thing will be the fact it's not exploitable by people doing a flood of vma_splits, to solve that with anon_vma I'd need a prio tree on top of every anon_vma, that means even more memory wased both in the anon_vma and vma, though pratically a prio_tree there wouldn't be necessary. The anonmm solves the complexity issue using find_vma, so sharing the rbtree which already works. that's probably the part I find most appealing of anonmm. One can still exploit the complexity with anonmm too, but not from the same address space, so it's easier to limit with ulimit -u. I'm really not sure what's best, which is not good since I hoped to get anon_vma implementation working on Monday evening (heck it was already swapping fine my test app despite the huge vma_split/PageDirect bug that you noticed that probably caused `ps` to oops, I bet `ps` is doing a vma_split ;) but I now returned wondering about the design issues instead. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:28 ` Rik van Riel 2004-03-13 17:41 ` Hugh Dickins @ 2004-03-13 17:54 ` Andrea Arcangeli 2004-03-13 17:55 ` Andrea Arcangeli 2004-03-13 18:57 ` Linus Torvalds 2 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-13 17:54 UTC (permalink / raw) To: Rik van Riel Cc: Hugh Dickins, Linus Torvalds, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, Mar 13, 2004 at 12:28:31PM -0500, Rik van Riel wrote: > On Sat, 13 Mar 2004, Hugh Dickins wrote: > > On Sat, 13 Mar 2004, Linus Torvalds wrote: > > > > if (PageAnonymous(page) && page->count > 1) { > > > newpage = alloc_page(); > > > copy_page(page, newpage); > > > page = newpage; > > > } > > > /* Move the page to the new address */ > > > page->index = address >> PAGE_SHIFT; > > > > > > and now we have zero special cases. > > > > That's always been a fallback solution, I was just a little too ashamed > > to propose it originally - seems a little wrong to waste whole pages > > rather than wasting a few bytes of data structure trying to track them: > > though the pages are pageable unlike any data structure we come up with. > > No, Linus is right. > > If a child process uses mremap(), it stands to reason that > it's about to use those pages for something. > > Think of it as taking the COW faults early, because chances > are you'd be taking them anyway, just a little bit later... using mremap to _move_ anonymous maps is simply not frequent. It's so unfrequent that it's hard to tell if the child is going to _read_ or to _write_. Using those pages means nothing, all it matters is if it will use those pages from reading or for writing, and I don't see how you can assume it's going to write to them and how can you assume this is an early-COW in the common case. the only interesting point to me is that it's non frequent, with that I certainly agreee, but I don't see this as an early-COW. What worries me most are things like kde, they used the library design with the only object of sharing readonly anonymous pages, that's very smart since it still avoids one bug in one app to take down the whole GUI, but if they happen to use mremap to move those readonly page around after the for we'll screw them completely. I've no indication that this may be the case and if they ever call mrmap, but I cannot tell the opposite either. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:54 ` Andrea Arcangeli @ 2004-03-13 17:55 ` Andrea Arcangeli 0 siblings, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-13 17:55 UTC (permalink / raw) To: Rik van Riel Cc: Hugh Dickins, Linus Torvalds, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, Mar 13, 2004 at 06:54:06PM +0100, Andrea Arcangeli wrote: > after the for we'll screw them completely. I've no indication that this ^k ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:28 ` Rik van Riel 2004-03-13 17:41 ` Hugh Dickins 2004-03-13 17:54 ` Andrea Arcangeli @ 2004-03-13 18:57 ` Linus Torvalds 2004-03-13 19:14 ` Hugh Dickins 2 siblings, 1 reply; 74+ messages in thread From: Linus Torvalds @ 2004-03-13 18:57 UTC (permalink / raw) To: Rik van Riel Cc: Hugh Dickins, Andrea Arcangeli, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, 13 Mar 2004, Rik van Riel wrote: > > No, Linus is right. > > If a child process uses mremap(), it stands to reason that > it's about to use those pages for something. That's not necessarily true, since it's entirely possible that it's just a realloc(), and the old part of the allocation would have been left alone. That said, I suspect that - mremap() isn't all _that_ common in the first place - it's even more rare to do a fork() and then a mremap() (ie most of the time I suspect the page count will be 1, and no COW is necessary). Most apps tend to exec() after a fork. - I agree that in at least part of the remaining cases we _would_ COW the pages anyway. I suspect that the only common "no execve after fork" usage is for a few servers, especially the traditional UNIX kind (ie using processes are fairly heavy-weight threads). It could be interesting to see numbers. But basically I'm inclined to believe that the "unnecessary COW" case is _so_ rare, that if it allows us to make other things simpler (and thus more stable and likely faster) it is worth it. Especially the simplicity just appeals to me. I just think that if mremap() causes so many problems for reverse mapping, we should make _that_ the expensive operation, instead of making everything else more complicated. After all, if it turns out that the "early COW" behaviour I suggest can be a performance problem for some (rare) circumstances, then the fix for that is likely to just let applications know that mremap() can be expensive. (It's still likely to be a lot cheaper than actually doing a new mmap+memcpy+munmap, so it's not like mremap would become pointless). Linus ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 18:57 ` Linus Torvalds @ 2004-03-13 19:14 ` Hugh Dickins 0 siblings, 0 replies; 74+ messages in thread From: Hugh Dickins @ 2004-03-13 19:14 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrea Arcangeli, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Fri, 12 Mar 2004, Linus Torvalds wrote: > > The absolute _LAST_ thing we want to have is a "remnant" rmap > infrastructure that only gets very occasional use. That's a GUARANTEED way > to get bugs, and really subtle behaviour. On Sat, 13 Mar 2004, Linus Torvalds wrote: > > I just think that if mremap() causes so many problems for reverse mapping, > we should make _that_ the expensive operation, instead of making > everything else more complicated. Friday's Linus has a good point, but I agree more with Saturday's: mremap MAYMOVE is a very special case, and I believe it would hurt the whole to put it at the centre of the design. But all power to Andrea to achieve that. Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:24 ` Hugh Dickins 2004-03-13 17:28 ` Rik van Riel @ 2004-03-13 17:48 ` Andrea Arcangeli 1 sibling, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-13 17:48 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Rik van Riel, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, Mar 13, 2004 at 05:24:12PM +0000, Hugh Dickins wrote: > On Sat, 13 Mar 2004, Linus Torvalds wrote: > > > > Ok, guys, > > how about this anon-page suggestion? > > What you describe is pretty much exactly what my anobjrmap patch > from a year ago did. I'm currently looking through that again it is. Linus simply provided a solution to the mremap issue, that is to make it impossible to share anonymous pages through an mremap, that solves the problem indeed at some cpu and memory cost after an mremap. I realized you could solve it also by walking the whole list of vmas in every mm->mmap list but that complexity would be way too high. > > The only problem is mremap() after a fork(), and hell, we know that's a > > special case anyway, and let's just add a few lines to copy_one_pte(), > > which basically does: > > > > if (PageAnonymous(page) && page->count > 1) { > > newpage = alloc_page(); > > copy_page(page, newpage); > > page = newpage; > > } > > /* Move the page to the new address */ > > page->index = address >> PAGE_SHIFT; > > > > and now we have zero special cases. > > That's always been a fallback solution, I was just a little too ashamed > to propose it originally - seems a little wrong to waste whole pages > rather than wasting a few bytes of data structure trying to track them: > though the pages are pageable unlike any data structure we come up with. > > I think we have page_table_lock in copy_one_pte, so won't want to do > it quite like that. It won't matter at all if pages are transiently > untrackable. Might want to do something like make_pages_present > afterwards (but it should only be COWing instantiated pages; and > does need to COW pages currently on swap too). > > There's probably an issue with Alan's strict commit memory accounting, > if the mapping is readonly; but so long as we get that counting right, > I don't think it's really going to matter at all if we sometimes fail > an mremap for that reason - but probably need to avoid mistaking the > common case (mremap of own area) for the rare case which needs this > copying (mremap of inherited area). It still looks like quite an hack to me, though I must agree in a desktop scenario with swapoff -a, it will save around 24 bytes per anonymous vma and 12 bytes per file vma plus it doesn't restrict the vma merging in any way, compared to my anon_vma, and it avoids me to worry about people doing a flood of vma_splits that will generate a long list of vmas for every anon_vma. I still feel anon_vma is more preferable than anonmm+linus-unshare-mremap if one needs to swap, and while the prio_tree on i_mmap{shared} in practice is needed only for 32bit apps, I know some app with hundred of processes allocating huge chunks of direct anon memory each and swapping a lot at the same time. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 16:18 ` Linus Torvalds 2004-03-13 17:24 ` Hugh Dickins @ 2004-03-13 17:33 ` Andrea Arcangeli 2004-03-13 17:53 ` Hugh Dickins 2004-03-13 17:57 ` Rik van Riel 1 sibling, 2 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-13 17:33 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, William Lee Irwin III, Hugh Dickins, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, Mar 13, 2004 at 08:18:48AM -0800, Linus Torvalds wrote: > > > Ok, guys, > how about this anon-page suggestion? > > I'm a bit nervous about the complexity issues in Andrea's current setup, > so I've been thinking about Rik's per-mm thing. And I think that there is > one very simple approach, which should work fine, and should have minimal > impact on the existing setup exactly because it is so simple. > > Basic setup: > - each anonymous page is associated with exactly _one_ virtual address, > in a "anon memory group". > > We put the virtual address (shifted down by PAGE_SHIFT) into > "page->index". We put the "anon memory group" pointer into > "page->mapping". We have a PAGE_ANONYMOUS flag to tell the > rest of the world about this. > > - the anon memory group has a list of all mm's that it is associated > with. > > - an "execve()" creates a new "anon memory group" and drops the old one. > > - a mm copy operation just increments the reference count and adds the > new mm to the mm list for that anon memory group. This is the anonmm from Hugh. > > So now to do reverse mapping, we can take a page, and do > > if (PageAnonymous(page)) { > struct anongroup *mmlist = (struct anongroup *)page->mapping; > unsigned long address = page->index << PAGE_SHIFT; > struct mm_struct *mm; > > for_each_entry(mm, mmlist->anon_mms, anon_mm) { > .. look up page in page tables in "mm, address" .. > .. most of the time we may not even need to look .. > .. up the "vma" at all, just walk the page tables .. > } > } else { > /* Shared page */ > .. look up page using the inode vma list .. > } > > The above all works 99% of the time. this is again exactly the anonmm from Hugh. BTW, (for completeness) I was thinking last night that the anonmm could handle mremap correctly too in theory without changes like the below one, if it would walk the whole list of vmas reachable from the mm->mmap for every mm in the anonmm (your anongroup, Hugh called it struct anonmm instead of struct anongroup). Problem is that checking all the vmas in if expensive and a single find_vma is a lot faster, but find_vma has no way to take vm_pgoff into the equation and in turn it breaks with mremap. > The only problem is mremap() after a fork(), and hell, we know that's a > special case anyway, and let's just add a few lines to copy_one_pte(), > which basically does: > > if (PageAnonymous(page) && page->count > 1) { > newpage = alloc_page(); > copy_page(page, newpage); > page = newpage; > } > /* Move the page to the new address */ > page->index = address >> PAGE_SHIFT; > > and now we have zero special cases. you're basically here saying that you agree with Hugh that anonmm is the way to go, and you're providing one of the possible ways to handle mremap correctly with anonmm (without using pte_chains). I also above provided another alternate way to handle mremap correctly with anonmm (that is to inefficiently walk all the mm->mmap and to try unmapping from all vmas with vma->vm_file == NULL). what I called anon_vma_global in a older email is the more efficient version of checking all the vmas in the mm->mmap, a prio_tree could index all the anon vmas in each mm, so taking vm_pgoff into consideration, unlike find_vma(page->index). That still takes memory for each vma though, and it also still forces to check all unrelated mm address spaces too (see later in the email for details on this). But returning to your proposed solution to the mremap problem with the anonmm design, that will certainly work: rather than trying to handle that case correctly we just makes it impossible for that condition to happen. I don't like very much to unshare pages, but it may save more memory than what it actually waste. Problem is that it depends on the workload. The remaining downside of all the global anonmm designs vs my finegrined anon_vma design, is that if you execute a malloc in a child (that will be direct memory with page->count == 1), you'll still have to try all the mm in the anongroup (that can be on the order of the thousands), while the anon_vma design would immediatly only reach the right vma in the right mm and it would not try the wrong vmas in the other mm (i.e. no find_vma). That isn't fixable with the anonmm design. I think the only important thing is to avoid the _per-page_ overhead of the pte_chains, a _per-vma_ 12 byte cost for the anon_vma doesn't sound like an issue to me if it can save significant cpu in a setup with thousand of tasks and each one executing a malloc. A single vma can cover plenty of memory. Note that even the i_mmap{,shared} methods (even with a prio_tree!) may actually check vmas and (in turn mm_structs too) where the page has been sobstituted with an anonymous copy during a cow fault, if the vma has been mapped with MAP_PRIVATE. While we cannot avoid to check unrelated mm_structs with MAP_PRIVATE usages (since the only thing where we have that information is the pte itself, so by the time we find the answer it's too late to avoid asking the question), but I can avoid that for the anonymous memory with my anon_vma design. And my anon_vma gets mremap right too without the need of prio trees like the anon_vma_global design I proposed requires, and while still allowing sharing of pages through mremap. the downsides of anon_vma vs anonmm+linus-unshare-during-mremap is that anon_vma requires a per anonymous vma 12 byte object, and secondly it requires 12 bytes per-vma for the anon_vma_node list_head and the anon_vma pointer. So it's a worst case 24byte overhead per anonymous vma (on average it will be slightly less since the anon_vmas can be shared). Secondly anon_vma forbids merging of vmas with different anon_vma or with different vm_pgoff, though for all appends there will be no problem at all, appends with mmap are guaranteed to work. A munmap+mmap gap creation and gap fill is also guaranteed to work (since split_vma will make both the prev and next vma share the same anon_vma). the advantage of anon_vma is that it will track all vma in the most possible finegrined way, avoiding the unmapping code to walk "mm" that for sure don't have anything to do with the page that we want to unmap, plus it handles mremap (allowing sharing and avoiding copies). It avoids the find_vma cost too. I'm not sure if the pros-cons worth the additional 24 bytes per anonymous vma. the complexity doesn't worry me though. Also when the cost will be truly 24 bytes we'll have the biggest advantage, if the advantage will be low it means the cost will be less than 24 bytes since the vma is shared. > What do you think? To me, this seems to be a really simple approach.. I certainly agree it's simpler. I'm quite undecided if to giveup on the anon_vma and to use anonmm plus your unshared during mremap at the moment, while it's simpler it's also a definitely inferior solution since it uses the mremap hack to work safely and it will check all mm in the group with find_pte not matter if it worth checking them, but at the same time if one is never swapping and never using mremap it will save some memory from the anon_vma overhead (and it will also be non-exploitable without the need of a prio_tree). With anon_vma and w/o a prio_tree on top of it, one could try executing a flood of vma_splits, and without a prio_tree on top of an anon_vma, that could cause memory waste during swapping, but all real applications would definitely swap better with anon_vma than with anonmm. I mean, I would expect the pte_chain advocates to agree anon_vma is a lot better than anonmm, they were going to throw 8 bytes per-pte to save cpu during swapping, now I throw only 24 bytes per-vma at the problem (with each vma being still extendable with merging) and I still provide optimal swapping with minimal complexty, so they should like the finegrined way more than unsharing with mremap and not scaling during swapping checking all unrelated mms too. anon_vma basically sits in between anonmm and pte_chains. it was more than enough for me, to save all the memory wasted in the pte_chains on the 64bit archs with huge anonymous vma blocks, but I didn't want to giveup the swap scalability either with many processes (with i_mmap{,shared} we've already enough troubles with the scalability during swapping that I didn't want to think about those issues with the anonymous memory too with some thousand tasks like it will run in practice). If I go stright ahead with anon_vma I'm basically guaranteed that I can forget about the anonymous vma swapping and that all real life apps will scale _as_well_ as with the pte_chains, and I'm guaranteed not to run into issues with mremap (though I don't expect troubles there). ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:33 ` Andrea Arcangeli @ 2004-03-13 17:53 ` Hugh Dickins 2004-03-13 18:13 ` Andrea Arcangeli 2004-03-13 17:57 ` Rik van Riel 1 sibling, 1 reply; 74+ messages in thread From: Hugh Dickins @ 2004-03-13 17:53 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, 13 Mar 2004, Andrea Arcangeli wrote: > > I certainly agree it's simpler. I'm quite undecided if to giveup on the > anon_vma and to use anonmm plus your unshared during mremap at the > moment, while it's simpler it's also a definitely inferior solution I think you should persist with anon_vma and I should resurrect anonmm, and let others decide between those two and pte_chains. But while in this trial phase, can we both do it in such a way as to avoid too much trivial change all over the tree? For example, I'm thinking I need to junk my irrelevant renaming of put_dirty_page to put_stack_page, and for the moment it would help if you cut out your mapping -> as.mapping changes (when I came to build yours, I had to go through various filesystems I had in my config updating them accordingly). It's a correct change (which I was too lazy to do, used evil casting instead) but better left as a tidyup for later? Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:53 ` Hugh Dickins @ 2004-03-13 18:13 ` Andrea Arcangeli 2004-03-13 19:35 ` Hugh Dickins 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-13 18:13 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Rik van Riel, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, Mar 13, 2004 at 05:53:36PM +0000, Hugh Dickins wrote: > On Sat, 13 Mar 2004, Andrea Arcangeli wrote: > > > > I certainly agree it's simpler. I'm quite undecided if to giveup on the > > anon_vma and to use anonmm plus your unshared during mremap at the > > moment, while it's simpler it's also a definitely inferior solution > > I think you should persist with anon_vma and I should resurrect > anonmm, and let others decide between those two and pte_chains. > > But while in this trial phase, can we both do it in such a way as to > avoid too much trivial change all over the tree? For example, I'm > thinking I need to junk my irrelevant renaming of put_dirty_page to > put_stack_page, and for the moment it would help if you cut out your > mapping -> as.mapping changes (when I came to build yours, I had to > go through various filesystems I had in my config updating them > accordingly). It's a correct change (which I was too lazy to do, > used evil casting instead) but better left as a tidyup for later? yes, we should split in two patches, one is the "peparation" for a reused page->as.mapping, you know I did it differently to retain the swapper_space and avoiding to hook explicit "if (PageSwapCache)" checks into things like sync_page. About using the union, I still prefer it, I've seen Linus in the pseudocode used an explicit cast too, but I don't feel safe with explicit casts, I prefer more breakage, than risking to forget converting any page->mapping into page_maping or similar issues with the casts ;) I'll return working on this after the weekend. You can find my latest status on the ftp, if you extract any interesting "common" bit from there just send it to me too. thanks. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 18:13 ` Andrea Arcangeli @ 2004-03-13 19:35 ` Hugh Dickins 0 siblings, 0 replies; 74+ messages in thread From: Hugh Dickins @ 2004-03-13 19:35 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, William Lee Irwin III, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, 13 Mar 2004, Andrea Arcangeli wrote: > > yes, we should split in two patches, one is the "peparation" for a > reused page->as.mapping, you know I did it differently to retain the > swapper_space and avoiding to hook explicit "if (PageSwapCache)" checks > into things like sync_page. > > About using the union, I still prefer it, I've seen Linus in the > pseudocode used an explicit cast too, but I don't feel safe with > explicit casts, I prefer more breakage, than risking to forget > converting any page->mapping into page_maping or similar issues with the > casts ;) Your union is right, and my casting lazy, no question of that. It's just that we'd need to do a whole lot of cosmetic edits to get fully building trees, distracting from the guts of it. In my case, anyway, the number of places that actually use the casting are very few (just rmap.c?), suspect it's same for you. I'm certainly not arguing against sanity checks where needed, just against treewide edits (or broken builds) for now. > I'll return working on this after the weekend. You can find my latest > status on the ftp, if you extract any interesting "common" bit from > there just send it to me too. thanks. Thanks a lot. I don't imagine you've done the nonlinear vma case yet, but when you or Rajesh do, please may I just steal it, okay? Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-13 17:33 ` Andrea Arcangeli 2004-03-13 17:53 ` Hugh Dickins @ 2004-03-13 17:57 ` Rik van Riel 1 sibling, 0 replies; 74+ messages in thread From: Rik van Riel @ 2004-03-13 17:57 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, William Lee Irwin III, Hugh Dickins, Ingo Molnar, Andrew Morton, Kernel Mailing List On Sat, 13 Mar 2004, Andrea Arcangeli wrote: > The remaining downside of all the global anonmm designs vs my finegrined > anon_vma design, is that if you execute a malloc in a child (that will > be direct memory with page->count == 1), you'll still have to try all > the mm in the anongroup (that can be on the order of the thousands), That's ok, you have a similar issue with very commonly mmap()d files, where some pages haven't been faulted in by most processes, or have been replaced by private pages after a COW fault due to MAP_PRIVATE mapping. You just increase the number of pages for which this search is done, but I suspect that shouldn't be a big worry... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 12:21 ` Andrea Arcangeli ` (2 preceding siblings ...) 2004-03-12 12:46 ` William Lee Irwin III @ 2004-03-12 13:43 ` Hugh Dickins 2004-03-12 15:56 ` Andrea Arcangeli 3 siblings, 1 reply; 74+ messages in thread From: Hugh Dickins @ 2004-03-12 13:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III Thanks a lot for pointing us to your (last night's) patches, Andrea. On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote: > > It's not that I didn't read anonmm patches from Hugh, I spent lots of > time on those, they just were flawed and they couldn't handle mremap, > he very well knows, see anobjrmap-5 for istance. Flawed in what way? They handled mremap fine, but yes, used pte_chains for that extraordinary case, just as pte_chains were used for nonlinear. With pte_chains gone (hurrah! though nonlinear handling yet to come), as you know, I've already suggested a better way to handle that case (use tmpfs-style backing object). > the vma merging isn't a problem, we need to rework the code anyways to > allow the file merging in both mprotect and mremap (currently only mmap > is capable of merging files, and in turn it's also the only one capable > of merging anon_vmas). Any merging code that is currently capable of > merging files is easy to teach about anon_vmas too, it's basically the > same problem at merging. You're paying too much attention to the (almost optional, though it can have a devastating effect on vma usage, yes) issue of vma merging, but what about the (mandatory) vma splitting? I see no sign of the tiresome code I said you'd need for anonvma rather than anonmm, walking the pages updating as.vma whenever vma changes e.g. when mprotecting or munmapping some pages in the middle of a vma. Surely move_vma_start is not enough? That's what led me to choose anonmm, which seems a lot simpler: the real argument for anonvma is that it saves a find_vma per pte in try_to_unmap (page_referenced doesn't need it): a good saving, but is it worth the complication of the faster paths? Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 13:43 ` Hugh Dickins @ 2004-03-12 15:56 ` Andrea Arcangeli 2004-03-12 16:12 ` Hugh Dickins 0 siblings, 1 reply; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 15:56 UTC (permalink / raw) To: Hugh Dickins Cc: Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, Mar 12, 2004 at 01:43:23PM +0000, Hugh Dickins wrote: > Thanks a lot for pointing us to your (last night's) patches, Andrea. > > On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > > On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote: > > > > It's not that I didn't read anonmm patches from Hugh, I spent lots of > > time on those, they just were flawed and they couldn't handle mremap, > > he very well knows, see anobjrmap-5 for istance. > > Flawed in what way? They handled mremap fine, but yes, used pte_chains > for that extraordinary case, just as pte_chains were used for nonlinear. "using pte_chains for the extraordinary case" (which is a common case for some apps) means it doesn't handle it, and you've to use rmap to handle that case. > With pte_chains gone (hurrah! though nonlinear handling yet to come), > as you know, I've already suggested a better way to handle that case > (use tmpfs-style backing object). Do you realize the complexity of creating a tmpfs-inode and to attach all vmas to it stacked on top of anonmm? And after you fix mremap you get the same disavantages for merging of vmas (remeber my disavantage of not merging after an mremap you won't merge too), plus it wastes a lot more ram since you need a fake inode for every anonymous vma and it's ugly to create those objects inside mremap. My transient object is 8 bytes per group of vmas. And you need even the prio_tree search on top of the anonmm. Don't forget you can't re-use the vma->shared for doing the tmpfs-style thing, that's already in a true inode. so what you're suggesting would becomes an huge mess to implement IMHO. the anon_vma sounds a lot cleaner and more efficient design to me than stacking inode-like objects on top of a vma already queued in a i_mmap. > > the vma merging isn't a problem, we need to rework the code anyways > > to > > allow the file merging in both mprotect and mremap (currently only mmap > > is capable of merging files, and in turn it's also the only one capable > > of merging anon_vmas). Any merging code that is currently capable of > > merging files is easy to teach about anon_vmas too, it's basically the > > same problem at merging. > > You're paying too much attention to the (almost optional, though it can > have a devastating effect on vma usage, yes) issue of vma merging, but > what about the (mandatory) vma splitting? I see no sign of the tiresome > code I said you'd need for anonvma rather than anonmm, walking the pages > updating as.vma whenever vma changes e.g. when mprotecting or munmapping > some pages in the middle of a vma. Surely move_vma_start is not enough? you're right about vma_split, the way I implemented it is wrong, basically the as.vma/PageDirect idea is falling apart with vma_split. I should simply allocate the anon_vma without passing through the direct mode, that will fix it though it'll be a bit less efficient for the first page fault in an anonymous vma (only the first one, for all the other page faults it'll be as fast as the direct mode). this is probably why the code was not stable yet btw ;) so I greatly appreciate your comments about it, it's just the optimization I did that was invalid. I could retain the optimization with a list of pages attached to the vma but it doesn't worth it, allocating the anon_vma is way too cheap compared to that. the pagedirect was a microoptization only, any additional complexity to retain the microoptimization is worthless. > That's what led me to choose anonmm, which seems a lot simpler: the real > argument for anonvma is that it saves a find_vma per pte in try_to_unmap > (page_referenced doesn't need it): a good saving, but is it worth the > complication of the faster paths? the only real argument is mremap, your tmpfs-like thing is overkill compared to anon_vma, and secondly I don't need the prio_tree to scale. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 15:56 ` Andrea Arcangeli @ 2004-03-12 16:12 ` Hugh Dickins 2004-03-12 16:39 ` Andrea Arcangeli 0 siblings, 1 reply; 74+ messages in thread From: Hugh Dickins @ 2004-03-12 16:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, 12 Mar 2004, Andrea Arcangeli wrote: > On Fri, Mar 12, 2004 at 01:43:23PM +0000, Hugh Dickins wrote: > > Don't forget you can't re-use the vma->shared for doing the tmpfs-style > thing, that's already in a true inode. Good point, I was overlooking that. I'll see if I can come up with something, but that may well prove a killer. > you're right about vma_split, the way I implemented it is wrong, > basically the as.vma/PageDirect idea is falling apart with vma_split. > I should simply allocate the anon_vma without passing through the direct Yes, that'll take a lot of the branching out, all much simpler. > mode, that will fix it though it'll be a bit less efficient for the > first page fault in an anonymous vma (only the first one, for all the > other page faults it'll be as fast as the direct mode). Simpler still to allocate it earlier? Perhaps too wasteful. Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-12 16:12 ` Hugh Dickins @ 2004-03-12 16:39 ` Andrea Arcangeli 0 siblings, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-12 16:39 UTC (permalink / raw) To: Hugh Dickins Cc: Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Fri, Mar 12, 2004 at 04:12:10PM +0000, Hugh Dickins wrote: > > you're right about vma_split, the way I implemented it is wrong, > > basically the as.vma/PageDirect idea is falling apart with vma_split. > > I should simply allocate the anon_vma without passing through the direct > > Yes, that'll take a lot of the branching out, all much simpler. indeed. > Simpler still to allocate it earlier? Perhaps too wasteful. one trouble with allocate it earlier is that insert_vm_struct would need to return a -ENOMEM retval, plus things like MAP_PRIVATE don't necessairly need an anon_vma ever (true anon mappings tends to need it always instead ;). So I will have to add a anon_vma_prepare(vma) near all SetPageAnon. that's easy. Infact I may want to coalesce the two things together, it will look like: int anon_vma_prepare_page(vma, page) { if (!vma->anon_vma) { vma->anon_vma = anon_vma_alloc() if (!vma->anon_vma) return -ENOMEM; /* single threaded no locks here */ list_add(&vma->anon_vma_node, &anon_vma->anon_vma_head); } SetPageAnon(page); return 0; } I will have to handle a retval failure from there, that's the only annoyance of removing the PageDirect optimization, I really did the PageDirect mostly to leave all the anon_vma allocations to fork(). Now it's the exact opposite, fork will never need to allocate any anon_vma anymore, it will only boost the page->mapcount. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-11 13:23 ` Hugh Dickins 2004-03-11 13:56 ` Andrea Arcangeli @ 2004-03-11 17:33 ` Andrea Arcangeli 2004-03-11 22:20 ` Rik van Riel 2 siblings, 0 replies; 74+ messages in thread From: Andrea Arcangeli @ 2004-03-11 17:33 UTC (permalink / raw) To: Hugh Dickins Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III ok, it links and boots ;) at the previous try, with slab debugging enabled, it was spawning tons of errors but I suspect it's a bug in the slab debugging, it was complaining with red zone memory corruption, could be due the tiny size of this object (only 8 bytes). andrea@xeon:~> grep anon_vma /proc/slabinfo anon_vma 1230 1500 12 250 1 : tunables 120 60 8 : slabdata 6 6 0 andrea@xeon:~> now I need to try swapping... (I guess it won't work at the first try, I'd be surprised if I didn't miss any s/index/private/) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-11 13:23 ` Hugh Dickins 2004-03-11 13:56 ` Andrea Arcangeli 2004-03-11 17:33 ` Andrea Arcangeli @ 2004-03-11 22:20 ` Rik van Riel 2004-03-11 23:43 ` Hugh Dickins 2 siblings, 1 reply; 74+ messages in thread From: Rik van Riel @ 2004-03-11 22:20 UTC (permalink / raw) To: Hugh Dickins Cc: Andrea Arcangeli, Ingo Molnar, Andrew Morton, torvalds, linux-kernel, William Lee Irwin III On Thu, 11 Mar 2004, Hugh Dickins wrote: > length of your essay on vma merging, it strikes me that you've taken > a wrong direction in switching from my anon mm to your anon vma. > > Go by vmas and you have tiresome problems as they are split and merged, > very commonly. Plus you have the overhead of new data structure per vma. There's of course a blindingly simple alternative. Add every anonymous page to an "anon_memory" inode. Then everything is in effect file backed. Using the same page refcounting we already do, holes get shot into that "file". The swap cache code provides a filesystem like mapping from the anon_memory "files" to the on-disk stuff, or the anon_memory file pages are resident in memory. As a side effect, it also makes it possible to get rid of the swapoff code, simply move the anon_memory file pages from disk into memory... We can avoid BSD memory object like code by simply having multiple processes share the same anon_memory inode, allocating extents of virtual space at once to reduce VMA count. Not sure to which extent this is similar to what Hugh's stuff already does though, or if it's just a different way of saying how it's done ... I need to re-read the code ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-11 22:20 ` Rik van Riel @ 2004-03-11 23:43 ` Hugh Dickins 2004-03-12 3:20 ` Rik van Riel 0 siblings, 1 reply; 74+ messages in thread From: Hugh Dickins @ 2004-03-11 23:43 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Ingo Molnar, Andrew Morton, Linus Torvalds, William Lee Irwin III, linux-kernel On Thu, 11 Mar 2004, Rik van Riel wrote: > On Thu, 11 Mar 2004, Hugh Dickins wrote: > > > length of your essay on vma merging, it strikes me that you've taken > > a wrong direction in switching from my anon mm to your anon vma. > > > > Go by vmas and you have tiresome problems as they are split and merged, > > very commonly. Plus you have the overhead of new data structure per vma. > > There's of course a blindingly simple alternative. > > Add every anonymous page to an "anon_memory" inode. Then > everything is in effect file backed. Using the same page > refcounting we already do, holes get shot into that "file". Okay, Rik, the two extremes belong to you: one anon memory object in total (above), and one per page (your original rmap); whereas Andrea is betting on one per vma, and I go for one per mm. Each way has its merits, I'm sure - and you've placed two bets! > The swap cache code provides a filesystem like mapping > from the anon_memory "files" to the on-disk stuff, or the > anon_memory file pages are resident in memory. For 2.7 something like that may well be reasonable. But let's beware the fancy bloat of extra levels. > As a side effect, it also makes it possible to get rid > of the swapoff code, simply move the anon_memory file > pages from disk into memory... Wonderful if that code could disappear: but I somehow doubt it'll fall out quite so easily - swapoff is inevitably backwards from sanity, isn't it? Hugh ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: anon_vma RFC2 2004-03-11 23:43 ` Hugh Dickins @ 2004-03-12 3:20 ` Rik van Riel 0 siblings, 0 replies; 74+ messages in thread From: Rik van Riel @ 2004-03-12 3:20 UTC (permalink / raw) To: Hugh Dickins Cc: Andrea Arcangeli, Ingo Molnar, Andrew Morton, Linus Torvalds, William Lee Irwin III, linux-kernel On Thu, 11 Mar 2004, Hugh Dickins wrote: > Okay, Rik, the two extremes belong to you: one anon memory > object in total (above), and one per page (your original rmap); > whereas Andrea is betting on one per vma, and I go for one per mm. > Each way has its merits, I'm sure - and you've placed two bets! I suspect yours is the best mix. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 74+ messages in thread
end of thread, other threads:[~2004-03-14 2:27 UTC | newest]
Thread overview: 74+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20040310080000.GA30940@dualathlon.random>
2004-03-10 13:01 ` [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) Rik van Riel
2004-03-10 13:50 ` Andrea Arcangeli
2004-03-12 17:05 ` anon_vma RFC2 Rajesh Venkatasubramanian
2004-03-12 17:26 ` Andrea Arcangeli
2004-03-12 21:16 ` Rajesh Venkatasubramanian
2004-03-13 17:55 ` Rajesh Venkatasubramanian
2004-03-13 18:16 ` Andrea Arcangeli
2004-03-13 19:40 ` Rajesh Venkatasubramanian
2004-03-14 0:23 ` Andrea Arcangeli
2004-03-14 0:52 ` Linus Torvalds
2004-03-14 1:01 ` William Lee Irwin III
2004-03-14 1:07 ` Rik van Riel
2004-03-14 1:19 ` William Lee Irwin III
2004-03-14 1:41 ` Rik van Riel
2004-03-14 2:27 ` William Lee Irwin III
2004-03-14 1:15 ` Linus Torvalds
2004-03-11 20:09 Manfred Spraul
-- strict thread matches above, loose matches on Subject: below --
2004-03-08 20:24 objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) Andrea Arcangeli
2004-03-09 10:52 ` [lockup] " Ingo Molnar
2004-03-09 11:02 ` Ingo Molnar
2004-03-09 11:09 ` Andrew Morton
2004-03-09 11:49 ` Ingo Molnar
2004-03-09 16:03 ` Andrea Arcangeli
2004-03-10 10:36 ` RFC anon_vma previous (i.e. full objrmap) Andrea Arcangeli
2004-03-11 6:52 ` anon_vma RFC2 Andrea Arcangeli
2004-03-11 13:23 ` Hugh Dickins
2004-03-11 13:56 ` Andrea Arcangeli
2004-03-11 21:54 ` Hugh Dickins
2004-03-12 1:47 ` Andrea Arcangeli
2004-03-12 2:20 ` Andrea Arcangeli
2004-03-12 3:28 ` Rik van Riel
2004-03-12 12:21 ` Andrea Arcangeli
2004-03-12 12:40 ` Rik van Riel
2004-03-12 13:11 ` Andrea Arcangeli
2004-03-12 16:25 ` Rik van Riel
2004-03-12 17:13 ` Andrea Arcangeli
2004-03-12 17:23 ` Rik van Riel
2004-03-12 17:44 ` Andrea Arcangeli
2004-03-12 18:18 ` Rik van Riel
2004-03-12 18:25 ` Linus Torvalds
2004-03-12 18:48 ` Rik van Riel
2004-03-12 19:02 ` Chris Friesen
2004-03-12 19:06 ` Rik van Riel
2004-03-12 19:10 ` Chris Friesen
2004-03-12 19:14 ` Rik van Riel
2004-03-12 20:27 ` Andrea Arcangeli
2004-03-12 20:32 ` Rik van Riel
2004-03-12 20:49 ` Andrea Arcangeli
2004-03-12 21:08 ` Jamie Lokier
2004-03-12 12:42 ` Andrea Arcangeli
2004-03-12 12:46 ` William Lee Irwin III
2004-03-12 13:24 ` Andrea Arcangeli
2004-03-12 13:40 ` William Lee Irwin III
2004-03-12 13:55 ` Hugh Dickins
2004-03-12 16:01 ` Andrea Arcangeli
2004-03-12 16:17 ` Linus Torvalds
2004-03-13 0:28 ` William Lee Irwin III
2004-03-13 14:43 ` Rik van Riel
2004-03-13 16:18 ` Linus Torvalds
2004-03-13 17:24 ` Hugh Dickins
2004-03-13 17:28 ` Rik van Riel
2004-03-13 17:41 ` Hugh Dickins
2004-03-13 18:08 ` Andrea Arcangeli
2004-03-13 17:54 ` Andrea Arcangeli
2004-03-13 17:55 ` Andrea Arcangeli
2004-03-13 18:57 ` Linus Torvalds
2004-03-13 19:14 ` Hugh Dickins
2004-03-13 17:48 ` Andrea Arcangeli
2004-03-13 17:33 ` Andrea Arcangeli
2004-03-13 17:53 ` Hugh Dickins
2004-03-13 18:13 ` Andrea Arcangeli
2004-03-13 19:35 ` Hugh Dickins
2004-03-13 17:57 ` Rik van Riel
2004-03-12 13:43 ` Hugh Dickins
2004-03-12 15:56 ` Andrea Arcangeli
2004-03-12 16:12 ` Hugh Dickins
2004-03-12 16:39 ` Andrea Arcangeli
2004-03-11 17:33 ` Andrea Arcangeli
2004-03-11 22:20 ` Rik van Riel
2004-03-11 23:43 ` Hugh Dickins
2004-03-12 3:20 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox