* need help interpreting 'free' output.
@ 2001-10-30 11:32 Frank Dekervel
2001-10-30 11:46 ` Mike Fedyk
2001-10-30 16:07 ` Hugh Dickins
0 siblings, 2 replies; 19+ messages in thread
From: Frank Dekervel @ 2001-10-30 11:32 UTC (permalink / raw)
To: linux-kernel
hello,
since i saw strange things happening with my free memory numbers, i tried
this:
- i compiled and booted a fresh kernel (no proprietary modules, no patches,
just 2.4.14-pre4)
- i did free.
bakvis:~# free
total used free shared buffers cached
Mem: 384912 55644 329268 0 3652 29880
-/+ buffers/cache: 22112 362800
Swap: 136512 0 136512
so i have 22 meg used right ?
- i started the daily cron jobs (updatedb and htdig and some minor things
like log rotation)
- i did 'free' again.
bakvis:~# free
total used free shared buffers cached
Mem: 384912 377060 7852 0 29424 125660
-/+ buffers/cache: 221976 162936
Swap: 136512 752 135760
so now there is 220 meg used memory right ?
and the memory is definitely used, because as soon as i start a memory hog
the system hits swap ...
so what am i missing here ?
should i provide more info about my kernel configuration ? vmstat numbers ?
greetings,
Frank
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: need help interpreting 'free' output. 2001-10-30 11:32 need help interpreting 'free' output Frank Dekervel @ 2001-10-30 11:46 ` Mike Fedyk 2001-10-30 14:02 ` Frank Dekervel 2001-10-30 16:07 ` Hugh Dickins 1 sibling, 1 reply; 19+ messages in thread From: Mike Fedyk @ 2001-10-30 11:46 UTC (permalink / raw) To: Frank Dekervel; +Cc: linux-kernel On Tue, Oct 30, 2001 at 12:32:52PM +0100, Frank Dekervel wrote: > so now there is 220 meg used memory right ? > and the memory is definitely used, because as soon as i start a memory hog > the system hits swap ... > > so what am i missing here ? > should i provide more info about my kernel configuration ? vmstat numbers ? > Ahh, are you a new convert from a 2.2 kernel? In 2.4 the kernel will swap out much earlier to make room for the running programs, and disk cache. This is normal. Earlier 2.4 kernels didn't do so well, but I won't go into detail because there is already enough about that in the archives... When you watch vmstat, if you see a lot of swapping traffic without much good reason, then you should probably report something... Mike ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 11:46 ` Mike Fedyk @ 2001-10-30 14:02 ` Frank Dekervel 0 siblings, 0 replies; 19+ messages in thread From: Frank Dekervel @ 2001-10-30 14:02 UTC (permalink / raw) To: Mike Fedyk; +Cc: linux-kernel Op dinsdag 30 oktober 2001 12:46, schreef Mike Fedyk: > Ahh, are you a new convert from a 2.2 kernel? > > In 2.4 the kernel will swap out much earlier to make room for the running > programs, and disk cache. This is normal. > > Earlier 2.4 kernels didn't do so well, but I won't go into detail because > there is already enough about that in the archives... > > When you watch vmstat, if you see a lot of swapping traffic without much > good reason, then you should probably report something... Hi, i already use 2.4 for some time. the thing that bugs me is the 'used' figures go up, and no processes actually use that memory (not the buffered/cached, well, they go up , but thats normal) , so it seems the memory is 'lost' somewhere, and i don't see any processes using it up, and 200 meg ram in 70 seconds is a lot ... So or i am misinterpreting something, or i am completely clueless, or there is a leak somewhere.. greetings, Frank ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 11:32 need help interpreting 'free' output Frank Dekervel 2001-10-30 11:46 ` Mike Fedyk @ 2001-10-30 16:07 ` Hugh Dickins 2001-10-30 16:51 ` Andrea Arcangeli ` (2 more replies) 1 sibling, 3 replies; 19+ messages in thread From: Hugh Dickins @ 2001-10-30 16:07 UTC (permalink / raw) To: Frank Dekervel Cc: Linus Torvalds, Andrea Arcangeli, Marcelo Tosatti, linux-kernel On Tue, 30 Oct 2001, Frank Dekervel wrote: > > since i saw strange things happening with my free memory numbers, i tried > this: > - i compiled and booted a fresh kernel (no proprietary modules, no patches, > just 2.4.14-pre4) > - i did free. > > bakvis:~# free > total used free shared buffers cached > Mem: 384912 55644 329268 0 3652 29880 > -/+ buffers/cache: 22112 362800 > Swap: 136512 0 136512 > > so i have 22 meg used right ? > > - i started the daily cron jobs (updatedb and htdig and some minor things > like log rotation) > > - i did 'free' again. > > bakvis:~# free > total used free shared buffers cached > Mem: 384912 377060 7852 0 29424 125660 > -/+ buffers/cache: 221976 162936 > Swap: 136512 752 135760 > > so now there is 220 meg used memory right ? > and the memory is definitely used, because as soon as i start a memory hog > the system hits swap ... > > so what am i missing here ? > should i provide more info about my kernel configuration ? vmstat numbers ? I'm fairly sure /proc/slabinfo will show large inode_cache and large dentry_cache: which is natural after updatedb, nothing wrong with that. However, unlike 2.4.13, 2.4.14-pre (you tried pre4, I just tried pre5) seems much too unwilling to shrink_dcache and shrink_icache: your memory hog should shrink them, but it seems not to. Linus? Hugh ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 16:07 ` Hugh Dickins @ 2001-10-30 16:51 ` Andrea Arcangeli 2001-10-30 16:52 ` Linus Torvalds 2001-10-30 18:11 ` Frank Dekervel 2 siblings, 0 replies; 19+ messages in thread From: Andrea Arcangeli @ 2001-10-30 16:51 UTC (permalink / raw) To: Hugh Dickins Cc: Frank Dekervel, Linus Torvalds, Marcelo Tosatti, linux-kernel On Tue, Oct 30, 2001 at 04:07:45PM +0000, Hugh Dickins wrote: > On Tue, 30 Oct 2001, Frank Dekervel wrote: > > > > since i saw strange things happening with my free memory numbers, i tried > > this: > > - i compiled and booted a fresh kernel (no proprietary modules, no patches, > > just 2.4.14-pre4) > > - i did free. > > > > bakvis:~# free > > total used free shared buffers cached > > Mem: 384912 55644 329268 0 3652 29880 > > -/+ buffers/cache: 22112 362800 > > Swap: 136512 0 136512 > > > > so i have 22 meg used right ? > > > > - i started the daily cron jobs (updatedb and htdig and some minor things > > like log rotation) > > > > - i did 'free' again. > > > > bakvis:~# free > > total used free shared buffers cached > > Mem: 384912 377060 7852 0 29424 125660 > > -/+ buffers/cache: 221976 162936 > > Swap: 136512 752 135760 > > > > so now there is 220 meg used memory right ? > > and the memory is definitely used, because as soon as i start a memory hog > > the system hits swap ... > > > > so what am i missing here ? > > should i provide more info about my kernel configuration ? vmstat numbers ? > > I'm fairly sure /proc/slabinfo will show large inode_cache and large > dentry_cache: which is natural after updatedb, nothing wrong with that. > > However, unlike 2.4.13, 2.4.14-pre (you tried pre4, I just tried pre5) > seems much too unwilling to shrink_dcache and shrink_icache: your > memory hog should shrink them, but it seems not to. Linus? 2.4.14pre5aa1 has a logic to try to shrink those caches at a better time. Frank could you try again with pre5aa1 and see if it goes better? Not shrinking the vfs caches when shrink_cache failed is wrong, allocations from ZONE_NORMAL will fail without way to recover as soon as all ZONE_NORMAL is eat in vfs caches. Andrea ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 16:07 ` Hugh Dickins 2001-10-30 16:51 ` Andrea Arcangeli @ 2001-10-30 16:52 ` Linus Torvalds 2001-10-30 17:06 ` Andrea Arcangeli 2001-10-30 20:47 ` David S. Miller 2001-10-30 18:11 ` Frank Dekervel 2 siblings, 2 replies; 19+ messages in thread From: Linus Torvalds @ 2001-10-30 16:52 UTC (permalink / raw) To: Hugh Dickins Cc: Frank Dekervel, Andrea Arcangeli, Marcelo Tosatti, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 1370 bytes --] On Tue, 30 Oct 2001, Hugh Dickins wrote: > > However, unlike 2.4.13, 2.4.14-pre (you tried pre4, I just tried pre5) > seems much too unwilling to shrink_dcache and shrink_icache: your > memory hog should shrink them, but it seems not to. Linus? Yes. It's next on my list. My _preferred_ approach would actually be to move the slab pages to the LRU list too, and have a special "slab" address space (we don't need to actually hash them, we just make page->mapping point to it), and have the cache shrink be done naturally as part of writepage(). That way "shrink_cache()" reacts very naturally to slab pressure, while right now it's more of a random behaviour. That's what the "anonymous pages in the LRU" approach fixes - the VM scanning reacts very naturally (instead of with subtle tweaking and almost random behaviour) to mapped page pressure. The "slab address space" is a longer-range plan, though. It migth be really simple (the writepage would just move the page to the active list and try to shrink the slab that was hit), but I think the current stuff is "good enough". So in the short range, I haven't come up with any really good approaches, but I suspect I'll just have to move the shrink_[di]cache() back to the caller, which will at least shrink them on swapouts (a bit too much, I think, but on the other hand maybe not). Patch attached, Linus [-- Attachment #2: Type: TEXT/PLAIN, Size: 1203 bytes --] diff -u --recursive pre5/linux/mm/vmscan.c linux/mm/vmscan.c --- pre5/linux/mm/vmscan.c Tue Oct 30 08:51:13 2001 +++ linux/mm/vmscan.c Tue Oct 30 08:46:08 2001 @@ -515,20 +515,6 @@ } spin_unlock(&pagemap_lru_lock); - if (nr_pages <= 0) - return 0; - - /* - * If swapping out isn't appropriate, and - * we still fail, try the other (usually smaller) - * caches instead. - */ - shrink_dcache_memory(priority, gfp_mask); - shrink_icache_memory(priority, gfp_mask); -#ifdef CONFIG_QUOTA - shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); -#endif - return nr_pages; } @@ -577,7 +563,17 @@ ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2); refill_inactive(ratio); - return shrink_cache(nr_pages, classzone, gfp_mask, priority); + nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority); + if (nr_pages <= 0) + return 0; + + shrink_dcache_memory(priority, gfp_mask); + shrink_icache_memory(priority, gfp_mask); +#ifdef CONFIG_QUOTA + shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); +#endif + + return nr_pages; } int try_to_free_pages(zone_t *classzone, unsigned int gfp_mask, unsigned int order) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 16:52 ` Linus Torvalds @ 2001-10-30 17:06 ` Andrea Arcangeli 2001-10-30 17:28 ` Linus Torvalds 2001-10-30 20:47 ` David S. Miller 1 sibling, 1 reply; 19+ messages in thread From: Andrea Arcangeli @ 2001-10-30 17:06 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, Oct 30, 2001 at 08:52:58AM -0800, Linus Torvalds wrote: > So in the short range, I haven't come up with any really good approaches, > but I suspect I'll just have to move the shrink_[di]cache() back to the > caller, which will at least shrink them on swapouts (a bit too much, I > think, but on the other hand maybe not). Agreed. It is still interesting to hear if it makes a big performance differece under swap though. In particular it would be very nice to keep inodes with pagecache in it out of the unused-inode-list, but it would need additional bitkeeping in inode.c. I'm also wondering why you dropped the early-cow for the write swapins, just to avoid managing the anon pages in the lru in do_swap_page and to have the logic only in once place? I kept the early-cow logic so I only get 1 page fault for every write-swapped-in pages. Andrea ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 17:06 ` Andrea Arcangeli @ 2001-10-30 17:28 ` Linus Torvalds 2001-10-30 17:39 ` Andrea Arcangeli 2001-10-30 18:05 ` Eric W. Biederman 0 siblings, 2 replies; 19+ messages in thread From: Linus Torvalds @ 2001-10-30 17:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > > It is still interesting to hear if it makes a big performance differece > under swap though. In particular it would be very nice to keep inodes > with pagecache in it out of the unused-inode-list, but it would need > additional bitkeeping in inode.c. Yes. I'm worried about the fact that icache shrinking was one of the top CPU users under heavy swapout, so I'd like to do _something_. The LRU approach is probably the cleanest and least random approach. > I'm also wondering why you dropped the early-cow for the write swapins, > just to avoid managing the anon pages in the lru in do_swap_page and to > have the logic only in once place? I kept the early-cow logic so I only > get 1 page fault for every write-swapped-in pages. I only dropped it because the locking rules for how exclusive swap pages work were too unclear, and I wanted to have the "remove on write" in just one place. Then I cleaned up the logic and made the thing use the pagecache lock properly and turned it into "remove_exclusive_swap_page()", and now I'm not worried about it any more, so I'm considering moving it back again. HOWEVER, _then_ I started wondering about whether the thing needs to be removed from the swap cache at all, and came to the conclusion that for the only case we really care about (and the only case where we _can_ re-use the swap cache page), we don't actually need to remove it from the cache in the first place. I think we should just share the page, and make the WP (and early-COW in do_swap_page()) logic just be /* Are we now the only user? */ if (swap_count(page) == 1 && page_count(page) == 2) { pte = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)) install_pte(); return; } There's no real reason to remove the page from the swap cache - that only means that we have to wait for the page to unlock (because you need to lock the page in order to remove the buffers that you need to remove _before_ you free the swap entry) and other crap that has no real point to it. When we fork() and possibly share the page non-exclusively, we will _already_ mark the page read-only and do the COW - so after that point we will correctly just copy the page on demand. Much simpler, I think. Does anybody see why we have to remove it from the swap cache at all? Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 17:28 ` Linus Torvalds @ 2001-10-30 17:39 ` Andrea Arcangeli 2001-10-30 17:53 ` Linus Torvalds 2001-10-30 18:05 ` Eric W. Biederman 1 sibling, 1 reply; 19+ messages in thread From: Andrea Arcangeli @ 2001-10-30 17:39 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, Oct 30, 2001 at 09:28:29AM -0800, Linus Torvalds wrote: > Does anybody see why we have to remove it from the swap cache at all? the only reason is to avoid wasting the swap space, so at least Rik's vm_swap_full logic should be added to it. The only advantage of dirty swap cache persistence is that it will maintain the same position on disk across a swapin/swapout cycle. But anyways you can do that "swap persistence" work in do_swap_page too to save a page fault for the write swapins. Ok, it's in one more place but it will be less costly than running into another pagefault just after returning to userspace. Andrea ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 17:39 ` Andrea Arcangeli @ 2001-10-30 17:53 ` Linus Torvalds 2001-10-30 18:16 ` Andrea Arcangeli 0 siblings, 1 reply; 19+ messages in thread From: Linus Torvalds @ 2001-10-30 17:53 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > > On Tue, Oct 30, 2001 at 09:28:29AM -0800, Linus Torvalds wrote: > > Does anybody see why we have to remove it from the swap cache at all? > > the only reason is to avoid wasting the swap space, so at least Rik's > vm_swap_full logic should be added to it. I agree, but that's true both for reads and writes, and then we want to delete it. So the logic might be something like remove = 0; if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) || only_swap_user()) { pte = mk_pte(page, vma->vm_page_prot); if (remove || write_access) pte = pte_mkdirty(pte); if (vma->vm_page_prot & VM_WRITE) pte = pte_mkwrite(pte); install_pte(); return; } ie we _remove_ it if we're low on swap entries and it is exclusive (that doesn't really save memory, but it allows us to re-use the swap entries for "better" pages), and we just re-use it without removing it if we're the only users (it doesn't even have to be a write access - we can do it even for reads, as if we're the only user we might as well just give the page to the process anyway - and let fork() do the thing it does in any case. Then we'll just trust the dirty bit when shared, like we always have done before anyway (we need to set it on removal, and we want to set it early on a write access to avoid unnecessary faults on architectures which do the dirty bit in software - that's why we have the "remove || write_access" test there. > The only advantage of dirty swap cache persistence is that it will > maintain the same position on disk across a swapin/swapout cycle. Well, the _big_ advantage is not the persistence, but the fact that the page might be in-flight when the user wants to use it, and the swap cache is just busy. Right now we _wait_ for the write to complete, which is silly. We might as well just let the user start using the page (including writing more stuff to it), and later on write it again. So right now the "remove from swap cache" is actually a IO-serializing operation, and we're doing it for no really good reason. Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 17:53 ` Linus Torvalds @ 2001-10-30 18:16 ` Andrea Arcangeli 2001-10-30 18:28 ` Linus Torvalds 0 siblings, 1 reply; 19+ messages in thread From: Andrea Arcangeli @ 2001-10-30 18:16 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, Oct 30, 2001 at 09:53:28AM -0800, Linus Torvalds wrote: > > On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > > > > On Tue, Oct 30, 2001 at 09:28:29AM -0800, Linus Torvalds wrote: > > > Does anybody see why we have to remove it from the swap cache at all? > > > > the only reason is to avoid wasting the swap space, so at least Rik's > > vm_swap_full logic should be added to it. > > I agree, but that's true both for reads and writes, and then we want to yes. > delete it. So the logic might be something like > > remove = 0; > if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) || > only_swap_user()) { I preferred the previous exclusive_swap_page logic. It couldn't race because we had the lock on the page, it's equivalent and it looked cleaner and simpler to me, we had to bother about the rest only if the page was exclusive. Now this only_swap_user replaces the exclusive_swap_cache check basically and you will end doing the double of the work if the vm is full and the page isn't exclusive, so both exclusive_swap_cache_delete and only_swap_user will have to work and fail. > pte = mk_pte(page, vma->vm_page_prot); > if (remove || write_access) > pte = pte_mkdirty(pte); > if (vma->vm_page_prot & VM_WRITE) > pte = pte_mkwrite(pte); > install_pte(); > return; > } > > ie we _remove_ it if we're low on swap entries and it is exclusive (that > doesn't really save memory, but it allows us to re-use the swap entries > for "better" pages), and we just re-use it without removing it if we're > the only users (it doesn't even have to be a write access - we can do it > even for reads, as if we're the only user we might as well just give the > page to the process anyway - and let fork() do the thing it does in any > case. > > Then we'll just trust the dirty bit when shared, like we always have done > before anyway (we need to set it on removal, and we want to set it early > on a write access to avoid unnecessary faults on architectures which do > the dirty bit in software - that's why we have the "remove || > write_access" test there. ok. > > > The only advantage of dirty swap cache persistence is that it will > > maintain the same position on disk across a swapin/swapout cycle. > > Well, the _big_ advantage is not the persistence, but the fact that the > page might be in-flight when the user wants to use it, and the swap cache > is just busy. Right now we _wait_ for the write to complete, which is > silly. We might as well just let the user start using the page (including > writing more stuff to it), and later on write it again. if we remove all write-swapins from the swap cache those pages cannot be in flight, we cannot do I/O on anon memory if it's not in the swapcache or we would race badly. all I/O to the swap space have to pass through the swap cache to be safe. So I don't see how an anonymous page can be in flight. > So right now the "remove from swap cache" is actually a IO-serializing > operation, and we're doing it for no really good reason. I think this is not true. remove_from_swap_cache can be run only if: 1) we hold the lock on the page 2) this mean all I/O is complete and so we can safely convert this non in-flight page to an anonymous page clean where any further I/O will be impossible So I still think the only advantage is to keep the swap position persistent across a swapin/swapout cycle. Andrea ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 18:16 ` Andrea Arcangeli @ 2001-10-30 18:28 ` Linus Torvalds 2001-10-30 18:58 ` Andrea Arcangeli 0 siblings, 1 reply; 19+ messages in thread From: Linus Torvalds @ 2001-10-30 18:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > > delete it. So the logic might be something like > > > > remove = 0; > > if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) || > > only_swap_user()) { > > I preferred the previous exclusive_swap_page logic. It couldn't race > because we had the lock on the page, it's equivalent It is _not_ equivalent. Think for five seconds about what you just wrote.. "It couldn't race because we had the lock on the page.." In short: the old code needed to get the page lock. In fact, it needed to get the page lock even for reads that don't need it at all - only because there could be a write from another process that shared the swap page. Ie we optimized for the very very uncommon case. Sharing swap pages is uncommon in itself, and it only happens when they _really_ aren't accessed over a fork() etc. In short, writing the code to deal with that by default is the wrong optimization. Now, the _common_ case is that the page is truly exclusive, and you don't want to lock the page - because locking the page means that you can pause for a _long_ time waiting for the page to be written out when there is IO pending. This is especially true since we need to get the swap device lock _anyway_, so locking the page is (a) inefficient and (b) overkill. The new re-org gets no new locks, and drops the page lock, allowing people to do the optimization without holding the page locked, which in turn means that you don't need to wait for potential IO to complete just to read a value from a page that you already have in memory. > if we remove all write-swapins from the swap cache those pages cannot be > in flight, What? The page is busy being written out by another process - the page is locked but up-to-date. We have _no_ reason to not give it immediately. This is something we do for all page cache pages - go read filemap_nopage() etc. They don't wait for data that is up-to-date. > So I don't see how an anonymous page can be in flight. It's being swapped out. What's so hard to see about that? Look at mm/vmscan.c: writepage-> swap_writepage(). The page is up-to-date but locked (it's obviously up-to-date, or we wouldn't be able to write it out). It won't be unlocked until the IO has completed, which is, under heavy swap load, easily half a second. Why do you want to wait for that? Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 18:28 ` Linus Torvalds @ 2001-10-30 18:58 ` Andrea Arcangeli 2001-10-30 19:21 ` Linus Torvalds 0 siblings, 1 reply; 19+ messages in thread From: Andrea Arcangeli @ 2001-10-30 18:58 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, Oct 30, 2001 at 10:28:29AM -0800, Linus Torvalds wrote: > want to lock the page - because locking the page means that you can pause > for a _long_ time waiting for the page to be written out when there is IO > pending. ok I see what you mean, I agree (going to merge those important bits into my tree! :) however those locking bits have nothing to do with exclusive_swap_page and the ealry cow I believe. exclusive_swap_page is faster than remove_exclusive_swap_page + only_swap_page as said in the earlier email and don't forget you somehow need the page lock too for remove_exclusive_swap_page. The magic word here is "_trylock_" after your wait_on_page if the page wasn't uptodate, it's not that avoiding the early-cow or your remove_exclusive_swap_cache will change anything (they only slowdowns). So in short we only need to replace the lock_page with a TryLockPage (plus your wait_on_page if page is not uptodate to catch the major faults) and here we go, faster than pre5. In previous emails I was thinking at major faults, of course the whole optimization here is for the _minor_ faults were we don't need to block and where pre5aa1 blocks and where pre5 vanilla doesn't block! Very good point. Andrea ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 18:58 ` Andrea Arcangeli @ 2001-10-30 19:21 ` Linus Torvalds 2001-10-30 20:05 ` Andrea Arcangeli 0 siblings, 1 reply; 19+ messages in thread From: Linus Torvalds @ 2001-10-30 19:21 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > > So in short we only need to replace the lock_page with a TryLockPage > (plus your wait_on_page if page is not uptodate to catch the major > faults) and here we go, faster than pre5. Wrong. If _anybody_ accesses the page unlocked, you cannot do the swap_count() at all, because then you don't have anything that serializes the accesses to swap_count vs page_count any more. Sure, it will _look_ like it is working (because 99.9% of the time we tend to have exclusive pages anyway), but the fact is that the old scheme _depended_ on swap_in getting the page lock - not just for testing, but for everybody else who wasn't even _interested_ in testing, but just wanted to increment the page could and decrement the swap count. See? Do you _now_ understand why pre5 does this atomically? It needs to test the swap count _and_ the page count atomically under the same lock. The page lock ha NOTHING to do with anything. If we ever have any user that does not take the page lock (and you now seem to realize why we want to have such users), the pagelock is WORTHLESS, because suddenly it doesn't end up protecting the counts at all. So making it a trylock doesn't help. See? Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 19:21 ` Linus Torvalds @ 2001-10-30 20:05 ` Andrea Arcangeli 2001-10-30 20:25 ` Linus Torvalds 0 siblings, 1 reply; 19+ messages in thread From: Andrea Arcangeli @ 2001-10-30 20:05 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, Oct 30, 2001 at 11:21:46AM -0800, Linus Torvalds wrote: > > On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > > > > So in short we only need to replace the lock_page with a TryLockPage > > (plus your wait_on_page if page is not uptodate to catch the major > > faults) and here we go, faster than pre5. > > Wrong. > > If _anybody_ accesses the page unlocked, you cannot do the swap_count() at > all, because then you don't have anything that serializes the accesses to > swap_count vs page_count any more. incidentally if trylock fails do_wp_page doesn't even try to check the swap count, it just lefts the swap cache there. same thing do_swap_page can do at the early-cow stage. this is the only point I'm making. and as said if you want to do any remove_exclusive_swap_page() in do_swap_page as you claimed in earlier email you also need to get the page lock. As far I can tell here the magic key is "trylock" and nothing else, it's not that the remove_exclusive_swap_page or the avoidance of the early-cow per se can make any difference (let's ignore swapoff) except running slower, here the only improvement during swapout load is that you're delegating the work of remove_exclusive_swap_page to do_wp_page that will do a trylock instead of a lock_page as far I can tell. This is the only point I'm making. Go ahead and implement this thing in do_swap_page: remove = 0; if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) || only_swap_user()) { pte = mk_pte(page, vma->vm_page_prot); if (remove || write_access) pte = pte_mkdirty(pte); if (vma->vm_page_prot & VM_WRITE) pte = pte_mkwrite(pte); install_pte(); return; } and you'll find yourself grabbing the page lock somehow first in the do_swap_page path, or exclusive_swap_cache_delete will obviously BUG() on you. This is why I'm saying the real magic is to conver the lock_page of pre4 in a TryLockPage, all other changes are not interesting in real load and I obviously agree that's very good idea to fix the minor faults, that in pre4 (and all previous kernels including all -ac and -aa) are running as slow as major faults! Now about the real need of exclusive_swap_cache_delete compared to exclusive_swap_page I need to think a little more about it to be sure. In sort previously we run exclusive_swap_page only with the page lock, page->buffers is constant if the page is locked. And swap count and page count _can't_ increase under us if the page happen to be exclusive once. This was the previous rule at least, but as usual there's the swapoff evil caming out and doing the lookup on a exclusive swap page... Hugh may provide more hints on this case. Andrea ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 20:05 ` Andrea Arcangeli @ 2001-10-30 20:25 ` Linus Torvalds 0 siblings, 0 replies; 19+ messages in thread From: Linus Torvalds @ 2001-10-30 20:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel On Tue, 30 Oct 2001, Andrea Arcangeli wrote: > > incidentally if trylock fails do_wp_page doesn't even try to check the > swap count, it just lefts the swap cache there. same thing do_swap_page > can do at the early-cow stage. this is the only point I'm making. Yes. At some point we need to lock the page _if_ we actually decide we have to do something with it. The current strategy is along the lines: if we can obviously share it, let's do so, but let's not wait for it to be unsharable. > Go ahead and implement this thing in do_swap_page: > > remove = 0; > if ((vm_swap_full() && (remove = exclusive_swap_cache_delete())) || > only_swap_user()) { > pte = mk_pte(page, vma->vm_page_prot); > if (remove || write_access) > pte = pte_mkdirty(pte); > if (vma->vm_page_prot & VM_WRITE) > pte = pte_mkwrite(pte); > install_pte(); > return; > } > > and you'll find yourself grabbing the page lock somehow first in the > do_swap_page path, or exclusive_swap_cache_delete will obviously BUG() > on you. We'll trylock it yes, but that has nothing to do with the page count protections. We'll trylock it if we end up _deleting_ the page, but not for count reasons, but because deletion needs the lock in order to wait for pages. And realize that that is the really rare case, where we really don't care for performance - we've just realized that we don't even have enough swap for the kind of load the machine is under. So your point is that the "we're out of swap space" case is a bit slower because we potentially take the swapspace spinlocks twice? Sure. But look at the fast paths: no waiting anywhere, and no unnecessary locks. > This is why I'm saying the real magic is to conver the lock_page of pre4 > in a TryLockPage, all other changes are not interesting in real load NO NO NO. Read my mails again. Th epage lock used to protect the integrity of the "swap_count()" test (which is part of the old "exclusive_swap_page()"). The ruls was: the swap count cannot change on a page when it is locked. Making the lock_page() be a TryLockPage, and doing the "swap_free()" without holding the page lock means that that integrity NO LONEGR EXISTS. Which means that the old "exclusive_swap_page()" DOES NOT WORK RELIABLY. It tested "page_count()" and "swap_count()" in ways that were no longer guaranteed to be meaningful - swap_count() could go down to 1 _after_ somebody else had incremented "page_count()" on another CPU due to a swap-in of another process that shared the swap entry (or even another thread on the same MM - we don't hold the page table spinlock there).. Do you get it now? By making the unconditional "always lock the page on swap-in" be a "try to lock the page if you _need_ to", exclusive_swap_page() no longer worked, and had to be gotten rid of or at least changed to do the right thing. Considering that all users of the function also wanted to remove the page, the change was obvious. Might we want to split it up differently if we do the "only_user()"? Maybe. But _please_ realize that the changes in pre5 are correctness fixes, not some random movement of code. Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 17:28 ` Linus Torvalds 2001-10-30 17:39 ` Andrea Arcangeli @ 2001-10-30 18:05 ` Eric W. Biederman 1 sibling, 0 replies; 19+ messages in thread From: Eric W. Biederman @ 2001-10-30 18:05 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Hugh Dickins, Frank Dekervel, Marcelo Tosatti, linux-kernel Linus Torvalds <torvalds@transmeta.com> writes: > HOWEVER, _then_ I started wondering about whether the thing needs to be > removed from the swap cache at all, and came to the conclusion that for > the only case we really care about (and the only case where we _can_ > re-use the swap cache page), we don't actually need to remove it from the > cache in the first place. There is a second case, though you may be handling it differently now. Typically the case is swap < RAM. But basically when we don't have enough have enough swap pages it pays to drop pages from the swap cache. So in as many places as we can figuring out how to drop swap pages when the swap space is practically full is important. The other alternative implementation is to create a logical backing store for anonymous pages (so the don't need a presence in the page table) and then we could just walk that backing store and free up swap space on demand. Though if you can put anonymous pages in the page cache now, a variation on that idea may be possible. We don't want to remove the swap from pages that aren't in ram. > Does anybody see why we have to remove it from the swap cache at all? Not just for cow. Eric ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 16:52 ` Linus Torvalds 2001-10-30 17:06 ` Andrea Arcangeli @ 2001-10-30 20:47 ` David S. Miller 1 sibling, 0 replies; 19+ messages in thread From: David S. Miller @ 2001-10-30 20:47 UTC (permalink / raw) To: torvalds; +Cc: hugh, Frank.dekervel, andrea, marcelo, linux-kernel From: Linus Torvalds <torvalds@transmeta.com> Date: Tue, 30 Oct 2001 08:52:58 -0800 (PST) My _preferred_ approach would actually be to move the slab pages to the LRU list too, and have a special "slab" address space (we don't need to actually hash them, we just make page->mapping point to it), and have the cache shrink be done naturally as part of writepage(). This is a cool idea. So when a SLAB block gets allocated from, we "reference" the underlying page? Franks a lot, David S. Miller davem@redhat.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: need help interpreting 'free' output. 2001-10-30 16:07 ` Hugh Dickins 2001-10-30 16:51 ` Andrea Arcangeli 2001-10-30 16:52 ` Linus Torvalds @ 2001-10-30 18:11 ` Frank Dekervel 2 siblings, 0 replies; 19+ messages in thread From: Frank Dekervel @ 2001-10-30 18:11 UTC (permalink / raw) To: linux-kernel Op dinsdag 30 oktober 2001 17:07, schreef Hugh Dickins: > I'm fairly sure /proc/slabinfo will show large inode_cache and large > dentry_cache: which is natural after updatedb, nothing wrong with that. indeed. before updatedb: inode_cache 10594 10605 512 1515 1515 1 dentry_cache 18239 18240 128 608 608 1 after: inode_cache 220883 220913 512 31558 31559 1 dentry_cache 229471 229500 128 7650 7650 1 but i guess this comes a bit late :) greetings, frank ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2001-10-30 20:47 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-10-30 11:32 need help interpreting 'free' output Frank Dekervel 2001-10-30 11:46 ` Mike Fedyk 2001-10-30 14:02 ` Frank Dekervel 2001-10-30 16:07 ` Hugh Dickins 2001-10-30 16:51 ` Andrea Arcangeli 2001-10-30 16:52 ` Linus Torvalds 2001-10-30 17:06 ` Andrea Arcangeli 2001-10-30 17:28 ` Linus Torvalds 2001-10-30 17:39 ` Andrea Arcangeli 2001-10-30 17:53 ` Linus Torvalds 2001-10-30 18:16 ` Andrea Arcangeli 2001-10-30 18:28 ` Linus Torvalds 2001-10-30 18:58 ` Andrea Arcangeli 2001-10-30 19:21 ` Linus Torvalds 2001-10-30 20:05 ` Andrea Arcangeli 2001-10-30 20:25 ` Linus Torvalds 2001-10-30 18:05 ` Eric W. Biederman 2001-10-30 20:47 ` David S. Miller 2001-10-30 18:11 ` Frank Dekervel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox