* MM patches against 2.5.31 @ 2002-08-22 2:29 Andrew Morton 2002-08-22 11:28 ` Christian Ehrhardt 2002-08-22 15:59 ` MM patches against 2.5.31 Steven Cole 0 siblings, 2 replies; 65+ messages in thread From: Andrew Morton @ 2002-08-22 2:29 UTC (permalink / raw) To: lkml, linux-mm@kvack.org I've uploaded a rollup of pending fixes and feature work against 2.5.31 to http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ The rolled up patch there is suitable for ongoing testing and development. The individual patches are in the broken-out/ directory and should all be documented. broken-out/linus.patch Incremental BK patch from Linus' tree broken-out/page_reserved.patch Test PageReserved in pagevec_release() broken-out/scsi_hack.patch Fix block-highmem for scsi broken-out/page_cache_release_lru_fix.patch Fix a race between __page_cache_release() and shrink_cache(). broken-out/page_cache_release_fix.patch Fix __page_cache_release() bugs broken-out/mvm.patch Fix vmalloc bugs broken-out/pte-chain-fix.patch Fix a VM lockup on uniprocessors broken-out/func-fix.patch gcc-2.91.66 does not support __func__ broken-out/ext3-htree.patch Indexed directories for ext3 broken-out/rmap-mapping-BUG.patch Fix a BUG_ON(page->mapping == NULL) in try_to_unmap() broken-out/misc.patch misc fixlets broken-out/tlb-speedup.patch Reduce typical global TLB invalidation frequency by 35% broken-out/buffer-slab-align.patch Don't align the buffer_head slab on hardware cacheline boundaries broken-out/zone-rename.patch Rename zone_struct->zone, zonelist_struct->zonelist. Remove zone_t, zonelist_t. broken-out/per-zone-lru.patch Per-zone page LRUs broken-out/per-zone-lock.patch Per-zone LRU list locking broken-out/l1-max-size.patch Infrastructure for determining the maximum L1 cache size which the kernel may have to support. broken-out/zone-lock-alignment.patch Pad struct zone to ensure that the lru and buddy locks are in separate cachelines. broken-out/put_page_cleanup.patch Clean up put_page() and page_cache_release(). broken-out/anon-batch-free.patch Batched freeing and de-LRUing of anonymous pages broken-out/writeback-sync.patch Writeback fixes and tuneups broken-out/ext3-inode-allocation.patch Fix an ext3 deadlock broken-out/ext3-o_direct.patch O_DIRECT support for ext3. broken-out/jfs-bio.patch Convert JFS to use direct-to-BIO I/O broken-out/discontig-paddr_to_pfn.patch Convert page pointers into pfns for i386 NUMA broken-out/discontig-setup_arch.patch Rework setup_arch() for i386 NUMA broken-out/discontig-mem_init.patch Restructure mem_init for i386 NUMA broken-out/discontig-i386-numa.patch discontigmem support for i386 NUMA broken-out/cleanup-mem_map-1.patch Clean up lots of open-coded uese of mem_map[]. For ia32 NUMA broken-out/zone-pages-reporting.patch Fix the boot-time reporting of each zone's available pages broken-out/enospc-recovery-fix.patch Fix the __block_write_full_page() error path. broken-out/fix-faults.patch Back out the initial work for atomic copy_*_user() broken-out/bkl-consolidate.patch Consolidation per-arch lock_kenrel() implementations. broken-out/might_sleep.patch Infrastructure to detect sleep-inside-spinlock bugs broken-out/spin-lock-check.patch spinlock/rwlock checking infrastructure broken-out/atomic-copy_user.patch Support for atomic copy_*_user() broken-out/kmap_atomic_reads.patch Use kmap_atomic() for generic_file_read() broken-out/kmap_atomic_writes.patch Use kmap_atomic() for generic_file_write() broken-out/config-PAGE_OFFSET.patch Configurable kenrel/user memory split broken-out/throttling-fix.patch Fix throttling of heavy write()rs. broken-out/dirty-state-accounting.patch Make the global dirty memory accounting more accurate broken-out/rd-cleanup.patch Cleanup and fix the ramdisk driver (doesn't work right yet) ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 2:29 MM patches against 2.5.31 Andrew Morton @ 2002-08-22 11:28 ` Christian Ehrhardt 2002-08-26 1:52 ` Andrew Morton 2002-08-22 15:59 ` MM patches against 2.5.31 Steven Cole 1 sibling, 1 reply; 65+ messages in thread From: Christian Ehrhardt @ 2002-08-22 11:28 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm@kvack.org On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote: > > I've uploaded a rollup of pending fixes and feature work > against 2.5.31 to > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ > > The rolled up patch there is suitable for ongoing testing and > development. The individual patches are in the broken-out/ > directory and should all be documented. Sorry, but we still have the page release race in multiple places. Look at the following (page starts with page_count == 1): Processor 1 Processor 2 refill_inactive: lines 378-395 as page count == 1 we'll continue with line 401 __pagevec_release: line 138 calls release_pages release_pages: line 100-111 put_page_test_zero brings the page count to 0 and we'll continue at line 114. Note that this may happen while another processor holds the lru lock, i.e. there is no point in checking for page count == 0 with the lru lock held because the lru lock doesn't protect against decrements of page count after the check. line 401: page_cache_get resurrects the page, page count is now 1. lines 402-448. line 448 calls __pagevec_release __pagevec_release: line 138 calls release_pages release_pages: lines 100-111 put_page_test_zero brings the page count back to 0 (!!!) i.e. we continue at line 114: lines 114-123. The page count == 0 check in line 123 is successful and the page is returned to the buddy allocator lines 114-123. The page count == 0 check in line 123 is successful, i.e. the page is returned to the buddy allocator a second time. ===> BOOM Neither the lru lock nor any of the page count == 0 checks can prevent this from happening. regards Christian -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 11:28 ` Christian Ehrhardt @ 2002-08-26 1:52 ` Andrew Morton 2002-08-26 9:10 ` Christian Ehrhardt 2002-08-26 17:58 ` Linus Torvalds 0 siblings, 2 replies; 65+ messages in thread From: Andrew Morton @ 2002-08-26 1:52 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: lkml, linux-mm@kvack.org Christian Ehrhardt wrote: > > On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote: > > > > I've uploaded a rollup of pending fixes and feature work > > against 2.5.31 to > > > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ > > > > The rolled up patch there is suitable for ongoing testing and > > development. The individual patches are in the broken-out/ > > directory and should all be documented. > > Sorry, but we still have the page release race in multiple places. > Look at the following (page starts with page_count == 1): > So we do. It's a hugely improbable race, so there's no huge rush to fix it. Looks like the same race is present in -ac kernels, actually, if add_to_swap() fails. Also perhaps 2.4 is exposed if swap_writepage() is synchronous, and page reclaim races with zap_page_range. ho-hum. What I'm inclined to do there is to change __page_cache_release() to not attempt to free the page at all. Just let it sit on the LRU until page reclaim encounters it. With the anon-free-via-pagevec patch, very, very, very few pages actually get their final release in __page_cache_release() - zero on uniprocessor, I expect. And change pagevec_release() to take the LRU lock before dropping the refcount on the pages. That means there will need to be two flavours of pagevec_release(): one which expects the pages to become freeable (and takes the LRU lock in anticipation of this). And one which doesn't expect the pages to become freeable. The latter will leave the occasional zero-count page on the LRU, as above. Sound sane? ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 1:52 ` Andrew Morton @ 2002-08-26 9:10 ` Christian Ehrhardt 2002-08-26 14:22 ` Daniel Phillips 2002-08-26 17:58 ` Linus Torvalds 1 sibling, 1 reply; 65+ messages in thread From: Christian Ehrhardt @ 2002-08-26 9:10 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm@kvack.org On Sun, Aug 25, 2002 at 06:52:55PM -0700, Andrew Morton wrote: > Christian Ehrhardt wrote: > > > > On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote: > > > > > > I've uploaded a rollup of pending fixes and feature work > > > against 2.5.31 to > > > > > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ > > > > > > The rolled up patch there is suitable for ongoing testing and > > > development. The individual patches are in the broken-out/ > > > directory and should all be documented. > > > > Sorry, but we still have the page release race in multiple places. > > Look at the following (page starts with page_count == 1): > > > > So we do. It's a hugely improbable race, so there's no huge rush > to fix it. Actually the race seems to happen in real life (it does explain the the pte.chain != NULL BUG) and it is not that improbable with preempt. > Looks like the same race is present in -ac kernels, > actually, if add_to_swap() fails. Also perhaps 2.4 is exposed if > swap_writepage() is synchronous, and page reclaim races with > zap_page_range. ho-hum. I didn't check each kernel, but I agree that most of the recent kernels have the potential for this race. Your tree just happend to be the one where I found it first. > What I'm inclined to do there is to change __page_cache_release() > to not attempt to free the page at all. Just let it sit on the > LRU until page reclaim encounters it. With the anon-free-via-pagevec > patch, very, very, very few pages actually get their final release in > __page_cache_release() - zero on uniprocessor, I expect. It's not just a Problem with __page_cache_release, but yes it seems to be a SMP only race. > And change pagevec_release() to take the LRU lock before dropping > the refcount on the pages. > > That means there will need to be two flavours of pagevec_release(): > one which expects the pages to become freeable (and takes the LRU > lock in anticipation of this). And one which doesn't expect the > pages to become freeable. The latter will leave the occasional > zero-count page on the LRU, as above. > > Sound sane? So the rules would be: * if you bring the page count to zero call __free_pages_ok unless the page is on the lru. * if someone (reclaim page) walking the lru finds a page with page count zero remove it from the lru and call __free_pages_ok. This requires that ANYTHING that ends up calling put_page_testzero must happen under the lru lock. This doesn't sound like a good idea but it seems to be possible to do it race free. I'd actually go for the following (patch in it compiles state but otherwise untested below to illustrate what I'm talking about): The basic idea is to move the logic into page_cache_get. The rules would be: 1. if you bring the page count to zero (using put_page_testzero) it is your job to call __free_pages_ok eventually. Before doing so make sure that the page is no longer on the lru. 2. You may only call page_cache_get to duplicate an existing reference, i.e. page_cache_get could be made to BUG_ON(page_count(page) == 0). 3. If you got a pointer to a page without holding a reference (this is only allowd to happen if we found the pointer on an lru list) call page_cache_get_lru UNDER the lru lock and just leave the page alone if that would resurrect the page. page_cache_get_lru would basically look like this (implementation details below): int page_cache_get_lru (struct page * page) { if (!atomic_inc_and_test_for_one (&page->count)) return 1; atomic_dec (&page->count); return 0; } Proof of correctness: A page is called dead if its page count reached zero before (no matter what the page count currently is). Once a page is dead there can be at most two pointers to the page: One held by the lru and the other one held by the thread freeing the page. Any thread accessing the page via the lru list will first call page_cache_get_lru under the lru lock, the thread freeing the page will not read the page count anymore. As page_cache_get_lru will not resurrect the page there will never be a page count != 0 visible outside the lru lock on a dead page. This meas that each thread trying to access the dead page via the lru list will detect that the page is dead and leave it alone. It follows that each page is freed at most once. Suppose a page could be leaked under these rules. This would require someone calling __put_page (or atomic_dec (&page->count)) to bring the page count to zero on a not already dead page. However, the only place where this happens is in page_cache_get_lru and it only happens if the page is already dead. Now let's look at the ugly part: implementation. The basic problem is that we don't habe an atomic_inc_and_test_for_one function and it is unlikely that we'll get one on all architectures. The solution (and this is the ugly part) is a change in the semantics of page->count. The value -1 now means no reference, 0 means one reference etc. This way we can define put_page_testzero as atomic_add_negative (-1, &page->count); and get_page_testzero as atomic_inc_and_test (&page->count); Here's the promised (untested) patch against bk7 to illustrate what I'm talking about: diff -ur linux-2.5.31-bk7/include/linux/mm.h linux-2.5.31-cae/include/linux/mm.h --- linux-2.5.31-bk7/include/linux/mm.h Sun Aug 25 18:30:38 2002 +++ linux-2.5.31-cae/include/linux/mm.h Sun Aug 25 21:40:57 2002 @@ -184,6 +184,11 @@ /* * Methods to modify the page usage count. * + * NOTE: Real page counts start at -1 for no reference. This is a hack + * to be able to implement get_page_testzero with the existing portable + * atomic functions. The value exposed via set_page_count and page_count + * is (1+page->count). + * * What counts for a page usage: * - cache mapping (page->mapping) * - private data (page->private) @@ -192,12 +197,25 @@ * * Also, many kernel routines increase the page count before a critical * routine so they can be sure the page doesn't go away from under them. + * + * A special Problem is the lru lists. Presence on one of these lists + * does not increase the page count. The FIRST thread that brings the + * page count back to zero is responsible to remove the page from the + * lru list and actually free it (__free_pages_ok). This means that we + * can only get a reference to a page that is on a lru list, if this + * page is not already dead, i.e. about to be removed from the lru list. + * To do this we call get_page_testzero which will increment the page + * count and return true if we just resurrected the page i.e. the real + * page->count is now zero indicating one user. In this case we drop + * the reference again using __put_page. Both calls must happen under + * the lru lock. */ #define get_page(p) atomic_inc(&(p)->count) #define __put_page(p) atomic_dec(&(p)->count) -#define put_page_testzero(p) atomic_dec_and_test(&(p)->count) -#define page_count(p) atomic_read(&(p)->count) -#define set_page_count(p,v) atomic_set(&(p)->count, v) +#define put_page_testzero(p) atomic_add_negative(-1, &(p)->count) +#define page_count(p) (1+atomic_read(&(p)->count)) +#define set_page_count(p,v) atomic_set(&(p)->count, v-1) +#define get_page_testzero(p) atomic_inc_and_test(&(p)->count) extern void FASTCALL(__page_cache_release(struct page *)); #define put_page(p) \ do { \ diff -ur linux-2.5.31-bk7/include/linux/pagemap.h linux-2.5.31-cae/include/linux/pagemap.h --- linux-2.5.31-bk7/include/linux/pagemap.h Sun Aug 25 18:30:38 2002 +++ linux-2.5.31-cae/include/linux/pagemap.h Sun Aug 25 21:56:30 2002 @@ -22,7 +22,43 @@ #define PAGE_CACHE_MASK PAGE_MASK #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) -#define page_cache_get(x) get_page(x) +/* + * Get a reference to the page. This function must not be called on + * a dead page, i.e. a page that has page count zero. If the page is + * still on a lru_list use page_cache_get_lru instead. + */ +static inline void page_cache_get (struct page * page) +{ + BUG_ON(page_count(page) == 0); + get_page(page); +} + +/* + * Try to get a reference to page that we found on an lru list. + * The lru lists may contain pages with page count zero. We must + * not take a reference to such a page because it is already about + * to be freed (once it is of the lru lists). If we'd take a reference + * the page would eventually be freed twice. + * + * The return value is true if we sucessfully incremented the page count. + * + * This function must be called with the lru lock held. + */ +static inline int page_cache_get_lru (struct page * page) +{ + /* + * Yes there is a window where the page count is not zero + * even though the page is dead. This is one of the reasons + * why the caller must hold the lru lock. Due to the lru_lock + * only the thread that is about to free the page can have + * a reference to this page. This thread will not test the + * page count anymore. + */ + if (!get_page_testzero (page)) + return 1; + __put_page (page); + return 0; +} static inline void page_cache_release(struct page *page) { diff -ur linux-2.5.31-bk7/mm/swap.c linux-2.5.31-cae/mm/swap.c --- linux-2.5.31-bk7/mm/swap.c Sun Aug 25 18:30:38 2002 +++ linux-2.5.31-cae/mm/swap.c Sun Aug 25 11:28:55 2002 @@ -77,7 +77,6 @@ void __page_cache_release(struct page *page) { unsigned long flags; - BUG_ON(page_count(page) != 0); spin_lock_irqsave(&_pagemap_lru_lock, flags); if (TestClearPageLRU(page)) { @@ -86,11 +85,8 @@ else del_page_from_inactive_list(page); } - if (page_count(page) != 0) - page = NULL; spin_unlock_irqrestore(&_pagemap_lru_lock, flags); - if (page) - __free_pages_ok(page, 0); + __free_pages_ok(page, 0); } /* @@ -131,8 +127,7 @@ else del_page_from_inactive_list(page); } - if (page_count(page) == 0) - pagevec_add(&pages_to_free, page); + pagevec_add(&pages_to_free, page); } if (lock_held) spin_unlock_irq(&_pagemap_lru_lock); diff -ur linux-2.5.31-bk7/mm/vmscan.c linux-2.5.31-cae/mm/vmscan.c --- linux-2.5.31-bk7/mm/vmscan.c Sun Aug 25 18:30:38 2002 +++ linux-2.5.31-cae/mm/vmscan.c Sun Aug 25 21:44:07 2002 @@ -92,6 +92,10 @@ return page_count(page) - !!PagePrivate(page) == 2; } +/* + * The caller must hold a reference to each page in the list. We drop + * this reference if and only if we remove the page from the page_list. + */ static /* inline */ int shrink_list(struct list_head *page_list, int nr_pages, zone_t *classzone, unsigned int gfp_mask, int priority, int *max_scan) @@ -295,24 +299,23 @@ spin_lock_irq(&_pagemap_lru_lock); while (max_scan > 0 && nr_pages > 0) { struct page *page; + struct list_head * curr; int n = 0; - while (n < nr_to_process && !list_empty(&inactive_list)) { - page = list_entry(inactive_list.prev, struct page, lru); + curr = inactive_list.prev; + while (n < nr_to_process && (&inactive_list != curr)) { + page = list_entry(curr, struct page, lru); - prefetchw_prev_lru_page(page, &inactive_list, flags); + prefetchw_prev_lru_page(page, curr, flags); + curr = curr->prev; + /* Is the page already dead ? */ + if (!page_cache_get_lru (page)) + continue; if (!TestClearPageLRU(page)) BUG(); list_del(&page->lru); - if (page_count(page) == 0) { - /* It is currently in pagevec_release() */ - SetPageLRU(page); - list_add(&page->lru, &inactive_list); - continue; - } list_add(&page->lru, &page_list); - page_cache_get(page); n++; } spin_unlock_irq(&_pagemap_lru_lock); @@ -381,15 +384,19 @@ LIST_HEAD(l_active); /* Pages to go onto the active_list */ struct page *page; struct pagevec pvec; + struct list_head *curr; lru_add_drain(); spin_lock_irq(&_pagemap_lru_lock); - while (nr_pages && !list_empty(&active_list)) { - page = list_entry(active_list.prev, struct page, lru); - prefetchw_prev_lru_page(page, &active_list, flags); + curr = active_list.prev; + while (nr_pages && (&active_list != curr)) { + page = list_entry(curr, struct page, lru); + prefetchw_prev_lru_page(page, curr, flags); + curr = curr->prev; + if (!page_cache_get_lru (page)) + continue; if (!TestClearPageLRU(page)) BUG(); - page_cache_get(page); list_move(&page->lru, &l_hold); nr_pages--; } -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 9:10 ` Christian Ehrhardt @ 2002-08-26 14:22 ` Daniel Phillips 2002-08-26 15:29 ` Christian Ehrhardt 0 siblings, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-08-26 14:22 UTC (permalink / raw) To: Christian Ehrhardt, Andrew Morton; +Cc: lkml, linux-mm@kvack.org On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > + * A special Problem is the lru lists. Presence on one of these lists > + * does not increase the page count. Please remind me... why should it not? -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 14:22 ` Daniel Phillips @ 2002-08-26 15:29 ` Christian Ehrhardt 2002-08-26 17:56 ` Daniel Phillips 0 siblings, 1 reply; 65+ messages in thread From: Christian Ehrhardt @ 2002-08-26 15:29 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm@kvack.org On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > + * A special Problem is the lru lists. Presence on one of these lists > > + * does not increase the page count. > > Please remind me... why should it not? Pages that are only on the lru but not reference by anyone are of no use and we want to free them immediatly. If we leave them on the lru list with a page count of 1, someone else will have to walk the lru list and remove pages that are only on the lru. One could argue that try_to_free_pages could do this but try_to_free_pages will process the pages in lru order and push out other pages first. The next suggestion that comes to mind is: Let's have some magic in page_cache_release that will remove the page from the lru list if it is actually dead. However, this raises the question: How do we detect that a page is now dead? The answer is something along the lines of if (put_page_testzero ()) { __free_pages_ok (page); return } spin_lock_irq(pagemap_lru_lock); if (PageLRU(page) && (page_count(page) == 1)) { lru_cache_del (page); spin_unlock_irq(pagemap_lru_lock); page_cache_release (page); return; } spin_unlock_irq(pagemap_lru_lock); return; The sad truth is, that this solution has all the same races that we have now, plus it makes the fast path (decreasing page count to something not zero) slower. One problem in the above would be that the last reference might as well not be due the the lru cache, i.e at the time we call PageLRU(page) the page might have been freed by another processor. I know the idea is appealing (see one of my earlier Mails on the subject ;-) ) but it doesn't solve the Problem. regards Christian Ehrhardt -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 15:29 ` Christian Ehrhardt @ 2002-08-26 17:56 ` Daniel Phillips 2002-08-26 19:24 ` Andrew Morton 2002-08-26 20:00 ` Christian Ehrhardt 0 siblings, 2 replies; 65+ messages in thread From: Daniel Phillips @ 2002-08-26 17:56 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm@kvack.org On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > + * A special Problem is the lru lists. Presence on one of these lists > > > + * does not increase the page count. > > > > Please remind me... why should it not? > > Pages that are only on the lru but not reference by anyone are of no > use and we want to free them immediatly. If we leave them on the lru > list with a page count of 1, someone else will have to walk the lru > list and remove pages that are only on the lru. I don't understand this argument. Suppose lru list membership is worth a page count of one. Then anyone who finds a page by way of the lru list can safely put_page_testzero and remove the page from the lru list. Anyone who finds a page by way of a page table can likewise put_page_testzero and clear the pte, or remove the mapping and pass the page to Andrew's pagevec machinery, which will eventually do the put_page_testzero. Anyone who removes a page from a radix tree will also do a put_page_testzero. Exactly one of those paths will result in the page count reaching zero, which tells us nobody else holds a reference and it's time for __free_pages_ok. The page is thus freed immediately as soon as there are no more references to it, and does not hang around on the lru list. Nobody has to lock against the page count. Each put_page_testzero caller only locks the data structure from which it's removing the reference. This seems so simple, what is the flaw? -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 17:56 ` Daniel Phillips @ 2002-08-26 19:24 ` Andrew Morton 2002-08-26 19:34 ` Daniel Phillips ` (2 more replies) 2002-08-26 20:00 ` Christian Ehrhardt 1 sibling, 3 replies; 65+ messages in thread From: Andrew Morton @ 2002-08-26 19:24 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm@kvack.org Daniel Phillips wrote: > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > + * does not increase the page count. > > > > > > Please remind me... why should it not? > > > > Pages that are only on the lru but not reference by anyone are of no > > use and we want to free them immediatly. If we leave them on the lru > > list with a page count of 1, someone else will have to walk the lru > > list and remove pages that are only on the lru. > > I don't understand this argument. Suppose lru list membership is worth a > page count of one. Then anyone who finds a page by way of the lru list can > safely put_page_testzero and remove the page from the lru list. Anyone who > finds a page by way of a page table can likewise put_page_testzero and clear > the pte, or remove the mapping and pass the page to Andrew's pagevec > machinery, which will eventually do the put_page_testzero. Anyone who > removes a page from a radix tree will also do a put_page_testzero. Exactly > one of those paths will result in the page count reaching zero, which tells > us nobody else holds a reference and it's time for __free_pages_ok. The page > is thus freed immediately as soon as there are no more references to it, and > does not hang around on the lru list. > > Nobody has to lock against the page count. Each put_page_testzero caller > only locks the data structure from which it's removing the reference. > > This seems so simple, what is the flaw? The flaw is in doing the put_page_testzero() outside of any locking which would prevent other CPUs from finding and "rescuing" the zero-recount page. CPUA: if (put_page_testzero()) { /* Here's the window */ spin_lock(lru_lock); list_del(page->lru); CPUB: spin_lock(lru_lock); page = list_entry(lru); page_cache_get(page); /* If this goes from 0->1, we die */ ... page_cache_release(page); /* double free */ 2.5.31-mm1 has tests which make this race enormously improbable [1], but it's still there. It's that `put' outside the lock which is the culprit. Normally, we handle that with atomic_dec_and_lock() (inodes) or by manipulating the refcount inside an area which has exclusion (page presence in pagecache). The sane, sensible and sucky way is to always take the lock: page_cache_release(page) { spin_lock(lru_lock); if (put_page_testzero(page)) { lru_cache_del(page); __free_pages_ok(page, 0); } spin_unlock(lru_lock); } Because this provides exclusion from another CPU discovering the page via the LRU. So taking the above as the design principle, how can we speed it up? How to avoid taking the lock in every page_cache_release()? Maybe: page_cache_release(page) { if (page_count(page) == 1) { spin_lock(lru_lock); if (put_page_testzero(page)) { if (PageLRU(page)) __lru_cache_del(page); __free_pages_ok(page); } spin_unlock(lru_lock); } else { atomic_dec(&page->count); } } This is nice and quick, but racy. Two concurrent page_cache_releases will create a zero-ref unfreed page which is on the LRU. These are rare, and can be mopped up in page reclaim. The above code will also work for pages which aren't on the LRU. It will take the lock unnecessarily for (say) slab pages. But if we put slab pages on the LRU then I suspect there are so few non-LRU pages left that it isn't worth bothering about this. [1] The race requires that the CPU running page_cache_release find a five instruction window against the CPU running shrink_cache. And that they be operating against the same page. And that the CPU running __page_cache_release() then take an interrupt in a 3-4 instruction window. And that the interrupt take longer than the runtime for shrink_list. And that the page be the first page in the pagevec. It's a heat-death-of-the-universe-race, but even if it were to be ignored, the current code is too complex. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 19:24 ` Andrew Morton @ 2002-08-26 19:34 ` Daniel Phillips 2002-08-26 19:48 ` Christian Ehrhardt 2002-08-27 9:22 ` Christian Ehrhardt 2 siblings, 0 replies; 65+ messages in thread From: Daniel Phillips @ 2002-08-26 19:34 UTC (permalink / raw) To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm@kvack.org On Monday 26 August 2002 21:24, Andrew Morton wrote: > Daniel Phillips wrote: > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > > + * does not increase the page count. > > > > > > > > Please remind me... why should it not? > > > > > > Pages that are only on the lru but not reference by anyone are of no > > > use and we want to free them immediatly. If we leave them on the lru > > > list with a page count of 1, someone else will have to walk the lru > > > list and remove pages that are only on the lru. > > > > I don't understand this argument. Suppose lru list membership is worth a > > page count of one. Then anyone who finds a page by way of the lru list can > > safely put_page_testzero and remove the page from the lru list. Anyone who > > finds a page by way of a page table can likewise put_page_testzero and clear > > the pte, or remove the mapping and pass the page to Andrew's pagevec > > machinery, which will eventually do the put_page_testzero. Anyone who > > removes a page from a radix tree will also do a put_page_testzero. Exactly > > one of those paths will result in the page count reaching zero, which tells > > us nobody else holds a reference and it's time for __free_pages_ok. The page > > is thus freed immediately as soon as there are no more references to it, and > > does not hang around on the lru list. > > > > Nobody has to lock against the page count. Each put_page_testzero caller > > only locks the data structure from which it's removing the reference. > > > > This seems so simple, what is the flaw? > > The flaw is in doing the put_page_testzero() outside of any locking > which would prevent other CPUs from finding and "rescuing" the zero-recount > page. > > CPUA: > if (put_page_testzero()) { > /* Here's the window */ > spin_lock(lru_lock); > list_del(page->lru); According to my assumption that lru list membership is (should be) worth one page count, if testzero triggers here the page is not on the lru. > CPUB: > > spin_lock(lru_lock); > page = list_entry(lru); > page_cache_get(page); /* If this goes from 0->1, we die */ It can't. You know that because you found the page on the lru, its count must be at least one (again, according to assumption above). > ... > page_cache_release(page); /* double free */ I'd like to jump in and chase more solutions with you, but the above doesn't prove your point, so I'm not ready to reject this one yet. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 19:24 ` Andrew Morton 2002-08-26 19:34 ` Daniel Phillips @ 2002-08-26 19:48 ` Christian Ehrhardt 2002-08-27 9:22 ` Christian Ehrhardt 2 siblings, 0 replies; 65+ messages in thread From: Christian Ehrhardt @ 2002-08-26 19:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Daniel Phillips, lkml, linux-mm@kvack.org On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote: > The flaw is in doing the put_page_testzero() outside of any locking Well, one could argue that doing the put_page_testzero outside of any locking is a feature. > [ ... ] > > 2.5.31-mm1 has tests which make this race enormously improbable [1], > but it's still there. Agreed. Both on the improbable and on the still there part. > It's that `put' outside the lock which is the culprit. Normally, we > handle that with atomic_dec_and_lock() (inodes) or by manipulating > the refcount inside an area which has exclusion (page presence in > pagecache). > > The sane, sensible and sucky way is to always take the lock: > > page_cache_release(page) > { > spin_lock(lru_lock); > if (put_page_testzero(page)) { > lru_cache_del(page); > __free_pages_ok(page, 0); > } > spin_unlock(lru_lock); > } That would probably solve the problem. > Because this provides exclusion from another CPU discovering the page > via the LRU. > > So taking the above as the design principle, how can we speed it up? > How to avoid taking the lock in every page_cache_release()? Maybe: > > page_cache_release(page) > { > if (page_count(page) == 1) { > spin_lock(lru_lock); > if (put_page_testzero(page)) { > if (PageLRU(page)) > __lru_cache_del(page); > __free_pages_ok(page); > } > spin_unlock(lru_lock); > } else { > atomic_dec(&page->count); > } > } However, this is an incredibly bad idea if the page is NOT on the lru. Think of two instances of page_cache_release racing against each other. This could result in a leaked page which is not on the LRU. > This is nice and quick, but racy. Two concurrent page_cache_releases > will create a zero-ref unfreed page which is on the LRU. These are > rare, and can be mopped up in page reclaim. > > The above code will also work for pages which aren't on the LRU. It will > take the lock unnecessarily for (say) slab pages. But if we put slab pages > on the LRU then I suspect there are so few non-LRU pages left that it isn't > worth bothering about this. No it will not work. See above. > [1] The race requires that the CPU running page_cache_release find a > five instruction window against the CPU running shrink_cache. And > that they be operating against the same page. And that the CPU > running __page_cache_release() then take an interrupt in a 3-4 > instruction window. And that the interrupt take longer than the > runtime for shrink_list. And that the page be the first page in > the pagevec. The interrupt can also be a preemption which might easily take long enough. But I agree that the race is now rare. The real problem is that the locking rules don't guarantee that there are no other racy paths that we both missed. regards Christian -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 19:24 ` Andrew Morton 2002-08-26 19:34 ` Daniel Phillips 2002-08-26 19:48 ` Christian Ehrhardt @ 2002-08-27 9:22 ` Christian Ehrhardt 2002-08-27 19:19 ` Andrew Morton 2 siblings, 1 reply; 65+ messages in thread From: Christian Ehrhardt @ 2002-08-27 9:22 UTC (permalink / raw) To: Andrew Morton; +Cc: Daniel Phillips, lkml, linux-mm@kvack.org On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote: > The flaw is in doing the put_page_testzero() outside of any locking > which would prevent other CPUs from finding and "rescuing" the zero-recount > page. > > CPUA: > if (put_page_testzero()) { > /* Here's the window */ > spin_lock(lru_lock); > list_del(page->lru); > > CPUB: > > spin_lock(lru_lock); > page = list_entry(lru); > page_cache_get(page); /* If this goes from 0->1, we die */ > ... > page_cache_release(page); /* double free */ So what we want CPUB do instead is spin_lock(lru_lock); page = list_entry(lru) START ATOMIC page_cache_get(page); res = (page_count (page) == 1) END ATOMIC if (res) { atomic_dec (&page->count); continue; /* with next page */ } ... page_cache_release (page); I.e. we want to detect _atomically_ that we just raised the page count from zero to one. My patch actually has a solution that implements the needed atomic operation above by means of the atomic functions that we currently have on all archs (it's called get_page_testzero and should probably called get_page_testone). The more I think about this the more I think this is the way to go. regards Christian -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-27 9:22 ` Christian Ehrhardt @ 2002-08-27 19:19 ` Andrew Morton 0 siblings, 0 replies; 65+ messages in thread From: Andrew Morton @ 2002-08-27 19:19 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Daniel Phillips, lkml, linux-mm@kvack.org Christian Ehrhardt wrote: > > ... > So what we want CPUB do instead is > > spin_lock(lru_lock); > page = list_entry(lru) > > START ATOMIC > page_cache_get(page); > res = (page_count (page) == 1) > END ATOMIC > > if (res) { > atomic_dec (&page->count); > continue; /* with next page */ > } > ... > page_cache_release (page); > > I.e. we want to detect _atomically_ that we just raised the page count > from zero to one. My patch actually has a solution that implements the > needed atomic operation above by means of the atomic functions that we > currently have on all archs (it's called get_page_testzero and > should probably called get_page_testone). > The more I think about this the more I think this is the way to go. > Yes, I think that would provide a minimal fix to the problem. (I'd prefer a solution in which presence on the LRU contributes to page->count, because that means I can dump a load of expensive page_cache_get-inside-lru-lock instances, but whatever) You had: -#define put_page_testzero(p) atomic_dec_and_test(&(p)->count) -#define page_count(p) atomic_read(&(p)->count) -#define set_page_count(p,v) atomic_set(&(p)->count, v) +#define put_page_testzero(p) atomic_add_negative(-1, &(p)->count) +#define page_count(p) (1+atomic_read(&(p)->count)) +#define set_page_count(p,v) atomic_set(&(p)->count, v-1) +#define get_page_testzero(p) atomic_inc_and_test(&(p)->count) So the page count is actually offset by -1, and that is hidden by the macros. Fair enough. atomic_add_negative() is not implemented on quite a number of architectures (sparc64, mips, ppc, sh, cris, 68k, alpha..), so some legwork is needed there. Looks to be pretty simple though; alpha, ppc and others already have atomic_add_return(). ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 17:56 ` Daniel Phillips 2002-08-26 19:24 ` Andrew Morton @ 2002-08-26 20:00 ` Christian Ehrhardt 2002-08-26 20:09 ` Daniel Phillips 1 sibling, 1 reply; 65+ messages in thread From: Christian Ehrhardt @ 2002-08-26 20:00 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm@kvack.org On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote: > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > + * does not increase the page count. > > > > > > Please remind me... why should it not? > > > > Pages that are only on the lru but not reference by anyone are of no > > use and we want to free them immediatly. If we leave them on the lru > > list with a page count of 1, someone else will have to walk the lru > > list and remove pages that are only on the lru. > > I don't understand this argument. Suppose lru list membership is worth a > page count of one. Then anyone who finds a page by way of the lru list can This does fix the double free problem but think of a typical anonymous page at exit. The page is on the lru list and there is one reference held by the pte. According to your scheme the pte reference would be freed (obviously due to the exit) but the page would remain on the lru list. However, there is no point in leaving the page on the lru list at all. If you think about who is going to remove the page from the lru you'll see the problem. regards Christian -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 20:00 ` Christian Ehrhardt @ 2002-08-26 20:09 ` Daniel Phillips 2002-08-26 20:58 ` Christian Ehrhardt 2002-08-26 21:31 ` Andrew Morton 0 siblings, 2 replies; 65+ messages in thread From: Daniel Phillips @ 2002-08-26 20:09 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm@kvack.org On Monday 26 August 2002 22:00, Christian Ehrhardt wrote: > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote: > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > > + * does not increase the page count. > > > > > > > > Please remind me... why should it not? > > > > > > Pages that are only on the lru but not reference by anyone are of no > > > use and we want to free them immediatly. If we leave them on the lru > > > list with a page count of 1, someone else will have to walk the lru > > > list and remove pages that are only on the lru. > > > > I don't understand this argument. Suppose lru list membership is worth a > > page count of one. Then anyone who finds a page by way of the lru list can > > This does fix the double free problem but think of a typical anonymous > page at exit. The page is on the lru list and there is one reference held > by the pte. According to your scheme the pte reference would be freed > (obviously due to the exit) but the page would remain on the lru list. > However, there is no point in leaving the page on the lru list at all. If you want the page off the lru list at that point (which you probably do) then you take the lru lock and put_page_testzero. > If you think about who is going to remove the page from the lru you'll > see the problem. Nope, still don't see it. Whoever hits put_page_testzero frees the page, secure in the knowlege that there are no other references to it. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 20:09 ` Daniel Phillips @ 2002-08-26 20:58 ` Christian Ehrhardt 2002-08-27 16:48 ` Daniel Phillips 2002-08-26 21:31 ` Andrew Morton 1 sibling, 1 reply; 65+ messages in thread From: Christian Ehrhardt @ 2002-08-26 20:58 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm@kvack.org On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote: > On Monday 26 August 2002 22:00, Christian Ehrhardt wrote: > > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote: > > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > > > + * does not increase the page count. > > > > > > > > > > Please remind me... why should it not? > > > > > > > > Pages that are only on the lru but not reference by anyone are of no > > > > use and we want to free them immediatly. If we leave them on the lru > > > > list with a page count of 1, someone else will have to walk the lru > > > > list and remove pages that are only on the lru. > > > > > > I don't understand this argument. Suppose lru list membership is worth a > > > page count of one. Then anyone who finds a page by way of the lru list can > > > > This does fix the double free problem but think of a typical anonymous > > page at exit. The page is on the lru list and there is one reference held > > by the pte. According to your scheme the pte reference would be freed > > (obviously due to the exit) but the page would remain on the lru list. > > However, there is no point in leaving the page on the lru list at all. > > If you want the page off the lru list at that point (which you probably do) > then you take the lru lock and put_page_testzero. Could you clarify what you mean with "at that point"? Especially how do you plan to test for "this point". Besides it is illegal to use the page after put_page_testzero (unless put_page_testzero returns true). > > If you think about who is going to remove the page from the lru you'll > > see the problem. > > Nope, still don't see it. Whoever hits put_page_testzero frees the page, > secure in the knowlege that there are no other references to it. Well yes, but we cannot remove the page from the lru atomatically at page_cache_release time if we follow your proposal. If you think we can, show me your implementation of page_cache_release and I'll show you where the races are (unless you do everything under the lru_lock of course). regards Christian -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 20:58 ` Christian Ehrhardt @ 2002-08-27 16:48 ` Daniel Phillips 2002-08-28 13:14 ` Christian Ehrhardt 0 siblings, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-08-27 16:48 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm@kvack.org On Monday 26 August 2002 22:58, Christian Ehrhardt wrote: > On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote: > > On Monday 26 August 2002 22:00, Christian Ehrhardt wrote: > > > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote: > > > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > > > > + * does not increase the page count. > > > > > > > > > > > > Please remind me... why should it not? > > > > > > > > > > Pages that are only on the lru but not reference by anyone are of no > > > > > use and we want to free them immediatly. If we leave them on the lru > > > > > list with a page count of 1, someone else will have to walk the lru > > > > > list and remove pages that are only on the lru. > > > > > > > > I don't understand this argument. Suppose lru list membership is worth a > > > > page count of one. Then anyone who finds a page by way of the lru list can > > > > > > This does fix the double free problem but think of a typical anonymous > > > page at exit. The page is on the lru list and there is one reference held > > > by the pte. According to your scheme the pte reference would be freed > > > (obviously due to the exit) but the page would remain on the lru list. > > > However, there is no point in leaving the page on the lru list at all. > > > > If you want the page off the lru list at that point (which you probably do) > > then you take the lru lock and put_page_testzero. > > Could you clarify what you mean with "at that point"? Especially how > do you plan to test for "this point". Besides it is illegal to use > the page after put_page_testzero (unless put_page_testzero returns true). > > > If you think about who is going to remove the page from the lru you'll > > > see the problem. > > > > Nope, still don't see it. Whoever hits put_page_testzero frees the page, > > secure in the knowlege that there are no other references to it. > > Well yes, but we cannot remove the page from the lru atomatically > at page_cache_release time if we follow your proposal. If you think we can, > show me your implementation of page_cache_release and I'll show > you where the races are (unless you do everything under the lru_lock > of course). void page_cache_release(struct page *page) { spin_lock(&pagemap_lru_lock); if (PageLRU(page) && page_count(page) == 2) { __lru_cache_del(page); atomic_dec(&page->count); } spin_unlock(&pagemap_lru_lock); if (put_page_testzero(page)) __free_pages_ok(page, 0); } This allows the following benign race, with initial page count = 3: spin_lock(&pagemap_lru_lock); if (PageLRU(page) && page_count(page) == 2) /* false */ spin_unlock(&pagemap_lru_lock); spin_lock(&pagemap_lru_lock); if (PageLRU(page) && page_count(page) == 2) /* false */ spin_unlock(&pagemap_lru_lock); if (put_page_testzero(page)) __free_pages_ok(page, 0); if (put_page_testzero(page)) __free_pages_ok(page, 0); Neither holder of a page reference sees the count at 2, and so the page is left on the lru with count = 1. This won't happen often and such pages will be recovered from the cold end of the list in due course. The important question is: can this code ever remove a page from the lru erroneously, leaving somebody holding a reference to a non-lru page? In other words, can the test PageLRU(page) && page_count(page) == 2 return a false positive? Well, when this test is true we can account for both both references: the one we own, and the one the lru list owns. Since we hold the lru lock, the latter won't change. Nobody else has the right to increment the page count, since they must inherit that right from somebody who holds a reference, and there are none. We could also do this: void page_cache_release(struct page *page) { if (page_count(page) == 2) { spin_lock(&pagemap_lru_lock); if (PageLRU(page) && page_count(page) == 2) { __lru_cache_del(page); atomic_dec(&page->count); } spin_unlock(&pagemap_lru_lock); } if (put_page_testzero(page)) __free_pages_ok(page, 0); } Which avoids taking the lru lock sometimes in exchange for widening the hole through which pages can end up with count = 1 on the lru list. Let's run this through your race detector and see what happens. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-27 16:48 ` Daniel Phillips @ 2002-08-28 13:14 ` Christian Ehrhardt 2002-08-28 17:18 ` Daniel Phillips 2002-08-28 20:41 ` Daniel Phillips 0 siblings, 2 replies; 65+ messages in thread From: Christian Ehrhardt @ 2002-08-28 13:14 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm@kvack.org On Tue, Aug 27, 2002 at 06:48:50PM +0200, Daniel Phillips wrote: > On Monday 26 August 2002 22:58, Christian Ehrhardt wrote: > > > Nope, still don't see it. Whoever hits put_page_testzero frees the page, > > > secure in the knowlege that there are no other references to it. > > > > Well yes, but we cannot remove the page from the lru atomatically > > at page_cache_release time if we follow your proposal. If you think we can, > > show me your implementation of page_cache_release and I'll show > > you where the races are (unless you do everything under the lru_lock > > of course). > > void page_cache_release(struct page *page) > { > spin_lock(&pagemap_lru_lock); > if (PageLRU(page) && page_count(page) == 2) { > __lru_cache_del(page); > atomic_dec(&page->count); > } > spin_unlock(&pagemap_lru_lock); > if (put_page_testzero(page)) > __free_pages_ok(page, 0); > } > > This allows the following benign race, with initial page count = 3: > [ ...] > Neither holder of a page reference sees the count at 2, and so the page > is left on the lru with count = 1. This won't happen often and such > pages will be recovered from the cold end of the list in due course. Ok, agreed. I think this will work but taking the lru lock each time is probably not a good idea. > We could also do this: > > void page_cache_release(struct page *page) > { > if (page_count(page) == 2) { > spin_lock(&pagemap_lru_lock); > if (PageLRU(page) && page_count(page) == 2) { > __lru_cache_del(page); > atomic_dec(&page->count); > } > spin_unlock(&pagemap_lru_lock); > } > if (put_page_testzero(page)) > __free_pages_ok(page, 0); > } > > Which avoids taking the lru lock sometimes in exchange for widening the > hole through which pages can end up with count = 1 on the lru list. This sounds like something that is worth trying. I missed that one. Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive or shrink_cache could have removed the page from the lru before __pagevec_lru_del acquired the lru lock. regards Christian -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 13:14 ` Christian Ehrhardt @ 2002-08-28 17:18 ` Daniel Phillips 2002-08-28 17:42 ` Andrew Morton 2002-08-28 20:41 ` Daniel Phillips 1 sibling, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-08-28 17:18 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm@kvack.org On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote: > Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive > or shrink_cache could have removed the page from the lru before > __pagevec_lru_del acquired the lru lock. It's suspect all right. If there's a chain of assumptions that proves the page is always on the lru at the point, I haven't seen it yet. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 17:18 ` Daniel Phillips @ 2002-08-28 17:42 ` Andrew Morton 0 siblings, 0 replies; 65+ messages in thread From: Andrew Morton @ 2002-08-28 17:42 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm@kvack.org Daniel Phillips wrote: > > On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote: > > Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive > > or shrink_cache could have removed the page from the lru before > > __pagevec_lru_del acquired the lru lock. > > It's suspect all right. If there's a chain of assumptions that proves > the page is always on the lru at the point, I haven't seen it yet. Yeah. __pagevec_lru_del is only used by invalidate_inode_pages. A very simple solution is to just delete it. untested code: include/linux/pagevec.h | 7 ------- mm/filemap.c | 10 +++++----- mm/swap.c | 28 ---------------------------- 3 files changed, 5 insertions(+), 40 deletions(-) --- 2.5.32/mm/filemap.c~pagevec_lru_del Wed Aug 28 09:51:51 2002 +++ 2.5.32-akpm/mm/filemap.c Wed Aug 28 09:51:51 2002 @@ -116,10 +116,10 @@ void invalidate_inode_pages(struct inode struct list_head *head, *curr; struct page * page; struct address_space *mapping = inode->i_mapping; - struct pagevec lru_pvec; + struct pagevec pvec; head = &mapping->clean_pages; - pagevec_init(&lru_pvec); + pagevec_init(&pvec); write_lock(&mapping->page_lock); curr = head->next; @@ -143,8 +143,8 @@ void invalidate_inode_pages(struct inode __remove_from_page_cache(page); unlock_page(page); - if (!pagevec_add(&lru_pvec, page)) - __pagevec_lru_del(&lru_pvec); + if (!pagevec_add(&pvec, page)) + __pagevec_release(&pvec); continue; unlock: unlock_page(page); @@ -152,7 +152,7 @@ unlock: } write_unlock(&mapping->page_lock); - pagevec_lru_del(&lru_pvec); + pagevec_release(&pvec); } static int do_invalidatepage(struct page *page, unsigned long offset) --- 2.5.32/include/linux/pagevec.h~pagevec_lru_del Wed Aug 28 09:51:51 2002 +++ 2.5.32-akpm/include/linux/pagevec.h Wed Aug 28 09:51:51 2002 @@ -18,7 +18,6 @@ void __pagevec_release(struct pagevec *p void __pagevec_release_nonlru(struct pagevec *pvec); void __pagevec_free(struct pagevec *pvec); void __pagevec_lru_add(struct pagevec *pvec); -void __pagevec_lru_del(struct pagevec *pvec); void lru_add_drain(void); void pagevec_deactivate_inactive(struct pagevec *pvec); @@ -69,9 +68,3 @@ static inline void pagevec_lru_add(struc if (pagevec_count(pvec)) __pagevec_lru_add(pvec); } - -static inline void pagevec_lru_del(struct pagevec *pvec) -{ - if (pagevec_count(pvec)) - __pagevec_lru_del(pvec); -} --- 2.5.32/mm/swap.c~pagevec_lru_del Wed Aug 28 09:51:51 2002 +++ 2.5.32-akpm/mm/swap.c Wed Aug 28 09:51:58 2002 @@ -214,34 +214,6 @@ void __pagevec_lru_add(struct pagevec *p } /* - * Remove the passed pages from the LRU, then drop the caller's refcount on - * them. Reinitialises the caller's pagevec. - */ -void __pagevec_lru_del(struct pagevec *pvec) -{ - int i; - struct zone *zone = NULL; - - for (i = 0; i < pagevec_count(pvec); i++) { - struct page *page = pvec->pages[i]; - struct zone *pagezone = page_zone(page); - - if (pagezone != zone) { - if (zone) - spin_unlock_irq(&zone->lru_lock); - zone = pagezone; - spin_lock_irq(&zone->lru_lock); - } - if (!TestClearPageLRU(page)) - BUG(); - del_page_from_lru(zone, page); - } - if (zone) - spin_unlock_irq(&zone->lru_lock); - pagevec_release(pvec); -} - -/* * Perform any setup for the swap system */ void __init swap_setup(void) . ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 13:14 ` Christian Ehrhardt 2002-08-28 17:18 ` Daniel Phillips @ 2002-08-28 20:41 ` Daniel Phillips 2002-08-28 21:03 ` Andrew Morton 1 sibling, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-08-28 20:41 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm@kvack.org Going right back to basics, what do you suppose is wrong with the 2.4 strategy of always doing the lru removal in free_pages_ok? -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 20:41 ` Daniel Phillips @ 2002-08-28 21:03 ` Andrew Morton 2002-08-28 22:04 ` Daniel Phillips 0 siblings, 1 reply; 65+ messages in thread From: Andrew Morton @ 2002-08-28 21:03 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm@kvack.org Daniel Phillips wrote: > > Going right back to basics, what do you suppose is wrong with the 2.4 > strategy of always doing the lru removal in free_pages_ok? That's equivalent to what we have at present, which is: if (put_page_testzero(page)) { /* window here */ lru_cache_del(page); __free_pages_ok(page, 0); } versus: spin_lock(lru lock); page = list_entry(lru, ...); if (page_count(page) == 0) continue; /* window here */ page_cache_get(page); page_cache_release(page); /* double-free */ ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 21:03 ` Andrew Morton @ 2002-08-28 22:04 ` Daniel Phillips 2002-08-28 22:39 ` Andrew Morton 0 siblings, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-08-28 22:04 UTC (permalink / raw) To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm@kvack.org On Wednesday 28 August 2002 23:03, Andrew Morton wrote: > Daniel Phillips wrote: > > > > Going right back to basics, what do you suppose is wrong with the 2.4 > > strategy of always doing the lru removal in free_pages_ok? > > That's equivalent to what we have at present, which is: > > if (put_page_testzero(page)) { > /* window here */ > lru_cache_del(page); > __free_pages_ok(page, 0); > } > > versus: > > spin_lock(lru lock); > page = list_entry(lru, ...); > if (page_count(page) == 0) > continue; > /* window here */ > page_cache_get(page); > page_cache_release(page); /* double-free */ Indeed it is. In 2.4.19 we have: (vmscan.c: shrink_cache) (page_alloc.c: __free_pages) 365 if (unlikely(!page_count(page))) 366 continue; 444 if (!PageReserved(page) && put_page_testzero(page)) [many twisty paths, all different] 511 /* effectively free the page here */ 512 page_cache_release(page); 445 __free_pages_ok(page, order); [free it again just to make sure] So there's no question that the race is lurking in 2.4. I noticed several more paths besides the one above that look suspicious as well. The bottom line is, 2.4 needs a fix along the lines of my suggestion or Christian's, something that can actually be proved. It's a wonder that this problem manifests so rarely in practice. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 22:04 ` Daniel Phillips @ 2002-08-28 22:39 ` Andrew Morton 2002-08-28 22:57 ` Daniel Phillips 0 siblings, 1 reply; 65+ messages in thread From: Andrew Morton @ 2002-08-28 22:39 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm@kvack.org Daniel Phillips wrote: > > ... > So there's no question that the race is lurking in 2.4. I noticed several > more paths besides the one above that look suspicious as well. The bottom > line is, 2.4 needs a fix along the lines of my suggestion or Christian's, > something that can actually be proved. > > It's a wonder that this problem manifests so rarely in practice. I sort-of glanced through the 2.4 paths and it appears that in all of the places where it could do a page_cache_get/release, that would never happen because of other parts of the page state. Like: it can't be in pagecache, so we won't run writepage, and it can't have buffers, so we won't run try_to_release_page(). Of course, I might have missed a path. And, well, generally: ugh. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 22:39 ` Andrew Morton @ 2002-08-28 22:57 ` Daniel Phillips 0 siblings, 0 replies; 65+ messages in thread From: Daniel Phillips @ 2002-08-28 22:57 UTC (permalink / raw) To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm@kvack.org On Thursday 29 August 2002 00:39, Andrew Morton wrote: > Daniel Phillips wrote: > > > > ... > > So there's no question that the race is lurking in 2.4. I noticed several > > more paths besides the one above that look suspicious as well. The bottom > > line is, 2.4 needs a fix along the lines of my suggestion or Christian's, > > something that can actually be proved. > > > > It's a wonder that this problem manifests so rarely in practice. > > I sort-of glanced through the 2.4 paths and it appears that in all of the > places where it could do a page_cache_get/release, that would never happen > because of other parts of the page state. > > Like: it can't be in pagecache, so we won't run writepage, and > it can't have buffers, so we won't run try_to_release_page(). > > Of course, I might have missed a path. And, well, generally: ugh. I think it is happening. I just went sifting searching through the archives on 'oops' and '2.4'. The first one I found was: 2.4.18-xfs (xfs related?) oops report which fits the description nicely. The race I showed actually causes the page->count to go negative, avoiding a double free on a technicality. That doesn't make me feel much better about it. Have you got a BUG_ON(!page_count(page)) in put_page_testzero? I think we might see some action. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 20:09 ` Daniel Phillips 2002-08-26 20:58 ` Christian Ehrhardt @ 2002-08-26 21:31 ` Andrew Morton 2002-08-27 3:42 ` Benjamin LaHaise 1 sibling, 1 reply; 65+ messages in thread From: Andrew Morton @ 2002-08-26 21:31 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm@kvack.org Daniel Phillips wrote: > > ... > > If you think about who is going to remove the page from the lru you'll > > see the problem. > > Nope, still don't see it. Whoever hits put_page_testzero frees the page, > secure in the knowlege that there are no other references to it. Sure. But this requires that the caller of page_cache_release() has previously removed the page from the LRU. We (used to) do that for truncate and page reclaim. But we did not do that for anon pages. For anon pages, we perform magical LRU removal when the page refcount goes to zero. The fact that we performed explicit removal in one place, and magical removal in the other was unfortunate. I nuked the explicit removal and made it all magical (explicit removal in truncate_complete_page() was wrong anyway - the page could have been rescued and anonymised by concurrent pagefault and must stay on the LRU). Possibly, we could go back to explicit removal everywhere. Haven't really looked at that, but I suspect we're back to a similar problem: to do you unracily determine whether the page should be removed from the LRU? Take ->page_table_lock and look at page_count(page)? Worried. I like the magical-removal-just-before-free, and my gut feel is that it'll provide a cleaner end result. Making presence on the LRU contribute to page->count is attractive, if only because it removes some irritating and expensive page_cache_gets and puts from shrink_cache and refill_inactive. But for it to be useful, we must perform explicit removal everywhere. Making presence on the LRU contribute to page->count doesn't fundamentally change anything of course - it offsets the current problems by one. Then again, it would remove all page_cache_gets/releases from vmscan.c and may thus make the race go away. That's a bit of a timebomb though. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 21:31 ` Andrew Morton @ 2002-08-27 3:42 ` Benjamin LaHaise 2002-08-27 4:37 ` Andrew Morton 0 siblings, 1 reply; 65+ messages in thread From: Benjamin LaHaise @ 2002-08-27 3:42 UTC (permalink / raw) To: Andrew Morton Cc: Daniel Phillips, Christian Ehrhardt, lkml, linux-mm@kvack.org On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote: > I like the magical-removal-just-before-free, and my gut feel is that > it'll provide a cleaner end result. For the record, I'd rather see explicite removal everwhere. We received a number of complaints along the lines of "I run my app immediately after system startup, and it's fast, but the second time it's slower" due to the lazy page reclaim in early 2.4. Until there's a way to make LRU scanning faster than page allocation, it can't be lazy. -ben ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-27 3:42 ` Benjamin LaHaise @ 2002-08-27 4:37 ` Andrew Morton 0 siblings, 0 replies; 65+ messages in thread From: Andrew Morton @ 2002-08-27 4:37 UTC (permalink / raw) To: Benjamin LaHaise Cc: Daniel Phillips, Christian Ehrhardt, lkml, linux-mm@kvack.org Benjamin LaHaise wrote: > > On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote: > > I like the magical-removal-just-before-free, and my gut feel is that > > it'll provide a cleaner end result. > > For the record, I'd rather see explicite removal everwhere. We received > a number of complaints along the lines of "I run my app immediately after > system startup, and it's fast, but the second time it's slower" due to > the lazy page reclaim in early 2.4. Until there's a way to make LRU > scanning faster than page allocation, it can't be lazy. > I think that's what Rik was referring to. But here, "explicit removal" refers to running lru_cache_del() prior to the final put_page, rather than within the context of the final put_page. So it's a different thing. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 1:52 ` Andrew Morton 2002-08-26 9:10 ` Christian Ehrhardt @ 2002-08-26 17:58 ` Linus Torvalds 2002-08-26 19:28 ` Rik van Riel 2002-08-30 23:03 ` [RFC] [PATCH] Include LRU in page count Daniel Phillips 1 sibling, 2 replies; 65+ messages in thread From: Linus Torvalds @ 2002-08-26 17:58 UTC (permalink / raw) To: linux-kernel In article <3D6989F7.9ED1948A@zip.com.au>, Andrew Morton <akpm@zip.com.au> wrote: > >What I'm inclined to do there is to change __page_cache_release() >to not attempt to free the page at all. Just let it sit on the >LRU until page reclaim encounters it. With the anon-free-via-pagevec >patch, very, very, very few pages actually get their final release in >__page_cache_release() - zero on uniprocessor, I expect. If you do this, then I would personally suggest a conceptually different approach: make the LRU list count towards the page count. That will _automatically_ result in what you describe - if a page is on the LRU list, then "freeing" it will always just decrement the count, and the _real_ free comes from walking the LRU list and considering count==1 to be trivially freeable. That way you don't have to have separate functions for releasing different kinds of pages (we've seen how nasty that was from a maintainance standpoint already with the "put_page vs page_cache_release" thing). Ehh? Linus ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 17:58 ` Linus Torvalds @ 2002-08-26 19:28 ` Rik van Riel 2002-08-30 23:03 ` [RFC] [PATCH] Include LRU in page count Daniel Phillips 1 sibling, 0 replies; 65+ messages in thread From: Rik van Riel @ 2002-08-26 19:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Mon, 26 Aug 2002, Linus Torvalds wrote: > If you do this, then I would personally suggest a conceptually different > approach: make the LRU list count towards the page count. That will > _automatically_ result in what you describe - if a page is on the LRU > list, then "freeing" it will always just decrement the count, and the > _real_ free comes from walking the LRU list and considering count==1 to > be trivially freeable. We can turn these into per-cpu "garbage collect" LRUs, if we're holding the lock anyway when we decrement the count we can move the page to a place where it can be found easily. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 65+ messages in thread
* [RFC] [PATCH] Include LRU in page count 2002-08-26 17:58 ` Linus Torvalds 2002-08-26 19:28 ` Rik van Riel @ 2002-08-30 23:03 ` Daniel Phillips 2002-08-31 16:14 ` Christian Ehrhardt 2002-09-05 4:42 ` [RFC] Alternative raceless page free Daniel Phillips 1 sibling, 2 replies; 65+ messages in thread From: Daniel Phillips @ 2002-08-30 23:03 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Marcelo Tosatti Cc: linux-kernel, Christian Ehrhardt The executive summary is 'Houston, we have a bug'. The long version of this is that all current linux kernels are now thought to suffer from a rarely occurring page double-free bug that results from a race between shrink_cache and page_cache_release, where both attempt to free a page from the lru at the same time. Andrew's succinct description of the race is: if (put_page_testzero(page)) { /* window here */ lru_cache_del(page); __free_pages_ok(page, 0); } versus: spin_lock(lru lock); page = list_entry(lru, ...); if (page_count(page) == 0) continue; /* window here */ page_cache_get(page); page_cache_release(page); /* double-free */ This rare race happened to become not so rare in 2.5 recently, and was analyzed by Christian Ehrhardt, who also proposed a solution based on a new approach to locking, essentially put_page_testone. We went on to check 2.4 for the race, and while it's quite difficult to prove the race exists there, after a year or two of plugging holes, it's even more difficult to prove it's not there. In fact, sifting through recent kernel archives turns up unexplained bug reports which fit the description of this race, for example: http://marc.theaimsgroup.com/?l=linux-xfs&m=103055330831850&w=2 (2.4.18-xfs (xfs related?) oops report) I proposed an alternate solution using the traditional put_page_testzero primitive, which relies on assigning a page count of one for membership on the lru list. A slightly racy heuristic is used for efficient lru list removal. The resulting incarnation of lru_cache_release is: static inline void page_cache_release(struct page *page) { if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) { if (PageLRU(page) && page_count(page) == 2) __lru_cache_del(page); spin_unlock(&pagemap_lru_lock); } put_page(page); } which permits the following benign race, starting with page count of 3: if (page_count(page) == 2 /* false */ if (page_count(page) == 2 /* false */ put_page(page); put_page(page); Neither holder of a page reference sees the count at 2, and so the page is left on the lru with count = 1. This won't happen often (I haven't seen it happen yet) and such pages will be recovered from the cold end of the list in due course. There are a number of other variations of this race, but they all have the same effect and are all rare. The important question is: can this code ever remove a page from the lru erroneously, leaving somebody holding a reference to a non-lru page? In other words, can the test PageLRU(page) && page_count(page) == 2 return a false positive? Well, when this test is true we can account for both both references: the one we own, and the one the lru list owns. Since we hold the lru lock, the latter won't change. Nobody else has the right to increment the page count, since they must inherit that right from somebody who holds a reference, and there are none. The spin_trylock is efficient - no contention ever, just some cache pingponging at worst - and permits page_cache_release to be used even by a process that holds the lru lock. The attached patch against 2.4.19 implements these ideas, along with some proc statistics to help verify that the intended effect is being achieved. The patch supports both Marcelo's current design (hopefully with no semantic changes at all) and the new lru counting strategy, selected by setting the LRU_PLUS_CACHE to one to get the existing strategy or two to get the new one. With the patch, there remain only three instances of __lru_cache_del: the one in page_cache_release, the one in shrink_cache and the one in truncate. The latter looks suspicious since it's not clear why anonymous pages can't be allowed to appear on the lru. However, that is not the focus of the current effort. The atomic tests for lru membership appear to be obsoleted by the new regimen, since we are protected in every case by the lru lock, but I did not disturb them. Due to laziness, the correctness of the swapcache logic in swapfile.c has not yet been verified. I've done some light testing on my 2GB 2X1Ghz PIII box, using the following pair of scripts inherited from akpm: file: doitlots ----------------------- #!/bin/sh doit() { ( cat $1 | wc -l ) } count=0 while [ $count != 500 ] do doit doitlots > /dev/null count=$(expr $count + 1) done echo done ----------------------- file: forklots ----------------------- echo >foocount ./doitlots >>foocount & ./doitlots >>foocount & ./doitlots >>foocount & ./doitlots >>foocount & ./doitlots >>foocount & ./doitlots >>foocount & ./doitlots >>foocount & ./doitlots >>foocount & ./doitlots >>foocount & count=0 while [ $count != 10 ] do count=$(wc foocount | cut -b -8) done ----------------------- The patch computes the following statistics and makes them available via /proc/mmstats: page_cache_release ... Total calls to page_cache_release page_cache_release_test_lru ... How often the lru heuristic triggers page_cache_release_del_lru ... How often the heuristic is correct shrink_cache_del_lru ... How often an orphan is left on the lru put_page_testzero ... Total page releases lru_cache_del ... Total lru deletes truncate_lru_cache_del ... Total lru deletes during truncate Now the benchmark results... LRU_PLUS_CACHE = 2, with LRU_DEL_HEURISTIC: time sh forklots 24.81user 28.15system 0:26.50elapsed 199%CPU cat /proc/mmstats page_cache_release 8555807 page_cache_release_test_lru 771550 page_cache_release_del_lru 771535 shrink_cache_del_lru 0 put_page_testzero 9209139 lru_cache_del 771547 truncate_lru_cache_del 12 According to the statistics, we took the lru lock to see if the page is really on the lru about 9% of the time. In just .002% (15) of those cases the heuristic was wrong, the lock was taken and no page removed from the lru. No orphan pages were ever found on the lru. In this case, all the lru_cache_dels are accounted for by the 771,535 that were removed via the page_cache_release heuristic, and the 12 that were converted to anonymous pages via truncate. LRU_PLUS_CACHE = 2, without LRU_DEL_HEURISTIC: time sh forklots 25.45user 28.70system 0:27.28elapsed 198%CPU cat /proc/mmstats page_cache_release 8611106 page_cache_release_test_lru 0 page_cache_release_del_lru 0 shrink_cache_del_lru 614573 put_page_testzero 9832924 lru_cache_del 11 truncate_lru_cache_del 11 Omitting the heuristic forces shrink_cache to pick up the orphaned lru pages, which costs about 3% in terms of wall clock time on this test. At least this exercises the orphan recovery path in shrink_cache. LRU_PLUS_CACHE = 2, with LRU_DEL_HEURISTIC, no statistics: time sh forklots 24.74user 28.02system 0:26.39elapsed 199%CPU The statistics cost about .5%, which is basically noise. Vanilla 2.4.19: time sh forklots 24.15user 28.44system 0:26.31elapsed 199%CPU Vanilla 2.4.19 may or may not have an advantage of .3%. On the other hand, removing the atomic test for lru membership probably gives a slight advantage to the new version. The patch is against 2.4.19, apply with patch -p1: --- 2.4.19.clean/fs/proc/proc_misc.c Fri Aug 2 20:39:45 2002 +++ 2.4.19/fs/proc/proc_misc.c Fri Aug 30 21:32:20 2002 @@ -539,6 +539,28 @@ entry->proc_fops = f; } +struct mmstats mmstats; + +char *mmnames[] = { + "page_cache_release", + "page_cache_release_test_lru", + "page_cache_release_del_lru", + "shrink_cache_del_lru", + "put_page_testzero", + "lru_cache_del", + "truncate_lru_cache_del", + NULL, +}; + +static int read_mmstats(char *page, char **start, off_t off, int count, int *eof, void *data) +{ + char **names = mmnames; + unsigned *stats = (unsigned *) &mmstats, len = 0; + for (; *names; *names++, *stats++) if (strlen(*names)) + len += sprintf(page + len, "%s %u\n", *names, *stats); + return proc_calc_metrics(page, start, off, count, eof, len); +} + void __init proc_misc_init(void) { struct proc_dir_entry *entry; @@ -546,6 +568,7 @@ char *name; int (*read_proc)(char*,char**,off_t,int,int*,void*); } *p, simple_ones[] = { + {"mmstats", read_mmstats}, {"loadavg", loadavg_read_proc}, {"uptime", uptime_read_proc}, {"meminfo", meminfo_read_proc}, --- 2.4.19.clean/include/linux/mm.h Fri Aug 2 20:39:45 2002 +++ 2.4.19/include/linux/mm.h Fri Aug 30 21:41:54 2002 @@ -35,6 +35,20 @@ * mmap() functions). */ +extern struct mmstats { + unsigned release; + unsigned release_test_page; + unsigned release_free_page; + unsigned vmscan_free_page; + unsigned testzero; + unsigned lru_cache_del; + unsigned truncate_lru_cache_del; +} mmstats; + +extern char *mmnames[]; + +#define mmstat(field) mmstats.field++ + /* * This struct defines a memory VMM memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory @@ -180,6 +194,15 @@ } mem_map_t; /* + * There is only one 'core' page-freeing function. + */ +extern void FASTCALL(__free_pages(struct page *page, unsigned int order)); +extern void FASTCALL(free_pages(unsigned long addr, unsigned int order)); + +#define __free_page(page) __free_pages((page), 0) +#define free_page(addr) free_pages((addr),0) + +/* * Methods to modify the page usage count. * * What counts for a page usage: @@ -191,12 +214,34 @@ * Also, many kernel routines increase the page count before a critical * routine so they can be sure the page doesn't go away from under them. */ -#define get_page(p) atomic_inc(&(p)->count) -#define put_page(p) __free_page(p) -#define put_page_testzero(p) atomic_dec_and_test(&(p)->count) #define page_count(p) atomic_read(&(p)->count) #define set_page_count(p,v) atomic_set(&(p)->count, v) +extern void FASTCALL(__free_pages(struct page *page, unsigned int order)); + +static inline void get_page(struct page *page) +{ + atomic_inc(&page->count); +} + +static inline void put_page_nofree(struct page *page) +{ + BUG_ON(!page_count(page)); + atomic_dec(&page->count); +} + +static inline int put_page_testzero(struct page *page) +{ + mmstat(testzero); + BUG_ON(!page_count(page)); + return atomic_dec_and_test(&page->count); +} + +static inline void put_page(struct page *page) +{ + __free_page(page); +} + /* * Various page->flags bits: * @@ -451,15 +496,6 @@ */ #define get_free_page get_zeroed_page -/* - * There is only one 'core' page-freeing function. - */ -extern void FASTCALL(__free_pages(struct page *page, unsigned int order)); -extern void FASTCALL(free_pages(unsigned long addr, unsigned int order)); - -#define __free_page(page) __free_pages((page), 0) -#define free_page(addr) free_pages((addr),0) - extern void show_free_areas(void); extern void show_free_areas_node(pg_data_t *pgdat); @@ -518,11 +554,6 @@ extern struct address_space swapper_space; #define PageSwapCache(page) ((page)->mapping == &swapper_space) - -static inline int is_page_cache_freeable(struct page * page) -{ - return page_count(page) - !!page->buffers == 1; -} extern int can_share_swap_page(struct page *); extern int remove_exclusive_swap_page(struct page *); --- 2.4.19.clean/include/linux/pagemap.h Mon Feb 25 14:38:13 2002 +++ 2.4.19/include/linux/pagemap.h Fri Aug 30 22:17:21 2002 @@ -27,9 +27,38 @@ #define PAGE_CACHE_SIZE PAGE_SIZE #define PAGE_CACHE_MASK PAGE_MASK #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) +#define LRU_PLUS_CACHE 2 +#define LRU_DEL_HEURISTIC #define page_cache_get(x) get_page(x) -#define page_cache_release(x) __free_page(x) + +static inline void page_cache_release(struct page *page) +{ + mmstat(release); +#if LRU_PLUS_CACHE==2 +#ifdef LRU_DEL_HEURISTIC + if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) { + mmstat(release_test_page); + if (PageLRU(page) && page_count(page) == 2) { + mmstat(release_free_page); + __lru_cache_del(page); + } + spin_unlock(&pagemap_lru_lock); + } +#endif +#endif + put_page(page); +} + +static inline int has_buffers(struct page *page) +{ + return !!page->buffers; +} + +static inline int is_page_cache_freeable(struct page *page) +{ + return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE; +} static inline struct page *page_cache_alloc(struct address_space *x) { --- 2.4.19.clean/mm/filemap.c Fri Aug 2 20:39:46 2002 +++ 2.4.19/mm/filemap.c Fri Aug 30 21:40:33 2002 @@ -198,10 +198,12 @@ if (page->buffers && !try_to_free_buffers(page, 0)) goto unlock; - if (page_count(page) != 1) + if (page_count(page) != LRU_PLUS_CACHE) goto unlock; +#if LRU_PLUS_CACHE==1 __lru_cache_del(page); +#endif __remove_inode_page(page); UnlockPage(page); page_cache_release(page); @@ -234,8 +236,10 @@ static void truncate_complete_page(struct page *page) { /* Leave it on the LRU if it gets converted into anonymous buffers */ - if (!page->buffers || do_flushpage(page, 0)) + if (!page->buffers || do_flushpage(page, 0)) { + mmstat(truncate_lru_cache_del); lru_cache_del(page); + } /* * We remove the page from the page cache _after_ we have @@ -345,7 +349,7 @@ * The page is locked and we hold the pagecache_lock as well * so both page_count(page) and page->buffers stays constant here. */ - if (page_count(page) == 1 + !!page->buffers) { + if (is_page_cache_freeable(page)) { /* Restart after this page */ list_del(head); list_add_tail(head, curr); --- 2.4.19.clean/mm/page_alloc.c Fri Aug 2 20:39:46 2002 +++ 2.4.19/mm/page_alloc.c Fri Aug 30 21:55:03 2002 @@ -82,9 +82,12 @@ /* Yes, think what happens when other parts of the kernel take * a reference to a page in order to pin it for io. -ben */ +#if LRU_PLUS_CACHE==2 + BUG_ON(PageLRU(page)); +#else if (PageLRU(page)) lru_cache_del(page); - +#endif if (page->buffers) BUG(); if (page->mapping) --- 2.4.19.clean/mm/swap.c Wed Nov 7 01:44:20 2001 +++ 2.4.19/mm/swap.c Fri Aug 30 21:32:20 2002 @@ -60,6 +60,9 @@ if (!TestSetPageLRU(page)) { spin_lock(&pagemap_lru_lock); add_page_to_inactive_list(page); +#if LRU_PLUS_CACHE==2 + get_page(page); +#endif spin_unlock(&pagemap_lru_lock); } } @@ -79,6 +82,10 @@ } else { del_page_from_inactive_list(page); } + mmstat(lru_cache_del); +#if LRU_PLUS_CACHE==2 + put_page_nofree(page); +#endif } } --- 2.4.19.clean/mm/swapfile.c Fri Aug 2 20:39:46 2002 +++ 2.4.19/mm/swapfile.c Fri Aug 30 21:32:20 2002 @@ -239,7 +239,7 @@ if (p->swap_map[SWP_OFFSET(entry)] == 1) { /* Recheck the page count with the pagecache lock held.. */ spin_lock(&pagecache_lock); - if (page_count(page) - !!page->buffers == 2) + if (page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1) retval = 1; spin_unlock(&pagecache_lock); } @@ -263,16 +263,16 @@ if (!PageLocked(page)) BUG(); switch (page_count(page)) { - case 3: - if (!page->buffers) + case LRU_PLUS_CACHE + 2: + if (!has_buffers(page)) break; /* Fallthrough */ - case 2: + case LRU_PLUS_CACHE + 1: if (!PageSwapCache(page)) break; retval = exclusive_swap_page(page); break; - case 1: + case LRU_PLUS_CACHE: if (PageReserved(page)) break; retval = 1; @@ -280,6 +280,11 @@ return retval; } +static inline int only_cached(struct page *page) +{ + return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1; +} + /* * Work out if there are any other processes sharing this * swap cache page. Free it if you can. Return success. @@ -294,7 +299,7 @@ BUG(); if (!PageSwapCache(page)) return 0; - if (page_count(page) - !!page->buffers != 2) /* 2: us + cache */ + if (!only_cached(page)) return 0; entry.val = page->index; @@ -307,7 +312,7 @@ if (p->swap_map[SWP_OFFSET(entry)] == 1) { /* Recheck the page count with the pagecache lock held.. */ spin_lock(&pagecache_lock); - if (page_count(page) - !!page->buffers == 2) { + if (only_cached(page)) { __delete_from_swap_cache(page); SetPageDirty(page); retval = 1; @@ -343,7 +348,7 @@ if (page) { page_cache_get(page); /* Only cache user (+us), or swap space full? Free it! */ - if (page_count(page) - !!page->buffers == 2 || vm_swap_full()) { + if (only_cached(page) || vm_swap_full()) { delete_from_swap_cache(page); SetPageDirty(page); } --- 2.4.19.clean/mm/vmscan.c Fri Aug 2 20:39:46 2002 +++ 2.4.19/mm/vmscan.c Fri Aug 30 21:40:30 2002 @@ -89,7 +89,7 @@ mm->rss--; UnlockPage(page); { - int freeable = page_count(page) - !!page->buffers <= 2; + int freeable = page_count(page) - has_buffers(page) <= LRU_PLUS_CACHE + 1; page_cache_release(page); return freeable; } @@ -356,20 +356,30 @@ BUG_ON(PageActive(page)); list_del(entry); +#if LRU_PLUS_CACHE==2 + BUG_ON(!page_count(page)); + if (unlikely(page_count(page) == 1)) { + mmstat(vmscan_free_page); + BUG_ON(!TestClearPageLRU(page)); // side effect abuse!! + put_page(page); + continue; + } +#endif list_add(entry, &inactive_list); +#if LRU_PLUS_CACHE==1 /* * Zero page counts can happen because we unlink the pages * _after_ decrementing the usage count.. */ if (unlikely(!page_count(page))) continue; - +#endif if (!memclass(page_zone(page), classzone)) continue; /* Racy check to avoid trylocking when not worthwhile */ - if (!page->buffers && (page_count(page) != 1 || !page->mapping)) + if (!page->buffers && (page_count(page) != LRU_PLUS_CACHE || !page->mapping)) goto page_mapped; /* @@ -434,9 +444,9 @@ */ spin_lock(&pagemap_lru_lock); UnlockPage(page); +#if LRU_PLUS_CACHE==1 __lru_cache_del(page); - - /* effectively free the page here */ +#endif page_cache_release(page); if (--nr_pages) @@ -505,10 +515,10 @@ swap_free(swap); } +#if LRU_PLUS_CACHE==1 __lru_cache_del(page); +#endif UnlockPage(page); - - /* effectively free the page here */ page_cache_release(page); if (--nr_pages) ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-08-30 23:03 ` [RFC] [PATCH] Include LRU in page count Daniel Phillips @ 2002-08-31 16:14 ` Christian Ehrhardt 2002-08-31 17:54 ` Andrew Morton 2002-08-31 19:47 ` Daniel Phillips 2002-09-05 4:42 ` [RFC] Alternative raceless page free Daniel Phillips 1 sibling, 2 replies; 65+ messages in thread From: Christian Ehrhardt @ 2002-08-31 16:14 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel On Sat, Aug 31, 2002 at 01:03:07AM +0200, Daniel Phillips wrote: > This rare race happened to become not so rare in 2.5 recently, and was > analyzed by Christian Ehrhardt, who also proposed a solution based on a new > approach to locking, essentially put_page_testone. We went on to check 2.4 Just a little correction: The key function implemented in my solution is an atomic GET_page_testone which is called if the page might have a zero refcount. The idea I had in mind is to distinguish heavy and weak references to pages. Your solution is probably the better way to go. > I proposed an alternate solution using the traditional put_page_testzero > primitive, which relies on assigning a page count of one for membership on > the lru list. A slightly racy heuristic is used for efficient lru list > removal. The resulting incarnation of lru_cache_release is: > > static inline void page_cache_release(struct page *page) > { > if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) { > if (PageLRU(page) && page_count(page) == 2) > __lru_cache_del(page); > spin_unlock(&pagemap_lru_lock); > } > put_page(page); > } Just saw that this can still race e.g. with lru_cache_add (not hard to fix though): | void lru_cache_add(struct page * page) | { | if (!TestSetPageLRU(page)) { Window is here: Once we set the PageLRU bit page_cache_release assumes that there is a reference held by the lru cache. | spin_lock(&pagemap_lru_lock); | add_page_to_inactive_list(page); |#if LRU_PLUS_CACHE==2 | get_page(page); |#endif But only starting at this point the reference actually exists. | spin_unlock(&pagemap_lru_lock); | } |} Solution: Change the PageLRU bit inside the lock. Looks like lru_cache_add is the only place that doesn't hold the lru lock to change the LRU flag and it's probably not a good idea even without the patch. Two more comments: I don't think it is a good idea to use put_page_nofree in __lru_cache_del. This is probably safe now but it adds an additional rule that lru_cache_del can't be called without holding a second reference to the page. Also there may be lru only pages on the active list, i.e. refill inactive should have this hunk as well: > +#if LRU_PLUS_CACHE==2 > + BUG_ON(!page_count(page)); > + if (unlikely(page_count(page) == 1)) { > + mmstat(vmscan_free_page); > + BUG_ON(!TestClearPageLRU(page)); // side effect abuse!! > + put_page(page); > + continue; > + } > +#endif regards Christian Ehrhardt -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-08-31 16:14 ` Christian Ehrhardt @ 2002-08-31 17:54 ` Andrew Morton 2002-08-31 19:47 ` Daniel Phillips 1 sibling, 0 replies; 65+ messages in thread From: Andrew Morton @ 2002-08-31 17:54 UTC (permalink / raw) To: Christian Ehrhardt Cc: Daniel Phillips, Linus Torvalds, Marcelo Tosatti, linux-kernel Christian Ehrhardt wrote: > > ... > Solution: Change the PageLRU bit inside the lock. Looks like > lru_cache_add is the only place that doesn't hold the lru lock to > change the LRU flag and it's probably not a good idea even without > the patch. > 2.4.20-pre already does that: void lru_cache_add(struct page * page) { if (!PageLRU(page)) { spin_lock(&pagemap_lru_lock); if (!TestSetPageLRU(page)) add_page_to_inactive_list(page); spin_unlock(&pagemap_lru_lock); } } there was an oopsing race with activate_page()... ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-08-31 16:14 ` Christian Ehrhardt 2002-08-31 17:54 ` Andrew Morton @ 2002-08-31 19:47 ` Daniel Phillips 2002-08-31 20:26 ` Andrew Morton 2002-09-02 17:23 ` Christian Ehrhardt 1 sibling, 2 replies; 65+ messages in thread From: Daniel Phillips @ 2002-08-31 19:47 UTC (permalink / raw) To: Christian Ehrhardt Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel On Saturday 31 August 2002 18:14, Christian Ehrhardt wrote: > On Sat, Aug 31, 2002 at 01:03:07AM +0200, Daniel Phillips wrote: > > static inline void page_cache_release(struct page *page) > > { > > if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) { > > if (PageLRU(page) && page_count(page) == 2) > > __lru_cache_del(page); > > spin_unlock(&pagemap_lru_lock); > > } > > put_page(page); > > } > > Just saw that this can still race e.g. with lru_cache_add (not > hard to fix though): > > | void lru_cache_add(struct page * page) > | { > | if (!TestSetPageLRU(page)) { > > Window is here: Once we set the PageLRU bit page_cache_release > assumes that there is a reference held by the lru cache. > > | spin_lock(&pagemap_lru_lock); > | add_page_to_inactive_list(page); > |#if LRU_PLUS_CACHE==2 > | get_page(page); > |#endif > > But only starting at this point the reference actually exists. > > | spin_unlock(&pagemap_lru_lock); > | } > |} > > Solution: Change the PageLRU bit inside the lock. Looks like > lru_cache_add is the only place that doesn't hold the lru lock to > change the LRU flag and it's probably not a good idea even without > the patch. Thanks. I sniffed at that very thing a little yesterday and decided that lru_cache_add must always be called on a page that isn't on the lru. It's really hard to prove that though - just look at the acres of sewage in filemap.c and the various creepy-crawly things that call put_dirty_page. Coding it so the race isn't even possible is a much better idea. In the process of getting rid of the TestSet and TestClear bitops, I'd inadvertently done that already, this way: void lru_cache_add(struct page * page) { spin_lock(&pagemap_lru_lock); BUG_ON(PageLRU(page)); SetPageLRU(page); get_page(page); add_page_to_inactive_list(page); spin_unlock(&pagemap_lru_lock); } So at least we get an easy-to-understand BUG instead of a much harder to find side effect. That said, the assumption that a page is never on the lru at the time of this call is too fragile at best and likely wrong - see shmem_writepage->add_to_page_cache_locked. So we want: void lru_cache_add(struct page * page) { spin_lock(&pagemap_lru_lock); if (likely(!PageLRU(page))) { add_page_to_inactive_list(page); SetPageLRU(page); get_page(page); } spin_unlock(&pagemap_lru_lock); } > Two more comments: I don't think it is a good idea to use > put_page_nofree in __lru_cache_del. This is probably safe now but > it adds an additional rule that lru_cache_del can't be called without > holding a second reference to the page. That was intentional. We always do hold an additional reference where this is used, and if that changes in the future the author better at least take a look at the rules. Another way of looking at it: since all the code paths that use this rely on the page staying around for further operations after the lru delete, it is a bug for sure if the count goes to zero during the lru delete. I'd call that one a documentation bug. > Also there may be lru only pages on the active list, i.e. refill > inactive should have this hunk as well: > > > +#if LRU_PLUS_CACHE==2 > > + BUG_ON(!page_count(page)); > > + if (unlikely(page_count(page) == 1)) { > > + mmstat(vmscan_free_page); > > + BUG_ON(!TestClearPageLRU(page)); // side effect abuse!! > > + put_page(page); > > + continue; > > + } > > +#endif If we have orphans on the active list, we'd probably better just count them and figure out what we're doing wrong to put them there in the first place. In time they will migrate to the inactive list and get cleaned up. Yesterday's patch had a pretty big performance bug: at the bottom of the shrink_cache loop where a cache page is finally removed, I had: UnlockPage(page); page_cache_release(page); but we hold the lru lock there, so the page always becomes an orphan, to be picked up later. The sensible thing to do is: __lru_cache_del(page); UnlockPage(page); put_page(page); Besides that, I cleaned up yesterdays side-effecty BUG_ON, and removed the TestSet/Clear to just Set/ClearPageLRU. It's tempting to do a lot more, but with some effort I resisted that. Manfred suggested an approach to de-racing this race using atomic_dec_and_lock, which needs to be compared to the current approach. More on that later. -- Daniel --- 2.4.19.clean/fs/proc/proc_misc.c Fri Aug 2 20:39:45 2002 +++ 2.4.19/fs/proc/proc_misc.c Fri Aug 30 21:32:20 2002 @@ -539,6 +539,28 @@ entry->proc_fops = f; } +struct mmstats mmstats; + +char *mmnames[] = { + "page_cache_release", + "page_cache_release_test_lru", + "page_cache_release_del_lru", + "shrink_cache_del_lru", + "put_page_testzero", + "lru_cache_del", + "truncate_lru_cache_del", + NULL, +}; + +static int read_mmstats(char *page, char **start, off_t off, int count, int *eof, void *data) +{ + char **names = mmnames; + unsigned *stats = (unsigned *) &mmstats, len = 0; + for (; *names; *names++, *stats++) if (strlen(*names)) + len += sprintf(page + len, "%s %u\n", *names, *stats); + return proc_calc_metrics(page, start, off, count, eof, len); +} + void __init proc_misc_init(void) { struct proc_dir_entry *entry; @@ -546,6 +568,7 @@ char *name; int (*read_proc)(char*,char**,off_t,int,int*,void*); } *p, simple_ones[] = { + {"mmstats", read_mmstats}, {"loadavg", loadavg_read_proc}, {"uptime", uptime_read_proc}, {"meminfo", meminfo_read_proc}, --- 2.4.19.clean/include/linux/mm.h Fri Aug 2 20:39:45 2002 +++ 2.4.19/include/linux/mm.h Sat Aug 31 18:56:17 2002 @@ -35,6 +35,20 @@ * mmap() functions). */ +extern struct mmstats { + unsigned release; + unsigned release_test_page; + unsigned release_free_page; + unsigned vmscan_free_page; + unsigned testzero; + unsigned lru_cache_del; + unsigned truncate_lru_cache_del; +} mmstats; + +extern char *mmnames[]; + +#define mmstat(field) mmstats.field++ + /* * This struct defines a memory VMM memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory @@ -180,6 +194,15 @@ } mem_map_t; /* + * There is only one 'core' page-freeing function. + */ +extern void FASTCALL(__free_pages(struct page *page, unsigned int order)); +extern void FASTCALL(free_pages(unsigned long addr, unsigned int order)); + +#define __free_page(page) __free_pages((page), 0) +#define free_page(addr) free_pages((addr),0) + +/* * Methods to modify the page usage count. * * What counts for a page usage: @@ -191,12 +214,34 @@ * Also, many kernel routines increase the page count before a critical * routine so they can be sure the page doesn't go away from under them. */ -#define get_page(p) atomic_inc(&(p)->count) -#define put_page(p) __free_page(p) -#define put_page_testzero(p) atomic_dec_and_test(&(p)->count) #define page_count(p) atomic_read(&(p)->count) #define set_page_count(p,v) atomic_set(&(p)->count, v) +extern void FASTCALL(__free_pages(struct page *page, unsigned int order)); + +static inline void get_page(struct page *page) +{ + atomic_inc(&page->count); +} + +static inline void put_page_nofree(struct page *page) +{ + BUG_ON(!page_count(page)); + atomic_dec(&page->count); +} + +static inline int put_page_testzero(struct page *page) +{ + mmstat(testzero); + BUG_ON(!page_count(page)); + return atomic_dec_and_test(&page->count); +} + +static inline void put_page(struct page *page) +{ + __free_page(page); +} + /* * Various page->flags bits: * @@ -394,6 +439,8 @@ #define ClearPageActive(page) clear_bit(PG_active, &(page)->flags) #define PageLRU(page) test_bit(PG_lru, &(page)->flags) +#define SetPageLRU(page) set_bit(PG_lru, &(page)->flags) +#define ClearPageLRU(page) clear_bit(PG_lru, &(page)->flags) #define TestSetPageLRU(page) test_and_set_bit(PG_lru, &(page)->flags) #define TestClearPageLRU(page) test_and_clear_bit(PG_lru, &(page)->flags) @@ -451,15 +498,6 @@ */ #define get_free_page get_zeroed_page -/* - * There is only one 'core' page-freeing function. - */ -extern void FASTCALL(__free_pages(struct page *page, unsigned int order)); -extern void FASTCALL(free_pages(unsigned long addr, unsigned int order)); - -#define __free_page(page) __free_pages((page), 0) -#define free_page(addr) free_pages((addr),0) - extern void show_free_areas(void); extern void show_free_areas_node(pg_data_t *pgdat); @@ -518,11 +556,6 @@ extern struct address_space swapper_space; #define PageSwapCache(page) ((page)->mapping == &swapper_space) - -static inline int is_page_cache_freeable(struct page * page) -{ - return page_count(page) - !!page->buffers == 1; -} extern int can_share_swap_page(struct page *); extern int remove_exclusive_swap_page(struct page *); --- 2.4.19.clean/include/linux/pagemap.h Mon Feb 25 14:38:13 2002 +++ 2.4.19/include/linux/pagemap.h Sat Aug 31 18:58:43 2002 @@ -27,9 +27,38 @@ #define PAGE_CACHE_SIZE PAGE_SIZE #define PAGE_CACHE_MASK PAGE_MASK #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) +#define LRU_PLUS_CACHE 2 +#define LRU_DEL_HEURISTIC #define page_cache_get(x) get_page(x) -#define page_cache_release(x) __free_page(x) + +static inline void page_cache_release(struct page *page) +{ + mmstat(release); +#if LRU_PLUS_CACHE==2 +#ifdef LRU_DEL_HEURISTIC + if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) { + mmstat(release_test_page); + if (PageLRU(page) && page_count(page) == 2) { + mmstat(release_free_page); + __lru_cache_del(page); + } + spin_unlock(&pagemap_lru_lock); + } +#endif +#endif + put_page(page); +} + +static inline int has_buffers(struct page *page) +{ + return !!page->buffers; +} + +static inline int is_page_cache_freeable(struct page *page) +{ + return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE; +} static inline struct page *page_cache_alloc(struct address_space *x) { --- 2.4.19.clean/include/linux/swap.h Fri Aug 2 20:39:46 2002 +++ 2.4.19/include/linux/swap.h Sat Aug 31 18:56:11 2002 @@ -168,17 +168,9 @@ * List add/del helper macros. These must be called * with the pagemap_lru_lock held! */ -#define DEBUG_LRU_PAGE(page) \ -do { \ - if (!PageLRU(page)) \ - BUG(); \ - if (PageActive(page)) \ - BUG(); \ -} while (0) - #define add_page_to_active_list(page) \ do { \ - DEBUG_LRU_PAGE(page); \ + BUG_ON(PageActive(page)); \ SetPageActive(page); \ list_add(&(page)->lru, &active_list); \ nr_active_pages++; \ @@ -186,13 +178,13 @@ #define add_page_to_inactive_list(page) \ do { \ - DEBUG_LRU_PAGE(page); \ list_add(&(page)->lru, &inactive_list); \ nr_inactive_pages++; \ } while (0) #define del_page_from_active_list(page) \ do { \ + BUG_ON(!PageActive(page)); \ list_del(&(page)->lru); \ ClearPageActive(page); \ nr_active_pages--; \ --- 2.4.19.clean/mm/filemap.c Fri Aug 2 20:39:46 2002 +++ 2.4.19/mm/filemap.c Fri Aug 30 21:40:33 2002 @@ -198,10 +198,12 @@ if (page->buffers && !try_to_free_buffers(page, 0)) goto unlock; - if (page_count(page) != 1) + if (page_count(page) != LRU_PLUS_CACHE) goto unlock; +#if LRU_PLUS_CACHE==1 __lru_cache_del(page); +#endif __remove_inode_page(page); UnlockPage(page); page_cache_release(page); @@ -234,8 +236,10 @@ static void truncate_complete_page(struct page *page) { /* Leave it on the LRU if it gets converted into anonymous buffers */ - if (!page->buffers || do_flushpage(page, 0)) + if (!page->buffers || do_flushpage(page, 0)) { + mmstat(truncate_lru_cache_del); lru_cache_del(page); + } /* * We remove the page from the page cache _after_ we have @@ -345,7 +349,7 @@ * The page is locked and we hold the pagecache_lock as well * so both page_count(page) and page->buffers stays constant here. */ - if (page_count(page) == 1 + !!page->buffers) { + if (is_page_cache_freeable(page)) { /* Restart after this page */ list_del(head); list_add_tail(head, curr); --- 2.4.19.clean/mm/page_alloc.c Fri Aug 2 20:39:46 2002 +++ 2.4.19/mm/page_alloc.c Fri Aug 30 21:55:03 2002 @@ -82,9 +82,12 @@ /* Yes, think what happens when other parts of the kernel take * a reference to a page in order to pin it for io. -ben */ +#if LRU_PLUS_CACHE==2 + BUG_ON(PageLRU(page)); +#else if (PageLRU(page)) lru_cache_del(page); - +#endif if (page->buffers) BUG(); if (page->mapping) --- 2.4.19.clean/mm/swap.c Wed Nov 7 01:44:20 2001 +++ 2.4.19/mm/swap.c Sat Aug 31 18:56:07 2002 @@ -57,11 +57,31 @@ */ void lru_cache_add(struct page * page) { +#if LRU_PLUS_CACHE==2 + spin_lock(&pagemap_lru_lock); + if (likely(!PageLRU(page))) { + add_page_to_inactive_list(page); + SetPageLRU(page); + get_page(page); + } + spin_unlock(&pagemap_lru_lock); +#else if (!TestSetPageLRU(page)) { spin_lock(&pagemap_lru_lock); add_page_to_inactive_list(page); spin_unlock(&pagemap_lru_lock); } +#endif +} + +static inline void del_page_from_lru_list(struct page *page) +{ + BUG_ON(!PageLRU(page)); + if (PageActive(page)) + del_page_from_active_list(page); + else + del_page_from_inactive_list(page); + mmstat(lru_cache_del); } /** @@ -73,13 +93,14 @@ */ void __lru_cache_del(struct page * page) { - if (TestClearPageLRU(page)) { - if (PageActive(page)) { - del_page_from_active_list(page); - } else { - del_page_from_inactive_list(page); - } - } +#if LRU_PLUS_CACHE==2 + del_page_from_lru_list(page); + ClearPageLRU(page); + put_page_nofree(page); +#else + if (TestClearPageLRU(page)) + del_page_from_lru_list(page); +#endif } /** --- 2.4.19.clean/mm/swapfile.c Fri Aug 2 20:39:46 2002 +++ 2.4.19/mm/swapfile.c Fri Aug 30 21:32:20 2002 @@ -239,7 +239,7 @@ if (p->swap_map[SWP_OFFSET(entry)] == 1) { /* Recheck the page count with the pagecache lock held.. */ spin_lock(&pagecache_lock); - if (page_count(page) - !!page->buffers == 2) + if (page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1) retval = 1; spin_unlock(&pagecache_lock); } @@ -263,16 +263,16 @@ if (!PageLocked(page)) BUG(); switch (page_count(page)) { - case 3: - if (!page->buffers) + case LRU_PLUS_CACHE + 2: + if (!has_buffers(page)) break; /* Fallthrough */ - case 2: + case LRU_PLUS_CACHE + 1: if (!PageSwapCache(page)) break; retval = exclusive_swap_page(page); break; - case 1: + case LRU_PLUS_CACHE: if (PageReserved(page)) break; retval = 1; @@ -280,6 +280,11 @@ return retval; } +static inline int only_cached(struct page *page) +{ + return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1; +} + /* * Work out if there are any other processes sharing this * swap cache page. Free it if you can. Return success. @@ -294,7 +299,7 @@ BUG(); if (!PageSwapCache(page)) return 0; - if (page_count(page) - !!page->buffers != 2) /* 2: us + cache */ + if (!only_cached(page)) return 0; entry.val = page->index; @@ -307,7 +312,7 @@ if (p->swap_map[SWP_OFFSET(entry)] == 1) { /* Recheck the page count with the pagecache lock held.. */ spin_lock(&pagecache_lock); - if (page_count(page) - !!page->buffers == 2) { + if (only_cached(page)) { __delete_from_swap_cache(page); SetPageDirty(page); retval = 1; @@ -343,7 +348,7 @@ if (page) { page_cache_get(page); /* Only cache user (+us), or swap space full? Free it! */ - if (page_count(page) - !!page->buffers == 2 || vm_swap_full()) { + if (only_cached(page) || vm_swap_full()) { delete_from_swap_cache(page); SetPageDirty(page); } --- 2.4.19.clean/mm/vmscan.c Fri Aug 2 20:39:46 2002 +++ 2.4.19/mm/vmscan.c Sat Aug 31 19:15:20 2002 @@ -89,7 +89,7 @@ mm->rss--; UnlockPage(page); { - int freeable = page_count(page) - !!page->buffers <= 2; + int freeable = page_count(page) - has_buffers(page) <= LRU_PLUS_CACHE + 1; page_cache_release(page); return freeable; } @@ -356,20 +356,31 @@ BUG_ON(PageActive(page)); list_del(entry); +#if LRU_PLUS_CACHE==2 + BUG_ON(!page_count(page)); + if (unlikely(page_count(page) == 1)) { + mmstat(vmscan_free_page); + BUG_ON(!PageLRU(page)); + ClearPageLRU(page); + put_page(page); + continue; + } +#endif list_add(entry, &inactive_list); +#if LRU_PLUS_CACHE==1 /* * Zero page counts can happen because we unlink the pages * _after_ decrementing the usage count.. */ if (unlikely(!page_count(page))) continue; - +#endif if (!memclass(page_zone(page), classzone)) continue; /* Racy check to avoid trylocking when not worthwhile */ - if (!page->buffers && (page_count(page) != 1 || !page->mapping)) + if (!page->buffers && (page_count(page) != LRU_PLUS_CACHE || !page->mapping)) goto page_mapped; /* @@ -434,9 +445,9 @@ */ spin_lock(&pagemap_lru_lock); UnlockPage(page); +#if LRU_PLUS_CACHE==1 __lru_cache_del(page); - - /* effectively free the page here */ +#endif page_cache_release(page); if (--nr_pages) @@ -507,9 +518,7 @@ __lru_cache_del(page); UnlockPage(page); - - /* effectively free the page here */ - page_cache_release(page); + put_page(page); if (--nr_pages) continue; ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-08-31 19:47 ` Daniel Phillips @ 2002-08-31 20:26 ` Andrew Morton 2002-08-31 21:05 ` Daniel Phillips 2002-09-01 21:32 ` Daniel Phillips 2002-09-02 17:23 ` Christian Ehrhardt 1 sibling, 2 replies; 65+ messages in thread From: Andrew Morton @ 2002-08-31 20:26 UTC (permalink / raw) To: Daniel Phillips Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel Daniel Phillips wrote: > > Manfred suggested an approach to de-racing this race using > atomic_dec_and_lock, which needs to be compared to the current approach. Could simplify things, but not all architectures have an optimised version. So ia64, mips, parisc and s390 would end up taking the lru lock on every page_cache_release. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-08-31 20:26 ` Andrew Morton @ 2002-08-31 21:05 ` Daniel Phillips 2002-08-31 22:30 ` William Lee Irwin III 2002-09-01 21:32 ` Daniel Phillips 1 sibling, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-08-31 21:05 UTC (permalink / raw) To: Andrew Morton Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel On Saturday 31 August 2002 22:26, Andrew Morton wrote: > Daniel Phillips wrote: > > Manfred suggested an approach to de-racing this race using > > atomic_dec_and_lock, which needs to be compared to the current approach. > > Could simplify things, but not all architectures have an optimised > version. So ia64, mips, parisc and s390 would end up taking > the lru lock on every page_cache_release. As far as implementing it goes, some instances of page_cache_release are already under the lru lock and would need an alternative, non-locking version. The current patch seems satisfactory performance-wise and if it's also raceless as it's supposed to be, it gives us something that works, and we can evaluate alternatives at our leisure. Right now I'm afraid we have something that just works most of the time. I think we're getting to the point where this needs to get some heavy beating up, to see what happens. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-08-31 21:05 ` Daniel Phillips @ 2002-08-31 22:30 ` William Lee Irwin III 2002-09-01 3:36 ` Daniel Phillips 0 siblings, 1 reply; 65+ messages in thread From: William Lee Irwin III @ 2002-08-31 22:30 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel, rmk On Sat, Aug 31, 2002 at 11:05:02PM +0200, Daniel Phillips wrote: > The current patch seems satisfactory performance-wise and if it's > also raceless as it's supposed to be, it gives us something that works, > and we can evaluate alternatives at our leisure. Right now I'm afraid > we have something that just works most of the time. > I think we're getting to the point where this needs to get some heavy > beating up, to see what happens. It's not going to get much heavier than how I'm beating on it. Although it seems my box is mighty bored during 64 simultaneous tiobench 256's (i.e. 16384 tasks). I got bored & compiled a kernel: make -j64 bzImage 304.60s user 848.70s system 694% cpu 2:46.05 total The cpus are 95+% idle except for when I touch /proc/, where the task fishing around /proc/ gets stuck spinning hard in the kernel for anywhere from 30 minutes to several hours before killing it succeeds. It didn't quite finish the run, as the tty deadlock happened again. The VM doesn't appear to be oopsing, though I should slap on the OOM fixes. Cheers, Bill ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-08-31 22:30 ` William Lee Irwin III @ 2002-09-01 3:36 ` Daniel Phillips 0 siblings, 0 replies; 65+ messages in thread From: Daniel Phillips @ 2002-09-01 3:36 UTC (permalink / raw) To: William Lee Irwin III Cc: Andrew Morton, Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel, rmk On Sunday 01 September 2002 00:30, William Lee Irwin III wrote: > On Sat, Aug 31, 2002 at 11:05:02PM +0200, Daniel Phillips wrote: > > The current patch seems satisfactory performance-wise and if it's > > also raceless as it's supposed to be, it gives us something that works, > > and we can evaluate alternatives at our leisure. Right now I'm afraid > > we have something that just works most of the time. > > I think we're getting to the point where this needs to get some heavy > > beating up, to see what happens. > > It's not going to get much heavier than how I'm beating on it. > Although it seems my box is mighty bored during 64 simultaneous > tiobench 256's (i.e. 16384 tasks). I got bored & compiled a kernel: > > make -j64 bzImage 304.60s user 848.70s system 694% cpu 2:46.05 total > > The cpus are 95+% idle except for when I touch /proc/, where the task > fishing around /proc/ gets stuck spinning hard in the kernel for > anywhere from 30 minutes to several hours before killing it succeeds. > It didn't quite finish the run, as the tty deadlock happened again. The > VM doesn't appear to be oopsing, though I should slap on the OOM fixes. Is that good or bad? It sounds like the VM cycle is doing what it's supposed to. I've been running lots_of_forks for about 5 hours and things have reached a steady state with free memory slowly racheting down 1% then, apparently in a sudden fit of scanning, jumping to 2.5%. Pretty normal looking behaviour, though I don't much like the way the free list refill is so sporadic. I can't think of any reason why that is good. It doesn't cost a lot of cpu though. There's been a little bit of (useless) swapping, so I guess I can consider that path at least somewhat tested. Every now and then, instead of free jumping up to 2.5% it jumps up to over 10%. I doubt that's my fault. Even though I'm missing an if (--nr_pages) at the orphan collection point, there aren't enough orphans to cause that effect - about 3 million out of 12 billion lru_cache_dels. The orphans appear during or shortly after the episodes of free list filling, as you would expect, since page_cache_release fails the trylock whenever shrink_cache is active. There are lots of parallel page_cache_releases going on, but they don't seem to create any noticable number of orphans. I think what's happening normally is this: if (page_count(page) == 2 /* no, it's 3 */ put_page(page); if (page_count(page) == 2 /* now it is */ if (PageLRU(page) && page_count(page) == 2) __lru_cache_del(page); as opposed to this: if (page_count(page) == 2 /* no, it's 3 */ if (page_count(page) == 2 /* no, still 3 */ put_page(page); put_page(page); which requires hitting a 3 instruction window or on the same page. It's not going to happen often. Statm_pte_range looks like it's going to get even more confused about what constitutes a shared page than it already is. The solution to this is to examine the rmap. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-08-31 20:26 ` Andrew Morton 2002-08-31 21:05 ` Daniel Phillips @ 2002-09-01 21:32 ` Daniel Phillips 2002-09-01 22:09 ` Andrew Morton 1 sibling, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-09-01 21:32 UTC (permalink / raw) To: Andrew Morton Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel On Saturday 31 August 2002 22:26, Andrew Morton wrote: > Daniel Phillips wrote: > > > > Manfred suggested an approach to de-racing this race using > > atomic_dec_and_lock, which needs to be compared to the current approach. > > Could simplify things, but not all architectures have an optimised > version. So ia64, mips, parisc and s390 would end up taking > the lru lock on every page_cache_release. I've put some more thought into this and I don't see any real problem with the atomic_dec_and_lock strategy. The only efficiency drawback is the extra locked cmpxchg on every page count dec, and that most likely tips the efficiency advantage ever so slightly in favor of my current strategy. Something else I like about the current strategy is the way the trylock avoids contention - if shrink_cache is active it just leaves the page on the lru for collection later. Sweeping up the orphans is efficient, since the lru lock is batched by shrink_cache. I think I may be even able to make this all work without holding the extra count, but I'll treat that as a background project. The important thing is, the page reaping cycle was never correct before, now it just might be. I'm looking at your spinlock_irq now and thinking the _irq part could possibly be avoided. Can you please remind me of the motivation for this - was it originally intended to address the same race we've been working on here? I can see the advantage of being able to take the lru lock from interrupt context, but we may be able to achieve the intended effect with a trylock. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-01 21:32 ` Daniel Phillips @ 2002-09-01 22:09 ` Andrew Morton 2002-09-01 22:08 ` Daniel Phillips 2002-09-01 22:20 ` Daniel Phillips 0 siblings, 2 replies; 65+ messages in thread From: Andrew Morton @ 2002-09-01 22:09 UTC (permalink / raw) To: Daniel Phillips Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel Daniel Phillips wrote: > > ... > I'm looking at your spinlock_irq now and thinking the _irq part could > possibly be avoided. Can you please remind me of the motivation for this - > was it originally intended to address the same race we've been working on > here? > scalability, mainly. If the CPU holding the lock takes an interrupt, all the other CPUs get to spin until the handler completes. I measured a 30% reducton in contention from this. Not a generally justifiable trick, but this is a heavily-used lock. All the new games in refill_inactive() are there to minimise the interrupt-off time. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-01 22:09 ` Andrew Morton @ 2002-09-01 22:08 ` Daniel Phillips 2002-09-01 22:20 ` Daniel Phillips 1 sibling, 0 replies; 65+ messages in thread From: Daniel Phillips @ 2002-09-01 22:08 UTC (permalink / raw) To: Andrew Morton Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel On Monday 02 September 2002 00:09, Andrew Morton wrote: > Daniel Phillips wrote: > > > > ... > > I'm looking at your spinlock_irq now and thinking the _irq part could > > possibly be avoided. Can you please remind me of the motivation for this - > > was it originally intended to address the same race we've been working on > > here? > > > > scalability, mainly. If the CPU holding the lock takes an interrupt, > all the other CPUs get to spin until the handler completes. I measured > a 30% reducton in contention from this. > > Not a generally justifiable trick, but this is a heavily-used lock. > All the new games in refill_inactive() are there to minimise the > interrupt-off time. Per-cpu lru lists? Your solution is a lot simpler... -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-01 22:09 ` Andrew Morton 2002-09-01 22:08 ` Daniel Phillips @ 2002-09-01 22:20 ` Daniel Phillips 2002-09-01 23:08 ` Andrew Morton 1 sibling, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-09-01 22:20 UTC (permalink / raw) To: Andrew Morton Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel On Monday 02 September 2002 00:09, Andrew Morton wrote: > Daniel Phillips wrote: > > > > ... > > I'm looking at your spinlock_irq now and thinking the _irq part could > > possibly be avoided. Can you please remind me of the motivation for this - > > was it originally intended to address the same race we've been working on > > here? > > scalability, mainly. If the CPU holding the lock takes an interrupt, > all the other CPUs get to spin until the handler completes. I measured > a 30% reducton in contention from this. > > Not a generally justifiable trick, but this is a heavily-used lock. > All the new games in refill_inactive() are there to minimise the > interrupt-off time. Ick. I hope you really chopped the lock hold time into itty-bitty pieces. Note that I changed the spin_lock in page_cache_release to a trylock, maybe it's worth checking out the effect on contention. With a little head scratching we might be able to get rid of the spin_lock in lru_cache_add as well. That leaves (I think) just the two big scan loops. I've always felt it's silly to run more than one of either at the same time anyway. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-01 22:20 ` Daniel Phillips @ 2002-09-01 23:08 ` Andrew Morton 2002-09-01 23:19 ` Daniel Phillips ` (2 more replies) 0 siblings, 3 replies; 65+ messages in thread From: Andrew Morton @ 2002-09-01 23:08 UTC (permalink / raw) To: Daniel Phillips Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel Daniel Phillips wrote: > > On Monday 02 September 2002 00:09, Andrew Morton wrote: > > Daniel Phillips wrote: > > > > > > ... > > > I'm looking at your spinlock_irq now and thinking the _irq part could > > > possibly be avoided. Can you please remind me of the motivation for this - > > > was it originally intended to address the same race we've been working on > > > here? > > > > scalability, mainly. If the CPU holding the lock takes an interrupt, > > all the other CPUs get to spin until the handler completes. I measured > > a 30% reducton in contention from this. > > > > Not a generally justifiable trick, but this is a heavily-used lock. > > All the new games in refill_inactive() are there to minimise the > > interrupt-off time. > > Ick. I hope you really chopped the lock hold time into itty-bitty pieces. Max hold is around 7,000 cycles. > Note that I changed the spin_lock in page_cache_release to a trylock, maybe > it's worth checking out the effect on contention. With a little head > scratching we might be able to get rid of the spin_lock in lru_cache_add as > well. That leaves (I think) just the two big scan loops. I've always felt > it's silly to run more than one of either at the same time anyway. No way. Take a look at http://samba.org/~anton/linux/2.5.30/ That's 8-way power4, the workload is "dd from 7 disks dd if=/dev/sd* of=/dev/null bs=1024k". The CPU load in this situation was dominated by the VM. The LRU list and page reclaim. Spending more CPU in lru_cache_add() than in copy_to_user() is pretty gross. I think it's under control now - I expect we're OK up to eight or sixteen CPUs per node, but I'm still not happy with the level of testing. Most people who have enough hardware to test this tend to have so much RAM that their tests don't hit page reclaim. slablru will probably change the picture too. There's a weird effect with the traditional try_to_free_pages() code. The first thing it does is to run kmem_cache_shrink(). Which takes a sempahore, fruitlessly fiddles with the slab and then runs page reclaim. On the 4-way I measured 25% contention on the spinlock inside that semaphore. What is happening is that the combination of the sleeping lock and the slab operations effectively serialises entry into page reclaim. slablru removes that little turnstile on entry to try_to_free_pages(), and we will now see a lot more concurrency in shrink_foo(). My approach was to keep the existing design and warm it up, rather than to redesign. Is it "good enough" now? Dunno - nobody has run the tests with slablru. But it's probably too late for a redesign (such as per-cpu LRU, per-mapping lru, etc). It would be great to make presence on the LRU contribute to page->count, because that would permit the removal of a ton of page_cache_get/release operations inside the LRU lock, perhaps doubling throughput in there. Guess I should get off my lazy butt and see what you've done (will you for heaven's sake go and buy an IDE disk and compile up a 2.5 kernel? :)) ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-01 23:08 ` Andrew Morton @ 2002-09-01 23:19 ` Daniel Phillips 2002-09-01 23:28 ` William Lee Irwin III 2002-09-01 23:33 ` Daniel Phillips 2 siblings, 0 replies; 65+ messages in thread From: Daniel Phillips @ 2002-09-01 23:19 UTC (permalink / raw) To: Andrew Morton Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel On Monday 02 September 2002 01:08, Andrew Morton wrote: > Daniel Phillips wrote: > It would be great to make presence on the LRU contribute to page->count, because > that would permit the removal of a ton of page_cache_get/release operations inside > the LRU lock, perhaps doubling throughput in there. It's pretty much done, it just needs some torture testing: http://people.nl.linux.org/~phillips/patches/lru.race-2.4.19 The handy /proc/mmstat ad hoc statistics are still in there. The patch supports both zero and one-flavors of lru contribution to page count for the time being, with the latter as the default. > Guess I should get off my > lazy butt and see what you've done (will you for heaven's sake go and buy an IDE > disk and compile up a 2.5 kernel? :)) I was just thinking about that, time to go poke at the DAC960 again. Yes I do remember Jens and Bill offered to help :-) -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-01 23:08 ` Andrew Morton 2002-09-01 23:19 ` Daniel Phillips @ 2002-09-01 23:28 ` William Lee Irwin III 2002-09-01 23:33 ` Daniel Phillips 2 siblings, 0 replies; 65+ messages in thread From: William Lee Irwin III @ 2002-09-01 23:28 UTC (permalink / raw) To: Andrew Morton Cc: Daniel Phillips, Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel On Sun, Sep 01, 2002 at 04:08:03PM -0700, Andrew Morton wrote: > My approach was to keep the existing design and warm it up, rather than to > redesign. Is it "good enough" now? Dunno - nobody has run the tests > with slablru. But it's probably too late for a redesign (such as per-cpu LRU, > per-mapping lru, etc). > > It would be great to make presence on the LRU contribute to > page->count, because that would permit the removal of a ton of > page_cache_get/release operations inside the LRU lock, perhaps > doubling throughput in there. Guess I should get off my lazy butt > and see what you've done (will you for heaven's sake go and buy an IDE > disk and compile up a 2.5 kernel? :)) I've at least been testing (if not at least attempting to debug) 2.5.32-mm2, which includes slablru. Cheers, Bill ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-01 23:08 ` Andrew Morton 2002-09-01 23:19 ` Daniel Phillips 2002-09-01 23:28 ` William Lee Irwin III @ 2002-09-01 23:33 ` Daniel Phillips 2002-09-02 0:17 ` Andrew Morton 2002-09-02 1:08 ` Rik van Riel 2 siblings, 2 replies; 65+ messages in thread From: Daniel Phillips @ 2002-09-01 23:33 UTC (permalink / raw) To: Andrew Morton Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel On Monday 02 September 2002 01:08, Andrew Morton wrote: > Daniel Phillips wrote: > > Note that I changed the spin_lock in page_cache_release to a trylock, maybe > > it's worth checking out the effect on contention. With a little head > > scratching we might be able to get rid of the spin_lock in lru_cache_add as > > well. That leaves (I think) just the two big scan loops. I've always felt > > it's silly to run more than one of either at the same time anyway. > > No way. Take a look at http://samba.org/~anton/linux/2.5.30/ > > That's 8-way power4, the workload is "dd from 7 disks > dd if=/dev/sd* of=/dev/null bs=1024k". > > The CPU load in this situation was dominated by the VM. The LRU list and page > reclaim. Spending more CPU in lru_cache_add() than in copy_to_user() is > pretty gross. Are we looking at the same thing? The cpu load there is dominated by cpu_idle, 89%. Anyway, if your point is that it makes sense to run shrink_cache or refill_inactive in parallel, I don't see it because they'll serialize on the lru lock anyway. What would make sense is to make shink_cache nonblocking. > My approach was to keep the existing design and warm it up, rather than to > redesign. Yup. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-01 23:33 ` Daniel Phillips @ 2002-09-02 0:17 ` Andrew Morton 2002-09-02 0:30 ` Daniel Phillips 2002-09-02 1:08 ` Rik van Riel 1 sibling, 1 reply; 65+ messages in thread From: Andrew Morton @ 2002-09-02 0:17 UTC (permalink / raw) To: Daniel Phillips Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel Daniel Phillips wrote: > > On Monday 02 September 2002 01:08, Andrew Morton wrote: > > Daniel Phillips wrote: > > > Note that I changed the spin_lock in page_cache_release to a trylock, maybe > > > it's worth checking out the effect on contention. With a little head > > > scratching we might be able to get rid of the spin_lock in lru_cache_add as > > > well. That leaves (I think) just the two big scan loops. I've always felt > > > it's silly to run more than one of either at the same time anyway. > > > > No way. Take a look at http://samba.org/~anton/linux/2.5.30/ > > > > That's 8-way power4, the workload is "dd from 7 disks > > dd if=/dev/sd* of=/dev/null bs=1024k". > > > > The CPU load in this situation was dominated by the VM. The LRU list and page > > reclaim. Spending more CPU in lru_cache_add() than in copy_to_user() is > > pretty gross. > > Are we looking at the same thing? The cpu load there is dominated by cpu_idle, > 89%. Apart from that, dummy ;) The absolute numbers are proportional to IO bandwidth. And 7/8ths of a scsi disk per CPU isn't much. > Anyway, if your point is that it makes sense to run shrink_cache or > refill_inactive in parallel, I don't see it because they'll serialize on the > lru lock anyway. What would make sense is to make shink_cache nonblocking. Well shrink_list() runs locklessly now. But there remains a significant cost in shrink_cache(), a little of which will be due to the hopefully-removable page_cache_get() inside the lock. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-02 0:17 ` Andrew Morton @ 2002-09-02 0:30 ` Daniel Phillips 2002-09-02 1:50 ` Andrew Morton 0 siblings, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-09-02 0:30 UTC (permalink / raw) To: Andrew Morton Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel On Monday 02 September 2002 02:17, Andrew Morton wrote: > Daniel Phillips wrote: > > > > On Monday 02 September 2002 01:08, Andrew Morton wrote: > > > Daniel Phillips wrote: > > > > Note that I changed the spin_lock in page_cache_release to a trylock, maybe > > > > it's worth checking out the effect on contention. With a little head > > > > scratching we might be able to get rid of the spin_lock in lru_cache_add as > > > > well. That leaves (I think) just the two big scan loops. I've always felt > > > > it's silly to run more than one of either at the same time anyway. > > > > > > No way. Take a look at http://samba.org/~anton/linux/2.5.30/ > > > > > > That's 8-way power4, the workload is "dd from 7 disks > > > dd if=/dev/sd* of=/dev/null bs=1024k". > > > > > > The CPU load in this situation was dominated by the VM. The LRU list and page > > > reclaim. Spending more CPU in lru_cache_add() than in copy_to_user() is > > > pretty gross. > > > > Are we looking at the same thing? The cpu load there is dominated by cpu_idle, > > 89%. > > Apart from that, dummy ;) The absolute numbers are proportional > to IO bandwidth. And 7/8ths of a scsi disk per CPU isn't much. OK, fine, my point here is that there's no real excuse for contention on the lru lock. Without departing radically from the current design the most you can possibly spend on page reclaiming is 100% of one cpu, and we ought to simply aim at getting the most bang for the buck out of that one cpu. It follows that we want only one copy of shrink_cache or refill_inactive running at a time, otherwise we're tossing away cycles spinning on locks and getting nothing back. If lru_cache_add is costly then we can kill both the lock contention and the cacheline pingponging by feeding the new pages into a local list, then grafting the whole list onto the inactive list when we start to scan. We want to do something like that for nonsucky readahead as well. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-02 0:30 ` Daniel Phillips @ 2002-09-02 1:50 ` Andrew Morton 0 siblings, 0 replies; 65+ messages in thread From: Andrew Morton @ 2002-09-02 1:50 UTC (permalink / raw) To: Daniel Phillips Cc: Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel Daniel Phillips wrote: > > ... > If lru_cache_add is costly then we can kill both the lock contention and the > cacheline pingponging by feeding the new pages into a local list, then > grafting the whole list onto the inactive list when we start to scan. 2.5.33 does this. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-01 23:33 ` Daniel Phillips 2002-09-02 0:17 ` Andrew Morton @ 2002-09-02 1:08 ` Rik van Riel 1 sibling, 0 replies; 65+ messages in thread From: Rik van Riel @ 2002-09-02 1:08 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Christian Ehrhardt, Linus Torvalds, Marcelo Tosatti, linux-kernel On Mon, 2 Sep 2002, Daniel Phillips wrote: > Are we looking at the same thing? The cpu load there is dominated by > cpu_idle, 89%. Anyway, if your point is that it makes sense to run > shrink_cache or refill_inactive in parallel, I don't see it because > they'll serialize on the lru lock anyway. What would make sense is to > make shink_cache nonblocking. Working on it ;) Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-08-31 19:47 ` Daniel Phillips 2002-08-31 20:26 ` Andrew Morton @ 2002-09-02 17:23 ` Christian Ehrhardt 2002-09-02 18:01 ` Daniel Phillips 1 sibling, 1 reply; 65+ messages in thread From: Christian Ehrhardt @ 2002-09-02 17:23 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel On Sat, Aug 31, 2002 at 09:47:29PM +0200, Daniel Phillips wrote: > > Also there may be lru only pages on the active list, i.e. refill > > inactive should have this hunk as well: > > > > > +#if LRU_PLUS_CACHE==2 > > > + BUG_ON(!page_count(page)); > > > + if (unlikely(page_count(page) == 1)) { > > > + mmstat(vmscan_free_page); > > > + BUG_ON(!TestClearPageLRU(page)); // side effect abuse!! > > > + put_page(page); > > > + continue; > > > + } > > > +#endif > > If we have orphans on the active list, we'd probably better just count > them and figure out what we're doing wrong to put them there in the first > place. In time they will migrate to the inactive list and get cleaned > up. Hm, think of your favourite memory hog being killed with lots of anon pages on the active list while someone else holds the lru lock. Won't all these anon pages legally end up orphaned on the active list (due to the trylock in page_cache_release)? regards Christian -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] [PATCH] Include LRU in page count 2002-09-02 17:23 ` Christian Ehrhardt @ 2002-09-02 18:01 ` Daniel Phillips 0 siblings, 0 replies; 65+ messages in thread From: Daniel Phillips @ 2002-09-02 18:01 UTC (permalink / raw) To: Christian Ehrhardt Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel On Monday 02 September 2002 19:23, Christian Ehrhardt wrote: > On Sat, Aug 31, 2002 at 09:47:29PM +0200, Daniel Phillips wrote: > > > Also there may be lru only pages on the active list, i.e. refill > > > inactive should have this hunk as well: > > > > > > > +#if LRU_PLUS_CACHE==2 > > > > + BUG_ON(!page_count(page)); > > > > + if (unlikely(page_count(page) == 1)) { > > > > + mmstat(vmscan_free_page); > > > > + BUG_ON(!TestClearPageLRU(page)); // side effect abuse!! > > > > + put_page(page); > > > > + continue; > > > > + } > > > > +#endif > > > > If we have orphans on the active list, we'd probably better just count > > them and figure out what we're doing wrong to put them there in the first > > place. In time they will migrate to the inactive list and get cleaned > > up. > > Hm, think of your favourite memory hog being killed with lots of anon > pages on the active list while someone else holds the lru lock. > Won't all these anon pages legally end up orphaned on the active list > (due to the trylock in page_cache_release)? Some of them will, for one pass of refill_inactive. It seems hard to justify saving a single pass through the active list executed in rare, pathological circumstances, in return for adding even a simple test, executed commonly. On a dual processor system, one of them should be scanning while the other is oom_killing. It should work out fine. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* [RFC] Alternative raceless page free 2002-08-30 23:03 ` [RFC] [PATCH] Include LRU in page count Daniel Phillips 2002-08-31 16:14 ` Christian Ehrhardt @ 2002-09-05 4:42 ` Daniel Phillips 2002-09-05 12:34 ` Christian Ehrhardt 1 sibling, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-09-05 4:42 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds, Marcelo Tosatti Cc: linux-kernel, Christian Ehrhardt For completeness, I implemented the atomic_dec_and_test version of raceless page freeing suggested by Manfred Spraul. The atomic_dec_and_test approach eliminates the free race by ensuring that when a page's count drops to zero the lru list lock is taken atomically, leaving no window where the page can also be found and manipulated on the lru list.[1] Both this and the extra-lru-count version are supported in the linked patch: http://people.nl.linux.org/~phillips/patches/lru.race-2.4.19-2 The atomic_dec_and_test version is slightly simpler, but was actually more work to implement because of the need to locate and eliminate all uses of page_cache_release where the lru lock is known to be held, as these will deadlock. That had the side effect of eliminating a number of ifdefs vs the lru count version, and rooting out some hidden redundancy. The patch exposes __free_pages_ok, which must called directly by the atomic_dec_and_lock variant. In the process it got a less confusing name - recover_pages. (The incumbent name is confusing because all other 'free' variants in addition manipulate the page count.) It's a close call which version is faster. I suspect the atomic_dec_and_lock version will not scale quite as well because of the bus-locked cmpxchg on the page count (optimized version; unoptimized version always takes the spinlock) but neither version really lacks in the speed department. I have a slight preference for the extra-lru-count version, because of the trylock in page_cache_release. This means that nobody will have to spin when shrink_cache is active. Instead, freed pages that collide with the lru lock can just be left on the lru list to be picked up efficiently later. The trylock also allows the lru lock to be acquired speculatively from interrupt context, without a requirement that lru lock holders disable interrupts. Both versions are provably correct, modulo implementation gaffs. The linked patch defaults to atomic_dec_and_lock version. To change to the extra count version, define LRU_PLUS_CACHE as 2 instead of 1. Christian, can you please run this one through your race detector? [1] As a corollary, pages with zero count can never be found on the lru list, so that is treated as a bug. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] Alternative raceless page free 2002-09-05 4:42 ` [RFC] Alternative raceless page free Daniel Phillips @ 2002-09-05 12:34 ` Christian Ehrhardt 2002-09-05 15:21 ` Daniel Phillips 0 siblings, 1 reply; 65+ messages in thread From: Christian Ehrhardt @ 2002-09-05 12:34 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel, Christian Ehrhardt On Thu, Sep 05, 2002 at 06:42:12AM +0200, Daniel Phillips wrote: > I have a slight preference for the extra-lru-count version, because of the > trylock in page_cache_release. This means that nobody will have to spin when > shrink_cache is active. Instead, freed pages that collide with the lru lock > can just be left on the lru list to be picked up efficiently later. The > trylock also allows the lru lock to be acquired speculatively from interrupt > context, without a requirement that lru lock holders disable interrupts. > Both versions are provably correct, modulo implementation gaffs. > > The linked patch defaults to atomic_dec_and_lock version. To change to > the extra count version, define LRU_PLUS_CACHE as 2 instead of 1. > > Christian, can you please run this one through your race detector? I don't see a priciple problem with this approach (except that there may be architectures that don't have cmpxchg or similar). But this hunk @@ -455,7 +458,7 @@ } else { /* failed to drop the buffers so stop here */ UnlockPage(page); - page_cache_release(page); + put_page(page); spin_lock(&pagemap_lru_lock); continue; looks a bit suspicious. put_page is not allowed if the page is still on the lru and there is no other reference to it. As we don't hold any locks between UnlockPage and put_page there is no formal guarantee that the above condition is met. I don't have another path that could race with this one though and chances are that there actually is none. regards Christian -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] Alternative raceless page free 2002-09-05 12:34 ` Christian Ehrhardt @ 2002-09-05 15:21 ` Daniel Phillips 2002-09-05 16:04 ` Christian Ehrhardt 0 siblings, 1 reply; 65+ messages in thread From: Daniel Phillips @ 2002-09-05 15:21 UTC (permalink / raw) To: Christian Ehrhardt Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel, Christian Ehrhardt On Thursday 05 September 2002 14:34, Christian Ehrhardt wrote: > @@ -455,7 +458,7 @@ > } else { > /* failed to drop the buffers so stop here */ > UnlockPage(page); > - page_cache_release(page); > + put_page(page); > > spin_lock(&pagemap_lru_lock); > continue; > > looks a bit suspicious. put_page is not allowed if the page is still > on the lru and there is no other reference to it. As we don't hold any > locks between UnlockPage and put_page there is no formal guarantee that > the above condition is met. I don't have another path that could race > with this one though and chances are that there actually is none. The corresponding get_page is just above, you must have overlooked it. Besides that, we have a promise that the page still has buffers, worth another count, and the page will never be freed here. That's fragile though, and this particular piece of code can no doubt be considerably simplified, while improving robustness and efficiency at the same time. But that goes beyond the scope of this patch. The whole try_to_free_buffers path is hairy, scary and disgusting. It could use some critical analysis, if anybody has time. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] Alternative raceless page free 2002-09-05 15:21 ` Daniel Phillips @ 2002-09-05 16:04 ` Christian Ehrhardt 2002-09-05 16:10 ` Daniel Phillips ` (2 more replies) 0 siblings, 3 replies; 65+ messages in thread From: Christian Ehrhardt @ 2002-09-05 16:04 UTC (permalink / raw) To: Daniel Phillips Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel, Christian Ehrhardt On Thu, Sep 05, 2002 at 05:21:31PM +0200, Daniel Phillips wrote: > On Thursday 05 September 2002 14:34, Christian Ehrhardt wrote: > > @@ -455,7 +458,7 @@ > > } else { > > /* failed to drop the buffers so stop here */ > > UnlockPage(page); > > - page_cache_release(page); > > + put_page(page); > > > > spin_lock(&pagemap_lru_lock); > > continue; > > > > looks a bit suspicious. put_page is not allowed if the page is still > > on the lru and there is no other reference to it. As we don't hold any > > locks between UnlockPage and put_page there is no formal guarantee that > > the above condition is met. I don't have another path that could race > > with this one though and chances are that there actually is none. > > The corresponding get_page is just above, you must have overlooked it. See it. The scenario im thinking about is: CPU1 CPU2 pick page from lru Lock it get_page try_to_release_buffers fails UnlockPage /* Window here */ Pick page form lru page_cache_get Lock it actually free the buffers UnlockPage page_cache_release doesn't remove the page from the lru because CPU 1 holds a reference put_page dropps last reference but doenn't remove the page from the lru because it is put_page and not page_cache release. ---> Page gets freed while it is still on the lru. If the lru cache holds a reference that's a non issue. CPU1 is the path in shrink_cache. I admit that I don't have a real path that does what CPU2 does in this scenario but doing what CPU2 does should be perfectly legal. > Besides that, we have a promise that the page still has buffers, worth Not after UnlockPage. > another count, and the page will never be freed here. That's fragile > though, and this particular piece of code can no doubt be considerably > simplified, while improving robustness and efficiency at the same time. > But that goes beyond the scope of this patch. Well yes ;-) There's funny things going on like accessing a page after page_cache_release... regards Christian -- THAT'S ALL FOLKS! ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] Alternative raceless page free 2002-09-05 16:04 ` Christian Ehrhardt @ 2002-09-05 16:10 ` Daniel Phillips 2002-09-05 16:31 ` Daniel Phillips 2002-09-05 18:06 ` [RFC] Alternative raceless page free, updated Daniel Phillips 2 siblings, 0 replies; 65+ messages in thread From: Daniel Phillips @ 2002-09-05 16:10 UTC (permalink / raw) To: Christian Ehrhardt Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel, Christian Ehrhardt On Thursday 05 September 2002 18:04, Christian Ehrhardt wrote: > On Thu, Sep 05, 2002 at 05:21:31PM +0200, Daniel Phillips wrote: > > On Thursday 05 September 2002 14:34, Christian Ehrhardt wrote: > > > @@ -455,7 +458,7 @@ > > > } else { > > > /* failed to drop the buffers so stop here */ > > > UnlockPage(page); > > > - page_cache_release(page); > > > + put_page(page); > > > > > > spin_lock(&pagemap_lru_lock); > > > continue; > > > > > > looks a bit suspicious. put_page is not allowed if the page is still > > > on the lru and there is no other reference to it. As we don't hold any > > > locks between UnlockPage and put_page there is no formal guarantee that > > > the above condition is met. I don't have another path that could race > > > with this one though and chances are that there actually is none. > > > > The corresponding get_page is just above, you must have overlooked it. I misspoke. You're right, page_cache_release is the appropriate function here, not put_page. Fixed, thanks. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] Alternative raceless page free 2002-09-05 16:04 ` Christian Ehrhardt 2002-09-05 16:10 ` Daniel Phillips @ 2002-09-05 16:31 ` Daniel Phillips 2002-09-05 18:06 ` [RFC] Alternative raceless page free, updated Daniel Phillips 2 siblings, 0 replies; 65+ messages in thread From: Daniel Phillips @ 2002-09-05 16:31 UTC (permalink / raw) To: Christian Ehrhardt Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel, Christian Ehrhardt On Thursday 05 September 2002 18:04, Christian Ehrhardt wrote: > On Thu, Sep 05, 2002 at 05:21:31PM +0200, Daniel Phillips wrote: > > ...this particular piece of code can no doubt be considerably > > simplified, while improving robustness and efficiency at the same time. > > But that goes beyond the scope of this patch. > > Well yes ;-) There's funny things going on like accessing a page > after page_cache_release... You're right about that one too, it should be put_page_nofree, since freeing the page there is a bug for two reasons: !!page->mapping tells us there should be a count on the page, and we continue to operate on the page immediately below. Thanks again. -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [RFC] Alternative raceless page free, updated 2002-09-05 16:04 ` Christian Ehrhardt 2002-09-05 16:10 ` Daniel Phillips 2002-09-05 16:31 ` Daniel Phillips @ 2002-09-05 18:06 ` Daniel Phillips 2 siblings, 0 replies; 65+ messages in thread From: Daniel Phillips @ 2002-09-05 18:06 UTC (permalink / raw) To: Christian Ehrhardt Cc: Andrew Morton, Linus Torvalds, Marcelo Tosatti, linux-kernel, Christian Ehrhardt Here's an updated patch with improved/corrected page count handling in shrink_cache along the lines we discussed: http://people.nl.linux.org/~phillips/patches/lru.race-2.4.19-3 -- Daniel ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 2:29 MM patches against 2.5.31 Andrew Morton 2002-08-22 11:28 ` Christian Ehrhardt @ 2002-08-22 15:59 ` Steven Cole 2002-08-22 16:06 ` Martin J. Bligh 1 sibling, 1 reply; 65+ messages in thread From: Steven Cole @ 2002-08-22 15:59 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm@kvack.org On Wed, 2002-08-21 at 20:29, Andrew Morton wrote: > I've uploaded a rollup of pending fixes and feature work > against 2.5.31 to > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ > > The rolled up patch there is suitable for ongoing testing and > development. The individual patches are in the broken-out/ > directory and should all be documented. The good news: I ran my dbench 1..128 stress test and for the first time since 2.5.31-vanilla there were _no_ BUG()s reported at all. The other news: from dmesg: kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,3), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald: page allocation failure. order:0, mode:0x0 The kjournald failure message came out with dbench 48 running on an ext3 partition. The test continued with only this one instance of this message. Steven ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 15:59 ` MM patches against 2.5.31 Steven Cole @ 2002-08-22 16:06 ` Martin J. Bligh 2002-08-22 19:45 ` Steven Cole 2002-08-26 2:15 ` Andrew Morton 0 siblings, 2 replies; 65+ messages in thread From: Martin J. Bligh @ 2002-08-22 16:06 UTC (permalink / raw) To: Steven Cole, Andrew Morton; +Cc: lkml, linux-mm@kvack.org > kjournald: page allocation failure. order:0, mode:0x0 I've seen this before, but am curious how we ever passed a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere that does this? Thanks, M. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 16:06 ` Martin J. Bligh @ 2002-08-22 19:45 ` Steven Cole 2002-08-26 2:15 ` Andrew Morton 1 sibling, 0 replies; 65+ messages in thread From: Steven Cole @ 2002-08-22 19:45 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, lkml, linux-mm@kvack.org On Thu, 2002-08-22 at 10:06, Martin J. Bligh wrote: > > kjournald: page allocation failure. order:0, mode:0x0 > > I've seen this before, but am curious how we ever passed > a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere > that does this? > > Thanks, > > M. I ran dbench 1..128 on 2.5.31-mm1 several more times with nothing unusual happening, and then got this from pdflush with dbench 96. pdflush: page allocation failure. order:0, mode:0x0 FWIW, this 2.5.31-mm1 kernel is SMP, HIGHMEM4G, no PREEMPT. Steven ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 16:06 ` Martin J. Bligh 2002-08-22 19:45 ` Steven Cole @ 2002-08-26 2:15 ` Andrew Morton 2002-08-26 2:08 ` Martin J. Bligh 1 sibling, 1 reply; 65+ messages in thread From: Andrew Morton @ 2002-08-26 2:15 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Steven Cole, lkml, linux-mm@kvack.org "Martin J. Bligh" wrote: > > > kjournald: page allocation failure. order:0, mode:0x0 > > I've seen this before, but am curious how we ever passed > a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere > that does this? Could be anywhere, really. A network interrupt doing GFP_ATOMIC while kjournald is executing. A radix-tree node allocation on the add-to-swap path perhaps. (The swapout failure messages aren't supposed to come out, but mempool_alloc() stomps on the caller's setting of PF_NOWARN.) Or: mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l 89 ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 2:15 ` Andrew Morton @ 2002-08-26 2:08 ` Martin J. Bligh 2002-08-26 2:32 ` Andrew Morton 0 siblings, 1 reply; 65+ messages in thread From: Martin J. Bligh @ 2002-08-26 2:08 UTC (permalink / raw) To: Andrew Morton; +Cc: Steven Cole, lkml, linux-mm@kvack.org >> > kjournald: page allocation failure. order:0, mode:0x0 >> >> I've seen this before, but am curious how we ever passed >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere >> that does this? > > Could be anywhere, really. A network interrupt doing GFP_ATOMIC > while kjournald is executing. A radix-tree node allocation > on the add-to-swap path perhaps. (The swapout failure messages > aren't supposed to come out, but mempool_alloc() stomps on the > caller's setting of PF_NOWARN.) > > Or: > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l > 89 No, GFP_ATOMIC is not 0: #define __GFP_HIGH 0x20 /* Should access emergency pools? */ #define GFP_ATOMIC (__GFP_HIGH) Looking at all the options: #define __GFP_WAIT 0x10 /* Can wait and reschedule? */ #define __GFP_HIGH 0x20 /* Should access emergency pools? */ #define __GFP_IO 0x40 /* Can start low memory physical IO? */ #define __GFP_HIGHIO 0x80 /* Can start high mem physical IO? */ #define __GFP_FS 0x100 /* Can call down to low-level FS? */ What worries me is that 0 seems to mean "you can't do anything to try and free it, but you can't access the emergency pools either". Seems doomed to failure to me. And the standard sets we have are #define GFP_NOHIGHIO ( __GFP_WAIT | __GFP_IO) #define GFP_NOIO ( __GFP_WAIT) #define GFP_NOFS ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_USER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS) #define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS | __GFP_HIGHMEM) #define GFP_KERNEL ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS) #define GFP_NFS ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS) #define GFP_KSWAPD ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS) So I think someone's screwed something up, and this is accidental. Or I'm just totally misunderstanding this, which is perfectly possible. M. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 2:08 ` Martin J. Bligh @ 2002-08-26 2:32 ` Andrew Morton 2002-08-26 3:06 ` Steven Cole 0 siblings, 1 reply; 65+ messages in thread From: Andrew Morton @ 2002-08-26 2:32 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Steven Cole, lkml, linux-mm@kvack.org "Martin J. Bligh" wrote: > > >> > kjournald: page allocation failure. order:0, mode:0x0 > >> > >> I've seen this before, but am curious how we ever passed > >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere > >> that does this? > > > > Could be anywhere, really. A network interrupt doing GFP_ATOMIC > > while kjournald is executing. A radix-tree node allocation > > on the add-to-swap path perhaps. (The swapout failure messages > > aren't supposed to come out, but mempool_alloc() stomps on the > > caller's setting of PF_NOWARN.) > > > > Or: > > > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l > > 89 > > No, GFP_ATOMIC is not 0: > It's mempool_alloc(GFP_NOIO) or such. mempool_alloc() strips __GFP_WAIT|__GFP_IO on the first attempt. It also disables the printk, so maybe I just dunno ;) show_stack() would tell. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 2:32 ` Andrew Morton @ 2002-08-26 3:06 ` Steven Cole 0 siblings, 0 replies; 65+ messages in thread From: Steven Cole @ 2002-08-26 3:06 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin J. Bligh, lkml, linux-mm@kvack.org On Sun, 2002-08-25 at 20:32, Andrew Morton wrote: > "Martin J. Bligh" wrote: > > > > >> > kjournald: page allocation failure. order:0, mode:0x0 > > >> > > >> I've seen this before, but am curious how we ever passed > > >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere > > >> that does this? > > > > > > Could be anywhere, really. A network interrupt doing GFP_ATOMIC > > > while kjournald is executing. A radix-tree node allocation > > > on the add-to-swap path perhaps. (The swapout failure messages > > > aren't supposed to come out, but mempool_alloc() stomps on the > > > caller's setting of PF_NOWARN.) > > > > > > Or: > > > > > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l > > > 89 > > > > No, GFP_ATOMIC is not 0: > > > > It's mempool_alloc(GFP_NOIO) or such. mempool_alloc() strips > __GFP_WAIT|__GFP_IO on the first attempt. > > It also disables the printk, so maybe I just dunno ;) show_stack() > would tell. > The "kjournald: page allocation failure. order:0, mode:0x0" message and "pdflush: page allocation failure. order:0, mode:0x0" occurred only once each on my dual p3 scsi ext3 test box running 2.5.31-mm1. So, I added something like this: --- page_alloc.c.orig Thu Aug 22 17:27:32 2002 +++ page_alloc.c Thu Aug 22 17:29:24 2002 @@ -388,6 +388,8 @@ printk("%s: page allocation failure." " order:%d, mode:0x%x\n", current->comm, order, gfp_mask); + if (gfp_mask == 0) + BUG(); } return NULL; } and continued testing on Friday with no repeats of the "page allocation failure" messages. I obtained a second dual p3 ext3 test box (ide this time) and left both boxes running 2.5.31-mm1 and the dbench 1..128 stress test scripted to rerun many times over the weekend. Due to a couple of firewalls, I can't look at those boxes from here, but I'll let you know what happened in about 10 to 11 hours. Cheers, Steven ^ permalink raw reply [flat|nested] 65+ messages in thread
end of thread, other threads:[~2002-09-05 18:01 UTC | newest] Thread overview: 65+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-08-22 2:29 MM patches against 2.5.31 Andrew Morton 2002-08-22 11:28 ` Christian Ehrhardt 2002-08-26 1:52 ` Andrew Morton 2002-08-26 9:10 ` Christian Ehrhardt 2002-08-26 14:22 ` Daniel Phillips 2002-08-26 15:29 ` Christian Ehrhardt 2002-08-26 17:56 ` Daniel Phillips 2002-08-26 19:24 ` Andrew Morton 2002-08-26 19:34 ` Daniel Phillips 2002-08-26 19:48 ` Christian Ehrhardt 2002-08-27 9:22 ` Christian Ehrhardt 2002-08-27 19:19 ` Andrew Morton 2002-08-26 20:00 ` Christian Ehrhardt 2002-08-26 20:09 ` Daniel Phillips 2002-08-26 20:58 ` Christian Ehrhardt 2002-08-27 16:48 ` Daniel Phillips 2002-08-28 13:14 ` Christian Ehrhardt 2002-08-28 17:18 ` Daniel Phillips 2002-08-28 17:42 ` Andrew Morton 2002-08-28 20:41 ` Daniel Phillips 2002-08-28 21:03 ` Andrew Morton 2002-08-28 22:04 ` Daniel Phillips 2002-08-28 22:39 ` Andrew Morton 2002-08-28 22:57 ` Daniel Phillips 2002-08-26 21:31 ` Andrew Morton 2002-08-27 3:42 ` Benjamin LaHaise 2002-08-27 4:37 ` Andrew Morton 2002-08-26 17:58 ` Linus Torvalds 2002-08-26 19:28 ` Rik van Riel 2002-08-30 23:03 ` [RFC] [PATCH] Include LRU in page count Daniel Phillips 2002-08-31 16:14 ` Christian Ehrhardt 2002-08-31 17:54 ` Andrew Morton 2002-08-31 19:47 ` Daniel Phillips 2002-08-31 20:26 ` Andrew Morton 2002-08-31 21:05 ` Daniel Phillips 2002-08-31 22:30 ` William Lee Irwin III 2002-09-01 3:36 ` Daniel Phillips 2002-09-01 21:32 ` Daniel Phillips 2002-09-01 22:09 ` Andrew Morton 2002-09-01 22:08 ` Daniel Phillips 2002-09-01 22:20 ` Daniel Phillips 2002-09-01 23:08 ` Andrew Morton 2002-09-01 23:19 ` Daniel Phillips 2002-09-01 23:28 ` William Lee Irwin III 2002-09-01 23:33 ` Daniel Phillips 2002-09-02 0:17 ` Andrew Morton 2002-09-02 0:30 ` Daniel Phillips 2002-09-02 1:50 ` Andrew Morton 2002-09-02 1:08 ` Rik van Riel 2002-09-02 17:23 ` Christian Ehrhardt 2002-09-02 18:01 ` Daniel Phillips 2002-09-05 4:42 ` [RFC] Alternative raceless page free Daniel Phillips 2002-09-05 12:34 ` Christian Ehrhardt 2002-09-05 15:21 ` Daniel Phillips 2002-09-05 16:04 ` Christian Ehrhardt 2002-09-05 16:10 ` Daniel Phillips 2002-09-05 16:31 ` Daniel Phillips 2002-09-05 18:06 ` [RFC] Alternative raceless page free, updated Daniel Phillips 2002-08-22 15:59 ` MM patches against 2.5.31 Steven Cole 2002-08-22 16:06 ` Martin J. Bligh 2002-08-22 19:45 ` Steven Cole 2002-08-26 2:15 ` Andrew Morton 2002-08-26 2:08 ` Martin J. Bligh 2002-08-26 2:32 ` Andrew Morton 2002-08-26 3:06 ` Steven Cole
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox