From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: [patch 0/6] Guest page hinting version 7. Date: Fri, 27 Mar 2009 16:09:05 +0100 Message-ID: <20090327150905.819861420@de.ibm.com> Return-path: Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com List-Id: virtualization@lists.linuxfoundation.org Greetings, the circus is back in town -- another version of the guest page hinting patches. The patches differ from version 6 only in the kernel version, they apply against 2.6.29. My short sniff test showed that the code is still working as expected. To recap (you can skip this if you read the boiler plate of the last version of the patches): The main benefit for guest page hinting vs. the ballooner is that there is no need for a monitor that keeps track of the memory usage of all the guests, a complex algorithm that calculates the working set sizes and for the calls into the guest kernel to control the size of the balloons. The host just does normal LRU based paging. If the host picks one of the pages the guest can recreate, the host can throw it away instead of writing it to the paging device. Simple and elegant. The main disadvantage is the added complexity that is introduced to the guests memory management code to do the page state changes and to deal with discard faults. Right after booting the page states on my 256 MB z/VM guest looked like this (r=resident, p=preserved, z=zero, S=stable, U=unused, P=potentially volatile, V=volatile): |--tot--|---r---|---p---|---z---| S | 19719| 19673| 0| 46| U | 235416| 2734| 0| 232682| P | 1| 1| 0| 0| V | 7008| 7008| 0| 0| tot-> | 262144| 29416| 0| 232728| about 25% of the pages are in voltile state. After grepping through the linux source tree this picture changes: |--tot--|---r---|---p---|---z---| S | 43784| 43744| 0| 40| U | 78631| 2397| 0| 76234| P | 2| 2| 0| 0| V | 139727| 139727| 0| 0| tot-> | 262144| 185870| 0| 76274| about 75% of the pages are now volatile. Depending on the workload you will get different results. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: [patch 3/6] Guest page hinting: mlocked pages. Date: Fri, 27 Mar 2009 16:09:08 +0100 Message-ID: <20090327151012.095486071@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> Return-path: Content-Disposition: inline; filename=003-hva-mlock.diff Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky List-Id: virtualization@lists.linuxfoundation.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj Add code to get mlock() working with guest page hinting. The problem with mlock is that locked pages may not be removed from page cache. That means they need to be stable. page_make_volatile needs a way to check if a page has been locked. To avoid traversing vma lists - which would hurt performance a lot - a field is added in the struct address_space. This field is set in mlock_fixup if a vma gets mlocked. The bit never gets removed - once a file had an mlocked vma all future pages added to it will stay stable. The pages of an mlocked area are made present in the linux page table by a call to make_pages_present which calls get_user_pages and follow_page. The follow_page function is called for each page in the mlocked vma, if the VM_LOCKED bit in the vma flags is set the page is made stable. Signed-off-by: Martin Schwidefsky --- include/linux/fs.h | 10 ++++++++++ mm/memory.c | 5 +++-- mm/mlock.c | 4 ++++ mm/page-states.c | 5 ++++- mm/rmap.c | 14 ++++++++++++-- 5 files changed, 33 insertions(+), 5 deletions(-) Index: linux-2.6/include/linux/fs.h =================================================================== --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -561,6 +561,9 @@ struct address_space { unsigned long flags; /* error bits/gfp mask */ struct backing_dev_info *backing_dev_info; /* device readahead, etc */ spinlock_t private_lock; /* for use by the address_space */ +#ifdef CONFIG_PAGE_STATES + unsigned int mlocked; /* set if VM_LOCKED vmas present */ +#endif struct list_head private_list; /* ditto */ struct address_space *assoc_mapping; /* ditto */ } __attribute__((aligned(sizeof(long)))); @@ -570,6 +573,13 @@ struct address_space { * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON. */ +static inline void mapping_set_mlocked(struct address_space *mapping) +{ +#ifdef CONFIG_PAGE_STATES + mapping->mlocked = 1; +#endif +} + struct block_device { dev_t bd_dev; /* not a kdev_t - it's a search key */ struct inode * bd_inode; /* will die */ Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1177,9 +1177,10 @@ struct page *follow_page(struct vm_area_ if (flags & FOLL_GET) get_page(page); - if (flags & FOLL_GET) { + if ((flags & FOLL_GET) || (vma->vm_flags & VM_LOCKED)) { /* - * The page is made stable if a reference is acquired. + * The page is made stable if a reference is acquired or + * the vm area is locked. * If the caller does not get a reference it implies that * the caller can deal with page faults in case the page * is swapped out. In this case the caller can deal with Index: linux-2.6/mm/mlock.c =================================================================== --- linux-2.6.orig/mm/mlock.c +++ linux-2.6/mm/mlock.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "internal.h" @@ -380,6 +381,9 @@ static int mlock_fixup(struct vm_area_st (vma->vm_flags & (VM_IO | VM_PFNMAP))) goto out; /* don't set VM_LOCKED, don't count */ + if (lock && vma->vm_file && vma->vm_file->f_mapping) + mapping_set_mlocked(vma->vm_file->f_mapping); + if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current)) { Index: linux-2.6/mm/page-states.c =================================================================== --- linux-2.6.orig/mm/page-states.c +++ linux-2.6/mm/page-states.c @@ -30,6 +30,8 @@ */ static inline int check_bits(struct page *page) { + struct address_space *mapping; + /* * There are several conditions that prevent a page from becoming * volatile. The first check is for the page bits. @@ -53,7 +55,8 @@ static inline int check_bits(struct page * it volatile. It will be freed soon. And if the mapping ever * had locked pages all pages of the mapping will stay stable. */ - return page_mapping(page) != NULL; + mapping = page_mapping(page); + return mapping && !mapping->mlocked; } /* Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -793,8 +793,18 @@ static int try_to_unmap_one(struct page goto out_unmap; } if (ptep_clear_flush_young_notify(vma, address, pte)) { - ret = SWAP_FAIL; - goto out_unmap; + /* + * Check for discarded pages. This can happen if + * there have been discarded pages before a vma + * gets mlocked. The code in make_pages_present + * will force all discarded pages out and reload + * them. That happens after the VM_LOCKED bit + * has been set. + */ + if (likely(!PageDiscarded(page))) { + ret = SWAP_FAIL; + goto out_unmap; + } } } -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: [patch 2/6] Guest page hinting: volatile swap cache. Date: Fri, 27 Mar 2009 16:09:07 +0100 Message-ID: <20090327151011.798602788@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> Return-path: Content-Disposition: inline; filename=002-hva-swap.diff Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky List-Id: virtualization@lists.linuxfoundation.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj The volatile page state can be used for anonymous pages as well, if they have been added to the swap cache and the swap write is finished. The tricky bit is in free_swap_and_cache. The call to find_get_page dead-locks with the discard handler. If the page has been discarded find_get_page will try to remove it. To do that it needs the page table lock of all mappers but one is held by the caller of free_swap_and_cache. A special variant of find_get_page is needed that does not check the page state and returns a page reference even if the page is discarded. The second pitfall is that the page needs to be made stable before the swap slot gets freed. If the page cannot be made stable because it has been discarded the swap slot may not be freed because it is still needed to reload the discarded page from the swap device. Signed-off-by: Martin Schwidefsky --- include/linux/pagemap.h | 3 ++ include/linux/swap.h | 5 ++++ mm/filemap.c | 39 ++++++++++++++++++++++++++++++++++++ mm/memory.c | 13 +++++++++++- mm/page-states.c | 34 +++++++++++++++++++++++--------- mm/rmap.c | 51 ++++++++++++++++++++++++++++++++++++++++++++---- mm/swap_state.c | 25 ++++++++++++++++++++++- mm/swapfile.c | 24 +++++++++++++++++++--- 8 files changed, 176 insertions(+), 18 deletions(-) Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -91,8 +91,11 @@ static inline void mapping_set_gfp_mask( #define page_cache_get(page) get_page(page) #ifdef CONFIG_PAGE_STATES +extern struct page * find_get_page_nodiscard(struct address_space *mapping, + unsigned long index); #define page_cache_release(page) put_page_check(page) #else +#define find_get_page_nodiscard(mapping, index) find_get_page(mapping, index) #define page_cache_release(page) put_page(page) #endif void release_pages(struct page **pages, int nr, int cold); Index: linux-2.6/include/linux/swap.h =================================================================== --- linux-2.6.orig/include/linux/swap.h +++ linux-2.6/include/linux/swap.h @@ -285,6 +285,7 @@ extern void show_swap_cache_info(void); extern int add_to_swap(struct page *); extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); extern void __delete_from_swap_cache(struct page *); +extern void __delete_from_swap_cache_nocheck(struct page *); extern void delete_from_swap_cache(struct page *); extern void free_page_and_swap_cache(struct page *); extern void free_pages_and_swap_cache(struct page **, int); @@ -402,6 +403,10 @@ static inline void __delete_from_swap_ca { } +static inline void __delete_from_swap_cache_nocheck(struct page *page) +{ +} + static inline void delete_from_swap_cache(struct page *page) { } Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -555,6 +555,45 @@ static int __sleep_on_page_lock(void *wo return 0; } +#ifdef CONFIG_PAGE_STATES + +struct page * find_get_page_nodiscard(struct address_space *mapping, + unsigned long offset) +{ + void **pagep; + struct page *page; + + rcu_read_lock(); +repeat: + page = NULL; + pagep = radix_tree_lookup_slot(&mapping->page_tree, offset); + if (pagep) { + page = radix_tree_deref_slot(pagep); + if (unlikely(!page || page == RADIX_TREE_RETRY)) + goto repeat; + + if (!page_cache_get_speculative(page)) + goto repeat; + + /* + * Has the page moved? + * This is part of the lockless pagecache protocol. See + * include/linux/pagemap.h for details. + */ + if (unlikely(page != *pagep)) { + page_cache_release(page); + goto repeat; + } + } + rcu_read_unlock(); + + return page; +} + +EXPORT_SYMBOL(find_get_page_nodiscard); + +#endif + /* * In order to wait for pages to become available there must be * waitqueues associated with pages. By using a hash table of Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -614,7 +614,18 @@ out_discard_pte: * in the page cache anymore. Do what try_to_unmap_one would do * if the copy_one_pte had taken place before page_discard. */ - if (page->index != linear_page_index(vma, addr)) + if (PageAnon(page)) { + swp_entry_t entry = { .val = page_private(page) }; + swap_duplicate(entry); + if (list_empty(&dst_mm->mmlist)) { + spin_lock(&mmlist_lock); + if (list_empty(&dst_mm->mmlist)) + list_add(&dst_mm->mmlist, &init_mm.mmlist); + spin_unlock(&mmlist_lock); + } + pte = swp_entry_to_pte(entry); + set_pte_at(dst_mm, addr, dst_pte, pte); + } else if (page->index != linear_page_index(vma, addr)) /* If nonlinear, store the file page offset in the pte. */ set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); else Index: linux-2.6/mm/page-states.c =================================================================== --- linux-2.6.orig/mm/page-states.c +++ linux-2.6/mm/page-states.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "internal.h" @@ -35,7 +36,16 @@ static inline int check_bits(struct page */ if (PageDirty(page) || PageReserved(page) || PageWriteback(page) || PageLocked(page) || PagePrivate(page) || PageDiscarded(page) || - !PageUptodate(page) || !PageLRU(page) || PageAnon(page)) + !PageUptodate(page) || !PageLRU(page) || + (PageAnon(page) && !PageSwapCache(page))) + return 0; + + /* + * Special case shared memory: page is PageSwapCache but not + * PageAnon. page_unmap_all failes for swapped shared memory + * pages. + */ + if (PageSwapCache(page) && !PageAnon(page)) return 0; /* @@ -169,15 +179,21 @@ static void __page_discard(struct page * } spin_unlock_irq(&zone->lru_lock); - /* We can't handle swap cache pages (yet). */ - VM_BUG_ON(PageSwapCache(page)); - - /* Remove page from page cache. */ + /* Remove page from page cache/swap cache. */ mapping = page->mapping; - spin_lock_irq(&mapping->tree_lock); - __remove_from_page_cache_nocheck(page); - spin_unlock_irq(&mapping->tree_lock); - __put_page(page); + if (PageSwapCache(page)) { + swp_entry_t entry = { .val = page_private(page) }; + spin_lock_irq(&swapper_space.tree_lock); + __delete_from_swap_cache_nocheck(page); + spin_unlock_irq(&swapper_space.tree_lock); + swap_free(entry); + page_cache_release(page); + } else { + spin_lock_irq(&mapping->tree_lock); + __remove_from_page_cache_nocheck(page); + spin_unlock_irq(&mapping->tree_lock); + __put_page(page); + } } /** Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -762,6 +762,7 @@ void page_remove_rmap(struct page *page) * faster for those pages still in swapcache. */ } + page_make_volatile(page, 1); } /* @@ -1253,13 +1254,13 @@ int try_to_munlock(struct page *page) #ifdef CONFIG_PAGE_STATES /** - * page_unmap_all - removes all mappings of a page + * page_unmap_file - removes all mappings of a file page * * @page: the page which mapping in the vma should be struck down * * the caller needs to hold page lock */ -void page_unmap_all(struct page* page) +static void page_unmap_file(struct page* page) { struct address_space *mapping = page_mapping(page); pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); @@ -1268,8 +1269,6 @@ void page_unmap_all(struct page* page) unsigned long address; int rc; - VM_BUG_ON(!PageLocked(page) || PageReserved(page) || PageAnon(page)); - spin_lock(&mapping->i_mmap_lock); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { address = vma_address(page, vma); @@ -1300,4 +1299,48 @@ out: spin_unlock(&mapping->i_mmap_lock); } +/** + * page_unmap_anon - removes all mappings of an anonymous page + * + * @page: the page which mapping in the vma should be struck down + * + * the caller needs to hold page lock + */ +static void page_unmap_anon(struct page* page) +{ + struct anon_vma *anon_vma; + struct vm_area_struct *vma; + unsigned long address; + int rc; + + anon_vma = page_lock_anon_vma(page); + if (!anon_vma) + return; + list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { + address = vma_address(page, vma); + if (address == -EFAULT) + continue; + rc = try_to_unmap_one(page, vma, address, 0); + VM_BUG_ON(rc == SWAP_FAIL); + } + page_unlock_anon_vma(anon_vma); +} + +/** + * page_unmap_all - removes all mappings of a page + * + * @page: the page which mapping in the vma should be struck down + * + * the caller needs to hold page lock + */ +void page_unmap_all(struct page *page) +{ + VM_BUG_ON(!PageLocked(page) || PageReserved(page)); + + if (PageAnon(page)) + page_unmap_anon(page); + else + page_unmap_file(page); +} + #endif Index: linux-2.6/mm/swap_state.c =================================================================== --- linux-2.6.orig/mm/swap_state.c +++ linux-2.6/mm/swap_state.c @@ -18,6 +18,7 @@ #include #include #include +#include #include @@ -107,7 +108,7 @@ int add_to_swap_cache(struct page *page, * This must be called only on pages that have * been verified to be in the swap cache. */ -void __delete_from_swap_cache(struct page *page) +void inline __delete_from_swap_cache_nocheck(struct page *page) { swp_entry_t ent = {.val = page_private(page)}; @@ -124,6 +125,28 @@ void __delete_from_swap_cache(struct pag mem_cgroup_uncharge_swapcache(page, ent); } +void __delete_from_swap_cache(struct page *page) +{ + /* + * Check if the discard fault handler already removed + * the page from the page cache. If not set the discard + * bit in the page flags to prevent double page free if + * a discard fault is racing with normal page free. + */ + if (TestSetPageDiscarded(page)) + return; + + __delete_from_swap_cache_nocheck(page); + + /* + * Check the hardware page state and clear the discard + * bit in the page flags only if the page is not + * discarded. + */ + if (!page_discarded(page)) + ClearPageDiscarded(page); +} + /** * add_to_swap - allocate swap space for a page * @page: page we want to move to swap Index: linux-2.6/mm/swapfile.c =================================================================== --- linux-2.6.orig/mm/swapfile.c +++ linux-2.6/mm/swapfile.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include @@ -564,6 +565,8 @@ int try_to_free_swap(struct page *page) return 0; if (page_swapcount(page)) return 0; + if (!page_make_stable(page)) + return 0; delete_from_swap_cache(page); SetPageDirty(page); @@ -585,7 +588,13 @@ int free_swap_and_cache(swp_entry_t entr p = swap_info_get(entry); if (p) { if (swap_entry_free(p, entry) == 1) { - page = find_get_page(&swapper_space, entry.val); + /* + * Use find_get_page_nodiscard to avoid the deadlock + * on the swap_lock and the page table lock if the + * page has been discarded. + */ + page = find_get_page_nodiscard(&swapper_space, + entry.val); if (page && !trylock_page(page)) { page_cache_release(page); page = NULL; @@ -600,8 +609,17 @@ int free_swap_and_cache(swp_entry_t entr */ if (PageSwapCache(page) && !PageWriteback(page) && (!page_mapped(page) || vm_swap_full())) { - delete_from_swap_cache(page); - SetPageDirty(page); + /* + * To be able to reload the page from swap the + * swap slot may not be freed. The caller of + * free_swap_and_cache holds a page table lock + * for this page. The discarded page can not be + * removed here. + */ + if (likely(page_make_stable(page))) { + delete_from_swap_cache(page); + SetPageDirty(page); + } } unlock_page(page); page_cache_release(page); -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: [patch 6/6] Guest page hinting: s390 support. Date: Fri, 27 Mar 2009 16:09:11 +0100 Message-ID: <20090327151013.024372165@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> Return-path: Content-Disposition: inline; filename=006-hva-s390.diff Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky List-Id: virtualization@lists.linuxfoundation.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj s390 uses the milli-coded ESSA instruction to set the page state. The page state is formed by four guest page states called block usage states and three host page states called block content states. The guest states are: - stable (S): there is essential content in the page - unused (U): there is no useful content and any access to the page will cause an addressing exception - volatile (V): there is useful content in the page. The host system is allowed to discard the content anytime, but has to deliver a discard fault with the absolute address of the page if the guest tries to access it. - potential volatile (P): the page has useful content. The host system is allowed to discard the content after it has checked the dirty bit of the page. It has to deliver a discard fault with the absolute address of the page if the guest tries to access it. The host states are: - resident: the page is present in real memory. - preserved: the page is not present in real memory but the content is preserved elsewhere by the machine, e.g. on the paging device. - zero: the page is not present in real memory. The content of the page is logically-zero. There are 12 combinations of guest and host state, currently only 8 are valid page states: Sr: a stable, resident page. Sp: a stable, preserved page. Sz: a stable, logically zero page. A page filled with zeroes will be allocated on first access. Ur: an unused but resident page. The host could make it Uz anytime but it doesn't have to. Uz: an unused, logically zero page. Vr: a volatile, resident page. The guest can access it normally. Vz: a volatile, logically zero page. This is a discarded page. The host will deliver a discard fault for any access to the page. Pr: a potential volatile, resident page. The guest can access it normally. The remaining 4 combinations can't occur: Up: an unused, preserved page. If the host tries to get rid of a Ur page it will remove it without writing the page content to disk and set the page to Uz. Vp: a volatile, preserved page. If the host picks a Vr page for eviction it will discard it and set the page state to Vz. Pp: a potential volatile, preserved page. There are two cases for page out: 1) if the page is dirty then the host will preserved the page and set it to Sp or 2) if the page is clean then the host will discard it and set the page state to Vz. Pz: a potential volatile, logically zero page. The host system will always use Vz instead of Pz. The state transitions (a diagram would be nicer but that is too hard to do in ascii art...): {Ur,Sr,Vr,Pr}: a resident page will change its block usage state if the guest requests it with page_set_{unused,stable,volatile}. {Uz,Sz,Vz}: a logically zero page will change its block usage state if the guest requests it with page_set_{unused,stable,volatile}. The guest can't create the Pz state, the state will be Vz instead. Ur -> Uz: the host system can remove an unused, resident page from memory Sz -> Sr: on first access a stable, logically zero page will become resident Sr -> Sp: the host system can swap a stable page to disk Sp -> Sr: a guest access to a Sp page forces the host to retrieve it Vr -> Vz: the host can discard a volatile page Sp -> Uz: a page preserved by the host will be removed if the guest sets the block usage state to unused. Sp -> Vz: a page preserved by the host will be discarded if the guest sets the block usage state to volatile. Pr -> Sp: the host can move a page from Pr to Sp if it discovers that the page is dirty while trying to discard the page. The page content is written to the paging device. Pr -> Vz: the host can discard a Pr page. The Pz state is replaced by the Vz state. The are some hazards the code has to deal with: 1) For potential volatile pages the transfer of the hardware dirty bit to the software dirty bit needs to make sure that the page gets into the stable state before the hardware dirty bit is cleared. Between the page_test_dirty and the page_clear_dirty call a page_make_stable is required. 2) Since the access of unused pages causes addressing exceptions we need to take care with /dev/mem. The copy_{from_to}_user functions need to be able to cope with addressing exceptions for the kernel address space. 3) The discard fault on a s390 machine delivers the absolute address of the page that caused the fault instead of the virtual address. With the virtual address we could have used the page table entry of the current process to safely get a reference to the discarded page. We can get to the struct page from the absolute page address but it is rather hard to get to a proper page reference. The page that caused the fault could already have been freed and reused for a different purpose. None of the fields in the struct page would be reliable to use. The freeing of discarded pages therefore has to be postponed until all pending discard faults for this page have been dealt with. The discard fault handler is called disabled for interrupts and tries to get a page reference with get_page_unless_zero. A discarded page is only freed after all cpus have been enabled for interrupts at least once since the detection of the discarded page. This is done using the timer interrupts and the cpu-idle notifier. Signed-off-by: Martin Schwidefsky --- arch/s390/Kconfig | 6 arch/s390/include/asm/page-states.h | 116 ++++++++++++++++++ arch/s390/include/asm/page.h | 11 - arch/s390/kernel/process.c | 4 arch/s390/kernel/time.c | 8 + arch/s390/kernel/traps.c | 4 arch/s390/lib/uaccess_mvcos.c | 10 - arch/s390/lib/uaccess_std.c | 7 - arch/s390/mm/fault.c | 1 arch/s390/mm/init.c | 3 arch/s390/mm/page-states.c | 224 ++++++++++++++++++++++++++++++------ mm/rmap.c | 9 + 12 files changed, 346 insertions(+), 57 deletions(-) Index: linux-2.6/arch/s390/Kconfig =================================================================== --- linux-2.6.orig/arch/s390/Kconfig +++ linux-2.6/arch/s390/Kconfig @@ -468,11 +468,7 @@ config CMM_IUCV the cooperative memory management. config PAGE_STATES - bool "Unused page notification" - help - This enables the notification of unused pages to the - hypervisor. The ESSA instruction is used to do the states - changes between a page that has content and the unused state. + bool "Enable support for guest page hinting." config APPLDATA_BASE bool "Linux - VM Monitor Stream, base infrastructure" Index: linux-2.6/arch/s390/include/asm/page-states.h =================================================================== --- /dev/null +++ linux-2.6/arch/s390/include/asm/page-states.h @@ -0,0 +1,116 @@ +#ifndef _ASM_S390_PAGE_STATES_H +#define _ASM_S390_PAGE_STATES_H + +#define ESSA_GET_STATE 0 +#define ESSA_SET_STABLE 1 +#define ESSA_SET_UNUSED 2 +#define ESSA_SET_VOLATILE 3 +#define ESSA_SET_PVOLATILE 4 +#define ESSA_SET_STABLE_MAKE_RESIDENT 5 +#define ESSA_SET_STABLE_IF_NOT_DISCARDED 6 + +#define ESSA_USTATE_MASK 0x0c +#define ESSA_USTATE_STABLE 0x00 +#define ESSA_USTATE_UNUSED 0x04 +#define ESSA_USTATE_PVOLATILE 0x08 +#define ESSA_USTATE_VOLATILE 0x0c + +#define ESSA_CSTATE_MASK 0x03 +#define ESSA_CSTATE_RESIDENT 0x00 +#define ESSA_CSTATE_PRESERVED 0x02 +#define ESSA_CSTATE_ZERO 0x03 + +extern int cmma_flag; + +/* + * ESSA ,, + */ +#define page_essa(_page,_command) ({ \ + int _rc; \ + asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0" \ + : "=&d" (_rc) : "a" (page_to_phys(_page)), \ + "i" (_command)); \ + _rc; \ +}) + +static inline int page_host_discards(void) +{ + return cmma_flag; +} + +static inline int page_discarded(struct page *page) +{ + int state; + + if (!cmma_flag) + return 0; + state = page_essa(page, ESSA_GET_STATE); + return (state & ESSA_USTATE_MASK) == ESSA_USTATE_VOLATILE && + (state & ESSA_CSTATE_MASK) == ESSA_CSTATE_ZERO; +} + +static inline void page_set_unused(struct page *page, int order) +{ + int i; + + if (!cmma_flag) + return; + for (i = 0; i < (1 << order); i++) + page_essa(page + i, ESSA_SET_UNUSED); +} + +static inline void page_set_stable(struct page *page, int order) +{ + int i; + + if (!cmma_flag) + return; + for (i = 0; i < (1 << order); i++) + page_essa(page + i, ESSA_SET_STABLE); +} + +static inline void page_set_volatile(struct page *page, int writable) +{ + if (!cmma_flag) + return; + if (writable) + page_essa(page, ESSA_SET_PVOLATILE); + else + page_essa(page, ESSA_SET_VOLATILE); +} + +static inline int page_set_stable_if_present(struct page *page) +{ + int rc; + + if (!cmma_flag || PageReserved(page)) + return 1; + + rc = page_essa(page, ESSA_SET_STABLE_IF_NOT_DISCARDED); + return (rc & ESSA_USTATE_MASK) != ESSA_USTATE_VOLATILE || + (rc & ESSA_CSTATE_MASK) != ESSA_CSTATE_ZERO; +} + +/* + * Page locking is done with the architecture page bit PG_arch_1. + */ +static inline int page_test_set_state_change(struct page *page) +{ + return test_and_set_bit(PG_arch_1, &page->flags); +} + +static inline void page_clear_state_change(struct page *page) +{ + clear_bit(PG_arch_1, &page->flags); +} + +static inline int page_state_change(struct page *page) +{ + return test_bit(PG_arch_1, &page->flags); +} + +int page_free_discarded(struct page *page); +void page_shrink_discard_list(void); +void page_discard_init(void); + +#endif /* _ASM_S390_PAGE_STATES_H */ Index: linux-2.6/arch/s390/include/asm/page.h =================================================================== --- linux-2.6.orig/arch/s390/include/asm/page.h +++ linux-2.6/arch/s390/include/asm/page.h @@ -125,17 +125,6 @@ page_get_storage_key(unsigned long addr) return skey; } -#ifdef CONFIG_PAGE_STATES - -struct page; -void arch_free_page(struct page *page, int order); -void arch_alloc_page(struct page *page, int order); - -#define HAVE_ARCH_FREE_PAGE -#define HAVE_ARCH_ALLOC_PAGE - -#endif - #endif /* !__ASSEMBLY__ */ #define __PAGE_OFFSET 0x0UL Index: linux-2.6/arch/s390/kernel/process.c =================================================================== --- linux-2.6.orig/arch/s390/kernel/process.c +++ linux-2.6/arch/s390/kernel/process.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -82,6 +83,9 @@ extern void s390_handle_mcck(void); */ static void default_idle(void) { +#ifdef CONFIG_PAGE_STATES + page_shrink_discard_list(); +#endif /* CPU is going idle. */ local_irq_disable(); if (need_resched()) { Index: linux-2.6/arch/s390/kernel/time.c =================================================================== --- linux-2.6.orig/arch/s390/kernel/time.c +++ linux-2.6/arch/s390/kernel/time.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include #include @@ -137,6 +138,9 @@ static int s390_next_event(unsigned long static void s390_set_mode(enum clock_event_mode mode, struct clock_event_device *evt) { +#ifdef CONFIG_PAGE_STATES + page_shrink_discard_list(); +#endif } /* @@ -287,6 +291,10 @@ void __init time_init(void) &ext_int_etr_cc) != 0) panic("Couldn't request external interrupt 0x1406"); +#ifdef CONFIG_PAGE_STATES + page_discard_init(); +#endif + /* Enable TOD clock interrupts on the boot cpu. */ init_cpu_timer(); /* Enable cpu timer interrupts on the boot cpu. */ Index: linux-2.6/arch/s390/kernel/traps.c =================================================================== --- linux-2.6.orig/arch/s390/kernel/traps.c +++ linux-2.6/arch/s390/kernel/traps.c @@ -57,6 +57,7 @@ int sysctl_userprocess_debug = 0; extern pgm_check_handler_t do_protection_exception; extern pgm_check_handler_t do_dat_exception; extern pgm_check_handler_t do_asce_exception; +extern pgm_check_handler_t do_discard_fault; #define stack_pointer ({ void **sp; asm("la %0,0(15)" : "=&d" (sp)); sp; }) @@ -761,5 +762,8 @@ void __init trap_init(void) pgm_check_table[0x15] = &operand_exception; pgm_check_table[0x1C] = &space_switch_exception; pgm_check_table[0x1D] = &hfp_sqrt_exception; +#ifdef CONFIG_PAGE_STATES + pgm_check_table[0x1a] = &do_discard_fault; +#endif pfault_irq_init(); } Index: linux-2.6/arch/s390/lib/uaccess_mvcos.c =================================================================== --- linux-2.6.orig/arch/s390/lib/uaccess_mvcos.c +++ linux-2.6/arch/s390/lib/uaccess_mvcos.c @@ -36,7 +36,7 @@ static size_t copy_from_user_mvcos(size_ tmp1 = -4096UL; asm volatile( "0: .insn ss,0xc80000000000,0(%0,%2),0(%1),0\n" - " jz 7f\n" + "10:jz 7f\n" "1:"ALR" %0,%3\n" " "SLR" %1,%3\n" " "SLR" %2,%3\n" @@ -47,7 +47,7 @@ static size_t copy_from_user_mvcos(size_ " "CLR" %0,%4\n" /* copy crosses next page boundary? */ " jnh 4f\n" "3: .insn ss,0xc80000000000,0(%4,%2),0(%1),0\n" - " "SLR" %0,%4\n" + "11:"SLR" %0,%4\n" " "ALR" %2,%4\n" "4:"LHI" %4,-1\n" " "ALR" %4,%0\n" /* copy remaining size, subtract 1 */ @@ -62,6 +62,7 @@ static size_t copy_from_user_mvcos(size_ "7:"SLR" %0,%0\n" "8: \n" EX_TABLE(0b,2b) EX_TABLE(3b,4b) + EX_TABLE(10b,8b) EX_TABLE(11b,8b) : "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2) : "d" (reg0) : "cc", "memory"); return size; @@ -82,7 +83,7 @@ static size_t copy_to_user_mvcos(size_t tmp1 = -4096UL; asm volatile( "0: .insn ss,0xc80000000000,0(%0,%1),0(%2),0\n" - " jz 4f\n" + "6: jz 4f\n" "1:"ALR" %0,%3\n" " "SLR" %1,%3\n" " "SLR" %2,%3\n" @@ -93,11 +94,12 @@ static size_t copy_to_user_mvcos(size_t " "CLR" %0,%4\n" /* copy crosses next page boundary? */ " jnh 5f\n" "3: .insn ss,0xc80000000000,0(%4,%1),0(%2),0\n" - " "SLR" %0,%4\n" + "7:"SLR" %0,%4\n" " j 5f\n" "4:"SLR" %0,%0\n" "5: \n" EX_TABLE(0b,2b) EX_TABLE(3b,5b) + EX_TABLE(6b,5b) EX_TABLE(7b,5b) : "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2) : "d" (reg0) : "cc", "memory"); return size; Index: linux-2.6/arch/s390/lib/uaccess_std.c =================================================================== --- linux-2.6.orig/arch/s390/lib/uaccess_std.c +++ linux-2.6/arch/s390/lib/uaccess_std.c @@ -36,12 +36,12 @@ size_t copy_from_user_std(size_t size, c tmp1 = -256UL; asm volatile( "0: mvcp 0(%0,%2),0(%1),%3\n" - " jz 8f\n" + "10:jz 8f\n" "1:"ALR" %0,%3\n" " la %1,256(%1)\n" " la %2,256(%2)\n" "2: mvcp 0(%0,%2),0(%1),%3\n" - " jnz 1b\n" + "11:jnz 1b\n" " j 8f\n" "3: la %4,255(%1)\n" /* %4 = ptr + 255 */ " "LHI" %3,-4096\n" @@ -50,7 +50,7 @@ size_t copy_from_user_std(size_t size, c " "CLR" %0,%4\n" /* copy crosses next page boundary? */ " jnh 5f\n" "4: mvcp 0(%4,%2),0(%1),%3\n" - " "SLR" %0,%4\n" + "12:"SLR" %0,%4\n" " "ALR" %2,%4\n" "5:"LHI" %4,-1\n" " "ALR" %4,%0\n" /* copy remaining size, subtract 1 */ @@ -65,6 +65,7 @@ size_t copy_from_user_std(size_t size, c "8:"SLR" %0,%0\n" "9: \n" EX_TABLE(0b,3b) EX_TABLE(2b,3b) EX_TABLE(4b,5b) + EX_TABLE(10b,9b) EX_TABLE(11b,9b) EX_TABLE(12b,9b) : "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2) : : "cc", "memory"); return size; Index: linux-2.6/arch/s390/mm/fault.c =================================================================== --- linux-2.6.orig/arch/s390/mm/fault.c +++ linux-2.6/arch/s390/mm/fault.c @@ -611,4 +611,5 @@ void __init pfault_irq_init(void) unregister_early_external_interrupt(0x2603, pfault_interrupt, &ext_int_pfault); } + #endif Index: linux-2.6/arch/s390/mm/init.c =================================================================== --- linux-2.6.orig/arch/s390/mm/init.c +++ linux-2.6/arch/s390/mm/init.c @@ -94,6 +94,9 @@ void __init mem_init(void) /* Setup guest page hinting */ cmma_init(); + /* Setup guest page hinting */ + cmma_init(); + /* this will put all low memory onto the freelists */ totalram_pages += free_all_bootmem(); Index: linux-2.6/arch/s390/mm/page-states.c =================================================================== --- linux-2.6.orig/arch/s390/mm/page-states.c +++ linux-2.6/arch/s390/mm/page-states.c @@ -13,67 +13,223 @@ #include #include #include +#include +#include +#include +#include +#include +#include + +extern void die(const char *,struct pt_regs *,long); + +#ifndef CONFIG_64BIT +#define __FAIL_ADDR_MASK 0x7ffff000 +#else /* CONFIG_64BIT */ +#define __FAIL_ADDR_MASK -4096L +#endif /* CONFIG_64BIT */ -#define ESSA_SET_STABLE 1 -#define ESSA_SET_UNUSED 2 +int cmma_flag; -static int cmma_flag; +void __init cmma_init(void) +{ + register unsigned long tmp asm("0") = 0; + register int rc asm("1") = -ENOSYS; + if (!cmma_flag) + return; + asm volatile( + " .insn rrf,0xb9ab0000,%1,%1,0,0\n" + "0: la %0,0\n" + "1:\n" + EX_TABLE(0b,1b) + : "+&d" (rc), "+&d" (tmp)); + if (rc) + cmma_flag = 0; +} static int __init cmma(char *str) { char *parm; + parm = strstrip(str); if (strcmp(parm, "yes") == 0 || strcmp(parm, "on") == 0) { cmma_flag = 1; return 1; } - cmma_flag = 0; - if (strcmp(parm, "no") == 0 || strcmp(parm, "off") == 0) + if (strcmp(parm, "no") == 0 || strcmp(parm, "off") == 0) { + cmma_flag = 0; return 1; + } return 0; } __setup("cmma=", cmma); -void __init cmma_init(void) +static inline void fixup_user_copy(struct pt_regs *regs, + unsigned long address, unsigned short rx) { - register unsigned long tmp asm("0") = 0; - register int rc asm("1") = -EOPNOTSUPP; + const struct exception_table_entry *fixup; + unsigned long kaddr; - if (!cmma_flag) + kaddr = (regs->gprs[rx >> 12] + (rx & 0xfff)) & __FAIL_ADDR_MASK; + if (virt_to_phys((void *) kaddr) != address) return; - asm volatile( - " .insn rrf,0xb9ab0000,%1,%1,0,0\n" - "0: la %0,0\n" - "1:\n" - EX_TABLE(0b,1b) - : "+&d" (rc), "+&d" (tmp)); - if (rc) - cmma_flag = 0; + + fixup = search_exception_tables(regs->psw.addr & PSW_ADDR_INSN); + if (fixup) + regs->psw.addr = fixup->fixup | PSW_ADDR_AMODE; + else + die("discard fault", regs, SIGSEGV); } -void arch_free_page(struct page *page, int order) +/* + * Discarded pages with a page_count() of zero are placed on + * the page_discarded_list until all cpus have been at + * least once in enabled code. That closes the race of page + * free vs. discard faults. + */ +void do_discard_fault(struct pt_regs *regs, unsigned long error_code) { - int i, rc; + unsigned long address; + struct page *page; - if (!cmma_flag) - return; - for (i = 0; i < (1 << order); i++) - asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0" - : "=&d" (rc) - : "a" ((page_to_pfn(page) + i) << PAGE_SHIFT), - "i" (ESSA_SET_UNUSED)); + /* + * get the real address that caused the block validity + * exception. + */ + address = S390_lowcore.trans_exc_code & __FAIL_ADDR_MASK; + page = pfn_to_page(address >> PAGE_SHIFT); + + /* + * Check for the special case of a discard fault in + * copy_{from,to}_user. User copy is done using one of + * three special instructions: mvcp, mvcs or mvcos. + */ + if (!(regs->psw.mask & PSW_MASK_PSTATE)) { + switch (*(unsigned char *) regs->psw.addr) { + case 0xda: /* mvcp */ + fixup_user_copy(regs, address, + *(__u16 *)(regs->psw.addr + 2)); + break; + case 0xdb: /* mvcs */ + fixup_user_copy(regs, address, + *(__u16 *)(regs->psw.addr + 4)); + break; + case 0xc8: /* mvcos */ + if (regs->gprs[0] == 0x81) + fixup_user_copy(regs, address, + *(__u16*)(regs->psw.addr + 2)); + else if (regs->gprs[0] == 0x810000) + fixup_user_copy(regs, address, + *(__u16*)(regs->psw.addr + 4)); + break; + default: + break; + } + } + + if (likely(get_page_unless_zero(page))) { + local_irq_enable(); + page_discard(page); + } } -void arch_alloc_page(struct page *page, int order) +static DEFINE_PER_CPU(struct list_head, page_discard_list); +static struct list_head page_gather_list = LIST_HEAD_INIT(page_gather_list); +static struct list_head page_signoff_list = LIST_HEAD_INIT(page_signoff_list); +static cpumask_var_t page_signoff_cpumask; +static DEFINE_SPINLOCK(page_discard_lock); + +/* + * page_free_discarded + * + * free_hot_cold_page calls this function if it is about to free a + * page that has PG_discarded set. Since there might be pending + * discard faults on other cpus on s390 we have to postpone the + * freeing of the page until each cpu has "signed-off" the page. + * + * returns 1 to stop free_hot_cold_page from freeing the page. + */ +int page_free_discarded(struct page *page) { - int i, rc; + local_irq_disable(); + list_add_tail(&page->lru, &__get_cpu_var(page_discard_list)); + local_irq_enable(); + return 1; +} - if (!cmma_flag) +/* + * page_shrink_discard_list + * + * This function is called from the timer tick for an active cpu or + * from the idle notifier. It frees discarded pages in three stages. + * In the first stage it moves the pages from the per-cpu discard + * list to a global list. From the global list the pages are moved + * to the signoff list in a second step. The third step is to free + * the pages after all cpus acknoledged the signoff. That prevents + * that a page is freed when a cpus still has a pending discard + * fault for the page. + */ +void page_shrink_discard_list(void) +{ + struct list_head *cpu_list = &__get_cpu_var(page_discard_list); + struct list_head free_list = LIST_HEAD_INIT(free_list); + struct page *page, *next; + int cpu = smp_processor_id(); + if (list_empty(cpu_list) && + !cpumask_test_cpu(cpu, page_signoff_cpumask)) return; - for (i = 0; i < (1 << order); i++) - asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0" - : "=&d" (rc) - : "a" ((page_to_pfn(page) + i) << PAGE_SHIFT), - "i" (ESSA_SET_STABLE)); + spin_lock(&page_discard_lock); + if (!list_empty(cpu_list)) + list_splice_init(cpu_list, &page_gather_list); + cpumask_clear_cpu(cpu, page_signoff_cpumask); + if (cpumask_empty(page_signoff_cpumask)) { + list_splice_init(&page_signoff_list, &free_list); + list_splice_init(&page_gather_list, &page_signoff_list); + if (!list_empty(&page_signoff_list)) { + /* Take care of the nohz race.. */ + cpumask_copy(page_signoff_cpumask, &cpu_online_map); + smp_wmb(); + cpumask_andnot(page_signoff_cpumask, + page_signoff_cpumask, nohz_cpu_mask); + cpumask_clear_cpu(cpu, page_signoff_cpumask); + if (cpumask_empty(page_signoff_cpumask)) + list_splice_init(&page_signoff_list, + &free_list); + } + } + spin_unlock(&page_discard_lock); + list_for_each_entry_safe(page, next, &free_list, lru) { + ClearPageDiscarded(page); + free_cold_page(page); + } +} + +static int page_discard_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + int cpu = (unsigned long) hcpu; + + if (action == CPU_DEAD) { + local_irq_disable(); + list_splice_init(&per_cpu(page_discard_list, cpu), + &__get_cpu_var(page_discard_list)); + local_irq_enable(); + } + return NOTIFY_OK; +} + +static struct notifier_block page_discard_cpu_notifier = { + .notifier_call = page_discard_cpu_notify, +}; + +void __init page_discard_init(void) +{ + int i; + + if (!alloc_cpumask_var(&page_signoff_cpumask, GFP_KERNEL)) + panic("Couldn't allocate page_signoff_cpumask\n"); + for_each_possible_cpu(i) + INIT_LIST_HEAD(&per_cpu(page_discard_list, i)); + if (register_cpu_notifier(&page_discard_cpu_notifier)) + panic("Couldn't register page discard cpu notifier"); } Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -744,6 +744,15 @@ void page_remove_rmap(struct page *page) */ if ((!PageAnon(page) || PageSwapCache(page)) && page_test_dirty(page)) { + int stable = page_make_stable(page); + VM_BUG_ON(!stable); + /* + * We decremented the mapcount so we now have an + * extra reference for the page. That prevents + * page_make_volatile from making the page + * volatile again while the dirty bit is in + * transit. + */ page_clear_dirty(page); set_page_dirty(page); } -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: [patch 5/6] Guest page hinting: minor fault optimization. Date: Fri, 27 Mar 2009 16:09:10 +0100 Message-ID: <20090327151012.713478499@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> Return-path: Content-Disposition: inline; filename=005-hva-nohv.diff Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky List-Id: virtualization@lists.linuxfoundation.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj On of the challenges of the guest page hinting scheme is the cost for the state transitions. If the cost gets too high the whole concept of page state information is in question. Therefore it is important to avoid the state transitions when possible. One place where the state transitions can be avoided are minor faults. Why change the page state to stable in find_get_page and back in page_add_anon_rmap/ page_add_file_rmap if the discarded pages can be handled by the discard fault handler? If the page is in page/swap cache just map it even if it is already discarded. The first access to the page will cause a discard fault which needs to be able to deal with this kind of situation anyway because of other races in the memory management. The special find_get_page_nodiscard variant introduced for volatile swap cache is used which does not change the page state. The calls to find_get_page in filemap_nopage and lookup_swap_cache are replaced with find_get_page_nodiscard. By the use of this function a new race is created. If a minor fault races with the discard of a page the page may not get mapped to the page table because the discard handler removed the page from the cache which removes the page->mapping that is needed to find the page table entry. A check for the discarded bit is added to do_swap_page and do_no_page. The page table lock for the pte takes care of the synchronization. That removes the state transitions on the minor fault path. A page that has been mapped will eventually be unmapped again. On the unmap path each page that has been removed from the page table is freed with a call to page_cache_release. In general that causes an unnecessary page state transition from volatile to volatile. To get rid of these state transitions as well a special variants of page_cache_release is added that does not attempt to make the page volatile. page_cache_release_nocheck is then used in free_page_and_swap_cache and release_pages. This makes the unmap of ptes state transitions free. Signed-off-by: Martin Schwidefsky --- include/linux/pagemap.h | 4 ++++ include/linux/swap.h | 2 +- mm/filemap.c | 27 ++++++++++++++++++++++++--- mm/fremap.c | 1 + mm/memory.c | 4 ++-- mm/rmap.c | 4 +--- mm/shmem.c | 7 +++++++ mm/swap_state.c | 4 ++-- 8 files changed, 42 insertions(+), 11 deletions(-) Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -93,11 +93,15 @@ static inline void mapping_set_gfp_mask( #ifdef CONFIG_PAGE_STATES extern struct page * find_get_page_nodiscard(struct address_space *mapping, unsigned long index); +extern struct page * find_lock_page_nodiscard(struct address_space *mapping, + unsigned long index); #define page_cache_release(page) put_page_check(page) #else #define find_get_page_nodiscard(mapping, index) find_get_page(mapping, index) +#define find_lock_page_nodiscard(mapping, index) find_lock_page(mapping, index) #define page_cache_release(page) put_page(page) #endif +#define page_cache_release_nocheck(page) put_page(page) void release_pages(struct page **pages, int nr, int cold); /* Index: linux-2.6/include/linux/swap.h =================================================================== --- linux-2.6.orig/include/linux/swap.h +++ linux-2.6/include/linux/swap.h @@ -362,7 +362,7 @@ static inline void mem_cgroup_uncharge_s /* only sparc can not include linux/pagemap.h in this file * so leave page_cache_release and release_pages undeclared... */ #define free_page_and_swap_cache(page) \ - page_cache_release(page) + page_cache_release_nocheck(page) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr), 0); Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -592,6 +592,27 @@ repeat: EXPORT_SYMBOL(find_get_page_nodiscard); +struct page *find_lock_page_nodiscard(struct address_space *mapping, + unsigned long offset) +{ + struct page *page; + +repeat: + page = find_get_page_nodiscard(mapping, offset); + if (page) { + lock_page(page); + /* Has the page been truncated? */ + if (unlikely(page->mapping != mapping)) { + unlock_page(page); + page_cache_release(page); + goto repeat; + } + VM_BUG_ON(page->index != offset); + } + return page; +} +EXPORT_SYMBOL(find_lock_page_nodiscard); + #endif /* @@ -1586,7 +1607,7 @@ int filemap_fault(struct vm_area_struct * Do we have something in the page cache already? */ retry_find: - page = find_lock_page(mapping, vmf->pgoff); + page = find_lock_page_nodiscard(mapping, vmf->pgoff); /* * For sequential accesses, we use the generic readahead logic. */ @@ -1594,7 +1615,7 @@ retry_find: if (!page) { page_cache_sync_readahead(mapping, ra, file, vmf->pgoff, 1); - page = find_lock_page(mapping, vmf->pgoff); + page = find_lock_page_nodiscard(mapping, vmf->pgoff); if (!page) goto no_cached_page; } @@ -1633,7 +1654,7 @@ retry_find: start = vmf->pgoff - ra_pages / 2; do_page_cache_readahead(mapping, file, start, ra_pages); } - page = find_lock_page(mapping, vmf->pgoff); + page = find_lock_page_nodiscard(mapping, vmf->pgoff); if (!page) goto no_cached_page; } Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c +++ linux-2.6/mm/fremap.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -2513,7 +2513,7 @@ static int do_swap_page(struct mm_struct * Back out if somebody else already faulted in this pte. */ page_table = pte_offset_map_lock(mm, pmd, address, &ptl); - if (unlikely(!pte_same(*page_table, orig_pte))) + if (unlikely(!pte_same(*page_table, orig_pte) || PageDiscarded(page))) goto out_nomap; if (unlikely(!PageUptodate(page))) { @@ -2753,7 +2753,7 @@ retry: * handle that later. */ /* Only go through if we didn't race with anybody else... */ - if (likely(pte_same(*page_table, orig_pte))) { + if (likely(pte_same(*page_table, orig_pte) && !PageDiscarded(page))) { flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); if (flags & FAULT_FLAG_WRITE) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -703,7 +703,6 @@ void page_add_file_rmap(struct page *pag { if (atomic_inc_and_test(&page->_mapcount)) __inc_zone_page_state(page, NR_FILE_MAPPED); - page_make_volatile(page, 1); } #ifdef CONFIG_DEBUG_VM @@ -763,7 +762,6 @@ void page_remove_rmap(struct page *page) * faster for those pages still in swapcache. */ } - page_make_volatile(page, 1); } /* @@ -862,7 +860,7 @@ static int try_to_unmap_one(struct page } page_remove_rmap(page); - page_cache_release(page); + page_cache_release_nocheck(page); out_unmap: pte_unmap_unlock(pte, ptl); Index: linux-2.6/mm/shmem.c =================================================================== --- linux-2.6.orig/mm/shmem.c +++ linux-2.6/mm/shmem.c @@ -59,6 +59,7 @@ static struct vfsmount *shm_mnt; #include #include #include +#include #include #include @@ -1245,6 +1246,12 @@ repeat: if (swap.val) { /* Look it up and read it in.. */ swappage = lookup_swap_cache(swap); + if (swappage && unlikely(!page_make_stable(swappage))) { + shmem_swp_unmap(entry); + spin_unlock(&info->lock); + page_discard(swappage); + goto repeat; + } if (!swappage) { shmem_swp_unmap(entry); /* here we actually do the io */ Index: linux-2.6/mm/swap_state.c =================================================================== --- linux-2.6.orig/mm/swap_state.c +++ linux-2.6/mm/swap_state.c @@ -241,7 +241,7 @@ static inline void free_swap_cache(struc void free_page_and_swap_cache(struct page *page) { free_swap_cache(page); - page_cache_release(page); + page_cache_release_nocheck(page); } /* @@ -275,7 +275,7 @@ struct page * lookup_swap_cache(swp_entr { struct page *page; - page = find_get_page(&swapper_space, entry.val); + page = find_get_page_nodiscard(&swapper_space, entry.val); if (page) INC_CACHE_INFO(find_success); -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: [patch 1/6] Guest page hinting: core + volatile page cache. Date: Fri, 27 Mar 2009 16:09:06 +0100 Message-ID: <20090327151011.534224968@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> Return-path: Content-Disposition: inline; filename=001-hva-core.diff Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky List-Id: virtualization@lists.linuxfoundation.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj The guest page hinting patchset introduces code that passes guest page usage information to the host system that virtualizes the memory of its guests. There are three different page states: * Unused: The page content is of no interest to the guest. The host can forget the page content and replace it with a page containing zeroes. * Stable: The page content is needed by the guest and has to be preserved by the host. * Volatile: The page content is useful to the guest but not essential. The host can discard the page but has to deliver a special kind of fault to the guest if the guest accesses a page discarded by the host. The unused state is used for free pages, it allows the host to avoid the paging of empty pages. The default state for non-free pages is stable. The host can write stable pages to a swap device but has to restore the page if the guest accesses it. The volatile page state is used for clean uptodate page cache pages. The host can choose to discard volatile pages as part of its vmscan operation instead of writing them to the hosts paging device. The guest system doesn't notice that a volatile page is gone until it tries to access the page or if it tries to make the page stable again. For a guest access to a discarded page the host generates a discard fault to notify the guest. The guest has to remove the page from the cache and reload the page from its backing device. The volatile state is used for all page cache pages, even for pages which are referenced by writable ptes. The host needs to be able to check the dirty state of the pages. Since the host doesn't know where the page table entries of the guest are located, the volatile state as introduced by this patch is only usable on architectures with per-page dirty bits (s390 only). For per-pte dirty bit architectures some additional code is needed, see patch #4. The main question is where to put the state transitions between the volatile and the stable state. The simple solution is to make a page stable whenever a lookup is done or a page reference is derived from a page table entry. Attempts to make pages volatile are added at strategic points. The conditions that prevent a page from being made volatile: 1) The page is reserved. Some sort of special page. 2) The page is marked dirty in the struct page. The page content is more recent than the data on the backing device. The host cannot access the linux internal dirty bit so the page needs to be stable. 3) The page is in writeback. The page content is needed for i/o. 4) The page is locked. Someone has exclusive access to the page. 5) The page is anonymous. Swap cache support needs additional code. See patch #2. 6) The page has no mapping. Without a backing the page cannot be recreated. 7) The page is not uptodate. 8) The page has private information. try_to_release_page can fail, e.g. in case the private information is journaling data. The discard fault need to be able to remove the page. 9) The page is already discarded. 10) The page is not on the LRU list. The page has been isolated, some processing is done. 11) The page map count is not equal to the page reference count - 1. The discard fault handler can remove the page cache reference and all mappers of a page. It cannot remove the page reference for any other user of the page. The transitions to stable are done by find_get_pages() and its variants, in follow_page if the FOLL_GET flag is set, by copy-on-write in do_wp_page, and by the early copy-on-write in do_no_page. For page cache page this is always done with a call to page_make_stable(). To make enough pages discardable by the host an attempt to do the transition to volatile state is done at several places: 1) When a page gets unlocked (unlock_page). 2) When writeback has finished (test_clear_page_writeback). 3) When the page reference counter is decreased (__free_pages, page_cache_release alias put_page_check and __free_pages right before the put_page_testzero call). 4) When the map counter in increased (page_add_file_rmap). 5) When a page is moved from the active list to the inactive list. The function for the state transitions to volatile is page_make_volatile(). The major obstacles that need to get addressed: * Concurrent page state changes: To guard against concurrent page state updates some kind of lock is needed. If page_make_volatile() has already done the 11 checks it will issue the state change primitive. If in the meantime one of the conditions has changed the user that requires that page in stable state will have to wait in the page_make_stable() function until the make volatile operation has finished. It is up to the architecture to define how this is done with the three primitives page_test_set_state_change, page_clear_state_change and page_state_change. There are some alternatives how this can be done, e.g. a global lock, or lock per segment in the kernel page table, or the per page bit PG_arch_1 if it is still free. * Page references acquired from page tables: All page references acquired with find_get_page and friends can be used to access the page frame content. A page reference grabbed from a page table cannot be used to access the page content, the page has to be made stable first. If the make stable operation fails because the page has been discarded it has to be removed from page cache. That removes the page table entry as well. * Page discard vs. __remove_from_page_cache race A new page flag PG_discarded is added. This bit is set for discarded pages. It prevents multiple removes of a page from the page cache due to concurrent discard faults and/or normal page removals. It also prevents the re-add of isolated pages to the lru list in vmscan if the page has been discarded while it was not on the lru list. * Page discard vs. pte establish The discard fault handler does three things: 1) set the PG_discarded bit for the page, 2) remove the page from all page tables and 3) remove the page from the page cache. All page references of the discarded page that are still around after step 2 may not be used to establish new mappings because step 3 clears the page->mapping field that is required to find the mappers. Code that establishes new ptes to pages that might be discarded has to check the PG_discarded bit. Step 2 has to check all possible location for a pte of a particular page and check if the pte exists or another processor might be in the process of establishing one. To do that the page table lock for the pte is used. See page_unmap_all and the modified quick check in page_check_address for the details. * copy_one_pte vs. discarded pages The code that copies the page tables may not copy ptes for discarded pages because this races with the discard fault handler. copy_one_pte cannot back out either since there is no automatic repeat of the fault that caused the pte modification. Ptes to discarded pages only show up in copy_one_pte if a fork races with a discard fault. In this case copy_one_pte has to create a pte in the new page table that looks like the one that the discard fault handler would have created in the original page table if copy_one_pte would not have grabed the page table lock first. * get_user_pages with FOLL_GET If get_user_pages is called with a non-NULL pages argument the caller has to be able to access the page content using the references returned in the pages array. This is done with a check in follow_page for the FOLL_GET bit and a call to page_make_stable. If get_user_pages is called with NULL as the pages argument the pages are not made stable. The caller cannot expect that the pages are available after the call because vmscan might have removed them. * buffer heads / page_private A page that is modified with sys_write will get a buffer-head to keep track of the dirty state. The existence of a buffer-head makes PagePrivate(page) return true. Pages with private information cannot be made volatile. Until the buffer-head is removed the page will stay stable. The standard logic is to call try_to_release_page which frees the buffer-head only if more than 10% of GFP_USER memory are used for buffer heads. Without high memory every page can have a buffer-head without running over the limit. The result is that every page written to with sys_write will stay stable until it is removed. To get these pages volatile again max_buffer_heads is set to zero (!) to force a call to try_to_release_page whenever a page is moved from the active to the inactive list. * page_free_discarded hook The architecture might want/need to do special things for discarded pages before they are freed. E.g. s390 has to delay the freeing of discarded pages. To allow this a hook in added to free_hot_cold_page. Another noticable change is that the first few lines of code in try_to_unmap_one that calculates the address from the page and the vma is moved out of try_to_unmap_one to the callers. This is done to make try_to_unmap_one usable for the removal of discarded pages in page_unmap_all. Signed-off-by: Martin Schwidefsky --- fs/buffer.c | 12 ++ include/linux/mm.h | 1 include/linux/page-flags.h | 22 ++++ include/linux/page-states.h | 123 +++++++++++++++++++++++++++ include/linux/pagemap.h | 6 + mm/Makefile | 1 mm/filemap.c | 77 ++++++++++++++++- mm/memory.c | 58 ++++++++++++ mm/page-states.c | 198 ++++++++++++++++++++++++++++++++++++++++++++ mm/page-writeback.c | 2 mm/page_alloc.c | 11 ++ mm/rmap.c | 96 ++++++++++++++++++--- mm/swap.c | 35 ++++++- mm/vmscan.c | 50 ++++++++--- 14 files changed, 656 insertions(+), 36 deletions(-) Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -3404,11 +3404,23 @@ void __init buffer_init(void) SLAB_MEM_SPREAD), init_buffer_head); +#ifdef CONFIG_PAGE_STATES + /* + * If volatile page cache is enabled we want to get as many + * pages into volatile state as possible. Pages with private + * information cannot be made stable. Set max_buffer_heads + * to zero to make shrink_active_list to release the private + * information when moving page from the active to the inactive + * list. + */ + max_buffer_heads = 0; +#else /* * Limit the bh occupancy to 10% of ZONE_NORMAL */ nrpages = (nr_free_buffer_pages() * 10) / 100; max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); +#endif hotcpu_notifier(buffer_cpu_notify, 0); } Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -319,6 +319,7 @@ static inline void init_page_count(struc } void put_page(struct page *page); +void put_page_check(struct page *page); void put_pages_list(struct list_head *pages); void split_page(struct page *page, unsigned int order); Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h +++ linux-2.6/include/linux/page-flags.h @@ -101,6 +101,9 @@ enum pageflags { #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR PG_uncached, /* Page has been mapped as uncached */ #endif +#ifdef CONFIG_PAGE_STATES + PG_discarded, /* Page discarded by the hypervisor. */ +#endif __NR_PAGEFLAGS, /* Filesystems */ @@ -178,6 +181,17 @@ static inline void __ClearPage##uname(st #define TESTCLEARFLAG_FALSE(uname) \ static inline int TestClearPage##uname(struct page *page) { return 0; } +#ifdef CONFIG_PAGE_STATES +#define PageDiscarded(page) test_bit(PG_discarded, &(page)->flags) +#define ClearPageDiscarded(page) clear_bit(PG_discarded, &(page)->flags) +#define TestSetPageDiscarded(page) \ + test_and_set_bit(PG_discarded, &(page)->flags) +#else +#define PageDiscarded(page) 0 +#define ClearPageDiscarded(page) do { } while (0) +#define TestSetPageDiscarded(page) 0 +#endif + struct page; /* forward declaration */ TESTPAGEFLAG(Locked, locked) @@ -373,6 +387,12 @@ static inline void __ClearPageTail(struc #define __PG_MLOCKED 0 #endif +#ifdef CONFIG_PAGE_STATES +#define __PG_DISCARDED (1 << PG_discarded) +#else +#define __PG_DISCARDED 0 +#endif + /* * Flags checked when a page is freed. Pages being freed should not have * these flags set. It they are, there is a problem. @@ -381,7 +401,7 @@ static inline void __ClearPageTail(struc (1 << PG_lru | 1 << PG_private | 1 << PG_locked | \ 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ - __PG_UNEVICTABLE | __PG_MLOCKED) + __PG_UNEVICTABLE | __PG_MLOCKED | __PG_DISCARDED) /* * Flags checked when a page is prepped for return by the page allocator. Index: linux-2.6/include/linux/page-states.h =================================================================== --- /dev/null +++ linux-2.6/include/linux/page-states.h @@ -0,0 +1,123 @@ +#ifndef _LINUX_PAGE_STATES_H +#define _LINUX_PAGE_STATES_H + +/* + * include/linux/page-states.h + * + * Copyright IBM Corp. 2005, 2007 + * + * Authors: Martin Schwidefsky + * Hubertus Franke + * Himanshu Raj + */ + +#include + +#ifdef CONFIG_PAGE_STATES +/* + * Guest page hinting primitives that need to be defined in the + * architecture header file if PAGE_STATES=y: + * - page_host_discards: + * Indicates whether the host system discards guest pages or not. + * - page_set_unused: + * Indicates to the host that the page content is of no interest + * to the guest. The host can "forget" the page content and replace + * it with a page containing zeroes. + * - page_set_stable: + * Indicate to the host that the page content is needed by the guest. + * - page_set_volatile: + * Make the page discardable by the host. Instead of writing the + * page to the hosts swap device, the host can remove the page. + * A guest that accesses such a discarded page gets a special + * discard fault. + * - page_set_stable_if_present: + * The page state is set to stable if the page has not been discarded + * by the host. The check and the state change have to be done + * atomically. + * - page_discarded: + * Returns true if the page has been discarded by the host. + * - page_volatile: + * Returns true if the page is marked volatile. + * - page_test_set_state_change: + * Tries to lock the page for state change. The primitive does not need + * to have page granularity, it can lock a range of pages. + * - page_clear_state_change: + * Unlocks a page for state changes. + * - page_state_change: + * Returns true if the page is locked for state change. + * - page_free_discarded: + * Free a discarded page. This might require to put the page on a + * discard list and a synchronization over all cpus. Returns true + * if the architecture backend wants to do special things on free. + */ +#include + +extern void page_unmap_all(struct page *page); +extern void page_discard(struct page *page); +extern int __page_make_stable(struct page *page); +extern void __page_make_volatile(struct page *page, int offset); +extern void __pagevec_make_volatile(struct pagevec *pvec); + +/* + * Extended guest page hinting functions defined by using the + * architecture primitives: + * - page_make_stable: + * Tries to make a page stable. This operation can fail if the + * host has discarded a page. The function returns != 0 if the + * page could not be made stable. + * - page_make_volatile: + * Tries to make a page volatile. There are a number of conditions + * that prevent a page from becoming volatile. If at least one + * is true the function does nothing. See mm/page-states.c for + * details. + * - pagevec_make_volatile: + * Tries to make a vector of pages volatile. For each page in the + * vector the same conditions apply as for page_make_volatile. + * - page_discard: + * Removes a discarded page from the system. The page is removed + * from the LRU list and the radix tree of its mapping. + * page_discard uses page_unmap_all to remove all page table + * entries for a page. + */ + +static inline int page_make_stable(struct page *page) +{ + return page_host_discards() ? __page_make_stable(page) : 1; +} + +static inline void page_make_volatile(struct page *page, int offset) +{ + if (page_host_discards()) + __page_make_volatile(page, offset); +} + +static inline void pagevec_make_volatile(struct pagevec *pvec) +{ + if (page_host_discards()) + __pagevec_make_volatile(pvec); +} + +#else + +#define page_host_discards() (0) +#define page_set_unused(_page,_order) do { } while (0) +#define page_set_stable(_page,_order) do { } while (0) +#define page_set_volatile(_page) do { } while (0) +#define page_set_stable_if_present(_page) (1) +#define page_discarded(_page) (0) +#define page_volatile(_page) (0) + +#define page_test_set_state_change(_page) (0) +#define page_clear_state_change(_page) do { } while (0) +#define page_state_change(_page) (0) + +#define page_free_discarded(_page) (0) + +#define page_make_stable(_page) (1) +#define page_make_volatile(_page, offset) do { } while (0) +#define pagevec_make_volatile(_pagevec) do { } while (0) +#define page_discard(_page) do { } while (0) + +#endif + +#endif /* _LINUX_PAGE_STATES_H */ Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -13,6 +13,7 @@ #include #include #include /* for in_interrupt() */ +#include /* * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page @@ -89,7 +90,11 @@ static inline void mapping_set_gfp_mask( #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) #define page_cache_get(page) get_page(page) +#ifdef CONFIG_PAGE_STATES +#define page_cache_release(page) put_page_check(page) +#else #define page_cache_release(page) put_page(page) +#endif void release_pages(struct page **pages, int nr, int cold); /* @@ -436,6 +441,7 @@ int add_to_page_cache_lru(struct page *p pgoff_t index, gfp_t gfp_mask); extern void remove_from_page_cache(struct page *page); extern void __remove_from_page_cache(struct page *page); +extern void __remove_from_page_cache_nocheck(struct page *page); /* * Like add_to_page_cache_locked, but used to add newly allocated pages: Index: linux-2.6/mm/Makefile =================================================================== --- linux-2.6.orig/mm/Makefile +++ linux-2.6/mm/Makefile @@ -33,3 +33,4 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o +obj-$(CONFIG_PAGE_STATES) += page-states.o Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -34,6 +34,7 @@ #include /* for BUG_ON(!in_atomic()) only */ #include #include /* for page_is_file_cache() */ +#include #include "internal.h" /* @@ -112,7 +113,7 @@ * sure the page is locked and that nobody else uses it - or that usage * is safe. The caller must hold the mapping's tree_lock. */ -void __remove_from_page_cache(struct page *page) +void inline __remove_from_page_cache_nocheck(struct page *page) { struct address_space *mapping = page->mapping; @@ -136,6 +137,28 @@ void __remove_from_page_cache(struct pag } } +void __remove_from_page_cache(struct page *page) +{ + /* + * Check if the discard fault handler already removed + * the page from the page cache. If not set the discard + * bit in the page flags to prevent double page free if + * a discard fault is racing with normal page free. + */ + if (TestSetPageDiscarded(page)) + return; + + __remove_from_page_cache_nocheck(page); + + /* + * Check the hardware page state and clear the discard + * bit in the page flags only if the page is not + * discarded. + */ + if (!page_discarded(page)) + ClearPageDiscarded(page); +} + void remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; @@ -581,6 +604,7 @@ void unlock_page(struct page *page) VM_BUG_ON(!PageLocked(page)); clear_bit_unlock(PG_locked, &page->flags); smp_mb__after_clear_bit(); + page_make_volatile(page, 1); wake_up_page(page, PG_locked); } EXPORT_SYMBOL(unlock_page); @@ -679,6 +703,15 @@ repeat: } rcu_read_unlock(); + if (page && unlikely(!page_make_stable(page))) { + /* + * The page has been discarded by the host. Run the + * discard handler and return NULL. + */ + page_discard(page); + page = NULL; + } + return page; } EXPORT_SYMBOL(find_get_page); @@ -783,6 +816,7 @@ unsigned find_get_pages(struct address_s unsigned int ret; unsigned int nr_found; +from_scratch: rcu_read_lock(); restart: nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree, @@ -812,6 +846,19 @@ repeat: pages[ret] = page; ret++; + + if (likely(page_make_stable(page))) + continue; + /* + * Make stable failed, we discard the page and retry the + * whole operation. + */ + ret--; + rcu_read_unlock(); + page_discard(page); + while (ret--) + page_cache_release(pages[ret]); + goto from_scratch; } rcu_read_unlock(); return ret; @@ -836,6 +883,7 @@ unsigned find_get_pages_contig(struct ad unsigned int ret; unsigned int nr_found; +from_scratch: rcu_read_lock(); restart: nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree, @@ -869,6 +917,19 @@ repeat: pages[ret] = page; ret++; index++; + + if (likely(page_make_stable(page))) + continue; + /* + * Make stable failed, we discard the page and retry the + * whole operation. + */ + ret--; + rcu_read_unlock(); + page_discard(page); + while (ret--) + page_cache_release(pages[ret]); + goto from_scratch; } rcu_read_unlock(); return ret; @@ -893,6 +954,7 @@ unsigned find_get_pages_tag(struct addre unsigned int ret; unsigned int nr_found; +from_scratch: rcu_read_lock(); restart: nr_found = radix_tree_gang_lookup_tag_slot(&mapping->page_tree, @@ -922,6 +984,19 @@ repeat: pages[ret] = page; ret++; + + if (likely(page_make_stable(page))) + continue; + /* + * Make stable failed, we discard the page and retry the + * whole operation. + */ + ret--; + rcu_read_unlock(); + page_discard(page); + while (ret--) + page_cache_release(pages[ret]); + goto from_scratch; } rcu_read_unlock(); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -594,6 +595,8 @@ copy_one_pte(struct mm_struct *dst_mm, s page = vm_normal_page(vma, addr, pte); if (page) { + if (unlikely(PageDiscarded(page))) + goto out_discard_pte; get_page(page); page_dup_rmap(page, vma, addr); rss[!!PageAnon(page)]++; @@ -601,6 +604,21 @@ copy_one_pte(struct mm_struct *dst_mm, s out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); + return; + +out_discard_pte: + /* + * If the page referred by the pte has the PG_discarded bit set, + * copy_one_pte is racing with page_discard. The pte may not be + * copied or we can end up with a pte pointing to a page not + * in the page cache anymore. Do what try_to_unmap_one would do + * if the copy_one_pte had taken place before page_discard. + */ + if (page->index != linear_page_index(vma, addr)) + /* If nonlinear, store the file page offset in the pte. */ + set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); + else + pte_clear(dst_mm, addr, dst_pte); } static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -1147,6 +1165,19 @@ struct page *follow_page(struct vm_area_ if (flags & FOLL_GET) get_page(page); + + if (flags & FOLL_GET) { + /* + * The page is made stable if a reference is acquired. + * If the caller does not get a reference it implies that + * the caller can deal with page faults in case the page + * is swapped out. In this case the caller can deal with + * discard faults as well. + */ + if (unlikely(!page_make_stable(page))) + goto out_discard; + } + if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -1179,6 +1210,11 @@ no_page_table: BUG_ON(flags & FOLL_WRITE); } return page; + +out_discard: + pte_unmap_unlock(ptep, ptl); + page_discard(page); + return NULL; } /* Can we do the FOLL_ANON optimization? */ @@ -1969,6 +2005,11 @@ static int do_wp_page(struct mm_struct * dirty_page = old_page; get_page(dirty_page); reuse = 1; + /* + * dirty_page will be set dirty, so it needs to be stable. + */ + if (unlikely(!page_make_stable(dirty_page))) + goto discard; } if (reuse) { @@ -1986,6 +2027,12 @@ reuse: * Ok, we need to copy. Oh, well.. */ page_cache_get(old_page); + /* + * To copy the content of old_page it needs to be stable. + * page_cache_release on old_page will make it volatile again. + */ + if (unlikely(!page_make_stable(old_page))) + goto discard; gotten: pte_unmap_unlock(page_table, ptl); @@ -2100,6 +2147,10 @@ oom: unwritable_page: page_cache_release(old_page); return VM_FAULT_SIGBUS; +discard: + pte_unmap_unlock(page_table, ptl); + page_discard(old_page); + return VM_FAULT_MINOR; } /* @@ -2591,7 +2642,7 @@ static int __do_fault(struct mm_struct * vmf.pgoff = pgoff; vmf.flags = flags; vmf.page = NULL; - +retry: ret = vma->vm_ops->fault(vma, &vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) return ret; @@ -2616,6 +2667,11 @@ static int __do_fault(struct mm_struct * ret = VM_FAULT_OOM; goto out; } + if (unlikely(!page_make_stable(vmf.page))) { + unlock_page(vmf.page); + page_discard(vmf.page); + goto retry; + } page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!page) { Index: linux-2.6/mm/page-states.c =================================================================== --- /dev/null +++ linux-2.6/mm/page-states.c @@ -0,0 +1,198 @@ +/* + * mm/page-states.c + * + * (C) Copyright IBM Corp. 2005, 2007 + * + * Guest page hinting functions. + * + * Authors: Martin Schwidefsky + * Hubertus Franke + * Himanshu Raj + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +/* + * Check if there is anything in the page flags or the mapping + * that prevents the page from changing its state to volatile. + */ +static inline int check_bits(struct page *page) +{ + /* + * There are several conditions that prevent a page from becoming + * volatile. The first check is for the page bits. + */ + if (PageDirty(page) || PageReserved(page) || PageWriteback(page) || + PageLocked(page) || PagePrivate(page) || PageDiscarded(page) || + !PageUptodate(page) || !PageLRU(page) || PageAnon(page)) + return 0; + + /* + * If the page has been truncated there is no point in making + * it volatile. It will be freed soon. And if the mapping ever + * had locked pages all pages of the mapping will stay stable. + */ + return page_mapping(page) != NULL; +} + +/* + * Check the reference counter of the page against the number of + * mappings. The caller passes an offset, that is the number of + * extra, known references. The page cache itself is one extra + * reference. If the caller acquired an additional reference then + * the offset would be 2. If the page map counter is equal to the + * page count minus the offset then there is no other, unknown + * user of the page in the system. + */ +static inline int check_counts(struct page *page, unsigned int offset) +{ + return page_mapcount(page) + offset == page_count(page); +} + +/* + * Attempts to change the state of a page to volatile. + * If there is something preventing the state change the page stays + * in its current state. + */ +void __page_make_volatile(struct page *page, int offset) +{ + preempt_disable(); + if (!page_test_set_state_change(page)) { + if (check_bits(page) && check_counts(page, offset)) + page_set_volatile(page); + page_clear_state_change(page); + } + preempt_enable(); +} +EXPORT_SYMBOL(__page_make_volatile); + +/* + * Attempts to change the state of a vector of pages to volatile. + * If there is something preventing the state change the page stays + * int its current state. + */ +void __pagevec_make_volatile(struct pagevec *pvec) +{ + struct page *page; + int i = pagevec_count(pvec); + + while (--i >= 0) { + /* + * If we can't get the state change bit just give up. + * The worst that can happen is that the page will stay + * in the stable state although it might be volatile. + */ + page = pvec->pages[i]; + if (!page_test_set_state_change(page)) { + if (check_bits(page) && check_counts(page, 1)) + page_set_volatile(page); + page_clear_state_change(page); + } + } +} +EXPORT_SYMBOL(__pagevec_make_volatile); + +/* + * Attempts to change the state of a page to stable. The host could + * have removed a volatile page, the page_set_stable_if_present call + * can fail. + * + * returns "0" on success and "1" on failure + */ +int __page_make_stable(struct page *page) +{ + /* + * Postpone state change to stable until the state change bit is + * cleared. As long as the state change bit is set another cpu + * is in page_make_volatile for this page. That makes sure that + * no caller of make_stable "overtakes" a make_volatile leaving + * the page in volatile where stable is required. + * The caller of make_stable need to make sure that no caller + * of make_volatile can make the page volatile right after + * make_stable has finished. + */ + while (page_state_change(page)) + cpu_relax(); + return page_set_stable_if_present(page); +} +EXPORT_SYMBOL(__page_make_stable); + +/** + * __page_discard() - remove a discarded page from the cache + * + * @page: the page + * + * The page passed to this function needs to be locked. + */ +static void __page_discard(struct page *page) +{ + struct address_space *mapping; + struct zone *zone; + + /* Paranoia checks. */ + VM_BUG_ON(PageWriteback(page)); + VM_BUG_ON(PageDirty(page)); + VM_BUG_ON(PagePrivate(page)); + + /* Set the discarded bit early. */ + if (TestSetPageDiscarded(page)) + return; + + /* Unmap the page from all page tables. */ + page_unmap_all(page); + + /* Check if really all mappers of this page are gone. */ + VM_BUG_ON(page_mapcount(page) != 0); + + /* + * Remove the page from LRU if it is currently added. + * The users of isolate_lru_pages need to check the + * discarded bit before readding the page to the LRU. + */ + zone = page_zone(page); + spin_lock_irq(&zone->lru_lock); + if (PageLRU(page)) { + /* Unlink page from lru. */ + __ClearPageLRU(page); + del_page_from_lru(zone, page); + } + spin_unlock_irq(&zone->lru_lock); + + /* We can't handle swap cache pages (yet). */ + VM_BUG_ON(PageSwapCache(page)); + + /* Remove page from page cache. */ + mapping = page->mapping; + spin_lock_irq(&mapping->tree_lock); + __remove_from_page_cache_nocheck(page); + spin_unlock_irq(&mapping->tree_lock); + __put_page(page); +} + +/** + * page_discard() - remove a discarded page from the cache + * + * @page: the page + * + * Before calling this function an additional page reference needs to + * be acquired. This reference is released by the function. + */ +void page_discard(struct page *page) +{ + lock_page(page); + __page_discard(page); + unlock_page(page); + page_cache_release(page); +} +EXPORT_SYMBOL(page_discard); Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -34,6 +34,7 @@ #include #include #include +#include /* * The maximum number of pages to writeout in a single bdflush/kupdate @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + page_make_volatile(page, 1); if (bdi_cap_account_writeback(bdi)) { __dec_bdi_stat(bdi, BDI_WRITEBACK); __bdi_writeout_inc(bdi); Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include @@ -553,6 +554,7 @@ static void __free_pages_ok(struct page bad += free_pages_check(page + i); if (bad) return; + page_set_unused(page, order); if (!PageHighMem(page)) { debug_check_no_locks_freed(page_address(page),PAGE_SIZE<mapping = NULL; if (free_pages_check(page)) return; + page_set_unused(page, 0); if (!PageHighMem(page)) { debug_check_no_locks_freed(page_address(page), PAGE_SIZE); @@ -1119,6 +1127,7 @@ again: put_cpu(); VM_BUG_ON(bad_range(zone, page)); + page_set_stable(page, order); if (prep_new_page(page, order, gfp_flags)) goto again; return page; @@ -1714,6 +1723,8 @@ void __pagevec_free(struct pagevec *pvec void __free_pages(struct page *page, unsigned int order) { + if (page_count(page) > 1) + page_make_volatile(page, 2); if (put_page_testzero(page)) { if (order == 0) free_hot_page(page); Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -50,6 +50,7 @@ #include #include #include +#include #include @@ -286,13 +287,25 @@ pte_t *page_check_address(struct page *p return NULL; pte = pte_offset_map(pmd, address); + ptl = pte_lockptr(mm, pmd); /* Make a quick check before getting the lock */ +#ifndef CONFIG_PAGE_STATES + /* + * If the page table lock for this pte is taken we have to + * assume that someone might be mapping the page. To solve + * the race of a page discard vs. mapping the page we have + * to serialize the two operations by taking the lock, + * otherwise we end up with a pte for a page that has been + * removed from page cache by the discard fault handler. + * So for CONFIG_PAGE_STATES=yes the !pte_present optimization + * need to be deactivated. + */ if (!sync && !pte_present(*pte)) { pte_unmap(pte); return NULL; } +#endif - ptl = pte_lockptr(mm, pmd); spin_lock(ptl); if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) { *ptlp = ptl; @@ -690,6 +703,7 @@ void page_add_file_rmap(struct page *pag { if (atomic_inc_and_test(&page->_mapcount)) __inc_zone_page_state(page, NR_FILE_MAPPED); + page_make_volatile(page, 1); } #ifdef CONFIG_DEBUG_VM @@ -755,19 +769,14 @@ void page_remove_rmap(struct page *page) * repeatedly from either try_to_unmap_anon or try_to_unmap_file. */ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, - int migration) + unsigned long address, int migration) { struct mm_struct *mm = vma->vm_mm; - unsigned long address; pte_t *pte; pte_t pteval; spinlock_t *ptl; int ret = SWAP_AGAIN; - address = vma_address(page, vma); - if (address == -EFAULT) - goto out; - pte = page_check_address(page, mm, address, &ptl, 0); if (!pte) goto out; @@ -831,9 +840,14 @@ static int try_to_unmap_one(struct page swp_entry_t entry; entry = make_migration_entry(page, pte_write(pteval)); set_pte_at(mm, address, pte, swp_entry_to_pte(entry)); - } else + } else { +#ifdef CONFIG_PAGE_STATES + /* If nonlinear, store the file page offset in the pte. */ + if (page->index != linear_page_index(vma, address)) + set_pte_at(mm, address, pte, pgoff_to_pte(page->index)); +#endif dec_mm_counter(mm, file_rss); - + } page_remove_rmap(page); page_cache_release(page); @@ -1000,6 +1014,7 @@ static int try_to_unmap_anon(struct page struct anon_vma *anon_vma; struct vm_area_struct *vma; unsigned int mlocked = 0; + unsigned long address; int ret = SWAP_AGAIN; if (MLOCK_PAGES && unlikely(unlock)) @@ -1016,7 +1031,10 @@ static int try_to_unmap_anon(struct page continue; /* must visit all unlocked vmas */ ret = SWAP_MLOCK; /* saw at least one mlocked vma */ } else { - ret = try_to_unmap_one(page, vma, migration); + address = vma_address(page, vma); + if (address == -EFAULT) + continue; + ret = try_to_unmap_one(page, vma, address, migration); if (ret == SWAP_FAIL || !page_mapped(page)) break; } @@ -1060,6 +1078,7 @@ static int try_to_unmap_file(struct page struct vm_area_struct *vma; struct prio_tree_iter iter; int ret = SWAP_AGAIN; + unsigned long address; unsigned long cursor; unsigned long max_nl_cursor = 0; unsigned long max_nl_size = 0; @@ -1077,7 +1096,10 @@ static int try_to_unmap_file(struct page continue; /* must visit all vmas */ ret = SWAP_MLOCK; } else { - ret = try_to_unmap_one(page, vma, migration); + address = vma_address(page, vma); + if (address == -EFAULT) + continue; + ret = try_to_unmap_one(page, vma, address, migration); if (ret == SWAP_FAIL || !page_mapped(page)) goto out; } @@ -1227,3 +1249,55 @@ int try_to_munlock(struct page *page) return try_to_unmap_file(page, 1, 0); } #endif + +#ifdef CONFIG_PAGE_STATES + +/** + * page_unmap_all - removes all mappings of a page + * + * @page: the page which mapping in the vma should be struck down + * + * the caller needs to hold page lock + */ +void page_unmap_all(struct page* page) +{ + struct address_space *mapping = page_mapping(page); + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + struct vm_area_struct *vma; + struct prio_tree_iter iter; + unsigned long address; + int rc; + + VM_BUG_ON(!PageLocked(page) || PageReserved(page) || PageAnon(page)); + + spin_lock(&mapping->i_mmap_lock); + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { + address = vma_address(page, vma); + if (address == -EFAULT) + continue; + rc = try_to_unmap_one(page, vma, address, 0); + VM_BUG_ON(rc == SWAP_FAIL); + } + + if (list_empty(&mapping->i_mmap_nonlinear)) + goto out; + + /* + * Remove the non-linear mappings of the page. This is + * awfully slow, but we have to find that discarded page.. + */ + list_for_each_entry(vma, &mapping->i_mmap_nonlinear, + shared.vm_set.list) { + address = vma->vm_start; + while (address < vma->vm_end) { + rc = try_to_unmap_one(page, vma, address, 0); + VM_BUG_ON(rc == SWAP_FAIL); + address += PAGE_SIZE; + } + } + +out: + spin_unlock(&mapping->i_mmap_lock); +} + +#endif Index: linux-2.6/mm/swap.c =================================================================== --- linux-2.6.orig/mm/swap.c +++ linux-2.6/mm/swap.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "internal.h" @@ -78,6 +79,16 @@ void put_page(struct page *page) } EXPORT_SYMBOL(put_page); +#ifdef CONFIG_PAGE_STATES +void put_page_check(struct page *page) +{ + if (page_count(page) > 1) + page_make_volatile(page, 2); + put_page(page); +} +EXPORT_SYMBOL(put_page_check); +#endif + /** * put_pages_list() - release a list of pages * @pages: list of pages threaded on page->lru @@ -421,14 +432,22 @@ void ____pagevec_lru_add(struct pagevec } VM_BUG_ON(PageActive(page)); VM_BUG_ON(PageUnevictable(page)); - VM_BUG_ON(PageLRU(page)); - SetPageLRU(page); - active = is_active_lru(lru); - file = is_file_lru(lru); - if (active) - SetPageActive(page); - update_page_reclaim_stat(zone, page, file, active); - add_page_to_lru_list(zone, page, lru); + /* + * Only add the page to lru list if it has not + * been discarded. + */ + if (likely(!PageDiscarded(page))) { + VM_BUG_ON(PageLRU(page)); + SetPageLRU(page); + active = is_active_lru(lru); + file = is_file_lru(lru); + if (active) + SetPageActive(page); + update_page_reclaim_stat(zone, page, file, active); + add_page_to_lru_list(zone, page, lru); + page_make_volatile(page, 2); + } else + ClearPageActive(page); } if (zone) spin_unlock_irq(&zone->lru_lock); Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c +++ linux-2.6/mm/vmscan.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -606,6 +607,9 @@ static unsigned long shrink_page_list(st if (unlikely(!page_evictable(page, NULL))) goto cull_mlocked; + if (unlikely(PageDiscarded(page))) + goto free_it; + if (!sc->may_swap && page_mapped(page)) goto keep_locked; @@ -1152,13 +1156,20 @@ static unsigned long shrink_inactive_lis spin_lock_irq(&zone->lru_lock); continue; } - SetPageLRU(page); - lru = page_lru(page); - add_page_to_lru_list(zone, page, lru); - if (PageActive(page)) { - int file = !!page_is_file_cache(page); - reclaim_stat->recent_rotated[file]++; - } + /* + * Only readd the page to lru list if it has not + * been discarded. + */ + if (likely(!PageDiscarded(page))) { + SetPageLRU(page); + lru = page_lru(page); + add_page_to_lru_list(zone, page, lru); + if (PageActive(page)) { + int file = !!page_is_file_cache(page); + reclaim_stat->recent_rotated[file]++; + } + } else + ClearPageActive(page); if (!pagevec_add(&pvec, page)) { spin_unlock_irq(&zone->lru_lock); __pagevec_release(&pvec); @@ -1278,13 +1289,22 @@ static void shrink_active_list(unsigned page = lru_to_page(&l_inactive); prefetchw_prev_lru_page(page, &l_inactive, flags); VM_BUG_ON(PageLRU(page)); - SetPageLRU(page); - VM_BUG_ON(!PageActive(page)); - ClearPageActive(page); - - list_move(&page->lru, &zone->lru[lru].list); - mem_cgroup_add_lru_list(page, lru); - pgmoved++; + /* + * Only readd the page to lru list if it has not + * been discarded. + */ + if (likely(!PageDiscarded(page))) { + SetPageLRU(page); + VM_BUG_ON(!PageActive(page)); + ClearPageActive(page); + list_move(&page->lru, &zone->lru[lru].list); + mem_cgroup_add_lru_list(page, lru); + pgmoved++; + } else { + ClearPageActive(page); + list_del(&page->lru); + } + if (!pagevec_add(&pvec, page)) { __mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved); spin_unlock_irq(&zone->lru_lock); @@ -1292,6 +1312,7 @@ static void shrink_active_list(unsigned pgmoved = 0; if (buffer_heads_over_limit) pagevec_strip(&pvec); + pagevec_make_volatile(&pvec); __pagevec_release(&pvec); spin_lock_irq(&zone->lru_lock); } @@ -1301,6 +1322,7 @@ static void shrink_active_list(unsigned if (buffer_heads_over_limit) { spin_unlock_irq(&zone->lru_lock); pagevec_strip(&pvec); + pagevec_make_volatile(&pvec); spin_lock_irq(&zone->lru_lock); } __count_zone_vm_events(PGREFILL, zone, pgscanned); -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: [patch 4/6] Guest page hinting: writable page table entries. Date: Fri, 27 Mar 2009 16:09:09 +0100 Message-ID: <20090327151012.398894143@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> Return-path: Content-Disposition: inline; filename=004-hva-prot.diff Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky List-Id: virtualization@lists.linuxfoundation.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj The volatile state for page cache and swap cache pages requires that the host system needs to be able to determine if a volatile page is dirty before removing it. This excludes almost all platforms from using the scheme. What is needed is a way to distinguish between pages that are purely read-only and pages that might get written to. This allows platforms with per-pte dirty bits to use the scheme and platforms with per-page dirty bits a small optimization. Whenever a writable pte is created a check is added that allows to move the page into the correct state. This needs to be done before the writable pte is established. To avoid unnecessary state transitions and the need for a counter, a new page flag PG_writable is added. Only the creation of the first writable pte will do a page state change. Even if all the writable ptes pointing to a page are removed again, the page stays in the safe state until all read-only users of the page have unmapped it as well. Only then is the PG_writable bit reset. The state a page needs to have if a writable pte is present depends on the platform. A platform with per-pte dirty bits wants to move the page into stable state, a platform with per-page dirty bits like s390 can decide to move the page into a special state that requires the host system to check the dirty bit before discarding a page. Signed-off-by: Martin Schwidefsky --- include/linux/page-flags.h | 8 ++++++ include/linux/page-states.h | 27 +++++++++++++++++++- mm/memory.c | 5 +++ mm/mprotect.c | 2 + mm/page-states.c | 58 ++++++++++++++++++++++++++++++++++++++++++-- mm/rmap.c | 1 6 files changed, 98 insertions(+), 3 deletions(-) Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h +++ linux-2.6/include/linux/page-flags.h @@ -103,6 +103,7 @@ enum pageflags { #endif #ifdef CONFIG_PAGE_STATES PG_discarded, /* Page discarded by the hypervisor. */ + PG_writable, /* Page is mapped writable. */ #endif __NR_PAGEFLAGS, @@ -186,10 +187,17 @@ static inline int TestClearPage##uname(s #define ClearPageDiscarded(page) clear_bit(PG_discarded, &(page)->flags) #define TestSetPageDiscarded(page) \ test_and_set_bit(PG_discarded, &(page)->flags) +#define PageWritable(page) test_bit(PG_writable, &(page)->flags) +#define ClearPageWritable(page) clear_bit(PG_writable, &(page)->flags) +#define TestSetPageWritable(page) \ + test_and_set_bit(PG_writable, &(page)->flags) #else #define PageDiscarded(page) 0 #define ClearPageDiscarded(page) do { } while (0) #define TestSetPageDiscarded(page) 0 +#define PageWritable(page) 0 +#define ClearPageWritable(page) do { } while (0) +#define TestSetPageWritable(page) 0 #endif struct page; /* forward declaration */ Index: linux-2.6/include/linux/page-states.h =================================================================== --- linux-2.6.orig/include/linux/page-states.h +++ linux-2.6/include/linux/page-states.h @@ -57,6 +57,9 @@ extern void page_discard(struct page *pa extern int __page_make_stable(struct page *page); extern void __page_make_volatile(struct page *page, int offset); extern void __pagevec_make_volatile(struct pagevec *pvec); +extern void __page_check_writable(struct page *page, pte_t pte, + unsigned int offset); +extern void __page_reset_writable(struct page *page); /* * Extended guest page hinting functions defined by using the @@ -78,6 +81,12 @@ extern void __pagevec_make_volatile(stru * from the LRU list and the radix tree of its mapping. * page_discard uses page_unmap_all to remove all page table * entries for a page. + * - page_check_writable: + * Checks if the page states needs to be adapted because a new + * writable page table entry refering to the page is established. + * - page_reset_writable: + * Resets the page state after the last writable page table entry + * refering to the page has been removed. */ static inline int page_make_stable(struct page *page) @@ -97,12 +106,26 @@ static inline void pagevec_make_volatile __pagevec_make_volatile(pvec); } +static inline void page_check_writable(struct page *page, pte_t pte, + unsigned int offset) +{ + if (page_host_discards() && pte_write(pte) && + !test_bit(PG_writable, &page->flags)) + __page_check_writable(page, pte, offset); +} + +static inline void page_reset_writable(struct page *page) +{ + if (page_host_discards() && test_bit(PG_writable, &page->flags)) + __page_reset_writable(page); +} + #else #define page_host_discards() (0) #define page_set_unused(_page,_order) do { } while (0) #define page_set_stable(_page,_order) do { } while (0) -#define page_set_volatile(_page) do { } while (0) +#define page_set_volatile(_page,_writable) do { } while (0) #define page_set_stable_if_present(_page) (1) #define page_discarded(_page) (0) #define page_volatile(_page) (0) @@ -117,6 +140,8 @@ static inline void pagevec_make_volatile #define page_make_volatile(_page, offset) do { } while (0) #define pagevec_make_volatile(_pagevec) do { } while (0) #define page_discard(_page) do { } while (0) +#define page_check_writable(_page,_pte,_off) do { } while (0) +#define page_reset_writable(_page) do { } while (0) #endif Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -2029,6 +2029,7 @@ reuse: flush_cache_page(vma, address, pte_pfn(orig_pte)); entry = pte_mkyoung(orig_pte); entry = maybe_mkwrite(pte_mkdirty(entry), vma); + page_check_writable(old_page, entry, 1); if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; @@ -2084,6 +2085,7 @@ gotten: flush_cache_page(vma, address, pte_pfn(orig_pte)); entry = mk_pte(new_page, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); + page_check_writable(new_page, entry, 2); /* * Clear the pte entry and flush it first, before updating the * pte with the new entry. This will avoid a race condition @@ -2540,6 +2542,7 @@ static int do_swap_page(struct mm_struct write_access = 0; } flush_icache_page(vma, page); + page_check_writable(page, pte, 2); set_pte_at(mm, address, page_table, pte); page_add_anon_rmap(page, vma, address); /* It's better to call commit-charge after rmap is established */ @@ -2599,6 +2602,7 @@ static int do_anonymous_page(struct mm_s entry = mk_pte(page, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); + page_check_writable(page, entry, 2); page_table = pte_offset_map_lock(mm, pmd, address, &ptl); if (!pte_none(*page_table)) @@ -2754,6 +2758,7 @@ retry: entry = mk_pte(page, vma->vm_page_prot); if (flags & FAULT_FLAG_WRITE) entry = maybe_mkwrite(pte_mkdirty(entry), vma); + page_check_writable(page, entry, 2); if (anon) { inc_mm_counter(mm, anon_rss); page_add_new_anon_rmap(page, vma, address); Index: linux-2.6/mm/mprotect.c =================================================================== --- linux-2.6.orig/mm/mprotect.c +++ linux-2.6/mm/mprotect.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -58,6 +59,7 @@ static void change_pte_range(struct mm_s */ if (dirty_accountable && pte_dirty(ptent)) ptent = pte_mkwrite(ptent); + page_check_writable(pte_page(ptent), ptent, 1); ptep_modify_prot_commit(mm, addr, pte, ptent); } else if (PAGE_MIGRATION && !pte_file(oldpte)) { Index: linux-2.6/mm/page-states.c =================================================================== --- linux-2.6.orig/mm/page-states.c +++ linux-2.6/mm/page-states.c @@ -83,7 +83,7 @@ void __page_make_volatile(struct page *p preempt_disable(); if (!page_test_set_state_change(page)) { if (check_bits(page) && check_counts(page, offset)) - page_set_volatile(page); + page_set_volatile(page, PageWritable(page)); page_clear_state_change(page); } preempt_enable(); @@ -109,7 +109,7 @@ void __pagevec_make_volatile(struct page page = pvec->pages[i]; if (!page_test_set_state_change(page)) { if (check_bits(page) && check_counts(page, 1)) - page_set_volatile(page); + page_set_volatile(page, PageWritable(page)); page_clear_state_change(page); } } @@ -142,6 +142,60 @@ int __page_make_stable(struct page *page EXPORT_SYMBOL(__page_make_stable); /** + * __page_check_writable() - check page state for new writable pte + * + * @page: the page the new writable pte refers to + * @pte: the new writable pte + */ +void __page_check_writable(struct page *page, pte_t pte, unsigned int offset) +{ + int count_ok = 0; + + preempt_disable(); + while (page_test_set_state_change(page)) + cpu_relax(); + + if (!TestSetPageWritable(page)) { + count_ok = check_counts(page, offset); + if (check_bits(page) && count_ok) + page_set_volatile(page, 1); + else + /* + * If two processes create a write mapping at the + * same time check_counts will return false or if + * the page is currently isolated from the LRU + * check_bits will return false but the page might + * be in volatile state. + * We have to take care about the dirty bit so the + * only option left is to make the page stable but + * we can try to make it volatile a bit later. + */ + page_set_stable_if_present(page); + } + page_clear_state_change(page); + if (!count_ok) + page_make_volatile(page, 1); + preempt_enable(); +} +EXPORT_SYMBOL(__page_check_writable); + +/** + * __page_reset_writable() - clear the PageWritable bit + * + * @page: the page + */ +void __page_reset_writable(struct page *page) +{ + preempt_disable(); + if (!page_test_set_state_change(page)) { + ClearPageWritable(page); + page_clear_state_change(page); + } + preempt_enable(); +} +EXPORT_SYMBOL(__page_reset_writable); + +/** * __page_discard() - remove a discarded page from the cache * * @page: the page Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -752,6 +752,7 @@ void page_remove_rmap(struct page *page) mem_cgroup_uncharge_page(page); __dec_zone_page_state(page, PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); + page_reset_writable(page); /* * It would be tidy to reset the PageAnon mapping here, * but that might overwrite a racing page_add_anon_rmap -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 1/6] Guest page hinting: core + volatile page cache. Date: Fri, 27 Mar 2009 18:57:31 -0400 Message-ID: <49CD59DB.3070906@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151011.534224968@de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090327151011.534224968@de.ibm.com> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > The major obstacles that need to get addressed: > * Concurrent page state changes: > To guard against concurrent page state updates some kind of lock > is needed. If page_make_volatile() has already done the 11 checks it > will issue the state change primitive. If in the meantime one of > the conditions has changed the user that requires that page in > stable state will have to wait in the page_make_stable() function > until the make volatile operation has finished. It is up to the > architecture to define how this is done with the three primitives > page_test_set_state_change, page_clear_state_change and > page_state_change. > There are some alternatives how this can be done, e.g. a global > lock, or lock per segment in the kernel page table, or the per page > bit PG_arch_1 if it is still free. Can this be taken care of by memory barriers and careful ordering of operations? If we consider the states unused -> volatile -> stable as progressively higher, "upgrades" can be done before any kernel operation that requires the page to be in that state (but after setting up the things that allow it to be found), while downgrades can be done after the kernel is done with needing the page at a higher level. Since the downgrade checks for users that need the page in a higher state, no lock should be required. In fact, it may be possible to manage the page state bitmap with compare-and-swap, without needing a call to the hypervisor. > Signed-off-by: Martin Schwidefsky Some comments and questions in line. > @@ -601,6 +604,21 @@ copy_one_pte(struct mm_struct *dst_mm, s > > out_set_pte: > set_pte_at(dst_mm, addr, dst_pte, pte); > + return; > + > +out_discard_pte: > + /* > + * If the page referred by the pte has the PG_discarded bit set, > + * copy_one_pte is racing with page_discard. The pte may not be > + * copied or we can end up with a pte pointing to a page not > + * in the page cache anymore. Do what try_to_unmap_one would do > + * if the copy_one_pte had taken place before page_discard. > + */ > + if (page->index != linear_page_index(vma, addr)) > + /* If nonlinear, store the file page offset in the pte. */ > + set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); > + else > + pte_clear(dst_mm, addr, dst_pte); > } It would be good to document that PG_discarded can only happen for file pages and NOT for eg. clean swap cache pages. > @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag > radix_tree_tag_clear(&mapping->page_tree, > page_index(page), > PAGECACHE_TAG_WRITEBACK); > + page_make_volatile(page, 1); > if (bdi_cap_account_writeback(bdi)) { > __dec_bdi_stat(bdi, BDI_WRITEBACK); > __bdi_writeout_inc(bdi); Does this mark the page volatile before the IO writing the dirty data back to disk has even started? Is that OK? -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Hansen Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Fri, 27 Mar 2009 16:03:43 -0700 Message-ID: <1238195024.8286.562.camel@nimitz> References: <20090327150905.819861420@de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090327150905.819861420@de.ibm.com> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com List-Id: virtualization@lists.linuxfoundation.org On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: > If the host picks one of the > pages the guest can recreate, the host can throw it away instead of writing > it to the paging device. Simple and elegant. Heh, simple and elegant for the hypervisor. But I'm not sure I'm going to call *anything* that requires a new CPU instruction elegant. ;) I don't see any description of it in there any more, but I thought this entire patch set was to get rid of the idiotic triple I/Os in the following scenario: 1. Hypervisor picks a page and evicts it out to disk, pays the I/O cost to get it written out. (I/O #1) 2. Linux comes along (being a bit late to the party) and picks the same page, also decides it needs to be out to disk 3. Linux tries to write the page to disk, but touches it in the process, pulling the page back in from the store where the hypervisor wrote it. (I/O #2) 4. Linux writes the page to its swap device (I/O #3) I don't see that mentioned at all in the current description. Simplifying the hypervisor is hard to get behind, but cutting system I/O by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;) Can we persuade the hypervisor to tell us which pages it decided to page out and just skip those when we're scanning the LRU? -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Fri, 27 Mar 2009 20:06:03 -0400 Message-ID: <49CD69EB.6000000@redhat.com> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1238195024.8286.562.camel@nimitz> Sender: owner-linux-mm@kvack.org To: Dave Hansen Cc: Martin Schwidefsky , linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Dave Hansen wrote: > On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: >> If the host picks one of the >> pages the guest can recreate, the host can throw it away instead of writing >> it to the paging device. Simple and elegant. > > Heh, simple and elegant for the hypervisor. But I'm not sure I'm going > to call *anything* that requires a new CPU instruction elegant. ;) I am convinced that it could be done with a guest-writable "bitmap", with 2 bits per page. That would make this scheme useful for KVM, too. > I don't see any description of it in there any more, but I thought this > entire patch set was to get rid of the idiotic triple I/Os in the > following scenario: > I don't see that mentioned at all in the current description. > Simplifying the hypervisor is hard to get behind, but cutting system I/O > by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;) Cutting down on a fair bit of IO is absolutely worth 1200 lines of fairly well isolated code. > Can we persuade the hypervisor to tell us which pages it decided to page > out and just skip those when we're scanning the LRU? The easiest "notification" points are in the page fault handler and the page cache lookup code. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rusty Russell Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Sat, 28 Mar 2009 17:05:28 +1030 Message-ID: <200903281705.29798.rusty@rustcorp.com.au> References: <20090327150905.819861420@de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090327150905.819861420@de.ibm.com> Content-Disposition: inline Sender: owner-linux-mm@kvack.org To: virtualization@lists.linux-foundation.org Cc: Martin Schwidefsky , linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > Greetings, > the circus is back in town -- another version of the guest page hinting > patches. The patches differ from version 6 only in the kernel version, > they apply against 2.6.29. My short sniff test showed that the code > is still working as expected. > > To recap (you can skip this if you read the boiler plate of the last > version of the patches): > The main benefit for guest page hinting vs. the ballooner is that there > is no need for a monitor that keeps track of the memory usage of all the > guests, a complex algorithm that calculates the working set sizes and for > the calls into the guest kernel to control the size of the balloons. I thought you weren't convinced of the concrete benefits over ballooning, or am I misremembering? Thanks, Rusty. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 1/6] Guest page hinting: core + volatile page cache. Date: Sun, 29 Mar 2009 15:56:40 +0200 Message-ID: <20090329155640.31472c61@skybase> References: <20090327150905.819861420@de.ibm.com> <20090327151011.534224968@de.ibm.com> <49CD59DB.3070906@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49CD59DB.3070906@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Fri, 27 Mar 2009 18:57:31 -0400 Rik van Riel wrote: > Martin Schwidefsky wrote: > > > The major obstacles that need to get addressed: > > * Concurrent page state changes: > > To guard against concurrent page state updates some kind of lock > > is needed. If page_make_volatile() has already done the 11 checks it > > will issue the state change primitive. If in the meantime one of > > the conditions has changed the user that requires that page in > > stable state will have to wait in the page_make_stable() function > > until the make volatile operation has finished. It is up to the > > architecture to define how this is done with the three primitives > > page_test_set_state_change, page_clear_state_change and > > page_state_change. > > There are some alternatives how this can be done, e.g. a global > > lock, or lock per segment in the kernel page table, or the per page > > bit PG_arch_1 if it is still free. > > Can this be taken care of by memory barriers and > careful ordering of operations? I don't see how this could be done with memory barries, the sequence is 1) check conditions 2) do state change to volatile another cpus can do i) change one of the conditions The operation i) needs to be postponed while the first cpu has done 1) but not done 2) yet. 1+2 needs to be atomic but consists of several instructions. Ergo we need a lock, no ? > If we consider the states unused -> volatile -> stable > as progressively higher, "upgrades" can be done before > any kernel operation that requires the page to be in > that state (but after setting up the things that allow > it to be found), while downgrades can be done after the > kernel is done with needing the page at a higher level. > > Since the downgrade checks for users that need the page > in a higher state, no lock should be required. > > In fact, it may be possible to manage the page state > bitmap with compare-and-swap, without needing a call > to the hypervisor. > > > Signed-off-by: Martin Schwidefsky > > Some comments and questions in line. > > > @@ -601,6 +604,21 @@ copy_one_pte(struct mm_struct *dst_mm, s > > > > out_set_pte: > > set_pte_at(dst_mm, addr, dst_pte, pte); > > + return; > > + > > +out_discard_pte: > > + /* > > + * If the page referred by the pte has the PG_discarded bit set, > > + * copy_one_pte is racing with page_discard. The pte may not be > > + * copied or we can end up with a pte pointing to a page not > > + * in the page cache anymore. Do what try_to_unmap_one would do > > + * if the copy_one_pte had taken place before page_discard. > > + */ > > + if (page->index != linear_page_index(vma, addr)) > > + /* If nonlinear, store the file page offset in the pte. */ > > + set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); > > + else > > + pte_clear(dst_mm, addr, dst_pte); > > } > > It would be good to document that PG_discarded can only happen for > file pages and NOT for eg. clean swap cache pages. PG_discarded can happen for swap cache pages as well. If a clean swap cache page gets remove and subsequently access again the discard fault handler will set the bit (see __page_discard). The code necessary for volatile swap cache is introduced with patch #2. So I would rather not add a comment in patch #1 only to remove it again with patch #2 .. > > @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag > > radix_tree_tag_clear(&mapping->page_tree, > > page_index(page), > > PAGECACHE_TAG_WRITEBACK); > > + page_make_volatile(page, 1); > > if (bdi_cap_account_writeback(bdi)) { > > __dec_bdi_stat(bdi, BDI_WRITEBACK); > > __bdi_writeout_inc(bdi); > > Does this mark the page volatile before the IO writing the > dirty data back to disk has even started? Is that OK? Hmm, it could be that the page_make_volatile is just superflouos here. The logic here is that whenever one of the conditions that prevent a page from becoming volatile is cleared a try with page_make_volatile is done. The condition in question here is PageWriteback(page). If we can prove that one of the other conditions is true this particular call is a waste of effort. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Sun, 29 Mar 2009 16:12:53 +0200 Message-ID: <20090329161253.3faffdeb@skybase> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1238195024.8286.562.camel@nimitz> Sender: linux-kernel-owner@vger.kernel.org To: Dave Hansen Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com List-Id: virtualization@lists.linuxfoundation.org On Fri, 27 Mar 2009 16:03:43 -0700 Dave Hansen wrote: > On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: > > If the host picks one of the > > pages the guest can recreate, the host can throw it away instead of writing > > it to the paging device. Simple and elegant. > > Heh, simple and elegant for the hypervisor. But I'm not sure I'm going > to call *anything* that requires a new CPU instruction elegant. ;) Hey its cool if you can request an instruction to solve your problem :-) > I don't see any description of it in there any more, but I thought this > entire patch set was to get rid of the idiotic triple I/Os in the > following scenario: > > 1. Hypervisor picks a page and evicts it out to disk, pays the I/O cost > to get it written out. (I/O #1) > 2. Linux comes along (being a bit late to the party) and picks the same > page, also decides it needs to be out to disk > 3. Linux tries to write the page to disk, but touches it in the > process, pulling the page back in from the store where the hypervisor > wrote it. (I/O #2) > 4. Linux writes the page to its swap device (I/O #3) > > I don't see that mentioned at all in the current description. > Simplifying the hypervisor is hard to get behind, but cutting system I/O > by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;) You are right, for a newcomer to the party the advantages of this approach are not really obvious. Should have copied some more text from the boilerplate from the previous versions. Yes, the guest page hinting code aims to reduce the hosts swap I/O. There are two scenarios, one is the above, the other is a simple read-only file cache page. Without hinting: 1. Hypervisor picks a page and evicts it, that is one write I/O 2. Linux access the page and causes a host page fault. The host reads the page from its swap disk, one read I/O. In total 2 I/O operations. With hinting: 1. Hypervisor picks a page, finds it volatile and throws it away. 2. Linux access the page and gets a discard fault from the host. Linux reads the file page from its block device. This is just one I/O operation. > Can we persuade the hypervisor to tell us which pages it decided to page > out and just skip those when we're scanning the LRU? One principle of the whole approach is that the hypervisor does not call into an otherwise idle guest. The cost of schedulung the virtual cpu is just too high. So we would a means to store the information where the guest can pick it up when it happens to do LRU. I don't think that this will work out. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Sun, 29 Mar 2009 16:20:24 +0200 Message-ID: <20090329162024.687196ab@skybase> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <49CD69EB.6000000@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49CD69EB.6000000@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: Dave Hansen , linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Fri, 27 Mar 2009 20:06:03 -0400 Rik van Riel wrote: > Dave Hansen wrote: > > On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: > >> If the host picks one of the > >> pages the guest can recreate, the host can throw it away instead of writing > >> it to the paging device. Simple and elegant. > > > > Heh, simple and elegant for the hypervisor. But I'm not sure I'm going > > to call *anything* that requires a new CPU instruction elegant. ;) > > I am convinced that it could be done with a guest-writable > "bitmap", with 2 bits per page. That would make this scheme > useful for KVM, too. This was our initial approach before we came up with the milli-code instruction. The reason we did not use a bitmap was to prevent the guest to change the host state (4 guest states U/S/V/P and 3 host states r/p/z). With the full set of states you'd need 4 bits. And the hosts need to have a "master" copy of the host bits, one the guest cannot change, otherwise you get into trouble. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Sun, 29 Mar 2009 16:23:36 +0200 Message-ID: <20090329162336.7c0700e9@skybase> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <200903281705.29798.rusty@rustcorp.com.au> Sender: owner-linux-mm@kvack.org To: Rusty Russell Cc: virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Sat, 28 Mar 2009 17:05:28 +1030 Rusty Russell wrote: > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > > Greetings, > > the circus is back in town -- another version of the guest page hinting > > patches. The patches differ from version 6 only in the kernel version, > > they apply against 2.6.29. My short sniff test showed that the code > > is still working as expected. > > > > To recap (you can skip this if you read the boiler plate of the last > > version of the patches): > > The main benefit for guest page hinting vs. the ballooner is that there > > is no need for a monitor that keeps track of the memory usage of all the > > guests, a complex algorithm that calculates the working set sizes and for > > the calls into the guest kernel to control the size of the balloons. > > I thought you weren't convinced of the concrete benefits over ballooning, > or am I misremembering? The performance test I have seen so far show that the benefits of ballooning vs. guest page hinting are about the same. I am still convinced that the guest page hinting is the way to go because you do not need an external monitor. Calculating the working set size for a guest is a challenge. With guest page hinting there is no need for a working set size calculation. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 1/6] Guest page hinting: core + volatile page cache. Date: Sun, 29 Mar 2009 10:35:31 -0400 Message-ID: <49CF8733.7060309@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151011.534224968@de.ibm.com> <49CD59DB.3070906@redhat.com> <20090329155640.31472c61@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090329155640.31472c61@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > On Fri, 27 Mar 2009 18:57:31 -0400 > Rik van Riel wrote: >> Martin Schwidefsky wrote: >>> There are some alternatives how this can be done, e.g. a global >>> lock, or lock per segment in the kernel page table, or the per page >>> bit PG_arch_1 if it is still free. >> Can this be taken care of by memory barriers and >> careful ordering of operations? > > I don't see how this could be done with memory barries, the sequence is > 1) check conditions > 2) do state change to volatile > > another cpus can do > i) change one of the conditions > > The operation i) needs to be postponed while the first cpu has done 1) > but not done 2) yet. 1+2 needs to be atomic but consists of several > instructions. Ergo we need a lock, no ? You are right. Hashed locks may be a space saving option, with a set of (cache line aligned?) locks in each zone and the page state lock chosen by taking a hash of the page number or address. Not ideal, but at least we can get some NUMA locality. >>> + if (page->index != linear_page_index(vma, addr)) >>> + /* If nonlinear, store the file page offset in the pte. */ >>> + set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); >>> + else >>> + pte_clear(dst_mm, addr, dst_pte); >>> } >> It would be good to document that PG_discarded can only happen for >> file pages and NOT for eg. clean swap cache pages. > > PG_discarded can happen for swap cache pages as well. If a clean swap > cache page gets remove and subsequently access again the discard fault > handler will set the bit (see __page_discard). The code necessary for > volatile swap cache is introduced with patch #2. So I would rather not > add a comment in patch #1 only to remove it again with patch #2 .. I discovered that once I opened the next email :) >>> @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag >>> radix_tree_tag_clear(&mapping->page_tree, >>> page_index(page), >>> PAGECACHE_TAG_WRITEBACK); >>> + page_make_volatile(page, 1); >>> if (bdi_cap_account_writeback(bdi)) { >>> __dec_bdi_stat(bdi, BDI_WRITEBACK); >>> __bdi_writeout_inc(bdi); >> Does this mark the page volatile before the IO writing the >> dirty data back to disk has even started? Is that OK? > > Hmm, it could be that the page_make_volatile is just superflouos here. > The logic here is that whenever one of the conditions that prevent a > page from becoming volatile is cleared a try with page_make_volatile > is done. The condition in question here is PageWriteback(page). If we > can prove that one of the other conditions is true this particular call > is a waste of effort. Actually, test_clear_page_writeback is probably called on IO completion and it was just me being confused after a few hundred lines of very new (to me) VM code :) I guess the patch is correct. Acked-by: Rik van Riel -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Sun, 29 Mar 2009 10:38:51 -0400 Message-ID: <49CF87FB.4030608@redhat.com> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <49CD69EB.6000000@redhat.com> <20090329162024.687196ab@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090329162024.687196ab@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Dave Hansen , linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > On Fri, 27 Mar 2009 20:06:03 -0400 > Rik van Riel wrote: > >> Dave Hansen wrote: >>> On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: >>>> If the host picks one of the >>>> pages the guest can recreate, the host can throw it away instead of writing >>>> it to the paging device. Simple and elegant. >>> Heh, simple and elegant for the hypervisor. But I'm not sure I'm going >>> to call *anything* that requires a new CPU instruction elegant. ;) >> I am convinced that it could be done with a guest-writable >> "bitmap", with 2 bits per page. That would make this scheme >> useful for KVM, too. > > This was our initial approach before we came up with the milli-code > instruction. The reason we did not use a bitmap was to prevent the > guest to change the host state (4 guest states U/S/V/P and 3 host > states r/p/z). With the full set of states you'd need 4 bits. And the > hosts need to have a "master" copy of the host bits, one the guest > cannot change, otherwise you get into trouble. KVM already has the info from the host bits somewhere else, which is needed to be able to actually find the physical pages used by a guest. That leaves just the guest states, so a compare-and-swap may work for non-s390. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Hansen Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 30 Mar 2009 08:54:55 -0700 Message-ID: <1238428495.8286.638.camel@nimitz> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090329161253.3faffdeb@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com List-Id: virtualization@lists.linuxfoundation.org On Sun, 2009-03-29 at 16:12 +0200, Martin Schwidefsky wrote: > > Can we persuade the hypervisor to tell us which pages it decided to page > > out and just skip those when we're scanning the LRU? > > One principle of the whole approach is that the hypervisor does not > call into an otherwise idle guest. The cost of schedulung the virtual > cpu is just too high. So we would a means to store the information where > the guest can pick it up when it happens to do LRU. I don't think that > this will work out. I didn't mean for it to actively notify the guest. Perhaps, as Rik said, have a bitmap where the host can set or clear bit for the guest to see. As the guest is scanning the LRU, it checks the structure (or makes an hcall or whatever) and sees that the hypervisor has already taken care of the page. It skips these pages in the first round of scanning. I do see what you're saying about this saving the page-*out* operation on the hypervisor side. It can simply toss out pages instead of paging them itself. That's a pretty advanced optimization, though. What would this code look like if we didn't optimize to that level? It also occurs to me that the hypervisor could be doing a lot of this internally. This whole scheme is about telling the hypervisor about pages that we (the kernel) know we can regenerate. The hypervisor should know a lot of that information, too. We ask it to populate a page with stuff from virtual I/O devices or write a page out to those devices. The page remains volatile until something from the guest writes to it. The hypervisor could keep a record of how to recreate the page as long as it remains volatile and clean. That wouldn't cover things like page cache from network filesystems, though. This patch does look like the full monty but I have to wonder what other partial approaches are out there. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 30 Mar 2009 18:34:05 +0200 Message-ID: <20090330183405.750440da@skybase> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1238428495.8286.638.camel@nimitz> Sender: owner-linux-mm@kvack.org To: Dave Hansen Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com List-Id: virtualization@lists.linuxfoundation.org On Mon, 30 Mar 2009 08:54:55 -0700 Dave Hansen wrote: > On Sun, 2009-03-29 at 16:12 +0200, Martin Schwidefsky wrote: > > > Can we persuade the hypervisor to tell us which pages it decided to page > > > out and just skip those when we're scanning the LRU? > > > > One principle of the whole approach is that the hypervisor does not > > call into an otherwise idle guest. The cost of schedulung the virtual > > cpu is just too high. So we would a means to store the information where > > the guest can pick it up when it happens to do LRU. I don't think that > > this will work out. > > I didn't mean for it to actively notify the guest. Perhaps, as Rik > said, have a bitmap where the host can set or clear bit for the guest to > see. Yes, agreed. > As the guest is scanning the LRU, it checks the structure (or makes an > hcall or whatever) and sees that the hypervisor has already taken care > of the page. It skips these pages in the first round of scanning. As long as we make this optional I'm fine with it. On s390 with the current implementation that translates to an ESSA call. Which is not exactly inexpensive, we are talking about > 100 cycles. The better solution for us is to age the page with the standard active/inactive processing. > I do see what you're saying about this saving the page-*out* operation > on the hypervisor side. It can simply toss out pages instead of paging > them itself. That's a pretty advanced optimization, though. What would > this code look like if we didn't optimize to that level? Why? It is just a simple test in the hosts LRU scan. If the page is at the end of the inactive list AND has the volatile state then don't bother with writeback, just throw it away. This is the only place where the host has to check for the page state. > It also occurs to me that the hypervisor could be doing a lot of this > internally. This whole scheme is about telling the hypervisor about > pages that we (the kernel) know we can regenerate. The hypervisor > should know a lot of that information, too. We ask it to populate a > page with stuff from virtual I/O devices or write a page out to those > devices. The page remains volatile until something from the guest > writes to it. The hypervisor could keep a record of how to recreate the > page as long as it remains volatile and clean. Unfortunately it is not that simple. There are quite a few reasons why a page has to be made stable. You'd have to pass that information back and forth between the guest and the host otherwise the host will throw away e.g. an mlocked page because it is still marked as volatile in the virtual block device. > That wouldn't cover things like page cache from network filesystems, > though. Yes, there are pages with a backing the host knows nothing about. > This patch does look like the full monty but I have to wonder what other > partial approaches are out there. I am open for suggestions. The simples partial approach is already implemented for s390: unused/stable transitions in the buddy allocator. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 30 Mar 2009 11:37:56 -0700 Message-ID: <49D11184.3060002@goop.org> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1238428495.8286.638.camel@nimitz> Sender: owner-linux-mm@kvack.org To: Dave Hansen Cc: Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, riel@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Dave Hansen wrote: > It also occurs to me that the hypervisor could be doing a lot of this > internally. This whole scheme is about telling the hypervisor about > pages that we (the kernel) know we can regenerate. The hypervisor > should know a lot of that information, too. We ask it to populate a > page with stuff from virtual I/O devices or write a page out to those > devices. The page remains volatile until something from the guest > writes to it. The hypervisor could keep a record of how to recreate the > page as long as it remains volatile and clean. > That potentially pushes a lot of complexity elsewhere. If you have multiple paths to a storage device, or a cluster store shared between multiple machines, then the underlying storage can change making the guest's copies of those blocks unbacked. Obviously the host/hypervisor could deal with that, but it would be a pile of new mechanisms which don't necessarily exist (for example, it would have to be an active participant in a distributed locking scheme for a shared block device rather than just passing it all through to the guest to handle). That said, people have been looking at tracking block IO to work out when it might be useful to try and share pages between guests under Xen. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 30 Mar 2009 14:42:15 -0400 Message-ID: <49D11287.4030307@redhat.com> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D11184.3060002@goop.org> Sender: owner-linux-mm@kvack.org To: Jeremy Fitzhardinge Cc: Dave Hansen , Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Jeremy Fitzhardinge wrote: > That said, people have been looking at tracking block IO to work out > when it might be useful to try and share pages between guests under Xen. Tracking block IO seems like a bass-ackwards way to figure out what the contents of a memory page are. The KVM KSM code has a simpler, yet still efficient, way of figuring out which memory pages can be shared. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 30 Mar 2009 11:59:00 -0700 Message-ID: <49D11674.9040205@goop.org> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D11287.4030307@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: Dave Hansen , Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Rik van Riel wrote: > Jeremy Fitzhardinge wrote: > >> That said, people have been looking at tracking block IO to work out >> when it might be useful to try and share pages between guests under Xen. > > Tracking block IO seems like a bass-ackwards way to figure > out what the contents of a memory page are. Well, they're research projects, so nobody said that they're necessarily useful results ;). I think the rationale is that, in general, there aren't all that many sharable pages, and asize from zero-pages, the bulk of them are the result of IO. Since its much simpler to compare device+block references than doing page content matching, it is worth looking at the IO stream to work out what your candidates are. > The KVM KSM code has a simpler, yet still efficient, way of > figuring out which memory pages can be shared. How's that? Does it do page content comparison? J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 30 Mar 2009 16:02:44 -0400 Message-ID: <49D12564.40708@redhat.com> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> <49D11674.9040205@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D11674.9040205@goop.org> Sender: owner-linux-mm@kvack.org To: Jeremy Fitzhardinge Cc: Dave Hansen , Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Jeremy Fitzhardinge wrote: > Rik van Riel wrote: >> Jeremy Fitzhardinge wrote: >> >>> That said, people have been looking at tracking block IO to work out >>> when it might be useful to try and share pages between guests under Xen. >> >> Tracking block IO seems like a bass-ackwards way to figure >> out what the contents of a memory page are. > > Well, they're research projects, so nobody said that they're necessarily > useful results ;). I think the rationale is that, in general, there > aren't all that many sharable pages, and asize from zero-pages, the bulk > of them are the result of IO. I'll give you a hint: Windows zeroes out freed pages. It should also be possible to hook up arch_free_page() so freed pages in Linux guests become sharable. Furthermore, every guest with the same OS version will be running the same system daemons, same glibc, etc. This means sharable pages from not just disk IO (probably from different disks anyway), but also in the BSS and possibly even on the heap. >> The KVM KSM code has a simpler, yet still efficient, way of >> figuring out which memory pages can be shared. > How's that? Does it do page content comparison? Eventually. It starts out with hashing the first 128 (IIRC) bytes of page content and comparing the hashes. If that matches, it will do content comparison. Content comparison is done in the background on the host. I suspect (but have not checked) that it is somehow hooked up to the page reclaim code on the host. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 30 Mar 2009 13:35:34 -0700 Message-ID: <49D12D16.6050407@goop.org> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> <49D11674.9040205@goop.org> <49D12564.40708@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D12564.40708@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: Dave Hansen , Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Rik van Riel wrote: > Jeremy Fitzhardinge wrote: >> Rik van Riel wrote: >>> Jeremy Fitzhardinge wrote: >>> >>>> That said, people have been looking at tracking block IO to work >>>> out when it might be useful to try and share pages between guests >>>> under Xen. >>> >>> Tracking block IO seems like a bass-ackwards way to figure >>> out what the contents of a memory page are. >> >> Well, they're research projects, so nobody said that they're >> necessarily useful results ;). I think the rationale is that, in >> general, there aren't all that many sharable pages, and asize from >> zero-pages, the bulk of them are the result of IO. > > I'll give you a hint: Windows zeroes out freed pages. Right: "aside from zero-pages". If you exclude zero-pages from your count of shared pages, the amount of sharing drops a lot. > It should also be possible to hook up arch_free_page() so > freed pages in Linux guests become sharable. > > Furthermore, every guest with the same OS version will be > running the same system daemons, same glibc, etc. This > means sharable pages from not just disk IO (probably from > different disks anyway), Why? If you're starting a bunch of cookie-cutter guests, then you're probably starting them from the same template image or COW block devices. (Also, if you're wearing the cost of physical IO anyway, then additional cost of hashing is relatively small.) > but also in the BSS and possibly > even on the heap. Well, maybe. Modern systems generally randomize memory layouts, so even if they're semantically the same, the pointers will all have different values. Other research into "sharing" mostly-similar pages is more promising for that kind of case. > Eventually. It starts out with hashing the first 128 (IIRC) > bytes of page content and comparing the hashes. If that > matches, it will do content comparison. > > Content comparison is done in the background on the host. > I suspect (but have not checked) that it is somehow hooked > up to the page reclaim code on the host. Yeah, that's the straightforward approach; there's about a research project/year doing a Xen implementation, but they never seem to get very good results aside from very artificial test conditions. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dor Laor Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Tue, 31 Mar 2009 00:38:01 +0300 Message-ID: <49D13BB9.3010200@redhat.com> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> <49D11674.9040205@goop.org> <49D12564.40708@redhat.com> <49D12D16.6050407@goop.org> Reply-To: dlaor@redhat.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D12D16.6050407@goop.org> Sender: owner-linux-mm@kvack.org To: Jeremy Fitzhardinge Cc: Rik van Riel , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , virtualization@lists.osdl.org, Martin Schwidefsky , hugh@veritas.com, Izik Eidus List-Id: virtualization@lists.linuxfoundation.org Jeremy Fitzhardinge wrote: > Rik van Riel wrote: > >> Jeremy Fitzhardinge wrote: >> >>> Rik van Riel wrote: >>> >>>> Jeremy Fitzhardinge wrote: >>>> >>>> >>>>> That said, people have been looking at tracking block IO to work >>>>> out when it might be useful to try and share pages between guests >>>>> under Xen. >>>>> >>>> Tracking block IO seems like a bass-ackwards way to figure >>>> out what the contents of a memory page are. >>>> >>> Well, they're research projects, so nobody said that they're >>> necessarily useful results ;). I think the rationale is that, in >>> general, there aren't all that many sharable pages, and asize from >>> zero-pages, the bulk of them are the result of IO. >>> >> I'll give you a hint: Windows zeroes out freed pages. >> > > Right: "aside from zero-pages". If you exclude zero-pages from your > count of shared pages, the amount of sharing drops a lot. > > >> It should also be possible to hook up arch_free_page() so >> freed pages in Linux guests become sharable. >> >> Furthermore, every guest with the same OS version will be >> running the same system daemons, same glibc, etc. This >> means sharable pages from not just disk IO (probably from >> different disks anyway), >> > > Why? If you're starting a bunch of cookie-cutter guests, then you're > probably starting them from the same template image or COW block > devices. (Also, if you're wearing the cost of physical IO anyway, then > additional cost of hashing is relatively small.) > > >> but also in the BSS and possibly >> even on the heap. >> > > Well, maybe. Modern systems generally randomize memory layouts, so even > if they're semantically the same, the pointers will all have different > values. > > Other research into "sharing" mostly-similar pages is more promising for > that kind of case. > > >> Eventually. It starts out with hashing the first 128 (IIRC) >> bytes of page content and comparing the hashes. If that >> matches, it will do content comparison. >> The algorithm was changed quite a bit. Izik is planning to resubmit it any day now. >> Content comparison is done in the background on the host. >> I suspect (but have not checked) that it is somehow hooked >> up to the page reclaim code on the host. >> > > Yeah, that's the straightforward approach; there's about a research > project/year doing a Xen implementation, but they never seem to get very > good results aside from very artificial test conditions. > Actually we got really good results by using ksm along with kvm, running large amount of windows virtual machines. We can achieve over commit ratio of up to 400% of the host ram for VMs doing M$ office load. -dor -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Izik Eidus Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Tue, 31 Mar 2009 01:16:54 +0300 Message-ID: <49D144D6.9000001@redhat.com> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> <49D11674.9040205@goop.org> <49D12564.40708@redhat.com> <49D12D16.6050407@goop.org> <49D13BB9.3010200@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D13BB9.3010200@redhat.com> Sender: owner-linux-mm@kvack.org Cc: Jeremy Fitzhardinge , Rik van Riel , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , virtualization@lists.osdl.org, Martin Schwidefsky , hugh@veritas.com, dlaor@redhat.com List-Id: virtualization@lists.linuxfoundation.org Jeremy Fitzhardinge wrote: >> Rik van Riel wrote: >> >>> Jeremy Fitzhardinge wrote: >>> >>>> Rik van Riel wrote: >>>> >>>>> Jeremy Fitzhardinge wrote: >>>>> >>>>> >>>>>> That said, people have been looking at tracking block IO to work >>>>>> out when it might be useful to try and share pages between guests >>>>>> under Xen. >>>>>> >>>>> Tracking block IO seems like a bass-ackwards way to figure >>>>> out what the contents of a memory page are. >>>>> >>>> Well, they're research projects, so nobody said that they're >>>> necessarily useful results ;). I think the rationale is that, in >>>> general, there aren't all that many sharable pages, and asize from >>>> zero-pages, the bulk of them are the result of IO. >>> I'll give you a hint: Windows zeroes out freed pages. >>> >> >> Right: "aside from zero-pages". If you exclude zero-pages from your >> count of shared pages, the amount of sharing drops a lot. 20026 root 15 0 707m 526m 246m S 7.0 14.0 0:39.57 qemu-system-x86 20010 root 15 0 707m 526m 239m S 6.3 14.0 0:47.16 qemu-system-x86 20015 root 15 0 707m 526m 247m S 5.7 14.0 0:46.84 qemu-system-x86 20031 root 15 0 707m 526m 242m S 5.7 14.1 0:46.74 qemu-system-x86 20005 root 15 0 707m 526m 239m S 0.3 14.0 0:54.16 qemu-system-x86 I just ran 5 debian 5.0 guests with each have 512 mb physical ram, all i did was just open X, and then open thunderbird and firefox in them, check the SHR field... You cannot ignore the fact that the librarys and the kernel would be identical among guests and would be shared... Other than the library we got the big bonus that is called zero page in windows, but that is really not the case for the above example since thigs guests are linux..... >> >> >>> It should also be possible to hook up arch_free_page() so >>> freed pages in Linux guests become sharable. >>> >>> Furthermore, every guest with the same OS version will be >>> running the same system daemons, same glibc, etc. This >>> means sharable pages from not just disk IO (probably from >>> different disks anyway), >>> >> >> Why? If you're starting a bunch of cookie-cutter guests, then you're >> probably starting them from the same template image or COW block >> devices. (Also, if you're wearing the cost of physical IO anyway, >> then additional cost of hashing is relatively small.) >> >> >>> but also in the BSS and possibly >>> even on the heap. >>> >> >> Well, maybe. Modern systems generally randomize memory layouts, so >> even if they're semantically the same, the pointers will all have >> different values. >> >> Other research into "sharing" mostly-similar pages is more promising >> for that kind of case. >> >> >>> Eventually. It starts out with hashing the first 128 (IIRC) >>> bytes of page content and comparing the hashes. If that >>> matches, it will do content comparison. >>> > The algorithm was changed quite a bit. Izik is planning to resubmit it > any day now. >>> Content comparison is done in the background on the host. >>> I suspect (but have not checked) that it is somehow hooked >>> up to the page reclaim code on the host. >>> >> >> Yeah, that's the straightforward approach; there's about a research >> project/year doing a Xen implementation, but they never seem to get >> very good results aside from very artificial test conditions. I keep hear this argument from Microsoft but even in the hardest test condition, how would you make the librarys and the kernel wont be identical among the guests?. Anyway Page sharing is running and installed for our customers and so far i only hear from sells guys how surprised and happy the costumers are from the overcommit that page sharing is offer... Anyway i have ready massive-changed (mostly the logical algorithm for finding pages) ksm version that i made against the mainline version and is ready to be send after i will get some better benchmarks numbers to post on the list when together with the patch... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 2/6] Guest page hinting: volatile swap cache. Date: Tue, 31 Mar 2009 22:10:48 -0400 Message-ID: <49D2CD28.9080700@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151011.798602788@de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090327151011.798602788@de.ibm.com> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > From: Martin Schwidefsky > From: Hubertus Franke > From: Himanshu Raj > > The volatile page state can be used for anonymous pages as well, if > they have been added to the swap cache and the swap write is finished. > Signed-off-by: Martin Schwidefsky Acked-by: Rik van Riel -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 3/6] Guest page hinting: mlocked pages. Date: Tue, 31 Mar 2009 22:52:04 -0400 Message-ID: <49D2D6D4.8000309@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151012.095486071@de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090327151012.095486071@de.ibm.com> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > From: Martin Schwidefsky > From: Hubertus Franke > From: Himanshu Raj > > Add code to get mlock() working with guest page hinting. The problem > with mlock is that locked pages may not be removed from page cache. > That means they need to be stable. > Signed-off-by: Martin Schwidefsky Acked-by: Rik van Riel -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 2/6] Guest page hinting: volatile swap cache. Date: Wed, 1 Apr 2009 10:13:34 +0200 Message-ID: <20090401101334.7e6ea848@skybase> References: <20090327150905.819861420@de.ibm.com> <20090327151011.798602788@de.ibm.com> <49D2CD28.9080700@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D2CD28.9080700@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Tue, 31 Mar 2009 22:10:48 -0400 Rik van Riel wrote: > Martin Schwidefsky wrote: > > From: Martin Schwidefsky > > From: Hubertus Franke > > From: Himanshu Raj > > > > The volatile page state can be used for anonymous pages as well, if > > they have been added to the swap cache and the swap write is finished. > > > Signed-off-by: Martin Schwidefsky > > Acked-by: Rik van Riel Thanks you for the review. I'll add the Acked-by. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 3/6] Guest page hinting: mlocked pages. Date: Wed, 1 Apr 2009 10:13:57 +0200 Message-ID: <20090401101357.7f714f7a@skybase> References: <20090327150905.819861420@de.ibm.com> <20090327151012.095486071@de.ibm.com> <49D2D6D4.8000309@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D2D6D4.8000309@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Tue, 31 Mar 2009 22:52:04 -0400 Rik van Riel wrote: > Martin Schwidefsky wrote: > > From: Martin Schwidefsky > > From: Hubertus Franke > > From: Himanshu Raj > > > > Add code to get mlock() working with guest page hinting. The problem > > with mlock is that locked pages may not be removed from page cache. > > That means they need to be stable. > > > Signed-off-by: Martin Schwidefsky > > Acked-by: Rik van Riel I'll add this one as well. Thanks again. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 4/6] Guest page hinting: writable page table entries. Date: Wed, 01 Apr 2009 09:25:34 -0400 Message-ID: <49D36B4E.7000702@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151012.398894143@de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090327151012.398894143@de.ibm.com> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: This code has me stumped. Does it mean that if a page already has the PageWritable bit set (and count_ok stays 0), we will always mark the page as volatile? How does that work out on !s390? > /** > + * __page_check_writable() - check page state for new writable pte > + * > + * @page: the page the new writable pte refers to > + * @pte: the new writable pte > + */ > +void __page_check_writable(struct page *page, pte_t pte, unsigned int offset) > +{ > + int count_ok = 0; > + > + preempt_disable(); > + while (page_test_set_state_change(page)) > + cpu_relax(); > + > + if (!TestSetPageWritable(page)) { > + count_ok = check_counts(page, offset); > + if (check_bits(page) && count_ok) > + page_set_volatile(page, 1); > + else > + /* > + * If two processes create a write mapping at the > + * same time check_counts will return false or if > + * the page is currently isolated from the LRU > + * check_bits will return false but the page might > + * be in volatile state. > + * We have to take care about the dirty bit so the > + * only option left is to make the page stable but > + * we can try to make it volatile a bit later. > + */ > + page_set_stable_if_present(page); > + } > + page_clear_state_change(page); > + if (!count_ok) > + page_make_volatile(page, 1); > + preempt_enable(); > +} > +EXPORT_SYMBOL(__page_check_writable); -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 4/6] Guest page hinting: writable page table entries. Date: Wed, 1 Apr 2009 16:36:58 +0200 Message-ID: <20090401163658.60f851ed@skybase> References: <20090327150905.819861420@de.ibm.com> <20090327151012.398894143@de.ibm.com> <49D36B4E.7000702@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D36B4E.7000702@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Wed, 01 Apr 2009 09:25:34 -0400 Rik van Riel wrote: > Martin Schwidefsky wrote: > > This code has me stumped. Does it mean that if a page already > has the PageWritable bit set (and count_ok stays 0), we will > always mark the page as volatile? > > How does that work out on !s390? No, we will not always mark the page as volatile. If PG_writable is already set count_ok will stay 0 and a call to page_make_volatile is done. This differs from page_set_volatile as it repeats all the required checks, then calls page_set_volatile with a PageWritable(page) as second argument. What state the page will get depends on the architecture definition of page_set_volatile. For s390 this will do a state transition to potentially volatile as the PG_writable bit is set. On architecture that cannot check the dirty bit on a physical page basis you need to make the page stable. > > /** > > + * __page_check_writable() - check page state for new writable pte > > + * > > + * @page: the page the new writable pte refers to > > + * @pte: the new writable pte > > + */ > > +void __page_check_writable(struct page *page, pte_t pte, unsigned int offset) > > +{ > > + int count_ok = 0; > > + > > + preempt_disable(); > > + while (page_test_set_state_change(page)) > > + cpu_relax(); > > + > > + if (!TestSetPageWritable(page)) { > > + count_ok = check_counts(page, offset); > > + if (check_bits(page) && count_ok) > > + page_set_volatile(page, 1); > > + else > > + /* > > + * If two processes create a write mapping at the > > + * same time check_counts will return false or if > > + * the page is currently isolated from the LRU > > + * check_bits will return false but the page might > > + * be in volatile state. > > + * We have to take care about the dirty bit so the > > + * only option left is to make the page stable but > > + * we can try to make it volatile a bit later. > > + */ > > + page_set_stable_if_present(page); > > + } > > + page_clear_state_change(page); > > + if (!count_ok) > > + page_make_volatile(page, 1); > > + preempt_enable(); > > +} > > +EXPORT_SYMBOL(__page_check_writable); > > -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 4/6] Guest page hinting: writable page table entries. Date: Wed, 01 Apr 2009 10:45:53 -0400 Message-ID: <49D37E21.2000609@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151012.398894143@de.ibm.com> <49D36B4E.7000702@redhat.com> <20090401163658.60f851ed@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090401163658.60f851ed@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > On Wed, 01 Apr 2009 09:25:34 -0400 > Rik van Riel wrote: > >> Martin Schwidefsky wrote: >> >> This code has me stumped. Does it mean that if a page already >> has the PageWritable bit set (and count_ok stays 0), we will >> always mark the page as volatile? >> >> How does that work out on !s390? > > No, we will not always mark the page as volatile. If PG_writable is > already set count_ok will stay 0 and a call to page_make_volatile is > done. This differs from page_set_volatile as it repeats all the > required checks, then calls page_set_volatile with a PageWritable(page) > as second argument. What state the page will get depends on the > architecture definition of page_set_volatile. For s390 this will do a > state transition to potentially volatile as the PG_writable bit is set. > On architecture that cannot check the dirty bit on a physical page basis > you need to make the page stable. Good point. I guess that means patch 4/6 checks out right, then :) Acked-by: Rik van Riel -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 5/6] Guest page hinting: minor fault optimization. Date: Wed, 01 Apr 2009 11:33:59 -0400 Message-ID: <49D38967.8020706@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151012.713478499@de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090327151012.713478499@de.ibm.com> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > From: Martin Schwidefsky > From: Hubertus Franke > From: Himanshu Raj > > On of the challenges of the guest page hinting scheme is the cost for > the state transitions. If the cost gets too high the whole concept of > page state information is in question. Therefore it is important to > avoid the state transitions when possible. > Signed-off-by: Martin Schwidefsky Acked-by: Rik van Riel -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 6/6] Guest page hinting: s390 support. Date: Wed, 01 Apr 2009 12:18:58 -0400 Message-ID: <49D393F2.2010105@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151013.024372165@de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090327151013.024372165@de.ibm.com> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > From: Martin Schwidefsky > From: Hubertus Franke > From: Himanshu Raj > > s390 uses the milli-coded ESSA instruction to set the page state. The > page state is formed by four guest page states called block usage states > and three host page states called block content states. > Signed-off-by: Martin Schwidefsky Acked-by: Rik van Riel -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 2 Apr 2009 22:32:00 +1100 Message-ID: <200904022232.02185.nickpiggin@yahoo.com.au> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090329162336.7c0700e9@skybase> Content-Disposition: inline Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Monday 30 March 2009 01:23:36 Martin Schwidefsky wrote: > On Sat, 28 Mar 2009 17:05:28 +1030 > > Rusty Russell wrote: > > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > > > Greetings, > > > the circus is back in town -- another version of the guest page hinting > > > patches. The patches differ from version 6 only in the kernel version, > > > they apply against 2.6.29. My short sniff test showed that the code > > > is still working as expected. > > > > > > To recap (you can skip this if you read the boiler plate of the last > > > version of the patches): > > > The main benefit for guest page hinting vs. the ballooner is that there > > > is no need for a monitor that keeps track of the memory usage of all > > > the guests, a complex algorithm that calculates the working set sizes > > > and for the calls into the guest kernel to control the size of the > > > balloons. > > > > I thought you weren't convinced of the concrete benefits over ballooning, > > or am I misremembering? > > The performance test I have seen so far show that the benefits of > ballooning vs. guest page hinting are about the same. I am still > convinced that the guest page hinting is the way to go because you do > not need an external monitor. Calculating the working set size for a > guest is a challenge. With guest page hinting there is no need for a > working set size calculation. Sounds backwards to me. If the benefits are the same, then having complexity in an external monitor (which, by the way, shares many problems and goals of single-kernel resource/workload management), rather than putting a huge chunk of crap in the guest kernel's core mm code. I still think this needs much more justification. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 2 Apr 2009 17:52:49 +0200 Message-ID: <20090402175249.3c4a6d59@skybase> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <200904022232.02185.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Thu, 2 Apr 2009 22:32:00 +1100 Nick Piggin wrote: > On Monday 30 March 2009 01:23:36 Martin Schwidefsky wrote: > > On Sat, 28 Mar 2009 17:05:28 +1030 > > > > Rusty Russell wrote: > > > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > > > > Greetings, > > > > the circus is back in town -- another version of the guest page hinting > > > > patches. The patches differ from version 6 only in the kernel version, > > > > they apply against 2.6.29. My short sniff test showed that the code > > > > is still working as expected. > > > > > > > > To recap (you can skip this if you read the boiler plate of the last > > > > version of the patches): > > > > The main benefit for guest page hinting vs. the ballooner is that there > > > > is no need for a monitor that keeps track of the memory usage of all > > > > the guests, a complex algorithm that calculates the working set sizes > > > > and for the calls into the guest kernel to control the size of the > > > > balloons. > > > > > > I thought you weren't convinced of the concrete benefits over ballooning, > > > or am I misremembering? > > > > The performance test I have seen so far show that the benefits of > > ballooning vs. guest page hinting are about the same. I am still > > convinced that the guest page hinting is the way to go because you do > > not need an external monitor. Calculating the working set size for a > > guest is a challenge. With guest page hinting there is no need for a > > working set size calculation. > > Sounds backwards to me. If the benefits are the same, then having > complexity in an external monitor (which, by the way, shares many > problems and goals of single-kernel resource/workload management), > rather than putting a huge chunk of crap in the guest kernel's core > mm code. The benefits are the same but the algorithmic complexity is reduced. The patch to the memory management has complexity in itself but from a 1000 feet standpoint guest page hinting is simpler, no? The question how much memory each guest has to release does not exist. With the balloner I have seen a few problematic cases where the size of the balloon in principle killed the guest. My favorite is the "clever" monitor script that queried the guests free memory and put all free memory into the balloon. Now gues what happened with a guest that just booted.. And could you please explain with a few more words >what< you consider to be "crap"? I can't do anything with a general statement "this is crap". Which translates to me: leave me alone.. > I still think this needs much more justification. Ok, I can understand that. We probably need a KVM based version to show that benefits exist on non-s390 hardware as well. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 02 Apr 2009 09:18:39 -0700 Message-ID: <49D4E55F.8010406@goop.org> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090402175249.3c4a6d59@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Nick Piggin , akpm@osdl.org, frankeh@watson.ibm.com, virtualization@lists.osdl.org, riel@redhat.com, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: >> I still think this needs much more justification. >> > > Ok, I can understand that. We probably need a KVM based version to show > that benefits exist on non-s390 hardware as well. > BTW, there was a presentation at the most recent Xen summit which makes use of CMM ("Satori: Enlightened Page Sharing", http://www.xen.org/files/xensummit_oracle09/Satori.pdf). J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Fri, 3 Apr 2009 03:23:36 +1100 Message-ID: <200904030323.37523.nickpiggin@yahoo.com.au> References: <20090327150905.819861420@de.ibm.com> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090402175249.3c4a6d59@skybase> Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org To: Martin Schwidefsky Cc: Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Friday 03 April 2009 02:52:49 Martin Schwidefsky wrote: > On Thu, 2 Apr 2009 22:32:00 +1100 > Nick Piggin wrote: > > > On Monday 30 March 2009 01:23:36 Martin Schwidefsky wrote: > > > On Sat, 28 Mar 2009 17:05:28 +1030 > > > > > > Rusty Russell wrote: > > > > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > > > > > Greetings, > > > > > the circus is back in town -- another version of the guest page hinting > > > > > patches. The patches differ from version 6 only in the kernel version, > > > > > they apply against 2.6.29. My short sniff test showed that the code > > > > > is still working as expected. > > > > > > > > > > To recap (you can skip this if you read the boiler plate of the last > > > > > version of the patches): > > > > > The main benefit for guest page hinting vs. the ballooner is that there > > > > > is no need for a monitor that keeps track of the memory usage of all > > > > > the guests, a complex algorithm that calculates the working set sizes > > > > > and for the calls into the guest kernel to control the size of the > > > > > balloons. > > > > > > > > I thought you weren't convinced of the concrete benefits over ballooning, > > > > or am I misremembering? > > > > > > The performance test I have seen so far show that the benefits of > > > ballooning vs. guest page hinting are about the same. I am still > > > convinced that the guest page hinting is the way to go because you do > > > not need an external monitor. Calculating the working set size for a > > > guest is a challenge. With guest page hinting there is no need for a > > > working set size calculation. > > > > Sounds backwards to me. If the benefits are the same, then having > > complexity in an external monitor (which, by the way, shares many > > problems and goals of single-kernel resource/workload management), > > rather than putting a huge chunk of crap in the guest kernel's core > > mm code. > > The benefits are the same but the algorithmic complexity is reduced. > The patch to the memory management has complexity in itself but from a > 1000 feet standpoint guest page hinting is simpler, no? Yeah but that's a tradeoff I'll begrudgingly make, considering a) lots of people doing workload management inside cgroups/containers need similar algorithmic complexity so improvements to those algorithms will help one another b) it may be adding complexity, but it isn't adding complexity to a subsystem that is already among the most complex in the kernel c) i don't have to help maintain it > The question > how much memory each guest has to release does not exist. With the > balloner I have seen a few problematic cases where the size of > the balloon in principle killed the guest. My favorite is the "clever" > monitor script that queried the guests free memory and put all free > memory into the balloon. Now gues what happened with a guest that just > booted.. > > And could you please explain with a few more words >what< you consider > to be "crap"? I can't do anything with a general statement "this is > crap". Which translates to me: leave me alone.. :) No it's cool code, interesting idea etc, and last time I looked I don't think I saw any fundamental (or even any significant incidental) bugs. So I guess my problem with it is that it adds complexity to benefit a small portion of users where there is already another solution that another set of users already require. > > I still think this needs much more justification. > > Ok, I can understand that. We probably need a KVM based version to show > that benefits exist on non-s390 hardware as well. Should be significantly better than ballooning too. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 02 Apr 2009 15:06:31 -0400 Message-ID: <49D50CB7.2050705@redhat.com> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090402175249.3c4a6d59@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Nick Piggin , Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > The benefits are the same but the algorithmic complexity is reduced. > The patch to the memory management has complexity in itself but from a > 1000 feet standpoint guest page hinting is simpler, no? Page hinting has a complex, but well understood, mechanism and simple policy. Ballooning has a simpler mechanism, but relies on an as-of-yet undiscovered policy. Having experienced a zillion VM corner cases over the last decade and a bit, I think I prefer a complex mechanism over complex (or worse, unknown!) policy any day. > Ok, I can understand that. We probably need a KVM based version to show > that benefits exist on non-s390 hardware as well. I believe it can work for KVM just fine, if we keep the host state and the guest state in separate places (so the guest can always write the guest state without a hypercall). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Fri, 3 Apr 2009 06:22:29 +1100 Message-ID: <200904030622.30935.nickpiggin@yahoo.com.au> References: <20090327150905.819861420@de.ibm.com> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D50CB7.2050705@redhat.com> Content-Disposition: inline Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: Martin Schwidefsky , Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org On Friday 03 April 2009 06:06:31 Rik van Riel wrote: > Martin Schwidefsky wrote: > > The benefits are the same but the algorithmic complexity is reduced. > > The patch to the memory management has complexity in itself but from a > > 1000 feet standpoint guest page hinting is simpler, no? > Page hinting has a complex, but well understood, mechanism > and simple policy. > > Ballooning has a simpler mechanism, but relies on an > as-of-yet undiscovered policy. > > Having experienced a zillion VM corner cases over the > last decade and a bit, I think I prefer a complex mechanism > over complex (or worse, unknown!) policy any day. I disagree with it being so clear cut. Volatile pagecache policy is completely out of the control of the Linux VM. Wheras ballooning does have to make some tradeoff between guests, but the actual reclaim will be driven by the guests. Neither way is perfect, but it's not like the hypervisor reclaim is foolproof against making a bad tradeoff between guests. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hugh Dickins Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 2 Apr 2009 20:27:21 +0100 (BST) Message-ID: References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Return-path: In-Reply-To: <20090402175249.3c4a6d59@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Nick Piggin , Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, riel@redhat.com List-Id: virtualization@lists.linuxfoundation.org On Thu, 2 Apr 2009, Martin Schwidefsky wrote: > On Thu, 2 Apr 2009 22:32:00 +1100 > Nick Piggin wrote: > > > I still think this needs much more justification. > > Ok, I can understand that. We probably need a KVM based version to show > that benefits exist on non-s390 hardware as well. That would indeed help your cause enormously (I think I made the same point last time). All these complex transitions, added to benefit only an architecture to which few developers have access, asks for trouble - we mm hackers already get caught out often enough by your too-well-camouflaged page_test_dirty(). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 02 Apr 2009 12:58:33 -0700 Message-ID: <49D518E9.1090001@goop.org> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D50CB7.2050705@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: Martin Schwidefsky , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel List-Id: virtualization@lists.linuxfoundation.org Rik van Riel wrote: > Page hinting has a complex, but well understood, mechanism > and simple policy. > For the guest perhaps, and yes, it does push the problem out to the host. But that doesn't make solving a performance problem any easier if you end up in a mess. > Ballooning has a simpler mechanism, but relies on an > as-of-yet undiscovered policy. > (I'm talking about Xen ballooning here; I know KVM ballooning works differently.) Yes and no. If you want to be able to shrink the guest very aggressively, then you need to be very careful about not shrinking too much for its current and near-future needs. But you'll get into an equivalently bad state with page hinting if the host decides to swap out and discard lots of persistent guest pages. When the host demands memory from the guest, the simple caseballooning is analogous to page hinting: * give up free pages == mark pages unused * give up clean pages == mark pages volatile * cause pressure to release some memory == host swapping The flipside is how guests can ask for memory if their needs increase again. Page-hinting is fault-driven, so the guest may stall while the host sorts out some memory to back the guests pages. Ballooning requires the guest to explicitly ask for memory, and that could be done in advance if it notices the pool of easily-freed pages is shrinking rapidly (though I guess it could be done on demand as well, but we don't have hooks for that). But of course, there are other approaches people are playing with, like Dan Magenheimer's transcendental memory, which is a pool of hypervisor-owned and managed pages which guests can use via a copy interface, as a second-chance page discard cache, fast swap, etc. Such mechanisms may be easier on both the guest complexity and policy fronts. The more complex host policy decisions of how to balance overall memory use system-wide are much in the same for both mechanisms. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 02 Apr 2009 16:05:22 -0400 Message-ID: <49D51A82.8090908@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <200904030622.30935.nickpiggin@yahoo.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <200904030622.30935.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Martin Schwidefsky , Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Nick Piggin wrote: > On Friday 03 April 2009 06:06:31 Rik van Riel wrote: > >> Ballooning has a simpler mechanism, but relies on an >> as-of-yet undiscovered policy. >> >> Having experienced a zillion VM corner cases over the >> last decade and a bit, I think I prefer a complex mechanism >> over complex (or worse, unknown!) policy any day. >> > > I disagree with it being so clear cut. Volatile pagecache policy is completely > out of the control of the Linux VM. Wheras ballooning does have to make some > tradeoff between guests, but the actual reclaim will be driven by the guests. > Neither way is perfect, but it's not like the hypervisor reclaim is foolproof > against making a bad tradeoff between guests. > I guess we could try to figure out a simple and robust policy for ballooning. If we can come up with a policy which nobody can shoot holes in by just discussing it, it may be worth implementing and benchmarking. Maybe something based on the host passing memory pressure on to the guests, and the guests having their own memory pressure push back to the host. I'l start by telling you the best auto-ballooning policy idea I have come up with so far, and the (major?) hole in it. Basically, the host needs the memory pressure notification, where the VM will notify the guests when memory is running low (and something could soon be swapped). At that point, each guest which receives the signal will try to free some memory and return it to the host. Each guest can have the reverse in its own pageout code. Once memory pressure grows to a certain point (eg. when the guest is about to swap something out), it could reclaim a few pages from the host. If all the guests behave themselves, this could work. However, even with just reasonably behaving guests, differences between the VMs in each guest could lead to unbalanced reclaiming, penalizing better behaving guests. If one guest is behaving badly, it could really impact the other guests. Can you think of improvements to this idea? Can you think of another balloon policy that does not have nasty corner cases? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rik van Riel Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 02 Apr 2009 16:14:33 -0400 Message-ID: <49D51CA9.6090601@redhat.com> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D518E9.1090001@goop.org> Sender: owner-linux-mm@kvack.org To: Jeremy Fitzhardinge Cc: Martin Schwidefsky , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel List-Id: virtualization@lists.linuxfoundation.org Jeremy Fitzhardinge wrote: > The more complex host policy decisions of how to balance overall > memory use system-wide are much in the same for both mechanisms. Not at all. Page hinting is just an optimization to host swapping, where IO can be avoided on many of the pages that hit the end of the LRU. No decisions have to be made at all about balancing memory use between guests, it just happens through regular host LRU aging. Automatic ballooning requires that something on the host figures out how much memory each guest needs and sizes the guests appropriately. All the proposed policies for that which I have seen have some nasty corner cases or are simply very limited in scope. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 02 Apr 2009 13:34:37 -0700 Message-ID: <49D5215D.6050503@goop.org> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D51CA9.6090601@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: Martin Schwidefsky , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel List-Id: virtualization@lists.linuxfoundation.org Rik van Riel wrote: > Jeremy Fitzhardinge wrote: >> The more complex host policy decisions of how to balance overall >> memory use system-wide are much in the same for both mechanisms. > Not at all. Page hinting is just an optimization to host swapping, where > IO can be avoided on many of the pages that hit the end of the LRU. > > No decisions have to be made at all about balancing memory use > between guests, it just happens through regular host LRU aging. When the host pages out a page belonging to guest A, then its making a policy decision on how large guest A should be compared to B. If the policy is a global LRU on all guest pages, then that's still a policy on guest sizes: the target size is a function of its working set, assuming that the working set is well modelled by LRU. I imagine that if the guest and host are both managing their pages with an LRU-like algorithm you'll get some nasty interactions, which page hinting tries to alleviate. > Automatic ballooning requires that something on the host figures > out how much memory each guest needs and sizes the guests > appropriately. All the proposed policies for that which I have > seen have some nasty corner cases or are simply very limited > in scope. Well, you could apply something equivalent to a global LRU: ask for more pages from guests who have the most unused pages. (I'm not saying that its necessarily a useful policy.) J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 02 Apr 2009 17:50:49 -0700 Message-ID: <49D55D69.5030605@goop.org> References: <20090327150905.819861420@de.ibm.com> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <200904030622.30935.nickpiggin@yahoo.com.au> <49D51A82.8090908@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D51A82.8090908@redhat.com> Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: Nick Piggin , akpm@osdl.org, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, Martin Schwidefsky , hugh@veritas.com List-Id: virtualization@lists.linuxfoundation.org Rik van Riel wrote: > I guess we could try to figure out a simple and robust policy > for ballooning. If we can come up with a policy which nobody > can shoot holes in by just discussing it, it may be worth > implementing and benchmarking. > > Maybe something based on the host passing memory pressure > on to the guests, and the guests having their own memory > pressure push back to the host. > > I'l start by telling you the best auto-ballooning policy idea > I have come up with so far, and the (major?) hole in it. > I think the first step is to reasonably precisely describe what the outcome you're trying to get to. Once you have that you can start talking about policies and mechanisms to achieve it. I suspect we all have basically the same thing in mind, but there's no harm in being explicit. I'm assuming that: 1. Each domain has a minimum guaranteed amount of resident memory. If you want to shrink a domain to smaller than that minimum, you may as well take all its memory away (ie suspend to disk, completely swap out, migrate elsewhere, etc). The amount is at least the bare-minimum WSS for the domain, but it may be higher to achieve other guarantees. 2. Each domain has a maximum allowable resident memory, which could be unbounded. The sums of all maximums could well exceed the total amount of host memory, and that represents the overcommit case. 3. Each domain has a weight, or memory priority. The simple case is that they all have the same weight, but a useful implementation would probably allow more. 4. Domains can be cooperative, unhelpful (ignore all requests and make none) or malicious (ignore requests, always try to claim more memory). An incompetent cooperative domain could be effectively unhelpful or malicious. * hard max limits will prevent non-cooperative domains from causing too much damage * they could be limited in other ways, by lowering IO or CPU priorities * a domain's "goodness" could be measured by looking to see how much memory is actually using relative to its min size and its weight * other remedies are essentially non-technical, such as more expensive billing the more non-good a domain is * (its hard to force a Xen domain to give up memory you've already given it) Given that, what outcome do we want? What are we optimising for? * Overall throughput? * Fairness? * Minimise wastage? * Rapid response to changes in conditions? (Cope with domains swinging between 64MB and 3GB on a regular basis?) * Punish wrong-doers / Reward cooperative domains? * ...? Trying to make one thing work for all cases isn't going to be simple or robust. If we pick one or two (minimise wastage+overall throughput?) then it might be more tractable. > Basically, the host needs the memory pressure notification, > where the VM will notify the guests when memory is running > low (and something could soon be swapped). At that point, > each guest which receives the signal will try to free some > memory and return it to the host. > > Each guest can have the reverse in its own pageout code. > Once memory pressure grows to a certain point (eg. when > the guest is about to swap something out), it could reclaim > a few pages from the host. > > If all the guests behave themselves, this could work. > Yes. It seems to me the basic metric is that each domain needs to keep track of how much easily allocatable memory it has on hand (ie, pages it can drop without causing a significant increase in IO). If it gets too large, then it can afford to give pages back to the host. If it gets too small, it must ask for more memory (preferably early enough to prevent a real memory crunch). > However, even with just reasonably behaving guests, > differences between the VMs in each guest could lead > to unbalanced reclaiming, penalizing better behaving > guests. > Well, it depends on what you mean by penalized. If they can function properly with the amount of memory they have, then they're fine. If they're struggling because they don't have enough memory for their WSS, then they got their "do I have enough memory on hand" calculation wrong. > If one guest is behaving badly, it could really impact > the other guests. > > Can you think of improvements to this idea? > > Can you think of another balloon policy that does > not have nasty corner cases? > In fully cooperative environments you can rely on ballooning to move things around dramatically. But with only partially cooperative guests, the best you can hope for is that it allows you some provisioning flexibility so you can deal with fluctuating demands in guests, but not order-of-magnitude size changes. You just have to leave enough headroom to make the corner cases not too pointy. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Fri, 3 Apr 2009 10:49:13 +0200 Message-ID: <20090403104913.29c62082@skybase> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> <49D5215D.6050503@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D5215D.6050503@goop.org> Sender: owner-linux-mm@kvack.org To: Jeremy Fitzhardinge Cc: Rik van Riel , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel List-Id: virtualization@lists.linuxfoundation.org On Thu, 02 Apr 2009 13:34:37 -0700 Jeremy Fitzhardinge wrote: > Rik van Riel wrote: > > Jeremy Fitzhardinge wrote: > >> The more complex host policy decisions of how to balance overall > >> memory use system-wide are much in the same for both mechanisms. > > Not at all. Page hinting is just an optimization to host swapping, where > > IO can be avoided on many of the pages that hit the end of the LRU. > > > > No decisions have to be made at all about balancing memory use > > between guests, it just happens through regular host LRU aging. > > When the host pages out a page belonging to guest A, then its making a > policy decision on how large guest A should be compared to B. If the > policy is a global LRU on all guest pages, then that's still a policy on > guest sizes: the target size is a function of its working set, assuming > that the working set is well modelled by LRU. I imagine that if the > guest and host are both managing their pages with an LRU-like algorithm > you'll get some nasty interactions, which page hinting tries to alleviate. This is the basic idea of guest page hinting. Let the host memory manager make it decision based on the data it has. That includes page age determined with a global LRU list, page age determined with a per-guest LRU list, i/o rates of the guests, all kind of policy which guest should have how much memory. The page hinting comes into play AFTER the decision has been made which page to evict. Only then the host should look at the volatile vs. stable page state and decide what has to be done with the page. If it is volatile the host can throw the page away because the guest can recreate it with LESS effort. That is the optimization. > > Automatic ballooning requires that something on the host figures > > out how much memory each guest needs and sizes the guests > > appropriately. All the proposed policies for that which I have > > seen have some nasty corner cases or are simply very limited > > in scope. > > Well, you could apply something equivalent to a global LRU: ask for more > pages from guests who have the most unused pages. (I'm not saying that > its necessarily a useful policy.) But with page hinting you don't have to even ask. Just take the pages if you need them. The guest already told you that you can have them by setting the unused state. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Fri, 03 Apr 2009 11:19:24 -0700 Message-ID: <49D6532C.6010804@goop.org> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> <49D5215D.6050503@goop.org> <20090403104913.29c62082@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090403104913.29c62082@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Rik van Riel , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > This is the basic idea of guest page hinting. Let the host memory > manager make it decision based on the data it has. That includes page > age determined with a global LRU list, page age determined with a > per-guest LRU list, i/o rates of the guests, all kind of policy which > guest should have how much memory. Do you look at fault rates? Refault rates? > The page hinting comes into play > AFTER the decision has been made which page to evict. Only then the host > should look at the volatile vs. stable page state and decide what has > to be done with the page. If it is volatile the host can throw the page > away because the guest can recreate it with LESS effort. That is the > optimization. > Yes, and its good from that perspective. Do you really implement it purely that way, or do you bias the LRU to push volatile and free pages down the end of the LRU list in preference to pages which must be preserved? If you have a small bias then you can prefer to evict easily evictable pages compared to their near-equivalents which require IO. > But with page hinting you don't have to even ask. Just take the pages > if you need them. The guest already told you that you can have them by > setting the unused state. > Yes. But it still depends on the guest. A very helpful guest could deliberately preswap pages so that it can mark them as volatile, whereas a less helpful one may keep them persistent and defer preswapping them until there's a good reason to do so. Host swapping and page hinting won't put any apparent memory pressure on the guest, so it has no reason to start preswapping even if the overall system is under pressure. Ballooning will expose each guest to its share of the overall system memory pressure, so they can respond appropriately (one hopes). J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 6 Apr 2009 09:21:11 +0200 Message-ID: <20090406092111.3b432edd@skybase> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> <49D5215D.6050503@goop.org> <20090403104913.29c62082@skybase> <49D6532C.6010804@goop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <49D6532C.6010804@goop.org> Sender: owner-linux-mm@kvack.org To: Jeremy Fitzhardinge Cc: Rik van Riel , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel List-Id: virtualization@lists.linuxfoundation.org On Fri, 03 Apr 2009 11:19:24 -0700 Jeremy Fitzhardinge wrote: > Martin Schwidefsky wrote: > > This is the basic idea of guest page hinting. Let the host memory > > manager make it decision based on the data it has. That includes page > > age determined with a global LRU list, page age determined with a > > per-guest LRU list, i/o rates of the guests, all kind of policy which > > guest should have how much memory. > > Do you look at fault rates? Refault rates? That is hidden in the memory management of z/VM. I know some details how the z/VM page manager works but in the end I don't care as the guest operating system. > > The page hinting comes into play > > AFTER the decision has been made which page to evict. Only then the host > > should look at the volatile vs. stable page state and decide what has > > to be done with the page. If it is volatile the host can throw the page > > away because the guest can recreate it with LESS effort. That is the > > optimization. > > > > Yes, and its good from that perspective. Do you really implement it > purely that way, or do you bias the LRU to push volatile and free pages > down the end of the LRU list in preference to pages which must be > preserved? If you have a small bias then you can prefer to evict easily > evictable pages compared to their near-equivalents which require IO. We though about a bias to prefer volatile pages but in the end decided against it. We do prefer free pages, if the page manager finds a unused page it will reuse it immediately. > > But with page hinting you don't have to even ask. Just take the pages > > if you need them. The guest already told you that you can have them by > > setting the unused state. > > > > Yes. But it still depends on the guest. A very helpful guest could > deliberately preswap pages so that it can mark them as volatile, whereas > a less helpful one may keep them persistent and defer preswapping them > until there's a good reason to do so. Host swapping and page hinting > won't put any apparent memory pressure on the guest, so it has no reason > to start preswapping even if the overall system is under pressure. > Ballooning will expose each guest to its share of the overall system > memory pressure, so they can respond appropriately (one hopes). Why should the guest want to do preswapping? It is as expensive for the host to swap a page and get it back as it is for the guest (= one write + one read). It is a waste of cpu time to call into the guest. You need something we call PFAULT though: if a guest process hits a page that is missing in the host page table you don't want to stop the virtual cpu until the page is back. You notify the guest that the host page is missing. The process that caused the fault is put to sleep until the host retrieved the page again. You will find the pfault code for s390 in arch/s390/mm/fault.c So to me preswap doesn't make sense. The only thing you can gain by putting memory pressure on the guest is to free some of the memory that is used by the kernel for dentries, inodes, etc. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 6 Apr 2009 17:32:39 +1000 Message-ID: <200904061732.39885.nickpiggin@yahoo.com.au> References: <20090327150905.819861420@de.ibm.com> <49D6532C.6010804@goop.org> <20090406092111.3b432edd@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090406092111.3b432edd@skybase> Content-Disposition: inline List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Martin Schwidefsky Cc: akpm@osdl.org, Jeremy Fitzhardinge , Xen-devel , linux-mm@kvack.org, frankeh@watson.ibm.com, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, hugh@veritas.com, virtualization@lists.linux-foundation.org List-Id: virtualization@lists.linuxfoundation.org On Monday 06 April 2009 17:21:11 Martin Schwidefsky wrote: > On Fri, 03 Apr 2009 11:19:24 -0700 > > Yes. But it still depends on the guest. A very helpful guest could > > deliberately preswap pages so that it can mark them as volatile, whereas > > a less helpful one may keep them persistent and defer preswapping them > > until there's a good reason to do so. Host swapping and page hinting > > won't put any apparent memory pressure on the guest, so it has no reason > > to start preswapping even if the overall system is under pressure. > > Ballooning will expose each guest to its share of the overall system > > memory pressure, so they can respond appropriately (one hopes). > > Why should the guest want to do preswapping? It is as expensive for > the host to swap a page and get it back as it is for the guest (= one > write + one read). It is a waste of cpu time to call into the guest. You > need something we call PFAULT though: if a guest process hits a page > that is missing in the host page table you don't want to stop the > virtual cpu until the page is back. You notify the guest that the host > page is missing. The process that caused the fault is put to sleep > until the host retrieved the page again. You will find the pfault code > for s390 in arch/s390/mm/fault.c > > So to me preswap doesn't make sense. The only thing you can gain by > putting memory pressure on the guest is to free some of the memory that > is used by the kernel for dentries, inodes, etc. The guest kernel can have more context about usage patterns, or user hints set on some pages or ranges. And as you say, there are non-pagecache things to free that can be taking significant or most of the freeable memory, and there can be policy knobs set in the guest (swappiness or vfs_cache_pressure etc). I guess that counters or performance monitoring events in the guest should also look more like a normal Linux kernel (although I haven't remembered what you do in that department in your patches). From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 06 Apr 2009 12:23:35 -0700 Message-ID: <49DA56B7.9020905@goop.org> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> <49D5215D.6050503@goop.org> <20090403104913.29c62082@skybase> <49D6532C.6010804@goop.org> <20090406092111.3b432edd@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090406092111.3b432edd@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Rik van Riel , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel List-Id: virtualization@lists.linuxfoundation.org Martin Schwidefsky wrote: > Why should the guest want to do preswapping? It is as expensive for > the host to swap a page and get it back as it is for the guest (= one > write + one read). Yes, perhaps for swapping, but in general it makes sense for the guest to write the pages to backing store to prevent host swapping. For swap pages there's no big benefit, but for file-backed pages its better for the guest to do it. > The only thing you can gain by > putting memory pressure on the guest is to free some of the memory that > is used by the kernel for dentries, inodes, etc. > Well, that's also significant. My point is that the guest has multiple ways in which it can relieve its own memory pressure in response to overall system memory pressure; its just that I happened to pick the example where its much of a muchness. J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753407AbZC0PKn (ORCPT ); Fri, 27 Mar 2009 11:10:43 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756260AbZC0PKX (ORCPT ); Fri, 27 Mar 2009 11:10:23 -0400 Received: from mtagate8.de.ibm.com ([195.212.29.157]:57712 "EHLO mtagate8.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757265AbZC0PKU (ORCPT ); Fri, 27 Mar 2009 11:10:20 -0400 Message-Id: <20090327150905.819861420@de.ibm.com> User-Agent: quilt/0.46-1 Date: Fri, 27 Mar 2009 16:09:05 +0100 From: Martin Schwidefsky To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com Subject: [patch 0/6] Guest page hinting version 7. Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Greetings, the circus is back in town -- another version of the guest page hinting patches. The patches differ from version 6 only in the kernel version, they apply against 2.6.29. My short sniff test showed that the code is still working as expected. To recap (you can skip this if you read the boiler plate of the last version of the patches): The main benefit for guest page hinting vs. the ballooner is that there is no need for a monitor that keeps track of the memory usage of all the guests, a complex algorithm that calculates the working set sizes and for the calls into the guest kernel to control the size of the balloons. The host just does normal LRU based paging. If the host picks one of the pages the guest can recreate, the host can throw it away instead of writing it to the paging device. Simple and elegant. The main disadvantage is the added complexity that is introduced to the guests memory management code to do the page state changes and to deal with discard faults. Right after booting the page states on my 256 MB z/VM guest looked like this (r=resident, p=preserved, z=zero, S=stable, U=unused, P=potentially volatile, V=volatile): |--tot--|---r---|---p---|---z---| S | 19719| 19673| 0| 46| U | 235416| 2734| 0| 232682| P | 1| 1| 0| 0| V | 7008| 7008| 0| 0| tot-> | 262144| 29416| 0| 232728| about 25% of the pages are in voltile state. After grepping through the linux source tree this picture changes: |--tot--|---r---|---p---|---z---| S | 43784| 43744| 0| 40| U | 78631| 2397| 0| 76234| P | 2| 2| 0| 0| V | 139727| 139727| 0| 0| tot-> | 262144| 185870| 0| 76274| about 75% of the pages are now volatile. Depending on the workload you will get different results. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758595AbZC0PK6 (ORCPT ); Fri, 27 Mar 2009 11:10:58 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757932AbZC0PK0 (ORCPT ); Fri, 27 Mar 2009 11:10:26 -0400 Received: from mtagate7.de.ibm.com ([195.212.29.156]:58453 "EHLO mtagate7.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757237AbZC0PKV (ORCPT ); Fri, 27 Mar 2009 11:10:21 -0400 Message-Id: <20090327151011.534224968@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> User-Agent: quilt/0.46-1 Date: Fri, 27 Mar 2009 16:09:06 +0100 From: Martin Schwidefsky To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky Subject: [patch 1/6] Guest page hinting: core + volatile page cache. Content-Disposition: inline; filename=001-hva-core.diff Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj The guest page hinting patchset introduces code that passes guest page usage information to the host system that virtualizes the memory of its guests. There are three different page states: * Unused: The page content is of no interest to the guest. The host can forget the page content and replace it with a page containing zeroes. * Stable: The page content is needed by the guest and has to be preserved by the host. * Volatile: The page content is useful to the guest but not essential. The host can discard the page but has to deliver a special kind of fault to the guest if the guest accesses a page discarded by the host. The unused state is used for free pages, it allows the host to avoid the paging of empty pages. The default state for non-free pages is stable. The host can write stable pages to a swap device but has to restore the page if the guest accesses it. The volatile page state is used for clean uptodate page cache pages. The host can choose to discard volatile pages as part of its vmscan operation instead of writing them to the hosts paging device. The guest system doesn't notice that a volatile page is gone until it tries to access the page or if it tries to make the page stable again. For a guest access to a discarded page the host generates a discard fault to notify the guest. The guest has to remove the page from the cache and reload the page from its backing device. The volatile state is used for all page cache pages, even for pages which are referenced by writable ptes. The host needs to be able to check the dirty state of the pages. Since the host doesn't know where the page table entries of the guest are located, the volatile state as introduced by this patch is only usable on architectures with per-page dirty bits (s390 only). For per-pte dirty bit architectures some additional code is needed, see patch #4. The main question is where to put the state transitions between the volatile and the stable state. The simple solution is to make a page stable whenever a lookup is done or a page reference is derived from a page table entry. Attempts to make pages volatile are added at strategic points. The conditions that prevent a page from being made volatile: 1) The page is reserved. Some sort of special page. 2) The page is marked dirty in the struct page. The page content is more recent than the data on the backing device. The host cannot access the linux internal dirty bit so the page needs to be stable. 3) The page is in writeback. The page content is needed for i/o. 4) The page is locked. Someone has exclusive access to the page. 5) The page is anonymous. Swap cache support needs additional code. See patch #2. 6) The page has no mapping. Without a backing the page cannot be recreated. 7) The page is not uptodate. 8) The page has private information. try_to_release_page can fail, e.g. in case the private information is journaling data. The discard fault need to be able to remove the page. 9) The page is already discarded. 10) The page is not on the LRU list. The page has been isolated, some processing is done. 11) The page map count is not equal to the page reference count - 1. The discard fault handler can remove the page cache reference and all mappers of a page. It cannot remove the page reference for any other user of the page. The transitions to stable are done by find_get_pages() and its variants, in follow_page if the FOLL_GET flag is set, by copy-on-write in do_wp_page, and by the early copy-on-write in do_no_page. For page cache page this is always done with a call to page_make_stable(). To make enough pages discardable by the host an attempt to do the transition to volatile state is done at several places: 1) When a page gets unlocked (unlock_page). 2) When writeback has finished (test_clear_page_writeback). 3) When the page reference counter is decreased (__free_pages, page_cache_release alias put_page_check and __free_pages right before the put_page_testzero call). 4) When the map counter in increased (page_add_file_rmap). 5) When a page is moved from the active list to the inactive list. The function for the state transitions to volatile is page_make_volatile(). The major obstacles that need to get addressed: * Concurrent page state changes: To guard against concurrent page state updates some kind of lock is needed. If page_make_volatile() has already done the 11 checks it will issue the state change primitive. If in the meantime one of the conditions has changed the user that requires that page in stable state will have to wait in the page_make_stable() function until the make volatile operation has finished. It is up to the architecture to define how this is done with the three primitives page_test_set_state_change, page_clear_state_change and page_state_change. There are some alternatives how this can be done, e.g. a global lock, or lock per segment in the kernel page table, or the per page bit PG_arch_1 if it is still free. * Page references acquired from page tables: All page references acquired with find_get_page and friends can be used to access the page frame content. A page reference grabbed from a page table cannot be used to access the page content, the page has to be made stable first. If the make stable operation fails because the page has been discarded it has to be removed from page cache. That removes the page table entry as well. * Page discard vs. __remove_from_page_cache race A new page flag PG_discarded is added. This bit is set for discarded pages. It prevents multiple removes of a page from the page cache due to concurrent discard faults and/or normal page removals. It also prevents the re-add of isolated pages to the lru list in vmscan if the page has been discarded while it was not on the lru list. * Page discard vs. pte establish The discard fault handler does three things: 1) set the PG_discarded bit for the page, 2) remove the page from all page tables and 3) remove the page from the page cache. All page references of the discarded page that are still around after step 2 may not be used to establish new mappings because step 3 clears the page->mapping field that is required to find the mappers. Code that establishes new ptes to pages that might be discarded has to check the PG_discarded bit. Step 2 has to check all possible location for a pte of a particular page and check if the pte exists or another processor might be in the process of establishing one. To do that the page table lock for the pte is used. See page_unmap_all and the modified quick check in page_check_address for the details. * copy_one_pte vs. discarded pages The code that copies the page tables may not copy ptes for discarded pages because this races with the discard fault handler. copy_one_pte cannot back out either since there is no automatic repeat of the fault that caused the pte modification. Ptes to discarded pages only show up in copy_one_pte if a fork races with a discard fault. In this case copy_one_pte has to create a pte in the new page table that looks like the one that the discard fault handler would have created in the original page table if copy_one_pte would not have grabed the page table lock first. * get_user_pages with FOLL_GET If get_user_pages is called with a non-NULL pages argument the caller has to be able to access the page content using the references returned in the pages array. This is done with a check in follow_page for the FOLL_GET bit and a call to page_make_stable. If get_user_pages is called with NULL as the pages argument the pages are not made stable. The caller cannot expect that the pages are available after the call because vmscan might have removed them. * buffer heads / page_private A page that is modified with sys_write will get a buffer-head to keep track of the dirty state. The existence of a buffer-head makes PagePrivate(page) return true. Pages with private information cannot be made volatile. Until the buffer-head is removed the page will stay stable. The standard logic is to call try_to_release_page which frees the buffer-head only if more than 10% of GFP_USER memory are used for buffer heads. Without high memory every page can have a buffer-head without running over the limit. The result is that every page written to with sys_write will stay stable until it is removed. To get these pages volatile again max_buffer_heads is set to zero (!) to force a call to try_to_release_page whenever a page is moved from the active to the inactive list. * page_free_discarded hook The architecture might want/need to do special things for discarded pages before they are freed. E.g. s390 has to delay the freeing of discarded pages. To allow this a hook in added to free_hot_cold_page. Another noticable change is that the first few lines of code in try_to_unmap_one that calculates the address from the page and the vma is moved out of try_to_unmap_one to the callers. This is done to make try_to_unmap_one usable for the removal of discarded pages in page_unmap_all. Signed-off-by: Martin Schwidefsky --- fs/buffer.c | 12 ++ include/linux/mm.h | 1 include/linux/page-flags.h | 22 ++++ include/linux/page-states.h | 123 +++++++++++++++++++++++++++ include/linux/pagemap.h | 6 + mm/Makefile | 1 mm/filemap.c | 77 ++++++++++++++++- mm/memory.c | 58 ++++++++++++ mm/page-states.c | 198 ++++++++++++++++++++++++++++++++++++++++++++ mm/page-writeback.c | 2 mm/page_alloc.c | 11 ++ mm/rmap.c | 96 ++++++++++++++++++--- mm/swap.c | 35 ++++++- mm/vmscan.c | 50 ++++++++--- 14 files changed, 656 insertions(+), 36 deletions(-) Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -3404,11 +3404,23 @@ void __init buffer_init(void) SLAB_MEM_SPREAD), init_buffer_head); +#ifdef CONFIG_PAGE_STATES + /* + * If volatile page cache is enabled we want to get as many + * pages into volatile state as possible. Pages with private + * information cannot be made stable. Set max_buffer_heads + * to zero to make shrink_active_list to release the private + * information when moving page from the active to the inactive + * list. + */ + max_buffer_heads = 0; +#else /* * Limit the bh occupancy to 10% of ZONE_NORMAL */ nrpages = (nr_free_buffer_pages() * 10) / 100; max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); +#endif hotcpu_notifier(buffer_cpu_notify, 0); } Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -319,6 +319,7 @@ static inline void init_page_count(struc } void put_page(struct page *page); +void put_page_check(struct page *page); void put_pages_list(struct list_head *pages); void split_page(struct page *page, unsigned int order); Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h +++ linux-2.6/include/linux/page-flags.h @@ -101,6 +101,9 @@ enum pageflags { #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR PG_uncached, /* Page has been mapped as uncached */ #endif +#ifdef CONFIG_PAGE_STATES + PG_discarded, /* Page discarded by the hypervisor. */ +#endif __NR_PAGEFLAGS, /* Filesystems */ @@ -178,6 +181,17 @@ static inline void __ClearPage##uname(st #define TESTCLEARFLAG_FALSE(uname) \ static inline int TestClearPage##uname(struct page *page) { return 0; } +#ifdef CONFIG_PAGE_STATES +#define PageDiscarded(page) test_bit(PG_discarded, &(page)->flags) +#define ClearPageDiscarded(page) clear_bit(PG_discarded, &(page)->flags) +#define TestSetPageDiscarded(page) \ + test_and_set_bit(PG_discarded, &(page)->flags) +#else +#define PageDiscarded(page) 0 +#define ClearPageDiscarded(page) do { } while (0) +#define TestSetPageDiscarded(page) 0 +#endif + struct page; /* forward declaration */ TESTPAGEFLAG(Locked, locked) @@ -373,6 +387,12 @@ static inline void __ClearPageTail(struc #define __PG_MLOCKED 0 #endif +#ifdef CONFIG_PAGE_STATES +#define __PG_DISCARDED (1 << PG_discarded) +#else +#define __PG_DISCARDED 0 +#endif + /* * Flags checked when a page is freed. Pages being freed should not have * these flags set. It they are, there is a problem. @@ -381,7 +401,7 @@ static inline void __ClearPageTail(struc (1 << PG_lru | 1 << PG_private | 1 << PG_locked | \ 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ - __PG_UNEVICTABLE | __PG_MLOCKED) + __PG_UNEVICTABLE | __PG_MLOCKED | __PG_DISCARDED) /* * Flags checked when a page is prepped for return by the page allocator. Index: linux-2.6/include/linux/page-states.h =================================================================== --- /dev/null +++ linux-2.6/include/linux/page-states.h @@ -0,0 +1,123 @@ +#ifndef _LINUX_PAGE_STATES_H +#define _LINUX_PAGE_STATES_H + +/* + * include/linux/page-states.h + * + * Copyright IBM Corp. 2005, 2007 + * + * Authors: Martin Schwidefsky + * Hubertus Franke + * Himanshu Raj + */ + +#include + +#ifdef CONFIG_PAGE_STATES +/* + * Guest page hinting primitives that need to be defined in the + * architecture header file if PAGE_STATES=y: + * - page_host_discards: + * Indicates whether the host system discards guest pages or not. + * - page_set_unused: + * Indicates to the host that the page content is of no interest + * to the guest. The host can "forget" the page content and replace + * it with a page containing zeroes. + * - page_set_stable: + * Indicate to the host that the page content is needed by the guest. + * - page_set_volatile: + * Make the page discardable by the host. Instead of writing the + * page to the hosts swap device, the host can remove the page. + * A guest that accesses such a discarded page gets a special + * discard fault. + * - page_set_stable_if_present: + * The page state is set to stable if the page has not been discarded + * by the host. The check and the state change have to be done + * atomically. + * - page_discarded: + * Returns true if the page has been discarded by the host. + * - page_volatile: + * Returns true if the page is marked volatile. + * - page_test_set_state_change: + * Tries to lock the page for state change. The primitive does not need + * to have page granularity, it can lock a range of pages. + * - page_clear_state_change: + * Unlocks a page for state changes. + * - page_state_change: + * Returns true if the page is locked for state change. + * - page_free_discarded: + * Free a discarded page. This might require to put the page on a + * discard list and a synchronization over all cpus. Returns true + * if the architecture backend wants to do special things on free. + */ +#include + +extern void page_unmap_all(struct page *page); +extern void page_discard(struct page *page); +extern int __page_make_stable(struct page *page); +extern void __page_make_volatile(struct page *page, int offset); +extern void __pagevec_make_volatile(struct pagevec *pvec); + +/* + * Extended guest page hinting functions defined by using the + * architecture primitives: + * - page_make_stable: + * Tries to make a page stable. This operation can fail if the + * host has discarded a page. The function returns != 0 if the + * page could not be made stable. + * - page_make_volatile: + * Tries to make a page volatile. There are a number of conditions + * that prevent a page from becoming volatile. If at least one + * is true the function does nothing. See mm/page-states.c for + * details. + * - pagevec_make_volatile: + * Tries to make a vector of pages volatile. For each page in the + * vector the same conditions apply as for page_make_volatile. + * - page_discard: + * Removes a discarded page from the system. The page is removed + * from the LRU list and the radix tree of its mapping. + * page_discard uses page_unmap_all to remove all page table + * entries for a page. + */ + +static inline int page_make_stable(struct page *page) +{ + return page_host_discards() ? __page_make_stable(page) : 1; +} + +static inline void page_make_volatile(struct page *page, int offset) +{ + if (page_host_discards()) + __page_make_volatile(page, offset); +} + +static inline void pagevec_make_volatile(struct pagevec *pvec) +{ + if (page_host_discards()) + __pagevec_make_volatile(pvec); +} + +#else + +#define page_host_discards() (0) +#define page_set_unused(_page,_order) do { } while (0) +#define page_set_stable(_page,_order) do { } while (0) +#define page_set_volatile(_page) do { } while (0) +#define page_set_stable_if_present(_page) (1) +#define page_discarded(_page) (0) +#define page_volatile(_page) (0) + +#define page_test_set_state_change(_page) (0) +#define page_clear_state_change(_page) do { } while (0) +#define page_state_change(_page) (0) + +#define page_free_discarded(_page) (0) + +#define page_make_stable(_page) (1) +#define page_make_volatile(_page, offset) do { } while (0) +#define pagevec_make_volatile(_pagevec) do { } while (0) +#define page_discard(_page) do { } while (0) + +#endif + +#endif /* _LINUX_PAGE_STATES_H */ Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -13,6 +13,7 @@ #include #include #include /* for in_interrupt() */ +#include /* * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page @@ -89,7 +90,11 @@ static inline void mapping_set_gfp_mask( #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) #define page_cache_get(page) get_page(page) +#ifdef CONFIG_PAGE_STATES +#define page_cache_release(page) put_page_check(page) +#else #define page_cache_release(page) put_page(page) +#endif void release_pages(struct page **pages, int nr, int cold); /* @@ -436,6 +441,7 @@ int add_to_page_cache_lru(struct page *p pgoff_t index, gfp_t gfp_mask); extern void remove_from_page_cache(struct page *page); extern void __remove_from_page_cache(struct page *page); +extern void __remove_from_page_cache_nocheck(struct page *page); /* * Like add_to_page_cache_locked, but used to add newly allocated pages: Index: linux-2.6/mm/Makefile =================================================================== --- linux-2.6.orig/mm/Makefile +++ linux-2.6/mm/Makefile @@ -33,3 +33,4 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o +obj-$(CONFIG_PAGE_STATES) += page-states.o Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -34,6 +34,7 @@ #include /* for BUG_ON(!in_atomic()) only */ #include #include /* for page_is_file_cache() */ +#include #include "internal.h" /* @@ -112,7 +113,7 @@ * sure the page is locked and that nobody else uses it - or that usage * is safe. The caller must hold the mapping's tree_lock. */ -void __remove_from_page_cache(struct page *page) +void inline __remove_from_page_cache_nocheck(struct page *page) { struct address_space *mapping = page->mapping; @@ -136,6 +137,28 @@ void __remove_from_page_cache(struct pag } } +void __remove_from_page_cache(struct page *page) +{ + /* + * Check if the discard fault handler already removed + * the page from the page cache. If not set the discard + * bit in the page flags to prevent double page free if + * a discard fault is racing with normal page free. + */ + if (TestSetPageDiscarded(page)) + return; + + __remove_from_page_cache_nocheck(page); + + /* + * Check the hardware page state and clear the discard + * bit in the page flags only if the page is not + * discarded. + */ + if (!page_discarded(page)) + ClearPageDiscarded(page); +} + void remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; @@ -581,6 +604,7 @@ void unlock_page(struct page *page) VM_BUG_ON(!PageLocked(page)); clear_bit_unlock(PG_locked, &page->flags); smp_mb__after_clear_bit(); + page_make_volatile(page, 1); wake_up_page(page, PG_locked); } EXPORT_SYMBOL(unlock_page); @@ -679,6 +703,15 @@ repeat: } rcu_read_unlock(); + if (page && unlikely(!page_make_stable(page))) { + /* + * The page has been discarded by the host. Run the + * discard handler and return NULL. + */ + page_discard(page); + page = NULL; + } + return page; } EXPORT_SYMBOL(find_get_page); @@ -783,6 +816,7 @@ unsigned find_get_pages(struct address_s unsigned int ret; unsigned int nr_found; +from_scratch: rcu_read_lock(); restart: nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree, @@ -812,6 +846,19 @@ repeat: pages[ret] = page; ret++; + + if (likely(page_make_stable(page))) + continue; + /* + * Make stable failed, we discard the page and retry the + * whole operation. + */ + ret--; + rcu_read_unlock(); + page_discard(page); + while (ret--) + page_cache_release(pages[ret]); + goto from_scratch; } rcu_read_unlock(); return ret; @@ -836,6 +883,7 @@ unsigned find_get_pages_contig(struct ad unsigned int ret; unsigned int nr_found; +from_scratch: rcu_read_lock(); restart: nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree, @@ -869,6 +917,19 @@ repeat: pages[ret] = page; ret++; index++; + + if (likely(page_make_stable(page))) + continue; + /* + * Make stable failed, we discard the page and retry the + * whole operation. + */ + ret--; + rcu_read_unlock(); + page_discard(page); + while (ret--) + page_cache_release(pages[ret]); + goto from_scratch; } rcu_read_unlock(); return ret; @@ -893,6 +954,7 @@ unsigned find_get_pages_tag(struct addre unsigned int ret; unsigned int nr_found; +from_scratch: rcu_read_lock(); restart: nr_found = radix_tree_gang_lookup_tag_slot(&mapping->page_tree, @@ -922,6 +984,19 @@ repeat: pages[ret] = page; ret++; + + if (likely(page_make_stable(page))) + continue; + /* + * Make stable failed, we discard the page and retry the + * whole operation. + */ + ret--; + rcu_read_unlock(); + page_discard(page); + while (ret--) + page_cache_release(pages[ret]); + goto from_scratch; } rcu_read_unlock(); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -594,6 +595,8 @@ copy_one_pte(struct mm_struct *dst_mm, s page = vm_normal_page(vma, addr, pte); if (page) { + if (unlikely(PageDiscarded(page))) + goto out_discard_pte; get_page(page); page_dup_rmap(page, vma, addr); rss[!!PageAnon(page)]++; @@ -601,6 +604,21 @@ copy_one_pte(struct mm_struct *dst_mm, s out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); + return; + +out_discard_pte: + /* + * If the page referred by the pte has the PG_discarded bit set, + * copy_one_pte is racing with page_discard. The pte may not be + * copied or we can end up with a pte pointing to a page not + * in the page cache anymore. Do what try_to_unmap_one would do + * if the copy_one_pte had taken place before page_discard. + */ + if (page->index != linear_page_index(vma, addr)) + /* If nonlinear, store the file page offset in the pte. */ + set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); + else + pte_clear(dst_mm, addr, dst_pte); } static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -1147,6 +1165,19 @@ struct page *follow_page(struct vm_area_ if (flags & FOLL_GET) get_page(page); + + if (flags & FOLL_GET) { + /* + * The page is made stable if a reference is acquired. + * If the caller does not get a reference it implies that + * the caller can deal with page faults in case the page + * is swapped out. In this case the caller can deal with + * discard faults as well. + */ + if (unlikely(!page_make_stable(page))) + goto out_discard; + } + if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -1179,6 +1210,11 @@ no_page_table: BUG_ON(flags & FOLL_WRITE); } return page; + +out_discard: + pte_unmap_unlock(ptep, ptl); + page_discard(page); + return NULL; } /* Can we do the FOLL_ANON optimization? */ @@ -1969,6 +2005,11 @@ static int do_wp_page(struct mm_struct * dirty_page = old_page; get_page(dirty_page); reuse = 1; + /* + * dirty_page will be set dirty, so it needs to be stable. + */ + if (unlikely(!page_make_stable(dirty_page))) + goto discard; } if (reuse) { @@ -1986,6 +2027,12 @@ reuse: * Ok, we need to copy. Oh, well.. */ page_cache_get(old_page); + /* + * To copy the content of old_page it needs to be stable. + * page_cache_release on old_page will make it volatile again. + */ + if (unlikely(!page_make_stable(old_page))) + goto discard; gotten: pte_unmap_unlock(page_table, ptl); @@ -2100,6 +2147,10 @@ oom: unwritable_page: page_cache_release(old_page); return VM_FAULT_SIGBUS; +discard: + pte_unmap_unlock(page_table, ptl); + page_discard(old_page); + return VM_FAULT_MINOR; } /* @@ -2591,7 +2642,7 @@ static int __do_fault(struct mm_struct * vmf.pgoff = pgoff; vmf.flags = flags; vmf.page = NULL; - +retry: ret = vma->vm_ops->fault(vma, &vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) return ret; @@ -2616,6 +2667,11 @@ static int __do_fault(struct mm_struct * ret = VM_FAULT_OOM; goto out; } + if (unlikely(!page_make_stable(vmf.page))) { + unlock_page(vmf.page); + page_discard(vmf.page); + goto retry; + } page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!page) { Index: linux-2.6/mm/page-states.c =================================================================== --- /dev/null +++ linux-2.6/mm/page-states.c @@ -0,0 +1,198 @@ +/* + * mm/page-states.c + * + * (C) Copyright IBM Corp. 2005, 2007 + * + * Guest page hinting functions. + * + * Authors: Martin Schwidefsky + * Hubertus Franke + * Himanshu Raj + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +/* + * Check if there is anything in the page flags or the mapping + * that prevents the page from changing its state to volatile. + */ +static inline int check_bits(struct page *page) +{ + /* + * There are several conditions that prevent a page from becoming + * volatile. The first check is for the page bits. + */ + if (PageDirty(page) || PageReserved(page) || PageWriteback(page) || + PageLocked(page) || PagePrivate(page) || PageDiscarded(page) || + !PageUptodate(page) || !PageLRU(page) || PageAnon(page)) + return 0; + + /* + * If the page has been truncated there is no point in making + * it volatile. It will be freed soon. And if the mapping ever + * had locked pages all pages of the mapping will stay stable. + */ + return page_mapping(page) != NULL; +} + +/* + * Check the reference counter of the page against the number of + * mappings. The caller passes an offset, that is the number of + * extra, known references. The page cache itself is one extra + * reference. If the caller acquired an additional reference then + * the offset would be 2. If the page map counter is equal to the + * page count minus the offset then there is no other, unknown + * user of the page in the system. + */ +static inline int check_counts(struct page *page, unsigned int offset) +{ + return page_mapcount(page) + offset == page_count(page); +} + +/* + * Attempts to change the state of a page to volatile. + * If there is something preventing the state change the page stays + * in its current state. + */ +void __page_make_volatile(struct page *page, int offset) +{ + preempt_disable(); + if (!page_test_set_state_change(page)) { + if (check_bits(page) && check_counts(page, offset)) + page_set_volatile(page); + page_clear_state_change(page); + } + preempt_enable(); +} +EXPORT_SYMBOL(__page_make_volatile); + +/* + * Attempts to change the state of a vector of pages to volatile. + * If there is something preventing the state change the page stays + * int its current state. + */ +void __pagevec_make_volatile(struct pagevec *pvec) +{ + struct page *page; + int i = pagevec_count(pvec); + + while (--i >= 0) { + /* + * If we can't get the state change bit just give up. + * The worst that can happen is that the page will stay + * in the stable state although it might be volatile. + */ + page = pvec->pages[i]; + if (!page_test_set_state_change(page)) { + if (check_bits(page) && check_counts(page, 1)) + page_set_volatile(page); + page_clear_state_change(page); + } + } +} +EXPORT_SYMBOL(__pagevec_make_volatile); + +/* + * Attempts to change the state of a page to stable. The host could + * have removed a volatile page, the page_set_stable_if_present call + * can fail. + * + * returns "0" on success and "1" on failure + */ +int __page_make_stable(struct page *page) +{ + /* + * Postpone state change to stable until the state change bit is + * cleared. As long as the state change bit is set another cpu + * is in page_make_volatile for this page. That makes sure that + * no caller of make_stable "overtakes" a make_volatile leaving + * the page in volatile where stable is required. + * The caller of make_stable need to make sure that no caller + * of make_volatile can make the page volatile right after + * make_stable has finished. + */ + while (page_state_change(page)) + cpu_relax(); + return page_set_stable_if_present(page); +} +EXPORT_SYMBOL(__page_make_stable); + +/** + * __page_discard() - remove a discarded page from the cache + * + * @page: the page + * + * The page passed to this function needs to be locked. + */ +static void __page_discard(struct page *page) +{ + struct address_space *mapping; + struct zone *zone; + + /* Paranoia checks. */ + VM_BUG_ON(PageWriteback(page)); + VM_BUG_ON(PageDirty(page)); + VM_BUG_ON(PagePrivate(page)); + + /* Set the discarded bit early. */ + if (TestSetPageDiscarded(page)) + return; + + /* Unmap the page from all page tables. */ + page_unmap_all(page); + + /* Check if really all mappers of this page are gone. */ + VM_BUG_ON(page_mapcount(page) != 0); + + /* + * Remove the page from LRU if it is currently added. + * The users of isolate_lru_pages need to check the + * discarded bit before readding the page to the LRU. + */ + zone = page_zone(page); + spin_lock_irq(&zone->lru_lock); + if (PageLRU(page)) { + /* Unlink page from lru. */ + __ClearPageLRU(page); + del_page_from_lru(zone, page); + } + spin_unlock_irq(&zone->lru_lock); + + /* We can't handle swap cache pages (yet). */ + VM_BUG_ON(PageSwapCache(page)); + + /* Remove page from page cache. */ + mapping = page->mapping; + spin_lock_irq(&mapping->tree_lock); + __remove_from_page_cache_nocheck(page); + spin_unlock_irq(&mapping->tree_lock); + __put_page(page); +} + +/** + * page_discard() - remove a discarded page from the cache + * + * @page: the page + * + * Before calling this function an additional page reference needs to + * be acquired. This reference is released by the function. + */ +void page_discard(struct page *page) +{ + lock_page(page); + __page_discard(page); + unlock_page(page); + page_cache_release(page); +} +EXPORT_SYMBOL(page_discard); Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -34,6 +34,7 @@ #include #include #include +#include /* * The maximum number of pages to writeout in a single bdflush/kupdate @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + page_make_volatile(page, 1); if (bdi_cap_account_writeback(bdi)) { __dec_bdi_stat(bdi, BDI_WRITEBACK); __bdi_writeout_inc(bdi); Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include @@ -553,6 +554,7 @@ static void __free_pages_ok(struct page bad += free_pages_check(page + i); if (bad) return; + page_set_unused(page, order); if (!PageHighMem(page)) { debug_check_no_locks_freed(page_address(page),PAGE_SIZE<mapping = NULL; if (free_pages_check(page)) return; + page_set_unused(page, 0); if (!PageHighMem(page)) { debug_check_no_locks_freed(page_address(page), PAGE_SIZE); @@ -1119,6 +1127,7 @@ again: put_cpu(); VM_BUG_ON(bad_range(zone, page)); + page_set_stable(page, order); if (prep_new_page(page, order, gfp_flags)) goto again; return page; @@ -1714,6 +1723,8 @@ void __pagevec_free(struct pagevec *pvec void __free_pages(struct page *page, unsigned int order) { + if (page_count(page) > 1) + page_make_volatile(page, 2); if (put_page_testzero(page)) { if (order == 0) free_hot_page(page); Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -50,6 +50,7 @@ #include #include #include +#include #include @@ -286,13 +287,25 @@ pte_t *page_check_address(struct page *p return NULL; pte = pte_offset_map(pmd, address); + ptl = pte_lockptr(mm, pmd); /* Make a quick check before getting the lock */ +#ifndef CONFIG_PAGE_STATES + /* + * If the page table lock for this pte is taken we have to + * assume that someone might be mapping the page. To solve + * the race of a page discard vs. mapping the page we have + * to serialize the two operations by taking the lock, + * otherwise we end up with a pte for a page that has been + * removed from page cache by the discard fault handler. + * So for CONFIG_PAGE_STATES=yes the !pte_present optimization + * need to be deactivated. + */ if (!sync && !pte_present(*pte)) { pte_unmap(pte); return NULL; } +#endif - ptl = pte_lockptr(mm, pmd); spin_lock(ptl); if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) { *ptlp = ptl; @@ -690,6 +703,7 @@ void page_add_file_rmap(struct page *pag { if (atomic_inc_and_test(&page->_mapcount)) __inc_zone_page_state(page, NR_FILE_MAPPED); + page_make_volatile(page, 1); } #ifdef CONFIG_DEBUG_VM @@ -755,19 +769,14 @@ void page_remove_rmap(struct page *page) * repeatedly from either try_to_unmap_anon or try_to_unmap_file. */ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, - int migration) + unsigned long address, int migration) { struct mm_struct *mm = vma->vm_mm; - unsigned long address; pte_t *pte; pte_t pteval; spinlock_t *ptl; int ret = SWAP_AGAIN; - address = vma_address(page, vma); - if (address == -EFAULT) - goto out; - pte = page_check_address(page, mm, address, &ptl, 0); if (!pte) goto out; @@ -831,9 +840,14 @@ static int try_to_unmap_one(struct page swp_entry_t entry; entry = make_migration_entry(page, pte_write(pteval)); set_pte_at(mm, address, pte, swp_entry_to_pte(entry)); - } else + } else { +#ifdef CONFIG_PAGE_STATES + /* If nonlinear, store the file page offset in the pte. */ + if (page->index != linear_page_index(vma, address)) + set_pte_at(mm, address, pte, pgoff_to_pte(page->index)); +#endif dec_mm_counter(mm, file_rss); - + } page_remove_rmap(page); page_cache_release(page); @@ -1000,6 +1014,7 @@ static int try_to_unmap_anon(struct page struct anon_vma *anon_vma; struct vm_area_struct *vma; unsigned int mlocked = 0; + unsigned long address; int ret = SWAP_AGAIN; if (MLOCK_PAGES && unlikely(unlock)) @@ -1016,7 +1031,10 @@ static int try_to_unmap_anon(struct page continue; /* must visit all unlocked vmas */ ret = SWAP_MLOCK; /* saw at least one mlocked vma */ } else { - ret = try_to_unmap_one(page, vma, migration); + address = vma_address(page, vma); + if (address == -EFAULT) + continue; + ret = try_to_unmap_one(page, vma, address, migration); if (ret == SWAP_FAIL || !page_mapped(page)) break; } @@ -1060,6 +1078,7 @@ static int try_to_unmap_file(struct page struct vm_area_struct *vma; struct prio_tree_iter iter; int ret = SWAP_AGAIN; + unsigned long address; unsigned long cursor; unsigned long max_nl_cursor = 0; unsigned long max_nl_size = 0; @@ -1077,7 +1096,10 @@ static int try_to_unmap_file(struct page continue; /* must visit all vmas */ ret = SWAP_MLOCK; } else { - ret = try_to_unmap_one(page, vma, migration); + address = vma_address(page, vma); + if (address == -EFAULT) + continue; + ret = try_to_unmap_one(page, vma, address, migration); if (ret == SWAP_FAIL || !page_mapped(page)) goto out; } @@ -1227,3 +1249,55 @@ int try_to_munlock(struct page *page) return try_to_unmap_file(page, 1, 0); } #endif + +#ifdef CONFIG_PAGE_STATES + +/** + * page_unmap_all - removes all mappings of a page + * + * @page: the page which mapping in the vma should be struck down + * + * the caller needs to hold page lock + */ +void page_unmap_all(struct page* page) +{ + struct address_space *mapping = page_mapping(page); + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + struct vm_area_struct *vma; + struct prio_tree_iter iter; + unsigned long address; + int rc; + + VM_BUG_ON(!PageLocked(page) || PageReserved(page) || PageAnon(page)); + + spin_lock(&mapping->i_mmap_lock); + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { + address = vma_address(page, vma); + if (address == -EFAULT) + continue; + rc = try_to_unmap_one(page, vma, address, 0); + VM_BUG_ON(rc == SWAP_FAIL); + } + + if (list_empty(&mapping->i_mmap_nonlinear)) + goto out; + + /* + * Remove the non-linear mappings of the page. This is + * awfully slow, but we have to find that discarded page.. + */ + list_for_each_entry(vma, &mapping->i_mmap_nonlinear, + shared.vm_set.list) { + address = vma->vm_start; + while (address < vma->vm_end) { + rc = try_to_unmap_one(page, vma, address, 0); + VM_BUG_ON(rc == SWAP_FAIL); + address += PAGE_SIZE; + } + } + +out: + spin_unlock(&mapping->i_mmap_lock); +} + +#endif Index: linux-2.6/mm/swap.c =================================================================== --- linux-2.6.orig/mm/swap.c +++ linux-2.6/mm/swap.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "internal.h" @@ -78,6 +79,16 @@ void put_page(struct page *page) } EXPORT_SYMBOL(put_page); +#ifdef CONFIG_PAGE_STATES +void put_page_check(struct page *page) +{ + if (page_count(page) > 1) + page_make_volatile(page, 2); + put_page(page); +} +EXPORT_SYMBOL(put_page_check); +#endif + /** * put_pages_list() - release a list of pages * @pages: list of pages threaded on page->lru @@ -421,14 +432,22 @@ void ____pagevec_lru_add(struct pagevec } VM_BUG_ON(PageActive(page)); VM_BUG_ON(PageUnevictable(page)); - VM_BUG_ON(PageLRU(page)); - SetPageLRU(page); - active = is_active_lru(lru); - file = is_file_lru(lru); - if (active) - SetPageActive(page); - update_page_reclaim_stat(zone, page, file, active); - add_page_to_lru_list(zone, page, lru); + /* + * Only add the page to lru list if it has not + * been discarded. + */ + if (likely(!PageDiscarded(page))) { + VM_BUG_ON(PageLRU(page)); + SetPageLRU(page); + active = is_active_lru(lru); + file = is_file_lru(lru); + if (active) + SetPageActive(page); + update_page_reclaim_stat(zone, page, file, active); + add_page_to_lru_list(zone, page, lru); + page_make_volatile(page, 2); + } else + ClearPageActive(page); } if (zone) spin_unlock_irq(&zone->lru_lock); Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c +++ linux-2.6/mm/vmscan.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -606,6 +607,9 @@ static unsigned long shrink_page_list(st if (unlikely(!page_evictable(page, NULL))) goto cull_mlocked; + if (unlikely(PageDiscarded(page))) + goto free_it; + if (!sc->may_swap && page_mapped(page)) goto keep_locked; @@ -1152,13 +1156,20 @@ static unsigned long shrink_inactive_lis spin_lock_irq(&zone->lru_lock); continue; } - SetPageLRU(page); - lru = page_lru(page); - add_page_to_lru_list(zone, page, lru); - if (PageActive(page)) { - int file = !!page_is_file_cache(page); - reclaim_stat->recent_rotated[file]++; - } + /* + * Only readd the page to lru list if it has not + * been discarded. + */ + if (likely(!PageDiscarded(page))) { + SetPageLRU(page); + lru = page_lru(page); + add_page_to_lru_list(zone, page, lru); + if (PageActive(page)) { + int file = !!page_is_file_cache(page); + reclaim_stat->recent_rotated[file]++; + } + } else + ClearPageActive(page); if (!pagevec_add(&pvec, page)) { spin_unlock_irq(&zone->lru_lock); __pagevec_release(&pvec); @@ -1278,13 +1289,22 @@ static void shrink_active_list(unsigned page = lru_to_page(&l_inactive); prefetchw_prev_lru_page(page, &l_inactive, flags); VM_BUG_ON(PageLRU(page)); - SetPageLRU(page); - VM_BUG_ON(!PageActive(page)); - ClearPageActive(page); - - list_move(&page->lru, &zone->lru[lru].list); - mem_cgroup_add_lru_list(page, lru); - pgmoved++; + /* + * Only readd the page to lru list if it has not + * been discarded. + */ + if (likely(!PageDiscarded(page))) { + SetPageLRU(page); + VM_BUG_ON(!PageActive(page)); + ClearPageActive(page); + list_move(&page->lru, &zone->lru[lru].list); + mem_cgroup_add_lru_list(page, lru); + pgmoved++; + } else { + ClearPageActive(page); + list_del(&page->lru); + } + if (!pagevec_add(&pvec, page)) { __mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved); spin_unlock_irq(&zone->lru_lock); @@ -1292,6 +1312,7 @@ static void shrink_active_list(unsigned pgmoved = 0; if (buffer_heads_over_limit) pagevec_strip(&pvec); + pagevec_make_volatile(&pvec); __pagevec_release(&pvec); spin_lock_irq(&zone->lru_lock); } @@ -1301,6 +1322,7 @@ static void shrink_active_list(unsigned if (buffer_heads_over_limit) { spin_unlock_irq(&zone->lru_lock); pagevec_strip(&pvec); + pagevec_make_volatile(&pvec); spin_lock_irq(&zone->lru_lock); } __count_zone_vm_events(PGREFILL, zone, pgscanned); -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758621AbZC0PLZ (ORCPT ); Fri, 27 Mar 2009 11:11:25 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757973AbZC0PK2 (ORCPT ); Fri, 27 Mar 2009 11:10:28 -0400 Received: from mtagate4.de.ibm.com ([195.212.29.153]:55031 "EHLO mtagate4.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753407AbZC0PKW (ORCPT ); Fri, 27 Mar 2009 11:10:22 -0400 Message-Id: <20090327151011.798602788@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> User-Agent: quilt/0.46-1 Date: Fri, 27 Mar 2009 16:09:07 +0100 From: Martin Schwidefsky To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky Subject: [patch 2/6] Guest page hinting: volatile swap cache. Content-Disposition: inline; filename=002-hva-swap.diff Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj The volatile page state can be used for anonymous pages as well, if they have been added to the swap cache and the swap write is finished. The tricky bit is in free_swap_and_cache. The call to find_get_page dead-locks with the discard handler. If the page has been discarded find_get_page will try to remove it. To do that it needs the page table lock of all mappers but one is held by the caller of free_swap_and_cache. A special variant of find_get_page is needed that does not check the page state and returns a page reference even if the page is discarded. The second pitfall is that the page needs to be made stable before the swap slot gets freed. If the page cannot be made stable because it has been discarded the swap slot may not be freed because it is still needed to reload the discarded page from the swap device. Signed-off-by: Martin Schwidefsky --- include/linux/pagemap.h | 3 ++ include/linux/swap.h | 5 ++++ mm/filemap.c | 39 ++++++++++++++++++++++++++++++++++++ mm/memory.c | 13 +++++++++++- mm/page-states.c | 34 +++++++++++++++++++++++--------- mm/rmap.c | 51 ++++++++++++++++++++++++++++++++++++++++++++---- mm/swap_state.c | 25 ++++++++++++++++++++++- mm/swapfile.c | 24 +++++++++++++++++++--- 8 files changed, 176 insertions(+), 18 deletions(-) Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -91,8 +91,11 @@ static inline void mapping_set_gfp_mask( #define page_cache_get(page) get_page(page) #ifdef CONFIG_PAGE_STATES +extern struct page * find_get_page_nodiscard(struct address_space *mapping, + unsigned long index); #define page_cache_release(page) put_page_check(page) #else +#define find_get_page_nodiscard(mapping, index) find_get_page(mapping, index) #define page_cache_release(page) put_page(page) #endif void release_pages(struct page **pages, int nr, int cold); Index: linux-2.6/include/linux/swap.h =================================================================== --- linux-2.6.orig/include/linux/swap.h +++ linux-2.6/include/linux/swap.h @@ -285,6 +285,7 @@ extern void show_swap_cache_info(void); extern int add_to_swap(struct page *); extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); extern void __delete_from_swap_cache(struct page *); +extern void __delete_from_swap_cache_nocheck(struct page *); extern void delete_from_swap_cache(struct page *); extern void free_page_and_swap_cache(struct page *); extern void free_pages_and_swap_cache(struct page **, int); @@ -402,6 +403,10 @@ static inline void __delete_from_swap_ca { } +static inline void __delete_from_swap_cache_nocheck(struct page *page) +{ +} + static inline void delete_from_swap_cache(struct page *page) { } Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -555,6 +555,45 @@ static int __sleep_on_page_lock(void *wo return 0; } +#ifdef CONFIG_PAGE_STATES + +struct page * find_get_page_nodiscard(struct address_space *mapping, + unsigned long offset) +{ + void **pagep; + struct page *page; + + rcu_read_lock(); +repeat: + page = NULL; + pagep = radix_tree_lookup_slot(&mapping->page_tree, offset); + if (pagep) { + page = radix_tree_deref_slot(pagep); + if (unlikely(!page || page == RADIX_TREE_RETRY)) + goto repeat; + + if (!page_cache_get_speculative(page)) + goto repeat; + + /* + * Has the page moved? + * This is part of the lockless pagecache protocol. See + * include/linux/pagemap.h for details. + */ + if (unlikely(page != *pagep)) { + page_cache_release(page); + goto repeat; + } + } + rcu_read_unlock(); + + return page; +} + +EXPORT_SYMBOL(find_get_page_nodiscard); + +#endif + /* * In order to wait for pages to become available there must be * waitqueues associated with pages. By using a hash table of Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -614,7 +614,18 @@ out_discard_pte: * in the page cache anymore. Do what try_to_unmap_one would do * if the copy_one_pte had taken place before page_discard. */ - if (page->index != linear_page_index(vma, addr)) + if (PageAnon(page)) { + swp_entry_t entry = { .val = page_private(page) }; + swap_duplicate(entry); + if (list_empty(&dst_mm->mmlist)) { + spin_lock(&mmlist_lock); + if (list_empty(&dst_mm->mmlist)) + list_add(&dst_mm->mmlist, &init_mm.mmlist); + spin_unlock(&mmlist_lock); + } + pte = swp_entry_to_pte(entry); + set_pte_at(dst_mm, addr, dst_pte, pte); + } else if (page->index != linear_page_index(vma, addr)) /* If nonlinear, store the file page offset in the pte. */ set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); else Index: linux-2.6/mm/page-states.c =================================================================== --- linux-2.6.orig/mm/page-states.c +++ linux-2.6/mm/page-states.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "internal.h" @@ -35,7 +36,16 @@ static inline int check_bits(struct page */ if (PageDirty(page) || PageReserved(page) || PageWriteback(page) || PageLocked(page) || PagePrivate(page) || PageDiscarded(page) || - !PageUptodate(page) || !PageLRU(page) || PageAnon(page)) + !PageUptodate(page) || !PageLRU(page) || + (PageAnon(page) && !PageSwapCache(page))) + return 0; + + /* + * Special case shared memory: page is PageSwapCache but not + * PageAnon. page_unmap_all failes for swapped shared memory + * pages. + */ + if (PageSwapCache(page) && !PageAnon(page)) return 0; /* @@ -169,15 +179,21 @@ static void __page_discard(struct page * } spin_unlock_irq(&zone->lru_lock); - /* We can't handle swap cache pages (yet). */ - VM_BUG_ON(PageSwapCache(page)); - - /* Remove page from page cache. */ + /* Remove page from page cache/swap cache. */ mapping = page->mapping; - spin_lock_irq(&mapping->tree_lock); - __remove_from_page_cache_nocheck(page); - spin_unlock_irq(&mapping->tree_lock); - __put_page(page); + if (PageSwapCache(page)) { + swp_entry_t entry = { .val = page_private(page) }; + spin_lock_irq(&swapper_space.tree_lock); + __delete_from_swap_cache_nocheck(page); + spin_unlock_irq(&swapper_space.tree_lock); + swap_free(entry); + page_cache_release(page); + } else { + spin_lock_irq(&mapping->tree_lock); + __remove_from_page_cache_nocheck(page); + spin_unlock_irq(&mapping->tree_lock); + __put_page(page); + } } /** Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -762,6 +762,7 @@ void page_remove_rmap(struct page *page) * faster for those pages still in swapcache. */ } + page_make_volatile(page, 1); } /* @@ -1253,13 +1254,13 @@ int try_to_munlock(struct page *page) #ifdef CONFIG_PAGE_STATES /** - * page_unmap_all - removes all mappings of a page + * page_unmap_file - removes all mappings of a file page * * @page: the page which mapping in the vma should be struck down * * the caller needs to hold page lock */ -void page_unmap_all(struct page* page) +static void page_unmap_file(struct page* page) { struct address_space *mapping = page_mapping(page); pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); @@ -1268,8 +1269,6 @@ void page_unmap_all(struct page* page) unsigned long address; int rc; - VM_BUG_ON(!PageLocked(page) || PageReserved(page) || PageAnon(page)); - spin_lock(&mapping->i_mmap_lock); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { address = vma_address(page, vma); @@ -1300,4 +1299,48 @@ out: spin_unlock(&mapping->i_mmap_lock); } +/** + * page_unmap_anon - removes all mappings of an anonymous page + * + * @page: the page which mapping in the vma should be struck down + * + * the caller needs to hold page lock + */ +static void page_unmap_anon(struct page* page) +{ + struct anon_vma *anon_vma; + struct vm_area_struct *vma; + unsigned long address; + int rc; + + anon_vma = page_lock_anon_vma(page); + if (!anon_vma) + return; + list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { + address = vma_address(page, vma); + if (address == -EFAULT) + continue; + rc = try_to_unmap_one(page, vma, address, 0); + VM_BUG_ON(rc == SWAP_FAIL); + } + page_unlock_anon_vma(anon_vma); +} + +/** + * page_unmap_all - removes all mappings of a page + * + * @page: the page which mapping in the vma should be struck down + * + * the caller needs to hold page lock + */ +void page_unmap_all(struct page *page) +{ + VM_BUG_ON(!PageLocked(page) || PageReserved(page)); + + if (PageAnon(page)) + page_unmap_anon(page); + else + page_unmap_file(page); +} + #endif Index: linux-2.6/mm/swap_state.c =================================================================== --- linux-2.6.orig/mm/swap_state.c +++ linux-2.6/mm/swap_state.c @@ -18,6 +18,7 @@ #include #include #include +#include #include @@ -107,7 +108,7 @@ int add_to_swap_cache(struct page *page, * This must be called only on pages that have * been verified to be in the swap cache. */ -void __delete_from_swap_cache(struct page *page) +void inline __delete_from_swap_cache_nocheck(struct page *page) { swp_entry_t ent = {.val = page_private(page)}; @@ -124,6 +125,28 @@ void __delete_from_swap_cache(struct pag mem_cgroup_uncharge_swapcache(page, ent); } +void __delete_from_swap_cache(struct page *page) +{ + /* + * Check if the discard fault handler already removed + * the page from the page cache. If not set the discard + * bit in the page flags to prevent double page free if + * a discard fault is racing with normal page free. + */ + if (TestSetPageDiscarded(page)) + return; + + __delete_from_swap_cache_nocheck(page); + + /* + * Check the hardware page state and clear the discard + * bit in the page flags only if the page is not + * discarded. + */ + if (!page_discarded(page)) + ClearPageDiscarded(page); +} + /** * add_to_swap - allocate swap space for a page * @page: page we want to move to swap Index: linux-2.6/mm/swapfile.c =================================================================== --- linux-2.6.orig/mm/swapfile.c +++ linux-2.6/mm/swapfile.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include @@ -564,6 +565,8 @@ int try_to_free_swap(struct page *page) return 0; if (page_swapcount(page)) return 0; + if (!page_make_stable(page)) + return 0; delete_from_swap_cache(page); SetPageDirty(page); @@ -585,7 +588,13 @@ int free_swap_and_cache(swp_entry_t entr p = swap_info_get(entry); if (p) { if (swap_entry_free(p, entry) == 1) { - page = find_get_page(&swapper_space, entry.val); + /* + * Use find_get_page_nodiscard to avoid the deadlock + * on the swap_lock and the page table lock if the + * page has been discarded. + */ + page = find_get_page_nodiscard(&swapper_space, + entry.val); if (page && !trylock_page(page)) { page_cache_release(page); page = NULL; @@ -600,8 +609,17 @@ int free_swap_and_cache(swp_entry_t entr */ if (PageSwapCache(page) && !PageWriteback(page) && (!page_mapped(page) || vm_swap_full())) { - delete_from_swap_cache(page); - SetPageDirty(page); + /* + * To be able to reload the page from swap the + * swap slot may not be freed. The caller of + * free_swap_and_cache holds a page table lock + * for this page. The discarded page can not be + * removed here. + */ + if (likely(page_make_stable(page))) { + delete_from_swap_cache(page); + SetPageDirty(page); + } } unlock_page(page); page_cache_release(page); -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759083AbZC0PL6 (ORCPT ); Fri, 27 Mar 2009 11:11:58 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758076AbZC0PKc (ORCPT ); Fri, 27 Mar 2009 11:10:32 -0400 Received: from mtagate1.de.ibm.com ([195.212.17.161]:42210 "EHLO mtagate1.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756663AbZC0PKX (ORCPT ); Fri, 27 Mar 2009 11:10:23 -0400 Message-Id: <20090327151013.024372165@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> User-Agent: quilt/0.46-1 Date: Fri, 27 Mar 2009 16:09:11 +0100 From: Martin Schwidefsky To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky Subject: [patch 6/6] Guest page hinting: s390 support. Content-Disposition: inline; filename=006-hva-s390.diff Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj s390 uses the milli-coded ESSA instruction to set the page state. The page state is formed by four guest page states called block usage states and three host page states called block content states. The guest states are: - stable (S): there is essential content in the page - unused (U): there is no useful content and any access to the page will cause an addressing exception - volatile (V): there is useful content in the page. The host system is allowed to discard the content anytime, but has to deliver a discard fault with the absolute address of the page if the guest tries to access it. - potential volatile (P): the page has useful content. The host system is allowed to discard the content after it has checked the dirty bit of the page. It has to deliver a discard fault with the absolute address of the page if the guest tries to access it. The host states are: - resident: the page is present in real memory. - preserved: the page is not present in real memory but the content is preserved elsewhere by the machine, e.g. on the paging device. - zero: the page is not present in real memory. The content of the page is logically-zero. There are 12 combinations of guest and host state, currently only 8 are valid page states: Sr: a stable, resident page. Sp: a stable, preserved page. Sz: a stable, logically zero page. A page filled with zeroes will be allocated on first access. Ur: an unused but resident page. The host could make it Uz anytime but it doesn't have to. Uz: an unused, logically zero page. Vr: a volatile, resident page. The guest can access it normally. Vz: a volatile, logically zero page. This is a discarded page. The host will deliver a discard fault for any access to the page. Pr: a potential volatile, resident page. The guest can access it normally. The remaining 4 combinations can't occur: Up: an unused, preserved page. If the host tries to get rid of a Ur page it will remove it without writing the page content to disk and set the page to Uz. Vp: a volatile, preserved page. If the host picks a Vr page for eviction it will discard it and set the page state to Vz. Pp: a potential volatile, preserved page. There are two cases for page out: 1) if the page is dirty then the host will preserved the page and set it to Sp or 2) if the page is clean then the host will discard it and set the page state to Vz. Pz: a potential volatile, logically zero page. The host system will always use Vz instead of Pz. The state transitions (a diagram would be nicer but that is too hard to do in ascii art...): {Ur,Sr,Vr,Pr}: a resident page will change its block usage state if the guest requests it with page_set_{unused,stable,volatile}. {Uz,Sz,Vz}: a logically zero page will change its block usage state if the guest requests it with page_set_{unused,stable,volatile}. The guest can't create the Pz state, the state will be Vz instead. Ur -> Uz: the host system can remove an unused, resident page from memory Sz -> Sr: on first access a stable, logically zero page will become resident Sr -> Sp: the host system can swap a stable page to disk Sp -> Sr: a guest access to a Sp page forces the host to retrieve it Vr -> Vz: the host can discard a volatile page Sp -> Uz: a page preserved by the host will be removed if the guest sets the block usage state to unused. Sp -> Vz: a page preserved by the host will be discarded if the guest sets the block usage state to volatile. Pr -> Sp: the host can move a page from Pr to Sp if it discovers that the page is dirty while trying to discard the page. The page content is written to the paging device. Pr -> Vz: the host can discard a Pr page. The Pz state is replaced by the Vz state. The are some hazards the code has to deal with: 1) For potential volatile pages the transfer of the hardware dirty bit to the software dirty bit needs to make sure that the page gets into the stable state before the hardware dirty bit is cleared. Between the page_test_dirty and the page_clear_dirty call a page_make_stable is required. 2) Since the access of unused pages causes addressing exceptions we need to take care with /dev/mem. The copy_{from_to}_user functions need to be able to cope with addressing exceptions for the kernel address space. 3) The discard fault on a s390 machine delivers the absolute address of the page that caused the fault instead of the virtual address. With the virtual address we could have used the page table entry of the current process to safely get a reference to the discarded page. We can get to the struct page from the absolute page address but it is rather hard to get to a proper page reference. The page that caused the fault could already have been freed and reused for a different purpose. None of the fields in the struct page would be reliable to use. The freeing of discarded pages therefore has to be postponed until all pending discard faults for this page have been dealt with. The discard fault handler is called disabled for interrupts and tries to get a page reference with get_page_unless_zero. A discarded page is only freed after all cpus have been enabled for interrupts at least once since the detection of the discarded page. This is done using the timer interrupts and the cpu-idle notifier. Signed-off-by: Martin Schwidefsky --- arch/s390/Kconfig | 6 arch/s390/include/asm/page-states.h | 116 ++++++++++++++++++ arch/s390/include/asm/page.h | 11 - arch/s390/kernel/process.c | 4 arch/s390/kernel/time.c | 8 + arch/s390/kernel/traps.c | 4 arch/s390/lib/uaccess_mvcos.c | 10 - arch/s390/lib/uaccess_std.c | 7 - arch/s390/mm/fault.c | 1 arch/s390/mm/init.c | 3 arch/s390/mm/page-states.c | 224 ++++++++++++++++++++++++++++++------ mm/rmap.c | 9 + 12 files changed, 346 insertions(+), 57 deletions(-) Index: linux-2.6/arch/s390/Kconfig =================================================================== --- linux-2.6.orig/arch/s390/Kconfig +++ linux-2.6/arch/s390/Kconfig @@ -468,11 +468,7 @@ config CMM_IUCV the cooperative memory management. config PAGE_STATES - bool "Unused page notification" - help - This enables the notification of unused pages to the - hypervisor. The ESSA instruction is used to do the states - changes between a page that has content and the unused state. + bool "Enable support for guest page hinting." config APPLDATA_BASE bool "Linux - VM Monitor Stream, base infrastructure" Index: linux-2.6/arch/s390/include/asm/page-states.h =================================================================== --- /dev/null +++ linux-2.6/arch/s390/include/asm/page-states.h @@ -0,0 +1,116 @@ +#ifndef _ASM_S390_PAGE_STATES_H +#define _ASM_S390_PAGE_STATES_H + +#define ESSA_GET_STATE 0 +#define ESSA_SET_STABLE 1 +#define ESSA_SET_UNUSED 2 +#define ESSA_SET_VOLATILE 3 +#define ESSA_SET_PVOLATILE 4 +#define ESSA_SET_STABLE_MAKE_RESIDENT 5 +#define ESSA_SET_STABLE_IF_NOT_DISCARDED 6 + +#define ESSA_USTATE_MASK 0x0c +#define ESSA_USTATE_STABLE 0x00 +#define ESSA_USTATE_UNUSED 0x04 +#define ESSA_USTATE_PVOLATILE 0x08 +#define ESSA_USTATE_VOLATILE 0x0c + +#define ESSA_CSTATE_MASK 0x03 +#define ESSA_CSTATE_RESIDENT 0x00 +#define ESSA_CSTATE_PRESERVED 0x02 +#define ESSA_CSTATE_ZERO 0x03 + +extern int cmma_flag; + +/* + * ESSA ,, + */ +#define page_essa(_page,_command) ({ \ + int _rc; \ + asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0" \ + : "=&d" (_rc) : "a" (page_to_phys(_page)), \ + "i" (_command)); \ + _rc; \ +}) + +static inline int page_host_discards(void) +{ + return cmma_flag; +} + +static inline int page_discarded(struct page *page) +{ + int state; + + if (!cmma_flag) + return 0; + state = page_essa(page, ESSA_GET_STATE); + return (state & ESSA_USTATE_MASK) == ESSA_USTATE_VOLATILE && + (state & ESSA_CSTATE_MASK) == ESSA_CSTATE_ZERO; +} + +static inline void page_set_unused(struct page *page, int order) +{ + int i; + + if (!cmma_flag) + return; + for (i = 0; i < (1 << order); i++) + page_essa(page + i, ESSA_SET_UNUSED); +} + +static inline void page_set_stable(struct page *page, int order) +{ + int i; + + if (!cmma_flag) + return; + for (i = 0; i < (1 << order); i++) + page_essa(page + i, ESSA_SET_STABLE); +} + +static inline void page_set_volatile(struct page *page, int writable) +{ + if (!cmma_flag) + return; + if (writable) + page_essa(page, ESSA_SET_PVOLATILE); + else + page_essa(page, ESSA_SET_VOLATILE); +} + +static inline int page_set_stable_if_present(struct page *page) +{ + int rc; + + if (!cmma_flag || PageReserved(page)) + return 1; + + rc = page_essa(page, ESSA_SET_STABLE_IF_NOT_DISCARDED); + return (rc & ESSA_USTATE_MASK) != ESSA_USTATE_VOLATILE || + (rc & ESSA_CSTATE_MASK) != ESSA_CSTATE_ZERO; +} + +/* + * Page locking is done with the architecture page bit PG_arch_1. + */ +static inline int page_test_set_state_change(struct page *page) +{ + return test_and_set_bit(PG_arch_1, &page->flags); +} + +static inline void page_clear_state_change(struct page *page) +{ + clear_bit(PG_arch_1, &page->flags); +} + +static inline int page_state_change(struct page *page) +{ + return test_bit(PG_arch_1, &page->flags); +} + +int page_free_discarded(struct page *page); +void page_shrink_discard_list(void); +void page_discard_init(void); + +#endif /* _ASM_S390_PAGE_STATES_H */ Index: linux-2.6/arch/s390/include/asm/page.h =================================================================== --- linux-2.6.orig/arch/s390/include/asm/page.h +++ linux-2.6/arch/s390/include/asm/page.h @@ -125,17 +125,6 @@ page_get_storage_key(unsigned long addr) return skey; } -#ifdef CONFIG_PAGE_STATES - -struct page; -void arch_free_page(struct page *page, int order); -void arch_alloc_page(struct page *page, int order); - -#define HAVE_ARCH_FREE_PAGE -#define HAVE_ARCH_ALLOC_PAGE - -#endif - #endif /* !__ASSEMBLY__ */ #define __PAGE_OFFSET 0x0UL Index: linux-2.6/arch/s390/kernel/process.c =================================================================== --- linux-2.6.orig/arch/s390/kernel/process.c +++ linux-2.6/arch/s390/kernel/process.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -82,6 +83,9 @@ extern void s390_handle_mcck(void); */ static void default_idle(void) { +#ifdef CONFIG_PAGE_STATES + page_shrink_discard_list(); +#endif /* CPU is going idle. */ local_irq_disable(); if (need_resched()) { Index: linux-2.6/arch/s390/kernel/time.c =================================================================== --- linux-2.6.orig/arch/s390/kernel/time.c +++ linux-2.6/arch/s390/kernel/time.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include #include @@ -137,6 +138,9 @@ static int s390_next_event(unsigned long static void s390_set_mode(enum clock_event_mode mode, struct clock_event_device *evt) { +#ifdef CONFIG_PAGE_STATES + page_shrink_discard_list(); +#endif } /* @@ -287,6 +291,10 @@ void __init time_init(void) &ext_int_etr_cc) != 0) panic("Couldn't request external interrupt 0x1406"); +#ifdef CONFIG_PAGE_STATES + page_discard_init(); +#endif + /* Enable TOD clock interrupts on the boot cpu. */ init_cpu_timer(); /* Enable cpu timer interrupts on the boot cpu. */ Index: linux-2.6/arch/s390/kernel/traps.c =================================================================== --- linux-2.6.orig/arch/s390/kernel/traps.c +++ linux-2.6/arch/s390/kernel/traps.c @@ -57,6 +57,7 @@ int sysctl_userprocess_debug = 0; extern pgm_check_handler_t do_protection_exception; extern pgm_check_handler_t do_dat_exception; extern pgm_check_handler_t do_asce_exception; +extern pgm_check_handler_t do_discard_fault; #define stack_pointer ({ void **sp; asm("la %0,0(15)" : "=&d" (sp)); sp; }) @@ -761,5 +762,8 @@ void __init trap_init(void) pgm_check_table[0x15] = &operand_exception; pgm_check_table[0x1C] = &space_switch_exception; pgm_check_table[0x1D] = &hfp_sqrt_exception; +#ifdef CONFIG_PAGE_STATES + pgm_check_table[0x1a] = &do_discard_fault; +#endif pfault_irq_init(); } Index: linux-2.6/arch/s390/lib/uaccess_mvcos.c =================================================================== --- linux-2.6.orig/arch/s390/lib/uaccess_mvcos.c +++ linux-2.6/arch/s390/lib/uaccess_mvcos.c @@ -36,7 +36,7 @@ static size_t copy_from_user_mvcos(size_ tmp1 = -4096UL; asm volatile( "0: .insn ss,0xc80000000000,0(%0,%2),0(%1),0\n" - " jz 7f\n" + "10:jz 7f\n" "1:"ALR" %0,%3\n" " "SLR" %1,%3\n" " "SLR" %2,%3\n" @@ -47,7 +47,7 @@ static size_t copy_from_user_mvcos(size_ " "CLR" %0,%4\n" /* copy crosses next page boundary? */ " jnh 4f\n" "3: .insn ss,0xc80000000000,0(%4,%2),0(%1),0\n" - " "SLR" %0,%4\n" + "11:"SLR" %0,%4\n" " "ALR" %2,%4\n" "4:"LHI" %4,-1\n" " "ALR" %4,%0\n" /* copy remaining size, subtract 1 */ @@ -62,6 +62,7 @@ static size_t copy_from_user_mvcos(size_ "7:"SLR" %0,%0\n" "8: \n" EX_TABLE(0b,2b) EX_TABLE(3b,4b) + EX_TABLE(10b,8b) EX_TABLE(11b,8b) : "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2) : "d" (reg0) : "cc", "memory"); return size; @@ -82,7 +83,7 @@ static size_t copy_to_user_mvcos(size_t tmp1 = -4096UL; asm volatile( "0: .insn ss,0xc80000000000,0(%0,%1),0(%2),0\n" - " jz 4f\n" + "6: jz 4f\n" "1:"ALR" %0,%3\n" " "SLR" %1,%3\n" " "SLR" %2,%3\n" @@ -93,11 +94,12 @@ static size_t copy_to_user_mvcos(size_t " "CLR" %0,%4\n" /* copy crosses next page boundary? */ " jnh 5f\n" "3: .insn ss,0xc80000000000,0(%4,%1),0(%2),0\n" - " "SLR" %0,%4\n" + "7:"SLR" %0,%4\n" " j 5f\n" "4:"SLR" %0,%0\n" "5: \n" EX_TABLE(0b,2b) EX_TABLE(3b,5b) + EX_TABLE(6b,5b) EX_TABLE(7b,5b) : "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2) : "d" (reg0) : "cc", "memory"); return size; Index: linux-2.6/arch/s390/lib/uaccess_std.c =================================================================== --- linux-2.6.orig/arch/s390/lib/uaccess_std.c +++ linux-2.6/arch/s390/lib/uaccess_std.c @@ -36,12 +36,12 @@ size_t copy_from_user_std(size_t size, c tmp1 = -256UL; asm volatile( "0: mvcp 0(%0,%2),0(%1),%3\n" - " jz 8f\n" + "10:jz 8f\n" "1:"ALR" %0,%3\n" " la %1,256(%1)\n" " la %2,256(%2)\n" "2: mvcp 0(%0,%2),0(%1),%3\n" - " jnz 1b\n" + "11:jnz 1b\n" " j 8f\n" "3: la %4,255(%1)\n" /* %4 = ptr + 255 */ " "LHI" %3,-4096\n" @@ -50,7 +50,7 @@ size_t copy_from_user_std(size_t size, c " "CLR" %0,%4\n" /* copy crosses next page boundary? */ " jnh 5f\n" "4: mvcp 0(%4,%2),0(%1),%3\n" - " "SLR" %0,%4\n" + "12:"SLR" %0,%4\n" " "ALR" %2,%4\n" "5:"LHI" %4,-1\n" " "ALR" %4,%0\n" /* copy remaining size, subtract 1 */ @@ -65,6 +65,7 @@ size_t copy_from_user_std(size_t size, c "8:"SLR" %0,%0\n" "9: \n" EX_TABLE(0b,3b) EX_TABLE(2b,3b) EX_TABLE(4b,5b) + EX_TABLE(10b,9b) EX_TABLE(11b,9b) EX_TABLE(12b,9b) : "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2) : : "cc", "memory"); return size; Index: linux-2.6/arch/s390/mm/fault.c =================================================================== --- linux-2.6.orig/arch/s390/mm/fault.c +++ linux-2.6/arch/s390/mm/fault.c @@ -611,4 +611,5 @@ void __init pfault_irq_init(void) unregister_early_external_interrupt(0x2603, pfault_interrupt, &ext_int_pfault); } + #endif Index: linux-2.6/arch/s390/mm/init.c =================================================================== --- linux-2.6.orig/arch/s390/mm/init.c +++ linux-2.6/arch/s390/mm/init.c @@ -94,6 +94,9 @@ void __init mem_init(void) /* Setup guest page hinting */ cmma_init(); + /* Setup guest page hinting */ + cmma_init(); + /* this will put all low memory onto the freelists */ totalram_pages += free_all_bootmem(); Index: linux-2.6/arch/s390/mm/page-states.c =================================================================== --- linux-2.6.orig/arch/s390/mm/page-states.c +++ linux-2.6/arch/s390/mm/page-states.c @@ -13,67 +13,223 @@ #include #include #include +#include +#include +#include +#include +#include +#include + +extern void die(const char *,struct pt_regs *,long); + +#ifndef CONFIG_64BIT +#define __FAIL_ADDR_MASK 0x7ffff000 +#else /* CONFIG_64BIT */ +#define __FAIL_ADDR_MASK -4096L +#endif /* CONFIG_64BIT */ -#define ESSA_SET_STABLE 1 -#define ESSA_SET_UNUSED 2 +int cmma_flag; -static int cmma_flag; +void __init cmma_init(void) +{ + register unsigned long tmp asm("0") = 0; + register int rc asm("1") = -ENOSYS; + if (!cmma_flag) + return; + asm volatile( + " .insn rrf,0xb9ab0000,%1,%1,0,0\n" + "0: la %0,0\n" + "1:\n" + EX_TABLE(0b,1b) + : "+&d" (rc), "+&d" (tmp)); + if (rc) + cmma_flag = 0; +} static int __init cmma(char *str) { char *parm; + parm = strstrip(str); if (strcmp(parm, "yes") == 0 || strcmp(parm, "on") == 0) { cmma_flag = 1; return 1; } - cmma_flag = 0; - if (strcmp(parm, "no") == 0 || strcmp(parm, "off") == 0) + if (strcmp(parm, "no") == 0 || strcmp(parm, "off") == 0) { + cmma_flag = 0; return 1; + } return 0; } __setup("cmma=", cmma); -void __init cmma_init(void) +static inline void fixup_user_copy(struct pt_regs *regs, + unsigned long address, unsigned short rx) { - register unsigned long tmp asm("0") = 0; - register int rc asm("1") = -EOPNOTSUPP; + const struct exception_table_entry *fixup; + unsigned long kaddr; - if (!cmma_flag) + kaddr = (regs->gprs[rx >> 12] + (rx & 0xfff)) & __FAIL_ADDR_MASK; + if (virt_to_phys((void *) kaddr) != address) return; - asm volatile( - " .insn rrf,0xb9ab0000,%1,%1,0,0\n" - "0: la %0,0\n" - "1:\n" - EX_TABLE(0b,1b) - : "+&d" (rc), "+&d" (tmp)); - if (rc) - cmma_flag = 0; + + fixup = search_exception_tables(regs->psw.addr & PSW_ADDR_INSN); + if (fixup) + regs->psw.addr = fixup->fixup | PSW_ADDR_AMODE; + else + die("discard fault", regs, SIGSEGV); } -void arch_free_page(struct page *page, int order) +/* + * Discarded pages with a page_count() of zero are placed on + * the page_discarded_list until all cpus have been at + * least once in enabled code. That closes the race of page + * free vs. discard faults. + */ +void do_discard_fault(struct pt_regs *regs, unsigned long error_code) { - int i, rc; + unsigned long address; + struct page *page; - if (!cmma_flag) - return; - for (i = 0; i < (1 << order); i++) - asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0" - : "=&d" (rc) - : "a" ((page_to_pfn(page) + i) << PAGE_SHIFT), - "i" (ESSA_SET_UNUSED)); + /* + * get the real address that caused the block validity + * exception. + */ + address = S390_lowcore.trans_exc_code & __FAIL_ADDR_MASK; + page = pfn_to_page(address >> PAGE_SHIFT); + + /* + * Check for the special case of a discard fault in + * copy_{from,to}_user. User copy is done using one of + * three special instructions: mvcp, mvcs or mvcos. + */ + if (!(regs->psw.mask & PSW_MASK_PSTATE)) { + switch (*(unsigned char *) regs->psw.addr) { + case 0xda: /* mvcp */ + fixup_user_copy(regs, address, + *(__u16 *)(regs->psw.addr + 2)); + break; + case 0xdb: /* mvcs */ + fixup_user_copy(regs, address, + *(__u16 *)(regs->psw.addr + 4)); + break; + case 0xc8: /* mvcos */ + if (regs->gprs[0] == 0x81) + fixup_user_copy(regs, address, + *(__u16*)(regs->psw.addr + 2)); + else if (regs->gprs[0] == 0x810000) + fixup_user_copy(regs, address, + *(__u16*)(regs->psw.addr + 4)); + break; + default: + break; + } + } + + if (likely(get_page_unless_zero(page))) { + local_irq_enable(); + page_discard(page); + } } -void arch_alloc_page(struct page *page, int order) +static DEFINE_PER_CPU(struct list_head, page_discard_list); +static struct list_head page_gather_list = LIST_HEAD_INIT(page_gather_list); +static struct list_head page_signoff_list = LIST_HEAD_INIT(page_signoff_list); +static cpumask_var_t page_signoff_cpumask; +static DEFINE_SPINLOCK(page_discard_lock); + +/* + * page_free_discarded + * + * free_hot_cold_page calls this function if it is about to free a + * page that has PG_discarded set. Since there might be pending + * discard faults on other cpus on s390 we have to postpone the + * freeing of the page until each cpu has "signed-off" the page. + * + * returns 1 to stop free_hot_cold_page from freeing the page. + */ +int page_free_discarded(struct page *page) { - int i, rc; + local_irq_disable(); + list_add_tail(&page->lru, &__get_cpu_var(page_discard_list)); + local_irq_enable(); + return 1; +} - if (!cmma_flag) +/* + * page_shrink_discard_list + * + * This function is called from the timer tick for an active cpu or + * from the idle notifier. It frees discarded pages in three stages. + * In the first stage it moves the pages from the per-cpu discard + * list to a global list. From the global list the pages are moved + * to the signoff list in a second step. The third step is to free + * the pages after all cpus acknoledged the signoff. That prevents + * that a page is freed when a cpus still has a pending discard + * fault for the page. + */ +void page_shrink_discard_list(void) +{ + struct list_head *cpu_list = &__get_cpu_var(page_discard_list); + struct list_head free_list = LIST_HEAD_INIT(free_list); + struct page *page, *next; + int cpu = smp_processor_id(); + if (list_empty(cpu_list) && + !cpumask_test_cpu(cpu, page_signoff_cpumask)) return; - for (i = 0; i < (1 << order); i++) - asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0" - : "=&d" (rc) - : "a" ((page_to_pfn(page) + i) << PAGE_SHIFT), - "i" (ESSA_SET_STABLE)); + spin_lock(&page_discard_lock); + if (!list_empty(cpu_list)) + list_splice_init(cpu_list, &page_gather_list); + cpumask_clear_cpu(cpu, page_signoff_cpumask); + if (cpumask_empty(page_signoff_cpumask)) { + list_splice_init(&page_signoff_list, &free_list); + list_splice_init(&page_gather_list, &page_signoff_list); + if (!list_empty(&page_signoff_list)) { + /* Take care of the nohz race.. */ + cpumask_copy(page_signoff_cpumask, &cpu_online_map); + smp_wmb(); + cpumask_andnot(page_signoff_cpumask, + page_signoff_cpumask, nohz_cpu_mask); + cpumask_clear_cpu(cpu, page_signoff_cpumask); + if (cpumask_empty(page_signoff_cpumask)) + list_splice_init(&page_signoff_list, + &free_list); + } + } + spin_unlock(&page_discard_lock); + list_for_each_entry_safe(page, next, &free_list, lru) { + ClearPageDiscarded(page); + free_cold_page(page); + } +} + +static int page_discard_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + int cpu = (unsigned long) hcpu; + + if (action == CPU_DEAD) { + local_irq_disable(); + list_splice_init(&per_cpu(page_discard_list, cpu), + &__get_cpu_var(page_discard_list)); + local_irq_enable(); + } + return NOTIFY_OK; +} + +static struct notifier_block page_discard_cpu_notifier = { + .notifier_call = page_discard_cpu_notify, +}; + +void __init page_discard_init(void) +{ + int i; + + if (!alloc_cpumask_var(&page_signoff_cpumask, GFP_KERNEL)) + panic("Couldn't allocate page_signoff_cpumask\n"); + for_each_possible_cpu(i) + INIT_LIST_HEAD(&per_cpu(page_discard_list, i)); + if (register_cpu_notifier(&page_discard_cpu_notifier)) + panic("Couldn't register page discard cpu notifier"); } Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -744,6 +744,15 @@ void page_remove_rmap(struct page *page) */ if ((!PageAnon(page) || PageSwapCache(page)) && page_test_dirty(page)) { + int stable = page_make_stable(page); + VM_BUG_ON(!stable); + /* + * We decremented the mapcount so we now have an + * extra reference for the page. That prevents + * page_make_volatile from making the page + * volatile again while the dirty bit is in + * transit. + */ page_clear_dirty(page); set_page_dirty(page); } -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759426AbZC0PM0 (ORCPT ); Fri, 27 Mar 2009 11:12:26 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757289AbZC0PKd (ORCPT ); Fri, 27 Mar 2009 11:10:33 -0400 Received: from mtagate8.de.ibm.com ([195.212.29.157]:57850 "EHLO mtagate8.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755711AbZC0PKc (ORCPT ); Fri, 27 Mar 2009 11:10:32 -0400 Message-Id: <20090327151012.095486071@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> User-Agent: quilt/0.46-1 Date: Fri, 27 Mar 2009 16:09:08 +0100 From: Martin Schwidefsky To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky Subject: [patch 3/6] Guest page hinting: mlocked pages. Content-Disposition: inline; filename=003-hva-mlock.diff Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj Add code to get mlock() working with guest page hinting. The problem with mlock is that locked pages may not be removed from page cache. That means they need to be stable. page_make_volatile needs a way to check if a page has been locked. To avoid traversing vma lists - which would hurt performance a lot - a field is added in the struct address_space. This field is set in mlock_fixup if a vma gets mlocked. The bit never gets removed - once a file had an mlocked vma all future pages added to it will stay stable. The pages of an mlocked area are made present in the linux page table by a call to make_pages_present which calls get_user_pages and follow_page. The follow_page function is called for each page in the mlocked vma, if the VM_LOCKED bit in the vma flags is set the page is made stable. Signed-off-by: Martin Schwidefsky --- include/linux/fs.h | 10 ++++++++++ mm/memory.c | 5 +++-- mm/mlock.c | 4 ++++ mm/page-states.c | 5 ++++- mm/rmap.c | 14 ++++++++++++-- 5 files changed, 33 insertions(+), 5 deletions(-) Index: linux-2.6/include/linux/fs.h =================================================================== --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -561,6 +561,9 @@ struct address_space { unsigned long flags; /* error bits/gfp mask */ struct backing_dev_info *backing_dev_info; /* device readahead, etc */ spinlock_t private_lock; /* for use by the address_space */ +#ifdef CONFIG_PAGE_STATES + unsigned int mlocked; /* set if VM_LOCKED vmas present */ +#endif struct list_head private_list; /* ditto */ struct address_space *assoc_mapping; /* ditto */ } __attribute__((aligned(sizeof(long)))); @@ -570,6 +573,13 @@ struct address_space { * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON. */ +static inline void mapping_set_mlocked(struct address_space *mapping) +{ +#ifdef CONFIG_PAGE_STATES + mapping->mlocked = 1; +#endif +} + struct block_device { dev_t bd_dev; /* not a kdev_t - it's a search key */ struct inode * bd_inode; /* will die */ Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1177,9 +1177,10 @@ struct page *follow_page(struct vm_area_ if (flags & FOLL_GET) get_page(page); - if (flags & FOLL_GET) { + if ((flags & FOLL_GET) || (vma->vm_flags & VM_LOCKED)) { /* - * The page is made stable if a reference is acquired. + * The page is made stable if a reference is acquired or + * the vm area is locked. * If the caller does not get a reference it implies that * the caller can deal with page faults in case the page * is swapped out. In this case the caller can deal with Index: linux-2.6/mm/mlock.c =================================================================== --- linux-2.6.orig/mm/mlock.c +++ linux-2.6/mm/mlock.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "internal.h" @@ -380,6 +381,9 @@ static int mlock_fixup(struct vm_area_st (vma->vm_flags & (VM_IO | VM_PFNMAP))) goto out; /* don't set VM_LOCKED, don't count */ + if (lock && vma->vm_file && vma->vm_file->f_mapping) + mapping_set_mlocked(vma->vm_file->f_mapping); + if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current)) { Index: linux-2.6/mm/page-states.c =================================================================== --- linux-2.6.orig/mm/page-states.c +++ linux-2.6/mm/page-states.c @@ -30,6 +30,8 @@ */ static inline int check_bits(struct page *page) { + struct address_space *mapping; + /* * There are several conditions that prevent a page from becoming * volatile. The first check is for the page bits. @@ -53,7 +55,8 @@ static inline int check_bits(struct page * it volatile. It will be freed soon. And if the mapping ever * had locked pages all pages of the mapping will stay stable. */ - return page_mapping(page) != NULL; + mapping = page_mapping(page); + return mapping && !mapping->mlocked; } /* Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -793,8 +793,18 @@ static int try_to_unmap_one(struct page goto out_unmap; } if (ptep_clear_flush_young_notify(vma, address, pte)) { - ret = SWAP_FAIL; - goto out_unmap; + /* + * Check for discarded pages. This can happen if + * there have been discarded pages before a vma + * gets mlocked. The code in make_pages_present + * will force all discarded pages out and reload + * them. That happens after the VM_LOCKED bit + * has been set. + */ + if (likely(!PageDiscarded(page))) { + ret = SWAP_FAIL; + goto out_unmap; + } } } -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758845AbZC0PLj (ORCPT ); Fri, 27 Mar 2009 11:11:39 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758033AbZC0PK3 (ORCPT ); Fri, 27 Mar 2009 11:10:29 -0400 Received: from mtagate8.de.ibm.com ([195.212.29.157]:57741 "EHLO mtagate8.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756922AbZC0PKX (ORCPT ); Fri, 27 Mar 2009 11:10:23 -0400 Message-Id: <20090327151012.398894143@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> User-Agent: quilt/0.46-1 Date: Fri, 27 Mar 2009 16:09:09 +0100 From: Martin Schwidefsky To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky Subject: [patch 4/6] Guest page hinting: writable page table entries. Content-Disposition: inline; filename=004-hva-prot.diff Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj The volatile state for page cache and swap cache pages requires that the host system needs to be able to determine if a volatile page is dirty before removing it. This excludes almost all platforms from using the scheme. What is needed is a way to distinguish between pages that are purely read-only and pages that might get written to. This allows platforms with per-pte dirty bits to use the scheme and platforms with per-page dirty bits a small optimization. Whenever a writable pte is created a check is added that allows to move the page into the correct state. This needs to be done before the writable pte is established. To avoid unnecessary state transitions and the need for a counter, a new page flag PG_writable is added. Only the creation of the first writable pte will do a page state change. Even if all the writable ptes pointing to a page are removed again, the page stays in the safe state until all read-only users of the page have unmapped it as well. Only then is the PG_writable bit reset. The state a page needs to have if a writable pte is present depends on the platform. A platform with per-pte dirty bits wants to move the page into stable state, a platform with per-page dirty bits like s390 can decide to move the page into a special state that requires the host system to check the dirty bit before discarding a page. Signed-off-by: Martin Schwidefsky --- include/linux/page-flags.h | 8 ++++++ include/linux/page-states.h | 27 +++++++++++++++++++- mm/memory.c | 5 +++ mm/mprotect.c | 2 + mm/page-states.c | 58 ++++++++++++++++++++++++++++++++++++++++++-- mm/rmap.c | 1 6 files changed, 98 insertions(+), 3 deletions(-) Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h +++ linux-2.6/include/linux/page-flags.h @@ -103,6 +103,7 @@ enum pageflags { #endif #ifdef CONFIG_PAGE_STATES PG_discarded, /* Page discarded by the hypervisor. */ + PG_writable, /* Page is mapped writable. */ #endif __NR_PAGEFLAGS, @@ -186,10 +187,17 @@ static inline int TestClearPage##uname(s #define ClearPageDiscarded(page) clear_bit(PG_discarded, &(page)->flags) #define TestSetPageDiscarded(page) \ test_and_set_bit(PG_discarded, &(page)->flags) +#define PageWritable(page) test_bit(PG_writable, &(page)->flags) +#define ClearPageWritable(page) clear_bit(PG_writable, &(page)->flags) +#define TestSetPageWritable(page) \ + test_and_set_bit(PG_writable, &(page)->flags) #else #define PageDiscarded(page) 0 #define ClearPageDiscarded(page) do { } while (0) #define TestSetPageDiscarded(page) 0 +#define PageWritable(page) 0 +#define ClearPageWritable(page) do { } while (0) +#define TestSetPageWritable(page) 0 #endif struct page; /* forward declaration */ Index: linux-2.6/include/linux/page-states.h =================================================================== --- linux-2.6.orig/include/linux/page-states.h +++ linux-2.6/include/linux/page-states.h @@ -57,6 +57,9 @@ extern void page_discard(struct page *pa extern int __page_make_stable(struct page *page); extern void __page_make_volatile(struct page *page, int offset); extern void __pagevec_make_volatile(struct pagevec *pvec); +extern void __page_check_writable(struct page *page, pte_t pte, + unsigned int offset); +extern void __page_reset_writable(struct page *page); /* * Extended guest page hinting functions defined by using the @@ -78,6 +81,12 @@ extern void __pagevec_make_volatile(stru * from the LRU list and the radix tree of its mapping. * page_discard uses page_unmap_all to remove all page table * entries for a page. + * - page_check_writable: + * Checks if the page states needs to be adapted because a new + * writable page table entry refering to the page is established. + * - page_reset_writable: + * Resets the page state after the last writable page table entry + * refering to the page has been removed. */ static inline int page_make_stable(struct page *page) @@ -97,12 +106,26 @@ static inline void pagevec_make_volatile __pagevec_make_volatile(pvec); } +static inline void page_check_writable(struct page *page, pte_t pte, + unsigned int offset) +{ + if (page_host_discards() && pte_write(pte) && + !test_bit(PG_writable, &page->flags)) + __page_check_writable(page, pte, offset); +} + +static inline void page_reset_writable(struct page *page) +{ + if (page_host_discards() && test_bit(PG_writable, &page->flags)) + __page_reset_writable(page); +} + #else #define page_host_discards() (0) #define page_set_unused(_page,_order) do { } while (0) #define page_set_stable(_page,_order) do { } while (0) -#define page_set_volatile(_page) do { } while (0) +#define page_set_volatile(_page,_writable) do { } while (0) #define page_set_stable_if_present(_page) (1) #define page_discarded(_page) (0) #define page_volatile(_page) (0) @@ -117,6 +140,8 @@ static inline void pagevec_make_volatile #define page_make_volatile(_page, offset) do { } while (0) #define pagevec_make_volatile(_pagevec) do { } while (0) #define page_discard(_page) do { } while (0) +#define page_check_writable(_page,_pte,_off) do { } while (0) +#define page_reset_writable(_page) do { } while (0) #endif Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -2029,6 +2029,7 @@ reuse: flush_cache_page(vma, address, pte_pfn(orig_pte)); entry = pte_mkyoung(orig_pte); entry = maybe_mkwrite(pte_mkdirty(entry), vma); + page_check_writable(old_page, entry, 1); if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; @@ -2084,6 +2085,7 @@ gotten: flush_cache_page(vma, address, pte_pfn(orig_pte)); entry = mk_pte(new_page, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); + page_check_writable(new_page, entry, 2); /* * Clear the pte entry and flush it first, before updating the * pte with the new entry. This will avoid a race condition @@ -2540,6 +2542,7 @@ static int do_swap_page(struct mm_struct write_access = 0; } flush_icache_page(vma, page); + page_check_writable(page, pte, 2); set_pte_at(mm, address, page_table, pte); page_add_anon_rmap(page, vma, address); /* It's better to call commit-charge after rmap is established */ @@ -2599,6 +2602,7 @@ static int do_anonymous_page(struct mm_s entry = mk_pte(page, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); + page_check_writable(page, entry, 2); page_table = pte_offset_map_lock(mm, pmd, address, &ptl); if (!pte_none(*page_table)) @@ -2754,6 +2758,7 @@ retry: entry = mk_pte(page, vma->vm_page_prot); if (flags & FAULT_FLAG_WRITE) entry = maybe_mkwrite(pte_mkdirty(entry), vma); + page_check_writable(page, entry, 2); if (anon) { inc_mm_counter(mm, anon_rss); page_add_new_anon_rmap(page, vma, address); Index: linux-2.6/mm/mprotect.c =================================================================== --- linux-2.6.orig/mm/mprotect.c +++ linux-2.6/mm/mprotect.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -58,6 +59,7 @@ static void change_pte_range(struct mm_s */ if (dirty_accountable && pte_dirty(ptent)) ptent = pte_mkwrite(ptent); + page_check_writable(pte_page(ptent), ptent, 1); ptep_modify_prot_commit(mm, addr, pte, ptent); } else if (PAGE_MIGRATION && !pte_file(oldpte)) { Index: linux-2.6/mm/page-states.c =================================================================== --- linux-2.6.orig/mm/page-states.c +++ linux-2.6/mm/page-states.c @@ -83,7 +83,7 @@ void __page_make_volatile(struct page *p preempt_disable(); if (!page_test_set_state_change(page)) { if (check_bits(page) && check_counts(page, offset)) - page_set_volatile(page); + page_set_volatile(page, PageWritable(page)); page_clear_state_change(page); } preempt_enable(); @@ -109,7 +109,7 @@ void __pagevec_make_volatile(struct page page = pvec->pages[i]; if (!page_test_set_state_change(page)) { if (check_bits(page) && check_counts(page, 1)) - page_set_volatile(page); + page_set_volatile(page, PageWritable(page)); page_clear_state_change(page); } } @@ -142,6 +142,60 @@ int __page_make_stable(struct page *page EXPORT_SYMBOL(__page_make_stable); /** + * __page_check_writable() - check page state for new writable pte + * + * @page: the page the new writable pte refers to + * @pte: the new writable pte + */ +void __page_check_writable(struct page *page, pte_t pte, unsigned int offset) +{ + int count_ok = 0; + + preempt_disable(); + while (page_test_set_state_change(page)) + cpu_relax(); + + if (!TestSetPageWritable(page)) { + count_ok = check_counts(page, offset); + if (check_bits(page) && count_ok) + page_set_volatile(page, 1); + else + /* + * If two processes create a write mapping at the + * same time check_counts will return false or if + * the page is currently isolated from the LRU + * check_bits will return false but the page might + * be in volatile state. + * We have to take care about the dirty bit so the + * only option left is to make the page stable but + * we can try to make it volatile a bit later. + */ + page_set_stable_if_present(page); + } + page_clear_state_change(page); + if (!count_ok) + page_make_volatile(page, 1); + preempt_enable(); +} +EXPORT_SYMBOL(__page_check_writable); + +/** + * __page_reset_writable() - clear the PageWritable bit + * + * @page: the page + */ +void __page_reset_writable(struct page *page) +{ + preempt_disable(); + if (!page_test_set_state_change(page)) { + ClearPageWritable(page); + page_clear_state_change(page); + } + preempt_enable(); +} +EXPORT_SYMBOL(__page_reset_writable); + +/** * __page_discard() - remove a discarded page from the cache * * @page: the page Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -752,6 +752,7 @@ void page_remove_rmap(struct page *page) mem_cgroup_uncharge_page(page); __dec_zone_page_state(page, PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); + page_reset_writable(page); /* * It would be tidy to reset the PageAnon mapping here, * but that might overwrite a racing page_add_anon_rmap -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758779AbZC0PMo (ORCPT ); Fri, 27 Mar 2009 11:12:44 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756922AbZC0PKf (ORCPT ); Fri, 27 Mar 2009 11:10:35 -0400 Received: from mtagate4.de.ibm.com ([195.212.29.153]:55166 "EHLO mtagate4.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758065AbZC0PKc (ORCPT ); Fri, 27 Mar 2009 11:10:32 -0400 Message-Id: <20090327151012.713478499@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> User-Agent: quilt/0.46-1 Date: Fri, 27 Mar 2009 16:09:10 +0100 From: Martin Schwidefsky To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org Cc: frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com, Martin Schwidefsky Subject: [patch 5/6] Guest page hinting: minor fault optimization. Content-Disposition: inline; filename=005-hva-nohv.diff Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Martin Schwidefsky From: Hubertus Franke From: Himanshu Raj On of the challenges of the guest page hinting scheme is the cost for the state transitions. If the cost gets too high the whole concept of page state information is in question. Therefore it is important to avoid the state transitions when possible. One place where the state transitions can be avoided are minor faults. Why change the page state to stable in find_get_page and back in page_add_anon_rmap/ page_add_file_rmap if the discarded pages can be handled by the discard fault handler? If the page is in page/swap cache just map it even if it is already discarded. The first access to the page will cause a discard fault which needs to be able to deal with this kind of situation anyway because of other races in the memory management. The special find_get_page_nodiscard variant introduced for volatile swap cache is used which does not change the page state. The calls to find_get_page in filemap_nopage and lookup_swap_cache are replaced with find_get_page_nodiscard. By the use of this function a new race is created. If a minor fault races with the discard of a page the page may not get mapped to the page table because the discard handler removed the page from the cache which removes the page->mapping that is needed to find the page table entry. A check for the discarded bit is added to do_swap_page and do_no_page. The page table lock for the pte takes care of the synchronization. That removes the state transitions on the minor fault path. A page that has been mapped will eventually be unmapped again. On the unmap path each page that has been removed from the page table is freed with a call to page_cache_release. In general that causes an unnecessary page state transition from volatile to volatile. To get rid of these state transitions as well a special variants of page_cache_release is added that does not attempt to make the page volatile. page_cache_release_nocheck is then used in free_page_and_swap_cache and release_pages. This makes the unmap of ptes state transitions free. Signed-off-by: Martin Schwidefsky --- include/linux/pagemap.h | 4 ++++ include/linux/swap.h | 2 +- mm/filemap.c | 27 ++++++++++++++++++++++++--- mm/fremap.c | 1 + mm/memory.c | 4 ++-- mm/rmap.c | 4 +--- mm/shmem.c | 7 +++++++ mm/swap_state.c | 4 ++-- 8 files changed, 42 insertions(+), 11 deletions(-) Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -93,11 +93,15 @@ static inline void mapping_set_gfp_mask( #ifdef CONFIG_PAGE_STATES extern struct page * find_get_page_nodiscard(struct address_space *mapping, unsigned long index); +extern struct page * find_lock_page_nodiscard(struct address_space *mapping, + unsigned long index); #define page_cache_release(page) put_page_check(page) #else #define find_get_page_nodiscard(mapping, index) find_get_page(mapping, index) +#define find_lock_page_nodiscard(mapping, index) find_lock_page(mapping, index) #define page_cache_release(page) put_page(page) #endif +#define page_cache_release_nocheck(page) put_page(page) void release_pages(struct page **pages, int nr, int cold); /* Index: linux-2.6/include/linux/swap.h =================================================================== --- linux-2.6.orig/include/linux/swap.h +++ linux-2.6/include/linux/swap.h @@ -362,7 +362,7 @@ static inline void mem_cgroup_uncharge_s /* only sparc can not include linux/pagemap.h in this file * so leave page_cache_release and release_pages undeclared... */ #define free_page_and_swap_cache(page) \ - page_cache_release(page) + page_cache_release_nocheck(page) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr), 0); Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -592,6 +592,27 @@ repeat: EXPORT_SYMBOL(find_get_page_nodiscard); +struct page *find_lock_page_nodiscard(struct address_space *mapping, + unsigned long offset) +{ + struct page *page; + +repeat: + page = find_get_page_nodiscard(mapping, offset); + if (page) { + lock_page(page); + /* Has the page been truncated? */ + if (unlikely(page->mapping != mapping)) { + unlock_page(page); + page_cache_release(page); + goto repeat; + } + VM_BUG_ON(page->index != offset); + } + return page; +} +EXPORT_SYMBOL(find_lock_page_nodiscard); + #endif /* @@ -1586,7 +1607,7 @@ int filemap_fault(struct vm_area_struct * Do we have something in the page cache already? */ retry_find: - page = find_lock_page(mapping, vmf->pgoff); + page = find_lock_page_nodiscard(mapping, vmf->pgoff); /* * For sequential accesses, we use the generic readahead logic. */ @@ -1594,7 +1615,7 @@ retry_find: if (!page) { page_cache_sync_readahead(mapping, ra, file, vmf->pgoff, 1); - page = find_lock_page(mapping, vmf->pgoff); + page = find_lock_page_nodiscard(mapping, vmf->pgoff); if (!page) goto no_cached_page; } @@ -1633,7 +1654,7 @@ retry_find: start = vmf->pgoff - ra_pages / 2; do_page_cache_readahead(mapping, file, start, ra_pages); } - page = find_lock_page(mapping, vmf->pgoff); + page = find_lock_page_nodiscard(mapping, vmf->pgoff); if (!page) goto no_cached_page; } Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c +++ linux-2.6/mm/fremap.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -2513,7 +2513,7 @@ static int do_swap_page(struct mm_struct * Back out if somebody else already faulted in this pte. */ page_table = pte_offset_map_lock(mm, pmd, address, &ptl); - if (unlikely(!pte_same(*page_table, orig_pte))) + if (unlikely(!pte_same(*page_table, orig_pte) || PageDiscarded(page))) goto out_nomap; if (unlikely(!PageUptodate(page))) { @@ -2753,7 +2753,7 @@ retry: * handle that later. */ /* Only go through if we didn't race with anybody else... */ - if (likely(pte_same(*page_table, orig_pte))) { + if (likely(pte_same(*page_table, orig_pte) && !PageDiscarded(page))) { flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); if (flags & FAULT_FLAG_WRITE) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -703,7 +703,6 @@ void page_add_file_rmap(struct page *pag { if (atomic_inc_and_test(&page->_mapcount)) __inc_zone_page_state(page, NR_FILE_MAPPED); - page_make_volatile(page, 1); } #ifdef CONFIG_DEBUG_VM @@ -763,7 +762,6 @@ void page_remove_rmap(struct page *page) * faster for those pages still in swapcache. */ } - page_make_volatile(page, 1); } /* @@ -862,7 +860,7 @@ static int try_to_unmap_one(struct page } page_remove_rmap(page); - page_cache_release(page); + page_cache_release_nocheck(page); out_unmap: pte_unmap_unlock(pte, ptl); Index: linux-2.6/mm/shmem.c =================================================================== --- linux-2.6.orig/mm/shmem.c +++ linux-2.6/mm/shmem.c @@ -59,6 +59,7 @@ static struct vfsmount *shm_mnt; #include #include #include +#include #include #include @@ -1245,6 +1246,12 @@ repeat: if (swap.val) { /* Look it up and read it in.. */ swappage = lookup_swap_cache(swap); + if (swappage && unlikely(!page_make_stable(swappage))) { + shmem_swp_unmap(entry); + spin_unlock(&info->lock); + page_discard(swappage); + goto repeat; + } if (!swappage) { shmem_swp_unmap(entry); /* here we actually do the io */ Index: linux-2.6/mm/swap_state.c =================================================================== --- linux-2.6.orig/mm/swap_state.c +++ linux-2.6/mm/swap_state.c @@ -241,7 +241,7 @@ static inline void free_swap_cache(struc void free_page_and_swap_cache(struct page *page) { free_swap_cache(page); - page_cache_release(page); + page_cache_release_nocheck(page); } /* @@ -275,7 +275,7 @@ struct page * lookup_swap_cache(swp_entr { struct page *page; - page = find_get_page(&swapper_space, entry.val); + page = find_get_page_nodiscard(&swapper_space, entry.val); if (page) INC_CACHE_INFO(find_success); -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756386AbZC0W71 (ORCPT ); Fri, 27 Mar 2009 18:59:27 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755907AbZC0W7N (ORCPT ); Fri, 27 Mar 2009 18:59:13 -0400 Received: from mx2.redhat.com ([66.187.237.31]:55628 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755115AbZC0W7M (ORCPT ); Fri, 27 Mar 2009 18:59:12 -0400 Message-ID: <49CD59DB.3070906@redhat.com> Date: Fri, 27 Mar 2009 18:57:31 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Martin Schwidefsky CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 1/6] Guest page hinting: core + volatile page cache. References: <20090327150905.819861420@de.ibm.com> <20090327151011.534224968@de.ibm.com> In-Reply-To: <20090327151011.534224968@de.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > The major obstacles that need to get addressed: > * Concurrent page state changes: > To guard against concurrent page state updates some kind of lock > is needed. If page_make_volatile() has already done the 11 checks it > will issue the state change primitive. If in the meantime one of > the conditions has changed the user that requires that page in > stable state will have to wait in the page_make_stable() function > until the make volatile operation has finished. It is up to the > architecture to define how this is done with the three primitives > page_test_set_state_change, page_clear_state_change and > page_state_change. > There are some alternatives how this can be done, e.g. a global > lock, or lock per segment in the kernel page table, or the per page > bit PG_arch_1 if it is still free. Can this be taken care of by memory barriers and careful ordering of operations? If we consider the states unused -> volatile -> stable as progressively higher, "upgrades" can be done before any kernel operation that requires the page to be in that state (but after setting up the things that allow it to be found), while downgrades can be done after the kernel is done with needing the page at a higher level. Since the downgrade checks for users that need the page in a higher state, no lock should be required. In fact, it may be possible to manage the page state bitmap with compare-and-swap, without needing a call to the hypervisor. > Signed-off-by: Martin Schwidefsky Some comments and questions in line. > @@ -601,6 +604,21 @@ copy_one_pte(struct mm_struct *dst_mm, s > > out_set_pte: > set_pte_at(dst_mm, addr, dst_pte, pte); > + return; > + > +out_discard_pte: > + /* > + * If the page referred by the pte has the PG_discarded bit set, > + * copy_one_pte is racing with page_discard. The pte may not be > + * copied or we can end up with a pte pointing to a page not > + * in the page cache anymore. Do what try_to_unmap_one would do > + * if the copy_one_pte had taken place before page_discard. > + */ > + if (page->index != linear_page_index(vma, addr)) > + /* If nonlinear, store the file page offset in the pte. */ > + set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); > + else > + pte_clear(dst_mm, addr, dst_pte); > } It would be good to document that PG_discarded can only happen for file pages and NOT for eg. clean swap cache pages. > @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag > radix_tree_tag_clear(&mapping->page_tree, > page_index(page), > PAGECACHE_TAG_WRITEBACK); > + page_make_volatile(page, 1); > if (bdi_cap_account_writeback(bdi)) { > __dec_bdi_stat(bdi, BDI_WRITEBACK); > __bdi_writeout_inc(bdi); Does this mark the page volatile before the IO writing the dirty data back to disk has even started? Is that OK? -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761959AbZC0XET (ORCPT ); Fri, 27 Mar 2009 19:04:19 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1761778AbZC0XDu (ORCPT ); Fri, 27 Mar 2009 19:03:50 -0400 Received: from e37.co.us.ibm.com ([32.97.110.158]:47995 "EHLO e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759870AbZC0XDt (ORCPT ); Fri, 27 Mar 2009 19:03:49 -0400 Subject: Re: [patch 0/6] Guest page hinting version 7. From: Dave Hansen To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com In-Reply-To: <20090327150905.819861420@de.ibm.com> References: <20090327150905.819861420@de.ibm.com> Content-Type: text/plain Date: Fri, 27 Mar 2009 16:03:43 -0700 Message-Id: <1238195024.8286.562.camel@nimitz> Mime-Version: 1.0 X-Mailer: Evolution 2.22.3.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: > If the host picks one of the > pages the guest can recreate, the host can throw it away instead of writing > it to the paging device. Simple and elegant. Heh, simple and elegant for the hypervisor. But I'm not sure I'm going to call *anything* that requires a new CPU instruction elegant. ;) I don't see any description of it in there any more, but I thought this entire patch set was to get rid of the idiotic triple I/Os in the following scenario: 1. Hypervisor picks a page and evicts it out to disk, pays the I/O cost to get it written out. (I/O #1) 2. Linux comes along (being a bit late to the party) and picks the same page, also decides it needs to be out to disk 3. Linux tries to write the page to disk, but touches it in the process, pulling the page back in from the store where the hypervisor wrote it. (I/O #2) 4. Linux writes the page to its swap device (I/O #3) I don't see that mentioned at all in the current description. Simplifying the hypervisor is hard to get behind, but cutting system I/O by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;) Can we persuade the hypervisor to tell us which pages it decided to page out and just skip those when we're scanning the LRU? -- Dave From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755914AbZC1AIa (ORCPT ); Fri, 27 Mar 2009 20:08:30 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752303AbZC1AIW (ORCPT ); Fri, 27 Mar 2009 20:08:22 -0400 Received: from mx2.redhat.com ([66.187.237.31]:52346 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750830AbZC1AIV (ORCPT ); Fri, 27 Mar 2009 20:08:21 -0400 Message-ID: <49CD69EB.6000000@redhat.com> Date: Fri, 27 Mar 2009 20:06:03 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Dave Hansen CC: Martin Schwidefsky , linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> In-Reply-To: <1238195024.8286.562.camel@nimitz> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Hansen wrote: > On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: >> If the host picks one of the >> pages the guest can recreate, the host can throw it away instead of writing >> it to the paging device. Simple and elegant. > > Heh, simple and elegant for the hypervisor. But I'm not sure I'm going > to call *anything* that requires a new CPU instruction elegant. ;) I am convinced that it could be done with a guest-writable "bitmap", with 2 bits per page. That would make this scheme useful for KVM, too. > I don't see any description of it in there any more, but I thought this > entire patch set was to get rid of the idiotic triple I/Os in the > following scenario: > I don't see that mentioned at all in the current description. > Simplifying the hypervisor is hard to get behind, but cutting system I/O > by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;) Cutting down on a fair bit of IO is absolutely worth 1200 lines of fairly well isolated code. > Can we persuade the hypervisor to tell us which pages it decided to page > out and just skip those when we're scanning the LRU? The easiest "notification" points are in the page fault handler and the page cache lookup code. -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754829AbZC1Gfp (ORCPT ); Sat, 28 Mar 2009 02:35:45 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751830AbZC1Gff (ORCPT ); Sat, 28 Mar 2009 02:35:35 -0400 Received: from ozlabs.org ([203.10.76.45]:57974 "EHLO ozlabs.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751666AbZC1Gff (ORCPT ); Sat, 28 Mar 2009 02:35:35 -0400 From: Rusty Russell To: virtualization@lists.linux-foundation.org Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Sat, 28 Mar 2009 17:05:28 +1030 User-Agent: KMail/1.11.1 (Linux/2.6.27-11-generic; KDE/4.2.1; i686; ; ) Cc: Martin Schwidefsky , linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com References: <20090327150905.819861420@de.ibm.com> In-Reply-To: <20090327150905.819861420@de.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200903281705.29798.rusty@rustcorp.com.au> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > Greetings, > the circus is back in town -- another version of the guest page hinting > patches. The patches differ from version 6 only in the kernel version, > they apply against 2.6.29. My short sniff test showed that the code > is still working as expected. > > To recap (you can skip this if you read the boiler plate of the last > version of the patches): > The main benefit for guest page hinting vs. the ballooner is that there > is no need for a monitor that keeps track of the memory usage of all the > guests, a complex algorithm that calculates the working set sizes and for > the calls into the guest kernel to control the size of the balloons. I thought you weren't convinced of the concrete benefits over ballooning, or am I misremembering? Thanks, Rusty. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757344AbZC2N47 (ORCPT ); Sun, 29 Mar 2009 09:56:59 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756061AbZC2N4t (ORCPT ); Sun, 29 Mar 2009 09:56:49 -0400 Received: from mtagate8.de.ibm.com ([195.212.29.157]:60750 "EHLO mtagate8.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755331AbZC2N4r (ORCPT ); Sun, 29 Mar 2009 09:56:47 -0400 Date: Sun, 29 Mar 2009 15:56:40 +0200 From: Martin Schwidefsky To: Rik van Riel Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 1/6] Guest page hinting: core + volatile page cache. Message-ID: <20090329155640.31472c61@skybase> In-Reply-To: <49CD59DB.3070906@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151011.534224968@de.ibm.com> <49CD59DB.3070906@redhat.com> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 27 Mar 2009 18:57:31 -0400 Rik van Riel wrote: > Martin Schwidefsky wrote: > > > The major obstacles that need to get addressed: > > * Concurrent page state changes: > > To guard against concurrent page state updates some kind of lock > > is needed. If page_make_volatile() has already done the 11 checks it > > will issue the state change primitive. If in the meantime one of > > the conditions has changed the user that requires that page in > > stable state will have to wait in the page_make_stable() function > > until the make volatile operation has finished. It is up to the > > architecture to define how this is done with the three primitives > > page_test_set_state_change, page_clear_state_change and > > page_state_change. > > There are some alternatives how this can be done, e.g. a global > > lock, or lock per segment in the kernel page table, or the per page > > bit PG_arch_1 if it is still free. > > Can this be taken care of by memory barriers and > careful ordering of operations? I don't see how this could be done with memory barries, the sequence is 1) check conditions 2) do state change to volatile another cpus can do i) change one of the conditions The operation i) needs to be postponed while the first cpu has done 1) but not done 2) yet. 1+2 needs to be atomic but consists of several instructions. Ergo we need a lock, no ? > If we consider the states unused -> volatile -> stable > as progressively higher, "upgrades" can be done before > any kernel operation that requires the page to be in > that state (but after setting up the things that allow > it to be found), while downgrades can be done after the > kernel is done with needing the page at a higher level. > > Since the downgrade checks for users that need the page > in a higher state, no lock should be required. > > In fact, it may be possible to manage the page state > bitmap with compare-and-swap, without needing a call > to the hypervisor. > > > Signed-off-by: Martin Schwidefsky > > Some comments and questions in line. > > > @@ -601,6 +604,21 @@ copy_one_pte(struct mm_struct *dst_mm, s > > > > out_set_pte: > > set_pte_at(dst_mm, addr, dst_pte, pte); > > + return; > > + > > +out_discard_pte: > > + /* > > + * If the page referred by the pte has the PG_discarded bit set, > > + * copy_one_pte is racing with page_discard. The pte may not be > > + * copied or we can end up with a pte pointing to a page not > > + * in the page cache anymore. Do what try_to_unmap_one would do > > + * if the copy_one_pte had taken place before page_discard. > > + */ > > + if (page->index != linear_page_index(vma, addr)) > > + /* If nonlinear, store the file page offset in the pte. */ > > + set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); > > + else > > + pte_clear(dst_mm, addr, dst_pte); > > } > > It would be good to document that PG_discarded can only happen for > file pages and NOT for eg. clean swap cache pages. PG_discarded can happen for swap cache pages as well. If a clean swap cache page gets remove and subsequently access again the discard fault handler will set the bit (see __page_discard). The code necessary for volatile swap cache is introduced with patch #2. So I would rather not add a comment in patch #1 only to remove it again with patch #2 .. > > @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag > > radix_tree_tag_clear(&mapping->page_tree, > > page_index(page), > > PAGECACHE_TAG_WRITEBACK); > > + page_make_volatile(page, 1); > > if (bdi_cap_account_writeback(bdi)) { > > __dec_bdi_stat(bdi, BDI_WRITEBACK); > > __bdi_writeout_inc(bdi); > > Does this mark the page volatile before the IO writing the > dirty data back to disk has even started? Is that OK? Hmm, it could be that the page_make_volatile is just superflouos here. The logic here is that whenever one of the conditions that prevent a page from becoming volatile is cleared a try with page_make_volatile is done. The condition in question here is PageWriteback(page). If we can prove that one of the other conditions is true this particular call is a waste of effort. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757097AbZC2OUn (ORCPT ); Sun, 29 Mar 2009 10:20:43 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753016AbZC2OUc (ORCPT ); Sun, 29 Mar 2009 10:20:32 -0400 Received: from mtagate2.de.ibm.com ([195.212.17.162]:40553 "EHLO mtagate2.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751953AbZC2OUb (ORCPT ); Sun, 29 Mar 2009 10:20:31 -0400 Date: Sun, 29 Mar 2009 16:20:24 +0200 From: Martin Schwidefsky To: Rik van Riel Cc: Dave Hansen , linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. Message-ID: <20090329162024.687196ab@skybase> In-Reply-To: <49CD69EB.6000000@redhat.com> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <49CD69EB.6000000@redhat.com> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 27 Mar 2009 20:06:03 -0400 Rik van Riel wrote: > Dave Hansen wrote: > > On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: > >> If the host picks one of the > >> pages the guest can recreate, the host can throw it away instead of writing > >> it to the paging device. Simple and elegant. > > > > Heh, simple and elegant for the hypervisor. But I'm not sure I'm going > > to call *anything* that requires a new CPU instruction elegant. ;) > > I am convinced that it could be done with a guest-writable > "bitmap", with 2 bits per page. That would make this scheme > useful for KVM, too. This was our initial approach before we came up with the milli-code instruction. The reason we did not use a bitmap was to prevent the guest to change the host state (4 guest states U/S/V/P and 3 host states r/p/z). With the full set of states you'd need 4 bits. And the hosts need to have a "master" copy of the host bits, one the guest cannot change, otherwise you get into trouble. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756689AbZC2OXy (ORCPT ); Sun, 29 Mar 2009 10:23:54 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754864AbZC2OXo (ORCPT ); Sun, 29 Mar 2009 10:23:44 -0400 Received: from mtagate4.de.ibm.com ([195.212.29.153]:53328 "EHLO mtagate4.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754118AbZC2OXn (ORCPT ); Sun, 29 Mar 2009 10:23:43 -0400 Date: Sun, 29 Mar 2009 16:23:36 +0200 From: Martin Schwidefsky To: Rusty Russell Cc: virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. Message-ID: <20090329162336.7c0700e9@skybase> In-Reply-To: <200903281705.29798.rusty@rustcorp.com.au> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 28 Mar 2009 17:05:28 +1030 Rusty Russell wrote: > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > > Greetings, > > the circus is back in town -- another version of the guest page hinting > > patches. The patches differ from version 6 only in the kernel version, > > they apply against 2.6.29. My short sniff test showed that the code > > is still working as expected. > > > > To recap (you can skip this if you read the boiler plate of the last > > version of the patches): > > The main benefit for guest page hinting vs. the ballooner is that there > > is no need for a monitor that keeps track of the memory usage of all the > > guests, a complex algorithm that calculates the working set sizes and for > > the calls into the guest kernel to control the size of the balloons. > > I thought you weren't convinced of the concrete benefits over ballooning, > or am I misremembering? The performance test I have seen so far show that the benefits of ballooning vs. guest page hinting are about the same. I am still convinced that the guest page hinting is the way to go because you do not need an external monitor. Calculating the working set size for a guest is a challenge. With guest page hinting there is no need for a working set size calculation. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757913AbZC2OhN (ORCPT ); Sun, 29 Mar 2009 10:37:13 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756725AbZC2Og5 (ORCPT ); Sun, 29 Mar 2009 10:36:57 -0400 Received: from mx2.redhat.com ([66.187.237.31]:59061 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756297AbZC2Og5 (ORCPT ); Sun, 29 Mar 2009 10:36:57 -0400 Message-ID: <49CF8733.7060309@redhat.com> Date: Sun, 29 Mar 2009 10:35:31 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Martin Schwidefsky CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 1/6] Guest page hinting: core + volatile page cache. References: <20090327150905.819861420@de.ibm.com> <20090327151011.534224968@de.ibm.com> <49CD59DB.3070906@redhat.com> <20090329155640.31472c61@skybase> In-Reply-To: <20090329155640.31472c61@skybase> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > On Fri, 27 Mar 2009 18:57:31 -0400 > Rik van Riel wrote: >> Martin Schwidefsky wrote: >>> There are some alternatives how this can be done, e.g. a global >>> lock, or lock per segment in the kernel page table, or the per page >>> bit PG_arch_1 if it is still free. >> Can this be taken care of by memory barriers and >> careful ordering of operations? > > I don't see how this could be done with memory barries, the sequence is > 1) check conditions > 2) do state change to volatile > > another cpus can do > i) change one of the conditions > > The operation i) needs to be postponed while the first cpu has done 1) > but not done 2) yet. 1+2 needs to be atomic but consists of several > instructions. Ergo we need a lock, no ? You are right. Hashed locks may be a space saving option, with a set of (cache line aligned?) locks in each zone and the page state lock chosen by taking a hash of the page number or address. Not ideal, but at least we can get some NUMA locality. >>> + if (page->index != linear_page_index(vma, addr)) >>> + /* If nonlinear, store the file page offset in the pte. */ >>> + set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); >>> + else >>> + pte_clear(dst_mm, addr, dst_pte); >>> } >> It would be good to document that PG_discarded can only happen for >> file pages and NOT for eg. clean swap cache pages. > > PG_discarded can happen for swap cache pages as well. If a clean swap > cache page gets remove and subsequently access again the discard fault > handler will set the bit (see __page_discard). The code necessary for > volatile swap cache is introduced with patch #2. So I would rather not > add a comment in patch #1 only to remove it again with patch #2 .. I discovered that once I opened the next email :) >>> @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag >>> radix_tree_tag_clear(&mapping->page_tree, >>> page_index(page), >>> PAGECACHE_TAG_WRITEBACK); >>> + page_make_volatile(page, 1); >>> if (bdi_cap_account_writeback(bdi)) { >>> __dec_bdi_stat(bdi, BDI_WRITEBACK); >>> __bdi_writeout_inc(bdi); >> Does this mark the page volatile before the IO writing the >> dirty data back to disk has even started? Is that OK? > > Hmm, it could be that the page_make_volatile is just superflouos here. > The logic here is that whenever one of the conditions that prevent a > page from becoming volatile is cleared a try with page_make_volatile > is done. The condition in question here is PageWriteback(page). If we > can prove that one of the other conditions is true this particular call > is a waste of effort. Actually, test_clear_page_writeback is probably called on IO completion and it was just me being confused after a few hundred lines of very new (to me) VM code :) I guess the patch is correct. Acked-by: Rik van Riel -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757646AbZC2Ok3 (ORCPT ); Sun, 29 Mar 2009 10:40:29 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754439AbZC2OkU (ORCPT ); Sun, 29 Mar 2009 10:40:20 -0400 Received: from mx2.redhat.com ([66.187.237.31]:42637 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753016AbZC2OkU (ORCPT ); Sun, 29 Mar 2009 10:40:20 -0400 Message-ID: <49CF87FB.4030608@redhat.com> Date: Sun, 29 Mar 2009 10:38:51 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Martin Schwidefsky CC: Dave Hansen , linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <49CD69EB.6000000@redhat.com> <20090329162024.687196ab@skybase> In-Reply-To: <20090329162024.687196ab@skybase> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > On Fri, 27 Mar 2009 20:06:03 -0400 > Rik van Riel wrote: > >> Dave Hansen wrote: >>> On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: >>>> If the host picks one of the >>>> pages the guest can recreate, the host can throw it away instead of writing >>>> it to the paging device. Simple and elegant. >>> Heh, simple and elegant for the hypervisor. But I'm not sure I'm going >>> to call *anything* that requires a new CPU instruction elegant. ;) >> I am convinced that it could be done with a guest-writable >> "bitmap", with 2 bits per page. That would make this scheme >> useful for KVM, too. > > This was our initial approach before we came up with the milli-code > instruction. The reason we did not use a bitmap was to prevent the > guest to change the host state (4 guest states U/S/V/P and 3 host > states r/p/z). With the full set of states you'd need 4 bits. And the > hosts need to have a "master" copy of the host bits, one the guest > cannot change, otherwise you get into trouble. KVM already has the info from the host bits somewhere else, which is needed to be able to actually find the physical pages used by a guest. That leaves just the guest states, so a compare-and-swap may work for non-s390. -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754120AbZC3PzW (ORCPT ); Mon, 30 Mar 2009 11:55:22 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753029AbZC3PzH (ORCPT ); Mon, 30 Mar 2009 11:55:07 -0400 Received: from e5.ny.us.ibm.com ([32.97.182.145]:45909 "EHLO e5.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752069AbZC3PzG (ORCPT ); Mon, 30 Mar 2009 11:55:06 -0400 Subject: Re: [patch 0/6] Guest page hinting version 7. From: Dave Hansen To: Martin Schwidefsky Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com In-Reply-To: <20090329161253.3faffdeb@skybase> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> Content-Type: text/plain Date: Mon, 30 Mar 2009 08:54:55 -0700 Message-Id: <1238428495.8286.638.camel@nimitz> Mime-Version: 1.0 X-Mailer: Evolution 2.22.3.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 2009-03-29 at 16:12 +0200, Martin Schwidefsky wrote: > > Can we persuade the hypervisor to tell us which pages it decided to page > > out and just skip those when we're scanning the LRU? > > One principle of the whole approach is that the hypervisor does not > call into an otherwise idle guest. The cost of schedulung the virtual > cpu is just too high. So we would a means to store the information where > the guest can pick it up when it happens to do LRU. I don't think that > this will work out. I didn't mean for it to actively notify the guest. Perhaps, as Rik said, have a bitmap where the host can set or clear bit for the guest to see. As the guest is scanning the LRU, it checks the structure (or makes an hcall or whatever) and sees that the hypervisor has already taken care of the page. It skips these pages in the first round of scanning. I do see what you're saying about this saving the page-*out* operation on the hypervisor side. It can simply toss out pages instead of paging them itself. That's a pretty advanced optimization, though. What would this code look like if we didn't optimize to that level? It also occurs to me that the hypervisor could be doing a lot of this internally. This whole scheme is about telling the hypervisor about pages that we (the kernel) know we can regenerate. The hypervisor should know a lot of that information, too. We ask it to populate a page with stuff from virtual I/O devices or write a page out to those devices. The page remains volatile until something from the guest writes to it. The hypervisor could keep a record of how to recreate the page as long as it remains volatile and clean. That wouldn't cover things like page cache from network filesystems, though. This patch does look like the full monty but I have to wonder what other partial approaches are out there. -- Dave From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753730AbZC3QeW (ORCPT ); Mon, 30 Mar 2009 12:34:22 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751298AbZC3QeO (ORCPT ); Mon, 30 Mar 2009 12:34:14 -0400 Received: from mtagate1.de.ibm.com ([195.212.17.161]:36313 "EHLO mtagate1.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751276AbZC3QeM (ORCPT ); Mon, 30 Mar 2009 12:34:12 -0400 Date: Mon, 30 Mar 2009 18:34:05 +0200 From: Martin Schwidefsky To: Dave Hansen Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com Subject: Re: [patch 0/6] Guest page hinting version 7. Message-ID: <20090330183405.750440da@skybase> In-Reply-To: <1238428495.8286.638.camel@nimitz> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 30 Mar 2009 08:54:55 -0700 Dave Hansen wrote: > On Sun, 2009-03-29 at 16:12 +0200, Martin Schwidefsky wrote: > > > Can we persuade the hypervisor to tell us which pages it decided to page > > > out and just skip those when we're scanning the LRU? > > > > One principle of the whole approach is that the hypervisor does not > > call into an otherwise idle guest. The cost of schedulung the virtual > > cpu is just too high. So we would a means to store the information where > > the guest can pick it up when it happens to do LRU. I don't think that > > this will work out. > > I didn't mean for it to actively notify the guest. Perhaps, as Rik > said, have a bitmap where the host can set or clear bit for the guest to > see. Yes, agreed. > As the guest is scanning the LRU, it checks the structure (or makes an > hcall or whatever) and sees that the hypervisor has already taken care > of the page. It skips these pages in the first round of scanning. As long as we make this optional I'm fine with it. On s390 with the current implementation that translates to an ESSA call. Which is not exactly inexpensive, we are talking about > 100 cycles. The better solution for us is to age the page with the standard active/inactive processing. > I do see what you're saying about this saving the page-*out* operation > on the hypervisor side. It can simply toss out pages instead of paging > them itself. That's a pretty advanced optimization, though. What would > this code look like if we didn't optimize to that level? Why? It is just a simple test in the hosts LRU scan. If the page is at the end of the inactive list AND has the volatile state then don't bother with writeback, just throw it away. This is the only place where the host has to check for the page state. > It also occurs to me that the hypervisor could be doing a lot of this > internally. This whole scheme is about telling the hypervisor about > pages that we (the kernel) know we can regenerate. The hypervisor > should know a lot of that information, too. We ask it to populate a > page with stuff from virtual I/O devices or write a page out to those > devices. The page remains volatile until something from the guest > writes to it. The hypervisor could keep a record of how to recreate the > page as long as it remains volatile and clean. Unfortunately it is not that simple. There are quite a few reasons why a page has to be made stable. You'd have to pass that information back and forth between the guest and the host otherwise the host will throw away e.g. an mlocked page because it is still marked as volatile in the virtual block device. > That wouldn't cover things like page cache from network filesystems, > though. Yes, there are pages with a backing the host knows nothing about. > This patch does look like the full monty but I have to wonder what other > partial approaches are out there. I am open for suggestions. The simples partial approach is already implemented for s390: unused/stable transitions in the buddy allocator. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756558AbZC3SiY (ORCPT ); Mon, 30 Mar 2009 14:38:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752355AbZC3SiO (ORCPT ); Mon, 30 Mar 2009 14:38:14 -0400 Received: from gw.goop.org ([64.81.55.164]:57455 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752745AbZC3SiO (ORCPT ); Mon, 30 Mar 2009 14:38:14 -0400 Message-ID: <49D11184.3060002@goop.org> Date: Mon, 30 Mar 2009 11:37:56 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Dave Hansen CC: Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, riel@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> In-Reply-To: <1238428495.8286.638.camel@nimitz> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Hansen wrote: > It also occurs to me that the hypervisor could be doing a lot of this > internally. This whole scheme is about telling the hypervisor about > pages that we (the kernel) know we can regenerate. The hypervisor > should know a lot of that information, too. We ask it to populate a > page with stuff from virtual I/O devices or write a page out to those > devices. The page remains volatile until something from the guest > writes to it. The hypervisor could keep a record of how to recreate the > page as long as it remains volatile and clean. > That potentially pushes a lot of complexity elsewhere. If you have multiple paths to a storage device, or a cluster store shared between multiple machines, then the underlying storage can change making the guest's copies of those blocks unbacked. Obviously the host/hypervisor could deal with that, but it would be a pile of new mechanisms which don't necessarily exist (for example, it would have to be an active participant in a distributed locking scheme for a shared block device rather than just passing it all through to the guest to handle). That said, people have been looking at tracking block IO to work out when it might be useful to try and share pages between guests under Xen. J From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758597AbZC3SpK (ORCPT ); Mon, 30 Mar 2009 14:45:10 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757526AbZC3Sox (ORCPT ); Mon, 30 Mar 2009 14:44:53 -0400 Received: from mx2.redhat.com ([66.187.237.31]:40552 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757384AbZC3Sow (ORCPT ); Mon, 30 Mar 2009 14:44:52 -0400 Message-ID: <49D11287.4030307@redhat.com> Date: Mon, 30 Mar 2009 14:42:15 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: Dave Hansen , Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> In-Reply-To: <49D11184.3060002@goop.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jeremy Fitzhardinge wrote: > That said, people have been looking at tracking block IO to work out > when it might be useful to try and share pages between guests under Xen. Tracking block IO seems like a bass-ackwards way to figure out what the contents of a memory page are. The KVM KSM code has a simpler, yet still efficient, way of figuring out which memory pages can be shared. -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759590AbZC3TOq (ORCPT ); Mon, 30 Mar 2009 15:14:46 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757447AbZC3TOZ (ORCPT ); Mon, 30 Mar 2009 15:14:25 -0400 Received: from gw.goop.org ([64.81.55.164]:47593 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757562AbZC3TOX (ORCPT ); Mon, 30 Mar 2009 15:14:23 -0400 Message-ID: <49D11674.9040205@goop.org> Date: Mon, 30 Mar 2009 11:59:00 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Rik van Riel CC: Dave Hansen , Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> In-Reply-To: <49D11287.4030307@redhat.com> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Rik van Riel wrote: > Jeremy Fitzhardinge wrote: > >> That said, people have been looking at tracking block IO to work out >> when it might be useful to try and share pages between guests under Xen. > > Tracking block IO seems like a bass-ackwards way to figure > out what the contents of a memory page are. Well, they're research projects, so nobody said that they're necessarily useful results ;). I think the rationale is that, in general, there aren't all that many sharable pages, and asize from zero-pages, the bulk of them are the result of IO. Since its much simpler to compare device+block references than doing page content matching, it is worth looking at the IO stream to work out what your candidates are. > The KVM KSM code has a simpler, yet still efficient, way of > figuring out which memory pages can be shared. How's that? Does it do page content comparison? J From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760723AbZC3UFA (ORCPT ); Mon, 30 Mar 2009 16:05:00 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757724AbZC3UEs (ORCPT ); Mon, 30 Mar 2009 16:04:48 -0400 Received: from mx2.redhat.com ([66.187.237.31]:38829 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757802AbZC3UEr (ORCPT ); Mon, 30 Mar 2009 16:04:47 -0400 Message-ID: <49D12564.40708@redhat.com> Date: Mon, 30 Mar 2009 16:02:44 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: Dave Hansen , Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> <49D11674.9040205@goop.org> In-Reply-To: <49D11674.9040205@goop.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jeremy Fitzhardinge wrote: > Rik van Riel wrote: >> Jeremy Fitzhardinge wrote: >> >>> That said, people have been looking at tracking block IO to work out >>> when it might be useful to try and share pages between guests under Xen. >> >> Tracking block IO seems like a bass-ackwards way to figure >> out what the contents of a memory page are. > > Well, they're research projects, so nobody said that they're necessarily > useful results ;). I think the rationale is that, in general, there > aren't all that many sharable pages, and asize from zero-pages, the bulk > of them are the result of IO. I'll give you a hint: Windows zeroes out freed pages. It should also be possible to hook up arch_free_page() so freed pages in Linux guests become sharable. Furthermore, every guest with the same OS version will be running the same system daemons, same glibc, etc. This means sharable pages from not just disk IO (probably from different disks anyway), but also in the BSS and possibly even on the heap. >> The KVM KSM code has a simpler, yet still efficient, way of >> figuring out which memory pages can be shared. > How's that? Does it do page content comparison? Eventually. It starts out with hashing the first 128 (IIRC) bytes of page content and comparing the hashes. If that matches, it will do content comparison. Content comparison is done in the background on the host. I suspect (but have not checked) that it is somehow hooked up to the page reclaim code on the host. -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761203AbZC3Uft (ORCPT ); Mon, 30 Mar 2009 16:35:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758121AbZC3Ufk (ORCPT ); Mon, 30 Mar 2009 16:35:40 -0400 Received: from gw.goop.org ([64.81.55.164]:55765 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757245AbZC3Ufj (ORCPT ); Mon, 30 Mar 2009 16:35:39 -0400 Message-ID: <49D12D16.6050407@goop.org> Date: Mon, 30 Mar 2009 13:35:34 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Rik van Riel CC: Dave Hansen , Martin Schwidefsky , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> <49D11674.9040205@goop.org> <49D12564.40708@redhat.com> In-Reply-To: <49D12564.40708@redhat.com> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Rik van Riel wrote: > Jeremy Fitzhardinge wrote: >> Rik van Riel wrote: >>> Jeremy Fitzhardinge wrote: >>> >>>> That said, people have been looking at tracking block IO to work >>>> out when it might be useful to try and share pages between guests >>>> under Xen. >>> >>> Tracking block IO seems like a bass-ackwards way to figure >>> out what the contents of a memory page are. >> >> Well, they're research projects, so nobody said that they're >> necessarily useful results ;). I think the rationale is that, in >> general, there aren't all that many sharable pages, and asize from >> zero-pages, the bulk of them are the result of IO. > > I'll give you a hint: Windows zeroes out freed pages. Right: "aside from zero-pages". If you exclude zero-pages from your count of shared pages, the amount of sharing drops a lot. > It should also be possible to hook up arch_free_page() so > freed pages in Linux guests become sharable. > > Furthermore, every guest with the same OS version will be > running the same system daemons, same glibc, etc. This > means sharable pages from not just disk IO (probably from > different disks anyway), Why? If you're starting a bunch of cookie-cutter guests, then you're probably starting them from the same template image or COW block devices. (Also, if you're wearing the cost of physical IO anyway, then additional cost of hashing is relatively small.) > but also in the BSS and possibly > even on the heap. Well, maybe. Modern systems generally randomize memory layouts, so even if they're semantically the same, the pointers will all have different values. Other research into "sharing" mostly-similar pages is more promising for that kind of case. > Eventually. It starts out with hashing the first 128 (IIRC) > bytes of page content and comparing the hashes. If that > matches, it will do content comparison. > > Content comparison is done in the background on the host. > I suspect (but have not checked) that it is somehow hooked > up to the page reclaim code on the host. Yeah, that's the straightforward approach; there's about a research project/year doing a Xen implementation, but they never seem to get very good results aside from very artificial test conditions. J From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762096AbZC3Vlt (ORCPT ); Mon, 30 Mar 2009 17:41:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1761986AbZC3VkE (ORCPT ); Mon, 30 Mar 2009 17:40:04 -0400 Received: from mx2.redhat.com ([66.187.237.31]:51726 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761980AbZC3VkB (ORCPT ); Mon, 30 Mar 2009 17:40:01 -0400 Message-ID: <49D13BB9.3010200@redhat.com> Date: Tue, 31 Mar 2009 00:38:01 +0300 From: Dor Laor Reply-To: dlaor@redhat.com User-Agent: Thunderbird 2.0.0.19 (X11/20090105) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: Rik van Riel , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , virtualization@lists.osdl.org, Martin Schwidefsky , hugh@veritas.com, Izik Eidus Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> <49D11674.9040205@goop.org> <49D12564.40708@redhat.com> <49D12D16.6050407@goop.org> In-Reply-To: <49D12D16.6050407@goop.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jeremy Fitzhardinge wrote: > Rik van Riel wrote: > >> Jeremy Fitzhardinge wrote: >> >>> Rik van Riel wrote: >>> >>>> Jeremy Fitzhardinge wrote: >>>> >>>> >>>>> That said, people have been looking at tracking block IO to work >>>>> out when it might be useful to try and share pages between guests >>>>> under Xen. >>>>> >>>> Tracking block IO seems like a bass-ackwards way to figure >>>> out what the contents of a memory page are. >>>> >>> Well, they're research projects, so nobody said that they're >>> necessarily useful results ;). I think the rationale is that, in >>> general, there aren't all that many sharable pages, and asize from >>> zero-pages, the bulk of them are the result of IO. >>> >> I'll give you a hint: Windows zeroes out freed pages. >> > > Right: "aside from zero-pages". If you exclude zero-pages from your > count of shared pages, the amount of sharing drops a lot. > > >> It should also be possible to hook up arch_free_page() so >> freed pages in Linux guests become sharable. >> >> Furthermore, every guest with the same OS version will be >> running the same system daemons, same glibc, etc. This >> means sharable pages from not just disk IO (probably from >> different disks anyway), >> > > Why? If you're starting a bunch of cookie-cutter guests, then you're > probably starting them from the same template image or COW block > devices. (Also, if you're wearing the cost of physical IO anyway, then > additional cost of hashing is relatively small.) > > >> but also in the BSS and possibly >> even on the heap. >> > > Well, maybe. Modern systems generally randomize memory layouts, so even > if they're semantically the same, the pointers will all have different > values. > > Other research into "sharing" mostly-similar pages is more promising for > that kind of case. > > >> Eventually. It starts out with hashing the first 128 (IIRC) >> bytes of page content and comparing the hashes. If that >> matches, it will do content comparison. >> The algorithm was changed quite a bit. Izik is planning to resubmit it any day now. >> Content comparison is done in the background on the host. >> I suspect (but have not checked) that it is somehow hooked >> up to the page reclaim code on the host. >> > > Yeah, that's the straightforward approach; there's about a research > project/year doing a Xen implementation, but they never seem to get very > good results aside from very artificial test conditions. > Actually we got really good results by using ksm along with kvm, running large amount of windows virtual machines. We can achieve over commit ratio of up to 400% of the host ram for VMs doing M$ office load. -dor From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759776AbZC3WUa (ORCPT ); Mon, 30 Mar 2009 18:20:30 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755564AbZC3WUV (ORCPT ); Mon, 30 Mar 2009 18:20:21 -0400 Received: from mx2.redhat.com ([66.187.237.31]:60901 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755467AbZC3WUU (ORCPT ); Mon, 30 Mar 2009 18:20:20 -0400 Message-ID: <49D144D6.9000001@redhat.com> Date: Tue, 31 Mar 2009 01:16:54 +0300 From: Izik Eidus User-Agent: Mozilla-Thunderbird 2.0.0.19 (X11/20090103) MIME-Version: 1.0 CC: Jeremy Fitzhardinge , Rik van Riel , akpm@osdl.org, nickpiggin@yahoo.com.au, frankeh@watson.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , virtualization@lists.osdl.org, Martin Schwidefsky , hugh@veritas.com, dlaor@redhat.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> <20090329161253.3faffdeb@skybase> <1238428495.8286.638.camel@nimitz> <49D11184.3060002@goop.org> <49D11287.4030307@redhat.com> <49D11674.9040205@goop.org> <49D12564.40708@redhat.com> <49D12D16.6050407@goop.org> <49D13BB9.3010200@redhat.com> In-Reply-To: <49D13BB9.3010200@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit To: unlisted-recipients:; (no To-header on input) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jeremy Fitzhardinge wrote: >> Rik van Riel wrote: >> >>> Jeremy Fitzhardinge wrote: >>> >>>> Rik van Riel wrote: >>>> >>>>> Jeremy Fitzhardinge wrote: >>>>> >>>>> >>>>>> That said, people have been looking at tracking block IO to work >>>>>> out when it might be useful to try and share pages between guests >>>>>> under Xen. >>>>>> >>>>> Tracking block IO seems like a bass-ackwards way to figure >>>>> out what the contents of a memory page are. >>>>> >>>> Well, they're research projects, so nobody said that they're >>>> necessarily useful results ;). I think the rationale is that, in >>>> general, there aren't all that many sharable pages, and asize from >>>> zero-pages, the bulk of them are the result of IO. >>> I'll give you a hint: Windows zeroes out freed pages. >>> >> >> Right: "aside from zero-pages". If you exclude zero-pages from your >> count of shared pages, the amount of sharing drops a lot. 20026 root 15 0 707m 526m 246m S 7.0 14.0 0:39.57 qemu-system-x86 20010 root 15 0 707m 526m 239m S 6.3 14.0 0:47.16 qemu-system-x86 20015 root 15 0 707m 526m 247m S 5.7 14.0 0:46.84 qemu-system-x86 20031 root 15 0 707m 526m 242m S 5.7 14.1 0:46.74 qemu-system-x86 20005 root 15 0 707m 526m 239m S 0.3 14.0 0:54.16 qemu-system-x86 I just ran 5 debian 5.0 guests with each have 512 mb physical ram, all i did was just open X, and then open thunderbird and firefox in them, check the SHR field... You cannot ignore the fact that the librarys and the kernel would be identical among guests and would be shared... Other than the library we got the big bonus that is called zero page in windows, but that is really not the case for the above example since thigs guests are linux..... >> >> >>> It should also be possible to hook up arch_free_page() so >>> freed pages in Linux guests become sharable. >>> >>> Furthermore, every guest with the same OS version will be >>> running the same system daemons, same glibc, etc. This >>> means sharable pages from not just disk IO (probably from >>> different disks anyway), >>> >> >> Why? If you're starting a bunch of cookie-cutter guests, then you're >> probably starting them from the same template image or COW block >> devices. (Also, if you're wearing the cost of physical IO anyway, >> then additional cost of hashing is relatively small.) >> >> >>> but also in the BSS and possibly >>> even on the heap. >>> >> >> Well, maybe. Modern systems generally randomize memory layouts, so >> even if they're semantically the same, the pointers will all have >> different values. >> >> Other research into "sharing" mostly-similar pages is more promising >> for that kind of case. >> >> >>> Eventually. It starts out with hashing the first 128 (IIRC) >>> bytes of page content and comparing the hashes. If that >>> matches, it will do content comparison. >>> > The algorithm was changed quite a bit. Izik is planning to resubmit it > any day now. >>> Content comparison is done in the background on the host. >>> I suspect (but have not checked) that it is somehow hooked >>> up to the page reclaim code on the host. >>> >> >> Yeah, that's the straightforward approach; there's about a research >> project/year doing a Xen implementation, but they never seem to get >> very good results aside from very artificial test conditions. I keep hear this argument from Microsoft but even in the hardest test condition, how would you make the librarys and the kernel wont be identical among the guests?. Anyway Page sharing is running and installed for our customers and so far i only hear from sells guys how surprised and happy the costumers are from the overcommit that page sharing is offer... Anyway i have ready massive-changed (mostly the logical algorithm for finding pages) ksm version that i made against the mainline version and is ready to be send after i will get some better benchmarks numbers to post on the list when together with the patch... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758022AbZDACO6 (ORCPT ); Tue, 31 Mar 2009 22:14:58 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754025AbZDACOs (ORCPT ); Tue, 31 Mar 2009 22:14:48 -0400 Received: from mx2.redhat.com ([66.187.237.31]:54553 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753339AbZDACOr (ORCPT ); Tue, 31 Mar 2009 22:14:47 -0400 Message-ID: <49D2CD28.9080700@redhat.com> Date: Tue, 31 Mar 2009 22:10:48 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Martin Schwidefsky CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 2/6] Guest page hinting: volatile swap cache. References: <20090327150905.819861420@de.ibm.com> <20090327151011.798602788@de.ibm.com> In-Reply-To: <20090327151011.798602788@de.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > From: Martin Schwidefsky > From: Hubertus Franke > From: Himanshu Raj > > The volatile page state can be used for anonymous pages as well, if > they have been added to the swap cache and the swap write is finished. > Signed-off-by: Martin Schwidefsky Acked-by: Rik van Riel -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757578AbZDAC4e (ORCPT ); Tue, 31 Mar 2009 22:56:34 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754268AbZDAC4Y (ORCPT ); Tue, 31 Mar 2009 22:56:24 -0400 Received: from mx2.redhat.com ([66.187.237.31]:45709 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754122AbZDAC4X (ORCPT ); Tue, 31 Mar 2009 22:56:23 -0400 Message-ID: <49D2D6D4.8000309@redhat.com> Date: Tue, 31 Mar 2009 22:52:04 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Martin Schwidefsky CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 3/6] Guest page hinting: mlocked pages. References: <20090327150905.819861420@de.ibm.com> <20090327151012.095486071@de.ibm.com> In-Reply-To: <20090327151012.095486071@de.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > From: Martin Schwidefsky > From: Hubertus Franke > From: Himanshu Raj > > Add code to get mlock() working with guest page hinting. The problem > with mlock is that locked pages may not be removed from page cache. > That means they need to be stable. > Signed-off-by: Martin Schwidefsky Acked-by: Rik van Riel -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761779AbZDAIOR (ORCPT ); Wed, 1 Apr 2009 04:14:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1761538AbZDAINl (ORCPT ); Wed, 1 Apr 2009 04:13:41 -0400 Received: from mtagate2.de.ibm.com ([195.212.17.162]:51156 "EHLO mtagate2.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761506AbZDAINj (ORCPT ); Wed, 1 Apr 2009 04:13:39 -0400 Date: Wed, 1 Apr 2009 10:13:34 +0200 From: Martin Schwidefsky To: Rik van Riel Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 2/6] Guest page hinting: volatile swap cache. Message-ID: <20090401101334.7e6ea848@skybase> In-Reply-To: <49D2CD28.9080700@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151011.798602788@de.ibm.com> <49D2CD28.9080700@redhat.com> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 31 Mar 2009 22:10:48 -0400 Rik van Riel wrote: > Martin Schwidefsky wrote: > > From: Martin Schwidefsky > > From: Hubertus Franke > > From: Himanshu Raj > > > > The volatile page state can be used for anonymous pages as well, if > > they have been added to the swap cache and the swap write is finished. > > > Signed-off-by: Martin Schwidefsky > > Acked-by: Rik van Riel Thanks you for the review. I'll add the Acked-by. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761948AbZDAIOu (ORCPT ); Wed, 1 Apr 2009 04:14:50 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1761356AbZDAIOF (ORCPT ); Wed, 1 Apr 2009 04:14:05 -0400 Received: from mtagate6.de.ibm.com ([195.212.29.155]:61151 "EHLO mtagate6.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761518AbZDAIOC (ORCPT ); Wed, 1 Apr 2009 04:14:02 -0400 Date: Wed, 1 Apr 2009 10:13:57 +0200 From: Martin Schwidefsky To: Rik van Riel Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 3/6] Guest page hinting: mlocked pages. Message-ID: <20090401101357.7f714f7a@skybase> In-Reply-To: <49D2D6D4.8000309@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151012.095486071@de.ibm.com> <49D2D6D4.8000309@redhat.com> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 31 Mar 2009 22:52:04 -0400 Rik van Riel wrote: > Martin Schwidefsky wrote: > > From: Martin Schwidefsky > > From: Hubertus Franke > > From: Himanshu Raj > > > > Add code to get mlock() working with guest page hinting. The problem > > with mlock is that locked pages may not be removed from page cache. > > That means they need to be stable. > > > Signed-off-by: Martin Schwidefsky > > Acked-by: Rik van Riel I'll add this one as well. Thanks again. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752197AbZDAN2W (ORCPT ); Wed, 1 Apr 2009 09:28:22 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753016AbZDAN2I (ORCPT ); Wed, 1 Apr 2009 09:28:08 -0400 Received: from mx2.redhat.com ([66.187.237.31]:36960 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750927AbZDAN2F (ORCPT ); Wed, 1 Apr 2009 09:28:05 -0400 Message-ID: <49D36B4E.7000702@redhat.com> Date: Wed, 01 Apr 2009 09:25:34 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Martin Schwidefsky CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 4/6] Guest page hinting: writable page table entries. References: <20090327150905.819861420@de.ibm.com> <20090327151012.398894143@de.ibm.com> In-Reply-To: <20090327151012.398894143@de.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: This code has me stumped. Does it mean that if a page already has the PageWritable bit set (and count_ok stays 0), we will always mark the page as volatile? How does that work out on !s390? > /** > + * __page_check_writable() - check page state for new writable pte > + * > + * @page: the page the new writable pte refers to > + * @pte: the new writable pte > + */ > +void __page_check_writable(struct page *page, pte_t pte, unsigned int offset) > +{ > + int count_ok = 0; > + > + preempt_disable(); > + while (page_test_set_state_change(page)) > + cpu_relax(); > + > + if (!TestSetPageWritable(page)) { > + count_ok = check_counts(page, offset); > + if (check_bits(page) && count_ok) > + page_set_volatile(page, 1); > + else > + /* > + * If two processes create a write mapping at the > + * same time check_counts will return false or if > + * the page is currently isolated from the LRU > + * check_bits will return false but the page might > + * be in volatile state. > + * We have to take care about the dirty bit so the > + * only option left is to make the page stable but > + * we can try to make it volatile a bit later. > + */ > + page_set_stable_if_present(page); > + } > + page_clear_state_change(page); > + if (!count_ok) > + page_make_volatile(page, 1); > + preempt_enable(); > +} > +EXPORT_SYMBOL(__page_check_writable); -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1765881AbZDAOhn (ORCPT ); Wed, 1 Apr 2009 10:37:43 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1764267AbZDAOhH (ORCPT ); Wed, 1 Apr 2009 10:37:07 -0400 Received: from mtagate2.de.ibm.com ([195.212.17.162]:48375 "EHLO mtagate2.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1764787AbZDAOhG (ORCPT ); Wed, 1 Apr 2009 10:37:06 -0400 Date: Wed, 1 Apr 2009 16:36:58 +0200 From: Martin Schwidefsky To: Rik van Riel Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 4/6] Guest page hinting: writable page table entries. Message-ID: <20090401163658.60f851ed@skybase> In-Reply-To: <49D36B4E.7000702@redhat.com> References: <20090327150905.819861420@de.ibm.com> <20090327151012.398894143@de.ibm.com> <49D36B4E.7000702@redhat.com> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 01 Apr 2009 09:25:34 -0400 Rik van Riel wrote: > Martin Schwidefsky wrote: > > This code has me stumped. Does it mean that if a page already > has the PageWritable bit set (and count_ok stays 0), we will > always mark the page as volatile? > > How does that work out on !s390? No, we will not always mark the page as volatile. If PG_writable is already set count_ok will stay 0 and a call to page_make_volatile is done. This differs from page_set_volatile as it repeats all the required checks, then calls page_set_volatile with a PageWritable(page) as second argument. What state the page will get depends on the architecture definition of page_set_volatile. For s390 this will do a state transition to potentially volatile as the PG_writable bit is set. On architecture that cannot check the dirty bit on a physical page basis you need to make the page stable. > > /** > > + * __page_check_writable() - check page state for new writable pte > > + * > > + * @page: the page the new writable pte refers to > > + * @pte: the new writable pte > > + */ > > +void __page_check_writable(struct page *page, pte_t pte, unsigned int offset) > > +{ > > + int count_ok = 0; > > + > > + preempt_disable(); > > + while (page_test_set_state_change(page)) > > + cpu_relax(); > > + > > + if (!TestSetPageWritable(page)) { > > + count_ok = check_counts(page, offset); > > + if (check_bits(page) && count_ok) > > + page_set_volatile(page, 1); > > + else > > + /* > > + * If two processes create a write mapping at the > > + * same time check_counts will return false or if > > + * the page is currently isolated from the LRU > > + * check_bits will return false but the page might > > + * be in volatile state. > > + * We have to take care about the dirty bit so the > > + * only option left is to make the page stable but > > + * we can try to make it volatile a bit later. > > + */ > > + page_set_stable_if_present(page); > > + } > > + page_clear_state_change(page); > > + if (!count_ok) > > + page_make_volatile(page, 1); > > + preempt_enable(); > > +} > > +EXPORT_SYMBOL(__page_check_writable); > > -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932229AbZDAOsY (ORCPT ); Wed, 1 Apr 2009 10:48:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1759546AbZDAOsO (ORCPT ); Wed, 1 Apr 2009 10:48:14 -0400 Received: from mx2.redhat.com ([66.187.237.31]:33584 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758918AbZDAOsO (ORCPT ); Wed, 1 Apr 2009 10:48:14 -0400 Message-ID: <49D37E21.2000609@redhat.com> Date: Wed, 01 Apr 2009 10:45:53 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Martin Schwidefsky CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 4/6] Guest page hinting: writable page table entries. References: <20090327150905.819861420@de.ibm.com> <20090327151012.398894143@de.ibm.com> <49D36B4E.7000702@redhat.com> <20090401163658.60f851ed@skybase> In-Reply-To: <20090401163658.60f851ed@skybase> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > On Wed, 01 Apr 2009 09:25:34 -0400 > Rik van Riel wrote: > >> Martin Schwidefsky wrote: >> >> This code has me stumped. Does it mean that if a page already >> has the PageWritable bit set (and count_ok stays 0), we will >> always mark the page as volatile? >> >> How does that work out on !s390? > > No, we will not always mark the page as volatile. If PG_writable is > already set count_ok will stay 0 and a call to page_make_volatile is > done. This differs from page_set_volatile as it repeats all the > required checks, then calls page_set_volatile with a PageWritable(page) > as second argument. What state the page will get depends on the > architecture definition of page_set_volatile. For s390 this will do a > state transition to potentially volatile as the PG_writable bit is set. > On architecture that cannot check the dirty bit on a physical page basis > you need to make the page stable. Good point. I guess that means patch 4/6 checks out right, then :) Acked-by: Rik van Riel -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761495AbZDAPgT (ORCPT ); Wed, 1 Apr 2009 11:36:19 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753984AbZDAPgG (ORCPT ); Wed, 1 Apr 2009 11:36:06 -0400 Received: from mx2.redhat.com ([66.187.237.31]:50305 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752316AbZDAPgF (ORCPT ); Wed, 1 Apr 2009 11:36:05 -0400 Message-ID: <49D38967.8020706@redhat.com> Date: Wed, 01 Apr 2009 11:33:59 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Martin Schwidefsky CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 5/6] Guest page hinting: minor fault optimization. References: <20090327150905.819861420@de.ibm.com> <20090327151012.713478499@de.ibm.com> In-Reply-To: <20090327151012.713478499@de.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > From: Martin Schwidefsky > From: Hubertus Franke > From: Himanshu Raj > > On of the challenges of the guest page hinting scheme is the cost for > the state transitions. If the cost gets too high the whole concept of > page state information is in question. Therefore it is important to > avoid the state transitions when possible. > Signed-off-by: Martin Schwidefsky Acked-by: Rik van Riel -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762308AbZDAQUi (ORCPT ); Wed, 1 Apr 2009 12:20:38 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756649AbZDAQU2 (ORCPT ); Wed, 1 Apr 2009 12:20:28 -0400 Received: from mx2.redhat.com ([66.187.237.31]:44709 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755950AbZDAQU2 (ORCPT ); Wed, 1 Apr 2009 12:20:28 -0400 Message-ID: <49D393F2.2010105@redhat.com> Date: Wed, 01 Apr 2009 12:18:58 -0400 From: Rik van Riel Organization: Red Hat, Inc User-Agent: Thunderbird 2.0.0.17 (X11/20080915) MIME-Version: 1.0 To: Martin Schwidefsky CC: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com Subject: Re: [patch 6/6] Guest page hinting: s390 support. References: <20090327150905.819861420@de.ibm.com> <20090327151013.024372165@de.ibm.com> In-Reply-To: <20090327151013.024372165@de.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > From: Martin Schwidefsky > From: Hubertus Franke > From: Himanshu Raj > > s390 uses the milli-coded ESSA instruction to set the page state. The > page state is formed by four guest page states called block usage states > and three host page states called block content states. > Signed-off-by: Martin Schwidefsky Acked-by: Rik van Riel -- All rights reversed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754454AbZDBLcW (ORCPT ); Thu, 2 Apr 2009 07:32:22 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751717AbZDBLcK (ORCPT ); Thu, 2 Apr 2009 07:32:10 -0400 Received: from smtp119.mail.mud.yahoo.com ([209.191.84.76]:41522 "HELO smtp119.mail.mud.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1751417AbZDBLcJ (ORCPT ); Thu, 2 Apr 2009 07:32:09 -0400 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com.au; h=Received:X-YMail-OSG:X-Yahoo-Newman-Property:From:To:Subject:Date:User-Agent:Cc:References:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding:Content-Disposition:Message-Id; b=AaI3X+HoDxc8IkuczUp7ls5hwBVw+gDcRYJ3t2ILCUvC0jYN1IyXUcbYUlOHPa9sJG44yW64GG8F5R+7q4n3Danuu0piHC+bTmWlfAA6IAC8ky3FcmCILrUWZxn3esC3DIbGkZJj5Re3CjvN5WevV6uOeYxwXn4Vb8vbaugvXJM= ; X-YMail-OSG: 9pM4BN8VM1nbQIkBjx8yKY_PErlTiIcljVRbLR5z1bjL3FHwfTLPIIn3JNzhld.j9MexwrKUxL9zISFVDUMenjo2ySBiN9GEaxH9S.HEpdOTsjI.btHgiJ7C58x7P.ImWIwr5aOZ7lZAhoXJm55vF8.XQsARCUKjKaYafBjprjoDErhp3d41fB4ZWYJlxTxdmOuW7Ani6oGkvJymCKQ6b9sZPSFbpnqZOOlADwoZlDHteJpZ4IgBaFPJBHL6T9Z_hJBq2Efb85hCdnHIu3cupFVfy78fwA3x.rLqfk6FMCwyamcwQEAfR6O4mfUIj2AljJ0zX.MLGBDOlSyrIG58iNDRIvot X-Yahoo-Newman-Property: ymail-3 From: Nick Piggin To: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Thu, 2 Apr 2009 22:32:00 +1100 User-Agent: KMail/1.9.51 (KDE/4.0.4; ; ) Cc: Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> In-Reply-To: <20090329162336.7c0700e9@skybase> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200904022232.02185.nickpiggin@yahoo.com.au> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Monday 30 March 2009 01:23:36 Martin Schwidefsky wrote: > On Sat, 28 Mar 2009 17:05:28 +1030 > > Rusty Russell wrote: > > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > > > Greetings, > > > the circus is back in town -- another version of the guest page hinting > > > patches. The patches differ from version 6 only in the kernel version, > > > they apply against 2.6.29. My short sniff test showed that the code > > > is still working as expected. > > > > > > To recap (you can skip this if you read the boiler plate of the last > > > version of the patches): > > > The main benefit for guest page hinting vs. the ballooner is that there > > > is no need for a monitor that keeps track of the memory usage of all > > > the guests, a complex algorithm that calculates the working set sizes > > > and for the calls into the guest kernel to control the size of the > > > balloons. > > > > I thought you weren't convinced of the concrete benefits over ballooning, > > or am I misremembering? > > The performance test I have seen so far show that the benefits of > ballooning vs. guest page hinting are about the same. I am still > convinced that the guest page hinting is the way to go because you do > not need an external monitor. Calculating the working set size for a > guest is a challenge. With guest page hinting there is no need for a > working set size calculation. Sounds backwards to me. If the benefits are the same, then having complexity in an external monitor (which, by the way, shares many problems and goals of single-kernel resource/workload management), rather than putting a huge chunk of crap in the guest kernel's core mm code. I still think this needs much more justification. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759825AbZDBPxW (ORCPT ); Thu, 2 Apr 2009 11:53:22 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1763055AbZDBPw4 (ORCPT ); Thu, 2 Apr 2009 11:52:56 -0400 Received: from mtagate4.de.ibm.com ([195.212.29.153]:52238 "EHLO mtagate4.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1763025AbZDBPwz (ORCPT ); Thu, 2 Apr 2009 11:52:55 -0400 Date: Thu, 2 Apr 2009 17:52:49 +0200 From: Martin Schwidefsky To: Nick Piggin Cc: Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. Message-ID: <20090402175249.3c4a6d59@skybase> In-Reply-To: <200904022232.02185.nickpiggin@yahoo.com.au> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2 Apr 2009 22:32:00 +1100 Nick Piggin wrote: > On Monday 30 March 2009 01:23:36 Martin Schwidefsky wrote: > > On Sat, 28 Mar 2009 17:05:28 +1030 > > > > Rusty Russell wrote: > > > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > > > > Greetings, > > > > the circus is back in town -- another version of the guest page hinting > > > > patches. The patches differ from version 6 only in the kernel version, > > > > they apply against 2.6.29. My short sniff test showed that the code > > > > is still working as expected. > > > > > > > > To recap (you can skip this if you read the boiler plate of the last > > > > version of the patches): > > > > The main benefit for guest page hinting vs. the ballooner is that there > > > > is no need for a monitor that keeps track of the memory usage of all > > > > the guests, a complex algorithm that calculates the working set sizes > > > > and for the calls into the guest kernel to control the size of the > > > > balloons. > > > > > > I thought you weren't convinced of the concrete benefits over ballooning, > > > or am I misremembering? > > > > The performance test I have seen so far show that the benefits of > > ballooning vs. guest page hinting are about the same. I am still > > convinced that the guest page hinting is the way to go because you do > > not need an external monitor. Calculating the working set size for a > > guest is a challenge. With guest page hinting there is no need for a > > working set size calculation. > > Sounds backwards to me. If the benefits are the same, then having > complexity in an external monitor (which, by the way, shares many > problems and goals of single-kernel resource/workload management), > rather than putting a huge chunk of crap in the guest kernel's core > mm code. The benefits are the same but the algorithmic complexity is reduced. The patch to the memory management has complexity in itself but from a 1000 feet standpoint guest page hinting is simpler, no? The question how much memory each guest has to release does not exist. With the balloner I have seen a few problematic cases where the size of the balloon in principle killed the guest. My favorite is the "clever" monitor script that queried the guests free memory and put all free memory into the balloon. Now gues what happened with a guest that just booted.. And could you please explain with a few more words >what< you consider to be "crap"? I can't do anything with a general statement "this is crap". Which translates to me: leave me alone.. > I still think this needs much more justification. Ok, I can understand that. We probably need a KVM based version to show that benefits exist on non-s390 hardware as well. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1763452AbZDBQTE (ORCPT ); Thu, 2 Apr 2009 12:19:04 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1760217AbZDBQSn (ORCPT ); Thu, 2 Apr 2009 12:18:43 -0400 Received: from gw.goop.org ([64.81.55.164]:48350 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761715AbZDBQSm (ORCPT ); Thu, 2 Apr 2009 12:18:42 -0400 Message-ID: <49D4E55F.8010406@goop.org> Date: Thu, 02 Apr 2009 09:18:39 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Martin Schwidefsky CC: Nick Piggin , akpm@osdl.org, frankeh@watson.ibm.com, virtualization@lists.osdl.org, riel@redhat.com, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> In-Reply-To: <20090402175249.3c4a6d59@skybase> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: >> I still think this needs much more justification. >> > > Ok, I can understand that. We probably need a KVM based version to show > that benefits exist on non-s390 hardware as well. > BTW, there was a presentation at the most recent Xen summit which makes use of CMM ("Satori: Enlightened Page Sharing", http://www.xen.org/files/xensummit_oracle09/Satori.pdf). J From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761577AbZDBTJ6 (ORCPT ); Thu, 2 Apr 2009 15:09:58 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754329AbZDBTJt (ORCPT ); Thu, 2 Apr 2009 15:09:49 -0400 Received: from mx2.redhat.com ([66.187.237.31]:55881 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753089AbZDBTJs (ORCPT ); Thu, 2 Apr 2009 15:09:48 -0400 Message-ID: <49D50CB7.2050705@redhat.com> Date: Thu, 02 Apr 2009 15:06:31 -0400 From: Rik van Riel User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Martin Schwidefsky CC: Nick Piggin , Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> In-Reply-To: <20090402175249.3c4a6d59@skybase> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > The benefits are the same but the algorithmic complexity is reduced. > The patch to the memory management has complexity in itself but from a > 1000 feet standpoint guest page hinting is simpler, no? Page hinting has a complex, but well understood, mechanism and simple policy. Ballooning has a simpler mechanism, but relies on an as-of-yet undiscovered policy. Having experienced a zillion VM corner cases over the last decade and a bit, I think I prefer a complex mechanism over complex (or worse, unknown!) policy any day. > Ok, I can understand that. We probably need a KVM based version to show > that benefits exist on non-s390 hardware as well. I believe it can work for KVM just fine, if we keep the host state and the guest state in separate places (so the guest can always write the guest state without a hypercall). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755575AbZDBTXK (ORCPT ); Thu, 2 Apr 2009 15:23:10 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932733AbZDBTWj (ORCPT ); Thu, 2 Apr 2009 15:22:39 -0400 Received: from smtp116.mail.mud.yahoo.com ([209.191.84.165]:43346 "HELO smtp116.mail.mud.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S932806AbZDBTWi (ORCPT ); Thu, 2 Apr 2009 15:22:38 -0400 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com.au; h=Received:X-YMail-OSG:X-Yahoo-Newman-Property:From:To:Subject:Date:User-Agent:Cc:References:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding:Content-Disposition:Message-Id; b=eyBT10hSAXDm+sfCi/ODAasusgeY08fXPJtSJhtA5NAmmaV0PrJyDe3g59BtZYYTOvaJYcQae6t1Evc5PzcrQt28OI2Va+dJE5vbIp1hzDEJEF9Rt/YdzYzfY2uMQolI4HjWXDELBmDkSHt3k0buKDJRazhpIb+94rTfYDtU0Vk= ; X-YMail-OSG: y2o0qdsVM1kcd.H5KtJzgGA1fi25zduauOSIpj8vT4VKO5Zm3luApkD18okv9r9iUBnZO1l1vXEteMGhdWhb67Y71Tth3dB_uslCni.Yz5H7WmIzB6NrWvTWSFo6t_OiRDxDou74QW.IUi7J2nD8BB.W8QnwgW0FFSKTcVp_8EqnygHr5ScOOQMK9olU_YAL58ut0xp2MCkuXtUXzxeUN.d9.MqKAM045m3EqB9muToZJQfcUoTt6lJsW0En846FeUmzQfeST2AAww_4BmvgExGlcQEt1BS.J_MAdwIaJKJMfXDUDXRt X-Yahoo-Newman-Property: ymail-3 From: Nick Piggin To: Rik van Riel Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Fri, 3 Apr 2009 06:22:29 +1100 User-Agent: KMail/1.9.51 (KDE/4.0.4; ; ) Cc: Martin Schwidefsky , Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, hugh@veritas.com References: <20090327150905.819861420@de.ibm.com> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> In-Reply-To: <49D50CB7.2050705@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200904030622.30935.nickpiggin@yahoo.com.au> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Friday 03 April 2009 06:06:31 Rik van Riel wrote: > Martin Schwidefsky wrote: > > The benefits are the same but the algorithmic complexity is reduced. > > The patch to the memory management has complexity in itself but from a > > 1000 feet standpoint guest page hinting is simpler, no? > Page hinting has a complex, but well understood, mechanism > and simple policy. > > Ballooning has a simpler mechanism, but relies on an > as-of-yet undiscovered policy. > > Having experienced a zillion VM corner cases over the > last decade and a bit, I think I prefer a complex mechanism > over complex (or worse, unknown!) policy any day. I disagree with it being so clear cut. Volatile pagecache policy is completely out of the control of the Linux VM. Wheras ballooning does have to make some tradeoff between guests, but the actual reclaim will be driven by the guests. Neither way is perfect, but it's not like the hypervisor reclaim is foolproof against making a bad tradeoff between guests. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1763411AbZDBTaj (ORCPT ); Thu, 2 Apr 2009 15:30:39 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933356AbZDBTaU (ORCPT ); Thu, 2 Apr 2009 15:30:20 -0400 Received: from extu-mxob-2.symantec.com ([216.10.194.135]:34283 "EHLO extu-mxob-2.symantec.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1762846AbZDBTaT (ORCPT ); Thu, 2 Apr 2009 15:30:19 -0400 Date: Thu, 2 Apr 2009 20:27:21 +0100 (BST) From: Hugh Dickins X-X-Sender: hugh@blonde.anvils To: Martin Schwidefsky cc: Nick Piggin , Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, riel@redhat.com Subject: Re: [patch 0/6] Guest page hinting version 7. In-Reply-To: <20090402175249.3c4a6d59@skybase> Message-ID: References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2 Apr 2009, Martin Schwidefsky wrote: > On Thu, 2 Apr 2009 22:32:00 +1100 > Nick Piggin wrote: > > > I still think this needs much more justification. > > Ok, I can understand that. We probably need a KVM based version to show > that benefits exist on non-s390 hardware as well. That would indeed help your cause enormously (I think I made the same point last time). All these complex transitions, added to benefit only an architecture to which few developers have access, asks for trouble - we mm hackers already get caught out often enough by your too-well-camouflaged page_test_dirty(). Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1765648AbZDBT6r (ORCPT ); Thu, 2 Apr 2009 15:58:47 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754759AbZDBT6i (ORCPT ); Thu, 2 Apr 2009 15:58:38 -0400 Received: from gw.goop.org ([64.81.55.164]:47623 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753718AbZDBT6i (ORCPT ); Thu, 2 Apr 2009 15:58:38 -0400 Message-ID: <49D518E9.1090001@goop.org> Date: Thu, 02 Apr 2009 12:58:33 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Rik van Riel CC: Martin Schwidefsky , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> In-Reply-To: <49D50CB7.2050705@redhat.com> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Rik van Riel wrote: > Page hinting has a complex, but well understood, mechanism > and simple policy. > For the guest perhaps, and yes, it does push the problem out to the host. But that doesn't make solving a performance problem any easier if you end up in a mess. > Ballooning has a simpler mechanism, but relies on an > as-of-yet undiscovered policy. > (I'm talking about Xen ballooning here; I know KVM ballooning works differently.) Yes and no. If you want to be able to shrink the guest very aggressively, then you need to be very careful about not shrinking too much for its current and near-future needs. But you'll get into an equivalently bad state with page hinting if the host decides to swap out and discard lots of persistent guest pages. When the host demands memory from the guest, the simple caseballooning is analogous to page hinting: * give up free pages == mark pages unused * give up clean pages == mark pages volatile * cause pressure to release some memory == host swapping The flipside is how guests can ask for memory if their needs increase again. Page-hinting is fault-driven, so the guest may stall while the host sorts out some memory to back the guests pages. Ballooning requires the guest to explicitly ask for memory, and that could be done in advance if it notices the pool of easily-freed pages is shrinking rapidly (though I guess it could be done on demand as well, but we don't have hooks for that). But of course, there are other approaches people are playing with, like Dan Magenheimer's transcendental memory, which is a pool of hypervisor-owned and managed pages which guests can use via a copy interface, as a second-chance page discard cache, fast swap, etc. Such mechanisms may be easier on both the guest complexity and policy fronts. The more complex host policy decisions of how to balance overall memory use system-wide are much in the same for both mechanisms. J From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933499AbZDBUIf (ORCPT ); Thu, 2 Apr 2009 16:08:35 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932926AbZDBUIT (ORCPT ); Thu, 2 Apr 2009 16:08:19 -0400 Received: from mx2.redhat.com ([66.187.237.31]:38225 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932914AbZDBUIS (ORCPT ); Thu, 2 Apr 2009 16:08:18 -0400 Message-ID: <49D51A82.8090908@redhat.com> Date: Thu, 02 Apr 2009 16:05:22 -0400 From: Rik van Riel User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Nick Piggin CC: Martin Schwidefsky , Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <200904030622.30935.nickpiggin@yahoo.com.au> In-Reply-To: <200904030622.30935.nickpiggin@yahoo.com.au> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Nick Piggin wrote: > On Friday 03 April 2009 06:06:31 Rik van Riel wrote: > >> Ballooning has a simpler mechanism, but relies on an >> as-of-yet undiscovered policy. >> >> Having experienced a zillion VM corner cases over the >> last decade and a bit, I think I prefer a complex mechanism >> over complex (or worse, unknown!) policy any day. >> > > I disagree with it being so clear cut. Volatile pagecache policy is completely > out of the control of the Linux VM. Wheras ballooning does have to make some > tradeoff between guests, but the actual reclaim will be driven by the guests. > Neither way is perfect, but it's not like the hypervisor reclaim is foolproof > against making a bad tradeoff between guests. > I guess we could try to figure out a simple and robust policy for ballooning. If we can come up with a policy which nobody can shoot holes in by just discussing it, it may be worth implementing and benchmarking. Maybe something based on the host passing memory pressure on to the guests, and the guests having their own memory pressure push back to the host. I'l start by telling you the best auto-ballooning policy idea I have come up with so far, and the (major?) hole in it. Basically, the host needs the memory pressure notification, where the VM will notify the guests when memory is running low (and something could soon be swapped). At that point, each guest which receives the signal will try to free some memory and return it to the host. Each guest can have the reverse in its own pageout code. Once memory pressure grows to a certain point (eg. when the guest is about to swap something out), it could reclaim a few pages from the host. If all the guests behave themselves, this could work. However, even with just reasonably behaving guests, differences between the VMs in each guest could lead to unbalanced reclaiming, penalizing better behaving guests. If one guest is behaving badly, it could really impact the other guests. Can you think of improvements to this idea? Can you think of another balloon policy that does not have nasty corner cases? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933914AbZDBUQl (ORCPT ); Thu, 2 Apr 2009 16:16:41 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1761972AbZDBUQc (ORCPT ); Thu, 2 Apr 2009 16:16:32 -0400 Received: from mx2.redhat.com ([66.187.237.31]:41153 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759087AbZDBUQc (ORCPT ); Thu, 2 Apr 2009 16:16:32 -0400 Message-ID: <49D51CA9.6090601@redhat.com> Date: Thu, 02 Apr 2009 16:14:33 -0400 From: Rik van Riel User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: Martin Schwidefsky , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> In-Reply-To: <49D518E9.1090001@goop.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jeremy Fitzhardinge wrote: > The more complex host policy decisions of how to balance overall > memory use system-wide are much in the same for both mechanisms. Not at all. Page hinting is just an optimization to host swapping, where IO can be avoided on many of the pages that hit the end of the LRU. No decisions have to be made at all about balancing memory use between guests, it just happens through regular host LRU aging. Automatic ballooning requires that something on the host figures out how much memory each guest needs and sizes the guests appropriately. All the proposed policies for that which I have seen have some nasty corner cases or are simply very limited in scope. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762952AbZDBUev (ORCPT ); Thu, 2 Apr 2009 16:34:51 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756669AbZDBUem (ORCPT ); Thu, 2 Apr 2009 16:34:42 -0400 Received: from gw.goop.org ([64.81.55.164]:50714 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751147AbZDBUem (ORCPT ); Thu, 2 Apr 2009 16:34:42 -0400 Message-ID: <49D5215D.6050503@goop.org> Date: Thu, 02 Apr 2009 13:34:37 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Rik van Riel CC: Martin Schwidefsky , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> In-Reply-To: <49D51CA9.6090601@redhat.com> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Rik van Riel wrote: > Jeremy Fitzhardinge wrote: >> The more complex host policy decisions of how to balance overall >> memory use system-wide are much in the same for both mechanisms. > Not at all. Page hinting is just an optimization to host swapping, where > IO can be avoided on many of the pages that hit the end of the LRU. > > No decisions have to be made at all about balancing memory use > between guests, it just happens through regular host LRU aging. When the host pages out a page belonging to guest A, then its making a policy decision on how large guest A should be compared to B. If the policy is a global LRU on all guest pages, then that's still a policy on guest sizes: the target size is a function of its working set, assuming that the working set is well modelled by LRU. I imagine that if the guest and host are both managing their pages with an LRU-like algorithm you'll get some nasty interactions, which page hinting tries to alleviate. > Automatic ballooning requires that something on the host figures > out how much memory each guest needs and sizes the guests > appropriately. All the proposed policies for that which I have > seen have some nasty corner cases or are simply very limited > in scope. Well, you could apply something equivalent to a global LRU: ask for more pages from guests who have the most unused pages. (I'm not saying that its necessarily a useful policy.) J From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756843AbZDCAv1 (ORCPT ); Thu, 2 Apr 2009 20:51:27 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755212AbZDCAvG (ORCPT ); Thu, 2 Apr 2009 20:51:06 -0400 Received: from gw.goop.org ([64.81.55.164]:41677 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753378AbZDCAvE (ORCPT ); Thu, 2 Apr 2009 20:51:04 -0400 Message-ID: <49D55D69.5030605@goop.org> Date: Thu, 02 Apr 2009 17:50:49 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Rik van Riel CC: Nick Piggin , akpm@osdl.org, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, Martin Schwidefsky , hugh@veritas.com Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <200904030622.30935.nickpiggin@yahoo.com.au> <49D51A82.8090908@redhat.com> In-Reply-To: <49D51A82.8090908@redhat.com> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Rik van Riel wrote: > I guess we could try to figure out a simple and robust policy > for ballooning. If we can come up with a policy which nobody > can shoot holes in by just discussing it, it may be worth > implementing and benchmarking. > > Maybe something based on the host passing memory pressure > on to the guests, and the guests having their own memory > pressure push back to the host. > > I'l start by telling you the best auto-ballooning policy idea > I have come up with so far, and the (major?) hole in it. > I think the first step is to reasonably precisely describe what the outcome you're trying to get to. Once you have that you can start talking about policies and mechanisms to achieve it. I suspect we all have basically the same thing in mind, but there's no harm in being explicit. I'm assuming that: 1. Each domain has a minimum guaranteed amount of resident memory. If you want to shrink a domain to smaller than that minimum, you may as well take all its memory away (ie suspend to disk, completely swap out, migrate elsewhere, etc). The amount is at least the bare-minimum WSS for the domain, but it may be higher to achieve other guarantees. 2. Each domain has a maximum allowable resident memory, which could be unbounded. The sums of all maximums could well exceed the total amount of host memory, and that represents the overcommit case. 3. Each domain has a weight, or memory priority. The simple case is that they all have the same weight, but a useful implementation would probably allow more. 4. Domains can be cooperative, unhelpful (ignore all requests and make none) or malicious (ignore requests, always try to claim more memory). An incompetent cooperative domain could be effectively unhelpful or malicious. * hard max limits will prevent non-cooperative domains from causing too much damage * they could be limited in other ways, by lowering IO or CPU priorities * a domain's "goodness" could be measured by looking to see how much memory is actually using relative to its min size and its weight * other remedies are essentially non-technical, such as more expensive billing the more non-good a domain is * (its hard to force a Xen domain to give up memory you've already given it) Given that, what outcome do we want? What are we optimising for? * Overall throughput? * Fairness? * Minimise wastage? * Rapid response to changes in conditions? (Cope with domains swinging between 64MB and 3GB on a regular basis?) * Punish wrong-doers / Reward cooperative domains? * ...? Trying to make one thing work for all cases isn't going to be simple or robust. If we pick one or two (minimise wastage+overall throughput?) then it might be more tractable. > Basically, the host needs the memory pressure notification, > where the VM will notify the guests when memory is running > low (and something could soon be swapped). At that point, > each guest which receives the signal will try to free some > memory and return it to the host. > > Each guest can have the reverse in its own pageout code. > Once memory pressure grows to a certain point (eg. when > the guest is about to swap something out), it could reclaim > a few pages from the host. > > If all the guests behave themselves, this could work. > Yes. It seems to me the basic metric is that each domain needs to keep track of how much easily allocatable memory it has on hand (ie, pages it can drop without causing a significant increase in IO). If it gets too large, then it can afford to give pages back to the host. If it gets too small, it must ask for more memory (preferably early enough to prevent a real memory crunch). > However, even with just reasonably behaving guests, > differences between the VMs in each guest could lead > to unbalanced reclaiming, penalizing better behaving > guests. > Well, it depends on what you mean by penalized. If they can function properly with the amount of memory they have, then they're fine. If they're struggling because they don't have enough memory for their WSS, then they got their "do I have enough memory on hand" calculation wrong. > If one guest is behaving badly, it could really impact > the other guests. > > Can you think of improvements to this idea? > > Can you think of another balloon policy that does > not have nasty corner cases? > In fully cooperative environments you can rely on ballooning to move things around dramatically. But with only partially cooperative guests, the best you can hope for is that it allows you some provisioning flexibility so you can deal with fluctuating demands in guests, but not order-of-magnitude size changes. You just have to leave enough headroom to make the corner cases not too pointy. J From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757927AbZDCItd (ORCPT ); Fri, 3 Apr 2009 04:49:33 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752608AbZDCItV (ORCPT ); Fri, 3 Apr 2009 04:49:21 -0400 Received: from mtagate8.de.ibm.com ([195.212.29.157]:38951 "EHLO mtagate8.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751683AbZDCItT (ORCPT ); Fri, 3 Apr 2009 04:49:19 -0400 Date: Fri, 3 Apr 2009 10:49:13 +0200 From: Martin Schwidefsky To: Jeremy Fitzhardinge Cc: Rik van Riel , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel Subject: Re: [patch 0/6] Guest page hinting version 7. Message-ID: <20090403104913.29c62082@skybase> In-Reply-To: <49D5215D.6050503@goop.org> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> <49D5215D.6050503@goop.org> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 02 Apr 2009 13:34:37 -0700 Jeremy Fitzhardinge wrote: > Rik van Riel wrote: > > Jeremy Fitzhardinge wrote: > >> The more complex host policy decisions of how to balance overall > >> memory use system-wide are much in the same for both mechanisms. > > Not at all. Page hinting is just an optimization to host swapping, where > > IO can be avoided on many of the pages that hit the end of the LRU. > > > > No decisions have to be made at all about balancing memory use > > between guests, it just happens through regular host LRU aging. > > When the host pages out a page belonging to guest A, then its making a > policy decision on how large guest A should be compared to B. If the > policy is a global LRU on all guest pages, then that's still a policy on > guest sizes: the target size is a function of its working set, assuming > that the working set is well modelled by LRU. I imagine that if the > guest and host are both managing their pages with an LRU-like algorithm > you'll get some nasty interactions, which page hinting tries to alleviate. This is the basic idea of guest page hinting. Let the host memory manager make it decision based on the data it has. That includes page age determined with a global LRU list, page age determined with a per-guest LRU list, i/o rates of the guests, all kind of policy which guest should have how much memory. The page hinting comes into play AFTER the decision has been made which page to evict. Only then the host should look at the volatile vs. stable page state and decide what has to be done with the page. If it is volatile the host can throw the page away because the guest can recreate it with LESS effort. That is the optimization. > > Automatic ballooning requires that something on the host figures > > out how much memory each guest needs and sizes the guests > > appropriately. All the proposed policies for that which I have > > seen have some nasty corner cases or are simply very limited > > in scope. > > Well, you could apply something equivalent to a global LRU: ask for more > pages from guests who have the most unused pages. (I'm not saying that > its necessarily a useful policy.) But with page hinting you don't have to even ask. Just take the pages if you need them. The guest already told you that you can have them by setting the unused state. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935205AbZDCSTi (ORCPT ); Fri, 3 Apr 2009 14:19:38 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1763055AbZDCSTa (ORCPT ); Fri, 3 Apr 2009 14:19:30 -0400 Received: from gw.goop.org ([64.81.55.164]:43794 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757719AbZDCST3 (ORCPT ); Fri, 3 Apr 2009 14:19:29 -0400 Message-ID: <49D6532C.6010804@goop.org> Date: Fri, 03 Apr 2009 11:19:24 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Martin Schwidefsky CC: Rik van Riel , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> <49D5215D.6050503@goop.org> <20090403104913.29c62082@skybase> In-Reply-To: <20090403104913.29c62082@skybase> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > This is the basic idea of guest page hinting. Let the host memory > manager make it decision based on the data it has. That includes page > age determined with a global LRU list, page age determined with a > per-guest LRU list, i/o rates of the guests, all kind of policy which > guest should have how much memory. Do you look at fault rates? Refault rates? > The page hinting comes into play > AFTER the decision has been made which page to evict. Only then the host > should look at the volatile vs. stable page state and decide what has > to be done with the page. If it is volatile the host can throw the page > away because the guest can recreate it with LESS effort. That is the > optimization. > Yes, and its good from that perspective. Do you really implement it purely that way, or do you bias the LRU to push volatile and free pages down the end of the LRU list in preference to pages which must be preserved? If you have a small bias then you can prefer to evict easily evictable pages compared to their near-equivalents which require IO. > But with page hinting you don't have to even ask. Just take the pages > if you need them. The guest already told you that you can have them by > setting the unused state. > Yes. But it still depends on the guest. A very helpful guest could deliberately preswap pages so that it can mark them as volatile, whereas a less helpful one may keep them persistent and defer preswapping them until there's a good reason to do so. Host swapping and page hinting won't put any apparent memory pressure on the guest, so it has no reason to start preswapping even if the overall system is under pressure. Ballooning will expose each guest to its share of the overall system memory pressure, so they can respond appropriately (one hopes). J From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754999AbZDFHVh (ORCPT ); Mon, 6 Apr 2009 03:21:37 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753820AbZDFHV1 (ORCPT ); Mon, 6 Apr 2009 03:21:27 -0400 Received: from mtagate8.de.ibm.com ([195.212.29.157]:48206 "EHLO mtagate8.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753434AbZDFHV0 (ORCPT ); Mon, 6 Apr 2009 03:21:26 -0400 Date: Mon, 6 Apr 2009 09:21:11 +0200 From: Martin Schwidefsky To: Jeremy Fitzhardinge Cc: Rik van Riel , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel Subject: Re: [patch 0/6] Guest page hinting version 7. Message-ID: <20090406092111.3b432edd@skybase> In-Reply-To: <49D6532C.6010804@goop.org> References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> <49D5215D.6050503@goop.org> <20090403104913.29c62082@skybase> <49D6532C.6010804@goop.org> Organization: IBM Corporation X-Mailer: Claws Mail 3.7.1 (GTK+ 2.14.7; i486-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 03 Apr 2009 11:19:24 -0700 Jeremy Fitzhardinge wrote: > Martin Schwidefsky wrote: > > This is the basic idea of guest page hinting. Let the host memory > > manager make it decision based on the data it has. That includes page > > age determined with a global LRU list, page age determined with a > > per-guest LRU list, i/o rates of the guests, all kind of policy which > > guest should have how much memory. > > Do you look at fault rates? Refault rates? That is hidden in the memory management of z/VM. I know some details how the z/VM page manager works but in the end I don't care as the guest operating system. > > The page hinting comes into play > > AFTER the decision has been made which page to evict. Only then the host > > should look at the volatile vs. stable page state and decide what has > > to be done with the page. If it is volatile the host can throw the page > > away because the guest can recreate it with LESS effort. That is the > > optimization. > > > > Yes, and its good from that perspective. Do you really implement it > purely that way, or do you bias the LRU to push volatile and free pages > down the end of the LRU list in preference to pages which must be > preserved? If you have a small bias then you can prefer to evict easily > evictable pages compared to their near-equivalents which require IO. We though about a bias to prefer volatile pages but in the end decided against it. We do prefer free pages, if the page manager finds a unused page it will reuse it immediately. > > But with page hinting you don't have to even ask. Just take the pages > > if you need them. The guest already told you that you can have them by > > setting the unused state. > > > > Yes. But it still depends on the guest. A very helpful guest could > deliberately preswap pages so that it can mark them as volatile, whereas > a less helpful one may keep them persistent and defer preswapping them > until there's a good reason to do so. Host swapping and page hinting > won't put any apparent memory pressure on the guest, so it has no reason > to start preswapping even if the overall system is under pressure. > Ballooning will expose each guest to its share of the overall system > memory pressure, so they can respond appropriately (one hopes). Why should the guest want to do preswapping? It is as expensive for the host to swap a page and get it back as it is for the guest (= one write + one read). It is a waste of cpu time to call into the guest. You need something we call PFAULT though: if a guest process hits a page that is missing in the host page table you don't want to stop the virtual cpu until the page is back. You notify the guest that the host page is missing. The process that caused the fault is put to sleep until the host retrieved the page again. You will find the pfault code for s390 in arch/s390/mm/fault.c So to me preswap doesn't make sense. The only thing you can gain by putting memory pressure on the guest is to free some of the memory that is used by the kernel for dentries, inodes, etc. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754739AbZDFHdY (ORCPT ); Mon, 6 Apr 2009 03:33:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754906AbZDFHc5 (ORCPT ); Mon, 6 Apr 2009 03:32:57 -0400 Received: from smtp118.mail.mud.yahoo.com ([209.191.84.167]:48703 "HELO smtp118.mail.mud.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1754540AbZDFHcu (ORCPT ); Mon, 6 Apr 2009 03:32:50 -0400 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com.au; h=Received:X-YMail-OSG:X-Yahoo-Newman-Property:From:To:Subject:Date:User-Agent:Cc:References:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding:Content-Disposition:Message-Id; b=xuPcjJnW0e/2gBBKZtnUBW6SYTUctwa4SGRxIJXlMFa6ewTggpAaGGAGQInzStzO1TG3EfnwDw8EPX4NHHlk/ViQgn6zLOdw9RwpUCaZR8hZ5NpcSYcLS2NaFNNRAd16+V8UK6z9a051xqINGl797IV4UXiei/uj9IEgCLY6qQw= ; X-YMail-OSG: 0HG4kfUVM1m4_Lp7e0AIWZV.3iumzoDDndXJgx1ZvP4T.IY0K.hh06lNn8FApswuhLmdBmnIggjnxF3WxoLTQinju6zqA.Glenv1E.lcogLB94sABI7p4RwgurYut4CL44FeCVo4RxDoa4LAuVYtYsssgkxfpIGgaE0IKKo0Ia6C1pz71l9.80uktqM02P_p3RAdm1nDO6i.zy4FRjUMM6XYbkKjLtZ7JeCxc9rUQz4kUpcoXhICyl6DM8nePFYE3d8G6irV0XXH2PYy17qzq7gb_InsIvAAw0cryrko6egs77hBlWJP X-Yahoo-Newman-Property: ymail-3 From: Nick Piggin To: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 6 Apr 2009 17:32:39 +1000 User-Agent: KMail/1.9.51 (KDE/4.0.4; ; ) Cc: Jeremy Fitzhardinge , Rik van Riel , akpm@osdl.org, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, "Xen-devel" References: <20090327150905.819861420@de.ibm.com> <49D6532C.6010804@goop.org> <20090406092111.3b432edd@skybase> In-Reply-To: <20090406092111.3b432edd@skybase> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200904061732.39885.nickpiggin@yahoo.com.au> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Monday 06 April 2009 17:21:11 Martin Schwidefsky wrote: > On Fri, 03 Apr 2009 11:19:24 -0700 > > Yes. But it still depends on the guest. A very helpful guest could > > deliberately preswap pages so that it can mark them as volatile, whereas > > a less helpful one may keep them persistent and defer preswapping them > > until there's a good reason to do so. Host swapping and page hinting > > won't put any apparent memory pressure on the guest, so it has no reason > > to start preswapping even if the overall system is under pressure. > > Ballooning will expose each guest to its share of the overall system > > memory pressure, so they can respond appropriately (one hopes). > > Why should the guest want to do preswapping? It is as expensive for > the host to swap a page and get it back as it is for the guest (= one > write + one read). It is a waste of cpu time to call into the guest. You > need something we call PFAULT though: if a guest process hits a page > that is missing in the host page table you don't want to stop the > virtual cpu until the page is back. You notify the guest that the host > page is missing. The process that caused the fault is put to sleep > until the host retrieved the page again. You will find the pfault code > for s390 in arch/s390/mm/fault.c > > So to me preswap doesn't make sense. The only thing you can gain by > putting memory pressure on the guest is to free some of the memory that > is used by the kernel for dentries, inodes, etc. The guest kernel can have more context about usage patterns, or user hints set on some pages or ranges. And as you say, there are non-pagecache things to free that can be taking significant or most of the freeable memory, and there can be policy knobs set in the guest (swappiness or vfs_cache_pressure etc). I guess that counters or performance monitoring events in the guest should also look more like a normal Linux kernel (although I haven't remembered what you do in that department in your patches). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755170AbZDFTXt (ORCPT ); Mon, 6 Apr 2009 15:23:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752594AbZDFTXk (ORCPT ); Mon, 6 Apr 2009 15:23:40 -0400 Received: from gw.goop.org ([64.81.55.164]:42365 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752206AbZDFTXj (ORCPT ); Mon, 6 Apr 2009 15:23:39 -0400 Message-ID: <49DA56B7.9020905@goop.org> Date: Mon, 06 Apr 2009 12:23:35 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Martin Schwidefsky CC: Rik van Riel , akpm@osdl.org, Nick Piggin , frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel Subject: Re: [patch 0/6] Guest page hinting version 7. References: <20090327150905.819861420@de.ibm.com> <200903281705.29798.rusty@rustcorp.com.au> <20090329162336.7c0700e9@skybase> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> <49D50CB7.2050705@redhat.com> <49D518E9.1090001@goop.org> <49D51CA9.6090601@redhat.com> <49D5215D.6050503@goop.org> <20090403104913.29c62082@skybase> <49D6532C.6010804@goop.org> <20090406092111.3b432edd@skybase> In-Reply-To: <20090406092111.3b432edd@skybase> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Martin Schwidefsky wrote: > Why should the guest want to do preswapping? It is as expensive for > the host to swap a page and get it back as it is for the guest (= one > write + one read). Yes, perhaps for swapping, but in general it makes sense for the guest to write the pages to backing store to prevent host swapping. For swap pages there's no big benefit, but for file-backed pages its better for the guest to do it. > The only thing you can gain by > putting memory pressure on the guest is to free some of the memory that > is used by the kernel for dentries, inodes, etc. > Well, that's also significant. My point is that the guest has multiple ways in which it can relieve its own memory pressure in response to overall system memory pressure; its just that I happened to pick the example where its much of a muchness. J From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 7FA336B003D for ; Sun, 29 Mar 2009 10:13:37 -0400 (EDT) Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate5.de.ibm.com (8.14.3/8.13.8) with ESMTP id n2TECwrA461016 for ; Sun, 29 Mar 2009 14:12:58 GMT Received: from d12av02.megacenter.de.ibm.com (d12av02.megacenter.de.ibm.com [9.149.165.228]) by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v9.2) with ESMTP id n2TECwsF4288714 for ; Sun, 29 Mar 2009 16:12:58 +0200 Received: from d12av02.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av02.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id n2TECvbT016381 for ; Sun, 29 Mar 2009 16:12:57 +0200 Date: Sun, 29 Mar 2009 16:12:53 +0200 From: Martin Schwidefsky Subject: Re: [patch 0/6] Guest page hinting version 7. Message-ID: <20090329161253.3faffdeb@skybase> In-Reply-To: <1238195024.8286.562.camel@nimitz> References: <20090327150905.819861420@de.ibm.com> <1238195024.8286.562.camel@nimitz> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dave Hansen Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, frankeh@watson.ibm.com, akpm@osdl.org, nickpiggin@yahoo.com.au, hugh@veritas.com, riel@redhat.com List-ID: On Fri, 27 Mar 2009 16:03:43 -0700 Dave Hansen wrote: > On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote: > > If the host picks one of the > > pages the guest can recreate, the host can throw it away instead of writing > > it to the paging device. Simple and elegant. > > Heh, simple and elegant for the hypervisor. But I'm not sure I'm going > to call *anything* that requires a new CPU instruction elegant. ;) Hey its cool if you can request an instruction to solve your problem :-) > I don't see any description of it in there any more, but I thought this > entire patch set was to get rid of the idiotic triple I/Os in the > following scenario: > > 1. Hypervisor picks a page and evicts it out to disk, pays the I/O cost > to get it written out. (I/O #1) > 2. Linux comes along (being a bit late to the party) and picks the same > page, also decides it needs to be out to disk > 3. Linux tries to write the page to disk, but touches it in the > process, pulling the page back in from the store where the hypervisor > wrote it. (I/O #2) > 4. Linux writes the page to its swap device (I/O #3) > > I don't see that mentioned at all in the current description. > Simplifying the hypervisor is hard to get behind, but cutting system I/O > by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;) You are right, for a newcomer to the party the advantages of this approach are not really obvious. Should have copied some more text from the boilerplate from the previous versions. Yes, the guest page hinting code aims to reduce the hosts swap I/O. There are two scenarios, one is the above, the other is a simple read-only file cache page. Without hinting: 1. Hypervisor picks a page and evicts it, that is one write I/O 2. Linux access the page and causes a host page fault. The host reads the page from its swap disk, one read I/O. In total 2 I/O operations. With hinting: 1. Hypervisor picks a page, finds it volatile and throws it away. 2. Linux access the page and gets a discard fault from the host. Linux reads the file page from its block device. This is just one I/O operation. > Can we persuade the hypervisor to tell us which pages it decided to page > out and just skip those when we're scanning the LRU? One principle of the whole approach is that the hypervisor does not call into an otherwise idle guest. The cost of schedulung the virtual cpu is just too high. So we would a means to store the information where the guest can pick it up when it happens to do LRU. I don't think that this will work out. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 9B16B6B003D for ; Thu, 2 Apr 2009 12:24:24 -0400 (EDT) From: Nick Piggin Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Fri, 3 Apr 2009 03:23:36 +1100 References: <20090327150905.819861420@de.ibm.com> <200904022232.02185.nickpiggin@yahoo.com.au> <20090402175249.3c4a6d59@skybase> In-Reply-To: <20090402175249.3c4a6d59@skybase> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200904030323.37523.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Rusty Russell , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, virtualization@lists.osdl.org, akpm@osdl.org, frankeh@watson.ibm.com, riel@redhat.com, hugh@veritas.com List-ID: On Friday 03 April 2009 02:52:49 Martin Schwidefsky wrote: > On Thu, 2 Apr 2009 22:32:00 +1100 > Nick Piggin wrote: > > > On Monday 30 March 2009 01:23:36 Martin Schwidefsky wrote: > > > On Sat, 28 Mar 2009 17:05:28 +1030 > > > > > > Rusty Russell wrote: > > > > On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote: > > > > > Greetings, > > > > > the circus is back in town -- another version of the guest page hinting > > > > > patches. The patches differ from version 6 only in the kernel version, > > > > > they apply against 2.6.29. My short sniff test showed that the code > > > > > is still working as expected. > > > > > > > > > > To recap (you can skip this if you read the boiler plate of the last > > > > > version of the patches): > > > > > The main benefit for guest page hinting vs. the ballooner is that there > > > > > is no need for a monitor that keeps track of the memory usage of all > > > > > the guests, a complex algorithm that calculates the working set sizes > > > > > and for the calls into the guest kernel to control the size of the > > > > > balloons. > > > > > > > > I thought you weren't convinced of the concrete benefits over ballooning, > > > > or am I misremembering? > > > > > > The performance test I have seen so far show that the benefits of > > > ballooning vs. guest page hinting are about the same. I am still > > > convinced that the guest page hinting is the way to go because you do > > > not need an external monitor. Calculating the working set size for a > > > guest is a challenge. With guest page hinting there is no need for a > > > working set size calculation. > > > > Sounds backwards to me. If the benefits are the same, then having > > complexity in an external monitor (which, by the way, shares many > > problems and goals of single-kernel resource/workload management), > > rather than putting a huge chunk of crap in the guest kernel's core > > mm code. > > The benefits are the same but the algorithmic complexity is reduced. > The patch to the memory management has complexity in itself but from a > 1000 feet standpoint guest page hinting is simpler, no? Yeah but that's a tradeoff I'll begrudgingly make, considering a) lots of people doing workload management inside cgroups/containers need similar algorithmic complexity so improvements to those algorithms will help one another b) it may be adding complexity, but it isn't adding complexity to a subsystem that is already among the most complex in the kernel c) i don't have to help maintain it > The question > how much memory each guest has to release does not exist. With the > balloner I have seen a few problematic cases where the size of > the balloon in principle killed the guest. My favorite is the "clever" > monitor script that queried the guests free memory and put all free > memory into the balloon. Now gues what happened with a guest that just > booted.. > > And could you please explain with a few more words >what< you consider > to be "crap"? I can't do anything with a general statement "this is > crap". Which translates to me: leave me alone.. :) No it's cool code, interesting idea etc, and last time I looked I don't think I saw any fundamental (or even any significant incidental) bugs. So I guess my problem with it is that it adds complexity to benefit a small portion of users where there is already another solution that another set of users already require. > > I still think this needs much more justification. > > Ok, I can understand that. We probably need a KVM based version to show > that benefits exist on non-s390 hardware as well. Should be significantly better than ballooning too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id DA7CA5F0001 for ; Mon, 6 Apr 2009 03:32:18 -0400 (EDT) From: Nick Piggin Subject: Re: [patch 0/6] Guest page hinting version 7. Date: Mon, 6 Apr 2009 17:32:39 +1000 References: <20090327150905.819861420@de.ibm.com> <49D6532C.6010804@goop.org> <20090406092111.3b432edd@skybase> In-Reply-To: <20090406092111.3b432edd@skybase> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200904061732.39885.nickpiggin@yahoo.com.au> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Jeremy Fitzhardinge , Rik van Riel , akpm@osdl.org, frankeh@watson.ibm.com, virtualization@lists.osdl.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, hugh@veritas.com, Xen-devel List-ID: On Monday 06 April 2009 17:21:11 Martin Schwidefsky wrote: > On Fri, 03 Apr 2009 11:19:24 -0700 > > Yes. But it still depends on the guest. A very helpful guest could > > deliberately preswap pages so that it can mark them as volatile, whereas > > a less helpful one may keep them persistent and defer preswapping them > > until there's a good reason to do so. Host swapping and page hinting > > won't put any apparent memory pressure on the guest, so it has no reason > > to start preswapping even if the overall system is under pressure. > > Ballooning will expose each guest to its share of the overall system > > memory pressure, so they can respond appropriately (one hopes). > > Why should the guest want to do preswapping? It is as expensive for > the host to swap a page and get it back as it is for the guest (= one > write + one read). It is a waste of cpu time to call into the guest. You > need something we call PFAULT though: if a guest process hits a page > that is missing in the host page table you don't want to stop the > virtual cpu until the page is back. You notify the guest that the host > page is missing. The process that caused the fault is put to sleep > until the host retrieved the page again. You will find the pfault code > for s390 in arch/s390/mm/fault.c > > So to me preswap doesn't make sense. The only thing you can gain by > putting memory pressure on the guest is to free some of the memory that > is used by the kernel for dentries, inodes, etc. The guest kernel can have more context about usage patterns, or user hints set on some pages or ranges. And as you say, there are non-pagecache things to free that can be taking significant or most of the freeable memory, and there can be policy knobs set in the guest (swappiness or vfs_cache_pressure etc). I guess that counters or performance monitoring events in the guest should also look more like a normal Linux kernel (although I haven't remembered what you do in that department in your patches). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org