* [RFC] respect the referenced bit of KVM guest pages? @ 2009-08-05 2:40 Wu Fengguang 2009-08-05 4:15 ` KOSAKI Motohiro ` (3 more replies) 0 siblings, 4 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-05 2:40 UTC (permalink / raw) To: Rik van Riel Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Greetings, Jeff Dike found that many KVM pages are being refaulted in 2.6.29: "Lots of pages between discarded due to memory pressure only to be faulted back in soon after. These pages are nearly all stack pages. This is not consistent - sometimes there are relatively few such pages and they are spread out between processes." The refaults can be drastically reduced by the following patch, which respects the referenced bit of all anonymous pages (including the KVM pages). However it risks reintroducing the problem addressed by commit 7e9cd4842 (fix reclaim scalability problem by ignoring the referenced bit, mainly the pte young bit). I wonder if there are better solutions? Thanks, Fengguang --- mm/vmscan.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) --- linux.orig/mm/vmscan.c +++ linux/mm/vmscan.c @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned * Identify referenced, file-backed active pages and * give them one more trip around the active list. So * that executable code get better chances to stay in - * memory under moderate memory pressure. Anon pages - * are not likely to be evicted by use-once streaming - * IO, plus JVM can create lots of anon VM_EXEC pages, - * so we ignore them here. + * memory under moderate memory pressure. + * + * Also protect anon pages: swapping could be costly, + * and KVM guest's referenced bit is helpful. */ - if ((vm_flags & VM_EXEC) && !PageAnon(page)) { + if ((vm_flags & VM_EXEC) || PageAnon(page)) { list_add(&page->lru, &l_active); continue; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 2:40 [RFC] respect the referenced bit of KVM guest pages? Wu Fengguang @ 2009-08-05 4:15 ` KOSAKI Motohiro 2009-08-05 4:41 ` Wu Fengguang 2009-08-05 7:58 ` Avi Kivity ` (2 subsequent siblings) 3 siblings, 1 reply; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-05 4:15 UTC (permalink / raw) To: Wu Fengguang Cc: kosaki.motohiro, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Hi > Greetings, > > Jeff Dike found that many KVM pages are being refaulted in 2.6.29: > > "Lots of pages between discarded due to memory pressure only to be > faulted back in soon after. These pages are nearly all stack pages. > This is not consistent - sometimes there are relatively few such pages > and they are spread out between processes." I suprise this result really. - Why this issue happened only on kvm? - Why shrink_inactive_list() can't find pte young bit? Is this really unused stack? > > The refaults can be drastically reduced by the following patch, which > respects the referenced bit of all anonymous pages (including the KVM > pages). > > However it risks reintroducing the problem addressed by commit 7e9cd4842 > (fix reclaim scalability problem by ignoring the referenced bit, > mainly the pte young bit). I wonder if there are better solutions? > > Thanks, > Fengguang > > --- > mm/vmscan.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > --- linux.orig/mm/vmscan.c > +++ linux/mm/vmscan.c > @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned > * Identify referenced, file-backed active pages and > * give them one more trip around the active list. So > * that executable code get better chances to stay in > - * memory under moderate memory pressure. Anon pages > - * are not likely to be evicted by use-once streaming > - * IO, plus JVM can create lots of anon VM_EXEC pages, > - * so we ignore them here. > + * memory under moderate memory pressure. > + * > + * Also protect anon pages: swapping could be costly, > + * and KVM guest's referenced bit is helpful. > */ > - if ((vm_flags & VM_EXEC) && !PageAnon(page)) { > + if ((vm_flags & VM_EXEC) || PageAnon(page)) { > list_add(&page->lru, &l_active); > continue; > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 4:15 ` KOSAKI Motohiro @ 2009-08-05 4:41 ` Wu Fengguang 0 siblings, 0 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-05 4:41 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Wed, Aug 05, 2009 at 12:15:40PM +0800, KOSAKI Motohiro wrote: > Hi > > > Greetings, > > > > Jeff Dike found that many KVM pages are being refaulted in 2.6.29: > > > > "Lots of pages between discarded due to memory pressure only to be > > faulted back in soon after. These pages are nearly all stack pages. > > This is not consistent - sometimes there are relatively few such pages > > and they are spread out between processes." > > I suprise this result really. > > - Why this issue happened only on kvm? Maybe because - they take up a large portion of memory - their access patterns/frequencies vary a lot > - Why shrink_inactive_list() can't find pte young bit? It can, but I guess the grace period would be much shorter than with this patch. > Is this really unused stack? They were actually being refaulted. So they should be kind of not-too-hot as well as not-too-cold pages. Thanks, Fengguang > > > > The refaults can be drastically reduced by the following patch, which > > respects the referenced bit of all anonymous pages (including the KVM > > pages). > > > > However it risks reintroducing the problem addressed by commit 7e9cd4842 > > (fix reclaim scalability problem by ignoring the referenced bit, > > mainly the pte young bit). I wonder if there are better solutions? > > > > Thanks, > > Fengguang > > > > --- > > mm/vmscan.c | 10 +++++----- > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > --- linux.orig/mm/vmscan.c > > +++ linux/mm/vmscan.c > > @@ -1288,12 +1288,12 @@ static void shrink_active_list(unsigned > > * Identify referenced, file-backed active pages and > > * give them one more trip around the active list. So > > * that executable code get better chances to stay in > > - * memory under moderate memory pressure. Anon pages > > - * are not likely to be evicted by use-once streaming > > - * IO, plus JVM can create lots of anon VM_EXEC pages, > > - * so we ignore them here. > > + * memory under moderate memory pressure. > > + * > > + * Also protect anon pages: swapping could be costly, > > + * and KVM guest's referenced bit is helpful. > > */ > > - if ((vm_flags & VM_EXEC) && !PageAnon(page)) { > > + if ((vm_flags & VM_EXEC) || PageAnon(page)) { > > list_add(&page->lru, &l_active); > > continue; > > } > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 2:40 [RFC] respect the referenced bit of KVM guest pages? Wu Fengguang 2009-08-05 4:15 ` KOSAKI Motohiro @ 2009-08-05 7:58 ` Avi Kivity 2009-08-05 8:17 ` Avi Kivity ` (3 more replies) 2009-08-05 15:58 ` Andrea Arcangeli 2009-08-05 17:53 ` Rik van Riel 3 siblings, 4 replies; 122+ messages in thread From: Avi Kivity @ 2009-08-05 7:58 UTC (permalink / raw) To: Wu Fengguang Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/05/2009 05:40 AM, Wu Fengguang wrote: > Greetings, > > Jeff Dike found that many KVM pages are being refaulted in 2.6.29: > > "Lots of pages between discarded due to memory pressure only to be > faulted back in soon after. These pages are nearly all stack pages. > This is not consistent - sometimes there are relatively few such pages > and they are spread out between processes." > > The refaults can be drastically reduced by the following patch, which > respects the referenced bit of all anonymous pages (including the KVM > pages). > > However it risks reintroducing the problem addressed by commit 7e9cd4842 > (fix reclaim scalability problem by ignoring the referenced bit, > mainly the pte young bit). I wonder if there are better solutions? > How do you distinguish between kvm pages and non-kvm anonymous pages? More importantly, why should you? Jeff, do you see the refaults on Nehalem systems? If so, that's likely due to the lack of an accessed bit on EPT pagetables. It would be interesting to compare with Barcelona (which does). If that's indeed the case, we can have the EPT ageing mechanism give pages a bit more time around by using an available bit in the EPT PTEs to return accessed on the first pass and not-accessed on the second. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 7:58 ` Avi Kivity @ 2009-08-05 8:17 ` Avi Kivity 2009-08-05 14:33 ` Rik van Riel 2009-08-05 14:15 ` Rik van Riel ` (2 subsequent siblings) 3 siblings, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-05 8:17 UTC (permalink / raw) To: Wu Fengguang Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm [-- Attachment #1: Type: text/plain, Size: 1455 bytes --] On 08/05/2009 10:58 AM, Avi Kivity wrote: > On 08/05/2009 05:40 AM, Wu Fengguang wrote: >> Greetings, >> >> Jeff Dike found that many KVM pages are being refaulted in 2.6.29: >> >> "Lots of pages between discarded due to memory pressure only to be >> faulted back in soon after. These pages are nearly all stack pages. >> This is not consistent - sometimes there are relatively few such pages >> and they are spread out between processes." >> >> The refaults can be drastically reduced by the following patch, which >> respects the referenced bit of all anonymous pages (including the KVM >> pages). >> >> However it risks reintroducing the problem addressed by commit 7e9cd4842 >> (fix reclaim scalability problem by ignoring the referenced bit, >> mainly the pte young bit). I wonder if there are better solutions? > > How do you distinguish between kvm pages and non-kvm anonymous pages? > More importantly, why should you? > > Jeff, do you see the refaults on Nehalem systems? If so, that's > likely due to the lack of an accessed bit on EPT pagetables. It would > be interesting to compare with Barcelona (which does). > > If that's indeed the case, we can have the EPT ageing mechanism give > pages a bit more time around by using an available bit in the EPT PTEs > to return accessed on the first pass and not-accessed on the second. > The attached patch implements this. -- error compiling committee.c: too many arguments to function [-- Attachment #2: ept-emulate-accessed-bit.patch --] [-- Type: text/x-patch, Size: 2115 bytes --] diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 7b53614..310938a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -195,6 +195,7 @@ static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */ static u64 __read_mostly shadow_user_mask; static u64 __read_mostly shadow_accessed_mask; static u64 __read_mostly shadow_dirty_mask; +static int __read_mostly shadow_accessed_shift; static inline u64 rsvd_bits(int s, int e) { @@ -219,6 +220,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, { shadow_user_mask = user_mask; shadow_accessed_mask = accessed_mask; + shadow_accessed_shift + = find_first_bit((void *)&shadow_accessed_mask, 64); shadow_dirty_mask = dirty_mask; shadow_nx_mask = nx_mask; shadow_x_mask = x_mask; @@ -817,11 +820,11 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp) while (spte) { int _young; u64 _spte = *spte; - BUG_ON(!(_spte & PT_PRESENT_MASK)); - _young = _spte & PT_ACCESSED_MASK; + BUG_ON(!(_spte & shadow_accessed_mask)); + _young = _spte & shadow_accessed_mask; if (_young) { young = 1; - clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte); + clear_bit(shadow_accessed_shift, (unsigned long *)spte); } spte = rmap_next(kvm, rmapp, spte); } @@ -2572,7 +2575,7 @@ static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn) && shadow_accessed_mask && !(*spte & shadow_accessed_mask) && is_shadow_present_pte(*spte)) - set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte); + set_bit(shadow_accessed_shift, (unsigned long *)spte); } void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 0ba706e..bc99367 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4029,7 +4029,7 @@ static int __init vmx_init(void) bypass_guest_pf = 0; kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK); - kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull, + kvm_mmu_set_mask_ptes(0ull, 1ull << 63, 0ull, 0ull, VMX_EPT_EXECUTABLE_MASK); kvm_enable_tdp(); } else ^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 8:17 ` Avi Kivity @ 2009-08-05 14:33 ` Rik van Riel 2009-08-05 15:37 ` Avi Kivity 0 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-05 14:33 UTC (permalink / raw) To: Avi Kivity Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list Avi Kivity wrote: > The attached patch implements this. The attached page requires each page to go around twice before it is evicted, but they will still get evicted in the order in which they were made present. FIFO page replacement was shown to be a bad idea in the 1960's and it is still a terrible idea today. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 14:33 ` Rik van Riel @ 2009-08-05 15:37 ` Avi Kivity 0 siblings, 0 replies; 122+ messages in thread From: Avi Kivity @ 2009-08-05 15:37 UTC (permalink / raw) To: Rik van Riel Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm, KVM list On 08/05/2009 05:33 PM, Rik van Riel wrote: > Avi Kivity wrote: > >> The attached patch implements this. > > The attached page requires each page to go around twice > before it is evicted, but they will still get evicted in > the order in which they were made present. > > FIFO page replacement was shown to be a bad idea in the > 1960's and it is still a terrible idea today. > Which is why we have accessed bits in page tables... but emulating the accessed bit via RWX (note no present bit in EPT) is better than ignoring it. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 7:58 ` Avi Kivity 2009-08-05 8:17 ` Avi Kivity @ 2009-08-05 14:15 ` Rik van Riel 2009-08-05 15:12 ` Avi Kivity 2009-08-05 16:31 ` Andrea Arcangeli 2009-08-05 15:45 ` Dike, Jeffrey G 2009-08-05 16:05 ` Andrea Arcangeli 3 siblings, 2 replies; 122+ messages in thread From: Rik van Riel @ 2009-08-05 14:15 UTC (permalink / raw) To: Avi Kivity Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Avi Kivity wrote: >> However it risks reintroducing the problem addressed by commit 7e9cd4842 >> (fix reclaim scalability problem by ignoring the referenced bit, >> mainly the pte young bit). I wonder if there are better solutions? Agreed, we need to figure out what the real problem is, and how to solve it better. > Jeff, do you see the refaults on Nehalem systems? If so, that's likely > due to the lack of an accessed bit on EPT pagetables. It would be > interesting to compare with Barcelona (which does). Not having a hardware accessed bit would explain why the VM is not reactivating the pages that were accessed while on the inactive list. > If that's indeed the case, we can have the EPT ageing mechanism give > pages a bit more time around by using an available bit in the EPT PTEs > to return accessed on the first pass and not-accessed on the second. Can we find out which pages are EPT pages? If so, we could unmap them when they get moved from the active to the inactive list, and soft fault them back in on access, emulating the referenced bit for EPT pages and making page replacement on them work like it should. Your approximation of pretending the page is accessed the first time and pretending it's not the second time sounds like it will just lead to less efficient FIFO replacement, not to anything even vaguely approximating LRU. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 14:15 ` Rik van Riel @ 2009-08-05 15:12 ` Avi Kivity 2009-08-05 15:15 ` Rik van Riel 2009-08-05 16:31 ` Andrea Arcangeli 1 sibling, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-05 15:12 UTC (permalink / raw) To: Rik van Riel Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/05/2009 05:15 PM, Rik van Riel wrote: >> If that's indeed the case, we can have the EPT ageing mechanism give >> pages a bit more time around by using an available bit in the EPT >> PTEs to return accessed on the first pass and not-accessed on the >> second. > > Can we find out which pages are EPT pages? > No need to (see below). > If so, we could unmap them when they get moved from the > active to the inactive list, and soft fault them back in > on access, emulating the referenced bit for EPT pages and > making page replacement on them work like it should. It should be easy to implement via the mmu notifier callback: when the mm calls clear_flush_young(), mark it as young, and unmap it from the EPT pagetable. > Your approximation of pretending the page is accessed the > first time and pretending it's not the second time sounds > like it will just lead to less efficient FIFO replacement, > not to anything even vaguely approximating LRU. Right, it's just a hack that gives EPT pages higher priority, like the original patch suggested. Note that LRU for VMs is not a good algorithm, since the VM will also reference the least recently used page, leading to thrashing. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 15:12 ` Avi Kivity @ 2009-08-05 15:15 ` Rik van Riel 2009-08-05 15:25 ` Avi Kivity 0 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-05 15:15 UTC (permalink / raw) To: Avi Kivity Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Avi Kivity wrote: >> If so, we could unmap them when they get moved from the >> active to the inactive list, and soft fault them back in >> on access, emulating the referenced bit for EPT pages and >> making page replacement on them work like it should. > > It should be easy to implement via the mmu notifier callback: when the > mm calls clear_flush_young(), mark it as young, and unmap it from the > EPT pagetable. You mean "mark it as old"? >> Your approximation of pretending the page is accessed the >> first time and pretending it's not the second time sounds >> like it will just lead to less efficient FIFO replacement, >> not to anything even vaguely approximating LRU. > > Right, it's just a hack that gives EPT pages higher priority, like the > original patch suggested. Note that LRU for VMs is not a good > algorithm, since the VM will also reference the least recently used > page, leading to thrashing. That is one of the reasons we use a very coarse two handed clock algorithm instead of true LRU. LRU has more overhead and more artifacts :) -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 15:15 ` Rik van Riel @ 2009-08-05 15:25 ` Avi Kivity 2009-08-05 16:35 ` Andrea Arcangeli 0 siblings, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-05 15:25 UTC (permalink / raw) To: Rik van Riel Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/05/2009 06:15 PM, Rik van Riel wrote: > Avi Kivity wrote: > >>> If so, we could unmap them when they get moved from the >>> active to the inactive list, and soft fault them back in >>> on access, emulating the referenced bit for EPT pages and >>> making page replacement on them work like it should. >> >> It should be easy to implement via the mmu notifier callback: when >> the mm calls clear_flush_young(), mark it as young, and unmap it from >> the EPT pagetable. > > You mean "mark it as old"? I meant 'return young, and drop it from the EPT pagetable'. If we use the present bit as a replacement for the accessed bit, present means young, and clear_flush_young means "if present, return young and unmap, otherwise return old'. See kvm_age_rmapp() in arch/x86/kvm/mmu.c. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 15:25 ` Avi Kivity @ 2009-08-05 16:35 ` Andrea Arcangeli 0 siblings, 0 replies; 122+ messages in thread From: Andrea Arcangeli @ 2009-08-05 16:35 UTC (permalink / raw) To: Avi Kivity Cc: Rik van Riel, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Wed, Aug 05, 2009 at 06:25:28PM +0300, Avi Kivity wrote: > On 08/05/2009 06:15 PM, Rik van Riel wrote: > > Avi Kivity wrote: > > > >>> If so, we could unmap them when they get moved from the > >>> active to the inactive list, and soft fault them back in > >>> on access, emulating the referenced bit for EPT pages and > >>> making page replacement on them work like it should. > >> > >> It should be easy to implement via the mmu notifier callback: when > >> the mm calls clear_flush_young(), mark it as young, and unmap it from > >> the EPT pagetable. > > > > You mean "mark it as old"? > > I meant 'return young, and drop it from the EPT pagetable'. > > If we use the present bit as a replacement for the accessed bit, present > means young, and clear_flush_young means "if present, return young and > unmap, otherwise return old'. This is the only way to provide accurate information, and it's still a minor fault so not very different than return young first time around and old second time around without invalidating the spte... but the only reason I like it more is that it is done at the right time, like for the ptes, so probably it's best to implement it this way to ensure total fairness of the VM regardless if it's guest or qemu-kvm touching the virtual memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 14:15 ` Rik van Riel 2009-08-05 15:12 ` Avi Kivity @ 2009-08-05 16:31 ` Andrea Arcangeli 2009-08-05 17:25 ` Rik van Riel 1 sibling, 1 reply; 122+ messages in thread From: Andrea Arcangeli @ 2009-08-05 16:31 UTC (permalink / raw) To: Rik van Riel Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote: > Not having a hardware accessed bit would explain why > the VM is not reactivating the pages that were accessed > while on the inactive list. Problem is, even with young bit functional the VM isn't reactivating those pages anyway because of that broken check... That check should be nuked entirely in my view as it fundamentally thinks it can outsmart the VM intelligence by checking a bit in the vma... quite absurd in my view. > Can we find out which pages are EPT pages? > > If so, we could unmap them when they get moved from the > active to the inactive list, and soft fault them back in > on access, emulating the referenced bit for EPT pages and > making page replacement on them work like it should. > > Your approximation of pretending the page is accessed the > first time and pretending it's not the second time sounds > like it will just lead to less efficient FIFO replacement, > not to anything even vaguely approximating LRU. I think it'll still better than current situation, as young bit is always set for ptes. Otherwise EPT pages are too penalized, we need them to stay one round more in active list like everything else. They are too penalizied anyways because at second pass they'll be forced out of active list and unmapped. This is what alpha and all other archs without young bit set in hardware have to do. They set young bit in software and clear it in software and then set it again in software if there's a page fault (hopefully a minor fault). Returning "not young" first time sounds worse to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 16:31 ` Andrea Arcangeli @ 2009-08-05 17:25 ` Rik van Riel 0 siblings, 0 replies; 122+ messages in thread From: Rik van Riel @ 2009-08-05 17:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: Avi Kivity, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Andrea Arcangeli wrote: > On Wed, Aug 05, 2009 at 10:15:16AM -0400, Rik van Riel wrote: >> Not having a hardware accessed bit would explain why >> the VM is not reactivating the pages that were accessed >> while on the inactive list. > > Problem is, even with young bit functional the VM isn't reactivating > those pages anyway because of that broken check... That check is only done where active pages are moved to the inactive list! Inactive pages that were referenced always get moved to the active list (except for unmapped file pages). > I think it'll still better than current situation, as young bit is > always set for ptes. Otherwise EPT pages are too penalized, we need > them to stay one round more in active list like everything else. NOTHING ELSE stays on the active anon list for two rounds, for very good reasons. Please read up on what has changed in the VM since 2.6.27. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 7:58 ` Avi Kivity 2009-08-05 8:17 ` Avi Kivity 2009-08-05 14:15 ` Rik van Riel @ 2009-08-05 15:45 ` Dike, Jeffrey G 2009-08-05 16:05 ` Andrea Arcangeli 3 siblings, 0 replies; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-05 15:45 UTC (permalink / raw) To: Avi Kivity, Wu, Fengguang Cc: Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm > Jeff, do you see the refaults on Nehalem systems? My test box is pre-Nehalem - no EPT. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 7:58 ` Avi Kivity ` (2 preceding siblings ...) 2009-08-05 15:45 ` Dike, Jeffrey G @ 2009-08-05 16:05 ` Andrea Arcangeli 2009-08-05 16:12 ` Dike, Jeffrey G 3 siblings, 1 reply; 122+ messages in thread From: Andrea Arcangeli @ 2009-08-05 16:05 UTC (permalink / raw) To: Avi Kivity Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Wed, Aug 05, 2009 at 10:58:10AM +0300, Avi Kivity wrote: > How do you distinguish between kvm pages and non-kvm anonymous pages? > More importantly, why should you? It can't distinguish. Besides the pages being refaulted (as minor faults) implies they weren't collected yet. So the fact they are allowed to stay on active list or not can't matter or alter the refaulting issue. > Jeff, do you see the refaults on Nehalem systems? If so, that's likely > due to the lack of an accessed bit on EPT pagetables. It would be > interesting to compare with Barcelona (which does). It seems it wasn't using EPT. Refaulting as minor faults, still possible w/ or w/o EPT and young bit... when young bit is found not set, we just unmap the spte/pte and we leave the page in lru for a while until it is collected. So it can be refaulted even with young bit perfectly functional in spte and pte. But the _whole_ point of NPT young bit (shame on EPT) and in pte, is to try not to unmap the pagetables to get the aging information. So there's a one more pass with young bit functional compared to without young bit functional, but it doesn't mean that when young bit is clear at second pass we immediately free the page, just we go into the "refaulting lru cache waiting to be collected". And if the page isn't actually collected it doesn't matter if it's in active or inactive list, so patch can't matter if it's ""minor"" refaults what we're talking about here :). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 16:05 ` Andrea Arcangeli @ 2009-08-05 16:12 ` Dike, Jeffrey G 2009-08-05 16:19 ` Andrea Arcangeli 0 siblings, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-05 16:12 UTC (permalink / raw) To: Andrea Arcangeli, Avi Kivity Cc: Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm > It can't distinguish. Besides the pages being refaulted (as minor > faults) implies they weren't collected yet. So the fact they are > allowed to stay on active list or not can't matter or alter the > refaulting issue. Sounds like there's some terminology confusion. A refault is a page being discarded due to memory pressure and subsequently being faulted back in. I was counting the number of faults between the discard and faulting back in for each affected page. For a large number of predominately stack pages, that number was very small. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 16:12 ` Dike, Jeffrey G @ 2009-08-05 16:19 ` Andrea Arcangeli 0 siblings, 0 replies; 122+ messages in thread From: Andrea Arcangeli @ 2009-08-05 16:19 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Avi Kivity, Wu, Fengguang, Rik van Riel, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Wed, Aug 05, 2009 at 09:12:39AM -0700, Dike, Jeffrey G wrote: > Sounds like there's some terminology confusion. A refault is a page > being discarded due to memory pressure and subsequently being > faulted back in. I was counting the number of faults between the > discard and faulting back in for each affected page. For a large > number of predominately stack pages, that number was very small. Hmm ok, but if it's anonymous pages we're talking about here (I see KVM in the equation so it has to be!) normally we call that thing swapin to imply I/O is involved, not refault... Refault to me sounds minor fault from swapcache (clean or dirty) and that's about it... Anon page becomes swapcache, it is unmapped if young bit permits, and then it's collected from lru eventually, if it is collected I/O will be generated as swapin during the next page fault. If it's too much swapin, then yes, it could be that patch that prevents young bit to keep the anon pages in active list. But fix is to remove the whole check, not just to enable list_add for anon pages. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 2:40 [RFC] respect the referenced bit of KVM guest pages? Wu Fengguang 2009-08-05 4:15 ` KOSAKI Motohiro 2009-08-05 7:58 ` Avi Kivity @ 2009-08-05 15:58 ` Andrea Arcangeli 2009-08-05 17:20 ` Rik van Riel ` (2 more replies) 2009-08-05 17:53 ` Rik van Riel 3 siblings, 3 replies; 122+ messages in thread From: Andrea Arcangeli @ 2009-08-05 15:58 UTC (permalink / raw) To: Wu Fengguang Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote: > */ > - if ((vm_flags & VM_EXEC) && !PageAnon(page)) { > + if ((vm_flags & VM_EXEC) || PageAnon(page)) { > list_add(&page->lru, &l_active); > continue; > } > Please nuke the whole check and do an unconditional list_add; continue; there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 15:58 ` Andrea Arcangeli @ 2009-08-05 17:20 ` Rik van Riel 2009-08-05 17:42 ` Rik van Riel 2009-08-06 10:08 ` Andrea Arcangeli 2 siblings, 0 replies; 122+ messages in thread From: Rik van Riel @ 2009-08-05 17:20 UTC (permalink / raw) To: Andrea Arcangeli Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Andrea Arcangeli wrote: > On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote: >> */ >> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) { >> + if ((vm_flags & VM_EXEC) || PageAnon(page)) { >> list_add(&page->lru, &l_active); >> continue; >> } >> > > Please nuke the whole check and do an unconditional list_add; > continue; there. That would reinstate the bug that the VM has no pages available for evicting. There are very good reasons that only VM_EXEC file pages get moved to the back of the active list if they were referenced, and nothing else. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 15:58 ` Andrea Arcangeli 2009-08-05 17:20 ` Rik van Riel @ 2009-08-05 17:42 ` Rik van Riel 2009-08-06 10:15 ` Andrea Arcangeli 2009-08-06 10:08 ` Andrea Arcangeli 2 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-05 17:42 UTC (permalink / raw) To: Andrea Arcangeli Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Andrea Arcangeli wrote: > On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote: >> */ >> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) { >> + if ((vm_flags & VM_EXEC) || PageAnon(page)) { >> list_add(&page->lru, &l_active); >> continue; >> } >> > > Please nuke the whole check and do an unconditional list_add; > continue; there. <riel> aa: so you're saying we should _never_ add pages to the active list at this point in the code <aa> right <riel> aa: and remove the list_add and continue completely <aa> yes <riel> aa: your email says the opposite :) -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 17:42 ` Rik van Riel @ 2009-08-06 10:15 ` Andrea Arcangeli 0 siblings, 0 replies; 122+ messages in thread From: Andrea Arcangeli @ 2009-08-06 10:15 UTC (permalink / raw) To: Rik van Riel Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Wed, Aug 05, 2009 at 01:42:30PM -0400, Rik van Riel wrote: > Andrea Arcangeli wrote: > > On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote: > >> */ > >> - if ((vm_flags & VM_EXEC) && !PageAnon(page)) { > >> + if ((vm_flags & VM_EXEC) || PageAnon(page)) { > >> list_add(&page->lru, &l_active); > >> continue; > >> } > >> > > > > Please nuke the whole check and do an unconditional list_add; > > continue; there. > > <riel> aa: so you're saying we should _never_ add pages to the active > list at this point in the code > <aa> right > <riel> aa: and remove the list_add and continue completely > <aa> yes > <riel> aa: your email says the opposite :) Posted more meaningful explanation in self-reply to the email that said the opposite, which tries to explains why I changed my mind (well my focus really was on VM_EXEC and I didn't change my mind about yet but then I'm flexible so I'm listening if somebody thinks it's good thing to keep it). The irc quote was greatly out of context and it missed all the previous conversation... I hope my mail explains my point in more detail than the above. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 15:58 ` Andrea Arcangeli 2009-08-05 17:20 ` Rik van Riel 2009-08-05 17:42 ` Rik van Riel @ 2009-08-06 10:08 ` Andrea Arcangeli 2009-08-06 10:18 ` Avi Kivity 2009-08-06 13:08 ` Rik van Riel 2 siblings, 2 replies; 122+ messages in thread From: Andrea Arcangeli @ 2009-08-06 10:08 UTC (permalink / raw) To: Wu Fengguang Cc: Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Wed, Aug 05, 2009 at 05:58:05PM +0200, Andrea Arcangeli wrote: > On Wed, Aug 05, 2009 at 10:40:58AM +0800, Wu Fengguang wrote: > > */ > > - if ((vm_flags & VM_EXEC) && !PageAnon(page)) { > > + if ((vm_flags & VM_EXEC) || PageAnon(page)) { > > list_add(&page->lru, &l_active); > > continue; > > } > > > > Please nuke the whole check and do an unconditional list_add; > continue; there. After some conversation it seems reactivating on large systems generates troubles to the VM as young bit have excessive time to be reactivated, giving troubles to shrink active list. I see that, so then the check should be still nuked, but the unconditional deactivation should happen instead. Otherwise it's trivial to put the VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC as parameter of mmap. My whole point is that deciding if activating or deactivating pages can't be in function of VM_EXEC, and clearly it helps on desktops but then it probably is a signal that the VM isn't good enough by itself to identify the important working set using young bits and stuff on desktop systems, and if there's a good reason to not activate, we shouldn't activate the VM_EXEC either as anything and anybody can generate a file mapping with VM_EXEC set... Likely we need a cut-off point, if we detect it takes more than X seconds to scan the whole active list, we start ignoring young bits, as young bits don't provide any meaningful information then and they just hang the VM in preventing it to shrink active list and looping over it endlessy with million pages inside that list. But on small systems if inactive list is short it may be too quick to just clear the young bit and only giving it time to be re-enabled in inactive list. That may be the source of the problem. Actually I'm speculating here, because I barely understood that this is swapin... not sure exactly what this regression is about but testing the patch posted is good idea and it will tell us if we just need to dynamically differentiating the algorithm between large and small systems and start ignoring young bits only at some point. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 10:08 ` Andrea Arcangeli @ 2009-08-06 10:18 ` Avi Kivity 2009-08-06 10:20 ` Andrea Arcangeli 2009-08-06 13:08 ` Rik van Riel 1 sibling, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-06 10:18 UTC (permalink / raw) To: Andrea Arcangeli Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/06/2009 01:08 PM, Andrea Arcangeli wrote: > After some conversation it seems reactivating on large systems > generates troubles to the VM as young bit have excessive time to be > reactivated, giving troubles to shrink active list. I see that, so > then the check should be still nuked, but the unconditional > deactivation should happen instead. Otherwise it's trivial to put the > VM to its knees and DoS it with a simple mmap of a file with MAP_EXEC > as parameter of mmap. My whole point is that deciding if activating or > deactivating pages can't be in function of VM_EXEC, and clearly it > helps on desktops but then it probably is a signal that the VM isn't > good enough by itself to identify the important working set using > young bits and stuff on desktop systems, and if there's a good reason > to not activate, we shouldn't activate the VM_EXEC either as anything > and anybody can generate a file mapping with VM_EXEC set... > Reasonable; if you depend on a hint from userspace, that hint can be used against you. > Likely we need a cut-off point, if we detect it takes more than X > seconds to scan the whole active list, we start ignoring young bits, > as young bits don't provide any meaningful information then and they > just hang the VM in preventing it to shrink active list and looping > over it endlessy with million pages inside that list. But on small > systems if inactive list is short it may be too quick to just clear > the young bit and only giving it time to be re-enabled in inactive > list. That may be the source of the problem. Actually I'm speculating > here, because I barely understood that this is swapin... not sure > exactly what this regression is about but testing the patch posted is > good idea and it will tell us if we just need to dynamically > differentiating the algorithm between large and small systems and start > ignoring young bits only at some point. > How about, for every N pages that you scan, evict at least 1 page, regardless of young bit status? That limits overscanning to a N:1 ratio. With N=250 we'll spend at most 25 usec in order to locate one page to evict. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 10:18 ` Avi Kivity @ 2009-08-06 10:20 ` Andrea Arcangeli 2009-08-06 10:59 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Andrea Arcangeli @ 2009-08-06 10:20 UTC (permalink / raw) To: Avi Kivity Cc: Wu Fengguang, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote: > Reasonable; if you depend on a hint from userspace, that hint can be > used against you. Correct, that is my whole point. Also we never know if applications are mmapping huge files with MAP_EXEC just because they might need to trampoline once in a while, or do some little JIT thing once in a while. Sometime people open files with O_RDWR even if they only need O_RDONLY. It's not a bug, but radically altering VM behavior because of a bitflag doesn't sound good to me. I certainly see this tends to help as it will reactivate all .text. But this signals current VM behavior is not ok for small systems IMHO if such an hack is required. I prefer a dynamic algorithm that when active list grow too much stop reactivating pages and reduces the time for young bit activation only to the time the page sits on the inactive list. And if active list is small (like 128M system) we fully trust young bit and if it set, we don't allow it to go in inactive list as it's quick enough to scan the whole active list, and young bit is meaningful there. The issue I can see is with huge system and million pages in active list, by the time we can it all, too much time has passed and we don't get any meaningful information out of young bit. Things are radically different on all regular workstations, and frankly regular workstations are very important too, as I suspect there are more users running on <64G systems than on >64G systems. > How about, for every N pages that you scan, evict at least 1 page, > regardless of young bit status? That limits overscanning to a N:1 > ratio. With N=250 we'll spend at most 25 usec in order to locate one > page to evict. Yes exactly, something like that I think will be dynamic, and then we can drop VM_EXEC check and solve the issues on large systems while still not almost totally ignoring young bit on small systems. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 10:20 ` Andrea Arcangeli @ 2009-08-06 10:59 ` Wu Fengguang 2009-08-06 11:44 ` Avi Kivity 2009-08-06 13:11 ` Rik van Riel 0 siblings, 2 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-06 10:59 UTC (permalink / raw) To: Andrea Arcangeli Cc: Avi Kivity, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Thu, Aug 06, 2009 at 06:20:57PM +0800, Andrea Arcangeli wrote: > On Thu, Aug 06, 2009 at 01:18:47PM +0300, Avi Kivity wrote: > > Reasonable; if you depend on a hint from userspace, that hint can be > > used against you. > > Correct, that is my whole point. Also we never know if applications > are mmapping huge files with MAP_EXEC just because they might need to > trampoline once in a while, or do some little JIT thing once in a > while. Sometime people open files with O_RDWR even if they only need > O_RDONLY. It's not a bug, but radically altering VM behavior because > of a bitflag doesn't sound good to me. > > I certainly see this tends to help as it will reactivate all > .text. But this signals current VM behavior is not ok for small > systems IMHO if such an hack is required. I prefer a dynamic algorithm > that when active list grow too much stop reactivating pages and > reduces the time for young bit activation only to the time the page > sits on the inactive list. And if active list is small (like 128M > system) we fully trust young bit and if it set, we don't allow it to > go in inactive list as it's quick enough to scan the whole active > list, and young bit is meaningful there. > > The issue I can see is with huge system and million pages in active > list, by the time we can it all, too much time has passed and we don't > get any meaningful information out of young bit. Things are radically > different on all regular workstations, and frankly regular > workstations are very important too, as I suspect there are more users > running on <64G systems than on >64G systems. > > > How about, for every N pages that you scan, evict at least 1 page, > > regardless of young bit status? That limits overscanning to a N:1 > > ratio. With N=250 we'll spend at most 25 usec in order to locate one > > page to evict. > > Yes exactly, something like that I think will be dynamic, and then we > can drop VM_EXEC check and solve the issues on large systems while > still not almost totally ignoring young bit on small systems. This is a quick hack to materialize the idea. It remembers roughly the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned, and if _all of them_ are referenced, then the referenced bit is probably meaningless and should not be taken seriously. As a refinement, the static variable 'recent_all_referenced' could be moved to struct zone or made a per-cpu variable. Thanks, Fengguang --- mm/vmscan.c | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) --- linux.orig/mm/vmscan.c 2009-08-06 18:31:20.000000000 +0800 +++ linux/mm/vmscan.c 2009-08-06 18:51:58.000000000 +0800 @@ -1239,6 +1239,9 @@ static void move_active_pages_to_lru(str static void shrink_active_list(unsigned long nr_pages, struct zone *zone, struct scan_control *sc, int priority, int file) { + static unsigned int recent_all_referenced; + int all_referenced = 1; + int referenced_bit_ok; unsigned long pgmoved; unsigned long pgscanned; unsigned long vm_flags; @@ -1267,6 +1270,8 @@ static void shrink_active_list(unsigned __mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved); else __mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved); + + referenced_bit_ok = !recent_all_referenced; spin_unlock_irq(&zone->lru_lock); pgmoved = 0; /* count referenced (mapping) mapped pages */ @@ -1281,19 +1286,15 @@ static void shrink_active_list(unsigned } /* page_referenced clears PageReferenced */ - if (page_mapping_inuse(page) && - page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) { - pgmoved++; - /* - * Identify referenced, file-backed active pages and - * give them one more trip around the active list. So - * that executable code get better chances to stay in - * memory under moderate memory pressure. - * - * Also protect anon pages: swapping could be costly, - * and KVM guest's referenced bit is helpful. - */ - if ((vm_flags & VM_EXEC) || PageAnon(page)) { + if (page_mapping_inuse(page)) { + referenced = page_referenced(page, 0, sc->mem_cgroup, + &vm_flags); + if (referenced) + pgmoved++; + else + all_referenced = 0; + + if (referenced && referenced_bit_ok) { list_add(&page->lru, &l_active); continue; } @@ -1319,6 +1320,7 @@ static void shrink_active_list(unsigned move_active_pages_to_lru(zone, &l_inactive, LRU_BASE + file * LRU_FILE); + recent_all_referenced = (recent_all_referenced << 1) | all_referenced; spin_unlock_irq(&zone->lru_lock); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 10:59 ` Wu Fengguang @ 2009-08-06 11:44 ` Avi Kivity 2009-08-06 13:06 ` Wu Fengguang 2009-08-06 13:13 ` Rik van Riel 2009-08-06 13:11 ` Rik van Riel 1 sibling, 2 replies; 122+ messages in thread From: Avi Kivity @ 2009-08-06 11:44 UTC (permalink / raw) To: Wu Fengguang Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/06/2009 01:59 PM, Wu Fengguang wrote: > > This is a quick hack to materialize the idea. It remembers roughly > the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned, > and if _all of them_ are referenced, then the referenced bit is > probably meaningless and should not be taken seriously. > > I don't think we should ignore the referenced bit. There could still be a large batch of unreferenced pages later on that we should preferentially swap. If we swap at least 1 page for every 250 scanned, after 4K swaps we will have traversed 1M pages, enough to find them. > As a refinement, the static variable 'recent_all_referenced' could be > moved to struct zone or made a per-cpu variable. > > Definitely this should be made part of the zone structure, consider the original report where the problem occurs in a 128MB zone (where we can expect many pages to have their referenced bit set). -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 11:44 ` Avi Kivity @ 2009-08-06 13:06 ` Wu Fengguang 2009-08-06 13:16 ` Rik van Riel ` (2 more replies) 2009-08-06 13:13 ` Rik van Riel 1 sibling, 3 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-06 13:06 UTC (permalink / raw) To: Avi Kivity Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote: > On 08/06/2009 01:59 PM, Wu Fengguang wrote: scheme KEEP_MOST: >> How about, for every N pages that you scan, evict at least 1 page, >> regardless of young bit status? That limits overscanning to a N:1 >> ratio. With N=250 we'll spend at most 25 usec in order to locate one >> page to evict. scheme DROP_CONTINUOUS: > > This is a quick hack to materialize the idea. It remembers roughly > > the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned, > > and if _all of them_ are referenced, then the referenced bit is > > probably meaningless and should not be taken seriously. > I don't think we should ignore the referenced bit. There could still be > a large batch of unreferenced pages later on that we should > preferentially swap. If we swap at least 1 page for every 250 scanned, > after 4K swaps we will have traversed 1M pages, enough to find them. I guess both schemes have unacceptable flaws. For JVM/BIGMEM workload, most pages would be found referenced _all the time_. So the KEEP_MOST scheme could increase reclaim overheads by N=250 times; while the DROP_CONTINUOUS scheme is effectively zero cost. However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_. It can behave vastly different on single active task and multi ones. It is short sighted and can be cheated by bursty activities. > > As a refinement, the static variable 'recent_all_referenced' could be > > moved to struct zone or made a per-cpu variable. > > Definitely this should be made part of the zone structure, consider the > original report where the problem occurs in a 128MB zone (where we can > expect many pages to have their referenced bit set). Good point. Here the cgroup list is highly stressed, while the global zones are idling. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 13:06 ` Wu Fengguang @ 2009-08-06 13:16 ` Rik van Riel 2009-08-16 3:28 ` Wu Fengguang 2009-08-06 13:46 ` Avi Kivity 2009-08-06 21:09 ` Jeff Dike 2 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-06 13:16 UTC (permalink / raw) To: Wu Fengguang Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Wu Fengguang wrote: > I guess both schemes have unacceptable flaws. > > For JVM/BIGMEM workload, most pages would be found referenced _all the time_. > So the KEEP_MOST scheme could increase reclaim overheads by N=250 times; > while the DROP_CONTINUOUS scheme is effectively zero cost. The higher overhead may not be an issue on smaller systems, or inside smaller cgroups inside large systems, when doing cgroup reclaim. > However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_. > It can behave vastly different on single active task and multi ones. > It is short sighted and can be cheated by bursty activities. The split LRU VM tries to avoid the bursty page aging as much as possible, by doing background deactivating of anonymous pages whenever we reclaim page cache pages and the number of anonymous pages in the zone (or cgroup) is low. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 13:16 ` Rik van Riel @ 2009-08-16 3:28 ` Wu Fengguang 2009-08-16 3:56 ` Rik van Riel 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-16 3:28 UTC (permalink / raw) To: Rik van Riel Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Thu, Aug 06, 2009 at 09:16:14PM +0800, Rik van Riel wrote: > Wu Fengguang wrote: > > > I guess both schemes have unacceptable flaws. > > > > For JVM/BIGMEM workload, most pages would be found referenced _all the time_. > > So the KEEP_MOST scheme could increase reclaim overheads by N=250 times; > > while the DROP_CONTINUOUS scheme is effectively zero cost. > > The higher overhead may not be an issue on smaller systems, > or inside smaller cgroups inside large systems, when doing > cgroup reclaim. Right. > > However, the DROP_CONTINUOUS scheme does bring more _indeterminacy_. > > It can behave vastly different on single active task and multi ones. > > It is short sighted and can be cheated by bursty activities. > > The split LRU VM tries to avoid the bursty page aging as > much as possible, by doing background deactivating of > anonymous pages whenever we reclaim page cache pages and > the number of anonymous pages in the zone (or cgroup) is > low. Right, but I meant busty page allocations and accesses on them, which can make a large continuous segment of referenced pages in LRU list, say 50MB. They may or may not be valuable as a whole, however a local algorithm may keep the first 4MB and drop the remaining 46MB. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 3:28 ` Wu Fengguang @ 2009-08-16 3:56 ` Rik van Riel 2009-08-16 4:43 ` Balbir Singh 2009-08-16 4:55 ` Wu Fengguang 0 siblings, 2 replies; 122+ messages in thread From: Rik van Riel @ 2009-08-16 3:56 UTC (permalink / raw) To: Wu Fengguang Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Wu Fengguang wrote: > Right, but I meant busty page allocations and accesses on them, which > can make a large continuous segment of referenced pages in LRU list, > say 50MB. They may or may not be valuable as a whole, however a local > algorithm may keep the first 4MB and drop the remaining 46MB. I wonder if the problem is that we simply do not keep a large enough inactive list in Jeff's test. If we do not, pages do not have a chance to be referenced again before the reclaim code comes in. The cgroup stats should show how many active anon and inactive anon pages there are in the cgroup. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 3:56 ` Rik van Riel @ 2009-08-16 4:43 ` Balbir Singh 2009-08-16 4:55 ` Wu Fengguang 1 sibling, 0 replies; 122+ messages in thread From: Balbir Singh @ 2009-08-16 4:43 UTC (permalink / raw) To: Rik van Riel Cc: Wu Fengguang, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm * Rik van Riel <riel@redhat.com> [2009-08-15 23:56:39]: > Wu Fengguang wrote: > >> Right, but I meant busty page allocations and accesses on them, which >> can make a large continuous segment of referenced pages in LRU list, >> say 50MB. They may or may not be valuable as a whole, however a local >> algorithm may keep the first 4MB and drop the remaining 46MB. > > I wonder if the problem is that we simply do not keep a large > enough inactive list in Jeff's test. If we do not, pages do > not have a chance to be referenced again before the reclaim > code comes in. > > The cgroup stats should show how many active anon and inactive > anon pages there are in the cgroup. > Yes, we do show active and inactive anon pages in the mem cgroup controller in the memory.stat file. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 3:56 ` Rik van Riel 2009-08-16 4:43 ` Balbir Singh @ 2009-08-16 4:55 ` Wu Fengguang 2009-08-16 5:59 ` Balbir Singh 2009-08-17 19:47 ` Dike, Jeffrey G 1 sibling, 2 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-16 4:55 UTC (permalink / raw) To: Rik van Riel Cc: Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote: > Wu Fengguang wrote: > > > Right, but I meant busty page allocations and accesses on them, which > > can make a large continuous segment of referenced pages in LRU list, > > say 50MB. They may or may not be valuable as a whole, however a local > > algorithm may keep the first 4MB and drop the remaining 46MB. > > I wonder if the problem is that we simply do not keep a large > enough inactive list in Jeff's test. If we do not, pages do > not have a chance to be referenced again before the reclaim > code comes in. Exactly, that's the case I call the list FIFO. > The cgroup stats should show how many active anon and inactive > anon pages there are in the cgroup. Jeff, can you have a look at these stats? Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 4:55 ` Wu Fengguang @ 2009-08-16 5:59 ` Balbir Singh 2009-08-17 19:47 ` Dike, Jeffrey G 1 sibling, 0 replies; 122+ messages in thread From: Balbir Singh @ 2009-08-16 5:59 UTC (permalink / raw) To: Wu Fengguang Cc: Rik van Riel, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm * Wu Fengguang <fengguang.wu@intel.com> [2009-08-16 12:55:22]: > On Sun, Aug 16, 2009 at 11:56:39AM +0800, Rik van Riel wrote: > > Wu Fengguang wrote: > > > > > Right, but I meant busty page allocations and accesses on them, which > > > can make a large continuous segment of referenced pages in LRU list, > > > say 50MB. They may or may not be valuable as a whole, however a local > > > algorithm may keep the first 4MB and drop the remaining 46MB. > > > > I wonder if the problem is that we simply do not keep a large > > enough inactive list in Jeff's test. If we do not, pages do > > not have a chance to be referenced again before the reclaim > > code comes in. > > Exactly, that's the case I call the list FIFO. > > > The cgroup stats should show how many active anon and inactive > > anon pages there are in the cgroup. > > Jeff, can you have a look at these stats? Thanks! Another experiment would be to toy with memory.swappiness (although defaults should work well). Could you compare the in-guest values of nr_*active* with the cgroup values as seen by the host? -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 4:55 ` Wu Fengguang 2009-08-16 5:59 ` Balbir Singh @ 2009-08-17 19:47 ` Dike, Jeffrey G 2009-08-21 18:24 ` Balbir Singh 1 sibling, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-17 19:47 UTC (permalink / raw) To: Wu, Fengguang, Rik van Riel Cc: Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm > Jeff, can you have a look at these stats? Thanks! Yeah, I just did after adding some tracing which dumped out the same data. It looks pretty much the same. Inactive anon and active anon are pretty similar. Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced. Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening. Jeff 114688 0 1978368 630784 114688 0 1978368 630784 114688 0 1978368 630784 114688 0 1978368 630784 114688 0 1978368 630784 114688 0 1978368 630784 114688 0 1978368 630784 # Fire up instance 20480 4403200 2699264 647168 20480 4411392 2740224 647168 20480 4411392 2740224 647168 20480 11436032 3289088 651264 20480 11587584 3313664 651264 20480 12558336 3313664 651264 20480 12558336 3313664 651264 20480 12558336 3313664 651264 20480 12558336 3313664 651264 20480 12558336 3313664 651264 20480 12558336 3313664 651264 20480 12558336 3313664 651264 20480 12558336 3313664 651264 20480 25387008 4263936 872448 20480 25387008 4263936 872448 20480 25387008 4198400 937984 20480 25387008 4198400 937984 20480 25387008 4198400 937984 20480 25387008 4198400 937984 20480 25387008 4198400 937984 20480 25387008 4198400 937984 20480 25387008 4198400 937984 20480 25387008 4198400 937984 20480 25411584 4198400 937984 20480 25411584 4198400 937984 20480 40665088 7573504 946176 20480 43606016 7573504 946176 20480 45346816 7581696 946176 20480 45752320 7581696 946176 20480 46575616 7581696 946176 20480 46682112 7581696 946176 20480 48975872 9920512 1073152 20480 64536576 38457344 1826816 # Booted, X is starting # Run a browser and editor, then shut them down and halt the instance 10964992 72454144 47714304 3067904 16797696 71151616 42893312 3244032 16797696 73035776 41037824 3272704 16797696 73547776 40525824 3272704 16797696 73609216 40402944 3276800 16797696 73719808 40337408 3289088 16797696 73920512 40079360 3289088 16797696 78016512 36036608 3297280 16797696 78016512 36036608 3297280 16797696 80203776 33755136 3387392 16797696 86904832 26972160 3526656 16797696 93523968 19927040 3837952 29011968 90546176 10276864 4308992 45670400 83685376 991232 3854336 66400256 66416640 368640 933888 66715648 66654208 376832 471040 64811008 64802816 3416064 1114112 65236992 65085440 2535424 1228800 65212416 65011712 2519040 1343488 64626688 64610304 3534848 1429504 63807488 63758336 4829184 1695744 63975424 63946752 4419584 1744896 63975424 63946752 4419584 1744896 63975424 63946752 4419584 1744896 63975424 63946752 4423680 1744896 63975424 63946752 4423680 1744896 63975424 63946752 4423680 1744896 64045056 63946752 4440064 1757184 64069632 63946752 4440064 1757184 64077824 63946752 4403200 1757184 64077824 63946752 4403200 1757184 64077824 63946752 4403200 1757184 64147456 64016384 4222976 1757184 64638976 64733184 2801664 1900544 65208320 65605632 1310720 1892352 64946176 65863680 1335296 1998848 62701568 68599808 843776 1945600 66068480 66023424 778240 978944 65568768 65511424 2093056 1044480 66183168 66056192 966656 1011712 66478080 66555904 241664 864256 66912256 66899968 135168 270336 66646016 66539520 577536 262144 66134016 66179072 1421312 319488 66125824 66183168 1273856 380928 66330624 66445312 933888 475136 65970176 65966080 1581056 548864 66158592 66158592 1175552 708608 65781760 66084864 1503232 806912 66084864 66048000 1118208 843776 66420736 66449408 376832 851968 66351104 66138112 757760 864256 66285568 66138112 921600 868352 65945600 65847296 1495040 888832 66002944 65839104 1363968 888832 66002944 65839104 1363968 888832 66039808 65839104 1363968 888832 66043904 65839104 1363968 888832 66043904 65839104 1372160 888832 65523712 65490944 2224128 929792 66031616 66297856 827392 946176 64913408 68141056 188416 933888 64770048 68325376 73728 917504 65216512 67932160 81920 909312 65470464 67678208 81920 909312 65036288 67973120 356352 790528 63492096 69877760 110592 647168 63111168 70508544 73728 413696 66895872 66883584 16384 348160 66650112 67203072 20480 344064 66830336 67002368 28672 335872 66785280 67002368 32768 331776 67084288 66736128 45056 331776 67104768 66736128 45056 331776 66916352 66801664 45056 331776 66883584 66863104 45056 331776 66883584 66863104 45056 331776 66891776 66863104 45056 331776 66899968 66863104 45056 331776 66904064 66863104 45056 331776 66904064 66863104 45056 331776 66904064 66863104 45056 331776 66715648 66641920 385024 339968 66617344 66629632 237568 364544 66633728 66629632 237568 364544 66641920 66629632 237568 364544 66641920 66629632 237568 364544 66662400 66461696 589824 389120 66588672 66527232 659456 389120 66252800 66252800 1105920 413696 66277376 66297856 983040 413696 66498560 66285568 884736 413696 66129920 66183168 1163264 442368 66138112 66183168 1163264 442368 66256896 66465792 921600 442368 66560000 66465792 606208 442368 66662400 66445312 589824 446464 66560000 66490368 708608 446464 66629632 66441216 577536 446464 66711552 66441216 520192 462848 66617344 66531328 577536 466944 66412544 66605056 606208 466944 66605056 66637824 475136 471040 66297856 67018752 344064 483328 65863680 67588096 212992 483328 66334720 67129344 159744 516096 65204224 68337664 159744 516096 61399040 72212480 98304 499712 62427136 71114752 98304 499712 62775296 70680576 155648 503808 62595072 70209536 823296 565248 66490368 66576384 458752 577536 66818048 66625536 196608 577536 66699264 66793472 135168 507904 66707456 66736128 151552 512000 66314240 67293184 53248 483328 56832000 76705792 180224 495616 59273216 73998336 368640 499712 59351040 73928704 368640 499712 59400192 73928704 368640 499712 59678720 73478144 241664 495616 59875328 73531392 241664 495616 60690432 72830976 110592 503808 61919232 71634944 110592 503808 65523712 68096000 49152 483328 66748416 66801664 102400 487424 65994752 67555328 114688 487424 65994752 67555328 114688 487424 65994752 67555328 114688 487424 66285568 66195456 1093632 520192 66232320 65978368 1335296 557056 66232320 65978368 1335296 557056 66232320 65978368 1335296 557056 66232320 65978368 1335296 557056 66060288 65994752 1548288 557056 # Instance halted 118784 0 1953792 569344 118784 0 1953792 569344 118784 0 1953792 569344 118784 0 1953792 569344 118784 0 1953792 569344 118784 0 1953792 569344 118784 0 1953792 569344 118784 0 1953792 569344 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-17 19:47 ` Dike, Jeffrey G @ 2009-08-21 18:24 ` Balbir Singh 2009-08-31 19:43 ` Dike, Jeffrey G 0 siblings, 1 reply; 122+ messages in thread From: Balbir Singh @ 2009-08-21 18:24 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm * Dike, Jeffrey G <jeffrey.g.dike@intel.com> [2009-08-17 12:47:29]: > > Jeff, can you have a look at these stats? Thanks! > > Yeah, I just did after adding some tracing which dumped out the same data. It looks pretty much the same. Inactive anon and active anon are pretty similar. Inactive file and active file are smaller and fluctuate more, but doesn't look horribly unbalanced. > > Below are the stats from memory.stat - inactive_anon, active_anon, inactive_file, active_file, plus some commentary on what's happening. > Interesting.. there seems to be sufficient number of inactive memory, specifically inactive_file. My biggest suspicion now is passing of reference info from shadow page tables to host (although to be honest, I've never looked at that code). What do the stats for / from within kvm look like? -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-21 18:24 ` Balbir Singh @ 2009-08-31 19:43 ` Dike, Jeffrey G 2009-08-31 19:52 ` Rik van Riel 0 siblings, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-31 19:43 UTC (permalink / raw) To: balbir@linux.vnet.ibm.com Cc: Wu, Fengguang, Rik van Riel, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm > What do the stats for / from within kvm look like? Interesting - what they look like is inactive_anon is always zero. Details below - I took the host numbers at the same time and they are similar to what I reported before. Jeff The fields are inactive_anon, active_anon, inactive_file, active_file - shortly after the data started being collected, I started firefox and an editor thingy. The data continues as far into the shutdown as it could. 0 10858 13516 3279 0 10872 13516 3286 0 10867 13513 3286 0 11455 13268 3552 0 13068 12871 3949 0 13281 12810 4012 0 13701 12719 4103 0 14133 12631 4191 0 10878 11742 5087 0 10878 11741 5085 0 10878 11741 5085 0 10877 11741 5085 0 10877 11741 5085 0 10878 11741 5085 0 10878 11741 5085 0 10877 11741 5085 0 10905 11741 5085 0 11118 11776 5106 0 11594 14541 5169 0 11084 15314 5248 0 12022 15686 5300 0 12813 16379 5608 0 13614 16744 5915 0 14230 16849 5936 0 14461 16943 5953 0 14706 17412 5967 0 15574 17445 6011 0 15623 17459 6011 0 15596 17461 6015 0 15941 17523 6048 0 16508 17684 6048 0 17095 18154 6056 0 18635 18175 6056 0 18867 18195 6060 0 18972 18195 6060 0 18975 18185 6073 0 19220 18234 6073 0 19809 18276 6076 0 19571 18276 6076 0 19567 18276 6076 0 19588 18276 6076 0 19588 18276 6076 0 19588 18276 6076 0 19589 18276 6076 0 19603 18276 6076 0 19607 18277 6077 0 19600 18277 6077 0 19034 18235 6119 0 19041 18235 6119 0 19040 18233 6121 0 19040 18233 6121 0 18724 18240 6121 0 11674 16376 7977 0 11674 16376 7977 0 11673 16376 7977 0 11708 16376 7977 0 11703 16374 7979 0 11703 16374 7979 0 11702 16374 7979 0 11702 16374 7979 0 11716 16374 7979 0 11716 16374 7979 0 11718 16374 7979 0 11711 16374 7979 0 11811 16413 7986 0 11811 16413 7986 0 11897 16413 7986 0 12247 16434 7986 0 12495 16457 7990 0 12495 16457 7990 0 12491 16457 7990 0 12491 16457 7990 0 12737 16457 7990 0 11844 16457 7990 0 10969 16436 8011 0 9586 16140 8328 0 9209 16253 8333 0 8467 16120 8550 0 7857 16504 8592 0 7215 16467 8681 0 7194 16481 8723 0 7155 16475 8730 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-31 19:43 ` Dike, Jeffrey G @ 2009-08-31 19:52 ` Rik van Riel 2009-08-31 20:06 ` Dike, Jeffrey G 0 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-31 19:52 UTC (permalink / raw) To: Dike, Jeffrey G Cc: balbir@linux.vnet.ibm.com, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Dike, Jeffrey G wrote: >> What do the stats for / from within kvm look like? > > Interesting - what they look like is inactive_anon is always zero. This will be because the VM does not start aging pages from the active to the inactive list unless there is some memory pressure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-31 19:52 ` Rik van Riel @ 2009-08-31 20:06 ` Dike, Jeffrey G 2009-08-31 20:09 ` Rik van Riel 0 siblings, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-31 20:06 UTC (permalink / raw) To: Rik van Riel Cc: balbir@linux.vnet.ibm.com, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm > This will be because the VM does not start aging pages > from the active to the inactive list unless there is > some memory pressure. Which is the reason I gave the VM a puny amount of memory. We know the thing is under memory pressure because I've been complaining about page discards. I didn't collect that data on this run, but I'll do it again to make sure. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-31 20:06 ` Dike, Jeffrey G @ 2009-08-31 20:09 ` Rik van Riel 2009-08-31 20:11 ` Dike, Jeffrey G 0 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-31 20:09 UTC (permalink / raw) To: Dike, Jeffrey G Cc: balbir@linux.vnet.ibm.com, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Dike, Jeffrey G wrote: >> This will be because the VM does not start aging pages >> from the active to the inactive list unless there is >> some memory pressure. > > Which is the reason I gave the VM a puny amount of memory. > We know the thing is under memory pressure because I've been > complaining about page discards. Page discards by the host, which are invisible to the guest OS. The guest OS thinks it has enough pages. The host disagrees and swaps out some guest memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-31 20:09 ` Rik van Riel @ 2009-08-31 20:11 ` Dike, Jeffrey G 2009-08-31 20:42 ` Balbir Singh 0 siblings, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-31 20:11 UTC (permalink / raw) To: Rik van Riel Cc: balbir@linux.vnet.ibm.com, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm > Page discards by the host, which are invisible to the guest > OS. Duh. Right - I can't keep my VM systems straight... Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-31 20:11 ` Dike, Jeffrey G @ 2009-08-31 20:42 ` Balbir Singh 0 siblings, 0 replies; 122+ messages in thread From: Balbir Singh @ 2009-08-31 20:42 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Rik van Riel, Wu, Fengguang, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Tue, Sep 1, 2009 at 1:41 AM, Dike, Jeffrey G<jeffrey.g.dike@intel.com> wrote: >> Page discards by the host, which are invisible to the guest >> OS. > > Duh. Right - I can't keep my VM systems straight... > Sounds like we need a way of indicating reference information. Guest page hinting (cough; cough) anyone? May be a simpler version? Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 13:06 ` Wu Fengguang 2009-08-06 13:16 ` Rik van Riel @ 2009-08-06 13:46 ` Avi Kivity 2009-08-06 21:09 ` Jeff Dike 2 siblings, 0 replies; 122+ messages in thread From: Avi Kivity @ 2009-08-06 13:46 UTC (permalink / raw) To: Wu Fengguang Cc: Andrea Arcangeli, Rik van Riel, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/06/2009 04:06 PM, Wu Fengguang wrote: > On Thu, Aug 06, 2009 at 07:44:01PM +0800, Avi Kivity wrote: > >> On 08/06/2009 01:59 PM, Wu Fengguang wrote: >> > > scheme KEEP_MOST: > > >>> How about, for every N pages that you scan, evict at least 1 page, >>> regardless of young bit status? That limits overscanning to a N:1 >>> ratio. With N=250 we'll spend at most 25 usec in order to locate one >>> page to evict. >>> > > scheme DROP_CONTINUOUS: > > >>> This is a quick hack to materialize the idea. It remembers roughly >>> the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned, >>> and if _all of them_ are referenced, then the referenced bit is >>> probably meaningless and should not be taken seriously. >>> > > Or one scheme, with N=parameter. >> I don't think we should ignore the referenced bit. There could still be >> a large batch of unreferenced pages later on that we should >> preferentially swap. If we swap at least 1 page for every 250 scanned, >> after 4K swaps we will have traversed 1M pages, enough to find them. >> > > I guess both schemes have unacceptable flaws. > > For JVM/BIGMEM workload, most pages would be found referenced _all the time_. > So the KEEP_MOST scheme could increase reclaim overheads by N=250 times; > while the DROP_CONTINUOUS scheme is effectively zero cost. > Maybe 250 is an exaggeration. But even with 250, the cost is still pretty low compared to the cpu cost of evicting a page (with IPIs and tlb flushes). -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 13:06 ` Wu Fengguang 2009-08-06 13:16 ` Rik van Riel 2009-08-06 13:46 ` Avi Kivity @ 2009-08-06 21:09 ` Jeff Dike 2009-08-16 3:18 ` Wu Fengguang 2 siblings, 1 reply; 122+ messages in thread From: Jeff Dike @ 2009-08-06 21:09 UTC (permalink / raw) To: Wu Fengguang Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Side question - Is there a good reason for this to be in shrink_active_list() as opposed to __isolate_lru_page? if (unlikely(!page_evictable(page, NULL))) { putback_lru_page(page); continue; } Maybe we want to minimize the amount of code under the lru lock or avoid duplicate logic in the isolate_page functions. But if there are important mlock-heavy workloads, this could make the scan come up empty, or at least emptier than we might like. Jeff -- Work email - jdike at linux dot intel dot com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 21:09 ` Jeff Dike @ 2009-08-16 3:18 ` Wu Fengguang 2009-08-16 3:53 ` Rik van Riel 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-16 3:18 UTC (permalink / raw) To: Jeff Dike Cc: Avi Kivity, Andrea Arcangeli, Rik van Riel, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > Side question - > Is there a good reason for this to be in shrink_active_list() > as opposed to __isolate_lru_page? > > if (unlikely(!page_evictable(page, NULL))) { > putback_lru_page(page); > continue; > } > > Maybe we want to minimize the amount of code under the lru lock or > avoid duplicate logic in the isolate_page functions. I guess the quick test means to avoid the expensive page_referenced() call that follows it. But that should be mostly one shot cost - the unevictable pages are unlikely to cycle in active/inactive list again and again. > But if there are important mlock-heavy workloads, this could make the > scan come up empty, or at least emptier than we might like. Yes, if the above 'if' block is removed, the inactive lists might get more expensive to reclaim. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 3:18 ` Wu Fengguang @ 2009-08-16 3:53 ` Rik van Riel 2009-08-16 5:15 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-16 3:53 UTC (permalink / raw) To: Wu Fengguang Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Wu Fengguang wrote: > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: >> Side question - >> Is there a good reason for this to be in shrink_active_list() >> as opposed to __isolate_lru_page? >> >> if (unlikely(!page_evictable(page, NULL))) { >> putback_lru_page(page); >> continue; >> } >> >> Maybe we want to minimize the amount of code under the lru lock or >> avoid duplicate logic in the isolate_page functions. > > I guess the quick test means to avoid the expensive page_referenced() > call that follows it. But that should be mostly one shot cost - the > unevictable pages are unlikely to cycle in active/inactive list again > and again. Please read what putback_lru_page does. It moves the page onto the unevictable list, so that it will not end up in this scan again. >> But if there are important mlock-heavy workloads, this could make the >> scan come up empty, or at least emptier than we might like. > > Yes, if the above 'if' block is removed, the inactive lists might get > more expensive to reclaim. Why? -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 3:53 ` Rik van Riel @ 2009-08-16 5:15 ` Wu Fengguang 2009-08-16 11:29 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-16 5:15 UTC (permalink / raw) To: Rik van Riel Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: > Wu Fengguang wrote: > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > >> Side question - > >> Is there a good reason for this to be in shrink_active_list() > >> as opposed to __isolate_lru_page? > >> > >> if (unlikely(!page_evictable(page, NULL))) { > >> putback_lru_page(page); > >> continue; > >> } > >> > >> Maybe we want to minimize the amount of code under the lru lock or > >> avoid duplicate logic in the isolate_page functions. > > > > I guess the quick test means to avoid the expensive page_referenced() > > call that follows it. But that should be mostly one shot cost - the > > unevictable pages are unlikely to cycle in active/inactive list again > > and again. > > Please read what putback_lru_page does. > > It moves the page onto the unevictable list, so that > it will not end up in this scan again. Yes it does. I said 'mostly' because there is a small hole that an unevictable page may be scanned but still not moved to unevictable list: when a page is mapped in two places, the first pte has the referenced bit set, the _second_ VMA has VM_LOCKED bit set, then page_referenced() will return 1 and shrink_page_list() will move it into active list instead of unevictable list. Shall we fix this rare case? > >> But if there are important mlock-heavy workloads, this could make the > >> scan come up empty, or at least emptier than we might like. > > > > Yes, if the above 'if' block is removed, the inactive lists might get > > more expensive to reclaim. > > Why? Without the 'if' block, an unevictable page may well be deactivated into inactive list (and some time later be moved to unevictable list from there), increasing the inactive list's scanned:reclaimed ratio. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 5:15 ` Wu Fengguang @ 2009-08-16 11:29 ` Wu Fengguang 2009-08-17 14:33 ` Minchan Kim 2009-08-18 15:57 ` KOSAKI Motohiro 0 siblings, 2 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-16 11:29 UTC (permalink / raw) To: Rik van Riel Cc: Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: > On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: > > Wu Fengguang wrote: > > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > > >> Side question - > > >> Is there a good reason for this to be in shrink_active_list() > > >> as opposed to __isolate_lru_page? > > >> > > >> if (unlikely(!page_evictable(page, NULL))) { > > >> putback_lru_page(page); > > >> continue; > > >> } > > >> > > >> Maybe we want to minimize the amount of code under the lru lock or > > >> avoid duplicate logic in the isolate_page functions. > > > > > > I guess the quick test means to avoid the expensive page_referenced() > > > call that follows it. But that should be mostly one shot cost - the > > > unevictable pages are unlikely to cycle in active/inactive list again > > > and again. > > > > Please read what putback_lru_page does. > > > > It moves the page onto the unevictable list, so that > > it will not end up in this scan again. > > Yes it does. I said 'mostly' because there is a small hole that an > unevictable page may be scanned but still not moved to unevictable > list: when a page is mapped in two places, the first pte has the > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > page_referenced() will return 1 and shrink_page_list() will move it > into active list instead of unevictable list. Shall we fix this rare > case? How about this fix? --- mm: stop circulating of referenced mlocked pages Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800 +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800 @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa */ if (vma->vm_flags & VM_LOCKED) { *mapcount = 1; /* break early from loop */ + *vm_flags |= VM_LOCKED; goto out_unmap; } @@ -482,6 +483,8 @@ static int page_referenced_file(struct p } spin_unlock(&mapping->i_mmap_lock); + if (*vm_flags & VM_LOCKED) + referenced = 0; return referenced; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 11:29 ` Wu Fengguang @ 2009-08-17 14:33 ` Minchan Kim 2009-08-18 2:34 ` Wu Fengguang 2009-08-18 15:57 ` KOSAKI Motohiro 1 sibling, 1 reply; 122+ messages in thread From: Minchan Kim @ 2009-08-17 14:33 UTC (permalink / raw) To: Wu Fengguang Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Hi, Wu. On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: >> > Wu Fengguang wrote: >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: >> > >> Side question - >> > >> Is there a good reason for this to be in shrink_active_list() >> > >> as opposed to __isolate_lru_page? >> > >> >> > >> if (unlikely(!page_evictable(page, NULL))) { >> > >> putback_lru_page(page); >> > >> continue; >> > >> } >> > >> >> > >> Maybe we want to minimize the amount of code under the lru lock or >> > >> avoid duplicate logic in the isolate_page functions. >> > > >> > > I guess the quick test means to avoid the expensive page_referenced() >> > > call that follows it. But that should be mostly one shot cost - the >> > > unevictable pages are unlikely to cycle in active/inactive list again >> > > and again. >> > >> > Please read what putback_lru_page does. >> > >> > It moves the page onto the unevictable list, so that >> > it will not end up in this scan again. >> >> Yes it does. I said 'mostly' because there is a small hole that an >> unevictable page may be scanned but still not moved to unevictable >> list: when a page is mapped in two places, the first pte has the >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then >> page_referenced() will return 1 and shrink_page_list() will move it >> into active list instead of unevictable list. Shall we fix this rare >> case? I think it's not a big deal. As you mentioned, it's rare case so there would be few pages in active list instead of unevictable list. When next time to scan comes, we can try to move the pages into unevictable list, again. As I know about mlock pages, we already had some races condition. They will be rescued like above. > > How about this fix? > > --- > mm: stop circulating of referenced mlocked pages > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > > --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800 > +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800 > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa > */ > if (vma->vm_flags & VM_LOCKED) { > *mapcount = 1; /* break early from loop */ > + *vm_flags |= VM_LOCKED; > goto out_unmap; > } > > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p > } > > spin_unlock(&mapping->i_mmap_lock); > + if (*vm_flags & VM_LOCKED) > + referenced = 0; > return referenced; > } > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-17 14:33 ` Minchan Kim @ 2009-08-18 2:34 ` Wu Fengguang 2009-08-18 4:17 ` Minchan Kim 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-18 2:34 UTC (permalink / raw) To: Minchan Kim Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Minchan, On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote: > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: > >> > Wu Fengguang wrote: > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > >> > >> Side question - > >> > >> A Is there a good reason for this to be in shrink_active_list() > >> > >> as opposed to __isolate_lru_page? > >> > >> > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) { > >> > >> A A A A A A A A A putback_lru_page(page); > >> > >> A A A A A A A A A continue; > >> > >> A A A A A } > >> > >> > >> > >> Maybe we want to minimize the amount of code under the lru lock or > >> > >> avoid duplicate logic in the isolate_page functions. > >> > > > >> > > I guess the quick test means to avoid the expensive page_referenced() > >> > > call that follows it. But that should be mostly one shot cost - the > >> > > unevictable pages are unlikely to cycle in active/inactive list again > >> > > and again. > >> > > >> > Please read what putback_lru_page does. > >> > > >> > It moves the page onto the unevictable list, so that > >> > it will not end up in this scan again. > >> > >> Yes it does. I said 'mostly' because there is a small hole that an > >> unevictable page may be scanned but still not moved to unevictable > >> list: when a page is mapped in two places, the first pte has the > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > >> page_referenced() will return 1 and shrink_page_list() will move it > >> into active list instead of unevictable list. Shall we fix this rare > >> case? > > I think it's not a big deal. Maybe, otherwise I should bring up this issue long time before :) > As you mentioned, it's rare case so there would be few pages in active > list instead of unevictable list. Yes. > When next time to scan comes, we can try to move the pages into > unevictable list, again. Will PG_mlocked be set by then? Otherwise the situation is not likely to change and the VM_LOCKED pages may circulate in active/inactive list for countless times. > As I know about mlock pages, we already had some races condition. > They will be rescued like above. Thanks, Fengguang > > > > How about this fix? > > > > --- > > mm: stop circulating of referenced mlocked pages > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > > > --- linux.orig/mm/rmap.c A A A A 2009-08-16 19:11:13.000000000 +0800 > > +++ linux/mm/rmap.c A A 2009-08-16 19:22:46.000000000 +0800 > > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa > > A A A A */ > > A A A A if (vma->vm_flags & VM_LOCKED) { > > A A A A A A A A *mapcount = 1; A /* break early from loop */ > > + A A A A A A A *vm_flags |= VM_LOCKED; > > A A A A A A A A goto out_unmap; > > A A A A } > > > > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p > > A A A A } > > > > A A A A spin_unlock(&mapping->i_mmap_lock); > > + A A A if (*vm_flags & VM_LOCKED) > > + A A A A A A A referenced = 0; > > A A A A return referenced; > > A } > > > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. A For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > > > > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 2:34 ` Wu Fengguang @ 2009-08-18 4:17 ` Minchan Kim 2009-08-18 9:31 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Minchan Kim @ 2009-08-18 4:17 UTC (permalink / raw) To: Wu Fengguang Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Tue, 18 Aug 2009 10:34:38 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > Minchan, > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote: > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: > > >> > Wu Fengguang wrote: > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > > >> > >> Side question - > > >> > >> A Is there a good reason for this to be in shrink_active_list() > > >> > >> as opposed to __isolate_lru_page? > > >> > >> > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) { > > >> > >> A A A A A A A A A putback_lru_page(page); > > >> > >> A A A A A A A A A continue; > > >> > >> A A A A A } > > >> > >> > > >> > >> Maybe we want to minimize the amount of code under the lru lock or > > >> > >> avoid duplicate logic in the isolate_page functions. > > >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced() > > >> > > call that follows it. But that should be mostly one shot cost - the > > >> > > unevictable pages are unlikely to cycle in active/inactive list again > > >> > > and again. > > >> > > > >> > Please read what putback_lru_page does. > > >> > > > >> > It moves the page onto the unevictable list, so that > > >> > it will not end up in this scan again. > > >> > > >> Yes it does. I said 'mostly' because there is a small hole that an > > >> unevictable page may be scanned but still not moved to unevictable > > >> list: when a page is mapped in two places, the first pte has the > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > > >> page_referenced() will return 1 and shrink_page_list() will move it > > >> into active list instead of unevictable list. Shall we fix this rare > > >> case? > > > > I think it's not a big deal. > > Maybe, otherwise I should bring up this issue long time before :) > > > As you mentioned, it's rare case so there would be few pages in active > > list instead of unevictable list. > > Yes. > > > When next time to scan comes, we can try to move the pages into > > unevictable list, again. > > Will PG_mlocked be set by then? Otherwise the situation is not likely > to change and the VM_LOCKED pages may circulate in active/inactive > list for countless times. PG_mlocked is not important in that case. Important thing is VM_LOCKED vma. I think below annotaion can help you to understand my point. :) ---- /* * called from munlock()/munmap() path with page supposedly on the LRU. * * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked * [in try_to_munlock()] and then attempt to isolate the page. We must * isolate the page to keep others from messing with its unevictable * and mlocked state while trying to munlock. However, we pre-clear the * mlocked state anyway as we might lose the isolation race and we might * not get another chance to clear PageMlocked. If we successfully * isolate the page and try_to_munlock() detects other VM_LOCKED vmas * mapping the page, it will restore the PageMlocked state, unless the page * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(), * perhaps redundantly. * If we lose the isolation race, and the page is mapped by other VM_LOCKED * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap() * either of which will restore the PageMlocked state by calling * mlock_vma_page() above, if it can grab the vma's mmap sem. */ static void munlock_vma_page(struct page *page) { ... -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 4:17 ` Minchan Kim @ 2009-08-18 9:31 ` Wu Fengguang 2009-08-18 9:52 ` Minchan Kim 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-18 9:31 UTC (permalink / raw) To: Minchan Kim Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote: > On Tue, 18 Aug 2009 10:34:38 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > Minchan, > > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote: > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: > > > >> > Wu Fengguang wrote: > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > > > >> > >> Side question - > > > >> > >> A Is there a good reason for this to be in shrink_active_list() > > > >> > >> as opposed to __isolate_lru_page? > > > >> > >> > > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) { > > > >> > >> A A A A A A A A A putback_lru_page(page); > > > >> > >> A A A A A A A A A continue; > > > >> > >> A A A A A } > > > >> > >> > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or > > > >> > >> avoid duplicate logic in the isolate_page functions. > > > >> > > > > > >> > > I guess the quick test means to avoid the expensive page_referenced() > > > >> > > call that follows it. But that should be mostly one shot cost - the > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again > > > >> > > and again. > > > >> > > > > >> > Please read what putback_lru_page does. > > > >> > > > > >> > It moves the page onto the unevictable list, so that > > > >> > it will not end up in this scan again. > > > >> > > > >> Yes it does. I said 'mostly' because there is a small hole that an > > > >> unevictable page may be scanned but still not moved to unevictable > > > >> list: when a page is mapped in two places, the first pte has the > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > > > >> page_referenced() will return 1 and shrink_page_list() will move it > > > >> into active list instead of unevictable list. Shall we fix this rare > > > >> case? > > > > > > I think it's not a big deal. > > > > Maybe, otherwise I should bring up this issue long time before :) > > > > > As you mentioned, it's rare case so there would be few pages in active > > > list instead of unevictable list. > > > > Yes. > > > > > When next time to scan comes, we can try to move the pages into > > > unevictable list, again. > > > > Will PG_mlocked be set by then? Otherwise the situation is not likely > > to change and the VM_LOCKED pages may circulate in active/inactive > > list for countless times. > > PG_mlocked is not important in that case. > Important thing is VM_LOCKED vma. > I think below annotaion can help you to understand my point. :) Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have PG_mlocked set, and so will be caught by page_evictable(). Is it? Then I was worrying about a null problem. Sorry for the confusion! Thanks, Fengguang > ---- > > /* > * called from munlock()/munmap() path with page supposedly on the LRU. > * > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked > * [in try_to_munlock()] and then attempt to isolate the page. We must > * isolate the page to keep others from messing with its unevictable > * and mlocked state while trying to munlock. However, we pre-clear the > * mlocked state anyway as we might lose the isolation race and we might > * not get another chance to clear PageMlocked. If we successfully > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas > * mapping the page, it will restore the PageMlocked state, unless the page > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(), > * perhaps redundantly. > * If we lose the isolation race, and the page is mapped by other VM_LOCKED > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap() > * either of which will restore the PageMlocked state by calling > * mlock_vma_page() above, if it can grab the vma's mmap sem. > */ > static void munlock_vma_page(struct page *page) > { > ... > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 9:31 ` Wu Fengguang @ 2009-08-18 9:52 ` Minchan Kim 2009-08-18 10:00 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Minchan Kim @ 2009-08-18 9:52 UTC (permalink / raw) To: Wu Fengguang Cc: Minchan Kim, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Tue, 18 Aug 2009 17:31:19 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote: > > On Tue, 18 Aug 2009 10:34:38 +0800 > > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > Minchan, > > > > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote: > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: > > > > >> > Wu Fengguang wrote: > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > > > > >> > >> Side question - > > > > >> > >> A Is there a good reason for this to be in shrink_active_list() > > > > >> > >> as opposed to __isolate_lru_page? > > > > >> > >> > > > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) { > > > > >> > >> A A A A A A A A A putback_lru_page(page); > > > > >> > >> A A A A A A A A A continue; > > > > >> > >> A A A A A } > > > > >> > >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or > > > > >> > >> avoid duplicate logic in the isolate_page functions. > > > > >> > > > > > > >> > > I guess the quick test means to avoid the expensive page_referenced() > > > > >> > > call that follows it. But that should be mostly one shot cost - the > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again > > > > >> > > and again. > > > > >> > > > > > >> > Please read what putback_lru_page does. > > > > >> > > > > > >> > It moves the page onto the unevictable list, so that > > > > >> > it will not end up in this scan again. > > > > >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an > > > > >> unevictable page may be scanned but still not moved to unevictable > > > > >> list: when a page is mapped in two places, the first pte has the > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > > > > >> page_referenced() will return 1 and shrink_page_list() will move it > > > > >> into active list instead of unevictable list. Shall we fix this rare > > > > >> case? > > > > > > > > I think it's not a big deal. > > > > > > Maybe, otherwise I should bring up this issue long time before :) > > > > > > > As you mentioned, it's rare case so there would be few pages in active > > > > list instead of unevictable list. > > > > > > Yes. > > > > > > > When next time to scan comes, we can try to move the pages into > > > > unevictable list, again. > > > > > > Will PG_mlocked be set by then? Otherwise the situation is not likely > > > to change and the VM_LOCKED pages may circulate in active/inactive > > > list for countless times. > > > > PG_mlocked is not important in that case. > > Important thing is VM_LOCKED vma. > > I think below annotaion can help you to understand my point. :) > > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have > PG_mlocked set, and so will be caught by page_evictable(). Is it? No. I am sorry for making my point not clear. I meant following as. When the next time to scan, shrink_page_list -> try_to_unmap -> try_to_unmap_xxx -> if (vma->vm_flags & VM_LOCKED) -> try_to_mlock_page -> TestSetPageMlocked -> putback_lru_page So at last, the page will be located in unevictable list. > Then I was worrying about a null problem. Sorry for the confusion! > > Thanks, > Fengguang > > > ---- > > > > /* > > * called from munlock()/munmap() path with page supposedly on the LRU. > > * > > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked > > * [in try_to_munlock()] and then attempt to isolate the page. We must > > * isolate the page to keep others from messing with its unevictable > > * and mlocked state while trying to munlock. However, we pre-clear the > > * mlocked state anyway as we might lose the isolation race and we might > > * not get another chance to clear PageMlocked. If we successfully > > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas > > * mapping the page, it will restore the PageMlocked state, unless the page > > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(), > > * perhaps redundantly. > > * If we lose the isolation race, and the page is mapped by other VM_LOCKED > > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap() > > * either of which will restore the PageMlocked state by calling > > * mlock_vma_page() above, if it can grab the vma's mmap sem. > > */ > > static void munlock_vma_page(struct page *page) > > { > > ... > > > > -- > > Kind regards, > > Minchan Kim -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 9:52 ` Minchan Kim @ 2009-08-18 10:00 ` Wu Fengguang 2009-08-18 11:00 ` Minchan Kim 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-18 10:00 UTC (permalink / raw) To: Minchan Kim Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote: > On Tue, 18 Aug 2009 17:31:19 +0800 > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote: > > > On Tue, 18 Aug 2009 10:34:38 +0800 > > > Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > > > > Minchan, > > > > > > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote: > > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: > > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: > > > > > >> > Wu Fengguang wrote: > > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > > > > > >> > >> Side question - > > > > > >> > >> A Is there a good reason for this to be in shrink_active_list() > > > > > >> > >> as opposed to __isolate_lru_page? > > > > > >> > >> > > > > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) { > > > > > >> > >> A A A A A A A A A putback_lru_page(page); > > > > > >> > >> A A A A A A A A A continue; > > > > > >> > >> A A A A A } > > > > > >> > >> > > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or > > > > > >> > >> avoid duplicate logic in the isolate_page functions. > > > > > >> > > > > > > > >> > > I guess the quick test means to avoid the expensive page_referenced() > > > > > >> > > call that follows it. But that should be mostly one shot cost - the > > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again > > > > > >> > > and again. > > > > > >> > > > > > > >> > Please read what putback_lru_page does. > > > > > >> > > > > > > >> > It moves the page onto the unevictable list, so that > > > > > >> > it will not end up in this scan again. > > > > > >> > > > > > >> Yes it does. I said 'mostly' because there is a small hole that an > > > > > >> unevictable page may be scanned but still not moved to unevictable > > > > > >> list: when a page is mapped in two places, the first pte has the > > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > > > > > >> page_referenced() will return 1 and shrink_page_list() will move it > > > > > >> into active list instead of unevictable list. Shall we fix this rare > > > > > >> case? > > > > > > > > > > I think it's not a big deal. > > > > > > > > Maybe, otherwise I should bring up this issue long time before :) > > > > > > > > > As you mentioned, it's rare case so there would be few pages in active > > > > > list instead of unevictable list. > > > > > > > > Yes. > > > > > > > > > When next time to scan comes, we can try to move the pages into > > > > > unevictable list, again. > > > > > > > > Will PG_mlocked be set by then? Otherwise the situation is not likely > > > > to change and the VM_LOCKED pages may circulate in active/inactive > > > > list for countless times. > > > > > > PG_mlocked is not important in that case. > > > Important thing is VM_LOCKED vma. > > > I think below annotaion can help you to understand my point. :) > > > > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have > > PG_mlocked set, and so will be caught by page_evictable(). Is it? > > No. I am sorry for making my point not clear. > I meant following as. > When the next time to scan, > > shrink_page_list -> referenced = page_referenced(page, 1, sc->mem_cgroup, &vm_flags); /* In active use or really unfreeable? Activate it. */ if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced && page_mapping_inuse(page)) goto activate_locked; > -> try_to_unmap ~~~~~~~~~~~~ this line won't be reached if page is found to be referenced in the above lines? Thanks, Fengguang > -> try_to_unmap_xxx > -> if (vma->vm_flags & VM_LOCKED) > -> try_to_mlock_page > -> TestSetPageMlocked > -> putback_lru_page > > So at last, the page will be located in unevictable list. > > > Then I was worrying about a null problem. Sorry for the confusion! > > > > Thanks, > > Fengguang > > > > > ---- > > > > > > /* > > > * called from munlock()/munmap() path with page supposedly on the LRU. > > > * > > > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked > > > * [in try_to_munlock()] and then attempt to isolate the page. We must > > > * isolate the page to keep others from messing with its unevictable > > > * and mlocked state while trying to munlock. However, we pre-clear the > > > * mlocked state anyway as we might lose the isolation race and we might > > > * not get another chance to clear PageMlocked. If we successfully > > > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas > > > * mapping the page, it will restore the PageMlocked state, unless the page > > > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(), > > > * perhaps redundantly. > > > * If we lose the isolation race, and the page is mapped by other VM_LOCKED > > > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap() > > > * either of which will restore the PageMlocked state by calling > > > * mlock_vma_page() above, if it can grab the vma's mmap sem. > > > */ > > > static void munlock_vma_page(struct page *page) > > > { > > > ... > > > > > > -- > > > Kind regards, > > > Minchan Kim > > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 10:00 ` Wu Fengguang @ 2009-08-18 11:00 ` Minchan Kim 2009-08-18 11:11 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Minchan Kim @ 2009-08-18 11:00 UTC (permalink / raw) To: Wu Fengguang, Lee Schermerhorn Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote: >> On Tue, 18 Aug 2009 17:31:19 +0800 >> Wu Fengguang <fengguang.wu@intel.com> wrote: >> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote: >> > > On Tue, 18 Aug 2009 10:34:38 +0800 >> > > Wu Fengguang <fengguang.wu@intel.com> wrote: >> > > >> > > > Minchan, >> > > > >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote: >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: >> > > > > >> > Wu Fengguang wrote: >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: >> > > > > >> > >> Side question - >> > > > > >> > >> Is there a good reason for this to be in shrink_active_list() >> > > > > >> > >> as opposed to __isolate_lru_page? >> > > > > >> > >> >> > > > > >> > >> if (unlikely(!page_evictable(page, NULL))) { >> > > > > >> > >> putback_lru_page(page); >> > > > > >> > >> continue; >> > > > > >> > >> } >> > > > > >> > >> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or >> > > > > >> > >> avoid duplicate logic in the isolate_page functions. >> > > > > >> > > >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced() >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again >> > > > > >> > > and again. >> > > > > >> > >> > > > > >> > Please read what putback_lru_page does. >> > > > > >> > >> > > > > >> > It moves the page onto the unevictable list, so that >> > > > > >> > it will not end up in this scan again. >> > > > > >> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an >> > > > > >> unevictable page may be scanned but still not moved to unevictable >> > > > > >> list: when a page is mapped in two places, the first pte has the >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it >> > > > > >> into active list instead of unevictable list. Shall we fix this rare >> > > > > >> case? >> > > > > >> > > > > I think it's not a big deal. >> > > > >> > > > Maybe, otherwise I should bring up this issue long time before :) >> > > > >> > > > > As you mentioned, it's rare case so there would be few pages in active >> > > > > list instead of unevictable list. >> > > > >> > > > Yes. >> > > > >> > > > > When next time to scan comes, we can try to move the pages into >> > > > > unevictable list, again. >> > > > >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely >> > > > to change and the VM_LOCKED pages may circulate in active/inactive >> > > > list for countless times. >> > > >> > > PG_mlocked is not important in that case. >> > > Important thing is VM_LOCKED vma. >> > > I think below annotaion can help you to understand my point. :) >> > >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have >> > PG_mlocked set, and so will be caught by page_evictable(). Is it? >> >> No. I am sorry for making my point not clear. >> I meant following as. >> When the next time to scan, >> >> shrink_page_list > -> > referenced = page_referenced(page, 1, > sc->mem_cgroup, &vm_flags); > /* In active use or really unfreeable? Activate it. */ > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && > referenced && page_mapping_inuse(page)) > goto activate_locked; > >> -> try_to_unmap > ~~~~~~~~~~~~ this line won't be reached if page is found to be > referenced in the above lines? Indeed! In fact, I was worry about that. It looks after live lock problem. But I think it's very small race window so there isn't any report until now. Let's Cced Lee. If we have to fix it, how about this ? This version has small overhead than yours since there is less shrink_page_list call than page_referenced. diff --git a/mm/rmap.c b/mm/rmap.c index ed63894..283266c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page, */ if (vma->vm_flags & VM_LOCKED) { *mapcount = 1; /* break early from loop */ + *vm_flags |= VM_LOCKED; goto out_unmap; } diff --git a/mm/vmscan.c b/mm/vmscan.c index d224b28..d156e1d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, sc->mem_cgroup, &vm_flags); /* In active use or really unfreeable? Activate it. */ if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && - referenced && page_mapping_inuse(page)) + referenced && page_mapping_inuse(page) + && !(vm_flags & VM_LOCKED)) goto activate_locked; > > Thanks, > Fengguang > >> -> try_to_unmap_xxx >> -> if (vma->vm_flags & VM_LOCKED) >> -> try_to_mlock_page >> -> TestSetPageMlocked >> -> putback_lru_page >> >> So at last, the page will be located in unevictable list. >> >> > Then I was worrying about a null problem. Sorry for the confusion! >> > >> > Thanks, >> > Fengguang >> > >> > > ---- >> > > >> > > /* >> > > * called from munlock()/munmap() path with page supposedly on the LRU. >> > > * >> > > * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked >> > > * [in try_to_munlock()] and then attempt to isolate the page. We must >> > > * isolate the page to keep others from messing with its unevictable >> > > * and mlocked state while trying to munlock. However, we pre-clear the >> > > * mlocked state anyway as we might lose the isolation race and we might >> > > * not get another chance to clear PageMlocked. If we successfully >> > > * isolate the page and try_to_munlock() detects other VM_LOCKED vmas >> > > * mapping the page, it will restore the PageMlocked state, unless the page >> > > * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(), >> > > * perhaps redundantly. >> > > * If we lose the isolation race, and the page is mapped by other VM_LOCKED >> > > * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap() >> > > * either of which will restore the PageMlocked state by calling >> > > * mlock_vma_page() above, if it can grab the vma's mmap sem. >> > > */ >> > > static void munlock_vma_page(struct page *page) >> > > { >> > > ... >> > > >> > > -- >> > > Kind regards, >> > > Minchan Kim >> >> >> -- >> Kind regards, >> Minchan Kim > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 11:00 ` Minchan Kim @ 2009-08-18 11:11 ` Wu Fengguang 2009-08-18 14:03 ` Minchan Kim 2009-08-18 16:27 ` KOSAKI Motohiro 0 siblings, 2 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-18 11:11 UTC (permalink / raw) To: Minchan Kim Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote: > On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote: > >> On Tue, 18 Aug 2009 17:31:19 +0800 > >> Wu Fengguang <fengguang.wu@intel.com> wrote: > >> > >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote: > >> > > On Tue, 18 Aug 2009 10:34:38 +0800 > >> > > Wu Fengguang <fengguang.wu@intel.com> wrote: > >> > > > >> > > > Minchan, > >> > > > > >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote: > >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: > >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: > >> > > > > >> > Wu Fengguang wrote: > >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > >> > > > > >> > >> Side question - > >> > > > > >> > >> A Is there a good reason for this to be in shrink_active_list() > >> > > > > >> > >> as opposed to __isolate_lru_page? > >> > > > > >> > >> > >> > > > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) { > >> > > > > >> > >> A A A A A A A A A putback_lru_page(page); > >> > > > > >> > >> A A A A A A A A A continue; > >> > > > > >> > >> A A A A A } > >> > > > > >> > >> > >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or > >> > > > > >> > >> avoid duplicate logic in the isolate_page functions. > >> > > > > >> > > > >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced() > >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the > >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again > >> > > > > >> > > and again. > >> > > > > >> > > >> > > > > >> > Please read what putback_lru_page does. > >> > > > > >> > > >> > > > > >> > It moves the page onto the unevictable list, so that > >> > > > > >> > it will not end up in this scan again. > >> > > > > >> > >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an > >> > > > > >> unevictable page may be scanned but still not moved to unevictable > >> > > > > >> list: when a page is mapped in two places, the first pte has the > >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it > >> > > > > >> into active list instead of unevictable list. Shall we fix this rare > >> > > > > >> case? > >> > > > > > >> > > > > I think it's not a big deal. > >> > > > > >> > > > Maybe, otherwise I should bring up this issue long time before :) > >> > > > > >> > > > > As you mentioned, it's rare case so there would be few pages in active > >> > > > > list instead of unevictable list. > >> > > > > >> > > > Yes. > >> > > > > >> > > > > When next time to scan comes, we can try to move the pages into > >> > > > > unevictable list, again. > >> > > > > >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely > >> > > > to change and the VM_LOCKED pages may circulate in active/inactive > >> > > > list for countless times. > >> > > > >> > > PG_mlocked is not important in that case. > >> > > Important thing is VM_LOCKED vma. > >> > > I think below annotaion can help you to understand my point. :) > >> > > >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have > >> > PG_mlocked set, and so will be caught by page_evictable(). Is it? > >> > >> No. I am sorry for making my point not clear. > >> I meant following as. > >> When the next time to scan, > >> > >> shrink_page_list > > A -> > > A A A A A A A A referenced = page_referenced(page, 1, > > A A A A A A A A A A A A A A A A A A A A A A A A sc->mem_cgroup, &vm_flags); > > A A A A A A A A /* In active use or really unfreeable? A Activate it. */ > > A A A A A A A A if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && > > A A A A A A A A A A A A A A A A A A A A referenced && page_mapping_inuse(page)) > > A A A A A A A A A A A A goto activate_locked; > > > >> -> try_to_unmap > > A A ~~~~~~~~~~~~ this line won't be reached if page is found to be > > A A referenced in the above lines? > > Indeed! In fact, I was worry about that. > It looks after live lock problem. > But I think it's very small race window so there isn't any report until now. > Let's Cced Lee. > > If we have to fix it, how about this ? > This version has small overhead than yours since > there is less shrink_page_list call than page_referenced. Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked) is possible and somehow persistent. Does anyone have the answer? Thanks! Thanks, Fengguang > > diff --git a/mm/rmap.c b/mm/rmap.c > index ed63894..283266c 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page, > */ > if (vma->vm_flags & VM_LOCKED) { > *mapcount = 1; /* break early from loop */ > + *vm_flags |= VM_LOCKED; > goto out_unmap; > } > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d224b28..d156e1d 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct > list_head *page_list, > sc->mem_cgroup, &vm_flags); > /* In active use or really unfreeable? Activate it. */ > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && > - referenced && page_mapping_inuse(page)) > + referenced && page_mapping_inuse(page) > + && !(vm_flags & VM_LOCKED)) > goto activate_locked; > > > > > > > Thanks, > > Fengguang > > > >> A A A -> try_to_unmap_xxx > >> A A A A A A A -> if (vma->vm_flags & VM_LOCKED) > >> A A A A A A A -> try_to_mlock_page > >> A A A A A A A A A A A -> TestSetPageMlocked > >> A A A A A A A A A A A -> putback_lru_page > >> > >> So at last, the page will be located in unevictable list. > >> > >> > Then I was worrying about a null problem. Sorry for the confusion! > >> > > >> > Thanks, > >> > Fengguang > >> > > >> > > ---- > >> > > > >> > > /* > >> > > A * called from munlock()/munmap() path with page supposedly on the LRU. > >> > > A * > >> > > A * Note: A unlike mlock_vma_page(), we can't just clear the PageMlocked > >> > > A * [in try_to_munlock()] and then attempt to isolate the page. A We must > >> > > A * isolate the page to keep others from messing with its unevictable > >> > > A * and mlocked state while trying to munlock. A However, we pre-clear the > >> > > A * mlocked state anyway as we might lose the isolation race and we might > >> > > A * not get another chance to clear PageMlocked. A If we successfully > >> > > A * isolate the page and try_to_munlock() detects other VM_LOCKED vmas > >> > > A * mapping the page, it will restore the PageMlocked state, unless the page > >> > > A * is mapped in a non-linear vma. A So, we go ahead and SetPageMlocked(), > >> > > A * perhaps redundantly. > >> > > A * If we lose the isolation race, and the page is mapped by other VM_LOCKED > >> > > A * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap() > >> > > A * either of which will restore the PageMlocked state by calling > >> > > A * mlock_vma_page() above, if it can grab the vma's mmap sem. > >> > > A */ > >> > > static void munlock_vma_page(struct page *page) > >> > > { > >> > > ... > >> > > > >> > > -- > >> > > Kind regards, > >> > > Minchan Kim > >> > >> > >> -- > >> Kind regards, > >> Minchan Kim > > > > > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 11:11 ` Wu Fengguang @ 2009-08-18 14:03 ` Minchan Kim 2009-08-18 16:27 ` KOSAKI Motohiro 1 sibling, 0 replies; 122+ messages in thread From: Minchan Kim @ 2009-08-18 14:03 UTC (permalink / raw) To: Wu Fengguang Cc: Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Tue, Aug 18, 2009 at 8:11 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote: >> On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: >> > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote: >> >> On Tue, 18 Aug 2009 17:31:19 +0800 >> >> Wu Fengguang <fengguang.wu@intel.com> wrote: >> >> >> >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote: >> >> > > On Tue, 18 Aug 2009 10:34:38 +0800 >> >> > > Wu Fengguang <fengguang.wu@intel.com> wrote: >> >> > > >> >> > > > Minchan, >> >> > > > >> >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote: >> >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: >> >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: >> >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: >> >> > > > > >> > Wu Fengguang wrote: >> >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: >> >> > > > > >> > >> Side question - >> >> > > > > >> > >> Is there a good reason for this to be in shrink_active_list() >> >> > > > > >> > >> as opposed to __isolate_lru_page? >> >> > > > > >> > >> >> >> > > > > >> > >> if (unlikely(!page_evictable(page, NULL))) { >> >> > > > > >> > >> putback_lru_page(page); >> >> > > > > >> > >> continue; >> >> > > > > >> > >> } >> >> > > > > >> > >> >> >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or >> >> > > > > >> > >> avoid duplicate logic in the isolate_page functions. >> >> > > > > >> > > >> >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced() >> >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the >> >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again >> >> > > > > >> > > and again. >> >> > > > > >> > >> >> > > > > >> > Please read what putback_lru_page does. >> >> > > > > >> > >> >> > > > > >> > It moves the page onto the unevictable list, so that >> >> > > > > >> > it will not end up in this scan again. >> >> > > > > >> >> >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an >> >> > > > > >> unevictable page may be scanned but still not moved to unevictable >> >> > > > > >> list: when a page is mapped in two places, the first pte has the >> >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then >> >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it >> >> > > > > >> into active list instead of unevictable list. Shall we fix this rare >> >> > > > > >> case? >> >> > > > > >> >> > > > > I think it's not a big deal. >> >> > > > >> >> > > > Maybe, otherwise I should bring up this issue long time before :) >> >> > > > >> >> > > > > As you mentioned, it's rare case so there would be few pages in active >> >> > > > > list instead of unevictable list. >> >> > > > >> >> > > > Yes. >> >> > > > >> >> > > > > When next time to scan comes, we can try to move the pages into >> >> > > > > unevictable list, again. >> >> > > > >> >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely >> >> > > > to change and the VM_LOCKED pages may circulate in active/inactive >> >> > > > list for countless times. >> >> > > >> >> > > PG_mlocked is not important in that case. >> >> > > Important thing is VM_LOCKED vma. >> >> > > I think below annotaion can help you to understand my point. :) >> >> > >> >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have >> >> > PG_mlocked set, and so will be caught by page_evictable(). Is it? >> >> >> >> No. I am sorry for making my point not clear. >> >> I meant following as. >> >> When the next time to scan, >> >> >> >> shrink_page_list >> > -> >> > referenced = page_referenced(page, 1, >> > sc->mem_cgroup, &vm_flags); >> > /* In active use or really unfreeable? Activate it. */ >> > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && >> > referenced && page_mapping_inuse(page)) >> > goto activate_locked; >> > >> >> -> try_to_unmap >> > ~~~~~~~~~~~~ this line won't be reached if page is found to be >> > referenced in the above lines? >> >> Indeed! In fact, I was worry about that. >> It looks after live lock problem. >> But I think it's very small race window so there isn't any report until now. >> Let's Cced Lee. >> >> If we have to fix it, how about this ? >> This version has small overhead than yours since >> there is less shrink_page_list call than page_referenced. > > Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked) > is possible and somehow persistent. Does anyone have the answer? Thanks! I think it's possible. munlock_vma_page pre-clears PG_mlocked of page. And then if isolate_lru_page fail, the page have no PG_mlocked and vma which have VM_LOCKED. As munlock_vma_page's annotation said, we hope the page will be rescued by try_to_unmap. But As you pointed out, if the page have PG_referenced, it can't reach try_to_unmap so that it will go into the active list. What are others' opinion ? > Thanks, > Fengguang -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 11:11 ` Wu Fengguang 2009-08-18 14:03 ` Minchan Kim @ 2009-08-18 16:27 ` KOSAKI Motohiro 1 sibling, 0 replies; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-18 16:27 UTC (permalink / raw) To: Wu Fengguang Cc: kosaki.motohiro, Minchan Kim, Lee Schermerhorn, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm > On Tue, Aug 18, 2009 at 07:00:48PM +0800, Minchan Kim wrote: > > On Tue, Aug 18, 2009 at 7:00 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > > On Tue, Aug 18, 2009 at 05:52:47PM +0800, Minchan Kim wrote: > > >> On Tue, 18 Aug 2009 17:31:19 +0800 > > >> Wu Fengguang <fengguang.wu@intel.com> wrote: > > >> > > >> > On Tue, Aug 18, 2009 at 12:17:34PM +0800, Minchan Kim wrote: > > >> > > On Tue, 18 Aug 2009 10:34:38 +0800 > > >> > > Wu Fengguang <fengguang.wu@intel.com> wrote: > > >> > > > > >> > > > Minchan, > > >> > > > > > >> > > > On Mon, Aug 17, 2009 at 10:33:54PM +0800, Minchan Kim wrote: > > >> > > > > On Sun, Aug 16, 2009 at 8:29 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > >> > > > > > On Sun, Aug 16, 2009 at 01:15:02PM +0800, Wu Fengguang wrote: > > >> > > > > >> On Sun, Aug 16, 2009 at 11:53:00AM +0800, Rik van Riel wrote: > > >> > > > > >> > Wu Fengguang wrote: > > >> > > > > >> > > On Fri, Aug 07, 2009 at 05:09:55AM +0800, Jeff Dike wrote: > > >> > > > > >> > >> Side question - > > >> > > > > >> > >> A Is there a good reason for this to be in shrink_active_list() > > >> > > > > >> > >> as opposed to __isolate_lru_page? > > >> > > > > >> > >> > > >> > > > > >> > >> A A A A A if (unlikely(!page_evictable(page, NULL))) { > > >> > > > > >> > >> A A A A A A A A A putback_lru_page(page); > > >> > > > > >> > >> A A A A A A A A A continue; > > >> > > > > >> > >> A A A A A } > > >> > > > > >> > >> > > >> > > > > >> > >> Maybe we want to minimize the amount of code under the lru lock or > > >> > > > > >> > >> avoid duplicate logic in the isolate_page functions. > > >> > > > > >> > > > > >> > > > > >> > > I guess the quick test means to avoid the expensive page_referenced() > > >> > > > > >> > > call that follows it. But that should be mostly one shot cost - the > > >> > > > > >> > > unevictable pages are unlikely to cycle in active/inactive list again > > >> > > > > >> > > and again. > > >> > > > > >> > > > >> > > > > >> > Please read what putback_lru_page does. > > >> > > > > >> > > > >> > > > > >> > It moves the page onto the unevictable list, so that > > >> > > > > >> > it will not end up in this scan again. > > >> > > > > >> > > >> > > > > >> Yes it does. I said 'mostly' because there is a small hole that an > > >> > > > > >> unevictable page may be scanned but still not moved to unevictable > > >> > > > > >> list: when a page is mapped in two places, the first pte has the > > >> > > > > >> referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > > >> > > > > >> page_referenced() will return 1 and shrink_page_list() will move it > > >> > > > > >> into active list instead of unevictable list. Shall we fix this rare > > >> > > > > >> case? > > >> > > > > > > >> > > > > I think it's not a big deal. > > >> > > > > > >> > > > Maybe, otherwise I should bring up this issue long time before :) > > >> > > > > > >> > > > > As you mentioned, it's rare case so there would be few pages in active > > >> > > > > list instead of unevictable list. > > >> > > > > > >> > > > Yes. > > >> > > > > > >> > > > > When next time to scan comes, we can try to move the pages into > > >> > > > > unevictable list, again. > > >> > > > > > >> > > > Will PG_mlocked be set by then? Otherwise the situation is not likely > > >> > > > to change and the VM_LOCKED pages may circulate in active/inactive > > >> > > > list for countless times. > > >> > > > > >> > > PG_mlocked is not important in that case. > > >> > > Important thing is VM_LOCKED vma. > > >> > > I think below annotaion can help you to understand my point. :) > > >> > > > >> > Hmm, it looks like pages under VM_LOCKED vma is guaranteed to have > > >> > PG_mlocked set, and so will be caught by page_evictable(). Is it? > > >> > > >> No. I am sorry for making my point not clear. > > >> I meant following as. > > >> When the next time to scan, > > >> > > >> shrink_page_list > > > A -> > > > A A A A A A A A referenced = page_referenced(page, 1, > > > A A A A A A A A A A A A A A A A A A A A A A A A sc->mem_cgroup, &vm_flags); > > > A A A A A A A A /* In active use or really unfreeable? A Activate it. */ > > > A A A A A A A A if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && > > > A A A A A A A A A A A A A A A A A A A A referenced && page_mapping_inuse(page)) > > > A A A A A A A A A A A A goto activate_locked; > > > > > >> -> try_to_unmap > > > A A ~~~~~~~~~~~~ this line won't be reached if page is found to be > > > A A referenced in the above lines? > > > > Indeed! In fact, I was worry about that. > > It looks after live lock problem. > > But I think it's very small race window so there isn't any report until now. > > Let's Cced Lee. > > > > If we have to fix it, how about this ? > > This version has small overhead than yours since > > there is less shrink_page_list call than page_referenced. > > Yeah, it looks better. However I still wonder if (VM_LOCKED && !PG_mlocked) > is possible and somehow persistent. Does anyone have the answer? Thanks! hehe, that's bug. you spotted very good thing IMHO ;) I posted fixed patch. can you see it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 11:29 ` Wu Fengguang 2009-08-17 14:33 ` Minchan Kim @ 2009-08-18 15:57 ` KOSAKI Motohiro 2009-08-19 12:01 ` Wu Fengguang 1 sibling, 1 reply; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw) To: Wu Fengguang Cc: kosaki.motohiro, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm > > Yes it does. I said 'mostly' because there is a small hole that an > > unevictable page may be scanned but still not moved to unevictable > > list: when a page is mapped in two places, the first pte has the > > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > > page_referenced() will return 1 and shrink_page_list() will move it > > into active list instead of unevictable list. Shall we fix this rare > > case? > > How about this fix? Good spotting. Yes, this is rare case. but I also don't think your patch introduce performance degression. However, I think your patch have one bug. > > --- > mm: stop circulating of referenced mlocked pages > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > > --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800 > +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800 > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa > */ > if (vma->vm_flags & VM_LOCKED) { > *mapcount = 1; /* break early from loop */ > + *vm_flags |= VM_LOCKED; > goto out_unmap; > } > > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p > } > > spin_unlock(&mapping->i_mmap_lock); > + if (*vm_flags & VM_LOCKED) > + referenced = 0; > return referenced; > } > page_referenced_file? I think we should change page_referenced(). Instead, How about this? ============================================== Subject: [PATCH] mm: stop circulating of referenced mlocked pages Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked because some race prevent page grabbing. In that case, instead vmscan move the page to unevictable lru. However, Recently Wu Fengguang pointed out current vmscan logic isn't so efficient. mlocked page can move circulatly active and inactive list because vmscan check the page is referenced _before_ cull mlocked page. Plus, vmscan should mark PG_Mlocked when cull mlocked page. Otherwise vm stastics show strange number. This patch does that. Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/internal.h | 5 +++-- mm/rmap.c | 8 +++++++- mm/vmscan.c | 2 +- 3 files changed, 11 insertions(+), 4 deletions(-) Index: b/mm/internal.h =================================================================== --- a/mm/internal.h 2009-06-26 21:06:43.000000000 +0900 +++ b/mm/internal.h 2009-08-18 23:31:11.000000000 +0900 @@ -91,7 +91,8 @@ static inline void unevictable_migrate_p * to determine if it's being mapped into a LOCKED vma. * If so, mark page as mlocked. */ -static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page) +static inline int try_set_page_mlocked(struct vm_area_struct *vma, + struct page *page) { VM_BUG_ON(PageLRU(page)); @@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st } #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */ -static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p) +static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p) { return 0; } Index: b/mm/rmap.c =================================================================== --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900 +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900 @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa * unevictable list. */ if (vma->vm_flags & VM_LOCKED) { - *mapcount = 1; /* break early from loop */ + *mapcount = 1; /* break early from loop */ + *vm_flags |= VM_LOCKED; /* for prevent to move active list */ + try_set_page_mlocked(vma, page); goto out_unmap; } @@ -531,6 +533,9 @@ int page_referenced(struct page *page, if (page_test_and_clear_young(page)) referenced++; + if (unlikely(*vm_flags & VM_LOCKED)) + referenced = 0; + return referenced; } @@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page */ if (!(flags & TTU_IGNORE_MLOCK)) { if (vma->vm_flags & VM_LOCKED) { + try_set_page_mlocked(vma, page); ret = SWAP_MLOCK; goto out_unmap; } Index: b/mm/vmscan.c =================================================================== --- a/mm/vmscan.c 2009-08-18 19:48:14.000000000 +0900 +++ b/mm/vmscan.c 2009-08-18 23:30:51.000000000 +0900 @@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st if (mapping_unevictable(page_mapping(page))) return 0; - if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page))) + if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page))) return 0; return 1; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 15:57 ` KOSAKI Motohiro @ 2009-08-19 12:01 ` Wu Fengguang 2009-08-19 12:05 ` KOSAKI Motohiro 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-19 12:01 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Tue, Aug 18, 2009 at 11:57:54PM +0800, KOSAKI Motohiro wrote: > > > Yes it does. I said 'mostly' because there is a small hole that an > > > unevictable page may be scanned but still not moved to unevictable > > > list: when a page is mapped in two places, the first pte has the > > > referenced bit set, the _second_ VMA has VM_LOCKED bit set, then > > > page_referenced() will return 1 and shrink_page_list() will move it > > > into active list instead of unevictable list. Shall we fix this rare > > > case? > > > > How about this fix? > > Good spotting. > Yes, this is rare case. but I also don't think your patch introduce > performance degression. Thanks. > However, I think your patch have one bug. Hehe, sorry for being careless :) > > > > --- > > mm: stop circulating of referenced mlocked pages > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > > > --- linux.orig/mm/rmap.c 2009-08-16 19:11:13.000000000 +0800 > > +++ linux/mm/rmap.c 2009-08-16 19:22:46.000000000 +0800 > > @@ -358,6 +358,7 @@ static int page_referenced_one(struct pa > > */ > > if (vma->vm_flags & VM_LOCKED) { > > *mapcount = 1; /* break early from loop */ > > + *vm_flags |= VM_LOCKED; > > goto out_unmap; > > } > > > > @@ -482,6 +483,8 @@ static int page_referenced_file(struct p > > } > > > > spin_unlock(&mapping->i_mmap_lock); > > + if (*vm_flags & VM_LOCKED) > > + referenced = 0; > > return referenced; > > } > > > > page_referenced_file? > I think we should change page_referenced(). Yeah, good catch. > > Instead, How about this? > ============================================== > > Subject: [PATCH] mm: stop circulating of referenced mlocked pages > > Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked mark PG_mlocked > because some race prevent page grabbing. > In that case, instead vmscan move the page to unevictable lru. > > However, Recently Wu Fengguang pointed out current vmscan logic isn't so > efficient. > mlocked page can move circulatly active and inactive list because > vmscan check the page is referenced _before_ cull mlocked page. > > Plus, vmscan should mark PG_Mlocked when cull mlocked page. PG_mlocked > Otherwise vm stastics show strange number. > > This patch does that. Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> > Reported-by: Wu Fengguang <fengguang.wu@intel.com> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > --- > mm/internal.h | 5 +++-- > mm/rmap.c | 8 +++++++- > mm/vmscan.c | 2 +- > 3 files changed, 11 insertions(+), 4 deletions(-) > > Index: b/mm/internal.h > =================================================================== > --- a/mm/internal.h 2009-06-26 21:06:43.000000000 +0900 > +++ b/mm/internal.h 2009-08-18 23:31:11.000000000 +0900 > @@ -91,7 +91,8 @@ static inline void unevictable_migrate_p > * to determine if it's being mapped into a LOCKED vma. > * If so, mark page as mlocked. > */ > -static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page) > +static inline int try_set_page_mlocked(struct vm_area_struct *vma, > + struct page *page) > { > VM_BUG_ON(PageLRU(page)); > > @@ -144,7 +145,7 @@ static inline void mlock_migrate_page(st > } > > #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */ > -static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p) > +static inline int try_set_page_mlocked(struct vm_area_struct *v, struct page *p) > { > return 0; > } > Index: b/mm/rmap.c > =================================================================== > --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900 > +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900 > @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa > * unevictable list. > */ > if (vma->vm_flags & VM_LOCKED) { > - *mapcount = 1; /* break early from loop */ > + *mapcount = 1; /* break early from loop */ > + *vm_flags |= VM_LOCKED; /* for prevent to move active list */ > + try_set_page_mlocked(vma, page); That call is not absolutely necessary? Thanks, Fengguang > goto out_unmap; > } > > @@ -531,6 +533,9 @@ int page_referenced(struct page *page, > if (page_test_and_clear_young(page)) > referenced++; > > + if (unlikely(*vm_flags & VM_LOCKED)) > + referenced = 0; > + > return referenced; > } > > @@ -784,6 +789,7 @@ static int try_to_unmap_one(struct page > */ > if (!(flags & TTU_IGNORE_MLOCK)) { > if (vma->vm_flags & VM_LOCKED) { > + try_set_page_mlocked(vma, page); > ret = SWAP_MLOCK; > goto out_unmap; > } > Index: b/mm/vmscan.c > =================================================================== > --- a/mm/vmscan.c 2009-08-18 19:48:14.000000000 +0900 > +++ b/mm/vmscan.c 2009-08-18 23:30:51.000000000 +0900 > @@ -2666,7 +2666,7 @@ int page_evictable(struct page *page, st > if (mapping_unevictable(page_mapping(page))) > return 0; > > - if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page))) > + if (PageMlocked(page) || (vma && try_set_page_mlocked(vma, page))) > return 0; > > return 1; > > > > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-19 12:01 ` Wu Fengguang @ 2009-08-19 12:05 ` KOSAKI Motohiro 2009-08-19 12:10 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-19 12:05 UTC (permalink / raw) To: Wu Fengguang Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm >> page_referenced_file? >> I think we should change page_referenced(). > > Yeah, good catch. > >> >> Instead, How about this? >> ============================================== >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked > > mark PG_mlocked > >> because some race prevent page grabbing. >> In that case, instead vmscan move the page to unevictable lru. >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so >> efficient. >> mlocked page can move circulatly active and inactive list because >> vmscan check the page is referenced _before_ cull mlocked page. >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page. > > PG_mlocked > >> Otherwise vm stastics show strange number. >> >> This patch does that. > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Thanks. >> Index: b/mm/rmap.c >> =================================================================== >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900 >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900 >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa >> * unevictable list. >> */ >> if (vma->vm_flags & VM_LOCKED) { >> - *mapcount = 1; /* break early from loop */ >> + *mapcount = 1; /* break early from loop */ >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */ > >> + try_set_page_mlocked(vma, page); > > That call is not absolutely necessary? Why? I haven't catch your point. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-19 12:05 ` KOSAKI Motohiro @ 2009-08-19 12:10 ` Wu Fengguang 2009-08-19 12:25 ` Minchan Kim 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-19 12:10 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote: > >> page_referenced_file? > >> I think we should change page_referenced(). > > > > Yeah, good catch. > > > >> > >> Instead, How about this? > >> ============================================== > >> > >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages > >> > >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked > > > > A A A A A A A A A A A A A A A A A A A A A A A A A A mark PG_mlocked > > > >> because some race prevent page grabbing. > >> In that case, instead vmscan move the page to unevictable lru. > >> > >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so > >> efficient. > >> mlocked page can move circulatly active and inactive list because > >> vmscan check the page is referenced _before_ cull mlocked page. > >> > >> Plus, vmscan should mark PG_Mlocked when cull mlocked page. > > > > A A A A A A A A A A A A A PG_mlocked > > > >> Otherwise vm stastics show strange number. > >> > >> This patch does that. > > > > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> > > Thanks. > > > > >> Index: b/mm/rmap.c > >> =================================================================== > >> --- a/mm/rmap.c A A A 2009-08-18 19:48:14.000000000 +0900 > >> +++ b/mm/rmap.c A A A 2009-08-18 23:47:34.000000000 +0900 > >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa > >> A A A A * unevictable list. > >> A A A A */ > >> A A A if (vma->vm_flags & VM_LOCKED) { > >> - A A A A A A *mapcount = 1; A /* break early from loop */ > >> + A A A A A A *mapcount = 1; A A A A A /* break early from loop */ > >> + A A A A A A *vm_flags |= VM_LOCKED; /* for prevent to move active list */ > > > >> + A A A A A A try_set_page_mlocked(vma, page); > > > > That call is not absolutely necessary? > > Why? I haven't catch your point. Because we'll eventually hit another try_set_page_mlocked() when trying to unmap the page. Ie. duplicated with another call you added in this patch. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-19 12:10 ` Wu Fengguang @ 2009-08-19 12:25 ` Minchan Kim 2009-08-19 13:19 ` KOSAKI Motohiro 2009-08-19 13:24 ` Wu Fengguang 0 siblings, 2 replies; 122+ messages in thread From: Minchan Kim @ 2009-08-19 12:25 UTC (permalink / raw) To: Wu Fengguang Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote: >> >> page_referenced_file? >> >> I think we should change page_referenced(). >> > >> > Yeah, good catch. >> > >> >> >> >> Instead, How about this? >> >> ============================================== >> >> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages >> >> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked >> > >> > mark PG_mlocked >> > >> >> because some race prevent page grabbing. >> >> In that case, instead vmscan move the page to unevictable lru. >> >> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so >> >> efficient. >> >> mlocked page can move circulatly active and inactive list because >> >> vmscan check the page is referenced _before_ cull mlocked page. >> >> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page. >> > >> > PG_mlocked >> > >> >> Otherwise vm stastics show strange number. >> >> >> >> This patch does that. >> > >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> >> >> Thanks. >> >> >> >> >> Index: b/mm/rmap.c >> >> =================================================================== >> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900 >> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900 >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa >> >> * unevictable list. >> >> */ >> >> if (vma->vm_flags & VM_LOCKED) { >> >> - *mapcount = 1; /* break early from loop */ >> >> + *mapcount = 1; /* break early from loop */ >> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */ >> > >> >> + try_set_page_mlocked(vma, page); >> > >> > That call is not absolutely necessary? >> >> Why? I haven't catch your point. > > Because we'll eventually hit another try_set_page_mlocked() when > trying to unmap the page. Ie. duplicated with another call you added > in this patch. Yes. we don't have to call it and we can make patch simple. I already sent patch on yesterday. http://marc.info/?l=linux-mm&m=125059325722370&w=2 I think It's more simple than KOSAKI's idea. Is any problem in my patch ? > > Thanks, > Fengguang > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-19 12:25 ` Minchan Kim @ 2009-08-19 13:19 ` KOSAKI Motohiro 2009-08-19 13:28 ` Minchan Kim 2009-08-19 13:24 ` Wu Fengguang 1 sibling, 1 reply; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-19 13:19 UTC (permalink / raw) To: Minchan Kim Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm 2009/8/19 Minchan Kim <minchan.kim@gmail.com>: > On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: >> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote: >>> >> page_referenced_file? >>> >> I think we should change page_referenced(). >>> > >>> > Yeah, good catch. >>> > >>> >> >>> >> Instead, How about this? >>> >> ============================================== >>> >> >>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages >>> >> >>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked >>> > >>> > mark PG_mlocked >>> > >>> >> because some race prevent page grabbing. >>> >> In that case, instead vmscan move the page to unevictable lru. >>> >> >>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so >>> >> efficient. >>> >> mlocked page can move circulatly active and inactive list because >>> >> vmscan check the page is referenced _before_ cull mlocked page. >>> >> >>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page. >>> > >>> > PG_mlocked >>> > >>> >> Otherwise vm stastics show strange number. >>> >> >>> >> This patch does that. >>> > >>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> >>> >>> Thanks. >>> >>> >>> >>> >> Index: b/mm/rmap.c >>> >> =================================================================== >>> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900 >>> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900 >>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa >>> >> * unevictable list. >>> >> */ >>> >> if (vma->vm_flags & VM_LOCKED) { >>> >> - *mapcount = 1; /* break early from loop */ >>> >> + *mapcount = 1; /* break early from loop */ >>> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */ >>> > >>> >> + try_set_page_mlocked(vma, page); >>> > >>> > That call is not absolutely necessary? >>> >>> Why? I haven't catch your point. >> >> Because we'll eventually hit another try_set_page_mlocked() when >> trying to unmap the page. Ie. duplicated with another call you added >> in this patch. Correct. > Yes. we don't have to call it and we can make patch simple. > I already sent patch on yesterday. > > http://marc.info/?l=linux-mm&m=125059325722370&w=2 > > I think It's more simple than KOSAKI's idea. > Is any problem in my patch ? Hmm, I think 1. Anyway, we need turn on PG_mlock. 2. PG_mlock prevent livelock because page_evictable() check is called at very early in shrink_page_list(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-19 13:19 ` KOSAKI Motohiro @ 2009-08-19 13:28 ` Minchan Kim 2009-08-21 11:17 ` KOSAKI Motohiro 0 siblings, 1 reply; 122+ messages in thread From: Minchan Kim @ 2009-08-19 13:28 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Wed, Aug 19, 2009 at 10:19 PM, KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote: > 2009/8/19 Minchan Kim <minchan.kim@gmail.com>: >> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: >>> On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote: >>>> >> page_referenced_file? >>>> >> I think we should change page_referenced(). >>>> > >>>> > Yeah, good catch. >>>> > >>>> >> >>>> >> Instead, How about this? >>>> >> ============================================== >>>> >> >>>> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages >>>> >> >>>> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked >>>> > >>>> > mark PG_mlocked >>>> > >>>> >> because some race prevent page grabbing. >>>> >> In that case, instead vmscan move the page to unevictable lru. >>>> >> >>>> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so >>>> >> efficient. >>>> >> mlocked page can move circulatly active and inactive list because >>>> >> vmscan check the page is referenced _before_ cull mlocked page. >>>> >> >>>> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page. >>>> > >>>> > PG_mlocked >>>> > >>>> >> Otherwise vm stastics show strange number. >>>> >> >>>> >> This patch does that. >>>> > >>>> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> >>>> >>>> Thanks. >>>> >>>> >>>> >>>> >> Index: b/mm/rmap.c >>>> >> =================================================================== >>>> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900 >>>> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900 >>>> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa >>>> >> * unevictable list. >>>> >> */ >>>> >> if (vma->vm_flags & VM_LOCKED) { >>>> >> - *mapcount = 1; /* break early from loop */ >>>> >> + *mapcount = 1; /* break early from loop */ >>>> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */ >>>> > >>>> >> + try_set_page_mlocked(vma, page); >>>> > >>>> > That call is not absolutely necessary? >>>> >>>> Why? I haven't catch your point. >>> >>> Because we'll eventually hit another try_set_page_mlocked() when >>> trying to unmap the page. Ie. duplicated with another call you added >>> in this patch. > > Correct. > > >> Yes. we don't have to call it and we can make patch simple. >> I already sent patch on yesterday. >> >> http://marc.info/?l=linux-mm&m=125059325722370&w=2 >> >> I think It's more simple than KOSAKI's idea. >> Is any problem in my patch ? > > Hmm, I think > > 1. Anyway, we need turn on PG_mlock. I add my patch again to explain. diff --git a/mm/rmap.c b/mm/rmap.c index ed63894..283266c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page, */ if (vma->vm_flags & VM_LOCKED) { *mapcount = 1; /* break early from loop */ + *vm_flags |= VM_LOCKED; goto out_unmap; } diff --git a/mm/vmscan.c b/mm/vmscan.c index d224b28..d156e1d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, sc->mem_cgroup, &vm_flags); /* In active use or really unfreeable? Activate it. */ if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && - referenced && page_mapping_inuse(page)) + referenced && page_mapping_inuse(page) + && !(vm_flags & VM_LOCKED)) goto activate_locked; By this check, the page can be reached at try_to_unmap after page_referenced in shrink_page_list. At that time PG_mlocked will be set. > 2. PG_mlock prevent livelock because page_evictable() check is called > at very early in shrink_page_list(). -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-19 13:28 ` Minchan Kim @ 2009-08-21 11:17 ` KOSAKI Motohiro 0 siblings, 0 replies; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-21 11:17 UTC (permalink / raw) To: Minchan Kim Cc: Wu Fengguang, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm >> Hmm, I think >> >> 1. Anyway, we need turn on PG_mlock. > > I add my patch again to explain. > > diff --git a/mm/rmap.c b/mm/rmap.c > index ed63894..283266c 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -358,6 +358,7 @@ static int page_referenced_one(struct page *page, > */ > if (vma->vm_flags & VM_LOCKED) { > *mapcount = 1; /* break early from loop */ > + *vm_flags |= VM_LOCKED; > goto out_unmap; > } > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d224b28..d156e1d 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -632,7 +632,8 @@ static unsigned long shrink_page_list(struct > list_head *page_list, > sc->mem_cgroup, &vm_flags); > /* In active use or really unfreeable? Activate it. */ > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && > - referenced && page_mapping_inuse(page)) > + referenced && page_mapping_inuse(page) > + && !(vm_flags & VM_LOCKED)) > goto activate_locked; > > By this check, the page can be reached at try_to_unmap after > page_referenced in shrink_page_list. At that time PG_mlocked will be > set. You are right. Please add my Reviewed-by sign to your patch. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-19 12:25 ` Minchan Kim 2009-08-19 13:19 ` KOSAKI Motohiro @ 2009-08-19 13:24 ` Wu Fengguang 2009-08-19 13:38 ` Minchan Kim 1 sibling, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-19 13:24 UTC (permalink / raw) To: Minchan Kim Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote: > On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote: > >> >> page_referenced_file? > >> >> I think we should change page_referenced(). > >> > > >> > Yeah, good catch. > >> > > >> >> > >> >> Instead, How about this? > >> >> ============================================== > >> >> > >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages > >> >> > >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked > >> > > >> > A A A A A A A A A A A A A A A A A A A A A A A A A A mark PG_mlocked > >> > > >> >> because some race prevent page grabbing. > >> >> In that case, instead vmscan move the page to unevictable lru. > >> >> > >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so > >> >> efficient. > >> >> mlocked page can move circulatly active and inactive list because > >> >> vmscan check the page is referenced _before_ cull mlocked page. > >> >> > >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page. > >> > > >> > A A A A A A A A A A A A A PG_mlocked > >> > > >> >> Otherwise vm stastics show strange number. > >> >> > >> >> This patch does that. > >> > > >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> > >> > >> Thanks. > >> > >> > >> > >> >> Index: b/mm/rmap.c > >> >> =================================================================== > >> >> --- a/mm/rmap.c A A A 2009-08-18 19:48:14.000000000 +0900 > >> >> +++ b/mm/rmap.c A A A 2009-08-18 23:47:34.000000000 +0900 > >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa > >> >> A A A A * unevictable list. > >> >> A A A A */ > >> >> A A A if (vma->vm_flags & VM_LOCKED) { > >> >> - A A A A A A *mapcount = 1; A /* break early from loop */ > >> >> + A A A A A A *mapcount = 1; A A A A A /* break early from loop */ > >> >> + A A A A A A *vm_flags |= VM_LOCKED; /* for prevent to move active list */ > >> > > >> >> + A A A A A A try_set_page_mlocked(vma, page); > >> > > >> > That call is not absolutely necessary? > >> > >> Why? I haven't catch your point. > > > > Because we'll eventually hit another try_set_page_mlocked() when > > trying to unmap the page. Ie. duplicated with another call you added > > in this patch. > > Yes. we don't have to call it and we can make patch simple. > I already sent patch on yesterday. > > http://marc.info/?l=linux-mm&m=125059325722370&w=2 > > I think It's more simple than KOSAKI's idea. > Is any problem in my patch ? No, IMHO your patch is simple and good, while KOSAKI's is more complete :) - the try_set_page_mlocked() rename is suitable - the call to try_set_page_mlocked() is necessary on try_to_unmap() - the "if (VM_LOCKED) referenced = 0" in page_referenced() could cover both active/inactive vmscan I did like your proposed if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && - referenced && page_mapping_inuse(page)) + referenced && page_mapping_inuse(page) + && !(vm_flags & VM_LOCKED)) goto activate_locked; which looks more intuitive and less confusing. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-19 13:24 ` Wu Fengguang @ 2009-08-19 13:38 ` Minchan Kim 2009-08-19 14:00 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Minchan Kim @ 2009-08-19 13:38 UTC (permalink / raw) To: Wu Fengguang Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote: >> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: >> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote: >> >> >> page_referenced_file? >> >> >> I think we should change page_referenced(). >> >> > >> >> > Yeah, good catch. >> >> > >> >> >> >> >> >> Instead, How about this? >> >> >> ============================================== >> >> >> >> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages >> >> >> >> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked >> >> > >> >> > mark PG_mlocked >> >> > >> >> >> because some race prevent page grabbing. >> >> >> In that case, instead vmscan move the page to unevictable lru. >> >> >> >> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so >> >> >> efficient. >> >> >> mlocked page can move circulatly active and inactive list because >> >> >> vmscan check the page is referenced _before_ cull mlocked page. >> >> >> >> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page. >> >> > >> >> > PG_mlocked >> >> > >> >> >> Otherwise vm stastics show strange number. >> >> >> >> >> >> This patch does that. >> >> > >> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> >> >> >> >> Thanks. >> >> >> >> >> >> >> >> >> Index: b/mm/rmap.c >> >> >> =================================================================== >> >> >> --- a/mm/rmap.c 2009-08-18 19:48:14.000000000 +0900 >> >> >> +++ b/mm/rmap.c 2009-08-18 23:47:34.000000000 +0900 >> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa >> >> >> * unevictable list. >> >> >> */ >> >> >> if (vma->vm_flags & VM_LOCKED) { >> >> >> - *mapcount = 1; /* break early from loop */ >> >> >> + *mapcount = 1; /* break early from loop */ >> >> >> + *vm_flags |= VM_LOCKED; /* for prevent to move active list */ >> >> > >> >> >> + try_set_page_mlocked(vma, page); >> >> > >> >> > That call is not absolutely necessary? >> >> >> >> Why? I haven't catch your point. >> > >> > Because we'll eventually hit another try_set_page_mlocked() when >> > trying to unmap the page. Ie. duplicated with another call you added >> > in this patch. >> >> Yes. we don't have to call it and we can make patch simple. >> I already sent patch on yesterday. >> >> http://marc.info/?l=linux-mm&m=125059325722370&w=2 >> >> I think It's more simple than KOSAKI's idea. >> Is any problem in my patch ? > > No, IMHO your patch is simple and good, while KOSAKI's is more > complete :) > > - the try_set_page_mlocked() rename is suitable > - the call to try_set_page_mlocked() is necessary on try_to_unmap() We don't need try_set_page_mlocked call in try_to_unmap. That's because try_to_unmap_xxx will call try_to_mlock_page if the page is included in any VM_LOCKED vma. Eventually, It can move unevictable list. > - the "if (VM_LOCKED) referenced = 0" in page_referenced() could > cover both active/inactive vmscan ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from active list to inactive list. But I think it's rare case so that there would be few pages. So I think that will be not big overhead. As I know, Rescue by vmscan page losing the isolation race was the Lee's design. But as you pointed out, it have a bug that vmscan can't rescue the page due to reach try_to_unmap. So I think this approach is proper. :) > I did like your proposed > > if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && > - referenced && page_mapping_inuse(page)) > + referenced && page_mapping_inuse(page) > + && !(vm_flags & VM_LOCKED)) > goto activate_locked; > > which looks more intuitive and less confusing. > > Thanks, > Fengguang > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-19 13:38 ` Minchan Kim @ 2009-08-19 14:00 ` Wu Fengguang 0 siblings, 0 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-19 14:00 UTC (permalink / raw) To: Minchan Kim Cc: KOSAKI Motohiro, Rik van Riel, Jeff Dike, Avi Kivity, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Wed, Aug 19, 2009 at 09:38:05PM +0800, Minchan Kim wrote: > On Wed, Aug 19, 2009 at 10:24 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > > On Wed, Aug 19, 2009 at 08:25:56PM +0800, Minchan Kim wrote: > >> On Wed, Aug 19, 2009 at 9:10 PM, Wu Fengguang<fengguang.wu@intel.com> wrote: > >> > On Wed, Aug 19, 2009 at 08:05:19PM +0800, KOSAKI Motohiro wrote: > >> >> >> page_referenced_file? > >> >> >> I think we should change page_referenced(). > >> >> > > >> >> > Yeah, good catch. > >> >> > > >> >> >> > >> >> >> Instead, How about this? > >> >> >> ============================================== > >> >> >> > >> >> >> Subject: [PATCH] mm: stop circulating of referenced mlocked pages > >> >> >> > >> >> >> Currently, mlock() systemcall doesn't gurantee to mark the page PG_Mlocked > >> >> > > >> >> > A A A A A A A A A A A A A A A A A A A A A A A A A A mark PG_mlocked > >> >> > > >> >> >> because some race prevent page grabbing. > >> >> >> In that case, instead vmscan move the page to unevictable lru. > >> >> >> > >> >> >> However, Recently Wu Fengguang pointed out current vmscan logic isn't so > >> >> >> efficient. > >> >> >> mlocked page can move circulatly active and inactive list because > >> >> >> vmscan check the page is referenced _before_ cull mlocked page. > >> >> >> > >> >> >> Plus, vmscan should mark PG_Mlocked when cull mlocked page. > >> >> > > >> >> > A A A A A A A A A A A A A PG_mlocked > >> >> > > >> >> >> Otherwise vm stastics show strange number. > >> >> >> > >> >> >> This patch does that. > >> >> > > >> >> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> > >> >> > >> >> Thanks. > >> >> > >> >> > >> >> > >> >> >> Index: b/mm/rmap.c > >> >> >> =================================================================== > >> >> >> --- a/mm/rmap.c A A A 2009-08-18 19:48:14.000000000 +0900 > >> >> >> +++ b/mm/rmap.c A A A 2009-08-18 23:47:34.000000000 +0900 > >> >> >> @@ -362,7 +362,9 @@ static int page_referenced_one(struct pa > >> >> >> A A A A * unevictable list. > >> >> >> A A A A */ > >> >> >> A A A if (vma->vm_flags & VM_LOCKED) { > >> >> >> - A A A A A A *mapcount = 1; A /* break early from loop */ > >> >> >> + A A A A A A *mapcount = 1; A A A A A /* break early from loop */ > >> >> >> + A A A A A A *vm_flags |= VM_LOCKED; /* for prevent to move active list */ > >> >> > > >> >> >> + A A A A A A try_set_page_mlocked(vma, page); > >> >> > > >> >> > That call is not absolutely necessary? > >> >> > >> >> Why? I haven't catch your point. > >> > > >> > Because we'll eventually hit another try_set_page_mlocked() when > >> > trying to unmap the page. Ie. duplicated with another call you added > >> > in this patch. > >> > >> Yes. we don't have to call it and we can make patch simple. > >> I already sent patch on yesterday. > >> > >> http://marc.info/?l=linux-mm&m=125059325722370&w=2 > >> > >> I think It's more simple than KOSAKI's idea. > >> Is any problem in my patch ? > > > > No, IMHO your patch is simple and good, while KOSAKI's is more > > complete :) > > > > - the try_set_page_mlocked() rename is suitable > > - the call to try_set_page_mlocked() is necessary on try_to_unmap() > > We don't need try_set_page_mlocked call in try_to_unmap. > That's because try_to_unmap_xxx will call try_to_mlock_page if the > page is included in any VM_LOCKED vma. Eventually, It can move > unevictable list. Yes, indeed! > > - the "if (VM_LOCKED) referenced = 0" in page_referenced() could > > A cover both active/inactive vmscan > > ASAP we set PG_mlocked in page, we can save unnecessary vmscan cost from > active list to inactive list. But I think it's rare case so that there > would be few pages. > So I think that will be not big overhead. The active list case can be persistent, when the mlocked (but without PG_mlocked) page is executable and referenced by 2+ processes. But I admit that executable pages are relatively rare. > As I know, Rescue by vmscan page losing the isolation race was the > Lee's design. > But as you pointed out, it have a bug that vmscan can't rescue the > page due to reach try_to_unmap. > > So I think this approach is proper. :) Now you decide :) Thanks, Fengguang > > I did like your proposed > > > > A A A A A A A A if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && > > - A A A A A A A A A A A A A A A A A A A referenced && page_mapping_inuse(page)) > > + A A A A A A A A A A A A A A A A A A A referenced && page_mapping_inuse(page) > > + A A A A A A A A A A A A A A A A A A A && !(vm_flags & VM_LOCKED)) > > A A A A A A A A A A A A goto activate_locked; > > > > which looks more intuitive and less confusing. > > > > Thanks, > > Fengguang > > > > > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 11:44 ` Avi Kivity 2009-08-06 13:06 ` Wu Fengguang @ 2009-08-06 13:13 ` Rik van Riel 2009-08-06 13:49 ` Avi Kivity 2009-08-07 3:11 ` KOSAKI Motohiro 1 sibling, 2 replies; 122+ messages in thread From: Rik van Riel @ 2009-08-06 13:13 UTC (permalink / raw) To: Avi Kivity Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Avi Kivity wrote: > On 08/06/2009 01:59 PM, Wu Fengguang wrote: >> As a refinement, the static variable 'recent_all_referenced' could be >> moved to struct zone or made a per-cpu variable. > > Definitely this should be made part of the zone structure, consider the > original report where the problem occurs in a 128MB zone (where we can > expect many pages to have their referenced bit set). The problem did not occur in a 128MB zone, but in a 128MB cgroup. Putting it in the zone means that the cgroup, which may have different behaviour from the rest of the zone, due to excessive memory pressure inside the cgroup, does not get the right statistics. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 13:13 ` Rik van Riel @ 2009-08-06 13:49 ` Avi Kivity 2009-08-07 3:11 ` KOSAKI Motohiro 1 sibling, 0 replies; 122+ messages in thread From: Avi Kivity @ 2009-08-06 13:49 UTC (permalink / raw) To: Rik van Riel Cc: Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/06/2009 04:13 PM, Rik van Riel wrote: > Avi Kivity wrote: >> On 08/06/2009 01:59 PM, Wu Fengguang wrote: > >>> As a refinement, the static variable 'recent_all_referenced' could be >>> moved to struct zone or made a per-cpu variable. >> >> Definitely this should be made part of the zone structure, consider >> the original report where the problem occurs in a 128MB zone (where >> we can expect many pages to have their referenced bit set). > > The problem did not occur in a 128MB zone, but in a 128MB cgroup. > > Putting it in the zone means that the cgroup, which may have > different behaviour from the rest of the zone, due to excessive > memory pressure inside the cgroup, does not get the right > statistics. > Well, it should be per inactive list, whether it's a zone or a cgroup. What's the name of this thing? ("inactive list"?) error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 13:13 ` Rik van Riel 2009-08-06 13:49 ` Avi Kivity @ 2009-08-07 3:11 ` KOSAKI Motohiro 2009-08-07 7:54 ` Balbir Singh 1 sibling, 1 reply; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-07 3:11 UTC (permalink / raw) To: Rik van Riel Cc: kosaki.motohiro, Avi Kivity, Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm, KAMEZAWA Hiroyuki, Balbir Singh (cc to memcg folks) > Avi Kivity wrote: > > On 08/06/2009 01:59 PM, Wu Fengguang wrote: > > >> As a refinement, the static variable 'recent_all_referenced' could be > >> moved to struct zone or made a per-cpu variable. > > > > Definitely this should be made part of the zone structure, consider the > > original report where the problem occurs in a 128MB zone (where we can > > expect many pages to have their referenced bit set). > > The problem did not occur in a 128MB zone, but in a 128MB cgroup. > > Putting it in the zone means that the cgroup, which may have > different behaviour from the rest of the zone, due to excessive > memory pressure inside the cgroup, does not get the right > statistics. maybe, I heven't catch your point. Current memcgroup logic also use recent_scan/recent_rotate statistics. Isn't it enought? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-07 3:11 ` KOSAKI Motohiro @ 2009-08-07 7:54 ` Balbir Singh 2009-08-07 8:24 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 122+ messages in thread From: Balbir Singh @ 2009-08-07 7:54 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Avi Kivity, Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm, KAMEZAWA Hiroyuki On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote: > (cc to memcg folks) > >> Avi Kivity wrote: >> > On 08/06/2009 01:59 PM, Wu Fengguang wrote: >> >> >> As a refinement, the static variable 'recent_all_referenced' could be >> >> moved to struct zone or made a per-cpu variable. >> > >> > Definitely this should be made part of the zone structure, consider the >> > original report where the problem occurs in a 128MB zone (where we can >> > expect many pages to have their referenced bit set). >> >> The problem did not occur in a 128MB zone, but in a 128MB cgroup. >> >> Putting it in the zone means that the cgroup, which may have >> different behaviour from the rest of the zone, due to excessive >> memory pressure inside the cgroup, does not get the right >> statistics. > > maybe, I heven't catch your point. > > Current memcgroup logic also use recent_scan/recent_rotate statistics. > Isn't it enought? I don't understand the context, I'll look at the problem when I am back (I am away from work for the next few days). Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-07 7:54 ` Balbir Singh @ 2009-08-07 8:24 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 122+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-08-07 8:24 UTC (permalink / raw) To: Balbir Singh Cc: KOSAKI Motohiro, Rik van Riel, Avi Kivity, Wu Fengguang, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Fri, 7 Aug 2009 13:24:34 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > On Fri, Aug 7, 2009 at 8:41 AM, KOSAKI > Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote: > > Current memcgroup logic also use recent_scan/recent_rotate statistics. > > Isn't it enought? > > I don't understand the context, I'll look at the problem when I am > back (I am away from work for the next few days). > Brief summary: (please point out if not correct) prepare a memcg with memory.limit_in_bytex=128MB Run kvm on it, and use apps, those working-set is near to 256MB (then, heavy swap) In this case, - Anon memory are swapped-out even while there are file caches. Especially, a page for stack which is frequently accessed can be easily swapped out, again and again. One of reasone is a recent change as: "a page mapped with VM_EXEC is not pageout even if no reference" Without memcg, a user can use Gigabytes of memory, above change is very welcomed. Then, current point is "how we can handle this case without bad effect". One possibility I wonder is this is a problem of configuration mistake. setting memory.memsw.limit_in_bytes to be proper value may change bahavior. But it seems just a workaround. Can't we find algorithmic/heuristic way to avoid too much swap-out ? I think memcg can check # of swap-ins, but now, we don't have a tag to see the sign of "recently swapped-out page is reused" case or executable file pages are too much. I wonder we can comapre # of pageouted file-caches v.s. # of swapout anon. and keeping "# of pageouted file-caches < # of swapout anon." (or use swappiness) This state can be checked by recalim_stat. (per memcg) Hmm? I'm sorry I'll be on a trip Aug/11-Aug/17 and response will be delayed. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 10:59 ` Wu Fengguang 2009-08-06 11:44 ` Avi Kivity @ 2009-08-06 13:11 ` Rik van Riel 1 sibling, 0 replies; 122+ messages in thread From: Rik van Riel @ 2009-08-06 13:11 UTC (permalink / raw) To: Wu Fengguang Cc: Andrea Arcangeli, Avi Kivity, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Wu Fengguang wrote: > This is a quick hack to materialize the idea. It remembers roughly > the last 32*SWAP_CLUSTER_MAX=1024 active (mapped) pages scanned, > and if _all of them_ are referenced, then the referenced bit is > probably meaningless and should not be taken seriously. This has the potential to increase the number of active pages scanned by almost a factor 1024. Let me whip up an alternative idea when I get to the office later today. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 10:08 ` Andrea Arcangeli 2009-08-06 10:18 ` Avi Kivity @ 2009-08-06 13:08 ` Rik van Riel 2009-08-07 3:17 ` KOSAKI Motohiro 1 sibling, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-06 13:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Andrea Arcangeli wrote: > Likely we need a cut-off point, if we detect it takes more than X > seconds to scan the whole active list, we start ignoring young bits, We could just make this depend on the calculated inactive_ratio, which depends on the size of the list. For small systems, it may make sense to make every accessed bit count, because the working set will often approach the size of memory. On very large systems, the working set may also approach the size of memory, but the inactive list only contains a small percentage of the pages, so there is enough space for everything. Say, if the inactive_ratio is 3 or less, make the accessed bit on the active lists count. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 13:08 ` Rik van Riel @ 2009-08-07 3:17 ` KOSAKI Motohiro 2009-08-12 7:48 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-07 3:17 UTC (permalink / raw) To: Rik van Riel Cc: kosaki.motohiro, Andrea Arcangeli, Wu Fengguang, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm > Andrea Arcangeli wrote: > > > Likely we need a cut-off point, if we detect it takes more than X > > seconds to scan the whole active list, we start ignoring young bits, > > We could just make this depend on the calculated inactive_ratio, > which depends on the size of the list. > > For small systems, it may make sense to make every accessed bit > count, because the working set will often approach the size of > memory. > > On very large systems, the working set may also approach the > size of memory, but the inactive list only contains a small > percentage of the pages, so there is enough space for everything. > > Say, if the inactive_ratio is 3 or less, make the accessed bit > on the active lists count. Sound reasonable. How do we confirm the idea correctness? Wu, your X focus switching benchmark is sufficient test? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-07 3:17 ` KOSAKI Motohiro @ 2009-08-12 7:48 ` Wu Fengguang 2009-08-12 14:31 ` Rik van Riel 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-12 7:48 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm [-- Attachment #1: Type: text/plain, Size: 28582 bytes --] On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote: > > Andrea Arcangeli wrote: > > > > > Likely we need a cut-off point, if we detect it takes more than X > > > seconds to scan the whole active list, we start ignoring young bits, > > > > We could just make this depend on the calculated inactive_ratio, > > which depends on the size of the list. > > > > For small systems, it may make sense to make every accessed bit > > count, because the working set will often approach the size of > > memory. > > > > On very large systems, the working set may also approach the > > size of memory, but the inactive list only contains a small > > percentage of the pages, so there is enough space for everything. > > > > Say, if the inactive_ratio is 3 or less, make the accessed bit > > on the active lists count. > > Sound reasonable. Yes, such kind of global measurements would be much better. > How do we confirm the idea correctness? In general the active list tends to grow large on under-scanned LRU. I guess Rik is pretty familiar with typical inactive_ratio values of the large memory systems and may even have some real numbers :) > Wu, your X focus switching benchmark is sufficient test? It is a major test case for memory tight desktop. Jeff presents another interesting one for KVM, hehe. Anyway I collected the active/inactive list sizes, and the numbers show that the inactive_ratio is roughly 1 when the LRU is scanned actively and may go very high when it is under-scanned. Thanks, Fengguang 4GB desktop, kernel 2.6.30 -------------------------- 1) fresh startup: nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 0 80255 68932 24066 2) read 10GB sparse file: nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 48096 52312 830142 10971 3) kvm -m 512M: nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 82606 155588 684375 15380 4) exit kvm: nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 66364 35275 679033 17009 512MB desktop, kernel 2.6.31 ---------------------------- 1) fresh startup, console: nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 0 1870 7082 2075 2) fresh startx: nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 0 30021 31551 6893 3) start many x apps, no swap: (script attached) nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 0 56475 29608 9707 4074 54886 27431 9743 5452 54025 26685 9950 7417 53428 25394 9963 8522 52388 24717 10553 10684 51955 22055 11384 11644 51597 21329 11342 12341 51221 20822 11513 13874 49738 19916 11516 13874 50494 19916 11517 15284 48778 19739 12127 15668 49037 19196 12380 16821 48571 17661 13133 18329 49175 14470 14490 18961 49652 13081 14432 18961 49608 13236 14414 20563 51379 11171 13823 21044 50281 10311 13948 21426 49906 10268 13984 21771 50479 9734 14019 23246 49062 9672 13431 23984 49490 10083 12763 24479 49373 10332 12446 25782 49053 9655 12101 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 26970 48078 9891 11415 28041 47873 9617 11079 29485 51183 8445 8293 30484 50140 8441 7997 31841 50578 6904 7413 32579 49873 6937 7804 34117 49336 6447 7440 35380 48300 5816 7471 38055 46486 4778 7546 39528 45227 5043 7417 40777 44681 4148 7325 41902 44468 3967 6534 43107 43378 4630 5846 43418 43538 5019 5698 43563 43514 4839 5514 43660 43587 5228 5431 43645 43315 4919 5886 43618 43555 4531 5704 43751 43646 4584 5600 43839 43703 4507 5569 44015 44057 4757 5378 44115 44089 4707 4724 44331 44184 4710 4701 44577 44554 4221 4265 [...] nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 47945 47876 1594 1547 47944 47888 1944 1494 47944 47888 1351 1392 47974 47844 1976 1498 47974 47858 1411 1549 47974 47857 1482 1423 47973 47874 2105 1435 47969 47349 1884 1592 47966 47353 1993 1700 47966 47343 1913 1882 47965 47306 1683 1746 47960 47373 1598 1583 47960 47375 1808 1677 47960 47004 2444 1625 47959 47060 2017 1825 47956 47047 1866 1742 47955 47080 2039 1987 47954 47072 1734 1822 47954 47092 1963 1867 47954 47130 1851 1846 47954 47154 2134 1813 47954 47181 1952 1813 47953 47138 1678 1810 47951 47125 1848 1951 4) start many x apps, with swap: nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 0 6571 13646 3251 0 6823 14637 3900 0 7426 17187 3935 0 8188 19989 3959 0 9994 21582 4148 0 12556 21889 5157 0 13846 23764 5249 0 20383 25393 5546 0 21830 26019 5696 0 22856 26608 5972 0 28651 28128 6146 0 28058 28482 6309 0 27726 28595 6312 0 27634 28775 6471 0 27636 28774 6464 0 31299 28848 6834 0 35102 29539 6886 0 39561 29980 6915 0 41573 30008 6917 0 47562 30041 6917 0 54603 30041 6917 3040 55528 29273 6945 16937 44916 23406 7675 16937 44932 23416 7670 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 16937 44961 23416 7670 16937 40583 23416 7670 16937 40596 23417 7670 16937 40607 23417 7670 16937 40139 23404 7668 12181 11794 22932 8144 12181 11794 22932 8144 12181 11794 22946 8144 12181 11794 22946 8144 12147 13063 23148 8280 12146 15994 22842 8565 12146 17491 22654 8718 12146 17488 22654 8718 12146 17653 22634 8733 12146 18656 21030 10513 12146 19717 20778 10770 12146 20341 20859 10846 12146 21134 21096 11133 12146 22692 21129 11453 12144 24698 22225 11476 12144 27726 22609 11536 12144 27774 22648 11555 12144 28447 22844 11564 12144 30286 23238 11567 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 12121 31489 23350 11761 12099 33117 23336 11779 12099 33632 23555 11787 12099 35393 23566 11806 12099 35828 23490 11882 12099 35879 23486 11887 12099 35889 23486 11888 12099 36078 24124 11890 12099 36449 25079 11895 12099 37782 25334 11898 12099 39494 25564 11904 12099 40620 25657 11905 14200 41298 25399 12069 21555 35228 22969 12495 22829 33097 22703 12617 25519 31496 22115 12552 28590 28947 21617 13051 28940 29076 19806 13270 29430 29344 19153 13825 30183 30399 17643 13418 32242 32203 13535 13969 33319 33294 12236 13659 33154 33085 11431 13482 33572 33569 11315 13102 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 36246 35355 8033 9355 35659 35558 8491 8394 35330 35142 8233 8278 35788 35561 8460 8454 36129 36359 8413 8627 36727 36365 8311 8509 36672 36870 8437 8479 36772 36656 8090 8354 36754 36614 8237 8378 37591 36065 8352 8470 36530 36383 7607 7611 36113 35992 7271 7296 36149 35667 7092 7052 36014 35350 7408 7206 36409 35890 8027 7396 36300 35418 7892 7704 36369 36589 7723 7838 36243 36168 7576 7793 35804 35622 7422 7726 35498 35435 7443 7557 35078 35159 7542 7243 35478 35415 8199 7552 35143 35025 7828 7763 35312 34754 7745 7545 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 35093 34933 7166 7748 36253 36236 7171 7408 36225 36929 8236 7532 36197 36169 7562 7632 35711 35647 7312 7471 35210 35144 7202 7227 35052 35021 7073 7084 35263 35047 7128 6963 35359 35177 7572 7048 35665 35523 7927 7025 34988 34788 7279 7340 34678 34438 7352 7141 34352 34270 7033 6980 34307 34175 6881 6809 34038 34469 7603 6700 34169 33854 7105 6868 34048 34124 7051 6869 33630 33445 6821 6875 33047 32992 6617 6554 33232 33012 7114 6659 33442 33217 7408 6700 32942 32707 6830 7257 32672 32593 6801 7207 32406 32142 6656 6960 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 32127 32036 6641 6798 31929 31769 6567 6664 31786 31968 7532 6670 32208 31228 7448 6859 30904 30835 6503 6774 30543 30559 6345 6709 30394 30278 6235 6288 30541 30239 6470 6243 30463 30656 6959 6587 31020 29794 7393 6897 30169 30128 6295 6905 29755 29644 6236 6598 29765 29617 6342 6475 29874 29748 6215 6335 29654 29491 6355 6358 29972 29853 7079 6607 29437 29267 6670 7205 29160 28956 6602 6982 29411 29017 6578 6937 29069 28952 6539 6717 29570 29342 6982 6850 28882 28927 6912 6809 29326 28731 6928 6814 28883 28817 6762 6819 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 29072 28696 6756 6803 29296 29120 6993 6972 28426 28167 6238 7182 28071 27862 6197 6953 27944 27767 6872 6780 28141 27819 6839 6654 27547 27309 6285 6209 27578 27842 7730 6465 27741 27470 7180 6665 27481 27217 7566 6919 27568 27405 7696 7027 27274 27004 7416 7120 27110 26920 7111 7303 27282 27056 7476 7046 27549 27044 7779 7074 27325 26968 7290 6972 27665 27528 8465 7058 27093 26974 7662 7243 27155 27068 7299 7344 26638 26553 6925 7325 26718 26383 6571 7425 26264 26150 6470 6960 26463 26176 6590 6803 27155 26396 7387 6709 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 26532 26408 6722 6702 26491 26421 6789 6731 26783 26950 8389 6849 27129 26584 7713 6991 26791 26228 7316 7202 26208 26168 7115 7172 26031 25907 6957 7118 25980 25675 6764 7216 25608 25535 6779 7042 25571 25501 6520 6943 26068 25287 6574 6948 25734 25305 6778 6776 25442 25134 6629 6556 25514 25217 7469 6543 25659 25552 8561 6620 26082 24784 8494 6676 25312 25194 7052 7026 25386 25267 7422 6973 25070 24965 6716 6886 25143 24801 6597 6785 24971 24866 6643 6786 25223 25212 6829 6757 25504 24778 7589 6840 25531 24786 8068 6896 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 25343 25169 7227 7042 25195 25129 6804 7149 25355 25071 6958 6941 25294 25202 6676 6850 25688 25050 6743 6694 25736 25268 6910 6580 25750 25530 7299 6557 25401 25273 6622 6810 25672 25525 6798 6770 25192 25067 6226 6486 26011 25360 6540 6466 26673 25768 6444 6411 27211 26326 6370 6423 27527 26764 6615 6534 27355 26820 6337 6467 27385 26962 6098 6446 27528 27431 5832 6303 26955 26918 6016 6015 26816 26469 5847 5894 26961 26390 6077 5866 26781 26664 5625 5815 26755 26454 6114 5806 26994 26552 6016 5784 27482 26910 5945 5714 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 27609 27143 5929 5715 27885 27443 7168 5947 27684 27635 8231 6145 27320 27205 7359 6415 27679 27265 7898 6445 27655 26342 7651 6574 27033 26853 7385 6831 26696 26468 6533 6721 26464 26310 6374 6465 26192 26084 6261 6417 26182 25801 6511 6367 26010 25880 6251 6266 26130 26032 5974 6280 26417 26175 5830 6610 26558 26450 6002 6623 26758 26016 6141 6526 26481 26363 5911 6356 26765 26622 6401 6266 27022 26534 6593 6210 27587 26515 6560 6193 27156 27029 6123 6109 27284 26926 6159 5776 27153 26996 5698 5642 26712 26603 6151 5541 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 26697 27024 7919 5565 26651 26471 7134 5965 27021 26479 7617 5996 26323 26024 7091 6273 26081 25894 6267 6527 25605 25487 5814 6407 25564 25447 5613 6422 25460 25406 5630 6374 25380 25380 5776 6358 25661 25653 6045 6037 25790 24706 6069 6045 25512 25043 6024 5982 25440 25102 6067 5807 25802 25181 5953 5838 25864 25314 5694 5711 26022 25737 5592 5510 25964 25741 6784 5376 26092 25952 7929 5537 26110 25990 7120 5789 25311 25252 6157 6146 25432 25379 6658 6197 25552 25390 6176 6357 25388 25237 5742 6303 25841 25173 5932 6325 nr_inactive_anon nr_active_anon nr_inactive_file nr_active_file 25054 24965 5532 6072 24754 25191 7503 5941 25529 25133 7319 6000 25350 24510 6993 6078 24672 24541 6027 6068 24610 24492 5811 5804 24819 24674 5841 5820 24775 24394 5719 5696 24991 25179 6639 5816 25282 24538 6870 6088 25172 24727 6628 6090 25363 24721 6644 6091 25676 24705 6672 6102 24998 24909 5683 5957 24762 24736 6034 5869 24965 24890 6374 6614 25050 24895 6436 6616 25087 24932 6436 6617 25139 24860 6435 6619 25159 24903 6435 6620 25168 25362 6004 7051 25209 25524 6004 7052 25209 25504 6004 7053 25262 25447 6011 7054 [-- Attachment #2: run-many-x-apps.sh --] [-- Type: application/x-sh, Size: 1735 bytes --] ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-12 7:48 ` Wu Fengguang @ 2009-08-12 14:31 ` Rik van Riel 2009-08-13 1:03 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-12 14:31 UTC (permalink / raw) To: Wu Fengguang Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Wu Fengguang wrote: > On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote: >>> Andrea Arcangeli wrote: >>> >>>> Likely we need a cut-off point, if we detect it takes more than X >>>> seconds to scan the whole active list, we start ignoring young bits, >>> We could just make this depend on the calculated inactive_ratio, >>> which depends on the size of the list. >>> >>> For small systems, it may make sense to make every accessed bit >>> count, because the working set will often approach the size of >>> memory. >>> >>> On very large systems, the working set may also approach the >>> size of memory, but the inactive list only contains a small >>> percentage of the pages, so there is enough space for everything. >>> >>> Say, if the inactive_ratio is 3 or less, make the accessed bit >>> on the active lists count. >> Sound reasonable. > > Yes, such kind of global measurements would be much better. > >> How do we confirm the idea correctness? > > In general the active list tends to grow large on under-scanned LRU. > I guess Rik is pretty familiar with typical inactive_ratio values of > the large memory systems and may even have some real numbers :) > >> Wu, your X focus switching benchmark is sufficient test? > > It is a major test case for memory tight desktop. Jeff presents > another interesting one for KVM, hehe. > > Anyway I collected the active/inactive list sizes, and the numbers > show that the inactive_ratio is roughly 1 when the LRU is scanned > actively and may go very high when it is under-scanned. inactive_ratio is based on the zone (or cgroup) size. For zones it is a fixed value, which is available in /proc/zoneinfo -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-12 14:31 ` Rik van Riel @ 2009-08-13 1:03 ` Wu Fengguang 2009-08-13 15:46 ` Rik van Riel 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-13 1:03 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote: > Wu Fengguang wrote: > > On Fri, Aug 07, 2009 at 11:17:22AM +0800, KOSAKI Motohiro wrote: > >>> Andrea Arcangeli wrote: > >>> > >>>> Likely we need a cut-off point, if we detect it takes more than X > >>>> seconds to scan the whole active list, we start ignoring young bits, > >>> We could just make this depend on the calculated inactive_ratio, > >>> which depends on the size of the list. > >>> > >>> For small systems, it may make sense to make every accessed bit > >>> count, because the working set will often approach the size of > >>> memory. > >>> > >>> On very large systems, the working set may also approach the > >>> size of memory, but the inactive list only contains a small > >>> percentage of the pages, so there is enough space for everything. > >>> > >>> Say, if the inactive_ratio is 3 or less, make the accessed bit > >>> on the active lists count. > >> Sound reasonable. > > > > Yes, such kind of global measurements would be much better. > > > >> How do we confirm the idea correctness? > > > > In general the active list tends to grow large on under-scanned LRU. > > I guess Rik is pretty familiar with typical inactive_ratio values of > > the large memory systems and may even have some real numbers :) > > > >> Wu, your X focus switching benchmark is sufficient test? > > > > It is a major test case for memory tight desktop. Jeff presents > > another interesting one for KVM, hehe. > > > > Anyway I collected the active/inactive list sizes, and the numbers > > show that the inactive_ratio is roughly 1 when the LRU is scanned > > actively and may go very high when it is under-scanned. > > inactive_ratio is based on the zone (or cgroup) size. Ah sorry, my word 'inactive_ratio' means runtime active:inactive ratio. > For zones it is a fixed value, which is available in > /proc/zoneinfo On my 64bit desktop with 4GB memory: DMA inactive_ratio: 1 DMA32 inactive_ratio: 4 Normal inactive_ratio: 1 The biggest zone DMA32 has inactive_ratio=4. But I guess the referenced bit should not be ignored on this typical desktop configuration? Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-13 1:03 ` Wu Fengguang @ 2009-08-13 15:46 ` Rik van Riel 2009-08-13 16:12 ` Avi Kivity 0 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-13 15:46 UTC (permalink / raw) To: Wu Fengguang Cc: KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Wu Fengguang wrote: > On Wed, Aug 12, 2009 at 10:31:41PM +0800, Rik van Riel wrote: >> For zones it is a fixed value, which is available in >> /proc/zoneinfo > > On my 64bit desktop with 4GB memory: > > DMA inactive_ratio: 1 > DMA32 inactive_ratio: 4 > Normal inactive_ratio: 1 > > The biggest zone DMA32 has inactive_ratio=4. But I guess the > referenced bit should not be ignored on this typical desktop > configuration? We need to ignore the referenced bit on active anon pages on very large systems, but it could indeed be helpful to respect the referenced bit on smaller systems. I have no idea where the cut-off between them would be. Maybe at inactive_ratio <= 4 ? -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-13 15:46 ` Rik van Riel @ 2009-08-13 16:12 ` Avi Kivity 2009-08-13 16:26 ` Rik van Riel 0 siblings, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-13 16:12 UTC (permalink / raw) To: Rik van Riel Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On 08/13/2009 06:46 PM, Rik van Riel wrote: > We need to ignore the referenced bit on active anon pages > on very large systems, but it could indeed be helpful to > respect the referenced bit on smaller systems. > > I have no idea where the cut-off between them would be. > > Maybe at inactive_ratio <= 4 ? Why do we need to ignore the referenced bit in such cases? To avoid overscanning? -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-13 16:12 ` Avi Kivity @ 2009-08-13 16:26 ` Rik van Riel 2009-08-13 19:12 ` Avi Kivity 0 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-13 16:26 UTC (permalink / raw) To: Avi Kivity Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Avi Kivity wrote: > On 08/13/2009 06:46 PM, Rik van Riel wrote: >> We need to ignore the referenced bit on active anon pages >> on very large systems, but it could indeed be helpful to >> respect the referenced bit on smaller systems. >> >> I have no idea where the cut-off between them would be. >> >> Maybe at inactive_ratio <= 4 ? > > Why do we need to ignore the referenced bit in such cases? To avoid > overscanning? Because swapping out anonymous pages tends to be a relatively rare operation, we'll have many gigabytes of anonymous pages that all have the referenced bit set (because there was lots of time between swapout bursts). Ignoring the referenced bit on active anon pages makes no difference on these systems, because all active anon pages have the referenced bit set, anyway. All we need to do is put the pages on the inactive list and give them a chance to get referenced. However, on smaller systems (and cgroups!), the speed at which we can do pageout IO is larger, compared to the amount of memory. This means we can cycle through the pages more quickly and we may want to count references on the active list, too. Yes, on smaller systems we'll also often end up with bursty swapout loads and all pages referenced - but since we have fewer pages to begin with, it won't hurt as much. I suspect that an inactive_ratio of 3 or 4 might make a good cutoff value. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-13 16:26 ` Rik van Riel @ 2009-08-13 19:12 ` Avi Kivity 2009-08-13 21:16 ` Johannes Weiner 0 siblings, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-13 19:12 UTC (permalink / raw) To: Rik van Riel Cc: Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On 08/13/2009 07:26 PM, Rik van Riel wrote: >> Why do we need to ignore the referenced bit in such cases? To avoid >> overscanning? > > > Because swapping out anonymous pages tends to be a relatively > rare operation, we'll have many gigabytes of anonymous pages > that all have the referenced bit set (because there was lots > of time between swapout bursts). > > Ignoring the referenced bit on active anon pages makes no > difference on these systems, because all active anon pages > have the referenced bit set, anyway. > > All we need to do is put the pages on the inactive list and > give them a chance to get referenced. > > However, on smaller systems (and cgroups!), the speed at > which we can do pageout IO is larger, compared to the amount > of memory. This means we can cycle through the pages more > quickly and we may want to count references on the active > list, too. > > Yes, on smaller systems we'll also often end up with bursty > swapout loads and all pages referenced - but since we have > fewer pages to begin with, it won't hurt as much. > > I suspect that an inactive_ratio of 3 or 4 might make a > good cutoff value. > Thanks for the explanation. I think my earlier idea of - do not ignore the referenced bit - if you see a run of N pages which all have the referenced bit set, do swap one has merit. It means we cycle more quickly (by a factor of N) through the list, looking for unreferenced pages. If we don't find any we've spent a some more cpu, but if we do find an unreferenced page, we win by swapping a truly unneeded page. Cycling faster also means reducing the time between examinations of any particular page, so it increases the meaningfulness of the check on large systems (otherwise even rarely used pages will always show up as referenced). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-13 19:12 ` Avi Kivity @ 2009-08-13 21:16 ` Johannes Weiner 2009-08-14 7:16 ` Avi Kivity 0 siblings, 1 reply; 122+ messages in thread From: Johannes Weiner @ 2009-08-13 21:16 UTC (permalink / raw) To: Avi Kivity Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Thu, Aug 13, 2009 at 10:12:01PM +0300, Avi Kivity wrote: > On 08/13/2009 07:26 PM, Rik van Riel wrote: > >>Why do we need to ignore the referenced bit in such cases? To avoid > >>overscanning? > > > > > >Because swapping out anonymous pages tends to be a relatively > >rare operation, we'll have many gigabytes of anonymous pages > >that all have the referenced bit set (because there was lots > >of time between swapout bursts). > > > >Ignoring the referenced bit on active anon pages makes no > >difference on these systems, because all active anon pages > >have the referenced bit set, anyway. > > > >All we need to do is put the pages on the inactive list and > >give them a chance to get referenced. > > > >However, on smaller systems (and cgroups!), the speed at > >which we can do pageout IO is larger, compared to the amount > >of memory. This means we can cycle through the pages more > >quickly and we may want to count references on the active > >list, too. > > > >Yes, on smaller systems we'll also often end up with bursty > >swapout loads and all pages referenced - but since we have > >fewer pages to begin with, it won't hurt as much. > > > >I suspect that an inactive_ratio of 3 or 4 might make a > >good cutoff value. > > > > Thanks for the explanation. I think my earlier idea of > > - do not ignore the referenced bit > - if you see a run of N pages which all have the referenced bit set, do > swap one > > has merit. It means we cycle more quickly (by a factor of N) through > the list, looking for unreferenced pages. If we don't find any we've > spent a some more cpu, but if we do find an unreferenced page, we win by > swapping a truly unneeded page. But it also means destroying the LRU order. Okay, we ignore the referenced bit, but we keep LRU buddies together which then get reactivated together as well, if they are indeed in active use. I could imagine the VM going nuts when you separate them by a predicate that is rather unrelated to the pages's actual usage patterns. After all, the list order is the primary input to selecting pages for eviction. It would need actual testing, of course, but I bet Rik's idea of using the referenced bit always or never is going to show better results. Hannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-13 21:16 ` Johannes Weiner @ 2009-08-14 7:16 ` Avi Kivity 2009-08-14 9:10 ` Johannes Weiner 0 siblings, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-14 7:16 UTC (permalink / raw) To: Johannes Weiner Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On 08/14/2009 12:16 AM, Johannes Weiner wrote: > >> - do not ignore the referenced bit >> - if you see a run of N pages which all have the referenced bit set, do >> swap one >> >> > > But it also means destroying the LRU order. > > True, it would, but if we ignore the referenced bit, LRU order is really FIFO. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-14 7:16 ` Avi Kivity @ 2009-08-14 9:10 ` Johannes Weiner 2009-08-14 9:51 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Johannes Weiner @ 2009-08-14 9:10 UTC (permalink / raw) To: Avi Kivity Cc: Rik van Riel, Wu Fengguang, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote: > On 08/14/2009 12:16 AM, Johannes Weiner wrote: > > > >>- do not ignore the referenced bit > >>- if you see a run of N pages which all have the referenced bit set, do > >>swap one > >> > >> > > > >But it also means destroying the LRU order. > > > > > > True, it would, but if we ignore the referenced bit, LRU order is really > FIFO. For the active list, yes. But it's not that we degrade to First Fault First Out in a global scope, we still update the order from mark_page_accessed() and by activating referenced pages in shrink_page_list() etc. So even with the active list being a FIFO, we keep usage information gathered from the inactive list. If we deactivate pages in arbitrary list intervals, we throw this away. And while global FIFO-based reclaim does not work too well, initial fault order is a valuable hint in the aspect of referential locality as the pages get used in groups and thus move around the lists in groups. Our granularity for regrouping decisions is pretty coarse, for non-filecache pages it's basically 'referenced or not refrenced in the last list round-trip', so it will take quite some time to regroup pages that are used in truly similar intervals. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-14 9:10 ` Johannes Weiner @ 2009-08-14 9:51 ` Wu Fengguang 2009-08-14 13:19 ` Rik van Riel 2009-08-14 21:42 ` [RFC] respect the referenced bit of KVM guest pages? Dike, Jeffrey G 0 siblings, 2 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-14 9:51 UTC (permalink / raw) To: Johannes Weiner Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote: > On Fri, Aug 14, 2009 at 10:16:26AM +0300, Avi Kivity wrote: > > On 08/14/2009 12:16 AM, Johannes Weiner wrote: > > > > > >>- do not ignore the referenced bit > > >>- if you see a run of N pages which all have the referenced bit set, do > > >>swap one > > >> > > >> > > > > > >But it also means destroying the LRU order. > > > > > > > > > > True, it would, but if we ignore the referenced bit, LRU order is really > > FIFO. > > For the active list, yes. But it's not that we degrade to First Fault > First Out in a global scope, we still update the order from > mark_page_accessed() and by activating referenced pages in > shrink_page_list() etc. > > So even with the active list being a FIFO, we keep usage information > gathered from the inactive list. If we deactivate pages in arbitrary > list intervals, we throw this away. We do have the danger of FIFO, if inactive list is small enough, so that (unconditionally) deactivated pages quickly get reclaimed and their life window in inactive list is too small to be useful. > And while global FIFO-based reclaim does not work too well, initial > fault order is a valuable hint in the aspect of referential locality > as the pages get used in groups and thus move around the lists in > groups. > > Our granularity for regrouping decisions is pretty coarse, for > non-filecache pages it's basically 'referenced or not refrenced in the > last list round-trip', so it will take quite some time to regroup > pages that are used in truly similar intervals. Agreed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-14 9:51 ` Wu Fengguang @ 2009-08-14 13:19 ` Rik van Riel 2009-08-15 5:45 ` Wu Fengguang 2009-08-14 21:42 ` [RFC] respect the referenced bit of KVM guest pages? Dike, Jeffrey G 1 sibling, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-14 13:19 UTC (permalink / raw) To: Wu Fengguang Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Wu Fengguang wrote: > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote: >> So even with the active list being a FIFO, we keep usage information >> gathered from the inactive list. If we deactivate pages in arbitrary >> list intervals, we throw this away. > > We do have the danger of FIFO, if inactive list is small enough, so > that (unconditionally) deactivated pages quickly get reclaimed and > their life window in inactive list is too small to be useful. This one of the reasons why we unconditionally deactivate the active anon pages, and do background scanning of the active anon list when reclaiming page cache pages. We want to always move some pages to the inactive anon list, so it does not get too small. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-14 13:19 ` Rik van Riel @ 2009-08-15 5:45 ` Wu Fengguang 2009-08-16 5:09 ` Balbir Singh ` (2 more replies) 0 siblings, 3 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-15 5:45 UTC (permalink / raw) To: Rik van Riel Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote: > Wu Fengguang wrote: > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote: > > >> So even with the active list being a FIFO, we keep usage information > >> gathered from the inactive list. If we deactivate pages in arbitrary > >> list intervals, we throw this away. > > > > We do have the danger of FIFO, if inactive list is small enough, so > > that (unconditionally) deactivated pages quickly get reclaimed and > > their life window in inactive list is too small to be useful. > > This one of the reasons why we unconditionally deactivate > the active anon pages, and do background scanning of the > active anon list when reclaiming page cache pages. > > We want to always move some pages to the inactive anon > list, so it does not get too small. Right, the current code tries to pull inactive list out of smallish-size state as long as there are vmscan activities. However there is a possible (and tricky) hole: mem cgroups don't do batched vmscan. shrink_zone() may call shrink_list() with nr_to_scan=1, in which case shrink_list() _still_ calls isolate_pages() with the much larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan rate by 10 times when it is still small, and may thus prevent it from growing up for ever. In that case, LRU becomes FIFO. Jeff, can you confirm if the mem cgroup's inactive list is small? If so, this patch should help. Thanks, Fengguang --- mm: do batched scans for mem_cgroup Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/memcontrol.h | 3 +++ mm/memcontrol.c | 12 ++++++++++++ mm/vmscan.c | 9 +++++---- 3 files changed, 20 insertions(+), 4 deletions(-) --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800 +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800 @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone, enum lru_list lru); +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg, + struct zone *zone, + enum lru_list lru); struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone); struct zone_reclaim_stat* --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800 +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800 @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone { */ struct list_head lists[NR_LRU_LISTS]; unsigned long count[NR_LRU_LISTS]; + unsigned long nr_saved_scan[NR_LRU_LISTS]; struct zone_reclaim_stat reclaim_stat; }; @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s return MEM_CGROUP_ZSTAT(mz, lru); } +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg, + struct zone *zone, + enum lru_list lru) +{ + int nid = zone->zone_pgdat->node_id; + int zid = zone_idx(zone); + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); + + return &mz->nr_saved_scan[lru]; +} + struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone) { --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800 +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800 @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st for_each_evictable_lru(l) { int file = is_file_lru(l); unsigned long scan; + unsigned long *saved_scan; scan = zone_nr_pages(zone, sc, l); if (priority || noswap) { @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st scan = (scan * percent[file]) / 100; } if (scanning_global_lru(sc)) - nr[l] = nr_scan_try_batch(scan, - &zone->lru[l].nr_saved_scan, - swap_cluster_max); + saved_scan = &zone->lru[l].nr_saved_scan; else - nr[l] = scan; + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup, + zone, l); + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max); } while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-15 5:45 ` Wu Fengguang @ 2009-08-16 5:09 ` Balbir Singh 2009-08-16 5:41 ` Wu Fengguang ` (2 more replies) 2009-08-17 18:04 ` Dike, Jeffrey G 2009-08-18 15:57 ` KOSAKI Motohiro 2 siblings, 3 replies; 122+ messages in thread From: Balbir Singh @ 2009-08-16 5:09 UTC (permalink / raw) To: Wu Fengguang Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]: > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote: > > Wu Fengguang wrote: > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote: > > > > >> So even with the active list being a FIFO, we keep usage information > > >> gathered from the inactive list. If we deactivate pages in arbitrary > > >> list intervals, we throw this away. > > > > > > We do have the danger of FIFO, if inactive list is small enough, so > > > that (unconditionally) deactivated pages quickly get reclaimed and > > > their life window in inactive list is too small to be useful. > > > > This one of the reasons why we unconditionally deactivate > > the active anon pages, and do background scanning of the > > active anon list when reclaiming page cache pages. > > > > We want to always move some pages to the inactive anon > > list, so it does not get too small. > > Right, the current code tries to pull inactive list out of > smallish-size state as long as there are vmscan activities. > > However there is a possible (and tricky) hole: mem cgroups > don't do batched vmscan. shrink_zone() may call shrink_list() > with nr_to_scan=1, in which case shrink_list() _still_ calls > isolate_pages() with the much larger SWAP_CLUSTER_MAX. > > It effectively scales up the inactive list scan rate by 10 times when > it is still small, and may thus prevent it from growing up for ever. > I think we need to possibly export some scanning data under DEBUG_VM to cross verify. > In that case, LRU becomes FIFO. > > Jeff, can you confirm if the mem cgroup's inactive list is small? > If so, this patch should help. > > Thanks, > Fengguang > --- > > mm: do batched scans for mem_cgroup > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/linux/memcontrol.h | 3 +++ > mm/memcontrol.c | 12 ++++++++++++ > mm/vmscan.c | 9 +++++---- > 3 files changed, 20 insertions(+), 4 deletions(-) > > --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800 > +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800 > @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru > unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, > struct zone *zone, > enum lru_list lru); > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg, > + struct zone *zone, > + enum lru_list lru); > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, > struct zone *zone); > struct zone_reclaim_stat* > --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800 > +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800 > @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone { > */ > struct list_head lists[NR_LRU_LISTS]; > unsigned long count[NR_LRU_LISTS]; > + unsigned long nr_saved_scan[NR_LRU_LISTS]; > > struct zone_reclaim_stat reclaim_stat; > }; > @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s > return MEM_CGROUP_ZSTAT(mz, lru); > } > > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg, > + struct zone *zone, > + enum lru_list lru) > +{ > + int nid = zone->zone_pgdat->node_id; > + int zid = zone_idx(zone); > + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); > + > + return &mz->nr_saved_scan[lru]; > +} > + > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, > struct zone *zone) > { > --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800 > +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800 > @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st > for_each_evictable_lru(l) { > int file = is_file_lru(l); > unsigned long scan; > + unsigned long *saved_scan; > > scan = zone_nr_pages(zone, sc, l); > if (priority || noswap) { > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st > scan = (scan * percent[file]) / 100; > } > if (scanning_global_lru(sc)) > - nr[l] = nr_scan_try_batch(scan, > - &zone->lru[l].nr_saved_scan, > - swap_cluster_max); > + saved_scan = &zone->lru[l].nr_saved_scan; > else > - nr[l] = scan; > + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup, > + zone, l); > + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max); > } > This might be a concern (although not a big ATM), since we can't afford to miss limits by much. If a cgroup is near its limit and we drop scanning it. We'll have to work out what this means for the end user. May be more fundamental look through is required at the priority based logic of exposing how much to scan, I don't know. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 5:09 ` Balbir Singh @ 2009-08-16 5:41 ` Wu Fengguang 2009-08-16 5:50 ` Wu Fengguang 2009-08-18 15:57 ` KOSAKI Motohiro 2 siblings, 0 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-16 5:41 UTC (permalink / raw) To: Balbir Singh Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote: > * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]: > > > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote: > > > Wu Fengguang wrote: > > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote: > > > > > > >> So even with the active list being a FIFO, we keep usage information > > > >> gathered from the inactive list. If we deactivate pages in arbitrary > > > >> list intervals, we throw this away. > > > > > > > > We do have the danger of FIFO, if inactive list is small enough, so > > > > that (unconditionally) deactivated pages quickly get reclaimed and > > > > their life window in inactive list is too small to be useful. > > > > > > This one of the reasons why we unconditionally deactivate > > > the active anon pages, and do background scanning of the > > > active anon list when reclaiming page cache pages. > > > > > > We want to always move some pages to the inactive anon > > > list, so it does not get too small. > > > > Right, the current code tries to pull inactive list out of > > smallish-size state as long as there are vmscan activities. > > > > However there is a possible (and tricky) hole: mem cgroups > > don't do batched vmscan. shrink_zone() may call shrink_list() > > with nr_to_scan=1, in which case shrink_list() _still_ calls > > isolate_pages() with the much larger SWAP_CLUSTER_MAX. > > > > It effectively scales up the inactive list scan rate by 10 times when > > it is still small, and may thus prevent it from growing up for ever. > > > > I think we need to possibly export some scanning data under DEBUG_VM > to cross verify. Maybe we can do more general debugging code, but here is a quick patch for examining the cgroup case. Note that even for the global zones, max_scan may well not be the multiple of SWAP_CLUSTER_MAX, thus shrink_inactive_list() will scan a little more in its last loop. --- mm/vmscan.c | 7 +++++++ 1 file changed, 7 insertions(+) --- linux.orig/mm/vmscan.c 2009-08-16 13:24:25.000000000 +0800 +++ linux/mm/vmscan.c 2009-08-16 13:38:32.000000000 +0800 @@ -1043,6 +1043,13 @@ static unsigned long shrink_inactive_lis struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); int lumpy_reclaim = 0; + if (!scanning_global_lru(sc)) + printk("shrink inactive %s count=%lu scan=%lu\n", + file ? "file" : "anon", + mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone, + LRU_INACTIVE_ANON + !!file), + max_scan); + /* * If we need a large contiguous chunk of memory, or have * trouble getting a small set of contiguous pages, we -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 5:09 ` Balbir Singh 2009-08-16 5:41 ` Wu Fengguang @ 2009-08-16 5:50 ` Wu Fengguang 2009-08-18 15:57 ` KOSAKI Motohiro 2 siblings, 0 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-16 5:50 UTC (permalink / raw) To: Balbir Singh Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Sun, Aug 16, 2009 at 01:09:03PM +0800, Balbir Singh wrote: > * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]: > > > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote: > > > Wu Fengguang wrote: > > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote: > > > > > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st > > scan = (scan * percent[file]) / 100; > > } > > if (scanning_global_lru(sc)) > > - nr[l] = nr_scan_try_batch(scan, > > - &zone->lru[l].nr_saved_scan, > > - swap_cluster_max); > > + saved_scan = &zone->lru[l].nr_saved_scan; > > else > > - nr[l] = scan; > > + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup, > > + zone, l); > > + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max); > > } > > > > This might be a concern (although not a big ATM), since we can't > afford to miss limits by much. If a cgroup is near its limit and we > drop scanning it. We'll have to work out what this means for the end > user. May be more fundamental look through is required at the priority > based logic of exposing how much to scan, I don't know. I also had this worry at first. Then dismissed it because the page reclaim should be driven by "pages reclaimed" rather than "pages scanned". So when shrink_zone() decides to cancel one smallish scan, it may well be called again and accumulate up nr_saved_scan. So it should only be a problem for a very small mem_cgroup (which may be _full_ scanned too much times in order to accumulate up nr_saved_scan). Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-16 5:09 ` Balbir Singh 2009-08-16 5:41 ` Wu Fengguang 2009-08-16 5:50 ` Wu Fengguang @ 2009-08-18 15:57 ` KOSAKI Motohiro 2 siblings, 0 replies; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw) To: balbir Cc: kosaki.motohiro, Wu Fengguang, Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm > * Wu Fengguang <fengguang.wu@intel.com> [2009-08-15 13:45:24]: > > > On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote: > > > Wu Fengguang wrote: > > > > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote: > > > > > > >> So even with the active list being a FIFO, we keep usage information > > > >> gathered from the inactive list. If we deactivate pages in arbitrary > > > >> list intervals, we throw this away. > > > > > > > > We do have the danger of FIFO, if inactive list is small enough, so > > > > that (unconditionally) deactivated pages quickly get reclaimed and > > > > their life window in inactive list is too small to be useful. > > > > > > This one of the reasons why we unconditionally deactivate > > > the active anon pages, and do background scanning of the > > > active anon list when reclaiming page cache pages. > > > > > > We want to always move some pages to the inactive anon > > > list, so it does not get too small. > > > > Right, the current code tries to pull inactive list out of > > smallish-size state as long as there are vmscan activities. > > > > However there is a possible (and tricky) hole: mem cgroups > > don't do batched vmscan. shrink_zone() may call shrink_list() > > with nr_to_scan=1, in which case shrink_list() _still_ calls > > isolate_pages() with the much larger SWAP_CLUSTER_MAX. > > > > It effectively scales up the inactive list scan rate by 10 times when > > it is still small, and may thus prevent it from growing up for ever. > > > > I think we need to possibly export some scanning data under DEBUG_VM > to cross verify. Sorry for the delay. How about this? ======================================= Subject: [PATCH] vmscan: show recent_scanned/rotated stat On recent discussion, Balbir Singh pointed out VM developer shold be able to see recent_scanned/rotated statistics. This patch does it. output example -------------------- % cat /proc/zoneinfo Node 0, zone DMA32 pages free 347590 min 613 low 766 high 919 (snip) inactive_ratio: 3 recent_rotated_anon: 127305 recent_rotated_file: 67439 recent_scanned_anon: 135591 recent_scanned_file: 180399 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmstat.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) Index: b/mm/vmstat.c =================================================================== --- a/mm/vmstat.c 2009-08-08 14:16:53.000000000 +0900 +++ b/mm/vmstat.c 2009-08-18 22:07:25.000000000 +0900 @@ -762,6 +762,20 @@ static void zoneinfo_show_print(struct s zone->prev_priority, zone->zone_start_pfn, zone->inactive_ratio); + +#ifdef CONFIG_DEBUG_VM + seq_printf(m, + "\n recent_rotated_anon: %lu" + "\n recent_rotated_file: %lu" + "\n recent_scanned_anon: %lu" + "\n recent_scanned_file: %lu", + zone->reclaim_stat.recent_rotated[0], + zone->reclaim_stat.recent_rotated[1], + zone->reclaim_stat.recent_scanned[0], + zone->reclaim_stat.recent_scanned[1] + ); +#endif + seq_putc(m, '\n'); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-15 5:45 ` Wu Fengguang 2009-08-16 5:09 ` Balbir Singh @ 2009-08-17 18:04 ` Dike, Jeffrey G 2009-08-18 2:26 ` Wu Fengguang 2009-08-18 15:57 ` KOSAKI Motohiro 2 siblings, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-17 18:04 UTC (permalink / raw) To: Wu, Fengguang, Rik van Riel Cc: Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm > Jeff, can you confirm if the mem cgroup's inactive list is small? Nope. I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M). The inactive mapped list is much smaller - 0 to ~700 pages. The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-17 18:04 ` Dike, Jeffrey G @ 2009-08-18 2:26 ` Wu Fengguang 2009-09-02 19:30 ` Dike, Jeffrey G 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-18 2:26 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Tue, Aug 18, 2009 at 02:04:46AM +0800, Dike, Jeffrey G wrote: > > Jeff, can you confirm if the mem cgroup's inactive list is small? > > Nope. I have plenty on the inactive anon list, between 13K and 16K pages (i.e. 52M to 64M). > > The inactive mapped list is much smaller - 0 to ~700 pages. > > The active lists are comparable in size, but larger - 16K - 19K pages for anon and 60 - 450 pages for mapped. The anon inactive list is "over scanned". Take 16k pages for example, with DEF_PRIORITY=12, (16k >> 12) = 4. So when shrink_zone() expects to scan 4 pages in the active/inactive list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect. This triggers the background aging of active anon list because inactive_anon_is_low() is found to be true, which keeps the active:inactive ratio in balance. So anon inactive list over scanned => anon active list over scanned => anon lists over scanned relative to file lists. (The inactive file list may or may not be over scanned depending on its size <> (1<<prio) pages.) Anyway this is not the expected way vmscan should work, and batching up the cgroup vmscan could get rid of the mess. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 2:26 ` Wu Fengguang @ 2009-09-02 19:30 ` Dike, Jeffrey G 2009-09-03 2:04 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-09-02 19:30 UTC (permalink / raw) To: Wu, Fengguang Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm I'm trying to better understand the motivation for your make-mapped-exec-pages-first-class-citizens patch. As I read your (very detailed!) description, you are diagnosing a threshold effect from Rik's evict-use-once-pages-first patch where if the inactive list is slightly smaller than the active list, the active list will start being scanned, pushing text (and other) pages onto the inactive list where they will be quickly kicked out to swap. As I read Rik's patch, if the active list is one page larger than the inactive list, then a batch of pages will get moved from one to the other. For this to have a noticeable effect on the system once the streaming is done, there must be something continuing to keep the active list larger than the inactive list. Maybe there is a consistent percentage of the streamed pages which are use-twice. So, we a threshold effect where a small change in input (the size of the streaming file vs the number of active pages) causes a large change in output (lots of text pages suddenly start getting thrown out). My immediate reaction to that is that there shouldn't be this sudden change in behavior, and that maybe there should only be enough scanning in shink_active_list to bring the two lists back to parity. However, if there's something keeping the active list bigger than the inactive list, this will just put off the inevitable required scanning. As for your patch, it seems like we have a problem with scanning I/O, and instead of looking at those pages, you are looking to protect some other set of pages (mapped text). That, in turn, increases pressure on anonymous pages (which is where I came in). Wouldn't it be a better idea to keep looking at those streaming pages and figure out how to get them out of memory quickly? Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-09-02 19:30 ` Dike, Jeffrey G @ 2009-09-03 2:04 ` Wu Fengguang 2009-09-04 20:06 ` Dike, Jeffrey G 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-09-03 2:04 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Jeff, On Thu, Sep 03, 2009 at 03:30:59AM +0800, Dike, Jeffrey G wrote: > I'm trying to better understand the motivation for your > make-mapped-exec-pages-first-class-citizens patch. As I read your > (very detailed!) description, you are diagnosing a threshold effect > from Rik's evict-use-once-pages-first patch where if the inactive > list is slightly smaller than the active list, the active list will > start being scanned, pushing text (and other) pages onto the > inactive list where they will be quickly kicked out to swap. Right. > As I read Rik's patch, if the active list is one page larger than > the inactive list, then a batch of pages will get moved from one to > the other. For this to have a noticeable effect on the system once > the streaming is done, there must be something continuing to keep > the active list larger than the inactive list. Maybe there is a > consistent percentage of the streamed pages which are use-twice. Right. Besides the use-twice case, I also explored the desktop-working-set-cannot-fit-in-memory case in the patch. > So, we a threshold effect where a small change in input (the size of > the streaming file vs the number of active pages) causes a large > change in output (lots of text pages suddenly start getting thrown > out). My immediate reaction to that is that there shouldn't be > this sudden change in behavior, and that maybe there should only be > enough scanning in shink_active_list to bring the two lists back to > parity. However, if there's something keeping the active list > bigger than the inactive list, this will just put off the inevitable > required scanning. Yes there will be a sudden "behavior change" as soon as active list grows larger than inactive list. However the "output change" is bounded and not as large, because that extra behavior to scan active list stops as soon as the two lists are back to parity. > As for your patch, it seems like we have a problem with scanning > I/O, and instead of looking at those pages, you are looking to > protect some other set of pages (mapped text). That, in turn, > increases pressure on anonymous pages (which is where I came in). > Wouldn't it be a better idea to keep looking at those streaming > pages and figure out how to get them out of memory quickly? The scanning I/O problem has been largely addressed by Rik's patch. It is not optimal(which is hard), but fair enough for common cases. Your kvm test case sounds like desktop-working-set-cannot-fit-in-memory. In that case, it obviously benefits to protect the exec-mapped pages, and there are not too much kvm exec-mapped pages to impact anon pages. I ran a kvm and collected its number exec-mapped pages as follows. They sum up to ~3MB. This is not a big pressure on memory thrashing. Thanks, Fengguang --- Rss of kvm: % grep -A2 x /proc/7640/smaps | grep -v Size 00400000-005fe000 r-xp 00000000 08:02 1890389 /usr/bin/kvm Rss: 680 kB -- 7f6c029f9000-7f6c02a0f000 r-xp 00000000 08:02 458771 /lib/libgcc_s.so.1 Rss: 16 kB -- 7f6c02c10000-7f6c02d00000 r-xp 00000000 08:02 1885409 /usr/lib/libstdc++.so.6.0.10 Rss: 364 kB -- 7f6c03bf2000-7f6c03bf7000 r-xp 00000000 08:02 1887873 /usr/lib/libXdmcp.so.6.0.0 Rss: 12 kB -- 7f6c03df7000-7f6c03df9000 r-xp 00000000 08:02 1887871 /usr/lib/libXau.so.6.0.0 Rss: 8 kB -- 7f6c03ff9000-7f6c04019000 r-xp 00000000 08:02 458890 /lib/libx86.so.1 Rss: 36 kB -- 7f6c04019000-7f6c04218000 ---p 00020000 08:02 458890 /lib/libx86.so.1 Rss: 0 kB -- 7f6c04218000-7f6c0421a000 rw-p 0001f000 08:02 458890 /lib/libx86.so.1 Rss: 8 kB -- 7f6c0421b000-7f6c0421f000 r-xp 00000000 08:02 458861 /lib/libattr.so.1.1.0 Rss: 12 kB -- 7f6c0441f000-7f6c04434000 r-xp 00000000 08:02 460739 /lib/libnsl-2.9.so Rss: 24 kB -- 7f6c04637000-7f6c04647000 r-xp 00000000 08:02 1897259 /usr/lib/libXext.so.6.4.0 Rss: 20 kB -- 7f6c04647000-7f6c04847000 ---p 00010000 08:02 1897259 /usr/lib/libXext.so.6.4.0 Rss: 0 kB -- 7f6c04847000-7f6c04848000 rw-p 00010000 08:02 1897259 /usr/lib/libXext.so.6.4.0 Rss: 4 kB -- 7f6c04848000-7f6c04978000 r-xp 00000000 08:02 1889103 /usr/lib/libicuuc.so.38.1 Rss: 244 kB -- 7f6c04b89000-7f6c04ba4000 r-xp 00000000 08:02 1886322 /usr/lib/libxcb.so.1.1.0 Rss: 28 kB -- 7f6c04ba4000-7f6c04da4000 ---p 0001b000 08:02 1886322 /usr/lib/libxcb.so.1.1.0 Rss: 0 kB -- 7f6c04da4000-7f6c04da5000 rw-p 0001b000 08:02 1886322 /usr/lib/libxcb.so.1.1.0 Rss: 4 kB -- 7f6c04da5000-7f6c04df3000 r-xp 00000000 08:02 1899899 /usr/lib/libvga.so.1.4.3 Rss: 68 kB -- 7f6c05004000-7f6c05018000 r-xp 00000000 08:02 1896343 /usr/lib/libdirect-1.0.so.0.1.0 Rss: 24 kB -- 7f6c05219000-7f6c05221000 r-xp 00000000 08:02 1892825 /usr/lib/libfusion-1.0.so.0.1.0 Rss: 16 kB -- 7f6c05421000-7f6c0548d000 r-xp 00000000 08:02 1892826 /usr/lib/libdirectfb-1.0.so.0.1.0 Rss: 64 kB -- 7f6c05691000-7f6c056a4000 r-xp 00000000 08:02 460720 /lib/libresolv-2.9.so Rss: 20 kB -- 7f6c058a7000-7f6c058aa000 r-xp 00000000 08:02 1895568 /usr/lib/libgpg-error.so.0.4.0 Rss: 4 kB -- 7f6c05aaa000-7f6c05b1d000 r-xp 00000000 08:02 1890493 /usr/lib/libgcrypt.so.11.5.2 Rss: 36 kB -- 7f6c05d20000-7f6c05d30000 r-xp 00000000 08:02 2081187 /usr/lib/libtasn1.so.3.1.2 Rss: 12 kB -- 7f6c05f30000-7f6c05f34000 r-xp 00000000 08:02 458960 /lib/libcap.so.2.11 Rss: 12 kB -- 7f6c06134000-7f6c06170000 r-xp 00000000 08:02 1889247 /usr/lib/libdbus-1.so.3.4.0 Rss: 36 kB -- 7f6c06372000-7f6c06376000 r-xp 00000000 08:02 1891735 /usr/lib/libasyncns.so.0.1.0 Rss: 12 kB -- 7f6c06576000-7f6c0657e000 r-xp 00000000 08:02 458834 /lib/libwrap.so.0.7.6 Rss: 20 kB -- 7f6c0677f000-7f6c06784000 r-xp 00000000 08:02 1888031 /usr/lib/libXtst.so.6.1.0 Rss: 12 kB -- 7f6c06985000-7f6c0698d000 r-xp 00000000 08:02 1885331 /usr/lib/libSM.so.6.0.0 Rss: 12 kB -- 7f6c06b8d000-7f6c06ba3000 r-xp 00000000 08:02 1897238 /usr/lib/libICE.so.6.3.0 Rss: 28 kB -- 7f6c06da8000-7f6c06dee000 r-xp 00000000 08:02 2147381 /usr/lib/libpulsecommon-0.9.15.so Rss: 56 kB -- 7f6c06fef000-7f6c06ff1000 r-xp 00000000 08:02 460735 /lib/libdl-2.9.so Rss: 8 kB -- 7f6c071f3000-7f6c0722d000 r-xp 00000000 08:02 2147380 /usr/lib/libpulse.so.0.8.0 Rss: 44 kB -- 7f6c0742f000-7f6c07578000 r-xp 00000000 08:02 460721 /lib/libc-2.9.so Rss: 464 kB -- 7f6c07782000-7f6c07786000 r-xp 00000000 08:02 1892933 /usr/lib/libvdeplug.so.2.1.0 Rss: 12 kB -- 7f6c07986000-7f6c0798f000 r-xp 00000000 08:02 459991 /lib/libbrlapi.so.0.5.1 Rss: 20 kB -- 7f6c07b91000-7f6c07bcb000 r-xp 00000000 08:02 458804 /lib/libncurses.so.5.6 Rss: 76 kB -- 7f6c07dd0000-7f6c07f05000 r-xp 00000000 08:02 1886324 /usr/lib/libX11.so.6.2.0 Rss: 120 kB -- 7f6c0810b000-7f6c08173000 r-xp 00000000 08:02 1892212 /usr/lib/libSDL-1.2.so.0.11.1 Rss: 36 kB -- 7f6c083c1000-7f6c083c3000 r-xp 00000000 08:02 460719 /lib/libutil-2.9.so Rss: 8 kB -- 7f6c085c4000-7f6c085cb000 r-xp 00000000 08:02 460727 /lib/librt-2.9.so Rss: 24 kB -- 7f6c087cc000-7f6c087e2000 r-xp 00000000 08:02 460725 /lib/libpthread-2.9.so Rss: 68 kB -- 7f6c089e7000-7f6c089f2000 r-xp 00000000 08:02 1889339 /usr/lib/libpci.so.3.1.2 Rss: 16 kB -- 7f6c08bf2000-7f6c08c0c000 r-xp 00000000 08:02 2146929 /usr/lib/libbluetooth.so.3.2.5 Rss: 32 kB -- 7f6c08e0d000-7f6c08eb4000 r-xp 00000000 08:02 1896347 /usr/lib/libgnutls.so.26.11.5 Rss: 136 kB -- 7f6c090bf000-7f6c090c2000 r-xp 00000000 08:02 2147382 /usr/lib/libpulse-simple.so.0.0.2 Rss: 12 kB -- 7f6c092c3000-7f6c093a0000 r-xp 00000000 08:02 1885450 /usr/lib/libasound.so.2.0.0 Rss: 176 kB -- 7f6c095a7000-7f6c095bd000 r-xp 00000000 08:02 1885377 /usr/lib/libz.so.1.2.3.3 Rss: 16 kB -- 7f6c097be000-7f6c09840000 r-xp 00000000 08:02 460724 /lib/libm-2.9.so Rss: 20 kB -- 7f6c09a41000-7f6c09a5e000 r-xp 00000000 08:02 459078 /lib/ld-2.9.so Rss: 96 kB -- 7f6c09b2c000-7f6c09b31000 r-xp 00000000 08:02 1886943 /usr/lib/libgdbm.so.3.0.0 Rss: 12 kB -- 7fff54d4f000-7fff54d50000 r-xp 00000000 00:00 0 [vdso] Rss: 4 kB -- ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Rss: 0 kB -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-09-03 2:04 ` Wu Fengguang @ 2009-09-04 20:06 ` Dike, Jeffrey G 2009-09-04 20:57 ` Rik van Riel 0 siblings, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-09-04 20:06 UTC (permalink / raw) To: Wu, Fengguang Cc: Rik van Riel, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list? Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-09-04 20:06 ` Dike, Jeffrey G @ 2009-09-04 20:57 ` Rik van Riel 0 siblings, 0 replies; 122+ messages in thread From: Rik van Riel @ 2009-09-04 20:57 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Dike, Jeffrey G wrote: > Stupid question - what in your patch allows a text page get kicked out to the inactive list after you've given it an extra pass through the active list? If it did not get referenced during its second pass through the active list, it will get deactivated. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-15 5:45 ` Wu Fengguang 2009-08-16 5:09 ` Balbir Singh 2009-08-17 18:04 ` Dike, Jeffrey G @ 2009-08-18 15:57 ` KOSAKI Motohiro 2009-08-19 12:08 ` Wu Fengguang 2009-08-19 13:40 ` [RFC] memcg: move definitions to .h and inline some functions Wu Fengguang 2 siblings, 2 replies; 122+ messages in thread From: KOSAKI Motohiro @ 2009-08-18 15:57 UTC (permalink / raw) To: Wu Fengguang Cc: kosaki.motohiro, Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm > > This one of the reasons why we unconditionally deactivate > > the active anon pages, and do background scanning of the > > active anon list when reclaiming page cache pages. > > > > We want to always move some pages to the inactive anon > > list, so it does not get too small. > > Right, the current code tries to pull inactive list out of > smallish-size state as long as there are vmscan activities. > > However there is a possible (and tricky) hole: mem cgroups > don't do batched vmscan. shrink_zone() may call shrink_list() > with nr_to_scan=1, in which case shrink_list() _still_ calls > isolate_pages() with the much larger SWAP_CLUSTER_MAX. > > It effectively scales up the inactive list scan rate by 10 times when > it is still small, and may thus prevent it from growing up for ever. > > In that case, LRU becomes FIFO. > > Jeff, can you confirm if the mem cgroup's inactive list is small? > If so, this patch should help. This patch does right thing. However, I would explain why I and memcg folks didn't do that in past days. Strangely, some memcg struct declaration is hide in *.c. Thus we can't make inline function and we hesitated to introduce many function calling overhead. So, Can we move some memcg structure declaration to *.h and make mem_cgroup_get_saved_scan() inlined function? > > Thanks, > Fengguang > --- > > mm: do batched scans for mem_cgroup > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/linux/memcontrol.h | 3 +++ > mm/memcontrol.c | 12 ++++++++++++ > mm/vmscan.c | 9 +++++---- > 3 files changed, 20 insertions(+), 4 deletions(-) > > --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800 > +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800 > @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru > unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, > struct zone *zone, > enum lru_list lru); > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg, > + struct zone *zone, > + enum lru_list lru); > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, > struct zone *zone); > struct zone_reclaim_stat* > --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800 > +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800 > @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone { > */ > struct list_head lists[NR_LRU_LISTS]; > unsigned long count[NR_LRU_LISTS]; > + unsigned long nr_saved_scan[NR_LRU_LISTS]; > > struct zone_reclaim_stat reclaim_stat; > }; > @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s > return MEM_CGROUP_ZSTAT(mz, lru); > } > > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg, > + struct zone *zone, > + enum lru_list lru) > +{ > + int nid = zone->zone_pgdat->node_id; > + int zid = zone_idx(zone); > + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); > + > + return &mz->nr_saved_scan[lru]; > +} I think this fuction is a bit strange. shrink_zone don't hold any lock. so, shouldn't we case memcg removing race? > + > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, > struct zone *zone) > { > --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800 > +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800 > @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st > for_each_evictable_lru(l) { > int file = is_file_lru(l); > unsigned long scan; > + unsigned long *saved_scan; > > scan = zone_nr_pages(zone, sc, l); > if (priority || noswap) { > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st > scan = (scan * percent[file]) / 100; > } > if (scanning_global_lru(sc)) > - nr[l] = nr_scan_try_batch(scan, > - &zone->lru[l].nr_saved_scan, > - swap_cluster_max); > + saved_scan = &zone->lru[l].nr_saved_scan; > else > - nr[l] = scan; > + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup, > + zone, l); > + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max); > } > > while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-18 15:57 ` KOSAKI Motohiro @ 2009-08-19 12:08 ` Wu Fengguang 2009-08-19 13:40 ` [RFC] memcg: move definitions to .h and inline some functions Wu Fengguang 1 sibling, 0 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-19 12:08 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote: > > > > This one of the reasons why we unconditionally deactivate > > > the active anon pages, and do background scanning of the > > > active anon list when reclaiming page cache pages. > > > > > > We want to always move some pages to the inactive anon > > > list, so it does not get too small. > > > > Right, the current code tries to pull inactive list out of > > smallish-size state as long as there are vmscan activities. > > > > However there is a possible (and tricky) hole: mem cgroups > > don't do batched vmscan. shrink_zone() may call shrink_list() > > with nr_to_scan=1, in which case shrink_list() _still_ calls > > isolate_pages() with the much larger SWAP_CLUSTER_MAX. > > > > It effectively scales up the inactive list scan rate by 10 times when > > it is still small, and may thus prevent it from growing up for ever. > > > > In that case, LRU becomes FIFO. > > > > Jeff, can you confirm if the mem cgroup's inactive list is small? > > If so, this patch should help. > > This patch does right thing. > However, I would explain why I and memcg folks didn't do that in past days. > > Strangely, some memcg struct declaration is hide in *.c. Thus we can't > make inline function and we hesitated to introduce many function calling > overhead. > > So, Can we move some memcg structure declaration to *.h and make > mem_cgroup_get_saved_scan() inlined function? Good idea, I'll do that btw. > > > > > Thanks, > > Fengguang > > --- > > > > mm: do batched scans for mem_cgroup > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > include/linux/memcontrol.h | 3 +++ > > mm/memcontrol.c | 12 ++++++++++++ > > mm/vmscan.c | 9 +++++---- > > 3 files changed, 20 insertions(+), 4 deletions(-) > > > > --- linux.orig/include/linux/memcontrol.h 2009-08-15 13:12:49.000000000 +0800 > > +++ linux/include/linux/memcontrol.h 2009-08-15 13:18:13.000000000 +0800 > > @@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru > > unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, > > struct zone *zone, > > enum lru_list lru); > > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg, > > + struct zone *zone, > > + enum lru_list lru); > > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, > > struct zone *zone); > > struct zone_reclaim_stat* > > --- linux.orig/mm/memcontrol.c 2009-08-15 13:07:34.000000000 +0800 > > +++ linux/mm/memcontrol.c 2009-08-15 13:17:56.000000000 +0800 > > @@ -115,6 +115,7 @@ struct mem_cgroup_per_zone { > > */ > > struct list_head lists[NR_LRU_LISTS]; > > unsigned long count[NR_LRU_LISTS]; > > + unsigned long nr_saved_scan[NR_LRU_LISTS]; > > > > struct zone_reclaim_stat reclaim_stat; > > }; > > @@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s > > return MEM_CGROUP_ZSTAT(mz, lru); > > } > > > > +unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg, > > + struct zone *zone, > > + enum lru_list lru) > > +{ > > + int nid = zone->zone_pgdat->node_id; > > + int zid = zone_idx(zone); > > + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); > > + > > + return &mz->nr_saved_scan[lru]; > > +} > > I think this fuction is a bit strange. > shrink_zone don't hold any lock. so, shouldn't we case memcg removing race? We've been doing that racy computation for long time. It may hurt a bit balancing. But the balanced vmscan was never perfect, and required to perfect. So let's just go with it? Thanks, Fengguang > > > + > > struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, > > struct zone *zone) > > { > > --- linux.orig/mm/vmscan.c 2009-08-15 13:04:54.000000000 +0800 > > +++ linux/mm/vmscan.c 2009-08-15 13:19:03.000000000 +0800 > > @@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st > > for_each_evictable_lru(l) { > > int file = is_file_lru(l); > > unsigned long scan; > > + unsigned long *saved_scan; > > > > scan = zone_nr_pages(zone, sc, l); > > if (priority || noswap) { > > @@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st > > scan = (scan * percent[file]) / 100; > > } > > if (scanning_global_lru(sc)) > > - nr[l] = nr_scan_try_batch(scan, > > - &zone->lru[l].nr_saved_scan, > > - swap_cluster_max); > > + saved_scan = &zone->lru[l].nr_saved_scan; > > else > > - nr[l] = scan; > > + saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup, > > + zone, l); > > + nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max); > > } > > > > while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* [RFC] memcg: move definitions to .h and inline some functions 2009-08-18 15:57 ` KOSAKI Motohiro 2009-08-19 12:08 ` Wu Fengguang @ 2009-08-19 13:40 ` Wu Fengguang 2009-08-19 14:18 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-19 13:40 UTC (permalink / raw) To: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki Cc: Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com, menage@google.com On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote: > > > > This one of the reasons why we unconditionally deactivate > > > the active anon pages, and do background scanning of the > > > active anon list when reclaiming page cache pages. > > > > > > We want to always move some pages to the inactive anon > > > list, so it does not get too small. > > > > Right, the current code tries to pull inactive list out of > > smallish-size state as long as there are vmscan activities. > > > > However there is a possible (and tricky) hole: mem cgroups > > don't do batched vmscan. shrink_zone() may call shrink_list() > > with nr_to_scan=1, in which case shrink_list() _still_ calls > > isolate_pages() with the much larger SWAP_CLUSTER_MAX. > > > > It effectively scales up the inactive list scan rate by 10 times when > > it is still small, and may thus prevent it from growing up for ever. > > > > In that case, LRU becomes FIFO. > > > > Jeff, can you confirm if the mem cgroup's inactive list is small? > > If so, this patch should help. > > This patch does right thing. > However, I would explain why I and memcg folks didn't do that in past days. > > Strangely, some memcg struct declaration is hide in *.c. Thus we can't > make inline function and we hesitated to introduce many function calling > overhead. > > So, Can we move some memcg structure declaration to *.h and make > mem_cgroup_get_saved_scan() inlined function? OK here it is. I have to move big chunks to make it compile, and it does reduced a dozen lines of code :) Is this big copy&paste acceptable? (memcg developers CCed). Thanks, Fengguang --- memcg: move definitions to .h and inline some functions So as to make inline functions. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- include/linux/memcontrol.h | 154 ++++++++++++++++++++++++++++++----- mm/memcontrol.c | 131 ----------------------------- 2 files changed, 134 insertions(+), 151 deletions(-) --- linux.orig/include/linux/memcontrol.h 2009-08-19 20:18:55.000000000 +0800 +++ linux/include/linux/memcontrol.h 2009-08-19 20:51:06.000000000 +0800 @@ -20,11 +20,144 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H #include <linux/cgroup.h> -struct mem_cgroup; +#include <linux/res_counter.h> struct page_cgroup; struct page; struct mm_struct; +/* + * Statistics for memory cgroup. + */ +enum mem_cgroup_stat_index { + /* + * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. + */ + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ + MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */ + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ + + MEM_CGROUP_STAT_NSTATS, +}; + +struct mem_cgroup_stat_cpu { + s64 count[MEM_CGROUP_STAT_NSTATS]; +} ____cacheline_aligned_in_smp; + +struct mem_cgroup_stat { + struct mem_cgroup_stat_cpu cpustat[0]; +}; + +/* + * per-zone information in memory controller. + */ +struct mem_cgroup_per_zone { + /* + * spin_lock to protect the per cgroup LRU + */ + struct list_head lists[NR_LRU_LISTS]; + unsigned long count[NR_LRU_LISTS]; + + struct zone_reclaim_stat reclaim_stat; +}; +/* Macro for accessing counter */ +#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)]) + +struct mem_cgroup_per_node { + struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; +}; + +struct mem_cgroup_lru_info { + struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES]; +}; + +/* + * The memory controller data structure. The memory controller controls both + * page cache and RSS per cgroup. We would eventually like to provide + * statistics based on the statistics developed by Rik Van Riel for clock-pro, + * to help the administrator determine what knobs to tune. + * + * TODO: Add a water mark for the memory controller. Reclaim will begin when + * we hit the water mark. May be even add a low water mark, such that + * no reclaim occurs from a cgroup at it's low water mark, this is + * a feature that will be implemented much later in the future. + */ +struct mem_cgroup { + struct cgroup_subsys_state css; + /* + * the counter to account for memory usage + */ + struct res_counter res; + /* + * the counter to account for mem+swap usage. + */ + struct res_counter memsw; + /* + * Per cgroup active and inactive list, similar to the + * per zone LRU lists. + */ + struct mem_cgroup_lru_info info; + + /* + protect against reclaim related member. + */ + spinlock_t reclaim_param_lock; + + int prev_priority; /* for recording reclaim priority */ + + /* + * While reclaiming in a hiearchy, we cache the last child we + * reclaimed from. + */ + int last_scanned_child; + /* + * Should the accounting and control be hierarchical, per subtree? + */ + bool use_hierarchy; + unsigned long last_oom_jiffies; + atomic_t refcnt; + + unsigned int swappiness; + + /* set when res.limit == memsw.limit */ + bool memsw_is_minimum; + + /* + * statistics. This must be placed at the end of memcg. + */ + struct mem_cgroup_stat stat; +}; + +static inline struct mem_cgroup_per_zone * +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid) +{ + return &mem->info.nodeinfo[nid]->zoneinfo[zid]; +} + +static inline unsigned long +mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, + struct zone *zone, + enum lru_list lru) +{ + int nid = zone->zone_pgdat->node_id; + int zid = zone_idx(zone); + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); + + return MEM_CGROUP_ZSTAT(mz, lru); +} + +static inline struct zone_reclaim_stat * +mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone) +{ + int nid = zone->zone_pgdat->node_id; + int zid = zone_idx(zone); + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); + + return &mz->reclaim_stat; +} + + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All "charge" functions with gfp_mask should use GFP_KERNEL or @@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr int priority); int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg); int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg); -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, - struct zone *zone, - enum lru_list lru); -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, - struct zone *zone); struct zone_reclaim_stat* mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, @@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m return 1; } -static inline unsigned long -mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone, - enum lru_list lru) -{ - return 0; -} - - -static inline struct zone_reclaim_stat* -mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone) -{ - return NULL; -} - static inline struct zone_reclaim_stat* mem_cgroup_get_reclaim_stat_from_page(struct page *page) { --- linux.orig/mm/memcontrol.c 2009-08-19 20:14:56.000000000 +0800 +++ linux/mm/memcontrol.c 2009-08-19 20:46:50.000000000 +0800 @@ -55,30 +55,6 @@ static int really_do_swap_account __init static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */ /* - * Statistics for memory cgroup. - */ -enum mem_cgroup_stat_index { - /* - * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. - */ - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ - MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ - MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */ - MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ - MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ - - MEM_CGROUP_STAT_NSTATS, -}; - -struct mem_cgroup_stat_cpu { - s64 count[MEM_CGROUP_STAT_NSTATS]; -} ____cacheline_aligned_in_smp; - -struct mem_cgroup_stat { - struct mem_cgroup_stat_cpu cpustat[0]; -}; - -/* * For accounting under irq disable, no need for increment preempt count. */ static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat, @@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct return ret; } -/* - * per-zone information in memory controller. - */ -struct mem_cgroup_per_zone { - /* - * spin_lock to protect the per cgroup LRU - */ - struct list_head lists[NR_LRU_LISTS]; - unsigned long count[NR_LRU_LISTS]; - - struct zone_reclaim_stat reclaim_stat; -}; -/* Macro for accessing counter */ -#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)]) - -struct mem_cgroup_per_node { - struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; -}; - -struct mem_cgroup_lru_info { - struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES]; -}; - -/* - * The memory controller data structure. The memory controller controls both - * page cache and RSS per cgroup. We would eventually like to provide - * statistics based on the statistics developed by Rik Van Riel for clock-pro, - * to help the administrator determine what knobs to tune. - * - * TODO: Add a water mark for the memory controller. Reclaim will begin when - * we hit the water mark. May be even add a low water mark, such that - * no reclaim occurs from a cgroup at it's low water mark, this is - * a feature that will be implemented much later in the future. - */ -struct mem_cgroup { - struct cgroup_subsys_state css; - /* - * the counter to account for memory usage - */ - struct res_counter res; - /* - * the counter to account for mem+swap usage. - */ - struct res_counter memsw; - /* - * Per cgroup active and inactive list, similar to the - * per zone LRU lists. - */ - struct mem_cgroup_lru_info info; - - /* - protect against reclaim related member. - */ - spinlock_t reclaim_param_lock; - - int prev_priority; /* for recording reclaim priority */ - - /* - * While reclaiming in a hiearchy, we cache the last child we - * reclaimed from. - */ - int last_scanned_child; - /* - * Should the accounting and control be hierarchical, per subtree? - */ - bool use_hierarchy; - unsigned long last_oom_jiffies; - atomic_t refcnt; - - unsigned int swappiness; - - /* set when res.limit == memsw.limit */ - bool memsw_is_minimum; - - /* - * statistics. This must be placed at the end of memcg. - */ - struct mem_cgroup_stat stat; -}; - enum charge_type { MEM_CGROUP_CHARGE_TYPE_CACHE = 0, MEM_CGROUP_CHARGE_TYPE_MAPPED, @@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics } static struct mem_cgroup_per_zone * -mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid) -{ - return &mem->info.nodeinfo[nid]->zoneinfo[zid]; -} - -static struct mem_cgroup_per_zone * page_cgroup_zoneinfo(struct page_cgroup *pc) { struct mem_cgroup *mem = pc->mem_cgroup; @@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru return (active > inactive); } -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, - struct zone *zone, - enum lru_list lru) -{ - int nid = zone->zone_pgdat->node_id; - int zid = zone_idx(zone); - struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); - - return MEM_CGROUP_ZSTAT(mz, lru); -} - -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, - struct zone *zone) -{ - int nid = zone->zone_pgdat->node_id; - int zid = zone_idx(zone); - struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); - - return &mz->reclaim_stat; -} - struct zone_reclaim_stat * mem_cgroup_get_reclaim_stat_from_page(struct page *page) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] memcg: move definitions to .h and inline some functions 2009-08-19 13:40 ` [RFC] memcg: move definitions to .h and inline some functions Wu Fengguang @ 2009-08-19 14:18 ` KAMEZAWA Hiroyuki 2009-08-19 14:27 ` Balbir Singh 0 siblings, 1 reply; 122+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-08-19 14:18 UTC (permalink / raw) To: Wu Fengguang Cc: KOSAKI Motohiro, Balbir Singh, KAMEZAWA Hiroyuki, Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com, menage@google.com Wu Fengguang さんは書きました: > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote: >> >> > > This one of the reasons why we unconditionally deactivate >> > > the active anon pages, and do background scanning of the >> > > active anon list when reclaiming page cache pages. >> > > >> > > We want to always move some pages to the inactive anon >> > > list, so it does not get too small. >> > >> > Right, the current code tries to pull inactive list out of >> > smallish-size state as long as there are vmscan activities. >> > >> > However there is a possible (and tricky) hole: mem cgroups >> > don't do batched vmscan. shrink_zone() may call shrink_list() >> > with nr_to_scan=1, in which case shrink_list() _still_ calls >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX. >> > >> > It effectively scales up the inactive list scan rate by 10 times when >> > it is still small, and may thus prevent it from growing up for ever. >> > >> > In that case, LRU becomes FIFO. >> > >> > Jeff, can you confirm if the mem cgroup's inactive list is small? >> > If so, this patch should help. >> >> This patch does right thing. >> However, I would explain why I and memcg folks didn't do that in past >> days. >> >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't >> make inline function and we hesitated to introduce many function calling >> overhead. >> >> So, Can we move some memcg structure declaration to *.h and make >> mem_cgroup_get_saved_scan() inlined function? > > OK here it is. I have to move big chunks to make it compile, and it > does reduced a dozen lines of code :) > > Is this big copy&paste acceptable? (memcg developers CCed). > > Thanks, > Fengguang I don't like this. plz add hooks to necessary places, at this stage. This will be too big for inlined function, anyway. plz move this after you find overhead is too big. Thanks, -Kame > --- > > memcg: move definitions to .h and inline some functions > > So as to make inline functions. > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > include/linux/memcontrol.h | 154 ++++++++++++++++++++++++++++++----- > mm/memcontrol.c | 131 ----------------------------- > 2 files changed, 134 insertions(+), 151 deletions(-) > > --- linux.orig/include/linux/memcontrol.h 2009-08-19 20:18:55.000000000 > +0800 > +++ linux/include/linux/memcontrol.h 2009-08-19 20:51:06.000000000 +0800 > @@ -20,11 +20,144 @@ > #ifndef _LINUX_MEMCONTROL_H > #define _LINUX_MEMCONTROL_H > #include <linux/cgroup.h> > -struct mem_cgroup; > +#include <linux/res_counter.h> > struct page_cgroup; > struct page; > struct mm_struct; > > +/* > + * Statistics for memory cgroup. > + */ > +enum mem_cgroup_stat_index { > + /* > + * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. > + */ > + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ > + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ > + MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */ > + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ > + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ > + > + MEM_CGROUP_STAT_NSTATS, > +}; > + > +struct mem_cgroup_stat_cpu { > + s64 count[MEM_CGROUP_STAT_NSTATS]; > +} ____cacheline_aligned_in_smp; > + > +struct mem_cgroup_stat { > + struct mem_cgroup_stat_cpu cpustat[0]; > +}; > + > +/* > + * per-zone information in memory controller. > + */ > +struct mem_cgroup_per_zone { > + /* > + * spin_lock to protect the per cgroup LRU > + */ > + struct list_head lists[NR_LRU_LISTS]; > + unsigned long count[NR_LRU_LISTS]; > + > + struct zone_reclaim_stat reclaim_stat; > +}; > +/* Macro for accessing counter */ > +#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)]) > + > +struct mem_cgroup_per_node { > + struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; > +}; > + > +struct mem_cgroup_lru_info { > + struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES]; > +}; > + > +/* > + * The memory controller data structure. The memory controller controls > both > + * page cache and RSS per cgroup. We would eventually like to provide > + * statistics based on the statistics developed by Rik Van Riel for > clock-pro, > + * to help the administrator determine what knobs to tune. > + * > + * TODO: Add a water mark for the memory controller. Reclaim will begin > when > + * we hit the water mark. May be even add a low water mark, such that > + * no reclaim occurs from a cgroup at it's low water mark, this is > + * a feature that will be implemented much later in the future. > + */ > +struct mem_cgroup { > + struct cgroup_subsys_state css; > + /* > + * the counter to account for memory usage > + */ > + struct res_counter res; > + /* > + * the counter to account for mem+swap usage. > + */ > + struct res_counter memsw; > + /* > + * Per cgroup active and inactive list, similar to the > + * per zone LRU lists. > + */ > + struct mem_cgroup_lru_info info; > + > + /* > + protect against reclaim related member. > + */ > + spinlock_t reclaim_param_lock; > + > + int prev_priority; /* for recording reclaim priority */ > + > + /* > + * While reclaiming in a hiearchy, we cache the last child we > + * reclaimed from. > + */ > + int last_scanned_child; > + /* > + * Should the accounting and control be hierarchical, per subtree? > + */ > + bool use_hierarchy; > + unsigned long last_oom_jiffies; > + atomic_t refcnt; > + > + unsigned int swappiness; > + > + /* set when res.limit == memsw.limit */ > + bool memsw_is_minimum; > + > + /* > + * statistics. This must be placed at the end of memcg. > + */ > + struct mem_cgroup_stat stat; > +}; > + > +static inline struct mem_cgroup_per_zone * > +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid) > +{ > + return &mem->info.nodeinfo[nid]->zoneinfo[zid]; > +} > + > +static inline unsigned long > +mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, > + struct zone *zone, > + enum lru_list lru) > +{ > + int nid = zone->zone_pgdat->node_id; > + int zid = zone_idx(zone); > + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); > + > + return MEM_CGROUP_ZSTAT(mz, lru); > +} > + > +static inline struct zone_reclaim_stat * > +mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone) > +{ > + int nid = zone->zone_pgdat->node_id; > + int zid = zone_idx(zone); > + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); > + > + return &mz->reclaim_stat; > +} > + > + > #ifdef CONFIG_CGROUP_MEM_RES_CTLR > /* > * All "charge" functions with gfp_mask should use GFP_KERNEL or > @@ -95,11 +228,6 @@ extern void mem_cgroup_record_reclaim_pr > int priority); > int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg); > int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg); > -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, > - struct zone *zone, > - enum lru_list lru); > -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup > *memcg, > - struct zone *zone); > struct zone_reclaim_stat* > mem_cgroup_get_reclaim_stat_from_page(struct page *page); > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > @@ -246,20 +374,6 @@ mem_cgroup_inactive_file_is_low(struct m > return 1; > } > > -static inline unsigned long > -mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone, > - enum lru_list lru) > -{ > - return 0; > -} > - > - > -static inline struct zone_reclaim_stat* > -mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone) > -{ > - return NULL; > -} > - > static inline struct zone_reclaim_stat* > mem_cgroup_get_reclaim_stat_from_page(struct page *page) > { > --- linux.orig/mm/memcontrol.c 2009-08-19 20:14:56.000000000 +0800 > +++ linux/mm/memcontrol.c 2009-08-19 20:46:50.000000000 +0800 > @@ -55,30 +55,6 @@ static int really_do_swap_account __init > static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */ > > /* > - * Statistics for memory cgroup. > - */ > -enum mem_cgroup_stat_index { > - /* > - * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. > - */ > - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ > - MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ > - MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */ > - MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ > - MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ > - > - MEM_CGROUP_STAT_NSTATS, > -}; > - > -struct mem_cgroup_stat_cpu { > - s64 count[MEM_CGROUP_STAT_NSTATS]; > -} ____cacheline_aligned_in_smp; > - > -struct mem_cgroup_stat { > - struct mem_cgroup_stat_cpu cpustat[0]; > -}; > - > -/* > * For accounting under irq disable, no need for increment preempt count. > */ > static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu > *stat, > @@ -106,86 +82,6 @@ static s64 mem_cgroup_local_usage(struct > return ret; > } > > -/* > - * per-zone information in memory controller. > - */ > -struct mem_cgroup_per_zone { > - /* > - * spin_lock to protect the per cgroup LRU > - */ > - struct list_head lists[NR_LRU_LISTS]; > - unsigned long count[NR_LRU_LISTS]; > - > - struct zone_reclaim_stat reclaim_stat; > -}; > -/* Macro for accessing counter */ > -#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)]) > - > -struct mem_cgroup_per_node { > - struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; > -}; > - > -struct mem_cgroup_lru_info { > - struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES]; > -}; > - > -/* > - * The memory controller data structure. The memory controller controls > both > - * page cache and RSS per cgroup. We would eventually like to provide > - * statistics based on the statistics developed by Rik Van Riel for > clock-pro, > - * to help the administrator determine what knobs to tune. > - * > - * TODO: Add a water mark for the memory controller. Reclaim will begin > when > - * we hit the water mark. May be even add a low water mark, such that > - * no reclaim occurs from a cgroup at it's low water mark, this is > - * a feature that will be implemented much later in the future. > - */ > -struct mem_cgroup { > - struct cgroup_subsys_state css; > - /* > - * the counter to account for memory usage > - */ > - struct res_counter res; > - /* > - * the counter to account for mem+swap usage. > - */ > - struct res_counter memsw; > - /* > - * Per cgroup active and inactive list, similar to the > - * per zone LRU lists. > - */ > - struct mem_cgroup_lru_info info; > - > - /* > - protect against reclaim related member. > - */ > - spinlock_t reclaim_param_lock; > - > - int prev_priority; /* for recording reclaim priority */ > - > - /* > - * While reclaiming in a hiearchy, we cache the last child we > - * reclaimed from. > - */ > - int last_scanned_child; > - /* > - * Should the accounting and control be hierarchical, per subtree? > - */ > - bool use_hierarchy; > - unsigned long last_oom_jiffies; > - atomic_t refcnt; > - > - unsigned int swappiness; > - > - /* set when res.limit == memsw.limit */ > - bool memsw_is_minimum; > - > - /* > - * statistics. This must be placed at the end of memcg. > - */ > - struct mem_cgroup_stat stat; > -}; > - > enum charge_type { > MEM_CGROUP_CHARGE_TYPE_CACHE = 0, > MEM_CGROUP_CHARGE_TYPE_MAPPED, > @@ -244,12 +140,6 @@ static void mem_cgroup_charge_statistics > } > > static struct mem_cgroup_per_zone * > -mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid) > -{ > - return &mem->info.nodeinfo[nid]->zoneinfo[zid]; > -} > - > -static struct mem_cgroup_per_zone * > page_cgroup_zoneinfo(struct page_cgroup *pc) > { > struct mem_cgroup *mem = pc->mem_cgroup; > @@ -586,27 +476,6 @@ int mem_cgroup_inactive_file_is_low(stru > return (active > inactive); > } > > -unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, > - struct zone *zone, > - enum lru_list lru) > -{ > - int nid = zone->zone_pgdat->node_id; > - int zid = zone_idx(zone); > - struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); > - > - return MEM_CGROUP_ZSTAT(mz, lru); > -} > - > -struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup > *memcg, > - struct zone *zone) > -{ > - int nid = zone->zone_pgdat->node_id; > - int zid = zone_idx(zone); > - struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid); > - > - return &mz->reclaim_stat; > -} > - > struct zone_reclaim_stat * > mem_cgroup_get_reclaim_stat_from_page(struct page *page) > { > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] memcg: move definitions to .h and inline some functions 2009-08-19 14:18 ` KAMEZAWA Hiroyuki @ 2009-08-19 14:27 ` Balbir Singh 2009-08-20 1:34 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Balbir Singh @ 2009-08-19 14:27 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Wu Fengguang, KOSAKI Motohiro, Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com, menage@google.com * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]: > Wu Fengguang ?$B$5$s$O=q$-$^$7$?!' > > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote: > >> > >> > > This one of the reasons why we unconditionally deactivate > >> > > the active anon pages, and do background scanning of the > >> > > active anon list when reclaiming page cache pages. > >> > > > >> > > We want to always move some pages to the inactive anon > >> > > list, so it does not get too small. > >> > > >> > Right, the current code tries to pull inactive list out of > >> > smallish-size state as long as there are vmscan activities. > >> > > >> > However there is a possible (and tricky) hole: mem cgroups > >> > don't do batched vmscan. shrink_zone() may call shrink_list() > >> > with nr_to_scan=1, in which case shrink_list() _still_ calls > >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX. > >> > > >> > It effectively scales up the inactive list scan rate by 10 times when > >> > it is still small, and may thus prevent it from growing up for ever. > >> > > >> > In that case, LRU becomes FIFO. > >> > > >> > Jeff, can you confirm if the mem cgroup's inactive list is small? > >> > If so, this patch should help. > >> > >> This patch does right thing. > >> However, I would explain why I and memcg folks didn't do that in past > >> days. > >> > >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't > >> make inline function and we hesitated to introduce many function calling > >> overhead. > >> > >> So, Can we move some memcg structure declaration to *.h and make > >> mem_cgroup_get_saved_scan() inlined function? > > > > OK here it is. I have to move big chunks to make it compile, and it > > does reduced a dozen lines of code :) > > > > Is this big copy&paste acceptable? (memcg developers CCed). > > > > Thanks, > > Fengguang > > I don't like this. plz add hooks to necessary places, at this stage. > This will be too big for inlined function, anyway. > plz move this after you find overhead is too big. Me too.. I want to abstract the implementation within memcontrol.c to be honest (I am concerned that someone might include memcontrol.h and access its structure members, which scares me). Hiding it within memcontrol.c provides the right level of abstraction. Could you please explain your motivation for this change? I got cc'ed on to a few emails, is this for the patch that export nr_save_scanned approach? -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] memcg: move definitions to .h and inline some functions 2009-08-19 14:27 ` Balbir Singh @ 2009-08-20 1:34 ` Wu Fengguang 0 siblings, 0 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-20 1:34 UTC (permalink / raw) To: Balbir Singh Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Johannes Weiner, Avi Kivity, Andrea Arcangeli, Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm, nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com, menage@google.com On Wed, Aug 19, 2009 at 10:27:05PM +0800, Balbir Singh wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-19 23:18:01]: > > > Wu Fengguang ?$B$5$s$O=q$-$^$7$?!' > > > On Tue, Aug 18, 2009 at 11:57:52PM +0800, KOSAKI Motohiro wrote: > > >> > > >> > > This one of the reasons why we unconditionally deactivate > > >> > > the active anon pages, and do background scanning of the > > >> > > active anon list when reclaiming page cache pages. > > >> > > > > >> > > We want to always move some pages to the inactive anon > > >> > > list, so it does not get too small. > > >> > > > >> > Right, the current code tries to pull inactive list out of > > >> > smallish-size state as long as there are vmscan activities. > > >> > > > >> > However there is a possible (and tricky) hole: mem cgroups > > >> > don't do batched vmscan. shrink_zone() may call shrink_list() > > >> > with nr_to_scan=1, in which case shrink_list() _still_ calls > > >> > isolate_pages() with the much larger SWAP_CLUSTER_MAX. > > >> > > > >> > It effectively scales up the inactive list scan rate by 10 times when > > >> > it is still small, and may thus prevent it from growing up for ever. > > >> > > > >> > In that case, LRU becomes FIFO. > > >> > > > >> > Jeff, can you confirm if the mem cgroup's inactive list is small? > > >> > If so, this patch should help. > > >> > > >> This patch does right thing. > > >> However, I would explain why I and memcg folks didn't do that in past > > >> days. > > >> > > >> Strangely, some memcg struct declaration is hide in *.c. Thus we can't > > >> make inline function and we hesitated to introduce many function calling > > >> overhead. > > >> > > >> So, Can we move some memcg structure declaration to *.h and make > > >> mem_cgroup_get_saved_scan() inlined function? > > > > > > OK here it is. I have to move big chunks to make it compile, and it > > > does reduced a dozen lines of code :) > > > > > > Is this big copy&paste acceptable? (memcg developers CCed). > > > > > > Thanks, > > > Fengguang > > > > I don't like this. plz add hooks to necessary places, at this stage. > > This will be too big for inlined function, anyway. > > plz move this after you find overhead is too big. It shall not be a performance regression, since the text size is slightly smaller with the patch: text data bss dec hex filename before 8732148 2771858 11048432 22552438 1581f76 vmlinux after 8731972 2771858 11048432 22552262 1581ec6 vmlinux > Me too.. I want to abstract the implementation within memcontrol.c to > be honest (I am concerned that someone might include memcontrol.h and > access its structure members, which scares me). Hiding it within > memcontrol.c provides the right level of abstraction. Yeah quite reasonable. > Could you please explain your motivation for this change? I got cc'ed > on to a few emails, is this for the patch that export nr_save_scanned > approach? Yes, KOSAKI proposed to inline the mem_cgroup_get_saved_scan() function introduced in that patch, which requires moving the structs into .h I'll submit the original (un-inlined) patch. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-14 9:51 ` Wu Fengguang 2009-08-14 13:19 ` Rik van Riel @ 2009-08-14 21:42 ` Dike, Jeffrey G 2009-08-14 22:37 ` Rik van Riel 2009-09-13 16:23 ` KOSAKI Motohiro 1 sibling, 2 replies; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-14 21:42 UTC (permalink / raw) To: Wu, Fengguang, Johannes Weiner Cc: Avi Kivity, Rik van Riel, KOSAKI Motohiro, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time. Your test is rescuing essentially all candidate pages from the inactive list. Right now, I have the VM_EXEC || PageAnon version of your test. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-14 21:42 ` [RFC] respect the referenced bit of KVM guest pages? Dike, Jeffrey G @ 2009-08-14 22:37 ` Rik van Riel 2009-08-15 5:32 ` Wu Fengguang 2009-09-13 16:23 ` KOSAKI Motohiro 1 sibling, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-14 22:37 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Dike, Jeffrey G wrote: > A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time. Your test is rescuing essentially all candidate pages from the inactive list. Right now, I have the VM_EXEC || PageAnon version of your test. That is exactly why the the split LRU VM does an unconditional deactivation of active anon pages :) -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-14 22:37 ` Rik van Riel @ 2009-08-15 5:32 ` Wu Fengguang 0 siblings, 0 replies; 122+ messages in thread From: Wu Fengguang @ 2009-08-15 5:32 UTC (permalink / raw) To: Rik van Riel Cc: Dike, Jeffrey G, Johannes Weiner, Avi Kivity, KOSAKI Motohiro, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm On Sat, Aug 15, 2009 at 06:37:22AM +0800, Rik van Riel wrote: > Dike, Jeffrey G wrote: > > A side note - I've been doing some tracing and shrink_active_list > > is called a humongous number of times (25000-ish during a ~90 kvm > > run), with a net result of zero pages moved nearly all the time. Your mean "no pages get deactivated at all in most invocations"? This is possible in the steady (thrashing) state of a memory tight system(the working set is bigger than memory size). > > Your test is rescuing essentially all candidate pages from the > > inactive list. Right now, I have the VM_EXEC || PageAnon version > > of your test. > > That is exactly why the the split LRU VM does an unconditional > deactivation of active anon pages :) In general it is :) However in Jeff's small memory case, there will be many refaults without the "PageAnon" protection. But the patch does not imply that I'm happy with the "PageAnon" test ;) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-14 21:42 ` [RFC] respect the referenced bit of KVM guest pages? Dike, Jeffrey G 2009-08-14 22:37 ` Rik van Riel @ 2009-09-13 16:23 ` KOSAKI Motohiro 1 sibling, 0 replies; 122+ messages in thread From: KOSAKI Motohiro @ 2009-09-13 16:23 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Wu, Fengguang, Johannes Weiner, Avi Kivity, Rik van Riel, Andrea Arcangeli, Yu, Wilfred, Kleen, Andi, Hugh Dickins, Andrew Morton, Christoph Lameter, Mel Gorman, LKML, linux-mm Hi Jeff, > A side note - I've been doing some tracing and shrink_active_list is called a humongous number of times (25000-ish during a ~90 kvm run), with a net result of zero pages moved nearly all the time. Your test is rescuing essentially all candidate pages from the inactive list. Right now, I have the VM_EXEC || PageAnon version of your test. Sorry for the long delayed replay. I made reproduce environment today. but I don't have luck. I didn't reproduce stack refault issue. Could you please explain detailed reproduce way and your analysis way? My environment is, x86_64 CPUx4 MEM 6G userland: fedora11 kernel: latest mmotm cgroup size: 128M guest mem: 256M CONFIG_KSM=n My result, - plenty anon and file fault happen. but it is ideal. it is caused by demand paging. - do_anonymous_page almost doesn't handle stack fault. both host and guest. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 2:40 [RFC] respect the referenced bit of KVM guest pages? Wu Fengguang ` (2 preceding siblings ...) 2009-08-05 15:58 ` Andrea Arcangeli @ 2009-08-05 17:53 ` Rik van Riel 2009-08-05 19:00 ` Dike, Jeffrey G 3 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-05 17:53 UTC (permalink / raw) To: Wu Fengguang Cc: Dike, Jeffrey G, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Wu Fengguang wrote: > The refaults can be drastically reduced by the following patch, which > respects the referenced bit of all anonymous pages (including the KVM > pages). The big question is, which referenced bit? All anonymous pages get the referenced bit set when they are initially created. Acting on that bit is pretty useless, since it does not add any information at all. > However it risks reintroducing the problem addressed by commit 7e9cd4842 > (fix reclaim scalability problem by ignoring the referenced bit, > mainly the pte young bit). I wonder if there are better solutions? Reintroducing that problem is disastrous for large systems running eg. JVMs or certain scientific computing workloads. When you have a 256GB system that is low on memory, you need to be able to find a page to swap out soon. If all 64 million pages in your system are "recently referenced", you run into BIG trouble. I do not believe we can afford to reintroduce that problem. Also, the inactive list (where references to anonymous pages _do_ count) is pretty big. Is it not big enough in Jeff's test case? Jeff, what kind of workloads are you running in the guests? -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 17:53 ` Rik van Riel @ 2009-08-05 19:00 ` Dike, Jeffrey G 2009-08-05 19:07 ` Rik van Riel 0 siblings, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-05 19:00 UTC (permalink / raw) To: Rik van Riel, Wu, Fengguang Cc: Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm > Also, the inactive list (where references to anonymous pages > _do_ count) is pretty big. Is it not big enough in Jeff's > test case? > Jeff, what kind of workloads are you running in the guests? I'm looking at KVM on small systems. My "small system" is a 128M memory compartment on a 4G server. The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 19:00 ` Dike, Jeffrey G @ 2009-08-05 19:07 ` Rik van Riel 2009-08-05 19:18 ` Dike, Jeffrey G 0 siblings, 1 reply; 122+ messages in thread From: Rik van Riel @ 2009-08-05 19:07 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm Dike, Jeffrey G wrote: >> Also, the inactive list (where references to anonymous pages >> _do_ count) is pretty big. Is it not big enough in Jeff's >> test case? > >> Jeff, what kind of workloads are you running in the guests? > > I'm looking at KVM on small systems. My "small system" is a 128M memory compartment on a 4G server. How did you create that 128M memory compartment? Did you use cgroups on the host system? > The workload is boot up the instance, start Firefox and another app (whatever editor comes by default with Moblin), close them, and shut down the instance. How much memory do you give your virtual machine? That is, how much memory does it think it has? -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 19:07 ` Rik van Riel @ 2009-08-05 19:18 ` Dike, Jeffrey G 2009-08-06 9:22 ` Avi Kivity 0 siblings, 1 reply; 122+ messages in thread From: Dike, Jeffrey G @ 2009-08-05 19:18 UTC (permalink / raw) To: Rik van Riel Cc: Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Avi Kivity, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm > How did you create that 128M memory compartment? > > Did you use cgroups on the host system? Yup. > How much memory do you give your virtual machine? > > That is, how much memory does it think it has? 256M. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-05 19:18 ` Dike, Jeffrey G @ 2009-08-06 9:22 ` Avi Kivity 2009-08-06 9:25 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-06 9:22 UTC (permalink / raw) To: Dike, Jeffrey G Cc: Rik van Riel, Wu, Fengguang, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote: >> How did you create that 128M memory compartment? >> >> Did you use cgroups on the host system? >> > > Yup. > > >> How much memory do you give your virtual machine? >> >> That is, how much memory does it think it has? >> > > 256M. > So you're effectively running a 256M guest on a 128M host? Do cgroups have private active/inactive lists? -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 9:22 ` Avi Kivity @ 2009-08-06 9:25 ` Wu Fengguang 2009-08-06 9:35 ` Avi Kivity 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-06 9:25 UTC (permalink / raw) To: Avi Kivity Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Thu, Aug 06, 2009 at 05:22:23PM +0800, Avi Kivity wrote: > On 08/05/2009 10:18 PM, Dike, Jeffrey G wrote: > >> How did you create that 128M memory compartment? > >> > >> Did you use cgroups on the host system? > >> > > > > Yup. > > > > > >> How much memory do you give your virtual machine? > >> > >> That is, how much memory does it think it has? > >> > > > > 256M. > > > > So you're effectively running a 256M guest on a 128M host? > > Do cgroups have private active/inactive lists? Yes, and they reuse the same page reclaim routines with the global LRU lists. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 9:25 ` Wu Fengguang @ 2009-08-06 9:35 ` Avi Kivity 2009-08-06 9:35 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-06 9:35 UTC (permalink / raw) To: Wu Fengguang Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/06/2009 12:25 PM, Wu Fengguang wrote: >> So you're effectively running a 256M guest on a 128M host? >> >> Do cgroups have private active/inactive lists? >> > > Yes, and they reuse the same page reclaim routines with the global > LRU lists. > Then this looks like a bug in the shadow accessed bit handling. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 9:35 ` Avi Kivity @ 2009-08-06 9:35 ` Wu Fengguang 2009-08-06 9:59 ` Avi Kivity 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-06 9:35 UTC (permalink / raw) To: Avi Kivity Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote: > On 08/06/2009 12:25 PM, Wu Fengguang wrote: > >> So you're effectively running a 256M guest on a 128M host? > >> > >> Do cgroups have private active/inactive lists? > >> > > > > Yes, and they reuse the same page reclaim routines with the global > > LRU lists. > > > > Then this looks like a bug in the shadow accessed bit handling. Yes. One question is: why only stack pages hurts if it is a general page reclaim problem? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 9:35 ` Wu Fengguang @ 2009-08-06 9:59 ` Avi Kivity 2009-08-06 9:59 ` Wu Fengguang 0 siblings, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-06 9:59 UTC (permalink / raw) To: Wu Fengguang Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/06/2009 12:35 PM, Wu Fengguang wrote: > On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote: > >> On 08/06/2009 12:25 PM, Wu Fengguang wrote: >> >>>> So you're effectively running a 256M guest on a 128M host? >>>> >>>> Do cgroups have private active/inactive lists? >>>> >>>> >>> Yes, and they reuse the same page reclaim routines with the global >>> LRU lists. >>> >>> >> Then this looks like a bug in the shadow accessed bit handling. >> > > Yes. One question is: why only stack pages hurts if it is a > general page reclaim problem? > Do we know for a fact that only stack pages suffer, or is it what has been noticed? -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 9:59 ` Avi Kivity @ 2009-08-06 9:59 ` Wu Fengguang 2009-08-06 10:14 ` Avi Kivity 0 siblings, 1 reply; 122+ messages in thread From: Wu Fengguang @ 2009-08-06 9:59 UTC (permalink / raw) To: Avi Kivity Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Thu, Aug 06, 2009 at 05:59:53PM +0800, Avi Kivity wrote: > On 08/06/2009 12:35 PM, Wu Fengguang wrote: > > On Thu, Aug 06, 2009 at 05:35:59PM +0800, Avi Kivity wrote: > > > >> On 08/06/2009 12:25 PM, Wu Fengguang wrote: > >> > >>>> So you're effectively running a 256M guest on a 128M host? > >>>> > >>>> Do cgroups have private active/inactive lists? > >>>> > >>>> > >>> Yes, and they reuse the same page reclaim routines with the global > >>> LRU lists. > >>> > >>> > >> Then this looks like a bug in the shadow accessed bit handling. > >> > > > > Yes. One question is: why only stack pages hurts if it is a > > general page reclaim problem? > > > > Do we know for a fact that only stack pages suffer, or is it what has > been noticed? It shall be the first case: "These pages are nearly all stack pages.", Jeff said. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 9:59 ` Wu Fengguang @ 2009-08-06 10:14 ` Avi Kivity 2009-08-07 1:25 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 122+ messages in thread From: Avi Kivity @ 2009-08-06 10:14 UTC (permalink / raw) To: Wu Fengguang Cc: Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On 08/06/2009 12:59 PM, Wu Fengguang wrote: >> Do we know for a fact that only stack pages suffer, or is it what has >> been noticed? >> > > It shall be the first case: "These pages are nearly all stack pages.", > Jeff said. > Ok. I can't explain it. There's no special treatment for guest stack pages. The accessed bit should be maintained for them exactly like all other pages. Are they kernel-mode stack pages, or user-mode stack pages (the difference being that kernel mode stack pages are accessed through large ptes, whereas user mode stack pages are accessed through normal ptes). -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [RFC] respect the referenced bit of KVM guest pages? 2009-08-06 10:14 ` Avi Kivity @ 2009-08-07 1:25 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 122+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-08-07 1:25 UTC (permalink / raw) To: Avi Kivity Cc: Wu Fengguang, Dike, Jeffrey G, Rik van Riel, Yu, Wilfred, Kleen, Andi, Andrea Arcangeli, Hugh Dickins, Andrew Morton, Christoph Lameter, KOSAKI Motohiro, Mel Gorman, LKML, linux-mm On Thu, 06 Aug 2009 13:14:09 +0300 Avi Kivity <avi@redhat.com> wrote: > On 08/06/2009 12:59 PM, Wu Fengguang wrote: > >> Do we know for a fact that only stack pages suffer, or is it what has > >> been noticed? > >> > > > > It shall be the first case: "These pages are nearly all stack pages.", > > Jeff said. > > > > Ok. I can't explain it. There's no special treatment for guest stack > pages. The accessed bit should be maintained for them exactly like all > other pages. > > Are they kernel-mode stack pages, or user-mode stack pages (the > difference being that kernel mode stack pages are accessed through large > ptes, whereas user mode stack pages are accessed through normal ptes). > Hmm, finally, memcg's problem ? just as an experiment, how following works ? - memory.limit_in_bytes = 128MB - memory.memsw.limit_in_bytes = 160MB By this, if mamory+swap usage hits 160MB, no swap more. But plz take care of OOM. THanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 122+ messages in thread
end of thread, other threads:[~2009-09-13 16:23 UTC | newest] Thread overview: 122+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-05 2:40 [RFC] respect the referenced bit of KVM guest pages? Wu Fengguang 2009-08-05 4:15 ` KOSAKI Motohiro 2009-08-05 4:41 ` Wu Fengguang 2009-08-05 7:58 ` Avi Kivity 2009-08-05 8:17 ` Avi Kivity 2009-08-05 14:33 ` Rik van Riel 2009-08-05 15:37 ` Avi Kivity 2009-08-05 14:15 ` Rik van Riel 2009-08-05 15:12 ` Avi Kivity 2009-08-05 15:15 ` Rik van Riel 2009-08-05 15:25 ` Avi Kivity 2009-08-05 16:35 ` Andrea Arcangeli 2009-08-05 16:31 ` Andrea Arcangeli 2009-08-05 17:25 ` Rik van Riel 2009-08-05 15:45 ` Dike, Jeffrey G 2009-08-05 16:05 ` Andrea Arcangeli 2009-08-05 16:12 ` Dike, Jeffrey G 2009-08-05 16:19 ` Andrea Arcangeli 2009-08-05 15:58 ` Andrea Arcangeli 2009-08-05 17:20 ` Rik van Riel 2009-08-05 17:42 ` Rik van Riel 2009-08-06 10:15 ` Andrea Arcangeli 2009-08-06 10:08 ` Andrea Arcangeli 2009-08-06 10:18 ` Avi Kivity 2009-08-06 10:20 ` Andrea Arcangeli 2009-08-06 10:59 ` Wu Fengguang 2009-08-06 11:44 ` Avi Kivity 2009-08-06 13:06 ` Wu Fengguang 2009-08-06 13:16 ` Rik van Riel 2009-08-16 3:28 ` Wu Fengguang 2009-08-16 3:56 ` Rik van Riel 2009-08-16 4:43 ` Balbir Singh 2009-08-16 4:55 ` Wu Fengguang 2009-08-16 5:59 ` Balbir Singh 2009-08-17 19:47 ` Dike, Jeffrey G 2009-08-21 18:24 ` Balbir Singh 2009-08-31 19:43 ` Dike, Jeffrey G 2009-08-31 19:52 ` Rik van Riel 2009-08-31 20:06 ` Dike, Jeffrey G 2009-08-31 20:09 ` Rik van Riel 2009-08-31 20:11 ` Dike, Jeffrey G 2009-08-31 20:42 ` Balbir Singh 2009-08-06 13:46 ` Avi Kivity 2009-08-06 21:09 ` Jeff Dike 2009-08-16 3:18 ` Wu Fengguang 2009-08-16 3:53 ` Rik van Riel 2009-08-16 5:15 ` Wu Fengguang 2009-08-16 11:29 ` Wu Fengguang 2009-08-17 14:33 ` Minchan Kim 2009-08-18 2:34 ` Wu Fengguang 2009-08-18 4:17 ` Minchan Kim 2009-08-18 9:31 ` Wu Fengguang 2009-08-18 9:52 ` Minchan Kim 2009-08-18 10:00 ` Wu Fengguang 2009-08-18 11:00 ` Minchan Kim 2009-08-18 11:11 ` Wu Fengguang 2009-08-18 14:03 ` Minchan Kim 2009-08-18 16:27 ` KOSAKI Motohiro 2009-08-18 15:57 ` KOSAKI Motohiro 2009-08-19 12:01 ` Wu Fengguang 2009-08-19 12:05 ` KOSAKI Motohiro 2009-08-19 12:10 ` Wu Fengguang 2009-08-19 12:25 ` Minchan Kim 2009-08-19 13:19 ` KOSAKI Motohiro 2009-08-19 13:28 ` Minchan Kim 2009-08-21 11:17 ` KOSAKI Motohiro 2009-08-19 13:24 ` Wu Fengguang 2009-08-19 13:38 ` Minchan Kim 2009-08-19 14:00 ` Wu Fengguang 2009-08-06 13:13 ` Rik van Riel 2009-08-06 13:49 ` Avi Kivity 2009-08-07 3:11 ` KOSAKI Motohiro 2009-08-07 7:54 ` Balbir Singh 2009-08-07 8:24 ` KAMEZAWA Hiroyuki 2009-08-06 13:11 ` Rik van Riel 2009-08-06 13:08 ` Rik van Riel 2009-08-07 3:17 ` KOSAKI Motohiro 2009-08-12 7:48 ` Wu Fengguang 2009-08-12 14:31 ` Rik van Riel 2009-08-13 1:03 ` Wu Fengguang 2009-08-13 15:46 ` Rik van Riel 2009-08-13 16:12 ` Avi Kivity 2009-08-13 16:26 ` Rik van Riel 2009-08-13 19:12 ` Avi Kivity 2009-08-13 21:16 ` Johannes Weiner 2009-08-14 7:16 ` Avi Kivity 2009-08-14 9:10 ` Johannes Weiner 2009-08-14 9:51 ` Wu Fengguang 2009-08-14 13:19 ` Rik van Riel 2009-08-15 5:45 ` Wu Fengguang 2009-08-16 5:09 ` Balbir Singh 2009-08-16 5:41 ` Wu Fengguang 2009-08-16 5:50 ` Wu Fengguang 2009-08-18 15:57 ` KOSAKI Motohiro 2009-08-17 18:04 ` Dike, Jeffrey G 2009-08-18 2:26 ` Wu Fengguang 2009-09-02 19:30 ` Dike, Jeffrey G 2009-09-03 2:04 ` Wu Fengguang 2009-09-04 20:06 ` Dike, Jeffrey G 2009-09-04 20:57 ` Rik van Riel 2009-08-18 15:57 ` KOSAKI Motohiro 2009-08-19 12:08 ` Wu Fengguang 2009-08-19 13:40 ` [RFC] memcg: move definitions to .h and inline some functions Wu Fengguang 2009-08-19 14:18 ` KAMEZAWA Hiroyuki 2009-08-19 14:27 ` Balbir Singh 2009-08-20 1:34 ` Wu Fengguang 2009-08-14 21:42 ` [RFC] respect the referenced bit of KVM guest pages? Dike, Jeffrey G 2009-08-14 22:37 ` Rik van Riel 2009-08-15 5:32 ` Wu Fengguang 2009-09-13 16:23 ` KOSAKI Motohiro 2009-08-05 17:53 ` Rik van Riel 2009-08-05 19:00 ` Dike, Jeffrey G 2009-08-05 19:07 ` Rik van Riel 2009-08-05 19:18 ` Dike, Jeffrey G 2009-08-06 9:22 ` Avi Kivity 2009-08-06 9:25 ` Wu Fengguang 2009-08-06 9:35 ` Avi Kivity 2009-08-06 9:35 ` Wu Fengguang 2009-08-06 9:59 ` Avi Kivity 2009-08-06 9:59 ` Wu Fengguang 2009-08-06 10:14 ` Avi Kivity 2009-08-07 1:25 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).