* RFC: shadow page table reclaim @ 2009-08-28 2:31 Max Laier 2009-08-31 9:55 ` Avi Kivity 0 siblings, 1 reply; 6+ messages in thread From: Max Laier @ 2009-08-28 2:31 UTC (permalink / raw) To: kvm [-- Attachment #1: Type: Text/Plain, Size: 1589 bytes --] Hello, it seems to me that the reclaim mechanism for shadow page table pages is sub- optimal. The arch.active_mmu_pages list that is used for reclaiming does not move up parent shadow page tables when a child is added so when we need a new shadow page we zap the oldest - which can well be a directory level page holding a just added table level page. Attached is a proof-of-concept diff and two plots before and after. The plots show referenced guest pages over time. As you can see there is less saw- toothing in the after plot and also fewer changes overall (because we don't zap mappings that are still in use as often). This is with a limit of 64 for the shadow page table to increase the effect and vmx/ept. I realize that the list_move and parent walk are quite expensive and that kvm_mmu_alloc_page is only half the story. It should really be done every time a new guest page table is mapped - maybe via rmap_add. This would obviously completely kill performance-wise, though. Another idea would be to improve the reclaim logic in a way that it prefers "old" PT_PAGE_TABLE_LEVEL over directories. Though I'm not sure how to code that up sensibly, either. As I said, this is proof-of-concept and RFC. So any comments welcome. For my use case the proof-of-concept diff seems to do well enough, though. Thanks, -- /"\ Best regards, | mlaier@freebsd.org \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | mlaier@EFnet / \ ASCII Ribbon Campaign | Against HTML Mail and News [-- Attachment #2: before.png --] [-- Type: image/png, Size: 7117 bytes --] [-- Attachment #3: after.png --] [-- Type: image/png, Size: 6864 bytes --] [-- Attachment #4: reclaim1.diff --] [-- Type: text/x-patch, Size: 1203 bytes --] diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 95d5329..0a63570 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -190,6 +190,8 @@ struct kvm_unsync_walk { }; typedef int (*mmu_parent_walk_fn) (struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); +static void mmu_parent_walk(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, + mmu_parent_walk_fn fn); static struct kmem_cache *pte_chain_cache; static struct kmem_cache *rmap_desc_cache; @@ -900,6 +902,12 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn) return gfn & ((1 << KVM_MMU_HASH_SHIFT) - 1); } +static int move_up_walk_fn(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + list_move(&sp->link, &vcpu->kvm->arch.active_mmu_pages); + return 1; +} + static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, u64 *parent_pte) { @@ -918,6 +926,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS); sp->multimapped = 0; sp->parent_pte = parent_pte; +#if 1 + if (parent_pte) + mmu_parent_walk(vcpu, sp, move_up_walk_fn); +#endif --vcpu->kvm->arch.n_free_mmu_pages; return sp; } ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: RFC: shadow page table reclaim 2009-08-28 2:31 RFC: shadow page table reclaim Max Laier @ 2009-08-31 9:55 ` Avi Kivity 2009-08-31 12:09 ` Max Laier 0 siblings, 1 reply; 6+ messages in thread From: Avi Kivity @ 2009-08-31 9:55 UTC (permalink / raw) To: Max Laier; +Cc: kvm On 08/28/2009 05:31 AM, Max Laier wrote: > Hello, > > it seems to me that the reclaim mechanism for shadow page table pages is sub- > optimal. The arch.active_mmu_pages list that is used for reclaiming does not > move up parent shadow page tables when a child is added so when we need a new > shadow page we zap the oldest - which can well be a directory level page > holding a just added table level page. > > Attached is a proof-of-concept diff and two plots before and after. The plots > show referenced guest pages over time. What do you mean by referenced guest pages? Total number of populated sptes? > As you can see there is less saw- > toothing in the after plot and also fewer changes overall (because we don't > zap mappings that are still in use as often). This is with a limit of 64 for > the shadow page table to increase the effect and vmx/ept. > > I realize that the list_move and parent walk are quite expensive and that > kvm_mmu_alloc_page is only half the story. It should really be done every > time a new guest page table is mapped - maybe via rmap_add. This would > obviously completely kill performance-wise, though. > > Another idea would be to improve the reclaim logic in a way that it prefers > "old" PT_PAGE_TABLE_LEVEL over directories. Though I'm not sure how to code > that up sensibly, either. > > As I said, this is proof-of-concept and RFC. So any comments welcome. For my > use case the proof-of-concept diff seems to do well enough, though. > Given that reclaim is fairly rare, we should try to move the cost there. So how about this: - add an 'accessed' flag to struct kvm_mmu_page - when reclaiming, try to evict pages that were not recently accessed (but don't overscan - if you scan many recently accessed pages, evict some of them anyway) - when scanning, update the accessed flag with the accessed bit of all parent_ptes - when dropping an spte, update the accessed flag of the kvm_mmu_page it points to - when reloading cr3, mark the page as accessed (since it has no parent_ptes) This should introduce some LRU-ness that depends not only on fault behaviour but also on long-term guest access behaviour (which is important for long-running processes and kernel pages). -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RFC: shadow page table reclaim 2009-08-31 9:55 ` Avi Kivity @ 2009-08-31 12:09 ` Max Laier 2009-08-31 12:40 ` Avi Kivity 0 siblings, 1 reply; 6+ messages in thread From: Max Laier @ 2009-08-31 12:09 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm On Monday 31 August 2009 11:55:24 Avi Kivity wrote: > On 08/28/2009 05:31 AM, Max Laier wrote: > > Hello, > > > > it seems to me that the reclaim mechanism for shadow page table pages is > > sub- optimal. The arch.active_mmu_pages list that is used for reclaiming > > does not move up parent shadow page tables when a child is added so when > > we need a new shadow page we zap the oldest - which can well be a > > directory level page holding a just added table level page. > > > > Attached is a proof-of-concept diff and two plots before and after. The > > plots show referenced guest pages over time. > > What do you mean by referenced guest pages? Total number of populated > sptes? Yes. > > As you can see there is less saw- > > toothing in the after plot and also fewer changes overall (because we > > don't zap mappings that are still in use as often). This is with a limit > > of 64 for the shadow page table to increase the effect and vmx/ept. > > > > I realize that the list_move and parent walk are quite expensive and that > > kvm_mmu_alloc_page is only half the story. It should really be done > > every time a new guest page table is mapped - maybe via rmap_add. This > > would obviously completely kill performance-wise, though. > > > > Another idea would be to improve the reclaim logic in a way that it > > prefers "old" PT_PAGE_TABLE_LEVEL over directories. Though I'm not sure > > how to code that up sensibly, either. > > > > As I said, this is proof-of-concept and RFC. So any comments welcome. > > For my use case the proof-of-concept diff seems to do well enough, > > though. > > Given that reclaim is fairly rare, we should try to move the cost > there. So how about this: > > - add an 'accessed' flag to struct kvm_mmu_page > - when reclaiming, try to evict pages that were not recently accessed > (but don't overscan - if you scan many recently accessed pages, evict > some of them anyway) - prefer page table level pages over directory level pages in the face of overscan. > - when scanning, update the accessed flag with the accessed bit of all > parent_ptes I might be misunderstanding, but I think it should be the other way 'round. i.e. a page is accessed if any of it's children have been accessed. > - when dropping an spte, update the accessed flag of the kvm_mmu_page it > points to > - when reloading cr3, mark the page as accessed (since it has no > parent_ptes) > > This should introduce some LRU-ness that depends not only on fault > behaviour but also on long-term guest access behaviour (which is > important for long-running processes and kernel pages). I'll try to come up with a patch for this, later tonight. Unless you already have something in the making. Thanks. -- /"\ Best regards, | mlaier@freebsd.org \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | mlaier@EFnet / \ ASCII Ribbon Campaign | Against HTML Mail and News ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RFC: shadow page table reclaim 2009-08-31 12:09 ` Max Laier @ 2009-08-31 12:40 ` Avi Kivity 2009-09-02 2:24 ` Max Laier 0 siblings, 1 reply; 6+ messages in thread From: Avi Kivity @ 2009-08-31 12:40 UTC (permalink / raw) To: Max Laier; +Cc: kvm On 08/31/2009 03:09 PM, Max Laier wrote: > >>> As you can see there is less saw- >>> toothing in the after plot and also fewer changes overall (because we >>> don't zap mappings that are still in use as often). This is with a limit >>> of 64 for the shadow page table to increase the effect and vmx/ept. >>> >>> I realize that the list_move and parent walk are quite expensive and that >>> kvm_mmu_alloc_page is only half the story. It should really be done >>> every time a new guest page table is mapped - maybe via rmap_add. This >>> would obviously completely kill performance-wise, though. >>> >>> Another idea would be to improve the reclaim logic in a way that it >>> prefers "old" PT_PAGE_TABLE_LEVEL over directories. Though I'm not sure >>> how to code that up sensibly, either. >>> >>> As I said, this is proof-of-concept and RFC. So any comments welcome. >>> For my use case the proof-of-concept diff seems to do well enough, >>> though. >>> >> Given that reclaim is fairly rare, we should try to move the cost >> there. So how about this: >> >> - add an 'accessed' flag to struct kvm_mmu_page >> - when reclaiming, try to evict pages that were not recently accessed >> (but don't overscan - if you scan many recently accessed pages, evict >> some of them anyway) >> > - prefer page table level pages over directory level pages in the face of > overscan. > I'm hoping that overscan will only occur when we start to feel memory pressure, and that once we do a full scan we'll get accurate recency information. >> - when scanning, update the accessed flag with the accessed bit of all >> parent_ptes >> > I might be misunderstanding, but I think it should be the other way 'round. > i.e. a page is accessed if any of it's children have been accessed. > They're both true, but looking at the parents is much more efficient. Note we need to look at the accessed bit of the parent_ptes, not parent kvm_mmu_pages. >> - when dropping an spte, update the accessed flag of the kvm_mmu_page it >> points to >> - when reloading cr3, mark the page as accessed (since it has no >> parent_ptes) >> >> This should introduce some LRU-ness that depends not only on fault >> behaviour but also on long-term guest access behaviour (which is >> important for long-running processes and kernel pages). >> > I'll try to come up with a patch for this, later tonight. Unless you already > have something in the making. Thanks. > Please do, it's an area that need attention. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RFC: shadow page table reclaim 2009-08-31 12:40 ` Avi Kivity @ 2009-09-02 2:24 ` Max Laier 2009-09-02 11:31 ` Avi Kivity 0 siblings, 1 reply; 6+ messages in thread From: Max Laier @ 2009-09-02 2:24 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm [-- Attachment #1: Type: Text/Plain, Size: 3711 bytes --] On Monday 31 August 2009 14:40:29 Avi Kivity wrote: > On 08/31/2009 03:09 PM, Max Laier wrote: > >>> As you can see there is less saw- > >>> toothing in the after plot and also fewer changes overall (because we > >>> don't zap mappings that are still in use as often). This is with a > >>> limit of 64 for the shadow page table to increase the effect and > >>> vmx/ept. > >>> > >>> I realize that the list_move and parent walk are quite expensive and > >>> that kvm_mmu_alloc_page is only half the story. It should really be > >>> done every time a new guest page table is mapped - maybe via rmap_add. > >>> This would obviously completely kill performance-wise, though. > >>> > >>> Another idea would be to improve the reclaim logic in a way that it > >>> prefers "old" PT_PAGE_TABLE_LEVEL over directories. Though I'm not > >>> sure how to code that up sensibly, either. > >>> > >>> As I said, this is proof-of-concept and RFC. So any comments welcome. > >>> For my use case the proof-of-concept diff seems to do well enough, > >>> though. > >> > >> Given that reclaim is fairly rare, we should try to move the cost > >> there. So how about this: > >> > >> - add an 'accessed' flag to struct kvm_mmu_page > >> - when reclaiming, try to evict pages that were not recently accessed > >> (but don't overscan - if you scan many recently accessed pages, evict > >> some of them anyway) > > > > - prefer page table level pages over directory level pages in the face of > > overscan. > > I'm hoping that overscan will only occur when we start to feel memory > pressure, and that once we do a full scan we'll get accurate recency > information. > > >> - when scanning, update the accessed flag with the accessed bit of all > >> parent_ptes > > > > I might be misunderstanding, but I think it should be the other way > > 'round. i.e. a page is accessed if any of it's children have been > > accessed. > > They're both true, but looking at the parents is much more efficient. > Note we need to look at the accessed bit of the parent_ptes, not parent > kvm_mmu_pages. > > >> - when dropping an spte, update the accessed flag of the kvm_mmu_page it > >> points to > >> - when reloading cr3, mark the page as accessed (since it has no > >> parent_ptes) > >> > >> This should introduce some LRU-ness that depends not only on fault > >> behaviour but also on long-term guest access behaviour (which is > >> important for long-running processes and kernel pages). > > > > I'll try to come up with a patch for this, later tonight. Unless you > > already have something in the making. Thanks. > > Please do, it's an area that need attention. Okay ... I have /something/, but I'm certainly not there yet as I have to admit that I don't understand your idea completely. In addition it seems that EPT doesn't have an accessed bit :-\ Any idea for this? Regardless, testing the attached with EPT, it turns out that not zapping shadow pages with root_count != 0 already makes much difference. After all we don't really zap these pages anyways, but just mark them invalid after zapping the children. So this could be a first improvement. In any case, I clearly don't have the right idea here, yet. Plus I don't really have time to look into this further right now. And my hack is "good enough"[tm] for my testing ... so if anyone more knowledgeable would like to continue - much appreciated. Maybe some of this can at least serve as food for thoughts. Sorry. -- /"\ Best regards, | mlaier@freebsd.org \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | mlaier@EFnet / \ ASCII Ribbon Campaign | Against HTML Mail and News [-- Attachment #2: reclaim_scan.diff --] [-- Type: text/x-patch, Size: 4201 bytes --] diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a3f637f..089ad0e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -394,6 +394,7 @@ struct kvm_arch{ * Hash table of struct kvm_mmu_page. */ struct list_head active_mmu_pages; + struct kvm_mmu_page *scan_hand; struct list_head assigned_dev_head; struct iommu_domain *iommu_domain; int iommu_flags; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f76d086..3715242 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -869,6 +869,8 @@ static int is_empty_shadow_page(u64 *spt) static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp) { ASSERT(is_empty_shadow_page(sp->spt)); + if (kvm->arch.scan_hand == sp) + kvm->arch.scan_hand = NULL; list_del(&sp->link); __free_page(virt_to_page(sp->spt)); __free_page(virt_to_page(sp->gfns)); @@ -1490,6 +1492,71 @@ static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp) return ret; } +static int kvm_mmu_test_and_clear_pte_active(struct kvm_mmu_page *sp) +{ + struct kvm_pte_chain *pte_chain; + struct hlist_node *node; + int i, accessed = 0; + + if (!sp->multimapped) { + if (!sp->parent_pte) { + if (!sp->root_count) + return 0; + else + return 1; + } + if (*sp->parent_pte & PT_ACCESSED_MASK) { + clear_bit(PT_ACCESSED_SHIFT, + (unsigned long *)sp->parent_pte); + return 1; + } else + return 0; + } + /* Multimapped */ + hlist_for_each_entry(pte_chain, node, &sp->parent_ptes, link) + for (i = 0; i < NR_PTE_CHAIN_ENTRIES; ++i) { + if (!pte_chain->parent_ptes[i]) + break; + if (*pte_chain->parent_ptes[i] & + PT_ACCESSED_MASK) { + clear_bit(PT_ACCESSED_SHIFT, + (unsigned long *) + pte_chain->parent_ptes[i]); + accessed++; + } + } + if (!accessed) + return 0; + else + return 1; +} + +static struct kvm_mmu_page *kvm_mmu_get_inactive_page(struct kvm *kvm) +{ + struct kvm_mmu_page *sp, *prev = NULL; + int c = (kvm->arch.n_alloc_mmu_pages - kvm->arch.n_free_mmu_pages) / 4; + + if (kvm->arch.scan_hand) + sp = kvm->arch.scan_hand; + else + sp = container_of(kvm->arch.active_mmu_pages.prev, + struct kvm_mmu_page, link); + + list_for_each_entry_reverse(sp, &kvm->arch.active_mmu_pages, link) { + if (!kvm_mmu_test_and_clear_pte_active(sp)) + return sp; + if (!prev && sp->role.level == PT_PAGE_TABLE_LEVEL) + prev = sp; + else + kvm->arch.scan_hand = sp; + if (!--c) + break; + } + + return prev ? prev : container_of(kvm->arch.active_mmu_pages.prev, + struct kvm_mmu_page, link); +} + /* * Changing the number of mmu pages allocated to the vm * Note: if kvm_nr_mmu_pages is too small, you will get dead lock @@ -1511,8 +1578,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages) while (used_pages > kvm_nr_mmu_pages) { struct kvm_mmu_page *page; - page = container_of(kvm->arch.active_mmu_pages.prev, - struct kvm_mmu_page, link); + page = kvm_mmu_get_inactive_page(kvm); kvm_mmu_zap_page(kvm, page); used_pages--; } @@ -2712,8 +2778,7 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu) !list_empty(&vcpu->kvm->arch.active_mmu_pages)) { struct kvm_mmu_page *sp; - sp = container_of(vcpu->kvm->arch.active_mmu_pages.prev, - struct kvm_mmu_page, link); + sp = kvm_mmu_get_inactive_page(vcpu->kvm); kvm_mmu_zap_page(vcpu->kvm, sp); ++vcpu->kvm->stat.mmu_recycled; } @@ -2871,8 +2936,7 @@ static void kvm_mmu_remove_one_alloc_mmu_page(struct kvm *kvm) { struct kvm_mmu_page *page; - page = container_of(kvm->arch.active_mmu_pages.prev, - struct kvm_mmu_page, link); + page = kvm_mmu_get_inactive_page(kvm); kvm_mmu_zap_page(kvm, page); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8b3a169..ccd5bea 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4782,6 +4782,7 @@ struct kvm *kvm_arch_create_vm(void) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.scan_hand = NULL; INIT_LIST_HEAD(&kvm->arch.assigned_dev_head); /* Reserve bit 0 of irq_sources_bitmap for userspace irq source */ ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: RFC: shadow page table reclaim 2009-09-02 2:24 ` Max Laier @ 2009-09-02 11:31 ` Avi Kivity 0 siblings, 0 replies; 6+ messages in thread From: Avi Kivity @ 2009-09-02 11:31 UTC (permalink / raw) To: Max Laier; +Cc: kvm On 09/02/2009 05:24 AM, Max Laier wrote: > Okay ... I have/something/, but I'm certainly not there yet as I have to > admit that I don't understand your idea completely. In addition it seems that > EPT doesn't have an accessed bit :-\ Any idea for this? > Use the rwx bits as an approximation. If the pages are needed they'll be faulted back in, which is a lot cheaper than reconstructing them. But why do you see reclaim with ept? The pages ought to be constructed once and then left alone, unless there is severe memory pressure. > Regardless, testing the attached with EPT, it turns out that not zapping > shadow pages with root_count != 0 already makes much difference. After all we > don't really zap these pages anyways, but just mark them invalid after zapping > the children. So this could be a first improvement. > > In any case, I clearly don't have the right idea here, yet. Plus I don't > really have time to look into this further right now. And my hack is "good > enough"[tm] for my testing ... so if anyone more knowledgeable would like to > continue - much appreciated. Maybe some of this can at least serve as food > for thoughts. Sorry. > Sure. > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index a3f637f..089ad0e 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -394,6 +394,7 @@ struct kvm_arch{ > * Hash table of struct kvm_mmu_page. > */ > struct list_head active_mmu_pages; > + struct kvm_mmu_page *scan_hand; > struct list_head assigned_dev_head; > struct iommu_domain *iommu_domain; > int iommu_flags; > Why is a scan hand needed? I though you could just clear the accessed bits and requeue the page. If you drop a page, all the accessed bits in the ptes are lost with it, so you need to transfer them to the pointed-to pages before you dropped it. Other than that, this seems pretty complete. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-09-02 11:31 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-28 2:31 RFC: shadow page table reclaim Max Laier 2009-08-31 9:55 ` Avi Kivity 2009-08-31 12:09 ` Max Laier 2009-08-31 12:40 ` Avi Kivity 2009-09-02 2:24 ` Max Laier 2009-09-02 11:31 ` Avi Kivity
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).