From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrea Arcangeli Subject: Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) Date: Wed, 28 May 2008 19:04:10 +0200 Message-ID: <20080528170410.GC8086@duo.random> References: <4830F90A.1020809@cisco.com> <4830FE8D.6010006@cisco.com> <48318E64.8090706@qumranet.com> <4832DDEB.4000100@qumranet.com> <4835EEF5.9010600@cisco.com> <483D391F.7050007@qumranet.com> <483D6898.2050605@cisco.com> <20080528144850.GX27375@duo.random> <483D7C45.5020300@qumranet.com> <483D7D8D.3030309@cisco.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Avi Kivity , kvm@vger.kernel.org To: "David S. Ahern" Return-path: Received: from host36-195-149-62.serverdedicati.aruba.it ([62.149.195.36]:40604 "EHLO mx.cpushare.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751556AbYE1REM (ORCPT ); Wed, 28 May 2008 13:04:12 -0400 Content-Disposition: inline In-Reply-To: <483D7D8D.3030309@cisco.com> Sender: kvm-owner@vger.kernel.org List-ID: On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote: > This is the code in the RHEL3.8 kernel: > > static int scan_active_list(struct zone_struct * zone, int age, > struct list_head * list, int count) > { > struct list_head *page_lru , *next; > struct page * page; > int over_rsslimit; > > count = count * kscand_work_percent / 100; > /* Take the lock while messing with the list... */ > lru_lock(zone); > while (count-- > 0 && !list_empty(list)) { > page = list_entry(list->prev, struct page, lru); > pte_chain_lock(page); > if (page_referenced(page, &over_rsslimit) > && !over_rsslimit > && check_mapping_inuse(page)) > age_page_up_nolock(page, age); > else { > list_del(&page->lru); > list_add(&page->lru, list); > } > pte_chain_unlock(page); > } > lru_unlock(zone); > return 0; > } > > My previous email shows examples of the number of pages in the list and > the scanning that happens. This code looks better than the one below, as a limit was introduced and the whole list isn't scanned anymore, if you decrease kscand_work_percent (I assume it's a sysctl even if it's missing the sysctl_ prefix) to say 1, you can limit damages. Did you try it? > Avi Kivity wrote: > > Andrea Arcangeli wrote: > >> > >> So I never found a relation to the symptom reported of VM kernel > >> threads going weird, with KVM optimal handling of kmap ptes. > >> > > > > > > The problem is this code: > > > > static int scan_active_list(struct zone_struct * zone, int age, > > struct list_head * list) > > { > > struct list_head *page_lru , *next; > > struct page * page; > > int over_rsslimit; > > > > /* Take the lock while messing with the list... */ > > lru_lock(zone); > > list_for_each_safe(page_lru, next, list) { > > page = list_entry(page_lru, struct page, lru); > > pte_chain_lock(page); > > if (page_referenced(page, &over_rsslimit) && !over_rsslimit) > > age_page_up_nolock(page, age); > > pte_chain_unlock(page); > > } > > lru_unlock(zone); > > return 0; > > } > > > If the pages in the list are in the same order as in the ptes (which is > > very likely), then we have the following access pattern Yes it is likely. > > - set up kmap to point at pte > > - test_and_clear_bit(pte) > > - kunmap > > > > From kvm's point of view this looks like > > > > - several accesses to set up the kmap Hmm, the kmap establishment takes a single guest operation in the fixmap area. That's a single write to the pte, to write a pte_t 8/4 byte large region (PAE/non-PAE). The same pte_t is then cleared and flushed out of the tlb with a cpu-local invlpg during kunmap_atomic. I count 1 write here so far. > > - if these accesses trigger flooding, we will have to tear down the > > shadow for this page, only to set it up again soon So the shadow mapping the fixmap area would be tear down by the flooding. Or is the shadow corresponding to the real user pte pointed by the fixmap, that is unshadowed by the flooding, or both/all? > > - an access to the pte (emulted) Here I count the second write and this isn't done on the fixmap area like the first write above, but this is a write to the real user pte, pointed by the fixmap. So if this is emulated it means the shadow of the user pte pointing to the real data page is still active. > > - if this access _doesn't_ trigger flooding, we will have 512 unneeded > > emulations. The pte is worthless anyway since the accessed bit is clear > > (so we can't set up a shadow pte for it) > > - this bug was fixed You mean the accessed bit on fixmap pte used by kmap? Or the user pte pointed by the fixmap pte? > > - an access to tear down the kmap Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that matters). > > [btw, am I reading this right? the entire list is scanned each time? If the list parameter isn't a local LIST_HEAD on the stack but the global one it's a full scan each time. I guess it's the global list looking at the new code at the top that has a kswapd_scan_limit sysctl. > > if you have 1G of active HIGHMEM, that's a quarter of a million pages, > > which would take at least a second no matter what we do. VMware can > > probably special-case kmaps, but we can't] Perhaps they've a list per-age bucket or similar but still I doubt this works well on host either... I guess the virtualization overhead is exacerbating the inefficiency. Perhaps killall -STOP kscand is good enough fix ;). This seem to only push the age up, to be functional the age has to go down and I guess the go-down is done by other threads so stopping kscand may not hurt. I think what we should aim for is to quickly reach this condition: 1) always keep the fixmap/kmap pte_t shadowed and emulate the kmap/kunmap access so the test_and_clear_young done on the user pte doesn't require to re-establish the spte representing the fixmap virtual address. If we don't emulate fixmap we'll have to re-establish the spte during the write to the user pte, and tear it down again during kunmap_atomic. So there's not much doubt fixmap access emulation is worth it. 2) get rid of the user pte shadow mapping pointing to the user data so the test_and_clear of the young bitflag on the user pte will not be emulated and it'll run at full CPU speed through the shadow pte mapping corresponding to the fixmap virtual address kscand pattern is the same as running mprotect on a 32bit 2.6 kernel so it sounds worth optimizing for it, even if kscand may be unfixable without killall -STOP kscand or VM fixes to guest. However I'm not sure about point 2 at the light of mprotect. With mprotect the guest virutal addresses mapped by the guest user ptes will be used. It's not like kscand that may write forever to the user ptes without ever using the guest virtual addresses that they're mapping. So we better be sure that by unshadowing and optimizing kscand we're not hurting mprotect or other pte mangling operations in 2.6 that will likely keep accessing the guest virtual addresses mapped by the user ptes previously modified. Hope this makes any sense, I'm not sure to understand this completely.