From mboxrd@z Thu Jan 1 00:00:00 1970 From: "David S. Ahern" Subject: Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) Date: Wed, 28 May 2008 11:24:04 -0600 Message-ID: <483D9534.8010002@cisco.com> References: <4830F90A.1020809@cisco.com> <4830FE8D.6010006@cisco.com> <48318E64.8090706@qumranet.com> <4832DDEB.4000100@qumranet.com> <4835EEF5.9010600@cisco.com> <483D391F.7050007@qumranet.com> <483D6898.2050605@cisco.com> <20080528144850.GX27375@duo.random> <483D7C45.5020300@qumranet.com> <483D7D8D.3030309@cisco.com> <20080528170410.GC8086@duo.random> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Avi Kivity , kvm@vger.kernel.org To: Andrea Arcangeli Return-path: Received: from sj-iport-1.cisco.com ([171.71.176.70]:26912 "EHLO sj-iport-1.cisco.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751340AbYE1RYI (ORCPT ); Wed, 28 May 2008 13:24:08 -0400 In-Reply-To: <20080528170410.GC8086@duo.random> Sender: kvm-owner@vger.kernel.org List-ID: Yes, I've tried changing kscand_work_percent (values of 50 and 30). Basically it makes kscand wake more often (ie.,MIN_AGING_INTERVAL declines in proportion) put do less work each trip through the lists. I have not seen a noticeable change in guest behavior. david Andrea Arcangeli wrote: > On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote: >> This is the code in the RHEL3.8 kernel: >> >> static int scan_active_list(struct zone_struct * zone, int age, >> struct list_head * list, int count) >> { >> struct list_head *page_lru , *next; >> struct page * page; >> int over_rsslimit; >> >> count = count * kscand_work_percent / 100; >> /* Take the lock while messing with the list... */ >> lru_lock(zone); >> while (count-- > 0 && !list_empty(list)) { >> page = list_entry(list->prev, struct page, lru); >> pte_chain_lock(page); >> if (page_referenced(page, &over_rsslimit) >> && !over_rsslimit >> && check_mapping_inuse(page)) >> age_page_up_nolock(page, age); >> else { >> list_del(&page->lru); >> list_add(&page->lru, list); >> } >> pte_chain_unlock(page); >> } >> lru_unlock(zone); >> return 0; >> } >> >> My previous email shows examples of the number of pages in the list and >> the scanning that happens. > > This code looks better than the one below, as a limit was introduced > and the whole list isn't scanned anymore, if you decrease > kscand_work_percent (I assume it's a sysctl even if it's missing the > sysctl_ prefix) to say 1, you can limit damages. Did you try it? > >> Avi Kivity wrote: >>> Andrea Arcangeli wrote: >>>> So I never found a relation to the symptom reported of VM kernel >>>> threads going weird, with KVM optimal handling of kmap ptes. >>>> >>> >>> The problem is this code: >>> >>> static int scan_active_list(struct zone_struct * zone, int age, >>> struct list_head * list) >>> { >>> struct list_head *page_lru , *next; >>> struct page * page; >>> int over_rsslimit; >>> >>> /* Take the lock while messing with the list... */ >>> lru_lock(zone); >>> list_for_each_safe(page_lru, next, list) { >>> page = list_entry(page_lru, struct page, lru); >>> pte_chain_lock(page); >>> if (page_referenced(page, &over_rsslimit) && !over_rsslimit) >>> age_page_up_nolock(page, age); >>> pte_chain_unlock(page); >>> } >>> lru_unlock(zone); >>> return 0; >>> } >>> If the pages in the list are in the same order as in the ptes (which is >>> very likely), then we have the following access pattern > > Yes it is likely. > >>> - set up kmap to point at pte >>> - test_and_clear_bit(pte) >>> - kunmap >>> >>> From kvm's point of view this looks like >>> >>> - several accesses to set up the kmap > > Hmm, the kmap establishment takes a single guest operation in the > fixmap area. That's a single write to the pte, to write a pte_t 8/4 > byte large region (PAE/non-PAE). The same pte_t is then cleared and > flushed out of the tlb with a cpu-local invlpg during kunmap_atomic. > > I count 1 write here so far. > >>> - if these accesses trigger flooding, we will have to tear down the >>> shadow for this page, only to set it up again soon > > So the shadow mapping the fixmap area would be tear down by the > flooding. > > Or is the shadow corresponding to the real user pte pointed by the > fixmap, that is unshadowed by the flooding, or both/all? > >>> - an access to the pte (emulted) > > Here I count the second write and this isn't done on the fixmap area > like the first write above, but this is a write to the real user pte, > pointed by the fixmap. So if this is emulated it means the shadow of > the user pte pointing to the real data page is still active. > >>> - if this access _doesn't_ trigger flooding, we will have 512 unneeded >>> emulations. The pte is worthless anyway since the accessed bit is clear >>> (so we can't set up a shadow pte for it) >>> - this bug was fixed > > You mean the accessed bit on fixmap pte used by kmap? Or the user pte > pointed by the fixmap pte? > >>> - an access to tear down the kmap > > Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that > matters). > >>> [btw, am I reading this right? the entire list is scanned each time? > > If the list parameter isn't a local LIST_HEAD on the stack but the > global one it's a full scan each time. I guess it's the global list > looking at the new code at the top that has a kswapd_scan_limit > sysctl. > >>> if you have 1G of active HIGHMEM, that's a quarter of a million pages, >>> which would take at least a second no matter what we do. VMware can >>> probably special-case kmaps, but we can't] > > Perhaps they've a list per-age bucket or similar but still I doubt > this works well on host either... I guess the virtualization overhead > is exacerbating the inefficiency. Perhaps killall -STOP kscand is good > enough fix ;). This seem to only push the age up, to be functional the > age has to go down and I guess the go-down is done by other threads so > stopping kscand may not hurt. > > I think what we should aim for is to quickly reach this condition: > > 1) always keep the fixmap/kmap pte_t shadowed and emulate the > kmap/kunmap access so the test_and_clear_young done on the user pte > doesn't require to re-establish the spte representing the fixmap > virtual address. If we don't emulate fixmap we'll have to > re-establish the spte during the write to the user pte, and > tear it down again during kunmap_atomic. So there's not much doubt > fixmap access emulation is worth it. > > 2) get rid of the user pte shadow mapping pointing to the user data so > the test_and_clear of the young bitflag on the user pte will not be > emulated and it'll run at full CPU speed through the shadow pte > mapping corresponding to the fixmap virtual address > > kscand pattern is the same as running mprotect on a 32bit 2.6 > kernel so it sounds worth optimizing for it, even if kscand may be > unfixable without killall -STOP kscand or VM fixes to guest. > > However I'm not sure about point 2 at the light of mprotect. With > mprotect the guest virutal addresses mapped by the guest user ptes > will be used. It's not like kscand that may write forever to the user > ptes without ever using the guest virtual addresses that they're > mapping. So we better be sure that by unshadowing and optimizing > kscand we're not hurting mprotect or other pte mangling operations in > 2.6 that will likely keep accessing the guest virtual addresses mapped > by the user ptes previously modified. > > Hope this makes any sense, I'm not sure to understand this completely. > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >