From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrea Arcangeli <andrea@qumranet.com>
Subject: Re: [kvm-devel] performance with guests running 2.4 kernels
	(specifically RHEL3)
Date: Wed, 28 May 2008 19:04:10 +0200
Message-ID: <20080528170410.GC8086@duo.random>
References: <4830F90A.1020809@cisco.com> <4830FE8D.6010006@cisco.com> <48318E64.8090706@qumranet.com> <4832DDEB.4000100@qumranet.com> <4835EEF5.9010600@cisco.com> <483D391F.7050007@qumranet.com> <483D6898.2050605@cisco.com> <20080528144850.GX27375@duo.random> <483D7C45.5020300@qumranet.com> <483D7D8D.3030309@cisco.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Avi Kivity <avi@qumranet.com>, kvm@vger.kernel.org
To: "David S. Ahern" <daahern@cisco.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from host36-195-149-62.serverdedicati.aruba.it ([62.149.195.36]:40604
	"EHLO mx.cpushare.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751556AbYE1REM (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 28 May 2008 13:04:12 -0400
Content-Disposition: inline
In-Reply-To: <483D7D8D.3030309@cisco.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote:
> This is the code in the RHEL3.8 kernel:
> 
> static int scan_active_list(struct zone_struct * zone, int age,
> 		struct list_head * list, int count)
> {
> 	struct list_head *page_lru , *next;
> 	struct page * page;
> 	int over_rsslimit;
> 
> 	count = count * kscand_work_percent / 100;
> 	/* Take the lock while messing with the list... */
> 	lru_lock(zone);
> 	while (count-- > 0 && !list_empty(list)) {
> 		page = list_entry(list->prev, struct page, lru);
> 		pte_chain_lock(page);
> 		if (page_referenced(page, &over_rsslimit)
> 				&& !over_rsslimit
> 				&& check_mapping_inuse(page))
> 			age_page_up_nolock(page, age);
> 		else {
> 			list_del(&page->lru);
> 			list_add(&page->lru, list);
> 		}
> 		pte_chain_unlock(page);
> 	}
> 	lru_unlock(zone);
> 	return 0;
> }
> 
> My previous email shows examples of the number of pages in the list and
> the scanning that happens.

This code looks better than the one below, as a limit was introduced
and the whole list isn't scanned anymore, if you decrease
kscand_work_percent (I assume it's a sysctl even if it's missing the
sysctl_ prefix) to say 1, you can limit damages. Did you try it?

> Avi Kivity wrote:
> > Andrea Arcangeli wrote:
> >>
> >> So I never found a relation to the symptom reported of VM kernel
> >> threads going weird, with KVM optimal handling of kmap ptes.
> >>   
> > 
> > 
> > The problem is this code:
> > 
> > static int scan_active_list(struct zone_struct * zone, int age,
> >                struct list_head * list)
> > {
> >        struct list_head *page_lru , *next;
> >        struct page * page;
> >        int over_rsslimit;
> > 
> >        /* Take the lock while messing with the list... */
> >        lru_lock(zone);
> >        list_for_each_safe(page_lru, next, list) {
> >                page = list_entry(page_lru, struct page, lru);
> >                pte_chain_lock(page);
> >                if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
> >                        age_page_up_nolock(page, age);
> >                pte_chain_unlock(page);
> >        }
> >        lru_unlock(zone);
> >        return 0;
> > }
>
> > If the pages in the list are in the same order as in the ptes (which is
> > very likely), then we have the following access pattern

Yes it is likely.

> > - set up kmap to point at pte
> > - test_and_clear_bit(pte)
> > - kunmap
> > 
> > From kvm's point of view this looks like
> > 
> > - several accesses to set up the kmap

Hmm, the kmap establishment takes a single guest operation in the
fixmap area. That's a single write to the pte, to write a pte_t 8/4
byte large region (PAE/non-PAE). The same pte_t is then cleared and
flushed out of the tlb with a cpu-local invlpg during kunmap_atomic.

I count 1 write here so far.

> >  - if these accesses trigger flooding, we will have to tear down the
> > shadow for this page, only to set it up again soon

So the shadow mapping the fixmap area would be tear down by the
flooding.

Or is the shadow corresponding to the real user pte pointed by the
fixmap, that is unshadowed by the flooding, or both/all?

> > - an access to the pte (emulted)

Here I count the second write and this isn't done on the fixmap area
like the first write above, but this is a write to the real user pte,
pointed by the fixmap. So if this is emulated it means the shadow of
the user pte pointing to the real data page is still active.

> >  - if this access _doesn't_ trigger flooding, we will have 512 unneeded
> > emulations.  The pte is worthless anyway since the accessed bit is clear
> > (so we can't set up a shadow pte for it)
> >    - this bug was fixed

You mean the accessed bit on fixmap pte used by kmap? Or the user pte
pointed by the fixmap pte?

> > - an access to tear down the kmap

Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
matters).

> > [btw, am I reading this right? the entire list is scanned each time?

If the list parameter isn't a local LIST_HEAD on the stack but the
global one it's a full scan each time. I guess it's the global list
looking at the new code at the top that has a kswapd_scan_limit
sysctl.

> > if you have 1G of active HIGHMEM, that's a quarter of a million pages,
> > which would take at least a second no matter what we do.  VMware can
> > probably special-case kmaps, but we can't]

Perhaps they've a list per-age bucket or similar but still I doubt
this works well on host either... I guess the virtualization overhead
is exacerbating the inefficiency. Perhaps killall -STOP kscand is good
enough fix ;). This seem to only push the age up, to be functional the
age has to go down and I guess the go-down is done by other threads so
stopping kscand may not hurt.

I think what we should aim for is to quickly reach this condition:

1) always keep the fixmap/kmap pte_t shadowed and emulate the
   kmap/kunmap access so the test_and_clear_young done on the user pte
   doesn't require to re-establish the spte representing the fixmap
   virtual address. If we don't emulate fixmap we'll have to
   re-establish the spte during the write to the user pte, and
   tear it down again during kunmap_atomic. So there's not much doubt
   fixmap access emulation is worth it.

2) get rid of the user pte shadow mapping pointing to the user data so
   the test_and_clear of the young bitflag on the user pte will not be
   emulated and it'll run at full CPU speed through the shadow pte
   mapping corresponding to the fixmap virtual address

kscand pattern is the same as running mprotect on a 32bit 2.6
kernel so it sounds worth optimizing for it, even if kscand may be
unfixable without killall -STOP kscand or VM fixes to guest.

However I'm not sure about point 2 at the light of mprotect. With
mprotect the guest virutal addresses mapped by the guest user ptes
will be used. It's not like kscand that may write forever to the user
ptes without ever using the guest virtual addresses that they're
mapping. So we better be sure that by unshadowing and optimizing
kscand we're not hurting mprotect or other pte mangling operations in
2.6 that will likely keep accessing the guest virtual addresses mapped
by the user ptes previously modified.

Hope this makes any sense, I'm not sure to understand this completely.