From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) Date: Sat, 31 May 2008 10:39:17 +0300 Message-ID: <484100A5.2070704@qumranet.com> References: <4835EEF5.9010600@cisco.com> <483D391F.7050007@qumranet.com> <483D6898.2050605@cisco.com> <20080528144850.GX27375@duo.random> <483D7C45.5020300@qumranet.com> <483D7D8D.3030309@cisco.com> <20080528170410.GC8086@duo.random> <483E7EE2.8010508@qumranet.com> <20080529142703.GJ8086@duo.random> <483EC8E7.4010501@qumranet.com> <20080530131238.GB3118@duo.random> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "David S. Ahern" , kvm@vger.kernel.org To: Andrea Arcangeli Return-path: Received: from bzq-179-150-194.static.bezeqint.net ([212.179.150.194]:31470 "EHLO il.qumranet.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750710AbYEaHjS (ORCPT ); Sat, 31 May 2008 03:39:18 -0400 In-Reply-To: <20080530131238.GB3118@duo.random> Sender: kvm-owner@vger.kernel.org List-ID: Andrea Arcangeli wrote: > On Thu, May 29, 2008 at 06:16:55PM +0300, Avi Kivity wrote: > >> Yes. We need a fault in order to set the guest accessed bit. >> > > So what I'm missing now is how the spte corresponding to the user pte > that is under test_and_clear to clear the accessed bit, will not the > zapped immediately. If we don't zap it immediately, how do we set the > accessed bit again on the user pte, when the user program returned > running and used that shadow pte to access the program data after the > kscand pass? > > The spte is zapped unconditionally in kvm_mmu_pte_write(), and not re-established in mmu_pte_write_new_pte() due to the missing accessed bit. The question is whether to tear down the shadow page it is contained in, or not. > Or am I missing something? > > >> Unshadowing a page is expensive, both in immediate cost, and in future cost >> of reshadowing the page and taking faults. It's worthwhile to be sure the >> guest really doesn't want it as a page table. >> > > Ok that makes sense, but can we defer the unshadowing while still > emulating the accessed bit correctly on the user pte? > > We do, unless there's a bad bug somewhere. >> If the pages are not scanned linearly, then unshadowing may not help. >> > > It should help the second time kscand runs, for the user ptes that > aren't shadowed anymore, the second pass won't require any emulation > for test_and_bit because the spte of the fixmap area will be > read-write. The bug that passes the anonymous pages number instead of > the cache number will lead to many more test_and_clear than needed, > and not all user ptes may be used in between two different kscand passes. > > We still need 3 emulations per pte to set the fixmap entry. Unshadowing saves one emulation on the pte itself. >> Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables. >> > > There are likely 1500 ptes in highmem. (ram isn't the most important factor) > > I use 'pte' in the Intel manual sense (page table entry), not the Linux sense (page table). I mentioned these numbers to see the worst case behavior. Non-highmem: - with unshadow: O(500) accesses to unshadow the page tables, then native speed - without unshadow: O(250000) accesses to modify the ptes Highmem: - with unshadow: O(250000) accesses to update the fixmap entry - with unshadow: O(250000) accesses to update the fixmap entry and to modify the ptes >> Well, then after 4000 scans we ought to have unshadowed everything. So I >> guess per-page-pte-history is broken, can't explain it otherwise. >> > > Yes, we should have unshadowed all user ptes after 4000 scans and then > the test_and_clear shouldn't require any more emulation, there will be > only 3 emulations for each kmap_atomic/kunmap_atomic. > > So we save 25%. It's still bad even if everything is working correctly. > > I think it should be clear that by now, we're trying to be > bug-compatile like the host here, and optimizing for 2.6 kmaps. > Don't understand. I'm guessing esx gets its good performance by special-casing something. For example, they can keep the fixmap page never shadowed, always emulate accesses through the fixmap page, and recompile instructions that go through fixmap to issue a hypercall. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.