From: Avi Kivity <avi@qumranet.com>
To: Andrea Arcangeli <andrea@qumranet.com>
Cc: "David S. Ahern" <daahern@cisco.com>, kvm@vger.kernel.org
Subject: Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
Date: Thu, 29 May 2008 18:16:55 +0300 [thread overview]
Message-ID: <483EC8E7.4010501@qumranet.com> (raw)
In-Reply-To: <20080529142703.GJ8086@duo.random>
Andrea Arcangeli wrote:
>>> Here I count the second write and this isn't done on the fixmap area
>>> like the first write above, but this is a write to the real user pte,
>>> pointed by the fixmap. So if this is emulated it means the shadow of
>>> the user pte pointing to the real data page is still active.
>>>
>>>
>> Right. But if we are scanning a page table linearly, it should be
>> unshadowed.
>>
>
> I think we're often not scanning page table linearly with pte_chains,
> but yet those should be still unshadowed. mmaps won't always bring
> memory in linear order, memory isn't always initialized or by memset
> or pagedin with contiguous virtual accesses.
>
>
I guess we aren't scanning the page table linerarly, since with the
linear-scan test case I can't reproduce the problem.
> So while the assumption that following the active list will sometime
> return guest ptes that maps contiguous guest virtual address is valid,
> it only accounts for a small percentage of the active list. It largely
> depends on the userland apps. Furthermore even if the active lru is
> initially pointing to linear ptes, the list is then split into age
> buckets depending on the access patterns at runtime, so that further
> fragments the linearity of the virtual addresses of the kmapped ptes.
>
> BTW, one thing we didn't account for in previous email, is that there
> can be more than one guest user pte modified by page_referenced, if
> it's not a direct page. And non direct pages surely won't provide
> linear scans, infact for non linear pages the most common is that the
> pte_t will point to the same virtual address but on a different
> pgd_t * (and in turn on a different pmd_t).
>
>
Since the pte tracking is per-page, it won't be affected by shared pages.
>>> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
>>> pointed by the fixmap pte?
>>>
>>>
>> The user pte. After guest code runs test_and_clear_bit(accessed_bit,
>> ptep), we can't shadow that pte (all shadowed ptes must have the accessed
>> bit set in the corresponding guest pte, similar to how a tlb entry can only
>> exist if the accessed bit is set).
>>
>
> Is this software invariant to ensure that we'll refresh the accessed
> bit on the user pte too?
>
>
Yes. We need a fault in order to set the guest accessed bit.
> So this means kscand by clearing the accessed bitflag on them, should
> automatically unshadowing all user ptes pointed by the fixmap pte.
>
> So a secnd test_and_clear_bit on the same user pte will run through
> the fixmap pte established by kmap_atomic without traps.
>
> So this means when the user program run again, it'll find the user pte
> unshadowed and it'll have to re-instantiate the shadow ptes with a kvm
> page fault, that has the primary objective of marking the user pte
> accessed again (to notify the next kscand pass that the data page
> pointed by the user pte was used meanwhile).
>
> If I understand correctly, the establishment of the shadow pte
> corresponding to the user pte, will have to mark wrprotect the spte
> corresponding to the fixmap pte because we need to intercept
> modifications to shadowed guest ptes and the spte corresponding to the
> fixmap guest pte is now pointing to a shadowed guest pte after the
> program returns running.
>
> Then when kscand runs again, for the pages that have been faulted in
> by the user program, we'll trap the test_and_clear_bit happening
> through the readonly spte corresponding to the fixmap guest pte, and
> we'll unshadow the spte of the guest user pte again and we'll mark the
> spte corresponding to the fixmap pte as read-write again, because of
> the test_and_clear_bit tells us that we've to unshadow instead of
> emulating.
>
Yes.
>>> 2) get rid of the user pte shadow mapping pointing to the user data so
>>> the test_and_clear of the young bitflag on the user pte will not be
>>> emulated and it'll run at full CPU speed through the shadow pte
>>> mapping corresponding to the fixmap virtual address
>>>
>>>
>> That's what per-page-pte-history is supposed to do. The first few accesses
>> are emulated, the next will be native.
>>
>
> Why not to go native immediately when we notice a test_and_clear of
> the accessed bit? First the ptes won't be in contiguous virtual
> address order, so if the flooding of the sptes corresponding to the
> guest user pte depends on the gpa of the guest user ptes being
> contiguous it won't work well. But more importantly we've found a
> test_and_clear_bit of the accessed bitflag, so we should unshadow the
> user pte that is being marked "old" immediately without need to detect
> any flooding.
>
Unshadowing a page is expensive, both in immediate cost, and in future
cost of reshadowing the page and taking faults. It's worthwhile to be
sure the guest really doesn't want it as a page table.
>> It's still not full speed as the kmap setup has to be emulated (twice).
>>
>
> Agreed, the 1/2/3 emulations on writes to the fixmap area during
> kmap_atomic (1/2 for non-PAE/PAE and 1 further pte_clear on 2.6 or 2.4
> debug-highmem) seems unavoidable.
>
> But the test_and_clear_bit writprotect fault (when the guest user pte
> is shadowed) should just unshadow the guest user pte, mark the spte
> representing the fixmap pte as writeable, and return immediately to
> guest mode to actually run test_and_clear_bit natively without writing
> it through emulation.
>
> Noticing the test_and_clear_bit also requires a bit of instruction
> "detection", but once we detected it from the eip address, we don't
> have to write anything to the guest.
>
> But I guess I'm missing something...
>
>
If the pages are not scanned linearly, then unshadowing may not help.
Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables.
Well, then after 4000 scans we ought to have unshadowed everything. So
I guess per-page-pte-history is broken, can't explain it otherwise.
>> One possible optimization is that if we see the first part of the kmap
>> instantiation, we emulate a few more instructions before returning to the
>> guest. Xen does this IIRC.
>>
>
> Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not
> sure if 32bit PAE is that important to do this. Most 32bit enterprise
> kernels I worked aren't compiled with PAE, only one called bigsmp is.
>
>
Well, seems RHEL 3.8 smp is PAE.
> Also on 2.6, we could get the same benefit by making 2.6 at least as
> optimal as 2.4 by never clearing the fixmap pte and by doing invlpg
> only after setting it to a new value. Xen can't optimize that write in
> kunmap_atomic.
>
> 2.6 has debug enabled by default for no good reason. So that would be
> the first optimization to do as it saves a few cycles per
> kunmap_atomic on host too.
>
>
Yes, it's probably a small win on native as well.
>> I'm no longer sure the access pattern is sequential, since I see
>> kmap_atomic() will not recreate the pte if its value has not changed
>> (unless HIGHMEM_DEBUG).
>>
>
> Hmm kmap_atomic always writes a new value to the fixmap pte, even if
> it was mapping the same user pte as before.
>
> static inline void *kmap_atomic(struct page *page, enum km_type type)
> {
> enum fixed_addresses idx;
> unsigned long vaddr;
>
> if (page < highmem_start_page)
> return page_address(page);
>
> idx = type + KM_TYPE_NR*smp_processor_id();
> vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
> #if HIGHMEM_DEBUG
> if (!pte_none(*(kmap_pte-idx)))
> out_of_line_bug();
> #endif
> set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
> __flush_tlb_one(vaddr);
>
> return (void*) vaddr;
> }
>
>
The centos 3.8 sources have
static inline void *__kmap_atomic(struct page *page, enum km_type type)
{
enum fixed_addresses idx;
unsigned long vaddr;
idx = type + KM_TYPE_NR*smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
#if HIGHMEM_DEBUG
if (!pte_none(*(kmap_pte-idx)))
out_of_line_bug();
#else
/*
* Performance optimization - do not flush if the new
* pte is the same as the old one:
*/
if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot)))
return (void *) vaddr;
#endif
set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
__flush_tlb_one(vaddr);
return (void*) vaddr;
}
(linux-2.4.21-47.EL)
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
next prev parent reply other threads:[~2008-05-29 15:16 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-04-16 0:15 performance with guests running 2.4 kernels (specifically RHEL3) David S. Ahern
2008-04-16 8:46 ` Avi Kivity
2008-04-17 21:12 ` David S. Ahern
2008-04-18 7:57 ` Avi Kivity
2008-04-21 4:31 ` David S. Ahern
2008-04-21 9:19 ` Avi Kivity
2008-04-21 17:07 ` David S. Ahern
2008-04-22 20:23 ` David S. Ahern
2008-04-23 8:04 ` Avi Kivity
2008-04-23 15:23 ` David S. Ahern
2008-04-23 15:53 ` Avi Kivity
2008-04-23 16:39 ` David S. Ahern
2008-04-24 17:25 ` David S. Ahern
2008-04-26 6:43 ` Avi Kivity
2008-04-26 6:20 ` Avi Kivity
2008-04-25 17:33 ` David S. Ahern
2008-04-26 6:45 ` Avi Kivity
2008-04-28 18:15 ` Marcelo Tosatti
2008-04-28 23:45 ` David S. Ahern
2008-04-30 4:18 ` David S. Ahern
2008-04-30 9:55 ` Avi Kivity
2008-04-30 13:39 ` David S. Ahern
2008-04-30 13:49 ` Avi Kivity
2008-05-11 12:32 ` Avi Kivity
2008-05-11 13:36 ` Avi Kivity
2008-05-13 3:49 ` David S. Ahern
2008-05-13 7:25 ` Avi Kivity
2008-05-14 20:35 ` David S. Ahern
2008-05-15 10:53 ` Avi Kivity
2008-05-17 4:31 ` David S. Ahern
[not found] ` <482FCEE1.5040306@qumranet.com>
[not found] ` <4830F90A.1020809@cisco.com>
2008-05-19 4:14 ` [kvm-devel] " David S. Ahern
2008-05-19 14:27 ` Avi Kivity
2008-05-19 16:25 ` David S. Ahern
2008-05-19 17:04 ` Avi Kivity
2008-05-20 14:19 ` Avi Kivity
2008-05-20 14:34 ` Avi Kivity
2008-05-22 22:08 ` David S. Ahern
2008-05-28 10:51 ` Avi Kivity
2008-05-28 14:13 ` David S. Ahern
2008-05-28 14:35 ` Avi Kivity
2008-05-28 19:49 ` David S. Ahern
2008-05-29 6:37 ` Avi Kivity
2008-05-28 14:48 ` Andrea Arcangeli
2008-05-28 14:57 ` Avi Kivity
2008-05-28 15:39 ` David S. Ahern
2008-05-29 11:49 ` Avi Kivity
2008-05-29 12:10 ` Avi Kivity
2008-05-29 13:49 ` David S. Ahern
2008-05-29 14:08 ` Avi Kivity
2008-05-28 15:58 ` Andrea Arcangeli
2008-05-28 15:37 ` Avi Kivity
2008-05-28 15:43 ` David S. Ahern
2008-05-28 17:04 ` Andrea Arcangeli
2008-05-28 17:24 ` David S. Ahern
2008-05-29 10:01 ` Avi Kivity
2008-05-29 14:27 ` Andrea Arcangeli
2008-05-29 15:11 ` David S. Ahern
2008-05-29 15:16 ` Avi Kivity [this message]
2008-05-30 13:12 ` Andrea Arcangeli
2008-05-31 7:39 ` Avi Kivity
2008-05-29 16:42 ` David S. Ahern
2008-05-31 8:16 ` Avi Kivity
2008-06-02 16:42 ` David S. Ahern
2008-06-05 8:37 ` Avi Kivity
2008-06-05 16:20 ` David S. Ahern
2008-06-06 16:40 ` Avi Kivity
2008-06-19 4:20 ` David S. Ahern
2008-06-22 6:34 ` Avi Kivity
2008-06-23 14:09 ` David S. Ahern
2008-06-25 9:51 ` Avi Kivity
2008-04-30 13:56 ` Daniel P. Berrange
2008-04-30 14:23 ` David S. Ahern
2008-04-23 8:03 ` Avi Kivity
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=483EC8E7.4010501@qumranet.com \
--to=avi@qumranet.com \
--cc=andrea@qumranet.com \
--cc=daahern@cisco.com \
--cc=kvm@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox