From: Mario Smarduch <m.smarduch@samsung.com>
To: Christoffer Dall <christoffer.dall@linaro.org>
Cc: kvmarm@lists.cs.columbia.edu, marc.zyngier@arm.com,
steve.capper@arm.com, kvm@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, gavin.guo@canonical.com,
peter.maydell@linaro.org, jays.lee@samsung.com,
sungjinn.chung@samsung.com
Subject: Re: [PATCH v6 3/4] live migration support for VM dirty log management
Date: Wed, 28 May 2014 10:59:51 -0700 [thread overview]
Message-ID: <53862417.9040107@samsung.com> (raw)
In-Reply-To: <20140528090834.GK16428@lvm>
On 05/28/2014 02:08 AM, Christoffer Dall wrote:
> On Tue, May 27, 2014 at 02:55:21PM -0700, Mario Smarduch wrote:
>> On 05/27/2014 01:12 PM, Christoffer Dall wrote:
>>> On Thu, May 15, 2014 at 11:27:30AM -0700, Mario Smarduch wrote:
>
> [...]
>
>>>> +
>>>> + /* If pgd, pud, pmd not present and you cross pmd range check next
>>>> + * index.
>>>> + */
>>>> + pgd = pgdp + pgd_index(ipa);
>>>> + if (unlikely(crosses_pmd && !pgd_present(*pgd))) {
>>>> + pgd = pgdp + pgd_index(next);
>>>> + if (!pgd_present(*pgd))
>>>> + return;
>>>> + }
>>>> +
>>>> + pud = pud_offset(pgd, ipa);
>>>> + if (unlikely(crosses_pmd && !pud_present(*pud))) {
>>>> + pud = pud_offset(pgd, next);
>>>> + if (!pud_present(*pud))
>>>> + return;
>>>> + }
>>>> +
>>>> + pmd = pmd_offset(pud, ipa);
>>>> + if (unlikely(crosses_pmd && !pmd_present(*pmd))) {
>>>> + pmd = pmd_offset(pud, next);
>>>> + if (!pmd_present(*pmd))
>>>> + return;
>>>> + }
>>>> +
>>>> + for (;;) {
>>>> + pte = pte_offset_kernel(pmd, ipa);
>>>> + if (!pte_present(*pte))
>>>> + goto next_ipa;
>>>> +
>>>> + if (kvm_s2pte_readonly(pte))
>>>> + goto next_ipa;
>>>> + kvm_set_s2pte_readonly(pte);
>>>> +next_ipa:
>>>> + mask &= mask - 1;
>>>> + if (!mask)
>>>> + break;
>>>> +
>>>> + /* find next page */
>>>> + ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
>>>> +
>>>> + /* skip upper page table lookups */
>>>> + if (!crosses_pmd)
>>>> + continue;
>>>> +
>>>> + pgd = pgdp + pgd_index(ipa);
>>>> + if (unlikely(!pgd_present(*pgd)))
>>>> + goto next_ipa;
>>>> + pud = pud_offset(pgd, ipa);
>>>> + if (unlikely(!pud_present(*pud)))
>>>> + goto next_ipa;
>>>> + pmd = pmd_offset(pud, ipa);
>>>> + if (unlikely(!pmd_present(*pmd)))
>>>> + goto next_ipa;
>>>> + }
>>>
>>> So I think the reason this is done separately on x86 is that they have
>>> an rmap structure for their gfn mappings so that they can quickly lookup
>>> ptes based on a gfn and write-protect it without having to walk the
>>> stage-2 page tables.
>>
>> Yes, they also use rmapps for mmu notifiers, invalidations on huge VMs and
>> large ranges resulted in excessive times.
>>>
>>> Unless you want to introduce this on ARM, I think you will be much
>>
>> Eventually yes but that would also require reworking mmu notifiers. I had
>> two step approach in mind. Initially get the dirty page marking to work,
>> TLB flushing, GIC/arch-timer migration, validate migration under various
>> stress loads (page reclaim) with mmu notifiers, test several VMs and migration
>> times.
>>
>> Then get rmapp (or something similar) working - eventually for huge VMs it's
>> needed. In short two phases.
>>
>>> better off just having a single (properly written) iterating
>>> write-protect function, that takes a start and end IPA and a bitmap for
>>> which pages to actually write-protect, which can then handle the generic
>>> case (either NULL or all-ones bitmap) or a specific case, which just
>>> traverses the IPA range given as input. Such a function should follow
>>> the model of page table walk functions discussed previously
>>> (separate functions: wp_pgd_enties(), wp_pud_entries(),
>>> wp_pmd_entries(), wp_pte_entries()).
>>>
>>> However, you may want to verify my assumption above with the x86 people
>>> and look at sharing the rmap logic between architectures.
>>>
>>> In any case, this code is very difficult to read and understand, and it
>>> doesn't look at all like the other code we have to walk page tables. I
>>> understand you are trying to optimize for performance (by skipping some
>>> intermediate page table level lookups), but you never declare that goal
>>> anywhere in the code or in the commit message.
>>
>> Marc's comment noticed I was walking a small range (128k), using upper table
>> iterations that covered 1G, 2MB ranges. As you mention the code tries to
>> optimize upper table lookups. Yes the function is too bulky, but I'm not sure how
>> to remove the upper table checks since page tables may change between the
>> time pages are marked dirty and the log is retrieved. And if a memory slot
>> is very dirty walking upper tables will impact performance. I'll think some
>> more on this function.
>>
> I think you should aim at the simplest possible implementation that
> functionally works, first. Let's verify that this thing works, have
> clean working code that implementation-wise is as minimal as possible.
>
> Then we can run perf on that and see if our migrations are very slow,
> where we are actually spending time, and only then optimize it.
>
> The solution to this specific problem for the time being appears quite
> clear to me: Follow the exact same scheme as for unmap_range (the one I
> sent out here:
> https://lists.cs.columbia.edu/pipermail/kvmarm/2014-May/009592.html, the
> diff is hard to read, so I recommend you apply the patch and look at the
> resulting code). Have a similar scheme, call it wp_ipa_range() or
> something like that, and use that for now.
Ok I'll reuse that code. I'll need to apply that patch given I need
to test various host conditions during migration.
>
> -Christoffer
>
next prev parent reply other threads:[~2014-05-28 17:59 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-15 18:27 [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
2014-05-15 18:27 ` [PATCH v6 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
2014-05-27 19:51 ` Christoffer Dall
2014-05-15 18:27 ` [PATCH v6 2/4] live migration support for initial write protect of VM Mario Smarduch
2014-05-27 19:58 ` Christoffer Dall
2014-05-27 20:15 ` Mario Smarduch
2014-05-27 20:20 ` Christoffer Dall
2014-05-15 18:27 ` [PATCH v6 3/4] live migration support for VM dirty log management Mario Smarduch
2014-05-27 20:12 ` Christoffer Dall
2014-05-27 21:55 ` Mario Smarduch
2014-05-28 9:08 ` Christoffer Dall
2014-05-28 17:59 ` Mario Smarduch [this message]
2014-05-15 18:27 ` [PATCH v6 4/4] add 2nd stage page fault handling during live migration Mario Smarduch
2014-05-27 20:19 ` Christoffer Dall
2014-05-28 1:30 ` Mario Smarduch
2014-05-28 8:09 ` Christoffer Dall
2014-05-28 17:55 ` Mario Smarduch
2014-05-28 18:42 ` Mario Smarduch
2014-05-29 2:02 ` Mario Smarduch
2014-05-29 8:42 ` Christoffer Dall
2014-05-29 8:51 ` Christoffer Dall
2014-05-29 17:08 ` Mario Smarduch
2014-05-29 17:57 ` Christoffer Dall
2014-05-29 19:10 ` Mario Smarduch
2014-05-15 18:51 ` [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Christoffer Dall
2014-05-15 22:53 ` Mario Smarduch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53862417.9040107@samsung.com \
--to=m.smarduch@samsung.com \
--cc=christoffer.dall@linaro.org \
--cc=gavin.guo@canonical.com \
--cc=jays.lee@samsung.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=marc.zyngier@arm.com \
--cc=peter.maydell@linaro.org \
--cc=steve.capper@arm.com \
--cc=sungjinn.chung@samsung.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox