Re: [PATCH 3/6] KVM: Dirty memory tracking for performant checkpointing and improved live migration

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Radim Krčmář" <rkrcmar@redhat.com>
To: "Huang, Kai" <kai.huang@linux.intel.com>
Cc: "Cao, Lei" <Lei.Cao@stratus.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [PATCH 3/6] KVM: Dirty memory tracking for performant checkpointing and improved live migration
Date: Wed, 4 May 2016 15:13:15 +0200	[thread overview]
Message-ID: <20160504131314.GA27590@potion> (raw)
In-Reply-To: <32d8060e-648c-cf99-970a-3ddadc6a501a@linux.intel.com>

2016-05-04 19:45+1200, Huang, Kai:
> On 5/4/2016 2:11 AM, Radim Krčmář wrote:
>> 2016-05-03 18:06+1200, Huang, Kai:
>> > Actually my concern is, with your new mechanism to track guest dirty pages,
>> > there will be two logdirty mechanisms (using bitmap and your per-vcpu list),
>> > which I think is not good as it's a little bit redundant, given both
>> > mechanisms are used for dirty page logging.
>> > 
>> > I think your main concern of current bitmap mechanism is scanning bitmap
>> > takes lots of time, especially when only few pages get dirty, you still have
>> > to scan the entire bitmap, which results in bad performance if you runs
>> > checkpoint very frequently. My suggestion is, instead of introducing two
>> > logdirty data structures, maybe you can try to use another more efficient
>> > data structure instead of bitmap for both current logdirty mechanism and
>> > your new interfaces. Maybe Xen's log-dirty tree is a good reference.
>> 
>> A sparse structure (buffer, tree, ...) also needs a mechanism to grow
>> (store new entries), so concurrent accesses become a problem, because
>> there has to be synchronization.  I think that per-vcpu structure
>> becomes mandatory when thousands VCPUs dirty memory at the same time.
> 
> Yes synchronization will be needed. But even for per-vcpu structure, we
> still need per-vcpu lock to access, say, gfn_list, right? For example, one
> thread from userspace trying to get and clear dirty pages would need to loop
> all vcpus and acquire each vcpu's lock for gfn_list. (see function
> mt_reset_all_gfns in patch 3/6). Looks this is not scalable neither?

Coarse locking is optional.  The list can be designed allow concurrent
addition and removal (circullar buffer with 3 atomic markers).

If we had 'vcpu -> memslot -> structure' then we would design the
userspace interface so it would only affect one memslot, which would
avoid any scalability issue even if there was a vcpu+memslot lock in
each structure.

>> >                      Maybe Xen's log-dirty tree is a good reference.
>> 
>> Is there some top-level overview?
>> 
>> > From a glance at the code, it looked like GPA bitmap sparsified with
>> radix tree in a manner similar to the page table hierarchy.
> 
> Yes it is just a radix tree. The point is the tree will be pretty small if
> there are few dirty pages, so the scanning will be very quick, comparing to
> bitmap.

Bitmap had slow scanning, but any dynamic structure will have problems
with insertion ...

I think the tree might work if we pre-allotected bigger chunks to avoid
allocation overhead and made it "lockless" (fine grained locking using
cmpxchg) to avoid a bottleneck for concurrent writes.

>> We should have dynamic sparse dirty log, to avoid wasting memory when
>> there are many small memslots, but a linear structure is probably still
>> fine.
> 
> The sparse dirty log structure can be allocated when necessary so I don't
> think it will waste of memory. Take radix tree as example, if there's no
> dirty page in the slot, the pointer to radix can be NULL, or just root
> entry.

(And we want to waste some memory, because allocations are slow,
 tradeoffs, tradeoffs ...)

>> We don't care which vcpu dirtied the page, so it seems like a waste to
>> have them in the hierarchy, but I can't think of designs where the
>> sparse dirty log is rooted in memslot and its updates scale well.
>> 
>> 'memslot -> sparse dirty log' usually evolve into buffering on the VCPU
>> side before writing to the memslot or aren't efficient for sparse
>> dataset.
>> 
>> Where do you think is the balance between 'memslot -> bitmap' and
>> 'vcpu -> memslot -> dirty buffer'?
> 
> In my opinion, we can first try 'memslot -> sparse dirty log'. Cao, Lei
> mentioned there were two bottlenecks: bitmap and bad multithread performance
> due to mmu_lock. I think 'memslot->sparse dirty log' might help to improve
> or solve the bitmap one.

The bimap was chosen because it scales well with concurrent writes and
it easy to export.  Lei already hit scalability issues with mmu_lock, so
I don't expect that we could afford to put all VCPUs onto one lock
elsewhere.

Good designs so far seem to be:
 memslot -> lockless radix tree
and
 vcpu -> memslot -> list  (memslot -> vcpu -> list)

I'd like to see the lockless radix tree, but I expect the per-vcpu list
to be *much* easier to implment.

Do you see other designs on the pareto front?

Thanks.

next prev parent reply	other threads:[~2016-05-04 13:13 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <201604261855.u3QItn85024244@dev1.sn.stratus.com>
2016-04-26 19:24 ` [PATCH 3/6] KVM: Dirty memory tracking for performant checkpointing and improved live migration Cao, Lei
2016-04-28  9:13   ` Huang, Kai
2016-04-28 19:58     ` Cao, Lei
2016-04-29 18:19       ` Radim Krčmář
2016-05-02 15:24         ` Cao, Lei
2016-05-02 15:46           ` Radim Krčmář
2016-05-02 15:51             ` Cao, Lei
2016-05-03  6:06           ` Huang, Kai
2016-05-03 14:11             ` Radim Krčmář
2016-05-04  7:45               ` Huang, Kai
2016-05-04 13:13                 ` Radim Krčmář [this message]
2016-05-04 13:51                   ` Cao, Lei
2016-05-04 17:15                   ` Cao, Lei
2016-05-04 18:33                     ` Cao, Lei
2016-05-04 18:57                       ` Radim Krčmář
2016-05-06  9:46                         ` Kai Huang
2016-05-06 12:09                           ` Radim Krčmář
2016-05-06 15:13                             ` Cao, Lei
2016-05-06 16:04                               ` Radim Krčmář
2016-05-24 17:19                                 ` Cao, Lei
2016-06-30 13:49                                 ` Cao, Lei
2016-05-07  1:48                             ` Kai Huang
2016-05-04 19:27                     ` Radim Krčmář
2016-05-05 16:26                       ` Radim Krčmář
2016-05-06 15:19                         ` Cao, Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160504131314.GA27590@potion \
    --to=rkrcmar@redhat.com \
    --cc=Lei.Cao@stratus.com \
    --cc=kai.huang@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.