From mboxrd@z Thu Jan 1 00:00:00 1970 From: Radim =?utf-8?B?S3LEjW3DocWZ?= Subject: Re: [PATCH 3/6] KVM: Dirty memory tracking for performant checkpointing and improved live migration Date: Wed, 4 May 2016 21:27:10 +0200 Message-ID: <20160504192709.GH30059@potion> References: <33d8668e-2bba-af91-069e-6452609a6ff0@linux.intel.com> <20160429181911.GA2687@potion> <20160503141118.GA27975@potion> <32d8060e-648c-cf99-970a-3ddadc6a501a@linux.intel.com> <20160504131314.GA27590@potion> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "Huang, Kai" , Paolo Bonzini , "kvm@vger.kernel.org" To: "Cao, Lei" Return-path: Received: from mx1.redhat.com ([209.132.183.28]:41571 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751333AbcEDT1N (ORCPT ); Wed, 4 May 2016 15:27:13 -0400 Content-Disposition: inline In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: 2016-05-04 17:15+0000, Cao, Lei: > On 5/4/2016 9:13 AM, Radim Kr=C4=8Dm=C3=A1=C5=99 wrote: >> Good designs so far seem to be: >> memslot -> lockless radix tree >> and >> vcpu -> memslot -> list (memslot -> vcpu -> list) >> >=20 > There is no need for lookup, the dirty log is fetched in sequence, so= why use > radix tree with added complexity but no benefit? >=20 > List can be designed to be lockless, so memslot -> lockless fixed lis= t? It can, but lockless list for concurrent writers is harder than lockles= s list for a concurrent writer and reader. The difference is in starvation -- it's possible that VCPU would never get to write an entry unless you implemented a queueing mechanism. A queueing mechanism means that you basically have a spinlock, so I wouldn't bother with a lockless list and just try spinlock directly. A spinlock with very short critical section might actually work well fo= r < 256 VCPU and is definitely the easiest option. Worth experimenting with, IMO. Lockless radix tree doesn't starve. Every entry has a well defined place in the tree. The entry just might not be fully allocated yet. If another VCPU is faster and expands the tree, then other VCPUs use that extended tree until they all get to their leaf nodes, VCPUs basically cooperate on growing the tree. And I completely forgot that we can preallocate the whole tree and use = a effective packed storage thanks to that. My first guess is that it would be make sense with double the memory of our bitmap. Scans and insertion would be slower than for a per-vcpu list, but much faster tha= n with a dynamically allocated structure. I'll think a bit about that. The main reason why I'd like something that can contain all dirty pages is overflow -- the userspace has to treat *all* pages as dirty if we lose a dirty page, so overflow must never happen -- we have to either grow the dirty log or suspend the writer until userspace frees space ..= =2E