From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Subject: Re: [PATCH 1/3] kvm-s390: infrastructure to kick vcpus out of guest
 state
Date: Thu, 28 May 2009 09:59:36 +0200
Message-ID: <4A1E4468.1050309@linux.vnet.ibm.com>
References: <1243251652-27617-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1243251652-27617-2-git-send-email-ehrhardt@linux.vnet.ibm.com> <20090525202248.GA7608@amt.cnet> <4A1BA233.6080504@linux.vnet.ibm.com> <20090528034412.GA6090@amt.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: kvm@vger.kernel.org, avi@redhat.com, borntraeger@de.ibm.com,
	cotte@de.ibm.com, heiko.carstens@de.ibm.com, schwidefsky@de.ibm.com
To: Marcelo Tosatti <mtosatti@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mtagate4.de.ibm.com ([195.212.29.153]:55626 "EHLO
	mtagate4.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1760223AbZE1H7k (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 28 May 2009 03:59:40 -0400
Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49])
	by mtagate4.de.ibm.com (8.14.3/8.13.8) with ESMTP id n4S7xfVx074572
	for <kvm@vger.kernel.org>; Thu, 28 May 2009 07:59:41 GMT
Received: from d12av01.megacenter.de.ibm.com (d12av01.megacenter.de.ibm.com [9.149.165.212])
	by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v9.2) with ESMTP id n4S7xfuj2838564
	for <kvm@vger.kernel.org>; Thu, 28 May 2009 09:59:41 +0200
Received: from d12av01.megacenter.de.ibm.com (loopback [127.0.0.1])
	by d12av01.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id n4S7xfZG015815
	for <kvm@vger.kernel.org>; Thu, 28 May 2009 09:59:41 +0200
In-Reply-To: <20090528034412.GA6090@amt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Marcelo Tosatti wrote:
> On Tue, May 26, 2009 at 10:02:59AM +0200, Christian Ehrhardt wrote:
>  =20
>> Marcelo Tosatti wrote:
>>    =20
>>> On Mon, May 25, 2009 at 01:40:49PM +0200, ehrhardt@linux.vnet.ibm.c=
om wrote:
>>>  =20
>>>      =20
>>>> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>>>>
>>>> To ensure vcpu's come out of guest context in certain cases this p=
atch adds a
>>>> s390 specific way to kick them out of guest context. Currently it =
kicks them
>>>> out to rerun the vcpu_run path in the s390 code, but the mechanism=
 itself is
>>>> expandable and with a new flag we could also add e.g. kicks to use=
rspace etc.
>>>>
>>>> Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>>>>    =20
>>>>        =20
>>> "For now I added the optimization to skip kicking vcpus out of gues=
t
>>> that had the request bit already set to the s390 specific loop (sen=
t as
>>> v2 in a few minutes).
>>>
>>> We might one day consider standardizing some generic kickout levels=
 e.g.
>>> kick to "inner loop", "arch vcpu run", "generic vcpu run", "userspa=
ce",
>>> ... whatever levels fit *all* our use cases. And then let that kick=
s be
>>> implemented in an kvm_arch_* backend as it might be very different =
how
>>> they behave on different architectures."
>>>
>>> That would be ideal, yes. Two things make_all_requests handles:=20
>>>
>>> 1) It disables preemption with get_cpu(), so it can reliably check =
for
>>> cpu id. Somehow you don't need that for s390 when kicking multiple
>>> vcpus?
>>>  =20
>>>      =20
>> I don't even need the cpuid as make_all_requests does, I just insert=
 a =20
>> special bit in the vcpu arch part and the vcpu will "come out to me=20
>> (host)".
>> Fortunateley the kick is rare and fast so I can just insert it =20
>> unconditionally (it's even ok to insert it if the vcpu is not in gue=
st =20
>> state). That prevents us from needing vcpu lock or detailed checks w=
hich =20
>> would end up where we started (no guarantee that vcpu's come out of =
=20
>> guest context while trying to aquire all vcpu locks)
>>    =20
>
> Let me see if I get this right: you kick the vcpus out of guest mode =
by
> using a special bit in the vcpu arch part. OK.
>
> What I don't understand is this:=20
> "would end up where we started (no guarantee that vcpu's come out of
> guest context while trying to aquire all vcpu locks)"
>  =20
initially the mechanism looped over vcpu's and just aquired the vcpu=20
lock and then updated the vcpu.arch infor directly.
Avi mentioned that we have no guarantee if/when the vcpu will come out=20
of guest context to free a lock currently held and suggested the=20
mechanism x86 uses via setting vcpu->request and kicking the vcpu. That=
s=20
the eason behind "end up where we (the discussion) started", if we woul=
d=20
need the vcpu lock again we would be at the beginnign of the discussion=
=2E
> So you _need_ a mechanism to kick all vcpus out of guest mode?
>  =20
I have a mechanism to kick a vcpu, and I use it. Due to the fact that=20
smp_call_* don't work as kick for us the kick is an arch specific funct=
ion.
I hop ethat clarified this part :-)
>>> 2) It uses smp_call_function_many(wait=3D1), which guarantees that =
by the
>>> time make_all_requests returns no vcpus will be using stale data (t=
he
>>> remote vcpus will have executed ack_flush).
>>>  =20
>>>      =20
>> yes this is really a part my s390 implementation doesn't fulfill yet=
=2E =20
>> Currently on return vcpus might still use the old memslot informatio=
n.
>> As mentioned before letting all interrupts come "too far" out of the=
 hot =20
>> loop would be a performance issue, therefore I think I will need som=
e =20
>> request&confirm mechanism. I'm not sure yet but maybe it could be as=
 =20
>> easy as this pseudo code example:
>>
>> # in make_all_requests
>> # remember we have slots_lock write here and the reentry that update=
s =20
>> the vcpu specific data aquires slots_lock for read.
>> loop vcpus
>>  set_bit in vcpu requests
>>  kick vcpu #arch function
>> endloop
>>
>> loop vcpus
>>  until the requested bit is disappeared #as the reentry path uses =20
>> test_and_clear it will disappear
>> endloop
>>
>> That would be a implicit synchronization and should work, as I wrote=
 =20
>> before setting memslots while the guest is running is rare if ever =20
>> existant for s390. On x86 smp_call_many could then work without the =
wait =20
>> flag being set.
>>    =20
>
> I see, yes.=20
>
>  =20
>> But I assume that this synchronization approach is slower as it =20
>> serializes all vcpus on reentry (they wait for the slots_lock to get=
 =20
>> dropped), therefore I wanted to ask how often setting memslots on =20
>> runtime will occur on x86 ? Would this approach be acceptable ?
>>    =20
>
> For x86 we need slots_lock for two things:
>
> 1) to protect the memslot structures from changing (very rare), ie:
> kvm_set_memory.
>
> 2) to protect updates to the dirty bitmap (operations on behalf of
> guest) which take slots_lock for read versus updates to that dirty
> bitmap (an ioctl that retrieves what pages have been dirtied in the
> memslots, and clears the dirtyness info).
>
> All you need for S390 is 1), AFAICS.
>  =20
correct
> For 1), we can drop the slots_lock usage, but instead create an
> explicit synchronization point, where all vcpus are forced to (say
> kvm_vcpu_block) "paused" state. qemu-kvm has such notion.
>
> Same language?
>  =20
Yes, I think i got your point :-)
But I think by keeping slots_lock we already got our synchronization=20
point and don't need an explicit one adding extra code and maybe locks.
As I mentioned above it should synchronize already implicit. When I=20
looked at it once more yesterday I realized that kvm_set_memory is not=20
performance critical anyway (i.e. does not have to be the fastest ioctl=
=20
on earth) so we could be one step smarter and instead of serializing al=
l=20
vcpu's among each other we could set_memory just let do it one by one.

In case I lost you again due to my obviously confusing mainframe=20
language this week you might want to my next patch submission where I=20
implement that in the s390 arch code as an example. I'll put you on cc=20
and in that new code we might find an implicit language synchronization=
=20
for us :-)

[...]

--=20

Gr=FCsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization=20