From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Subject: Re: [PATCH 3/3] kvm-s390: streamline memslot handling - rebased
Date: Mon, 08 Jun 2009 12:51:26 +0200
Message-ID: <4A2CED2E.6030904@linux.vnet.ibm.com>
References: <1243952771-32428-1-git-send-email-ehrhardt@linux.vnet.ibm.com> <1243952771-32428-4-git-send-email-ehrhardt@linux.vnet.ibm.com> <20090605205312.GA13471@amt.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: kvm@vger.kernel.org, avi@redhat.com, borntraeger@de.ibm.com,
	cotte@de.ibm.com, heiko.carstens@de.ibm.com, schwidefsky@de.ibm.com
To: Marcelo Tosatti <mtosatti@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mtagate2.de.ibm.com ([195.212.17.162]:54012 "EHLO
	mtagate2.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752663AbZFHKv3 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 8 Jun 2009 06:51:29 -0400
Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49])
	by mtagate2.de.ibm.com (8.13.1/8.13.1) with ESMTP id n58ApVJx002462
	for <kvm@vger.kernel.org>; Mon, 8 Jun 2009 10:51:31 GMT
Received: from d12av02.megacenter.de.ibm.com (d12av02.megacenter.de.ibm.com [9.149.165.228])
	by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v9.2) with ESMTP id n58ApVTb2838718
	for <kvm@vger.kernel.org>; Mon, 8 Jun 2009 12:51:31 +0200
Received: from d12av02.megacenter.de.ibm.com (loopback [127.0.0.1])
	by d12av02.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id n58ApUlc002343
	for <kvm@vger.kernel.org>; Mon, 8 Jun 2009 12:51:31 +0200
In-Reply-To: <20090605205312.GA13471@amt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Marcelo Tosatti wrote:
> On Tue, Jun 02, 2009 at 04:26:11PM +0200, ehrhardt@linux.vnet.ibm.com=
 wrote:
>  =20
>> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>>    =20
[...]
>> @@ -706,13 +713,19 @@ int kvm_arch_set_memory_region(struct kv
>> =20
>>  	/* request update of sie control block for all available vcpus */
>>  	for (i =3D 0; i < KVM_MAX_VCPUS; ++i) {
>> -		if (kvm->vcpus[i]) {
>> -			if (test_and_set_bit(KVM_REQ_MMU_RELOAD,
>> -						&kvm->vcpus[i]->requests))
>> -				continue;
>> -			kvm_s390_inject_sigp_stop(kvm->vcpus[i],
>> -						  ACTION_VCPUREQUEST_ON_STOP);
>> -		}
>> +		vcpu =3D kvm->vcpus[i];
>> +		if (!vcpu)
>> +			continue;
>> +
>> +		if (!test_and_set_bit(KVM_REQ_MMU_RELOAD, &vcpu->requests))
>> +			continue;
>> +
>> +		if (vcpu->cpu =3D=3D -1)
>> +			continue;
>>    =20
>
> What happens if the check for cpu =3D=3D -1 races with kvm_arch_vcpu_=
put?
> This context will wait until the vcpu_put context is scheduled back i=
n
> to clear the bit? Is that OK?
>  =20
It either comes back to clear the bit or it is consumed on deletion of=20
the vcpu. Both ways are ok. The question we have to answer is if it=20
might stall the mem update ioctl for too long.
Because eventually the check for vcpu->cpu =3D=3D -1 is just an optimiz=
ation=20
if we would completely ignore remove it we would have the same problem=20
-> could it stall the set mem operation too much. That means the "race"=
=20
is not an issue it might just be sub-optimal, but the chance for a long=
=20
stall could become an issue. Unfortunately I have no better approach to=
=20
that (yet), until then this I like this implementation more than what w=
e=20
would have without all the corner case fixes in that patch series.

>> +
>> +		kvm_s390_inject_sigp_stop(vcpu, ACTION_VCPUREQUEST_ON_STOP);
>> +		wait_on_bit(&vcpu->requests, KVM_REQ_MMU_RELOAD,
>> +			    wait_bit_schedule, TASK_UNINTERRUPTIBLE);
>>  	}
>>    =20
>
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
> +       vcpu->cpu =3D -1;
>         save_fp_regs(&vcpu->arch.guest_fpregs);
>  =20
[...]
>> +++ kvm/arch/s390/kvm/kvm-s390.h
>> @@ -92,6 +92,13 @@ static inline unsigned long kvm_s390_han
>>  	if (!vcpu->requests)
>>  		return 0;
>> =20
>> +	/* requests that can be handled at all levels */
>> +	if (test_and_clear_bit(KVM_REQ_MMU_RELOAD, &vcpu->requests)) {
>> +		smp_mb__after_clear_bit();
>>    =20
>
> Really need that smp_mb__after_clear_bit ? AFAIK test_and_clear_bit
> implies a barrier?
>  =20

Well I agree that practically test_and_clear_bit has a barrier on s390,=
=20
but as far as I read Documentation/atomic_ops.txt at line 339-360 I=20
think the interface does not imply it so I wanted to add it explicitly.=
=20
I would be happy if someone really knows the in depth details here and=20
corrects me :-)

>> +		wake_up_bit(&vcpu->requests, KVM_REQ_MMU_RELOAD);
>> +		kvm_s390_vcpu_set_mem(vcpu);
>> +	}
>> +
>>  	return vcpu->requests;
>>  }
>> =20
>> Index: kvm/virt/kvm/kvm_main.c
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> --- kvm.orig/virt/kvm/kvm_main.c
>> +++ kvm/virt/kvm/kvm_main.c
>> @@ -1682,6 +1682,10 @@ static int kvm_vcpu_release(struct inode
>>  {
>>  	struct kvm_vcpu *vcpu =3D filp->private_data;
>> =20
>> +	clear_bit(KVM_REQ_MMU_RELOAD, &vcpu->requests);
>> +	smp_mb__after_clear_bit();
>> +	wake_up_bit(&vcpu->requests, KVM_REQ_MMU_RELOAD);
>> +
>>    =20
>
> And this should be generic? Say if other architectures want to make u=
se=20
> of a similar wait infrastructure. Talk is cheap.
>  =20
Clear bit and wake up on release doesn't hurt any architecture, but it=20
is at a good place fine for those using the mechanism to ensure cleanin=
g=20
up outstanding things when closing a vcpu fd.
I thought its not worth to add kvm_ARCH_vcpu_release for it while I=20
could do so if we want it separated.
(continued below)
> Anyway, yeah, the set request / wait mechanism you implement here is
> quite similar to the idea mentioned earlier that could be used for x8=
6.
>
> Just get rid of this explicit KVM_REQ_MMU_RELOAD knowledge in
> arch-independent code please (if you want to see this merged).
>  =20
I agree to lift the wait part to other archs later if needed, but as=20
mentioned above I could move this to arch code to the cost of one arch=20
hook more. But as also mentioned it doesn't really hurt. I agree that i=
t=20
does not need to be KVM_REQ_MMU_RELOAD specific, we could just=20
walk/clear/wake all bits on that vcpu->requests variable.
Would that be generic enough in your opinion ?
> Later it can all be lifted off to arch independent code.
>  =20
True for the wait part which can evolve in our arch code until it is=20
ripe to get cross arch merged.

--=20

Gr=FCsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization=20