From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: RFC: cache_regs in kvm_emulate_pio
Date: Fri, 20 Jun 2008 18:24:12 -0300
Message-ID: <20080620212412.GA16688@dmt.cnet>
References: <20080619233123.GA24183@dmt.cnet> <485C134D.2070905@qumranet.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: kvm-devel <kvm@vger.kernel.org>
To: Avi Kivity <avi@qumranet.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([66.187.233.31]:52477 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753229AbYFTVY2 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 20 Jun 2008 17:24:28 -0400
Content-Disposition: inline
In-Reply-To: <485C134D.2070905@qumranet.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Fri, Jun 20, 2008 at 11:30:05PM +0300, Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> Hi,
>>
>> From my understanding the ->cache_regs call on kvm_emulate_pio() is
>> necessary only on AMD, where vcpu->arch.regs[RAX] is not copied during
>> exit in svm_vcpu_load().
>>
>> On both architectures, the remaining general purpose registers are saved
>> on exit.
>>
>> The following patch saves 100 cycles out of both light and heavy exits
>> on Intel (if correct, kvm_emulate_hypercall and complete_pio could also
>> benefit, thus saving 200 cycles for in-kernel devices).
>>   
>
> ISTR vmwrite as 50 cycles and vmread as much lower.

On my 1.60GHz textbox ->cache_regs takes 114 cycles, measured with
rdtscll() before and after (rdtscll() takes 90 cycles by itself, due to
the barriers I guess, so the exact number was 204 cycles). Calling the
empty ->cache_rax takes 6 cycles.

>> BTW, the decache_regs(vcpu) call at the end of complete_pio() could also
>> be a noop on Intel from what I can tell ?
>>
>>   
>
> I think so.  decache_regs() is actually more important.
>
>>  		       int *exception);
>> +	void (*cache_rax)(struct kvm_vcpu *vcpu);
>>  	void (*cache_regs)(struct kvm_vcpu *vcpu);
>
> ugh, another callback.  how about instead
>
> /* in vcpu structure */
> u16 regs_available;
> u16 regs_dirty;
>
> /* read from cache if possible */
> if (!test_bit(VCPU_REG_RAX, &regs_available))
>       ->cache_regs();
> printk("%d\n", regs[VCPU_REGS_RAX]);
>
> /* write to cache, ->vcpu_run() will flush */
> regs[VCPU_REGS_RAX] = 17;
> __set_bit(VCPU_REGS_RAX, &regs_dirty);

I think that hiding whether registers are cached or not behing wrappers
makes a lot of sense, but having the ->cache_regs interface split can
also result in gains. An index argument to ->cache_regs() would do the
trick.

For example, there's no need to read GUEST_RSP for
skip_emulated_instruction, thats another 50+ cycles.

Unless there's something obscure that means you need to read RSP/RIP
before accessing the now in-memory guest registers saved with "mov"
in vmx_vcpu_run(). The comment on vcpu_load_rsp_rip seems a little
ambiguous to me:

/*
 * Sync the rsp and rip registers into the vcpu structure.  This allows
 * registers to be accessed by indexing vcpu->arch.regs.
 */

But I think it just refers to the interface in general, so that nobody
would try to access RSP or RIP (and RAX in AMD's case) before calling
->cache_regs().