From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: Re: RFC: cache_regs in kvm_emulate_pio Date: Fri, 20 Jun 2008 18:24:12 -0300 Message-ID: <20080620212412.GA16688@dmt.cnet> References: <20080619233123.GA24183@dmt.cnet> <485C134D.2070905@qumranet.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: kvm-devel To: Avi Kivity Return-path: Received: from mx1.redhat.com ([66.187.233.31]:52477 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753229AbYFTVY2 (ORCPT ); Fri, 20 Jun 2008 17:24:28 -0400 Content-Disposition: inline In-Reply-To: <485C134D.2070905@qumranet.com> Sender: kvm-owner@vger.kernel.org List-ID: On Fri, Jun 20, 2008 at 11:30:05PM +0300, Avi Kivity wrote: > Marcelo Tosatti wrote: >> Hi, >> >> From my understanding the ->cache_regs call on kvm_emulate_pio() is >> necessary only on AMD, where vcpu->arch.regs[RAX] is not copied during >> exit in svm_vcpu_load(). >> >> On both architectures, the remaining general purpose registers are saved >> on exit. >> >> The following patch saves 100 cycles out of both light and heavy exits >> on Intel (if correct, kvm_emulate_hypercall and complete_pio could also >> benefit, thus saving 200 cycles for in-kernel devices). >> > > ISTR vmwrite as 50 cycles and vmread as much lower. On my 1.60GHz textbox ->cache_regs takes 114 cycles, measured with rdtscll() before and after (rdtscll() takes 90 cycles by itself, due to the barriers I guess, so the exact number was 204 cycles). Calling the empty ->cache_rax takes 6 cycles. >> BTW, the decache_regs(vcpu) call at the end of complete_pio() could also >> be a noop on Intel from what I can tell ? >> >> > > I think so. decache_regs() is actually more important. > >> int *exception); >> + void (*cache_rax)(struct kvm_vcpu *vcpu); >> void (*cache_regs)(struct kvm_vcpu *vcpu); > > ugh, another callback. how about instead > > /* in vcpu structure */ > u16 regs_available; > u16 regs_dirty; > > /* read from cache if possible */ > if (!test_bit(VCPU_REG_RAX, ®s_available)) > ->cache_regs(); > printk("%d\n", regs[VCPU_REGS_RAX]); > > /* write to cache, ->vcpu_run() will flush */ > regs[VCPU_REGS_RAX] = 17; > __set_bit(VCPU_REGS_RAX, ®s_dirty); I think that hiding whether registers are cached or not behing wrappers makes a lot of sense, but having the ->cache_regs interface split can also result in gains. An index argument to ->cache_regs() would do the trick. For example, there's no need to read GUEST_RSP for skip_emulated_instruction, thats another 50+ cycles. Unless there's something obscure that means you need to read RSP/RIP before accessing the now in-memory guest registers saved with "mov" in vmx_vcpu_run(). The comment on vcpu_load_rsp_rip seems a little ambiguous to me: /* * Sync the rsp and rip registers into the vcpu structure. This allows * registers to be accessed by indexing vcpu->arch.regs. */ But I think it just refers to the interface in general, so that nobody would try to access RSP or RIP (and RAX in AMD's case) before calling ->cache_regs().