From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760686AbZENIZj (ORCPT ); Thu, 14 May 2009 04:25:39 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755077AbZENIZU (ORCPT ); Thu, 14 May 2009 04:25:20 -0400 Received: from casper.infradead.org ([85.118.1.10]:47443 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751734AbZENIZS (ORCPT ); Thu, 14 May 2009 04:25:18 -0400 Subject: Re: Performance overhead of paravirt_ops on native identified From: Peter Zijlstra To: "H. Peter Anvin" Cc: Jeremy Fitzhardinge , Ingo Molnar , "Xin, Xiaohui" , "Li, Xin" , "Nakajima, Jun" , Nick Piggin , Linux Kernel Mailing List , Xen-devel , rostedt In-Reply-To: <4A0B6F9C.4060405@zytor.com> References: <4A0B62F7.5030802@goop.org> <4A0B6F9C.4060405@zytor.com> Content-Type: text/plain Date: Thu, 14 May 2009 10:25:06 +0200 Message-Id: <1242289506.6642.901.camel@laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.26.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2009-05-13 at 18:10 -0700, H. Peter Anvin wrote: > Jeremy Fitzhardinge wrote: > > > > So, what's the fix? > > > > Paravirt patching turns all the pvops calls into direct calls, so > > _spin_lock etc do end up having direct calls. For example, the compiler > > generated code for paravirtualized _spin_lock is: > > > > <_spin_lock+0>: mov %gs:0xb4c8,%rax > > <_spin_lock+9>: incl 0xffffffffffffe044(%rax) > > <_spin_lock+15>: callq *0xffffffff805a5b30 > > <_spin_lock+22>: retq > > > > The indirect call will get patched to: > > <_spin_lock+0>: mov %gs:0xb4c8,%rax > > <_spin_lock+9>: incl 0xffffffffffffe044(%rax) > > <_spin_lock+15>: callq <__ticket_spin_lock> > > <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ > > <_spin_lock+22>: retq > > > > One possibility is to inline _spin_lock, etc, when building an > > optimised kernel (ie, when there's no spinlock/preempt > > instrumentation/debugging enabled). That will remove the outer > > call/return pair, returning the instruction stream to a single > > call/return, which will presumably execute the same as the non-pvops > > case. The downsides arel 1) it will replicate the > > preempt_disable/enable code at eack lock/unlock callsite; this code is > > fairly small, but not nothing; and 2) the spinlock definitions are > > already a very heavily tangled mass of #ifdefs and other preprocessor > > magic, and making any changes will be non-trivial. > > > > The other obvious option, it would seem to me, would be to eliminate the > *inner* call/return pair, i.e. merging the _spin_lock setup code in with > the internals of each available implementation (in the case above, > __ticket_spin_lock). This is effectively what happens on native. The > one problem with that is that every callsite now becomes a patching target. > > That brings me to a somewhat half-arsed thought I have been walking > around with for a while. > > Consider a paravirt -- or for that matter any other call which is > runtime-static; this isn't just limited to paravirt -- function which > looks to the C compiler just like any other external function -- no > indirection. We can point it by default to a function which is really > just an indirect jump to the appropriate handler, that handles the > prepatching case. However, a linktime pass over vmlinux.o can find all > the points where this function is called, and turn it into a list of > patch sites(*). The advantages are: > > 1. [minor] no additional nop padding due to indirect function calls. > 2. [major] no need for a ton of wrapper macros manifest in the code. > > paravirt_ops that turn into pure inline code in the native case is > obviously another ball of wax entirely; there inline assembly wrappers > are simply unavoidable. > > -hpa > > (*) if patching code on SMP was cheaper, we could actually do this > lazily, and wouldn't have to store a list of patch sites. I don't feel > brave enough to go down that route. This sounds remarkably like what the dynamic function call tracer does.