From: Jeremy Fitzhardinge <jeremy@goop.org>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>,
"Xin, Xiaohui" <xiaohui.xin@intel.com>,
"Li, Xin" <xin.li@intel.com>,
"Nakajima, Jun" <jun.nakajima@intel.com>,
Nick Piggin <npiggin@suse.de>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Xen-devel <xen-devel@lists.xensource.com>
Subject: Re: Performance overhead of paravirt_ops on native identified
Date: Thu, 14 May 2009 10:36:11 -0700 [thread overview]
Message-ID: <4A0C568B.7070907@goop.org> (raw)
In-Reply-To: <4A0B6F9C.4060405@zytor.com>
H. Peter Anvin wrote:
> The other obvious option, it would seem to me, would be to eliminate the
> *inner* call/return pair, i.e. merging the _spin_lock setup code in with
> the internals of each available implementation (in the case above,
> __ticket_spin_lock). This is effectively what happens on native. The
> one problem with that is that every callsite now becomes a patching target.
>
Yes, that's an option. It has the downside of requiring changes to the
common spinlock code in kernel/spinlock.c and linux/spinlock_api*.h.
The amount of duplicated code is potentially quite large, but there
aren't that many spinlock implementations.
Also, there's not much point in using pv spinlocks when all the
instrumentation is on. Lock contention metering, for example, never
does a proper lock operation, but does a spin with repeated trylocks; we
can't optimise that, so there's no point in trying.
So maybe if we can fast-path the fast-path to pv spinlocks, the problem
is more tractable...
> That brings me to a somewhat half-arsed thought I have been walking
> around with for a while.
>
> Consider a paravirt -- or for that matter any other call which is
> runtime-static; this isn't just limited to paravirt -- function which
> looks to the C compiler just like any other external function -- no
> indirection. We can point it by default to a function which is really
> just an indirect jump to the appropriate handler, that handles the
> prepatching case. However, a linktime pass over vmlinux.o can find all
> the points where this function is called, and turn it into a list of
> patch sites(*). The advantages are:
>
> 1. [minor] no additional nop padding due to indirect function calls.
> 2. [major] no need for a ton of wrapper macros manifest in the code.
>
> paravirt_ops that turn into pure inline code in the native case is
> obviously another ball of wax entirely; there inline assembly wrappers
> are simply unavoidable.
>
We did consider something like this at the outset. As I remember, there
were a few concerns:
* There was no relocation data available in the kernel. I played
around with ways to make it work, but they ended up being fairly
complex and brittle, with a tendency (of course) to trigger
binutils bugs. Maybe that has changed.
* We didn't really want to implement two separate mechanisms for the
same thing. Given that we wanted to inline things like
cli/sti/pushf/popf, we needed to have something capable of full
patching. Having a separate mechanisms for patching calls is
harder to justify. Now that pvops is well settled, perhaps it
makes sense to consider adding another more general patching
mechanism to avoid the indirect calls (a dynamic linker, essentially).
I won't make any great claims about the beauty of the PV_CALL* gunk, but
at the very least it is contained within paravirt.h.
> (*) if patching code on SMP was cheaper, we could actually do this
> lazily, and wouldn't have to store a list of patch sites. I don't feel
> brave enough to go down that route.
>
The problem that the tracepoints people were trying to solve was harder,
where they wanted to replace an arbitrary set of instructions with some
other arbitrary instructions (or a call) - that would need some kind SMP
synchronization, both for general sanity and to keep the Intel rules happy.
In theory relinking a call should just be a single word write into the
instruction, but I don't know if that gets into undefined territory or
not. On older P4 systems it would end up blowing away the trace cache
on all cpus when you write to code like that, so you'd want to be sure
that your references are getting resolved fairly quickly. But its hard
to see how patching the offset in a call instruction would end up
calling something other than the old or new function.
J
WARNING: multiple messages have this Message-ID (diff)
From: Jeremy Fitzhardinge <jeremy@goop.org>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nick Piggin <npiggin@suse.de>,
"Xin, Xiaohui" <xiaohui.xin@intel.com>,
Xen-devel <xen-devel@lists.xensource.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
"Li, Xin" <xin.li@intel.com>,
"Nakajima, Jun" <jun.nakajima@intel.com>,
Ingo Molnar <mingo@elte.hu>
Subject: Re: Performance overhead of paravirt_ops on native identified
Date: Thu, 14 May 2009 10:36:11 -0700 [thread overview]
Message-ID: <4A0C568B.7070907@goop.org> (raw)
In-Reply-To: <4A0B6F9C.4060405@zytor.com>
H. Peter Anvin wrote:
> The other obvious option, it would seem to me, would be to eliminate the
> *inner* call/return pair, i.e. merging the _spin_lock setup code in with
> the internals of each available implementation (in the case above,
> __ticket_spin_lock). This is effectively what happens on native. The
> one problem with that is that every callsite now becomes a patching target.
>
Yes, that's an option. It has the downside of requiring changes to the
common spinlock code in kernel/spinlock.c and linux/spinlock_api*.h.
The amount of duplicated code is potentially quite large, but there
aren't that many spinlock implementations.
Also, there's not much point in using pv spinlocks when all the
instrumentation is on. Lock contention metering, for example, never
does a proper lock operation, but does a spin with repeated trylocks; we
can't optimise that, so there's no point in trying.
So maybe if we can fast-path the fast-path to pv spinlocks, the problem
is more tractable...
> That brings me to a somewhat half-arsed thought I have been walking
> around with for a while.
>
> Consider a paravirt -- or for that matter any other call which is
> runtime-static; this isn't just limited to paravirt -- function which
> looks to the C compiler just like any other external function -- no
> indirection. We can point it by default to a function which is really
> just an indirect jump to the appropriate handler, that handles the
> prepatching case. However, a linktime pass over vmlinux.o can find all
> the points where this function is called, and turn it into a list of
> patch sites(*). The advantages are:
>
> 1. [minor] no additional nop padding due to indirect function calls.
> 2. [major] no need for a ton of wrapper macros manifest in the code.
>
> paravirt_ops that turn into pure inline code in the native case is
> obviously another ball of wax entirely; there inline assembly wrappers
> are simply unavoidable.
>
We did consider something like this at the outset. As I remember, there
were a few concerns:
* There was no relocation data available in the kernel. I played
around with ways to make it work, but they ended up being fairly
complex and brittle, with a tendency (of course) to trigger
binutils bugs. Maybe that has changed.
* We didn't really want to implement two separate mechanisms for the
same thing. Given that we wanted to inline things like
cli/sti/pushf/popf, we needed to have something capable of full
patching. Having a separate mechanisms for patching calls is
harder to justify. Now that pvops is well settled, perhaps it
makes sense to consider adding another more general patching
mechanism to avoid the indirect calls (a dynamic linker, essentially).
I won't make any great claims about the beauty of the PV_CALL* gunk, but
at the very least it is contained within paravirt.h.
> (*) if patching code on SMP was cheaper, we could actually do this
> lazily, and wouldn't have to store a list of patch sites. I don't feel
> brave enough to go down that route.
>
The problem that the tracepoints people were trying to solve was harder,
where they wanted to replace an arbitrary set of instructions with some
other arbitrary instructions (or a call) - that would need some kind SMP
synchronization, both for general sanity and to keep the Intel rules happy.
In theory relinking a call should just be a single word write into the
instruction, but I don't know if that gets into undefined territory or
not. On older P4 systems it would end up blowing away the trace cache
on all cpus when you write to code like that, so you'd want to be sure
that your references are getting resolved fairly quickly. But its hard
to see how patching the offset in a call instruction would end up
calling something other than the old or new function.
J
next prev parent reply other threads:[~2009-05-14 17:36 UTC|newest]
Thread overview: 99+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-14 0:16 Performance overhead of paravirt_ops on native identified Jeremy Fitzhardinge
2009-05-14 0:16 ` Jeremy Fitzhardinge
2009-05-14 1:10 ` H. Peter Anvin
2009-05-14 1:10 ` H. Peter Anvin
2009-05-14 8:25 ` Peter Zijlstra
2009-05-14 8:25 ` Peter Zijlstra
2009-05-14 14:05 ` H. Peter Anvin
2009-05-14 14:05 ` H. Peter Anvin
2009-05-14 17:36 ` Jeremy Fitzhardinge [this message]
2009-05-14 17:36 ` Jeremy Fitzhardinge
2009-05-14 17:50 ` H. Peter Anvin
2009-05-14 17:50 ` H. Peter Anvin
2009-05-14 8:05 ` [Xen-devel] Performance overhead of paravirt_ops on nativeidentified Jan Beulich
2009-05-14 8:05 ` Jan Beulich
2009-05-14 8:33 ` [Xen-devel] " Peter Zijlstra
2009-05-14 17:45 ` Jeremy Fitzhardinge
2009-05-14 17:45 ` Jeremy Fitzhardinge
2009-05-15 8:10 ` [Xen-devel] " Jan Beulich
2009-05-15 18:50 ` Jeremy Fitzhardinge
2009-05-18 7:19 ` Jan Beulich
2009-05-18 7:19 ` Jan Beulich
2009-05-20 22:42 ` [Xen-devel] " Jeremy Fitzhardinge
2009-05-20 22:42 ` Jeremy Fitzhardinge
2009-05-15 18:18 ` [tip:x86/urgent] x86: Fix performance regression caused by paravirt_ops on native kernels tip-bot for Jeremy Fitzhardinge
2009-05-15 18:18 ` tip-bot for Jeremy Fitzhardinge
2009-05-21 22:42 ` Performance overhead of paravirt_ops on native identified Chuck Ebbert
2009-05-21 22:48 ` Jeremy Fitzhardinge
2009-05-21 22:48 ` Jeremy Fitzhardinge
2009-05-21 23:10 ` H. Peter Anvin
2009-05-21 23:10 ` H. Peter Anvin
2009-05-22 1:26 ` Xin, Xiaohui
2009-05-22 1:26 ` Xin, Xiaohui
2009-05-22 3:39 ` H. Peter Anvin
2009-05-22 3:39 ` H. Peter Anvin
2009-05-22 4:27 ` Jeremy Fitzhardinge
2009-05-22 4:27 ` Jeremy Fitzhardinge
2009-05-22 5:59 ` Xin, Xiaohui
2009-05-22 5:59 ` Xin, Xiaohui
2009-05-22 16:33 ` H. Peter Anvin
2009-05-22 16:33 ` H. Peter Anvin
2009-05-22 22:44 ` Jeremy Fitzhardinge
2009-05-22 22:44 ` Jeremy Fitzhardinge
2009-05-22 22:47 ` H. Peter Anvin
2009-05-22 22:47 ` H. Peter Anvin
2009-05-25 9:15 ` [benchmark] 1% performance overhead of paravirt_ops on native kernels Ingo Molnar
2009-05-26 18:42 ` Jeremy Fitzhardinge
2009-05-28 6:17 ` Nick Piggin
2009-05-28 20:57 ` Jeremy Fitzhardinge
2009-05-30 10:23 ` Ingo Molnar
2009-06-02 14:18 ` Chris Mason
2009-06-02 14:49 ` Ulrich Drepper
2009-06-02 15:03 ` Chris Mason
2009-06-02 15:22 ` Ulrich Drepper
2009-06-02 16:20 ` Chris Mason
2009-06-02 18:13 ` Pekka Enberg
2009-06-02 18:06 ` Pekka Enberg
2009-06-02 18:27 ` Chris Mason
2009-06-03 6:33 ` Jeremy Fitzhardinge
2009-06-02 19:14 ` Thomas Gleixner
2009-06-02 19:51 ` Chris Mason
2009-06-03 12:38 ` Rusty Russell
2009-06-03 16:09 ` Linus Torvalds
[not found] ` <200906041554.37102.rusty@rustcorp.com.au>
2009-06-04 15:02 ` Linus Torvalds
2009-06-04 21:52 ` Dave McCracken
2009-06-05 7:31 ` Gerd Hoffmann
2009-06-05 14:31 ` Rusty Russell
2009-06-06 18:54 ` Anders K. Pedersen
2009-06-05 4:46 ` Rusty Russell
2009-06-05 14:54 ` Linus Torvalds
2009-06-07 0:53 ` Rusty Russell
2009-06-08 14:53 ` Linus Torvalds
2009-06-09 9:39 ` Nick Piggin
2009-06-09 11:17 ` Ingo Molnar
2009-06-09 12:10 ` Nick Piggin
2009-06-09 12:25 ` Ingo Molnar
2009-06-09 12:42 ` Nick Piggin
2009-06-09 12:56 ` Avi Kivity
2009-06-09 15:18 ` Linus Torvalds
2009-06-09 23:33 ` Paul Mackerras
2009-06-10 1:26 ` Ingo Molnar
2009-06-09 15:07 ` Linus Torvalds
2009-06-09 15:09 ` H. Peter Anvin
2009-06-09 18:06 ` Linus Torvalds
2009-06-09 18:07 ` Linus Torvalds
2009-06-09 22:48 ` Matthew Garrett
2009-06-09 22:54 ` H. Peter Anvin
2009-06-09 14:54 ` Linus Torvalds
2009-06-09 14:57 ` Ingo Molnar
2009-06-09 15:55 ` Avi Kivity
2009-06-09 15:38 ` Nick Piggin
2009-06-09 16:00 ` Linus Torvalds
2009-06-09 16:21 ` Nick Piggin
2009-06-09 16:26 ` Linus Torvalds
2009-06-09 16:45 ` Nick Piggin
2009-06-09 17:08 ` Linus Torvalds
2009-06-10 5:53 ` Nick Piggin
2009-06-17 9:40 ` Pavel Machek
2009-06-17 9:56 ` Nick Piggin
2009-06-10 6:29 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A0C568B.7070907@goop.org \
--to=jeremy@goop.org \
--cc=hpa@zytor.com \
--cc=jun.nakajima@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=npiggin@suse.de \
--cc=xen-devel@lists.xensource.com \
--cc=xiaohui.xin@intel.com \
--cc=xin.li@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.