Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jeremy Fitzhardinge <jeremy@goop.org>
To: Jan Beulich <JBeulich@novell.com>
Cc: Ingo Molnar <mingo@elte.hu>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Xiaohui Xin <xiaohui.xin@intel.com>, Xin Li <xin.li@intel.com>,
	Xen-devel <xen-devel@lists.xensource.com>,
	Nick Piggin <npiggin@suse.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [Xen-devel] Performance overhead of paravirt_ops on			 nativeidentified
Date: Wed, 20 May 2009 15:42:08 -0700	[thread overview]
Message-ID: <4A148740.1070405@goop.org> (raw)
In-Reply-To: <4A11280402000078000014E2@vpn.id2.novell.com>

Jan Beulich wrote:
>>>> Jeremy Fitzhardinge <jeremy@goop.org> 15.05.09 20:50 >>>
>>>>         
>> Jan Beulich wrote:
>>     
>>> A patch for the pv-ops kernel would require some time. What I can give you
>>> right away - just for reference - are the sources we currently use in our kernel:
>>> attached.
>>>       
>> Hm, I see.  Putting a call out to a pv-ops function in the ticket lock 
>> slow path looks pretty straightforward.  The need for an extra lock on 
>> the contended unlock side is a bit unfortunate; have you measured to see 
>> what hit that has?  Seems to me like you could avoid the problem by 
>> using per-cpu storage rather than stack storage (though you'd need to 
>> copy the per-cpu data to stack when handling a nested spinlock).
>>     
>
> Not sure how you'd imagine this to work: The unlock code has to look at all
> cpus' data in either case, so an inner lock would still be required imo.
>   

Well, rather than an explicit lock I was thinking of an optimistic 
scheme like the pv clock update algorithm.

"struct spinning" would have a version counter it could update using the 
even=stable/odd=unstable algorithm.  The lock side would simply save the 
current per-cpu "struct spinning" state onto its own stack (assuming you 
can even have nested spinning), and then update the per-cpu copy with 
the current lock.  The kick side can check the version counter to make 
sure it gets a consistent snapshot of the target cpu's current lock state.

I think that only the locking side requires locked instructions, and the 
kick side can be all unlocked.

>> What's the thinking behind the xen_spin_adjust() stuff?
>>     
>
> That's the placeholder for implementing interrupt re-enabling in the irq-save
> lock path. The requirement is that if a nested lock attempt hits the same
> lock on the same cpu that it failed to get acquired on earlier (but got a ticket
> already), tickets for the given (lock, cpu) pair need to be circularly shifted
> around so that the innermost requestor gets the earliest ticket. This is what
> that function's job will become if I ever get to implement this.
>   

Sounds fiddly.

>>> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) { 
>>> unsigned int token, count; bool free; __ticket_spin_lock_preamble; if 
>>> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1 
>>> << 10; __ticket_spin_lock_body; } while (unlikely(!count) && 
>>> !xen_spin_wait(lock, token)); } 
>>>       
>> How does this work?  Doesn't it always go into the slowpath loop even if 
>> the preamble got the lock with no contention?
>>     
>
> It indeed always enters the slowpath loop, but only for a single pass through
> part of its body (the first compare in the body macro will make it exit the loop
> right away: 'token' is not only the ticket here, but the full lock->slock
> contents). But yes, I think you're right, one could avoid entering the body
> altogether by moving the containing loop into the if(!free) body. The logic
> went through a number of re-writes, so I must have overlooked that
> opportunity on the last round of adjustments.
>   

I was thinking of something like this: (completely untested)

void __ticket_spin_lock(raw_spinlock_t *lock)
{
	unsigned short inc = 0x100;
	unsigned short token;

	do {
		unsigned count = 1 << 10;
		asm volatile (
			"	lock xaddw %w0, %1\n"
			"1:	cmpb %h0, %b0\n"
			"	je 2f\n"
			"	rep ; nop\n"
			"	movb %1, %b0\n"
			/* don't need lfence here, because loads are in-order */
			"	sub $1,%2\n"
			"	jnz 1b\n"
			"2:"
			: "+Q" (inc), "+m" (lock->slock), "+r" (count)
			:
			: "memory", "cc");

		if (likely(count != 0))
			break;

		token = inc;
		inc = 0;
	} while (unlikely(!xen_spin_wait(lock, token)));
}


where xen_spin_wait() would actually be a pvops call, and perhaps the 
asm could be pulled out into an inline to deal with the 8/16 bit ticket 
difference.

    J

WARNING: multiple messages have this Message-ID (diff)

From: Jeremy Fitzhardinge <jeremy@goop.org>
To: Jan Beulich <JBeulich@novell.com>
Cc: Nick Piggin <npiggin@suse.de>,
	Xiaohui Xin <xiaohui.xin@intel.com>,
	Xen-devel <xen-devel@lists.xensource.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Xin Li <xin.li@intel.com>, Jun Nakajima <jun.nakajima@intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@elte.hu>
Subject: Re: Performance overhead of paravirt_ops on			 nativeidentified
Date: Wed, 20 May 2009 15:42:08 -0700	[thread overview]
Message-ID: <4A148740.1070405@goop.org> (raw)
In-Reply-To: <4A11280402000078000014E2@vpn.id2.novell.com>

Jan Beulich wrote:
>>>> Jeremy Fitzhardinge <jeremy@goop.org> 15.05.09 20:50 >>>
>>>>         
>> Jan Beulich wrote:
>>     
>>> A patch for the pv-ops kernel would require some time. What I can give you
>>> right away - just for reference - are the sources we currently use in our kernel:
>>> attached.
>>>       
>> Hm, I see.  Putting a call out to a pv-ops function in the ticket lock 
>> slow path looks pretty straightforward.  The need for an extra lock on 
>> the contended unlock side is a bit unfortunate; have you measured to see 
>> what hit that has?  Seems to me like you could avoid the problem by 
>> using per-cpu storage rather than stack storage (though you'd need to 
>> copy the per-cpu data to stack when handling a nested spinlock).
>>     
>
> Not sure how you'd imagine this to work: The unlock code has to look at all
> cpus' data in either case, so an inner lock would still be required imo.
>   

Well, rather than an explicit lock I was thinking of an optimistic 
scheme like the pv clock update algorithm.

"struct spinning" would have a version counter it could update using the 
even=stable/odd=unstable algorithm.  The lock side would simply save the 
current per-cpu "struct spinning" state onto its own stack (assuming you 
can even have nested spinning), and then update the per-cpu copy with 
the current lock.  The kick side can check the version counter to make 
sure it gets a consistent snapshot of the target cpu's current lock state.

I think that only the locking side requires locked instructions, and the 
kick side can be all unlocked.

>> What's the thinking behind the xen_spin_adjust() stuff?
>>     
>
> That's the placeholder for implementing interrupt re-enabling in the irq-save
> lock path. The requirement is that if a nested lock attempt hits the same
> lock on the same cpu that it failed to get acquired on earlier (but got a ticket
> already), tickets for the given (lock, cpu) pair need to be circularly shifted
> around so that the innermost requestor gets the earliest ticket. This is what
> that function's job will become if I ever get to implement this.
>   

Sounds fiddly.

>>> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) { 
>>> unsigned int token, count; bool free; __ticket_spin_lock_preamble; if 
>>> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1 
>>> << 10; __ticket_spin_lock_body; } while (unlikely(!count) && 
>>> !xen_spin_wait(lock, token)); } 
>>>       
>> How does this work?  Doesn't it always go into the slowpath loop even if 
>> the preamble got the lock with no contention?
>>     
>
> It indeed always enters the slowpath loop, but only for a single pass through
> part of its body (the first compare in the body macro will make it exit the loop
> right away: 'token' is not only the ticket here, but the full lock->slock
> contents). But yes, I think you're right, one could avoid entering the body
> altogether by moving the containing loop into the if(!free) body. The logic
> went through a number of re-writes, so I must have overlooked that
> opportunity on the last round of adjustments.
>   

I was thinking of something like this: (completely untested)

void __ticket_spin_lock(raw_spinlock_t *lock)
{
	unsigned short inc = 0x100;
	unsigned short token;

	do {
		unsigned count = 1 << 10;
		asm volatile (
			"	lock xaddw %w0, %1\n"
			"1:	cmpb %h0, %b0\n"
			"	je 2f\n"
			"	rep ; nop\n"
			"	movb %1, %b0\n"
			/* don't need lfence here, because loads are in-order */
			"	sub $1,%2\n"
			"	jnz 1b\n"
			"2:"
			: "+Q" (inc), "+m" (lock->slock), "+r" (count)
			:
			: "memory", "cc");

		if (likely(count != 0))
			break;

		token = inc;
		inc = 0;
	} while (unlikely(!xen_spin_wait(lock, token)));
}


where xen_spin_wait() would actually be a pvops call, and perhaps the 
asm could be pulled out into an inline to deal with the 8/16 bit ticket 
difference.

    J

next prev parent reply	other threads:[~2009-05-20 22:42 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-14  0:16 Performance overhead of paravirt_ops on native identified Jeremy Fitzhardinge
2009-05-14  0:16 ` Jeremy Fitzhardinge
2009-05-14  1:10 ` H. Peter Anvin
2009-05-14  1:10   ` H. Peter Anvin
2009-05-14  8:25   ` Peter Zijlstra
2009-05-14  8:25     ` Peter Zijlstra
2009-05-14 14:05     ` H. Peter Anvin
2009-05-14 14:05       ` H. Peter Anvin
2009-05-14 17:36   ` Jeremy Fitzhardinge
2009-05-14 17:36     ` Jeremy Fitzhardinge
2009-05-14 17:50     ` H. Peter Anvin
2009-05-14 17:50       ` H. Peter Anvin
2009-05-14  8:05 ` [Xen-devel] Performance overhead of paravirt_ops on nativeidentified Jan Beulich
2009-05-14  8:05   ` Jan Beulich
2009-05-14  8:33   ` [Xen-devel] " Peter Zijlstra
2009-05-14 17:45   ` Jeremy Fitzhardinge
2009-05-14 17:45     ` Jeremy Fitzhardinge
2009-05-15  8:10     ` [Xen-devel] " Jan Beulich
2009-05-15 18:50       ` Jeremy Fitzhardinge
2009-05-18  7:19         ` Jan Beulich
2009-05-18  7:19           ` Jan Beulich
2009-05-20 22:42           ` Jeremy Fitzhardinge [this message]
2009-05-20 22:42             ` Jeremy Fitzhardinge
2009-05-15 18:18 ` [tip:x86/urgent] x86: Fix performance regression caused by paravirt_ops on native kernels tip-bot for Jeremy Fitzhardinge
2009-05-15 18:18   ` tip-bot for Jeremy Fitzhardinge
2009-05-21 22:42 ` Performance overhead of paravirt_ops on native identified Chuck Ebbert
2009-05-21 22:48   ` Jeremy Fitzhardinge
2009-05-21 22:48     ` Jeremy Fitzhardinge
2009-05-21 23:10     ` H. Peter Anvin
2009-05-21 23:10       ` H. Peter Anvin
2009-05-22  1:26     ` Xin, Xiaohui
2009-05-22  1:26       ` Xin, Xiaohui
2009-05-22  3:39       ` H. Peter Anvin
2009-05-22  3:39         ` H. Peter Anvin
2009-05-22  4:27       ` Jeremy Fitzhardinge
2009-05-22  4:27         ` Jeremy Fitzhardinge
2009-05-22  5:59         ` Xin, Xiaohui
2009-05-22  5:59           ` Xin, Xiaohui
2009-05-22 16:33           ` H. Peter Anvin
2009-05-22 16:33             ` H. Peter Anvin
2009-05-22 22:44             ` Jeremy Fitzhardinge
2009-05-22 22:44               ` Jeremy Fitzhardinge
2009-05-22 22:47               ` H. Peter Anvin
2009-05-22 22:47                 ` H. Peter Anvin
2009-05-25  9:15 ` [benchmark] 1% performance overhead of paravirt_ops on native kernels Ingo Molnar
2009-05-26 18:42   ` Jeremy Fitzhardinge
2009-05-28  6:17     ` Nick Piggin
2009-05-28 20:57       ` Jeremy Fitzhardinge
2009-05-30 10:23       ` Ingo Molnar
2009-06-02 14:18         ` Chris Mason
2009-06-02 14:49           ` Ulrich Drepper
2009-06-02 15:03             ` Chris Mason
2009-06-02 15:22               ` Ulrich Drepper
2009-06-02 16:20                 ` Chris Mason
2009-06-02 18:13                   ` Pekka Enberg
2009-06-02 18:06               ` Pekka Enberg
2009-06-02 18:27                 ` Chris Mason
2009-06-03  6:33             ` Jeremy Fitzhardinge
2009-06-02 19:14           ` Thomas Gleixner
2009-06-02 19:51             ` Chris Mason
2009-06-03 12:38         ` Rusty Russell
2009-06-03 16:09           ` Linus Torvalds
     [not found]             ` <200906041554.37102.rusty@rustcorp.com.au>
2009-06-04 15:02               ` Linus Torvalds
2009-06-04 21:52                 ` Dave McCracken
2009-06-05  7:31                   ` Gerd Hoffmann
2009-06-05 14:31                     ` Rusty Russell
2009-06-06 18:54                   ` Anders K. Pedersen
2009-06-05  4:46                 ` Rusty Russell
2009-06-05 14:54                   ` Linus Torvalds
2009-06-07  0:53                     ` Rusty Russell
2009-06-08 14:53                       ` Linus Torvalds
2009-06-09  9:39                 ` Nick Piggin
2009-06-09 11:17                   ` Ingo Molnar
2009-06-09 12:10                     ` Nick Piggin
2009-06-09 12:25                       ` Ingo Molnar
2009-06-09 12:42                         ` Nick Piggin
2009-06-09 12:56                         ` Avi Kivity
2009-06-09 15:18                         ` Linus Torvalds
2009-06-09 23:33                         ` Paul Mackerras
2009-06-10  1:26                           ` Ingo Molnar
2009-06-09 15:07                       ` Linus Torvalds
2009-06-09 15:09                     ` H. Peter Anvin
2009-06-09 18:06                       ` Linus Torvalds
2009-06-09 18:07                         ` Linus Torvalds
2009-06-09 22:48                           ` Matthew Garrett
2009-06-09 22:54                             ` H. Peter Anvin
2009-06-09 14:54                   ` Linus Torvalds
2009-06-09 14:57                     ` Ingo Molnar
2009-06-09 15:55                       ` Avi Kivity
2009-06-09 15:38                     ` Nick Piggin
2009-06-09 16:00                       ` Linus Torvalds
2009-06-09 16:21                         ` Nick Piggin
2009-06-09 16:26                           ` Linus Torvalds
2009-06-09 16:45                             ` Nick Piggin
2009-06-09 17:08                               ` Linus Torvalds
2009-06-10  5:53                                 ` Nick Piggin
2009-06-17  9:40                                   ` Pavel Machek
2009-06-17  9:56                                     ` Nick Piggin
2009-06-10  6:29                             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A148740.1070405@goop.org \
    --to=jeremy@goop.org \
    --cc=JBeulich@novell.com \
    --cc=hpa@zytor.com \
    --cc=jun.nakajima@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=npiggin@suse.de \
    --cc=xen-devel@lists.xensource.com \
    --cc=xiaohui.xin@intel.com \
    --cc=xin.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.