public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <waiman.long@hp.com>
To: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Richard Weinberger <richard@nod.at>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Matt Fleming <matt.fleming@intel.com>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Akinobu Mita <akinobu.mita@gmail.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Michel Lespinasse <walken@google.com>,
	Andi Kleen <andi@firstfloor.org>, Rik van Riel <riel@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	George Spelvin <linux@horizon.com>Harvey Harrison <har>
Subject: Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation
Date: Thu, 01 Aug 2013 17:09:12 -0400	[thread overview]
Message-ID: <51FACE78.9070901@hp.com> (raw)
In-Reply-To: <51FAC3BA.9050705@linux.vnet.ibm.com>

On 08/01/2013 04:23 PM, Raghavendra K T wrote:
> On 08/01/2013 08:07 AM, Waiman Long wrote:
>>
>> +}
>> +/**
>> + * queue_spin_trylock - try to acquire the queue spinlock
>> + * @lock : Pointer to queue spinlock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +static __always_inline int queue_spin_trylock(struct qspinlock *lock)
>> +{
>> +    if (!queue_spin_is_contended(lock) && (xchg(&lock->locked, 1) == 
>> 0))
>> +        return 1;
>> +    return 0;
>> +}
>> +
>> +/**
>> + * queue_spin_lock - acquire a queue spinlock
>> + * @lock: Pointer to queue spinlock structure
>> + */
>> +static __always_inline void queue_spin_lock(struct qspinlock *lock)
>> +{
>> +    if (likely(queue_spin_trylock(lock)))
>> +        return;
>> +    queue_spin_lock_slowpath(lock);
>> +}
>
> quickly falling into slowpath may hurt performance in some cases. no?

Failing the trylock means that the process is likely to wait. I do retry 
one more time in the slowpath before waiting in the queue.

> Instead, I tried something like this:
>
> #define SPIN_THRESHOLD 64
>
> static __always_inline void queue_spin_lock(struct qspinlock *lock)
> {
>         unsigned count = SPIN_THRESHOLD;
>         do {
>                 if (likely(queue_spin_trylock(lock)))
>                         return;
>                 cpu_relax();
>         } while (count--);
>         queue_spin_lock_slowpath(lock);
> }
>
> Though I could see some gains in overcommit, but it hurted undercommit
> in some workloads :(.

The gcc 4.4.7 compiler that I used in my test machine has the tendency 
of allocating stack space for variables instead of using registers when 
a loop is present. So I try to avoid having loop in the fast path. Also 
the count itself is rather arbitrary. For the first pass, I would like 
to make thing simple. We can always enhance it once it is accepted and 
merged.

>
>>
>> +/**
>> + * queue_trylock - try to acquire the lock bit ignoring the qcode in 
>> lock
>> + * @lock: Pointer to queue spinlock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +static __always_inline int queue_trylock(struct qspinlock *lock)
>> +{
>> +    if (!ACCESS_ONCE(lock->locked) && (xchg(&lock->locked, 1) == 0))
>> +        return 1;
>> +    return 0;
>> +}
>
> It took long time for me to confirm myself that,
> this is being used when we exhaust all the nodes. But not sure of
> any better name so that it does not confuse with queue_spin_trylock.
> anyway, they are in different files :).
>

Yes, I know it is confusing. I will change the name to make it more 
explicit.

>
> Result:
> sandybridge 32 cpu/ 16 core (HT on) 2 node machine with 16 vcpu kvm
> guests.
>
> In general, I am seeing undercommit loads are getting benefited by the 
> patches.
>
> base = 3.11-rc1
> patched = base + qlock
> +----+-----------+-----------+-----------+------------+-----------+
>                      hackbench (time in sec lower is better)
> +----+-----------+-----------+-----------+------------+-----------+
>  oc      base        stdev       patched    stdev       %improvement
> +----+-----------+-----------+-----------+------------+-----------+
> 0.5x    18.9326     1.6072    20.0686     2.9968      -6.00023
> 1.0x    34.0585     5.5120    33.2230     1.6119       2.45313
> +----+-----------+-----------+-----------+------------+-----------+
> +----+-----------+-----------+-----------+------------+-----------+
>                       ebizzy  (records/sec higher is better)
> +----+-----------+-----------+-----------+------------+-----------+
>  oc      base        stdev       patched    stdev       %improvement
> +----+-----------+-----------+-----------+------------+-----------+
> 0.5x  20499.3750   466.7756     22257.8750   884.8308       8.57831
> 1.0x  15903.5000   271.7126     17993.5000   682.5095      13.14176
> 1.5x  1883.2222   166.3714      1742.8889   135.2271      -7.45177
> 2.5x   829.1250    44.3957       803.6250    78.8034      -3.07553
> +----+-----------+-----------+-----------+------------+-----------+
> +----+-----------+-----------+-----------+------------+-----------+
>                    dbench  (Throughput in MB/sec higher is better)
> +----+-----------+-----------+-----------+------------+-----------+
>  oc      base        stdev       patched    stdev       %improvement
> +----+-----------+-----------+-----------+------------+-----------+
> 0.5x 11623.5000    34.2764     11667.0250    47.1122       0.37446
> 1.0x  6945.3675    79.0642      6798.4950   161.9431      -2.11468
> 1.5x  3950.4367    27.3828      3910.3122    45.4275      -1.01570
> 2.0x  2588.2063    35.2058      2520.3412    51.7138      -2.62209
> +----+-----------+-----------+-----------+------------+-----------+
>
> I saw dbench results improving to 0.3529, -2.9459, 3.2423, 4.8027
> respectively after delaying entering to slowpath above.
> [...]
>
> I have not yet tested on bigger machine. I hope that bigger machine will
> see significant undercommit improvements.
>

Thank for running the test. I am a bit confused about the terminology. 
What exactly do undercommit and overcommit mean?

Regards,
Longman

WARNING: multiple messages have this Message-ID (diff)
From: Waiman Long <waiman.long@hp.com>
To: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Richard Weinberger <richard@nod.at>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Matt Fleming <matt.fleming@intel.com>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Akinobu Mita <akinobu.mita@gmail.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Michel Lespinasse <walken@google.com>,
	Andi Kleen <andi@firstfloor.org>, Rik van Riel <riel@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	George Spelvin <linux@horizon.com>,
	Harvey Harrison <harvey.harrison@gmail.com>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	"Norton, Scott J" <scott.norton@hp.com>
Subject: Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation
Date: Thu, 01 Aug 2013 17:09:12 -0400	[thread overview]
Message-ID: <51FACE78.9070901@hp.com> (raw)
Message-ID: <20130801210912.Q7XZvYyHNXhW2nvNjK0c_Hk7Fi-5eqtlMaFJBGiS0cg@z> (raw)
In-Reply-To: <51FAC3BA.9050705@linux.vnet.ibm.com>

On 08/01/2013 04:23 PM, Raghavendra K T wrote:
> On 08/01/2013 08:07 AM, Waiman Long wrote:
>>
>> +}
>> +/**
>> + * queue_spin_trylock - try to acquire the queue spinlock
>> + * @lock : Pointer to queue spinlock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +static __always_inline int queue_spin_trylock(struct qspinlock *lock)
>> +{
>> +    if (!queue_spin_is_contended(lock) && (xchg(&lock->locked, 1) == 
>> 0))
>> +        return 1;
>> +    return 0;
>> +}
>> +
>> +/**
>> + * queue_spin_lock - acquire a queue spinlock
>> + * @lock: Pointer to queue spinlock structure
>> + */
>> +static __always_inline void queue_spin_lock(struct qspinlock *lock)
>> +{
>> +    if (likely(queue_spin_trylock(lock)))
>> +        return;
>> +    queue_spin_lock_slowpath(lock);
>> +}
>
> quickly falling into slowpath may hurt performance in some cases. no?

Failing the trylock means that the process is likely to wait. I do retry 
one more time in the slowpath before waiting in the queue.

> Instead, I tried something like this:
>
> #define SPIN_THRESHOLD 64
>
> static __always_inline void queue_spin_lock(struct qspinlock *lock)
> {
>         unsigned count = SPIN_THRESHOLD;
>         do {
>                 if (likely(queue_spin_trylock(lock)))
>                         return;
>                 cpu_relax();
>         } while (count--);
>         queue_spin_lock_slowpath(lock);
> }
>
> Though I could see some gains in overcommit, but it hurted undercommit
> in some workloads :(.

The gcc 4.4.7 compiler that I used in my test machine has the tendency 
of allocating stack space for variables instead of using registers when 
a loop is present. So I try to avoid having loop in the fast path. Also 
the count itself is rather arbitrary. For the first pass, I would like 
to make thing simple. We can always enhance it once it is accepted and 
merged.

>
>>
>> +/**
>> + * queue_trylock - try to acquire the lock bit ignoring the qcode in 
>> lock
>> + * @lock: Pointer to queue spinlock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +static __always_inline int queue_trylock(struct qspinlock *lock)
>> +{
>> +    if (!ACCESS_ONCE(lock->locked) && (xchg(&lock->locked, 1) == 0))
>> +        return 1;
>> +    return 0;
>> +}
>
> It took long time for me to confirm myself that,
> this is being used when we exhaust all the nodes. But not sure of
> any better name so that it does not confuse with queue_spin_trylock.
> anyway, they are in different files :).
>

Yes, I know it is confusing. I will change the name to make it more 
explicit.

>
> Result:
> sandybridge 32 cpu/ 16 core (HT on) 2 node machine with 16 vcpu kvm
> guests.
>
> In general, I am seeing undercommit loads are getting benefited by the 
> patches.
>
> base = 3.11-rc1
> patched = base + qlock
> +----+-----------+-----------+-----------+------------+-----------+
>                      hackbench (time in sec lower is better)
> +----+-----------+-----------+-----------+------------+-----------+
>  oc      base        stdev       patched    stdev       %improvement
> +----+-----------+-----------+-----------+------------+-----------+
> 0.5x    18.9326     1.6072    20.0686     2.9968      -6.00023
> 1.0x    34.0585     5.5120    33.2230     1.6119       2.45313
> +----+-----------+-----------+-----------+------------+-----------+
> +----+-----------+-----------+-----------+------------+-----------+
>                       ebizzy  (records/sec higher is better)
> +----+-----------+-----------+-----------+------------+-----------+
>  oc      base        stdev       patched    stdev       %improvement
> +----+-----------+-----------+-----------+------------+-----------+
> 0.5x  20499.3750   466.7756     22257.8750   884.8308       8.57831
> 1.0x  15903.5000   271.7126     17993.5000   682.5095      13.14176
> 1.5x  1883.2222   166.3714      1742.8889   135.2271      -7.45177
> 2.5x   829.1250    44.3957       803.6250    78.8034      -3.07553
> +----+-----------+-----------+-----------+------------+-----------+
> +----+-----------+-----------+-----------+------------+-----------+
>                    dbench  (Throughput in MB/sec higher is better)
> +----+-----------+-----------+-----------+------------+-----------+
>  oc      base        stdev       patched    stdev       %improvement
> +----+-----------+-----------+-----------+------------+-----------+
> 0.5x 11623.5000    34.2764     11667.0250    47.1122       0.37446
> 1.0x  6945.3675    79.0642      6798.4950   161.9431      -2.11468
> 1.5x  3950.4367    27.3828      3910.3122    45.4275      -1.01570
> 2.0x  2588.2063    35.2058      2520.3412    51.7138      -2.62209
> +----+-----------+-----------+-----------+------------+-----------+
>
> I saw dbench results improving to 0.3529, -2.9459, 3.2423, 4.8027
> respectively after delaying entering to slowpath above.
> [...]
>
> I have not yet tested on bigger machine. I hope that bigger machine will
> see significant undercommit improvements.
>

Thank for running the test. I am a bit confused about the terminology. 
What exactly do undercommit and overcommit mean?

Regards,
Longman


  parent reply	other threads:[~2013-08-01 21:09 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1375324631-32868-1-git-send-email-Waiman.Long@hp.com>
2013-08-01  2:37 ` [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
2013-08-01  2:37   ` Waiman Long
     [not found]   ` <20130801094029.GK3008@twins.programming.kicks-ass.net>
2013-08-01 10:11     ` Raghavendra K T
2013-08-01 10:11       ` Raghavendra K T
2013-08-01 10:12       ` Peter Zijlstra
2013-08-01 10:12         ` Peter Zijlstra
2013-08-01 10:14       ` Peter Zijlstra
2013-08-01 10:14         ` Peter Zijlstra
     [not found]     ` <51FAA1C3.2050507@hp.com>
2013-08-01 18:16       ` Raghavendra K T
2013-08-01 18:16         ` Raghavendra K T
2013-08-01 20:10         ` Peter Zijlstra
2013-08-01 20:10           ` Peter Zijlstra
2013-08-01 20:36           ` Raghavendra K T
2013-08-01 20:36             ` Raghavendra K T
2013-08-01 20:23   ` Raghavendra K T
2013-08-01 20:23     ` Raghavendra K T
2013-08-01 20:47     ` Peter Zijlstra
2013-08-01 20:47       ` Peter Zijlstra
2013-08-02  2:54       ` Raghavendra K T
2013-08-02  2:54         ` Raghavendra K T
2013-08-01 21:09     ` Waiman Long [this message]
2013-08-01 21:09       ` Waiman Long
2013-08-02  3:00       ` Raghavendra K T
2013-08-02  3:00         ` Raghavendra K T
2013-08-01  2:37 ` [PATCH RFC 2/2] qspinlock x86: Enable x86 to use queue spinlock Waiman Long
2013-08-01  2:37   ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51FACE78.9070901@hp.com \
    --to=waiman.long@hp.com \
    --cc=akinobu.mita@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=arnd@arndb.de \
    --cc=catalin.marinas@arm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=hpa@zytor.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@horizon.com \
    --cc=matt.fleming@intel.com \
    --cc=mingo@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=richard@nod.at \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=rusty@rustcorp.com.au \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=walken@google.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox