Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Chegu Vinod <chegu_vinod@hp.com>
To: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org, aquini@redhat.com,
	walken@google.com, eric.dumazet@gmail.com, lwoodman@redhat.com,
	knoel@redhat.com, raghavendra.kt@linux.vnet.ibm.com,
	mingo@redhat.com
Subject: Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
Date: Sun, 03 Feb 2013 07:28:40 -0800	[thread overview]
Message-ID: <510E8228.6050200@hp.com> (raw)
In-Reply-To: <20130125140553.060b8ced@annuminas.surriel.com>

On 1/25/2013 11:05 AM, Rik van Riel wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> 	http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> The test case has a number of threads locking and unlocking a
> semaphore. With just one thread, everything sits in the CPU
> cache and throughput is around 2.6 million operations per
> second, with a 5-10% variation.
>
> Once a second thread gets involved, data structures bounce
> from CPU to CPU, and performance deteriorates to about 1.25
> million operations per second, with a 5-10% variation.
>
> However, as more and more threads get added to the mix,
> performance with the vanilla kernel continues to deteriorate.
> Once I hit 24 threads, on a 24 CPU, 4 node test system,
> performance is down to about 290k operations/second.
>
> With a proportional backoff delay added to the spinlock
> code, performance with 24 threads goes up to about 400k
> operations/second with a 50x delay, and about 900k operations/second
> with a 250x delay. However, with a 250x delay, performance with
> 2-5 threads is worse than with a 50x delay.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention,
> and should also protect against the (likely) case of the fixed
> delay factor being wrong for other hardware.
>
> The attached graph shows the performance of the multi threaded
> semaphore lock/unlock test case, with 1-24 threads, on the
> vanilla kernel, with 10x, 50x, and 250x proportional delay,
> as well as the v1 patch series with autotuning for 2x and 2.7x
> spinning before the lock is obtained, and with the v2 series.
>
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> For the v3 series, I tried out all the ideas suggested by
> Michel. They made perfect sense, but in the end it turned
> out they did not work as well as the simple, aggressive
> "try to make the delay longer" policy I have now. Several
> small bug fixes and cleanups have been integrated.
>
> For the v4 series, I added code to keep the maximum spinlock
> delay to a small value when running in a virtual machine. That
> should solve the performance regression seen in virtual machines.
>
> The performance issue observed with AIM7 is still a mystery.
>
> Performance is within the margin of error of v2, so the graph
> has not been update.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...
>
I got back on the machine re-ran the AIM7-highsystime microbenchmark 
with a 2000 users and 100 jobs per user
on a 20, 40, 80 vcpu guest using 3.7.5 kernel with and without Rik's 
latest patches.

Host Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Config 1 : 3.7.5 base running on host and in the guest

Config 2 : 3.7.5 + Rik's patches running on host and in the guest

(Note: I didn't change the PLE settings on the host... The severe drop 
from at 40 way and 80 way is due to the un-optimized PLE handler. 
Raghu's PLE's fixes should address those. ).

                           Config 1          Config 2
20vcpu  -              170K               168K
40vcpu  -              37K                   37K
80vcpu  -             10.5K                11.5K


Not much difference between the two configs.. (need to test it along 
with Raghu's fixes).

BTW, I noticed that there were results posted using AIM7-compute 
workload earlier.  The AIM7-highsystime is a lot more kernel intensive.

FYI
Vinod

     prev parent reply	other threads:[~2013-02-03 15:28 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-25 19:05 [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
2013-01-25 19:16 ` [PATCH -v4 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
2013-01-25 19:17 ` [PATCH -v4 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
2013-01-25 19:17 ` [PATCH -v4 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
2013-01-26 12:03   ` Ingo Molnar
2013-01-25 19:18 ` [PATCH -v4 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
2013-01-27 13:04   ` Michel Lespinasse
2013-02-06 20:10     ` Rik van Riel
2013-02-09 23:50       ` Michel Lespinasse
2013-01-25 19:19 ` [PATCH -v4 5/5] x86,smp: limit spinlock delay on virtual machines Rik van Riel
2013-01-26 12:00   ` Ingo Molnar
2013-01-26 12:47     ` Borislav Petkov
2013-02-04 13:50       ` Rik van Riel
2013-02-04 14:02         ` Borislav Petkov
2013-02-06 17:05           ` Rik van Riel
2013-01-28 18:18   ` Stefano Stabellini
2013-01-28 18:18     ` Stefano Stabellini
2013-01-25 19:20 ` [DEBUG PATCH -v4 6/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
2013-01-26  7:21 ` [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Mike Galbraith
2013-01-26 12:05   ` Ingo Molnar
2013-01-26 13:10     ` Mike Galbraith
2013-01-28  6:49 ` Raghavendra K T
2013-02-03 15:28 ` Chegu Vinod [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=510E8228.6050200@hp.com \
    --to=chegu_vinod@hp.com \
    --cc=aquini@redhat.com \
    --cc=eric.dumazet@gmail.com \
    --cc=knoel@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lwoodman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.