All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>,
	Andi Kleen <andi@firstfloor.org>, Matthew Wilcox <matthew@wil.cx>,
	LKML <linux-kernel@vger.kernel.org>,
	Alexander Viro <viro@ftp.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	"H. Peter Anvin" <hpa@zytor.com>
Subject: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1)
Date: Thu, 8 May 2008 14:01:30 +0200	[thread overview]
Message-ID: <20080508120130.GA2860@elte.hu> (raw)
In-Reply-To: <alpine.LFD.1.10.0805072115190.3024@woody.linux-foundation.org>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > > That said, "idle 0%" is easy when you spin. Do you also have 
> > > actual performance numbers?
> >
> > Yes. My conclusion is based on the actual number. cpu idle 0% is 
> > just a behavior it should be.
> 
> Thanks, that's all I wanted to verify.
> 
> I'll leave this overnight, and see if somebody has come up with some 
> smart and wonderful patch. And if not, I think I'll apply mine as 
> "known to fix a regression", and we can perhaps then improve on things 
> further from there.

hey, i happen to have such a smart and wonderful patch =B-)

i reproduced the AIM7 workload and can confirm Yanmin's findings that 
-.26-rc1 regresses over .25 - by over 67% here.

Looking at the workload i found and fixed what i believe to be the real 
bug causing the AIM7 regression: it was inefficient wakeup / scheduling 
/ locking behavior of the new generic semaphore code, causing suboptimal 
performance.

The problem comes from the following code. The new semaphore code does 
this on down():

        spin_lock_irqsave(&sem->lock, flags);
        if (likely(sem->count > 0))
                sem->count--;
        else
                __down(sem);
        spin_unlock_irqrestore(&sem->lock, flags);

and this on up():

        spin_lock_irqsave(&sem->lock, flags);
        if (likely(list_empty(&sem->wait_list)))
                sem->count++;
        else
                __up(sem);
        spin_unlock_irqrestore(&sem->lock, flags);

where __up() does:

        list_del(&waiter->list);
        waiter->up = 1;
        wake_up_process(waiter->task);

and where __down() does this in essence:

        list_add_tail(&waiter.list, &sem->wait_list);
        waiter.task = task;
        waiter.up = 0;
        for (;;) {
                [...]
                spin_unlock_irq(&sem->lock);
                timeout = schedule_timeout(timeout);
                spin_lock_irq(&sem->lock);
                if (waiter.up)
                        return 0;
        }

the fastpath looks good and obvious, but note the following property of 
the contended path: if there's a task on the ->wait_list, the up() of 
the current owner will "pass over" ownership to that waiting task, in a 
wake-one manner, via the waiter->up flag and by removing the waiter from 
the wait list.

That is all and fine in principle, but as implemented in 
kernel/semaphore.c it also creates a nasty, hidden source of contention!

The contention comes from the following property of the new semaphore 
code: the new owner owns the semaphore exclusively, even if it is not 
running yet.

So if the old owner, even if just a few instructions later, does a 
down() [lock_kernel()] again, it will be blocked and will have to wait 
on the new owner to eventually be scheduled (possibly on another CPU)! 
Or if another other task gets to lock_kernel() sooner than the "new 
owner" scheduled, it will be blocked unnecessarily and for a very long 
time when there are 2000 tasks running.

I.e. the implementation of the new semaphores code does wake-one and 
lock ownership in a very restrictive way - it does not allow 
opportunistic re-locking of the lock at all and keeps the scheduler from 
picking task order intelligently.

This kind of scheduling, with 2000 AIM7 processes running, creates awful 
cross-scheduling between those 2000 tasks, causes reduced parallelism, a 
throttled runqueue length and a lot of idle time. With increasing number 
of CPUs it causes an exponentially worse behavior in AIM7, as the chance 
for a newly woken new-owner task to actually run anytime soon is less 
and less likely.

Note that it takes just a tiny bit of contention for the 'new-semaphore 
catastrophy' to happen: the wakeup latencies get added to whatever small 
contention there is, and quickly snowball out of control!

I believe Yanmin's findings and numbers support this analysis too.

The best fix for this problem is to use the same scheduling logic that 
the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and 
wanted because we do not want to over-schedule), but also allow 
opportunistic locking of the lock even if a wakee is already "in 
flight".

The patch below implements this new logic. With this patch applied the 
AIM7 regression is largely fixed on my quad testbox:

  # v2.6.25 vanilla:
  ..................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    56096.4         91      207.5   789.7   0.4675
  2000    55894.4         94      208.2   792.7   0.4658

  # v2.6.26-rc1-166-gc0a1811 vanilla:
  ...................................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    33230.6         83      350.3   784.5   0.2769
  2000    31778.1         86      366.3   783.6   0.2648

  # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup:
  ...............................................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    55707.1         92      209.0   795.6   0.4642
  2000    55704.4         96      209.0   796.0   0.4642

i.e. a 67% speedup. We are now back to within 1% of the v2.6.25 
performance levels and have zero idle time during the test, as expected.

Btw., interactivity also improved dramatically with the fix - for 
example console-switching became almost instantaneous during this 
workload (which after all is running 2000 tasks at once!), without the 
patch it was stuck for a minute at times.

I also ran Linus's spinlock-BKL patch as well:

  # v2.6.26-rc1-166-gc0a1811 + Linus-BKL-spinlock-patch:
  ......................................................
  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    55889.0         92      208.3   793.3   0.4657
  2000    55891.7         96      208.3   793.3   0.4658

it is about 0.3% faster - but note that that is within the general noise 
levels of this test. I'd expect Linus's spinlock-BKL patch to give some 
small speedup because the BKL acquire times are short and 2000 tasks 
running all at once really increases the context-switching cost and most 
BKL contentions are within the cost of context-switching.

But i believe the better solution for that is to remove BKL use from all 
hotpaths, not to hide some of its costs by reintroducing it as a 
spinlock. Reintroducing the spinlock based BKL would have other 
disadvantages as well: it could reintroduce per-CPU-ness assumptions in 
BKL-using code and other complications as well. It's also not a very 
realistic workload - with 2000 tasks running the system was barely 
serviceable.

I'd much rather make BKL costs more apparent and more visible - but 50% 
regression was of course too much. But 0.3% for a 2000-tasks workload, 
which is near the noise level ... is acceptable i think - especially as 
this discussion has now reinvigorated the remove-the-BKL discussions and 
patches.

Linus, we can do your spinlock-BKL patch too if you feel strongly about 
it, but i'd rather not - we fought so hard for the preemptible BKL :-)

The spinlock-based-BKL patch only worked around the real problem i 
believe, because it eliminated the use of the suboptimal new semaphore 
code: with spinlocks there's no scheduling at all, so the wakeup/locking 
bug of the new semaphore code did not apply. It was not about any 
fastpath overhead AFAICS. [we'd have seen that with the 
CONFIG_PREEMPT_BKL=y code as well, which has been the default setting 
since v2.6.8.]

There's another nice side-effect of this speedup patch, the new generic 
semaphore code got even smaller:

   text    data     bss     dec     hex filename
   1241       0       0    1241     4d9 semaphore.o.before
   1207       0       0    1207     4b7 semaphore.o.after

(because the waiter.up complication got removed.)

Longer-term we should look into using the mutex code for the generic 
semaphore code as well - but i's not easy due to legacies and it's 
outside of the scope of v2.6.26 and outside the scope of this patch as 
well.

Hm?

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/semaphore.c |   36 ++++++++++++++++--------------------
 1 file changed, 16 insertions(+), 20 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(sem->count <= 0))
 		__down(sem);
+	sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 }
 EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore 
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(sem->count <= 0))
 		result = __down_interruptible(sem);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem)
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(sem->count <= 0))
 		result = __down_killable(sem);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, 
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(sem->count <= 0))
 		result = __down_timeout(sem, jiffies);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -179,9 +178,8 @@ void up(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(list_empty(&sem->wait_list)))
-		sem->count++;
-	else
+	sem->count++;
+	if (unlikely(!list_empty(&sem->wait_list)))
 		__up(sem);
 	spin_unlock_irqrestore(&sem->lock, flags);
 }
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up);
 struct semaphore_waiter {
 	struct list_head list;
 	struct task_struct *task;
-	int up;
 };
 
 /*
@@ -206,11 +203,11 @@ static inline int __sched __down_common(
 	struct task_struct *task = current;
 	struct semaphore_waiter waiter;
 
-	list_add_tail(&waiter.list, &sem->wait_list);
 	waiter.task = task;
-	waiter.up = 0;
 
 	for (;;) {
+		list_add_tail(&waiter.list, &sem->wait_list);
+
 		if (state == TASK_INTERRUPTIBLE && signal_pending(task))
 			goto interrupted;
 		if (state == TASK_KILLABLE && fatal_signal_pending(task))
@@ -221,7 +218,7 @@ static inline int __sched __down_common(
 		spin_unlock_irq(&sem->lock);
 		timeout = schedule_timeout(timeout);
 		spin_lock_irq(&sem->lock);
-		if (waiter.up)
+		if (sem->count > 0)
 			return 0;
 	}
 
@@ -259,6 +256,5 @@ static noinline void __sched __up(struct
 	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
 						struct semaphore_waiter, list);
 	list_del(&waiter->list);
-	waiter->up = 1;
 	wake_up_process(waiter->task);
 }

  reply	other threads:[~2008-05-08 12:02 UTC|newest]

Thread overview: 140+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-06  5:48 AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
2008-05-06 11:18 ` Matthew Wilcox
2008-05-06 11:44 ` Ingo Molnar
2008-05-06 12:09   ` Matthew Wilcox
2008-05-06 16:23     ` Matthew Wilcox
2008-05-06 16:36       ` Linus Torvalds
2008-05-06 16:42         ` Matthew Wilcox
2008-05-06 16:39           ` Alan Cox
2008-05-06 16:51             ` Matthew Wilcox
2008-05-06 16:45               ` Alan Cox
2008-05-06 17:42               ` Linus Torvalds
2008-05-06 20:28           ` Linus Torvalds
2008-05-06 16:44         ` J. Bruce Fields
2008-05-06 17:21       ` Andrew Morton
2008-05-06 17:31         ` Matthew Wilcox
2008-05-06 17:49           ` Ingo Molnar
2008-05-06 18:07             ` Andrew Morton
2008-05-11 11:11               ` Matthew Wilcox
2008-05-06 17:39         ` Ingo Molnar
2008-05-07  6:49           ` Zhang, Yanmin
2008-05-06 17:45         ` Linus Torvalds
2008-05-07 16:38         ` Matthew Wilcox
2008-05-07 16:55           ` Linus Torvalds
2008-05-07 17:08             ` Linus Torvalds
2008-05-07 17:16               ` Andrew Morton
2008-05-07 17:27                 ` Linus Torvalds
2008-05-07 17:22               ` Ingo Molnar
2008-05-07 17:25                 ` Ingo Molnar
2008-05-07 17:31                 ` Linus Torvalds
2008-05-07 17:47                   ` Linus Torvalds
2008-05-07 17:49                   ` Ingo Molnar
2008-05-07 18:02                     ` Linus Torvalds
2008-05-07 18:17                       ` Ingo Molnar
2008-05-07 18:27                         ` Linus Torvalds
2008-05-07 18:43                           ` Ingo Molnar
2008-05-07 19:01                             ` Linus Torvalds
2008-05-07 19:09                               ` Ingo Molnar
2008-05-07 19:24                               ` Matthew Wilcox
2008-05-07 19:44                                 ` Linus Torvalds
2008-05-07 20:00                                   ` Oi. NFS people. Read this Matthew Wilcox
2008-05-07 22:10                                     ` Trond Myklebust
2008-05-09  1:43                                       ` J. Bruce Fields
2008-05-08  3:24       ` AIM7 40% regression with 2.6.26-rc1 Zhang, Yanmin
2008-05-08  3:34         ` Linus Torvalds
2008-05-08  4:37           ` Zhang, Yanmin
2008-05-08 14:58             ` Linus Torvalds
2008-05-07  2:11   ` Zhang, Yanmin
2008-05-07  3:41     ` Zhang, Yanmin
2008-05-07  3:59       ` Andrew Morton
2008-05-07  4:46         ` Zhang, Yanmin
2008-05-07  6:26       ` Ingo Molnar
2008-05-07  6:28         ` Ingo Molnar
2008-05-07  7:05           ` Zhang, Yanmin
2008-05-07 11:00       ` Andi Kleen
2008-05-07 11:46         ` Matthew Wilcox
2008-05-07 12:21           ` Andi Kleen
2008-05-07 14:36             ` Linus Torvalds
2008-05-07 14:35               ` Alan Cox
2008-05-07 15:00                 ` Linus Torvalds
2008-05-07 15:02                   ` Linus Torvalds
2008-05-07 14:57               ` Andi Kleen
2008-05-07 15:31                 ` Andrew Morton
2008-05-07 16:22                   ` Matthew Wilcox
2008-05-07 15:19               ` Linus Torvalds
2008-05-07 17:14                 ` Ingo Molnar
2008-05-08  2:44                 ` Zhang, Yanmin
2008-05-08  3:29                   ` Linus Torvalds
2008-05-08  4:08                     ` Zhang, Yanmin
2008-05-08  4:17                       ` Linus Torvalds
2008-05-08 12:01                         ` Ingo Molnar [this message]
2008-05-08 12:28                           ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Ingo Molnar
2008-05-08 14:43                             ` Ingo Molnar
2008-05-08 15:10                               ` [git pull] scheduler fixes Ingo Molnar
2008-05-08 15:33                                 ` Adrian Bunk
2008-05-08 15:41                                   ` Ingo Molnar
2008-05-08 19:42                                     ` Adrian Bunk
2008-05-11 11:03                                 ` Matthew Wilcox
2008-05-11 11:14                                   ` Matthew Wilcox
2008-05-11 11:48                                   ` Matthew Wilcox
2008-05-11 12:50                                     ` Ingo Molnar
2008-05-11 12:52                                       ` Ingo Molnar
2008-05-11 13:02                                         ` Matthew Wilcox
2008-05-11 13:26                                           ` Matthew Wilcox
2008-05-11 14:00                                             ` Ingo Molnar
2008-05-11 14:18                                               ` Matthew Wilcox
2008-05-11 14:42                                                 ` Ingo Molnar
2008-05-11 14:48                                                   ` Matthew Wilcox
2008-05-11 15:19                                                     ` Ingo Molnar
2008-05-11 15:29                                                       ` Matthew Wilcox
2008-05-13 14:11                                                         ` Ingo Molnar
2008-05-13 14:21                                                           ` Matthew Wilcox
2008-05-13 14:42                                                             ` Ingo Molnar
2008-05-13 15:28                                                               ` Matthew Wilcox
2008-05-13 17:13                                                                 ` Ingo Molnar
2008-05-13 17:22                                                                   ` Linus Torvalds
2008-05-13 21:05                                                                     ` Ingo Molnar
2008-05-11 13:54                                           ` Ingo Molnar
2008-05-11 14:22                                             ` Matthew Wilcox
2008-05-11 14:32                                               ` Ingo Molnar
2008-05-11 14:46                                                 ` Matthew Wilcox
2008-05-11 16:47                                                 ` Linus Torvalds
2008-05-11 13:01                                   ` Ingo Molnar
2008-05-11 13:06                                     ` Matthew Wilcox
2008-05-11 13:45                                       ` Ingo Molnar
2008-05-11 14:10                                   ` Sven Wegener
2008-05-08 16:02                             ` [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Linus Torvalds
2008-05-08 18:30                               ` Linus Torvalds
2008-05-08 20:19                                 ` Ingo Molnar
2008-05-08 20:27                                   ` Linus Torvalds
2008-05-08 21:45                                     ` Ingo Molnar
2008-05-08 22:02                                       ` Ingo Molnar
2008-05-08 22:55                                       ` Linus Torvalds
2008-05-08 23:07                                         ` Linus Torvalds
2008-05-08 23:14                                           ` Linus Torvalds
2008-05-08 23:16                                         ` Alan Cox
2008-05-08 23:33                                           ` Linus Torvalds
2008-05-08 23:27                                             ` Alan Cox
2008-05-09  6:50                                             ` Ingo Molnar
2008-05-09  8:29                                             ` Andi Kleen
2008-05-08 13:20                           ` Matthew Wilcox
2008-05-08 15:01                             ` Ingo Molnar
2008-05-08 13:56                           ` Arjan van de Ven
2008-05-08  6:43                   ` AIM7 40% regression with 2.6.26-rc1 Ingo Molnar
2008-05-08  6:48                     ` Andrew Morton
2008-05-08  7:14                     ` Zhang, Yanmin
2008-05-08  7:39                       ` Ingo Molnar
2008-05-08  8:44                         ` Zhang, Yanmin
2008-05-08  9:21                           ` Ingo Molnar
2008-05-08  9:29                             ` Ingo Molnar
2008-05-08  9:30                             ` Zhang, Yanmin
2008-05-07 16:20               ` Ingo Molnar
2008-05-07 16:35                 ` Linus Torvalds
2008-05-07 17:05                   ` Ingo Molnar
2008-05-07 17:24                     ` Linus Torvalds
2008-05-07 17:36                       ` Ingo Molnar
2008-05-07 17:55                         ` Linus Torvalds
2008-05-07 17:59                           ` Matthew Wilcox
2008-05-07 18:17                             ` Linus Torvalds
2008-05-07 18:49                               ` Ingo Molnar
2008-05-07 13:59         ` Alan Cox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080508120130.GA2860@elte.hu \
    --to=mingo@elte.hu \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matthew@wil.cx \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@ftp.linux.org.uk \
    --cc=yanmin_zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.