Re: Mutex vs semaphores scheduler bug

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Török Edwin" <edwin@clamav.net>
To: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	aCaB <acab@clamav.net>, Nick Piggin <npiggin@suse.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: Mutex vs semaphores scheduler bug
Date: Tue, 20 Oct 2009 22:02:00 +0300	[thread overview]
Message-ID: <4ADE0928.6080104@clamav.net> (raw)
In-Reply-To: <1255793527.8152.4.camel@laptop>

On 2009-10-17 18:32, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 00:44 +0100, David Howells wrote:
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> The problem appears to be that rwsem doesn't allow lock-stealing
>> With good reason.  rwsems can be read or write locked for a long time - so if
>> readers can jump the queue on read-locked rwsems, then writer starvation is a
>> real possibility.  I carefully implemented it so that it is a strict FIFO to
>> avoid certain problems I was having.
> 
> Well, it kinda sucks that rwsem is slower than a mutex.
> 
> What about allowing writer stealing when the next contending task is a
> writer?
> 

Also can the scheduler be improved so it doesn't unnecessarily wake up
processes if they can't take the semaphore?

AFAICT (in linux 2.6.31.3) this is what happens:
 semaphore fails to get acquired -> rwsem.c:rwsem_down_failed_common
  -> if nobody holds the semaphore, wake up the process(es) that can
acquire it
  -> loop calling schedule(), until woken

 the thread holding the semaphore releases it (no wakeups!), and the
scheduler will schedule the next process

There are several things that are suboptimal here IMHO:
1. The task calls schedule(), but does that prevent the scheduler from
scheduling it again if waiter->task is non-null? If not, then there are
useless wakeups (task gets scheduled, waiter->task is non-null, task
calls schedule again).

2. The process in the front of the queue is woken only if there was
contention, which introduces unnecessary latency for the task in the
front of the queue waiting for the semaphore.

3. Due to 2) a task on another CPU waiting for the semaphore is only
woken: when schedule() is called, when another thread tries to acquire
the lock and fails

Suggested solution:
schedule() should know that scheduling this process is useless, it won't
get the lock, so it shouldn't attempt to schedule it (maybe by setting a
flag in the task state).
When a thread releases a semaphore, it should also try to wake other
threads that were waiting on it (if any).

With both of these changes the scheduler will know that when it
schedules the task it'll be able to acquire the semaphore (it was called
by wake_up_process in rwsem_do_wake which will clear the do-not-wake
flag). Also the thread will be scheduled as soon as another thread (on
another CPU) released the semaphore, and it doesn't need to wait for the
timeslice to elapse.

Here is a snippet of ftrace of sched_switch:
<...>-27515 [000] 44338.386867:  27515:120:R   + [000] 27522:120:D <...>
           <...>-27515 [000] 44338.386868:  27515:120:R   + [000]
27511:120:D <...>
           <...>-27515 [000] 44338.386869:  27515:120:R   + [001]
27523:120:D <...>
           <...>-27515 [000] 44338.386875:  27515:120:R   + [003]
27512:120:D <...>
           <...>-27515 [000] 44338.386877:  27515:120:R   + [000]
27517:120:S <...>
           <...>-27515 [000] 44338.386905:  27515:120:D ==> [000]
27517:120:R <...>
           <...>-27517 [000] 44338.386907:  27517:120:S ==> [000]
27522:120:R <...>
           <...>-27522 [000] 44338.386915:  27522:120:D ==> [000]
27511:120:R <...>
           <...>-27511 [000] 44338.386917:  27511:120:R   + [001]
27523:120:D <...>
           <...>-27511 [000] 44338.386918:  27511:120:D ==> [000]
0:140:R <idle>
          <idle>-0     [000] 44338.386973:      0:140:R ==> [000]
27522:120:R <...>

If I interpret this correctly 27515 wakes a bunch of readers (27511,
27523, 27512, 27517), then it switches to 27517.
However 27517 immediately schedules again (!) after 2ns, wakes 27522, it
runs for 8ns, then schedules to 27511, which runs for a short while
again, finally it schedules to idle. All this scheduling ping-pong looks
like a waste, none of the tasks woken up did useful work.

Another situation is this scheduler ping-pong with md4_raid10, notice
how 27961 wakes up md4_raid10, switches to it, md4_raid10 schedules back
to 27961, which then schedules to idle (2ns later). md4_raid10 could
have gone directly to idle ....:

md4_raid10-982   [000] 44746.579830:    982:115:S ==> [000] 27961:120:R
lt-clamd
        lt-clamd-27964 [003] 44746.579892:  27964:120:R   + [001]
27966:120:D lt-clamd
        lt-clamd-27964 [003] 44746.580065:  27964:120:D ==> [003]
0:140:R <idle>
        lt-clamd-27961 [000] 44746.580069:  27961:120:R   + [003]
27964:120:D lt-clamd
          <idle>-0     [003] 44746.580073:      0:140:R ==> [003]
27964:120:R lt-clamd
        lt-clamd-27961 [000] 44746.580912:  27961:120:R   + [002]
27960:120:S lt-clamd
        lt-clamd-27961 [000] 44746.580945:  27961:120:D   + [000]
982:115:S md4_raid10
        lt-clamd-27961 [000] 44746.580946:  27961:120:D ==> [000]
982:115:R md4_raid10
      md4_raid10-982   [000] 44746.580948:    982:115:S ==> [000]
27961:120:D lt-clamd
        lt-clamd-27961 [000] 44746.580950:  27961:120:D ==> [000]
0:140:R <idle>
        lt-clamd-27964 [003] 44746.581054:  27964:120:R   + [000]
27961:120:D lt-clamd
          <idle>-0     [000] 44746.581064:      0:140:R ==> [000]
27961:120:R lt-clamd
        lt-clamd-27964 [003] 44746.581092:  27964:120:R   + [002]
27960:120:S lt-clamd
        lt-clamd-27964 [003] 44746.581125:  27964:120:D   + [000]
982:115:S md4_raid10
        lt-clamd-27964 [003] 44746.581128:  27964:120:D ==> [003]
27965:120:R lt-clamd
        lt-clamd-27961 [000] 44746.581128:  27961:120:R ==> [000]
982:115:R md4_raid10

Best regards,
--Edwin

     prev parent reply	other threads:[~2009-10-20 19:02 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-10 14:57 Mutex vs semaphores scheduler bug Török Edwin
2009-10-12 14:53 ` Peter Zijlstra
2009-10-12 15:37   ` Török Edwin
2009-10-15 23:44   ` David Howells
2009-10-17 15:32     ` Peter Zijlstra
2009-10-20 19:02       ` Török Edwin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ADE0928.6080104@clamav.net \
    --to=edwin@clamav.net \
    --cc=acab@clamav.net \
    --cc=dhowells@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=npiggin@suse.de \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox