Mutex vs semaphores scheduler bug

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Mutex vs semaphores scheduler bug
@ 2009-10-10 14:57 Török Edwin
  2009-10-12 14:53 ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Török Edwin @ 2009-10-10 14:57 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra; +Cc: Linux Kernel, aCaB

[-- Attachment #1: Type: text/plain, Size: 5594 bytes --]

Hi,

If a semaphore (such as mmap_sem) is heavily congested, then using a
userspace mutex makes the program faster.

For example using a mutex around *anonymous* mmaps, speeds it up
significantly (~80% on this microbenchmark,
~15% on real applications). Such workarounds shouldn't  be necessary for
userspace applications, the kernel should
by default use the most efficient implementation for locks.

Run the attached testcase, and see that when using a mutex, the elapsed
time (and user+sys too) is smaller
than when not using the mutex [1].

So using mmap/munmap by itself (and thus directly hitting the
semaphore), and memsetting only first page
is 80% slower than doing the same guarded by a mutex.
If the entire allocated region is memset (thus faulting all pages), then
the mmap without a mutex is still 7% slower.

In both cases the maximum time for a mmap is HIGHER when using the mutex
(8.7 ms>5.1 ms and 4 ms>0.2 ms),
however the average time is smaller.
However when using a mutex the number of context switches is SMALLER by
40-60%.

I think its a bug in the scheduler, it scheduler the mutex case much
better.
Maybe because userspace also spins a bit before actually calling
futex(). So *maybe* a possible fix would be to also try
and spin a little on some semaphores before taking them.

I think its important to optimize the mmap_sem semaphore, because it is
hit quite often (even when not mmaping files):
 - mmap of anon memory
 - malloc (due to above)
 - stack increase
 - page faults (of both file-mapped and anon memory)

P.S.: this is not just a microbenchmark, it came up during profiling of
ClamAV, using the mutex speeds it up by ~15%.
P.S.: if needed I can send you my .config
Best regards,
--Edwin

[1]

Timings on Linux 2.6.31 Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
--------------------------------------------------------------------------------------------------------------------------

Running tests with ntimings=100, nloops=100, nthreads=16, number of CPUs=4
Starting test mmap sem congestion (memset all)
Test mmap sem congestion (memset all) ended
Timing mmap sem congestion (memset all): elapsted time: 99.1104 s, 3.055
%user, 38.521 %sys
max 5175.1 us, average 2477.75 us, stdev 49681.3 us
resource usage: 3.02818 s user, 38.1784 s sys, 99.1104 s elapsed,
(10240001 min + 0 maj) pagefaults, 5844551 + 2924 context switches

Starting test mmap sem congestion with mutex workaround (memset all)
Test mmap sem congestion with mutex workaround (memset all) ended
Timing mmap sem congestion with mutex workaround (memset all): elapsted
time: 93.4126 s, 3.357 %user, 31.822 %sys
max 8710.94 us, average 2335.3 us, stdev 549728 us
resource usage: 3.13619 s user, 29.7259 s sys, 93.4126 s elapsed,
(10240000 min + 0 maj) pagefaults, 3316568 + 1464 context switches

Starting test mmap sem congestion (memset firstpage)
Test mmap sem congestion (memset firstpage) ended
Timing mmap sem congestion (memset firstpage): elapsted time: 6.73074 s,
0.000 %user, 15.571 %sys
max 229.96 us, average 168.263 us, stdev 541.789 us
resource usage: 0 s user, 1.04806 s sys, 6.73074 s elapsed, (160000 min
+ 0 maj) pagefaults, 639971 + 7993 context switches

Starting test mmap sem congestion with mutex workaround (memset firstpage)
Test mmap sem congestion with mutex workaround (memset firstpage) ended
Timing mmap sem congestion with mutex workaround (memset firstpage):
elapsted time: 1.23567 s, 2.590 %user, 37.553 %sys
max 4039.39 us, average 30.887 us, stdev 61860.3 us
resource usage: 0.032 s user, 0.464025 s sys, 1.23567 s elapsed, (160000
min + 0 maj) pagefaults, 22335 + 53 context switches

Timings on Linux 2.6.30.4 Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
--------------------------------------------------------------------------------------------------------------------------

Running tests with ntimings=100, nloops=100, nthreads=16, number of CPUs=8
Starting test mmap sem congestion (memset all)
Test mmap sem congestion (memset all) ended
Timing mmap sem congestion (memset all): elapsted time: 81.7756 s,
10.527 %user, 317.860 %sys
        max 4893.67 us, average 4088.78 us, stdev 23209.1 us
        resource usage: 8.60853 s user, 259.932 s sys, 81.7756 s
elapsed, (10240002 min + 0 maj) pagefaults, 5278414 + 16294 context switches

Starting test mmap sem congestion with mutex workaround (memset all)
Test mmap sem congestion with mutex workaround (memset all) ended
Timing mmap sem congestion with mutex workaround (memset all): elapsted
time: 55.312 s, 13.068 %user, 259.070 %sys
        max 5665.63 us, average 2765.6 us, stdev 527012 us
        resource usage: 7.22845 s user, 143.297 s sys, 55.312 s elapsed,
(10240000 min + 0 maj) pagefaults, 3351636 + 14707 context switches

Starting test mmap sem congestion (memset firstpage)
Test mmap sem congestion (memset firstpage) ended
Timing mmap sem congestion (memset firstpage): elapsted time: 6.08303 s,
0.526 %user, 51.951 %sys
        max 354.55 us, average 304.151 us, stdev 74.5072 us
        resource usage: 0.032 s user, 3.16019 s sys, 6.08303 s elapsed,
(160000 min + 0 maj) pagefaults, 639994 + 3959 context switches

Starting test mmap sem congestion with mutex workaround (memset firstpage)
Test mmap sem congestion with mutex workaround (memset firstpage) ended
Timing mmap sem congestion with mutex workaround (memset firstpage):
elapsted time: 1.15893 s, 6.558 %user, 90.088 %sys
        max 4767.38 us, average 57.9458 us, stdev 83782.6 us
        resource usage: 0.076001 s user, 1.04406 s sys, 1.15893 s
elapsed, (160000 min + 0 maj) pagefaults, 64700 + 29 context switches



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: scheduler.c --]
[-- Type: text/x-csrc; name="scheduler.c", Size: 8077 bytes --]

/*
 *  Copyright (C) 2009 Török Edwin
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License version 2 as
 *  published by the Free Software Foundation.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, write to the Free Software
 *  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
 *  MA 02110-1301, USA.
 */

/* 
 * Compile with: gcc scheduler.c  -Wall -lm -pthread -g -O2 
 */
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/time.h>
#include <stdint.h>
#include <stdint.h>
#include <sys/resource.h>
#include <pthread.h>
#include <math.h>
#include <unistd.h>
#include <string.h>
#define RUSAGE_THREAD 1 /* why isn't this in resource.h? */

static int32_t mmap_sem_congestion_all()
{
    void *data = mmap(NULL, 256*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    if (data == MAP_FAILED) {
	perror("mmap");
	abort();
    }
    madvise(data, 256*1024, MADV_RANDOM|MADV_DONTFORK);
    memset(data, 0x5a, 256*1024);
    munmap(data, 256*1024);
    return 1;
}

static int32_t mmap_sem_congestion_firstpage()
{
    void *data = mmap(NULL, 256*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    if (data == MAP_FAILED) {
	perror("mmap");
	abort();
    }
    madvise(data, 256*1024, MADV_RANDOM|MADV_DONTFORK);
    memset(data, 0x5a, 4096);
    munmap(data, 256*1024);
    return 1;
}

static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

static int32_t mmap_sem_congestion_mtx_all()
{
    pthread_mutex_lock(&mutex);
    void *data = mmap(NULL, 256*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    madvise(data, 256*1024, MADV_RANDOM|MADV_DONTFORK);
    pthread_mutex_unlock(&mutex);
    if (data == MAP_FAILED) {
	perror("mmap");
	abort();
    }
    memset(data, 0x5a, 256*1024);
    pthread_mutex_lock(&mutex);
    munmap(data, 256*1024);
    pthread_mutex_unlock(&mutex);
    return 1;
}

static int32_t mmap_sem_congestion_mtx_firstpage()
{
    pthread_mutex_lock(&mutex);
    void *data = mmap(NULL, 256*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    madvise(data, 256*1024, MADV_RANDOM|MADV_DONTFORK);
    memset(data, 0x5a, 4096);
    pthread_mutex_unlock(&mutex);
    if (data == MAP_FAILED) {
	perror("mmap");
	abort();
    }
    pthread_mutex_lock(&mutex);
    munmap(data, 256*1024);
    pthread_mutex_unlock(&mutex);
    return 1;
}


typedef int32_t (*foofunc)(void);
volatile int nodce;
struct result {
    long double average;
    long double sum;
    long double sum2;
    uint64_t tmax;
    uint64_t tmin;
    struct timeval elapsedtime;
    struct rusage rusage;
} result;

static pthread_mutex_t result_mutex = PTHREAD_MUTEX_INITIALIZER;

static void timeit(foofunc f, const char *name, unsigned ntimings, unsigned nloops)
{
    struct rusage rs0, rs1;
    struct timeval tv00, tv01;

    int64_t timings[ntimings], tmin,tmax;
    long double avg, sum2;
    int64_t sum=0;
    int32_t tmp=0;
    unsigned i, j;

    if (getrusage(RUSAGE_THREAD, &rs0) == -1) {
	perror("getrusage");
	abort();
    }
    gettimeofday(&tv00, NULL);
    for (j=0;j<ntimings;j++) {
	struct timeval tv0, tv1;
	gettimeofday(&tv0, NULL);
	for (i=0;i<nloops;i++)
	    tmp += f();
	gettimeofday(&tv1, NULL);
	nodce = tmp;
	timings[j] = (tv1.tv_usec - tv0.tv_usec) + (tv1.tv_sec - tv0.tv_sec)*1000000;
	sum += timings[j];
    }
    gettimeofday(&tv01, NULL);
    if (getrusage(RUSAGE_THREAD, &rs1) == -1) {
	perror("getrusage");
	abort();
    }
    avg = ((long double)sum) / ntimings / nloops;

    tmin=tmax=timings[0];
    sum2 = 0;
    for (j=0;j<ntimings;j++) {
	if (timings[j] > tmax)
	    tmax = timings[j];
	else if (timings[j] < tmin) {
	    tmin = timings[j];
	}
	long double t = timings[j]/nloops;
	sum2 += t*t;
    }
    pthread_mutex_lock(&result_mutex);
    result.average += avg;
    result.sum += sum/nloops;
    result.sum2 += sum2;
    result.elapsedtime.tv_sec += tv01.tv_sec - tv00.tv_sec;
    result.elapsedtime.tv_usec += tv01.tv_usec - tv00.tv_usec;
    if (tmax > result.tmax)
	result.tmax = tmax;
    if (tmin < result.tmin)
	result.tmin = tmin;
    result.rusage.ru_utime.tv_sec += rs1.ru_utime.tv_sec - rs0.ru_utime.tv_sec;
    result.rusage.ru_utime.tv_usec += rs1.ru_utime.tv_usec - rs0.ru_utime.tv_usec;
    result.rusage.ru_stime.tv_sec += rs1.ru_stime.tv_sec - rs0.ru_stime.tv_sec;
    result.rusage.ru_stime.tv_usec += rs1.ru_stime.tv_usec - rs0.ru_stime.tv_usec;
    result.rusage.ru_minflt += rs1.ru_minflt - rs0.ru_minflt;
    result.rusage.ru_majflt += rs1.ru_majflt - rs0.ru_majflt;
    result.rusage.ru_nvcsw += rs1.ru_nvcsw - rs0.ru_nvcsw;
    result.rusage.ru_nivcsw += rs1.ru_nivcsw - rs0.ru_nivcsw;
    pthread_mutex_unlock(&result_mutex);
}

struct data {
    const char *name;
    foofunc f;
    unsigned ntimings;
    unsigned nloops;
};

static void *thread(void* arg)
{
    struct data *data = (struct data*)arg;
    timeit(data->f, data->name, data->ntimings, data->nloops);
    return arg;
}
static int numcpu;
static void runtest(const char *name, foofunc f, unsigned ntimings, unsigned nloops, unsigned nthreads)
{
    unsigned i;
    pthread_t threads[nthreads];
    struct data data;
    data.name = name;
    data.f = f;
    data.ntimings = ntimings;
    data.nloops = nloops;
    printf("Starting test %s\n", name);
    memset(&result, 0, sizeof(result));
    for (i=0;i<nthreads;i++) {
	if (pthread_create(&threads[i], NULL, thread, (void*)&data) == -1) {
	    perror("pthread_create");
	    abort();
	}
    }
    for (i=0;i<nthreads;i++) {
	pthread_join(threads[i], NULL);
    }
    long double avg = result.average / nthreads;
    printf("Test %s ended\n", name);
    unsigned count = ntimings*nthreads;
    long double stdev = count*result.sum2 - result.sum*result.sum;
    stdev /= count*(count-1);
    long double elapsedtime = result.elapsedtime.tv_sec + result.elapsedtime.tv_usec/1000000.0;
    /* utime/stime */
    elapsedtime /= numcpu;
    long double utime = result.rusage.ru_utime.tv_sec + result.rusage.ru_utime.tv_usec/1000000.0;
    long double stime = result.rusage.ru_stime.tv_sec + result.rusage.ru_stime.tv_usec/1000000.0;
    double up = 100.0*utime/elapsedtime;
    double sp = 100.0*stime/elapsedtime;

    printf("Timing %s: elapsted time: %Lg s, %.3f %%user, %.3f %%sys\n", name, elapsedtime, up, sp);
    printf("\tmax %g us, average %Lg us, stdev %Lg us\n",
	   1.0*result.tmax/nloops, avg, stdev);
    printf("\tresource usage: %Lg s user, %Lg s sys, %Lg s elapsed, (%lu min + %lu maj) pagefaults, %lu + %lu context switches\n\n",
	   utime, stime, elapsedtime,
	   result.rusage.ru_minflt, result.rusage.ru_majflt,
	   result.rusage.ru_nvcsw, result.rusage.ru_nivcsw);
}


int main(int argc, char *argv[])
{
    numcpu = sysconf(_SC_NPROCESSORS_ONLN);
    if (numcpu == -1) {
	perror("sysconf(_SC_NPROCESSOR_ONLN)");
	abort();
    }
    unsigned ntimings=100, nloops=100, nthreads=16;
    if (argc > 2) {
	ntimings = atoi(argv[1]);
	if (argc > 3) {
	    nloops = atoi(argv[2]);
	    if (argc > 4) {
		nthreads = atoi(argv[3]);
	    }
	}
    }
    printf("Running tests with ntimings=%u, nloops=%u, nthreads=%u, number of CPUs=%u\n",
	   ntimings, nloops, nthreads, numcpu);
    runtest("mmap sem congestion (memset all)", mmap_sem_congestion_all, ntimings, nloops, nthreads);
    runtest("mmap sem congestion with mutex workaround (memset all)", mmap_sem_congestion_mtx_all, ntimings, nloops, nthreads);
    runtest("mmap sem congestion (memset firstpage)", mmap_sem_congestion_firstpage, ntimings, nloops, nthreads);
    runtest("mmap sem congestion with mutex workaround (memset firstpage)", mmap_sem_congestion_mtx_firstpage, ntimings, nloops, nthreads);
    return 0;
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mutex vs semaphores scheduler bug
  2009-10-10 14:57 Mutex vs semaphores scheduler bug Török Edwin
@ 2009-10-12 14:53 ` Peter Zijlstra
  2009-10-12 15:37   ` Török Edwin
  2009-10-15 23:44   ` David Howells
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Zijlstra @ 2009-10-12 14:53 UTC (permalink / raw)
  To: Török Edwin
  Cc: Ingo Molnar, Linux Kernel, aCaB, David Howells, Nick Piggin,
	Linus Torvalds, Thomas Gleixner

On Sat, 2009-10-10 at 17:57 +0300, Török Edwin wrote:
> If a semaphore (such as mmap_sem) is heavily congested, then using a
> userspace mutex makes the program faster.
> 
> For example using a mutex around *anonymous* mmaps, speeds it up
> significantly (~80% on this microbenchmark,
> ~15% on real applications). Such workarounds shouldn't  be necessary for
> userspace applications, the kernel should
> by default use the most efficient implementation for locks.

Should, yes, does, no.

> However when using a mutex the number of context switches is SMALLER by
> 40-60%.

That matches the problem, see below.

> I think its a bug in the scheduler, it scheduler the mutex case much
> better. 

It's not, the scheduler doesn't know about mutexes/futexes/rwsems.

> Maybe because userspace also spins a bit before actually calling
> futex().

Nope, if we would ever spin, it would be in the kernel after calling
FUTEX_LOCK (which currently doesn't exist). glibc shouldn't do any
spinning on its own (if it does, I have yet another reason to try and
supplant the glibc futex code).

> I think its important to optimize the mmap_sem semaphore

It is.

The problem appears to be that rwsem doesn't allow lock-stealing, and
very strictly maintains FIFO order on contention. This results in extra
schedules and reduced performance as you noticed.

What happens is that when we release a contended rwsem we assign it to
the next waiter, if before that waiter gets ran, another (running) tasks
comes along and tries to acquire the lock, that gets put to sleep, even
though it could possibly get to acquire it (and the woken waiter would
detect failure and go back to sleep).

So what I think we need to do is have a look at all this lib/rwsem.c
slowpath code and hack in lock stealing.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mutex vs semaphores scheduler bug
  2009-10-12 14:53 ` Peter Zijlstra
@ 2009-10-12 15:37   ` Török Edwin
  2009-10-15 23:44   ` David Howells
  1 sibling, 0 replies; 6+ messages in thread
From: Török Edwin @ 2009-10-12 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linux Kernel, aCaB, David Howells, Nick Piggin,
	Linus Torvalds, Thomas Gleixner

On 2009-10-12 17:53, Peter Zijlstra wrote:
> On Sat, 2009-10-10 at 17:57 +0300, Török Edwin wrote:
>> If a semaphore (such as mmap_sem) is heavily congested, then using a
>> userspace mutex makes the program faster.
>>
>> For example using a mutex around *anonymous* mmaps, speeds it up
>> significantly (~80% on this microbenchmark,
>> ~15% on real applications). Such workarounds shouldn't  be necessary for
>> userspace applications, the kernel should
>> by default use the most efficient implementation for locks.
> 
> Should, yes, does, no.
> 
>> However when using a mutex the number of context switches is SMALLER by
>> 40-60%.
> 
> That matches the problem, see below.
> 
>> I think its a bug in the scheduler, it scheduler the mutex case much
>> better. 
> 
> It's not, the scheduler doesn't know about mutexes/futexes/rwsems.
> 
>> Maybe because userspace also spins a bit before actually calling
>> futex().
> 
> Nope, if we would ever spin, it would be in the kernel after calling
> FUTEX_LOCK (which currently doesn't exist). glibc shouldn't do any
> spinning on its own (if it does, I have yet another reason to try and
> supplant the glibc futex code).

I think it doesn't by default, I was mislead by the huge number of cases
in pthread_mutex_lock.c. The default one does this:

__lll_lock_wait:
	cfi_startproc
	pushq	%r10
	cfi_adjust_cfa_offset(8)
	pushq	%rdx
	cfi_adjust_cfa_offset(8)
	cfi_offset(%r10, -16)
	cfi_offset(%rdx, -24)
	xorq	%r10, %r10	/* No timeout.  */
	movl	$2, %edx
	LOAD_FUTEX_WAIT (%esi)

	cmpl	%edx, %eax	/* NB:	 %edx == 2 */
	jne	2f

1:	movl	$SYS_futex, %eax
	syscall

2:	movl	%edx, %eax
	xchgl	%eax, (%rdi)	/* NB:	 lock is implied */

	testl	%eax, %eax
	jnz	1b

	popq	%rdx
	cfi_adjust_cfa_offset(-8)
	cfi_restore(%rdx)
	popq	%r10
	cfi_adjust_cfa_offset(-8)
	cfi_restore(%r10)
	retq


> 
>> I think its important to optimize the mmap_sem semaphore
> 
> It is.
> 
> The problem appears to be that rwsem doesn't allow lock-stealing

OK, sorry for mistaking lack of lock-stealing with scheduler bug.

>, and
> very strictly maintains FIFO order on contention. This results in extra
> schedules and reduced performance as you noticed.
> 
> What happens is that when we release a contended rwsem we assign it to
> the next waiter, if before that waiter gets ran, another (running) tasks
> comes along and tries to acquire the lock, that gets put to sleep, even
> though it could possibly get to acquire it (and the woken waiter would
> detect failure and go back to sleep).

The reason I initially thought it was a scheduler bug is that it seemed
it has something to do with wakeups, and threads are sleeping for too
long waiting for the lock.
But I think the scheduler can't give preference to tasks which would be
able to acquire a semaphore they were sleeping on, because that'd throw
 fair scheduling off-balance, right?

> 
> So what I think we need to do is have a look at all this lib/rwsem.c
> slowpath code and hack in lock stealing.
> 
> 

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mutex vs semaphores scheduler bug
  2009-10-12 14:53 ` Peter Zijlstra
  2009-10-12 15:37   ` Török Edwin
@ 2009-10-15 23:44   ` David Howells
  2009-10-17 15:32     ` Peter Zijlstra
  1 sibling, 1 reply; 6+ messages in thread
From: David Howells @ 2009-10-15 23:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Török Edwin, Ingo Molnar, Linux Kernel, aCaB,
	Nick Piggin, Linus Torvalds, Thomas Gleixner

Peter Zijlstra <peterz@infradead.org> wrote:

> The problem appears to be that rwsem doesn't allow lock-stealing

With good reason.  rwsems can be read or write locked for a long time - so if
readers can jump the queue on read-locked rwsems, then writer starvation is a
real possibility.  I carefully implemented it so that it is a strict FIFO to
avoid certain problems I was having.

David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mutex vs semaphores scheduler bug
  2009-10-15 23:44   ` David Howells
@ 2009-10-17 15:32     ` Peter Zijlstra
  2009-10-20 19:02       ` Török Edwin
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2009-10-17 15:32 UTC (permalink / raw)
  To: David Howells
  Cc: Török Edwin, Ingo Molnar, Linux Kernel, aCaB,
	Nick Piggin, Linus Torvalds, Thomas Gleixner

On Fri, 2009-10-16 at 00:44 +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > The problem appears to be that rwsem doesn't allow lock-stealing
> 
> With good reason.  rwsems can be read or write locked for a long time - so if
> readers can jump the queue on read-locked rwsems, then writer starvation is a
> real possibility.  I carefully implemented it so that it is a strict FIFO to
> avoid certain problems I was having.

Well, it kinda sucks that rwsem is slower than a mutex.

What about allowing writer stealing when the next contending task is a
writer?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mutex vs semaphores scheduler bug
  2009-10-17 15:32     ` Peter Zijlstra
@ 2009-10-20 19:02       ` Török Edwin
  0 siblings, 0 replies; 6+ messages in thread
From: Török Edwin @ 2009-10-20 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Howells, Ingo Molnar, Linux Kernel, aCaB, Nick Piggin,
	Linus Torvalds, Thomas Gleixner

On 2009-10-17 18:32, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 00:44 +0100, David Howells wrote:
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> The problem appears to be that rwsem doesn't allow lock-stealing
>> With good reason.  rwsems can be read or write locked for a long time - so if
>> readers can jump the queue on read-locked rwsems, then writer starvation is a
>> real possibility.  I carefully implemented it so that it is a strict FIFO to
>> avoid certain problems I was having.
> 
> Well, it kinda sucks that rwsem is slower than a mutex.
> 
> What about allowing writer stealing when the next contending task is a
> writer?
> 

Also can the scheduler be improved so it doesn't unnecessarily wake up
processes if they can't take the semaphore?

AFAICT (in linux 2.6.31.3) this is what happens:
 semaphore fails to get acquired -> rwsem.c:rwsem_down_failed_common
  -> if nobody holds the semaphore, wake up the process(es) that can
acquire it
  -> loop calling schedule(), until woken

 the thread holding the semaphore releases it (no wakeups!), and the
scheduler will schedule the next process

There are several things that are suboptimal here IMHO:
1. The task calls schedule(), but does that prevent the scheduler from
scheduling it again if waiter->task is non-null? If not, then there are
useless wakeups (task gets scheduled, waiter->task is non-null, task
calls schedule again).

2. The process in the front of the queue is woken only if there was
contention, which introduces unnecessary latency for the task in the
front of the queue waiting for the semaphore.

3. Due to 2) a task on another CPU waiting for the semaphore is only
woken: when schedule() is called, when another thread tries to acquire
the lock and fails

Suggested solution:
schedule() should know that scheduling this process is useless, it won't
get the lock, so it shouldn't attempt to schedule it (maybe by setting a
flag in the task state).
When a thread releases a semaphore, it should also try to wake other
threads that were waiting on it (if any).

With both of these changes the scheduler will know that when it
schedules the task it'll be able to acquire the semaphore (it was called
by wake_up_process in rwsem_do_wake which will clear the do-not-wake
flag). Also the thread will be scheduled as soon as another thread (on
another CPU) released the semaphore, and it doesn't need to wait for the
timeslice to elapse.

Here is a snippet of ftrace of sched_switch:
<...>-27515 [000] 44338.386867:  27515:120:R   + [000] 27522:120:D <...>
           <...>-27515 [000] 44338.386868:  27515:120:R   + [000]
27511:120:D <...>
           <...>-27515 [000] 44338.386869:  27515:120:R   + [001]
27523:120:D <...>
           <...>-27515 [000] 44338.386875:  27515:120:R   + [003]
27512:120:D <...>
           <...>-27515 [000] 44338.386877:  27515:120:R   + [000]
27517:120:S <...>
           <...>-27515 [000] 44338.386905:  27515:120:D ==> [000]
27517:120:R <...>
           <...>-27517 [000] 44338.386907:  27517:120:S ==> [000]
27522:120:R <...>
           <...>-27522 [000] 44338.386915:  27522:120:D ==> [000]
27511:120:R <...>
           <...>-27511 [000] 44338.386917:  27511:120:R   + [001]
27523:120:D <...>
           <...>-27511 [000] 44338.386918:  27511:120:D ==> [000]
0:140:R <idle>
          <idle>-0     [000] 44338.386973:      0:140:R ==> [000]
27522:120:R <...>

If I interpret this correctly 27515 wakes a bunch of readers (27511,
27523, 27512, 27517), then it switches to 27517.
However 27517 immediately schedules again (!) after 2ns, wakes 27522, it
runs for 8ns, then schedules to 27511, which runs for a short while
again, finally it schedules to idle. All this scheduling ping-pong looks
like a waste, none of the tasks woken up did useful work.

Another situation is this scheduler ping-pong with md4_raid10, notice
how 27961 wakes up md4_raid10, switches to it, md4_raid10 schedules back
to 27961, which then schedules to idle (2ns later). md4_raid10 could
have gone directly to idle ....:

md4_raid10-982   [000] 44746.579830:    982:115:S ==> [000] 27961:120:R
lt-clamd
        lt-clamd-27964 [003] 44746.579892:  27964:120:R   + [001]
27966:120:D lt-clamd
        lt-clamd-27964 [003] 44746.580065:  27964:120:D ==> [003]
0:140:R <idle>
        lt-clamd-27961 [000] 44746.580069:  27961:120:R   + [003]
27964:120:D lt-clamd
          <idle>-0     [003] 44746.580073:      0:140:R ==> [003]
27964:120:R lt-clamd
        lt-clamd-27961 [000] 44746.580912:  27961:120:R   + [002]
27960:120:S lt-clamd
        lt-clamd-27961 [000] 44746.580945:  27961:120:D   + [000]
982:115:S md4_raid10
        lt-clamd-27961 [000] 44746.580946:  27961:120:D ==> [000]
982:115:R md4_raid10
      md4_raid10-982   [000] 44746.580948:    982:115:S ==> [000]
27961:120:D lt-clamd
        lt-clamd-27961 [000] 44746.580950:  27961:120:D ==> [000]
0:140:R <idle>
        lt-clamd-27964 [003] 44746.581054:  27964:120:R   + [000]
27961:120:D lt-clamd
          <idle>-0     [000] 44746.581064:      0:140:R ==> [000]
27961:120:R lt-clamd
        lt-clamd-27964 [003] 44746.581092:  27964:120:R   + [002]
27960:120:S lt-clamd
        lt-clamd-27964 [003] 44746.581125:  27964:120:D   + [000]
982:115:S md4_raid10
        lt-clamd-27964 [003] 44746.581128:  27964:120:D ==> [003]
27965:120:R lt-clamd
        lt-clamd-27961 [000] 44746.581128:  27961:120:R ==> [000]
982:115:R md4_raid10

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-10-20 19:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-10 14:57 Mutex vs semaphores scheduler bug Török Edwin
2009-10-12 14:53 ` Peter Zijlstra
2009-10-12 15:37   ` Török Edwin
2009-10-15 23:44   ` David Howells
2009-10-17 15:32     ` Peter Zijlstra
2009-10-20 19:02       ` Török Edwin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox