public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* lock's trace events can improve mutex's performance in userspace?
@ 2010-03-11  9:43 Xiao Guangrong
  2010-03-13  4:27 ` Hitoshi Mitake
  0 siblings, 1 reply; 4+ messages in thread
From: Xiao Guangrong @ 2010-03-11  9:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Frederic Weisbecker, Peter Zijlstra,
	Hitoshi Mitake, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Ming Lei,
	LKML

[-- Attachment #1: Type: text/plain, Size: 940 bytes --]

We found that if enable lock's trace events, the 'sysbench mutex'
benchmark program can run quicker.

The simple program that is attached can reproduce it, the system info,
kernel config, and the script are also attached.

The test step is below:

# tar -zxvf test-mutex.tar.bz
# cd test-mutex
# make
# ./tscript.sh >& log
# cat log | grep "real"
real    0m46.765s < all trace events are disabled >
real    0m47.073s
real    0m47.402s
real    0m46.458s
real    0m47.433s
real    0m47.395s
real    0m47.010s
real    0m47.454s
real    0m47.044s
real    0m47.464s
real    0m39.245s < enable lock's trace events >
real    0m40.822s
real    0m40.779s
real    0m40.549s
real    0m40.605s
real    0m40.923s
real    0m40.560s
real    0m41.050s
real    0m40.757s
real    0m40.715s

[ "< ... >" is my comments ]

>From the result, we can see the program's runtime is less if enable lock's
trace events.

The conclusion is weird but i don't know why.

[-- Attachment #2: test-mutex.tar.bz --]
[-- Type: application/octet-stream, Size: 23996 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: lock's trace events can improve mutex's performance in userspace?
  2010-03-11  9:43 lock's trace events can improve mutex's performance in userspace? Xiao Guangrong
@ 2010-03-13  4:27 ` Hitoshi Mitake
  2010-04-05  9:14   ` Michel Lespinasse
  0 siblings, 1 reply; 4+ messages in thread
From: Hitoshi Mitake @ 2010-03-13  4:27 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Ingo Molnar, Steven Rostedt, Frederic Weisbecker, Peter Zijlstra,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Ming Lei, LKML

[-- Attachment #1: Type: text/plain, Size: 2321 bytes --]

On 03/11/10 18:43, Xiao Guangrong wrote:
 > We found that if enable lock's trace events, the 'sysbench mutex'
 > benchmark program can run quicker.
 >
 > The simple program that is attached can reproduce it, the system info,
 > kernel config, and the script are also attached.
 >
 > The test step is below:
 >
 > # tar -zxvf test-mutex.tar.bz
 > # cd test-mutex
 > # make
 > # ./tscript.sh>&  log
 > # cat log | grep "real"
 > real    0m46.765s<  all trace events are disabled>
 > real    0m47.073s
 > real    0m47.402s
 > real    0m46.458s
 > real    0m47.433s
 > real    0m47.395s
 > real    0m47.010s
 > real    0m47.454s
 > real    0m47.044s
 > real    0m47.464s
 > real    0m39.245s<  enable lock's trace events>
 > real    0m40.822s
 > real    0m40.779s
 > real    0m40.549s
 > real    0m40.605s
 > real    0m40.923s
 > real    0m40.560s
 > real    0m41.050s
 > real    0m40.757s
 > real    0m40.715s
 >
 > [ "<  ...>" is my comments ]
 >
 >  From the result, we can see the program's runtime is less if enable
lock's
 > trace events.
 >
 > The conclusion is weird but i don't know why.

Hi Xiao,

It's hard to believe, but...

% sudo ./tscript.sh &> log
% grep real log
real    0m24.132s
real    0m23.535s
real    0m20.064s
real    0m16.636s <- enabled from here
real    0m16.435s
real    0m17.339s

I could reproduce your surprising result.
(I only execed your benchmark 3 times.)

I rewrote your mainc.c and checked contended count of each test like
this way,
		if (pthread_mutex_trylock(&mutex) == EBUSY) {
			pthread_mutex_lock(&mutex);
			atomic_inc(&contended);
		}
# I'll attach my new mainc.c

% cat log
Run mutex with trace events disabled...
contended:25191221

real    0m24.132s
user    0m17.149s
sys     1m18.933s
contended:25360563

real    0m23.535s
user    0m17.233s
sys     1m16.213s
contended:23813911

real    0m20.064s
user    0m15.561s
sys     1m4.332s
Run mutex with lockdep events enabled...
contended:11458318

real    0m16.636s
user    0m10.173s
sys     0m55.595s
contended:11881095

real    0m16.435s
user    0m10.273s
sys     0m54.911s
contended:11261650

real    0m17.339s
user    0m10.225s
sys     0m58.556s

It seems that num of contention decreased to about half.
I don't know why this happened and effect to performance of it,
but this result is worth to consider.

Thanks,
	Hitoshi

[-- Attachment #2: mainc.c --]
[-- Type: text/x-csrc, Size: 1594 bytes --]

#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#include <errno.h>

#define THREADS_NUM	1000
#define MUTEX_LOCKS	300000

static pthread_mutex_t mutex;
static pthread_mutex_t thread_start_mutex;
static int dummy_count;

typedef struct {
	volatile int val;
} atomic_t;

atomic_t contended = { 0 };

static inline int
atomic_inc(atomic_t *addr)
{
	return __sync_add_and_fetch(&addr->val, 1);
}

static void *run_thread(void *arg)
{
	int i;

	pthread_mutex_lock(&thread_start_mutex);
 	pthread_mutex_unlock(&thread_start_mutex);

	for (i = 0; i < MUTEX_LOCKS; i++) {
		if (pthread_mutex_trylock(&mutex) == EBUSY) {
			pthread_mutex_lock(&mutex);
			atomic_inc(&contended);
		}
		dummy_count++;
		pthread_mutex_unlock(&mutex);
	}

	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t thread[THREADS_NUM];
	pthread_attr_t  thread_attr;
	int i;

	pthread_mutex_init(&mutex, NULL);
	pthread_mutex_init(&thread_start_mutex, NULL);
	pthread_attr_init(&thread_attr);
	pthread_attr_setstacksize(&thread_attr, 32*1024);

	pthread_mutex_lock(&thread_start_mutex);
	for (i = 0; i < THREADS_NUM; i++) {
		if (pthread_create(&thread[i], &thread_attr, run_thread, NULL) != 0) {
			printf("Thread[%d] create failed:", i);
			perror(NULL);
			exit(-1);
		}
    	}

	pthread_mutex_unlock(&thread_start_mutex);
	for (i = 0; i < THREADS_NUM; i++) {
		if (pthread_join(thread[i], NULL) != 0)
			printf("Thread #%d join failed.\n", i);
    	}
	pthread_mutex_destroy(&mutex);
	pthread_mutex_destroy(&thread_start_mutex);

	printf("contended:%d\n", contended.val);
	return 0;
}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: lock's trace events can improve mutex's performance in userspace?
  2010-03-13  4:27 ` Hitoshi Mitake
@ 2010-04-05  9:14   ` Michel Lespinasse
  2010-04-08  9:05     ` Xiao Guangrong
  0 siblings, 1 reply; 4+ messages in thread
From: Michel Lespinasse @ 2010-04-05  9:14 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: Xiao Guangrong, Ingo Molnar, Steven Rostedt, Frederic Weisbecker,
	Peter Zijlstra, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Ming Lei,
	LKML

Sorry for the late reply...

One thing to consider in locking micro-benchmarks is that often, code
changes that slow down parts of the contended code path where the lock
is not held, will result in an increase of the reported
micro-benchmark metric. This effect is particularly marked for
micro-benchmarks that consist of multiple threads doing empty
acquire/release loops.

As a thought experiment, imagine what would happen if you added a
one-millisecond sleep in the contended code path for mutex
acquisition. Soon all but one of your benchmark threads would be
sleeping, and the only non-sleeping thread would be able to spin on
that lock/unlock loop with no contention, resulting in very nice
results for the micro-benchmark. Remove the sleep and the lock/unlock
threads will have to contend, resulting in lower reported performance
metrics.

I think what you're seeing with lockdep events is a smaller case of
this - enabling lockdep slows down the contended code path, but also
reduces the number of times that it is taken...

On Fri, Mar 12, 2010 at 8:27 PM, Hitoshi Mitake
<mitake@dcl.info.waseda.ac.jp> wrote:
> On 03/11/10 18:43, Xiao Guangrong wrote:
>> We found that if enable lock's trace events, the 'sysbench mutex'
>> benchmark program can run quicker.
>>
>> The simple program that is attached can reproduce it, the system info,
>> kernel config, and the script are also attached.
>>
>> The test step is below:
>>
>> # tar -zxvf test-mutex.tar.bz
>> # cd test-mutex
>> # make
>> # ./tscript.sh>&  log
>> # cat log | grep "real"
>> real    0m46.765s<  all trace events are disabled>
>> real    0m47.073s
>> real    0m47.402s
>> real    0m46.458s
>> real    0m47.433s
>> real    0m47.395s
>> real    0m47.010s
>> real    0m47.454s
>> real    0m47.044s
>> real    0m47.464s
>> real    0m39.245s<  enable lock's trace events>
>> real    0m40.822s
>> real    0m40.779s
>> real    0m40.549s
>> real    0m40.605s
>> real    0m40.923s
>> real    0m40.560s
>> real    0m41.050s
>> real    0m40.757s
>> real    0m40.715s
>>
>> [ "<  ...>" is my comments ]
>>
>>  From the result, we can see the program's runtime is less if enable
> lock's
>> trace events.
>>
>> The conclusion is weird but i don't know why.
>
> Hi Xiao,
>
> It's hard to believe, but...
>
> % sudo ./tscript.sh &> log
> % grep real log
> real    0m24.132s
> real    0m23.535s
> real    0m20.064s
> real    0m16.636s <- enabled from here
> real    0m16.435s
> real    0m17.339s
>
> I could reproduce your surprising result.
> (I only execed your benchmark 3 times.)
>
> I rewrote your mainc.c and checked contended count of each test like
> this way,
>                if (pthread_mutex_trylock(&mutex) == EBUSY) {
>                        pthread_mutex_lock(&mutex);
>                        atomic_inc(&contended);
>                }
> # I'll attach my new mainc.c
>
> % cat log
> Run mutex with trace events disabled...
> contended:25191221
>
> real    0m24.132s
> user    0m17.149s
> sys     1m18.933s
> contended:25360563
>
> real    0m23.535s
> user    0m17.233s
> sys     1m16.213s
> contended:23813911
>
> real    0m20.064s
> user    0m15.561s
> sys     1m4.332s
> Run mutex with lockdep events enabled...
> contended:11458318
>
> real    0m16.636s
> user    0m10.173s
> sys     0m55.595s
> contended:11881095
>
> real    0m16.435s
> user    0m10.273s
> sys     0m54.911s
> contended:11261650
>
> real    0m17.339s
> user    0m10.225s
> sys     0m58.556s
>
> It seems that num of contention decreased to about half.
> I don't know why this happened and effect to performance of it,
> but this result is worth to consider.
>
> Thanks,
>        Hitoshi
>



-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: lock's trace events can improve mutex's performance in userspace?
  2010-04-05  9:14   ` Michel Lespinasse
@ 2010-04-08  9:05     ` Xiao Guangrong
  0 siblings, 0 replies; 4+ messages in thread
From: Xiao Guangrong @ 2010-04-08  9:05 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Hitoshi Mitake, Ingo Molnar, Steven Rostedt, Frederic Weisbecker,
	Peter Zijlstra, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Ming Lei,
	LKML

Hi Michel,

Michel Lespinasse wrote:
> Sorry for the late reply...
> 
> One thing to consider in locking micro-benchmarks is that often, code
> changes that slow down parts of the contended code path where the lock
> is not held, will result in an increase of the reported
> micro-benchmark metric. This effect is particularly marked for
> micro-benchmarks that consist of multiple threads doing empty
> acquire/release loops.
> 
> As a thought experiment, imagine what would happen if you added a
> one-millisecond sleep in the contended code path for mutex
> acquisition. Soon all but one of your benchmark threads would be
> sleeping, and the only non-sleeping thread would be able to spin on
> that lock/unlock loop with no contention, resulting in very nice
> results for the micro-benchmark. Remove the sleep and the lock/unlock
> threads will have to contend, resulting in lower reported performance
> metrics.

Great thanks for your valuable reply that makes we see the
issue more clearly.

I've do the test address your conjecture that add usleep(1) in mutex
acquisition path, the test result shows contention is reduced.

And i also do the test that does more work in mutex holding path, the
result shows optimization ratio is decreased.

Thanks,
Xiao

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-04-08  9:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-11  9:43 lock's trace events can improve mutex's performance in userspace? Xiao Guangrong
2010-03-13  4:27 ` Hitoshi Mitake
2010-04-05  9:14   ` Michel Lespinasse
2010-04-08  9:05     ` Xiao Guangrong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox