~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
@ 2006-02-23 19:55 Gautam H Thaker
  2006-02-23 20:15 ` Benjamin LaHaise
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Gautam H Thaker @ 2006-02-23 19:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: Gautam H. Thaker - LM ATL, Ingo Molnar

The real-time patches at the URL below do a great job of endowing Linux with
real-time capabilities.

http://people.redhat.com/mingo/realtime-preempt/

It has been documented before (and accepted) that this patch turns Linux into
a RT kernel but considerably slows down the code paths, esp. thru the I/O
subsystem. I want to provide some additional measurements and seek opinions
of if it might ever be possible to improve on this situation.

In my tests I used 20 3GHZ Intel Xeon PCs on an isoloated gigabit network.
One of the nodes has a "monitor" process that listens to incoming UDP packets
from the other 19 nodes. Each node is sending approximately 2000 UDP
packets/sec to the monitor process for a total of about 38,000 incoming UDP
packest/sec. These UDP packets are small with application payload being ~10
bytes, for total BW usage of less than 4 Mbits/sec at application level and
less than 15 Mbits/sec counting all headers. (Total BW usage is not high but
there is a large number of packets that are coming in.) Monitor process does
some fairly simple processing per packet.

I measured the CPU usage of the "monitor" process when the testbed was used
with two different operating system. The monitor process is the "nalive.p"
process in the "top" output below. The CPU laod is fairly stable and "top"
gives the following information:

::::::::::::::
top:  2.6.12-1.1390_FC4    # STANDARD KERNEL
::::::::::::::
top - 14:34:39 up  2:32,  2 users,  load average: 0.10, 0.05, 0.01
Tasks:  56 total,   2 running,  54 sleeping,   0 stopped,   0 zombie
top - 14:35:32 up  2:33,  2 users,  load average: 0.11, 0.06, 0.01
Tasks:  56 total,   2 running,  54 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.4% us,  7.0% sy,  0.0% ni, 80.8% id,  0.2% wa,  7.0% hi,  3.6% si
Mem:   2076008k total,   100292k used,  1975716k free,    16192k buffers
Swap:   128512k total,        0k used,   128512k free,    50376k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4823 root     -66   0 22712 2236 1484 S  8.4  0.1   0:37.74 nalive.p
 4860 gthaker   16   0  7396 2380 1904 R  0.2  0.1   0:00.04 sshd
    1 root      16   0  1748  572  492 S  0.0  0.0   0:01.06 init
    2 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
    3 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/0

::::::::::::::
top:  2.6.15-rt15-smp.out   # REAL_TIME KERNEL
::::::::::::::
node0> top
top - 09:52:48 up  1:47,  3 users,  load average: 0.91, 1.05, 1.02
Tasks:  98 total,   1 running,  97 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.5% us, 41.8% sy,  0.0% ni, 55.6% id,  0.1% wa,  0.0% hi,  0.0% si
Mem:   2058608k total,    88104k used,  1970504k free,     9072k buffers
Swap:   128512k total,        0k used,   128512k free,    39208k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2906 root     -66   0 18624 2244 1480 S 41.4  0.1  27:11.21 nalive.p
    6 root     -91   0     0    0    0 S 32.3  0.0  21:04.53 softirq-net-rx/
 1379 root     -40  -5     0    0    0 S 14.5  0.0   9:54.76 IRQ 23
  400 root      15   0     0    0    0 S  0.2  0.0   0:00.13 kjournald
    1 root      16   0  1740  564  488 S  0.0  0.0   0:04.03 init

The %CPU is at 8% for the non-real-time, uniprocessor kernel, while it is at
least 41% (and may be 41.4%+32.3%+14.5% = 88%) for the real-time SMP kernel.)

My question is this: How much improvement in raw efficiency is possible for
real-time patches? We take very long view, so if there is a belief that in 5
years the penalty will be reduced from 5-10x in this application to less than
2x that would be great. If we think this is about as well as can be done it
helps knowing that too.

There is nothing else going on on the machines, all code paths should be
going down "happy path" with no contention or blocking - my naive view is
that a 2x overhead is possible, but 5-10x seems harder to understand. And
this is not the case of finding some large non-preemptible region - since
real-time performance is excellent, but about why the code paths seem so "heavy".

Gautam Thaker

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 19:55 ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4 Gautam H Thaker
@ 2006-02-23 20:15 ` Benjamin LaHaise
  2006-02-23 20:58 ` Ingo Molnar
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Benjamin LaHaise @ 2006-02-23 20:15 UTC (permalink / raw)
  To: Gautam H Thaker; +Cc: linux-kernel, Ingo Molnar

On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
> It has been documented before (and accepted) that this patch turns Linux into
> a RT kernel but considerably slows down the code paths, esp. thru the I/O
> subsystem. I want to provide some additional measurements and seek opinions
> of if it might ever be possible to improve on this situation.

32 bit kernel or 64 bit kernel?  What about profiling the system with 
oprofile?

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 19:55 ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4 Gautam H Thaker
  2006-02-23 20:15 ` Benjamin LaHaise
@ 2006-02-23 20:58 ` Ingo Molnar
  2006-02-23 21:06   ` Nish Aravamudan
  2006-02-24 12:11   ` Andrew Morton
  2006-02-24 16:52 ` Theodore Ts'o
  2006-02-28 19:27 ` Matt Mackall
  3 siblings, 2 replies; 17+ messages in thread
From: Ingo Molnar @ 2006-02-23 20:58 UTC (permalink / raw)
  To: Gautam H Thaker; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1257 bytes --]


* Gautam H Thaker <gthaker@atl.lmco.com> wrote:

> ::::::::::::::
> top:  2.6.15-rt15-smp.out   # REAL_TIME KERNEL
> ::::::::::::::

>  2906 root     -66   0 18624 2244 1480 S 41.4  0.1  27:11.21 nalive.p
>     6 root     -91   0     0    0    0 S 32.3  0.0  21:04.53 softirq-net-rx/
>  1379 root     -40  -5     0    0    0 S 14.5  0.0   9:54.76 IRQ 23

One effect of the -rt kernel is that it shows IRQ load explicitly - 
while the stock kernel can 'hide' it because there interrupts run 
'atomically', making it hard to measure the true system overhead. The 
-rt kernel will likely show more overhead, but i'd not expect this 
amount of overhead.

To figure out the true overhead of both kernels, could you try the 
attached loop_print_thread.c code, and run it on: an idle non-rt kernel, 
and idle -rt kernel, a busy non-rt kernel and a busy -rt kernel, and 
send me the typical/average loops/sec value you are getting?

Furthermore, there have been some tasklet related fixes in 2.6.15-rt17, 
which maybe could improve this workload. Maybe ...

Also, would there be some easy way for me to reproduce that workload?  
Possibly some .c code you could send that is easy to run on the server 
and the client to reproduce the guts of this workload?

	Ingo

[-- Attachment #2: loop_print_thread.c --]
[-- Type: text/plain, Size: 2095 bytes --]


#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>

#define rdtscll(val) \
     __asm__ __volatile__ ("rdtsc;" : "=A" (val))

#define SECS 3ULL

volatile unsigned int count_array[1000] __attribute__((aligned(256)));
int atomic = 0;
unsigned long long delta = 0;

void *loop(void *arg)
{
	unsigned long long start, now, mhz = 525000000, limit = mhz * SECS,
		min = -1ULL, tmp;
	volatile unsigned int *count, offset = (int)arg;
	int j;

	printf("offset: %u (atomic: %d).\n", offset, atomic);
	count = (void *)count_array + offset;

	if (!arg) {
		for (j = 0; j < 10; j++) {
			limit = mhz/10;
			*count = 0;
			rdtscll(start);
			for (;;) {
				(*count)++;
				rdtscll(now);
				if (now - start > limit)
					break;
			}
			rdtscll(now);
			tmp = (now-start)/(*count);
			if (tmp < min)
				min = tmp;
		}
		printf("delta: %Ld\n", min);
		delta = min;
	} else
		while (!delta)
			usleep(100000);
	limit = mhz*SECS;

repeat:
	*count = 0;
	rdtscll(start);
	if (atomic)
		for (;;) {
			asm ("lock; incl %0" : "=m" (*count) : "m" (*count));
			rdtscll(now);
			if (now - start > limit)
				break;
		}
	else
		for (;;) {
			(*count)++;
			rdtscll(now);
			if (now - start > limit)
				break;
		}
	printf("speed: %Ld loops (%Ld cycles per iteration).\n", (*count)/SECS, (limit/(*count)-delta)); fflush(stdout);
	goto repeat;
}

int main (int argc, char **argv)
{
	unsigned int nr_threads, i, ret, offset = 0;
	pthread_t *t;

	if (argc != 2 && argc != 3 && argc != 4) {
usage:
		printf("usage: loop_print2 <nr threads> [<counter offset>] [<atomic>]\n");
		exit(-1);
	}
	nr_threads = atol(argv[1]);
	if (!nr_threads)
		goto usage;
	t = calloc(nr_threads, sizeof(*t));
	if (argc >= 3)
		offset = atol(argv[2]);
	if (offset < sizeof(unsigned int))
		offset = sizeof(unsigned int);
	atomic = 0;
	if (argc >= 4) {
		atomic = atol(argv[3]);
		printf("a: %d\n", atomic);
	}

	for (i = 1; i < nr_threads; i++) {
		ret = pthread_create (t+i, NULL, loop,
			       	(void *)(i*offset));
		if (ret)
			exit(-1);
	}
	loop((void *)0);

	return 0;
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 20:58 ` Ingo Molnar
@ 2006-02-23 21:06   ` Nish Aravamudan
  2006-02-23 21:08     ` Ingo Molnar
  2006-02-24 12:11   ` Andrew Morton
  1 sibling, 1 reply; 17+ messages in thread
From: Nish Aravamudan @ 2006-02-23 21:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Gautam H Thaker, linux-kernel

On 2/23/06, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Gautam H Thaker <gthaker@atl.lmco.com> wrote:
>
> > ::::::::::::::
> > top:  2.6.15-rt15-smp.out   # REAL_TIME KERNEL
> > ::::::::::::::
>
> >  2906 root     -66   0 18624 2244 1480 S 41.4  0.1  27:11.21 nalive.p
> >     6 root     -91   0     0    0    0 S 32.3  0.0  21:04.53 softirq-net-rx/
> >  1379 root     -40  -5     0    0    0 S 14.5  0.0   9:54.76 IRQ 23
>
> One effect of the -rt kernel is that it shows IRQ load explicitly -
> while the stock kernel can 'hide' it because there interrupts run
> 'atomically', making it hard to measure the true system overhead. The
> -rt kernel will likely show more overhead, but i'd not expect this
> amount of overhead.
>
> To figure out the true overhead of both kernels, could you try the
> attached loop_print_thread.c code, and run it on: an idle non-rt kernel,
> and idle -rt kernel, a busy non-rt kernel and a busy -rt kernel, and
> send me the typical/average loops/sec value you are getting?
>
> Furthermore, there have been some tasklet related fixes in 2.6.15-rt17,
> which maybe could improve this workload. Maybe ...

Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed
to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two
kernels are, the easier it will be to isolate the differences.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 21:06   ` Nish Aravamudan
@ 2006-02-23 21:08     ` Ingo Molnar
  2006-02-23 21:14       ` Nish Aravamudan
  2006-02-24  8:03       ` Jan Engelhardt
  0 siblings, 2 replies; 17+ messages in thread
From: Ingo Molnar @ 2006-02-23 21:08 UTC (permalink / raw)
  To: Nish Aravamudan; +Cc: Gautam H Thaker, linux-kernel


* Nish Aravamudan <nish.aravamudan@gmail.com> wrote:

> Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed 
> to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two 
> kernels are, the easier it will be to isolate the differences.

good point. I'd expect there to be similar 'top' output, but still worth 
doing for comparable results.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 21:08     ` Ingo Molnar
@ 2006-02-23 21:14       ` Nish Aravamudan
  2006-02-23 22:07         ` Esben Nielsen
  2006-02-24  8:03       ` Jan Engelhardt
  1 sibling, 1 reply; 17+ messages in thread
From: Nish Aravamudan @ 2006-02-23 21:14 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Gautam H Thaker, linux-kernel

On 2/23/06, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Nish Aravamudan <nish.aravamudan@gmail.com> wrote:
>
> > Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed
> > to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two
> > kernels are, the easier it will be to isolate the differences.
>
> good point. I'd expect there to be similar 'top' output, but still worth
> doing for comparable results.

I'd also expect little difference (hopefully) -- although there's
always an off-chance something big changed somewhere and the problem
was fixed in mainline. Just makes the comparison clearer.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 21:14       ` Nish Aravamudan
@ 2006-02-23 22:07         ` Esben Nielsen
  0 siblings, 0 replies; 17+ messages in thread
From: Esben Nielsen @ 2006-02-23 22:07 UTC (permalink / raw)
  To: Nish Aravamudan; +Cc: Ingo Molnar, Gautam H Thaker, linux-kernel

When PREEMPT_RT is settled down I propose that many of the irq-handlers
are moved back to "raw" irq context. I am sure many of them are so short
that it wont increase latency. It is always a balance between needed
latency and performance. Basicly, the rule is that all irq handlers
running for less that the lower required latency and rare enough not 
to take a significant part of the CPU load, should run in irq context. 

Now the kernel hacker doesn't know for how long various irq handlers run
on a specific piece of hardware and which latencies the application need.
Therefore it has to be a config option per driver. The driver locks
ofcourse also need to be change from rt_lock to raw_spin_lock depending on
that option. Thus a macro framework for making the right choices of lock
type fitting the choosen irq-handler context it, is needed.

For the issue here, I am pretty sure changing the ethernet driver from
running in task context to raw irq context will improve the performance.
What you need to meassure as well, is how it influences latencies. There
is a good change you will see you can't meassure any difference because so
little work is actually done in irq context, must of the work is done
by the DMA controller.

Esben

On Thu, 23 Feb 2006, Nish Aravamudan wrote:

> On 2/23/06, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Nish Aravamudan <nish.aravamudan@gmail.com> wrote:
> >
> > > Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed
> > > to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two
> > > kernels are, the easier it will be to isolate the differences.
> >
> > good point. I'd expect there to be similar 'top' output, but still worth
> > doing for comparable results.
> 
> I'd also expect little difference (hopefully) -- although there's
> always an off-chance something big changed somewhere and the problem
> was fixed in mainline. Just makes the comparison clearer.
> 
> Thanks,
> Nish
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 21:08     ` Ingo Molnar
  2006-02-23 21:14       ` Nish Aravamudan
@ 2006-02-24  8:03       ` Jan Engelhardt
  1 sibling, 0 replies; 17+ messages in thread
From: Jan Engelhardt @ 2006-02-24  8:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nish Aravamudan, Gautam H Thaker, linux-kernel

>> Would it make more sense to compare 2.6.15 and 2.6.15-rt17, as opposed 
>> to 2.6.12-1.1390_FC4 and 2.6.15-rt17? Seems like the closer the two 
>> kernels are, the easier it will be to isolate the differences.
>
>good point. I'd expect there to be similar 'top' output, but still worth 
>doing for comparable results.
>
I have seen this before too (with earlier -rt's), when MPlayer jumped from 
1.8% to about 10%. Maybe because it's using the rtc at 1024 Hz?


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 20:58 ` Ingo Molnar
  2006-02-23 21:06   ` Nish Aravamudan
@ 2006-02-24 12:11   ` Andrew Morton
  2006-02-24 20:06     ` Gautam H Thaker
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2006-02-24 12:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: gthaker, linux-kernel

Ingo Molnar <mingo@elte.hu> wrote:
>
> To figure out the true overhead of both kernels, could you try the 
>  attached loop_print_thread.c code
>

http://www.zip.com.au/~akpm/linux/#zc  <- better ;)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 19:55 ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4 Gautam H Thaker
  2006-02-23 20:15 ` Benjamin LaHaise
  2006-02-23 20:58 ` Ingo Molnar
@ 2006-02-24 16:52 ` Theodore Ts'o
  2006-02-24 19:25   ` Gautam H Thaker
  2006-02-28 19:27 ` Matt Mackall
  3 siblings, 1 reply; 17+ messages in thread
From: Theodore Ts'o @ 2006-02-24 16:52 UTC (permalink / raw)
  To: Gautam H Thaker; +Cc: linux-kernel, Ingo Molnar

On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
> The real-time patches at the URL below do a great job of endowing Linux with
> real-time capabilities.
> 
> http://people.redhat.com/mingo/realtime-preempt/

Gautam,

#1) Can you publish the code you used in your tests?

#2) Can you post your .config file?  In particular, did you have any
of the latency measurement options or other debugging options?  

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-24 16:52 ` Theodore Ts'o
@ 2006-02-24 19:25   ` Gautam H Thaker
  0 siblings, 0 replies; 17+ messages in thread
From: Gautam H Thaker @ 2006-02-24 19:25 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Gautam H Thaker, linux-kernel, Ingo Molnar

Theodore Ts'o wrote:
> On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
> 
>>The real-time patches at the URL below do a great job of endowing Linux with
>>real-time capabilities.
>>
>>http://people.redhat.com/mingo/realtime-preempt/
> 
> 
> Gautam,
> 
> #1) Can you publish the code you used in your tests?

This may not be easy for me but I will try to get corp. approval(s).
Basically, the process that is, at least according to "top", showing ~5x
increased CPU usage is receiving very short UDP packets over a gigabit
interface at the rate of about 38,000 per second. UDP packets are small and
according to "/sbin/ifconfig" there are no errors, drops, overruns, frame or
carrier errors or collisions. (it is an isolated network of 20 PC3000s (3 GH
Xeon processors) at www.emulab.net.

> 
> #2) Can you post your .config file?  In particular, did you have any
> of the latency measurement options or other debugging options?  

The config file I had used to build the "RT" kernel can be found at:

http://www.atl.external.lmco.com/projects/QoS/config.2.6.15-rt15-smp

I had tried to have all debug options off

> 
> Regards,
> 
> 					- Ted

Gautam

-- 

Gautam H. Thaker
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs
3 Executive Campus; Cherry Hill, NJ 08002
856-792-9754, fax 856-792-9925  email: gthaker@atl.lmco.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-24 12:11   ` Andrew Morton
@ 2006-02-24 20:06     ` Gautam H Thaker
  2006-02-24 20:31       ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Gautam H Thaker @ 2006-02-24 20:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, gautam.h.thaker, linux-kernel

Andrew Morton wrote:
> Ingo Molnar <mingo@elte.hu> wrote:
> 
>>To figure out the true overhead of both kernels, could you try the 
>> attached loop_print_thread.c code

> http://www.zip.com.au/~akpm/linux/#zc  <- better ;)

Andrew,

I read the README for the "zc" tests. I wish Ingo can opine on which may be a
better test. Also, i assume that I can run "zcs" and "zcc" on the same
machine. I would do the tests with "send" instead of "sendfile".

I also have some other test data. The graphical summary result can be viewed
at this link:

http://www.atl.external.lmco.com/projects/QoS/LM_ATL_MW_Comparator_7920.png

In these tests I used a single Intel Xeon 3GH dual processor machine with 4
different kernels, all based on 2.6.14

2.6.14              Uniprocessor kernel
2.6.14-rt22         Uniprocessor kernel w/ RT patches
2.6.14-smp          SMP kernel
2.6.14-rt22-smp     SMP kernel w/ RT patches.

The test is similar to "zcs", "zcc" tests. In my tests a client process opens
a TCP connection to the server process (all on same machine) and sends to it
10,000,000 messages of sizes 4 bytes, 8 bytes, 16 bytes, .... , 32Kbytes,
64Kbytes. The server sends back a 1 byte reply. The client measures roundtrip
latencies. The graphic shows mean roundtrip latencies. Since measuremnts are
taken over so many samples I believe that the large differences in mean
latencies capture the relative CPU consumption of various kernel. (This being
loopback there are no NIC card issues or otherwise.) One notices a 3:1 ration
here from uniprocessor, non-RT kernel to SMP-RT kernel. The RT kernel has
nice real-time properties, and there is a lot of pressure in our systems to
use the SMP hardware of the multicore machines, and in some cases we can even
with with a 3x slowdown (since real applications do more than just I/O), but
when I started to note 5x (or more) in my newer tests I thought I would at
least post something.

I suspect that "zcs"/"zcc" tests would pretty much show the same conclusions
as this graphic.

Gautam H. Thaker
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs
3 Executive Campus; Cherry Hill, NJ 08002
856-792-9754, fax 856-792-9925  email: gthaker@atl.lmco.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-24 20:06     ` Gautam H Thaker
@ 2006-02-24 20:31       ` Andrew Morton
  2006-02-24 20:44         ` Gautam H Thaker
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2006-02-24 20:31 UTC (permalink / raw)
  To: Gautam H Thaker; +Cc: mingo, gautam.h.thaker, linux-kernel

Gautam H Thaker <gthaker@atl.lmco.com> wrote:
>
> > http://www.zip.com.au/~akpm/linux/#zc  <- better ;)
> 
>  Andrew,
> 
>  I read the README for the "zc" tests. I wish Ingo can opine on which may be a
>  better test. Also, i assume that I can run "zcs" and "zcc" on the same
>  machine. I would do the tests with "send" instead of "sendfile".

Oh.  I don't actually remember what zc does.  I was actually referring to
`cyclesoak', which has proven to be a pretty accurate (or at least,
sensitive and repeatable) way of determining overall per-CPU system load.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-24 20:31       ` Andrew Morton
@ 2006-02-24 20:44         ` Gautam H Thaker
  0 siblings, 0 replies; 17+ messages in thread
From: Gautam H Thaker @ 2006-02-24 20:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Gautam H Thaker, mingo, linux-kernel

Andrew Morton wrote:
> Gautam H Thaker <gthaker@atl.lmco.com> wrote:
> 
>>>http://www.zip.com.au/~akpm/linux/#zc  <- better ;)
>>
>> Andrew,
>>
>> I read the README for the "zc" tests. I wish Ingo can opine on which may be a
>> better test. Also, i assume that I can run "zcs" and "zcc" on the same
>> machine. I would do the tests with "send" instead of "sendfile".
> 
> 
> Oh.  I don't actually remember what zc does.  I was actually referring to
> `cyclesoak', which has proven to be a pretty accurate (or at least,
> sensitive and repeatable) way of determining overall per-CPU system load.

Yes, I should have been more clear. I meant to say that perhaps I should use
the 4 combinations of OS configs (non-RT/RT x UniProc/SMP) and use zc  and
cyclesoak rather than do a 20 node test, but I believe I will need many nodes
sending to my one "monitor" node to get this high packet receive rate of
about 38,000/second. Lower rates involving only a single machine should also
be capable of revealing conclusively that "RT-SMP" kernels are some factor
heavier than non-RT-UniProc kernel. Anyway, I will do the tests.

-- 

Gautam H. Thaker
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs
3 Executive Campus; Cherry Hill, NJ 08002
856-792-9754, fax 856-792-9925  email: gthaker@atl.lmco.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-23 19:55 ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4 Gautam H Thaker
                   ` (2 preceding siblings ...)
  2006-02-24 16:52 ` Theodore Ts'o
@ 2006-02-28 19:27 ` Matt Mackall
  2006-02-28 22:19   ` Gautam H Thaker
  3 siblings, 1 reply; 17+ messages in thread
From: Matt Mackall @ 2006-02-28 19:27 UTC (permalink / raw)
  To: Gautam H Thaker; +Cc: linux-kernel, Ingo Molnar

On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
> The real-time patches at the URL below do a great job of endowing Linux with
> real-time capabilities.
> 
> http://people.redhat.com/mingo/realtime-preempt/
> 
> It has been documented before (and accepted) that this patch turns Linux into
> a RT kernel but considerably slows down the code paths, esp. thru the I/O
> subsystem. I want to provide some additional measurements and seek opinions
> of if it might ever be possible to improve on this situation.

Are you using the SLAB or SLOB allocator in the -rt kernel?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
  2006-02-28 19:27 ` Matt Mackall
@ 2006-02-28 22:19   ` Gautam H Thaker
  0 siblings, 0 replies; 17+ messages in thread
From: Gautam H Thaker @ 2006-02-28 22:19 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Gautam H Thaker, linux-kernel, Ingo Molnar

Matt Mackall wrote:
> On Thu, Feb 23, 2006 at 02:55:56PM -0500, Gautam H Thaker wrote:
> 
>>The real-time patches at the URL below do a great job of endowing Linux with
>>real-time capabilities.
>>
>>http://people.redhat.com/mingo/realtime-preempt/
>>
>>It has been documented before (and accepted) that this patch turns Linux into
>>a RT kernel but considerably slows down the code paths, esp. thru the I/O
>>subsystem. I want to provide some additional measurements and seek opinions
>>of if it might ever be possible to improve on this situation.
> 
> 
> Are you using the SLAB or SLOB allocator in the -rt kernel?

lake> grep SL config.2.6.15-rt15-smp
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_SLAB=y
# CONFIG_SLOB is not set




-- 

Gautam H. Thaker
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs
3 Executive Campus; Cherry Hill, NJ 08002
856-792-9754, fax 856-792-9925  email: gthaker@atl.lmco.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4
@ 2006-07-11 18:08 Jonathan Walsh
  0 siblings, 0 replies; 17+ messages in thread
From: Jonathan Walsh @ 2006-07-11 18:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Gautam H. Thaker, mingo

As a follow up to previous emails (Gautam Thaker, Ingo Molnar, Ted Tso, et. al.) on the subject of large CPU overhead by the RT kernel when under heavy network load, I ran the following test in order to get more reasonable data.  I have 19 nodes with 20 "virtual" node processes sending UDP messages to a single host at a rate of 100Hz for 38,000 packets per second.  Using cyclesoak to determine cpu usage (over 240 samples, 1 sample per second), I found the following results:

RT kernel: linux-2.6.17-rt1-uni
Mean: 48.9%
Variance: 5.91
Standard Deviation: 2.43

Standard kernel: Standard Fedora Core 4
Mean: 23.2%
Variance: 0.237
Standard Deviation: 0.4867

Thus I found the average cpu load on the RT kernel to be 2.11 times that of the standard kernel.  Hopefully this information will be of some use.

-Jonathan Walsh
Distributed Processing Lab; Lockheed Martin Adv. Tech. Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-07-11 18:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-23 19:55 ~5x greater CPU load for a networked application when using 2.6.15-rt15-smp vs. 2.6.12-1.1390_FC4 Gautam H Thaker
2006-02-23 20:15 ` Benjamin LaHaise
2006-02-23 20:58 ` Ingo Molnar
2006-02-23 21:06   ` Nish Aravamudan
2006-02-23 21:08     ` Ingo Molnar
2006-02-23 21:14       ` Nish Aravamudan
2006-02-23 22:07         ` Esben Nielsen
2006-02-24  8:03       ` Jan Engelhardt
2006-02-24 12:11   ` Andrew Morton
2006-02-24 20:06     ` Gautam H Thaker
2006-02-24 20:31       ` Andrew Morton
2006-02-24 20:44         ` Gautam H Thaker
2006-02-24 16:52 ` Theodore Ts'o
2006-02-24 19:25   ` Gautam H Thaker
2006-02-28 19:27 ` Matt Mackall
2006-02-28 22:19   ` Gautam H Thaker
  -- strict thread matches above, loose matches on Subject: below --
2006-07-11 18:08 Jonathan Walsh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox