public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: dipankar@in.ibm.com
Cc: "David S. Miller" <davem@davemloft.net>,
	linux-kernel@vger.kernel.org, torvalds@osdl.org, akpm@osdl.org
Subject: Re: [PATCH] reorder struct files_struct
Date: Thu, 15 Sep 2005 19:11:26 +0200	[thread overview]
Message-ID: <4329AB3E.1070904@cosmosbay.com> (raw)
In-Reply-To: <20050915093518.GB5168@in.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 5949 bytes --]

Dipankar Sarma a écrit :
> On Thu, Sep 15, 2005 at 08:17:40AM +0200, Eric Dumazet wrote:
>>The point is that we gain nothing in this case for 32 bits platforms, but 
>>we gain something on 64 bits platform. And for apps using more than 
> 
> 
> I am not sure about that. IIRC, x86_64 has a 128-byte L1 cacheline.
> So, count, fdt, fdtab, close_on_exec_init and open_fds_init would
> all fit into one cache line. And close_on_exec_init will get updated
> on open(). Also, most apps will not likely have more than the
> default # of fds, it might not be a good idea to optimize for
> that case.

x86_64 has 64-bytes L1 cache line, at least for AMD cpus.
SMP P3 are quite common and they have 32-bytes L1 cache line.

*** WARNING *** BUG **** :

In the mean time I discovered that sizeof(fd_set) was 128 bytes !
I thought it was sizeof(long) only.

This is a huge waste of 248 bytes for most apps (apps that open less than 32 
or 64 files)

>>
>>Moving next_fd from 'struct fdtable' to 'struct files_struct' is also a win 
>>for 64bits platforms since sizeof(struct fdtable) become 64 : a nice power 
>>of two, so 64 bytes are allocated instead of 128.
> 
> 
> Can you benchmark this on a higher end SMP/NUMA system ?

Well, I can benchmark on a dual Xeon 2GHz machine. (Dell Poweredge 1600SC)
Thats 2 physical cpus, four logical cpus (thanks to HyperThreading)
Not exactly higher end NUMA systems unfortunatly.


here are the results :

linux-2.6.14-rc1

# ./bench -t 1 -l 100 -s   # one thread, small fdset
1 threads, 100.005352 seconds, work_done=8429730
# ./bench -t 1 -l 100      # one thread, big fdset
1 threads, 100.006087 seconds, work_done=8343664

# ./bench -t 1 -l 100 -s -i  # one thread plus one idle thread to force locks
1 threads, 100.007956 seconds, work_done=6786008
# ./bench -t 1 -l 100 -i # one thread + idle, bigfdset
1 threads, 100.005044 seconds, work_done=6791259

# ./bench -t 2 -l 100 -s # two threads, small fdset
2 threads, 100.002860 seconds, work_done=11034805
# ./bench -t 2 -l 100 # two threads, big fdset
2 threads, 100.006046 seconds, work_done=11063804

# ./bench -t 2 -l 100 -s -a 5 # force affinity to two phys CPUS
2 threads, 100.004547 seconds, work_done=10825310
# ./bench -t 2 -l 100 -a 5
2 threads, 100.006288 seconds, work_done=11273778

# ./bench -t 4 -l 100 -s # Four threads, small fdset
4 threads, 100.007234 seconds, work_done=15061795
# ./bench -t 4 -l 100 # Four threads, big fdset
4 threads, 100.007620 seconds, work_done=14811832


linux-2.6.14-rc1 + patch :


# ./bench -t 1 -l 100 -s  # one thread, small fdset
1 threads, 100.005759 seconds, work_done=8406981 (~same)
# ./bench -t 1 -l 100     # one thread, big fdset
1 threads, 100.006887 seconds, work_done=8350681 (~same)

# ./bench -t 1 -l 100 -s -i # one thread plus one idle thread to force locks
1 threads, 100.005829 seconds, work_done=6858520 (1% better)
# ./bench -t 1 -l 100 -i # one thread + idle, bigfdset
1 threads, 100.007902 seconds, work_done=6847941 (~same)

# ./bench -t 2 -l 100 -s # two threads, small fdset
2 threads, 100.005877 seconds, work_done=11257165 (2% better)
# ./bench -t 2 -l 100 # two threads, big fdset
2 threads, 100.005561 seconds, work_done=11520262 (4% better)

# ./bench -t 2 -l 100 -s -a 5 # force affinity to two phys CPUS
2 threads, 100.006744 seconds, work_done=11505449 (6% better)
# ./bench -t 2 -l 100 -a 5
2 threads, 100.006706 seconds, work_done=11688051 (3% better)

# ./bench -t 4 -l 100 -s # Four threads, small fdset
4 threads, 100.007496 seconds, work_done=15556770 (3% better)
# ./bench -t 4 -l 100 # Four threads, big fdset
4 threads, 100.009882 seconds, work_done=16145618 (9% better)


linux-2.6.14-rc1 + patch + two embedded fd_set replaced by a long :
(this change should also speedup fork())

struct fdtable {
         unsigned int max_fds;
         int max_fdset;
         struct file ** fd;      /* current fd array */
         fd_set *close_on_exec;
         fd_set *open_fds;
         struct rcu_head rcu;
         struct files_struct *free_files;
         struct fdtable *next;
};

/*
  * Open file table structure
  */
struct files_struct {
/* read mostly part */
         atomic_t count;
         struct fdtable *fdt;
         struct fdtable fdtab;
/* written part */
         spinlock_t file_lock ____cacheline_aligned_in_smp;
         int next_fd;
         unsigned long close_on_exec_init;
         unsigned long open_fds_init;
         struct file * fd_array[NR_OPEN_DEFAULT];
};



# grep files_cache /proc/slabinfo
files_cache           53    195    256   15    1 : tunables  120   60    8 : 
slabdata     13     13      0
  (256 bytes used instead of 512 for files_cache objects)

# ./bench -t 1 -l 100 -s  # one thread, small fdset
1 threads, 100.007298 seconds, work_done=8413167 (~same)
# ./bench -t 1 -l 100     # one thread, big fdset
1 threads, 100.006007 seconds, work_done=8441197 (1% better)

# ./bench -t 1 -l 100 -s -i # one thread plus one idle thread to force locks
1 threads, 100.005101 seconds, work_done=6870893 (1% better)
# ./bench -t 1 -l 100 -i # one thread + idle, bigfdset
1 threads, 100.005285 seconds, work_done=6852314 (~same)

# ./bench -t 2 -l 100 -s # two threads, small fdset
2 threads, 100.007029 seconds, work_done=11424646 (3.5 % better)
# ./bench -t 2 -l 100 # two threads, big fdset
2 threads, 100.006128 seconds, work_done=11634769 (5% better)

# ./bench -t 2 -l 100 -s -a 5 # force affinity to two phys CPUS
2 threads, 100.008100 seconds, work_done=11408030 (5% better)
# ./bench -t 2 -l 100 -a 5
2 threads, 100.004221 seconds, work_done=11686082 (3% better)

# ./bench -t 4 -l 100 -s # Four threads, small fdset
4 threads, 100.008243 seconds, work_done=15818419 (5% better)
# ./bench -t 4 -l 100 # Four threads, big fdset
4 threads, 100.008279 seconds, work_done=16352921 (10% better)


I suspect that NUMA machines will get more interesting results...

Eric

Attached bench source code.
Compile : gcc -O2 -o bench bench.c -lpthread


[-- Attachment #2: bench.c --]
[-- Type: text/plain, Size: 3304 bytes --]

/*
 * Bench program to exercice multi threads using open()/close()/read()/lseek() calls.
 * Usage :
 *    bench [-t XX] [-l len]
 *   XX : number of threads
 *   len : bench time in seconds
 * -s : small fdset : try to use embedded sruct fdtable
 */
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>
#include <sched.h>
#include <errno.h>
#include <signal.h>

static pthread_mutex_t  mut = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

static int sflag; /* small fdset */
static int end_prog;
static unsigned long work_done;
static char onekilo[1024];

static void catch_alarm(int sig)
{
	end_prog = 1;
}

static void *perform_work(void *arg)
{
	int fd, i;
	unsigned long units = 0;
	int fds[64];
	char filename[64];
	char c;
	strcpy(filename, "/tmp/benchXXXXXX");
	fd = mkstemp(filename);
	write(fd, onekilo, sizeof(onekilo));
	close(fd);

	/* force this program to open more than 64 fds */
	if (!sflag)
		for (i = 0 ; i < 64 ; i++)
			fds[i] = open("/dev/null", O_RDONLY);

	while (!end_prog) {
		fd = open(filename, O_RDONLY);
		read(fd, &c, 1);
		lseek(fd, 10, SEEK_SET);
		read(fd, &c, 1);
		lseek(fd, 20, SEEK_SET);
		read(fd, &c, 1);
		lseek(fd, 30, SEEK_SET);
		read(fd, &c, 1);
		lseek(fd, 40, SEEK_SET);
		read(fd, &c, 1);
		close(fd);
		units++;
	}
	unlink(filename);
	if (!sflag)
		for (i = 0 ; i < 64 ; i++)
			close(fds[i]);
	pthread_mutex_lock(&mut);
	work_done += units;
	pthread_mutex_unlock(&mut);
	return 0;
}

static void *idle_thread(void *arg)
{
	pthread_mutex_lock(&mut);
	while (!end_prog) {
		pthread_cond_wait(&cond, &mut);
	}
	pthread_mutex_unlock(&mut);
}

static void usage(int code)
{
	fprintf(stderr, "Usage : bench [-i] [-s] [-a affinity_mask] [-t threads] [-l duration]\n");
	exit(code);
}

int main(int argc, char *argv[])
{
	int i, c;
	int nbthreads = 2;
	int iflag = 0;
	unsigned int length = 10;
	pthread_t *tid;
	struct sigaction sg;
	struct timeval t0, t1;
	long mask = 0;

	while ((c = getopt(argc, argv, "sit:l:a:")) != -1) {
		if (c == 't')
			nbthreads = atoi(optarg);
		else if (c == 'l')
			length = atoi(optarg);
		else if (c == 's')
			sflag = 1;
		else if (c == 'i')
			iflag = 1;
		else if (c == 'a')
			sscanf(optarg, "%li", &mask);
		else usage(1);
	}
	if (mask != 0) {
		int res = sched_setaffinity(0, &mask);
		if (res != 0)
			fprintf(stderr, "sched_affinity(0x%lx)->%d errno=%d\n", mask, res, errno);
	}


	tid = malloc(nbthreads*sizeof(pthread_t));
	gettimeofday(&t0, NULL);
	for (i = 1 ; i < nbthreads; i++)
		pthread_create(tid + i, NULL, perform_work, NULL);
	if (iflag)
		pthread_create(tid, NULL, idle_thread, NULL);
	memset(&sg, 0, sizeof(sg));
	sg.sa_handler = catch_alarm;
	sigaction(SIGALRM, &sg, NULL);
	alarm(length);
	perform_work(NULL);

	if (iflag) {
		pthread_cond_signal(&cond);
		pthread_join(tid[0], NULL);
	}
	for (i = 1 ; i < nbthreads; i++)
		pthread_join(tid[i], NULL); 

	gettimeofday(&t1, NULL);
	t1.tv_sec -= t0.tv_sec;
	t1.tv_usec -= t0.tv_usec;
	if (t1.tv_usec < 0) {
		t1.tv_usec += 1000000;
		t1.tv_sec--;
	}
	pthread_mutex_lock(&mut);
	printf("%d threads, %d.%06d seconds, work_done=%lu\n",
		nbthreads, (int)t1.tv_sec, (int)t1.tv_usec, work_done);
	pthread_mutex_unlock(&mut);
	return 0;
}

  reply	other threads:[~2005-09-15 17:11 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-09-14 18:31 [PATCH]: Brown paper bag in fs/file.c? David S. Miller
2005-09-14 18:48 ` David S. Miller
2005-09-14 19:05   ` David S. Miller
2005-09-14 19:18 ` Dipankar Sarma
2005-09-14 19:57   ` David S. Miller
2005-09-14 20:15     ` Dipankar Sarma
2005-09-14 20:29       ` David S. Miller
2005-09-14 21:17         ` [PATCH] reorder struct files_struct Eric Dumazet
2005-09-14 21:35           ` Peter Staubach
2005-09-14 22:02           ` Dipankar Sarma
2005-09-14 22:17             ` Andrew Morton
2005-09-14 22:42             ` Eric Dumazet
2005-09-14 22:50               ` Dipankar Sarma
2005-09-14 23:19                 ` Eric Dumazet
2005-09-15  4:54                   ` Dipankar Sarma
2005-09-15  6:17                     ` Eric Dumazet
2005-09-15  9:35                       ` Dipankar Sarma
2005-09-15 17:11                         ` Eric Dumazet [this message]
2005-09-15 21:06       ` [PATCH]: Brown paper bag in fs/file.c? David S. Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4329AB3E.1070904@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=akpm@osdl.org \
    --cc=davem@davemloft.net \
    --cc=dipankar@in.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox