stat benchmark

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* stat benchmark
@ 2008-04-24 20:59 Soeren Sandmann
  2008-04-24 21:42 ` Carl Henrik Lunde
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Soeren Sandmann @ 2008-04-24 20:59 UTC (permalink / raw)
  To: linux-kernel

At

    http://www.daimi.au.dk/~sandmann/stat-benchmark.c

there is a simple program that will measure the time it takes to stat
every file in the current directory with a cold cache.

This is essentially what applications do in a number of common cases
such as "ls -l", nautilus opening a directory, or an "open file"
dialog being showed.

Unfortunately, performance of that operation kinda sucks. On my system
(ext3), it produces:

    c-24-61-65-93:~% sudo ./a.out 
    Time to readdir(): 0.307671 s
    Time to stat 2349 files: 8.203693 s

8 seconds is about 80 times slower than what a user perceives as
"instantly" and slow enough that we really should display a progress
bar if it can't be fixed.

So I am looking for ways to improve this.

Under the theory that disk seeks are killing us, one idea is to add a
'multistat' system call that would allow statting of many files at a
time, which would give the disk scheduler more to work with. 

Possibly the same thing would need to be done for the getxattr
information.

Does this sound like a reasonable idea?

Thanks,
Soren

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-24 20:59 stat benchmark Soeren Sandmann
@ 2008-04-24 21:42 ` Carl Henrik Lunde
  2008-04-24 21:44 ` Jan Engelhardt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Carl Henrik Lunde @ 2008-04-24 21:42 UTC (permalink / raw)
  To: Soeren Sandmann; +Cc: linux-kernel

On Thu, Apr 24, 2008 at 10:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote:
[ about programs reading all inodes after readdir ]

>  Unfortunately, performance of that operation kinda sucks. On my system
>  (ext3), it produces:
>
>     c-24-61-65-93:~% sudo ./a.out
>     Time to readdir(): 0.307671 s
>     Time to stat 2349 files: 8.203693 s
>
>  8 seconds is about 80 times slower than what a user perceives as
>  "instantly" and slow enough that we really should display a progress
>  bar if it can't be fixed.
>
>  So I am looking for ways to improve this.
>
>  Under the theory that disk seeks are killing us, one idea is to add a
>  'multistat' system call that would allow statting of many files at a
>  time, which would give the disk scheduler more to work with.

I have experimented with the same problem, and another idea is to
reorder the result from readdir, which I've gotten good results by doing.

This works because:
 - For most filesystems there is a high correlation between the inode
   number and the sector on the disk.

 - Most programs like your example handle the files in the order that
   they are returned from readdir

 - The time spent sorting is very small compared to the disk seeks

There are several possible ways to implement this:

 - reorder the dirents in the kernel for each getdents call

 - reorderi the dirents in user space, for example by running
   qsort in a libc wrapper

 - in the file system, optimize the order before writing back a dirty directory

This does not only apply to programs only stating files, but also reading
them, such as indexing files, backups (tar), and Nautilus getting thumbnails
from JPGs.

-- 
Carl Henrik

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-24 20:59 stat benchmark Soeren Sandmann
  2008-04-24 21:42 ` Carl Henrik Lunde
@ 2008-04-24 21:44 ` Jan Engelhardt
  2008-04-25  2:27   ` Justin Banks
  2008-04-25  7:01   ` Christoph Hellwig
  2008-04-25 19:48 ` Theodore Tso
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Jan Engelhardt @ 2008-04-24 21:44 UTC (permalink / raw)
  To: Soeren Sandmann; +Cc: linux-kernel


On Thursday 2008-04-24 22:59, Soeren Sandmann wrote:
>
>    http://www.daimi.au.dk/~sandmann/stat-benchmark.c
>
>there is a simple program that will measure the time it takes to stat
>every file in the current directory with a cold cache.
>
>This is essentially what applications do in a number of common cases
>such as "ls -l", nautilus opening a directory, or an "open file"
>dialog being showed.

Mh somewhat. Some programs/libraries unnecessarily stat No, nautilus is excluded

>
>Unfortunately, performance of that operation kinda sucks. On my system
>(ext3), it produces:
>
>    c-24-61-65-93:~% sudo ./a.out 
>    Time to readdir(): 0.307671 s
>    Time to stat 2349 files: 8.203693 s
>
>8 seconds is about 80 times slower than what a user perceives as
>"instantly" and slow enough that we really should display a progress
>bar if it can't be fixed.
>
>So I am looking for ways to improve this.
>
>Under the theory that disk seeks are killing us, one idea is to add a
>'multistat' system call that would allow statting of many files at a
>time, which would give the disk scheduler more to work with. 
>
>Possibly the same thing would need to be done for the getxattr
>information.
>
>Does this sound like a reasonable idea?
>

XFS for example already has bulkstat, but it is not really
exported to userspace :-/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-24 21:44 ` Jan Engelhardt
@ 2008-04-25  2:27   ` Justin Banks
  2008-04-25  7:01   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Justin Banks @ 2008-04-25  2:27 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Soeren Sandmann, linux-kernel

Jan Engelhardt wrote
> 
> XFS for example already has bulkstat, but it is not really
> exported to userspace :-/

Perhaps this could be in VFS, and used if present?

-justinb

-- 
Justin Banks
BakBone Software
justinb@bakbone.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-24 21:44 ` Jan Engelhardt
  2008-04-25  2:27   ` Justin Banks
@ 2008-04-25  7:01   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2008-04-25  7:01 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Soeren Sandmann, linux-kernel

On Thu, Apr 24, 2008 at 11:44:04PM +0200, Jan Engelhardt wrote:
> XFS for example already has bulkstat, but it is not really
> exported to userspace :-/

It is avaiable through an ugly ioctl.  Doesn't really help in this
use-case because bulkstat iterates over all inodes of a given filesystem
and not inodes of a directory.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-24 20:59 stat benchmark Soeren Sandmann
  2008-04-24 21:42 ` Carl Henrik Lunde
  2008-04-24 21:44 ` Jan Engelhardt
@ 2008-04-25 19:48 ` Theodore Tso
  2008-04-27 23:29   ` Soeren Sandmann
  2008-04-27 22:40 ` Carl Henrik Lunde
  2008-04-28  4:43 ` Ulrich Drepper
  4 siblings, 1 reply; 17+ messages in thread
From: Theodore Tso @ 2008-04-25 19:48 UTC (permalink / raw)
  To: Soeren Sandmann; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 964 bytes --]

On Thu, Apr 24, 2008 at 10:59:10PM +0200, Soeren Sandmann wrote:
> 
> Under the theory that disk seeks are killing us, one idea is to add a
> 'multistat' system call that would allow statting of many files at a
> time, which would give the disk scheduler more to work with. 

Why don't you try this version of your stat-benchmark first?  If you
give it the -s option, it will sort the files by inode number first.
I think you will find this should make a significant difference.

If it works, something that would be really great if someone were to
make a generic library which could be used instead of readdir().  I
have something which works as an LD_PRELOAD, but apparently it's been
blowing up on 64-bit systems, and I haven't had time to debug it.
It's probably better to do it as a library which userspace
applications linked against, anyway.  Would you or someone you know be
interesed in maybe taking this idea and running with it?

Regards,

						- Ted


[-- Attachment #2: stat-benchmark.c --]
[-- Type: text/x-csrc, Size: 2821 bytes --]

#include <sys/time.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <getopt.h>

struct dirent_s {
	unsigned long long d_ino;
	long long d_off;
	unsigned short int d_reclen;
	unsigned char d_type;
	char *d_name;
};


static void
disaster (const char *what)
{
    fprintf (stderr, "%s failed: %s\n", what, strerror (errno));
    exit (1);
}

static void
dump_caches (void)
{
    int fd = open ("/proc/sys/vm/drop_caches", O_RDWR);

    if (fd < 0)
	disaster ("opening drop_caches");
    
    if (write (fd, "3", strlen ("3")) < 0)
	disaster ("writing drop_caches");

    if (close (fd) < 0)
	disaster ("closing drop_caches");
}

static int ino_cmp(const void *a, const void *b)
{
	const struct dirent_s *ds_a = (const struct dirent_s *) a;
	const struct dirent_s *ds_b = (const struct dirent_s *) b;
	unsigned int i_a, i_b;
	
	i_a = ds_a->d_ino;
	i_b = ds_b->d_ino;

	return (i_a - i_b);
}

static double
tv_to_sec (const struct timeval *tv)
{
    return tv->tv_sec + tv->tv_usec / 1000000.0;
}

static double
time_diff (const struct timeval *before,
	   const struct timeval *after)
{
    return tv_to_sec (after) - tv_to_sec (before);
}

static int
pot (int n)
{
    int p = 1;
    while (p <= n)
	p *= 2;
    return p;
}

int
main (int argc, char **argv)
{
    DIR *dir = opendir (".");
    struct dirent *ent;
    struct timeval before;
    struct timeval after;
    struct dirent_s *ds = NULL;
    int n_files;
    int do_sort = 0;
    int drop_caches = 1;
    int i, c;

    while ((c = getopt (argc, argv, "sc")) != EOF) {
	    switch (c) {
	    case 's':
		    do_sort++;
		    break;
	    case 'c':
		    drop_caches = 0;
		    break;
	    default:
		    printf("Usage: %s [-s]\n", argv[0]);
	    }
    }

    /* Dump caches */
    
    if (drop_caches)
	    dump_caches();

    sleep (1);

    /* Read directory */
    errno = 0;
    
    gettimeofday (&before, NULL);

    n_files = 0;
    while ((ent = readdir (dir)))
    {
	ds = realloc (ds, sizeof (struct dirent_s) * pot (n_files));

	if (!ds)
	    disaster ("realloc");

	ds[n_files].d_name = strdup (ent->d_name);
	ds[n_files].d_ino = ent->d_ino;
	n_files++;
    }
    
    if (errno != 0)
	disaster ("readdir");
	
    gettimeofday (&after, NULL);

    printf ("Time to readdir(): %f s\n", time_diff (&before, &after));

    if (do_sort)
	    qsort(ds, n_files, sizeof(struct dirent_s), ino_cmp);

    /* Stat all the files */

    gettimeofday (&before, NULL);

    for (i = 0; i < n_files; ++i)
    { 
	struct stat statbuf;

	stat (ds[i].d_name, &statbuf);
    }
    
    gettimeofday (&after, NULL);

    printf ("Time to stat %d files: %f s\n",
	    n_files, time_diff (&before, &after));

    return 0;
}

[-- Attachment #3: spd_readdir.c --]
[-- Type: text/x-csrc, Size: 6551 bytes --]

/*
 * readdir accelerator
 *
 * (C) Copyright 2003, 2004 by Theodore Ts'o.
 *
 * Compile using the command:
 *
 * gcc -o spd_readdir.so -shared spd_readdir.c -ldl
 *
 * Use it by setting the LD_PRELOAD environment variable:
 * 
 * export LD_PRELOAD=/usr/local/sbin/spd_readdir.so
 *
 * %Begin-Header%
 * This file may be redistributed under the terms of the GNU Public
 * License.
 * %End-Header%
 * 
 */

#define ALLOC_STEPSIZE	100
#define MAX_DIRSIZE	0

#define DEBUG

#ifdef DEBUG
#define DEBUG_DIR(x)	{if (do_debug) { x; }}
#else
#define DEBUG_DIR(x)
#endif

#define _GNU_SOURCE
#define __USE_LARGEFILE64

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>
#include <dirent.h>
#include <errno.h>
#include <dlfcn.h>

struct dirent_s {
	unsigned long long d_ino;
	long long d_off;
	unsigned short int d_reclen;
	unsigned char d_type;
	char *d_name;
};

struct dir_s {
	DIR	*dir;
	int	num;
	int	max;
	struct dirent_s *dp;
	int	pos;
	int	fd;
	struct dirent ret_dir;
	struct dirent64 ret_dir64;
};

static int (*real_closedir)(DIR *dir) = 0;
static DIR *(*real_opendir)(const char *name) = 0;
static struct dirent *(*real_readdir)(DIR *dir) = 0;
static struct dirent64 *(*real_readdir64)(DIR *dir) = 0;
static off_t (*real_telldir)(DIR *dir) = 0;
static void (*real_seekdir)(DIR *dir, off_t offset) = 0;
static int (*real_dirfd)(DIR *dir) = 0;
static unsigned long max_dirsize = MAX_DIRSIZE;
static num_open = 0;
#ifdef DEBUG
static int do_debug = 0;
#endif

static void setup_ptr()
{
	char *cp;

	real_opendir = dlsym(RTLD_NEXT, "opendir");
	real_closedir = dlsym(RTLD_NEXT, "closedir");
	real_readdir = dlsym(RTLD_NEXT, "readdir");
	real_readdir64 = dlsym(RTLD_NEXT, "readdir64");
	real_telldir = dlsym(RTLD_NEXT, "telldir");
	real_seekdir = dlsym(RTLD_NEXT, "seekdir");
	real_dirfd = dlsym(RTLD_NEXT, "dirfd");
	if ((cp = getenv("SPD_READDIR_MAX_SIZE")) != NULL) {
		max_dirsize = atol(cp);
	}
#ifdef DEBUG
	if (getenv("SPD_READDIR_DEBUG"))
		do_debug++;
#endif
}

static void free_cached_dir(struct dir_s *dirstruct)
{
	int i;

	if (!dirstruct->dp)
		return;

	for (i=0; i < dirstruct->num; i++) {
		free(dirstruct->dp[i].d_name);
	}
	free(dirstruct->dp);
	dirstruct->dp = 0;
}	

static int ino_cmp(const void *a, const void *b)
{
	const struct dirent_s *ds_a = (const struct dirent_s *) a;
	const struct dirent_s *ds_b = (const struct dirent_s *) b;
	ino_t i_a, i_b;
	
	i_a = ds_a->d_ino;
	i_b = ds_b->d_ino;

	if (ds_a->d_name[0] == '.') {
		if (ds_a->d_name[1] == 0)
			i_a = 0;
		else if ((ds_a->d_name[1] == '.') && (ds_a->d_name[2] == 0))
			i_a = 1;
	}
	if (ds_b->d_name[0] == '.') {
		if (ds_b->d_name[1] == 0)
			i_b = 0;
		else if ((ds_b->d_name[1] == '.') && (ds_b->d_name[2] == 0))
			i_b = 1;
	}

	return (i_a - i_b);
}


DIR *opendir(const char *name)
{
	DIR *dir;
	struct dir_s	*dirstruct;
	struct dirent_s *ds, *dnew;
	struct dirent64 *d;
	struct stat st;

	if (!real_opendir)
		setup_ptr();

	DEBUG_DIR(printf("Opendir(%s) (%d open)\n", name, num_open++));
	dir = (*real_opendir)(name);
	if (!dir)
		return NULL;

	dirstruct = malloc(sizeof(struct dir_s));
	if (!dirstruct) {
		(*real_closedir)(dir);
		errno = -ENOMEM;
		return NULL;
	}
	dirstruct->num = 0;
	dirstruct->max = 0;
	dirstruct->dp = 0;
	dirstruct->pos = 0;
	dirstruct->dir = 0;

	if (max_dirsize && (stat(name, &st) == 0) && 
	    (st.st_size > max_dirsize)) {
		DEBUG_DIR(printf("Directory size %ld, using direct readdir\n",
				 st.st_size));
		dirstruct->dir = dir;
		return (DIR *) dirstruct;
	}

	while ((d = (*real_readdir64)(dir)) != NULL) {
		if (dirstruct->num >= dirstruct->max) {
			dirstruct->max += ALLOC_STEPSIZE;
			DEBUG_DIR(printf("Reallocating to size %d\n", 
					 dirstruct->max));
			dnew = realloc(dirstruct->dp, 
				       dirstruct->max * sizeof(struct dir_s));
			if (!dnew)
				goto nomem;
			dirstruct->dp = dnew;
		}
		ds = &dirstruct->dp[dirstruct->num++];
		ds->d_ino = d->d_ino;
		ds->d_off = d->d_off;
		ds->d_reclen = d->d_reclen;
		ds->d_type = d->d_type;
		if ((ds->d_name = malloc(strlen(d->d_name)+1)) == NULL) {
			dirstruct->num--;
			goto nomem;
		}
		strcpy(ds->d_name, d->d_name);
		DEBUG_DIR(printf("readdir: %lu %s\n", 
				 (unsigned long) d->d_ino, d->d_name));
	}
	dirstruct->fd = dup((*real_dirfd)(dir));
	(*real_closedir)(dir);
	qsort(dirstruct->dp, dirstruct->num, sizeof(struct dirent_s), ino_cmp);
	return ((DIR *) dirstruct);
nomem:
	DEBUG_DIR(printf("No memory, backing off to direct readdir\n"));
	free_cached_dir(dirstruct);
	dirstruct->dir = dir;
	return ((DIR *) dirstruct);
}

int closedir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	DEBUG_DIR(printf("Closedir (%d open)\n", --num_open));
	if (dirstruct->dir)
		(*real_closedir)(dirstruct->dir);

	if (dirstruct->fd >= 0)
		close(dirstruct->fd);
	free_cached_dir(dirstruct);
	free(dirstruct);
	return 0;
}

struct dirent *readdir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;
	struct dirent_s *ds;

	if (dirstruct->dir)
		return (*real_readdir)(dirstruct->dir);

	if (dirstruct->pos >= dirstruct->num)
		return NULL;

	ds = &dirstruct->dp[dirstruct->pos++];
	dirstruct->ret_dir.d_ino = ds->d_ino;
	dirstruct->ret_dir.d_off = ds->d_off;
	dirstruct->ret_dir.d_reclen = ds->d_reclen;
	dirstruct->ret_dir.d_type = ds->d_type;
	strncpy(dirstruct->ret_dir.d_name, ds->d_name,
		sizeof(dirstruct->ret_dir.d_name));

	return (&dirstruct->ret_dir);
}

struct dirent64 *readdir64(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;
	struct dirent_s *ds;

	if (dirstruct->dir)
		return (*real_readdir64)(dirstruct->dir);

	if (dirstruct->pos >= dirstruct->num)
		return NULL;

	ds = &dirstruct->dp[dirstruct->pos++];
	dirstruct->ret_dir64.d_ino = ds->d_ino;
	dirstruct->ret_dir64.d_off = ds->d_off;
	dirstruct->ret_dir64.d_reclen = ds->d_reclen;
	dirstruct->ret_dir64.d_type = ds->d_type;
	strncpy(dirstruct->ret_dir64.d_name, ds->d_name,
		sizeof(dirstruct->ret_dir64.d_name));

	return (&dirstruct->ret_dir64);
}

off_t telldir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	if (dirstruct->dir)
		return (*real_telldir)(dirstruct->dir);

	return ((off_t) dirstruct->pos);
}

void seekdir(DIR *dir, off_t offset)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	if (dirstruct->dir) {
		(*real_seekdir)(dirstruct->dir, offset);
		return;
	}

	dirstruct->pos = offset;
}

int dirfd(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	if (dirstruct->dir)
		return (*real_dirfd)(dirstruct->dir);

	return (dirstruct->fd);
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-25 19:48 ` Theodore Tso
@ 2008-04-27 23:29   ` Soeren Sandmann
  2008-04-28  0:13     ` Carl Henrik Lunde
  2008-04-28  2:10     ` Theodore Tso
  0 siblings, 2 replies; 17+ messages in thread
From: Soeren Sandmann @ 2008-04-27 23:29 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-kernel, alexl

Theodore Tso <tytso@mit.edu> writes:

> On Thu, Apr 24, 2008 at 10:59:10PM +0200, Soeren Sandmann wrote:
> > 
> > Under the theory that disk seeks are killing us, one idea is to add a
> > 'multistat' system call that would allow statting of many files at a
> > time, which would give the disk scheduler more to work with. 
> 
> Why don't you try this version of your stat-benchmark first?  If you
> give it the -s option, it will sort the files by inode number first.
> I think you will find this should make a significant difference.

Yeah, Carl suggested this as well. 

Sorting by inode is a major improvement. The numbers are less stable,
but consistently much lower:

    Time to readdir(): 0.238737 s
    Time to stat 2366 files: 1.338904 s

compared to

    Time to readdir(): 0.227599 s
    Time to stat 2366 files: 7.981752 s

Of course, 1.3 seconds is still far from instant, but it may be the
best we can get given the realities of ext3 disk layout.

One thing that surprised me is that lstat is slightly slower than
stat. With lstat() instead of stat(), I get:

    Time to stat 2366 files: 1.472115 s
    Time to readdir(): 1.735542 s

I naively thought that stat() was a superset of lstat(), but
apparently not.

> If it works, something that would be really great if someone were to
> make a generic library which could be used instead of readdir().  I
> have something which works as an LD_PRELOAD, but apparently it's been
> blowing up on 64-bit systems, and I haven't had time to debug it.
> It's probably better to do it as a library which userspace
> applications linked against, anyway.  Would you or someone you know be
> interesed in maybe taking this idea and running with it?

After I read Carl's mail I looked at what glib - which is used by both
Nautilus and the gtk+ file open dialog - does, and in fact the latest
version does actually read chunks of 1000 dirents and sorts them by
inode. The version I had installed when I wrote the benchmark just
stat'ed in readdir() order.

For a directory of ~2360 files, chunks of a 1000 files is actually
surprisingly worse than statting all of the files at once:

    Time to stat 1000 files: 1.008735 s
    Time to stat 1000 files: 0.738936 s
    Time to stat 366 files: 0.217002 s

I guess this just shows that seeks really is pretty much all that
matters. Glib should maybe use a larger chunk size.

I don't know if a general library outside glib would be useful. It
seems that just telling people to "sort by inode before statting"
would be just as effective as telling them "use this optimized
library".

Thanks,
Soren

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-27 23:29   ` Soeren Sandmann
@ 2008-04-28  0:13     ` Carl Henrik Lunde
  2008-04-28 19:41       ` Alexander Larsson
  2008-04-28  2:10     ` Theodore Tso
  1 sibling, 1 reply; 17+ messages in thread
From: Carl Henrik Lunde @ 2008-04-28  0:13 UTC (permalink / raw)
  To: Soeren Sandmann; +Cc: Theodore Tso, linux-kernel, alexl

On Mon, Apr 28, 2008 at 1:29 AM, Soeren Sandmann <sandmann@daimi.au.dk> wrote:
[...]
>  For a directory of ~2360 files, chunks of a 1000 files is actually
>  surprisingly worse than statting all of the files at once:
>
>     Time to stat 1000 files: 1.008735 s
>     Time to stat 1000 files: 0.738936 s
>     Time to stat 366 files: 0.217002 s
>
>  I guess this just shows that seeks really is pretty much all that
>  matters. Glib should maybe use a larger chunk size.

I agree, if I remember correctly I did not find a directory on my local
disk where the best result was to sort a chunk instead of the complete
directory.

>  I don't know if a general library outside glib would be useful. It
>  seems that just telling people to "sort by inode before statting"
>  would be just as effective as telling them "use this optimized
>  library".

A library (like glib) could disable the feature for solid-state drives
and perhaps implement an alternative strategy for filesystems without
any inode/sector correlation.

So I think we should tell people to use glib. :-)

-- 
Carl Henrik

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-28  0:13     ` Carl Henrik Lunde
@ 2008-04-28 19:41       ` Alexander Larsson
  0 siblings, 0 replies; 17+ messages in thread
From: Alexander Larsson @ 2008-04-28 19:41 UTC (permalink / raw)
  To: Carl Henrik Lunde; +Cc: Soeren Sandmann, Theodore Tso, linux-kernel

On Mon, 2008-04-28 at 02:13 +0200, Carl Henrik Lunde wrote:
> On Mon, Apr 28, 2008 at 1:29 AM, Soeren Sandmann <sandmann@daimi.au.dk> wrote:
> [...]
> >  For a directory of ~2360 files, chunks of a 1000 files is actually
> >  surprisingly worse than statting all of the files at once:
> >
> >     Time to stat 1000 files: 1.008735 s
> >     Time to stat 1000 files: 0.738936 s
> >     Time to stat 366 files: 0.217002 s
> >
> >  I guess this just shows that seeks really is pretty much all that
> >  matters. Glib should maybe use a larger chunk size.
> 
> I agree, if I remember correctly I did not find a directory on my local
> disk where the best result was to sort a chunk instead of the complete
> directory.

I don't think that is expected either. The reason for the chunking is to
avoid unlimited memory use on large directories.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-27 23:29   ` Soeren Sandmann
  2008-04-28  0:13     ` Carl Henrik Lunde
@ 2008-04-28  2:10     ` Theodore Tso
  1 sibling, 0 replies; 17+ messages in thread
From: Theodore Tso @ 2008-04-28  2:10 UTC (permalink / raw)
  To: Soeren Sandmann; +Cc: linux-kernel, alexl

On Mon, Apr 28, 2008 at 01:29:52AM +0200, Soeren Sandmann wrote:
> Sorting by inode is a major improvement. The numbers are less stable,
> but consistently much lower:
> 
>     Time to readdir(): 0.238737 s
>     Time to stat 2366 files: 1.338904 s
> 
> compared to
> 
>     Time to readdir(): 0.227599 s
>     Time to stat 2366 files: 7.981752 s
> 
> Of course, 1.3 seconds is still far from instant, but it may be the
> best we can get given the realities of ext3 disk layout.

Out of curiosity, what was the directory that you were stating?  If it
took you 1.3 seconds to stat 2366, the directory have inodes scattered
all over the disk, or the disk must be very slow.  On my laptop disk,
I can stat 9543 files in 1.1 seconds (from a Maildir directory).

Also, why does the application need to stat all of the files?  Is it
just to get the file type?  (i.e., regular file vs. directory)  If so,
maybe you can use the d_type field in the directory entry returned by
readdir().

> I don't know if a general library outside glib would be useful. It
> seems that just telling people to "sort by inode before statting"
> would be just as effective as telling them "use this optimized
> library".

Well, the question is what we would need to do in order to make it
really easy for people to drop that into their code.  Programmers are
fundamentally lazy, after all, and if it's too much work to create an
interim data structure, and then qsort it, they won't.  But maybe the
glib interface is that convenient interface, and all we need to do is
change glibc to sort with a much larger chunk size.

We do need to get similar changes into find, ls, and many other
programs that might not be so interestedin linking against glibc,
though.

                     	     	     	    	 - Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-24 20:59 stat benchmark Soeren Sandmann
                   ` (2 preceding siblings ...)
  2008-04-25 19:48 ` Theodore Tso
@ 2008-04-27 22:40 ` Carl Henrik Lunde
  2008-04-28 17:46   ` Zach Brown
  2008-04-28  4:43 ` Ulrich Drepper
  4 siblings, 1 reply; 17+ messages in thread
From: Carl Henrik Lunde @ 2008-04-27 22:40 UTC (permalink / raw)
  To: Soeren Sandmann; +Cc: linux-kernel, Zach Brown

On Thu, Apr 24, 2008 at 10:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote:
[...]
>  Under the theory that disk seeks are killing us, one idea is to add a
>  'multistat' system call that would allow statting of many files at a
>  time, which would give the disk scheduler more to work with.

Instead of a specific system call for "multistat" it would be interesting
to look at syslets.  I'm not sure what the current state of that work is,
the last reference I find is http://lkml.org/lkml/2007/12/6/336 and more
interestingly http://lkml.org/lkml/2008/1/9/327

I guess it would be difficult to get close to the optimal disk schedule by
using syslets; if a directory contains 1000 files that would require 1000
syslets and a good I/O scheduler - that's unlikely to be feasible.

I've added Zach Brown to Cc because he authored the last syslet patchset.

-- 
Carl Henrik

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-27 22:40 ` Carl Henrik Lunde
@ 2008-04-28 17:46   ` Zach Brown
  0 siblings, 0 replies; 17+ messages in thread
From: Zach Brown @ 2008-04-28 17:46 UTC (permalink / raw)
  To: Carl Henrik Lunde; +Cc: Soeren Sandmann, linux-kernel

> I guess it would be difficult to get close to the optimal disk schedule by
> using syslets; if a directory contains 1000 files that would require 1000
> syslets and a good I/O scheduler - that's unlikely to be feasible.

It wouldn't even get to the I/O scheduler.   The VFS stat path (see
real_lookup()) is synchronous and serialized.  Each stat will hold the
i_mutex of the parent directory while it waits for the file system's
lookup method to populate the inode from disk.

- z

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-24 20:59 stat benchmark Soeren Sandmann
                   ` (3 preceding siblings ...)
  2008-04-27 22:40 ` Carl Henrik Lunde
@ 2008-04-28  4:43 ` Ulrich Drepper
  2008-04-28 11:53   ` Theodore Tso
  4 siblings, 1 reply; 17+ messages in thread
From: Ulrich Drepper @ 2008-04-28  4:43 UTC (permalink / raw)
  To: Soeren Sandmann; +Cc: linux-kernel

On Thu, Apr 24, 2008 at 1:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote:
>  So I am looking for ways to improve this.

Aside from what has already been proposed there is also the
readdirplus() route.  Unfortunately the people behind this and related
proposals vanished after the last discussions.  I was  hoping they
come back with a revised proposal but perhaps not.  Maybe it's time to
pick up the ball myself.

As a reminder, readdirplus() is an extended readdir() which also
returns (a subset of) the stat information for the file at the same
time.  The subset part is needed to account for the different
information contained in the inodes.  For most applications the subset
should be sufficient and therefore all that's needed is a single
iteration over the directory.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-28  4:43 ` Ulrich Drepper
@ 2008-04-28 11:53   ` Theodore Tso
  2008-04-28 11:59     ` Avi Kivity
  2008-04-28 16:18     ` J. Bruce Fields
  0 siblings, 2 replies; 17+ messages in thread
From: Theodore Tso @ 2008-04-28 11:53 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Soeren Sandmann, linux-kernel

On Sun, Apr 27, 2008 at 09:43:05PM -0700, Ulrich Drepper wrote:
> On Thu, Apr 24, 2008 at 1:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote:
> >  So I am looking for ways to improve this.
> 
> Aside from what has already been proposed there is also the
> readdirplus() route.  Unfortunately the people behind this and related
> proposals vanished after the last discussions.  I was  hoping they
> come back with a revised proposal but perhaps not.  Maybe it's time to
> pick up the ball myself.
> 
> As a reminder, readdirplus() is an extended readdir() which also
> returns (a subset of) the stat information for the file at the same
> time.  The subset part is needed to account for the different
> information contained in the inodes.  For most applications the subset
> should be sufficient and therefore all that's needed is a single
> iteration over the directory.

I'm not sure this would help in the cold cache case, which is what
Soeren originally complained about.[1] The problem is whaever
information the user might need won't be store in the directory, so
the filesystem would end having to stat the file anyway, incurring a
disk seek, which was what the user was complaining about.  A
readdirplus() would save a whole bunch of system calls if the inode
was already cached, yes, but I'm not sure that's it would be worth the
effort given how small Linux's system call overhead would be.  But in
the cold cache case, you end up seeking all over the disk, and the
only thing you can do is to try to keep the inodes close to each
other, and to have either readdir() or the caller of readdir() sort
all of the returned directory entries by inode number to avoid seeking
all over the disk.

							- Ted

[1] At least, not without making filesystem modifications; the problem
is that if you're going to store the inode information (or part of the
inode information) in the directory it has to get updated each time
the inode gets updated, and that gets painful with hardlinks.  There
was a Usenix paper which explored this idea over ten years ago, and
it's possible --- but then all directory entries need to have linked
list so that that you can either (a) find the "master" directory entry
which contains the inode information and be able to migrate inode
information to another directory entry when the originl "master"
dirent gets deleted, or (b) you keep the subset of the inode
information in all of the directory entries of the hardlink, and have
to update them all each time the inode information changes.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-28 11:53   ` Theodore Tso
@ 2008-04-28 11:59     ` Avi Kivity
  2008-04-28 13:31       ` Theodore Tso
  2008-04-28 16:18     ` J. Bruce Fields
  1 sibling, 1 reply; 17+ messages in thread
From: Avi Kivity @ 2008-04-28 11:59 UTC (permalink / raw)
  To: Theodore Tso, Ulrich Drepper, Soeren Sandmann, linux-kernel

Theodore Tso wrote:
>> Aside from what has already been proposed there is also the
>> readdirplus() route.  Unfortunately the people behind this and related
>> proposals vanished after the last discussions.  I was  hoping they
>> come back with a revised proposal but perhaps not.  Maybe it's time to
>> pick up the ball myself.
>>
>> As a reminder, readdirplus() is an extended readdir() which also
>> returns (a subset of) the stat information for the file at the same
>> time.  The subset part is needed to account for the different
>> information contained in the inodes.  For most applications the subset
>> should be sufficient and therefore all that's needed is a single
>> iteration over the directory.
>>     
>
> I'm not sure this would help in the cold cache case, which is what
> Soeren originally complained about.[1] The problem is whaever
> information the user might need won't be store in the directory, so
> the filesystem would end having to stat the file anyway, incurring a
> disk seek, which was what the user was complaining about.  A
> readdirplus() would save a whole bunch of system calls if the inode
> was already cached, yes, but I'm not sure that's it would be worth the
> effort given how small Linux's system call overhead would be.  But in
> the cold cache case, you end up seeking all over the disk, and the
> only thing you can do is to try to keep the inodes close to each
> other, and to have either readdir() or the caller of readdir() sort
> all of the returned directory entries by inode number to avoid seeking
> all over the disk.
>   

A readdirplus() could sort the inodes according to the filesystem's 
layout, and additionally issue the stats in parallel, so if you have 
multiple spindles you get significant additional speedup.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-28 11:59     ` Avi Kivity
@ 2008-04-28 13:31       ` Theodore Tso
  0 siblings, 0 replies; 17+ messages in thread
From: Theodore Tso @ 2008-04-28 13:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ulrich Drepper, Soeren Sandmann, linux-kernel

On Mon, Apr 28, 2008 at 02:59:26PM +0300, Avi Kivity wrote:
>
> A readdirplus() could sort the inodes according to the filesystem's layout, 
> and additionally issue the stats in parallel, so if you have multiple 
> spindles you get significant additional speedup.
>

Well, sorting inodes is something readdir() could do, but it takes
potentially an unbounded amount of (non-swappable) kernel memory.
That's why it's messy.

And it wouldn't be hard to have flags (set by fcntl()) on the fd
returned by readdir() which which requests inode cache prefetching,
which would have the same net result, without needing to design a new
system call.  So before designing a new system call, it might be worth
it to do some benchmarks to see how much of the cost is really in the
number of stat() system calls, in the warm cache case.

       	  	 	       	      - Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: stat benchmark
  2008-04-28 11:53   ` Theodore Tso
  2008-04-28 11:59     ` Avi Kivity
@ 2008-04-28 16:18     ` J. Bruce Fields
  1 sibling, 0 replies; 17+ messages in thread
From: J. Bruce Fields @ 2008-04-28 16:18 UTC (permalink / raw)
  To: Theodore Tso, Ulrich Drepper, Soeren Sandmann, linux-kernel

On Mon, Apr 28, 2008 at 07:53:22AM -0400, Theodore Tso wrote:
> On Sun, Apr 27, 2008 at 09:43:05PM -0700, Ulrich Drepper wrote:
> > On Thu, Apr 24, 2008 at 1:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote:
> > >  So I am looking for ways to improve this.
> > 
> > Aside from what has already been proposed there is also the
> > readdirplus() route.  Unfortunately the people behind this and related
> > proposals vanished after the last discussions.  I was  hoping they
> > come back with a revised proposal but perhaps not.  Maybe it's time to
> > pick up the ball myself.
> > 
> > As a reminder, readdirplus() is an extended readdir() which also
> > returns (a subset of) the stat information for the file at the same
> > time.  The subset part is needed to account for the different
> > information contained in the inodes.  For most applications the subset
> > should be sufficient and therefore all that's needed is a single
> > iteration over the directory.
> 
> I'm not sure this would help in the cold cache case, which is what
> Soeren originally complained about.[1] The problem is whaever
> information the user might need won't be store in the directory, so
> the filesystem would end having to stat the file anyway, incurring a
> disk seek, which was what the user was complaining about.  A
> readdirplus() would save a whole bunch of system calls if the inode
> was already cached, yes, but I'm not sure that's it would be worth the
> effort given how small Linux's system call overhead would be.  But in
> the cold cache case, you end up seeking all over the disk, and the
> only thing you can do is to try to keep the inodes close to each
> other, and to have either readdir() or the caller of readdir() sort
> all of the returned directory entries by inode number to avoid seeking
> all over the disk.

The other reason for something like a readdirplus or a bulk stat is to
provide an opportunity for parallelism.

As my favorite example: cold-cache "git diff" of a linux tree on my
desktop (with an nfs-mounted /home) takes about 12 seconds.  That's
mainly just a sequential stat of about about 24000 files.  Patching git
to issue the stats in parallel, I could get that down to about 3.5
seconds.  (Still not great.  I don't know if it's disk seeks on the
server or what that are the limiting factor.)

In the case of git, it's looking just for files that it tracks--it's not
reading whole directories--so I don't know if readdirplus() specifically
would help.

--b.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-04-28 19:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-24 20:59 stat benchmark Soeren Sandmann
2008-04-24 21:42 ` Carl Henrik Lunde
2008-04-24 21:44 ` Jan Engelhardt
2008-04-25  2:27   ` Justin Banks
2008-04-25  7:01   ` Christoph Hellwig
2008-04-25 19:48 ` Theodore Tso
2008-04-27 23:29   ` Soeren Sandmann
2008-04-28  0:13     ` Carl Henrik Lunde
2008-04-28 19:41       ` Alexander Larsson
2008-04-28  2:10     ` Theodore Tso
2008-04-27 22:40 ` Carl Henrik Lunde
2008-04-28 17:46   ` Zach Brown
2008-04-28  4:43 ` Ulrich Drepper
2008-04-28 11:53   ` Theodore Tso
2008-04-28 11:59     ` Avi Kivity
2008-04-28 13:31       ` Theodore Tso
2008-04-28 16:18     ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox