* stat benchmark
@ 2008-04-24 20:59 Soeren Sandmann
2008-04-24 21:42 ` Carl Henrik Lunde
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: Soeren Sandmann @ 2008-04-24 20:59 UTC (permalink / raw)
To: linux-kernel
At
http://www.daimi.au.dk/~sandmann/stat-benchmark.c
there is a simple program that will measure the time it takes to stat
every file in the current directory with a cold cache.
This is essentially what applications do in a number of common cases
such as "ls -l", nautilus opening a directory, or an "open file"
dialog being showed.
Unfortunately, performance of that operation kinda sucks. On my system
(ext3), it produces:
c-24-61-65-93:~% sudo ./a.out
Time to readdir(): 0.307671 s
Time to stat 2349 files: 8.203693 s
8 seconds is about 80 times slower than what a user perceives as
"instantly" and slow enough that we really should display a progress
bar if it can't be fixed.
So I am looking for ways to improve this.
Under the theory that disk seeks are killing us, one idea is to add a
'multistat' system call that would allow statting of many files at a
time, which would give the disk scheduler more to work with.
Possibly the same thing would need to be done for the getxattr
information.
Does this sound like a reasonable idea?
Thanks,
Soren
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: stat benchmark 2008-04-24 20:59 stat benchmark Soeren Sandmann @ 2008-04-24 21:42 ` Carl Henrik Lunde 2008-04-24 21:44 ` Jan Engelhardt ` (3 subsequent siblings) 4 siblings, 0 replies; 17+ messages in thread From: Carl Henrik Lunde @ 2008-04-24 21:42 UTC (permalink / raw) To: Soeren Sandmann; +Cc: linux-kernel On Thu, Apr 24, 2008 at 10:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote: [ about programs reading all inodes after readdir ] > Unfortunately, performance of that operation kinda sucks. On my system > (ext3), it produces: > > c-24-61-65-93:~% sudo ./a.out > Time to readdir(): 0.307671 s > Time to stat 2349 files: 8.203693 s > > 8 seconds is about 80 times slower than what a user perceives as > "instantly" and slow enough that we really should display a progress > bar if it can't be fixed. > > So I am looking for ways to improve this. > > Under the theory that disk seeks are killing us, one idea is to add a > 'multistat' system call that would allow statting of many files at a > time, which would give the disk scheduler more to work with. I have experimented with the same problem, and another idea is to reorder the result from readdir, which I've gotten good results by doing. This works because: - For most filesystems there is a high correlation between the inode number and the sector on the disk. - Most programs like your example handle the files in the order that they are returned from readdir - The time spent sorting is very small compared to the disk seeks There are several possible ways to implement this: - reorder the dirents in the kernel for each getdents call - reorderi the dirents in user space, for example by running qsort in a libc wrapper - in the file system, optimize the order before writing back a dirty directory This does not only apply to programs only stating files, but also reading them, such as indexing files, backups (tar), and Nautilus getting thumbnails from JPGs. -- Carl Henrik ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-24 20:59 stat benchmark Soeren Sandmann 2008-04-24 21:42 ` Carl Henrik Lunde @ 2008-04-24 21:44 ` Jan Engelhardt 2008-04-25 2:27 ` Justin Banks 2008-04-25 7:01 ` Christoph Hellwig 2008-04-25 19:48 ` Theodore Tso ` (2 subsequent siblings) 4 siblings, 2 replies; 17+ messages in thread From: Jan Engelhardt @ 2008-04-24 21:44 UTC (permalink / raw) To: Soeren Sandmann; +Cc: linux-kernel On Thursday 2008-04-24 22:59, Soeren Sandmann wrote: > > http://www.daimi.au.dk/~sandmann/stat-benchmark.c > >there is a simple program that will measure the time it takes to stat >every file in the current directory with a cold cache. > >This is essentially what applications do in a number of common cases >such as "ls -l", nautilus opening a directory, or an "open file" >dialog being showed. Mh somewhat. Some programs/libraries unnecessarily stat No, nautilus is excluded > >Unfortunately, performance of that operation kinda sucks. On my system >(ext3), it produces: > > c-24-61-65-93:~% sudo ./a.out > Time to readdir(): 0.307671 s > Time to stat 2349 files: 8.203693 s > >8 seconds is about 80 times slower than what a user perceives as >"instantly" and slow enough that we really should display a progress >bar if it can't be fixed. > >So I am looking for ways to improve this. > >Under the theory that disk seeks are killing us, one idea is to add a >'multistat' system call that would allow statting of many files at a >time, which would give the disk scheduler more to work with. > >Possibly the same thing would need to be done for the getxattr >information. > >Does this sound like a reasonable idea? > XFS for example already has bulkstat, but it is not really exported to userspace :-/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-24 21:44 ` Jan Engelhardt @ 2008-04-25 2:27 ` Justin Banks 2008-04-25 7:01 ` Christoph Hellwig 1 sibling, 0 replies; 17+ messages in thread From: Justin Banks @ 2008-04-25 2:27 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Soeren Sandmann, linux-kernel Jan Engelhardt wrote > > XFS for example already has bulkstat, but it is not really > exported to userspace :-/ Perhaps this could be in VFS, and used if present? -justinb -- Justin Banks BakBone Software justinb@bakbone.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-24 21:44 ` Jan Engelhardt 2008-04-25 2:27 ` Justin Banks @ 2008-04-25 7:01 ` Christoph Hellwig 1 sibling, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2008-04-25 7:01 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Soeren Sandmann, linux-kernel On Thu, Apr 24, 2008 at 11:44:04PM +0200, Jan Engelhardt wrote: > XFS for example already has bulkstat, but it is not really > exported to userspace :-/ It is avaiable through an ugly ioctl. Doesn't really help in this use-case because bulkstat iterates over all inodes of a given filesystem and not inodes of a directory. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-24 20:59 stat benchmark Soeren Sandmann 2008-04-24 21:42 ` Carl Henrik Lunde 2008-04-24 21:44 ` Jan Engelhardt @ 2008-04-25 19:48 ` Theodore Tso 2008-04-27 23:29 ` Soeren Sandmann 2008-04-27 22:40 ` Carl Henrik Lunde 2008-04-28 4:43 ` Ulrich Drepper 4 siblings, 1 reply; 17+ messages in thread From: Theodore Tso @ 2008-04-25 19:48 UTC (permalink / raw) To: Soeren Sandmann; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 964 bytes --] On Thu, Apr 24, 2008 at 10:59:10PM +0200, Soeren Sandmann wrote: > > Under the theory that disk seeks are killing us, one idea is to add a > 'multistat' system call that would allow statting of many files at a > time, which would give the disk scheduler more to work with. Why don't you try this version of your stat-benchmark first? If you give it the -s option, it will sort the files by inode number first. I think you will find this should make a significant difference. If it works, something that would be really great if someone were to make a generic library which could be used instead of readdir(). I have something which works as an LD_PRELOAD, but apparently it's been blowing up on 64-bit systems, and I haven't had time to debug it. It's probably better to do it as a library which userspace applications linked against, anyway. Would you or someone you know be interesed in maybe taking this idea and running with it? Regards, - Ted [-- Attachment #2: stat-benchmark.c --] [-- Type: text/x-csrc, Size: 2821 bytes --] #include <sys/time.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <sys/stat.h> #include <sys/types.h> #include <dirent.h> #include <stdio.h> #include <errno.h> #include <string.h> #include <getopt.h> struct dirent_s { unsigned long long d_ino; long long d_off; unsigned short int d_reclen; unsigned char d_type; char *d_name; }; static void disaster (const char *what) { fprintf (stderr, "%s failed: %s\n", what, strerror (errno)); exit (1); } static void dump_caches (void) { int fd = open ("/proc/sys/vm/drop_caches", O_RDWR); if (fd < 0) disaster ("opening drop_caches"); if (write (fd, "3", strlen ("3")) < 0) disaster ("writing drop_caches"); if (close (fd) < 0) disaster ("closing drop_caches"); } static int ino_cmp(const void *a, const void *b) { const struct dirent_s *ds_a = (const struct dirent_s *) a; const struct dirent_s *ds_b = (const struct dirent_s *) b; unsigned int i_a, i_b; i_a = ds_a->d_ino; i_b = ds_b->d_ino; return (i_a - i_b); } static double tv_to_sec (const struct timeval *tv) { return tv->tv_sec + tv->tv_usec / 1000000.0; } static double time_diff (const struct timeval *before, const struct timeval *after) { return tv_to_sec (after) - tv_to_sec (before); } static int pot (int n) { int p = 1; while (p <= n) p *= 2; return p; } int main (int argc, char **argv) { DIR *dir = opendir ("."); struct dirent *ent; struct timeval before; struct timeval after; struct dirent_s *ds = NULL; int n_files; int do_sort = 0; int drop_caches = 1; int i, c; while ((c = getopt (argc, argv, "sc")) != EOF) { switch (c) { case 's': do_sort++; break; case 'c': drop_caches = 0; break; default: printf("Usage: %s [-s]\n", argv[0]); } } /* Dump caches */ if (drop_caches) dump_caches(); sleep (1); /* Read directory */ errno = 0; gettimeofday (&before, NULL); n_files = 0; while ((ent = readdir (dir))) { ds = realloc (ds, sizeof (struct dirent_s) * pot (n_files)); if (!ds) disaster ("realloc"); ds[n_files].d_name = strdup (ent->d_name); ds[n_files].d_ino = ent->d_ino; n_files++; } if (errno != 0) disaster ("readdir"); gettimeofday (&after, NULL); printf ("Time to readdir(): %f s\n", time_diff (&before, &after)); if (do_sort) qsort(ds, n_files, sizeof(struct dirent_s), ino_cmp); /* Stat all the files */ gettimeofday (&before, NULL); for (i = 0; i < n_files; ++i) { struct stat statbuf; stat (ds[i].d_name, &statbuf); } gettimeofday (&after, NULL); printf ("Time to stat %d files: %f s\n", n_files, time_diff (&before, &after)); return 0; } [-- Attachment #3: spd_readdir.c --] [-- Type: text/x-csrc, Size: 6551 bytes --] /* * readdir accelerator * * (C) Copyright 2003, 2004 by Theodore Ts'o. * * Compile using the command: * * gcc -o spd_readdir.so -shared spd_readdir.c -ldl * * Use it by setting the LD_PRELOAD environment variable: * * export LD_PRELOAD=/usr/local/sbin/spd_readdir.so * * %Begin-Header% * This file may be redistributed under the terms of the GNU Public * License. * %End-Header% * */ #define ALLOC_STEPSIZE 100 #define MAX_DIRSIZE 0 #define DEBUG #ifdef DEBUG #define DEBUG_DIR(x) {if (do_debug) { x; }} #else #define DEBUG_DIR(x) #endif #define _GNU_SOURCE #define __USE_LARGEFILE64 #include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <stdlib.h> #include <string.h> #include <dirent.h> #include <errno.h> #include <dlfcn.h> struct dirent_s { unsigned long long d_ino; long long d_off; unsigned short int d_reclen; unsigned char d_type; char *d_name; }; struct dir_s { DIR *dir; int num; int max; struct dirent_s *dp; int pos; int fd; struct dirent ret_dir; struct dirent64 ret_dir64; }; static int (*real_closedir)(DIR *dir) = 0; static DIR *(*real_opendir)(const char *name) = 0; static struct dirent *(*real_readdir)(DIR *dir) = 0; static struct dirent64 *(*real_readdir64)(DIR *dir) = 0; static off_t (*real_telldir)(DIR *dir) = 0; static void (*real_seekdir)(DIR *dir, off_t offset) = 0; static int (*real_dirfd)(DIR *dir) = 0; static unsigned long max_dirsize = MAX_DIRSIZE; static num_open = 0; #ifdef DEBUG static int do_debug = 0; #endif static void setup_ptr() { char *cp; real_opendir = dlsym(RTLD_NEXT, "opendir"); real_closedir = dlsym(RTLD_NEXT, "closedir"); real_readdir = dlsym(RTLD_NEXT, "readdir"); real_readdir64 = dlsym(RTLD_NEXT, "readdir64"); real_telldir = dlsym(RTLD_NEXT, "telldir"); real_seekdir = dlsym(RTLD_NEXT, "seekdir"); real_dirfd = dlsym(RTLD_NEXT, "dirfd"); if ((cp = getenv("SPD_READDIR_MAX_SIZE")) != NULL) { max_dirsize = atol(cp); } #ifdef DEBUG if (getenv("SPD_READDIR_DEBUG")) do_debug++; #endif } static void free_cached_dir(struct dir_s *dirstruct) { int i; if (!dirstruct->dp) return; for (i=0; i < dirstruct->num; i++) { free(dirstruct->dp[i].d_name); } free(dirstruct->dp); dirstruct->dp = 0; } static int ino_cmp(const void *a, const void *b) { const struct dirent_s *ds_a = (const struct dirent_s *) a; const struct dirent_s *ds_b = (const struct dirent_s *) b; ino_t i_a, i_b; i_a = ds_a->d_ino; i_b = ds_b->d_ino; if (ds_a->d_name[0] == '.') { if (ds_a->d_name[1] == 0) i_a = 0; else if ((ds_a->d_name[1] == '.') && (ds_a->d_name[2] == 0)) i_a = 1; } if (ds_b->d_name[0] == '.') { if (ds_b->d_name[1] == 0) i_b = 0; else if ((ds_b->d_name[1] == '.') && (ds_b->d_name[2] == 0)) i_b = 1; } return (i_a - i_b); } DIR *opendir(const char *name) { DIR *dir; struct dir_s *dirstruct; struct dirent_s *ds, *dnew; struct dirent64 *d; struct stat st; if (!real_opendir) setup_ptr(); DEBUG_DIR(printf("Opendir(%s) (%d open)\n", name, num_open++)); dir = (*real_opendir)(name); if (!dir) return NULL; dirstruct = malloc(sizeof(struct dir_s)); if (!dirstruct) { (*real_closedir)(dir); errno = -ENOMEM; return NULL; } dirstruct->num = 0; dirstruct->max = 0; dirstruct->dp = 0; dirstruct->pos = 0; dirstruct->dir = 0; if (max_dirsize && (stat(name, &st) == 0) && (st.st_size > max_dirsize)) { DEBUG_DIR(printf("Directory size %ld, using direct readdir\n", st.st_size)); dirstruct->dir = dir; return (DIR *) dirstruct; } while ((d = (*real_readdir64)(dir)) != NULL) { if (dirstruct->num >= dirstruct->max) { dirstruct->max += ALLOC_STEPSIZE; DEBUG_DIR(printf("Reallocating to size %d\n", dirstruct->max)); dnew = realloc(dirstruct->dp, dirstruct->max * sizeof(struct dir_s)); if (!dnew) goto nomem; dirstruct->dp = dnew; } ds = &dirstruct->dp[dirstruct->num++]; ds->d_ino = d->d_ino; ds->d_off = d->d_off; ds->d_reclen = d->d_reclen; ds->d_type = d->d_type; if ((ds->d_name = malloc(strlen(d->d_name)+1)) == NULL) { dirstruct->num--; goto nomem; } strcpy(ds->d_name, d->d_name); DEBUG_DIR(printf("readdir: %lu %s\n", (unsigned long) d->d_ino, d->d_name)); } dirstruct->fd = dup((*real_dirfd)(dir)); (*real_closedir)(dir); qsort(dirstruct->dp, dirstruct->num, sizeof(struct dirent_s), ino_cmp); return ((DIR *) dirstruct); nomem: DEBUG_DIR(printf("No memory, backing off to direct readdir\n")); free_cached_dir(dirstruct); dirstruct->dir = dir; return ((DIR *) dirstruct); } int closedir(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; DEBUG_DIR(printf("Closedir (%d open)\n", --num_open)); if (dirstruct->dir) (*real_closedir)(dirstruct->dir); if (dirstruct->fd >= 0) close(dirstruct->fd); free_cached_dir(dirstruct); free(dirstruct); return 0; } struct dirent *readdir(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; struct dirent_s *ds; if (dirstruct->dir) return (*real_readdir)(dirstruct->dir); if (dirstruct->pos >= dirstruct->num) return NULL; ds = &dirstruct->dp[dirstruct->pos++]; dirstruct->ret_dir.d_ino = ds->d_ino; dirstruct->ret_dir.d_off = ds->d_off; dirstruct->ret_dir.d_reclen = ds->d_reclen; dirstruct->ret_dir.d_type = ds->d_type; strncpy(dirstruct->ret_dir.d_name, ds->d_name, sizeof(dirstruct->ret_dir.d_name)); return (&dirstruct->ret_dir); } struct dirent64 *readdir64(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; struct dirent_s *ds; if (dirstruct->dir) return (*real_readdir64)(dirstruct->dir); if (dirstruct->pos >= dirstruct->num) return NULL; ds = &dirstruct->dp[dirstruct->pos++]; dirstruct->ret_dir64.d_ino = ds->d_ino; dirstruct->ret_dir64.d_off = ds->d_off; dirstruct->ret_dir64.d_reclen = ds->d_reclen; dirstruct->ret_dir64.d_type = ds->d_type; strncpy(dirstruct->ret_dir64.d_name, ds->d_name, sizeof(dirstruct->ret_dir64.d_name)); return (&dirstruct->ret_dir64); } off_t telldir(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; if (dirstruct->dir) return (*real_telldir)(dirstruct->dir); return ((off_t) dirstruct->pos); } void seekdir(DIR *dir, off_t offset) { struct dir_s *dirstruct = (struct dir_s *) dir; if (dirstruct->dir) { (*real_seekdir)(dirstruct->dir, offset); return; } dirstruct->pos = offset; } int dirfd(DIR *dir) { struct dir_s *dirstruct = (struct dir_s *) dir; if (dirstruct->dir) return (*real_dirfd)(dirstruct->dir); return (dirstruct->fd); } ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-25 19:48 ` Theodore Tso @ 2008-04-27 23:29 ` Soeren Sandmann 2008-04-28 0:13 ` Carl Henrik Lunde 2008-04-28 2:10 ` Theodore Tso 0 siblings, 2 replies; 17+ messages in thread From: Soeren Sandmann @ 2008-04-27 23:29 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-kernel, alexl Theodore Tso <tytso@mit.edu> writes: > On Thu, Apr 24, 2008 at 10:59:10PM +0200, Soeren Sandmann wrote: > > > > Under the theory that disk seeks are killing us, one idea is to add a > > 'multistat' system call that would allow statting of many files at a > > time, which would give the disk scheduler more to work with. > > Why don't you try this version of your stat-benchmark first? If you > give it the -s option, it will sort the files by inode number first. > I think you will find this should make a significant difference. Yeah, Carl suggested this as well. Sorting by inode is a major improvement. The numbers are less stable, but consistently much lower: Time to readdir(): 0.238737 s Time to stat 2366 files: 1.338904 s compared to Time to readdir(): 0.227599 s Time to stat 2366 files: 7.981752 s Of course, 1.3 seconds is still far from instant, but it may be the best we can get given the realities of ext3 disk layout. One thing that surprised me is that lstat is slightly slower than stat. With lstat() instead of stat(), I get: Time to stat 2366 files: 1.472115 s Time to readdir(): 1.735542 s I naively thought that stat() was a superset of lstat(), but apparently not. > If it works, something that would be really great if someone were to > make a generic library which could be used instead of readdir(). I > have something which works as an LD_PRELOAD, but apparently it's been > blowing up on 64-bit systems, and I haven't had time to debug it. > It's probably better to do it as a library which userspace > applications linked against, anyway. Would you or someone you know be > interesed in maybe taking this idea and running with it? After I read Carl's mail I looked at what glib - which is used by both Nautilus and the gtk+ file open dialog - does, and in fact the latest version does actually read chunks of 1000 dirents and sorts them by inode. The version I had installed when I wrote the benchmark just stat'ed in readdir() order. For a directory of ~2360 files, chunks of a 1000 files is actually surprisingly worse than statting all of the files at once: Time to stat 1000 files: 1.008735 s Time to stat 1000 files: 0.738936 s Time to stat 366 files: 0.217002 s I guess this just shows that seeks really is pretty much all that matters. Glib should maybe use a larger chunk size. I don't know if a general library outside glib would be useful. It seems that just telling people to "sort by inode before statting" would be just as effective as telling them "use this optimized library". Thanks, Soren ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-27 23:29 ` Soeren Sandmann @ 2008-04-28 0:13 ` Carl Henrik Lunde 2008-04-28 19:41 ` Alexander Larsson 2008-04-28 2:10 ` Theodore Tso 1 sibling, 1 reply; 17+ messages in thread From: Carl Henrik Lunde @ 2008-04-28 0:13 UTC (permalink / raw) To: Soeren Sandmann; +Cc: Theodore Tso, linux-kernel, alexl On Mon, Apr 28, 2008 at 1:29 AM, Soeren Sandmann <sandmann@daimi.au.dk> wrote: [...] > For a directory of ~2360 files, chunks of a 1000 files is actually > surprisingly worse than statting all of the files at once: > > Time to stat 1000 files: 1.008735 s > Time to stat 1000 files: 0.738936 s > Time to stat 366 files: 0.217002 s > > I guess this just shows that seeks really is pretty much all that > matters. Glib should maybe use a larger chunk size. I agree, if I remember correctly I did not find a directory on my local disk where the best result was to sort a chunk instead of the complete directory. > I don't know if a general library outside glib would be useful. It > seems that just telling people to "sort by inode before statting" > would be just as effective as telling them "use this optimized > library". A library (like glib) could disable the feature for solid-state drives and perhaps implement an alternative strategy for filesystems without any inode/sector correlation. So I think we should tell people to use glib. :-) -- Carl Henrik ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-28 0:13 ` Carl Henrik Lunde @ 2008-04-28 19:41 ` Alexander Larsson 0 siblings, 0 replies; 17+ messages in thread From: Alexander Larsson @ 2008-04-28 19:41 UTC (permalink / raw) To: Carl Henrik Lunde; +Cc: Soeren Sandmann, Theodore Tso, linux-kernel On Mon, 2008-04-28 at 02:13 +0200, Carl Henrik Lunde wrote: > On Mon, Apr 28, 2008 at 1:29 AM, Soeren Sandmann <sandmann@daimi.au.dk> wrote: > [...] > > For a directory of ~2360 files, chunks of a 1000 files is actually > > surprisingly worse than statting all of the files at once: > > > > Time to stat 1000 files: 1.008735 s > > Time to stat 1000 files: 0.738936 s > > Time to stat 366 files: 0.217002 s > > > > I guess this just shows that seeks really is pretty much all that > > matters. Glib should maybe use a larger chunk size. > > I agree, if I remember correctly I did not find a directory on my local > disk where the best result was to sort a chunk instead of the complete > directory. I don't think that is expected either. The reason for the chunking is to avoid unlimited memory use on large directories. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-27 23:29 ` Soeren Sandmann 2008-04-28 0:13 ` Carl Henrik Lunde @ 2008-04-28 2:10 ` Theodore Tso 1 sibling, 0 replies; 17+ messages in thread From: Theodore Tso @ 2008-04-28 2:10 UTC (permalink / raw) To: Soeren Sandmann; +Cc: linux-kernel, alexl On Mon, Apr 28, 2008 at 01:29:52AM +0200, Soeren Sandmann wrote: > Sorting by inode is a major improvement. The numbers are less stable, > but consistently much lower: > > Time to readdir(): 0.238737 s > Time to stat 2366 files: 1.338904 s > > compared to > > Time to readdir(): 0.227599 s > Time to stat 2366 files: 7.981752 s > > Of course, 1.3 seconds is still far from instant, but it may be the > best we can get given the realities of ext3 disk layout. Out of curiosity, what was the directory that you were stating? If it took you 1.3 seconds to stat 2366, the directory have inodes scattered all over the disk, or the disk must be very slow. On my laptop disk, I can stat 9543 files in 1.1 seconds (from a Maildir directory). Also, why does the application need to stat all of the files? Is it just to get the file type? (i.e., regular file vs. directory) If so, maybe you can use the d_type field in the directory entry returned by readdir(). > I don't know if a general library outside glib would be useful. It > seems that just telling people to "sort by inode before statting" > would be just as effective as telling them "use this optimized > library". Well, the question is what we would need to do in order to make it really easy for people to drop that into their code. Programmers are fundamentally lazy, after all, and if it's too much work to create an interim data structure, and then qsort it, they won't. But maybe the glib interface is that convenient interface, and all we need to do is change glibc to sort with a much larger chunk size. We do need to get similar changes into find, ls, and many other programs that might not be so interestedin linking against glibc, though. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-24 20:59 stat benchmark Soeren Sandmann ` (2 preceding siblings ...) 2008-04-25 19:48 ` Theodore Tso @ 2008-04-27 22:40 ` Carl Henrik Lunde 2008-04-28 17:46 ` Zach Brown 2008-04-28 4:43 ` Ulrich Drepper 4 siblings, 1 reply; 17+ messages in thread From: Carl Henrik Lunde @ 2008-04-27 22:40 UTC (permalink / raw) To: Soeren Sandmann; +Cc: linux-kernel, Zach Brown On Thu, Apr 24, 2008 at 10:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote: [...] > Under the theory that disk seeks are killing us, one idea is to add a > 'multistat' system call that would allow statting of many files at a > time, which would give the disk scheduler more to work with. Instead of a specific system call for "multistat" it would be interesting to look at syslets. I'm not sure what the current state of that work is, the last reference I find is http://lkml.org/lkml/2007/12/6/336 and more interestingly http://lkml.org/lkml/2008/1/9/327 I guess it would be difficult to get close to the optimal disk schedule by using syslets; if a directory contains 1000 files that would require 1000 syslets and a good I/O scheduler - that's unlikely to be feasible. I've added Zach Brown to Cc because he authored the last syslet patchset. -- Carl Henrik ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-27 22:40 ` Carl Henrik Lunde @ 2008-04-28 17:46 ` Zach Brown 0 siblings, 0 replies; 17+ messages in thread From: Zach Brown @ 2008-04-28 17:46 UTC (permalink / raw) To: Carl Henrik Lunde; +Cc: Soeren Sandmann, linux-kernel > I guess it would be difficult to get close to the optimal disk schedule by > using syslets; if a directory contains 1000 files that would require 1000 > syslets and a good I/O scheduler - that's unlikely to be feasible. It wouldn't even get to the I/O scheduler. The VFS stat path (see real_lookup()) is synchronous and serialized. Each stat will hold the i_mutex of the parent directory while it waits for the file system's lookup method to populate the inode from disk. - z ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-24 20:59 stat benchmark Soeren Sandmann ` (3 preceding siblings ...) 2008-04-27 22:40 ` Carl Henrik Lunde @ 2008-04-28 4:43 ` Ulrich Drepper 2008-04-28 11:53 ` Theodore Tso 4 siblings, 1 reply; 17+ messages in thread From: Ulrich Drepper @ 2008-04-28 4:43 UTC (permalink / raw) To: Soeren Sandmann; +Cc: linux-kernel On Thu, Apr 24, 2008 at 1:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote: > So I am looking for ways to improve this. Aside from what has already been proposed there is also the readdirplus() route. Unfortunately the people behind this and related proposals vanished after the last discussions. I was hoping they come back with a revised proposal but perhaps not. Maybe it's time to pick up the ball myself. As a reminder, readdirplus() is an extended readdir() which also returns (a subset of) the stat information for the file at the same time. The subset part is needed to account for the different information contained in the inodes. For most applications the subset should be sufficient and therefore all that's needed is a single iteration over the directory. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-28 4:43 ` Ulrich Drepper @ 2008-04-28 11:53 ` Theodore Tso 2008-04-28 11:59 ` Avi Kivity 2008-04-28 16:18 ` J. Bruce Fields 0 siblings, 2 replies; 17+ messages in thread From: Theodore Tso @ 2008-04-28 11:53 UTC (permalink / raw) To: Ulrich Drepper; +Cc: Soeren Sandmann, linux-kernel On Sun, Apr 27, 2008 at 09:43:05PM -0700, Ulrich Drepper wrote: > On Thu, Apr 24, 2008 at 1:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote: > > So I am looking for ways to improve this. > > Aside from what has already been proposed there is also the > readdirplus() route. Unfortunately the people behind this and related > proposals vanished after the last discussions. I was hoping they > come back with a revised proposal but perhaps not. Maybe it's time to > pick up the ball myself. > > As a reminder, readdirplus() is an extended readdir() which also > returns (a subset of) the stat information for the file at the same > time. The subset part is needed to account for the different > information contained in the inodes. For most applications the subset > should be sufficient and therefore all that's needed is a single > iteration over the directory. I'm not sure this would help in the cold cache case, which is what Soeren originally complained about.[1] The problem is whaever information the user might need won't be store in the directory, so the filesystem would end having to stat the file anyway, incurring a disk seek, which was what the user was complaining about. A readdirplus() would save a whole bunch of system calls if the inode was already cached, yes, but I'm not sure that's it would be worth the effort given how small Linux's system call overhead would be. But in the cold cache case, you end up seeking all over the disk, and the only thing you can do is to try to keep the inodes close to each other, and to have either readdir() or the caller of readdir() sort all of the returned directory entries by inode number to avoid seeking all over the disk. - Ted [1] At least, not without making filesystem modifications; the problem is that if you're going to store the inode information (or part of the inode information) in the directory it has to get updated each time the inode gets updated, and that gets painful with hardlinks. There was a Usenix paper which explored this idea over ten years ago, and it's possible --- but then all directory entries need to have linked list so that that you can either (a) find the "master" directory entry which contains the inode information and be able to migrate inode information to another directory entry when the originl "master" dirent gets deleted, or (b) you keep the subset of the inode information in all of the directory entries of the hardlink, and have to update them all each time the inode information changes. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-28 11:53 ` Theodore Tso @ 2008-04-28 11:59 ` Avi Kivity 2008-04-28 13:31 ` Theodore Tso 2008-04-28 16:18 ` J. Bruce Fields 1 sibling, 1 reply; 17+ messages in thread From: Avi Kivity @ 2008-04-28 11:59 UTC (permalink / raw) To: Theodore Tso, Ulrich Drepper, Soeren Sandmann, linux-kernel Theodore Tso wrote: >> Aside from what has already been proposed there is also the >> readdirplus() route. Unfortunately the people behind this and related >> proposals vanished after the last discussions. I was hoping they >> come back with a revised proposal but perhaps not. Maybe it's time to >> pick up the ball myself. >> >> As a reminder, readdirplus() is an extended readdir() which also >> returns (a subset of) the stat information for the file at the same >> time. The subset part is needed to account for the different >> information contained in the inodes. For most applications the subset >> should be sufficient and therefore all that's needed is a single >> iteration over the directory. >> > > I'm not sure this would help in the cold cache case, which is what > Soeren originally complained about.[1] The problem is whaever > information the user might need won't be store in the directory, so > the filesystem would end having to stat the file anyway, incurring a > disk seek, which was what the user was complaining about. A > readdirplus() would save a whole bunch of system calls if the inode > was already cached, yes, but I'm not sure that's it would be worth the > effort given how small Linux's system call overhead would be. But in > the cold cache case, you end up seeking all over the disk, and the > only thing you can do is to try to keep the inodes close to each > other, and to have either readdir() or the caller of readdir() sort > all of the returned directory entries by inode number to avoid seeking > all over the disk. > A readdirplus() could sort the inodes according to the filesystem's layout, and additionally issue the stats in parallel, so if you have multiple spindles you get significant additional speedup. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-28 11:59 ` Avi Kivity @ 2008-04-28 13:31 ` Theodore Tso 0 siblings, 0 replies; 17+ messages in thread From: Theodore Tso @ 2008-04-28 13:31 UTC (permalink / raw) To: Avi Kivity; +Cc: Ulrich Drepper, Soeren Sandmann, linux-kernel On Mon, Apr 28, 2008 at 02:59:26PM +0300, Avi Kivity wrote: > > A readdirplus() could sort the inodes according to the filesystem's layout, > and additionally issue the stats in parallel, so if you have multiple > spindles you get significant additional speedup. > Well, sorting inodes is something readdir() could do, but it takes potentially an unbounded amount of (non-swappable) kernel memory. That's why it's messy. And it wouldn't be hard to have flags (set by fcntl()) on the fd returned by readdir() which which requests inode cache prefetching, which would have the same net result, without needing to design a new system call. So before designing a new system call, it might be worth it to do some benchmarks to see how much of the cost is really in the number of stat() system calls, in the warm cache case. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: stat benchmark 2008-04-28 11:53 ` Theodore Tso 2008-04-28 11:59 ` Avi Kivity @ 2008-04-28 16:18 ` J. Bruce Fields 1 sibling, 0 replies; 17+ messages in thread From: J. Bruce Fields @ 2008-04-28 16:18 UTC (permalink / raw) To: Theodore Tso, Ulrich Drepper, Soeren Sandmann, linux-kernel On Mon, Apr 28, 2008 at 07:53:22AM -0400, Theodore Tso wrote: > On Sun, Apr 27, 2008 at 09:43:05PM -0700, Ulrich Drepper wrote: > > On Thu, Apr 24, 2008 at 1:59 PM, Soeren Sandmann <sandmann@daimi.au.dk> wrote: > > > So I am looking for ways to improve this. > > > > Aside from what has already been proposed there is also the > > readdirplus() route. Unfortunately the people behind this and related > > proposals vanished after the last discussions. I was hoping they > > come back with a revised proposal but perhaps not. Maybe it's time to > > pick up the ball myself. > > > > As a reminder, readdirplus() is an extended readdir() which also > > returns (a subset of) the stat information for the file at the same > > time. The subset part is needed to account for the different > > information contained in the inodes. For most applications the subset > > should be sufficient and therefore all that's needed is a single > > iteration over the directory. > > I'm not sure this would help in the cold cache case, which is what > Soeren originally complained about.[1] The problem is whaever > information the user might need won't be store in the directory, so > the filesystem would end having to stat the file anyway, incurring a > disk seek, which was what the user was complaining about. A > readdirplus() would save a whole bunch of system calls if the inode > was already cached, yes, but I'm not sure that's it would be worth the > effort given how small Linux's system call overhead would be. But in > the cold cache case, you end up seeking all over the disk, and the > only thing you can do is to try to keep the inodes close to each > other, and to have either readdir() or the caller of readdir() sort > all of the returned directory entries by inode number to avoid seeking > all over the disk. The other reason for something like a readdirplus or a bulk stat is to provide an opportunity for parallelism. As my favorite example: cold-cache "git diff" of a linux tree on my desktop (with an nfs-mounted /home) takes about 12 seconds. That's mainly just a sequential stat of about about 24000 files. Patching git to issue the stats in parallel, I could get that down to about 3.5 seconds. (Still not great. I don't know if it's disk seeks on the server or what that are the limiting factor.) In the case of git, it's looking just for files that it tracks--it's not reading whole directories--so I don't know if readdirplus() specifically would help. --b. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2008-04-28 19:42 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-04-24 20:59 stat benchmark Soeren Sandmann 2008-04-24 21:42 ` Carl Henrik Lunde 2008-04-24 21:44 ` Jan Engelhardt 2008-04-25 2:27 ` Justin Banks 2008-04-25 7:01 ` Christoph Hellwig 2008-04-25 19:48 ` Theodore Tso 2008-04-27 23:29 ` Soeren Sandmann 2008-04-28 0:13 ` Carl Henrik Lunde 2008-04-28 19:41 ` Alexander Larsson 2008-04-28 2:10 ` Theodore Tso 2008-04-27 22:40 ` Carl Henrik Lunde 2008-04-28 17:46 ` Zach Brown 2008-04-28 4:43 ` Ulrich Drepper 2008-04-28 11:53 ` Theodore Tso 2008-04-28 11:59 ` Avi Kivity 2008-04-28 13:31 ` Theodore Tso 2008-04-28 16:18 ` J. Bruce Fields
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox