[RFC] Speed up "git status" by caching untracked file info

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Speed up "git status" by caching untracked file info
@ 2014-04-17  5:51 Nguyễn Thái Ngọc Duy
  2014-04-17 19:40 ` Junio C Hamano
  2014-04-22  9:56 ` Karsten Blees
  0 siblings, 2 replies; 8+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2014-04-17  5:51 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

This patch serves as a heads up about a feature I'm working on. I hope
that by posting it early, people could double check if I have made
some fundamental mistakes that completely ruin the idea. It's about
speeding up "git status" by caching untracked file info in the index
_if_ your file system supports it (more below).

The whole WIP series is at

https://github.com/pclouds/git/commits/untracked-cache

I only post the real meat here. I'm aware of a few incomplete details
in this patch, but nothing fundamentally wrong. So far the numbers are
promising.  ls-files is updated to run fill_directory() twice in a
row and "ls-files -o --directory --no-empty-directory --exclude-standard"
(with gcc -O0) gives me:

           first run  second (cached) run
gentoo-x86    500 ms             71.6  ms
wine          140 ms              9.72 ms
webkit        125 ms              6.88 ms
linux-2.6     106 ms             16.2  ms

Basically untracked time is cut to one tenth in the best case
scenario. The final numbers would be a bit higher because I haven't
stored or read the cache from index yet. Real commit message follows..


read_directory() plays a bit part in the slowness of "git status"
because it has to read every directory and check for excluded entries,
which is really expensive. This patch adds an option to cache the
results so that after the first slow read_directory(), the following
calls should be cheap and fast.

The following inputs are sufficient to determine what files in a
directory are excluded:

 - The list of files and directories of the direction in question
 - The $GIT_DIR/index
 - The content of $GIT_DIR/info/exclude
 - The content of core.excludesfile
 - The content (or the lack) of .gitignore of all parent directories
   from $GIT_WORK_TREE

If we can cheaply validate all those inputs for a certain directory,
we are sure that the current code will always produce the same
results, so we can cache and reuse those results.

This is not a silver bullet approach. When you compile a C file, for
example, the old .o file is removed and a new one with the same name
created, effectively invalidating the containing directory's
cache. But at least with a large enough work tree, there could be many
directories you never touch. The cache could help there.

The first input can be checked using directory mtime. In many
filesystems, directory mtime is updated when direct files/dirs are
added or removed (*). If you do not use such a file system, this
feature is not for you.

The second one can be hooked from read-cache.c. Whenever a file (or a
submodule) is added or removed from a directory, we invalidate that
directory. This will be done in a later patch.

The remaining inputs are easy, their SHA-1 could be used to verify
their contents. We do need to read .gitignore files and digest
them. But they are usually few and small, so the overhead should not
be much.

At the implementation level, the whole directory structure is saved,
each directory corresponds to one struct untracked_dir. Each directory
holds SHA-1 of the .gitignore underneath (or null if it does not
exist) and the list of untracked "files" and subdirs that need to
recurse into if all is well. Untracked subdirectories are saved in the
file queue and are the reason of quoting "files" in the previous
sentence.

On the first run, no untracked_dir is valid, the default code path is
run. prep_exclude() is updated to record SHA-1 of .gitignore along the
way. read_directory_recursive() is updated to record untracked files.

On subsequent runs, read_directory_recursive() reads stat info of the
directory in question and verifies if files/dirs have been added or
removed. With the help of prep_exclude() to verify .gitignore chain,
it may decide "all is well" and enable the fast path in
treat_path(). read_directory_recursive() is still called for
subdirectories even in fast path, because a directory mtime does not
cover all subdirs recursively.

So if all is really well, read_directory() becomes a series of
open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and
stat(<dir>) _without_ heavyweight exclude filtering. There should be
no overhead if this feature is disabled.

(*) Looking at the code in linux-2.6, ext* family seems to expose this
behavior. vfat also does (but not so sure about fat). btrfs probably
does. statfs() could be used to detect file system.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 dir.c | 336 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 dir.h |  31 ++++++
 2 files changed, 343 insertions(+), 24 deletions(-)

diff --git a/dir.c b/dir.c
index bd58c14..f5d6315 100644
--- a/dir.c
+++ b/dir.c
@@ -31,8 +31,23 @@ enum path_treatment {
 	path_untracked
 };
 
+struct cached_dir {
+	DIR *fdir;
+	struct untracked_dir *untracked;
+	int nr_files;
+	int nr_dirs;
+
+	/*
+	 * return data from read_cached_dir(). name and state are only
+	 * valid if de is NULL
+	 */
+	struct dirent *de;
+	struct strbuf name;
+	enum path_treatment state;
+};
+
 static enum path_treatment read_directory_recursive(struct dir_struct *dir,
-	const char *path, int len,
+	const char *path, int len, struct untracked_dir *untracked,
 	int check_only, const struct path_simplify *simplify);
 static int get_dtype(struct dirent *de, const char *path, int len);
 
@@ -796,6 +811,56 @@ static struct exclude *last_exclude_matching_from_lists(struct dir_struct *dir,
 	return NULL;
 }
 
+static struct untracked_dir *lookup_untracked(struct untracked_dir *dir,
+					      const char *name, int len)
+{
+	int first, last;
+	struct untracked_dir *d;
+	if (!dir)
+		return NULL;
+	if (len && name[len - 1] == '/')
+		len--;
+	first = 0;
+	last = dir->dirs_nr;
+	while (last > first) {
+		int cmp, next = (last + first) >> 1;
+		d = dir->dirs[next];
+		cmp = strncmp(name, d->name, len);
+		if (!cmp && strlen(d->name) > len)
+			cmp = -1;
+		if (!cmp)
+			return d;
+		if (cmp < 0) {
+			last = next;
+			continue;
+		}
+		first = next+1;
+	}
+
+	d = xmalloc(sizeof(*d) + len);
+	memset(d, 0, sizeof(*d) + len);
+	d->untracked_nr = -1;
+	memcpy(d->name, name, len);
+
+	ALLOC_GROW(dir->dirs, dir->dirs_nr + 1, dir->dirs_alloc);
+	memmove(dir->dirs + first + 1, dir->dirs + first,
+		(dir->dirs_nr - first) * sizeof(*dir->dirs));
+	dir->dirs_nr++;
+	dir->dirs[first] = d;
+	return d;
+}
+
+static void invalidate_untracked(struct untracked_dir *dir)
+{
+	int i;
+	if (dir->untracked_nr == -1)
+		/* it is assumed that all subdirs are already invalidated */
+		return;
+	dir->untracked_nr = -1;
+	for (i = 0; i < dir->dirs_nr; i++)
+		invalidate_untracked(dir->dirs[i]);
+}
+
 /*
  * Loads the per-directory exclude list for the substring of base
  * which has a char length of baselen.
@@ -805,6 +870,7 @@ static void prep_exclude(struct dir_struct *dir, const char *base, int baselen)
 	struct exclude_list_group *group;
 	struct exclude_list *el;
 	struct exclude_stack *stk = NULL;
+	struct untracked_dir *untracked = NULL;
 	int current;
 
 	group = &dir->exclude_list_group[EXC_DIRS];
@@ -842,18 +908,35 @@ static void prep_exclude(struct dir_struct *dir, const char *base, int baselen)
 	/* Read from the parent directories and push them down. */
 	current = stk ? stk->baselen : -1;
 	strbuf_setlen(&dir->basebuf, current < 0 ? 0 : current);
+
+	if (dir->untracked && current >= 0 && current < baselen) {
+		const char *start = base;
+		const char *end = base + current;
+		untracked = &dir->untracked->root;
+		while (start < end) {
+			const char *cp = strchrnul(start, '/');
+			untracked = lookup_untracked(untracked, start, cp - start);
+			start = *cp == '/' ? cp + 1 : cp;
+		}
+	}
+
 	while (current < baselen) {
 		struct exclude_stack *stk = xcalloc(1, sizeof(*stk));
 		const char *cp;
+		unsigned char sha1[20];
 
 		if (current < 0) {
 			cp = base;
 			current = 0;
+			if (dir->untracked)
+				untracked = &dir->untracked->root;
 		} else {
 			cp = strchr(base + current + 1, '/');
 			if (!cp)
 				die("oops in prep_exclude");
 			cp++;
+			untracked = lookup_untracked(untracked, base + current,
+						     cp - base - current);
 		}
 		stk->prev = dir->exclude_stack;
 		stk->baselen = cp - base;
@@ -880,7 +963,22 @@ static void prep_exclude(struct dir_struct *dir, const char *base, int baselen)
 		}
 
 		/* Try to read per-directory file */
-		if (dir->exclude_per_dir) {
+		hashclr(sha1);
+		if (dir->exclude_per_dir &&
+		    /*
+		     * If we know that no files have been added in
+		     * this directory (i.e. valid_cached_dir() has
+		     * been executed and untracked_nr is non-zero) ..
+		     */
+		    (!untracked || untracked->untracked_nr < 0 ||
+		     /*
+		      * .. and .gitignore does not exist before
+		      * (i.e. null exclude_sha1 and skip_worktree is
+		      * not set). Then we can skip loading .gitignore,
+		      * which would result in ENOENT anyway.
+		      * skip_worktree is taken care in read_directory()
+		      */
+		     !is_null_sha1(untracked->exclude_sha1))) {
 			/*
 			 * dir->basebuf gets reused by the traversal, but we
 			 * need fname to remain unchanged to ensure the src
@@ -895,7 +993,13 @@ static void prep_exclude(struct dir_struct *dir, const char *base, int baselen)
 			el->src = strbuf_detach(&sb, NULL);
 			add_excludes_from_file_to_list(el->src, el->src,
 						       stk->baselen, el, 1,
-						       NULL);
+						       untracked ? sha1 : NULL);
+		}
+		if (untracked) {
+			if (untracked->untracked_nr >= 0 &&
+			    hashcmp(sha1, untracked->exclude_sha1))
+				invalidate_untracked(untracked);
+			hashcpy(untracked->exclude_sha1, sha1);
 		}
 		dir->exclude_stack = stk;
 		current = stk->baselen;
@@ -1076,6 +1180,7 @@ static enum exist_status directory_exists_in_index(const char *dirname, int len)
  *  (c) otherwise, we recurse into it.
  */
 static enum path_treatment treat_directory(struct dir_struct *dir,
+	struct untracked_dir *untracked,
 	const char *dirname, int len, int exclude,
 	const struct path_simplify *simplify)
 {
@@ -1103,7 +1208,9 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
 	if (!(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
 		return exclude ? path_excluded : path_untracked;
 
-	return read_directory_recursive(dir, dirname, len, 1, simplify);
+	return read_directory_recursive(dir, dirname, len,
+					lookup_untracked(untracked, dirname, len),
+					1, simplify);
 }
 
 /*
@@ -1219,6 +1326,7 @@ static int get_dtype(struct dirent *de, const char *path, int len)
 }
 
 static enum path_treatment treat_one_path(struct dir_struct *dir,
+					  struct untracked_dir *untracked,
 					  struct strbuf *path,
 					  const struct path_simplify *simplify,
 					  int dtype, struct dirent *de)
@@ -1271,7 +1379,7 @@ static enum path_treatment treat_one_path(struct dir_struct *dir,
 		return path_none;
 	case DT_DIR:
 		strbuf_addch(path, '/');
-		return treat_directory(dir, path->buf, path->len, exclude,
+		return treat_directory(dir, untracked, path->buf, path->len, exclude,
 			simplify);
 	case DT_REG:
 	case DT_LNK:
@@ -1280,13 +1388,28 @@ static enum path_treatment treat_one_path(struct dir_struct *dir,
 }
 
 static enum path_treatment treat_path(struct dir_struct *dir,
-				      struct dirent *de,
+				      struct untracked_dir *untracked,
+				      struct cached_dir *cdir,
 				      struct strbuf *path,
 				      int baselen,
 				      const struct path_simplify *simplify)
 {
 	int dtype;
+	struct dirent *de = cdir->de;
 
+	/* fast path after verifying cached data is still valid */
+	if (!de) {
+		strbuf_setlen(path, baselen);
+		strbuf_addbuf(path, &cdir->name);
+		/*
+		 * treat_one_path() does this before it calls
+		 * treat_directory()
+		 */
+		if (cdir->state == path_recurse &&
+		    path->buf[path->len - 1] != '/')
+			strbuf_addch(path, '/');
+		return cdir->state;
+	}
 	if (is_dot_or_dotdot(de->d_name) || !strcmp(de->d_name, ".git"))
 		return path_none;
 	strbuf_setlen(path, baselen);
@@ -1295,7 +1418,122 @@ static enum path_treatment treat_path(struct dir_struct *dir,
 		return path_none;
 
 	dtype = DTYPE(de);
-	return treat_one_path(dir, path, simplify, dtype, de);
+	return treat_one_path(dir, untracked, path, simplify, dtype, de);
+}
+
+static void add_untracked(struct untracked_dir *dir, const char *name)
+{
+	if (!dir)
+		return;
+	if (dir->untracked_nr < 0)
+		dir->untracked_nr = 0;
+	ALLOC_GROW(dir->untracked, dir->untracked_nr + 1,
+		   dir->untracked_alloc);
+	dir->untracked[dir->untracked_nr++] = xstrdup(name);
+}
+
+static int valid_cached_dir(struct dir_struct *dir,
+			    struct untracked_dir *untracked,
+			    struct strbuf *path)
+{
+	struct stat st;
+	if (stat(path->len ? path->buf : ".", &st)) {
+		memset(&untracked->stat_data, 0, sizeof(untracked->stat_data));
+		invalidate_untracked(untracked);
+		return 0;
+	}
+	if (untracked->untracked_nr < 0 ||
+	    match_stat_data(&untracked->stat_data, &st)) {
+		fill_stat_data(&untracked->stat_data, &st);
+		/*
+		 * We could leave subdirs validated if .gitignore is
+		 * not added or removed, but for now just invalidate
+		 * recursively for safety, there's the an assumption
+		 * elsewhere that if dir X is invalidated, then
+		 * everything inside must be invalidated.
+		 */
+		invalidate_untracked(untracked);
+		return 0;
+	}
+
+	/*
+	 * prep_exclude will be called eventually, but it's called
+	 * much later in last_exclude_matching(), deep in
+	 * treat_path(), We need it now to determine the validity of
+	 * the cache. The next call is nearly no-op, the way
+	 * prep_exclude() is designed.
+	 */
+	if (path->len && path->buf[path->len - 1] != '/') {
+		strbuf_addch(path, '/');
+		prep_exclude(dir, path->buf, path->len);
+		strbuf_setlen(path, path->len - 1);
+	} else
+		prep_exclude(dir, path->buf, path->len);
+
+	/* hopefully prep_exclude() will not invalidate this entry... */
+	return untracked->untracked_nr >= 0;
+}
+
+static int open_cached_dir(struct cached_dir *cdir,
+			   struct dir_struct *dir,
+			   struct untracked_dir *untracked,
+			   struct strbuf *path)
+{
+	memset(cdir, 0, sizeof(*cdir));
+	strbuf_init(&cdir->name, 100);
+	cdir->untracked = untracked;
+	if (!untracked || !valid_cached_dir(dir, untracked, path)) {
+		cdir->fdir = opendir(path->len ? path->buf : ".");
+		if (!cdir->fdir)
+			return -1;
+	}
+	return 0;
+}
+
+int read_cached_dir(struct cached_dir *cdir)
+{
+	if (cdir->fdir) {
+		cdir->de = readdir(cdir->fdir);
+		if (!cdir->de)
+			return -1;
+		return 0;
+	}
+	if (cdir->nr_files < cdir->untracked->untracked_nr) {
+		struct untracked_dir *d = cdir->untracked;
+		strbuf_reset(&cdir->name);
+		strbuf_addstr(&cdir->name, d->untracked[cdir->nr_files]);
+		cdir->nr_files++;
+		cdir->de = NULL;
+		cdir->state = path_untracked;
+		return 0;
+	}
+	while (cdir->nr_dirs < cdir->untracked->dirs_nr) {
+		struct untracked_dir *d = cdir->untracked->dirs[cdir->nr_dirs];
+		if (!d->recurse) {
+			cdir->nr_dirs++;
+			continue;
+		}
+		strbuf_reset(&cdir->name);
+		strbuf_addstr(&cdir->name, d->name);
+		cdir->nr_dirs++;
+		cdir->de = NULL;
+		cdir->state = path_recurse;
+		return 0;
+	}
+	return -1;
+}
+
+static void close_cached_dir(struct cached_dir *cdir)
+{
+	if (cdir->fdir)
+		closedir(cdir->fdir);
+	/*
+	 * We have gone through this directory and found no untracked
+	 * entries. Set untracked_nr to zero to make it valid.
+	 */
+	if (cdir->untracked && cdir->untracked->untracked_nr < 0)
+		cdir->untracked->untracked_nr = 0;
+	strbuf_release(&cdir->name);
 }
 
 /*
@@ -1311,30 +1549,35 @@ static enum path_treatment treat_path(struct dir_struct *dir,
  */
 static enum path_treatment read_directory_recursive(struct dir_struct *dir,
 				    const char *base, int baselen,
-				    int check_only,
+				    struct untracked_dir *untracked, int check_only,
 				    const struct path_simplify *simplify)
 {
-	DIR *fdir;
+	struct cached_dir cdir;
 	enum path_treatment state, subdir_state, dir_state = path_none;
-	struct dirent *de;
 	struct strbuf path = STRBUF_INIT;
 
 	strbuf_add(&path, base, baselen);
 
-	fdir = opendir(path.len ? path.buf : ".");
-	if (!fdir)
+	if (open_cached_dir(&cdir, dir, untracked, &path))
 		goto out;
 
-	while ((de = readdir(fdir)) != NULL) {
+	while (!read_cached_dir(&cdir)) {
 		/* check how the file or directory should be treated */
-		state = treat_path(dir, de, &path, baselen, simplify);
+		state = treat_path(dir, untracked, &cdir, &path, baselen, simplify);
+
 		if (state > dir_state)
 			dir_state = state;
 
 		/* recurse into subdir if instructed by treat_path */
 		if (state == path_recurse) {
-			subdir_state = read_directory_recursive(dir, path.buf,
-				path.len, check_only, simplify);
+			struct untracked_dir *ud;
+			ud = lookup_untracked(untracked, path.buf + baselen,
+					      path.len - baselen);
+			if (ud)
+				ud->recurse = 1;
+			subdir_state =
+				read_directory_recursive(dir, path.buf, path.len,
+							 ud, check_only, simplify);
 			if (subdir_state > dir_state)
 				dir_state = subdir_state;
 		}
@@ -1360,15 +1603,18 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
 			break;
 
 		case path_untracked:
-			if (!(dir->flags & DIR_SHOW_IGNORED))
-				dir_add_name(dir, path.buf, path.len);
+			if (dir->flags & DIR_SHOW_IGNORED)
+				break;
+			dir_add_name(dir, path.buf, path.len);
+			if (cdir.fdir)
+				add_untracked(untracked, path.buf + baselen);
 			break;
 
 		default:
 			break;
 		}
 	}
-	closedir(fdir);
+	close_cached_dir(&cdir);
  out:
 	strbuf_release(&path);
 
@@ -1439,7 +1685,7 @@ static int treat_leading_path(struct dir_struct *dir,
 			break;
 		if (simplify_away(sb.buf, sb.len, simplify))
 			break;
-		if (treat_one_path(dir, &sb, simplify,
+		if (treat_one_path(dir, NULL, &sb, simplify,
 				   DT_DIR, NULL) == path_none)
 			break; /* do not recurse into it */
 		if (len <= baselen) {
@@ -1455,6 +1701,7 @@ static int treat_leading_path(struct dir_struct *dir,
 int read_directory(struct dir_struct *dir, const char *path, int len, const struct pathspec *pathspec)
 {
 	struct path_simplify *simplify;
+	struct untracked_dir *untracked = NULL;
 
 	/*
 	 * Check out create_simplify()
@@ -1478,8 +1725,47 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const stru
 	 * create_simplify().
 	 */
 	simplify = create_simplify(pathspec ? pathspec->_raw : NULL);
-	if (!len || treat_leading_path(dir, path, len, simplify))
-		read_directory_recursive(dir, path, len, 0, simplify);
+	/* dir->untracked->dir_flags = DIR_SHOW_OTHER_DIRECTORIES | DIR_HIDE_EMPTY_DIRECTORIES; */
+	/* if EXC_CMDL is loaded, ignore untracked cache */
+	if ((!pathspec || !pathspec->nr) && dir->untracked &&
+	    dir->flags == dir->untracked->dir_flags &&
+	    (dir->exclude_per_dir == dir->untracked->exclude_per_dir || /* both are NULL? */
+	     !strcmp(dir->exclude_per_dir, dir->untracked->exclude_per_dir))) {
+		int i;
+		untracked = &dir->untracked->root;
+		if (hashcmp(dir->info_exclude_sha1,
+			    dir->untracked->info_exclude_sha1)) {
+			invalidate_untracked(untracked);
+			hashcpy(dir->untracked->info_exclude_sha1,
+				dir->info_exclude_sha1);
+		}
+		if (hashcmp(dir->excludes_file_sha1,
+			    dir->untracked->excludes_file_sha1)) {
+			invalidate_untracked(untracked);
+			hashcpy(dir->untracked->excludes_file_sha1,
+				dir->excludes_file_sha1);
+		}
+		untracked->recurse = 1;
+
+		/*
+		 * An optimization in prep_exclude() does not play
+		 * well with CE_SKIP_WORKTREE. It's a rare case
+		 * anyway, if a single entry has that bit set, disable
+		 * the whole untracked cache.
+		 */
+		for (i = 0; i < active_nr; i++)
+			if (ce_skip_worktree(active_cache[i])) {
+				dir->untracked = NULL;
+				untracked = NULL;
+				break;
+			}
+	} else
+		/* make sure untracked cache code path is disabled */
+		dir->untracked = NULL;
+	if (!len || treat_leading_path(dir, path, len, simplify)) {
+		read_directory_recursive(dir, path, len, untracked,
+					 0, simplify);
+	}
 	free_simplify(simplify);
 	qsort(dir->entries, dir->nr, sizeof(struct dir_entry *), cmp_name);
 	qsort(dir->ignored, dir->ignored_nr, sizeof(struct dir_entry *), cmp_name);
@@ -1646,9 +1932,11 @@ void setup_standard_excludes(struct dir_struct *dir)
 		excludes_file = xdg_path;
 	}
 	if (!access_or_warn(path, R_OK, 0))
-		add_excludes_from_file(dir, path, NULL);
+		add_excludes_from_file(dir, path,
+				       dir->untracked ? dir->info_exclude_sha1 : NULL);
 	if (excludes_file && !access_or_warn(excludes_file, R_OK, 0))
-		add_excludes_from_file(dir, excludes_file, NULL);
+		add_excludes_from_file(dir, excludes_file,
+				       dir->untracked ? dir->excludes_file_sha1 : NULL);
 }
 
 int remove_path(const char *name)
diff --git a/dir.h b/dir.h
index 71ad8c6..d293744 100644
--- a/dir.h
+++ b/dir.h
@@ -73,6 +73,31 @@ struct exclude_list_group {
 	struct exclude_list *el;
 };
 
+struct untracked_dir {
+	/* null SHA-1 means this directory does not have .gitignore */
+	unsigned char exclude_sha1[20];
+	struct stat_data stat_data;
+	int recurse;
+	struct untracked_dir **dirs;
+	char **untracked;
+	unsigned dirs_nr, dirs_alloc;
+	/* untracked_nr == -1 means this entry is invalid */
+	int untracked_nr, untracked_alloc;
+	char name[1];
+};
+
+struct untracked_cache {
+	unsigned char info_exclude_sha1[20];
+	unsigned char excludes_file_sha1[20];
+	const char *exclude_per_dir;
+	/*
+	 * dir_struct#flags must match dir_flags or the untracked
+	 * cache is ignored.
+	 */
+	unsigned dir_flags;
+	struct untracked_dir root;
+};
+
 struct dir_struct {
 	int nr, alloc;
 	int ignored_nr, ignored_alloc;
@@ -120,6 +145,12 @@ struct dir_struct {
 	struct exclude_stack *exclude_stack;
 	struct exclude *exclude;
 	struct strbuf basebuf;
+
+	/* Enable untracked file cache if set */
+	struct untracked_cache *untracked;
+	/* temporary data for validating untracked->[same field name] */
+	unsigned char info_exclude_sha1[20];
+	unsigned char excludes_file_sha1[20];
 };
 
 /*
-- 
1.9.1.346.ga2b5940

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] Speed up "git status" by caching untracked file info
  2014-04-17  5:51 [RFC] Speed up "git status" by caching untracked file info Nguyễn Thái Ngọc Duy
@ 2014-04-17 19:40 ` Junio C Hamano
  2014-04-17 23:27   ` Duy Nguyen
  2014-04-22  9:56 ` Karsten Blees
  1 sibling, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2014-04-17 19:40 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

>            first run  second (cached) run
> gentoo-x86    500 ms             71.6  ms
> wine          140 ms              9.72 ms
> webkit        125 ms              6.88 ms
> linux-2.6     106 ms             16.2  ms
>
> Basically untracked time is cut to one tenth in the best case
> scenario. The final numbers would be a bit higher because I haven't
> stored or read the cache from index yet. Real commit message follows..

As you allude to later with "if you recompile a single file, the
whole hierarchy in that directory is lost", two back-to-back runs of
"git status" is not very interesting.

>  - The list of files and directories of the direction in question
>  - The $GIT_DIR/index
>  - The content of $GIT_DIR/info/exclude
>  - The content of core.excludesfile
>  - The content (or the lack) of .gitignore of all parent directories
>    from $GIT_WORK_TREE
>
> If we can cheaply validate all those inputs for a certain directory,
> we are sure that the current code will always produce the same
> results, so we can cache and reuse those results.
>
> This is not a silver bullet approach. When you compile a C file, for
> example, the old .o file is removed and a new one with the same name
> created, effectively invalidating the containing directory's
> cache. But at least with a large enough work tree, there could be many
> directories you never touch. The cache could help there.
>
> The first input can be checked using directory mtime. In many
> filesystems, directory mtime is updated when direct files/dirs are
> added or removed (*).

An important thing is that creation of new cruft or deletion of
existing cruft can be detected without any false negative with the
mechanism, and mtime on directory would be a good way to check it.

> The second one can be hooked from read-cache.c. Whenever a file (or a
> submodule) is added or removed from a directory, we invalidate that
> directory. This will be done in a later patch.

I would imagine that it would be done at the same places as we
invalidate cache-trees, with the same "invalidation percolates up"
logic.

> On subsequent runs, read_directory_recursive() reads stat info of the
> directory in question and verifies if files/dirs have been added or
> removed.

Hmph.  If you have a two-level hierarchy D1/D2 and you change the
list of crufts in D2 but not in D1, the mtime of D1/D2 changes but
not the mtime of D1, as you observed below.

> With the help of prep_exclude() to verify .gitignore chain,
> it may decide "all is well" and enable the fast path in
> treat_path(). read_directory_recursive() is still called for
> subdirectories even in fast path, because a directory mtime does not
> cover all subdirs recursively.

I wonder if you can avoid recursing into D1 when no cached mtime
(and .gitignore) information has changed in any subdirectory of it
(e.g. both D1 and D1/D2 match the cache).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Speed up "git status" by caching untracked file info
  2014-04-17 19:40 ` Junio C Hamano
@ 2014-04-17 23:27   ` Duy Nguyen
  0 siblings, 0 replies; 8+ messages in thread
From: Duy Nguyen @ 2014-04-17 23:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Fri, Apr 18, 2014 at 2:40 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
>
>>            first run  second (cached) run
>> gentoo-x86    500 ms             71.6  ms
>> wine          140 ms              9.72 ms
>> webkit        125 ms              6.88 ms
>> linux-2.6     106 ms             16.2  ms
>>
>> Basically untracked time is cut to one tenth in the best case
>> scenario. The final numbers would be a bit higher because I haven't
>> stored or read the cache from index yet. Real commit message follows..
>
> As you allude to later with "if you recompile a single file, the
> whole hierarchy in that directory is lost", two back-to-back runs of
> "git status" is not very interesting.

No, if you recompile in directory A, then we need to recompute exclude
files for A only, not A/B, A/B/C... We only need to invalidate the
whole hierarchy when A/.gitignore (or worse $GIT_DIR/info/exclude) is
changed.

>> The second one can be hooked from read-cache.c. Whenever a file (or a
>> submodule) is added or removed from a directory, we invalidate that
>> directory. This will be done in a later patch.
>
> I would imagine that it would be done at the same places as we
> invalidate cache-trees, with the same "invalidation percolates up"
> logic.

Yep yep.

>> On subsequent runs, read_directory_recursive() reads stat info of the
>> directory in question and verifies if files/dirs have been added or
>> removed.
>
> Hmph.  If you have a two-level hierarchy D1/D2 and you change the
> list of crufts in D2 but not in D1, the mtime of D1/D2 changes but
> not the mtime of D1, as you observed below.
>
>> With the help of prep_exclude() to verify .gitignore chain,
>> it may decide "all is well" and enable the fast path in
>> treat_path(). read_directory_recursive() is still called for
>> subdirectories even in fast path, because a directory mtime does not
>> cover all subdirs recursively.
>
> I wonder if you can avoid recursing into D1 when no cached mtime
> (and .gitignore) information has changed in any subdirectory of it
> (e.g. both D1 and D1/D2 match the cache).

The problem if when we need to decide to recurse into D1, we have no
idea if any of its subdirs is changed. So we need to recurse in anyway
(at least in the cache; if D1 is unchanged, we will not try to
opendir() it, just get the exclude list from the cache and move on).
-- 
Duy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Speed up "git status" by caching untracked file info
  2014-04-17  5:51 [RFC] Speed up "git status" by caching untracked file info Nguyễn Thái Ngọc Duy
  2014-04-17 19:40 ` Junio C Hamano
@ 2014-04-22  9:56 ` Karsten Blees
  2014-04-22 10:13   ` Duy Nguyen
  1 sibling, 1 reply; 8+ messages in thread
From: Karsten Blees @ 2014-04-22  9:56 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy, git

Am 17.04.2014 07:51, schrieb Nguyễn Thái Ngọc Duy:
> This patch serves as a heads up about a feature I'm working on. I hope
> that by posting it early, people could double check if I have made
> some fundamental mistakes that completely ruin the idea. It's about
> speeding up "git status" by caching untracked file info in the index
> _if_ your file system supports it (more below).
> 
> The whole WIP series is at
> 
> https://github.com/pclouds/git/commits/untracked-cache
> 
> I only post the real meat here. I'm aware of a few incomplete details
> in this patch, but nothing fundamentally wrong. So far the numbers are
> promising.  ls-files is updated to run fill_directory() twice in a
> row and "ls-files -o --directory --no-empty-directory --exclude-standard"
> (with gcc -O0) gives me:
> 
>            first run  second (cached) run
> gentoo-x86    500 ms             71.6  ms
> wine          140 ms              9.72 ms
> webkit        125 ms              6.88 ms

IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?

> linux-2.6     106 ms             16.2  ms
> 
> Basically untracked time is cut to one tenth in the best case
> scenario. The final numbers would be a bit higher because I haven't
> stored or read the cache from index yet. Real commit message follows..
> 
> 
> read_directory() plays a bit part in the slowness of "git status"
> because it has to read every directory and check for excluded entries,
> which is really expensive. This patch adds an option to cache the
> results so that after the first slow read_directory(), the following
> calls should be cheap and fast.
> 
> The following inputs are sufficient to determine what files in a
> directory are excluded:
> 
>  - The list of files and directories of the direction in question
>  - The $GIT_DIR/index
>  - The content of $GIT_DIR/info/exclude
>  - The content of core.excludesfile
>  - The content (or the lack) of .gitignore of all parent directories
>    from $GIT_WORK_TREE
> 

The dir_struct.flags also play a big role in evaluation of read_directory.

E.g. it seems untracked files are not properly recorded if the cache is filled with '--ignored' option:

> @@ -1360,15 +1603,18 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
>  			break;
>  
>  		case path_untracked:
> -			if (!(dir->flags & DIR_SHOW_IGNORED))
> -				dir_add_name(dir, path.buf, path.len);
> +			if (dir->flags & DIR_SHOW_IGNORED)
> +				break;
> +			dir_add_name(dir, path.buf, path.len);
> +			if (cdir.fdir)
> +				add_untracked(untracked, path.buf + baselen);
>  			break;

Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...

I wonder if it makes sense to separate cache recording logic from read_directory_recursive and friends, which are mainly concerned with flags processing.

> If we can cheaply validate all those inputs for a certain directory,
> we are sure that the current code will always produce the same
> results, so we can cache and reuse those results.
> 
> This is not a silver bullet approach. When you compile a C file, for
> example, the old .o file is removed and a new one with the same name
> created, effectively invalidating the containing directory's
> cache. But at least with a large enough work tree, there could be many
> directories you never touch. The cache could help there.
> 
> The first input can be checked using directory mtime. In many
> filesystems, directory mtime is updated when direct files/dirs are
> added or removed (*). If you do not use such a file system, this
> feature is not for you.
> 
> The second one can be hooked from read-cache.c. Whenever a file (or a
> submodule) is added or removed from a directory, we invalidate that
> directory. This will be done in a later patch.
> 
> The remaining inputs are easy, their SHA-1 could be used to verify
> their contents. We do need to read .gitignore files and digest
> them. But they are usually few and small, so the overhead should not
> be much.
> 
> At the implementation level, the whole directory structure is saved,
> each directory corresponds to one struct untracked_dir.

With the usual options (e.g. standard 'git status'), untracked directories are mostly skipped, so the cache would mostly store tracked directories. Naming it 'struct untracked_dir' is a bit confusing, IMO.

> Each directory
> holds SHA-1 of the .gitignore underneath (or null if it does not
> exist) and the list of untracked "files" and subdirs that need to
> recurse into if all is well. Untracked subdirectories are saved in the
> file queue and are the reason of quoting "files" in the previous
> sentence.
> 
> On the first run, no untracked_dir is valid, the default code path is
> run. prep_exclude() is updated to record SHA-1 of .gitignore along the
> way. read_directory_recursive() is updated to record untracked files.
> 
> On subsequent runs, read_directory_recursive() reads stat info of the
> directory in question and verifies if files/dirs have been added or
> removed. With the help of prep_exclude() to verify .gitignore chain,
> it may decide "all is well" and enable the fast path in
> treat_path(). read_directory_recursive() is still called for
> subdirectories even in fast path, because a directory mtime does not
> cover all subdirs recursively.
> 
> So if all is really well, read_directory() becomes a series of
> open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and
> stat(<dir>) _without_ heavyweight exclude filtering. There should be
> no overhead if this feature is disabled.
> 

Wouldn't mtime of .gitignore files suffice here (so you don't need to open and parse them every time)?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Speed up "git status" by caching untracked file info
  2014-04-22  9:56 ` Karsten Blees
@ 2014-04-22 10:13   ` Duy Nguyen
  2014-04-22 10:35     ` Duy Nguyen
  0 siblings, 1 reply; 8+ messages in thread
From: Duy Nguyen @ 2014-04-22 10:13 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git Mailing List

On Tue, Apr 22, 2014 at 4:56 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
> Am 17.04.2014 07:51, schrieb Nguyễn Thái Ngọc Duy:
>> This patch serves as a heads up about a feature I'm working on. I hope
>> that by posting it early, people could double check if I have made
>> some fundamental mistakes that completely ruin the idea. It's about
>> speeding up "git status" by caching untracked file info in the index
>> _if_ your file system supports it (more below).
>>
>> The whole WIP series is at
>>
>> https://github.com/pclouds/git/commits/untracked-cache
>>
>> I only post the real meat here. I'm aware of a few incomplete details
>> in this patch, but nothing fundamentally wrong. So far the numbers are
>> promising.  ls-files is updated to run fill_directory() twice in a
>> row and "ls-files -o --directory --no-empty-directory --exclude-standard"
>> (with gcc -O0) gives me:
>>
>>            first run  second (cached) run
>> gentoo-x86    500 ms             71.6  ms
>> wine          140 ms              9.72 ms
>> webkit        125 ms              6.88 ms
>
> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?

No I didn't. Thanks for pointing it out. I'll see if I can reduce its time.

>> The following inputs are sufficient to determine what files in a
>> directory are excluded:
>>
>>  - The list of files and directories of the direction in question
>>  - The $GIT_DIR/index
>>  - The content of $GIT_DIR/info/exclude
>>  - The content of core.excludesfile
>>  - The content (or the lack) of .gitignore of all parent directories
>>    from $GIT_WORK_TREE
>>
>
> The dir_struct.flags also play a big role in evaluation of read_directory.
>
> E.g. it seems untracked files are not properly recorded if the cache is filled with '--ignored' option:

Yeah. dir_struct.flags will be part of the input. I intend to optimize
"git status" case only, so if it matches the recorded
dir_struct.flags, the cache is used. Else the cache is ignored.
Caching --ignored is not so interesting, because the list of ignored
files could be huge, while untracked listing is usually small.

>> @@ -1360,15 +1603,18 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
>>                       break;
>>
>>               case path_untracked:
>> -                     if (!(dir->flags & DIR_SHOW_IGNORED))
>> -                             dir_add_name(dir, path.buf, path.len);
>> +                     if (dir->flags & DIR_SHOW_IGNORED)
>> +                             break;
>> +                     dir_add_name(dir, path.buf, path.len);
>> +                     if (cdir.fdir)
>> +                             add_untracked(untracked, path.buf + baselen);
>>                       break;
>
> Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...

I missed this. We could ignore check_only if caching is enabled, but
that does not sound really good. Let me think about it more..

>
> I wonder if it makes sense to separate cache recording logic from read_directory_recursive and friends, which are mainly concerned with flags processing.

The core code path is still shared though, or we would duplicate r_d_r
entirely for caching recording, which sounds like a maintenance
nightmare.

>> At the implementation level, the whole directory structure is saved,
>> each directory corresponds to one struct untracked_dir.
>
> With the usual options (e.g. standard 'git status'), untracked directories are mostly skipped, so the cache would mostly store tracked directories. Naming it 'struct untracked_dir' is a bit confusing, IMO.

It's actually just "directories". We may need to store both tracked
and untracked directories. Maybe renaming it to cached_dir..

>> So if all is really well, read_directory() becomes a series of
>> open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and
>> stat(<dir>) _without_ heavyweight exclude filtering. There should be
>> no overhead if this feature is disabled.
>>
>
> Wouldn't mtime of .gitignore files suffice here (so you don't need to open and parse them every time)?

That's a further optimization. With the current code it's simpler to
open .gitignore. Assume you have a path a/b/c. a/.gitignore's stat
info is good, so you skip opening it. Then you find a/b/.gitignore is
modified and you need to recompute untracked files in a/b. To do that
you need a/.gitignore as well. Lazily opening a/.gitignore at this
stage is possible, but trickier (you have to make sure the rules are
in correct order because of negative patterns).

Anyway, the number of .gitignore files is usually small. We can
already avoid opening non-existent .gitignore, which is proportional
to the number of directories.
-- 
Duy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Speed up "git status" by caching untracked file info
  2014-04-22 10:13   ` Duy Nguyen
@ 2014-04-22 10:35     ` Duy Nguyen
  2014-04-22 18:56       ` Karsten Blees
  0 siblings, 1 reply; 8+ messages in thread
From: Duy Nguyen @ 2014-04-22 10:35 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git Mailing List

On Tue, Apr 22, 2014 at 5:13 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?
>
> No I didn't. Thanks for pointing it out. I'll see if I can reduce its time.

Well name-hash is only used when core.ignorecase is set. So it's
optional. Maybe we could save it in a separate index extension, but we
need to verify that the reader uses the same hash function as the
writer.

>> Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...
>
> I missed this. We could ignore check_only if caching is enabled, but
> that does not sound really good. Let me think about it more..

We could save "check_only" to the cache as well. This way we don't
have to disable the check_only trick completely.

So we process a directory with check_only set, find one untracked
entry and stop short. We store check_only value and the status ("found
something") in addition to dir mtime. Next time we check the dir's
mtime. If it matches and is called with check_only set, we know there
is at least one untracked entry, that's enough to stop r_d_r and
return early. If dir mtime does not match, or r_d_r is called without
check_only, we ignore the cached data and fall back to opendir.

Sounds good?
-- 
Duy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Speed up "git status" by caching untracked file info
  2014-04-22 10:35     ` Duy Nguyen
@ 2014-04-22 18:56       ` Karsten Blees
  2014-04-23  0:52         ` Duy Nguyen
  0 siblings, 1 reply; 8+ messages in thread
From: Karsten Blees @ 2014-04-22 18:56 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

Am 22.04.2014 12:35, schrieb Duy Nguyen:
> On Tue, Apr 22, 2014 at 5:13 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>>> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?
>>
>> No I didn't. Thanks for pointing it out. I'll see if I can reduce its time.
> 
> Well name-hash is only used when core.ignorecase is set. So it's
> optional.

This is only true for the case-insensitive directory hash. The file hash ('cache_file_exists') is always used to skip expensive excluded checks for tracked files. 

'cache_file_exists' basically treats faster lookups for higher setup costs, which makes perfect sense when scanning the entire work tree. However, if most of the directory info is cached and just a few directories need refresh (and core.ignorecase=false), binary search ('cache_name_pos') may be better. The difficulty is to decide when to choose one over the other :-)

> Maybe we could save it in a separate index extension, but we
> need to verify that the reader uses the same hash function as the
> writer.
> 
>>> Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...
>>
>> I missed this. We could ignore check_only if caching is enabled, but
>> that does not sound really good. Let me think about it more..
> 
> We could save "check_only" to the cache as well. This way we don't
> have to disable the check_only trick completely.
> 
> So we process a directory with check_only set, find one untracked
> entry and stop short. We store check_only value and the status ("found
> something") in addition to dir mtime. Next time we check the dir's
> mtime. If it matches and is called with check_only set, we know there
> is at least one untracked entry, that's enough to stop r_d_r and
> return early. If dir mtime does not match, or r_d_r is called without
> check_only, we ignore the cached data and fall back to opendir.
> 
> Sounds good?
> 

What about untracked files in sub-directories? E.g. you have untracked dirs a/b with untracked file a/b/c, so normal 'git status' would list 'a/' as untracked.
Now, 'rm a/b/c' would update mtime of b, but not of a, so you'd still list 'a/' as untracked. Same thing for 'echo "c" >a/b/.gitignore'.

Your solution could work if you additionally cache the directories that had to be scanned to find that first untracked file (but you probably had that in mind anyway).

If the cache is only used for certain dir_struct.flags combinations, you can probably get around saving the check_only flag (which can only ever be true if both DIR_SHOW_OTHER_DIRECTORIES and DIR_HIDE_EMPTY_DIRECTORIES are set (which is the default for 'git status')).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Speed up "git status" by caching untracked file info
  2014-04-22 18:56       ` Karsten Blees
@ 2014-04-23  0:52         ` Duy Nguyen
  0 siblings, 0 replies; 8+ messages in thread
From: Duy Nguyen @ 2014-04-23  0:52 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git Mailing List

On Wed, Apr 23, 2014 at 1:56 AM, Karsten Blees <karsten.blees@gmail.com> wrote:
> Am 22.04.2014 12:35, schrieb Duy Nguyen:
>> On Tue, Apr 22, 2014 at 5:13 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>>>> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?
>>>
>>> No I didn't. Thanks for pointing it out. I'll see if I can reduce its time.
>>
>> Well name-hash is only used when core.ignorecase is set. So it's
>> optional.
>
> This is only true for the case-insensitive directory hash. The file hash ('cache_file_exists') is always used to skip expensive excluded checks for tracked files.
>
> 'cache_file_exists' basically treats faster lookups for higher setup costs, which makes perfect sense when scanning the entire work tree. However, if most of the directory info is cached and just a few directories need refresh (and core.ignorecase=false), binary search ('cache_name_pos') may be better. The difficulty is to decide when to choose one over the other :-)

Right. The problem is even if untracked cache is used, we don't know
in advance how cache_file_exists calls we need to make. If .gitignore
changes, we could see how many directories are invalidated recursively
and that could be an indicator for favoring cache_file_exists over
cache_name_pos. It's harder when dir mtime changes, I suppose we could
be optimistic and stick to cache_name_pos until the number of calls
gets over a limit and turn to cache_file_exists. May backfire though..

>
>> Maybe we could save it in a separate index extension, but we
>> need to verify that the reader uses the same hash function as the
>> writer.
>>
>>>> Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...
>>>
>>> I missed this. We could ignore check_only if caching is enabled, but
>>> that does not sound really good. Let me think about it more..
>>
>> We could save "check_only" to the cache as well. This way we don't
>> have to disable the check_only trick completely.
>>
>> So we process a directory with check_only set, find one untracked
>> entry and stop short. We store check_only value and the status ("found
>> something") in addition to dir mtime. Next time we check the dir's
>> mtime. If it matches and is called with check_only set, we know there
>> is at least one untracked entry, that's enough to stop r_d_r and
>> return early. If dir mtime does not match, or r_d_r is called without
>> check_only, we ignore the cached data and fall back to opendir.
>>
>> Sounds good?
>>
>
> What about untracked files in sub-directories? E.g. you have untracked dirs a/b with untracked file a/b/c, so normal 'git status' would list 'a/' as untracked.
> Now, 'rm a/b/c' would update mtime of b, but not of a, so you'd still list 'a/' as untracked. Same thing for 'echo "c" >a/b/.gitignore'.
>
> Your solution could work if you additionally cache the directories that had to be scanned to find that first untracked file (but you probably had that in mind anyway).

Basically all directories that are touched by r_d_r() will be cached.

> If the cache is only used for certain dir_struct.flags combinations, you can probably get around saving the check_only flag (which can only ever be true if both DIR_SHOW_OTHER_DIRECTORIES and DIR_HIDE_EMPTY_DIRECTORIES are set (which is the default for 'git status')).
-- 
Duy

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-04-23  0:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-17  5:51 [RFC] Speed up "git status" by caching untracked file info Nguyễn Thái Ngọc Duy
2014-04-17 19:40 ` Junio C Hamano
2014-04-17 23:27   ` Duy Nguyen
2014-04-22  9:56 ` Karsten Blees
2014-04-22 10:13   ` Duy Nguyen
2014-04-22 10:35     ` Duy Nguyen
2014-04-22 18:56       ` Karsten Blees
2014-04-23  0:52         ` Duy Nguyen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).