git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git-add has gone lstat() mad
@ 2007-03-30 19:55 Andy Parkins
  2007-03-30 20:20 ` Andy Parkins
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Parkins @ 2007-03-30 19:55 UTC (permalink / raw)
  To: git

Hello,

I was interested in trying out the GIT_WORK_DIR stuff, but ended up 
being unable to.  The thing that stopped me is present in master as 
well:

 $ git --version
 git version 1.5.1.rc3.20.gaa453
 $ cd $HOME
 $ git init
 $ git add .bashrc

At this point the CPU pegs at 100% systime.  An strace shows that git is 
calling lstat64() on every file in my home directory.  I killed git 
before it scanned everything I've ever done.

I only want to track one file; is git meant to scan every file in the 
directory even though I'm not adding any of them?



Andy
-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-add has gone lstat() mad
  2007-03-30 19:55 git-add has gone lstat() mad Andy Parkins
@ 2007-03-30 20:20 ` Andy Parkins
  2007-03-31  1:38   ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Parkins @ 2007-03-30 20:20 UTC (permalink / raw)
  To: git

On Friday 2007, March 30, Andy Parkins wrote:

> At this point the CPU pegs at 100% systime.  An strace shows that git
> is calling lstat64() on every file in my home directory.  I killed
> git before it scanned everything I've ever done.

Okay.  I've tracked down the culprit function, but I have no idea what 
the fix is.

builtin-add.c:fill_directory() calls
dir.c:read_directory() which calls
dir.c:read_directory_recursive()

I can't see why git feels that it has to recurse the entire subtree.  It 
seems to be something to do with the gitignore stuff.  Surely there is 
no need to use a recursive search when no directories are being added?

If git-add were given

 file1
 dir1/file2
 dir2/dir3/file3

Then only the directories "."; "dir1/"; "dir2"; and "dir2/dir3" need 
checking for .gitignore files; and in none of those cases does the 
search need to be recursive.



Andy

-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-add has gone lstat() mad
  2007-03-30 20:20 ` Andy Parkins
@ 2007-03-31  1:38   ` Junio C Hamano
  2007-03-31  3:39     ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2007-03-31  1:38 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

Andy Parkins <andyparkins@gmail.com> writes:

> I can't see why git feels that it has to recurse the entire subtree.  It 
> seems to be something to do with the gitignore stuff.  Surely there is 
> no need to use a recursive search when no directories are being added?

I do not think this is anything new.

> If git-add were given
>
>  file1
>  dir1/file2
>  dir2/dir3/file3

The thing is, you are not giving the above three pathnames,
although you might think you are.

You are giving three path *patterns* and asking git-add: "please
run 'git ls-files --others' and add the ones that match these
patterns".

You can teach it to detect cases where you do not have wildcard
(that is both shell glob wildcard and directory names; the
latter means "grab everything in that named directory") to limit
the set of directories to descend into.

Patches are welcome, but applying them to 'master' needs to wait
post 1.5.1.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-add has gone lstat() mad
  2007-03-31  1:38   ` Junio C Hamano
@ 2007-03-31  3:39     ` Linus Torvalds
  2007-03-31 10:18       ` Andy Parkins
  2007-04-01  0:39       ` Junio C Hamano
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2007-03-31  3:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Andy Parkins, Git Mailing List



On Fri, 30 Mar 2007, Junio C Hamano wrote:

> Andy Parkins <andyparkins@gmail.com> writes:
> 
> > I can't see why git feels that it has to recurse the entire subtree.  It 
> > seems to be something to do with the gitignore stuff.  Surely there is 
> > no need to use a recursive search when no directories are being added?
> 
> I do not think this is anything new.

Yeah, it's probably old. That said, it's still ugly.

> Patches are welcome, but applying them to 'master' needs to wait
> post 1.5.1.

Here's a patch. It passes all tests. It's not that complex. But people 
should double-check. ESPECIALLY the list of special characters (currently 
'?' '*' '\\' and '[').

Andy, does it work for you?

NOTE! It changes the "const char **pathspec" into an array of "struct 
path_simplify", because that way it doesn't need to check for the magic 
shell expansion characters over and over and over and over again, and can 
just do it once up-front. All the real meat of the patch is really that 
conversion, and somebody should double-check that I actually got all the 
special characters..

The way things are set up, you can now pass a "pathspec" to the 
"read_directory()" function. If you pass NULL, it acts exactly like it 
used to do (read everything). If you pass a non-NULL pointer, it will 
simplify it into a "these are the prefixes without any special 
characters", and stop any readdir() early if the path in question doesn't 
match any of the prefixes.

NOTE! This does *not* obviate the need for the caller to do the *exact* 
pathspec match later. It's a first-level filter on "read_directory()", but 
it does not do the full pathspec thing. Maybe it should. But in the 
meantime, builtin-add.c really does need to do first

	read_directory(dir, .., pathspec);
	if (pathspec)
		prune_directory(dir, pathspec, baselen);

ie the "prune_directory()" part will do the *exact* pathspec pruning, 
while the "read_directory()" will use the pathspec just to do some quick 
high-level pruning of the directories it will recurse into.

Does this matter? I can say that:

	git ls-files -o Makefile~

on the kernel took 0.110s for me before, and it now takes 0.014s. And 
maybe Andy's case more noticeable. Andy?

		Linus
---
 builtin-add.c      |    2 +-
 builtin-ls-files.c |    2 +-
 dir.c              |   96 +++++++++++++++++++++++++++++++++++++++++++++++++---
 dir.h              |    2 +-
 wt-status.c        |    2 +-
 5 files changed, 95 insertions(+), 9 deletions(-)

diff --git a/builtin-add.c b/builtin-add.c
index 9fcf514..871e23f 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -87,7 +87,7 @@ static void fill_directory(struct dir_struct *dir, const char **pathspec)
 	}
 
 	/* Read the directory and prune it */
-	read_directory(dir, path, base, baselen);
+	read_directory(dir, path, base, baselen, pathspec);
 	if (pathspec)
 		prune_directory(dir, pathspec, baselen);
 }
diff --git a/builtin-ls-files.c b/builtin-ls-files.c
index 4e1d5af..74a6aca 100644
--- a/builtin-ls-files.c
+++ b/builtin-ls-files.c
@@ -216,7 +216,7 @@ static void show_files(struct dir_struct *dir, const char *prefix)
 
 		if (baselen)
 			path = base = prefix;
-		read_directory(dir, path, base, baselen);
+		read_directory(dir, path, base, baselen, pathspec);
 		if (show_others)
 			show_other_files(dir);
 		if (show_killed)
diff --git a/dir.c b/dir.c
index b48e19d..f1cf278 100644
--- a/dir.c
+++ b/dir.c
@@ -8,6 +8,11 @@
 #include "cache.h"
 #include "dir.h"
 
+struct path_simplify {
+	int len;
+	const char *path;
+};
+
 int common_prefix(const char **pathspec)
 {
 	const char *path, *slash, *next;
@@ -293,6 +298,31 @@ static int dir_exists(const char *dirname, int len)
 }
 
 /*
+ * This is an inexact early pruning of any recursive directory
+ * reading - if the path cannot possibly be in the pathspec,
+ * return true, and we'll skip it early.
+ */
+static int simplify_away(const char *path, int pathlen, const struct path_simplify *simplify)
+{
+	if (simplify) {
+		for (;;) {
+			const char *match = simplify->path;
+			int len = simplify->len;
+
+			if (!match)
+				break;
+			if (len > pathlen)
+				len = pathlen;
+			if (!memcmp(path, match, len))
+				return 0;
+			simplify++;
+		}
+		return 1;
+	}
+	return 0;
+}
+
+/*
  * Read a directory tree. We currently ignore anything but
  * directories, regular files and symlinks. That's because git
  * doesn't handle them at all yet. Maybe that will change some
@@ -301,7 +331,7 @@ static int dir_exists(const char *dirname, int len)
  * Also, we ignore the name ".git" (even if it is not a directory).
  * That likely will not change.
  */
-static int read_directory_recursive(struct dir_struct *dir, const char *path, const char *base, int baselen, int check_only)
+static int read_directory_recursive(struct dir_struct *dir, const char *path, const char *base, int baselen, int check_only, const struct path_simplify *simplify)
 {
 	DIR *fdir = opendir(path);
 	int contents = 0;
@@ -324,6 +354,8 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 				continue;
 			len = strlen(de->d_name);
 			memcpy(fullname + baselen, de->d_name, len+1);
+			if (simplify_away(fullname, baselen + len, simplify))
+				continue;
 			if (excluded(dir, fullname) != dir->show_ignored) {
 				if (!dir->show_ignored || DTYPE(de) != DT_DIR) {
 					continue;
@@ -350,13 +382,13 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 					if (dir->hide_empty_directories &&
 					    !read_directory_recursive(dir,
 						    fullname, fullname,
-						    baselen + len, 1))
+						    baselen + len, 1, simplify))
 						continue;
 					break;
 				}
 
 				contents += read_directory_recursive(dir,
-					fullname, fullname, baselen + len, 0);
+					fullname, fullname, baselen + len, 0, simplify);
 				continue;
 			case DT_REG:
 			case DT_LNK:
@@ -386,8 +418,61 @@ static int cmp_name(const void *p1, const void *p2)
 				  e2->name, e2->len);
 }
 
-int read_directory(struct dir_struct *dir, const char *path, const char *base, int baselen)
+/*
+ * Return the length of the "simple" part of a path match limiter.
+ */
+static int simple_length(const char *match)
 {
+	const char special[256] = {
+		[0] = 1, ['?'] = 1,
+		['\\'] = 1, ['*'] = 1,
+		['['] = 1
+	};
+	int len = -1;
+
+	for (;;) {
+		unsigned char c = *match++;
+		len++;
+		if (special[c])
+			return len;
+	}
+}
+
+static struct path_simplify *create_simplify(const char **pathspec)
+{
+	int nr, alloc = 0;
+	struct path_simplify *simplify = NULL;
+
+	if (!pathspec)
+		return NULL;
+
+	for (nr = 0 ; ; nr++) {
+		const char *match;
+		if (nr >= alloc) {
+			alloc = alloc_nr(alloc);
+			simplify = xrealloc(simplify, alloc * sizeof(*simplify));
+		}
+		match = *pathspec++;
+		if (!match)
+			break;
+		simplify[nr].path = match;
+		simplify[nr].len = simple_length(match);
+	}
+	simplify[nr].path = NULL;
+	simplify[nr].len = 0;
+	return simplify;
+}
+
+static void free_simplify(struct path_simplify *simplify)
+{
+	if (simplify)
+		free(simplify);
+}
+
+int read_directory(struct dir_struct *dir, const char *path, const char *base, int baselen, const char **pathspec)
+{
+	struct path_simplify *simplify = create_simplify(pathspec);
+
 	/*
 	 * Make sure to do the per-directory exclude for all the
 	 * directories leading up to our base.
@@ -414,7 +499,8 @@ int read_directory(struct dir_struct *dir, const char *path, const char *base, i
 		}
 	}
 
-	read_directory_recursive(dir, path, base, baselen, 0);
+	read_directory_recursive(dir, path, base, baselen, 0, simplify);
+	free_simplify(simplify);
 	qsort(dir->entries, dir->nr, sizeof(struct dir_entry *), cmp_name);
 	return dir->nr;
 }
diff --git a/dir.h b/dir.h
index 7233d65..33c31f2 100644
--- a/dir.h
+++ b/dir.h
@@ -48,7 +48,7 @@ extern int common_prefix(const char **pathspec);
 #define MATCHED_EXACTLY 3
 extern int match_pathspec(const char **pathspec, const char *name, int namelen, int prefix, char *seen);
 
-extern int read_directory(struct dir_struct *, const char *path, const char *base, int baselen);
+extern int read_directory(struct dir_struct *, const char *path, const char *base, int baselen, const char **pathspec);
 extern int push_exclude_per_directory(struct dir_struct *, const char *, int);
 extern void pop_exclude_per_directory(struct dir_struct *, int);
 
diff --git a/wt-status.c b/wt-status.c
index a25632b..a055990 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -260,7 +260,7 @@ static void wt_status_print_untracked(struct wt_status *s)
 	if (file_exists(x))
 		add_excludes_from_file(&dir, x);
 
-	read_directory(&dir, ".", "", 0);
+	read_directory(&dir, ".", "", 0, NULL);
 	for(i = 0; i < dir.nr; i++) {
 		/* check for matching entry, which is unmerged; lifted from
 		 * builtin-ls-files:show_other_files */

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: git-add has gone lstat() mad
  2007-03-31  3:39     ` Linus Torvalds
@ 2007-03-31 10:18       ` Andy Parkins
  2007-03-31 14:54         ` Randal L. Schwartz
  2007-04-01  0:39       ` Junio C Hamano
  1 sibling, 1 reply; 10+ messages in thread
From: Andy Parkins @ 2007-03-31 10:18 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Junio C Hamano

On Saturday 2007, March 31, Linus Torvalds wrote:

> on the kernel took 0.110s for me before, and it now takes 0.014s. And
> maybe Andy's case more noticeable. Andy?

Blindingly fast.

Previously, I never completed a git add .bashrc as it was taking so 
long.  Now it's instant.  This is back to true git form - I wasn't 
entirely sure git-add had done anything :-).  git-status confirmed that 
it had worked successfully though.

I've not done any extensive tests for regressions, but I've done

 cd $HOME
 git init
 git add .bashrc
 git add somedirectory/

And they work fine.  So - it's works for me from me, and a big happy 
grin.



Andy
-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-add has gone lstat() mad
  2007-03-31 10:18       ` Andy Parkins
@ 2007-03-31 14:54         ` Randal L. Schwartz
  2007-03-31 15:09           ` Andy Parkins
  2007-03-31 19:28           ` Tom Prince
  0 siblings, 2 replies; 10+ messages in thread
From: Randal L. Schwartz @ 2007-03-31 14:54 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

>>>>> "Andy" == Andy Parkins <andyparkins@gmail.com> writes:

Andy> I've not done any extensive tests for regressions, but I've done

Andy>  cd $HOME
Andy>  git init
Andy>  git add .bashrc
Andy>  git add somedirectory/

Andy> And they work fine.  So - it's works for me from me, and a big happy 
Andy> grin.

Given that git currently doesn't maintain any metadata other than +x/-x, how
are you maintaining the metadata for your homedir items?  I know some schemes
were discussed here in the past, but I'm curious as to what you settled on.
For example, your .netrc file needs to be 600, but git won't track that.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-add has gone lstat() mad
  2007-03-31 14:54         ` Randal L. Schwartz
@ 2007-03-31 15:09           ` Andy Parkins
  2007-03-31 19:28           ` Tom Prince
  1 sibling, 0 replies; 10+ messages in thread
From: Andy Parkins @ 2007-03-31 15:09 UTC (permalink / raw)
  To: git; +Cc: Randal L. Schwartz

On Saturday 2007, March 31, Randal L. Schwartz wrote:

> Given that git currently doesn't maintain any metadata other than
> +x/-x, how are you maintaining the metadata for your homedir items? 
> I know some schemes were discussed here in the past, but I'm curious
> as to what you settled on. For example, your .netrc file needs to be
> 600, but git won't track that.

I'm not maintaining anything as yet; the experiment so far consists 
of .bashrc only :-)

Just storing my .inputrc, .bashrc, .vimrc and .gitconfig in a repository 
is probably going to give me all that I want.  There aren't many other 
config files that I regularly update, and certainly not many that I 
update differently on different machines.

However, I plan to use a script to wrap the git calls so that I don't 
have to type GIT_WORK_DIR and GIT_DIR for every file I want to track.  
I plan to make it so that if you run this script with a checkout or 
other working tree changing command then I will run a function within 
it that reads the metadata out of a file that is stored in the 
repository which will restore those settings.

Ideally this functionality would be in hooks for pre-update-index and 
post-checkout-index or similar.  But I have a feeling that the 
potential for conflicts makes it hard to do in reality.



Andy
-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-add has gone lstat() mad
  2007-03-31 14:54         ` Randal L. Schwartz
  2007-03-31 15:09           ` Andy Parkins
@ 2007-03-31 19:28           ` Tom Prince
  1 sibling, 0 replies; 10+ messages in thread
From: Tom Prince @ 2007-03-31 19:28 UTC (permalink / raw)
  To: Randal L. Schwartz; +Cc: Andy Parkins, git

On Sat, Mar 31, 2007 at 07:54:37AM -0700, Randal L. Schwartz wrote:
> Given that git currently doesn't maintain any metadata other than +x/-x, how
> are you maintaining the metadata for your homedir items?  I know some
> schemes
> were discussed here in the past, but I'm curious as to what you settled on.
> For example, your .netrc file needs to be 600, but git won't track that.

I personally just don't track files that have sensitive data. I tend to have
copies of my home dir on somewhat public servers, so I don't want that data
copied around anyway.

  Tom

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-add has gone lstat() mad
  2007-03-31  3:39     ` Linus Torvalds
  2007-03-31 10:18       ` Andy Parkins
@ 2007-04-01  0:39       ` Junio C Hamano
  2007-04-01  8:25         ` Andy Parkins
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2007-04-01  0:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andy Parkins, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Here's a patch. It passes all tests. It's not that complex. But people 
> should double-check. ESPECIALLY the list of special characters (currently 
> '?' '*' '\\' and '[').

I think the above is a good set.

This is an optimization different from what I was thinking
about.  I was hoping that we do not even need to call into
read_directory() if all the pathspec[] elements succeeds to
lstat() and they are not directories; in such a case we can just
stuff them to dir structure by hand, and use the remainder for
directory walk.

But I like this patch better.  We need to look at .gitignore to
warn about adding ignored files, so we cannot just stuff what
are found to dir without checking if they are ignored.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-add has gone lstat() mad
  2007-04-01  0:39       ` Junio C Hamano
@ 2007-04-01  8:25         ` Andy Parkins
  0 siblings, 0 replies; 10+ messages in thread
From: Andy Parkins @ 2007-04-01  8:25 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Linus Torvalds

On Sunday 2007, April 01, Junio C Hamano wrote:

> But I like this patch better.  We need to look at .gitignore to
> warn about adding ignored files, so we cannot just stuff what
> are found to dir without checking if they are ignored.

I needed the following needed on top of current pu:

diff --git a/unpack-trees.c b/unpack-trees.c
index fa36495..d4f7589 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -521,7 +521,7 @@ static void verify_clean_subdirectory(const char *path, const char *action,
 	memset(&d, 0, sizeof(d));
 	if (o->dir)
 		d.exclude_per_dir = o->dir->exclude_per_dir;
-	i = read_directory(&d, path, pathbuf, namelen+1);
+	i = read_directory(&d, path, pathbuf, namelen+1, NULL);
 	if (i)
 		die("Updating '%s' would lose untracked files in it",
 		    path);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-04-01  8:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-30 19:55 git-add has gone lstat() mad Andy Parkins
2007-03-30 20:20 ` Andy Parkins
2007-03-31  1:38   ` Junio C Hamano
2007-03-31  3:39     ` Linus Torvalds
2007-03-31 10:18       ` Andy Parkins
2007-03-31 14:54         ` Randal L. Schwartz
2007-03-31 15:09           ` Andy Parkins
2007-03-31 19:28           ` Tom Prince
2007-04-01  0:39       ` Junio C Hamano
2007-04-01  8:25         ` Andy Parkins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).