git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Soem core git optimizations
@ 2008-05-09 16:04 Linus Torvalds
  2008-05-09 16:11 ` [PATCH 1/2] Avoid some unnecessary lstat() calls Linus Torvalds
  2008-05-11  2:36 ` [PATCH 0/2] Soem core git optimizations Junio C Hamano
  0 siblings, 2 replies; 5+ messages in thread
From: Linus Torvalds @ 2008-05-09 16:04 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


I notice that Junio merged my fnmatch-avoidance patch, but I have a few 
other optimizations that I track in my private tree that I have sent out 
but probably didn't get much attention. They do matter from a performance 
angle, even if not as much as avoiding fnmatch did.

The first patch is obvious and pretty trivial: just moving lstat() calls 
up a bit in the call-chain, so that you don't end up unnecessarily doing 
two lstat() calls after each other.

The second patch is the symlink detection rewrite, which avoids a _lot_ of 
unnecessary lstat calls under some loads by not doing the lstat on the 
directory path entries over and over for each pathname. It's just a 
single-deep cache for "last directory seen" and "last symlink seen", and 
replaces the old code that only cached the last symlink.

Diffstat for the combined thing as follows:

 builtin-apply.c  |    2 +-
 builtin-commit.c |    6 ++-
 cache.h          |    4 ++-
 diff-lib.c       |   10 +++---
 read-cache.c     |   29 +++++++++++--------
 symlinks.c       |   82 ++++++++++++++++++++++++++++++++---------------------
 unpack-trees.c   |   12 +++----
 7 files changed, 84 insertions(+), 61 deletions(-)

and while this probably doesn't matter on most loads, the reason I'm 
re-sending is that I think it's pretty solid and core code. I've been 
running with both of these patches (and some others) rebased on top of 
Junio's tree for the last few weeks.

		Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] Avoid some unnecessary lstat() calls
  2008-05-09 16:04 [PATCH 0/2] Soem core git optimizations Linus Torvalds
@ 2008-05-09 16:11 ` Linus Torvalds
  2008-05-09 16:21   ` [PATCH 2/2] Optimize symlink/directory detection Linus Torvalds
  2008-05-11  2:36 ` [PATCH 0/2] Soem core git optimizations Junio C Hamano
  1 sibling, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2008-05-09 16:11 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


The commit sequence used to do

	if (file_exists(p->path))
		add_file_to_cache(p->path, 0);

where both "file_exists()" and "add_file_to_cache()" needed to do a 
lstat() on the path to do their work.

This cuts down 'lstat()' calls for the partial commit case by two 
for each path we know about (because we do this twice per path).

Just move the lstat() to the caller instead (that's all that 
"file_exists()" really does), and pass the stat information down to the 
add_to_cache() function.

This essentially makes 'add_to_index()' the core function that adds a path 
to the index, getting the index pointer, the pathname and the stat 
information as arguments. There are then shorthand helper functions that 
use this core function:

 - 'add_to_cache()' is just 'add_to_index()' with the default index

 - 'add_file_to_cache/index()' is the same, but does the lstat() call 
   itself, so you can pass just the pathname if you don't already have the 
   stat information available.

So old users of the 'add_file_to_xyzzy()' are essentially left unchanged, 
and this just exposes the more generic helper function that can take 
existing stat information into account.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 builtin-commit.c |    6 ++++--
 cache.h          |    2 ++
 read-cache.c     |   29 +++++++++++++++++------------
 3 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/builtin-commit.c b/builtin-commit.c
index 256181a..6433f86 100644
--- a/builtin-commit.c
+++ b/builtin-commit.c
@@ -175,9 +175,11 @@ static void add_remove_files(struct path_list *list)
 {
 	int i;
 	for (i = 0; i < list->nr; i++) {
+		struct stat st;
 		struct path_list_item *p = &(list->items[i]);
-		if (file_exists(p->path))
-			add_file_to_cache(p->path, 0);
+
+		if (!lstat(p->path, &st))
+			add_to_cache(p->path, &st, 0);
 		else
 			remove_file_from_cache(p->path);
 	}
diff --git a/cache.h b/cache.h
index 80a8842..4803b03 100644
--- a/cache.h
+++ b/cache.h
@@ -261,6 +261,7 @@ static inline void remove_name_hash(struct cache_entry *ce)
 #define add_cache_entry(ce, option) add_index_entry(&the_index, (ce), (option))
 #define remove_cache_entry_at(pos) remove_index_entry_at(&the_index, (pos))
 #define remove_file_from_cache(path) remove_file_from_index(&the_index, (path))
+#define add_to_cache(path, st, verbose) add_to_index(&the_index, (path), (st), (verbose))
 #define add_file_to_cache(path, verbose) add_file_to_index(&the_index, (path), (verbose))
 #define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL)
 #define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), (options))
@@ -365,6 +366,7 @@ extern int add_index_entry(struct index_state *, struct cache_entry *ce, int opt
 extern struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int really);
 extern int remove_index_entry_at(struct index_state *, int pos);
 extern int remove_file_from_index(struct index_state *, const char *path);
+extern int add_to_index(struct index_state *, const char *path, struct stat *, int verbose);
 extern int add_file_to_index(struct index_state *, const char *path, int verbose);
 extern struct cache_entry *make_cache_entry(unsigned int mode, const unsigned char *sha1, const char *path, int stage, int refresh);
 extern int ce_same_name(struct cache_entry *a, struct cache_entry *b);
diff --git a/read-cache.c b/read-cache.c
index e71c3b7..f0d74b3 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -461,21 +461,18 @@ static struct cache_entry *create_alias_ce(struct cache_entry *ce, struct cache_
 	return new;
 }
 
-int add_file_to_index(struct index_state *istate, const char *path, int verbose)
+int add_to_index(struct index_state *istate, const char *path, struct stat *st, int verbose)
 {
 	int size, namelen;
-	struct stat st;
+	mode_t st_mode = st->st_mode;
 	struct cache_entry *ce, *alias;
 	unsigned ce_option = CE_MATCH_IGNORE_VALID|CE_MATCH_RACY_IS_DIRTY;
 
-	if (lstat(path, &st))
-		die("%s: unable to stat (%s)", path, strerror(errno));
-
-	if (!S_ISREG(st.st_mode) && !S_ISLNK(st.st_mode) && !S_ISDIR(st.st_mode))
+	if (!S_ISREG(st_mode) && !S_ISLNK(st_mode) && !S_ISDIR(st_mode))
 		die("%s: can only add regular files, symbolic links or git-directories", path);
 
 	namelen = strlen(path);
-	if (S_ISDIR(st.st_mode)) {
+	if (S_ISDIR(st_mode)) {
 		while (namelen && path[namelen-1] == '/')
 			namelen--;
 	}
@@ -483,10 +480,10 @@ int add_file_to_index(struct index_state *istate, const char *path, int verbose)
 	ce = xcalloc(1, size);
 	memcpy(ce->name, path, namelen);
 	ce->ce_flags = namelen;
-	fill_stat_cache_info(ce, &st);
+	fill_stat_cache_info(ce, st);
 
 	if (trust_executable_bit && has_symlinks)
-		ce->ce_mode = create_ce_mode(st.st_mode);
+		ce->ce_mode = create_ce_mode(st_mode);
 	else {
 		/* If there is an existing entry, pick the mode bits and type
 		 * from it, otherwise assume unexecutable regular file.
@@ -495,18 +492,18 @@ int add_file_to_index(struct index_state *istate, const char *path, int verbose)
 		int pos = index_name_pos_also_unmerged(istate, path, namelen);
 
 		ent = (0 <= pos) ? istate->cache[pos] : NULL;
-		ce->ce_mode = ce_mode_from_stat(ent, st.st_mode);
+		ce->ce_mode = ce_mode_from_stat(ent, st_mode);
 	}
 
 	alias = index_name_exists(istate, ce->name, ce_namelen(ce), ignore_case);
-	if (alias && !ce_stage(alias) && !ie_match_stat(istate, alias, &st, ce_option)) {
+	if (alias && !ce_stage(alias) && !ie_match_stat(istate, alias, st, ce_option)) {
 		/* Nothing changed, really */
 		free(ce);
 		ce_mark_uptodate(alias);
 		alias->ce_flags |= CE_ADDED;
 		return 0;
 	}
-	if (index_path(ce->sha1, path, &st, 1))
+	if (index_path(ce->sha1, path, st, 1))
 		die("unable to index file %s", path);
 	if (ignore_case && alias && different_name(ce, alias))
 		ce = create_alias_ce(ce, alias);
@@ -518,6 +515,14 @@ int add_file_to_index(struct index_state *istate, const char *path, int verbose)
 	return 0;
 }
 
+int add_file_to_index(struct index_state *istate, const char *path, int verbose)
+{
+	struct stat st;
+	if (lstat(path, &st))
+		die("%s: unable to stat (%s)", path, strerror(errno));
+	return add_to_index(istate, path, &st, verbose);
+}
+
 struct cache_entry *make_cache_entry(unsigned int mode,
 		const unsigned char *sha1, const char *path, int stage,
 		int refresh)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] Optimize symlink/directory detection
  2008-05-09 16:11 ` [PATCH 1/2] Avoid some unnecessary lstat() calls Linus Torvalds
@ 2008-05-09 16:21   ` Linus Torvalds
  0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2008-05-09 16:21 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


This is the base for making symlink detection in the middle fo a pathname 
saner and (much) more efficient.

Under various loads, we want to verify that the full path leading up to a 
filename is a real directory tree, and that when we successfully do an 
'lstat()' on a filename, we don't get a false positive due to a symlink in 
the middle of the path that git should have seen as a symlink, not as a 
normal path component.

The 'has_symlink_leading_path()' function already did this, and cached 
a single level of symlink information, but didn't cache the _lack_ of a 
symlink, so the normal behaviour was actually the wrong way around, and we 
ended up doing an 'lstat()' on each path component to check that it was a 
real directory.

This caches the last detected full directory and symlink entries, and 
speeds up especially deep directory structures a lot by avoiding to 
lstat() all the directories leading up to each entry in the index.
    
[ This can - and should - probably be extended upon so that we eventually 
  never do a bare 'lstat()' on any path entries at *all* when checking the 
  index, but always check the full path carefully. Right now we do not 
  generally check the whole path for all our normal quick index 
  revalidation.

  We should also make sure that we're careful about all the invalidation, 
  ie when we remove a link and replace it by a directory we should 
  invalidate the symlink cache if it matches (and vice versa for the 
  directory cache).

  But regardless, the basic function needs to be sane to do that. The old
  'has_symlink_leading_path()' was not capable enough - or indeed the code
  readable enough - to really do that sanely. So I'm pushing this as not 
  just an optimization, but as a base for further work. ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

 builtin-apply.c |    2 +-
 cache.h         |    2 +-
 diff-lib.c      |   10 +++---
 symlinks.c      |   82 +++++++++++++++++++++++++++++++++----------------------
 unpack-trees.c  |   12 +++-----
 5 files changed, 61 insertions(+), 47 deletions(-)

diff --git a/builtin-apply.c b/builtin-apply.c
index caa3f2a..1103625 100644
--- a/builtin-apply.c
+++ b/builtin-apply.c
@@ -2247,7 +2247,7 @@ static int check_to_create_blob(const char *new_name, int ok_if_exists)
 		 * In such a case, path "new_name" does not exist as
 		 * far as git is concerned.
 		 */
-		if (has_symlink_leading_path(new_name, NULL))
+		if (has_symlink_leading_path(strlen(new_name), new_name))
 			return 0;
 
 		return error("%s: already exists in working directory", new_name);
diff --git a/cache.h b/cache.h
index 4803b03..0d9d9e2 100644
--- a/cache.h
+++ b/cache.h
@@ -598,7 +598,7 @@ struct checkout {
 };
 
 extern int checkout_entry(struct cache_entry *ce, const struct checkout *state, char *topath);
-extern int has_symlink_leading_path(const char *name, char *last_symlink);
+extern int has_symlink_leading_path(int len, const char *name);
 
 extern struct alternate_object_database {
 	struct alternate_object_database *next;
diff --git a/diff-lib.c b/diff-lib.c
index 9139e45..9966d99 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -341,14 +341,14 @@ int run_diff_files_cmd(struct rev_info *revs, int argc, const char **argv)
  * See if work tree has an entity that can be staged.  Return 0 if so,
  * return 1 if not and return -1 if error.
  */
-static int check_work_tree_entity(const struct cache_entry *ce, struct stat *st, char *symcache)
+static int check_work_tree_entity(const struct cache_entry *ce, struct stat *st)
 {
 	if (lstat(ce->name, st) < 0) {
 		if (errno != ENOENT && errno != ENOTDIR)
 			return -1;
 		return 1;
 	}
-	if (has_symlink_leading_path(ce->name, symcache))
+	if (has_symlink_leading_path(ce_namelen(ce), ce->name))
 		return 1;
 	if (S_ISDIR(st->st_mode)) {
 		unsigned char sub[20];
@@ -402,7 +402,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			memset(&(dpath->parent[0]), 0,
 			       sizeof(struct combine_diff_parent)*5);
 
-			changed = check_work_tree_entity(ce, &st, symcache);
+			changed = check_work_tree_entity(ce, &st);
 			if (!changed)
 				dpath->mode = ce_mode_from_stat(ce, st.st_mode);
 			else {
@@ -466,7 +466,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 		if (ce_uptodate(ce))
 			continue;
 
-		changed = check_work_tree_entity(ce, &st, symcache);
+		changed = check_work_tree_entity(ce, &st);
 		if (changed) {
 			if (changed < 0) {
 				perror(ce->name);
@@ -527,7 +527,7 @@ static int get_stat_data(struct cache_entry *ce,
 	if (!cached) {
 		int changed;
 		struct stat st;
-		changed = check_work_tree_entity(ce, &st, cbdata->symcache);
+		changed = check_work_tree_entity(ce, &st);
 		if (changed < 0)
 			return -1;
 		else if (changed) {
diff --git a/symlinks.c b/symlinks.c
index be9ace6..5a5e781 100644
--- a/symlinks.c
+++ b/symlinks.c
@@ -1,48 +1,64 @@
 #include "cache.h"
 
-int has_symlink_leading_path(const char *name, char *last_symlink)
-{
+struct pathname {
+	int len;
 	char path[PATH_MAX];
-	const char *sp, *ep;
-	char *dp;
-
-	sp = name;
-	dp = path;
-
-	if (last_symlink && *last_symlink) {
-		size_t last_len = strlen(last_symlink);
-		size_t len = strlen(name);
-		if (last_len < len &&
-		    !strncmp(name, last_symlink, last_len) &&
-		    name[last_len] == '/')
-			return 1;
-		*last_symlink = '\0';
+};
+
+/* Return matching pathname prefix length, or zero if not matching */
+static inline int match_pathname(int len, const char *name, struct pathname *match)
+{
+	int match_len = match->len;
+	return (len > match_len &&
+		name[match_len] == '/' &&
+		!memcmp(name, match->path, match_len)) ? match_len : 0;
+}
+
+static inline void set_pathname(int len, const char *name, struct pathname *match)
+{
+	if (len < PATH_MAX) {
+		match->len = len;
+		memcpy(match->path, name, len);
+		match->path[len] = 0;
 	}
+}
+
+int has_symlink_leading_path(int len, const char *name)
+{
+	static struct pathname link, nonlink;
+	char path[PATH_MAX];
+	struct stat st;
+	char *sp;
+	int known_dir;
 
-	while (1) {
-		size_t len;
-		struct stat st;
+	/*
+	 * See if the last known symlink cache matches.
+	 */
+	if (match_pathname(len, name, &link))
+		return 1;
 
-		ep = strchr(sp, '/');
-		if (!ep)
-			break;
-		len = ep - sp;
-		if (PATH_MAX <= dp + len - path + 2)
-			return 0; /* new name is longer than that??? */
-		memcpy(dp, sp, len);
-		dp[len] = 0;
+	/*
+	 * Get rid of the last known directory part
+	 */
+	known_dir = match_pathname(len, name, &nonlink);
+
+	while ((sp = strchr(name + known_dir + 1, '/')) != NULL) {
+		int thislen = sp - name ;
+		memcpy(path, name, thislen);
+		path[thislen] = 0;
 
 		if (lstat(path, &st))
 			return 0;
+		if (S_ISDIR(st.st_mode)) {
+			set_pathname(thislen, path, &nonlink);
+			known_dir = thislen;
+			continue;
+		}
 		if (S_ISLNK(st.st_mode)) {
-			if (last_symlink)
-				strcpy(last_symlink, path);
+			set_pathname(thislen, path, &link);
 			return 1;
 		}
-
-		dp[len++] = '/';
-		dp = dp + len;
-		sp = ep + 1;
+		break;
 	}
 	return 0;
 }
diff --git a/unpack-trees.c b/unpack-trees.c
index feae846..1ab28fd 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -26,11 +26,12 @@ static void add_entry(struct unpack_trees_options *o, struct cache_entry *ce,
  * directories, in case this unlink is the removal of the
  * last entry in the directory -- empty directories are removed.
  */
-static void unlink_entry(char *name, char *last_symlink)
+static void unlink_entry(struct cache_entry *ce)
 {
 	char *cp, *prev;
+	char *name = ce->name;
 
-	if (has_symlink_leading_path(name, last_symlink))
+	if (has_symlink_leading_path(ce_namelen(ce), ce->name))
 		return;
 	if (unlink(name))
 		return;
@@ -58,7 +59,6 @@ static int check_updates(struct unpack_trees_options *o)
 {
 	unsigned cnt = 0, total = 0;
 	struct progress *progress = NULL;
-	char last_symlink[PATH_MAX];
 	struct index_state *index = &o->result;
 	int i;
 	int errs = 0;
@@ -75,14 +75,13 @@ static int check_updates(struct unpack_trees_options *o)
 		cnt = 0;
 	}
 
-	*last_symlink = '\0';
 	for (i = 0; i < index->cache_nr; i++) {
 		struct cache_entry *ce = index->cache[i];
 
 		if (ce->ce_flags & CE_REMOVE) {
 			display_progress(progress, ++cnt);
 			if (o->update)
-				unlink_entry(ce->name, last_symlink);
+				unlink_entry(ce);
 			remove_index_entry_at(&o->result, i);
 			i--;
 			continue;
@@ -97,7 +96,6 @@ static int check_updates(struct unpack_trees_options *o)
 			ce->ce_flags &= ~CE_UPDATE;
 			if (o->update) {
 				errs |= checkout_entry(ce, &state, NULL);
-				*last_symlink = '\0';
 			}
 		}
 	}
@@ -553,7 +551,7 @@ static int verify_absent(struct cache_entry *ce, const char *action,
 	if (o->index_only || o->reset || !o->update)
 		return 0;
 
-	if (has_symlink_leading_path(ce->name, NULL))
+	if (has_symlink_leading_path(ce_namelen(ce), ce->name))
 		return 0;
 
 	if (!lstat(ce->name, &st)) {

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] Soem core git optimizations
  2008-05-09 16:04 [PATCH 0/2] Soem core git optimizations Linus Torvalds
  2008-05-09 16:11 ` [PATCH 1/2] Avoid some unnecessary lstat() calls Linus Torvalds
@ 2008-05-11  2:36 ` Junio C Hamano
  2008-05-11  2:42   ` Linus Torvalds
  1 sibling, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2008-05-11  2:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> I notice that Junio merged my fnmatch-avoidance patch, but I have a few 
> other optimizations that I track in my private tree that I have sent out 
> but probably didn't get much attention. They do matter from a performance 
> angle, even if not as much as avoiding fnmatch did.
> ...
> and while this probably doesn't matter on most loads, the reason I'm 
> re-sending is that I think it's pretty solid and core code. I've been 
> running with both of these patches (and some others) rebased on top of 
> Junio's tree for the last few weeks.

Unfortunately this seems to depend on stuff still in 'next', and a trivial
rebasing seems to break switching branches with "git checkout" in t4122
(switching from 'master' to 'test' will lose arch/i386/boot/Makefile).

I'll take a hint and merge the "case insensitive" and "diff-submodule"
series to 'master' and then apply these two --- the result does not have
the git-checkout breakage.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] Soem core git optimizations
  2008-05-11  2:36 ` [PATCH 0/2] Soem core git optimizations Junio C Hamano
@ 2008-05-11  2:42   ` Linus Torvalds
  0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2008-05-11  2:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List



On Sat, 10 May 2008, Junio C Hamano wrote:
> 
> Unfortunately this seems to depend on stuff still in 'next', and a trivial
> rebasing seems to break switching branches with "git checkout" in t4122
> (switching from 'master' to 'test' will lose arch/i386/boot/Makefile).

Gaah. It's actually based on 'master', not 'next', but yeah, I have my 
other patch-series in my tree (the case-insensitivity ones you have in 
next), and while I thought it was independent, I was wrong. Sorry about 
that.

		Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-05-11  2:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-09 16:04 [PATCH 0/2] Soem core git optimizations Linus Torvalds
2008-05-09 16:11 ` [PATCH 1/2] Avoid some unnecessary lstat() calls Linus Torvalds
2008-05-09 16:21   ` [PATCH 2/2] Optimize symlink/directory detection Linus Torvalds
2008-05-11  2:36 ` [PATCH 0/2] Soem core git optimizations Junio C Hamano
2008-05-11  2:42   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).