* [PATCH 0/3] Trivial (and small) exclude optimizations @ 2013-03-09 4:09 Nguyễn Thái Ngọc Duy 2013-03-09 4:09 ` [PATCH 1/3] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy ` (3 more replies) 0 siblings, 4 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-09 4:09 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy I've been examining how and where str*cmp are called by read_directory(). They are presumably where most of the time is spent. The gain is not big, but still worth it, I think. These are tested without nd/read-directory-recursive-optim. The saving is probably smaller as nd/read... cuts down a big number of entries to be processed by exclude machinery. There is another optimization window, which a simple hack suggests it may shave 100ms out of 500ms ls-files time on my machine. Say you have "/INSTALL" in your root .gitignore. The pattern will be tested for _all_ pathnames including "path/deep/down/here". By limiting the pattern to pathnames of the same directory level, the number of applicable pathnames to the pattern will be usually small and limited (otherwise as the repository grows with more pathnames, we pay more for exclude). Haven't figured out how to save directory level yet though. Nguyễn Thái Ngọc Duy (3): match_pathname: avoid calling strncmp if baselen is 0 dir.c: inline convenient *_icase helpers match_basename: use strncmp instead of strcmp attr.c | 2 ++ dir.c | 26 ++++++-------------------- dir.h | 18 +++++++++++++++--- 3 files changed, 23 insertions(+), 23 deletions(-) -- 1.8.1.2.536.gf441e6d ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 1/3] match_pathname: avoid calling strncmp if baselen is 0 2013-03-09 4:09 [PATCH 0/3] Trivial (and small) exclude optimizations Nguyễn Thái Ngọc Duy @ 2013-03-09 4:09 ` Nguyễn Thái Ngọc Duy 2013-03-09 9:06 ` Antoine Pelisse 2013-03-09 4:09 ` [PATCH 2/3] dir.c: inline convenient *_icase helpers Nguyễn Thái Ngọc Duy ` (2 subsequent siblings) 3 siblings, 1 reply; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-09 4:09 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy This reduces "git status" user time by a little bit. This is the sorted results of 10 consecutive runs of "git ls-files --exclude-standard -o" on webkit.git, compiled with gcc -O2: before after user 0m0.580s 0m0.546s user 0m0.581s 0m0.549s user 0m0.582s 0m0.550s user 0m0.584s 0m0.558s user 0m0.586s 0m0.560s user 0m0.587s 0m0.561s user 0m0.587s 0m0.562s user 0m0.593s 0m0.566s user 0m0.597s 0m0.568s user 0m0.601s 0m0.573s Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dir.c b/dir.c index 57394e4..669cf80 100644 --- a/dir.c +++ b/dir.c @@ -663,7 +663,7 @@ int match_pathname(const char *pathname, int pathlen, */ if (pathlen < baselen + 1 || (baselen && pathname[baselen] != '/') || - strncmp_icase(pathname, base, baselen)) + (baselen && strncmp_icase(pathname, base, baselen))) return 0; namelen = baselen ? pathlen - baselen - 1 : pathlen; -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 1/3] match_pathname: avoid calling strncmp if baselen is 0 2013-03-09 4:09 ` [PATCH 1/3] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy @ 2013-03-09 9:06 ` Antoine Pelisse 0 siblings, 0 replies; 48+ messages in thread From: Antoine Pelisse @ 2013-03-09 9:06 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy; +Cc: git > diff --git a/dir.c b/dir.c > index 57394e4..669cf80 100644 > --- a/dir.c > +++ b/dir.c > @@ -663,7 +663,7 @@ int match_pathname(const char *pathname, int pathlen, > */ > if (pathlen < baselen + 1 || > (baselen && pathname[baselen] != '/') || > - strncmp_icase(pathname, base, baselen)) > + (baselen && strncmp_icase(pathname, base, baselen))) Shouldn't you factorize by baselen here ? For readability reasons, not performance of course. > return 0; > > namelen = baselen ? pathlen - baselen - 1 : pathlen; ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 2/3] dir.c: inline convenient *_icase helpers 2013-03-09 4:09 [PATCH 0/3] Trivial (and small) exclude optimizations Nguyễn Thái Ngọc Duy 2013-03-09 4:09 ` [PATCH 1/3] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy @ 2013-03-09 4:09 ` Nguyễn Thái Ngọc Duy 2013-03-09 4:09 ` [PATCH 3/3] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy 3 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-09 4:09 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Like the previous patch, this cuts down the number of str*cmp calls in read_directory (which does _a lot_). Again sorted results on webkit.git: before after user 0m0.546s 0m0.519s user 0m0.549s 0m0.521s user 0m0.550s 0m0.523s user 0m0.558s 0m0.532s user 0m0.560s 0m0.534s user 0m0.561s 0m0.536s user 0m0.562s 0m0.537s user 0m0.566s 0m0.545s user 0m0.568s 0m0.546s user 0m0.573s 0m0.548s Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 16 ---------------- dir.h | 18 +++++++++++++++--- 2 files changed, 15 insertions(+), 19 deletions(-) diff --git a/dir.c b/dir.c index 669cf80..f58320d 100644 --- a/dir.c +++ b/dir.c @@ -21,22 +21,6 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, in int check_only, const struct path_simplify *simplify); static int get_dtype(struct dirent *de, const char *path, int len); -/* helper string functions with support for the ignore_case flag */ -int strcmp_icase(const char *a, const char *b) -{ - return ignore_case ? strcasecmp(a, b) : strcmp(a, b); -} - -int strncmp_icase(const char *a, const char *b, size_t count) -{ - return ignore_case ? strncasecmp(a, b, count) : strncmp(a, b, count); -} - -int fnmatch_icase(const char *pattern, const char *string, int flags) -{ - return fnmatch(pattern, string, flags | (ignore_case ? FNM_CASEFOLD : 0)); -} - inline int git_fnmatch(const char *pattern, const char *string, int flags, int prefix) { diff --git a/dir.h b/dir.h index c3eb4b5..560ade4 100644 --- a/dir.h +++ b/dir.h @@ -200,9 +200,21 @@ extern int remove_dir_recursively(struct strbuf *path, int flag); /* tries to remove the path with empty directories along it, ignores ENOENT */ extern int remove_path(const char *path); -extern int strcmp_icase(const char *a, const char *b); -extern int strncmp_icase(const char *a, const char *b, size_t count); -extern int fnmatch_icase(const char *pattern, const char *string, int flags); +/* helper string functions with support for the ignore_case flag */ +static inline int strcmp_icase(const char *a, const char *b) +{ + return ignore_case ? strcasecmp(a, b) : strcmp(a, b); +} + +static inline int strncmp_icase(const char *a, const char *b, size_t count) +{ + return ignore_case ? strncasecmp(a, b, count) : strncmp(a, b, count); +} + +static inline int fnmatch_icase(const char *pattern, const char *string, int flags) +{ + return fnmatch(pattern, string, flags | (ignore_case ? FNM_CASEFOLD : 0)); +} /* * The prefix part of pattern must not contains wildcards. -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 3/3] match_basename: use strncmp instead of strcmp 2013-03-09 4:09 [PATCH 0/3] Trivial (and small) exclude optimizations Nguyễn Thái Ngọc Duy 2013-03-09 4:09 ` [PATCH 1/3] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy 2013-03-09 4:09 ` [PATCH 2/3] dir.c: inline convenient *_icase helpers Nguyễn Thái Ngọc Duy @ 2013-03-09 4:09 ` Nguyễn Thái Ngọc Duy 2013-03-09 7:50 ` Junio C Hamano 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy 3 siblings, 1 reply; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-09 4:09 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy strncmp provides length information, compared to strcmp, which could be taken advantage by the implementation. Even better, we could check if the lengths are equal before calling strncmp, eliminating a bit of strncmp calls. before after user 0m0.519s 0m0.489s user 0m0.521s 0m0.504s user 0m0.523s 0m0.507s user 0m0.532s 0m0.510s user 0m0.534s 0m0.513s user 0m0.536s 0m0.514s user 0m0.537s 0m0.522s user 0m0.545s 0m0.523s user 0m0.546s 0m0.527s user 0m0.548s 0m0.529s While at there, fix an inconsistency about pattern/patternlen in how attr handles EXC_FLAG_MUSTBEDIR. When parse_exclude_pattern detects this flag, it sets patternlen _not_ to include the trailing slash and expects the caller to trim it. add_exclude does, parse_attr_line does not. In attr.c, the pattern could be "foo/" while patternlen tells us it only has 3 chars. Some functions do not care about patternlen and will see the pattern as "foo/" while others may see it as "foo". This patch makes patternlen 4 in this case. (Although for a piece of mind, perhaps we should trim it to "foo" like exclude, and never pass a pathname like "abc/" to match_{base,path}name) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- attr.c | 2 ++ dir.c | 8 +++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/attr.c b/attr.c index e2f9377..1818ba5 100644 --- a/attr.c +++ b/attr.c @@ -255,6 +255,8 @@ static struct match_attr *parse_attr_line(const char *line, const char *src, &res->u.pat.patternlen, &res->u.pat.flags, &res->u.pat.nowildcardlen); + if (res->u.pat.flags & EXC_FLAG_MUSTBEDIR) + res->u.pat.patternlen++; if (res->u.pat.flags & EXC_FLAG_NEGATIVE) { warning(_("Negative patterns are ignored in git attributes\n" "Use '\\!' for literal leading exclamation.")); diff --git a/dir.c b/dir.c index f58320d..2a91d14 100644 --- a/dir.c +++ b/dir.c @@ -610,12 +610,14 @@ int match_basename(const char *basename, int basenamelen, int flags) { if (prefix == patternlen) { - if (!strcmp_icase(pattern, basename)) + if (patternlen == basenamelen && + !strncmp_icase(pattern, basename, patternlen)) return 1; } else if (flags & EXC_FLAG_ENDSWITH) { if (patternlen - 1 <= basenamelen && - !strcmp_icase(pattern + 1, - basename + basenamelen - patternlen + 1)) + !strncmp_icase(pattern + 1, + basename + basenamelen - patternlen + 1, + patternlen - 1)) return 1; } else { if (fnmatch_icase(pattern, basename, 0) == 0) -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 3/3] match_basename: use strncmp instead of strcmp 2013-03-09 4:09 ` [PATCH 3/3] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy @ 2013-03-09 7:50 ` Junio C Hamano 2013-03-09 8:47 ` Fredrik Gustafsson 2013-03-09 9:58 ` Duy Nguyen 0 siblings, 2 replies; 48+ messages in thread From: Junio C Hamano @ 2013-03-09 7:50 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy; +Cc: git Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: > strncmp provides length information, compared to strcmp, which could > be taken advantage by the implementation. Even better, we could check > if the lengths are equal before calling strncmp, eliminating a bit of > strncmp calls. I think I am a bit slower than my usual self tonight, but I am utterly confused by the above. strncmp() compares _only_ up to the first n bytes, so when you are using it for equality, it is not "we could check length", but is "we MUST check they match to the length of the shorter string", if you want to obtain not just faster but correct result. Am I mistaken? Even if you are using strcmp() that yields ordering not just equality, it can return a correct result as soon as it hits the first bytes that are different; I doubt using strncmp() contributes to the performance very much. Comparing lengths before doing byte-for-byte comparison could help because you can reject two strings with different lengths without looking at them. At the same time, I wonder if we can take advantage of the fact that these call sites only care about equality and not ordering. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 3/3] match_basename: use strncmp instead of strcmp 2013-03-09 7:50 ` Junio C Hamano @ 2013-03-09 8:47 ` Fredrik Gustafsson 2013-03-09 9:58 ` Duy Nguyen 1 sibling, 0 replies; 48+ messages in thread From: Fredrik Gustafsson @ 2013-03-09 8:47 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nguyễn Thái Ngọc Duy, git On Fri, Mar 08, 2013 at 11:50:04PM -0800, Junio C Hamano wrote: > At the same time, I wonder if we can take advantage of the fact that > these call sites only care about equality and not ordering. I did an RFC-patch for that (that I mistakenly didn't sent as a reply to this e-mail). And I believe that you're correct. My solution is inspired of curl's strequal. Is the reason for git not to care about lower/upper-case for beeing able to support windows? Or is there any other smart reason? I was also thinking about discarding files by looking at their modification date. If the modification timestamp is older than/or equal to the latest commit, there's probably no reason for examine that file any further. I'm not sure about the side effects this may imply though. I think they can be quite nasty. Is this something worth digging more in or am I already on the wrong path? -- Med vänliga hälsningar Fredrik Gustafsson tel: 0733-608274 e-post: iveqy@iveqy.com ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 3/3] match_basename: use strncmp instead of strcmp 2013-03-09 7:50 ` Junio C Hamano 2013-03-09 8:47 ` Fredrik Gustafsson @ 2013-03-09 9:58 ` Duy Nguyen 1 sibling, 0 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-09 9:58 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, Mar 9, 2013 at 2:50 PM, Junio C Hamano <gitster@pobox.com> wrote: > Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: > >> strncmp provides length information, compared to strcmp, which could >> be taken advantage by the implementation. Even better, we could check >> if the lengths are equal before calling strncmp, eliminating a bit of >> strncmp calls. > > I think I am a bit slower than my usual self tonight, but I am > utterly confused by the above. > > strncmp() compares _only_ up to the first n bytes, so when you are > using it for equality, it is not "we could check length", but is "we > MUST check they match to the length of the shorter string", if you > want to obtain not just faster but correct result. > > Am I mistaken? Yeap, the description is a bit misleading. Although you could get away with length check by doing !strncmp(a, b, strlen(a)+1). > Even if you are using strcmp() that yields ordering not just > equality, it can return a correct result as soon as it hits the > first bytes that are different; I doubt using strncmp() contributes > to the performance very much. Comparing lengths before doing > byte-for-byte comparison could help because you can reject two > strings with different lengths without looking at them. > > At the same time, I wonder if we can take advantage of the fact that > these call sites only care about equality and not ordering. I tried to push it further and compare hash before do the actual string comparison. It slowed things down (hopefully because the cost of hashing, the same one from name-hash.c, not because I did it wrong). -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 0/6] Exclude optimizations 2013-03-09 4:09 [PATCH 0/3] Trivial (and small) exclude optimizations Nguyễn Thái Ngọc Duy ` (2 preceding siblings ...) 2013-03-09 4:09 ` [PATCH 3/3] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 ` Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 1/6] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy ` (7 more replies) 3 siblings, 8 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy v2 includes strncmp_equal and directory level pattern filter. user time of "git ls-files --exclude-standard -o" on webkit.git below. Looking pretty good. before after user 0m0.607s 0m0.365s user 0m0.613s 0m0.366s user 0m0.613s 0m0.374s user 0m0.621s 0m0.374s user 0m0.621s 0m0.377s user 0m0.622s 0m0.381s user 0m0.624s 0m0.381s user 0m0.626s 0m0.383s user 0m0.628s 0m0.384s user 0m0.638s 0m0.384s Nguyễn Thái Ngọc Duy (6): match_pathname: avoid calling strncmp if baselen is 0 dir.c: inline convenient *_icase helpers match_basename: use strncmp instead of strcmp match_{base,path}name: replace strncmp_icase with strnequal_icase dir.c: pass pathname length to last_exclude_matching exclude: filter patterns by directory level attr.c | 5 ++- dir.c | 114 ++++++++++++++++++++++++++++++++++++++++++++--------------------- dir.h | 27 +++++++++++++--- 3 files changed, 104 insertions(+), 42 deletions(-) -- 1.8.1.2.536.gf441e6d ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 1/6] match_pathname: avoid calling strncmp if baselen is 0 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 ` Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 2/6] dir.c: inline convenient *_icase helpers Nguyễn Thái Ngọc Duy ` (6 subsequent siblings) 7 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy This reduces "git status" user time by a little bit. This is the sorted results of 10 consecutive runs of "git ls-files --exclude-standard -o" on webkit.git, compiled with gcc -O2: before after user 0m0.607s 0m0.554s user 0m0.613s 0m0.564s user 0m0.613s 0m0.571s user 0m0.621s 0m0.576s user 0m0.621s 0m0.578s user 0m0.622s 0m0.579s user 0m0.624s 0m0.583s user 0m0.626s 0m0.584s user 0m0.628s 0m0.586s user 0m0.638s 0m0.592s Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dir.c b/dir.c index 57394e4..b3cd66c 100644 --- a/dir.c +++ b/dir.c @@ -662,8 +662,8 @@ int match_pathname(const char *pathname, int pathlen, * may not end with a trailing slash though. */ if (pathlen < baselen + 1 || - (baselen && pathname[baselen] != '/') || - strncmp_icase(pathname, base, baselen)) + (baselen && (pathname[baselen] != '/' || + strncmp_icase(pathname, base, baselen)))) return 0; namelen = baselen ? pathlen - baselen - 1 : pathlen; -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v2 2/6] dir.c: inline convenient *_icase helpers 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 1/6] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 ` Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 3/6] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy ` (5 subsequent siblings) 7 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Like the previous patch, this cuts down the number of str*cmp calls in read_directory (which does _a lot_). Again sorted results on webkit.git: before after user 0m0.554s 0m0.548s user 0m0.564s 0m0.549s user 0m0.571s 0m0.554s user 0m0.576s 0m0.557s user 0m0.578s 0m0.558s user 0m0.579s 0m0.559s user 0m0.583s 0m0.562s user 0m0.584s 0m0.564s user 0m0.586s 0m0.566s user 0m0.592s 0m0.569s Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 16 ---------------- dir.h | 18 +++++++++++++++--- 2 files changed, 15 insertions(+), 19 deletions(-) diff --git a/dir.c b/dir.c index b3cd66c..9960a37 100644 --- a/dir.c +++ b/dir.c @@ -21,22 +21,6 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, in int check_only, const struct path_simplify *simplify); static int get_dtype(struct dirent *de, const char *path, int len); -/* helper string functions with support for the ignore_case flag */ -int strcmp_icase(const char *a, const char *b) -{ - return ignore_case ? strcasecmp(a, b) : strcmp(a, b); -} - -int strncmp_icase(const char *a, const char *b, size_t count) -{ - return ignore_case ? strncasecmp(a, b, count) : strncmp(a, b, count); -} - -int fnmatch_icase(const char *pattern, const char *string, int flags) -{ - return fnmatch(pattern, string, flags | (ignore_case ? FNM_CASEFOLD : 0)); -} - inline int git_fnmatch(const char *pattern, const char *string, int flags, int prefix) { diff --git a/dir.h b/dir.h index c3eb4b5..560ade4 100644 --- a/dir.h +++ b/dir.h @@ -200,9 +200,21 @@ extern int remove_dir_recursively(struct strbuf *path, int flag); /* tries to remove the path with empty directories along it, ignores ENOENT */ extern int remove_path(const char *path); -extern int strcmp_icase(const char *a, const char *b); -extern int strncmp_icase(const char *a, const char *b, size_t count); -extern int fnmatch_icase(const char *pattern, const char *string, int flags); +/* helper string functions with support for the ignore_case flag */ +static inline int strcmp_icase(const char *a, const char *b) +{ + return ignore_case ? strcasecmp(a, b) : strcmp(a, b); +} + +static inline int strncmp_icase(const char *a, const char *b, size_t count) +{ + return ignore_case ? strncasecmp(a, b, count) : strncmp(a, b, count); +} + +static inline int fnmatch_icase(const char *pattern, const char *string, int flags) +{ + return fnmatch(pattern, string, flags | (ignore_case ? FNM_CASEFOLD : 0)); +} /* * The prefix part of pattern must not contains wildcards. -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 1/6] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 2/6] dir.c: inline convenient *_icase helpers Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 ` Nguyễn Thái Ngọc Duy 2013-03-10 7:34 ` Junio C Hamano 2013-03-10 6:14 ` [PATCH v2 4/6] match_{base,path}name: replace strncmp_icase with strnequal_icase Nguyễn Thái Ngọc Duy ` (4 subsequent siblings) 7 siblings, 1 reply; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy strncmp is provided length information which could be taken advantage by the underlying implementation. Even better, we need to check if the lengths are equal before calling strncmp, eliminating a bit of strncmp calls. before after user 0m0.548s 0m0.516s user 0m0.549s 0m0.523s user 0m0.554s 0m0.532s user 0m0.557s 0m0.533s user 0m0.558s 0m0.535s user 0m0.559s 0m0.542s user 0m0.562s 0m0.546s user 0m0.564s 0m0.551s user 0m0.566s 0m0.556s user 0m0.569s 0m0.561s While at there, fix an inconsistency about pattern/patternlen in how attr handles EXC_FLAG_MUSTBEDIR. When parse_exclude_pattern detects this flag, it sets patternlen _not_ to include the trailing slash and expects the caller to trim it. add_exclude does, parse_attr_line does not. In attr.c, the pattern could be "foo/" while patternlen tells us it only has 3 chars. Some functions do not care about patternlen and will see the pattern as "foo/" while others may see it as "foo". This patch makes patternlen 4 in this case. (Although for a piece of mind, perhaps we should trim it to "foo" like exclude, and never pass a pathname like "abc/" to match_{base,path}name) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- attr.c | 2 ++ dir.c | 8 +++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/attr.c b/attr.c index e2f9377..1818ba5 100644 --- a/attr.c +++ b/attr.c @@ -255,6 +255,8 @@ static struct match_attr *parse_attr_line(const char *line, const char *src, &res->u.pat.patternlen, &res->u.pat.flags, &res->u.pat.nowildcardlen); + if (res->u.pat.flags & EXC_FLAG_MUSTBEDIR) + res->u.pat.patternlen++; if (res->u.pat.flags & EXC_FLAG_NEGATIVE) { warning(_("Negative patterns are ignored in git attributes\n" "Use '\\!' for literal leading exclamation.")); diff --git a/dir.c b/dir.c index 9960a37..46b24db 100644 --- a/dir.c +++ b/dir.c @@ -610,12 +610,14 @@ int match_basename(const char *basename, int basenamelen, int flags) { if (prefix == patternlen) { - if (!strcmp_icase(pattern, basename)) + if (patternlen == basenamelen && + !strncmp_icase(pattern, basename, patternlen)) return 1; } else if (flags & EXC_FLAG_ENDSWITH) { if (patternlen - 1 <= basenamelen && - !strcmp_icase(pattern + 1, - basename + basenamelen - patternlen + 1)) + !strncmp_icase(pattern + 1, + basename + basenamelen - patternlen + 1, + patternlen - 1)) return 1; } else { if (fnmatch_icase(pattern, basename, 0) == 0) -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-10 6:14 ` [PATCH v2 3/6] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy @ 2013-03-10 7:34 ` Junio C Hamano 2013-03-10 10:38 ` Duy Nguyen 0 siblings, 1 reply; 48+ messages in thread From: Junio C Hamano @ 2013-03-10 7:34 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy; +Cc: git Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: > strncmp is provided length information which could be taken advantage > by the underlying implementation. I may be missing something fundamental, but I somehow find the above does not make any sense. strcmp(a, b) has to pay attention to NUL in these strings and stop comparison. strncmp(a, b, n) not only has to pay the same attention to NUL in the strings, but also needs to stop comparing at n bytes. In what situation can the latter take advantage of that extra thing that it needs to keep track of and operate faster, when n is the length of shorter of the two strings? > diff --git a/dir.c b/dir.c > index 9960a37..46b24db 100644 > --- a/dir.c > +++ b/dir.c > @@ -610,12 +610,14 @@ int match_basename(const char *basename, int basenamelen, > int flags) > { > if (prefix == patternlen) { > - if (!strcmp_icase(pattern, basename)) > + if (patternlen == basenamelen && > + !strncmp_icase(pattern, basename, patternlen)) > return 1; What happens if you replace this with if (patternlen == baselen && !strcmp_icase(pattern, basename, patternlen)) and drop the other hunk and run the benchmark again? > } else if (flags & EXC_FLAG_ENDSWITH) { > if (patternlen - 1 <= basenamelen && > - !strcmp_icase(pattern + 1, > - basename + basenamelen - patternlen + 1)) > + !strncmp_icase(pattern + 1, > + basename + basenamelen - patternlen + 1, > + patternlen - 1)) > return 1; > } else { > if (fnmatch_icase(pattern, basename, 0) == 0) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-10 7:34 ` Junio C Hamano @ 2013-03-10 10:38 ` Duy Nguyen 2013-03-10 11:43 ` Antoine Pelisse 2013-03-12 20:59 ` Junio C Hamano 0 siblings, 2 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-10 10:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, Mar 10, 2013 at 2:34 PM, Junio C Hamano <gitster@pobox.com> wrote: > Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: > >> strncmp is provided length information which could be taken advantage >> by the underlying implementation. > > I may be missing something fundamental, but I somehow find the above > does not make any sense. > > strcmp(a, b) has to pay attention to NUL in these strings and stop > comparison. strncmp(a, b, n) not only has to pay the same attention > to NUL in the strings, but also needs to stop comparing at n bytes. > > In what situation can the latter take advantage of that extra thing > that it needs to keep track of and operate faster, when n is the > length of shorter of the two strings? glibc's C strncmp version does 4-byte comparison at a time when n >=4, then fall back to 1-byte for the rest. I don't know if it's faster than a plain always 1-byte comparison though. There's also the hand written assembly version that compares n from 1..16, not exactly sure how this version works yet. >> diff --git a/dir.c b/dir.c >> index 9960a37..46b24db 100644 >> --- a/dir.c >> +++ b/dir.c >> @@ -610,12 +610,14 @@ int match_basename(const char *basename, int basenamelen, >> int flags) >> { >> if (prefix == patternlen) { >> - if (!strcmp_icase(pattern, basename)) >> + if (patternlen == basenamelen && >> + !strncmp_icase(pattern, basename, patternlen)) >> return 1; > > What happens if you replace this with > > if (patternlen == baselen && > !strcmp_icase(pattern, basename, patternlen)) > > and drop the other hunk and run the benchmark again? > before after user 0m0.533s 0m0.522s user 0m0.549s 0m0.530s user 0m0.550s 0m0.534s user 0m0.551s 0m0.545s user 0m0.556s 0m0.550s user 0m0.557s 0m0.552s user 0m0.559s 0m0.554s user 0m0.564s 0m0.561s user 0m0.567s 0m0.565s user 0m0.567s 0m0.565s -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-10 10:38 ` Duy Nguyen @ 2013-03-10 11:43 ` Antoine Pelisse 2013-03-10 11:54 ` Antoine Pelisse 2013-03-12 20:59 ` Junio C Hamano 1 sibling, 1 reply; 48+ messages in thread From: Antoine Pelisse @ 2013-03-10 11:43 UTC (permalink / raw) To: Duy Nguyen; +Cc: Junio C Hamano, git On Sun, Mar 10, 2013 at 11:38 AM, Duy Nguyen <pclouds@gmail.com> wrote: > glibc's C strncmp version does 4-byte comparison at a time when n >=4, > then fall back to 1-byte for the rest. Looking at this (http://fossies.org/dox/glibc-2.17/strncmp_8c_source.html), it's not exactly true. It would rather be while (n >= 4), manually unroll the loop. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-10 11:43 ` Antoine Pelisse @ 2013-03-10 11:54 ` Antoine Pelisse 2013-03-10 12:06 ` Duy Nguyen 0 siblings, 1 reply; 48+ messages in thread From: Antoine Pelisse @ 2013-03-10 11:54 UTC (permalink / raw) To: Duy Nguyen; +Cc: Junio C Hamano, git On Sun, Mar 10, 2013 at 12:43 PM, Antoine Pelisse <apelisse@gmail.com> wrote: > On Sun, Mar 10, 2013 at 11:38 AM, Duy Nguyen <pclouds@gmail.com> wrote: >> glibc's C strncmp version does 4-byte comparison at a time when n >=4, >> then fall back to 1-byte for the rest. > > Looking at this > (http://fossies.org/dox/glibc-2.17/strncmp_8c_source.html), it's not > exactly true. > > It would rather be while (n >= 4), manually unroll the loop. By the way, if we know the length of the string, we could use memcmp. This one is allowed to compare 4-bytes at a time (he doesn't care about end of string). This is true because the value of the length parameter is no longer "at most". ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-10 11:54 ` Antoine Pelisse @ 2013-03-10 12:06 ` Duy Nguyen 2013-03-10 12:11 ` Antoine Pelisse 0 siblings, 1 reply; 48+ messages in thread From: Duy Nguyen @ 2013-03-10 12:06 UTC (permalink / raw) To: Antoine Pelisse; +Cc: Junio C Hamano, git On Sun, Mar 10, 2013 at 6:54 PM, Antoine Pelisse <apelisse@gmail.com> wrote: > On Sun, Mar 10, 2013 at 12:43 PM, Antoine Pelisse <apelisse@gmail.com> wrote: >> On Sun, Mar 10, 2013 at 11:38 AM, Duy Nguyen <pclouds@gmail.com> wrote: >>> glibc's C strncmp version does 4-byte comparison at a time when n >=4, >>> then fall back to 1-byte for the rest. >> >> Looking at this >> (http://fossies.org/dox/glibc-2.17/strncmp_8c_source.html), it's not >> exactly true. >> >> It would rather be while (n >= 4), manually unroll the loop. > > By the way, if we know the length of the string, we could use memcmp. > This one is allowed to compare 4-bytes at a time (he doesn't care > about end of string). This is true because the value of the length > parameter is no longer "at most". We still need to worry about access violation after NUL when two strings have different lengths. That could be avoided in this particular case, but I think it's too fragile. -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-10 12:06 ` Duy Nguyen @ 2013-03-10 12:11 ` Antoine Pelisse 2013-03-10 12:14 ` Duy Nguyen 0 siblings, 1 reply; 48+ messages in thread From: Antoine Pelisse @ 2013-03-10 12:11 UTC (permalink / raw) To: Duy Nguyen; +Cc: Junio C Hamano, git >> By the way, if we know the length of the string, we could use memcmp. >> This one is allowed to compare 4-bytes at a time (he doesn't care >> about end of string). This is true because the value of the length >> parameter is no longer "at most". > > We still need to worry about access violation after NUL when two > strings have different lengths. That could be avoided in this > particular case, but I think it's too fragile. Why would we need to compare if the strings don't have the same length ? We already do that in combine-diff.c:append_lost(). ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-10 12:11 ` Antoine Pelisse @ 2013-03-10 12:14 ` Duy Nguyen 0 siblings, 0 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-10 12:14 UTC (permalink / raw) To: Antoine Pelisse; +Cc: Junio C Hamano, git On Sun, Mar 10, 2013 at 7:11 PM, Antoine Pelisse <apelisse@gmail.com> wrote: >>> By the way, if we know the length of the string, we could use memcmp. >>> This one is allowed to compare 4-bytes at a time (he doesn't care >>> about end of string). This is true because the value of the length >>> parameter is no longer "at most". >> >> We still need to worry about access violation after NUL when two >> strings have different lengths. That could be avoided in this >> particular case, but I think it's too fragile. > > Why would we need to compare if the strings don't have the same length > ? We already do that in combine-diff.c:append_lost(). Watching movie and replying to git@ don't mix. You're right we don't need to compare if lengths are different. What was I thinking.. -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-10 10:38 ` Duy Nguyen 2013-03-10 11:43 ` Antoine Pelisse @ 2013-03-12 20:59 ` Junio C Hamano 2013-03-13 1:11 ` Duy Nguyen 1 sibling, 1 reply; 48+ messages in thread From: Junio C Hamano @ 2013-03-12 20:59 UTC (permalink / raw) To: Duy Nguyen; +Cc: git Duy Nguyen <pclouds@gmail.com> writes: > glibc's C strncmp version does 4-byte comparison at a time when n >=4, > then fall back to 1-byte for the rest. I don't know if it's faster > than a plain always 1-byte comparison though. There's also the hand > written assembly version that compares n from 1..16, not exactly sure > how this version works yet. It sounds to me more like "a very popular implementation of strcmp/strncmp pair found to have more optimized strncmp than strcmp". While that may be a good reason to justify this patch, I do not think it justifies this: >> strncmp is provided length information which could be taken advantage >> by the underlying implementation. After all, strcmp() could also be optimized to fetch word-at-a-time while being careful about not overstepping the page boundary. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/6] match_basename: use strncmp instead of strcmp 2013-03-12 20:59 ` Junio C Hamano @ 2013-03-13 1:11 ` Duy Nguyen 0 siblings, 0 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-13 1:11 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Wed, Mar 13, 2013 at 3:59 AM, Junio C Hamano <gitster@pobox.com> wrote: >>> strncmp is provided length information which could be taken advantage >>> by the underlying implementation. > > After all, strcmp() could also be optimized to fetch word-at-a-time > while being careful about not overstepping the page boundary. It boils down to giving more information to the underlying implementation with hope that it can be useful for something. Although at this point I think strncmp vs strcmp vs memcmp may be not worth doing (we keep explicit length comparison to reduce *cmp calls though). The gain is relatively small and will become even smaller after we avoid running exclude on tracked files (which eliminates like 2/3 of the processed entries). -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 4/6] match_{base,path}name: replace strncmp_icase with strnequal_icase 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy ` (2 preceding siblings ...) 2013-03-10 6:14 ` [PATCH v2 3/6] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 ` Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 5/6] dir.c: pass pathname length to last_exclude_matching Nguyễn Thái Ngọc Duy ` (3 subsequent siblings) 7 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy We could also optimize for ignore_case, assuming that non-ascii characters are not used in most repositories. We could check that all patterns and pathnames are ascii-only, then use git's toupper() before after user 0m0.516s 0m0.433s user 0m0.523s 0m0.437s user 0m0.532s 0m0.443s user 0m0.533s 0m0.448s user 0m0.535s 0m0.449s user 0m0.542s 0m0.452s user 0m0.546s 0m0.453s user 0m0.551s 0m0.458s user 0m0.556s 0m0.459s user 0m0.561s 0m0.462s Suggested-by: Fredrik Gustafsson <iveqy@iveqy.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/dir.c b/dir.c index 46b24db..7b6a625 100644 --- a/dir.c +++ b/dir.c @@ -21,6 +21,25 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, in int check_only, const struct path_simplify *simplify); static int get_dtype(struct dirent *de, const char *path, int len); +/* + * This function is more like memequal_icase than strnequal_icase as + * it does not check for NUL. The caller is not supposed to pass a + * length longer than both input strings + */ +static inline strnequal_icase(const char *a, const char *b, int n) +{ + if (!ignore_case) { + while (n && *a == *b) { + a++; + b++; + n--; + } + return n == 0; + } + + return !strncmp_icase(a, b, n); +} + inline int git_fnmatch(const char *pattern, const char *string, int flags, int prefix) { @@ -611,11 +630,11 @@ int match_basename(const char *basename, int basenamelen, { if (prefix == patternlen) { if (patternlen == basenamelen && - !strncmp_icase(pattern, basename, patternlen)) + strnequal_icase(pattern, basename, patternlen)) return 1; } else if (flags & EXC_FLAG_ENDSWITH) { if (patternlen - 1 <= basenamelen && - !strncmp_icase(pattern + 1, + strnequal_icase(pattern + 1, basename + basenamelen - patternlen + 1, patternlen - 1)) return 1; @@ -649,7 +668,7 @@ int match_pathname(const char *pathname, int pathlen, */ if (pathlen < baselen + 1 || (baselen && (pathname[baselen] != '/' || - strncmp_icase(pathname, base, baselen)))) + !strnequal_icase(pathname, base, baselen)))) return 0; namelen = baselen ? pathlen - baselen - 1 : pathlen; @@ -663,7 +682,7 @@ int match_pathname(const char *pathname, int pathlen, if (prefix > namelen) return 0; - if (strncmp_icase(pattern, name, prefix)) + if (!strnequal_icase(pattern, name, prefix)) return 0; pattern += prefix; name += prefix; -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v2 5/6] dir.c: pass pathname length to last_exclude_matching 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy ` (3 preceding siblings ...) 2013-03-10 6:14 ` [PATCH v2 4/6] match_{base,path}name: replace strncmp_icase with strnequal_icase Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 ` Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 6/6] exclude: filter patterns by directory level Nguyễn Thái Ngọc Duy ` (2 subsequent siblings) 7 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/dir.c b/dir.c index 7b6a625..880b5e6 100644 --- a/dir.c +++ b/dir.c @@ -764,9 +764,9 @@ int is_excluded_from_list(const char *pathname, */ static struct exclude *last_exclude_matching(struct dir_struct *dir, const char *pathname, + int pathlen, int *dtype_p) { - int pathlen = strlen(pathname); int i, j; struct exclude_list_group *group; struct exclude *exclude; @@ -793,10 +793,12 @@ static struct exclude *last_exclude_matching(struct dir_struct *dir, * scans all exclude lists to determine whether pathname is excluded. * Returns 1 if true, otherwise 0. */ -static int is_excluded(struct dir_struct *dir, const char *pathname, int *dtype_p) +static int is_excluded(struct dir_struct *dir, + const char *pathname, int pathlen, + int *dtype_p) { struct exclude *exclude = - last_exclude_matching(dir, pathname, dtype_p); + last_exclude_matching(dir, pathname, pathlen, dtype_p); if (exclude) return exclude->flags & EXC_FLAG_NEGATIVE ? 0 : 1; return 0; @@ -859,7 +861,8 @@ struct exclude *last_exclude_matching_path(struct path_exclude_check *check, if (ch == '/') { int dt = DT_DIR; exclude = last_exclude_matching(check->dir, - path->buf, &dt); + path->buf, path->len, + &dt); if (exclude) { check->exclude = exclude; return exclude; @@ -871,7 +874,7 @@ struct exclude *last_exclude_matching_path(struct path_exclude_check *check, /* An entry in the index; cannot be a directory with subentries */ strbuf_setlen(path, 0); - return last_exclude_matching(check->dir, name, dtype); + return last_exclude_matching(check->dir, name, namelen, dtype); } /* @@ -1249,7 +1252,7 @@ static enum path_treatment treat_one_path(struct dir_struct *dir, const struct path_simplify *simplify, int dtype, struct dirent *de) { - int exclude = is_excluded(dir, path->buf, &dtype); + int exclude = is_excluded(dir, path->buf, path->len, &dtype); if (exclude && (dir->flags & DIR_COLLECT_IGNORED) && exclude_matches_pathspec(path->buf, path->len, simplify)) dir_add_ignored(dir, path->buf, path->len); -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v2 6/6] exclude: filter patterns by directory level 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy ` (4 preceding siblings ...) 2013-03-10 6:14 ` [PATCH v2 5/6] dir.c: pass pathname length to last_exclude_matching Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 ` Nguyễn Thái Ngọc Duy 2013-03-10 8:20 ` Junio C Hamano 2013-03-11 15:11 ` [PATCH v2 0/6] Exclude optimizations Duy Nguyen 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy 7 siblings, 1 reply; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-10 6:14 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy A non-basename pattern that does not contain /**/ can't match anything outside the attached directory. Record its directory level and avoid matching unless the pathname is also at the same directory level. This optimization shines when there are a lot of non-basename patterns are the root .gitignore and big/deep worktree. Due to the cascading rule of .gitignore, patterns in the root .gitignore are checked for _all_ entries in the worktree. before after user 0m0.424s 0m0.365s user 0m0.427s 0m0.366s user 0m0.432s 0m0.374s user 0m0.435s 0m0.374s user 0m0.435s 0m0.377s user 0m0.437s 0m0.381s user 0m0.439s 0m0.381s user 0m0.440s 0m0.383s user 0m0.450s 0m0.384s user 0m0.454s 0m0.384s Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- attr.c | 3 ++- dir.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++------------------ dir.h | 9 ++++++++- 3 files changed, 60 insertions(+), 20 deletions(-) diff --git a/attr.c b/attr.c index 1818ba5..7764ddd 100644 --- a/attr.c +++ b/attr.c @@ -254,7 +254,8 @@ static struct match_attr *parse_attr_line(const char *line, const char *src, parse_exclude_pattern(&res->u.pat.pattern, &res->u.pat.patternlen, &res->u.pat.flags, - &res->u.pat.nowildcardlen); + &res->u.pat.nowildcardlen, + NULL); if (res->u.pat.flags & EXC_FLAG_MUSTBEDIR) res->u.pat.patternlen++; if (res->u.pat.flags & EXC_FLAG_NEGATIVE) { diff --git a/dir.c b/dir.c index 880b5e6..de7a6ba 100644 --- a/dir.c +++ b/dir.c @@ -360,10 +360,12 @@ static int no_wildcard(const char *string) void parse_exclude_pattern(const char **pattern, int *patternlen, int *flags, - int *nowildcardlen) + int *nowildcardlen, + int *dirs_p) { const char *p = *pattern; size_t i, len; + int dirs; *flags = 0; if (*p == '!') { @@ -375,12 +377,15 @@ void parse_exclude_pattern(const char **pattern, len--; *flags |= EXC_FLAG_MUSTBEDIR; } - for (i = 0; i < len; i++) { + for (i = 0, dirs = 0; i < len; i++) { if (p[i] == '/') - break; + dirs++; } - if (i == len) + if (!dirs) *flags |= EXC_FLAG_NODIR; + else if (*p == '/') + dirs--; + *nowildcardlen = simple_length(p); /* * we should have excluded the trailing slash from 'p' too, @@ -393,6 +398,8 @@ void parse_exclude_pattern(const char **pattern, *flags |= EXC_FLAG_ENDSWITH; *pattern = p; *patternlen = len; + if (dirs_p) + *dirs_p = dirs; } void add_exclude(const char *string, const char *base, @@ -402,8 +409,9 @@ void add_exclude(const char *string, const char *base, int patternlen; int flags; int nowildcardlen; + int dirs; - parse_exclude_pattern(&string, &patternlen, &flags, &nowildcardlen); + parse_exclude_pattern(&string, &patternlen, &flags, &nowildcardlen, &dirs); if (flags & EXC_FLAG_MUSTBEDIR) { char *s; x = xmalloc(sizeof(*x) + patternlen + 1); @@ -415,11 +423,26 @@ void add_exclude(const char *string, const char *base, x = xmalloc(sizeof(*x)); x->pattern = string; } + /* + * TODO: nowildcardlen < patternlen is a stricter than + * necessary mainly to exclude "**" that breaks directory + * boundary. Patterns like "/foo-*" should be fine. + */ + if ((flags & EXC_FLAG_NODIR) || nowildcardlen < patternlen) + dirs = -1; + else { + int i; + for (i = 0; i < baselen; i++) { + if (base[i] == '/') + dirs++; + } + } x->patternlen = patternlen; x->nowildcardlen = nowildcardlen; x->base = base; x->baselen = baselen; x->flags = flags; + x->dirs = dirs; x->srcpos = srcpos; ALLOC_GROW(el->excludes, el->nr + 1, el->alloc); el->excludes[el->nr++] = x; @@ -701,7 +724,7 @@ int match_pathname(const char *pathname, int pathlen, * matched, or NULL for undecided. */ static struct exclude *last_exclude_matching_from_list(const char *pathname, - int pathlen, + int pathlen, int dirs, const char *basename, int *dtype, struct exclude_list *el) @@ -732,6 +755,9 @@ static struct exclude *last_exclude_matching_from_list(const char *pathname, continue; } + if (dirs >= 0 && x->dirs >= 0 && x->dirs != dirs) + continue; + assert(x->baselen == 0 || x->base[x->baselen - 1] == '/'); if (match_pathname(pathname, pathlen, x->base, x->baselen ? x->baselen - 1 : 0, @@ -750,7 +776,8 @@ int is_excluded_from_list(const char *pathname, struct exclude_list *el) { struct exclude *exclude; - exclude = last_exclude_matching_from_list(pathname, pathlen, basename, dtype, el); + exclude = last_exclude_matching_from_list(pathname, pathlen, -1, + basename, dtype, el); if (exclude) return exclude->flags & EXC_FLAG_NEGATIVE ? 0 : 1; return -1; /* undecided */ @@ -765,6 +792,7 @@ int is_excluded_from_list(const char *pathname, static struct exclude *last_exclude_matching(struct dir_struct *dir, const char *pathname, int pathlen, + int dirs, int *dtype_p) { int i, j; @@ -779,8 +807,8 @@ static struct exclude *last_exclude_matching(struct dir_struct *dir, group = &dir->exclude_list_group[i]; for (j = group->nr - 1; j >= 0; j--) { exclude = last_exclude_matching_from_list( - pathname, pathlen, basename, dtype_p, - &group->el[j]); + pathname, pathlen, dir->dir_level, + basename, dtype_p, &group->el[j]); if (exclude) return exclude; } @@ -794,11 +822,11 @@ static struct exclude *last_exclude_matching(struct dir_struct *dir, * Returns 1 if true, otherwise 0. */ static int is_excluded(struct dir_struct *dir, - const char *pathname, int pathlen, + const char *pathname, int pathlen, int dirs, int *dtype_p) { struct exclude *exclude = - last_exclude_matching(dir, pathname, pathlen, dtype_p); + last_exclude_matching(dir, pathname, pathlen, dirs, dtype_p); if (exclude) return exclude->flags & EXC_FLAG_NEGATIVE ? 0 : 1; return 0; @@ -862,7 +890,7 @@ struct exclude *last_exclude_matching_path(struct path_exclude_check *check, int dt = DT_DIR; exclude = last_exclude_matching(check->dir, path->buf, path->len, - &dt); + -1, &dt); if (exclude) { check->exclude = exclude; return exclude; @@ -874,7 +902,7 @@ struct exclude *last_exclude_matching_path(struct path_exclude_check *check, /* An entry in the index; cannot be a directory with subentries */ strbuf_setlen(path, 0); - return last_exclude_matching(check->dir, name, namelen, dtype); + return last_exclude_matching(check->dir, name, namelen, -1, dtype); } /* @@ -1248,11 +1276,11 @@ enum path_treatment { }; static enum path_treatment treat_one_path(struct dir_struct *dir, - struct strbuf *path, + struct strbuf *path, int dirs, const struct path_simplify *simplify, int dtype, struct dirent *de) { - int exclude = is_excluded(dir, path->buf, path->len, &dtype); + int exclude = is_excluded(dir, path->buf, path->len, dirs, &dtype); if (exclude && (dir->flags & DIR_COLLECT_IGNORED) && exclude_matches_pathspec(path->buf, path->len, simplify)) dir_add_ignored(dir, path->buf, path->len); @@ -1310,7 +1338,7 @@ static enum path_treatment treat_path(struct dir_struct *dir, return path_ignored; dtype = DTYPE(de); - return treat_one_path(dir, path, simplify, dtype, de); + return treat_one_path(dir, path, -1, simplify, dtype, de); } /* @@ -1338,6 +1366,7 @@ static int read_directory_recursive(struct dir_struct *dir, if (!fdir) goto out; + dir->dir_level++; while ((de = readdir(fdir)) != NULL) { switch (treat_path(dir, de, &path, baselen, simplify)) { case path_recurse: @@ -1357,6 +1386,7 @@ static int read_directory_recursive(struct dir_struct *dir, } closedir(fdir); out: + dir->dir_level--; strbuf_release(&path); return contents; @@ -1427,7 +1457,7 @@ static int treat_leading_path(struct dir_struct *dir, break; if (simplify_away(sb.buf, sb.len, simplify)) break; - if (treat_one_path(dir, &sb, simplify, + if (treat_one_path(dir, &sb, -1, simplify, DT_DIR, NULL) == path_ignored) break; /* do not recurse into it */ if (len <= baselen) { @@ -1447,8 +1477,10 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const char return dir->nr; simplify = create_simplify(pathspec); - if (!len || treat_leading_path(dir, path, len, simplify)) + if (!len || treat_leading_path(dir, path, len, simplify)) { + dir->dir_level = -1; read_directory_recursive(dir, path, len, 0, simplify); + } free_simplify(simplify); qsort(dir->entries, dir->nr, sizeof(struct dir_entry *), cmp_name); qsort(dir->ignored, dir->ignored_nr, sizeof(struct dir_entry *), cmp_name); diff --git a/dir.h b/dir.h index 560ade4..c434f1c 100644 --- a/dir.h +++ b/dir.h @@ -45,6 +45,7 @@ struct exclude_list { const char *base; int baselen; int flags; + int dirs; /* * Counting starts from 1 for line numbers in ignore files, @@ -87,6 +88,8 @@ struct dir_struct { /* Exclude info */ const char *exclude_per_dir; + int dir_level; + /* * We maintain three groups of exclude pattern lists: * @@ -171,7 +174,11 @@ extern struct exclude_list *add_exclude_list(struct dir_struct *dir, extern int add_excludes_from_file_to_list(const char *fname, const char *base, int baselen, struct exclude_list *el, int check_index); extern void add_excludes_from_file(struct dir_struct *, const char *fname); -extern void parse_exclude_pattern(const char **string, int *patternlen, int *flags, int *nowildcardlen); +extern void parse_exclude_pattern(const char **string, + int *patternlen, + int *flags, + int *nowildcardlen, + int *dirs); extern void add_exclude(const char *string, const char *base, int baselen, struct exclude_list *el, int srcpos); extern void clear_exclude_list(struct exclude_list *el); -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v2 6/6] exclude: filter patterns by directory level 2013-03-10 6:14 ` [PATCH v2 6/6] exclude: filter patterns by directory level Nguyễn Thái Ngọc Duy @ 2013-03-10 8:20 ` Junio C Hamano 2013-03-10 10:18 ` Duy Nguyen 2013-03-10 10:58 ` Junio C Hamano 0 siblings, 2 replies; 48+ messages in thread From: Junio C Hamano @ 2013-03-10 8:20 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy; +Cc: git Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: > A non-basename pattern that does not contain /**/ can't match anything > outside the attached directory. Record its directory level and avoid > matching unless the pathname is also at the same directory level. Without defining what a "directory level" is, the above is a bit hard to grok, but I think you mean an entry "b/c/*.c" that appears in "a/.gitignore" file will want to match a path that is directly in "a/b/c" directory (and not in its subdirectories), "a/b/x.c" at the two levels deep subdirectory or "a/b/c/d/x.c" that is four levels deep will never match the pattern. The logic feels sound. > diff --git a/dir.c b/dir.c > index 880b5e6..de7a6ba 100644 > --- a/dir.c > +++ b/dir.c > @@ -360,10 +360,12 @@ static int no_wildcard(const char *string) > void parse_exclude_pattern(const char **pattern, > int *patternlen, > int *flags, > - int *nowildcardlen) > + int *nowildcardlen, > + int *dirs_p) > { > const char *p = *pattern; > size_t i, len; > + int dirs; > > *flags = 0; > if (*p == '!') { > @@ -375,12 +377,15 @@ void parse_exclude_pattern(const char **pattern, > len--; > *flags |= EXC_FLAG_MUSTBEDIR; > } > - for (i = 0; i < len; i++) { > + for (i = 0, dirs = 0; i < len; i++) { > if (p[i] == '/') > - break; > + dirs++; > } > - if (i == len) > + if (!dirs) > *flags |= EXC_FLAG_NODIR; > + else if (*p == '/') > + dirs--; I presume this is to compensate for a pattern like "/pat" whose leading slash is only to anchor the pattern at the level. Correct? > @@ -415,11 +423,26 @@ void add_exclude(const char *string, const char *base, > x = xmalloc(sizeof(*x)); > x->pattern = string; > } > + /* > + * TODO: nowildcardlen < patternlen is a stricter than > + * necessary mainly to exclude "**" that breaks directory > + * boundary. Patterns like "/foo-*" should be fine. > + */ > + if ((flags & EXC_FLAG_NODIR) || nowildcardlen < patternlen) > + dirs = -1; OK, so an entry "README" to match README in any subdirectory will becomes (dirs < 0) and the matcher below will not short-circuit the comparison. Good. > + else { > + int i; > + for (i = 0; i < baselen; i++) { > + if (base[i] == '/') > + dirs++; > + } > + } > x->patternlen = patternlen; > x->nowildcardlen = nowildcardlen; > x->base = base; > x->baselen = baselen; > x->flags = flags; > + x->dirs = dirs; > x->srcpos = srcpos; > ALLOC_GROW(el->excludes, el->nr + 1, el->alloc); > el->excludes[el->nr++] = x; > @@ -701,7 +724,7 @@ int match_pathname(const char *pathname, int pathlen, > * matched, or NULL for undecided. > */ > static struct exclude *last_exclude_matching_from_list(const char *pathname, > - int pathlen, > + int pathlen, int dirs, > const char *basename, > int *dtype, > struct exclude_list *el) > @@ -732,6 +755,9 @@ static struct exclude *last_exclude_matching_from_list(const char *pathname, > continue; > } > > + if (dirs >= 0 && x->dirs >= 0 && x->dirs != dirs) > + continue; ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 6/6] exclude: filter patterns by directory level 2013-03-10 8:20 ` Junio C Hamano @ 2013-03-10 10:18 ` Duy Nguyen 2013-03-10 10:58 ` Junio C Hamano 1 sibling, 0 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-10 10:18 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, Mar 10, 2013 at 3:20 PM, Junio C Hamano <gitster@pobox.com> wrote: >> + else if (*p == '/') >> + dirs--; > > I presume this is to compensate for a pattern like "/pat" whose > leading slash is only to anchor the pattern at the level. Correct? Yes. Also for the record, we could cut down the number of prep_exclude calls significantly by only calling it when we switch directories (i.e. when read_directory_recursive begins or exits), not calling it for all entries of the same directory. For instance, path/, path/a, path/b, path/c/, path/c/d, path/e should only call prep_exclude 3 times when we enter "path", "path/c" and leave "path/c" (rather than 6 times currently). Unfortunately, I see no real time savings by this call reduction. So no patch. Maybe I'll try it again on my slower laptop and see if it makes any difference. -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 6/6] exclude: filter patterns by directory level 2013-03-10 8:20 ` Junio C Hamano 2013-03-10 10:18 ` Duy Nguyen @ 2013-03-10 10:58 ` Junio C Hamano 2013-03-10 11:14 ` Duy Nguyen 1 sibling, 1 reply; 48+ messages in thread From: Junio C Hamano @ 2013-03-10 10:58 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy; +Cc: git Junio C Hamano <gitster@pobox.com> writes: > Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: > >> A non-basename pattern that does not contain /**/ can't match anything >> outside the attached directory. Record its directory level and avoid >> matching unless the pathname is also at the same directory level. > > Without defining what a "directory level" is, the above is a bit > hard to grok, but I think you mean an entry "b/c/*.c" that appears > in "a/.gitignore" file will want to match a path that is directly > in "a/b/c" directory (and not in its subdirectories), > "a/b/x.c" at the two levels deep subdirectory or "a/b/c/d/x.c" that is > four levels deep will never match the pattern. > > The logic feels sound. Actually, I think you may be able to do a lot more with a simpler change. If your top-level .gitignore has "a/b/c/*.c" in it, you certainly want to mark it not to be applied when you are looking at paths directly in directory a/b/ because they will never match, but you also know that nothing will match when you are inside a/b/d/, even though the pattern and the path you are checking are at the same levels. Your dirlen approach will fail for that case, no? The idea behind prep_exclude() that organizes the exclode patterns into a stack structure and update the groups near the leaves by popping those for the old directory we were in and pushing those for the new directory we are going into is to give us a place to tweak the elements on the whole stack for optimization when we notice that we are looking at paths in different directories. Instead of giving a "dirlen" member to each element, you could give a "do not look at me" flag to it, and when you notice that you were in a/b/c/ and now you are going to look at paths in a/b/d/, you can look at the group that was read from the .gitignore from the top-level, and mark entries that cannot be relevant (e.g. "a/b/c/*.c") as such. The mark does not have to be a boolean. "a/b/*.c" when you are in "a/b/c/" can be marked as "This never matches, and I do not have to re-check until I pop one level". When digging deeper to "a/b/c/d", you add one to that. When switching to "a/b/e", you would first pop twice ("d" and then "c"), each time decrementing the "I do not have to re-check" counter by one, and then when pushing "e" down, you notice that you need to re-check, and mark it again as "no need to re-check for one pop". So it is not like you have to re-scan all entries textually every time you switch directories. Most entries that are level-limited you would increment or decrement its counter and only the ones at the level boundary need to be re-checked. Hmm? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 6/6] exclude: filter patterns by directory level 2013-03-10 10:58 ` Junio C Hamano @ 2013-03-10 11:14 ` Duy Nguyen 0 siblings, 0 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-10 11:14 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, Mar 10, 2013 at 5:58 PM, Junio C Hamano <gitster@pobox.com> wrote: > Junio C Hamano <gitster@pobox.com> writes: > >> Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: >> >>> A non-basename pattern that does not contain /**/ can't match anything >>> outside the attached directory. Record its directory level and avoid >>> matching unless the pathname is also at the same directory level. >> >> Without defining what a "directory level" is, the above is a bit >> hard to grok, but I think you mean an entry "b/c/*.c" that appears >> in "a/.gitignore" file will want to match a path that is directly >> in "a/b/c" directory (and not in its subdirectories), >> "a/b/x.c" at the two levels deep subdirectory or "a/b/c/d/x.c" that is >> four levels deep will never match the pattern. >> >> The logic feels sound. > > Actually, I think you may be able to do a lot more with a simpler > change. If your top-level .gitignore has "a/b/c/*.c" in it, you > certainly want to mark it not to be applied when you are looking at > paths directly in directory a/b/ because they will never match, but > you also know that nothing will match when you are inside a/b/d/, > even though the pattern and the path you are checking are at the > same levels. Your dirlen approach will fail for that case, no? > > The idea behind prep_exclude() that organizes the exclode patterns > into a stack structure and update the groups near the leaves by > popping those for the old directory we were in and pushing those for > the new directory we are going into is to give us a place to tweak > the elements on the whole stack for optimization when we notice that > we are looking at paths in different directories. Instead of giving > a "dirlen" member to each element, you could give a "do not look at > me" flag to it, and when you notice that you were in a/b/c/ and now > you are going to look at paths in a/b/d/, you can look at the group > that was read from the .gitignore from the top-level, and mark > entries that cannot be relevant (e.g. "a/b/c/*.c") as such. > > The mark does not have to be a boolean. "a/b/*.c" when you are in > "a/b/c/" can be marked as "This never matches, and I do not have to > re-check until I pop one level". When digging deeper to "a/b/c/d", > you add one to that. When switching to "a/b/e", you would first pop > twice ("d" and then "c"), each time decrementing the "I do not have > to re-check" counter by one, and then when pushing "e" down, you > notice that you need to re-check, and mark it again as "no need to > re-check for one pop". So it is not like you have to re-scan all > entries textually every time you switch directories. Most entries > that are level-limited you would increment or decrement its counter > and only the ones at the level boundary need to be re-checked. A bit confused by "dirlen" (what is it?). I think what you're trying to say is "mark whether a pattern is applicable for entries in this directory in prep_exclude, update the marks as we push and pop directories". It does not sound simpler (and it's actually more powerful, as you said it could avoid checking "a/b/c/*.c" when standing in a/b/d). I'll give it a try. -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 0/6] Exclude optimizations 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy ` (5 preceding siblings ...) 2013-03-10 6:14 ` [PATCH v2 6/6] exclude: filter patterns by directory level Nguyễn Thái Ngọc Duy @ 2013-03-11 15:11 ` Duy Nguyen 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy 7 siblings, 0 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-11 15:11 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy The hunt continues.. (and thanks everyone for suggestions). Now is_excluded() (and exclude machinery) is no longer the hot spot in read_directory. index_name_exists is the new star: function time (in seconds) treat_leading_path: 0.000 read_directory: 0.289 +treat_one_path: 0.147 ++is_excluded: 0.013 +++prep_exclude: 0.006 +++matching: 0.004 ++dir_exists_in_index: 0.008 ++index_name_exists: 0.117 <-- +++lazy_init_name_hash: 0.060 +simplify_away: 0.004 +dir_add_name: 0.000 real 0m0.372s user 0m0.256s sys 0m0.114s <-- can't kill this one (*) until we get inotify support I think if we save the hash in index, we could nearly cut lazy_init_name_hash out (or not, perf reported insert_hash near the top, not hash_name). Any ideas to further reduce iname_name_exists cost are welcome. 0.117s on 2.50GHz turns to 0.549s on my Atom 1.6GHz, so I think it's worth doing something about it. (*) I tried breadth-first search, checking for .gitignore existence before opening, chdir() to shorten pathnames. Nothing worked. -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v3 00/13] Exclude optimizations 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy ` (6 preceding siblings ...) 2013-03-11 15:11 ` [PATCH v2 0/6] Exclude optimizations Duy Nguyen @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 01/13] dir.c: add MEASURE_EXCLUDE code for tracking exclude performance Nguyễn Thái Ngọc Duy ` (13 more replies) 7 siblings, 14 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Result of today. I cherry-picked nd/read-directory-recursive-optim to see how far I can get after pulling all the tricks. This is a slower machine so time is longer. Anyway, read_directory time is reduced about 70% in the end. function before after ---------------------------------- treat_leading_path: 0.000 0.000 read_directory: 4.102 1.235 +treat_one_path: 2.843 0.531 ++is_excluded: 2.632 0.102 +++prep_exclude: 0.225 0.040 +++matching: 2.054 0.035 ++dir_exist: 0.035 0.035 ++index_name_exists: 0.292 0.225 lazy_init_name_hash: 0.258 0.155 +simplify_away: 0.085 0.083 +dir_add_name: 0.446 0.000 I don't expect all these patches to go in. The meat is nd/read-directory-recursive-optim (or 10/13) and 09/13. Some other patches are safe even if they do not contribute much to the gain. The last two are probably not worth the trouble. Nguyễn Thái Ngọc Duy (13): dir.c: add MEASURE_EXCLUDE code for tracking exclude performance match_pathname: avoid calling strncmp if baselen is 0 dir.c: inline convenient *_icase helpers match_basename: use strncmp instead of strcmp match_{base,path}name: replace strncmp_icase with memequal_icase dir: pass pathname length to last_exclude_matching exclude: avoid calling prep_exclude on entries of the same directory exclude: record baselen in the pattern exclude: filter out patterns not applicable to the current directory read_directory: avoid invoking exclude machinery on tracked files Preallocate hash tables when the number of inserts are known in advance name-hash: allow to lookup a name with precalculated base hash read_directory: calculate name hashes incrementally Makefile | 1 + attr.c | 6 +- cache.h | 2 - diffcore-rename.c | 1 + dir.c | 392 ++++++++++++++++++++++++++++++++++++++++++++---------- dir.h | 26 +++- hash.h | 7 + name-hash.c | 49 ++++--- name-hash.h (new) | 45 +++++++ read-cache.c | 1 + unpack-trees.c | 1 + 11 files changed, 431 insertions(+), 100 deletions(-) create mode 100644 name-hash.h -- 1.8.1.2.536.gf441e6d ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v3 01/13] dir.c: add MEASURE_EXCLUDE code for tracking exclude performance 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 02/13] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy ` (12 subsequent siblings) 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Hot read_directory() codepaths are tracked. This could be helpful to see how changes affect read_directory() performance. Results are printed when environment variable GIT_MEASURE_EXCLUDE is set. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 100 insertions(+), 9 deletions(-) diff --git a/dir.c b/dir.c index 57394e4..69c045b 100644 --- a/dir.c +++ b/dir.c @@ -12,6 +12,32 @@ #include "refs.h" #include "wildmatch.h" +#ifdef MEASURE_EXCLUDE +static uint32_t tv_treat_leading_path; +static uint32_t tv_read_directory; +static uint32_t tv_treat_one_path; +static uint32_t tv_is_excluded; +static uint32_t tv_prep_exclude; +static uint32_t tv_last_exclude_matching; +static uint32_t tv_dir_add_name; +static uint32_t tv_directory_exists_in_index; +static uint32_t tv_simplify_away; +static uint32_t tv_index_name_exists; +static uint32_t tv_lazy_init_name_hash; +#define START_CLOCK() \ + { \ + struct timeval tv1, tv2; \ + gettimeofday(&tv1, NULL); +#define STOP_CLOCK(v) \ + gettimeofday(&tv2, NULL); \ + v += (uint64_t)tv2.tv_sec * 1000000 + tv2.tv_usec - \ + (uint64_t)tv1.tv_sec * 1000000 - tv1.tv_usec; \ + } +#else +#define START_CLOCK() +#define STOP_CLOCK(v) +#endif + struct path_simplify { int len; const char *path; @@ -768,8 +794,11 @@ static struct exclude *last_exclude_matching(struct dir_struct *dir, const char *basename = strrchr(pathname, '/'); basename = (basename) ? basename+1 : pathname; + START_CLOCK(); prep_exclude(dir, pathname, basename-pathname); + STOP_CLOCK(tv_prep_exclude); + START_CLOCK(); for (i = EXC_CMDL; i <= EXC_FILE; i++) { group = &dir->exclude_list_group[i]; for (j = group->nr - 1; j >= 0; j--) { @@ -780,6 +809,7 @@ static struct exclude *last_exclude_matching(struct dir_struct *dir, return exclude; } } + STOP_CLOCK(tv_last_exclude_matching); return NULL; } @@ -897,9 +927,14 @@ static struct dir_entry *dir_entry_new(const char *pathname, int len) static struct dir_entry *dir_add_name(struct dir_struct *dir, const char *pathname, int len) { - if (!(dir->flags & DIR_SHOW_IGNORED) && - cache_name_exists(pathname, len, ignore_case)) - return NULL; + if (!(dir->flags & DIR_SHOW_IGNORED)) { + struct cache_entry *ce; + START_CLOCK(); + ce = cache_name_exists(pathname, len, ignore_case); + STOP_CLOCK(tv_index_name_exists); + if (ce) + return NULL; + } ALLOC_GROW(dir->entries, dir->nr+1, dir->alloc); return dir->entries[dir->nr++] = dir_entry_new(pathname, len); @@ -1034,8 +1069,12 @@ static enum directory_treatment treat_directory(struct dir_struct *dir, const char *dirname, int len, int exclude, const struct path_simplify *simplify) { + int ret; + START_CLOCK(); /* The "len-1" is to strip the final '/' */ - switch (directory_exists_in_index(dirname, len-1)) { + ret = directory_exists_in_index(dirname, len-1); + STOP_CLOCK(tv_directory_exists_in_index); + switch (ret) { case index_directory: if ((dir->flags & DIR_SHOW_OTHER_DIRECTORIES) && exclude) break; @@ -1179,7 +1218,9 @@ static int get_index_dtype(const char *path, int len) int pos; struct cache_entry *ce; + START_CLOCK(); ce = cache_name_exists(path, len, 0); + STOP_CLOCK(tv_index_name_exists); if (ce) { if (!ce_uptodate(ce)) return DT_UNKNOWN; @@ -1244,7 +1285,12 @@ static enum path_treatment treat_one_path(struct dir_struct *dir, const struct path_simplify *simplify, int dtype, struct dirent *de) { - int exclude = is_excluded(dir, path->buf, &dtype); + int exclude; + + START_CLOCK(); + exclude = is_excluded(dir, path->buf, &dtype); + STOP_CLOCK(tv_is_excluded); + if (exclude && (dir->flags & DIR_COLLECT_IGNORED) && exclude_matches_pathspec(path->buf, path->len, simplify)) dir_add_ignored(dir, path->buf, path->len); @@ -1292,17 +1338,23 @@ static enum path_treatment treat_path(struct dir_struct *dir, int baselen, const struct path_simplify *simplify) { - int dtype; + int dtype, ret; if (is_dot_or_dotdot(de->d_name) || !strcmp(de->d_name, ".git")) return path_ignored; strbuf_setlen(path, baselen); strbuf_addstr(path, de->d_name); - if (simplify_away(path->buf, path->len, simplify)) + START_CLOCK(); + ret = simplify_away(path->buf, path->len, simplify); + STOP_CLOCK(tv_simplify_away); + if (ret) return path_ignored; dtype = DTYPE(de); - return treat_one_path(dir, path, simplify, dtype, de); + START_CLOCK(); + ret = treat_one_path(dir, path, simplify, dtype, de); + STOP_CLOCK(tv_treat_one_path); + return ret; } /* @@ -1345,7 +1397,9 @@ static int read_directory_recursive(struct dir_struct *dir, contents++; if (check_only) break; + START_CLOCK(); dir_add_name(dir, path.buf, path.len); + STOP_CLOCK(tv_dir_add_name); } closedir(fdir); out: @@ -1405,6 +1459,7 @@ static int treat_leading_path(struct dir_struct *dir, len--; if (!len) return 1; + START_CLOCK(); baselen = 0; while (1) { cp = path + baselen + !!baselen; @@ -1428,6 +1483,7 @@ static int treat_leading_path(struct dir_struct *dir, } } strbuf_release(&sb); + STOP_CLOCK(tv_treat_leading_path); return rc; } @@ -1439,8 +1495,43 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const char return dir->nr; simplify = create_simplify(pathspec); - if (!len || treat_leading_path(dir, path, len, simplify)) + if (!len || treat_leading_path(dir, path, len, simplify)) { +#ifdef MEASURE_EXCLUDE + /* The first call triggers lazy_init_name_hash() */ + START_CLOCK(); + index_name_exists(&the_index, "", 0, ignore_case); + STOP_CLOCK(tv_lazy_init_name_hash); +#endif + START_CLOCK(); read_directory_recursive(dir, path, len, 0, simplify); + STOP_CLOCK(tv_read_directory); + } +#ifdef MEASURE_EXCLUDE + if (getenv("GIT_MEASURE_EXCLUDE")) { + fprintf(stderr, "treat_leading_path: %10.3f\n", + (float)tv_treat_leading_path / 1000000); + fprintf(stderr, "read_directory: %10.3f\n", + (float)tv_read_directory / 1000000); + fprintf(stderr, "+treat_one_path: %10.3f\n", + (float)tv_treat_one_path / 1000000); + fprintf(stderr, "++is_excluded: %10.3f\n", + (float)tv_is_excluded / 1000000); + fprintf(stderr, "+++prep_exclude: %10.3f\n", + (float)tv_prep_exclude / 1000000); + fprintf(stderr, "+++matching: %10.3f\n", + (float)tv_last_exclude_matching / 1000000); + fprintf(stderr, "++dir_exist: %10.3f\n", + (float)tv_directory_exists_in_index / 1000000); + fprintf(stderr, "++index_name_exists: %10.3f\n", + (float)tv_index_name_exists / 1000000); + fprintf(stderr, "lazy_init_name_hash: %10.3f\n", + (float)tv_lazy_init_name_hash / 1000000); + fprintf(stderr, "+simplify_away: %10.3f\n", + (float)tv_simplify_away / 1000000); + fprintf(stderr, "+dir_add_name: %10.3f\n", + (float)tv_dir_add_name / 1000000); + } +#endif free_simplify(simplify); qsort(dir->entries, dir->nr, sizeof(struct dir_entry *), cmp_name); qsort(dir->ignored, dir->ignored_nr, sizeof(struct dir_entry *), cmp_name); -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 02/13] match_pathname: avoid calling strncmp if baselen is 0 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 01/13] dir.c: add MEASURE_EXCLUDE code for tracking exclude performance Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 03/13] dir.c: inline convenient *_icase helpers Nguyễn Thái Ngọc Duy ` (11 subsequent siblings) 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy This reduces "git status" user time by a little bit. This is the sorted results of 10 consecutive runs of "git ls-files --exclude-standard -o" on webkit.git, compiled with gcc -O2: treat_leading_path: 0.000 0.000 read_directory: 4.102 3.674 +treat_one_path: 2.843 2.427 ++is_excluded: 2.632 2.221 +++prep_exclude: 0.225 0.224 +++matching: 2.054 1.650 ++dir_exist: 0.035 0.035 ++index_name_exists: 0.292 0.288 lazy_init_name_hash: 0.258 0.257 +simplify_away: 0.085 0.085 +dir_add_name: 0.446 0.441 Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dir.c b/dir.c index 69c045b..32a3adb 100644 --- a/dir.c +++ b/dir.c @@ -688,8 +688,8 @@ int match_pathname(const char *pathname, int pathlen, * may not end with a trailing slash though. */ if (pathlen < baselen + 1 || - (baselen && pathname[baselen] != '/') || - strncmp_icase(pathname, base, baselen)) + (baselen && (pathname[baselen] != '/' || + strncmp_icase(pathname, base, baselen)))) return 0; namelen = baselen ? pathlen - baselen - 1 : pathlen; -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 03/13] dir.c: inline convenient *_icase helpers 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 01/13] dir.c: add MEASURE_EXCLUDE code for tracking exclude performance Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 02/13] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 04/13] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy ` (10 subsequent siblings) 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Like the previous patch, this cuts down the number of str*cmp calls in read_directory (which does _a lot_). Again sorted results on webkit.git: treat_leading_path: 0.000 0.000 read_directory: 3.674 3.558 +treat_one_path: 2.427 2.305 ++is_excluded: 2.221 2.098 +++prep_exclude: 0.224 0.223 +++matching: 1.650 1.529 ++dir_exist: 0.035 0.035 ++index_name_exists: 0.288 0.291 lazy_init_name_hash: 0.257 0.257 +simplify_away: 0.085 0.086 +dir_add_name: 0.441 0.445 Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 16 ---------------- dir.h | 18 +++++++++++++++--- 2 files changed, 15 insertions(+), 19 deletions(-) diff --git a/dir.c b/dir.c index 32a3adb..a69c8ac 100644 --- a/dir.c +++ b/dir.c @@ -47,22 +47,6 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, in int check_only, const struct path_simplify *simplify); static int get_dtype(struct dirent *de, const char *path, int len); -/* helper string functions with support for the ignore_case flag */ -int strcmp_icase(const char *a, const char *b) -{ - return ignore_case ? strcasecmp(a, b) : strcmp(a, b); -} - -int strncmp_icase(const char *a, const char *b, size_t count) -{ - return ignore_case ? strncasecmp(a, b, count) : strncmp(a, b, count); -} - -int fnmatch_icase(const char *pattern, const char *string, int flags) -{ - return fnmatch(pattern, string, flags | (ignore_case ? FNM_CASEFOLD : 0)); -} - inline int git_fnmatch(const char *pattern, const char *string, int flags, int prefix) { diff --git a/dir.h b/dir.h index c3eb4b5..560ade4 100644 --- a/dir.h +++ b/dir.h @@ -200,9 +200,21 @@ extern int remove_dir_recursively(struct strbuf *path, int flag); /* tries to remove the path with empty directories along it, ignores ENOENT */ extern int remove_path(const char *path); -extern int strcmp_icase(const char *a, const char *b); -extern int strncmp_icase(const char *a, const char *b, size_t count); -extern int fnmatch_icase(const char *pattern, const char *string, int flags); +/* helper string functions with support for the ignore_case flag */ +static inline int strcmp_icase(const char *a, const char *b) +{ + return ignore_case ? strcasecmp(a, b) : strcmp(a, b); +} + +static inline int strncmp_icase(const char *a, const char *b, size_t count) +{ + return ignore_case ? strncasecmp(a, b, count) : strncmp(a, b, count); +} + +static inline int fnmatch_icase(const char *pattern, const char *string, int flags) +{ + return fnmatch(pattern, string, flags | (ignore_case ? FNM_CASEFOLD : 0)); +} /* * The prefix part of pattern must not contains wildcards. -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 04/13] match_basename: use strncmp instead of strcmp 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (2 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 03/13] dir.c: inline convenient *_icase helpers Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 17:40 ` Antoine Pelisse 2013-03-12 13:04 ` [PATCH v3 05/13] match_{base,path}name: replace strncmp_icase with memequal_icase Nguyễn Thái Ngọc Duy ` (9 subsequent siblings) 13 siblings, 1 reply; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy strncmp provides length information, compared to strcmp, which could be taken advantage by the implementation. Even better, we could check if the lengths are equal before calling strncmp, eliminating a bit of strncmp calls. treat_leading_path: 0.000 0.000 read_directory: 3.558 3.578 +treat_one_path: 2.305 2.323 ++is_excluded: 2.098 2.117 +++prep_exclude: 0.223 0.224 +++matching: 1.529 1.544 ++dir_exist: 0.035 0.035 ++index_name_exists: 0.291 0.290 lazy_init_name_hash: 0.257 0.258 +simplify_away: 0.086 0.087 +dir_add_name: 0.445 0.445 While at there, fix an inconsistency about pattern/patternlen in how attr handles EXC_FLAG_MUSTBEDIR. When parse_exclude_pattern detects this flag, it sets patternlen _not_ to include the trailing slash and expects the caller to trim it. add_exclude does, parse_attr_line does not. In attr.c, the pattern could be "foo/" while patternlen tells us it only has 3 chars. Some functions do not care about patternlen and will see the pattern as "foo/" while others may see it as "foo". This patch makes patternlen 4 in this case. (Although for a piece of mind, perhaps we should trim it to "foo" like exclude, and never pass a pathname like "abc/" to match_{base,path}name) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- attr.c | 2 ++ dir.c | 8 +++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/attr.c b/attr.c index e2f9377..1818ba5 100644 --- a/attr.c +++ b/attr.c @@ -255,6 +255,8 @@ static struct match_attr *parse_attr_line(const char *line, const char *src, &res->u.pat.patternlen, &res->u.pat.flags, &res->u.pat.nowildcardlen); + if (res->u.pat.flags & EXC_FLAG_MUSTBEDIR) + res->u.pat.patternlen++; if (res->u.pat.flags & EXC_FLAG_NEGATIVE) { warning(_("Negative patterns are ignored in git attributes\n" "Use '\\!' for literal leading exclamation.")); diff --git a/dir.c b/dir.c index a69c8ac..a2ab200 100644 --- a/dir.c +++ b/dir.c @@ -636,12 +636,14 @@ int match_basename(const char *basename, int basenamelen, int flags) { if (prefix == patternlen) { - if (!strcmp_icase(pattern, basename)) + if (patternlen == basenamelen && + !strncmp_icase(pattern, basename, patternlen)) return 1; } else if (flags & EXC_FLAG_ENDSWITH) { if (patternlen - 1 <= basenamelen && - !strcmp_icase(pattern + 1, - basename + basenamelen - patternlen + 1)) + !strncmp_icase(pattern + 1, + basename + basenamelen - patternlen + 1, + patternlen - 1)) return 1; } else { if (fnmatch_icase(pattern, basename, 0) == 0) -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 04/13] match_basename: use strncmp instead of strcmp 2013-03-12 13:04 ` [PATCH v3 04/13] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy @ 2013-03-12 17:40 ` Antoine Pelisse 2013-03-13 1:05 ` Duy Nguyen 0 siblings, 1 reply; 48+ messages in thread From: Antoine Pelisse @ 2013-03-12 17:40 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy; +Cc: git > --- a/dir.c > +++ b/dir.c > @@ -636,12 +636,14 @@ int match_basename(const char *basename, int basenamelen, > int flags) > { > if (prefix == patternlen) { > - if (!strcmp_icase(pattern, basename)) > + if (patternlen == basenamelen && > + !strncmp_icase(pattern, basename, patternlen)) > return 1; > } else if (flags & EXC_FLAG_ENDSWITH) { > if (patternlen - 1 <= basenamelen && > - !strcmp_icase(pattern + 1, > - basename + basenamelen - patternlen + 1)) > + !strncmp_icase(pattern + 1, > + basename + basenamelen - patternlen + 1, > + patternlen - 1)) > return 1; I can see that you kept strncmp(), was it worse with memcmp() ? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 04/13] match_basename: use strncmp instead of strcmp 2013-03-12 17:40 ` Antoine Pelisse @ 2013-03-13 1:05 ` Duy Nguyen 0 siblings, 0 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-13 1:05 UTC (permalink / raw) To: Antoine Pelisse; +Cc: git On Wed, Mar 13, 2013 at 12:40 AM, Antoine Pelisse <apelisse@gmail.com> wrote: >> --- a/dir.c >> +++ b/dir.c >> @@ -636,12 +636,14 @@ int match_basename(const char *basename, int basenamelen, >> int flags) >> { >> if (prefix == patternlen) { >> - if (!strcmp_icase(pattern, basename)) >> + if (patternlen == basenamelen && >> + !strncmp_icase(pattern, basename, patternlen)) >> return 1; >> } else if (flags & EXC_FLAG_ENDSWITH) { >> if (patternlen - 1 <= basenamelen && >> - !strcmp_icase(pattern + 1, >> - basename + basenamelen - patternlen + 1)) >> + !strncmp_icase(pattern + 1, >> + basename + basenamelen - patternlen + 1, >> + patternlen - 1)) >> return 1; > > > I can see that you kept strncmp(), was it worse with memcmp() ? One step at a time. 05/13 replace strncmp_icase with memcmp (for when ignore_case == 0). -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v3 05/13] match_{base,path}name: replace strncmp_icase with memequal_icase 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (3 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 04/13] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-13 1:14 ` Duy Nguyen 2013-03-12 13:04 ` [PATCH v3 06/13] dir: pass pathname length to last_exclude_matching Nguyễn Thái Ngọc Duy ` (8 subsequent siblings) 13 siblings, 1 reply; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy treat_leading_path: 0.000 0.000 read_directory: 3.578 3.501 +treat_one_path: 2.323 2.257 ++is_excluded: 2.117 2.056 +++prep_exclude: 0.224 0.216 +++matching: 1.544 1.493 ++dir_exist: 0.035 0.035 ++index_name_exists: 0.290 0.292 lazy_init_name_hash: 0.258 0.256 +simplify_away: 0.087 0.084 +dir_add_name: 0.445 0.447 Suggested-by: Fredrik Gustafsson <iveqy@iveqy.com> Suggested-by: Antoine Pelisse <apelisse@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/dir.c b/dir.c index a2ab200..26c3b3a 100644 --- a/dir.c +++ b/dir.c @@ -47,6 +47,29 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, in int check_only, const struct path_simplify *simplify); static int get_dtype(struct dirent *de, const char *path, int len); +static inline int memequal_icase(const char *a, const char *b, int n) +{ + if (!memcmp(a, b, n)) + return 1; + + if (!ignore_case) + return 0; + + /* + * This assumes that ASCII is used in most repositories. We + * try the ascii-only version first (i.e. Git's + * toupper). Failing that, fall back to normal strncasecmp. + */ + while (n && toupper(*a) == toupper(*b)) { + a++; + b++; + n--; + } + if (!n) + return 1; + return !strncasecmp(a, b, n); +} + inline int git_fnmatch(const char *pattern, const char *string, int flags, int prefix) { @@ -637,11 +660,11 @@ int match_basename(const char *basename, int basenamelen, { if (prefix == patternlen) { if (patternlen == basenamelen && - !strncmp_icase(pattern, basename, patternlen)) + memequal_icase(pattern, basename, patternlen)) return 1; } else if (flags & EXC_FLAG_ENDSWITH) { if (patternlen - 1 <= basenamelen && - !strncmp_icase(pattern + 1, + memequal_icase(pattern + 1, basename + basenamelen - patternlen + 1, patternlen - 1)) return 1; @@ -675,7 +698,7 @@ int match_pathname(const char *pathname, int pathlen, */ if (pathlen < baselen + 1 || (baselen && (pathname[baselen] != '/' || - strncmp_icase(pathname, base, baselen)))) + !memequal_icase(pathname, base, baselen)))) return 0; namelen = baselen ? pathlen - baselen - 1 : pathlen; @@ -689,7 +712,7 @@ int match_pathname(const char *pathname, int pathlen, if (prefix > namelen) return 0; - if (strncmp_icase(pattern, name, prefix)) + if (!memequal_icase(pattern, name, prefix)) return 0; pattern += prefix; name += prefix; -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 05/13] match_{base,path}name: replace strncmp_icase with memequal_icase 2013-03-12 13:04 ` [PATCH v3 05/13] match_{base,path}name: replace strncmp_icase with memequal_icase Nguyễn Thái Ngọc Duy @ 2013-03-13 1:14 ` Duy Nguyen 0 siblings, 0 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-13 1:14 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy On Tue, Mar 12, 2013 at 8:04 PM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote: > +static inline int memequal_icase(const char *a, const char *b, int n) > +{ > + if (!memcmp(a, b, n)) > + return 1; > + > + if (!ignore_case) > + return 0; > + > + /* > + * This assumes that ASCII is used in most repositories. We > + * try the ascii-only version first (i.e. Git's > + * toupper). Failing that, fall back to normal strncasecmp. > + */ > + while (n && toupper(*a) == toupper(*b)) { > + a++; > + b++; > + n--; > + } > + if (!n) > + return 1; > + return !strncasecmp(a, b, n); > +} Note, the ignore_case == 1 codepath was not tested and measured. I suspect that strncasecmp may be more complex to support locales and the "LANG=C" version should suffice in most case. But it's just guesses at this moment. -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v3 06/13] dir: pass pathname length to last_exclude_matching 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (4 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 05/13] match_{base,path}name: replace strncmp_icase with memequal_icase Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 07/13] exclude: avoid calling prep_exclude on entries of the same directory Nguyễn Thái Ngọc Duy ` (7 subsequent siblings) 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/dir.c b/dir.c index 26c3b3a..58739f3 100644 --- a/dir.c +++ b/dir.c @@ -794,9 +794,9 @@ int is_excluded_from_list(const char *pathname, */ static struct exclude *last_exclude_matching(struct dir_struct *dir, const char *pathname, + int pathlen, int *dtype_p) { - int pathlen = strlen(pathname); int i, j; struct exclude_list_group *group; struct exclude *exclude; @@ -827,10 +827,12 @@ static struct exclude *last_exclude_matching(struct dir_struct *dir, * scans all exclude lists to determine whether pathname is excluded. * Returns 1 if true, otherwise 0. */ -static int is_excluded(struct dir_struct *dir, const char *pathname, int *dtype_p) +static int is_excluded(struct dir_struct *dir, + const char *pathname, int pathlen, + int *dtype_p) { struct exclude *exclude = - last_exclude_matching(dir, pathname, dtype_p); + last_exclude_matching(dir, pathname, pathlen, dtype_p); if (exclude) return exclude->flags & EXC_FLAG_NEGATIVE ? 0 : 1; return 0; @@ -893,7 +895,8 @@ struct exclude *last_exclude_matching_path(struct path_exclude_check *check, if (ch == '/') { int dt = DT_DIR; exclude = last_exclude_matching(check->dir, - path->buf, &dt); + path->buf, path->len, + &dt); if (exclude) { check->exclude = exclude; return exclude; @@ -905,7 +908,7 @@ struct exclude *last_exclude_matching_path(struct path_exclude_check *check, /* An entry in the index; cannot be a directory with subentries */ strbuf_setlen(path, 0); - return last_exclude_matching(check->dir, name, dtype); + return last_exclude_matching(check->dir, name, namelen, dtype); } /* @@ -1297,7 +1300,7 @@ static enum path_treatment treat_one_path(struct dir_struct *dir, int exclude; START_CLOCK(); - exclude = is_excluded(dir, path->buf, &dtype); + exclude = is_excluded(dir, path->buf, path->len, &dtype); STOP_CLOCK(tv_is_excluded); if (exclude && (dir->flags & DIR_COLLECT_IGNORED) -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 07/13] exclude: avoid calling prep_exclude on entries of the same directory 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (5 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 06/13] dir: pass pathname length to last_exclude_matching Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 08/13] exclude: record baselen in the pattern Nguyễn Thái Ngọc Duy ` (6 subsequent siblings) 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy prep_exclude is only necessary when we enter or leave a directory. Now it's called for every entry in a directory. With this patch, the number of prep_exclude calls in webkit.git goes from 188k down to 11k. This patch does not make exclude any faster, but it prepares for making prep_exclude heavier in terms of computation, where a large number of calls may have bigger impacts. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 10 +++++++++- dir.h | 1 + 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/dir.c b/dir.c index 58739f3..f8f7a7e 100644 --- a/dir.c +++ b/dir.c @@ -804,7 +804,10 @@ static struct exclude *last_exclude_matching(struct dir_struct *dir, basename = (basename) ? basename+1 : pathname; START_CLOCK(); - prep_exclude(dir, pathname, basename-pathname); + if (!dir->exclude_prepared) { + prep_exclude(dir, pathname, basename-pathname); + dir->exclude_prepared = 1; + } STOP_CLOCK(tv_prep_exclude); START_CLOCK(); @@ -894,6 +897,7 @@ struct exclude *last_exclude_matching_path(struct path_exclude_check *check, if (ch == '/') { int dt = DT_DIR; + check->dir->exclude_prepared = 0; exclude = last_exclude_matching(check->dir, path->buf, path->len, &dt); @@ -908,6 +912,7 @@ struct exclude *last_exclude_matching_path(struct path_exclude_check *check, /* An entry in the index; cannot be a directory with subentries */ strbuf_setlen(path, 0); + check->dir->exclude_prepared = 0; return last_exclude_matching(check->dir, name, namelen, dtype); } @@ -1394,6 +1399,7 @@ static int read_directory_recursive(struct dir_struct *dir, if (!fdir) goto out; + dir->exclude_prepared = 0; while ((de = readdir(fdir)) != NULL) { switch (treat_path(dir, de, &path, baselen, simplify)) { case path_recurse: @@ -1415,6 +1421,7 @@ static int read_directory_recursive(struct dir_struct *dir, } closedir(fdir); out: + dir->exclude_prepared = 0; strbuf_release(&path); return contents; @@ -1486,6 +1493,7 @@ static int treat_leading_path(struct dir_struct *dir, break; if (simplify_away(sb.buf, sb.len, simplify)) break; + dir->exclude_prepared = 0; if (treat_one_path(dir, &sb, simplify, DT_DIR, NULL) == path_ignored) break; /* do not recurse into it */ diff --git a/dir.h b/dir.h index 560ade4..0748407 100644 --- a/dir.h +++ b/dir.h @@ -86,6 +86,7 @@ struct dir_struct { /* Exclude info */ const char *exclude_per_dir; + int exclude_prepared; /* * We maintain three groups of exclude pattern lists: -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 08/13] exclude: record baselen in the pattern 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (6 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 07/13] exclude: avoid calling prep_exclude on entries of the same directory Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 09/13] exclude: filter out patterns not applicable to the current directory Nguyễn Thái Ngọc Duy ` (5 subsequent siblings) 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- attr.c | 4 +++- dir.c | 19 ++++++++++++++----- dir.h | 6 +++++- 3 files changed, 22 insertions(+), 7 deletions(-) diff --git a/attr.c b/attr.c index 1818ba5..b89da33 100644 --- a/attr.c +++ b/attr.c @@ -249,12 +249,14 @@ static struct match_attr *parse_attr_line(const char *line, const char *src, res->u.attr = git_attr_internal(name, namelen); else { char *p = (char *)&(res->state[num_attr]); + int pattern_baselen; memcpy(p, name, namelen); res->u.pat.pattern = p; parse_exclude_pattern(&res->u.pat.pattern, &res->u.pat.patternlen, &res->u.pat.flags, - &res->u.pat.nowildcardlen); + &res->u.pat.nowildcardlen, + &pattern_baselen); if (res->u.pat.flags & EXC_FLAG_MUSTBEDIR) res->u.pat.patternlen++; if (res->u.pat.flags & EXC_FLAG_NEGATIVE) { diff --git a/dir.c b/dir.c index f8f7a7e..932fd2f 100644 --- a/dir.c +++ b/dir.c @@ -390,10 +390,11 @@ static int no_wildcard(const char *string) void parse_exclude_pattern(const char **pattern, int *patternlen, int *flags, - int *nowildcardlen) + int *nowildcardlen, + int *pattern_baselen) { const char *p = *pattern; - size_t i, len; + int i, len; *flags = 0; if (*p == '!') { @@ -405,12 +406,15 @@ void parse_exclude_pattern(const char **pattern, len--; *flags |= EXC_FLAG_MUSTBEDIR; } - for (i = 0; i < len; i++) { + for (i = len - 1; i >= 0; i--) { if (p[i] == '/') break; } - if (i == len) + if (i < 0) { *flags |= EXC_FLAG_NODIR; + *pattern_baselen = -1; + } else + *pattern_baselen = i; *nowildcardlen = simple_length(p); /* * we should have excluded the trailing slash from 'p' too, @@ -421,6 +425,8 @@ void parse_exclude_pattern(const char **pattern, *nowildcardlen = len; if (*p == '*' && no_wildcard(p + 1)) *flags |= EXC_FLAG_ENDSWITH; + else if (*nowildcardlen != len) + *pattern_baselen = -1; *pattern = p; *patternlen = len; } @@ -432,8 +438,10 @@ void add_exclude(const char *string, const char *base, int patternlen; int flags; int nowildcardlen; + int pattern_baselen; - parse_exclude_pattern(&string, &patternlen, &flags, &nowildcardlen); + parse_exclude_pattern(&string, &patternlen, &flags, + &nowildcardlen, &pattern_baselen); if (flags & EXC_FLAG_MUSTBEDIR) { char *s; x = xmalloc(sizeof(*x) + patternlen + 1); @@ -449,6 +457,7 @@ void add_exclude(const char *string, const char *base, x->nowildcardlen = nowildcardlen; x->base = base; x->baselen = baselen; + x->pattern_baselen = pattern_baselen; x->flags = flags; x->srcpos = srcpos; ALLOC_GROW(el->excludes, el->nr + 1, el->alloc); diff --git a/dir.h b/dir.h index 0748407..cb50a85 100644 --- a/dir.h +++ b/dir.h @@ -44,6 +44,7 @@ struct exclude_list { int nowildcardlen; const char *base; int baselen; + int pattern_baselen; int flags; /* @@ -172,7 +173,10 @@ extern struct exclude_list *add_exclude_list(struct dir_struct *dir, extern int add_excludes_from_file_to_list(const char *fname, const char *base, int baselen, struct exclude_list *el, int check_index); extern void add_excludes_from_file(struct dir_struct *, const char *fname); -extern void parse_exclude_pattern(const char **string, int *patternlen, int *flags, int *nowildcardlen); +extern void parse_exclude_pattern(const char **string, + int *patternlen, int *flags, + int *nowildcardlen, + int *pattern_baselen); extern void add_exclude(const char *string, const char *base, int baselen, struct exclude_list *el, int srcpos); extern void clear_exclude_list(struct exclude_list *el); -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 09/13] exclude: filter out patterns not applicable to the current directory 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (7 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 08/13] exclude: record baselen in the pattern Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 23:13 ` Eric Sunshine 2013-03-12 13:04 ` [PATCH v3 10/13] read_directory: avoid invoking exclude machinery on tracked files Nguyễn Thái Ngọc Duy ` (4 subsequent siblings) 13 siblings, 1 reply; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy .gitignore files are spread over directories (*) so that when we check for ignored files at foo, we are not bothered by foo/bar/.gitignore, which contains ignore rules for foo/bar only. This is not enough. foo/.gitignore can contain the pattern "foo/bar/*.c". When we stay at foo, we know that the pattern cannot match anything. Similarly, the pattern "/autom4te.cache" at root directory cannot match anything in foo. This patch attempts to filter out such patterns to drive down matching cost. The algorithm implemented here is a naive one. Patterns can be either active or passive: - When we enter a new directory (e.g. from root to foo), currently active patterns may no longer be applicable and can be turned to passive. - On the opposite, when we leave a directory (foo back to roo), passive patterns may come alive again. We could do smarter things. But this implementation cuts a big portion of cost already (and solves the "root .gitignore is evil" problem). There's probably no need to be smart. (*) this design forces us to try to find .gitignore at every directory. On webkit.git that equals to 6k open syscalls. It feels like ".svn on every directory" again. I suggest we add ~/.gitignore.master, containing the list .gitignore files in worktree. If this file exists, we don't poke at every directory for .gitignore. treat_leading_path: 0.000 0.000 read_directory: 3.455 2.879 +treat_one_path: 2.203 1.620 ++is_excluded: 2.000 1.416 +++prep_exclude: 0.171 0.198 +++matching: 1.509 0.904 ++dir_exist: 0.036 0.035 ++index_name_exists: 0.292 0.289 lazy_init_name_hash: 0.257 0.257 +simplify_away: 0.084 0.085 +dir_add_name: 0.446 0.446 Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- dir.h | 1 + 2 files changed, 92 insertions(+), 2 deletions(-) diff --git a/dir.c b/dir.c index 932fd2f..c57bf06 100644 --- a/dir.c +++ b/dir.c @@ -458,7 +458,7 @@ void add_exclude(const char *string, const char *base, x->base = base; x->baselen = baselen; x->pattern_baselen = pattern_baselen; - x->flags = flags; + x->flags = flags | EXC_FLAG_ACTIVE; x->srcpos = srcpos; ALLOC_GROW(el->excludes, el->nr + 1, el->alloc); el->excludes[el->nr++] = x; @@ -591,6 +591,87 @@ void add_excludes_from_file(struct dir_struct *dir, const char *fname) die("cannot use %s as an exclude file", fname); } +static int pattern_match_base(struct dir_struct *dir, + const char *base, int baselen, + const struct exclude *exc) +{ + const char *pattern; + + /* + * TODO: if a patterns come from a .gitignore, exc->base would + * be the same for all of them. We could compare once and + * reuse the result, instead of perform the comparison per + * pattern like this. + */ + if (exc->baselen) { + if (baselen < exc->baselen + 1) + return 0; + + if (base[exc->baselen] != '/' || + memcmp(base, exc->base, exc->baselen)) + return 0; + + base += exc->baselen + 1; + baselen -= exc->baselen + 1; + } + + if (baselen != exc->pattern_baselen) + return 0; + + if (exc->pattern_baselen) { + pattern = exc->pattern; + if (*pattern == '/') + pattern++; + if (memcmp(base, pattern, exc->pattern_baselen)) + return 0; + } + + return 1; +} + +/* + * If pushed is non-zero, we have entered a new directory. Some + * pathname patterns may no longer applicable. Go over all active + * patterns and disable them if so. + * + * If popped is non-zero, we have left a directory. Inactive patterns + * may be applicable again. Go over them and re-enable if so. + */ +static void scan_patterns(struct dir_struct *dir, + const char *base, int baselen, + int pushed, int popped) +{ + int i, j, k; + + for (i = EXC_CMDL; i <= EXC_FILE; i++) { + struct exclude_list_group *group = &dir->exclude_list_group[i]; + for (j = group->nr - 1; j >= 0; j--) { + struct exclude_list *list = &group->el[j]; + for (k = 0; k < list->nr; k++) { + struct exclude *exc = list->excludes[k]; + + /* + * No base (i.e. EXC_FLAG_NODIR) or + * applicable to many bases ("**" + * patterns) + */ + if (exc->pattern_baselen == -1) + continue; + + if (exc->flags & EXC_FLAG_ACTIVE) { + if (pushed && + !pattern_match_base(dir, base, baselen, exc)) + exc->flags &= ~EXC_FLAG_ACTIVE; + } else { + if (popped && + pattern_match_base(dir, base, baselen, exc)) + exc->flags |= EXC_FLAG_ACTIVE; + } + } + } + } +} + /* * Loads the per-directory exclude list for the substring of base * which has a char length of baselen. @@ -600,7 +681,7 @@ static void prep_exclude(struct dir_struct *dir, const char *base, int baselen) struct exclude_list_group *group; struct exclude_list *el; struct exclude_stack *stk = NULL; - int current; + int current, popped = 0, pushed = 0; if ((!dir->exclude_per_dir) || (baselen + strlen(dir->exclude_per_dir) >= PATH_MAX)) @@ -621,6 +702,7 @@ static void prep_exclude(struct dir_struct *dir, const char *base, int baselen) clear_exclude_list(el); free(stk); group->nr--; + popped++; } /* Read from the parent directories and push them down. */ @@ -659,8 +741,12 @@ static void prep_exclude(struct dir_struct *dir, const char *base, int baselen) el, 1); dir->exclude_stack = stk; current = stk->baselen; + pushed++; } dir->basebuf[baselen] = '\0'; + + if (pushed | popped) + scan_patterns(dir, base, baselen, pushed, popped); } int match_basename(const char *basename, int basenamelen, @@ -755,6 +841,9 @@ static struct exclude *last_exclude_matching_from_list(const char *pathname, const char *exclude = x->pattern; int prefix = x->nowildcardlen; + if (!(x->flags & EXC_FLAG_ACTIVE)) + continue; + if (x->flags & EXC_FLAG_MUSTBEDIR) { if (*dtype == DT_UNKNOWN) *dtype = get_dtype(NULL, pathname, pathlen); diff --git a/dir.h b/dir.h index cb50a85..247bfda 100644 --- a/dir.h +++ b/dir.h @@ -14,6 +14,7 @@ struct dir_entry { #define EXC_FLAG_ENDSWITH 4 #define EXC_FLAG_MUSTBEDIR 8 #define EXC_FLAG_NEGATIVE 16 +#define EXC_FLAG_ACTIVE 32 /* * Each excludes file will be parsed into a fresh exclude_list which -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 09/13] exclude: filter out patterns not applicable to the current directory 2013-03-12 13:04 ` [PATCH v3 09/13] exclude: filter out patterns not applicable to the current directory Nguyễn Thái Ngọc Duy @ 2013-03-12 23:13 ` Eric Sunshine 0 siblings, 0 replies; 48+ messages in thread From: Eric Sunshine @ 2013-03-12 23:13 UTC (permalink / raw) To: Nguyễn Thái Ngọc Duy; +Cc: git On Tue, Mar 12, 2013 at 9:04 AM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote: > The algorithm implemented here is a naive one. Patterns can be either > active or passive: > > - When we enter a new directory (e.g. from root to foo), currently > active patterns may no longer be applicable and can be turned to > passive. > > - On the opposite, when we leave a directory (foo back to roo), s/roo/root/ Also, perhaps you meant s/On/Or/ ? > +/* > + * If pushed is non-zero, we have entered a new directory. Some > + * pathname patterns may no longer applicable. Go over all active s/applicable/be applicable/ -- ES ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v3 10/13] read_directory: avoid invoking exclude machinery on tracked files 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (8 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 09/13] exclude: filter out patterns not applicable to the current directory Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 11/13] Preallocate hash tables when the number of inserts are known in advance Nguyễn Thái Ngọc Duy ` (3 subsequent siblings) 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy, Junio C Hamano read_directory() (and its friendly wrapper fill_directory) collects untracked/ignored files by traversing through the whole worktree, feeding every entry to treat_one_path(), where each entry is checked against .gitignore patterns. One may see that tracked files can't be excluded and we do not need to run them through exclude machinery. On repos where there are many .gitignore patterns and/or a lot of tracked files, this unnecessary processing can become expensive. This patch avoids it mostly for normal cases. Directories are still processed as before. DIR_SHOW_IGNORED and DIR_COLLECT_IGNORED are not normally used unless some options are given (e.g. "checkout --overwrite-ignore", "add -f"...) treat_one_path's behavior changes when taking this shortcut. With current code, when a non-directory path is not excluded, treat_one_path calls treat_file, which returns the initial value of exclude_file and causes treat_one_path to return path_handled. With this patch, on the same conditions, treat_one_path returns path_ignored. read_directory_recursive() cares about this difference. Check out the snippet: while (...) { switch (treat_path(...)) { case path_ignored: continue; case path_handled: break; } contents++; if (check_only) break; dir_add_name(dir, path.buf, path.len); } If path_handled is returned, contents goes up. And if check_only is true, the loop could be broken early. These will not happen when treat_one_path (and its wrapper treat_path) returns path_ignored. dir_add_name internally does a cache_name_exists() check so it makes no difference. To avoid this behavior change, treat_one_path is instructed to skip the optimization when check_only or contents is used. Finally some numbers (best of 20 runs) that shows why it's worth all the hassle: git status | webkit linux-2.6 libreoffice-core gentoo-x86 -------------+---------------------------------------------- before | 1.097s 0.208s 0.399s 0.539s after | 0.736s 0.159s 0.248s 0.501s nr. patterns | 89 376 19 0 nr. tracked | 182k 40k 63k 101k treat_leading_path: 0.000 0.000 read_directory: 2.879 1.299 +treat_one_path: 1.620 0.599 ++is_excluded: 1.416 0.103 +++prep_exclude: 0.198 0.040 +++matching: 0.904 0.036 ++dir_exist: 0.035 0.036 ++index_name_exists: 0.289 0.291 lazy_init_name_hash: 0.257 0.257 +simplify_away: 0.085 0.082 +dir_add_name: 0.446 0.000 Tracked-down-by: Karsten Blees <karsten.blees@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> --- dir.c | 80 ++++++++++++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 53 insertions(+), 27 deletions(-) diff --git a/dir.c b/dir.c index c57bf06..6809dd2 100644 --- a/dir.c +++ b/dir.c @@ -43,8 +43,11 @@ struct path_simplify { const char *path; }; -static int read_directory_recursive(struct dir_struct *dir, const char *path, int len, - int check_only, const struct path_simplify *simplify); +static void read_directory_recursive(struct dir_struct *dir, + const char *path, int len, + int check_only, + const struct path_simplify *simplify, + int *contents); static int get_dtype(struct dirent *de, const char *path, int len); static inline int memequal_icase(const char *a, const char *b, int n) @@ -1184,7 +1187,7 @@ static enum directory_treatment treat_directory(struct dir_struct *dir, const char *dirname, int len, int exclude, const struct path_simplify *simplify) { - int ret; + int contents = 0, ret; START_CLOCK(); /* The "len-1" is to strip the final '/' */ ret = directory_exists_in_index(dirname, len-1); @@ -1219,19 +1222,19 @@ static enum directory_treatment treat_directory(struct dir_struct *dir, * check if it contains only ignored files */ if ((dir->flags & DIR_SHOW_IGNORED) && !exclude) { - int ignored; dir->flags &= ~DIR_SHOW_IGNORED; dir->flags |= DIR_HIDE_EMPTY_DIRECTORIES; - ignored = read_directory_recursive(dir, dirname, len, 1, simplify); + read_directory_recursive(dir, dirname, len, 1, simplify, &contents); dir->flags &= ~DIR_HIDE_EMPTY_DIRECTORIES; dir->flags |= DIR_SHOW_IGNORED; - return ignored ? ignore_directory : show_directory; + return contents ? ignore_directory : show_directory; } if (!(dir->flags & DIR_SHOW_IGNORED) && !(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES)) return show_directory; - if (!read_directory_recursive(dir, dirname, len, 1, simplify)) + read_directory_recursive(dir, dirname, len, 1, simplify, &contents); + if (!contents) return ignore_directory; return show_directory; } @@ -1398,10 +1401,26 @@ enum path_treatment { static enum path_treatment treat_one_path(struct dir_struct *dir, struct strbuf *path, const struct path_simplify *simplify, - int dtype, struct dirent *de) + int dtype, struct dirent *de, + int exclude_shortcut_ok) { int exclude; + if (dtype == DT_UNKNOWN) + dtype = get_dtype(de, path->buf, path->len); + + if (exclude_shortcut_ok && + !(dir->flags & DIR_SHOW_IGNORED) && + !(dir->flags & DIR_COLLECT_IGNORED) && + dtype != DT_DIR) { + struct cache_entry *ce; + START_CLOCK(); + ce = cache_name_exists(path->buf, path->len, ignore_case); + STOP_CLOCK(tv_index_name_exists); + if (ce) + return path_ignored; + } + START_CLOCK(); exclude = is_excluded(dir, path->buf, path->len, &dtype); STOP_CLOCK(tv_is_excluded); @@ -1417,9 +1436,6 @@ static enum path_treatment treat_one_path(struct dir_struct *dir, if (exclude && !(dir->flags & DIR_SHOW_IGNORED)) return path_ignored; - if (dtype == DT_UNKNOWN) - dtype = get_dtype(de, path->buf, path->len); - switch (dtype) { default: return path_ignored; @@ -1451,7 +1467,8 @@ static enum path_treatment treat_path(struct dir_struct *dir, struct dirent *de, struct strbuf *path, int baselen, - const struct path_simplify *simplify) + const struct path_simplify *simplify, + int exclude_shortcut_ok) { int dtype, ret; @@ -1467,7 +1484,7 @@ static enum path_treatment treat_path(struct dir_struct *dir, dtype = DTYPE(de); START_CLOCK(); - ret = treat_one_path(dir, path, simplify, dtype, de); + ret = treat_one_path(dir, path, simplify, dtype, de, exclude_shortcut_ok); STOP_CLOCK(tv_treat_one_path); return ret; } @@ -1481,13 +1498,13 @@ static enum path_treatment treat_path(struct dir_struct *dir, * Also, we ignore the name ".git" (even if it is not a directory). * That likely will not change. */ -static int read_directory_recursive(struct dir_struct *dir, - const char *base, int baselen, - int check_only, - const struct path_simplify *simplify) +static void read_directory_recursive(struct dir_struct *dir, + const char *base, int baselen, + int check_only, + const struct path_simplify *simplify, + int *contents) { DIR *fdir; - int contents = 0; struct dirent *de; struct strbuf path = STRBUF_INIT; @@ -1499,18 +1516,29 @@ static int read_directory_recursive(struct dir_struct *dir, dir->exclude_prepared = 0; while ((de = readdir(fdir)) != NULL) { - switch (treat_path(dir, de, &path, baselen, simplify)) { + switch (treat_path(dir, de, &path, baselen, + simplify, + !check_only && !contents)) { case path_recurse: - contents += read_directory_recursive(dir, path.buf, - path.len, 0, - simplify); + read_directory_recursive(dir, path.buf, + path.len, 0, + simplify, + contents); continue; case path_ignored: continue; case path_handled: break; } - contents++; + /* + * Update the last argument to treat_path if anything + * else is done after this point. This is because if + * treat_path's exclude_shortcut_ok is true, it may + * incorrectly return path_ignored (and never reaches + * this part) instead of path_handled. + */ + if (contents) + (*contents)++; if (check_only) break; START_CLOCK(); @@ -1521,8 +1549,6 @@ static int read_directory_recursive(struct dir_struct *dir, out: dir->exclude_prepared = 0; strbuf_release(&path); - - return contents; } static int cmp_name(const void *p1, const void *p2) @@ -1593,7 +1619,7 @@ static int treat_leading_path(struct dir_struct *dir, break; dir->exclude_prepared = 0; if (treat_one_path(dir, &sb, simplify, - DT_DIR, NULL) == path_ignored) + DT_DIR, NULL, 0) == path_ignored) break; /* do not recurse into it */ if (len <= baselen) { rc = 1; @@ -1621,7 +1647,7 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const char STOP_CLOCK(tv_lazy_init_name_hash); #endif START_CLOCK(); - read_directory_recursive(dir, path, len, 0, simplify); + read_directory_recursive(dir, path, len, 0, simplify, NULL); STOP_CLOCK(tv_read_directory); } #ifdef MEASURE_EXCLUDE -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 11/13] Preallocate hash tables when the number of inserts are known in advance 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (9 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 10/13] read_directory: avoid invoking exclude machinery on tracked files Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 12/13] name-hash: allow to lookup a name with precalculated base hash Nguyễn Thái Ngọc Duy ` (2 subsequent siblings) 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy This avoids unnecessary allocations and reinsertions. On webkit.git (i.e. about 100k inserts to the name hash table), this reduces about 100ms out of 3s user time. treat_leading_path: 0.000 0.000 read_directory: 1.299 1.305 +treat_one_path: 0.599 0.599 ++is_excluded: 0.103 0.103 +++prep_exclude: 0.040 0.040 +++matching: 0.036 0.035 ++dir_exist: 0.036 0.035 ++index_name_exists: 0.291 0.292 lazy_init_name_hash: 0.257 0.155 +simplify_away: 0.082 0.083 +dir_add_name: 0.000 0.000 Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- diffcore-rename.c | 1 + hash.h | 7 +++++++ name-hash.c | 1 + 3 files changed, 9 insertions(+) diff --git a/diffcore-rename.c b/diffcore-rename.c index 512d0ac..8d3d9bb 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -389,6 +389,7 @@ static int find_exact_renames(struct diff_options *options) struct hash_table file_table; init_hash(&file_table); + preallocate_hash(&file_table, (rename_src_nr + rename_dst_nr) * 2); for (i = 0; i < rename_src_nr; i++) insert_file_table(&file_table, -1, i, rename_src[i].p->one); diff --git a/hash.h b/hash.h index b875ce6..244d1fe 100644 --- a/hash.h +++ b/hash.h @@ -40,4 +40,11 @@ static inline void init_hash(struct hash_table *table) table->array = NULL; } +static inline void preallocate_hash(struct hash_table *table, unsigned int size) +{ + assert(table->size == 0 && table->nr == 0 && table->array == NULL); + table->size = size; + table->array = xcalloc(sizeof(struct hash_table_entry), size); +} + #endif diff --git a/name-hash.c b/name-hash.c index 942c459..1287a19 100644 --- a/name-hash.c +++ b/name-hash.c @@ -92,6 +92,7 @@ static void lazy_init_name_hash(struct index_state *istate) if (istate->name_hash_initialized) return; + preallocate_hash(&istate->name_hash, istate->cache_nr * 2); for (nr = 0; nr < istate->cache_nr; nr++) hash_index_entry(istate, istate->cache[nr]); istate->name_hash_initialized = 1; -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 12/13] name-hash: allow to lookup a name with precalculated base hash 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (10 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 11/13] Preallocate hash tables when the number of inserts are known in advance Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 ` Nguyễn Thái Ngọc Duy 2013-03-12 13:05 ` [PATCH v3 13/13] read_directory: calculate name hashes incrementally Nguyễn Thái Ngọc Duy 2013-03-14 13:05 ` [PATCH v3 00/13] Exclude optimizations Duy Nguyen 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:04 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy index_name_exists_base() differs from index_name_exists() in how the hash is calculated. It uses the base hash as the seed and skips the baselen part. The intention is to reduce hashing cost during directory traversal, where the hash of the directory is calculated once, and used as base hash for all entries inside. This poses a constraint in hash function. The function has not changed since 2008. With luck it'll never change again. If resuming hashing at any characters are not feasible with a new (better) hash function, maybe we could stop at directory boundary. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- Makefile | 1 + cache.h | 2 -- dir.c | 1 + name-hash.c | 48 ++++++++++++++++++++++-------------------------- name-hash.h (new) | 45 +++++++++++++++++++++++++++++++++++++++++++++ read-cache.c | 1 + unpack-trees.c | 1 + 7 files changed, 71 insertions(+), 28 deletions(-) create mode 100644 name-hash.h diff --git a/Makefile b/Makefile index 26d3332..8d753af 100644 --- a/Makefile +++ b/Makefile @@ -690,6 +690,7 @@ LIB_H += mailmap.h LIB_H += merge-blobs.h LIB_H += merge-recursive.h LIB_H += mergesort.h +LIB_H += name-hash.h LIB_H += notes-cache.h LIB_H += notes-merge.h LIB_H += notes.h diff --git a/cache.h b/cache.h index e493563..cf33c67 100644 --- a/cache.h +++ b/cache.h @@ -313,7 +313,6 @@ static inline void remove_name_hash(struct cache_entry *ce) #define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL, NULL) #define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), (options)) #define ce_modified(ce, st, options) ie_modified(&the_index, (ce), (st), (options)) -#define cache_name_exists(name, namelen, igncase) index_name_exists(&the_index, (name), (namelen), (igncase)) #define cache_name_is_other(name, namelen) index_name_is_other(&the_index, (name), (namelen)) #define resolve_undo_clear() resolve_undo_clear_index(&the_index) #define unmerge_cache_entry_at(at) unmerge_index_entry_at(&the_index, at) @@ -442,7 +441,6 @@ extern int write_index(struct index_state *, int newfd); extern int discard_index(struct index_state *); extern int unmerged_index(const struct index_state *); extern int verify_path(const char *path); -extern struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int igncase); extern int index_name_pos(const struct index_state *, const char *name, int namelen); #define ADD_CACHE_OK_TO_ADD 1 /* Ok to add */ #define ADD_CACHE_OK_TO_REPLACE 2 /* Ok to replace file/directory */ diff --git a/dir.c b/dir.c index 6809dd2..5fda5af 100644 --- a/dir.c +++ b/dir.c @@ -11,6 +11,7 @@ #include "dir.h" #include "refs.h" #include "wildmatch.h" +#include "name-hash.h" #ifdef MEASURE_EXCLUDE static uint32_t tv_treat_leading_path; diff --git a/name-hash.c b/name-hash.c index 1287a19..572d232 100644 --- a/name-hash.c +++ b/name-hash.c @@ -7,30 +7,7 @@ */ #define NO_THE_INDEX_COMPATIBILITY_MACROS #include "cache.h" - -/* - * This removes bit 5 if bit 6 is set. - * - * That will make US-ASCII characters hash to their upper-case - * equivalent. We could easily do this one whole word at a time, - * but that's for future worries. - */ -static inline unsigned char icase_hash(unsigned char c) -{ - return c & ~((c & 0x40) >> 1); -} - -static unsigned int hash_name(const char *name, int namelen) -{ - unsigned int hash = 0x123; - - while (namelen--) { - unsigned char c = *name++; - c = icase_hash(c); - hash = hash*101 + c; - } - return hash; -} +#include "name-hash.h" static void hash_index_entry_directories(struct index_state *istate, struct cache_entry *ce) { @@ -152,9 +129,11 @@ static int same_name(const struct cache_entry *ce, const char *name, int namelen return slow_same_name(name, namelen, ce->name, namelen < len ? namelen : len); } -struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int icase) +static inline struct cache_entry *find_entry_by_hash(struct index_state *istate, + const char *name, int namelen, + unsigned int hash, + int icase) { - unsigned int hash = hash_name(name, namelen); struct cache_entry *ce; lazy_init_name_hash(istate); @@ -189,3 +168,20 @@ struct cache_entry *index_name_exists(struct index_state *istate, const char *na } return NULL; } + +struct cache_entry *index_name_exists(struct index_state *istate, + const char *name, int namelen, + int icase) +{ + unsigned int hash = hash_name(name, namelen); + return find_entry_by_hash(istate, name, namelen, hash, icase); +} + +struct cache_entry *index_name_exists_base(struct index_state *istate, + unsigned int basehash, int baselen, + const char *name, int namelen, + int icase) +{ + unsigned int hash = hash_name_from(basehash, name + baselen, namelen - baselen); + return find_entry_by_hash(istate, name, namelen, hash, icase); +} diff --git a/name-hash.h b/name-hash.h new file mode 100644 index 0000000..997de30 --- /dev/null +++ b/name-hash.h @@ -0,0 +1,45 @@ +#ifndef NAME_HASH_H +#define NAME_HASH_H + +extern struct cache_entry *index_name_exists(struct index_state *istate, + const char *name, int namelen, + int igncase); +extern struct cache_entry *index_name_exists_base(struct index_state *istate, + unsigned int basehash, int baselen, + const char *name, int namelen, + int igncase); + +static inline struct cache_entry *cache_name_exists(const char *name, int namelen, int igncase) +{ + return index_name_exists(&the_index, name, namelen, igncase); +} + +/* + * This removes bit 5 if bit 6 is set. + * + * That will make US-ASCII characters hash to their upper-case + * equivalent. We could easily do this one whole word at a time, + * but that's for future worries. + */ +static inline unsigned char icase_hash(unsigned char c) +{ + return c & ~((c & 0x40) >> 1); +} + +static inline unsigned int hash_name_from(unsigned int hash, + const char *name, int namelen) +{ + while (namelen--) { + unsigned char c = *name++; + c = icase_hash(c); + hash = hash*101 + c; + } + return hash; +} + +static inline unsigned int hash_name(const char *name, int namelen) +{ + return hash_name_from(0x123, name, namelen); +} + +#endif diff --git a/read-cache.c b/read-cache.c index 827ae55..8bd3ec8 100644 --- a/read-cache.c +++ b/read-cache.c @@ -14,6 +14,7 @@ #include "resolve-undo.h" #include "strbuf.h" #include "varint.h" +#include "name-hash.h" static struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int really); diff --git a/unpack-trees.c b/unpack-trees.c index 09e53df..60fa5d4 100644 --- a/unpack-trees.c +++ b/unpack-trees.c @@ -8,6 +8,7 @@ #include "progress.h" #include "refs.h" #include "attr.h" +#include "name-hash.h" /* * Error messages expected by scripts out of plumbing commands such as -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 13/13] read_directory: calculate name hashes incrementally 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (11 preceding siblings ...) 2013-03-12 13:04 ` [PATCH v3 12/13] name-hash: allow to lookup a name with precalculated base hash Nguyễn Thái Ngọc Duy @ 2013-03-12 13:05 ` Nguyễn Thái Ngọc Duy 2013-03-14 13:05 ` [PATCH v3 00/13] Exclude optimizations Duy Nguyen 13 siblings, 0 replies; 48+ messages in thread From: Nguyễn Thái Ngọc Duy @ 2013-03-12 13:05 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Instead of index_name_exists() calculating a hash for full pathname for every entry, we calculate partial hash per directory, use it as a seed. The number of characters that icase_hash has to chew will reduce. treat_leading_path: 0.000 0.000 read_directory: 1.296 1.235 +treat_one_path: 0.599 0.531 ++is_excluded: 0.102 0.102 +++prep_exclude: 0.040 0.040 +++matching: 0.035 0.035 ++dir_exist: 0.035 0.035 ++index_name_exists: 0.292 0.225 lazy_init_name_hash: 0.155 0.155 +simplify_away: 0.082 0.083 +dir_add_name: 0.000 0.000 Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- dir.c | 44 +++++++++++++++++++++++++++++++++----------- 1 file changed, 33 insertions(+), 11 deletions(-) diff --git a/dir.c b/dir.c index 5fda5af..8638dcd 100644 --- a/dir.c +++ b/dir.c @@ -46,6 +46,7 @@ struct path_simplify { static void read_directory_recursive(struct dir_struct *dir, const char *path, int len, + unsigned int hash, int check_only, const struct path_simplify *simplify, int *contents); @@ -1044,12 +1045,17 @@ static struct dir_entry *dir_entry_new(const char *pathname, int len) return ent; } -static struct dir_entry *dir_add_name(struct dir_struct *dir, const char *pathname, int len) +static struct dir_entry *dir_add_name(struct dir_struct *dir, + const char *pathname, int len, + unsigned int hash, int baselen) { if (!(dir->flags & DIR_SHOW_IGNORED)) { struct cache_entry *ce; START_CLOCK(); - ce = cache_name_exists(pathname, len, ignore_case); + ce = index_name_exists_base(&the_index, + hash, baselen, + pathname, len, + ignore_case); STOP_CLOCK(tv_index_name_exists); if (ce) return NULL; @@ -1225,7 +1231,9 @@ static enum directory_treatment treat_directory(struct dir_struct *dir, if ((dir->flags & DIR_SHOW_IGNORED) && !exclude) { dir->flags &= ~DIR_SHOW_IGNORED; dir->flags |= DIR_HIDE_EMPTY_DIRECTORIES; - read_directory_recursive(dir, dirname, len, 1, simplify, &contents); + read_directory_recursive(dir, dirname, len, + hash_name(dirname, len), + 1, simplify, &contents); dir->flags &= ~DIR_HIDE_EMPTY_DIRECTORIES; dir->flags |= DIR_SHOW_IGNORED; @@ -1234,7 +1242,9 @@ static enum directory_treatment treat_directory(struct dir_struct *dir, if (!(dir->flags & DIR_SHOW_IGNORED) && !(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES)) return show_directory; - read_directory_recursive(dir, dirname, len, 1, simplify, &contents); + read_directory_recursive(dir, dirname, len, + hash_name(dirname, len), + 1, simplify, &contents); if (!contents) return ignore_directory; return show_directory; @@ -1401,6 +1411,8 @@ enum path_treatment { static enum path_treatment treat_one_path(struct dir_struct *dir, struct strbuf *path, + unsigned int hash, + int baselen, const struct path_simplify *simplify, int dtype, struct dirent *de, int exclude_shortcut_ok) @@ -1416,7 +1428,8 @@ static enum path_treatment treat_one_path(struct dir_struct *dir, dtype != DT_DIR) { struct cache_entry *ce; START_CLOCK(); - ce = cache_name_exists(path->buf, path->len, ignore_case); + ce = index_name_exists_base(&the_index, hash, baselen, + path->buf, path->len, ignore_case); STOP_CLOCK(tv_index_name_exists); if (ce) return path_ignored; @@ -1467,6 +1480,7 @@ static enum path_treatment treat_one_path(struct dir_struct *dir, static enum path_treatment treat_path(struct dir_struct *dir, struct dirent *de, struct strbuf *path, + unsigned int hash, int baselen, const struct path_simplify *simplify, int exclude_shortcut_ok) @@ -1485,7 +1499,8 @@ static enum path_treatment treat_path(struct dir_struct *dir, dtype = DTYPE(de); START_CLOCK(); - ret = treat_one_path(dir, path, simplify, dtype, de, exclude_shortcut_ok); + ret = treat_one_path(dir, path, hash, baselen, + simplify, dtype, de, exclude_shortcut_ok); STOP_CLOCK(tv_treat_one_path); return ret; } @@ -1501,6 +1516,7 @@ static enum path_treatment treat_path(struct dir_struct *dir, */ static void read_directory_recursive(struct dir_struct *dir, const char *base, int baselen, + unsigned int hash, int check_only, const struct path_simplify *simplify, int *contents) @@ -1517,12 +1533,16 @@ static void read_directory_recursive(struct dir_struct *dir, dir->exclude_prepared = 0; while ((de = readdir(fdir)) != NULL) { - switch (treat_path(dir, de, &path, baselen, + switch (treat_path(dir, de, &path, hash, baselen, simplify, !check_only && !contents)) { case path_recurse: read_directory_recursive(dir, path.buf, - path.len, 0, + path.len, + hash_name_from(hash, + path.buf + baselen, + path.len - baselen), + 0, simplify, contents); continue; @@ -1543,7 +1563,7 @@ static void read_directory_recursive(struct dir_struct *dir, if (check_only) break; START_CLOCK(); - dir_add_name(dir, path.buf, path.len); + dir_add_name(dir, path.buf, path.len, hash, baselen); STOP_CLOCK(tv_dir_add_name); } closedir(fdir); @@ -1619,7 +1639,7 @@ static int treat_leading_path(struct dir_struct *dir, if (simplify_away(sb.buf, sb.len, simplify)) break; dir->exclude_prepared = 0; - if (treat_one_path(dir, &sb, simplify, + if (treat_one_path(dir, &sb, 0, 0, simplify, DT_DIR, NULL, 0) == path_ignored) break; /* do not recurse into it */ if (len <= baselen) { @@ -1648,7 +1668,9 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const char STOP_CLOCK(tv_lazy_init_name_hash); #endif START_CLOCK(); - read_directory_recursive(dir, path, len, 0, simplify, NULL); + read_directory_recursive(dir, path, len, + hash_name(path, len), + 0, simplify, NULL); STOP_CLOCK(tv_read_directory); } #ifdef MEASURE_EXCLUDE -- 1.8.1.2.536.gf441e6d ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 00/13] Exclude optimizations 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy ` (12 preceding siblings ...) 2013-03-12 13:05 ` [PATCH v3 13/13] read_directory: calculate name hashes incrementally Nguyễn Thái Ngọc Duy @ 2013-03-14 13:05 ` Duy Nguyen 13 siblings, 0 replies; 48+ messages in thread From: Duy Nguyen @ 2013-03-14 13:05 UTC (permalink / raw) To: git; +Cc: Nguyễn Thái Ngọc Duy Just thinking out loud. If I'm not mistaken, a directory's mtime is only changed when files are added are removed in that directory. And that's all we care about in read_directory. So if we keep a secondary index containing the list of all (tracked and untracked) .gitignore files and all untracked files (regardless ignore status), we could avoid reading a directory if: - all relevant .gitignore are unchanged - the directory's stat is unchanged In that case we already have the list of untracked files. Exclude can be run over to filter out ignored files. And because we know these are not tracked, we do not need to call index_name_exists (not if ignore_case == 0). In the best case, nothing's added or removed, read_directory just issues a bunch of lstat (like index refresh), filter out ignored files and _not_ trigger (nor pay penalty for) lazy_init_name_hash. webkit has 11k untracked files. I don't think we have problems storing that list as we already have to chew 182k entries in index. Did I make any mistakes above? -- Duy ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2013-03-14 13:06 UTC | newest] Thread overview: 48+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-09 4:09 [PATCH 0/3] Trivial (and small) exclude optimizations Nguyễn Thái Ngọc Duy 2013-03-09 4:09 ` [PATCH 1/3] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy 2013-03-09 9:06 ` Antoine Pelisse 2013-03-09 4:09 ` [PATCH 2/3] dir.c: inline convenient *_icase helpers Nguyễn Thái Ngọc Duy 2013-03-09 4:09 ` [PATCH 3/3] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy 2013-03-09 7:50 ` Junio C Hamano 2013-03-09 8:47 ` Fredrik Gustafsson 2013-03-09 9:58 ` Duy Nguyen 2013-03-10 6:14 ` [PATCH v2 0/6] Exclude optimizations Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 1/6] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 2/6] dir.c: inline convenient *_icase helpers Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 3/6] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy 2013-03-10 7:34 ` Junio C Hamano 2013-03-10 10:38 ` Duy Nguyen 2013-03-10 11:43 ` Antoine Pelisse 2013-03-10 11:54 ` Antoine Pelisse 2013-03-10 12:06 ` Duy Nguyen 2013-03-10 12:11 ` Antoine Pelisse 2013-03-10 12:14 ` Duy Nguyen 2013-03-12 20:59 ` Junio C Hamano 2013-03-13 1:11 ` Duy Nguyen 2013-03-10 6:14 ` [PATCH v2 4/6] match_{base,path}name: replace strncmp_icase with strnequal_icase Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 5/6] dir.c: pass pathname length to last_exclude_matching Nguyễn Thái Ngọc Duy 2013-03-10 6:14 ` [PATCH v2 6/6] exclude: filter patterns by directory level Nguyễn Thái Ngọc Duy 2013-03-10 8:20 ` Junio C Hamano 2013-03-10 10:18 ` Duy Nguyen 2013-03-10 10:58 ` Junio C Hamano 2013-03-10 11:14 ` Duy Nguyen 2013-03-11 15:11 ` [PATCH v2 0/6] Exclude optimizations Duy Nguyen 2013-03-12 13:04 ` [PATCH v3 00/13] " Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 01/13] dir.c: add MEASURE_EXCLUDE code for tracking exclude performance Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 02/13] match_pathname: avoid calling strncmp if baselen is 0 Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 03/13] dir.c: inline convenient *_icase helpers Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 04/13] match_basename: use strncmp instead of strcmp Nguyễn Thái Ngọc Duy 2013-03-12 17:40 ` Antoine Pelisse 2013-03-13 1:05 ` Duy Nguyen 2013-03-12 13:04 ` [PATCH v3 05/13] match_{base,path}name: replace strncmp_icase with memequal_icase Nguyễn Thái Ngọc Duy 2013-03-13 1:14 ` Duy Nguyen 2013-03-12 13:04 ` [PATCH v3 06/13] dir: pass pathname length to last_exclude_matching Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 07/13] exclude: avoid calling prep_exclude on entries of the same directory Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 08/13] exclude: record baselen in the pattern Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 09/13] exclude: filter out patterns not applicable to the current directory Nguyễn Thái Ngọc Duy 2013-03-12 23:13 ` Eric Sunshine 2013-03-12 13:04 ` [PATCH v3 10/13] read_directory: avoid invoking exclude machinery on tracked files Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 11/13] Preallocate hash tables when the number of inserts are known in advance Nguyễn Thái Ngọc Duy 2013-03-12 13:04 ` [PATCH v3 12/13] name-hash: allow to lookup a name with precalculated base hash Nguyễn Thái Ngọc Duy 2013-03-12 13:05 ` [PATCH v3 13/13] read_directory: calculate name hashes incrementally Nguyễn Thái Ngọc Duy 2013-03-14 13:05 ` [PATCH v3 00/13] Exclude optimizations Duy Nguyen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).