From: Junio C Hamano <gitster@pobox.com>
To: David Turner <dturner@twopensource.com>
Cc: git@vger.kernel.org, mhagger@alum.mit.edu, pclouds@gmail.com,
sunshine@sunshineco.com
Subject: Re: [PATCH v4 2/4] path: optimize common dir checking
Date: Wed, 26 Aug 2015 14:15:41 -0700 [thread overview]
Message-ID: <xmqqwpwh21ky.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <1440618365-20628-3-git-send-email-dturner@twopensource.com> (David Turner's message of "Wed, 26 Aug 2015 15:46:03 -0400")
David Turner <dturner@twopensource.com> writes:
> Instead of a linear search over common_list to check whether
> a path is common, use a trie. The trie search operates on
> path prefixes, and handles excludes.
>
> Signed-off-by: David Turner <dturner@twopensource.com>
> ---
> path.c | 226 ++++++++++++++++++++++++++++++++++++++++++++++----
> t/t0060-path-utils.sh | 1 +
> 2 files changed, 213 insertions(+), 14 deletions(-)
>
> diff --git a/path.c b/path.c
> index d24bfa2..c7a4c40 100644
> --- a/path.c
> +++ b/path.c
> @@ -121,25 +121,223 @@ struct common_dir common_list[] = {
> { 0, 0, 0, NULL }
> };
>
> -static void update_common_dir(struct strbuf *buf, int git_dir_len)
> +/*
> + * A compressed trie. A trie node consists of zero or more characters that
> + * are common to all elements with this prefix, optionally followed by some
> + * children. If value is not NULL, the trie node is a terminal node.
> + *
> + * For example, consider the following set of strings:
> + * abc
> + * def
> + * definite
> + * definition
> + *
> + * The trie would look look like:
> + * root: len = 0, value = (something), children a and d non-NULL.
"value = NULL", as there is no empty string registered in the trie?
> + * a: len = 2, contents = bc
"value = NULL" here, too (just showing I am following along, not
just skimming)?
> + * d: len = 2, contents = ef, children i non-NULL, value = (something)
> + * i: len = 3, contents = nit, children e and i non-NULL, value = NULL
> + * e: len = 0, children all NULL, value = (something)
> + * i: len = 2, contents = on, children all NULL, value = (something)
> + */
> +struct trie {
> + struct trie *children[256];
> + int len;
> + char *contents;
> + void *value;
> +};
> +
> +static struct trie *make_trie_node(const char *key, void *value)
> {
> - char *base = buf->buf + git_dir_len;
> - const struct common_dir *p;
> + struct trie *new_node = xcalloc(1, sizeof(*new_node));
> + new_node->len = strlen(key);
> + if (new_node->len) {
> + new_node->contents = xmalloc(new_node->len);
> + memcpy(new_node->contents, key, new_node->len);
> + }
> + new_node->value = value;
> + return new_node;
> +}
>
> - if (is_dir_file(base, "logs", "HEAD") ||
> - is_dir_file(base, "info", "sparse-checkout"))
> - return; /* keep this in $GIT_DIR */
> - for (p = common_list; p->dirname; p++) {
> - const char *path = p->dirname;
> - if (p->is_dir && dir_prefix(base, path)) {
> - replace_dir(buf, git_dir_len, get_git_common_dir());
> - return;
> +/*
> + * Add a key/value pair to a trie. The key is assumed to be \0-terminated.
> + * If there was an existing value for this key, return it.
> + */
> +static void *add_to_trie(struct trie *root, const char *key, void *value)
> +{
> + struct trie *child;
> + void *old;
> + int i;
> +
> + if (!*key) {
> + /* we have reached the end of the key */
> + old = root->value;
> + root->value = value;
> + return old;
> + }
> +
> + for (i = 0; i < root->len; ++i) {
> + if (root->contents[i] == key[i])
> + continue;
> +
> + /*
> + * Split this node: child will contain this node's
> + * existing children.
> + */
> + child = malloc(sizeof(*child));
> + memcpy(child->children, root->children, sizeof(root->children));
> +
> + child->len = root->len - i - 1;
> + if (child->len) {
> + child->contents = strndup(root->contents + i + 1,
> + child->len);
> }
> - if (!p->is_dir && !strcmp(base, path)) {
> - replace_dir(buf, git_dir_len, get_git_common_dir());
> - return;
> + child->value = root->value;
> + root->value = NULL;
> + root->len = i;
> +
> + memset(root->children, 0, sizeof(root->children));
> + root->children[(unsigned char)root->contents[i]] = child;
> +
> + /* This is the newly-added child. */
> + root->children[(unsigned char)key[i]] =
> + make_trie_node(key + i + 1, value);
> + return NULL;
> + }
> +
> + /* We have matched the entire compressed section */
> + if (key[i]) {
> + child = root->children[(unsigned char)key[root->len]];
> + if (child) {
> + return add_to_trie(child, key + root->len + 1, value);
> + } else {
> + child = make_trie_node(key + root->len + 1, value);
> + root->children[(unsigned char)key[root->len]] = child;
> + return NULL;
> }
> }
> +
> + old = root->value;
> + root->value = value;
> + return old;
> +}
> +
> +typedef int (*match_fn)(const char *unmatched, void *data, void *baton);
> +
> +/*
> + * Search a trie for some key. Find the longest /-or-\0-terminated
> + * prefix of the key for which the trie contains a value. Call fn
> + * with the unmatched portion of the key and the found value, and
> + * return its return value. If there is no such prefix, return -1.
> + *
> + * The key is partially normalized: consecutive slashes are skipped.
> + *
> + * For example, consider the trie containing only [refs,
> + * refs/worktree] (both with values).
> + *
> + * | key | unmatched | val from node | return value |
> + * |-----------------|------------|---------------|--------------|
> + * | a | not called | n/a | -1 |
> + * | refs | \0 | refs | as per fn |
> + * | refs/ | / | refs | as per fn |
> + * | refs/w | /w | refs | as per fn |
> + * | refs/worktree | \0 | refs/worktree | as per fn |
> + * | refs/worktree/ | / | refs/worktree | as per fn |
> + * | refs/worktree/a | /a | refs/worktree | as per fn |
> + * |-----------------|------------|---------------|--------------|
> + *
> + */
> +static int trie_find(struct trie *root, const char *key, match_fn fn,
> + void *baton)
> +{
> + int i;
> + int result;
> + struct trie *child;
> +
> + if (!*key) {
> + /* we have reached the end of the key */
> + if (root->value && !root->len)
> + return fn(key, root->value, baton);
> + else
> + return -1;
> + }
> +
> + for (i = 0; i < root->len; ++i) {
> + /* Partial path normalization: skip consecutive slashes. */
> + if (key[i] == '/' && key[i+1] == '/') {
> + key++;
> + continue;
> + }
> + if (root->contents[i] != key[i])
> + return -1;
> + }
> +
> + /* Matched the entire compressed section */
> + key += i;
> + if (!*key)
> + /* End of key */
> + return fn(key, root->value, baton);
> +
> + /* Partial path normalization: skip consecutive slashes */
> + while (key[0] == '/' && key[1] == '/')
> + key ++;
These "skipping consecutive slashes" inside this function is cute.
I would have expected me to react "Eek, Ugly" to it, but somehow I
didn't find them ugly.
> + child = root->children[(unsigned char)*key];
> + if (child)
> + result = trie_find(child, key + 1, fn, baton);
> + else
> + result = -1;
> +
> + if (result >= 0 || (*key != '/' && *key != 0))
> + return result;
> + if (root->value)
> + return fn(key, root->value, baton);
> + else
> + return -1;
> +}
> +
> +static struct trie common_trie;
> +static int common_trie_done_setup;
> +
> +static void init_common_trie(void)
> +{
> + struct common_dir *p;
> +
> + if (common_trie_done_setup)
> + return;
> +
> + for (p = common_list; p->dirname; p++)
> + add_to_trie(&common_trie, p->dirname, p);
> +
> + common_trie_done_setup = 1;
> +}
I somehow wonder if we'd want to precomute the trie at the build
time, though ;-)
> +/*
> + * Helper function for update_common_dir: returns 1 if the dir
> + * prefix is common.
> + */
> +static int check_common(const char *unmatched, void *value, void *baton)
> +{
> + struct common_dir *dir = value;
> +
> + if (!dir)
> + return 0;
> +
> + if (dir->is_dir && (unmatched[0] == 0 || unmatched[0] == '/'))
> + return !dir->exclude;
> +
> + if (!dir->is_dir && unmatched[0] == 0)
> + return !dir->exclude;
> +
> + return 0;
> +}
> +
> +static void update_common_dir(struct strbuf *buf, int git_dir_len)
> +{
> + char *base = buf->buf + git_dir_len;
> + init_common_trie();
> + if (trie_find(&common_trie, base, check_common, NULL) > 0)
> + replace_dir(buf, git_dir_len, get_git_common_dir());
> }
Thanks for a pleasant read.
>
> void report_linked_checkout_garbage(void)
> diff --git a/t/t0060-path-utils.sh b/t/t0060-path-utils.sh
> index 93605f4..1ca49e1 100755
> --- a/t/t0060-path-utils.sh
> +++ b/t/t0060-path-utils.sh
> @@ -271,6 +271,7 @@ test_git_path GIT_COMMON_DIR=bar objects/bar bar/objects/bar
> test_git_path GIT_COMMON_DIR=bar info/exclude bar/info/exclude
> test_git_path GIT_COMMON_DIR=bar info/grafts bar/info/grafts
> test_git_path GIT_COMMON_DIR=bar info/sparse-checkout .git/info/sparse-checkout
> +test_git_path GIT_COMMON_DIR=bar info//sparse-checkout .git/info//sparse-checkout
> test_git_path GIT_COMMON_DIR=bar remotes/bar bar/remotes/bar
> test_git_path GIT_COMMON_DIR=bar branches/bar bar/branches/bar
> test_git_path GIT_COMMON_DIR=bar logs/refs/heads/master bar/logs/refs/heads/master
next prev parent reply other threads:[~2015-08-26 21:15 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-08-26 19:46 [PATCH v4 0/4] per-worktree bisection refs David Turner
2015-08-26 19:46 ` [PATCH v4 1/4] refs: clean up common_list David Turner
2015-08-26 19:46 ` [PATCH v4 2/4] path: optimize common dir checking David Turner
2015-08-26 21:15 ` Junio C Hamano [this message]
2015-08-26 22:10 ` David Turner
2015-08-26 22:15 ` Junio C Hamano
2015-08-26 22:19 ` David Turner
2015-08-28 16:39 ` Junio C Hamano
2015-08-28 18:36 ` David Turner
2015-08-30 6:25 ` Torsten Bögershausen
2015-08-31 18:56 ` David Turner
2015-08-26 19:46 ` [PATCH v4 3/4] refs: make refs/worktree/* per-worktree David Turner
2015-08-26 19:46 ` [PATCH v4 4/4] bisect: make bisection refs per-worktree David Turner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqwpwh21ky.fsf@gitster.dls.corp.google.com \
--to=gitster@pobox.com \
--cc=dturner@twopensource.com \
--cc=git@vger.kernel.org \
--cc=mhagger@alum.mit.edu \
--cc=pclouds@gmail.com \
--cc=sunshine@sunshineco.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).