Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH 3/6] fsck: prepare dummy objects for --connectivity-check
From: Junio C Hamano @ 2017-01-17 21:16 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty, Johannes Schindelin
In-Reply-To: <20170116213204.e7ykwowqzafkexqd@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> +		# Drop the index now; we want to be sure that we
> +		# recursively notice that we notice the broken objects
> +		# because they are reachable from refs, not because
> +		# they are in the index.

Rephrase to reduce redundunt "notice"s?

^ permalink raw reply

* Re: [PATCH 4/6] fsck: tighten error-checks of "git fsck <head>"
From: Junio C Hamano @ 2017-01-17 21:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty, Johannes Schindelin
In-Reply-To: <20170116213329.jk26zvcp7erzfc6l@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> Instead of checking reachability from the refs, you can ask
> fsck to check from a particular set of heads. However, the
> error checking here is quite lax. In particular:
>
>   1. It claims lookup_object() will report an error, which
>      is not true. It only does a hash lookup, and the user
>      has no clue that their argument was skipped.
>
>   2. When either the name or sha1 cannot be resolved, we
>      continue to exit with a successful error code, even
>      though we didn't check what the user asked us to.
>
> This patch fixes both of these cases.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---

Makes sense, too.  Thanks.

^ permalink raw reply

* [PATCH 0/2] Fix remote_is_configured()
From: Johannes Schindelin @ 2017-01-17 21:18 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Thomas Gummerer, Andrew Arnott

A surprising behavior triggered the bug report in
https://github.com/git-for-windows/git/issues/888: the mere existence of
the config setting "remote.origin.prune" (in this instance, configured
via ~/.gitconfig so that it applies to all repositories) fooled `git
remote rename <source> <target>` into believing that the <target> remote
is already there.

This patch pair demonstrates the problem, and then fixes it (along with
potential similar problems, such as setting an HTTP proxy for remotes of
a given name via ~/.gitconfig).

Johannes Schindelin (2):
  remote rename: demonstrate a bogus "remote exists" bug
  Be more careful when determining whether a remote was configured

 remote.c          | 9 ++++++++-
 remote.h          | 2 +-
 t/t5505-remote.sh | 9 +++++++++
 3 files changed, 18 insertions(+), 2 deletions(-)

base-commit: d7dffce1cebde29a0c4b309a79e4345450bf352a
Published-As: https://github.com/dscho/git/releases/tag/rename-remote-v1
Fetch-It-Via: git fetch https://github.com/dscho/git rename-remote-v1

-- 
2.11.0.windows.3

^ permalink raw reply

* [PATCH 1/2] remote rename: demonstrate a bogus "remote exists" bug
From: Johannes Schindelin @ 2017-01-17 21:19 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Thomas Gummerer, Andrew Arnott
In-Reply-To: <cover.1484687919.git.johannes.schindelin@gmx.de>

Some users like to set `remote.origin.prune = true` in their ~/.gitconfig
so that all of their repositories use that default.

However, our code is ill-prepared for this, mistaking that single entry to
mean that there is already a remote of the name "origin", even if there is
not.

This patch adds a test case demonstrating this issue.

Reported by Andrew Arnott.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t5505-remote.sh | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/t/t5505-remote.sh b/t/t5505-remote.sh
index 8198d8eb05..d7e41e9230 100755
--- a/t/t5505-remote.sh
+++ b/t/t5505-remote.sh
@@ -764,6 +764,15 @@ test_expect_success 'rename a remote with name prefix of other remote' '
 	)
 '
 
+test_expect_failure 'rename succeeds with existing remote.<target>.prune' '
+	git clone one four.four &&
+	(
+		cd four.four &&
+		git config remote.upstream.prune true &&
+		git remote rename origin upstream
+	)
+'
+
 cat >remotes_origin <<EOF
 URL: $(pwd)/one
 Push: refs/heads/master:refs/heads/upstream
-- 
2.11.0.windows.3



^ permalink raw reply related

* [PATCH 2/2] Be more careful when determining whether a remote was configured
From: Johannes Schindelin @ 2017-01-17 21:19 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Thomas Gummerer, Andrew Arnott
In-Reply-To: <cover.1484687919.git.johannes.schindelin@gmx.de>

One of the really nice features of the ~/.gitconfig file is that users
can override defaults by their own preferred settings for all of their
repositories.

One such default that some users like to override is whether the
"origin" remote gets auto-pruned or not. The user would simply call

	git config --global remote.origin.prune true

and from now on all "origin" remotes would be pruned automatically when
fetching into the local repository.

There is just one catch: now Git thinks that the "origin" remote is
configured, as it does not discern between having a remote whose
fetch (and/or push) URL and refspec is set, and merely having
preemptively-configured, global flags for specific remotes.

Let's fix this by telling Git that a remote is not configured unless any
fetch/push URL or refspect is configured explicitly.

As a special exception, we deem a remote configured also when *only* the
"vcs" setting is configured. The commit a31eeae27f (remote: use
remote_is_configured() for add and rename, 2016-02-16) specifically
extended our test suite to verify this, so it is safe to assume that there
has been at least one user with a legitimate use case for this.

This fixes https://github.com/git-for-windows/git/issues/888

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 remote.c          | 9 ++++++++-
 remote.h          | 2 +-
 t/t5505-remote.sh | 2 +-
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/remote.c b/remote.c
index ad6c5424ed..298f2f93fa 100644
--- a/remote.c
+++ b/remote.c
@@ -255,6 +255,7 @@ static void read_remotes_file(struct remote *remote)
 
 	if (!f)
 		return;
+	remote->configured = 1;
 	remote->origin = REMOTE_REMOTES;
 	while (strbuf_getline(&buf, f) != EOF) {
 		const char *v;
@@ -289,6 +290,7 @@ static void read_branches_file(struct remote *remote)
 		return;
 	}
 
+	remote->configured = 1;
 	remote->origin = REMOTE_BRANCHES;
 
 	/*
@@ -384,21 +386,25 @@ static int handle_config(const char *key, const char *value, void *cb)
 		if (git_config_string(&v, key, value))
 			return -1;
 		add_url(remote, v);
+		remote->configured = 1;
 	} else if (!strcmp(subkey, "pushurl")) {
 		const char *v;
 		if (git_config_string(&v, key, value))
 			return -1;
 		add_pushurl(remote, v);
+		remote->configured = 1;
 	} else if (!strcmp(subkey, "push")) {
 		const char *v;
 		if (git_config_string(&v, key, value))
 			return -1;
 		add_push_refspec(remote, v);
+		remote->configured = 1;
 	} else if (!strcmp(subkey, "fetch")) {
 		const char *v;
 		if (git_config_string(&v, key, value))
 			return -1;
 		add_fetch_refspec(remote, v);
+		remote->configured = 1;
 	} else if (!strcmp(subkey, "receivepack")) {
 		const char *v;
 		if (git_config_string(&v, key, value))
@@ -427,6 +433,7 @@ static int handle_config(const char *key, const char *value, void *cb)
 		return git_config_string((const char **)&remote->http_proxy_authmethod,
 					 key, value);
 	} else if (!strcmp(subkey, "vcs")) {
+		remote->configured = 1;
 		return git_config_string(&remote->foreign_vcs, key, value);
 	}
 	return 0;
@@ -716,7 +723,7 @@ struct remote *pushremote_get(const char *name)
 
 int remote_is_configured(struct remote *remote)
 {
-	return remote && remote->origin;
+	return remote && remote->configured;
 }
 
 int for_each_remote(each_remote_fn fn, void *priv)
diff --git a/remote.h b/remote.h
index 924881169d..7e6c8067bb 100644
--- a/remote.h
+++ b/remote.h
@@ -15,7 +15,7 @@ struct remote {
 	struct hashmap_entry ent;  /* must be first */
 
 	const char *name;
-	int origin;
+	int origin, configured;
 
 	const char *foreign_vcs;
 
diff --git a/t/t5505-remote.sh b/t/t5505-remote.sh
index d7e41e9230..09c9823002 100755
--- a/t/t5505-remote.sh
+++ b/t/t5505-remote.sh
@@ -764,7 +764,7 @@ test_expect_success 'rename a remote with name prefix of other remote' '
 	)
 '
 
-test_expect_failure 'rename succeeds with existing remote.<target>.prune' '
+test_expect_success 'rename succeeds with existing remote.<target>.prune' '
 	git clone one four.four &&
 	(
 		cd four.four &&
-- 
2.11.0.windows.3

^ permalink raw reply related

* Re: [PATCH 3/6] fsck: prepare dummy objects for --connectivity-check
From: Junio C Hamano @ 2017-01-17 21:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty, Johannes Schindelin
In-Reply-To: <20170116213204.e7ykwowqzafkexqd@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> +static void mark_object_for_connectivity(const unsigned char *sha1)
> +{
> +	struct object *obj = lookup_object(sha1);
> +
> +	/*
> +	 * Setting the object type here isn't strictly necessary here for a
> +	 * connectivity check.

Drop one of these "here"s?


> The cmd_fsck() part of the diff is pretty nasty without
> "-b".

True.  I also wonder if swapping the if/else arms make the end
result and the patch easier to read. i.e.

+	if (connectivity_only) {
+		mark loose for connectivity;
+		mark packed for connectivity;
+	} else {
		... existing code comes here reindented ...
	}                

But the patch makes sense.

> diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
> index e88ec7747..c1b2dda33 100755
> --- a/t/t1450-fsck.sh
> +++ b/t/t1450-fsck.sh
> @@ -523,9 +523,21 @@ test_expect_success 'fsck --connectivity-only' '
>  		touch empty &&
>  		git add empty &&
>  		test_commit empty &&
> +
> +		# Drop the index now; we want to be sure that we
> +		# recursively notice that we notice the broken objects
> +		# because they are reachable from refs, not because
> +		# they are in the index.
> +		rm -f .git/index &&
> +
> +		# corrupt the blob, but in a way that we can still identify
> +		# its type. That lets us see that --connectivity-only is
> +		# not actually looking at the contents, but leaves it
> +		# free to examine the type if it chooses.
>  		empty=.git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 &&
> -		rm -f $empty &&
> -		echo invalid >$empty &&
> +		blob=$(echo unrelated | git hash-object -w --stdin) &&
> +		mv $(sha1_file $blob) $empty &&
> +
>  		test_must_fail git fsck --strict &&
>  		git fsck --strict --connectivity-only &&
>  		tree=$(git rev-parse HEAD:) &&
> @@ -537,6 +549,13 @@ test_expect_success 'fsck --connectivity-only' '
>  	)
>  '
>  
> +test_expect_success 'fsck --connectivity-only with explicit head' '
> +	(
> +		cd connectivity-only &&
> +		test_must_fail git fsck --connectivity-only $tree
> +	)
> +'

Most of the earlier "tree=..." assignments are done in subshells,
and it is not clear which tree this refers to.  Is this the one that
was written in 'rev-list --verify-objects with bad sha1' that has
been removed in its when-finished handler?

>  remove_loose_object () {
>  	sha1="$(git rev-parse "$1")" &&
>  	remainder=${sha1#??} &&

^ permalink raw reply

* Re: [PATCH 3/6] fsck: prepare dummy objects for --connectivity-check
From: Jeff King @ 2017-01-17 21:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty, Johannes Schindelin
In-Reply-To: <xmqqa8ape6b6.fsf@gitster.mtv.corp.google.com>

On Tue, Jan 17, 2017 at 01:15:57PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > +static void mark_object_for_connectivity(const unsigned char *sha1)
> > +{
> > +	struct object *obj = lookup_object(sha1);
> > +
> > +	/*
> > +	 * Setting the object type here isn't strictly necessary here for a
> > +	 * connectivity check.
> 
> Drop one of these "here"s?

Yeah.

> > The cmd_fsck() part of the diff is pretty nasty without
> > "-b".
> 
> True.  I also wonder if swapping the if/else arms make the end
> result and the patch easier to read. i.e.
> 
> +	if (connectivity_only) {
> +		mark loose for connectivity;
> +		mark packed for connectivity;
> +	} else {
> 		... existing code comes here reindented ...
> 	}                
> 
> But the patch makes sense.

Yeah. It doesn't help the diff, but the end result is more readable. And
the conditional lacks a negation, which is nice. Will change.

> > +test_expect_success 'fsck --connectivity-only with explicit head' '
> > +	(
> > +		cd connectivity-only &&
> > +		test_must_fail git fsck --connectivity-only $tree
> > +	)
> > +'
> 
> Most of the earlier "tree=..." assignments are done in subshells,
> and it is not clear which tree this refers to.  Is this the one that
> was written in 'rev-list --verify-objects with bad sha1' that has
> been removed in its when-finished handler?

Doh. It was meant to be the one from the earlier --connectivity-only
check. But that is doubly silly, even, because the tree variable in that
case is holding the filename, not the sha1. The right thing is to call
rev-parse again, but of course that doesn't work because the repo has
been corrupted. :-/

Here's a re-roll with the two changes above (and the notice thing you
mentioned elsewhere), plus a test that actually checks the right thing
(fails before this commit and passes after).

If everything else looks good, you can queue it along with the rest.
Otherwise if I need a re-roll, it will be in the next version.

-- >8 --
Subject: fsck: prepare dummy objects for --connectivity-check

Normally fsck makes a pass over all objects to check their
integrity, and then follows up with a reachability check to
make sure we have all of the referenced objects (and to know
which ones are dangling). The latter checks for the HAS_OBJ
flag in obj->flags to see if we found the object in the
first pass.

Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`,
2015-06-22) taught fsck to skip the initial pass, and to
fallback to has_sha1_file() instead of the HAS_OBJ check.

However, it converted only one HAS_OBJ check to use
has_sha1_file(). But there are many other places in
builtin/fsck.c that assume that the flag is set (or that
lookup_object() will return an object at all). This leads to
several bugs with --connectivity-only:

  1. mark_object() will not queue objects for examination,
     so recursively following links from commits to trees,
     etc, did nothing. I.e., we were checking the
     reachability of hardly anything at all.

  2. When a set of heads is given on the command-line, we
     use lookup_object() to see if they exist. But without
     the initial pass, we assume nothing exists.

  3. When loading reflog entries, we do a similar
     lookup_object() check, and complain that the reflog is
     broken if the object doesn't exist in our hash.

So in short, --connectivity-only is broken pretty badly, and
will claim that your repository is fine when it's not.
Presumably nobody noticed for a few reasons.

One is that the embedded test does not actually test the
recursive nature of the reachability check. All of the
missing objects are still in the index, and we directly
check items from the index. This patch modifies the test to
delete the index, which shows off breakage (1).

Another is that --connectivity-only just skips the initial
pass for loose objects. So on a real repository, the packed
objects were still checked correctly. But on the flipside,
it means that "git fsck --connectivity-only" still checks
the sha1 of all of the packed objects, nullifying its
original purpose of being a faster git-fsck.

And of course the final problem is that the bug only shows
up when there _is_ corruption, which is rare. So anybody
running "git fsck --connectivity-only" proactively would
assume it was being thorough, when it was not.

One possibility for fixing this is to find all of the spots
that rely on HAS_OBJ and tweak them for the connectivity-only
case. But besides the risk that we might miss a spot (and I
found three already, corresponding to the three bugs above),
there are other parts of fsck that _can't_ work without a
full list of objects. E.g., the list of dangling objects.

Instead, let's make the connectivity-only case look more
like the normal case. Rather than skip the initial pass
completely, we'll do an abbreviated one that sets up the
HAS_OBJ flag for each object, without actually loading the
object data.

That's simple and fast, and we don't have to care about the
connectivity_only flag in the rest of the code at all.
While we're at it, let's make sure we treat loose and packed
objects the same (i.e., setting up dummy objects for both
and skipping the actual sha1 check). That makes the
connectivity-only check actually fast on a real repo (40
seconds versus 180 seconds on my copy of linux.git).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fsck.c  | 120 +++++++++++++++++++++++++++++++++++++++++++++-----------
 t/t1450-fsck.sh |  29 +++++++++++++-
 2 files changed, 124 insertions(+), 25 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 3e67203f9..75e836e2f 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -205,8 +205,6 @@ static void check_reachable_object(struct object *obj)
 	if (!(obj->flags & HAS_OBJ)) {
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
-		if (connectivity_only && has_object_file(&obj->oid))
-			return;
 		printf("missing %s %s\n", typename(obj->type),
 			describe_object(obj));
 		errors_found |= ERROR_REACHABLE;
@@ -584,6 +582,79 @@ static int fsck_cache_tree(struct cache_tree *it)
 	return err;
 }
 
+static void mark_object_for_connectivity(const unsigned char *sha1)
+{
+	struct object *obj = lookup_object(sha1);
+
+	/*
+	 * Setting the object type here isn't strictly necessary for a
+	 * connectivity check. In most cases, our walk will expect a certain
+	 * type (e.g., a tree referencing a blob) and will use lookup_blob() to
+	 * assign the type. But doing it here has two advantages:
+	 *
+	 *   1. When the fsck_walk code looks at objects that _don't_ come from
+	 *      links (e.g., the tip of a ref), it may complain about the
+	 *      "unknown object type".
+	 *
+	 *   2. This serves as a nice cross-check that the graph links are
+	 *      sane. So --connectivity-only does not check that the bits of
+	 *      blobs are not corrupted, but it _does_ check that 100644 tree
+	 *      entries point to blobs, and so forth.
+	 *
+	 * Unfortunately we can't just use parse_object() here, because the
+	 * whole point of --connectivity-only is to avoid reading the object
+	 * data more than necessary.
+	 */
+	if (!obj || obj->type == OBJ_NONE) {
+		enum object_type type = sha1_object_info(sha1, NULL);
+		switch (type) {
+		case OBJ_BAD:
+			error("%s: unable to read object type",
+			      sha1_to_hex(sha1));
+			break;
+		case OBJ_COMMIT:
+			obj = (struct object *)lookup_commit(sha1);
+			break;
+		case OBJ_TREE:
+			obj = (struct object *)lookup_tree(sha1);
+			break;
+		case OBJ_BLOB:
+			obj = (struct object *)lookup_blob(sha1);
+			break;
+		case OBJ_TAG:
+			obj = (struct object *)lookup_tag(sha1);
+			break;
+		default:
+			error("%s: unknown object type %d",
+			      sha1_to_hex(sha1), type);
+		}
+
+		if (!obj || obj->type == OBJ_NONE) {
+			errors_found |= ERROR_OBJECT;
+			return;
+		}
+	}
+
+	obj->flags |= HAS_OBJ;
+}
+
+static int mark_loose_for_connectivity(const unsigned char *sha1,
+				       const char *path,
+				       void *data)
+{
+	mark_object_for_connectivity(sha1);
+	return 0;
+}
+
+static int mark_packed_for_connectivity(const unsigned char *sha1,
+					struct packed_git *pack,
+					uint32_t pos,
+					void *data)
+{
+	mark_object_for_connectivity(sha1);
+	return 0;
+}
+
 static char const * const fsck_usage[] = {
 	N_("git fsck [<options>] [<object>...]"),
 	NULL
@@ -640,38 +711,41 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 	git_config(fsck_config, NULL);
 
 	fsck_head_link();
-	if (!connectivity_only) {
+	if (connectivity_only) {
+		for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
+		for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
+	} else {
 		fsck_object_dir(get_object_directory());
 
 		prepare_alt_odb();
 		for (alt = alt_odb_list; alt; alt = alt->next)
 			fsck_object_dir(alt->path);
-	}
 
-	if (check_full) {
-		struct packed_git *p;
-		uint32_t total = 0, count = 0;
-		struct progress *progress = NULL;
+		if (check_full) {
+			struct packed_git *p;
+			uint32_t total = 0, count = 0;
+			struct progress *progress = NULL;
+
+			prepare_packed_git();
 
-		prepare_packed_git();
+			if (show_progress) {
+				for (p = packed_git; p; p = p->next) {
+					if (open_pack_index(p))
+						continue;
+					total += p->num_objects;
+				}
 
-		if (show_progress) {
+				progress = start_progress(_("Checking objects"), total);
+			}
 			for (p = packed_git; p; p = p->next) {
-				if (open_pack_index(p))
-					continue;
-				total += p->num_objects;
+				/* verify gives error messages itself */
+				if (verify_pack(p, fsck_obj_buffer,
+						progress, count))
+					errors_found |= ERROR_PACK;
+				count += p->num_objects;
 			}
-
-			progress = start_progress(_("Checking objects"), total);
-		}
-		for (p = packed_git; p; p = p->next) {
-			/* verify gives error messages itself */
-			if (verify_pack(p, fsck_obj_buffer,
-					progress, count))
-				errors_found |= ERROR_PACK;
-			count += p->num_objects;
+			stop_progress(&progress);
 		}
-		stop_progress(&progress);
 	}
 
 	heads = 0;
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index e88ec7747..4d1c3ba66 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -523,9 +523,21 @@ test_expect_success 'fsck --connectivity-only' '
 		touch empty &&
 		git add empty &&
 		test_commit empty &&
+
+		# Drop the index now; we want to be sure that we
+		# recursively notice the broken objects
+		# because they are reachable from refs, not because
+		# they are in the index.
+		rm -f .git/index &&
+
+		# corrupt the blob, but in a way that we can still identify
+		# its type. That lets us see that --connectivity-only is
+		# not actually looking at the contents, but leaves it
+		# free to examine the type if it chooses.
 		empty=.git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 &&
-		rm -f $empty &&
-		echo invalid >$empty &&
+		blob=$(echo unrelated | git hash-object -w --stdin) &&
+		mv $(sha1_file $blob) $empty &&
+
 		test_must_fail git fsck --strict &&
 		git fsck --strict --connectivity-only &&
 		tree=$(git rev-parse HEAD:) &&
@@ -537,6 +549,19 @@ test_expect_success 'fsck --connectivity-only' '
 	)
 '
 
+test_expect_success 'fsck --connectivity-only with explicit head' '
+	rm -rf connectivity-only &&
+	git init connectivity-only &&
+	(
+		cd connectivity-only &&
+		test_commit foo &&
+		rm -f .git/index &&
+		tree=$(git rev-parse HEAD^{tree}) &&
+		remove_object $(git rev-parse HEAD:foo.t) &&
+		test_must_fail git fsck --connectivity-only $tree
+	)
+'
+
 remove_loose_object () {
 	sha1="$(git rev-parse "$1")" &&
 	remainder=${sha1#??} &&
-- 
2.11.0.651.g41f4a5c4e


^ permalink raw reply related

* Re: [PATCH v5 0/3] Turn the difftool into a builtin
From: Junio C Hamano @ 2017-01-17 21:31 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, David Aguilar, Dennis Kaarsemaker, Paul Sbarra
In-Reply-To: <cover.1484668473.git.johannes.schindelin@gmx.de>

Johannes Schindelin <johannes.schindelin@gmx.de> writes:

> - replaced the cross-validation with the Perl script by a patch that
>   retires the Perl script instead.

Yup.  That makes things a lot simpler.  While we try to be careful
during major rewrite, we usually do not go extra careful to leave
two competing implementations in-tree.

Will replace js/difftool-builtin topic.  Thanks.

^ permalink raw reply

* Re: [PATCH] CodingGuidelines: clarify multi-line brace style
From: Junio C Hamano @ 2017-01-17 21:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, Johannes Sixt, git
In-Reply-To: <20170117200503.3iwaedrgfq544b4i@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> Yeah. I obviously was adapting the original text, and I think I left too
> much of the wishy-washy-ness in. As the point of the patch is to avoid
> that, let's take your suggestion. A re-rolled patch is below.
>
> Now the patch is at least self-consistent. The bigger question remains
> of: do we want to dictate these rules, or did we prefer the vague
> version?

I would say no to make all of them "rules to be dictated".  It is OK
to leave some things "gray".  

But obviously others can vote differently ;-)

> I _thought_ the vague rules were doing fine, but this whole discussion
> has made me think otherwise. And I'd just as soon not ever have to
> repeat it. ;-/
>
> OTOH, I really do not want to review a bunch of patches that do nothing
> but change brace style (and I am sure there is a mix of styles already
> in the code base).

Yes, exactly, that is why the last sentence is in there below.

>> 	When the project says it does not have strong preference
>> 	among multiple choices, you are welcome to write your new
>> 	code using any of them, as long as you are consistent with
>> 	surrounding code.  Do not change style of existing code only
>> 	to flip among these styles, though.

> The existing document says:
>
>     Make your code readable and sensible, and don't try to be clever.
>     
>     As for more concrete guidelines, just imitate the existing code
>     (this is a good guideline, no matter which project you are
>     contributing to). It is always preferable to match the _local_
>     convention. New code added to Git suite is expected to match
>     the overall style of existing code. Modifications to existing
>     code is expected to match the style the surrounding code already
>     uses (even if it doesn't match the overall style of existing code).
>     
>     But if you must have a list of rules, here they are.
>
> I guess that is the place to expound on how to interpret the rules. I
> dunno. Some of the individual rules that go into "gray areas" already
> spell out the "surrounding code" rule.


We are not saying one important thing and that was what invited
useless arguing this time: For "gray" things, the choice is yours in
your new code, but do not churn existing code for "gray" guidelines.

If we made _that_ clear as a rule we dictate, hopefully we'd have
less chance to see "a bunch of patches that do nothing but change
brace style", I would think.

> -- >8 --
> Subject: [PATCH] CodingGuidelines: clarify multi-line brace style
>
> There are some "gray areas" around when to omit braces from
> a conditional or loop body. Since that seems to have
> resulted in some arguments, let's be a little more clear
> about our preferred style.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  Documentation/CodingGuidelines | 37 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/CodingGuidelines b/Documentation/CodingGuidelines
> index 4cd95da6b..a4191aa38 100644
> --- a/Documentation/CodingGuidelines
> +++ b/Documentation/CodingGuidelines
> @@ -206,11 +206,38 @@ For C programs:
>  		x = 1;
>  	}
>  
> -   is frowned upon.  A gray area is when the statement extends
> -   over a few lines, and/or you have a lengthy comment atop of
> -   it.  Also, like in the Linux kernel, if there is a long list
> -   of "else if" statements, it can make sense to add braces to
> -   single line blocks.
> +   is frowned upon. But there are a few exceptions:
> +
> +	- When the statement extends over a few lines (e.g., a while loop
> +	  with an embedded conditional, or a comment). E.g.:
> +
> +		while (foo) {
> +			if (x)
> +				one();
> +			else
> +				two();
> +		}
> +
> +		if (foo) {
> +			/*
> +			 * This one requires some explanation,
> +			 * so we're better off with braces to make
> +			 * it obvious that the indentation is correct.
> +			 */
> +			doit();
> +		}
> +
> +	- When there are multiple arms to a conditional and some of them
> +	  require braces, enclose even a single line block in braces for
> +	  consistency. E.g.:
> +
> +		if (foo) {
> +			doit();
> +		} else {
> +			one();
> +			two();
> +			three();
> +		}
>  
>   - We try to avoid assignments in the condition of an "if" statement.

^ permalink raw reply

* Re: [PATCH] document index_name_pos
From: Junio C Hamano @ 2017-01-17 21:43 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git
In-Reply-To: <20170117204642.31514-1-sbeller@google.com>

Stefan Beller <sbeller@google.com> writes:

> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>
>> These placeholders are meant to encourage those people who dove into
>> the code to update it, so from that point of view, I think removing
>> it is backwards.
>
> Yes, I am currently understanding and writing up documentation for
> index_name_pos. If I recall the latest discussion where we want to have
> documentation, I think a quorum favored documentation in the header itself,
> c.f. strbuf.h, string-list.h for the most desired state. (Although we do have
> Documentation/technical/api-string-list.txt as well ...)
>
> So maybe starting like this?

That is very good.  Let's drop that file from Documentation/technical
and do it like this (meaning, take both patches from you).

Thanks.


>
> Thanks,
> Stefan
>
>  cache.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/cache.h b/cache.h
> index 1b67f078dd..e168e4e073 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -575,7 +575,20 @@ extern int verify_path(const char *path);
>  extern int index_dir_exists(struct index_state *istate, const char *name, int namelen);
>  extern void adjust_dirname_case(struct index_state *istate, char *name);
>  extern struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int igncase);
> +
> +/*
> + * Searches for an entry defined by name and namelen in the given index.
> + * If the return value is positive (including 0) it is the position of an
> + * exact match. If the return value is negative, the negated value minus 1 is the
> + * position where the entry would be inserted.
> + * Example: In the current index we have the files a,c,d:
> + * index_name_pos(&index, "a", 1) ->  0
> + * index_name_pos(&index, "b", 1) -> -1
> + * index_name_pos(&index, "c", 1) ->  1
> + * index_name_pos(&index, "d", 1) ->  2
> + */
>  extern int index_name_pos(const struct index_state *, const char *name, int namelen);
> +
>  #define ADD_CACHE_OK_TO_ADD 1		/* Ok to add */
>  #define ADD_CACHE_OK_TO_REPLACE 2	/* Ok to replace file/directory */
>  #define ADD_CACHE_SKIP_DFCHECK 4	/* Ok to skip DF conflict checks */

^ permalink raw reply

* Re: [PATCH 0/2] Fix remote_is_configured()
From: Jeff King @ 2017-01-17 21:45 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano, Thomas Gummerer, Andrew Arnott
In-Reply-To: <cover.1484687919.git.johannes.schindelin@gmx.de>

On Tue, Jan 17, 2017 at 10:18:58PM +0100, Johannes Schindelin wrote:

> A surprising behavior triggered the bug report in
> https://github.com/git-for-windows/git/issues/888: the mere existence of
> the config setting "remote.origin.prune" (in this instance, configured
> via ~/.gitconfig so that it applies to all repositories) fooled `git
> remote rename <source> <target>` into believing that the <target> remote
> is already there.
> 
> This patch pair demonstrates the problem, and then fixes it (along with
> potential similar problems, such as setting an HTTP proxy for remotes of
> a given name via ~/.gitconfig).

I thought it was intentional that any config "created" the remote, even
without a url field. E.g., you can set:

  [remote "https://example.com/foo/git"]
  proxy = localhost:1234

and then a bare "git fetch https://example.com/foo/git" will respect it.

I admit that "prune" is probably useless without a fetch refspec,
though.

Note that I don't disagree that the rule "it's not a remote unless it
has a url" would be a lot saner. But I have a feeling you may be
breaking some existing setups.

I grepped around in the list but I couldn't find any relevant
discussion. So maybe I just dreamed it.

-Peff

^ permalink raw reply

* Re: [PATCH 2/2] Be more careful when determining whether a remote was configured
From: Jeff King @ 2017-01-17 21:47 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano, Thomas Gummerer, Andrew Arnott
In-Reply-To: <41c347f22c80e96c54db34baa739b6e37e268b61.1484687919.git.johannes.schindelin@gmx.de>

On Tue, Jan 17, 2017 at 10:19:24PM +0100, Johannes Schindelin wrote:

> One of the really nice features of the ~/.gitconfig file is that users
> can override defaults by their own preferred settings for all of their
> repositories.
> 
> One such default that some users like to override is whether the
> "origin" remote gets auto-pruned or not. The user would simply call
> 
> 	git config --global remote.origin.prune true
> 
> and from now on all "origin" remotes would be pruned automatically when
> fetching into the local repository.
> 
> There is just one catch: now Git thinks that the "origin" remote is
> configured, as it does not discern between having a remote whose
> fetch (and/or push) URL and refspec is set, and merely having
> preemptively-configured, global flags for specific remotes.
> 
> Let's fix this by telling Git that a remote is not configured unless any
> fetch/push URL or refspect is configured explicitly.

Hmm. Old versions of GitHub for Windows used to set fetch refspecs in
the system gitconfig, for a similar purpose to what you want to do with
remote.origin.prune.

I notice here that setting a refspec _does_ define a remote. Is there a
reason you drew the line there, and not at, say, whether it has a URL?

-Peff

^ permalink raw reply

* Re: [PATCH v5 3/3] Retire the scripted difftool
From: Junio C Hamano @ 2017-01-17 21:46 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, David Aguilar, Dennis Kaarsemaker, Paul Sbarra
In-Reply-To: <8238bba389c031b091a37396fed43cac94d944e7.1484668473.git.johannes.schindelin@gmx.de>

Johannes Schindelin <johannes.schindelin@gmx.de> writes:

> It served its purpose, but now we have a builtin difftool. Time for the
> Perl script to enjoy Florida.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---

The endgame makes a lot of sense.  Both in the cover letter and in
the previous patch you talk about having both in the released
version, so do you want this step to proceed slower than the other
two?  I.e. merge all three to 'next' but graduate only the first two
to 'master' and after a while make this last step graduate?

^ permalink raw reply

* Re: [PATCH] document index_name_pos
From: Stefan Beller @ 2017-01-17 21:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git@vger.kernel.org
In-Reply-To: <xmqqo9z5cqh8.fsf@gitster.mtv.corp.google.com>

On Tue, Jan 17, 2017 at 1:43 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>> Signed-off-by: Stefan Beller <sbeller@google.com>
>> ---
>>
>>> These placeholders are meant to encourage those people who dove into
>>> the code to update it, so from that point of view, I think removing
>>> it is backwards.
>>
>> Yes, I am currently understanding and writing up documentation for
>> index_name_pos. If I recall the latest discussion where we want to have
>> documentation, I think a quorum favored documentation in the header itself,
>> c.f. strbuf.h, string-list.h for the most desired state. (Although we do have
>> Documentation/technical/api-string-list.txt as well ...)
>>
>> So maybe starting like this?
>
> That is very good.  Let's drop that file from Documentation/technical
> and do it like this (meaning, take both patches from you).
>
> Thanks.

This patch is incorrect, as we need to do s/b -> -1/b -> -2/.
(Wrong documentation is the worst)

Also we'd probably want to see more of the functions documented.
I'll see if I can extend this into a series documenting more functions.

Thanks,
Stefan

^ permalink raw reply

* RE: [RFC] Add support for downloading blobs on demand
From: Ben Peart @ 2017-01-17 21:50 UTC (permalink / raw)
  To: 'Shawn Pearce'; +Cc: 'git', benpeart
In-Reply-To: <CAJo=hJumYXTRN_B3iZdmcpomp7wJ+UPcikxGb6rn9W=uJeYmfw@mail.gmail.com>

Thanks for the encouragement, support, and good ideas to look into.

Ben

> -----Original Message-----
> From: Shawn Pearce [mailto:spearce@spearce.org]
> Sent: Friday, January 13, 2017 4:07 PM
> To: Ben Peart <peartben@gmail.com>
> Cc: git <git@vger.kernel.org>; benpeart@microsoft.com
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> On Fri, Jan 13, 2017 at 7:52 AM, Ben Peart <peartben@gmail.com> wrote:
> >
> > Goal
> > ~~~~
> >
> > To be able to better handle repos with many files that any individual
> > developer doesn’t need it would be nice if clone/fetch only brought
> > down those files that were actually needed.
> >
> > To enable that, we are proposing adding a flag to clone/fetch that
> > will instruct the server to limit the objects it sends to commits and
> > trees and to not send any blobs.
> >
> > When git performs an operation that requires a blob that isn’t
> > currently available locally, it will download the missing blob and add
> > it to the local object store.
> 
> Interesting. This is also an area I want to work on with my team at $DAY_JOB.
> Repositories are growing along multiple dimensions, and developers or
> editors don't always need all blobs for all time available locally to successfully
> perform their work.
> 
> > Design
> > ~~~~~~
> >
> > Clone and fetch will pass a “--lazy-clone” flag (open to a better name
> > here) similar to “--depth” that instructs the server to only return
> > commits and trees and to ignore blobs.
> 
> My group at $DAY_JOB hasn't talked about it yet, but I want to add a
> protocol capability that lets clone/fetch ask only for blobs smaller than a
> specified byte count. This could be set to a reasonable text file size (e.g. <= 5
> MiB) to predominately download only source files and text documentation,
> omitting larger binaries.
> 
> If the limit was set to 0, its the same as your idea to ignore all blobs.
> 

This is an interesting idea that may be an easier way to help mitigate 
the cost of very large files.  While our primary issue today is the 
sheer number of files, I'm sure at some point we'll run into issues with 
file size as well.  

> > Later during git operations like checkout, when a blob cannot be found
> > after checking all the regular places (loose, pack, alternates, etc),
> > git will download the missing object and place it into the local
> > object store (currently as a loose object) then resume the operation.
> 
> Right. I'd like to have this object retrieval be inside the native Git wire
> protocol, reusing the remote configuration and authentication setup. That
> requires expanding the server side of the protocol implementation slightly
> allowing any reachable object to be retrieved by SHA-1 alone. Bitmap indexes
> can significantly reduce the computational complexity for the server.
> 

Agree.  

> > To prevent git from accidentally downloading all missing blobs, some
> > git operations are updated to be aware of the potential for missing blobs.
> > The most obvious being check_connected which will return success as if
> > everything in the requested commits is available locally.
> 
> This ... sounds risky for the developer, as the repository may be corrupt due
> to a missing object, and the user cannot determine it.
> 
> Would it be reasonable for the server to return a list of SHA-1s it knows
> should exist, but has omitted due to the blob threshold (above), and the
> local repository store this in a binary searchable file?
> During connectivity checking its assumed OK if an object is not present in the
> object store, but is listed in this omitted objects file.
> 

Corrupt repos due to missing blobs must be pretty rare as I've never 
seen anyone report that error but for this and other reasons (see Peff's 
suggestion on how to minimize downloading unnecessary blobs) having this 
data could be valuable.  I'll add it to the list of things to look into.

> > To minimize the impact on the server, the existing dumb HTTP protocol
> > endpoint “objects/<sha>” can be used to retrieve the individual
> > missing blobs when needed.
> 
> I'd prefer this to be in the native wire protocol, where the objects are in pack
> format (which unfortunately differs from loose format). I assume servers
> would combine many objects into pack files, potentially isolating large
> uncompressable binaries into their own packs, stored separately from
> commits/trees/small-text-blobs.
> 
> I get the value of this being in HTTP, where HTTP caching inside proxies can
> be leveraged to reduce master server load. I wonder if the native wire
> protocol could be taught to use a variation of an HTTP GET that includes the
> object SHA-1 in the URL line, to retrieve a one-object pack file.
> 

You make a good point. I don't think the benefit of hitting this 
"existing" end point outweighs the many drawbacks.  Adding the ability 
to retrieve an individual blob via the native wire protocol seems a 
better plan.

> > Performance considerations
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > We found that downloading commits and trees on demand had a
> > significant negative performance impact.  In addition, many git
> > commands assume all commits and trees are available locally so they
> > quickly got pulled down anyway.  Even in very large repos the commits
> > and trees are relatively small so bringing them down with the initial
> > commit and subsequent fetch commands was reasonable.
> >
> > After cloning, the developer can use sparse-checkout to limit the set
> > of files to the subset they need (typically only 1-10% in these large
> > repos).  This allows the initial checkout to only download the set of
> > files actually needed to complete their task.  At any point, the
> > sparse-checkout file can be updated to include additional files which
> > will be fetched transparently on demand.
> >
> > Typical source files are relatively small so the overhead of
> > connecting and authenticating to the server for a single file at a
> > time is substantial.  As a result, having a long running process that
> > is started with the first request and can cache connection information
> > between requests is a significant performance win.
> 
> Junio and I talked years ago (offline, sorry no mailing list archive) about
> "narrow checkout", which is the idea of the client being able to ask for a pack
> file from the server that only includes objects along specific path names. This
> would allow a client to amortize the setup costs, and even delta compress
> source files against each other (e.g.
> boilerplate across Makefiles or license headers).
> 
> If the paths of interest can be determined as a batch before starting the
> connection, this may be easier than maintaining a cross platform connection
> cache in a separate process.
> 

We looked into sparse/narrow-clone but for a variety of reasons it 
didn't work well for our usage patterns (see my response to Peff's 
feedback for more details).

> > Now some numbers
> > ~~~~~~~~~~~~~~~~
> >
> > One repo has 3+ million files at tip across 500K folders with 5-6K
> > active developers.  They have done a lot of work to remove large files
> > from the repo so it is down to < 100GB.
> >
> > Before changes: clone took hours to transfer the 87GB .pack + 119MB
> > .idx
> >
> > After changes: clone took 4 minutes to transfer 305MB .pack + 37MB
> > .idx
> >
> > After hydrating 35K files (the typical number any individual developer
> > needs to do their work), there was an additional 460 MB of loose files
> > downloaded.
> >
> > Total savings: 86.24 GB * 6000 developers = 517 Terabytes saved!
> >
> > We have another repo (3.1 M files, 618 GB at tip with no history with
> > 3K+ active developers) where the savings are even greater.
> 
> This is quite impressive, and shows this strategy has a lot of promise.
> 
> 
> > Future Work
> > ~~~~~~~~~~~
> >
> > The current prototype calls a new hook proc in
> > sha1_object_info_extended and read_object, to download each missing
> > blob.  A better solution would be to implement this via a long running
> > process that is spawned on the first download and listens for requests
> > to download additional objects until it terminates when the parent git
> > operation exits (similar to the recent long running smudge and clean filter
> work).
> 
> Or batching these up in advance. checkout should be able to determine
> which path entries from the index it wants to write to the working tree. Once
> it has that set of paths it wants to write, it should be fast to construct a
> subset of paths for which the blobs are not present locally, and then pass the
> entire group off for download.
> 

Yes, I'm optimistic that we can optimize for the checkout case (which is 
a _very_ common case!).

> > Need to do more investigation into possible code paths that can
> > trigger unnecessary blobs to be downloaded.  For example, we have
> > determined that the rename detection logic in status can also trigger
> > unnecessary blobs to be downloaded making status slow.
> 
> There isn't much of a workaround here. Only options I can see are disabling
> rename detection when objects are above a certain size, or removing entries
> from the rename table when the blob isn't already local, which may yield
> different results than if the blob(s) were local.
> 
> Another is to try to have actual source files always be local, and thus we only
> punt on rename detection for bigger files that are more likely to be binary,
> and thus less likely to match for rename[1] unless it was SHA-1 identity
> match, which can be done without the
> blob(s) present.
> 

While large files can be a real problem, our biggest issue today is 
having a lot (millions!) of source files when any individual developer 
only needs a small percentage of them.  Git with 3+ million local files 
just doesn't perform well.

We'll see what we can come up with here - especially if we had some 
information _about_ the blob, even though we didn't have the blob itself.

> 
> [1] I assume most really big files are some sort of media asset (e.g.
> JPEG), where a change inside the source data may result in large difference in
> bytes due to the compression applied by the media file format.
> 
> > Need to investigate an alternate batching scheme where we can make a
> > single request for a set of "related" blobs and receive single a
> > packfile (especially during checkout).
> 
> Heh, what I just said above. Glad to see you already thought of it.
> 
> > Need to investigate adding a new endpoint in the smart protocol that
> > can download both individual blobs as well as a batch of blobs.
> 
> Agreed, I said as much above. Again, glad to see you have similar ideas. :)


^ permalink raw reply

* RE: [RFC] Add support for downloading blobs on demand
From: Ben Peart @ 2017-01-17 21:50 UTC (permalink / raw)
  To: 'Jeff King', 'Ben Peart'; +Cc: git
In-Reply-To: <20170117184258.sd7h2hkv27w52gzt@sigill.intra.peff.net>

Thanks for the thoughtful response.  No need to appologize for the 
length, it's a tough problem to solve so I don't expect it to be handled 
with a single, short email. :)

> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Tuesday, January 17, 2017 1:43 PM
> To: Ben Peart <peartben@gmail.com>
> Cc: git@vger.kernel.org; Ben Peart <Ben.Peart@microsoft.com>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> This is an issue I've thought a lot about. So apologies in advance that this
> response turned out a bit long. :)
> 
> On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote:
> 
> > Design
> > ~~~~~~
> >
> > Clone and fetch will pass a  --lazy-clone  flag (open to a better name
> > here) similar to  --depth  that instructs the server to only return
> > commits and trees and to ignore blobs.
> >
> > Later during git operations like checkout, when a blob cannot be found
> > after checking all the regular places (loose, pack, alternates, etc),
> > git will download the missing object and place it into the local
> > object store (currently as a loose object) then resume the operation.
> 
> Have you looked at the "external odb" patches I wrote a while ago, and
> which Christian has been trying to resurrect?
> 
>   https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpublic-
> inbox.org%2Fgit%2F20161130210420.15982-1-
> chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS
> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0
> 
> This is a similar approach, though I pushed the policy for "how do you get the
> objects" out into an external script. One advantage there is that large objects
> could easily be fetched from another source entirely (e.g., S3 or equivalent)
> rather than the repo itself.
> 
> The downside is that it makes things more complicated, because a push or a
> fetch now involves three parties (server, client, and the alternate object
> store). So questions like "do I have all the objects I need" are hard to reason
> about.
> 
> If you assume that there's going to be _some_ central Git repo which has all
> of the objects, you might as well fetch from there (and do it over normal git
> protocols). And that simplifies things a bit, at the cost of being less flexible.
> 

We looked quite a bit at the external odb patches, as well as lfs and 
even using alternates.  They all share a common downside that you must 
maintain a separate service that contains _some_ of the files.  These 
files must also be versioned, replicated, backed up and the service 
itself scaled out to handle the load.  As you mentioned, having multiple 
services involved increases flexability but it also increases the 
complexity and decreases the reliability of the overall version control 
service.  

For operational simplicity, we opted to go with a design that uses a 
single, central git repo which has _all_ the objects and to focus on 
enhancing it to handle large numbers of files efficiently.  This allows 
us to focus our efforts on a great git service and to avoid having to 
build out these other services.

> > To prevent git from accidentally downloading all missing blobs, some
> > git operations are updated to be aware of the potential for missing blobs.
> > The most obvious being check_connected which will return success as if
> > everything in the requested commits is available locally.
> 
> Actually, Git is pretty good about trying not to access blobs when it doesn't
> need to. The important thing is that you know enough about the blobs to
> fulfill has_sha1_file() and sha1_object_info() requests without actually
> fetching the data.
> 
> So the client definitely needs to have some list of which objects exist, and
> which it _could_ get if it needed to.
> 
> The one place you'd probably want to tweak things is in the diff code, as a
> single "git log -Sfoo" would fault in all of the blobs.
> 

It is an interesting idea to explore how we could be smarter about 
preventing blobs from faulting in if we had enough info to fulfill 
has_sha1_file() and sha1_object_info().  Given we also heavily prune the 
working directory using sparse-checkout, this hasn't been our top focus 
but it is certainly something worth looking into.

> > To minimize the impact on the server, the existing dumb HTTP protocol
> > endpoint  objects/<sha>  can be used to retrieve the individual
> > missing blobs when needed.
> 
> This is going to behave badly on well-packed repositories, because there isn't
> a good way to fetch a single object. The best case (which is not implemented
> at all in Git) is that you grab the pack .idx, then grab "slices" of the pack
> corresponding to specific objects, including hunting down delta bases.
> 
> But then next time the server repacks, you have to throw away your .idx file.
> And those can be big. The .idx for linux.git is 135MB. You really wouldn't	
> want to do an incremental fetch of 1MB worth of objects and have to grab
> the whole .idx just to figure out which bytes you needed.
> 
> You can solve this by replacing the dumb-http server with a smart one that
> actually serves up the individual objects as if they were truly sitting on the
> filesystem. But then you haven't really minimized impact on the server, and
> you might as well teach the smart protocols to do blob fetches.
> 

Yea, we actually implemented a new endpoint that we are using to fetch 
individual blobs, I just found the dumb endpoint recently and thought 
"hey, maybe we can us this to make it easier for other git servers."  
For a number of good reasons, I don't think this is the right approach.

> 
> One big hurdle to this approach, no matter the protocol, is how you are
> going to handle deltas. Right now, a git client tells the server "I have this
> commit, but I want this other one". And the server knows which objects the
> client has from the first, and which it needs from the second. Moreover, it
> knows that it can send objects in delta form directly from disk if the other
> side has the delta base.
> 
> So what happens in this system? We know we don't need to send any blobs
> in a regular fetch, because the whole idea is that we only send blobs on
> demand. So we wait for the client to ask us for blob A. But then what do we
> send? If we send the whole blob without deltas, we're going to waste a lot of
> bandwidth.
> 
> The on-disk size of all of the blobs in linux.git is ~500MB. The actual data size
> is ~48GB. Some of that is from zlib, which you get even for non-deltas. But
> the rest of it is from the delta compression. I don't think it's feasible to give
> that up, at least not for "normal" source repos like linux.git (more on that in
> a minute).
> 
> So ideally you do want to send deltas. But how do you know which objects
> the other side already has, which you can use as a delta base? Sending the
> list of "here are the blobs I have" doesn't scale. Just the sha1s start to add
> up, especially when you are doing incremental fetches.
> 
> I think this sort of things performs a lot better when you just focus on large
> objects. Because they don't tend to delta well anyway, and the savings are
> much bigger by avoiding ones you don't want. So a directive like "don't
> bother sending blobs larger than 1MB" avoids a lot of these issues. In other
> words, you have some quick shorthand to communicate between the client
> and server: this what I have, and what I don't.
> Normal git relies on commit reachability for that, but there are obviously
> other dimensions. The key thing is that both sides be able to express the
> filters succinctly, and apply them efficiently.
> 

Our challenge has been more the sheer _number_ of files that exist in 
the repo rather than the _size_ of the files in the repo.  With >3M 
source files and any typical developer only needing a small percentage 
of those files to do their job, our focus has been pruning the tree as 
much as possible such that they only pay the cost for the files they 
actually need.  With typical text source files being 10K - 20K in size, 
the overhead of the round trip is a significant part of the overall 
transfer time so deltas don't help as much.  I agree that large files 
are also a problem but it isn't my top focus at this point in time.  

> > After cloning, the developer can use sparse-checkout to limit the set
> > of files to the subset they need (typically only 1-10% in these large
> > repos).  This allows the initial checkout to only download the set of
> > files actually needed to complete their task.  At any point, the
> > sparse-checkout file can be updated to include additional files which
> > will be fetched transparently on demand.
> 
> If most of your benefits are not from avoiding blobs in general, but rather
> just from sparsely populating the tree, then it sounds like sparse clone might
> be an easier path forward. The general idea is to restrict not just the
> checkout, but the actual object transfer and reachability (in the tree
> dimension, the way shallow clone limits it in the time dimension, which will
> require cooperation between the client and server).
> 
> So that's another dimension of filtering, which should be expressed pretty
> succinctly: "I'm interested in these paths, and not these other ones." It's
> pretty easy to compute on the server side during graph traversal (though it
> interacts badly with reachability bitmaps, so there would need to be some
> hacks there).
> 
> It's an idea that's been talked about many times, but I don't recall that there
> were ever working patches. You might dig around in the list archive under
> the name "sparse clone" or possibly "narrow clone".

While a sparse/narrow clone would work with this proposal, it isn't 
required.  You'd still probably want all the commits and trees but the 
clone would also bring down the specified blobs.  Combined with using 
"depth" you could further limit it to those blobs at tip. 

We did run into problems with this model however as our usage patterns 
are such that our working directories often contain very sparse trees 
and as a result, we can end up with thousands of entries in the sparse 
checkout file.  This makes it difficult for users to manually specify a 
sparse-checkout before they even do a clone.  We have implemented a 
hashmap based sparse-checkout to deal with the performance issues of 
having that many entries but that's a different RFC/PATCH.  In short, we 
found that a "lazy-clone" and downloading blobs on demand provided a 
better developer experience.

> 
> > Now some numbers
> > ~~~~~~~~~~~~~~~~
> >
> > One repo has 3+ million files at tip across 500K folders with 5-6K
> > active developers.  They have done a lot of work to remove large files
> > from the repo so it is down to < 100GB.
> >
> > Before changes: clone took hours to transfer the 87GB .pack + 119MB
> > .idx
> >
> > After changes: clone took 4 minutes to transfer 305MB .pack + 37MB
> > .idx
> >
> > After hydrating 35K files (the typical number any individual developer
> > needs to do their work), there was an additional 460 MB of loose files
> > downloaded.
> 
> It sounds like you have a case where the repository has a lot of large files
> that are either historical, or uninteresting the sparse-tree dimension.
> 
> How big is that 460MB if it were actually packed with deltas?
> 

Uninteresting in the sparse-tree dimension.  460 MB divided by 35K files 
is less than 13 KB per file which is fairly typical for source code.  
Given there are no versions to calculate deltas from, compressing them 
into a pack file would help some but I don't have the numbers as to how 
much.  When we get to the "future work" below and start batching up 
requests, we'll have better data on that.

> > Future Work
> > ~~~~~~~~~~~
> >
> > The current prototype calls a new hook proc in
> > sha1_object_info_extended and read_object, to download each missing
> > blob.  A better solution would be to implement this via a long running
> > process that is spawned on the first download and listens for requests
> > to download additional objects until it terminates when the parent git
> > operation exits (similar to the recent long running smudge and clean filter
> work).
> 
> Yeah, see the external-odb discussion. Those prototypes use a process per
> object, but I think we all agree after seeing how the git-lfs interface has
> scaled that this is a non-starter. Recent versions of git-lfs do the single-
> process thing, and I think any sort of external-odb hook should be modeled
> on that protocol.
> 

I'm looking into this now and plan to re-implement it this way before 
sending out the first patch series.  Glad to hear you think it is a good 
protocol to model it on.

> > Need to investigate an alternate batching scheme where we can make a
> > single request for a set of "related" blobs and receive single a
> > packfile (especially during checkout).
> 
> I think this sort of batching is going to be the really hard part to retrofit onto
> git. Because you're throwing out the procedural notion that you can loop
> over a set of objects and ask for each individually.
> You have to start deferring computation until answers are ready. Some
> operations can do that reasonably well (e.g., checkout), but something like
> "git log -p" is constantly digging down into history. I suppose you could just
> perform the skeleton of the operation _twice_, once to find the list of objects
> to fault in, and the second time to actually do it.
> 
> That will make git feel a lot slower, because a lot of the illusion of speed is
> the way it streams out results. OTOH, if you have to wait to fault in objects
> from the network, it's going to feel pretty slow anyway. :)
> 

The good news is that for most operations, git doesn't need to access to 
all the blobs.  You're right, any command that does ends up faulting in 
a bunch of blobs from the network can get pretty slow.  Sometimes you 
get streaming results and sometimes it just "hangs" while we go off 
downloading blobs in the background.  We capture telemetry to detect 
these types of issues but typically the users are more than happy to 
send us an "I just ran command 'foo' and it hung" email. :)

> -Peff


^ permalink raw reply

* Re: [RFC] Add support for downloading blobs on demand
From: Martin Fick @ 2017-01-17 22:05 UTC (permalink / raw)
  To: Ben Peart; +Cc: 'Shawn Pearce', 'git', benpeart
In-Reply-To: <002501d2710b$af74c4d0$0e5e4e70$@gmail.com>

On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
> While large files can be a real problem, our biggest issue
> today is having a lot (millions!) of source files when
> any individual developer only needs a small percentage of
> them.  Git with 3+ million local files just doesn't
> perform well.

Honestly, this sounds like a problem better dealt with by 
using git subtree or git submodules, have you considered 
that?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply

* Re: [PATCH 2/2] Be more careful when determining whether a remote was configured
From: Junio C Hamano @ 2017-01-17 22:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, git, Thomas Gummerer, Andrew Arnott
In-Reply-To: <20170117214723.p5rni6wwggei366j@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

>> Let's fix this by telling Git that a remote is not configured unless any
>> fetch/push URL or refspec is configured explicitly.
>
> I notice here that setting a refspec _does_ define a remote. Is there a
> reason you drew the line there, and not at, say, whether it has a URL?

"Not configured unless any URL or refspec is configured" means that
if URL is there, even if there is no refspec, then it is a remote
definition, right?  

But I think "what does it mean to define a remote" is a question
that you are asking, and that is not necessary to answer within the
scope of this topic.

I do agree that honoring .prune is nonsense unless refspecs are
defined, but the question "does it make sense to allow prune?"  is
different from "is it configured?".  Your "you can set .proxy and
other useful things, it is just .prune does not make sense" is a
quite appropriate statement to illustrate the difference.

Perhaps instead of adding "is it configured?" flag that is too
broadly named and has too narrow meaning, would it make more sense
to introduce "int can_prune(struct remote *remote)" that looks at
the remote refspecs?

^ permalink raw reply

* Re: [RFC] Add support for downloading blobs on demand
From: Stefan Beller @ 2017-01-17 22:23 UTC (permalink / raw)
  To: Martin Fick; +Cc: Ben Peart, Shawn Pearce, git, benpeart
In-Reply-To: <2381666.1DSVtKRIH5@mfick1-lnx>

On Tue, Jan 17, 2017 at 2:05 PM, Martin Fick <mfick@codeaurora.org> wrote:
> On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
>> While large files can be a real problem, our biggest issue
>> today is having a lot (millions!) of source files when
>> any individual developer only needs a small percentage of
>> them.  Git with 3+ million local files just doesn't
>> perform well.
>
> Honestly, this sounds like a problem better dealt with by
> using git subtree or git submodules, have you considered
> that?
>
> -Martin
>

I cannot speak for subtrees as I have very little knowledge on them.
But there you also have the problem that *someone* has to have a
whole tree? (e.g. the build bot)

submodules however comes with a couple of things attached, both
positive as well as negative points:

* it offers ACLs along the way. ($user may not be allowed to
  clone all submodules, but only those related to the work)
* The conceptual understanding of git just got a lot harder.
  ("Yo dawg, I heard you like git, so I put git repos inside
  other git repos"), it is not easy to come up with reasonable
  defaults for all usecases, so the everyday user still has to
  have some understanding of submodules.
* typical cheap in-tree operations may become very expensive:
  -> moving a file from one location to another (in another
     submodule) adds overhead, no rename detection.
* We are actively working on submodules, so there is
  some momentum going already.
* our experiments with Android show that e.g. fetching (even
  if you have all of Android) becomes a lot faster for everyday
  usage as only a few repositories change each day). This
  comparision was against the repo tool, that we currently
  use for Android. I do not know how it would compare against
  single repo Git, as having such a large repository seemed
  complicated.
* the support for submodules in Git is already there, though
  not polished. The positive side is to have already a good base,
  the negative side is to have support current use cases.

Stefan

^ permalink raw reply

* Re: [PATCH 5/5] describe: teach describe negative pattern matches
From: Jacob Keller @ 2017-01-17 23:31 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Jacob Keller, Git mailing list
In-Reply-To: <97d4105c-0fa6-41c5-e456-30cebd93dc36@kdbg.org>

On Fri, Jan 13, 2017 at 1:31 PM, Johannes Sixt <j6t@kdbg.org> wrote:
> Am 13.01.2017 um 07:57 schrieb Jacob Keller:
>>
>> On Thu, Jan 12, 2017 at 10:43 PM, Johannes Sixt <j6t@kdbg.org> wrote:
>>>
>>>  When you write
>>>
>>>   git log --branches --exclude=origin/* --remotes
>>>
>>> --exclude=origin/* applies only to --remotes, but not to --branches.
>>
>>
>> Well for describe I don't think the order matters.
>
>
> That is certainly true today. But I would value consistency more. We would
> lose it if some time in the future 'describe' accepts --branches and
> --remotes in addition to --tags and --all.
>
> -- Hannes
>

I am not sure that the interface for git-log and git-describe are
similar enough to make this distinction work. --match already seems to
imply that it only works on refs in refs/tags, as it says it considers
globs matching excluding the "refs/tags" prefix.

In git-describe, we already have "--tags" and "--all" but they are
mutually exclusive. We don't support using more than one at once, and
I'm not really convinced that describe will ever support more than one
at a time. Additionally, match already doesn't respect order.

Thanks,
Jake

^ permalink raw reply

* Re: [RFH - Tcl/Tk] use of procedure before declaration?
From: Philip Oakley @ 2017-01-17 23:32 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git List, Pat Thoyts
In-Reply-To: <alpine.DEB.2.20.1701171218260.3469@virtualbox>

From: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>
> Hi Philip,
>
> On Mon, 16 Jan 2017, Philip Oakley wrote:
>
>> In
>> https://github.com/git/git/blob/master/git-gui/lib/choose_repository.tcl#L242
>> the procedure `_unset_recentrepo` is called, however the procedure isn't
>> declared until line 248. My reading of the various Tcl tutorials suggest
>> (but not explictly) that this isn't the right way.
>
> Indeed, calling a procedure before it is declared sounds incorrect.
>
> Since documentation can be treacherous, let's just test it. With a `tclsh`
> whose `$tcl_version` variable claims that this is version 8.6, this
> script:
>
> ```tcl
> hello Philip
>
> proc hello {arg} {
>        puts "Hi, $arg"
> }
> ```
>
> ... yields the error message:
>
> invalid command name "hello"
>     while executing
> "hello Philip"
>
> ... while this script:
>
> ```tcl
> proc hello {arg} {
>        puts "Hi, $arg"
> }
>
> hello Philip
> ```
>
> ... prints the expected "Hi, Philip".
>
> Having said that, in the code to which you linked, the procedure is not
> actually called before it is declared, as the call is inside another
> procedure.
>
> Indeed, the entire file declares one object-oriented class, so no code
> gets executed in that file:
>
> https://github.com/git/git/blob/d7dffce1c/git-gui/lib/choose_repository.tcl#L4
>
> (I guess proper indentation would make it easier to understand that this
> file is defining a class, not executing anything yet).
>
> And it is perfectly legitimate to use not-yet-declared procedures in other
> procedures, otherwise recursion would not work.
>
>> Should 3c6a287 ("git-gui: Keep repo_config(gui.recentrepos) and 
>> .gitconfig
>> in sync", 2010-01-23) have declared `proc _unset_recentrepo {p}` before
>> `proc _get_recentrepos {}` ?
>
> Given the findings above, I believe that the patch is actually correct.
>
> Ciao,
> Dscho
>
Thanks for the clarification. I'll update the old patch series and see if we 
can get this fixed.

Philip 


^ permalink raw reply

* [PATCH 0/4] start documenting core functions
From: Stefan Beller @ 2017-01-17 23:34 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller

The two single patches[1] are turned into a series here.

[1] https://public-inbox.org/git/20170117200147.25425-1-sbeller@google.com/

Thanks,
Stefan

Stefan Beller (4):
  document index_name_pos
  remove_index_entry_at: move documentation to cache.h
  document add_[file_]to_index
  documentation: retire unfinished documentation

 Documentation/technical/api-in-core-index.txt | 21 ----------------
 cache.h                                       | 35 +++++++++++++++++++++++----
 read-cache.c                                  |  1 -
 3 files changed, 30 insertions(+), 27 deletions(-)
 delete mode 100644 Documentation/technical/api-in-core-index.txt

-- 
2.11.0.299.g762782ba8a


^ permalink raw reply

* [PATCH 2/4] remove_index_entry_at: move documentation to cache.h
From: Stefan Beller @ 2017-01-17 23:35 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller
In-Reply-To: <20170117233503.27137-1-sbeller@google.com>

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 cache.h      | 3 +++
 read-cache.c | 1 -
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/cache.h b/cache.h
index 270a0d0ea7..26632065a5 100644
--- a/cache.h
+++ b/cache.h
@@ -599,7 +599,10 @@ extern int index_name_pos(const struct index_state *, const char *name, int name
 #define ADD_CACHE_KEEP_CACHE_TREE 32	/* Do not invalidate cache-tree */
 extern int add_index_entry(struct index_state *, struct cache_entry *ce, int option);
 extern void rename_index_entry_at(struct index_state *, int pos, const char *new_name);
+
+/* Remove entry, return 1 if there are more entries after pos. */
 extern int remove_index_entry_at(struct index_state *, int pos);
+
 extern void remove_marked_cache_entries(struct index_state *istate);
 extern int remove_file_from_index(struct index_state *, const char *path);
 #define ADD_CACHE_VERBOSE 1
diff --git a/read-cache.c b/read-cache.c
index 2eca639cce..63a414cdb5 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -503,7 +503,6 @@ int index_name_pos(const struct index_state *istate, const char *name, int namel
 	return index_name_stage_pos(istate, name, namelen, 0);
 }
 
-/* Remove entry, return true if there are more entries to go.. */
 int remove_index_entry_at(struct index_state *istate, int pos)
 {
 	struct cache_entry *ce = istate->cache[pos];
-- 
2.11.0.299.g762782ba8a


^ permalink raw reply related

* [PATCH 1/4] document index_name_pos
From: Stefan Beller @ 2017-01-17 23:35 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller
In-Reply-To: <20170117233503.27137-1-sbeller@google.com>

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 cache.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/cache.h b/cache.h
index 1b67f078dd..270a0d0ea7 100644
--- a/cache.h
+++ b/cache.h
@@ -575,7 +575,22 @@ extern int verify_path(const char *path);
 extern int index_dir_exists(struct index_state *istate, const char *name, int namelen);
 extern void adjust_dirname_case(struct index_state *istate, char *name);
 extern struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int igncase);
+
+/*
+ * Searches for an entry defined by name and namelen in the given index.
+ * If the return value is positive (including 0) it is the position of an
+ * exact match. If the return value is negative, the negated value minus 1 is the
+ * position where the entry would be inserted.
+ * Example: In the current index we have the files b,d,e:
+ * index_name_pos(&index, "a", 1) -> -1
+ * index_name_pos(&index, "b", 1) ->  0
+ * index_name_pos(&index, "c", 1) -> -2
+ * index_name_pos(&index, "d", 1) ->  1
+ * index_name_pos(&index, "e", 1) ->  2
+ * index_name_pos(&index, "f", 1) -> -3
+ */
 extern int index_name_pos(const struct index_state *, const char *name, int namelen);
+
 #define ADD_CACHE_OK_TO_ADD 1		/* Ok to add */
 #define ADD_CACHE_OK_TO_REPLACE 2	/* Ok to replace file/directory */
 #define ADD_CACHE_SKIP_DFCHECK 4	/* Ok to skip DF conflict checks */
-- 
2.11.0.299.g762782ba8a


^ permalink raw reply related

* [PATCH 3/4] document add_[file_]to_index
From: Stefan Beller @ 2017-01-17 23:35 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller
In-Reply-To: <20170117233503.27137-1-sbeller@google.com>

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 cache.h | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/cache.h b/cache.h
index 26632065a5..acc639d6e0 100644
--- a/cache.h
+++ b/cache.h
@@ -605,13 +605,20 @@ extern int remove_index_entry_at(struct index_state *, int pos);
 
 extern void remove_marked_cache_entries(struct index_state *istate);
 extern int remove_file_from_index(struct index_state *, const char *path);
-#define ADD_CACHE_VERBOSE 1
-#define ADD_CACHE_PRETEND 2
-#define ADD_CACHE_IGNORE_ERRORS	4
-#define ADD_CACHE_IGNORE_REMOVAL 8
-#define ADD_CACHE_INTENT 16
+
+#define ADD_CACHE_VERBOSE 1		/* verbose */
+#define ADD_CACHE_PRETEND 2 		/* dry run */
+#define ADD_CACHE_IGNORE_ERRORS 4	/* ignore errors */
+#define ADD_CACHE_IGNORE_REMOVAL 8	/* do not remove files from index */
+#define ADD_CACHE_INTENT 16		/* intend to add later; stage empty file */
+/*
+ * Adds the given path the index, respecting the repsitory configuration, e.g.
+ * in case insensitive file systems, the path is normalized.
+ */
 extern int add_to_index(struct index_state *, const char *path, struct stat *, int flags);
+/* stat the file then call add_to_index */
 extern int add_file_to_index(struct index_state *, const char *path, int flags);
+
 extern struct cache_entry *make_cache_entry(unsigned int mode, const unsigned char *sha1, const char *path, int stage, unsigned int refresh_options);
 extern int chmod_index_entry(struct index_state *, struct cache_entry *ce, char flip);
 extern int ce_same_name(const struct cache_entry *a, const struct cache_entry *b);
-- 
2.11.0.299.g762782ba8a


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox