All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	dstolee@microsoft.com, git@jeffhostetler.com, peff@peff.net,
	johannes.schindelin@gmx.de, jrnieder@gmail.com,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: [PATCH 20/20] abbrev: add a core.validateAbbrev setting
Date: Fri,  8 Jun 2018 22:41:36 +0000	[thread overview]
Message-ID: <20180608224136.20220-21-avarab@gmail.com> (raw)
In-Reply-To: <20180608224136.20220-1-avarab@gmail.com>

Operations that need to abbreviate a lot of SHA-1s such as 'git log
--oneline' experience degraded performance when traversing a lot of
packs. See [1] and [2] for some relevant performance numbers.

One way to address this is something like the MIDX to make looking up
the SHA-1s cheaper.

This change adds an alternate method of achieving some of the same
ends (but possibly not all, see [3] and replies to the original thread
at [1]).

Instead of trying really hard to find an unambiguous SHA-1 we can with
core.validateAbbrev=false, and preferably combined with the new
support for relative core.abbrev values (such as +4) unconditionally
print a short SHA-1 without doing any disambiguation check. I.e. it
allows for picking a trade-off between performance, and the odds that
future or remote (or current and local) short SHA-1 will be ambiguous.

With the included performance test my git.git repacked with with `git
repack -A -d --max-pack-size=X` gives the following results against
git.git itself with X=64M:

    Test                                        HEAD~             HEAD
    ------------------------------------------------------------------------------------
    0014.1: git log --oneline --raw --parents   2.53(2.48+0.05)   2.20(2.14+0.05) -13.0%

With one big pack -7.6%, and with 16M packs -23.8%.

1. https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
2. https://public-inbox.org/git/20180607140338.32440-1-dstolee@microsoft.com/
3. https://public-inbox.org/git/87lgbsz61p.fsf@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config.txt            | 43 ++++++++++++++++++++++++++
 cache.h                             |  1 +
 config.c                            |  7 +++++
 environment.c                       |  1 +
 sha1-name.c                         |  4 +++
 t/perf/p0014-abbrev.sh              | 13 ++++++++
 t/t1512-rev-parse-disambiguation.sh | 47 +++++++++++++++++++++++++++++
 7 files changed, 116 insertions(+)
 create mode 100755 t/perf/p0014-abbrev.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index abf07be7b6..df31d1351f 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -925,6 +925,49 @@ means to add or subtract N characters from the SHA-1 that Git would
 otherwise print, this allows for producing more future-proof SHA-1s
 for use within a given project, while adjusting the value for the
 current approximate number of objects.
++
+This is especially useful in combination with the
+`core.validateAbbrev` setting, or to get more future-proof hashes to
+reference in the future in a repository whose number of objects is
+expected to grow.
+
+core.validateAbbrev::
+	If set to false (true by default) don't do any validation when
+	printing abbreviated object names to see if they're really
+	unique. This makes printing objects more performant at the
+	cost of potentially printing object names that aren't unique
+	within the current repository.
++
+When printing abbreviated object names Git needs to look through the
+local object store. This is an `O(log N)` operation assuming all the
+objects are in a single pack file, but `X * O(log N)` given `X` pack
+files, which can get expensive on some larger repositories.
++
+This setting changes that to `O(1)`, but with the trade-off that
+depending on the value of `core.abbrev` we may be printing abbreviated
+hashes that collide. Too see how likely this is, try running:
++
+-----------------------------------------------------------------------------------------------------------
+git log --all --pretty=format:%h --abbrev=4 | perl -nE 'chomp; say length' | sort | uniq -c | sort -nr
+-----------------------------------------------------------------------------------------------------------
++
+This shows how many commits were found at each abbreviation length. On
+linux.git in June 2018 this shows a bit more than 750,000 commits,
+with just 4 needing 11 characters to be fully abbreviated, and the
+default heuristic picks a length of 12.
++
+Even without `core.validateAbbrev=false` the results abbreviation
+already a bit of a probability game. They're guaranteed at the moment
+of generation, but as more objects are added, ambiguities may be
+introduced. Likewise, what's unambiguous for you may not be for
+somebody else you're communicating with, if they have their own clone.
++
+Therefore the default of `core.validateAbbrev=true` may not save you
+in practice if you're sharing the SHA-1 or noting it now to use after
+a `git fetch`. You may be better off setting `core.abbrev` to
+e.g. `+2` to add 2 extra characters to the SHA-1, and possibly combine
+that with `core.validateAbbrev=false` to get a reasonable trade-off
+between safety and performance.
 
 add.ignoreErrors::
 add.ignore-errors (deprecated)::
diff --git a/cache.h b/cache.h
index 0fb4211804..6dc5af9482 100644
--- a/cache.h
+++ b/cache.h
@@ -773,6 +773,7 @@ extern int quote_path_fully;
 extern int has_symlinks;
 extern int minimum_abbrev, default_abbrev;
 extern int default_abbrev_relative;
+extern int validate_abbrev;
 extern int ignore_case;
 extern int assume_unchanged;
 extern int prefer_symlink_refs;
diff --git a/config.c b/config.c
index cd95c6bdfb..fd11e1047d 100644
--- a/config.c
+++ b/config.c
@@ -1146,6 +1146,13 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.validateabbrev")) {
+		if (!value)
+			return config_error_nonbool(var);
+		validate_abbrev =  git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.abbrev")) {
 		if (!value)
 			return config_error_nonbool(var);
diff --git a/environment.c b/environment.c
index 0d48d52fba..4a24d8126b 100644
--- a/environment.c
+++ b/environment.c
@@ -23,6 +23,7 @@ int check_stat = 1;
 int has_symlinks = 1;
 int minimum_abbrev = 4, default_abbrev = -1;
 int default_abbrev_relative = 0;
+int validate_abbrev = 1;
 int ignore_case;
 int assume_unchanged;
 int prefer_symlink_refs;
diff --git a/sha1-name.c b/sha1-name.c
index 75f1bef7d1..57dc782782 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -612,6 +612,10 @@ int find_unique_abbrev_r(char *hex, const struct object_id *oid, int len)
 		}
 		len += dar;
 	}
+	if (!validate_abbrev) {
+		hex[len] = 0;
+		return len;
+	}
 
 	mad.init_len = len;
 	mad.cur_len = len;
diff --git a/t/perf/p0014-abbrev.sh b/t/perf/p0014-abbrev.sh
new file mode 100755
index 0000000000..15c3764265
--- /dev/null
+++ b/t/perf/p0014-abbrev.sh
@@ -0,0 +1,13 @@
+#!/bin/sh
+
+test_description="Tests core.validateAbbrev performance"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test_perf 'git log --oneline --raw --parents' '
+	git -c core.abbrev=15 -c core.validateAbbrev=false log --oneline --raw --parents >/dev/null
+'
+
+test_done
diff --git a/t/t1512-rev-parse-disambiguation.sh b/t/t1512-rev-parse-disambiguation.sh
index 96fe3754c8..89565af508 100755
--- a/t/t1512-rev-parse-disambiguation.sh
+++ b/t/t1512-rev-parse-disambiguation.sh
@@ -232,6 +232,53 @@ test_expect_success 'more history' '
 
 '
 
+test_expect_success 'ambiguous objects and core.abbrev' '
+	cat >expected <<-\EOF &&
+	00000000006
+	0000000005
+	00000000008
+	000000000004
+	0000000000e
+	EOF
+	git -c core.abbrev=4 log --pretty=tformat:%h >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'ambiguous objects and core.validateAbbrev' '
+	git -c core.abbrev=4 -c core.validateabbrev=true log --pretty=tformat:%h >actual &&
+	test_cmp expected actual &&
+
+	cat >expected <<-\EOF &&
+	0000
+	0000
+	0000
+	0000
+	0000
+	EOF
+	git -c core.abbrev=4 -c core.validateabbrev=false log --pretty=tformat:%h >actual &&
+	test_cmp expected actual &&
+
+	cat >expected <<-\EOF &&
+	00000000006
+	0000000005b
+	00000000008
+	000000000004
+	0000000000e
+	EOF
+	git -c core.abbrev=+4 log --pretty=tformat:%h >actual &&
+	test_cmp expected actual &&
+
+	cat >expected <<-\EOF &&
+	00000000006
+	0000000005b
+	00000000008
+	00000000000
+	0000000000e
+	EOF
+	git -c core.abbrev=+4 -c core.validateabbrev=false log --pretty=tformat:%h >actual &&
+	test_cmp expected actual
+'
+
 test_expect_failure 'parse describe name taking advantage of generation' '
 	# ambiguous at the object name level, but there is only one
 	# such commit at generation 0
-- 
2.17.0.290.gded63e768a


  parent reply	other threads:[~2018-06-08 22:42 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-08 22:41 [PATCH 00/20] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 01/20] t/README: clarify the description of test_line_count Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 02/20] test library: add a test_byte_count Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 03/20] blame doc: explicitly note how --abbrev=40 gives 39 chars Ævar Arnfjörð Bjarmason
2018-06-12 18:10   ` Junio C Hamano
2018-06-08 22:41 ` [PATCH 04/20] abbrev tests: add tests for core.abbrev and --abbrev Ævar Arnfjörð Bjarmason
2018-06-12 18:31   ` Junio C Hamano
2018-06-08 22:41 ` [PATCH 05/20] abbrev tests: test "git-blame" behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 06/20] blame: fix a bug, core.abbrev should work like --abbrev Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 07/20] abbrev tests: test "git branch" behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 08/20] abbrev tests: test for "git-describe" behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 09/20] abbrev tests: test for "git-log" behavior Ævar Arnfjörð Bjarmason
2018-06-09  8:43   ` Martin Ågren
2018-06-09  9:56     ` Ævar Arnfjörð Bjarmason
2018-06-09 13:56       ` Martin Ågren
2018-06-08 22:41 ` [PATCH 10/20] abbrev tests: test for "git-diff" behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 11/20] abbrev tests: test for plumbing behavior Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 12/20] abbrev tests: test for --abbrev and core.abbrev=[+-]N Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 13/20] parse-options-cb.c: convert uses of 40 to GIT_SHA1_HEXSZ Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 14/20] config.c: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 15/20] parse-options-cb.c: " Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 16/20] abbrev: unify the handling of non-numeric values Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 17/20] abbrev: unify the handling of empty values Ævar Arnfjörð Bjarmason
2018-06-09 14:24   ` Martin Ågren
2018-06-09 14:31     ` Martin Ågren
2018-06-08 22:41 ` [PATCH 18/20] abbrev parsing: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` [PATCH 19/20] abbrev: support relative abbrev values Ævar Arnfjörð Bjarmason
2018-06-09 15:38   ` Martin Ågren
2018-06-12 19:16   ` Junio C Hamano
2018-06-13 22:22     ` Ævar Arnfjörð Bjarmason
2018-06-13 22:34       ` Junio C Hamano
2018-06-14  7:36         ` Ævar Arnfjörð Bjarmason
2018-06-14 15:50           ` Junio C Hamano
2018-06-14 19:07             ` Ævar Arnfjörð Bjarmason
2018-06-08 22:41 ` Ævar Arnfjörð Bjarmason [this message]
2018-06-09 15:47   ` [PATCH 20/20] abbrev: add a core.validateAbbrev setting Martin Ågren
2018-06-12 18:58     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180608224136.20220-21-avarab@gmail.com \
    --to=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.