git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default
@ 2023-04-10 22:53 Taylor Blau
  2023-04-10 22:53 ` [PATCH 1/7] pack-write.c: plug a leak in stage_tmp_packfiles() Taylor Blau
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-10 22:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

This series enables pack reverse-indexes to be written to disk by
default instead of computed on-the-fly in memory.

For repositories with large packs, this can have a significant benefit
for both end-users and Git forges. Extensive performance tests appear in
the fifth and sixth commit, but here are some highlights:

  - The time it takes to generate a pack containing objects for the 10
    most-recent commits in linux.git (starting from 68047c48b228) drops
    from ~540ms to ~240ms, yielding a ~2.22x improvement.

  - The time it takes to compute the on-disk object size of a single
    object drops from ~300ms to 4ms, yielding a ~76.96x improvement.

Reverse indexes are a trade-off between the time they take to be
computed versus the time it takes to access a single record. In-memory
reverse indexes take time to generate proportional to the number of
packed objects in the repository, but have instantaneous lookup time.

On-disk reverse indexes can be "generated" instantly (the time it takes
to generate them is a one-time cost, loading them is instantaneous).
But individual record lookup time is slower, since it involves multiple
disk I/O operations.

In the vast majority of cases, this trade-off favors the on-disk ".rev"
files. But in certain cases, the in-memory variant performs more
favorably. Since these cases are narrow, and performance is machine- and
repository-dependent, this series also introduces a new configuration
option to disable reading ".rev" files in the third commit.

The series is structured as follows:

  - A couple of cleanup patches to plug a leak in stage_tmp_packfiles().
  - Three patches to enable `pack.readReverseIndex`.
  - Another patch to change the default of `pack.writeReverseIndex` from
    "false" to "true".
  - A final patch to enable the test suite to be run in a mode that does
    not use on-disk ".rev" files.

Thanks in advance for your review. I'm really excited to get this in the
hands of users after a couple of years of running this at GitHub (and
being opt-in otherwise).

Taylor Blau (7):
  pack-write.c: plug a leak in stage_tmp_packfiles()
  t5325: mark as leak-free
  pack-revindex: make `load_pack_revindex` take a repository
  pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK
  pack-revindex: introduce `pack.readReverseIndex`
  config: enable `pack.writeReverseIndex` by default
  t: invert `GIT_TEST_WRITE_REV_INDEX`

 Documentation/config/pack.txt     |  8 +++++++-
 builtin/index-pack.c              |  5 +++--
 builtin/pack-objects.c            |  5 +++--
 ci/run-build-and-tests.sh         |  1 -
 pack-bitmap.c                     |  5 +++--
 pack-revindex.c                   | 12 +++++++++---
 pack-revindex.h                   |  6 ++++--
 pack-write.c                      |  2 ++
 packfile.c                        |  2 +-
 repo-settings.c                   |  1 +
 repository.h                      |  1 +
 t/README                          |  2 +-
 t/perf/p5312-pack-bitmaps-revs.sh |  3 +--
 t/t5325-reverse-index.sh          | 16 +++++++++++++++-
 14 files changed, 51 insertions(+), 18 deletions(-)

-- 
2.40.0.323.g9c80379958

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/7] pack-write.c: plug a leak in stage_tmp_packfiles()
  2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
@ 2023-04-10 22:53 ` Taylor Blau
  2023-04-11 13:23   ` Derrick Stolee
  2023-04-10 22:53 ` [PATCH 2/7] t5325: mark as leak-free Taylor Blau
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-04-10 22:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

The function `stage_tmp_packfiles()` generates a filename to use for
staging the contents of what will become the pack's ".rev" file.

The name is generated in `write_rev_file_order()` (via its caller
`write_rev_file()`) in a string buffer, and the result is returned back
to `stage_tmp_packfiles()` which uses it to rename the temporary file
into place via `rename_tmp_packfiles()`.

That name is not visible outside of `stage_tmp_packfiles()`, so it can
(and should) be `free()`'d at the end of that function. We can't free it
in `rename_tmp_packfile()` since not all of its `source` arguments are
unreachable after calling it.

Instead, simply free() `rev_tmp_name` at the end of
`stage_tmp_packfiles()`.

(Note that the same leak exists for `mtimes_tmp_name`, but we do not
address it in this commit).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-write.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/pack-write.c b/pack-write.c
index f171405495..f27c1f7f28 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -568,6 +568,8 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
 	if (mtimes_tmp_name)
 		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
+
+	free((char *)rev_tmp_name);
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
-- 
2.40.0.323.g9c80379958


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/7] t5325: mark as leak-free
  2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
  2023-04-10 22:53 ` [PATCH 1/7] pack-write.c: plug a leak in stage_tmp_packfiles() Taylor Blau
@ 2023-04-10 22:53 ` Taylor Blau
  2023-04-10 22:53 ` [PATCH 3/7] pack-revindex: make `load_pack_revindex` take a repository Taylor Blau
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-10 22:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

This test is leak-free as of the previous commit, so let's mark it as
such to ensure we don't regress and introduce a leak in the future.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5325-reverse-index.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index d042d26f2b..48c14d8576 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -1,6 +1,8 @@
 #!/bin/sh
 
 test_description='on-disk reverse index'
+
+TEST_PASSES_SANITIZE_LEAK=true
 . ./test-lib.sh
 
 # The below tests want control over the 'pack.writeReverseIndex' setting
-- 
2.40.0.323.g9c80379958


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/7] pack-revindex: make `load_pack_revindex` take a repository
  2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
  2023-04-10 22:53 ` [PATCH 1/7] pack-write.c: plug a leak in stage_tmp_packfiles() Taylor Blau
  2023-04-10 22:53 ` [PATCH 2/7] t5325: mark as leak-free Taylor Blau
@ 2023-04-10 22:53 ` Taylor Blau
  2023-04-11 13:45   ` Derrick Stolee
  2023-04-10 22:53 ` [PATCH 4/7] pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK Taylor Blau
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-04-10 22:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

In a future commit, we will introduce a `pack.readReverseIndex`
configuration, which forces Git to generate the reverse index from
scratch instead of loading it from disk.

In order to avoid reading this configuration value more than once, we'll
use the `repo_settings` struct to lazily load this value.

In order to access the `struct repo_settings`, add a repository argument
to `load_pack_revindex`, and update all callers to pass the correct
instance (in all cases, `the_repository`).

Signed-off-by: Taylor Blauy <me@ttaylorr.com>
---
 pack-bitmap.c   | 5 +++--
 pack-revindex.c | 4 ++--
 pack-revindex.h | 3 ++-
 packfile.c      | 2 +-
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index b2e7d06d60..8e3bb00931 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -477,13 +477,14 @@ static int load_reverse_index(struct bitmap_index *bitmap_git)
 		 * since we will need to make use of them in pack-objects.
 		 */
 		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
-			ret = load_pack_revindex(bitmap_git->midx->packs[i]);
+			ret = load_pack_revindex(the_repository,
+						 bitmap_git->midx->packs[i]);
 			if (ret)
 				return ret;
 		}
 		return 0;
 	}
-	return load_pack_revindex(bitmap_git->pack);
+	return load_pack_revindex(the_repository, bitmap_git->pack);
 }
 
 static int load_bitmap(struct bitmap_index *bitmap_git)
diff --git a/pack-revindex.c b/pack-revindex.c
index 03c7e81f9d..e3d69cc0f7 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -283,7 +283,7 @@ static int load_pack_revindex_from_disk(struct packed_git *p)
 	return ret;
 }
 
-int load_pack_revindex(struct packed_git *p)
+int load_pack_revindex(struct repository *r, struct packed_git *p)
 {
 	if (p->revindex || p->revindex_data)
 		return 0;
@@ -356,7 +356,7 @@ int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	unsigned lo, hi;
 
-	if (load_pack_revindex(p) < 0)
+	if (load_pack_revindex(the_repository, p) < 0)
 		return -1;
 
 	lo = 0;
diff --git a/pack-revindex.h b/pack-revindex.h
index 4974e75eb4..3d1969ce8b 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -39,6 +39,7 @@
 
 struct packed_git;
 struct multi_pack_index;
+struct repository;
 
 /*
  * load_pack_revindex populates the revindex's internal data-structures for the
@@ -47,7 +48,7 @@ struct multi_pack_index;
  * If a '.rev' file is present it is mmap'd, and pointers are assigned into it
  * (instead of using the in-memory variant).
  */
-int load_pack_revindex(struct packed_git *p);
+int load_pack_revindex(struct repository *r, struct packed_git *p);
 
 /*
  * load_midx_revindex loads the '.rev' file corresponding to the given
diff --git a/packfile.c b/packfile.c
index b120405ccc..717fd685c9 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2151,7 +2151,7 @@ int for_each_object_in_pack(struct packed_git *p,
 	int r = 0;
 
 	if (flags & FOR_EACH_OBJECT_PACK_ORDER) {
-		if (load_pack_revindex(p))
+		if (load_pack_revindex(the_repository, p))
 			return -1;
 	}
 
-- 
2.40.0.323.g9c80379958


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/7] pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK
  2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
                   ` (2 preceding siblings ...)
  2023-04-10 22:53 ` [PATCH 3/7] pack-revindex: make `load_pack_revindex` take a repository Taylor Blau
@ 2023-04-10 22:53 ` Taylor Blau
  2023-04-10 22:53 ` [PATCH 5/7] pack-revindex: introduce `pack.readReverseIndex` Taylor Blau
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-10 22:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

In ec8e7760ac (pack-revindex: ensure that on-disk reverse indexes are
given precedence, 2021-01-25), we introduced
GIT_TEST_REV_INDEX_DIE_IN_MEMORY to abort the process when Git generated
a reverse index from scratch.

ec8e7760ac was about ensuring that Git prefers a .rev file when
available over generating the same information in memory from scratch.

In a subsequent patch, we'll introduce `pack.readReverseIndex`, which
may be used to disable reading ".rev" files when available. In order to
ensure that those files are indeed being ignored, introduce an analogous
option to abort the process when Git reads a ".rev" file from disk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 3 +++
 pack-revindex.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/pack-revindex.c b/pack-revindex.c
index e3d69cc0f7..44e1b3fed9 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -205,6 +205,9 @@ static int load_revindex_from_disk(char *revindex_name,
 	size_t revindex_size;
 	struct revindex_header *hdr;
 
+	if (git_env_bool(GIT_TEST_REV_INDEX_DIE_ON_DISK, 0))
+		die("dying as requested by '%s'", GIT_TEST_REV_INDEX_DIE_ON_DISK);
+
 	fd = git_open(revindex_name);
 
 	if (fd < 0) {
diff --git a/pack-revindex.h b/pack-revindex.h
index 3d1969ce8b..ef8afee88b 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -36,6 +36,7 @@
 
 #define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
 #define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY"
+#define GIT_TEST_REV_INDEX_DIE_ON_DISK "GIT_TEST_REV_INDEX_DIE_ON_DISK"
 
 struct packed_git;
 struct multi_pack_index;
-- 
2.40.0.323.g9c80379958


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/7] pack-revindex: introduce `pack.readReverseIndex`
  2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
                   ` (3 preceding siblings ...)
  2023-04-10 22:53 ` [PATCH 4/7] pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK Taylor Blau
@ 2023-04-10 22:53 ` Taylor Blau
  2023-04-10 22:53 ` [PATCH 6/7] config: enable `pack.writeReverseIndex` by default Taylor Blau
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-10 22:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

Since 1615c567b8 (Documentation/config/pack.txt: advertise
'pack.writeReverseIndex', 2021-01-25), we have had the
`pack.writeReverseIndex` configuration option, which tells Git whether
or not it is allowed to write a ".rev" file when indexing a pack.

Introduce a complementary configuration knob, `pack.readReverseIndex` to
control whether or not Git will read any ".rev" file(s) that may be
available on disk.

This option is useful for debugging, as well as disabling the effect of
".rev" files in certain instances.

This is useful because of the trade-off[^1] between the time it takes to
generate a reverse index (slow from scratch, fast when reading an
existing ".rev" file), and the time it takes to access a record (the
opposite).

For example, even though it is faster to use the on-disk reverse index
when computing the on-disk size of a packed object, it is slower to
enumerate the same value for all objects.

Here are a couple of examples from linux.git. When computing the above
for a single object, using the on-disk reverse index is significantly
faster:

    $ git rev-parse HEAD >in
    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" <in'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):     302.5 ms ±  12.5 ms    [User: 258.7 ms, System: 43.6 ms]
      Range (min … max):   291.1 ms … 328.1 ms    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):       3.9 ms ±   0.3 ms    [User: 1.6 ms, System: 2.4 ms]
      Range (min … max):     2.0 ms …   4.4 ms    801 runs

    Summary
      'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in' ran
       77.29 ± 7.14 times faster than 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in'

, but when instead trying to compute the on-disk object size for all
objects in the repository, using the ".rev" file is a disadvantage over
creating the reverse index from scratch:

    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" --batch-all-objects'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):      8.258 s ±  0.035 s    [User: 7.949 s, System: 0.308 s]
      Range (min … max):    8.199 s …  8.293 s    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):     16.976 s ±  0.107 s    [User: 16.706 s, System: 0.268 s]
      Range (min … max):   16.839 s … 17.105 s    10 runs

    Summary
      'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' ran
	2.06 ± 0.02 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects'

Because the equilibrium point between these two is highly machine- and
repository-dependent, allow users to configure whether or not they will
read any ".rev" file(s) with this configuration knob.

[^1]: Generating a reverse index in memory takes O(N) time (where N is
  the number of objects in the repository), since we use a radix sort.
  Reading an entry from an on-disk ".rev" file is slower since each
  operation is bound by disk I/O instead of memory I/O.

  In order to compute the on-disk size of a packed object, we need to
  find the offset of our object, and the adjacent object (the on-disk
  size difference of these two). Finding the first offset requires a
  binary search. Finding the latter involves a single .rev lookup.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt |  6 ++++++
 pack-revindex.c               |  5 ++++-
 repo-settings.c               |  1 +
 repository.h                  |  1 +
 t/t5325-reverse-index.sh      | 11 +++++++++++
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 53093d9996..7db7fed466 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -171,6 +171,12 @@ pack.writeBitmapLookupTable::
 	beneficial in repositories that have relatively large bitmap
 	indexes. Defaults to false.
 
+pack.readReverseIndex::
+	When true, git will read any .rev file(s) that may be available
+	(see: linkgit:gitformat-pack[5]). When false, the reverse index
+	will be generated from scratch and stored in memory. Defaults to
+	true.
+
 pack.writeReverseIndex::
 	When true, git will write a corresponding .rev file (see:
 	linkgit:gitformat-pack[5])
diff --git a/pack-revindex.c b/pack-revindex.c
index 44e1b3fed9..29f5358b25 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -291,7 +291,10 @@ int load_pack_revindex(struct repository *r, struct packed_git *p)
 	if (p->revindex || p->revindex_data)
 		return 0;
 
-	if (!load_pack_revindex_from_disk(p))
+	prepare_repo_settings(r);
+
+	if (r->settings.pack_read_reverse_index &&
+	    !load_pack_revindex_from_disk(p))
 		return 0;
 	else if (!create_pack_revindex_in_memory(p))
 		return 0;
diff --git a/repo-settings.c b/repo-settings.c
index 0a6c0b381f..bdd7640ab0 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -63,6 +63,7 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
 	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
+	repo_cfg_bool(r, "pack.readreverseindex", &r->settings.pack_read_reverse_index, 1);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 15a8afc5fb..ed73e799b6 100644
--- a/repository.h
+++ b/repository.h
@@ -37,6 +37,7 @@ struct repo_settings {
 	int fetch_write_commit_graph;
 	int command_requires_full_index;
 	int sparse_index;
+	int pack_read_reverse_index;
 
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 48c14d8576..66171c1d67 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -96,6 +96,17 @@ test_expect_success 'reverse index is not generated when available on disk' '
 		--batch-check="%(objectsize:disk)" <tip
 '
 
+test_expect_success 'reverse index is ignored when pack.readReverseIndex is false' '
+	test_index_pack true &&
+	test_path_is_file $rev &&
+
+	test_config pack.readReverseIndex false &&
+
+	git rev-parse HEAD >tip &&
+	GIT_TEST_REV_INDEX_DIE_ON_DISK=1 git cat-file \
+		--batch-check="%(objectsize:disk)" <tip
+'
+
 test_expect_success 'revindex in-memory vs on-disk' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.40.0.323.g9c80379958


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/7] config: enable `pack.writeReverseIndex` by default
  2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
                   ` (4 preceding siblings ...)
  2023-04-10 22:53 ` [PATCH 5/7] pack-revindex: introduce `pack.readReverseIndex` Taylor Blau
@ 2023-04-10 22:53 ` Taylor Blau
  2023-04-13 16:14   ` Junio C Hamano
  2023-04-10 22:53 ` [PATCH 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX` Taylor Blau
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-04-10 22:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

Back in e37d0b8730 (builtin/index-pack.c: write reverse indexes,
2021-01-25), Git learned how to read and write a pack's reverse index
from a file instead of in-memory.

A pack's reverse index is a mapping from pack position (that is, the
order that objects appear together in a ".pack")  to their position in
lexical order (that is, the order that objects are listed in an ".idx"
file).

Reverse indexes are consulted often during pack-objects, as well as
during auxiliary operations that require mapping between pack offsets,
pack order, and index index.

They are useful in GitHub's infrastructure, where we have seen a
dramatic increase in performance when writing ".rev" files[1]. In
particular:

  - an ~80% reduction in the time it takes to serve fetches on a popular
    repository, Homebrew/homebrew-core.

  - a ~60% reduction in the peak memory usage to serve fetches on that
    same repository.

  - a collective savings of ~35% in CPU time across all pack-objects
    invocations serving fetches across all repositories in a single
    datacenter.

Reverse indexes are also beneficial to end-users as well as forges. For
example, the time it takes to generate a pack containing the objects for
the 10 most recent commits in linux.git (representing a typical push) is
significantly faster when on-disk reverse indexes are available:

    $ { git rev-parse HEAD && printf '^' && git rev-parse HEAD~10 } >in
    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} pack-objects --delta-base-offset --revs --stdout <in >/dev/null'
    Benchmark 1: git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null
      Time (mean ± σ):     543.0 ms ±  20.3 ms    [User: 616.2 ms, System: 58.8 ms]
      Range (min … max):   521.0 ms … 577.9 ms    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null
      Time (mean ± σ):     245.0 ms ±  11.4 ms    [User: 335.6 ms, System: 31.3 ms]
      Range (min … max):   226.0 ms … 259.6 ms    13 runs

    Summary
      'git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null' ran
	2.22 ± 0.13 times faster than 'git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null'

The same is true of writing a pack containing the objects for the 30
most-recent commits:

    $ { git rev-parse HEAD && printf '^' && git rev-parse HEAD~30 } >in
    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} pack-objects --delta-base-offset --revs --stdout <in >/dev/null'
    Benchmark 1: git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null
      Time (mean ± σ):     866.5 ms ±  16.2 ms    [User: 1414.5 ms, System: 97.0 ms]
      Range (min … max):   839.3 ms … 886.9 ms    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null
      Time (mean ± σ):     581.6 ms ±  10.2 ms    [User: 1181.7 ms, System: 62.6 ms]
      Range (min … max):   567.5 ms … 599.3 ms    10 runs

    Summary
      'git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null' ran
	1.49 ± 0.04 times faster than 'git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null'

...and savings on trivial operations like computing the on-disk size of
a single (packed) object are even more dramatic:

    $ git rev-parse HEAD >in
    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" <in'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):     305.8 ms ±  11.4 ms    [User: 264.2 ms, System: 41.4 ms]
      Range (min … max):   290.3 ms … 331.1 ms    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):       4.0 ms ±   0.3 ms    [User: 1.7 ms, System: 2.3 ms]
      Range (min … max):     1.6 ms …   4.6 ms    1155 runs

    Summary
      'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in' ran
       76.96 ± 6.25 times faster than 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in'

In the more than two years since e37d0b8730 was merged, Git's
implementation of on-disk reverse indexes has been thoroughly tested,
both from users enabling `pack.writeReverseIndexes`, and from GitHub's
deployment of the feature. The latter has been running without incident
for more than two years.

This patch changes Git's behavior to write on-disk reverse indexes by
default when indexing a pack, which should make the above operations
faster for everybody's Git installation after a repack.

(The previous commit explains some potential drawbacks of using on-disk
reverse indexes in certain limited circumstances, that essentially boil
down to a trade-off between time to generate, and time to access. For
those limited cases, the `pack.readReverseIndex` escape hatch can be
used).

[1]: https://github.blog/2021-04-29-scaling-monorepo-maintenance/#reverse-indexes

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt     | 2 +-
 builtin/index-pack.c              | 1 +
 builtin/pack-objects.c            | 1 +
 t/perf/p5312-pack-bitmaps-revs.sh | 3 +--
 t/t5325-reverse-index.sh          | 1 +
 5 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 7db7fed466..d4c7c9d4e4 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -182,4 +182,4 @@ pack.writeReverseIndex::
 	linkgit:gitformat-pack[5])
 	for each new packfile that it writes in all places except for
 	linkgit:git-fast-import[1] and in the bulk checkin mechanism.
-	Defaults to false.
+	Defaults to true.
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index b17e79cd40..323c063f9d 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1753,6 +1753,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	fsck_options.walk = mark_link;
 
 	reset_pack_idx_option(&opts);
+	opts.flags |= WRITE_REV;
 	git_config(git_index_pack_config, &opts);
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 77d88f85b0..dbaa04482f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4293,6 +4293,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	}
 
 	reset_pack_idx_option(&pack_idx_opts);
+	pack_idx_opts.flags |= WRITE_REV;
 	git_config(git_pack_config, NULL);
 	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
 		pack_idx_opts.flags |= WRITE_REV;
diff --git a/t/perf/p5312-pack-bitmaps-revs.sh b/t/perf/p5312-pack-bitmaps-revs.sh
index 0684b690af..ceec60656b 100755
--- a/t/perf/p5312-pack-bitmaps-revs.sh
+++ b/t/perf/p5312-pack-bitmaps-revs.sh
@@ -12,8 +12,7 @@ test_lookup_pack_bitmap () {
 	test_perf_large_repo
 
 	test_expect_success 'setup bitmap config' '
-		git config pack.writebitmaps true &&
-		git config pack.writeReverseIndex true
+		git config pack.writebitmaps true
 	'
 
 	# we need to create the tag up front such that it is covered by the repack and
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 66171c1d67..149dcf5193 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -14,6 +14,7 @@ packdir=.git/objects/pack
 test_expect_success 'setup' '
 	test_commit base &&
 
+	test_config pack.writeReverseIndex false &&
 	pack=$(git pack-objects --all $packdir/pack) &&
 	rev=$packdir/pack-$pack.rev &&
 
-- 
2.40.0.323.g9c80379958


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX`
  2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
                   ` (5 preceding siblings ...)
  2023-04-10 22:53 ` [PATCH 6/7] config: enable `pack.writeReverseIndex` by default Taylor Blau
@ 2023-04-10 22:53 ` Taylor Blau
  2023-04-11 13:51   ` Derrick Stolee
  2023-04-11 13:54 ` [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Derrick Stolee
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
  8 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-04-10 22:53 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

Back in e8c58f894b (t: support GIT_TEST_WRITE_REV_INDEX, 2021-01-25), we
added a test knob to conditionally enable writing a ".rev" file when
indexing a pack. At the time, this was used to ensure that the test
suite worked even when ".rev" files were written, which served as a
stress-test for the on-disk reverse index implementation.

Now that reading from on-disk ".rev" files is enabled by default, the
test knob `GIT_TEST_WRITE_REV_INDEX` no longer has any meaning.

We could get rid of the option entirely, but there would be no
convenient way to test Git when ".rev" files *aren't* in place.

Instead of getting rid of the option, invert its meaning to instead
disable writing ".rev" files, thereby running the test suite in a mode
where the reverse index is generated from scratch.

This ensures that we are still running and exercising Git's behavior
when forced to generate reverse indexes from scratch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/index-pack.c      | 4 ++--
 builtin/pack-objects.c    | 4 ++--
 ci/run-build-and-tests.sh | 1 -
 pack-revindex.h           | 2 +-
 t/README                  | 2 +-
 t/t5325-reverse-index.sh  | 2 +-
 6 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 323c063f9d..9e36c985cf 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1758,8 +1758,8 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
 
-	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
-		rev_index = 1;
+	if (git_env_bool(GIT_TEST_NO_WRITE_REV_INDEX, 0))
+		rev_index = 0;
 	else
 		rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index dbaa04482f..1797871ce9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4295,8 +4295,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	reset_pack_idx_option(&pack_idx_opts);
 	pack_idx_opts.flags |= WRITE_REV;
 	git_config(git_pack_config, NULL);
-	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
-		pack_idx_opts.flags |= WRITE_REV;
+	if (git_env_bool(GIT_TEST_NO_WRITE_REV_INDEX, 0))
+		pack_idx_opts.flags &= ~WRITE_REV;
 
 	progress = isatty(2);
 	argc = parse_options(argc, argv, prefix, pack_objects_options,
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index b098e10f52..e9fbfb6345 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -27,7 +27,6 @@ linux-TEST-vars)
 	export GIT_TEST_MULTI_PACK_INDEX=1
 	export GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=1
 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
-	export GIT_TEST_WRITE_REV_INDEX=1
 	export GIT_TEST_CHECKOUT_WORKERS=2
 	;;
 linux-clang)
diff --git a/pack-revindex.h b/pack-revindex.h
index ef8afee88b..46e834064e 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -34,7 +34,7 @@
 #define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
 #define RIDX_VERSION 1
 
-#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
+#define GIT_TEST_NO_WRITE_REV_INDEX "GIT_TEST_NO_WRITE_REV_INDEX"
 #define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY"
 #define GIT_TEST_REV_INDEX_DIE_ON_DISK "GIT_TEST_REV_INDEX_DIE_ON_DISK"
 
diff --git a/t/README b/t/README
index 29576c3748..bdfac4cceb 100644
--- a/t/README
+++ b/t/README
@@ -475,7 +475,7 @@ GIT_TEST_DEFAULT_HASH=<hash-algo> specifies which hash algorithm to
 use in the test scripts. Recognized values for <hash-algo> are "sha1"
 and "sha256".
 
-GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
+GIT_TEST_NO_WRITE_REV_INDEX=<boolean>, when true disables the
 'pack.writeReverseIndex' setting.
 
 GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 149dcf5193..0548fce1aa 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -7,7 +7,7 @@ TEST_PASSES_SANITIZE_LEAK=true
 
 # The below tests want control over the 'pack.writeReverseIndex' setting
 # themselves to assert various combinations of it with other options.
-sane_unset GIT_TEST_WRITE_REV_INDEX
+sane_unset GIT_TEST_NO_WRITE_REV_INDEX
 
 packdir=.git/objects/pack
 
-- 
2.40.0.323.g9c80379958

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] pack-write.c: plug a leak in stage_tmp_packfiles()
  2023-04-10 22:53 ` [PATCH 1/7] pack-write.c: plug a leak in stage_tmp_packfiles() Taylor Blau
@ 2023-04-11 13:23   ` Derrick Stolee
  2023-04-11 21:25     ` Taylor Blau
  0 siblings, 1 reply; 29+ messages in thread
From: Derrick Stolee @ 2023-04-11 13:23 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 4/10/2023 6:53 PM, Taylor Blau wrote:

> Instead, simply free() `rev_tmp_name` at the end of
> `stage_tmp_packfiles()`.

> @@ -568,6 +568,8 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
>  		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
>  	if (mtimes_tmp_name)
>  		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
> +
> +	free((char *)rev_tmp_name);

Just cut off from the context is a "if (rev_tmp_name)", so it might be
good to group this into that block, since we have the condition, anyway.


But I was also thinking about how we like to use "const" as an indicator
as "I am not responsible for free()ing this". And this comes from the
public write_rev_file() method. Based on the API prototype, we could
think that this string is held by a static strbuf (making the method
not reentrant, but that happens sometimes in our methods). But generally,
I wanted to inspect what it would take to make the API reflect the fact
that it can return a "new" string.

But there are two issues:

 1. The actual logic is inside write_rev_file_order(), so that API
    needs to change, too.

 2. The "new" string is created only if the rev_name parameter is
    NULL, which is somewhat understandable but still requires
    inside knowledge about the implementation to make that choice.

 3. If we inspect the callers to these methods, only one caller
    passes a non-null name: builtin/index-pack.c. The rest pass NULL,
    including write_midx_reverse_index() (which then leaks the name).

The below diff includes my attempt to change the API to return a
non-const string that must be freed by the callers.

Thanks,
-Stolee

--- >8 ---

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index b17e79cd40f..6d2fa52f9c4 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1725,7 +1725,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 {
 	int i, fix_thin_pack = 0, verify = 0, stat_only = 0, rev_index;
 	const char *curr_index;
-	const char *curr_rev_index = NULL;
+	char *curr_rev_index = NULL;
 	const char *index_name = NULL, *pack_name = NULL, *rev_index_name = NULL;
 	const char *keep_msg = NULL;
 	const char *promisor_msg = NULL;
@@ -1956,8 +1956,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 		free((void *) curr_pack);
 	if (!index_name)
 		free((void *) curr_index);
-	if (!rev_index_name)
-		free((void *) curr_rev_index);
+	free(curr_rev_index);
 
 	/*
 	 * Let the caller know this pack is not self contained
diff --git a/midx.c b/midx.c
index 9af3e5de889..85154bedd73 100644
--- a/midx.c
+++ b/midx.c
@@ -945,7 +945,7 @@ static void write_midx_reverse_index(char *midx_name, unsigned char *midx_hash,
 				     struct write_midx_context *ctx)
 {
 	struct strbuf buf = STRBUF_INIT;
-	const char *tmp_file;
+	char *tmp_file;
 
 	trace2_region_enter("midx", "write_midx_reverse_index", the_repository);
 
@@ -958,6 +958,7 @@ static void write_midx_reverse_index(char *midx_name, unsigned char *midx_hash,
 		die(_("cannot store reverse index file"));
 
 	strbuf_release(&buf);
+	free(tmp_file);
 
 	trace2_region_leave("midx", "write_midx_reverse_index", the_repository);
 }
diff --git a/pack-write.c b/pack-write.c
index f1714054951..73850c061d9 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -207,15 +207,15 @@ static void write_rev_trailer(struct hashfile *f, const unsigned char *hash)
 	hashwrite(f, hash, the_hash_algo->rawsz);
 }
 
-const char *write_rev_file(const char *rev_name,
-			   struct pack_idx_entry **objects,
-			   uint32_t nr_objects,
-			   const unsigned char *hash,
-			   unsigned flags)
+char *write_rev_file(const char *rev_name,
+		     struct pack_idx_entry **objects,
+		     uint32_t nr_objects,
+		     const unsigned char *hash,
+		     unsigned flags)
 {
 	uint32_t *pack_order;
 	uint32_t i;
-	const char *ret;
+	char *ret;
 
 	if (!(flags & WRITE_REV) && !(flags & WRITE_REV_VERIFY))
 		return NULL;
@@ -233,12 +233,13 @@ const char *write_rev_file(const char *rev_name,
 	return ret;
 }
 
-const char *write_rev_file_order(const char *rev_name,
-				 uint32_t *pack_order,
-				 uint32_t nr_objects,
-				 const unsigned char *hash,
-				 unsigned flags)
+char *write_rev_file_order(const char *rev_name,
+			   uint32_t *pack_order,
+			   uint32_t nr_objects,
+			   const unsigned char *hash,
+			   unsigned flags)
 {
+	char *ret_name;
 	struct hashfile *f;
 	int fd;
 
@@ -249,10 +250,11 @@ const char *write_rev_file_order(const char *rev_name,
 		if (!rev_name) {
 			struct strbuf tmp_file = STRBUF_INIT;
 			fd = odb_mkstemp(&tmp_file, "pack/tmp_rev_XXXXXX");
-			rev_name = strbuf_detach(&tmp_file, NULL);
+			rev_name = ret_name = strbuf_detach(&tmp_file, NULL);
 		} else {
 			unlink(rev_name);
 			fd = xopen(rev_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+			ret_name = xstrdup(rev_name);
 		}
 		f = hashfd(fd, rev_name);
 	} else if (flags & WRITE_REV_VERIFY) {
@@ -264,6 +266,7 @@ const char *write_rev_file_order(const char *rev_name,
 			} else
 				die_errno(_("could not stat: %s"), rev_name);
 		}
+		ret_name = xstrdup(rev_name);
 		f = hashfd_check(rev_name);
 	} else
 		return NULL;
@@ -280,7 +283,7 @@ const char *write_rev_file_order(const char *rev_name,
 			  CSUM_HASH_IN_STREAM | CSUM_CLOSE |
 			  ((flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
 
-	return rev_name;
+	return ret_name;
 }
 
 static void write_mtimes_header(struct hashfile *f)
@@ -543,7 +546,7 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 			 unsigned char hash[],
 			 char **idx_tmp_name)
 {
-	const char *rev_tmp_name = NULL;
+	char *rev_tmp_name = NULL;
 	const char *mtimes_tmp_name = NULL;
 
 	if (adjust_shared_perm(pack_tmp_name))
@@ -564,8 +567,10 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 	}
 
 	rename_tmp_packfile(name_buffer, pack_tmp_name, "pack");
-	if (rev_tmp_name)
+	if (rev_tmp_name) {
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
+		free(rev_tmp_name);
+	}
 	if (mtimes_tmp_name)
 		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
 }
diff --git a/pack.h b/pack.h
index 3ab9e3f60c0..02bbdfb19cc 100644
--- a/pack.h
+++ b/pack.h
@@ -96,8 +96,8 @@ struct ref;
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought);
 
-const char *write_rev_file(const char *rev_name, struct pack_idx_entry **objects, uint32_t nr_objects, const unsigned char *hash, unsigned flags);
-const char *write_rev_file_order(const char *rev_name, uint32_t *pack_order, uint32_t nr_objects, const unsigned char *hash, unsigned flags);
+char *write_rev_file(const char *rev_name, struct pack_idx_entry **objects, uint32_t nr_objects, const unsigned char *hash, unsigned flags);
+char *write_rev_file_order(const char *rev_name, uint32_t *pack_order, uint32_t nr_objects, const unsigned char *hash, unsigned flags);
 
 /*
  * The "hdr" output buffer should be at least this big, which will handle sizes

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] pack-revindex: make `load_pack_revindex` take a repository
  2023-04-10 22:53 ` [PATCH 3/7] pack-revindex: make `load_pack_revindex` take a repository Taylor Blau
@ 2023-04-11 13:45   ` Derrick Stolee
  2023-04-11 21:30     ` Taylor Blau
  0 siblings, 1 reply; 29+ messages in thread
From: Derrick Stolee @ 2023-04-11 13:45 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 4/10/2023 6:53 PM, Taylor Blau wrote:
> In a future commit, we will introduce a `pack.readReverseIndex`
> configuration, which forces Git to generate the reverse index from
> scratch instead of loading it from disk.
> 
> In order to avoid reading this configuration value more than once, we'll
> use the `repo_settings` struct to lazily load this value.
> 
> In order to access the `struct repo_settings`, add a repository argument
> to `load_pack_revindex`, and update all callers to pass the correct
> instance (in all cases, `the_repository`).

If all callers use the_repository, then we could presumably use
the_repository within the method directly. However, there are some
cases where the call chain is less obvious that we have already
entered something that is "repository scoped".

The patch below applies on top of this one and is the result of
exploring the two callers within pack-bitmap.c. Since they are
static, I was able to only modify things within that file, but
found two callers to _those_ methods that were repository scoped,
so without making this connection we are losing that scope.

There are other non-static methods in pack-bitmap.c that might
benefit from wiring a repository pointer through (or adding a
repository pointer to struct bitmap_index to get it for free),
but I used the trick of defining a local repository pointer at
the top of the method to make it easier to change in the future.

Thanks,
-Stolee


--- >8 ---

From 9816f7026199981b86d9f3e2188036e1b20bc2f9 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <derrickstolee@github.com>
Date: Tue, 11 Apr 2023 09:34:42 -0400
Subject: [PATCH] pack-bitmap: use struct repository more often

The previous change introduced a 'struct repository *' parameter to
load_pack_revindex(). To satisfy the callers within pack-bitmap.c, these
parameters were filled with 'the_repository'.

However, these callers are sometimes included in methods that are
already scoped to a 'struct repository *' parameter. By dropping the
link from that repository and using the_repository, we are giving a
false impression that this portion of the rev-index API is properly
scoped to a single repository.

Expand the static methods in pack-bitmap.c that call
load_pack_revindex() to include a 'struct repository *' parameter.
Modify the callers of those methods to pass a repository as appropriate.
For the methods without an appropriate repository, create a local
variable equal to the_repository so it is easier to convert them to
using a parameter in the future.

In the case of prepare_bitmap_git(), we already have a repository
pointer parameter that can be used. In prepare_bitmap_walk(), the
rev_info struct contains a repository pointer.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 pack-bitmap.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 8e3bb00931..38b35c4823 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -463,7 +463,7 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 	return 0;
 }
 
-static int load_reverse_index(struct bitmap_index *bitmap_git)
+static int load_reverse_index(struct repository *r, struct bitmap_index *bitmap_git)
 {
 	if (bitmap_is_midx(bitmap_git)) {
 		uint32_t i;
@@ -477,24 +477,23 @@ static int load_reverse_index(struct bitmap_index *bitmap_git)
 		 * since we will need to make use of them in pack-objects.
 		 */
 		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
-			ret = load_pack_revindex(the_repository,
-						 bitmap_git->midx->packs[i]);
+			ret = load_pack_revindex(r, bitmap_git->midx->packs[i]);
 			if (ret)
 				return ret;
 		}
 		return 0;
 	}
-	return load_pack_revindex(the_repository, bitmap_git->pack);
+	return load_pack_revindex(r, bitmap_git->pack);
 }
 
-static int load_bitmap(struct bitmap_index *bitmap_git)
+static int load_bitmap(struct repository *r, struct bitmap_index *bitmap_git)
 {
 	assert(bitmap_git->map);
 
 	bitmap_git->bitmaps = kh_init_oid_map();
 	bitmap_git->ext_index.positions = kh_init_oid_pos();
 
-	if (load_reverse_index(bitmap_git))
+	if (load_reverse_index(r, bitmap_git))
 		goto failed;
 
 	if (!(bitmap_git->commits = read_bitmap_1(bitmap_git)) ||
@@ -581,7 +580,7 @@ struct bitmap_index *prepare_bitmap_git(struct repository *r)
 {
 	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
 
-	if (!open_bitmap(r, bitmap_git) && !load_bitmap(bitmap_git))
+	if (!open_bitmap(r, bitmap_git) && !load_bitmap(r, bitmap_git))
 		return bitmap_git;
 
 	free_bitmap_index(bitmap_git);
@@ -590,9 +589,10 @@ struct bitmap_index *prepare_bitmap_git(struct repository *r)
 
 struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
 {
+	struct repository *r = the_repository;
 	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
 
-	if (!open_midx_bitmap_1(bitmap_git, midx) && !load_bitmap(bitmap_git))
+	if (!open_midx_bitmap_1(bitmap_git, midx) && !load_bitmap(r, bitmap_git))
 		return bitmap_git;
 
 	free_bitmap_index(bitmap_git);
@@ -1593,7 +1593,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	 * from disk. this is the point of no return; after this the rev_list
 	 * becomes invalidated and we must perform the revwalk through bitmaps
 	 */
-	if (load_bitmap(bitmap_git) < 0)
+	if (load_bitmap(revs->repo, bitmap_git) < 0)
 		goto cleanup;
 
 	object_array_clear(&revs->pending);
@@ -1743,6 +1743,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				       uint32_t *entries,
 				       struct bitmap **reuse_out)
 {
+	struct repository *r = the_repository;
 	struct packed_git *pack;
 	struct bitmap *result = bitmap_git->result;
 	struct bitmap *reuse;
@@ -1753,7 +1754,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 
 	assert(result);
 
-	load_reverse_index(bitmap_git);
+	load_reverse_index(r, bitmap_git);
 
 	if (bitmap_is_midx(bitmap_git))
 		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
@@ -2133,11 +2134,12 @@ int rebuild_bitmap(const uint32_t *reposition,
 uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 				struct packing_data *mapping)
 {
+	struct repository *r = the_repository;
 	uint32_t i, num_objects;
 	uint32_t *reposition;
 
 	if (!bitmap_is_midx(bitmap_git))
-		load_reverse_index(bitmap_git);
+		load_reverse_index(r, bitmap_git);
 	else if (load_midx_revindex(bitmap_git->midx) < 0)
 		BUG("rebuild_existing_bitmaps: missing required rev-cache "
 		    "extension");
-- 
2.40.0.vfs.0.0.3.g5872ac9aaa4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX`
  2023-04-10 22:53 ` [PATCH 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX` Taylor Blau
@ 2023-04-11 13:51   ` Derrick Stolee
  2023-04-11 21:33     ` Taylor Blau
  0 siblings, 1 reply; 29+ messages in thread
From: Derrick Stolee @ 2023-04-11 13:51 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 4/10/2023 6:53 PM, Taylor Blau wrote:
> Instead of getting rid of the option, invert its meaning to instead
> disable writing ".rev" files, thereby running the test suite in a mode
> where the reverse index is generated from scratch.
> 
> This ensures that we are still running and exercising Git's behavior
> when forced to generate reverse indexes from scratch.

I don't think this is true because you remove the environment
variable from the following test. Replacing the line with
GIT_TEST_NO_WRITE_REV_INDEX=1 would keep us testing the from-scratch
case as a side-effect in other tests.

> diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
> index b098e10f52..e9fbfb6345 100755
> --- a/ci/run-build-and-tests.sh
> +++ b/ci/run-build-and-tests.sh
> @@ -27,7 +27,6 @@ linux-TEST-vars)
>  	export GIT_TEST_MULTI_PACK_INDEX=1
>  	export GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=1
>  	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
> -	export GIT_TEST_WRITE_REV_INDEX=1
>  	export GIT_TEST_CHECKOUT_WORKERS=2
>  	;;

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default
  2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
                   ` (6 preceding siblings ...)
  2023-04-10 22:53 ` [PATCH 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX` Taylor Blau
@ 2023-04-11 13:54 ` Derrick Stolee
  2023-04-11 21:40   ` Taylor Blau
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
  8 siblings, 1 reply; 29+ messages in thread
From: Derrick Stolee @ 2023-04-11 13:54 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 4/10/2023 6:53 PM, Taylor Blau wrote:
> In the vast majority of cases, this trade-off favors the on-disk ".rev"
> files. But in certain cases, the in-memory variant performs more
> favorably. Since these cases are narrow, and performance is machine- and
> repository-dependent, this series also introduces a new configuration
> option to disable reading ".rev" files in the third commit.

I agree that the performance trade-off indicates that having the .rev
files is preferred. It makes operations that _can_ be very fast as fast
as possible (inspecting a small number of objects is much faster because
we don't generate the in-memory index in O(N) time) and you create a knob
for disabling it in the case that we are already doing something that
inspects almost-all objects.
 
> The series is structured as follows:
> 
>   - A couple of cleanup patches to plug a leak in stage_tmp_packfiles().
>   - Three patches to enable `pack.readReverseIndex`.
>   - Another patch to change the default of `pack.writeReverseIndex` from
>     "false" to "true".
>   - A final patch to enable the test suite to be run in a mode that does
>     not use on-disk ".rev" files.

This was an easy series to read. I applied the patches locally and poked
around in the resulting code as I went along. This led to a couple places
where I recommend a few changes, including a new patch that wires
repository pointers through a few more method layers.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] pack-write.c: plug a leak in stage_tmp_packfiles()
  2023-04-11 13:23   ` Derrick Stolee
@ 2023-04-11 21:25     ` Taylor Blau
  0 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-11 21:25 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Tue, Apr 11, 2023 at 09:23:31AM -0400, Derrick Stolee wrote:
> On 4/10/2023 6:53 PM, Taylor Blau wrote:
>
> > Instead, simply free() `rev_tmp_name` at the end of
> > `stage_tmp_packfiles()`.
>
> > @@ -568,6 +568,8 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
> >  		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
> >  	if (mtimes_tmp_name)
> >  		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
> > +
> > +	free((char *)rev_tmp_name);
>
> Just cut off from the context is a "if (rev_tmp_name)", so it might be
> good to group this into that block, since we have the condition, anyway.

Definitely possible, though FWIW it's fine to have this free()
positioned at the end of the function, since we initialize rev_tmp_name
to NULL (making this a noop when not writing an on-disk reverse index).

> But I was also thinking about how we like to use "const" as an indicator
> as "I am not responsible for free()ing this". And this comes from the
> public write_rev_file() method. Based on the API prototype, we could
> think that this string is held by a static strbuf (making the method
> not reentrant, but that happens sometimes in our methods). But generally,
> I wanted to inspect what it would take to make the API reflect the fact
> that it can return a "new" string.
>
> But there are two issues:
>
>  1. The actual logic is inside write_rev_file_order(), so that API
>     needs to change, too.
>
>  2. The "new" string is created only if the rev_name parameter is
>     NULL, which is somewhat understandable but still requires
>     inside knowledge about the implementation to make that choice.
>
>  3. If we inspect the callers to these methods, only one caller
>     passes a non-null name: builtin/index-pack.c. The rest pass NULL,
>     including write_midx_reverse_index() (which then leaks the name).
>
> The below diff includes my attempt to change the API to return a
> non-const string that must be freed by the callers.

I like this direction. I think that all things being equal (and unless
you feel strongly about it in the meantime), I'd just as soon pursue
this as a "fast follow" to avoid intermixing this API change with the
primary intent of this series.

In the meantime, dropping the const via a cast down to "char *" works
fine to plug the leak here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] pack-revindex: make `load_pack_revindex` take a repository
  2023-04-11 13:45   ` Derrick Stolee
@ 2023-04-11 21:30     ` Taylor Blau
  2023-04-12 17:33       ` Derrick Stolee
  0 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-04-11 21:30 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Tue, Apr 11, 2023 at 09:45:21AM -0400, Derrick Stolee wrote:
> On 4/10/2023 6:53 PM, Taylor Blau wrote:
> > In a future commit, we will introduce a `pack.readReverseIndex`
> > configuration, which forces Git to generate the reverse index from
> > scratch instead of loading it from disk.
> >
> > In order to avoid reading this configuration value more than once, we'll
> > use the `repo_settings` struct to lazily load this value.
> >
> > In order to access the `struct repo_settings`, add a repository argument
> > to `load_pack_revindex`, and update all callers to pass the correct
> > instance (in all cases, `the_repository`).
>
> If all callers use the_repository, then we could presumably use
> the_repository within the method directly. However, there are some
> cases where the call chain is less obvious that we have already
> entered something that is "repository scoped".
>
> The patch below applies on top of this one and is the result of
> exploring the two callers within pack-bitmap.c. Since they are
> static, I was able to only modify things within that file, but
> found two callers to _those_ methods that were repository scoped,
> so without making this connection we are losing that scope.
>
> There are other non-static methods in pack-bitmap.c that might
> benefit from wiring a repository pointer through (or adding a
> repository pointer to struct bitmap_index to get it for free),
> but I used the trick of defining a local repository pointer at
> the top of the method to make it easier to change in the future.
>
> Thanks,
> -Stolee

> @@ -581,7 +580,7 @@ struct bitmap_index *prepare_bitmap_git(struct repository *r)
>  {
>  	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
>
> -	if (!open_bitmap(r, bitmap_git) && !load_bitmap(bitmap_git))
> +	if (!open_bitmap(r, bitmap_git) && !load_bitmap(r, bitmap_git))
>  		return bitmap_git;
>
>  	free_bitmap_index(bitmap_git);

Oops; we are indeed dropping the repository pointer that was given to
prepare_bitmap_git() here. It's unfortunate that we have to work through
so many layers to propagate it back down, but I agree that it's the
right thing to do.

> @@ -590,9 +589,10 @@ struct bitmap_index *prepare_bitmap_git(struct repository *r)
>
>  struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
>  {
> +	struct repository *r = the_repository;

OK; and here we're using the trick you mentioned in the patch message to
avoid having to propagate this even further out. The rest of the patch
looks sensible to me.

In terms of working this one in, it feels odd to include it as a
separate commit since we know the one immediately prior to it is kind of
broken.

Do you want to squash these together? Something else? Anything is fine
with me here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX`
  2023-04-11 13:51   ` Derrick Stolee
@ 2023-04-11 21:33     ` Taylor Blau
  2023-04-12 17:37       ` Derrick Stolee
  0 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-04-11 21:33 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Tue, Apr 11, 2023 at 09:51:06AM -0400, Derrick Stolee wrote:
> On 4/10/2023 6:53 PM, Taylor Blau wrote:
> > Instead of getting rid of the option, invert its meaning to instead
> > disable writing ".rev" files, thereby running the test suite in a mode
> > where the reverse index is generated from scratch.
> >
> > This ensures that we are still running and exercising Git's behavior
> > when forced to generate reverse indexes from scratch.
>
> I don't think this is true because you remove the environment
> variable from the following test.

This is intentional; I wrote that last sentence ("This ensures...") with
an implied "[When set], this ensures..." at the beginning of it. IOW, if
you wanted to run the test suite with primarily in-memory reverse
indexes, you still could.

> Replacing the line with GIT_TEST_NO_WRITE_REV_INDEX=1 would keep us
> testing the from-scratch case as a side-effect in other tests.
>
> > diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
> > index b098e10f52..e9fbfb6345 100755
> > --- a/ci/run-build-and-tests.sh
> > +++ b/ci/run-build-and-tests.sh
> > @@ -27,7 +27,6 @@ linux-TEST-vars)
> >  	export GIT_TEST_MULTI_PACK_INDEX=1
> >  	export GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=1
> >  	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
> > -	export GIT_TEST_WRITE_REV_INDEX=1
> >  	export GIT_TEST_CHECKOUT_WORKERS=2
> >  	;;

I wasn't quite sure what to do here. On one hand, sticking
GIT_TEST_NO_WRITE_REV_INDEX=1 here would ensure that the linux-TEST-vars
test is still exercising the old code.

Is that desirable? I dunno. On one hand it increases our coverage, but
on the other hand I've always treated this suite as for experimental
features, not older ones.

But I think all things being equal, I'd rather have more CI coverage
than not, so I'll add it back in. Thanks for a sanity check on this one.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default
  2023-04-11 13:54 ` [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Derrick Stolee
@ 2023-04-11 21:40   ` Taylor Blau
  2023-04-12 17:39     ` Derrick Stolee
  0 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2023-04-11 21:40 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Junio C Hamano

On Tue, Apr 11, 2023 at 09:54:08AM -0400, Derrick Stolee wrote:
> On 4/10/2023 6:53 PM, Taylor Blau wrote:
> > In the vast majority of cases, this trade-off favors the on-disk ".rev"
> > files. But in certain cases, the in-memory variant performs more
> > favorably. Since these cases are narrow, and performance is machine- and
> > repository-dependent, this series also introduces a new configuration
> > option to disable reading ".rev" files in the third commit.
>
> I agree that the performance trade-off indicates that having the .rev
> files is preferred. It makes operations that _can_ be very fast as fast
> as possible (inspecting a small number of objects is much faster because
> we don't generate the in-memory index in O(N) time) and you create a knob
> for disabling it in the case that we are already doing something that
> inspects almost-all objects.

Sweet; I'm glad that you agree.

FWIW for non-GitHub folks, observing a slow-down here has never been an
issue for us. So much so that I wrote the pack.readReverseIndex knob
yesterday for the purpose of sending this series.

That said, I think that including it here is still worthwhile, since the
cases where performance really suffers (e.g., `git cat-file
--batch-all-objects --batch-check='%(objectsize:disk)'`) isn't something
that GitHub runs regularly if at all.

To accommodate different workflows, I think having the option to opt-out
of reading the on-disk ".rev" files is worthwhile.

> This was an easy series to read. I applied the patches locally and poked
> around in the resulting code as I went along. This led to a couple places
> where I recommend a few changes, including a new patch that wires
> repository pointers through a few more method layers.

Thanks for taking a look. Based on your review, there are only a couple
of things on my mind:

  - I amended the last patch to more clearly state when we would want to
    run the suite GIT_TEST_NO_WRITE_REV_INDEXES=1 set, and kept it in
    the linux-TEST-vars configuration.

  - How do you want to handle that extra patch? As I noted in [1], I
    think squashing the two together in one way or another makes sense.
    So really we have to figure out (a) if you think that is the right
    way to go, and (b) if so, how to handle attribution / the commit
    message.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/ZDXRajRky5XtFenU@nand.local/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] pack-revindex: make `load_pack_revindex` take a repository
  2023-04-11 21:30     ` Taylor Blau
@ 2023-04-12 17:33       ` Derrick Stolee
  0 siblings, 0 replies; 29+ messages in thread
From: Derrick Stolee @ 2023-04-12 17:33 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

On 4/11/2023 5:30 PM, Taylor Blau wrote:
> On Tue, Apr 11, 2023 at 09:45:21AM -0400, Derrick Stolee wrote:

>> @@ -581,7 +580,7 @@ struct bitmap_index *prepare_bitmap_git(struct repository *r)
>>  {
>>  	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
>>
>> -	if (!open_bitmap(r, bitmap_git) && !load_bitmap(bitmap_git))
>> +	if (!open_bitmap(r, bitmap_git) && !load_bitmap(r, bitmap_git))
>>  		return bitmap_git;
>>
>>  	free_bitmap_index(bitmap_git);
> 
> Oops; we are indeed dropping the repository pointer that was given to
> prepare_bitmap_git() here. It's unfortunate that we have to work through
> so many layers to propagate it back down, but I agree that it's the
> right thing to do.
> 
>> @@ -590,9 +589,10 @@ struct bitmap_index *prepare_bitmap_git(struct repository *r)
>>
>>  struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
>>  {
>> +	struct repository *r = the_repository;
> 
> OK; and here we're using the trick you mentioned in the patch message to
> avoid having to propagate this even further out. The rest of the patch
> looks sensible to me.
> 
> In terms of working this one in, it feels odd to include it as a
> separate commit since we know the one immediately prior to it is kind of
> broken.
> 
> Do you want to squash these together? Something else? Anything is fine
> with me here.

Feel free to squash it in, to avoid having a commit where the chain is
broken.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX`
  2023-04-11 21:33     ` Taylor Blau
@ 2023-04-12 17:37       ` Derrick Stolee
  0 siblings, 0 replies; 29+ messages in thread
From: Derrick Stolee @ 2023-04-12 17:37 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

On 4/11/2023 5:33 PM, Taylor Blau wrote:
> On Tue, Apr 11, 2023 at 09:51:06AM -0400, Derrick Stolee wrote:
>> On 4/10/2023 6:53 PM, Taylor Blau wrote:
>>> Instead of getting rid of the option, invert its meaning to instead
>>> disable writing ".rev" files, thereby running the test suite in a mode
>>> where the reverse index is generated from scratch.
>>>
>>> This ensures that we are still running and exercising Git's behavior
>>> when forced to generate reverse indexes from scratch.
>>
>> I don't think this is true because you remove the environment
>> variable from the following test.
> 
> This is intentional; I wrote that last sentence ("This ensures...") with
> an implied "[When set], this ensures..." at the beginning of it. IOW, if
> you wanted to run the test suite with primarily in-memory reverse
> indexes, you still could.

I guess I think of "we are still running and exercising" as only
being true if it happens during regular CI, not just when someone
thinks to try it.

> I wasn't quite sure what to do here. On one hand, sticking
> GIT_TEST_NO_WRITE_REV_INDEX=1 here would ensure that the linux-TEST-vars
> test is still exercising the old code.
> 
> Is that desirable? I dunno. On one hand it increases our coverage, but
> on the other hand I've always treated this suite as for experimental
> features, not older ones.
> 
> But I think all things being equal, I'd rather have more CI coverage
> than not, so I'll add it back in. Thanks for a sanity check on this one.

If we think there is value in continuing to test the case without .rev
files by default, then adding it in would be good. I think it has some
value at least for some time after the transition. We're unlikely to
remember to remove it, but having the extra mode in this build is not
too much overhead.

We should consider auditing this build just to be sure we want to
continue testing these options or if some could be removed in favor
of new defaults.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default
  2023-04-11 21:40   ` Taylor Blau
@ 2023-04-12 17:39     ` Derrick Stolee
  0 siblings, 0 replies; 29+ messages in thread
From: Derrick Stolee @ 2023-04-12 17:39 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Jeff King, Junio C Hamano

On 4/11/2023 5:40 PM, Taylor Blau wrote:
> On Tue, Apr 11, 2023 at 09:54:08AM -0400, Derrick Stolee wrote:
>> On 4/10/2023 6:53 PM, Taylor Blau wrote:
>>> In the vast majority of cases, this trade-off favors the on-disk ".rev"
>>> files. But in certain cases, the in-memory variant performs more
>>> favorably. Since these cases are narrow, and performance is machine- and
>>> repository-dependent, this series also introduces a new configuration
>>> option to disable reading ".rev" files in the third commit.
>>
>> I agree that the performance trade-off indicates that having the .rev
>> files is preferred. It makes operations that _can_ be very fast as fast
>> as possible (inspecting a small number of objects is much faster because
>> we don't generate the in-memory index in O(N) time) and you create a knob
>> for disabling it in the case that we are already doing something that
>> inspects almost-all objects.
> 
> Sweet; I'm glad that you agree.
> 
> FWIW for non-GitHub folks, observing a slow-down here has never been an
> issue for us. So much so that I wrote the pack.readReverseIndex knob
> yesterday for the purpose of sending this series.
> 
> That said, I think that including it here is still worthwhile, since the
> cases where performance really suffers (e.g., `git cat-file
> --batch-all-objects --batch-check='%(objectsize:disk)'`) isn't something
> that GitHub runs regularly if at all.
> 
> To accommodate different workflows, I think having the option to opt-out
> of reading the on-disk ".rev" files is worthwhile.

The only thing I can think of that would actually use this kind of
behavior is git-sizer, but even that doesn't actually report the on-disk
size (yet) and instead inflates all deltas when reporting size counts. The
difference in performance here is likely minimal for that tool.
 
>> This was an easy series to read. I applied the patches locally and poked
>> around in the resulting code as I went along. This led to a couple places
>> where I recommend a few changes, including a new patch that wires
>> repository pointers through a few more method layers.
> 
> Thanks for taking a look. Based on your review, there are only a couple
> of things on my mind:
> 
>   - I amended the last patch to more clearly state when we would want to
>     run the suite GIT_TEST_NO_WRITE_REV_INDEXES=1 set, and kept it in
>     the linux-TEST-vars configuration.

I think this is the right thing to do. Thanks.

>   - How do you want to handle that extra patch? As I noted in [1], I
>     think squashing the two together in one way or another makes sense.
>     So really we have to figure out (a) if you think that is the right
>     way to go, and (b) if so, how to handle attribution / the commit
>     message.

Squashing makes sense. You could make me a co-author, or not. It's the
natural thing to do once the problem to solve is identified.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 0/7] pack-revindex: enable on-disk reverse indexes by default
  2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
                   ` (7 preceding siblings ...)
  2023-04-11 13:54 ` [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Derrick Stolee
@ 2023-04-12 22:20 ` Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 1/7] pack-write.c: plug a leak in stage_tmp_packfiles() Taylor Blau
                     ` (7 more replies)
  8 siblings, 8 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-12 22:20 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

Here is a (tiny) reroll of my series to enable pack reverse-indexes to
be written to disk by default instead of computed on-the-fly in memory.

The original cover letter[1] and commits herein have all of the gore-y
details. Not much has changed since last time, except:

  - squashing in a patch from Stolee to propagate a `struct repository
    *` a little further out from `load_pack_revindex()`

  - a tweak to the linux-TEST-vars CI job to continue running it with
    GIT_TEST_NO_WRITE_REV_INDEX

  - and an additional benchmark to demonstrate the performance of `git
    cat-file --batch-check='%(objectsize:disk)' --batch--all-objects
    --unordered` with and without on-disk reverse indexes

As always, a range-diff is included below for convenience. Thanks in
advance for taking (another) look!

[1]: https://lore.kernel.org/git/cover.1681166596.git.me@ttaylorr.com/

Taylor Blau (7):
  pack-write.c: plug a leak in stage_tmp_packfiles()
  t5325: mark as leak-free
  pack-revindex: make `load_pack_revindex` take a repository
  pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK
  pack-revindex: introduce `pack.readReverseIndex`
  config: enable `pack.writeReverseIndex` by default
  t: invert `GIT_TEST_WRITE_REV_INDEX`

 Documentation/config/pack.txt     |  8 +++++++-
 builtin/index-pack.c              |  5 +++--
 builtin/pack-objects.c            |  5 +++--
 ci/run-build-and-tests.sh         |  2 +-
 pack-bitmap.c                     | 23 +++++++++++++----------
 pack-revindex.c                   | 12 +++++++++---
 pack-revindex.h                   |  6 ++++--
 pack-write.c                      |  2 ++
 packfile.c                        |  2 +-
 repo-settings.c                   |  1 +
 repository.h                      |  1 +
 t/README                          |  2 +-
 t/perf/p5312-pack-bitmaps-revs.sh |  3 +--
 t/t5325-reverse-index.sh          | 16 +++++++++++++++-
 14 files changed, 62 insertions(+), 26 deletions(-)

Range-diff against v1:
-:  ----------- > 1:  18be29c3988 pack-write.c: plug a leak in stage_tmp_packfiles()
-:  ----------- > 2:  affb5e2574b t5325: mark as leak-free
1:  be4faf11011 ! 3:  687a9a58924 pack-revindex: make `load_pack_revindex` take a repository
    @@ Commit message
         to `load_pack_revindex`, and update all callers to pass the correct
         instance (in all cases, `the_repository`).
     
    -    Signed-off-by: Taylor Blauy <me@ttaylorr.com>
    +    In certain instances, a new function-local variable is introduced to
    +    take the place of a `struct repository *` argument to the function
    +    itself to avoid propagating the new parameter even further throughout
    +    the tree.
    +
    +    Co-authored-by: Derrick Stolee <derrickstolee@github.com>
    +    Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    +    Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## pack-bitmap.c ##
    +@@ pack-bitmap.c: static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
    + 	return 0;
    + }
    + 
    +-static int load_reverse_index(struct bitmap_index *bitmap_git)
    ++static int load_reverse_index(struct repository *r, struct bitmap_index *bitmap_git)
    + {
    + 	if (bitmap_is_midx(bitmap_git)) {
    + 		uint32_t i;
     @@ pack-bitmap.c: static int load_reverse_index(struct bitmap_index *bitmap_git)
      		 * since we will need to make use of them in pack-objects.
      		 */
      		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
     -			ret = load_pack_revindex(bitmap_git->midx->packs[i]);
    -+			ret = load_pack_revindex(the_repository,
    -+						 bitmap_git->midx->packs[i]);
    ++			ret = load_pack_revindex(r, bitmap_git->midx->packs[i]);
      			if (ret)
      				return ret;
      		}
      		return 0;
      	}
     -	return load_pack_revindex(bitmap_git->pack);
    -+	return load_pack_revindex(the_repository, bitmap_git->pack);
    ++	return load_pack_revindex(r, bitmap_git->pack);
      }
      
    - static int load_bitmap(struct bitmap_index *bitmap_git)
    +-static int load_bitmap(struct bitmap_index *bitmap_git)
    ++static int load_bitmap(struct repository *r, struct bitmap_index *bitmap_git)
    + {
    + 	assert(bitmap_git->map);
    + 
    + 	bitmap_git->bitmaps = kh_init_oid_map();
    + 	bitmap_git->ext_index.positions = kh_init_oid_pos();
    + 
    +-	if (load_reverse_index(bitmap_git))
    ++	if (load_reverse_index(r, bitmap_git))
    + 		goto failed;
    + 
    + 	if (!(bitmap_git->commits = read_bitmap_1(bitmap_git)) ||
    +@@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_git(struct repository *r)
    + {
    + 	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
    + 
    +-	if (!open_bitmap(r, bitmap_git) && !load_bitmap(bitmap_git))
    ++	if (!open_bitmap(r, bitmap_git) && !load_bitmap(r, bitmap_git))
    + 		return bitmap_git;
    + 
    + 	free_bitmap_index(bitmap_git);
    +@@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_git(struct repository *r)
    + 
    + struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
    + {
    ++	struct repository *r = the_repository;
    + 	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
    + 
    +-	if (!open_midx_bitmap_1(bitmap_git, midx) && !load_bitmap(bitmap_git))
    ++	if (!open_midx_bitmap_1(bitmap_git, midx) && !load_bitmap(r, bitmap_git))
    + 		return bitmap_git;
    + 
    + 	free_bitmap_index(bitmap_git);
    +@@ pack-bitmap.c: struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
    + 	 * from disk. this is the point of no return; after this the rev_list
    + 	 * becomes invalidated and we must perform the revwalk through bitmaps
    + 	 */
    +-	if (load_bitmap(bitmap_git) < 0)
    ++	if (load_bitmap(revs->repo, bitmap_git) < 0)
    + 		goto cleanup;
    + 
    + 	object_array_clear(&revs->pending);
    +@@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
    + 				       uint32_t *entries,
    + 				       struct bitmap **reuse_out)
    + {
    ++	struct repository *r = the_repository;
    + 	struct packed_git *pack;
    + 	struct bitmap *result = bitmap_git->result;
    + 	struct bitmap *reuse;
    +@@ pack-bitmap.c: int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
    + 
    + 	assert(result);
    + 
    +-	load_reverse_index(bitmap_git);
    ++	load_reverse_index(r, bitmap_git);
    + 
    + 	if (bitmap_is_midx(bitmap_git))
    + 		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
    +@@ pack-bitmap.c: int rebuild_bitmap(const uint32_t *reposition,
    + uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
    + 				struct packing_data *mapping)
    + {
    ++	struct repository *r = the_repository;
    + 	uint32_t i, num_objects;
    + 	uint32_t *reposition;
    + 
    + 	if (!bitmap_is_midx(bitmap_git))
    +-		load_reverse_index(bitmap_git);
    ++		load_reverse_index(r, bitmap_git);
    + 	else if (load_midx_revindex(bitmap_git->midx) < 0)
    + 		BUG("rebuild_existing_bitmaps: missing required rev-cache "
    + 		    "extension");
     
      ## pack-revindex.c ##
     @@ pack-revindex.c: static int load_pack_revindex_from_disk(struct packed_git *p)
2:  0f368e2347e = 4:  8eec5bacd3a pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK
3:  6f692d470cb ! 5:  a62fc3e4ec1 pack-revindex: introduce `pack.readReverseIndex`
    @@ Commit message
               'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' ran
                 2.06 ± 0.02 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects'
     
    +    Luckily, the results when running `git cat-file` with `--unordered` are
    +    closer together:
    +
    +        $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects'
    +        Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects
    +          Time (mean ± σ):      5.066 s ±  0.105 s    [User: 4.792 s, System: 0.274 s]
    +          Range (min … max):    4.943 s …  5.220 s    10 runs
    +
    +        Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects
    +          Time (mean ± σ):      6.193 s ±  0.069 s    [User: 5.937 s, System: 0.255 s]
    +          Range (min … max):    6.145 s …  6.356 s    10 runs
    +
    +        Summary
    +          'git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' ran
    +            1.22 ± 0.03 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects'
    +
         Because the equilibrium point between these two is highly machine- and
         repository-dependent, allow users to configure whether or not they will
         read any ".rev" file(s) with this configuration knob.
4:  56a0fc0098e = 6:  f8298fb0bac config: enable `pack.writeReverseIndex` by default
5:  9c803799588 ! 7:  edff6a80c63 t: invert `GIT_TEST_WRITE_REV_INDEX`
    @@ Commit message
         disable writing ".rev" files, thereby running the test suite in a mode
         where the reverse index is generated from scratch.
     
    -    This ensures that we are still running and exercising Git's behavior
    -    when forced to generate reverse indexes from scratch.
    +    This ensures that, when GIT_TEST_NO_WRITE_REV_INDEX is set to some
    +    spelling of "true", we are still running and exercising Git's behavior
    +    when forced to generate reverse indexes from scratch. Do so by setting
    +    it in the linux-TEST-vars CI run to ensure that we are maintaining good
    +    coverage of this now-legacy code.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ ci/run-build-and-tests.sh: linux-TEST-vars)
      	export GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=1
      	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
     -	export GIT_TEST_WRITE_REV_INDEX=1
    ++	export GIT_TEST_NO_WRITE_REV_INDEX=1
      	export GIT_TEST_CHECKOUT_WORKERS=2
      	;;
      linux-clang)
-- 
2.40.0.323.gedff6a80c63

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/7] pack-write.c: plug a leak in stage_tmp_packfiles()
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
@ 2023-04-12 22:20   ` Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 2/7] t5325: mark as leak-free Taylor Blau
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-12 22:20 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

The function `stage_tmp_packfiles()` generates a filename to use for
staging the contents of what will become the pack's ".rev" file.

The name is generated in `write_rev_file_order()` (via its caller
`write_rev_file()`) in a string buffer, and the result is returned back
to `stage_tmp_packfiles()` which uses it to rename the temporary file
into place via `rename_tmp_packfiles()`.

That name is not visible outside of `stage_tmp_packfiles()`, so it can
(and should) be `free()`'d at the end of that function. We can't free it
in `rename_tmp_packfile()` since not all of its `source` arguments are
unreachable after calling it.

Instead, simply free() `rev_tmp_name` at the end of
`stage_tmp_packfiles()`.

(Note that the same leak exists for `mtimes_tmp_name`, but we do not
address it in this commit).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-write.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/pack-write.c b/pack-write.c
index f1714054951..f27c1f7f281 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -568,6 +568,8 @@ void stage_tmp_packfiles(struct strbuf *name_buffer,
 		rename_tmp_packfile(name_buffer, rev_tmp_name, "rev");
 	if (mtimes_tmp_name)
 		rename_tmp_packfile(name_buffer, mtimes_tmp_name, "mtimes");
+
+	free((char *)rev_tmp_name);
 }
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought)
-- 
2.40.0.323.gedff6a80c63


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/7] t5325: mark as leak-free
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 1/7] pack-write.c: plug a leak in stage_tmp_packfiles() Taylor Blau
@ 2023-04-12 22:20   ` Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 3/7] pack-revindex: make `load_pack_revindex` take a repository Taylor Blau
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-12 22:20 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

This test is leak-free as of the previous commit, so let's mark it as
such to ensure we don't regress and introduce a leak in the future.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5325-reverse-index.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index d042d26f2b3..48c14d85764 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -1,6 +1,8 @@
 #!/bin/sh
 
 test_description='on-disk reverse index'
+
+TEST_PASSES_SANITIZE_LEAK=true
 . ./test-lib.sh
 
 # The below tests want control over the 'pack.writeReverseIndex' setting
-- 
2.40.0.323.gedff6a80c63


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/7] pack-revindex: make `load_pack_revindex` take a repository
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 1/7] pack-write.c: plug a leak in stage_tmp_packfiles() Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 2/7] t5325: mark as leak-free Taylor Blau
@ 2023-04-12 22:20   ` Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 4/7] pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK Taylor Blau
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-12 22:20 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

In a future commit, we will introduce a `pack.readReverseIndex`
configuration, which forces Git to generate the reverse index from
scratch instead of loading it from disk.

In order to avoid reading this configuration value more than once, we'll
use the `repo_settings` struct to lazily load this value.

In order to access the `struct repo_settings`, add a repository argument
to `load_pack_revindex`, and update all callers to pass the correct
instance (in all cases, `the_repository`).

In certain instances, a new function-local variable is introduced to
take the place of a `struct repository *` argument to the function
itself to avoid propagating the new parameter even further throughout
the tree.

Co-authored-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c   | 23 +++++++++++++----------
 pack-revindex.c |  4 ++--
 pack-revindex.h |  3 ++-
 packfile.c      |  2 +-
 4 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index b2e7d06d604..38b35c48237 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -463,7 +463,7 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 	return 0;
 }
 
-static int load_reverse_index(struct bitmap_index *bitmap_git)
+static int load_reverse_index(struct repository *r, struct bitmap_index *bitmap_git)
 {
 	if (bitmap_is_midx(bitmap_git)) {
 		uint32_t i;
@@ -477,23 +477,23 @@ static int load_reverse_index(struct bitmap_index *bitmap_git)
 		 * since we will need to make use of them in pack-objects.
 		 */
 		for (i = 0; i < bitmap_git->midx->num_packs; i++) {
-			ret = load_pack_revindex(bitmap_git->midx->packs[i]);
+			ret = load_pack_revindex(r, bitmap_git->midx->packs[i]);
 			if (ret)
 				return ret;
 		}
 		return 0;
 	}
-	return load_pack_revindex(bitmap_git->pack);
+	return load_pack_revindex(r, bitmap_git->pack);
 }
 
-static int load_bitmap(struct bitmap_index *bitmap_git)
+static int load_bitmap(struct repository *r, struct bitmap_index *bitmap_git)
 {
 	assert(bitmap_git->map);
 
 	bitmap_git->bitmaps = kh_init_oid_map();
 	bitmap_git->ext_index.positions = kh_init_oid_pos();
 
-	if (load_reverse_index(bitmap_git))
+	if (load_reverse_index(r, bitmap_git))
 		goto failed;
 
 	if (!(bitmap_git->commits = read_bitmap_1(bitmap_git)) ||
@@ -580,7 +580,7 @@ struct bitmap_index *prepare_bitmap_git(struct repository *r)
 {
 	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
 
-	if (!open_bitmap(r, bitmap_git) && !load_bitmap(bitmap_git))
+	if (!open_bitmap(r, bitmap_git) && !load_bitmap(r, bitmap_git))
 		return bitmap_git;
 
 	free_bitmap_index(bitmap_git);
@@ -589,9 +589,10 @@ struct bitmap_index *prepare_bitmap_git(struct repository *r)
 
 struct bitmap_index *prepare_midx_bitmap_git(struct multi_pack_index *midx)
 {
+	struct repository *r = the_repository;
 	struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
 
-	if (!open_midx_bitmap_1(bitmap_git, midx) && !load_bitmap(bitmap_git))
+	if (!open_midx_bitmap_1(bitmap_git, midx) && !load_bitmap(r, bitmap_git))
 		return bitmap_git;
 
 	free_bitmap_index(bitmap_git);
@@ -1592,7 +1593,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 	 * from disk. this is the point of no return; after this the rev_list
 	 * becomes invalidated and we must perform the revwalk through bitmaps
 	 */
-	if (load_bitmap(bitmap_git) < 0)
+	if (load_bitmap(revs->repo, bitmap_git) < 0)
 		goto cleanup;
 
 	object_array_clear(&revs->pending);
@@ -1742,6 +1743,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 				       uint32_t *entries,
 				       struct bitmap **reuse_out)
 {
+	struct repository *r = the_repository;
 	struct packed_git *pack;
 	struct bitmap *result = bitmap_git->result;
 	struct bitmap *reuse;
@@ -1752,7 +1754,7 @@ int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
 
 	assert(result);
 
-	load_reverse_index(bitmap_git);
+	load_reverse_index(r, bitmap_git);
 
 	if (bitmap_is_midx(bitmap_git))
 		pack = bitmap_git->midx->packs[midx_preferred_pack(bitmap_git)];
@@ -2132,11 +2134,12 @@ int rebuild_bitmap(const uint32_t *reposition,
 uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git,
 				struct packing_data *mapping)
 {
+	struct repository *r = the_repository;
 	uint32_t i, num_objects;
 	uint32_t *reposition;
 
 	if (!bitmap_is_midx(bitmap_git))
-		load_reverse_index(bitmap_git);
+		load_reverse_index(r, bitmap_git);
 	else if (load_midx_revindex(bitmap_git->midx) < 0)
 		BUG("rebuild_existing_bitmaps: missing required rev-cache "
 		    "extension");
diff --git a/pack-revindex.c b/pack-revindex.c
index 03c7e81f9da..e3d69cc0f7a 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -283,7 +283,7 @@ static int load_pack_revindex_from_disk(struct packed_git *p)
 	return ret;
 }
 
-int load_pack_revindex(struct packed_git *p)
+int load_pack_revindex(struct repository *r, struct packed_git *p)
 {
 	if (p->revindex || p->revindex_data)
 		return 0;
@@ -356,7 +356,7 @@ int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	unsigned lo, hi;
 
-	if (load_pack_revindex(p) < 0)
+	if (load_pack_revindex(the_repository, p) < 0)
 		return -1;
 
 	lo = 0;
diff --git a/pack-revindex.h b/pack-revindex.h
index 4974e75eb4d..3d1969ce8b9 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -39,6 +39,7 @@
 
 struct packed_git;
 struct multi_pack_index;
+struct repository;
 
 /*
  * load_pack_revindex populates the revindex's internal data-structures for the
@@ -47,7 +48,7 @@ struct multi_pack_index;
  * If a '.rev' file is present it is mmap'd, and pointers are assigned into it
  * (instead of using the in-memory variant).
  */
-int load_pack_revindex(struct packed_git *p);
+int load_pack_revindex(struct repository *r, struct packed_git *p);
 
 /*
  * load_midx_revindex loads the '.rev' file corresponding to the given
diff --git a/packfile.c b/packfile.c
index b120405ccc8..717fd685c97 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2151,7 +2151,7 @@ int for_each_object_in_pack(struct packed_git *p,
 	int r = 0;
 
 	if (flags & FOR_EACH_OBJECT_PACK_ORDER) {
-		if (load_pack_revindex(p))
+		if (load_pack_revindex(the_repository, p))
 			return -1;
 	}
 
-- 
2.40.0.323.gedff6a80c63


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 4/7] pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
                     ` (2 preceding siblings ...)
  2023-04-12 22:20   ` [PATCH v2 3/7] pack-revindex: make `load_pack_revindex` take a repository Taylor Blau
@ 2023-04-12 22:20   ` Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 5/7] pack-revindex: introduce `pack.readReverseIndex` Taylor Blau
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-12 22:20 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

In ec8e7760ac (pack-revindex: ensure that on-disk reverse indexes are
given precedence, 2021-01-25), we introduced
GIT_TEST_REV_INDEX_DIE_IN_MEMORY to abort the process when Git generated
a reverse index from scratch.

ec8e7760ac was about ensuring that Git prefers a .rev file when
available over generating the same information in memory from scratch.

In a subsequent patch, we'll introduce `pack.readReverseIndex`, which
may be used to disable reading ".rev" files when available. In order to
ensure that those files are indeed being ignored, introduce an analogous
option to abort the process when Git reads a ".rev" file from disk.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c | 3 +++
 pack-revindex.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/pack-revindex.c b/pack-revindex.c
index e3d69cc0f7a..44e1b3fed95 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -205,6 +205,9 @@ static int load_revindex_from_disk(char *revindex_name,
 	size_t revindex_size;
 	struct revindex_header *hdr;
 
+	if (git_env_bool(GIT_TEST_REV_INDEX_DIE_ON_DISK, 0))
+		die("dying as requested by '%s'", GIT_TEST_REV_INDEX_DIE_ON_DISK);
+
 	fd = git_open(revindex_name);
 
 	if (fd < 0) {
diff --git a/pack-revindex.h b/pack-revindex.h
index 3d1969ce8b9..ef8afee88b0 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -36,6 +36,7 @@
 
 #define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
 #define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY"
+#define GIT_TEST_REV_INDEX_DIE_ON_DISK "GIT_TEST_REV_INDEX_DIE_ON_DISK"
 
 struct packed_git;
 struct multi_pack_index;
-- 
2.40.0.323.gedff6a80c63


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 5/7] pack-revindex: introduce `pack.readReverseIndex`
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
                     ` (3 preceding siblings ...)
  2023-04-12 22:20   ` [PATCH v2 4/7] pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK Taylor Blau
@ 2023-04-12 22:20   ` Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 6/7] config: enable `pack.writeReverseIndex` by default Taylor Blau
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-12 22:20 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

Since 1615c567b8 (Documentation/config/pack.txt: advertise
'pack.writeReverseIndex', 2021-01-25), we have had the
`pack.writeReverseIndex` configuration option, which tells Git whether
or not it is allowed to write a ".rev" file when indexing a pack.

Introduce a complementary configuration knob, `pack.readReverseIndex` to
control whether or not Git will read any ".rev" file(s) that may be
available on disk.

This option is useful for debugging, as well as disabling the effect of
".rev" files in certain instances.

This is useful because of the trade-off[^1] between the time it takes to
generate a reverse index (slow from scratch, fast when reading an
existing ".rev" file), and the time it takes to access a record (the
opposite).

For example, even though it is faster to use the on-disk reverse index
when computing the on-disk size of a packed object, it is slower to
enumerate the same value for all objects.

Here are a couple of examples from linux.git. When computing the above
for a single object, using the on-disk reverse index is significantly
faster:

    $ git rev-parse HEAD >in
    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" <in'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):     302.5 ms ±  12.5 ms    [User: 258.7 ms, System: 43.6 ms]
      Range (min … max):   291.1 ms … 328.1 ms    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):       3.9 ms ±   0.3 ms    [User: 1.6 ms, System: 2.4 ms]
      Range (min … max):     2.0 ms …   4.4 ms    801 runs

    Summary
      'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in' ran
       77.29 ± 7.14 times faster than 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in'

, but when instead trying to compute the on-disk object size for all
objects in the repository, using the ".rev" file is a disadvantage over
creating the reverse index from scratch:

    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" --batch-all-objects'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):      8.258 s ±  0.035 s    [User: 7.949 s, System: 0.308 s]
      Range (min … max):    8.199 s …  8.293 s    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):     16.976 s ±  0.107 s    [User: 16.706 s, System: 0.268 s]
      Range (min … max):   16.839 s … 17.105 s    10 runs

    Summary
      'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' ran
	2.06 ± 0.02 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects'

Luckily, the results when running `git cat-file` with `--unordered` are
closer together:

    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):      5.066 s ±  0.105 s    [User: 4.792 s, System: 0.274 s]
      Range (min … max):    4.943 s …  5.220 s    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects
      Time (mean ± σ):      6.193 s ±  0.069 s    [User: 5.937 s, System: 0.255 s]
      Range (min … max):    6.145 s …  6.356 s    10 runs

    Summary
      'git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' ran
        1.22 ± 0.03 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects'

Because the equilibrium point between these two is highly machine- and
repository-dependent, allow users to configure whether or not they will
read any ".rev" file(s) with this configuration knob.

[^1]: Generating a reverse index in memory takes O(N) time (where N is
  the number of objects in the repository), since we use a radix sort.
  Reading an entry from an on-disk ".rev" file is slower since each
  operation is bound by disk I/O instead of memory I/O.

  In order to compute the on-disk size of a packed object, we need to
  find the offset of our object, and the adjacent object (the on-disk
  size difference of these two). Finding the first offset requires a
  binary search. Finding the latter involves a single .rev lookup.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt |  6 ++++++
 pack-revindex.c               |  5 ++++-
 repo-settings.c               |  1 +
 repository.h                  |  1 +
 t/t5325-reverse-index.sh      | 11 +++++++++++
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 53093d99969..7db7fed4667 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -171,6 +171,12 @@ pack.writeBitmapLookupTable::
 	beneficial in repositories that have relatively large bitmap
 	indexes. Defaults to false.
 
+pack.readReverseIndex::
+	When true, git will read any .rev file(s) that may be available
+	(see: linkgit:gitformat-pack[5]). When false, the reverse index
+	will be generated from scratch and stored in memory. Defaults to
+	true.
+
 pack.writeReverseIndex::
 	When true, git will write a corresponding .rev file (see:
 	linkgit:gitformat-pack[5])
diff --git a/pack-revindex.c b/pack-revindex.c
index 44e1b3fed95..29f5358b256 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -291,7 +291,10 @@ int load_pack_revindex(struct repository *r, struct packed_git *p)
 	if (p->revindex || p->revindex_data)
 		return 0;
 
-	if (!load_pack_revindex_from_disk(p))
+	prepare_repo_settings(r);
+
+	if (r->settings.pack_read_reverse_index &&
+	    !load_pack_revindex_from_disk(p))
 		return 0;
 	else if (!create_pack_revindex_in_memory(p))
 		return 0;
diff --git a/repo-settings.c b/repo-settings.c
index 0a6c0b381fe..bdd7640ab0b 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -63,6 +63,7 @@ void prepare_repo_settings(struct repository *r)
 	repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
 	repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
 	repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
+	repo_cfg_bool(r, "pack.readreverseindex", &r->settings.pack_read_reverse_index, 1);
 
 	/*
 	 * The GIT_TEST_MULTI_PACK_INDEX variable is special in that
diff --git a/repository.h b/repository.h
index 15a8afc5fb5..ed73e799b61 100644
--- a/repository.h
+++ b/repository.h
@@ -37,6 +37,7 @@ struct repo_settings {
 	int fetch_write_commit_graph;
 	int command_requires_full_index;
 	int sparse_index;
+	int pack_read_reverse_index;
 
 	struct fsmonitor_settings *fsmonitor; /* lazily loaded */
 
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 48c14d85764..66171c1d67b 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -96,6 +96,17 @@ test_expect_success 'reverse index is not generated when available on disk' '
 		--batch-check="%(objectsize:disk)" <tip
 '
 
+test_expect_success 'reverse index is ignored when pack.readReverseIndex is false' '
+	test_index_pack true &&
+	test_path_is_file $rev &&
+
+	test_config pack.readReverseIndex false &&
+
+	git rev-parse HEAD >tip &&
+	GIT_TEST_REV_INDEX_DIE_ON_DISK=1 git cat-file \
+		--batch-check="%(objectsize:disk)" <tip
+'
+
 test_expect_success 'revindex in-memory vs on-disk' '
 	git init repo &&
 	test_when_finished "rm -fr repo" &&
-- 
2.40.0.323.gedff6a80c63


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 6/7] config: enable `pack.writeReverseIndex` by default
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
                     ` (4 preceding siblings ...)
  2023-04-12 22:20   ` [PATCH v2 5/7] pack-revindex: introduce `pack.readReverseIndex` Taylor Blau
@ 2023-04-12 22:20   ` Taylor Blau
  2023-04-12 22:20   ` [PATCH v2 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX` Taylor Blau
  2023-04-13 13:40   ` [PATCH v2 0/7] pack-revindex: enable on-disk reverse indexes by default Derrick Stolee
  7 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-12 22:20 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

Back in e37d0b8730 (builtin/index-pack.c: write reverse indexes,
2021-01-25), Git learned how to read and write a pack's reverse index
from a file instead of in-memory.

A pack's reverse index is a mapping from pack position (that is, the
order that objects appear together in a ".pack")  to their position in
lexical order (that is, the order that objects are listed in an ".idx"
file).

Reverse indexes are consulted often during pack-objects, as well as
during auxiliary operations that require mapping between pack offsets,
pack order, and index index.

They are useful in GitHub's infrastructure, where we have seen a
dramatic increase in performance when writing ".rev" files[1]. In
particular:

  - an ~80% reduction in the time it takes to serve fetches on a popular
    repository, Homebrew/homebrew-core.

  - a ~60% reduction in the peak memory usage to serve fetches on that
    same repository.

  - a collective savings of ~35% in CPU time across all pack-objects
    invocations serving fetches across all repositories in a single
    datacenter.

Reverse indexes are also beneficial to end-users as well as forges. For
example, the time it takes to generate a pack containing the objects for
the 10 most recent commits in linux.git (representing a typical push) is
significantly faster when on-disk reverse indexes are available:

    $ { git rev-parse HEAD && printf '^' && git rev-parse HEAD~10 } >in
    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} pack-objects --delta-base-offset --revs --stdout <in >/dev/null'
    Benchmark 1: git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null
      Time (mean ± σ):     543.0 ms ±  20.3 ms    [User: 616.2 ms, System: 58.8 ms]
      Range (min … max):   521.0 ms … 577.9 ms    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null
      Time (mean ± σ):     245.0 ms ±  11.4 ms    [User: 335.6 ms, System: 31.3 ms]
      Range (min … max):   226.0 ms … 259.6 ms    13 runs

    Summary
      'git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null' ran
	2.22 ± 0.13 times faster than 'git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null'

The same is true of writing a pack containing the objects for the 30
most-recent commits:

    $ { git rev-parse HEAD && printf '^' && git rev-parse HEAD~30 } >in
    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} pack-objects --delta-base-offset --revs --stdout <in >/dev/null'
    Benchmark 1: git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null
      Time (mean ± σ):     866.5 ms ±  16.2 ms    [User: 1414.5 ms, System: 97.0 ms]
      Range (min … max):   839.3 ms … 886.9 ms    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null
      Time (mean ± σ):     581.6 ms ±  10.2 ms    [User: 1181.7 ms, System: 62.6 ms]
      Range (min … max):   567.5 ms … 599.3 ms    10 runs

    Summary
      'git.compile -c pack.readReverseIndex=true pack-objects --delta-base-offset --revs --stdout <in >/dev/null' ran
	1.49 ± 0.04 times faster than 'git.compile -c pack.readReverseIndex=false pack-objects --delta-base-offset --revs --stdout <in >/dev/null'

...and savings on trivial operations like computing the on-disk size of
a single (packed) object are even more dramatic:

    $ git rev-parse HEAD >in
    $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" <in'
    Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):     305.8 ms ±  11.4 ms    [User: 264.2 ms, System: 41.4 ms]
      Range (min … max):   290.3 ms … 331.1 ms    10 runs

    Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in
      Time (mean ± σ):       4.0 ms ±   0.3 ms    [User: 1.7 ms, System: 2.3 ms]
      Range (min … max):     1.6 ms …   4.6 ms    1155 runs

    Summary
      'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in' ran
       76.96 ± 6.25 times faster than 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in'

In the more than two years since e37d0b8730 was merged, Git's
implementation of on-disk reverse indexes has been thoroughly tested,
both from users enabling `pack.writeReverseIndexes`, and from GitHub's
deployment of the feature. The latter has been running without incident
for more than two years.

This patch changes Git's behavior to write on-disk reverse indexes by
default when indexing a pack, which should make the above operations
faster for everybody's Git installation after a repack.

(The previous commit explains some potential drawbacks of using on-disk
reverse indexes in certain limited circumstances, that essentially boil
down to a trade-off between time to generate, and time to access. For
those limited cases, the `pack.readReverseIndex` escape hatch can be
used).

[1]: https://github.blog/2021-04-29-scaling-monorepo-maintenance/#reverse-indexes

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt     | 2 +-
 builtin/index-pack.c              | 1 +
 builtin/pack-objects.c            | 1 +
 t/perf/p5312-pack-bitmaps-revs.sh | 3 +--
 t/t5325-reverse-index.sh          | 1 +
 5 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 7db7fed4667..d4c7c9d4e4e 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -182,4 +182,4 @@ pack.writeReverseIndex::
 	linkgit:gitformat-pack[5])
 	for each new packfile that it writes in all places except for
 	linkgit:git-fast-import[1] and in the bulk checkin mechanism.
-	Defaults to false.
+	Defaults to true.
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index b17e79cd40f..323c063f9db 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1753,6 +1753,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	fsck_options.walk = mark_link;
 
 	reset_pack_idx_option(&opts);
+	opts.flags |= WRITE_REV;
 	git_config(git_index_pack_config, &opts);
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 77d88f85b04..dbaa04482fd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4293,6 +4293,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	}
 
 	reset_pack_idx_option(&pack_idx_opts);
+	pack_idx_opts.flags |= WRITE_REV;
 	git_config(git_pack_config, NULL);
 	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
 		pack_idx_opts.flags |= WRITE_REV;
diff --git a/t/perf/p5312-pack-bitmaps-revs.sh b/t/perf/p5312-pack-bitmaps-revs.sh
index 0684b690af0..ceec60656b5 100755
--- a/t/perf/p5312-pack-bitmaps-revs.sh
+++ b/t/perf/p5312-pack-bitmaps-revs.sh
@@ -12,8 +12,7 @@ test_lookup_pack_bitmap () {
 	test_perf_large_repo
 
 	test_expect_success 'setup bitmap config' '
-		git config pack.writebitmaps true &&
-		git config pack.writeReverseIndex true
+		git config pack.writebitmaps true
 	'
 
 	# we need to create the tag up front such that it is covered by the repack and
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 66171c1d67b..149dcf5193b 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -14,6 +14,7 @@ packdir=.git/objects/pack
 test_expect_success 'setup' '
 	test_commit base &&
 
+	test_config pack.writeReverseIndex false &&
 	pack=$(git pack-objects --all $packdir/pack) &&
 	rev=$packdir/pack-$pack.rev &&
 
-- 
2.40.0.323.gedff6a80c63


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX`
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
                     ` (5 preceding siblings ...)
  2023-04-12 22:20   ` [PATCH v2 6/7] config: enable `pack.writeReverseIndex` by default Taylor Blau
@ 2023-04-12 22:20   ` Taylor Blau
  2023-04-13 13:40   ` [PATCH v2 0/7] pack-revindex: enable on-disk reverse indexes by default Derrick Stolee
  7 siblings, 0 replies; 29+ messages in thread
From: Taylor Blau @ 2023-04-12 22:20 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jeff King, Junio C Hamano

Back in e8c58f894b (t: support GIT_TEST_WRITE_REV_INDEX, 2021-01-25), we
added a test knob to conditionally enable writing a ".rev" file when
indexing a pack. At the time, this was used to ensure that the test
suite worked even when ".rev" files were written, which served as a
stress-test for the on-disk reverse index implementation.

Now that reading from on-disk ".rev" files is enabled by default, the
test knob `GIT_TEST_WRITE_REV_INDEX` no longer has any meaning.

We could get rid of the option entirely, but there would be no
convenient way to test Git when ".rev" files *aren't* in place.

Instead of getting rid of the option, invert its meaning to instead
disable writing ".rev" files, thereby running the test suite in a mode
where the reverse index is generated from scratch.

This ensures that, when GIT_TEST_NO_WRITE_REV_INDEX is set to some
spelling of "true", we are still running and exercising Git's behavior
when forced to generate reverse indexes from scratch. Do so by setting
it in the linux-TEST-vars CI run to ensure that we are maintaining good
coverage of this now-legacy code.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/index-pack.c      | 4 ++--
 builtin/pack-objects.c    | 4 ++--
 ci/run-build-and-tests.sh | 2 +-
 pack-revindex.h           | 2 +-
 t/README                  | 2 +-
 t/t5325-reverse-index.sh  | 2 +-
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 323c063f9db..9e36c985cf1 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1758,8 +1758,8 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
 
-	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
-		rev_index = 1;
+	if (git_env_bool(GIT_TEST_NO_WRITE_REV_INDEX, 0))
+		rev_index = 0;
 	else
 		rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index dbaa04482fd..1797871ce90 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4295,8 +4295,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	reset_pack_idx_option(&pack_idx_opts);
 	pack_idx_opts.flags |= WRITE_REV;
 	git_config(git_pack_config, NULL);
-	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
-		pack_idx_opts.flags |= WRITE_REV;
+	if (git_env_bool(GIT_TEST_NO_WRITE_REV_INDEX, 0))
+		pack_idx_opts.flags &= ~WRITE_REV;
 
 	progress = isatty(2);
 	argc = parse_options(argc, argv, prefix, pack_objects_options,
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index b098e10f52a..a18b13a41dd 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -27,7 +27,7 @@ linux-TEST-vars)
 	export GIT_TEST_MULTI_PACK_INDEX=1
 	export GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=1
 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
-	export GIT_TEST_WRITE_REV_INDEX=1
+	export GIT_TEST_NO_WRITE_REV_INDEX=1
 	export GIT_TEST_CHECKOUT_WORKERS=2
 	;;
 linux-clang)
diff --git a/pack-revindex.h b/pack-revindex.h
index ef8afee88b0..46e834064e1 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -34,7 +34,7 @@
 #define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
 #define RIDX_VERSION 1
 
-#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
+#define GIT_TEST_NO_WRITE_REV_INDEX "GIT_TEST_NO_WRITE_REV_INDEX"
 #define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY"
 #define GIT_TEST_REV_INDEX_DIE_ON_DISK "GIT_TEST_REV_INDEX_DIE_ON_DISK"
 
diff --git a/t/README b/t/README
index 29576c37488..bdfac4cceb2 100644
--- a/t/README
+++ b/t/README
@@ -475,7 +475,7 @@ GIT_TEST_DEFAULT_HASH=<hash-algo> specifies which hash algorithm to
 use in the test scripts. Recognized values for <hash-algo> are "sha1"
 and "sha256".
 
-GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
+GIT_TEST_NO_WRITE_REV_INDEX=<boolean>, when true disables the
 'pack.writeReverseIndex' setting.
 
 GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 149dcf5193b..0548fce1aa6 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -7,7 +7,7 @@ TEST_PASSES_SANITIZE_LEAK=true
 
 # The below tests want control over the 'pack.writeReverseIndex' setting
 # themselves to assert various combinations of it with other options.
-sane_unset GIT_TEST_WRITE_REV_INDEX
+sane_unset GIT_TEST_NO_WRITE_REV_INDEX
 
 packdir=.git/objects/pack
 
-- 
2.40.0.323.gedff6a80c63

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/7] pack-revindex: enable on-disk reverse indexes by default
  2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
                     ` (6 preceding siblings ...)
  2023-04-12 22:20   ` [PATCH v2 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX` Taylor Blau
@ 2023-04-13 13:40   ` Derrick Stolee
  7 siblings, 0 replies; 29+ messages in thread
From: Derrick Stolee @ 2023-04-13 13:40 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: Jeff King, Junio C Hamano

On 4/12/2023 6:20 PM, Taylor Blau wrote:
> Here is a (tiny) reroll of my series to enable pack reverse-indexes to
> be written to disk by default instead of computed on-the-fly in memory.
> 
> The original cover letter[1] and commits herein have all of the gore-y
> details. Not much has changed since last time, except:
> 
>   - squashing in a patch from Stolee to propagate a `struct repository
>     *` a little further out from `load_pack_revindex()`
> 
>   - a tweak to the linux-TEST-vars CI job to continue running it with
>     GIT_TEST_NO_WRITE_REV_INDEX
> 
>   - and an additional benchmark to demonstrate the performance of `git
>     cat-file --batch-check='%(objectsize:disk)' --batch--all-objects
>     --unordered` with and without on-disk reverse indexes

Thanks for addressing my concerns from v1. This version LGTM.

-Stolee


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/7] config: enable `pack.writeReverseIndex` by default
  2023-04-10 22:53 ` [PATCH 6/7] config: enable `pack.writeReverseIndex` by default Taylor Blau
@ 2023-04-13 16:14   ` Junio C Hamano
  0 siblings, 0 replies; 29+ messages in thread
From: Junio C Hamano @ 2023-04-13 16:14 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, Derrick Stolee, Jeff King

Taylor Blau <me@ttaylorr.com> writes:

> They are useful in GitHub's infrastructure, where we have seen a
> dramatic increase in performance when writing ".rev" files[1]. In
> particular:
>
>   - an ~80% reduction in the time it takes to serve fetches on a popular
>     repository, Homebrew/homebrew-core.
>
>   - a ~60% reduction in the peak memory usage to serve fetches on that
>     same repository.
>
>   - a collective savings of ~35% in CPU time across all pack-objects
>     invocations serving fetches across all repositories in a single
>     datacenter.
>
> Reverse indexes are also beneficial to end-users as well as forges. For
> example, the time it takes to generate a pack containing the objects for
> the 10 most recent commits in linux.git (representing a typical push) is
> significantly faster when on-disk reverse indexes are available:
> ...
> In the more than two years since e37d0b8730 was merged, Git's
> implementation of on-disk reverse indexes has been thoroughly tested,
> both from users enabling `pack.writeReverseIndexes`, and from GitHub's
> deployment of the feature. The latter has been running without incident
> for more than two years.

What is missing is what extra overhead this adds to "git gc".  Does
it add 5%?  15%?  How big an overhead do we find reasonable?

Other than that, it looks like that the series is very well crafted.
It was somewhat surprising that we still need to add a few new
end-user facing or meant-for-testing knobs for a feature that we
claim is well battle tested.  It is understandable that we benefit
by having knobs to selectively _disable_ a part of the feature now
it will be on by default.

Will queue.  Thanks.


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-04-13 16:14 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-10 22:53 [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Taylor Blau
2023-04-10 22:53 ` [PATCH 1/7] pack-write.c: plug a leak in stage_tmp_packfiles() Taylor Blau
2023-04-11 13:23   ` Derrick Stolee
2023-04-11 21:25     ` Taylor Blau
2023-04-10 22:53 ` [PATCH 2/7] t5325: mark as leak-free Taylor Blau
2023-04-10 22:53 ` [PATCH 3/7] pack-revindex: make `load_pack_revindex` take a repository Taylor Blau
2023-04-11 13:45   ` Derrick Stolee
2023-04-11 21:30     ` Taylor Blau
2023-04-12 17:33       ` Derrick Stolee
2023-04-10 22:53 ` [PATCH 4/7] pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK Taylor Blau
2023-04-10 22:53 ` [PATCH 5/7] pack-revindex: introduce `pack.readReverseIndex` Taylor Blau
2023-04-10 22:53 ` [PATCH 6/7] config: enable `pack.writeReverseIndex` by default Taylor Blau
2023-04-13 16:14   ` Junio C Hamano
2023-04-10 22:53 ` [PATCH 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX` Taylor Blau
2023-04-11 13:51   ` Derrick Stolee
2023-04-11 21:33     ` Taylor Blau
2023-04-12 17:37       ` Derrick Stolee
2023-04-11 13:54 ` [PATCH 0/7] pack-revindex: enable on-disk reverse indexes by default Derrick Stolee
2023-04-11 21:40   ` Taylor Blau
2023-04-12 17:39     ` Derrick Stolee
2023-04-12 22:20 ` [PATCH v2 " Taylor Blau
2023-04-12 22:20   ` [PATCH v2 1/7] pack-write.c: plug a leak in stage_tmp_packfiles() Taylor Blau
2023-04-12 22:20   ` [PATCH v2 2/7] t5325: mark as leak-free Taylor Blau
2023-04-12 22:20   ` [PATCH v2 3/7] pack-revindex: make `load_pack_revindex` take a repository Taylor Blau
2023-04-12 22:20   ` [PATCH v2 4/7] pack-revindex: introduce GIT_TEST_REV_INDEX_DIE_ON_DISK Taylor Blau
2023-04-12 22:20   ` [PATCH v2 5/7] pack-revindex: introduce `pack.readReverseIndex` Taylor Blau
2023-04-12 22:20   ` [PATCH v2 6/7] config: enable `pack.writeReverseIndex` by default Taylor Blau
2023-04-12 22:20   ` [PATCH v2 7/7] t: invert `GIT_TEST_WRITE_REV_INDEX` Taylor Blau
2023-04-13 13:40   ` [PATCH v2 0/7] pack-revindex: enable on-disk reverse indexes by default Derrick Stolee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).