Git development
 help / color / mirror / Atom feed
* [PATCH 3/3] t7421: eliminate 'grep' check in t7421.4 for mingw compatibility
From: Shourya Shukla @ 2020-08-25 11:30 UTC (permalink / raw)
  To: git
  Cc: gitster, christian.couder, kaartic.sivaraam, Johannes.Schindelin,
	peff, liu.denton, Shourya Shukla, Christian Couder
In-Reply-To: <20200825113020.71801-1-shouryashukla.oo@gmail.com>

The 'grep' check in test 4 of t7421 resulted in the failure of t7421 on
Windows due to a different error message

    error: cannot spawn git: No such file or directory

instead of

    fatal: exec 'rev-parse': cd to 'my-subm' failed: No such file or directory

Tighten up the check to compute '{src,dst}_abbrev' by guarding the
'verify_submodule_committish()' call using `p->status !='D'`, so that
the former isn't called in case of non-existent submodule directory,
consequently, there is no such error message on any execution
environment.

Therefore, eliminate the 'grep' check in t7421. Instead, verify the
absence of an error message by doing a 'test_must_be_empty' on the
file containing the error.

Reported-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Helped-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Shourya Shukla <shouryashukla.oo@gmail.com>
---
 builtin/submodule--helper.c      | 7 ++++---
 t/t7421-submodule-summary-add.sh | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index 93d0700891..f1951680f7 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -1035,7 +1035,7 @@ static void print_submodule_summary(struct summary_cb *info, char *errmsg,
 static void generate_submodule_summary(struct summary_cb *info,
 				       struct module_cb *p)
 {
-	char *displaypath, *src_abbrev, *dst_abbrev;
+	char *displaypath, *src_abbrev = NULL, *dst_abbrev = NULL;
 	int missing_src = 0, missing_dst = 0;
 	char *errmsg = NULL;
 	int total_commits = -1;
@@ -1061,8 +1061,9 @@ static void generate_submodule_summary(struct summary_cb *info,
 	}
 
 	if (S_ISGITLINK(p->mod_src)) {
-		src_abbrev = verify_submodule_committish(p->sm_path,
-							 oid_to_hex(&p->oid_src));
+		if (p->status != 'D')
+			src_abbrev = verify_submodule_committish(p->sm_path,
+								 oid_to_hex(&p->oid_src));
 		if (!src_abbrev) {
 			missing_src = 1;
 			/*
diff --git a/t/t7421-submodule-summary-add.sh b/t/t7421-submodule-summary-add.sh
index 59a9b00467..b070f13714 100755
--- a/t/t7421-submodule-summary-add.sh
+++ b/t/t7421-submodule-summary-add.sh
@@ -58,7 +58,7 @@ test_expect_success 'submodule summary output for submodules with changed paths'
 	git commit -m "change submodule path" &&
 	rev=$(git -C sm rev-parse --short HEAD^) &&
 	git submodule summary HEAD^^ -- my-subm >actual 2>err &&
-	grep "fatal:.*my-subm" err &&
+	test_must_be_empty err &&
 	cat >expected <<-EOF &&
 	* my-subm ${rev}...0000000:
 
-- 
2.28.0


^ permalink raw reply related

* [PATCH 1/3] submodule: eliminate unused parameters from print_submodule_summary()
From: Shourya Shukla @ 2020-08-25 11:30 UTC (permalink / raw)
  To: git
  Cc: gitster, christian.couder, kaartic.sivaraam, Johannes.Schindelin,
	peff, liu.denton, Shourya Shukla, Christian Couder
In-Reply-To: <20200825113020.71801-1-shouryashukla.oo@gmail.com>

Eliminate the parameters 'missing_{src,dst}' from the
'print_submodule_summary()' function call since they are not used
anywhere in the function.

Reported-by: Jeff King <peff@peff.net>
Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Shourya Shukla <shouryashukla.oo@gmail.com>
---
 builtin/submodule--helper.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index 63ea39025d..b83f840251 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -982,7 +982,6 @@ static char* verify_submodule_committish(const char *sm_path,
 static void print_submodule_summary(struct summary_cb *info, char* errmsg,
 				    int total_commits, const char *displaypath,
 				    const char *src_abbrev, const char *dst_abbrev,
-				    int missing_src, int missing_dst,
 				    struct module_cb *p)
 {
 	if (p->status == 'T') {
@@ -1154,8 +1153,7 @@ static void generate_submodule_summary(struct summary_cb *info,
 
 	print_submodule_summary(info, errmsg, total_commits,
 				displaypath, src_abbrev,
-				dst_abbrev, missing_src,
-				missing_dst, p);
+				dst_abbrev, p);
 
 	free(displaypath);
 	free(src_abbrev);
-- 
2.28.0


^ permalink raw reply related

* [PATCH 2/3] submodule: fix style in function definition
From: Shourya Shukla @ 2020-08-25 11:30 UTC (permalink / raw)
  To: git
  Cc: gitster, christian.couder, kaartic.sivaraam, Johannes.Schindelin,
	peff, liu.denton, Shourya Shukla, Christian Couder
In-Reply-To: <20200825113020.71801-1-shouryashukla.oo@gmail.com>

The definitions of 'verify_submodule_committish()' and
'print_submodule_summary()' had wrong styling in terms of the asterisk
placement. Amend them.

Also, the warning printed in case of an unexpected file mode printed the
mode in decimal. Print it in octal for enhanced readability.

Reported-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Shourya Shukla <shouryashukla.oo@gmail.com>
---
 builtin/submodule--helper.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index b83f840251..93d0700891 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -959,7 +959,7 @@ enum diff_cmd {
 	DIFF_FILES
 };
 
-static char* verify_submodule_committish(const char *sm_path,
+static char *verify_submodule_committish(const char *sm_path,
 					 const char *committish)
 {
 	struct child_process cp_rev_parse = CHILD_PROCESS_INIT;
@@ -979,7 +979,7 @@ static char* verify_submodule_committish(const char *sm_path,
 	return strbuf_detach(&result, NULL);
 }
 
-static void print_submodule_summary(struct summary_cb *info, char* errmsg,
+static void print_submodule_summary(struct summary_cb *info, char *errmsg,
 				    int total_commits, const char *displaypath,
 				    const char *src_abbrev, const char *dst_abbrev,
 				    struct module_cb *p)
@@ -1056,7 +1056,7 @@ static void generate_submodule_summary(struct summary_cb *info,
 		} else {
 			/* for a submodule removal (mode:0000000), don't warn */
 			if (p->mod_dst)
-				warning(_("unexpected mode %d\n"), p->mod_dst);
+				warning(_("unexpected mode %o\n"), p->mod_dst);
 		}
 	}
 
-- 
2.28.0


^ permalink raw reply related

* [GSoC][PATCH 0/3] submodule: fixup to summary-v3
From: Shourya Shukla @ 2020-08-25 11:30 UTC (permalink / raw)
  To: git
  Cc: gitster, christian.couder, kaartic.sivaraam, Johannes.Schindelin,
	peff, liu.denton, Shourya Shukla

Greetings,

The v3 of 'git submodule summary' port to C is currently in 'next'
branch of git/git. Recently, the patch recieved some comments from
Peff, Dscho and Kaartic:

    1. The definition of 'print_submodule_summary()' contained two
       unused parameters namely 'missing_src' and 'missing_dst'. Hence,
       I had to eliminate them as covered in the commit a22ffa950f
       a22ffa950f (submodule: eliminate unused parameters from
       print_submodule_summary(), 2020-08-21). Reported by Peff.
       Junio also advised to make the output in case of an unexpected
       file mode a bit more user friendly by outputting an octal instead
       of a decimal.

    2. The function definitions of 'verify_submodule_committish()' and
       'print_submodule_summary()' had wrong styling in terms of the
       asterisk placement. Hence it was fixed in 32934998ee (submodule:
       fix style in function definition, 2020-08-22). Reported by
       Kaartic.

    2. The test script 't7421-submodule-summary-add.sh' failed in
       Windows due to failure of t7421.4. Precisely, the 'test_i18ngrep'
       check failed on Windows since the error message which was being
       grepped was different on Windows; it was designed to work on
       Linux. Therefore, we had to eliminate the grep check in t7421.4
       and replace it with a check to see if there is any error message
       or not using 'test_must_be_empty'. Also, to support this change,
       we had to make some small changes in 'print_submodule_summary()'
       function. The call to verify_submodule_committish()' had to be
       guarded using 'p->status !=D' so that it isn't called when the SM
       directory does not exist, therefore, the error message is not
       displayed. This resulted in 82e0956cd2 (t7421: eliminate 'grep'
       check in t7421.4 for mingw compatibility, 2020-08-22). Reported
       by Dscho.

summary-v3: https://lore.kernel.org/git/20200812194404.17028-1-shouryashukla.oo@gmail.com/
Peff's comment: https://lore.kernel.org/git/20200818020838.GA1872632@coredump.intra.peff.net/
Dscho' comment: https://lore.kernel.org/git/nycvar.QRO.7.76.6.2008211708280.56@tvgsbejvaqbjf.bet/
Kaartic's comment: https://lore.kernel.org/git/377b1a2ad60c5ca30864f48c5921ff89b5aca65b.camel@gmail.com/
Junio's comment regarding unexpected file mode: https://lore.kernel.org/git/xmqqo8n053r6.fsf@gitster.c.googlers.com/

Feedback and reviews are appreciated.

Regards,
Shourya Shukla

Shourya Shukla (3):
  submodule: eliminate unused parameters from print_submodule_summary()
  submodule: fix style in function definition
  t7421: eliminate 'grep' check in t7421.4 for mingw compatibility

 builtin/submodule--helper.c      | 17 ++++++++---------
 t/t7421-submodule-summary-add.sh |  2 +-
 2 files changed, 9 insertions(+), 10 deletions(-)

-- 
2.28.0


^ permalink raw reply

* Re: [PATCH 0/3] Optionally skip linking/copying the built-ins
From: Johannes Schindelin @ 2020-08-25  8:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git
In-Reply-To: <xmqqzh6j51j0.fsf@gitster.c.googlers.com>

Hi Junio,

On Mon, 24 Aug 2020, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> >> It is close ;-)
> >>
> >> The author name is correctly on "From:" but not the address.
> >
> > Yes, but the problem seems to be insurmountable, as I _think_ it is to
> > prevent spammers from successfully sending "from abitrary email
> > addresses".
>
> At least, even with only the name correction, the threads were
> easier to locate.  Perhaps you can leave the in-body From: in to
> help "git am" but keep the half-successful attempt to give the human
> readable name to humans who are reading in their MUA?

But if all you're interested in is the part before the actual email
address, isn't "Johannes Schindelin via GitGitGadget
<gitgitgadget@gmail.com>" almost identical to "Johannes Schindelin
<gitgitgadget@gmail.com>"?

Sorry, I seem to be slow understanding you :-(

Ciao,
Dscho

^ permalink raw reply

* Re: [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info
From: Jakub Narębski @ 2020-08-25 11:11 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski
In-Reply-To: <20200821041124.GA39355@Abhishek-Arch>

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Wed, Aug 19, 2020 at 07:54:20PM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>
>>> Both fill_commit_graph_info() and fill_commit_in_graph() parse
>>> information present in commit data chunk. Let's simplify the
>>> implementation by calling fill_commit_graph_info() within
>>> fill_commit_in_graph().
>>>
>>> The test 'generate tar with future mtime' creates a commit with commit
>>> time of (2 ^ 36 + 1) seconds since EPOCH. The commit time overflows into
>>> generation number (within CDAT chunk) and has undefined behavior.
>>>
>>> The test used to pass
>>
>> Could you please tell us how does this test starts to fail without the
>> change to the test described there?  What is the error message, etc.?
>>
>
> Here's what I revised the commit message to:
>
> commit-graph: consolidate fill_commit_graph_info
>
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().

All right.

We might want to add here the information that we also move loading the
commit date from the commit-graph file from fill_commit_in_graph() down
the [new] call chain into fill_commit_graph_info().  The commit date
would be needed in fill_commit_graph_info() in the next commit to
compute corrected commit date out of corrected commit date offset, and
store it as generation number.


NOTE that this means that if we switch to storing 64-bit corrected
commit date directly in the commit-graph file, instead of storing 32-bit
offsets, neither this Move Statement Into Function Out of Caller
refactoring nor change to the 'generate tar with future mtime' test
would be necessary.

>
> The test 'generate tar with future mtime' creates a commit with commit
> time of (2 ^ 36 + 1) seconds since EPOCH. The CDAT chunk provides
> 34-bits for storing commiter date, thus committer time overflows into
> generation number (within CDAT chunk) and has undefined behavior.
>
> The test used to pass as fill_commit_graph_info() would not set struct
> member `date` of struct commit and loads committer date from the object
> database, generating a tar file with the expected mtime.

I guess that in the case of generating a tar file we would read the
commit out of 'object database', and then only add commit-graph specific
info with fill_commit_graph_info().  Possibly because we need more
information that commit-graph provides for a commit.

>
> However, with corrected commit date, we will load the committer date
> from CDAT chunk (truncated to lower 34-bits) to populate the generation
> number. Thus, fill_commit_graph_info() sets date and generates tar file
> with the truncated mtime and the test fails.
>
> Let's fix the test by setting a timestamp of (2 ^ 34 - 1) seconds, which
> will not be truncated.

Now I got interested why the value of (2 ^ 36 + 1) seconds since EPOCH
was used.

The commit that introduced the 'generate tar with future mtime' test,
namely e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016), says:

	The ustar format only has room for 11 (or 12, depending on
	some implementations) octal digits for the size and mtime of
	each file. For values larger than this, we have to add pax
	extended headers to specify the real data, and git does not
	yet know how to do so.

	Before fixing that, let's start off with some test
	infrastructure [...]

The value of 2 ^ 36 equals 2 ^ 3*12 = (2 ^ 3) ^ 12 = 8 ^ 12.
So we need the value of (2 ^ 36 + 1) for this test do do its job.
Possibly the value of 8 ^ 11 + 1 = 2 ^ 33 + 1 would be enough
(if we skip testing "some implementations").

So I think to make this test more clear (for inquisitive minds) we
should set a timestamp of (2 ^ 33 + 1), not (2 ^ 34 - 1) seconds
since EPOCH.  Maybe even add a variant of this test that uses the
origial value of (2 ^ 36 + 1) seconds since EPOCH, but turns off
use of serialized commit-graph.

I'm sorry for not checking this earlier.

Best,
-- 
Jakub Narębski

^ permalink raw reply

* [PATCH v2] refs: remove lookup cache for reference-transaction hook
From: Patrick Steinhardt @ 2020-08-25 10:35 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Taylor Blau
In-Reply-To: <0db8ad8cdb69afb9d6453bf60a808e8b82382a4e.1597998473.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 5263 bytes --]

When adding the reference-transaction hook, there were concerns about
the performance impact it may have on setups which do not make use of
the new hook at all. After all, it gets executed every time a reftx is
prepared, committed or aborted, which linearly scales with the number of
reference-transactions created per session. And as there are code paths
like `git push` which create a new transaction for each reference to be
updated, this may translate to calling `find_hook()` quite a lot.

To address this concern, a cache was added with the intention to not
repeatedly do negative hook lookups. Turns out this cache caused a
regression, which was fixed via e5256c82e5 (refs: fix interleaving hook
calls with reference-transaction hook, 2020-08-07). In the process of
discussing the fix, we realized that the cache doesn't really help even
in the negative-lookup case. While performance tests added to benchmark
this did show a slight improvement in the 1% range, this really doesn't
warrent having a cache. Furthermore, it's quite flaky, too. E.g. running
it twice in succession produces the following results:

Test                         master            pks-reftx-hook-remove-cache
--------------------------------------------------------------------------
1400.2: update-ref           2.79(2.16+0.74)   2.73(2.12+0.71) -2.2%
1400.3: update-ref --stdin   0.22(0.08+0.14)   0.21(0.08+0.12) -4.5%

Test                         master            pks-reftx-hook-remove-cache
--------------------------------------------------------------------------
1400.2: update-ref           2.70(2.09+0.72)   2.74(2.13+0.71) +1.5%
1400.3: update-ref --stdin   0.21(0.10+0.10)   0.21(0.08+0.13) +0.0%

One case notably absent from those benchmarks is a single executable
searching for the hook hundreds of times, which is exactly the case for
which the negative cache was added. p1400.2 will spawn a new update-ref
for each transaction and p1400.3 only has a single reference-transaction
for all reference updates. So this commit adds a third benchmark, which
performs an non-atomic push of a thousand references. This will create a
new reference transaction per reference. But even for this case, the
negative cache doesn't consistently improve performance:

Test                         master            pks-reftx-hook-remove-cache
--------------------------------------------------------------------------
1400.4: nonatomic push       6.63(6.50+0.13)   6.81(6.67+0.14) +2.7%
1400.4: nonatomic push       6.35(6.21+0.14)   6.39(6.23+0.16) +0.6%
1400.4: nonatomic push       6.43(6.31+0.13)   6.42(6.28+0.15) -0.2%

So let's just remove the cache altogether to simplify the code.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

The only change compared to v1 is that I've addressed the unportable
`branch-{1..1000}` syntax in favor of `test_seq`. I had to setup refs as
part of the setup and change the ordering for "update-ref --stdin" from
create/update/delete to update/delete/create, but I don't think that's
too bad. At least timings didn't seem to really change because of that.

 refs.c                     | 11 ++---------
 t/perf/p1400-update-ref.sh | 13 ++++++++++---
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/refs.c b/refs.c
index cf91711968..cb9bfc5c5c 100644
--- a/refs.c
+++ b/refs.c
@@ -1924,24 +1924,17 @@ int ref_update_reject_duplicates(struct string_list *refnames,
 	return 0;
 }
 
-static const char hook_not_found;
-static const char *hook;
-
 static int run_transaction_hook(struct ref_transaction *transaction,
 				const char *state)
 {
 	struct child_process proc = CHILD_PROCESS_INIT;
 	struct strbuf buf = STRBUF_INIT;
+	const char *hook;
 	int ret = 0, i;
 
-	if (hook == &hook_not_found)
-		return ret;
+	hook = find_hook("reference-transaction");
 	if (!hook)
-		hook = xstrdup_or_null(find_hook("reference-transaction"));
-	if (!hook) {
-		hook = &hook_not_found;
 		return ret;
-	}
 
 	strvec_pushl(&proc.args, hook, state, NULL);
 	proc.in = -1;
diff --git a/t/perf/p1400-update-ref.sh b/t/perf/p1400-update-ref.sh
index d275a81248..ce5ac3ed85 100755
--- a/t/perf/p1400-update-ref.sh
+++ b/t/perf/p1400-update-ref.sh
@@ -7,11 +7,13 @@ test_description="Tests performance of update-ref"
 test_perf_fresh_repo
 
 test_expect_success "setup" '
+	git init --bare target-repo.git &&
 	test_commit PRE &&
 	test_commit POST &&
 	printf "create refs/heads/%d PRE\n" $(test_seq 1000) >create &&
 	printf "update refs/heads/%d POST PRE\n" $(test_seq 1000) >update &&
-	printf "delete refs/heads/%d POST\n" $(test_seq 1000) >delete
+	printf "delete refs/heads/%d POST\n" $(test_seq 1000) >delete &&
+	git update-ref --stdin <create
 '
 
 test_perf "update-ref" '
@@ -24,9 +26,14 @@ test_perf "update-ref" '
 '
 
 test_perf "update-ref --stdin" '
-	git update-ref --stdin <create &&
 	git update-ref --stdin <update &&
-	git update-ref --stdin <delete
+	git update-ref --stdin <delete &&
+	git update-ref --stdin <create
+'
+
+test_perf "nonatomic push" '
+	git push ./target-repo.git $(test_seq 1000) &&
+	git push --delete ./target-repo.git $(test_seq 1000)
 '
 
 test_done
-- 
2.28.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
From: Jakub Narębski @ 2020-08-25 10:07 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski
In-Reply-To: <20200825064954.GA645690@Abhishek-Arch>

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sat, Aug 22, 2020 at 02:05:41AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> To minimize the space required to store corrected commit date, Git
>>> stores corrected commit date offsets into the commit-graph file. The
>>> corrected commit date offset for a commit is defined as the difference
>>> between its corrected commit date and actual commit date.
>>
>> Perhaps we should add more details about data type sizes in question.
>
> Will add.

Note however that we need to solve the problem of storing values which
are not monotonic wrt. parent relation (partial order) in limited disk
space, that is GENERATION_NUMBER_V2_OFFSET_MAX vs GENERATION_NUMBER_MAX;
see comments in 11/11 and 00/11.

>>
>> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
>> in most cases is 64 bits (uintmax_t).  However corrected commit date
>> offsets can be safely stored^* using only 32 bits.  This halves the size
>> of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
>> 2*H + 16 + 4 bytes, which is reduction of around 6%, not including
>> header, fanout table (OIDF) and extra edges list (EDGE).
>>
>> Which might mean that the extra complication is not worth it, and we
>> should store corrected commit date directly instead.
>>
>> *) unless for example one of commits is malformed but valid,
>>    and has committerdate of 0 Unix time, 1 January 1970.

See above.

>>> While Git does not write out offsets at this stage, Git stores the
>>> corrected commit dates in member generation of struct commit_graph_data.
>>> It will begin writing commit date offsets with the introduction of
>>> generation data chunk.
>>
>> OK, so the agenda for introducing geeration number v2 is as follows:
>> - compute generation numbers v2, i.e. corrected commit date
>> - store corrected commit date [offsets] in new GDAT chunk,
>>   unless backward-compatibility concerns require us to not to
>> - load [and compute] corrected commit date from commit-graph
>>   storing it as 'generation' field of `struct commit_graph_data`,
>>   unless backward-compatibility concerns require us to store
>>   topological levels (generation number v1) in there instead
>>
>
> The last point is not correct. We always store topological levels into
> the topo_levels slab introduced and always store corrected commit date
> into data->generation, regardless of backward compatibility concerns.

I think I was not clear enough (in trying to be brief).  I meant here
loading available generation numbers for use in graph traversal,
done in later patches in this series.

In _next_ commit we store topological levels in `generation` field:

  @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
   	date_low = get_be32(commit_data + g->hash_len + 12);
   	item->date = (timestamp_t)((date_high << 32) | date_low);

  -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
  +	if (g->chunk_generation_data)
  +		graph_data->generation = item->date +
  +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
  +	else
  +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;


We use topo_level slab only when writing the commit-graph file.

> We could avoid initializing topo_slab if we are not writing generation
> data chunk (and thus don't need corrected commit dates) but that
> wouldn't have an impact on run time while writing commit-graph because
> computing corrected commit dates is cheap as the main cost is in walking
> the graph and writing the file.

Right.

Though you need to add the cost of allocation and managing extra
commit slab, I think that amortized cost is negligible.

But what would be better is showing benchmark data: does writing the
commit graph without GDAT take not insigificant more time than without
this patch?

[...]
>>> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>>  	for (i = 0; i < g->num_commits; i++) {
>>>  		struct commit *graph_commit, *odb_commit;
>>>  		struct commit_list *graph_parents, *odb_parents;
>>> -		timestamp_t max_generation = 0;
>>> -		timestamp_t generation;
>>> +		timestamp_t max_corrected_commit_date = 0;
>>> +		timestamp_t corrected_commit_date;
>>
>> This is simple, and perhaps unnecessary, rename of variables.
>> Shouldn't we however verify *both* topological level, and
>> (if exists) corrected commit date?
>
> The problem with verifying both topological level and corrected commit
> dates is that we would have to re-fill commit_graph_data slab with commit
> data chunk as we cannot modify data->generation otherwise, essentially
> repeating the whole verification process.
>
> While it's okay for now, I might take this up in a future series [1].
>
> [1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

All right, I believe you that verifying both topological level and
corrected commit date would be more difficult.

That doesn't change the conclusion that this variable should remain to
be named `generation`, as when verifying GDAT-less commit-graph files it
would check topological levels (it uses commit_graph_generation(), which
in turn uses `generation` field in commit graph info, which as I have
show above in later patch could be v1 or v2 generation number).

Best,
-- 
Jakub Narębski

^ permalink raw reply

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
From: Jakub Narębski @ 2020-08-25  7:56 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
In-Reply-To: <855z97dvsp.fsf@gmail.com>

On Tue, 25 Aug 2020 at 09:33, Jakub Narębski <jnareb@gmail.com> wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> > On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
> >> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >>
> >>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> [...]
> >>> @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >>>                                     _("Computing commit graph generation numbers"),
> >>>                                     ctx->commits.nr);
> >>>     for (i = 0; i < ctx->commits.nr; i++) {
> >>> -           uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> >>> +           uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> >>
> >> All right, so that is why this 'generation' variable was not converted
> >> to timestamp_t type.
> >>
> >>>
> >>>             display_progress(ctx->progress, i + 1);
> >>> -           if (generation != GENERATION_NUMBER_V1_INFINITY &&
> >>> -               generation != GENERATION_NUMBER_ZERO)
> >>> +           if (level != GENERATION_NUMBER_V1_INFINITY &&
> >>> +               level != GENERATION_NUMBER_ZERO)
> >>>                     continue;
> >>
> >> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
> >> outside commit-graph files, and therefore we would need its topological
> >> level computed.
> >>
> >> However, I don't understand how it works.  We have had created the
> >> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
> >> to provide default values for `struct commit_graph`... but only for
> >> `graph_pos` member.  It is commit_graph_generation() that returns
> >> GENERATION_NUMBER_INFINITY for commits not in graph.
> >>
> >> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
> >> handles this special case, so I don't see how 'generation' variable can
> >> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
> >> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
> >>
> >> Does it work *accidentally*, because the default value for uninitialized
> >> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
> >> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
> >> of commit-graph feature development history, namely the short time where
> >> Git didn't use any generation numbers and stored 0 in the place set for
> >> it in the commit-graph format...  On the other hand this is not the case
> >> for corrected commit date (generation number v2), as it could
> >> "legitimately" be 0 if some root commit (without any parents) had
> >> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
> >> caused by malformed but valid commit object.
> >>
> >> Ugh...
> >
> > It works accidentally.
> >
> > Our decision to avoid the cost of initializing both
> > commit_graph_data->generation and commit_graph_data->graph_pos has
> > led to some unwieldy choices - the complexity of helper functions,
> > bypassing helper functions when writing a commit-graph file [1].
> >
> > I want to re-visit how commit_graph_data slab is defined in a future series.
> >
> > [1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@gmail.com/
>
> All right, we might want to make use of the fact that the value of 0 for
> topological level here always mean that its value for a commit needs to
> be computed, that 0 is not a valid value for topological levels.
> - if the value 0 came from commit-graph file, it means that it came
>   from Git version that used commit-graph but didn't compute generation
>   numbers; the value is GENERATION_NUMBER_ZERO
> - the value 0 might came from the fact that commit is not in graph,
>   and that commit-slab zero-initializes the values stored; let's
>   call this value GENERATION_NUMBER_UNINITIALIZED
>
> If we ensure that corrected commit date can never be zero (which is
> extremely unlikely, as one of root commits would have to be malformed or
> written on badly misconfigured computer, with value of 0 for committer
> timestamp), then this "happy accident" can keep working.
>
>   As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
>   has corrected commit date of one, to be able to distinguish
>   uninitialized values.
>
> Or something like that.
>
> Actually, it is not even necessary, as corrected commit date of 0 just
> means that this single value (well, for every root commit with commit
> date of 0) would be unnecessary recomputed in compute_generation_numbers().
>
> Anyway, we would want to document this fact in the commit message.

Alternatively, instead of comparing 'level' (and later in series also
'corrected_commit_date') against GENERATION_NUMBER_INFINITY,
we could load at no extra cost `graph_pos` value and compare it
against COMMIT_NOT_FROM_GRAPH.

But with this solution we could never get rid of graph_pos, if we
think it is unnecessary. If we split commit_graph_data into separate
slabs (as it was in early versions of respective patch series), we
would have to pay additional cost.

But it is an alternative.

Best,
-- 
Jakub Narębski

^ permalink raw reply

* Re: [PATCH] builtin/repack.c: invalidate MIDX only when necessary
From: Son Luong Ngoc @ 2020-08-25  7:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee
In-Reply-To: <ef9186a8df0d712c2ecccbe62cb43a7abadb9c96.1598320716.git.me@ttaylorr.com>

Hi Taylor,

Thanks for working on this.

> On Aug 25, 2020, at 04:01, Taylor Blau <me@ttaylorr.com> wrote:
> 
> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
> learned to remove a multi-pack-index file if it added or removed a pack
> from the object store.
> 
> This mechanism is a little over-eager, since it is only necessary to
> drop a MIDX if 'git repack' removes a pack that the MIDX references.
> Adding a pack outside of the MIDX does not require invalidating the
> MIDX, and likewise for removing a pack the MIDX does not know about.

I wonder if its worth to trigger write_midx_file() to update the midx instead of
just removing MIDX?

That is already the direction we are taking in the 'maintenance' patch series
whenever the multi-pack-index file was deemed invalid.

Or perhaps, we can check for 'core.multiPackIndex' value (which recently is 
'true' by default) and determine whether we should remove the MIDX or rewrite
it?

> 
> Teach 'git repack' to check for this by loading the MIDX, and checking
> whether the to-be-removed pack is known to the MIDX. This requires a
> slightly odd alternation to a test in t5319, which is explained with a
> comment.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>

Cheers,
Son Luong.

^ permalink raw reply

* Draft of Git Rev News edition 66
From: Christian Couder @ 2020-08-25  7:53 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Markus Jansen, Kaartic Sivaraam,
	Jeff King, Johannes Schindelin, Chris Torek, Elijah Newren,
	Eric Sunshine

Hi everyone!

A draft of a new Git Rev News edition is available here:

  https://github.com/git/git.github.io/blob/master/rev_news/drafts/edition-66.md

Everyone is welcome to contribute in any section either by editing the
above page on GitHub and sending a pull request, or by commenting on
this GitHub issue:

  https://github.com/git/git.github.io/issues/453

You can also reply to this email.

In general all kinds of contributions, for example proofreading,
suggestions for articles or links, help on the issues in GitHub, and
so on, are very much appreciated.

I tried to Cc everyone who appears in this edition, but maybe I missed
some people, sorry about that.

Jakub, Markus, Kaartic and me plan to publish this edition on Thursday
August 27th.

Thanks,
Christian.

^ permalink raw reply

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
From: Jakub Narębski @ 2020-08-25  7:33 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau,
	Jakub Narębski
In-Reply-To: <20200825061418.GA629699@Abhishek-Arch>

Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>  					_("Computing commit graph generation numbers"),
>>>  					ctx->commits.nr);
>>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
>>> +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>>
>> All right, so that is why this 'generation' variable was not converted
>> to timestamp_t type.
>>
>>>
>>>  		display_progress(ctx->progress, i + 1);
>>> -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
>>> -		    generation != GENERATION_NUMBER_ZERO)
>>> +		if (level != GENERATION_NUMBER_V1_INFINITY &&
>>> +		    level != GENERATION_NUMBER_ZERO)
>>>  			continue;
>>
>> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
>> outside commit-graph files, and therefore we would need its topological
>> level computed.
>>
>> However, I don't understand how it works.  We have had created the
>> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
>> to provide default values for `struct commit_graph`... but only for
>> `graph_pos` member.  It is commit_graph_generation() that returns
>> GENERATION_NUMBER_INFINITY for commits not in graph.
>>
>> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
>> handles this special case, so I don't see how 'generation' variable can
>> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
>> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
>>
>> Does it work *accidentally*, because the default value for uninitialized
>> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
>> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
>> of commit-graph feature development history, namely the short time where
>> Git didn't use any generation numbers and stored 0 in the place set for
>> it in the commit-graph format...  On the other hand this is not the case
>> for corrected commit date (generation number v2), as it could
>> "legitimately" be 0 if some root commit (without any parents) had
>> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
>> caused by malformed but valid commit object.
>>
>> Ugh...
>
> It works accidentally.
>
> Our decision to avoid the cost of initializing both
> commit_graph_data->generation and commit_graph_data->graph_pos has
> led to some unwieldy choices - the complexity of helper functions,
> bypassing helper functions when writing a commit-graph file [1].
>
> I want to re-visit how commit_graph_data slab is defined in a future series.
>
> [1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@gmail.com/

All right, we might want to make use of the fact that the value of 0 for
topological level here always mean that its value for a commit needs to
be computed, that 0 is not a valid value for topological levels.
- if the value 0 came from commit-graph file, it means that it came
  from Git version that used commit-graph but didn't compute generation
  numbers; the value is GENERATION_NUMBER_ZERO
- the value 0 might came from the fact that commit is not in graph,
  and that commit-slab zero-initializes the values stored; let's
  call this value GENERATION_NUMBER_UNINITIALIZED

If we ensure that corrected commit date can never be zero (which is
extremely unlikely, as one of root commits would have to be malformed or
written on badly misconfigured computer, with value of 0 for committer
timestamp), then this "happy accident" can keep working.

  As a special case, commit date with timestamp of zero (01.01.1970 00:00:00Z)
  has corrected commit date of one, to be able to distinguish
  uninitialized values.

Or something like that.

Actually, it is not even necessary, as corrected commit date of 0 just
means that this single value (well, for every root commit with commit
date of 0) would be unnecessary recomputed in compute_generation_numbers().

Anyway, we would want to document this fact in the commit message.

Best,
-- 
Jakub Narębski

^ permalink raw reply

* Re: [PATCH v3 07/11] commit-graph: implement corrected commit date
From: Abhishek Kumar @ 2020-08-25  6:49 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, me, stolee
In-Reply-To: <85wo1rk0iy.fsf@gmail.com>

On Sat, Aug 22, 2020 at 02:05:41AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With most of preparations done, let's implement corrected commit date.
> >
> > The corrected commit date for a commit is defined as:
> >
> > * A commit with no parents (a root commit) has corrected commit date
> >   equal to its committer date.
> > * A commit with at least one parent has corrected commit date equal to
> >   the maximum of its commit date and one more than the largest corrected
> >   commit date among its parents.
> 
> Good.
> 
> >
> > To minimize the space required to store corrected commit date, Git
> > stores corrected commit date offsets into the commit-graph file. The
> > corrected commit date offset for a commit is defined as the difference
> > between its corrected commit date and actual commit date.
> 
> Perhaps we should add more details about data type sizes in question.

Will add.

> 
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t).  However corrected commit date
> offsets can be safely stored^* using only 32 bits.  This halves the size
> of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
> 2*H + 16 + 4 bytes, which is reduction of around 6%, not including
> header, fanout table (OIDF) and extra edges list (EDGE).
> 
> Which might mean that the extra complication is not worth it, and we
> should store corrected commit date directly instead.
> 
> *) unless for example one of commits is malformed but valid,
>    and has committerdate of 0 Unix time, 1 January 1970.
> 
> >
> > While Git does not write out offsets at this stage, Git stores the
> > corrected commit dates in member generation of struct commit_graph_data.
> > It will begin writing commit date offsets with the introduction of
> > generation data chunk.
> 
> OK, so the agenda for introducing geeration number v2 is as follows:
> - compute generation numbers v2, i.e. corrected commit date
> - store corrected commit date [offsets] in new GDAT chunk,
>   unless backward-compatibility concerns require us to not to
> - load [and compute] corrected commit date from commit-graph
>   storing it as 'generation' field of `struct commit_graph_data`,
>   unless backward-compatibility concerns require us to store
>   topological levels (generation number v1) in there instead
> 

The last point is not correct. We always store topological levels into
the topo_levels slab introduced and always store corrected commit date
into data->generation, regardless of backward compatibility concerns.

We could avoid initializing topo_slab if we are not writing generation
data chunk (and thus don't need corrected commit dates) but that
wouldn't have an impact on run time while writing commit-graph because
computing corrected commit dates is cheap as the main cost is in walking
the graph and writing the file.

> Because the reachability condition for corrected commit date and for
> topological level is exactly the same, we don't need to do anything to
> take advantage of generation number v2.
> 
> Though we can use generation number v2 in more cases, where we turned
> off use of generation numbers because v1 gave worse performance than
> date heuristics.
> 
> Did I got this right?
> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
> >  1 file changed, 31 insertions(+), 27 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index a2f15b2825..fd69534dd5 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -169,11 +169,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	else if (generation_a > generation_b)
> >  		return 1;
> >
> > -	/* use date as a heuristic when generations are equal */
> > -	if (a->date < b->date)
> > -		return -1;
> > -	else if (a->date > b->date)
> > -		return 1;
> 
> At first I was wondering why this tie-breaking is beig removed; wouldn't
> be needed for backward-compatibility?  But then I remembered that this
> comparison function is used _only_ for sorting commits when writing
> Bloom filters, for `git commit-graph write --reachable --changed-paths ...`
> 
> Assuming that when writing the commit graph we always compute geeration
> number v2 and 'generation' field stores corrected commit date, we don't
> need to use date as a heuristic when generations are equal, and it would
> not help in tie-breaking anyway.
> 
> All right.
> 
> >  	return 0;
> >  }
> >
> > @@ -1342,10 +1337,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> >  		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> > +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
> 
> All right, so the pattern is to add 'corrected_commit_date' stuff after
> 'topological_level' stuff.
> 
> >
> >  		display_progress(ctx->progress, i + 1);
> >  		if (level != GENERATION_NUMBER_V1_INFINITY &&
> > -		    level != GENERATION_NUMBER_ZERO)
> > +		    level != GENERATION_NUMBER_ZERO &&
> > +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> > +		    corrected_commit_date != GENERATION_NUMBER_ZERO
> > +		    )
> >  			continue;
> >
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1354,17 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> >  			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = 0;
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> >  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> > +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
> >
> >  				if (level == GENERATION_NUMBER_V1_INFINITY ||
> > -				    level == GENERATION_NUMBER_ZERO) {
> > +				    level == GENERATION_NUMBER_ZERO ||
> > +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> > +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> > +				    ) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (level > max_level) {
> > -					max_level = level;
> > +				} else {
> > +					if (level > max_level)
> > +						max_level = level;
> > +
> > +					if (corrected_commit_date > max_corrected_commit_date)
> > +						max_corrected_commit_date = corrected_commit_date;
> >  				}
> >  			}
> >
> > @@ -1374,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  				if (max_level > GENERATION_NUMBER_MAX - 1)
> >  					max_level = GENERATION_NUMBER_MAX - 1;
> >  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +
> > +				if (current->date > max_corrected_commit_date)
> > +					max_corrected_commit_date = current->date - 1;
> > +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> >  			}
> >  		}
> >  	}
> 
> All right.  Looks good to me.
> 
> > @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  	for (i = 0; i < g->num_commits; i++) {
> >  		struct commit *graph_commit, *odb_commit;
> >  		struct commit_list *graph_parents, *odb_parents;
> > -		timestamp_t max_generation = 0;
> > -		timestamp_t generation;
> > +		timestamp_t max_corrected_commit_date = 0;
> > +		timestamp_t corrected_commit_date;
> 
> This is simple, and perhaps unnecessary, rename of variables.
> Shouldn't we however verify *both* topological level, and
> (if exists) corrected commit date?

The problem with verifying both topological level and corrected commit
dates is that we would have to re-fill commit_graph_data slab with commit
data chunk as we cannot modify data->generation otherwise, essentially
repeating the whole verification process.

While it's okay for now, I might take this up in a future series [1].

[1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

> 
> >
> >  		display_progress(progress, i + 1);
> >  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> > @@ -2412,9 +2424,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  					     oid_to_hex(&graph_parents->item->object.oid),
> >  					     oid_to_hex(&odb_parents->item->object.oid));
> >
> > -			generation = commit_graph_generation(graph_parents->item);
> > -			if (generation > max_generation)
> > -				max_generation = generation;
> > +			corrected_commit_date = commit_graph_generation(graph_parents->item);
> > +			if (corrected_commit_date > max_corrected_commit_date)
> > +				max_corrected_commit_date = corrected_commit_date;
> 
> Actually, commit_graph_generation(<commit>) can return either corrected
> commit date, or topological level, the latter in backward-compatibility
> case (if at least one commit-graph file is lacking GDAT chunk, because
> [some of] it was created by the "Old" Git).
> 
> >
> >  			graph_parents = graph_parents->next;
> >  			odb_parents = odb_parents->next;
> > @@ -2436,20 +2448,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  		if (generation_zero == GENERATION_ZERO_EXISTS)
> >  			continue;
> >
> > -		/*
> > -		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
> > -		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> > -		 * extra logic in the following condition.
> > -		 */
> > -		if (max_generation == GENERATION_NUMBER_MAX)
> > -			max_generation--;
> 
> All right, this was needed for checking the correctness of topological
> levels (generation number v1) because we were checking not that it
> fullfills the reachability condition, but more strict one: namely that
> topological level of commit is equal to maximum of topological levels of
> its parents plus one.
> 
> The comment about checking both generation number v1 and v2 still
> applies.
> 
> > -
> > -		generation = commit_graph_generation(graph_commit);
> > -		if (generation != max_generation + 1)
> > -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> > +		corrected_commit_date = commit_graph_generation(graph_commit);
> > +		if (corrected_commit_date < max_corrected_commit_date + 1)
> > +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
> >  				     oid_to_hex(&cur_oid),
> > -				     generation,
> > -				     max_generation + 1);
> > +				     corrected_commit_date,
> > +				     max_corrected_commit_date + 1);
> 
> All right, we check less strict condition for corrected commit date.
> 
> >
> >  		if (graph_commit->date != odb_commit->date)
> >  			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
> 
> Best,
> -- 
> Jakub Narębski

^ permalink raw reply

* Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels
From: Abhishek Kumar @ 2020-08-25  6:14 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee
In-Reply-To: <85d03jlu05.fsf@gmail.com>

On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
> Hello,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > As we are writing topological levels to commit data chunk to ensure
> > backwards compatibility with "Old" Git and the member `generation` of
> > struct commit_graph_data will store corrected commit date in a later
> > commit, let's introduce a commit-slab to store topological levels while
> > writing commit-graph.
> 
> In my opinion the above it would be easier to follow if rephrased in the
> following way:
> 
>   In a later commit we will introduce corrected commit date as the
>   generation number v2.  This value will be stored in the new separate
>   GDAT chunk.  However to ensure backwards compatibility with "Old" Git
>   we need to continue to write generation number v1, which is
>   topological level, to the commit data chunk (CDAT).  This means that
>   we need to compute both versions of generation numbers when writing
>   the commit-graph file.  Let's therefore introduce a commit-slab
>   to store topological levels; corrected commit date will be stored
>   in the member `generation` of struct commit_graph_data.
> 
> What do you think?
> 

Yes, that's better.

> 
> By the way, do I understand it correctly that in backward-compatibility
> mode (that is, in mixed-version environment where at least some
> commit-graph files were written by "Old" Git and are lacking GDAT chunk
> and generation number v2 data) the `generation` member of commit graph
> data chunk will be populated and will store generation number v1, that
> is topological level? And that the commit-slab for topological levels is
> only there for writing and re-writing?
> 

No, the topo_levels commit-slab would be always populated when we write
a commit data chunk. The topo_level slab is a workaround for the fact
that we are trying to write two independent values (corrected commit
date offset, topological levels) but have one struct member to store them in
(data->generation).

If we are in a mixed-version environment, we could avoid initializing
the slab and fill the topological levels into data->generation instead,
but that's not how it is implemented right now.

> >
> > When Git creates a split commit-graph, it takes advantage of the
> > generation values that have been computed already and present in
> > existing commit-graph files.
> >
> > So, let's add a pointer to struct commit_graph to the topological level
> > commit-slab and populate it with topological levels while writing a
> > split commit-graph.
> 
> All right, looks sensible.

I have extend the last paragraph to include write_commit_graph_context
as well as:

  So, let's add a pointer to struct commit_graph as well as struct
  write_commit_graph_context to the topological level commit-slab and
  populate it with topological levels while writing a commit-graph file.

> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
> >  commit-graph.h |  1 +
> >  commit.h       |  1 +
> >  3 files changed, 34 insertions(+), 15 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 7f9f858577..a2f15b2825 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
> >  /* Remember to update object flag allocation in object.h */
> >  #define REACHABLE       (1u<<15)
> >
> > +define_commit_slab(topo_level_slab, uint32_t);
> > +
> 
> All right.
> 
> Also, here we might need GENERATION_NUMBER_V1_INFINITY, but I don't
> think it would be necessary.
> 
> >  /* Keep track of the order in which commits are added to our list. */
> >  define_commit_slab(commit_pos, int);
> >  static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> > @@ -759,6 +761,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	item->date = (timestamp_t)((date_high << 32) | date_low);
> >
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > +
> > +	if (g->topo_levels)
> > +		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> 
> All right, here we store topological levels on commit-slab to avoid
> recomputing them.
> 
> Do I understand it correctly that the `topo_levels` member of the `struct
> commit_graph` would be non-null only when we are updating the
> commit-graph?
> 

Yes, that's correct.

> >
> >  static inline void set_commit_tree(struct commit *c, struct tree *t)
> > @@ -953,6 +958,7 @@ struct write_commit_graph_context {
> >  		 changed_paths:1,
> >  		 order_by_pack:1;
> >
> > +	struct topo_level_slab *topo_levels;
> >  	const struct split_commit_graph_opts *split_opts;
> >  	size_t total_bloom_filter_data_size;
> >  	const struct bloom_filter_settings *bloom_settings;
> 
> Why do we need `topo_levels` member *both* in `struct commit_graph` and
> in `struct write_commit_graph_context`?
> 
> [After examining the change further I have realized why both are needed,
>  and written about the reasoning later in this email.]
> 
> 
> Note that the commit message talks only about `struct commit_graph`...
> 
> > @@ -1094,7 +1100,7 @@ static int write_graph_chunk_data(struct hashfile *f,
> >  		else
> >  			packedDate[0] = 0;
> >
> > -		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
> > +		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
> 
> All right, here we prepare for writing to the CDAT chunk using data that
> is now stored on newly introduced topo_levels slab (either computed, or
> taken from commit-graph file being rewritten).
> 
> Assuming that ctx->topo_levels is not-null, and that the values are
> properly calculated before this -- and we did compute topological levels
> before writing the commit-graph.
> 
> >
> >  		packedDate[1] = htonl((*list)->date);
> >  		hashwrite(f, packedDate, 8);
> > @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					_("Computing commit graph generation numbers"),
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> > -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> > +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> 
> All right, so that is why this 'generation' variable was not converted
> to timestamp_t type.
> 
> >
> >  		display_progress(ctx->progress, i + 1);
> > -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
> > -		    generation != GENERATION_NUMBER_ZERO)
> > +		if (level != GENERATION_NUMBER_V1_INFINITY &&
> > +		    level != GENERATION_NUMBER_ZERO)
> >  			continue;
> 
> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
> outside commit-graph files, and therefore we would need its topological
> level computed.
> 
> However, I don't understand how it works.  We have had created the
> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
> to provide default values for `struct commit_graph`... but only for
> `graph_pos` member.  It is commit_graph_generation() that returns
> GENERATION_NUMBER_INFINITY for commits not in graph.
> 
> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
> handles this special case, so I don't see how 'generation' variable can
> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
> 
> Does it work *accidentally*, because the default value for uninitialized
> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
> of commit-graph feature development history, namely the short time where
> Git didn't use any generation numbers and stored 0 in the place set for
> it in the commit-graph format...  On the other hand this is not the case
> for corrected commit date (generation number v2), as it could
> "legitimately" be 0 if some root commit (without any parents) had
> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
> caused by malformed but valid commit object.
> 
> Ugh...

It works accidentally.

Our decision to avoid the cost of initializing both
commit_graph_data->generation and commit_graph_data->graph_pos has
led to some unwieldy choices - the complexity of helper functions,
bypassing helper functions when writing a commit-graph file [1].

I want to re-visit how commit_graph_data slab is defined in a future series.

[1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@gmail.com/

> 
> >
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1347,29 +1353,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit *current = list->item;
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> > -			uint32_t max_generation = 0;
> > +			uint32_t max_level = 0;
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> > -				generation = commit_graph_data_at(parent->item)->generation;
> > +				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> >
> > -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
> > -				    generation == GENERATION_NUMBER_ZERO) {
> > +				if (level == GENERATION_NUMBER_V1_INFINITY ||
> > +				    level == GENERATION_NUMBER_ZERO) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (generation > max_generation) {
> > -					max_generation = generation;
> > +				} else if (level > max_level) {
> > +					max_level = level;
> >  				}
> >  			}
> 
> This is the same case as for previous chunk; see the comment above.
> 
> This code checks if parents have generation number / topological level
> computed, and tracks maximum value of it among all parents.
> 
> >
> >  			if (all_parents_computed) {
> > -				struct commit_graph_data *data = commit_graph_data_at(current);
> > -
> > -				data->generation = max_generation + 1;
> >  				pop_commit(&list);
> >
> > -				if (data->generation > GENERATION_NUMBER_MAX)
> > -					data->generation = GENERATION_NUMBER_MAX;
> > +				if (max_level > GENERATION_NUMBER_MAX - 1)
> > +					max_level = GENERATION_NUMBER_MAX - 1;
> > +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> 
> OK, this is safer way of handling GENERATION_NUMBER*_MAX, especially if
> this value can be maximum value that can be safely stored in a given
> data type.  Previously GENERATION_NUMBER_MAX was smaller than maximum
> value that can be safely stored in uint32_t, so generation+1 had no
> chance to overflow.  This is no longer the case; the reorganization done
> here leads to more defensive code (safer).
> 
> All good.  However I think that we should clamp the value of topological
> level to the maximum value that can be safely stored *on disk*, in the
> 30 bits of the CDAT chunk reserved for generation number v1.  Otherwise
> the code to write topological level would get more complicated.
> 
> In my opinion the symbolic constant used here should be named
> GENERATION_NUMBER_V1_MAX, and its value should be at most (2 ^ 30 - 1);
> it should be the current value of GENERATION_NUMBER_MAX, that is
> 0x3FFFFFFF.
> 
> >  			}
> >  		}
> >  	}
> > @@ -2101,6 +2105,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	uint32_t i, count_distinct = 0;
> >  	int res = 0;
> >  	int replace = 0;
> > +	struct topo_level_slab topo_levels;
> >
> 
> All right, we will be using topo_level slab for writing the
> commit-graph, and only for this purpose, so it is good to put it here.
> 
> >  	if (!commit_graph_compatible(the_repository))
> >  		return 0;
> > @@ -2179,6 +2184,18 @@ int write_commit_graph(struct object_directory *odb,
> >  		}
> >  	}
> >
> > +	init_topo_level_slab(&topo_levels);
> > +	ctx->topo_levels = &topo_levels;
> > +
> > +	if (ctx->r->objects->commit_graph) {
> > +		struct commit_graph *g = ctx->r->objects->commit_graph;
> > +
> > +		while (g) {
> > +			g->topo_levels = &topo_levels;
> > +			g = g->base_graph;
> > +		}
> > +	}
> 
> All right, now I see why we need `topo_levels` member both in the
> `struct write_commit_graph_context` and in `struct commit_graph`.
> The former is for functions that write the commit-graph, the latter for
> fill_commit_graph_info() functions that is deep in the callstack, but it
> needs to know whether to load topological level to commit-slab, or maybe
> put it as generation number (and in the future -- discard it, if not
> needed).
> 
> 
> Sidenote: this fragment of code, that fills with a given value some
> member of the `struct commit_graph` throughout the split commit-graph
> chain, will be repeated as similar code in patches later in series.
> However without resorting to preprocessor macros I have no idea how to
> generalize it to avoid code duplication (well, almost).
> 

The pattern is: iterate over the commit-graph chain and assign a member
(here, topo_level and in the other patch, read_generation_data) a value
(the address of topo_level slab, 1/0 depending on whether it is a mixed
generation chain).

We could generalize this in a future series but I don't think it is
worthwhile.

> > +
> >  	if (pack_indexes) {
> >  		ctx->order_by_pack = 1;
> >  		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 430bc830bb..1152a9642e 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -72,6 +72,7 @@ struct commit_graph {
> >  	const unsigned char *chunk_bloom_indexes;
> >  	const unsigned char *chunk_bloom_data;
> >
> > +	struct topo_level_slab *topo_levels;
> >  	struct bloom_filter_settings *bloom_filter_settings;
> >  };
> 
> All right: `struct commit_graph` is public, `struct
> write_commit_graph_context` is not.
> 
> >
> > diff --git a/commit.h b/commit.h
> > index bc0732a4fe..bb846e0025 100644
> > --- a/commit.h
> > +++ b/commit.h
> > @@ -15,6 +15,7 @@
> >  #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
> >  #define GENERATION_NUMBER_MAX 0x3FFFFFFF
> 
> The name GENERATION_NUMBER_MAX for 0x3FFFFFFF should be instead
> GENERATION_NUMBER_V1_MAX, but that may be done in a later commit.
> 
> >  #define GENERATION_NUMBER_ZERO 0
> > +#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
> 
> This value is never used, so why it is defined in this commit.
> 

Moved down to the patch actually uses it.

> >
> >  struct commit_list {
> >  	struct commit *item;
> 
> Best,
> -- 
> Jakub Narębski

^ permalink raw reply

* Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number
From: Abhishek Kumar @ 2020-08-25  5:04 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: abhishekkumar8222, git, stolee
In-Reply-To: <85d03km98l.fsf@gmail.com>

On Fri, Aug 21, 2020 at 03:14:34PM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > In a preparatory step, let's return timestamp_t values from
> > commit_graph_generation(), use timestamp_t for local variables
> 
> All right, this is all good.
> 
> > and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
> 
> This needs more detailed examination.  There are two similar constants,
> GENERATION_NUMBER_INFINITY and GENERATION_NUMBER_MAX.  The former is
> used for newest commits outside the commit-graph, while the latter is
> maximum number that commits in the commit-graph can have (because of the
> storage limitations).  We therefore need GENERATION_NUMBER_INFINITY
> to be larger than GENERATION_NUMBER_MAX, and it is (and was).
> 
> The GENERATION_NUMBER_INFINITY is because of the above requirement
> traditionally taken as maximum value that can be represented in the data
> type used to store commit's generation number _in memory_, but it can be
> less.  For timestamp_t the maximum value that can be represented
> is (2 ^ 63 - 1).
> 
> All right then.
> 
> >

Related to this, by the end of this series we are using
GENERATION_NUMBER_MAX in just one place - compute_generation_numbers()
to make sure the topological levels fit within 30 bits.

Would it be more appropriate to rename GENERATION_NUMBER_MAX to
GENERATION_NUMBER_V1_MAX (along the lines of
GENERATION_NUMBER_V2_OFFSET_MAX)  to correctly describe that is a
limit on topological levels, rather than generation number value?

> 
> The commit message says nothing about the new symbolic constant
> GENERATION_NUMBER_V1_INFINITY, though.
> 
> I'm not sure it is even needed (see comments below).

Yes, you are correct. I tried it out with your suggestions and it wasn't
really needed.

Thanks for catching this!

> ...

Thanks
- Abhishek

^ permalink raw reply

* Re: [PATCH] builtin/repack.c: invalidate MIDX only when necessary
From: Taylor Blau @ 2020-08-25  2:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee
In-Reply-To: <20200825022614.GA1391422@coredump.intra.peff.net>

On Mon, Aug 24, 2020 at 10:26:14PM -0400, Jeff King wrote:
> On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote:
>
> > In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
> > learned to remove a multi-pack-index file if it added or removed a pack
> > from the object store.
> >
> > This mechanism is a little over-eager, since it is only necessary to
> > drop a MIDX if 'git repack' removes a pack that the MIDX references.
> > Adding a pack outside of the MIDX does not require invalidating the
> > MIDX, and likewise for removing a pack the MIDX does not know about.
>
> Does "git repack" ever remove just one pack? Obviously "git repack -ad"
> or "git repack -Ad" is going to pack everything and delete the old
> packs. So I think we'd want to remove a midx there.
>
> And "git repack -d" I think of as deleting only loose objects that we
> just packed. But I guess it could also remove a pack that has now been
> made redundant? That seems like a rare case in practice, but I suppose
> is possible.

Yeah, the patch message makes this sound more likely than it actually
is, which I agree is very rare. I often write 'git repack' instead of
'git pack-objects' to slurp up everything loose into a new pack without
having to list loose objects by name.

That's the case that I really care about here: purely adding a new pack
should not invalidate the existing MIDX.

> Not exactly related to your fix, but kind of the flip side of it: would
> we ever need to retain a midx that mentions some packs that still exist?
>
> E.g., imagine we have a midx that points to packs A and B, and
> git-repack deletes B. By your logic above, we need to remove the midx
> because now it points to objects in B which aren't accessible. But by
> deleting it, could we be deleting the only thing that mentions the
> objects in A?
>
> I _think_ the answer is "no", because we never went all-in on midx and
> allowed deleting the matching .idx files for contained packs. So we'd
> still have that A.idx, and we could just use the pack as normal. But
> it's an interesting corner case if we ever do go in that direction.

Agreed. Maybe a (admittedly somewhat large) #leftoverbits.

> If you'll let me muse a bit more on midx-lifetime issues (which I've
> never really thought about before just now):
>
> I'm also a little curious how bad it is to have a midx whose pack has
> gone away. I guess we'd answer queries for "yes, we have this object"
> even if we don't, which is bad. Though in practice we'd only delete
> those packs if we have their objects elsewhere. And the pack code is
> pretty good about retrying other copies of objects that can't be
> accessed. Alternatively, I wonder if the midx-loading code ought to
> check that all of the constituent packs are available.
>
> In that line of thinking, do we even need to delete midx files if one of
> their packs goes away? The reading side probably ought to be able to
> handle that gracefully.

I think that this is probably the right direction, although I've only
spend time in the MIDX code over the past couple of weeks, so I can't
say with authority. It seems like it would be pretty annoying, though.
For example, code that cares about listing all objects in a MIDX would
have to check first whether the pack they're in still exists before
emitting them. On top of that, there are more corner cases when object X
exists in more than one pack, but some strict subset of those packs
containing X have gone away.

I don't think that it couldn't be done, though.

> And the more interesting case is when you repack everything with "-ad"
> or similar, at which point you shouldn't even need to look up what's in
> the midx to see if you deleted its packs. The point of your operation is
> to put it all-into-one, so you know the old midx should be discarded.
>
> > Teach 'git repack' to check for this by loading the MIDX, and checking
> > whether the to-be-removed pack is known to the MIDX. This requires a
> > slightly odd alternation to a test in t5319, which is explained with a
> > comment.
>
> My above musings aside, this seems like an obvious improvement.
>
> > diff --git a/builtin/repack.c b/builtin/repack.c
> > index 04c5ceaf7e..98fac03946 100644
> > --- a/builtin/repack.c
> > +++ b/builtin/repack.c
> > @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
> >  static void remove_redundant_pack(const char *dir_name, const char *base_name)
> >  {
> >  	struct strbuf buf = STRBUF_INIT;
> > -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> > +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
> > +	strbuf_addf(&buf, "%s.pack", base_name);
> > +	if (m && midx_contains_pack(m, buf.buf))
> > +		clear_midx_file(the_repository);
> > +	strbuf_insertf(&buf, 0, "%s/", dir_name);
>
> Makes sense. midx_contains_pack() is a binary search, so we'll spend
> O(n log n) effort deleting the packs (I wondered if this might be
> accidentally quadratic over the number of packs).

Right. The MIDX stores packs in lexographic order, so checking them is
O(log n), which we do at most 'n' times.

> And after we clear, "m" will be NULL, so we'll do it at most once. Which
> is why you can get rid of the manual "midx_cleared" flag from the
> preimage.

Yep. I thought briefly about passing 'm' as a parameter, but then you
have to worry about a dangling reference to
'the_repository->objects->multi_pack_index' after calling
'clear_midx_file()', so it's easier to look it up each time.

> So the patch looks good to me.

Thanks.

> -Peff

Thanks,
Taylor

^ permalink raw reply

* Re: [PATCH] builtin/repack.c: invalidate MIDX only when necessary
From: Jeff King @ 2020-08-25  2:26 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee
In-Reply-To: <ef9186a8df0d712c2ecccbe62cb43a7abadb9c96.1598320716.git.me@ttaylorr.com>

On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote:

> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
> learned to remove a multi-pack-index file if it added or removed a pack
> from the object store.
> 
> This mechanism is a little over-eager, since it is only necessary to
> drop a MIDX if 'git repack' removes a pack that the MIDX references.
> Adding a pack outside of the MIDX does not require invalidating the
> MIDX, and likewise for removing a pack the MIDX does not know about.

Does "git repack" ever remove just one pack? Obviously "git repack -ad"
or "git repack -Ad" is going to pack everything and delete the old
packs. So I think we'd want to remove a midx there.

And "git repack -d" I think of as deleting only loose objects that we
just packed. But I guess it could also remove a pack that has now been
made redundant? That seems like a rare case in practice, but I suppose
is possible.

Not exactly related to your fix, but kind of the flip side of it: would
we ever need to retain a midx that mentions some packs that still exist?

E.g., imagine we have a midx that points to packs A and B, and
git-repack deletes B. By your logic above, we need to remove the midx
because now it points to objects in B which aren't accessible. But by
deleting it, could we be deleting the only thing that mentions the
objects in A?

I _think_ the answer is "no", because we never went all-in on midx and
allowed deleting the matching .idx files for contained packs. So we'd
still have that A.idx, and we could just use the pack as normal. But
it's an interesting corner case if we ever do go in that direction.

If you'll let me muse a bit more on midx-lifetime issues (which I've
never really thought about before just now):

I'm also a little curious how bad it is to have a midx whose pack has
gone away. I guess we'd answer queries for "yes, we have this object"
even if we don't, which is bad. Though in practice we'd only delete
those packs if we have their objects elsewhere. And the pack code is
pretty good about retrying other copies of objects that can't be
accessed. Alternatively, I wonder if the midx-loading code ought to
check that all of the constituent packs are available.

In that line of thinking, do we even need to delete midx files if one of
their packs goes away? The reading side probably ought to be able to
handle that gracefully.

And the more interesting case is when you repack everything with "-ad"
or similar, at which point you shouldn't even need to look up what's in
the midx to see if you deleted its packs. The point of your operation is
to put it all-into-one, so you know the old midx should be discarded.

> Teach 'git repack' to check for this by loading the MIDX, and checking
> whether the to-be-removed pack is known to the MIDX. This requires a
> slightly odd alternation to a test in t5319, which is explained with a
> comment.

My above musings aside, this seems like an obvious improvement.

> diff --git a/builtin/repack.c b/builtin/repack.c
> index 04c5ceaf7e..98fac03946 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
> +	strbuf_addf(&buf, "%s.pack", base_name);
> +	if (m && midx_contains_pack(m, buf.buf))
> +		clear_midx_file(the_repository);
> +	strbuf_insertf(&buf, 0, "%s/", dir_name);

Makes sense. midx_contains_pack() is a binary search, so we'll spend
O(n log n) effort deleting the packs (I wondered if this might be
accidentally quadratic over the number of packs).

And after we clear, "m" will be NULL, so we'll do it at most once. Which
is why you can get rid of the manual "midx_cleared" flag from the
preimage.

So the patch looks good to me.

-Peff

^ permalink raw reply

* [PATCH] builtin/repack.c: invalidate MIDX only when necessary
From: Taylor Blau @ 2020-08-25  2:01 UTC (permalink / raw)
  To: git; +Cc: dstolee

In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
learned to remove a multi-pack-index file if it added or removed a pack
from the object store.

This mechanism is a little over-eager, since it is only necessary to
drop a MIDX if 'git repack' removes a pack that the MIDX references.
Adding a pack outside of the MIDX does not require invalidating the
MIDX, and likewise for removing a pack the MIDX does not know about.

Teach 'git repack' to check for this by loading the MIDX, and checking
whether the to-be-removed pack is known to the MIDX. This requires a
slightly odd alternation to a test in t5319, which is explained with a
comment.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Something I noticed while working on a MIDX-related topic. Not critical
that we take this now since it's only dropping the MIDX too
aggressively, but not otherwise doing anything destructive. But, this
will make my somewhat large other series a little smaller.

 builtin/repack.c            | 12 +++++-------
 t/t5319-multi-pack-index.sh | 21 +++++++++++++++++++--
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 04c5ceaf7e..98fac03946 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
 	struct strbuf buf = STRBUF_INIT;
-	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	struct multi_pack_index *m = get_multi_pack_index(the_repository);
+	strbuf_addf(&buf, "%s.pack", base_name);
+	if (m && midx_contains_pack(m, buf.buf))
+		clear_midx_file(the_repository);
+	strbuf_insertf(&buf, 0, "%s/", dir_name);
 	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }
@@ -286,7 +290,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int keep_unreachable = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
-	int midx_cleared = 0;
 	struct pack_objects_args po_args = {NULL};

 	struct option builtin_repack_options[] = {
@@ -439,11 +442,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		for (ext = 0; ext < ARRAY_SIZE(exts); ext++) {
 			char *fname, *fname_old;

-			if (!midx_cleared) {
-				clear_midx_file(the_repository);
-				midx_cleared = 1;
-			}
-
 			fname = mkpathdup("%s/pack-%s%s", packdir,
 						item->string, exts[ext].name);
 			if (!file_exists(fname)) {
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 43b1b5b2af..16a1ad040e 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -382,12 +382,29 @@ test_expect_success 'repack with the --no-progress option' '
 	test_line_count = 0 err
 '

-test_expect_success 'repack removes multi-pack-index' '
+test_expect_success 'repack removes multi-pack-index when deleting packs' '
 	test_path_is_file $objdir/pack/multi-pack-index &&
-	GIT_TEST_MULTI_PACK_INDEX=0 git repack -adf &&
+	# Set GIT_TEST_MULTI_PACK_INDEX to 0 to avoid writing a new
+	# multi-pack-index after repacking, but set "core.multiPackIndex" to
+	# true so that "git repack" can read the existing MIDX.
+	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -adf &&
 	test_path_is_missing $objdir/pack/multi-pack-index
 '

+test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
+	git multi-pack-index write &&
+	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
+	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
+
+	# Write a new pack that is unknown to the multi-pack-index.
+	git hash-object -w </dev/null >blob &&
+	git pack-objects $objdir/pack/pack <blob &&
+
+	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
+	test_cmp_bin $objdir/pack/multi-pack-index \
+		$objdir/pack/multi-pack-index.bak
+'
+
 compare_results_with_midx "after repack"

 test_expect_success 'multi-pack-index and pack-bitmap' '
--
2.28.0.rc1.13.ge78abce653

^ permalink raw reply related

* What's cooking in git.git (Aug 2020, #06; Mon, 24)
From: Junio C Hamano @ 2020-08-25  0:22 UTC (permalink / raw)
  To: git

Here are the topics that have been cooking.  Commits prefixed with '-' are
only in 'seen' (formerly 'pu'---proposed updates) while commits prefixed
with '+' are in 'next'.  The ones marked with '.' do not appear in any of
the integration branches, but I am still holding onto them.

You can find the changes described here in the integration branches of the
repositories listed at

    http://git-blame.blogspot.com/p/git-public-repositories.html

--------------------------------------------------
[Graduated to 'master']

* dl/subtree-docs (2020-08-18) 2 commits
  (merged to 'next' on 2020-08-19 at e1a8ea9d46)
 + contrib/subtree: document 'push' does not take '--squash'
 + contrib/subtree: fix "unsure" for --message in the document

 Doc updates for subtree (in contrib/)


* ds/midx-repack-to-batch-size (2020-08-11) 1 commit
  (merged to 'next' on 2020-08-17 at eee94634aa)
 + multi-pack-index: repack batches below --batch-size

 The "--batch-size" option of "git multi-pack-index repack" command
 is now used to specify that very small packfiles are collected into
 one until the total size roughly exceeds it.


* en/dir-clear (2020-08-18) 2 commits
  (merged to 'next' on 2020-08-19 at 18c5b69293)
 + dir: fix problematic API to avoid memory leaks
 + dir: make clear_directory() free all relevant memory

 Leakfix with code clean-up.


* en/dir-nonbare-embedded (2020-08-12) 2 commits
  (merged to 'next' on 2020-08-17 at ab180b7fcb)
 + dir: avoid prematurely marking nonbare repositories as matches
 + t3000: fix some test description typos

 "ls-files -o" mishandled the top-level directory of another git
 working tree that hangs in the current git working tree.


* es/init-no-separate-git-dir-in-bare (2020-08-10) 1 commit
  (merged to 'next' on 2020-08-17 at 80498c8659)
 + init: disallow --separate-git-dir with bare repository

 The purpose of "git init --separate-git-dir" is to initialize a
 new project with the repository separate from the working tree,
 or, in the case of an existing project, to move the repository
 (the .git/ directory) out of the working tree. It does not make
 sense to use --separate-git-dir with a bare repository for which
 there is no working tree, so disallow its use with bare
 repositories.


* jc/no-update-fetch-head (2020-08-18) 1 commit
  (merged to 'next' on 2020-08-19 at 68e1374ed6)
 + fetch: optionally allow disabling FETCH_HEAD update
 (this branch is used by ds/maintenance-part-1, ds/maintenance-part-2 and jt/lazy-fetch.)

 "git fetch" learned --no-write-fetch-head option to avoid writing
 the FETCH_HEAD file.


* jk/unleak-fixes (2020-08-13) 2 commits
  (merged to 'next' on 2020-08-17 at f9bd296b25)
 + ls-remote: simplify UNLEAK() usage
 + stop calling UNLEAK() before die()

 Fix some incorrect UNLEAK() annotations.


* ma/doc-sha-256-is-experimental (2020-08-17) 1 commit
  (merged to 'next' on 2020-08-19 at 9ccf6c399c)
 + Documentation: mark `--object-format=sha256` as experimental

 The recent addition of SHA-256 support is marked as experimental in
 the documentation.


* mt/checkout-entry-dead-code-removal (2020-08-18) 1 commit
  (merged to 'next' on 2020-08-19 at 58866e5299)
 + checkout_entry(): remove unreachable error() call

 Code clean-up.


* rs/more-buffered-io (2020-08-17) 3 commits
  (merged to 'next' on 2020-08-19 at 6d23a23bb2)
 + upload-pack: use buffered I/O to talk to rev-list
 + midx: use buffered I/O to talk to pack-objects
 + connected: use buffered I/O to talk to rev-list

 Use more buffered I/O where we used to call many small write(2)s.


* rs/patch-id-with-incomplete-line (2020-08-18) 1 commit
  (merged to 'next' on 2020-08-19 at 72961d48d0)
 + patch-id: ignore newline at end of file in diff_flush_patch_id()

 The patch-id computation did not ignore the "incomplete last line"
 marker like whitespaces.

--------------------------------------------------
[New Topics]

* jk/refspecs-cleanup (2020-08-17) 2 commits
  (merged to 'next' on 2020-08-24 at 807a080ebf)
 + refspec: make sure stack refspec_item variables are zeroed
 + refspec: fix documentation referring to refspec_item
 (this branch is used by jk/refspecs-negative.)

 Preliminary code clean-up before introducing "negative refspec".

 Will merge to 'master'.


* rs/checkout-no-overlay-pathspec-fix (2020-08-22) 1 commit
 - checkout, restore: make pathspec recursive

 "git restore/checkout --no-overlay" with wildcarded pathspec
 mistakenly removed matching paths in subdirectories, which has been
 corrected.

 Will merge to 'next'.


* al/bisect-first-parent (2020-08-22) 1 commit
  (merged to 'next' on 2020-08-24 at f95fbf45a6)
 + bisect: add first-parent option to documentation

 Finishing touches.

 Will merge to 'master'.


* js/no-builtins-on-disk-option (2020-08-24) 3 commits
 - ci: stop linking built-ins to the dashed versions
 - install: optionally skip linking/copying the built-ins
 - msvc: copy the correct `.pdb` files in the Makefile target `install`

 The installation procedure learned to optionally omit "git-foo"
 executable files for each 'foo' built-in subcommand, which are only
 required by old timers that still rely on the age old promise that
 prepending "git --exec-path" output to PATH early in their script
 will keep the "git-foo" calls they wrote working.

 The old attempt to remove these executables from the disk failed in
 the 1.6 era; it may be worth attempting again, but I think it is
 worth to keep this topic separate from such a policy change to help
 it graduate early.

 cf. https://public-inbox.org/git/7vprnzt7d5.fsf@gitster.siamese.dyndns.org/


* jt/threaded-index-pack (2020-08-24) 8 commits
 - fixup! index-pack: make quantum of work smaller
 - index-pack: make quantum of work smaller
 - index-pack: make resolve_delta() assume base data
 - index-pack: calculate {ref,ofs}_{first,last} early
 - index-pack: remove redundant child field
 - index-pack: unify threaded and unthreaded code
 - index-pack: remove redundant parameter
 - Documentation: deltaBaseCacheLimit is per-thread

 "git index-pack" learned to resolve deltified objects with greater
 parallelism.

--------------------------------------------------
[Stalled]

* mt/grep-sparse-checkout (2020-06-12) 6 commits
 - config: add setting to ignore sparsity patterns in some cmds
 - grep: honor sparse checkout patterns
 - config: correctly read worktree configs in submodules
 - t/helper/test-config: facilitate addition of new cli options
 - t/helper/test-config: return exit codes consistently
 - doc: grep: unify info on configuration variables

 "git grep" has been tweaked to be limited to the sparse checkout
 paths.

 Review needed on 4/6; otherwise looking sane.
 cf. <CABPp-BGdEyEeajYZj_rdxp=MyEQdszuyjVTax=hhYj3fOtRQUQ@mail.gmail.com>


* ls/mergetool-meld-auto-merge (2020-07-12) 2 commits
 - SQUASH???
 - Support auto-merge for meld to follow the vim-diff behavior

 The 'meld' backend of the "git mergetool" learned to give the
 underlying 'meld' the '--auto-merge' option, which would help
 reduce the amount of text that requires manual merging.

 Expecting a reroll.


* mf/submodule-summary-with-correct-repository (2020-06-24) 2 commits
 - submodule: use submodule repository when preparing summary
 - revision: use repository from rev_info when parsing commits

 "git diff/show" on a change that involves a submodule used to read
 the information on commits in the submodule from a wrong repository
 and gave a wrong information when the commit-graph is involved.

 Needs tests.


* dr/push-remoteref-fix (2020-04-23) 1 commit
 - remote.c: fix handling of %(push:remoteref)

 The "%(push:remoteref)" placeholder in the "--format=" argument of
 "git format-patch" (and friends) only showed what got explicitly
 configured, not what ref at the receiving end would be updated when
 "git push" was used, as it ignored the default behaviour (e.g. update
 the same ref as the source).

 Expecting a reroll.
 cf. <20200416152145.wp2zeibxmuyas6y6@feanor>


* mr/bisect-in-c-2 (2020-07-17) 14 commits
 . SQUASH??? do not add new users of git_path_bisect_head()
 . bisect--helper: retire `--bisect-autostart` subcommand
 . bisect--helper: retire `--write-terms` subcommand
 . bisect--helper: retire `--check-expected-revs` subcommand
 . bisect--helper: reimplement `bisect_state` & `bisect_head` shell functions in C
 . bisect--helper: retire `--next-all` subcommand
 . bisect--helper: retire `--bisect-clean-state` subcommand
 . bisect--helper: finish porting `bisect_start()` to C
 . bisect--helper: reimplement `bisect_next` and `bisect_auto_next` shell functions in C
 . bisect: call 'clear_commit_marks_all()' in 'bisect_next_all()'
 . bisect--helper: reimplement `bisect_autostart` shell function in C
 . bisect--helper: introduce new `write_in_file()` function
 . bisect--helper: use '-res' in 'cmd_bisect__helper' return
 . bisect--helper: BUG() in cmd_*() on invalid subcommand

 Rewrite of the remainder of "git bisect" script in C continues.

 Needs more work.
 Ejected out of 'seen'; al/bisect-first-parent topic has a bit of
 textual conflict with this topic.


* mk/use-size-t-in-zlib (2018-10-15) 1 commit
 - zlib.c: use size_t for size

 The wrapper to call into zlib followed our long tradition to use
 "unsigned long" for sizes of regions in memory, which have been
 updated to use "size_t".

--------------------------------------------------
[Cooking]

* hn/refs-pseudorefs (2020-08-21) 4 commits
  (merged to 'next' on 2020-08-24 at 3579abe8ff)
 + sequencer: treat REVERT_HEAD as a pseudo ref
 + builtin/commit: suggest update-ref for pseudoref removal
 + sequencer: treat CHERRY_PICK_HEAD as a pseudo ref
 + refs: make refs_ref_exists public
 (this branch uses hn/refs-fetch-head-is-special.)

 Accesses to two pseudorefs have been updated to properly use ref
 API.

 Will merge to 'master'.


* rz/complete-more-options (2020-08-19) 2 commits
  (merged to 'next' on 2020-08-21 at ba8f4c8cb1)
 + completion: add GIT_COMPLETION_SHOW_ALL env var
 + parse-options: add --git-completion-helper-all

 Command line completion (in contrib/) usually omits redundant,
 deprecated and/or dangerous options from its output; it learned to
 optionally include all of them.

 Will merge to 'master'.


* jt/promisor-pack-fix (2020-08-20) 1 commit
  (merged to 'next' on 2020-08-24 at cd26d30d8d)
 + fetch-pack: in partial clone, pass --promisor

 Updates into a lazy/partial clone with a submodule did not work
 well with transfer.fsckobjects set.

 Will merge to 'master'.


* hv/ref-filter-trailers-atom-parsing-fix (2020-08-21) 2 commits
  (merged to 'next' on 2020-08-24 at 79b27f3263)
 + ref-filter: 'contents:trailers' show error if `:` is missing
 + t6300: unify %(trailers) and %(contents:trailers) tests

 The parser for "git for-each-ref --format=..." was too loose when
 parsing the "%(trailers...)" atom, and forgot that "trailers" and
 "trailers:<modifers>" are the only two allowed forms, which has
 been corrected.

 Will merge to 'master'.


* jc/ident-whose-ident (2020-08-21) 1 commit
 - ident: say whose identity is missing when giving user.name hint

 Error message update.

 Will merge to 'next'.


* jk/index-pack-w-more-threads (2020-08-21) 3 commits
  (merged to 'next' on 2020-08-24 at 18f18a5b66)
 + index-pack: adjust default threading cap
 + p5302: count up to online-cpus for thread tests
 + p5302: disable thread-count parameter tests by default

 Long ago, we decided to use 3 threads by default when running the
 index-pack task in parallel, which has been adjusted a bit upwards.

 Will merge to 'master'.


* rp/apply-cached-doc (2020-08-20) 1 commit
 - git-apply.txt: update descriptions of --cached, --index

 The description of --cached/--index options in "git apply --help"
 has been updated.

 Will merge to 'next'.


* dd/diff-customize-index-line-abbrev (2020-08-21) 2 commits
  (merged to 'next' on 2020-08-24 at 74e842a2c8)
 + diff: index-line: respect --abbrev in object's name
 + t4013: improve diff-post-processor logic

 The output from the "diff" family of the commands had abbreviated
 object names of blobs involved in the patch, but its length was not
 affected by the --abbrev option.  Now it is.

 Will merge to 'master'.


* hn/refs-fetch-head-is-special (2020-08-19) 4 commits
  (merged to 'next' on 2020-08-21 at def233ab43)
 + refs: read FETCH_HEAD and MERGE_HEAD generically
 + refs: move gitdir into base ref_store
 + refs: fix comment about submodule ref_stores
 + refs: split off reading loose ref data in separate function
 (this branch is used by hn/refs-pseudorefs.)

 The FETCH_HEAD is now always read from the filesystem regardless of
 the ref backend in use, as its format is much richer than the
 normal refs, and written directly by "git fetch" as a plain file..

 Will merge to 'master'.


* hv/ref-filter-misc (2020-08-17) 9 commits
 - ref-filter: add `sanitize` option for 'subject' atom
 - format-support: move `format_sanitized_subject()` from pretty
 - pretty: refactor `format_sanitized_subject()`
 - ref-filter: add `short` modifier to 'parent' atom
 - ref-filter: add `short` modifier to 'tree' atom
 - ref-filter: rename `objectname` related functions and fields
 - ref-filter: modify error messages in `grab_objectname()`
 - ref-filter: refactor `grab_objectname()`
 - ref-filter: support different email formats

 The "--format=" option to the "for-each-ref" command and friends
 learned a few more tricks, e.g. the ":short" suffix that applies to
 "objectname" now also can be used for "parent", "tree", etc.

 Will merge to 'next'.


* jk/leakfix (2020-08-17) 7 commits
  (merged to 'next' on 2020-08-21 at a8b25a2657)
 + submodule--helper: fix leak of core.worktree value
 + config: fix leak in git_config_get_expiry_in_days()
 + config: drop git_config_get_string_const()
 + config: fix leaks from git_config_get_string_const()
 + checkout: fix leak of non-existent branch names
 + submodule--helper: use strbuf_release() to free strbufs
 + clear_pattern_list(): clear embedded hashmaps

 Code clean-up.

 Will merge to 'master'.


* jk/refspecs-negative (2020-08-21) 1 commit
 - refspec: add support for negative refspecs
 (this branch uses jk/refspecs-cleanup.)

 "negative refspecs"


* jt/fetch-pack-loosen-validation-with-packfile-uri (2020-08-14) 1 commit
 - fetch-pack: make packfile URIs work with transfer.fsckobjects

 Bugfix for "git fetch" when the packfile URI capability is in use.

 Need to pick up a reroll.


* mr/diff-hide-stat-wo-textual-change (2020-08-19) 1 commit
 - diff: teach --stat to ignore uninteresting modifications

 "git diff --stat -w" showed 0-line changes for paths whose changes
 were only whitespaces, which was not intuitive.  We now omit such
 paths from the stat output.

 Will merge to 'next'.


* pw/add-p-allowed-options-fix (2020-08-17) 2 commits
 - add -p: fix checking of user input
 - add -p: use ALLOC_GROW_BY instead of ALLOW_GROW

 "git add -p" update.

 Will merge to 'next'.


* en/mem-pool (2020-08-18) 3 commits
  (merged to 'next' on 2020-08-19 at eff9ad46f0)
 + mem-pool: use consistent pool variable name
 + mem-pool: use more standard initialization and finalization
 + mem-pool: add convenience functions for strdup and strndup

 API update.

 Will merge to 'master'.


* jt/lazy-fetch (2020-08-18) 7 commits
 - fetch-pack: remove no_dependents code
 - promisor-remote: lazy-fetch objects in subprocess
 - fetch-pack: do not lazy-fetch during ref iteration
 - fetch: only populate existing_refs if needed
 - fetch: avoid reading submodule config until needed
 - fetch: allow refspecs specified through stdin
 - negotiator/noop: add noop fetch negotiator

 Updates to on-demand fetching code in lazily cloned repositories.

 Will merge to 'next'.


* jx/proc-receive-hook (2020-08-24) 10 commits
 - doc: add documentation for the proc-receive hook
 - transport: parse report options for tracking refs
 - t5411: test updates of remote-tracking branches
 - receive-pack: new config receive.procReceiveRefs
 - doc: add document for capability report-status-v2
 - New capability "report-status-v2" for git-push
 - receive-pack: feed report options to post-receive
 - receive-pack: add new proc-receive hook
 - t5411: add basic test cases for proc-receive hook
 - transport: not report a non-head push as a branch

 "git receive-pack" that accepts requests by "git push" learned to
 outsource most of the ref updates to the new "proc-receive" hook.


* pw/rebase-i-more-options (2020-08-19) 5 commits
  (merged to 'next' on 2020-08-21 at ade71fd49b)
 + rebase: add --reset-author-date
 + rebase -i: support --ignore-date
 + rebase -i: support --committer-date-is-author-date
 + am: stop exporting GIT_COMMITTER_DATE
 + rebase -i: add --ignore-whitespace flag

 "git rebase -i" learns a bit more options.

 Will merge to 'master'.


* tb/bloom-improvements (2020-08-11) 14 commits
 - builtin/commit-graph.c: introduce '--max-new-filters=<n>'
 - commit-graph: rename 'split_commit_graph_opts'
 - commit-graph: add large-filters bitmap chunk
 - commit-graph.c: sort index into commits list
 - bloom/diff: properly short-circuit on max_changes
 - bloom: use provided 'struct bloom_filter_settings'
 - csum-file.h: introduce 'hashwrite_be64()'
 - bloom: split 'get_bloom_filter()' in two
 - commit-graph.c: store maximum changed paths
 - commit-graph: respect 'commitGraph.readChangedPaths'
 - t/helper/test-read-graph.c: prepare repo settings
 - commit-graph: pass a 'struct repository *' in more places
 - t4216: use an '&&'-chain
 - commit-graph: introduce 'get_bloom_filter_settings()'

 Misc Bloom filter improvements.

 Expecting a reroll.
 It seems that the review is getting closer to result in another update.
 cf. <20200811220503.GC66656@syl.lan>


* jk/slimmed-down (2020-08-13) 5 commits
 - drop vcs-svn experiment
 - make git-fast-import a builtin
 - make git-bugreport a builtin
 - make credential helpers builtins
 - Makefile: drop builtins from MSVC pdb list

 Trim an unused binary and turn a bunch of commands into built-in.

 Will merge to 'next'.


* ss/t7401-modernize (2020-08-21) 5 commits
 - t7401: add a NEEDSWORK
 - t7401: change indentation for enhanced readability
 - t7401: change syntax of test_i18ncmp calls for clarity
 - t7401: use 'short' instead of 'verify' and cut in rev-parse calls
 - t7401: modernize style

 Test clean-up.

 Will merge to 'next'.


* ds/maintenance-part-2 (2020-08-18) 8 commits
 - maintenance: add incremental-repack auto condition
 - maintenance: auto-size incremental-repack batch
 - maintenance: add incremental-repack task
 - midx: use start_delayed_progress()
 - midx: enable core.multiPackIndex by default
 - maintenance: create auto condition for loose-objects
 - maintenance: add loose-objects task
 - maintenance: add prefetch task
 (this branch uses ds/maintenance-part-1.)

 "git maintenance", an extended big brother of "git gc", continues
 to evolve.


* ss/submodule-summary-in-c (2020-08-12) 4 commits
  (merged to 'next' on 2020-08-17 at 9bc352cb70)
 + submodule: port submodule subcommand 'summary' from shell to C
 + t7421: introduce a test script for verifying 'summary' output
 + submodule: rename helper functions to avoid ambiguity
 + submodule: remove extra line feeds between callback struct and macro

 Yet another subcommand of "git submodule" is getting rewritten in C.

 Will merge to 'master'.


* am/ci-wsfix (2020-08-21) 1 commit
  (merged to 'next' on 2020-08-24 at 8491e031f1)
 + ci: fix inconsistent indentation

 Aesthetic fix to a CI configuration file.

 Will merge to 'master'.


* ds/maintenance-part-1 (2020-08-18) 11 commits
 - maintenance: add trace2 regions for task execution
 - maintenance: add auto condition for commit-graph task
 - maintenance: use pointers to check --auto
 - maintenance: create maintenance.<task>.enabled config
 - maintenance: take a lock on the objects directory
 - maintenance: add --task option
 - maintenance: add commit-graph task
 - maintenance: initialize task array
 - maintenance: replace run_auto_gc()
 - maintenance: add --quiet option
 - maintenance: create basic maintenance runner
 (this branch is used by ds/maintenance-part-2.)

 A "git gc"'s big brother has been introduced to take care of more
 repository maintenance tasks, not limited to the object database
 cleaning.

 Comments?


* es/config-hooks (2020-07-30) 6 commits
 - hook: add 'run' subcommand
 - parse-options: parse into argv_array
 - hook: add --porcelain to list command
 - hook: add list command
 - hook: scaffolding for git-hook subcommand
 - doc: propose hooks managed by the config

 The "hooks defined in config" topic.

 Expecting a reroll.
 Now jk/strvec is in 'master', we may want to see the topic reworked
 on top of it.  Are there unresolved issues, or does the topic need
 a round of detailed review?
 cf. <xmqqmu3i9kvg.fsf@gitster.c.googlers.com>

--------------------------------------------------
[Discarded]

* rs/fast-export-anon-simplify (2020-08-13) 1 commit
 . fast-export: factor out print_oid()

 Code simplification.

 Retracted.
 cf. <6e2d4472-8293-4f10-0ba6-82ae83f7a465@web.de>


* mt/hash-to-hex-thread-safety (2020-06-26) 2 commits
 . hex: make hash_to_hex_algop() and friends thread-safe
 . compat/win32/pthread: add pthread_once()

 hash_to_hex() used a set of rotating static buffers, which was not
 safe to use in a threaded environment.  This has been made safer by
 using thread-local storage.

 Retracted.
 cf. <CAHd-oW7Wd8oSaMhPFeRcEeKTJ-k_hC7b6e28efhXT5LFu1E_Uw@mail.gmail.com>

^ permalink raw reply

* Re: [PATCH v2 2/2] ref-filter: 'contents:trailers' show error if `:` is missing
From: Hariom verma @ 2020-08-24 23:32 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Junio C Hamano, Hariom Verma via GitGitGadget, Git List,
	Christian Couder, Heba Waly
In-Reply-To: <CAPig+cScdV1ORSbqDuUiOEvCd6TYgkR=3GK8OCUu4yuoKVy5Pg@mail.gmail.com>

Hi,

On Mon, Aug 24, 2020 at 9:19 AM Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Sun, Aug 23, 2020 at 8:56 PM Hariom verma <hariom18599@gmail.com> wrote:
> > On Sat, Aug 22, 2020 at 12:47 AM Junio C Hamano <gitster@pobox.com> wrote:
> > > Eric Sunshine <sunshine@sunshineco.com> writes:
> > > > ...an alternative would have been something like:
> > > >
> > > >   else if (!strcmp(arg, "trailers")) {
> > > >     if (trailers_atom_parser(format, atom, NULL, err))
> > > >       return -1;
> > > >   } else if (skip_prefix(arg, "trailers:", &arg)) {
> > > >     if (trailers_atom_parser(format, atom, arg, err))
> > > >       return -1;
> > > >   }
> > > >
> > > > which is quite simple to reason about (though has the cost of a tiny
> > > > bit of duplication).
> > >
> > > Yeah, that looks quite simple and straight-forward.
> >
> > Recently, I sent a patch series "Improvements to ref-filter"[1]. A
> > patch in this patch series introduced "sanitize" modifier to "subject"
> > atom. i.e "%(subject:sanitize)".
> >
> > What if in the future we also want "%(contents:subject:sanitize)" to work?
> > We can use this helper any number of times, whenever there is a need.
> >
> > Sorry, I missed saying this earlier. But I don't prefer duplicating
> > the code here.
>
> Pushing back on a reviewer suggestion is fine. Explaining the reason
> for your position -- as you do here -- helps reviewers understand why
> you feel the way you do. My review suggestion about making it easier
> to reason about the code while avoiding a brand new function, at the
> cost of a minor amount of duplication, was made in the context of this
> one-off case in which the function increased cognitive load and was
> used just once (not knowing that you envisioned future callers). If
> you expect the new function to be re-used by upcoming changes, then
> that may be a good reason to keep it. Stating so in the commit message
> will help reviewers see beyond the immediate patch or patch series.

Yeah. I should have mentioned this in the commit message.

> Aside from a couple minor style violations[1,2], I don't particularly
> oppose the helper function, though I have a quibble with the name
> check_format_field(), which I don't find helpful, and which (at least
> for me) increases the cognitive load. The increased cognitive load, I
> think, comes not only from the function name not spelling out what the
> function actually does, but also because the function is dual-purpose:
> it's both checking that the argument matches a particular token
> ("trailers", in this case) and extracting the sub-argument. Perhaps
> naming it match_and_extract_subarg() or something similar would help,
> though that's a mouthful.

I will fix those violations.
Also, "match_and_extract_subarg()" looks good to me.

> But the observation about the function being dual-purpose (thus
> potentially confusing) brings up other questions. For instance, is it
> too special-purpose? If you foresee more callers in the future with
> multiple-token arguments such as `%(content:subject:sanitize)`, should
> the function provide more assistance by splitting out each of the
> sub-arguments rather than stopping at the first? Taking that even
> further, a generalized helper for "splitting" arguments like that
> might be useful at the top-level of contents_atom_parser() too, rather
> than only for specific arguments, such as "trailers". Of course, this
> may all be way too ambitious for this little bug fix series or even
> for whatever upcoming changes you're planning, thus not worth
> pursuing.

Splitting sub-arguments is done at "<atomname>_atom_parser()".
If you mean pre-splitting every argument...
something like: ['contents', 'subject', 'sanitize'] for
`%(content:subject:sanitize)` in `contents_atom_parser()` ? I'm not
able to see how it can be useful.

Sorry, If I got your concerned wrong.

> As for the helper's implementation, I might have written it like this:
>
>     static int check_format_field(...)
>     {
>         const char *opt
>         if (!strcmp(arg, field))
>             *option = NULL;
>         else if (skip_prefix(arg, field, opt) && *opt == ':')
>             *option = opt + 1;
>         else
>             return 0;
>         return 1;
>     }
>
> which is more compact and closer to what I suggested earlier for
> avoiding the helper function in the first place. But, of course,
> programming is quite subjective, and you may find your implementation
> easier to reason about. Plus, your version has the benefit of being
> slightly more optimal since it avoids an extra string scan, although
> that probably is mostly immaterial considering that
> contents_atom_parser() itself contains a long chain of potentially
> sub-optimal strcmp() and skip_prefix() calls.

"programming is quite subjective"
Yeah, I couldn't agree more.

The change you suggested looks good too. But I'm little inclined to my
keeping my changes. I'm curious, what others have to say on this.

Thanks,
Hariom

^ permalink raw reply

* [RFD] on removing "git-foo" for builtins from disk
From: Junio C Hamano @ 2020-08-24 22:23 UTC (permalink / raw)
  To: git; +Cc: users, Johannes Schindelin

Back in 2005, all "git" subcommands were installed in $PATH as
"git-commit" etc., and people wrote their scripts assuming that they
are all available on their $PATH.  But it then became cumbersome to
have so many "git-foo" commands in /usr/bin/.  Because "git" is
installed in /usr/bin/ and knows to dispatch to 'git-foo' subcommand
when invoked with 'foo' as its first argument, we asked our users to
use "git foo" form and moved "git-foo" commands out of /usr/bin/,
moved to /usr/libexec/git-core/, and gave an escape hatch, which is
to add the output of "git --exec-path" early in their $PATH, to old
scripts (i.e. the scripts needed a one-line fix).

Exectly 12 years ago, we asked the users if they still need these
on-disk binaries [*1*].  Many built-in commands (i.e. subcommands
whose code is in the "git" binary), as long as they are invoked in
"git foo" form, do not need to be installed anywhere on-disk, and
theoretically can be removed, as long as the users and the scripts
they wrote long time ago, following our advise to use the escape
hatch, can somehow be convinced.

Removing /usr/libexec/git-core/git-{add,branch,commit,...} will
allow us to reduce disk footprint on systems that do not support
hardlinks and slightly simplifies the installation procedure.  Some
folks want to do so unconditionally.  The only downside being that
we finally break the promise we made our users 12 years ago.

I do not have a strong opinion either way, but if proponents for
removal can convince others successfully, I do not mind removing the
unnecessary on-disk binaries for built-in commands.

Discuss away.


[References and Footnotes]

*1* https://lore.kernel.org/git/7vprnzt7d5.fsf@gitster.siamese.dyndns.org/

^ permalink raw reply

* Re: [PATCH 0/7] Better threaded delta resolution in index-pack (another try)
From: Jeff King @ 2020-08-24 22:08 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, steadmon
In-Reply-To: <cover.1598296530.git.jonathantanmy@google.com>

On Mon, Aug 24, 2020 at 12:16:30PM -0700, Jonathan Tan wrote:

> I'm trying to resurrect [1] and have rebased it to latest master
> (675a4aaf3b ("Ninth batch", 2020-08-19)).
> 
> Peff said [2] (of v1) that the overall direction seems reasonable and
> Josh Steadmon said [3] (of v2) that it looks mostly good except for
> possible improvements to commit messages and comments. Josh did not list
> out specific improvements to commit messages but I have taken his
> suggestions for comments.

I haven't looked closely yet, but since I was doing timings of
index-pack recently[1], I wondered if this might change anything
(spoiler: it doesn't really seem to).

Here's the result of p5302 with my extra tests on my 8-core (16 with
hyperthreading) laptop against linux.git:

  5302.3: index-pack 0 threads                   266.66(263.85+2.71)
  5302.4: index-pack 1 threads                   275.06(272.11+2.85)
  5302.5: index-pack 2 threads                   159.49(285.44+3.51)
  5302.6: index-pack 4 threads                   102.54(318.86+4.30)
  5302.7: index-pack 8 threads                   75.60(391.39+6.56) 
  5302.8: index-pack 16 threads                  75.56(748.45+13.37)
  5302.9: index-pack default number of threads   75.01(389.33+6.59) 

So the conclusions from that other series remain pretty similar: nothing
gets faster as we move past the number of actual cores. The penalty for
doing so seems less than what I got before, though (though it might just
be a fluke; it was something like 2s worse before your patches, and
there's a bit of noise; the increased CPU time can be disregarded as the
processors are throttled down when more are running).

The overall time seems to get slightly worse, though (HEAD~7 is before
your patch, HEAD is with it):

  Test                                           HEAD~7               HEAD                    
  --------------------------------------------------------------------------------------------
  5302.9: index-pack default number of threads   71.96(376.11+3.66)   74.18(390.62+6.08) +3.1%

There may be other cases that get better, though. A 3% increase here is
probably OK if we get something for it. But if our primary goal here is
increasing multithread efficiency, then we should be able to show some
benchmark that improves. :)

-Peff

[1] https://lore.kernel.org/git/20200821175153.GA3263018@coredump.intra.peff.net/

^ permalink raw reply

* [PATCH] fixup! index-pack: make quantum of work smaller
From: Jonathan Tan @ 2020-08-24 21:27 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan
In-Reply-To: <cover.1598296530.git.jonathantanmy@google.com>

---
> Josh did not list
> out specific improvements to commit messages but I have taken his
> suggestions for comments.

...and somehow I forgot to commit them before sending out this patch
set, so here they are.

 builtin/index-pack.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 0a5b938e1e..4fa9127c35 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -62,6 +62,8 @@ struct base_data {
  * Stack of struct base_data that have unprocessed children.
  * threaded_second_pass() uses this as a source of work (the other being the
  * objects array).
+ *
+ * Guarded by work_mutex.
  */
 LIST_HEAD(work_head);
 
@@ -70,11 +72,16 @@ LIST_HEAD(work_head);
  * processed or are being processed, and at least one child is being processed.
  * These struct base_data must be kept around until the last child is
  * processed.
+ *
+ * Guarded by work_mutex.
  */
 LIST_HEAD(done_head);
 
 /*
  * All threads share one delta base cache.
+ *
+ * base_cache_used is guarded by work_mutex, and base_cache_limit is read-only
+ * in a thread.
  */
 size_t base_cache_used;
 size_t base_cache_limit;
-- 
2.28.0.297.g1956fa8f8d-goog


^ permalink raw reply related

* Re: [PATCH] Avoid infinite loop in malformed packfiles
From: Junio C Hamano @ 2020-08-24 21:22 UTC (permalink / raw)
  To: Jeff King; +Cc: René Scharfe, Ori Bernstein, git
In-Reply-To: <20200824205231.GA787628@coredump.intra.peff.net>

Jeff King <peff@peff.net> writes:

> It may be hard to test, as I suspect modern versions of Git are not
> happy to create such a deep chain. We could test with a lowered value of
> the config option, though.

Yes, that was what I meant.  Start from a 1KB text, create 50
revisions of the file by adding a single line at its end at a time,
pack with depth limit of 100, and then see "git log -p" die when the
allowed max lowered to 10, or something like that.


^ permalink raw reply

* Re: [PATCH 7/7] index-pack: make quantum of work smaller
From: Junio C Hamano @ 2020-08-24 21:19 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff, steadmon
In-Reply-To: <3e69f41f19535fa2c04fae6adc78bcc4f052d523.1598296530.git.jonathantanmy@google.com>

Jonathan Tan <jonathantanmy@google.com> writes:

> Currently, when index-pack resolves deltas, it does not split up delta
> trees into threads: each delta base root (an object that is not a
> REF_DELTA or OFS_DELTA) can go into its own thread, but all deltas on
> that root (direct or indirect) are processed in the same thread.
>
> This is a problem when a repository contains a large text file (thus,
> delta-able) that is modified many times - delta resolution time during
> fetching is dominated by processing the deltas corresponding to that
> text file.

We favor wide and shallow delta trees for both reduced storage
footprint and lower runtime overhead, so optimizing the index-pack
for that use case makes a lot of sense.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox