Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH 1/1] add: use advise function to display hints
From: Junio C Hamano @ 2020-01-06 23:18 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Heba Waly via GitGitGadget, git, Heba Waly
In-Reply-To: <20200106231327.GB181522@google.com>

Emily Shaffer <emilyshaffer@google.com> writes:

> Hm, I guess this answers my question above about them being 1:1. But I
> suppose it doesn't necessarily preclude advise() from associating a
> single config with multiple advice messages.

... and probably the other message in the thread from me would
answer any remaining question you may have, I guess ;-)

^ permalink raw reply

* Re: [PATCH 1/1] add: use advise function to display hints
From: Emily Shaffer @ 2020-01-06 23:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Heba Waly via GitGitGadget, git, Heba Waly
In-Reply-To: <xmqqpng1eisc.fsf@gitster-ct.c.googlers.com>

On Thu, Jan 02, 2020 at 11:54:11AM -0800, Junio C Hamano wrote:
> "Heba Waly via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Heba Waly <heba.waly@gmail.com>
> >
> > Use the advise function in advice.c to display hints to the users, as
> > it provides a neat and a standard format for hint messages, i.e: the
> > text is colored in yellow and the line starts by the word "hint:".
> 
> Use of advise() function is good for giving hints not just due to
> its yellow coloring (which by the way I find not very readable,
> perhaps because I use black ink on white paper).  One good thing in
> using the advise() API is that the messages can also be squelched
> with advice.* configuration variables.
> 
> And these two hints in "git add" are good chandidates to make
> customizable (perhaps with "advice.addNothing"), so I tend to agree
> with you that it makes sense to move these two messages to advise().
> Unfortunately this patch goes only halfway and stops (see below).
> 
> If there are many other places that calls to advise() are made
> without getting guarded by the toggles defined in advice.c, we
> should fix them, I think.

Maybe this is my C++ habits not dying when they should :) but to me,
this begs the question, "why doesn't advise() check the toggles for me?"

Are advice messages 1:1 with advice settings? Is there a reason that
advise() doesn't look up its corresponding config for itself?

 - Emily

> 
> >
> > Signed-off-by: Heba Waly <heba.waly@gmail.com>
> > ---
> >  builtin/add.c  | 4 ++--
> >  t/t3700-add.sh | 2 +-
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/builtin/add.c b/builtin/add.c
> > index 4c38aff419..eebf8d772b 100644
> > --- a/builtin/add.c
> > +++ b/builtin/add.c
> > @@ -390,7 +390,7 @@ static int add_files(struct dir_struct *dir, int flags)
> >  		fprintf(stderr, _(ignore_error));
> >  		for (i = 0; i < dir->ignored_nr; i++)
> >  			fprintf(stderr, "%s\n", dir->ignored[i]->name);
> > -		fprintf(stderr, _("Use -f if you really want to add them.\n"));
> > +		advise(_("Use -f if you really want to add them.\n"));
> >  		exit_status = 1;
> >  	}
> >  
> > @@ -480,7 +480,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
> >  
> >  	if (require_pathspec && pathspec.nr == 0) {
> >  		fprintf(stderr, _("Nothing specified, nothing added.\n"));
> > -		fprintf(stderr, _("Maybe you wanted to say 'git add .'?\n"));
> > +		advise( _("Maybe you wanted to say 'git add .'?\n"));
> >  		return 0;
> >  	}
> 
> The final code for the above part would look like:
> 
> 		if (advice_add_nothing)
> 			advise(_("Use -f if you really want to add them."));
> 		...
> 		if (advice_add_nothing)
> 			advise( _("Maybe you wanted to say 'git add .'?"));
> 

Hm, I guess this answers my question above about them being 1:1. But I
suppose it doesn't necessarily preclude advise() from associating a
single config with multiple advice messages.

 - Emily

^ permalink raw reply

* Re: [PATCH 1/1] add: use advise function to display hints
From: Junio C Hamano @ 2020-01-06 23:13 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Heba Waly via GitGitGadget, git, Heba Waly
In-Reply-To: <20200106230712.GA181522@google.com>

Emily Shaffer <emilyshaffer@google.com> writes:

> On Thu, Jan 02, 2020 at 03:04:01AM +0000, Heba Waly via GitGitGadget wrote:
>> From: Heba Waly <heba.waly@gmail.com>
>> 
>> @@ -390,7 +390,7 @@ static int add_files(struct dir_struct *dir, int flags)
>>  		fprintf(stderr, _(ignore_error));
>>  		for (i = 0; i < dir->ignored_nr; i++)
>>  			fprintf(stderr, "%s\n", dir->ignored[i]->name);
>> -		fprintf(stderr, _("Use -f if you really want to add them.\n"));
>> +		advise(_("Use -f if you really want to add them.\n"));
>
> In the vein of the rest of your project, for me I'd rather see a
> copy-pasteable response here:
>
> "Use 'git add -f " + name + "' if you really want to add them."
>
> That is, if you know the name of the file that was being added here, you
> could provide it so the user can simply copy and go, rather than
> retyping.

Just being a devil's advocate, but you are opening a can of worms by
suggesting so---the path needs to be quoted proporly (and the way to
do so may be different depending on the shell in use), for example.


^ permalink raw reply

* Re: Fwd: My Introduction and a Doubt
From: Johannes Schindelin @ 2020-01-06 23:07 UTC (permalink / raw)
  To: Nirmal Khedkar; +Cc: git
In-Reply-To: <CAFFaXsx3Mtrq4mEGq6GYct7ZcRtucmZdRG-PQmjLrvVVTfq2Wg@mail.gmail.com>

Hi Nirmal,

On Sun, 5 Jan 2020, Nirmal Khedkar wrote:

> Hey everybody!
>
> I'd love to use this email to introduce myself to the community and
> ask for help.
>
> I'm Nirmal Khedkar, student from India. I love making end user
> applications, and in this FOSS world, there simply aren't many as big
> and impactful as Git. I'd love to contribute Git for GSoC 2020 as well
> as for long-term.
>
> Now the doubt:
> I'm trying to wrap up the issue mentioned here
> (https://github.com/gitgitgadget/git/issues/486) : basically allowing
> git bisect to work from subdirectories.  I do consider myself okay in
> C but I'm still kind-of stuck here: hence emailing this on the mailing
> list.
>
>  When you do "git bisect" in a subdirectory, an error "You need to run
> this command from the toplevel of the working tree." is raised. The
> error is raised in "git-sh-setup.sh" and the command begins execution
> in "bisect--helper.c", but I cant understand how the error appears.
> Additionally I'd love to know how the C files linkwith the numerous
> shell scripts within Git.
>
> I've searched all I could within the Git project, but if there's any
> existing documentation on this, please let me know.

I answered on that ticket:
https://github.com/gitgitgadget/git/issues/486#issuecomment-571355006

Ciao,
Johannes

^ permalink raw reply

* Re: [PATCH 1/1] add: use advise function to display hints
From: Emily Shaffer @ 2020-01-06 23:07 UTC (permalink / raw)
  To: Heba Waly via GitGitGadget; +Cc: git, Heba Waly, Junio C Hamano
In-Reply-To: <90608636bf184de76f91e4e04d9e796a021775a0.1577934241.git.gitgitgadget@gmail.com>

On Thu, Jan 02, 2020 at 03:04:01AM +0000, Heba Waly via GitGitGadget wrote:
> From: Heba Waly <heba.waly@gmail.com>
> 
> @@ -390,7 +390,7 @@ static int add_files(struct dir_struct *dir, int flags)
>  		fprintf(stderr, _(ignore_error));
>  		for (i = 0; i < dir->ignored_nr; i++)
>  			fprintf(stderr, "%s\n", dir->ignored[i]->name);
> -		fprintf(stderr, _("Use -f if you really want to add them.\n"));
> +		advise(_("Use -f if you really want to add them.\n"));

In the vein of the rest of your project, for me I'd rather see a
copy-pasteable response here:

"Use 'git add -f " + name + "' if you really want to add them."

That is, if you know the name of the file that was being added here, you
could provide it so the user can simply copy and go, rather than
retyping.


 - Emily

^ permalink raw reply

* Re: Reduction in blanks in `git status` makes output pretty cramped
From: Anthony Sottile @ 2020-01-06 23:06 UTC (permalink / raw)
  To: Jeff King; +Cc: John Lin, brian m. carlson, Git Mailing List
In-Reply-To: <20200106212343.GC980197@coredump.intra.peff.net>

> (I didn't even notice, as I think it only affects the case where advice.statusHints is enabled

yeah this makes sense -- (unfortunately) advice.statusHints appears to
be the default

Anthony

^ permalink raw reply

* Re: [PATCH 1/1] doc: submodule: fix typo for command absorbgitdirs
From: Johannes Schindelin @ 2020-01-06 22:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, thomas menzel
In-Reply-To: <xmqqblrga4h1.fsf@gitster-ct.c.googlers.com>

Hi Junio,

On Mon, 6 Jan 2020, Junio C Hamano wrote:

> When anybody tries to respond to the quoted message [*1*] that came
> from GGG in the way recommended for the list, the
> *.noreply.github.com address gets placed on the Cc:, which is
> probably not wanted.
>
> Can GGG do something about it?  As GitHub specific service, it is
> not wrong for it to know what noreply.github.com means so dropping
> such an address from its Cc: line shouldn't be too difficult.
>
> Thanks.
>
>
> Reference *1*
>
> <b032358ca97df3bd5605ff356196f82a1f16a320.1578322213.git.gitgitgadget@gmail.com>

Right. I added it to the bug tracker:
https://github.com/gitgitgadget/gitgitgadget/issues/182

Ciao,
Dscho

^ permalink raw reply

* Re: [Outreachy] Return value before or after free()?
From: Jonathan Nieder @ 2020-01-06 22:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Miriam R., git
In-Reply-To: <20200106213051.GD980197@coredump.intra.peff.net>

Hi,

Jeff King wrote:
> On Mon, Jan 06, 2020 at 10:15:53PM +0100, Miriam R. wrote:

>> in run-command.c file `exists_in_PATH()` function does this:
>>
>> static int exists_in_PATH(const char *file)
>> {
>>	char *r = locate_in_PATH(file);
>>	free(r);
>>	return r != NULL;
>> }
>>
>> I wonder if it is correct to do return r != NULL; after free(r);
>
> It is technically undefined behavior according to the C standard, but I
> think it would be hard to find an implementation where it was not
> perfectly fine in practice.
>
> Ref: http://c-faq.com/malloc/ptrafterfree.html
>
> I'd probably leave it alone unless it is causing a problem (e.g., a
> static analyzer complaining).

Today I learned.

Miriam, do you have more context?  Did you notice this while reading or
did a tool bring it to your attention?

(Because I was curious, here's what I chased down in C99:

7.20.3.2 "The free function" says: "The free function causes the space
pointed to by ptr to be deallocated, that is, made available for
further allocation."

6.2.4 "Storage durations of objects" says: "The value of a pointer
becomes indeterminate when the object it points to reaches the end of
its lifetime."

6.2.5 "Types" says: "A pointer type describes an object whose value
provides a reference to an entity of the referenced type."

6.5.9 "Equality operators": "Two pointers compare equal if and only if
both are null pointers, both are pointers to the same object
(including a pointer to an object and a subobject at its beginning) or
function, both are pointers to one past the last element of the same
array object, or one is a pointer to one past the end of one array
object and the other is a pointer to the start of a different array
object that happens to immediately follow the first array object in
the address space."

J.2 "Undefined behavior": "The behavior is undefined in the following
circumstances: [...] The value of an object with automatic storage
duration is used while it is indeterminate (6.2.4, 6.7.8, 6.8)"

The reference to automatic storage duration there is interesting.  Of
course `r` here does have automatic storage duration, but the
distinction from

	char **r = xmalloc(sizeof(*r));
	*r = locate_in_PATH(file);
	free(*r);
	/* leak r */
	return *r != NULL;

is peculiar.  It looks like exists_in_PATH is indeed producing
undefined behavior, but the intention of the standard was probably to
make the behavior implementation defined.)

Thanks,
Jonathan

^ permalink raw reply

* Re: [Outreachy] Return value before or after free()?
From: Jeff King @ 2020-01-06 21:30 UTC (permalink / raw)
  To: Miriam R.; +Cc: git
In-Reply-To: <CAN7CjDDBA0ZoCG9aaQf5rg3gxqny=EjR6v6jE1mnxvUJQSF_0Q@mail.gmail.com>

On Mon, Jan 06, 2020 at 10:15:53PM +0100, Miriam R. wrote:

> in run-command.c file `exists_in_PATH()` function does this:
> 
> static int exists_in_PATH(const char *file)
> {
> char *r = locate_in_PATH(file);
> free(r);
> return r != NULL;
> }
> 
> I wonder if it is correct to do return r != NULL; after free(r);

It is technically undefined behavior according to the C standard, but I
think it would be hard to find an implementation where it was not
perfectly fine in practice.

Ref: http://c-faq.com/malloc/ptrafterfree.html

I'd probably leave it alone unless it is causing a problem (e.g., a
static analyzer complaining).

-Peff

^ permalink raw reply

* Re: Reduction in blanks in `git status` makes output pretty cramped
From: Jeff King @ 2020-01-06 21:23 UTC (permalink / raw)
  To: Anthony Sottile; +Cc: John Lin, brian m. carlson, Git Mailing List
In-Reply-To: <CA+dzEB=m94egho+1UOGeDSFjjzwXYA-HncM-9C7NLLP=E3U2Jw@mail.gmail.com>

On Tue, Dec 31, 2019 at 03:18:17PM -0800, Anthony Sottile wrote:

> This is a comment on 7b098429355bb3271f9ffdf73b97f2ef82794fea
> 
> https://github.com/git/git/commit/7b098429355bb3271f9ffdf73b97f2ef82794fea

I don't have much of an opinion myself (I didn't even notice, as I think
it only affects the case where advice.statusHints is enabled). But
there's more discussion in this thread:

  https://lore.kernel.org/git/20190430060210.79610-1-johnlinp@gmail.com/

about whether to add or remove blank lines. ;)

+cc people who expressed opinions there

^ permalink raw reply

* Re: [PATCH v2] sha1-file: remove OBJECT_INFO_SKIP_CACHED
From: Jeff King @ 2020-01-06 21:17 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git, gitster
In-Reply-To: <20200104001331.GA130883@google.com>

On Fri, Jan 03, 2020 at 04:13:31PM -0800, Jonathan Nieder wrote:

> To follow up on Junio's hint in his review: callers can inject
> additional cached objects by using pretend_object_file.  Junio
> described how this would make sense as a mechanism for building
> the virtual ancestor object, but we don't do that.  In fact, the
> only caller is fake_working_tree_commit in "git blame", a read-only
> code path. *phew*
> 
> -- >8 --
> Subject: sha1-file: document how to use pretend_object_file
> [...]
> +/*
> + * Add an object file to the in-memory object store, without writing it
> + * to disk.
> + *
> + * Callers are responsible for calling write_object_file to record the
> + * object in persistent storage before writing any other new objects
> + * that reference it.
> + */
>  int pretend_object_file(void *, unsigned long, enum object_type,
>  			struct object_id *oid);
>  

I think this is an improvement over the status quo, but it's still a
potential trap for code which happens to run in the same process (see my
other email in the thread).

Should the message perhaps be even more scary?

-Peff

^ permalink raw reply

* [Outreachy] Return value before or after free()?
From: Miriam R. @ 2020-01-06 21:15 UTC (permalink / raw)
  To: git

Dear all,
in run-command.c file `exists_in_PATH()` function does this:

static int exists_in_PATH(const char *file)
{
char *r = locate_in_PATH(file);
free(r);
return r != NULL;
}

I wonder if it is correct to do return r != NULL; after free(r);

Could be this version more readable? :

static int exists_in_PATH(const char *file)
{
char *r = locate_in_PATH(file);
int res = (r != NULL);
free(r);
return res;
}

Could you please give me your feedback?

Thank you!

Best,
Miriam.

^ permalink raw reply

* Re: [PATCH v2] sha1-file: remove OBJECT_INFO_SKIP_CACHED
From: Jeff King @ 2020-01-06 21:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, jrnieder
In-Reply-To: <xmqqftgxedtk.fsf@gitster-ct.c.googlers.com>

On Thu, Jan 02, 2020 at 01:41:27PM -0800, Junio C Hamano wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > As a historical note, the function now known as repo_read_object_file()
> > was taught the empty tree in 346245a1bb ("hard-code the empty tree
> > object", 2008-02-13), and the function now known as oid_object_info()
> > was taught the empty tree in c4d9986f5f ("sha1_object_info: examine
> > cached_object store too", 2011-02-07). repo_has_object_file() was never
> > updated, perhaps due to oversight. The flag OBJECT_INFO_SKIP_CACHED,
> > introduced later in dfdd4afcf9 ("sha1_file: teach
> > sha1_object_info_extended more flags", 2017-06-26) and used in
> > e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26),
> > was introduced to preserve this difference in empty-tree handling, but
> > now it can be removed.
> 
> I am not 100% sure if the implication of this change is safe to
> allow us to say "now it can be".
> 
> The has_object_file() helper wanted to say "no" when given a
> non-existing object registered via the pretend_object_file(),
> presumably because we wanted to allow a use pattern like:
> 
>  - prepare an in-core representation of an object we tentatively
>    expect, but not absolutely sure, to be necessary.
> 
>  - perform operations, using the object data obtained via
>    read_object() API, which is capable of yielding data even for
>    such "pretend" objects (perhaps we are creating a tentative merge
>    parents during a recursive merge).
> 
>  - write out final set of objects by enumerating those that do not
>    really exist yet (via has_object_file() API).
> 
> Teaching about the empty tree to has_object_file() is a good thing
> (especially because we do not necessarily write an empty tree object
> to our repositories), but as a collateral damage of doing so, we
> make such use pattern impossible.  
> 
> It is not a large loss---the third bullet in the above list can just
> be made to unconditionally call write_object_file() without
> filtering with has_object_file() and write_object_file() will apply
> the right optimization anyway, so it probably is OK.

I agree that whoever called pretend_object_file() can be careful and
write out the final set of objects itself via write_object_file(). But
I'd worry a bit about a caller who doesn't necessarily realize that they
need to do that. E.g., imagine we call pretend_object_file() for some
blob oid, expecting it to be read-only. And then in the same process,
some other bit of the code writes out a tree that mentions that blob.
Oops, that tree is now corrupt after we exit the process. And IMHO
neither the pretend-caller nor the tree-writer are to blame; the problem
is that they shared global state they were not expecting.

This is pretty far-fetched given that the only user of
pretend_object_file() is in git-blame right now. But it does give me
pause. Overall, though, I'm more inclined to say that we should be
dropping SKIP_CACHED here and considering pretend_object_file() to be a
bit dangerous (i.e., to keep it in mind if somebody proposes more
calls).

Another point of reference (in favor of Jonathan's patch):

  https://lore.kernel.org/git/20190304174053.GA27497@sigill.intra.peff.net/

is a bug that would not have happened if this patch had been applied
(there's also some discussion of the greater issue, but nothing that wasn't
already brought up here, I think).

-Peff

^ permalink raw reply

* Re: [PATCH 4/5] merge: verify signatures if gpg.verifySignatures is true
From: Junio C Hamano @ 2020-01-06 21:01 UTC (permalink / raw)
  To: Hans Jerry Illikainen; +Cc: git
In-Reply-To: <20200105135616.19102-5-hji@dyntopia.com>

Hans Jerry Illikainen <hji@dyntopia.com> writes:

> @@ -610,6 +610,8 @@ static int git_merge_config(const char *k, const char *v, void *cb)
>  		show_diffstat = git_config_bool(k, v);
>  	else if (!strcmp(k, "merge.verifysignatures"))
>  		verify_signatures = git_config_bool(k, v);
> +	else if (!strcmp(k, "gpg.verifysignatures") && verify_signatures < 0)
> +		verify_signatures = git_config_bool(k, v);

Hmm, the next person who attempts to generalize the mechanism
further would have a hard time introducing a fallback configuration
that is even common than "gpg" when s/he has to start with this
code, no?  That is, this patch introduced "gpg.verifysignatures is
used when merge.verifysignatures is not defined" and with the two
level override the code works OK, but to implement "if neither gpg.*
or merge.* is defined, common.verifysignatures is used instead", the
above part needs to be dismantled and redone.

Keeping the "initialize verify_signatures to -1 (unspecified)" from
this patch, setting a separate gpg_verify_signatures variable upon
seeing gpg.verifysignatures, and consolidating the final finding
after git_config(git_merge_config, NULL) returns into verify_signatures
like so:

	init_diff_ui_defaults();
	git_config(git_merge_config, NULL);

+	if (verify_signatures < 0)
+		verify_signatures = (0 <= gpg_verify_signatures) 
+				  ? gpg_verify_signatures 
+				  : 0;

would be more in line with the way we arrange multiple configuration
variables to serve as fallback defaults.  And that is more easily
extensible.

Also with such an arrangement, "if (verify_signatures == 1)" we see
below will become unnecessary, which is another plus.

Thanks.

>  	else if (!strcmp(k, "pull.twohead"))
>  		return git_config_string(&pull_twohead, k, v);
>  	else if (!strcmp(k, "pull.octopus"))
> @@ -1399,7 +1401,7 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
>  		if (remoteheads->next)
>  			die(_("Can merge only exactly one commit into empty head"));
>  
> -		if (verify_signatures &&
> +		if (verify_signatures == 1 &&
>  		    gpg_verify_commit(&remoteheads->item->object.oid, NULL,
>  				      NULL, gpg_flags))
>  			die(_("Signature verification failed"));
> @@ -1423,7 +1425,7 @@ int cmd_merge(int argc, const char **argv, const char *prefix)
>  		usage_with_options(builtin_merge_usage,
>  			builtin_merge_options);
>  
> -	if (verify_signatures) {
> +	if (verify_signatures == 1) {
>  		for (p = remoteheads; p; p = p->next) {
>  			if (gpg_verify_commit(&p->item->object.oid, NULL, NULL,
>  					      gpg_flags))


^ permalink raw reply

* Re: [PATCH 1/1] commit: make the sign-off trailer configurable
From: Junio C Hamano @ 2020-01-06 20:45 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Hans Jerry Illikainen, git
In-Reply-To: <xmqqpnfw8gyn.fsf@gitster-ct.c.googlers.com>

Junio C Hamano <gitster@pobox.com> writes:

> ++
> +As it makes it harder to argue against one who tells the court "that
> +log message ends with a SoB by person X but it is very plausible
> +that it was done by inertia without person X really intending to
> +certify what DCO says, and the SoB is meaningless." to more

s/to more/to have more/; otherwise the sentence does not make any
sense.  Sorry about the noise.

> +publicized ways to add SoB automatically, Git does not (and will not)
> +have a configuration variable to enable it by default.
>  
>  -n::
>  --no-verify::

^ permalink raw reply

* Re: [PATCH 0/1] sequencer: comment out the 'squash!' line
From: Junio C Hamano @ 2020-01-06 20:41 UTC (permalink / raw)
  To: Mike Rappazzo; +Cc: Michael Rappazzo via GitGitGadget, Git List
In-Reply-To: <CANoM8SV=pT3sFrfnEqWc2xBn_c2rES0qSMsdptF0DgcxgYL94w@mail.gmail.com>

Mike Rappazzo <rappazzo@gmail.com> writes:

> This change isn't removing the subject line from the commit message
> during the edit phase, it is only commenting it out.  With the subject being
> commented out, it can minimize the effort to edit during the squash.

Which means existing automation will be broken if they are not
taught to be aware that these subject lines can now be commented out
if their Git is recent enough, which does not sound like a good thing.

> Furthermore, it can help to eliminate accidental inclusion in the final
> message.  Ultimately, the accidental inclusion is my motivation for
> submitting this.

Yes, but that is why the concatenated messages are given to the
editor to be "edited" by you, to be better than just a
concatenation, right?  When I deal with a "squash" (not "fixup"),
the end result would have a log message for a single commit that
describes the single thing it does, which would not resemble to the
original of any of the squashed message---and removing extra title
lines would be the smallest part of such an edit.  So...

^ permalink raw reply

* Re: [PATCH 1/1] commit: make the sign-off trailer configurable
From: Junio C Hamano @ 2020-01-06 20:31 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Hans Jerry Illikainen, git
In-Reply-To: <cfa40ca5-98a1-fc9c-9ccc-f14b81119e60@gmail.com>

Derrick Stolee <stolee@gmail.com> writes:

> Since I started the line of "this isn't a bad idea" I'll follow up with
> the historical search. Here are previous attempts from 2018 [1], 2015 [2],
> 2010 [3].
>
> Thanks,
> -Stolee
>
> [1] https://lore.kernel.org/git/20180204020318.4363-1-chenjingpiao@gmail.com/
> [2] https://lore.kernel.org/git/1435217454-5718-1-git-send-email-cmarcelo@gmail.com/
> [3] https://lore.kernel.org/git/alpine.LNX.2.00.1001131635510.16395@vqena.qenxr.bet.am/

Thanks.  In an earlier thread, we did

https://lore.kernel.org/git/20090401175153.GA90421@macbook.lan/

to which I said "... the wording we can update if somebody can come
up with a better one" in its follow-up.  Perhaps it's time for us to
be that somebody to help everybody to be on the same page.

Here is my attempt, starting from what I wrote in

https://lore.kernel.org/git/xmqqtw9m5s5m.fsf@gitster.mtv.corp.google.com/

-- >8 --
Subject: commit -s: document why commit.signoff option will not be added


Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 As I said in https://lore.kernel.org/git/7veiw69p26.fsf@gitster.siamese.dyndns.org/
 I have a mixed feeling about this.  To projects that use the same
 definition of what SoB means, not adding the configuration ever is
 the right thing to do, but Git is to be used by other projects, and
 some of them may use SoB with a completely different meaning that
 has no legal weight---and to them, lack of such an automation may
 be a bug.  So ...

 Documentation/git-commit.txt | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/git-commit.txt b/Documentation/git-commit.txt
index ced5a9beab..1909551087 100644
--- a/Documentation/git-commit.txt
+++ b/Documentation/git-commit.txt
@@ -171,6 +171,13 @@ The `-m` option is mutually exclusive with `-c`, `-C`, and `-F`.
 	the rights to submit this work under the same license and
 	agrees to a Developer Certificate of Origin
 	(see http://developercertificate.org/ for more information).
++
+As it makes it harder to argue against one who tells the court "that
+log message ends with a SoB by person X but it is very plausible
+that it was done by inertia without person X really intending to
+certify what DCO says, and the SoB is meaningless." to more
+publicized ways to add SoB automatically, Git does not (and will not)
+have a configuration variable to enable it by default.
 
 -n::
 --no-verify::


^ permalink raw reply related

* Re: [PATCH 1/1] commit: make the sign-off trailer configurable
From: Derrick Stolee @ 2020-01-06 19:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Hans Jerry Illikainen, git
In-Reply-To: <xmqqtv58a5m2.fsf@gitster-ct.c.googlers.com>

On 1/6/2020 11:53 AM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> My initial thought was that the sign-off was supposed to be a purposeful
>> decision, but then I also realized that I never do anything in the Git
>> codebase that I _can't_ put online under the GPL. (It may not make it
>> upstream, but it will always be put online somewhere.)
> 
> Hmm...
> 
> Sorry, but I do not quite understand the flow of your logic here,
> especially, how "but then I also realized" trumps "signing off a
> patch is a conscious act---it would weaken the legal meaning if you
> automate it", which was why we deliberately avoided adding this
> configuration variable for the last 10+ years.
> 
> So, I dunno.

I guess I meant that enabling this config for a repo is also a
conscious act, making me think this is not completely unreasonable.

But if we've already discussed and rejected this feature in the past,
then that's sufficient.

Since I started the line of "this isn't a bad idea" I'll follow up with
the historical search. Here are previous attempts from 2018 [1], 2015 [2],
2010 [3].

Thanks,
-Stolee

[1] https://lore.kernel.org/git/20180204020318.4363-1-chenjingpiao@gmail.com/
[2] https://lore.kernel.org/git/1435217454-5718-1-git-send-email-cmarcelo@gmail.com/
[3] https://lore.kernel.org/git/alpine.LNX.2.00.1001131635510.16395@vqena.qenxr.bet.am/

^ permalink raw reply

* Re: [PATCH 1/1] fetch: set size_multiple in split_commit_graph_opts
From: Jeff King @ 2020-01-06 19:39 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, szeder.dev, Derrick Stolee, Junio C Hamano
In-Reply-To: <91d89356a20625d04af74d458c28b32445e760c1.1577981654.git.gitgitgadget@gmail.com>

On Thu, Jan 02, 2020 at 04:14:14PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> In 50f26bd ("fetch: add fetch.writeCommitGraph config setting",
> 2019-09-02), the fetch builtin added the capability to write a
> commit-graph using the "--split" feature. This feature creates
> multiple commit-graph files, and those can merge based on a set
> of "split options" including a size multiple. The default size
> multiple is 2, which intends to provide a log_2 N depth of the
> commit-graph chain where N is the number of commits.
> 
> However, I noticed during dogfooding that my commit-graph chains
> were becoming quite large when left only to builds by 'git fetch'.
> It turns out that in split_graph_merge_strategy(), we default the
> size_mult variable to 2 except we override it with the context's
> split_opts if they exist. In builtin/fetch.c, we create such a
> split_opts, but do not populate it with values.
> 
> This problem is due to two failures:
> 
>  1. It is unclear that we can add the flag COMMIT_GRAPH_WRITE_SPLIT
>     with a NULL split_opts.
>  2. If we have a non-NULL split_opts, then we override the default
>     values even if a zero value is given.
> 
> Correct both of these issues. First, do not override size_mult when
> the options provide a zero value. Second, stop creating a split_opts
> in the fetch builtin.

Thanks, the explanation and fix (both parts) look good to me, modulo the
subject correction you already noted.

> ---
>  builtin/fetch.c | 4 +---
>  commit-graph.c  | 4 +++-

Is it worth covering this with a test?

I guess the non-fetch code paths for splitting already cover this pretty
well, and this is just about managing to get the right number into the
commit-graph code. So perhaps it isn't worth it.

-Peff

^ permalink raw reply

* Re: [PATCH 3/5] commit: refactor signature verification helpers
From: Junio C Hamano @ 2020-01-06 19:36 UTC (permalink / raw)
  To: Hans Jerry Illikainen; +Cc: git
In-Reply-To: <20200105135616.19102-4-hji@dyntopia.com>

Hans Jerry Illikainen <hji@dyntopia.com> writes:

> +static unsigned gpg_flags = GPG_VERIFY_SHORT | GPG_VERIFY_COMPAT;
>  static struct strbuf merge_msg = STRBUF_INIT;
>  static struct strategy **use_strategies;
>  static size_t use_strategies_nr, use_strategies_alloc;
> @@ -633,7 +633,7 @@ static int git_merge_config(const char *k, const char *v, void *cb)
>  		sign_commit = git_config_bool(k, v) ? "" : NULL;
>  		return 0;
>  	} else if (!strcmp(k, "gpg.mintrustlevel")) {
> -		check_trust_level = 0;
> +		gpg_flags ^= GPG_VERIFY_COMPAT;

Did you really mean to "toggle the bit each time the variable
appears"?  Or is this better written as

		gpg_flags &= ~GPG_VERIFY_COMPAT;

instead?  There may be another instance of the same in this patch.


^ permalink raw reply

* Re: [PATCH 2/5] gpg-interface: support one-line summaries in print_signature_buffer()
From: Junio C Hamano @ 2020-01-06 19:33 UTC (permalink / raw)
  To: Hans Jerry Illikainen; +Cc: git
In-Reply-To: <20200105135616.19102-3-hji@dyntopia.com>

Hans Jerry Illikainen <hji@dyntopia.com> writes:

> Most consumers of the GPG interface use print_signature_buffer() to show
> This patch moves the one-line summary from verify_merge_signature() to
> print_signature_buffer().  It also introduces a GPG_VERIFY_SHORT flag
> that determines whether the summary should be printed.

I think the motivation makes sense.

> -void print_signature_buffer(const struct signature_check *sigc, unsigned flags)
> +void print_signature_buffer(const struct object_id *oid,
> +			    const struct signature_check *sigc, int status,
> +			    unsigned flags)
>  {
>  	const char *output = flags & GPG_VERIFY_RAW ?
>  		sigc->gpg_status : sigc->gpg_output;
> +	char hex[GIT_MAX_HEXSZ + 1];
> +	char *type;
>  
>  	if (flags & GPG_VERIFY_VERBOSE && sigc->payload)
>  		fputs(sigc->payload, stdout);
>  
>  	if (flags & GPG_VERIFY_FULL && output)
>  		fputs(output, stderr);
> +
> +	if (flags & GPG_VERIFY_SHORT && oid) {

I am not sure about the wisdom of "&& oid" here.  Wouldn't it be a
bug for a caller who asks for _SHORT format to feed a NULL oid to
us?  I would understand

	if (flags & GPG_VERIFY_SHORT) {
		if (!oid)
			BUG("Caller asking for SHORT without oid???");

much better, but if there is a caller that has a legitimate need
tobe able to pass NULL and still request SHORT, let's see it and
discuss if it is a good idea.

For that matter, the two print routines we have immediately above
share the same issue.

> +		type = xstrdup_or_null(
> +			type_name(oid_object_info(the_repository, oid, NULL)));
> +		if (!type)
> +			type = xstrdup("object");

This is minor, but I wonder if this is easier to follow.

		type = type_name(oid_object_info(the_repository, oid, NULL));
		if (!type)
			type = "object";
		type = xstrdup(type);

> +		*type = toupper(*type);

This is not helpful at all for i18n/l10n, I am afraid.

> +		find_unique_abbrev_r(hex, oid, DEFAULT_ABBREV);
> +
> +		switch (sigc->result) {
> +		case 'G':
> +			if (status)
> +				error(_("%s %s has an untrusted GPG signature, "
> +					"allegedly by %s."),
> +				      type, hex, sigc->signer);

The original was of course

        die(_("Commit %s has an untrusted GPG signature, "
              "allegedly by %s."), hex, signature_check.signer);

so the message can properly localized including the typename.  That
is no longer true with this conversion.

You probably would need to prepare a 3-element array (one element
for each of <object, commit, tag>), each element of the array being
a 3-element array that holds the message for these three cases
(i.e. G, N and default) instead.

^ permalink raw reply

* Re: [PATCH 0/1] sequencer: comment out the 'squash!' line
From: Jeff King @ 2020-01-06 19:32 UTC (permalink / raw)
  To: Mike Rappazzo; +Cc: Junio C Hamano, Michael Rappazzo via GitGitGadget, Git List
In-Reply-To: <CANoM8SV=pT3sFrfnEqWc2xBn_c2rES0qSMsdptF0DgcxgYL94w@mail.gmail.com>

On Mon, Jan 06, 2020 at 02:20:09PM -0500, Mike Rappazzo wrote:

> > Can you tell us more about your expected use case?  I am imagining
> > that most people use the log messages from both/all commits being
> > squashed when manually editing to perfect the final log message (as
> > opposed to mechanically processing the concatenated message), so it
> > shouldn't matter if the squash! title is untouched or commented out
> > to them, and those (probably minority) who are mechanical processing
> > will be hurt with this change, so I do not quite see the point of
> > this patch.
> 
> This change isn't removing the subject line from the commit message
> during the edit phase, it is only commenting it out.  With the subject being
> commented out, it can minimize the effort to edit during the squash.
> 
> Furthermore, it can help to eliminate accidental inclusion in the final
> message.  Ultimately, the accidental inclusion is my motivation for
> submitting this.

But I thought that was the point of "squash" versus "fixup"? One
includes the commit message, and the other does not.

I do think "commit --squash" is mostly useless for that reason, and I
suspect we could do a better job in the documentation about pushing
people to "--fixup".

But --squash _can_ be useful with other options to populate the commit
message (e.g., "--edit", which just pre-populates the subject with the
right "squash!" line but lets you otherwise write a normal commit
message). If that's the workflow you're using, then I'm sympathetic to
auto-removing just a "squash!" line, as it's automated garbage that is
only meant as a signal for --autosquash.

-Peff

^ permalink raw reply

* Re: [PATCH 0/1] sequencer: comment out the 'squash!' line
From: Mike Rappazzo @ 2020-01-06 19:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Rappazzo via GitGitGadget, Git List
In-Reply-To: <xmqq7e24a3t0.fsf@gitster-ct.c.googlers.com>

On Mon, Jan 6, 2020 at 12:34 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Michael Rappazzo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Since this change what the expected post-rebase commit comment would look
> > like, related test expectations are adjusted to reflect the the new
> > expectation. A new test is added for the new expectation.
>
> Doesn't that mean automated tools people may have written require
> similar adjustment to continue working correctly if this change is
> applied?
>
> Can you tell us more about your expected use case?  I am imagining
> that most people use the log messages from both/all commits being
> squashed when manually editing to perfect the final log message (as
> opposed to mechanically processing the concatenated message), so it
> shouldn't matter if the squash! title is untouched or commented out
> to them, and those (probably minority) who are mechanical processing
> will be hurt with this change, so I do not quite see the point of
> this patch.

This change isn't removing the subject line from the commit message
during the edit phase, it is only commenting it out.  With the subject being
commented out, it can minimize the effort to edit during the squash.

Furthermore, it can help to eliminate accidental inclusion in the final
message.  Ultimately, the accidental inclusion is my motivation for
submitting this.

^ permalink raw reply

* Re: [PATCH 1/5] gpg-interface: conditionally show the result in print_signature_buffer()
From: Junio C Hamano @ 2020-01-06 19:07 UTC (permalink / raw)
  To: Hans Jerry Illikainen; +Cc: git
In-Reply-To: <20200105135616.19102-2-hji@dyntopia.com>

Hans Jerry Illikainen <hji@dyntopia.com> writes:

> The print_signature_buffer() function in gpg-interface.c is used to
> print the result of a GPG verified payload.  It takes a 'flags'
> parameter that determines what to print.
>
> Previously, the 'flags' parameter processed 2 flags:
>
> - GPG_VERIFY_RAW: to print the raw output from GPG instead of the
>   human(ish)-readable output.  One of these outputs were always
>   shown, irregardless of any other flags.
> - GPG_VERIFY_VERBOSE: to print the payload that was verified
>
> Interestingly, there was also a third flag defined in gpg-interface.h;
> GPG_VERIFY_OMIT_STATUS.  That flag wasn't used by the print function
> itself -- instead, callers would check for the presence of
> GPG_VERIFY_OMIT_STATUS before invoking print_signature_buffer().
>
> It seems reasonable that the GPG interface should handle all flags
> related to how the result should be (or shouldn't be) shown.  This patch
> implements that behavior by removing GPG_VERIFY_OMIT_STATUS and adding
> GPG_VERIFY_FULL.  If neither GPG_VERIFY_FULL nor GPG_VERIFY_VERBOSE is
> present, then nothing is printed.  This allows callers to invoke
> print_signature_buffer() unconditionally.

So in short, VERIFY_FULL is equivalent to !OMIT_STATUS?

As the direct callers of "print" are not the ones that set up bits
in flags, I think the proposed change makes the API easier to use.

Will queue.  Thanks.

> Signed-off-by: Hans Jerry Illikainen <hji@dyntopia.com>
> ---
>  builtin/tag.c           | 4 ++--
>  builtin/verify-commit.c | 2 +-
>  builtin/verify-tag.c    | 4 ++--
>  gpg-interface.c         | 2 +-
>  gpg-interface.h         | 6 +++---
>  tag.c                   | 4 +---
>  6 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/builtin/tag.c b/builtin/tag.c
> index e0a4c25382..8489e220e8 100644
> --- a/builtin/tag.c
> +++ b/builtin/tag.c
> @@ -112,10 +112,10 @@ static int verify_tag(const char *name, const char *ref,
>  {
>  	int flags;
>  	const struct ref_format *format = cb_data;
> -	flags = GPG_VERIFY_VERBOSE;
> +	flags = GPG_VERIFY_FULL | GPG_VERIFY_VERBOSE;
>  
>  	if (format->format)
> -		flags = GPG_VERIFY_OMIT_STATUS;
> +		flags = 0;
>  
>  	if (gpg_verify_tag(oid, name, flags))
>  		return -1;
> diff --git a/builtin/verify-commit.c b/builtin/verify-commit.c
> index 40c69a0bed..2a099ec6ba 100644
> --- a/builtin/verify-commit.c
> +++ b/builtin/verify-commit.c
> @@ -63,7 +63,7 @@ static int git_verify_commit_config(const char *var, const char *value, void *cb
>  int cmd_verify_commit(int argc, const char **argv, const char *prefix)
>  {
>  	int i = 1, verbose = 0, had_error = 0;
> -	unsigned flags = 0;
> +	unsigned flags = GPG_VERIFY_FULL;
>  	const struct option verify_commit_options[] = {
>  		OPT__VERBOSE(&verbose, N_("print commit contents")),
>  		OPT_BIT(0, "raw", &flags, N_("print raw gpg status output"), GPG_VERIFY_RAW),
> diff --git a/builtin/verify-tag.c b/builtin/verify-tag.c
> index f45136a06b..bd5e99925b 100644
> --- a/builtin/verify-tag.c
> +++ b/builtin/verify-tag.c
> @@ -30,7 +30,7 @@ static int git_verify_tag_config(const char *var, const char *value, void *cb)
>  int cmd_verify_tag(int argc, const char **argv, const char *prefix)
>  {
>  	int i = 1, verbose = 0, had_error = 0;
> -	unsigned flags = 0;
> +	unsigned flags = GPG_VERIFY_FULL;
>  	struct ref_format format = REF_FORMAT_INIT;
>  	const struct option verify_tag_options[] = {
>  		OPT__VERBOSE(&verbose, N_("print tag contents")),
> @@ -53,7 +53,7 @@ int cmd_verify_tag(int argc, const char **argv, const char *prefix)
>  		if (verify_ref_format(&format))
>  			usage_with_options(verify_tag_usage,
>  					   verify_tag_options);
> -		flags |= GPG_VERIFY_OMIT_STATUS;
> +		flags = 0;
>  	}
>  
>  	while (i < argc) {
> diff --git a/gpg-interface.c b/gpg-interface.c
> index 2d538bcd6e..fc182d39be 100644
> --- a/gpg-interface.c
> +++ b/gpg-interface.c
> @@ -341,7 +341,7 @@ void print_signature_buffer(const struct signature_check *sigc, unsigned flags)
>  	if (flags & GPG_VERIFY_VERBOSE && sigc->payload)
>  		fputs(sigc->payload, stdout);
>  
> -	if (output)
> +	if (flags & GPG_VERIFY_FULL && output)
>  		fputs(output, stderr);
>  }
>  
> diff --git a/gpg-interface.h b/gpg-interface.h
> index f4e9b4f371..4631a91330 100644
> --- a/gpg-interface.h
> +++ b/gpg-interface.h
> @@ -3,9 +3,9 @@
>  
>  struct strbuf;
>  
> -#define GPG_VERIFY_VERBOSE		1
> -#define GPG_VERIFY_RAW			2
> -#define GPG_VERIFY_OMIT_STATUS	4
> +#define GPG_VERIFY_VERBOSE (1 << 0)
> +#define GPG_VERIFY_RAW (1 << 1)
> +#define GPG_VERIFY_FULL (1 << 2)
>  
>  enum signature_trust_level {
>  	TRUST_UNDEFINED,
> diff --git a/tag.c b/tag.c
> index 71b544467e..b8d6da81eb 100644
> --- a/tag.c
> +++ b/tag.c
> @@ -28,9 +28,7 @@ static int run_gpg_verify(const char *buf, unsigned long size, unsigned flags)
>  
>  	ret = check_signature(buf, payload_size, buf + payload_size,
>  				size - payload_size, &sigc);
> -
> -	if (!(flags & GPG_VERIFY_OMIT_STATUS))
> -		print_signature_buffer(&sigc, flags);
> +	print_signature_buffer(&sigc, flags);
>  
>  	signature_check_clear(&sigc);
>  	return ret;

^ permalink raw reply

* Re: [PATCH 2/9] commit-graph: write changed paths bloom filters
From: Jakub Narebski @ 2020-01-06 18:44 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh
In-Reply-To: <e52c7ad37a306891487bd79a09b040bfb657d723.1576879520.git.gitgitgadget@gmail.com>

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> The changed path bloom filters help determine which paths changed between a
> commit and its first parent. We already have the "--changed-paths" option
> for the "git commit-graph write" subcommand, now actually compute them under
> that option. The COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag enables this
> computation.
>
> RFC Notes: Here are some details about the implementation and I would love
> to know your thoughts and suggestions for improvements here.
>
> For details on what bloom filters are and how they work, please refer to
> Dr. Derrick Stolee's blog post [1].
> [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-bloom-filters/
>
> 1. The implementation sticks to the recommended values of 7 and 10 for the
>    number of hashes and the size of each entry, as described in the blog.

Please provide references to original work for this.  Derrick Stolee
blog post references the following work:

  Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
  "An Improved Construction for Counting Bloom Filters"
  http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
  https://doi.org/10.1007/11841036_61

However, we do not use Counting Bloom Filters, but ordinary Bloom
Filters, if used in untypical way: instead of testing many elements
(keys) against single filter, we test single element (path) against
mainy filters.

Also, I'm not sure that values 10 bits per entry and 7 hash functions
are recommended; the work states:

  "For example, when n/m = 10 and k = 7 the false positive probability
  is just over 0.008."

Given false positive probablity we can calculate best choice for n/m and
k.

On the other hand in https://arxiv.org/abs/1912.08258 we have

  "For efficient memory usage, a Bloom filter with a false-positive
   probability ϵ should use about − log_2(ϵ) hash functions
   [Broder2004]. At a false-positive probability of 1%, seven hash
   functions are thus required."

So k=7 being optimal is somewhat confirmed.

>    The implementation while not completely open to it at the moment, is flexible
>    enough to allow for tweaking these settings in the future.

All right.

>    Note: The performance gains we have observed so far with these values is
>    significant enough to not that we did not need to tweak these settings.
                           ^^^
s/not/note/

Did you try to tweak settings, i.e. numbers of bits per entry, number
of hash functions (which is derivative of the former - at least the
optimal number), the size of the block, the cutoff threshold value?
It is not needed to be in this patch series - fine tuning is probably
better left for later.

>    The cover letter of this series has the details and the commit where we have
>    git log use bloom filters.

The second part of this sentence, from "and the commit..." is a bit
unclear.  Did you mean here that the future / subsequent commit in this
patch series that makes Git actually use Bloom filters in `git log --
<path>` will have more details in its commit message?

> 2. As described in the blog and the linked technical paper therin, we do not need
                                                             ^^^^^^
s/therin/therein/

>    7 independent hashing functions. We use the Murmur3 hashing scheme - seed it
>    twice and then combine those to procure an arbitrary number of hash values.

The "linked technical paper" in the blog post (which I would prefer to
have linked directly to in the commit message) is

  Peter C. Dillinger and Panagiotis Manolios
  "Bloom Filters in Probabilistic Verification"
  http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf
  https://doi.org/10.1007/978-3-540-30494-4_26

Sidenote: it looks like it is a reference from Wikipedia on Bloom filters.
This is according to authors the original paper with the _double hashing_
technique.

They also examine in much detail the optimal number of hash functions.

> 3. The filters are sized according to the number of changes in the each commit,
>    with minimum size of one 64 bit word.

Do I understand it correctly that the size of filter is 10*(number of
changed files) bits, rounded up to nearest multiple of 64?

How do you count renames and copies?  As two changes?

Do I understand it correctly that commit with no changes in it (which
can rarely happen) would have 64-bits i.e. 8-bytes Bloom filter of all
zeros: 0x0000000000000000?

How merges are handled?  Does the filter uses all changed files, or just
changes compared to first parent?

>
> [Call for advice] We currently cap writing bloom filters for commits with
> atmost 512 changed files. In the current implementation, we compute the diff,
> and then just throw it away once we see it has more than 512 changes.
> Any suggestiongs on how to reduce the work we are doing in this case are more
> than welcome.

This got solved in "[PATCH] diff: halt tree-diff early after max_changes"
https://public-inbox.org/git/e9a4e4ff-5466-dc39-c3f5-c9a8b8f2f11d@gmail.com/

> [Call for advice] Would the git community like this commit to be split up into
> more granular commits? This commit could possibly be split out further with the
> bloom.c code in its own commit, to be used by the commit-graph in a subsequent
> commit. While I prefer it being contained in one commit this way, I am open to
> suggestions.

I think it might be a good idea to split this commit into purely Bloom
filter implementation (bloom.c) AND unit tests for Bloom filter itself
(which would probably involve some new test-tool).

I have not read further messages in the series [edit: they don't], so I
don't know if such tests already exist or not.  One could test for
negative match, maybe also (for specific choice of hash function) for
positive and maybe even false positive match, for filter size depending
on the number of changes, for changes cap (maybe), maybe also for
no-changes scenario.

As for splitting the main part of the series, I would envision it in the
following way (which is of course only one possibility):

1. Implementation of generic-ish Bloom filter (with elements being
   strings / paths, and optimized to test single key against many
   filters, each taking small-ish space, variable size filter, limit on
   maximum number of elements).

   Technical documentation in comments in bloom.h (description of API)
   and bloom.c (details of the algorithm, with references).

   TODO: test-tool and unit tests.

2. Using per-commit Bloom filter(s) to store changeset information
   i.e. changed paths.  This would implement in-memory storage (on slab)
   and creating Bloom filter out of commit and repository information.

   Perhaps this should also get its own unit tests (that Bloom filter
   catches changed files, and excluding false positivess catches
   unchanged files).

3. Storing per-commit Bloom filters in the commit-graph file:

   a.) writing Bloom filters data to commit-graph file, which means
       designing the chunk(s) format,
   b.) verifying Bloom filter chunks, at least sanity-checks
   c.) reading Bloom filters from commit-graph file into memory

   Perhaps also some integration tests that the information is stored
   and retrieved correctly, and that verifying finds bugs in
   intentionally corrupted Bloom filter chunks.

4. Using Bloom filters to speed up `git log -- <path>` (and similar
   commands).

   It would be nice to have some functional tests, and maybe some
   performance tests, if possible.

> [Call for advice] Would a technical document explaining the exact details of
> the bloom filter implemenation and the hashing calculations be helpful? I will
> be adding details into Documentation/technical/commit-graph-format.txt, but the
> bloom filter code is an independent subsystem and could be used outside of the
> commit-graph feature. Is it worth a separate document, or should we apply "You
> Ain't Gonna Need It" principles?

As nowadays technical reference documentation is being moved from
Documentation/technical/api-*.txt to appropriate header files, maybe the
documentation of Bloom filter API (and some technical documentation and
references) be put in bloom.h?  See for example comments in strbuf.h.

> [Call for advice] I plan to add unit tests for bloom.c, specifically to ensure
> that the hash algorithm and bloom key calculations are stable across versions.

Ah, so the unit tests for bloom.c does not exist, yet...

> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Makefile       |   1 +
>  bloom.c        | 201 +++++++++++++++++++++++++++++++++++++++++++++++++
>  bloom.h        |  46 +++++++++++
>  commit-graph.c |  32 +++++++-
>  4 files changed, 279 insertions(+), 1 deletion(-)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>
> diff --git a/Makefile b/Makefile
> index 42a061d3fb..9d5e26f5d6 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -838,6 +838,7 @@ LIB_OBJS += base85.o
>  LIB_OBJS += bisect.o
>  LIB_OBJS += blame.o
>  LIB_OBJS += blob.o
> +LIB_OBJS += bloom.o
>  LIB_OBJS += branch.o
>  LIB_OBJS += bulk-checkin.o
>  LIB_OBJS += bundle.o

I'll put bloom.h first, to make it easier to review.

> diff --git a/bloom.h b/bloom.h
> new file mode 100644
> index 0000000000..ba8ae70b67
> --- /dev/null
> +++ b/bloom.h
> @@ -0,0 +1,46 @@
> +#ifndef BLOOM_H
> +#define BLOOM_H
> +
> +struct commit;
> +struct repository;

This would probably be missing if this patch was split in two:
introducing Bloom filter and saving Bloom filter in the repository
metadata (in commit-graphh file).

> +

O.K., the names of fields are descriptive enough so that this struct
doesn't need detailed description in comment (like the next one).

> +struct bloom_filter_settings {
> +	uint32_t hash_version;

Do we need full half-word for hash version?

> +	uint32_t num_hashes;

Do we need full 32-bits for number of hashes?  The "Bloom Filters in
Probabilistic Verification" paper mentioned in Stolee blog states that
no one should need number of hashes greater than k=32 - the accuracy is
so high that it doesn't matter that it is not optimal.

  "Notice one last thing about Bloom filters in verification, if $m$ is
   several gigabytes or less and $m/n$ calls for more than about 32 index
   functions, the accuracy is going to be so high that there is not much
   reason to use more than 32—for the next several years at least. In
   response to this, 3SPIN currently limits the user to $k = 32$. The
   point of this observation is that we do not have to worry about the
   runtime cost of $k$ being on the order of 64 or 100, because those
   choices do not really buy us anything over 32."

Here 'm' is the number of bits in Bloom filter, and m/n is number of
bits per element added to filter.

> +	uint32_t bits_per_entry;

All right, we wouldn't really want large Bloom filters, as we use one
filter per commit to match againts one key, not single Bloom filter to
match againts many keys.

> +};
> +
> +#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> +
> +/*
> + * A bloom_filter struct represents a data segment to
> + * use when testing hash values. The 'len' member
> + * dictates how many uint64_t entries are stored in
> + * 'data'.
> + */
> +struct bloom_filter {
> +	uint64_t *data;
> +	int len;
> +};

O.K., so it is single variable-sized (in 64-bit increments) Bloom filter
bit vector (bitmap).

> +
> +/*
> + * A bloom_key represents the k hash values for a
> + * given hash input. These can be precomputed and
> + * stored in a bloom_key for re-use when testing
> + * against a bloom_filter.
> + */
> +struct bloom_key {
> +	uint32_t *hashes;
> +};

That is smart.  I wonder however if it wouldn't be a good idea to
'typedef' a hash function return type.

I repeat myself: in Git case we have one key that we try to match
against many Bloom filters which are never updated, while in an ordinary
case many keys are matched against single Bloom filter - in many cases
updated (with keys inserted to Bloom filter).

I wonder if somebody from academia have examined such situation.
I couldn't find a good search query.

Sidenote: perhaps Xor or Xor+ filters from Graf & Lemire (2019)
https://arxiv.org/abs/1912.08258 would be better solution - they also
assume unchanging filter.  Though they are a very fresh proposal;
also construction time might be important for Git.
https://github.com/FastFilter/xor_singleheader

> +
> +void load_bloom_filters(void);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c);

Those two functions really need API documentation on how they are used,
if they are to be used in any other role, especially what is their
calling convention?  Why load_bloom_filters() doesn't take any
parameters?

Anyway, if this patch would be split into pure Bloom filter
implementation and Git use^W store of Bloom filters, then this would be
left for the latter patch.

> +
> +void fill_bloom_key(const char *data,
> +		    int len,
> +		    struct bloom_key *key,
> +		    struct bloom_filter_settings *settings);

It is a bit strange that two basic Bloom filter operations, namely
adding element to Bloom filter (and constructing Bloom filter), and
testing whether element is in Bloom filter are not part of a public
API...

This function should probably be documented, in particular the fact that
*key is an in/out parameter.  This could also be a good place to
document the mechanism itself (i.e. our implementation of Bloom filter,
with references), though it might be better to keep the details of how
it works in the bloom.c - close to the actual source (while keeping
description of API in bloom.h comments).

> +
> +#endif
> diff --git a/bloom.c b/bloom.c
> new file mode 100644
> index 0000000000..08328cc381
> --- /dev/null
> +++ b/bloom.c
> @@ -0,0 +1,201 @@
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +#define BITS_PER_BLOCK 64
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);
> +
> +struct bloom_filter_slab bloom_filters;

All right, so the Bloom filter data would be on slab.  This should
probably be mentioned in the commit message, like in
https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/

Sidenote: If I remember correctly one of the unmet prerequisites for
switching to generation numbers v2 (corrected commit date with monotonic
offsets) was moving 'generation' field out of 'struct commit' and on to
slab (possibly also 'graph_pos'), and certainly having 'corrected_date'
on slab (Inside-Out Object style).  Which probably could be done with
Coccinelle script...

> +
> +struct pathmap_hash_entry {
> +    struct hashmap_entry entry;
> +    const char path[FLEX_ARRAY];
> +};

Hmmm... I wonder why use hashmap and not string_list.  This is for
adding path with leading directories to the Bloom filter, isn't it?

> +
> +static uint32_t rotate_right(uint32_t value, int32_t count)
> +{
> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
> +	count &= mask;
> +	return ((value >> count) | (value << ((-count) & mask)));
> +}

Does it actually work with count being negative?  Shouldn't 'count' be
of unsigned type, and if int32_t is needed, perhaps add an assertion (if
needed)?  I think it does not.

It looks like it is John Regehr [2] safe and compiler-friendly
implementation, with explicit 8 in place of CHAR_BIT from <limits.h>,
which should compile to "rotate" assembly instruction... it looks like
it is the case, see https://godbolt.org/z/5JP1Jb (at least for C++
compiler).

[2]: https://en.wikipedia.org/wiki/Circular_shift

I wonder if this should, in the future, be a part of 'compat/', maybe
even using compiler intrinsics for "rotate right" if available (see
https://stackoverflow.com/a/776523/46058).  But that might be outside of
the scope of this patch (perhaps outside of choosing function name).

> +

It would be nice to have reference to the source of algorithm, or to the
code that was borrowed for this in the header comment for the following
function.

I will be comparing the algorithm itself in Wikipedia
https://en.wikipedia.org/wiki/MurmurHash#Algorithm
and its implementation in C in qLibc library (BSD licensed)
https://github.com/wolkykim/qlibc/blob/03a8ce035391adf88d6d755f9a26967c16a1a567/src/utilities/qhash.c#L258

> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
> +{
> +	const uint32_t c1 = 0xcc9e2d51;
> +	const uint32_t c2 = 0x1b873593;
> +	const int32_t r1 = 15;
> +	const int32_t r2 = 13;

Those two: r1 and r1, probably should be both uint32_t type.

> +	const uint32_t m = 5;
> +	const uint32_t n = 0xe6546b64;
> +	int i;
> +	uint32_t k1 = 0;
> +	const char *tail;

*tail should probably be 'uint8_t', not 'char', isn't it?

> +
> +	int len4 = len / sizeof(uint32_t);

All length variables and parameters, i.e. `len`, `len4`, `i`, could
possibly be `size_t` and not `int` type.

> +
> +	const uint32_t *blocks = (const uint32_t*)data;
> +

Some implementations copy `seed` (or assume seed=0) to the local
variable named `h` or `hash`.

> +	uint32_t k;
> +	for (i = 0; i < len4; i++)
> +	{
> +		k = blocks[i];
> +		k *= c1;
> +		k = rotate_right(k, r1);

Shouldn't it be *rotate_left* (ROL), not rotate_right (ROR)???
This affects all cases / uses.

> +		k *= c2;
> +
> +		seed ^= k;
> +		seed = rotate_right(seed, r2) * m + n;
> +	}
> +
> +	tail = (data + len4 * sizeof(uint32_t));
> +

We could have reused variable `k`, like the implementation in qLibc
does, instead of introducing new `k1` variable, but this way it is more
clean.  Or name it `remainingBytes` instead of `k1`

> +	switch (len & (sizeof(uint32_t) - 1))
> +	{
> +	case 3:
> +		k1 ^= ((uint32_t)tail[2]) << 16;
> +		/*-fallthrough*/
> +	case 2:
> +		k1 ^= ((uint32_t)tail[1]) << 8;
> +		/*-fallthrough*/
> +	case 1:
> +		k1 ^= ((uint32_t)tail[0]) << 0;
> +		k1 *= c1;
> +		k1 = rotate_right(k1, r1);
> +		k1 *= c2;
> +		seed ^= k1;
> +		break;
> +	}
> +
> +	seed ^= (uint32_t)len;
> +	seed ^= (seed >> 16);
> +	seed *= 0x85ebca6b;
> +	seed ^= (seed >> 13);
> +	seed *= 0xc2b2ae35;
> +	seed ^= (seed >> 16);
> +
> +	return seed;
> +}
> +

It would be nice to have header comment describing what this function is
intended to actually do.

> +static inline uint64_t get_bitmask(uint32_t pos)
> +{
> +	return ((uint64_t)1) << (pos & (BITS_PER_BLOCK - 1));
> +}

Sidenote: I wonder if ewah/bitmap.c implements something similar.
Certainly possible consolidation, if any possible exists, should be left
for the future.

> +
> +void fill_bloom_key(const char *data,
> +		    int len,
> +		    struct bloom_key *key,
> +		    struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	uint32_t seed0 = 0x293ae76f;
> +	uint32_t seed1 = 0x7e646e2c;

Where did those constants came from?  It would be nice to have a
reference either in header comment (in bloom.h or bloom.c), or in a
commit message, or both.

Note that above *constants* are each used only once.

> +
> +	uint32_t hash0 = seed_murmur3(seed0, data, len);
> +	uint32_t hash1 = seed_murmur3(seed1, data, len);

Those are constant values, so perhaps they should be `const uint32_t`.

> +
> +	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
> +	for (i = 0; i < settings->num_hashes; i++)
> +		key->hashes[i] = hash0 + i * hash1;

It looks like this code implements the double hashing technique given in
Eq. (4) in http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf
that is "Bloom Filters in Probabilistic Verification".

Note that Dillinger and Manolios in this paper propose also _enhanced_
double hashing algorithm (Algorithm 2 on page 11), which has closed form
given by Eq. (6) - with better science-theoretical properties at
similar cost.

It might be a good idea to explicitly state in the header comment that
all arithmetic is performed with unsigned 32-bit integers, which means
that operations are performed modulo 2^32.  Or it might not be needed.

> +}
> +
> +static void add_key_to_filter(struct bloom_key *key,
> +			      struct bloom_filter *filter,
> +			      struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	uint64_t mod = filter->len * BITS_PER_BLOCK;
> +
> +	for (i = 0; i < settings->num_hashes; i++) {
> +		uint64_t hash_mod = key->hashes[i] % mod;
> +		uint64_t block_pos = hash_mod / BITS_PER_BLOCK;

All right.  Because Bloom filters for different commits (and the same
key) may have different lengths, we can perform modulo operation only
here.  `hash_mod` is i-th hash modulo size of filter, and `block_pos` is
the block the '1' bit would go into.

> +
> +		filter->data[block_pos] |= get_bitmask(hash_mod);

I'm not quite convinced that get_bitmask() is a good name: this function
returns bitmap with hash_mod's bit set to 1.  On the other hand it
doesn't matter, because it is static (file-local) helper function.

Never mind then.

> +	}
> +}
> +
> +void load_bloom_filters(void)
> +{
> +	init_bloom_filter_slab(&bloom_filters);
> +}

Why *load* if all it does is initialize?

> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c)

I will not comment on this function; see Jeff King reply and Derrick
Stolee reply.

> +{
[...]
> +}
> \ No newline at end of file

Why there is no newline at the end of the file?  Accident?

> diff --git a/commit-graph.c b/commit-graph.c
> index e771394aff..61e60ff98a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -16,6 +16,7 @@
>  #include "hashmap.h"
>  #include "replace-object.h"
>  #include "progress.h"
> +#include "bloom.h"
>  
>  #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> @@ -794,9 +795,11 @@ struct write_commit_graph_context {
>  	unsigned append:1,
>  		 report_progress:1,
>  		 split:1,
> -		 check_oids:1;
> +		 check_oids:1,
> +		 bloom:1;

Very minor nitpick: why `bloom` and not `bloom_filter`?

>  
>  	const struct split_commit_graph_opts *split_opts;
> +	uint32_t total_bloom_filter_size;

All right, I guess size of all Bloom filters would fit in uint32_t, no
need for size_t, is it?

Shouldn't it be total_bloom_filters_size -- it is not a single Bloom
filter, but many (minor nitpick)?

>  };
>  
>  static void write_graph_chunk_fanout(struct hashfile *f,
> @@ -1139,6 +1142,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  	stop_progress(&ctx->progress);
>  }
>  
> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> +{
> +	int i;
> +	struct progress *progress = NULL;
> +
> +	load_bloom_filters();
> +
> +	if (ctx->report_progress)
> +		progress = start_progress(
> +			_("Computing commit diff Bloom filters"),
> +			ctx->commits.nr);
> +
> +	for (i = 0; i < ctx->commits.nr; i++) {
> +		struct commit *c = ctx->commits.list[i];
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len;

Wouldn't it be more future proof instead of using `sizeof(uint64_t)` to
use `sizeof(filter->data[0])` here?  This may be not worth it, and be
less readable (we have hard-coded use of 64-bits blocks in other places).

> +		display_progress(progress, i + 1);
> +	}
> +
> +	stop_progress(&progress);
> +}
> +
>  static int add_ref_to_list(const char *refname,
>  			   const struct object_id *oid,
>  			   int flags, void *cb_data)
> @@ -1791,6 +1816,8 @@ int write_commit_graph(const char *obj_dir,
>  	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
>  	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
>  	ctx->split_opts = split_opts;
> +	ctx->bloom = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;

All right, this flag was defined in [PATCH 1/9].

The ordering of setting `ctx` members looks a bit strange.  Now it is
neither check `flags` firsts, neither keep related stuff together (see
ctx->split vs ctx->split_opts).  This is a very minor nitpick.

> +	ctx->total_bloom_filter_size = 0;
>  
>  	if (ctx->split) {
>  		struct commit_graph *g;
> @@ -1885,6 +1912,9 @@ int write_commit_graph(const char *obj_dir,
>  
>  	compute_generation_numbers(ctx);
>  
> +	if (ctx->bloom)
> +		compute_bloom_filters(ctx);
> +
>  	res = write_commit_graph_file(ctx);
>  
>  	if (ctx->split)

Regards,
-- 
Jakub Narębski

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox