Git development

Git development
 help / color / mirror / Atom feed

* Re: [PATCH v5 0/3] Turn the difftool into a builtin
From: Junio C Hamano @ 2017-01-17 21:31 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, David Aguilar, Dennis Kaarsemaker, Paul Sbarra
In-Reply-To: <cover.1484668473.git.johannes.schindelin@gmx.de>

Johannes Schindelin <johannes.schindelin@gmx.de> writes:

> - replaced the cross-validation with the Perl script by a patch that
>   retires the Perl script instead.

Yup.  That makes things a lot simpler.  While we try to be careful
during major rewrite, we usually do not go extra careful to leave
two competing implementations in-tree.

Will replace js/difftool-builtin topic.  Thanks.

^ permalink raw reply

* Re: [PATCH] CodingGuidelines: clarify multi-line brace style
From: Junio C Hamano @ 2017-01-17 21:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, Johannes Sixt, git
In-Reply-To: <20170117200503.3iwaedrgfq544b4i@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> Yeah. I obviously was adapting the original text, and I think I left too
> much of the wishy-washy-ness in. As the point of the patch is to avoid
> that, let's take your suggestion. A re-rolled patch is below.
>
> Now the patch is at least self-consistent. The bigger question remains
> of: do we want to dictate these rules, or did we prefer the vague
> version?

I would say no to make all of them "rules to be dictated".  It is OK
to leave some things "gray".  

But obviously others can vote differently ;-)

> I _thought_ the vague rules were doing fine, but this whole discussion
> has made me think otherwise. And I'd just as soon not ever have to
> repeat it. ;-/
>
> OTOH, I really do not want to review a bunch of patches that do nothing
> but change brace style (and I am sure there is a mix of styles already
> in the code base).

Yes, exactly, that is why the last sentence is in there below.

>> 	When the project says it does not have strong preference
>> 	among multiple choices, you are welcome to write your new
>> 	code using any of them, as long as you are consistent with
>> 	surrounding code.  Do not change style of existing code only
>> 	to flip among these styles, though.

> The existing document says:
>
>     Make your code readable and sensible, and don't try to be clever.
>     
>     As for more concrete guidelines, just imitate the existing code
>     (this is a good guideline, no matter which project you are
>     contributing to). It is always preferable to match the _local_
>     convention. New code added to Git suite is expected to match
>     the overall style of existing code. Modifications to existing
>     code is expected to match the style the surrounding code already
>     uses (even if it doesn't match the overall style of existing code).
>     
>     But if you must have a list of rules, here they are.
>
> I guess that is the place to expound on how to interpret the rules. I
> dunno. Some of the individual rules that go into "gray areas" already
> spell out the "surrounding code" rule.


We are not saying one important thing and that was what invited
useless arguing this time: For "gray" things, the choice is yours in
your new code, but do not churn existing code for "gray" guidelines.

If we made _that_ clear as a rule we dictate, hopefully we'd have
less chance to see "a bunch of patches that do nothing but change
brace style", I would think.

> -- >8 --
> Subject: [PATCH] CodingGuidelines: clarify multi-line brace style
>
> There are some "gray areas" around when to omit braces from
> a conditional or loop body. Since that seems to have
> resulted in some arguments, let's be a little more clear
> about our preferred style.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  Documentation/CodingGuidelines | 37 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/CodingGuidelines b/Documentation/CodingGuidelines
> index 4cd95da6b..a4191aa38 100644
> --- a/Documentation/CodingGuidelines
> +++ b/Documentation/CodingGuidelines
> @@ -206,11 +206,38 @@ For C programs:
>  		x = 1;
>  	}
>  
> -   is frowned upon.  A gray area is when the statement extends
> -   over a few lines, and/or you have a lengthy comment atop of
> -   it.  Also, like in the Linux kernel, if there is a long list
> -   of "else if" statements, it can make sense to add braces to
> -   single line blocks.
> +   is frowned upon. But there are a few exceptions:
> +
> +	- When the statement extends over a few lines (e.g., a while loop
> +	  with an embedded conditional, or a comment). E.g.:
> +
> +		while (foo) {
> +			if (x)
> +				one();
> +			else
> +				two();
> +		}
> +
> +		if (foo) {
> +			/*
> +			 * This one requires some explanation,
> +			 * so we're better off with braces to make
> +			 * it obvious that the indentation is correct.
> +			 */
> +			doit();
> +		}
> +
> +	- When there are multiple arms to a conditional and some of them
> +	  require braces, enclose even a single line block in braces for
> +	  consistency. E.g.:
> +
> +		if (foo) {
> +			doit();
> +		} else {
> +			one();
> +			two();
> +			three();
> +		}
>  
>   - We try to avoid assignments in the condition of an "if" statement.

^ permalink raw reply

* Re: [PATCH] document index_name_pos
From: Junio C Hamano @ 2017-01-17 21:43 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git
In-Reply-To: <20170117204642.31514-1-sbeller@google.com>

Stefan Beller <sbeller@google.com> writes:

> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>
>> These placeholders are meant to encourage those people who dove into
>> the code to update it, so from that point of view, I think removing
>> it is backwards.
>
> Yes, I am currently understanding and writing up documentation for
> index_name_pos. If I recall the latest discussion where we want to have
> documentation, I think a quorum favored documentation in the header itself,
> c.f. strbuf.h, string-list.h for the most desired state. (Although we do have
> Documentation/technical/api-string-list.txt as well ...)
>
> So maybe starting like this?

That is very good.  Let's drop that file from Documentation/technical
and do it like this (meaning, take both patches from you).

Thanks.


>
> Thanks,
> Stefan
>
>  cache.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/cache.h b/cache.h
> index 1b67f078dd..e168e4e073 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -575,7 +575,20 @@ extern int verify_path(const char *path);
>  extern int index_dir_exists(struct index_state *istate, const char *name, int namelen);
>  extern void adjust_dirname_case(struct index_state *istate, char *name);
>  extern struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int igncase);
> +
> +/*
> + * Searches for an entry defined by name and namelen in the given index.
> + * If the return value is positive (including 0) it is the position of an
> + * exact match. If the return value is negative, the negated value minus 1 is the
> + * position where the entry would be inserted.
> + * Example: In the current index we have the files a,c,d:
> + * index_name_pos(&index, "a", 1) ->  0
> + * index_name_pos(&index, "b", 1) -> -1
> + * index_name_pos(&index, "c", 1) ->  1
> + * index_name_pos(&index, "d", 1) ->  2
> + */
>  extern int index_name_pos(const struct index_state *, const char *name, int namelen);
> +
>  #define ADD_CACHE_OK_TO_ADD 1		/* Ok to add */
>  #define ADD_CACHE_OK_TO_REPLACE 2	/* Ok to replace file/directory */
>  #define ADD_CACHE_SKIP_DFCHECK 4	/* Ok to skip DF conflict checks */

^ permalink raw reply

* Re: [PATCH 0/2] Fix remote_is_configured()
From: Jeff King @ 2017-01-17 21:45 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano, Thomas Gummerer, Andrew Arnott
In-Reply-To: <cover.1484687919.git.johannes.schindelin@gmx.de>

On Tue, Jan 17, 2017 at 10:18:58PM +0100, Johannes Schindelin wrote:

> A surprising behavior triggered the bug report in
> https://github.com/git-for-windows/git/issues/888: the mere existence of
> the config setting "remote.origin.prune" (in this instance, configured
> via ~/.gitconfig so that it applies to all repositories) fooled `git
> remote rename <source> <target>` into believing that the <target> remote
> is already there.
> 
> This patch pair demonstrates the problem, and then fixes it (along with
> potential similar problems, such as setting an HTTP proxy for remotes of
> a given name via ~/.gitconfig).

I thought it was intentional that any config "created" the remote, even
without a url field. E.g., you can set:

  [remote "https://example.com/foo/git"]
  proxy = localhost:1234

and then a bare "git fetch https://example.com/foo/git" will respect it.

I admit that "prune" is probably useless without a fetch refspec,
though.

Note that I don't disagree that the rule "it's not a remote unless it
has a url" would be a lot saner. But I have a feeling you may be
breaking some existing setups.

I grepped around in the list but I couldn't find any relevant
discussion. So maybe I just dreamed it.

-Peff

^ permalink raw reply

* Re: [PATCH 2/2] Be more careful when determining whether a remote was configured
From: Jeff King @ 2017-01-17 21:47 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Junio C Hamano, Thomas Gummerer, Andrew Arnott
In-Reply-To: <41c347f22c80e96c54db34baa739b6e37e268b61.1484687919.git.johannes.schindelin@gmx.de>

On Tue, Jan 17, 2017 at 10:19:24PM +0100, Johannes Schindelin wrote:

> One of the really nice features of the ~/.gitconfig file is that users
> can override defaults by their own preferred settings for all of their
> repositories.
> 
> One such default that some users like to override is whether the
> "origin" remote gets auto-pruned or not. The user would simply call
> 
> 	git config --global remote.origin.prune true
> 
> and from now on all "origin" remotes would be pruned automatically when
> fetching into the local repository.
> 
> There is just one catch: now Git thinks that the "origin" remote is
> configured, as it does not discern between having a remote whose
> fetch (and/or push) URL and refspec is set, and merely having
> preemptively-configured, global flags for specific remotes.
> 
> Let's fix this by telling Git that a remote is not configured unless any
> fetch/push URL or refspect is configured explicitly.

Hmm. Old versions of GitHub for Windows used to set fetch refspecs in
the system gitconfig, for a similar purpose to what you want to do with
remote.origin.prune.

I notice here that setting a refspec _does_ define a remote. Is there a
reason you drew the line there, and not at, say, whether it has a URL?

-Peff

^ permalink raw reply

* Re: [PATCH v5 3/3] Retire the scripted difftool
From: Junio C Hamano @ 2017-01-17 21:46 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, David Aguilar, Dennis Kaarsemaker, Paul Sbarra
In-Reply-To: <8238bba389c031b091a37396fed43cac94d944e7.1484668473.git.johannes.schindelin@gmx.de>

Johannes Schindelin <johannes.schindelin@gmx.de> writes:

> It served its purpose, but now we have a builtin difftool. Time for the
> Perl script to enjoy Florida.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---

The endgame makes a lot of sense.  Both in the cover letter and in
the previous patch you talk about having both in the released
version, so do you want this step to proceed slower than the other
two?  I.e. merge all three to 'next' but graduate only the first two
to 'master' and after a while make this last step graduate?

^ permalink raw reply

* Re: [PATCH] document index_name_pos
From: Stefan Beller @ 2017-01-17 21:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git@vger.kernel.org
In-Reply-To: <xmqqo9z5cqh8.fsf@gitster.mtv.corp.google.com>

On Tue, Jan 17, 2017 at 1:43 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>> Signed-off-by: Stefan Beller <sbeller@google.com>
>> ---
>>
>>> These placeholders are meant to encourage those people who dove into
>>> the code to update it, so from that point of view, I think removing
>>> it is backwards.
>>
>> Yes, I am currently understanding and writing up documentation for
>> index_name_pos. If I recall the latest discussion where we want to have
>> documentation, I think a quorum favored documentation in the header itself,
>> c.f. strbuf.h, string-list.h for the most desired state. (Although we do have
>> Documentation/technical/api-string-list.txt as well ...)
>>
>> So maybe starting like this?
>
> That is very good.  Let's drop that file from Documentation/technical
> and do it like this (meaning, take both patches from you).
>
> Thanks.

This patch is incorrect, as we need to do s/b -> -1/b -> -2/.
(Wrong documentation is the worst)

Also we'd probably want to see more of the functions documented.
I'll see if I can extend this into a series documenting more functions.

Thanks,
Stefan

^ permalink raw reply

* RE: [RFC] Add support for downloading blobs on demand
From: Ben Peart @ 2017-01-17 21:50 UTC (permalink / raw)
  To: 'Shawn Pearce'; +Cc: 'git', benpeart
In-Reply-To: <CAJo=hJumYXTRN_B3iZdmcpomp7wJ+UPcikxGb6rn9W=uJeYmfw@mail.gmail.com>

Thanks for the encouragement, support, and good ideas to look into.

Ben

> -----Original Message-----
> From: Shawn Pearce [mailto:spearce@spearce.org]
> Sent: Friday, January 13, 2017 4:07 PM
> To: Ben Peart <peartben@gmail.com>
> Cc: git <git@vger.kernel.org>; benpeart@microsoft.com
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> On Fri, Jan 13, 2017 at 7:52 AM, Ben Peart <peartben@gmail.com> wrote:
> >
> > Goal
> > ~~~~
> >
> > To be able to better handle repos with many files that any individual
> > developer doesn’t need it would be nice if clone/fetch only brought
> > down those files that were actually needed.
> >
> > To enable that, we are proposing adding a flag to clone/fetch that
> > will instruct the server to limit the objects it sends to commits and
> > trees and to not send any blobs.
> >
> > When git performs an operation that requires a blob that isn’t
> > currently available locally, it will download the missing blob and add
> > it to the local object store.
> 
> Interesting. This is also an area I want to work on with my team at $DAY_JOB.
> Repositories are growing along multiple dimensions, and developers or
> editors don't always need all blobs for all time available locally to successfully
> perform their work.
> 
> > Design
> > ~~~~~~
> >
> > Clone and fetch will pass a “--lazy-clone” flag (open to a better name
> > here) similar to “--depth” that instructs the server to only return
> > commits and trees and to ignore blobs.
> 
> My group at $DAY_JOB hasn't talked about it yet, but I want to add a
> protocol capability that lets clone/fetch ask only for blobs smaller than a
> specified byte count. This could be set to a reasonable text file size (e.g. <= 5
> MiB) to predominately download only source files and text documentation,
> omitting larger binaries.
> 
> If the limit was set to 0, its the same as your idea to ignore all blobs.
> 

This is an interesting idea that may be an easier way to help mitigate 
the cost of very large files.  While our primary issue today is the 
sheer number of files, I'm sure at some point we'll run into issues with 
file size as well.  

> > Later during git operations like checkout, when a blob cannot be found
> > after checking all the regular places (loose, pack, alternates, etc),
> > git will download the missing object and place it into the local
> > object store (currently as a loose object) then resume the operation.
> 
> Right. I'd like to have this object retrieval be inside the native Git wire
> protocol, reusing the remote configuration and authentication setup. That
> requires expanding the server side of the protocol implementation slightly
> allowing any reachable object to be retrieved by SHA-1 alone. Bitmap indexes
> can significantly reduce the computational complexity for the server.
> 

Agree.  

> > To prevent git from accidentally downloading all missing blobs, some
> > git operations are updated to be aware of the potential for missing blobs.
> > The most obvious being check_connected which will return success as if
> > everything in the requested commits is available locally.
> 
> This ... sounds risky for the developer, as the repository may be corrupt due
> to a missing object, and the user cannot determine it.
> 
> Would it be reasonable for the server to return a list of SHA-1s it knows
> should exist, but has omitted due to the blob threshold (above), and the
> local repository store this in a binary searchable file?
> During connectivity checking its assumed OK if an object is not present in the
> object store, but is listed in this omitted objects file.
> 

Corrupt repos due to missing blobs must be pretty rare as I've never 
seen anyone report that error but for this and other reasons (see Peff's 
suggestion on how to minimize downloading unnecessary blobs) having this 
data could be valuable.  I'll add it to the list of things to look into.

> > To minimize the impact on the server, the existing dumb HTTP protocol
> > endpoint “objects/<sha>” can be used to retrieve the individual
> > missing blobs when needed.
> 
> I'd prefer this to be in the native wire protocol, where the objects are in pack
> format (which unfortunately differs from loose format). I assume servers
> would combine many objects into pack files, potentially isolating large
> uncompressable binaries into their own packs, stored separately from
> commits/trees/small-text-blobs.
> 
> I get the value of this being in HTTP, where HTTP caching inside proxies can
> be leveraged to reduce master server load. I wonder if the native wire
> protocol could be taught to use a variation of an HTTP GET that includes the
> object SHA-1 in the URL line, to retrieve a one-object pack file.
> 

You make a good point. I don't think the benefit of hitting this 
"existing" end point outweighs the many drawbacks.  Adding the ability 
to retrieve an individual blob via the native wire protocol seems a 
better plan.

> > Performance considerations
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > We found that downloading commits and trees on demand had a
> > significant negative performance impact.  In addition, many git
> > commands assume all commits and trees are available locally so they
> > quickly got pulled down anyway.  Even in very large repos the commits
> > and trees are relatively small so bringing them down with the initial
> > commit and subsequent fetch commands was reasonable.
> >
> > After cloning, the developer can use sparse-checkout to limit the set
> > of files to the subset they need (typically only 1-10% in these large
> > repos).  This allows the initial checkout to only download the set of
> > files actually needed to complete their task.  At any point, the
> > sparse-checkout file can be updated to include additional files which
> > will be fetched transparently on demand.
> >
> > Typical source files are relatively small so the overhead of
> > connecting and authenticating to the server for a single file at a
> > time is substantial.  As a result, having a long running process that
> > is started with the first request and can cache connection information
> > between requests is a significant performance win.
> 
> Junio and I talked years ago (offline, sorry no mailing list archive) about
> "narrow checkout", which is the idea of the client being able to ask for a pack
> file from the server that only includes objects along specific path names. This
> would allow a client to amortize the setup costs, and even delta compress
> source files against each other (e.g.
> boilerplate across Makefiles or license headers).
> 
> If the paths of interest can be determined as a batch before starting the
> connection, this may be easier than maintaining a cross platform connection
> cache in a separate process.
> 

We looked into sparse/narrow-clone but for a variety of reasons it 
didn't work well for our usage patterns (see my response to Peff's 
feedback for more details).

> > Now some numbers
> > ~~~~~~~~~~~~~~~~
> >
> > One repo has 3+ million files at tip across 500K folders with 5-6K
> > active developers.  They have done a lot of work to remove large files
> > from the repo so it is down to < 100GB.
> >
> > Before changes: clone took hours to transfer the 87GB .pack + 119MB
> > .idx
> >
> > After changes: clone took 4 minutes to transfer 305MB .pack + 37MB
> > .idx
> >
> > After hydrating 35K files (the typical number any individual developer
> > needs to do their work), there was an additional 460 MB of loose files
> > downloaded.
> >
> > Total savings: 86.24 GB * 6000 developers = 517 Terabytes saved!
> >
> > We have another repo (3.1 M files, 618 GB at tip with no history with
> > 3K+ active developers) where the savings are even greater.
> 
> This is quite impressive, and shows this strategy has a lot of promise.
> 
> 
> > Future Work
> > ~~~~~~~~~~~
> >
> > The current prototype calls a new hook proc in
> > sha1_object_info_extended and read_object, to download each missing
> > blob.  A better solution would be to implement this via a long running
> > process that is spawned on the first download and listens for requests
> > to download additional objects until it terminates when the parent git
> > operation exits (similar to the recent long running smudge and clean filter
> work).
> 
> Or batching these up in advance. checkout should be able to determine
> which path entries from the index it wants to write to the working tree. Once
> it has that set of paths it wants to write, it should be fast to construct a
> subset of paths for which the blobs are not present locally, and then pass the
> entire group off for download.
> 

Yes, I'm optimistic that we can optimize for the checkout case (which is 
a _very_ common case!).

> > Need to do more investigation into possible code paths that can
> > trigger unnecessary blobs to be downloaded.  For example, we have
> > determined that the rename detection logic in status can also trigger
> > unnecessary blobs to be downloaded making status slow.
> 
> There isn't much of a workaround here. Only options I can see are disabling
> rename detection when objects are above a certain size, or removing entries
> from the rename table when the blob isn't already local, which may yield
> different results than if the blob(s) were local.
> 
> Another is to try to have actual source files always be local, and thus we only
> punt on rename detection for bigger files that are more likely to be binary,
> and thus less likely to match for rename[1] unless it was SHA-1 identity
> match, which can be done without the
> blob(s) present.
> 

While large files can be a real problem, our biggest issue today is 
having a lot (millions!) of source files when any individual developer 
only needs a small percentage of them.  Git with 3+ million local files 
just doesn't perform well.

We'll see what we can come up with here - especially if we had some 
information _about_ the blob, even though we didn't have the blob itself.

> 
> [1] I assume most really big files are some sort of media asset (e.g.
> JPEG), where a change inside the source data may result in large difference in
> bytes due to the compression applied by the media file format.
> 
> > Need to investigate an alternate batching scheme where we can make a
> > single request for a set of "related" blobs and receive single a
> > packfile (especially during checkout).
> 
> Heh, what I just said above. Glad to see you already thought of it.
> 
> > Need to investigate adding a new endpoint in the smart protocol that
> > can download both individual blobs as well as a batch of blobs.
> 
> Agreed, I said as much above. Again, glad to see you have similar ideas. :)


^ permalink raw reply

* RE: [RFC] Add support for downloading blobs on demand
From: Ben Peart @ 2017-01-17 21:50 UTC (permalink / raw)
  To: 'Jeff King', 'Ben Peart'; +Cc: git
In-Reply-To: <20170117184258.sd7h2hkv27w52gzt@sigill.intra.peff.net>

Thanks for the thoughtful response.  No need to appologize for the 
length, it's a tough problem to solve so I don't expect it to be handled 
with a single, short email. :)

> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Tuesday, January 17, 2017 1:43 PM
> To: Ben Peart <peartben@gmail.com>
> Cc: git@vger.kernel.org; Ben Peart <Ben.Peart@microsoft.com>
> Subject: Re: [RFC] Add support for downloading blobs on demand
> 
> This is an issue I've thought a lot about. So apologies in advance that this
> response turned out a bit long. :)
> 
> On Fri, Jan 13, 2017 at 10:52:53AM -0500, Ben Peart wrote:
> 
> > Design
> > ~~~~~~
> >
> > Clone and fetch will pass a  --lazy-clone  flag (open to a better name
> > here) similar to  --depth  that instructs the server to only return
> > commits and trees and to ignore blobs.
> >
> > Later during git operations like checkout, when a blob cannot be found
> > after checking all the regular places (loose, pack, alternates, etc),
> > git will download the missing object and place it into the local
> > object store (currently as a loose object) then resume the operation.
> 
> Have you looked at the "external odb" patches I wrote a while ago, and
> which Christian has been trying to resurrect?
> 
>   https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpublic-
> inbox.org%2Fgit%2F20161130210420.15982-1-
> chriscool%40tuxfamily.org%2F&data=02%7C01%7CBen.Peart%40microsoft.c
> om%7C9596d3bf32564f123e0c08d43f08a9e1%7C72f988bf86f141af91ab2d7c
> d011db47%7C1%7C0%7C636202753822020527&sdata=a6%2BGOAQoRhjFoxS
> vftY8JZAVUssmrXuDZ9OBy3xqNZk%3D&reserved=0
> 
> This is a similar approach, though I pushed the policy for "how do you get the
> objects" out into an external script. One advantage there is that large objects
> could easily be fetched from another source entirely (e.g., S3 or equivalent)
> rather than the repo itself.
> 
> The downside is that it makes things more complicated, because a push or a
> fetch now involves three parties (server, client, and the alternate object
> store). So questions like "do I have all the objects I need" are hard to reason
> about.
> 
> If you assume that there's going to be _some_ central Git repo which has all
> of the objects, you might as well fetch from there (and do it over normal git
> protocols). And that simplifies things a bit, at the cost of being less flexible.
> 

We looked quite a bit at the external odb patches, as well as lfs and 
even using alternates.  They all share a common downside that you must 
maintain a separate service that contains _some_ of the files.  These 
files must also be versioned, replicated, backed up and the service 
itself scaled out to handle the load.  As you mentioned, having multiple 
services involved increases flexability but it also increases the 
complexity and decreases the reliability of the overall version control 
service.  

For operational simplicity, we opted to go with a design that uses a 
single, central git repo which has _all_ the objects and to focus on 
enhancing it to handle large numbers of files efficiently.  This allows 
us to focus our efforts on a great git service and to avoid having to 
build out these other services.

> > To prevent git from accidentally downloading all missing blobs, some
> > git operations are updated to be aware of the potential for missing blobs.
> > The most obvious being check_connected which will return success as if
> > everything in the requested commits is available locally.
> 
> Actually, Git is pretty good about trying not to access blobs when it doesn't
> need to. The important thing is that you know enough about the blobs to
> fulfill has_sha1_file() and sha1_object_info() requests without actually
> fetching the data.
> 
> So the client definitely needs to have some list of which objects exist, and
> which it _could_ get if it needed to.
> 
> The one place you'd probably want to tweak things is in the diff code, as a
> single "git log -Sfoo" would fault in all of the blobs.
> 

It is an interesting idea to explore how we could be smarter about 
preventing blobs from faulting in if we had enough info to fulfill 
has_sha1_file() and sha1_object_info().  Given we also heavily prune the 
working directory using sparse-checkout, this hasn't been our top focus 
but it is certainly something worth looking into.

> > To minimize the impact on the server, the existing dumb HTTP protocol
> > endpoint  objects/<sha>  can be used to retrieve the individual
> > missing blobs when needed.
> 
> This is going to behave badly on well-packed repositories, because there isn't
> a good way to fetch a single object. The best case (which is not implemented
> at all in Git) is that you grab the pack .idx, then grab "slices" of the pack
> corresponding to specific objects, including hunting down delta bases.
> 
> But then next time the server repacks, you have to throw away your .idx file.
> And those can be big. The .idx for linux.git is 135MB. You really wouldn't	
> want to do an incremental fetch of 1MB worth of objects and have to grab
> the whole .idx just to figure out which bytes you needed.
> 
> You can solve this by replacing the dumb-http server with a smart one that
> actually serves up the individual objects as if they were truly sitting on the
> filesystem. But then you haven't really minimized impact on the server, and
> you might as well teach the smart protocols to do blob fetches.
> 

Yea, we actually implemented a new endpoint that we are using to fetch 
individual blobs, I just found the dumb endpoint recently and thought 
"hey, maybe we can us this to make it easier for other git servers."  
For a number of good reasons, I don't think this is the right approach.

> 
> One big hurdle to this approach, no matter the protocol, is how you are
> going to handle deltas. Right now, a git client tells the server "I have this
> commit, but I want this other one". And the server knows which objects the
> client has from the first, and which it needs from the second. Moreover, it
> knows that it can send objects in delta form directly from disk if the other
> side has the delta base.
> 
> So what happens in this system? We know we don't need to send any blobs
> in a regular fetch, because the whole idea is that we only send blobs on
> demand. So we wait for the client to ask us for blob A. But then what do we
> send? If we send the whole blob without deltas, we're going to waste a lot of
> bandwidth.
> 
> The on-disk size of all of the blobs in linux.git is ~500MB. The actual data size
> is ~48GB. Some of that is from zlib, which you get even for non-deltas. But
> the rest of it is from the delta compression. I don't think it's feasible to give
> that up, at least not for "normal" source repos like linux.git (more on that in
> a minute).
> 
> So ideally you do want to send deltas. But how do you know which objects
> the other side already has, which you can use as a delta base? Sending the
> list of "here are the blobs I have" doesn't scale. Just the sha1s start to add
> up, especially when you are doing incremental fetches.
> 
> I think this sort of things performs a lot better when you just focus on large
> objects. Because they don't tend to delta well anyway, and the savings are
> much bigger by avoiding ones you don't want. So a directive like "don't
> bother sending blobs larger than 1MB" avoids a lot of these issues. In other
> words, you have some quick shorthand to communicate between the client
> and server: this what I have, and what I don't.
> Normal git relies on commit reachability for that, but there are obviously
> other dimensions. The key thing is that both sides be able to express the
> filters succinctly, and apply them efficiently.
> 

Our challenge has been more the sheer _number_ of files that exist in 
the repo rather than the _size_ of the files in the repo.  With >3M 
source files and any typical developer only needing a small percentage 
of those files to do their job, our focus has been pruning the tree as 
much as possible such that they only pay the cost for the files they 
actually need.  With typical text source files being 10K - 20K in size, 
the overhead of the round trip is a significant part of the overall 
transfer time so deltas don't help as much.  I agree that large files 
are also a problem but it isn't my top focus at this point in time.  

> > After cloning, the developer can use sparse-checkout to limit the set
> > of files to the subset they need (typically only 1-10% in these large
> > repos).  This allows the initial checkout to only download the set of
> > files actually needed to complete their task.  At any point, the
> > sparse-checkout file can be updated to include additional files which
> > will be fetched transparently on demand.
> 
> If most of your benefits are not from avoiding blobs in general, but rather
> just from sparsely populating the tree, then it sounds like sparse clone might
> be an easier path forward. The general idea is to restrict not just the
> checkout, but the actual object transfer and reachability (in the tree
> dimension, the way shallow clone limits it in the time dimension, which will
> require cooperation between the client and server).
> 
> So that's another dimension of filtering, which should be expressed pretty
> succinctly: "I'm interested in these paths, and not these other ones." It's
> pretty easy to compute on the server side during graph traversal (though it
> interacts badly with reachability bitmaps, so there would need to be some
> hacks there).
> 
> It's an idea that's been talked about many times, but I don't recall that there
> were ever working patches. You might dig around in the list archive under
> the name "sparse clone" or possibly "narrow clone".

While a sparse/narrow clone would work with this proposal, it isn't 
required.  You'd still probably want all the commits and trees but the 
clone would also bring down the specified blobs.  Combined with using 
"depth" you could further limit it to those blobs at tip. 

We did run into problems with this model however as our usage patterns 
are such that our working directories often contain very sparse trees 
and as a result, we can end up with thousands of entries in the sparse 
checkout file.  This makes it difficult for users to manually specify a 
sparse-checkout before they even do a clone.  We have implemented a 
hashmap based sparse-checkout to deal with the performance issues of 
having that many entries but that's a different RFC/PATCH.  In short, we 
found that a "lazy-clone" and downloading blobs on demand provided a 
better developer experience.

> 
> > Now some numbers
> > ~~~~~~~~~~~~~~~~
> >
> > One repo has 3+ million files at tip across 500K folders with 5-6K
> > active developers.  They have done a lot of work to remove large files
> > from the repo so it is down to < 100GB.
> >
> > Before changes: clone took hours to transfer the 87GB .pack + 119MB
> > .idx
> >
> > After changes: clone took 4 minutes to transfer 305MB .pack + 37MB
> > .idx
> >
> > After hydrating 35K files (the typical number any individual developer
> > needs to do their work), there was an additional 460 MB of loose files
> > downloaded.
> 
> It sounds like you have a case where the repository has a lot of large files
> that are either historical, or uninteresting the sparse-tree dimension.
> 
> How big is that 460MB if it were actually packed with deltas?
> 

Uninteresting in the sparse-tree dimension.  460 MB divided by 35K files 
is less than 13 KB per file which is fairly typical for source code.  
Given there are no versions to calculate deltas from, compressing them 
into a pack file would help some but I don't have the numbers as to how 
much.  When we get to the "future work" below and start batching up 
requests, we'll have better data on that.

> > Future Work
> > ~~~~~~~~~~~
> >
> > The current prototype calls a new hook proc in
> > sha1_object_info_extended and read_object, to download each missing
> > blob.  A better solution would be to implement this via a long running
> > process that is spawned on the first download and listens for requests
> > to download additional objects until it terminates when the parent git
> > operation exits (similar to the recent long running smudge and clean filter
> work).
> 
> Yeah, see the external-odb discussion. Those prototypes use a process per
> object, but I think we all agree after seeing how the git-lfs interface has
> scaled that this is a non-starter. Recent versions of git-lfs do the single-
> process thing, and I think any sort of external-odb hook should be modeled
> on that protocol.
> 

I'm looking into this now and plan to re-implement it this way before 
sending out the first patch series.  Glad to hear you think it is a good 
protocol to model it on.

> > Need to investigate an alternate batching scheme where we can make a
> > single request for a set of "related" blobs and receive single a
> > packfile (especially during checkout).
> 
> I think this sort of batching is going to be the really hard part to retrofit onto
> git. Because you're throwing out the procedural notion that you can loop
> over a set of objects and ask for each individually.
> You have to start deferring computation until answers are ready. Some
> operations can do that reasonably well (e.g., checkout), but something like
> "git log -p" is constantly digging down into history. I suppose you could just
> perform the skeleton of the operation _twice_, once to find the list of objects
> to fault in, and the second time to actually do it.
> 
> That will make git feel a lot slower, because a lot of the illusion of speed is
> the way it streams out results. OTOH, if you have to wait to fault in objects
> from the network, it's going to feel pretty slow anyway. :)
> 

The good news is that for most operations, git doesn't need to access to 
all the blobs.  You're right, any command that does ends up faulting in 
a bunch of blobs from the network can get pretty slow.  Sometimes you 
get streaming results and sometimes it just "hangs" while we go off 
downloading blobs in the background.  We capture telemetry to detect 
these types of issues but typically the users are more than happy to 
send us an "I just ran command 'foo' and it hung" email. :)

> -Peff


^ permalink raw reply

* Re: [RFC] Add support for downloading blobs on demand
From: Martin Fick @ 2017-01-17 22:05 UTC (permalink / raw)
  To: Ben Peart; +Cc: 'Shawn Pearce', 'git', benpeart
In-Reply-To: <002501d2710b$af74c4d0$0e5e4e70$@gmail.com>

On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
> While large files can be a real problem, our biggest issue
> today is having a lot (millions!) of source files when
> any individual developer only needs a small percentage of
> them.  Git with 3+ million local files just doesn't
> perform well.

Honestly, this sounds like a problem better dealt with by 
using git subtree or git submodules, have you considered 
that?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply

* Re: [PATCH 2/2] Be more careful when determining whether a remote was configured
From: Junio C Hamano @ 2017-01-17 22:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, git, Thomas Gummerer, Andrew Arnott
In-Reply-To: <20170117214723.p5rni6wwggei366j@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

>> Let's fix this by telling Git that a remote is not configured unless any
>> fetch/push URL or refspec is configured explicitly.
>
> I notice here that setting a refspec _does_ define a remote. Is there a
> reason you drew the line there, and not at, say, whether it has a URL?

"Not configured unless any URL or refspec is configured" means that
if URL is there, even if there is no refspec, then it is a remote
definition, right?  

But I think "what does it mean to define a remote" is a question
that you are asking, and that is not necessary to answer within the
scope of this topic.

I do agree that honoring .prune is nonsense unless refspecs are
defined, but the question "does it make sense to allow prune?"  is
different from "is it configured?".  Your "you can set .proxy and
other useful things, it is just .prune does not make sense" is a
quite appropriate statement to illustrate the difference.

Perhaps instead of adding "is it configured?" flag that is too
broadly named and has too narrow meaning, would it make more sense
to introduce "int can_prune(struct remote *remote)" that looks at
the remote refspecs?

^ permalink raw reply

* Re: [RFC] Add support for downloading blobs on demand
From: Stefan Beller @ 2017-01-17 22:23 UTC (permalink / raw)
  To: Martin Fick; +Cc: Ben Peart, Shawn Pearce, git, benpeart
In-Reply-To: <2381666.1DSVtKRIH5@mfick1-lnx>

On Tue, Jan 17, 2017 at 2:05 PM, Martin Fick <mfick@codeaurora.org> wrote:
> On Tuesday, January 17, 2017 04:50:13 PM Ben Peart wrote:
>> While large files can be a real problem, our biggest issue
>> today is having a lot (millions!) of source files when
>> any individual developer only needs a small percentage of
>> them.  Git with 3+ million local files just doesn't
>> perform well.
>
> Honestly, this sounds like a problem better dealt with by
> using git subtree or git submodules, have you considered
> that?
>
> -Martin
>

I cannot speak for subtrees as I have very little knowledge on them.
But there you also have the problem that *someone* has to have a
whole tree? (e.g. the build bot)

submodules however comes with a couple of things attached, both
positive as well as negative points:

* it offers ACLs along the way. ($user may not be allowed to
  clone all submodules, but only those related to the work)
* The conceptual understanding of git just got a lot harder.
  ("Yo dawg, I heard you like git, so I put git repos inside
  other git repos"), it is not easy to come up with reasonable
  defaults for all usecases, so the everyday user still has to
  have some understanding of submodules.
* typical cheap in-tree operations may become very expensive:
  -> moving a file from one location to another (in another
     submodule) adds overhead, no rename detection.
* We are actively working on submodules, so there is
  some momentum going already.
* our experiments with Android show that e.g. fetching (even
  if you have all of Android) becomes a lot faster for everyday
  usage as only a few repositories change each day). This
  comparision was against the repo tool, that we currently
  use for Android. I do not know how it would compare against
  single repo Git, as having such a large repository seemed
  complicated.
* the support for submodules in Git is already there, though
  not polished. The positive side is to have already a good base,
  the negative side is to have support current use cases.

Stefan

^ permalink raw reply

* Re: [PATCH 5/5] describe: teach describe negative pattern matches
From: Jacob Keller @ 2017-01-17 23:31 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Jacob Keller, Git mailing list
In-Reply-To: <97d4105c-0fa6-41c5-e456-30cebd93dc36@kdbg.org>

On Fri, Jan 13, 2017 at 1:31 PM, Johannes Sixt <j6t@kdbg.org> wrote:
> Am 13.01.2017 um 07:57 schrieb Jacob Keller:
>>
>> On Thu, Jan 12, 2017 at 10:43 PM, Johannes Sixt <j6t@kdbg.org> wrote:
>>>
>>>  When you write
>>>
>>>   git log --branches --exclude=origin/* --remotes
>>>
>>> --exclude=origin/* applies only to --remotes, but not to --branches.
>>
>>
>> Well for describe I don't think the order matters.
>
>
> That is certainly true today. But I would value consistency more. We would
> lose it if some time in the future 'describe' accepts --branches and
> --remotes in addition to --tags and --all.
>
> -- Hannes
>

I am not sure that the interface for git-log and git-describe are
similar enough to make this distinction work. --match already seems to
imply that it only works on refs in refs/tags, as it says it considers
globs matching excluding the "refs/tags" prefix.

In git-describe, we already have "--tags" and "--all" but they are
mutually exclusive. We don't support using more than one at once, and
I'm not really convinced that describe will ever support more than one
at a time. Additionally, match already doesn't respect order.

Thanks,
Jake

^ permalink raw reply

* Re: [RFH - Tcl/Tk] use of procedure before declaration?
From: Philip Oakley @ 2017-01-17 23:32 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git List, Pat Thoyts
In-Reply-To: <alpine.DEB.2.20.1701171218260.3469@virtualbox>

From: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>
> Hi Philip,
>
> On Mon, 16 Jan 2017, Philip Oakley wrote:
>
>> In
>> https://github.com/git/git/blob/master/git-gui/lib/choose_repository.tcl#L242
>> the procedure `_unset_recentrepo` is called, however the procedure isn't
>> declared until line 248. My reading of the various Tcl tutorials suggest
>> (but not explictly) that this isn't the right way.
>
> Indeed, calling a procedure before it is declared sounds incorrect.
>
> Since documentation can be treacherous, let's just test it. With a `tclsh`
> whose `$tcl_version` variable claims that this is version 8.6, this
> script:
>
> ```tcl
> hello Philip
>
> proc hello {arg} {
>        puts "Hi, $arg"
> }
> ```
>
> ... yields the error message:
>
> invalid command name "hello"
>     while executing
> "hello Philip"
>
> ... while this script:
>
> ```tcl
> proc hello {arg} {
>        puts "Hi, $arg"
> }
>
> hello Philip
> ```
>
> ... prints the expected "Hi, Philip".
>
> Having said that, in the code to which you linked, the procedure is not
> actually called before it is declared, as the call is inside another
> procedure.
>
> Indeed, the entire file declares one object-oriented class, so no code
> gets executed in that file:
>
> https://github.com/git/git/blob/d7dffce1c/git-gui/lib/choose_repository.tcl#L4
>
> (I guess proper indentation would make it easier to understand that this
> file is defining a class, not executing anything yet).
>
> And it is perfectly legitimate to use not-yet-declared procedures in other
> procedures, otherwise recursion would not work.
>
>> Should 3c6a287 ("git-gui: Keep repo_config(gui.recentrepos) and 
>> .gitconfig
>> in sync", 2010-01-23) have declared `proc _unset_recentrepo {p}` before
>> `proc _get_recentrepos {}` ?
>
> Given the findings above, I believe that the patch is actually correct.
>
> Ciao,
> Dscho
>
Thanks for the clarification. I'll update the old patch series and see if we 
can get this fixed.

Philip 


^ permalink raw reply

* [PATCH 0/4] start documenting core functions
From: Stefan Beller @ 2017-01-17 23:34 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller

The two single patches[1] are turned into a series here.

[1] https://public-inbox.org/git/20170117200147.25425-1-sbeller@google.com/

Thanks,
Stefan

Stefan Beller (4):
  document index_name_pos
  remove_index_entry_at: move documentation to cache.h
  document add_[file_]to_index
  documentation: retire unfinished documentation

 Documentation/technical/api-in-core-index.txt | 21 ----------------
 cache.h                                       | 35 +++++++++++++++++++++++----
 read-cache.c                                  |  1 -
 3 files changed, 30 insertions(+), 27 deletions(-)
 delete mode 100644 Documentation/technical/api-in-core-index.txt

-- 
2.11.0.299.g762782ba8a


^ permalink raw reply

* [PATCH 2/4] remove_index_entry_at: move documentation to cache.h
From: Stefan Beller @ 2017-01-17 23:35 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller
In-Reply-To: <20170117233503.27137-1-sbeller@google.com>

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 cache.h      | 3 +++
 read-cache.c | 1 -
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/cache.h b/cache.h
index 270a0d0ea7..26632065a5 100644
--- a/cache.h
+++ b/cache.h
@@ -599,7 +599,10 @@ extern int index_name_pos(const struct index_state *, const char *name, int name
 #define ADD_CACHE_KEEP_CACHE_TREE 32	/* Do not invalidate cache-tree */
 extern int add_index_entry(struct index_state *, struct cache_entry *ce, int option);
 extern void rename_index_entry_at(struct index_state *, int pos, const char *new_name);
+
+/* Remove entry, return 1 if there are more entries after pos. */
 extern int remove_index_entry_at(struct index_state *, int pos);
+
 extern void remove_marked_cache_entries(struct index_state *istate);
 extern int remove_file_from_index(struct index_state *, const char *path);
 #define ADD_CACHE_VERBOSE 1
diff --git a/read-cache.c b/read-cache.c
index 2eca639cce..63a414cdb5 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -503,7 +503,6 @@ int index_name_pos(const struct index_state *istate, const char *name, int namel
 	return index_name_stage_pos(istate, name, namelen, 0);
 }
 
-/* Remove entry, return true if there are more entries to go.. */
 int remove_index_entry_at(struct index_state *istate, int pos)
 {
 	struct cache_entry *ce = istate->cache[pos];
-- 
2.11.0.299.g762782ba8a


^ permalink raw reply related

* [PATCH 1/4] document index_name_pos
From: Stefan Beller @ 2017-01-17 23:35 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller
In-Reply-To: <20170117233503.27137-1-sbeller@google.com>

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 cache.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/cache.h b/cache.h
index 1b67f078dd..270a0d0ea7 100644
--- a/cache.h
+++ b/cache.h
@@ -575,7 +575,22 @@ extern int verify_path(const char *path);
 extern int index_dir_exists(struct index_state *istate, const char *name, int namelen);
 extern void adjust_dirname_case(struct index_state *istate, char *name);
 extern struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int igncase);
+
+/*
+ * Searches for an entry defined by name and namelen in the given index.
+ * If the return value is positive (including 0) it is the position of an
+ * exact match. If the return value is negative, the negated value minus 1 is the
+ * position where the entry would be inserted.
+ * Example: In the current index we have the files b,d,e:
+ * index_name_pos(&index, "a", 1) -> -1
+ * index_name_pos(&index, "b", 1) ->  0
+ * index_name_pos(&index, "c", 1) -> -2
+ * index_name_pos(&index, "d", 1) ->  1
+ * index_name_pos(&index, "e", 1) ->  2
+ * index_name_pos(&index, "f", 1) -> -3
+ */
 extern int index_name_pos(const struct index_state *, const char *name, int namelen);
+
 #define ADD_CACHE_OK_TO_ADD 1		/* Ok to add */
 #define ADD_CACHE_OK_TO_REPLACE 2	/* Ok to replace file/directory */
 #define ADD_CACHE_SKIP_DFCHECK 4	/* Ok to skip DF conflict checks */
-- 
2.11.0.299.g762782ba8a


^ permalink raw reply related

* [PATCH 3/4] document add_[file_]to_index
From: Stefan Beller @ 2017-01-17 23:35 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller
In-Reply-To: <20170117233503.27137-1-sbeller@google.com>

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 cache.h | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/cache.h b/cache.h
index 26632065a5..acc639d6e0 100644
--- a/cache.h
+++ b/cache.h
@@ -605,13 +605,20 @@ extern int remove_index_entry_at(struct index_state *, int pos);
 
 extern void remove_marked_cache_entries(struct index_state *istate);
 extern int remove_file_from_index(struct index_state *, const char *path);
-#define ADD_CACHE_VERBOSE 1
-#define ADD_CACHE_PRETEND 2
-#define ADD_CACHE_IGNORE_ERRORS	4
-#define ADD_CACHE_IGNORE_REMOVAL 8
-#define ADD_CACHE_INTENT 16
+
+#define ADD_CACHE_VERBOSE 1		/* verbose */
+#define ADD_CACHE_PRETEND 2 		/* dry run */
+#define ADD_CACHE_IGNORE_ERRORS 4	/* ignore errors */
+#define ADD_CACHE_IGNORE_REMOVAL 8	/* do not remove files from index */
+#define ADD_CACHE_INTENT 16		/* intend to add later; stage empty file */
+/*
+ * Adds the given path the index, respecting the repsitory configuration, e.g.
+ * in case insensitive file systems, the path is normalized.
+ */
 extern int add_to_index(struct index_state *, const char *path, struct stat *, int flags);
+/* stat the file then call add_to_index */
 extern int add_file_to_index(struct index_state *, const char *path, int flags);
+
 extern struct cache_entry *make_cache_entry(unsigned int mode, const unsigned char *sha1, const char *path, int stage, unsigned int refresh_options);
 extern int chmod_index_entry(struct index_state *, struct cache_entry *ce, char flip);
 extern int ce_same_name(const struct cache_entry *a, const struct cache_entry *b);
-- 
2.11.0.299.g762782ba8a


^ permalink raw reply related

* [PATCH v6 6/6] t/t7004-tag: Add --format specifier tests
From: santiago @ 2017-01-17 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, sunshine, walters, Santiago Torres
In-Reply-To: <20170117233723.23897-1-santiago@nyu.edu>

From: Santiago Torres <santiago@nyu.edu>

tag -v now supports --format specifiers to inspect the contents of a tag
upon verification. Add two tests to ensure this behavior is respected in
future changes.

Signed-off-by: Santiago Torres <santiago@nyu.edu>
---
 t/t7004-tag.sh | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/t/t7004-tag.sh b/t/t7004-tag.sh
index 07869b0c0..ba88b556b 100755
--- a/t/t7004-tag.sh
+++ b/t/t7004-tag.sh
@@ -874,6 +874,22 @@ test_expect_success GPG 'verifying a forged tag should fail' '
 	test_must_fail git tag -v forged-tag
 '
 
+test_expect_success 'verifying a proper tag with --format pass and format accordingly' '
+	cat >expect <<-\EOF
+	tagname : signed-tag
+	EOF &&
+	git tag -v --format="tagname : %(tag)" "signed-tag" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'verifying a forged tag with --format fail and format accordingly' '
+	cat >expect <<-\EOF
+	tagname : forged-tag
+	EOF &&
+	test_must_fail git tag -v --format="tagname : %(tag)" "forged-tag" >actual &&
+	test_cmp expect actual
+'
+
 # blank and empty messages for signed tags:
 
 get_tag_header empty-signed-tag $commit commit $time >expect
-- 
2.11.0


^ permalink raw reply related

* [PATCH v6 3/6] builtin/verify-tag: add --format to verify-tag
From: santiago @ 2017-01-17 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, sunshine, walters, Santiago Torres
In-Reply-To: <20170117233723.23897-1-santiago@nyu.edu>

From: Santiago Torres <santiago@nyu.edu>

Callers of verify-tag may want to cross-check the tagname from refs/tags
with the tagname from the tag object header upon GPG verification. This
is to avoid tag refs that point to an incorrect object.

Add a --format parameter to git verify-tag to print the formatted tag
object header in addition to or instead of the --verbose or --raw GPG
verification output.

Signed-off-by: Santiago Torres <santiago@nyu.edu>
---
 Documentation/git-verify-tag.txt |  2 +-
 builtin/verify-tag.c             | 23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-verify-tag.txt b/Documentation/git-verify-tag.txt
index d590edceb..0b8075dad 100644
--- a/Documentation/git-verify-tag.txt
+++ b/Documentation/git-verify-tag.txt
@@ -8,7 +8,7 @@ git-verify-tag - Check the GPG signature of tags
 SYNOPSIS
 --------
 [verse]
-'git verify-tag' <tag>...
+'git verify-tag' [--format=<format>] <tag>...
 
 DESCRIPTION
 -----------
diff --git a/builtin/verify-tag.c b/builtin/verify-tag.c
index 99f8148cf..18443bf9f 100644
--- a/builtin/verify-tag.c
+++ b/builtin/verify-tag.c
@@ -12,12 +12,14 @@
 #include <signal.h>
 #include "parse-options.h"
 #include "gpg-interface.h"
+#include "ref-filter.h"
 
 static const char * const verify_tag_usage[] = {
-		N_("git verify-tag [-v | --verbose] <tag>..."),
+		N_("git verify-tag [-v | --verbose] [--format=<format>] <tag>..."),
 		NULL
 };
 
+
 static int git_verify_tag_config(const char *var, const char *value, void *cb)
 {
 	int status = git_gpg_config(var, value, cb);
@@ -30,9 +32,11 @@ int cmd_verify_tag(int argc, const char **argv, const char *prefix)
 {
 	int i = 1, verbose = 0, had_error = 0;
 	unsigned flags = 0;
+	char *fmt_pretty = NULL;
 	const struct option verify_tag_options[] = {
 		OPT__VERBOSE(&verbose, N_("print tag contents")),
 		OPT_BIT(0, "raw", &flags, N_("print raw gpg status output"), GPG_VERIFY_RAW),
+		OPT_STRING(  0 , "format", &fmt_pretty, N_("format"), N_("format to use for the output")),
 		OPT_END()
 	};
 
@@ -46,13 +50,26 @@ int cmd_verify_tag(int argc, const char **argv, const char *prefix)
 	if (verbose)
 		flags |= GPG_VERIFY_VERBOSE;
 
+	if (fmt_pretty) {
+		verify_ref_format(fmt_pretty);
+		flags |= GPG_VERIFY_OMIT_STATUS;
+	}
+
 	while (i < argc) {
 		unsigned char sha1[20];
 		const char *name = argv[i++];
-		if (get_sha1(name, sha1))
+		if (get_sha1(name, sha1)) {
 			had_error = !!error("tag '%s' not found.", name);
-		else if (gpg_verify_tag(sha1, name, flags))
+			continue;
+		}
+
+		if (gpg_verify_tag(sha1, name, flags)) {
 			had_error = 1;
+			continue;
+		}
+
+		if (fmt_pretty)
+			pretty_print_ref(name, sha1, fmt_pretty);
 	}
 	return had_error;
 }
-- 
2.11.0


^ permalink raw reply related

* [PATCH v6 1/6] gpg-interface,tag: add GPG_VERIFY_OMIT_STATUS flag
From: santiago @ 2017-01-17 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, sunshine, walters, Lukas Puehringer
In-Reply-To: <20170117233723.23897-1-santiago@nyu.edu>

From: Lukas Puehringer <luk.puehringer@gmail.com>

Functions that print git object information may require that the
gpg-interface functions be silent. Add GPG_VERIFY_OMIT_STATUS flag and
prevent print_signature_buffer from being called if flag is set.

Signed-off-by: Lukas Puehringer <luk.puehringer@gmail.com>
---
 gpg-interface.h | 5 +++--
 tag.c           | 5 ++++-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/gpg-interface.h b/gpg-interface.h
index ea68885ad..d2d4fd3a6 100644
--- a/gpg-interface.h
+++ b/gpg-interface.h
@@ -1,8 +1,9 @@
 #ifndef GPG_INTERFACE_H
 #define GPG_INTERFACE_H
 
-#define GPG_VERIFY_VERBOSE	1
-#define GPG_VERIFY_RAW		2
+#define GPG_VERIFY_VERBOSE		1
+#define GPG_VERIFY_RAW			2
+#define GPG_VERIFY_OMIT_STATUS	4
 
 struct signature_check {
 	char *payload;
diff --git a/tag.c b/tag.c
index d1dcd18cd..243d1fdbb 100644
--- a/tag.c
+++ b/tag.c
@@ -3,6 +3,7 @@
 #include "commit.h"
 #include "tree.h"
 #include "blob.h"
+#include "gpg-interface.h"
 
 const char *tag_type = "tag";
 
@@ -24,7 +25,9 @@ static int run_gpg_verify(const char *buf, unsigned long size, unsigned flags)
 
 	ret = check_signature(buf, payload_size, buf + payload_size,
 				size - payload_size, &sigc);
-	print_signature_buffer(&sigc, flags);
+
+	if (!(flags & GPG_VERIFY_OMIT_STATUS))
+		print_signature_buffer(&sigc, flags);
 
 	signature_check_clear(&sigc);
 	return ret;
-- 
2.11.0


^ permalink raw reply related

* [PATCH v6 0/6] Add --format to tag verification
From: santiago @ 2017-01-17 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, sunshine, walters, Santiago Torres

From: Santiago Torres <santiago@nyu.edu>

This is the sixth iteration of [1][2][3][4][5], and as a result of the
discussion in [5]. The main goal of this patch series is to bring
--format to git tag verification so that upper-layer tools can inspect
the content of a tag and make decisions based on it.

In this re-woll we:

* Changed the call interface so printing is done outside of verification. 

* Fixed a couple of whitespace issues and whatnot. 

Thanks,
-Santiago

[1] http://public-inbox.org/git/20170115184705.10376-1-santiago@nyu.edu/
[2] http://public-inbox.org/git/20161007210721.20437-1-santiago@nyu.edu/
[3] http://public-inbox.org/git/20160930221806.3398-1-santiago@nyu.edu/
[4] http://public-inbox.org/git/20160922185317.349-1-santiago@nyu.edu/
[5] http://public-inbox.org/git/20160926224233.32702-1-santiago@nyu.edu/
[6] http://public-inbox.org/git/20160607195608.16643-1-santiago@nyu.edu/
[7] http://public-inbox.org/git/20161019203546.dfqmi2czcxopgj6w@sigill.intra.peff.net/
[8] http://public-inbox.org/git/20161019203943.epjxnfci7vcqg4xv@sigill.intra.peff.net/

Lukas Puehringer (3):
  gpg-interface,tag: add GPG_VERIFY_OMIT_STATUS flag
  ref-filter: add function to print single ref_array_item
  builtin/tag: add --format argument for tag -v

Santiago Torres (3):
  builtin/verify-tag: add --format to verify-tag
  t/t7030-verify-tag: Add --format specifier tests
  t/t7004-tag: Add --format specifier tests

 Documentation/git-tag.txt        |  2 +-
 Documentation/git-verify-tag.txt |  2 +-
 builtin/tag.c                    | 38 ++++++++++++++++++++++++++++----------
 builtin/verify-tag.c             | 23 ++++++++++++++++++++---
 gpg-interface.h                  |  5 +++--
 ref-filter.c                     | 27 +++++++++++++++++++++------
 ref-filter.h                     |  7 +++++++
 t/t7004-tag.sh                   | 16 ++++++++++++++++
 t/t7030-verify-tag.sh            | 16 ++++++++++++++++
 tag.c                            |  5 ++++-
 10 files changed, 117 insertions(+), 24 deletions(-)

-- 
2.11.0


^ permalink raw reply

* [PATCH v6 2/6] ref-filter: add function to print single ref_array_item
From: santiago @ 2017-01-17 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, sunshine, walters, Lukas Puehringer
In-Reply-To: <20170117233723.23897-1-santiago@nyu.edu>

From: Lukas Puehringer <luk.puehringer@gmail.com>

ref-filter functions are useful for printing git object information
using a format specifier. However, some other modules may not want to use
this functionality on a ref-array but only print a single item.

Expose a pretty_print_ref function to create, pretty print and free
individual ref-items.

Signed-off-by: Lukas Puehringer <luk.puehringer@gmail.com>
---
 ref-filter.c | 27 +++++++++++++++++++++------
 ref-filter.h |  7 +++++++
 2 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index 1a978405e..5f4b08792 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1361,7 +1361,7 @@ static struct ref_array_item *new_ref_array_item(const char *refname,
 	return ref;
 }
 
-static int filter_ref_kind(struct ref_filter *filter, const char *refname)
+static int ref_kind_from_refname(const char *refname)
 {
 	unsigned int i;
 
@@ -1374,11 +1374,7 @@ static int filter_ref_kind(struct ref_filter *filter, const char *refname)
 		{ "refs/tags/", FILTER_REFS_TAGS}
 	};
 
-	if (filter->kind == FILTER_REFS_BRANCHES ||
-	    filter->kind == FILTER_REFS_REMOTES ||
-	    filter->kind == FILTER_REFS_TAGS)
-		return filter->kind;
-	else if (!strcmp(refname, "HEAD"))
+	if (!strcmp(refname, "HEAD"))
 		return FILTER_REFS_DETACHED_HEAD;
 
 	for (i = 0; i < ARRAY_SIZE(ref_kind); i++) {
@@ -1389,6 +1385,15 @@ static int filter_ref_kind(struct ref_filter *filter, const char *refname)
 	return FILTER_REFS_OTHERS;
 }
 
+static int filter_ref_kind(struct ref_filter *filter, const char *refname)
+{
+	if (filter->kind == FILTER_REFS_BRANCHES ||
+	    filter->kind == FILTER_REFS_REMOTES ||
+	    filter->kind == FILTER_REFS_TAGS)
+		return filter->kind;
+	return ref_kind_from_refname(refname);
+}
+
 /*
  * A call-back given to for_each_ref().  Filter refs and keep them for
  * later object processing.
@@ -1671,6 +1676,16 @@ void show_ref_array_item(struct ref_array_item *info, const char *format, int qu
 	putchar('\n');
 }
 
+void pretty_print_ref(const char *name, const unsigned char *sha1,
+		const char *format)
+{
+	struct ref_array_item *ref_item;
+	ref_item = new_ref_array_item(name, sha1, 0);
+	ref_item->kind = ref_kind_from_refname(name);
+	show_ref_array_item(ref_item, format, 0);
+	free_array_item(ref_item);
+}
+
 /*  If no sorting option is given, use refname to sort as default */
 struct ref_sorting *ref_default_sorting(void)
 {
diff --git a/ref-filter.h b/ref-filter.h
index fc55fa357..7b05592ba 100644
--- a/ref-filter.h
+++ b/ref-filter.h
@@ -109,4 +109,11 @@ struct ref_sorting *ref_default_sorting(void);
 /*  Function to parse --merged and --no-merged options */
 int parse_opt_merge_filter(const struct option *opt, const char *arg, int unset);
 
+/*
+ * Print a single ref, outside of any ref-filter. Note that the
+ * name must be a fully qualified refname.
+ */
+void pretty_print_ref(const char *name, const unsigned char *sha1,
+		const char *format);
+
 #endif /*  REF_FILTER_H  */
-- 
2.11.0


^ permalink raw reply related

* [PATCH v6 4/6] builtin/tag: add --format argument for tag -v
From: santiago @ 2017-01-17 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, sunshine, walters, Lukas Puehringer
In-Reply-To: <20170117233723.23897-1-santiago@nyu.edu>

From: Lukas Puehringer <luk.puehringer@gmail.com>

Adding --format to git tag -v mutes the default output of the GPG
verification and instead prints the formatted tag object.
This allows callers to cross-check the tagname from refs/tags with
the tagname from the tag object header upon GPG verification.

The callback function for for_each_tag_name() didn't allow callers to
pass custom data to their callback functions. Add a new opaque pointer
to each_tag_name_fn's parameter to allow this.

Signed-off-by: Lukas Puehringer <luk.puehringer@gmail.com>
---
 Documentation/git-tag.txt |  2 +-
 builtin/tag.c             | 38 ++++++++++++++++++++++++++++----------
 2 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/Documentation/git-tag.txt b/Documentation/git-tag.txt
index 76cfe40d9..586aaa315 100644
--- a/Documentation/git-tag.txt
+++ b/Documentation/git-tag.txt
@@ -15,7 +15,7 @@ SYNOPSIS
 'git tag' [-n[<num>]] -l [--contains <commit>] [--points-at <object>]
 	[--column[=<options>] | --no-column] [--create-reflog] [--sort=<key>]
 	[--format=<format>] [--[no-]merged [<commit>]] [<pattern>...]
-'git tag' -v <tagname>...
+'git tag' -v [--format=<format>] <tagname>...
 
 DESCRIPTION
 -----------
diff --git a/builtin/tag.c b/builtin/tag.c
index 73df72811..b9da72761 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -24,7 +24,7 @@ static const char * const git_tag_usage[] = {
 	N_("git tag -d <tagname>..."),
 	N_("git tag -l [-n[<num>]] [--contains <commit>] [--points-at <object>]"
 		"\n\t\t[--format=<format>] [--[no-]merged [<commit>]] [<pattern>...]"),
-	N_("git tag -v <tagname>..."),
+	N_("git tag -v [--format=<format>] <tagname>..."),
 	NULL
 };
 
@@ -66,15 +66,17 @@ static int list_tags(struct ref_filter *filter, struct ref_sorting *sorting, con
 }
 
 typedef int (*each_tag_name_fn)(const char *name, const char *ref,
-				const unsigned char *sha1);
+				const unsigned char *sha1, void *cb_data);
 
-static int for_each_tag_name(const char **argv, each_tag_name_fn fn)
+static int for_each_tag_name(const char **argv, each_tag_name_fn fn,
+		void *cb_data)
 {
 	const char **p;
 	char ref[PATH_MAX];
 	int had_error = 0;
 	unsigned char sha1[20];
 
+
 	for (p = argv; *p; p++) {
 		if (snprintf(ref, sizeof(ref), "refs/tags/%s", *p)
 					>= sizeof(ref)) {
@@ -87,14 +89,14 @@ static int for_each_tag_name(const char **argv, each_tag_name_fn fn)
 			had_error = 1;
 			continue;
 		}
-		if (fn(*p, ref, sha1))
+		if (fn(*p, ref, sha1, cb_data))
 			had_error = 1;
 	}
 	return had_error;
 }
 
 static int delete_tag(const char *name, const char *ref,
-				const unsigned char *sha1)
+				const unsigned char *sha1, void *cb_data)
 {
 	if (delete_ref(ref, sha1, 0))
 		return 1;
@@ -103,9 +105,22 @@ static int delete_tag(const char *name, const char *ref,
 }
 
 static int verify_tag(const char *name, const char *ref,
-				const unsigned char *sha1)
+				const unsigned char *sha1, void *cb_data)
 {
-	return gpg_verify_tag(sha1, name, GPG_VERIFY_VERBOSE);
+	int flags;
+	char *fmt_pretty = cb_data;
+	flags = GPG_VERIFY_VERBOSE;
+
+	if (fmt_pretty)
+		flags = GPG_VERIFY_OMIT_STATUS;
+
+	if (gpg_verify_tag(sha1, name, flags))
+		return -1;
+
+    if (fmt_pretty)
+		pretty_print_ref(name, sha1, fmt_pretty);
+
+	return 0;
 }
 
 static int do_sign(struct strbuf *buffer)
@@ -428,9 +443,12 @@ int cmd_tag(int argc, const char **argv, const char *prefix)
 	if (filter.merge_commit)
 		die(_("--merged and --no-merged option are only allowed with -l"));
 	if (cmdmode == 'd')
-		return for_each_tag_name(argv, delete_tag);
-	if (cmdmode == 'v')
-		return for_each_tag_name(argv, verify_tag);
+		return for_each_tag_name(argv, delete_tag, NULL);
+	if (cmdmode == 'v') {
+		if (format)
+			verify_ref_format(format);
+		return for_each_tag_name(argv, verify_tag, format);
+	}
 
 	if (msg.given || msgfile) {
 		if (msg.given && msgfile)
-- 
2.11.0


^ permalink raw reply related

* [PATCH v6 5/6] t/t7030-verify-tag: Add --format specifier tests
From: santiago @ 2017-01-17 23:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, sunshine, walters, Santiago Torres
In-Reply-To: <20170117233723.23897-1-santiago@nyu.edu>

From: Santiago Torres <santiago@nyu.edu>

Verify-tag now provides --format specifiers to inspect and ensure the
contents of the tag are proper. We add two tests to ensure this
functionality works as expected: the return value should indicate if
verification passed, and the format specifiers must be respected.

Signed-off-by: Santiago Torres <santiago@nyu.edu>
---
 t/t7030-verify-tag.sh | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/t/t7030-verify-tag.sh b/t/t7030-verify-tag.sh
index 07079a41c..d62ccbb98 100755
--- a/t/t7030-verify-tag.sh
+++ b/t/t7030-verify-tag.sh
@@ -125,4 +125,20 @@ test_expect_success GPG 'verify multiple tags' '
 	test_cmp expect.stderr actual.stderr
 '
 
+test_expect_success 'verifying tag with --format' '
+	cat >expect <<-\EOF
+	tagname : fourth-signed
+	EOF &&
+	git verify-tag --format="tagname : %(tag)" "fourth-signed" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'verifying a forged tag with --format fail and format accordingly' '
+	cat >expect <<-\EOF
+	tagname : 7th forged-signed
+	EOF &&
+	test_must_fail git verify-tag --format="tagname : %(tag)" $(cat forged1.tag) >actual-forged &&
+	test_cmp expect actual-forged
+'
+
 test_done
-- 
2.11.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox