Git development
 help / color / mirror / Atom feed
* Re: [PATCH 5/5] describe: teach describe negative pattern matches
From: Johannes Sixt @ 2017-01-13 21:31 UTC (permalink / raw)
  To: Jacob Keller; +Cc: Jacob Keller, Git mailing list
In-Reply-To: <CA+P7+xq1LMkRG_aSyamrsPUQE+rDv4A9Qd19tDMgx-_a5OHsqQ@mail.gmail.com>

Am 13.01.2017 um 07:57 schrieb Jacob Keller:
> On Thu, Jan 12, 2017 at 10:43 PM, Johannes Sixt <j6t@kdbg.org> wrote:
>>  When you write
>>
>>   git log --branches --exclude=origin/* --remotes
>>
>> --exclude=origin/* applies only to --remotes, but not to --branches.
>
> Well for describe I don't think the order matters.

That is certainly true today. But I would value consistency more. We 
would lose it if some time in the future 'describe' accepts --branches 
and --remotes in addition to --tags and --all.

-- Hannes


^ permalink raw reply

* Re: [PATCH 25/27] attr: store attribute stacks in hashmap
From: Junio C Hamano @ 2017-01-13 21:20 UTC (permalink / raw)
  To: Brandon Williams; +Cc: git, pclouds, sbeller
In-Reply-To: <20170112235354.153403-26-bmwill@google.com>

Brandon Williams <bmwill@google.com> writes:

> The last big hurdle towards a thread-safe API for the attribute system
> is the reliance on a global attribute stack that is modified during each
> call into the attribute system.
>
> This patch removes this global stack and instead a stack is retrieved or
> constructed locally.  Since each of these stacks is only used as a
> read-only structure once constructed, they can be stored in a hashmap
> and shared between threads.

Very good.

The reason why the original code used a stack was because it wanted
to keep only the info read from releavant files in-core, discarding
ones from files no-longer relevant (because the traversal switched
to another subdirectory of the same parent directory), to avoid the
memory consumption grow unbounded.  It probably was a premature
"optimization" that we can do without, so keeping everything we have
read so far in a hashmap (which is my understanding of what is going
on in this patch) is probably OK.

I suspect that this hashmap may eventually need to become per
attr_check if we want to follow through the optimization envisioned
by patch 15/27.

Inside fill(), path_matches() is called for the number of match_attr
in the entire attribute stack but it is wasteful to check if the
path matches with the a.u.pat if none of the a.state[] entries talk
about attributes and macros that are eventually get used by the
caller of check_attr().  By introducing a wrapping structure, 15/27
wanted to make sure that we have a place to store a "reduced"
attribute stack that is kept per attr_check that has only entries
from the files that talk about the attributes the particular
attr_check wants to learn about.

I need to think about this a bit more, but I do not offhand think
that it makes future such enhancement to make it per-check harder to
move from a global stack to a global hashmap, i.e. the above is not
an objection to this step.

> One caveat with storing and sharing the stack frames like this is that
> the info stack needs to be treated separately from the rest of the
> attribute stack.  This is because each stack frame holds a pointer to
> the stack that comes before it and if it was placed on top of the rest
> of the attribute stack then this pointer would be different for each
> attribute stack and wouldn't be able to be shared between threads.  In
> order to allow for sharing the info stack frame it needs to be its own
> isolated frame and can simply be processed first to have the same affect
> of being at the top of the stack.

Good.

Thanks.

^ permalink raw reply

* Re: [RFC] Add support for downloading blobs on demand
From: Shawn Pearce @ 2017-01-13 21:07 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, benpeart
In-Reply-To: <20170113155253.1644-1-benpeart@microsoft.com>

On Fri, Jan 13, 2017 at 7:52 AM, Ben Peart <peartben@gmail.com> wrote:
>
> Goal
> ~~~~
>
> To be able to better handle repos with many files that any individual
> developer doesn’t need it would be nice if clone/fetch only brought down
> those files that were actually needed.
>
> To enable that, we are proposing adding a flag to clone/fetch that will
> instruct the server to limit the objects it sends to commits and trees
> and to not send any blobs.
>
> When git performs an operation that requires a blob that isn’t currently
> available locally, it will download the missing blob and add it to the
> local object store.

Interesting. This is also an area I want to work on with my team at
$DAY_JOB. Repositories are growing along multiple dimensions, and
developers or editors don't always need all blobs for all time
available locally to successfully perform their work.

> Design
> ~~~~~~
>
> Clone and fetch will pass a “--lazy-clone” flag (open to a better name
> here) similar to “--depth” that instructs the server to only return
> commits and trees and to ignore blobs.

My group at $DAY_JOB hasn't talked about it yet, but I want to add a
protocol capability that lets clone/fetch ask only for blobs smaller
than a specified byte count. This could be set to a reasonable text
file size (e.g. <= 5 MiB) to predominately download only source files
and text documentation, omitting larger binaries.

If the limit was set to 0, its the same as your idea to ignore all blobs.

> Later during git operations like checkout, when a blob cannot be found
> after checking all the regular places (loose, pack, alternates, etc),
> git will download the missing object and place it into the local object
> store (currently as a loose object) then resume the operation.

Right. I'd like to have this object retrieval be inside the native Git
wire protocol, reusing the remote configuration and authentication
setup. That requires expanding the server side of the protocol
implementation slightly allowing any reachable object to be retrieved
by SHA-1 alone. Bitmap indexes can significantly reduce the
computational complexity for the server.

> To prevent git from accidentally downloading all missing blobs, some git
> operations are updated to be aware of the potential for missing blobs.
> The most obvious being check_connected which will return success as if
> everything in the requested commits is available locally.

This ... sounds risky for the developer, as the repository may be
corrupt due to a missing object, and the user cannot determine it.

Would it be reasonable for the server to return a list of SHA-1s it
knows should exist, but has omitted due to the blob threshold (above),
and the local repository store this in a binary searchable file?
During connectivity checking its assumed OK if an object is not
present in the object store, but is listed in this omitted objects
file.

> To minimize the impact on the server, the existing dumb HTTP protocol
> endpoint “objects/<sha>” can be used to retrieve the individual missing
> blobs when needed.

I'd prefer this to be in the native wire protocol, where the objects
are in pack format (which unfortunately differs from loose format). I
assume servers would combine many objects into pack files, potentially
isolating large uncompressable binaries into their own packs, stored
separately from commits/trees/small-text-blobs.

I get the value of this being in HTTP, where HTTP caching inside
proxies can be leveraged to reduce master server load. I wonder if the
native wire protocol could be taught to use a variation of an HTTP GET
that includes the object SHA-1 in the URL line, to retrieve a
one-object pack file.

> Performance considerations
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We found that downloading commits and trees on demand had a significant
> negative performance impact.  In addition, many git commands assume all
> commits and trees are available locally so they quickly got pulled down
> anyway.  Even in very large repos the commits and trees are relatively
> small so bringing them down with the initial commit and subsequent fetch
> commands was reasonable.
>
> After cloning, the developer can use sparse-checkout to limit the set of
> files to the subset they need (typically only 1-10% in these large
> repos).  This allows the initial checkout to only download the set of
> files actually needed to complete their task.  At any point, the
> sparse-checkout file can be updated to include additional files which
> will be fetched transparently on demand.
>
> Typical source files are relatively small so the overhead of connecting
> and authenticating to the server for a single file at a time is
> substantial.  As a result, having a long running process that is started
> with the first request and can cache connection information between
> requests is a significant performance win.

Junio and I talked years ago (offline, sorry no mailing list archive)
about "narrow checkout", which is the idea of the client being able to
ask for a pack file from the server that only includes objects along
specific path names. This would allow a client to amortize the setup
costs, and even delta compress source files against each other (e.g.
boilerplate across Makefiles or license headers).

If the paths of interest can be determined as a batch before starting
the connection, this may be easier than maintaining a cross platform
connection cache in a separate process.

> Now some numbers
> ~~~~~~~~~~~~~~~~
>
> One repo has 3+ million files at tip across 500K folders with 5-6K
> active developers.  They have done a lot of work to remove large files
> from the repo so it is down to < 100GB.
>
> Before changes: clone took hours to transfer the 87GB .pack + 119MB .idx
>
> After changes: clone took 4 minutes to transfer 305MB .pack + 37MB .idx
>
> After hydrating 35K files (the typical number any individual developer
> needs to do their work), there was an additional 460 MB of loose files
> downloaded.
>
> Total savings: 86.24 GB * 6000 developers = 517 Terabytes saved!
>
> We have another repo (3.1 M files, 618 GB at tip with no history with
> 3K+ active developers) where the savings are even greater.

This is quite impressive, and shows this strategy has a lot of promise.


> Future Work
> ~~~~~~~~~~~
>
> The current prototype calls a new hook proc in sha1_object_info_extended
> and read_object, to download each missing blob.  A better solution would
> be to implement this via a long running process that is spawned on the
> first download and listens for requests to download additional objects
> until it terminates when the parent git operation exits (similar to the
> recent long running smudge and clean filter work).

Or batching these up in advance. checkout should be able to determine
which path entries from the index it wants to write to the working
tree. Once it has that set of paths it wants to write, it should be
fast to construct a subset of paths for which the blobs are not
present locally, and then pass the entire group off for download.

> Need to do more investigation into possible code paths that can trigger
> unnecessary blobs to be downloaded.  For example, we have determined
> that the rename detection logic in status can also trigger unnecessary
> blobs to be downloaded making status slow.

There isn't much of a workaround here. Only options I can see are
disabling rename detection when objects are above a certain size, or
removing entries from the rename table when the blob isn't already
local, which may yield different results than if the blob(s) were
local.

Another is to try to have actual source files always be local, and
thus we only punt on rename detection for bigger files that are more
likely to be binary, and thus less likely to match for rename[1]
unless it was SHA-1 identity match, which can be done without the
blob(s) present.


[1] I assume most really big files are some sort of media asset (e.g.
JPEG), where a change inside the source data may result in large
difference in bytes due to the compression applied by the media file
format.

> Need to investigate an alternate batching scheme where we can make a
> single request for a set of "related" blobs and receive single a
> packfile (especially during checkout).

Heh, what I just said above. Glad to see you already thought of it.

> Need to investigate adding a new endpoint in the smart protocol that can
> download both individual blobs as well as a batch of blobs.

Agreed, I said as much above. Again, glad to see you have similar ideas. :)

^ permalink raw reply

* Re: [PATCH 0/5] extend git-describe pattern matching
From: Jacob Keller @ 2017-01-13 20:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jacob Keller, Git mailing list
In-Reply-To: <xmqqpojqhk3i.fsf@gitster.mtv.corp.google.com>

On Fri, Jan 13, 2017 at 10:48 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Jacob Keller <jacob.e.keller@intel.com> writes:
>
>> From: Jacob Keller <jacob.keller@gmail.com>
>>
>> Teach git describe and git name-rev the ability to match multiple
>> patterns inclusively. Additionally, teach these commands to also accept
>> negative patterns to discard any refs which match.
>
> You made quick responses to reviews with "will change", so I am not
> queuing this round to my tree; please don't mistake that as my
> indifference or opposition to the topic.  This sounds like a good
> thing.

Perfect. I will probably take a few days till I am back at a computer
and can do this, but I will be submitting with the suggested changes
soon.

>
> As to the semantics of mixing positives and negatives, I would
> recommend this to follow how positive and negative pathspecs mix.
> IIRC we chose to use the most simple and easy to explain option,
> i.e. a thing must match at least one of the positives and must not
> match any of the negatives to be considered a match.
>
>

That is the current implementation, so I will stick with it. It's the
simplest, and easiest to implement.

Thanks,
Jake

^ permalink raw reply

* Re: [PATCH 2/3] xdiff: -W: include immediately preceding non-empty lines in context
From: Vegard Nossum @ 2017-01-13 20:20 UTC (permalink / raw)
  To: Junio C Hamano, René Scharfe; +Cc: git
In-Reply-To: <xmqqeg06hh6z.fsf@gitster.mtv.corp.google.com>

On 13/01/2017 20:51, Junio C Hamano wrote:
> René Scharfe <l.s.r@web.de> writes:
>> That's true, but I'm not sure "non-empty line before function line" is
>> good enough a definition for desirable lines.  It wouldn't work for
>> people who don't believe in empty lines.  Or for those that put a
>> blank line between comment and function.  (I have an opinion on such
>> habits, but git diff should probably stay neutral.)  And that's just
>> for C code; I have no idea how this heuristic would hold up for other
>> file types like HTML.
>
> As you are, I am fairly negative on the heuristic based on the
> "non-blank" thing.  We tried once with compaction-heuristics already
> and it did not quite perform well.  Let's not hardcode another one.

The patch will work as intended and as expected for 95% of the users out
there (javadoc, Doxygen, kerneldoc, etc. all have the comment
immediately preceding the function) and fixes a very real problem for me
(and I expect many others) _today_; for the remaining 5% (who put a
blank line between their comment and the start of the function) it will
revert back to the current behaviour, so there should be no regression
for them.

For the 0% who don't put even a single blank line between their
functions, it will probably not work as expected, but then again I have
never seen such a coding style in any language, so I am doubtful if this
is something that needs to be taken into account in the first place.

>> We can identify function lines with arbitrary precision (with a
>> xfuncname regex, if needed), but there is no accurate way to classify
>> lines as comments, or as the end of functions.  Adding optional
>> regexes for single- and multi-line comments would help, at least for
>> C.
>
> The funcline regexp is used for two related but different purposes.
> It identifies a single line to be placed on @@ ... @@ line before a
> diff hunk.  This line however does not have to be at the beginning
> of a function.  It has to be the line that conveys the most
> significant information (e.g. the name of the function).
>
> The way "diff -W" codepath used it as if it were always the very
> first line of a function was bound to invite a patch like this, and
> if we want to be extra elaborate, I agree that an extra mechanism to
> say "the line the funcline regexp matches is not the beginning of a
> function, but the beginning is a line that matches this other regexp
> before that line" may help.
>
> Do we really want to be that elaborate, though?  I dunno.

Adding a regex instead of the simple "blank line" test doesn't seem very
difficult to do, but I am doubtful that it will make any difference in
practice. But that can be done incrementally as well by the people who
would actually need it (who I strongly suspect do not exist in the first
place).

> I wonder if it would be sufficient to make -W take an optional
> number, e.g. "git show -W4", to add extre context lines before the
> funcline.
>

I don't like specifying a fixed number, that negates almost all the
reason for using -W in the first place; I would much prefer adding
a config variable to control the -W behaviour (or a new diff flag).


Vegard

^ permalink raw reply

* Re: [PATCH] lib-submodule-update.sh: drop unneeded shell
From: Junio C Hamano @ 2017-01-13 20:19 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git@vger.kernel.org
In-Reply-To: <CAGZ79kayLoH7EURQ9aKGh+FzDz_BegJRjB2175qo53beLZDYog@mail.gmail.com>

Stefan Beller <sbeller@google.com> writes:

> On Fri, Jan 13, 2017 at 11:55 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>>> Stefan Beller <sbeller@google.com> writes:
>>>
>>>> In modern Git we prefer "git -C <cmd" over "(cd <somewhere && git <cmd>)"
>>>> as it doesn't need an extra shell.
>>>
>>> There is a matching '>' missing.  The description is correct (I am
>>> not sure if there actually is "preference", though), but I found the
>>> title a bit misleading....
>>
>> It turns out that there were two missing '>' ;-)  It tentatively has
>> become like this in my tree.
>
> Thanks for fixing up locally. I had resent as
> "[PATCH] lib-submodule-update.sh: reduce use of subshell by using git -C <dir>"
> but you can ignore that now.

Yeah, apparently our mails crossed.  Yours still have "git -C <cmd>"
that should have been "git -C <dir> <cmd>", so I'll keep the locally
munged one.

Thanks.



^ permalink raw reply

* Re: [PATCH] lib-submodule-update.sh: drop unneeded shell
From: Stefan Beller @ 2017-01-13 20:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git@vger.kernel.org
In-Reply-To: <xmqqa8auhgzt.fsf@gitster.mtv.corp.google.com>

On Fri, Jan 13, 2017 at 11:55 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Stefan Beller <sbeller@google.com> writes:
>>
>>> In modern Git we prefer "git -C <cmd" over "(cd <somewhere && git <cmd>)"
>>> as it doesn't need an extra shell.
>>
>> There is a matching '>' missing.  The description is correct (I am
>> not sure if there actually is "preference", though), but I found the
>> title a bit misleading....
>
> It turns out that there were two missing '>' ;-)  It tentatively has
> become like this in my tree.

Thanks for fixing up locally. I had resent as
"[PATCH] lib-submodule-update.sh: reduce use of subshell by using git -C <dir>"
but you can ignore that now.

Thank,
Stefan

^ permalink raw reply

* Re: [PATCH 2/2] Use 'env' to find perl instead of fixed path
From: Junio C Hamano @ 2017-01-13 20:01 UTC (permalink / raw)
  To: Eric Wong; +Cc: Pat Pannuto, Johannes Schindelin, Johannes Sixt, git
In-Reply-To: <20170113185246.GA17441@starla>

Eric Wong <e@80x24.org> writes:

> Junio C Hamano <gitster@pobox.com> wrote:
>> Eric Wong <e@80x24.org> writes:
>> > Pat Pannuto <pat.pannuto@gmail.com> wrote:
>> >> You may still want the 1/2 patch in this series, just to make things
>> >> internally consistent with "-w" vs "use warnings;" inside git's perl
>> >> scripts.
>> >
>> > No, that is a step back.  "-w" affects the entire process, so it
>> > spots more potential problems.  The "warnings" pragma is scoped
>> > to the enclosing block, so it won't span across files.
>> 
>> OK, so with "-w", we do not have to write "use warnings" in each of
>> our files to get them checked.  It is handy when we ship our own
>> libs (e.g. Git.pm) that are used by our programs.
>
> Yes.  "use warnings" should be in our own libs in case other
> people run without "-w"

Would it mean that we need both anyway?  That is, add missing "use
warnings" without removing "-w" from she-bang line?

> Yes, "-w" will trigger warnings in third party packages.
> Existing uses we have should be fine, and I think most Perl
> modules we use or would use are vigilant about being
> warnings-clean.  If we have to leave off a "-w", there should
> probably be a comment at the top stating the reason:
>
> #!/usr/bin/perl
> # Not using "perl -w" since Foo::Bar <= X.Y.Y is not warnings-clean
> use strict;
> use warnings;
> use Foo::Bar;
> ...

Good.

Speaking of Perl, I recall that somebody complained that we ship
with and do use a stale copy of Error.pm that has been deprecated.
I am not asking you to do so, but we may want to see somebody look
into it (i.e. assessing the current situation, and if it indeed is
desirable for us to wean ourselves away from Error.pm, update our
codepaths that use it).

Thanks.

^ permalink raw reply

* Re: [PATCH 0/3] updates to gitk & git-gui doc now gitview has gone
From: Junio C Hamano @ 2017-01-13 20:05 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Philip Oakley, GitList
In-Reply-To: <alpine.DEB.2.20.1701131622510.3469@virtualbox>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> Lets remove the reference in the gitk man page, and do some other
>> fixups while in the area.
>
> These changes all look sensible to me.

To me, too.  Thanks, both.

^ permalink raw reply

* Re: [PATCH] lib-submodule-update.sh: drop unneeded shell
From: Junio C Hamano @ 2017-01-13 19:55 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git
In-Reply-To: <xmqqtw92hkgc.fsf@gitster.mtv.corp.google.com>

Junio C Hamano <gitster@pobox.com> writes:

> Stefan Beller <sbeller@google.com> writes:
>
>> In modern Git we prefer "git -C <cmd" over "(cd <somewhere && git <cmd>)"
>> as it doesn't need an extra shell.
>
> There is a matching '>' missing.  The description is correct (I am
> not sure if there actually is "preference", though), but I found the
> title a bit misleading....

It turns out that there were two missing '>' ;-)  It tentatively has
become like this in my tree.

-- >8 --
From: Stefan Beller <sbeller@google.com>
Date: Wed, 11 Jan 2017 10:47:32 -0800
Subject: [PATCH] lib-submodule-update.sh: reduce use of subshell by using "git -C"

We write

    (cd <dir> && git <cmd>)

to avoid

    cd <dir> && git <cmd> && cd ..

that allows a breakage in one part of the test script to leave the
entire test process in an unexpected place.  We can do this more
concisely with "git -C <dir> <cmd>" with modern Git.

Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

diff --git a/t/lib-submodule-update.sh b/t/lib-submodule-update.sh
index 79cdd34a54..915eb4a7c6 100755
--- a/t/lib-submodule-update.sh
+++ b/t/lib-submodule-update.sh
@@ -69,10 +69,7 @@ create_lib_submodule_repo () {
 
 		git checkout -b "replace_sub1_with_directory" "add_sub1" &&
 		git submodule update &&
-		(
-			cd sub1 &&
-			git checkout modifications
-		) &&
+		git -C sub1 checkout modifications &&
 		git rm --cached sub1 &&
 		rm sub1/.git* &&
 		git config -f .gitmodules --remove-section "submodule.sub1" &&

^ permalink raw reply related

* Re: [PATCH 2/3] xdiff: -W: include immediately preceding non-empty lines in context
From: Junio C Hamano @ 2017-01-13 19:51 UTC (permalink / raw)
  To: René Scharfe; +Cc: Vegard Nossum, git
In-Reply-To: <e55dc4dd-768b-8c9b-e3b2-e850d5d521f5@web.de>

René Scharfe <l.s.r@web.de> writes:

> Am 13.01.2017 um 17:15 schrieb Vegard Nossum:
>> When using -W to include the whole function in the diff context, you
>> are typically doing this to be able to review the change in its entirety
>> within the context of the function. It is therefore almost always
>> desirable to include any comments that immediately precede the function.
>>
>> This also the fixes the case for C where the declaration is split across
>> multiple lines (where the first line of the declaration would not be
>> included in the output), e.g.:
>>
>> 	void
>> 	dummy(void)
>> 	{
>> 		...
>> 	}
>>
>
> That's true, but I'm not sure "non-empty line before function line" is
> good enough a definition for desirable lines.  It wouldn't work for
> people who don't believe in empty lines.  Or for those that put a
> blank line between comment and function.  (I have an opinion on such
> habits, but git diff should probably stay neutral.)  And that's just
> for C code; I have no idea how this heuristic would hold up for other
> file types like HTML.

As you are, I am fairly negative on the heuristic based on the
"non-blank" thing.  We tried once with compaction-heuristics already
and it did not quite perform well.  Let's not hardcode another one.

> We can identify function lines with arbitrary precision (with a
> xfuncname regex, if needed), but there is no accurate way to classify
> lines as comments, or as the end of functions.  Adding optional
> regexes for single- and multi-line comments would help, at least for
> C.

The funcline regexp is used for two related but different purposes.
It identifies a single line to be placed on @@ ... @@ line before a
diff hunk.  This line however does not have to be at the beginning
of a function.  It has to be the line that conveys the most
significant information (e.g. the name of the function).

The way "diff -W" codepath used it as if it were always the very
first line of a function was bound to invite a patch like this, and
if we want to be extra elaborate, I agree that an extra mechanism to
say "the line the funcline regexp matches is not the beginning of a
function, but the beginning is a line that matches this other regexp
before that line" may help.

Do we really want to be that elaborate, though?  I dunno.

I wonder if it would be sufficient to make -W take an optional
number, e.g. "git show -W4", to add extre context lines before the
funcline.

^ permalink raw reply

* [PATCH] submodule update: run custom update script for initial populating as well
From: Stefan Beller @ 2017-01-13 19:43 UTC (permalink / raw)
  To: bmwill, gitster, judge.packham, olsonse; +Cc: git, Stefan Beller

In 1b4735d9f3 (submodule: no [--merge|--rebase] when newly cloned,
2011-02-17), all actions were defaulted to checkout for populating
a submodule initially, because merging or rebasing makes no sense
in that situation.

Other commands however do make sense, such as the custom command
that was added later (6cb5728c43, submodule update: allow custom
command to update submodule working tree, 2013-07-03).

I am unsure about the "none" command, as I can see an initial
checkout there as a useful thing. On the other hand going strictly
by our own documentation, we should do nothing in case of "none"
as well, because the user asked for it.

Reported-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
---
 git-submodule.sh            |  7 ++++++-
 t/t7406-submodule-update.sh | 15 +++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/git-submodule.sh b/git-submodule.sh
index 554bd1c494..aeb721ab7e 100755
--- a/git-submodule.sh
+++ b/git-submodule.sh
@@ -606,7 +606,12 @@ cmd_update()
 		if test $just_cloned -eq 1
 		then
 			subsha1=
-			update_module=checkout
+			if test "$update_module" = "merge" ||
+			   test "$update_module" = "rebase" ||
+			   test "$update_module" = "none"
+			then
+				update_module=checkout
+			fi
 		else
 			subsha1=$(sanitize_submodule_env; cd "$sm_path" &&
 				git rev-parse --verify HEAD) ||
diff --git a/t/t7406-submodule-update.sh b/t/t7406-submodule-update.sh
index 64f322c4cc..1407fa6098 100755
--- a/t/t7406-submodule-update.sh
+++ b/t/t7406-submodule-update.sh
@@ -424,6 +424,19 @@ test_expect_success 'submodule update - command in .git/config catches failure -
 	test_i18ncmp actual expect
 '
 
+test_expect_success 'submodule update - command run for initial population of submodule' '
+	cat <<-\ EOF >expect
+	Execution of '\''false $submodulesha1'\'' failed in submodule path '\''submodule'\''
+	EOF
+	(
+		cd super &&
+		rm -rf submodule
+		test_must_fail git submodule update >../actual
+	)
+	test_cmp expect actual
+	git -C super submodule update --checkout
+'
+
 cat << EOF >expect
 Execution of 'false $submodulesha1' failed in submodule path '../super/submodule'
 Failed to recurse into submodule path '../super'
@@ -476,6 +489,7 @@ test_expect_success 'submodule init picks up merge' '
 '
 
 test_expect_success 'submodule update --merge  - ignores --merge  for new submodules' '
+	test_config -C super submodule.submodule.update checkout &&
 	(cd super &&
 	 rm -rf submodule &&
 	 git submodule update submodule &&
@@ -488,6 +502,7 @@ test_expect_success 'submodule update --merge  - ignores --merge  for new submod
 '
 
 test_expect_success 'submodule update --rebase - ignores --rebase for new submodules' '
+	test_config -C super submodule.submodule.update checkout &&
 	(cd super &&
 	 rm -rf submodule &&
 	 git submodule update submodule &&
-- 
2.11.0.300.g08194d1431.dirty


^ permalink raw reply related

* Re: [PATCH] Documentation/bisect: improve on (bad|new) and (good|bad)
From: Junio C Hamano @ 2017-01-13 19:14 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Manuel Ullmann, Matthieu Moy, Christian Couder
In-Reply-To: <20170113144405.3963-1-chriscool@tuxfamily.org>

Christian Couder <christian.couder@gmail.com> writes:

> The following part of the description:
>
> git bisect (bad|new) [<rev>]
> git bisect (good|old) [<rev>...]
>
> may be a bit confusing, as a reader may wonder if instead it should be:
>
> git bisect (bad|good) [<rev>]
> git bisect (old|new) [<rev>...]
>
> Of course the difference between "[<rev>]" and "[<rev>...]" should hint
> that there is a good reason for the way it is.
>
> But we can further clarify and complete the description by adding
> "<term-new>" and "<term-old>" to the "bad|new" and "good|old"
> alternatives.
>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  Documentation/git-bisect.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Thanks.  The patch looks good.

A related tangent.  

Last night, I was trying to think if there is a fundamental reason
why "bad/new/term-new" cannot take more than one <rev>s on the newer
side of the bisection, and couldn't quite think of any before
falling asleep.

Currently we keep track of a single bisect/bad, while marking all the
revs given as good previously as bisect/good-<SHA-1>.

Because the next "bad" is typically chosen from the region of the
commit DAG that is bounded by bad and good commits, i.e. "rev-list
bisect/bad --not bisect/good-*", the current bisect/bad will always
be an ancestor of all bad commits that used to be bisect/bad, and
keeping previous bisect/bad as bisect/bad-<SHA-1> won't change the
region of the commit DAG yet to be explored.

As a reason why we need to use only a single bisect/bad, the above
description is understandable.  But as a reason why we cannot have
more than one, it is tautological.  It merely says "if we start from
only one and dig history to find older culprit, we need only one
bad".

I fell asleep last night without thinking further than that.

I think the answer to the question "why do we think we need a single
bisect/bad?" is "because bisection is about assuming that there is
only one commit that flips the tree state from 'old' to 'new' and
finding that single commit".  That would mean that even if we had
bisect/bad-A and bisect/bad-B, e.g.

                          o---o---o---bad-A
                         /
    -----Good---o---o---o
                         \
                          o---o---o---bad-B


where 'o' are all commits whose goodness is not yet known, because
bisection is valid only when we are hunting for a single commit that
flips the state from good to bad, that commit MUST be at or before
the merge base of bad-A and bad-B.  So even if we allowed

	$ git bisect bad bad-A bad-B

on the command line, we won't have to set bisect/bad-A and
bisect/bad-B.  We only need a single bisect/bad that points at the
merge base of these two.

But what if bad-A and bad-B have more than one merge bases?  We
won't know which side the badness came from.

                          o---o---o---bad-A
                         /     \ / 
    -----Good---o---o---o       / 
                         \     / \
                          o---o---o---bad-B

Being able to bisect the region of DAG bound by "^Good bad-A bad-B"
may have value in such a case.  I dunno.


^ permalink raw reply

* [PATCH] lib-submodule-update.sh: reduce use of subshell by using git -C <dir>
From: Stefan Beller @ 2017-01-13 19:03 UTC (permalink / raw)
  To: gitster; +Cc: git, Stefan Beller
In-Reply-To: <xmqqtw92hkgc.fsf@gitster.mtv.corp.google.com>

In modern Git we prefer
    "git -C <cmd>"
over
    "(cd <somewhere && git <cmd>)"
as it doesn't need an extra shell.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 t/lib-submodule-update.sh | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/t/lib-submodule-update.sh b/t/lib-submodule-update.sh
index 79cdd34a54..915eb4a7c6 100755
--- a/t/lib-submodule-update.sh
+++ b/t/lib-submodule-update.sh
@@ -69,10 +69,7 @@ create_lib_submodule_repo () {
 
 		git checkout -b "replace_sub1_with_directory" "add_sub1" &&
 		git submodule update &&
-		(
-			cd sub1 &&
-			git checkout modifications
-		) &&
+		git -C sub1 checkout modifications &&
 		git rm --cached sub1 &&
 		rm sub1/.git* &&
 		git config -f .gitmodules --remove-section "submodule.sub1" &&
-- 
2.11.0.300.g08194d1431


^ permalink raw reply related

* Re: [PATCH 2/2] Use 'env' to find perl instead of fixed path
From: Eric Wong @ 2017-01-13 18:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Pat Pannuto, Johannes Schindelin, Johannes Sixt, git
In-Reply-To: <xmqq4m12izmd.fsf@gitster.mtv.corp.google.com>

Junio C Hamano <gitster@pobox.com> wrote:
> Eric Wong <e@80x24.org> writes:
> > Pat Pannuto <pat.pannuto@gmail.com> wrote:
> >> You may still want the 1/2 patch in this series, just to make things
> >> internally consistent with "-w" vs "use warnings;" inside git's perl
> >> scripts.
> >
> > No, that is a step back.  "-w" affects the entire process, so it
> > spots more potential problems.  The "warnings" pragma is scoped
> > to the enclosing block, so it won't span across files.
> 
> OK, so with "-w", we do not have to write "use warnings" in each of
> our files to get them checked.  It is handy when we ship our own
> libs (e.g. Git.pm) that are used by our programs.

Yes.  "use warnings" should be in our own libs in case other
people run without "-w"

> If something we write outselves trigger a false-positive, we can
> work it around with "no warnings;" in the smallest block that
> encloses the offending code in the worst case, or just rephrase it
> in a way that won't trigger a false-positive.

Correct.

> If something we _use_ from a third-party is not warnings-clean,
> there is no easy way to squelch them if we use "-w", which is a
> potential downside, isn't it?  I do not know how serious a problem
> it is in practice.  I suspect that the core package we use from perl
> distribution are supposed to be warnings-clean, but we use a handful
> of things from outside the core and I do not know what state they
> are in.

Yes, "-w" will trigger warnings in third party packages.
Existing uses we have should be fine, and I think most Perl
modules we use or would use are vigilant about being
warnings-clean.  If we have to leave off a "-w", there should
probably be a comment at the top stating the reason:

#!/usr/bin/perl
# Not using "perl -w" since Foo::Bar <= X.Y.Y is not warnings-clean
use strict;
use warnings;
use Foo::Bar;
...

^ permalink raw reply

* Re: [PATCH 0/5] extend git-describe pattern matching
From: Junio C Hamano @ 2017-01-13 18:48 UTC (permalink / raw)
  To: Jacob Keller; +Cc: git, Jacob Keller
In-Reply-To: <20170112001721.2534-1-jacob.e.keller@intel.com>

Jacob Keller <jacob.e.keller@intel.com> writes:

> From: Jacob Keller <jacob.keller@gmail.com>
>
> Teach git describe and git name-rev the ability to match multiple
> patterns inclusively. Additionally, teach these commands to also accept
> negative patterns to discard any refs which match.

You made quick responses to reviews with "will change", so I am not
queuing this round to my tree; please don't mistake that as my
indifference or opposition to the topic.  This sounds like a good
thing.

As to the semantics of mixing positives and negatives, I would
recommend this to follow how positive and negative pathspecs mix.
IIRC we chose to use the most simple and easy to explain option,
i.e. a thing must match at least one of the positives and must not
match any of the negatives to be considered a match.



^ permalink raw reply

* Re: [PATCH 2/3] xdiff: -W: include immediately preceding non-empty lines in context
From: Stefan Beller @ 2017-01-13 18:44 UTC (permalink / raw)
  To: René Scharfe, Michael Haggerty
  Cc: Vegard Nossum, Junio C Hamano, git@vger.kernel.org
In-Reply-To: <e55dc4dd-768b-8c9b-e3b2-e850d5d521f5@web.de>

On Fri, Jan 13, 2017 at 10:19 AM, René Scharfe <l.s.r@web.de> wrote:
> Am 13.01.2017 um 17:15 schrieb Vegard Nossum:
>>
>> When using -W to include the whole function in the diff context, you
>> are typically doing this to be able to review the change in its entirety
>> within the context of the function. It is therefore almost always
>> desirable to include any comments that immediately precede the function.

Do we need a small comment in the actual code to hint at why we count
upwards there?

>>
>> This also the fixes the case for C where the declaration is split across
>> multiple lines (where the first line of the declaration would not be
>> included in the output), e.g.:
>>
>>         void
>>         dummy(void)
>>         {
>>                 ...
>>         }
>>
>
> That's true, but I'm not sure "non-empty line before function line" is good
> enough a definition for desirable lines.  It wouldn't work for people who
> don't believe in empty lines.  Or for those that put a blank line between
> comment and function.  (I have an opinion on such habits, but git diff
> should probably stay neutral.)  And that's just for C code; I have no idea
> how this heuristic would hold up for other file types like HTML.

I think empty lines are "good as a first approach", see e.g.
433860f3d0beb0c6 the "compaction" heuristic for a similar
thing (the compaction was introduced at d634d61ed), and then
we can build a more elaborate thing on top.

>
> We can identify function lines with arbitrary precision (with a xfuncname
> regex, if needed), but there is no accurate way to classify lines as
> comments, or as the end of functions.  Adding optional regexes for single-
> and multi-line comments would help, at least for C.

That would cover Java and whole lot of other C like languages. So a good
start as well IMHO.

>
> René

^ permalink raw reply

* Re: [PATCH] lib-submodule-update.sh: drop unneeded shell
From: Junio C Hamano @ 2017-01-13 18:40 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git
In-Reply-To: <20170111184732.26416-1-sbeller@google.com>

Stefan Beller <sbeller@google.com> writes:

> In modern Git we prefer "git -C <cmd" over "(cd <somewhere && git <cmd>)"
> as it doesn't need an extra shell.

There is a matching '>' missing.  The description is correct (I am
not sure if there actually is "preference", though), but I found the
title a bit misleading.  With 'drop unneeded subshell', I would
imagine a patch like

		(
	-		( do something )
	+		do something;
			do something else;
		)

It looks more like "reduce use of subshell by using 'git -C <dir>'"
to me.

> And because it is in a setup function, we actually save the invocation
> of 22 shells for a single run of the whole test suite.

Nice.

>
> Noticed while adding a lot more in near vincinity, though not as near
> to cause merge conflicts, so sending it extra.
>
> Thanks,
> Stefan
>
>  t/lib-submodule-update.sh | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/t/lib-submodule-update.sh b/t/lib-submodule-update.sh
> index 79cdd34a54..915eb4a7c6 100755
> --- a/t/lib-submodule-update.sh
> +++ b/t/lib-submodule-update.sh
> @@ -69,10 +69,7 @@ create_lib_submodule_repo () {
>  
>  		git checkout -b "replace_sub1_with_directory" "add_sub1" &&
>  		git submodule update &&
> -		(
> -			cd sub1 &&
> -			git checkout modifications
> -		) &&
> +		git -C sub1 checkout modifications &&
>  		git rm --cached sub1 &&
>  		rm sub1/.git* &&
>  		git config -f .gitmodules --remove-section "submodule.sub1" &&

^ permalink raw reply

* Re: [PATCHv3 2/2] builtin/commit.c: switch to xstrfmt(), instead of snprintf,
From: Brandon Williams @ 2017-01-13 18:33 UTC (permalink / raw)
  To: Elia Pinto; +Cc: git
In-Reply-To: <20170113175801.39468-2-gitter.spiros@gmail.com>

On 01/13, Elia Pinto wrote:
> In this patch, instead of using xnprintf instead of snprintf, which asserts
> that we don't truncate, we are switching to dynamic allocation with  xstrfmt(),
> , so we can avoid dealing with magic numbers in the code and reduce the cognitive burden from
> the programmers, because they no longer have to count bytes needed for static allocation.
> As a side effect of this patch we have also reduced the snprintf() calls, that may silently truncate 
> results if the programmer is not careful.
> 
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Helped-by: Jeff King <peff@peff.net> 
> Signed-off-by: Elia Pinto <gitter.spiros@gmail.com>

Small nit's with the commit message:
* Stray comma ',' of on its own
* lines are longer than 80 characters

> ---
> This is the third  version of the patch.
> 
> Changes from the first version: I have split the original commit in two, as discussed here
> http://public-inbox.org/git/20161213132717.42965-1-gitter.spiros@gmail.com/.
> 
> Changes from the second version:
> - Changed the commit message to clarify the purpose of the patch (
> as suggested by Junio)
> https://public-inbox.org/git/xmqqtw95mfo3.fsf@gitster.mtv.corp.google.com/T/#m2e6405a8a78a8ca1ed770614c91398290574c4a1
> 
> 
> 
>  builtin/commit.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/builtin/commit.c b/builtin/commit.c
> index 09bcc0f13..37228330c 100644
> --- a/builtin/commit.c
> +++ b/builtin/commit.c
> @@ -1526,12 +1526,10 @@ static int git_commit_config(const char *k, const char *v, void *cb)
>  static int run_rewrite_hook(const unsigned char *oldsha1,
>  			    const unsigned char *newsha1)
>  {
> -	/* oldsha1 SP newsha1 LF NUL */
> -	static char buf[2*40 + 3];
> +	char *buf;
>  	struct child_process proc = CHILD_PROCESS_INIT;
>  	const char *argv[3];
>  	int code;
> -	size_t n;
>  
>  	argv[0] = find_hook("post-rewrite");
>  	if (!argv[0])
> @@ -1547,11 +1545,11 @@ static int run_rewrite_hook(const unsigned char *oldsha1,
>  	code = start_command(&proc);
>  	if (code)
>  		return code;
> -	n = snprintf(buf, sizeof(buf), "%s %s\n",
> -		     sha1_to_hex(oldsha1), sha1_to_hex(newsha1));
> +	buf = xstrfmt("%s %s\n", sha1_to_hex(oldsha1), sha1_to_hex(newsha1));
>  	sigchain_push(SIGPIPE, SIG_IGN);
> -	write_in_full(proc.in, buf, n);
> +	write_in_full(proc.in, buf, strlen(buf));
>  	close(proc.in);
> +	free(buf);
>  	sigchain_pop(SIGPIPE);
>  	return finish_command(&proc);
>  }
> -- 
> 2.11.0.154.g5f5f154
> 

-- 
Brandon Williams

^ permalink raw reply

* Re: [PATCH 1/2] asciidoctor: fix user-manual to be built by `asciidoctor`
From: Junio C Hamano @ 2017-01-13 18:31 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jeff King, git, 마누엘
In-Reply-To: <xmqq37gokrdj.fsf@gitster.mtv.corp.google.com>

Junio C Hamano <gitster@pobox.com> writes:

> ...  Perhaps 1/2 posted as-is
> is the best we could do within that constraint.
>
> Let's queue it on 'pu' and see if somebody else comes up with an
> update that is more visually pleasing with both backends.

... and I noticed that I didn't get to actually queuing this
yesterday, even though I sent the message above.

Now I have, and the patch will be part of tonight's pushout.

Thanks.

^ permalink raw reply

* submodule network operations [WAS: Re: [RFC/PATCH 0/4] working tree operations: support superprefix]
From: Stefan Beller @ 2017-01-13 18:30 UTC (permalink / raw)
  To: Brian J. Davis; +Cc: Brandon Williams, git@vger.kernel.org, David Turner

This question is about networking; the patch you originally replied to
was strictly about local operations in the filesystem, which is quite
a difference, so let's discuss it separately.

On Fri, Jan 13, 2017 at 9:56 AM, Brian J. Davis <bitminer@gmail.com> wrote:
>
> In response to a post at:
>
> https://groups.google.com/forum/#!topic/git-users/BVLcKHhSUKo
>
> I was asked to post here to explain a potential problem with current modules
> implementation.  Apologies if I am posting in the wrong place... so good bad
> or otherwise here goes:
>
> +-------------------------------
> With:
>
> git push origin branchname
>
> or
>
> git push server2 branchname
>
> I can push or pull from multiple servers so no problem as git supports this.
>
> Now the problem with use of submodules
>
> submodules are listed:
>
> submodule.directory_to_
> checkout/proj1_dir.url=https://git.someserver1/git/proj1_dir
> <https://git.someserver1/git/proj1_dir>

Technically it is submodule.<name>.url instead of
submodule.<path>.url. The name is usually the path initially, and once
you move the submodule, only the path changes, the name is supposed to
be constant and stay the same.

>
>
> but if say I want to pull from some server 2 and do a
>
> git submodule update --init --recursive

That is why the "git submodule init" exists at all.

    git submodule init
    $EDIT .git/config # change submodule.<name>.url to server2
    git submodule update # that uses the adapted url and
    # creates the submodule repository.

    # From now on the submodule is on its own.
    cd <submodule> && git config --list
    # prints an "origin" remote and a lot more

For a better explanation, I started a documentation series, see
https://github.com/gitster/git/commit/e2b51b9df618ceeff7c4ec044e20f5ce9a87241e

(It is not finished, but that is supposed to explain this exact pain
point in the
STATES section, feel free to point out missing things or what is hard
to understand)

>
> what I would get is from someserver1 not someserver2 as there is no "origin"
> support for submodules.  Is this correct?  If so can origin support be added
> to submodules?

Can you explain more in detail what you mean by origin support?

> +-------------------------------
>
> So above was post from google group.  Maybe above is enough to explain the
> problem that can result, but if not here is a discussion as to why this can
> be confusing resulting to pushing or pulling from the incorrect server.
>
> Lets say projects starts at origin on https://server0.org. Project is then
> brought in house to server https://indhouse.corp by developer A.
>
> Developer A then changes the submodule projects to point to indhouse and
> changes submodules in super project to point to indhouse so everything is
> great.
>
> Then Developer B clones from indhouse makes changes to submodule1 and
> submodule1 and pushes them to indhouse.  Dev A tells Dev B hey thoes changes
> are great why don't you push to public server0 so every one can get thoes
> changes.  Dev B then git push origin branch_name on submodule1 and push
> origin on submodule 2 and superproject.  And everything is great ... right?
>
> Yes by now those who know git know what dev B  forgot or more to the point
> what git does not support in a "clean" way.  For those who don't enter the
> life of dev C
>
> So dev C clones from server0 and performs a git submodule update --init
> --recursive.  Now Dev C is on the www and can't see indhouse.corp and ...
> kerpow... Dev B and Dev C just experienced one of the many SW mines on the
> battlefield.
>
> I ask that git devs first see if I am correct with this assessment as I
> could be wrong... maybe there is a way of doing this... and if not add a
> feature to git to handle submodules and multiple origins cleanly.... and yes
> beating dev B with rubber chicken might be a solution though I am looking
> for a slightly better option.

Yes this is a big point that we want to solve eventually.

When devA brought it inhouse, what they meant to do was this:
"This superproject is actually from server0, but we want to work on it, which
may have submodules diverge from server0 eventually. So if a submodule changed
you need to get it from the inhouse server, otherwise fall back to the server0".

That way developerB can just make changes to some submodules and when
devC clones
they get the "correct" submodule.

A weak attempt to do this is to use *relative* submodule urls. When
using relative urls, and then mirroring the supeproject inhouse, then
Git will look for the submodules as well inhouse, but there
is no such "or if not found look at the original superprojects
origin", which means, you have to mirror all submodules.

And then about upstreaming changes. If you have a single repo (no
submodules), you have to teach people to run "git remote add remote
server0.org && git push upstream ...", but that you can do for each
submodule. (This is tedious:/ but maybe ok; some submodules are free
to sent things upstream whereas others are supersecret that you do not
want to push upstream ever.)

So yeah maybe we want to have more power in the superprojects push operation

    (in the superproject) $ git push --recurse-submodules \
       --only-these-submodules=subA,subB \
       --submodule-to=upstream-as-configered-in-super-project

This is a lot of words but for explaining that is ok?

Thanks,
Stefan

^ permalink raw reply

* Re: [PATCH 2/2] Use 'env' to find perl instead of fixed path
From: Junio C Hamano @ 2017-01-13 18:27 UTC (permalink / raw)
  To: Eric Wong; +Cc: Pat Pannuto, Johannes Schindelin, Johannes Sixt, git
In-Reply-To: <20170113024842.GA20572@starla>

Eric Wong <e@80x24.org> writes:

> Pat Pannuto <pat.pannuto@gmail.com> wrote:
>> You may still want the 1/2 patch in this series, just to make things
>> internally consistent with "-w" vs "use warnings;" inside git's perl
>> scripts.
>
> No, that is a step back.  "-w" affects the entire process, so it
> spots more potential problems.  The "warnings" pragma is scoped
> to the enclosing block, so it won't span across files.

OK, so with "-w", we do not have to write "use warnings" in each of
our files to get them checked.  It is handy when we ship our own
libs (e.g. Git.pm) that are used by our programs.

If something we write outselves trigger a false-positive, we can
work it around with "no warnings;" in the smallest block that
encloses the offending code in the worst case, or just rephrase it
in a way that won't trigger a false-positive.

If something we _use_ from a third-party is not warnings-clean,
there is no easy way to squelch them if we use "-w", which is a
potential downside, isn't it?  I do not know how serious a problem
it is in practice.  I suspect that the core package we use from perl
distribution are supposed to be warnings-clean, but we use a handful
of things from outside the core and I do not know what state they
are in.

^ permalink raw reply

* Re: [PATCH 2/3] xdiff: -W: include immediately preceding non-empty lines in context
From: René Scharfe @ 2017-01-13 18:19 UTC (permalink / raw)
  To: Vegard Nossum, Junio C Hamano, git
In-Reply-To: <1484324112-17773-2-git-send-email-vegard.nossum@oracle.com>

Am 13.01.2017 um 17:15 schrieb Vegard Nossum:
> When using -W to include the whole function in the diff context, you
> are typically doing this to be able to review the change in its entirety
> within the context of the function. It is therefore almost always
> desirable to include any comments that immediately precede the function.
>
> This also the fixes the case for C where the declaration is split across
> multiple lines (where the first line of the declaration would not be
> included in the output), e.g.:
>
> 	void
> 	dummy(void)
> 	{
> 		...
> 	}
>

That's true, but I'm not sure "non-empty line before function line" is 
good enough a definition for desirable lines.  It wouldn't work for 
people who don't believe in empty lines.  Or for those that put a blank 
line between comment and function.  (I have an opinion on such habits, 
but git diff should probably stay neutral.)  And that's just for C code; 
I have no idea how this heuristic would hold up for other file types 
like HTML.

We can identify function lines with arbitrary precision (with a 
xfuncname regex, if needed), but there is no accurate way to classify 
lines as comments, or as the end of functions.  Adding optional regexes 
for single- and multi-line comments would help, at least for C.

René

^ permalink raw reply

* [PATCH 6/6] fsck: detect trailing garbage in all object types
From: Jeff King @ 2017-01-13 18:00 UTC (permalink / raw)
  To: John Szakmeister; +Cc: Dennis Kaarsemaker, git
In-Reply-To: <20170113175258.e66taigy4wpokohk@sigill.intra.peff.net>

When a loose tree or commit is read by fsck (or any git
program), unpack_sha1_rest() checks whether there is extra
cruft at the end of the object file, after the zlib data.
Blobs that are streamed, however, do not have this check.

For normal git operations, it's not a big deal. We know the
sha1 and size checked out, so we have the object bytes we
wanted.  The trailing garbage doesn't affect what we're
trying to do.

But since the point of fsck is to find corruption or other
problems, it should be more thorough. This patch teaches its
loose-sha1 reader to detect extra bytes after the zlib
stream and complain.

Signed-off-by: Jeff King <peff@peff.net>
---
 sha1_file.c     |  5 +++++
 t/t1450-fsck.sh | 22 ++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/sha1_file.c b/sha1_file.c
index c0fccb73c..b77ab6d5c 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3864,6 +3864,11 @@ static int check_stream_sha1(git_zstream *stream,
 		error("corrupt loose object '%s'", sha1_to_hex(expected_sha1));
 		return -1;
 	}
+	if (stream->avail_in) {
+		error("garbage at end of loose object '%s'",
+		      sha1_to_hex(expected_sha1));
+		return -1;
+	}
 
 	git_SHA1_Final(real_sha1, &c);
 	if (hashcmp(expected_sha1, real_sha1)) {
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 455c186fe..8975b4d1b 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -597,4 +597,26 @@ test_expect_success 'fsck finds problems in duplicate loose objects' '
 	)
 '
 
+test_expect_success 'fsck detects trailing loose garbage (commit)' '
+	git cat-file commit HEAD >basis &&
+	echo bump-commit-sha1 >>basis &&
+	commit=$(git hash-object -w -t commit basis) &&
+	file=$(sha1_file $commit) &&
+	test_when_finished "remove_object $commit" &&
+	chmod +w "$file" &&
+	echo garbage >>"$file" &&
+	test_must_fail git fsck 2>out &&
+	test_i18ngrep "garbage.*$commit" out
+'
+
+test_expect_success 'fsck detects trailing loose garbage (blob)' '
+	blob=$(echo trailing | git hash-object -w --stdin) &&
+	file=$(sha1_file $blob) &&
+	test_when_finished "remove_object $blob" &&
+	chmod +w "$file" &&
+	echo garbage >>"$file" &&
+	test_must_fail git fsck 2>out &&
+	test_i18ngrep "garbage.*$blob" out
+'
+
 test_done
-- 
2.11.0.629.g10075098c

^ permalink raw reply related

* [PATCH 5/6] fsck: parse loose object paths directly
From: Jeff King @ 2017-01-13 17:59 UTC (permalink / raw)
  To: John Szakmeister; +Cc: Dennis Kaarsemaker, git
In-Reply-To: <20170113175258.e66taigy4wpokohk@sigill.intra.peff.net>

When we iterate over the list of loose objects to check, we
get the actual path of each object. But we then throw it
away and pass just the sha1 to fsck_sha1(), which will do a
fresh lookup. Usually it would find the same object, but it
may not if an object exists both as a loose and a packed
object. We may end up checking the packed object twice, and
never look at the loose one.

In practice this isn't too terrible, because if fsck doesn't
complain, it means you have at least one good copy. But
since the point of fsck is to look for corruption, we should
be thorough.

The new read_loose_object() interface can help us get the
data from disk, and then we replace parse_object() with
parse_object_buffer(). As a bonus, our error messages now
mention the path to a corrupted object, which should make it
easier to track down errors when they do happen.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fsck.c  | 46 +++++++++++++++++++++++++++++++++-------------
 t/t1450-fsck.sh | 16 ++++++++++++++++
 2 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index f01b81eeb..4b91ee95e 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -362,18 +362,6 @@ static int fsck_obj(struct object *obj)
 	return 0;
 }
 
-static int fsck_sha1(const unsigned char *sha1)
-{
-	struct object *obj = parse_object(sha1);
-	if (!obj) {
-		errors_found |= ERROR_OBJECT;
-		return error("%s: object corrupt or missing",
-			     sha1_to_hex(sha1));
-	}
-	obj->flags |= HAS_OBJ;
-	return fsck_obj(obj);
-}
-
 static int fsck_obj_buffer(const unsigned char *sha1, enum object_type type,
 			   unsigned long size, void *buffer, int *eaten)
 {
@@ -488,9 +476,41 @@ static void get_default_heads(void)
 	}
 }
 
+static struct object *parse_loose_object(const unsigned char *sha1,
+					 const char *path)
+{
+	struct object *obj;
+	void *contents;
+	enum object_type type;
+	unsigned long size;
+	int eaten;
+
+	if (read_loose_object(path, sha1, &type, &size, &contents) < 0)
+		return NULL;
+
+	if (!contents && type != OBJ_BLOB)
+		die("BUG: read_loose_object streamed a non-blob");
+
+	obj = parse_object_buffer(sha1, type, size, contents, &eaten);
+
+	if (!eaten)
+		free(contents);
+	return obj;
+}
+
 static int fsck_loose(const unsigned char *sha1, const char *path, void *data)
 {
-	if (fsck_sha1(sha1))
+	struct object *obj = parse_loose_object(sha1, path);
+
+	if (!obj) {
+		errors_found |= ERROR_OBJECT;
+		error("%s: object corrupt or missing: %s",
+		      sha1_to_hex(sha1), path);
+		return 0; /* keep checking other objects */
+	}
+
+	obj->flags = HAS_OBJ;
+	if (fsck_obj(obj))
 		errors_found |= ERROR_OBJECT;
 	return 0;
 }
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index c39d42120..455c186fe 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -581,4 +581,20 @@ test_expect_success 'fsck errors in packed objects' '
 	! grep corrupt out
 '
 
+test_expect_success 'fsck finds problems in duplicate loose objects' '
+	rm -rf broken-duplicate &&
+	git init broken-duplicate &&
+	(
+		cd broken-duplicate &&
+		test_commit duplicate &&
+		# no "-d" here, so we end up with duplicates
+		git repack &&
+		# now corrupt the loose copy
+		file=$(sha1_file "$(git rev-parse HEAD)") &&
+		rm "$file" &&
+		echo broken >"$file" &&
+		test_must_fail git fsck
+	)
+'
+
 test_done
-- 
2.11.0.629.g10075098c


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox