Git development
 help / color / mirror / Atom feed
* [PATCH 1/2] real_path: prevent redefinition of MAXSYMLINKS
From: Brandon Williams @ 2017-01-09 18:50 UTC (permalink / raw)
  To: git; +Cc: Brandon Williams, peff, gitster, larsxschneider
In-Reply-To: <xmqqzij0t7uh.fsf@gitster.mtv.corp.google.com>

The macro 'MAXSYMLINKS' is already defined on macOS and Linux in
<sys/param.h>.  If 'MAXSYMLINKS' has already been defined, use the value
defined by the OS otherwise default to a value of 32 which is more
inline with what is allowed by many systems.

Signed-off-by: Brandon Williams <bmwill@google.com>
---
 abspath.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/abspath.c b/abspath.c
index 1d56f5ed9..0393213e5 100644
--- a/abspath.c
+++ b/abspath.c
@@ -62,7 +62,9 @@ static void get_root_part(struct strbuf *resolved, struct strbuf *remaining)
 }
 
 /* We allow "recursive" symbolic links. Only within reason, though. */
-#define MAXSYMLINKS 5
+#ifndef MAXSYMLINKS
+#define MAXSYMLINKS 32
+#endif
 
 /*
  * Return the real path (i.e., absolute path, with symlinks resolved
-- 
2.11.0.390.gc69c2f50cf-goog


^ permalink raw reply related

* Re: [PATCH v3 00/13] fix mergetool+rerere+subdir regression
From: Stefan Beller @ 2017-01-09 18:49 UTC (permalink / raw)
  To: Richard Hansen
  Cc: git@vger.kernel.org, David Aguilar, Johannes Sixt, Simon Ruderich
In-Reply-To: <20170109054238.42599-1-hansenr@google.com>

On Sun, Jan 8, 2017 at 9:42 PM, Richard Hansen <hansenr@google.com> wrote:
> If rerere is enabled, no pathnames are given, and mergetool is run
> from a subdirectory, mergetool always prints "No files need merging".
> Fix the bug.
>
> This regression was introduced in
> 57937f70a09c12ef484c290865dac4066d207c9c (v2.11.0).
>
> Changes since v2:
>   * Added entries to .mailmap.
>   * Patch 2/4 was split into multiple separate commits.
>   * Moved '-O' out of $orderfile.
>   * $orderfile is now adjusted after running cd_to_toplevel.
>   * Added lots of comments.
>   * Other minor tweaks to address review feedback.
>

The changes to tests look all good to me.

Thanks,
Stefan

^ permalink raw reply

* Re: [PATCH v5 0/5] road to reentrant real_path
From: Brandon Williams @ 2017-01-09 18:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, sbeller, peff, jacob.keller, ramsay, tboegi, j6t, pclouds,
	larsxschneider
In-Reply-To: <xmqqzij0t7uh.fsf@gitster.mtv.corp.google.com>

On 01/09, Junio C Hamano wrote:
> Brandon Williams <bmwill@google.com> writes:
> 
> >> How does this relate to the 5-patch real_path: series that has been
> >> on 'next' since last year?
> >
> > The only difference should be in the first patch of the series which
> > handles the #define a bit differently due to the discussion that
> > happened last week.
> >
> > Here is the interdiff between v5 and 'origin/bw/realpath-wo-chdir':
> 
> Then can you make that an incremental patch (or two) with its own
> log message instead?  It (or they) would look and smell like a
> bugfix patch that follows up a change that has already landed.  As
> you know, we won't eject and replace patches that are already in
> 'next' during a development cycle.
> 
> Thanks.

Yes I'll get right on that.

-- 
Brandon Williams

^ permalink raw reply

* Re: [PATCH v5 0/5] road to reentrant real_path
From: Junio C Hamano @ 2017-01-09 18:18 UTC (permalink / raw)
  To: Brandon Williams
  Cc: git, sbeller, peff, jacob.keller, ramsay, tboegi, j6t, pclouds,
	larsxschneider
In-Reply-To: <20170109180418.GB62878@google.com>

Brandon Williams <bmwill@google.com> writes:

>> How does this relate to the 5-patch real_path: series that has been
>> on 'next' since last year?
>
> The only difference should be in the first patch of the series which
> handles the #define a bit differently due to the discussion that
> happened last week.
>
> Here is the interdiff between v5 and 'origin/bw/realpath-wo-chdir':

Then can you make that an incremental patch (or two) with its own
log message instead?  It (or they) would look and smell like a
bugfix patch that follows up a change that has already landed.  As
you know, we won't eject and replace patches that are already in
'next' during a development cycle.

Thanks.

>
> diff --git a/abspath.c b/abspath.c
> index 1d56f5ed9..fce40fddc 100644
> --- a/abspath.c
> +++ b/abspath.c
> @@ -62,7 +62,9 @@ static void get_root_part(struct strbuf *resolved, struct strbuf *remaining)
>  }
>  
>  /* We allow "recursive" symbolic links. Only within reason, though. */
> -#define MAXSYMLINKS 5
> +#ifndef MAXSYMLINKS
> +#define MAXSYMLINKS 32
> +#endif
>  
>  /*
>   * Return the real path (i.e., absolute path, with symlinks resolved
> @@ -139,6 +141,8 @@ char *strbuf_realpath(struct strbuf *resolved, const char *path,
>  			strbuf_reset(&symlink);
>  
>  			if (num_symlinks++ > MAXSYMLINKS) {
> +				errno = ELOOP;
> +
>  				if (die_on_error)
>  					die("More than %d nested symlinks "
>  					    "on path '%s'", MAXSYMLINKS, path);

^ permalink raw reply

* Re: [PATCH v3 13/13] mergetool: fix running in subdir when rerere enabled
From: Junio C Hamano @ 2017-01-09 18:12 UTC (permalink / raw)
  To: Richard Hansen; +Cc: git, davvid, j6t, sbeller, simon
In-Reply-To: <20170109054238.42599-14-hansenr@google.com>

Richard Hansen <hansenr@google.com> writes:

> If rerere is enabled and no pathnames are given, run cd_to_toplevel
> before running 'git diff --name-only' so that 'git diff --name-only'
> sees the files named by 'git rerere remaining', which outputs
> pathnames relative to the top-level directory.
>
> The cd_to_toplevel command could be run after 'git rerere remaining',
> but it is run before just in case 'git rerere remaining' is ever
> changed to print pathnames relative to the current working directory
> rather than relative to the top-level directory.
>
> An alternative approach would be to unconditionally convert all
> relative pathnames (including the orderfile pathname) to be relative
> to the top-level directory and then run cd_to_toplevel before 'git
> diff --name-only', but unfortunately 'git rev-parse --prefix' requires
> valid pathnames, which would break some valid use cases.
>
> This fixes a regression introduced in
> 57937f70a09c12ef484c290865dac4066d207c9c (v2.11.0).
>
> Signed-off-by: Richard Hansen <hansenr@google.com>
> ---
>  git-mergetool.sh     | 32 ++++++++++++++++++++++++++++++++
>  t/t7610-mergetool.sh |  2 +-
>  2 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/git-mergetool.sh b/git-mergetool.sh
> index b506896dc..22f56c25a 100755
> --- a/git-mergetool.sh
> +++ b/git-mergetool.sh
> @@ -456,6 +456,28 @@ main () {
>  

While doing the extra cd_to_toplevel added by this patch may not
break anything that comes after the existing cd_to_toplevel, I find
the result of applying this patch unnecessarily confusing.  As "if
the user didn't give any pathnames from the command line, ask rerere
what paths it thinks are necessary to be handled" is merely a
laziness fallback, it feels conceptually wrong to use different
invocations of -O$orderfile when rerere is and is not in effect.

I wonder if it makes more sense to always move to toplevel upfront
and consistently use path from the toplevel, perhaps like the patch
does.  The first hunk is what you wrote but only inside MERGE_RR
block, and the second hunk deals with converting end-user supplied
paths that are relative to the original relative to the top-level.

The tweaking of $orderfile you have in the first hunk may have to be
tightened mimicking the way how "eval ... --sq ... ; shift" is used
in the second hunk to avoid confusion in case orderfile specified by
the end user happens to be the same as a valid revname
(e.g. "master").


diff --git a/git-mergetool.sh b/git-mergetool.sh
index b506896dc1..adbbeceb47 100755
--- a/git-mergetool.sh
+++ b/git-mergetool.sh
@@ -454,6 +454,14 @@ main () {
 	merge_keep_backup="$(git config --bool mergetool.keepBackup || echo true)"
 	merge_keep_temporaries="$(git config --bool mergetool.keepTemporaries || echo false)"
 
+	prefix=$(git rev-parse --show-prefix) || exit 1
+	cd_to_toplevel
+
+	if test -n "$orderfile"
+	then
+		orderfile=$(git rev-parse --prefix "$prefix" "$orderfile") || exit 1
+	fi
+
 	if test $# -eq 0 && test -e "$GIT_DIR/MERGE_RR"
 	then
 		set -- $(git rerere remaining)
@@ -461,14 +469,16 @@ main () {
 		then
 			print_noop_and_exit
 		fi
+	elif test $# -ge 0
+	then
+		eval "set -- $(git rev-parse --sq --prefix "$prefix" -- "$@")"
+		shift
 	fi
 
 	files=$(git -c core.quotePath=false \
 		diff --name-only --diff-filter=U \
 		${orderfile:+"-O$orderfile"} -- "$@")
 
-	cd_to_toplevel
-
 	if test -z "$files"
 	then
 		print_noop_and_exit

^ permalink raw reply related

* Re: [PATCH v5 0/5] road to reentrant real_path
From: Brandon Williams @ 2017-01-09 18:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, sbeller, peff, jacob.keller, ramsay, tboegi, j6t, pclouds,
	larsxschneider
In-Reply-To: <xmqqpojy1c49.fsf@gitster.mtv.corp.google.com>

On 01/07, Junio C Hamano wrote:
> Brandon Williams <bmwill@google.com> writes:
> 
> > changes in v5:
> > * set errno to ELOOP when MAXSYMLINKS is exceded.
> > * revert to use MAXSYMLINKS instead of MAXDEPTH.
> > * If the OS hasn't defined MAXSYMLINKS, use a fallback value of 32.
> >
> > Brandon Williams (4):
> >   real_path: resolve symlinks by hand
> >   real_path: convert real_path_internal to strbuf_realpath
> >   real_path: create real_pathdup
> >   real_path: have callers use real_pathdup and strbuf_realpath
> >
> > Johannes Sixt (1):
> >   real_path: canonicalize directory separators in root parts
> >
> 
> How does this relate to the 5-patch real_path: series that has been
> on 'next' since last year?

The only difference should be in the first patch of the series which
handles the #define a bit differently due to the discussion that
happened last week.

Here is the interdiff between v5 and 'origin/bw/realpath-wo-chdir':

diff --git a/abspath.c b/abspath.c
index 1d56f5ed9..fce40fddc 100644
--- a/abspath.c
+++ b/abspath.c
@@ -62,7 +62,9 @@ static void get_root_part(struct strbuf *resolved, struct strbuf *remaining)
 }
 
 /* We allow "recursive" symbolic links. Only within reason, though. */
-#define MAXSYMLINKS 5
+#ifndef MAXSYMLINKS
+#define MAXSYMLINKS 32
+#endif
 
 /*
  * Return the real path (i.e., absolute path, with symlinks resolved
@@ -139,6 +141,8 @@ char *strbuf_realpath(struct strbuf *resolved, const char *path,
 			strbuf_reset(&symlink);
 
 			if (num_symlinks++ > MAXSYMLINKS) {
+				errno = ELOOP;
+
 				if (die_on_error)
 					die("More than %d nested symlinks "
 					    "on path '%s'", MAXSYMLINKS, path);

-- 
Brandon Williams

^ permalink raw reply related

* Re: [PATCH v5 00/16] pathspec cleanup
From: Brandon Williams @ 2017-01-09 17:57 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Stefan Beller, Junio C Hamano
In-Reply-To: <CACsJy8CM6FfHpVuqby=hjPmiYAxvJjzr1W6LdO5B82KQnTmmog@mail.gmail.com>

On 01/07, Duy Nguyen wrote:
> On Thu, Jan 5, 2017 at 1:03 AM, Brandon Williams <bmwill@google.com> wrote:
> > Changes in v5:
> > * Move GUARD_PATHSPEC to prevent checking if pathspec is null twice.
> > * Mark a string containing 'mnemonic' for translation.
> 
> Argh.. I've run out of things to complain about! Ack!

haha! Well thanks again for your help in reviewing this series!

-- 
Brandon Williams

^ permalink raw reply

* Re: [PATCH 0/6] git worktree move/remove
From: Junio C Hamano @ 2017-01-09 17:34 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git
In-Reply-To: <20170108094003.637-1-pclouds@gmail.com>

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> This version is the same as nd/worktree-move but with the recursive
> directory copy code removed. We rely on rename() to move directories.

Much simpler ;-)

Will replace; thanks.

> Nguyễn Thái Ngọc Duy (6):
>   worktree.c: add validate_worktree()
>   worktree.c: add update_worktree_location()
>   worktree move: new command
>   worktree move: accept destination as directory
>   worktree move: refuse to move worktrees with submodules
>   worktree remove: new command
>
>  Documentation/git-worktree.txt         |  28 ++++--
>  builtin/worktree.c                     | 162 +++++++++++++++++++++++++++++++++
>  contrib/completion/git-completion.bash |   5 +-
>  t/t2028-worktree-move.sh               |  56 ++++++++++++
>  worktree.c                             |  84 +++++++++++++++++
>  worktree.h                             |  11 +++
>  6 files changed, 335 insertions(+), 11 deletions(-)

^ permalink raw reply

* Re: [PATCH v4] log --graph: customize the graph lines with config log.graphColors
From: Junio C Hamano @ 2017-01-09 17:29 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git, Jeff King
In-Reply-To: <20170109103258.25341-1-pclouds@gmail.com>

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> +	end = string + strlen(string);
> +	while (start < end) {
> +		const char *comma = strchrnul(start, ',');
> +		char color[COLOR_MAXLEN];
> +
> +		while (start < comma && isspace(*start))
> +			start++;
> +		if (start == comma) {
> +			start = comma + 1;
> +			continue;
> +		}
> +
> +		if (!color_parse_mem(start, comma - start, color))

So you skip the leading blanks but let color_parse_mem() trim the
trailing blanks?  It would work once the control reaches the loop,
but wouldn't that miss

	git -c log.graphColors=' reset , blue, red' log --graph 

as "reset" is not understood by parse_color() and is understood by
color_parse_mem() only when the length is given correctly?

I am undecided, but leaning towards declaring that it is a bug in
color_parse_mem() not in this caller.

> +			argv_array_push(&colors, color);
> +		else
> +			warning(_("ignore invalid color '%.*s' in log.graphColors"),
> +				(int)(comma - start), start);
> +		start = comma + 1;
> +	}
> +	free(string);
> +	argv_array_push(&colors, GIT_COLOR_RESET);
> +	/* graph_set_column_colors takes a max-index, not a count */
> +	graph_set_column_colors(colors.argv, colors.argc - 1);
> +}
> +
>  void graph_set_column_colors(const char **colors, unsigned short colors_max)
>  {
>  	column_colors = colors;
> @@ -208,8 +248,7 @@ struct git_graph *graph_init(struct rev_info *opt)
>  	struct git_graph *graph = xmalloc(sizeof(struct git_graph));
>  
>  	if (!column_colors)
> -		graph_set_column_colors(column_colors_ansi,
> -					column_colors_ansi_max);
> +		read_graph_colors_config();

Now that the helper is renamed to be about "reading the
configuration to figure out the graph colors", the division of labor
between the caller and the callee we see here is suboptimal for
readability, I would think.   

We would want to see the caller to either be

	if (!column_colors) {
		if (read_graph_colors_config())
			; /* ok */
		else                        
			graph_set_column_colors(ansi, ansi_max);
	}

or better yet, something like:

	if (!column_colors) {
		const char **colors = column_colors_ansi;
		int colors_max = column_colors_ansi_max;

		read_graph_colors_config(&colors, &colors_max);
		graph_set_collumn_colors(colors, colors_max);
	}

The last one would make it clear that by default we use ansi but
override that default by calling that function whose purpose is to
read the values that override the default from the configuration.

I haven't thought things thru regarding memory leakages etc., so
there may be valid reasons why the last one is infeasible.


^ permalink raw reply

* Re: Preserve/Prune Old Pack Files
From: Martin Fick @ 2017-01-09 16:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Mike Hommey, git
In-Reply-To: <20170109105545.gchfklcuzwhlfbtf@sigill.intra.peff.net>

On Monday, January 09, 2017 05:55:45 AM Jeff King wrote:
> On Mon, Jan 09, 2017 at 04:01:19PM +0900, Mike Hommey 
wrote:
> > > That's _way_ more complicated than your problem, and
> > > as I said, I do not have a finished solution. But it
> > > seems like they touch on a similar concept (a
> > > post-delete holding area for objects). So I thought
> > > I'd mention it in case if spurs any brilliance.
> > 
> > Something that is kind-of in the same family of problems
> > is the "loosening" or objects on repacks, before they
> > can be pruned.
...
> Yes, this can be a problem. The repack is smart enough not
> to write out objects which would just get pruned
> immediately, but since the grace period is 2 weeks, that
> can include a lot of objects (especially with history
> rewriting as you note). It would be possible to write
> those loose objects to a "cruft" pack, but there are some
> management issues around the cruft pack. You do not want
> to keep repacking them into a new cruft pack at each
> repack, since then they would never expire. So you need
> some way of marking the pack as cruft, letting it age
> out, and then deleting it after the grace period expires.
> 
> I don't think it would be _that_ hard, but AFAIK nobody
> has ever made patches.

FYI, jgit does this,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply

* Re: Preserve/Prune Old Pack Files
From: Martin Fick @ 2017-01-09 16:17 UTC (permalink / raw)
  To: Jeff King; +Cc: repo-discuss, jmelvin, jgit-dev, git
In-Reply-To: <20170109062137.zghmurndlbts5x44@sigill.intra.peff.net>

On Monday, January 09, 2017 01:21:37 AM Jeff King wrote:
> On Wed, Jan 04, 2017 at 09:11:55AM -0700, Martin Fick 
wrote:
> > I am replying to this email across lists because I
> > wanted to highlight to the git community this jgit
> > change to repacking that we have up for review
> > 
> >  https://git.eclipse.org/r/#/c/87969/
> > 
> > This change introduces a new convention for how to
> > preserve old pack files in a staging area
> > (.git/objects/packs/preserved) before deleting them.  I
> > wanted to ensure that the new proposed convention would
> > be done in a way that would be satisfactory to the git
> > community as a whole so that it would be more easy to
> > provide the same behavior in git eventually.  The
> > preserved pack files (and accompanying index and bitmap
> > files), are not only moved, but they are also renamed
> > so that they no longer will match recursive finds
> > looking for pack files.
> It looks like objects/pack/pack-123.pack becomes
> objects/pack/preserved/pack-123.old-pack,

Yes, that's the idea.

> and so forth. Which seems reasonable, and I'm happy that:
> 
>   find objects/pack -name '*.pack'
> 
> would not find it. :)

Cool.

> I suspect the name-change will break a few tools that you
> might want to use to look at a preserved pack (like
> verify-pack).  I know that's not your primary use case,
> but it seems plausible that somebody may one day want to
> use a preserved pack to try to recover from corruption. I
> think "git index-pack --stdin
> <objects/packs/preserved/pack-123.old-pack" could always
> be a last-resort for re-admitting the objects to the
> repository.

or even a simple manual rename/move back to its orginal 
place?

> I notice this doesn't do anything for loose objects. I
> think they technically suffer the same issue, though the
> race window is much shorter (we mmap them and zlib
> inflate immediately, whereas packfiles may stay mapped
> across many object requests).

Hmm, yeah that's the next change, didn't you see it? :)  No, 
actually I forgot about those.  Our server tends to not have 
too many of those (loose objects), and I don't think we have 
seen any exceptions yet for them.  But, of course, you are 
right, they should get fixed too.  I will work on a followup 
change to do that.

Where would you suggest we store those?  Maybe under 
".git/objects/preserved/<xx>/<sha1>"?  Do they need to be 
renamed also somehow to avoid a find?

...
> I've wondered if we could make object pruning more atomic
> by speculatively moving items to be deleted into some
> kind of "outgoing" object area.
...
> I don't have a solution here.  I don't think we want to
> solve it by locking the repository for updates during a
> repack. I have a vague sense that a solution could be
> crafted around moving the old pack into a holding area
> instead of deleting (during which time nobody else would
> see the objects, and thus not reference them), while the
> repacking process checks to see if the actual deletion
> would break any references (and rolls back the deletion
> if it would).
> 
> That's _way_ more complicated than your problem, and as I
> said, I do not have a finished solution. But it seems
> like they touch on a similar concept (a post-delete
> holding area for objects). So I thought I'd mention it in
> case if spurs any brilliance.

I agree, this is a problem I have wanted to solve also.  I 
think having a "preserved" directory does open the door to 
such "recovery" solutions, although I think you would 
actually want to modify the many read code paths to fall 
back to looking at the preserved area and performing 
immediate "recovery" of the pack file if it ends up being 
needed.  That's a lot of work, but having the packs (and 
eventually the loose objects) preserved into a location 
where no new references will be built to depend on them is 
likely the first step.  Does the name "preserved" do well for 
that use case also, or would there be some better name, what 
would a transactional system call them?

Thanks for the review Peff!

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply

* Re: Refreshing index timestamps without reading content
From: Quentin Casasnovas @ 2017-01-09 15:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Quentin Casasnovas, Git Mailing List
In-Reply-To: <xmqqr34cuvjj.fsf@gitster.mtv.corp.google.com>

[-- Attachment #1: Type: text/plain, Size: 6095 bytes --]

On Mon, Jan 09, 2017 at 07:01:36AM -0800, Junio C Hamano wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>
> > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> > <quentin.casasnovas@oracle.com> wrote:
> >> Is there any way to tell git, after the git ls-tree command above, to
> >> refresh its stat cache information and trust us that the file content has
> >> not changed, as to avoid any useless file read (though it will obviously
> >> will have to stat all of them, but that's not something we can really
> >> avoid)
> >
> > I don't think there's any way to do that, unfortunately.
>
> Lose "unfortunately".
>
> >> If not, I am willing to implement a --assume-content-unchanged to the git
> >> update-index if you guys don't see something fundamentally wrong with this
> >> approach.
> >
> > If you do that, I think you should go with either of the following options
> >
> > - Extend git-update-index --index-info to take stat info as well (or
> > maybe make a new option instead). Then you can feed stat info directly
> > to git without a use-case-specific "assume-content-unchanged".
> >
> > - Add "git update-index --touch" that does what "touch" does. In this
> > case, it blindly updates stat info to latest. But like touch, we can
> > also specify  mtime from command line if we need to. It's a bit less
> > generic than the above option, but easier to use.
>
> Even if we assume that it is a good idea to let people muck with the
> index like this, either of the above would be a usable addition,
> because the cached stat information does not consist solely of
> mtime.
>
> "git update-index --index-info" was invented for the case where a
> user or a script _knows_ the object ID of the blob that _would_
> result if a contents of a file on the filesystem were run through
> hash-object.  So from the interface's point of view, it may make
> sense to teach it to take an extra/optional argument that is the
> path to the file and take the stat info out of the named file when
> the extra/optional argument was given.
>
> But that assumes that it is a good idea to do this in the first
> place.  It was deliberate design decision that setting the cached
> stat info for the entry was protected behind actual content
> comparison, and removing that protection will open the index to
> abuse.
>

Hi Junio,

Thanks for your feedback, appreciated :)

I do understand how it would be possible for someone to shoot themselves in
the feet with such option, but it solves real life use cases and improved
build times very signficantly here.

Another use case we have is setting up very lightweight linux work trees,
by reflinking from a base work-tree.  This allows for a completely
different work-tree taking up almost no size at first, whereas using a
shared clone or the recent worktree subcommand would "waste" ~500MB*:

 # linux-2.6 is a shared clone of a bare clone residing locally
 ~ $ cp --reflink -a linux-2.6 linux-2.6-reflinked

 # At this point, the mtime inside linux-2.6-reflinked are matching the
 # mtime of the source linux-2.6 (since we used the '-a' option of 'cp)
 ~ $ diff -u <(stat linux-2.6/README) <(stat linux-2.6-reflinked/README)
 --- /proc/self/fd/11  2017-01-09 16:34:04.523438942 +0100
 +++ /proc/self/fd/12  2017-01-09 16:34:04.523438942 +0100
 @@ -1,8 +1,8 @@
 -  File: 'linux-2.6/README'
 +  File: 'linux-2.6-reflinked/README'
    Size: 18372		Blocks: 40         IO Block: 4096   regular file
 -Device: fd00h/64768d	Inode: 268467090   Links: 1
 +Device: fd00h/64768d	Inode: 805970606   Links: 1
  Access: (0644/-rw-r--r--)  Uid: ( 1000/ quentin)   Gid: ( 1000/ quentin)
  Access: 2017-01-09 12:04:15.317758718 +0100
  Modify: 2017-01-09 12:04:12.566758772 +0100
 -Change: 2017-01-09 12:04:12.566758772 +0100
 +Change: 2017-01-09 16:29:48.305444003 +0100
   Birth:

  # Now let's check how long it takes to refresh the index from the source
  # and destination..
  ~/linux-2.6 $ time git update-index --refresh
  git update-index --refresh  0.04s user 0.08s system 204% cpu 0.058 total
                                                               ~~~~~~~~~~~
  ~/linux-2.6-reflinked $ time git update-index --refresh
  git update-index --refresh  2.40s user 1.43s system 38% cpu 10.003 total
                                                              ~~~~~~~~~~~~

This is quite a high penalty when a power user knows that his lightweight
copy worktree matches the index!  That's on a single worktree on a fairly
decent SSD but our build farm would do this with hundreds of work trees in
parallel, all residing on a spinning disks - the penalty would be much
worse than 10 seconds wasted.

I might have a look at adding reflinking awareness to the worktree
subcommand (if it hasn't been implemented already) to avoid this index
force refreshing but that still would not fix our other use case where we
knowingly change the mtime of the files on our worktrees.

* That's a rather low estimate:

  git ls-tree -zr --name-only v4.9 | du -csh --files0-from=- | tail -n1
  747M  total

> The userbase of Git has grown wide enough that it is harder to say "If
> you lie that a file whose contents does not match the index is up to date
> using this mechanism, you will lose data and all bad things happen---you
> can keep both halves".  Once we release a version of Git with such a
> "feature", the first bug report will be "I did not want to run
> 'update-index --refresh' because it takes time, and some index entries
> apparently did not match what is on the filesystem, and I got a corrupt
> working file after a merge.  Git should make sure that the contents match
> when using the new 'path to the file' argument when updating the cached
> stat info!".  I do not have a good answer to such a bug report.
>

The good answer would probably be that this is intended behaviour and would
explicitly be mentionned in the manual for that specific option?  :)

How about renaming the option (which is more of a sub-command) to
--i-know-this-can-eat-my-data-but-refresh-stat-only?  Then we can print a
big warning saying this is generally a VERY bad idea?

Q

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: What's cooking in git.git (preview)
From: Junio C Hamano @ 2017-01-09 15:05 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git
In-Reply-To: <alpine.DEB.2.20.1701091228460.3469@virtualbox>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Hi Junio,
>
> On Sun, 8 Jan 2017, Junio C Hamano wrote:
>
>> * js/difftool-builtin (2017-01-08) 4 commits
>>  - t7800: run both builtin and scripted difftool, for now
>>  - difftool: implement the functionality in the builtin
>>  - difftool: add a skeleton for the upcoming builtin
>>  - git_exec_path: avoid Coverity warning about unfree()d result
>> 
>>  Rewrite a scripted porcelain "git difftool" in C.
>> 
>>  What's the doneness of this topic?
>
> There was only one bug report since the first iteration, and it has been
> fixed as of v4.

This is more RFH than FYI, but some topics in flight add more tests
to t7800 and with everything merged, the test breaks at the tip of
'pu' as of tonight (eh, technically last night, as I ended up not
sleeping due to jetlag), even though each individual topic passes
it.  It _might_ indicate a problem with this topic, or it may merely
be a biproduct of a mismerge.


^ permalink raw reply

* Re: Refreshing index timestamps without reading content
From: Junio C Hamano @ 2017-01-09 15:01 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Quentin Casasnovas, Git Mailing List
In-Reply-To: <CACsJy8BRfJG6L49VyC+qsrQ9Arz0gCGpMATpK9uLq61Lx6_Jtg@mail.gmail.com>

Duy Nguyen <pclouds@gmail.com> writes:

> On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> <quentin.casasnovas@oracle.com> wrote:
>> Is there any way to tell git, after the git ls-tree command above, to
>> refresh its stat cache information and trust us that the file content has
>> not changed, as to avoid any useless file read (though it will obviously
>> will have to stat all of them, but that's not something we can really
>> avoid)
>
> I don't think there's any way to do that, unfortunately.

Lose "unfortunately".

>> If not, I am willing to implement a --assume-content-unchanged to the git
>> update-index if you guys don't see something fundamentally wrong with this
>> approach.
>
> If you do that, I think you should go with either of the following options
>
> - Extend git-update-index --index-info to take stat info as well (or
> maybe make a new option instead). Then you can feed stat info directly
> to git without a use-case-specific "assume-content-unchanged".
>
> - Add "git update-index --touch" that does what "touch" does. In this
> case, it blindly updates stat info to latest. But like touch, we can
> also specify  mtime from command line if we need to. It's a bit less
> generic than the above option, but easier to use.

Even if we assume that it is a good idea to let people muck with the
index like this, either of the above would be a usable addition,
because the cached stat information does not consist solely of
mtime.

"git update-index --index-info" was invented for the case where a
user or a script _knows_ the object ID of the blob that _would_
result if a contents of a file on the filesystem were run through
hash-object.  So from the interface's point of view, it may make
sense to teach it to take an extra/optional argument that is the
path to the file and take the stat info out of the named file when
the extra/optional argument was given.

But that assumes that it is a good idea to do this in the first
place.  It was deliberate design decision that setting the cached
stat info for the entry was protected behind actual content
comparison, and removing that protection will open the index to
abuse.

The userbase of Git has grown wide enough that it is harder to say
"If you lie that a file whose contents does not match the index is
up to date using this mechanism, you will lose data and all bad
things happen---you can keep both halves".  Once we release a
version of Git with such a "feature", the first bug report will be
"I did not want to run 'update-index --refresh' because it takes
time, and some index entries apparently did not match what is on the
filesystem, and I got a corrupt working file after a merge.  Git
should make sure that the contents match when using the new 'path to
the file' argument when updating the cached stat info!".  I do not
have a good answer to such a bug report.

So...

^ permalink raw reply

* Re: [PATCH v3 14/21] read-cache: touch shared index files when used
From: Junio C Hamano @ 2017-01-09 14:34 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Christian Couder, git, Ævar Arnfjörð Bjarmason,
	Christian Couder
In-Reply-To: <CACsJy8AR6yNr0y+_JZDkW-HO_yHPkUx_6zbLGoviKQBOVcSg5A@mail.gmail.com>

Duy Nguyen <pclouds@gmail.com> writes:

> On Sun, Jan 8, 2017 at 4:46 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Christian Couder <christian.couder@gmail.com> writes:
>>
>>> So what should we do if freshen_file() returns 0 which means that the
>>> freshening failed?
>>
>> You tell me ;-)  as you are the one who is proposing this feature.
>
> My answer is, we are not worse than freshening loose objects case
> (especially since I took the idea from there). 

I do not think so, unfortunately.  Loose object files with stale
timestamps are not removed as long as objects are still reachable.
For the base/shared index file, the timestamp is the only thing that
protects them from pruning, unless it is serving as the base file
for the currently active $GIT_DIR/index that is split.

>> What is the failure mode after such a premature GC happens?  What
>> does the end-user see?  Can you try to (1) split the index (2)
>> modify bunch of entries (3) remove the base/shared index with /bin/rm
>> and then see how various Git commands fail?  Do they fail gracefully?
>>
>> I am trying to gauge the seriousness of ignoring such an error here.
>
> If we fail to refresh it and the file is old enough and gc happens,
> any index file referenced to it are broken. Any commands that read the
> index will die(). The best you could do is delete $GIT_DIR/index and
> read-tree HEAD.

^ permalink raw reply

* Re: [PATCH] Makefile: put LIBS after LDFLAGS for imap-send
From: Junio C Hamano @ 2017-01-09 14:31 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Steven Penny, git
In-Reply-To: <xmqq4m18wbo5.fsf@gitster.mtv.corp.google.com>

Junio C Hamano <gitster@pobox.com> writes:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
>> In any case, it does not break things, and it helps Cygwin, so: ACK
>>
>> Ciao,
>> Dscho
>>
>> P.S.: I pushed this to Git for Windows' `master`, too:
>> https://github.com/git-for-windows/git/commit/f05a26948b
>
> Hmm, I peeked it hoping that you corrected the log message along the
> lines of your <alpine.DEB.2.20.1701081953330.3469@virtualbox>, but
> it appears that what you queued does not have any extra mention of
> cygwin or specifics of the buildchain on top of what Steven posted?

Ah, false alarm.  It does have s/Windows/Cygwin/; I trust your
judgement that that change makes the context clear enough for those
involved in Git for Windows, msysGit and Cygwin port.

Will queue it together with the other one from Steven.

Thanks.

^ permalink raw reply

* Re: [PATCH] Makefile: put LIBS after LDFLAGS for imap-send
From: Junio C Hamano @ 2017-01-09 14:27 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Steven Penny, git
In-Reply-To: <alpine.DEB.2.20.1701091127570.3469@virtualbox>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> In any case, it does not break things, and it helps Cygwin, so: ACK
>
> Ciao,
> Dscho
>
> P.S.: I pushed this to Git for Windows' `master`, too:
> https://github.com/git-for-windows/git/commit/f05a26948b

Hmm, I peeked it hoping that you corrected the log message along the
lines of your <alpine.DEB.2.20.1701081953330.3469@virtualbox>, but
it appears that what you queued does not have any extra mention of
cygwin or specifics of the buildchain on top of what Steven posted?


^ permalink raw reply

* Re: [PATCH v3] log --graph: customize the graph lines with config log.graphColors
From: Junio C Hamano @ 2017-01-09 14:23 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Jeff King, Git Mailing List
In-Reply-To: <CACsJy8AjW1TrLO28mSUSTc6V+_0kuShf7V-=Pkw3Cw9t7ZRfyw@mail.gmail.com>

Duy Nguyen <pclouds@gmail.com> writes:

> On Mon, Jan 9, 2017 at 12:30 PM, Jeff King <peff@peff.net> wrote:
>> I also wonder if it is worth just using argv_array. We do not care about
>> ...
>> It is not much shorter than ALLOC_GROW(), but IMHO it is easier to read.
>
> Indeed. My only complaint is it's "argv_array_" and not
> "string_array_" but we could think about renaming it later when we see
> another use like this.

Yup, when Peff said "argv-array", it took me a few breaths, until I
realized that argv-array was merely an array of strings, to convince
myself that the data structure can be used here.  If we are going to
use it in more places, we may want to rename it somehow.

^ permalink raw reply

* Re: [PATCH] Makefile: put LIBS after LDFLAGS for imap-send
From: Junio C Hamano @ 2017-01-09 14:23 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Steven Penny, git
In-Reply-To: <alpine.DEB.2.20.1701091127570.3469@virtualbox>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> ...
> It is a bit curious that -lws2_32 *does* only show up in $(LIBS), but I
> guess it is simply the fact that we use a newer GCC (gcc.exe (Rev2, Built
> by MSYS2 project) 6.2.0) that allows Git for Windows to be built even
> without this patch.
>
> In any case, it does not break things, and it helps Cygwin, so: ACK

Thanks, will queue on my side, too.

^ permalink raw reply

* Re: [PATCH] connect: handle putty/plink also in GIT_SSH_COMMAND
From: Junio C Hamano @ 2017-01-09 14:19 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Segev Finer
In-Reply-To: <alpine.DEB.2.20.1701091207480.3469@virtualbox>

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> > But do we really need to do that?
>> 
>> No.
>
> Exactly.

> But since you seem to convinced that a future bug report like this may
> happen (I am not, and it contradicts my conviction that one should cross a
> bridge only when reaching it, but whatever), how about this, on top:

I do not understand why you keep arguing.

With No-Exactly exchange, I thought that we established that

 * The original auto-detection for GIT_SSH is battle tested and has
   worked for us very well.  It may probably have covered 99.9% of
   the population, and we haven't heard complaints from the
   remaining 0.1%.

 * The added support for GIT_SSH_COMMAND, because the mechanism
   gives more lattitude to end-users to be creative, will have lower
   chance of correctly deciding between ssh, tortoiseplink and
   plink, but it would still be high enough, say 99%, correct
   detection rate.

 * It is foolish to add more code to refine the auto-detection to
   raise 99% to 99.5%.  We know that the approach fundamentally
   cannot make the detection rate reach 100%.

The last one can be paraphrased as "perfect is the enemy of good".

But that does not mean that it is OK to leave the system unusable
for 1% of the users.  If the auto-detection code does not work
correctly for a tiny minority of users, it is still OK, as long as
we give them an escape hatch to allow them to say "You may not be
able to guess among ssh, tortoiseplink and plink, or you may even
guess incorrectly, so I'll tell you.  I am using plink".  99% of
users do not have to know about that escape hatch, but 1% of users
will be helped.

IOW, "give an escape hatch to override auto-detected tortoiseplink
and plink variables" suggestion should be taken as an "in addition
to" suggestion (as opposed to an "instead of" suggestion).  I was
not clear in my very first message, and I thought in my second and
my third message I clarified it as such, but probably I wasn't
explicit enough.

In any case, "do this only if the first one token on the command
line does not have '='" we see below is flawed for two reasons and I
think it would not be a workable counter-proposal (besides, it goes
against the "perfect is the enemy of good" mantra).

For one thing, the command line may not be "VAR=VAL cmd", but just
"cmd" that names the path to tortoiseplink, but the leading
directory path that houses tortoiseplink may have a '=' in it, in
which case we would detect it correctly if we didn't have such a
hack, but with the hack we would punt.  More importantly, even when
"VAR=VAL cmd" form was caught with the change, it may punt and stop
guessing, but punting alone does not help users by letting them say
"I know you cannot auto-detect, so let me help you---what I have is
PuTTY plink".  If it always assumes that the user uses the plain
vanilla ssh, then the change merely changed what incorrect guess the
auto-detection makes from "tortoiseplink" to "vanilla ssh" for a
user who uses the PuTTY variant of plink.

> -- snipsnap --
> diff --git a/connect.c b/connect.c
> index c81f77001b..b990dd6190 100644
> --- a/connect.c
> +++ b/connect.c
> @@ -797,7 +797,8 @@ struct child_process *git_connect(int fd[2], const
> char *url,
>  				char *split_ssh = xstrdup(ssh);
>  				const char **ssh_argv;
>  
> -				if (split_cmdline(split_ssh, &ssh_argv))
> +				if (split_cmdline(split_ssh, &ssh_argv) &&
> +				    !strchr(ssh_argv[0], '='))
>  					ssh_argv0 = xstrdup(ssh_argv[0]);
>  				free(split_ssh);
>  				free((void *)ssh_argv);

^ permalink raw reply

* Re: Test failures when Git is built with libpcre and grep is built without it
From: Andreas Schwab @ 2017-01-09 13:05 UTC (permalink / raw)
  To: A. Wilcox; +Cc: Jeff King, git
In-Reply-To: <58736B2A.40003@adelielinux.org>

On Jan 09 2017, "A. Wilcox" <awilfox@adelielinux.org> wrote:

> Interestingly enough, you seem to be right.  The failure is very
> bizarre and has nothing to do with system /bin/grep:
>
> test_must_fail: command succeeded: git grep -G -F -P -E a\x{2b}b\x{2a}c ab
> not ok 142 - grep -G -F -P -E pattern
> #
> #               >empty &&
> #               test_must_fail git grep -G -F -P -E "a\x{2b}b\x{2a}c"
> ab >actual &&
> #               test_cmp empty actual
> #
>
> However:
>
> elaine trash directory.t7810-grep # git grep -G -F -P -E
> a\x{2b}b\x{2a}c ab >actual

You need to quote the regexp argument, see the line starting with
"test_must_fail" above.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply

* Re: [RFC PATCH 0/5] Localise error headers
From: Michael J Gruber @ 2017-01-09 12:43 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20170104070514.pxdthvilw66ierfz@sigill.intra.peff.net>

Jeff King venit, vidit, dixit 04.01.2017 08:05:
> On Mon, Jan 02, 2017 at 12:14:49PM +0100, Michael J Gruber wrote:
> 
>> Currently, the headers "error: ", "warning: " etc. - generated by die(),
>> warning() etc. - are not localized, but we feed many localized error messages
>> into these functions so that we produce error messages with mixed localisation.
>>
>> This series introduces variants of die() etc. that use localised variants of
>> the headers, i.e. _("error: ") etc., and are to be fed localized messages. So,
>> instead of die(_("not workee")), which would produce a mixed localisation (such
>> as "error: geht ned"), one should use die_(_("not workee")) (resulting in
>> "Fehler: geht ned").
> 
> I can't say I'm excited about having matching "_" variants for each
> function. Are we sure that they are necessary? I.e., would it be
> acceptable to just translate them always?

We would still need to mark the strings, e.g.

die(N_("oopsie"));

and would not be able to opt out of translating in the code (only in the
po file, by not providing a translation).


>> 1/5 prepares the error machinery
>> 2/5 provides new variants error_() etc.
>> 3/5 has coccinelli rules error(_(E)) -> error_(_(E)) etc.
>> 4/5 applies the coccinelli patches
>>
>> 5/5 is not to be applied to the main tree, but helps you try out the feature:
>> it has changes to de.po and git.pot so that e.g. "git branch" has fully localised
>> error messages (see the recipe in the commit message).
> 
> Your patches 4 and 5 don't seem to have made it to the list. Judging
> from the diffstat, I'd guess they broke the 100K limit.

Hmmpf, I didn't know about the limit. In any case, they were simple
results of applying the "make cocci" patches (4/5) resp. providing some
"de" strings to try this out.

In any case, the question is whether we want to tell the user

A: B

where A is in English and B is localised, or rather localise both A and
B (for A in "error", "fatal", "warning"...).

For localising A and B, we'd need this series or something similar. For
keeping the mix, we don't need to do anything ;)

Michael


^ permalink raw reply

* Re: Refreshing index timestamps without reading content
From: Quentin Casasnovas @ 2017-01-09 12:22 UTC (permalink / raw)
  To: Quentin Casasnovas; +Cc: Duy Nguyen, Git Mailing List
In-Reply-To: <20170109121724.GC7000@chrystal.oracle.com>

[-- Attachment #1: Type: text/plain, Size: 1837 bytes --]

On Mon, Jan 09, 2017 at 01:17:24PM +0100, Quentin Casasnovas wrote:
> On Mon, Jan 09, 2017 at 07:02:45PM +0700, Duy Nguyen wrote:
> > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> > <quentin.casasnovas@oracle.com> wrote:
> > 
> > > If not, I am willing to implement a --assume-content-unchanged to the git
> > > update-index if you guys don't see something fundamentally wrong with this
> > > approach.
> > 
> > If you do that, I think you should go with either of the following options
> > 
> > - Extend git-update-index --index-info to take stat info as well (or
> > maybe make a new option instead). Then you can feed stat info directly
> > to git without a use-case-specific "assume-content-unchanged".
> > 
> > - Add "git update-index --touch" that does what "touch" does. In this
> > case, it blindly updates stat info to latest. But like touch, we can
> > also specify  mtime from command line if we need to. It's a bit less
> > generic than the above option, but easier to use.
> > 
> > Caveat: The options I'm proposing can be rejected. So maybe wait a bit
> > to see how people feel and perhaps send an RFC patch, again to gauge
> > the reception.
> 
> 
> Hey Duy,
> 
> Thanks for your answer.
> 
> I've played with this a bit last week and added an extra command, which I
> called --refresh-stat, which works like your suggested --index.  It works
                                                         ^^^^^^^

Whoops, sorry!

[...] *a bit* like your suggested *--touch*. [...]

I don't really need it to support specific mtime from the CLI for my
different use cases, but if someone has some ideas in how that would be
useful, I can try implementing your --index-info change to support stat
information to be passed around instead of my current simple
implementation.

Q

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: Refreshing index timestamps without reading content
From: Quentin Casasnovas @ 2017-01-09 12:17 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Quentin Casasnovas, Git Mailing List
In-Reply-To: <CACsJy8BRfJG6L49VyC+qsrQ9Arz0gCGpMATpK9uLq61Lx6_Jtg@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 1736 bytes --]

On Mon, Jan 09, 2017 at 07:02:45PM +0700, Duy Nguyen wrote:
> On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> <quentin.casasnovas@oracle.com> wrote:
> 
> > If not, I am willing to implement a --assume-content-unchanged to the git
> > update-index if you guys don't see something fundamentally wrong with this
> > approach.
> 
> If you do that, I think you should go with either of the following options
> 
> - Extend git-update-index --index-info to take stat info as well (or
> maybe make a new option instead). Then you can feed stat info directly
> to git without a use-case-specific "assume-content-unchanged".
> 
> - Add "git update-index --touch" that does what "touch" does. In this
> case, it blindly updates stat info to latest. But like touch, we can
> also specify  mtime from command line if we need to. It's a bit less
> generic than the above option, but easier to use.
> 
> Caveat: The options I'm proposing can be rejected. So maybe wait a bit
> to see how people feel and perhaps send an RFC patch, again to gauge
> the reception.


Hey Duy,

Thanks for your answer.

I've played with this a bit last week and added an extra command, which I
called --refresh-stat, which works like your suggested --index.  It works
very well with my use case and improves the performances very significantly
on some of our use cases.

It is attached to this e-mail to gather comments, as you suggest, and is
really not meant to be reviewed for inclusion yet as it lacks test cases,
documentation changes, etc.  It is just a convenient way to show what I
need and receive comments.

The logic is simple enugh and will just skip calling ie_modified() when
refreshing the index.

Cheers,
Q

[-- Attachment #1.2: 0001-git-update-index-add-refresh-stat-option.patch --]
[-- Type: text/x-diff, Size: 5346 bytes --]

From 461b4f75056fc476db6bbdf8bc3ff6e91a8ad08d Mon Sep 17 00:00:00 2001
From: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Date: Sat, 7 Jan 2017 09:26:29 +0100
Subject: [PATCH] git-update-index: add --refresh-stat option.

git-update-index --refresh, contrary to what the documentation says, not
only will refresh the stat information but will also update the sha1 of the
objects in the working tree if the stat information is found to be
out-of-date.

We add a --refresh-stat option, which like --refresh will update the stat
information, but will assume that any file in the working tree that is
present in the index matches the index - hence prevents unecessarily
reading the files in the working tree.  It is different from
--assume-unchanged or --skip-worktree in that the new stat information is
recorded in the index, hence subsequent git update-index --refresh will not
read the files if their mtime hasn't changed.

This sounds like a very dangerous option to give to the user, since it
could potentially cause changed files to be missed, but can be needed when
we are sure the working tree files have not changed and are matching the
index.

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
---
 builtin/update-index.c | 10 ++++++++++
 cache.h                |  3 +++
 read-cache.c           |  7 +++++--
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index f3f07e7f1cb2..1215b6a09687 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -777,6 +777,12 @@ static int really_refresh_callback(const struct option *opt,
 	return refresh(opt->value, REFRESH_REALLY);
 }
 
+static int refresh_stat_callback(const struct option *opt,
+				const char *arg, int unset)
+{
+	return refresh(opt->value, REFRESH_STAT_ONLY);
+}
+
 static int chmod_callback(const struct option *opt,
 				const char *arg, int unset)
 {
@@ -934,6 +940,10 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 			N_("refresh stat information"),
 			PARSE_OPT_NOARG | PARSE_OPT_NONEG,
 			refresh_callback},
+		{OPTION_CALLBACK, 0, "refresh-stat", &refresh_args, NULL,
+			N_("refresh only stat information (assume content has not changed)"),
+			PARSE_OPT_NOARG | PARSE_OPT_NONEG,
+			refresh_stat_callback},
 		{OPTION_CALLBACK, 0, "really-refresh", &refresh_args, NULL,
 			N_("like --refresh, but ignore assume-unchanged setting"),
 			PARSE_OPT_NOARG | PARSE_OPT_NONEG,
diff --git a/cache.h b/cache.h
index a50a61a19787..57d0f9fffbe5 100644
--- a/cache.h
+++ b/cache.h
@@ -611,6 +611,8 @@ extern void *read_blob_data_from_index(struct index_state *, const char *, unsig
 #define CE_MATCH_IGNORE_MISSING		0x08
 /* enable stat refresh */
 #define CE_MATCH_REFRESH		0x10
+/* only stat refresh, assume content hasn't changed */
+#define CE_MATCH_REFRESH_STAT_ONLY	0x20
 extern int ie_match_stat(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
 extern int ie_modified(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
 
@@ -643,6 +645,7 @@ extern void fill_stat_cache_info(struct cache_entry *ce, struct stat *st);
 #define REFRESH_IGNORE_MISSING	0x0008	/* ignore non-existent */
 #define REFRESH_IGNORE_SUBMODULES	0x0010	/* ignore submodules */
 #define REFRESH_IN_PORCELAIN	0x0020	/* user friendly output, not "needs update" */
+#define REFRESH_STAT_ONLY	0x0040	/* do not check content but update stat */
 extern int refresh_index(struct index_state *, unsigned int flags, const struct pathspec *pathspec, char *seen, const char *header_msg);
 extern struct cache_entry *refresh_cache_entry(struct cache_entry *, unsigned int);
 
diff --git a/read-cache.c b/read-cache.c
index db5d91064266..e9334094c6f0 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1101,6 +1101,7 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate,
 	int ignore_valid = options & CE_MATCH_IGNORE_VALID;
 	int ignore_skip_worktree = options & CE_MATCH_IGNORE_SKIP_WORKTREE;
 	int ignore_missing = options & CE_MATCH_IGNORE_MISSING;
+	int assume_content_unchanged = options & CE_MATCH_REFRESH_STAT_ONLY;
 
 	if (!refresh || ce_uptodate(ce))
 		return ce;
@@ -1161,7 +1162,7 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate,
 		}
 	}
 
-	if (ie_modified(istate, ce, &st, options)) {
+	if (!assume_content_unchanged && ie_modified(istate, ce, &st, options)) {
 		if (err)
 			*err = EINVAL;
 		return NULL;
@@ -1206,11 +1207,13 @@ int refresh_index(struct index_state *istate, unsigned int flags,
 	int quiet = (flags & REFRESH_QUIET) != 0;
 	int not_new = (flags & REFRESH_IGNORE_MISSING) != 0;
 	int ignore_submodules = (flags & REFRESH_IGNORE_SUBMODULES) != 0;
+	int refresh_stat_only = (flags & REFRESH_STAT_ONLY) != 0;
 	int first = 1;
 	int in_porcelain = (flags & REFRESH_IN_PORCELAIN);
 	unsigned int options = (CE_MATCH_REFRESH |
 				(really ? CE_MATCH_IGNORE_VALID : 0) |
-				(not_new ? CE_MATCH_IGNORE_MISSING : 0));
+				(not_new ? CE_MATCH_IGNORE_MISSING : 0) |
+				(refresh_stat_only ? CE_MATCH_REFRESH_STAT_ONLY : 0));
 	const char *modified_fmt;
 	const char *deleted_fmt;
 	const char *typechange_fmt;
-- 
2.11.0


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply related

* Re: Refreshing index timestamps without reading content
From: Duy Nguyen @ 2017-01-09 12:02 UTC (permalink / raw)
  To: Quentin Casasnovas; +Cc: Git Mailing List
In-Reply-To: <20170105112359.GN8116@chrystal.oracle.com>

On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
<quentin.casasnovas@oracle.com> wrote:
> Hi guys,
>
> Apologies if this is documented somewhere, I have fairly bad search vudu
> skills.
>
> I'm looking for a way to cause a full refresh of the index without causing
> any read of the files, basically telling git "trust me, all worktree files
> are matching the index, but their stat information have changed".  I have
> read about the update-index --assume-unchanged and --skip-worktree flags in
> the documentation, but these do not cause any index refresh - rather, they
> fake that the respective worktree files are matching the index until you
> remove those assume-unchanged/skip-worktree bits.
>
> This might sound like a really weird thing to do, but I do have a use case
> for it - we have some build farm setup where the resulting objects of a
> compilation are stored on a shared server.  The source files are not stored
> on the shared server, but locally on each of the build server (as to
> decrease network load and make good use of local storage as caches).
>
> We then use an onion filesystem to mount the compiled objects on top of the
> local sources - and change the modification time of the source to be older
> than the object files, so that on subsequent builds, make does not rebuild
> the whole world.
>
> This works fine except for one thing, after changing the mtime of the
> source files, the first subsequent git command needing to compare the tree
> with the index will take a LONG time since it will read all of the object
> content:
>
>   cd linux-2.6
>
>   # Less than a second  when the index is up to date
>   time git status > /dev/null
>   git status 0.06s user 0.09s system 172% cpu 0.087 total
>                                               ~~~~~~~~~~~
>
>   # Change the mtime..
>   git ls-tree -r --name-only HEAD | xargs -n 1024 touch
>
>   # Now 30s..
>   time git status > /dev/null
>   git status  2.73s user 1.79s system 13% cpu 32.453 total
>                                               ~~~~~~~~~~~~
>
> The timing information above was captured on my laptop SSD and the penalty
> is obviously much higher on spinning disks - especially when this operation
> is done on *hundreds* of different work tree in parallel, all hosted on the
> same filesystem (it can take tens of minutes!).
>
> Is there any way to tell git, after the git ls-tree command above, to
> refresh its stat cache information and trust us that the file content has
> not changed, as to avoid any useless file read (though it will obviously
> will have to stat all of them, but that's not something we can really
> avoid)

I don't think there's any way to do that, unfortunately.

> If not, I am willing to implement a --assume-content-unchanged to the git
> update-index if you guys don't see something fundamentally wrong with this
> approach.

If you do that, I think you should go with either of the following options

- Extend git-update-index --index-info to take stat info as well (or
maybe make a new option instead). Then you can feed stat info directly
to git without a use-case-specific "assume-content-unchanged".

- Add "git update-index --touch" that does what "touch" does. In this
case, it blindly updates stat info to latest. But like touch, we can
also specify  mtime from command line if we need to. It's a bit less
generic than the above option, but easier to use.

Caveat: The options I'm proposing can be rejected. So maybe wait a bit
to see how people feel and perhaps send an RFC patch, again to gauge
the reception.

-- 
Duy

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox