Git development
 help / color / mirror / Atom feed
* Re: [PATCH 2/5] vcs-svn: Start working on the dumpfile producer
From: Ramkumar Ramachandra @ 2011-01-22  9:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List, Jonathan Nieder, David Barr, Sverre Rabbelier
In-Reply-To: <7v39olok4l.fsf@alter.siamese.dyndns.org>

Hi Junio,

Junio C Hamano writes:
> Ramkumar Ramachandra <artagnon@gmail.com> writes:
> > Start off with some broad design sketches. Compile succeeds, but
> > parser is incorrect. Include a Makefile rule to build it into
> > vcs-svn/lib.a.
> >
> > Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com>
> 
> I initially thought I should refrain from doing a usual picky review for
> contrib/ material, but realized this does not go to contrib/, so...

Thanks for the review. A new iteration including tests is almost
ready- your review came in at the perfect time :)

> > diff --git a/vcs-svn/dump_export.c b/vcs-svn/dump_export.c
> > new file mode 100644
> > index 0000000..04ede06
> > --- /dev/null
> > +++ b/vcs-svn/dump_export.c
> > ...
> > +void dump_export_begin_rev(int revision, const char *revprops,
> > +			int prop_len) {
> 
> Style. "{" at the beginning of the next line (same everywhere).

Fixed.

> > +void dump_export_node(const char *path, enum node_kind kind,
> > +		enum node_action action, unsigned long text_len,
> > +		unsigned long copyfrom_rev, const char *copyfrom_path) {
> > +	printf("Node-path: %s\n", path);
> > +	printf("Node-kind: ");
> > +	switch (action) {
> > +	case NODE_KIND_NORMAL:
> > +		printf("file\n");
> > +		break;
> > +	case NODE_KIND_EXECUTABLE:
> > +		printf("file\n");
> > +		break;
> > +	case NODE_KIND_SYMLINK:
> > +		printf("file\n");
> > +		break;
> > +	case NODE_KIND_GITLINK:
> > +		printf("file\n");
> > +		break;
> > +	case NODE_KIND_SUBDIR:
> > +		die("Unsupported: subdirectory");
> > +	default:
> > +		break;
> > +	}
> 
> Hmph, is it expected that the repertoire of node-kind stay the same, or
> will it be extended over time?  It might make sense to change the above
> switch a more table-driven one.  The same for node-action that follows
> this part of the code.

I needed more than a lookup table, because I wanted the flexibility to
execute arbitrary statements in each case. It's extended in the latest
version (will come up on the list soon).

> > diff --git a/vcs-svn/svnload.c b/vcs-svn/svnload.c
> > new file mode 100644
> > index 0000000..7043ae7
> > --- /dev/null
> > +++ b/vcs-svn/svnload.c
> > @@ -0,0 +1,294 @@
> > ...
> > +static struct strbuf blobs[100];
> > +	
> 
> Is there a rationale for "100", or is it just a random number that can be
> tweakable at the source level?  Would we want to have a symbolic constant
> for it?
> 
> The "blank" line has a trailing whitespace.

Fixed. I used this to illustrate the problem with persisting blobs- in
this version, I persist 100 blobs; this is totally arbitrary and
inextensible. Since the latest version leverages the --inline-blobs
feature of fast export, this is totally unecessary.

> > +static struct {
> > +	unsigned long prop_len, text_len, copyfrom_rev, mark;
> > +	int text_delta, prop_delta; /* Boolean */
> > ...
> > +static void reset_node_ctx(void)
> > ...
> > +	node_ctx.text_delta = -1;
> > +	node_ctx.prop_delta = -1;
> 
> Does your "Boolean" type take values 0 or -1?  Or is it perhaps a tristate
> false=0, true=1, unknown=-1?  If so, the comment on the type above needs
> to be fixed.

Fixed. Thanks for pointing this out.

> > +	strbuf_reset(&node_ctx.copyfrom_path);
> > +	strbuf_reset(&node_ctx.path);
> > +}
> > +
> > +static void populate_props(struct strbuf *props, const char *author,
> > +			const char *log, const char *date) {
> 
> Style on "{" (look everywhere).

Fixed.

> > +	strbuf_reset(props);	
> 
> Trailing whitespace.

Fixed.

> > ...
> > +	t ++;
> 
> Unwanted SP before "++" (look everywhere).

Fixed.

> > +	tm_time = time_to_tm(strtoul(t, NULL, 10), atoi(tz_off));
> 
> Has your caller already verified tz_off is a reasonable integer?

Fixed.

> > ...
> > +			if (!memcmp(t, "D", 1)) {
> > +				node_ctx.action = NODE_ACTION_DELETE;
> > +			}
> > +			else if (!memcmp(t, "C", 1)) {
> 
> Style: "} else if (...) {".

Fixed.

> > ...
> > +					node_ctx.kind = NODE_KIND_GITLINK;
> > +				else if (!memcmp(val, "040000", 6))
> 
> Is <mode> guaranteed to be a 6-digit octal, padded with 0 to the left?
> 
> The manual page of git-fast-import seems to hint that is the case, but I
> am double checking, as the leading zero is an error in the tree object.

The documentation and commit messages seem to hint that this is the
case. I'm digging to see if there's anything wrong with this.

> > +					node_ctx.kind = NODE_KIND_SUBDIR;
> > +				else {
> > +					if (!memcmp(val, "755", 3))
> > +						node_ctx.kind = NODE_KIND_EXECUTABLE;
> > +					else if(!memcmp(val, "644", 3))
> 
> Style; missing SP after "if/switch" (look everywhere).

Fixed.

> > +						node_ctx.kind = NODE_KIND_NORMAL;
> > +					else
> > +						die("Unrecognized mode: %s", val);
> > +					mode_incr = 4;
> > +				}
> > +				val += mode_incr;
> 
> Hmm, this whole thing is ugly.
> 
> Perhaps a table-lookup, or at least a helper function that translates
> "val" to node-kind (while advancing val as the side effect) may make it
> easier to read?

Fixed. I used a helper function.

> > ...
> > +					to_dump = &blobs[strtoul(val + 1, NULL, 10)];
> 
> Has anybody so far verified the decimal number at (val+1) is a reasonable
> value, or am I seeing a future CVE here?

Fixed.

> Do we care what follows the len of digits on this line?  Does a line in G-F-I
> stream allow trailing garbage (this question is not limited to this part
> of the code)?

Hm, I haven't thought about this hard enough. Currently, both svn-fi
and svn-fe ignore any trailing garbage/ commands they don't
understand. Are there some issues that we haven't anticipated?

> > +int svnload_init(const char *filename)
> > ...
> > +void svnload_deinit(void)
>
> Whoever needs to add an element to rev_ctx must consistently touch reset,
> init and deinit.  Would it help him/her if you moved the code so that
> these functions are close together from the beginning?

Fixed.

-- Ram

^ permalink raw reply

* Re: [PATCH 4/5] fast-export: Introduce --inline-blobs
From: Ramkumar Ramachandra @ 2011-01-22  9:24 UTC (permalink / raw)
  To: Drew Northup
  Cc: Jonathan Nieder, Junio C Hamano, Git List, David Barr,
	Sverre Rabbelier
In-Reply-To: <1295531623.4298.26.camel@drew-northup.unet.maine.edu>

Hi Drew,

Drew Northup writes:
> On Thu, 2011-01-20 at 10:20 +0530, Ramkumar Ramachandra wrote:
> > Hi,
> > Jonathan Nieder writes:
> > > Junio C Hamano wrote:
> > > > Ramkumar Ramachandra <artagnon@gmail.com> writes:
> > > > Just thinking aloud, but is it possible to write a filter that converts an
> > > > arbitrary G-F-I stream with referenced blobs into a G-F-I stream without
> > > > referenced blobs by inlining all the blobs?
> > > 
> > > to avoid complexity in the svn fast-import backend itself.
> > > (Complicating detail: such a filter would presumably take responsibility
> > > for --export-marks, so it might want a way to retrieve commit marks
> > > from its downstream.)
> > 
> > This filter will need to persist every blob for the entire lifetime of
> > the program. We can't possibly do it in-memory, so we have to find
> > some way to persist them on-disk and retrieve them very
> > quickly. Jonathan suggested using something like ToyoCabinet earlier-
> > I'll start working and see what I come up with.
> 
> Is it worth including the extra dependency? Most systems that I'm in
> frequent contact with already have some lightweight BDB implementation
> already. I don't currently know of any with TokyoCabinet (or
> KyotoCabinet for that matter) already in place. Besides, if all you're
> doing is persisting blobs that you're likely to write out to disk
> eventually anyway you might as well just do so once you have them and
> keep an "index" (not to be confused with the Git Index, just lacking a
> better word right now) of what you have in some standard in-memory
> format (a heap?). From there you can build each commit into the Git
> Index in the proper order once you have the required parts for
> each--perhaps even re-using the blobs you've already dumped to disk
> (mv'ing them or something).

Agreed. I wouldn't like to introduce an extra dependency either. I was
talking about using it for prototyping- if the final version includes
an extra dependency, it's unlikely to get merged into git.git :) The
final design will probably use an in-memory B+ tree, but I haven't
thought about that hard enough.

-- Ram

^ permalink raw reply

* [FYI/PATCH] vcs-svn: give control over temporary file names
From: Jonathan Nieder @ 2011-01-22  6:42 UTC (permalink / raw)
  To: git; +Cc: David Barr, Thomas Rast, Ramkumar Ramachandra
In-Reply-To: <20110103031059.GE10143@burratino>

Allow users of the line_buffer library to specify what directory and
filename to use for temporary files.  For example:

	struct line_buffer tmp = LINE_BUFFER_INIT;
	if (buffer_tmpfile_init(&tmp, ".git", "svnfe_blob.XXXXXX"))
		die_errno("opening temporary file");
	...
	if (buffer_deinit(&tmp))
		die_errno("removing temporary file");

On Windows, something like this is needed if users without write
permission to the root directory are to be able to use temporary
files.

Unlike the implementation using tmpfile, this would not take
care of automatically removing the temporary file on exit.  The
user is responsible for now for installing appropriate signal and
atexit handlers to take care of that.

Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
I have an odd feeling this will be necessary at some point but I'm not
sure if it's a good idea or not.  Although tmpfile(3) is a limited
interface, it's pretty much exactly what we want.

See [1] for tmpfile on Windows.

This is just to get the idea out there.  I don't think this patch's
moment has come (though I'd be interested in your thoughts either
way).

[1] http://msdn.microsoft.com/en-us/library/x8x7sakw.aspx

 vcs-svn/line_buffer.c |   35 ++++++++++++++++++++++++++++++-----
 vcs-svn/line_buffer.h |    6 ++++--
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/vcs-svn/line_buffer.c b/vcs-svn/line_buffer.c
index aedf105..4131e08 100644
--- a/vcs-svn/line_buffer.c
+++ b/vcs-svn/line_buffer.c
@@ -3,7 +3,7 @@
  * See LICENSE for details.
  */
 
-#include "git-compat-util.h"
+#include "cache.h"
 #include "line_buffer.h"
 #include "strbuf.h"
 
@@ -25,12 +25,33 @@ int buffer_fdinit(struct line_buffer *buf, int fd)
 	return 0;
 }
 
-int buffer_tmpfile_init(struct line_buffer *buf)
+int buffer_tmpfile_init(struct line_buffer *buf,
+			const char *dirname, const char *pattern)
 {
-	buf->infile = tmpfile();
-	if (!buf->infile)
-		return -1;
+	int fd, saved_errno;
+	int mode = 0444;	/* just remove write permission. */
+	strbuf_addstr(&buf->temp_filename, dirname);
+	strbuf_addch(&buf->temp_filename, '/');
+	strbuf_addstr(&buf->temp_filename, pattern);
+
+	fd = git_mkstemp_mode(buf->temp_filename.buf, mode);
+	if (fd < 0) {
+		saved_errno = errno;
+		goto fail_mktemp;
+	}
+	buf->infile = fdopen(fd, "r+");
+	if (!buf->infile) {
+		saved_errno = errno;
+		goto fail_fdopen;
+	}
 	return 0;
+
+fail_fdopen:
+	close(fd);
+fail_mktemp:
+	strbuf_reset(&buf->temp_filename);
+	errno = saved_errno;
+	return -1;
 }
 
 int buffer_deinit(struct line_buffer *buf)
@@ -40,6 +61,9 @@ int buffer_deinit(struct line_buffer *buf)
 		return ferror(buf->infile);
 	err = ferror(buf->infile);
 	err |= fclose(buf->infile);
+	if (buf->temp_filename.len)
+		err |= unlink_or_warn(buf->temp_filename.buf);
+	strbuf_reset(&buf->temp_filename);
 	return err;
 }
 
@@ -129,4 +153,5 @@ void buffer_skip_bytes(struct line_buffer *buf, uint32_t len)
 void buffer_reset(struct line_buffer *buf)
 {
 	strbuf_release(&buf->blob_buffer);
+	strbuf_release(&buf->temp_filename);
 }
diff --git a/vcs-svn/line_buffer.h b/vcs-svn/line_buffer.h
index 96ce966..bd1a621 100644
--- a/vcs-svn/line_buffer.h
+++ b/vcs-svn/line_buffer.h
@@ -9,15 +9,17 @@ struct line_buffer {
 	char line_buffer[LINE_BUFFER_LEN];
 	struct strbuf blob_buffer;
 	FILE *infile;
+	struct strbuf temp_filename;
 };
-#define LINE_BUFFER_INIT {"", STRBUF_INIT, NULL}
+#define LINE_BUFFER_INIT {"", STRBUF_INIT, NULL, STRBUF_INIT}
 
 int buffer_init(struct line_buffer *buf, const char *filename);
 int buffer_fdinit(struct line_buffer *buf, int fd);
 int buffer_deinit(struct line_buffer *buf);
 void buffer_reset(struct line_buffer *buf);
 
-int buffer_tmpfile_init(struct line_buffer *buf);
+int buffer_tmpfile_init(struct line_buffer *buf,
+		const char *directory, const char *pattern);
 FILE *buffer_tmpfile_rewind(struct line_buffer *buf);	/* prepare to write. */
 long buffer_tmpfile_prepare_to_read(struct line_buffer *buf);
 
-- 
1.7.4.rc2

^ permalink raw reply related

* Re: Fwd: Git and Large Binaries: A Proposed Solution
From: Sverre Rabbelier @ 2011-01-22  3:05 UTC (permalink / raw)
  To: Eric Montellese, Avery Pennarun, Jonathan Leto; +Cc: Jeff King, git, Joey Hess
In-Reply-To: <AANLkTinKNtDDy6Pi4Tn+hpTrVw_DBoYpTn3ihCfN_fUd@mail.gmail.com>

Heya,

[More or less separate from to the ongoing discussion, so no text quoted]

Eric, at the last GitTogether Avery presented his tool, bup, which
implements a number of solutions to the problem of large binary files.
I think I remember that Jonathan is also interested in the topic.
Avery, Jonathan, you can read up on the ongoing conversation at [0] if
you like :).

[0] http://thread.gmane.org/gmane.comp.version-control.git/165389/focus=165401

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply

* Re: [PATCH 4/5] fast-export: Introduce --inline-blobs
From: Junio C Hamano @ 2011-01-22  0:30 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Git List, Jonathan Nieder, David Barr, Sverre Rabbelier
In-Reply-To: <1295415899-1192-5-git-send-email-artagnon@gmail.com>

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> +				buf = read_sha1_file(spec->sha1, &type, &size);
> +				if (!buf)
> +					die ("Could not read blob %s",
> +						sha1_to_hex(spec->sha1));

Style; no SP after "die", as it is a function, not syntactic elements like
if/switch/while (look everywhere).

^ permalink raw reply

* Re: [PATCH 2/5] vcs-svn: Start working on the dumpfile producer
From: Junio C Hamano @ 2011-01-22  0:30 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Git List, Jonathan Nieder, David Barr, Sverre Rabbelier
In-Reply-To: <1295415899-1192-3-git-send-email-artagnon@gmail.com>

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> Start off with some broad design sketches. Compile succeeds, but
> parser is incorrect. Include a Makefile rule to build it into
> vcs-svn/lib.a.
>
> Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com>

I initially thought I should refrain from doing a usual picky review for
contrib/ material, but realized this does not go to contrib/, so...

> diff --git a/vcs-svn/dump_export.c b/vcs-svn/dump_export.c
> new file mode 100644
> index 0000000..04ede06
> --- /dev/null
> +++ b/vcs-svn/dump_export.c
> @@ -0,0 +1,73 @@
> +/*
> + * Licensed under a two-clause BSD-style license.
> + * See LICENSE for details.
> + */
> +
> +#include "git-compat-util.h"
> +#include "strbuf.h"
> +#include "line_buffer.h"
> +#include "dump_export.h"
> +
> +void dump_export_begin_rev(int revision, const char *revprops,
> +			int prop_len) {

Style. "{" at the beginning of the next line (same everywhere).

> +void dump_export_node(const char *path, enum node_kind kind,
> +		enum node_action action, unsigned long text_len,
> +		unsigned long copyfrom_rev, const char *copyfrom_path) {
> +	printf("Node-path: %s\n", path);
> +	printf("Node-kind: ");
> +	switch (action) {
> +	case NODE_KIND_NORMAL:
> +		printf("file\n");
> +		break;
> +	case NODE_KIND_EXECUTABLE:
> +		printf("file\n");
> +		break;
> +	case NODE_KIND_SYMLINK:
> +		printf("file\n");
> +		break;
> +	case NODE_KIND_GITLINK:
> +		printf("file\n");
> +		break;
> +	case NODE_KIND_SUBDIR:
> +		die("Unsupported: subdirectory");
> +	default:
> +		break;
> +	}

Hmph, is it expected that the repertoire of node-kind stay the same, or
will it be extended over time?  It might make sense to change the above
switch a more table-driven one.  The same for node-action that follows
this part of the code.

> diff --git a/vcs-svn/svnload.c b/vcs-svn/svnload.c
> new file mode 100644
> index 0000000..7043ae7
> --- /dev/null
> +++ b/vcs-svn/svnload.c
> @@ -0,0 +1,294 @@
> ...
> +#define SVN_DATE_FORMAT "%Y-%m-%dT%H:%M:%S.000000Z"
> +#define SVN_DATE_LEN 28
> +#define LENGTH_UNKNOWN (~0)
> +
> +static struct line_buffer input = LINE_BUFFER_INIT;
> +static struct strbuf blobs[100];
> +	

Is there a rationale for "100", or is it just a random number that can be
tweakable at the source level?  Would we want to have a symbolic constant
for it?

The "blank" line has a trailing whitespace.

> +static struct {
> +	unsigned long prop_len, text_len, copyfrom_rev, mark;
> +	int text_delta, prop_delta; /* Boolean */
> +	enum node_action action;
> +	enum node_kind kind;
> +	struct strbuf copyfrom_path, path;
> +} node_ctx;
> + ...
> +static void reset_node_ctx(void)
> +{
> +	node_ctx.prop_len = LENGTH_UNKNOWN;
> +	node_ctx.text_len = LENGTH_UNKNOWN;
> +	node_ctx.mark = 0;
> +	node_ctx.copyfrom_rev = 0;
> +	node_ctx.text_delta = -1;
> +	node_ctx.prop_delta = -1;

Does your "Boolean" type take values 0 or -1?  Or is it perhaps a tristate
false=0, true=1, unknown=-1?  If so, the comment on the type above needs
to be fixed.

> +	strbuf_reset(&node_ctx.copyfrom_path);
> +	strbuf_reset(&node_ctx.path);
> +}
> +
> +static void populate_props(struct strbuf *props, const char *author,
> +			const char *log, const char *date) {

Style on "{" (look everywhere).

> +	strbuf_reset(props);	

Trailing whitespace.

> +static void parse_author_line(char *val, struct strbuf *name,
> +			struct strbuf *email, struct strbuf *date) {
> +	char *t, *tz_off;
> +	char time_buf[SVN_DATE_LEN];
> +	const struct tm *tm_time;
> +
> +	/* Simon Hausmann <shausman@trolltech.com> 1170199019 +0100 */
> +	strbuf_reset(name);
> +	strbuf_reset(email);
> +	strbuf_reset(date);
> +	tz_off = strrchr(val, ' ');
> +	*tz_off++ = '\0';
> +	t = strrchr(val, ' ');
> +	*(t - 1) = '\0'; /* Ignore '>' from email */
> +	t ++;

Unwanted SP before "++" (look everywhere).

> +	tm_time = time_to_tm(strtoul(t, NULL, 10), atoi(tz_off));

Has your caller already verified tz_off is a reasonable integer?

> +void svnload_read(void) {
> ...
> +		switch (val - t - 1) {
> +		case 1:
> +			if (!memcmp(t, "D", 1)) {
> +				node_ctx.action = NODE_ACTION_DELETE;
> +			}
> +			else if (!memcmp(t, "C", 1)) {

Style: "} else if (...) {".

> +				node_ctx.action = NODE_ACTION_ADD;
> +			}
> +			else if (!memcmp(t, "R", 1)) {
> +				node_ctx.action = NODE_ACTION_REPLACE;
> +			}
> +			else if (!memcmp(t, "M", 1)) {
> +				node_ctx.action = NODE_ACTION_CHANGE;
> +				mode_incr = 7;
> +				if (!memcmp(val, "100644", 6))
> +					node_ctx.kind = NODE_KIND_NORMAL;
> +				else if (!memcmp(val, "100755", 6))
> +					node_ctx.kind = NODE_KIND_EXECUTABLE;
> +				else if (!memcmp(val, "120000", 6))
> +					node_ctx.kind = NODE_KIND_SYMLINK;
> +				else if (!memcmp(val, "160000", 6))
> +					node_ctx.kind = NODE_KIND_GITLINK;
> +				else if (!memcmp(val, "040000", 6))

Is <mode> guaranteed to be a 6-digit octal, padded with 0 to the left?

The manual page of git-fast-import seems to hint that is the case, but I
am double checking, as the leading zero is an error in the tree object.

> +					node_ctx.kind = NODE_KIND_SUBDIR;
> +				else {
> +					if (!memcmp(val, "755", 3))
> +						node_ctx.kind = NODE_KIND_EXECUTABLE;
> +					else if(!memcmp(val, "644", 3))

Style; missing SP after "if/switch" (look everywhere).

> +						node_ctx.kind = NODE_KIND_NORMAL;
> +					else
> +						die("Unrecognized mode: %s", val);
> +					mode_incr = 4;
> +				}
> +				val += mode_incr;

Hmm, this whole thing is ugly.

Perhaps a table-lookup, or at least a helper function that translates
"val" to node-kind (while advancing val as the side effect) may make it
easier to read?

> +				t = strchr(val, ' ');
> +				*t++ = '\0';
> +				strbuf_reset(&node_ctx.path);
> +				strbuf_add(&node_ctx.path, t, strlen(t));
> +				if (!memcmp(val + 1, "inline", 6))
> +					die("Unsupported dataref: inline");
> +				else if (*val == ':')
> +					to_dump = &blobs[strtoul(val + 1, NULL, 10)];

Has anybody so far verified the decimal number at (val+1) is a reasonable
value, or am I seeing a future CVE here?

Do we care what follows the len of digits on this line?  Does a line in G-F-I
stream allow trailing garbage (this question is not limited to this part
of the code)?
> +int svnload_init(const char *filename)
> +{
> +	int i;
> +	if (buffer_init(&input, filename))
> +		return error("cannot open %s: %s", filename, strerror(errno));
> +	active_ctx = UNKNOWN_CTX;
> +	strbuf_init(&rev_ctx.props, MAX_GITSVN_LINE_LEN);
> +	...
> +	strbuf_init(&node_ctx.copyfrom_path, MAX_GITSVN_LINE_LEN);
> +	for (i = 0; i < 100; i ++)
> +		strbuf_init(&blobs[i], 10000);
> +	return 0;
> +}
> +
> +void svnload_deinit(void)
> +{
> +	int i;
> +	reset_rev_ctx(0);
> +	reset_node_ctx();
> +	strbuf_release(&rev_ctx.props);
> +	...
> +	strbuf_release(&node_ctx.copyfrom_path);
> +	for (i = 0; i < 100; i ++)
> +		strbuf_release(&blobs[i]);
> +	if (buffer_deinit(&input))
> +		fprintf(stderr, "Input error\n");
> +	if (ferror(stdout))
> +		fprintf(stderr, "Output error\n");
> +}

Whoever needs to add an element to rev_ctx must consistently touch reset,
init and deinit.  Would it help him/her if you moved the code so that
these functions are close together from the beginning?

^ permalink raw reply

* Re: Fwd: Git and Large Binaries: A Proposed Solution
From: Joey Hess @ 2011-01-22  0:07 UTC (permalink / raw)
  To: git
In-Reply-To: <AANLkTimPua_kz2w33BRPeTtOEWOKDCsJzf0sqxm=db68@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4113 bytes --]

Hi, I wrote git-annex, and pristine-tar, and etckeeper. I enjoy making
git do things that I'm told it shouldn't be used for. :) I should have
probably talked more about git-annex here, before.

Eric Montellese wrote:
> 2. zipped tarballs of source code (that I will never need to modify)
> -- I could unpack these and then use git to track the source code.
> However, I prefer to track these deliverables as the tarballs
> themselves because it makes my customer happier to see the exact
> tarball that they delivered being used when I repackage updates.
> (Let's not discuss problems with this model - I understand that this
> is non-ideal).

In this specific case, you can use pristine-tar to recreate the
original, exact tarballs from unpacked source files that you check into
git. It accomplishes this without the overhead of duplicating compressed
data in tarballs. I feel in this case, this is a better approach than
generic large file support, since it stores all the data in git, just in a
much more compressed form, and so fits in nicely with standard git-based
source code management.

> The short version:
> ***Don't track binaries in git.  Track their hashes.***

That was my principle with git-annex. Although slightly generalized to:
"Don't track large file contents in git. Track unique keys that
an arbitrary backend can use to obtain the file contents."

Now, you mention in a followup that git-annex does not default to keeping
a local copy of every binary referenced by a file in master.
This is true, for the simple reason that a copy of every file in some of
my git repos master would sum to multiple terabytes of data. :) I think
that practically, anything that supports large files in git needs to
support partial checkouts too.

But, git-annex can be run in eg, a post-merge hook, and asked to
retrieve all current file contents, and drop outdated contents.

> First the layout:
> my_git_project/binrepo/
> -- binaries/
> -- hashes/
> -- symlink_to_hashes_file
> -- symlink_to_another_hashes_file
> within the "binrepo" (binary repository) there is a subdirectory for
> binaries, and a subdirectory for hashes.  In the root of the 'binrepo'
> all of the files stored have a symlink to the current version of the
> hash.

Very similar to git-annex in the use of versioned symlinks here.
It stores the binaries in .git/annex/objects to avoid needing to
gitignore them.

> 3. In my setup, all of the binary files are in a single "binrepo"
> directory.  If done from within git, we would need a non-kludgey way
> to allow large binaries to exist anywhere within the git tree.

git-annex allows the symlinks to be mixed with regular git managed
content throughout the repository. (This means that when symlinks
are moved, they may need to be fixed, which is done at commit time.)

> 5. Command to purge all binaries in your "binrepo" that are not needed
> for the current revision (if you're running out of disk space
> locally).

Safely dropping data is really one of the complexities of this
approach. Git-annex stores location tracking information in git,
so it can know where it can retrieve file data *from*. I chose to make
it very cautious about removing data, as location tracking data can 
fall out of date (if for example, a remote had the data, had dropped it,
and has not pushed that information out). So it actively confirms that
enough other copies of the data currently exist before dropping it.
(Of course, these checks can be disabled.)

> 6. Automatically upload new versions of files to the "binrepo" (rather
> than needing to do this manually)

In git-annex, data transfer is done using rsync, so that interrupted
transfers of large files can be resumed. I recently added a git-annex-shell
to support locked-down access, similar to git-shell.


BTW, I have been meaning to look into using smudge filters with git-annex.
I'm a bit worried about some of the potential overhead associated with
smudge filters, and I'm not sure how a partial checkout would work with
them.

-- 
see shy jo

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: Fwd: Git and Large Binaries: A Proposed Solution
From: Eric Montellese @ 2011-01-21 23:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20110121222440.GA1837@sigill.intra.peff.net>

Peff,

Thanks for your insight -- this looks great.

Once something like this is available and more polished, what's the
process to request that it join the main line of git development?   (i
know functionally there's "no main line" in git... but you know what I
mean)

Has there already been discussion to this effect?  I do think that a
fix like this would improve git adoption among certain groups.  (I
know I've heard the "big binaries" problem mentioned at least a few
times)


I haven't dug around in git code yet, so while I can get the gist of
your code, I'm unable to get the complete picture.  You wouldn't
happen to have a git patch, or a public repo somewhere that I can take
a look at?  Does there happen to be a git developers guide hidden away
anywhere?  Though I have very limited time, I'd be happy to help out
as much as I can.


Eric







On Fri, Jan 21, 2011 at 5:24 PM, Jeff King <peff@peff.net> wrote:
> On Fri, Jan 21, 2011 at 01:57:21PM -0500, Eric Montellese wrote:
>
>> I did a search for this issue before posting, but if I am touching on
>> an old topic that already has a solution in progress, I apologize.  As
>
> It's been talked about a lot, but there is not exactly a solution in
> progress. One promising direction is not very different from what you're
> doing, though:
>
>> Solution:
>> The short version:
>> ***Don't track binaries in git.  Track their hashes.***
>
> Yes, exactly. But what your solution lacks, I think, is more integration
> into git. Specifically, using clean/smudge filters you can have git take
> care of tracking the file contents automatically.
>
> At the very simplest, it would look like:
>
> -- >8 --
> cat >$HOME/local/bin/huge-clean <<'EOF'
> #!/bin/sh
>
> # In an ideal world, we could actually
> # access the original file directly instead of
> # having to cat it to a new file.
> temp="$(git rev-parse --git-dir)"/huge.$$
> cat >"$temp"
> sha1=`sha1sum "$temp" | cut -d' ' -f1`
>
> # now move it to wherever your permanent storage is
> # scp "$root/$sha1" host:/path/to/big_storage/$sha1
> cp "$temp" /tmp/big_storage/$sha1
> rm -f "$temp"
>
> echo $sha1
> EOF
>
> cat >$HOME/local/bin/huge-smudge <<'EOF'
> #!/bin/sh
>
> # Get sha1 from stored blob via stdin
> read sha1
>
> # Now retrieve blob. We could optionally do some caching here.
> # ssh host cat /path/to/big/storage/$sha1
> cat /tmp/big_storage/$sha1
> EOF
> -- 8< --
>
> Obviously our storage mechanism (throwing things in /tmp) is simplistic,
> but obviously you could store and retrieve via ssh, http, s3, or
> whatever.
>
> You can try it out like this:
>
>  # set up our filter config and fake storage area
>  mkdir /tmp/big_storage
>  git config --global filter.huge.clean huge-clean
>  git config --global filter.huge.smudge huge-smudge
>
>  # now make a repo, and make sure we mark *.bin files as huge
>  mkdir repo && cd repo && git init
>  echo '*.bin filter=huge' >.gitattributes
>  git add .gitattributes
>  git commit -m 'add attributes'
>
>  # let's do a moderate 20M file
>  perl -e 'print "foo\n" for (1 .. 5000000)' >foo.bin
>  git add foo.bin
>  git commit -m 'add huge file (foo)'
>
>  # and then another revision
>  perl -e 'print "bar\n" for (1 .. 5000000)' >foo.bin
>  git commit -a -m 'revise huge file (bar)'
>
> Notice that we just add and commit as normal.  And we can check that the
> space usage is what you expect:
>
>  $ du -sh repo/.git
>  196K    repo/.git
>  $ du -sh /tmp/big_storage
>  39M     /tmp/big_storage
>
> Diffs obviously are going to be less interesting, as we just see the
> hash:
>
>  $ git log --oneline -p foo.bin
>  39e549c revise huge file (bar)
>  diff --git a/foo.bin b/foo.bin
>  index 281fd03..70874bd 100644
>  --- a/foo.bin
>  +++ b/foo.bin
>  @@ -1 +1 @@
>  -50a1ee265f4562721346566701fce1d06f54dd9e
>  +bbc2f7f191ad398fe3fcb57d885e1feacb4eae4e
>  845836e add huge file (foo)
>  diff --git a/foo.bin b/foo.bin
>  new file mode 100644
>  index 0000000..281fd03
>  --- /dev/null
>  +++ b/foo.bin
>  @@ -0,0 +1 @@
>  +50a1ee265f4562721346566701fce1d06f54dd9e
>
> but if you wanted to, you could write a custom diff driver that does
> something more meaningful with your particular binary format (it would
> have to grab from big_storage, though).
>
> Checking out other revisions works without extra action:
>
>  $ head -n 1 foo.bin
>  bar
>  $ git checkout HEAD^
>  HEAD is now at 845836e... add huge file (foo)
>  $ head -n 1 foo.bin
>  foo
>
> And since you have the filter config in your ~/.gitconfig, clones will
> just work:
>
>  $ git clone repo other
>  $ du -sh other/.git
>  204K    other/.git
>  $ du -sh other/foo.bin
>  20M
>
> So conceptually it's pretty similar to yours, but the filter integration
> means that git takes care of putting the right files in place at the
> right time.
>
> It would probably benefit a lot from caching the large binary files
> instead of hitting big_storage all the time. And probably the
> putting/getting from storage should be factored out so you can plug in
> different storage. And it should all be configurable. Different users of
> the same repo might want different caching policies, or to access the
> binary assets by different mechanisms or URLs.
>
>> I imagine these features (among others):
>>
>> 1. In my current setup, each large binary file has a different name (a
>> revision number).  This could be easily solved, however, by generating
>> unique names under the hood and tracking this within git.
>
> In the scheme above, we just index by their hash. So you can easily fsck
> your big_storage by making sure everything matches its hash (but you
> can't know that you have _all_ of the blobs needed unless you
> cross-reference with the history).
>
>> 2. A lot of the steps in my current setup are manual.  When I want to
>> add a new binary file, I need to manually create the hash and manually
>> upload the binary to the joint server.  If done within git, this would
>> be automatic.
>
> I think the scheme above takes care of the manual bits.
>
>> 3. In my setup, all of the binary files are in a single "binrepo"
>> directory.  If done from within git, we would need a non-kludgey way
>> to allow large binaries to exist anywhere within the git tree.  If git
>
> Any scheme, whether it uses clean/smudge filters or not, should probably
> tie in via gitattributes.
>
>> 4. User option to download all versions of all binaries, or only the
>> version necessary for the position on the current branch.  If you want
>> to be able to run all versions of the repository when offline, you can
>> download all versions of all binaries.  If you don't need to do this,
>> you can just download the versions you need.  Or perhaps have the
>> option to download all binaries smaller than X-bytes, but skip the big
>> ones.
>
> The scheme above will download on an as-needed basis. If caching were
> implemented, you could just make the cache infinitely big and do a "git
> log -p" which would download everything. :)
>
> Probably you would also want the smudge filter to return "blob not
> available" when operating in some kind of offline mode.
>
>> 5. Command to purge all binaries in your "binrepo" that are not needed
>> for the current revision (if you're running out of disk space
>> locally).
>
> In my scheme, just rm your cache directory (once it exists).
>
>> 6. Automatically upload new versions of files to the "binrepo" (rather
>> than needing to do this manually)
>
> Handled by the clean filter above.
>
>
> So obviously this is not very complete. And there are a few changes to
> git that could make it more efficient (e.g., letting the clean filter
> touch the file directly instead of having to make a copy via stdin). But
> the general idea is there, and it just needs somebody to make a nice
> polished script that is configurable, does caching, etc. I'll get to it
> eventually, but if you'd like to work on it, be my guest.
>
> -Peff
>

^ permalink raw reply

* Re: [PATCH 3/3] setup: always honor GIT_WORK_TREE and core.worktree
From: Junio C Hamano @ 2011-01-21 23:01 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Junio C Hamano, Nguyen Thai Ngoc Duy, git, Maaartin
In-Reply-To: <20110121220532.GA23695@burratino>

Jonathan Nieder <jrnieder@gmail.com> writes:

> Junio C Hamano wrote:
>
>> Perhaps squash this into, or apply on top of, your 3/3?
>
> Looks good to me.  I'd suggest squashing and using your commit
> message for the combined commit.

Thanks, will do.

^ permalink raw reply

* Re: Fwd: Git and Large Binaries: A Proposed Solution
From: Jeff King @ 2011-01-21 22:24 UTC (permalink / raw)
  To: Eric Montellese; +Cc: git
In-Reply-To: <AANLkTimPua_kz2w33BRPeTtOEWOKDCsJzf0sqxm=db68@mail.gmail.com>

On Fri, Jan 21, 2011 at 01:57:21PM -0500, Eric Montellese wrote:

> I did a search for this issue before posting, but if I am touching on
> an old topic that already has a solution in progress, I apologize.  As

It's been talked about a lot, but there is not exactly a solution in
progress. One promising direction is not very different from what you're
doing, though:

> Solution:
> The short version:
> ***Don't track binaries in git.  Track their hashes.***

Yes, exactly. But what your solution lacks, I think, is more integration
into git. Specifically, using clean/smudge filters you can have git take
care of tracking the file contents automatically.

At the very simplest, it would look like:

-- >8 --
cat >$HOME/local/bin/huge-clean <<'EOF'
#!/bin/sh

# In an ideal world, we could actually
# access the original file directly instead of
# having to cat it to a new file.
temp="$(git rev-parse --git-dir)"/huge.$$
cat >"$temp"
sha1=`sha1sum "$temp" | cut -d' ' -f1`

# now move it to wherever your permanent storage is
# scp "$root/$sha1" host:/path/to/big_storage/$sha1
cp "$temp" /tmp/big_storage/$sha1
rm -f "$temp"

echo $sha1
EOF

cat >$HOME/local/bin/huge-smudge <<'EOF'
#!/bin/sh

# Get sha1 from stored blob via stdin
read sha1

# Now retrieve blob. We could optionally do some caching here.
# ssh host cat /path/to/big/storage/$sha1
cat /tmp/big_storage/$sha1
EOF
-- 8< --

Obviously our storage mechanism (throwing things in /tmp) is simplistic,
but obviously you could store and retrieve via ssh, http, s3, or
whatever.

You can try it out like this:

  # set up our filter config and fake storage area
  mkdir /tmp/big_storage
  git config --global filter.huge.clean huge-clean
  git config --global filter.huge.smudge huge-smudge

  # now make a repo, and make sure we mark *.bin files as huge
  mkdir repo && cd repo && git init
  echo '*.bin filter=huge' >.gitattributes
  git add .gitattributes
  git commit -m 'add attributes'

  # let's do a moderate 20M file
  perl -e 'print "foo\n" for (1 .. 5000000)' >foo.bin
  git add foo.bin
  git commit -m 'add huge file (foo)'

  # and then another revision
  perl -e 'print "bar\n" for (1 .. 5000000)' >foo.bin
  git commit -a -m 'revise huge file (bar)'

Notice that we just add and commit as normal.  And we can check that the
space usage is what you expect:

  $ du -sh repo/.git
  196K    repo/.git
  $ du -sh /tmp/big_storage
  39M     /tmp/big_storage

Diffs obviously are going to be less interesting, as we just see the
hash:

  $ git log --oneline -p foo.bin
  39e549c revise huge file (bar)
  diff --git a/foo.bin b/foo.bin
  index 281fd03..70874bd 100644
  --- a/foo.bin
  +++ b/foo.bin
  @@ -1 +1 @@
  -50a1ee265f4562721346566701fce1d06f54dd9e
  +bbc2f7f191ad398fe3fcb57d885e1feacb4eae4e
  845836e add huge file (foo)
  diff --git a/foo.bin b/foo.bin
  new file mode 100644
  index 0000000..281fd03
  --- /dev/null
  +++ b/foo.bin
  @@ -0,0 +1 @@
  +50a1ee265f4562721346566701fce1d06f54dd9e

but if you wanted to, you could write a custom diff driver that does
something more meaningful with your particular binary format (it would
have to grab from big_storage, though).

Checking out other revisions works without extra action:

  $ head -n 1 foo.bin
  bar
  $ git checkout HEAD^
  HEAD is now at 845836e... add huge file (foo)
  $ head -n 1 foo.bin
  foo

And since you have the filter config in your ~/.gitconfig, clones will
just work:

  $ git clone repo other
  $ du -sh other/.git
  204K    other/.git
  $ du -sh other/foo.bin
  20M

So conceptually it's pretty similar to yours, but the filter integration
means that git takes care of putting the right files in place at the
right time.

It would probably benefit a lot from caching the large binary files
instead of hitting big_storage all the time. And probably the
putting/getting from storage should be factored out so you can plug in
different storage. And it should all be configurable. Different users of
the same repo might want different caching policies, or to access the
binary assets by different mechanisms or URLs.

> I imagine these features (among others):
> 
> 1. In my current setup, each large binary file has a different name (a
> revision number).  This could be easily solved, however, by generating
> unique names under the hood and tracking this within git.

In the scheme above, we just index by their hash. So you can easily fsck
your big_storage by making sure everything matches its hash (but you
can't know that you have _all_ of the blobs needed unless you
cross-reference with the history).

> 2. A lot of the steps in my current setup are manual.  When I want to
> add a new binary file, I need to manually create the hash and manually
> upload the binary to the joint server.  If done within git, this would
> be automatic.

I think the scheme above takes care of the manual bits.

> 3. In my setup, all of the binary files are in a single "binrepo"
> directory.  If done from within git, we would need a non-kludgey way
> to allow large binaries to exist anywhere within the git tree.  If git

Any scheme, whether it uses clean/smudge filters or not, should probably
tie in via gitattributes.

> 4. User option to download all versions of all binaries, or only the
> version necessary for the position on the current branch.  If you want
> to be able to run all versions of the repository when offline, you can
> download all versions of all binaries.  If you don't need to do this,
> you can just download the versions you need.  Or perhaps have the
> option to download all binaries smaller than X-bytes, but skip the big
> ones.

The scheme above will download on an as-needed basis. If caching were
implemented, you could just make the cache infinitely big and do a "git
log -p" which would download everything. :)

Probably you would also want the smudge filter to return "blob not
available" when operating in some kind of offline mode.

> 5. Command to purge all binaries in your "binrepo" that are not needed
> for the current revision (if you're running out of disk space
> locally).

In my scheme, just rm your cache directory (once it exists).

> 6. Automatically upload new versions of files to the "binrepo" (rather
> than needing to do this manually)

Handled by the clean filter above.


So obviously this is not very complete. And there are a few changes to
git that could make it more efficient (e.g., letting the clean filter
touch the file directly instead of having to make a copy via stdin). But
the general idea is there, and it just needs somebody to make a nice
polished script that is configurable, does caching, etc. I'll get to it
eventually, but if you'd like to work on it, be my guest.

-Peff

^ permalink raw reply

* Re: [PATCH 3/3] setup: always honor GIT_WORK_TREE and core.worktree
From: Jonathan Nieder @ 2011-01-21 22:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nguyen Thai Ngoc Duy, git, Maaartin
In-Reply-To: <7vtyh1oqy8.fsf@alter.siamese.dyndns.org>

Junio C Hamano wrote:

> Perhaps squash this into, or apply on top of, your 3/3?

Looks good to me.  I'd suggest squashing and using your commit
message for the combined commit.

^ permalink raw reply

* Re: git bisect problems/ideas
From: Johannes Sixt @ 2011-01-21 22:04 UTC (permalink / raw)
  To: Christian Couder; +Cc: Aaron S. Meurer, git, SZEDER Gábor, Jonathan Nieder
In-Reply-To: <AANLkTikG6Ft3Y922Aaakf28cnYs26PcRHoq9GSNj04mu@mail.gmail.com>

On Freitag, 21. Januar 2011, Christian Couder wrote:
> On Wed, Jan 19, 2011 at 8:44 PM, Aaron S. Meurer <asmeurer@gmail.com> wrote:
> > If no, I think --reverse is actually a suitable fix.
>
> Yeah, but I think that what Dscho started was probably better. The
> problem is just that it is not so simple to implement and no one yet
> has been interested enough or took enough time to finish it.

Let me throw in an idea:

Add two new sub-commands:

* 'git bisect regression': this is a synonym for 'git bisect start'.

* 'git bisect improvement': this also starts a bisection, but subsequently the 
operation of 'git bisect good' and 'git bisect bad' is reversed.

-- Hannes

^ permalink raw reply

* Re: [PATCH 3/3] setup: always honor GIT_WORK_TREE and core.worktree
From: Junio C Hamano @ 2011-01-21 22:02 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Nguyen Thai Ngoc Duy, git, Maaartin
In-Reply-To: <7v39omotxg.fsf@alter.siamese.dyndns.org>

Junio C Hamano <gitster@pobox.com> writes:

> I was re-reading this thread, and changed my mind; I think we should have
> this series to avoid unnecessary regression, with or without clarifying
> (5), before 1.7.4 final.
>
> Even if some scripts you had trouble with started using GIT_WORK_TREE
> without specifying GIT_DIR because they misunderstood what these are
> designed to do, as long as the combination has been working consistently
> with the expectation of these scripts, ans as long as we can keep the same
> behaviour, I don't see a reason to change it.

... and that leads me to suggest that we may not even want to issue a
warning in these cases.

Perhaps squash this into, or apply on top of, your 3/3?

-- >8 --
Subject: setup: officially support --work-tree without --git-dir

The original intention of --work-tree was to allow people to work in a
subdirectory of their working tree that does not have an embedded .git
directory.  Because their working tree, which their $cwd was in, did not
have an embedded .git, they needed to use $GIT_DIR to specify where it is,
and because this meant there was no way to discover where the root level
of the working tree was, so we needed to add $GIT_WORK_TREE to tell git
where it was.

However, this facility has long been (mis)used by people's scripts to
start git from a working tree _with_ an embedded .git directory, let git
find .git directory, and then pretend as if an unrelated directory were
the associated working tree of the .git directory found by the discovery
process.  It happens to work in simple cases, and is not worth causing
"regression" to these scripts.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 setup.c               |    8 +-----
 t/t1510-repo-setup.sh |   59 +++++++++++++++++++++++++-----------------------
 2 files changed, 33 insertions(+), 34 deletions(-)

diff --git a/setup.c b/setup.c
index e08cdf2..dadc666 100644
--- a/setup.c
+++ b/setup.c
@@ -411,9 +411,8 @@ static const char *setup_discovered_git_dir(const char *gitdir,
 	if (check_repository_format_gently(gitdir, nongit_ok))
 		return NULL;
 
-	/* Accept --work-tree to support old scripts that played with fire. */
+	/* --work-tree is set without --git-dir; use discovered one */
 	if (getenv(GIT_WORK_TREE_ENVIRONMENT) || git_work_tree_cfg) {
-		warning("pretending GIT_DIR was supplied alongside GIT_WORK_TREE");
 		if (offset != len && !is_absolute_path(gitdir))
 			gitdir = xstrdup(make_absolute_path(gitdir));
 		if (chdir(cwd))
@@ -453,13 +452,10 @@ static const char *setup_bare_git_dir(char *cwd, int offset, int len, int *nongi
 	if (check_repository_format_gently(".", nongit_ok))
 		return NULL;
 
-	/*
-	 * Accept --work-tree, reluctantly.
-	 */
+	/* --work-tree is set without --git-dir; use discovered one */
 	if (getenv(GIT_WORK_TREE_ENVIRONMENT) || git_work_tree_cfg) {
 		const char *gitdir;
 
-		warning("pretending GIT_DIR was supplied alongside GIT_WORK_TREE");
 		gitdir = offset == len ? "." : xmemdupz(cwd, offset);
 		if (chdir(cwd))
 			die_errno("Could not come back to cwd");
diff --git a/t/t1510-repo-setup.sh b/t/t1510-repo-setup.sh
index b8a1b02..dcc0f86 100755
--- a/t/t1510-repo-setup.sh
+++ b/t/t1510-repo-setup.sh
@@ -16,9 +16,12 @@ A few rules for repo setup:
 4. GIT_WORK_TREE is relative to user's cwd. --work-tree is
    equivalent to GIT_WORK_TREE.
 
-5. GIT_WORK_TREE/core.worktree is only meant to work if GIT_DIR is set.
-   Otherwise there is a warning and a best effort is made to follow
-   historical behavior.
+5. GIT_WORK_TREE/core.worktree was originally meant to work only if
+   GIT_DIR is set, but earlier git didn't enforce it, and some scripts
+   depend on the implementation that happened to first discover .git by
+   going up from the users $cwd and then using the specified working tree
+   that may or may not have any relation to where .git was found in.  This
+   historical behaviour must be kept.
 
 6. Effective GIT_WORK_TREE overrides core.worktree and core.bare
 
@@ -225,16 +228,16 @@ try_repo () {
 test_expect_success '#0: nonbare repo, no explicit configuration' '
 	try_repo 0 unset unset unset "" unset \
 		.git "$here/0" "$here/0" "(null)" \
-		.git "$here/0" "$here/0" sub/ 2>messages &&
-	! grep "warning:.*GIT_DIR.*GIT_WORK_TREE" messages
+		.git "$here/0" "$here/0" sub/ 2>message &&
+	! test -s message
 '
 
-test_expect_success '#1: GIT_WORK_TREE without explicit GIT_DIR is reluctantly accepted' '
+test_expect_success '#1: GIT_WORK_TREE without explicit GIT_DIR is accepted' '
 	mkdir -p wt &&
 	try_repo 1 "$here" unset unset "" unset \
 		"$here/1/.git" "$here" "$here" 1/ \
 		"$here/1/.git" "$here" "$here" 1/sub/ 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 
 test_expect_success '#2: worktree defaults to cwd with explicit GIT_DIR' '
@@ -255,16 +258,16 @@ test_expect_success '#3: setup' '
 '
 run_wt_tests 3
 
-test_expect_success '#4: core.worktree without GIT_DIR set is reluctantly accepted' '
+test_expect_success '#4: core.worktree without GIT_DIR set is accepted' '
 	setup_repo 4 ../sub "" unset &&
 	mkdir -p 4/sub sub &&
 	try_case 4 unset unset \
 		.git "$here/4/sub" "$here/4" "(null)" \
 		"$here/4/.git" "$here/4/sub" "$here/4/sub" "(null)" 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 
-test_expect_success '#5: core.worktree + GIT_WORK_TREE is reluctantly accepted' '
+test_expect_success '#5: core.worktree + GIT_WORK_TREE is accepted' '
 	# or: you cannot intimidate away the lack of GIT_DIR setting
 	try_repo 5 "$here" unset "$here/5" "" unset \
 		"$here/5/.git" "$here" "$here" 5/ \
@@ -272,7 +275,7 @@ test_expect_success '#5: core.worktree + GIT_WORK_TREE is reluctantly accepted'
 	try_repo 5a .. unset "$here/5a" "" unset \
 		"$here/5a/.git" "$here" "$here" 5a/ \
 		"$here/5a/.git" "$here/5a" "$here/5a" sub/ &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 
 test_expect_success '#6: setting GIT_DIR brings core.worktree to life' '
@@ -364,12 +367,12 @@ test_expect_success '#8: gitfile, easy case' '
 		"$here/8.git" "$here/8" "$here/8" sub/
 '
 
-test_expect_success '#9: GIT_WORK_TREE reluctantly accepted with gitfile' '
+test_expect_success '#9: GIT_WORK_TREE accepted with gitfile' '
 	mkdir -p 9/wt &&
 	try_repo 9 wt unset unset gitfile unset \
 		"$here/9.git" "$here/9/wt" "$here/9" "(null)" \
 		"$here/9.git" "$here/9/sub/wt" "$here/9/sub" "(null)" 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 
 test_expect_success '#10: GIT_DIR can point to gitfile' '
@@ -391,19 +394,19 @@ test_expect_success '#11: setup' '
 '
 run_wt_tests 11 gitfile
 
-test_expect_success '#12: core.worktree with gitfile is reluctantly accepted' '
+test_expect_success '#12: core.worktree with gitfile is accepted' '
 	try_repo 12 unset unset "$here/12" gitfile unset \
 		"$here/12.git" "$here/12" "$here/12" "(null)" \
 		"$here/12.git" "$here/12" "$here/12" sub/ 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 
-test_expect_success '#13: core.worktree+GIT_WORK_TREE relucantly accepted (with gitfile)' '
+test_expect_success '#13: core.worktree+GIT_WORK_TREE accepted (with gitfile)' '
 	# or: you cannot intimidate away the lack of GIT_DIR setting
 	try_repo 13 non-existent-too unset non-existent gitfile unset \
 		"$here/13.git" "$here/13/non-existent-too" "$here/13" "(null)" \
 		"$here/13.git" "$here/13/sub/non-existent-too" "$here/13/sub" "(null)" 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 
 # case #14.
@@ -514,7 +517,7 @@ test_expect_success '#16c: bare .git has no worktree' '
 		"$here/16c/.git" "(null)" "$here/16c/sub" "(null)"
 '
 
-test_expect_success '#17: GIT_WORK_TREE without explicit GIT_DIR is reluctantly accepted (bare case)' '
+test_expect_success '#17: GIT_WORK_TREE without explicit GIT_DIR is accepted (bare case)' '
 	# Just like #16.
 	setup_repo 17a unset "" true &&
 	setup_repo 17b unset "" true &&
@@ -539,7 +542,7 @@ test_expect_success '#17: GIT_WORK_TREE without explicit GIT_DIR is reluctantly
 	try_repo 17c "$here/17c" unset unset "" true \
 		.git "$here/17c" "$here/17c" "(null)" \
 		"$here/17c/.git" "$here/17c" "$here/17c" sub/ 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 
 test_expect_success '#18: bare .git named by GIT_DIR has no worktree' '
@@ -558,7 +561,7 @@ test_expect_success '#19: setup' '
 '
 run_wt_tests 19
 
-test_expect_success '#20a: core.worktree without GIT_DIR reluctantly accepted (inside .git)' '
+test_expect_success '#20a: core.worktree without GIT_DIR accepted (inside .git)' '
 	# Unlike case #16a.
 	setup_repo 20a "$here/20a" "" unset &&
 	mkdir -p 20a/.git/wt/sub &&
@@ -568,7 +571,7 @@ test_expect_success '#20a: core.worktree without GIT_DIR reluctantly accepted (i
 		"$here/20a/.git" "$here/20a" "$here/20a" .git/wt/ &&
 	try_case 20a/.git/wt/sub unset unset \
 		"$here/20a/.git" "$here/20a" "$here/20a" .git/wt/sub/ &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 
 test_expect_success '#20b/c: core.worktree and core.bare conflict' '
@@ -581,7 +584,7 @@ test_expect_success '#20b/c: core.worktree and core.bare conflict' '
 	grep "core.bare and core.worktree" message
 '
 
-# Case #21: core.worktree/GIT_WORK_TREE reluctantly overrides core.bare' '
+# Case #21: core.worktree/GIT_WORK_TREE overrides core.bare' '
 test_expect_success '#21: setup, core.worktree warns before overriding core.bare' '
 	setup_repo 21 non-existent "" unset &&
 	mkdir -p 21/.git/wt/sub &&
@@ -591,7 +594,7 @@ test_expect_success '#21: setup, core.worktree warns before overriding core.bare
 		export GIT_WORK_TREE &&
 		git symbolic-ref HEAD >/dev/null
 	) 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 
 '
 run_wt_tests 21
@@ -703,11 +706,11 @@ test_expect_success '#24: bare repo has no worktree (gitfile case)' '
 		"$here/24.git" "(null)" "$here/24/sub" "(null)"
 '
 
-test_expect_success '#25: GIT_WORK_TREE accepted reluctantly if GIT_DIR unset (bare gitfile case)' '
+test_expect_success '#25: GIT_WORK_TREE accepted if GIT_DIR unset (bare gitfile case)' '
 	try_repo 25 "$here/25" unset unset gitfile true \
 		"$here/25.git" "$here/25" "$here/25" "(null)"  \
 		"$here/25.git" "$here/25" "$here/25" "sub/" 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 
 test_expect_success '#26: bare repo has no worktree (GIT_DIR -> gitfile case)' '
@@ -732,11 +735,11 @@ test_expect_success '#28: core.worktree and core.bare conflict (gitfile case)' '
 		cd 28 &&
 		test_must_fail git symbolic-ref HEAD
 	) 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message &&
+	! grep "^warning:" message &&
 	grep "core.bare and core.worktree" message
 '
 
-# Case #29: GIT_WORK_TREE(+core.worktree) reluctantly overries core.bare (gitfile case).
+# Case #29: GIT_WORK_TREE(+core.worktree) overries core.bare (gitfile case).
 test_expect_success '#29: setup' '
 	setup_repo 29 non-existent gitfile true &&
 	mkdir -p 29/sub/sub 29/wt/sub
@@ -746,7 +749,7 @@ test_expect_success '#29: setup' '
 		export GIT_WORK_TREE &&
 		git symbolic-ref HEAD >/dev/null
 	) 2>message &&
-	grep "warning:.*GIT_DIR.*GIT_WORK_TREE" message
+	! test -s message
 '
 run_wt_tests 29 gitfile
 

^ permalink raw reply related

* Re: Fwd: Git and Large Binaries: A Proposed Solution
From: Eric Montellese @ 2011-01-21 22:00 UTC (permalink / raw)
  To: Wesley J. Landaker; +Cc: git
In-Reply-To: <201101211436.32033.wjl@icecavern.net>

Thanks Wesely,

I did take a look at git annex -- it looks to me as though that
project is more of a special-case, allowing users to use git to track
things like music and movies.  While it's possible this might be
usable for the use case I described, what I'm really looking for is a
true extension of git which allows binaries to be treated differently
(if the user desires) when using git as a source management tool.

The major difference I see with git-annex is that the user must
specifically tell git-annex to download certain files.  Instead, I
want the user to always automatically have *all* of the files (both
source and binaries) for the current revision -- but not necessarily
for hundreds (thousands?) of past revisions (which as git is
implemented currently would take up many gigabytes)

git-annex does look like a neat piece of software, but I don't think
it quite fits here -- thank you again for the comment though!

Eric


On Fri, Jan 21, 2011 at 4:36 PM, Wesley J. Landaker <wjl@icecavern.net> wrote:
> On Friday, January 21, 2011 11:57:21 Eric Montellese wrote:
>> To whet your appetite to read all of the below (I know it's long),
>> this is the root of the solution:
>>
>> ---       Don't track binaries in git.  Track their hashes.       ---
>
> Comment from the peanut gallery:
>
> I haven't read your approach in great detail, but just in case you are not
> aware, there is a project call git-annex <http://git-annex.branchable.com/>
> by Joey Hess that I believe takes a similar approach.
>
> Since you've obviously given this a lot of thought, you might want to take a
> peek at that and see if it already does what you want, or if your proposal
> does something significantly different/better.
>

^ permalink raw reply

* Re: Fwd: Git and Large Binaries: A Proposed Solution
From: Wesley J. Landaker @ 2011-01-21 21:36 UTC (permalink / raw)
  To: Eric Montellese; +Cc: git
In-Reply-To: <AANLkTimPua_kz2w33BRPeTtOEWOKDCsJzf0sqxm=db68@mail.gmail.com>

On Friday, January 21, 2011 11:57:21 Eric Montellese wrote:
> To whet your appetite to read all of the below (I know it's long),
> this is the root of the solution:
> 
> ---       Don't track binaries in git.  Track their hashes.       ---

Comment from the peanut gallery:

I haven't read your approach in great detail, but just in case you are not 
aware, there is a project call git-annex <http://git-annex.branchable.com/> 
by Joey Hess that I believe takes a similar approach.

Since you've obviously given this a lot of thought, you might want to take a 
peek at that and see if it already does what you want, or if your proposal 
does something significantly different/better.

^ permalink raw reply

* Re: [PATCH 3/3] setup: always honor GIT_WORK_TREE and core.worktree
From: Junio C Hamano @ 2011-01-21 20:58 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Nguyen Thai Ngoc Duy, git, Maaartin
In-Reply-To: <7vpqrssl5d.fsf@alter.siamese.dyndns.org>

Junio C Hamano <gitster@pobox.com> writes:

> Having said that, thanks for a nice summary.
> ...
>
>> 5. The interaction with core.bare and implicit bareness are not
>>    obvious.  Clearly core.bare should conflict with core.worktree,
>>    but can GIT_WORK_TREE override that?  Maybe
>>    check_repository_format_gently is the right place for this check
>>    (rather than the setup procedure).
>
> IIRC, we on purpose added support to allow GIT_WORK_TREE to tentatively
> lift bareness of a repository so that people can
>
> 	cd /var/tmp
>         GIT_WORK_TREE=. git --git-dir=/srv/git/jgit.git checkout -f
>
> to get a snapshot easily.
>
>> (1) and (2) have been resolved by your work (nice!), (3) seems like
>> a case of "don't do that, then", and (4) out to error out in
>> setup_nongit (though my patch doesn't take care of that).  Given an
>> answer to (5) we could wholeheartedly and consistently support
>> worktree with implicit gitdir, as a new feature.
>
> As long as we really can support it _consistently_, I wouldn't see a big
> problem in resurrecting the historical accident as a feature.  You earlier
> said gitolite also misuses the interface, but does the usage pattern it
> has the same as the one in the debian script you had trouble with, and do
> they expect the same behaviour?

I was re-reading this thread, and changed my mind; I think we should have
this series to avoid unnecessary regression, with or without clarifying
(5), before 1.7.4 final.

Even if some scripts you had trouble with started using GIT_WORK_TREE
without specifying GIT_DIR because they misunderstood what these are
designed to do, as long as the combination has been working consistently
with the expectation of these scripts, ans as long as we can keep the same
behaviour, I don't see a reason to change it.

^ permalink raw reply

* Re: [PATCH] Documentation: do not treat reset --keep as a special case
From: Junio C Hamano @ 2011-01-21 20:35 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Joshua Jensen, Matthieu Moy, Junio C Hamano, Johannes Schindelin,
	Thomas Rast, Nicolas Sebrecht, git, Kevin Ballard, Yann Dirson,
	Eric Raible
In-Reply-To: <20110121183734.GB16325@burratino>

Jonathan Nieder <jrnieder@gmail.com> writes:

> The current treatment of "git reset --keep" emphasizes how it
> differs from --hard (treatment of local changes) and how it breaks
> down into plumbing (git read-tree -m -u HEAD <commit> followed by git
> update-ref HEAD <commit>).  This can discourage people from using
> it, since it might seem to be a complex or niche option.
>
> Better to emphasize what the --keep flag is intended for --- moving
> the index and worktree from one commit to another, like "git checkout"
> would --- so the reader can make a more informed decision about the
> appropriate situations in which to use it.

The updated text makes quite a lot of sense ;-) while the old text
doesn't.  What were we smoking when we wrote it and passed it through the
review?

> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
>  Documentation/git-reset.txt |    9 ++-------
>  1 files changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt
> index fd72976..927ecee 100644
> --- a/Documentation/git-reset.txt
> +++ b/Documentation/git-reset.txt
> @@ -76,15 +76,10 @@ In other words, --merge does something like a 'git read-tree -u -m <commit>',
>  but carries forward unmerged index entries.
>  
>  --keep::
> -	Resets the index, updates files in the working tree that are
> -	different between <commit> and HEAD, but keeps those
> -	which are different between HEAD and the working tree (i.e.
> -	which have local changes).
> +	Resets index entries and updates files in the working tree that are
> +	different between <commit> and HEAD.
>  	If a file that is different between <commit> and HEAD has local changes,
>  	reset is aborted.

I saw "updates files" and one question immediately came to mind: update
how?  "... to match what is in HEAD"?  "Resets index entries" is less of a
problem as the word "reset" already strongly suggests that the current
state does not matter as much as the target state, though.

Thanks.

^ permalink raw reply

* Re: [PATCH] Documentation: suggest "reset --keep" to undo a commit
From: Junio C Hamano @ 2011-01-21 20:28 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Johannes Schindelin, Thomas Rast, Nicolas Sebrecht, Matthieu Moy,
	git, Kevin Ballard, Yann Dirson, Eric Raible, Christian Couder
In-Reply-To: <20110121191459.GC16325@burratino>

Jonathan Nieder <jrnieder@gmail.com> writes:

>> Please tell a story where keep makes more sense than hard by enhancing the
>> explanatory text <1> associated with this section.  The current text says
>> that the three topmost commit representing what you have recently worked
>> so far are all unwanted, strongly hinting that hard is more appropriate
>> thing to do than keep, which is not what we want if we are changing the
>> example to use keep.
>
> Maybe the best story would be "you have just explored a blind alley
> and decided the last three commits are not a good idea at all", with

That unfortunately does not seem to describe the nature of the local
changes at all, which I think is the whole point of this topic to
encourage use of --keep over --hard.

>> It would be sufficient to just hint that the uncommitted changes that you
>> have in your working tree are unrelated to what these three commits wanted
>> to do (e.g. you always keep small changes around, such as debugging
>> printf's
>
> That use case is less interesting to me --- it is relatively harmless
> to clobber such content.

Actually I think that is the primary use case of the feature, as --keep
was done as a parallel to the behaviour of checkout that checks out a
different branch while keeping local changes.

^ permalink raw reply

* Re: [PATCH] Documentation: suggest "reset --keep" to undo a commit
From: Jonathan Nieder @ 2011-01-21 19:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Thomas Rast, Nicolas Sebrecht, Matthieu Moy,
	git, Kevin Ballard, Yann Dirson, Eric Raible, Christian Couder
In-Reply-To: <7voc7ap3dp.fsf@alter.siamese.dyndns.org>

Junio C Hamano wrote:

> But the user could do the reviewing and thinking with some local changes
> still in the working tree (they are incredients for the fourth commit yet
> to be made) and decide to branch at that point.  The description in <1>
> needs to be updated to hint that there can be uncommitted changes, e.g.
>
> 	You have worked for some time, made a few commits, and may have
> 	uncommitted changes.  After reviewing the current state, you
> 	realized that ...
>
> Using --keep may help the user do so, but only if the local changes do not
> conflict with the changes in the recent commits to be discarded, right?

I think this explanation misses out on something.

I may be abusing git in a certain way, but I find myself in the
following situation fairly often:

	... hack hack hack ...
	git add -p;	# hmm, looks like multiple features.
	git stash -k
	... test ...
	git commit;	# commit feature #1
	git stash pop

	git add -p
	git stash -k
	... test ...
	git commit; # commit feature #2
	git stash pop

	# hmm, feature #2 is not suitable for this branch.
	git branch wip/feature-2
	git reset --keep HEAD^;	# <*>
	git add -p
	git stash -k
	... test ...
	git commit; # commit feature #3

On line <*>, I am just not thinking about the uncommitted changes.
They may be there or they may not.  If they are in the way of what I
am trying to do, "git reset --keep" will politely inform me so I can
act accordingly (usually stash, commit, or discard them).

> By the way, a more natural way to do this would actually be:
>
>     $ git checkout -b topic/wip
>     $ git branch -f @{-1} HEAD~3

True.  (I think the intended scenario was

	git branch topic/wip; # save the tip for later
	git reset --keep HEAD~3
	# now what was I working on?
	... hack hack hack ...

	# okay, now we have time for that diversion.
	git checkout topic/wip

but it would be nice to contrast it with the one you described.)

> or using the stash:
>
>     $ git stash ;# save local changes
>     $ git branch topic/wip ;# and mark the tip before rewinding
>     $ git reset --hard HEAD~3 ;# you could say --keep here too
>     $ git checkout topic/wip ;# and then continue
>     $ git stash pop ;# with the local changes

This approach leaves more files touched and more targets to be rebuilt
by "make".

> Please tell a story where keep makes more sense than hard by enhancing the
> explanatory text <1> associated with this section.  The current text says
> that the three topmost commit representing what you have recently worked
> so far are all unwanted, strongly hinting that hard is more appropriate
> thing to do than keep, which is not what we want if we are changing the
> example to use keep.

Maybe the best story would be "you have just explored a blind alley
and decided the last three commits are not a good idea at all", with
reference to a new section explaining that

 * --soft is for when the commit in preparation has the right content
   but should be on top of a different parent (e.g., squashing commits)

 * --keep is for transporting your local changes to a different commit
   (e.g., rewinding a branch or transplanting changes)

 - --merge is a limited and low-level tool for recovering from a
   conflicted merge and most often will take ORIG_HEAD as its argument.

   Maybe in the future merges will save more information so reset --merge
   can error out more often.

 - --hard is for resetting to a known state

 - --mixed is for resetting to a known state but leaving the worktree
   alone

> It would be sufficient to just hint that the uncommitted changes that you
> have in your working tree are unrelated to what these three commits wanted
> to do (e.g. you always keep small changes around, such as debugging
> printf's

That use case is less interesting to me --- it is relatively harmless
to clobber such content.

^ permalink raw reply

* Fwd: Git and Large Binaries: A Proposed Solution
From: Eric Montellese @ 2011-01-21 18:57 UTC (permalink / raw)
  To: git
In-Reply-To: <AANLkTin=UySutWLS0Y7OmuvkE=T=+YB8G8aUCxLH=GKa@mail.gmail.com>

I did a search for this issue before posting, but if I am touching on
an old topic that already has a solution in progress, I apologize.  As
far as I know, this is still an open issue (last i saw in the
kernel-trap archives was a "git and binary files" thread from Jan
2008, and there are a couple of promising related works (git annex and
git bigfiles) -- but nothing that solves the complete problem).

I'm interested in hearing your thoughts and suggestions, and
interested if there is community interest in adding this feature to
git.  I would be happy to be involved in making the changes, but I
have very limited time, so would prefer help and would like to know
that it has a strong chance of joining the main line before
starting...
To whet your appetite to read all of the below (I know it's long),
this is the root of the solution:

---       Don't track binaries in git.  Track their hashes.       ---


Problem Background:
I work on embedded system software, with code for these products
delivered from multiple customers and in multiple formats, for
example:

1. source code -- works great and is what git is designed for
2. zipped tarballs of source code (that I will never need to modify)
-- I could unpack these and then use git to track the source code.
However, I prefer to track these deliverables as the tarballs
themselves because it makes my customer happier to see the exact
tarball that they delivered being used when I repackage updates.
(Let's not discuss problems with this model - I understand that this
is non-ideal).
3. large and/or many binaries.  (could be pictures, short videos,
pre-compiled binaries, etc)

The problem of course is that, as you know, git is not ideal for
handling large, or many, binaries.  It's better at just about
everything else, of course, but not this, for largely these two
reasons:

1. git cannot diff the binaries to their previous iterations
effectively to save space.  (and neither can any tool)
2. git requires that all clones of that repository must therefore
download all versions of all binaries -- which, if the binaries are
large, or many, and are poorly compressed together (as stated in 1),
this will be a very expensive operation.


Problem Statement:
We (the git user) want and "need" to be able to track large binaries
from within our repository.  But putting them into git slows down git
unnecessarily.
The only current alternative is to *not* check the large binaries into
git -- but now they are no longer tracked, which is unacceptable.  If
I want to jump back in git to a point in the tree from 6 months ago, I
do not have any way to tell which version of the large binaries I
need.  I could keep track of this manually, of course, but that's what
git is for...


Solution:
The short version:
***Don't track binaries in git.  Track their hashes.***

Solution:
The long version:
For my current project, I have this (the "store the hashes" idea)
implemented outside of git.  I am posting to this list because I would
like to see this functionality (well, something even better) become
native to git, and believe that it would remove one of the few
remaining arguments that some projects have against adopting git.
Here is how I have it implemented:

First the layout:
my_git_project/binrepo/
-- binaries/
-- hashes/
-- symlink_to_hashes_file
-- symlink_to_another_hashes_file
within the "binrepo" (binary repository) there is a subdirectory for
binaries, and a subdirectory for hashes.  In the root of the 'binrepo'
all of the files stored have a symlink to the current version of the
hash.
The "binaries" directory is .gitignore'd -- the hashes directory and
the symlinks to the current hashes are maintained by git.
Whenever I receive a new version of a large binary file from a
customer, I put it into "binaries" and I create a new hash for that
file in "hashes" and update the symlink to point to that hash.  I 'git
commit' and 'git push' those changes (this is fast since there is no
large binary in the git repository).
The other important factor is that I must put this large binary file
somewhere accessible for others to download it.  In this example, it
is:  my_git_server.net:/binrepo/

Then I have a bash script (some psuedocode here to save space):

for (all BINFILE in binrepo) ; do
  HASHFILE=$BINFILE".md5"
  # check if the binary exists
  if [[ -e binaries/$BINFILE ]] ; then
    echo "  $BINFILE available"
  else
    echo "  $BINFILE not available. Downloading..."
    wget http://my_git_server.net:/binrepo/$BINFILE
  fi
  # check md5sum
  md5sum $BINFILE > temp.md5
  if ! diff -q ../hashes/$HASHFILE temp.md5 >/dev/null ; then
    echo "ERROR! $BINFILE md5 does not match!"
    exit and/or redownload
  fi
done

This confirms that I have the right version of all of the binaries --
my git repository is effectively tracking the large binaries, but
without actually storing them internally to the git repo.  If someone
else updates the "binrepo" I will know it when I do a "git pull" and I
will automatically get the right version of the binary file so that my
sandbox is up-to-date.  Now let's say I want to revert the version of
the large binary file to the previous version -- all I need to do is
to to edit the symlink in "binrepo", commit, and push.  Other users
will automatically use the old version of the file as well after they
do their pull (and without needing to re-download that file)


Summary of Big Advantages:

1. Repository is unpolluted by large binary files.  git clone stays fast.
2. User has access to any version of any binary file, but does not
need to store every version locally if they do not want to.
3. Git does not need to worry about the big binaries - there are no
slow attempts to calculate binary deltas or pack and unpack under the
hood.


Improvements:

I imagine these features (among others):

1. In my current setup, each large binary file has a different name (a
revision number).  This could be easily solved, however, by generating
unique names under the hood and tracking this within git.
2. A lot of the steps in my current setup are manual.  When I want to
add a new binary file, I need to manually create the hash and manually
upload the binary to the joint server.  If done within git, this would
be automatic.
3. In my setup, all of the binary files are in a single "binrepo"
directory.  If done from within git, we would need a non-kludgey way
to allow large binaries to exist anywhere within the git tree.  If git
handles the "binrepo" under the hood though, the user would never need
to know about it -- instead git would just handle all binaries by
checking the internal "binrepo"  Instead of tracking symlinks, git
would track the file versions in the normal way -- it just wouldn't
store the binaries the same way (instead it would store the hash)
4. User option to download all versions of all binaries, or only the
version necessary for the position on the current branch.  If you want
to be able to run all versions of the repository when offline, you can
download all versions of all binaries.  If you don't need to do this,
you can just download the versions you need.  Or perhaps have the
option to download all binaries smaller than X-bytes, but skip the big
ones.
5. Command to purge all binaries in your "binrepo" that are not needed
for the current revision (if you're running out of disk space
locally).
6. Automatically upload new versions of files to the "binrepo" (rather
than needing to do this manually)


Rock on!
Eric

^ permalink raw reply

* [PATCH] Documentation: do not treat reset --keep as a special case
From: Jonathan Nieder @ 2011-01-21 18:37 UTC (permalink / raw)
  To: Joshua Jensen
  Cc: Matthieu Moy, Junio C Hamano, Johannes Schindelin, Thomas Rast,
	Nicolas Sebrecht, git, Kevin Ballard, Yann Dirson, Eric Raible
In-Reply-To: <4D39C923.20202@workspacewhiz.com>

The current treatment of "git reset --keep" emphasizes how it
differs from --hard (treatment of local changes) and how it breaks
down into plumbing (git read-tree -m -u HEAD <commit> followed by git
update-ref HEAD <commit>).  This can discourage people from using
it, since it might seem to be a complex or niche option.

Better to emphasize what the --keep flag is intended for --- moving
the index and worktree from one commit to another, like "git checkout"
would --- so the reader can make a more informed decision about the
appropriate situations in which to use it.

Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
Joshua Jensen wrote:

> I've always wished "git reset --hard" would tell me there are
> modified files and force me to type "git reset --hard --force" to
> overwrite them.  It is a dangerous command, and I stupidly run it
> sometimes without running "git status" first.

Have you tried "git reset --keep"?  How does it compare to your
wish?

 Documentation/git-reset.txt |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt
index fd72976..927ecee 100644
--- a/Documentation/git-reset.txt
+++ b/Documentation/git-reset.txt
@@ -76,15 +76,10 @@ In other words, --merge does something like a 'git read-tree -u -m <commit>',
 but carries forward unmerged index entries.
 
 --keep::
-	Resets the index, updates files in the working tree that are
-	different between <commit> and HEAD, but keeps those
-	which are different between HEAD and the working tree (i.e.
-	which have local changes).
+	Resets index entries and updates files in the working tree that are
+	different between <commit> and HEAD.
 	If a file that is different between <commit> and HEAD has local changes,
 	reset is aborted.
-+
-In other words, --keep does a 2-way merge between <commit> and HEAD followed by
-'git reset --mixed <commit>'.
 --
 
 If you want to undo a commit other than the latest on a branch,
-- 
1.7.4.rc2

^ permalink raw reply related

* Re: [PATCH 2/2] Re: rebase -i: explain how to discard all commits
From: Joshua Jensen @ 2011-01-21 17:57 UTC (permalink / raw)
  To: Matthieu Moy
  Cc: Junio C Hamano, Johannes Schindelin, Thomas Rast,
	Nicolas Sebrecht, Jonathan Nieder, git, Kevin Ballard,
	Yann Dirson, Eric Raible
In-Reply-To: <vpqmxmu2nm3.fsf@bauges.imag.fr>

----- Original Message -----
From: Matthieu Moy
Date: 1/21/2011 10:05 AM
> Junio C Hamano<gitster@pobox.com>  writes:
>
>> Johannes Schindelin<Johannes.Schindelin@gmx.de>  writes:
>>
>>>> Wouldn't that suggest us that if we were to do anything to this message
>>>> it would be a good idea to teach the user to "reset --hard" the branch
>>>> if no commits truly needs to be replayed on top of the onto-commit?
>>> The important difference between rebase -i&&  noop on the one, and reset
>>> --hard on the other hand is that the latter is completely unsafe. I mean,
>>> utterly completely super-unsafe. And I say that because _this here
>>> developer_ who is not exactly a Git noob lost stuff that way.
>> I think "rebase" already checks that the index and the working tree is
>> clean before starting, so referring to "reset --hard" when "rebase -i"
>> notices there is absolutely nothing to do is _not_ unsafe, no?
> The point is not about letting rebase do a "reset --hard", but to tell
> the user s/he should have ran "reset --hard" instead of rebase. The
> danger is to teach the user's fingers to type "reset --hard" too
> often, which is unsafe ;-).
I've always wished "git reset --hard" would tell me there are modified 
files and force me to type "git reset --hard --force" to overwrite 
them.  It is a dangerous command, and I stupidly run it sometimes 
without running "git status" first.

-Josh

^ permalink raw reply

* Re: Move git-stash from one machine (or working copy) to another
From: Junio C Hamano @ 2011-01-21 17:49 UTC (permalink / raw)
  To: Patrick Doyle; +Cc: git
In-Reply-To: <AANLkTin2M+dLUOFnAKqNvYn04NumCmmQ331Yfb9ieW-D@mail.gmail.com>

Patrick Doyle <wpdster@gmail.com> writes:

> Is there an easy way to move work in progress from one machine to another?
>
> One way to do it might be something like this:
>
> machine1$ git checkout -b movewip
> machine1$ git add .
> machine1$ git commit -m "Moving work in progress"
> machine1$ git push origin movewip:movewip
>
> machine2$ git fetch origin movewip:movewip:
> machine2$ git checkout movewip
> machine2$ git reset HEAD^
> machine2$ git stash
> machine2$ git checkout master
> machine2$ git stash pop
>
> # go through and delete movewip branches on machine1, machine2, and
> the origin server
>
> Except for some possible typos, this seems like it would work, but
> seems to be awfully clumsy.  Is there a more elegant way to accomplish
> this?

If your two machines can talk directly with each other (which seems to be
the case from your "take that with me (somehow) to machine2"), you don't
have to push and fetch through the origin.

^ permalink raw reply

* Re: Creating remote branch called HEAD corrupts remote clones
From: Junio C Hamano @ 2011-01-21 17:37 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: Jeff King, Junio C Hamano, git, Stephen Kelly, KDE PIM
In-Reply-To: <AANLkTikBbSt5_WdbuE8a96w1pWBCYLNjMCUCBThjdLdG@mail.gmail.com>

Felipe Contreras <felipe.contreras@gmail.com> writes:

> I don't fully understand the issue, so excuse me if this is totally
> wrong, but wouldn't a rule like 'you can't create a branch for which
> there's already a symbolic ref' do the trick?

But whose symbolic ref are you checking against?  Your own, or ones in
somebody else's repository that you haven't recently updated from?

^ permalink raw reply

* Re: [PATCH] Documentation: suggest "reset --keep" to undo a commit
From: Junio C Hamano @ 2011-01-21 17:34 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Johannes Schindelin, Thomas Rast, Nicolas Sebrecht, Matthieu Moy,
	git, Kevin Ballard, Yann Dirson, Eric Raible, Christian Couder
In-Reply-To: <20110121073730.GA26276@burratino>

Jonathan Nieder <jrnieder@gmail.com> writes:

> When one's only goal is to move from one commit to another, reset
> --keep is simply better than reset --hard, since it preserves local
> changes in the index and worktree when easy and errors out without
> doing anything when not.  Update the two "how to remove commits"
> examples in this vein.  "reset --hard" is still explained in a later
> example about cleaning up during a merge.

I agree with the first sentence but I do not think its conclusion would
lead to changes to "how to _remove_ commits".  The examples were written
in contexts (explanatory text <$n>) where hard makes sense, and the
context needs tweaking to make keep makes more sense than hard does.

For example, the first one's original sequence is this:

    $ git branch topic/wip
    $ git reset --hard HEAD~3
    $ git checkout topic/wip

The text explains the motivation behind these series of commands in <1>
but it has one untold assumption behind it; the user did the review to
reach the conclusion that the recent changes are premature after fully
committing (i.e. the working tree is clean).  That is why "hard" worked
just fine.

But the user could do the reviewing and thinking with some local changes
still in the working tree (they are incredients for the fourth commit yet
to be made) and decide to branch at that point.  The description in <1>
needs to be updated to hint that there can be uncommitted changes, e.g.

	You have worked for some time, made a few commits, and may have
	uncommitted changes.  After reviewing the current state, you
	realized that ...

Using --keep may help the user do so, but only if the local changes do not
conflict with the changes in the recent commits to be discarded, right?

    Side note: Regardless of any of the above, the section header needs to
    be updated---it is not "Undo *A* commit", we are excluding three from
    the current branch.

By the way, a more natural way to do this would actually be:

    $ git checkout -b topic/wip
    $ git branch -f @{-1} HEAD~3

or using the stash:

    $ git stash ;# save local changes
    $ git branch topic/wip ;# and mark the tip before rewinding
    $ git reset --hard HEAD~3 ;# you could say --keep here too
    $ git checkout topic/wip ;# and then continue
    $ git stash pop ;# with the local changes

The branch/reset/checkout sequence itself, either with hard or keep, looks
more like a contrived example to show "reset" than the best way to solve
the problem in the scenario presented there.  Probably we would want to
drop this one altogether, or keep the scenario and explain the best
solution in somewhere else (e.g. tutorial).

> @@ -163,7 +163,7 @@ Undo commits permanently::
>  +
>  ------------
>  $ git commit ...
> -$ git reset --hard HEAD~3   <1>
> +$ git reset --keep HEAD~3   <1>
>  ------------
>  +
>  <1> The last three commits (HEAD, HEAD^, and HEAD~2) were bad

Please tell a story where keep makes more sense than hard by enhancing the
explanatory text <1> associated with this section.  The current text says
that the three topmost commit representing what you have recently worked
so far are all unwanted, strongly hinting that hard is more appropriate
thing to do than keep, which is not what we want if we are changing the
example to use keep.

It would be sufficient to just hint that the uncommitted changes that you
have in your working tree are unrelated to what these three commits wanted
to do (e.g. you always keep small changes around, such as debugging
printf's and a change to the version string in Makefile---you do not
intend to commit them and they are unrelated to the commits you are
discarding), and you do want to keep them around if you can.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox