git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] RFC: git lazy clone proof-of-concept
@ 2008-02-08 17:28 Jan Holesovsky
  2008-02-08 18:03 ` Nicolas Pitre
                   ` (5 more replies)
  0 siblings, 6 replies; 85+ messages in thread
From: Jan Holesovsky @ 2008-02-08 17:28 UTC (permalink / raw)
  To: git, gitster

Hi,

This is my attempt to implement the 'lazy clone' I've read about a bit in the
git mailing list archive, but did not see implemented anywhere - the clone
that fetches a minimal amount of data with the possibility to download the
rest later (transparently!) when necessary.  I am sorry to send it as a huge
patch, not as a series of patches, but as I don't know if I chose a way that is
acceptable for you [I'm new to the git code ;-)], I'd like to hear some
feedback first, and then I'll split it into smaller pieces for easier
integration - if OK.

Background:

Currently we are evaluating the usage of git for OpenOffice.org as one of the
candidates (SVN is the other one), see

  http://wiki.services.openoffice.org/wiki/SCM_Migration

I've provided a git import of OOo with the entire history; the problem is that
the pack has 2.5G, so it's not too convenient to download for casual
developers that just want to try it.  Shallow clone is not a possibility - we
don't get patches through mailing lists, so we need the pull/push, and also
thanks to the OOo development cycle, we have too many living heads which
causes the shallow clone to download about 1.5G even with --depth 1.  Lazy
clone sounded like the right idea to me.  With this proof-of-concept
implementation, just about 550M from the 2.5G is downloaded, which is still
about twice as much in comparison with downloading a tarball, but bearable.

The principle:

During the initial clone, just the commit objects are downloaded.  Then, any
time an object is requested, it is downloaded from the remote repository if
not available locally.  To make this usable and performing, when a tree is
requested, it is downloaded together with all the subtrees and blobs at which
it points.  Every subsequent pull (of stuff newer than what was cloned) is
supposed to use the normal git mechanisms.

Protocol extensions:

I've extended the git protocol in 2 ways:
- added the 'commits-only' flag that is used during the clone to get a pack
  containing just the commit objects, nothing else
- added the 'exact-objects' flag that allows to request just few objects
  exactly specified by the client

A bit more detailed description:

Here I use the term 'remote alternate' as the remote repository from which the
objects are downloaded when not locally available.

- fetch-pack.h
- builtin-fetch-pack.c
  Added --commits-only, --exact-objects, and --stdin options.
  --commits-only and --exact-objects trigger the protocol extensions described
  above, --stdin allows fetch-pack to get the list of refs or objects on stdin
  instead of the command line
- transport.h
- transport.c
- builtin-fetch.c
  Added --commits-only option that is passed to fetch-pack
- builtin-unpack-objects.c
- index-pack.c
  Added --ignore-remote-alternates option which will avoid fetching remote
  objects, to avoid a cycle in downloading the missing objects.
- cache.h
  Export the function that disables fetching remote objects.
- git-clone.sh
  Added handling of -s even in the case when the git:// protocol is used to
  activate the 'remote alternates' and thus get a lazy clone.  The info where
  to get the missing objects from is stored in the
  objects/info/remote_alternates file.
- sha1_file.c
  The core of the changes.  When an object is requested, usage of 'remote
  alternates' is on, and it is not present locally, it is downloaded.
- upload-pack.c
  Extended so that just the commit objects, or the exact objects are returned.

Limitations/FIXMEs/TODOs:

Currently there can be just one 'remote alternate' in the
objects/info/remote_alternates file.  I'm not sure if it makes sense at all to
provide the possibility to have more of them.

Some operations are too slow, like the annotate, and thus unusable [though not
disabled for the patient ones ;-)].

Not too much tested ;-), maybe I'm leaking memory somewhere, better error
handling in case the pack is not available should be introduced, maybe the
names of the variables/functions/commands is not the best chosen, etc.

Every fetch-pack gets list of refs from the server even for the exact-objects
case which is unnecessary - we know what objects we want, this just wastes
bandwidth.

The new options are not documented.


So - comments, ideas, questions appreciated, any help with polishing
this/getting this in is appreciated even more ;-)

Regards,
Jan

Signed-off-by: Jan Holesovsky <kendy@suse.cz>
---
diff --git a/builtin-fetch-pack.c b/builtin-fetch-pack.c
index e68e015..69b4226 100644
--- a/builtin-fetch-pack.c
+++ b/builtin-fetch-pack.c
@@ -17,7 +17,7 @@ static struct fetch_pack_args args = {
 };
 
 static const char fetch_pack_usage[] =
-"git-fetch-pack [--all] [--quiet|-q] [--keep|-k] [--thin] [--upload-pack=<git-upload-pack>] [--depth=<n>] [--no-progress] [-v] [<host>:]<directory> [<refs>...]";
+"git-fetch-pack [--all] [--quiet|-q] [--keep|-k] [--thin] [--upload-pack=<git-upload-pack>] [--depth=<n>] [--no-progress] [--commits-only] [--exact-objects] [-v] [--stdin] [<host>:]<directory> [<refs>...|<sha1>...]";
 
 #define COMPLETE	(1U << 0)
 #define COMMON		(1U << 1)
@@ -141,6 +141,34 @@ static const unsigned char* get_rev(void)
 	return commit->object.sha1;
 }
 
+static void send_want(int fd[2], const char *remote, int full_info)
+{
+	if (full_info)
+		packet_write(fd[1], "want %s%s%s%s%s%s%s%s%s\n",
+				remote,
+				(multi_ack ? " multi_ack" : ""),
+				(use_sideband == 2 ? " side-band-64k" : ""),
+				(use_sideband == 1 ? " side-band" : ""),
+				(args.use_thin_pack ? " thin-pack" : ""),
+				(args.no_progress ? " no-progress" : ""),
+				(args.commits_only ? " commits-only" : ""),
+				(args.exact_objects ? " exact-objects" : ""),
+				" ofs-delta");
+	else
+		packet_write(fd[1], "want %s\n", remote);
+}
+
+static void get_exact_objects(int fd[2], int nr_match, char **match)
+{
+	int i;
+
+	/* send all the objects as we got them on the command line */
+	for (i = 0; i < nr_match; i++)
+		send_want(fd, match[i], !i);
+
+	packet_flush(fd[1]);
+}
+
 static int find_common(int fd[2], unsigned char *result_sha1,
 		       struct ref *refs)
 {
@@ -172,17 +200,7 @@ static int find_common(int fd[2], unsigned char *result_sha1,
 			continue;
 		}
 
-		if (!fetching)
-			packet_write(fd[1], "want %s%s%s%s%s%s%s\n",
-				     sha1_to_hex(remote),
-				     (multi_ack ? " multi_ack" : ""),
-				     (use_sideband == 2 ? " side-band-64k" : ""),
-				     (use_sideband == 1 ? " side-band" : ""),
-				     (args.use_thin_pack ? " thin-pack" : ""),
-				     (args.no_progress ? " no-progress" : ""),
-				     " ofs-delta");
-		else
-			packet_write(fd[1], "want %s\n", sha1_to_hex(remote));
+		send_want(fd, sha1_to_hex(remote), !fetching);
 		fetching++;
 	}
 	if (is_repository_shallow())
@@ -523,11 +541,15 @@ static int get_pack(int xd[2], char **pack_lockfile)
 				strcpy(keep_arg + s, "localhost");
 			*av++ = keep_arg;
 		}
+		if (args.exact_objects)
+			*av++ = "--ignore-remote-alternates";
 	}
 	else {
 		*av++ = "unpack-objects";
 		if (args.quiet)
 			*av++ = "-q";
+		if (args.exact_objects)
+			*av++ = "--ignore-remote-alternates";
 	}
 	if (*hdr_arg)
 		*av++ = hdr_arg;
@@ -556,6 +578,7 @@ static struct ref *do_fetch_pack(int fd[2],
 	unsigned char sha1[20];
 
 	get_remote_heads(fd[0], &ref, 0, NULL, 0);
+
 	if (is_repository_shallow() && !server_supports("shallow"))
 		die("Server does not support shallow clients");
 	if (server_supports("multi_ack")) {
@@ -573,20 +596,36 @@ static struct ref *do_fetch_pack(int fd[2],
 			fprintf(stderr, "Server supports side-band\n");
 		use_sideband = 1;
 	}
-	if (!ref) {
-		packet_flush(fd[1]);
-		die("no matching remote head");
+	if (!server_supports("remote-alternates") &&
+			(args.commits_only || args.exact_objects)) {
+		if (args.verbose)
+			fprintf(stderr, "Server does not support remote "
+					"alternates, ignoring %s%s\n",
+					(args.commits_only?
+						"--commits-only ": ""),
+					(args.exact_objects? "--exact-objects": ""));
+		args.commits_only = 0;
+		args.exact_objects = 0;
 	}
-	if (everything_local(&ref, nr_match, match)) {
-		packet_flush(fd[1]);
-		goto all_done;
+
+	if (args.exact_objects)
+		get_exact_objects(fd, nr_match, match);
+	else {
+		if (!ref) {
+			packet_flush(fd[1]);
+			die("no matching remote head");
+		}
+		if (everything_local(&ref, nr_match, match)) {
+			packet_flush(fd[1]);
+			goto all_done;
+		}
+		if (find_common(fd, sha1, ref) < 0)
+			if (!args.keep_pack)
+				/* When cloning, it is not unusual to have
+				 * no common commit.
+				 */
+				fprintf(stderr, "warning: no common commits\n");
 	}
-	if (find_common(fd, sha1, ref) < 0)
-		if (!args.keep_pack)
-			/* When cloning, it is not unusual to have
-			 * no common commit.
-			 */
-			fprintf(stderr, "warning: no common commits\n");
 
 	if (get_pack(fd, pack_lockfile))
 		die("git-fetch-pack: fetch failed.");
@@ -647,12 +686,72 @@ static void fetch_pack_setup(void)
 	did_setup = 1;
 }
 
+static void read_from_stdin(int *num, char ***records)
+{
+	char buffer[4096];
+	size_t records_num, leftover;
+	ssize_t ret;
+
+	*num = 0;
+	leftover = 0;
+
+	records_num = 4096;
+	(*records) = xmalloc(records_num * sizeof(char *));
+
+	do {
+		char *p, *last;
+
+		ret = xread(0 /*stdin*/, buffer + leftover,
+				sizeof(buffer) - leftover);
+		if (ret < 0)
+			die("read error on input: %s", strerror(errno));
+
+		last = buffer;
+		for (p = buffer; p < buffer + leftover + ret; p++)
+			if ((!*p || *p == '\n') && (p != last)) {
+				if (*num >= records_num) {
+					records_num *= 2;
+					(*records) = xrealloc(*records,
+							      records_num * sizeof(char*));
+				}
+
+				if (p - last > 0) {
+					(*records)[*num] =
+						strndup(last, p - last);
+					(*num)++;
+				}
+				last = p + 1;
+			}
+
+		leftover = p - last;
+		if (leftover >= sizeof(buffer))
+			die("input line too long");
+		if (leftover < 0)
+			leftover = 0;
+
+		memmove(buffer, last, leftover);
+	} while (ret > 0);
+
+	if (leftover) {
+		if (*num >= records_num) {
+			records_num *= 2;
+			(*records) = xrealloc(*records,
+					      records_num * sizeof(char*));
+		}
+
+		(*records)[*num] = strndup(buffer, leftover);
+		(*num)++;
+	}
+}
+
 int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 {
 	int i, ret, nr_heads;
 	struct ref *ref;
 	char *dest = NULL, **heads;
+	int from_stdin;
 
+	from_stdin = 0;
 	nr_heads = 0;
 	heads = NULL;
 	for (i = 1; i < argc; i++) {
@@ -696,6 +795,19 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 				args.no_progress = 1;
 				continue;
 			}
+			if (!strcmp("--commits-only", arg)) {
+				args.commits_only = 1;
+				continue;
+			}
+			if (!strcmp("--exact-objects", arg)) {
+				args.exact_objects = 1;
+				disable_remote_alternates();
+				continue;
+			}
+			if (!strcmp("--stdin", arg)) {
+				from_stdin = 1;
+				continue;
+			}
 			usage(fetch_pack_usage);
 		}
 		dest = (char *)arg;
@@ -706,14 +818,18 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 	if (!dest)
 		usage(fetch_pack_usage);
 
+	if (from_stdin)
+		read_from_stdin(&nr_heads, &heads);
+
 	ref = fetch_pack(&args, dest, nr_heads, heads, NULL);
 	ret = !ref;
 
-	while (ref) {
-		printf("%s %s\n",
-		       sha1_to_hex(ref->old_sha1), ref->name);
-		ref = ref->next;
-	}
+	if (!args.exact_objects)
+		while (ref) {
+			printf("%s %s\n",
+					sha1_to_hex(ref->old_sha1), ref->name);
+			ref = ref->next;
+		}
 
 	return ret;
 }
@@ -746,7 +862,7 @@ struct ref *fetch_pack(struct fetch_pack_args *my_args,
 	close(fd[1]);
 	ret = finish_connect(conn);
 
-	if (!ret && nr_heads) {
+	if (!ret && nr_heads && !args.exact_objects) {
 		/* If the heads to pull were given, we should have
 		 * consumed all of them by matching the remote.
 		 * Otherwise, 'git-fetch remote no-such-ref' would
diff --git a/builtin-fetch.c b/builtin-fetch.c
index 320e235..858384a 100644
--- a/builtin-fetch.c
+++ b/builtin-fetch.c
@@ -22,7 +22,7 @@ enum {
 	TAGS_SET = 2
 };
 
-static int append, force, keep, update_head_ok, verbose, quiet;
+static int append, force, keep, update_head_ok, verbose, quiet, commits_only;
 static int tags = TAGS_DEFAULT;
 static const char *depth;
 static const char *upload_pack;
@@ -45,6 +45,8 @@ static struct option builtin_fetch_options[] = {
 		    "allow updating of HEAD ref"),
 	OPT_STRING(0, "depth", &depth, "DEPTH",
 		   "deepen history of shallow clone"),
+	OPT_BOOLEAN(0, "commits-only", &commits_only,
+		    "fetch just the commit objects, leave the tree, blob, and tag objects for later"),
 	OPT_END()
 };
 
@@ -602,6 +604,8 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 		set_option(TRANS_OPT_KEEP, "yes");
 	if (depth)
 		set_option(TRANS_OPT_DEPTH, depth);
+	if (commits_only)
+		set_option(TRANS_OPT_COMMITS_ONLY, "yes");
 
 	if (!transport->url)
 		die("Where do you want to fetch from today?");
diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
index 1e51865..58d8a41 100644
--- a/builtin-unpack-objects.c
+++ b/builtin-unpack-objects.c
@@ -10,7 +10,7 @@
 #include "progress.h"
 
 static int dry_run, quiet, recover, has_errors;
-static const char unpack_usage[] = "git-unpack-objects [-n] [-q] [-r] < pack-file";
+static const char unpack_usage[] = "git-unpack-objects [-n] [-q] [-r] [--ignore-remote-alternates] < pack-file";
 
 /* We always read in 4kB chunks. */
 static unsigned char buffer[4096];
@@ -359,6 +359,10 @@ int cmd_unpack_objects(int argc, const char **argv, const char *prefix)
 				recover = 1;
 				continue;
 			}
+			if (!strcmp(arg, "--ignore-remote-alternates")) {
+				disable_remote_alternates();
+				continue;
+			}
 			if (!prefixcmp(arg, "--pack_header=")) {
 				struct pack_header *hdr;
 				char *c;
diff --git a/cache.h b/cache.h
index 549f4bb..def7459 100644
--- a/cache.h
+++ b/cache.h
@@ -480,6 +480,8 @@ extern struct alternate_object_database {
 } *alt_odb_list;
 extern void prepare_alt_odb(void);
 
+extern void disable_remote_alternates(void);
+
 struct pack_window {
 	struct pack_window *next;
 	unsigned char *base;
diff --git a/fetch-pack.h b/fetch-pack.h
index a7888ea..0c3b13f 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -12,7 +12,9 @@ struct fetch_pack_args
 		use_thin_pack:1,
 		fetch_all:1,
 		verbose:1,
-		no_progress:1;
+		no_progress:1,
+		commits_only:1,
+		exact_objects:1;
 };
 
 struct ref *fetch_pack(struct fetch_pack_args *args,
diff --git a/git-clone.sh b/git-clone.sh
index b4e858c..208e9fc 100755
--- a/git-clone.sh
+++ b/git-clone.sh
@@ -115,7 +115,7 @@ Perhaps git-update-server-info needs to be run there?"
 quiet=
 local=no
 use_local_hardlink=yes
-local_shared=no
+shared=no
 unset template
 no_checkout=
 upload_pack=
@@ -143,7 +143,7 @@ do
 	--no-hardlinks)
 		use_local_hardlink=no ;;
 	-s|--shared)
-		local_shared=yes ;;
+		shared=yes ;;
 	--template)
 		shift; template="--template=$1" ;;
 	-q|--quiet)
@@ -288,7 +288,7 @@ yes)
 	( cd "$repo/objects" ) ||
 		die "cannot chdir to local '$repo/objects'."
 
-	if test "$local_shared" = yes
+	if test "$shared" = yes
 	then
 		mkdir -p "$GIT_DIR/objects/info"
 		echo "$repo/objects" >>"$GIT_DIR/objects/info/alternates"
@@ -364,11 +364,22 @@ yes)
 		fi
 		;;
 	*)
+		commits_only=
+		if test "$shared" = yes
+		then
+			commits_only="--commits-only"
+		fi
 		case "$upload_pack" in
-		'') git-fetch-pack --all -k $quiet $depth $no_progress "$repo";;
-		*) git-fetch-pack --all -k $quiet "$upload_pack" $depth $no_progress "$repo" ;;
+		'') git-fetch-pack --all -k $quiet $depth $no_progress $commits_only "$repo";;
+		*) git-fetch-pack --all -k $quiet "$upload_pack" $depth $no_progress $commits_only "$repo" ;;
 		esac >"$GIT_DIR/CLONE_HEAD" ||
 			die "fetch-pack from '$repo' failed."
+		if test "$shared" = yes
+		then
+			# Must be done after the fetch
+			mkdir -p "$GIT_DIR/objects/info"
+			echo "$repo" >> "$GIT_DIR/objects/info/remote_alternates"
+		fi
 		;;
 	esac
 	;;
diff --git a/index-pack.c b/index-pack.c
index 9fd6982..f2e6b7a 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -9,7 +9,7 @@
 #include "progress.h"
 
 static const char index_pack_usage[] =
-"git-index-pack [-v] [-o <index-file>] [{ ---keep | --keep=<msg> }] { <pack-file> | --stdin [--fix-thin] [<pack-file>] }";
+"git-index-pack [-v] [-o <index-file>] [{ ---keep | --keep=<msg> }] [--ignore-remote-alternates] { <pack-file> | --stdin [--fix-thin] [<pack-file>] }";
 
 struct object_entry
 {
@@ -746,6 +746,8 @@ int main(int argc, char **argv)
 					pack_idx_off32_limit = strtoul(c+1, &c, 0);
 				if (*c || pack_idx_off32_limit & 0x80000000)
 					die("bad %s", arg);
+			} else if (!strcmp(arg, "--ignore-remote-alternates")) {
+				disable_remote_alternates();
 			} else
 				usage(index_pack_usage);
 			continue;
diff --git a/sha1_file.c b/sha1_file.c
index 66a4e00..7d60be0 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -14,6 +14,7 @@
 #include "tag.h"
 #include "tree.h"
 #include "refs.h"
+#include "run-command.h"
 
 #ifndef O_NOATIME
 #if defined(__linux__) && (defined(__i386__) || defined(__PPC__))
@@ -411,6 +412,205 @@ static char *find_sha1_file(const unsigned char *sha1, struct stat *st)
 	return NULL;
 }
 
+static char *remote_alternates = NULL;
+static int has_remote_alt_feature = -1;
+
+void disable_remote_alternates(void)
+{
+	has_remote_alt_feature = 0;
+}
+
+static int has_remote_alternates(void)
+{
+	/* FIXME: does it make sense to support more URLs inside
+	 * remote_alternates? */
+	struct stat st;
+	const char remote_alt_file_name[] = "info/remote_alternates";
+	char path[PATH_MAX + 1 + sizeof remote_alt_file_name];
+	int fd;
+	char *map, *p;
+	size_t mapsz;
+
+	if (has_remote_alt_feature != -1)
+		return has_remote_alt_feature;
+
+	has_remote_alt_feature = 0;
+
+	sprintf(path, "%s/%s", get_object_directory(),
+			remote_alt_file_name);
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return has_remote_alt_feature;
+	else if (fstat(fd, &st) || (st.st_size == 0)) {
+		close(fd);
+		return has_remote_alt_feature;
+	}
+
+	mapsz = xsize_t(st.st_size);
+	map = xmmap(NULL, mapsz, PROT_READ, MAP_PRIVATE, fd, 0);
+	close(fd);
+
+	/* we support just one remote alternate for now,
+	 * so read just the first entry */
+	for (p = map; (p < map + mapsz) && (*p != '\n'); p++)
+		;
+
+	remote_alternates = strndup(map, p - map);
+
+	munmap(map, mapsz);
+
+	if (remote_alternates && remote_alternates[0])
+		has_remote_alt_feature = 1;
+
+	return has_remote_alt_feature;
+}
+
+struct sha1_list {
+	unsigned char sha1[20];
+	struct sha1_list *next;
+};
+
+static int has_sha1_file_locally(const unsigned char *sha1);
+
+static int async_dump_objects(int fd, void *data)
+{
+	FILE *out = NULL;
+	struct sha1_list *list;
+
+	out = fdopen(fd, "w");
+
+	list = (struct sha1_list *)data;
+	while (list) {
+		if (!has_sha1_file_locally(list->sha1))
+			fprintf(out, "%s\n", sha1_to_hex(list->sha1));
+
+		list = list->next;
+	}
+
+	fflush(out);
+	return 0;
+}
+
+static int fetch_remote_sha1s(struct sha1_list *objects)
+{
+	struct async dump_objects;
+	struct child_process fetch_pack;
+	const char *argv[20];
+	int argc = 0;
+	int err;
+
+	if (!objects)
+		return 0;
+
+	/* this will fill the stdin of fetch-pack */
+	dump_objects.proc = async_dump_objects;
+	dump_objects.data = objects;
+
+	if (start_async(&dump_objects))
+		die("unable to send data to fetch-pack");
+
+	argv[argc++] = "fetch-pack";
+	argv[argc++] = "--stdin";
+	argv[argc++] = "--exact-objects";
+	argv[argc++] = remote_alternates;
+	argv[argc] = NULL;
+
+	memset(&fetch_pack, 0, sizeof(fetch_pack));
+	fetch_pack.in = dump_objects.out;
+	fetch_pack.out = 1;
+	fetch_pack.err = 2;
+	fetch_pack.git_cmd = 1;
+	fetch_pack.argv = argv;
+
+	err = run_command(&fetch_pack);
+
+	/* TODO better error handling - is the object really missing, or
+	 * was it just a temporary network error? */
+	if (err) {
+		fprintf(stderr, "error %d while calling fetch-pack\n", err);
+		return 0;
+	}
+
+	return 1;
+}
+
+static struct sha1_list *remote_list = NULL;
+
+static int fill_remote_list(const unsigned char *sha1,
+		const char *base, int baselen,
+		const char *pathname, unsigned mode, int stage)
+{
+	if (!has_sha1_file_locally(sha1)) {
+		struct sha1_list *item;
+
+		item = xmalloc(sizeof(*item));
+		hashcpy(item->sha1, sha1);
+		item->next = remote_list;
+
+		remote_list = item;
+	}
+
+	return 0;
+}
+
+static int fetch_remote_sha1s_recursive(struct sha1_list *objects)
+{
+	struct sha1_list *list;
+	int ret = 0;
+
+	/* first of all, fetch the missing objects */
+	if (!fetch_remote_sha1s(objects))
+		return 0;
+
+	remote_list = NULL;
+
+	list = objects;
+	while (list) {
+		struct tree *tree;
+
+		tree = parse_tree_indirect(list->sha1);
+		if (tree) {
+			read_tree_recursive(tree, "", 0, 0, NULL,
+					fill_remote_list);
+		}
+
+		list = list->next;
+	}
+
+	list = remote_list;
+	if (!list)
+		return 1; /* hooray, we have everything */
+
+	ret = fetch_remote_sha1s_recursive(list);
+
+	while (list) {
+		struct sha1_list *item;
+
+		item = list;
+		list = list->next;
+
+		free(item);
+	}
+
+	return ret;
+}
+
+static int download_remote_sha1(const unsigned char *sha1)
+{
+	struct sha1_list item;
+	int ret;
+
+	if (!has_remote_alternates())
+		return 0;
+
+	hashcpy(item.sha1, sha1);
+	item.next = NULL;
+
+	ret = fetch_remote_sha1s_recursive(&item);
+
+	return ret;
+}
+
 static unsigned int pack_used_ctr;
 static unsigned int pack_mmap_calls;
 static unsigned int peak_pack_open_windows;
@@ -1880,7 +2080,7 @@ int pretend_sha1_file(void *buf, unsigned long len, enum object_type type,
 	return 0;
 }
 
-void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
+static void *read_sha1_file_locally(const unsigned char *sha1, enum object_type *type,
 		     unsigned long *size)
 {
 	unsigned long mapsize;
@@ -1897,6 +2097,7 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
 	buf = read_packed_sha1(sha1, type, size);
 	if (buf)
 		return buf;
+
 	map = map_sha1_file(sha1, &mapsize);
 	if (map) {
 		buf = unpack_sha1_file(map, mapsize, type, size, sha1);
@@ -1907,6 +2108,21 @@ void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
 	return read_packed_sha1(sha1, type, size);
 }
 
+void *read_sha1_file(const unsigned char *sha1, enum object_type *type,
+		     unsigned long *size)
+{
+	void *result;
+
+	result = read_sha1_file_locally(sha1, type, size);
+
+	/* if it's remote, and we don't have it yet, dowload it now and try
+	 * again */
+	if (!result && has_remote_alternates() && download_remote_sha1(sha1))
+		result = read_sha1_file_locally(sha1, type, size);
+
+	return result;
+}
+
 void *read_object_with_reference(const unsigned char *sha1,
 				 const char *required_type_name,
 				 unsigned long *size,
@@ -2306,7 +2522,7 @@ int has_sha1_pack(const unsigned char *sha1, const char **ignore_packed)
 	return find_pack_entry(sha1, &e, ignore_packed);
 }
 
-int has_sha1_file(const unsigned char *sha1)
+static int has_sha1_file_locally(const unsigned char *sha1)
 {
 	struct stat st;
 	struct pack_entry e;
@@ -2316,6 +2532,18 @@ int has_sha1_file(const unsigned char *sha1)
 	return find_sha1_file(sha1, &st) ? 1 : 0;
 }
 
+int has_sha1_file(const unsigned char *sha1)
+{
+	if (has_sha1_file_locally(sha1))
+		return 1;
+
+	/* download it if necessary */
+	if (has_remote_alternates() && download_remote_sha1(sha1))
+		return has_sha1_file_locally(sha1);
+
+	return 0;
+}
+
 int index_pipe(unsigned char *sha1, int fd, const char *type, int write_object)
 {
 	struct strbuf buf;
diff --git a/transport.c b/transport.c
index babaa21..918c390 100644
--- a/transport.c
+++ b/transport.c
@@ -562,6 +562,7 @@ static int close_bundle(struct transport *transport)
 struct git_transport_data {
 	unsigned thin : 1;
 	unsigned keep : 1;
+	unsigned commits_only : 1;
 	int depth;
 	const char *uploadpack;
 	const char *receivepack;
@@ -589,6 +590,9 @@ static int set_git_option(struct transport *connection,
 		else
 			data->depth = atoi(value);
 		return 0;
+	} else if (!strcmp(name, TRANS_OPT_COMMITS_ONLY)) {
+		data->commits_only = !!value;
+		return 0;
 	}
 	return 1;
 }
@@ -629,6 +633,7 @@ static int fetch_refs_via_pack(struct transport *transport,
 	args.use_thin_pack = data->thin;
 	args.verbose = transport->verbose > 0;
 	args.depth = data->depth;
+	args.commits_only = data->commits_only;
 
 	for (i = 0; i < nr_heads; i++)
 		origh[i] = heads[i] = xstrdup(to_fetch[i]->name);
diff --git a/transport.h b/transport.h
index 6fb4526..4076186 100644
--- a/transport.h
+++ b/transport.h
@@ -53,6 +53,10 @@ struct transport *transport_get(struct remote *, const char *);
 /* Limit the depth of the fetch if not null */
 #define TRANS_OPT_DEPTH "depth"
 
+/* Download only the commit objects, let the tree, blob and tag objects for
+ * later */
+#define TRANS_OPT_COMMITS_ONLY "commits-only"
+
 /**
  * Returns 0 if the option was used, non-zero otherwise. Prints a
  * message to stderr if the option is not used.
diff --git a/upload-pack.c b/upload-pack.c
index 7e04311..2d047ec 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -27,7 +27,7 @@ static const char upload_pack_usage[] = "git-upload-pack [--strict] [--timeout=n
 static unsigned long oldest_have;
 
 static int multi_ack, nr_our_refs;
-static int use_thin_pack, use_ofs_delta, no_progress;
+static int use_thin_pack, use_ofs_delta, no_progress, commits_only, exact_objects;
 static struct object_array have_obj;
 static struct object_array want_obj;
 static unsigned int timeout;
@@ -106,9 +106,15 @@ static int do_rev_list(int fd, void *create_full_pack)
 	if (create_full_pack)
 		use_thin_pack = 0; /* no point doing it */
 	init_revisions(&revs, NULL);
-	revs.tag_objects = 1;
-	revs.tree_objects = 1;
-	revs.blob_objects = 1;
+	if (!commits_only) {
+		revs.tag_objects = 1;
+		revs.tree_objects = 1;
+		revs.blob_objects = 1;
+	} else {
+		revs.tag_objects = 0;
+		revs.tree_objects = 0;
+		revs.blob_objects = 0;
+	}
 	if (use_thin_pack)
 		revs.edge_hint = 1;
 
@@ -135,6 +141,20 @@ static int do_rev_list(int fd, void *create_full_pack)
 	return 0;
 }
 
+static int dump_want_objects(int fd, void *data)
+{
+	int i;
+	pack_pipe = fdopen(fd, "w");
+
+	for (i = 0; i < want_obj.nr; i++) {
+		struct object *o = want_obj.objects[i].item;
+		fprintf(pack_pipe, "%s\n", sha1_to_hex(o->sha1));
+	}
+
+	fflush(pack_pipe);
+	return 0;
+}
+
 static void create_pack_file(void)
 {
 	struct async rev_list;
@@ -148,7 +168,10 @@ static void create_pack_file(void)
 	const char *argv[10];
 	int arg = 0;
 
-	rev_list.proc = do_rev_list;
+	if (!exact_objects)
+		rev_list.proc = do_rev_list;
+	else
+		rev_list.proc = dump_want_objects;
 	/* .data is just a boolean: any non-NULL value will do */
 	rev_list.data = create_full_pack ? &rev_list : NULL;
 	if (start_async(&rev_list))
@@ -489,6 +512,10 @@ static void receive_needs(void)
 			use_sideband = DEFAULT_PACKET_MAX;
 		if (strstr(line+45, "no-progress"))
 			no_progress = 1;
+		if (strstr(line+45, "commits-only"))
+			commits_only = 1;
+		if (strstr(line+45, "exact-objects"))
+			exact_objects = 1;
 
 		/* We have sent all our refs already, and the other end
 		 * should have chosen out of them; otherwise they are
@@ -498,9 +525,15 @@ static void receive_needs(void)
 		 * asks for something like "master~10" (symbolic)...
 		 * would it make sense?  I don't know.
 		 */
-		o = lookup_object(sha1_buf);
-		if (!o || !(o->flags & OUR_REF))
-			die("git-upload-pack: not our ref %s", line+5);
+		if (!exact_objects) {
+			o = lookup_object(sha1_buf);
+			if (!o || !(o->flags & OUR_REF))
+				die("git-upload-pack: not our ref %s", line+5);
+		} else {
+			o = lookup_unknown_object(sha1_buf);
+			if (!o)
+				die("git-upload-pack: not an object %s", line+5);
+		}
 		if (!(o->flags & WANTED)) {
 			o->flags |= WANTED;
 			add_object_array(o, NULL, &want_obj);
@@ -557,7 +590,7 @@ static void receive_needs(void)
 static int send_ref(const char *refname, const unsigned char *sha1, int flag, void *cb_data)
 {
 	static const char *capabilities = "multi_ack thin-pack side-band"
-		" side-band-64k ofs-delta shallow no-progress";
+		" side-band-64k ofs-delta shallow no-progress remote-alternates";
 	struct object *o = parse_object(sha1);
 
 	if (!o)
@@ -588,7 +621,8 @@ static void upload_pack(void)
 	packet_flush(1);
 	receive_needs();
 	if (want_obj.nr) {
-		get_common_commits();
+		if (!exact_objects)
+			get_common_commits();
 		create_pack_file();
 	}
 }

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 17:28 [PATCH] RFC: git lazy clone proof-of-concept Jan Holesovsky
@ 2008-02-08 18:03 ` Nicolas Pitre
  2008-02-09 14:25   ` Jan Holesovsky
  2008-02-08 18:14 ` Harvey Harrison
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-08 18:03 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: git, gitster

On Fri, 8 Feb 2008, Jan Holesovsky wrote:

> Currently we are evaluating the usage of git for OpenOffice.org as one of the
> candidates (SVN is the other one), see
> 
>   http://wiki.services.openoffice.org/wiki/SCM_Migration
> 
> I've provided a git import of OOo with the entire history; the problem is that
> the pack has 2.5G, so it's not too convenient to download for casual
> developers that just want to try it.  Shallow clone is not a possibility - we
> don't get patches through mailing lists, so we need the pull/push, and also
> thanks to the OOo development cycle, we have too many living heads which
> causes the shallow clone to download about 1.5G even with --depth 1.

How did you repack your repository?

We know that current defaults are not suitable for large projects.  For 
example, the gcc git repository shrinked from 1.5GB pack down to 230MB 
after some tuning.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 17:28 [PATCH] RFC: git lazy clone proof-of-concept Jan Holesovsky
  2008-02-08 18:03 ` Nicolas Pitre
@ 2008-02-08 18:14 ` Harvey Harrison
  2008-02-09 14:27   ` Jan Holesovsky
  2008-02-08 18:20 ` Johannes Schindelin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 85+ messages in thread
From: Harvey Harrison @ 2008-02-08 18:14 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: git, gitster

On Fri, 2008-02-08 at 18:28 +0100, Jan Holesovsky wrote:
> Hi,
> 
> This is my attempt to implement the 'lazy clone' I've read about a bit in the
> git mailing list archive, but did not see implemented anywhere - the clone
> that fetches a minimal amount of data with the possibility to download the
> rest later (transparently!) when necessary.  I am sorry to send it as a huge
> patch, not as a series of patches, but as I don't know if I chose a way that is
> acceptable for you [I'm new to the git code ;-)], I'd like to hear some
> feedback first, and then I'll split it into smaller pieces for easier
> integration - if OK.
> 
> Background:
> 
> Currently we are evaluating the usage of git for OpenOffice.org as one of the
> candidates (SVN is the other one), see
> 
>   http://wiki.services.openoffice.org/wiki/SCM_Migration
> 
> I've provided a git import of OOo with the entire history; the problem is that
> the pack has 2.5G, so it's not too convenient to download for casual
> developers that just want to try it.  Shallow clone is not a possibility - we
> don't get patches through mailing lists, so we need the pull/push, and also
> thanks to the OOo development cycle, we have too many living heads which
> causes the shallow clone to download about 1.5G even with --depth 1.  Lazy
> clone sounded like the right idea to me.  With this proof-of-concept
> implementation, just about 550M from the 2.5G is downloaded, which is still
> about twice as much in comparison with downloading a tarball, but bearable.

For comparison, how big was the svn repo you're testing?  My experience
has been about 15-20 times smaller than SVN once a tuned repack has
been done.

Cheers,

Harvey

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 17:28 [PATCH] RFC: git lazy clone proof-of-concept Jan Holesovsky
  2008-02-08 18:03 ` Nicolas Pitre
  2008-02-08 18:14 ` Harvey Harrison
@ 2008-02-08 18:20 ` Johannes Schindelin
  2008-02-08 18:49 ` Mike Hommey
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-08 18:20 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: git, gitster

Hi,

On Fri, 8 Feb 2008, Jan Holesovsky wrote:

> +static void send_want(int fd[2], const char *remote, int full_info)
> +{
> +	if (full_info)
> +		packet_write(fd[1], "want %s%s%s%s%s%s%s%s%s\n",
> +				remote,
> +				(multi_ack ? " multi_ack" : ""),
> +				(use_sideband == 2 ? " side-band-64k" : ""),
> +				(use_sideband == 1 ? " side-band" : ""),
> +				(args.use_thin_pack ? " thin-pack" : ""),
> +				(args.no_progress ? " no-progress" : ""),
> +				(args.commits_only ? " commits-only" : ""),
> +				(args.exact_objects ? " exact-objects" : ""),
> +				" ofs-delta");
> +	else
> +		packet_write(fd[1], "want %s\n", remote);
> +}

You might want to make the full_info static, and only send the options the 
first time.

> @@ -523,11 +541,15 @@ static int get_pack(int xd[2], char **pack_lockfile)
>  				strcpy(keep_arg + s, "localhost");
>  			*av++ = keep_arg;
>  		}
> +		if (args.exact_objects)
> +			*av++ = "--ignore-remote-alternates";
>  	}
>  	else {
>  		*av++ = "unpack-objects";
>  		if (args.quiet)
>  			*av++ = "-q";
> +		if (args.exact_objects)
> +			*av++ = "--ignore-remote-alternates";
>  	}

You can move this outside of the if() instead of repeating yourself...

> @@ -556,6 +578,7 @@ static struct ref *do_fetch_pack(int fd[2],
>  	unsigned char sha1[20];
>  
>  	get_remote_heads(fd[0], &ref, 0, NULL, 0);
> +
>  	if (is_repository_shallow() && !server_supports("shallow"))
>  		die("Server does not support shallow clients");
>  	if (server_supports("multi_ack")) {

Not strictly necessary, right? ;-)

> @@ -647,12 +686,72 @@ static void fetch_pack_setup(void)
>  	did_setup = 1;
>  }
>  
> +static void read_from_stdin(int *num, char ***records)
> +{
> +	char buffer[4096];
> +	size_t records_num, leftover;
> +	ssize_t ret;
> +
> +	*num = 0;
> +	leftover = 0;
> +
> +	records_num = 4096;
> +	(*records) = xmalloc(records_num * sizeof(char *));
> +
> +	do {
> +		char *p, *last;
> +
> +		ret = xread(0 /*stdin*/, buffer + leftover,
> +				sizeof(buffer) - leftover);
> +		if (ret < 0)
> +			die("read error on input: %s", strerror(errno));
> +
> +		last = buffer;
> +		for (p = buffer; p < buffer + leftover + ret; p++)
> +			if ((!*p || *p == '\n') && (p != last)) {
> +				if (*num >= records_num) {
> +					records_num *= 2;
> +					(*records) = xrealloc(*records,
> +							      records_num * sizeof(char*));
> +				}
> +
> +				if (p - last > 0) {
> +					(*records)[*num] =
> +						strndup(last, p - last);
> +					(*num)++;
> +				}
> +				last = p + 1;
> +			}
> +
> +		leftover = p - last;
> +		if (leftover >= sizeof(buffer))
> +			die("input line too long");
> +		if (leftover < 0)
> +			leftover = 0;
> +
> +		memmove(buffer, last, leftover);
> +	} while (ret > 0);
> +
> +	if (leftover) {
> +		if (*num >= records_num) {
> +			records_num *= 2;
> +			(*records) = xrealloc(*records,
> +					      records_num * sizeof(char*));
> +		}
> +
> +		(*records)[*num] = strndup(buffer, leftover);
> +		(*num)++;
> +	}
> +}
> +

This chunk could use ALLOC_GROW() quite nicely (would make it more 
readable, and avoid errors).  Also, I'd use alloc_nr() instead of the 
doubling.

>  int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
>  {
>  	int i, ret, nr_heads;
>  	struct ref *ref;
>  	char *dest = NULL, **heads;
> +	int from_stdin;
>  
> +	from_stdin = 0;

You can initialise it to 0 right away...

Unfortunately, I have to go now... so I will review the rest 
(from builtin-fetch.c on) later.

It's great seeing that you work on this!

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 17:28 [PATCH] RFC: git lazy clone proof-of-concept Jan Holesovsky
                   ` (2 preceding siblings ...)
  2008-02-08 18:20 ` Johannes Schindelin
@ 2008-02-08 18:49 ` Mike Hommey
  2008-02-08 19:04   ` Johannes Schindelin
  2008-02-09 15:06   ` Jan Holesovsky
  2008-02-08 19:00 ` Jakub Narebski
  2008-02-08 20:16 ` Johannes Schindelin
  5 siblings, 2 replies; 85+ messages in thread
From: Mike Hommey @ 2008-02-08 18:49 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: git, gitster

On Fri, Feb 08, 2008 at 06:28:43PM +0100, Jan Holesovsky wrote:
> Hi,
> 
> This is my attempt to implement the 'lazy clone' I've read about a bit in the
> git mailing list archive, but did not see implemented anywhere - the clone
> that fetches a minimal amount of data with the possibility to download the
> rest later (transparently!) when necessary.  I am sorry to send it as a huge
> patch, not as a series of patches, but as I don't know if I chose a way that is
> acceptable for you [I'm new to the git code ;-)], I'd like to hear some
> feedback first, and then I'll split it into smaller pieces for easier
> integration - if OK.
> 
> Background:
> 
> Currently we are evaluating the usage of git for OpenOffice.org as one of the
> candidates (SVN is the other one), see
> 
>   http://wiki.services.openoffice.org/wiki/SCM_Migration
> 
> I've provided a git import of OOo with the entire history; the problem is that
> the pack has 2.5G, so it's not too convenient to download for casual
> developers that just want to try it.  Shallow clone is not a possibility - we
> don't get patches through mailing lists, so we need the pull/push, and also
> thanks to the OOo development cycle, we have too many living heads which
> causes the shallow clone to download about 1.5G even with --depth 1.  Lazy
> clone sounded like the right idea to me.  With this proof-of-concept
> implementation, just about 550M from the 2.5G is downloaded, which is still
> about twice as much in comparison with downloading a tarball, but bearable.
<snip>

There are 2 things, here:
- Probably, you can make your pack smaller with proper window sizing.
Try taking a look at the "Git and GCC" that crossed borders between
the gcc and the git mailing lists.
- There are tricks to do roughly what you want without modifying git.
For example, you can prepare several "shared" clones of your repo (git
clone -s) and leave in each only a few branches. Cloning from these will
only pull the needed data.

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 17:28 [PATCH] RFC: git lazy clone proof-of-concept Jan Holesovsky
                   ` (3 preceding siblings ...)
  2008-02-08 18:49 ` Mike Hommey
@ 2008-02-08 19:00 ` Jakub Narebski
  2008-02-08 19:26   ` Jon Smirl
  2008-02-09 15:27   ` Jan Holesovsky
  2008-02-08 20:16 ` Johannes Schindelin
  5 siblings, 2 replies; 85+ messages in thread
From: Jakub Narebski @ 2008-02-08 19:00 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: git, gitster

Jan Holesovsky <kendy@suse.cz> writes:

> This is my attempt to implement the 'lazy clone' I've read about a
> bit in the git mailing list archive, but did not see implemented
> anywhere - the clone that fetches a minimal amount of data with the
> possibility to download the rest later (transparently!) when
> necessary.

It was not implemented because it was thought to be hard; git assumes
in many places that if it has an object, it has all objects referenced
by it.

But it is very nice of you to [try to] implement 'lazy clone'/'remote
alternates'.

Could you provide some benchmarks (time, network throughtput, latency)
for your implementation?

> Currently we are evaluating the usage of git for OpenOffice.org as
> one of the candidates (SVN is the other one), see
> 
>   http://wiki.services.openoffice.org/wiki/SCM_Migration
> 
> I've provided a git import of OOo with the entire history; the
> problem is that the pack has 2.5G, so it's not too convenient to
> download for casual developers that just want to try it.

One of the reasons why 'lazy clone' was not implemented was the fact
that by using large enough window, and larger than default delta
length you can repack "archive pack" (and keep it from trying to
repack using .keep files, see git-config(1)) much tighter than with
default (time and CPU conserving) options, and much, much tighter than
pack which is result of fast-import driven import.

Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
you would need machine with large amount of memory to repack it
tightly in sensible time!

> Shallow clone is not a possibility - we don't get patches through
> mailing lists, so we need the pull/push, and also thanks to the OOo
> development cycle, we have too many living heads which causes the
> shallow clone to download about 1.5G even with --depth 1.

Wouldn't be easier to try to fix shallow clone implementation to allow
for pushing from shallow to full clone (fetching from full to shallow
is implemented), and perhaps also push/pull between two shallow
clones?

As to many living heads: first, you don't need to fetch all
heads. Currently git-clone has no option to select subset of heads to
clone, but you can always use git-init + hand configuration +
git-remote and git-fetch for actual fetching.


By the way, did you try to split OpenOffice.org repository at the
components boundary into submodules (subprojects)? This would also
limit amount of needed download, as you don't neeed to download and
checkout all subprojects. 

The problem of course is _how_ to split repository into
submodules. Submodules should be enough self contained so the
whole-tree commit is alsays (or almost always) only about submodule.

> Lazy clone sounded like the right idea to me.  With this
> proof-of-concept implementation, just about 550M from the 2.5G is
> downloaded, which is still about twice as much in comparison with
> downloading a tarball, but bearable.

Do you have any numbers for OOo repository like number of revisions,
depth of DAG of commits (maximum number of revisions in one line of
commits), number of files, size of checkout, average size of file,
etc.?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 18:49 ` Mike Hommey
@ 2008-02-08 19:04   ` Johannes Schindelin
  2008-02-09 15:06   ` Jan Holesovsky
  1 sibling, 0 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-08 19:04 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Jan Holesovsky, git, gitster

Hi,

On Fri, 8 Feb 2008, Mike Hommey wrote:

> - There are tricks to do roughly what you want without modifying git. 
> For example, you can prepare several "shared" clones of your repo (git 
> clone -s) and leave in each only a few branches. Cloning from these will 
> only pull the needed data.

The problem is, of course, that the shared clones are not updated 
automatically, whenever the big repository is updated.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 19:00 ` Jakub Narebski
@ 2008-02-08 19:26   ` Jon Smirl
  2008-02-08 20:09     ` Nicolas Pitre
  2008-02-08 20:19     ` [PATCH] RFC: git lazy clone proof-of-concept Harvey Harrison
  2008-02-09 15:27   ` Jan Holesovsky
  1 sibling, 2 replies; 85+ messages in thread
From: Jon Smirl @ 2008-02-08 19:26 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Jan Holesovsky, git, gitster

On 2/8/08, Jakub Narebski <jnareb@gmail.com> wrote:
> Jan Holesovsky <kendy@suse.cz> writes:
> One of the reasons why 'lazy clone' was not implemented was the fact
> that by using large enough window, and larger than default delta
> length you can repack "archive pack" (and keep it from trying to
> repack using .keep files, see git-config(1)) much tighter than with
> default (time and CPU conserving) options, and much, much tighter than
> pack which is result of fast-import driven import.
>
> Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> you would need machine with large amount of memory to repack it
> tightly in sensible time!

A lot of memory is 2-4GB. Without this much memory you will trigger
swapping and the pack process will finish in about a month. Note that
only one machine needs to have this kind of memory. It can be used to
make the optimized pack of the project history and mark it with .keep
files. It doesn't take a lot of memory to use the optimized packs,
only to make them.

There are some patches for making repack work multi-core. Not sure if
they made it into the main git tree yet. These patches work almost
linearly. A eight hour repack will take 2.5 hours on a quad core
machine.

There is very good chance your 1.5GB repo will turn into 300MB if it
is extremely packed. This is something you only need to do once, but
you'll probably end up doing it a dozen times trying to get it just
right.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 19:26   ` Jon Smirl
@ 2008-02-08 20:09     ` Nicolas Pitre
  2008-02-11 10:13       ` Andreas Ericsson
  2008-02-08 20:19     ` [PATCH] RFC: git lazy clone proof-of-concept Harvey Harrison
  1 sibling, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-08 20:09 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, Jan Holesovsky, git, Junio C Hamano

On Fri, 8 Feb 2008, Jon Smirl wrote:

> There are some patches for making repack work multi-core. Not sure if
> they made it into the main git tree yet.

Yes, they are.  You need to compile with"make THREADED_DELTA_SEARCH=yes" 
or add THREADED_DELTA_SEARCH=yes into config.mak for it to be enabled 
though.  Then you have to set the pack.threads configuration variable 
appropriately to use it.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 17:28 [PATCH] RFC: git lazy clone proof-of-concept Jan Holesovsky
                   ` (4 preceding siblings ...)
  2008-02-08 19:00 ` Jakub Narebski
@ 2008-02-08 20:16 ` Johannes Schindelin
  2008-02-08 21:35   ` Jakub Narebski
  5 siblings, 1 reply; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-08 20:16 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: git, gitster

Hi,

2nd part of my review:

On Fri, 8 Feb 2008, Jan Holesovsky wrote:

> +static void read_from_stdin(int *num, char ***records)
> +{
> +	char buffer[4096];
> +	size_t records_num, leftover;
> +	ssize_t ret;
> +
> +	*num = 0;
> +	leftover = 0;
> +
> +	records_num = 4096;
> +	(*records) = xmalloc(records_num * sizeof(char *));
> +
> +	do {
> +		char *p, *last;
> +
> +		ret = xread(0 /*stdin*/, buffer + leftover,
> +				sizeof(buffer) - leftover);
> +		if (ret < 0)
> +			die("read error on input: %s", strerror(errno));
> +
> +		last = buffer;
> +		for (p = buffer; p < buffer + leftover + ret; p++)
> +			if ((!*p || *p == '\n') && (p != last)) {
> +				if (*num >= records_num) {
> +					records_num *= 2;
> +					(*records) = xrealloc(*records,
> +							      records_num * sizeof(char*));
> +				}
> +
> +				if (p - last > 0) {
> +					(*records)[*num] =
> +						strndup(last, p - last);
> +					(*num)++;
> +				}
> +				last = p + 1;
> +			}
> +		memmove(buffer, last, leftover);
> +	} while (ret > 0);
> +
> +	if (leftover) {
> +		if (*num >= records_num) {
> +			records_num *= 2;
> +			(*records) = xrealloc(*records,
> +					      records_num * sizeof(char*));
> +		}
> +
> +		(*records)[*num] = strndup(buffer, leftover);
> +		(*num)++;
> +	}
> +}

I thought about this function again.  It seems we have something similar 
in builtin-pack-objects.c, which is easier to read.  The equivalent would 
be:

static void read_from_stdin(int *num, char ***records)
{
	char line[4096];
	int alloc = 0;

	*num = 0;
	*records = NULL;
	for (;;) {
		if (!fgets(line, sizeof(line), stdin)) {
			if (feof(stdin))
				break;
			if (!ferror(stdin))
				die("fgets returned NULL, not EOF, nor error!");
			if (errno != EINTR)
				die("fgets: %s", strerror(errno));
			clearerr(stdin);
			continue;
		}
		if (!line[0])
			continue;
		ALLOC_GROW(*records, *num + 1, alloc);
		(*records)[(*num)++] = xstrdup(line);
	}
}		

> diff --git a/git-clone.sh b/git-clone.sh
> index b4e858c..208e9fc 100755
> --- a/git-clone.sh
> +++ b/git-clone.sh
> @@ -115,7 +115,7 @@ Perhaps git-update-server-info needs to be run there?"
>  quiet=
>  local=no
>  use_local_hardlink=yes
> -local_shared=no
> +shared=no
>  unset template
>  no_checkout=
>  upload_pack=
> @@ -143,7 +143,7 @@ do
>  	--no-hardlinks)
>  		use_local_hardlink=no ;;
>  	-s|--shared)
> -		local_shared=yes ;;
> +		shared=yes ;;
>  	--template)
>  		shift; template="--template=$1" ;;
>  	-q|--quiet)
> @@ -288,7 +288,7 @@ yes)
>  	( cd "$repo/objects" ) ||
>  		die "cannot chdir to local '$repo/objects'."
>  
> -	if test "$local_shared" = yes
> +	if test "$shared" = yes
>  	then
>  		mkdir -p "$GIT_DIR/objects/info"
>  		echo "$repo/objects" >>"$GIT_DIR/objects/info/alternates"
> @@ -364,11 +364,22 @@ yes)
>  		fi
>  		;;
>  	*)
> +		commits_only=
> +		if test "$shared" = yes
> +		then
> +			commits_only="--commits-only"
> +		fi
>  		case "$upload_pack" in
> -		'') git-fetch-pack --all -k $quiet $depth $no_progress "$repo";;
> -		*) git-fetch-pack --all -k $quiet "$upload_pack" $depth $no_progress "$repo" ;;
> +		'') git-fetch-pack --all -k $quiet $depth $no_progress $commits_only "$repo";;
> +		*) git-fetch-pack --all -k $quiet "$upload_pack" $depth $no_progress $commits_only "$repo" ;;
>  		esac >"$GIT_DIR/CLONE_HEAD" ||
>  			die "fetch-pack from '$repo' failed."
> +		if test "$shared" = yes
> +		then
> +			# Must be done after the fetch
> +			mkdir -p "$GIT_DIR/objects/info"
> +			echo "$repo" >> "$GIT_DIR/objects/info/remote_alternates"
> +		fi
>  		;;
>  	esac
>  	;;

Please have a different option than --shared for lazy clones.  Maybe 
--lazy?  ;-)

I can see why you reused --shared, though.  But let's make this more 
fool-proof: a user should explicitely ask for a lazy clone.

> diff --git a/index-pack.c b/index-pack.c
> index 9fd6982..f2e6b7a 100644
> --- a/index-pack.c
> +++ b/index-pack.c
> @@ -9,7 +9,7 @@
>  #include "progress.h"
>  
>  static const char index_pack_usage[] =
> -"git-index-pack [-v] [-o <index-file>] [{ ---keep | --keep=<msg> }] { <pack-file> | --stdin [--fix-thin] [<pack-file>] }";
> +"git-index-pack [-v] [-o <index-file>] [{ ---keep | --keep=<msg> }] [--ignore-remote-alternates] { <pack-file> | --stdin [--fix-thin] [<pack-file>] }";
>  
>  struct object_entry
>  {
> @@ -746,6 +746,8 @@ int main(int argc, char **argv)
>  					pack_idx_off32_limit = strtoul(c+1, &c, 0);
>  				if (*c || pack_idx_off32_limit & 0x80000000)
>  					die("bad %s", arg);
> +			} else if (!strcmp(arg, "--ignore-remote-alternates")) {
> +				disable_remote_alternates();
>  			} else
>  				usage(index_pack_usage);
>  			continue;

I might be missing something, but I do not believe this is necessary.  
index-pack only works on packs anyway.  Am I wrong?

> diff --git a/sha1_file.c b/sha1_file.c
> index 66a4e00..7d60be0 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -14,6 +14,7 @@
>  #include "tag.h"
>  #include "tree.h"
>  #include "refs.h"
> +#include "run-command.h"
>  
>  #ifndef O_NOATIME
>  #if defined(__linux__) && (defined(__i386__) || defined(__PPC__))
> @@ -411,6 +412,205 @@ static char *find_sha1_file(const unsigned char *sha1, struct stat *st)
>  	return NULL;
>  }
>  
> +static char *remote_alternates = NULL;
> +static int has_remote_alt_feature = -1;
> +
> +void disable_remote_alternates(void)
> +{
> +	has_remote_alt_feature = 0;
> +}
> +
> +static int has_remote_alternates(void)
> +{
> +	/* FIXME: does it make sense to support more URLs inside
> +	 * remote_alternates? */

I think it would make sense.  For example if you have a local machine 
which has most, but maybe not all, of the remote objects.

> +	struct stat st;
> +	const char remote_alt_file_name[] = "info/remote_alternates";

<bikeshedding>maybe remote-alternates (note the dash instead 
of the underscore)</bikeshedding>

> +	char path[PATH_MAX + 1 + sizeof remote_alt_file_name];
> +	int fd;
> +	char *map, *p;
> +	size_t mapsz;
> +
> +	if (has_remote_alt_feature != -1)
> +		return has_remote_alt_feature;
> +
> +	has_remote_alt_feature = 0;
> +
> +	sprintf(path, "%s/%s", get_object_directory(),
> +			remote_alt_file_name);
> +	fd = open(path, O_RDONLY);
> +	if (fd < 0)
> +		return has_remote_alt_feature;
> +	else if (fstat(fd, &st) || (st.st_size == 0)) {
> +		close(fd);
> +		return has_remote_alt_feature;
> +	}
> +
> +	mapsz = xsize_t(st.st_size);
> +	map = xmmap(NULL, mapsz, PROT_READ, MAP_PRIVATE, fd, 0);
> +	close(fd);
> +
> +	/* we support just one remote alternate for now,
> +	 * so read just the first entry */
> +	for (p = map; (p < map + mapsz) && (*p != '\n'); p++)
> +		;
> +
> +	remote_alternates = strndup(map, p - map);

Seems that you do something like the read_from_stdin() here, only from a 
file.  It appears to me as if the function wants to be a library function 
(taking a FILE * parameter, and maybe closing it after use, or even 
taking a filename parameter, which signifies stdin when NULL).

> +struct sha1_list {
> +	unsigned char sha1[20];
> +	struct sha1_list *next;
> +};

It'd be probably better to make this an array which uses ALLOC_GROW() in 
order to avoid memory fragmentation/allocation overhead.

> +	memset(&fetch_pack, 0, sizeof(fetch_pack));
> +	fetch_pack.in = dump_objects.out;
> +	fetch_pack.out = 1;
> +	fetch_pack.err = 2;
> +	fetch_pack.git_cmd = 1;
> +	fetch_pack.argv = argv;
> +
> +	err = run_command(&fetch_pack);
> +
> +	/* TODO better error handling - is the object really missing, or
> +	 * was it just a temporary network error? */
> +	if (err) {
> +		fprintf(stderr, "error %d while calling fetch-pack\n", err);
> +		return 0;

That is a

		return error("Error %d while calling fetch-pack", err);

And it does not really matter what type of error it is: you must report 
the error and continue without this object.

> +static int fill_remote_list(const unsigned char *sha1,
> +		const char *base, int baselen,
> +		const char *pathname, unsigned mode, int stage)
> +{
> +	if (!has_sha1_file_locally(sha1)) {
> +		struct sha1_list *item;
> +
> +		item = xmalloc(sizeof(*item));
> +		hashcpy(item->sha1, sha1);
> +		item->next = remote_list;
> +
> +		remote_list = item;
> +	}
> +
> +	return 0;
> +}
> +
> +static int fetch_remote_sha1s_recursive(struct sha1_list *objects)
> +{
> +	struct sha1_list *list;
> +	int ret = 0;
> +
> +	/* first of all, fetch the missing objects */
> +	if (!fetch_remote_sha1s(objects))
> +		return 0;
> +
> +	remote_list = NULL;
> +
> +	list = objects;
> +	while (list) {
> +		struct tree *tree;
> +
> +		tree = parse_tree_indirect(list->sha1);
> +		if (tree) {
> +			read_tree_recursive(tree, "", 0, 0, NULL,
> +					fill_remote_list);
> +		}

The curly brackets are not necessary.  Plus, with fill_remote_list() as 
you defined it, it will break down with submodules (see 481f0ee6(Fix 
rev-list when showing objects involving submodules) for inspiration).

> +
> +		list = list->next;
> +	}
> +
> +	list = remote_list;
> +	if (!list)
> +		return 1; /* hooray, we have everything */
> +
> +	ret = fetch_remote_sha1s_recursive(list);

This just cries out loud for a non-recursive approach: have two arrays, 
clear the second, fetch the objects in the first array, then fill the 
second with the objects referred to by the first array's objects.  Then 
swap the arrays.  Loop.

> @@ -2316,6 +2532,18 @@ int has_sha1_file(const unsigned char *sha1)
>  	return find_sha1_file(sha1, &st) ? 1 : 0;
>  }
>  
> +int has_sha1_file(const unsigned char *sha1)
> +{
> +	if (has_sha1_file_locally(sha1))
> +		return 1;
> +
> +	/* download it if necessary */
> +	if (has_remote_alternates() && download_remote_sha1(sha1))

Maybe it would be nicer to have the has_remote_alternates() check only in 
download_remote_sha1()?  Same applies to read_sha1_file().

> @@ -106,9 +106,15 @@ static int do_rev_list(int fd, void *create_full_pack)
>  	if (create_full_pack)
>  		use_thin_pack = 0; /* no point doing it */
>  	init_revisions(&revs, NULL);
> -	revs.tag_objects = 1;
> -	revs.tree_objects = 1;
> -	revs.blob_objects = 1;
> +	if (!commits_only) {
> +		revs.tag_objects = 1;
> +		revs.tree_objects = 1;
> +		revs.blob_objects = 1;
> +	} else {
> +		revs.tag_objects = 0;
> +		revs.tree_objects = 0;
> +		revs.blob_objects = 0;
> +	}

Or

	revs.tag_objects = revs.tree_objects = revs.blob_objects
		= !commits_only;


> @@ -498,9 +525,15 @@ static void receive_needs(void)
>  		 * asks for something like "master~10" (symbolic)...
>  		 * would it make sense?  I don't know.
>  		 */
> -		o = lookup_object(sha1_buf);
> -		if (!o || !(o->flags & OUR_REF))
> -			die("git-upload-pack: not our ref %s", line+5);
> +		if (!exact_objects) {
> +			o = lookup_object(sha1_buf);
> +			if (!o || !(o->flags & OUR_REF))
> +				die("git-upload-pack: not our ref %s", line+5);
> +		} else {
> +			o = lookup_unknown_object(sha1_buf);
> +			if (!o)
> +				die("git-upload-pack: not an object %s", line+5);
> +		}

Hmm... AFAICT lookup_unknown_object() does not return NULL.  It creates a 
"none" object if it did not find anything under that sha1.

I think you'd rather want

 		o = lookup_object(sha1_buf);
-		if (!o || !(o->flags & OUR_REF))
+		if (!o || (!exact_objects && !(o->flags & OUR_REF)))
 			die("git-upload-pack: not our ref %s", line+5);

Puh.  What a big patch!  But as I said, it is nice to know somebody is 
working on this.  (I do not necessarily see possibilities to break it 
down into smaller chunks, though.)

But I think that your needs can be satisfied with partial shallow clones, 
too: e.g.

	$ mkdir my-new-workdir
	$ cd my-new-workdir
	$ git init
	$ git remote add -t master origin <url>
	$ git fetch --depth 1 origin
	$ git checkout -b master origin/master

I cannot think of a proper place to make this a one-shot command.

As you probably know, I am a strong believer in semantics, so I would hate 
"git clone" being taught to not clone the whole repository, but only a 
single branch.

But hey, I have been wrong before.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 19:26   ` Jon Smirl
  2008-02-08 20:09     ` Nicolas Pitre
@ 2008-02-08 20:19     ` Harvey Harrison
  2008-02-08 20:24       ` Jon Smirl
  1 sibling, 1 reply; 85+ messages in thread
From: Harvey Harrison @ 2008-02-08 20:19 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, Jan Holesovsky, git, gitster

On Fri, 2008-02-08 at 14:26 -0500, Jon Smirl wrote:
> On 2/8/08, Jakub Narebski <jnareb@gmail.com> wrote:
> > Jan Holesovsky <kendy@suse.cz> writes:
> > One of the reasons why 'lazy clone' was not implemented was the fact
> > that by using large enough window, and larger than default delta
> > length you can repack "archive pack" (and keep it from trying to
> > repack using .keep files, see git-config(1)) much tighter than with
> > default (time and CPU conserving) options, and much, much tighter than
> > pack which is result of fast-import driven import.
> >
> > Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> > you would need machine with large amount of memory to repack it
> > tightly in sensible time!
> 
> A lot of memory is 2-4GB. Without this much memory you will trigger
> swapping and the pack process will finish in about a month. 

Well, my modest little Celeron M laptop w/ 1GB of ram did the full
repack overnight on the gcc repo, so a month is a bit of an
exaggeration.

Cheers,

Harvey

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 20:19     ` [PATCH] RFC: git lazy clone proof-of-concept Harvey Harrison
@ 2008-02-08 20:24       ` Jon Smirl
  2008-02-08 20:25         ` Harvey Harrison
  0 siblings, 1 reply; 85+ messages in thread
From: Jon Smirl @ 2008-02-08 20:24 UTC (permalink / raw)
  To: Harvey Harrison; +Cc: Jakub Narebski, Jan Holesovsky, git, gitster

On 2/8/08, Harvey Harrison <harvey.harrison@gmail.com> wrote:
> On Fri, 2008-02-08 at 14:26 -0500, Jon Smirl wrote:
> > On 2/8/08, Jakub Narebski <jnareb@gmail.com> wrote:
> > > Jan Holesovsky <kendy@suse.cz> writes:
> > > One of the reasons why 'lazy clone' was not implemented was the fact
> > > that by using large enough window, and larger than default delta
> > > length you can repack "archive pack" (and keep it from trying to
> > > repack using .keep files, see git-config(1)) much tighter than with
> > > default (time and CPU conserving) options, and much, much tighter than
> > > pack which is result of fast-import driven import.
> > >
> > > Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> > > you would need machine with large amount of memory to repack it
> > > tightly in sensible time!
> >
> > A lot of memory is 2-4GB. Without this much memory you will trigger
> > swapping and the pack process will finish in about a month.
>
> Well, my modest little Celeron M laptop w/ 1GB of ram did the full
> repack overnight on the gcc repo, so a month is a bit of an
> exaggeration.

Try it again with window=250 and depth=250. That's how you get the
really small packs.

>
> Cheers,
>
> Harvey
>
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 20:24       ` Jon Smirl
@ 2008-02-08 20:25         ` Harvey Harrison
  2008-02-08 20:41           ` Jon Smirl
  0 siblings, 1 reply; 85+ messages in thread
From: Harvey Harrison @ 2008-02-08 20:25 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, Jan Holesovsky, git, gitster

On Fri, 2008-02-08 at 15:24 -0500, Jon Smirl wrote:
> On 2/8/08, Harvey Harrison <harvey.harrison@gmail.com> wrote:
> > On Fri, 2008-02-08 at 14:26 -0500, Jon Smirl wrote:
> > > On 2/8/08, Jakub Narebski <jnareb@gmail.com> wrote:
> > > > Jan Holesovsky <kendy@suse.cz> writes:
> > > > One of the reasons why 'lazy clone' was not implemented was the fact
> > > > that by using large enough window, and larger than default delta
> > > > length you can repack "archive pack" (and keep it from trying to
> > > > repack using .keep files, see git-config(1)) much tighter than with
> > > > default (time and CPU conserving) options, and much, much tighter than
> > > > pack which is result of fast-import driven import.
> > > >
> > > > Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> > > > you would need machine with large amount of memory to repack it
> > > > tightly in sensible time!
> > >
> > > A lot of memory is 2-4GB. Without this much memory you will trigger
> > > swapping and the pack process will finish in about a month.
> >
> > Well, my modest little Celeron M laptop w/ 1GB of ram did the full
> > repack overnight on the gcc repo, so a month is a bit of an
> > exaggeration.
> 
> Try it again with window=250 and depth=250. That's how you get the
> really small packs.
> 

Yes, I know, and I did if you remember back to the gcc discussion.

Harvey

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 20:25         ` Harvey Harrison
@ 2008-02-08 20:41           ` Jon Smirl
  0 siblings, 0 replies; 85+ messages in thread
From: Jon Smirl @ 2008-02-08 20:41 UTC (permalink / raw)
  To: Harvey Harrison; +Cc: Jakub Narebski, Jan Holesovsky, git, gitster

On 2/8/08, Harvey Harrison <harvey.harrison@gmail.com> wrote:
> On Fri, 2008-02-08 at 15:24 -0500, Jon Smirl wrote:
> > On 2/8/08, Harvey Harrison <harvey.harrison@gmail.com> wrote:
> > > On Fri, 2008-02-08 at 14:26 -0500, Jon Smirl wrote:
> > > > On 2/8/08, Jakub Narebski <jnareb@gmail.com> wrote:
> > > > > Jan Holesovsky <kendy@suse.cz> writes:
> > > > > One of the reasons why 'lazy clone' was not implemented was the fact
> > > > > that by using large enough window, and larger than default delta
> > > > > length you can repack "archive pack" (and keep it from trying to
> > > > > repack using .keep files, see git-config(1)) much tighter than with
> > > > > default (time and CPU conserving) options, and much, much tighter than
> > > > > pack which is result of fast-import driven import.
> > > > >
> > > > > Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> > > > > you would need machine with large amount of memory to repack it
> > > > > tightly in sensible time!
> > > >
> > > > A lot of memory is 2-4GB. Without this much memory you will trigger
> > > > swapping and the pack process will finish in about a month.
> > >
> > > Well, my modest little Celeron M laptop w/ 1GB of ram did the full
> > > repack overnight on the gcc repo, so a month is a bit of an
> > > exaggeration.
> >
> > Try it again with window=250 and depth=250. That's how you get the
> > really small packs.
> >
>
> Yes, I know, and I did if you remember back to the gcc discussion.

Now that you mention it I seem to recall some changes were made to git
during that discussion that reduced the memory footprint and made the
optimized gcc repack fit into 1GB. I've forgotten the exact timings
and git is a moving target. When I was working on Mozilla it needed
2.4GB to avoid swapping but that was with a much older git.

The rule is: if it starts swapping it is going to take way longer that
you are probably willing to wait. Buying more RAM is a cheap and easy
fix.

If people are having trouble with large repositories please let the
git community know and your issues will probably get quickly fixed.
We can't fix something we don't know about.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 20:16 ` Johannes Schindelin
@ 2008-02-08 21:35   ` Jakub Narebski
  2008-02-08 21:52     ` Johannes Schindelin
  0 siblings, 1 reply; 85+ messages in thread
From: Jakub Narebski @ 2008-02-08 21:35 UTC (permalink / raw)
  To: git

Johannes Schindelin wrote:
> On Fri, 8 Feb 2008, Jan Holesovsky wrote:

>> +     struct stat st;
>> +     const char remote_alt_file_name[] = "info/remote_alternates";
> 
> <bikeshedding>maybe remote-alternates (note the dash instead 
> of the underscore)</bikeshedding>

Why not in info/alternates?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 21:35   ` Jakub Narebski
@ 2008-02-08 21:52     ` Johannes Schindelin
  2008-02-08 22:03       ` Mike Hommey
  2008-02-09 15:54       ` Jan Holesovsky
  0 siblings, 2 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-08 21:52 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 652 bytes --]

Hi,

[I'll just Cc you, out of the goodness of my heart.]

On Fri, 8 Feb 2008, Jakub Narebski wrote:

> Johannes Schindelin wrote:
> > On Fri, 8 Feb 2008, Jan Holesovsky wrote:
> 
> >> +     struct stat st;
> >> +     const char remote_alt_file_name[] = "info/remote_alternates";
> > 
> > <bikeshedding>maybe remote-alternates (note the dash instead 
> > of the underscore)</bikeshedding>
> 
> Why not in info/alternates?

Again, to make the distinction clear.

Also note that info/alternates is used by the http transport (which would 
then break semi-silently, because I expect that you usually put git:// 
urls into remote-alternates).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 21:52     ` Johannes Schindelin
@ 2008-02-08 22:03       ` Mike Hommey
  2008-02-08 22:34         ` Johannes Schindelin
  2008-02-09 15:54       ` Jan Holesovsky
  1 sibling, 1 reply; 85+ messages in thread
From: Mike Hommey @ 2008-02-08 22:03 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jakub Narebski, git

On Fri, Feb 08, 2008 at 09:52:08PM +0000, Johannes Schindelin wrote:
> Hi,
> 
> [I'll just Cc you, out of the goodness of my heart.]
> 
> On Fri, 8 Feb 2008, Jakub Narebski wrote:
> 
> > Johannes Schindelin wrote:
> > > On Fri, 8 Feb 2008, Jan Holesovsky wrote:
> > 
> > >> +     struct stat st;
> > >> +     const char remote_alt_file_name[] = "info/remote_alternates";
> > > 
> > > <bikeshedding>maybe remote-alternates (note the dash instead 
> > > of the underscore)</bikeshedding>
> > 
> > Why not in info/alternates?
> 
> Again, to make the distinction clear.
> 
> Also note that info/alternates is used by the http transport (which would 
> then break semi-silently, because I expect that you usually put git:// 
> urls into remote-alternates).

Also note that the http transport uses info/http-alternates for http://
urls. By the way, it doesn't make much sense that only http-fetch uses
it.

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 22:03       ` Mike Hommey
@ 2008-02-08 22:34         ` Johannes Schindelin
  2008-02-08 22:50           ` Mike Hommey
  0 siblings, 1 reply; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-08 22:34 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Jakub Narebski, git

Hi,

On Fri, 8 Feb 2008, Mike Hommey wrote:

> Also note that the http transport uses info/http-alternates for http:// 
> urls. By the way, it doesn't make much sense that only http-fetch uses 
> it.

I think it does make sense: nobody else needs http-alternates.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 22:34         ` Johannes Schindelin
@ 2008-02-08 22:50           ` Mike Hommey
  2008-02-08 23:14             ` Johannes Schindelin
  0 siblings, 1 reply; 85+ messages in thread
From: Mike Hommey @ 2008-02-08 22:50 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jakub Narebski, git

On Fri, Feb 08, 2008 at 10:34:55PM +0000, Johannes Schindelin wrote:
> Hi,
> 
> On Fri, 8 Feb 2008, Mike Hommey wrote:
> 
> > Also note that the http transport uses info/http-alternates for http:// 
> > urls. By the way, it doesn't make much sense that only http-fetch uses 
> > it.
> 
> I think it does make sense: nobody else needs http-alternates.

If you're setting an http-alternate, it means objects are missing in the
repo. If they are missing in the repo and are not in alternates, how can 
any other command needing objects out there work on the repo ?

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 22:50           ` Mike Hommey
@ 2008-02-08 23:14             ` Johannes Schindelin
  2008-02-08 23:38               ` Mike Hommey
  0 siblings, 1 reply; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-08 23:14 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Jakub Narebski, git

Hi,

On Fri, 8 Feb 2008, Mike Hommey wrote:

> On Fri, Feb 08, 2008 at 10:34:55PM +0000, Johannes Schindelin wrote:
> 
> > On Fri, 8 Feb 2008, Mike Hommey wrote:
> > 
> > > Also note that the http transport uses info/http-alternates for 
> > > http:// urls. By the way, it doesn't make much sense that only 
> > > http-fetch uses it.
> > 
> > I think it does make sense: nobody else needs http-alternates.
> 
> If you're setting an http-alternate, it means objects are missing in the 
> repo. If they are missing in the repo and are not in alternates, how can 
> any other command needing objects out there work on the repo ?

The point is: if you have a bare repository on a server that uses 
alternates, that path stored in info/alternates is usable by git-daemon.  
But it is not usable by git-http-fetch, since that does not have a 
git-aware server side.  So if you want to reuse the _same_ bare repository 
_with_ alternates for both git:// transport and http:// transport, you 
_need_ to _different_ alternates: one being a path on the server, and 
another being an http:// url for http-fetch.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 23:14             ` Johannes Schindelin
@ 2008-02-08 23:38               ` Mike Hommey
  2008-02-09 21:20                 ` Jan Hudec
  0 siblings, 1 reply; 85+ messages in thread
From: Mike Hommey @ 2008-02-08 23:38 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jakub Narebski, git

On Fri, Feb 08, 2008 at 11:14:40PM +0000, Johannes Schindelin wrote:
> Hi,
> 
> On Fri, 8 Feb 2008, Mike Hommey wrote:
> 
> > On Fri, Feb 08, 2008 at 10:34:55PM +0000, Johannes Schindelin wrote:
> > 
> > > On Fri, 8 Feb 2008, Mike Hommey wrote:
> > > 
> > > > Also note that the http transport uses info/http-alternates for 
> > > > http:// urls. By the way, it doesn't make much sense that only 
> > > > http-fetch uses it.
> > > 
> > > I think it does make sense: nobody else needs http-alternates.
> > 
> > If you're setting an http-alternate, it means objects are missing in the 
> > repo. If they are missing in the repo and are not in alternates, how can 
> > any other command needing objects out there work on the repo ?
> 
> The point is: if you have a bare repository on a server that uses 
> alternates, that path stored in info/alternates is usable by git-daemon.  
> But it is not usable by git-http-fetch, since that does not have a 
> git-aware server side.  So if you want to reuse the _same_ bare repository 
> _with_ alternates for both git:// transport and http:// transport, you 
> _need_ to _different_ alternates: one being a path on the server, and 
> another being an http:// url for http-fetch.

But nothing prevents you from only setting an http-alternate. Also not
http-fetch can deal fine with info/alternates if it contains relative
paths.

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 18:03 ` Nicolas Pitre
@ 2008-02-09 14:25   ` Jan Holesovsky
  2008-02-09 22:05     ` Mike Hommey
  2008-02-10  7:23     ` Marco Costalba
  0 siblings, 2 replies; 85+ messages in thread
From: Jan Holesovsky @ 2008-02-09 14:25 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git, gitster

Hi Nicolas,

On Friday 08 February 2008 19:03, Nicolas Pitre wrote:

> > I've provided a git import of OOo with the entire history; the problem is
> > that the pack has 2.5G, so it's not too convenient to download for casual
> > developers that just want to try it.
>
> How did you repack your repository?
>
> We know that current defaults are not suitable for large projects.  For
> example, the gcc git repository shrinked from 1.5GB pack down to 230MB
> after some tuning.

After the suggestions in this thread I tried to experiment with the --window 
and --depth options of git-repack, and indeed, there are still reserves.

So far I'm at 2G (saved 500M), unfortunately the aggressive values like 
--window=250 --depth=250 that someone mentioned here cause out-of-memory on a 
machine with 8G :-(  If there's anybody brave enough here to try as well, I'd 
be grateful.  Maybe it would be also interesting to _exactly_ locate what 
causes the oom, and eg. exclude the object from the pack if possible.

The tree is available here:

git clone git://o3-build.services.openoffice.org/git/ooo.git
git clone http://o3-build.services.openoffice.org/~svn/ooo.git (the same over 
http://)

Thank you in advance!

Regards,
Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 18:14 ` Harvey Harrison
@ 2008-02-09 14:27   ` Jan Holesovsky
  0 siblings, 0 replies; 85+ messages in thread
From: Jan Holesovsky @ 2008-02-09 14:27 UTC (permalink / raw)
  To: Harvey Harrison; +Cc: git, gitster

Hi Harvey,

On Friday 08 February 2008 19:14, Harvey Harrison wrote:

> For comparison, how big was the svn repo you're testing?  My experience
> has been about 15-20 times smaller than SVN once a tuned repack has
> been done.

Another guy created the SVN repo, IIRC he said it had 55G.

Regards,
Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 18:49 ` Mike Hommey
  2008-02-08 19:04   ` Johannes Schindelin
@ 2008-02-09 15:06   ` Jan Holesovsky
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Holesovsky @ 2008-02-09 15:06 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git, gitster

Hi Mike,

On Friday 08 February 2008 19:49, Mike Hommey wrote:

> There are 2 things, here:
> - Probably, you can make your pack smaller with proper window sizing.
> Try taking a look at the "Git and GCC" that crossed borders between
> the gcc and the git mailing lists.

Just trying this :-)

> - There are tricks to do roughly what you want without modifying git.
> For example, you can prepare several "shared" clones of your repo (git
> clone -s) and leave in each only a few branches. Cloning from these will
> only pull the needed data.

Good to know about this, thank you!  The problem currently is that we are 
trying to produce SVN and git trees containing the same data, the same number 
of branches, etc. for the sake of comparison.  If git wins, and it will be 
chosen for OOo, we'll be hopefully able to do more tuning - and I'm sure I'll 
ask here for help ;-)

Regards,
Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 19:00 ` Jakub Narebski
  2008-02-08 19:26   ` Jon Smirl
@ 2008-02-09 15:27   ` Jan Holesovsky
  2008-02-10  3:10     ` Nicolas Pitre
  2008-02-11  1:20     ` Jakub Narebski
  1 sibling, 2 replies; 85+ messages in thread
From: Jan Holesovsky @ 2008-02-09 15:27 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, gitster

Hi Jakub,

On Friday 08 February 2008 20:00, Jakub Narebski wrote:

> It was not implemented because it was thought to be hard; git assumes
> in many places that if it has an object, it has all objects referenced
> by it.
>
> But it is very nice of you to [try to] implement 'lazy clone'/'remote
> alternates'.
>
> Could you provide some benchmarks (time, network throughtput, latency)
> for your implementation?

Unfortunately not yet :-(  The only data I have that clone done on 
git://localhost/ooo.git took 10 minutes without the lazy clone, and 7.5 
minutes with it - and then I sent the patch for review here ;-)  The deadline 
for our SVN vs. git comparison for OOo is the next Friday, so I'll definitely 
have some better data by then.

> Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> you would need machine with large amount of memory to repack it
> tightly in sensible time!

As I answered elsewhere, unfortunately it goes out of memory even on 8G 
machine (x86-64), so...  But still trying.

> > Shallow clone is not a possibility - we don't get patches through
> > mailing lists, so we need the pull/push, and also thanks to the OOo
> > development cycle, we have too many living heads which causes the
> > shallow clone to download about 1.5G even with --depth 1.
>
> Wouldn't be easier to try to fix shallow clone implementation to allow
> for pushing from shallow to full clone (fetching from full to shallow
> is implemented), and perhaps also push/pull between two shallow
> clones?

I tried to look into it a bit, but unfortunately did not see a clear way how 
to do it transparently for the user - say you pull a branch that is based off 
a commit you do not have.  But of course, I could have missed something ;-)

> As to many living heads: first, you don't need to fetch all
> heads. Currently git-clone has no option to select subset of heads to
> clone, but you can always use git-init + hand configuration +
> git-remote and git-fetch for actual fetching.

Right, might be interesting as well.  But still the missing push/pull is 
problematic for us [or at least I see it as a problem ;-)].

> By the way, did you try to split OpenOffice.org repository at the
> components boundary into submodules (subprojects)? This would also
> limit amount of needed download, as you don't neeed to download and
> checkout all subprojects.

Yes, and got to much nicer repositories by that ;-) - by only moving some 
binary stuff out of the CVS to a separate tree.  The problem is that the deal 
is to compare the same stuff in SVN and git - so no choice for me in fact.

> The problem of course is _how_ to split repository into
> submodules. Submodules should be enough self contained so the
> whole-tree commit is alsays (or almost always) only about submodule.

I hope it will be doable _if_ the git wins & will be chosen for OOo.

> > Lazy clone sounded like the right idea to me.  With this
> > proof-of-concept implementation, just about 550M from the 2.5G is
> > downloaded, which is still about twice as much in comparison with
> > downloading a tarball, but bearable.
>
> Do you have any numbers for OOo repository like number of revisions,
> depth of DAG of commits (maximum number of revisions in one line of
> commits), number of files, size of checkout, average size of file,
> etc.?

I'll try to provide the data ASAP.

Regards,
Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 21:52     ` Johannes Schindelin
  2008-02-08 22:03       ` Mike Hommey
@ 2008-02-09 15:54       ` Jan Holesovsky
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Holesovsky @ 2008-02-09 15:54 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jakub Narebski, git

Hi Johannes,

One more 'thank you' for the review - now publically :-)

On Friday 08 February 2008 22:52, Johannes Schindelin wrote:

> > > <bikeshedding>maybe remote-alternates (note the dash instead
> > > of the underscore)</bikeshedding>
> >
> > Why not in info/alternates?
>
> Again, to make the distinction clear.

Yes; still even though the 'alteranates' and 'remote alternates' have some of 
the ideas common, the implementation differs (and has to differ) - so I think 
you are right even in the --lazy option for clone instead of reusing -s.

For the rest - I'll post the updated patch ASAP.

Regards,
Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 23:38               ` Mike Hommey
@ 2008-02-09 21:20                 ` Jan Hudec
  0 siblings, 0 replies; 85+ messages in thread
From: Jan Hudec @ 2008-02-09 21:20 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Johannes Schindelin, Jakub Narebski, git

[-- Attachment #1: Type: text/plain, Size: 1820 bytes --]

On Sat, Feb 09, 2008 at 00:38:56 +0100, Mike Hommey wrote:
> On Fri, Feb 08, 2008 at 11:14:40PM +0000, Johannes Schindelin wrote:
> > On Fri, 8 Feb 2008, Mike Hommey wrote:
> > > On Fri, Feb 08, 2008 at 10:34:55PM +0000, Johannes Schindelin wrote:
> > > > On Fri, 8 Feb 2008, Mike Hommey wrote:
> > > > > Also note that the http transport uses info/http-alternates for 
> > > > > http:// urls. By the way, it doesn't make much sense that only 
> > > > > http-fetch uses it.
> > > > 
> > > > I think it does make sense: nobody else needs http-alternates.
> > > 
> > > If you're setting an http-alternate, it means objects are missing in the 
> > > repo. If they are missing in the repo and are not in alternates, how can 
> > > any other command needing objects out there work on the repo ?
> > 
> > The point is: if you have a bare repository on a server that uses 
> > alternates, that path stored in info/alternates is usable by git-daemon.  
> > But it is not usable by git-http-fetch, since that does not have a 
> > git-aware server side.  So if you want to reuse the _same_ bare repository 
> > _with_ alternates for both git:// transport and http:// transport, you 
> > _need_ to _different_ alternates: one being a path on the server, and 
> > another being an http:// url for http-fetch.
> 
> But nothing prevents you from only setting an http-alternate. Also not
> http-fetch can deal fine with info/alternates if it contains relative
> paths.

They still may not work because of whatever mapping of paths to URLs the http
server does. Also relative paths in info/alternates don't actually work; or
rather, they do, but /not recursively/ (the code seems fixable, just someone
would have to make sure the proper base is always used).

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-09 14:25   ` Jan Holesovsky
@ 2008-02-09 22:05     ` Mike Hommey
  2008-02-09 23:38       ` Nicolas Pitre
  2008-02-10  7:23     ` Marco Costalba
  1 sibling, 1 reply; 85+ messages in thread
From: Mike Hommey @ 2008-02-09 22:05 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: Nicolas Pitre, git, gitster

On Sat, Feb 09, 2008 at 03:25:35PM +0100, Jan Holesovsky wrote:
> Hi Nicolas,
> 
> On Friday 08 February 2008 19:03, Nicolas Pitre wrote:
> 
> > > I've provided a git import of OOo with the entire history; the problem is
> > > that the pack has 2.5G, so it's not too convenient to download for casual
> > > developers that just want to try it.
> >
> > How did you repack your repository?
> >
> > We know that current defaults are not suitable for large projects.  For
> > example, the gcc git repository shrinked from 1.5GB pack down to 230MB
> > after some tuning.
> 
> After the suggestions in this thread I tried to experiment with the --window 
> and --depth options of git-repack, and indeed, there are still reserves.
> 
> So far I'm at 2G (saved 500M), unfortunately the aggressive values like 
> --window=250 --depth=250 that someone mentioned here cause out-of-memory on a 
> machine with 8G :-(. If there's anybody brave enough here to try as well, I'd 
> be grateful.  Maybe it would be also interesting to _exactly_ locate what 
> causes the oom, and eg. exclude the object from the pack if possible.

Speaking of which, I haven't taken a look at builtin-pack-objects.c deep
enough but shouldn't it be possible to do prepare_pack and
write_pack_file in one pass ?

Mike

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-09 22:05     ` Mike Hommey
@ 2008-02-09 23:38       ` Nicolas Pitre
  0 siblings, 0 replies; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-09 23:38 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Jan Holesovsky, git, gitster

On Sat, 9 Feb 2008, Mike Hommey wrote:

> Speaking of which, I haven't taken a look at builtin-pack-objects.c deep
> enough but shouldn't it be possible to do prepare_pack and
> write_pack_file in one pass ?

No.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-09 15:27   ` Jan Holesovsky
@ 2008-02-10  3:10     ` Nicolas Pitre
  2008-02-10  4:59       ` Sean
  2008-02-10 16:43       ` Johannes Schindelin
  2008-02-11  1:20     ` Jakub Narebski
  1 sibling, 2 replies; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-10  3:10 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: Jakub Narebski, git, Junio C Hamano

On Sat, 9 Feb 2008, Jan Holesovsky wrote:

> On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> 
> > Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> > you would need machine with large amount of memory to repack it
> > tightly in sensible time!
> 
> As I answered elsewhere, unfortunately it goes out of memory even on 8G 
> machine (x86-64), so...  But still trying.

Try setting the following config variables as follows:

	git config pack.deltaCacheLimit 1
	git config pack.deltaCacheSize 1
	git config pack.windowMemory 1g

That should help keeping memory usage somewhat bounded.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10  3:10     ` Nicolas Pitre
@ 2008-02-10  4:59       ` Sean
  2008-02-10  5:22         ` Nicolas Pitre
  2008-02-10  9:34         ` Joachim B Haga
  2008-02-10 16:43       ` Johannes Schindelin
  1 sibling, 2 replies; 85+ messages in thread
From: Sean @ 2008-02-10  4:59 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

On Sat, 09 Feb 2008 22:10:06 -0500 (EST)
Nicolas Pitre <nico@cam.org> wrote:

> On Sat, 9 Feb 2008, Jan Holesovsky wrote:
> 
> > On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> > 
> > > Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> > > you would need machine with large amount of memory to repack it
> > > tightly in sensible time!
> > 
> > As I answered elsewhere, unfortunately it goes out of memory even on 8G 
> > machine (x86-64), so...  But still trying.
> 
> Try setting the following config variables as follows:
> 
> 	git config pack.deltaCacheLimit 1
> 	git config pack.deltaCacheSize 1
> 	git config pack.windowMemory 1g
> 
> That should help keeping memory usage somewhat bounded.
> 

Hi Nicolas,

Tried that earlier today and got a 1.6G pack (on a 2G machine).  There are
some big objects in that repo.. over 100 are 30 to 62M in size, 400 more
over 10M, and ~40,000 over 100K.  Would you expect a larger memory window
(on a better machine) to help shrink the repo down any more?

Sean

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10  4:59       ` Sean
@ 2008-02-10  5:22         ` Nicolas Pitre
  2008-02-10  5:35           ` Sean
  2008-02-10  9:34         ` Joachim B Haga
  1 sibling, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-10  5:22 UTC (permalink / raw)
  To: Sean; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

On Sat, 9 Feb 2008, Sean wrote:

> On Sat, 09 Feb 2008 22:10:06 -0500 (EST)
> Nicolas Pitre <nico@cam.org> wrote:
> 
> > On Sat, 9 Feb 2008, Jan Holesovsky wrote:
> > 
> > > On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> > > 
> > > > Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> > > > you would need machine with large amount of memory to repack it
> > > > tightly in sensible time!
> > > 
> > > As I answered elsewhere, unfortunately it goes out of memory even on 8G 
> > > machine (x86-64), so...  But still trying.
> > 
> > Try setting the following config variables as follows:
> > 
> > 	git config pack.deltaCacheLimit 1
> > 	git config pack.deltaCacheSize 1
> > 	git config pack.windowMemory 1g
> > 
> > That should help keeping memory usage somewhat bounded.
> > 
> 
> Hi Nicolas,
> 
> Tried that earlier today and got a 1.6G pack (on a 2G machine).  There are
> some big objects in that repo.. over 100 are 30 to 62M in size, 400 more
> over 10M, and ~40,000 over 100K.  Would you expect a larger memory window
> (on a better machine) to help shrink the repo down any more?

Well, I don't think so.  Anyway, with the above pack.windowMemory 
setting, the window probably gets shrinked if those big objects are all 
to be found in the same window.  So that would be the setting to 
increase if you have lots of ram.

Finding out what those huge objects are, and if they actually need to be 
there, would be a good thing to do to reduce any repository size.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10  5:22         ` Nicolas Pitre
@ 2008-02-10  5:35           ` Sean
  2008-02-11  1:42             ` Jakub Narebski
  0 siblings, 1 reply; 85+ messages in thread
From: Sean @ 2008-02-10  5:35 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

On Sun, 10 Feb 2008 00:22:09 -0500 (EST)
Nicolas Pitre <nico@cam.org> wrote:

> Well, I don't think so.  Anyway, with the above pack.windowMemory 
> setting, the window probably gets shrinked if those big objects are all 
> to be found in the same window.  So that would be the setting to 
> increase if you have lots of ram.

Sounds like it would be worthwhile then for Jan to try on that 8G machine
and see what comes out the other end.

> Finding out what those huge objects are, and if they actually need to be 
> there, would be a good thing to do to reduce any repository size.

Okay, i've sent the sha1's of the top 500 to Jan for inspection.  It appears
that many of the largest objects are automatically generated i18n files that
could be regenerated from source files when needed rather than being checked
in themselves; but that's for the OO folks to decide.

Thanks,
Sean

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-09 14:25   ` Jan Holesovsky
  2008-02-09 22:05     ` Mike Hommey
@ 2008-02-10  7:23     ` Marco Costalba
  2008-02-10 12:08       ` Johannes Schindelin
  1 sibling, 1 reply; 85+ messages in thread
From: Marco Costalba @ 2008-02-10  7:23 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: Nicolas Pitre, git, gitster

On Feb 9, 2008 3:25 PM, Jan Holesovsky <kendy@suse.cz> wrote:
> Hi Nicolas,
>
> On Friday 08 February 2008 19:03, Nicolas Pitre wrote:
>
> > > I've provided a git import of OOo with the entire history; the problem is
> > > that the pack has 2.5G, so it's not too convenient to download for casual
> > > developers that just want to try it.
> >

Sorry to enter so late in this thread. I just would like to ask if you
have evaluated a different approach for casual developers.

The approach is the one used by Linux tree.

Linux git repository is not very big and can be downloaded with easy.
On the other end Linux history spans many more years then the repo
does.

The design choice here is two have *two repositories*, one with recent
stuff and one historical, with stuff older then version 2.6.12

We have to say that this choice come by accident due to Linus
switching from bitkeeper to git around 2.6.12 but today it's a more or
less a conscious choice because there exists the git historical repo,
converted from bk, and this repo is still kept separated, also if
technically could be grafted to the main one to create a super big
Linux repo.

Advantage of this approach are:

- Lean and fast everyday repos, where actual development occurs

- Easy clone also for casual users

- Possibility to have anyway the whole history when needed

A variation on this theme could be to have always two repos, one with
recent stuff, say last 5 years of development, and one with *the
whole* history, not only with old stuff as in the historical Linux
tree, in this case it's easier for people that need digging very old
changes to do this avoiding browsing two repos as occurs now with
Linux.

Marco

P.S: Idea here is that of a kind of cache memory for git repos ;-)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10  4:59       ` Sean
  2008-02-10  5:22         ` Nicolas Pitre
@ 2008-02-10  9:34         ` Joachim B Haga
  1 sibling, 0 replies; 85+ messages in thread
From: Joachim B Haga @ 2008-02-10  9:34 UTC (permalink / raw)
  To: git; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano,
	Nicolas Pitre

Sean <seanlkml@sympatico.ca> writes:

>> 	git config pack.deltaCacheLimit 1
>> 	git config pack.deltaCacheSize 1
>> 	git config pack.windowMemory 1g
>
> Tried that earlier today and got a 1.6G pack (on a 2G machine).  There are
> some big objects in that repo.. over 100 are 30 to 62M in size, 400 more
> over 10M, and ~40,000 over 100K.  Would you expect a larger memory window
> (on a better machine) to help shrink the repo down any more?

I tried without these, 1.47GiB packfile. Peak RSS ~14G.

-j.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10  7:23     ` Marco Costalba
@ 2008-02-10 12:08       ` Johannes Schindelin
  2008-02-10 16:46         ` David Symonds
  0 siblings, 1 reply; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-10 12:08 UTC (permalink / raw)
  To: Marco Costalba; +Cc: Jan Holesovsky, Nicolas Pitre, git, gitster

Hi,

On Sun, 10 Feb 2008, Marco Costalba wrote:

> Linux git repository is not very big and can be downloaded with easy. On 
> the other end Linux history spans many more years then the repo does.
> 
> The design choice here is two have *two repositories*, one with recent 
> stuff and one historical, with stuff older then version 2.6.12

I do not think that this is an option: Jan already tried a shallow clone 
(which would amount to something like what you propose), and it was still 
too large.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10  3:10     ` Nicolas Pitre
  2008-02-10  4:59       ` Sean
@ 2008-02-10 16:43       ` Johannes Schindelin
  2008-02-10 17:01         ` Jon Smirl
                           ` (2 more replies)
  1 sibling, 3 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-10 16:43 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

Hi,

On Sat, 9 Feb 2008, Nicolas Pitre wrote:

> On Sat, 9 Feb 2008, Jan Holesovsky wrote:
> 
> > On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> > 
> > > Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> > > you would need machine with large amount of memory to repack it
> > > tightly in sensible time!
> > 
> > As I answered elsewhere, unfortunately it goes out of memory even on 8G 
> > machine (x86-64), so...  But still trying.
> 
> Try setting the following config variables as follows:
> 
> 	git config pack.deltaCacheLimit 1
> 	git config pack.deltaCacheSize 1
> 	git config pack.windowMemory 1g
> 
> That should help keeping memory usage somewhat bounded.

I tried that:

$ git config pack.deltaCacheLimit 1
$ git config pack.deltaCacheSize 1
$ git config pack.windowMemory 2g
$ #/usr/bin/time git repack -a -d -f --window=250 --depth=250
$ du -s objects/
2548137 objects/
$ /usr/bin/time git repack -a -d -f --window=250 --depth=250
Counting objects: 2477715, done.
fatal: Out of memory, malloc failed411764)
Command exited with non-zero status 1
9356.95user 53.33system 2:38:58elapsed 98%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (31929major+18088744minor)pagefaults 0swaps

Note that this is on a 2.4GHz Quadcode CPU with 3.5GB RAM.

I'm retrying with smaller values, but at over 2.5 hours per try, this is 
getting tedious.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 12:08       ` Johannes Schindelin
@ 2008-02-10 16:46         ` David Symonds
  2008-02-10 17:45           ` Johannes Schindelin
  0 siblings, 1 reply; 85+ messages in thread
From: David Symonds @ 2008-02-10 16:46 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Marco Costalba, Jan Holesovsky, Nicolas Pitre, git, gitster

On Feb 10, 2008 4:08 AM, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Sun, 10 Feb 2008, Marco Costalba wrote:
>
> > Linux git repository is not very big and can be downloaded with easy. On
> > the other end Linux history spans many more years then the repo does.
> >
> > The design choice here is two have *two repositories*, one with recent
> > stuff and one historical, with stuff older then version 2.6.12
>
> I do not think that this is an option: Jan already tried a shallow clone
> (which would amount to something like what you propose), and it was still
> too large.

I think that was still pulling all the branches, so a shallow clone of
just a couple of branches might be feasible.


Dave.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 16:43       ` Johannes Schindelin
@ 2008-02-10 17:01         ` Jon Smirl
  2008-02-10 17:36           ` Johannes Schindelin
  2008-02-10 18:47         ` Johannes Schindelin
  2008-02-10 19:50         ` Nicolas Pitre
  2 siblings, 1 reply; 85+ messages in thread
From: Jon Smirl @ 2008-02-10 17:01 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nicolas Pitre, Jan Holesovsky, Jakub Narebski, git,
	Junio C Hamano

On 2/10/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Sat, 9 Feb 2008, Nicolas Pitre wrote:
>
> > On Sat, 9 Feb 2008, Jan Holesovsky wrote:
> >
> > > On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> > >
> > > > Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
> > > > you would need machine with large amount of memory to repack it
> > > > tightly in sensible time!
> > >
> > > As I answered elsewhere, unfortunately it goes out of memory even on 8G
> > > machine (x86-64), so...  But still trying.
> >
> > Try setting the following config variables as follows:
> >
> >       git config pack.deltaCacheLimit 1
> >       git config pack.deltaCacheSize 1
> >       git config pack.windowMemory 1g
> >
> > That should help keeping memory usage somewhat bounded.
>
> I tried that:
>
> $ git config pack.deltaCacheLimit 1
> $ git config pack.deltaCacheSize 1
> $ git config pack.windowMemory 2g
> $ #/usr/bin/time git repack -a -d -f --window=250 --depth=250
> $ du -s objects/
> 2548137 objects/
> $ /usr/bin/time git repack -a -d -f --window=250 --depth=250
> Counting objects: 2477715, done.
> fatal: Out of memory, malloc failed411764)
> Command exited with non-zero status 1
> 9356.95user 53.33system 2:38:58elapsed 98%CPU (0avgtext+0avgdata
> 0maxresident)k
> 0inputs+0outputs (31929major+18088744minor)pagefaults 0swaps
>
> Note that this is on a 2.4GHz Quadcode CPU with 3.5GB RAM.

Turning on multi-core support greatly increases the memory
consumption; at least double the single thread case.

Going over the original repository and deleting (get all copies out of
the history) those giant i18n files generated by programs than Sean
refers to would be my first step. If you have 5,000 revisions of a
10MB file I suspect it would take a huge amount of memory to pack.
Plus you have to copy all of that pointless history around.

>
> I'm retrying with smaller values, but at over 2.5 hours per try, this is
> getting tedious.
>
> Ciao,
> Dscho
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 17:01         ` Jon Smirl
@ 2008-02-10 17:36           ` Johannes Schindelin
  0 siblings, 0 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-10 17:36 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Nicolas Pitre, Jan Holesovsky, Jakub Narebski, git,
	Junio C Hamano

Hi,

On Sun, 10 Feb 2008, Jon Smirl wrote:

> Turning on multi-core support greatly increases the memory consumption; 
> at least double the single thread case.

That's why I did not do it.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 16:46         ` David Symonds
@ 2008-02-10 17:45           ` Johannes Schindelin
  2008-02-10 19:45             ` Nicolas Pitre
  0 siblings, 1 reply; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-10 17:45 UTC (permalink / raw)
  To: David Symonds; +Cc: Marco Costalba, Jan Holesovsky, Nicolas Pitre, git, gitster

Hi,

On Sun, 10 Feb 2008, David Symonds wrote:

> On Feb 10, 2008 4:08 AM, Johannes Schindelin 
> <Johannes.Schindelin@gmx.de> wrote:
>
> > On Sun, 10 Feb 2008, Marco Costalba wrote:
> >
> > > Linux git repository is not very big and can be downloaded with 
> > > easy. On the other end Linux history spans many more years then the 
> > > repo does.
> > >
> > > The design choice here is two have *two repositories*, one with 
> > > recent stuff and one historical, with stuff older then version 
> > > 2.6.12
> >
> > I do not think that this is an option: Jan already tried a shallow 
> > clone (which would amount to something like what you propose), and it 
> > was still too large.
> 
> I think that was still pulling all the branches, so a shallow clone of 
> just a couple of branches might be feasible.

Indeed:

$ git ls-remote git://o3-build.services.openoffice.org/git/ooo.git|wc -l
3970
$ git ls-remote --heads git://o3-build.services.openoffice.org/git/ooo.git|
	wc -l
751

Fetching just master is a little hard on the server (it spends quite a 
lot of time deltifying -- minutes! -- especially between 80% and 95%, 
and indexing is even slower), but other than 
that:

$ /usr/bin/time git fetch --depth=1 \
	git://o3-build.services.openoffice.org/git/ooo.git \
	master:refs/remotes/origin/master
warning: no common commits
remote: Generating pack...
remote: Done counting 79934 objects.
remote: Deltifying 79934 objects...
remote:  100% (79934/79934) done
Indexing 79934 objects...
remote: Total 79934 (delta 34549), reused 51323 (delta 20737)
 100% (79934/79934) done
Resolving 34549 deltas...
 100% (34549/34549) done
* refs/remotes/origin/master: storing branch 'master' of 
git://o3-build.services.openoffice.org/git/ooo
  commit: 29990e4
46.48user 4.60system 16:48.29elapsed 5%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (0major+941205minor)pagefaults 0swaps

$ du .git/objects/pack/
464688  .git/objects/pack/
$ /usr/bin/time git repack -a -d -f --window=250 --depth=250
Generating pack...
Done counting 79934 objects.
Deltifying 79934 objects...
 100% (79934/79934) done
Writing 79934 objects...
 100% (79934/79934) done
Total 79934 (delta 40013), reused 0 (delta 0)
Pack pack-350e4edca93ee75ef3d85269284a24775bf6b24f created.
Removing unused objects 100%...
Done.
1869.78user 6.66system 31:36.50elapsed 98%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (2031major+1753824minor)pagefaults 0swaps
$ du .git/objects/pack/
454636  .git/objects/pack/

Of course, the clone time would be reduced dramatically if the repository 
you clone from has only "master", and is fully (re-)packed.

So I was not completely correct in my assumption that a clear cut a la 
linux-2.6 (possibly grafting historical-linux) would not help.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 16:43       ` Johannes Schindelin
  2008-02-10 17:01         ` Jon Smirl
@ 2008-02-10 18:47         ` Johannes Schindelin
  2008-02-10 19:42           ` Nicolas Pitre
  2008-02-12 20:37           ` Johannes Schindelin
  2008-02-10 19:50         ` Nicolas Pitre
  2 siblings, 2 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-10 18:47 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

Hi,

On Sun, 10 Feb 2008, Johannes Schindelin wrote:

> On Sat, 9 Feb 2008, Nicolas Pitre wrote:
> 
> > On Sat, 9 Feb 2008, Jan Holesovsky wrote:
> > 
> > > On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> > > 
> > > > Both Mozilla import, and GCC import were packed below 0.5 GB. 
> > > > Warning: you would need machine with large amount of memory to 
> > > > repack it tightly in sensible time!
> > > 
> > > As I answered elsewhere, unfortunately it goes out of memory even on 
> > > 8G machine (x86-64), so...  But still trying.
> > 
> > Try setting the following config variables as follows:
> > 
> > 	git config pack.deltaCacheLimit 1
> > 	git config pack.deltaCacheSize 1
> > 	git config pack.windowMemory 1g
> > 
> > That should help keeping memory usage somewhat bounded.
> 
> I tried that:
> 
> $ git config pack.deltaCacheLimit 1
> $ git config pack.deltaCacheSize 1
> $ git config pack.windowMemory 2g
> $ #/usr/bin/time git repack -a -d -f --window=250 --depth=250
> $ du -s objects/
> 2548137 objects/
> $ /usr/bin/time git repack -a -d -f --window=250 --depth=250
> Counting objects: 2477715, done.
> fatal: Out of memory, malloc failed411764)
> Command exited with non-zero status 1
> 9356.95user 53.33system 2:38:58elapsed 98%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 0inputs+0outputs (31929major+18088744minor)pagefaults 0swaps
> 
> Note that this is on a 2.4GHz Quadcode CPU with 3.5GB RAM.
> 
> I'm retrying with smaller values, but at over 2.5 hours per try, this is 
> getting tedious.

Now, _that_ is strange.  Using 150 instead of 250 brings it down even 
quicker!

$ /usr/bin/time git repack -a -d -f --window=150 --depth=150
Counting objects: 2477715, done.
Compressing objects:  19% (481551/2411764)
Compressing objects:  19% (482333/2411764)
fatal: Out of memory, malloc failed411764)
Command exited with non-zero status 1
7118.37user 54.15system 2:01:44elapsed 98%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (29834major+17122977minor)pagefaults 0swaps

(I hit the Return key twice during the time I suspected it would go out of 
memory, so it might have been really at 20%.)

Ideas?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 18:47         ` Johannes Schindelin
@ 2008-02-10 19:42           ` Nicolas Pitre
  2008-02-10 20:11             ` Jon Smirl
  2008-02-12 20:37           ` Johannes Schindelin
  1 sibling, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-10 19:42 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

On Sun, 10 Feb 2008, Johannes Schindelin wrote:

> Hi,
> 
> On Sun, 10 Feb 2008, Johannes Schindelin wrote:
> 
> > On Sat, 9 Feb 2008, Nicolas Pitre wrote:
> > 
> > > On Sat, 9 Feb 2008, Jan Holesovsky wrote:
> > > 
> > > > On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> > > > 
> > > > > Both Mozilla import, and GCC import were packed below 0.5 GB. 
> > > > > Warning: you would need machine with large amount of memory to 
> > > > > repack it tightly in sensible time!
> > > > 
> > > > As I answered elsewhere, unfortunately it goes out of memory even on 
> > > > 8G machine (x86-64), so...  But still trying.
> > > 
> > > Try setting the following config variables as follows:
> > > 
> > > 	git config pack.deltaCacheLimit 1
> > > 	git config pack.deltaCacheSize 1
> > > 	git config pack.windowMemory 1g
> > > 
> > > That should help keeping memory usage somewhat bounded.
> > 
> > I tried that:
> > 
> > $ git config pack.deltaCacheLimit 1
> > $ git config pack.deltaCacheSize 1
> > $ git config pack.windowMemory 2g
> > $ #/usr/bin/time git repack -a -d -f --window=250 --depth=250
> > $ du -s objects/
> > 2548137 objects/
> > $ /usr/bin/time git repack -a -d -f --window=250 --depth=250
> > Counting objects: 2477715, done.
> > fatal: Out of memory, malloc failed411764)
> > Command exited with non-zero status 1
> > 9356.95user 53.33system 2:38:58elapsed 98%CPU (0avgtext+0avgdata 
> > 0maxresident)k
> > 0inputs+0outputs (31929major+18088744minor)pagefaults 0swaps
> > 
> > Note that this is on a 2.4GHz Quadcode CPU with 3.5GB RAM.
> > 
> > I'm retrying with smaller values, but at over 2.5 hours per try, this is 
> > getting tedious.
> 
> Now, _that_ is strange.  Using 150 instead of 250 brings it down even 
> quicker!
> 
> $ /usr/bin/time git repack -a -d -f --window=150 --depth=150
> Counting objects: 2477715, done.
> Compressing objects:  19% (481551/2411764)
> Compressing objects:  19% (482333/2411764)
> fatal: Out of memory, malloc failed411764)
> Command exited with non-zero status 1
> 7118.37user 54.15system 2:01:44elapsed 98%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 0inputs+0outputs (29834major+17122977minor)pagefaults 0swaps
> 
> (I hit the Return key twice during the time I suspected it would go out of 
> memory, so it might have been really at 20%.)
> 
> Ideas?

You're probably hitting the same memory allocator fragmentation issue I 
had with the gcc repo.  On my machine with 1GB of ram, I was able to 
repack the 1.5GB source pack just fine, but repacking the 300MB source 
pack was impossible due to memory exhaustion.

My theory is that the smaller pack has many more deltas with deeper 
delta chains, and this is stumping much harder on the memory allocator 
which fails to prevent fragmentation at some point.  When Jon Smirl 
tested Git using the Google memory allocator there was around 1GB less 
allocated, which might indicate that the glibc allocator has issues with 
some of Git's workloads.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 17:45           ` Johannes Schindelin
@ 2008-02-10 19:45             ` Nicolas Pitre
  2008-02-10 20:32               ` Johannes Schindelin
  0 siblings, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-10 19:45 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: David Symonds, Marco Costalba, Jan Holesovsky, git, gitster

On Sun, 10 Feb 2008, Johannes Schindelin wrote:

> Resolving 34549 deltas...
>  100% (34549/34549) done

What Git version is this?

You better try out 1.5.4 for packing comparisons.  It produces slightly 
tighter packs than 1.5.3.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 16:43       ` Johannes Schindelin
  2008-02-10 17:01         ` Jon Smirl
  2008-02-10 18:47         ` Johannes Schindelin
@ 2008-02-10 19:50         ` Nicolas Pitre
  2008-02-14 19:41           ` Brandon Casey
  2 siblings, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-10 19:50 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

On Sun, 10 Feb 2008, Johannes Schindelin wrote:

> I tried that:
> 
> $ git config pack.deltaCacheLimit 1
> $ git config pack.deltaCacheSize 1
> $ git config pack.windowMemory 2g

This has nothing to do with repacking memory usage, but even tighter 
packs can be obtained with:

	git config repack.usedeltabaseoffset true

This is not the default yet.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 19:42           ` Nicolas Pitre
@ 2008-02-10 20:11             ` Jon Smirl
  0 siblings, 0 replies; 85+ messages in thread
From: Jon Smirl @ 2008-02-10 20:11 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johannes Schindelin, Jan Holesovsky, Jakub Narebski, git,
	Junio C Hamano

On 2/10/08, Nicolas Pitre <nico@cam.org> wrote:
> On Sun, 10 Feb 2008, Johannes Schindelin wrote:
>
> > Hi,
> >
> > On Sun, 10 Feb 2008, Johannes Schindelin wrote:
> >
> > > On Sat, 9 Feb 2008, Nicolas Pitre wrote:
> > >
> > > > On Sat, 9 Feb 2008, Jan Holesovsky wrote:
> > > >
> > > > > On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> > > > >
> > > > > > Both Mozilla import, and GCC import were packed below 0.5 GB.
> > > > > > Warning: you would need machine with large amount of memory to
> > > > > > repack it tightly in sensible time!
> > > > >
> > > > > As I answered elsewhere, unfortunately it goes out of memory even on
> > > > > 8G machine (x86-64), so...  But still trying.
> > > >
> > > > Try setting the following config variables as follows:
> > > >
> > > >   git config pack.deltaCacheLimit 1
> > > >   git config pack.deltaCacheSize 1
> > > >   git config pack.windowMemory 1g
> > > >
> > > > That should help keeping memory usage somewhat bounded.
> > >
> > > I tried that:
> > >
> > > $ git config pack.deltaCacheLimit 1
> > > $ git config pack.deltaCacheSize 1
> > > $ git config pack.windowMemory 2g
> > > $ #/usr/bin/time git repack -a -d -f --window=250 --depth=250
> > > $ du -s objects/
> > > 2548137 objects/
> > > $ /usr/bin/time git repack -a -d -f --window=250 --depth=250
> > > Counting objects: 2477715, done.
> > > fatal: Out of memory, malloc failed411764)
> > > Command exited with non-zero status 1
> > > 9356.95user 53.33system 2:38:58elapsed 98%CPU (0avgtext+0avgdata
> > > 0maxresident)k
> > > 0inputs+0outputs (31929major+18088744minor)pagefaults 0swaps
> > >
> > > Note that this is on a 2.4GHz Quadcode CPU with 3.5GB RAM.
> > >
> > > I'm retrying with smaller values, but at over 2.5 hours per try, this is
> > > getting tedious.
> >
> > Now, _that_ is strange.  Using 150 instead of 250 brings it down even
> > quicker!
> >
> > $ /usr/bin/time git repack -a -d -f --window=150 --depth=150
> > Counting objects: 2477715, done.
> > Compressing objects:  19% (481551/2411764)
> > Compressing objects:  19% (482333/2411764)
> > fatal: Out of memory, malloc failed411764)
> > Command exited with non-zero status 1
> > 7118.37user 54.15system 2:01:44elapsed 98%CPU (0avgtext+0avgdata
> > 0maxresident)k
> > 0inputs+0outputs (29834major+17122977minor)pagefaults 0swaps
> >
> > (I hit the Return key twice during the time I suspected it would go out of
> > memory, so it might have been really at 20%.)
> >
> > Ideas?
>
> You're probably hitting the same memory allocator fragmentation issue I
> had with the gcc repo.  On my machine with 1GB of ram, I was able to
> repack the 1.5GB source pack just fine, but repacking the 300MB source
> pack was impossible due to memory exhaustion.
>
> My theory is that the smaller pack has many more deltas with deeper
> delta chains, and this is stumping much harder on the memory allocator
> which fails to prevent fragmentation at some point.  When Jon Smirl
> tested Git using the Google memory allocator there was around 1GB less
> allocated, which might indicate that the glibc allocator has issues with
> some of Git's workloads.

I'm forgetting everything again, but I seem to recall that the Google
allocator only made a significant difference with multithreading.  It
is much better at keeping the threads from fragmenting each other.
It's very easy to try it, all you have to do is add another lib the
the link command.


>
>
> Nicolas
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 19:45             ` Nicolas Pitre
@ 2008-02-10 20:32               ` Johannes Schindelin
  0 siblings, 0 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-10 20:32 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: David Symonds, Marco Costalba, Jan Holesovsky, git, gitster

Hi,

On Sun, 10 Feb 2008, Nicolas Pitre wrote:

> On Sun, 10 Feb 2008, Johannes Schindelin wrote:
> 
> > Resolving 34549 deltas...
> >  100% (34549/34549) done
> 
> What Git version is this?
> 
> You better try out 1.5.4 for packing comparisons.  It produces slightly 
> tighter packs than 1.5.3.

Ooops.  I thought I updated, but no: 1.5.3.6.2835.gf9ebf

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-09 15:27   ` Jan Holesovsky
  2008-02-10  3:10     ` Nicolas Pitre
@ 2008-02-11  1:20     ` Jakub Narebski
  1 sibling, 0 replies; 85+ messages in thread
From: Jakub Narebski @ 2008-02-11  1:20 UTC (permalink / raw)
  To: Jan Holesovsky; +Cc: git, Junio C Hamano

Hi, Jan!

On Sat, 9 Feb 2008, Jan Holesovsky wrote:
> On Friday 08 February 2008 20:00, Jakub Narebski wrote:
> 
>> It was not implemented because it was thought to be hard; git assumes
>> in many places that if it has an object, it has all objects referenced
>> by it.
>>
>> But it is very nice of you to [try to] implement 'lazy clone'/'remote
>> alternates'.
>>
>> Could you provide some benchmarks (time, network throughtput, latency)
>> for your implementation?
> 
> Unfortunately not yet :-(  The only data I have that clone done on 
> git://localhost/ooo.git took 10 minutes without the lazy clone, and 7.5 
> minutes with it - and then I sent the patch for review here ;-)  The deadline 
> for our SVN vs. git comparison for OOo is the next Friday, so I'll definitely 
> have some better data by then.

Here perhaps another optimization which wasn't done because git is
fast enough on moderately-sized repositories, namely that IIRC git-clone
(and git-fetch for sure) over native (smart) protocol recreates pack,
even if sometimes better and simplier would be to just copy (transfer)
existing pack.

But this would need multi-pack "extension". (it should work just now
without transport protocol extension, receiver must only be aware
of the need to split resulting pack, and index them all).

>> Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
>> you would need machine with large amount of memory to repack it
>> tightly in sensible time!
> 
> As I answered elsewhere, unfortunately it goes out of memory even on 8G 
> machine (x86-64), so...  But still trying.

I hope that would work better...

>>> Shallow clone is not a possibility - we don't get patches through
>>> mailing lists, so we need the pull/push, and also thanks to the OOo
>>> development cycle, we have too many living heads which causes the
>>> shallow clone to download about 1.5G even with --depth 1.
>>
>> Wouldn't be easier to try to fix shallow clone implementation to allow
>> for pushing from shallow to full clone (fetching from full to shallow
>> is implemented), and perhaps also push/pull between two shallow
>> clones?
> 
> I tried to look into it a bit, but unfortunately did not see a clear way how 
> to do it transparently for the user - say you pull a branch that is based off 
> a commit you do not have.  But of course, I could have missed something ;-)

If I remember correctly fetching _into_ shallow clone works correctly,
as deepening depth of shallow clone. What is not implemented AFAIK, but
should be not too hard would be to allow to push from shallow clone
to full clone. This way the network of full clones (functioning as
centres to publish your work) and shallow + few branches repos (working
repositories).

I don't know if that would be enough.

For better support git would need to exchange graft-like information,
and use union of restrictions to get correct commits.


Perhaps it would be best to mail 'shallow clone' author...

>> As to many living heads: first, you don't need to fetch all
>> heads. Currently git-clone has no option to select subset of heads to
>> clone, but you can always use git-init + hand configuration +
>> git-remote and git-fetch for actual fetching.
> 
> Right, might be interesting as well.  But still the missing push/pull is 
> problematic for us [or at least I see it as a problem ;-)].

You can configure separate 'remote's for the same repository
with different heads. This would work both for pull and for push.


I think the solution proposed by Marco Costalba, namely of creating
"archive" repository, and "live" repository, joining them if needed
by grafts, similarly to how linux kernel has live repo, and historical
import repo, would be good alternative to shallow or lazy clone.

There would be "archive" repo (or repos), read only, with whole history,
very tightly packed with kept packs, with all branches and all tags,
and "live" repo, with only current history (a year, or since major
API change, or from today, or something like that), with only important
branches (or repos, each containg important for a team set of branches).
There would be prepared graft file to join two histories, if you have
to examine full history. Hopefully repo would be smaller.

>> By the way, did you try to split OpenOffice.org repository at the
>> components boundary into submodules (subprojects)? This would also
>> limit amount of needed download, as you don't neeed to download and
>> checkout all subprojects.
> 
> Yes, and got to much nicer repositories by that ;-) - by only moving some 
> binary stuff out of the CVS to a separate tree.  The problem is that the deal 
> is to compare the same stuff in SVN and git - so no choice for me in fact.

Sidenote: due to (from what I have read) heavy use of topic branches
in OOo development, Subversion would have to be used with svnmerge
extension, or together with SVK, to make work with it not complete
pain.

In CVS you could have ad-hoc modules, and ad-hoc partial checkouts
(so called 'modules'), but that plays merry hell with whole tree,
atomic, recoverable state commits. In Git you have to plan carefully
boundaries between submodules / subprojects. Additional advantage
is that you would have boundaries more clear, and better modularity
usually leads to better code.

Comparing directly Subversion and Git is a bit stupid: they promote
different workflows. From what I've read Git with its ability to very
easily create branches, with easy _merging_ of branches, and ability
to easily create _private_ branches (testing branches) have much
common witch chosen OOo SCM workflow. Playing to strentghs of
Subversion because that is why you used because of limits of previously
used tools is not smart.

But if you have to, then you have to. Git would hopefully get lazy
clone support from your effort. But perhaps it would be possible
(if additional work) to prepare two repositories: first the same
as Subversion (and same as now in CVS), second one "how it should
be done with Git".

>> The problem of course is _how_ to split repository into
>> submodules. Submodules should be enough self contained so the
>> whole-tree commit is alsays (or almost always) only about submodule.
> 
> I hope it will be doable _if_ the git wins & will be chosen for OOo.

I hope that ability to work with submodules (with ability to not
clone / checkout modules if not needed), i.e. "svn:externals
done right" to para[hrase SVN slogan, would be one of reasons to
chose Git over Subversion.

>>> Lazy clone sounded like the right idea to me.  With this
>>> proof-of-concept implementation, just about 550M from the 2.5G is
>>> downloaded, which is still about twice as much in comparison with
>>> downloading a tarball, but bearable.
>>
>> Do you have any numbers for OOo repository like number of revisions,
>> depth of DAG of commits (maximum number of revisions in one line of
>> commits), number of files, size of checkout, average size of file,
>> etc.?
> 
> I'll try to provide the data ASAP.

For example what is the size of full checkout (all version-control
managed files). Of for example it is 0.5 GB it would be hard to go
to less that 0.5GB or so with pack size, even with compression
of objects themselves in pack file.


Such large repositories, like Mozilla, GCC, or now OpenOffice.org
tests the limits of Git. Perhaps snapshot-based distributed SCMs
cannot deal sensibly with such large projects; I hope this is not
the case.

I wonder if packv4 improvements, which development stalled because
(if I understand correctly) because it didn't brough as much
improvements, and what is now was good enough for up-till-now
projects, would help with OpenOffice.org repository...


P.S. From what I have read OOo uses CVS + some branch DB; does
your importer make use of this branch info database?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10  5:35           ` Sean
@ 2008-02-11  1:42             ` Jakub Narebski
  2008-02-11  2:04               ` Nicolas Pitre
  0 siblings, 1 reply; 85+ messages in thread
From: Jakub Narebski @ 2008-02-11  1:42 UTC (permalink / raw)
  To: Sean; +Cc: Nicolas Pitre, Jan Holesovsky, git, Junio C Hamano

On Sun, 10 Feb 2008, Sean napisał:
> On Sun, 10 Feb 2008 00:22:09 -0500 (EST)
> Nicolas Pitre <nico@cam.org> wrote:
>
>> Finding out what those huge objects are, and if they actually need to be 
>> there, would be a good thing to do to reduce any repository size.
> 
> Okay, i've sent the sha1's of the top 500 to Jan for inspection.  It appears
> that many of the largest objects are automatically generated i18n files that
> could be regenerated from source files when needed rather than being checked
> in themselves; but that's for the OO folks to decide.

Good practice is to not add generated files to version control.
But sometimes such files are stored if regenerating them is costly
(./configure file in some cases, 'man' and 'html' branches in git.git).

IIRC Dana How tried also to deal with repository with large binary
files in repo, although in that case those had shallow history. IIRC
the proposed solution was to pack all such large objects undeltified
into separate "large-objects" kept pack.

You can mark large files with (undocumented except for RelNotes)
'delta' gitattribute, but I don't know if it would help in your
case.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-11  1:42             ` Jakub Narebski
@ 2008-02-11  2:04               ` Nicolas Pitre
  2008-02-11 10:11                 ` Jakub Narebski
  0 siblings, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-11  2:04 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Sean, Jan Holesovsky, git, Junio C Hamano

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1231 bytes --]

On Mon, 11 Feb 2008, Jakub Narebski wrote:

> On Sun, 10 Feb 2008, Sean napisał:
> > On Sun, 10 Feb 2008 00:22:09 -0500 (EST)
> > Nicolas Pitre <nico@cam.org> wrote:
> >
> >> Finding out what those huge objects are, and if they actually need to be 
> >> there, would be a good thing to do to reduce any repository size.
> > 
> > Okay, i've sent the sha1's of the top 500 to Jan for inspection.  It appears
> > that many of the largest objects are automatically generated i18n files that
> > could be regenerated from source files when needed rather than being checked
> > in themselves; but that's for the OO folks to decide.
> 
> Good practice is to not add generated files to version control.
> But sometimes such files are stored if regenerating them is costly
> (./configure file in some cases, 'man' and 'html' branches in git.git).
> 
> IIRC Dana How tried also to deal with repository with large binary
> files in repo, although in that case those had shallow history. IIRC
> the proposed solution was to pack all such large objects undeltified
> into separate "large-objects" kept pack.

That was to solve a completely different problem which wasn't about 
space saving, but rather to save on 'git push' latency.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-11  2:04               ` Nicolas Pitre
@ 2008-02-11 10:11                 ` Jakub Narebski
  0 siblings, 0 replies; 85+ messages in thread
From: Jakub Narebski @ 2008-02-11 10:11 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Sean, Jan Holesovsky, git, Junio C Hamano

On Mon, 11 Feb 2008, Nicolas Pitre wrote:
> On Mon, 11 Feb 2008, Jakub Narebski wrote:
>> On Sun, 10 Feb 2008, Sean napisał:
>>> On Sun, 10 Feb 2008 00:22:09 -0500 (EST)
>>> Nicolas Pitre <nico@cam.org> wrote:
>>>
>>>> Finding out what those huge objects are, and if they actually need to be 
>>>> there, would be a good thing to do to reduce any repository size.

>> IIRC Dana How tried also to deal with repository with large binary
>> files in repo, although in that case those had shallow history. IIRC
>> the proposed solution was to pack all such large objects undeltified
>> into separate "large-objects" kept pack.
> 
> That was to solve a completely different problem which wasn't about 
> space saving, but rather to save on 'git push' latency.

Sorry, my mistake.

Although in Dana case separating large blobs into non-packed loose
objects (her patches), or separate kept non-delta large blobs only
pack (proposed solution), were shared over networked filesystem.
So the amortized size of repository was smaller... ;-ppp

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-08 20:09     ` Nicolas Pitre
@ 2008-02-11 10:13       ` Andreas Ericsson
  2008-02-12  2:55         ` [PATCH 1/2] pack-objects: Allow setting the #threads equal to #cpus automatically Brandon Casey
       [not found]         ` <1202784078-23700-1-git-send-email-casey@nrlssc.navy.mil>
  0 siblings, 2 replies; 85+ messages in thread
From: Andreas Ericsson @ 2008-02-11 10:13 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jon Smirl, Jakub Narebski, Jan Holesovsky, git, Junio C Hamano

Nicolas Pitre wrote:
> On Fri, 8 Feb 2008, Jon Smirl wrote:
> 
>> There are some patches for making repack work multi-core. Not sure if
>> they made it into the main git tree yet.
> 
> Yes, they are.  You need to compile with"make THREADED_DELTA_SEARCH=yes" 
> or add THREADED_DELTA_SEARCH=yes into config.mak for it to be enabled 
> though.  Then you have to set the pack.threads configuration variable 
> appropriately to use it.
> 

I sent a patch to get it to auto-detect multi-core machines, but I see
now that it was commented upon for finalization (by Nicolas, actually)
and I must have missed that, thinking it had been applied because I got
an accidental merge in my own tree.

As such, I've been using that patch the last several months without
problems. I'll rework them as per Nicolas' suggestions and resend.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 1/2] pack-objects: Allow setting the #threads equal to #cpus automatically
  2008-02-11 10:13       ` Andreas Ericsson
@ 2008-02-12  2:55         ` Brandon Casey
  2008-02-12  5:53           ` Andreas Ericsson
       [not found]         ` <1202784078-23700-1-git-send-email-casey@nrlssc.navy.mil>
  1 sibling, 1 reply; 85+ messages in thread
From: Brandon Casey @ 2008-02-12  2:55 UTC (permalink / raw)
  To: ae; +Cc: Nicolas Pitre, Git Mailing List

Allow pack.threads config option and --threads command line option to
accept '0' as an argument and set the number of created threads equal
to the number of online processors in this case.

Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
---


I was preparing this patch when I saw your email. I looked up your
the old email you were talking about. Your function is better since
it is cross platform.

When you redo your patch, you may want to adopt one aspect of this
one. I used a setting of zero to imply "set number of threads to
number of cpus". This allows the user to specifically set pack.threads
in the config file to zero with the above mentioned meaning, or to
override a setting in the config file from the command line with
--threads=0. This is rather than having to delete the option from the
config file.

-brandon


 builtin-pack-objects.c |   22 ++++++++++++++++++----
 1 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 692a761..5c55c11 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1852,11 +1852,11 @@ static int git_pack_config(const char *k, const char *v)
 	}
 	if (!strcmp(k, "pack.threads")) {
 		delta_search_threads = git_config_int(k, v);
-		if (delta_search_threads < 1)
+		if (delta_search_threads < 0)
 			die("invalid number of threads specified (%d)",
 			    delta_search_threads);
 #ifndef THREADED_DELTA_SEARCH
-		if (delta_search_threads > 1)
+		if (delta_search_threads != 1)
 			warning("no threads support, ignoring %s", k);
 #endif
 		return 0;
@@ -2121,10 +2121,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		if (!prefixcmp(arg, "--threads=")) {
 			char *end;
 			delta_search_threads = strtoul(arg+10, &end, 0);
-			if (!arg[10] || *end || delta_search_threads < 1)
+			if (!arg[10] || *end || delta_search_threads < 0)
 				usage(pack_usage);
 #ifndef THREADED_DELTA_SEARCH
-			if (delta_search_threads > 1)
+			if (delta_search_threads != 1)
 				warning("no threads support, "
 					"ignoring %s", arg);
 #endif
@@ -2234,6 +2234,20 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!pack_to_stdout && thin)
 		die("--thin cannot be used to build an indexable pack.");
 
+#ifdef THREADED_DELTA_SEARCH
+	if (!delta_search_threads) {
+#if defined _SC_NPROCESSORS_ONLN
+		delta_search_threads = sysconf(_SC_NPROCESSORS_ONLN);
+#elif defined _SC_NPROC_ONLN
+		delta_search_threads = sysconf(_SC_NPROC_ONLN);
+#endif
+		if (delta_search_threads == -1)
+			perror("Could not detect number of processors");
+		if (delta_search_threads <= 0)
+			delta_search_threads = 1;
+	}
+#endif
+
 	prepare_packed_git();
 
 	if (progress)
-- 
1.5.4.1.40.gdb90

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH 2/2] pack-objects: Default to zero threads, meaning auto-assign to #cpus
       [not found]         ` <1202784078-23700-1-git-send-email-casey@nrlssc.navy.mil>
@ 2008-02-12  2:59           ` Brandon Casey
  2008-02-12  4:57             ` Nicolas Pitre
  0 siblings, 1 reply; 85+ messages in thread
From: Brandon Casey @ 2008-02-12  2:59 UTC (permalink / raw)
  To: ae; +Cc: Nicolas Pitre, Git Mailing List

Additionally, update some tests for which the multi-threaded result
differs from the single-threaded result and the single-threaded
result is expected.

Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
---


Two of the tests in t5300-pack-object.sh failed when multiple
threads were used. My fix was to set --threads=1 for all pack-objects
calls. I didn't look into it any further than that. All other tests
passed.

-brandon


 builtin-pack-objects.c |    2 +-
 t/t5300-pack-object.sh |    8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 5c55c11..743de52 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -70,7 +70,7 @@ static int progress = 1;
 static int window = 10;
 static uint32_t pack_size_limit, pack_size_limit_cfg;
 static int depth = 50;
-static int delta_search_threads = 1;
+static int delta_search_threads = 0;
 static int pack_to_stdout;
 static int num_preferred_base;
 static struct progress *progress_state;
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index cd3c149..16ee940 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -35,7 +35,7 @@ test_expect_success \
 
 test_expect_success \
     'pack without delta' \
-    'packname_1=$(git pack-objects --window=0 test-1 <obj-list)'
+    'packname_1=$(git pack-objects --threads=1 --window=0 test-1 <obj-list)'
 
 rm -fr .git2
 mkdir .git2
@@ -66,7 +66,7 @@ cd "$TRASH"
 test_expect_success \
     'pack with REF_DELTA' \
     'pwd &&
-     packname_2=$(git pack-objects test-2 <obj-list)'
+     packname_2=$(git pack-objects --threads=1 test-2 <obj-list)'
 
 rm -fr .git2
 mkdir .git2
@@ -96,7 +96,7 @@ cd "$TRASH"
 test_expect_success \
     'pack with OFS_DELTA' \
     'pwd &&
-     packname_3=$(git pack-objects --delta-base-offset test-3 <obj-list)'
+     packname_3=$(git pack-objects --threads=1 --delta-base-offset test-3 <obj-list)'
 
 rm -fr .git2
 mkdir .git2
@@ -271,7 +271,7 @@ test_expect_success \
 test_expect_success \
     'honor pack.packSizeLimit' \
     'git config pack.packSizeLimit 200 &&
-     packname_4=$(git pack-objects test-4 <obj-list) &&
+     packname_4=$(git pack-objects --threads=1 test-4 <obj-list) &&
      test 3 = $(ls test-4-*.pack | wc -l)'
 
 test_done
-- 
1.5.4.1.40.gdb90

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH 2/2] pack-objects: Default to zero threads, meaning auto-assign to #cpus
  2008-02-12  2:59           ` [PATCH 2/2] pack-objects: Default to zero threads, meaning auto-assign to #cpus Brandon Casey
@ 2008-02-12  4:57             ` Nicolas Pitre
  0 siblings, 0 replies; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-12  4:57 UTC (permalink / raw)
  To: Brandon Casey; +Cc: ae, Git Mailing List

On Mon, 11 Feb 2008, Brandon Casey wrote:

> Additionally, update some tests for which the multi-threaded result
> differs from the single-threaded result and the single-threaded
> result is expected.
> 
> Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>

I think the first patch is OK, but having the _default_ be 
multi-threaded is going a bit too far.  IMHO you should document the 
meaning of the value 0, and compile with thread support whenever Posix 
threads are available, but activating threads should be done explicitly.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 1/2] pack-objects: Allow setting the #threads equal to #cpus automatically
  2008-02-12  2:55         ` [PATCH 1/2] pack-objects: Allow setting the #threads equal to #cpus automatically Brandon Casey
@ 2008-02-12  5:53           ` Andreas Ericsson
  0 siblings, 0 replies; 85+ messages in thread
From: Andreas Ericsson @ 2008-02-12  5:53 UTC (permalink / raw)
  To: Brandon Casey; +Cc: Nicolas Pitre, Git Mailing List

Brandon Casey wrote:
> Allow pack.threads config option and --threads command line option to
> accept '0' as an argument and set the number of created threads equal
> to the number of online processors in this case.
> 
> Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
> ---
> 
> 
> I was preparing this patch when I saw your email. I looked up your
> the old email you were talking about. Your function is better since
> it is cross platform.
> 
> When you redo your patch, you may want to adopt one aspect of this
> one. I used a setting of zero to imply "set number of threads to
> number of cpus". This allows the user to specifically set pack.threads
> in the config file to zero with the above mentioned meaning, or to
> override a setting in the config file from the command line with
> --threads=0. This is rather than having to delete the option from the
> config file.
> 

That make sense. Perhaps even go so far as to allow 'auto' as a
keyword would be nifty.

>  
> +#ifdef THREADED_DELTA_SEARCH
> +	if (!delta_search_threads) {
> +#if defined _SC_NPROCESSORS_ONLN
> +		delta_search_threads = sysconf(_SC_NPROCESSORS_ONLN);
> +#elif defined _SC_NPROC_ONLN
> +		delta_search_threads = sysconf(_SC_NPROC_ONLN);
> +#endif
> +		if (delta_search_threads == -1)
> +			perror("Could not detect number of processors");
> +		if (delta_search_threads <= 0)
> +			delta_search_threads = 1;
> +	}
> +#endif
> +

But this is not so good. For one thing you've dropped windows support
entirely. The last comment on my own patch was that get_num_active_cpus()
should live in a file of its own. You've taken one step back from that
and not even kept it in its own function.

I think perhaps it's time to introduce thread-compat.[ch] to deal with
thread-related cross-platform things like this.

I'll recook my patch and send it in a few minutes, using your suggestions
and Nicolas combined.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 18:47         ` Johannes Schindelin
  2008-02-10 19:42           ` Nicolas Pitre
@ 2008-02-12 20:37           ` Johannes Schindelin
  2008-02-12 21:05             ` Nicolas Pitre
                               ` (3 more replies)
  1 sibling, 4 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-12 20:37 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

Hi,

On Sun, 10 Feb 2008, Johannes Schindelin wrote:

> $ /usr/bin/time git repack -a -d -f --window=150 --depth=150
> Counting objects: 2477715, done.
> Compressing objects:  19% (481551/2411764)
> Compressing objects:  19% (482333/2411764)
> fatal: Out of memory, malloc failed411764)
> Command exited with non-zero status 1
> 7118.37user 54.15system 2:01:44elapsed 98%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 0inputs+0outputs (29834major+17122977minor)pagefaults 0swaps

I made the window much smaller (512 megabyte), and it still runs, after 27 
hours:

Compressing objects:  20% (484132/2411764)

However, it seems that it only worked on about 4000 objects in the last 
20(!) hours.  So, the first 19% were relatively quick.  The next percent 
not at all.

Will keep you posted,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-12 20:37           ` Johannes Schindelin
@ 2008-02-12 21:05             ` Nicolas Pitre
  2008-02-12 21:08             ` Linus Torvalds
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-12 21:05 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

On Tue, 12 Feb 2008, Johannes Schindelin wrote:

> Hi,
> 
> On Sun, 10 Feb 2008, Johannes Schindelin wrote:
> 
> > $ /usr/bin/time git repack -a -d -f --window=150 --depth=150
> > Counting objects: 2477715, done.
> > Compressing objects:  19% (481551/2411764)
> > Compressing objects:  19% (482333/2411764)
> > fatal: Out of memory, malloc failed411764)
> > Command exited with non-zero status 1
> > 7118.37user 54.15system 2:01:44elapsed 98%CPU (0avgtext+0avgdata 
> > 0maxresident)k
> > 0inputs+0outputs (29834major+17122977minor)pagefaults 0swaps
> 
> I made the window much smaller (512 megabyte), and it still runs, after 27 
> hours:
> 
> Compressing objects:  20% (484132/2411764)
> 
> However, it seems that it only worked on about 4000 objects in the last 
> 20(!) hours.  So, the first 19% were relatively quick.  The next percent 
> not at all.

Yeah... this repo is really a pain to repack.  I have access to a 
8-processor machine with 8GB of ram and all my repack attempts so far 
were killed after using too much memory, despite the window memory 
limit.  Those were threaded repack attempts, so the first 98% was really 
quick, like less than 15 minutes, but then all threads converged on this 
small fraction of the object space which appears to cause problems.  
And then I'm presuming I ran into the same threaded memory fragmentation 
issue.  Might be worth attaching gdb to it and extract a sample of the 
object SHA1's populating the delta window when the slowdown occurs to 
see what they actually are...

I'm attempting a single-threaded repack now.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-12 20:37           ` Johannes Schindelin
  2008-02-12 21:05             ` Nicolas Pitre
@ 2008-02-12 21:08             ` Linus Torvalds
  2008-02-12 21:36               ` Jon Smirl
  2008-02-12 21:25             ` Jon Smirl
  2008-02-14 19:20             ` Johannes Schindelin
  3 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2008-02-12 21:08 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nicolas Pitre, Jan Holesovsky, Jakub Narebski, git,
	Junio C Hamano



On Tue, 12 Feb 2008, Johannes Schindelin wrote:
> 
> I made the window much smaller (512 megabyte), and it still runs, after 27 
> hours:

I'd suggest making the memory window smaller yet. 

512MB is a *big* amount of memory, if you fill it up, and end up using an 
O(n**2) algorithm on the objects within the window (which it is: the 
repacking algorithm is O(n) in _total_ objects, but the constant part is 
basically O(winsize^2).

I'd suggest that a reasonable window memory limit is around just a few 
megabytes (eg 4MB to maybe 64MB). If you have "normal" source files, 
you're still going to be limited by the window _count_ size (assume normal 
source files are in the few tens of kB), and for those occasional large 
files, you'd better hope that the sort heursistics are good enough.

			Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-12 20:37           ` Johannes Schindelin
  2008-02-12 21:05             ` Nicolas Pitre
  2008-02-12 21:08             ` Linus Torvalds
@ 2008-02-12 21:25             ` Jon Smirl
  2008-02-14 19:20             ` Johannes Schindelin
  3 siblings, 0 replies; 85+ messages in thread
From: Jon Smirl @ 2008-02-12 21:25 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Nicolas Pitre, Jan Holesovsky, Jakub Narebski, git,
	Junio C Hamano

On 2/12/08, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Sun, 10 Feb 2008, Johannes Schindelin wrote:
>
> > $ /usr/bin/time git repack -a -d -f --window=150 --depth=150
> > Counting objects: 2477715, done.
> > Compressing objects:  19% (481551/2411764)
> > Compressing objects:  19% (482333/2411764)
> > fatal: Out of memory, malloc failed411764)
> > Command exited with non-zero status 1
> > 7118.37user 54.15system 2:01:44elapsed 98%CPU (0avgtext+0avgdata
> > 0maxresident)k
> > 0inputs+0outputs (29834major+17122977minor)pagefaults 0swaps
>
> I made the window much smaller (512 megabyte), and it still runs, after 27
> hours:
>
> Compressing objects:  20% (484132/2411764)
>
> However, it seems that it only worked on about 4000 objects in the last
> 20(!) hours.

I found that out with gcc. 95% went down in no time and the last 5%
took two hours. The 5% that got stuck were chains with 2000+ entries.

The neat thing about the multithread code is that it will keep
splitting the work load. That lets all of the easy deltas finish and
not get stuck behind the problem objects.

With quad core on gcc one core would get stuck on the problem objects.
The other three would finish their list and start splitting the
problem list. This effectively sorts the problems to the end of the
work load. By printing the object hash out as they are completed you
can easily identify the problem objects. If I recall right on gcc the
problem was a configure file that had 2000 entries in its delta chain.
That one delta chain took over an hour to process.

Could there be an N squared type problem when 2000 entry delta chains
are encountered? Maybe something that just isn't noticeable when
depth/window=50. Has testing been done with really long object chains
to make sure that only the minimal amount of work is being done? It
seems like something is breaking down when the chain length exceeds
the window size.

So, the first 19% were relatively quick.  The next percent
> not at all.
>
> Will keep you posted,
> Dscho
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-12 21:08             ` Linus Torvalds
@ 2008-02-12 21:36               ` Jon Smirl
  2008-02-12 21:59                 ` Linus Torvalds
  0 siblings, 1 reply; 85+ messages in thread
From: Jon Smirl @ 2008-02-12 21:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Nicolas Pitre, Jan Holesovsky,
	Jakub Narebski, git, Junio C Hamano

How many diffs should it take to compress a 2000 delta chain with
window/depth=250?

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-12 21:36               ` Jon Smirl
@ 2008-02-12 21:59                 ` Linus Torvalds
  2008-02-12 22:25                   ` Linus Torvalds
  0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2008-02-12 21:59 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Johannes Schindelin, Nicolas Pitre, Jan Holesovsky,
	Jakub Narebski, git, Junio C Hamano



On Tue, 12 Feb 2008, Jon Smirl wrote:
>
> How many diffs should it take to compress a 2000 delta chain with
> window/depth=250?

There's no fixed answer. We do various culling heurstics to avoid actually 
generating a diff at all if it looks unlikely to succeed etc. But in 
general, the way the window works is that 
 (a) we only need to generate the _unpacked_ object once
 (b) we compare each object to the "window-1" preceding objects, which is 
     how I got the O(windowsize^2) 
 (c) but then that "compare" relatively seldom involves actually 
     generating a whole diff!

So the answer is: in _theory_ each object may be compared to 
(windowsize-1) other objects, but in practice it's much less than that.

			Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-12 21:59                 ` Linus Torvalds
@ 2008-02-12 22:25                   ` Linus Torvalds
  2008-02-12 22:43                     ` Jon Smirl
  0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2008-02-12 22:25 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Johannes Schindelin, Nicolas Pitre, Jan Holesovsky,
	Jakub Narebski, git, Junio C Hamano



On Tue, 12 Feb 2008, Linus Torvalds wrote:
>
>  (b) we compare each object to the "window-1" preceding objects, which is 
>      how I got the O(windowsize^2) 

That's not really true, of course. But my (broken and inexact) logic is 
that we get one cost multiplier from the number of objects, and one from 
the size of the objects.

So *if* we have the situation of not limiting the window size, we 
basically have a big slowdown from raising the window in number of 
objects: not only do we get a slowdown from comparing more objects, we 
spend relatively more time comparing the *large* ones to begin with and 
having more of them just makes it even more skewed - when we hit a series 
of big blocks, the window will also contain more big blocks, so it kind of 
a double whammy.

But I don't think calling it O(windowsize^2) is really correct. It's still 
O(windowsize), it's just that the purely "number-of-object" thing doesn't 
account for big objects being much more expensive to diff. So you really 
want to make the *memory* limiter the big one, because that's the one that 
actually approximates how much time you end up spending.

So ignore that O(n^2) blather. It's not correct. What _is_ correct is that 
we want to aggressively limit memory size, because CPU cost goes up 
linearly not just with number of objects, but also super-linearly with 
size of the object ("super-linear" due to bad cache behavior and in worst 
case due to paging).

			Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-12 22:25                   ` Linus Torvalds
@ 2008-02-12 22:43                     ` Jon Smirl
  2008-02-12 23:39                       ` Linus Torvalds
  0 siblings, 1 reply; 85+ messages in thread
From: Jon Smirl @ 2008-02-12 22:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Nicolas Pitre, Jan Holesovsky,
	Jakub Narebski, git, Junio C Hamano

On 2/12/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 12 Feb 2008, Linus Torvalds wrote:
> >
> >  (b) we compare each object to the "window-1" preceding objects, which is
> >      how I got the O(windowsize^2)
>
> That's not really true, of course. But my (broken and inexact) logic is
> that we get one cost multiplier from the number of objects, and one from
> the size of the objects.
>
> So *if* we have the situation of not limiting the window size, we
> basically have a big slowdown from raising the window in number of
> objects: not only do we get a slowdown from comparing more objects, we
> spend relatively more time comparing the *large* ones to begin with and
> having more of them just makes it even more skewed - when we hit a series
> of big blocks, the window will also contain more big blocks, so it kind of
> a double whammy.
>
> But I don't think calling it O(windowsize^2) is really correct. It's still
> O(windowsize), it's just that the purely "number-of-object" thing doesn't
> account for big objects being much more expensive to diff. So you really
> want to make the *memory* limiter the big one, because that's the one that
> actually approximates how much time you end up spending.
>
> So ignore that O(n^2) blather. It's not correct. What _is_ correct is that
> we want to aggressively limit memory size, because CPU cost goes up
> linearly not just with number of objects, but also super-linearly with
> size of the object ("super-linear" due to bad cache behavior and in worst
> case due to paging).


In the gcc case I wasn't running out memory. I believe was CPU bound
for an hour processing a single object chain with 2000 entries. That
sure doesn't feel like O(windowsize).

Maybe someone playing the the OO repo can stick in an appropriate
printf and see how many diffs are really being done just to make sure
they match what we think the number should be.


>
>                         Linus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-12 22:43                     ` Jon Smirl
@ 2008-02-12 23:39                       ` Linus Torvalds
  0 siblings, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2008-02-12 23:39 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Johannes Schindelin, Nicolas Pitre, Jan Holesovsky,
	Jakub Narebski, git, Junio C Hamano



On Tue, 12 Feb 2008, Jon Smirl wrote:
> 
> In the gcc case I wasn't running out memory. I believe was CPU bound
> for an hour processing a single object chain with 2000 entries. That
> sure doesn't feel like O(windowsize).

Well, there's another - and totally unrelated - issue with *pre-existing* 
delta chains that are very deep.

Namely the fact that since such a deep delta chain will exhaust the 
delta-cache, you will now have a O(n*chaindepth) behaviour when you unpack 
the objects (in order to generate the deltas) in the first place!

So that really has nothing to do with the new window (or delta) depth at 
all, just with the _previous_ window depth.

See sha1_file.c: MAX_DELTA_CACHE.

If you have a 2000-deep delta chain, then the delta-cache should be big 
enough that you hit in it regularly without flushing it when you traverse 
down the chain. So MAX_DELTA_CACHE should generally be at _least_ as much 
as the max delta chain length, which is obviously normally the case 
(default max delta chain length: 10).

We could probably fairly easily make that MAX_DELTA_CACHE be a config 
option, but right now you have to recompile to test that theory of mine.

Or just limit your delta depth to something much smaller (ie ~100 or so)

		Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-12 20:37           ` Johannes Schindelin
                               ` (2 preceding siblings ...)
  2008-02-12 21:25             ` Jon Smirl
@ 2008-02-14 19:20             ` Johannes Schindelin
  2008-02-14 20:05               ` Jakub Narebski
  2008-02-15  9:34               ` Jan Holesovsky
  3 siblings, 2 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-14 19:20 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Holesovsky, Jakub Narebski, git, Junio C Hamano

Hi,

On Tue, 12 Feb 2008, Johannes Schindelin wrote:

> On Sun, 10 Feb 2008, Johannes Schindelin wrote:
> 
> > $ /usr/bin/time git repack -a -d -f --window=150 --depth=150
> > Counting objects: 2477715, done.
> > Compressing objects:  19% (481551/2411764)
> > Compressing objects:  19% (482333/2411764)
> > fatal: Out of memory, malloc failed411764)
> > Command exited with non-zero status 1
> > 7118.37user 54.15system 2:01:44elapsed 98%CPU (0avgtext+0avgdata 
> > 0maxresident)k
> > 0inputs+0outputs (29834major+17122977minor)pagefaults 0swaps
> 
> I made the window much smaller (512 megabyte), and it still runs, after 27 
> hours:
> 
> Compressing objects:  20% (484132/2411764)
> 
> However, it seems that it only worked on about 4000 objects in the last 
> 20(!) hours.  So, the first 19% were relatively quick.  The next percent 
> not at all.

Finally!

I updated to newest git+patches (git version 1.5.4.1.1353.g0d5dd), reset 
windowMemory to 512m and restarted the process:

$ /usr/bin/time git repack -a -d -f --window=250 --depth=250
Counting objects: 2477715, done.
Compressing objects: 100% (2411764/2411764), done.
Writing objects: 100% (2477715/2477715), done.
Total 2477715 (delta 1876242), reused 0 (delta 0)
21733.55user 175.32system 6:10:37elapsed 98%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (81921major+63880453minor)pagefaults 0swaps

A little over 6 hours, with one core (of the four available).  Not bad, I 
say.

The result is:

$ ls -la objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack
-rwxrwxrwx 1 root root 1638490531 2008-02-14 17:51 
objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack

1.6G looks much better than 2.4G, wouldn't you say?  Jan, if you want it, 
please give me a place to upload it to.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-10 19:50         ` Nicolas Pitre
@ 2008-02-14 19:41           ` Brandon Casey
  2008-02-14 19:58             ` Johannes Schindelin
  2008-02-14 20:11             ` Nicolas Pitre
  0 siblings, 2 replies; 85+ messages in thread
From: Brandon Casey @ 2008-02-14 19:41 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johannes Schindelin, Jan Holesovsky, Jakub Narebski, git,
	Junio C Hamano

Nicolas Pitre wrote:
> On Sun, 10 Feb 2008, Johannes Schindelin wrote:
> 
>> I tried that:
>>
>> $ git config pack.deltaCacheLimit 1
>> $ git config pack.deltaCacheSize 1
>> $ git config pack.windowMemory 2g
> 
> This has nothing to do with repacking memory usage, but even tighter 
> packs can be obtained with:
> 
> 	git config repack.usedeltabaseoffset true
> 
> This is not the default yet.

I have successfully repacked this repo a few times on a 2.1GHz system with 16G.

The smallest attained pack was about 1.45G (1556569742B).

This run took about 7 hours 26 min.

I ran: git repack -a -d -f --window=250 --depth=250

Here are the relevent config entries:
[pack]
        threads = 1
        compression = 9
[repack]
        usedeltabaseoffset = true


Other runs:


* Same as above, but with default compression:

	pack size: 1560624388
	time: 7 hours 11 min

	Not much difference in time or size.


* Multi threaded (250m window)
[pack]
        threads = 4
        windowmemory = 250m
        compression = 9
[repack]
        usedeltabaseoffset = true

	pack size: 1767405703
	time: 3 hours

	First >99% took 50min. Last 10000 objects took 2hours.

* Multi threaded (500m window)
[pack]
        threads = 4
        windowmemory = 500m
        compression = 9
[repack]
        usedeltabaseoffset = true

	pack size: 1640820903
	time: forgot to time, but between 3-4 hours based on file time

	I just received Dscho's email, this is interesting to compare
	with his single threaded result of 1638490531. I wonder if he
	used deltabaseoffset? I think his machine is a little faster
	than this one. So using 4 threads finished twice as fast and
	produced a similar pack size. Actually, the difference could
	just be the compression setting.

* Deeper (git repack -a -d -f --window=250 --depth=500)
[pack]
        threads = 1
        compression = 9
[repack]
        usedeltabaseoffset = true

	pack size: 1578263745
	time: 7 hours 58 min

	Larger pack compared to --depth=250.

-brandon

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 19:41           ` Brandon Casey
@ 2008-02-14 19:58             ` Johannes Schindelin
  2008-02-14 20:11             ` Nicolas Pitre
  1 sibling, 0 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-14 19:58 UTC (permalink / raw)
  To: Brandon Casey
  Cc: Nicolas Pitre, Jan Holesovsky, Jakub Narebski, git,
	Junio C Hamano

Hi,

On Thu, 14 Feb 2008, Brandon Casey wrote:

> 	I just received Dscho's email, this is interesting to compare
> 	with his single threaded result of 1638490531. I wonder if he
> 	used deltabaseoffset?

Nope.  Wanted it to be as compatible as possible.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 19:20             ` Johannes Schindelin
@ 2008-02-14 20:05               ` Jakub Narebski
  2008-02-14 20:16                 ` Nicolas Pitre
                                   ` (2 more replies)
  2008-02-15  9:34               ` Jan Holesovsky
  1 sibling, 3 replies; 85+ messages in thread
From: Jakub Narebski @ 2008-02-14 20:05 UTC (permalink / raw)
  To: Johannes Schindelin, Brandon Casey
  Cc: Nicolas Pitre, Jan Holesovsky, git, Junio C Hamano

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Finally!
> 
> I updated to newest git+patches (git version 1.5.4.1.1353.g0d5dd), reset 
> windowMemory to 512m and restarted the process:
 
> A little over 6 hours, with one core (of the four available).  Not bad, I 
> say.
> 
> The result is:
> 
> $ ls -la objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack
> -rwxrwxrwx 1 root root 1638490531 2008-02-14 17:51 
> objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack
> 
> 1.6G looks much better than 2.4G, wouldn't you say?  Jan, if you want it, 
> please give me a place to upload it to.

Brandon Casey wrote:

> I have successfully repacked this repo a few times on a 2.1GHz
> system with 16G.
> 
> The smallest attained pack was about 1.45G (1556569742B).

Do you perchance know why OOo needs so large pack? Perhaps you could
try running contrib/stats/packinfo.pl on this pack to examine it to
get to know what takes most space.

What is the size of checkout, by the way?

Hmmm... I wonder if packv4 would help...
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 19:41           ` Brandon Casey
  2008-02-14 19:58             ` Johannes Schindelin
@ 2008-02-14 20:11             ` Nicolas Pitre
  1 sibling, 0 replies; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-14 20:11 UTC (permalink / raw)
  To: Brandon Casey
  Cc: Johannes Schindelin, Jan Holesovsky, Jakub Narebski, git,
	Junio C Hamano

On Thu, 14 Feb 2008, Brandon Casey wrote:

> I have successfully repacked this repo a few times on a 2.1GHz system 
> with 16G.
> 
> The smallest attained pack was about 1.45G (1556569742B).
> 
[...]
> 
> * Multi threaded (250m window)
> [pack]
>         threads = 4
>         windowmemory = 250m
>         compression = 9
> [repack]
>         usedeltabaseoffset = true
> 
> 	pack size: 1767405703
> 	time: 3 hours
> 
> 	First >99% took 50min. Last 10000 objects took 2hours.

Right.  That's because the algorithm to distribute the load between 
threads ends up stealing work from other threads whenever a thread is 
done with its own share.  So the easy objects are quickly done with by a 
few threads until they all converge onto the hard ones.  In the non 
threaded case, the slow down ocurs around 12%.

It looks like those hard objects are huge binary blobs.  If they could 
be removed from the repository entirely and regenerated as needed 
instead of being carried around then I expect the repository size would 
fall below the 500MB mark.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 20:05               ` Jakub Narebski
@ 2008-02-14 20:16                 ` Nicolas Pitre
  2008-02-14 21:04                 ` Johannes Schindelin
  2008-02-14 21:08                 ` Brandon Casey
  2 siblings, 0 replies; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-14 20:16 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Johannes Schindelin, Brandon Casey, Jan Holesovsky, git,
	Junio C Hamano

On Thu, 14 Feb 2008, Jakub Narebski wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > The result is:
> > 
> > $ ls -la objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack
> > -rwxrwxrwx 1 root root 1638490531 2008-02-14 17:51 
> > objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack
> > 
> > 1.6G looks much better than 2.4G, wouldn't you say?  Jan, if you want it, 
> > please give me a place to upload it to.
> 
> Brandon Casey wrote:
> 
> > I have successfully repacked this repo a few times on a 2.1GHz
> > system with 16G.
> > 
> > The smallest attained pack was about 1.45G (1556569742B).
> 
> Hmmm... I wonder if packv4 would help...

No.  Well, it would help a bit, maybe in the 10-20% range, but nothing 
as significant as going from 2.6G to 1.5G, or like in the GCC case, from 
1.3G to 230M.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 20:05               ` Jakub Narebski
  2008-02-14 20:16                 ` Nicolas Pitre
@ 2008-02-14 21:04                 ` Johannes Schindelin
  2008-02-14 21:59                   ` Jakub Narebski
  2008-02-14 21:08                 ` Brandon Casey
  2 siblings, 1 reply; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-14 21:04 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Brandon Casey, Nicolas Pitre, Jan Holesovsky, git, Junio C Hamano

Hi,

On Thu, 14 Feb 2008, Jakub Narebski wrote:

> Do you perchance know why OOo needs so large pack?

No.

> Perhaps you could try running contrib/stats/packinfo.pl on this pack to 
> examine it to get to know what takes most space.

$ ~/git/contrib/stats/packinfo.pl < \
objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack 2>&1 | \
tee packinfo.txt
Illegal division by zero at /home/imaging/git/contrib/stats/packinfo.pl 
line 141, <STDIN> line 6330855.

> What is the size of checkout, by the way?

I work on a bare repository, but:

$ git archive origin/master | wc -c
2010060800

Or more precisely:

$ echo $(($(git ls-tree -l -r origin/master | sed -n 's/^[^ ]* [^ ]* [^ ]*  
*\([0-9]*\).*$/\1/p' | tr '\012' +)0))
1947839459

So yes, we still have the crown of the _whole_ repository being _smaller_ 
than a single checkout.

Yeah!

> Hmmm... I wonder if packv4 would help...

I could imagine that it does, what with it being so much better with 
strings.  But it would come at a price of performance, I guess, as the 
string table should be well over 64k.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 20:05               ` Jakub Narebski
  2008-02-14 20:16                 ` Nicolas Pitre
  2008-02-14 21:04                 ` Johannes Schindelin
@ 2008-02-14 21:08                 ` Brandon Casey
  2 siblings, 0 replies; 85+ messages in thread
From: Brandon Casey @ 2008-02-14 21:08 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Johannes Schindelin, Nicolas Pitre, Jan Holesovsky, git,
	Junio C Hamano

Jakub Narebski wrote:
> Brandon Casey wrote:

>> The smallest attained pack was about 1.45G (1556569742B).
> 
> Do you perchance know why OOo needs so large pack? Perhaps you could
> try running contrib/stats/packinfo.pl on this pack to examine it to
> get to know what takes most space.

Earlier in this thread Sean did some analysis and found lots of large
objects, and he mentioned that he sent a listing to Jan for inspection.
I haven't heard anything more.

> What is the size of checkout, by the way?

2.4G

-brandon

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 21:04                 ` Johannes Schindelin
@ 2008-02-14 21:59                   ` Jakub Narebski
  2008-02-14 23:38                     ` Johannes Schindelin
  2008-02-15  9:43                     ` Jan Holesovsky
  0 siblings, 2 replies; 85+ messages in thread
From: Jakub Narebski @ 2008-02-14 21:59 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brandon Casey, Nicolas Pitre, Jan Holesovsky, git, Junio C Hamano,
	Brian Downing

Johannes Schindelin wrote:
> On Thu, 14 Feb 2008, Jakub Narebski wrote:

>> Perhaps you could try running contrib/stats/packinfo.pl on this pack to 
>> examine it to get to know what takes most space.
> 
> $ ~/git/contrib/stats/packinfo.pl < \
> objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack 2>&1 | \
> tee packinfo.txt
> Illegal division by zero at /home/imaging/git/contrib/stats/packinfo.pl 
> line 141, <STDIN> line 6330855.

Errr... sorry, I should have been more explicit. What I meant here
is the result of

$ git verify-pack -v <packfile> | \
  ~/git/contrib/stats/packinfo.pl


>> What is the size of checkout, by the way?
> 
> I work on a bare repository, but:
> 
> $ git archive origin/master | wc -c
> 2010060800
> 
> Or more precisely:
> 
> $ echo $(($(git ls-tree -l -r origin/master | sed -n 's/^[^ ]* [^ ]* [^ ]*  
> *\([0-9]*\).*$/\1/p' | tr '\012' +)0))
> 1947839459
> 
> So yes, we still have the crown of the _whole_ repository being _smaller_ 
> than a single checkout.
> 
> Yeah!


Brandon Casey wrote:
> Jakub Narebski wrote:
>> 
>> What is the size of checkout, by the way?
> 
> 2.4G

That's huuuuge tree. Compared to that 1.6G (or 1.4G) packfile doesn't
look large.

I wonder if proper subdivision into submodules (which should encourage
better code by the way, see TAOUP), and perhaps partial checkouts
wouldn't be better solution than lazy clone. But it is nice to have
long discussed about feature, even if at RFC stage, but with some code.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 21:59                   ` Jakub Narebski
@ 2008-02-14 23:38                     ` Johannes Schindelin
  2008-02-14 23:51                       ` Brian Downing
  2008-02-15  1:07                       ` Jakub Narebski
  2008-02-15  9:43                     ` Jan Holesovsky
  1 sibling, 2 replies; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-14 23:38 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Brandon Casey, Nicolas Pitre, Jan Holesovsky, git, Junio C Hamano,
	Brian Downing

Hi,

On Thu, 14 Feb 2008, Jakub Narebski wrote:

> Johannes Schindelin wrote:
> > On Thu, 14 Feb 2008, Jakub Narebski wrote:
> 
> >> Perhaps you could try running contrib/stats/packinfo.pl on this pack 
> >> to examine it to get to know what takes most space.
> > 
> > $ ~/git/contrib/stats/packinfo.pl < \
> > objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack 2>&1 | \
> > tee packinfo.txt
> > Illegal division by zero at /home/imaging/git/contrib/stats/packinfo.pl 
> > line 141, <STDIN> line 6330855.
> 
> Errr... sorry, I should have been more explicit. What I meant here is 
> the result of
> 
> $ git verify-pack -v <packfile> | \
>   ~/git/contrib/stats/packinfo.pl

Heh.  I was too lazy to look up the usage, so I just did what I thought 
would make sense...

So here it goes:

$ git verify-pack -v 
objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack | 
~/git/contrib/stats/packinfo.pl | tee packinfo.txt
      all sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
4748.05 median 232 std_dev 221254.37
 all path sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
4748.05 median 232 std_dev 221254.37
     tree sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
4748.05 median 232 std_dev 221254.37
tree path sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
4748.05 median 232 std_dev 221254.37
         depths: count 2477715 total 70336238 min 0 max 250 mean 28.39 
median 4 std_dev 55.49

Something in my gut tells me that those four repetitive lines are not 
meant to look like they do...

> > 2.4G
>
> That's huuuuge tree. Compared to that 1.6G (or 1.4G) packfile doesn't 
> look large.
> 
> I wonder if proper subdivision into submodules (which should encourage 
> better code by the way, see TAOUP), and perhaps partial checkouts 
> wouldn't be better solution than lazy clone. But it is nice to have long 
> discussed about feature, even if at RFC stage, but with some code.

I think partial checkouts are wrong.  If you can work on partial 
checkouts, chances are that what you work on should be a submodule.

Having said that, I can understand if some people do not want to have the 
hassle of test^H^H^H^Husing submodules...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 23:38                     ` Johannes Schindelin
@ 2008-02-14 23:51                       ` Brian Downing
  2008-02-14 23:57                         ` Brian Downing
  2008-02-15  0:08                         ` Johannes Schindelin
  2008-02-15  1:07                       ` Jakub Narebski
  1 sibling, 2 replies; 85+ messages in thread
From: Brian Downing @ 2008-02-14 23:51 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jakub Narebski, Brandon Casey, Nicolas Pitre, Jan Holesovsky, git,
	Junio C Hamano

On Thu, Feb 14, 2008 at 11:38:24PM +0000, Johannes Schindelin wrote:
> Heh.  I was too lazy to look up the usage, so I just did what I thought 
> would make sense...
> 
> So here it goes:
> 
> $ git verify-pack -v 
> objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack | 
> ~/git/contrib/stats/packinfo.pl | tee packinfo.txt
>       all sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> 4748.05 median 232 std_dev 221254.37
>  all path sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> 4748.05 median 232 std_dev 221254.37
>      tree sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> 4748.05 median 232 std_dev 221254.37
> tree path sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> 4748.05 median 232 std_dev 221254.37
>          depths: count 2477715 total 70336238 min 0 max 250 mean 28.39 
> median 4 std_dev 55.49
> 
> Something in my gut tells me that those four repetitive lines are not 
> meant to look like they do...

Do you by chance have repack.usedeltabaseoffset turned on?  That has the
unfortunate side effect of changing the output of verify-pack -v to be
almost useless for my packinfo script (specifically, it no longer
reports the parent SHA1 hash for deltas, and the script is basically all
about deltra tree statistics.)  I suppose that should probably be fixed,
but I never looked into it.

-bcd

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 23:51                       ` Brian Downing
@ 2008-02-14 23:57                         ` Brian Downing
  2008-02-15  0:08                         ` Johannes Schindelin
  1 sibling, 0 replies; 85+ messages in thread
From: Brian Downing @ 2008-02-14 23:57 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jakub Narebski, Brandon Casey, Nicolas Pitre, Jan Holesovsky, git,
	Junio C Hamano

On Thu, Feb 14, 2008 at 05:51:29PM -0600, Brian Downing wrote:
> Do you by chance have repack.usedeltabaseoffset turned on?  That has the
> unfortunate side effect of changing the output of verify-pack -v to be
> almost useless for my packinfo script (specifically, it no longer
> reports the parent SHA1 hash for deltas, and the script is basically all
> about deltra tree statistics.)  I suppose that should probably be fixed,
> but I never looked into it.

That being said, the most useful output for figuring out where all the
space in the pack is going in my experience is gotten from:

git-verify-pack -v | packinfo.pl -tree -filenames

That will produce a huge amount of output, which is basically the tree
structure of the delta chains in the file.  If things aren't being
deltified together properly, it's usually pretty obvious.

A delta chain in this output looks approximately like this:

#   0   blob 03156f21...     1767     1767 Documentation/git-lost-found.txt @ tags/v1.2.0~142
#   1    blob f52a9d7f...       10     1777 Documentation/git-lost-found.txt @ tags/v1.5.0-rc1~74
#   2     blob a8cc5739...       51     1828 Documentation/git-lost+found.txt @ tags/v0.99.9h^0
#   3      blob 660e90b1...       15     1843 Documentation/git-lost+found.txt @ master~3222^2~2
#   4       blob 0cb8e3bb...       33     1876 Documentation/git-lost+found.txt @ master~3222^2~3
#   2     blob e48607f0...      311     2088 Documentation/git-lost-found.txt @ tags/v1.5.2-rc3~4
#      size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85
# path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26

# The first number after the sha1 is the object size, the second
# number is the path size.  The statistics are across all objects in
# the previous delta tree.  Obviously they are omitted for trees of
# one object.

# A path size is the sum of the size of the delta chain, including the
# base object.  In other words, it's how many bytes need be read to
# reassemble the file from deltas.

This is also quite slow, as it runs git-ls-tree -t -r on every commit in
the repository to assign file names to blobs.  You can leave out the
-filenames option to not do this (if you don't care about seeing
filenames, that is).

-bcd

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 23:51                       ` Brian Downing
  2008-02-14 23:57                         ` Brian Downing
@ 2008-02-15  0:08                         ` Johannes Schindelin
  2008-02-15  1:41                           ` Nicolas Pitre
  1 sibling, 1 reply; 85+ messages in thread
From: Johannes Schindelin @ 2008-02-15  0:08 UTC (permalink / raw)
  To: Brian Downing
  Cc: Jakub Narebski, Brandon Casey, Nicolas Pitre, Jan Holesovsky, git,
	Junio C Hamano

Hi,

On Thu, 14 Feb 2008, Brian Downing wrote:

> On Thu, Feb 14, 2008 at 11:38:24PM +0000, Johannes Schindelin wrote:
> > Heh.  I was too lazy to look up the usage, so I just did what I 
> > thought would make sense...
> > 
> > So here it goes:
> > 
> > $ git verify-pack -v 
> > objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack | 
> > ~/git/contrib/stats/packinfo.pl | tee packinfo.txt
> >       all sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> > 4748.05 median 232 std_dev 221254.37
> >  all path sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> > 4748.05 median 232 std_dev 221254.37
> >      tree sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> > 4748.05 median 232 std_dev 221254.37 tree path sizes: count 601473 
> > total 2855826280 min 0 max 62173032 mean 4748.05 median 232 std_dev 
> > 221254.37
> >          depths: count 2477715 total 70336238 min 0 max 250 mean 28.39 
> > median 4 std_dev 55.49
> > 
> > Something in my gut tells me that those four repetitive lines are not 
> > meant to look like they do...
> 
> Do you by chance have repack.usedeltabaseoffset turned on?

Ouch.  That must have been a leftover from earlier attempts.  I did not 
_mean_ to keep it, but now that I have a pretty packed repository, I think 
I'll just keep it as-is.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 23:38                     ` Johannes Schindelin
  2008-02-14 23:51                       ` Brian Downing
@ 2008-02-15  1:07                       ` Jakub Narebski
  1 sibling, 0 replies; 85+ messages in thread
From: Jakub Narebski @ 2008-02-15  1:07 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brandon Casey, Nicolas Pitre, Jan Holesovsky, git, Junio C Hamano,
	Brian Downing

Dnia piątek 15. lutego 2008 00:38, Johannes Schindelin napisał:
> Hi,
> 
> On Thu, 14 Feb 2008, Jakub Narebski wrote:
>
>> I wonder if proper subdivision into submodules (which should
>> encourage better code by the way, see TAOUP), and perhaps
>> _partial checkouts_ wouldn't be better solution than _lazy clone_.
>> But it is nice to have long discussed about feature, even if at
>> RFC stage, but with some code. 
> 
> I think partial checkouts are wrong.  If you can work on partial 
> checkouts, chances are that what you work on should be a submodule.
> 
> Having said that, I can understand if some people do not want to have
> the hassle of test^H^H^H^Husing submodules...

IMHO there is place for submodules, there is place for partial 
checkouts, and perhaps there is even place for the combination of two.

For example while Documentation/ isn't a good candidate for a submodule, 
because as you add new feature yuou want to add to documentation, if 
you change some feature you want to change documentation: there are 
whole-tree commits which contain changes outside Documentation/.
Nevertheless there are some people (technical writers) which are 
interested only in Documentation; perhaps only in few files there.
They would want to have partial checkout, I guess.

On the other hand cgit and msysgit use submodules, and I think it is 
good solution. I wonder if Sourcemage Linux distro uses submodules... 
In the case of cgit I think having git.git or its clone/fork as 
submodule is a good idea, but perhaps even better would be to checkout 
only part of it: libgit or libgitthin

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-15  0:08                         ` Johannes Schindelin
@ 2008-02-15  1:41                           ` Nicolas Pitre
  2008-02-17  8:18                             ` Shawn O. Pearce
  0 siblings, 1 reply; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-15  1:41 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brian Downing, Jakub Narebski, Brandon Casey, Jan Holesovsky, git,
	Junio C Hamano

On Fri, 15 Feb 2008, Johannes Schindelin wrote:

> Hi,
> 
> On Thu, 14 Feb 2008, Brian Downing wrote:
> 
> > On Thu, Feb 14, 2008 at 11:38:24PM +0000, Johannes Schindelin wrote:
> > > Heh.  I was too lazy to look up the usage, so I just did what I 
> > > thought would make sense...
> > > 
> > > So here it goes:
> > > 
> > > $ git verify-pack -v 
> > > objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack | 
> > > ~/git/contrib/stats/packinfo.pl | tee packinfo.txt
> > >       all sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> > > 4748.05 median 232 std_dev 221254.37
> > >  all path sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> > > 4748.05 median 232 std_dev 221254.37
> > >      tree sizes: count 601473 total 2855826280 min 0 max 62173032 mean 
> > > 4748.05 median 232 std_dev 221254.37 tree path sizes: count 601473 
> > > total 2855826280 min 0 max 62173032 mean 4748.05 median 232 std_dev 
> > > 221254.37
> > >          depths: count 2477715 total 70336238 min 0 max 250 mean 28.39 
> > > median 4 std_dev 55.49
> > > 
> > > Something in my gut tells me that those four repetitive lines are not 
> > > meant to look like they do...
> > 
> > Do you by chance have repack.usedeltabaseoffset turned on?
> 
> Ouch.  That must have been a leftover from earlier attempts.  I did not 
> _mean_ to keep it, but now that I have a pretty packed repository, I think 
> I'll just keep it as-is.

I should really come around to fixing packed_object_info_detail() for 
the OBJ_OFS_DELTA case one day.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 19:20             ` Johannes Schindelin
  2008-02-14 20:05               ` Jakub Narebski
@ 2008-02-15  9:34               ` Jan Holesovsky
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Holesovsky @ 2008-02-15  9:34 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Nicolas Pitre, Jakub Narebski, git, Junio C Hamano

Hi Johannes,

On Thursday 14 of February 2008, Johannes Schindelin wrote:

> The result is:
>
> $ ls -la objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack
> -rwxrwxrwx 1 root root 1638490531 2008-02-14 17:51
> objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack
>
> 1.6G looks much better than 2.4G, wouldn't you say?  Jan, if you want it,
> please give me a place to upload it to.

Thank you!  In the meantime, I happened to produce something similar.  
Unfortunately even mine was too late for another round of tests to present it 
in our git vs. svn comparison (with todays deadline) - so we just mentioned 
in the report that the tested repository still had reserves [but the numbers 
were quite nice even with the 2.5G one ;-)].

> ll minimal3.git/objects/pack/
celkem 1636608
-r--r--r-- 1 kendy users   59264432 2008-02-10 15:22 
pack-909b501d3d673f10a66adfefdf8371933e7a6f3e.idx
-r--r--r-- 1 kendy users 1614968445 2008-02-10 15:22 
pack-909b501d3d673f10a66adfefdf8371933e7a6f3e.pack

> ll minimal4.git/objects/pack/
celkem 1644160
-r--r--r-- 1 kendy users   59264432 2008-02-11 16:09 
pack-909b501d3d673f10a66adfefdf8371933e7a6f3e.idx
-rw-r--r-- 1 kendy users          0 2008-02-11 16:29 
pack-909b501d3d673f10a66adfefdf8371933e7a6f3e.keep
-r--r--r-- 1 kendy users 1622697708 2008-02-11 16:09 
pack-909b501d3d673f10a66adfefdf8371933e7a6f3e.pack

The 'minimal3' case was with '--window=250 --depth=250', 'minimal4' was 
with '--window=250 --depth=50'

I tried the --depth=50 because I read 'making it too deep affects the 
performance on the unpacker side' in the man page.  How big the difference 
could be in practice, please?

Regards,
Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-14 21:59                   ` Jakub Narebski
  2008-02-14 23:38                     ` Johannes Schindelin
@ 2008-02-15  9:43                     ` Jan Holesovsky
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Holesovsky @ 2008-02-15  9:43 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Johannes Schindelin, Brandon Casey, Nicolas Pitre, git,
	Junio C Hamano, Brian Downing

Hi Jakub,

On Thursday 14 of February 2008, Jakub Narebski wrote:

> >> What is the size of checkout, by the way?
> >
> > 2.4G
>
> That's huuuuge tree. Compared to that 1.6G (or 1.4G) packfile doesn't
> look large.
>
> I wonder if proper subdivision into submodules (which should encourage
> better code by the way, see TAOUP), and perhaps partial checkouts
> wouldn't be better solution than lazy clone. But it is nice to have
> long discussed about feature, even if at RFC stage, but with some code.

Yes, I'd love to see the OOo tree split into several parts, I've already 
proposed a division (http://www.nabble.com/OOo-source-split-td13096065.html), 
but it'll take some more time I'm afraid :-(

Regards,
Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-15  1:41                           ` Nicolas Pitre
@ 2008-02-17  8:18                             ` Shawn O. Pearce
  2008-02-17  9:05                               ` Junio C Hamano
  2008-02-17 18:44                               ` Nicolas Pitre
  0 siblings, 2 replies; 85+ messages in thread
From: Shawn O. Pearce @ 2008-02-17  8:18 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johannes Schindelin, Brian Downing, Jakub Narebski, Brandon Casey,
	Jan Holesovsky, git, Junio C Hamano

Nicolas Pitre <nico@cam.org> wrote:
> 
> I should really come around to fixing packed_object_info_detail() for 
> the OBJ_OFS_DELTA case one day.

Please don't.

Obtaining the SHA-1 of your delta base would require unpacking your
delta base and then doing a SHA-1 hash of it.  Or alternatively
doing a search through the .idx for the object that starts at the
requested OFS.  Either way, its really expensive for a minor detail
of output in verify-pack.  Something that any script can produce
with a simple reverse lookup table.

Its also run after we just spent a hell of a lot of time and disk
IO trying to verify the packfile.  We slammed through the pack
once to do its overall SHA-1, and then god knows how many times as
we iterate the objects in pack order, not delta base order, thus
causing the delta base cache to become overwhelmed and constantly
fault out entries.  Pack verification is stupid and slow.  This
would make -v even worse.


But if you are going to do that, you may also want to fix the
"*store_size = 0 /* notyet */" that's like 5 lines above.  :)


BTW, why does this return const char* from typename(type) instead
of just returning the enum object_type and letting the caller do
typename() if they want it?  Most of our other code that returns
types returns the enum, not the string.  :-\

-- 
Shawn.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-17  8:18                             ` Shawn O. Pearce
@ 2008-02-17  9:05                               ` Junio C Hamano
  2008-02-17 18:44                               ` Nicolas Pitre
  1 sibling, 0 replies; 85+ messages in thread
From: Junio C Hamano @ 2008-02-17  9:05 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Nicolas Pitre, Johannes Schindelin, Brian Downing, Jakub Narebski,
	Brandon Casey, Jan Holesovsky, git

"Shawn O. Pearce" <spearce@spearce.org> writes:

> BTW, why does this return const char* from typename(type) instead
> of just returning the enum object_type and letting the caller do
> typename() if they want it?  Most of our other code that returns
> types returns the enum, not the string.  :-\

It just was not converted from the old string interface.  I
thought you are old enough to remember ;-)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] RFC: git lazy clone proof-of-concept
  2008-02-17  8:18                             ` Shawn O. Pearce
  2008-02-17  9:05                               ` Junio C Hamano
@ 2008-02-17 18:44                               ` Nicolas Pitre
  1 sibling, 0 replies; 85+ messages in thread
From: Nicolas Pitre @ 2008-02-17 18:44 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Johannes Schindelin, Brian Downing, Jakub Narebski, Brandon Casey,
	Jan Holesovsky, git, Junio C Hamano

On Sun, 17 Feb 2008, Shawn O. Pearce wrote:

> Nicolas Pitre <nico@cam.org> wrote:
> > 
> > I should really come around to fixing packed_object_info_detail() for 
> > the OBJ_OFS_DELTA case one day.
> 
> Please don't.
> 
> Obtaining the SHA-1 of your delta base would require unpacking your
> delta base and then doing a SHA-1 hash of it.  Or alternatively
> doing a search through the .idx for the object that starts at the
> requested OFS.

I intended to use the pack index of course.  And the code already exists 
in pack-objects as find_packed_object().

> Either way, its really expensive for a minor detail
> of output in verify-pack.

Not _that_ expensive actually.  Like I say, in pack-objects we do it all 
the time.

> But if you are going to do that, you may also want to fix the
> "*store_size = 0 /* notyet */" that's like 5 lines above.  :)

Yeah, that's easy too.


Nicolas

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2008-02-17 18:44 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-08 17:28 [PATCH] RFC: git lazy clone proof-of-concept Jan Holesovsky
2008-02-08 18:03 ` Nicolas Pitre
2008-02-09 14:25   ` Jan Holesovsky
2008-02-09 22:05     ` Mike Hommey
2008-02-09 23:38       ` Nicolas Pitre
2008-02-10  7:23     ` Marco Costalba
2008-02-10 12:08       ` Johannes Schindelin
2008-02-10 16:46         ` David Symonds
2008-02-10 17:45           ` Johannes Schindelin
2008-02-10 19:45             ` Nicolas Pitre
2008-02-10 20:32               ` Johannes Schindelin
2008-02-08 18:14 ` Harvey Harrison
2008-02-09 14:27   ` Jan Holesovsky
2008-02-08 18:20 ` Johannes Schindelin
2008-02-08 18:49 ` Mike Hommey
2008-02-08 19:04   ` Johannes Schindelin
2008-02-09 15:06   ` Jan Holesovsky
2008-02-08 19:00 ` Jakub Narebski
2008-02-08 19:26   ` Jon Smirl
2008-02-08 20:09     ` Nicolas Pitre
2008-02-11 10:13       ` Andreas Ericsson
2008-02-12  2:55         ` [PATCH 1/2] pack-objects: Allow setting the #threads equal to #cpus automatically Brandon Casey
2008-02-12  5:53           ` Andreas Ericsson
     [not found]         ` <1202784078-23700-1-git-send-email-casey@nrlssc.navy.mil>
2008-02-12  2:59           ` [PATCH 2/2] pack-objects: Default to zero threads, meaning auto-assign to #cpus Brandon Casey
2008-02-12  4:57             ` Nicolas Pitre
2008-02-08 20:19     ` [PATCH] RFC: git lazy clone proof-of-concept Harvey Harrison
2008-02-08 20:24       ` Jon Smirl
2008-02-08 20:25         ` Harvey Harrison
2008-02-08 20:41           ` Jon Smirl
2008-02-09 15:27   ` Jan Holesovsky
2008-02-10  3:10     ` Nicolas Pitre
2008-02-10  4:59       ` Sean
2008-02-10  5:22         ` Nicolas Pitre
2008-02-10  5:35           ` Sean
2008-02-11  1:42             ` Jakub Narebski
2008-02-11  2:04               ` Nicolas Pitre
2008-02-11 10:11                 ` Jakub Narebski
2008-02-10  9:34         ` Joachim B Haga
2008-02-10 16:43       ` Johannes Schindelin
2008-02-10 17:01         ` Jon Smirl
2008-02-10 17:36           ` Johannes Schindelin
2008-02-10 18:47         ` Johannes Schindelin
2008-02-10 19:42           ` Nicolas Pitre
2008-02-10 20:11             ` Jon Smirl
2008-02-12 20:37           ` Johannes Schindelin
2008-02-12 21:05             ` Nicolas Pitre
2008-02-12 21:08             ` Linus Torvalds
2008-02-12 21:36               ` Jon Smirl
2008-02-12 21:59                 ` Linus Torvalds
2008-02-12 22:25                   ` Linus Torvalds
2008-02-12 22:43                     ` Jon Smirl
2008-02-12 23:39                       ` Linus Torvalds
2008-02-12 21:25             ` Jon Smirl
2008-02-14 19:20             ` Johannes Schindelin
2008-02-14 20:05               ` Jakub Narebski
2008-02-14 20:16                 ` Nicolas Pitre
2008-02-14 21:04                 ` Johannes Schindelin
2008-02-14 21:59                   ` Jakub Narebski
2008-02-14 23:38                     ` Johannes Schindelin
2008-02-14 23:51                       ` Brian Downing
2008-02-14 23:57                         ` Brian Downing
2008-02-15  0:08                         ` Johannes Schindelin
2008-02-15  1:41                           ` Nicolas Pitre
2008-02-17  8:18                             ` Shawn O. Pearce
2008-02-17  9:05                               ` Junio C Hamano
2008-02-17 18:44                               ` Nicolas Pitre
2008-02-15  1:07                       ` Jakub Narebski
2008-02-15  9:43                     ` Jan Holesovsky
2008-02-14 21:08                 ` Brandon Casey
2008-02-15  9:34               ` Jan Holesovsky
2008-02-10 19:50         ` Nicolas Pitre
2008-02-14 19:41           ` Brandon Casey
2008-02-14 19:58             ` Johannes Schindelin
2008-02-14 20:11             ` Nicolas Pitre
2008-02-11  1:20     ` Jakub Narebski
2008-02-08 20:16 ` Johannes Schindelin
2008-02-08 21:35   ` Jakub Narebski
2008-02-08 21:52     ` Johannes Schindelin
2008-02-08 22:03       ` Mike Hommey
2008-02-08 22:34         ` Johannes Schindelin
2008-02-08 22:50           ` Mike Hommey
2008-02-08 23:14             ` Johannes Schindelin
2008-02-08 23:38               ` Mike Hommey
2008-02-09 21:20                 ` Jan Hudec
2008-02-09 15:54       ` Jan Holesovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).