Git development
 help / color / mirror / Atom feed
* Re: Figured out how to get Mozilla into git
From: Lars Johannsen @ 2006-06-10 18:55 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git
In-Reply-To: <9e4733910606100844v5f4765d8o85c9a6f239faed43@mail.gmail.com>

On (10/06/06 11:44), Jon Smirl wrote:
> Date:	Sat, 10 Jun 2006 11:44:58 -0400
> From:	"Jon Smirl" <jonsmirl@gmail.com>
> To:	"Junio C Hamano" <junkio@cox.net>
> Subject: Re: Figured out how to get Mozilla into git
> Cc:	git@vger.kernel.org
> 
> On 6/10/06, Junio C Hamano <junkio@cox.net> wrote:
> >"Jon Smirl" <jonsmirl@gmail.com> writes:
> >
> >> Here's a new transport problem. When using git-clone to fetch Martin's
> >> tree it kept failing for me at dreamhost. I had a parallel fetch
> >> running on my local machine which has a much slower net connection. It
> >> finally finished and I am watching the end phase where it prints all
> >> of the 'walk' messages. The git-http-fetch process has jumped up to
> >> 800MB in size after being 2MB during the download. dreamhost has a
> >> 500MB process size limit so that is why my fetches kept failing there.
> >
> >The http-fetch process uses by mmaping the downloaded pack, and
> >if I recall correctly we are talking about 600MB pack, so 500MB
> >limit sounds impossible, perhaps?
> 
> The fetch on my local machine failed too. It left nothing behind, now
> I have to download the 680MB again.
> 
> walk 1f19465388a4ef7aff7527a13f16122a809487d4
> walk c3ca840256e3767d08c649f8d2761a1a887351ab
> walk 7a74e42699320c02b814b88beadb1ae65009e745
> error: Couldn't get
> http://mirrors.catalyst.net.nz/pub/mozilla.git//refs/tags/JS%5F1%5F7%5FALPHA%5FBASE
> for tags/JS_1_7_ALPHA_BASE
> Couldn't resolve host 'mirrors.catalyst.net.nz'
> error: Could not interpret tags/JS_1_7_ALPHA_BASE as something to pull
> [jonsmirl@jonsmirl mozgit]$ cg update
> There is no GIT repository here (.git not found)
> [jonsmirl@jonsmirl mozgit]$ ls -a
> .  ..
> [jonsmirl@jonsmirl mozgit]$

To prevent repeat (on this repo) your could grab it with a browser:
-mkdir tmp; cd tmp; git init-db;
-copy  mirror../pu/mozilla.git/objects/*  to .git/objects/
-copy   --||---.git/info/refs to refsinfo in tmp-dir
gawk '{if  ($2 !~ /\^\{\}$/) print $1 > sprintf(".git/%s",$2);}' refsinfo
 to extract branches and tags into ./git/refs/{heads,tags}
start playing (after a backup) with git-fsck-objects, git-checkout etc.
 
-- 
Lars Johannsen 
mail@Lars-johannsen.dk

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Petr Baudis @ 2006-06-10 18:37 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Junio C Hamano, git
In-Reply-To: <9e4733910606100844v5f4765d8o85c9a6f239faed43@mail.gmail.com>

Dear diary, on Sat, Jun 10, 2006 at 05:44:58PM CEST, I got a letter
where Jon Smirl <jonsmirl@gmail.com> said that...
> The fetch on my local machine failed too. It left nothing behind, now
> I have to download the 680MB again.
> 
> walk 1f19465388a4ef7aff7527a13f16122a809487d4
> walk c3ca840256e3767d08c649f8d2761a1a887351ab
> walk 7a74e42699320c02b814b88beadb1ae65009e745
> error: Couldn't get
> http://mirrors.catalyst.net.nz/pub/mozilla.git//refs/tags/JS%5F1%5F7%5FALPHA%5FBASE
> for tags/JS_1_7_ALPHA_BASE
> Couldn't resolve host 'mirrors.catalyst.net.nz'
> error: Could not interpret tags/JS_1_7_ALPHA_BASE as something to pull
> [jonsmirl@jonsmirl mozgit]$ cg update
> There is no GIT repository here (.git not found)
> [jonsmirl@jonsmirl mozgit]$ ls -a
> .  ..
> [jonsmirl@jonsmirl mozgit]$

  You could try with cg-clone, which won't delete the repository if
things fail. It will clone only the master branch, though.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
A person is just about as big as the things that make them angry.

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Rogan Dawes @ 2006-06-10 18:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, Martin Langhoff, git
In-Reply-To: <Pine.LNX.4.64.0606101041490.5498@g5.osdl.org>

Linus Torvalds wrote:
> 
> On Sat, 10 Jun 2006, Rogan Dawes wrote:
>> Here's an idea. How about separating trees and commits from the actual blobs
>> (e.g. in separate packs)? My reasoning is that the commits and trees should
>> only be a small portion of the overall repository size, and should not be that
>> expensive to transfer. (Of course, this is only a guess, and needs some
>> numbers to back it up.)
> 
> The trees in particular are actually a pretty big part of the history. 
> 
> More importantly, the blobs compress horribly badly in the absense of 
> history - a _lot_ of the compression in git packing comes very much from 
> the fact that we do a good job at delta-compression.
> 
> So if you get all of the commit/tree history, but none of the blob 
> history, you're actually not going to win that much space. As already 
> discussed, the _whole_ history packed with git is usually not insanely 
> bigger than just the whole unpacked tree (with no history at all).
> 
> So you'd think that getting just the top version of the tree would be a 
> much bigger space-saving that it actually is. If you _also_ get all the 
> tree and commit objects, the space saving is even less.
> 

One possibility, given that the full commit and tree history is so
large, is simply to get the HEAD commit and the trees that the commit
depends directly on, rather than fetching them all up front.

> I actually suspect that the most realistic way to handle this is to use 
> the "fetch.c" logic (ie the incremental fetcher used by http), and add 
> some mode to the git daemon where you fetch literally one object at a time 
> (ie this would be totally _separate_ from the pack-file thing: you'd not 
> ask for "git-upload-pack", you'd ask for something like 
> "git-serve-objects" instead).
> 
> The fetch.c logic really does allow for on-demand object fetching, and is 
> thus much more suitable for incomplete repositories.
> 
> HOWEVER. The fetch.c logic - by necessity - works on a object-by-object 
> level. That means that you'd get no delta compression AT ALL, and I 
> suspect that the downside of that would be a factor of ten expansion or 
> more, which means that it would really not work that well in practice.

Would it be possible to add a mode where fetch.c is given a list of 
desired objects, and returns a list of pointers to those objects? Then 
callers that already have such a list could be modified to pass the 
whole list at once, allowing at least SOME compression, and optimisation 
of round trips, etc? There would be a tradeoff in memory use, though, I 
guess.

Rogan

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Jon Smirl @ 2006-06-10 18:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rogan Dawes, Martin Langhoff, git
In-Reply-To: <Pine.LNX.4.64.0606101041490.5498@g5.osdl.org>

Here's a random idea, how about a tool that turns a real pack into one
that is segmented and then faults in segments if you do an operation
that needs the old segments? The full pack would always look like it
is there even if it isn't. Something like gitk would be modified not
to fault in the missing segments.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Linus Torvalds @ 2006-06-10 17:53 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: Jon Smirl, Martin Langhoff, git
In-Reply-To: <448A847C.20105@dawes.za.net>



On Sat, 10 Jun 2006, Rogan Dawes wrote:
>
> Here's an idea. How about separating trees and commits from the actual blobs
> (e.g. in separate packs)? My reasoning is that the commits and trees should
> only be a small portion of the overall repository size, and should not be that
> expensive to transfer. (Of course, this is only a guess, and needs some
> numbers to back it up.)

The trees in particular are actually a pretty big part of the history. 

More importantly, the blobs compress horribly badly in the absense of 
history - a _lot_ of the compression in git packing comes very much from 
the fact that we do a good job at delta-compression.

So if you get all of the commit/tree history, but none of the blob 
history, you're actually not going to win that much space. As already 
discussed, the _whole_ history packed with git is usually not insanely 
bigger than just the whole unpacked tree (with no history at all).

So you'd think that getting just the top version of the tree would be a 
much bigger space-saving that it actually is. If you _also_ get all the 
tree and commit objects, the space saving is even less.

I actually suspect that the most realistic way to handle this is to use 
the "fetch.c" logic (ie the incremental fetcher used by http), and add 
some mode to the git daemon where you fetch literally one object at a time 
(ie this would be totally _separate_ from the pack-file thing: you'd not 
ask for "git-upload-pack", you'd ask for something like 
"git-serve-objects" instead).

The fetch.c logic really does allow for on-demand object fetching, and is 
thus much more suitable for incomplete repositories.

HOWEVER. The fetch.c logic - by necessity - works on a object-by-object 
level. That means that you'd get no delta compression AT ALL, and I 
suspect that the downside of that would be a factor of ten expansion or 
more, which means that it would really not work that well in practice.

It might be worth testing, though. It would work fine for the "after I 
have the initial cauterized tree, fetch small incremental updates" case. 
The operative word here being "small" and "incremental", because I'm 
pretty sure it really would suck for the case of a big fetch.

But it would be _simple_, which is why it's worth trying out. It also has 
the advantage that it would solve the "I had data corruption on my disk, 
and lost 100 objects, but all the the rest is fine" issue. Again, that's 
not something that the efficient packing protocol handles, exactly because 
it assumes full history, and uses that to do all its optimizations.

		Linus

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Timo Hirvonen @ 2006-06-10 16:15 UTC (permalink / raw)
  To: Jon Smirl; +Cc: junkio, git
In-Reply-To: <9e4733910606100844v5f4765d8o85c9a6f239faed43@mail.gmail.com>

"Jon Smirl" <jonsmirl@gmail.com> wrote:

> On 6/10/06, Junio C Hamano <junkio@cox.net> wrote:
> > "Jon Smirl" <jonsmirl@gmail.com> writes:
> >
> > > Here's a new transport problem. When using git-clone to fetch Martin's
> > > tree it kept failing for me at dreamhost. I had a parallel fetch
> > > running on my local machine which has a much slower net connection. It
> > > finally finished and I am watching the end phase where it prints all
> > > of the 'walk' messages. The git-http-fetch process has jumped up to
> > > 800MB in size after being 2MB during the download. dreamhost has a
> > > 500MB process size limit so that is why my fetches kept failing there.
> >
> > The http-fetch process uses by mmaping the downloaded pack, and
> > if I recall correctly we are talking about 600MB pack, so 500MB
> > limit sounds impossible, perhaps?
> 
> The fetch on my local machine failed too. It left nothing behind, now
> I have to download the 680MB again.

That's sad.  Could git-clone be changed to not remove .git directory if
fetching objects fails (after other files in the .git directory have
been fetched)?  You could then hopefully continue with git-pull.

-- 
http://onion.dynserv.net/~timo/

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Jon Smirl @ 2006-06-10 15:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vr71xk047.fsf@assigned-by-dhcp.cox.net>

On 6/10/06, Junio C Hamano <junkio@cox.net> wrote:
> "Jon Smirl" <jonsmirl@gmail.com> writes:
>
> > Here's a new transport problem. When using git-clone to fetch Martin's
> > tree it kept failing for me at dreamhost. I had a parallel fetch
> > running on my local machine which has a much slower net connection. It
> > finally finished and I am watching the end phase where it prints all
> > of the 'walk' messages. The git-http-fetch process has jumped up to
> > 800MB in size after being 2MB during the download. dreamhost has a
> > 500MB process size limit so that is why my fetches kept failing there.
>
> The http-fetch process uses by mmaping the downloaded pack, and
> if I recall correctly we are talking about 600MB pack, so 500MB
> limit sounds impossible, perhaps?

The fetch on my local machine failed too. It left nothing behind, now
I have to download the 680MB again.

walk 1f19465388a4ef7aff7527a13f16122a809487d4
walk c3ca840256e3767d08c649f8d2761a1a887351ab
walk 7a74e42699320c02b814b88beadb1ae65009e745
error: Couldn't get
http://mirrors.catalyst.net.nz/pub/mozilla.git//refs/tags/JS%5F1%5F7%5FALPHA%5FBASE
for tags/JS_1_7_ALPHA_BASE
Couldn't resolve host 'mirrors.catalyst.net.nz'
error: Could not interpret tags/JS_1_7_ALPHA_BASE as something to pull
[jonsmirl@jonsmirl mozgit]$ cg update
There is no GIT repository here (.git not found)
[jonsmirl@jonsmirl mozgit]$ ls -a
.  ..
[jonsmirl@jonsmirl mozgit]$




-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Nicolas Pitre @ 2006-06-10 15:14 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: Junio C Hamano, git
In-Reply-To: <448ADB8A.3070506@dawes.za.net>

On Sat, 10 Jun 2006, Rogan Dawes wrote:

> Out of curiosity, do you think that it may be possible for tree objects to
> compress more/better if they are packed together? Or does the existing pack
> compression logic already do the diff against similar tree objects?

Tree objects for the same directories are already packed and deltified 
against each other in a pack.


Nicolas

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Jakub Narebski @ 2006-06-10 14:58 UTC (permalink / raw)
  To: git
In-Reply-To: <448ADB8A.3070506@dawes.za.net>

Rogan Dawes wrote:

> Junio C Hamano wrote:
>> Rogan Dawes <lists@dawes.za.net> writes:
>> 
>>> Here's an idea. How about separating trees and commits from the actual
>>> blobs (e.g. in separate packs)?
>> 
>> If I remember my numbers correctly, trees for any project with a
>> size that matters contribute nonnegligible amount of the total
>> pack weight.  Perhaps 10-25%.
> 
> Out of curiosity, do you think that it may be possible for tree objects 
> to compress more/better if they are packed together? Or does the 
> existing pack compression logic already do the diff against similar tree 
> objects?

The problem with compressing and deltafying trees is with sha1 objects
identifiers, I guess.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Rogan Dawes @ 2006-06-10 14:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vzmglgyz0.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano wrote:
> Rogan Dawes <lists@dawes.za.net> writes:
> 
>> Here's an idea. How about separating trees and commits from the actual
>> blobs (e.g. in separate packs)?
> 
> If I remember my numbers correctly, trees for any project with a
> size that matters contribute nonnegligible amount of the total
> pack weight.  Perhaps 10-25%.

Out of curiosity, do you think that it may be possible for tree objects 
to compress more/better if they are packed together? Or does the 
existing pack compression logic already do the diff against similar tree 
objects?

>> In this way, the user has a history that will show all of the commit
>> messages, and would be able to see _which_ files have changed over
>> time e.g. gitk would still work - except for the actual file level
>> diff, "git log" should also still work, etc
> 
> I suspect it would make a very unpleasant system to use.
> Sometimes "git diff -p" would show diffs, and other times it
> mysteriously complain saying that it lacks necessary blobs to do
> its job.  You cannot even run fsck and tell from its output
> which missing objects are OK (because you chose to create such a
> sparse repository) and which are real corruption.

The fsck problem could be worked around by maintaining a list of objects 
that are explicitly not expected to be present. As the list gets shorter 
(perhaps as diffs are performed, other parts of the blob history are 
retrieved, etc), the list will get shorter until we have a complete 
clone of the original tree.

Of course diffs against a version further back in the history would 
fail. But if you start with a checkout of a complete tree, any changes 
made since that point would at least have one version to compare against.

In effect, what we would have is a caching repository (or as Jakub said, 
a lazy clone). An initial checkout would effectively be pre-seeding the 
cache. One does not necessarily even need to get the complete set of 
commit and tree objects, either. The bare minimum would probably be to 
get the HEAD commit, and the tree objects that correspond to that commit.

At that point, one could populate the "uncached objects" list with the 
parent commits. One would not be in a position to get any history at 
all, of course.

As the user performs various operations, e.g. git log, git could either 
go and fetch the necessary objects (updating the uncached list as it 
goes), or fail with a message such as "Cannot perform the requested 
operation - required objects are not available". (We may require another 
utility that would list the objects required for an operation, and 
compare it against the list of "uncached objects", printing out a list 
of which are not yet available locally. I realise that this may be 
expensive. Maybe a repo configuration option "cached" to enable or 
disable this.)

As Jakub suggested, it would be necessary to configure the location of 
the source for any missing objects, but that is probably in the repo 
config anyway.

> A shallow clone with explicit cauterization in grafts file at
> least would not have that problem. Although the user will still
> not see the exact same result as what would happen in a full
> repository, at least we can say "your git log ends at that
> commit because your copy of the history does not go back beyond
> that" and the user would understand.

Or, we could say, perform the operation while you are online, and can 
access the necessary objects. If the user has explicitly chosen to make 
a lazy clone, then they should expect that at some point, whatever they 
do may require them to be online to access items that they have not yet 
cloned.

Rogan

^ permalink raw reply

* [PATCH] Built-in git-get-tar-commit-id (was: [PATCH/RFC] Retire SIMPLE_*** stuff.)
From: Rene Scharfe @ 2006-06-10 14:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git
In-Reply-To: <7v3bedn8ym.fsf_-_@assigned-by-dhcp.cox.net>

By being an internal command git-get-commit-id can make use of
struct ustar_header and other stuff and stops wasting precious
disk space.

Note: I recycled one of the two "tar-tree" entries instead of
splitting that cleanup into a separate patch.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>

diff --git a/Makefile b/Makefile
index 5226fa1..2a1e639 100644
--- a/Makefile
+++ b/Makefile
@@ -142,11 +142,11 @@ SCRIPTS = $(patsubst %.sh,%,$(SCRIPT_SH)
 	  $(patsubst %.py,%,$(SCRIPT_PYTHON)) \
 	  git-cherry-pick git-status
 
 # The ones that do not have to link with lcrypto, lz nor xdiff.
 SIMPLE_PROGRAMS = \
-	git-get-tar-commit-id$X git-mailsplit$X \
+	git-mailsplit$X \
 	git-stripspace$X git-daemon$X
 
 # ... and all the rest that could be moved out of bindir to gitexecdir
 PROGRAMS = \
 	git-checkout-index$X git-clone-pack$X \
@@ -167,11 +167,11 @@ PROGRAMS = \
 BUILT_INS = git-log$X git-whatchanged$X git-show$X \
 	git-count-objects$X git-diff$X git-push$X \
 	git-grep$X git-add$X git-rm$X git-rev-list$X \
 	git-check-ref-format$X git-rev-parse$X \
 	git-init-db$X git-tar-tree$X git-upload-tar$X git-format-patch$X \
-	git-ls-files$X git-ls-tree$X \
+	git-ls-files$X git-ls-tree$X git-get-tar-commit-id$X \
 	git-read-tree$X git-commit-tree$X \
 	git-apply$X git-show-branch$X git-diff-files$X \
 	git-diff-index$X git-diff-stages$X git-diff-tree$X git-cat-file$X
 
 # what 'all' will build and 'install' will install, in gitexecdir
diff --git a/builtin-tar-tree.c b/builtin-tar-tree.c
index 7663b9b..58a8ccd 100644
--- a/builtin-tar-tree.c
+++ b/builtin-tar-tree.c
@@ -400,5 +400,30 @@ int cmd_tar_tree(int argc, const char **
 		usage(tar_tree_usage);
 	if (!strncmp("--remote=", argv[1], 9))
 		return remote_tar(argc, argv);
 	return generate_tar(argc, argv, envp);
 }
+
+/* ustar header + extended global header content */
+#define HEADERSIZE (2 * RECORDSIZE)
+
+int cmd_get_tar_commit_id(int argc, const char **argv, char **envp)
+{
+	char buffer[HEADERSIZE];
+	struct ustar_header *header = (struct ustar_header *)buffer;
+	char *content = buffer + RECORDSIZE;
+	ssize_t n;
+
+	n = xread(0, buffer, HEADERSIZE);
+	if (n < HEADERSIZE)
+		die("git-get-tar-commit-id: read error");
+	if (header->typeflag[0] != 'g')
+		return 1;
+	if (memcmp(content, "52 comment=", 11))
+		return 1;
+
+	n = xwrite(1, content + 11, 41);
+	if (n < 41)
+		die("git-get-tar-commit-id: write error");
+
+	return 0;
+}
diff --git a/builtin.h b/builtin.h
index ffa9340..b9f36be 100644
--- a/builtin.h
+++ b/builtin.h
@@ -30,10 +30,11 @@ extern int cmd_add(int argc, const char 
 extern int cmd_rev_list(int argc, const char **argv, char **envp);
 extern int cmd_check_ref_format(int argc, const char **argv, char **envp);
 extern int cmd_init_db(int argc, const char **argv, char **envp);
 extern int cmd_tar_tree(int argc, const char **argv, char **envp);
 extern int cmd_upload_tar(int argc, const char **argv, char **envp);
+extern int cmd_get_tar_commit_id(int argc, const char **argv, char **envp);
 extern int cmd_ls_files(int argc, const char **argv, char **envp);
 extern int cmd_ls_tree(int argc, const char **argv, char **envp);
 extern int cmd_read_tree(int argc, const char **argv, char **envp);
 extern int cmd_commit_tree(int argc, const char **argv, char **envp);
 extern int cmd_apply(int argc, const char **argv, char **envp);
diff --git a/get-tar-commit-id.c b/get-tar-commit-id.c
deleted file mode 100644
index 4166290..0000000
--- a/get-tar-commit-id.c
+++ /dev/null
@@ -1,30 +0,0 @@
-/*
- * Copyright (C) 2005 Rene Scharfe
- */
-#include <stdio.h>
-#include <string.h>
-#include <unistd.h>
-
-#define HEADERSIZE	1024
-
-int main(int argc, char **argv)
-{
-	char buffer[HEADERSIZE];
-	ssize_t n;
-
-	n = read(0, buffer, HEADERSIZE);
-	if (n < HEADERSIZE) {
-		fprintf(stderr, "read error\n");
-		return 3;
-	}
-	if (buffer[156] != 'g')
-		return 1;
-	if (memcmp(&buffer[512], "52 comment=", 11))
-		return 1;
-	n = write(1, &buffer[523], 41);
-	if (n < 41) {
-		fprintf(stderr, "write error\n");
-		return 2;
-	}
-	return 0;
-}
diff --git a/git.c b/git.c
index 6db8f2b..9469d44 100644
--- a/git.c
+++ b/git.c
@@ -161,11 +161,11 @@ static void handle_internal_command(int 
 		{ "grep", cmd_grep },
 		{ "rm", cmd_rm },
 		{ "add", cmd_add },
 		{ "rev-list", cmd_rev_list },
 		{ "init-db", cmd_init_db },
-		{ "tar-tree", cmd_tar_tree },
+		{ "get-tar-commit-id", cmd_get_tar_commit_id },
 		{ "upload-tar", cmd_upload_tar },
 		{ "check-ref-format", cmd_check_ref_format },
 		{ "ls-files", cmd_ls_files },
 		{ "ls-tree", cmd_ls_tree },
 		{ "tar-tree", cmd_tar_tree },

^ permalink raw reply related

* Re: gitk on Windows: layout problem
From: Rutger Nijlunsing @ 2006-06-10 11:13 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: git, git
In-Reply-To: <17537.22986.653849.367731@cargo.ozlabs.ibm.com>

On Sat, Jun 03, 2006 at 07:43:38PM +1000, Paul Mackerras wrote:
> Rutger Nijlunsing writes:
> 
> > Is this a known problem? gitk-du-jour on Windows starts up with an
> > unusable layout. Screenshot attached.
> 
> Is that using Tk with the cygwin X server, or the native Windows Tk
> port?

I installed the default cygwin version but I don't have to start an X
server for it. So while it's not the native Windows Tk port, it also
doesn't seem to be the X-server version.

-- 
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Junio C Hamano @ 2006-06-10  9:08 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: git
In-Reply-To: <448A847C.20105@dawes.za.net>

Rogan Dawes <lists@dawes.za.net> writes:

> Here's an idea. How about separating trees and commits from the actual
> blobs (e.g. in separate packs)?

If I remember my numbers correctly, trees for any project with a
size that matters contribute nonnegligible amount of the total
pack weight.  Perhaps 10-25%.

> In this way, the user has a history that will show all of the commit
> messages, and would be able to see _which_ files have changed over
> time e.g. gitk would still work - except for the actual file level
> diff, "git log" should also still work, etc

I suspect it would make a very unpleasant system to use.
Sometimes "git diff -p" would show diffs, and other times it
mysteriously complain saying that it lacks necessary blobs to do
its job.  You cannot even run fsck and tell from its output
which missing objects are OK (because you chose to create such a
sparse repository) and which are real corruption.

A shallow clone with explicit cauterization in grafts file at
least would not have that problem. Although the user will still
not see the exact same result as what would happen in a full
repository, at least we can say "your git log ends at that
commit because your copy of the history does not go back beyond
that" and the user would understand.

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Junio C Hamano @ 2006-06-10  9:00 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git
In-Reply-To: <e6dvds$oes$1@sea.gmane.org>

Jakub Narebski <jnareb@gmail.com> writes:

> Couldn't it be solved by enhancing initial handshake to send from puller
> (object receivier) to pullee (object sender) the contents of graft file, or
> better the contents of cauterizing graft file - without splitting graft
> file we better have an option to send graft file or not, when graft file is
> used to join historical repository line of development not to cauterize
> history.
>
> Then the sender would use sent cauterizing history graft file for
> calculating which objects to sedn _only_, "in memory" cauterizing it's own
> history.
>
> Now I guess you would tell me why this very simple idea is stupid...

It is not stupid at all; what you said is actually on a correct
track.  You indeed just reinvented a half of what I've outlined
earlier for implementing shallow clone (the other half you
missed is that the graft exchange needs to happen both ways,
limiting the commit ancestry graph the both ends walk to the
intersection of the fake view of the ancestry graph both ends
have, but that is a minor detail).

The problem is that what Linus described as "fundamentally hard"
is not the initial "shallow clone" stage, but lies elsewhere.
Namely, what to do after you create such a shallow clone and
when you want to unplug an earlier cauterization points.

In order to unplug a cauterization point (a commit we faked to
be parentless earlier, whose parents and associated objects we
ought to have but we do not because we made a shallow clone),
the downloader needs to re-fetch that commit while temporarily
pretending that it does not have any objects that are newer,
perhaps defining another earlier point as a new cauterization
point at the same time.  Git format allows for that, and the
protocol exchange certainly can be extensible to support
something like that, but the design work would be quite
involved.

^ permalink raw reply

* Lazy clone ideas
From: Jakub Narebski @ 2006-06-10  8:58 UTC (permalink / raw)
  To: git

I've started new thread for lazy clone ideas,
splitting from "Figured out how to get Mozilla into git"

Rogan Dawes wrote:
> Here's an idea. How about separating trees and commits from the actual 
> blobs (e.g. in separate packs)? My reasoning is that the commits and 
> trees should only be a small portion of the overall repository size, and 
> should not be that expensive to transfer. (Of course, this is only a 
> guess, and needs some numbers to back it up.)
> 
> So, a shallow clone would receive all of the tree objects, and all of 
> the commit objects, and could then request a pack containing the blobs 
> represented by the current HEAD.

That would be _lazy_ clone (with on-demand pack downloading from "master"
full history repository), rather than shallow clone.

I had an idea for having all the commit objects (without all the tree
objects) below the soft-grafts line (beyond the line we cut-off full
history and start being lazy).
 
> In this way, the user has a history that will show all of the commit 
> messages, and would be able to see _which_ files have changed over time 
> e.g. gitk would still work - except for the actual file level diff, "git 
> log" should also still work, etc
> 
> This would also enable other optimisations.
> 
> For example, documentation people would only need to get the objects 
> under the doc/ tree, and would not need to actually check out the 
> source. Git could detect any actual changes by checking whether it has 
> the previous blob in its local repository, and whether the file exists 
> locally. Creating a patch would obviously require that the person checks 
> out the previous version, but one could theoretically commit a new blob 
> to a repo without having the previous one (not saying that this would be 
> a good idea, of course)

Something akin to CVS's modules, or rather to how CVS modules can be abused?
Something called, I think, partial checkout?

This is a separate idea and I think worth implementing even for full
repository.

> This would probably require Eric Biederman's "direct access to blob" 
> patches, I guess, in order to be feasible.

And it would need place to store URI from where to doenload objects
on-demand: perhaps 'remote alternatives'?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Rogan Dawes @ 2006-06-10  8:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, Martin Langhoff, git
In-Reply-To: <Pine.LNX.4.64.0606092001590.5498@g5.osdl.org>

Linus Torvalds wrote:
> 
> On Fri, 9 Jun 2006, Carl Worth wrote:
> 
>> On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote:
>>> Could you clone the repo and delete changesets earlier than 2004? Then
>>> I would clone the small repo and work with it. Later I decide I want
>>> full history, can I pull from a full repository at that point and get
>>> updated? That would need a flag to trigger it since I don't want full
>>> history to come over if I am just getting updates from someone else's
>>> tree that has a full history.
>> This is clearly a desirable feature, and has been requested by several
>> people (including myself) looking to switch some large-ish histories
>> from an existing system to git.
> 
> The thing is, to some degree it's really fundamentally hard.
> 
> It's easy for a linear history. What you do for a linear history is to 
> just get the top commit, and the tree associated with it, and then you 
> cauterize the parent by just grafting it to go away. Boom. You're done.
> 
> The problems are that if the preceding history _wasn't_ linear (or, in 
> fact, _subsequent_ development refers to it by having branched off at an 
> earlier point), and you try to pull your updates, the other end (that 
> knows about all the history) will assume you have all the history that you 
> don't have, and will send you a pack assuming that.
> 
> Which won't even necessarily have all the tree/blob objects (it assumed 
> you already had them), but more annoyingly, the history won't be 
> cauterized, and you'll have dangling commits. Which you can cauterize by 
> hand, of course, but you literally _will_ have to get the objects and 
> cauterize the thing by hand.
> 
> You're right that it's not "fundamentally impossible" to do: the git 
> format certainly _allows_ it. But the git protocol handshake really does 
> end up optimizing away all the unnecessary work by knowing that the other 
> side will have all the shared history, so lacking the shared history will 
> mean that you're a bit screwed.

Here's an idea. How about separating trees and commits from the actual 
blobs (e.g. in separate packs)? My reasoning is that the commits and 
trees should only be a small portion of the overall repository size, and 
should not be that expensive to transfer. (Of course, this is only a 
guess, and needs some numbers to back it up.)

So, a shallow clone would receive all of the tree objects, and all of 
the commit objects, and could then request a pack containing the blobs 
represented by the current HEAD.

In this way, the user has a history that will show all of the commit 
messages, and would be able to see _which_ files have changed over time 
e.g. gitk would still work - except for the actual file level diff, "git 
log" should also still work, etc

This would also enable other optimisations.

For example, documentation people would only need to get the objects 
under the doc/ tree, and would not need to actually check out the 
source. Git could detect any actual changes by checking whether it has 
the previous blob in its local repository, and whether the file exists 
locally. Creating a patch would obviously require that the person checks 
out the previous version, but one could theoretically commit a new blob 
to a repo without having the previous one (not saying that this would be 
a good idea, of course)

This would probably require Eric Biederman's "direct access to blob" 
patches, I guess, in order to be feasible.

Regards,

Rogan

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Jakub Narebski @ 2006-06-10  8:21 UTC (permalink / raw)
  To: git
In-Reply-To: <Pine.LNX.4.64.0606092001590.5498@g5.osdl.org>

Linus Torvalds wrote:


> On Fri, 9 Jun 2006, Carl Worth wrote:
> 
>> On Fri, 9 Jun 2006 22:21:17 -0400, "Jon Smirl" wrote:
>> > 
>> > Could you clone the repo and delete changesets earlier than 2004? Then
>> > I would clone the small repo and work with it. Later I decide I want
>> > full history, can I pull from a full repository at that point and get
>> > updated? That would need a flag to trigger it since I don't want full
>> > history to come over if I am just getting updates from someone else's
>> > tree that has a full history.
>> 
>> This is clearly a desirable feature, and has been requested by several
>> people (including myself) looking to switch some large-ish histories
>> from an existing system to git.
> 
> The thing is, to some degree it's really fundamentally hard.
> 
> It's easy for a linear history. What you do for a linear history is to 
> just get the top commit, and the tree associated with it, and then you 
> cauterize the parent by just grafting it to go away. Boom. You're done.
> 
> The problems are that if the preceding history _wasn't_ linear (or, in 
> fact, _subsequent_ development refers to it by having branched off at an 
> earlier point), and you try to pull your updates, the other end (that 
> knows about all the history) will assume you have all the history that you 
> don't have, and will send you a pack assuming that.

Couldn't it be solved by enhancing initial handshake to send from puller
(object receivier) to pullee (object sender) the contents of graft file, or
better the contents of cauterizing graft file - without splitting graft
file we better have an option to send graft file or not, when graft file is
used to join historical repository line of development not to cauterize
history.

Then the sender would use sent cauterizing history graft file for
calculating which objects to sedn _only_, "in memory" cauterizing it's own
history.

Main disadvantage is if one cauterized history too eagerly, and shallow
clone history can lack merge bases, and have no way to get them _simply_
using this approach...


Now I guess you would tell me why this very simple idea is stupid...

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Junio C Hamano @ 2006-06-10  6:15 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git
In-Reply-To: <9e4733910606092302h646ff554p107564417183e350@mail.gmail.com>

"Jon Smirl" <jonsmirl@gmail.com> writes:

> Here's a new transport problem. When using git-clone to fetch Martin's
> tree it kept failing for me at dreamhost. I had a parallel fetch
> running on my local machine which has a much slower net connection. It
> finally finished and I am watching the end phase where it prints all
> of the 'walk' messages. The git-http-fetch process has jumped up to
> 800MB in size after being 2MB during the download. dreamhost has a
> 500MB process size limit so that is why my fetches kept failing there.

The http-fetch process uses by mmaping the downloaded pack, and
if I recall correctly we are talking about 600MB pack, so 500MB
limit sounds impossible, perhaps?

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Jon Smirl @ 2006-06-10  6:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git
In-Reply-To: <Pine.LNX.4.64.0606092109380.5498@g5.osdl.org>

Here's a new transport problem. When using git-clone to fetch Martin's
tree it kept failing for me at dreamhost. I had a parallel fetch
running on my local machine which has a much slower net connection. It
finally finished and I am watching the end phase where it prints all
of the 'walk' messages. The git-http-fetch process has jumped up to
800MB in size after being 2MB during the download. dreamhost has a
500MB process size limit so that is why my fetches kept failing there.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: [PATCH] shared repository settings enhancement.
From: Linus Torvalds @ 2006-06-10  4:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vver9k5gg.fsf@assigned-by-dhcp.cox.net>



On Fri, 9 Jun 2006, Junio C Hamano wrote:
> 
> Yes, the user can mistype "gruop", people would start making
> noises about having "world" as a synonym for "everybody", and
> the parsing becomes somewhat cumbersome, and all that trouble,
> but on the other hand that is probably the easiest to explain.

Actually, it's quite easy to parse using the git config file parsers.

Let's say that 0 means umask, 1 means group, 2 means user and 3 means 
everybody. That leaves "0/1" with the old false/true behaviour, and leaves 
umask as the default.

So we'd have

	enum sharedrepo {
		PERM_UMASK = 0,
		PERM_GROUP,
		PERM_USER,
		PERM_EVERYBODY
	};

	int git_config_perm(const char *var, const char *value)
	{
		if (!strncmp(value, "umask"))
			return PERM_UMASK;
		if (!strncmp(value, "group"))
			return PERM_GROUP;
		if (!strncmp(value, "user"))
			return PERM_USER;
		if (!strncmp(value, "world") || !strncmp(value, "everybody"))
			return PERM_EVERYBODY;
		return git_config_bool(var, value);
	}

and then in check_repository_format_version() you just have

	..
	else if (strcmp(var, "core.sharedrepository") == 0)
		shared_repository = git_config_perm(var, value);
	..

instead of git_config_bool() there, and you're done. That's not so bad, is 
it?

		Linus

^ permalink raw reply

* Re: [PATCH] shared repository settings enhancement.
From: Junio C Hamano @ 2006-06-10  4:19 UTC (permalink / raw)
  To: git
In-Reply-To: <Pine.LNX.4.64.0606092103170.5498@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> How about making it be
>
> 	[core]
> 		sharedrepository = {umask | user | group | everybody}
>
> and allow the old boolean expression syntax to mean "0/false means umask, 
> 1/true means group".
>
> So you'd have:
>
>  - umask/0/false means "use 0777 permissions with default umask"
>  - user means "use 0500 permissions"
>  - group means "use 0550 permissions"
>  - everybody means "use 0555 permissions"
>
> (where "5" is r-x, and only for directories, and obviously degenerates to 
> just "4" aka r-- for regular files).
>
> That sounds really pretty self-explanatory and obvious, wouldn't you say?

Yes, the user can mistype "gruop", people would start making
noises about having "world" as a synonym for "everybody", and
the parsing becomes somewhat cumbersome, and all that trouble,
but on the other hand that is probably the easiest to explain.

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Linus Torvalds @ 2006-06-10  4:11 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git
In-Reply-To: <Pine.LNX.4.64.0606092043460.5498@g5.osdl.org>



On Fri, 9 Jun 2006, Linus Torvalds wrote:
> 
> You can try to approximate the latency by just looking at the number of 
> packets, and using a large MTU (and on localhost, the MTU will be pretty 
> large - roughly 16kB. Don't count packet size at all, just count how many 
> packets each protocol sends (both ways), ignoring packets that are just 
> empty ACK's.

Btw, the reason you should ignore empty acks is that they happen when you 
have a nice streaming one-way thing, because the TCP rules say that you 
should send an ACK every two full packets minimum, even if you have 
nothing to say.

So empty acks really approximate to "streaming data", while packets with 
payload _could_ obviously mean "nice streaming data going both ways", but 
almost always end up being synchronization discussion of some sort.

		Linus

^ permalink raw reply

* Re: [PATCH] shared repository settings enhancement.
From: Linus Torvalds @ 2006-06-10  4:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7v8xo5lleo.fsf@assigned-by-dhcp.cox.net>



On Fri, 9 Jun 2006, Junio C Hamano wrote:
> 
> Having said that, I do not think the distinction is that
> important; I would rather make the core.sharedrepository = true
> to mean an equivalent of "chmod go+rX" (it does "chmod g+rX"
> currently).

How about making it be

	[core]
		sharedrepository = {umask | user | group | everybody}

and allow the old boolean expression syntax to mean "0/false means umask, 
1/true means group".

So you'd have:

 - umask/0/false means "use 0777 permissions with default umask"
 - user means "use 0500 permissions"
 - group means "use 0550 permissions"
 - everybody means "use 0555 permissions"

(where "5" is r-x, and only for directories, and obviously degenerates to 
just "4" aka r-- for regular files).

That sounds really pretty self-explanatory and obvious, wouldn't you say?

			Linus

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Linus Torvalds @ 2006-06-10  4:02 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, git
In-Reply-To: <46a038f90606092041neadcc54n2acb6272d1f71de7@mail.gmail.com>



On Sat, 10 Jun 2006, Martin Langhoff wrote:
> 
> So the per-file and per-directory overhead are significant. I can do a
> cvs checkout via pserver:localhost but I don't know off-the-cuff how
> to measure the traffic. Hints?

Over localhost, you won't see the biggest issue, which is just latency.

The git protocol should be absolutely <i>wonderful</i> with bad latency, 
because once the early bakc-and-forth on what each side has is done, 
there's no synchronization any more - it's all just streaming, with 
full-frame TCP.

If :pserver: does per-file "hey, what are you up to" kind of 
syncronization, the big killer would be the latency from one end to the 
other, regardless of any throughput.

You can try to approximate the latency by just looking at the number of 
packets, and using a large MTU (and on localhost, the MTU will be pretty 
large - roughly 16kB. Don't count packet size at all, just count how many 
packets each protocol sends (both ways), ignoring packets that are just 
empty ACK's.

I don't know how to build a tcpdump expression for "TCP packet with an 
empty payload", but I bet it's possible.

[ And I won't guarantee that it's a wonderful approximation for "network 
  cost", but I think it's potentially a reasonably good one. It's totally 
  realistic to equate 32kB of _streaming_ data (two packets flowing in 
  one direction with no synchronization) with just a single byte of data 
  going back-and-forth synchronously ]

		Linus

^ permalink raw reply

* Re: Figured out how to get Mozilla into git
From: Junio C Hamano @ 2006-06-10  3:55 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git
In-Reply-To: <46a038f90606092041neadcc54n2acb6272d1f71de7@mail.gmail.com>

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> Yes, most people have -z3, and I agree with you, on paper it sounds
> like the cost is 1/4 of a git clone.
>
> However.
>
> The CVS protocol is very chatty because the client _acts_ extremely
> stupid. It says, ok, I got here an empty directory, and the server
> walks the client through every little step. And all that chatter is
> uncompressed cleartext under pserver.
>
> So the per-file and per-directory overhead are significant. I can do a
> cvs checkout via pserver:localhost but I don't know off-the-cuff how
> to measure the traffic. Hints?

If you have an otherwise unused interface, you can look at
ifconfig output and see RX/TX bytes?  But that sounds very
crude.

Running it through a proxy perhaps?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox