Git development

Git development
 help / color / mirror / Atom feed

* (rework) [PATCH 5/5] Accept commit in some places when tree is needed.
From: Junio C Hamano @ 2005-04-21  0:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Updates read-tree to use read_tree_with_tree_or_commit_sha1()
function.  The command can take either tree or commit IDs with
this patch.

The change involves a slight modification of how it recurses down
the tree.  Earlier the caller only supplied SHA1 and the recurser
read the object using it, but now it is the caller's responsibility
to read the object and give it to the recurser.  This matches the
way recursive behaviour is done in other tree- related commands.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 read-tree.c |   34 ++++++++++++++++++++++++----------
 1 files changed, 24 insertions(+), 10 deletions(-)

read-tree.c: 46747b5e99b102ed547e87f55a8ee734c9ddb074
--- a/read-tree.c
+++ b/read-tree.c
@@ -23,16 +23,11 @@ static int read_one_entry(unsigned char 
 	return add_cache_entry(ce, 1);
 }
 
-static int read_tree(unsigned char *sha1, const char *base, int baselen)
+static int read_tree_recursive(void *buffer, const char *type,
+			       unsigned long size,
+			       const char *base, int baselen)
 {
-	void *buffer;
-	unsigned long size;
-	char type[20];
-
-	buffer = read_sha1_file(sha1, type, &size);
-	if (!buffer)
-		return -1;
-	if (strcmp(type, "tree"))
+	if (!buffer || strcmp(type, "tree"))
 		return -1;
 	while (size) {
 		int len = strlen(buffer)+1;
@@ -50,10 +45,20 @@ static int read_tree(unsigned char *sha1
 			int retval;
 			int pathlen = strlen(path);
 			char *newbase = malloc(baselen + 1 + pathlen);
+			void *eltbuf;
+			char elttype[20];
+			unsigned long eltsize;
+
+			eltbuf = read_sha1_file(sha1, elttype, &eltsize);
+			if (!eltbuf)
+				return -1;
 			memcpy(newbase, base, baselen);
 			memcpy(newbase + baselen, path, pathlen);
 			newbase[baselen + pathlen] = '/';
-			retval = read_tree(sha1, newbase, baselen + pathlen + 1);
+			retval = read_tree_recursive(eltbuf, elttype, eltsize,
+						     newbase,
+						     baselen + pathlen + 1);
+			free(eltbuf);
 			free(newbase);
 			if (retval)
 				return -1;
@@ -65,6 +70,15 @@ static int read_tree(unsigned char *sha1
 	return 0;
 }
 
+static int read_tree(unsigned char *sha1, const char *base, int baselen)
+{
+	void *buffer;
+	unsigned long size;
+
+	buffer = read_tree_with_tree_or_commit_sha1(sha1, &size, 0);
+	return read_tree_recursive(buffer, "tree", size, base, baselen);
+}
+
 static int remove_lock = 0;
 
 static void remove_lock_file(void)


^ permalink raw reply

* Re: Possible problem with git-pasky-0.6.2 (patch: **** Only garbage was found in the patch input.)I
From: Steven Cole @ 2005-04-21  0:20 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git
In-Reply-To: <200504201715.00058.elenstev@mesatop.com>

On Wednesday 20 April 2005 05:15 pm, Steven Cole wrote:
> On Wednesday 20 April 2005 05:12 pm, Petr Baudis wrote:
> > Dear diary, on Thu, Apr 21, 2005 at 01:06:09AM CEST, I got a letter
> > where Steven Cole <elenstev@mesatop.com> told me that...
> > > After getting the latest tarball, and make, make install:
> > > 
> > > Tree change: 55f9d5042603fff4ddfaf4e5f004d2995286d6d3:a46844fcb6afef1f7a2d93f391c82f08ea322221
> > > *100755->100755 blob    a78cf8ccab98861ef7aecb4cb5a79e47d3a84b67->74b4083d67eda87d88a6f92c6c66877bba8bda8a     gitcancel.sh
> > > Tracked branch, applying changes...
> > > Fast-forwarding 55f9d5042603fff4ddfaf4e5f004d2995286d6d3 -> a46844fcb6afef1f7a2d93f391c82f08ea322221
> > >         on top of 55f9d5042603fff4ddfaf4e5f004d2995286d6d3...
> > > patch: **** Only garbage was found in the patch input.
> > > 
> > > This may be a harmless message, but I thought I'd bring it to your attention.
> > 
> > This _is_ weird. What does
> > 
> > 	$ git diff -r 55f9d5042603fff4ddfaf4e5f004d2995286d6d3:a46844fcb6afef1f7a2d93f391c82f08ea32222
> > 
> > tell you? 
> 
> [steven@spc git-pasky-0.6.2]$ git diff -r 55f9d5042603fff4ddfaf4e5f004d2995286d6d3:a46844fcb6afef1f7a2d93f391c82f08ea32222
> Index: gitcancel.sh
[ output snipped, see previous message for output]
> 
> > What if you feed it to patch -p1? 
> I haven't done that yet, awaiting response to above.
> 
> > What if you feed it to git  
> > apply?
> > 
> > Thanks,
> > 
> Your're welcome.  I'll do the "git patch -p1 <stuff_from_above" if that's what's needed,
> same with git apply.  Corrrections to syntax apprceciated.
> Steven

Actually, I meant "patch -p1 <stuff_from_above".

But before doing that, I did a fsck-cache as follows, with these results.
This seems damaged.

[steven@spc git-pasky-0.6.2]$ fsck-cache --unreachable $(cat .git/HEAD)
root 1bf00e46973f7f1c40bc898f1346a1273f0a347f
unreachable commit 0128396de7ca8a7dc74f6fbff59a68bb781bb9b2
unreachable blob 012c82312c99606f914bda5c501b616237a3b7e9
unreachable tree 02a1b5337f78b807d4404f473e55c44f4273d2f8

[ lots of snippage...]

unreachable blob fee26cc5b378819ff48ef8cb54c35744c0f1c17f
unreachable tree fff7294434014ea68153770da3965ed315806499

[steven@spc git-pasky-0.6.2]$ fsck-cache --unreachable $(cat .git/HEAD) | wc -l
467

I renamed the repo to git-pasky-0.6.2-damaged, and repeated untarring the 0.6.2 tarball,
make, (didn't do make install this time), and repeated "git pull pasky" with
similar results as before.

[steven@spc git-pasky-0.6.2-damaged]$ cat .git/HEAD
a46844fcb6afef1f7a2d93f391c82f08ea322221
[steven@spc git-pasky-0.6.2-damaged]$ cd ../git-pasky-0.6.2
[steven@spc git-pasky-0.6.2]$ cat .git/HEAD
7a4c67965de68ae7bc7aa1fde33f8eb9d8114697

Hope this helps,
Steven


^ permalink raw reply

* (rework) [PATCH 3/4] Accept commit in some places when tree is needed.
From: Junio C Hamano @ 2005-04-21  0:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Updates ls-tree.c to use read_tree_with_tree_or_commit_sha1()
function.  The command can take either tree or commit IDs with
this patch.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 ls-tree.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

ls-tree.c: c063640c114634dc7cf950ce44863dd17ddf83c1
--- a/ls-tree.c
+++ b/ls-tree.c
@@ -24,9 +24,9 @@ static void print_path_prefix(struct pat
 }
 
 static void list_recursive(void *buffer,
-			  unsigned char *type,
-			  unsigned long size,
-			  struct path_prefix *prefix)
+			   const unsigned char *type,
+			   unsigned long size,
+			   struct path_prefix *prefix)
 {
 	struct path_prefix this_prefix;
 	this_prefix.prev = prefix;
@@ -72,12 +72,11 @@ static int list(unsigned char *sha1)
 {
 	void *buffer;
 	unsigned long size;
-	char type[20];
 
-	buffer = read_sha1_file(sha1, type, &size);
+	buffer = read_tree_with_tree_or_commit_sha1(sha1, &size, 0);
 	if (!buffer)
 		die("unable to read sha1 file");
-	list_recursive(buffer, type, size, NULL);
+	list_recursive(buffer, "tree", size, NULL);
 	return 0;
 }


^ permalink raw reply

* (rework) [PATCH 3/5] Accept commit in some places when tree is needed.
From: Junio C Hamano @ 2005-04-21  0:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Updates diff-tree.c to use read_tree_with_tree_or_commit_sha1()
function.  The command can take either tree or commit IDs with this patch.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 diff-tree.c |   25 ++++---------------------
 1 files changed, 4 insertions(+), 21 deletions(-)

diff-tree.c: 65bb9d66c5610b2ede11f03a9120da48c59629f8
--- a/diff-tree.c
+++ b/diff-tree.c
@@ -164,14 +164,13 @@ static int diff_tree_sha1(const unsigned
 {
 	void *tree1, *tree2;
 	unsigned long size1, size2;
-	char type[20];
 	int retval;
 
-	tree1 = read_sha1_file(old, type, &size1);
-	if (!tree1 || strcmp(type, "tree"))
+	tree1 = read_tree_with_tree_or_commit_sha1(old, &size1, 0);
+	if (!tree1)
 		die("unable to read source tree (%s)", sha1_to_hex(old));
-	tree2 = read_sha1_file(new, type, &size2);
-	if (!tree2 || strcmp(type, "tree"))
+	tree2 = read_tree_with_tree_or_commit_sha1(new, &size2, 0);
+	if (!tree2)
 		die("unable to read destination tree (%s)", sha1_to_hex(new));
 	retval = diff_tree(tree1, size1, tree2, size2, base);
 	free(tree1);
@@ -179,20 +178,6 @@ static int diff_tree_sha1(const unsigned
 	return retval;
 }
 
-static void commit_to_tree(unsigned char *sha1)
-{
-	void *buf;
-	char type[20];
-	unsigned long size;
-
-	buf = read_sha1_file(sha1, type, &size);
-	if (buf) {
-		if (!strcmp(type, "commit"))
-			get_sha1_hex(buf+5, sha1);
-		free(buf);
-	}
-}
-
 int main(int argc, char **argv)
 {
 	unsigned char old[20], new[20];
@@ -214,7 +199,5 @@ int main(int argc, char **argv)
 
 	if (argc != 3 || get_sha1_hex(argv[1], old) || get_sha1_hex(argv[2], new))
 		usage("diff-tree <tree sha1> <tree sha1>");
-	commit_to_tree(old);
-	commit_to_tree(new);
 	return diff_tree_sha1(old, new, "");
 }


^ permalink raw reply

* (rework) [PATCH 2/5] Accept commit in some places when tree is needed.
From: Junio C Hamano @ 2005-04-21  0:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Updates diff-cache.c to use read_tree_with_tree_or_commit_sha1()
function.  The end-user visible result is the same --- the command
takes either tree or commit ID.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 diff-cache.c |   17 +----------------
 1 files changed, 1 insertion(+), 16 deletions(-)

diff-cache.c: fcbc4900d32f4ca24f67bb8f0fe344c6c5642ac9
--- a/diff-cache.c
+++ b/diff-cache.c
@@ -220,7 +220,6 @@ int main(int argc, char **argv)
 	unsigned char tree_sha1[20];
 	void *tree;
 	unsigned long size;
-	char type[20];
 
 	read_cache();
 	while (argc > 2) {
@@ -245,23 +244,9 @@ int main(int argc, char **argv)
 	if (argc != 2 || get_sha1_hex(argv[1], tree_sha1))
 		usage("diff-cache [-r] [-z] <tree sha1>");
 
-	tree = read_sha1_file(tree_sha1, type, &size);
+	tree = read_tree_with_tree_or_commit_sha1(tree_sha1, &size, 0);
 	if (!tree)
 		die("bad tree object %s", argv[1]);
 
-	/* We allow people to feed us a commit object, just because we're nice */
-	if (!strcmp(type, "commit")) {
-		/* tree sha1 is always at offset 5 ("tree ") */
-		if (get_sha1_hex(tree + 5, tree_sha1))
-			die("bad commit object %s", argv[1]);
-		free(tree);
-		tree = read_sha1_file(tree_sha1, type, &size);       
-		if (!tree)
-			die("unable to read tree object %s", sha1_to_hex(tree_sha1));
-	}
-
-	if (strcmp(type, "tree"))
-		die("bad tree object %s (%s)", sha1_to_hex(tree_sha1), type);
-
 	return diff_cache(tree, size, active_cache, active_nr, "");
 }



^ permalink raw reply

* (rework) [PATCH 1/5] Accept commit in some places when tree is needed.
From: Junio C Hamano @ 2005-04-21  0:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.58.0504200826360.6467@ppc970.osdl.org>

Linus,

    sorry for bringing up an issue that is already 8 hours old.

LT> I don't think that's a good interface. It changes the sha1 passed into it: 
LT> that may actually be nice, since you may want to know what it changed to, 
LT> but I think you'd want to have that as an (optional) separate 
LT> "sha1_result" parameter. 

Point taken about "_changing_ _is_ _bad_" part.  It was a mistake.

LT> Also, the "type" or "size" things make no sense to have as a parameter 
LT> at all.

Well, the semantics is "I want to read the raw data of a tree
and I do not know nor care if this sha1 I got from my user is
for a commit or a tree."  So type does not matter (if it returns
a non NULL we know it is a tree), but the size matters.

And that semantics is not so hacky thing specific to diff-cache.
Rather, it applies in general if you structure the way those
recursive walkers do things.  The recursive walkers in ls-tree,
diff-cache, and diff-tree all expect the caller to supply the
buffer read by sha1_read_buffer, and when it calls itself it
does the same (read-tree's recursing convention is an oddball
that needs to be addressed, though).

When the recursion is structured this way, the only thing you
need to do to allow commit ID from the user when tree ID is
needed, without breaking the error checking done by the part
that recurses down (i.e. we must error on a commit object ID
when we are expecting a tree object ID stored in objects we read
from the tree downwards), is to change the top-level caller to
use "I want tree with this tree/commit ID" instead of "I want a
buffer with this ID and I'll make sure it is a tree myself".
Instead, you make the recursor "Give me a buffer and its type,
I'll barf if it is does not say a tree."  When the recursor
calls itself, it reads with read_sha1_file and feeds the result
to itself and have the called do the checking.

The commit_to_tree() thing you introduced in diff-tree.c is
simple to use.  IMHO it is however conceptually a wrong thing to
use in these contexts.  When the user supplies a tree ID, you
first read that object only to see if it is not a commit and
throw it away, then immediately read it again for your real
processing.  In these particular cases of four tree- related
files, "I want tree with this tree/commit ID" semantics is a
_far_ _better_ match for the problem.

Having said that, here is a reworked version.  This first one 
introduces read_tree_with_tree_or_commit_sha1() function.

<end-of-cover-letter>

This patch implements read_tree_with_tree_or_commit_sha1(),
which can be used when you are interested in reading an unpacked
raw tree data but you do not know nor care if the SHA1 you
obtained your user is a tree ID or a commit ID.  Before this
function's introduction, you would have called read_sha1_file(),
examined its type, parsed it to call read_sha1_file() again if
it is a commit, and verified that the resulting object is a
tree.  Instead, this function does that for you.  It returns
NULL if the given SHA1 is not either a tree or a commit.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 cache.h     |    4 ++++
 sha1_file.c |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

cache.h: eab355da5d2f6595053f28f0cca61181ac314ee9
--- a/cache.h
+++ b/cache.h
@@ -124,4 +124,8 @@ extern int error(const char *err, ...);

 extern int cache_name_compare(const char *name1, int len1, const char *name2, int len2);

+extern void *read_tree_with_tree_or_commit_sha1(const unsigned char *sha1,
+						unsigned long *size,
+						unsigned char *tree_sha1_ret);
+
 #endif /* CACHE_H */

sha1_file.c: eee3598bb75e2199045b823f007e7933c0fb9cfe
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -166,6 +166,46 @@ void * read_sha1_file(const unsigned cha
 	return NULL;
 }

+void *read_tree_with_tree_or_commit_sha1(const unsigned char *sha1,
+					 unsigned long *size,
+					 unsigned char *tree_sha1_return)
+{
+	char type[20];
+	void *buffer;
+	unsigned long isize;
+	int was_commit = 0;
+	char tree_sha1[20];
+
+	buffer = read_sha1_file(sha1, type, &isize);
+
+	/* 
+	 * We might have read a commit instead of a tree, in which case
+	 * we parse out the tree_sha1 and attempt to read from there.
+	 * (buffer + 5) is because the tree sha1 is always at offset 5
+	 * in a commit record ("tree ").
+	 */
+	if (buffer &&
+	    !strcmp(type, "commit") &&
+	    !get_sha1_hex(buffer + 5, tree_sha1)) {
+		free(buffer);
+		buffer = read_sha1_file(tree_sha1, type, &isize);
+		was_commit = 1;
+	}
+
+	/*
+	 * Now do we have something and if so is it a tree?
+	 */
+	if (!buffer || strcmp(type, "tree")) {
+		free(buffer);
+		return;
+	}
+
+	*size = isize;
+	if (tree_sha1_return)
+		memcpy(tree_sha1_return, was_commit ? tree_sha1 : sha1, 20);
+	return buffer;
+}
+
 int write_sha1_file(char *buf, unsigned len, unsigned char *returnsha1)
 {
 	int size;

^ permalink raw reply

* Re: [ANNOUNCE] git-pasky-0.6.2 && heads-up on upcoming changes
From: Linus Torvalds @ 2005-04-21  0:14 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Greg KH, git
In-Reply-To: <20050420222815.GM19112@pasky.ji.cz>

Pasky,
 what do you think about this change to "git log"?

It makes it a _lot_ easier to parse the result, as it indents all the
comments by two spaces, meaning that the header is clearly marked, and you
can then do various 'sed'/'grep' things with nice normal regular
expressions like '^parent' without having to worry about there being a 
line that starts with "parent" in the free-form part..

I also think the end result is more readable from a human standpoint, with 
indentation as the way to distinguish the headers from the commentary, 
and less ugly ASCII barfic's with "------" etc.

I'm doing a 2.6.12-rc3 release, so I care more than usual about the 
changelog ;)

		Linus

---
gitlog.sh: a496a864f9586e47a4d7bd3ae0af0b3e07b7deb8
--- a/gitlog.sh
+++ b/gitlog.sh
@@ -28,7 +28,7 @@ rev-tree $base | sort -rn | while read t
 				fi
 				;;
 			"")
-				echo; cat
+				echo; sed 's/^/  /'
 				;;
 			*)
 				echo $key $rest
@@ -36,5 +36,5 @@ rev-tree $base | sort -rn | while read t
 			esac

 		done
-	echo -e "\n--------------------------"
+	echo
 done

^ permalink raw reply

* Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'
From: Denys Duchier @ 2005-04-21  0:05 UTC (permalink / raw)
  To: gnu-arch-users; +Cc: gnu-arch-dev, git
In-Reply-To: <200504202304.QAA17069@emf.net>

[-- Attachment #1: Type: text/plain, Size: 2045 bytes --]

Tom Lord <lord@emf.net> writes:

> Thank you for your experiment.

you are welcome.

> I think that to a large extent you are seeing artifacts
> of the questionable trade-offs that (reports tell me) the
> ext* filesystems make.   With a different filesystem, the 
> results would be very different.

No, this is not the only thing that we observe.  For example, here are the
reports for the following two experiments:

Indexing method = [2]

Max keys at level  0:     256
Max keys at level  1:     108
Total number of dirs:     257
Total number of keys:   21662
Disk footprint      :    1.8M

Indexing method = [4 4]

Max keys at level  0:   18474
Max keys at level  1:       5
Max keys at level  2:       1
Total number of dirs:   40137
Total number of keys:   21662
Disk footprint      :    157M

Notice the huge number of directories in the second experiment and they don't
help at all in performing discrimination.

> I'm imagining a blob database containing may revisions of the linux
> kernel.  It will contain millions of blobs.

It is very easy to write code that uses an adaptive discrimination method
(i.e. when a directory becomes too full, introduce an additional level of
discrimination and rehash).  In fact I have code that does that (rehashing if
the size of a leaf directory exceed 256), but the [2] method above doesn't even
need it even though it has 21662 keys.

Just in case there is some interest, I attach below the python scripts which I
used for my experiments:

To create an indexed archive:

	python build.py SRC DST N1 ... Nk

where SRC is the root directory of the tree to be indexed, and DST names the
root directory of the indexed archive to be created.  N1 through Nk are integers
that each indicate how many chars to chop off the key to create the next level
indexing key.

	python info.py DST

collects and then prints out statistics about an indexed archive.

For example, the invocation that relates to your original proposal would be:

	python build.py /usr/src/linux store 4 4
        python info.py store


[-- Attachment #2: script to build an indexed archive --]
[-- Type: text/plain, Size: 1741 bytes --]

import os,os.path,stat,sha

tree      = None
archive   = None
slices    = []
lastslice = (0,-1)

def recurse(path):
    s = os.stat(path)
    if stat.S_ISDIR(s.st_mode):
        print path
        contents = []
        for n in os.listdir(path):
            uid = recurse(os.path.join(path,n))
            contents.append('\t'.join((n,uid)))
        contents = '\n'.join(contents)
        buf = sha.new(contents)
        uid = buf.hexdigest()
        uid = ','.join((uid,str(len(contents))))
        store(uid)
        return uid
    else:
        fd = file(path,"rb")
        contents = fd.read()
        fd.close()
        buf = sha.new(contents)
        uid = ','.join((buf.hexdigest(),str(s.st_size)))
        store(uid)
        return uid

def store(uid):
    p = archive
    if not os.path.exists(p):
        os.mkdir(p)
    for s in slices:
        p = os.path.join(p,uid[s[0]:s[1]])
        if not os.path.exists(p):
            os.mkdir(p)
    p = os.path.join(p,uid[lastslice[0]:lastslice[1]])
    fd = file(p,"wb")
    fd.close()

if __name__ == '__main__':
    import sys
    from optparse import OptionParser
    from types import IntType
    parser = OptionParser(usage="usage: %prog TREE ARCHIVE N1 ... Nk")
    (options, args) = parser.parse_args()
    if len(args) < 3:
        print sys.stderr, "expected at least 3 positional arguments"
        sys.exit(1)
    tree    = args[0]
    archive = args[1]
    prev    = 0
    for a in args[2:]:
        try:
            next = prev+int(a)
            slices.append((prev,next))
            prev = next
        except:
            print >>sys.stderr, "positional argument is not an integer:",a
            sys.exit(1)
    lastslice = (next,-1)
    recurse(tree)
    sys.exit(0)

[-- Attachment #3: script to print statistics about an indexed archive --]
[-- Type: text/plain, Size: 1214 bytes --]

import os,os.path,stat

info = []
archive = None
total_keys = 0
total_dirs = 0

def collect_info(path,i):
    global total_dirs,total_keys
    s = os.stat(path)
    if stat.S_ISDIR(s.st_mode):
        total_dirs += 1
        l = os.listdir(path)
        n = len(l)
        if i==len(info):
            info.append(n)
        elif n>info[i]:
            info[i] = n
        i += 1
        for f in l:
            collect_info(os.path.join(path,f),i)
    else:
        total_keys += 1

def print_info():
    i = 0
    for n in info:
        print "Max keys at level %2s: %7s" % (i,n)
        i += 1
    print "Total number of dirs: %7s" % total_dirs
    print "Total number of keys: %7s" % total_keys
    fd = os.popen("du -csh %s" % archive,"r")
    s = fd.read()
    fd.close()
    s = s.split()[0]
    print "Disk footprint      : %7s"  % s

if __name__ == '__main__':
    import sys
    from optparse import OptionParser
    parser = OptionParser(usage="usage: %prog ARCHIVE")
    (options, args) = parser.parse_args()
    if len(args) != 1:
        print sys.stderr, "expected exactly 1 positional argument"
        sys.exit(1)
    archive = args[0]
    collect_info(archive,0)
    print_info()
    sys.exit(0)

[-- Attachment #4: Type: text/plain, Size: 216 bytes --]


Cheers,

PS: I should mention again, that my indexed archives only contain empty files
because I am only interested in measuring overhead.

-- 
Dr. Denys Duchier - IRI & LIFL - CNRS, Lille, France
AIM: duchierdenys

^ permalink raw reply

* Re: Change "pull" to _only_ download, and "git update"=pull+merge?
From: David Mansfield @ 2005-04-20 23:58 UTC (permalink / raw)
  To: Petr Baudis
  Cc: Ingo Molnar, Martin Schlemmer, David Greaves, dwheeler,
	Daniel Barkalow, git
In-Reply-To: <20050420211505.GE19112@pasky.ji.cz>

Petr Baudis wrote:
> Dear diary, on Wed, Apr 20, 2005 at 10:32:35PM CEST, I got a letter
> where Ingo Molnar <mingo@elte.hu> told me that...
> 
>>* Petr Baudis <pasky@ucw.cz> wrote:
>>
>>
>>>>yet another thing: what is the canonical 'pasky way' of simply nuking 
>>>>the current files and checking out the latest tree (according to 
>>>>.git/HEAD). Right now i'm using a script to:
>>>>
>>>>  read-tree $(tree-id $(cat .git/HEAD))
>>>>  checkout-cache -a
>>>>
>>>>(i first do an 'rm -f *' in the working directory)
>>>>
>>>>i guess there's an existing command for this already?
>>>
>>>git cancel
>>
>>hm, that's a pretty unintuitive name though. How about making it 'git 
>>checkout' and providing a 'git checkout -f' option to force the 
>>checkout? (or something like this)
> 
> 
> Since it does not really checkout. Ok, it does, but that's only small
> part of it. It just cancels whatever local changes are you doing in the
> tree and bring it to consistent state. When you have a merge in progress
> and after you see the sheer number of conflicts you decide to get your
> hands off, you type just git cancel. Doing basically anything with your
> tree (not only local changes checkout would fix, but also various git
> operations, including git add/rm and git seek) can be easily fixed by
> git cancel.


How about 'git revert'?

Most editors and word processors use that idiom for revert to saved 
copy, with the obvious parallel here.

David

^ permalink raw reply

* Re: chunking (Re: [ANNOUNCEMENT] /Arch/ embraces `git')
From: C. Scott Ananian @ 2005-04-20 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Petr Baudis, Tom Lord, gnu-arch-users, gnu-arch-dev,
	Git Mailing List, talli
In-Reply-To: <Pine.LNX.4.58.0504201510520.6467@ppc970.osdl.org>

On Wed, 20 Apr 2005, Linus Torvalds wrote:

> What's the disk usage results? I'm on ext3, for example, which means that
> even small files invariably take up 4.125kB on disk (with the inode).
>
> Even uncompressed, most source files tend to be small. Compressed, I'm
> seeing the median blob size being ~1.6kB in my trivial checks. That's
> blobs only, btw.

I'm working on it.  The format was chosen so that blobs under 1 block long 
*stay* 1 block long; i.e. there's no 'chunk plus index file' overhead.
So the chunking should only kick in on multiple-block files.
I hacked 'convert-cache' to do the conversion and it's running out of
memory on linux-2.6.git, however --- I found a few memory leaks in your 
code =) but I certainly seem to be missing a big one still (maybe it's in 
my code!).

When I get this working properly, my plan is to do a number of runs over 
the linux-2.6 archive trying out various combinations of chunking 
parameters.  I *will* be watching both 'real' disk usage (bunged up to 
block boundaries) and 'ideal' disk usage (on a reiserfs-type system).
The goal is to improve both, but if I can improve 'ideal' usage 
significantly with a minimal penalty in 'real' usage then I would argue 
it's still worth doing, since that will improve network times.

The handshaking penalties you mention are significant, but that's why 
rsync uses a pipelined approach.  The 'upstream' part of your full-duplex 
pipe is 'free' while you've got bits clogging your 'downstream' 
pipe.  The wonders of full-duplex...

Anyway, "numbers talk, etc".  I'm working on them.
  --scott

LIONIZER LCPANES shortwave MKSEARCH ESGAIN Saddam Hussein Rijndael 
WASHTUB Morwenstow ZPSEMANTIC SKIMMER cryptographic FJHOPEFUL assassination
                          ( http://cscott.net/ )

^ permalink raw reply

* Re: on when to checksum
From: Tom Lord @ 2005-04-20 23:39 UTC (permalink / raw)
  To: torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.58.0504201601130.6467@ppc970.osdl.org>

(I'll have to study/think about that for a while before a proper
reply.  Tomorrow, probably.)

Thanks,
-t

^ permalink raw reply

* Re: [PATCH] Add help details to git help command. (This time with Perl)
From: Petr Baudis @ 2005-04-20 23:34 UTC (permalink / raw)
  To: David Greaves; +Cc: Steven Cole, git
In-Reply-To: <42655630.80207@dgreaves.com>

Dear diary, on Tue, Apr 19, 2005 at 09:04:16PM CEST, I got a letter
where David Greaves <david@dgreaves.com> told me that...
> I don't love the 'require gitadd.pl' but it's a gradual start...

I hate it, for one. ;-)

> Cogito.pm seems to be a good place for the library stuff.

Sounds sensible.

> git.pl
> passes everything to scripts except gitadd.pl

We've decided to go for the individual scripts directly. :-)

Unfortunately, you didn't send the attachments inline, so I can't
comment on them sensibly.

Perhaps my main problem is now style. I'd prefer you do format it alike
the C sources of git, with 8-chars indentation and such. Also make sure
you use spaces around (or after) operators. Also, for just few short
functions I prefer putting the functions before the code itself.

> use IO::File;   # leads to less perlish syntax and is standard in perl dists

Oh come on. Are you writing Perl or not? I think it looks pretty awful,
and you are using Perl filehandle idioms anyway, so...

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply

* Re: [PATCH] gittrack.sh accepts invalid branch names
From: Paul Jackson @ 2005-04-20 23:22 UTC (permalink / raw)
  To: Pavel Roskin; +Cc: git, pasky
In-Reply-To: <1114026510.15186.15.camel@dv>

Pavel wrote:
> 	sed -ne "/^$name\t/p" .git/remotes | grep -q .

Consider using the following to look for a match of $name with
the first tab separated field of the remotes file (and to avoid
using 'grep -q', which is not in all grep's, so far as I know):

	cut -f1 .git/remotes | grep -Fx "$name" >/dev/null

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply

* Re: [PATCH] gittrack.sh accepts invalid branch names
From: Petr Baudis @ 2005-04-20 23:21 UTC (permalink / raw)
  To: Pavel Roskin; +Cc: git
In-Reply-To: <1114026510.15186.15.camel@dv>

Dear diary, on Wed, Apr 20, 2005 at 09:48:30PM CEST, I got a letter
where Pavel Roskin <proski@gnu.org> told me that...
> --- a/gittrack.sh
> +++ b/gittrack.sh
> @@ -35,7 +35,7 @@ die () {
>  mkdir -p .git/heads
>  
>  if [ "$name" ]; then
> -	grep -q $(echo -e "^$name\t" | sed 's/\./\\./g') .git/remotes || \
> +	sed -ne "/^$name\t/p" .git/remotes | grep -q . || \
>  		[ -s ".git/heads/$name" ] || \
>  		die "unknown branch \"$name\""

This fixes the acceptance, but not the choice.

What does the grep -q . exactly do? Just sets error code based on
whether the sed output is non-empty? What about [] instead?

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply

* Re: Possible problem with git-pasky-0.6.2 (patch: **** Only garbage was found in the patch input.)I
From: Steven Cole @ 2005-04-20 23:15 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git
In-Reply-To: <20050420231212.GN19112@pasky.ji.cz>

On Wednesday 20 April 2005 05:12 pm, Petr Baudis wrote:
> Dear diary, on Thu, Apr 21, 2005 at 01:06:09AM CEST, I got a letter
> where Steven Cole <elenstev@mesatop.com> told me that...
> > After getting the latest tarball, and make, make install:
> > 
> > Tree change: 55f9d5042603fff4ddfaf4e5f004d2995286d6d3:a46844fcb6afef1f7a2d93f391c82f08ea322221
> > *100755->100755 blob    a78cf8ccab98861ef7aecb4cb5a79e47d3a84b67->74b4083d67eda87d88a6f92c6c66877bba8bda8a     gitcancel.sh
> > Tracked branch, applying changes...
> > Fast-forwarding 55f9d5042603fff4ddfaf4e5f004d2995286d6d3 -> a46844fcb6afef1f7a2d93f391c82f08ea322221
> >         on top of 55f9d5042603fff4ddfaf4e5f004d2995286d6d3...
> > patch: **** Only garbage was found in the patch input.
> > 
> > This may be a harmless message, but I thought I'd bring it to your attention.
> 
> This _is_ weird. What does
> 
> 	$ git diff -r 55f9d5042603fff4ddfaf4e5f004d2995286d6d3:a46844fcb6afef1f7a2d93f391c82f08ea32222
> 
> tell you? 

[steven@spc git-pasky-0.6.2]$ git diff -r 55f9d5042603fff4ddfaf4e5f004d2995286d6d3:a46844fcb6afef1f7a2d93f391c82f08ea32222
Index: gitcancel.sh
===================================================================
--- f29be8140c5f1175052ec96ad2fa2b2901fd6ba5/gitcancel.sh  (mode:100755 sha1:a78cf8ccab98861ef7aecb4cb5a79e47d3a84b67)
+++ 2e1f16579fdcd9cd5d242f53a3cfaad52ac5d207/gitcancel.sh  (mode:100755 sha1:74b4083d67eda87d88a6f92c6c66877bba8bda8a)
@@ -13,6 +13,19 @@
 [ -s ".git/add-queue" ] && rm $(cat .git/add-queue)
 rm -f .git/add-queue .git/rm-queue

+# Undo seek?
+branch=
+[ -s .git/blocked ] && branch=$(grep '^seeked from ' .git/blocked | sed 's/^seeked from //')
+if [ "$branch" ]; then
+       echo "Unseeking: $(cat .git/HEAD) -> $(cat ".git/heads/$branch")"
+       if [ -s ".git/heads/$branch" ]; then
+               rm .git/HEAD
+               ln -s "heads/$branch" .git/HEAD
+       else
+               echo "ERROR: Unknown branch $branch! Preserving HEAD." >&2
+       fi
+fi
+
 rm -f .git/blocked .git/merging .git/merging-sym .git/merge-base
 read-tree $(tree-id)





> What if you feed it to patch -p1? 
I haven't done that yet, awaiting response to above.

> What if you feed it to git  
> apply?
> 
> Thanks,
> 
Your're welcome.  I'll do the "git patch -p1 <stuff_from_above" if that's what's needed,
same with git apply.  Corrrections to syntax apprceciated.
Steven

^ permalink raw reply

* Re: Possible problem with git-pasky-0.6.2 (patch: **** Only garbage was found in the patch input.)I
From: Petr Baudis @ 2005-04-20 23:12 UTC (permalink / raw)
  To: Steven Cole; +Cc: git
In-Reply-To: <200504201706.09656.elenstev@mesatop.com>

Dear diary, on Thu, Apr 21, 2005 at 01:06:09AM CEST, I got a letter
where Steven Cole <elenstev@mesatop.com> told me that...
> After getting the latest tarball, and make, make install:
> 
> Tree change: 55f9d5042603fff4ddfaf4e5f004d2995286d6d3:a46844fcb6afef1f7a2d93f391c82f08ea322221
> *100755->100755 blob    a78cf8ccab98861ef7aecb4cb5a79e47d3a84b67->74b4083d67eda87d88a6f92c6c66877bba8bda8a     gitcancel.sh
> Tracked branch, applying changes...
> Fast-forwarding 55f9d5042603fff4ddfaf4e5f004d2995286d6d3 -> a46844fcb6afef1f7a2d93f391c82f08ea322221
>         on top of 55f9d5042603fff4ddfaf4e5f004d2995286d6d3...
> patch: **** Only garbage was found in the patch input.
> 
> This may be a harmless message, but I thought I'd bring it to your attention.

This _is_ weird. What does

	$ git diff -r 55f9d5042603fff4ddfaf4e5f004d2995286d6d3:a46844fcb6afef1f7a2d93f391c82f08ea32222

tell you? What if you feed it to patch -p1? What if you feed it to git
apply?

Thanks,

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply

* Possible problem with git-pasky-0.6.2 (patch: **** Only garbage was found in the patch input.)I
From: Steven Cole @ 2005-04-20 23:06 UTC (permalink / raw)
  To: Petr Baudis; +Cc: git

After getting the latest tarball, and make, make install:

[steven@spc git-pasky-0.6.2]$ git pull pasky
MOTD:  Welcome to Petr Baudis' rsync archive.
MOTD:
MOTD:  If you are pulling my git branch, please do not repeat that
MOTD:  every five minutes or so - new stuff is likely not going to
MOTD:  appear so fast, and my line is not that thick. Nothing wrong
MOTD:  with pulling every half an hour or so, of course.
MOTD:
MOTD:  Feel free to contact me at <pasky@ucw.cz>, shall you have
MOTD:  any questions or suggestions.

receiving file list ... done
2e/1f16579fdcd9cd5d242f53a3cfaad52ac5d207
3e/f49665799151ced5e03ae1d544b1d67a6b7e5b
74/b4083d67eda87d88a6f92c6c66877bba8bda8a
7f/621eae988378ee776c040a5856e873e41691e1
a2/44b27ac61489b7d7fa4246e82479897d3bb886
a3/87546d148df5718a9c53bbe0cbea441e793d98
a4/6844fcb6afef1f7a2d93f391c82f08ea322221
a6/7b79e97f9db01bc270a07f3be9cda610845128
ba/4c6268d14989801b15e87cab98f6a236cc5e7f
f9/3b5e3d8a427d93e7e5125b55b17cd1a9479af9

wrote 228 bytes  read 99996 bytes  6466.06 bytes/sec
total size is 1753925  speedup is 17.50

receiving file list ... done

wrote 62 bytes  read 633 bytes  198.57 bytes/sec
total size is 369  speedup is 0.53
Tree change: 55f9d5042603fff4ddfaf4e5f004d2995286d6d3:a46844fcb6afef1f7a2d93f391c82f08ea322221
*100755->100755 blob    a78cf8ccab98861ef7aecb4cb5a79e47d3a84b67->74b4083d67eda87d88a6f92c6c66877bba8bda8a     gitcancel.sh
Tracked branch, applying changes...
Fast-forwarding 55f9d5042603fff4ddfaf4e5f004d2995286d6d3 -> a46844fcb6afef1f7a2d93f391c82f08ea322221
        on top of 55f9d5042603fff4ddfaf4e5f004d2995286d6d3...
patch: **** Only garbage was found in the patch input.

This may be a harmless message, but I thought I'd bring it to your attention.

Steven

^ permalink raw reply

* Re: [ANNOUNCE] git-pasky-0.6.2 && heads-up on upcoming changes
From: Greg KH @ 2005-04-20 23:04 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Linus Torvalds, git
In-Reply-To: <20050420222815.GM19112@pasky.ji.cz>

On Thu, Apr 21, 2005 at 12:28:15AM +0200, Petr Baudis wrote:
> Dear diary, on Thu, Apr 21, 2005 at 12:09:06AM CEST, I got a letter
> where Linus Torvalds <torvalds@osdl.org> told me that...
> > Yeah, yeah, it looks different from "cvs update", but dammit, wouldn't it 
> > be cool to just write "cg-<tab><tab>" and see the command choices? Or 
> > "cg-up<tab>" and get cg-update done for you..
> 
> I like this idea! :-) I guess that is in fact exactly what I have been
> looking for, and (as probably apparent from the current git-pasky
> structure) I prefer to have the scripts separated anyway.

I agree, it would solve the issue with 'cg' being overloaded, and I too
like the <tab><tab> completion idea.

thanks,

greg k-h

^ permalink raw reply

* Re: [ANNOUNCEMENT] /Arch/ embraces `git'
From: Tom Lord @ 2005-04-20 23:04 UTC (permalink / raw)
  To: duchier; +Cc: gnu-arch-users, gnu-arch-dev, git
In-Reply-To: <877jixfjxw.fsf@star.lifl.fr>

   From: duchier@ps.uni-sb.de

Thank you for your experiment.  I'm not surprised by the 
result but it is very nice to know that my expectations
are right.

I think that to a large extent you are seeing artifacts
of the questionable trade-offs that (reports tell me) the
ext* filesystems make.   With a different filesystem, the 
results would be very different.

I'm imagining a blob database containing may revisions of the linux
kernel.  It will contain millions of blobs.

It's fine that some filesystems and some blob operations work fine
on a directory with millions of files but what about other operations
on the database?   I pity the poor program that has to `readdir' through
millions of files.

That said: I may add an optional flat-directory format to my library,
just to avoid issues such as those you raise over the next couple 
years.

-t

^ permalink raw reply

* Re: on when to checksum
From: Linus Torvalds @ 2005-04-20 23:07 UTC (permalink / raw)
  To: Tom Lord; +Cc: git
In-Reply-To: <200504202252.PAA16837@emf.net>

On Wed, 20 Apr 2005, Tom Lord wrote:
> 
> How many times per day do you invoke `write-tree' and why?

Every single commit does a write-tree, so when I merge with Andrew, it's 
usually a series of 100-250 of them in a row.

(Actually, _usualyl_ it's smaller series, but it's the big series that can
be painful enough to matter).

> It takes a large multiple of `0.3s' to get me to take you seriously
> on this point.

The thing is, I don't "trickle" things in. That would be horribly 
inefficient for me. So I go over the patches, make a mbox, and do them all 
in one go. And then they need to happen _fast_. If it takes 20 minutes, I 
go away for coffee or something, and then if something didn't apply 
half-way through, I will have lost my "context".

That's why I want things instant. Not because I have huge daily throughput 
issues, but I have huge _latency_ issues. 

I considered doing a "two-level" thing, where I first did the stuff in a
light-weigth patch manager, and then batched things up in the background
for the real thing. But the fact is, I don't think it's needed. Not the
way git performs now. If I can apply a hundred patches in a minute or two,
I have not "lost the context" if it turns out that there is some silly
glitch with one of them.

		Linus

^ permalink raw reply

* Re: on when to checksum
From: Tom Lord @ 2005-04-20 22:52 UTC (permalink / raw)
  To: torvalds; +Cc: git
In-Reply-To: <Pine.LNX.4.58.0504201539180.6467@ppc970.osdl.org>

   From: Linus Torvalds <torvalds@osdl.org>

   On Wed, 20 Apr 2005, Tom Lord wrote:
   > 
   > I think you have made a mistake by moving the sha1 checksum from the
   > zipped form to the inflated form.  Here is why:

   I'd have agreed with you (and I did, violently) if it wasn't for the
   performance issues. It makes a huge difference for write-tree, and to me,
   clearly performance _does_ matter.

   Fractions of seconds may not sound like a lot, but they add up. I work 
   with 200-patch series myself all the time, so I'm very sensitive to a 0.3 
   second difference in performance.

How many times per day do you invoke `write-tree' and why?

It takes a large multiple of `0.3s' to get me to take you seriously
on this point.

I have long harbored the suspician that your perceived bandwidth
implies that you process a lot of patches unread or barely read --
implying that your day-to-day bitslingling could/should largely be
handled by an Arch-style patch-queue-manager (a script).

-t

^ permalink raw reply

* Re: [Gnu-arch-users] Re: [ANNOUNCEMENT] /Arch/ embraces `git'
From: Tomas Mraz @ 2005-04-20 22:51 UTC (permalink / raw)
  To: duchier; +Cc: gnu-arch-dev, talli, git, torvalds
In-Reply-To: <877jixfjxw.fsf@star.lifl.fr>

On Wed, 2005-04-20 at 19:15 +0200, duchier@ps.uni-sb.de wrote:
...
> As data, I used my /usr/src/linux which uses 301M and contains 20753 files and
> 1389 directories.  To compute the key for a directory, I considered that its
> contents were a mapping from names to keys.
I suppose if you used the blob archive for storing many revisions the
number of stored blobs would be much higher. However even then we can
estimate that the maximum number of stored blobs will be in the order of
milions.

> When constructing the indexed archive, I actually stored empty files instead of
> blobs because I am only interested in overhead.
> 
> Using your suggested indexing method that uses [0:4] as the 1st level key and
                                                 [0:3]
> [4:8] as the 2nd level key, I obtain an indexed archive that occupies 159M,
> where the top level contains 18665 1st level keys, the largest first level dir
> contains 5 entries, and all 2nd level dirs contain exactly 1 entry.
Yes, it really doesn't make much sense to have so big keys on the
directories. If we would assume that SHA1 is a really good hashing
function so the probability of any hash value is the same this would
allow storing 2^16 * 2^16 * 2^16 blobs with approximately same directory
usage.

> Using Linus suggested 1 level [0:2] indexing, I obtain an indexed archive that
                                [0:1] I suppose
> occupies 1.8M, where the top level contains 256 1st level keys, and where the
> largest 1st level dir contains 110 entries.
The question is how many entries in directory is optimal compromise
between space and the speed of access to it's files.

If we suppose the maximum number of stored blobs in the order of milions
probably the optimal indexing would be 1 level [0:2] indexing or 2
levels [0:1] [2:3]. However it would be necessary to do some
benchmarking first before setting this to stone.

-- 
Tomas Mraz <t8m@centrum.cz>

^ permalink raw reply

* Re: [Gnu-arch-users] Re: [GNU-arch-dev] [ANNOUNCEMENT] /Arch/ embraces `git'
From: Tomas Mraz @ 2005-04-20 22:40 UTC (permalink / raw)
  To: duchier; +Cc: gnu-arch-dev, talli, git, torvalds
In-Reply-To: <877jixfjxw.fsf@star.lifl.fr>

On Wed, 2005-04-20 at 19:15 +0200, duchier@ps.uni-sb.de wrote:
...
> As data, I used my /usr/src/linux which uses 301M and contains 20753 files and
> 1389 directories.  To compute the key for a directory, I considered that its
> contents were a mapping from names to keys.
I suppose if you used the blob archive for storing many revisions the
number of stored blobs would be much higher. However even then we can
estimate that the maximum number of stored blobs will be in the order of
milions.

> When constructing the indexed archive, I actually stored empty files instead of
> blobs because I am only interested in overhead.
> 
> Using your suggested indexing method that uses [0:4] as the 1st level key and
                                                 [0:3]
> [4:8] as the 2nd level key, I obtain an indexed archive that occupies 159M,
> where the top level contains 18665 1st level keys, the largest first level dir
> contains 5 entries, and all 2nd level dirs contain exactly 1 entry.
Yes, it really doesn't make much sense to have so big keys on the
directories. If we would assume that SHA1 is a really good hashing
function so the probability of any hash value is the same this would
allow storing 2^16 * 2^16 * 2^16 blobs with approximately same directory
usage.

> Using Linus suggested 1 level [0:2] indexing, I obtain an indexed archive that
                                [0:1] I suppose
> occupies 1.8M, where the top level contains 256 1st level keys, and where the
> largest 1st level dir contains 110 entries.
The question is how many entries in directory is optimal compromise
between space and the speed of access to it's files.

If we suppose the maximum number of stored blobs in the order of milions
probably the optimal indexing would be 1 level [0:2] indexing or 2
levels [0:1] [2:3]. However it would be necessary to do some
benchmarking first before setting this to stone.

-- 
Tomas Mraz <t8m@centrum.cz>


^ permalink raw reply

* Re: on when to checksum
From: Linus Torvalds @ 2005-04-20 22:41 UTC (permalink / raw)
  To: Tom Lord; +Cc: git
In-Reply-To: <200504202225.PAA15992@emf.net>

On Wed, 20 Apr 2005, Tom Lord wrote:
> 
> I think you have made a mistake by moving the sha1 checksum from the
> zipped form to the inflated form.  Here is why:

I'd have agreed with you (and I did, violently) if it wasn't for the
performance issues. It makes a huge difference for write-tree, and to me,
clearly performance _does_ matter.

Fractions of seconds may not sound like a lot, but they add up. I work 
with 200-patch series myself all the time, so I'm very sensitive to a 0.3 
second difference in performance.

		Linus

^ permalink raw reply

* Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)
From: David Woodhouse @ 2005-04-20 22:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, Git Mailing List, Chris Mason
In-Reply-To: <Pine.LNX.4.58.0504200731590.6467@ppc970.osdl.org>

On Wed, 2005-04-20 at 07:59 -0700, Linus Torvalds wrote:
>         external-parent <commit-hash> <external-parent-ID>
>                 comment for this parent
> 
> and the nice thing about that is that now that information allows you to 
> add external parents at any point. 
> 
> Why do it like this? First off, I think that the "initial import" ends up
> being just one special case of the much more _generic_ issue of having
> patches come in from other source control systems 

This isn't about patches coming in from other systems -- it's about
_history_, and the fact that it's imported from another system is just
an implementation detail. It's git history now, and what we have here is
just a special case of wanting to prune ancient git history to keep the
size of our working trees down. You refer to this yourself...

> Secondly, we do need something like this for pruning off history anyway, 
> so that the tools have a better way of saying "history has been pruned 
> off" than just hitting a missing commit. 

Having a more explicit way of saying "history is pruned" than just a
reference to a missing commit is a reasonable request -- but I really
don't see how we can do that by changing the now-oldest commit object to
contain an 'external-parent' field. Doing that would change the sha1 of
the commit object in question, and then ripple through all the
subsequent commits.

Come this time next year, if I decide I want to prune anything older
than 2.6.40 from all the trees on my laptop, it has to happen _without_
changing the commit objects which occur after my arbitrarily-chosen
cutoff point.

If we want to have an explicit record of pruning rather than just
copying with a missing object, then I think we'd need to do it with an
external note to say "It's OK that commit XXXXXXXXXXX is missing".

> Thirdly, I don't actually want my new tree to depend on a conversion of
> the old BK tree.
> 
> Two reasons: if it's a really full conversion, there are definitely going
> to be issues with BitMover. They do not want people to try to reverse
> engineer how they do namespace merges

Don't think of it as "a conversion of the old BK tree". It's just an
import of Linux's development history. This isn't going to help
reverse-engineer how BK does merges; it's just our own revision history.
I'm not sure exactly how Thomas is extracting it, but AIUI it's all
obtainable from the SCCS files anyway without actually resorting to
using BK itself. 

There's nothing here for Larry to worry about. It's not as if we're
actually using BK to develop git by observing BK's behaviour w.r.t
merges and trying to emulate it. Besides -- if we wanted to do that,
we'd need to use the _BK_ version of the tree; the git version wouldn't
help us much anyway.

And given that BK's merges are based on individual files and we're not
going that route with git, it's not clear how much we could lift
directly from BK even if we _were_ going to try that.

> The other reason is just the really obvious one: in the last week, I've
> already changed the format _twice_ in ways that change the hash. As long
> as it's 119MB of data, it's not going to be too nasty to do again.

That's fine. But by the time we settle on a format and actually start
using it in anger, it'd be good to be sure that it _is_ possible to
track development from current trees all the way back -- be that with
explicit reference to pruned history as you suggest, or with absent
parents as I still prefer.

> it's not that it's necessarily the wrong thing to do, but I think it
> is the wrogn thing to do _now_.

OK, time for us to keep arguing over the implementation details of how
we prune history then :)

-- 
dwmw2

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox